Python Beautiful Soup Table Data Scraping all except a specific data











up vote
0
down vote

favorite












I am trying to scrape data from a website which contains data of all politicians of India from multiple pages denoted by numbers.



url: http://www.myneta.info/ls2014/comparisonchart.php?constituency_id=1


I want the data to be exported into a CSV file from multiple websites.



This is a sample table I am trying:



<tr>
<td class=chartcell><a href='http://myneta.info/ls2014/candidate.php?candidate_id=7678' target=_blank>Banka Sahadev</a></td>
<td class=chartcell align=center>53</td>
<td class=chartcell align=center>M</td>
<td class=chartcell align=center>IND</td>
<td class=chartcell align=center><span style='font-size:150%;color:red'><b>Yes</b></span></td>
<td class=chartcell align=center><span style='font-size:160%;'><b>1</b></span></td>
<td class=chartcell align=center>1</td>
<td class=chartcell align=left> <b><span style='color:red'> criminal intimidation(506)</span></b>, <b><span style='color:red'> public nuisance in cases not otherwise provided for(290)</span></b>, <b><span style='color:red'> voluntarily causing hurt(323)</span></b>, </td>
<td class=chartcell align=center>Graduate</td>
<td class=chartcell align=center>19,000<br><span style='font-size:70%;color:brown'>~ 19&nbsp;Thou+</span></td>
<td class=chartcell align=center>3,74,000<br><span style='font-size:70%;color:brown'>~ 3&nbsp;Lacs+</span></td>
<td class=chartcell align=center>3,93,000<br><span style='font-size:70%;color:brown'>~ 3&nbsp;Lacs+</span></td>
<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>N</td>
<!--<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>2,00,000<br><span style='font-size:70%;color:brown'>~ 2&nbsp;Lacs+</span></td> -->
</tr>


I have used BeautifulSoup to get the data, but then they somehow the data gets merged and looks very clumsy if I open the CSV data.



Here is my code:



num = 1

url ='http://www.myneta.info/ls2014/comparisonchart.php?
constituency_id={}'.format(num)

headers= {'User-Agent': 'Mozilla/5.0'}

with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALLn')


while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?
constituency_id={}'.format(num)

time.sleep(1)
response = requests.get(url, headers)

if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
for cell in row.find_all('td'):
r.write(cell.text.ljust(250))
r.write('n')
else: print('Too many tables')

else:
print('No response')
print(num)


num += 1


Also, how could I omit data from the specific td ?
In my case, I don't want the data of the IPC details from the table.



I am fairly new to coding and python.










share|improve this question




























    up vote
    0
    down vote

    favorite












    I am trying to scrape data from a website which contains data of all politicians of India from multiple pages denoted by numbers.



    url: http://www.myneta.info/ls2014/comparisonchart.php?constituency_id=1


    I want the data to be exported into a CSV file from multiple websites.



    This is a sample table I am trying:



    <tr>
    <td class=chartcell><a href='http://myneta.info/ls2014/candidate.php?candidate_id=7678' target=_blank>Banka Sahadev</a></td>
    <td class=chartcell align=center>53</td>
    <td class=chartcell align=center>M</td>
    <td class=chartcell align=center>IND</td>
    <td class=chartcell align=center><span style='font-size:150%;color:red'><b>Yes</b></span></td>
    <td class=chartcell align=center><span style='font-size:160%;'><b>1</b></span></td>
    <td class=chartcell align=center>1</td>
    <td class=chartcell align=left> <b><span style='color:red'> criminal intimidation(506)</span></b>, <b><span style='color:red'> public nuisance in cases not otherwise provided for(290)</span></b>, <b><span style='color:red'> voluntarily causing hurt(323)</span></b>, </td>
    <td class=chartcell align=center>Graduate</td>
    <td class=chartcell align=center>19,000<br><span style='font-size:70%;color:brown'>~ 19&nbsp;Thou+</span></td>
    <td class=chartcell align=center>3,74,000<br><span style='font-size:70%;color:brown'>~ 3&nbsp;Lacs+</span></td>
    <td class=chartcell align=center>3,93,000<br><span style='font-size:70%;color:brown'>~ 3&nbsp;Lacs+</span></td>
    <td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
    <td class=chartcell align=center>N</td>
    <!--<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
    <td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
    <td class=chartcell align=center>2,00,000<br><span style='font-size:70%;color:brown'>~ 2&nbsp;Lacs+</span></td> -->
    </tr>


    I have used BeautifulSoup to get the data, but then they somehow the data gets merged and looks very clumsy if I open the CSV data.



    Here is my code:



    num = 1

    url ='http://www.myneta.info/ls2014/comparisonchart.php?
    constituency_id={}'.format(num)

    headers= {'User-Agent': 'Mozilla/5.0'}

    with open ('newstats.csv', 'w') as r:
    r.write('POLITICIANS ALLn')


    while num < 3:
    url ='http://www.myneta.info/ls2014/comparisonchart.php?
    constituency_id={}'.format(num)

    time.sleep(1)
    response = requests.get(url, headers)

    if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    tablenew = soup.find_all('table', id = "table1")
    if len(tablenew) < 2:
    tablenew = tablenew[0]
    with open ('newstats.csv', 'a') as r:
    for row in tablenew.find_all('tr'):
    for cell in row.find_all('td'):
    r.write(cell.text.ljust(250))
    r.write('n')
    else: print('Too many tables')

    else:
    print('No response')
    print(num)


    num += 1


    Also, how could I omit data from the specific td ?
    In my case, I don't want the data of the IPC details from the table.



    I am fairly new to coding and python.










    share|improve this question


























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I am trying to scrape data from a website which contains data of all politicians of India from multiple pages denoted by numbers.



      url: http://www.myneta.info/ls2014/comparisonchart.php?constituency_id=1


      I want the data to be exported into a CSV file from multiple websites.



      This is a sample table I am trying:



      <tr>
      <td class=chartcell><a href='http://myneta.info/ls2014/candidate.php?candidate_id=7678' target=_blank>Banka Sahadev</a></td>
      <td class=chartcell align=center>53</td>
      <td class=chartcell align=center>M</td>
      <td class=chartcell align=center>IND</td>
      <td class=chartcell align=center><span style='font-size:150%;color:red'><b>Yes</b></span></td>
      <td class=chartcell align=center><span style='font-size:160%;'><b>1</b></span></td>
      <td class=chartcell align=center>1</td>
      <td class=chartcell align=left> <b><span style='color:red'> criminal intimidation(506)</span></b>, <b><span style='color:red'> public nuisance in cases not otherwise provided for(290)</span></b>, <b><span style='color:red'> voluntarily causing hurt(323)</span></b>, </td>
      <td class=chartcell align=center>Graduate</td>
      <td class=chartcell align=center>19,000<br><span style='font-size:70%;color:brown'>~ 19&nbsp;Thou+</span></td>
      <td class=chartcell align=center>3,74,000<br><span style='font-size:70%;color:brown'>~ 3&nbsp;Lacs+</span></td>
      <td class=chartcell align=center>3,93,000<br><span style='font-size:70%;color:brown'>~ 3&nbsp;Lacs+</span></td>
      <td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
      <td class=chartcell align=center>N</td>
      <!--<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
      <td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
      <td class=chartcell align=center>2,00,000<br><span style='font-size:70%;color:brown'>~ 2&nbsp;Lacs+</span></td> -->
      </tr>


      I have used BeautifulSoup to get the data, but then they somehow the data gets merged and looks very clumsy if I open the CSV data.



      Here is my code:



      num = 1

      url ='http://www.myneta.info/ls2014/comparisonchart.php?
      constituency_id={}'.format(num)

      headers= {'User-Agent': 'Mozilla/5.0'}

      with open ('newstats.csv', 'w') as r:
      r.write('POLITICIANS ALLn')


      while num < 3:
      url ='http://www.myneta.info/ls2014/comparisonchart.php?
      constituency_id={}'.format(num)

      time.sleep(1)
      response = requests.get(url, headers)

      if response.status_code == 200:
      soup = BeautifulSoup(response.content, 'html.parser')
      tablenew = soup.find_all('table', id = "table1")
      if len(tablenew) < 2:
      tablenew = tablenew[0]
      with open ('newstats.csv', 'a') as r:
      for row in tablenew.find_all('tr'):
      for cell in row.find_all('td'):
      r.write(cell.text.ljust(250))
      r.write('n')
      else: print('Too many tables')

      else:
      print('No response')
      print(num)


      num += 1


      Also, how could I omit data from the specific td ?
      In my case, I don't want the data of the IPC details from the table.



      I am fairly new to coding and python.










      share|improve this question















      I am trying to scrape data from a website which contains data of all politicians of India from multiple pages denoted by numbers.



      url: http://www.myneta.info/ls2014/comparisonchart.php?constituency_id=1


      I want the data to be exported into a CSV file from multiple websites.



      This is a sample table I am trying:



      <tr>
      <td class=chartcell><a href='http://myneta.info/ls2014/candidate.php?candidate_id=7678' target=_blank>Banka Sahadev</a></td>
      <td class=chartcell align=center>53</td>
      <td class=chartcell align=center>M</td>
      <td class=chartcell align=center>IND</td>
      <td class=chartcell align=center><span style='font-size:150%;color:red'><b>Yes</b></span></td>
      <td class=chartcell align=center><span style='font-size:160%;'><b>1</b></span></td>
      <td class=chartcell align=center>1</td>
      <td class=chartcell align=left> <b><span style='color:red'> criminal intimidation(506)</span></b>, <b><span style='color:red'> public nuisance in cases not otherwise provided for(290)</span></b>, <b><span style='color:red'> voluntarily causing hurt(323)</span></b>, </td>
      <td class=chartcell align=center>Graduate</td>
      <td class=chartcell align=center>19,000<br><span style='font-size:70%;color:brown'>~ 19&nbsp;Thou+</span></td>
      <td class=chartcell align=center>3,74,000<br><span style='font-size:70%;color:brown'>~ 3&nbsp;Lacs+</span></td>
      <td class=chartcell align=center>3,93,000<br><span style='font-size:70%;color:brown'>~ 3&nbsp;Lacs+</span></td>
      <td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
      <td class=chartcell align=center>N</td>
      <!--<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
      <td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
      <td class=chartcell align=center>2,00,000<br><span style='font-size:70%;color:brown'>~ 2&nbsp;Lacs+</span></td> -->
      </tr>


      I have used BeautifulSoup to get the data, but then they somehow the data gets merged and looks very clumsy if I open the CSV data.



      Here is my code:



      num = 1

      url ='http://www.myneta.info/ls2014/comparisonchart.php?
      constituency_id={}'.format(num)

      headers= {'User-Agent': 'Mozilla/5.0'}

      with open ('newstats.csv', 'w') as r:
      r.write('POLITICIANS ALLn')


      while num < 3:
      url ='http://www.myneta.info/ls2014/comparisonchart.php?
      constituency_id={}'.format(num)

      time.sleep(1)
      response = requests.get(url, headers)

      if response.status_code == 200:
      soup = BeautifulSoup(response.content, 'html.parser')
      tablenew = soup.find_all('table', id = "table1")
      if len(tablenew) < 2:
      tablenew = tablenew[0]
      with open ('newstats.csv', 'a') as r:
      for row in tablenew.find_all('tr'):
      for cell in row.find_all('td'):
      r.write(cell.text.ljust(250))
      r.write('n')
      else: print('Too many tables')

      else:
      print('No response')
      print(num)


      num += 1


      Also, how could I omit data from the specific td ?
      In my case, I don't want the data of the IPC details from the table.



      I am fairly new to coding and python.







      python web-scraping html-table beautifulsoup export-to-csv






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Oct 30 at 12:02









      Brian Tompsett - 汤莱恩

      4,153133699




      4,153133699










      asked Oct 29 at 21:42









      Rajesh Kumar

      32




      32
























          2 Answers
          2






          active

          oldest

          votes

















          up vote
          0
          down vote



          accepted










          I think the "merged data problem" is due to the fact that you're not actually separating the cells with commas. Check your csv generated file on a regular text editor to see that.



          An easy solution is to use the join method to create a single comma-separated string with the list of cells and printing this into the file. e.g.:



          content = [cell.text for cell in row.find_all('td')]
          r.write(';'.join(content)+'n')


          On the first line I used what is called a "list comprehension", something that is very useful for you to learn. This allows to iterate all elements in a list using a single line of code, instead of doing a "for" loop. On the second line I use the join method on the string ;. This means the array content is converted into a string joining all elements with ;. At the end I add the line break.



          If you want to ommit elements based on index (let's say, ommit the 7th column) we can complicate a little bit the list comprehension like so:



          # Write on this array the indices of the columns you want
          # to exclude
          ommit_columns = [7]
          content = [cell.text
          for (index, cell) in enumerate(row.find_all('td'))
          if index not in ommit_columns]
          r.write(';'.join(content)+'n')


          In ommit_columns you can write several indices. In the list comprehension below we use the method enumerate to obtain all the indices and elements from the row.find_all('td') and then filter them checking if index is not in the ommit_columnsarray.



          The complete code should be:



          from bs4 import BeautifulSoup
          import time
          import requests

          num = 1

          url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

          headers= {'User-Agent': 'Mozilla/5.0'}

          with open ('newstats.csv', 'w') as r:
          r.write('POLITICIANS ALLn')


          while num < 3:
          url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

          time.sleep(1)
          response = requests.get(url, headers)

          if response.status_code == 200:
          soup = BeautifulSoup(response.content, 'html.parser')
          tablenew = soup.find_all('table', id = "table1")
          if len(tablenew) < 2:
          tablenew = tablenew[0]
          with open ('newstats.csv', 'a') as r:
          for row in tablenew.find_all('tr'):
          # content = [cell.text for cell in row.find_all('td')]
          # r.write(';'.join(content)+'n')

          # Write on this array the indices of the columns you want
          # to exclude
          ommit_columns = [7]
          content = [cell.text
          for (index, cell) in enumerate(row.find_all('td'))
          if index not in ommit_columns]
          r.write(';'.join(content)+'n')
          else: print('Too many tables')

          else:
          print('No response')
          print(num)

          num += 1


          and the response would be like this:



          POLITICIANS ALL

          Banka Sahadev;53;M;IND;Yes;1;1;Graduate;19,000~ 19 Thou+;3,74,000~ 3 Lacs+;3,93,000~ 3 Lacs+;0~ ;N
          Godam Nagesh;49;M;TRS;No;0;0;Post Graduate;31,39,857~ 31 Lacs+;72,39,000~ 72 Lacs+;1,03,78,857~ 1 Crore+;1,48,784~ 1 Lacs+;Y
          Mosali Chinnaiah;40;M;IND;No;0;0;12th Pass;1,67,000~ 1 Lacs+;30,00,000~ 30 Lacs+;31,67,000~ 31 Lacs+;40,000~ 40 Thou+;Y
          Naresh;37;M;INC;No;0;0;Doctorate;12,00,000~ 12 Lacs+;6,00,000~ 6 Lacs+;18,00,000~ 18 Lacs+;0~ ;Y
          Nethawath Ramdas;44;M;IND;No;0;0;Illiterate;0~ ;0~ ;0~ ;0~ ;N
          Pawar Krishna;33;M;IND;Yes;1;1;Post Graduate;0~ ;0~ ;0~ ;0~ ;N
          Ramesh Rathod;48;M;TDP;Yes;3;1;12th Pass;54,07,000~ 54 Lacs+;1,37,33,000~ 1 Crore+;1,91,40,000~ 1 Crore+;4,18,32,000~ 4 Crore+;Y
          Rathod Sadashiv;55;M;BSP;No;0;0;Graduate;80,000~ 80 Thou+;13,25,000~ 13 Lacs+;14,05,000~ 14 Lacs+;0~ ;Y





          share|improve this answer

















          • 1




            Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
            – Rajesh Kumar
            Nov 10 at 14:05


















          up vote
          0
          down vote













          As PIC Details column seams to always be the seventh, you can just slice it out:



          import csv
          import requests
          import time

          from bs4 import BeautifulSoup

          num = 1

          url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

          headers= {'User-Agent': 'Mozilla/5.0'}

          with open ('newstats.csv', 'w') as r:
          r.write('POLITICIANS ALLn')

          while num < 3:
          url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

          time.sleep(1)
          response = requests.get(url, headers)

          if response.status_code == 200:
          soup = BeautifulSoup(response.content, 'html.parser')
          tablenew = soup.find_all('table', id = "table1")
          if len(tablenew) < 2:
          tablenew = tablenew[0]
          with open ('newstats.csv', 'a') as r:
          for row in tablenew.find_all('tr'):
          cells = list(map(lambda cell: cell.text, row.find_all('td')))
          cells = cells[:7] + cells[8:]
          writer = csv.writer(r, delimiter='t')
          writer.writerow(cells)

          else: print('Too many tables')

          else:
          print('No response')
          print(num)


          num += 1





          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














             

            draft saved


            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53054272%2fpython-beautiful-soup-table-data-scraping-all-except-a-specific-td-data%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote



            accepted










            I think the "merged data problem" is due to the fact that you're not actually separating the cells with commas. Check your csv generated file on a regular text editor to see that.



            An easy solution is to use the join method to create a single comma-separated string with the list of cells and printing this into the file. e.g.:



            content = [cell.text for cell in row.find_all('td')]
            r.write(';'.join(content)+'n')


            On the first line I used what is called a "list comprehension", something that is very useful for you to learn. This allows to iterate all elements in a list using a single line of code, instead of doing a "for" loop. On the second line I use the join method on the string ;. This means the array content is converted into a string joining all elements with ;. At the end I add the line break.



            If you want to ommit elements based on index (let's say, ommit the 7th column) we can complicate a little bit the list comprehension like so:



            # Write on this array the indices of the columns you want
            # to exclude
            ommit_columns = [7]
            content = [cell.text
            for (index, cell) in enumerate(row.find_all('td'))
            if index not in ommit_columns]
            r.write(';'.join(content)+'n')


            In ommit_columns you can write several indices. In the list comprehension below we use the method enumerate to obtain all the indices and elements from the row.find_all('td') and then filter them checking if index is not in the ommit_columnsarray.



            The complete code should be:



            from bs4 import BeautifulSoup
            import time
            import requests

            num = 1

            url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

            headers= {'User-Agent': 'Mozilla/5.0'}

            with open ('newstats.csv', 'w') as r:
            r.write('POLITICIANS ALLn')


            while num < 3:
            url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

            time.sleep(1)
            response = requests.get(url, headers)

            if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            tablenew = soup.find_all('table', id = "table1")
            if len(tablenew) < 2:
            tablenew = tablenew[0]
            with open ('newstats.csv', 'a') as r:
            for row in tablenew.find_all('tr'):
            # content = [cell.text for cell in row.find_all('td')]
            # r.write(';'.join(content)+'n')

            # Write on this array the indices of the columns you want
            # to exclude
            ommit_columns = [7]
            content = [cell.text
            for (index, cell) in enumerate(row.find_all('td'))
            if index not in ommit_columns]
            r.write(';'.join(content)+'n')
            else: print('Too many tables')

            else:
            print('No response')
            print(num)

            num += 1


            and the response would be like this:



            POLITICIANS ALL

            Banka Sahadev;53;M;IND;Yes;1;1;Graduate;19,000~ 19 Thou+;3,74,000~ 3 Lacs+;3,93,000~ 3 Lacs+;0~ ;N
            Godam Nagesh;49;M;TRS;No;0;0;Post Graduate;31,39,857~ 31 Lacs+;72,39,000~ 72 Lacs+;1,03,78,857~ 1 Crore+;1,48,784~ 1 Lacs+;Y
            Mosali Chinnaiah;40;M;IND;No;0;0;12th Pass;1,67,000~ 1 Lacs+;30,00,000~ 30 Lacs+;31,67,000~ 31 Lacs+;40,000~ 40 Thou+;Y
            Naresh;37;M;INC;No;0;0;Doctorate;12,00,000~ 12 Lacs+;6,00,000~ 6 Lacs+;18,00,000~ 18 Lacs+;0~ ;Y
            Nethawath Ramdas;44;M;IND;No;0;0;Illiterate;0~ ;0~ ;0~ ;0~ ;N
            Pawar Krishna;33;M;IND;Yes;1;1;Post Graduate;0~ ;0~ ;0~ ;0~ ;N
            Ramesh Rathod;48;M;TDP;Yes;3;1;12th Pass;54,07,000~ 54 Lacs+;1,37,33,000~ 1 Crore+;1,91,40,000~ 1 Crore+;4,18,32,000~ 4 Crore+;Y
            Rathod Sadashiv;55;M;BSP;No;0;0;Graduate;80,000~ 80 Thou+;13,25,000~ 13 Lacs+;14,05,000~ 14 Lacs+;0~ ;Y





            share|improve this answer

















            • 1




              Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
              – Rajesh Kumar
              Nov 10 at 14:05















            up vote
            0
            down vote



            accepted










            I think the "merged data problem" is due to the fact that you're not actually separating the cells with commas. Check your csv generated file on a regular text editor to see that.



            An easy solution is to use the join method to create a single comma-separated string with the list of cells and printing this into the file. e.g.:



            content = [cell.text for cell in row.find_all('td')]
            r.write(';'.join(content)+'n')


            On the first line I used what is called a "list comprehension", something that is very useful for you to learn. This allows to iterate all elements in a list using a single line of code, instead of doing a "for" loop. On the second line I use the join method on the string ;. This means the array content is converted into a string joining all elements with ;. At the end I add the line break.



            If you want to ommit elements based on index (let's say, ommit the 7th column) we can complicate a little bit the list comprehension like so:



            # Write on this array the indices of the columns you want
            # to exclude
            ommit_columns = [7]
            content = [cell.text
            for (index, cell) in enumerate(row.find_all('td'))
            if index not in ommit_columns]
            r.write(';'.join(content)+'n')


            In ommit_columns you can write several indices. In the list comprehension below we use the method enumerate to obtain all the indices and elements from the row.find_all('td') and then filter them checking if index is not in the ommit_columnsarray.



            The complete code should be:



            from bs4 import BeautifulSoup
            import time
            import requests

            num = 1

            url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

            headers= {'User-Agent': 'Mozilla/5.0'}

            with open ('newstats.csv', 'w') as r:
            r.write('POLITICIANS ALLn')


            while num < 3:
            url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

            time.sleep(1)
            response = requests.get(url, headers)

            if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            tablenew = soup.find_all('table', id = "table1")
            if len(tablenew) < 2:
            tablenew = tablenew[0]
            with open ('newstats.csv', 'a') as r:
            for row in tablenew.find_all('tr'):
            # content = [cell.text for cell in row.find_all('td')]
            # r.write(';'.join(content)+'n')

            # Write on this array the indices of the columns you want
            # to exclude
            ommit_columns = [7]
            content = [cell.text
            for (index, cell) in enumerate(row.find_all('td'))
            if index not in ommit_columns]
            r.write(';'.join(content)+'n')
            else: print('Too many tables')

            else:
            print('No response')
            print(num)

            num += 1


            and the response would be like this:



            POLITICIANS ALL

            Banka Sahadev;53;M;IND;Yes;1;1;Graduate;19,000~ 19 Thou+;3,74,000~ 3 Lacs+;3,93,000~ 3 Lacs+;0~ ;N
            Godam Nagesh;49;M;TRS;No;0;0;Post Graduate;31,39,857~ 31 Lacs+;72,39,000~ 72 Lacs+;1,03,78,857~ 1 Crore+;1,48,784~ 1 Lacs+;Y
            Mosali Chinnaiah;40;M;IND;No;0;0;12th Pass;1,67,000~ 1 Lacs+;30,00,000~ 30 Lacs+;31,67,000~ 31 Lacs+;40,000~ 40 Thou+;Y
            Naresh;37;M;INC;No;0;0;Doctorate;12,00,000~ 12 Lacs+;6,00,000~ 6 Lacs+;18,00,000~ 18 Lacs+;0~ ;Y
            Nethawath Ramdas;44;M;IND;No;0;0;Illiterate;0~ ;0~ ;0~ ;0~ ;N
            Pawar Krishna;33;M;IND;Yes;1;1;Post Graduate;0~ ;0~ ;0~ ;0~ ;N
            Ramesh Rathod;48;M;TDP;Yes;3;1;12th Pass;54,07,000~ 54 Lacs+;1,37,33,000~ 1 Crore+;1,91,40,000~ 1 Crore+;4,18,32,000~ 4 Crore+;Y
            Rathod Sadashiv;55;M;BSP;No;0;0;Graduate;80,000~ 80 Thou+;13,25,000~ 13 Lacs+;14,05,000~ 14 Lacs+;0~ ;Y





            share|improve this answer

















            • 1




              Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
              – Rajesh Kumar
              Nov 10 at 14:05













            up vote
            0
            down vote



            accepted







            up vote
            0
            down vote



            accepted






            I think the "merged data problem" is due to the fact that you're not actually separating the cells with commas. Check your csv generated file on a regular text editor to see that.



            An easy solution is to use the join method to create a single comma-separated string with the list of cells and printing this into the file. e.g.:



            content = [cell.text for cell in row.find_all('td')]
            r.write(';'.join(content)+'n')


            On the first line I used what is called a "list comprehension", something that is very useful for you to learn. This allows to iterate all elements in a list using a single line of code, instead of doing a "for" loop. On the second line I use the join method on the string ;. This means the array content is converted into a string joining all elements with ;. At the end I add the line break.



            If you want to ommit elements based on index (let's say, ommit the 7th column) we can complicate a little bit the list comprehension like so:



            # Write on this array the indices of the columns you want
            # to exclude
            ommit_columns = [7]
            content = [cell.text
            for (index, cell) in enumerate(row.find_all('td'))
            if index not in ommit_columns]
            r.write(';'.join(content)+'n')


            In ommit_columns you can write several indices. In the list comprehension below we use the method enumerate to obtain all the indices and elements from the row.find_all('td') and then filter them checking if index is not in the ommit_columnsarray.



            The complete code should be:



            from bs4 import BeautifulSoup
            import time
            import requests

            num = 1

            url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

            headers= {'User-Agent': 'Mozilla/5.0'}

            with open ('newstats.csv', 'w') as r:
            r.write('POLITICIANS ALLn')


            while num < 3:
            url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

            time.sleep(1)
            response = requests.get(url, headers)

            if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            tablenew = soup.find_all('table', id = "table1")
            if len(tablenew) < 2:
            tablenew = tablenew[0]
            with open ('newstats.csv', 'a') as r:
            for row in tablenew.find_all('tr'):
            # content = [cell.text for cell in row.find_all('td')]
            # r.write(';'.join(content)+'n')

            # Write on this array the indices of the columns you want
            # to exclude
            ommit_columns = [7]
            content = [cell.text
            for (index, cell) in enumerate(row.find_all('td'))
            if index not in ommit_columns]
            r.write(';'.join(content)+'n')
            else: print('Too many tables')

            else:
            print('No response')
            print(num)

            num += 1


            and the response would be like this:



            POLITICIANS ALL

            Banka Sahadev;53;M;IND;Yes;1;1;Graduate;19,000~ 19 Thou+;3,74,000~ 3 Lacs+;3,93,000~ 3 Lacs+;0~ ;N
            Godam Nagesh;49;M;TRS;No;0;0;Post Graduate;31,39,857~ 31 Lacs+;72,39,000~ 72 Lacs+;1,03,78,857~ 1 Crore+;1,48,784~ 1 Lacs+;Y
            Mosali Chinnaiah;40;M;IND;No;0;0;12th Pass;1,67,000~ 1 Lacs+;30,00,000~ 30 Lacs+;31,67,000~ 31 Lacs+;40,000~ 40 Thou+;Y
            Naresh;37;M;INC;No;0;0;Doctorate;12,00,000~ 12 Lacs+;6,00,000~ 6 Lacs+;18,00,000~ 18 Lacs+;0~ ;Y
            Nethawath Ramdas;44;M;IND;No;0;0;Illiterate;0~ ;0~ ;0~ ;0~ ;N
            Pawar Krishna;33;M;IND;Yes;1;1;Post Graduate;0~ ;0~ ;0~ ;0~ ;N
            Ramesh Rathod;48;M;TDP;Yes;3;1;12th Pass;54,07,000~ 54 Lacs+;1,37,33,000~ 1 Crore+;1,91,40,000~ 1 Crore+;4,18,32,000~ 4 Crore+;Y
            Rathod Sadashiv;55;M;BSP;No;0;0;Graduate;80,000~ 80 Thou+;13,25,000~ 13 Lacs+;14,05,000~ 14 Lacs+;0~ ;Y





            share|improve this answer












            I think the "merged data problem" is due to the fact that you're not actually separating the cells with commas. Check your csv generated file on a regular text editor to see that.



            An easy solution is to use the join method to create a single comma-separated string with the list of cells and printing this into the file. e.g.:



            content = [cell.text for cell in row.find_all('td')]
            r.write(';'.join(content)+'n')


            On the first line I used what is called a "list comprehension", something that is very useful for you to learn. This allows to iterate all elements in a list using a single line of code, instead of doing a "for" loop. On the second line I use the join method on the string ;. This means the array content is converted into a string joining all elements with ;. At the end I add the line break.



            If you want to ommit elements based on index (let's say, ommit the 7th column) we can complicate a little bit the list comprehension like so:



            # Write on this array the indices of the columns you want
            # to exclude
            ommit_columns = [7]
            content = [cell.text
            for (index, cell) in enumerate(row.find_all('td'))
            if index not in ommit_columns]
            r.write(';'.join(content)+'n')


            In ommit_columns you can write several indices. In the list comprehension below we use the method enumerate to obtain all the indices and elements from the row.find_all('td') and then filter them checking if index is not in the ommit_columnsarray.



            The complete code should be:



            from bs4 import BeautifulSoup
            import time
            import requests

            num = 1

            url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

            headers= {'User-Agent': 'Mozilla/5.0'}

            with open ('newstats.csv', 'w') as r:
            r.write('POLITICIANS ALLn')


            while num < 3:
            url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

            time.sleep(1)
            response = requests.get(url, headers)

            if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            tablenew = soup.find_all('table', id = "table1")
            if len(tablenew) < 2:
            tablenew = tablenew[0]
            with open ('newstats.csv', 'a') as r:
            for row in tablenew.find_all('tr'):
            # content = [cell.text for cell in row.find_all('td')]
            # r.write(';'.join(content)+'n')

            # Write on this array the indices of the columns you want
            # to exclude
            ommit_columns = [7]
            content = [cell.text
            for (index, cell) in enumerate(row.find_all('td'))
            if index not in ommit_columns]
            r.write(';'.join(content)+'n')
            else: print('Too many tables')

            else:
            print('No response')
            print(num)

            num += 1


            and the response would be like this:



            POLITICIANS ALL

            Banka Sahadev;53;M;IND;Yes;1;1;Graduate;19,000~ 19 Thou+;3,74,000~ 3 Lacs+;3,93,000~ 3 Lacs+;0~ ;N
            Godam Nagesh;49;M;TRS;No;0;0;Post Graduate;31,39,857~ 31 Lacs+;72,39,000~ 72 Lacs+;1,03,78,857~ 1 Crore+;1,48,784~ 1 Lacs+;Y
            Mosali Chinnaiah;40;M;IND;No;0;0;12th Pass;1,67,000~ 1 Lacs+;30,00,000~ 30 Lacs+;31,67,000~ 31 Lacs+;40,000~ 40 Thou+;Y
            Naresh;37;M;INC;No;0;0;Doctorate;12,00,000~ 12 Lacs+;6,00,000~ 6 Lacs+;18,00,000~ 18 Lacs+;0~ ;Y
            Nethawath Ramdas;44;M;IND;No;0;0;Illiterate;0~ ;0~ ;0~ ;0~ ;N
            Pawar Krishna;33;M;IND;Yes;1;1;Post Graduate;0~ ;0~ ;0~ ;0~ ;N
            Ramesh Rathod;48;M;TDP;Yes;3;1;12th Pass;54,07,000~ 54 Lacs+;1,37,33,000~ 1 Crore+;1,91,40,000~ 1 Crore+;4,18,32,000~ 4 Crore+;Y
            Rathod Sadashiv;55;M;BSP;No;0;0;Graduate;80,000~ 80 Thou+;13,25,000~ 13 Lacs+;14,05,000~ 14 Lacs+;0~ ;Y






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Oct 29 at 22:37









            Felipe Ferri

            1,57711926




            1,57711926








            • 1




              Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
              – Rajesh Kumar
              Nov 10 at 14:05














            • 1




              Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
              – Rajesh Kumar
              Nov 10 at 14:05








            1




            1




            Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
            – Rajesh Kumar
            Nov 10 at 14:05




            Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
            – Rajesh Kumar
            Nov 10 at 14:05












            up vote
            0
            down vote













            As PIC Details column seams to always be the seventh, you can just slice it out:



            import csv
            import requests
            import time

            from bs4 import BeautifulSoup

            num = 1

            url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

            headers= {'User-Agent': 'Mozilla/5.0'}

            with open ('newstats.csv', 'w') as r:
            r.write('POLITICIANS ALLn')

            while num < 3:
            url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

            time.sleep(1)
            response = requests.get(url, headers)

            if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            tablenew = soup.find_all('table', id = "table1")
            if len(tablenew) < 2:
            tablenew = tablenew[0]
            with open ('newstats.csv', 'a') as r:
            for row in tablenew.find_all('tr'):
            cells = list(map(lambda cell: cell.text, row.find_all('td')))
            cells = cells[:7] + cells[8:]
            writer = csv.writer(r, delimiter='t')
            writer.writerow(cells)

            else: print('Too many tables')

            else:
            print('No response')
            print(num)


            num += 1





            share|improve this answer

























              up vote
              0
              down vote













              As PIC Details column seams to always be the seventh, you can just slice it out:



              import csv
              import requests
              import time

              from bs4 import BeautifulSoup

              num = 1

              url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

              headers= {'User-Agent': 'Mozilla/5.0'}

              with open ('newstats.csv', 'w') as r:
              r.write('POLITICIANS ALLn')

              while num < 3:
              url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

              time.sleep(1)
              response = requests.get(url, headers)

              if response.status_code == 200:
              soup = BeautifulSoup(response.content, 'html.parser')
              tablenew = soup.find_all('table', id = "table1")
              if len(tablenew) < 2:
              tablenew = tablenew[0]
              with open ('newstats.csv', 'a') as r:
              for row in tablenew.find_all('tr'):
              cells = list(map(lambda cell: cell.text, row.find_all('td')))
              cells = cells[:7] + cells[8:]
              writer = csv.writer(r, delimiter='t')
              writer.writerow(cells)

              else: print('Too many tables')

              else:
              print('No response')
              print(num)


              num += 1





              share|improve this answer























                up vote
                0
                down vote










                up vote
                0
                down vote









                As PIC Details column seams to always be the seventh, you can just slice it out:



                import csv
                import requests
                import time

                from bs4 import BeautifulSoup

                num = 1

                url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

                headers= {'User-Agent': 'Mozilla/5.0'}

                with open ('newstats.csv', 'w') as r:
                r.write('POLITICIANS ALLn')

                while num < 3:
                url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

                time.sleep(1)
                response = requests.get(url, headers)

                if response.status_code == 200:
                soup = BeautifulSoup(response.content, 'html.parser')
                tablenew = soup.find_all('table', id = "table1")
                if len(tablenew) < 2:
                tablenew = tablenew[0]
                with open ('newstats.csv', 'a') as r:
                for row in tablenew.find_all('tr'):
                cells = list(map(lambda cell: cell.text, row.find_all('td')))
                cells = cells[:7] + cells[8:]
                writer = csv.writer(r, delimiter='t')
                writer.writerow(cells)

                else: print('Too many tables')

                else:
                print('No response')
                print(num)


                num += 1





                share|improve this answer












                As PIC Details column seams to always be the seventh, you can just slice it out:



                import csv
                import requests
                import time

                from bs4 import BeautifulSoup

                num = 1

                url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

                headers= {'User-Agent': 'Mozilla/5.0'}

                with open ('newstats.csv', 'w') as r:
                r.write('POLITICIANS ALLn')

                while num < 3:
                url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)

                time.sleep(1)
                response = requests.get(url, headers)

                if response.status_code == 200:
                soup = BeautifulSoup(response.content, 'html.parser')
                tablenew = soup.find_all('table', id = "table1")
                if len(tablenew) < 2:
                tablenew = tablenew[0]
                with open ('newstats.csv', 'a') as r:
                for row in tablenew.find_all('tr'):
                cells = list(map(lambda cell: cell.text, row.find_all('td')))
                cells = cells[:7] + cells[8:]
                writer = csv.writer(r, delimiter='t')
                writer.writerow(cells)

                else: print('Too many tables')

                else:
                print('No response')
                print(num)


                num += 1






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Oct 29 at 22:34









                Vitor SRG

                1062




                1062






























                     

                    draft saved


                    draft discarded



















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53054272%2fpython-beautiful-soup-table-data-scraping-all-except-a-specific-td-data%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Full-time equivalent

                    Bicuculline

                    さくらももこ