Python Beautiful Soup Table Data Scraping all except a specific data

up vote
0
down vote

favorite

I am trying to scrape data from a website which contains data of all politicians of India from multiple pages denoted by numbers.

url: http://www.myneta.info/ls2014/comparisonchart.php?constituency_id=1

I want the data to be exported into a CSV file from multiple websites.

This is a sample table I am trying:

<tr>

    <td class=chartcell><a href='http://myneta.info/ls2014/candidate.php?candidate_id=7678' target=_blank>Banka Sahadev</a></td>

    <td class=chartcell align=center>53</td>

    <td class=chartcell align=center>M</td>

    <td class=chartcell align=center>IND</td>

    <td class=chartcell align=center><span style='font-size:150%;color:red'><b>Yes</b></span></td>

    <td class=chartcell align=center><span style='font-size:160%;'><b>1</b></span></td>

    <td class=chartcell align=center>1</td>

    <td class=chartcell align=left>     <b><span style='color:red'> criminal intimidation(506)</span></b>, <b><span style='color:red'> public nuisance in cases not otherwise provided for(290)</span></b>, <b><span style='color:red'> voluntarily causing hurt(323)</span></b>, </td>

    <td class=chartcell align=center>Graduate</td>

    <td class=chartcell align=center>19,000<br><span style='font-size:70%;color:brown'>~ 19&nbsp;Thou+</span></td>

    <td class=chartcell align=center>3,74,000<br><span style='font-size:70%;color:brown'>~ 3&nbsp;Lacs+</span></td>

    <td class=chartcell align=center>3,93,000<br><span style='font-size:70%;color:brown'>~ 3&nbsp;Lacs+</span></td>

    <td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>

    <td class=chartcell align=center>N</td>

    <!--<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>

    <td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>

    <td class=chartcell align=center>2,00,000<br><span style='font-size:70%;color:brown'>~ 2&nbsp;Lacs+</span></td> -->

</tr>

I have used BeautifulSoup to get the data, but then they somehow the data gets merged and looks very clumsy if I open the CSV data.

Here is my code:

num = 1



url ='http://www.myneta.info/ls2014/comparisonchart.php? 

constituency_id={}'.format(num)



headers= {'User-Agent': 'Mozilla/5.0'}



with open ('newstats.csv', 'w') as r:

r.write('POLITICIANS ALLn')





while num < 3:

url ='http://www.myneta.info/ls2014/comparisonchart.php? 

constituency_id={}'.format(num)



time.sleep(1)

response = requests.get(url, headers)



if response.status_code == 200:

    soup = BeautifulSoup(response.content, 'html.parser')

    tablenew = soup.find_all('table', id = "table1")

    if len(tablenew) < 2:

        tablenew = tablenew[0]

        with open ('newstats.csv', 'a') as r:

            for row in tablenew.find_all('tr'):

                for cell in row.find_all('td'):

                    r.write(cell.text.ljust(250))

                r.write('n')

    else: print('Too many tables')



else:

    print('No response')

    print(num)





num += 1

Also, how could I omit data from the specific td ?
In my case, I don't want the data of the IPC details from the table.

I am fairly new to coding and python.

edited Oct 30 at 12:02

Brian Tompsett - 汤莱恩

4,153133699

asked Oct 29 at 21:42

Rajesh Kumar

add a comment |

up vote
0
down vote

favorite

I am trying to scrape data from a website which contains data of all politicians of India from multiple pages denoted by numbers.

url: http://www.myneta.info/ls2014/comparisonchart.php?constituency_id=1

I want the data to be exported into a CSV file from multiple websites.

This is a sample table I am trying:

<tr>

    <td class=chartcell><a href='http://myneta.info/ls2014/candidate.php?candidate_id=7678' target=_blank>Banka Sahadev</a></td>

    <td class=chartcell align=center>53</td>

    <td class=chartcell align=center>M</td>

    <td class=chartcell align=center>IND</td>

    <td class=chartcell align=center><span style='font-size:150%;color:red'><b>Yes</b></span></td>

    <td class=chartcell align=center><span style='font-size:160%;'><b>1</b></span></td>

    <td class=chartcell align=center>1</td>

    <td class=chartcell align=left>     <b><span style='color:red'> criminal intimidation(506)</span></b>, <b><span style='color:red'> public nuisance in cases not otherwise provided for(290)</span></b>, <b><span style='color:red'> voluntarily causing hurt(323)</span></b>, </td>

    <td class=chartcell align=center>Graduate</td>

    <td class=chartcell align=center>19,000<br><span style='font-size:70%;color:brown'>~ 19&nbsp;Thou+</span></td>

    <td class=chartcell align=center>3,74,000<br><span style='font-size:70%;color:brown'>~ 3&nbsp;Lacs+</span></td>

    <td class=chartcell align=center>3,93,000<br><span style='font-size:70%;color:brown'>~ 3&nbsp;Lacs+</span></td>

    <td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>

    <td class=chartcell align=center>N</td>

    <!--<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>

    <td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>

    <td class=chartcell align=center>2,00,000<br><span style='font-size:70%;color:brown'>~ 2&nbsp;Lacs+</span></td> -->

</tr>

I have used BeautifulSoup to get the data, but then they somehow the data gets merged and looks very clumsy if I open the CSV data.

Here is my code:

num = 1



url ='http://www.myneta.info/ls2014/comparisonchart.php? 

constituency_id={}'.format(num)



headers= {'User-Agent': 'Mozilla/5.0'}



with open ('newstats.csv', 'w') as r:

r.write('POLITICIANS ALLn')





while num < 3:

url ='http://www.myneta.info/ls2014/comparisonchart.php? 

constituency_id={}'.format(num)



time.sleep(1)

response = requests.get(url, headers)



if response.status_code == 200:

    soup = BeautifulSoup(response.content, 'html.parser')

    tablenew = soup.find_all('table', id = "table1")

    if len(tablenew) < 2:

        tablenew = tablenew[0]

        with open ('newstats.csv', 'a') as r:

            for row in tablenew.find_all('tr'):

                for cell in row.find_all('td'):

                    r.write(cell.text.ljust(250))

                r.write('n')

    else: print('Too many tables')



else:

    print('No response')

    print(num)





num += 1

Also, how could I omit data from the specific td ?
In my case, I don't want the data of the IPC details from the table.

I am fairly new to coding and python.

edited Oct 30 at 12:02

Brian Tompsett - 汤莱恩

4,153133699

asked Oct 29 at 21:42

Rajesh Kumar

add a comment |

up vote
0
down vote

favorite

I am trying to scrape data from a website which contains data of all politicians of India from multiple pages denoted by numbers.

url: http://www.myneta.info/ls2014/comparisonchart.php?constituency_id=1

I want the data to be exported into a CSV file from multiple websites.

This is a sample table I am trying:

<tr>

    <td class=chartcell><a href='http://myneta.info/ls2014/candidate.php?candidate_id=7678' target=_blank>Banka Sahadev</a></td>

    <td class=chartcell align=center>53</td>

    <td class=chartcell align=center>M</td>

    <td class=chartcell align=center>IND</td>

    <td class=chartcell align=center><span style='font-size:150%;color:red'><b>Yes</b></span></td>

    <td class=chartcell align=center><span style='font-size:160%;'><b>1</b></span></td>

    <td class=chartcell align=center>1</td>

    <td class=chartcell align=left>     <b><span style='color:red'> criminal intimidation(506)</span></b>, <b><span style='color:red'> public nuisance in cases not otherwise provided for(290)</span></b>, <b><span style='color:red'> voluntarily causing hurt(323)</span></b>, </td>

    <td class=chartcell align=center>Graduate</td>

    <td class=chartcell align=center>19,000<br><span style='font-size:70%;color:brown'>~ 19&nbsp;Thou+</span></td>

    <td class=chartcell align=center>3,74,000<br><span style='font-size:70%;color:brown'>~ 3&nbsp;Lacs+</span></td>

    <td class=chartcell align=center>3,93,000<br><span style='font-size:70%;color:brown'>~ 3&nbsp;Lacs+</span></td>

    <td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>

    <td class=chartcell align=center>N</td>

    <!--<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>

    <td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>

    <td class=chartcell align=center>2,00,000<br><span style='font-size:70%;color:brown'>~ 2&nbsp;Lacs+</span></td> -->

</tr>

I have used BeautifulSoup to get the data, but then they somehow the data gets merged and looks very clumsy if I open the CSV data.

Here is my code:

num = 1



url ='http://www.myneta.info/ls2014/comparisonchart.php? 

constituency_id={}'.format(num)



headers= {'User-Agent': 'Mozilla/5.0'}



with open ('newstats.csv', 'w') as r:

r.write('POLITICIANS ALLn')





while num < 3:

url ='http://www.myneta.info/ls2014/comparisonchart.php? 

constituency_id={}'.format(num)



time.sleep(1)

response = requests.get(url, headers)



if response.status_code == 200:

    soup = BeautifulSoup(response.content, 'html.parser')

    tablenew = soup.find_all('table', id = "table1")

    if len(tablenew) < 2:

        tablenew = tablenew[0]

        with open ('newstats.csv', 'a') as r:

            for row in tablenew.find_all('tr'):

                for cell in row.find_all('td'):

                    r.write(cell.text.ljust(250))

                r.write('n')

    else: print('Too many tables')



else:

    print('No response')

    print(num)





num += 1

Also, how could I omit data from the specific td ?
In my case, I don't want the data of the IPC details from the table.

I am fairly new to coding and python.

edited Oct 30 at 12:02

Brian Tompsett - 汤莱恩

4,153133699

asked Oct 29 at 21:42

Rajesh Kumar

I am trying to scrape data from a website which contains data of all politicians of India from multiple pages denoted by numbers.

url: http://www.myneta.info/ls2014/comparisonchart.php?constituency_id=1

I want the data to be exported into a CSV file from multiple websites.

This is a sample table I am trying:

<tr>

    <td class=chartcell><a href='http://myneta.info/ls2014/candidate.php?candidate_id=7678' target=_blank>Banka Sahadev</a></td>

    <td class=chartcell align=center>53</td>

    <td class=chartcell align=center>M</td>

    <td class=chartcell align=center>IND</td>

    <td class=chartcell align=center><span style='font-size:150%;color:red'><b>Yes</b></span></td>

    <td class=chartcell align=center><span style='font-size:160%;'><b>1</b></span></td>

    <td class=chartcell align=center>1</td>

    <td class=chartcell align=left>     <b><span style='color:red'> criminal intimidation(506)</span></b>, <b><span style='color:red'> public nuisance in cases not otherwise provided for(290)</span></b>, <b><span style='color:red'> voluntarily causing hurt(323)</span></b>, </td>

    <td class=chartcell align=center>Graduate</td>

    <td class=chartcell align=center>19,000<br><span style='font-size:70%;color:brown'>~ 19&nbsp;Thou+</span></td>

    <td class=chartcell align=center>3,74,000<br><span style='font-size:70%;color:brown'>~ 3&nbsp;Lacs+</span></td>

    <td class=chartcell align=center>3,93,000<br><span style='font-size:70%;color:brown'>~ 3&nbsp;Lacs+</span></td>

    <td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>

    <td class=chartcell align=center>N</td>

    <!--<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>

    <td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>

    <td class=chartcell align=center>2,00,000<br><span style='font-size:70%;color:brown'>~ 2&nbsp;Lacs+</span></td> -->

</tr>

I have used BeautifulSoup to get the data, but then they somehow the data gets merged and looks very clumsy if I open the CSV data.

Here is my code:

num = 1



url ='http://www.myneta.info/ls2014/comparisonchart.php? 

constituency_id={}'.format(num)



headers= {'User-Agent': 'Mozilla/5.0'}



with open ('newstats.csv', 'w') as r:

r.write('POLITICIANS ALLn')





while num < 3:

url ='http://www.myneta.info/ls2014/comparisonchart.php? 

constituency_id={}'.format(num)



time.sleep(1)

response = requests.get(url, headers)



if response.status_code == 200:

    soup = BeautifulSoup(response.content, 'html.parser')

    tablenew = soup.find_all('table', id = "table1")

    if len(tablenew) < 2:

        tablenew = tablenew[0]

        with open ('newstats.csv', 'a') as r:

            for row in tablenew.find_all('tr'):

                for cell in row.find_all('td'):

                    r.write(cell.text.ljust(250))

                r.write('n')

    else: print('Too many tables')



else:

    print('No response')

    print(num)





num += 1

Also, how could I omit data from the specific td ?
In my case, I don't want the data of the IPC details from the table.

I am fairly new to coding and python.

python web-scraping html-table beautifulsoup export-to-csv

edited Oct 30 at 12:02

Brian Tompsett - 汤莱恩

4,153133699

asked Oct 29 at 21:42

Rajesh Kumar

edited Oct 30 at 12:02

Brian Tompsett - 汤莱恩

4,153133699

asked Oct 29 at 21:42

Rajesh Kumar

edited Oct 30 at 12:02

Brian Tompsett - 汤莱恩

4,153133699

edited Oct 30 at 12:02

Brian Tompsett - 汤莱恩

4,153133699

edited Oct 30 at 12:02

Brian Tompsett - 汤莱恩

4,153133699

asked Oct 29 at 21:42

Rajesh Kumar

asked Oct 29 at 21:42

Rajesh Kumar

asked Oct 29 at 21:42

Rajesh Kumar

add a comment |

2 Answers
2

active

oldest

votes

up vote
0
down vote

accepted

I think the "merged data problem" is due to the fact that you're not actually separating the cells with commas. Check your csv generated file on a regular text editor to see that.

An easy solution is to use the join method to create a single comma-separated string with the list of cells and printing this into the file. e.g.:

content = [cell.text for cell in row.find_all('td')]

r.write(';'.join(content)+'n')

On the first line I used what is called a "list comprehension", something that is very useful for you to learn. This allows to iterate all elements in a list using a single line of code, instead of doing a "for" loop. On the second line I use the join method on the string ;. This means the array content is converted into a string joining all elements with ;. At the end I add the line break.

If you want to ommit elements based on index (let's say, ommit the 7th column) we can complicate a little bit the list comprehension like so:

# Write on this array the indices of the columns you want

# to exclude

ommit_columns = [7]

content = [cell.text

    for (index, cell) in enumerate(row.find_all('td'))

    if index not in ommit_columns]

r.write(';'.join(content)+'n')

In ommit_columns you can write several indices. In the list comprehension below we use the method enumerate to obtain all the indices and elements from the row.find_all('td') and then filter them checking if index is not in the ommit_columnsarray.

The complete code should be:

from bs4 import BeautifulSoup

import time

import requests



num = 1



url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



headers= {'User-Agent': 'Mozilla/5.0'}



with open ('newstats.csv', 'w') as r:

    r.write('POLITICIANS ALLn')





while num < 3:

    url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



    time.sleep(1)

    response = requests.get(url, headers)



    if response.status_code == 200:

        soup = BeautifulSoup(response.content, 'html.parser')

        tablenew = soup.find_all('table', id = "table1")

        if len(tablenew) < 2:

            tablenew = tablenew[0]

            with open ('newstats.csv', 'a') as r:

                for row in tablenew.find_all('tr'):

                    # content = [cell.text for cell in row.find_all('td')]

                    # r.write(';'.join(content)+'n')



                    # Write on this array the indices of the columns you want

                    # to exclude

                    ommit_columns = [7]

                    content = [cell.text

                        for (index, cell) in enumerate(row.find_all('td'))

                        if index not in ommit_columns]

                    r.write(';'.join(content)+'n')

        else: print('Too many tables')



    else:

        print('No response')

        print(num)



    num += 1

and the response would be like this:

POLITICIANS ALL



Banka Sahadev;53;M;IND;Yes;1;1;Graduate;19,000~ 19 Thou+;3,74,000~ 3 Lacs+;3,93,000~ 3 Lacs+;0~ ;N

Godam Nagesh;49;M;TRS;No;0;0;Post Graduate;31,39,857~ 31 Lacs+;72,39,000~ 72 Lacs+;1,03,78,857~ 1 Crore+;1,48,784~ 1 Lacs+;Y

Mosali Chinnaiah;40;M;IND;No;0;0;12th Pass;1,67,000~ 1 Lacs+;30,00,000~ 30 Lacs+;31,67,000~ 31 Lacs+;40,000~ 40 Thou+;Y

Naresh;37;M;INC;No;0;0;Doctorate;12,00,000~ 12 Lacs+;6,00,000~ 6 Lacs+;18,00,000~ 18 Lacs+;0~ ;Y

Nethawath Ramdas;44;M;IND;No;0;0;Illiterate;0~ ;0~ ;0~ ;0~ ;N

Pawar Krishna;33;M;IND;Yes;1;1;Post Graduate;0~ ;0~ ;0~ ;0~ ;N

Ramesh Rathod;48;M;TDP;Yes;3;1;12th Pass;54,07,000~ 54 Lacs+;1,37,33,000~ 1 Crore+;1,91,40,000~ 1 Crore+;4,18,32,000~ 4 Crore+;Y

Rathod Sadashiv;55;M;BSP;No;0;0;Graduate;80,000~ 80 Thou+;13,25,000~ 13 Lacs+;14,05,000~ 14 Lacs+;0~ ;Y

answered Oct 29 at 22:37

Felipe Ferri

1,57711926

1

Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
– Rajesh Kumar
Nov 10 at 14:05

add a comment |

up vote
0
down vote

As PIC Details column seams to always be the seventh, you can just slice it out:

import csv

import requests

import time



from bs4 import BeautifulSoup



num = 1



url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



headers= {'User-Agent': 'Mozilla/5.0'}



with open ('newstats.csv', 'w') as r:

    r.write('POLITICIANS ALLn')



while num < 3:

    url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



    time.sleep(1)

    response = requests.get(url, headers)



    if response.status_code == 200:

        soup = BeautifulSoup(response.content, 'html.parser')

        tablenew = soup.find_all('table', id = "table1")

        if len(tablenew) < 2:

            tablenew = tablenew[0]

            with open ('newstats.csv', 'a') as r:

                for row in tablenew.find_all('tr'):

                    cells = list(map(lambda cell: cell.text, row.find_all('td')))

                    cells = cells[:7] + cells[8:]

                    writer = csv.writer(r, delimiter='t')

                    writer.writerow(cells)



        else: print('Too many tables')



    else:

        print('No response')

        print(num)





    num += 1

answered Oct 29 at 22:34

Vitor SRG

1062

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53054272%2fpython-beautiful-soup-table-data-scraping-all-except-a-specific-td-data%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
0
down vote

accepted

I think the "merged data problem" is due to the fact that you're not actually separating the cells with commas. Check your csv generated file on a regular text editor to see that.

An easy solution is to use the join method to create a single comma-separated string with the list of cells and printing this into the file. e.g.:

content = [cell.text for cell in row.find_all('td')]

r.write(';'.join(content)+'n')

If you want to ommit elements based on index (let's say, ommit the 7th column) we can complicate a little bit the list comprehension like so:

# Write on this array the indices of the columns you want

# to exclude

ommit_columns = [7]

content = [cell.text

    for (index, cell) in enumerate(row.find_all('td'))

    if index not in ommit_columns]

r.write(';'.join(content)+'n')

The complete code should be:

from bs4 import BeautifulSoup

import time

import requests



num = 1



url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



headers= {'User-Agent': 'Mozilla/5.0'}



with open ('newstats.csv', 'w') as r:

    r.write('POLITICIANS ALLn')





while num < 3:

    url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



    time.sleep(1)

    response = requests.get(url, headers)



    if response.status_code == 200:

        soup = BeautifulSoup(response.content, 'html.parser')

        tablenew = soup.find_all('table', id = "table1")

        if len(tablenew) < 2:

            tablenew = tablenew[0]

            with open ('newstats.csv', 'a') as r:

                for row in tablenew.find_all('tr'):

                    # content = [cell.text for cell in row.find_all('td')]

                    # r.write(';'.join(content)+'n')



                    # Write on this array the indices of the columns you want

                    # to exclude

                    ommit_columns = [7]

                    content = [cell.text

                        for (index, cell) in enumerate(row.find_all('td'))

                        if index not in ommit_columns]

                    r.write(';'.join(content)+'n')

        else: print('Too many tables')



    else:

        print('No response')

        print(num)



    num += 1

and the response would be like this:

POLITICIANS ALL



Banka Sahadev;53;M;IND;Yes;1;1;Graduate;19,000~ 19 Thou+;3,74,000~ 3 Lacs+;3,93,000~ 3 Lacs+;0~ ;N

Godam Nagesh;49;M;TRS;No;0;0;Post Graduate;31,39,857~ 31 Lacs+;72,39,000~ 72 Lacs+;1,03,78,857~ 1 Crore+;1,48,784~ 1 Lacs+;Y

Mosali Chinnaiah;40;M;IND;No;0;0;12th Pass;1,67,000~ 1 Lacs+;30,00,000~ 30 Lacs+;31,67,000~ 31 Lacs+;40,000~ 40 Thou+;Y

Naresh;37;M;INC;No;0;0;Doctorate;12,00,000~ 12 Lacs+;6,00,000~ 6 Lacs+;18,00,000~ 18 Lacs+;0~ ;Y

Nethawath Ramdas;44;M;IND;No;0;0;Illiterate;0~ ;0~ ;0~ ;0~ ;N

Pawar Krishna;33;M;IND;Yes;1;1;Post Graduate;0~ ;0~ ;0~ ;0~ ;N

Ramesh Rathod;48;M;TDP;Yes;3;1;12th Pass;54,07,000~ 54 Lacs+;1,37,33,000~ 1 Crore+;1,91,40,000~ 1 Crore+;4,18,32,000~ 4 Crore+;Y

Rathod Sadashiv;55;M;BSP;No;0;0;Graduate;80,000~ 80 Thou+;13,25,000~ 13 Lacs+;14,05,000~ 14 Lacs+;0~ ;Y

answered Oct 29 at 22:37

Felipe Ferri

1,57711926

1

Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
– Rajesh Kumar
Nov 10 at 14:05

add a comment |

up vote
0
down vote

accepted

I think the "merged data problem" is due to the fact that you're not actually separating the cells with commas. Check your csv generated file on a regular text editor to see that.

An easy solution is to use the join method to create a single comma-separated string with the list of cells and printing this into the file. e.g.:

content = [cell.text for cell in row.find_all('td')]

r.write(';'.join(content)+'n')

If you want to ommit elements based on index (let's say, ommit the 7th column) we can complicate a little bit the list comprehension like so:

# Write on this array the indices of the columns you want

# to exclude

ommit_columns = [7]

content = [cell.text

    for (index, cell) in enumerate(row.find_all('td'))

    if index not in ommit_columns]

r.write(';'.join(content)+'n')

The complete code should be:

from bs4 import BeautifulSoup

import time

import requests



num = 1



url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



headers= {'User-Agent': 'Mozilla/5.0'}



with open ('newstats.csv', 'w') as r:

    r.write('POLITICIANS ALLn')





while num < 3:

    url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



    time.sleep(1)

    response = requests.get(url, headers)



    if response.status_code == 200:

        soup = BeautifulSoup(response.content, 'html.parser')

        tablenew = soup.find_all('table', id = "table1")

        if len(tablenew) < 2:

            tablenew = tablenew[0]

            with open ('newstats.csv', 'a') as r:

                for row in tablenew.find_all('tr'):

                    # content = [cell.text for cell in row.find_all('td')]

                    # r.write(';'.join(content)+'n')



                    # Write on this array the indices of the columns you want

                    # to exclude

                    ommit_columns = [7]

                    content = [cell.text

                        for (index, cell) in enumerate(row.find_all('td'))

                        if index not in ommit_columns]

                    r.write(';'.join(content)+'n')

        else: print('Too many tables')



    else:

        print('No response')

        print(num)



    num += 1

and the response would be like this:

POLITICIANS ALL



Banka Sahadev;53;M;IND;Yes;1;1;Graduate;19,000~ 19 Thou+;3,74,000~ 3 Lacs+;3,93,000~ 3 Lacs+;0~ ;N

Godam Nagesh;49;M;TRS;No;0;0;Post Graduate;31,39,857~ 31 Lacs+;72,39,000~ 72 Lacs+;1,03,78,857~ 1 Crore+;1,48,784~ 1 Lacs+;Y

Mosali Chinnaiah;40;M;IND;No;0;0;12th Pass;1,67,000~ 1 Lacs+;30,00,000~ 30 Lacs+;31,67,000~ 31 Lacs+;40,000~ 40 Thou+;Y

Naresh;37;M;INC;No;0;0;Doctorate;12,00,000~ 12 Lacs+;6,00,000~ 6 Lacs+;18,00,000~ 18 Lacs+;0~ ;Y

Nethawath Ramdas;44;M;IND;No;0;0;Illiterate;0~ ;0~ ;0~ ;0~ ;N

Pawar Krishna;33;M;IND;Yes;1;1;Post Graduate;0~ ;0~ ;0~ ;0~ ;N

Ramesh Rathod;48;M;TDP;Yes;3;1;12th Pass;54,07,000~ 54 Lacs+;1,37,33,000~ 1 Crore+;1,91,40,000~ 1 Crore+;4,18,32,000~ 4 Crore+;Y

Rathod Sadashiv;55;M;BSP;No;0;0;Graduate;80,000~ 80 Thou+;13,25,000~ 13 Lacs+;14,05,000~ 14 Lacs+;0~ ;Y

answered Oct 29 at 22:37

Felipe Ferri

1,57711926

1

Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
– Rajesh Kumar
Nov 10 at 14:05

add a comment |

up vote
0
down vote

accepted

I think the "merged data problem" is due to the fact that you're not actually separating the cells with commas. Check your csv generated file on a regular text editor to see that.

An easy solution is to use the join method to create a single comma-separated string with the list of cells and printing this into the file. e.g.:

content = [cell.text for cell in row.find_all('td')]

r.write(';'.join(content)+'n')

If you want to ommit elements based on index (let's say, ommit the 7th column) we can complicate a little bit the list comprehension like so:

# Write on this array the indices of the columns you want

# to exclude

ommit_columns = [7]

content = [cell.text

    for (index, cell) in enumerate(row.find_all('td'))

    if index not in ommit_columns]

r.write(';'.join(content)+'n')

The complete code should be:

from bs4 import BeautifulSoup

import time

import requests



num = 1



url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



headers= {'User-Agent': 'Mozilla/5.0'}



with open ('newstats.csv', 'w') as r:

    r.write('POLITICIANS ALLn')





while num < 3:

    url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



    time.sleep(1)

    response = requests.get(url, headers)



    if response.status_code == 200:

        soup = BeautifulSoup(response.content, 'html.parser')

        tablenew = soup.find_all('table', id = "table1")

        if len(tablenew) < 2:

            tablenew = tablenew[0]

            with open ('newstats.csv', 'a') as r:

                for row in tablenew.find_all('tr'):

                    # content = [cell.text for cell in row.find_all('td')]

                    # r.write(';'.join(content)+'n')



                    # Write on this array the indices of the columns you want

                    # to exclude

                    ommit_columns = [7]

                    content = [cell.text

                        for (index, cell) in enumerate(row.find_all('td'))

                        if index not in ommit_columns]

                    r.write(';'.join(content)+'n')

        else: print('Too many tables')



    else:

        print('No response')

        print(num)



    num += 1

and the response would be like this:

POLITICIANS ALL



Banka Sahadev;53;M;IND;Yes;1;1;Graduate;19,000~ 19 Thou+;3,74,000~ 3 Lacs+;3,93,000~ 3 Lacs+;0~ ;N

Godam Nagesh;49;M;TRS;No;0;0;Post Graduate;31,39,857~ 31 Lacs+;72,39,000~ 72 Lacs+;1,03,78,857~ 1 Crore+;1,48,784~ 1 Lacs+;Y

Mosali Chinnaiah;40;M;IND;No;0;0;12th Pass;1,67,000~ 1 Lacs+;30,00,000~ 30 Lacs+;31,67,000~ 31 Lacs+;40,000~ 40 Thou+;Y

Naresh;37;M;INC;No;0;0;Doctorate;12,00,000~ 12 Lacs+;6,00,000~ 6 Lacs+;18,00,000~ 18 Lacs+;0~ ;Y

Nethawath Ramdas;44;M;IND;No;0;0;Illiterate;0~ ;0~ ;0~ ;0~ ;N

Pawar Krishna;33;M;IND;Yes;1;1;Post Graduate;0~ ;0~ ;0~ ;0~ ;N

Ramesh Rathod;48;M;TDP;Yes;3;1;12th Pass;54,07,000~ 54 Lacs+;1,37,33,000~ 1 Crore+;1,91,40,000~ 1 Crore+;4,18,32,000~ 4 Crore+;Y

Rathod Sadashiv;55;M;BSP;No;0;0;Graduate;80,000~ 80 Thou+;13,25,000~ 13 Lacs+;14,05,000~ 14 Lacs+;0~ ;Y

answered Oct 29 at 22:37

Felipe Ferri

1,57711926

I think the "merged data problem" is due to the fact that you're not actually separating the cells with commas. Check your csv generated file on a regular text editor to see that.

An easy solution is to use the join method to create a single comma-separated string with the list of cells and printing this into the file. e.g.:

content = [cell.text for cell in row.find_all('td')]

r.write(';'.join(content)+'n')

If you want to ommit elements based on index (let's say, ommit the 7th column) we can complicate a little bit the list comprehension like so:

# Write on this array the indices of the columns you want

# to exclude

ommit_columns = [7]

content = [cell.text

    for (index, cell) in enumerate(row.find_all('td'))

    if index not in ommit_columns]

r.write(';'.join(content)+'n')

The complete code should be:

from bs4 import BeautifulSoup

import time

import requests



num = 1



url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



headers= {'User-Agent': 'Mozilla/5.0'}



with open ('newstats.csv', 'w') as r:

    r.write('POLITICIANS ALLn')





while num < 3:

    url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



    time.sleep(1)

    response = requests.get(url, headers)



    if response.status_code == 200:

        soup = BeautifulSoup(response.content, 'html.parser')

        tablenew = soup.find_all('table', id = "table1")

        if len(tablenew) < 2:

            tablenew = tablenew[0]

            with open ('newstats.csv', 'a') as r:

                for row in tablenew.find_all('tr'):

                    # content = [cell.text for cell in row.find_all('td')]

                    # r.write(';'.join(content)+'n')



                    # Write on this array the indices of the columns you want

                    # to exclude

                    ommit_columns = [7]

                    content = [cell.text

                        for (index, cell) in enumerate(row.find_all('td'))

                        if index not in ommit_columns]

                    r.write(';'.join(content)+'n')

        else: print('Too many tables')



    else:

        print('No response')

        print(num)



    num += 1

and the response would be like this:

POLITICIANS ALL



Banka Sahadev;53;M;IND;Yes;1;1;Graduate;19,000~ 19 Thou+;3,74,000~ 3 Lacs+;3,93,000~ 3 Lacs+;0~ ;N

Godam Nagesh;49;M;TRS;No;0;0;Post Graduate;31,39,857~ 31 Lacs+;72,39,000~ 72 Lacs+;1,03,78,857~ 1 Crore+;1,48,784~ 1 Lacs+;Y

Mosali Chinnaiah;40;M;IND;No;0;0;12th Pass;1,67,000~ 1 Lacs+;30,00,000~ 30 Lacs+;31,67,000~ 31 Lacs+;40,000~ 40 Thou+;Y

Naresh;37;M;INC;No;0;0;Doctorate;12,00,000~ 12 Lacs+;6,00,000~ 6 Lacs+;18,00,000~ 18 Lacs+;0~ ;Y

Nethawath Ramdas;44;M;IND;No;0;0;Illiterate;0~ ;0~ ;0~ ;0~ ;N

Pawar Krishna;33;M;IND;Yes;1;1;Post Graduate;0~ ;0~ ;0~ ;0~ ;N

Ramesh Rathod;48;M;TDP;Yes;3;1;12th Pass;54,07,000~ 54 Lacs+;1,37,33,000~ 1 Crore+;1,91,40,000~ 1 Crore+;4,18,32,000~ 4 Crore+;Y

Rathod Sadashiv;55;M;BSP;No;0;0;Graduate;80,000~ 80 Thou+;13,25,000~ 13 Lacs+;14,05,000~ 14 Lacs+;0~ ;Y

answered Oct 29 at 22:37

Felipe Ferri

1,57711926

answered Oct 29 at 22:37

Felipe Ferri

1,57711926

answered Oct 29 at 22:37

Felipe Ferri

1,57711926

answered Oct 29 at 22:37

Felipe Ferri

1,57711926

1

Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
– Rajesh Kumar
Nov 10 at 14:05

add a comment |

1

Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
– Rajesh Kumar
Nov 10 at 14:05

Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
– Rajesh Kumar
Nov 10 at 14:05

add a comment |

up vote
0
down vote

As PIC Details column seams to always be the seventh, you can just slice it out:

import csv

import requests

import time



from bs4 import BeautifulSoup



num = 1



url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



headers= {'User-Agent': 'Mozilla/5.0'}



with open ('newstats.csv', 'w') as r:

    r.write('POLITICIANS ALLn')



while num < 3:

    url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



    time.sleep(1)

    response = requests.get(url, headers)



    if response.status_code == 200:

        soup = BeautifulSoup(response.content, 'html.parser')

        tablenew = soup.find_all('table', id = "table1")

        if len(tablenew) < 2:

            tablenew = tablenew[0]

            with open ('newstats.csv', 'a') as r:

                for row in tablenew.find_all('tr'):

                    cells = list(map(lambda cell: cell.text, row.find_all('td')))

                    cells = cells[:7] + cells[8:]

                    writer = csv.writer(r, delimiter='t')

                    writer.writerow(cells)



        else: print('Too many tables')



    else:

        print('No response')

        print(num)





    num += 1

answered Oct 29 at 22:34

Vitor SRG

1062

add a comment |

up vote
0
down vote

As PIC Details column seams to always be the seventh, you can just slice it out:

import csv

import requests

import time



from bs4 import BeautifulSoup



num = 1



url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



headers= {'User-Agent': 'Mozilla/5.0'}



with open ('newstats.csv', 'w') as r:

    r.write('POLITICIANS ALLn')



while num < 3:

    url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



    time.sleep(1)

    response = requests.get(url, headers)



    if response.status_code == 200:

        soup = BeautifulSoup(response.content, 'html.parser')

        tablenew = soup.find_all('table', id = "table1")

        if len(tablenew) < 2:

            tablenew = tablenew[0]

            with open ('newstats.csv', 'a') as r:

                for row in tablenew.find_all('tr'):

                    cells = list(map(lambda cell: cell.text, row.find_all('td')))

                    cells = cells[:7] + cells[8:]

                    writer = csv.writer(r, delimiter='t')

                    writer.writerow(cells)



        else: print('Too many tables')



    else:

        print('No response')

        print(num)





    num += 1

answered Oct 29 at 22:34

Vitor SRG

1062

add a comment |

up vote
0
down vote

As PIC Details column seams to always be the seventh, you can just slice it out:

import csv

import requests

import time



from bs4 import BeautifulSoup



num = 1



url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



headers= {'User-Agent': 'Mozilla/5.0'}



with open ('newstats.csv', 'w') as r:

    r.write('POLITICIANS ALLn')



while num < 3:

    url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



    time.sleep(1)

    response = requests.get(url, headers)



    if response.status_code == 200:

        soup = BeautifulSoup(response.content, 'html.parser')

        tablenew = soup.find_all('table', id = "table1")

        if len(tablenew) < 2:

            tablenew = tablenew[0]

            with open ('newstats.csv', 'a') as r:

                for row in tablenew.find_all('tr'):

                    cells = list(map(lambda cell: cell.text, row.find_all('td')))

                    cells = cells[:7] + cells[8:]

                    writer = csv.writer(r, delimiter='t')

                    writer.writerow(cells)



        else: print('Too many tables')



    else:

        print('No response')

        print(num)





    num += 1

answered Oct 29 at 22:34

Vitor SRG

1062

As PIC Details column seams to always be the seventh, you can just slice it out:

import csv

import requests

import time



from bs4 import BeautifulSoup



num = 1



url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



headers= {'User-Agent': 'Mozilla/5.0'}



with open ('newstats.csv', 'w') as r:

    r.write('POLITICIANS ALLn')



while num < 3:

    url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)



    time.sleep(1)

    response = requests.get(url, headers)



    if response.status_code == 200:

        soup = BeautifulSoup(response.content, 'html.parser')

        tablenew = soup.find_all('table', id = "table1")

        if len(tablenew) < 2:

            tablenew = tablenew[0]

            with open ('newstats.csv', 'a') as r:

                for row in tablenew.find_all('tr'):

                    cells = list(map(lambda cell: cell.text, row.find_all('td')))

                    cells = cells[:7] + cells[8:]

                    writer = csv.writer(r, delimiter='t')

                    writer.writerow(cells)



        else: print('Too many tables')



    else:

        print('No response')

        print(num)





    num += 1

answered Oct 29 at 22:34

Vitor SRG

1062

answered Oct 29 at 22:34

Vitor SRG

1062

answered Oct 29 at 22:34

Vitor SRG

1062

answered Oct 29 at 22:34

Vitor SRG

1062

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

RGvfBPnAbbFs3odCau QzduS3UCEqz d,sSN6Dnid fQ 8WDOdR Its,3gcHMDq,G 11jAyUJ

搜尋此網誌

Nrthugu