Python Beautiful Soup Table Data Scraping all except a specific data
up vote
0
down vote
favorite
I am trying to scrape data from a website which contains data of all politicians of India from multiple pages denoted by numbers.
url: http://www.myneta.info/ls2014/comparisonchart.php?constituency_id=1
I want the data to be exported into a CSV file from multiple websites.
This is a sample table I am trying:
<tr>
<td class=chartcell><a href='http://myneta.info/ls2014/candidate.php?candidate_id=7678' target=_blank>Banka Sahadev</a></td>
<td class=chartcell align=center>53</td>
<td class=chartcell align=center>M</td>
<td class=chartcell align=center>IND</td>
<td class=chartcell align=center><span style='font-size:150%;color:red'><b>Yes</b></span></td>
<td class=chartcell align=center><span style='font-size:160%;'><b>1</b></span></td>
<td class=chartcell align=center>1</td>
<td class=chartcell align=left> <b><span style='color:red'> criminal intimidation(506)</span></b>, <b><span style='color:red'> public nuisance in cases not otherwise provided for(290)</span></b>, <b><span style='color:red'> voluntarily causing hurt(323)</span></b>, </td>
<td class=chartcell align=center>Graduate</td>
<td class=chartcell align=center>19,000<br><span style='font-size:70%;color:brown'>~ 19 Thou+</span></td>
<td class=chartcell align=center>3,74,000<br><span style='font-size:70%;color:brown'>~ 3 Lacs+</span></td>
<td class=chartcell align=center>3,93,000<br><span style='font-size:70%;color:brown'>~ 3 Lacs+</span></td>
<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>N</td>
<!--<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>2,00,000<br><span style='font-size:70%;color:brown'>~ 2 Lacs+</span></td> -->
</tr>
I have used BeautifulSoup to get the data, but then they somehow the data gets merged and looks very clumsy if I open the CSV data.
Here is my code:
num = 1
url ='http://www.myneta.info/ls2014/comparisonchart.php?
constituency_id={}'.format(num)
headers= {'User-Agent': 'Mozilla/5.0'}
with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALLn')
while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?
constituency_id={}'.format(num)
time.sleep(1)
response = requests.get(url, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
for cell in row.find_all('td'):
r.write(cell.text.ljust(250))
r.write('n')
else: print('Too many tables')
else:
print('No response')
print(num)
num += 1
Also, how could I omit data from the specific td ?
In my case, I don't want the data of the IPC details from the table.
I am fairly new to coding and python.
python web-scraping html-table beautifulsoup export-to-csv
add a comment |
up vote
0
down vote
favorite
I am trying to scrape data from a website which contains data of all politicians of India from multiple pages denoted by numbers.
url: http://www.myneta.info/ls2014/comparisonchart.php?constituency_id=1
I want the data to be exported into a CSV file from multiple websites.
This is a sample table I am trying:
<tr>
<td class=chartcell><a href='http://myneta.info/ls2014/candidate.php?candidate_id=7678' target=_blank>Banka Sahadev</a></td>
<td class=chartcell align=center>53</td>
<td class=chartcell align=center>M</td>
<td class=chartcell align=center>IND</td>
<td class=chartcell align=center><span style='font-size:150%;color:red'><b>Yes</b></span></td>
<td class=chartcell align=center><span style='font-size:160%;'><b>1</b></span></td>
<td class=chartcell align=center>1</td>
<td class=chartcell align=left> <b><span style='color:red'> criminal intimidation(506)</span></b>, <b><span style='color:red'> public nuisance in cases not otherwise provided for(290)</span></b>, <b><span style='color:red'> voluntarily causing hurt(323)</span></b>, </td>
<td class=chartcell align=center>Graduate</td>
<td class=chartcell align=center>19,000<br><span style='font-size:70%;color:brown'>~ 19 Thou+</span></td>
<td class=chartcell align=center>3,74,000<br><span style='font-size:70%;color:brown'>~ 3 Lacs+</span></td>
<td class=chartcell align=center>3,93,000<br><span style='font-size:70%;color:brown'>~ 3 Lacs+</span></td>
<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>N</td>
<!--<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>2,00,000<br><span style='font-size:70%;color:brown'>~ 2 Lacs+</span></td> -->
</tr>
I have used BeautifulSoup to get the data, but then they somehow the data gets merged and looks very clumsy if I open the CSV data.
Here is my code:
num = 1
url ='http://www.myneta.info/ls2014/comparisonchart.php?
constituency_id={}'.format(num)
headers= {'User-Agent': 'Mozilla/5.0'}
with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALLn')
while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?
constituency_id={}'.format(num)
time.sleep(1)
response = requests.get(url, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
for cell in row.find_all('td'):
r.write(cell.text.ljust(250))
r.write('n')
else: print('Too many tables')
else:
print('No response')
print(num)
num += 1
Also, how could I omit data from the specific td ?
In my case, I don't want the data of the IPC details from the table.
I am fairly new to coding and python.
python web-scraping html-table beautifulsoup export-to-csv
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I am trying to scrape data from a website which contains data of all politicians of India from multiple pages denoted by numbers.
url: http://www.myneta.info/ls2014/comparisonchart.php?constituency_id=1
I want the data to be exported into a CSV file from multiple websites.
This is a sample table I am trying:
<tr>
<td class=chartcell><a href='http://myneta.info/ls2014/candidate.php?candidate_id=7678' target=_blank>Banka Sahadev</a></td>
<td class=chartcell align=center>53</td>
<td class=chartcell align=center>M</td>
<td class=chartcell align=center>IND</td>
<td class=chartcell align=center><span style='font-size:150%;color:red'><b>Yes</b></span></td>
<td class=chartcell align=center><span style='font-size:160%;'><b>1</b></span></td>
<td class=chartcell align=center>1</td>
<td class=chartcell align=left> <b><span style='color:red'> criminal intimidation(506)</span></b>, <b><span style='color:red'> public nuisance in cases not otherwise provided for(290)</span></b>, <b><span style='color:red'> voluntarily causing hurt(323)</span></b>, </td>
<td class=chartcell align=center>Graduate</td>
<td class=chartcell align=center>19,000<br><span style='font-size:70%;color:brown'>~ 19 Thou+</span></td>
<td class=chartcell align=center>3,74,000<br><span style='font-size:70%;color:brown'>~ 3 Lacs+</span></td>
<td class=chartcell align=center>3,93,000<br><span style='font-size:70%;color:brown'>~ 3 Lacs+</span></td>
<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>N</td>
<!--<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>2,00,000<br><span style='font-size:70%;color:brown'>~ 2 Lacs+</span></td> -->
</tr>
I have used BeautifulSoup to get the data, but then they somehow the data gets merged and looks very clumsy if I open the CSV data.
Here is my code:
num = 1
url ='http://www.myneta.info/ls2014/comparisonchart.php?
constituency_id={}'.format(num)
headers= {'User-Agent': 'Mozilla/5.0'}
with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALLn')
while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?
constituency_id={}'.format(num)
time.sleep(1)
response = requests.get(url, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
for cell in row.find_all('td'):
r.write(cell.text.ljust(250))
r.write('n')
else: print('Too many tables')
else:
print('No response')
print(num)
num += 1
Also, how could I omit data from the specific td ?
In my case, I don't want the data of the IPC details from the table.
I am fairly new to coding and python.
python web-scraping html-table beautifulsoup export-to-csv
I am trying to scrape data from a website which contains data of all politicians of India from multiple pages denoted by numbers.
url: http://www.myneta.info/ls2014/comparisonchart.php?constituency_id=1
I want the data to be exported into a CSV file from multiple websites.
This is a sample table I am trying:
<tr>
<td class=chartcell><a href='http://myneta.info/ls2014/candidate.php?candidate_id=7678' target=_blank>Banka Sahadev</a></td>
<td class=chartcell align=center>53</td>
<td class=chartcell align=center>M</td>
<td class=chartcell align=center>IND</td>
<td class=chartcell align=center><span style='font-size:150%;color:red'><b>Yes</b></span></td>
<td class=chartcell align=center><span style='font-size:160%;'><b>1</b></span></td>
<td class=chartcell align=center>1</td>
<td class=chartcell align=left> <b><span style='color:red'> criminal intimidation(506)</span></b>, <b><span style='color:red'> public nuisance in cases not otherwise provided for(290)</span></b>, <b><span style='color:red'> voluntarily causing hurt(323)</span></b>, </td>
<td class=chartcell align=center>Graduate</td>
<td class=chartcell align=center>19,000<br><span style='font-size:70%;color:brown'>~ 19 Thou+</span></td>
<td class=chartcell align=center>3,74,000<br><span style='font-size:70%;color:brown'>~ 3 Lacs+</span></td>
<td class=chartcell align=center>3,93,000<br><span style='font-size:70%;color:brown'>~ 3 Lacs+</span></td>
<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>N</td>
<!--<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>0<br><span style='font-size:70%;color:brown'>~ </span></td>
<td class=chartcell align=center>2,00,000<br><span style='font-size:70%;color:brown'>~ 2 Lacs+</span></td> -->
</tr>
I have used BeautifulSoup to get the data, but then they somehow the data gets merged and looks very clumsy if I open the CSV data.
Here is my code:
num = 1
url ='http://www.myneta.info/ls2014/comparisonchart.php?
constituency_id={}'.format(num)
headers= {'User-Agent': 'Mozilla/5.0'}
with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALLn')
while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?
constituency_id={}'.format(num)
time.sleep(1)
response = requests.get(url, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
for cell in row.find_all('td'):
r.write(cell.text.ljust(250))
r.write('n')
else: print('Too many tables')
else:
print('No response')
print(num)
num += 1
Also, how could I omit data from the specific td ?
In my case, I don't want the data of the IPC details from the table.
I am fairly new to coding and python.
python web-scraping html-table beautifulsoup export-to-csv
python web-scraping html-table beautifulsoup export-to-csv
edited Oct 30 at 12:02
Brian Tompsett - 汤莱恩
4,153133699
4,153133699
asked Oct 29 at 21:42
Rajesh Kumar
32
32
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
0
down vote
accepted
I think the "merged data problem" is due to the fact that you're not actually separating the cells with commas. Check your csv generated file on a regular text editor to see that.
An easy solution is to use the join
method to create a single comma-separated string with the list of cells and printing this into the file. e.g.:
content = [cell.text for cell in row.find_all('td')]
r.write(';'.join(content)+'n')
On the first line I used what is called a "list comprehension", something that is very useful for you to learn. This allows to iterate all elements in a list using a single line of code, instead of doing a "for" loop. On the second line I use the join
method on the string ;
. This means the array content
is converted into a string joining all elements with ;
. At the end I add the line break.
If you want to ommit elements based on index (let's say, ommit the 7th column) we can complicate a little bit the list comprehension like so:
# Write on this array the indices of the columns you want
# to exclude
ommit_columns = [7]
content = [cell.text
for (index, cell) in enumerate(row.find_all('td'))
if index not in ommit_columns]
r.write(';'.join(content)+'n')
In ommit_columns
you can write several indices. In the list comprehension below we use the method enumerate
to obtain all the indices and elements from the row.find_all('td')
and then filter them checking if index
is not in the ommit_columns
array.
The complete code should be:
from bs4 import BeautifulSoup
import time
import requests
num = 1
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
headers= {'User-Agent': 'Mozilla/5.0'}
with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALLn')
while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
time.sleep(1)
response = requests.get(url, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
# content = [cell.text for cell in row.find_all('td')]
# r.write(';'.join(content)+'n')
# Write on this array the indices of the columns you want
# to exclude
ommit_columns = [7]
content = [cell.text
for (index, cell) in enumerate(row.find_all('td'))
if index not in ommit_columns]
r.write(';'.join(content)+'n')
else: print('Too many tables')
else:
print('No response')
print(num)
num += 1
and the response would be like this:
POLITICIANS ALL
Banka Sahadev;53;M;IND;Yes;1;1;Graduate;19,000~ 19 Thou+;3,74,000~ 3 Lacs+;3,93,000~ 3 Lacs+;0~ ;N
Godam Nagesh;49;M;TRS;No;0;0;Post Graduate;31,39,857~ 31 Lacs+;72,39,000~ 72 Lacs+;1,03,78,857~ 1 Crore+;1,48,784~ 1 Lacs+;Y
Mosali Chinnaiah;40;M;IND;No;0;0;12th Pass;1,67,000~ 1 Lacs+;30,00,000~ 30 Lacs+;31,67,000~ 31 Lacs+;40,000~ 40 Thou+;Y
Naresh;37;M;INC;No;0;0;Doctorate;12,00,000~ 12 Lacs+;6,00,000~ 6 Lacs+;18,00,000~ 18 Lacs+;0~ ;Y
Nethawath Ramdas;44;M;IND;No;0;0;Illiterate;0~ ;0~ ;0~ ;0~ ;N
Pawar Krishna;33;M;IND;Yes;1;1;Post Graduate;0~ ;0~ ;0~ ;0~ ;N
Ramesh Rathod;48;M;TDP;Yes;3;1;12th Pass;54,07,000~ 54 Lacs+;1,37,33,000~ 1 Crore+;1,91,40,000~ 1 Crore+;4,18,32,000~ 4 Crore+;Y
Rathod Sadashiv;55;M;BSP;No;0;0;Graduate;80,000~ 80 Thou+;13,25,000~ 13 Lacs+;14,05,000~ 14 Lacs+;0~ ;Y
1
Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
– Rajesh Kumar
Nov 10 at 14:05
add a comment |
up vote
0
down vote
As PIC Details column seams to always be the seventh, you can just slice it out:
import csv
import requests
import time
from bs4 import BeautifulSoup
num = 1
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
headers= {'User-Agent': 'Mozilla/5.0'}
with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALLn')
while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
time.sleep(1)
response = requests.get(url, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
cells = list(map(lambda cell: cell.text, row.find_all('td')))
cells = cells[:7] + cells[8:]
writer = csv.writer(r, delimiter='t')
writer.writerow(cells)
else: print('Too many tables')
else:
print('No response')
print(num)
num += 1
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
I think the "merged data problem" is due to the fact that you're not actually separating the cells with commas. Check your csv generated file on a regular text editor to see that.
An easy solution is to use the join
method to create a single comma-separated string with the list of cells and printing this into the file. e.g.:
content = [cell.text for cell in row.find_all('td')]
r.write(';'.join(content)+'n')
On the first line I used what is called a "list comprehension", something that is very useful for you to learn. This allows to iterate all elements in a list using a single line of code, instead of doing a "for" loop. On the second line I use the join
method on the string ;
. This means the array content
is converted into a string joining all elements with ;
. At the end I add the line break.
If you want to ommit elements based on index (let's say, ommit the 7th column) we can complicate a little bit the list comprehension like so:
# Write on this array the indices of the columns you want
# to exclude
ommit_columns = [7]
content = [cell.text
for (index, cell) in enumerate(row.find_all('td'))
if index not in ommit_columns]
r.write(';'.join(content)+'n')
In ommit_columns
you can write several indices. In the list comprehension below we use the method enumerate
to obtain all the indices and elements from the row.find_all('td')
and then filter them checking if index
is not in the ommit_columns
array.
The complete code should be:
from bs4 import BeautifulSoup
import time
import requests
num = 1
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
headers= {'User-Agent': 'Mozilla/5.0'}
with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALLn')
while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
time.sleep(1)
response = requests.get(url, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
# content = [cell.text for cell in row.find_all('td')]
# r.write(';'.join(content)+'n')
# Write on this array the indices of the columns you want
# to exclude
ommit_columns = [7]
content = [cell.text
for (index, cell) in enumerate(row.find_all('td'))
if index not in ommit_columns]
r.write(';'.join(content)+'n')
else: print('Too many tables')
else:
print('No response')
print(num)
num += 1
and the response would be like this:
POLITICIANS ALL
Banka Sahadev;53;M;IND;Yes;1;1;Graduate;19,000~ 19 Thou+;3,74,000~ 3 Lacs+;3,93,000~ 3 Lacs+;0~ ;N
Godam Nagesh;49;M;TRS;No;0;0;Post Graduate;31,39,857~ 31 Lacs+;72,39,000~ 72 Lacs+;1,03,78,857~ 1 Crore+;1,48,784~ 1 Lacs+;Y
Mosali Chinnaiah;40;M;IND;No;0;0;12th Pass;1,67,000~ 1 Lacs+;30,00,000~ 30 Lacs+;31,67,000~ 31 Lacs+;40,000~ 40 Thou+;Y
Naresh;37;M;INC;No;0;0;Doctorate;12,00,000~ 12 Lacs+;6,00,000~ 6 Lacs+;18,00,000~ 18 Lacs+;0~ ;Y
Nethawath Ramdas;44;M;IND;No;0;0;Illiterate;0~ ;0~ ;0~ ;0~ ;N
Pawar Krishna;33;M;IND;Yes;1;1;Post Graduate;0~ ;0~ ;0~ ;0~ ;N
Ramesh Rathod;48;M;TDP;Yes;3;1;12th Pass;54,07,000~ 54 Lacs+;1,37,33,000~ 1 Crore+;1,91,40,000~ 1 Crore+;4,18,32,000~ 4 Crore+;Y
Rathod Sadashiv;55;M;BSP;No;0;0;Graduate;80,000~ 80 Thou+;13,25,000~ 13 Lacs+;14,05,000~ 14 Lacs+;0~ ;Y
1
Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
– Rajesh Kumar
Nov 10 at 14:05
add a comment |
up vote
0
down vote
accepted
I think the "merged data problem" is due to the fact that you're not actually separating the cells with commas. Check your csv generated file on a regular text editor to see that.
An easy solution is to use the join
method to create a single comma-separated string with the list of cells and printing this into the file. e.g.:
content = [cell.text for cell in row.find_all('td')]
r.write(';'.join(content)+'n')
On the first line I used what is called a "list comprehension", something that is very useful for you to learn. This allows to iterate all elements in a list using a single line of code, instead of doing a "for" loop. On the second line I use the join
method on the string ;
. This means the array content
is converted into a string joining all elements with ;
. At the end I add the line break.
If you want to ommit elements based on index (let's say, ommit the 7th column) we can complicate a little bit the list comprehension like so:
# Write on this array the indices of the columns you want
# to exclude
ommit_columns = [7]
content = [cell.text
for (index, cell) in enumerate(row.find_all('td'))
if index not in ommit_columns]
r.write(';'.join(content)+'n')
In ommit_columns
you can write several indices. In the list comprehension below we use the method enumerate
to obtain all the indices and elements from the row.find_all('td')
and then filter them checking if index
is not in the ommit_columns
array.
The complete code should be:
from bs4 import BeautifulSoup
import time
import requests
num = 1
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
headers= {'User-Agent': 'Mozilla/5.0'}
with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALLn')
while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
time.sleep(1)
response = requests.get(url, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
# content = [cell.text for cell in row.find_all('td')]
# r.write(';'.join(content)+'n')
# Write on this array the indices of the columns you want
# to exclude
ommit_columns = [7]
content = [cell.text
for (index, cell) in enumerate(row.find_all('td'))
if index not in ommit_columns]
r.write(';'.join(content)+'n')
else: print('Too many tables')
else:
print('No response')
print(num)
num += 1
and the response would be like this:
POLITICIANS ALL
Banka Sahadev;53;M;IND;Yes;1;1;Graduate;19,000~ 19 Thou+;3,74,000~ 3 Lacs+;3,93,000~ 3 Lacs+;0~ ;N
Godam Nagesh;49;M;TRS;No;0;0;Post Graduate;31,39,857~ 31 Lacs+;72,39,000~ 72 Lacs+;1,03,78,857~ 1 Crore+;1,48,784~ 1 Lacs+;Y
Mosali Chinnaiah;40;M;IND;No;0;0;12th Pass;1,67,000~ 1 Lacs+;30,00,000~ 30 Lacs+;31,67,000~ 31 Lacs+;40,000~ 40 Thou+;Y
Naresh;37;M;INC;No;0;0;Doctorate;12,00,000~ 12 Lacs+;6,00,000~ 6 Lacs+;18,00,000~ 18 Lacs+;0~ ;Y
Nethawath Ramdas;44;M;IND;No;0;0;Illiterate;0~ ;0~ ;0~ ;0~ ;N
Pawar Krishna;33;M;IND;Yes;1;1;Post Graduate;0~ ;0~ ;0~ ;0~ ;N
Ramesh Rathod;48;M;TDP;Yes;3;1;12th Pass;54,07,000~ 54 Lacs+;1,37,33,000~ 1 Crore+;1,91,40,000~ 1 Crore+;4,18,32,000~ 4 Crore+;Y
Rathod Sadashiv;55;M;BSP;No;0;0;Graduate;80,000~ 80 Thou+;13,25,000~ 13 Lacs+;14,05,000~ 14 Lacs+;0~ ;Y
1
Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
– Rajesh Kumar
Nov 10 at 14:05
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
I think the "merged data problem" is due to the fact that you're not actually separating the cells with commas. Check your csv generated file on a regular text editor to see that.
An easy solution is to use the join
method to create a single comma-separated string with the list of cells and printing this into the file. e.g.:
content = [cell.text for cell in row.find_all('td')]
r.write(';'.join(content)+'n')
On the first line I used what is called a "list comprehension", something that is very useful for you to learn. This allows to iterate all elements in a list using a single line of code, instead of doing a "for" loop. On the second line I use the join
method on the string ;
. This means the array content
is converted into a string joining all elements with ;
. At the end I add the line break.
If you want to ommit elements based on index (let's say, ommit the 7th column) we can complicate a little bit the list comprehension like so:
# Write on this array the indices of the columns you want
# to exclude
ommit_columns = [7]
content = [cell.text
for (index, cell) in enumerate(row.find_all('td'))
if index not in ommit_columns]
r.write(';'.join(content)+'n')
In ommit_columns
you can write several indices. In the list comprehension below we use the method enumerate
to obtain all the indices and elements from the row.find_all('td')
and then filter them checking if index
is not in the ommit_columns
array.
The complete code should be:
from bs4 import BeautifulSoup
import time
import requests
num = 1
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
headers= {'User-Agent': 'Mozilla/5.0'}
with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALLn')
while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
time.sleep(1)
response = requests.get(url, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
# content = [cell.text for cell in row.find_all('td')]
# r.write(';'.join(content)+'n')
# Write on this array the indices of the columns you want
# to exclude
ommit_columns = [7]
content = [cell.text
for (index, cell) in enumerate(row.find_all('td'))
if index not in ommit_columns]
r.write(';'.join(content)+'n')
else: print('Too many tables')
else:
print('No response')
print(num)
num += 1
and the response would be like this:
POLITICIANS ALL
Banka Sahadev;53;M;IND;Yes;1;1;Graduate;19,000~ 19 Thou+;3,74,000~ 3 Lacs+;3,93,000~ 3 Lacs+;0~ ;N
Godam Nagesh;49;M;TRS;No;0;0;Post Graduate;31,39,857~ 31 Lacs+;72,39,000~ 72 Lacs+;1,03,78,857~ 1 Crore+;1,48,784~ 1 Lacs+;Y
Mosali Chinnaiah;40;M;IND;No;0;0;12th Pass;1,67,000~ 1 Lacs+;30,00,000~ 30 Lacs+;31,67,000~ 31 Lacs+;40,000~ 40 Thou+;Y
Naresh;37;M;INC;No;0;0;Doctorate;12,00,000~ 12 Lacs+;6,00,000~ 6 Lacs+;18,00,000~ 18 Lacs+;0~ ;Y
Nethawath Ramdas;44;M;IND;No;0;0;Illiterate;0~ ;0~ ;0~ ;0~ ;N
Pawar Krishna;33;M;IND;Yes;1;1;Post Graduate;0~ ;0~ ;0~ ;0~ ;N
Ramesh Rathod;48;M;TDP;Yes;3;1;12th Pass;54,07,000~ 54 Lacs+;1,37,33,000~ 1 Crore+;1,91,40,000~ 1 Crore+;4,18,32,000~ 4 Crore+;Y
Rathod Sadashiv;55;M;BSP;No;0;0;Graduate;80,000~ 80 Thou+;13,25,000~ 13 Lacs+;14,05,000~ 14 Lacs+;0~ ;Y
I think the "merged data problem" is due to the fact that you're not actually separating the cells with commas. Check your csv generated file on a regular text editor to see that.
An easy solution is to use the join
method to create a single comma-separated string with the list of cells and printing this into the file. e.g.:
content = [cell.text for cell in row.find_all('td')]
r.write(';'.join(content)+'n')
On the first line I used what is called a "list comprehension", something that is very useful for you to learn. This allows to iterate all elements in a list using a single line of code, instead of doing a "for" loop. On the second line I use the join
method on the string ;
. This means the array content
is converted into a string joining all elements with ;
. At the end I add the line break.
If you want to ommit elements based on index (let's say, ommit the 7th column) we can complicate a little bit the list comprehension like so:
# Write on this array the indices of the columns you want
# to exclude
ommit_columns = [7]
content = [cell.text
for (index, cell) in enumerate(row.find_all('td'))
if index not in ommit_columns]
r.write(';'.join(content)+'n')
In ommit_columns
you can write several indices. In the list comprehension below we use the method enumerate
to obtain all the indices and elements from the row.find_all('td')
and then filter them checking if index
is not in the ommit_columns
array.
The complete code should be:
from bs4 import BeautifulSoup
import time
import requests
num = 1
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
headers= {'User-Agent': 'Mozilla/5.0'}
with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALLn')
while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
time.sleep(1)
response = requests.get(url, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
# content = [cell.text for cell in row.find_all('td')]
# r.write(';'.join(content)+'n')
# Write on this array the indices of the columns you want
# to exclude
ommit_columns = [7]
content = [cell.text
for (index, cell) in enumerate(row.find_all('td'))
if index not in ommit_columns]
r.write(';'.join(content)+'n')
else: print('Too many tables')
else:
print('No response')
print(num)
num += 1
and the response would be like this:
POLITICIANS ALL
Banka Sahadev;53;M;IND;Yes;1;1;Graduate;19,000~ 19 Thou+;3,74,000~ 3 Lacs+;3,93,000~ 3 Lacs+;0~ ;N
Godam Nagesh;49;M;TRS;No;0;0;Post Graduate;31,39,857~ 31 Lacs+;72,39,000~ 72 Lacs+;1,03,78,857~ 1 Crore+;1,48,784~ 1 Lacs+;Y
Mosali Chinnaiah;40;M;IND;No;0;0;12th Pass;1,67,000~ 1 Lacs+;30,00,000~ 30 Lacs+;31,67,000~ 31 Lacs+;40,000~ 40 Thou+;Y
Naresh;37;M;INC;No;0;0;Doctorate;12,00,000~ 12 Lacs+;6,00,000~ 6 Lacs+;18,00,000~ 18 Lacs+;0~ ;Y
Nethawath Ramdas;44;M;IND;No;0;0;Illiterate;0~ ;0~ ;0~ ;0~ ;N
Pawar Krishna;33;M;IND;Yes;1;1;Post Graduate;0~ ;0~ ;0~ ;0~ ;N
Ramesh Rathod;48;M;TDP;Yes;3;1;12th Pass;54,07,000~ 54 Lacs+;1,37,33,000~ 1 Crore+;1,91,40,000~ 1 Crore+;4,18,32,000~ 4 Crore+;Y
Rathod Sadashiv;55;M;BSP;No;0;0;Graduate;80,000~ 80 Thou+;13,25,000~ 13 Lacs+;14,05,000~ 14 Lacs+;0~ ;Y
answered Oct 29 at 22:37
Felipe Ferri
1,57711926
1,57711926
1
Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
– Rajesh Kumar
Nov 10 at 14:05
add a comment |
1
Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
– Rajesh Kumar
Nov 10 at 14:05
1
1
Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
– Rajesh Kumar
Nov 10 at 14:05
Hey Felipe, Thank you very much for your knowledge. That answer helped me scrape data from that link.
– Rajesh Kumar
Nov 10 at 14:05
add a comment |
up vote
0
down vote
As PIC Details column seams to always be the seventh, you can just slice it out:
import csv
import requests
import time
from bs4 import BeautifulSoup
num = 1
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
headers= {'User-Agent': 'Mozilla/5.0'}
with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALLn')
while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
time.sleep(1)
response = requests.get(url, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
cells = list(map(lambda cell: cell.text, row.find_all('td')))
cells = cells[:7] + cells[8:]
writer = csv.writer(r, delimiter='t')
writer.writerow(cells)
else: print('Too many tables')
else:
print('No response')
print(num)
num += 1
add a comment |
up vote
0
down vote
As PIC Details column seams to always be the seventh, you can just slice it out:
import csv
import requests
import time
from bs4 import BeautifulSoup
num = 1
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
headers= {'User-Agent': 'Mozilla/5.0'}
with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALLn')
while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
time.sleep(1)
response = requests.get(url, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
cells = list(map(lambda cell: cell.text, row.find_all('td')))
cells = cells[:7] + cells[8:]
writer = csv.writer(r, delimiter='t')
writer.writerow(cells)
else: print('Too many tables')
else:
print('No response')
print(num)
num += 1
add a comment |
up vote
0
down vote
up vote
0
down vote
As PIC Details column seams to always be the seventh, you can just slice it out:
import csv
import requests
import time
from bs4 import BeautifulSoup
num = 1
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
headers= {'User-Agent': 'Mozilla/5.0'}
with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALLn')
while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
time.sleep(1)
response = requests.get(url, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
cells = list(map(lambda cell: cell.text, row.find_all('td')))
cells = cells[:7] + cells[8:]
writer = csv.writer(r, delimiter='t')
writer.writerow(cells)
else: print('Too many tables')
else:
print('No response')
print(num)
num += 1
As PIC Details column seams to always be the seventh, you can just slice it out:
import csv
import requests
import time
from bs4 import BeautifulSoup
num = 1
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
headers= {'User-Agent': 'Mozilla/5.0'}
with open ('newstats.csv', 'w') as r:
r.write('POLITICIANS ALLn')
while num < 3:
url ='http://www.myneta.info/ls2014/comparisonchart.php?constituency_id={}'.format(num)
time.sleep(1)
response = requests.get(url, headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
tablenew = soup.find_all('table', id = "table1")
if len(tablenew) < 2:
tablenew = tablenew[0]
with open ('newstats.csv', 'a') as r:
for row in tablenew.find_all('tr'):
cells = list(map(lambda cell: cell.text, row.find_all('td')))
cells = cells[:7] + cells[8:]
writer = csv.writer(r, delimiter='t')
writer.writerow(cells)
else: print('Too many tables')
else:
print('No response')
print(num)
num += 1
answered Oct 29 at 22:34
Vitor SRG
1062
1062
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53054272%2fpython-beautiful-soup-table-data-scraping-all-except-a-specific-td-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown