Scrapy get text out of span
up vote
3
down vote
favorite
URL: https://myanimelist.net/anime/236/Es_Otherwise
I trying to scrape the following content in URL:
I tried :
for i in response.css('span[class = dark_text]') :
i.xpath('/following-sibling::text()')
or that current XPath who's don't work or I missed something...
aired_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[11]/text()')
producer_xpath = response.xpath("//*[@id='content']/table/tbody/tr/td[1]/div/div[12]/span/a/@href/text()")
licensor_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[13]/a/text()')
studio_xpath response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[14]/a/@href/title/text()')
studio_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[17]/text()')
str_rating_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[18]/text()')
ranked_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[20]/span/text()')
japanese_title_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[7]/text()')
source_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[15]/text()')
genre_xpath = [response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a[{0}]'.format(i)) for i in range(1,4)]
genre_xpath_v2 = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a/@href/text()')
number_of_users_rated_anime_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[19]/span[3]/text()')
popularity_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[21]/span/text()')
members_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[22]/span/text()')
favorite_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[23]/span/text()')
but I figured out that some text are out of a span class, so I would like to get that text out of span with a css/XPath formula.
python html css scrapy
add a comment |
up vote
3
down vote
favorite
URL: https://myanimelist.net/anime/236/Es_Otherwise
I trying to scrape the following content in URL:
I tried :
for i in response.css('span[class = dark_text]') :
i.xpath('/following-sibling::text()')
or that current XPath who's don't work or I missed something...
aired_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[11]/text()')
producer_xpath = response.xpath("//*[@id='content']/table/tbody/tr/td[1]/div/div[12]/span/a/@href/text()")
licensor_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[13]/a/text()')
studio_xpath response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[14]/a/@href/title/text()')
studio_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[17]/text()')
str_rating_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[18]/text()')
ranked_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[20]/span/text()')
japanese_title_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[7]/text()')
source_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[15]/text()')
genre_xpath = [response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a[{0}]'.format(i)) for i in range(1,4)]
genre_xpath_v2 = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a/@href/text()')
number_of_users_rated_anime_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[19]/span[3]/text()')
popularity_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[21]/span/text()')
members_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[22]/span/text()')
favorite_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[23]/span/text()')
but I figured out that some text are out of a span class, so I would like to get that text out of span with a css/XPath formula.
python html css scrapy
Hi. Please can you write a paragraph or so to better explain your question?
– user
Nov 10 at 15:17
What language do you want to use? Do you have a deal with that site to scrape the content?
– bestprogrammerintheworld
Nov 10 at 16:38
I use python with scrapy framework
– user9176398
Nov 10 at 20:50
add a comment |
up vote
3
down vote
favorite
up vote
3
down vote
favorite
URL: https://myanimelist.net/anime/236/Es_Otherwise
I trying to scrape the following content in URL:
I tried :
for i in response.css('span[class = dark_text]') :
i.xpath('/following-sibling::text()')
or that current XPath who's don't work or I missed something...
aired_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[11]/text()')
producer_xpath = response.xpath("//*[@id='content']/table/tbody/tr/td[1]/div/div[12]/span/a/@href/text()")
licensor_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[13]/a/text()')
studio_xpath response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[14]/a/@href/title/text()')
studio_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[17]/text()')
str_rating_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[18]/text()')
ranked_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[20]/span/text()')
japanese_title_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[7]/text()')
source_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[15]/text()')
genre_xpath = [response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a[{0}]'.format(i)) for i in range(1,4)]
genre_xpath_v2 = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a/@href/text()')
number_of_users_rated_anime_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[19]/span[3]/text()')
popularity_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[21]/span/text()')
members_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[22]/span/text()')
favorite_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[23]/span/text()')
but I figured out that some text are out of a span class, so I would like to get that text out of span with a css/XPath formula.
python html css scrapy
URL: https://myanimelist.net/anime/236/Es_Otherwise
I trying to scrape the following content in URL:
I tried :
for i in response.css('span[class = dark_text]') :
i.xpath('/following-sibling::text()')
or that current XPath who's don't work or I missed something...
aired_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[11]/text()')
producer_xpath = response.xpath("//*[@id='content']/table/tbody/tr/td[1]/div/div[12]/span/a/@href/text()")
licensor_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[13]/a/text()')
studio_xpath response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[14]/a/@href/title/text()')
studio_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[17]/text()')
str_rating_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[18]/text()')
ranked_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[20]/span/text()')
japanese_title_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[7]/text()')
source_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[15]/text()')
genre_xpath = [response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a[{0}]'.format(i)) for i in range(1,4)]
genre_xpath_v2 = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a/@href/text()')
number_of_users_rated_anime_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[19]/span[3]/text()')
popularity_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[21]/span/text()')
members_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[22]/span/text()')
favorite_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[23]/span/text()')
but I figured out that some text are out of a span class, so I would like to get that text out of span with a css/XPath formula.
python html css scrapy
python html css scrapy
edited Nov 11 at 8:41
quant
1,42111226
1,42111226
asked Nov 10 at 15:10
user9176398
10210
10210
Hi. Please can you write a paragraph or so to better explain your question?
– user
Nov 10 at 15:17
What language do you want to use? Do you have a deal with that site to scrape the content?
– bestprogrammerintheworld
Nov 10 at 16:38
I use python with scrapy framework
– user9176398
Nov 10 at 20:50
add a comment |
Hi. Please can you write a paragraph or so to better explain your question?
– user
Nov 10 at 15:17
What language do you want to use? Do you have a deal with that site to scrape the content?
– bestprogrammerintheworld
Nov 10 at 16:38
I use python with scrapy framework
– user9176398
Nov 10 at 20:50
Hi. Please can you write a paragraph or so to better explain your question?
– user
Nov 10 at 15:17
Hi. Please can you write a paragraph or so to better explain your question?
– user
Nov 10 at 15:17
What language do you want to use? Do you have a deal with that site to scrape the content?
– bestprogrammerintheworld
Nov 10 at 16:38
What language do you want to use? Do you have a deal with that site to scrape the content?
– bestprogrammerintheworld
Nov 10 at 16:38
I use python with scrapy framework
– user9176398
Nov 10 at 20:50
I use python with scrapy framework
– user9176398
Nov 10 at 20:50
add a comment |
2 Answers
2
active
oldest
votes
up vote
0
down vote
If you are only trying to scrap the information that you mentioned in the image you can just make use of
response.xpath('//div[@class="space-it"]//text()').extract()
Or i am unable to understand your question properly.
That following syntax return empty list
– user9176398
Nov 10 at 20:49
Have You changed the class name? actually the class name is spaceit
– Gaurav
Nov 11 at 15:19
For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
– Gaurav
Nov 11 at 15:41
just it won't return you alternative name and type
– Gaurav
Nov 11 at 15:43
add a comment |
up vote
0
down vote
it simpler to just loop through div inside the table
foundH2 = False
response = Selector(text=htmlString).xpath('//*[@id="content"]/table/tr/td[1]/div/*')
for resp in response:
tagName = resp.xpath('name()').extract_first()
if 'h2' == tagName:
foundH2 = True
if foundH2:
# start adding 'info' after <h2>Alternative Titles</h2> found
info = None
if 'div' == tagName:
for item in resp.xpath('.//text()').extract():
if 'googletag.' in item: break
item = item.strip()
if item and item != ',':
info = info + " " + item if info else item
if info:
print info
just my opinion, beautifulSoup is faster and better than scrapy.
Thanks it works, but what's name and googletag ? can you explain a bit your code please.
– user9176398
Nov 11 at 9:03
itdiv
content afterFavorites: 27
, and it will stop loop after it found
– ewwink
Nov 11 at 9:04
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
If you are only trying to scrap the information that you mentioned in the image you can just make use of
response.xpath('//div[@class="space-it"]//text()').extract()
Or i am unable to understand your question properly.
That following syntax return empty list
– user9176398
Nov 10 at 20:49
Have You changed the class name? actually the class name is spaceit
– Gaurav
Nov 11 at 15:19
For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
– Gaurav
Nov 11 at 15:41
just it won't return you alternative name and type
– Gaurav
Nov 11 at 15:43
add a comment |
up vote
0
down vote
If you are only trying to scrap the information that you mentioned in the image you can just make use of
response.xpath('//div[@class="space-it"]//text()').extract()
Or i am unable to understand your question properly.
That following syntax return empty list
– user9176398
Nov 10 at 20:49
Have You changed the class name? actually the class name is spaceit
– Gaurav
Nov 11 at 15:19
For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
– Gaurav
Nov 11 at 15:41
just it won't return you alternative name and type
– Gaurav
Nov 11 at 15:43
add a comment |
up vote
0
down vote
up vote
0
down vote
If you are only trying to scrap the information that you mentioned in the image you can just make use of
response.xpath('//div[@class="space-it"]//text()').extract()
Or i am unable to understand your question properly.
If you are only trying to scrap the information that you mentioned in the image you can just make use of
response.xpath('//div[@class="space-it"]//text()').extract()
Or i am unable to understand your question properly.
answered Nov 10 at 17:18
Gaurav
14
14
That following syntax return empty list
– user9176398
Nov 10 at 20:49
Have You changed the class name? actually the class name is spaceit
– Gaurav
Nov 11 at 15:19
For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
– Gaurav
Nov 11 at 15:41
just it won't return you alternative name and type
– Gaurav
Nov 11 at 15:43
add a comment |
That following syntax return empty list
– user9176398
Nov 10 at 20:49
Have You changed the class name? actually the class name is spaceit
– Gaurav
Nov 11 at 15:19
For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
– Gaurav
Nov 11 at 15:41
just it won't return you alternative name and type
– Gaurav
Nov 11 at 15:43
That following syntax return empty list
– user9176398
Nov 10 at 20:49
That following syntax return empty list
– user9176398
Nov 10 at 20:49
Have You changed the class name? actually the class name is spaceit
– Gaurav
Nov 11 at 15:19
Have You changed the class name? actually the class name is spaceit
– Gaurav
Nov 11 at 15:19
For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
– Gaurav
Nov 11 at 15:41
For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
– Gaurav
Nov 11 at 15:41
just it won't return you alternative name and type
– Gaurav
Nov 11 at 15:43
just it won't return you alternative name and type
– Gaurav
Nov 11 at 15:43
add a comment |
up vote
0
down vote
it simpler to just loop through div inside the table
foundH2 = False
response = Selector(text=htmlString).xpath('//*[@id="content"]/table/tr/td[1]/div/*')
for resp in response:
tagName = resp.xpath('name()').extract_first()
if 'h2' == tagName:
foundH2 = True
if foundH2:
# start adding 'info' after <h2>Alternative Titles</h2> found
info = None
if 'div' == tagName:
for item in resp.xpath('.//text()').extract():
if 'googletag.' in item: break
item = item.strip()
if item and item != ',':
info = info + " " + item if info else item
if info:
print info
just my opinion, beautifulSoup is faster and better than scrapy.
Thanks it works, but what's name and googletag ? can you explain a bit your code please.
– user9176398
Nov 11 at 9:03
itdiv
content afterFavorites: 27
, and it will stop loop after it found
– ewwink
Nov 11 at 9:04
add a comment |
up vote
0
down vote
it simpler to just loop through div inside the table
foundH2 = False
response = Selector(text=htmlString).xpath('//*[@id="content"]/table/tr/td[1]/div/*')
for resp in response:
tagName = resp.xpath('name()').extract_first()
if 'h2' == tagName:
foundH2 = True
if foundH2:
# start adding 'info' after <h2>Alternative Titles</h2> found
info = None
if 'div' == tagName:
for item in resp.xpath('.//text()').extract():
if 'googletag.' in item: break
item = item.strip()
if item and item != ',':
info = info + " " + item if info else item
if info:
print info
just my opinion, beautifulSoup is faster and better than scrapy.
Thanks it works, but what's name and googletag ? can you explain a bit your code please.
– user9176398
Nov 11 at 9:03
itdiv
content afterFavorites: 27
, and it will stop loop after it found
– ewwink
Nov 11 at 9:04
add a comment |
up vote
0
down vote
up vote
0
down vote
it simpler to just loop through div inside the table
foundH2 = False
response = Selector(text=htmlString).xpath('//*[@id="content"]/table/tr/td[1]/div/*')
for resp in response:
tagName = resp.xpath('name()').extract_first()
if 'h2' == tagName:
foundH2 = True
if foundH2:
# start adding 'info' after <h2>Alternative Titles</h2> found
info = None
if 'div' == tagName:
for item in resp.xpath('.//text()').extract():
if 'googletag.' in item: break
item = item.strip()
if item and item != ',':
info = info + " " + item if info else item
if info:
print info
just my opinion, beautifulSoup is faster and better than scrapy.
it simpler to just loop through div inside the table
foundH2 = False
response = Selector(text=htmlString).xpath('//*[@id="content"]/table/tr/td[1]/div/*')
for resp in response:
tagName = resp.xpath('name()').extract_first()
if 'h2' == tagName:
foundH2 = True
if foundH2:
# start adding 'info' after <h2>Alternative Titles</h2> found
info = None
if 'div' == tagName:
for item in resp.xpath('.//text()').extract():
if 'googletag.' in item: break
item = item.strip()
if item and item != ',':
info = info + " " + item if info else item
if info:
print info
just my opinion, beautifulSoup is faster and better than scrapy.
answered Nov 10 at 21:19
ewwink
5,75422232
5,75422232
Thanks it works, but what's name and googletag ? can you explain a bit your code please.
– user9176398
Nov 11 at 9:03
itdiv
content afterFavorites: 27
, and it will stop loop after it found
– ewwink
Nov 11 at 9:04
add a comment |
Thanks it works, but what's name and googletag ? can you explain a bit your code please.
– user9176398
Nov 11 at 9:03
itdiv
content afterFavorites: 27
, and it will stop loop after it found
– ewwink
Nov 11 at 9:04
Thanks it works, but what's name and googletag ? can you explain a bit your code please.
– user9176398
Nov 11 at 9:03
Thanks it works, but what's name and googletag ? can you explain a bit your code please.
– user9176398
Nov 11 at 9:03
it
div
content after Favorites: 27
, and it will stop loop after it found– ewwink
Nov 11 at 9:04
it
div
content after Favorites: 27
, and it will stop loop after it found– ewwink
Nov 11 at 9:04
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53240264%2fscrapy-get-text-out-of-span%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Hi. Please can you write a paragraph or so to better explain your question?
– user
Nov 10 at 15:17
What language do you want to use? Do you have a deal with that site to scrape the content?
– bestprogrammerintheworld
Nov 10 at 16:38
I use python with scrapy framework
– user9176398
Nov 10 at 20:50