Scrapy get text out of span

up vote
3
down vote

favorite

URL: https://myanimelist.net/anime/236/Es_Otherwise

I trying to scrape the following content in URL:

enter image description here

I tried :

for i in response.css('span[class = dark_text]') :

    i.xpath('/following-sibling::text()')

or that current XPath who's don't work or I missed something...

aired_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[11]/text()')



producer_xpath = response.xpath("//*[@id='content']/table/tbody/tr/td[1]/div/div[12]/span/a/@href/text()")

licensor_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[13]/a/text()')

studio_xpath response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[14]/a/@href/title/text()')

studio_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[17]/text()')

str_rating_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[18]/text()')

ranked_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[20]/span/text()')

japanese_title_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[7]/text()')

source_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[15]/text()')

genre_xpath = [response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a[{0}]'.format(i)) for i in range(1,4)]

genre_xpath_v2 = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a/@href/text()')

number_of_users_rated_anime_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[19]/span[3]/text()')

popularity_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[21]/span/text()')

members_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[22]/span/text()')

favorite_xpath =  response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[23]/span/text()')

but I figured out that some text are out of a span class, so I would like to get that text out of span with a css/XPath formula.

edited Nov 11 at 8:41

quant

1,42111226

asked Nov 10 at 15:10

user9176398

10210

Hi. Please can you write a paragraph or so to better explain your question?
– user
Nov 10 at 15:17

What language do you want to use? Do you have a deal with that site to scrape the content?
– bestprogrammerintheworld
Nov 10 at 16:38

I use python with scrapy framework
– user9176398
Nov 10 at 20:50

add a comment |

up vote
3
down vote

favorite

URL: https://myanimelist.net/anime/236/Es_Otherwise

I trying to scrape the following content in URL:

enter image description here

I tried :

for i in response.css('span[class = dark_text]') :

    i.xpath('/following-sibling::text()')

or that current XPath who's don't work or I missed something...

aired_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[11]/text()')



producer_xpath = response.xpath("//*[@id='content']/table/tbody/tr/td[1]/div/div[12]/span/a/@href/text()")

licensor_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[13]/a/text()')

studio_xpath response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[14]/a/@href/title/text()')

studio_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[17]/text()')

str_rating_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[18]/text()')

ranked_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[20]/span/text()')

japanese_title_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[7]/text()')

source_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[15]/text()')

genre_xpath = [response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a[{0}]'.format(i)) for i in range(1,4)]

genre_xpath_v2 = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a/@href/text()')

number_of_users_rated_anime_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[19]/span[3]/text()')

popularity_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[21]/span/text()')

members_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[22]/span/text()')

favorite_xpath =  response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[23]/span/text()')

but I figured out that some text are out of a span class, so I would like to get that text out of span with a css/XPath formula.

edited Nov 11 at 8:41

quant

1,42111226

asked Nov 10 at 15:10

user9176398

10210

Hi. Please can you write a paragraph or so to better explain your question?
– user
Nov 10 at 15:17

What language do you want to use? Do you have a deal with that site to scrape the content?
– bestprogrammerintheworld
Nov 10 at 16:38

I use python with scrapy framework
– user9176398
Nov 10 at 20:50

add a comment |

up vote
3
down vote

favorite

URL: https://myanimelist.net/anime/236/Es_Otherwise

I trying to scrape the following content in URL:

enter image description here

I tried :

for i in response.css('span[class = dark_text]') :

    i.xpath('/following-sibling::text()')

or that current XPath who's don't work or I missed something...

aired_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[11]/text()')



producer_xpath = response.xpath("//*[@id='content']/table/tbody/tr/td[1]/div/div[12]/span/a/@href/text()")

licensor_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[13]/a/text()')

studio_xpath response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[14]/a/@href/title/text()')

studio_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[17]/text()')

str_rating_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[18]/text()')

ranked_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[20]/span/text()')

japanese_title_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[7]/text()')

source_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[15]/text()')

genre_xpath = [response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a[{0}]'.format(i)) for i in range(1,4)]

genre_xpath_v2 = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a/@href/text()')

number_of_users_rated_anime_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[19]/span[3]/text()')

popularity_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[21]/span/text()')

members_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[22]/span/text()')

favorite_xpath =  response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[23]/span/text()')

but I figured out that some text are out of a span class, so I would like to get that text out of span with a css/XPath formula.

edited Nov 11 at 8:41

quant

1,42111226

asked Nov 10 at 15:10

user9176398

10210

URL: https://myanimelist.net/anime/236/Es_Otherwise

I trying to scrape the following content in URL:

enter image description here

I tried :

for i in response.css('span[class = dark_text]') :

    i.xpath('/following-sibling::text()')

or that current XPath who's don't work or I missed something...

aired_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[11]/text()')



producer_xpath = response.xpath("//*[@id='content']/table/tbody/tr/td[1]/div/div[12]/span/a/@href/text()")

licensor_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[13]/a/text()')

studio_xpath response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[14]/a/@href/title/text()')

studio_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[17]/text()')

str_rating_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[18]/text()')

ranked_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[20]/span/text()')

japanese_title_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[7]/text()')

source_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[15]/text()')

genre_xpath = [response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a[{0}]'.format(i)) for i in range(1,4)]

genre_xpath_v2 = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a/@href/text()')

number_of_users_rated_anime_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[19]/span[3]/text()')

popularity_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[21]/span/text()')

members_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[22]/span/text()')

favorite_xpath =  response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[23]/span/text()')

but I figured out that some text are out of a span class, so I would like to get that text out of span with a css/XPath formula.

python html css scrapy

edited Nov 11 at 8:41

quant

1,42111226

asked Nov 10 at 15:10

user9176398

10210

edited Nov 11 at 8:41

quant

1,42111226

asked Nov 10 at 15:10

user9176398

10210

edited Nov 11 at 8:41

quant

1,42111226

edited Nov 11 at 8:41

quant

1,42111226

edited Nov 11 at 8:41

quant

1,42111226

asked Nov 10 at 15:10

user9176398

10210

asked Nov 10 at 15:10

user9176398

10210

asked Nov 10 at 15:10

user9176398

10210

Hi. Please can you write a paragraph or so to better explain your question?
– user
Nov 10 at 15:17

What language do you want to use? Do you have a deal with that site to scrape the content?
– bestprogrammerintheworld
Nov 10 at 16:38

I use python with scrapy framework
– user9176398
Nov 10 at 20:50

add a comment |

Hi. Please can you write a paragraph or so to better explain your question?
– user
Nov 10 at 15:17

What language do you want to use? Do you have a deal with that site to scrape the content?
– bestprogrammerintheworld
Nov 10 at 16:38

I use python with scrapy framework
– user9176398
Nov 10 at 20:50

Hi. Please can you write a paragraph or so to better explain your question?
– user
Nov 10 at 15:17

What language do you want to use? Do you have a deal with that site to scrape the content?
– bestprogrammerintheworld
Nov 10 at 16:38

I use python with scrapy framework
– user9176398
Nov 10 at 20:50

add a comment |

2 Answers
2

active

oldest

votes

up vote
0
down vote

If you are only trying to scrap the information that you mentioned in the image you can just make use of

response.xpath('//div[@class="space-it"]//text()').extract()

Or i am unable to understand your question properly.

answered Nov 10 at 17:18

Gaurav

That following syntax return empty list
– user9176398
Nov 10 at 20:49

Have You changed the class name? actually the class name is spaceit
– Gaurav
Nov 11 at 15:19

For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
– Gaurav
Nov 11 at 15:41

just it won't return you alternative name and type
– Gaurav
Nov 11 at 15:43

add a comment |

up vote
0
down vote

it simpler to just loop through div inside the table

foundH2 = False

response =  Selector(text=htmlString).xpath('//*[@id="content"]/table/tr/td[1]/div/*')



for resp in response:

  tagName = resp.xpath('name()').extract_first()

  if 'h2' == tagName:

    foundH2 = True

  if foundH2:

    # start adding 'info' after <h2>Alternative Titles</h2> found

    info = None

    if 'div' == tagName:

      for item in resp.xpath('.//text()').extract():

        if 'googletag.' in item: break

        item = item.strip()

        if item and item != ',':

          info = info + " " + item if info else item

      if info:

        print info

just my opinion, beautifulSoup is faster and better than scrapy.

answered Nov 10 at 21:19

ewwink

5,75422232

Thanks it works, but what's name and googletag ? can you explain a bit your code please.
– user9176398
Nov 11 at 9:03

it div content after Favorites: 27, and it will stop loop after it found
– ewwink
Nov 11 at 9:04

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53240264%2fscrapy-get-text-out-of-span%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
0
down vote

If you are only trying to scrap the information that you mentioned in the image you can just make use of

response.xpath('//div[@class="space-it"]//text()').extract()

Or i am unable to understand your question properly.

answered Nov 10 at 17:18

Gaurav

That following syntax return empty list
– user9176398
Nov 10 at 20:49

Have You changed the class name? actually the class name is spaceit
– Gaurav
Nov 11 at 15:19

For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
– Gaurav
Nov 11 at 15:41

just it won't return you alternative name and type
– Gaurav
Nov 11 at 15:43

add a comment |

up vote
0
down vote

If you are only trying to scrap the information that you mentioned in the image you can just make use of

response.xpath('//div[@class="space-it"]//text()').extract()

Or i am unable to understand your question properly.

answered Nov 10 at 17:18

Gaurav

That following syntax return empty list
– user9176398
Nov 10 at 20:49

Have You changed the class name? actually the class name is spaceit
– Gaurav
Nov 11 at 15:19

For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
– Gaurav
Nov 11 at 15:41

just it won't return you alternative name and type
– Gaurav
Nov 11 at 15:43

add a comment |

up vote
0
down vote

If you are only trying to scrap the information that you mentioned in the image you can just make use of

response.xpath('//div[@class="space-it"]//text()').extract()

Or i am unable to understand your question properly.

answered Nov 10 at 17:18

Gaurav

If you are only trying to scrap the information that you mentioned in the image you can just make use of

response.xpath('//div[@class="space-it"]//text()').extract()

Or i am unable to understand your question properly.

answered Nov 10 at 17:18

Gaurav

answered Nov 10 at 17:18

Gaurav

answered Nov 10 at 17:18

Gaurav

answered Nov 10 at 17:18

Gaurav

That following syntax return empty list
– user9176398
Nov 10 at 20:49

Have You changed the class name? actually the class name is spaceit
– Gaurav
Nov 11 at 15:19

For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
– Gaurav
Nov 11 at 15:41

just it won't return you alternative name and type
– Gaurav
Nov 11 at 15:43

add a comment |

That following syntax return empty list
– user9176398
Nov 10 at 20:49

Have You changed the class name? actually the class name is spaceit
– Gaurav
Nov 11 at 15:19

For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
– Gaurav
Nov 11 at 15:41

just it won't return you alternative name and type
– Gaurav
Nov 11 at 15:43

That following syntax return empty list
– user9176398
Nov 10 at 20:49

Have You changed the class name? actually the class name is spaceit
– Gaurav
Nov 11 at 15:19

For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
– Gaurav
Nov 11 at 15:41

just it won't return you alternative name and type
– Gaurav
Nov 11 at 15:43

add a comment |

up vote
0
down vote

it simpler to just loop through div inside the table

foundH2 = False

response =  Selector(text=htmlString).xpath('//*[@id="content"]/table/tr/td[1]/div/*')



for resp in response:

  tagName = resp.xpath('name()').extract_first()

  if 'h2' == tagName:

    foundH2 = True

  if foundH2:

    # start adding 'info' after <h2>Alternative Titles</h2> found

    info = None

    if 'div' == tagName:

      for item in resp.xpath('.//text()').extract():

        if 'googletag.' in item: break

        item = item.strip()

        if item and item != ',':

          info = info + " " + item if info else item

      if info:

        print info

just my opinion, beautifulSoup is faster and better than scrapy.

answered Nov 10 at 21:19

ewwink

5,75422232

Thanks it works, but what's name and googletag ? can you explain a bit your code please.
– user9176398
Nov 11 at 9:03

it div content after Favorites: 27, and it will stop loop after it found
– ewwink
Nov 11 at 9:04

add a comment |

up vote
0
down vote

it simpler to just loop through div inside the table

foundH2 = False

response =  Selector(text=htmlString).xpath('//*[@id="content"]/table/tr/td[1]/div/*')



for resp in response:

  tagName = resp.xpath('name()').extract_first()

  if 'h2' == tagName:

    foundH2 = True

  if foundH2:

    # start adding 'info' after <h2>Alternative Titles</h2> found

    info = None

    if 'div' == tagName:

      for item in resp.xpath('.//text()').extract():

        if 'googletag.' in item: break

        item = item.strip()

        if item and item != ',':

          info = info + " " + item if info else item

      if info:

        print info

just my opinion, beautifulSoup is faster and better than scrapy.

answered Nov 10 at 21:19

ewwink

5,75422232

Thanks it works, but what's name and googletag ? can you explain a bit your code please.
– user9176398
Nov 11 at 9:03

it div content after Favorites: 27, and it will stop loop after it found
– ewwink
Nov 11 at 9:04

add a comment |

up vote
0
down vote

it simpler to just loop through div inside the table

foundH2 = False

response =  Selector(text=htmlString).xpath('//*[@id="content"]/table/tr/td[1]/div/*')



for resp in response:

  tagName = resp.xpath('name()').extract_first()

  if 'h2' == tagName:

    foundH2 = True

  if foundH2:

    # start adding 'info' after <h2>Alternative Titles</h2> found

    info = None

    if 'div' == tagName:

      for item in resp.xpath('.//text()').extract():

        if 'googletag.' in item: break

        item = item.strip()

        if item and item != ',':

          info = info + " " + item if info else item

      if info:

        print info

just my opinion, beautifulSoup is faster and better than scrapy.

answered Nov 10 at 21:19

ewwink

5,75422232

it simpler to just loop through div inside the table

foundH2 = False

response =  Selector(text=htmlString).xpath('//*[@id="content"]/table/tr/td[1]/div/*')



for resp in response:

  tagName = resp.xpath('name()').extract_first()

  if 'h2' == tagName:

    foundH2 = True

  if foundH2:

    # start adding 'info' after <h2>Alternative Titles</h2> found

    info = None

    if 'div' == tagName:

      for item in resp.xpath('.//text()').extract():

        if 'googletag.' in item: break

        item = item.strip()

        if item and item != ',':

          info = info + " " + item if info else item

      if info:

        print info

just my opinion, beautifulSoup is faster and better than scrapy.

answered Nov 10 at 21:19

ewwink

5,75422232

answered Nov 10 at 21:19

ewwink

5,75422232

answered Nov 10 at 21:19

ewwink

5,75422232

answered Nov 10 at 21:19

ewwink

5,75422232

Thanks it works, but what's name and googletag ? can you explain a bit your code please.
– user9176398
Nov 11 at 9:03

it div content after Favorites: 27, and it will stop loop after it found
– ewwink
Nov 11 at 9:04

add a comment |

Thanks it works, but what's name and googletag ? can you explain a bit your code please.
– user9176398
Nov 11 at 9:03

it div content after Favorites: 27, and it will stop loop after it found
– ewwink
Nov 11 at 9:04

Thanks it works, but what's name and googletag ? can you explain a bit your code please.
– user9176398
Nov 11 at 9:03

it div content after Favorites: 27, and it will stop loop after it found
– ewwink
Nov 11 at 9:04

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nrthugu