Getting javascript variable value while scraping with python
I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.
I am scraping a news site using python with packages such as Beautiful Soup and etc.
I am facing difficulty while getting the value of java script
variable which is declared in script
tag and also it is getting updated there.
Here is the part of HTML page which I am scraping:(containing only script part)
<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>
From the above part, I want to get the value of min_news_id
in python.
I should also get the value of same variable if updated from line 2.
Here is how I am doing it:
self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)
But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.
python web-scraping beautifulsoup python-3.6
|
show 3 more comments
I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.
I am scraping a news site using python with packages such as Beautiful Soup and etc.
I am facing difficulty while getting the value of java script
variable which is declared in script
tag and also it is getting updated there.
Here is the part of HTML page which I am scraping:(containing only script part)
<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>
From the above part, I want to get the value of min_news_id
in python.
I should also get the value of same variable if updated from line 2.
Here is how I am doing it:
self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)
But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.
python web-scraping beautifulsoup python-3.6
1
Some dynamic contents are not rendered when scraping withBeautifulSoup
. What you're seeing in browser vs what your scraper is getting is markedly different. (You can exportpage.content
and compare). You'll need a different module likeselenium
orrequest-html
that can handle dynamic contents.
– Idlehands
Nov 13 '18 at 14:58
@Idlehands Thank you very much for the information. If you have any example reference please add it.
– Anil
Nov 13 '18 at 15:00
Can you share the URL?
– QHarr
Nov 13 '18 at 15:24
inshorts.com/en/read/politics
– Anil
Nov 13 '18 at 15:26
By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example,d7zlgjdu-1
that you're looking for?
– Kamikaze_goldfish
Nov 13 '18 at 15:37
|
show 3 more comments
I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.
I am scraping a news site using python with packages such as Beautiful Soup and etc.
I am facing difficulty while getting the value of java script
variable which is declared in script
tag and also it is getting updated there.
Here is the part of HTML page which I am scraping:(containing only script part)
<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>
From the above part, I want to get the value of min_news_id
in python.
I should also get the value of same variable if updated from line 2.
Here is how I am doing it:
self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)
But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.
python web-scraping beautifulsoup python-3.6
I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.
I am scraping a news site using python with packages such as Beautiful Soup and etc.
I am facing difficulty while getting the value of java script
variable which is declared in script
tag and also it is getting updated there.
Here is the part of HTML page which I am scraping:(containing only script part)
<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>
From the above part, I want to get the value of min_news_id
in python.
I should also get the value of same variable if updated from line 2.
Here is how I am doing it:
self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)
But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.
python web-scraping beautifulsoup python-3.6
python web-scraping beautifulsoup python-3.6
asked Nov 13 '18 at 14:55
AnilAnil
5242725
5242725
1
Some dynamic contents are not rendered when scraping withBeautifulSoup
. What you're seeing in browser vs what your scraper is getting is markedly different. (You can exportpage.content
and compare). You'll need a different module likeselenium
orrequest-html
that can handle dynamic contents.
– Idlehands
Nov 13 '18 at 14:58
@Idlehands Thank you very much for the information. If you have any example reference please add it.
– Anil
Nov 13 '18 at 15:00
Can you share the URL?
– QHarr
Nov 13 '18 at 15:24
inshorts.com/en/read/politics
– Anil
Nov 13 '18 at 15:26
By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example,d7zlgjdu-1
that you're looking for?
– Kamikaze_goldfish
Nov 13 '18 at 15:37
|
show 3 more comments
1
Some dynamic contents are not rendered when scraping withBeautifulSoup
. What you're seeing in browser vs what your scraper is getting is markedly different. (You can exportpage.content
and compare). You'll need a different module likeselenium
orrequest-html
that can handle dynamic contents.
– Idlehands
Nov 13 '18 at 14:58
@Idlehands Thank you very much for the information. If you have any example reference please add it.
– Anil
Nov 13 '18 at 15:00
Can you share the URL?
– QHarr
Nov 13 '18 at 15:24
inshorts.com/en/read/politics
– Anil
Nov 13 '18 at 15:26
By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example,d7zlgjdu-1
that you're looking for?
– Kamikaze_goldfish
Nov 13 '18 at 15:37
1
1
Some dynamic contents are not rendered when scraping with
BeautifulSoup
. What you're seeing in browser vs what your scraper is getting is markedly different. (You can export page.content
and compare). You'll need a different module like selenium
or request-html
that can handle dynamic contents.– Idlehands
Nov 13 '18 at 14:58
Some dynamic contents are not rendered when scraping with
BeautifulSoup
. What you're seeing in browser vs what your scraper is getting is markedly different. (You can export page.content
and compare). You'll need a different module like selenium
or request-html
that can handle dynamic contents.– Idlehands
Nov 13 '18 at 14:58
@Idlehands Thank you very much for the information. If you have any example reference please add it.
– Anil
Nov 13 '18 at 15:00
@Idlehands Thank you very much for the information. If you have any example reference please add it.
– Anil
Nov 13 '18 at 15:00
Can you share the URL?
– QHarr
Nov 13 '18 at 15:24
Can you share the URL?
– QHarr
Nov 13 '18 at 15:24
inshorts.com/en/read/politics
– Anil
Nov 13 '18 at 15:26
inshorts.com/en/read/politics
– Anil
Nov 13 '18 at 15:26
By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example,
d7zlgjdu-1
that you're looking for?– Kamikaze_goldfish
Nov 13 '18 at 15:37
By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example,
d7zlgjdu-1
that you're looking for?– Kamikaze_goldfish
Nov 13 '18 at 15:37
|
show 3 more comments
3 Answers
3
active
oldest
votes
you can't monitor javascript variable change using BeautifulSoup
, here how to get next page news using while
loop, re
and json
from bs4 import BeautifulSoup
import requests, re
page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'
htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...
# get current min_news_id
min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1
customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}
while min_news_id:
# change "politics" if in different category
reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
# get Ajax next page
ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
# again, do extract article summary
page = BeautifulSoup(ajax_response["html"], "html.parser")
# ....
# ....
# new min_news_id
min_news_id = ajax_response["min_news_id"]
# remove this to loop all page (thousand?)
break
That's not hard in selenium:driver.execute_script("return min_news_id")
– pguardiario
Nov 14 '18 at 0:41
that's return current value, not monitor value on change. but its not hard if using element change.
– ewwink
Nov 14 '18 at 8:33
Just put it in a loop with asleep
– pguardiario
Nov 14 '18 at 23:57
missed thinking about that, but you're right
– ewwink
Nov 15 '18 at 0:01
1
I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
– pguardiario
Nov 15 '18 at 7:53
add a comment |
html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>'''
finder = re.findall(r'min_news_id = .*;', html)
print(finder)
Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']
#2 OR YOU CAN USE
print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())
Output:
d7zlgjdu-1
#3 OR YOU CAN USE
finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)
Output:
['d7zlgjdu-1']
Its not handling the value of the variable, once if it is updated
– Anil
Nov 13 '18 at 17:03
What do you mean handle the value? What are you trying to accomplish?
– Kamikaze_goldfish
Nov 13 '18 at 17:19
First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request :category: politics news_offset: afk0bz0p-1
and the url to make http post request ishttps://inshorts.com/en/ajax/more_news
– Anil
Nov 13 '18 at 17:27
So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
– Kamikaze_goldfish
Nov 13 '18 at 17:36
add a comment |
thank you for the response, Finally I solved using requests
package after reading its documentation,
here is my code :
if InShortsScraper.firstLoad == True:
self.pattern = re.compile('var min_news_id = (.+?);')
else:
self.pattern = re.compile('min_news_id = (.+?);')
page = None
# print("Pattern: " + str(self.pattern))
if news_offset == None:
htmlPage = urlopen(url)
page = bs(htmlPage, "html.parser")
else:
self.loadMore['news_offset'] = InShortsScraper.newsOffset
# print("payload : " + str(self.loadMore))
try:
r = myRequest.post(
url = url,
data = self.loadMore
)
except TypeError:
print("Error in loading")
InShortsScraper.newsOffset = r.json()["min_news_id"]
page = bs(r.json()["html"], "html.parser")
#print(page)
if InShortsScraper.newsOffset == None:
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
finder = re.findall(self.pattern, scriptString)
InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283742%2fgetting-javascript-variable-value-while-scraping-with-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
you can't monitor javascript variable change using BeautifulSoup
, here how to get next page news using while
loop, re
and json
from bs4 import BeautifulSoup
import requests, re
page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'
htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...
# get current min_news_id
min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1
customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}
while min_news_id:
# change "politics" if in different category
reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
# get Ajax next page
ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
# again, do extract article summary
page = BeautifulSoup(ajax_response["html"], "html.parser")
# ....
# ....
# new min_news_id
min_news_id = ajax_response["min_news_id"]
# remove this to loop all page (thousand?)
break
That's not hard in selenium:driver.execute_script("return min_news_id")
– pguardiario
Nov 14 '18 at 0:41
that's return current value, not monitor value on change. but its not hard if using element change.
– ewwink
Nov 14 '18 at 8:33
Just put it in a loop with asleep
– pguardiario
Nov 14 '18 at 23:57
missed thinking about that, but you're right
– ewwink
Nov 15 '18 at 0:01
1
I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
– pguardiario
Nov 15 '18 at 7:53
add a comment |
you can't monitor javascript variable change using BeautifulSoup
, here how to get next page news using while
loop, re
and json
from bs4 import BeautifulSoup
import requests, re
page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'
htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...
# get current min_news_id
min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1
customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}
while min_news_id:
# change "politics" if in different category
reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
# get Ajax next page
ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
# again, do extract article summary
page = BeautifulSoup(ajax_response["html"], "html.parser")
# ....
# ....
# new min_news_id
min_news_id = ajax_response["min_news_id"]
# remove this to loop all page (thousand?)
break
That's not hard in selenium:driver.execute_script("return min_news_id")
– pguardiario
Nov 14 '18 at 0:41
that's return current value, not monitor value on change. but its not hard if using element change.
– ewwink
Nov 14 '18 at 8:33
Just put it in a loop with asleep
– pguardiario
Nov 14 '18 at 23:57
missed thinking about that, but you're right
– ewwink
Nov 15 '18 at 0:01
1
I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
– pguardiario
Nov 15 '18 at 7:53
add a comment |
you can't monitor javascript variable change using BeautifulSoup
, here how to get next page news using while
loop, re
and json
from bs4 import BeautifulSoup
import requests, re
page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'
htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...
# get current min_news_id
min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1
customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}
while min_news_id:
# change "politics" if in different category
reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
# get Ajax next page
ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
# again, do extract article summary
page = BeautifulSoup(ajax_response["html"], "html.parser")
# ....
# ....
# new min_news_id
min_news_id = ajax_response["min_news_id"]
# remove this to loop all page (thousand?)
break
you can't monitor javascript variable change using BeautifulSoup
, here how to get next page news using while
loop, re
and json
from bs4 import BeautifulSoup
import requests, re
page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'
htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...
# get current min_news_id
min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1
customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}
while min_news_id:
# change "politics" if in different category
reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
# get Ajax next page
ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
# again, do extract article summary
page = BeautifulSoup(ajax_response["html"], "html.parser")
# ....
# ....
# new min_news_id
min_news_id = ajax_response["min_news_id"]
# remove this to loop all page (thousand?)
break
edited Nov 15 '18 at 13:36
answered Nov 13 '18 at 18:19
ewwinkewwink
11.8k22239
11.8k22239
That's not hard in selenium:driver.execute_script("return min_news_id")
– pguardiario
Nov 14 '18 at 0:41
that's return current value, not monitor value on change. but its not hard if using element change.
– ewwink
Nov 14 '18 at 8:33
Just put it in a loop with asleep
– pguardiario
Nov 14 '18 at 23:57
missed thinking about that, but you're right
– ewwink
Nov 15 '18 at 0:01
1
I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
– pguardiario
Nov 15 '18 at 7:53
add a comment |
That's not hard in selenium:driver.execute_script("return min_news_id")
– pguardiario
Nov 14 '18 at 0:41
that's return current value, not monitor value on change. but its not hard if using element change.
– ewwink
Nov 14 '18 at 8:33
Just put it in a loop with asleep
– pguardiario
Nov 14 '18 at 23:57
missed thinking about that, but you're right
– ewwink
Nov 15 '18 at 0:01
1
I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
– pguardiario
Nov 15 '18 at 7:53
That's not hard in selenium:
driver.execute_script("return min_news_id")
– pguardiario
Nov 14 '18 at 0:41
That's not hard in selenium:
driver.execute_script("return min_news_id")
– pguardiario
Nov 14 '18 at 0:41
that's return current value, not monitor value on change. but its not hard if using element change.
– ewwink
Nov 14 '18 at 8:33
that's return current value, not monitor value on change. but its not hard if using element change.
– ewwink
Nov 14 '18 at 8:33
Just put it in a loop with a
sleep
– pguardiario
Nov 14 '18 at 23:57
Just put it in a loop with a
sleep
– pguardiario
Nov 14 '18 at 23:57
missed thinking about that, but you're right
– ewwink
Nov 15 '18 at 0:01
missed thinking about that, but you're right
– ewwink
Nov 15 '18 at 0:01
1
1
I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
– pguardiario
Nov 15 '18 at 7:53
I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
– pguardiario
Nov 15 '18 at 7:53
add a comment |
html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>'''
finder = re.findall(r'min_news_id = .*;', html)
print(finder)
Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']
#2 OR YOU CAN USE
print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())
Output:
d7zlgjdu-1
#3 OR YOU CAN USE
finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)
Output:
['d7zlgjdu-1']
Its not handling the value of the variable, once if it is updated
– Anil
Nov 13 '18 at 17:03
What do you mean handle the value? What are you trying to accomplish?
– Kamikaze_goldfish
Nov 13 '18 at 17:19
First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request :category: politics news_offset: afk0bz0p-1
and the url to make http post request ishttps://inshorts.com/en/ajax/more_news
– Anil
Nov 13 '18 at 17:27
So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
– Kamikaze_goldfish
Nov 13 '18 at 17:36
add a comment |
html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>'''
finder = re.findall(r'min_news_id = .*;', html)
print(finder)
Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']
#2 OR YOU CAN USE
print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())
Output:
d7zlgjdu-1
#3 OR YOU CAN USE
finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)
Output:
['d7zlgjdu-1']
Its not handling the value of the variable, once if it is updated
– Anil
Nov 13 '18 at 17:03
What do you mean handle the value? What are you trying to accomplish?
– Kamikaze_goldfish
Nov 13 '18 at 17:19
First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request :category: politics news_offset: afk0bz0p-1
and the url to make http post request ishttps://inshorts.com/en/ajax/more_news
– Anil
Nov 13 '18 at 17:27
So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
– Kamikaze_goldfish
Nov 13 '18 at 17:36
add a comment |
html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>'''
finder = re.findall(r'min_news_id = .*;', html)
print(finder)
Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']
#2 OR YOU CAN USE
print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())
Output:
d7zlgjdu-1
#3 OR YOU CAN USE
finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)
Output:
['d7zlgjdu-1']
html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>
<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">
var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>'''
finder = re.findall(r'min_news_id = .*;', html)
print(finder)
Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']
#2 OR YOU CAN USE
print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())
Output:
d7zlgjdu-1
#3 OR YOU CAN USE
finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)
Output:
['d7zlgjdu-1']
edited Nov 13 '18 at 15:50
answered Nov 13 '18 at 15:39
Kamikaze_goldfishKamikaze_goldfish
458311
458311
Its not handling the value of the variable, once if it is updated
– Anil
Nov 13 '18 at 17:03
What do you mean handle the value? What are you trying to accomplish?
– Kamikaze_goldfish
Nov 13 '18 at 17:19
First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request :category: politics news_offset: afk0bz0p-1
and the url to make http post request ishttps://inshorts.com/en/ajax/more_news
– Anil
Nov 13 '18 at 17:27
So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
– Kamikaze_goldfish
Nov 13 '18 at 17:36
add a comment |
Its not handling the value of the variable, once if it is updated
– Anil
Nov 13 '18 at 17:03
What do you mean handle the value? What are you trying to accomplish?
– Kamikaze_goldfish
Nov 13 '18 at 17:19
First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request :category: politics news_offset: afk0bz0p-1
and the url to make http post request ishttps://inshorts.com/en/ajax/more_news
– Anil
Nov 13 '18 at 17:27
So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
– Kamikaze_goldfish
Nov 13 '18 at 17:36
Its not handling the value of the variable, once if it is updated
– Anil
Nov 13 '18 at 17:03
Its not handling the value of the variable, once if it is updated
– Anil
Nov 13 '18 at 17:03
What do you mean handle the value? What are you trying to accomplish?
– Kamikaze_goldfish
Nov 13 '18 at 17:19
What do you mean handle the value? What are you trying to accomplish?
– Kamikaze_goldfish
Nov 13 '18 at 17:19
First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request :
category: politics news_offset: afk0bz0p-1
and the url to make http post request is https://inshorts.com/en/ajax/more_news
– Anil
Nov 13 '18 at 17:27
First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request :
category: politics news_offset: afk0bz0p-1
and the url to make http post request is https://inshorts.com/en/ajax/more_news
– Anil
Nov 13 '18 at 17:27
So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
– Kamikaze_goldfish
Nov 13 '18 at 17:36
So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
– Kamikaze_goldfish
Nov 13 '18 at 17:36
add a comment |
thank you for the response, Finally I solved using requests
package after reading its documentation,
here is my code :
if InShortsScraper.firstLoad == True:
self.pattern = re.compile('var min_news_id = (.+?);')
else:
self.pattern = re.compile('min_news_id = (.+?);')
page = None
# print("Pattern: " + str(self.pattern))
if news_offset == None:
htmlPage = urlopen(url)
page = bs(htmlPage, "html.parser")
else:
self.loadMore['news_offset'] = InShortsScraper.newsOffset
# print("payload : " + str(self.loadMore))
try:
r = myRequest.post(
url = url,
data = self.loadMore
)
except TypeError:
print("Error in loading")
InShortsScraper.newsOffset = r.json()["min_news_id"]
page = bs(r.json()["html"], "html.parser")
#print(page)
if InShortsScraper.newsOffset == None:
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
finder = re.findall(self.pattern, scriptString)
InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()
add a comment |
thank you for the response, Finally I solved using requests
package after reading its documentation,
here is my code :
if InShortsScraper.firstLoad == True:
self.pattern = re.compile('var min_news_id = (.+?);')
else:
self.pattern = re.compile('min_news_id = (.+?);')
page = None
# print("Pattern: " + str(self.pattern))
if news_offset == None:
htmlPage = urlopen(url)
page = bs(htmlPage, "html.parser")
else:
self.loadMore['news_offset'] = InShortsScraper.newsOffset
# print("payload : " + str(self.loadMore))
try:
r = myRequest.post(
url = url,
data = self.loadMore
)
except TypeError:
print("Error in loading")
InShortsScraper.newsOffset = r.json()["min_news_id"]
page = bs(r.json()["html"], "html.parser")
#print(page)
if InShortsScraper.newsOffset == None:
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
finder = re.findall(self.pattern, scriptString)
InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()
add a comment |
thank you for the response, Finally I solved using requests
package after reading its documentation,
here is my code :
if InShortsScraper.firstLoad == True:
self.pattern = re.compile('var min_news_id = (.+?);')
else:
self.pattern = re.compile('min_news_id = (.+?);')
page = None
# print("Pattern: " + str(self.pattern))
if news_offset == None:
htmlPage = urlopen(url)
page = bs(htmlPage, "html.parser")
else:
self.loadMore['news_offset'] = InShortsScraper.newsOffset
# print("payload : " + str(self.loadMore))
try:
r = myRequest.post(
url = url,
data = self.loadMore
)
except TypeError:
print("Error in loading")
InShortsScraper.newsOffset = r.json()["min_news_id"]
page = bs(r.json()["html"], "html.parser")
#print(page)
if InShortsScraper.newsOffset == None:
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
finder = re.findall(self.pattern, scriptString)
InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()
thank you for the response, Finally I solved using requests
package after reading its documentation,
here is my code :
if InShortsScraper.firstLoad == True:
self.pattern = re.compile('var min_news_id = (.+?);')
else:
self.pattern = re.compile('min_news_id = (.+?);')
page = None
# print("Pattern: " + str(self.pattern))
if news_offset == None:
htmlPage = urlopen(url)
page = bs(htmlPage, "html.parser")
else:
self.loadMore['news_offset'] = InShortsScraper.newsOffset
# print("payload : " + str(self.loadMore))
try:
r = myRequest.post(
url = url,
data = self.loadMore
)
except TypeError:
print("Error in loading")
InShortsScraper.newsOffset = r.json()["min_news_id"]
page = bs(r.json()["html"], "html.parser")
#print(page)
if InShortsScraper.newsOffset == None:
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
finder = re.findall(self.pattern, scriptString)
InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()
answered Nov 15 '18 at 13:36
AnilAnil
5242725
5242725
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283742%2fgetting-javascript-variable-value-while-scraping-with-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Some dynamic contents are not rendered when scraping with
BeautifulSoup
. What you're seeing in browser vs what your scraper is getting is markedly different. (You can exportpage.content
and compare). You'll need a different module likeselenium
orrequest-html
that can handle dynamic contents.– Idlehands
Nov 13 '18 at 14:58
@Idlehands Thank you very much for the information. If you have any example reference please add it.
– Anil
Nov 13 '18 at 15:00
Can you share the URL?
– QHarr
Nov 13 '18 at 15:24
inshorts.com/en/read/politics
– Anil
Nov 13 '18 at 15:26
By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example,
d7zlgjdu-1
that you're looking for?– Kamikaze_goldfish
Nov 13 '18 at 15:37