Getting javascript variable value while scraping with python












0















I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.



I am scraping a news site using python with packages such as Beautiful Soup and etc.



I am facing difficulty while getting the value of java script variable which is declared in script tag and also it is getting updated there.



Here is the part of HTML page which I am scraping:(containing only script part)



<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">

var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>


From the above part, I want to get the value of min_news_id in python.
I should also get the value of same variable if updated from line 2.



Here is how I am doing it:



    self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)


But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.










share|improve this question


















  • 1





    Some dynamic contents are not rendered when scraping with BeautifulSoup. What you're seeing in browser vs what your scraper is getting is markedly different. (You can export page.content and compare). You'll need a different module like selenium or request-html that can handle dynamic contents.

    – Idlehands
    Nov 13 '18 at 14:58













  • @Idlehands Thank you very much for the information. If you have any example reference please add it.

    – Anil
    Nov 13 '18 at 15:00











  • Can you share the URL?

    – QHarr
    Nov 13 '18 at 15:24











  • inshorts.com/en/read/politics

    – Anil
    Nov 13 '18 at 15:26











  • By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example, d7zlgjdu-1 that you're looking for?

    – Kamikaze_goldfish
    Nov 13 '18 at 15:37


















0















I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.



I am scraping a news site using python with packages such as Beautiful Soup and etc.



I am facing difficulty while getting the value of java script variable which is declared in script tag and also it is getting updated there.



Here is the part of HTML page which I am scraping:(containing only script part)



<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">

var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>


From the above part, I want to get the value of min_news_id in python.
I should also get the value of same variable if updated from line 2.



Here is how I am doing it:



    self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)


But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.










share|improve this question


















  • 1





    Some dynamic contents are not rendered when scraping with BeautifulSoup. What you're seeing in browser vs what your scraper is getting is markedly different. (You can export page.content and compare). You'll need a different module like selenium or request-html that can handle dynamic contents.

    – Idlehands
    Nov 13 '18 at 14:58













  • @Idlehands Thank you very much for the information. If you have any example reference please add it.

    – Anil
    Nov 13 '18 at 15:00











  • Can you share the URL?

    – QHarr
    Nov 13 '18 at 15:24











  • inshorts.com/en/read/politics

    – Anil
    Nov 13 '18 at 15:26











  • By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example, d7zlgjdu-1 that you're looking for?

    – Kamikaze_goldfish
    Nov 13 '18 at 15:37
















0












0








0








I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.



I am scraping a news site using python with packages such as Beautiful Soup and etc.



I am facing difficulty while getting the value of java script variable which is declared in script tag and also it is getting updated there.



Here is the part of HTML page which I am scraping:(containing only script part)



<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">

var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>


From the above part, I want to get the value of min_news_id in python.
I should also get the value of same variable if updated from line 2.



Here is how I am doing it:



    self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)


But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.










share|improve this question














I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.



I am scraping a news site using python with packages such as Beautiful Soup and etc.



I am facing difficulty while getting the value of java script variable which is declared in script tag and also it is getting updated there.



Here is the part of HTML page which I am scraping:(containing only script part)



<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">

var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>


From the above part, I want to get the value of min_news_id in python.
I should also get the value of same variable if updated from line 2.



Here is how I am doing it:



    self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
page = bs(htmlPage, "html.parser")
//find all the scripts tag
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
scriptString.replace('"', '\"')
print(scriptString)
if(self.pattern.match(str(scriptString))):
print("matched")
data = self.pattern.match(scriptString)
jsVariable = json.loads(data.groups()[0])
InShortsScraper.newsOffset = jsVariable
print(InShortsScraper.newsOffset)


But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me.
Thank You in advance.







python web-scraping beautifulsoup python-3.6






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 13 '18 at 14:55









AnilAnil

5242725




5242725








  • 1





    Some dynamic contents are not rendered when scraping with BeautifulSoup. What you're seeing in browser vs what your scraper is getting is markedly different. (You can export page.content and compare). You'll need a different module like selenium or request-html that can handle dynamic contents.

    – Idlehands
    Nov 13 '18 at 14:58













  • @Idlehands Thank you very much for the information. If you have any example reference please add it.

    – Anil
    Nov 13 '18 at 15:00











  • Can you share the URL?

    – QHarr
    Nov 13 '18 at 15:24











  • inshorts.com/en/read/politics

    – Anil
    Nov 13 '18 at 15:26











  • By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example, d7zlgjdu-1 that you're looking for?

    – Kamikaze_goldfish
    Nov 13 '18 at 15:37
















  • 1





    Some dynamic contents are not rendered when scraping with BeautifulSoup. What you're seeing in browser vs what your scraper is getting is markedly different. (You can export page.content and compare). You'll need a different module like selenium or request-html that can handle dynamic contents.

    – Idlehands
    Nov 13 '18 at 14:58













  • @Idlehands Thank you very much for the information. If you have any example reference please add it.

    – Anil
    Nov 13 '18 at 15:00











  • Can you share the URL?

    – QHarr
    Nov 13 '18 at 15:24











  • inshorts.com/en/read/politics

    – Anil
    Nov 13 '18 at 15:26











  • By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example, d7zlgjdu-1 that you're looking for?

    – Kamikaze_goldfish
    Nov 13 '18 at 15:37










1




1





Some dynamic contents are not rendered when scraping with BeautifulSoup. What you're seeing in browser vs what your scraper is getting is markedly different. (You can export page.content and compare). You'll need a different module like selenium or request-html that can handle dynamic contents.

– Idlehands
Nov 13 '18 at 14:58







Some dynamic contents are not rendered when scraping with BeautifulSoup. What you're seeing in browser vs what your scraper is getting is markedly different. (You can export page.content and compare). You'll need a different module like selenium or request-html that can handle dynamic contents.

– Idlehands
Nov 13 '18 at 14:58















@Idlehands Thank you very much for the information. If you have any example reference please add it.

– Anil
Nov 13 '18 at 15:00





@Idlehands Thank you very much for the information. If you have any example reference please add it.

– Anil
Nov 13 '18 at 15:00













Can you share the URL?

– QHarr
Nov 13 '18 at 15:24





Can you share the URL?

– QHarr
Nov 13 '18 at 15:24













inshorts.com/en/read/politics

– Anil
Nov 13 '18 at 15:26





inshorts.com/en/read/politics

– Anil
Nov 13 '18 at 15:26













By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example, d7zlgjdu-1 that you're looking for?

– Kamikaze_goldfish
Nov 13 '18 at 15:37







By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example, d7zlgjdu-1 that you're looking for?

– Kamikaze_goldfish
Nov 13 '18 at 15:37














3 Answers
3






active

oldest

votes


















1














you can't monitor javascript variable change using BeautifulSoup, here how to get next page news using while loop, re and json



from bs4 import BeautifulSoup
import requests, re

page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'

htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...

# get current min_news_id
min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1

customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}

while min_news_id:
# change "politics" if in different category
reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
# get Ajax next page
ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
# again, do extract article summary
page = BeautifulSoup(ajax_response["html"], "html.parser")
# ....
# ....

# new min_news_id
min_news_id = ajax_response["min_news_id"]

# remove this to loop all page (thousand?)
break





share|improve this answer


























  • That's not hard in selenium: driver.execute_script("return min_news_id")

    – pguardiario
    Nov 14 '18 at 0:41













  • that's return current value, not monitor value on change. but its not hard if using element change.

    – ewwink
    Nov 14 '18 at 8:33













  • Just put it in a loop with a sleep

    – pguardiario
    Nov 14 '18 at 23:57











  • missed thinking about that, but you're right

    – ewwink
    Nov 15 '18 at 0:01






  • 1





    I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.

    – pguardiario
    Nov 15 '18 at 7:53



















0














html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

<script type="text/javascript" src="/dist/scripts/index.js"></script>
<script type="text/javascript" src="/dist/scripts/read.js"></script>
<script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
<script type="text/javascript">

var min_news_id = "d7zlgjdu-1"; // line 1
function loadMoreNews(){
$("#load-more-btn").hide();
$("#load-more-gif").show();
$.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
data = JSON.parse(data);
min_news_id = data.min_news_id||min_news_id; // line 2
$(".card-stack").append(data.html);
})
.fail(function(){alert("Error : unable to load more news");})
.always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();
</script>'''

finder = re.findall(r'min_news_id = .*;', html)
print(finder)

Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']


#2 OR YOU CAN USE



print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())

Output:
d7zlgjdu-1


#3 OR YOU CAN USE



finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)

Output:
['d7zlgjdu-1']





share|improve this answer


























  • Its not handling the value of the variable, once if it is updated

    – Anil
    Nov 13 '18 at 17:03











  • What do you mean handle the value? What are you trying to accomplish?

    – Kamikaze_goldfish
    Nov 13 '18 at 17:19











  • First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request : category: politics news_offset: afk0bz0p-1 and the url to make http post request is https://inshorts.com/en/ajax/more_news

    – Anil
    Nov 13 '18 at 17:27













  • So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?

    – Kamikaze_goldfish
    Nov 13 '18 at 17:36



















0














thank you for the response, Finally I solved using requests package after reading its documentation,



here is my code :



if InShortsScraper.firstLoad == True:
self.pattern = re.compile('var min_news_id = (.+?);')
else:
self.pattern = re.compile('min_news_id = (.+?);')
page = None
# print("Pattern: " + str(self.pattern))
if news_offset == None:
htmlPage = urlopen(url)
page = bs(htmlPage, "html.parser")
else:
self.loadMore['news_offset'] = InShortsScraper.newsOffset
# print("payload : " + str(self.loadMore))
try:
r = myRequest.post(
url = url,
data = self.loadMore
)
except TypeError:
print("Error in loading")

InShortsScraper.newsOffset = r.json()["min_news_id"]
page = bs(r.json()["html"], "html.parser")
#print(page)
if InShortsScraper.newsOffset == None:
scripts = page.find_all("script")
for script in scripts:
for line in script:
scriptString = str(line)
if "min_news_id" in scriptString:
finder = re.findall(self.pattern, scriptString)
InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283742%2fgetting-javascript-variable-value-while-scraping-with-python%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    you can't monitor javascript variable change using BeautifulSoup, here how to get next page news using while loop, re and json



    from bs4 import BeautifulSoup
    import requests, re

    page_url = 'https://inshorts.com/en/read/politics'
    ajax_url = 'https://inshorts.com/en/ajax/more_news'

    htmlPage = requests.get(page_url).text
    # BeautifulSoup extract article summary
    # page = BeautifulSoup(htmlPage, "html.parser")
    # ...

    # get current min_news_id
    min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1

    customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}

    while min_news_id:
    # change "politics" if in different category
    reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
    # get Ajax next page
    ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
    # again, do extract article summary
    page = BeautifulSoup(ajax_response["html"], "html.parser")
    # ....
    # ....

    # new min_news_id
    min_news_id = ajax_response["min_news_id"]

    # remove this to loop all page (thousand?)
    break





    share|improve this answer


























    • That's not hard in selenium: driver.execute_script("return min_news_id")

      – pguardiario
      Nov 14 '18 at 0:41













    • that's return current value, not monitor value on change. but its not hard if using element change.

      – ewwink
      Nov 14 '18 at 8:33













    • Just put it in a loop with a sleep

      – pguardiario
      Nov 14 '18 at 23:57











    • missed thinking about that, but you're right

      – ewwink
      Nov 15 '18 at 0:01






    • 1





      I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.

      – pguardiario
      Nov 15 '18 at 7:53
















    1














    you can't monitor javascript variable change using BeautifulSoup, here how to get next page news using while loop, re and json



    from bs4 import BeautifulSoup
    import requests, re

    page_url = 'https://inshorts.com/en/read/politics'
    ajax_url = 'https://inshorts.com/en/ajax/more_news'

    htmlPage = requests.get(page_url).text
    # BeautifulSoup extract article summary
    # page = BeautifulSoup(htmlPage, "html.parser")
    # ...

    # get current min_news_id
    min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1

    customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}

    while min_news_id:
    # change "politics" if in different category
    reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
    # get Ajax next page
    ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
    # again, do extract article summary
    page = BeautifulSoup(ajax_response["html"], "html.parser")
    # ....
    # ....

    # new min_news_id
    min_news_id = ajax_response["min_news_id"]

    # remove this to loop all page (thousand?)
    break





    share|improve this answer


























    • That's not hard in selenium: driver.execute_script("return min_news_id")

      – pguardiario
      Nov 14 '18 at 0:41













    • that's return current value, not monitor value on change. but its not hard if using element change.

      – ewwink
      Nov 14 '18 at 8:33













    • Just put it in a loop with a sleep

      – pguardiario
      Nov 14 '18 at 23:57











    • missed thinking about that, but you're right

      – ewwink
      Nov 15 '18 at 0:01






    • 1





      I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.

      – pguardiario
      Nov 15 '18 at 7:53














    1












    1








    1







    you can't monitor javascript variable change using BeautifulSoup, here how to get next page news using while loop, re and json



    from bs4 import BeautifulSoup
    import requests, re

    page_url = 'https://inshorts.com/en/read/politics'
    ajax_url = 'https://inshorts.com/en/ajax/more_news'

    htmlPage = requests.get(page_url).text
    # BeautifulSoup extract article summary
    # page = BeautifulSoup(htmlPage, "html.parser")
    # ...

    # get current min_news_id
    min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1

    customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}

    while min_news_id:
    # change "politics" if in different category
    reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
    # get Ajax next page
    ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
    # again, do extract article summary
    page = BeautifulSoup(ajax_response["html"], "html.parser")
    # ....
    # ....

    # new min_news_id
    min_news_id = ajax_response["min_news_id"]

    # remove this to loop all page (thousand?)
    break





    share|improve this answer















    you can't monitor javascript variable change using BeautifulSoup, here how to get next page news using while loop, re and json



    from bs4 import BeautifulSoup
    import requests, re

    page_url = 'https://inshorts.com/en/read/politics'
    ajax_url = 'https://inshorts.com/en/ajax/more_news'

    htmlPage = requests.get(page_url).text
    # BeautifulSoup extract article summary
    # page = BeautifulSoup(htmlPage, "html.parser")
    # ...

    # get current min_news_id
    min_news_id = re.search('min_news_ids+=s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1

    customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}

    while min_news_id:
    # change "politics" if in different category
    reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
    # get Ajax next page
    ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
    # again, do extract article summary
    page = BeautifulSoup(ajax_response["html"], "html.parser")
    # ....
    # ....

    # new min_news_id
    min_news_id = ajax_response["min_news_id"]

    # remove this to loop all page (thousand?)
    break






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 15 '18 at 13:36

























    answered Nov 13 '18 at 18:19









    ewwinkewwink

    11.8k22239




    11.8k22239













    • That's not hard in selenium: driver.execute_script("return min_news_id")

      – pguardiario
      Nov 14 '18 at 0:41













    • that's return current value, not monitor value on change. but its not hard if using element change.

      – ewwink
      Nov 14 '18 at 8:33













    • Just put it in a loop with a sleep

      – pguardiario
      Nov 14 '18 at 23:57











    • missed thinking about that, but you're right

      – ewwink
      Nov 15 '18 at 0:01






    • 1





      I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.

      – pguardiario
      Nov 15 '18 at 7:53



















    • That's not hard in selenium: driver.execute_script("return min_news_id")

      – pguardiario
      Nov 14 '18 at 0:41













    • that's return current value, not monitor value on change. but its not hard if using element change.

      – ewwink
      Nov 14 '18 at 8:33













    • Just put it in a loop with a sleep

      – pguardiario
      Nov 14 '18 at 23:57











    • missed thinking about that, but you're right

      – ewwink
      Nov 15 '18 at 0:01






    • 1





      I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.

      – pguardiario
      Nov 15 '18 at 7:53

















    That's not hard in selenium: driver.execute_script("return min_news_id")

    – pguardiario
    Nov 14 '18 at 0:41







    That's not hard in selenium: driver.execute_script("return min_news_id")

    – pguardiario
    Nov 14 '18 at 0:41















    that's return current value, not monitor value on change. but its not hard if using element change.

    – ewwink
    Nov 14 '18 at 8:33







    that's return current value, not monitor value on change. but its not hard if using element change.

    – ewwink
    Nov 14 '18 at 8:33















    Just put it in a loop with a sleep

    – pguardiario
    Nov 14 '18 at 23:57





    Just put it in a loop with a sleep

    – pguardiario
    Nov 14 '18 at 23:57













    missed thinking about that, but you're right

    – ewwink
    Nov 15 '18 at 0:01





    missed thinking about that, but you're right

    – ewwink
    Nov 15 '18 at 0:01




    1




    1





    I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.

    – pguardiario
    Nov 15 '18 at 7:53





    I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.

    – pguardiario
    Nov 15 '18 at 7:53













    0














    html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

    <script type="text/javascript" src="/dist/scripts/index.js"></script>
    <script type="text/javascript" src="/dist/scripts/read.js"></script>
    <script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
    <script type="text/javascript">

    var min_news_id = "d7zlgjdu-1"; // line 1
    function loadMoreNews(){
    $("#load-more-btn").hide();
    $("#load-more-gif").show();
    $.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
    data = JSON.parse(data);
    min_news_id = data.min_news_id||min_news_id; // line 2
    $(".card-stack").append(data.html);
    })
    .fail(function(){alert("Error : unable to load more news");})
    .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
    }
    jQuery.scrollDepth();
    </script>'''

    finder = re.findall(r'min_news_id = .*;', html)
    print(finder)

    Output:
    ['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']


    #2 OR YOU CAN USE



    print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())

    Output:
    d7zlgjdu-1


    #3 OR YOU CAN USE



    finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
    print(finder)

    Output:
    ['d7zlgjdu-1']





    share|improve this answer


























    • Its not handling the value of the variable, once if it is updated

      – Anil
      Nov 13 '18 at 17:03











    • What do you mean handle the value? What are you trying to accomplish?

      – Kamikaze_goldfish
      Nov 13 '18 at 17:19











    • First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request : category: politics news_offset: afk0bz0p-1 and the url to make http post request is https://inshorts.com/en/ajax/more_news

      – Anil
      Nov 13 '18 at 17:27













    • So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?

      – Kamikaze_goldfish
      Nov 13 '18 at 17:36
















    0














    html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

    <script type="text/javascript" src="/dist/scripts/index.js"></script>
    <script type="text/javascript" src="/dist/scripts/read.js"></script>
    <script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
    <script type="text/javascript">

    var min_news_id = "d7zlgjdu-1"; // line 1
    function loadMoreNews(){
    $("#load-more-btn").hide();
    $("#load-more-gif").show();
    $.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
    data = JSON.parse(data);
    min_news_id = data.min_news_id||min_news_id; // line 2
    $(".card-stack").append(data.html);
    })
    .fail(function(){alert("Error : unable to load more news");})
    .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
    }
    jQuery.scrollDepth();
    </script>'''

    finder = re.findall(r'min_news_id = .*;', html)
    print(finder)

    Output:
    ['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']


    #2 OR YOU CAN USE



    print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())

    Output:
    d7zlgjdu-1


    #3 OR YOU CAN USE



    finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
    print(finder)

    Output:
    ['d7zlgjdu-1']





    share|improve this answer


























    • Its not handling the value of the variable, once if it is updated

      – Anil
      Nov 13 '18 at 17:03











    • What do you mean handle the value? What are you trying to accomplish?

      – Kamikaze_goldfish
      Nov 13 '18 at 17:19











    • First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request : category: politics news_offset: afk0bz0p-1 and the url to make http post request is https://inshorts.com/en/ajax/more_news

      – Anil
      Nov 13 '18 at 17:27













    • So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?

      – Kamikaze_goldfish
      Nov 13 '18 at 17:36














    0












    0








    0







    html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

    <script type="text/javascript" src="/dist/scripts/index.js"></script>
    <script type="text/javascript" src="/dist/scripts/read.js"></script>
    <script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
    <script type="text/javascript">

    var min_news_id = "d7zlgjdu-1"; // line 1
    function loadMoreNews(){
    $("#load-more-btn").hide();
    $("#load-more-gif").show();
    $.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
    data = JSON.parse(data);
    min_news_id = data.min_news_id||min_news_id; // line 2
    $(".card-stack").append(data.html);
    })
    .fail(function(){alert("Error : unable to load more news");})
    .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
    }
    jQuery.scrollDepth();
    </script>'''

    finder = re.findall(r'min_news_id = .*;', html)
    print(finder)

    Output:
    ['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']


    #2 OR YOU CAN USE



    print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())

    Output:
    d7zlgjdu-1


    #3 OR YOU CAN USE



    finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
    print(finder)

    Output:
    ['d7zlgjdu-1']





    share|improve this answer















    html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

    <script type="text/javascript" src="/dist/scripts/index.js"></script>
    <script type="text/javascript" src="/dist/scripts/read.js"></script>
    <script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
    <script type="text/javascript">

    var min_news_id = "d7zlgjdu-1"; // line 1
    function loadMoreNews(){
    $("#load-more-btn").hide();
    $("#load-more-gif").show();
    $.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
    data = JSON.parse(data);
    min_news_id = data.min_news_id||min_news_id; // line 2
    $(".card-stack").append(data.html);
    })
    .fail(function(){alert("Error : unable to load more news");})
    .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
    }
    jQuery.scrollDepth();
    </script>'''

    finder = re.findall(r'min_news_id = .*;', html)
    print(finder)

    Output:
    ['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']


    #2 OR YOU CAN USE



    print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())

    Output:
    d7zlgjdu-1


    #3 OR YOU CAN USE



    finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
    print(finder)

    Output:
    ['d7zlgjdu-1']






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 13 '18 at 15:50

























    answered Nov 13 '18 at 15:39









    Kamikaze_goldfishKamikaze_goldfish

    458311




    458311













    • Its not handling the value of the variable, once if it is updated

      – Anil
      Nov 13 '18 at 17:03











    • What do you mean handle the value? What are you trying to accomplish?

      – Kamikaze_goldfish
      Nov 13 '18 at 17:19











    • First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request : category: politics news_offset: afk0bz0p-1 and the url to make http post request is https://inshorts.com/en/ajax/more_news

      – Anil
      Nov 13 '18 at 17:27













    • So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?

      – Kamikaze_goldfish
      Nov 13 '18 at 17:36



















    • Its not handling the value of the variable, once if it is updated

      – Anil
      Nov 13 '18 at 17:03











    • What do you mean handle the value? What are you trying to accomplish?

      – Kamikaze_goldfish
      Nov 13 '18 at 17:19











    • First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request : category: politics news_offset: afk0bz0p-1 and the url to make http post request is https://inshorts.com/en/ajax/more_news

      – Anil
      Nov 13 '18 at 17:27













    • So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?

      – Kamikaze_goldfish
      Nov 13 '18 at 17:36

















    Its not handling the value of the variable, once if it is updated

    – Anil
    Nov 13 '18 at 17:03





    Its not handling the value of the variable, once if it is updated

    – Anil
    Nov 13 '18 at 17:03













    What do you mean handle the value? What are you trying to accomplish?

    – Kamikaze_goldfish
    Nov 13 '18 at 17:19





    What do you mean handle the value? What are you trying to accomplish?

    – Kamikaze_goldfish
    Nov 13 '18 at 17:19













    First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request : category: politics news_offset: afk0bz0p-1 and the url to make http post request is https://inshorts.com/en/ajax/more_news

    – Anil
    Nov 13 '18 at 17:27







    First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request : category: politics news_offset: afk0bz0p-1 and the url to make http post request is https://inshorts.com/en/ajax/more_news

    – Anil
    Nov 13 '18 at 17:27















    So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?

    – Kamikaze_goldfish
    Nov 13 '18 at 17:36





    So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?

    – Kamikaze_goldfish
    Nov 13 '18 at 17:36











    0














    thank you for the response, Finally I solved using requests package after reading its documentation,



    here is my code :



    if InShortsScraper.firstLoad == True:
    self.pattern = re.compile('var min_news_id = (.+?);')
    else:
    self.pattern = re.compile('min_news_id = (.+?);')
    page = None
    # print("Pattern: " + str(self.pattern))
    if news_offset == None:
    htmlPage = urlopen(url)
    page = bs(htmlPage, "html.parser")
    else:
    self.loadMore['news_offset'] = InShortsScraper.newsOffset
    # print("payload : " + str(self.loadMore))
    try:
    r = myRequest.post(
    url = url,
    data = self.loadMore
    )
    except TypeError:
    print("Error in loading")

    InShortsScraper.newsOffset = r.json()["min_news_id"]
    page = bs(r.json()["html"], "html.parser")
    #print(page)
    if InShortsScraper.newsOffset == None:
    scripts = page.find_all("script")
    for script in scripts:
    for line in script:
    scriptString = str(line)
    if "min_news_id" in scriptString:
    finder = re.findall(self.pattern, scriptString)
    InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()





    share|improve this answer




























      0














      thank you for the response, Finally I solved using requests package after reading its documentation,



      here is my code :



      if InShortsScraper.firstLoad == True:
      self.pattern = re.compile('var min_news_id = (.+?);')
      else:
      self.pattern = re.compile('min_news_id = (.+?);')
      page = None
      # print("Pattern: " + str(self.pattern))
      if news_offset == None:
      htmlPage = urlopen(url)
      page = bs(htmlPage, "html.parser")
      else:
      self.loadMore['news_offset'] = InShortsScraper.newsOffset
      # print("payload : " + str(self.loadMore))
      try:
      r = myRequest.post(
      url = url,
      data = self.loadMore
      )
      except TypeError:
      print("Error in loading")

      InShortsScraper.newsOffset = r.json()["min_news_id"]
      page = bs(r.json()["html"], "html.parser")
      #print(page)
      if InShortsScraper.newsOffset == None:
      scripts = page.find_all("script")
      for script in scripts:
      for line in script:
      scriptString = str(line)
      if "min_news_id" in scriptString:
      finder = re.findall(self.pattern, scriptString)
      InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()





      share|improve this answer


























        0












        0








        0







        thank you for the response, Finally I solved using requests package after reading its documentation,



        here is my code :



        if InShortsScraper.firstLoad == True:
        self.pattern = re.compile('var min_news_id = (.+?);')
        else:
        self.pattern = re.compile('min_news_id = (.+?);')
        page = None
        # print("Pattern: " + str(self.pattern))
        if news_offset == None:
        htmlPage = urlopen(url)
        page = bs(htmlPage, "html.parser")
        else:
        self.loadMore['news_offset'] = InShortsScraper.newsOffset
        # print("payload : " + str(self.loadMore))
        try:
        r = myRequest.post(
        url = url,
        data = self.loadMore
        )
        except TypeError:
        print("Error in loading")

        InShortsScraper.newsOffset = r.json()["min_news_id"]
        page = bs(r.json()["html"], "html.parser")
        #print(page)
        if InShortsScraper.newsOffset == None:
        scripts = page.find_all("script")
        for script in scripts:
        for line in script:
        scriptString = str(line)
        if "min_news_id" in scriptString:
        finder = re.findall(self.pattern, scriptString)
        InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()





        share|improve this answer













        thank you for the response, Finally I solved using requests package after reading its documentation,



        here is my code :



        if InShortsScraper.firstLoad == True:
        self.pattern = re.compile('var min_news_id = (.+?);')
        else:
        self.pattern = re.compile('min_news_id = (.+?);')
        page = None
        # print("Pattern: " + str(self.pattern))
        if news_offset == None:
        htmlPage = urlopen(url)
        page = bs(htmlPage, "html.parser")
        else:
        self.loadMore['news_offset'] = InShortsScraper.newsOffset
        # print("payload : " + str(self.loadMore))
        try:
        r = myRequest.post(
        url = url,
        data = self.loadMore
        )
        except TypeError:
        print("Error in loading")

        InShortsScraper.newsOffset = r.json()["min_news_id"]
        page = bs(r.json()["html"], "html.parser")
        #print(page)
        if InShortsScraper.newsOffset == None:
        scripts = page.find_all("script")
        for script in scripts:
        for line in script:
        scriptString = str(line)
        if "min_news_id" in scriptString:
        finder = re.findall(self.pattern, scriptString)
        InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 15 '18 at 13:36









        AnilAnil

        5242725




        5242725






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283742%2fgetting-javascript-variable-value-while-scraping-with-python%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Full-time equivalent

            さくらももこ

            13 indicted, 8 arrested in Calif. drug cartel investigation