Web Crawling: get shares of Youtube Video from statistics tab
up vote
1
down vote
favorite
Does anybody know a way to get the shares of youtube videos (not mine)? I would like to store them into a DB. It is not working with the yt api. Another problem ist that not every yt video has the statistics tab.
So far I tried the Youtube API, jsoup HTML Parser (the div showing the shares wasn't there, altough it is shown via inspect in firefox e.g) and import.io demo which was working but is definitely too expensive.
web-scraping youtube youtube-api web-crawler extract
add a comment |
up vote
1
down vote
favorite
Does anybody know a way to get the shares of youtube videos (not mine)? I would like to store them into a DB. It is not working with the yt api. Another problem ist that not every yt video has the statistics tab.
So far I tried the Youtube API, jsoup HTML Parser (the div showing the shares wasn't there, altough it is shown via inspect in firefox e.g) and import.io demo which was working but is definitely too expensive.
web-scraping youtube youtube-api web-crawler extract
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
– DaImTo
Jun 20 '17 at 7:56
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
Does anybody know a way to get the shares of youtube videos (not mine)? I would like to store them into a DB. It is not working with the yt api. Another problem ist that not every yt video has the statistics tab.
So far I tried the Youtube API, jsoup HTML Parser (the div showing the shares wasn't there, altough it is shown via inspect in firefox e.g) and import.io demo which was working but is definitely too expensive.
web-scraping youtube youtube-api web-crawler extract
Does anybody know a way to get the shares of youtube videos (not mine)? I would like to store them into a DB. It is not working with the yt api. Another problem ist that not every yt video has the statistics tab.
So far I tried the Youtube API, jsoup HTML Parser (the div showing the shares wasn't there, altough it is shown via inspect in firefox e.g) and import.io demo which was working but is definitely too expensive.
web-scraping youtube youtube-api web-crawler extract
web-scraping youtube youtube-api web-crawler extract
edited Nov 11 at 16:36
Bertrand Martel
16.6k134064
16.6k134064
asked Jun 20 '17 at 7:40
neodymium
377
377
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
– DaImTo
Jun 20 '17 at 7:56
add a comment |
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
– DaImTo
Jun 20 '17 at 7:56
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
– DaImTo
Jun 20 '17 at 7:56
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
– DaImTo
Jun 20 '17 at 7:56
add a comment |
1 Answer
1
active
oldest
votes
up vote
4
down vote
accepted
The best way is to look at the network logs, in this case it shows a POST
on :
https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id
It sends a XSRF token in the body that is available in the original html body of the video page https://www.youtube.com/watch?v=$video_id
in a javascript object like :
yt.setConfig({
'XSRF_TOKEN': "QUFFLUhqbnNvZUx4THR3eV80dHlacV9tRkRxc2NwSjlXQXxBQ3Jtc0ttd0JLWENnMjdYNE5IRWhibE9ZdDJTSk1aMktxTDR5d3JjSnkzVUtQWVcwdnp3X0tSOXEtM3hZdzVFdjNPeGpPRGtLVU5pVXV0SmtfdWJSUHNqTVg2WXBndjZpa3d6U25ja2FTelBBVWRlT0lZZkRDaDV6SU94VWE3cnpERHhWNVlUYWdyRjFqN1hvc0VLRmVwcEY3ZWdJMWgyUmc=",
'XSRF_FIELD_NAME': "session_token",
'XSRF_REDIRECT_TOKEN': "VlhMkn6F56dGGYcm4Rg7jCZR0vJ8MTQ5ODA1NzIwMkAxNDk3OTcwODAy"
});
It also needs some cookies set in this same video page.
Using python
with beautifulsoup & python-requests :
import requests
from bs4 import BeautifulSoup
import re
s = requests.Session()
video_id = "CPkU0dF4JKo"
r = s.get('https://www.youtube.com/watch?v={}'.format(video_id))
xsrf_token = re.search("'XSRF_TOKEN's*:s*"(.*)"", r.text, re.IGNORECASE).group(1)
r = s.post(
'https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v={}'.format(video_id),
data = {
'session_token': xsrf_token
}
)
metrics = [
int(t.text.encode('ascii', 'ignore').split(' ', 1)[0])
for t in BeautifulSoup(r.content, "lxml").find('html_content').find("tr").findAll("div", {"class":"bragbar-metric"})
]
print(metrics)
Using bash
with curl, sed, pup & xml_grep :
The following bash script will :
- request the video page
https://www.youtube.com/watch?v=$video_id
withcurl
- store the cookies in a file called
cookie.txt
- extract the
XSRF_TOKEN
calledsession_token
in the following request withsed
- request the video statistic page
https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id
withcurl
with the cookies previously stored - parse the xml result extract the
CDATA
part withxml_grep
- parse the html with pup to extract the
bragbar-metric
class div and convert the html result to json withjson{}
- use
sed
to remove unicode character
The script :
video_id=CPkU0dF4JKo
session_token=$(curl -s -c cookie.txt "https://www.youtube.com/watch?v=$video_id" |
sed -rn "s/.*'XSRF_TOKEN's*:s*"(.*)".*/1/p")
curl -s -b cookie.txt -d "session_token=$session_token"
"https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id" |
xml_grep --text_only 'html_content' |
pup 'div table tr .bragbar-metric text{}' |
sed 's/xc2x91|xc2x92|xc2xa0|xe2x80x8e//' |
sed 's/s.*$//'
It gives number of views, time watched, subscriptions, shares:
120862
454
18
213
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
accepted
The best way is to look at the network logs, in this case it shows a POST
on :
https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id
It sends a XSRF token in the body that is available in the original html body of the video page https://www.youtube.com/watch?v=$video_id
in a javascript object like :
yt.setConfig({
'XSRF_TOKEN': "QUFFLUhqbnNvZUx4THR3eV80dHlacV9tRkRxc2NwSjlXQXxBQ3Jtc0ttd0JLWENnMjdYNE5IRWhibE9ZdDJTSk1aMktxTDR5d3JjSnkzVUtQWVcwdnp3X0tSOXEtM3hZdzVFdjNPeGpPRGtLVU5pVXV0SmtfdWJSUHNqTVg2WXBndjZpa3d6U25ja2FTelBBVWRlT0lZZkRDaDV6SU94VWE3cnpERHhWNVlUYWdyRjFqN1hvc0VLRmVwcEY3ZWdJMWgyUmc=",
'XSRF_FIELD_NAME': "session_token",
'XSRF_REDIRECT_TOKEN': "VlhMkn6F56dGGYcm4Rg7jCZR0vJ8MTQ5ODA1NzIwMkAxNDk3OTcwODAy"
});
It also needs some cookies set in this same video page.
Using python
with beautifulsoup & python-requests :
import requests
from bs4 import BeautifulSoup
import re
s = requests.Session()
video_id = "CPkU0dF4JKo"
r = s.get('https://www.youtube.com/watch?v={}'.format(video_id))
xsrf_token = re.search("'XSRF_TOKEN's*:s*"(.*)"", r.text, re.IGNORECASE).group(1)
r = s.post(
'https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v={}'.format(video_id),
data = {
'session_token': xsrf_token
}
)
metrics = [
int(t.text.encode('ascii', 'ignore').split(' ', 1)[0])
for t in BeautifulSoup(r.content, "lxml").find('html_content').find("tr").findAll("div", {"class":"bragbar-metric"})
]
print(metrics)
Using bash
with curl, sed, pup & xml_grep :
The following bash script will :
- request the video page
https://www.youtube.com/watch?v=$video_id
withcurl
- store the cookies in a file called
cookie.txt
- extract the
XSRF_TOKEN
calledsession_token
in the following request withsed
- request the video statistic page
https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id
withcurl
with the cookies previously stored - parse the xml result extract the
CDATA
part withxml_grep
- parse the html with pup to extract the
bragbar-metric
class div and convert the html result to json withjson{}
- use
sed
to remove unicode character
The script :
video_id=CPkU0dF4JKo
session_token=$(curl -s -c cookie.txt "https://www.youtube.com/watch?v=$video_id" |
sed -rn "s/.*'XSRF_TOKEN's*:s*"(.*)".*/1/p")
curl -s -b cookie.txt -d "session_token=$session_token"
"https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id" |
xml_grep --text_only 'html_content' |
pup 'div table tr .bragbar-metric text{}' |
sed 's/xc2x91|xc2x92|xc2xa0|xe2x80x8e//' |
sed 's/s.*$//'
It gives number of views, time watched, subscriptions, shares:
120862
454
18
213
add a comment |
up vote
4
down vote
accepted
The best way is to look at the network logs, in this case it shows a POST
on :
https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id
It sends a XSRF token in the body that is available in the original html body of the video page https://www.youtube.com/watch?v=$video_id
in a javascript object like :
yt.setConfig({
'XSRF_TOKEN': "QUFFLUhqbnNvZUx4THR3eV80dHlacV9tRkRxc2NwSjlXQXxBQ3Jtc0ttd0JLWENnMjdYNE5IRWhibE9ZdDJTSk1aMktxTDR5d3JjSnkzVUtQWVcwdnp3X0tSOXEtM3hZdzVFdjNPeGpPRGtLVU5pVXV0SmtfdWJSUHNqTVg2WXBndjZpa3d6U25ja2FTelBBVWRlT0lZZkRDaDV6SU94VWE3cnpERHhWNVlUYWdyRjFqN1hvc0VLRmVwcEY3ZWdJMWgyUmc=",
'XSRF_FIELD_NAME': "session_token",
'XSRF_REDIRECT_TOKEN': "VlhMkn6F56dGGYcm4Rg7jCZR0vJ8MTQ5ODA1NzIwMkAxNDk3OTcwODAy"
});
It also needs some cookies set in this same video page.
Using python
with beautifulsoup & python-requests :
import requests
from bs4 import BeautifulSoup
import re
s = requests.Session()
video_id = "CPkU0dF4JKo"
r = s.get('https://www.youtube.com/watch?v={}'.format(video_id))
xsrf_token = re.search("'XSRF_TOKEN's*:s*"(.*)"", r.text, re.IGNORECASE).group(1)
r = s.post(
'https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v={}'.format(video_id),
data = {
'session_token': xsrf_token
}
)
metrics = [
int(t.text.encode('ascii', 'ignore').split(' ', 1)[0])
for t in BeautifulSoup(r.content, "lxml").find('html_content').find("tr").findAll("div", {"class":"bragbar-metric"})
]
print(metrics)
Using bash
with curl, sed, pup & xml_grep :
The following bash script will :
- request the video page
https://www.youtube.com/watch?v=$video_id
withcurl
- store the cookies in a file called
cookie.txt
- extract the
XSRF_TOKEN
calledsession_token
in the following request withsed
- request the video statistic page
https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id
withcurl
with the cookies previously stored - parse the xml result extract the
CDATA
part withxml_grep
- parse the html with pup to extract the
bragbar-metric
class div and convert the html result to json withjson{}
- use
sed
to remove unicode character
The script :
video_id=CPkU0dF4JKo
session_token=$(curl -s -c cookie.txt "https://www.youtube.com/watch?v=$video_id" |
sed -rn "s/.*'XSRF_TOKEN's*:s*"(.*)".*/1/p")
curl -s -b cookie.txt -d "session_token=$session_token"
"https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id" |
xml_grep --text_only 'html_content' |
pup 'div table tr .bragbar-metric text{}' |
sed 's/xc2x91|xc2x92|xc2xa0|xe2x80x8e//' |
sed 's/s.*$//'
It gives number of views, time watched, subscriptions, shares:
120862
454
18
213
add a comment |
up vote
4
down vote
accepted
up vote
4
down vote
accepted
The best way is to look at the network logs, in this case it shows a POST
on :
https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id
It sends a XSRF token in the body that is available in the original html body of the video page https://www.youtube.com/watch?v=$video_id
in a javascript object like :
yt.setConfig({
'XSRF_TOKEN': "QUFFLUhqbnNvZUx4THR3eV80dHlacV9tRkRxc2NwSjlXQXxBQ3Jtc0ttd0JLWENnMjdYNE5IRWhibE9ZdDJTSk1aMktxTDR5d3JjSnkzVUtQWVcwdnp3X0tSOXEtM3hZdzVFdjNPeGpPRGtLVU5pVXV0SmtfdWJSUHNqTVg2WXBndjZpa3d6U25ja2FTelBBVWRlT0lZZkRDaDV6SU94VWE3cnpERHhWNVlUYWdyRjFqN1hvc0VLRmVwcEY3ZWdJMWgyUmc=",
'XSRF_FIELD_NAME': "session_token",
'XSRF_REDIRECT_TOKEN': "VlhMkn6F56dGGYcm4Rg7jCZR0vJ8MTQ5ODA1NzIwMkAxNDk3OTcwODAy"
});
It also needs some cookies set in this same video page.
Using python
with beautifulsoup & python-requests :
import requests
from bs4 import BeautifulSoup
import re
s = requests.Session()
video_id = "CPkU0dF4JKo"
r = s.get('https://www.youtube.com/watch?v={}'.format(video_id))
xsrf_token = re.search("'XSRF_TOKEN's*:s*"(.*)"", r.text, re.IGNORECASE).group(1)
r = s.post(
'https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v={}'.format(video_id),
data = {
'session_token': xsrf_token
}
)
metrics = [
int(t.text.encode('ascii', 'ignore').split(' ', 1)[0])
for t in BeautifulSoup(r.content, "lxml").find('html_content').find("tr").findAll("div", {"class":"bragbar-metric"})
]
print(metrics)
Using bash
with curl, sed, pup & xml_grep :
The following bash script will :
- request the video page
https://www.youtube.com/watch?v=$video_id
withcurl
- store the cookies in a file called
cookie.txt
- extract the
XSRF_TOKEN
calledsession_token
in the following request withsed
- request the video statistic page
https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id
withcurl
with the cookies previously stored - parse the xml result extract the
CDATA
part withxml_grep
- parse the html with pup to extract the
bragbar-metric
class div and convert the html result to json withjson{}
- use
sed
to remove unicode character
The script :
video_id=CPkU0dF4JKo
session_token=$(curl -s -c cookie.txt "https://www.youtube.com/watch?v=$video_id" |
sed -rn "s/.*'XSRF_TOKEN's*:s*"(.*)".*/1/p")
curl -s -b cookie.txt -d "session_token=$session_token"
"https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id" |
xml_grep --text_only 'html_content' |
pup 'div table tr .bragbar-metric text{}' |
sed 's/xc2x91|xc2x92|xc2xa0|xe2x80x8e//' |
sed 's/s.*$//'
It gives number of views, time watched, subscriptions, shares:
120862
454
18
213
The best way is to look at the network logs, in this case it shows a POST
on :
https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id
It sends a XSRF token in the body that is available in the original html body of the video page https://www.youtube.com/watch?v=$video_id
in a javascript object like :
yt.setConfig({
'XSRF_TOKEN': "QUFFLUhqbnNvZUx4THR3eV80dHlacV9tRkRxc2NwSjlXQXxBQ3Jtc0ttd0JLWENnMjdYNE5IRWhibE9ZdDJTSk1aMktxTDR5d3JjSnkzVUtQWVcwdnp3X0tSOXEtM3hZdzVFdjNPeGpPRGtLVU5pVXV0SmtfdWJSUHNqTVg2WXBndjZpa3d6U25ja2FTelBBVWRlT0lZZkRDaDV6SU94VWE3cnpERHhWNVlUYWdyRjFqN1hvc0VLRmVwcEY3ZWdJMWgyUmc=",
'XSRF_FIELD_NAME': "session_token",
'XSRF_REDIRECT_TOKEN': "VlhMkn6F56dGGYcm4Rg7jCZR0vJ8MTQ5ODA1NzIwMkAxNDk3OTcwODAy"
});
It also needs some cookies set in this same video page.
Using python
with beautifulsoup & python-requests :
import requests
from bs4 import BeautifulSoup
import re
s = requests.Session()
video_id = "CPkU0dF4JKo"
r = s.get('https://www.youtube.com/watch?v={}'.format(video_id))
xsrf_token = re.search("'XSRF_TOKEN's*:s*"(.*)"", r.text, re.IGNORECASE).group(1)
r = s.post(
'https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v={}'.format(video_id),
data = {
'session_token': xsrf_token
}
)
metrics = [
int(t.text.encode('ascii', 'ignore').split(' ', 1)[0])
for t in BeautifulSoup(r.content, "lxml").find('html_content').find("tr").findAll("div", {"class":"bragbar-metric"})
]
print(metrics)
Using bash
with curl, sed, pup & xml_grep :
The following bash script will :
- request the video page
https://www.youtube.com/watch?v=$video_id
withcurl
- store the cookies in a file called
cookie.txt
- extract the
XSRF_TOKEN
calledsession_token
in the following request withsed
- request the video statistic page
https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id
withcurl
with the cookies previously stored - parse the xml result extract the
CDATA
part withxml_grep
- parse the html with pup to extract the
bragbar-metric
class div and convert the html result to json withjson{}
- use
sed
to remove unicode character
The script :
video_id=CPkU0dF4JKo
session_token=$(curl -s -c cookie.txt "https://www.youtube.com/watch?v=$video_id" |
sed -rn "s/.*'XSRF_TOKEN's*:s*"(.*)".*/1/p")
curl -s -b cookie.txt -d "session_token=$session_token"
"https://www.youtube.com/insight_ajax?action_get_statistics_and_data=1&v=$video_id" |
xml_grep --text_only 'html_content' |
pup 'div table tr .bragbar-metric text{}' |
sed 's/xc2x91|xc2x92|xc2xa0|xe2x80x8e//' |
sed 's/s.*$//'
It gives number of views, time watched, subscriptions, shares:
120862
454
18
213
edited Nov 11 at 14:30
answered Jun 20 '17 at 15:31
Bertrand Martel
16.6k134064
16.6k134064
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f44646802%2fweb-crawling-get-shares-of-youtube-video-from-statistics-tab%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
– DaImTo
Jun 20 '17 at 7:56