I'm using YouTube Data API V3 to extract info about my YouTube channel.
I'd like to identify Shorts so I can analyze them separately.
I've found in another discussion a solution which is to do a head request at "https://www.youtube.com/shorts/videoId" as it should redirect the URL if it's not a short and it should not if it is one.
Unfortunately, regardless of if I'm passing a Short or not I get <Response [302]>.
I suspect this is because I'm in the EU and if I try to access the URL without being logged-in I'm redirected to the cookie consent page: https://consent.youtube.com/m?continue=https%3A%2F%2Fwww.youtube.com%2Fshorts%2F-2mHZGXtXSo%3Fcbrd%3D1&gl=DE&m=0&pc=yt&uxe=eomty&hl=en&src=1
Is that the case?
If so, is there any workaround? (aside from a VPN)
Thanks in advance,
I would have gladly commented on the other discussion instead of creating another topic but I'm a simple lurker with no reputation so I can't comment
Here is the original conversation: how do i get youtube shorts from youtube api data v3
Ran into this as well trying to identify shorts. Turns out, sending a cookie value of CONSENT=YES+ will bypass the consent screen. In Python, this might look like:
requests.head(shorts_link, cookies={"CONSENT": "YES+"})
Related
I am trying to build a UI where I can post to Facebook page using GraphAPI v13.0.
I can see the Posting a message. I can see posting a link and Message, but I am unable to find an option to post multiple images and message and multiple videos together at same time.
Has anyone been able to do it using Python or any other language?
Need suggestions or pointers around it.
I am trying a to build a comment scraper for YouTube comments. I already tried to build a scraper with selenium by opening a headless browser and scrolling down, opening comment responses and then extracting the data. But a headless browser is too slow and also the scrolling seems to be unreliable because the number of scraped comments does not match the number of comment given for each video. Maybe I did something wrong, but this is irrelevant to me right now: one of the main reasons why I would like to find another way is the time: it is too slow.
I know there have been a lot of questions about scraping youtube comment on stackoverflow, but almost every answer I found suggested to use some kind of headless browser, i.e. selenium, to do the job. I don’t like that because of the reasons mentioned above. Also, some references I found online suggest to refrain from using selenium for web scraping and instead to try reverse engineering, which means, as much as I understood, to emulate ajax calls and get the data, ideally in json.
When scrolling down a youtube video (here's an example), the comments are loaded dynamically. If I have a look at the XHR activity, I get a lot of information:
Looking at the XHR activity it seems like I find very information I need to make a request to get the json data with comments. But actually I struggle to construct the right the request in order to obtain the json file with the comments.
I read some tutorials online, which mostly offer simple examples which are easy to replicate. But none of them really helped me for my endeavour.
Can someone give me a hint and show me how to post the request in order to get the json file with the comments with python? I am using the requests library in Python.
I know, there is an Youtube API which can do the job, but I wanted to find out whether there is a way of doing the same thing without the API. I think it should be possible.
import requests
from sys import exit
url = 'www.videohere.com'
while(True):
try:
requests.get(url)
except KeyboardInterrupt:
exit(0)
*Ok so i'm trying to use requests to view this video on this page but like just basically stay on the page as if it is watching i have no clue on how to go about doing it, so if anyone could give any suggestions or show a little poc please do! thanks guys.
What do you mean by using requests.get() to view video?
If you would like to make to the progress bar (I don't know what it is named, hope you can understand) to move on, you just need to post some data to the server. You don't need to really "get" the video.
For example, on a Chinese video sharing website BILIBILI, it use the heartbeat to record where you were. It will help if you close the webpage and open it again, it will notify you that you could restore to the second where you were watching.
On different websites, the method might be different. On YouTube, the key is encrypted, so it might be a little bit challenging to pretent your cralwer as a real person watching the video.
For the site you mentioned as mixed.com, in Developer Tools, you could find the following:
To give your a brief of posting data on this site, the url of site is
https://browser.events.data.microsoft.com/OneCollector/1.0/?cors=true&content-type=application/x-json-stream&client-id=NO_AUTH&client-version=1DS-Web-JS-2.0.0-rc3&apikey=1918f1672f934c5ca6b2669551351de2-b61c1f8e-4440-4e53-8e38-b46e4c88a7e5-7038&upload-time=1569343709069&w=0
Once you are in someone's websites, you can find out your browser keeps posting data to this target url every several seconds. Personally speaking, I think it is a tool to record the number of users watching in this room.
For https://browser.events.data.microsoft.com/OneCollector/1.0/ this part, we shall have the same url, but the residual should be generated by the contents in your requests headers.
What's more, the apikey and the payload in the following picture tell me that it is really tough for people to use crawler to pretending a real user.
Most part of it could be easily fetched from the website, but it might require delication for not making mistakes.
Is there a way to scrape Facebook comments and IDs from a Facebook page like nytimes or the guardian for analytical purposes !?
For scraping, the quick answer is no. Use the API. I know this question is for Python, but if you use R, there is the Rfacebook package, which has the functions getPage() and getPost(). A combination of these (i.e. get the page and then loop through the post ids to get the comments with getPost() to get comments and the IDs of the commentators) should get you what you want. Apologies, I don't know if there is anything similar for Python.
for using their API, you'll need to "verify" your app to get access to their "pages_read_user_content" or "Page Public Content Access"
at first using the API you might "GET" the page id / page post id / the permalink to the post in the page your own but to scrape the comments with API you'll need to verify a business account.
I'm trying to scrap a page in youtube with python which has lot of ajax in it
I've to call the java script each time to get the info. But i'm not really sure how to go about it. I'm using the urllib2 module to open URLs. Any help would be appreciated.
Youtube (and everything else Google makes) have EXTENSIVE APIs already in place for giving you access to just about any and all data you could possibly want.
Take a look at The Youtube Data API for more information.
I use urllib to make the API requests and ElementTree to parse the returned XML.
Main problem is, you're violating the TOS (terms of service) for the youtube site. Youtube engineers and lawyers will do their professional best to track you down and make an example of you if you persist. If you're happy with that prospect, then, on you head be it -- technically, your best bet are python-spidermonkey and selenium. I wanted to put the technical hints on record in case anybody in the future has needs like the ones your question's title indicates, without the legal issues you clearly have if you continue in this particular endeavor.
Here is how I would do it: Install Firebug on Firefox, then turn the NET on in firebug and click on the desired link on YouTube. Now see what happens and what pages are requested. Find the one that are responsible for the AJAX part of page. Now you can use urllib or Mechanize to fetch the link. If you CAN pull the same content this way, then you have what you are looking for, then just parse the content. If you CAN'T pull the content this way, then that would suggest that the requested page might be looking at user login credentials, sessions info or other header fields such as HTTP_REFERER ... etc. Then you might want to look at something more extensive like the scrapy ... etc. I would suggest that you always follow the simple path first. Good luck and happy "responsibly" scraping! :)
As suggested, you should use the YouTube API to access the data made available legitimately.
Regarding the general question of scraping AJAX, you might want to consider the scrapy framework. It provides extensive support for crawling and scraping web sites and uses python-spidermonkey under the hood to access javascript links.
You could sniff the network traffic with something like Wireshark then replay the HTTP calls via a scraping framework that is robust enough to deal with AJAX, such as scraPY.