Python: Get resolution of videos from streaming links - python

I am pretty new to Python. I am currently using version 3.3.2. I have an array of links to sources of streaming videos. What I want to do is sort them between HD(>720p) and non-HD(<720p) videos. I have been searching the internet, but the only closest I got to was a ffmpeg python wrapper https://code.google.com/p/pyffmpeg/.
So I wanted to know if it is even possible? If yes, can you please link me to some resources, or what keywords I should be searching on google?
Regards

The easy way to do this is to use the web service API for each site.
For example, the YouTube API lets you issue a search and get back metadata on all of the matching videos. If you look at the video, properties, you can check definition == 'hd', or you can iterate the videoStreams for the video and check whether heightPixels >= 720 or bitrateBps >= 8*1024*1024 or whatever you think is an appropriate definition for "HD" if you don't like theirs.
You can find the APIs for most sites by just googling "Foo API", but here are links for the ones you asked about:
Daily Motion
Metacafe: I can't find the API docs anymore, but it's just simple RSS feeds.
YouTube
The hard way to do this is to write a universal video downloader—which is very, very hard—and process the file with something like pyffmpeg after you download it (or, if you're lucky, after you've only downloaded part of it).

Related

Sending a post request with python to extract youtube comments from json

I am trying a to build a comment scraper for YouTube comments. I already tried to build a scraper with selenium by opening a headless browser and scrolling down, opening comment responses and then extracting the data. But a headless browser is too slow and also the scrolling seems to be unreliable because the number of scraped comments does not match the number of comment given for each video. Maybe I did something wrong, but this is irrelevant to me right now: one of the main reasons why I would like to find another way is the time: it is too slow.
I know there have been a lot of questions about scraping youtube comment on stackoverflow, but almost every answer I found suggested to use some kind of headless browser, i.e. selenium, to do the job. I don’t like that because of the reasons mentioned above. Also, some references I found online suggest to refrain from using selenium for web scraping and instead to try reverse engineering, which means, as much as I understood, to emulate ajax calls and get the data, ideally in json.
When scrolling down a youtube video (here's an example), the comments are loaded dynamically. If I have a look at the XHR activity, I get a lot of information:
Looking at the XHR activity it seems like I find very information I need to make a request to get the json data with comments. But actually I struggle to construct the right the request in order to obtain the json file with the comments.
I read some tutorials online, which mostly offer simple examples which are easy to replicate. But none of them really helped me for my endeavour.
Can someone give me a hint and show me how to post the request in order to get the json file with the comments with python? I am using the requests library in Python.
I know, there is an Youtube API which can do the job, but I wanted to find out whether there is a way of doing the same thing without the API. I think it should be possible.

How to extract files from ScrapingHub?

I have deployed some Scrapy spiders to scrape data which I can download in .csv from ScrapingHub.
Some of these spiders have FilePipeline which I used to download files (pdf) to a specific folder. Is there any way I can retrieve these files from ScrapingHub via the platform or API?
Though I have to go over scraping hubs documentation, I'm quite certain despite of having a file explorer there's no actual file being generated or it's being ignored while during the crawl and stanchion... I assume so given the fact that if you try to deploy one of your projects with anything other than the files that correspond to a scrappy project() unless you do some hacking around with your settings and setup file for then scrapinghub to accept your extra parameters orphans)... For example if you try to have a ton of start URLs in a file and then use a real and function to parse all that into your spider... Works like a charm but scrapinghub wasn't built with that in mind...
I assume you know that you can download your files in a CSV or desired format straight from the web interface... Personally I use scraping Hub client API in Python... All three libraries of which I believe to our deprecated at this point but you kind of have to mix and match to get fully functional feet for example...
I have this side gig for a pretty well-known pornt website, what I do for them is content aggregation I spend a lot of time watching a lot o debauchery but for people like myself it's just fun... Hope that you're reading this and not think too much of a pervert LOL got to make that money right? Anyways... By using scraping hugs API client for python I'm able to connect to my account with the API key and maneuver my way around and do as I please; personally I think that there are some limitations , not so much of a limitation is just that one thing that really bothers me is that the function to get the name of a project was deprecated with the first version of there client Library... I'd like the see, when I'm parsing my items the name of the project of which where the spider is to run different jobs Ergo the crawlz... So when I first started to mess around with the client it just look messy,
What's even more awesome it's my life so sweet is that when you create a project run your spider and all your items are collected can directly download these files from the web interface as I mentioned, but what I can do is Target my output to give me desired effect for example.
I'm crawling a site and I'm getting a media item like videos, there are three things you always need. The name of the media or the title of the video , the URL source to where the video can be reached or URL where it is embedded of which you can then request for every instance that you need... And of course the metadata of what is tags and categories that are associated with video media.
The largest crawl that's outputted the most items now I believe was 150,000, it was abroad crawl and it was something like the 15 or 17% of dupla Fire cases. Each video I then call using the API client by its given dictionary or key value (not a dictionary btw)... Of course in my case I will always use all three of the key values but I can Target categories or tags of which RN or under the key value o its corresponding place and output only the items and their totality (meaning still output all three items) foot print out only the ones that meet or match a particular string or expression I want allowing me the able who really Parts through my content quite effectively. In this particular scrapy project, Im just simply printing out or creating a .m3u playlist from all this 'pronz'!

How to use a search bar with Python

Right now I am trying to figure out how to take a row of data, maybe like 50 entries max, and enter it individually into a search bar. But first I need to understand the beginning concepts so I want to do a practice program that could take info from an Excel sheet and enter into a Google search or YouTube, for example.
My problem is there seems to be no resource on how to do this for beginners. All posts I have read are either parts of the whole problem or not related to actually using a search bar but instead creating one. Even then every post I read has 100 plug-ins I could possibly add.
I'm just looking for a consistent explanation to where I can grasp how I can manipulate code in order to use a search bar function.
To perform a web search (Google, YouTube or whatever) from a program you need to execute the search, either by building up and calling an appropriate search URL or by making a call to an API provided by that site.
The article 'Python - Search for YouTube video' provides a code sample and explanation of how to generate and call a URL to perform a YouTube keyword search. You could do something similar for a Google search by analysing the URL from the result of a Google search, or try searching for 'Python submit google search url'.
The above approach is simplistic and relies on the URL structure for a particular site staying the same. A more complex, reliable and flexible approach is to use the API. For YouTube:
YouTube API - Python developers guide
YouTube API - Python code samples - Search by keyword

YouTube API retrieve demographic information about a video

Is there a way to pull traffic information from a particular youtube video (demographics for example, age of users, country, gender, etc.), say using the python gdata module or the youtube API? I have been looking around the module's documentation, but so far nothing.
There used to be a large selection of things in the past, but they are all gone now. So do not think it is possible any more.

Crawling web for specific file type

As a part of a research, I need to download freely available RDF (Resource Description Framework - *.rdf) files via web, as much as possible. What are the ideal libraries/frameworks available in Python for doing this?
Are there any websites/search engines capable of doing this? I've tried Google filetype:RDF search. Initially, Google shows you 6,960,000 results. However, as you browse individual results pages, the results drastically drop down to 205 results. I wrote a script to screen-scrape and download files, but 205 is not enough for my research and I am sure there are more than 205 files in the web. So, I really need a file crawler. I'd like to know whether there are any online or offline tools that can be used for this purpose or frameworks/sample scripts in Python to achieve this. Any help in this regards is highly appreciated.
Crawling RDF content from the Web is no different than crawling any other content. That said, if your question is "what is a good python Web crawler", than you should read this question: Anyone know of a good Python based web crawler that I could use?. If your question is related to processing RDF with python, then there are several options, one being RDFLib
Did you notice the text something like "google has hidden similar results, click here to show all results" at the bottom of one page? Might help.
I know that I'm a bit late with this answer - but for future searchers - http://sindice.com/ is a great index of rdf documents
teleport pro, although it maybe cant copy from google, too big, it can probably handly proxy sites that return google results, and i know, for a fact, i could download 10 000 pdfs with in a day if i wanted to. it has filetype specifiers and many options.
here's one workaround :
get "download master" from chrome extensions, or similar program
search on google or other for results, set google to 100 per page
select - show all files
write your file extension, .rdf press enter
press download
you can have 100 files per click, not bad.

Categories