I'm working on a project which requires downloading the first n results from Google Images given a query term, and I'm wondering how to do this. It seems that they deprecated their API recently, and I haven't been able to find a good up-to-date answer. Ultimately, I want to
Enter query term
Save the URLs for the first n images in a txt file (example)
Download the images from those URLs
I have seen similar solutions that use Selenium but I was hoping to use Requests instead. I'm not very familiar with HTML parsing, but have used beautifulsoup before! Any help is greatly appreciated. I'm currently using Python 3.8.
Related
I've come across an assignment which requires me to extract tabular data from images in a pdf file to neatly formatted dataframes via python code. There are several files to be processed and the relevant pages in all the files the may have different page numbers, hence the sequence of steps for this problem (my assumption) are:
Navigate to relevant section of the pdf
Extract images of the tabular data
Extract data from the images, format and convert to dataframes.
Some google searches resulted in me finding libraries for pdf text extraction, table extraction and more - modular solutions only.
I would appreciate some help in this regard. What packages should I use? Is my approach correct?
Can I get references to any helpful code snippets for similar problems?
page structure of the required tables
This started as a comment. I believe the answer is valid as it is in no way an endorsement of the service. I don't even use it. I know Azure uses SO as well.
This is the stuff of commercial services. You can try Azure Form Recognizer (with which I am not affiliated):
https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer
Here are some python examples of how to use it:
https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/how-to-guides/try-sdk-rest-api?pivots=programming-language-python
The AWS equivalent is Textract https://aws.amazon.com/textract
The Google Cloud version is called Form Parser - see https://cloud.google.com/document-ai/docs/processors-list#processor_form-parser
So I'm using PRAW, a python wrapper for reddit api (http://praw.readthedocs.io/en to find out more), and I have managed to print the URL's of my latest upvoted posts.
# start subreddit instance
subreddit = reddit.subreddit('dankmemes')
print("\n -> Current Subreddit: ")
print(subreddit.display_name)
redditor2 = reddit.redditor('me')
for upvoted in reddit.redditor('Kish_v').upvoted():
print(upvoted.url)
This outputs a long list of imgur URLs etc. However, I want to be able to download those images to a folder and then reupload them as a sort of scraper.
So I have got to the point where upvoted.url holds my URLs but would the best way to do the above be to put the links into an "array" and then download those images individually? - How would I go around doing this, sorry I am fairly new to python as I came from PHP to use this well documented API.
Thank you,
Kish
Check out the requests module, this SO question should get you pointed in the right direction with downloading the URLs you have on hand. I can't comment on the reuploading half.
How to download image using requests
I am pretty new to Python. I am currently using version 3.3.2. I have an array of links to sources of streaming videos. What I want to do is sort them between HD(>720p) and non-HD(<720p) videos. I have been searching the internet, but the only closest I got to was a ffmpeg python wrapper https://code.google.com/p/pyffmpeg/.
So I wanted to know if it is even possible? If yes, can you please link me to some resources, or what keywords I should be searching on google?
Regards
The easy way to do this is to use the web service API for each site.
For example, the YouTube API lets you issue a search and get back metadata on all of the matching videos. If you look at the video, properties, you can check definition == 'hd', or you can iterate the videoStreams for the video and check whether heightPixels >= 720 or bitrateBps >= 8*1024*1024 or whatever you think is an appropriate definition for "HD" if you don't like theirs.
You can find the APIs for most sites by just googling "Foo API", but here are links for the ones you asked about:
Daily Motion
Metacafe: I can't find the API docs anymore, but it's just simple RSS feeds.
YouTube
The hard way to do this is to write a universal video downloader—which is very, very hard—and process the file with something like pyffmpeg after you download it (or, if you're lucky, after you've only downloaded part of it).
As a part of a research, I need to download freely available RDF (Resource Description Framework - *.rdf) files via web, as much as possible. What are the ideal libraries/frameworks available in Python for doing this?
Are there any websites/search engines capable of doing this? I've tried Google filetype:RDF search. Initially, Google shows you 6,960,000 results. However, as you browse individual results pages, the results drastically drop down to 205 results. I wrote a script to screen-scrape and download files, but 205 is not enough for my research and I am sure there are more than 205 files in the web. So, I really need a file crawler. I'd like to know whether there are any online or offline tools that can be used for this purpose or frameworks/sample scripts in Python to achieve this. Any help in this regards is highly appreciated.
Crawling RDF content from the Web is no different than crawling any other content. That said, if your question is "what is a good python Web crawler", than you should read this question: Anyone know of a good Python based web crawler that I could use?. If your question is related to processing RDF with python, then there are several options, one being RDFLib
Did you notice the text something like "google has hidden similar results, click here to show all results" at the bottom of one page? Might help.
I know that I'm a bit late with this answer - but for future searchers - http://sindice.com/ is a great index of rdf documents
teleport pro, although it maybe cant copy from google, too big, it can probably handly proxy sites that return google results, and i know, for a fact, i could download 10 000 pdfs with in a day if i wanted to. it has filetype specifiers and many options.
here's one workaround :
get "download master" from chrome extensions, or similar program
search on google or other for results, set google to 100 per page
select - show all files
write your file extension, .rdf press enter
press download
you can have 100 files per click, not bad.
I want to be able to download a page and all of its associated resources (images, style sheets, script files, etc) using Python. I am (somewhat) familiar with urllib2 and know how to download individual urls, but before I go and start hacking at BeautifulSoup + urllib2 I wanted to be sure that there wasn't already a Python equivalent to "wget --page-requisites http://www.google.com".
Specifically I am interested in gathering statistical information about how long it takes to download an entire web page, including all resources.
Thanks
Mark
Websucker? See http://effbot.org/zone/websucker.htm
websucker.py doesn't import css links. HTTrack.com is not python, it's C/C++, but it's a good, maintained, utility for downloading a website for offline browsing.
http://www.mail-archive.com/python-bugs-list#python.org/msg13523.html
[issue1124] Webchecker not parsing css "#import url"
Guido> This is essentially unsupported and unmaintaned example code. Feel free
to submit a patch though!