I am looking for a python library to scrape results from search engines (google, yahoo, bing, etc).
I only found for google -> http://github.com/kevinw/xgoogle/tree/253db7ddc8603a9dcb038ae42684cf3499a22a4b
Does someone knows one for multiple search engines?
Scrapy is a pretty cool framework for scraping, but you will have code/configure it to work for the sites you want.
It's not too hard to write them. I usually just use php. Look into curl to retrive the page and then the dom object and dom xpath. You can use xpath to select the parts of the result you want.
Xpath is pretty simple if you install firebug and firexpath. I am working on a position checker right now. Same idea but it returns the position of a domain based on a keyword.
All of the answers here are deprecated, use the standard Google API if you want, you can make 1000 requests in 24 hour period for free.
What else can you try:
Use requests
Use selenium
Use the 3rd party google libraries (all deprecated to my knowledge)
But you will eventually get blocked, so better use the Google supported API or any other paid API.
scraper, you can scrape bing,google, baida,yahoo.check link
Related
I would like to know if it is possible to scrape google search specifying a date range. I read about googlesearch and I am trying to use its module (search). However it seems that something it is not working.
Using 'cdr:1,cd_min:01/01/2020,cd_max:01/01/2020' to search all results about a query (for example Kevin Spacey), it is not returning the expected urls. I guess something it is not working with the function (as defined in the library). Has someone ever tried to use it?
I am looking for results in Italian (only pages in Italian and with domain google.it). Another way to scrape these results would be also welcomed.
Many thanks
May this information help you:
Then, use the HTTP Spy to get the detail of the request. It's useful when Google changes their format of search, and the Module has not applied update to their code.
Good luck!
I realize that versions of this question have been asked and I spent several hours the other day trying a number of strategies.
What I would like to is use python to scrape all of the URLs from a google search that I can use in a separate script to do text analysis of a large corpus (news sites mainly). This seems relatively straightforward, but none of the attempts I've tried have worked properly.
This is as close as I got:
from google import search
for url in search('site:cbc.ca "kinder morgan" and "trans mountain" and protest*', stop=100):
print(url)
This returned about 300 URLs before I got kicked. An actual search using these parameters provides about 1000 results and I'd like all of them.
First: is this possible? Second: does anyone have any suggestions to do this? I basically just want a txt file of all the URLs that I can use in another script.
It seems that this package uses screen scraping to retrieve search results from google, so it doesn't play well with Google's Terms of Service which could be the reason why you've been blocked.
The relevant clause in Google's Terms of Service:
Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide. You may use our Services only as permitted by law, including applicable export and re-export control laws and regulations. We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct.
I haven't been able to find a definite number, but it seems like their limit for the number of search queries a day is rather strict too - at 100 search queries / day on their JSON Custom Search API documentation here.
Nonetheless, there's no harm trying out other alternatives to see if they work better:
BeautifulSoup
Scrapy
ParseHub - this one is not in code, but is a useful piece of software with good documentation. Link to their tutorial on how to scrape a list of URLs.
I've looked at a lot of questions and libs and didn't found exactly what I wanted. Here's the thing, I'm developing an application in python for a user to get all sorts of things from social networks accounts. I'm having trouble with facebook. I would like, if possible, a step-by-step tutorial on the code and libs to use to get a user's information, from posts to photos information (with the user's login information, and how to do it, because I've had a lot of problem with authentication).
Thank you
I strongly encourage you to use Facebook's own APIs.
First of all, check out documentation on Facebook's Graph API https://developers.facebook.com/docs/reference/api/. If you are not familiar with JSON, DO read a tutorial on it (for instance http://secretgeek.net/json_3mins.asp).
Once you grasp the concepts, start using this API. For Python, there are at several alternatives:
facebook/python-sdk https://github.com/facebook/python-sdk
pyFaceGraph https://github.com/iplatform/pyFaceGraph/
It is also semitrivial to write a simple HTTP client that uses the graph API
I would suggest you to check out the Python libraries, try out the examples in their documentation and see if they are working and do the stuff you need.
Only as a last resort, would I write a scraper and try to extract data with screenscraping (it is much more painful and breaks more easily).
I have not used this with Facebook, but in the past when I had to scrape a site that required login I used Mechanize to handle the login and scraping and Beautiful Soup to parse the resulting HTML.
I am new to the world of data scraping,previously used python for web and desktop app development.
I am just wondering,if there is any way to get the urls from a page then look into it for specific information like,phone no,address etc.
Currently I am using BeautifulSoup and built method where I am telling the urls as a parameter of the methods.
The site I am scraping large and its really tough to pass the specific url for each page.
Any suggestion to make it faster and self driven?
Thanks in advance.
You can use Scrapy. It simplifies both crawling and parsing (it uses libxml2 for parsing by default).
Use a more efficient HTML parser, like lxml. See here for performance comparisons of various Python parsers.
I'm trying to scrap a page in youtube with python which has lot of ajax in it
I've to call the java script each time to get the info. But i'm not really sure how to go about it. I'm using the urllib2 module to open URLs. Any help would be appreciated.
Youtube (and everything else Google makes) have EXTENSIVE APIs already in place for giving you access to just about any and all data you could possibly want.
Take a look at The Youtube Data API for more information.
I use urllib to make the API requests and ElementTree to parse the returned XML.
Main problem is, you're violating the TOS (terms of service) for the youtube site. Youtube engineers and lawyers will do their professional best to track you down and make an example of you if you persist. If you're happy with that prospect, then, on you head be it -- technically, your best bet are python-spidermonkey and selenium. I wanted to put the technical hints on record in case anybody in the future has needs like the ones your question's title indicates, without the legal issues you clearly have if you continue in this particular endeavor.
Here is how I would do it: Install Firebug on Firefox, then turn the NET on in firebug and click on the desired link on YouTube. Now see what happens and what pages are requested. Find the one that are responsible for the AJAX part of page. Now you can use urllib or Mechanize to fetch the link. If you CAN pull the same content this way, then you have what you are looking for, then just parse the content. If you CAN'T pull the content this way, then that would suggest that the requested page might be looking at user login credentials, sessions info or other header fields such as HTTP_REFERER ... etc. Then you might want to look at something more extensive like the scrapy ... etc. I would suggest that you always follow the simple path first. Good luck and happy "responsibly" scraping! :)
As suggested, you should use the YouTube API to access the data made available legitimately.
Regarding the general question of scraping AJAX, you might want to consider the scrapy framework. It provides extensive support for crawling and scraping web sites and uses python-spidermonkey under the hood to access javascript links.
You could sniff the network traffic with something like Wireshark then replay the HTTP calls via a scraping framework that is robust enough to deal with AJAX, such as scraPY.