python scraping by getting urls dynamic way - python

I am new to the world of data scraping,previously used python for web and desktop app development.
I am just wondering,if there is any way to get the urls from a page then look into it for specific information like,phone no,address etc.
Currently I am using BeautifulSoup and built method where I am telling the urls as a parameter of the methods.
The site I am scraping large and its really tough to pass the specific url for each page.
Any suggestion to make it faster and self driven?
Thanks in advance.

You can use Scrapy. It simplifies both crawling and parsing (it uses libxml2 for parsing by default).

Use a more efficient HTML parser, like lxml. See here for performance comparisons of various Python parsers.

Related

Extracting parts of HTML from website using python

I'm currently working on a project that involves a program to inspect a web page's HTML using Python. My program has to monitor a web page, and when a change is made to the HTML, it will complete a set of actions. My question is how do you extract just part of a web page, and how do you monitor a web page's HTML and report almost instantly when a change is made. Thanks.
In the past I wrote my own parsers. Nowadays HTML is HTML 5, more statements,more Javascript, a lot of crappiness done by developers and their editors, like
document.write('<SCR' + 'IPT
And some web frameworks / developers bad coding change the Last-Modified in the HTTP header on every request, even if for a human person the text you read on the page isn't changed.
I suggest you BeautifulSoup for the parsing stuff; by your own you have to careful choose what to watch to decide if the Web page is modified.
Its intro :
BeautifulSoup is a Python package that parses broken HTML, just like
lxml supports it based on the parser of libxml2. BeautifulSoup uses a
different parsing approach. It is not a real HTML parser but uses
regular expressions to dive through tag soup. It is therefore more
forgiving in some cases and less good in others. It is not uncommon
that lxml/libxml2 parses and fixes broken HTML better, but
BeautifulSoup has superiour support for encoding detection. It very
much depends on the input which parser works better.
Scrapy might be a good place to start. http://doc.scrapy.org/en/latest/intro/overview.html
Getting sections of websites is easy, it is just xml, you can use scrapy or beautifulsoup.

Python 3.2 Beautiful Soup alternative

I need to make a web crawler to extract information from web pages. I made a research and found that Beautiful Soup was excellent since i could parse all document and create dom objects and iterate, extract attributes, etc(similitar to JQuery).
But I'm using Python 3.2 and there is no stable version for it(I think there isn't at all, just 3.1 I saw at their home page).
So I need some as good alternatives.
Looks to me like there is a version of beautiful soup of 3.2.0 released almost a year ago. There's also HTMLParser http://docs.python.org/library/htmlparser.html
I think the latest release is 4.1.1, you can read about it here BS4 Documentation
I have used BS4 with PHP on my website for this purpose for a while now, with great results. I had to switch back to BSv3 because of a PHP / Python incompatibility issue, but that is separate from how well the BS4 script works by itself.
Initially I uses the built in HTML Parsing engine, but found this slow. After installing the LMXL engine on my web server, massive increase in speed! No noticable improvement in the actual parsing, but the speed increased dramatically.
Id give it a go - I reccomend it, and I tried a LOT of different options before I settled on Beautiful soup.
Good luck!
From the lxml homepage:
The latest release works with all CPython versions from 2.4 to 3.2.
The most direct and best alternative to BeautifulSoup is Mechanize.
Mechanize is your saviour if you need to automate simple web page functionality, such as submitting a form (with information you didn't have beforehand, like CSRF tokens). It's even available in multiple programming languages!
That said, Sven's answer is right: I love lxml when I just need to extract information from HTML.

is there a simple class/library which uses pyQT/webkit to scrape websites with javascript support?

i'm looking at using pyQT to scrape websites with javascript support, after dabbling with all the static html alternatives (beautifulsoup, mechanize etc.)
clearly pyQT is a much more generic tool and as such is not optimised for my needs.
is there any classes/libraries which give me simple functions for using pyQT for relatively simple scraping duties?
i have found a few classes/scripts by searching google, but am hopefull for something better suited to my needs!
i need to submit forms, maintain sessions, and return the html for processing with lxml.
thanks :)
You might want to take a look at spynner--it's a programmatic browser module based on QtWebKit. It might meet your needs.

Is there any python lib to scrape search engine(S) results?

I am looking for a python library to scrape results from search engines (google, yahoo, bing, etc).
I only found for google -> http://github.com/kevinw/xgoogle/tree/253db7ddc8603a9dcb038ae42684cf3499a22a4b
Does someone knows one for multiple search engines?
Scrapy is a pretty cool framework for scraping, but you will have code/configure it to work for the sites you want.
It's not too hard to write them. I usually just use php. Look into curl to retrive the page and then the dom object and dom xpath. You can use xpath to select the parts of the result you want.
Xpath is pretty simple if you install firebug and firexpath. I am working on a position checker right now. Same idea but it returns the position of a domain based on a keyword.
All of the answers here are deprecated, use the standard Google API if you want, you can make 1000 requests in 24 hour period for free.
What else can you try:
Use requests
Use selenium
Use the 3rd party google libraries (all deprecated to my knowledge)
But you will eventually get blocked, so better use the Google supported API or any other paid API.
scraper, you can scrape bing,google, baida,yahoo.check link

Scraping Ajax - Using python

I'm trying to scrap a page in youtube with python which has lot of ajax in it
I've to call the java script each time to get the info. But i'm not really sure how to go about it. I'm using the urllib2 module to open URLs. Any help would be appreciated.
Youtube (and everything else Google makes) have EXTENSIVE APIs already in place for giving you access to just about any and all data you could possibly want.
Take a look at The Youtube Data API for more information.
I use urllib to make the API requests and ElementTree to parse the returned XML.
Main problem is, you're violating the TOS (terms of service) for the youtube site. Youtube engineers and lawyers will do their professional best to track you down and make an example of you if you persist. If you're happy with that prospect, then, on you head be it -- technically, your best bet are python-spidermonkey and selenium. I wanted to put the technical hints on record in case anybody in the future has needs like the ones your question's title indicates, without the legal issues you clearly have if you continue in this particular endeavor.
Here is how I would do it: Install Firebug on Firefox, then turn the NET on in firebug and click on the desired link on YouTube. Now see what happens and what pages are requested. Find the one that are responsible for the AJAX part of page. Now you can use urllib or Mechanize to fetch the link. If you CAN pull the same content this way, then you have what you are looking for, then just parse the content. If you CAN'T pull the content this way, then that would suggest that the requested page might be looking at user login credentials, sessions info or other header fields such as HTTP_REFERER ... etc. Then you might want to look at something more extensive like the scrapy ... etc. I would suggest that you always follow the simple path first. Good luck and happy "responsibly" scraping! :)
As suggested, you should use the YouTube API to access the data made available legitimately.
Regarding the general question of scraping AJAX, you might want to consider the scrapy framework. It provides extensive support for crawling and scraping web sites and uses python-spidermonkey under the hood to access javascript links.
You could sniff the network traffic with something like Wireshark then replay the HTTP calls via a scraping framework that is robust enough to deal with AJAX, such as scraPY.

Categories