I want to write a program that searches through a fairly large website and extracts certain things. I've had a couple online Python courses, but neither said anything about how to access the internet with Python. I have no idea where I ought to start with this.
You have first to read about the standard python library urllib2.
Once you are comfortable with the basic ideas behind this lib you can try requests which is much easier to interact with the web especially APIs. I suggest using it in parallel with httpie to test out queries quick and dirty from command line.
If you go a little further building a librairy or an engine to crawl the web you will need some sort of asynchronous programming, I recommend starting with Gevent
Finally, if you want to create a crawler/bot you can take a look at Scrapy. You should however start with basic libraries before diving into this one as it can get quite complex
It sounds like you want a web crawler/scraper. What sorts of things do you want to pull? Images? Links? Just the job for a web crawler/scraper.
Start there, there should be lots of articles on Stackoverflow that will help you implement details such as connecting to the internet (getting a web response).
See this article.
There is much more in the internet than just websites, but I assume that you just want to crawl some html pages and extract data from them. You have many many options to solve that problem. Just some starting points:
urllib2 from the standard library
https://pypi.python.org/pypi/requests (much easier and more user friendly)
http://scrapy.org/ (a very good crawling framework)
http://www.crummy.com/software/BeautifulSoup/ (library to extract data from html)
Related
This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 4 hours ago.
What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new.
--Edit--
For more detail: I'm trying to scrape the CNN primary database. There is a wealth of information there, but there doesn't appear to be an api.
The best solution that I found was to use Firebug to monitor XmlHttpRequests, and then to use a script to resend them.
This is a difficult problem because you either have to reverse engineer the JavaScript on a per-site basis, or implement a JavaScript engine and run the scripts (which has its own difficulties and pitfalls).
It's a heavy weight solution, but I've seen people doing this with GreaseMonkey scripts - allow Firefox to render everything and run the JavaScript, and then scrape the elements. You can even initiate user actions on the page if needed.
Selenium IDE, a tool for testing, is something I've used for a lot of screen-scraping. There are a few things it doesn't handle well (Javascript window.alert() and popup windows in general), but it does its work on a page by actually triggering the click events and typing into the text boxes. Because the IDE portion runs in Firefox, you don't have to do all of the management of sessions, etc. as Firefox takes care of it. The IDE records and plays tests back.
It also exports C#, PHP, Java, etc. code to build compiled tests/scrapers that are executed on the Selenium server. I've done that for more than a few of my Selenium scripts, which makes things like storing the scraped data in a database much easier.
Scripts are fairly simple to write and alter, being made up of things like ("clickAndWait","submitButton"). Worth a look given what you're describing.
Adam Davis's advice is solid.
I would additionally suggest that you try to "reverse-engineer" what the JavaScript is doing, and instead of trying to scrape the page, you issue the HTTP requests that the JavaScript is issuing and interpret the results yourself (most likely in JSON format, nice and easy to parse). This strategy could be anything from trivial to a total nightmare, depending on the complexity of the JavaScript.
The best possibility, of course, would be to convince the website's maintainers to implement a developer-friendly API. All the cool kids are doing it these days 8-) Of course, they might not want their data scraped in an automated fashion... in which case you can expect a cat-and-mouse game of making their page increasingly difficult to scrape :-(
There is a bit of a learning curve, but tools like Pamie (Python) or Watir (Ruby) will let you latch into the IE web browser and get at the elements. This turns out to be easier than Mechanize and other HTTP level tools since you don't have to emulate the browser, you just ask the browser for the html elements. And it's going to be way easier than reverse engineering the Javascript/Ajax calls. If needed you can also use tools like beatiful soup in conjunction with Pamie.
Probably the easiest way is to use IE webbrowser control in C# (or any other language). You have access to all the stuff inside browser out of the box + you dont need to care about cookies, SSL and so on.
i found the IE Webbrowser control have all kinds of quirks and workarounds that would justify some high quality software to take care of all those inconsistencies, layered around the shvwdoc.dll api and mshtml and provide a framework.
This seems like it's a pretty common problem. I wonder why someone hasn't anyone developed a programmatic browser? I'm envisioning a Firefox you can call from the command line with a URL as an argument and it will load the page, run all of the initial page load JS events and save the resulting file.
I mean Firefox, and other browsers already do this, why can't we simply strip off the UI stuff?
Question
I am interested in web scraping and data analysis and I would like to develop my skills by writing a program using Python 2.7 that will monitor changes in stock prices. My goal is to be able to compare two stocks (for the time being) at certain points throughout the day, save that info into a document format easily handled by pandas (which I will learn how to use after I get this front end working). In the end I would like to map relationship trends between chosen stocks (when this one goes up by x what effect does that have on the other one). This is just a hobby project so it doesn't matter if the code is production quality.
My Experience
I am a brand new Python programmer (I have a very basic understanding of python and no real experience with any non-included modules) but I do have a technical background so if the answer to my question requires reading and understanding documentation intended for intermediate level programmers that should be OK.
For the basics I am working my way through Learning Python: Powerful Object-Oriented Programming by Mark Lutz if this helps any.
What I'm Looking For
I recognize this is a very broad subject and I am not asking for anyone to write any actual code examples or anything. I just want some direction as to where to go to get information more specific to my interests and goals.
This is actually my first post on this forum so please forgive me if this doesn't follow best practices for posting. I did search for other questions like mine and read the posting tips docs prior to writing this.
So, you want to web-scrape? If you're using Python 2.7, then you'll want to look into the urllib2, requests and BeautifulSoup libraries. If you're using Python 3.x, then you'll want to peek at urllib, urllib.request, and BeautifulSoup, again. Together, these libraries should accomplish everything you're looking to do in terms of web-scraping.
If you're interested in scraping stock data, might I suggest the yahoo_finance package? This is a Python wrapper for the Yahoo Finance API. Whenever I've done things with stock data in the past, this module was invaluable. There's also googlefinance, too. It's much easier to use these already-developed wrappers to extract stock info, rather than scraping hundreds (if not thousands) of web pages to get the data you want.
I've started learning python for a week, as always love to learn as much languages as I can ; as python is one of the easiest languages, but still there's lots of things teachers won't tell.
Interacting with web pages is an obvious example; registration in forums and other websites gives a headache, so building a simple python application that just can READ LABELS, and input INFORMATION such as First Name, Last Name, Sex, and Address in their TEXT BOXES, then Submit the page at the end would be fantastic.
Another silly thing, but I MUST know, is how to contact a website first.
If you can provide me with any tips, or help, I appreciate your contribution.
For a simply contacting the website, consider the requests or twill libraries. For screen-scraping, BeautifulSoup is your friend.
If third-party libraries really aren't an option, though, you'll need to delve into the htmllib, httplib, urllib, and urllib2 Python modules.
this is a complete n00b question and i understand i may get voted down for asking this but i am totally confused over python's html integration.
as i understand one way to integrate python with html code is by using mod_python.
now, is there any other way or method that is more effective for using python with html?
please advise me on this as i am new to learning python and could use some help.
some pointers to code samples would be highly appreciated.
thanks a lot.
EDIT: also what i would like to know is, how does PyHP and mod_python compare with regards to each other. i mean how are they different? and Django? what is Django all about?
I would suggest you to start with web.py
You can read a tutorial on how to use Python in the web.
http://docs.python.org/howto/webservers.html
In few words, mod_python keeps python interpreter in memory ready to execute python scripts, which is faster than launching it every time. It doesn't let you integrate python in html like PHP. For this you need to use a special application, like PyHP (http://www.pyhp.org) or another (there are several of them). Read Python tutorial and documentation pages, there's plenty of info and links to many template and html-embedding engines.
Such engines as PyHP require some overhead to run. Without them, your python application must output HTTP response headers and the page as strings. Mod_wsgi and fastcgi facilitate this process. The page I linked in the beginning gives a good overview on that.
Also you may try Tornado, a python web server, if you don't need to stick to Apache.
The standard way for Python web apps to talk to a webserver is WSGI. Also check out WebOb.
http://www.wsgi.org/wsgi/
http://pythonpaste.org/webob/
But for a complete noob I'd start with a complete web-framework (in which case you typically can ignore the links above). Django or Grok are both full-stack framworks that are easy to use and learn. Django is more popular, but Grok is built on 13 years of Web application publishing experience, and is seriously cool. The difference is a matter of taste.
http://django.org/
http://grok.zope.org/
If you want something more minimalistic, the worlds your oyster, there are an infinite amount of web frameworks for Python, from BFG to Turbogears.
I'm writing a spider in Python to crawl a site. Trouble is, I need to examine about 2.5 million pages, so I could really use some help making it optimized for speed.
What I need to do is examine the pages for a certain number, and if it is found to record the link to the page. The spider is very simple, it just needs to sort through a lot of pages.
I'm completely new to Python, but have used Java and C++ before. I have yet to start coding it, so any recommendations on libraries or frameworks to include would be great. Any optimization tips are also greatly appreciated.
You could use MapReduce like Google does, either via Hadoop (specifically with Python: 1 and 2), Disco, or Happy.
The traditional line of thought, is write your program in standard Python, if you find it is too slow, profile it, and optimize the specific slow spots. You can make these slow spots faster by dropping down to C, using C/C++ extensions or even ctypes.
If you are spidering just one site, consider using wget -r (an example).
Where are you storing the results? You can use PiCloud's cloud library to parallelize your scraping easily across a cluster of servers.
As you are new to Python, I think the following may be helpful for you :)
if you are writing regex to search for certain pattern in the page, compile your regex wherever you can and reuse the compiled object
BeautifulSoup is a html/xml parser that may be of some use for your project.
Spidering somebody's site with millions of requests isn't very polite. Can you instead ask the webmaster for an archive of the site? Once you have that, it's a simple matter of text searching.
You waste a lot of time waiting for network requests when spidering, so you'll definitely want to make your requests in parallel. I would probably save the result data to disk and then have a second process looping over the files searching for the term. That phase could easily be distributed across multiple machines if you needed extra performance.
What Adam said. I did this once to map out Xanga's network. The way I made it faster is by having a thread-safe set containing all usernames I had to look up. Then I had 5 or so threads making requests at the same time and processing them. You're going to spend way more time waiting for the page to DL than you will processing any of the text (most likely), so just find ways to increase the number of requests you can get at the same time.