I'd like to do perform data mining on a large scale. For this, I need a fast crawler. All I need is something to download a web page, extract links and follow them recursively, but without visiting the same url twice. Basically, I want to avoid looping.
I already wrote a crawler in python, but it's too slow. I'm not able to saturate a 100Mbit line with it. Top speed is ~40 urls/sec. and for some reason it's hard to get better results. It seems like a problem with python's multithreading/sockets. I also ran into problems with python's gargabe collector, but that was solvable. CPU isn't the bottleneck btw.
So, what should I use to write a crawler that is as fast as possible, and what's the best solution to avoid looping while crawling?
EDIT:
The solution was to combine multiprocessing and threading modules. Spawn multiple processes with multiple threads per process for best effect. Spawning multiple threads in a single process is not effective and multiple processes with just one thread consume too much memory.
Why not use something already tested for crawling, like Scrapy? I managed to reach almost 100 pages per second on a low-end VPS that has limited RAM memory (about 400Mb), while network speed was around 6-7 Mb/s (i.e. below 100Mbps).
Another improvement you can do is use urllib3 (especially when crawling many pages from a single domain). Here's a brief comparison I did some time ago:
UPDATE:
Scrapy now uses the Requests library, which in turn uses urllib3. That makes Scrapy the absolute go-to tool when it comes to scraping. Recent versions also support deploying projects, so scraping from a VPS is easier than ever.
Around 2 years ago i have developed a crawler. And it can download almost 250urls per second. You could flow my steps.
Optimize your file pointer use. Try to use minimal file pointer.
Don't write your data every time. Try to dump your data after
storing around 5000 url or 10000 url.
For your robustness you don't need to use different configuration.
Try to Use a log file and when you want to resume then just try to
read the log file and resume your crawler.
Distributed all your webcrawler task. And process it in a interval
wise.
a. downloader
b. link extractor
c. URLSeen
d. ContentSeen
I have written a simple multithreading crawler. It is available on GitHub as Discovering Web Resources and I've written a related article: Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website. You can change the number of threads being used in the NWORKERS class variable. Don't hesitate to ask any further question if you need extra help.
It sounds like you have a design problem more than a language problem. Try looking into the multiprocessing module for accessing more sites at the same time rather than threads. Also, consider getting some table to store your previously visited sites (a database maybe?).
Impossible to tell what your limitations are. Your problem is similiar to the C10K problem -- read first, don't optimize straight away. Go for the low-hanging fruit: Most probably you get significant performance improvements by analyzing your application design. Don't start out massively-mulithreaded or massively-multiprocessed.
I'd use Twisted to write the the networking part, this can be very fast. In general, I/O on the machine has to be better than average. Either you have to write your data
to disk or to another machine, not every notebook supports 10MByte/s sustained database writes. Lastly, if you have an asynchronous internet connection, It might simply be that your upstream is saturated. ACK priorization helps here (OpenBSD example).
Related
Is it possible to create as many threads to use 100% of CPU and is it really efficient? I'm planning to create a crawler in Python and, in order to make the program efficient, I want to create as many threads as possible, where each thread will be downloading one website. I tried looking up for some information online; unfortunately I didn't find much.
You are confusing your terminology, but that is ok. A very high level overview would help.
Concurrency can consist of IO bound (reading and writing from disk, http requests, etc) and CPU bound work (running a machine learning optimization function on a big set of data).
With IO bound work, which is what you are referring to I am assuming, in fact your CPU is not working very hard but rather waiting around for data to come back.
Contrast that with multi-processing where you can use multiple core of your machine to do more intense CPU bound work.
That said multi-threading could help you. I would advise to use the asyncio and aiohttp modules for Python. These will help you make sure whilst you are waiting for some response to be returned, the software can continue with other requests.
I use asyncio, aiohttp and bs4 when I need to do some web-scraping.
I have written a script to do some research on HTTP Archive data. This script needs to make HTTP requests to sites scraped by HTTP Archive in order to classify sites into groups (e.g., Drupal, WordPress, etc). The script is working really well; however, the list of sites that I am handling is 300,000 sites long.
I would like to be able to complete the categorization of sites as fast as possible. I have experimented with running multiple instances of the script at the same time and it is working well with appropriate locks in place to prevent race conditions.
How can I max this out to get all of these operations completed as fast as possible? For instance, I am looking at spinning up a VPS with 8 CPUs and 16 GB RAM. How do I maximize these resources to make sure I'm using every bit of processing power possible? I may consider spinning up something more powerful, but I want to make sure I understand how to get the most out of it so I'm not wasting money.
Thanks!
Multiprocessing module is the best option that lets you harness the maximum power of your 8 CPUs:
https://docs.python.org/3.3/library/multiprocessing.html
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm looking for a python library or a command line tool for downloading multiple files in parallel. My current solution is to download the files sequentially which is slow. I know you can easily write a half-assed threaded solution in python, but I always run into annoying problem when using threading. It is for polling a large number of xml feeds from websites.
My requirements for the solution are:
Should be interruptable. Ctrl+C should immediately terminate all downloads.
There should be no leftover processes that you have to kill manually using kill, even if the main program crashes or an exception is thrown.
It should work on Linux and Windows too.
It should retry downloads, be resilient against network errors and should timeout properly.
It should be smart about not hammering the same server with 100+ simultaneous downloads, but queue them in a sane way.
It should handle important http status codes like 301, 302 and 304. That means that for each file, it should take the Last-Modified value as input and only download if it has changed since last time.
Preferably it should have a progress bar or it should be easy to write a progress bar for it to monitor the download progress of all files.
Preferably it should take advantage of http keep-alive to maximize the transfer speed.
Please don't suggest how I may go about implementing the above requirements. I'm looking for a ready-made, battle-tested solution.
I guess I should describe what I want it for too... I have about 300 different data feeds as xml formatted files served from 50 data providers. Each file is between 100kb and 5mb in size. I need to poll them frequently (as in once every few minutes) to determine if any of them has new data I need to process. So it is important that the downloader uses http caching to minimize the amount of data to fetch. It also uses gzip compression obviously.
Then the big problem is how to use the bandwidth in an as efficient manner as possible without overstepping any boundaries. For example, one data provider may consider it abuse if you open 20 simultaneous connections to their data feeds. Instead it may be better to use one or two connections that are reused for multiple files. Or your own connection may be limited in strange ways.. My isp limits the number of dns lookups you can do so some kind of dns caching would be nice.
You can try pycurl, though the interface is not easy at first, but once you look at examples, its not hard to understand. I have used it to fetch 1000s of web pages in parallel on meagre linux box.
You don't have to deal with threads, so it terminates gracefully, and there are no processes left behind
It provides options for timeout, and http status handling.
It works on both linux and windows.
The only problem is that it provides a basic infrastructure (basically just a python layer above the excellent curl library). You will have to write few lines to achieve the features as you want.
There are lots of options but it will be hard to find one which fits all your needs.
In your case, try this approach:
Create a queue.
Put URLs to download into this queue (or "config objects" which contain the URL and other data like the user name, the destination file, etc).
Create a pool of threads
Each thread should try to fetch a URL (or a config object) from the queue and process it.
Use another thread to collect the results (i.e. another queue). When the number of result objects == number of puts in the first queue, then you're finished.
Make sure that all communication goes via the queue or the "config object". Avoid accessing data structures which are shared between threads. This should save you 99% of the problems.
I don't think such a complete library exists, so you'll probably have to write your own. I suggest taking a look at gevent for this task. They even provide a concurrent_download.py example script. Then you can use urllib2 for most of the other requirements, such as handling HTTP status codes, and displaying download progress.
I would suggest Twisted, although it is not a ready made solution, but provides the main building blocks to get every feature you listed in an easy way and it does not use threads.
If you are interested, take a look at the following links:
http://twistedmatrix.com/documents/current/api/twisted.web.client.html#getPage
http://twistedmatrix.com/documents/current/api/twisted.web.client.html#downloadPage
As per your requirements:
Supported out of the box
Supported out of the box
Supported out of the box
Timeout supported out of the box, other error handling done through deferreds
Achieved easily using cooperators (example 7)
Supported out of the box
Not supported, solutions exists (and they are not that hard to implement)
Not supported, it can be implemented (but it will be relatively hard)
Nowadays there are excellent Python libs you might want to use - urllib3 and requests
Try using aria2 through simple python subprocess module.
It provide all requirements from your list, except 7, out of the box, and 7 is easy to write.
aria2c has a nice xml-rpc or json-rpc interface to interact with it from your scripts.
Does urlgrabber fit your requirements?
http://urlgrabber.baseurl.org/
If it doesn't, you could consider volunteering to help finish it. Contact the authors, Michael Stenner and Ryan Tomayko.
Update: Googling for "parallel wget" yields these, among others:
http://puf.sourceforge.net/
http://www.commandlinefu.com/commands/view/3269/parallel-file-downloading-with-wget
It seems like you have a number of options to choose from.
I used the standard libs for that, urllib.urlretrieve to be precise. downloaded podcasts this way, via a simple thread pool, each using its own retrieve. I did about 10 simultanous connections, more should not be a problem. Continue a interrupted download, maybe not. Ctrl-C could be handled, I guess. Worked on Windows, installed a handler for progress bars. All in all 2 screens of code, 2 screens for generating the URLs to retrieve.
This seems pretty flexible:
http://keramida.wordpress.com/2010/01/19/parallel-downloads-with-python-and-gnu-wget/
Threading isn't "half-assed" unless you're a bad programmer. The best general approach to this problem is the producer / consumer model. You have one dedicated URL producer, and N dedicated download threads (or even processes if you use the multiprocessing model).
As for all of your requirements, ALL of them CAN be done with the normal python threaded model (yes, even catching Ctrl+C -- I've done it).
I've done some experiments using Apache Bench to profile my code response times, and it doesn't quite generate the right kind of data for me. I hope the good people here have ideas.
Specifically, I need a tool that
Does HTTP requests over the network (it doesn't need to do anything very fancy)
Records response times as accurately as possible (at least to a few milliseconds)
Writes the response time data to a file without further processing (or provides it to my code, if a library)
I know about ab -e, which prints data to a file. The problem is that this prints only the quantile data, which is useful, but not what I need. The ab -g option would work, except that it doesn't print sub-second data, meaning I don't have the resolution I need.
I wrote a few lines of Python to do it, but the httplib is horribly inefficient and so the results were useless. In general, I need better precision than pure Python is likely to provide. If anyone has suggestions for a library usable from Python, I'm all ears.
I need something that is high performance, repeatable, and reliable.
I know that half my responses are going to be along the lines of "internet latency makes that kind of detailed measurements meaningless." In my particular use case, this is not true. I need high resolution timing details. Something that actually used my HPET hardware would be awesome.
Throwing a bounty on here because of the low number of answers and views.
I have done this in two ways.
With "loadrunner" which is a wonderful but pretty expensive product (from I think HP these days).
With combination perl/php and the Curl package. I found the CURL api slightly easier to use from php. Its pretty easy to roll your own GET and PUT requests. I would also recommend manually running through some sample requests with Firefox and the LiveHttpHeaders add on to captute the exact format of the http requests you need.
JMeter is pretty handy. It has a GUI from which you can set up your requests and threadpools and it also can be run from the command line.
If you can code in Java, you can look at the combination of JUnitPerf + HttpUnit.
The downside is that you will have to do more things yourself. But at the price of this you will get unlimited flexibility and arguably more preciseness than with GUI tools, not to mention HTML parsing, JavaScript execution, etc.
There's also another project called Grinder which seems to be purposed for a similar task but I don't have any experience with it.
A good reference of opensource perfomance testing tools: http://www.opensourcetesting.org/performance.php
You will find descriptions and a "most popular" list
httperf is very powerful.
I've used a script to drive 10 boxes on the same switch to generate load by "replaying" requests to 1 server. I had my web app logging response time (server only) to the granularity I needed, but I didn't care about the response time to the client. I'm not sure you care to include the trip to and from the client in your calculations, but if you did it shouldn't be to difficult to code up. I then processed my log with a script which extracted the times per url and did scatter plot graphs, and trend graphs based on load.
This satisfied my requirements which were:
Real world distribution of calls to different urls.
Trending performance based on load.
Not influencing the web app by running other intensive ops on the same box.
I did controller as a shell script that foreach server started a process in the background to loop over all the urls in a file calling curl on each one. I wrote the log processor in Perl since I was doing more Perl at that time.
I have a simple Python web crawler. It uses SQLite to store its output and also to keep a queue. I want to make the crawler multi-threaded so that it can crawl several pages at a time. I figured i would make a thread and just run several instances of the class at once, so they all run concurrently. But the question is, how many should i run at once? should i stick to two? can i go higher? what would be a reasonable limit for a number of threads? Keep in mind that each thread goes out to a web page, downloads the html, runs a few regex searches through it, stores the info it finds in a SQLite db, and then pops the next url off the queue.
You will probably find your application is bandwidth limited not CPU or I/O limited.
As such, add as many as you like until performance begins to degrade.
You may come up against other limits depending on your network setup. Like if you're behind an ADSL router, there will be a limit on the number of concurrent NAT sessions, which may impact making too many HTTP requests at once. Make too many and your provider may treat you as being infected by a virus or the like.
There's also the issue of how many requests the server you're crawling can handle and how much of a load you want to put on it.
I wrote a crawler once that used just one thread. It took about a day to process all the information I wanted at about one page every two seconds. I could've done it faster but I figured this was less of a burden for the server.
So really theres no hard and fast answer. Assuming a 1-5 megabit connection I'd say you could easily have up to 20-30 threads without any problems.
I would use one thread and twisted with either a deferred semaphore or a task cooperator if you already have an easy way to feed an arbitrarily long list of URLs in.
It's extremely unlikely you'll be able to make a multi-threaded crawler that's faster or smaller than a twisted-based crawler.
It's usually simpler to make multiple concurrent processes. Simply use subprocess to create as many Popens as you feel it necessary to run concurrently.
There's no "optimal" number. Generally, when you run just one crawler, your PC spends a lot of time waiting. How much? Hard to say.
When you're running some small number of concurrent crawlers, you'll see that they take about the same amount of time as one. Your CPU switches among the various processes, filling up the wait time on one with work on the others.
You you run some larger number, you see that the overall elapsed time is longer because there's now more to do than your CPU can manage. So the overall process takes longer.
You can create a graph that shows how the process scales. Based on this you can balance the number of processes and your desirable elapsed time.
Think of it this way.
1 crawler does it's job in 1 minute. 100 pages done serially could take a 100 minutes. 100 crawlers concurrently might take on hour. Let's say that 25 crawlers finishes the job in 50 minutes.
You don't know what's optimal until you run various combinations and compare the results.
cletus's answer is the one you want.
A couple of people proposed an alternate solution using asynchronous I/O, especially looking at Twisted. If you decide to go that route, a different solution is pycurl, which is a thin wrapper to libcurl, which is a widely used URL transfer library. PyCurl's home page has a 'retriever-multi.py' example of how to fetch multiple pages in parallel, in about 120 lines of code.
You can go higher that two. How much higher depends entirely on the hardware of the system you're running this on, how much processing is going on after the network operations, and what else is running on the machine at the time.
Since it's being written in Python (and being called "simple") I'm going to assume you're not exactly concerned with squeezing every ounce of performance out of the thing. In that case, I'd suggest just running some tests under common working conditions and seeing how it performs. I'd guess around 5-10 is probably reasonable, but that's a complete stab in the dark.
Since you're using a dual-core machine, I'd highly recommend checking out the Python multiprocessing module (in Python 2.6). It will let you take advantage of multiple processors on your machine, which would be a significant performance boost.
One thing you should keep in mind is that some servers may interpret too many concurrent requests from the same IP address as a DoS attack and abort connections or return error pages for requests that would otherwise succeed.
So it might be a good idea to limit the number of concurrent requests to the same server to a relatively low number (5 should be on the safe side).
Threading isn't necessary in this case. Your program is I/O bound rather than CPU bound. The networking part would probably be better done using select() on the sockets. This reduces the overhead of creating and maintaining threads. I haven't used Twisted, but I heard it has really good support for asynchronous networking. This would allow you you to specify the URLs you wish to download and register a callback for each. When each is downloaded you the callback will be called, and the page can be processed. In order to allow multiple sites to be downloaded, without waiting for each to be processed, a second "worker" thread can be created with a queue. The callback would add the site's contents to the queue. The "worker" thread would do the actual processing.
As already stated in some answers, the optimal amount of simultaneous downloads depends on your bandwidth.
I'd use one or two threads - one for the actual crawling and the other (with a queue) for processing.