A web crawler script that spawns at most 500 threads and each thread basically requests for certain data served from the remote server, which each server's reply is different in content and size from others.
i'm setting stack_size as 756K's for threads
threading.stack_size(756*1024)
which enables me to have the sufficient number of threads required and complete most of the jobs and requests. But as some servers' responses are bigger than others, and when a thread gets that kind of response, script dies with SIGSEGV.
stack_sizes more than 756K makes it impossible to have the required number of threads at the same time.
any suggestions on how can i continue with given stack_size without crashes?
and how can i get the current used stack_size of any given thread?
Why on earth are you spawning 500 threads? That seems like a terrible idea!
Remove threading completely, use an event loop to do the crawling. Your program will be faster, simpler, and easier to maintain.
Lots of threads waiting for network won't make your program wait faster. Instead, collect all open sockets in a list and run a loop where you check if any of them has data available.
I recommend using Twisted - It is an event-driven networking engine. It is very flexile, secure, scalable and very stable (no segfaults).
You could also take a look at Scrapy - It is a web crawling and screen scraping framework written in Python/Twisted. It is still under heavy development, but maybe you can take some ideas.
Related
Is it OK to run certain pieces of code asynchronously in a Django web app. If so how?
For example:
I have a search algorithm that returns hundreds or thousands of results. I want to enter into the database that these items were the result of the search, so I can see what users are searching most. I don't want the client to have to wait an extra hundred or thousand more database inserts. Is there a way I can do this asynchronously? Is there any danger in doing so? Is there a better way to achieve this?
As far as Django is concerned yes.
The bigger concern is your web server and if it plays nice with threading. For instance, the sync workers of gunicorn are single threads, but there are other engines, such as greenlet. I'm not sure how well they play with threads.
Combining threading and multiprocessing can be an issue if you're forking from threads:
Status of mixing multiprocessing and threading in Python
http://bugs.python.org/issue6721
That being said, I know of popular performance analytics utilities that have been using threads to report on metrics, so seems to be an accepted practice.
In sum, seems safest to use the threading.Thread object from the standard library, so long as whatever you do in it doesn't fork (python's multiprocessing library)
https://docs.python.org/2/library/threading.html
Offloading requests from the main thread is a common practice; as the end goal is to return a result to the client (browser) as quickly as possible.
As I am sure you are aware, HTTP is blocking - so until you return a response, the client cannot do anything (it is blocked, in a waiting state).
The de-facto way of offloading requests is through celery which is a task queuing system.
I highly recommend you read the introduction to celery topic, but in summary here is what happens:
You mark certain pieces of codes as "tasks". These are usually functions that you want to run asynchronously.
Celery manages workers - you can think of them as threads - that will run these tasks.
To communicate with the worker a message queue is required. RabbitMQ is the one often recommended.
Once you have all the components running (it takes but a few minutes); your workflow goes like this:
In your view, when you want to offload some work; you will call the function that does that work with the .delay() option. This will trigger the worker to start executing the method in the background.
Your view then returns a response immediately.
You can then check for the result of the task, and take appropriate actions based on what needs to be done. There are ways to track progress as well.
It is also good practice to include caching - so that you are not executing expensive tasks unnecessarily. For example, you might choose to offload a request to do some analytics on search keywords that will be placed in a report.
Once the report is generated, I would cache the results (if applicable) so that the same report can be displayed if requested later - rather than be generated again.
Was looking to write a little web crawler in python. I was starting to investigate writing it as a multithreaded script, one pool of threads downloading and one pool processing results. Due to the GIL would it actually do simultaneous downloading? How does the GIL affect a web crawler? Would each thread pick some data off the socket, then move on to the next thread, let it pick some data off the socket, etc..?
Basically I'm asking is doing a multi-threaded crawler in python really going to buy me much performance vs single threaded?
thanks!
The GIL is not held by the Python interpreter when doing network operations. If you are doing work that is network-bound (like a crawler), you can safely ignore the effects of the GIL.
On the other hand, you may want to measure your performance if you create lots of threads doing processing (after downloading). Limiting the number of threads there will reduce the effects of the GIL on your performance.
Look at how scrapy works. It can help you a lot. It doesn't use threads, but can do multiple "simultaneous" downloading, all in the same thread.
If you think about it, you have only a single network card, so parallel processing can't really help by definition.
What scrapy does is just not wait around for the response of one request before sending another. All in a single thread.
When it comes to crawling you might be better off using something event-based such as Twisted that uses non-blocking asynchronous socket operations to fetch and return data as it comes, rather than blocking on each one.
Asynchronous network operations can easily be and usually are single-threaded. Network I/O almost always has higher latency than that of CPU because you really have no idea how long a page is going to take to return, and this is where async shines because an async operation is much lighter weight than a thread.
Edit: Here is a simple example of how to use Twisted's getPage to create a simple web crawler.
Another consideration: if you're scraping a single website and the server places limits on the frequency of requests your can send from your IP address, adding multiple threads may make no difference.
Yes, multithreading scraping increases the process speed significantly. This is not a case where GIL is an issue. You are losing a lot of idle CPU and unused bandwidth waiting for a request to finish. If the web page you are scraping is in your local network (a rare scraping case) then the difference between multithreading and single thread scraping can be smaller.
You can try the benchmark yourself playing with one to "n" threads. I have written a simple multithreaded crawler on Discovering Web Resources and I wrote a related article on Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website. You can select how many threads to use changing the NWORKERS class variable in FocusedWebCrawler.
I want to use Python's multiprocessing to do concurrent processing without using locks (locks to me are the opposite of multiprocessing) because I want to build up multiple reports from different resources at the exact same time during a web request (normally takes about 3 seconds but with multiprocessing I can do it in .5 seconds).
My problem is that, if I expose such a feature to the web and get 10 users pulling the same report at the same time, I suddenly have 60 interpreters open at the same time (which would crash the system). Is this just the common sense result of using multiprocessing, or is there a trick to get around this potential nightmare?
Thanks
If you're really worried about having too many instances you could think about protecting the call with a Semaphore object. If I understand what you're doing then you can use the threaded semaphore object:
from threading import Semaphore
sem = Semaphore(10)
with sem:
make_multiprocessing_call()
I'm assuming that make_multiprocessing_call() will cleanup after itself.
This way only 10 "extra" instances of python will ever be opened, if another request comes along it will just have to wait until the previous have completed. Unfortunately this won't be in "Queue" order ... or any order in particular.
Hope that helps
You are barking up the wrong tree if you are trying to use multiprocess to add concurrency to a network app. You are barking up a completely wrong tree if you're creating processes for each request. multiprocess is not what you want (at least as a concurrency model).
There's a good chance you want an asynchronous networking framework like Twisted.
locks are only ever nessecary if you have multiple agents writing to a source. If they are just accessing, locks are not needed (and as you said defeat the purpose of multiprocessing).
Are you sure that would crash the system? On a web server using CGI, each request spawns a new process, so it's not unusual to see thousands of simultaneous processes (granted in python one should use wsgi and avoid this), which do not crash the system.
I suggest you test your theory -- it shouldn't be difficult to manufacture 10 simultaneous accesses -- and see if your server really does crash.
My question is: which python framework should I use to build my server?
Notes:
This server talks HTTP with it's clients: GET and POST (via pyAMF)
Clients "submit" "tasks" for processing and, then, sometime later, retrieve the associated "task_result"
submit and retrieve might be separated by days - different HTTP connections
The "task" is a lump of XML describing a problem to be solved, and a "task_result" is a lump of XML describing an answer.
When a server gets a "task", it queues it for processing
The server manages this queue and, when tasks get to the top, organises that they are processed.
the processing is performed by a long running (15 mins?) external program (via subprocess) which is feed the task XML and which produces a "task_result" lump of XML which the server picks up and stores (for later Client retrieval).
it serves a couple of basic HTML pages showing the Queue and processing status (admin purposes only)
I've experimented with twisted.web, using SQLite as the database and threads to handle the long running processes.
But I can't help feeling that I'm missing a simpler solution. Am I? If you were faced with this, what technology mix would you use?
I'd recommend using an existing message queue. There are many to choose from (see below), and they vary in complexity and robustness.
Also, avoid threads: let your processing tasks run in a different process (why do they have to run in the webserver?)
By using an existing message queue, you only need to worry about producing messages (in your webserver) and consuming them (in your long running tasks). As your system grows you'll be able to scale up by just adding webservers and consumers, and worry less about your queuing infrastructure.
Some popular python implementations of message queues:
http://code.google.com/p/stomper/
http://code.google.com/p/pyactivemq/
http://xph.us/software/beanstalkd/
I'd suggest the following. (Since it's what we're doing.)
A simple WSGI server (wsgiref or werkzeug). The HTTP requests coming in will naturally form a queue. No further queueing needed. You get a request, you spawn the subprocess as a child and wait for it to finish. A simple list of children is about all you need.
I used a modification of the main "serve forever" loop in wsgiref to periodically poll all of the children to see how they're doing.
A simple SQLite database can track request status. Even this may be overkill because your XML inputs and results can just lay around in the file system.
That's it. Queueing and threads don't really enter into it. A single long-running external process is too complex to coordinate. It's simplest if each request is a separate, stand-alone, child process.
If you get immense bursts of requests, you might want a simple governor to prevent creating thousands of children. The governor could be a simple queue, built using a list with append() and pop(). Every request goes in, but only requests that fit will in some "max number of children" limit are taken out.
My reaction is to suggest Twisted, but you've already looked at this. Still, I stick by my answer. Without knowing you personal pain-points, I can at least share some things that helped me reduce almost all of the deferred-madness that arises when you have several dependent, blocking actions you need to perform for a client.
Inline callbacks (lightly documented here: http://twistedmatrix.com/documents/8.2.0/api/twisted.internet.defer.html) provide a means to make long chains of deferreds much more readable (to the point of looking like straight-line code). There is an excellent example of the complexity reduction this affords here: http://blog.mekk.waw.pl/archives/14-Twisted-inlineCallbacks-and-deferredGenerator.html
You don't always have to get your bulk processing to integrate nicely with Twisted. Sometimes it is easier to break a large piece of your program off into a stand-alone, easily testable/tweakable/implementable command line tool and have Twisted invoke this tool in another process. Twisted's ProcessProtocol provides a fairly flexible way of launching and interacting with external helper programs. Furthermore, if you suddenly decide you want to cloudify your application, it is not all that big of a deal to use a ProcessProtocol to simply run your bulk processing on a remote server (random EC2 instances perhaps) via ssh, assuming you have the keys setup already.
You can have a look at celery
It seems any python web framework will suit your needs. I work with a similar system on a daily basis and I can tell you, your solution with threads and SQLite for queue storage is about as simple as you're going to get.
Assuming order doesn't matter in your queue, then threads should be acceptable. It's important to make sure you don't create race conditions with your queues or, for example, have two of the same job type running simultaneously. If this is the case, I'd suggest a single threaded application to do the items in the queue one by one.
I have a simple Python web crawler. It uses SQLite to store its output and also to keep a queue. I want to make the crawler multi-threaded so that it can crawl several pages at a time. I figured i would make a thread and just run several instances of the class at once, so they all run concurrently. But the question is, how many should i run at once? should i stick to two? can i go higher? what would be a reasonable limit for a number of threads? Keep in mind that each thread goes out to a web page, downloads the html, runs a few regex searches through it, stores the info it finds in a SQLite db, and then pops the next url off the queue.
You will probably find your application is bandwidth limited not CPU or I/O limited.
As such, add as many as you like until performance begins to degrade.
You may come up against other limits depending on your network setup. Like if you're behind an ADSL router, there will be a limit on the number of concurrent NAT sessions, which may impact making too many HTTP requests at once. Make too many and your provider may treat you as being infected by a virus or the like.
There's also the issue of how many requests the server you're crawling can handle and how much of a load you want to put on it.
I wrote a crawler once that used just one thread. It took about a day to process all the information I wanted at about one page every two seconds. I could've done it faster but I figured this was less of a burden for the server.
So really theres no hard and fast answer. Assuming a 1-5 megabit connection I'd say you could easily have up to 20-30 threads without any problems.
I would use one thread and twisted with either a deferred semaphore or a task cooperator if you already have an easy way to feed an arbitrarily long list of URLs in.
It's extremely unlikely you'll be able to make a multi-threaded crawler that's faster or smaller than a twisted-based crawler.
It's usually simpler to make multiple concurrent processes. Simply use subprocess to create as many Popens as you feel it necessary to run concurrently.
There's no "optimal" number. Generally, when you run just one crawler, your PC spends a lot of time waiting. How much? Hard to say.
When you're running some small number of concurrent crawlers, you'll see that they take about the same amount of time as one. Your CPU switches among the various processes, filling up the wait time on one with work on the others.
You you run some larger number, you see that the overall elapsed time is longer because there's now more to do than your CPU can manage. So the overall process takes longer.
You can create a graph that shows how the process scales. Based on this you can balance the number of processes and your desirable elapsed time.
Think of it this way.
1 crawler does it's job in 1 minute. 100 pages done serially could take a 100 minutes. 100 crawlers concurrently might take on hour. Let's say that 25 crawlers finishes the job in 50 minutes.
You don't know what's optimal until you run various combinations and compare the results.
cletus's answer is the one you want.
A couple of people proposed an alternate solution using asynchronous I/O, especially looking at Twisted. If you decide to go that route, a different solution is pycurl, which is a thin wrapper to libcurl, which is a widely used URL transfer library. PyCurl's home page has a 'retriever-multi.py' example of how to fetch multiple pages in parallel, in about 120 lines of code.
You can go higher that two. How much higher depends entirely on the hardware of the system you're running this on, how much processing is going on after the network operations, and what else is running on the machine at the time.
Since it's being written in Python (and being called "simple") I'm going to assume you're not exactly concerned with squeezing every ounce of performance out of the thing. In that case, I'd suggest just running some tests under common working conditions and seeing how it performs. I'd guess around 5-10 is probably reasonable, but that's a complete stab in the dark.
Since you're using a dual-core machine, I'd highly recommend checking out the Python multiprocessing module (in Python 2.6). It will let you take advantage of multiple processors on your machine, which would be a significant performance boost.
One thing you should keep in mind is that some servers may interpret too many concurrent requests from the same IP address as a DoS attack and abort connections or return error pages for requests that would otherwise succeed.
So it might be a good idea to limit the number of concurrent requests to the same server to a relatively low number (5 should be on the safe side).
Threading isn't necessary in this case. Your program is I/O bound rather than CPU bound. The networking part would probably be better done using select() on the sockets. This reduces the overhead of creating and maintaining threads. I haven't used Twisted, but I heard it has really good support for asynchronous networking. This would allow you you to specify the URLs you wish to download and register a callback for each. When each is downloaded you the callback will be called, and the page can be processed. In order to allow multiple sites to be downloaded, without waiting for each to be processed, a second "worker" thread can be created with a queue. The callback would add the site's contents to the queue. The "worker" thread would do the actual processing.
As already stated in some answers, the optimal amount of simultaneous downloads depends on your bandwidth.
I'd use one or two threads - one for the actual crawling and the other (with a queue) for processing.