For my app, how many threads would be optimal? - python

I have a simple Python web crawler. It uses SQLite to store its output and also to keep a queue. I want to make the crawler multi-threaded so that it can crawl several pages at a time. I figured i would make a thread and just run several instances of the class at once, so they all run concurrently. But the question is, how many should i run at once? should i stick to two? can i go higher? what would be a reasonable limit for a number of threads? Keep in mind that each thread goes out to a web page, downloads the html, runs a few regex searches through it, stores the info it finds in a SQLite db, and then pops the next url off the queue.

You will probably find your application is bandwidth limited not CPU or I/O limited.
As such, add as many as you like until performance begins to degrade.
You may come up against other limits depending on your network setup. Like if you're behind an ADSL router, there will be a limit on the number of concurrent NAT sessions, which may impact making too many HTTP requests at once. Make too many and your provider may treat you as being infected by a virus or the like.
There's also the issue of how many requests the server you're crawling can handle and how much of a load you want to put on it.
I wrote a crawler once that used just one thread. It took about a day to process all the information I wanted at about one page every two seconds. I could've done it faster but I figured this was less of a burden for the server.
So really theres no hard and fast answer. Assuming a 1-5 megabit connection I'd say you could easily have up to 20-30 threads without any problems.

I would use one thread and twisted with either a deferred semaphore or a task cooperator if you already have an easy way to feed an arbitrarily long list of URLs in.
It's extremely unlikely you'll be able to make a multi-threaded crawler that's faster or smaller than a twisted-based crawler.

It's usually simpler to make multiple concurrent processes. Simply use subprocess to create as many Popens as you feel it necessary to run concurrently.
There's no "optimal" number. Generally, when you run just one crawler, your PC spends a lot of time waiting. How much? Hard to say.
When you're running some small number of concurrent crawlers, you'll see that they take about the same amount of time as one. Your CPU switches among the various processes, filling up the wait time on one with work on the others.
You you run some larger number, you see that the overall elapsed time is longer because there's now more to do than your CPU can manage. So the overall process takes longer.
You can create a graph that shows how the process scales. Based on this you can balance the number of processes and your desirable elapsed time.
Think of it this way.
1 crawler does it's job in 1 minute. 100 pages done serially could take a 100 minutes. 100 crawlers concurrently might take on hour. Let's say that 25 crawlers finishes the job in 50 minutes.
You don't know what's optimal until you run various combinations and compare the results.

cletus's answer is the one you want.
A couple of people proposed an alternate solution using asynchronous I/O, especially looking at Twisted. If you decide to go that route, a different solution is pycurl, which is a thin wrapper to libcurl, which is a widely used URL transfer library. PyCurl's home page has a 'retriever-multi.py' example of how to fetch multiple pages in parallel, in about 120 lines of code.

You can go higher that two. How much higher depends entirely on the hardware of the system you're running this on, how much processing is going on after the network operations, and what else is running on the machine at the time.
Since it's being written in Python (and being called "simple") I'm going to assume you're not exactly concerned with squeezing every ounce of performance out of the thing. In that case, I'd suggest just running some tests under common working conditions and seeing how it performs. I'd guess around 5-10 is probably reasonable, but that's a complete stab in the dark.
Since you're using a dual-core machine, I'd highly recommend checking out the Python multiprocessing module (in Python 2.6). It will let you take advantage of multiple processors on your machine, which would be a significant performance boost.

One thing you should keep in mind is that some servers may interpret too many concurrent requests from the same IP address as a DoS attack and abort connections or return error pages for requests that would otherwise succeed.
So it might be a good idea to limit the number of concurrent requests to the same server to a relatively low number (5 should be on the safe side).

Threading isn't necessary in this case. Your program is I/O bound rather than CPU bound. The networking part would probably be better done using select() on the sockets. This reduces the overhead of creating and maintaining threads. I haven't used Twisted, but I heard it has really good support for asynchronous networking. This would allow you you to specify the URLs you wish to download and register a callback for each. When each is downloaded you the callback will be called, and the page can be processed. In order to allow multiple sites to be downloaded, without waiting for each to be processed, a second "worker" thread can be created with a queue. The callback would add the site's contents to the queue. The "worker" thread would do the actual processing.
As already stated in some answers, the optimal amount of simultaneous downloads depends on your bandwidth.
I'd use one or two threads - one for the actual crawling and the other (with a queue) for processing.

Related

Python threading or multiprocessing for my 'tool'

I have created a script that :
Imports a list of IP's from .txt ( around 5K )
Connects to a REST API and performs a query based on the IP ( web logs for each IP)
Data is returned from the API and some calculations are done on the data
Results of calculations are written to a .csv
At the moment it's really slow as it takes one IP at a time does everything and then goes to the next IP .
I may be wrong but from my understanding with threading or multiprocessing i could have 3-4 threads each doing an IP which would increase the speed of the tool by a huge margin . Is my understanding correct and if it is should i be looking at threading or multi-processing for my task ?
Any help would amazing
Random info, running python 2.7.5 , Win7 with plenty or resources.
multiprocessing is definitely the way to go forward. You could start a process that reads the IPs and puts them in a multiprocessing.Queue and then make a few processes (depending on available resources) that read from this queue, connect to the API and make the requests. These requests shall then be made in parallel, and if the API can handle these requests, your program should finish faster. If the calculations are complex and time consuming, the output from the API can be put into another Queue, from where other processes you start, can read them and make the calculations and store results. You may have to start a collector process to collect the final outputs.
You can find some sample code for such problems in this stackoverflow question. If you require further explanation or sample code, let me know in comments.
With multiprocessing a primitive way to do this whould be chunck the file into 5 equal pieces and give it to 5 different processes write their results to 5 different files, when all processes are done you will merge the results.
You can have the same logic with Python threads without much complication. And probably won't make any difference since the bottle neck is probably the API. So in the end it does not really matter which approach you choose here.
There are two things two consider though:
Using Threads, you are not realy using multiple CPUs hence you are have "wasted resources"
Using Multiprocessing will use multiple processors but it is heavier on start up ... So you will benefit from never stoping the script and keeping the processes alive if the script needs to run very often.
Since the information you gave about the scenario where you use this script (or better say program) is limited, it really hard to say which is the better approach.

Multithreading With Very Large Number of Threads

I'm working on simulating a mesh network with a large number of nodes. The nodes pass data between different master nodes throughout the network.
Each master comes live once a second to receive the information, but the slave nodes don't know when the master is up or not, so when they have information to send, they try and do so every 5 ms for 1 second to make sure they can find the master.
Running this on a regular computer with 1600 nodes results in 1600 threads and the performance is extremely bad.
What is a good approach to handling the threading so each node acts as if it is running on its own thread?
In case it matters, I'm building the simulation in python 2.7, but I'm open to changing to something else if that makes sense.
For one, are you really using regular, default Python threads available in the default Python 2.7 interpreter (CPython), and is all of your code in Python? If so, you are probably not actually using multiple CPU cores because of the global interpreter lock CPython has (see https://wiki.python.org/moin/GlobalInterpreterLock). You could maybe try running your code under Jython, just to check if performance will be better.
You should probably rethink your application architecture and switch to manually scheduling events instead of using threads, or maybe try using something like greenlets (https://stackoverflow.com/a/15596277/1488821), but that would probably mean less precise timings because of lack of parallelism.
To me, 1600 threads sounds like a lot but not excessive given that it's a simulation. If this were a production application it would probably not be production-worthy.
A standard machine should have no trouble handling 1600 threads. As to the OS this article could provide you with some insights.
When it comes to your code a Python script is not a native application but an interpreted script and as such will require more CPU resources to execute.
I suggest you try implementing the simulation in C or C++ instead which will produce a native application which should execute more efficiently.
Do not use threading for that. If sticking to Python, let the nodes perform their actions one by one. If the performance you get doing so is OK, you will not have to use C/C++. If the actions each node perform are simple, that may work. Anyway, there is no reason to use threads in Python at all. Python threads are usable mostly for making blocking I/O not to block your program, not for multiple CPU kernels utilization.
If you want to really use parallel processing and to write your nodes as if they were really separated and exchanging only using messages, you may use Erlang (http://www.erlang.org/). It is a functional language very well suited for executing parallel processes and making them exchange messages. Erlang processes do not map to OS threads, and you may create thousands of them. However, Erlang is a purely functional language and may seem extremely strange if you have never used such languages. And it also is not very fast, so, like Python, it is unlikely to handle 1600 actions every 5ms unless the actions are rather simple.
Finally, if you do not get desired performance using Python or Erlang, you may move to C or C++. However, still do not use 1600 threads. In fact, using threads to gain performance is reasonable only if the number of threads does not dramatically exceed number of CPU kernels. A reactor pattern (with several reactor threads) is what you may need in that case (http://en.wikipedia.org/wiki/Reactor_pattern). There is an excellent implementation of the reactor pattern in boost.asio library. It is explained here: http://www.gamedev.net/blog/950/entry-2249317-a-guide-to-getting-started-with-boostasio/
Some random thoughts here:
I did rather well with several hundred threads working like this in Java; it can be done with the right language. (But I haven't tried this in Python.)
In any language, you could run the master node code in one thread; just have it loop continuously, running the code for each master in each cycle. You'll lose the benefits of multiple cores that way, though. On the other hand, you'll lose the problems of multithreading, too. (You could have, say, 4 such threads, utilizing the cores but getting the multithreading headaches back. It'll keep the thread-overhead down, too, but then there's blocking...)
One big problem I had was threads blocking each other. Enabling 100 threads to call the same method on the same object at the same time without waiting for each other requires a bit of thought and even research. I found my multithreading program at first often used only 25% of a 4-core CPU even when running flat out. This might be one reason you're running slow.
Don't have your slave nodes repeat sending data. The master nodes should come alive in response to data coming in, or have some way of storing it until they do come alive, or some combination.
It does pay to have more threads than cores. Once you have two threads, they can block each other (and will if they share any data). If you have code to run that won't block, you want to run it in its own thread so it won't be waiting for code that does block to unblock and finish. I found once I had a few threads, they started to multiply like crazy--hence my hundreds-of-threads program. Even when 100 threads block at one spot despite all my brilliance, there's plenty of other threads to keep the cores busy!

Python Urllib UrlOpen Read

Say I am retrieving a list of Urls from a server using Urllib2 library from Python. I noticed that it took about 5 seconds to get one page and it would take a long time to finish all the pages I want to collect.
I am thinking out of those 5 seconds. Most of the time was consumed on the server side and I am wondering could I just start using the threading library. Say 5 threads in this case, then the average time could be dramatically increased. Maybe 1 or 2 seconds in each page. (might make the server a bit busy). How could I optimize the number of threads so I could get a legit speed and not pushing the server too hard.
Thanks!
Updated:
I increased the number of threads one by one and monitored the total time (units: minutes) spent to scrape 100 URLs. and it turned out that the total time dramatically decreased when you change the number of threads to 2, and keep decreasing as you increase the number of threads, but the 'improvement' caused by threading become less and less obvious. (the total time even shows a bounce back when you build too many threads)
I know this is only a specific case for the web server that I harvest but I decided to share just to show the power of threading and hope would be helpful for somebody one day.
There are a few things you can do. If the URLs are on different domains, then you might just fan out the work to threads, each downloading a page from a different domain.
If your URLs all point to the same server and you do not want stress the server, then you can just retrieve the URLs sequentially. If the server is happy with a couple of parallel requests, the you can look into pools of workers. You could start, say a pool of four workers and add all your URL to a queue, from which the workers will pull new URLs.
Since you tagged the question with "screen-scraping" as well, scrapy is a dedicated scraping framework, which can work in parallel.
Python 3 comes with a set of new builtin concurrency primitives under concurrent.futures.
Here is a caveat. I have encountered a number of servers powered by somewhat "elderly" releases of IIS. They often will not service a request if there is not a one second delay between requests.

when doing downloading with python,should I use multithreading or multiprocessing?

Recently I'm working on a program which can download manga from a online manga website.It works but a bit slow.So I decide to use multithreading/processing to speed up downloading.Here are my questions:
which one is better?(this is a python3 program)
multiprocessing,I think,will definitely work.If I use multiprocessing,what is the suitable amount of processes?Does it relate to the number of cores in my CPU?
multithreading will probably work.This download work obviously needs much time to wait for pics to be downloaded,so I think when a thread starts waiting,python will make another thread work.Am I correct?
I've read 《Inside the New GIL》by David M.Beazley.What's the influence of GIL if I use multithreading?
You're probably going to be bound by either the server's upload pipe (if you have a faster connection) or your download pipe (if you have a slower connection).
There's significant startup latency associated with TCP connections. To avoid this, HTTP servers can recycle connections for requesting multiple resources. So there are two ways for your client to avoid this latency hit:
(a) Download several resources over a single TCP connection so your program only suffers the latency once, when downloading the first file
(b) Download a single resource per TCP connection, and use multiple connections so that hopefully at every point in time, at least one of them will be downloading at full speed
With option (a), you want to look into how to recycle requests with whatever HTTP library you're using. Any good one will have a way to recycle connections. http://python-requests.org/ is a good Python HTTP library.
For option (b), you probably do want a multithread/multiprocess route. I'd suggest only 2-3 simultaneous threads, since any more will likely just result in sharing bandwidth among the connections, and raise the risk of getting banned for multiple downloads.
The GIL doesn't really matter for this use case, since your code will be doing almost no processing, spending most of its time waiting bytes to arrive over the network.
The lazy way to do this is to avoid Python entirely because most UNIX-like environments have good building blocks for this. (If you're on Windows, your best choices for this approach would be msys, cygwin, or a VirtualBox running some flavor of Linux, I personally like Linux Mint.) If you have a list of URL's you want to download, one per line, in a text file, try this:
cat myfile.txt | xargs -n 1 --max-procs 3 --verbose wget
The "xargs" command with these parameters will take a whitespace-delimited URL's on stdin (in this case coming from myfile.txt) and run "wget" on each of them. It will allow up to 3 "wget" subprocesses to run at a time, when one of them completes (or errors out), it will read another line and launch another subprocess, until all the input URL's are exhausted. If you need cookies or other complicated stuff, curl might be a better choice than wget.
It doesn't really matter. It is indeed true that threads waiting on IO won't get in the way of other threads running, and since downloading over the Internet is an IO-bound task, there's no real reason to try to spread your execution threads over multiple CPUs. Given that and the fact that threads are more light-weight than processes, it might be better to use threads, but you honestly aren't going to notice the difference.
How many threads you should use depends on how hard you want to hit the website. Be courteous and take care that your scraping isn't viewed as a DOS attack.
You don't really need multithreading for this kind of tasks.. you could try single thread async programming using something like Twisted

Does a multithreaded crawler in Python really speed things up?

Was looking to write a little web crawler in python. I was starting to investigate writing it as a multithreaded script, one pool of threads downloading and one pool processing results. Due to the GIL would it actually do simultaneous downloading? How does the GIL affect a web crawler? Would each thread pick some data off the socket, then move on to the next thread, let it pick some data off the socket, etc..?
Basically I'm asking is doing a multi-threaded crawler in python really going to buy me much performance vs single threaded?
thanks!
The GIL is not held by the Python interpreter when doing network operations. If you are doing work that is network-bound (like a crawler), you can safely ignore the effects of the GIL.
On the other hand, you may want to measure your performance if you create lots of threads doing processing (after downloading). Limiting the number of threads there will reduce the effects of the GIL on your performance.
Look at how scrapy works. It can help you a lot. It doesn't use threads, but can do multiple "simultaneous" downloading, all in the same thread.
If you think about it, you have only a single network card, so parallel processing can't really help by definition.
What scrapy does is just not wait around for the response of one request before sending another. All in a single thread.
When it comes to crawling you might be better off using something event-based such as Twisted that uses non-blocking asynchronous socket operations to fetch and return data as it comes, rather than blocking on each one.
Asynchronous network operations can easily be and usually are single-threaded. Network I/O almost always has higher latency than that of CPU because you really have no idea how long a page is going to take to return, and this is where async shines because an async operation is much lighter weight than a thread.
Edit: Here is a simple example of how to use Twisted's getPage to create a simple web crawler.
Another consideration: if you're scraping a single website and the server places limits on the frequency of requests your can send from your IP address, adding multiple threads may make no difference.
Yes, multithreading scraping increases the process speed significantly. This is not a case where GIL is an issue. You are losing a lot of idle CPU and unused bandwidth waiting for a request to finish. If the web page you are scraping is in your local network (a rare scraping case) then the difference between multithreading and single thread scraping can be smaller.
You can try the benchmark yourself playing with one to "n" threads. I have written a simple multithreaded crawler on Discovering Web Resources and I wrote a related article on Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website. You can select how many threads to use changing the NWORKERS class variable in FocusedWebCrawler.

Categories