Best way of parallelising this webcrawling loop?

Best way of parallelising this webcrawling loop? - python

I am making a webcrawler, and I have some "sleep" functions that make the crawl quite long.
For now I am doing :
for speciality in range(1,25):
for year in range(1997, 2017):
for quarter in [1,2]:
deal_with (driver, year, quarter, speciality, ok)
The deal_with function is opening several webpages, waiting a few second for complete html download before moving on. The execution time is then very long : there is 25 * 10 * 2 = 500 loops, with no less than a minute by loop.
I would like to use my 4 physical Cores (8 threads) to enjoy parallelism.
I read about tornado, multiprocessing, joblib... and can't really make my mind on an easy solution to adapt to my code.
Any insight welcome :-)

tl;dr Investing in any choice without fully understanding the bottlenecks you are facing will not help you.
At the end of the day, there are only two fundamental approaches to scaling out a task like this:
Multiprocessing
You launch a number of Python processes, and distribute tasks to each of them. This is the approach you think will help you right now.
Some sample code for how this works, though you could use any appropriate wrapper:
import multiprocessing
# general rule of thumb: launch twice as many processes as cores
process_pool = multiprocessing.Pool(8) # launches 8 processes
# generate a list of all inputs you wish to feed to this pool
inputs = []
for speciality in range(1,25):
for year in range(1997, 2017):
for quarter in [1,2]:
inputs.append((driver, year, quarter, speciality, ok))
# feed your list of inputs to your process_pool and print it when done
print(process_pool.map(deal_with, inputs))
If this is all you wanted, you can stop reading now.
Asynchronous Execution
Here, you are content with a single thread or process, but you don't want it to be sitting idle waiting for stuff like network reads or disk seeks to come back - you want it to go on and do other, more important things while it's waiting.
True native asynchronous I/O support is provided in Python 3 and does not exist in Python 2.7 outside of the Twisted networking library.
import concurrent.futures
# generate a list of all inputs you wish to feed to this pool
inputs = []
for speciality in range(1,25):
for year in range(1997, 2017):
for quarter in [1,2]:
inputs.append((driver, year, quarter, speciality, ok))
# produce a pool of processes, and make sure they don't block each other
# - get back an object representing something yet to be resolved, that will
# only be updated when data comes in.
with concurrent.futures.ProcessPoolExecutor() as executor:
outputs = [executor.submit(input_tuple) for input_tuple in inputs]
# wait for all of them to finish - not ideal, since it defeats the purpose
# in production, but sufficient for an example
for future_object in concurrent.futures.as_completed(outputs):
# do something with future_object.result()
So What's the Difference?
My main point here it to emphasise that choosing from a list of technologies isn't as hard as figuring out where the real bottleneck is.
In the examples above, there isn't any difference. Both follow a simple pattern:
Have a lot of workers
Allow these workers to pick something from a queue of tasks right away
When one is free, set them to work on the next one right away.
Thus, you gain no conceptual difference altogether if you follow these examples verbatim, even though they use entirely different technologies and claim to use entirely different techniques.
Any technology you pick will be for naught if you write it in this pattern - even though you'll get some speedup, you will be sorely disappointed if you expected a massive performance boost.
Why is this pattern bad? Because it doesn't solve your problem.
Your problem is simple: you have wait. While your process is waiting for something to come back, it can't do anything else! It can't call more pages for you. It can't process an incoming task. All it can do is wait.
Having more processes that ultimately wait is not the true solution. An army of troops that has to march to Waterloo will not be faster if you split it into regiments - each regiment eventually has to sleep, though they may sleep at different times and for different lengths, and what will happen is that all of them will arrive at almost roughly the same time.
What you need is an army that never sleeps.
So What Should You Do?
Abstract all I/O bound tasks into something non-blocking. This is your true bottleneck. If you're waiting for a network response, don't let the poor process just sit there - give it something to do.
Your task is made somewhat difficult in that by default reading from a socket is blocking. It's the way operating systems are. Thankfully, you don't need to get Python 3 to solve it (though that is always the preferred solution) - the asyncore library (though Twisted is comparably superior in every way) already exists in Python 2.7 to make network reads and writes truly in the background.
There is one and only one case where true multiprocessing needs to be used in Python, and that's if you are doing CPU-bound or CPU-intensive work. From your description, it doesn't sound like that's the case.
In short, you should edit your deal_with function to avoid the incipient wait. Make that wait in the background, if needed, using a suitable abstraction from Twisted or asyncore. But don't make it consume your process completely.

If you're using python3, I would check out the asycio module. I believe you can just decorate deal_with with #asyncio.coroutine. You will likely have to adjust what deal_with does to properly work with the event loop as well.

Related

Python3 help, 2 concurrently running tasks, one needs data from the other

I'm not a Python programmer and i rarely work with linux but im forced to use it for a project. The project is fairly straightforward, one task constantly gathers information as a single often updating numpy float32 value in a class, the other task that is also running constantly needs to occasionally grab the data in that variable asynchronously but not in very time critical way, and entirely on linux. My default for such tasks is the creation of a thread but after doing some research it appears as if python threading might not be the best solution for this from what i'm reading.
So my question is this, do I use multithreading, multiprocessing, concurrent.futures, asyncio, or (just thinking out loud here) some stdin triggering / stdout reading trickery or something similar to DDS on linux that I don't know about, on 2 seperate running scripts?
Just to append, both tasks do a lot of IO, task 1 does a lot of USB IO, the other task does a bit serial and file IO. I'm not sure if this is useful. I also think the importance of resetting the data once pulled and having as little downtime in task 1 as possible should be stated. Having 2 programs talk via a file probably won't satisfy this.
Any help would be appreciated, this has proven to be a difficult use case to google for.

Threading will probably work fine, a lot of the problems with the BKL (big kernel lock) are overhyped. You just need to make sure both threads provide active opportunities for the scheduler to switch contexts on a regular basis, typically by calling sleep(0). Most threads run in a loop and if the loop body is fairly short then calling sleep(0) at the top or bottom of it on every iteration is usually enough. If the loop body is long you might want to put a few more in along the way. It’s just a hint to the scheduler that this would be a good time to switch if other threads want to run.

Have you considered a double ended queue? You write to one end and grab data from the other end. Then with your multi-threading you could write with one thread and read with the other:
https://docs.python.org/3/library/collections.html#collections.deque
Quoting the documentation: "Deques are a generalization of stacks and queues (the name is pronounced “deck” and is short for “double-ended queue”). Deques support thread-safe, memory efficient appends and pops from either side of the deque with approximately the same O(1) performance in either direction."

Python Multithreading a for loop with limited Threads

i am just learning Python and dont have much expierence with Multithreading. I am trying to send some json via the Requests session.post Method. This is called in the function at the bottem of the many for loops i need to run through the dictionary.
Is there a way to let this run in paralell?
I also have to limit my numbers of Threads, otherwise the post calls get blocked because they are to fast after each other. Help would be much appreciated.
def doWork(session, List, RefHashList):
for itemRefHash in RefHashList:
for equipment in res['Response']['data']['items']:
if equipment['itemHash'] == itemRefHash:
if equipment['characterIndex'] != 0:
SendJsonViaSession(session, getCharacterIdFromIndex(res, equipment['characterIndex']), itemRefHash, equipment['quantity'])

First, structuring your code differently might improve the speed without the added complexity of threading.
def doWork(session, res, RefHashList):
for equipment in res['Response']['data']['items']:
i = equipment['itemHash']
k = equipment['characterIndex']
if i in RefHashList and k != 0:
SendJsonViaSession(session, getCharacterIdFromIndex(res, k), i, equipment['quantity'])
To start with, we will look up equipment['itemHash'] and equipment['characterIndex'] only once.
Instead of explicitly looping over RefHashList, you could use the in operator. This moves the loop into the Python virtual machine, which is faster.
And instead of a nested if-conditional, you could use a single conditional using and.
Note: I have removed the unused parameter List, and replaced it with res. It is generally good practice to write functions that only act on parameters that they are given, not global variables.
Second, how much extra performance do you need? How much time is there on average between the SendJsonViaSession calls, and how small can this this time become before calls get blocked? If the difference between those numbers is small, it is probably not worth to implement a threaded sender.
Third, a design feature of the standard Python implementation is that only one thread at a time can be executing Python bytecode. So it is not certain that threading will improve performance.
Edit:
There are several ways to run stuff in parallel in Python. There is multiprocessing.Pool which uses processes, and multiprocessing.dummy.ThreadPool which uses threads. And from Python 3.2 onwards there is concurrent.futures, which can use processes or threads.
The thing is, neither of them has rate limiting. So you could get blocked for making too many calls.
Every time you call SendJsonViaSession you'd have to save the current time somehow so that all processes or threads can use it. And before every call, you would have to read that time and wait if it is too close to the last call.
Edit2:
If a call to SendJsonViaSession only takes 0.3 seconds, you should be able to do 3 calls/second sequentially. But your code only does 1 call/second. This implies that the speed restriction is somewhere else. You'd have to profile your code to see where the problem lies.

parallel python, or MPI?

I have a code with heavy symbolic calculations (many multiple symbolic integrals). Also I have access to both an 8-core cpu computer (with 18 GB RAM) and a small 32 cpu cluster. I prefer to remain on my professor's 8-core pc rather than to go to another professor's lab using his cluster in a more limited time, however, I'm not sure it will work on the SMP system, so I am looking for a parallel tool in Python that can be used on both SMP and Clusters and of course prefer the codes on one system to be easily and with least effort modifiable for use on the other system.
So far, I have found Parallel Python (PP) promising for my need, but I have recently told that MPI also does the same (pyMPI or MPI4py). I couldn't approve this as seemingly very little is discussed about this on the web, only here it is stated that MPI (both pyMPI or MPI4py) is usable for clusters only, if I am right about that "only"!
Is "Parallel Python" my only choice, or I can also happily use MPI based solutions? Which one is more promising for my needs?
PS. It seems none of them have very comprehensive documentations so if you know some links to other than their official websites that can help a newbie in parallel computation I will be so grateful if you would also mention them in your answer :)
Edit.
My code has two loops one inside the other, the outer loop cannot be parallelized as it is an iteration method (a recursive solution) each step depending on the values calculated within its previous step. The outer loop contains the inner loop alongside 3 extra equations whose calculations depend on the whole results of the inner loop. However, the inner loop (which contains 9 out of 12 equations computable at each step) can be safely parallelized, all 3*3 equations are independent w.r.t each other, only depending on the previous step. All my equations are so computationally heavy as each contains many multiple symbolic integrals. Seemingly I can parallelize both the inner loop's 9 equations and the integration calculations in each of these 9 equation separately, and also parallelize all the integrations in other 3 equations alongside the inner loop. You can find my code here if it can help you better understand my need, it is written inside SageMath.

I would look in to multiprocessing (doc) which provides a bunch of nice tools for spawning and working with sub-processes.
To quote the documentation:
multiprocessing is a package that supports spawning processes using an
API similar to the threading module. The multiprocessing package
offers both local and remote concurrency, effectively side-stepping
the Global Interpreter Lock by using subprocesses instead of threads.
From the comments I think the Pool and it's map would serve your purposes (doc).
def work_done_in_inner_loop(arg):
# put your work code here
pass
p = Pool(9)
for o in outer_loop:
# what ever else you do
list_of_args = [...] # what your inner loop currently loops over
res = p.map(work_done_in_inner_loop,list_of_args])
# rest of code

It seems like there are a few reasonable ways to design this.
Let me refer to your jobs as the main job, the 9 intermediate jobs, and the many inner jobs the intermediate jobs can spin off. I'm assuming the intermediate jobs have a "merge" step after the inner jobs all finish, and the same for the outer job.
The simplest design is that the main job fires off the intermediate jobs and then waits for them all to finish before doings its merge step. Then intermediate jobs then fire off the inner jobs and wait for them all to finish before doing their merge steps.
This can work with a single shared queue, but you need a queue that doesn't block the worker pool while waiting, and I don't think multiprocessing's Pool and Queue can do that out of the box. As soon as you've got all of your processes waiting to join their children, nothing gets done.
One way around that is to change to a continuation-passing style. If you know which one of the intermediate jobs will finish last, you can pass it the handles to the other intermediate jobs and have it join on them and do the merge, instead of the outer job. And the intermediate similarly pass off the merge to their last inner job.
The problem is that you usually have no way of knowing what's going to finish last, even without scheduling issues. So that means you need some form of either sharing (e.g., a semaphore) or message passing between the jobs to negotiate that among themselves. You can do that on top of multiprocessing. The only problem is that it destroys the independence of your jobs, and you're suddenly dealing with all the annoying problems of shared concurrency.
A different alternative is to have separate pools and queues for each intermediate job, and some kind of load balancing between the pools that can ensure that each core is running one active process.
Or, of course, a single pool with a more complicated implementation than multiprocessing's, which does either load balancing or cooperative scheduling, so a joiner doesn't block a core.
Or a super-simple solution: Overschedule, and pay a little cost in context switching for simplicity. For example, you can run 32 workers even though you've only got 8 cores, so you've got 22 active workers and 10 waiting. Each core has 2 or 3 active workers, which will slow things down a bit, but maybe not too badly—and at least nobody's idle, and you didn't have to write any code beyond passing a different parameter to the multiprocessing.Pool constructor.
At any rate, multiprocessing is very simple, and it has almost no extra concepts that won't apply to other solutions. So it may take less time to play with it until you run into a brick wall or don't, than to try to figure out in advance whether it'll work for you.

I recently ran into a similar problem. However, the following solution is only valid if (1) you wish to run the python script individually on a group of files, AND (2) each invocation of the script is independent of the others.
If the above applies to you, the simplest solution is to write a wrapper in bash along the lines of:
for a_file in $list_of_files
do
python python_script.py a_file &
done
The '&' will run the preceding command as a sub-process. The advantage is that bash will not wait for the python script to finish before continuing with the for loop.
You may want to place a cap on the number of processes running simultaneously, since this code will use all available resources.

For my app, how many threads would be optimal?

I have a simple Python web crawler. It uses SQLite to store its output and also to keep a queue. I want to make the crawler multi-threaded so that it can crawl several pages at a time. I figured i would make a thread and just run several instances of the class at once, so they all run concurrently. But the question is, how many should i run at once? should i stick to two? can i go higher? what would be a reasonable limit for a number of threads? Keep in mind that each thread goes out to a web page, downloads the html, runs a few regex searches through it, stores the info it finds in a SQLite db, and then pops the next url off the queue.

You will probably find your application is bandwidth limited not CPU or I/O limited.
As such, add as many as you like until performance begins to degrade.
You may come up against other limits depending on your network setup. Like if you're behind an ADSL router, there will be a limit on the number of concurrent NAT sessions, which may impact making too many HTTP requests at once. Make too many and your provider may treat you as being infected by a virus or the like.
There's also the issue of how many requests the server you're crawling can handle and how much of a load you want to put on it.
I wrote a crawler once that used just one thread. It took about a day to process all the information I wanted at about one page every two seconds. I could've done it faster but I figured this was less of a burden for the server.
So really theres no hard and fast answer. Assuming a 1-5 megabit connection I'd say you could easily have up to 20-30 threads without any problems.

I would use one thread and twisted with either a deferred semaphore or a task cooperator if you already have an easy way to feed an arbitrarily long list of URLs in.
It's extremely unlikely you'll be able to make a multi-threaded crawler that's faster or smaller than a twisted-based crawler.

It's usually simpler to make multiple concurrent processes. Simply use subprocess to create as many Popens as you feel it necessary to run concurrently.
There's no "optimal" number. Generally, when you run just one crawler, your PC spends a lot of time waiting. How much? Hard to say.
When you're running some small number of concurrent crawlers, you'll see that they take about the same amount of time as one. Your CPU switches among the various processes, filling up the wait time on one with work on the others.
You you run some larger number, you see that the overall elapsed time is longer because there's now more to do than your CPU can manage. So the overall process takes longer.
You can create a graph that shows how the process scales. Based on this you can balance the number of processes and your desirable elapsed time.
Think of it this way.
1 crawler does it's job in 1 minute. 100 pages done serially could take a 100 minutes. 100 crawlers concurrently might take on hour. Let's say that 25 crawlers finishes the job in 50 minutes.
You don't know what's optimal until you run various combinations and compare the results.

cletus's answer is the one you want.
A couple of people proposed an alternate solution using asynchronous I/O, especially looking at Twisted. If you decide to go that route, a different solution is pycurl, which is a thin wrapper to libcurl, which is a widely used URL transfer library. PyCurl's home page has a 'retriever-multi.py' example of how to fetch multiple pages in parallel, in about 120 lines of code.

You can go higher that two. How much higher depends entirely on the hardware of the system you're running this on, how much processing is going on after the network operations, and what else is running on the machine at the time.
Since it's being written in Python (and being called "simple") I'm going to assume you're not exactly concerned with squeezing every ounce of performance out of the thing. In that case, I'd suggest just running some tests under common working conditions and seeing how it performs. I'd guess around 5-10 is probably reasonable, but that's a complete stab in the dark.
Since you're using a dual-core machine, I'd highly recommend checking out the Python multiprocessing module (in Python 2.6). It will let you take advantage of multiple processors on your machine, which would be a significant performance boost.

One thing you should keep in mind is that some servers may interpret too many concurrent requests from the same IP address as a DoS attack and abort connections or return error pages for requests that would otherwise succeed.
So it might be a good idea to limit the number of concurrent requests to the same server to a relatively low number (5 should be on the safe side).

Threading isn't necessary in this case. Your program is I/O bound rather than CPU bound. The networking part would probably be better done using select() on the sockets. This reduces the overhead of creating and maintaining threads. I haven't used Twisted, but I heard it has really good support for asynchronous networking. This would allow you you to specify the URLs you wish to download and register a callback for each. When each is downloaded you the callback will be called, and the page can be processed. In order to allow multiple sites to be downloaded, without waiting for each to be processed, a second "worker" thread can be created with a queue. The callback would add the site's contents to the queue. The "worker" thread would do the actual processing.
As already stated in some answers, the optimal amount of simultaneous downloads depends on your bandwidth.
I'd use one or two threads - one for the actual crawling and the other (with a queue) for processing.

Multiprocessing in python with more then 2 levels

I want to do a program and want make a the spawn like this process -> n process -> n process
can the second level spawn process with multiprocessing ? using multiprocessinf module of python 2.6
thnx

#vilalian's answer is correct, but terse. Of course, it's hard to supply more information when your original question was vague.
To expand a little, you'd have your original program spawn its n processes, but they'd be slightly different than the original in that you'd want them (each, if I understand your question) to spawn n more processes. You could accomplish this by either by having them run code similar to your original process, but that spawned new sets of programs that performed the task at hand, without further processing, or you could use the same code/entry point, just providing different arguments - something like
def main(level):
if level == 0:
do_work
else:
for i in range(n):
spawn_process_that_runs_main(level-1)
and start it off with level == 2

You can structure your app as a series of process pools communicating via Queues at any nested depth. Though it can get hairy pretty quick (probably due to the required context switching).
It's not erlang though that's for sure.
The docs on multiprocessing are extremely useful.
Here(little too much to drop in a comment) is some code I use to increase throughput in a program that updates my feeds. I have one process polling for feeds that need to fetched, that stuffs it's results in a queue that a Process Pool of 4 workers picks up those results and fetches the feeds, it's results(if any) are then put in a queue for a Process Pool to parse and put into a queue to shove back in the database. Done sequentially, this process would be really slow due to some sites taking their own sweet time to respond so most of the time the process was waiting on data from the internet and would only use one core. Under this process based model, I'm actually waiting on the database the most it seems and my NIC is saturated most of the time as well as all 4 cores are actually doing something. Your mileage may vary.

Yes - but, you might run into an issue which would require the fix I committed to python trunk yesterday. See bug http://bugs.python.org/issue5313

Sure you can. Expecially if you are using fork to spawn child processes, they works as perfectly normal processes (like the father). Thread management is quite different, but you can also use "second level" sub-treading.
Pay attention to not over-complicate your program, as example program with two level threads are normally unused.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.