Python: Multiprocessing Recipe for Queue(s) with Many Consumers

Python: Multiprocessing Recipe for Queue(s) with Many Consumers - python

Since I have been wasting a lot of time trying to join() workers in python multiprocessing architectures that get their tasks from a multiprocessing.Queue, which is at the same time fed by a feeder function through put():
Can someone contribute a short but robust recipe for this kind of architecture?
Let's say up to 10 feeders and up to 100 workers. Queue items might be large.
I can only guess that in the past I have often had the queue busy and not responding anymore, but mostly, I had the task done, only the jobs never joined, or wether they joined or not seemed to depend on arbitrary parameters.
So imagine the following workflow:
One or more feeder() jobs read from input (i.e. disk) and create tasks for the first line of workers. They put the tasks into a queue Q. When there are no more tasks to create, the feeders should join.
Many workers take the tasks out of Q and process it. Then they put the result into another queue R. When these workers fail to get more jobs from Q and are done with the last task, they should join. Consider that they might see an exception due to invalid input/task. They should still join.
Let's add another line of workers that takes always two results from R, merges them and puts the result back into R. When there is only one result left in R, all workers of the second line should join.
Finally I can take the last remaining aggregated result out of R and be happy.
I think this setup should include enough aspects to allow for generalization of many different tasks.
To avoid blocking of the queue(s), I have already tried out the following ideas:
Use multiple queues with always putting to the smallest queue and getting from random queue
Do not block in get(), but try getting without wait, if fails, sleep for a short time and retry, keep track of number of tries, after a while, give up, assuming queue is empty.
I can add some code, but maybe this would lead to bug fixing instead of sharing best practices.

Related

How to submit a large set of long running parallel tasks to dask?

I have a computational workload that I originally ran with concurrent.futures.ProcessPoolExecutor which I converted to use dask so that I could make use of dask's integrations with distributed computing systems for scaling beyond one machine. The workload consists of two task types:
Task A: takes string/float inputs and produces a matrix (around 2000 x 2000). Task duration is usually 60 seconds or less.
Task B: takes the matrix from task A and uses it and some other small inputs to solve an ordinary differential equation. The solution is written to disk (so no return value). Task duration can be up to fifteen minutes.
There can multiple B tasks for each A task.
Originally, my code looked like this:
a_results = client.map(calc_a, a_inputs)
all_b_inputs = [(a_result, b_input) for b_input in b_inputs for a_result in a_results]
b_results = client.map(calc_b, all_b_inputs)
dask.distributed.wait(b_results)
because that was the clean translation from the concurrent.futures code (I actually kept the code so that it could be run either with dask or concurrent.futures so I could compare). client here is a distributed.Client instance.
I have been experiencing some stability issues with this code, especially for large numbers of tasks, and I think I might not be using dask in the best way. Recently, I changed my code to use Delayed instead like this:
a_results = [dask.delayed(calc_a)(a) for a in a_inputs]
b_results = [dask.delayed(calc_b)(a, b) for a in a_inputs for b in b_inputs]
client.compute(b_results)
I did this because I thought perhaps the scheduler could work through the tasks more efficiently if it examined the entire graph before starting anything rather than beginning to schedule the A tasks before knowing about the B tasks. This change seems to help some but I still see some stability issues.
I can create separate questions for the stability problems, but I first wanted to find out if I am using dask in the best way for this use case or if I should modify how I am submitting the tasks. Just to describe the problems briefly, the worst problem to me is that over time my workers drop to 0% CPU and tasks stop completing. Other problems include things like getting KilledWorker exceptions and seeing log messages about an unresponsive loop and time outs. Usually the scheduler runs fine for at least a few hours, completing thousands of tasks before these issues show up (which makes debugging difficult since the feedback loop is so long).
Some questions I have been wondering about:
I can have thousands of tasks to run. Can I submit these all to dask to start out or do I need to submit them in batches? My thought was that the dask scheduler would be better at scheduling tasks than my batching code.
If I do need to batch things myself, can I query the scheduler to find out the maximum number of workers so I can write something that will submit batches of the right size? Or do I need to make the batch size an input to my batching code?
In the end, my results all get written to disk and nothing gets returned. With the way I am running tasks, are resources getting held onto longer than necessary?
My B tasks are long but they could be split by scheduling tasks that solve for solutions at intermediate time steps and feeding those in as the inputs to subsequent solving tasks. I think I need to do this any way because I would like to use an HPC cluster with a timed queue and I think I need to use the lifetime parameter to retire workers to keep them from running over the time limit and that works best with short-lived tasks (to avoid losing work when shut down early). Is there an optimal way to split the B task?

There are lots of questions here, but with regards to the code snippets you provided, both look correct, but the futures version will scale better in my experience. The reason for that is that by default, whenever one of the delayed tasks fails, the computation of all delayed tasks halts, while futures can proceed as long as they are not directly affected by the failure.
Another observation is that delayed values will tend to hold on to resources after completion, while for futures you can at least .release() them once they have been completed (or use fire_and_forget).
Finally, with very large task lists, it might be worth to make them a bit more resilient to restarts. One basic option is to create simple text files after successful completion of a task, and then on restart check which tasks need to be re-computed. Fancier options include prefect and joblib.memory, but if you don't need all the bells and whistles, the text file route is often fastest.

What is the difference between a Queue and a JoinableQueue in multiprocessing in Python?

What is the difference between a Queue and a JoinableQueue in multiprocessing in Python? This question has already been asked here, but as some comments point out, the accepted answer is not helpful because all it does is quote the documentation. Could someone explain the difference in terms of when to use one versus the other? For example, why would one choose to use Queue over JoinableQueue if JoinableQueue is pretty much the same thing except for offering the two extra methods join() and task_done(). Additionally, the other answer in the post I linked to mentions that Based on the documentation, it's hard to be sure that Queue is actually empty. which again raises the question as to why would I want to use a Queue over JoinableQueue? What advantages does it offer?

multiprocessing patterns its queues off of queue.Queue. In that model, Queue keeps a "task count" of everything put on the queue. There are generally two ways to use this queue. Producers could just put things on the queue and ignore what happens to them in the long run. The producer may wait from time to time if the queue is full, but doesn't care if any of the things put on the queue are actually processed by the consumer. In this case the queue's task count grows, but who cares?
Alternately, the producer can "join" the queue. That means that it waits until the last task on the queue has been processed and the task count has gone to zero. But to do this, the producer needs the consumer's help. A consumer gets an item from the queue, but that doesn't decrease the task count. The consumer has to actively call task_done (typically when the task is done...) and the join will wait until every put has a task_done.
Fast forward to multiprocessing. The task_done mechanism requires communication between processes which is relatively expensive. If you are a type A producer that doesn't play the join game, use a multiprocessing.Queue and save a bit of CPU time. If you are a type B producer use multiprocessing.JoinableQueue. But remember that the consumer also has to play the task_done game or the producer will hang.

using multiple threads in Python

I'm trying to solve a problem, where I have many (on the order of ten thousand) URLs, and need to download the content from all of them. I've been doing this in a "for link in links:" loop up till now, but the amount of time it's taking is now too long. I think it's time to implement a multithreaded or multiprocessing approach. My question is, what is the best approach to take?
I know about the Global Interpreter Lock, but since my problem is network-bound, not CPU-bound, I don't think that will be an issue. I need to pass data back from each thread/process to the main thread/process. I don't need help implementing whatever approach (Terminate multiple threads when any thread completes a task covers that), I need advice on which approach to take. My current approach:
data_list = get_data(...)
output = []
for datum in data:
output.append(get_URL_data(datum))
return output
There's no other shared state.
I think the best approach would be to have a queue with all the data in it, and have several worker threads pop from the input queue, get the URL data, then push onto an output queue.
Am I right? Is there anything I'm missing? This is my first time implementing multithreaded code in any language, and I know it's generally a Hard Problem.

For your specific task I would recommend a multiprocessing worker pool. You simply define a pool and tell it how many processes you want to use (one per processor core by default) as well as a function you want to run on each unit of work. Then you ready every unit of work (in your case this would be a list of URLs) in a list and give it to the worker pool.
Your output will be a list of the return values of your worker function for every item of work in your original array. All the cool multi-processing goodness will happen in the background. There is of course other ways of working with the worker pool as well, but this is my favourite one.
Happy multi-processing!

The best approach I can think of in your use case will be to use a thread pool and maintain a work queue. The threads in the thread pool get work from the work queue, do the work and then go get some more work. This way you can finely control the number of threads working on your URLs.
So, create a WorkQueue, which in your case is basically a list containing the URLs that need to be downloaded.
Create a thread pool, which create the number of threads you specify, fetches work from the WorkQueue and assigns it to a thread. Each time a thread finishes and returns you check if the work queues has more work and accordingly assign work to that thread again. You may also want to put a hook so that every time work is added to the work queue, your threads assigns it to a free thread if available.

The fastest and most efficient method of doing IO bound tasks like this is an asynchronous event loop. The libcurl can do this, and there is a Python wrapper for that called pycurl. Using it's "multi" interface you can do high-performance client activities. I have done over 1000 simultaneous fetchs as fast as one.
However, the API is quite low-level and difficult to use. There is a simplifying wrapper here, which you can use as an example.

python: spawn threads as per requirements

I am creating a small application which will perform say 4 different, time consuming tasks such that the output of first is the input of second and so on.
At every task level, the output is appended to a list and the next task pops, operates and appends its output to its output list and so on...
The way I thought I would get the task done is by having multiple threads on each of those 4 tasks.
Coming to the question, is there any way using which I can have my application spawn threads at each task level depending upon the number of tasks in its input queue?
Say the input list of the second task is empty in the beginning so number threads is zero however if one task is there, single thread is spawned, two for two etc ... And of course have an upper limit on the number of threads say 10, so that if the length of the input list goes as high as 100, number of threads operating stays still at 10.
Please suggest the pythonic way to go about achieving this.

You have successfully invented the Thread Pool. There is builtin support and there are many libraries and examples that will provide this for you, so use one or learn from their code.
from multiprocessing.pool import ThreadPool

Parallel processing within a queue (using Pool within Celery)

I'm using Celery to queue jobs from a CGI application I made. The way I've set it up, Celery makes each job run one- or two-at-a-time by setting CELERYD_CONCURRENCY = 1 or = 2 (so they don't crowd the processor or thrash from memory consumption). The queue works great, thanks to advice I got on StackOverflow.
Each of these jobs takes a fair amount of time (~30 minutes serial), but has an embarrassing parallelizability. For this reason, I was using Pool.map to split it and do the work in parallel. It worked great from the command line, and I got runtimes around 5 minutes using a new many-cored chip.
Unfortunately, there is some limitation that does not allow daemonic process to have subprocesses, and when I run the fancy parallelized code within the CGI queue, I get this error:
AssertionError: daemonic processes are not allowed to have children
I noticed other people have had similar questions, but I can't find an answer that wouldn't require abandoning Pool.map altogether, and making more complicated thread code.
What is the appropriate design choice here? I can easily run my serial jobs using my Celery queue. I can also run my much faster parallelized jobs without a queue. How should I approach this, and is it possible to get what I want (both the queue and the per-job parallelization)?
A couple of ideas I've had (some are quite hacky):
The job sent to the Celery queue simply calls the command line program. That program can use Pool as it pleases, and then saves the result figures & data to a file (just as it does now). Downside: I won't be able to check on the status of the job or see if it terminated successfully. Also, system calls from CGI may cause security issues.
Obviously, if the queue is very full of jobs, I can make use of the CPU resources (by setting CELERYD_CONCURRENCY = 6 or so); this will allow many people to be "at the front of the queue" at once.Downside: Each job will spend a lot of time at the front of the queue; if the queue isn't full, there will be no speedup. Also, many partially finished jobs will be stored in memory at the same time, using much more RAM.
Use Celery's #task to parallelize within sub-jobs. Then, instead of setting CELERYD_CONCURRENCY = 1, I would set it to 6 (or however many sub jobs I'd like to allow in memory at a time). Downside: First of all, I'm not sure whether this will successfully avoid the "task-within-task" problem. But also, the notion of queue position may be lost, and many partially finished jobs may end up in memory at once.
Perhaps there is a way to call Pool.map and specify that the threads are non-daemonic? Or perhaps there is something more lightweight I can use instead of Pool.map? This is similar to an approach taken on another open StackOverflow question. Also, I should note that the parallelization I exploit via Pool.map is similar to linear algebra, and there is no inter-process communication (each just runs independently and returns its result without talking to the others).
Throw away Celery and use multiprocessing.Queue. Then maybe there'd be some way to use the same "thread depth" for every thread I use (i.e. maybe all of the threads could use the same Pool, avoiding nesting)?
Thanks a lot in advance.

What you need is a workflow management system (WFMS) that manages
task concurrency
task dependency
task nesting
among other things.
From a very high level view, a WFMS sits on top of a task pool like celery, and submits the tasks which are ready to execute to the pool. It is also responsible for opening up a nest and submitting the tasks in the nest accordingly.
I've developed a system to do just that. It's called pomsets. Try it out, and feel free to send me any questions.

I using a multiprocessed deamons based on Twisted with forking and Gearman jobs query normally.
Try to look at Gearman.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.