Celery parallel distributed task with multiprocessing

Celery parallel distributed task with multiprocessing - python

I have a CPU intensive Celery task. I would like to use all the processing power (cores) across lots of EC2 instances to get this job done faster (a celery parallel distributed task with multiprocessing - I think).
The terms, threading, multiprocessing, distributed computing, distributed parallel processing are all terms I'm trying to understand better.
Example task:
#app.task
for item in list_of_millions_of_ids:
id = item # do some long complicated equation here very CPU heavy!!!!!!!
database.objects(newid=id).save()
Using the code above (with an example if possible) how one would ago about distributed this task using Celery by allowing this one task to be split up utilising all the computing CPU power across all available machine in the cloud?

Your goals are:
Distribute your work to many machines (distributed
computing/distributed parallel processing)
Distribute the work on a given machine across all CPUs
(multiprocessing/threading)
Celery can do both of these for you fairly easily. The first thing to understand is that each celery worker is configured by default to run as many tasks as there are CPU cores available on a system:
Concurrency is the number of prefork worker process used to process
your tasks concurrently, when all of these are busy doing work new
tasks will have to wait for one of the tasks to finish before it can
be processed.
The default concurrency number is the number of CPU’s on that machine
(including cores), you can specify a custom number using -c option.
There is no recommended value, as the optimal number depends on a
number of factors, but if your tasks are mostly I/O-bound then you can
try to increase it, experimentation has shown that adding more than
twice the number of CPU’s is rarely effective, and likely to degrade
performance instead.
This means each individual task doesn't need to worry about using multiprocessing/threading to make use of multiple CPUs/cores. Instead, celery will run enough tasks concurrently to use each available CPU.
With that out of the way, the next step is to create a task that handles processing some subset of your list_of_millions_of_ids. You have a couple of options here - one is to have each task handle a single ID, so you run N tasks, where N == len(list_of_millions_of_ids). This will guarantee that work is evenly distributed amongst all your tasks since there will never be a case where one worker finishes early and is just waiting around; if it needs work, it can pull an id off the queue. You can do this (as mentioned by John Doe) using the celery group.
tasks.py:
#app.task
def process_ids(item):
id = item #long complicated equation here
database.objects(newid=id).save()
And to execute the tasks:
from celery import group
from tasks import process_id
jobs = group(process_ids(item) for item in list_of_millions_of_ids)
result = jobs.apply_async()
Another option is to break the list into smaller pieces and distribute the pieces to your workers. This approach runs the risk of wasting some cycles, because you may end up with some workers waiting around while others are still doing work. However, the celery documentation notes that this concern is often unfounded:
Some may worry that chunking your tasks results in a degradation of
parallelism, but this is rarely true for a busy cluster and in
practice since you are avoiding the overhead of messaging it may
considerably increase performance.
So, you may find that chunking the list and distributing the chunks to each task performs better, because of the reduced messaging overhead. You can probably also lighten the load on the database a bit this way, by calculating each id, storing it in a list, and then adding the whole list into the DB once you're done, rather than doing it one id at a time. The chunking approach would look something like this
tasks.py:
#app.task
def process_ids(items):
for item in items:
id = item #long complicated equation here
database.objects(newid=id).save() # Still adding one id at a time, but you don't have to.
And to start the tasks:
from tasks import process_ids
jobs = process_ids.chunks(list_of_millions_of_ids, 30) # break the list into 30 chunks. Experiment with what number works best here.
jobs.apply_async()
You can experiment a bit with what chunking size gives you the best result. You want to find a sweet spot where you're cutting down messaging overhead while also keeping the size small enough that you don't end up with workers finishing their chunk much faster than another worker, and then just waiting around with nothing to do.

In the world of distribution there is only one thing you should remember above all :
Premature optimization is the root of all evil. By D. Knuth
I know it sounds evident but before distributing double check you are using the best algorithm (if it exists...).
Having said that, optimizing distribution is a balancing act between 3 things:
Writing/Reading data from a persistent medium,
Moving data from medium A to medium B,
Processing data,
Computers are made so the closer you get to your processing unit (3) the faster and more efficient (1) and (2) will be. The order in a classic cluster will be : network hard drive, local hard drive, RAM, inside processing unit territory...
Nowadays processors are becoming sophisticated enough to be considered as an ensemble of independent hardware processing units commonly called cores, these cores process data (3) through threads (2).
Imagine your core is so fast that when you send data with one thread you are using 50% of the computer power, if the core has 2 threads you will then use 100%. Two threads per core is called hyper threading, and your OS will see 2 CPUs per hyper threaded core.
Managing threads in a processor is commonly called multi-threading.
Managing CPUs from the OS is commonly called multi-processing.
Managing concurrent tasks in a cluster is commonly called parallel programming.
Managing dependent tasks in a cluster is commonly called distributed programming.
So where is your bottleneck ?
In (1): Try to persist and stream from the upper level (the one closer to your processing unit, for example if network hard drive is slow first save in local hard drive)
In (2): This is the most common one, try to avoid communication packets not needed for the distribution or compress "on the fly" packets (for example if the HD is slow, save only a "batch computed" message and keep the intermediary results in RAM).
In (3): You are done! You are using all the processing power at your disposal.
What about Celery ?
Celery is a messaging framework for distributed programming, that will use a broker module for communication (2) and a backend module for persistence (1), this means that you will be able by changing the configuration to avoid most bottlenecks (if possible) on your network and only on your network.
First profile your code to achieve the best performance in a single computer.
Then use celery in your cluster with the default configuration and set CELERY_RESULT_PERSISTENT=True :
from celery import Celery
app = Celery('tasks',
broker='amqp://guest#localhost//',
backend='redis://localhost')
#app.task
def process_id(all_the_data_parameters_needed_to_process_in_this_computer):
#code that does stuff
return result
During execution open your favorite monitoring tools, I use the default for rabbitMQ and flower for celery and top for cpus, your results will be saved in your backend. An example of network bottleneck is tasks queue growing so much that they delay execution, you can proceed to change modules or celery configuration, if not your bottleneck is somewhere else.

Why not use group celery task for this?
http://celery.readthedocs.org/en/latest/userguide/canvas.html#groups
Basically, you should divide ids into chunks (or ranges) and give them to a bunch of tasks in group.
For smth more sophisticated, like aggregating results of particular celery tasks, I have successfully used chord task for similar purpose:
http://celery.readthedocs.org/en/latest/userguide/canvas.html#chords
Increase settings.CELERYD_CONCURRENCY to a number that is reasonable and you can afford, then those celery workers will keep executing your tasks in a group or a chord until done.
Note: due to a bug in kombu there were trouble with reusing workers for high number of tasks in the past, I don't know if it's fixed now. Maybe it is, but if not, reduce CELERYD_MAX_TASKS_PER_CHILD.
Example based on simplified and modified code I run:
#app.task
def do_matches():
match_data = ...
result = chord(single_batch_processor.s(m) for m in match_data)(summarize.s())
summarize gets results of all single_batch_processor tasks. Every task runs on any Celery worker, kombu coordinates that.
Now I get it: single_batch_processor and summarize ALSO have to be celery tasks, not regular functions - otherwise of course it will not be parallelized (I'm not even sure chord constructor will accept it if it's not a celery task).

Adding more celery workers will certainly speed up executing the task. You might have another bottleneck though: the database. Make sure it can handle the simultaneous inserts/updates.
Regarding your question: You are adding celery workers by assigning another process on your EC2 instances as celeryd. Depending on how many workers you need you might want to add even more instances.

Related

How to submit a large set of long running parallel tasks to dask?

I have a computational workload that I originally ran with concurrent.futures.ProcessPoolExecutor which I converted to use dask so that I could make use of dask's integrations with distributed computing systems for scaling beyond one machine. The workload consists of two task types:
Task A: takes string/float inputs and produces a matrix (around 2000 x 2000). Task duration is usually 60 seconds or less.
Task B: takes the matrix from task A and uses it and some other small inputs to solve an ordinary differential equation. The solution is written to disk (so no return value). Task duration can be up to fifteen minutes.
There can multiple B tasks for each A task.
Originally, my code looked like this:
a_results = client.map(calc_a, a_inputs)
all_b_inputs = [(a_result, b_input) for b_input in b_inputs for a_result in a_results]
b_results = client.map(calc_b, all_b_inputs)
dask.distributed.wait(b_results)
because that was the clean translation from the concurrent.futures code (I actually kept the code so that it could be run either with dask or concurrent.futures so I could compare). client here is a distributed.Client instance.
I have been experiencing some stability issues with this code, especially for large numbers of tasks, and I think I might not be using dask in the best way. Recently, I changed my code to use Delayed instead like this:
a_results = [dask.delayed(calc_a)(a) for a in a_inputs]
b_results = [dask.delayed(calc_b)(a, b) for a in a_inputs for b in b_inputs]
client.compute(b_results)
I did this because I thought perhaps the scheduler could work through the tasks more efficiently if it examined the entire graph before starting anything rather than beginning to schedule the A tasks before knowing about the B tasks. This change seems to help some but I still see some stability issues.
I can create separate questions for the stability problems, but I first wanted to find out if I am using dask in the best way for this use case or if I should modify how I am submitting the tasks. Just to describe the problems briefly, the worst problem to me is that over time my workers drop to 0% CPU and tasks stop completing. Other problems include things like getting KilledWorker exceptions and seeing log messages about an unresponsive loop and time outs. Usually the scheduler runs fine for at least a few hours, completing thousands of tasks before these issues show up (which makes debugging difficult since the feedback loop is so long).
Some questions I have been wondering about:
I can have thousands of tasks to run. Can I submit these all to dask to start out or do I need to submit them in batches? My thought was that the dask scheduler would be better at scheduling tasks than my batching code.
If I do need to batch things myself, can I query the scheduler to find out the maximum number of workers so I can write something that will submit batches of the right size? Or do I need to make the batch size an input to my batching code?
In the end, my results all get written to disk and nothing gets returned. With the way I am running tasks, are resources getting held onto longer than necessary?
My B tasks are long but they could be split by scheduling tasks that solve for solutions at intermediate time steps and feeding those in as the inputs to subsequent solving tasks. I think I need to do this any way because I would like to use an HPC cluster with a timed queue and I think I need to use the lifetime parameter to retire workers to keep them from running over the time limit and that works best with short-lived tasks (to avoid losing work when shut down early). Is there an optimal way to split the B task?

There are lots of questions here, but with regards to the code snippets you provided, both look correct, but the futures version will scale better in my experience. The reason for that is that by default, whenever one of the delayed tasks fails, the computation of all delayed tasks halts, while futures can proceed as long as they are not directly affected by the failure.
Another observation is that delayed values will tend to hold on to resources after completion, while for futures you can at least .release() them once they have been completed (or use fire_and_forget).
Finally, with very large task lists, it might be worth to make them a bit more resilient to restarts. One basic option is to create simple text files after successful completion of a task, and then on restart check which tasks need to be re-computed. Fancier options include prefect and joblib.memory, but if you don't need all the bells and whistles, the text file route is often fastest.

Python multiprocessing - reassigning jobs dynamically from pool - without async?

So I have a batch of 1000 tasks that I assign using parmap/python multiprocessing module to 8 cores (dual xeon machine 16 physical cores). Currently this runs using synchronized.
The issue is that usually 1 of the cores lags well behind the other cores and still has several jobs/tasks to complete after all the other cores finished their work. This may be related to core speed (older computer) but more likely due to some of the tasks being more difficult than others - so the 1 core that gets the slightly more difficult jobs gets laggy...
I'm a little confused here - but is this what asynch parallelization does? I've tried using it before, but because this step is part of a very large processing step - it wasn't clear how to create a barrier to force the program to wait until all async processes are done.
Any advice/links to similar questions/answers are appreciated.
[EDIT] To clarify, the processes are ok to run independently, they all save data to disk and do not share variables.

parmap author here
By default, both in multiprocessing and in parmap, tasks are divided in chunks and chunks are sent to each multiprocessing process (see the multiprocessing documentation). The reason behind this is that sending tasks individually to a process would introduce significant computational overhead in many situations. The overhead is reduced if several tasks are sent at once, in chunks.
The number of tasks on each chunk is controlled with chunksize in multiprocessing (and pm_chunksize in parmap). By default, chunksize is computed as "number of tasks"/(4*"pool size"), rounded up (see the multiprocessing source code). So for your case, 1000/(4*4) = 62.5 -> 63 tasks per chunk.
If, as in your case, many computationally expensive tasks fall into the same chunk, that chunk will take a long time to finish.
One "cheap and easy" way to workaround this is to pass a smaller chunksize value. Note that using the extreme chunksize=1 may introduce undesired larger cpu overhead.
A proper queuing system as suggested in other answers is a better solution on the long term, but maybe an overkill for a one-time problem.

You really need to look at creating microservices and using a queue pool. For instance, you could put a list of jobs in celery or redis, and then have the microservices pull from the queue one at a time and process the job. Once done they pull the next item and so forth. That way your load is distributed based on readiness, and not based on a preset list.
http://www.celeryproject.org/
https://www.fullstackpython.com/task-queues.html

schedule tasks based purely on idleness, instead of data communication

Using distributed to schedule lots of interdependent tasks, running on google compute engine. When I start an extra instance with workers halfway, no tasks get scheduled to it (though it registers fine with the scheduler). I presume this is because (from http://distributed.readthedocs.io/en/latest/scheduling-state.html#distributed.scheduler.decide_worker):
"If the task requires data communication, then we choose to minimize the number of bytes sent between workers. This takes precedence over worker occupancy."
Once I'm halfway running the task tree, all remaining tasks depend on the result of tasks which have already run. So, if I interpret the above quote right, nothing will ever be scheduled on the new workers, no matter how idle they are, as the dependent data is never already there but always on a 'old' worker.
However, I do make sure the amount of data to transfer is fairly minimal, usually just a small string. So in this case it would make much more sense to let idleness prevail over data communication. Would it be possible to allow this (e.g. setting a 'scheduler policy'? Or maybe even have a data-vs-idleness tradeoff coefficent which could be tuned?
Update after comment #1:
Complicating factor: every task is using the resources framework to make sure it either runs on the set of workers for cpu-bound tasks ("CPU=1") or on the set of workers for network-bound tasks ("NET=1"). This separation was made to avoid overloading up/download servers and restrict up/download tasks to a certain max, while still being able to scale the other tasks. However, according to http://distributed.readthedocs.io/en/latest/work-stealing.html, task stealing will not happen in these cases? Is there a way to allow task stealing while keeping the resource restrictions?
Update 2: I see there is an open issue for that: https://github.com/dask/distributed/issues/1389. Are there plans to implement this?

While Dask prefers to schedule work to reduce communication it also acknowledges that this isn't always best. Generally Dask runs a task on the machine where it will finish first, taking into account both communication costs and existing task backlogs on overloaded workers.
For more information on load balancing you might consider reading this documentation page:
http://distributed.readthedocs.io/en/latest/work-stealing.html

In Celery are there significant performance implications of using many queues

Are there substantial performance implications that I should keep in mind when Celery workers are pulling from multiple (or perhaps many) queues? For example, would there be a significant performance penalty if my system were designed so that workers pulled from 10 to 15 queues rather than just 1 or 2? As a follow-up, what if some of those queues are sometimes empty?

The short answer to your question on queue limits is:
Don't worry having multiple queues will not be worse or better, broker are designed to handle huge numbers of them. Off course in a lot of use cases you don't need so many, except really advanced one. Empty queues don't create any problem, they just take a tiny amount of memory on the broker.
Don't forget also that you have other things like exchanges and bindings, also there you don't have real limits but is better you understand the performance implication of each of them before using it (a TOPIC exchange will use more CPU than a direct one for example)
To give you a more complete answer let's look at the performance topic from a more generic point of view.
When looking at a distributed system based on message passing like Celery there are 2 main topics to analyze from the point of view of performance:
The workers number and concurrency factor.
As you probably already know each celery worker has a concurrency parameter that sets how many tasks can be executed at the same time, this should be set in relation with the server capacity(CPU, RAM, I/O) and off course also based on the type of tasks that the specific consumer will execute (depends on the queue that it will consume).
Off course depending on the total number of tasks you need to execute in a certain time window you will need to decide how many workers/servers you will need to have up and running.
The broker, the Single point of Failure in this architecture style.
The broker, especially RabbitMQ, is designed to manage millions of messages without any problem, however more messages it will need to store more memory will use and more are the messages to route more CPU it will use.
This machine should be well tuned too and if possible be in high availability.
Off course the main thing to avoid is the messages are consumed at a lower rate than they are produced otherwise your queue will keep growing and your RabbitMQ will explode. Here you can find some hints.
There are cases where you may also need to increase the number of tasks executed in a certain time frame but on only in response to peaks of requests. The nice thing about this architecture is that you can monitor the size of the queues and when you understand is growing to fast you could create new machines on the fly with a celery worker already configured and than turn it off when they are not needed. This is a quite cost saving and efficient approach.
One hint, remember to don't store celery tasks results in RabbitMQ.

What is the optimal way to organize infinitely looped work queue?

I have about 1000-10000 jobs which I need to run on a constant basis each minute or so. Sometimes new job comes in or other needs to be cancelled but it's rare event. Jobs are tagged and must be disturbed among workers each of them processes only jobs of specific kind.
For now I want to use cron and load whole database of jobs in some broker -- RabbitMQ or beanstalkd (haven't decided which one to use though).
But this approach seems ugly to me (using timer to simulate infinity, loading the whole database, etc) and has the disadvantage: for example if some kind of jobs are processed slower than added into the queue it may be overwhelmed and message broker will eat all ram, swap and then just halt.
Is there any other possibilities? Am I not using right patterns for a job? (May be I don't need queue or something..?)
p.s. I'm using python if this is important.

You create your initial batch of jobs and add them to the queue.
You have n-consumers of the queue each running the jobs. Adding consumers to the queue simply round-robins the distribution of jobs to each listening consumer, giving you arbitrary horizontal scalability.
Each job can, upon completion, be responsible for resubmitting itself back to the queue. This means that your job queue won't grow beyond the length that it was when you initialised it.
The master job can, if need be, spawn sub-jobs and add them to the queue.
For different types of jobs it is probably a good idea to use different queues. That way you can balance the load more effectively by having different quantities/horsepower of workers running the jobs from the different queues.
The fact that you are running Python isn't important here, it's the pattern, not the language that you need to nail first.

You can use asynchronous framework, e.g. Twisted
I don't think either it's a good idea to run script by cron daemon each minute (and you mentioned reasons), so I offer you Twisted. It doesn't give you benefit with scheduling, but you get flexibility in process management and memory sharing

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.