Rate limit a celery task without blocking other tasks

Rate limit a celery task without blocking other tasks - python

I am trying to limit the rate of one celery task. Here is how I am doing it:
from project.celery import app
app.control.rate_limit('task_a', '10/m')
It is working well. However, there is a catch. Other tasks that this worker is responsible for are being blocked as well.
Let's say, 100 of task_a have been scheduled. As it is rate-limited, it will take 10 minutes to execute all of them. During this time, task_b has been scheduled as well. It will not be executed until task_a is done.
Is it possible to not block task_b?
By the looks of it, this is just how it works. I just didn't get that impression after reading the documentation.
Other options include:
Separate worker and queue only for this task
Adding an eta to the task task_a so that all of it are scheduled to run during the night
What is the best practice in such cases?

This should be part of a task declaration to work on per-task basis. The way you are doing it via control probably why it has this side-effect on other tasks
#task(rate_limit='10/m')
def task_a():
...
After more reading
Note that this is a per worker instance rate limit, and not a global rate limit. To enforce a global rate limit (e.g., for an API with a maximum number of requests per second), you must restrict to a given queue.
You probably will have to do this in separate queue

The easiest (no coding required) way is separating the task into its own queue and running a dedicated worker just for this purpose.
There's no shame in that, it is totally fine to have many Celery queues and workers, each dedicated just for a specific type of work. As an added bonus you may get some more control over the execution, you can easily turn workers ON/OFF to pause certain processes if needed, etc.
On the other hand, having lots of specialized workers idle most of the time (waiting for a specific job to be queued) is not particularly memory-efficient.
Thus, in case you need to rate limit more tasks and expect the specific workers to be idle most of the time, you may consider increasing the efficiency and implement a Token Bucket. With that all your workers can be generic-purpose and you can scale them naturally as your overall load increases, knowing that the work distribution will not be crippled by a single task's rate limit anymore.

Related

Efficient ways of implementing waiting till a certain criterion is met in Airflow

Sensors in Airflow - are a certain type of operator that will keep running until a certain criterion is met but they consume a full worker slot. Curious if people have been able to reliably use more efficient ways of implementing this.
A few ideas on my mind
using pools to restrict the number of worker slots allotted to sensors
skipping all tasks downstream and then clear and resume via an external trigger
pause the run of the DAG and resume again via an external trigger
Other relevant links:
How to implement polling in Airflow?
How to wait for an asynchronous event in a task of a DAG in a workflow implemented using Airflow?
Airflow unpause dag programmatically?

The new version of Airflow,namely 1.10.2 provides new option for sensors, which I think addresses your concerns:
mode (str) – How the sensor operates. Options are: { poke | reschedule }, default is poke. When set to poke the sensor is taking up a worker slot for its whole execution time and sleeps between pokes. Use this mode if the expected runtime of the sensor is short or if a short poke interval is requried. When set to reschedule the sensor task frees the worker slot when the criteria is not yet met and it’s rescheduled at a later time. Use this mode if the expected time until the criteria is met is. The poke inteval should be more than one minute to prevent too much load on the scheduler.
Here is the link to doc.

I think you need to step back and question why it's a problem that a sensor consumes a full worker slot.
Airflow is a scheduler, not a resource allocator. Using worker concurrency, pools and queues, you can limit resource usage, but only very crudely. In the end, Airflow naively assumes a sensor will use the same resources on worker nodes as a BashOperator that spawns a multi-process genome sequencing utility. But sensors are cheap and sleep 99.9% of the time, so that is a bad assumption.
So, if you want to solve the problem of sensors consuming all your worker slots, just bump your worker concurrency. You should be able to have hundreds of sensors running concurrently on a single worker.
If you then get problems with very uneven workload distribution on your cluster nodes and nodes with dangerously high system load, you can limit the number of expensive jobs using either:
pools that expensive jobs must consume (will start the job and wait until a pool resource is available). This creates a cluster-wide limit.
special workers on each node that only take the expensive jobs (using airflow worker --queues my_expensive_queue) and have a low concurrency setting. This creates a per-node limit.
If you have more complex requirements than that, then consider shipping all non-trivial compute jobs to a dedicated resource allocator, e.g. Apache Mesos, where you can specify the exact CPU, memory and other requirements to make sure your cluster load is distributed more efficiently on each node than Airflow will ever be able to do.

Cross DAG dependencies are feasible per this doc
Criteria can be specified in a separte DAG as a separate task so that when that criteria is met for a given date, the child task is allowed to run.

schedule tasks based purely on idleness, instead of data communication

Using distributed to schedule lots of interdependent tasks, running on google compute engine. When I start an extra instance with workers halfway, no tasks get scheduled to it (though it registers fine with the scheduler). I presume this is because (from http://distributed.readthedocs.io/en/latest/scheduling-state.html#distributed.scheduler.decide_worker):
"If the task requires data communication, then we choose to minimize the number of bytes sent between workers. This takes precedence over worker occupancy."
Once I'm halfway running the task tree, all remaining tasks depend on the result of tasks which have already run. So, if I interpret the above quote right, nothing will ever be scheduled on the new workers, no matter how idle they are, as the dependent data is never already there but always on a 'old' worker.
However, I do make sure the amount of data to transfer is fairly minimal, usually just a small string. So in this case it would make much more sense to let idleness prevail over data communication. Would it be possible to allow this (e.g. setting a 'scheduler policy'? Or maybe even have a data-vs-idleness tradeoff coefficent which could be tuned?
Update after comment #1:
Complicating factor: every task is using the resources framework to make sure it either runs on the set of workers for cpu-bound tasks ("CPU=1") or on the set of workers for network-bound tasks ("NET=1"). This separation was made to avoid overloading up/download servers and restrict up/download tasks to a certain max, while still being able to scale the other tasks. However, according to http://distributed.readthedocs.io/en/latest/work-stealing.html, task stealing will not happen in these cases? Is there a way to allow task stealing while keeping the resource restrictions?
Update 2: I see there is an open issue for that: https://github.com/dask/distributed/issues/1389. Are there plans to implement this?

While Dask prefers to schedule work to reduce communication it also acknowledges that this isn't always best. Generally Dask runs a task on the machine where it will finish first, taking into account both communication costs and existing task backlogs on overloaded workers.
For more information on load balancing you might consider reading this documentation page:
http://distributed.readthedocs.io/en/latest/work-stealing.html

Celery parallel distributed task with multiprocessing

I have a CPU intensive Celery task. I would like to use all the processing power (cores) across lots of EC2 instances to get this job done faster (a celery parallel distributed task with multiprocessing - I think).
The terms, threading, multiprocessing, distributed computing, distributed parallel processing are all terms I'm trying to understand better.
Example task:
#app.task
for item in list_of_millions_of_ids:
id = item # do some long complicated equation here very CPU heavy!!!!!!!
database.objects(newid=id).save()
Using the code above (with an example if possible) how one would ago about distributed this task using Celery by allowing this one task to be split up utilising all the computing CPU power across all available machine in the cloud?

Your goals are:
Distribute your work to many machines (distributed
computing/distributed parallel processing)
Distribute the work on a given machine across all CPUs
(multiprocessing/threading)
Celery can do both of these for you fairly easily. The first thing to understand is that each celery worker is configured by default to run as many tasks as there are CPU cores available on a system:
Concurrency is the number of prefork worker process used to process
your tasks concurrently, when all of these are busy doing work new
tasks will have to wait for one of the tasks to finish before it can
be processed.
The default concurrency number is the number of CPU’s on that machine
(including cores), you can specify a custom number using -c option.
There is no recommended value, as the optimal number depends on a
number of factors, but if your tasks are mostly I/O-bound then you can
try to increase it, experimentation has shown that adding more than
twice the number of CPU’s is rarely effective, and likely to degrade
performance instead.
This means each individual task doesn't need to worry about using multiprocessing/threading to make use of multiple CPUs/cores. Instead, celery will run enough tasks concurrently to use each available CPU.
With that out of the way, the next step is to create a task that handles processing some subset of your list_of_millions_of_ids. You have a couple of options here - one is to have each task handle a single ID, so you run N tasks, where N == len(list_of_millions_of_ids). This will guarantee that work is evenly distributed amongst all your tasks since there will never be a case where one worker finishes early and is just waiting around; if it needs work, it can pull an id off the queue. You can do this (as mentioned by John Doe) using the celery group.
tasks.py:
#app.task
def process_ids(item):
id = item #long complicated equation here
database.objects(newid=id).save()
And to execute the tasks:
from celery import group
from tasks import process_id
jobs = group(process_ids(item) for item in list_of_millions_of_ids)
result = jobs.apply_async()
Another option is to break the list into smaller pieces and distribute the pieces to your workers. This approach runs the risk of wasting some cycles, because you may end up with some workers waiting around while others are still doing work. However, the celery documentation notes that this concern is often unfounded:
Some may worry that chunking your tasks results in a degradation of
parallelism, but this is rarely true for a busy cluster and in
practice since you are avoiding the overhead of messaging it may
considerably increase performance.
So, you may find that chunking the list and distributing the chunks to each task performs better, because of the reduced messaging overhead. You can probably also lighten the load on the database a bit this way, by calculating each id, storing it in a list, and then adding the whole list into the DB once you're done, rather than doing it one id at a time. The chunking approach would look something like this
tasks.py:
#app.task
def process_ids(items):
for item in items:
id = item #long complicated equation here
database.objects(newid=id).save() # Still adding one id at a time, but you don't have to.
And to start the tasks:
from tasks import process_ids
jobs = process_ids.chunks(list_of_millions_of_ids, 30) # break the list into 30 chunks. Experiment with what number works best here.
jobs.apply_async()
You can experiment a bit with what chunking size gives you the best result. You want to find a sweet spot where you're cutting down messaging overhead while also keeping the size small enough that you don't end up with workers finishing their chunk much faster than another worker, and then just waiting around with nothing to do.

In the world of distribution there is only one thing you should remember above all :
Premature optimization is the root of all evil. By D. Knuth
I know it sounds evident but before distributing double check you are using the best algorithm (if it exists...).
Having said that, optimizing distribution is a balancing act between 3 things:
Writing/Reading data from a persistent medium,
Moving data from medium A to medium B,
Processing data,
Computers are made so the closer you get to your processing unit (3) the faster and more efficient (1) and (2) will be. The order in a classic cluster will be : network hard drive, local hard drive, RAM, inside processing unit territory...
Nowadays processors are becoming sophisticated enough to be considered as an ensemble of independent hardware processing units commonly called cores, these cores process data (3) through threads (2).
Imagine your core is so fast that when you send data with one thread you are using 50% of the computer power, if the core has 2 threads you will then use 100%. Two threads per core is called hyper threading, and your OS will see 2 CPUs per hyper threaded core.
Managing threads in a processor is commonly called multi-threading.
Managing CPUs from the OS is commonly called multi-processing.
Managing concurrent tasks in a cluster is commonly called parallel programming.
Managing dependent tasks in a cluster is commonly called distributed programming.
So where is your bottleneck ?
In (1): Try to persist and stream from the upper level (the one closer to your processing unit, for example if network hard drive is slow first save in local hard drive)
In (2): This is the most common one, try to avoid communication packets not needed for the distribution or compress "on the fly" packets (for example if the HD is slow, save only a "batch computed" message and keep the intermediary results in RAM).
In (3): You are done! You are using all the processing power at your disposal.
What about Celery ?
Celery is a messaging framework for distributed programming, that will use a broker module for communication (2) and a backend module for persistence (1), this means that you will be able by changing the configuration to avoid most bottlenecks (if possible) on your network and only on your network.
First profile your code to achieve the best performance in a single computer.
Then use celery in your cluster with the default configuration and set CELERY_RESULT_PERSISTENT=True :
from celery import Celery
app = Celery('tasks',
broker='amqp://guest#localhost//',
backend='redis://localhost')
#app.task
def process_id(all_the_data_parameters_needed_to_process_in_this_computer):
#code that does stuff
return result
During execution open your favorite monitoring tools, I use the default for rabbitMQ and flower for celery and top for cpus, your results will be saved in your backend. An example of network bottleneck is tasks queue growing so much that they delay execution, you can proceed to change modules or celery configuration, if not your bottleneck is somewhere else.

Why not use group celery task for this?
http://celery.readthedocs.org/en/latest/userguide/canvas.html#groups
Basically, you should divide ids into chunks (or ranges) and give them to a bunch of tasks in group.
For smth more sophisticated, like aggregating results of particular celery tasks, I have successfully used chord task for similar purpose:
http://celery.readthedocs.org/en/latest/userguide/canvas.html#chords
Increase settings.CELERYD_CONCURRENCY to a number that is reasonable and you can afford, then those celery workers will keep executing your tasks in a group or a chord until done.
Note: due to a bug in kombu there were trouble with reusing workers for high number of tasks in the past, I don't know if it's fixed now. Maybe it is, but if not, reduce CELERYD_MAX_TASKS_PER_CHILD.
Example based on simplified and modified code I run:
#app.task
def do_matches():
match_data = ...
result = chord(single_batch_processor.s(m) for m in match_data)(summarize.s())
summarize gets results of all single_batch_processor tasks. Every task runs on any Celery worker, kombu coordinates that.
Now I get it: single_batch_processor and summarize ALSO have to be celery tasks, not regular functions - otherwise of course it will not be parallelized (I'm not even sure chord constructor will accept it if it's not a celery task).

Adding more celery workers will certainly speed up executing the task. You might have another bottleneck though: the database. Make sure it can handle the simultaneous inserts/updates.
Regarding your question: You are adding celery workers by assigning another process on your EC2 instances as celeryd. Depending on how many workers you need you might want to add even more instances.

Celery: long dedicated monolithic task vs short multiple tasks

In my solution I use distributed tasks to monitor hardware instances for a period of time (say, 10 minutes). I have to do some stuff when:
I start this monitoring session
I finish this monitoring session
(Potentially) during the monitoring session
Is it safe to have a single task run for the whole session (10 minutes) and perform all these, or should I split these actions into their own tasks?
The advantages of a single task, as I see it, are that it would be easier to manage and enforce timing constraints. But:
Is it a good idea to run a large pool of (mostly) asleep workers? For example, if I know that at most I will have 200 sessions open, I have a pool of 500 workers to ensure there are always available "session" seats?

There is no one-size-fits-all answer to this
Dividing a big task A into many small parts (A¹, A², A³, …) will increase potential concurrency.
So if you have 1 worker instance with 10 worker threads/processes,
A can now run in parallel using the 10 threads instead of sequentially
on one thread.
The number of parts is called the tasks granularity (fine or coarsely grained).
If the task is too finely grained the overhead of messaging will drag performance down.
Each part must have enough computation/IO to offset the overhead of sending the task
message to the broker, possibly writing it to disk if there are no workers to take it, the worker to receive the message, and so on (do note that messaging overhead can be tweaked, e.g. you can have a queue that is transient (not persisting messages to disk), and send tasks that are not so important there).
A busy cluster may make all of this moot
Maximum parallelism may already have been achieved if you have a busy cluster (e.g. 3 worker instances with 10 threads/processes each, all running tasks).
Then you many not get much benefit by dividing the task, but tasks doing I/O have a greater chance of improvement than CPU-bound tasks (split by I/O operations).
Long running tasks are fine
The worker is not allergic to long running tasks, be that 10 minutes or an hour.
But it's not ideal either because any long running task will block that slot from
finishing any waiting tasks. To mitigate this people use routing, so that you have a dedicated queue, with dedicated workers for tasks that must run ASAP.
-

What is the optimal way to organize infinitely looped work queue?

I have about 1000-10000 jobs which I need to run on a constant basis each minute or so. Sometimes new job comes in or other needs to be cancelled but it's rare event. Jobs are tagged and must be disturbed among workers each of them processes only jobs of specific kind.
For now I want to use cron and load whole database of jobs in some broker -- RabbitMQ or beanstalkd (haven't decided which one to use though).
But this approach seems ugly to me (using timer to simulate infinity, loading the whole database, etc) and has the disadvantage: for example if some kind of jobs are processed slower than added into the queue it may be overwhelmed and message broker will eat all ram, swap and then just halt.
Is there any other possibilities? Am I not using right patterns for a job? (May be I don't need queue or something..?)
p.s. I'm using python if this is important.

You create your initial batch of jobs and add them to the queue.
You have n-consumers of the queue each running the jobs. Adding consumers to the queue simply round-robins the distribution of jobs to each listening consumer, giving you arbitrary horizontal scalability.
Each job can, upon completion, be responsible for resubmitting itself back to the queue. This means that your job queue won't grow beyond the length that it was when you initialised it.
The master job can, if need be, spawn sub-jobs and add them to the queue.
For different types of jobs it is probably a good idea to use different queues. That way you can balance the load more effectively by having different quantities/horsepower of workers running the jobs from the different queues.
The fact that you are running Python isn't important here, it's the pattern, not the language that you need to nail first.

You can use asynchronous framework, e.g. Twisted
I don't think either it's a good idea to run script by cron daemon each minute (and you mentioned reasons), so I offer you Twisted. It doesn't give you benefit with scheduling, but you get flexibility in process management and memory sharing

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.