I have a setup with rabbitmq and celery, with workers running on 4 machines with 4 instances each. I have two task functions defined, which basically call the same backend function, but one of them named process_transaction with no rate_limit defined, and another called slow_process_transaction, with rate_limit="6/m". The tasks go to different queues on rabbitmq, slow and normal.
The strange thing is the rate_limit being enforced for both tasks. If I try to change the rate_limit using celery.control.rate_limit, doing it with the process_transaction doesn't change the effective rate, and using the slow_process_transaction name changes the effective rate for both.
Any ideas on what is wrong?
By reading the bucket source code I figured out celery implements rate limiting by sleeping the time delta after finishing a task, so if you mix tasks with different rate limits in the same workers, they affect each other.
Separating the workers solved my problem, but it's not the optimal solution.
You can separate the workers by using node names and named parameters on your celeryd call. For instance, you have nodes 'fast' and 'slow', and you want them to consume separate queues with concurrency 5 and 1 respectively:
celeryd <other opts> -Q:fast fast_queue -c:fast 5 -Q:slow slow_queue -c:slow 1
Related
I am trying to limit the rate of one celery task. Here is how I am doing it:
from project.celery import app
app.control.rate_limit('task_a', '10/m')
It is working well. However, there is a catch. Other tasks that this worker is responsible for are being blocked as well.
Let's say, 100 of task_a have been scheduled. As it is rate-limited, it will take 10 minutes to execute all of them. During this time, task_b has been scheduled as well. It will not be executed until task_a is done.
Is it possible to not block task_b?
By the looks of it, this is just how it works. I just didn't get that impression after reading the documentation.
Other options include:
Separate worker and queue only for this task
Adding an eta to the task task_a so that all of it are scheduled to run during the night
What is the best practice in such cases?
This should be part of a task declaration to work on per-task basis. The way you are doing it via control probably why it has this side-effect on other tasks
#task(rate_limit='10/m')
def task_a():
...
After more reading
Note that this is a per worker instance rate limit, and not a global rate limit. To enforce a global rate limit (e.g., for an API with a maximum number of requests per second), you must restrict to a given queue.
You probably will have to do this in separate queue
The easiest (no coding required) way is separating the task into its own queue and running a dedicated worker just for this purpose.
There's no shame in that, it is totally fine to have many Celery queues and workers, each dedicated just for a specific type of work. As an added bonus you may get some more control over the execution, you can easily turn workers ON/OFF to pause certain processes if needed, etc.
On the other hand, having lots of specialized workers idle most of the time (waiting for a specific job to be queued) is not particularly memory-efficient.
Thus, in case you need to rate limit more tasks and expect the specific workers to be idle most of the time, you may consider increasing the efficiency and implement a Token Bucket. With that all your workers can be generic-purpose and you can scale them naturally as your overall load increases, knowing that the work distribution will not be crippled by a single task's rate limit anymore.
I have tasks that do a get request to an API.
I have around 70 000 requests that I need to do, and I want to spread them out in 24 hours. So not all 70k requests are run at for example 10AM.
How would I do that in celery django? I have been searching for hours but cant find a good simple solution.
The database has a list of games that needs to be refreshed. Currently I have a cron that creates tasks every hour. But is it better to create a task for every game and make it repeat every hour?
The typical approach is to send them whenever you need some work done, no matter how many there are (even hundreds of thousands). The execution however is controlled by how many workers (and worker processes) you have subscribed to a dedicated queue. The key here is the dedicated queue - that is a common way of not allowing all workers start executing the newly created tasks. This goes beyond the basic Celery usage. You need to use celery multi for this use-case, or create two or more separate Celery workers manually with different queues.
If you do not want to over-complicate things you can use your current setup, but make these tasks with lowest priority, so if any new, more important, task gets created, it will be executed first. Problem with this approach is that only Redis and RabbitMQ backends support priorities as far as I know.
I have question about CELERYD_CONCURRENCY and CELERYD_PREFETCH_MULTIPLIER
Because my english is not well to understand the official site description,
I want to make sure it
I set CELERYD_CONCURRENCY=40
I think it will use 40 workers to do things
But I usually see INFO/MainProcess ,seldom see INFO/Worker-n
Is it because the task is fast,so it didn't have to assign to worker??
Here is a task architecture :
I have a period_task is celery period_task , and mail_it is normal celery task
#shared_task
def period_task():
do_something()
....
for mail in mail_list:
mail_it.delay(mail)
And the second question is CELERYD_PREFETCH_MULTIPLIER ,the default value is 4
Is it means that each worker can get 4 tasks from queue one time ??? So I have 40 worker,I can get 40*4 task????
My understanding:
CELERYD_CONCURRENCY:
This is the number of THREADS/GREENTHREADS a given worker will have. Celery calls these "processes". This is the number of tasks a single worker can execute in parallel. I believe Celery creates this numbe PLUS ONE internally, and that the additional 1 is for actually managing/assigning to the others (in your case, 40 of them!). In my experience, you likely don't need/want 40 (closer to 1 or 2 per CPU), but your mileage may vary.
CELERYD_PREFETCH_MULTIPLIER:
Prefetch is how many tasks are reserved per "process" according to the docs. It's a bit like a mini-queue just for that specific thread. This would indeed mean that your ONE started worker would potentially 'reserve' 40 * 4 tasks to do. Keep in mind that these reserved tasks cannot be "stolen" or sent to another worker or thread, so if they are long running you may wish to disable this feature to allow faster stations to pickup the slack of slower ones.
If this isn't clear in your current setup, I might suggest adding a sleep() to your task to be able to observe it.
I have a CPU intensive Celery task. I would like to use all the processing power (cores) across lots of EC2 instances to get this job done faster (a celery parallel distributed task with multiprocessing - I think).
The terms, threading, multiprocessing, distributed computing, distributed parallel processing are all terms I'm trying to understand better.
Example task:
#app.task
for item in list_of_millions_of_ids:
id = item # do some long complicated equation here very CPU heavy!!!!!!!
database.objects(newid=id).save()
Using the code above (with an example if possible) how one would ago about distributed this task using Celery by allowing this one task to be split up utilising all the computing CPU power across all available machine in the cloud?
Your goals are:
Distribute your work to many machines (distributed
computing/distributed parallel processing)
Distribute the work on a given machine across all CPUs
(multiprocessing/threading)
Celery can do both of these for you fairly easily. The first thing to understand is that each celery worker is configured by default to run as many tasks as there are CPU cores available on a system:
Concurrency is the number of prefork worker process used to process
your tasks concurrently, when all of these are busy doing work new
tasks will have to wait for one of the tasks to finish before it can
be processed.
The default concurrency number is the number of CPU’s on that machine
(including cores), you can specify a custom number using -c option.
There is no recommended value, as the optimal number depends on a
number of factors, but if your tasks are mostly I/O-bound then you can
try to increase it, experimentation has shown that adding more than
twice the number of CPU’s is rarely effective, and likely to degrade
performance instead.
This means each individual task doesn't need to worry about using multiprocessing/threading to make use of multiple CPUs/cores. Instead, celery will run enough tasks concurrently to use each available CPU.
With that out of the way, the next step is to create a task that handles processing some subset of your list_of_millions_of_ids. You have a couple of options here - one is to have each task handle a single ID, so you run N tasks, where N == len(list_of_millions_of_ids). This will guarantee that work is evenly distributed amongst all your tasks since there will never be a case where one worker finishes early and is just waiting around; if it needs work, it can pull an id off the queue. You can do this (as mentioned by John Doe) using the celery group.
tasks.py:
#app.task
def process_ids(item):
id = item #long complicated equation here
database.objects(newid=id).save()
And to execute the tasks:
from celery import group
from tasks import process_id
jobs = group(process_ids(item) for item in list_of_millions_of_ids)
result = jobs.apply_async()
Another option is to break the list into smaller pieces and distribute the pieces to your workers. This approach runs the risk of wasting some cycles, because you may end up with some workers waiting around while others are still doing work. However, the celery documentation notes that this concern is often unfounded:
Some may worry that chunking your tasks results in a degradation of
parallelism, but this is rarely true for a busy cluster and in
practice since you are avoiding the overhead of messaging it may
considerably increase performance.
So, you may find that chunking the list and distributing the chunks to each task performs better, because of the reduced messaging overhead. You can probably also lighten the load on the database a bit this way, by calculating each id, storing it in a list, and then adding the whole list into the DB once you're done, rather than doing it one id at a time. The chunking approach would look something like this
tasks.py:
#app.task
def process_ids(items):
for item in items:
id = item #long complicated equation here
database.objects(newid=id).save() # Still adding one id at a time, but you don't have to.
And to start the tasks:
from tasks import process_ids
jobs = process_ids.chunks(list_of_millions_of_ids, 30) # break the list into 30 chunks. Experiment with what number works best here.
jobs.apply_async()
You can experiment a bit with what chunking size gives you the best result. You want to find a sweet spot where you're cutting down messaging overhead while also keeping the size small enough that you don't end up with workers finishing their chunk much faster than another worker, and then just waiting around with nothing to do.
In the world of distribution there is only one thing you should remember above all :
Premature optimization is the root of all evil. By D. Knuth
I know it sounds evident but before distributing double check you are using the best algorithm (if it exists...).
Having said that, optimizing distribution is a balancing act between 3 things:
Writing/Reading data from a persistent medium,
Moving data from medium A to medium B,
Processing data,
Computers are made so the closer you get to your processing unit (3) the faster and more efficient (1) and (2) will be. The order in a classic cluster will be : network hard drive, local hard drive, RAM, inside processing unit territory...
Nowadays processors are becoming sophisticated enough to be considered as an ensemble of independent hardware processing units commonly called cores, these cores process data (3) through threads (2).
Imagine your core is so fast that when you send data with one thread you are using 50% of the computer power, if the core has 2 threads you will then use 100%. Two threads per core is called hyper threading, and your OS will see 2 CPUs per hyper threaded core.
Managing threads in a processor is commonly called multi-threading.
Managing CPUs from the OS is commonly called multi-processing.
Managing concurrent tasks in a cluster is commonly called parallel programming.
Managing dependent tasks in a cluster is commonly called distributed programming.
So where is your bottleneck ?
In (1): Try to persist and stream from the upper level (the one closer to your processing unit, for example if network hard drive is slow first save in local hard drive)
In (2): This is the most common one, try to avoid communication packets not needed for the distribution or compress "on the fly" packets (for example if the HD is slow, save only a "batch computed" message and keep the intermediary results in RAM).
In (3): You are done! You are using all the processing power at your disposal.
What about Celery ?
Celery is a messaging framework for distributed programming, that will use a broker module for communication (2) and a backend module for persistence (1), this means that you will be able by changing the configuration to avoid most bottlenecks (if possible) on your network and only on your network.
First profile your code to achieve the best performance in a single computer.
Then use celery in your cluster with the default configuration and set CELERY_RESULT_PERSISTENT=True :
from celery import Celery
app = Celery('tasks',
broker='amqp://guest#localhost//',
backend='redis://localhost')
#app.task
def process_id(all_the_data_parameters_needed_to_process_in_this_computer):
#code that does stuff
return result
During execution open your favorite monitoring tools, I use the default for rabbitMQ and flower for celery and top for cpus, your results will be saved in your backend. An example of network bottleneck is tasks queue growing so much that they delay execution, you can proceed to change modules or celery configuration, if not your bottleneck is somewhere else.
Why not use group celery task for this?
http://celery.readthedocs.org/en/latest/userguide/canvas.html#groups
Basically, you should divide ids into chunks (or ranges) and give them to a bunch of tasks in group.
For smth more sophisticated, like aggregating results of particular celery tasks, I have successfully used chord task for similar purpose:
http://celery.readthedocs.org/en/latest/userguide/canvas.html#chords
Increase settings.CELERYD_CONCURRENCY to a number that is reasonable and you can afford, then those celery workers will keep executing your tasks in a group or a chord until done.
Note: due to a bug in kombu there were trouble with reusing workers for high number of tasks in the past, I don't know if it's fixed now. Maybe it is, but if not, reduce CELERYD_MAX_TASKS_PER_CHILD.
Example based on simplified and modified code I run:
#app.task
def do_matches():
match_data = ...
result = chord(single_batch_processor.s(m) for m in match_data)(summarize.s())
summarize gets results of all single_batch_processor tasks. Every task runs on any Celery worker, kombu coordinates that.
Now I get it: single_batch_processor and summarize ALSO have to be celery tasks, not regular functions - otherwise of course it will not be parallelized (I'm not even sure chord constructor will accept it if it's not a celery task).
Adding more celery workers will certainly speed up executing the task. You might have another bottleneck though: the database. Make sure it can handle the simultaneous inserts/updates.
Regarding your question: You are adding celery workers by assigning another process on your EC2 instances as celeryd. Depending on how many workers you need you might want to add even more instances.
I am working on a project, to be deployed on Heroku in Django, which has around 12 update functions. They take around 15 minutes to run each. Let's call them update1(), update2()...update10().
I am deploying with one worker dyno on Heroku, and I would like to run up to n or more of these at once (They are not really computationally intensive, they are all HTML parsers, but the data is time-sensitive, so I would like them to be called as often as possible).
I've read a lot of Celery and APScheduler documentation, but I'm not really sure which is the best/easiest for me. Do scheduled tasks run concurrently if the times overlap with one another (ie. if I run one every 2 minutes, and another every 3 minutes, or do they wait until each one finishes?)
Any way I can queue these functions, so at least a few of them are running at once? What is the suggested number of simultaneous calls for this use-case?
Based on you use case description you do not need a Scheduler, so APScheduler will not match your requirements well.
Do you have a web dyno besides your worker dyno? The usual design pattern for this type of processing is to set up a control thread or control process (your web dyno) that accepts requests. These requests are then placed on a request queue.
This queue is read by one or more worker threads or worker processes (you worker dyno). I have not worked with Celery, but it looks like a match with your requirements. How many worker threads or worker dyno's you will need is difficult to determine based on your description. You will need to specify also how many requests for updates you will need to process per second. Also, you will need to specify if the request is CPU bound or IO bound.