How to limit the maximum number of running Celery tasks by name - python

How do you limit the number of instances of a specific Celery task that can be ran simultaneously?
I have a task that processes large files. I'm running into a problem where a user may launch several tasks, causing the server to run out of CPU and memory as it tries to process too many files at once. I want to ensure that only N instances of this one type of task are ran at any given time, and that other tasks will sit queued in the scheduler until the others complete.
I see there's a rate_limit option in the task decorator, but I don't think this does what I want. If I'm understanding the docs correctly, this will just limit how quickly the tasks are launched, but it won't restrict the overall number of tasks running, so this will make my server will crash more slowly...but it will still crash nonetheless.

You have to setup extra queue and set desired concurrency level for it. From Routing Tasks:
# Old config style
CELERY_ROUTES = {
'app.tasks.limited_task': {'queue': 'limited_queue'}
}
or
from kombu import Exchange, Queue
celery.conf.task_queues = (
Queue('default', default_exchange, routing_key='default'),
Queue('limited_queue', default_exchange, routing_key='limited_queue')
)
And start extra worker, serving only limited_queue:
$ celery -A celery_app worker -Q limited_queue --loglevel=info -c 1 -n limited_queue
Then you can check everything running smoothly using Flower or inspect command:
$ celery -A celery_app worker inspect --help

What you can do is to push these tasks to a specific queue and have X number of workers processing them. Having two workers on a queue with 100 items will ensure that there will only be two tasks processed at the same time.

I am not sure you can do that in Celery, what you can do is check how many tasks of that name are currently running when a request arrives and if it exceeds the maximum either return an error or add a mechanism that periodically checks if there are open slots for the tasks and runs it (if you add such a mechanism, you don't need to double check, just at each request add it to it's queue.
In order to check running tasks, you can use the inspect command.
In short:
app = Celery(...)
i = app.control.inspect()
i.active()

Related

Serial processing of specific tasks using Celery with concurrency

I have a python/celery setup: I have a queue named "task_queue" and multiple python scripts that feed it data from different sensors. There is a celery worker that reads from that queue and sends an alarm to user if the sensor value changed from high to low. The worker has multiple threads (I have autoscaling parameter enabled) and everything works fine until one sensor decides to send multiple messages at once. That's when I get the race condition and may send multiple alarms to user, since before a thread stores the info that it had already sent an alarm, few other threads also send it.
I have n sensors (n can be more than 10000) and messages from any sensor should be processed sequentially. So in theory I could have n threads, but that would be an overkill. I'm looking for a simplest way to equally distribute the messages across x threads (usually 10 or 20), so I wouldn't have to (re)write routing function and define new queues each time I want to increase x (or decrease).
So is it possible to somehow mark the tasks that originate from same sensor to be executed in serial manner (when calling the delay or apply_async)? Or is there a different queue/worker architecture I should be using to achieve that?
From what I understand, you have some tasks that can run all at the same time and a specific task that can not do this (this task needs to be executed 1 at a time).
There is no way (for now) to set the concurrency of a specific task queue so I think the best approach in your situation would be handling the problem with multiple workers.
Lets say you have the following queues:
queue_1 Here we send tasks that can run all at the same time
queue_2 Here we send tasks that can run 1 at a time.
You could start celery with the following commands (If you want them in the same machine).
celery -A proj worker --loglevel=INFO --concurrency=10 -n worker1#%h -Q queue_1
celery -A proj worker --loglevel=INFO --concurrency=1 -n worker2#%h -Q queue_2
This will make worker1 which has concurrency 10 handle all tasks that can be ran at the same time and worker2 handles only the tasks that need to be 1 at a time.
Here is some documentation reference:
https://docs.celeryproject.org/en/stable/userguide/workers.html
NOTE: Here you will need to specify the task in which queue runs. This can be done when calling with apply_async, directly from the decorator or some other ways.

How to rate limit Celery tasks by task name?

I'm using Celery to process asynchronous tasks from a Django app. Most tasks are short and run in a few seconds, but I have one task that can take a few hours.
Due to processing restrictions on my server, Celery is configured to only run 2 tasks at once. That means if someone launches two of these long-running tasks, it effectively blocks all other Celery processing site wide for several hours, which is very bad.
Is there any way to configure Celery so it only processes one type of task no more than one at a time? Something like:
#task(max_running_instances=1)
def my_really_long_task():
for i in range(1000000000):
time.sleep(6000)
Note, I don't want to cancel all other launches of my_really_long_task. I just don't want them to start right away, and only begin once all other tasks of the same name finish.
Since this doesn't seem to be supported by Celery, my current hacky solution is to query other tasks within the task, and if we find other running instances, then reschedule ourselves to run later, e.g.
from celery.task.control import inspect
def get_all_active_celery_task_names(ignore_id=None):
"""
Returns Celery task names for all running tasks.
"""
i = inspect()
task_names = defaultdict(int) # {name: count}
if i:
active = i.active()
if active is not None:
for worker_name, tasks in i.active().iteritems():
for task in tasks:
if ignore_id and task['id'] == ignore_id:
continue
task_names[task['name']] += 1
return task_names
#task
def my_really_long_task():
all_names = get_all_active_celery_task_names()
if 'my_really_long_task' in all_names:
my_really_long_task.retry(max_retries=100, countdown=random.randint(10, 300))
return
for i in range(1000000000):
time.sleep(6000)
Is there a better way to do this?
I'm aware of other hacky solutions like this, but setting up a separate memcache server to track task uniqueness is even less reliable, and more complicated than the method I use above.
An alternate solution is to queue my_really_long_task into a seperate queue.
my_really_long_task.apply_async(*args, queue='foo')
Then start a worker with a concurrency of 1 to consume these tasks so that only 1 task gets executed at a time.
celery -A foo worker -l info -Q foo

Stop celery workers to consume from all queues

Cheers,
I have a celery setup running in a production environment (on Linux) where I need to consume two different task types from two dedicated queues (one for each). The problem that arises is, that all workers are always bound to both queues, even when I specify them to only consume from one of them.
TL;DR
Celery running with 2 queues
Messages are published in correct queue as designed
Workers keep consuming both queues
Leads to deadlock
General Information
Think of my two different task types as a hierarchical setup:
A task is a regular celery task that may take quite some time, because it dynamically dispatches other celery tasks and may be required to chain through their respective results
A node is a dynamically dispatched sub-task, which also is a regular celery task but itself can be considered an atomic unit.
My task thus can be a more complex setup of nodes where the results of one or more nodes serves as input for one or more subsequent nodes, and so on. Since my tasks can take longer and will only finish when all their nodes have been deployed, it is essential that they are handled by dedicated workers to keep a sufficient number of workers free to consume the nodes. Otherwise, this could lead to the system being stuck, when a lot of tasks are dispatched, each consumed by another worker, and their respective nodes are only queued but will never be consumed, because all workers are blocked.
If this is a bad design in general, please make any propositions on how I can improve it. I did not yet manage to build one of these processes using celery's built-in canvas primitives. Help me, if you can?!
Configuration/Setup
I run celery with amqp and have set up the following queues and routes in the celery configuration:
CELERY_QUERUES = (
Queue('prod_nodes', Exchange('prod'), routing_key='prod.node'),
Queue('prod_tasks', Exchange('prod'), routing_key='prod.task')
)
CELERY_ROUTES = (
'deploy_node': {'queue': 'prod_nodes', 'routing_key': 'prod.node'},
'deploy_task': {'queue': 'prod_tasks', 'routing_key': 'prod.task'}
)
When I launch my workers, I issue a call similar to the following:
celery multi start w_task_01 w_node_01 w_node_02 -A my.deployment.system \
-E -l INFO -P gevent -Q:1 prod_tasks -Q:2-3 prod_nodes -c 4 --autoreload \
--logfile=/my/path/to/log/%N.log --pidfile=/my/path/to/pid/%N.pid
The Problem
My queue and routing setup seems to work properly, as I can see messages being correctly queued in the RabbitMQ Management web UI.
However, all workers always consume celery tasks from both queues. I can see this when I start and open up the flower web UI and inspect one of the deployed tasks, where e.g. w_node_01 starts consuming messages from the prod_tasks queue, even though it shouldn't.
The RabbitMQ Management web UI furthermore tells me, that all started workers are set up as consumers for both queues.
Thus, I ask you...
... what did I do wrong?
Where is the issue with my setup or worker start call; How can I circumvent the problem of workers always consuming from both queues; Do I really have to make additional settings during runtime (what I certainly do not want)?
Thanks for your time and answers!
You can create 2 separate workers for each queue and each one's define what queue it should get tasks from using the -Q command line argument.
If you want to keep the number processes the same, by default a process is opened for each core for each worker you can use the --concurrency flag (See Celery docs for more info)
Celery allows configuring a worker with a specific queue.
1) Specify the name of the queue with 'queue' attribute for different types of jobs
celery.send_task('job_type1', args=[], kwargs={}, queue='queue_name_1')
celery.send_task('job_type2', args=[], kwargs={}, queue='queue_name_2')
2) Add the following entry in configuration file
CELERY_CREATE_MISSING_QUEUES = True
3) On starting the worker, pass -Q 'queue_name' as argument, for consuming from that desired queue.
celery -A proj worker -l info -Q queue_name_1 -n worker1
celery -A proj worker -l info -Q queue_name_2 -n worker2

Celery Tasks with eta get removed from RabbitMQ

I'm using Django 1.6, RabbitMQ 3.5.6, celery 3.1.19.
There is a periodic task which runs every 30 seconds and creates 200 tasks with given eta parameter. After I run the celery worker, slowly the queue gets created in RabbitMQ and I see around 1200 scheduled tasks waiting to be fired. Then, I restart the celery worker and all of the waiting 1200 scheduled tasks get removed from RabbitMQ.
How I create tasks:
my_task.apply_async((arg1, arg2), eta=my_object.time_in_future)
I run the worker like this:
python manage.py celery worker -Q my_tasks_1 -A my_app -l
CELERY_ACKS_LATE is set to True in Django settings. I couldn't find any possible reason.
Should I run the worker with a different configuration/flag/parameter? Any idea?
As far as I know Celery does not rely on RabbitMQ's scheduled queues. It implements ETA/Countdown internally.
It seems that you have enough workers that are able to fetch enough messages and schedule them internally.
Mind that you don't need 200 workers. You have the prefetch multiplier set to the default value so you need less.

How to cancel conflicting / old tasks in Celery?

I'm using Celery + RabbitMQ.
When a Celery worker isn't available all the tasks are waiting in RabbitMQ.
Just as it becomes online all this bunch of tasks is executed immediately.
Can I somehow prevent it happening?
For example there are 100 tasks (the same) waiting for a Celery worker, can I execute only 1 of them when a Celery worker comes online?
Since all the tasks are the same in your queue, A better way to do this is to send the task only once, to do this you need to be able to track that the task was published, for example:
Using a lock, example: Ensuring a task is only executed one at a time
Using a custom task ID and a custom state after the task is published, for example:
To add a custom state when the task is published:
from celery import current_app
from celery.signals import after_task_publish
#after_task_publish.connect
def add_sent_state(sender=None, body=None, **kwargs):
"""Track Published Tasks."""
# get the task instance from its name
task = current_app.tasks.get(sender)
# if there is no task.backend fallback to app.backend
backend = task.backend if task else current_app.backend
# store the task state
backend.store_result(body['id'], None, 'SENT')
When you want to send the task you can check if the task has already been published, and since we're using a custom state the task's state won't be PENDING when it's published (which could be unkown) so we can check using:
from celery import states
# the task has a custom ID
task = task_func.AsyncResult('CUSTOM_ID')
if task.state != states.PENDING:
# the task already exists
else:
# send the task
task_func.apply_async(args, kwargs, task_id='CUSTOM_ID')
I'm using this approach in my app and it's working great, my tasks could be sent multiple times and they are identified by their IDs so this way each task is sent once.
If you're still want to cancel all the tasks in the queue you can use:
# import your Celery instance
from project.celery import app
app.control.purge()
Check the Celery FAQ How do I purge all waiting tasks ?
There are two ways to do this.
First, Run only one worker with a concurrency of one.
celery worker -A your_app -l info -c 1
This command starts a worker with a concurrency of one. So only one task will be executed at a time. This is the preferred way to do it.
Second method is bit complicated. You need to acquire lock and release the lock to make sure only one task is executed at a time.
Alternatively, if you want, you can remove all the tasks from queue using purge command.
celery -A your_app purge

Categories