I'm using Celery to process asynchronous tasks from a Django app. Most tasks are short and run in a few seconds, but I have one task that can take a few hours.
Due to processing restrictions on my server, Celery is configured to only run 2 tasks at once. That means if someone launches two of these long-running tasks, it effectively blocks all other Celery processing site wide for several hours, which is very bad.
Is there any way to configure Celery so it only processes one type of task no more than one at a time? Something like:
#task(max_running_instances=1)
def my_really_long_task():
for i in range(1000000000):
time.sleep(6000)
Note, I don't want to cancel all other launches of my_really_long_task. I just don't want them to start right away, and only begin once all other tasks of the same name finish.
Since this doesn't seem to be supported by Celery, my current hacky solution is to query other tasks within the task, and if we find other running instances, then reschedule ourselves to run later, e.g.
from celery.task.control import inspect
def get_all_active_celery_task_names(ignore_id=None):
"""
Returns Celery task names for all running tasks.
"""
i = inspect()
task_names = defaultdict(int) # {name: count}
if i:
active = i.active()
if active is not None:
for worker_name, tasks in i.active().iteritems():
for task in tasks:
if ignore_id and task['id'] == ignore_id:
continue
task_names[task['name']] += 1
return task_names
#task
def my_really_long_task():
all_names = get_all_active_celery_task_names()
if 'my_really_long_task' in all_names:
my_really_long_task.retry(max_retries=100, countdown=random.randint(10, 300))
return
for i in range(1000000000):
time.sleep(6000)
Is there a better way to do this?
I'm aware of other hacky solutions like this, but setting up a separate memcache server to track task uniqueness is even less reliable, and more complicated than the method I use above.
An alternate solution is to queue my_really_long_task into a seperate queue.
my_really_long_task.apply_async(*args, queue='foo')
Then start a worker with a concurrency of 1 to consume these tasks so that only 1 task gets executed at a time.
celery -A foo worker -l info -Q foo
Related
I have millions of tasks reserved in Celery (ETA not due yet) and every time I want to update my Celery code base, I have to restart it, which cuts the connection to RabbitMQ and causes RabbitMQ to redistribute tasks again (I am using late ack).
Is it possible to reload new code base but still keep my reserved tasks? I am using Celery with Django.
Short answer: yes, you can, but you have to write your own queue draining logic.
Longer answer: when you want to do a code update (and depending on how you handle this), you have to use the celery remote control api to tell all your workers to stop consuming tasks. RabbitMQ brokers support the remote control api, so you're in luck.
from my_app.celery import app
inspector = app.control.inspect()
controller = app.control
# get a list of current workers
workers = inspector.ping()
active_queues = inspector.active_queues()
all_queues = set()
for worker, queues in active_queues.items():
for queue in queues:
all_queues.add(queue['name'])
for queue in all_queues:
controller.cancel_consumer(queue)
This will stop your workers from consuming tasks. Now you have to monitor your workers until they have finished processing any active tasks.
import time
done = False
while not done:
active_count = 0
active = inspector.active()
active_count = sum(map(lambda l: len(l), active.values()))
done = active_count > 0
if not done:
time.sleep(60) # wait a minute between checks
Once your workers are done, your are clear to deploy your code without having to worry about losing tasks.
How do you limit the number of instances of a specific Celery task that can be ran simultaneously?
I have a task that processes large files. I'm running into a problem where a user may launch several tasks, causing the server to run out of CPU and memory as it tries to process too many files at once. I want to ensure that only N instances of this one type of task are ran at any given time, and that other tasks will sit queued in the scheduler until the others complete.
I see there's a rate_limit option in the task decorator, but I don't think this does what I want. If I'm understanding the docs correctly, this will just limit how quickly the tasks are launched, but it won't restrict the overall number of tasks running, so this will make my server will crash more slowly...but it will still crash nonetheless.
You have to setup extra queue and set desired concurrency level for it. From Routing Tasks:
# Old config style
CELERY_ROUTES = {
'app.tasks.limited_task': {'queue': 'limited_queue'}
}
or
from kombu import Exchange, Queue
celery.conf.task_queues = (
Queue('default', default_exchange, routing_key='default'),
Queue('limited_queue', default_exchange, routing_key='limited_queue')
)
And start extra worker, serving only limited_queue:
$ celery -A celery_app worker -Q limited_queue --loglevel=info -c 1 -n limited_queue
Then you can check everything running smoothly using Flower or inspect command:
$ celery -A celery_app worker inspect --help
What you can do is to push these tasks to a specific queue and have X number of workers processing them. Having two workers on a queue with 100 items will ensure that there will only be two tasks processed at the same time.
I am not sure you can do that in Celery, what you can do is check how many tasks of that name are currently running when a request arrives and if it exceeds the maximum either return an error or add a mechanism that periodically checks if there are open slots for the tasks and runs it (if you add such a mechanism, you don't need to double check, just at each request add it to it's queue.
In order to check running tasks, you can use the inspect command.
In short:
app = Celery(...)
i = app.control.inspect()
i.active()
I'm using Celery 3.1.x with 2 tasks. The first task (TaskOne) is enqueued when Celery starts up through the celeryd_after_setup signal:
#celeryd_after_setup.connect
def celeryd_after_setup(*args, **kwargs):
TaskOne().apply_async(countdown=5)
When TaskOne is run, it does some calculations and then enqueues TaskTwo. Imagine the following workflow:
I start celery, thus the signal is fired and TaskOne is enqueued
after the countdown (5) TaskTwo is enqueued
then I stop celery (the TaskTwo remains in the queue)
afterwards I restart celery
the workflow is run again and TaskTwo is enqueued again
So we have 2 TaskTwo in the queue. That is a problem for my workflow because I only want one TaskTwo within the queue and avoid that a second one is enqueued.
My question: How can I achieve this?
With celery.app.control.Inspect.scheduled() (Docs) I can get a list of which tasks are scheduled, hidden in a combination of lists and dicts. This is maybe a way, but going through the result of this does not feel right. Is there any better way?
An easy-to-implement solution would be to add the --purge switch to your worker command. It will clear the queue and the worker start with no scheduled jobs.
But beware: that's a kind of job-global, unrecoverable action. When there are other scheduled jobs you depend on, this is not your solution.
After considering several options I chose to use app.control.inspect.
It's not a really beautiful solution, but it works:
# fetch all scheduled tasks
scheduled_tasks = inspect().scheduled()
# iterate the scheduled task values, see http://docs.celeryproject.org/en/latest/userguide/workers.html?highlight=revoke#dump-of-scheduled-eta-tasks
for task_values in iter(scheduled_tasks.values()):
# task_values is a list of dicts
for task in task_values:
if task['request']['name'] == '{}.{}'.format(TaskTwo.__module__, TaskTwo.__name__):
logger.info('TaskTwo is already scheduled, skipping additional run')
return
I'm using Celery + RabbitMQ.
When a Celery worker isn't available all the tasks are waiting in RabbitMQ.
Just as it becomes online all this bunch of tasks is executed immediately.
Can I somehow prevent it happening?
For example there are 100 tasks (the same) waiting for a Celery worker, can I execute only 1 of them when a Celery worker comes online?
Since all the tasks are the same in your queue, A better way to do this is to send the task only once, to do this you need to be able to track that the task was published, for example:
Using a lock, example: Ensuring a task is only executed one at a time
Using a custom task ID and a custom state after the task is published, for example:
To add a custom state when the task is published:
from celery import current_app
from celery.signals import after_task_publish
#after_task_publish.connect
def add_sent_state(sender=None, body=None, **kwargs):
"""Track Published Tasks."""
# get the task instance from its name
task = current_app.tasks.get(sender)
# if there is no task.backend fallback to app.backend
backend = task.backend if task else current_app.backend
# store the task state
backend.store_result(body['id'], None, 'SENT')
When you want to send the task you can check if the task has already been published, and since we're using a custom state the task's state won't be PENDING when it's published (which could be unkown) so we can check using:
from celery import states
# the task has a custom ID
task = task_func.AsyncResult('CUSTOM_ID')
if task.state != states.PENDING:
# the task already exists
else:
# send the task
task_func.apply_async(args, kwargs, task_id='CUSTOM_ID')
I'm using this approach in my app and it's working great, my tasks could be sent multiple times and they are identified by their IDs so this way each task is sent once.
If you're still want to cancel all the tasks in the queue you can use:
# import your Celery instance
from project.celery import app
app.control.purge()
Check the Celery FAQ How do I purge all waiting tasks ?
There are two ways to do this.
First, Run only one worker with a concurrency of one.
celery worker -A your_app -l info -c 1
This command starts a worker with a concurrency of one. So only one task will be executed at a time. This is the preferred way to do it.
Second method is bit complicated. You need to acquire lock and release the lock to make sure only one task is executed at a time.
Alternatively, if you want, you can remove all the tasks from queue using purge command.
celery -A your_app purge
How can i delete all tasks in a queue, right after a task ended?
I want something like this (Deleting all pending tasks in celery / rabbitmq) but for celery 3.0.
Thanks
EDIT:
From celery documentation:
http://docs.celeryproject.org/en/latest/faq.html#how-do-i-purge-all-waiting-tasks
My code looks like:
from celery import current_app as celery
#task
def task_a():
celery.control.purge()
I was expecting that, if i issued 5 tasks, only the first would run. Somehow, i'ts not doind that.
Thanks
Those tasks might have been already prefetched by workers. To find out is this so, try to run amount of tasks more than active workers multplied by prefetch multiplier (see below), and check what result is returned by celery.control.purge(). You can control amount of prefetched tasks using config parameters CELERYD_PREFETCH_MULTIPLIER and CELERY_ACKS_LATE.