I have a celery task, that mangles some variable. It works perfect if I set a single celery worker, but when I use concurrency, it all messed up. How could I lock the critical section, where variable is mangled?
inb4: using Python 3.6, Redis both as broker and result backed. threading.Lock doesn't help in here.
As long as celery runs on multiple workers (processes) thread lock would not help, because it works inside single process. Moreover threading lock have use when you control overall process, while using celery there is no way to achieve that.
It means that celery requires distributed lock. For django I always use django-cache, as in: here. If you need more generic locks especially Redis based, working for any python app you can use sherlock.
I know this is a question with 2+ years, but right now I'm tuning my celery configs and I came to this topic.
I am using python 2.7 with Django 1.11 and celery 4 in a linux machine. I am using rabbitmq as a broker.
My configs implies to have celery running as daemon and celery beat to handle scheduled tasks.
And so, with a dedicated queue for a given task, you may configure this queue with a worker (process) with concurrency=1 (subprocesses).
This solution solves concurrency problems for celery to run a task, but in your code, if you run the task without celery, it won't respect concurrency principles.
Example code:
CELERY_TASK_QUEUES = (
Queue('celery_periodic', default_exchange, routing_key='celery_periodic'),
Queue('celery_task_1', celery_task_1_exchange, routing_key='celery_task_1'),
)
default_exchange = Exchange('celery_periodic', type='direct')
celery_task_1_exchange = Exchange('celery_task_1', type='direct')
CELERY_BEAT_SCHEDULE = {
'celery-task-1': {
'task': 'tasks.celery_task_1',
'schedule': timedelta(minutes=15),
'queue': 'celery_task_1'
},
}
and finally, in /etc/default/celeryd (docs here: https://docs.celeryproject.org/en/latest/userguide/daemonizing.html#example-configuration):
CELERYD_NODES="worker1 worker2"
CELERYD_OPTS="--concurrency=1 --time-limit=600 -Q:worker1 celery_periodic -Q:worker2 celery_task_1"
--concurrency N means you will have exactly N worker subprocesses for your worker instance (meaning the worker instance can handle N conccurent tasks) (from here: https://stackoverflow.com/a/44903753/9412892).
Read more here: https://docs.celeryproject.org/en/stable/userguide/workers.html#concurrency
BR,
Eduardo
Related
Edit: In lack of an alternative I just multiply the task. I am using Flask as a webserver. As I call the Train_network endpoint, the Task is executed as many times as there are workers.
response = [fw.train_network.delay().get() for _ in range(Workers)]
For this to work, I also remove the -c 2 argument from the celery worker command and placed the amount into my config
celery.conf.worker_concurrency = cfg.celery_workers
Witht his I always know the amount of Subprocesses and how many times the Task should be repeated.
If there is an better option to solve this, I will update the post with an answer. Or maybe somebody else can provide insight.
Edit: Basically, I need to have access from all Subprocesses to a specific set of variables, which should be shared between these processes. Or, if every process got their own varibale, I need to be able to modify all of these variables by executing a task.
Edit:
SO, Ive found out that the Task is indeed broadcasted at to all workers. But not the workers launched from the pool/ the concurrency but fro, the terminal.
So, if I start multiple terminals with celery worker ..... -c 2 These celery workers do receive the broadcast task. Which is good I guess. Now I want to broadcast these task to the PoolWorkers inside the celery workers too.
Basically I load a model and I want to relaod the model on all pool workers
Original:
I"ve been reading trough the user guide cat celery so that I can send a single task to all of my workers.
I am using RabbitMQ and everzthing elso works fine, but the broadcasted task are only processed by a single worker.
I define the exchange and the Queue
exchange = Exchange('broadcast_exchange', type='fanout')
celery.conf.task_queues = (Broadcast(name='broadcast_learning', exchange=exchange),)
And also the Task routes:
celery.conf.task_routes = {
'fworker.train_network':
{
'queue':'broadcast_learning',
'exchange':'broadcast_exchange'
},
....
}
But executing the task with .delay() or with .apply_async(queue='broadcast_learning') does not seem to send the task to ALL workers - instead only one is processing it.
After starting my worker it listens to the broadcast and the default queue, I see that they are registered in Celery (altough with a strange internal name)
[queues]
.> bcast.13bebf5c-f69c-40d9-a0e8-73f74efb9114 exchange=broadcast_exchange(fanout) key=celery
.> celery exchange=celery(direct) key=celery
I already changed from Redis backend to RabbitmQ, since some answers suggested that Redis is not working with broadcasting. But whatever I try, it does not seem to work.
Is Celery mostly just a high level interface for message queues like RabbitMQ? I am trying to set up a system with multiple scheduled workers doing concurrent http requests, but I am not sure if I would need either of them. Another question I am wondering is where do you write the actual task in code for the workers to complete, if I am using Celery or RabbitMQ?
RabbitMQ is indeed a message queue, and Celery uses it to send messages to and from workers. Celery is more than just an interface for RabbitMQ. Celery is what you use to create workers, kick off tasks, and define your tasks. It sounds like your use case makes sense for Celery/RabbitMQ. You create a task using the #app.task decorator. Check the docs for more info. In previous projects, I've set up a module for celery, where I define any tasks I need. Then you can pull in functions from other modules to use in your tasks.
Celery is the task management framework--the API you use to schedule jobs, the code that gets those jobs started, the management tools (e.g. Flower) you use to monitor what's going on.
RabbitMQ is one of several "backends" for Celery. It's an oversimplification to say that Celery is a high-level interface to RabbitMQ. RabbitMQ is not actually required for Celery to run and do its job properly. But, in practice, they are often paired together, and Celery is a higher-level way of accomplishing some things that you could do at a lower level with just RabbitMQ (or another queue or message delivery backend).
Cheers,
I have a celery setup running in a production environment (on Linux) where I need to consume two different task types from two dedicated queues (one for each). The problem that arises is, that all workers are always bound to both queues, even when I specify them to only consume from one of them.
TL;DR
Celery running with 2 queues
Messages are published in correct queue as designed
Workers keep consuming both queues
Leads to deadlock
General Information
Think of my two different task types as a hierarchical setup:
A task is a regular celery task that may take quite some time, because it dynamically dispatches other celery tasks and may be required to chain through their respective results
A node is a dynamically dispatched sub-task, which also is a regular celery task but itself can be considered an atomic unit.
My task thus can be a more complex setup of nodes where the results of one or more nodes serves as input for one or more subsequent nodes, and so on. Since my tasks can take longer and will only finish when all their nodes have been deployed, it is essential that they are handled by dedicated workers to keep a sufficient number of workers free to consume the nodes. Otherwise, this could lead to the system being stuck, when a lot of tasks are dispatched, each consumed by another worker, and their respective nodes are only queued but will never be consumed, because all workers are blocked.
If this is a bad design in general, please make any propositions on how I can improve it. I did not yet manage to build one of these processes using celery's built-in canvas primitives. Help me, if you can?!
Configuration/Setup
I run celery with amqp and have set up the following queues and routes in the celery configuration:
CELERY_QUERUES = (
Queue('prod_nodes', Exchange('prod'), routing_key='prod.node'),
Queue('prod_tasks', Exchange('prod'), routing_key='prod.task')
)
CELERY_ROUTES = (
'deploy_node': {'queue': 'prod_nodes', 'routing_key': 'prod.node'},
'deploy_task': {'queue': 'prod_tasks', 'routing_key': 'prod.task'}
)
When I launch my workers, I issue a call similar to the following:
celery multi start w_task_01 w_node_01 w_node_02 -A my.deployment.system \
-E -l INFO -P gevent -Q:1 prod_tasks -Q:2-3 prod_nodes -c 4 --autoreload \
--logfile=/my/path/to/log/%N.log --pidfile=/my/path/to/pid/%N.pid
The Problem
My queue and routing setup seems to work properly, as I can see messages being correctly queued in the RabbitMQ Management web UI.
However, all workers always consume celery tasks from both queues. I can see this when I start and open up the flower web UI and inspect one of the deployed tasks, where e.g. w_node_01 starts consuming messages from the prod_tasks queue, even though it shouldn't.
The RabbitMQ Management web UI furthermore tells me, that all started workers are set up as consumers for both queues.
Thus, I ask you...
... what did I do wrong?
Where is the issue with my setup or worker start call; How can I circumvent the problem of workers always consuming from both queues; Do I really have to make additional settings during runtime (what I certainly do not want)?
Thanks for your time and answers!
You can create 2 separate workers for each queue and each one's define what queue it should get tasks from using the -Q command line argument.
If you want to keep the number processes the same, by default a process is opened for each core for each worker you can use the --concurrency flag (See Celery docs for more info)
Celery allows configuring a worker with a specific queue.
1) Specify the name of the queue with 'queue' attribute for different types of jobs
celery.send_task('job_type1', args=[], kwargs={}, queue='queue_name_1')
celery.send_task('job_type2', args=[], kwargs={}, queue='queue_name_2')
2) Add the following entry in configuration file
CELERY_CREATE_MISSING_QUEUES = True
3) On starting the worker, pass -Q 'queue_name' as argument, for consuming from that desired queue.
celery -A proj worker -l info -Q queue_name_1 -n worker1
celery -A proj worker -l info -Q queue_name_2 -n worker2
Is it possible to have multiple celery instances possible on different machines consuming from a single queue for tasks, working with django preferably using django-orm as the backend?
How can i implement this if possible, I can't seem to find any documentation for this.
Yes it's possible, they just have to use the same broker. For instance, if you are using AMQP, the configs on your servers must share the same
BROKER_URL = 'amqp://user:password#localhost:5672//'
See the routing page for more details. For instance let's say you want to have a common queue for two servers, then one specific to each of them, you could do
On server 1:
CELERY_ROUTES = {'your_app.your_specific_tasks1': {'queue': 'server1'}}
user#server1:/$ celery -A your_celery_app worker -Q server1, default
On server 2:
CELERY_ROUTES = {'your_app.your_specific_tasks2': {'queue': 'server2'}}
user#server2:/$ celery -A your_celery_app worker -Q server2, default
Of course it's optional, by default all the tasks will be routed to the queue named celery.
I'm looking for a distributed cron-like framework for Python, and found Celery. However, the docs says "You have to ensure only a single scheduler is running for a schedule at a time, otherwise you would end up with duplicate tasks", Celery is using celery.beat.PersistentScheduler which store the schedule to a local file.
So, my question, is there another implementation than the default that can put the schedule "into the cluster" and coordinate task execution so that each task is only run once?
My goal is to be able to run celerybeat with identical schedules on all hosts in the cluster.
Thanks
tl;dr: No Celerybeat is not suitable for your use case. You have to run just one process of celerybeat, otherwise your tasks will be duplicated.
I know this is a very old question. I will try to make a small summary because I have the same problem/question (in the year 2018).
Some background: We're running Django application (with Celery) in the Kubernetes cluster. Cluster (EC2 instances) and Pods (~containers) are autoscaled: simply said, I do not know when and how many instances of the application are running.
It's your responsibility to run only one process of the celerybeat, otherwise, your tasks will be duplicated. [1] There was this feature request in the Celery repository: [2]
Requiring the user to ensure that only one instance of celerybeat
exists across their cluster creates a substantial implementation
burden (either creating a single point-of-failure or encouraging users
to roll their own distributed mutex).
celerybeat should either provide a mechanism to prevent inadvertent
concurrency, or the documentation should suggest a best-practice
approach.
After some time, this feature request was rejected by the author of Celery for lack of resources. [3] I highly recommend reading the entire thread on the Github. People there recommend these project/solutions:
https://github.com/ybrs/single-beat
https://github.com/sibson/redbeat
Use locking mechanism (http://docs.celeryproject.org/en/latest/tutorials/task-cookbook.html#ensuring-a-task-is-only-executed-one-at-a-time)
I did not try anything from the above (I do not want another dependency in my app and I do not like locking tasks /you need to deal with fail-over etc./).
I ended up using CronJob in Kubernetes (https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/).
[1] celerybeat - multiple instances & monitoring
[2] https://github.com/celery/celery/issues/251
[3] https://github.com/celery/celery/issues/251#issuecomment-228214951
I think there might be some misunderstanding about what celerybeat does. Celerybeat does not process the periodic tasks; it only publishes them. It puts the periodic tasks on the queue to be processed by the celeryd workers. If you run a single celerybeat process and multiple celeryd processes then the task execution will be distributed into the cluster.
We had this same issue where we had three servers running Celerybeat. However, our solution was to only run Celerybeat on a single server so duplicate tasks weren't created. Why would you want Celerybeat running on multiple servers?
If you're worried about Celery going down just create a script to monitor that the Celerybeat process is still running.
$ ps aux | grep celerybeat
That will show you if the Celerybeat process is running. Then create a script where if you see the process is down, email your system admins. Here's a sample setup where we're only running Celerybeat on one server.