Distributed Celery scheduler

Distributed Celery scheduler - python

I'm looking for a distributed cron-like framework for Python, and found Celery. However, the docs says "You have to ensure only a single scheduler is running for a schedule at a time, otherwise you would end up with duplicate tasks", Celery is using celery.beat.PersistentScheduler which store the schedule to a local file.
So, my question, is there another implementation than the default that can put the schedule "into the cluster" and coordinate task execution so that each task is only run once?
My goal is to be able to run celerybeat with identical schedules on all hosts in the cluster.
Thanks

tl;dr: No Celerybeat is not suitable for your use case. You have to run just one process of celerybeat, otherwise your tasks will be duplicated.
I know this is a very old question. I will try to make a small summary because I have the same problem/question (in the year 2018).
Some background: We're running Django application (with Celery) in the Kubernetes cluster. Cluster (EC2 instances) and Pods (~containers) are autoscaled: simply said, I do not know when and how many instances of the application are running.
It's your responsibility to run only one process of the celerybeat, otherwise, your tasks will be duplicated. [1] There was this feature request in the Celery repository: [2]
Requiring the user to ensure that only one instance of celerybeat
exists across their cluster creates a substantial implementation
burden (either creating a single point-of-failure or encouraging users
to roll their own distributed mutex).
celerybeat should either provide a mechanism to prevent inadvertent
concurrency, or the documentation should suggest a best-practice
approach.
After some time, this feature request was rejected by the author of Celery for lack of resources. [3] I highly recommend reading the entire thread on the Github. People there recommend these project/solutions:
https://github.com/ybrs/single-beat
https://github.com/sibson/redbeat
Use locking mechanism (http://docs.celeryproject.org/en/latest/tutorials/task-cookbook.html#ensuring-a-task-is-only-executed-one-at-a-time)
I did not try anything from the above (I do not want another dependency in my app and I do not like locking tasks /you need to deal with fail-over etc./).
I ended up using CronJob in Kubernetes (https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/).
[1] celerybeat - multiple instances & monitoring
[2] https://github.com/celery/celery/issues/251
[3] https://github.com/celery/celery/issues/251#issuecomment-228214951

I think there might be some misunderstanding about what celerybeat does. Celerybeat does not process the periodic tasks; it only publishes them. It puts the periodic tasks on the queue to be processed by the celeryd workers. If you run a single celerybeat process and multiple celeryd processes then the task execution will be distributed into the cluster.

We had this same issue where we had three servers running Celerybeat. However, our solution was to only run Celerybeat on a single server so duplicate tasks weren't created. Why would you want Celerybeat running on multiple servers?
If you're worried about Celery going down just create a script to monitor that the Celerybeat process is still running.
$ ps aux | grep celerybeat
That will show you if the Celerybeat process is running. Then create a script where if you see the process is down, email your system admins. Here's a sample setup where we're only running Celerybeat on one server.

Related

Use existing celery workers for Airflow's Celeryexecutor workers

I am trying to introduce dynamic workflows into my landscape that involves multiple steps of different model inference where the output from one model gets fed into another model.Currently we have few Celery workers spread across hosts to manage the inference chain. As the complexity increase, we are attempting to build workflows on the fly. For that purpose, I got a dynamic DAG setup with Celeryexecutor working. Now, is there a way I can retain the current Celery setup and route airflow driven tasks to the same workers? I do understand that the setup in these workers should have access to the DAG folders and environment same as the airflow server. I want to know how the celery worker need to be started in these servers so that airflow can route the same tasks that used to be done by the manual workflow from a python application. If I start the workers using command "airflow celery worker", I cannot access my application tasks. If I start celery the way it is currently ie "celery -A proj", airflow has nothing to do with it. Looking for ideas to make it work.

Thanks #DejanLekic. I got it working (though the DAG task scheduling latency was too much that I dropped the approach). If someone is looking to see how this was accomplished, here are few things I did to get it working.
Change the airflow.cfg to change the executor,queue and result back-end settings (Obvious)
If we have to use Celery worker spawned outside the airflow umbrella, change the celery_app_name setting to celery.execute instead of airflow.executors.celery_execute and change the Executor to "LocalExecutor". I have not tested this, but it may even be possible to avoid switching to celery executor by registering airflow's Task in the project's celery App.
Each task will now call send_task(), the AsynResult object returned is then stored in either Xcom(implicitly or explicitly) or in Redis(implicitly push to the queue) and the child task will then gather the Asyncresult ( it will be an implicit call to get the value from Xcom or Redis) and then call .get() to obtain the result from the previous step.
Note: It is not necessary to split the send_task() and .get() between two tasks of the DAG. By splitting them between parent and child, I was trying to take advantage of the lag between tasks. But in my case, the celery execution of tasks completed faster than airflow's inherent latency in scheduling dependent tasks.

Celery : understanding the big picture

Celery seems to be a great tool, but I have hard time understanding how the various Celery components work together:
The workers
The apps
The tasks
The message Broker (like RabbitMQ)
From what I understand, the command line:
celery -A not-clear-what-this-option-is worker
should run some sort of celery "worker server" which would itself need to connect to a broker server (I'm not so sure why so many servers are needed).
Then in any python code, some task may be sent to the worker by instantiating an app:
app = Celery('my_module', broker='pyamqp://guest#localhost//')
and then by decorating functions with this app in the following way:
#app.tasks
def my_func():
...
so that "my_func()" can now be called as "my_func.delay()" to be ran in an asynchronuous way.
Here are my questions:
What happens when my_func.delay() is called ? which server talks to which first ? and sending what where ?
What is the option to put behind the "-A" of the celery command? is this really needed ?
Suppose I have a process X which instantiates a Celery app to launch the task A, and suppose I have another process Y who wants to know the status of task A launched by X. I assume there is a way for Y to do so, but I don't know how. I suppose that Y should create its own instance of a Celery app. But then:
What function to call in the celery app of Y to get this information (and what is the "identifier" of task A inside the process Y) ?
How does this work in terms of communication, that is, when does the request goes through the Broker, and when does it go to the worker(s) ?
If anyone has some information about these questions, I would be grateful. I intend to use Celery in a Django project, where some requests to the server can trigger various time consuming tasks, and/or inquire about the status of previously launched tasks (pending, finished, error, etc...).

About the broker:
The main role of the broker is to mediate communication between the client and the worker
basically a lot of information is being generated and processed while your worker is running
taking care of this information is the broker's role
e.g. you can configure redis so that no information is lost if the server is shut down while running a process
The worker:
you can think of the worker as an instance independent of your application, which will only execute those tasks that you delegate to it
About the state of a task:
there are ways to consult celery to find out the status of a task, but I would not recommend building your application logic depending on this
if you want to get the output of a process and turn it in the input of another one, using tasks, I would recommend you to use a queue
run task A, and before finish insert your result objects in the queue
task B will listen to the queue and processes whatever comes up
The command:
on the terminal you can see in more detail what each argument means by running celery -h or celery --help
but the argument basically specifies which instance of celery you intend to run. So normally this argument will indicate where the instance you have configured and intend to execute can be found
usage: celery [-h] [-A APP] [-b BROKER] [--result-backend RESULT_BACKEND]
[--loader LOADER] [--config CONFIG] [--workdir WORKDIR]
[--no-color] [--quiet]
I hope this can provide an initial overview for those who get here

Celery is used to make functions to run in the background. Imagine you have a web API that does a job, and returns a response. You know, that job would seriously affect the response time for the API. So you'll transfer that particular job to Celery, and your API will respond instantly. Examples for some job that affect performance of an API are,
Routing to email servers
Routing to SMS Gateways
Database backup
Chained database operations
File conversion
Now, let's cover each components of celery.
The workers
Celery workers execute the job(function). They are asynchronous. So you'll have double the number of your processor cores as celery workers. You can assign a name and task to a celery worker#.
The apps
The app is the name of project you're working on. You'll have to specify that name in the celery instance.
The tasks
The functions you need to be executed in the background. Every task Celery execute will have a task id, state(and more). You can get that by inspecting a particular task.
The message Broker
Those tasks which will be executed in the background has to be moved from your python project to to Celery workers. Message brokers act as a medium here. So functions with its arguments will be transferred to brokers and from brokers Celery will fetch them to execute.
Some codes
celery -A project_name worker_name
celery -A project_name worker_name inspect
More in documentation
docs.celeryproject.org

Script needs to be run as a Celery task. What consequences does this have?

My task is it to write a script using opencv which will later run as a Celery task. What consequences does this have? What do I have to pay attention to? Is it enough in the end to include two lines of code or could it be, that I have to rewrite my whole script?
I read, that Celery is a "asynchronous task queue/job queuing system based on distributed message passing", but I wont pretend to know completely what that all entails.
I try to update the question, as soon as I get more details.

Celery implies a daemon using a broker (some data hub used to queue tasks). The celeryd daemon and the broker (RabbitMQ, redis, MongoDB or else) should always run in the background.
Your tasks will be queued, this means they won't happen all at the same time. You can choose how many at the same time can be run as a maximum. The rest of them will wait for the others to finish before starting. This also means some concurrency is often expected, and that you must create tasks that play nice with others doing the same thing.
Celery is not meant to run scripts but tasks, written as python functions. You can of course execute external scripts from Python, but your entry point is always a Python function.
Celery uses Kombu, which uses a message broker to dispatch the tasks. This implies the data you pass to your tasks should be serializable.

how to track revoked tasks in across multiple celeryd processes

I have a reminder type app that schedules tasks in celery using the "eta" argument. If the parameters in the reminder object changes (e.g. time of reminder), then I revoke the task previously sent and queue a new task.
I was wondering if there's any good way of keeping track of revoked tasks across celeryd restarts. I'd like to have the ability to scale celeryd processes up/down on the fly, and it seems that any celeryd processes started after the revoke command was sent will still execute that task.
One way of doing it is to keep a list of revoked task ids, but this method will result in the list growing arbitrarily. Pruning this list requires guarantees that the task is no longer in the RabbitMQ queue, which doesn't seem to be possible.
I've also tried using a shared --statedb file for each of the celeryd workers, but it seems that the statedb file is only updated on termination of the workers and thus not suitable for what I would like to accomplish.
Thanks in advance!

Interesting problem, I think it should be easy to solve using broadcast commands.
If when a new worker starts up it requests all the other workers to dump its revoked
tasks to the new worker. Adding two new remote control commands,
you can easily add new commands by using #Panel.register,
Module control.py:
from celery.worker import state
from celery.worker.control import Panel
#Panel.register
def bulk_revoke(panel, ids):
state.revoked.update(ids)
#Panel.register
def broadcast_revokes(panel, destination):
panel.app.control.broadcast("bulk_revoke", arguments={
"ids": list(state.revoked)},
destination=destination)
Add it to CELERY_IMPORTS:
CELERY_IMPORTS = ("control", )
The only missing problem now is to connect it so that the new worker
triggers broadcast_revokes at startup. I guess you could use the worker_ready
signal for this:
from celery import current_app as celery
from celery.signals import worker_ready
def request_revokes_at_startup(sender=None, **kwargs):
celery.control.broadcast("broadcast_revokes",
destination=sender.hostname)

I had to do something similar in my project and used celerycam with django-admin-monitor. The monitor takes a snapshot of tasks and saves them in the database periodically. And there is a nice user interface to browse and check the status of all tasks. And you can even use it even if your project is not Django based.

I implemented something similar to this some time ago, and the solution I came up with was very similar to yours.
The way I solved this problem was to have the worker fetch the Task object from the database when the job ran (by passing it the primary key, as the documentation recommends). In your case, before the reminder is sent the worker should perform a check to ensure that the task is "ready" to be run. If not, it should simply return without doing any work (assuming that the ETA has changed and another worker will pick up the new job).

reliable way to deploy new code into a production celery cluster without pausing service

I have a few celery nodes running in production with rabbitmq and I have been handling deploys with service interruption. I have to take down the whole site in order to deploy new code out to celery. I have max tasks per child set to 1, so in theory, if I make changes to an existing task, they should take effect when the next time they are run, but what about registering new tasks? I know that restarting the daemon won't kill running workers, but instead will let them die on their own, but it still seems dangerous. Is there an elegant solution to this problem?

The challenging part here seems to be identifying which celery tasks are new versus old. I would suggest creating another vhost in rabbitmq and performing the following steps:
Update django web servers with new code and reconfigure to point to the new vhost.
While tasks are queuing up in the new vhost, wait for celery works to finish up with the tasks from the old vhost.
When workers have completed, update the code and configuration to the new vhost
I haven't actually tried this but I don't see why this wouldn't work. One annoying aspect is having to alternate between the vhosts with each deploy.

a kind of work around for you can be to set the config variable MAX_TASK_PER_CHILD.
This variable specify the number of task that a Pool Worker execute before kill himself.
Off course when a new Pool Worker is executed this will load the new code.
On my system normally I use to restart celery leaving other task running on background, normally everything goes fine, sometimes happen that one of this task is never killed and you can still kill it with a script.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Distributed Celery scheduler - python

Related

Use existing celery workers for Airflow's Celeryexecutor workers

Celery : understanding the big picture

Script needs to be run as a Celery task. What consequences does this have?

how to track revoked tasks in across multiple celeryd processes

reliable way to deploy new code into a production celery cluster without pausing service

Categories

Resources