Celery : understanding the big picture - python

Celery seems to be a great tool, but I have hard time understanding how the various Celery components work together:
The workers
The apps
The tasks
The message Broker (like RabbitMQ)
From what I understand, the command line:
celery -A not-clear-what-this-option-is worker
should run some sort of celery "worker server" which would itself need to connect to a broker server (I'm not so sure why so many servers are needed).
Then in any python code, some task may be sent to the worker by instantiating an app:
app = Celery('my_module', broker='pyamqp://guest#localhost//')
and then by decorating functions with this app in the following way:
#app.tasks
def my_func():
...
so that "my_func()" can now be called as "my_func.delay()" to be ran in an asynchronuous way.
Here are my questions:
What happens when my_func.delay() is called ? which server talks to which first ? and sending what where ?
What is the option to put behind the "-A" of the celery command? is this really needed ?
Suppose I have a process X which instantiates a Celery app to launch the task A, and suppose I have another process Y who wants to know the status of task A launched by X. I assume there is a way for Y to do so, but I don't know how. I suppose that Y should create its own instance of a Celery app. But then:
What function to call in the celery app of Y to get this information (and what is the "identifier" of task A inside the process Y) ?
How does this work in terms of communication, that is, when does the request goes through the Broker, and when does it go to the worker(s) ?
If anyone has some information about these questions, I would be grateful. I intend to use Celery in a Django project, where some requests to the server can trigger various time consuming tasks, and/or inquire about the status of previously launched tasks (pending, finished, error, etc...).

About the broker:
The main role of the broker is to mediate communication between the client and the worker
basically a lot of information is being generated and processed while your worker is running
taking care of this information is the broker's role
e.g. you can configure redis so that no information is lost if the server is shut down while running a process
The worker:
you can think of the worker as an instance independent of your application, which will only execute those tasks that you delegate to it
About the state of a task:
there are ways to consult celery to find out the status of a task, but I would not recommend building your application logic depending on this
if you want to get the output of a process and turn it in the input of another one, using tasks, I would recommend you to use a queue
run task A, and before finish insert your result objects in the queue
task B will listen to the queue and processes whatever comes up
The command:
on the terminal you can see in more detail what each argument means by running celery -h or celery --help
but the argument basically specifies which instance of celery you intend to run. So normally this argument will indicate where the instance you have configured and intend to execute can be found
usage: celery [-h] [-A APP] [-b BROKER] [--result-backend RESULT_BACKEND]
[--loader LOADER] [--config CONFIG] [--workdir WORKDIR]
[--no-color] [--quiet]
I hope this can provide an initial overview for those who get here

Celery is used to make functions to run in the background. Imagine you have a web API that does a job, and returns a response. You know, that job would seriously affect the response time for the API. So you'll transfer that particular job to Celery, and your API will respond instantly. Examples for some job that affect performance of an API are,
Routing to email servers
Routing to SMS Gateways
Database backup
Chained database operations
File conversion
Now, let's cover each components of celery.
The workers
Celery workers execute the job(function). They are asynchronous. So you'll have double the number of your processor cores as celery workers. You can assign a name and task to a celery worker#.
The apps
The app is the name of project you're working on. You'll have to specify that name in the celery instance.
The tasks
The functions you need to be executed in the background. Every task Celery execute will have a task id, state(and more). You can get that by inspecting a particular task.
The message Broker
Those tasks which will be executed in the background has to be moved from your python project to to Celery workers. Message brokers act as a medium here. So functions with its arguments will be transferred to brokers and from brokers Celery will fetch them to execute.
Some codes
celery -A project_name worker_name
celery -A project_name worker_name inspect
More in documentation
docs.celeryproject.org

Related

Celery send_task() method

I have my API, and some endpoints need to forward requests to Celery. Idea is to have specific API service that basically only instantiates Celery client and uses send_task() method, and seperate service(workers) that consume tasks. Code for task definitions should be located in that worker service. Basicaly seperating celery app (API) and celery worker to two seperate services.
I dont want my API to know about any celery task definitions, endpoints only need to use celery_client.send_task('some_task', (some_arguments)). So on one service i have my API, an on other service/host I have celery code base where my celery worker will execute tasks.
I came across this great article that describes what I want to do.
https://medium.com/#tanchinhiong/separating-celery-application-and-worker-in-docker-containers-f70fedb1ba6d
and this post Celery - How to send task from remote machine?
I need help on how to create routes for tasks from the API? I was expecting for celery_client.send_task() to have queue= keyword, but it does not. I need to have 2 queues, and two workers that will consume content from these two queues.
Commands for my workers:
celery -A <path_to_my_celery_file>.celery_client worker --loglevel=info -Q queue_1
celery -A <path_to_my_celery_file>.celery_client worker --loglevel=info -Q queue_2
I have also visited celery "Routing Tasks" documentation, but it is still unclear to me how to establish this communication.
Your API side should hold the router. I guess it's not an issue because it is only a map of task -> queue (aka send task1 to queue1).
In other words, your celery_client should have task_routes like:
task_routes = {
'mytasks.some_task': 'queue_1',
'mytasks.some_other_task': 'queue_2',
}

Celery: How to separate the logic of Publisher and Consumer?

I am new to Celery. In this example, I am unable to figure out how to separate the logic of publisher and consumer. Is the command celery -A tasks worker --loglevel=INFO used to start working for publishing or consuming?
If add.delay(4, 4) is to push data into a queue, how do I connect to the same queue in a separate code file and consume it?
Publishers are typically either Celery beat (scheduler), custom scripts that you develop, or other tasks executed by Celery workers in your cluster.
Consumers are EXCLUSIVELY Celery workers. Unless you dig really deep into Celery/Kombu and implement your own consumer you are pretty much not able to write consumer so easily.

Setup celery periodic task

How do I set up a periodic task with Celerybeat and Flask that queries a database every hour?
The environment looks like this:
/
|-app
|-__init__.py
|-jobs
|-task.py
|-celery-beat.sh
|-celery-worker.sh
|-manage.py
I currently have a query function called run_query() located in task.py
I want the scheduler to kick in once the application initiates so I have the following lines in my /app/__init__.py folder:
celery = Celery()
#celery.on_after_configure.connect
def setup_periodic_tasks(sender, **kwargs):
sender.add_periodic_task(1, app.jobs.task.run_query())
(For simplicity's sake, I've set it up so that if it runs, it will run every minute. No such luck yet.)
When I launch the celery-worker.sh it recognizes my function under the [tasks] heading. But the scheduled function never runs. I can manually force the function to run by issuing the following at the command prompt:
>> from app.jobs import task
>> task.run_query.delay()
EDIT: Added celerybeat.sh
As a follow up: If the database is accessed through a flask context, during my asynch function call is it wise to create a new flask context to access the database? Use the existing flask context? Or forget contexts altogether and just initiate a connection to the database? My worry is that if I just initiate a new connection it may interfere with the existing context's connection?
To run periodic tasks you need some kind of schduler (Eg. celery beat).
celery beat is a scheduler; It kicks off tasks at regular intervals, that are then executed by available worker nodes in the
cluster.
You have to ensure only a single scheduler is running for a schedule
at a time, otherwise you’d end up with duplicate tasks. Using a
centralized approach means the schedule doesn’t have to be
synchronized, and the service can operate without using locks.
Reference: periodic-tasks
You can invoke scheduler with command,
$ celery -A proj beat #different process from your worker
You can also embed beat inside the worker by enabling the workers -B
option, this is convenient if you’ll never run more than one worker
node, but it’s not commonly used and for that reason isn’t recommended
for production use Starting scheduler:
$ celery -A proj worker -B
Reference: celery starting scheduler

Celery: why do I need a broker for periodic tasks?

I have a standalong script that scrapes a page, initiates a connection to a database, and writes database to it. I need it to execute periodically after x hours. I can make it with using a bash script, with the pseudocode:
while true
do
python scraper.py
sleep 60*60*x
done
From what I read about message brokers, they are used for sending "signals" from one running program to another, like HTTP in principle. Like I have a piece of code that accepts an email id from user, it sends signal with email-id to another piece of code that will send the email.
I need celery to run a periodic task on heroku. I already have a mongodb on a separate server. WHy do I need to run another server for rabbitmq or redis just for this? Can I use celery without the broker?
Celery architecture is designed to scale and distribute tasks across several servers. For sites like yours it might be an overkill. Queue service is generally needed to maintain the task list and signal the status of finished tasks.
You might want to take a look in Huey instead. Huey is small-scale Celery "Clone" needing only Redis as an external dependency, not RabbitMQ. It's still using Redis queue mechanism to line the tasks in queue.
There also exists Advanced Python scheduler which does not need even Redis, but can hold the state of the queue in memory in-process.
Alternatively if you have very small amount of periodical tasks, no delayed tasks, I would just use Cron and pure Python scripts to run the tasks.
As the Celery documentation explains:
Celery communicates via messages, usually using a broker to mediate between clients and workers. To initiate a task, a client adds a message to the queue, which the broker then delivers to a worker.
You can use your existing MongoDB database as broker. see Using MongoDB.
For the application like this, its better use Django Background Tasks
,
Installation
Install from PyPI:
pip install django-background-tasks
Add to INSTALLED_APPS:
INSTALLED_APPS = (
# ...
'background_task',
# ...
)
Migrate your database:
python manage.py makemigrations background_task
python manage.py migrate
Creating and registering tasks
To register a task use the background decorator:
from background_task import background
from django.contrib.auth.models import User
#background(schedule=60)
def notify_user(user_id):
# lookup user by id and send them a message
user = User.objects.get(pk=user_id)
user.email_user('Here is a notification', 'You have been notified')
This will convert the notify_user into a background task function. When you call it from regular code it will actually create a Task object and stores it in the database. The database then contains serialised information about which function actually needs running later on. This does place limits on the parameters that can be passed when calling the function - they must all be serializable as JSON. Hence why in the example above a user_id is passed rather than a User object.
Calling notify_user as normal will schedule the original function to be run 60 seconds from now:
notify_user(user.id)
This is the default schedule time (as set in the decorator), but it can be overridden:
notify_user(user.id, schedule=90) # 90 seconds from now
notify_user(user.id, schedule=timedelta(minutes=20)) # 20 minutes from now
notify_user(user.id, schedule=timezone.now()) # at a specific time
Also you can run original function right now in synchronous mode:
notify_user.now(user.id) # launch a notify_user function and wait for it
notify_user = notify_user.now # revert task function back to normal function.
Useful for testing.
You can specify a verbose name and a creator when scheduling a task:
notify_user(user.id, verbose_name="Notify user", creator=user)

Distributed Celery scheduler

I'm looking for a distributed cron-like framework for Python, and found Celery. However, the docs says "You have to ensure only a single scheduler is running for a schedule at a time, otherwise you would end up with duplicate tasks", Celery is using celery.beat.PersistentScheduler which store the schedule to a local file.
So, my question, is there another implementation than the default that can put the schedule "into the cluster" and coordinate task execution so that each task is only run once?
My goal is to be able to run celerybeat with identical schedules on all hosts in the cluster.
Thanks
tl;dr: No Celerybeat is not suitable for your use case. You have to run just one process of celerybeat, otherwise your tasks will be duplicated.
I know this is a very old question. I will try to make a small summary because I have the same problem/question (in the year 2018).
Some background: We're running Django application (with Celery) in the Kubernetes cluster. Cluster (EC2 instances) and Pods (~containers) are autoscaled: simply said, I do not know when and how many instances of the application are running.
It's your responsibility to run only one process of the celerybeat, otherwise, your tasks will be duplicated. [1] There was this feature request in the Celery repository: [2]
Requiring the user to ensure that only one instance of celerybeat
exists across their cluster creates a substantial implementation
burden (either creating a single point-of-failure or encouraging users
to roll their own distributed mutex).
celerybeat should either provide a mechanism to prevent inadvertent
concurrency, or the documentation should suggest a best-practice
approach.
After some time, this feature request was rejected by the author of Celery for lack of resources. [3] I highly recommend reading the entire thread on the Github. People there recommend these project/solutions:
https://github.com/ybrs/single-beat
https://github.com/sibson/redbeat
Use locking mechanism (http://docs.celeryproject.org/en/latest/tutorials/task-cookbook.html#ensuring-a-task-is-only-executed-one-at-a-time)
I did not try anything from the above (I do not want another dependency in my app and I do not like locking tasks /you need to deal with fail-over etc./).
I ended up using CronJob in Kubernetes (https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/).
[1] celerybeat - multiple instances & monitoring
[2] https://github.com/celery/celery/issues/251
[3] https://github.com/celery/celery/issues/251#issuecomment-228214951
I think there might be some misunderstanding about what celerybeat does. Celerybeat does not process the periodic tasks; it only publishes them. It puts the periodic tasks on the queue to be processed by the celeryd workers. If you run a single celerybeat process and multiple celeryd processes then the task execution will be distributed into the cluster.
We had this same issue where we had three servers running Celerybeat. However, our solution was to only run Celerybeat on a single server so duplicate tasks weren't created. Why would you want Celerybeat running on multiple servers?
If you're worried about Celery going down just create a script to monitor that the Celerybeat process is still running.
$ ps aux | grep celerybeat
That will show you if the Celerybeat process is running. Then create a script where if you see the process is down, email your system admins. Here's a sample setup where we're only running Celerybeat on one server.

Categories