Celery django explanation - python

I have been learning about django recently and have stumbled upon celery. I don't seem to understand what it does. I've been to their site to no avail. Can anyone explain to me the concept and it's real world applications (in simple terms)?

Celery is an "asynchronous task queue/job queue based on distributed message passing". It is just a task queue, or something that one puts tasks into to do as soon as possible. You have a celery instance that you integrate directly with your django or python app- this is what you use to talk to celery. Then, you can configure celery to have 'workers' that perform the tasks you give them. The whole point is to be able to do tasks that don't fit within the normal request/response cycle very well that django handles so well.
What kinds of tasks are these? Well, as said before, they don't fit into the normal request/response cycle. The best example I can think of is emails- if you're building a web app and you want to keep your users, you need to keep them engaged and coming back, and a good way to do that is by sending emails. You send them once a week or once a day and they can maybe configure when to send. This would fit horribly within the request/response cycle, but it's perfect for something like Celery.
Other examples are long-running jobs with lots of computation. While you would typically use something like Hadoop for really big computations, you can schedule some queries with Celery. You could also use it to schedule builds if you're doing something like Travis. The uses go on and on, but you probably get the point.

Related

Queue in django/python

I have developed a django REST API. I send request/data to it to perfrom a task and it does it nicely. Though, In a way, I can send it multiple request/data to perfrom the task on each of them. The issue is that server where the task gets performed has limited memory and I need to perform these task one by one. So, I am thinking to have a queue system at django pipeline which can maintain the reqeust on hold till the task in front of the queue is done.
I am not sure if I am on right path, but not sure if celery is the option to solve my issue?
It seems a simple task and I didnt understand if celery is what i need. Can you point me what should be looking at?
If you want to keep your API as is, then celery will not help you. It's a good idea to keep API calls as short as possible. If you have some longer job done during API calls (sending emails, for example), then you better use celery. But the only thing you can get as a response to your API is that the task was queued.

Tool selection - is Celery the right system to use in this instance?

I have been trying to learn more about Celery, but it's difficult to understand what's literal in terms of "workers" and the "queue" and what it actually means in terms of programming. I apologize if this question is very basic, but I can't seem to find a straight answer in simple terms.
I have a Flask/Python app that I want to add a task assignment functionality to. For example, when one user completes a task, I want them to be able to flag it for their team members to check. Additionally, I'd like to be able to schedule task assignments, for example have a user complete 10 of a category of a task per week.
Celery seems like a good way to queue tasks and ensure that they are being completed, but the focus of this system seems to be scheduling resource intensive processes for asynchrous processing, not literal task assignment and queueing.
My question boils down to: is Celery the proper tool for assigning tasks to users, even if it's not for the purpose of resource saving? Have I misunderstood what the capabilities of Celery are? If so, what would be the tools to use to implement this feature?
Thank you!
Celery is meant to for handling the queuing of automated tasks on computer workers. I recommend against using it for assigning tasks to your users.
If you're looking for a lightweight solution for storing queues of tasks for users, you could use lists in Redis: https://redis.io/topics/data-types
You could also probably use any database that you're already using.

Should I use plain Python code or Celery? Django.

I have a heavy function (a lot of calculations are done) which outputs a individual number for each user in my Django project. This number changes just a little over time so to minimize the server load I thought about running the function once a day, save the output and just reference the output. I know that these kinda things are usually handled with Celery but the package requires a lot of site packages and extra modules so I thought about writing a simple function like:
x0 = #last.time function was called
x1 = datetime.now
if x0-x1 > 1 day:
def whatever():
....
x0 = datetime.now
return ....
I like to keep my code clean and not to install Packages which are not really required so I would like to know if there are any downsides by "just" using Python or any gain when I would do that with Celery. The task does not need to be asynchronous so I don't care about that.
Is there a clear "Use case" when Celery should be used and when not? Is there a performance loss/gain?
I hope somebody can explain that properly.
Celery is a clear winner but I would like to explain this with pros and cons.
Pros:
You can control celery from Django very easily. Running a celery task, cancelling task, checking state/progress of task can be done within django.
A periodical task running with celery is very simple, just register the task from django run the celery worker and voila you are done. No need to mess around with crontab or background processes.
Celery is very easy to setup and run. You might already know that if you have gone through the introduction of celery.
Cons
One of the cons is that you need to have at least one result backend with either redis, rabbitmq or any other one running with celery for queuing purposes. Although RabbitMq is not a heavy you need to install it once.
One more is that celery worker itself takes some memory but that won't be an issue if you are on a server, on local memory consumption might seem high to you.
I would suggest celery because it would provide you more control over your task rather than a simple background process.

Running asynchronous python code in a Django web application

Is it OK to run certain pieces of code asynchronously in a Django web app. If so how?
For example:
I have a search algorithm that returns hundreds or thousands of results. I want to enter into the database that these items were the result of the search, so I can see what users are searching most. I don't want the client to have to wait an extra hundred or thousand more database inserts. Is there a way I can do this asynchronously? Is there any danger in doing so? Is there a better way to achieve this?
As far as Django is concerned yes.
The bigger concern is your web server and if it plays nice with threading. For instance, the sync workers of gunicorn are single threads, but there are other engines, such as greenlet. I'm not sure how well they play with threads.
Combining threading and multiprocessing can be an issue if you're forking from threads:
Status of mixing multiprocessing and threading in Python
http://bugs.python.org/issue6721
That being said, I know of popular performance analytics utilities that have been using threads to report on metrics, so seems to be an accepted practice.
In sum, seems safest to use the threading.Thread object from the standard library, so long as whatever you do in it doesn't fork (python's multiprocessing library)
https://docs.python.org/2/library/threading.html
Offloading requests from the main thread is a common practice; as the end goal is to return a result to the client (browser) as quickly as possible.
As I am sure you are aware, HTTP is blocking - so until you return a response, the client cannot do anything (it is blocked, in a waiting state).
The de-facto way of offloading requests is through celery which is a task queuing system.
I highly recommend you read the introduction to celery topic, but in summary here is what happens:
You mark certain pieces of codes as "tasks". These are usually functions that you want to run asynchronously.
Celery manages workers - you can think of them as threads - that will run these tasks.
To communicate with the worker a message queue is required. RabbitMQ is the one often recommended.
Once you have all the components running (it takes but a few minutes); your workflow goes like this:
In your view, when you want to offload some work; you will call the function that does that work with the .delay() option. This will trigger the worker to start executing the method in the background.
Your view then returns a response immediately.
You can then check for the result of the task, and take appropriate actions based on what needs to be done. There are ways to track progress as well.
It is also good practice to include caching - so that you are not executing expensive tasks unnecessarily. For example, you might choose to offload a request to do some analytics on search keywords that will be placed in a report.
Once the report is generated, I would cache the results (if applicable) so that the same report can be displayed if requested later - rather than be generated again.

Using Celery for Realtime, Synchronous External API Querying with Gevent

I'm working on a web application that will receive a request from a user and have to hit a number of external APIs to compose the answer to that request. This could be done directly from the main web thread using something like gevent to fan out the request.
Alternatively, I was thinking, I could put incoming requests into a queue and use workers to distribute the load. The idea would be to try to keep it real time, while splitting up the requests amongst several workers. Each of these workers would be querying only one of the many external APIs. The response they receive would then go through a series transformations, be saved into a DB, be transformed to a common schema and saved in a common DB to finally be composed into one big response that would be returned through the web request. The web request is most likely going to be blocking all this time, with a user waiting, so keeping
the queueing and dequeueing as fast as possible is important.
The external API calls can easily be turned into individual tasks. I think the linking
from one api task to a transformation to a DB saving task could be done using a chain, etc, and the final result combining all results returned to the web thread using a chord.
Some questions:
Can this (and should this) be done using celery?
I'm using django. Should I try to use django-celery over plain celery?
Each one of those tasks might spawn off other tasks - such as logging what just
happened or other types of branching off. Is this possible?
Could tasks be returning the data they get - i.e. potentially Kb of data through celery (redis as underlying in this case) or should they write to the DB, and just pass pointers to that data around?
Each task is mostly I/O bound, and was initially just going to use gevent from the web thread to fan out the requests and skip the whole queuing design, but it turns out that it would be reused for a different component. Trying to keep the whole round trip through the Qs real time will probably require many workers making sure the queueus are mostly empty. Or is it? Would running the gevent worker pool help with this?
Do I have to write gevent specific tasks or will using the gevent pool deal with network IO automagically?
Is it possible to assign priority to certain tasks?
What about keeping them in order?
Should I skip celery and just use kombu?
It seems like celery is geared more towards "tasks" that can be deferred and are
not time sensitive. Am I nuts for trying to keep this real time?
What other technologies should I look at?
Update: Trying to hash this out a bit more. I did some reading on Kombu and it seems to be able to do what I'm thinking of, although at a much lower level than celery. Here is a diagram of what I had in mind.
What seems to be possible with raw queues as accessible with Kombu is the ability for a number of workers to subscribe to a broadcast message. The type and number does not need to be known by the publisher if using a queue. Can something similar be achieved using Celery? It seems like if you want to make a chord, you need to know at runtime what tasks are going to be involved in the chord, whereas in this scenario you can simply add listeners to the broadcast, and simply make sure they announce they are in the running to add responses to the final queue.
Update 2: I see there is the ability to broadcast Can you combine this with a chord? In general, can you combine celery with raw kombu? This is starting to sound like a question about smoothies.
I will try to answer as many of the questions as possible.
Can this (and should this) be done using celery?
Yes you can
I'm using django. Should I try to use django-celery over plain celery?
Django has a good support for celery and would make the life much easier during development
Each one of those tasks might spawn off other tasks - such as logging
what just happened or other types of branching off. Is this possible?
You can start subtasks from withing a task with ignore_result = true for only side effects
Could tasks be returning the data they get - i.e. potentially Kb of
data through celery (redis as underlying in this case) or should they
write to the DB, and just pass pointers to that data around?
I would suggest putting the results in db and then passing id around would make your broker and workers happy. Less data transfer/pickling etc.
Each task is mostly I/O bound, and was initially just going to use
gevent from the web thread to fan out the requests and skip the whole
queuing design, but it turns out that it would be reused for a
different component. Trying to keep the whole round trip through the
Qs real time will probably require many workers making sure the
queueus are mostly empty. Or is it? Would running the gevent worker
pool help with this?
Since the process is io bound then gevent will definitely help here. However, how much the concurrency should be for gevent pool'd worker, is something that I'm looking for answer too.
Do I have to write gevent specific tasks or will using the gevent pool
deal with network IO automagically?
Gevent does the monkey patching automatically when you use it in pool. But the libraries that you use should play well with gevent. Otherwise, if your parsing some data with simplejson (which is written in c) then that would block other gevent greenlets.
Is it possible to assign priority to certain tasks?
You cannot assign specific priorities to certain tasks, but route them to different queue and then have those queues being listened to by varying number of workers. The more the workers for a particular queue, the higher would be the priority of that tasks on that queue.
What about keeping them in order?
Chain is one way to maintain order. Chord is a good way to summarize. Celery takes care of it, so you dont have to worry about it. Even when using gevent pool, it would at the end be possible to reason about the order of the tasks execution.
Should I skip celery and just use kombu?
You can, if your use case will not change to something more complex over time and also if you are willing to manage your processes through celeryd + supervisord by yourself. Also, if you don't care about the task monitoring that comes with tools such as celerymon, flower, etc.
It seems like celery is geared more towards "tasks" that can be
deferred and are not time sensitive.
Celery supports scheduled tasks as well. If that is what you meant by that statement.
Am I nuts for trying to keep this real time?
I don't think so. As long as your consumers are fast enough, it will be as good as real time.
What other technologies should I look at?
Pertaining to celery, you should choose result store wisely. My suggestion would be to use cassandra. It is good for realtime data (both write and query wise). You can also use redis or mongodb. They come with their own set of problems as result store. But then a little tweaking in configuration can go a long way.
If you mean something completely different from celery, then you can look into asyncio (python3.5) and zeromq for achieving the same. I can't comment more on that though.

Categories