Recently I read something about this and the point was that celery is more productive.
Now, I can't find detailed information about the difference between these two and what should be the best way to use them.
Straight from the documentation:
If you need to perform heavy background computation and you don't
necessarily need it to be run by the same process (for example, you
don't need to share memory, variables, etc), you might benefit from
using other bigger tools like Celery.
They tend to require more complex configurations, a message/job queue
manager, like RabbitMQ or Redis, but they allow you to run
background tasks in multiple processes, and especially, in multiple
servers.
To see an example, check the Project Generators, they all include
Celery already configured.
But if you need to access variables and objects from the same
FastAPI app, or you need to perform small background tasks (like
sending an email notification), you can simply just use
BackgroundTasks.
Have a look at this answer as well.
Related
I have been trying to learn more about Celery, but it's difficult to understand what's literal in terms of "workers" and the "queue" and what it actually means in terms of programming. I apologize if this question is very basic, but I can't seem to find a straight answer in simple terms.
I have a Flask/Python app that I want to add a task assignment functionality to. For example, when one user completes a task, I want them to be able to flag it for their team members to check. Additionally, I'd like to be able to schedule task assignments, for example have a user complete 10 of a category of a task per week.
Celery seems like a good way to queue tasks and ensure that they are being completed, but the focus of this system seems to be scheduling resource intensive processes for asynchrous processing, not literal task assignment and queueing.
My question boils down to: is Celery the proper tool for assigning tasks to users, even if it's not for the purpose of resource saving? Have I misunderstood what the capabilities of Celery are? If so, what would be the tools to use to implement this feature?
Thank you!
Celery is meant to for handling the queuing of automated tasks on computer workers. I recommend against using it for assigning tasks to your users.
If you're looking for a lightweight solution for storing queues of tasks for users, you could use lists in Redis: https://redis.io/topics/data-types
You could also probably use any database that you're already using.
I have been learning about django recently and have stumbled upon celery. I don't seem to understand what it does. I've been to their site to no avail. Can anyone explain to me the concept and it's real world applications (in simple terms)?
Celery is an "asynchronous task queue/job queue based on distributed message passing". It is just a task queue, or something that one puts tasks into to do as soon as possible. You have a celery instance that you integrate directly with your django or python app- this is what you use to talk to celery. Then, you can configure celery to have 'workers' that perform the tasks you give them. The whole point is to be able to do tasks that don't fit within the normal request/response cycle very well that django handles so well.
What kinds of tasks are these? Well, as said before, they don't fit into the normal request/response cycle. The best example I can think of is emails- if you're building a web app and you want to keep your users, you need to keep them engaged and coming back, and a good way to do that is by sending emails. You send them once a week or once a day and they can maybe configure when to send. This would fit horribly within the request/response cycle, but it's perfect for something like Celery.
Other examples are long-running jobs with lots of computation. While you would typically use something like Hadoop for really big computations, you can schedule some queries with Celery. You could also use it to schedule builds if you're doing something like Travis. The uses go on and on, but you probably get the point.
I'm working on a web application that will receive a request from a user and have to hit a number of external APIs to compose the answer to that request. This could be done directly from the main web thread using something like gevent to fan out the request.
Alternatively, I was thinking, I could put incoming requests into a queue and use workers to distribute the load. The idea would be to try to keep it real time, while splitting up the requests amongst several workers. Each of these workers would be querying only one of the many external APIs. The response they receive would then go through a series transformations, be saved into a DB, be transformed to a common schema and saved in a common DB to finally be composed into one big response that would be returned through the web request. The web request is most likely going to be blocking all this time, with a user waiting, so keeping
the queueing and dequeueing as fast as possible is important.
The external API calls can easily be turned into individual tasks. I think the linking
from one api task to a transformation to a DB saving task could be done using a chain, etc, and the final result combining all results returned to the web thread using a chord.
Some questions:
Can this (and should this) be done using celery?
I'm using django. Should I try to use django-celery over plain celery?
Each one of those tasks might spawn off other tasks - such as logging what just
happened or other types of branching off. Is this possible?
Could tasks be returning the data they get - i.e. potentially Kb of data through celery (redis as underlying in this case) or should they write to the DB, and just pass pointers to that data around?
Each task is mostly I/O bound, and was initially just going to use gevent from the web thread to fan out the requests and skip the whole queuing design, but it turns out that it would be reused for a different component. Trying to keep the whole round trip through the Qs real time will probably require many workers making sure the queueus are mostly empty. Or is it? Would running the gevent worker pool help with this?
Do I have to write gevent specific tasks or will using the gevent pool deal with network IO automagically?
Is it possible to assign priority to certain tasks?
What about keeping them in order?
Should I skip celery and just use kombu?
It seems like celery is geared more towards "tasks" that can be deferred and are
not time sensitive. Am I nuts for trying to keep this real time?
What other technologies should I look at?
Update: Trying to hash this out a bit more. I did some reading on Kombu and it seems to be able to do what I'm thinking of, although at a much lower level than celery. Here is a diagram of what I had in mind.
What seems to be possible with raw queues as accessible with Kombu is the ability for a number of workers to subscribe to a broadcast message. The type and number does not need to be known by the publisher if using a queue. Can something similar be achieved using Celery? It seems like if you want to make a chord, you need to know at runtime what tasks are going to be involved in the chord, whereas in this scenario you can simply add listeners to the broadcast, and simply make sure they announce they are in the running to add responses to the final queue.
Update 2: I see there is the ability to broadcast Can you combine this with a chord? In general, can you combine celery with raw kombu? This is starting to sound like a question about smoothies.
I will try to answer as many of the questions as possible.
Can this (and should this) be done using celery?
Yes you can
I'm using django. Should I try to use django-celery over plain celery?
Django has a good support for celery and would make the life much easier during development
Each one of those tasks might spawn off other tasks - such as logging
what just happened or other types of branching off. Is this possible?
You can start subtasks from withing a task with ignore_result = true for only side effects
Could tasks be returning the data they get - i.e. potentially Kb of
data through celery (redis as underlying in this case) or should they
write to the DB, and just pass pointers to that data around?
I would suggest putting the results in db and then passing id around would make your broker and workers happy. Less data transfer/pickling etc.
Each task is mostly I/O bound, and was initially just going to use
gevent from the web thread to fan out the requests and skip the whole
queuing design, but it turns out that it would be reused for a
different component. Trying to keep the whole round trip through the
Qs real time will probably require many workers making sure the
queueus are mostly empty. Or is it? Would running the gevent worker
pool help with this?
Since the process is io bound then gevent will definitely help here. However, how much the concurrency should be for gevent pool'd worker, is something that I'm looking for answer too.
Do I have to write gevent specific tasks or will using the gevent pool
deal with network IO automagically?
Gevent does the monkey patching automatically when you use it in pool. But the libraries that you use should play well with gevent. Otherwise, if your parsing some data with simplejson (which is written in c) then that would block other gevent greenlets.
Is it possible to assign priority to certain tasks?
You cannot assign specific priorities to certain tasks, but route them to different queue and then have those queues being listened to by varying number of workers. The more the workers for a particular queue, the higher would be the priority of that tasks on that queue.
What about keeping them in order?
Chain is one way to maintain order. Chord is a good way to summarize. Celery takes care of it, so you dont have to worry about it. Even when using gevent pool, it would at the end be possible to reason about the order of the tasks execution.
Should I skip celery and just use kombu?
You can, if your use case will not change to something more complex over time and also if you are willing to manage your processes through celeryd + supervisord by yourself. Also, if you don't care about the task monitoring that comes with tools such as celerymon, flower, etc.
It seems like celery is geared more towards "tasks" that can be
deferred and are not time sensitive.
Celery supports scheduled tasks as well. If that is what you meant by that statement.
Am I nuts for trying to keep this real time?
I don't think so. As long as your consumers are fast enough, it will be as good as real time.
What other technologies should I look at?
Pertaining to celery, you should choose result store wisely. My suggestion would be to use cassandra. It is good for realtime data (both write and query wise). You can also use redis or mongodb. They come with their own set of problems as result store. But then a little tweaking in configuration can go a long way.
If you mean something completely different from celery, then you can look into asyncio (python3.5) and zeromq for achieving the same. I can't comment more on that though.
Is it possible to use the same redis database for multiple projects using celery? Like using the same database for multiple projects as a cache using a key prefix. Or do i have to use a seperate database for every installation?
To summarize from this helpful blog post: https://kfalck.net/2013/02/21/run-multiple-celeries-on-a-single-redis/
Specify a different database number for each project, e.g. redis://localhost/0 and redis://localhost/1
Define and use different queue names for the different projects. On the task side, define CELERY_DEFAULT_QUEUE, and when starting up your worker, use the -Q parameter to specify that queue. Read more about routing here: http://docs.celeryproject.org/en/latest/userguide/routing.html
I've used a redis backend for celery while also using the same redis db with prefixed cache data. I was doing this during development, I only used redis for the result backend not to queue tasks, and the production deployment ended up being all AMQP (redis only for caching). I didn't have any problems and don't see why one would (other than performance issues).
For running multiple celery projects with different task definitions, I think the issue would be if you have two different types of workers that each can only handle a subset of job types. Without separate databases, I'm not sure how the workers would be able to tell which jobs they could process.
I'd probably either want to make sure all workers had all task types defined and could process anything, or would want to keep the separate projects in separate databases. This wouldn't require installing anything extra, you'd just specify a REDIS_DB=1 in one of your celery projects. There might be another way to do this. I don't know for sure that multiple DBs are required, but it kinda makes sense.
If you're only using redis for a result backend, maybe that would work for having multiple celery projects on one redis db... I'm not really sure.
I have a django project with various apps, which are completely independent. I'd like to make them run each one in their own process, as some of them spawn background threads to precalculate periodically some data and now they are competing for the CPU (the machine has loads of cores, but you know, the GIL and such...)
So, is there an easy way to split automatically the project into different ones, or at least to make each app live in its own process?
You can always have different settings files, but that would be like having multiple projects and even multiple endpoints. With some effort you could configure a reverse proxy to forward to the right Django server, based on the request's path and so on, but I don't think that's what you want and it would be an ugly solution to your problem.
The solution to this is to move the heavy processing to a jobs queue. A lot of people and projects prefer Celery for this.
If that seems like overkill for some reason, you can always implement your own based on simple cron jobs. You can take a look at my small project that does this.
The simplest of the simple is probably to write a custom management command that observes given model (database table) for new entries and processes them. The model is written to by e.g. Django view and the management command is launched periodically from cron (e.g. every 5 minutes).
Example: user registers on the site, but the account creation is an expensive operation (allocating some space, pinging remote services etc.). Therefore you just write a new record to AccountRequest table (AccountRequest.objects.create(...)). Then, cron periodically launches your management script (./manage.py account_creator), which checks for new AccountRequest-s (AccountRequest.objects.filter(unprocessed=True)), does its job and marks those requests as processed.