Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Currently I'm working on python project that requires implement some background jobs (mostly for email sending and heavily database updates). I use Redis for task broker. So in this point I have two candidates: Celery and RQ. I had some experience with these job queues, but I want to ask you guys to share you experience of using this tools. So.
What pros and cons to use Celery vs. RQ.
Any examples of projects/task suitable to use Celery vs. RQ.
Celery looks pretty complicated but it's full featured solution. Actually I don't think that I need all these features. From other side RQ is very simple (e.g configuration, integration), but it seems that it lacks some useful features (e.g task revoking, code auto-reloading)
Here is what I have found while trying to answer this exact same question. It's probably not comprehensive, and may even be inaccurate on some points.
In short, RQ is designed to be simpler all around. Celery is designed to be more robust. They are both excellent.
Documentation. RQ's documentation is comprehensive without being complex, and mirrors the project's overall simplicity - you never feel lost or confused. Celery's documentation is also comprehensive, but expect to be re-visiting it quite a lot when you're first setting things up as there are too many options to internalize
Monitoring. Celery's Flower and the RQ dashboard are both very simple to setup and give you at least 90% of all information you would ever want
Broker support. Celery is the clear winner, RQ only supports Redis. This means less documentation on "what is a broker", but also means you cannot switch brokers in the future if Redis no longer works for you. For example, Instagram considered both Redis and RabbitMQ with Celery. This is important because different brokers have different guarantees e.g. Redis cannot (as of writing) guarantee 100% that your messages are delivered.
Priority queues. RQs priority queue model is simple and effective - workers read from queues in order. Celery requires spinning up multiple workers to consume from different queues. Both approaches work
OS Support. Celery is the clear winner here, as RQ only runs on systems that support fork e.g. Unix systems
Language support. RQ only supports Python, whereas Celery lets you send tasks from one language to a different language
API. Celery is extremely flexible (multiple result backends, nice config format, workflow canvas support) but naturally this power can be confusing. By contrast, the RQ api is simple.
Subtask support. Celery supports subtasks (e.g. creating new tasks from within existing tasks). I don't know if RQ does
Community and Stability. Celery is probably more established, but they are both active projects. As of writing, Celery has ~3500 stars on Github while RQ has ~2000 and both projects show active development
In my opinion, Celery is not as complex as its reputation might lead you to believe, but you will have to RTFM.
So, why would anyone be willing to trade the (arguably more full-featured) Celery for RQ? In my mind, it all comes down to the simplicity. By restricting itself to Redis+Unix, RQ provides simpler documentation, simpler codebase, and a simpler API. This means you (and potential contributors to your project) can focus on the code you care about, instead of having to keep details about the task queue system in your working memory. We all have a limit on how many details can be in our head at once, and by removing the need to keep task queue details in there RQ lets get back to the code you care about. That simplicity comes at the expense of features like inter-language task queues, wide OS support, 100% reliable message guarantees, and ability to switch message brokers easily.
Celery is not that complicated. At its core, you do the step by step configuration from the tutorials, create a celery instance, decorate your function with #celery.task then run the task with my_task.delay(*args, **kwargs).
Judging from your own assessment, it seems you have to choose between lacking (key) features or having some excess hanging around. That is not too hard of a choice in my book.
Related
I have been trying to learn more about Celery, but it's difficult to understand what's literal in terms of "workers" and the "queue" and what it actually means in terms of programming. I apologize if this question is very basic, but I can't seem to find a straight answer in simple terms.
I have a Flask/Python app that I want to add a task assignment functionality to. For example, when one user completes a task, I want them to be able to flag it for their team members to check. Additionally, I'd like to be able to schedule task assignments, for example have a user complete 10 of a category of a task per week.
Celery seems like a good way to queue tasks and ensure that they are being completed, but the focus of this system seems to be scheduling resource intensive processes for asynchrous processing, not literal task assignment and queueing.
My question boils down to: is Celery the proper tool for assigning tasks to users, even if it's not for the purpose of resource saving? Have I misunderstood what the capabilities of Celery are? If so, what would be the tools to use to implement this feature?
Thank you!
Celery is meant to for handling the queuing of automated tasks on computer workers. I recommend against using it for assigning tasks to your users.
If you're looking for a lightweight solution for storing queues of tasks for users, you could use lists in Redis: https://redis.io/topics/data-types
You could also probably use any database that you're already using.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm working through possible architectures for a problem. In as few words as possible, the problem is: I need to design a system that allows clients to connect using HTTP/REST to kick off long running processes. Each process will create a persistent connection to a third party server and write the received data to a queue. Each process will terminate only if the third party server closes the connection or another HTTP/REST request is received indicating it should be terminated.
Constraints and background:
Clients must be able to connect using HTTP/REST
System must be written in Python
I'm a lower level C guy (with enough Python experience to feel competent) but trying to wrap my head around the Python frameworks available for making this easier. My gut is to jump into the weeds and I know if I implement this as I'm thinking, I might as well have written it in C. Don't want that. I want to leverage as many frameworks and libraries for Python as possible. Performance is not a top priority.
Approaches I've considered:
In doing research, I came across Twisted which might be a fit and seems to make sense to me (thinking about this as a daemon). I'm imagining the final product would be a Twisted app that exposes a REST interface, dispatches new threads connecting to the third party service for each client request received, and would manage its own thread pool. I'm familiar with threading, though admittedly haven't done anything in Python with them yet. In a nutshell, Twisted looks very cool, though in the end, I'm left wondering if I'm overcomplicating this.
The second approach I considered is using Celery and Flask and simply let Celery handle all the dispatching, thread management, etc. I found this article showing Celery and Flask playing nicely together. It seems much like a much simpler approach.
After writing this, I'm leaning towards the second option of using Celery and Flask, though I don't know much about Celery, so looking for any advice you might have, as well as other possible architectures that I'm not considering. I really appreciate it and thank you in advance.
Yes, Twisted is overkill here.
From what you described, the combination of Celery and Flask would suffice. It would allow you to implement a REST interface that kicks off your long running processes as Celery tasks. You can easily implement a REST method allowing clients to stop running tasks by invoking Celery's revoke method on a tasks ID. Take note that Celery depends on a Message Broker for sending and receiving messages (frequently RabbitMQ is used) and a data backend for storing results (frequently Redis is used).
>>> from celery.task.control import revoke
>>> revoke(task_id, terminate=True)
http://docs.celeryproject.org/en/latest/userguide/workers.html#revoking-tasks
I have been learning about django recently and have stumbled upon celery. I don't seem to understand what it does. I've been to their site to no avail. Can anyone explain to me the concept and it's real world applications (in simple terms)?
Celery is an "asynchronous task queue/job queue based on distributed message passing". It is just a task queue, or something that one puts tasks into to do as soon as possible. You have a celery instance that you integrate directly with your django or python app- this is what you use to talk to celery. Then, you can configure celery to have 'workers' that perform the tasks you give them. The whole point is to be able to do tasks that don't fit within the normal request/response cycle very well that django handles so well.
What kinds of tasks are these? Well, as said before, they don't fit into the normal request/response cycle. The best example I can think of is emails- if you're building a web app and you want to keep your users, you need to keep them engaged and coming back, and a good way to do that is by sending emails. You send them once a week or once a day and they can maybe configure when to send. This would fit horribly within the request/response cycle, but it's perfect for something like Celery.
Other examples are long-running jobs with lots of computation. While you would typically use something like Hadoop for really big computations, you can schedule some queries with Celery. You could also use it to schedule builds if you're doing something like Travis. The uses go on and on, but you probably get the point.
Question is relevant to this and this;
the difference is, I'd prefer something with possibly more precision and low load (per-minute cron job isn't preferable for those) and with minimal overhead (i.e. installing celery with rabbitmq seems like a big overkill).
An example task for such is personal reminders server (with reminders that could be edited over web and sent out through e-mail or XMPP).
I'm probably looking for something more like node.js's setTimeout but for django (and though I might prefer to implement reminders in node.js anyway, it's still a possibly interesting question).
For example, it's possible to start new threads in django app (with functions consisting of sleep() and send()); in what ways this can be bad?
The problem with using threads for this solution are the typical problems with Python threads that always drive people towards multi-process solutions instead. The problem is compounded here by the fact your thread isn't driven by the normal request-response cycle. This is summarized nicely by Malcolm Tredinnick here:
Have to disagree. Threads are not a good solution to this problem. The
issue is process management. As written, your threads will never be
rejoined. Webserver processes have a lifecycle uncontrollable by you
(the MaxRequestsPerChild Apache parameter and similar things in other
servers) and you are messing with that by using threads.
If you need a process with a lifecycle that is not matched by the
request-response path — something long running and independent of the
response — a completely separate process is definitely the right model
to use. Using a thread is tying it to the response lifecycle, which
wil have unintended side-effects.
A possible solution for you might be to have a long running process performing your tasks which gets a wake-up signal from a light cron process.
Another possibility would be build something using 0mq, which is much lighter than AMQP style queues (at the cost of some features of course). Tarek Ziade is working on a Mozilla project called powerhose that uses 0mq, looks super simple, and has a heartbeat capability with resolution to the second.
I'm writing some software which will manage a few hundred small systems in “the field” over an intermittent 3G (or similar) connection.
Home base will need to send jobs to the systems in the field (eg, “report on your status”, “update your software”, etc), and the systems in the field will need to send jobs back to the server (eg, “a failure has been detected”, “here is some data”, etc).
I've spent some time looking at Celery and it seems to be a perfect fit: celeryd running at home base could collect jobs for the systems in the field, a celeryd running on the field systems could collect jobs for the server, and these jobs could be exchanged as clients become available.
So, is Celery a good fit for this problem? Specifically:
The majority of tasks will be directed to an individual worker (eg, “send the ‘get_status’ job to ‘system51’”) — will this be a problem?
Does it gracefully handle adverse network conditions (like, eg, connections dying)?
What functionality is only available if RabbitMQ is being used as a backend? (I'd rather not run RabbitMQ on the field systems)
Is there any other reason Celery could make my life difficult if I use it like I've described?
Thanks!
(it would be valid to suggest that Celery is overkill, but there are other reasons that it would make my life easier, so I would like to consider it)
The majority of tasks will be directed
to an individual worker (eg, “send the
‘get_status’ job to ‘system51’”) —
will this be a problem?
Not at all. Just create a queue for each worker, e.g. say each node listens to a round robin queue called default and each node has its own queue named after its node name:
(a)$ celeryd -n a.example.com -Q default,a.example.com
(b)$ celeryd -n b.example.com -Q default,b.example.com
(c)$ celeryd -n c.example.com -Q default,c.example.com
Routing a task directly to a node is simple:
$ get_status.apply_async(args, kwargs, queue="a.example.com")
or by configuration using a Router:
# Always route "app.get_status" to "a.example.com"
CELERY_ROUTES = {"app.get_status": {"queue": "a.example.com"}}
Does it gracefully handle adverse
network conditions (like, eg,
connections dying)?
The worker gracefully recovers from broker connection failures.
(at least from RabbitMQ, I'm not sure about all the other backends, but this
is easy to test and fix (you only need to add the related exceptions to a list)
For the client you can always retry sending the task if the connection is down,
or you can set up HA with RabbitMQ: http://www.rabbitmq.com/pacemaker.html
What functionality is only available
if RabbitMQ is being used as a
backend? (I'd rather not run RabbitMQ
on the field systems)
Remote control commands, and only "direct" exchanges are supported (not "topic" or "fanout"). But this will be supported in Kombu (http://github.com/ask/kombu).
I would seriously reconsider using RabbitMQ. Why do you think it's not a good fit?
IMHO I wouldn't look elsewhere for a system like this, (except maybe ZeroMQ if the system
is transient and you don't require message persistence).
Is there any other reason Celery could make my life
difficult if I use it like I've described?
I can't think of anything from what you describe above. Since the concurrency model
is multiprocessing it does require some memory (I'm working on adding support for
thread pools and eventlet pools, which may help in some cases).
it would be valid to suggest that Celery is overkill, but there are
other reasons that it would make my life easier, so I would like to
consider it)
In that case I think you use the word overkill lightly. It really depends
on how much code and tests you need to write without it. I think
it's better to improve an already existing general solution, and in theory it sounds
like it should work well for your application.
I would probably set up a (django) web service to accept requests. The web service could do the job of validating requests and deflecting bad requests. Then celery can just do the work.
This would require the remote devices to poll the web service to see if their jobs were done though. That may or may not be appropriate, depending on what exactly you're doing.