How to replace threads in Django to Celery worker - python

I know it's not the best practice to use threads in django project but I have a project that is using threads:
threading.Thread(target=save_data, args=(dp, conv_handler)).start()
I want to replace this code to celery - to run worker with function
save_data(dispatcher, conversion)
Inside save_data I have infinite loop and in this loop I save states of dispatcher and conversation to file on disk with pickle.
I want to know may I use celery for such work?
Does the worker can see changes of state in dispatcher and conversation?

I personally don't like long running tasks in Celery. Normally you will have a maximum task time and if your task takes too much time it can time out. The best tasks for celery are quick and stateless tasks.
Notice that Celery params are serialized when you launch a task and it's tricky passing a python object as a task argument (not recommended).
I would need more info about the problem you are trying to solve but if dispatcher & conversion are django objects I would do something like:
def save_data(dispatcher_id, conversion_id):
dispatcher = Dispatcher.objects.get(id=dispatcher_id)
conversion = Conversion.objects.get(id_conversion_id)
And you should avoid that infinite loop in a celery task. You can workaround the infinite loop by calling this save_task periodically but I encourage you to find a solution that matches better with Celery (try to be stateless, quick tasks).

Related

how to profile memory usage of a celery task?

I have a django application that runs background tasks using the celery lib and I need to obtain and store the max memory usage of a task.
I've tried memory_usage from memory_profiler library, but I can not use this function inside a task because I get the error: "daemonic processes not allowed have children". I've also tried the memory_usage function outside the task, to monitor the task.async call, but for some reason the task is triggered twice.
All the other ways I found out there consist of checking the memory usage in different places of the code and then getting the maximum, but I have the feeling that it is very inaccurate and there are probably some calls that have a high memory usage that is left out because of garbage collection before I manage to check the current memory usage.
the official documentation has some useful functions but it would have to rely on the method above. https://docs.celeryproject.org/en/latest/reference/celery.utils.debug.html
Thanks in advance!
Why not a controller tasks?
Celery infrastructure let to query the current status of all workers:
from celery import Celery
app = Celery(...)
app.control.inspect().active()
This can be used inside a task to poll every # sec the cluster and understand what's happening.
I've used a similar approach to identify and send the kill() command between tasks. My tasks are killable so each of them know how to handle the soft kill.

How to ensure task execution order per user using Celery, RabbitMQ and Django?

I'm running Django, Celery and RabbitMQ. What I'm trying to achieve is to ensure, that tasks related to one user are executed in order (specifically, one at the time, I don't want task concurrency per user)
whenever new task is added for user, it should depend on the most recently added task. Additional functionality might include not adding task to queue, if task of this type is queued for this user and has not yet started.
I've done some research and:
I couldn't find a way to link newly created task with already queued one in Celery itself, chains seem to be only able to link new tasks.
I think that both functionalities are possible to implement with custom RabbitMQ message handler, though it might be hard to code after all.
I've also read about celery-tasktree and this might be an easiest way to ensure execution order, but how do I link new task with already "applied_async" task_tree or queue? Is there any way that I could implement that additional no-duplicate functionality using this package?
Edit: There is this also this "lock" example in celery cookbook and as the concept is fine, I can't see a possible way to make it work as intended in my case - simply if I can't acquire lock for user, task would have to be retried, but this means pushing it to the end of queue.
What would be the best course of action here?
If you configure the celery workers so that they can only execute one task at a time (see worker_concurrency setting), then you could enforce the concurrency that you need on a per user basis. Using a method like
NUMBER_OF_CELERY_WORKERS = 10
def get_task_queue_for_user(user):
return "user_queue_{}".format(user.id % NUMBER_OF_CELERY_WORKERS)
to get the task queue based on the user id, every task will be assigned to the same queue for each user. The workers would need to be configured to only consume tasks from a single task queue.
It would play out like this:
User 49 triggers a task
The task is sent to user_queue_9
When the one and only celery worker that is listening to user_queue_9 is ready to consume a new task, the task is executed
This is a hacky answer though, because
requiring just a single celery worker for each queue is a brittle system -- if the celery worker stops, the whole queue stops
the workers are running inefficiently

What's alternative choice for background worker, async defered task queue in django/wsgi besides celery?

Are there any pure wsgi implementation of background task?
I want to use local variables under the same context directly, not serialize/deserialize to another daemon process via a broker.
Is it possible to make this happen under the current wsgi infrastructure? E.g. after return response yield, run some callback functions?
This is a duplicate of question asked on the Python WEB-SIG. I reference the same page as provided in response to the question on the Python WEB-SIG so others can see it:
http://code.google.com/p/modwsgi/wiki/RegisteringCleanupCode
In doing this though, it ties up the request thread and so it would not be able to handle other requests until your task has finished.
Creating background threads at the end of a request is not a good idea unless you do it using a pooling mechanism such that you limit the number of worker threads for your tasks. Because the process can crash or be shutdown, you loose the job as only in memory and thus not persistent.
Better to use Celery, or if you think that is too heavy weight, have a look at Redis Queue (RQ) instead.
You could look at Django async. It uses an in-database queue and so handles transactions much better. All arguments need to be JSONable as does the return type. In some cases this means you may need to schedule a wrapper function, but that oughtn't to cause you any headaches.
http://pypi.python.org/pypi/django-async
You don't want to be doing this sort of thing inside the web server -- it's absolutely not the right place to do it. Django async provides a manage.py command for flushing the queue which you can run in a loop, possible on another machine from the web server.

Script needs to be run as a Celery task. What consequences does this have?

My task is it to write a script using opencv which will later run as a Celery task. What consequences does this have? What do I have to pay attention to? Is it enough in the end to include two lines of code or could it be, that I have to rewrite my whole script?
I read, that Celery is a "asynchronous task queue/job queuing system based on distributed message passing", but I wont pretend to know completely what that all entails.
I try to update the question, as soon as I get more details.
Celery implies a daemon using a broker (some data hub used to queue tasks). The celeryd daemon and the broker (RabbitMQ, redis, MongoDB or else) should always run in the background.
Your tasks will be queued, this means they won't happen all at the same time. You can choose how many at the same time can be run as a maximum. The rest of them will wait for the others to finish before starting. This also means some concurrency is often expected, and that you must create tasks that play nice with others doing the same thing.
Celery is not meant to run scripts but tasks, written as python functions. You can of course execute external scripts from Python, but your entry point is always a Python function.
Celery uses Kombu, which uses a message broker to dispatch the tasks. This implies the data you pass to your tasks should be serializable.

how to track revoked tasks in across multiple celeryd processes

I have a reminder type app that schedules tasks in celery using the "eta" argument. If the parameters in the reminder object changes (e.g. time of reminder), then I revoke the task previously sent and queue a new task.
I was wondering if there's any good way of keeping track of revoked tasks across celeryd restarts. I'd like to have the ability to scale celeryd processes up/down on the fly, and it seems that any celeryd processes started after the revoke command was sent will still execute that task.
One way of doing it is to keep a list of revoked task ids, but this method will result in the list growing arbitrarily. Pruning this list requires guarantees that the task is no longer in the RabbitMQ queue, which doesn't seem to be possible.
I've also tried using a shared --statedb file for each of the celeryd workers, but it seems that the statedb file is only updated on termination of the workers and thus not suitable for what I would like to accomplish.
Thanks in advance!
Interesting problem, I think it should be easy to solve using broadcast commands.
If when a new worker starts up it requests all the other workers to dump its revoked
tasks to the new worker. Adding two new remote control commands,
you can easily add new commands by using #Panel.register,
Module control.py:
from celery.worker import state
from celery.worker.control import Panel
#Panel.register
def bulk_revoke(panel, ids):
state.revoked.update(ids)
#Panel.register
def broadcast_revokes(panel, destination):
panel.app.control.broadcast("bulk_revoke", arguments={
"ids": list(state.revoked)},
destination=destination)
Add it to CELERY_IMPORTS:
CELERY_IMPORTS = ("control", )
The only missing problem now is to connect it so that the new worker
triggers broadcast_revokes at startup. I guess you could use the worker_ready
signal for this:
from celery import current_app as celery
from celery.signals import worker_ready
def request_revokes_at_startup(sender=None, **kwargs):
celery.control.broadcast("broadcast_revokes",
destination=sender.hostname)
I had to do something similar in my project and used celerycam with django-admin-monitor. The monitor takes a snapshot of tasks and saves them in the database periodically. And there is a nice user interface to browse and check the status of all tasks. And you can even use it even if your project is not Django based.
I implemented something similar to this some time ago, and the solution I came up with was very similar to yours.
The way I solved this problem was to have the worker fetch the Task object from the database when the job ran (by passing it the primary key, as the documentation recommends). In your case, before the reminder is sent the worker should perform a check to ensure that the task is "ready" to be run. If not, it should simply return without doing any work (assuming that the ETA has changed and another worker will pick up the new job).

Categories