I have to spawn celery tasks, which have to have some namespace (for example user id).
So I'm spawn it by
scrapper_start.apply_async((request.user.id,), queue=account.Account_username)
app.control.add_consumer(account.Account_username, reply=True)
And tasks spawns recursively, from other task.
Now I have to check, if tasks of queue are executing. Tried to check list length in redis, it return true number before celery start executing.
How to solve this problem. I need only to check, if queue or consumer is executing or already empty. Thanks
If you just want to inspect the queue, you do this from command line itself.
from celery.task.control import inspect
i = inspect('scrapper_start')
i.active() # get a list of active tasks
In addition to checking which are currently executing, you can also do the following.
i.registered() # get a list of tasks registered
i.scheduled # get a list of tasks waiting
i.reserved() #tasks that has been received, but waiting to be executed
This command line inspection is good if you want to check once in a while.
For some reason, if you want to monitor them continuously, you can use Flower which provides a beautiful interface to monitor workers.
Related
I recently started putting together a webapp with Plotly Dash. I have a callback function that updates a DataTable with data that are fetched from a Redis server. The code that connects to Redis and downloads the data was originally developed to be used elsewhere - in scripts that run standalone either from the command line or through scheduling systems. The scripts run fine. The code that fetches the data can be run either sequentially or in parallel via multiprocessing. The multiprocessing related code is typical for the use case, it creates two queues, one with tasks pending and one for the completed tasks. An infinite while loop listens on the completed tasks queue and picks up the completed tasks until all of the tasks are finished. The reason why multiprocessing is used is because for each key/value pair fetched from Redis, the value is a big object that needs unpickling which is relatively time consuming.
To cut the long story short, when the code gets executed via the Dash callback function, the tasks are inserted in the pending queue, the infinite while loop listens on the tasks completed queue but no tasks are getting executed. For some reason in the example below the function do_work never gets executed by any worker at all
# Set-up and start the workers
for c in range(num_workers):
p = mp.Process(target=do_work, args=(tasks_pending, tasks_completed, verbose))
p.name = 'worker' + str(c)
processes.append(p)
p.start()
I did have a look around multiprocessing context managers and Flask etc but I didn't manage to make it work. Any idea what is going on and why Dash (or Flask) is a special case? Any hints or pointers to the right direction would be great.
Many thanks!
You can use define a multiprocessing Queue and then pass it to the callback via app.
events_messages = multiprocessing.Queue()
app.queue = events_messages
then you can add messages or read them in the callback function:
app.queue.put('your item for the Queue')
Setup
Celery 3.0
broker=RabbitMQ
Scenario
Tasks have already been acknowledged and started processing, and have state=STARTED. Then I want to restart my worker (to update the worker to a newer version). After restart worker (using supervisorctl restart), those long running tasks are all been terminated. But their states remain in state=STARTED. How can I update their state to FAILURE or whatever other values? (And, I don't want these tasks being executed again after the worker restarts.)
Methods tried (but not working)
use track_started=True --- If with this option, the tasks stay at state=STARTED after the worker restarts. If without this option, the tasks stay at state=PENDING after the worker restarts.
use CELERY_ACKS_LATE=True --- The tasks stay at state=STARTED after the worker restarts. And tasks are executed again, not a desired behavior.
use signal(SIGTERM, handler) and a handler function to catch the signal. The handler can successfully be entered. However, no matter what thing I put inside the handler, it can't change the task's state. The states just stay as the same and won't change to FAILURE. Inside the handler I've tried
raise Exception
exit(0)
exit(1)
Is there any settings of Celery that could enable it to track the state of task being shutdown?
You need to revoke the tasks on worker shutdown. Take a look at this issue for the actual code.
I believe the better way would be adjust stopwaitsecs (see) in your supervisor config to be more than your time limit per task. Supervisor would wait for your tasks to finish without killing them. So your tasks always finished normally and no states to fix in the first place.
Also this depends on how long your tasks are. And if they too long running to wait for them, may be it's better to split them to shorter ones.
I find the last method (signal(SIGTERM, handler)) you tried working. Since the code successfully enter the TERM signal handling branch, the crux becomes how to update the state of task. Simply use self.update_state(state=states.FAILURE) inside the task body.
Here's an example:
import time
import signal
from celery import Celery, states
app = Celery("celery")
#app.task(bind=True)
def demo(self):
def mark_as_fail(*args):
self.update_state(state=states.FAILURE)
signal.signal(signal.SIGTERM, mark_as_fail)
for i in range(30):
print(f"I am demo task {i}")
time.sleep(1)
I'm using Celery 3.1.x with 2 tasks. The first task (TaskOne) is enqueued when Celery starts up through the celeryd_after_setup signal:
#celeryd_after_setup.connect
def celeryd_after_setup(*args, **kwargs):
TaskOne().apply_async(countdown=5)
When TaskOne is run, it does some calculations and then enqueues TaskTwo. Imagine the following workflow:
I start celery, thus the signal is fired and TaskOne is enqueued
after the countdown (5) TaskTwo is enqueued
then I stop celery (the TaskTwo remains in the queue)
afterwards I restart celery
the workflow is run again and TaskTwo is enqueued again
So we have 2 TaskTwo in the queue. That is a problem for my workflow because I only want one TaskTwo within the queue and avoid that a second one is enqueued.
My question: How can I achieve this?
With celery.app.control.Inspect.scheduled() (Docs) I can get a list of which tasks are scheduled, hidden in a combination of lists and dicts. This is maybe a way, but going through the result of this does not feel right. Is there any better way?
An easy-to-implement solution would be to add the --purge switch to your worker command. It will clear the queue and the worker start with no scheduled jobs.
But beware: that's a kind of job-global, unrecoverable action. When there are other scheduled jobs you depend on, this is not your solution.
After considering several options I chose to use app.control.inspect.
It's not a really beautiful solution, but it works:
# fetch all scheduled tasks
scheduled_tasks = inspect().scheduled()
# iterate the scheduled task values, see http://docs.celeryproject.org/en/latest/userguide/workers.html?highlight=revoke#dump-of-scheduled-eta-tasks
for task_values in iter(scheduled_tasks.values()):
# task_values is a list of dicts
for task in task_values:
if task['request']['name'] == '{}.{}'.format(TaskTwo.__module__, TaskTwo.__name__):
logger.info('TaskTwo is already scheduled, skipping additional run')
return
I'm running Django, Celery and RabbitMQ. What I'm trying to achieve is to ensure, that tasks related to one user are executed in order (specifically, one at the time, I don't want task concurrency per user)
whenever new task is added for user, it should depend on the most recently added task. Additional functionality might include not adding task to queue, if task of this type is queued for this user and has not yet started.
I've done some research and:
I couldn't find a way to link newly created task with already queued one in Celery itself, chains seem to be only able to link new tasks.
I think that both functionalities are possible to implement with custom RabbitMQ message handler, though it might be hard to code after all.
I've also read about celery-tasktree and this might be an easiest way to ensure execution order, but how do I link new task with already "applied_async" task_tree or queue? Is there any way that I could implement that additional no-duplicate functionality using this package?
Edit: There is this also this "lock" example in celery cookbook and as the concept is fine, I can't see a possible way to make it work as intended in my case - simply if I can't acquire lock for user, task would have to be retried, but this means pushing it to the end of queue.
What would be the best course of action here?
If you configure the celery workers so that they can only execute one task at a time (see worker_concurrency setting), then you could enforce the concurrency that you need on a per user basis. Using a method like
NUMBER_OF_CELERY_WORKERS = 10
def get_task_queue_for_user(user):
return "user_queue_{}".format(user.id % NUMBER_OF_CELERY_WORKERS)
to get the task queue based on the user id, every task will be assigned to the same queue for each user. The workers would need to be configured to only consume tasks from a single task queue.
It would play out like this:
User 49 triggers a task
The task is sent to user_queue_9
When the one and only celery worker that is listening to user_queue_9 is ready to consume a new task, the task is executed
This is a hacky answer though, because
requiring just a single celery worker for each queue is a brittle system -- if the celery worker stops, the whole queue stops
the workers are running inefficiently
From reading the Celery documentation, it looks like I should be able to use the following python code to list tasks on the queue that have not yet been picked up:
from celery.task.control import inspect
i = inspect()
tasks = i.reserved()
However, when running this code, the list of tasks is empty, even if there are items waiting in the queue (I have verified they are definitely in the queue using django-admin). The same is true for using the command line equivalent:
$ celeryctl inspect reserved
So I'm guessing this is not in fact what this command is for? If not, what is the accepted way to retrieve a list of tasks that have not yet started? Do I have to maintain my own list of task IDs in the code in order to query them?
The reason I ask is because I am trying to handle a situation where two tasks are queued which perform a write operation on the same object in the database. If both tasks execute in parallel and task 1 takes longer than task 2, it will overwrite the output from task 2, but I want the output from the most recent task i.e. task 2. So my plan was to cancel any pending tasks that operate on an object each time a new task is added which will write to the same object.
Thanks
Tom
You can see pending tasks using scheduled instead of reserved.
$ celeryctl inspect scheduled