Django Celery Queue getting stuck

Django Celery Queue getting stuck - python

I use Celery/RabbitMQ for asynchronous task execution with my django application. I have just started working with Celery.
The tasks execute and everything works perfectly once I start the worker.
The Problem is that the tasks execution stops sometime later. After couple of hours, a day or sometimes couple of days. I realise that only from the consequences of incomplete task executions. Then I restart celery and all the pending tasks get executed and everything is back to normal.
My questions are:
How can I debug (where to start looking) to find out what the problem is?
How can I create a mechanism that shall notify me immediately after the problem starts?
My Stack:
Django 1.4.8
Celery 3.1.16
RabbitMQ
Supervisord
Thanks,
andy

(1) If your celery worker get stuck sometimes, you can use strace & lsof to find out at which system call it get stuck.
For example:
$ strace -p 10268 -s 10000
Process 10268 attached - interrupt to quit
recvfrom(5,
10268 is the pid of celery worker, recvfrom(5 means the worker stops at receiving data from file descriptor.
Then you can use lsof to check out what is 5 in this worker process.
lsof -p 10268
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
......
celery 10268 root 5u IPv4 828871825 0t0 TCP 172.16.201.40:36162->10.13.244.205:wap-wsp (ESTABLISHED)
......
It indicates that the worker get stuck at a tcp connection(you can see 5u in FD column).
Some python packages like requests is blocking to wait data from peer, this may cause celery worker hangs, if you are using requests, please make sure to set timeout argument.
(2)you can monitor your celery task queue size in RabbitMQ, if it keep increasing in a long time, probably celery worker is going on strike.
Have you seen this page:
https://www.caktusgroup.com/blog/2013/10/30/using-strace-debug-stuck-celery-tasks/

Related

Understanding Gunicorn worker processes, when using threading module for sending mail

in my setup I am using Gunicorn for my deployment on a single CPU machine, with three worker process. I have came to ask this question from this answer: https://stackoverflow.com/a/53327191/10268003 . I have experienced that it is taking upto one and a half second to send mail, so I was trying to send email asynchronously. I am trying to understand what will happen to the worker process started by Gunicorn, which will be starting a new thread to send the mail, will the Process gets blocked until the mail sending thread finishes. In that case I beleive my application's throughput will decrease. I did not want to use celery because it seems to be overkill for setting up celery for just sending emails. I am currently running two containers on the same machine with three gunicorn workers each in development machine.
Below is the approach in question, the only difference is i will be using threading for sending mails.
import threading
from .models import Crawl
def startCrawl(request):
task = Crawl()
task.save()
t = threading.Thread(target=doCrawl,args=[task.id])
t.setDaemon(True)
t.start()
return JsonResponse({'id':task.id})
def checkCrawl(request,id):
task = Crawl.objects.get(pk=id)
return JsonResponse({'is_done':task.is_done, result:task.result})
def doCrawl(id):
task = Crawl.objects.get(pk=id)
# Do crawling, etc.
task.result = result
task.is_done = True
task.save()

Assuming that you are using gunicorn Sync (default), Gthread or Async workers, you can indeed spawn threads and gunicorn will take no notice/interfere. The threads are reused to answer following requests immediately after returning a result, not only after all Threads are joined again.
I have used this code to fire an independent event a minute or so after a request:
Timer(timeout, function_that_does_something, [arguments_to_function]).start()
You will find some more technical details in this other answer:
In normal operations, these Workers run in a loop until the Master either tells them to graceful shutdown or kills them. Workers will periodically issue a heartbeat to the Master to indicate that they are still alive and working. If a heartbeat timeout occurs, then the Master will kill the Worker and restart it.
Therefore, daemon and non-daemon threads that do not interfere with the Worker's main loop should have no impact. If the thread does interfere with the Worker's main loop, such as a scenario where the thread is performing work and will provide results to the HTTP Response, then consider using an Async Worker. Async Workers allow for the TCP connection to remain alive for a long time while still allowing the Worker to issue heartbeats to the Master.
I have recently gone on to use asynchronous event loop based solutions like the uvicorn worker for gunicorn with the fastapi framework that provide alternatives to waiting in threads for IO.

Celery doesn't acknowledge tasks if stopped too quickly

For a project using Celery, I would like to test the execution of a task.
I know that the documentation advises to mock it but since I'm not using the official client I would want to check for a specific test that everything works well.
Then I set up a very simple task that takes as parameters an Unix socket name and a message to write to it: the task opens the connection on the socket, writes the message and closes the connection.
Inside the tests, the Celery worker is launched with a subprocess: I start it before sending the task, send it a SIGTERM when I receive the message on the socket and then wait for the process to close.
Everything goes well : the message is received, it matches what is expected and the worker correctly terminates.
But I found that when the tests stop, a message still remains within the RabbitMQ queue, as if the task had never been acknowledged.
I confirmed this by looking at the RabbitMQ graphical interface: a "Deliver" occurs after the task is executed but no "Acknowledge".
This seems strange because using the default configuration the acknowledge should be sent before task execution.
Going further in my investigations I noticed that if I add a sleep of a split second just before sending SIGTERM to the worker, the task is acknowledged.
I tried to inspect the executions with or without sleep using strace, here are the logs:
Execution with a sleep of 0.5s.
Execution without sleep.
The only noticeable difference I see is that with sleep the worker has time to start a new communication with the broker. It receives an EAGAIN from a recvfrom and sends a frame "\1\0\1\0\0\0\r\0<\0P\0\0\0\0\0\0\0\1\0\316".
Is this the acknowledge? Why does this occur so late?
I give you the parameters with which I launch the Celery worker: celery worker --app tests.functional.tasks.app --concurrency 1 --pool solo --without-heartbeat.
The --without-heartbeat is just here to reduce differences between executions with or without sleep. Otherwise an additional heartbeat frame would occur in the execution with sleep.
Thanks.

How to trigger email when celery worker goes down?

I have configured django celery with rabbitmq in my server. Currently I am having only one node for my tasks.
I have tried with celery-flower, events, celerycam, etc. for monitoring the the worker/tasks status and it worked well.
My Problem is:-
I want to send mail notification if worker goes down for some reason.
I thought of creating cron job and running every 5 mins and check the status of worker(not sure this the correct way)
Is there any other extensions or other way to do this without cron??

Run your workers using supervisor. There's an example in the documentation. Then, take a look at this answer for how to send an email when the worker process goes down.

how to remove task from celery with redis broker?

I Have add some wrong task to a celery with redis broker
but now I want to remove the incorrect task and I can't find any way to do this
Is there some commands or some api to do this ?

I know two ways of doing so:
1) Delete queue directly from broker. In your case it's Redis. There are two commands that could help you: llen (to find right queue) and del (to delete it).
2) Start celery worker with --purge or --discard options. Here is help:
--purge, --discard Purges all waiting tasks before the daemon is started.
**WARNING**: This is unrecoverable, and the tasks will
be deleted from the messaging server.

The simplest way is to use the celery control revoke [id1 [id2 [... [idN]]]] (do not forget to pass the -A project.application flag too). Where id1 to idN are task IDs. However, it is not guaranteed to succeed every time you run it, for valid reasons...
Sure Celery has API for it. Here is an example how to do it from a script: res = app.control.revoke(task_id, terminate=True)
In the example above app is an instance of the Celery application.
In some rare ocasions the control command above will not work, in which case you have to instruct Celery worker to kill the worker process: res = app.control.revoke(task_id, terminate=True, signal='SIGKILL')

I just had this problem so for future readers,
http://celery.readthedocs.org/en/latest/faq.html#i-ve-purged-messages-but-there-are-still-messages-left-in-the-queue
so to properly purge the queue of waiting tasks you have to stop all
the workers, and then purge the tasks using celery.control.purge().

1.
To properly purge the queue of waiting tasks you have to stop all the workers (http://celery.readthedocs.io/en/latest/faq.html#i-ve-purged-messages-but-there-are-still-messages-left-in-the-queue):
2 ... and then purge the tasks from a specific queue:
$ cd <source_dir
$ celery amqp queue.purge <queue name>
3.
Start workers again

try to remove the .state file and if you are using a beat worker (celery worker -B) then remove the schedule file as well

How to make celery retry using the same worker?

I'm just starting out with celery in a Django project, and am kinda stuck at this particular problem: Basically, I need to distribute a long-running task to different workers. The task is actually broken into several steps, each of which takes considerable time to complete. Therefore, if some step fails, I'd like celery to retry this task using the same worker to reuse the results from the completed steps. I understand that celery uses routing to distribute tasks to certain server, but I can't find anything about this particular problem. I use RabbitMQ as my broker.

You could have every celeryd instance consume from a queue named after the hostname of the worker:
celeryd -l info -n worker1.example.com -Q celery,worker1.example.com
sets the hostname to worker1.example.com and will consume from a queue named the same, as well as the default queue (named celery).
Then to direct a task to a specific worker you can use:
task.apply_async(args, kwargs, queue="worker1.example.com")
similary to direct a retry:
task.retry(queue="worker1.example.com")
or to direct the retry to the same worker:
task.retry(queue=task.request.hostname)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.