Celery doesn't acknowledge tasks if stopped too quickly - python

For a project using Celery, I would like to test the execution of a task.
I know that the documentation advises to mock it but since I'm not using the official client I would want to check for a specific test that everything works well.
Then I set up a very simple task that takes as parameters an Unix socket name and a message to write to it: the task opens the connection on the socket, writes the message and closes the connection.
Inside the tests, the Celery worker is launched with a subprocess: I start it before sending the task, send it a SIGTERM when I receive the message on the socket and then wait for the process to close.
Everything goes well : the message is received, it matches what is expected and the worker correctly terminates.
But I found that when the tests stop, a message still remains within the RabbitMQ queue, as if the task had never been acknowledged.
I confirmed this by looking at the RabbitMQ graphical interface: a "Deliver" occurs after the task is executed but no "Acknowledge".
This seems strange because using the default configuration the acknowledge should be sent before task execution.
Going further in my investigations I noticed that if I add a sleep of a split second just before sending SIGTERM to the worker, the task is acknowledged.
I tried to inspect the executions with or without sleep using strace, here are the logs:
Execution with a sleep of 0.5s.
Execution without sleep.
The only noticeable difference I see is that with sleep the worker has time to start a new communication with the broker. It receives an EAGAIN from a recvfrom and sends a frame "\1\0\1\0\0\0\r\0<\0P\0\0\0\0\0\0\0\1\0\316".
Is this the acknowledge? Why does this occur so late?
I give you the parameters with which I launch the Celery worker: celery worker --app tests.functional.tasks.app --concurrency 1 --pool solo --without-heartbeat.
The --without-heartbeat is just here to reduce differences between executions with or without sleep. Otherwise an additional heartbeat frame would occur in the execution with sleep.
Thanks.

Related

How do gevent workers behave with sync vs async processes?

I have a Flask app that uses a Gunicorn server for hosting. We are running into an issue where our workers keep getting locked up by long-running requests to different microservices. We currently only have Gunicorn set to give us 3 workers, so if there are 3 requests that are waiting on calls to those microservices, the server is completely locked up.
I started searching around and ran into this post:
gunicorn async worker class
This made sense to me, and it seemed like I could make the endpoint whose only job is to call these microservices asynchronous, then install gunicorn[gevent] and add --worker-class gevent to my start script. I implemented this and tested by using only 1 worker and adding a long time.sleep to the microservice being called. Everything worked perfectly and my server could process other requests while waiting for the async process to complete.
Then I opened pandora's box and added another long time.sleep to a synchronous endpoint within my server, expecting that because this endpoint is synchronous, everything would be locked up during the time it took for the one worker to finish that process. I was surprised that my worker was responding to pings, even while it was processing the synchronous task.
To me, this suggests that the Gunicorn server is adding threads for the worker to use even when the worker is in the middle of processing a synchronous task, instead of only freeing up the worker to operate on a new thread while waiting for an asynchronous IO process like the above post suggested.
I'm relatively new to thread safety so I want to confirm if I am going about solving my issue the right way with this implementation, and if so, how can I expect the new worker class to generate new threads for synchronous vs asynchronous processes?

Understanding Gunicorn worker processes, when using threading module for sending mail

in my setup I am using Gunicorn for my deployment on a single CPU machine, with three worker process. I have came to ask this question from this answer: https://stackoverflow.com/a/53327191/10268003 . I have experienced that it is taking upto one and a half second to send mail, so I was trying to send email asynchronously. I am trying to understand what will happen to the worker process started by Gunicorn, which will be starting a new thread to send the mail, will the Process gets blocked until the mail sending thread finishes. In that case I beleive my application's throughput will decrease. I did not want to use celery because it seems to be overkill for setting up celery for just sending emails. I am currently running two containers on the same machine with three gunicorn workers each in development machine.
Below is the approach in question, the only difference is i will be using threading for sending mails.
import threading
from .models import Crawl
def startCrawl(request):
task = Crawl()
task.save()
t = threading.Thread(target=doCrawl,args=[task.id])
t.setDaemon(True)
t.start()
return JsonResponse({'id':task.id})
def checkCrawl(request,id):
task = Crawl.objects.get(pk=id)
return JsonResponse({'is_done':task.is_done, result:task.result})
def doCrawl(id):
task = Crawl.objects.get(pk=id)
# Do crawling, etc.
task.result = result
task.is_done = True
task.save()
Assuming that you are using gunicorn Sync (default), Gthread or Async workers, you can indeed spawn threads and gunicorn will take no notice/interfere. The threads are reused to answer following requests immediately after returning a result, not only after all Threads are joined again.
I have used this code to fire an independent event a minute or so after a request:
Timer(timeout, function_that_does_something, [arguments_to_function]).start()
You will find some more technical details in this other answer:
In normal operations, these Workers run in a loop until the Master either tells them to graceful shutdown or kills them. Workers will periodically issue a heartbeat to the Master to indicate that they are still alive and working. If a heartbeat timeout occurs, then the Master will kill the Worker and restart it.
Therefore, daemon and non-daemon threads that do not interfere with the Worker's main loop should have no impact. If the thread does interfere with the Worker's main loop, such as a scenario where the thread is performing work and will provide results to the HTTP Response, then consider using an Async Worker. Async Workers allow for the TCP connection to remain alive for a long time while still allowing the Worker to issue heartbeats to the Master.
I have recently gone on to use asynchronous event loop based solutions like the uvicorn worker for gunicorn with the fastapi framework that provide alternatives to waiting in threads for IO.

Python Celery - Raise if queue is not available

I have defined a route in my Celery configs:
task_routes = {'tasks.add': {'queue': 'calculate'}}
So that only a specific worker will run that task. I start my worker:
celery -A myproj worker -n worker1#%h -Q calculate
And then run my task:
add.apply_async((2, 2), time_limit=5)
Everything goes well. But now, let's say my worker dies and I try to run my task again. It hangs, forever. Time_limit doesn't do me any good since the task will never get to its queue. How can I define a time out in this case? In other words, if no queue is available in the next X seconds, I'd like to raise an error. Is that possible?
I'm assuming you are using Rabbitmq as the message broker, and if you are, there are a few subtleties about how Rabbitmq (and other AMQP-like message queues) work.
First of all, when you send a message, your process sends it to an exchange, which in turn routes the message to 0 or more queues. Your queue may or may not have a consumer (i.e. a celery worker) consuming messages, but as a sender you have no control of the receiving side unless there is an active reply from that worker.
However, I think it is possible to achieve what you want by doing the following (assuming you have a backend)
Make sure your queue is declared with a Message TTL of your choice (let's say 60 seconds). Also make sure it is not declared to delete if no consumers are attached. Also declare a dead-letter exchange.
Have a celery worker listening to your dead letter exchange, but that worker is raising an appropriate exception whenever it receives a message. The easiest here is probably to listen to the messages, but not have any tasks loaded. This way, it will result in a FAILURE in your backend saying something about a not implemented task.
If your original worker dies, any message in the queue will expire after your selected TTL and be sent to your dead-letter exchange at which point the second worker (the auto-failing one) will receive the message and raise fail the task.
Note that you need to set your TTL well above the time you expect the message to linger in the Rabbitmq queue, as it will expire regardless of there being a worker consuming from the queue or not.
To set up the first queue, I think you need a configuration looking something like:
Queue(
default_queue_name,
default_exchange,
routing_key=default_routing_key,
queue_arguments={
'x-message-ttl': 60000 # milliseconds
'x-dead-letter-exchange': deadletter_exchange_name,
'x-dead-letter-routing-key': deadletter_routing_key
})
The dead letter queue would look more like a standard celery worker queue configuration, but you may want to have a separate config for it, since you don't want to load any tasks for this worker.
So to sum up, yes it is possible but it is not as straightforward as one might think.

Django Celery Queue getting stuck

I use Celery/RabbitMQ for asynchronous task execution with my django application. I have just started working with Celery.
The tasks execute and everything works perfectly once I start the worker.
The Problem is that the tasks execution stops sometime later. After couple of hours, a day or sometimes couple of days. I realise that only from the consequences of incomplete task executions. Then I restart celery and all the pending tasks get executed and everything is back to normal.
My questions are:
How can I debug (where to start looking) to find out what the problem is?
How can I create a mechanism that shall notify me immediately after the problem starts?
My Stack:
Django 1.4.8
Celery 3.1.16
RabbitMQ
Supervisord
Thanks,
andy
(1) If your celery worker get stuck sometimes, you can use strace & lsof to find out at which system call it get stuck.
For example:
$ strace -p 10268 -s 10000
Process 10268 attached - interrupt to quit
recvfrom(5,
10268 is the pid of celery worker, recvfrom(5 means the worker stops at receiving data from file descriptor.
Then you can use lsof to check out what is 5 in this worker process.
lsof -p 10268
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
......
celery 10268 root 5u IPv4 828871825 0t0 TCP 172.16.201.40:36162->10.13.244.205:wap-wsp (ESTABLISHED)
......
It indicates that the worker get stuck at a tcp connection(you can see 5u in FD column).
Some python packages like requests is blocking to wait data from peer, this may cause celery worker hangs, if you are using requests, please make sure to set timeout argument.
(2)you can monitor your celery task queue size in RabbitMQ, if it keep increasing in a long time, probably celery worker is going on strike.
Have you seen this page:
https://www.caktusgroup.com/blog/2013/10/30/using-strace-debug-stuck-celery-tasks/

What could make celery worker becoming unresponsive after a few tasks?

My workers are stopping after a few (<50) tasks.
I have a very simple client/worker setup. The client post the task via func.delay(...) then enter a while loop to wait for the completion of all the tasks (i.e checking the ready() method of the AsyncResult). I use rabbitmq for the broker and the result backend.
The setup works...for a while. After a few tasks, the client doesn't receive anything and the worker seems to be idle (there is not output in the console anymore).
(The machine I work on is a bit old so a resource problem is not impossible. Still, at 50 tasks that runs for 2secs, I cannot say the system is under heavy load).

Categories