What could make celery worker becoming unresponsive after a few tasks?

What could make celery worker becoming unresponsive after a few tasks? - python

My workers are stopping after a few (<50) tasks.
I have a very simple client/worker setup. The client post the task via func.delay(...) then enter a while loop to wait for the completion of all the tasks (i.e checking the ready() method of the AsyncResult). I use rabbitmq for the broker and the result backend.
The setup works...for a while. After a few tasks, the client doesn't receive anything and the worker seems to be idle (there is not output in the console anymore).
(The machine I work on is a bit old so a resource problem is not impossible. Still, at 50 tasks that runs for 2secs, I cannot say the system is under heavy load).

Related

How do gevent workers behave with sync vs async processes?

I have a Flask app that uses a Gunicorn server for hosting. We are running into an issue where our workers keep getting locked up by long-running requests to different microservices. We currently only have Gunicorn set to give us 3 workers, so if there are 3 requests that are waiting on calls to those microservices, the server is completely locked up.
I started searching around and ran into this post:
gunicorn async worker class
This made sense to me, and it seemed like I could make the endpoint whose only job is to call these microservices asynchronous, then install gunicorn[gevent] and add --worker-class gevent to my start script. I implemented this and tested by using only 1 worker and adding a long time.sleep to the microservice being called. Everything worked perfectly and my server could process other requests while waiting for the async process to complete.
Then I opened pandora's box and added another long time.sleep to a synchronous endpoint within my server, expecting that because this endpoint is synchronous, everything would be locked up during the time it took for the one worker to finish that process. I was surprised that my worker was responding to pings, even while it was processing the synchronous task.
To me, this suggests that the Gunicorn server is adding threads for the worker to use even when the worker is in the middle of processing a synchronous task, instead of only freeing up the worker to operate on a new thread while waiting for an asynchronous IO process like the above post suggested.
I'm relatively new to thread safety so I want to confirm if I am going about solving my issue the right way with this implementation, and if so, how can I expect the new worker class to generate new threads for synchronous vs asynchronous processes?

What are the consequences of disabling gossip, mingle and heartbeat for celery workers?

What are the implications of disabling gossip, mingle, and heartbeat on my celery workers?
In order to reduce the number of messages sent to CloudAMQP to stay within the free plan, I decided to follow these recommendations. I therefore used the options --without-gossip --without-mingle --without-heartbeat. Since then, I have been using these options by default for all my celery projects but I am not sure if there are any side-effects I am not aware of.
Please note:
we now moved to a Redis broker and do not have that much limitations on the number of messages sent to the broker
we have several instances running multiple celery workers with multiple queues

This is the base documentation which doesn't give us much info
heartbeat
Is related to communication between the worker and the broker (in your case the broker is CloudAMQP).
See explanation
With the --without-heartbeat the worker won't send heartbeat events
mingle
It only asks for "logical clocks" and "revoked tasks" from other workers on startup.
Taken from whatsnew-3.1
The worker will now attempt to synchronize with other workers in the same cluster.
Synchronized data currently includes revoked tasks and logical clock.
This only happens at startup and causes a one second startup delay to collect broadcast responses from other workers.
You can disable this bootstep using the --without-mingle argument.
Also see docs
gossip
Workers send events to all other workers and this is currently used for "clock synchronization", but it's also possible to write your own handlers on events, such as on_node_join, See docs
Taken from whatsnew-3.1
Workers are now passively subscribing to worker related events like heartbeats.
This means that a worker knows what other workers are doing and can detect if they go offline. Currently this is only used for clock synchronization, but there are many possibilities for future additions and you can write extensions that take advantage of this already.
Some ideas include consensus protocols, reroute task to best worker (based on resource usage or data locality) or restarting workers when they crash.
We believe that although this is a small addition, it opens amazing possibilities.
You can disable this bootstep using the --without-gossip argument.

Celery workers started up with the --without-mingle option, as #ofirule mentioned above, will not receive synchronization data from other workers, particularly revoked tasks. So if you revoke a task, all workers currently running will receive that broadcast and store it in memory so that when one of them eventually picks up the task from the queue, it will not execute it:
https://docs.celeryproject.org/en/stable/userguide/workers.html#persistent-revokes
But if a new worker starts up before that task has been dequeued by a worker that received the broadcast, it doesn't know to revoke the task. If it eventually picks up the task, then the task is executed. You will see this behavior if you're running in an environment where you are dynamically scaling in and out celery workers constantly.

I wanted to know if the --without-heartbeat flag would impact the worker's ability to detect broker disconnect and attempts to reconnect. The documentation referenced above only opaquely refers to these heartbeats acting at the application layer rather than TCP/IP layer. Ok--what I really want to know is does eliminating these messages affect my worker's ability to function--specifically to detect broker disconnect and then to try to reconnect appropriately?
I ran a few quick tests myself and found that with the --without-heartbeat flag passed, workers still detect broker disconnect very quickly (initiated by me shutting down the RabbitMQ instance), and they attempt to reconnect to the broker and do so successfully when I restart the RabbitMQ instance. So my basic testing suggests the heartbeats are not necessary for basic health checks and functionality. What's the point of them anyways? It's unclear to me, but they don't appear to have impact on worker functionality.

After delay() is called on a celery task, it takes more than 5 to 10 seconds for the tasks to even start executing with redis as the server

I have Redis as my Cache Server. When I call delay() on a task,it takes more than 10 tasks to even start executing. Any idea how to reduce this unnecessary lag?
Should I replace Redis with RabbitMQ?

It's very difficult to say what the cause of the delay is without being able to inspect your application and server logs, but I can reassure you that the delay is not normal and not an effect specific to either Celery or using Redis as the broker. I've used this combination a lot in the past and execution of tasks happens in a number of milliseconds.
I'd start by ensuring there are no network related issues between your client creating the tasks, your broker (Redis) and your task consumers (celery workers).
Good luck!

Persistent Long Running Tasks in Celery

I'm working on a Python based system, to enqueue long running tasks to workers.
The tasks originate from an outside service that generate a "token", but once they're created based on that token, they should run continuously, and stopped only when explicitly removed by code.
The task starts a WebSocket and loops on it. If the socket is closed, it reopens it. Basically, the task shouldn't reach conclusion.
My goals in architecting this solutions are:
When gracefully restarting a worker (for example to load new code), the task should be re-added to the queue, and picked up by some worker.
Same thing should happen when ungraceful shutdown happens.
2 workers shouldn't work on the same token.
Other processes may create more tasks that should be directed to the same worker that's handling a specific token. This will be resolved by sending those tasks to a queue named after the token, which the worker should start listening to after starting the token's task. I am listing this requirement as an explanation to why a task engine is even required here.
Independent servers, fast code reload, etc. - Minimal downtime per task.
All our server side is Python, and looks like Celery is the best platform for it.
Are we using the right technology here? Any other architectural choices we should consider?
Thanks for your help!

According to the docs
When shutdown is initiated the worker will finish all currently executing tasks before it actually terminates, so if these tasks are important you should wait for it to finish before doing anything drastic (like sending the KILL signal).
If the worker won’t shutdown after considerate time, for example because of tasks stuck in an infinite-loop, you can use the KILL signal to force terminate the worker, but be aware that currently executing tasks will be lost (unless the tasks have the acks_late option set).
You may get something like what you want by using retry or acks_late
Overall I reckon you'll need to implement some extra application-side job control, plus, maybe, a lock service.
But, yes, overall you can do this with celery. Whether there are better technologies... that's out of the scope of this site.

Script needs to be run as a Celery task. What consequences does this have?

My task is it to write a script using opencv which will later run as a Celery task. What consequences does this have? What do I have to pay attention to? Is it enough in the end to include two lines of code or could it be, that I have to rewrite my whole script?
I read, that Celery is a "asynchronous task queue/job queuing system based on distributed message passing", but I wont pretend to know completely what that all entails.
I try to update the question, as soon as I get more details.

Celery implies a daemon using a broker (some data hub used to queue tasks). The celeryd daemon and the broker (RabbitMQ, redis, MongoDB or else) should always run in the background.
Your tasks will be queued, this means they won't happen all at the same time. You can choose how many at the same time can be run as a maximum. The rest of them will wait for the others to finish before starting. This also means some concurrency is often expected, and that you must create tasks that play nice with others doing the same thing.
Celery is not meant to run scripts but tasks, written as python functions. You can of course execute external scripts from Python, but your entry point is always a Python function.
Celery uses Kombu, which uses a message broker to dispatch the tasks. This implies the data you pass to your tasks should be serializable.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.