Does restarting celery cause duplicate tasks? - python

I have an email task in celery that has an eta of 10 days from now(). However, I'm finding that some people are getting 5-6 duplicate emails at a time. I've come across this problem before with BROKER_TRANSPORT_OPTIONS set too low. Now I have this in my settings file:
BROKER_TRANSPORT_OPTIONS = {'visibility_timeout': 2592000} #30 days
So that shouldn't be a problem any more. I'm just wondering if there is anything else that can cause it. i.e. restarting celery. Celery gets restarted every time I deploy new code and that can happen 5 or more times a week so it's the only thing I can think of.
Any ideas?
Thanks.

Task duplicating is possible if worker/beat processes had not stopped correctly. How do you restart celery workers/beat? Check server for zombie celery worker and beat processes. Try to stop all celery processes, check no processes of celery exist and start it again. After all check that ps ax | grep celery shows fresh workers and only one beat.

Tasks won't restart in case of incorrect worker stop if you set CELERY_ACKS_LATE = False. In this case the task marked as acknowledged immediately after consuming. See docs.
Also make sure that your tasks have no retry enabled. If any exception happens inside task - they might retry with the same input arguments.
Another possible case - your tasks are written wrong and each run selects the same recipients set.

Related

How to gracefully restart a Celery worker after a task in it caused SoftTimeLimitExceeded exception?

A have a problem with my server environment which makes my tasks sometimes hang.
Sometimes the tasks go on, when I restart the Celery worker(s).
I don't have time/skills for now to debug the issue, so I thought of a workaround to restart the workers if a SoftTimeLimitExceeded exception was raised (I have a soft limit of 1 hour set for my tasks).
The thing is that I want to restart the worker which is possibly running the task. I.e. I want the worker to finish his current task stuff (setting task result, etc.) then to be restarted BEFORE it takes/runs the next task from the queue.

Persistent Long Running Tasks in Celery

I'm working on a Python based system, to enqueue long running tasks to workers.
The tasks originate from an outside service that generate a "token", but once they're created based on that token, they should run continuously, and stopped only when explicitly removed by code.
The task starts a WebSocket and loops on it. If the socket is closed, it reopens it. Basically, the task shouldn't reach conclusion.
My goals in architecting this solutions are:
When gracefully restarting a worker (for example to load new code), the task should be re-added to the queue, and picked up by some worker.
Same thing should happen when ungraceful shutdown happens.
2 workers shouldn't work on the same token.
Other processes may create more tasks that should be directed to the same worker that's handling a specific token. This will be resolved by sending those tasks to a queue named after the token, which the worker should start listening to after starting the token's task. I am listing this requirement as an explanation to why a task engine is even required here.
Independent servers, fast code reload, etc. - Minimal downtime per task.
All our server side is Python, and looks like Celery is the best platform for it.
Are we using the right technology here? Any other architectural choices we should consider?
Thanks for your help!
According to the docs
When shutdown is initiated the worker will finish all currently executing tasks before it actually terminates, so if these tasks are important you should wait for it to finish before doing anything drastic (like sending the KILL signal).
If the worker won’t shutdown after considerate time, for example because of tasks stuck in an infinite-loop, you can use the KILL signal to force terminate the worker, but be aware that currently executing tasks will be lost (unless the tasks have the acks_late option set).
You may get something like what you want by using retry or acks_late
Overall I reckon you'll need to implement some extra application-side job control, plus, maybe, a lock service.
But, yes, overall you can do this with celery. Whether there are better technologies... that's out of the scope of this site.

Configure Python Celery for Long Running Tasks

I have a setup where I run long idempotent tasks on AWS spot instances but I can't work out how to set up Celery to elegantly handle workers being killed mid task.
At the moment if a worker is killed the task is marked as failed (WorkerLostError). I found the documentation on the subject to be a bit lean, but it suggests that you should use CELERY_ACKS_LATE for this scenario. This isn't working for me, the task is still marked as failed.
When I had CELERY_ACKS_LATE=False the task just stayed stuck as PENDING - so at least now I can tell that it has failed - which is a good start.
Here are my config settings at the moment:
# I'm using rabbit-mq as the broker
BROKER_HEARTBEAT = 10
CELERY_ACKS_LATE = True
CELERYD_PREFETCH_MULTIPLIER = 1
CELERY_TRACK_STARTED = True
I have a task spinning on a master server that checks for the results of outstanding tasks and handles updating my local db to mark the tasks as complete (and performs work with the results). At this stage I think I'm going to have to catch the 'Worker exited prematurely: signal 15 (SIGTERM)' scenario and retry the task.
It feels like this should all be handled by celery, so I feel like I've missed something fundamental in my config.
Given idempotent tasks and workers that will fail, what is the best way to configure celery so that those tasks are picked up by a different worker?

celery missed heartbeat (on_node_lost)

I just upgraded to celery 3.1 and now I see this i my logs ::
on_node_lost - INFO - missed heartbeat from celery#queue_name for every queue/worker in my cluster.
According to the docs BROKER_HEARTBEAT is off by default and I haven't configured it.
Should I explicitly set BROKER_HEARTBEAT=0 or is there something else that I should be checking?
Celery 3.1 added in the new mingle and gossip procedures. I too was getting a ton of missed heartbeats and passing --without-gossip to my workers cleared it up.
https://docs.celeryproject.org/en/3.1/whatsnew-3.1.html#mingle-worker-synchronization
Mingle: Worker synchronization
The worker will now attempt to synchronize with other workers in the
same cluster.
Synchronized data currently includes revoked tasks and logical clock.
This only happens at startup and causes a one second startup delay to
collect broadcast responses from other workers.
You can disable this bootstep using the --without-mingle argument.
https://docs.celeryproject.org/en/3.1/whatsnew-3.1.html#gossip-worker-worker-communication
Gossip: Worker <-> Worker communication
Workers are now passively subscribing to worker related events like
heartbeats.
This means that a worker knows what other workers are doing and can
detect if they go offline. Currently this is only used for clock
synchronization, but there are many possibilities for future additions
and you can write extensions that take advantage of this already.
Some ideas include consensus protocols, reroute task to best worker
(based on resource usage or data locality) or restarting workers when
they crash.
We believe that although this is a small addition, it opens amazing
possibilities.
You can disable this bootstep using the --without-gossip argument.
Saw the same thing, and noticed a couple of things in the log files.
1) There were messages about time drift at the start of the log and occasional missed heartbeats.
2) At the end of the log file, the drift messages went away and only the missed heartbeat messages were present.
3) There were no changes to the system when the drift messages went away... They just stopped showing up.
I figured that the drift itself was likely the problem itself.
After syncing the time on all the servers involved these messages went away. For ubuntu, run ntpdate as a cron or ntpd.
I'm having a similar issue. I have found the reason in my case.
I have two server to run worker.
when I use "ping" to another server,
I found when the ping time larger than 2 second, the log will show " missed heartbeat from celery# ". The default heartbeat interval is 2 second.
The reason is my poor network.
http://docs.celeryproject.org/en/latest/internals/reference/celery.worker.heartbeat.html
add --without-mingle when you start celery

reliable way to deploy new code into a production celery cluster without pausing service

I have a few celery nodes running in production with rabbitmq and I have been handling deploys with service interruption. I have to take down the whole site in order to deploy new code out to celery. I have max tasks per child set to 1, so in theory, if I make changes to an existing task, they should take effect when the next time they are run, but what about registering new tasks? I know that restarting the daemon won't kill running workers, but instead will let them die on their own, but it still seems dangerous. Is there an elegant solution to this problem?
The challenging part here seems to be identifying which celery tasks are new versus old. I would suggest creating another vhost in rabbitmq and performing the following steps:
Update django web servers with new code and reconfigure to point to the new vhost.
While tasks are queuing up in the new vhost, wait for celery works to finish up with the tasks from the old vhost.
When workers have completed, update the code and configuration to the new vhost
I haven't actually tried this but I don't see why this wouldn't work. One annoying aspect is having to alternate between the vhosts with each deploy.
a kind of work around for you can be to set the config variable MAX_TASK_PER_CHILD.
This variable specify the number of task that a Pool Worker execute before kill himself.
Off course when a new Pool Worker is executed this will load the new code.
On my system normally I use to restart celery leaving other task running on background, normally everything goes fine, sometimes happen that one of this task is never killed and you can still kill it with a script.

Categories