I had a workload which have 16 instances, also they can communicate with each other (verified by ping). Each of them was running a long time task and started like this:
nohup celery worker -A tasks.workers --loglevel=INFO --logfile=/dockerdata/log/celery.log --concurrency=7 >/dev/null 2>&1 &
However, after a while, there will always be a few instances of celery that will stop running, because normally the log directory will save every day's logs.
I checked the last day's logs for these instances and found the following information:
worker exited by signal SIGKILL
[2021-07-23 09:04:24,270: ERROR/MainProcess] Process 'ForkPoolWorker-19773' pid:2846586 exited with 'signal 9 (SIGKILL)'
[2021-07-23 09:04:24,281: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL) Job: 79074.')
Traceback (most recent call last):
File "/data/anaconda3/lib/python3.8/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
raise WorkerLostError(
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL) Job: 79074.
missed hearbeat from...
[2021-07-30 10:24:26,815: INFO/MainProcess] missed heartbeat from celery#instance-1
I suspect that the celery stop has something to do with the above two messages. Can anyone offer some solutions to this problem?
Related
I have started my celery worker, which uses RabbitMQ as broker, like this:
celery -A my_app worker -l info -P gevent -c 100 --prefetch-multiplier=1 -Q my_app
Then I have task which looks quite like this:
#shared_task(queue='my_app', default_retry_delay=10, max_retries=1, time_limit=8 * 60)
def example_task():
# getting queryset with some filtering
my_models = MyModel.objects.filter(...)
for my_model in my_models.iterator():
my_model.execute_something()
Sometimes this task can be fininshed less than a minute and sometimes, during highload, it requires more than 5 minutes to finish.
The main problem is that RabbitMQ constantly removes my worker from consumers list. It looks really random. Because of that I need to restart worker again.
Workers also starts throwing these errors:
SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2396)
Sometimes these errors:
consumer: Cannot connect to amqps://my_app:**#example.com:5671/prod: SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:997)').
Couldn't ack 2057, reason:"RecoverableConnectionError(None, 'connection already closed', None, '')"
I have tried to add --without-heartbeat but it does nothing.
How to solve this problems? Sometimes my tasks takes more than 30 minutes to finish, and I can't constantly monitor if workers were kicked out from rabbitmq.
I have a Celery task that is scheduled to run after five minutes:
my_task.apply_async(countdown=5 * 60)
In case the worker is restarted, I need the task to be requeued, so I'm using the acks_late=True and reject_on_worker_lost=True options in the task decorator.
#shared_task(acks_late=True, reject_on_worker_lost=True)
def my_task():
print('Running task...')
The Celery worker is running inside a Docker container, so when I restart the worker after starting the task by running docker restart, the task is getting requeued, but after approximately 1 hour. I expected that the task would still be executed at the ETA or as soon as the worker is back up if the ETA has already passed. How can I configure the wait time for task requeuing?
There are no other tasks running simultaneously and I'm running this task on a development environment under very small load, so I don't think it is a congestion issue.
Packages:
celery==4.2.2
redis==3.2.1
Example of celery logs without restart:
[2022-08-05 20:00:55,990: INFO/MainProcess] Received task: tasks.my_task[15844866-36c6-465e-874a-78f861837d3c] ETA:[2022-08-05 20:05:55.922479+00:00]
[2022-08-05 20:05:55,972: WARNING/ForkPoolWorker-1] Running task...
[2022-08-05 20:05:55,974: INFO/ForkPoolWorker-1] Task tasks.my_task[15844866-36c6-465e-874a-78f861837d3c] succeeded in 0.001996207982301712s: None
Example of celery logs with restart:
[2022-08-05 19:42:16,961: INFO/MainProcess] Received task: tasks.my_task[c103280b-7d62-4b6a-8311-57769df81c90] ETA:[2022-08-05 19:47:16.893205+00:00]
--- WORKER RESTART HERE ---
[2022-08-05 20:43:39,646: INFO/MainProcess] Received task: tasks.my_task[c103280b-7d62-4b6a-8311-57769df81c90] ETA:[2022-08-05 19:47:16.893205+00:00]
[2022-08-05 20:43:40,106: WARNING/ForkPoolWorker-1] Running task...
[2022-08-05 20:43:40,107: INFO/ForkPoolWorker-1] Task tasks.my_task[c103280b-7d62-4b6a-8311-57769df81c90] succeeded in 0.0011997036635875702s: None
Thanks in advance.
I am using Heroku to host a bot that I have been working on, the code itself works perfectly from when I launch it locally. However, I updated the bot yesterday adding some new functionality, and I am receiving this error code when checking the logs:
2021-08-29T13:42:44.000000+00:00 app[api]: Build succeeded
2021-08-29T13:43:05.090490+00:00 heroku[worker.1]: Error R12 (Exit timeout) -> At least one process failed to exit within 30 seconds of SIGTERM
2021-08-29T13:43:05.095578+00:00 heroku[worker.1]: Stopping remaining processes with SIGKILL
2021-08-29T13:43:05.174771+00:00 heroku[worker.1]: Process exited with status 137
2021-08-29T13:56:00.000000+00:00 app[api]: Build started by user ty.unsworth#gmail.com
2021-08-29T13:56:24.413396+00:00 app[api]: Deploy 5ff2d18b by user ty.unsworth#gmail.com
2021-08-29T13:56:24.413396+00:00 app[api]: Release v39 created by user ty.unsworth#gmail.com
2021-08-29T13:56:26.596593+00:00 heroku[worker.1]: Restarting
2021-08-29T13:56:26.610525+00:00 heroku[worker.1]: State changed from up to starting
2021-08-29T13:56:27.284912+00:00 heroku[worker.1]: Stopping all processes with SIGTERM
2021-08-29T13:56:29.770892+00:00 heroku[worker.1]: Starting process with command `python WagonCounterBot.py`
2021-08-29T13:56:30.497452+00:00 heroku[worker.1]: State changed from starting to up
2021-08-29T13:56:34.000000+00:00 app[api]: Build succeeded
2021-08-29T13:56:57.473450+00:00 heroku[worker.1]: Error R12 (Exit timeout) -> At least one process failed to exit within 30 seconds of SIGTERM
2021-08-29T13:56:57.481783+00:00 heroku[worker.1]: Stopping remaining processes with SIGKILL
2021-08-29T13:56:57.535059+00:00 heroku[worker.1]: Process exited with status 137
I think this error may be caused by the new functionality I added:
#client.listen()
async def on_message(message):
"""
Looks for when a member calls 'bhwagon', and after 24 minutes, sends them a message
:param message: the message sent by the user
:return: a DM to the user letting them know their cooldown ended
"""
channel = client.get_channel(int(WAGON_CHANNEL)) # sets the
if message.content.startswith("bhwagon"):
channel = message.channel
await cool_down_ended(message)
async def cool_down_ended(message):
"""
Sends the author of the message a personal DM 24 minutes after they type 'bhwagon' in the guild
:param message: is the message the author sent
:return: a message to the author
"""
time.sleep(1440) # sets a time for 24 minutes = 1440 seconds
await message.author.send("Your wagon steal timer is up 🎩 time for another materials run!")
So I think I understand this error to be that Heroku doesn't allow functions to delay themselves for more than 30 seconds? Which conflicts with cool_down_ended(message) that delays for 24 minutes.
Would there be any easy way around this?
Don't use time.sleep in asynchronous code, it blocks entire thread. Use await asyncio.sleep(delay) instead.
This is the code to handle SIGTERM from Heroku (you can add it before client.run call)
import signal
signal.signal(signal.SIGTERM, lambda *_: client.loop.create_task(client.close()))
I'm running an analysis on AWS EMR, and I am getting an unexpected SIGTERM error.
Some background:
I'm running a script that reads in many csv files I have stored on S3, and then performs an analysis. My script is schematically:
analysis_script.py
import pandas as pd
from pyspark.sql import SQLContext, DataFrame
from pyspark.sql.types import *
from pyspark import SparkContext
import boto3
#Spark context
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
df = sqlContext.read.csv("s3n://csv_files/*", header = True)
def analysis(df):
#do bunch of stuff. Create output dataframe
return df_output
df_output = analysis(df)
I launch the cluster using:
aws emr create-cluster
--release-label emr-5.5.0
--name "Analysis"
--applications Name=Hadoop Name=Hive Name=Spark Name=Ganglia
--ec2-attributes KeyName=EMRB,InstanceProfile=EMR_EC2_DefaultRole
--service-role EMR_DefaultRole
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r3.xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=r3.xlarge
--region us-west-2
--log-uri s3://emr-logs/
--bootstrap-actions Name="Install Python Packages",Path="s3://emr-bootstraps/install_python_packages_custom.bash",Args=["numpy pandas boto3 tqdm"]
--auto-terminate
I can see from logs that the reading in of the csv files goes fine. But then it finishes with errors. The following lines are in the stderr file:
18/07/16 12:02:26 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
18/07/16 12:02:26 ERROR ApplicationMaster: User application exited with status 143
18/07/16 12:02:26 INFO ApplicationMaster: Final app status: FAILED, exitCode: 143, (reason: User application exited with status 143)
18/07/16 12:02:26 INFO SparkContext: Invoking stop() from shutdown hook
18/07/16 12:02:26 INFO SparkUI: Stopped Spark web UI at http://172.31.36.42:36169
18/07/16 12:02:26 INFO TaskSetManager: Starting task 908.0 in stage 1494.0 (TID 88112, ip-172-31-35-59.us-west-2.compute.internal, executor 27, partition 908, RACK_LOCAL, 7278 bytes)
18/07/16 12:02:26 INFO TaskSetManager: Finished task 874.0 in stage 1494.0 (TID 88078) in 16482 ms on ip-172-31-35-59.us-west-2.compute.internal (executor 27) (879/4805)
18/07/16 12:02:26 INFO BlockManagerInfo: Added broadcast_2328_piece0 in memory on ip-172-31-36-42.us-west-2.compute.internal:34133 (size: 28.8 KB, free: 2.8 GB)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(20, ip-172-31-36-42.us-west-2.compute.internal, 34133, None),broadcast_2328_piece0,StorageLevel(memory, 1 replicas),29537,0))
18/07/16 12:02:26 INFO BlockManagerInfo: Added broadcast_2328_piece0 in memory on ip-172-31-47-55.us-west-2.compute.internal:45758 (size: 28.8 KB, free: 2.8 GB)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(16, ip-172-31-47-55.us-west-2.compute.internal, 45758, None),broadcast_2328_piece0,StorageLevel(memory, 1 replicas),29537,0))
18/07/16 12:02:26 INFO DAGScheduler: Job 1494 failed: toPandas at analysis_script.py:267, took 479.895614 s
18/07/16 12:02:26 INFO DAGScheduler: ShuffleMapStage 1494 (toPandas at analysis_script.py:267) failed in 478.993 s due to Stage cancelled because SparkContext was shut down
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerSQLExecutionEnd(0,1531742546839)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo#28e5b10c)
18/07/16 12:02:26 INFO DAGScheduler: ShuffleMapStage 1495 (toPandas at analysis_script.py:267) failed in 479.270 s due to Stage cancelled because SparkContext was shut down
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo#6b68c419)
18/07/16 12:02:26 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(1494,1531742546841,JobFailed(org.apache.spark.SparkException: Job 1494 cancelled because SparkContext was shut down))
18/07/16 12:02:26 INFO YarnAllocator: Driver requested a total number of 0 executor(s).
18/07/16 12:02:26 INFO YarnClusterSchedulerBackend: Shutting down all executors
18/07/16 12:02:26 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
18/07/16 12:02:26 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices(serviceOption=None, services=List(),started=false)
18/07/16 12:02:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
I can't find much useful information about exit code 143. Does anybody know why this error is occurring? Thanks.
Spark passes through exit codes when they're over 128, which is often the case with JVM errors. In the case of exit code 143, it signifies that the JVM received a SIGTERM - essentially a unix kill signal (see this post for more exit codes and an explanation). Other details about Spark exit codes can be found in this question.
Since you didn't terminate this yourself, I'd start by suspecting something else externally did. Given that precisely 8 minutes elapse between job start and a SIGTERM being issued, it seems much more likely that EMR itself may be enforcing a maximum job run time/cluster age. Try checking through your EMR settings to see if there is any such timeout set - there was one in my case (on AWS Glue, but the same concept).
Scenario:
I had created a shared_task on celery for testing purpose [RabbitMQ as a Broker for queuing messages]:
#app.task(bind=True, max_retries = 5, base=MyTask)
def testing(self):
try:
raise smtplib.SMTPException
except smtplib.SMTPException as exc:
print 'This is it'
self.retry(exc=exc, countdown=2)
#Overriding base class of Task
class MyTask(celery.Task):
def on_failure(self, exc, task_id, args, kwargs, einfo):
print "MyTask on failure world"
pass
I called the task for testing by entering command testing.delay() by 10 times after creating a worker. And I just quit the server by pressing Ctrl+C and delete all those queues from RabbitMQ server. And again I started the server.
Server starting command: celery worker --app=my_app.settings -l DEBUG
Delete command of queue: rabbitmqadmin delete queue name=<queue_name>
Deleting workers command: ps auxww | grep 'celery worker' | awk '{print $2}' | xargs kill -9
Problem:
Since I have already deleted all queues from the RabbitMQ server, now only fresh tasks should be received. But I am still getting old tasks, moreover, no new tasks are appearing on the list. What would be the actual cause of this?
What is happening is your worker takes in more than one task, unless you have the -Ofair flag when starting the worker
https://medium.com/#taylorhughes/three-quick-tips-from-two-years-with-celery-c05ff9d7f9eb
So, even if you clear out your queue, your worker will still be running with the tasks its already picked up, unless you kill the worker process itself.
Edit to add
If you have a task running after restart, you need to revoke the task.
http://celery.readthedocs.io/en/latest/faq.html#can-i-cancel-the-execution-of-a-task