I have set heartbeat in Celery settings:
BROKER_HEARTBEAT = 10
I have also set this configuration value in RabbitMQ config:
'heartbeat' => '10',
But somehow heartbeats are still disabled:
ubuntu#sync1:~$ sudo rabbitmqctl list_connections name timeout
Listing connections ...
some_address:37781 -> other_address:5672 0
some_address:37782 -> other_address:5672 0
...done.
Any ideas what am I doing wrong?
UPDATE:
So now I get:
ubuntu#sync1:/etc/puppet$ sudo rabbitmqctl list_connections name timeout
Listing connections ...
some_address:41281 -> other_address:5672 10
some_address:41282 -> other_address:5672 10
some_address:41562 -> other_address:5672 0
some_address:41563 -> other_address:5672 0
some_address:41564 -> other_address:5672 0
some_address:41565 -> other_address:5672 0
some_address:41566 -> other_address:5672 0
some_address:41567 -> other_address:5672 0
some_address:41568 -> other_address:5672 0
...done.
I have 3 servers:
RabbitMQ broker
RESTful API server
Remote Worker server
It appears the remote demonised Celery workers send heartbeats correctly. The RESTful API server using Celery to remotely process tasks is not using heartbeat for some reason.
the heartbeat of celery worker is application level heartbeat, not AMQP protocol's heartbeat.
Each worker periodically send heartbeat event message to "celeryev" event exchange in BROKER.
The heartbeat event is forwarded back to worker such worker can know the health status of BROKER.
If number of loss heartbeat exceeding a threshold, the worker can do some reconnect action to BROKER.
For the rest of detail, you may check this page
The section: BROKER_FAILOVER_STRATEGY describes the actions you may do for dropping from a BROKER.
Celery worker support AMQP heartbeat definitely. The configuration item BROKER_HEARTBEAT is used to define the heartbeat interval of AMQP client(celery worker).
We can find the description of BROKER_HEARTBEAT here Celery Doc!
The possible causes of heartbeat not work:
Use a wrong transport such as 'librabbitmq'
As celery doc described, only 'pyamqp' transport support BROKER_HEARTBEAT.
We need to check whether if librabbitmq package is installed
or we can use 'pyamqp' transport in broker url: 'pyamqp://userid:password#hostname:port/virtual_host' rather than 'amqp://userid:password#hostname:port/virtual_host'
No event send to celery worker during three heartbeat interval after boot up
Check code here to see how heartbeat works!
drain_events will be called during worker boot up, see code here!
If there's no event sent to celery worker, connection.heartbeat_check will not be called.
By the way, connection.heartbeat_check is defined here!
Hopes to help someone encounter the heartbeat issue.
Related
I have started my celery worker, which uses RabbitMQ as broker, like this:
celery -A my_app worker -l info -P gevent -c 100 --prefetch-multiplier=1 -Q my_app
Then I have task which looks quite like this:
#shared_task(queue='my_app', default_retry_delay=10, max_retries=1, time_limit=8 * 60)
def example_task():
# getting queryset with some filtering
my_models = MyModel.objects.filter(...)
for my_model in my_models.iterator():
my_model.execute_something()
Sometimes this task can be fininshed less than a minute and sometimes, during highload, it requires more than 5 minutes to finish.
The main problem is that RabbitMQ constantly removes my worker from consumers list. It looks really random. Because of that I need to restart worker again.
Workers also starts throwing these errors:
SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2396)
Sometimes these errors:
consumer: Cannot connect to amqps://my_app:**#example.com:5671/prod: SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:997)').
Couldn't ack 2057, reason:"RecoverableConnectionError(None, 'connection already closed', None, '')"
I have tried to add --without-heartbeat but it does nothing.
How to solve this problems? Sometimes my tasks takes more than 30 minutes to finish, and I can't constantly monitor if workers were kicked out from rabbitmq.
Scenario:
I had created a shared_task on celery for testing purpose [RabbitMQ as a Broker for queuing messages]:
#app.task(bind=True, max_retries = 5, base=MyTask)
def testing(self):
try:
raise smtplib.SMTPException
except smtplib.SMTPException as exc:
print 'This is it'
self.retry(exc=exc, countdown=2)
#Overriding base class of Task
class MyTask(celery.Task):
def on_failure(self, exc, task_id, args, kwargs, einfo):
print "MyTask on failure world"
pass
I called the task for testing by entering command testing.delay() by 10 times after creating a worker. And I just quit the server by pressing Ctrl+C and delete all those queues from RabbitMQ server. And again I started the server.
Server starting command: celery worker --app=my_app.settings -l DEBUG
Delete command of queue: rabbitmqadmin delete queue name=<queue_name>
Deleting workers command: ps auxww | grep 'celery worker' | awk '{print $2}' | xargs kill -9
Problem:
Since I have already deleted all queues from the RabbitMQ server, now only fresh tasks should be received. But I am still getting old tasks, moreover, no new tasks are appearing on the list. What would be the actual cause of this?
What is happening is your worker takes in more than one task, unless you have the -Ofair flag when starting the worker
https://medium.com/#taylorhughes/three-quick-tips-from-two-years-with-celery-c05ff9d7f9eb
So, even if you clear out your queue, your worker will still be running with the tasks its already picked up, unless you kill the worker process itself.
Edit to add
If you have a task running after restart, you need to revoke the task.
http://celery.readthedocs.io/en/latest/faq.html#can-i-cancel-the-execution-of-a-task
I am using rabbitMQ to launch processes in remote hosts located in other parts of the world. Eg, RabbitMQ is running in an Oregon host, and it receives a client message to launch processes in Ireland and California.
Most of the time, the processes are launched, and, when they finish, rabbitMQ returns the output to the client. But, sometimes, the jobs finish successfully but rabbitMQ hasn't return the output to the client, and the client keeps hanging waiting for the response. These processes can take 10 minutes to execute, so the client is 10 minutes hanged waiting for the response.
I am using celery to connect to the rabbitMQ, and the client calls are blocking using task.get(). In other words, the client hangs until it receives the response for its call. I would like to understand why the client did not get the response if the jobs have finished. How can I debug this problem?
Here is my celeryconfig.py
import os
import sys
# add hadoop python to the env, just for the running
sys.path.append(os.path.dirname(os.path.basename(__file__)))
# broker configuration
# medusa-rabbitmq is the name of the hosts where rabbitmq is running
BROKER_URL = "amqp://celeryuser:celery#medusa-rabbitmq/celeryvhost"
CELERY_RESULT_BACKEND = "amqp"
TEST_RUNNER = 'celery.contrib.test_runner.run_tests'
# for debug
# CELERY_ALWAYS_EAGER = True
# module loaded
CELERY_IMPORTS = ("medusa.mergedirs", "medusa.medusasystem",
"medusa.utility", "medusa.pingdaemon", "medusa.hdfs", "medusa.vote.voting")
I use a simple Flask application with gunicorn's gevent worker to serve server-sent events.
To stream the content, i use:
response = Response(eventstream(), mimetype="text/event-stream")
which streams events from redis:
def eventstream():
for message in pubsub.listen():
# ...
yield str(event)
deployed with:
gunicorn -k gevent -b 127.0.0.1:50008 flaskapplication
But after its used for a while, i have 50 redis connections open, even when no one is connected to the server-sent events stream anymore.
It seems, like the view does not terminate, because gunicorn is non-blocking and pubsub.listen() is blocking.
How can i fix this? Should i limit the number of processes gunicorn may spawn, or should flask kill the view after some timeout? If possible, it should stop the view/redis connections on inactivity, without disconnecting users, who are still connected to the SSE stream.
You can run gunicorn with -t <seconds> to specify a timeout for your workers which will kill them if they are silent for a number of seconds, usually 30 is typical. I think this should work for your issue, but not completely sure.
From what I've seen, it seems like you could also rewrite your worker to use Timeout from gevent.
This might look something like the following:
from gevent import Timeout
def eventstream():
pubsub = redis.pubsub()
try:
with Timeout(30) as timeout:
pubsub.subscribe(channel)
for message in pubsub.listen():
# ...
yield str(event)
except Timeout, t:
if t is not timeout:
raise
else:
pubsub.unsubscribe(channel)
This example was helpful for getting a hang of how this should work.
Using the Timeout object from natdempk's solution, the most elegant solution is to send a heartbeat, to detect dead connections:
while True:
pubsub = redis.pubsub()
try:
with Timeout(30) as timeout:
for message in pubsub.listen():
# ...
yield str(event)
timeout.cancel()
timeout.start()
except Timeout, t:
if t is not timeout:
raise
else:
yield ":\n\n" # heartbeat
Note that you need to call redis.pubsub() again, because the redis connection is lost after the exception and you will get an error NoneType object has no attribute readline.
So I am using a RabbitMQ + Celery to create a simple RPC architecture. I have one RabbitMQ message broker and one remote worker which runs Celery deamon.
There is a third server which exposes a thin RESTful API. When it receives HTTP request, it sends a task to the remote worker, waits for response and returns a response.
This works great most of the time. However I have notices that after a longer inactivity (say 5 minutes of no incoming requests), the Celery worker behaves strangely. First 3 tasks received after a longer inactivity return this error:
exchange.declare: connection closed unexpectedly
After three erroneous tasks it works again. If there are not tasks for longer period of time, the same thing happens. Any idea?
My init script for the Celery worker:
# description "Celery worker using sync broker"
console log
start on runlevel [2345]
stop on runlevel [!2345]
setuid richard
setgid richard
script
chdir /usr/local/myproject/myproject
exec /usr/local/myproject/venv/bin/celery worker -n celery_worker_deamon.%h -A proj.sync_celery -Q sync_queue -l info --autoscale=10,3 --autoreload --purge
end script
respawn
My celery config:
# Synchronous blocking tasks
BROKER_URL_SYNC = 'amqp://guest:guest#localhost:5672//'
# Asynchronous non blocking tasks
BROKER_URL_ASYNC = 'amqp://guest:guest#localhost:5672//'
#: Only add pickle to this list if your broker is secured
#: from unwanted access (see userguide/security.html)
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TIMEZONE = 'UTC'
CELERY_ENABLE_UTC = True
CELERY_BACKEND = 'amqp'
# http://docs.celeryproject.org/en/latest/userguide/tasks.html#disable-rate-limits-if-they-re-not-used
CELERY_DISABLE_RATE_LIMITS = True
# http://docs.celeryproject.org/en/latest/userguide/routing.html
CELERY_DEFAULT_QUEUE = 'sync_queue'
CELERY_DEFAULT_EXCHANGE = "tasks"
CELERY_DEFAULT_EXCHANGE_TYPE = "topic"
CELERY_DEFAULT_ROUTING_KEY = "sync_task.default"
CELERY_QUEUES = {
'sync_queue': {
'binding_key':'sync_task.#',
},
'async_queue': {
'binding_key':'async_task.#',
},
}
Any ideas?
EDIT:
Ok, now it appears to happen randomly. I noticed this in RabbitMQ logs:
=WARNING REPORT==== 6-Jan-2014::17:31:54 ===
closing AMQP connection <0.295.0> (some_ip_address:36842 -> some_ip_address:5672):
connection_closed_abruptly
Is your RabbitMQ server or your Celery worker behind a load balancer by any chance? If yes, then the load balancer is closing the TCP connection after some period of inactivity. In which case, you will have to enable heartbeat from the client (worker) side. If you do, I would not recommend using the pure Python amqp lib for this. Instead, replace it with librabbitmq.
The connection_closed_abruptly is caused when clients disconnecting without the proper AMQP shutdown protocol:
channel.close(...)
Request a channel close.
This method indicates that the sender wants to close the channel.
This may be due to internal conditions (e.g. a forced shut-down) or due to
an error handling a specific method, i.e. an exception.
When a close is due to an exception, the sender provides the class and method id of
the method which caused the exception.
After sending this method, any received methods except Close and Close-OK MUST be discarded. The response to receiving a Close after sending Close must be to send Close-Ok.
channel.close-ok():
Confirm a channel close.
This method confirms a Channel.Close method and tells the recipient
that it is safe to release resources for the channel.
A peer that detects a socket closure without having received a
Channel.Close-Ok handshake method SHOULD log the error.
Here is an issue about that.
Can you set your custom configuration for BROKER_HEARTBEAT and BROKER_HEARTBEAT_CHECKRATE and check again, for example:
BROKER_HEARTBEAT = 10
BROKER_HEARTBEAT_CHECKRATE = 2.0