Celery Closes Unexpectedly After Longer Inactivity - python

So I am using a RabbitMQ + Celery to create a simple RPC architecture. I have one RabbitMQ message broker and one remote worker which runs Celery deamon.
There is a third server which exposes a thin RESTful API. When it receives HTTP request, it sends a task to the remote worker, waits for response and returns a response.
This works great most of the time. However I have notices that after a longer inactivity (say 5 minutes of no incoming requests), the Celery worker behaves strangely. First 3 tasks received after a longer inactivity return this error:
exchange.declare: connection closed unexpectedly
After three erroneous tasks it works again. If there are not tasks for longer period of time, the same thing happens. Any idea?
My init script for the Celery worker:
# description "Celery worker using sync broker"
console log
start on runlevel [2345]
stop on runlevel [!2345]
setuid richard
setgid richard
script
chdir /usr/local/myproject/myproject
exec /usr/local/myproject/venv/bin/celery worker -n celery_worker_deamon.%h -A proj.sync_celery -Q sync_queue -l info --autoscale=10,3 --autoreload --purge
end script
respawn
My celery config:
# Synchronous blocking tasks
BROKER_URL_SYNC = 'amqp://guest:guest#localhost:5672//'
# Asynchronous non blocking tasks
BROKER_URL_ASYNC = 'amqp://guest:guest#localhost:5672//'
#: Only add pickle to this list if your broker is secured
#: from unwanted access (see userguide/security.html)
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TIMEZONE = 'UTC'
CELERY_ENABLE_UTC = True
CELERY_BACKEND = 'amqp'
# http://docs.celeryproject.org/en/latest/userguide/tasks.html#disable-rate-limits-if-they-re-not-used
CELERY_DISABLE_RATE_LIMITS = True
# http://docs.celeryproject.org/en/latest/userguide/routing.html
CELERY_DEFAULT_QUEUE = 'sync_queue'
CELERY_DEFAULT_EXCHANGE = "tasks"
CELERY_DEFAULT_EXCHANGE_TYPE = "topic"
CELERY_DEFAULT_ROUTING_KEY = "sync_task.default"
CELERY_QUEUES = {
'sync_queue': {
'binding_key':'sync_task.#',
},
'async_queue': {
'binding_key':'async_task.#',
},
}
Any ideas?
EDIT:
Ok, now it appears to happen randomly. I noticed this in RabbitMQ logs:
=WARNING REPORT==== 6-Jan-2014::17:31:54 ===
closing AMQP connection <0.295.0> (some_ip_address:36842 -> some_ip_address:5672):
connection_closed_abruptly

Is your RabbitMQ server or your Celery worker behind a load balancer by any chance? If yes, then the load balancer is closing the TCP connection after some period of inactivity. In which case, you will have to enable heartbeat from the client (worker) side. If you do, I would not recommend using the pure Python amqp lib for this. Instead, replace it with librabbitmq.

The connection_closed_abruptly is caused when clients disconnecting without the proper AMQP shutdown protocol:
channel.close(...)
Request a channel close.
This method indicates that the sender wants to close the channel.
This may be due to internal conditions (e.g. a forced shut-down) or due to
an error handling a specific method, i.e. an exception.
When a close is due to an exception, the sender provides the class and method id of
the method which caused the exception.
After sending this method, any received methods except Close and Close-OK MUST be discarded. The response to receiving a Close after sending Close must be to send Close-Ok.
channel.close-ok():
Confirm a channel close.
This method confirms a Channel.Close method and tells the recipient
that it is safe to release resources for the channel.
A peer that detects a socket closure without having received a
Channel.Close-Ok handshake method SHOULD log the error.
Here is an issue about that.
Can you set your custom configuration for BROKER_HEARTBEAT and BROKER_HEARTBEAT_CHECKRATE and check again, for example:
BROKER_HEARTBEAT = 10
BROKER_HEARTBEAT_CHECKRATE = 2.0

Related

Celery, RabbitMQ removes worker from consumers list while it is performing tasks

I have started my celery worker, which uses RabbitMQ as broker, like this:
celery -A my_app worker -l info -P gevent -c 100 --prefetch-multiplier=1 -Q my_app
Then I have task which looks quite like this:
#shared_task(queue='my_app', default_retry_delay=10, max_retries=1, time_limit=8 * 60)
def example_task():
# getting queryset with some filtering
my_models = MyModel.objects.filter(...)
for my_model in my_models.iterator():
my_model.execute_something()
Sometimes this task can be fininshed less than a minute and sometimes, during highload, it requires more than 5 minutes to finish.
The main problem is that RabbitMQ constantly removes my worker from consumers list. It looks really random. Because of that I need to restart worker again.
Workers also starts throwing these errors:
SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2396)
Sometimes these errors:
consumer: Cannot connect to amqps://my_app:**#example.com:5671/prod: SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:997)').
Couldn't ack 2057, reason:"RecoverableConnectionError(None, 'connection already closed', None, '')"
I have tried to add --without-heartbeat but it does nothing.
How to solve this problems? Sometimes my tasks takes more than 30 minutes to finish, and I can't constantly monitor if workers were kicked out from rabbitmq.

Clients keeps waiting for RabbitMQ response

I am using rabbitMQ to launch processes in remote hosts located in other parts of the world. Eg, RabbitMQ is running in an Oregon host, and it receives a client message to launch processes in Ireland and California.
Most of the time, the processes are launched, and, when they finish, rabbitMQ returns the output to the client. But, sometimes, the jobs finish successfully but rabbitMQ hasn't return the output to the client, and the client keeps hanging waiting for the response. These processes can take 10 minutes to execute, so the client is 10 minutes hanged waiting for the response.
I am using celery to connect to the rabbitMQ, and the client calls are blocking using task.get(). In other words, the client hangs until it receives the response for its call. I would like to understand why the client did not get the response if the jobs have finished. How can I debug this problem?
Here is my celeryconfig.py
import os
import sys
# add hadoop python to the env, just for the running
sys.path.append(os.path.dirname(os.path.basename(__file__)))
# broker configuration
# medusa-rabbitmq is the name of the hosts where rabbitmq is running
BROKER_URL = "amqp://celeryuser:celery#medusa-rabbitmq/celeryvhost"
CELERY_RESULT_BACKEND = "amqp"
TEST_RUNNER = 'celery.contrib.test_runner.run_tests'
# for debug
# CELERY_ALWAYS_EAGER = True
# module loaded
CELERY_IMPORTS = ("medusa.mergedirs", "medusa.medusasystem",
"medusa.utility", "medusa.pingdaemon", "medusa.hdfs", "medusa.vote.voting")

Where to place register code to zookeeper when using nd_service_registry with uwsgi+Django stack?

I'm using nd_service_registry to register my django service to zookeeper, which launched with uwsgi.
versions:
uWSGI==2.0.10
Django==1.7.5
My question is, what is the correct way to place nd_service_registry.set_node code to register itself to zookeeper server, avoiding duplicated register or deregister.
my uwsgi config ini, with processes=2, enable-threads=true, threads=2:
[uwsgi]
chdir = /data/www/django-proj/src
module = settings.wsgi:application
env = DJANGO_SETTINGS_MODULE=settings.test
master = true
pidfile = /tmp/uwsgi-proj.pid
socket = /tmp/uwsgi_proj.sock
processes = 2
threads = 2
harakiri = 20
max-requests = 50000
vacuum = true
home = /data/www/django-proj/env
enable-threads = true
buffer-size = 65535
chmod-socket=666
register code:
from nd_service_registry import KazooServiceRegistry
nd = KazooServiceRegistry(server=ZOOKEEPER_SERVER_URL)
nd.set_node('/web/test/server0', {'host': 'localhost', 'port': 80})
I've tested such cases and both worked as expected, django service registered at uwsgi master process startup only once.
place code in settings.py
place code in wsgi.py
Even if I killed uwsgi worker processes(then master process will relaunch another worker) or let worker process kill+restart by uwsgi harakiri options, no new register action triggered.
So my question is whether my register code is correct for django+uwsgi with processes and threads enabled, and where to place it.
The problem happened when you use uwsgi with master/worker. When uwsgi master process spawns workers, the connection to zookeeper maintained by thread in zookeeper client can't be copy to worker correctly.So in application of uwsgi, you should use uwsgi decorators: uwsgidecorators.postfork to initialize register code. The function decorated by #postfork will be called when spawning new workers.
Hope it will help.

View processes not stopped when streaming with python flask

I use a simple Flask application with gunicorn's gevent worker to serve server-sent events.
To stream the content, i use:
response = Response(eventstream(), mimetype="text/event-stream")
which streams events from redis:
def eventstream():
for message in pubsub.listen():
# ...
yield str(event)
deployed with:
gunicorn -k gevent -b 127.0.0.1:50008 flaskapplication
But after its used for a while, i have 50 redis connections open, even when no one is connected to the server-sent events stream anymore.
It seems, like the view does not terminate, because gunicorn is non-blocking and pubsub.listen() is blocking.
How can i fix this? Should i limit the number of processes gunicorn may spawn, or should flask kill the view after some timeout? If possible, it should stop the view/redis connections on inactivity, without disconnecting users, who are still connected to the SSE stream.
You can run gunicorn with -t <seconds> to specify a timeout for your workers which will kill them if they are silent for a number of seconds, usually 30 is typical. I think this should work for your issue, but not completely sure.
From what I've seen, it seems like you could also rewrite your worker to use Timeout from gevent.
This might look something like the following:
from gevent import Timeout
def eventstream():
pubsub = redis.pubsub()
try:
with Timeout(30) as timeout:
pubsub.subscribe(channel)
for message in pubsub.listen():
# ...
yield str(event)
except Timeout, t:
if t is not timeout:
raise
else:
pubsub.unsubscribe(channel)
This example was helpful for getting a hang of how this should work.
Using the Timeout object from natdempk's solution, the most elegant solution is to send a heartbeat, to detect dead connections:
while True:
pubsub = redis.pubsub()
try:
with Timeout(30) as timeout:
for message in pubsub.listen():
# ...
yield str(event)
timeout.cancel()
timeout.start()
except Timeout, t:
if t is not timeout:
raise
else:
yield ":\n\n" # heartbeat
Note that you need to call redis.pubsub() again, because the redis connection is lost after the exception and you will get an error NoneType object has no attribute readline.

Celery Heartbeat Not Working

I have set heartbeat in Celery settings:
BROKER_HEARTBEAT = 10
I have also set this configuration value in RabbitMQ config:
'heartbeat' => '10',
But somehow heartbeats are still disabled:
ubuntu#sync1:~$ sudo rabbitmqctl list_connections name timeout
Listing connections ...
some_address:37781 -> other_address:5672 0
some_address:37782 -> other_address:5672 0
...done.
Any ideas what am I doing wrong?
UPDATE:
So now I get:
ubuntu#sync1:/etc/puppet$ sudo rabbitmqctl list_connections name timeout
Listing connections ...
some_address:41281 -> other_address:5672 10
some_address:41282 -> other_address:5672 10
some_address:41562 -> other_address:5672 0
some_address:41563 -> other_address:5672 0
some_address:41564 -> other_address:5672 0
some_address:41565 -> other_address:5672 0
some_address:41566 -> other_address:5672 0
some_address:41567 -> other_address:5672 0
some_address:41568 -> other_address:5672 0
...done.
I have 3 servers:
RabbitMQ broker
RESTful API server
Remote Worker server
It appears the remote demonised Celery workers send heartbeats correctly. The RESTful API server using Celery to remotely process tasks is not using heartbeat for some reason.
the heartbeat of celery worker is application level heartbeat, not AMQP protocol's heartbeat.
Each worker periodically send heartbeat event message to "celeryev" event exchange in BROKER.
The heartbeat event is forwarded back to worker such worker can know the health status of BROKER.
If number of loss heartbeat exceeding a threshold, the worker can do some reconnect action to BROKER.
For the rest of detail, you may check this page
The section: BROKER_FAILOVER_STRATEGY describes the actions you may do for dropping from a BROKER.
Celery worker support AMQP heartbeat definitely. The configuration item BROKER_HEARTBEAT is used to define the heartbeat interval of AMQP client(celery worker).
We can find the description of BROKER_HEARTBEAT here Celery Doc!
The possible causes of heartbeat not work:
Use a wrong transport such as 'librabbitmq'
As celery doc described, only 'pyamqp' transport support BROKER_HEARTBEAT.
We need to check whether if librabbitmq package is installed
or we can use 'pyamqp' transport in broker url: 'pyamqp://userid:password#hostname:port/virtual_host' rather than 'amqp://userid:password#hostname:port/virtual_host'
No event send to celery worker during three heartbeat interval after boot up
Check code here to see how heartbeat works!
drain_events will be called during worker boot up, see code here!
If there's no event sent to celery worker, connection.heartbeat_check will not be called.
By the way, connection.heartbeat_check is defined here!
Hopes to help someone encounter the heartbeat issue.

Categories