Celery beat sometimes stops working - python

I'm using latest stable Celery (4) with RabbitMQ within my Django project.
RabbitMQ is running on separate server within local network. And beat periodically just stops to send tasks to worker without any errors, and only restarting it resolves the issue.
There are no exceptions in worker (checked in logs & also I'm using Sentry to catch exceptions). It just stops sending tasks.
Service config:
[Unit]
Description=*** Celery Beat
After=network.target
[Service]
User=***
Group=***
WorkingDirectory=/opt/***/web/
Environment="PATH=/opt/***/bin"
ExecStart=/opt/***/bin/celery -A *** beat --max-interval 30
[Install]
WantedBy=multi-user.target
Is it possible to fix this? Or are there any good alternatives? (Cron seems to be not a best solution).

Your description sounds a lot like this open bug: https://github.com/celery/celery/issues/3409
There are a lot of details there, but the high-level bug description is that if the connection to RabbitMQ is lost that it's unable to regain the connection.
Unfortunately, I can't see that anyone has definitely solved this issue.
You could start by debugging this using this:
ExecStart=/opt/***/bin/celery -A *** beat --loglevel DEBUG --max-interval 30

Related

Gunicorn access to other workers memory

Currently I've got a trading bot application which has a UI interface to start and stop currently running bots. In behind there is also a bot manager, which keeps track of all running bots and has the ability to start and stop them. Each bot is started in its own thread and the manager must have access to thats thread memory (in order to stop it when necessary). Currently If i specify more workers then depending which worker the request lands on I've no access to the thread I need.
At this moment, I've got a gunicorn setup in docker:
gunicorn app:application --worker-tmp-dir /dev/shm --bind 0.0.0.0:8000 --timeout 600 --workers 1 --threads 4
The problem:
Yesterday one of the bots stopped because apparently gunicorn ran out of memory and the worker had to restart in the process killing running bot. (Can't have that). These bots should be able to run for months if needed. Is there a way to fix or tell gunicorn to never stop workers or processes? Or maybe I should use a different python server like waitress or uWSGI?

Django: Celery worker in production, Ubuntu 18+

I'm learning Celery and I'd like to ask:
Which is the absolute simplest way to get Celery to automatically run when Django starts in Ubuntu?. Now I manually start celery -A {prj name} worker -l INFO via the terminal.
Can I make any type of configuration so Celery catches the changes in tasks.py code without the need to restart Celery? Now I ctrl+c and type celery -A {prj name} worker -l INFO every time I change something in the tasks.py code. I can foresee a problem in such approach in production if I can get Celery start automatically ==> need to restart Ubuntu instead?.
(setup: VPS, Django, Ubuntu 18.10 (no docker), no external resources, using Redis (that starts automatically)
I am aware it is a similar question to Django-Celery in production and How to ... but still it is a bit unclear as it refers to amazon and also using shell scripts, crontabs. It seems a bit peculiar that these things wouldn't work out of the box.
I give benefit to the doubt that I have misunderstood the setup of Celery.
I have a deploy script that launch Celery in production.
In production it's better to launch worker :
celery multi stop 5
celery multi start 5 -A {prj name} -Q:1 default -Q:2,3 QUEUE1 -Q:4,5 QUEUE2 --pidfile="%n.pid"
this will stop and launch 5 worker for different Queue
Celery at launch will get the wsgi file that will use this instance of your code, it's mean you need to relaunch it to apply modification, you cannot add a watcher in production (memory cost)

Restart celery if celery worker is down on windows

I wanted to know if there is a way to restart celery worker if celery worker is down due to some error or issue, so that it can be automatically restarted programmatically.
Check out this SO thread.
AS you are using windows, check for the ability to run Celery as a service, such as is explained right here on SO.

I got error 500 from gunicorn whet i set Debug=False in django settings py - Upstart variant

The problem is identical to this one but I use upstart. How to modify my upstart conf below to make it work.
gunicorn.conf
description "gunicorn"
start on (filesystem)
stop on runlevel [016]
respawn
console log
setuid nobody
setgid nogroup
chdir /home/spadmin/spcrm
exec /home/spadmin/.virtualenvs/crm/bin/python /home/spadmin/spcrm/manage.py run_gunicorn -w 3 -k gevent
Nginx is started via upstart conf from nginx package. I can post it if it's relevant.
And from curiosity: is this problem related to timing when starting jobs?
The above question provides solution but not explanation.
Jingo's comment describes the solution to a common problem when running with DEBUG = False. Try that, but also make sure you're checking your gunicorn logs/console. The actual exception should be logged there, and that would let you know whether it's the ALLOWED_HOSTS problem (you would see a SuspiciousOperation exception), or some other issue.
You may also want to consider adding Sentry to your project (via its Raven client). This will collect log messages and log uncaught exceptions to a very useful web app. See http://sentry.readthedocs.org/en/latest/

How can I communicate with Celery on Cloud Foundry?

I have a wsgi app with a celery component. Basically, when certain requests come in they can hand off relatively time-consuming tasks to celery. I have a working version of this product on a server I set up myself, but our client recently asked me to deploy it to Cloud Foundry. Since Celery is not available as a service on Cloud Foundry, we (me and the client's deployment team) decided to deploy the app twice – once as a wsgi app and once as a standalone celery app, sharing a rabbitmq service.
The code between the apps is identical. The wsgi app responds correctly, returning the expected web pages. vmc logs celeryapp shows that celery is to be up-and-running, but when I send requests to wsgi that should become celery tasks, they disappear as soon as they get to a .delay() statement. They neither appear in the celery logs nor do they appear as an error.
Attempts to debug:
I can't use celery.contrib.rdb in Cloud Foundry (to supply a telnet interface to pdb), as each app is sandboxed and port-restricted.
I don't know how to find the specific rabbitmq instance these apps are supposed to share, so I can see what messages it's passing.
Update: to corroborate the above statement about finding rabbitmq, here's what happens when I try to access the node that should be sharing celery tasks:
root#cf:~# export RABBITMQ_NODENAME=eecef185-e1ae-4e08-91af-47f590304ecc
root#cf:~# export RABBITMQ_NODE_PORT=57390
root#cf:~# ~/cloudfoundry/.deployments/devbox/deploy/rabbitmq/sbin/rabbitmqctl list_queues
Listing queues ...
=ERROR REPORT==== 18-Jun-2012::11:31:35 ===
Error in process <0.36.0> on node 'rabbitmqctl17951#cf' with exit value: {badarg,[{erlang,list_to_existing_atom,["eecef185-e1ae-4e08-91af-47f590304ecc#localhost"]},{dist_util,recv_challenge,1},{dist_util,handshake_we_started,1}]}
Error: unable to connect to node 'eecef185-e1ae-4e08-91af-47f590304ecc#cf': nodedown
diagnostics:
- nodes and their ports on cf: [{'eecef185-e1ae-4e08-91af-47f590304ecc',57390},
{rabbitmqctl17951,36032}]
- current node: rabbitmqctl17951#cf
- current node home dir: /home/cf
- current node cookie hash: 1igde7WRgkhAea8fCwKncQ==
How can I debug this and/or why are my tasks vanishing?
Apparently the problem was caused by a deadlock between the broker and the celery worker, such that the worker would never acknowledge the task as complete, and never accept a new task, but never crashed or failed either. The tasks weren't vanishing; they were simply staying in queue forever.
Update: The deadlock was caused by the fact that we were running celeryd inside a wrapper script that installed dependencies. (Literally pip install -r requirements.txt && ./celeryd -lINFO). Because of how Cloud Foundry manages process trees, Cloud Foundry would try to kill the parent process (bash), which would HUP celeryd, but ultimately lots of child processes would never die.

Categories