We have a django application served through gunicorn sync workers. There's a tornado server run in a thread from the django application itself(Let's not argue about architecture, it's legacy code) and is tightly coupled with worker.
It binds itself to a free port stored in a db. Now every time a worker restarts, I would need to start the tornado too, but since the port has been used it won't start.
I would need to somehow detect that worker went down and would need to store the state of port to available. However, I am not able to detect that anyhow.
signal.signal(signal.SIGTERM, stop_handler)
signal.signal(signal.SIGINT, stop_handler)
Above are only detected when I manually kill the worker but not when gunicorn restarts the worker itself.
How can I detect the same?
Related
Currently I've got a trading bot application which has a UI interface to start and stop currently running bots. In behind there is also a bot manager, which keeps track of all running bots and has the ability to start and stop them. Each bot is started in its own thread and the manager must have access to thats thread memory (in order to stop it when necessary). Currently If i specify more workers then depending which worker the request lands on I've no access to the thread I need.
At this moment, I've got a gunicorn setup in docker:
gunicorn app:application --worker-tmp-dir /dev/shm --bind 0.0.0.0:8000 --timeout 600 --workers 1 --threads 4
The problem:
Yesterday one of the bots stopped because apparently gunicorn ran out of memory and the worker had to restart in the process killing running bot. (Can't have that). These bots should be able to run for months if needed. Is there a way to fix or tell gunicorn to never stop workers or processes? Or maybe I should use a different python server like waitress or uWSGI?
I have two servers, there is one celery worker on each server. I use Redis as the broker to collaborate with workers.
My question is how can I make only one worker run for most of the time, and once this worker is broken, another worker will turn on as a backup worker?
Basically, just take one worker as a back-up.
I know how to specify a task to a certain worker by a queue on the worker respectively, after reading the doc [http://docs.celeryproject.org/en/latest/userguide/routing.html#redis-message-priorities]
This is, in my humble opinion, completely against the point of having distributed system to off-load CPU heavy, or long-running tasks, or have thousands of small tasks that you can't run elsewhere...
- You are running two servers anyway, so why keeping the other one idle? More workers mean you will be able to process more tasks concurrently.
If you are not convinced, and still want to do this, you need to write a tiny service on machine with idle Celery worker. This service will periodically check the health of the active worker, and if that check fails, you will run Celery worker on the backup server.
Here is a question for you - why this service simply does not restart the Celery worker on the active server? - It is pretty much possible to do that, so again, I see no justification for having a completely idle machine doing nothing. If you are on a cloud platform, you can easily spin up a new instance from an existing image of your Celery worker. This is scenario I use in production.
I have a server with above configuration and I am processing long tasks but I have to update user about the process state, which I am doing through Firebase. To respond to the client immediately I enqueue the job in redis using python-rq.
I am using flask and uwsgi and Nginx. In uwsgi conf file, there is a field which asks for number of processes.
My question is, Do I need to start multiple uwsgi processes, or more redis workers?
Does starting more uwsgi workers will create more redis workers?
How would the scaling work, My server has 1 vCPU and 2GB ram. I have aws autoscaling for production. Should I run more uWsgi workers and how many redis workers with only one queue.
I am starting the worker independently. The flask app is importing the connection and adding the job.
my startup script
my worker code
It depends upon how you're running rq workers. There can be two cases
1) Running rq workers from inside the app. Then increasing number of workers in uwsgi settings will automatically spawn num_rq_workers_in_app_conf * num_app_workers_in_uwsgi_conf
2) Running rq workers outside application like using supervisord. Where you can manually control number of rq workers independently of app.
According to me running rq workers under supervisord is a better option than point 1. It helps in effective debugging of each worker and one more issue which I've encountered while using rq is that rq-workers running via point 1 strategy unregisters themselves from rq i.e becomes dead for rq although running in background in few weeks interval.
I would like to run some code in a uWSGI app, but on a long-lived process, not inside workers. That's because the process blocks on a socket recv() call, and only one thread of execution should do this.
I am hoping to avoiding creating my own daemon by somehow starting a long-lived process on uWSGI startup that does not get spawned in each worker.
Does uWSGI support anything like this?
uWSGI Mules are like workers but without network access:
http://uwsgi-docs.readthedocs.org/en/latest/Mules.html
I have a wsgi app with a celery component. Basically, when certain requests come in they can hand off relatively time-consuming tasks to celery. I have a working version of this product on a server I set up myself, but our client recently asked me to deploy it to Cloud Foundry. Since Celery is not available as a service on Cloud Foundry, we (me and the client's deployment team) decided to deploy the app twice – once as a wsgi app and once as a standalone celery app, sharing a rabbitmq service.
The code between the apps is identical. The wsgi app responds correctly, returning the expected web pages. vmc logs celeryapp shows that celery is to be up-and-running, but when I send requests to wsgi that should become celery tasks, they disappear as soon as they get to a .delay() statement. They neither appear in the celery logs nor do they appear as an error.
Attempts to debug:
I can't use celery.contrib.rdb in Cloud Foundry (to supply a telnet interface to pdb), as each app is sandboxed and port-restricted.
I don't know how to find the specific rabbitmq instance these apps are supposed to share, so I can see what messages it's passing.
Update: to corroborate the above statement about finding rabbitmq, here's what happens when I try to access the node that should be sharing celery tasks:
root#cf:~# export RABBITMQ_NODENAME=eecef185-e1ae-4e08-91af-47f590304ecc
root#cf:~# export RABBITMQ_NODE_PORT=57390
root#cf:~# ~/cloudfoundry/.deployments/devbox/deploy/rabbitmq/sbin/rabbitmqctl list_queues
Listing queues ...
=ERROR REPORT==== 18-Jun-2012::11:31:35 ===
Error in process <0.36.0> on node 'rabbitmqctl17951#cf' with exit value: {badarg,[{erlang,list_to_existing_atom,["eecef185-e1ae-4e08-91af-47f590304ecc#localhost"]},{dist_util,recv_challenge,1},{dist_util,handshake_we_started,1}]}
Error: unable to connect to node 'eecef185-e1ae-4e08-91af-47f590304ecc#cf': nodedown
diagnostics:
- nodes and their ports on cf: [{'eecef185-e1ae-4e08-91af-47f590304ecc',57390},
{rabbitmqctl17951,36032}]
- current node: rabbitmqctl17951#cf
- current node home dir: /home/cf
- current node cookie hash: 1igde7WRgkhAea8fCwKncQ==
How can I debug this and/or why are my tasks vanishing?
Apparently the problem was caused by a deadlock between the broker and the celery worker, such that the worker would never acknowledge the task as complete, and never accept a new task, but never crashed or failed either. The tasks weren't vanishing; they were simply staying in queue forever.
Update: The deadlock was caused by the fact that we were running celeryd inside a wrapper script that installed dependencies. (Literally pip install -r requirements.txt && ./celeryd -lINFO). Because of how Cloud Foundry manages process trees, Cloud Foundry would try to kill the parent process (bash), which would HUP celeryd, but ultimately lots of child processes would never die.