When my new versions of my django application are deployed to heroku, the workers are forced to be restarted. I have some long running tasks which should perform some cleanup prior to being killed.
I have tried registering a worker_shutdown hook which doesn't every seem to get called.
I have also tried the answer in Notify celery task of worker shutdown but i am unclear of how to abort a given task from within this context as calling celery.task.control.active() throws an exception (celery is no longer running).
Thanks for any help.
If you control the deployment maybe you can run a script that does a Control.broadcast to a custom command that you can register beforehand and only after receiving the required replies (you'd have to implement that logic) you'd continue the deployment (or raise a TimeoutException)?
Also, celery already has a predefined command for shutdown which I'm guessing you could overload in your instance or Subclass of worker. Commands have the advantage of being passed a Panel instance which allows you access to the consumer. That should expose a lot of control right there...
Related
I am running python 3.9, Windows 10, celery 4.3, redis as the backend, and aws sqs as the broker (I wasn't intending on using the backend, but it became more and more apparent to me that due to the library's restrictions on windows that'd I'd be better off using it if I could get it to work, otherwise I would've just used redis as the broker and backend).
To give you some context, I have a webpage that a user interacts with to allow them to do a resource intensive task. If the user has a task running and decides to resend the task, I need it to kill the task, and use the new information sent by the user to create the new task.
The problem for me arrives after this line of thinking:
Me: "Hmmm, the prefork pool is used for heavy cpu background tasks... I want to use that..."
Me: Goes and configures settings.py,
updates the celery library,
sets the environment variable to allow windows to run prefork pool -
os.environ.setdefault('FORKED_BY_MULTIPROCESSING', '1'),
sets a few other configuration settings, etc,
runs the worker and it works.
Me: "Hey, hey. It works... Oh, I still can't revoke a task DESPITE RUNNING THE PREFORK POOL!?!?!
Oh, that's okay... I can just set a session variable to let me know if the user already started a task,
and if they have, just have celery tell me if the task that they started is finished
before I allow the user to request to run a task again."
Me: Goes and configures django sessions,
configures redis,
updates the views to include the session variable, etc,
Me: "Great! Everything is working, so far..."
Me: Runs a test to see if the redis server returns the status...
Celery: "PENDING"
Me: "Yo! Is my task done, yet!?"
Celery: "No - PENDING"
Celery: "PENDING"
Celery: "PENDING"
Celery: "PENDING"
Celery: "PENDING"
Celery: "PENDING"
Me: Searches stackoverflow for why its only pending...
Me: Finds out that you must use --pool=solo for the worker...
Me: Dies on the inside.
Ideally - I'd like to be able to use the prefork pool to do intense processing and to kill the task if need be. The thing is that everything that I read tells me prefork is what I want, but solo is the only way I can think of to get it to work.
Questions:
How bad is it for me to compromise these desires and just go with solo, expecting that I will be using heavy cpu for the tasks and many users? Assume 100s if not 1000s at once submitting tasks.
What other solutions should I consider?
In my experience on windows I cannot use anything other than --pool=solo
What other solutions should I consider?
The way I do it is I use 1 pool for windows development and more on production (linux) at least in my case using solo pool for development is fine.
How do I make the celery -A app worker command to consume only a single task and then exit.
I want to run celery workers as a kubernetes Job that finishes after handling a single task.
I'm using KEDA for autoscaling workers according to queue messages.
I want to run celery workers as jobs for long running tasks, as suggested in the documentation:
KEDA long running execution
There's not really anything specific for this. You would have to hack in your own driver program, probably via a custom concurrency module. Are you trying to use Keda ScaledJobs or something? You would just use a ScaledObject instead.
I am updating my ECS service to use a new task definition. The task definition in this case is a flask application running on gunicorn.
Under certain conditions, I want the flask application to exit and subsequently the update to the ECS service to fail. In this case, I want to check the database connection (and exit if the database connection is not running).
However, I am not seeing this. Whenever I exit or kill the flask application (using sys.kill or os.kill for example), the task definitions still continue to be updated.
How do I kill the update to the ECS service as well given that my entrypoint fails?
I saw this as well: https://aws.amazon.com/blogs/containers/graceful-shutdowns-with-ecs/ but the SIGTERM also doesn't work.
To update this, the problem was that the service kept crash-looping. To resolve this, I used the ECS deployment breaker: https://aws.amazon.com/blogs/containers/announcing-amazon-ecs-deployment-circuit-breaker/ which allows you to exit out of this loop.
I have read the documentation of uWSGI mules, and some info from other sources. But I'm confused about the differences between mules and python threads. Can anyone explain to me, what can mules do that threads cannot and why do mules even exist?
uWSGI mule can be thought of as a separate worker process, which is not accessible via sockets (eg. direct web requests). It executes an instance of your application and can be used for offloading tasks using mulefunc Python decorator for example. Also, as mentioned in the documentation, mule can be configured to execute custom logic.
On the other hand, a thread runs in its parent's (uWSGI worker) address space. So if the worker dies or is reloaded, the thread behaves the same way. It can handle requests and also can execute specified tasks (functions) via thread decorator.
Python threads do no span on multiple CPUs, roughly said can't use all the CPU power, this is a Python GIL limitation What is the global interpreter lock (GIL) in CPython?
This is one of the reasons for using web servers, their duty is to spawn a process worker or use idle one for each task received (http request).
A mule function on the same principal, but is particular in a sense that it is intended to run tasks outside of an http request context. the idea behind, is that you could reserve some mules, each will be running in a separate process (span on multiple CPUs) as regular workers do, but they don't serve any http request, only tasks to be setup as mentioned in the uwsgi documentation.
Worth to mention that mules are also monitored by the master process of the web server, such they are respawned when killed or dead.
I have a wsgi app with a celery component. Basically, when certain requests come in they can hand off relatively time-consuming tasks to celery. I have a working version of this product on a server I set up myself, but our client recently asked me to deploy it to Cloud Foundry. Since Celery is not available as a service on Cloud Foundry, we (me and the client's deployment team) decided to deploy the app twice – once as a wsgi app and once as a standalone celery app, sharing a rabbitmq service.
The code between the apps is identical. The wsgi app responds correctly, returning the expected web pages. vmc logs celeryapp shows that celery is to be up-and-running, but when I send requests to wsgi that should become celery tasks, they disappear as soon as they get to a .delay() statement. They neither appear in the celery logs nor do they appear as an error.
Attempts to debug:
I can't use celery.contrib.rdb in Cloud Foundry (to supply a telnet interface to pdb), as each app is sandboxed and port-restricted.
I don't know how to find the specific rabbitmq instance these apps are supposed to share, so I can see what messages it's passing.
Update: to corroborate the above statement about finding rabbitmq, here's what happens when I try to access the node that should be sharing celery tasks:
root#cf:~# export RABBITMQ_NODENAME=eecef185-e1ae-4e08-91af-47f590304ecc
root#cf:~# export RABBITMQ_NODE_PORT=57390
root#cf:~# ~/cloudfoundry/.deployments/devbox/deploy/rabbitmq/sbin/rabbitmqctl list_queues
Listing queues ...
=ERROR REPORT==== 18-Jun-2012::11:31:35 ===
Error in process <0.36.0> on node 'rabbitmqctl17951#cf' with exit value: {badarg,[{erlang,list_to_existing_atom,["eecef185-e1ae-4e08-91af-47f590304ecc#localhost"]},{dist_util,recv_challenge,1},{dist_util,handshake_we_started,1}]}
Error: unable to connect to node 'eecef185-e1ae-4e08-91af-47f590304ecc#cf': nodedown
diagnostics:
- nodes and their ports on cf: [{'eecef185-e1ae-4e08-91af-47f590304ecc',57390},
{rabbitmqctl17951,36032}]
- current node: rabbitmqctl17951#cf
- current node home dir: /home/cf
- current node cookie hash: 1igde7WRgkhAea8fCwKncQ==
How can I debug this and/or why are my tasks vanishing?
Apparently the problem was caused by a deadlock between the broker and the celery worker, such that the worker would never acknowledge the task as complete, and never accept a new task, but never crashed or failed either. The tasks weren't vanishing; they were simply staying in queue forever.
Update: The deadlock was caused by the fact that we were running celeryd inside a wrapper script that installed dependencies. (Literally pip install -r requirements.txt && ./celeryd -lINFO). Because of how Cloud Foundry manages process trees, Cloud Foundry would try to kill the parent process (bash), which would HUP celeryd, but ultimately lots of child processes would never die.