I have a wsgi app with a celery component. Basically, when certain requests come in they can hand off relatively time-consuming tasks to celery. I have a working version of this product on a server I set up myself, but our client recently asked me to deploy it to Cloud Foundry. Since Celery is not available as a service on Cloud Foundry, we (me and the client's deployment team) decided to deploy the app twice – once as a wsgi app and once as a standalone celery app, sharing a rabbitmq service.
The code between the apps is identical. The wsgi app responds correctly, returning the expected web pages. vmc logs celeryapp shows that celery is to be up-and-running, but when I send requests to wsgi that should become celery tasks, they disappear as soon as they get to a .delay() statement. They neither appear in the celery logs nor do they appear as an error.
Attempts to debug:
I can't use celery.contrib.rdb in Cloud Foundry (to supply a telnet interface to pdb), as each app is sandboxed and port-restricted.
I don't know how to find the specific rabbitmq instance these apps are supposed to share, so I can see what messages it's passing.
Update: to corroborate the above statement about finding rabbitmq, here's what happens when I try to access the node that should be sharing celery tasks:
root#cf:~# export RABBITMQ_NODENAME=eecef185-e1ae-4e08-91af-47f590304ecc
root#cf:~# export RABBITMQ_NODE_PORT=57390
root#cf:~# ~/cloudfoundry/.deployments/devbox/deploy/rabbitmq/sbin/rabbitmqctl list_queues
Listing queues ...
=ERROR REPORT==== 18-Jun-2012::11:31:35 ===
Error in process <0.36.0> on node 'rabbitmqctl17951#cf' with exit value: {badarg,[{erlang,list_to_existing_atom,["eecef185-e1ae-4e08-91af-47f590304ecc#localhost"]},{dist_util,recv_challenge,1},{dist_util,handshake_we_started,1}]}
Error: unable to connect to node 'eecef185-e1ae-4e08-91af-47f590304ecc#cf': nodedown
diagnostics:
- nodes and their ports on cf: [{'eecef185-e1ae-4e08-91af-47f590304ecc',57390},
{rabbitmqctl17951,36032}]
- current node: rabbitmqctl17951#cf
- current node home dir: /home/cf
- current node cookie hash: 1igde7WRgkhAea8fCwKncQ==
How can I debug this and/or why are my tasks vanishing?
Apparently the problem was caused by a deadlock between the broker and the celery worker, such that the worker would never acknowledge the task as complete, and never accept a new task, but never crashed or failed either. The tasks weren't vanishing; they were simply staying in queue forever.
Update: The deadlock was caused by the fact that we were running celeryd inside a wrapper script that installed dependencies. (Literally pip install -r requirements.txt && ./celeryd -lINFO). Because of how Cloud Foundry manages process trees, Cloud Foundry would try to kill the parent process (bash), which would HUP celeryd, but ultimately lots of child processes would never die.
Related
Im getting the following airflow issue:
When I run Dags that have mutiple tasks in it, randomly airflow set some of the tasks to failed state, and also doesn't show any logs on the UI. I went to my running worker container and saw that the log files for those failed tasks were also not created.
Going to Celery Flower, I found these logs on failed tasks:
airflow.exceptions.AirflowException: Celery command failed on host
How to solve this?
My environment is:
airflow:2.3.1
Docker compose
Celery Executor
Worker, webserver, scheduler and triggerer in different containers
Docker compose hosted on Ubuntu
I also saw this https://stackoverflow.com/a/69201032/11949273 answer that might be related.
Anyone with these same issues?
Edit:
On my EC2 Instance I got more vCPU's and fine tuned airflow/celery workers parameters and solved this. Probably is some issue with lack of CPU and or something else.
I am faced with some issue. In my case in Inspect -> Console has some error with replaceAll in old browser (Chrome 83.X). Chrome 98.X does not have this issue.
I am running python 3.9, Windows 10, celery 4.3, redis as the backend, and aws sqs as the broker (I wasn't intending on using the backend, but it became more and more apparent to me that due to the library's restrictions on windows that'd I'd be better off using it if I could get it to work, otherwise I would've just used redis as the broker and backend).
To give you some context, I have a webpage that a user interacts with to allow them to do a resource intensive task. If the user has a task running and decides to resend the task, I need it to kill the task, and use the new information sent by the user to create the new task.
The problem for me arrives after this line of thinking:
Me: "Hmmm, the prefork pool is used for heavy cpu background tasks... I want to use that..."
Me: Goes and configures settings.py,
updates the celery library,
sets the environment variable to allow windows to run prefork pool -
os.environ.setdefault('FORKED_BY_MULTIPROCESSING', '1'),
sets a few other configuration settings, etc,
runs the worker and it works.
Me: "Hey, hey. It works... Oh, I still can't revoke a task DESPITE RUNNING THE PREFORK POOL!?!?!
Oh, that's okay... I can just set a session variable to let me know if the user already started a task,
and if they have, just have celery tell me if the task that they started is finished
before I allow the user to request to run a task again."
Me: Goes and configures django sessions,
configures redis,
updates the views to include the session variable, etc,
Me: "Great! Everything is working, so far..."
Me: Runs a test to see if the redis server returns the status...
Celery: "PENDING"
Me: "Yo! Is my task done, yet!?"
Celery: "No - PENDING"
Celery: "PENDING"
Celery: "PENDING"
Celery: "PENDING"
Celery: "PENDING"
Celery: "PENDING"
Me: Searches stackoverflow for why its only pending...
Me: Finds out that you must use --pool=solo for the worker...
Me: Dies on the inside.
Ideally - I'd like to be able to use the prefork pool to do intense processing and to kill the task if need be. The thing is that everything that I read tells me prefork is what I want, but solo is the only way I can think of to get it to work.
Questions:
How bad is it for me to compromise these desires and just go with solo, expecting that I will be using heavy cpu for the tasks and many users? Assume 100s if not 1000s at once submitting tasks.
What other solutions should I consider?
In my experience on windows I cannot use anything other than --pool=solo
What other solutions should I consider?
The way I do it is I use 1 pool for windows development and more on production (linux) at least in my case using solo pool for development is fine.
I'm learning Celery and I'd like to ask:
Which is the absolute simplest way to get Celery to automatically run when Django starts in Ubuntu?. Now I manually start celery -A {prj name} worker -l INFO via the terminal.
Can I make any type of configuration so Celery catches the changes in tasks.py code without the need to restart Celery? Now I ctrl+c and type celery -A {prj name} worker -l INFO every time I change something in the tasks.py code. I can foresee a problem in such approach in production if I can get Celery start automatically ==> need to restart Ubuntu instead?.
(setup: VPS, Django, Ubuntu 18.10 (no docker), no external resources, using Redis (that starts automatically)
I am aware it is a similar question to Django-Celery in production and How to ... but still it is a bit unclear as it refers to amazon and also using shell scripts, crontabs. It seems a bit peculiar that these things wouldn't work out of the box.
I give benefit to the doubt that I have misunderstood the setup of Celery.
I have a deploy script that launch Celery in production.
In production it's better to launch worker :
celery multi stop 5
celery multi start 5 -A {prj name} -Q:1 default -Q:2,3 QUEUE1 -Q:4,5 QUEUE2 --pidfile="%n.pid"
this will stop and launch 5 worker for different Queue
Celery at launch will get the wsgi file that will use this instance of your code, it's mean you need to relaunch it to apply modification, you cannot add a watcher in production (memory cost)
I have created a flask application to process GNSS data. There are certain functions which takes a lot of time to execute. Therefore i have integrated celery to perform those functions as Asynchronous tasks. First I have tested the app in localhost by adding message broker as rabbitmq
app.config['CELERY_BROKER_URL']='amqp://localhost//'
app.config['CELERY_RESULT_BACKEND']='db+postgresql://username:pssword#localhost/DBname'
After fully tested the application in virtualenv I deployed It on heroku and added rabbitmq addon. Then I changed the app.config as follows.
app.config['CELERY_BROKER_URL']='amqp://myUsername:Mypassowrd#small-fiver-23.bigwig.lshift.net:10123/FlGJwZfbz4TR'
app.config['CELERY_RESULT_BACKEND']='db+postgres://myusername:Mypassword#ec2-54-163-246-193.compute-1.amazonaws.com:5432/dhcbl58v8ifst/MYDB'
After changing the above I ran the celery worker
celery -A app.celery worker --loglevel=info
and get this error
[2018-03-16 11:21:16,796: ERROR/MainProcess] consumer: Cannot connect to amqp://SHt1Xvhb:**#small-fiver-23.bigwig.lshift.net:10123/FlGJwZfbz4TR: timed out.
How can I check whether my heroku addon is working from Rabbitmq management console
It seems the port 10123 is not exposed. Can you try telnet small-fiver-23.bigwig.lshift.net 10123 from the server and see if you're able to connect successfully to the server?
If not, you have to expose that port to be accessible from the server you're trying to connect to.
When my new versions of my django application are deployed to heroku, the workers are forced to be restarted. I have some long running tasks which should perform some cleanup prior to being killed.
I have tried registering a worker_shutdown hook which doesn't every seem to get called.
I have also tried the answer in Notify celery task of worker shutdown but i am unclear of how to abort a given task from within this context as calling celery.task.control.active() throws an exception (celery is no longer running).
Thanks for any help.
If you control the deployment maybe you can run a script that does a Control.broadcast to a custom command that you can register beforehand and only after receiving the required replies (you'd have to implement that logic) you'd continue the deployment (or raise a TimeoutException)?
Also, celery already has a predefined command for shutdown which I'm guessing you could overload in your instance or Subclass of worker. Commands have the advantage of being passed a Panel instance which allows you access to the consumer. That should expose a lot of control right there...