Reloading celery code without losing reserved tasks - python

I have millions of tasks reserved in Celery (ETA not due yet) and every time I want to update my Celery code base, I have to restart it, which cuts the connection to RabbitMQ and causes RabbitMQ to redistribute tasks again (I am using late ack).
Is it possible to reload new code base but still keep my reserved tasks? I am using Celery with Django.

Short answer: yes, you can, but you have to write your own queue draining logic.
Longer answer: when you want to do a code update (and depending on how you handle this), you have to use the celery remote control api to tell all your workers to stop consuming tasks. RabbitMQ brokers support the remote control api, so you're in luck.
from my_app.celery import app
inspector = app.control.inspect()
controller = app.control
# get a list of current workers
workers = inspector.ping()
active_queues = inspector.active_queues()
all_queues = set()
for worker, queues in active_queues.items():
for queue in queues:
all_queues.add(queue['name'])
for queue in all_queues:
controller.cancel_consumer(queue)
This will stop your workers from consuming tasks. Now you have to monitor your workers until they have finished processing any active tasks.
import time
done = False
while not done:
active_count = 0
active = inspector.active()
active_count = sum(map(lambda l: len(l), active.values()))
done = active_count > 0
if not done:
time.sleep(60) # wait a minute between checks
Once your workers are done, your are clear to deploy your code without having to worry about losing tasks.

Related

What does it mean for a celery task to be "Received"? When all celery workers are blocked, what is happening with new tasks that are not "Received"?

I'm working on a new monitoring system that can measure Celery queue throughput and help alert the team when the queue is getting backed up. Over the course of my work, I've come across some peculiar behaviors that I don't understand (and are not well documented in the Celery specs).
For testing purposes, I've set up an endpoint that will populate the queue with 16 several long-running tasks that can be used to simulate a backed-up queue. The framework is Flask and the Queue broker is Redis. Celery is configured for each worker to work on up to 4 tasks in parallel, and I have 2 workers running.
api/health.py
def health():
health = Blueprint("health", __name__)
#health.route("/api/debug/create-long-queue", methods=["GET"])
def long_queue():
for i in range(16):
sleepy_job.delay()
return make_response({}, 200)
return health
jobs.py
#celery.task(priority=HIGH_PRIORITY)
def sleepy_job(*args, **kwargs):
time.sleep(30)
Here's what I do to simulate a backed-up production queue:
I call /api/debug/create-long-queue to simulate a back-up in my queue. Based on the above math, the workers should be busy sleeping for 1 minute each (Together, they can concurrently handle 8 tasks at a time. Each task just sleeps for 30 seconds, and there are 16 tasks total.)
I make another API call shortly after (< 5 s), which kicks of a different job with real business logic (processing of an inbound webhook API call). We'll call this job handle_incoming_message.
Here's what I see Using flower to inspect the queue:
While all workers are blocked by the first 8 sleepy_job tasks, I see no sign of the new handle_incoming_message on the queue, even though I am certain handle_incoming_message.delay() has been called as a result of the 2nd API call.
After the first 8 sleepy_job tasks have been completed (~30s), I see the new handle_incoming_message on the queue with state RECIEVED.
After the second (and final) 8 sleepy_job tasks have been completed, I now see handle_incoming_message has state STARTED (and I can confirm this as the UI updates with the new data that was received and processed in that task.)
Questions
So it seems clear that when the workers are momentarily unblocked after handling the first 8 sleepy_job tasks, they are doing something to mark/acknowledge the new handle_incoming_message task in a way that is visible to flower. But this leaves several unanswered questions:
What is the state of the new handle_incoming_message task when the workers are blocked?
What changes after workers are unblocked that makes it so flower now has visibility into the new handle_incoming_message task?
What does the "RECEIVED" state actually mean?
(Bonus: How can I get visibility into tasks that are queued while workers are blocked?)
When all workers are blocked SOME tasks could be in the received state because of prefetching (look in the documentation for that). So chances are very high that your tasks are simply in the queue, waiting to be received by Celery workers (coordinating processes - these are not actual worker processes).
Flower is a simple service that is built upon a Celery feature called "task events". In simple terms it (Flower) subscribes itself as receiver of all events (received, succeeded, started, failed, etc) and then visually represents those to the web clients. More about it here. So when task gets received by a Celery worker, a "task-received" event is sent. Flower fetches this event, and changes the state of that task in the dashboard.
When a task is "received" it means that particular Celery worker took that task off the queue and it may be executed immediately (if there is a free worker-process to execute it), or Celery worker will wait for a worker process to become ready to run the task. I have already mentioned prefetching - Celery workers will often take more tasks then available worker-processes.
Celery does not give users a way to list what is in particular queue. That is why you will see many similar questions - including this one which offers answers. You will see my short answer there among others. In short, it depends on your broker of choice. If it is Redis, then you simply go through the list of objects. If it is RabbitMQ then you can use their tool to inspect queues. I think the decision not to provide this is good one as this information is never reliable. By the time you list all the tasks in particular queue, there may be thousands new ones...

Understanding Gunicorn worker processes, when using threading module for sending mail

in my setup I am using Gunicorn for my deployment on a single CPU machine, with three worker process. I have came to ask this question from this answer: https://stackoverflow.com/a/53327191/10268003 . I have experienced that it is taking upto one and a half second to send mail, so I was trying to send email asynchronously. I am trying to understand what will happen to the worker process started by Gunicorn, which will be starting a new thread to send the mail, will the Process gets blocked until the mail sending thread finishes. In that case I beleive my application's throughput will decrease. I did not want to use celery because it seems to be overkill for setting up celery for just sending emails. I am currently running two containers on the same machine with three gunicorn workers each in development machine.
Below is the approach in question, the only difference is i will be using threading for sending mails.
import threading
from .models import Crawl
def startCrawl(request):
task = Crawl()
task.save()
t = threading.Thread(target=doCrawl,args=[task.id])
t.setDaemon(True)
t.start()
return JsonResponse({'id':task.id})
def checkCrawl(request,id):
task = Crawl.objects.get(pk=id)
return JsonResponse({'is_done':task.is_done, result:task.result})
def doCrawl(id):
task = Crawl.objects.get(pk=id)
# Do crawling, etc.
task.result = result
task.is_done = True
task.save()
Assuming that you are using gunicorn Sync (default), Gthread or Async workers, you can indeed spawn threads and gunicorn will take no notice/interfere. The threads are reused to answer following requests immediately after returning a result, not only after all Threads are joined again.
I have used this code to fire an independent event a minute or so after a request:
Timer(timeout, function_that_does_something, [arguments_to_function]).start()
You will find some more technical details in this other answer:
In normal operations, these Workers run in a loop until the Master either tells them to graceful shutdown or kills them. Workers will periodically issue a heartbeat to the Master to indicate that they are still alive and working. If a heartbeat timeout occurs, then the Master will kill the Worker and restart it.
Therefore, daemon and non-daemon threads that do not interfere with the Worker's main loop should have no impact. If the thread does interfere with the Worker's main loop, such as a scenario where the thread is performing work and will provide results to the HTTP Response, then consider using an Async Worker. Async Workers allow for the TCP connection to remain alive for a long time while still allowing the Worker to issue heartbeats to the Master.
I have recently gone on to use asynchronous event loop based solutions like the uvicorn worker for gunicorn with the fastapi framework that provide alternatives to waiting in threads for IO.

How to rate limit Celery tasks by task name?

I'm using Celery to process asynchronous tasks from a Django app. Most tasks are short and run in a few seconds, but I have one task that can take a few hours.
Due to processing restrictions on my server, Celery is configured to only run 2 tasks at once. That means if someone launches two of these long-running tasks, it effectively blocks all other Celery processing site wide for several hours, which is very bad.
Is there any way to configure Celery so it only processes one type of task no more than one at a time? Something like:
#task(max_running_instances=1)
def my_really_long_task():
for i in range(1000000000):
time.sleep(6000)
Note, I don't want to cancel all other launches of my_really_long_task. I just don't want them to start right away, and only begin once all other tasks of the same name finish.
Since this doesn't seem to be supported by Celery, my current hacky solution is to query other tasks within the task, and if we find other running instances, then reschedule ourselves to run later, e.g.
from celery.task.control import inspect
def get_all_active_celery_task_names(ignore_id=None):
"""
Returns Celery task names for all running tasks.
"""
i = inspect()
task_names = defaultdict(int) # {name: count}
if i:
active = i.active()
if active is not None:
for worker_name, tasks in i.active().iteritems():
for task in tasks:
if ignore_id and task['id'] == ignore_id:
continue
task_names[task['name']] += 1
return task_names
#task
def my_really_long_task():
all_names = get_all_active_celery_task_names()
if 'my_really_long_task' in all_names:
my_really_long_task.retry(max_retries=100, countdown=random.randint(10, 300))
return
for i in range(1000000000):
time.sleep(6000)
Is there a better way to do this?
I'm aware of other hacky solutions like this, but setting up a separate memcache server to track task uniqueness is even less reliable, and more complicated than the method I use above.
An alternate solution is to queue my_really_long_task into a seperate queue.
my_really_long_task.apply_async(*args, queue='foo')
Then start a worker with a concurrency of 1 to consume these tasks so that only 1 task gets executed at a time.
celery -A foo worker -l info -Q foo

How to cancel conflicting / old tasks in Celery?

I'm using Celery + RabbitMQ.
When a Celery worker isn't available all the tasks are waiting in RabbitMQ.
Just as it becomes online all this bunch of tasks is executed immediately.
Can I somehow prevent it happening?
For example there are 100 tasks (the same) waiting for a Celery worker, can I execute only 1 of them when a Celery worker comes online?
Since all the tasks are the same in your queue, A better way to do this is to send the task only once, to do this you need to be able to track that the task was published, for example:
Using a lock, example: Ensuring a task is only executed one at a time
Using a custom task ID and a custom state after the task is published, for example:
To add a custom state when the task is published:
from celery import current_app
from celery.signals import after_task_publish
#after_task_publish.connect
def add_sent_state(sender=None, body=None, **kwargs):
"""Track Published Tasks."""
# get the task instance from its name
task = current_app.tasks.get(sender)
# if there is no task.backend fallback to app.backend
backend = task.backend if task else current_app.backend
# store the task state
backend.store_result(body['id'], None, 'SENT')
When you want to send the task you can check if the task has already been published, and since we're using a custom state the task's state won't be PENDING when it's published (which could be unkown) so we can check using:
from celery import states
# the task has a custom ID
task = task_func.AsyncResult('CUSTOM_ID')
if task.state != states.PENDING:
# the task already exists
else:
# send the task
task_func.apply_async(args, kwargs, task_id='CUSTOM_ID')
I'm using this approach in my app and it's working great, my tasks could be sent multiple times and they are identified by their IDs so this way each task is sent once.
If you're still want to cancel all the tasks in the queue you can use:
# import your Celery instance
from project.celery import app
app.control.purge()
Check the Celery FAQ How do I purge all waiting tasks ?
There are two ways to do this.
First, Run only one worker with a concurrency of one.
celery worker -A your_app -l info -c 1
This command starts a worker with a concurrency of one. So only one task will be executed at a time. This is the preferred way to do it.
Second method is bit complicated. You need to acquire lock and release the lock to make sure only one task is executed at a time.
Alternatively, if you want, you can remove all the tasks from queue using purge command.
celery -A your_app purge

AppEngine Timeout with Task Queues

I'm trying to execute a task in AppEngine through the Task Queues, but I still seem to be faced with a 60 second timeout. I'm unsure what I'm doing incorrectly, as the limit I'd think should be 10 minutes as advertised.
I have a call to urlfetch.fetch() that appears to be the culprit. My call is:
urlfetch.fetch(url, payload=query_data, method=method, deadline=300)
The tail end of my stack trace shows the method that triggers the url fetch call right before the DeadlineExceededError:
File "/base/data/home/apps/s~mips-conversion-scheduler/000-11.371629749593131630/views.py", line 81, in _get_mips_updated_data
policies_changed = InquiryClient().get_changed_policies(company_id, initial=initial).json()
When I look at the task queue information it shows:
Method/URL: POST /tasks/queue-initial-load
Dispatched time (UTC): 2013/11/14 15:18:49
Seconds late: 0.18
Seconds to process task: 59.90
Last http response code: 500
Reason to rety: AppError
My View that processes the task looks like:
class QueueInitialLoad(webapp2.RequestHandler):
def post(self):
company = self.request.get("company")
if company:
company_id = self.request.get("company")
queue_policy_load(company_id, queue_name="initialLoad", initial=True)
with the queue_policy_load being the method that triggers the urlfetch call.
Is there something obvious I'm missing that makes me limited to the 60 second timeout instead of 10 minutes?
Might be a little too general, but here are some thoughts that might help close the loop. There are 2 kinds of task queues, push queues and pull queues. Push queue tasks execute automatically, and they are only available to your App Engine app. On the other hand, pull queue tasks wait to be leased, are available to workers outside the app, and can be batched.
If you want to configure your queue, you can do it in the queue config file. In Java, that happens in the queue.xml file, and in Python that happens in the queue.yaml file. In terms of push queues specifically, push queue tasks are processed by handlers (URLs) as POST requests. They:
Are executed ASAP
May cause new instances (Frontend or Backend)
Have a task duration limit of 10 minutes
But, they have an unlimited duration if the tasks are run on the backend
Here is a quick Python code example showing how you can add tasks to a named push queue. Have a look at the Google developers page for Task Queues if you need more information: https://developers.google.com/appengine/docs/python/taskqueue/
Adding Tasks to a Named Push Queue:
queue = taskqueue.Queue("Qname")
task = taskqueue.Task(url='/handler', params=args)
queue.add(task)
On the other hand, let's say that you wanted to use a pull queue. You could add tasks in Python to a pull queue using the following:
queue = taskqueue.Queue("Qname")
task = taskqueue.Task(payload=load, method='PULL')
queue.add(task)
You can then lease these tasks out using the following approach in Python:
queue = taskqueue.Queue("Qname")
tasks = queue.lease_tasks(how-long, how-many)
Remember that, for pull queues, if a task fails, App Engine retries it until it succeeds.
Hope that helps in terms of providing a general perspective!
The task queues have a 10min deadline but a Urlfetch call has a 1 min deadline :
maximum deadline (request handler) 60 seconds
UPDATE: the intended behaviour was to have a max of 10mins URLFetch deadline when running in a TaskQueue, see this bug.
As GAE has evolved, this answer pertains to today where the idea of "backend" instances is deprecated. GAE Apps can be configured to be Services (aka module) and run with a manual scaling policy. Doing so allows one to set longer timeouts. If you were running your app with an autoscaling policy, it will cap your urlfetch's to 60sec and your queued tasks to 10 mins:
https://cloud.google.com/appengine/docs/python/an-overview-of-app-engine

Categories