Realtime progress tracking of celery tasks

Realtime progress tracking of celery tasks - python

I have a main celery task that starts multiple sub-tasks (thousands) doing multiple actions (same actions per sub-task).
What i want is, from the main celery task to track in real-time for each action, how many are done and how many have failed for each sub-task.
In summary!
Main task: receive list of objects, and a list of actions to do for each object.
For each object, a sub-task is started to perform the actions for the object.
The main task is finished when all the sub-tasks are finished
So i need to know from the main task the real-time progress of the sub-tasks.
The app i am developing is using django/angularJs, and i need to show the real-time progress asynchronously in the front-end.
I am new to celery, and i am confused and don't know how to implement this.
Any help would be appreciated.
Thanks in advance.

I have done this before, there's too much code to put in here, so please allow me to simply put the outline, as I trust you can take care of the actual implementation and configuration:
Socket.io-based microservice to send real time events to browser
First, Django is synchronous, so it's not easy doing anything real time with it.
So I resorted to a socket.io process. You could say it's a microservice that only listens to a "channel" that was Redis-backed, and sends notifications to a browser client that listens to a given channel.
Celery -> Redis -> Socket.io -> Browser
I made it so each channel is identified with a Celery task ID. So when I fire a celery task from browser, I get the task ID, keep it and start listening to events from socket.io via that channel.
In chronological order it looks like this:
Fire off the Celery task, get the ID
Keep the ID in your client app, open a socket.io channel to listen for updates
The celery task sends messages to Redis, this will trigger socket.io events
Socket.io relays the messages to the browser, in real time
Reporting the progress
As for the actual updating of the status of the task, I just make it so that the Celery task, within its code, sends a message on Redis with something like e.g. {'done': 2, 'total_to_be_done': 10} (to represent a task that went through 2 out of 10 steps, a 20% progress, I prefer to send both numbers for better UI/UX)
import redis
redis_pub = redis.StrictRedis()
channel = 'task:<task_id>:progress'
redis_pub.publish(channel, json.dumps({'done': 2, 'total_to_be_done': 10}))
Find documentation for publishing messages on Redis with Python here
AngularJS/Socket.io integration
You can use or at least get some inspiration from a library like angular-socket-io

Related

Periodic and Non periodic tasks with Django + Telegram + Celery

I am building a project based on Django and one of my intentions is to have a telegram bot which is receiving information from a Telegram group. I was able to implement the bot to send messages in Telegram, no issues.
In this moment I have a couple of Celery tasks which are running with Beat and also the Django web, which are decopled. All good here.
I have seen that the python-telegram-bot is running a function in one of the examples (https://github.com/python-telegram-bot/python-telegram-bot/blob/master/examples/echobot.py) which is waiting idle to receive data from Telegram. Now, all my tasks in Celery are in this moment periodic and are called each 10 or 60 minutes by Beat.
How can I run this non-periodic task with Celery in my configuration? I am saying non-periodic because I understood that it will wait for content until it is manually interrupted.
Django~=3.2.6
celery~=5.1.2
CELERY_BEAT_SCHEDULE = {
'task_1': {
'task': 'apps.envc.tasks.Fetch1',
'schedule': 600.0,
},
'task_2': {
'task': 'apps.envc.tasks.Fetch2',
'schedule': crontab(minute='*/60'),
},
'task_3': {
'task': 'apps.envc.tasks.Analyze',
'schedule': 600,
},
}
In my tasks.py I have one of the tasks like this:
#celery_app.task(name='apps.envc.tasks.TelegramBot')
def TelegramBot():
status = start_bot()
return status
And as the start_bot implemenation, I simply copied the echobot.py example and I have added my TOKEN there (of course the functions for different commands from the example are also there).

Set up a webhook instead of polling with Celery
With Django, you shouldn't be using Celery to run Telegram polling (what you call PTB's “non-periodic task”, which is better described as a long-running process or service). Celery is designed for definite tasks, not indefinitely-running processes.
As Django implies that you're already running a web server, then the webhook option is a better fit. (Remember that you can either do polling or set up a webhook in order to receive updates from Telegram's servers.) The option that #CallMeStag suggested, of using a non-threading webhook setup, makes the most sense for Django-PTB integration.
You can do the bot setup (defining and registering your handler functions on a Dispatcher instance) in a separate module; to avoid threading, you should pass update_queue=None, workers=0 to your Dispatcher instantiation. And then, use it in a Django view, like this:
import json
from django.views.decorators.csrf import csrf_exempt
from telegram import Update
from .telegram_init import telegram_bot, telegram_dispatcher
...
#csrf_exempt
def telegram_webhook(request):
data = json.loads(request.body)
update = Update.de_json(data, telegram_bot)
telegram_dispatcher.process_update(update)
return JsonResponse({})
where telegram_bot is the Bot instance that I use for instantiating telegram_dispatcher. (I left out error handling in this snippet.)
Why avoid threading? Threads in the more general sense are not forbidden in Django, but in the context of PTB, threading usually means running bot updaters or dispatchers in a long-running thread that share an update/message queue, and that's a complication that doesn't look nice nor play well with, for example, a typical Django deployment that uses multiple Gunicorn workers in separate processes. There is, however, a motivation for using multithreading (multiple processes, actually, using Celery) in Django-PTB integration; see below.
Development environment caveat
The above setup is what you'd want to use for a basic production system. But during dev, unless your dev machine is internet-facing with a fixed IP, you probably can't use a webhook, so you'd still want to do polling. One way to do this is by creating a custom Django management command:
<my_app>/management/commands/polltelegram.py:
from django.core.management.base import BaseCommand
from my_django_project.telegram_init import telegram_updater
class Command(BaseCommand):
help = 'Run Telegram bot polling.'
def handle(self, *args, **options):
updater.start_polling()
self.stdout.write(
'Telegram bot polling started. '
'Press CTRL-BREAK to terminate.'
)
updater.idle()
self.stdout.write('Polling stopped.')
And then, during dev, run python manage.py polltelegram to fetch and process Telegram updates. (Run this along with python manage.py runserver to be able to use the main Django app simultaneously; the polling runs in a separate process with this setup, not just a separate thread.)
When Celery makes sense
Celery does have a role to play if you're integrating PTB with Django, and this is when reliability is a concern. For instance, when you want to be able to retry sending replies in case of transient network issues. Another potential issue is that the non-threading webhook setup detailed above can, in a high-traffic scenario, run into flood/rate limits. PTB's current solution for this, MessageQueue, uses threading, and while it can work, it can introduce other problems, for example interference with Django's autoreload function when running runserver during dev.
A more elegant and reliable solution is to use Celery to run the message sending function of PTB. This allows for retries and rate limiting for better reliability.
Briefly described, this integration can still use the non-threading webhook setup above, but you have to isolate the Bot.send_message() function into a Celery task, and then make sure that all handlers call this Celery task asynchronously instead of using the bot to run send_message() in the webhook process 'eagerly'.

In PTB, Updater.start_polling/webhook() starts a background thread that waits for incoming updates. Updater.idle() blocks the main thread and when receiving a stop signal, it ends the background thread mentioned above.
I'm not familiar with Celery and only know the basics of Django, but I see a few options here that I'd like to point out.
You can run the PTB-related code in a standalone thread, i.e. a thread that calls Updater.start_polling and Updater.idle. To end that thread on shutdown, you'll have to forward the stop signal to that thread
Vice versa, you can run PTB in the main thread and the Django & Celeray related tasks in a standalone thread
You don't have to use Updater. Since you're using Django anyway, you could switch to a webhook-based solution for receiving updates, where Django serves as webhook for you. You can even eliminate threading for PTB completely by calling Dispatcher.process_update manually. Please see this wiki page for more info on custom webhook solutions
Finally, I'd like to mention that PTB comes with a built-in solution of scheduling tasks, see the wiki page on Job Queue. This may or may not be relevant for you depending on your setup.
Dislaimer: I'm currently the maintainer of python-telegram-bot

Creating queue for Flask backend that can handle multiple users

I am creating a robot that has a Flask and React (running on raspberry pi zero) based interface for users to request it to perform tasks. When a user requests a task I want the backend to put it in a queue, and have the backend constantly looking at the queue and processing it on a one-by-one basis. Each tasks can take anywhere from 15-60 seconds so they are pretty lengthy.
Currently I just immediately do the task in the same python process that is running the Flask server, and from testing locally It seems like i can go to the react app in two different browsers and request tasks at the same time and it looks like the raspberry pi is trying to run them in parallel (from what I'm seeing in the printed logs).
What is the best way to allow multiple users to go to the front-end and queue up tasks? When multiple users go to the react app I assume they all connect to the same instance of the back-end. So it it enough just to add a dequeue to the back-end and protect it with a mutex lock (what is the pythonic way to use mutexes?). Or is this too simple? Do I need some other process or method to implement the task queue (such as writing/reading to an external file to act as the queue)?

In general, the most popular way to run tasks in Python is using Celery. It is a Python framework that runs on a separate process, continuously checking a queue (like Redis or AMQP) for tasks. When it finds one, it executes it, and logs the result to a "result backend" (like a database or Redis again). Then you have the Flask servers just push the tasks to the queue.
In order to notify the users, you could use polling from the React app, which is just requesting an update every 5 seconds until you see from the result backend that the task has completed successfully. As soon as you see that, stop polling and show the user the notification.
You can easily have multiple worker processes run in parallel, if the app would become large enough to need it. In general, you just need to remember to have every process do what it's needed to do: Flask servers should answer web requests, and Celery servers should process tasks. Not the other way around.

How to detect Celery task which doing similar job before run another task?

My celery task is doing time-consuming calculations on some database-stored entity. Workflow is like this: get information from database, compile it to some serializable object, save object. Other tasks are doing other calculations (like rendering images) on loaded object.
But serialization is time-consuming, so i'd like to have one task per one entity running for a while, which holds serialized object in memory and process client requests, delivered through messaging queue (redis pubsub). If no requests for a while, task exits. After that, if client need some job to be done, it runs another task, which loads object, process it and stay tuned for a while for other jobs. This task should check at startup, if it only one worker on this particular entity to avoid collisions. So what is best strategy to check is there another task running for this entity?
1) First idea is to send message to some channel associated with entity, and wait for response. Bad idea, target task can be busy with calculations and waiting for response with timeout is just wasting time.
2) Store celery task-id in db is even worse - task can be killed, but record will stay, so we need to ensure that target task is alive.
3) Third idea is to inspect workers for running tasks, checking it state for entity id (which task will provide at startup). Also seems, that some collisions can happens, i.e. if several tasks are scheduled, but not runing yet.
For now I think idea 1 is the best with modifications like this: task will send message to entity channel on startup with it's startup time, but then immediately starts working, not waiting for response. Then it checks message queue and if someone is respond they compare timestamps and task with bigger timestamp quits. Seems complicated enough, are there better solution?

Final solution is to start supervisor thread in task, which reply to 'discover' message from competing tasks.
So workflow is like that.
Task starts, then subscribes to Redis PubSub channel with entity ID
Task sends 'discover' message to channel
Task wait a little bit
Task search 'reply' in incoming messages in channel, if found exits.
Task starts supervisor thread, which reply by 'reply' to all incoming 'discover' messages
This works fine except several tasks start simultaneouly, i.e. after worker restart. To avoid this need to make subscription proccess atomic, using Redis lock:
class RedisChannel:
def __init__(self, channel_id):
self.channel_id = channel_id
self.redis = StrictRedis()
self.channel = self.redis.pubsub()
with self.redis.lock(channel_id):
self.channel.subscribe(channel_id)

The best practice to run periodic tasks in Python?

In my system user is allowed to set notifications schedule. He can choose any date and time when he wants to get messages. I have discovered one mechanims is named as Celery in Python. That executes tasks asyncronly. Due this I have pair of questions:
How to intergrate Celery with user interface?
Are there any Celery alternatives?
Is it panacea?

What you are looking for is something to process background tasks submitted to a queue from your web server. To that end, Celery is a good option and easy to configure. A more comprehensive list can be found here. None of these options would integrate with a user interface, they would integrate with your web server. They can queue jobs based on what is sent from the client side, which could be included as part of handling the request-response flow.
Also, this article provides a good reference for how to schedule periodic tasks using celery.

Batch processing of incoming notifications with GAE

My app engine app receives notifications from SendGrid for processing email deliveries, opens, etc. Sendgrid doesn't do much batching of these notifications so I could receive several per second.
I'd like to do batch processing of the incoming notifications, such as processing all of the notifications received in the last minute (my processing includes transactions so I need to combine them to avoid contention). There seems to be several ways of doing this...
For storing the incoming notifications, I could:
add an entity to the datastore or
create a pull queue task.
For triggering processing, I could:
Run a CRON job every minute (is this a good idea?) or
Have the handler that processes the incoming Sendgrid requests trigger processing of notifications but only if the last trigger was more than a minute ago (could store a last trigger date in memcache).
I'd love to hear pros and cons of the above or other approaches.

After a couple of days, I've come up with an implementation that works pretty well.
For storing incoming notifications, I'm storing the data in a pull queue task. I didn't know at the time of my question that you can actually store any raw data you want in a task, and the task doesn't have to itself be the execution of a function. You probably could store the incoming data in the datastore, but then you'd sort of be creating your own pull tasks so you might as well you the pull tasks provided by GAE.
For triggering a worker to process tasks in the pull queue, I came across this excellent blog post about On-demand Cron Jobs by a former GAE developer. I don't want to repeat that entire post here, but the basic idea is that each time you add a task to the pull queue, you create a worker task (regular push queue) to process tasks in the pull queue. For the worker task, you add a task name corresponding to a time interval to make sure you only have one worker task in the time interval. It allows you to get the benefit of 1-minute CRON job but the added performance bonus that it only runs when needed so you don't have a CRON job running when not needed.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.