Celery: why do I need a broker for periodic tasks? - python

I have a standalong script that scrapes a page, initiates a connection to a database, and writes database to it. I need it to execute periodically after x hours. I can make it with using a bash script, with the pseudocode:
while true
do
python scraper.py
sleep 60*60*x
done
From what I read about message brokers, they are used for sending "signals" from one running program to another, like HTTP in principle. Like I have a piece of code that accepts an email id from user, it sends signal with email-id to another piece of code that will send the email.
I need celery to run a periodic task on heroku. I already have a mongodb on a separate server. WHy do I need to run another server for rabbitmq or redis just for this? Can I use celery without the broker?

Celery architecture is designed to scale and distribute tasks across several servers. For sites like yours it might be an overkill. Queue service is generally needed to maintain the task list and signal the status of finished tasks.
You might want to take a look in Huey instead. Huey is small-scale Celery "Clone" needing only Redis as an external dependency, not RabbitMQ. It's still using Redis queue mechanism to line the tasks in queue.
There also exists Advanced Python scheduler which does not need even Redis, but can hold the state of the queue in memory in-process.
Alternatively if you have very small amount of periodical tasks, no delayed tasks, I would just use Cron and pure Python scripts to run the tasks.

As the Celery documentation explains:
Celery communicates via messages, usually using a broker to mediate between clients and workers. To initiate a task, a client adds a message to the queue, which the broker then delivers to a worker.
You can use your existing MongoDB database as broker. see Using MongoDB.

For the application like this, its better use Django Background Tasks
,
Installation
Install from PyPI:
pip install django-background-tasks
Add to INSTALLED_APPS:
INSTALLED_APPS = (
# ...
'background_task',
# ...
)
Migrate your database:
python manage.py makemigrations background_task
python manage.py migrate
Creating and registering tasks
To register a task use the background decorator:
from background_task import background
from django.contrib.auth.models import User
#background(schedule=60)
def notify_user(user_id):
# lookup user by id and send them a message
user = User.objects.get(pk=user_id)
user.email_user('Here is a notification', 'You have been notified')
This will convert the notify_user into a background task function. When you call it from regular code it will actually create a Task object and stores it in the database. The database then contains serialised information about which function actually needs running later on. This does place limits on the parameters that can be passed when calling the function - they must all be serializable as JSON. Hence why in the example above a user_id is passed rather than a User object.
Calling notify_user as normal will schedule the original function to be run 60 seconds from now:
notify_user(user.id)
This is the default schedule time (as set in the decorator), but it can be overridden:
notify_user(user.id, schedule=90) # 90 seconds from now
notify_user(user.id, schedule=timedelta(minutes=20)) # 20 minutes from now
notify_user(user.id, schedule=timezone.now()) # at a specific time
Also you can run original function right now in synchronous mode:
notify_user.now(user.id) # launch a notify_user function and wait for it
notify_user = notify_user.now # revert task function back to normal function.
Useful for testing.
You can specify a verbose name and a creator when scheduling a task:
notify_user(user.id, verbose_name="Notify user", creator=user)

Related

Periodic and Non periodic tasks with Django + Telegram + Celery

I am building a project based on Django and one of my intentions is to have a telegram bot which is receiving information from a Telegram group. I was able to implement the bot to send messages in Telegram, no issues.
In this moment I have a couple of Celery tasks which are running with Beat and also the Django web, which are decopled. All good here.
I have seen that the python-telegram-bot is running a function in one of the examples (https://github.com/python-telegram-bot/python-telegram-bot/blob/master/examples/echobot.py) which is waiting idle to receive data from Telegram. Now, all my tasks in Celery are in this moment periodic and are called each 10 or 60 minutes by Beat.
How can I run this non-periodic task with Celery in my configuration? I am saying non-periodic because I understood that it will wait for content until it is manually interrupted.
Django~=3.2.6
celery~=5.1.2
CELERY_BEAT_SCHEDULE = {
'task_1': {
'task': 'apps.envc.tasks.Fetch1',
'schedule': 600.0,
},
'task_2': {
'task': 'apps.envc.tasks.Fetch2',
'schedule': crontab(minute='*/60'),
},
'task_3': {
'task': 'apps.envc.tasks.Analyze',
'schedule': 600,
},
}
In my tasks.py I have one of the tasks like this:
#celery_app.task(name='apps.envc.tasks.TelegramBot')
def TelegramBot():
status = start_bot()
return status
And as the start_bot implemenation, I simply copied the echobot.py example and I have added my TOKEN there (of course the functions for different commands from the example are also there).
Set up a webhook instead of polling with Celery
With Django, you shouldn't be using Celery to run Telegram polling (what you call PTB's “non-periodic task”, which is better described as a long-running process or service). Celery is designed for definite tasks, not indefinitely-running processes.
As Django implies that you're already running a web server, then the webhook option is a better fit. (Remember that you can either do polling or set up a webhook in order to receive updates from Telegram's servers.) The option that #CallMeStag suggested, of using a non-threading webhook setup, makes the most sense for Django-PTB integration.
You can do the bot setup (defining and registering your handler functions on a Dispatcher instance) in a separate module; to avoid threading, you should pass update_queue=None, workers=0 to your Dispatcher instantiation. And then, use it in a Django view, like this:
import json
from django.views.decorators.csrf import csrf_exempt
from telegram import Update
from .telegram_init import telegram_bot, telegram_dispatcher
...
#csrf_exempt
def telegram_webhook(request):
data = json.loads(request.body)
update = Update.de_json(data, telegram_bot)
telegram_dispatcher.process_update(update)
return JsonResponse({})
where telegram_bot is the Bot instance that I use for instantiating telegram_dispatcher. (I left out error handling in this snippet.)
Why avoid threading? Threads in the more general sense are not forbidden in Django, but in the context of PTB, threading usually means running bot updaters or dispatchers in a long-running thread that share an update/message queue, and that's a complication that doesn't look nice nor play well with, for example, a typical Django deployment that uses multiple Gunicorn workers in separate processes. There is, however, a motivation for using multithreading (multiple processes, actually, using Celery) in Django-PTB integration; see below.
Development environment caveat
The above setup is what you'd want to use for a basic production system. But during dev, unless your dev machine is internet-facing with a fixed IP, you probably can't use a webhook, so you'd still want to do polling. One way to do this is by creating a custom Django management command:
<my_app>/management/commands/polltelegram.py:
from django.core.management.base import BaseCommand
from my_django_project.telegram_init import telegram_updater
class Command(BaseCommand):
help = 'Run Telegram bot polling.'
def handle(self, *args, **options):
updater.start_polling()
self.stdout.write(
'Telegram bot polling started. '
'Press CTRL-BREAK to terminate.'
)
updater.idle()
self.stdout.write('Polling stopped.')
And then, during dev, run python manage.py polltelegram to fetch and process Telegram updates. (Run this along with python manage.py runserver to be able to use the main Django app simultaneously; the polling runs in a separate process with this setup, not just a separate thread.)
When Celery makes sense
Celery does have a role to play if you're integrating PTB with Django, and this is when reliability is a concern. For instance, when you want to be able to retry sending replies in case of transient network issues. Another potential issue is that the non-threading webhook setup detailed above can, in a high-traffic scenario, run into flood/rate limits. PTB's current solution for this, MessageQueue, uses threading, and while it can work, it can introduce other problems, for example interference with Django's autoreload function when running runserver during dev.
A more elegant and reliable solution is to use Celery to run the message sending function of PTB. This allows for retries and rate limiting for better reliability.
Briefly described, this integration can still use the non-threading webhook setup above, but you have to isolate the Bot.send_message() function into a Celery task, and then make sure that all handlers call this Celery task asynchronously instead of using the bot to run send_message() in the webhook process 'eagerly'.
In PTB, Updater.start_polling/webhook() starts a background thread that waits for incoming updates. Updater.idle() blocks the main thread and when receiving a stop signal, it ends the background thread mentioned above.
I'm not familiar with Celery and only know the basics of Django, but I see a few options here that I'd like to point out.
You can run the PTB-related code in a standalone thread, i.e. a thread that calls Updater.start_polling and Updater.idle. To end that thread on shutdown, you'll have to forward the stop signal to that thread
Vice versa, you can run PTB in the main thread and the Django & Celeray related tasks in a standalone thread
You don't have to use Updater. Since you're using Django anyway, you could switch to a webhook-based solution for receiving updates, where Django serves as webhook for you. You can even eliminate threading for PTB completely by calling Dispatcher.process_update manually. Please see this wiki page for more info on custom webhook solutions
Finally, I'd like to mention that PTB comes with a built-in solution of scheduling tasks, see the wiki page on Job Queue. This may or may not be relevant for you depending on your setup.
Dislaimer: I'm currently the maintainer of python-telegram-bot

Celery : understanding the big picture

Celery seems to be a great tool, but I have hard time understanding how the various Celery components work together:
The workers
The apps
The tasks
The message Broker (like RabbitMQ)
From what I understand, the command line:
celery -A not-clear-what-this-option-is worker
should run some sort of celery "worker server" which would itself need to connect to a broker server (I'm not so sure why so many servers are needed).
Then in any python code, some task may be sent to the worker by instantiating an app:
app = Celery('my_module', broker='pyamqp://guest#localhost//')
and then by decorating functions with this app in the following way:
#app.tasks
def my_func():
...
so that "my_func()" can now be called as "my_func.delay()" to be ran in an asynchronuous way.
Here are my questions:
What happens when my_func.delay() is called ? which server talks to which first ? and sending what where ?
What is the option to put behind the "-A" of the celery command? is this really needed ?
Suppose I have a process X which instantiates a Celery app to launch the task A, and suppose I have another process Y who wants to know the status of task A launched by X. I assume there is a way for Y to do so, but I don't know how. I suppose that Y should create its own instance of a Celery app. But then:
What function to call in the celery app of Y to get this information (and what is the "identifier" of task A inside the process Y) ?
How does this work in terms of communication, that is, when does the request goes through the Broker, and when does it go to the worker(s) ?
If anyone has some information about these questions, I would be grateful. I intend to use Celery in a Django project, where some requests to the server can trigger various time consuming tasks, and/or inquire about the status of previously launched tasks (pending, finished, error, etc...).
About the broker:
The main role of the broker is to mediate communication between the client and the worker
basically a lot of information is being generated and processed while your worker is running
taking care of this information is the broker's role
e.g. you can configure redis so that no information is lost if the server is shut down while running a process
The worker:
you can think of the worker as an instance independent of your application, which will only execute those tasks that you delegate to it
About the state of a task:
there are ways to consult celery to find out the status of a task, but I would not recommend building your application logic depending on this
if you want to get the output of a process and turn it in the input of another one, using tasks, I would recommend you to use a queue
run task A, and before finish insert your result objects in the queue
task B will listen to the queue and processes whatever comes up
The command:
on the terminal you can see in more detail what each argument means by running celery -h or celery --help
but the argument basically specifies which instance of celery you intend to run. So normally this argument will indicate where the instance you have configured and intend to execute can be found
usage: celery [-h] [-A APP] [-b BROKER] [--result-backend RESULT_BACKEND]
[--loader LOADER] [--config CONFIG] [--workdir WORKDIR]
[--no-color] [--quiet]
I hope this can provide an initial overview for those who get here
Celery is used to make functions to run in the background. Imagine you have a web API that does a job, and returns a response. You know, that job would seriously affect the response time for the API. So you'll transfer that particular job to Celery, and your API will respond instantly. Examples for some job that affect performance of an API are,
Routing to email servers
Routing to SMS Gateways
Database backup
Chained database operations
File conversion
Now, let's cover each components of celery.
The workers
Celery workers execute the job(function). They are asynchronous. So you'll have double the number of your processor cores as celery workers. You can assign a name and task to a celery worker#.
The apps
The app is the name of project you're working on. You'll have to specify that name in the celery instance.
The tasks
The functions you need to be executed in the background. Every task Celery execute will have a task id, state(and more). You can get that by inspecting a particular task.
The message Broker
Those tasks which will be executed in the background has to be moved from your python project to to Celery workers. Message brokers act as a medium here. So functions with its arguments will be transferred to brokers and from brokers Celery will fetch them to execute.
Some codes
celery -A project_name worker_name
celery -A project_name worker_name inspect
More in documentation
docs.celeryproject.org

How do I schedule a job in Django?

I have to schedule a job using Schedule on my django web application.
def new_job(request):
print("I'm working...")
file=schedulesdb.objects.filter (user=request.user,f_name__icontains ="mp4").last()
file_initiated = str(f_name)
os.startfile(f_name_initiated)
I need to do it with filtered time in db
GIVEN DATETIME = schedulesdb.objects.datetimes('request_time', 'second').last()
schedule.GIVEN DATETIME.do(job)
Django is a web framework. It receives a request, does whatever processing is necessary and sends out a response. It doesn't have any persistent process that could keep track of time and run scheduled tasks, so there is no good way to do it using just Django.
That said, Celery (http://www.celeryproject.org/) is a python framework specifically built to run tasks, both scheduled and on-demand. It also integrates with Django ORM with minimal configuration. I suggest you look into it.
You could, of course, write your own external script that would use schedule module that you mentioned. You would need to implement a way to write shedule objects into the database and then you could have your script read and execute them. Is your "scheduledb" model already implemented?

Spawn Asyncronous Python Process From Flask [duplicate]

I have to do some long work in my Flask app. And I want to do it async. Just start working, and then check status from javascript.
I'm trying to do something like:
#app.route('/sync')
def sync():
p = Process(target=routine, args=('abc',))
p.start()
return "Working..."
But this it creates defunct gunicorn workers.
How can it be solved? Should I use something like Celery?
There are many options. You can develop your own solution, use Celery or Twisted (I'm sure there are more already-made options out there but those are the most common ones).
Developing your in-house solution isn't difficult. You can use the multiprocessing module of the Python standard library:
When a task arrives you insert a row in your database with the task id and status.
Then launch a process to perform the work which updates the row status at finish.
You can have a view to check if the task is finished, which actually just checks the status in the corresponding.
Of course you have to think where you want to store the result of the computation and what happens with errors.
Going with Celery is also easy. It would look like the following.
To define a function to be executed asynchronously:
#celery.task
def mytask(data):
... do a lot of work ...
Then instead of calling the task directly, like mytask(data), which would execute it straight away, use the delay method:
result = mytask.delay(mydata)
Finally, you can check if the result is available or not with ready:
result.ready()
However, remember that to use Celery you have to run an external worker process.
I haven't ever taken a look to Twisted so I cannot tell you if it more or less complex than this (but it should be fine to do what you want to do too).
In any case, any of those solutions should work fine with Flask. To check the result it doesn't matter at all if you use Javascript. Just make the view that checks the status return JSON (you can use Flask's jsonify).
I would use a message broker such as rabbitmq or activemq. The flask process would add jobs to the message queue and a long running worker process (or pool or worker processes) would take jobs off the queue to complete them. The worker process could update a database to allow the flask server to know the current status of the job and pass this information to the clients.
Using celery seems to be a nice way to do this.

how to track revoked tasks in across multiple celeryd processes

I have a reminder type app that schedules tasks in celery using the "eta" argument. If the parameters in the reminder object changes (e.g. time of reminder), then I revoke the task previously sent and queue a new task.
I was wondering if there's any good way of keeping track of revoked tasks across celeryd restarts. I'd like to have the ability to scale celeryd processes up/down on the fly, and it seems that any celeryd processes started after the revoke command was sent will still execute that task.
One way of doing it is to keep a list of revoked task ids, but this method will result in the list growing arbitrarily. Pruning this list requires guarantees that the task is no longer in the RabbitMQ queue, which doesn't seem to be possible.
I've also tried using a shared --statedb file for each of the celeryd workers, but it seems that the statedb file is only updated on termination of the workers and thus not suitable for what I would like to accomplish.
Thanks in advance!
Interesting problem, I think it should be easy to solve using broadcast commands.
If when a new worker starts up it requests all the other workers to dump its revoked
tasks to the new worker. Adding two new remote control commands,
you can easily add new commands by using #Panel.register,
Module control.py:
from celery.worker import state
from celery.worker.control import Panel
#Panel.register
def bulk_revoke(panel, ids):
state.revoked.update(ids)
#Panel.register
def broadcast_revokes(panel, destination):
panel.app.control.broadcast("bulk_revoke", arguments={
"ids": list(state.revoked)},
destination=destination)
Add it to CELERY_IMPORTS:
CELERY_IMPORTS = ("control", )
The only missing problem now is to connect it so that the new worker
triggers broadcast_revokes at startup. I guess you could use the worker_ready
signal for this:
from celery import current_app as celery
from celery.signals import worker_ready
def request_revokes_at_startup(sender=None, **kwargs):
celery.control.broadcast("broadcast_revokes",
destination=sender.hostname)
I had to do something similar in my project and used celerycam with django-admin-monitor. The monitor takes a snapshot of tasks and saves them in the database periodically. And there is a nice user interface to browse and check the status of all tasks. And you can even use it even if your project is not Django based.
I implemented something similar to this some time ago, and the solution I came up with was very similar to yours.
The way I solved this problem was to have the worker fetch the Task object from the database when the job ran (by passing it the primary key, as the documentation recommends). In your case, before the reminder is sent the worker should perform a check to ensure that the task is "ready" to be run. If not, it should simply return without doing any work (assuming that the ETA has changed and another worker will pick up the new job).

Categories