Replacing Celery Beat "CELERYBEAT_SCHEDULE" with a dynamic source (database) - python

In the Celery docs, the standard way to set the schedule of tasks is documented as hardcoding the schedule into the config file.
However, it also hints that this can be replaced with a custom backend. I see there is a dynamic, database driven option for Django but I'm using a simple Flask app to define my tasks.
Does anyone have a way to dynamically load the schedule, avoiding the need to restart the celery beat worker, either by dynamically pulling the schedule from a database or by reloading the schedule from a text file on a regular basis? Is it as simple as putting a reload() call around the schedule in a text file, perhaps even as its own scheduled celery task?

CELERYBEAT_SCHEDULE is just init/config sugar and the object is available from within a bound task at:
self.app.conf['CELERYBEAT_SCHEDULE']
You might write a periodic task that pulls down new values from some back end.

Related

Use existing celery workers for Airflow's Celeryexecutor workers

I am trying to introduce dynamic workflows into my landscape that involves multiple steps of different model inference where the output from one model gets fed into another model.Currently we have few Celery workers spread across hosts to manage the inference chain. As the complexity increase, we are attempting to build workflows on the fly. For that purpose, I got a dynamic DAG setup with Celeryexecutor working. Now, is there a way I can retain the current Celery setup and route airflow driven tasks to the same workers? I do understand that the setup in these workers should have access to the DAG folders and environment same as the airflow server. I want to know how the celery worker need to be started in these servers so that airflow can route the same tasks that used to be done by the manual workflow from a python application. If I start the workers using command "airflow celery worker", I cannot access my application tasks. If I start celery the way it is currently ie "celery -A proj", airflow has nothing to do with it. Looking for ideas to make it work.
Thanks #DejanLekic. I got it working (though the DAG task scheduling latency was too much that I dropped the approach). If someone is looking to see how this was accomplished, here are few things I did to get it working.
Change the airflow.cfg to change the executor,queue and result back-end settings (Obvious)
If we have to use Celery worker spawned outside the airflow umbrella, change the celery_app_name setting to celery.execute instead of airflow.executors.celery_execute and change the Executor to "LocalExecutor". I have not tested this, but it may even be possible to avoid switching to celery executor by registering airflow's Task in the project's celery App.
Each task will now call send_task(), the AsynResult object returned is then stored in either Xcom(implicitly or explicitly) or in Redis(implicitly push to the queue) and the child task will then gather the Asyncresult ( it will be an implicit call to get the value from Xcom or Redis) and then call .get() to obtain the result from the previous step.
Note: It is not necessary to split the send_task() and .get() between two tasks of the DAG. By splitting them between parent and child, I was trying to take advantage of the lag between tasks. But in my case, the celery execution of tasks completed faster than airflow's inherent latency in scheduling dependent tasks.

Enqueue celery task from other project

I have a project using celery to process tasks, and a second project which is an API that might need to enqueue tasks to be processed by celery workers.
However, these 2 projects are separated and I can't import the tasks in the API one.
I've used Sidekiq - Celery's equivalent in Ruby - in the past, and for example it is possible to push jobs by storing data in Redis from other languages/apps/processes if using the same format/payload.
Is something similar possible with Celery ? I couldn't find anything related.
Yes, this is possible in celery using send_task or signatures. Assuming fetch_data is the function in a separate code base, you can invoke it using one of the below methods
send_task
celery_app.send_task('fetch_data', kwargs={'url': request.json['url']})
app.signature
celery_app.signature('fetch_data', kwargs={'url': request.json['url']).delay()
You just specify the function name as a string and do not need to import it into your codebase.
You can read about this in more detail from https://www.distributedpython.com/2018/06/19/call-celery-task-outside-codebase/

How do I schedule a job in Django?

I have to schedule a job using Schedule on my django web application.
def new_job(request):
print("I'm working...")
file=schedulesdb.objects.filter (user=request.user,f_name__icontains ="mp4").last()
file_initiated = str(f_name)
os.startfile(f_name_initiated)
I need to do it with filtered time in db
GIVEN DATETIME = schedulesdb.objects.datetimes('request_time', 'second').last()
schedule.GIVEN DATETIME.do(job)
Django is a web framework. It receives a request, does whatever processing is necessary and sends out a response. It doesn't have any persistent process that could keep track of time and run scheduled tasks, so there is no good way to do it using just Django.
That said, Celery (http://www.celeryproject.org/) is a python framework specifically built to run tasks, both scheduled and on-demand. It also integrates with Django ORM with minimal configuration. I suggest you look into it.
You could, of course, write your own external script that would use schedule module that you mentioned. You would need to implement a way to write shedule objects into the database and then you could have your script read and execute them. Is your "scheduledb" model already implemented?

Celery: why do I need a broker for periodic tasks?

I have a standalong script that scrapes a page, initiates a connection to a database, and writes database to it. I need it to execute periodically after x hours. I can make it with using a bash script, with the pseudocode:
while true
do
python scraper.py
sleep 60*60*x
done
From what I read about message brokers, they are used for sending "signals" from one running program to another, like HTTP in principle. Like I have a piece of code that accepts an email id from user, it sends signal with email-id to another piece of code that will send the email.
I need celery to run a periodic task on heroku. I already have a mongodb on a separate server. WHy do I need to run another server for rabbitmq or redis just for this? Can I use celery without the broker?
Celery architecture is designed to scale and distribute tasks across several servers. For sites like yours it might be an overkill. Queue service is generally needed to maintain the task list and signal the status of finished tasks.
You might want to take a look in Huey instead. Huey is small-scale Celery "Clone" needing only Redis as an external dependency, not RabbitMQ. It's still using Redis queue mechanism to line the tasks in queue.
There also exists Advanced Python scheduler which does not need even Redis, but can hold the state of the queue in memory in-process.
Alternatively if you have very small amount of periodical tasks, no delayed tasks, I would just use Cron and pure Python scripts to run the tasks.
As the Celery documentation explains:
Celery communicates via messages, usually using a broker to mediate between clients and workers. To initiate a task, a client adds a message to the queue, which the broker then delivers to a worker.
You can use your existing MongoDB database as broker. see Using MongoDB.
For the application like this, its better use Django Background Tasks
,
Installation
Install from PyPI:
pip install django-background-tasks
Add to INSTALLED_APPS:
INSTALLED_APPS = (
# ...
'background_task',
# ...
)
Migrate your database:
python manage.py makemigrations background_task
python manage.py migrate
Creating and registering tasks
To register a task use the background decorator:
from background_task import background
from django.contrib.auth.models import User
#background(schedule=60)
def notify_user(user_id):
# lookup user by id and send them a message
user = User.objects.get(pk=user_id)
user.email_user('Here is a notification', 'You have been notified')
This will convert the notify_user into a background task function. When you call it from regular code it will actually create a Task object and stores it in the database. The database then contains serialised information about which function actually needs running later on. This does place limits on the parameters that can be passed when calling the function - they must all be serializable as JSON. Hence why in the example above a user_id is passed rather than a User object.
Calling notify_user as normal will schedule the original function to be run 60 seconds from now:
notify_user(user.id)
This is the default schedule time (as set in the decorator), but it can be overridden:
notify_user(user.id, schedule=90) # 90 seconds from now
notify_user(user.id, schedule=timedelta(minutes=20)) # 20 minutes from now
notify_user(user.id, schedule=timezone.now()) # at a specific time
Also you can run original function right now in synchronous mode:
notify_user.now(user.id) # launch a notify_user function and wait for it
notify_user = notify_user.now # revert task function back to normal function.
Useful for testing.
You can specify a verbose name and a creator when scheduling a task:
notify_user(user.id, verbose_name="Notify user", creator=user)

How do I delay a task using Celery?

Not talking about the delay method.
I want to be able to get a task, given it's task_id and change it's ETA on the fly, before it is executed.
For now I have to cancel it, and re-schedule one. Troublesome if the scheduled process involve a lot of stuff.
You should store some 'pause' value outside of celery/task queue. I do this with a mailer using celery. I can pause parts of the system by setting values in either memcache or mysql. The tasks then make sure to query the outside resource before executing the task. If it's meant to be paused it sets it does a task.retry() that causes it to go through the retry delay time and such.
Assuming you are using django-celery and PeriodicTask with DatabaseScheduler, you need to modify your PeriodicTask interval or crontab and save it. If your task is defined by an interval, modify the last_run_at property.
You run celerybeat with the database scheduler with:
python manage.py celerybeat -S djcelery.schedulers.DatabaseScheduler

Categories