Celery Task Priority - python

I want to manage tasks using Celery. I want to have a single task queue (with concurrency 1) and be able to push tasks onto the queue with different priorities such that higher priority tasks will preempt the others.
I am adding three tasks to a queue like so:
add_tasks.py
from tasks import example_task
example_task.apply_async((1), priority=1)
example_task.apply_async((2), priority=3)
example_task.apply_async((3), priority=2)
I have the following configuration:
tasks.py
from __future__ import absolute_import, unicode_literals
from celery import Celery
from kombu import Queue, Exchange
import time
app = Celery('tasks', backend='rpc://', broker='pyamqp://')
app.conf.task_queues = [Queue('celery', Exchange('celery'), routing_key='celery', queue_arguments={'x-max-priority': 10})]
#app.task
def example_task(task_num):
time.sleep(3)
print('Started {}'.format(task_num)
return True
I expect the second task I added to run before the third, because it has higher priority but it doesn't. They run in the order added.
I am following the docs and thought I had configured the app correctly.
Am I doing something wrong or am I misunderstanding the priority feature?

There is a possibility that the queue has no chance to prioritize the messages (because they get downloaded before the sorting happens). Try with these two settings (adapt to your project as needed):
CELERY_ACKS_LATE = True
CELERYD_PREFETCH_MULTIPLIER = 1
Prefetch multiplier is 4 by default.
I had developed a sample application to realize Celery's priority tasking on a very small scale. Please have a look at it here. While developing it, I had encountered a very similar problem and this change in settings actually solved it.
Note that you also require RabbitMQ version 3.5.0 or higher.

Related

How to execute tasks in different server using Python Celery?

Let's say I have three servers A, B & C. Server C will have the celery task code, and I need to execute them from servers A & B.
From the celery documentation, I see that there's a task.py file which is run as a celery worker
from celery import Celery
app = Celery('tasks', broker='pyamqp://guest#localhost//')
#app.task
def add(x, y):
return x + y
And then we have another python file (let's say client.py) which calls these tasks.
from tasks import add
add.delay(4, 4)
Here I can see that the client.py file is dependent on the tasks.py file as it's importing the task from the tasks.py file. If we are to run these two files in separate servers, we need to decouple them and somehow call the tasks without having to import the code. I am not able to figure out how to achieve that. So, how can it be done?
In general you do not do that. You deploy the same code (containing tasks) to both producers (clients) and consumers (workers). However, Celery is a cool piece of software and it allows you to actually schedule task without the need to distribute the code on the producer side. For that you have to use the send_task() method. You have to configure producer with same parameters as your workers (same broker naturally, same serialization) and you must know the calling task name and its parameters in order to schedule its execution correctly.

Django 3.0: Running backgound infinite loop in app ready()

I'm trying to find a way to constantly poll a server every x seconds from the ready() function of Django, basically something which looks like this:
from django.apps import AppConfig
class ApiConfig(AppConfig):
name = 'api'
def ready(self):
import threading
import time
from django.conf import settings
from api import utils
def refresh_ndt_servers_list():
while True:
utils.refresh_servers_list()
time.sleep(settings.WAIT_SECONDS_SERVER_POLL)
thread1 = threading.Thread(target=refresh_ndt_servers_list)
thread1.start()
I just want my utils.refresh_servers_list() to be executed when Django starts/is ready and re-execute that same method (which populates my DB) every settings.WAIT_SECONDS_SERVER_POLL seconds indefinitely. The problem with that is if I run python manage.py migrate the ready() function gets called and never finishes. I would like to avoid calling this function during migration.
Thanks!
AppConfig.ready() is to "... perform initialization tasks ..." and make your app ready to run / serve requests. Actual app working logic should be run after django app is initialized.
For launching task at regular intervals cron job can be used.
Or, setup a periodic celery task with celery beat.
Also, provided task seems to be performing updates in database (good for it to be atomic). It may be critical for it to have only single running instance of it. One instance of cronjob or one celery task take care of that.
However, the next job may still run if previous one has not yet finished or just be launched manually for some reason - adding some locking logic into task to check that only one is running (or lock database table for the run) may be desired.

Memory usage not getting lowered even after job is completed successfully

I have a job added in apscheduler which loads some data in memory and I am deleting all the objects after the job is complete. Now if I run this job with python it works successfully and memory drop after process exits successfully.But in case of apscheduler the memory usage is not coming down.I am using BackgroundScheduler.Thanks in advance.
I was running quite a few tasks via apscheduler. I suspected this setup led to R14 errors on Heroku, with dyno memory overload, crashes and restarts occurring daily. So I spun up another dyno and scheduled a few jobs to run very frequently.
Watching the metrics tab in Heroku, it immediately became clear that apscheduler was the culprit.
Removing jobs after they're run was recommended to me. But this is of course a bad idea when running cron and interval jobs as they won't run again.
What finally solved it was tweaking the threadpoolexecutioner (lowering max number of workers), see this answer on Stackoverflow and this and this post on Github. I definitely suggest you read the docs on this.
Other diagnostics resources: 1, 2.
Example code:
import logging
from apscheduler.executors.pool import ThreadPoolExecutor, ProcessPoolExecutor
from apscheduler.schedulers.blocking import BlockingScheduler
from tests import overloadcheck
logging.basicConfig()
logging.getLogger('apscheduler').setLevel(logging.DEBUG)
sched = BlockingScheduler(
executors={
'threadpool': ThreadPoolExecutor(max_workers=9),
'processpool': ProcessPoolExecutor(max_workers=3)
}
)
#sched.scheduled_job('interval', minutes=10, executor='threadpool')
def message_overloadcheck():
overloadcheck()
sched.start()
Or, if you like I do, love to run heavy tasks—try the ProcessPoolExecutor as an alternative, or addition to the ThreadPool, but make sure to call it from specific jobs in such case.
Update: And, you need to import ProcessPoolExecutor as well if you wish to use it, added this to code.
Just in case anyone is using Flask-APScheduler and having memory leak issues, it took me a while to realize that it expects any configuration settings to be in your Flask Config, not when you instantiate the scheduler.
So if you (like me) did something like this:
from flask_apscheduler import APScheduler
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.executors.pool import ThreadPoolExecutor
bg_scheduler = BackgroundScheduler(executors={'threadpool': ThreadPoolExecutor(max_workers=1)})
scheduler = APScheduler(scheduler=bg_scheduler)
scheduler.init_app(app)
scheduler.start()
then for whatever reason, when jobs are run in the Flask request context, it will not recognize the executor, 'threadpool', or any other configuration settings you may have set.
However, if you set these same options in the Flask Config class as:
class Config(object):
#: Enable build completion checks?
SCHEDULER_API_ENABLED = True
#: Sets max workers to 1 which reduces memory footprint
SCHEDULER_EXECUTORS = {"default": {"type": "threadpool", "max_workers": 1}
# ... other Flask configuration options
and then do (back in the main script)
scheduler = APScheduler()
scheduler.init_app(app)
scheduler.start()
then the configuration settings actually do get set. I'm guessing when I called scheduler.init_app in the original script, Flask-APScheduler saw that I hadn't set any of those settings in my Flask Config and so overwrote them with default values, but not 100% sure.
Regardless, hopefully this helps anyone who has tried the top-rated answer but is also using Flask-APScheduler as a wrapper and might still be seeing memory issues.

Celery: #shared_task and non-standard BROKER_URL

I have a Celery 3.1.19 setup which uses a BROKER_URL including a virtual host.
# in settings.py
BROKER_URL = 'amqp://guest:guest#localhost:5672/yard'
Celery starts normally, loads the tasks, and the tasks I define within the #app.task decorator work fine. I assume that my rabbitmq and celery configuration at this end are correct.
Tasks, I define with #shared_tasks and load with app.autodiscover_tasks are still loading correctly upon start. However, if I call the task the message ends up in the (still existing) amqp://guest:guest#localhost:5672/ virtual host.
Question: What am I missing here? Where do shared tasks get their actual configuration from.
And here some more details:
# celery_app.py
from celery import Celery
celery_app = Celery('celery_app')
celery_app.config_from_object('settings')
celery_app.autodiscover_tasks(['connectors'])
#celery_app.task
def i_do_work():
print 'this works'
And in connectors/tasks.py (with an __init__.py in the same folder):
# in connectors/tasks.py
from celery import shared_task
#shared_task
def I_do_not_work():
print 'bummer'
And again the shared task gets also picked up by the Celery instance. It just lacks somehow the context to send messages to the right BROKER_URL.
Btw. why are shared_tasks so purely documented. Do they rely on some Django context? I am not using Django.
Or do I need additional parameters in my settings?
Thanks a lot.
The celery_app was not yet imported at application start. Within my project, I added following code to __init__.py at the same module level as my celery_app definition.
from __future__ import absolute_import
try:
from .celery_app import celery_app
except ImportError:
# just in case someone develops application without
# celery running
pass
I was confused by the fact that Celery seems to come with a perfectly working default app. In this case a more interface like structure with a NotImplementedError might have been more helpful. Nevertheless, Celery is awesome.

how to track revoked tasks in across multiple celeryd processes

I have a reminder type app that schedules tasks in celery using the "eta" argument. If the parameters in the reminder object changes (e.g. time of reminder), then I revoke the task previously sent and queue a new task.
I was wondering if there's any good way of keeping track of revoked tasks across celeryd restarts. I'd like to have the ability to scale celeryd processes up/down on the fly, and it seems that any celeryd processes started after the revoke command was sent will still execute that task.
One way of doing it is to keep a list of revoked task ids, but this method will result in the list growing arbitrarily. Pruning this list requires guarantees that the task is no longer in the RabbitMQ queue, which doesn't seem to be possible.
I've also tried using a shared --statedb file for each of the celeryd workers, but it seems that the statedb file is only updated on termination of the workers and thus not suitable for what I would like to accomplish.
Thanks in advance!
Interesting problem, I think it should be easy to solve using broadcast commands.
If when a new worker starts up it requests all the other workers to dump its revoked
tasks to the new worker. Adding two new remote control commands,
you can easily add new commands by using #Panel.register,
Module control.py:
from celery.worker import state
from celery.worker.control import Panel
#Panel.register
def bulk_revoke(panel, ids):
state.revoked.update(ids)
#Panel.register
def broadcast_revokes(panel, destination):
panel.app.control.broadcast("bulk_revoke", arguments={
"ids": list(state.revoked)},
destination=destination)
Add it to CELERY_IMPORTS:
CELERY_IMPORTS = ("control", )
The only missing problem now is to connect it so that the new worker
triggers broadcast_revokes at startup. I guess you could use the worker_ready
signal for this:
from celery import current_app as celery
from celery.signals import worker_ready
def request_revokes_at_startup(sender=None, **kwargs):
celery.control.broadcast("broadcast_revokes",
destination=sender.hostname)
I had to do something similar in my project and used celerycam with django-admin-monitor. The monitor takes a snapshot of tasks and saves them in the database periodically. And there is a nice user interface to browse and check the status of all tasks. And you can even use it even if your project is not Django based.
I implemented something similar to this some time ago, and the solution I came up with was very similar to yours.
The way I solved this problem was to have the worker fetch the Task object from the database when the job ran (by passing it the primary key, as the documentation recommends). In your case, before the reminder is sent the worker should perform a check to ensure that the task is "ready" to be run. If not, it should simply return without doing any work (assuming that the ETA has changed and another worker will pick up the new job).

Categories