djcelery PeriodicTasks.changed method - update query running slow - python

I have been using celery (3.1), django-celery(3.1.17)(that is djcelery) with Django(1.9) for scheduling tasks. Recently I noticed that one update query to PeriodicTasks table takes a lot of time(around 1 minute) due to which the whole process of scheduling tasks is working slow. The update query to model PeriodicTasks is called by pre_save and pre_delete singals on PeriodicTask model. Thus whenever a new object gets added in PeriodicTask table, signal triggers and calls PeriodicTasks.changed method which simply update the same PeriodicTasks object's value again and again.
I am assuming the update query is working slow because millions of tasks gets published in a very short time and for each task same PeriodicTasks object is getting updated.
The question is(if I don't want to upgrade to celery 4.x), what is the purpose of this update(of PeriodicTasks object)? If I will remove the signal call, will it affect to the functionality of django-celery by any means?
Also just to mention, for celery4 also same signal calls on pre_save and pre_delete of PreodicTask model happen. So it will not help me even if I will upgrade to celery4.
Update
I experimented by commenting the two signal calls in djcelery/models.py everything is working well and good. Now the question is what is the purpose of this update call? Why the signals are called to update the same instance of PeriodicTasks.

Related

Django save model everyday

I have a model and a signal in models.py and this model sends message to discord webhook with how many days left to something. I want to refresh it everyday at 12:00 AM everyday automatically without using django-celery cause it doesnt work for me. My plan is do something like this
time_set = 12
if time_set == timezone.now().hour:
...save model instances...
but i have totally no idea how to do it
And i want to do it this way cause when model instance are saved signal runs
Django doesn't handle this scenario out of the box, hence the need for celery and its ilk. The simplest way is to set a scheduled task on the operating system that calls a custom django management command (which is essentially a python script that can reference your django models and methods etc by calling python manage.py myNewCommand).
You can find more about custom commands at https://docs.djangoproject.com/en/4.0/howto/custom-management-commands/
You can create a custom management command and call it by using a cron entry, set to run every day.
Check Django official documentation for the instructions on creating the custom command.
Instead of calling the save() method each time, I'd create a send_discord_message() on the model, and call it wherever required. If you need to execute it every time an instance is saved, then is preferred to use an override save() method in the model. Signals are a great way to plug and extend different apps together, but they have some caveats and it is simpler to override the save() method.
I'm supposing you are using a Unix-like system. You can check how to configure and create cron jobs.

Django + background-tasks how to initialize

I have a basic django projects that I use as a front end interface for a (Condor) computing cluster for generating simulations. From the django app the users can start simulations (in Condor). The simulation related meta-data and the simulation state are kept in a DB.
I need to add a new feature: notification when (some) simulations are done.
Since I want a simple solution (and I already using background tasks) I was thinking to use repeating task that at fixed intervals query Condor about the tasks, updates the DB and if necessary sends notifications.
So if I want to update every 10 min that statuses I will have something like:
#background(schedule=1)
def check_simulations(repeat=600):
# lookup simulation statuses
simulation_list = get_Simulations()
for sim in simulations_list:
if sim.status == Simulation.DONE:
user.email_user('Simulation Complete', 'You have been notified')
def initialize():
check_simulations()
However this task (or better say the initialize() method) must be started (called once) to create and schedule the check_simulations() task (which will practically serialize the call and save it in the DB); after that the background-tasks thread will read it and execute and also reschedule it (if there is error)
My questions:
where should I put the call to the initialize() method to only be run once ?
One such place could be for instance the urls.py but this is an extremely ugly solution. Is there a better way ?
how to ensure that a server restart will not create and schedule a new task (if one already exist)
This may happen if a task is already scheduled (so a serialized task is in the background-tasks table) and the webserver is restarted so the initialize() method is called again so a new task is created and scheduled ...
i had a similar problem and i solved it this way.
i initialize my task in urls.py, i dont know if you can use other places to put it ,also added and if, to check if the task its allready in the database
from background_task.models import Task
if not Task.objects.filter(verbose_name="update_orders").exists():
tasks.update_orders(repeat=300, verbose_name="update_orders")
i have tested it and it works fine, you can also search for the order with other parameters like name, hash ,...
you can check the task model here: https://github.com/arteria/django-background-tasks/blob/master/background_task/models.py

Django with Celery - existing object not found

I am having problem with executing celery task from another celery task.
Here is the problematic snippet (data object already exists in database, its attributes are just updated inside finalize_data function):
def finalize_data(data):
data = update_statistics(data)
data.save()
from apps.datas.tasks import optimize_data
optimize_data.delay(data.pk)
#shared_task
def optimize_data(data_pk):
data = Data.objects.get(pk=data_pk)
#Do something with data
Get call in optimize_data function fails with "Data matching query does not exist."
If I call the retrieve by pk function in finalize_data function it works fine. It also works fine if I delay the celery task call for some time.
This line:
optimize_data.apply_async((data.pk,), countdown=10)
instead of
optimize_data.delay(data.pk)
works fine. But I don't want to use hacks in my code. Is it possible that .save() call is asynchronously blocking access to that row/object?
I know that this is an old post but I stumbled on this problem today. Lee's answer pointed me to the correct direction but I think a better solution exists today.
Using the on_commit handler provided by Django this problem can be solved without a hackish way of countdowns in the code which might not be intuitive to the user about why it exsits.
I'm not sure if this existed when the question was posted but I'm just posting the answer so that people who come here in the future know about the alternative.
I'm guessing your caller is inside a transaction that hasn't committed before celery starts to process the task. Hence celery can't find the record. That is why adding a countdown makes it work.
A 1 second countdown will probably work as well as the 10 second one in your example. I've used 1 second countdowns throughout code to deal with this issue.
Another solution is to stop using transactions.
You could use an on_commit hook to make sure the celery task isn't triggered until after the transaction commits?
DjangoDocs#performing-actions-after-commit
It's a feature that was added in Django 1.9.
from django.db import transaction
def do_something():
pass # send a mail, invalidate a cache, fire off a Celery task, etc.
transaction.on_commit(do_something)
You can also wrap your function in a lambda:
transaction.on_commit(lambda: some_celery_task.delay('arg1'))
The function you pass in will be called immediately after a hypothetical database write made where on_commit() is called would be successfully committed.
If you call on_commit() while there isn’t an active transaction, the callback will be executed immediately.
If that hypothetical database write is instead rolled back (typically when an unhandled exception is raised in an atomic() block), your function will be discarded and never called.

Django model doesn't get saved to database inside Celery Task

I've hit a really nasty situation. I have the following setup.
I have a django model representing an FSM with a django FSM field
I have a celery task that sends out an email and then advances the state of the main objects FSM. From the celery task's perspective, the object "seems" to be saved. But from the main django process' perspective, the object isn't being updated. The strange thing is that ancillary objects are being saved properly to the DB, and later accessible from the main django process.
I explicitly call .save() on the object from the Celery task, and the date_last_modified = models.DateTimeField(auto_now=True, null=True) field has a later timestamp in the Celery task than the main thread, although I'm not sure if that's an indication of anything, i.e. it may have been updated but the update has not been flushed out to the DB.
I'm using django 1.5.1,
postgresql 9.3.0,
celery v3.1.0,
Redis 2.6.10
Running Celery like so
$ celery -A tracking worker -E -B -l info
Any ideas of why this may be happening would be greatly appreciated
Are you re-getting the object after the save? I.e. not just looking at the instance you got before the save?
I had similar problem with Django 1.5
I guess it's because of that Django does not commit changes to database immediately.
Adding
'OPTIONS': {
'autocommit': True
}
to DATABASES setting fixed the problem for me.
Problem will not exist in Django 1.6+ beacuse autocommit is the default there.
What about transactions? You can try to set CELERY_EAGER_PROPAGATES_EXCEPTIONS=True and run celery with -l DEBUG to see, is any error happens after model .save() call.
Also take care of concurrent updates. When one process reads model, then celery reads and saves same model, if initial process calls models.save() later it would override all fields in it.
Had an issue looking like yours on Django 3.2.7. using get_nb_lines.delay(flow.pk)within an class based updateview.
After fix, I suppose it was a kind of (maybe) concurrent updates or crossing updates (dunno how to call that).
I understood that after I noticed that get_nb_lines.apply_async((flow.pk,), countdown=5)had fixed my problem. I anybody explains this another way, I'll take it.
Take care because the parameter sent into the function must be iterable as said an error alert. So in my case, I had to treat flow.pk as a list (add a comma after flow.pk)

how to track revoked tasks in across multiple celeryd processes

I have a reminder type app that schedules tasks in celery using the "eta" argument. If the parameters in the reminder object changes (e.g. time of reminder), then I revoke the task previously sent and queue a new task.
I was wondering if there's any good way of keeping track of revoked tasks across celeryd restarts. I'd like to have the ability to scale celeryd processes up/down on the fly, and it seems that any celeryd processes started after the revoke command was sent will still execute that task.
One way of doing it is to keep a list of revoked task ids, but this method will result in the list growing arbitrarily. Pruning this list requires guarantees that the task is no longer in the RabbitMQ queue, which doesn't seem to be possible.
I've also tried using a shared --statedb file for each of the celeryd workers, but it seems that the statedb file is only updated on termination of the workers and thus not suitable for what I would like to accomplish.
Thanks in advance!
Interesting problem, I think it should be easy to solve using broadcast commands.
If when a new worker starts up it requests all the other workers to dump its revoked
tasks to the new worker. Adding two new remote control commands,
you can easily add new commands by using #Panel.register,
Module control.py:
from celery.worker import state
from celery.worker.control import Panel
#Panel.register
def bulk_revoke(panel, ids):
state.revoked.update(ids)
#Panel.register
def broadcast_revokes(panel, destination):
panel.app.control.broadcast("bulk_revoke", arguments={
"ids": list(state.revoked)},
destination=destination)
Add it to CELERY_IMPORTS:
CELERY_IMPORTS = ("control", )
The only missing problem now is to connect it so that the new worker
triggers broadcast_revokes at startup. I guess you could use the worker_ready
signal for this:
from celery import current_app as celery
from celery.signals import worker_ready
def request_revokes_at_startup(sender=None, **kwargs):
celery.control.broadcast("broadcast_revokes",
destination=sender.hostname)
I had to do something similar in my project and used celerycam with django-admin-monitor. The monitor takes a snapshot of tasks and saves them in the database periodically. And there is a nice user interface to browse and check the status of all tasks. And you can even use it even if your project is not Django based.
I implemented something similar to this some time ago, and the solution I came up with was very similar to yours.
The way I solved this problem was to have the worker fetch the Task object from the database when the job ran (by passing it the primary key, as the documentation recommends). In your case, before the reminder is sent the worker should perform a check to ensure that the task is "ready" to be run. If not, it should simply return without doing any work (assuming that the ETA has changed and another worker will pick up the new job).

Categories