Django model doesn't get saved to database inside Celery Task - python

I've hit a really nasty situation. I have the following setup.
I have a django model representing an FSM with a django FSM field
I have a celery task that sends out an email and then advances the state of the main objects FSM. From the celery task's perspective, the object "seems" to be saved. But from the main django process' perspective, the object isn't being updated. The strange thing is that ancillary objects are being saved properly to the DB, and later accessible from the main django process.
I explicitly call .save() on the object from the Celery task, and the date_last_modified = models.DateTimeField(auto_now=True, null=True) field has a later timestamp in the Celery task than the main thread, although I'm not sure if that's an indication of anything, i.e. it may have been updated but the update has not been flushed out to the DB.
I'm using django 1.5.1,
postgresql 9.3.0,
celery v3.1.0,
Redis 2.6.10
Running Celery like so
$ celery -A tracking worker -E -B -l info
Any ideas of why this may be happening would be greatly appreciated

Are you re-getting the object after the save? I.e. not just looking at the instance you got before the save?

I had similar problem with Django 1.5
I guess it's because of that Django does not commit changes to database immediately.
Adding
'OPTIONS': {
'autocommit': True
}
to DATABASES setting fixed the problem for me.
Problem will not exist in Django 1.6+ beacuse autocommit is the default there.

What about transactions? You can try to set CELERY_EAGER_PROPAGATES_EXCEPTIONS=True and run celery with -l DEBUG to see, is any error happens after model .save() call.
Also take care of concurrent updates. When one process reads model, then celery reads and saves same model, if initial process calls models.save() later it would override all fields in it.

Had an issue looking like yours on Django 3.2.7. using get_nb_lines.delay(flow.pk)within an class based updateview.
After fix, I suppose it was a kind of (maybe) concurrent updates or crossing updates (dunno how to call that).
I understood that after I noticed that get_nb_lines.apply_async((flow.pk,), countdown=5)had fixed my problem. I anybody explains this another way, I'll take it.
Take care because the parameter sent into the function must be iterable as said an error alert. So in my case, I had to treat flow.pk as a list (add a comma after flow.pk)

Related

Django save model everyday

I have a model and a signal in models.py and this model sends message to discord webhook with how many days left to something. I want to refresh it everyday at 12:00 AM everyday automatically without using django-celery cause it doesnt work for me. My plan is do something like this
time_set = 12
if time_set == timezone.now().hour:
...save model instances...
but i have totally no idea how to do it
And i want to do it this way cause when model instance are saved signal runs
Django doesn't handle this scenario out of the box, hence the need for celery and its ilk. The simplest way is to set a scheduled task on the operating system that calls a custom django management command (which is essentially a python script that can reference your django models and methods etc by calling python manage.py myNewCommand).
You can find more about custom commands at https://docs.djangoproject.com/en/4.0/howto/custom-management-commands/
You can create a custom management command and call it by using a cron entry, set to run every day.
Check Django official documentation for the instructions on creating the custom command.
Instead of calling the save() method each time, I'd create a send_discord_message() on the model, and call it wherever required. If you need to execute it every time an instance is saved, then is preferred to use an override save() method in the model. Signals are a great way to plug and extend different apps together, but they have some caveats and it is simpler to override the save() method.
I'm supposing you are using a Unix-like system. You can check how to configure and create cron jobs.

Huey not calling tasks in Django

I have a Django rest framework app that calls 2 huey tasks in succession in a serializer create method like so:
...
def create(self, validated_data):
user = self.context['request'].user
player_ids = validated_data.get('players', [])
game = Game.objects.create()
tasks.make_players_friends_task(player_ids)
tasks.send_notification_task(user.id, game.id)
return game
# tasks.py
#db_task()
def make_players_friends_task(ids):
players = User.objects.filter(id__in=ids)
# process players
#db_task()
def send_notification_task(user_id, game_id):
user = User.objects.get(id=user_id)
game = Game.objects.get(id=game_id)
# send notifications
When running the huey process in the terminal, when I hit this endpoint, I can see that only one or the other of the tasks is ever called, but never both. I am running huey with the default settings (redis with 1 thread worker.)
If I alter the code so that I am passing in the objects themselves as parameters, rather than the ids, and remove the django queries in the #db_task methods, things seem to work alright.
The reason I initially used the ids as parameters is because I assumed (or read somewhere) that huey uses json serialization as default, but after looking into it, pickle is actually the default serializer.
One theory is that since I am only running one worker, and also have a #db_periodic_task method in the app, the process can only handle listening for tasks or executing them at any time, but not both. This is the way celery seems to work, where you need a separate process for a scheduler and a worker each, but this isn't mentioned in huey's documentation.
If you run the huey consumer it will actually spawn a separate scheduler together with the amount of workers you've specified, so that's not going to be your problem.
You're not giving enough information to actually properly see what's going wrong so check the following:
If you run the huey consumer in the terminal, observe whether all your tasks show up as properly registered so that the consumer is actually capable of consuming them.
Check whether your redis process is running.
Try performing the tasks with a blocking call to see on which tasks it fails:
task_result = tasks.make_players_friends_task(player_ids)
task_result.get(blocking=True)
task_result = tasks.send_notification_task(user.id, game.id)
task_result.get(blocking=True)
Do this with a debugger or print statements to see whether it makes it to the end of your function or where it gets stuck.
Make sure to always restart your consumer when you change code. It doesn't automatically pick up new code like the django dev server. The fact that your code works as intended while pickling whole objects instead of passing id's could point to this, as it would be really weird that this would break it. On the other hand, you shouldn't pass in django ORM objects. It makes way more sense to use your id approach.

djcelery PeriodicTasks.changed method - update query running slow

I have been using celery (3.1), django-celery(3.1.17)(that is djcelery) with Django(1.9) for scheduling tasks. Recently I noticed that one update query to PeriodicTasks table takes a lot of time(around 1 minute) due to which the whole process of scheduling tasks is working slow. The update query to model PeriodicTasks is called by pre_save and pre_delete singals on PeriodicTask model. Thus whenever a new object gets added in PeriodicTask table, signal triggers and calls PeriodicTasks.changed method which simply update the same PeriodicTasks object's value again and again.
I am assuming the update query is working slow because millions of tasks gets published in a very short time and for each task same PeriodicTasks object is getting updated.
The question is(if I don't want to upgrade to celery 4.x), what is the purpose of this update(of PeriodicTasks object)? If I will remove the signal call, will it affect to the functionality of django-celery by any means?
Also just to mention, for celery4 also same signal calls on pre_save and pre_delete of PreodicTask model happen. So it will not help me even if I will upgrade to celery4.
Update
I experimented by commenting the two signal calls in djcelery/models.py everything is working well and good. Now the question is what is the purpose of this update call? Why the signals are called to update the same instance of PeriodicTasks.

Django with Celery - existing object not found

I am having problem with executing celery task from another celery task.
Here is the problematic snippet (data object already exists in database, its attributes are just updated inside finalize_data function):
def finalize_data(data):
data = update_statistics(data)
data.save()
from apps.datas.tasks import optimize_data
optimize_data.delay(data.pk)
#shared_task
def optimize_data(data_pk):
data = Data.objects.get(pk=data_pk)
#Do something with data
Get call in optimize_data function fails with "Data matching query does not exist."
If I call the retrieve by pk function in finalize_data function it works fine. It also works fine if I delay the celery task call for some time.
This line:
optimize_data.apply_async((data.pk,), countdown=10)
instead of
optimize_data.delay(data.pk)
works fine. But I don't want to use hacks in my code. Is it possible that .save() call is asynchronously blocking access to that row/object?
I know that this is an old post but I stumbled on this problem today. Lee's answer pointed me to the correct direction but I think a better solution exists today.
Using the on_commit handler provided by Django this problem can be solved without a hackish way of countdowns in the code which might not be intuitive to the user about why it exsits.
I'm not sure if this existed when the question was posted but I'm just posting the answer so that people who come here in the future know about the alternative.
I'm guessing your caller is inside a transaction that hasn't committed before celery starts to process the task. Hence celery can't find the record. That is why adding a countdown makes it work.
A 1 second countdown will probably work as well as the 10 second one in your example. I've used 1 second countdowns throughout code to deal with this issue.
Another solution is to stop using transactions.
You could use an on_commit hook to make sure the celery task isn't triggered until after the transaction commits?
DjangoDocs#performing-actions-after-commit
It's a feature that was added in Django 1.9.
from django.db import transaction
def do_something():
pass # send a mail, invalidate a cache, fire off a Celery task, etc.
transaction.on_commit(do_something)
You can also wrap your function in a lambda:
transaction.on_commit(lambda: some_celery_task.delay('arg1'))
The function you pass in will be called immediately after a hypothetical database write made where on_commit() is called would be successfully committed.
If you call on_commit() while there isn’t an active transaction, the callback will be executed immediately.
If that hypothetical database write is instead rolled back (typically when an unhandled exception is raised in an atomic() block), your function will be discarded and never called.

Set / get objects with memcached

In a Django Python app, I launch jobs with Celery (a task manager). When each job is launched, they return an object (lets call it an instance of class X) that lets you check on the job and retrieve the return value or errors thrown.
Several people (someday, I hope) will be able to use this web interface at the same time; therefore, several instances of class X may exist at the same time, each corresponding to a job that is queued or running in parallel. It's difficult to come up with a way to hold onto these X objects because I cannot use a global variable (a dictionary that allows me to look up each X objects from a key); this is because Celery uses different processes, not just different threads, so each would modify its own copy of the global table, causing mayhem.
Subsequently, I received the great advice to use memcached to share the memory across the tasks. I got it working and was able to set and get integer and string values between processes.
The trouble is this: after a great deal of debugging today, I learned that memcached's set and get don't seem to work for classes. This is my best guess: Perhaps under the hood memcached serializes objects to the shared memory; class X (understandably) cannot be serialized because it points at live data (the status of the job), and so the serial version may be out of date (i.e. it may point to the wrong place) when it is loaded again.
Attempts to use a SQLite database were similarly fruitless; not only could I not figure out how to serialize objects as database fields (using my Django models.py file), I would be stuck with the same problem: the handles of the launched jobs need to stay in RAM somehow (or use some fancy OS tricks underneath), so that they update as the jobs finish or fail.
My best guess is that (despite the advice that thankfully got me this far) I should be launching each job in some external queue (for instance Sun/Oracle Grid Engine). However, I couldn't come up with a good way of doing that without using a system call, which I thought may be bad style (and potentially insecure).
How do you keep track of jobs that you launch in Django or Django Celery? Do you launch them by simply putting the job arguments into a database and then have another job that polls the database and runs jobs?
Thanks a lot for your help, I'm quite lost.
I think django-celery does this work for you. Did you had a look at the tables made by django-celery? I.e. djcelery_taskstate holds all data for a given task like state, worker_id and so on. For periodic tasks there is a table called djcelery_periodictask.
In a Django view you can access the TaskMeta object:
from djcelery.models import TaskMeta
task = TaskMeta.objects.get(task_id=task_id)
print task.status

Categories