Should django model object instances be passed to celery?

Should django model object instances be passed to celery? - python

# models.py
from django.db import models
class Person(models.Model):
first_name = models.CharField(max_length=30)
last_name = models.CharField(max_length=30)
text_blob = models.CharField(max_length=50000)
# tasks.py
import celery
#celery.task
def my_task(person):
# example operation: does something to person
# needs only a few of the attributes of person
# and not the entire bulky record
person.first_name = person.first_name.title()
person.last_name = person.last_name.title()
person.save()
In my application somewhere I have something like:
from models import Person
from tasks import my_task
import celery
g = celery.group([my_task.s(p) for p in Person.objects.all()])
g.apply_async()
Celery pickles p to send it to the worker right?
If the workers are running on multiple machines, would the entire person object (along with the bulky text_blob which is primarily not required) be transmitted over the network? Is there a way to avoid it?
How can I efficiently and evenly distribute the Person records to workers running on multiple machines?
Could this be a better idea? Wouldn't it overwhelm the db if Person has a few million records?
# tasks.py
import celery
from models import Person
#celery.task
def my_task(person_pk):
# example operation that does not need text_blob
person = Person.objects.get(pk=person_pk)
person.first_name = person.first_name.title()
person.last_name = person.last_name.title()
person.save()
#In my application somewhere
from models import Person
from tasks import my_task
import celery
g = celery.group([my_task.s(p.pk) for p in Person.objects.all()])
g.apply_async()

I believe it is better and safer to pass PK rather than the whole model object. Since PK is just a number, serialization is also much simpler. Most importantly, you can use a safer sarializer (json/yaml instead of pickle) and have a peace of mind that you won't have any problems with serializing your model.
As this article says:
Since Celery is a distributed system, you can't know in which process, or even on what machine the task will run. So you shouldn't pass Django model objects as arguments to tasks, its almost always better to re-fetch the object from the database instead, as there are possible race conditions involved.

Yes. If there are millions of records in the database then this probably isn't the best approach, but since you have to go through all many millions of the records, then pretty much no matter what you do, your DB is going to get hit pretty hard.
Here are some alternatives, none of which I'd call "better", just different.
Implement a pre_save signal handler for your Person class that does the .title() stuff. That way your first_name/last_names will always get stored correctly in the db and you'll not have to do this again.
Use a management command that takes some kind of paging parameter...perhaps use the first letter of the last name to segment the Persons. So running ./manage.py my_task a would update all the records where the last name starts with "a". Obviously you'd have to run this several times to get through the whole database
Maybe you can do it with some creative sql. I'm not even going to attempt here, but it might be worth investigating.
Keep in mind that the .save() is going to be the harder "hit" to the database then actually selecting the millions of records.

Related

Restore lost tasks Redis python

I was thinking about the way to secure accomplishment of all tasks stored in Redis queues in case of server shutdown e.g.
My initial thought was to create an instance of job-description and saving it to database. Something like:
class JobDescription(db.Model):
id = Column(...)
name = Column(...)
queue_name = (...)
is_started = (Boolean)
is_finished = (Boolean)
....
And then updating boolean flags when necessary.
So on Flask/Django/FastApi application startup I would search for jobs that are either not started/not finished.
My question is - are there any best practices for what I described here or better ways to restore lost jobs rather than saving to db job-descriptions?

You're on the right track. To restore Rq tasks after Redis has restarted,
you have to record them so that you can determine what jobs need to be re-delayed (queued).
The JobDescription approach you're using can work fine, with the caveat that as time passes and the underlying table for JobDescription gets larger, it'll take longer to query that table unless you build a secondary index on is_finished.
You might find that using datetimes instead of booleans gives you more options. E.g.
started_at = db.Column(db.DateTime)
finished_at = db.Column(db.DateTime, index=True)
lets you answer several useful questions, such as whether there are interesting patterns in job requests over time, how completion latency varies over time, etc.

Huey not calling tasks in Django

I have a Django rest framework app that calls 2 huey tasks in succession in a serializer create method like so:
...
def create(self, validated_data):
user = self.context['request'].user
player_ids = validated_data.get('players', [])
game = Game.objects.create()
tasks.make_players_friends_task(player_ids)
tasks.send_notification_task(user.id, game.id)
return game
# tasks.py
#db_task()
def make_players_friends_task(ids):
players = User.objects.filter(id__in=ids)
# process players
#db_task()
def send_notification_task(user_id, game_id):
user = User.objects.get(id=user_id)
game = Game.objects.get(id=game_id)
# send notifications
When running the huey process in the terminal, when I hit this endpoint, I can see that only one or the other of the tasks is ever called, but never both. I am running huey with the default settings (redis with 1 thread worker.)
If I alter the code so that I am passing in the objects themselves as parameters, rather than the ids, and remove the django queries in the #db_task methods, things seem to work alright.
The reason I initially used the ids as parameters is because I assumed (or read somewhere) that huey uses json serialization as default, but after looking into it, pickle is actually the default serializer.
One theory is that since I am only running one worker, and also have a #db_periodic_task method in the app, the process can only handle listening for tasks or executing them at any time, but not both. This is the way celery seems to work, where you need a separate process for a scheduler and a worker each, but this isn't mentioned in huey's documentation.

If you run the huey consumer it will actually spawn a separate scheduler together with the amount of workers you've specified, so that's not going to be your problem.
You're not giving enough information to actually properly see what's going wrong so check the following:
If you run the huey consumer in the terminal, observe whether all your tasks show up as properly registered so that the consumer is actually capable of consuming them.
Check whether your redis process is running.
Try performing the tasks with a blocking call to see on which tasks it fails:
task_result = tasks.make_players_friends_task(player_ids)
task_result.get(blocking=True)
task_result = tasks.send_notification_task(user.id, game.id)
task_result.get(blocking=True)
Do this with a debugger or print statements to see whether it makes it to the end of your function or where it gets stuck.
Make sure to always restart your consumer when you change code. It doesn't automatically pick up new code like the django dev server. The fact that your code works as intended while pickling whole objects instead of passing id's could point to this, as it would be really weird that this would break it. On the other hand, you shouldn't pass in django ORM objects. It makes way more sense to use your id approach.

What is the proper way to process the user-trigger events in Django?

For example, I have a blog based on Django and I already have several functions for users: login、edit_profile、share.
But now I need to implement a mission system.
User logins, reward 10 score per day
User completes his profile, reward 20 score
User shares my blogs, reward 30 score
I don't want to mix reward code with normal functions code. So I decide to use message queue. Pseudo code may look like:
#login_required
def edit_profile(request):
user = request.user
nickname = ...
desc = ...
user.save(...)
action.send(sender='edit_profile', payload={'user_id': user.id})
return Response(...)
And the reward can subscribe this action
#receiver('edit_profile')
def edit_profile_reward(payload):
user_id = payload['user_id']
user = User.objects.get(id=user_id)
mission, created = Mission.objects.get_or_create(user=user, type='complete_profile')
if created:
user.score += 20
user.save()
But I don't know if this is the right way. If so, what message queue should I use? django-channel / django-q or something else?
If not, what is the best practice?

If you are looking for Async queue, you will need a combo of Redis and workers.
One of the most common libraries, and simplest, out there for this is RQ Workers
Implementation is simple, but you will need to run the rq-workers as a separate app.
It also allows you to implement different queues with different priorities. I use these for things like sending emails or things that need to be updated without making the user wait (logs, etc...)
Django-Q is another good solution with the advantage of being able to use your current database as the queue - but also works with Redis et al...
Finally, Celery is the grandaddy of them all. You can have scheduled jobs with Celeray as well as async jobs. A bit more complex but good solution.
Hope this helps...

What you are seeking to do is fairly normal when it comes to cuing tasks with Django or any Python framework for that matter. While there is no "right" way to do this, I personally would recommend going with Redis. Considering that you would have many users receiving points this would make your querying really fast.
You can naturally makes this up with Celery to make your own Stack. Everything will be done in RAM, which will be helpful for such repetitive tasks.
You can take a look at Redis for Django over here.
You would essentially need to include this as a caching server in your settings.
In whichever file you implement cuing, remember to add the following:
from django.core.cache.backends.base import DEFAULT_TIMEOUT
from django.views.decorators.cache import cache_page
I would agree that initially setting this up seems daunting, but trust me on this it is a great way to cue any task quickly and efficiently. Give it a shot! You will find it extremely useful in all of your projects.

For asynchronous/deferred execution of tasks/jobs you can use
Celery: https://github.com/celery/celery/
Django:
http://docs.celeryproject.org/en/latest/django/first-steps-with-django.html

Django store joined table in DB and update when tables change

In my Django (1.9) project I need to construct a table from an expensive JOIN. I would therefore like to store the table in the DB and only redo the query if the tables involved in the JOIN change. As I need the table as a basis for later JOIN operations I definitely want to store it in my database and not in any cache.
The problem I'm facing is that I'm not sure how to determine whether the data in the tables have changed. Connecting to the post_save, post_delete signals of the respective models seems not to be right since the models might be updated in bulk via CSV upload and I don't want the expensive query to be fired each time a new row is imported, because the DB table will change right away. My current approach is to check whether the data has changed every certain time interval, which would be perfectly fine for me. For this purpose I use a new thread, which compares the Checksums of the involved tables (see code below) to run this task. As I'm not really familiar with multi threading, especially on web servers I do not now, whether this is acceptable. My questions therefore:
Is the threading approach acceptable for running this single task?
Would a Distributed Task Queue like Celery be more appropriate?
Is there any way to disconnect a signal for a certain time after it is received, so that a bulk upload does not trigger the signal over and over again?
This is my current code:
import threading
from django.apps import apps
from .models import SomeModel
def check_for_table_change():
app_label = SomeModel._meta.app_label
def join():
"""Join the tables and save the resulting table to the DB."""
...
def get_involved_models(app_label):
"""Get all the models that are involved in the join."""
...
involved_models = get_involved_models(app_label)
involved_dbtables = tuple(model._meta.db_table for model in involved_models)
sql = 'CHECKSUM TABLE %s' % ', '.join(involved_dbtables)
old_checksums = None
while(True):
# Get the result of the query as named tuples.
checksums = from_db(sql, fetch_as='namedtuple')
if old_checksums is not None:
# Compare checksums.
for pair in zip(checksums, old_checksums):
if pair[0].Checksum != pair[1].Checksum:
print('db changed, table is rejoined')
join()
break
old_checksums = checksums
time.sleep(60)
check_tables_thread = threading.Thread()
check_tables_thread.run = check_for_table_change
check_tables_thread.start()
I'm grateful for any suggestions.

Materialized Views and Postgresql
If you were on postgresql, you could have used what's known as a Materialized View. Thus you can create a view based on your join and it would exist almost like a real table. This is very different from normal joins where the query needs to be executed each and every time a view is used. Now the bad news. Mysql does not have materialized views.
If you switched to postgresql, you might even find that materialized vies are not needed after all. That's because postgresql can use more than one index per table in queries. Thus your join that seems slow at the moment on mysql might be made to run faster with better use of indexes on Postgresql. Of course this is very dependent on what your structure is like.
Signals vs Triggers
The problem I'm facing is that I'm not sure how to determine whether
the data in the tables have changed. Connecting to the post_save,
post_delete signals of the respective models seems not to be right
since the models might be updated in bulk via CSV upload and I don't
want the expensive query to be fired each time a new row is imported,
because the DB table will change right away.
As you have rightly determined Django signals isn't the right way. This is the sort of task that is best done at the database level. Since you don't have materialized views, this is a job for triggers. However that's a lot of hard work involved (whether you use triggers or signals)
Is the threading approach acceptable for running this single task?
Why not use django as a CLI here? Which effectively means a django script is invoked by a cron or executed by some other mechanism independently of your website.
Would a Distributed Task Queue like Celery be more appropriate?
Very much so. Each time the data changes, you can fire off a task that does the update of the table.
Is there any way to disconnect a signal for a certain time after it is received, so that a bulk upload does not trigger the signal over and over again?
Keyword here is 'TRIGGER' :-)
Alternatives.
Having said all that doing a join and physically populating a table is going to be very very slow if your table grows to even a few thousand rows. This is because you will need an elaborate query to determine which records have changed (unless you used a separate queue for that). You would then need to insert or update the records in the 'join table' generally update/insert is slower than retrieve so as the size of the data goes, this would become progressively worse.
The real solution maybe to optimize your queries and or tables. May I suggest you post a new question with the slow query and also share your table structures?

Trigger Django module on Database update

I want to develop an application that monitors the database for new records and allows me to execute a method in the context of my Django application when a new record is inserted.
I am planning to use an approach where a Celery task checks the database for changes since the last check and triggers the above method.
Is there a better way to achieve this?
I'm using SQLite as the backend and tried apsw's setupdatehook API, but it doesn't seem to run my module in Django context.
NOTE: The updates are made by a different application outside Django.

Create a celery task to do whatever it is you need to do with the object:
tasks.py
from celery.decorators import task
#task()
def foo(object):
object.do_some_calculation()
Then create a django signal that is fired every time an instance of your Model is saved , queuing up your task in Celery:
models.py
class MyModel(models.Model):
...
from django.db.models.signals import post_save
from django.dispatch import receiver
from mymodel import tasks
#receiver(post_save, sender=MyModel)
def queue_task(sender, instance, created, **kwargs):
tasks.foo.delay(object=instance)
What's important to note that is django's signals are synchronous, in other words the queue_task function runs within the request cycle, but all the queue_task function is doing is telling Celery to handle the actual guts of the work (do_some_calculation) in theb background

A better way would be to have that application that modifies the records call yours. Or at least make a celery queue entry so that you don't really have to query the database too often to see if something changed.
But if that is not an option, letting celery query the database to find if something changed is probably the next best option. (surely better than the other possible option of calling a web service from the database as a trigger, which you should really avoid.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.