I want to develop an application that monitors the database for new records and allows me to execute a method in the context of my Django application when a new record is inserted.
I am planning to use an approach where a Celery task checks the database for changes since the last check and triggers the above method.
Is there a better way to achieve this?
I'm using SQLite as the backend and tried apsw's setupdatehook API, but it doesn't seem to run my module in Django context.
NOTE: The updates are made by a different application outside Django.
Create a celery task to do whatever it is you need to do with the object:
tasks.py
from celery.decorators import task
#task()
def foo(object):
object.do_some_calculation()
Then create a django signal that is fired every time an instance of your Model is saved , queuing up your task in Celery:
models.py
class MyModel(models.Model):
...
from django.db.models.signals import post_save
from django.dispatch import receiver
from mymodel import tasks
#receiver(post_save, sender=MyModel)
def queue_task(sender, instance, created, **kwargs):
tasks.foo.delay(object=instance)
What's important to note that is django's signals are synchronous, in other words the queue_task function runs within the request cycle, but all the queue_task function is doing is telling Celery to handle the actual guts of the work (do_some_calculation) in theb background
A better way would be to have that application that modifies the records call yours. Or at least make a celery queue entry so that you don't really have to query the database too often to see if something changed.
But if that is not an option, letting celery query the database to find if something changed is probably the next best option. (surely better than the other possible option of calling a web service from the database as a trigger, which you should really avoid.)
Related
I have a Django rest framework app that calls 2 huey tasks in succession in a serializer create method like so:
...
def create(self, validated_data):
user = self.context['request'].user
player_ids = validated_data.get('players', [])
game = Game.objects.create()
tasks.make_players_friends_task(player_ids)
tasks.send_notification_task(user.id, game.id)
return game
# tasks.py
#db_task()
def make_players_friends_task(ids):
players = User.objects.filter(id__in=ids)
# process players
#db_task()
def send_notification_task(user_id, game_id):
user = User.objects.get(id=user_id)
game = Game.objects.get(id=game_id)
# send notifications
When running the huey process in the terminal, when I hit this endpoint, I can see that only one or the other of the tasks is ever called, but never both. I am running huey with the default settings (redis with 1 thread worker.)
If I alter the code so that I am passing in the objects themselves as parameters, rather than the ids, and remove the django queries in the #db_task methods, things seem to work alright.
The reason I initially used the ids as parameters is because I assumed (or read somewhere) that huey uses json serialization as default, but after looking into it, pickle is actually the default serializer.
One theory is that since I am only running one worker, and also have a #db_periodic_task method in the app, the process can only handle listening for tasks or executing them at any time, but not both. This is the way celery seems to work, where you need a separate process for a scheduler and a worker each, but this isn't mentioned in huey's documentation.
If you run the huey consumer it will actually spawn a separate scheduler together with the amount of workers you've specified, so that's not going to be your problem.
You're not giving enough information to actually properly see what's going wrong so check the following:
If you run the huey consumer in the terminal, observe whether all your tasks show up as properly registered so that the consumer is actually capable of consuming them.
Check whether your redis process is running.
Try performing the tasks with a blocking call to see on which tasks it fails:
task_result = tasks.make_players_friends_task(player_ids)
task_result.get(blocking=True)
task_result = tasks.send_notification_task(user.id, game.id)
task_result.get(blocking=True)
Do this with a debugger or print statements to see whether it makes it to the end of your function or where it gets stuck.
Make sure to always restart your consumer when you change code. It doesn't automatically pick up new code like the django dev server. The fact that your code works as intended while pickling whole objects instead of passing id's could point to this, as it would be really weird that this would break it. On the other hand, you shouldn't pass in django ORM objects. It makes way more sense to use your id approach.
I'm trying to find a way to constantly poll a server every x seconds from the ready() function of Django, basically something which looks like this:
from django.apps import AppConfig
class ApiConfig(AppConfig):
name = 'api'
def ready(self):
import threading
import time
from django.conf import settings
from api import utils
def refresh_ndt_servers_list():
while True:
utils.refresh_servers_list()
time.sleep(settings.WAIT_SECONDS_SERVER_POLL)
thread1 = threading.Thread(target=refresh_ndt_servers_list)
thread1.start()
I just want my utils.refresh_servers_list() to be executed when Django starts/is ready and re-execute that same method (which populates my DB) every settings.WAIT_SECONDS_SERVER_POLL seconds indefinitely. The problem with that is if I run python manage.py migrate the ready() function gets called and never finishes. I would like to avoid calling this function during migration.
Thanks!
AppConfig.ready() is to "... perform initialization tasks ..." and make your app ready to run / serve requests. Actual app working logic should be run after django app is initialized.
For launching task at regular intervals cron job can be used.
Or, setup a periodic celery task with celery beat.
Also, provided task seems to be performing updates in database (good for it to be atomic). It may be critical for it to have only single running instance of it. One instance of cronjob or one celery task take care of that.
However, the next job may still run if previous one has not yet finished or just be launched manually for some reason - adding some locking logic into task to check that only one is running (or lock database table for the run) may be desired.
I'm planning on using django-celery-results backend to track status and results of Celery tasks.
Is the django-celery-results backend suitable to store status of task while it is running, or only after it has finished?
It's not clear when the TaskResult model is first created (upon task creation, task execution, or completion?)
If it's created upon task creation, will the model status automatically be updated to RUNNING when the task is picked up, if task_track_started option is set?
Can the TaskResult instance be accessed within the task function?
Another question here appears to indicate so but doesn't mention task status update to RUNNING
About Taskresult, of course, you can access during execution task, you only need import:
from django_celery_results.models import TaskResult
With this you can access to model taskresult filter by task_id, task_name etc,
This is official code, you can filter by any of this fields https://github.com/celery/django-celery-results/blob/master/django_celery_results/models.py
Example:
task = TaskResult.objects.filter(task_name=task_name, status='SUCCESS').first()
In addition, you can add some additional data necessaries for your task in models fields. When task is created and is in execution also is posible modify fields but in case that status is 'SUCCEES', you only can read it, status task is auto save by default.
Example:
task_result.meta = json.dumps({'help_text': {'date': '2020-11-30'}})
I recommend work with to_dict() function, here you can watch models attributes.
Backend is configured in settings module as:
CELERY_RESULT_BACKEND = 'django-db' # in this case it is django DB
If you configured django DB as a backend then you could import it as
from django-celery-results.models import TaskResult
I have to schedule a job using Schedule on my django web application.
def new_job(request):
print("I'm working...")
file=schedulesdb.objects.filter (user=request.user,f_name__icontains ="mp4").last()
file_initiated = str(f_name)
os.startfile(f_name_initiated)
I need to do it with filtered time in db
GIVEN DATETIME = schedulesdb.objects.datetimes('request_time', 'second').last()
schedule.GIVEN DATETIME.do(job)
Django is a web framework. It receives a request, does whatever processing is necessary and sends out a response. It doesn't have any persistent process that could keep track of time and run scheduled tasks, so there is no good way to do it using just Django.
That said, Celery (http://www.celeryproject.org/) is a python framework specifically built to run tasks, both scheduled and on-demand. It also integrates with Django ORM with minimal configuration. I suggest you look into it.
You could, of course, write your own external script that would use schedule module that you mentioned. You would need to implement a way to write shedule objects into the database and then you could have your script read and execute them. Is your "scheduledb" model already implemented?
# models.py
from django.db import models
class Person(models.Model):
first_name = models.CharField(max_length=30)
last_name = models.CharField(max_length=30)
text_blob = models.CharField(max_length=50000)
# tasks.py
import celery
#celery.task
def my_task(person):
# example operation: does something to person
# needs only a few of the attributes of person
# and not the entire bulky record
person.first_name = person.first_name.title()
person.last_name = person.last_name.title()
person.save()
In my application somewhere I have something like:
from models import Person
from tasks import my_task
import celery
g = celery.group([my_task.s(p) for p in Person.objects.all()])
g.apply_async()
Celery pickles p to send it to the worker right?
If the workers are running on multiple machines, would the entire person object (along with the bulky text_blob which is primarily not required) be transmitted over the network? Is there a way to avoid it?
How can I efficiently and evenly distribute the Person records to workers running on multiple machines?
Could this be a better idea? Wouldn't it overwhelm the db if Person has a few million records?
# tasks.py
import celery
from models import Person
#celery.task
def my_task(person_pk):
# example operation that does not need text_blob
person = Person.objects.get(pk=person_pk)
person.first_name = person.first_name.title()
person.last_name = person.last_name.title()
person.save()
#In my application somewhere
from models import Person
from tasks import my_task
import celery
g = celery.group([my_task.s(p.pk) for p in Person.objects.all()])
g.apply_async()
I believe it is better and safer to pass PK rather than the whole model object. Since PK is just a number, serialization is also much simpler. Most importantly, you can use a safer sarializer (json/yaml instead of pickle) and have a peace of mind that you won't have any problems with serializing your model.
As this article says:
Since Celery is a distributed system, you can't know in which process, or even on what machine the task will run. So you shouldn't pass Django model objects as arguments to tasks, its almost always better to re-fetch the object from the database instead, as there are possible race conditions involved.
Yes. If there are millions of records in the database then this probably isn't the best approach, but since you have to go through all many millions of the records, then pretty much no matter what you do, your DB is going to get hit pretty hard.
Here are some alternatives, none of which I'd call "better", just different.
Implement a pre_save signal handler for your Person class that does the .title() stuff. That way your first_name/last_names will always get stored correctly in the db and you'll not have to do this again.
Use a management command that takes some kind of paging parameter...perhaps use the first letter of the last name to segment the Persons. So running ./manage.py my_task a would update all the records where the last name starts with "a". Obviously you'd have to run this several times to get through the whole database
Maybe you can do it with some creative sql. I'm not even going to attempt here, but it might be worth investigating.
Keep in mind that the .save() is going to be the harder "hit" to the database then actually selecting the millions of records.