Let's say I have two Models in Django:
Book:
class Book(models.Model):
title = models.CharField(max_length=100, blank=False)
number_of_readers = models.PositiveIntegerField(default=0)
Reader:
class Reader(models.Model):
book = models.ForeignKey(Book)
name_of_reader = models.CharField(max_length=100, blank=False)
Everytime I add a new Reader to the database I want to increase number_of_readers in the Book model by 1. I do not want to dynamically count number of rows Reader rows, related to a particular Book, for performance reasons.
Where would be the best place to increase the number_of_readers field? In the Serializer or in the Model? And what method shall I use? Should I override .save in the Model? Or something else in the Serializer?
Even better if someone could provide a full blown example on how to modify the Book table when doing a post of a new Reader.
Thanks.
I wouldn't do this on the REST API level, I'd do it on the model level, because then the +1 increase will always happen, regardless of where it happened (not only when you hit a particular REST view/serializer)
Django signals
Everytime I add a new Reader to the database I want to increase number_of_readers in the Book model by 1
I'd implement a post_save signal that triggers when a model (Reader) is created
There is a parameter to that signal called created, that is True when the model is created, which makes more convenient than the Model.save() override
Example outline
from django.db.models.signals import post_save
def my_callback(sender, instance, created, **kwargs):
if created:
reader = instance
book = reader.book
book.number_of_readers += 1 # prone to race condition, more on that below
book.save(update_fields='number_of_readers') # save the counter field only
post_save.connect(my_callback, sender=your.models.Reader)
https://docs.djangoproject.com/en/1.8/ref/signals/#django.db.models.signals.post_save
Race Conditions
In the above code snippet, if you'd like to avoid a race condition (can happen when many threads updating the same counter), you can also replace the book.number_of_readers += 1 part with an F expression F('number_of_readers') + 1, which makes the read/write on the DB level instead of Python,
book.number_of_readers = F('number_of_readers') + 1
book.save(update_fields='number_of_readers')
more on that here: https://docs.djangoproject.com/en/1.8/ref/models/expressions/#avoiding-race-conditions-using-f
There is a post_delete signal too, to reverse the counter, if you ever think of supporting "unreading" a book :)
Batch or periodic updates
If you wish to have batch imports of readers, or need to periodically update (or "reflow") the reader counts (e.g. once a week), you can in addition of the above, implement a function that recounts the readers and update the Book.number_of_readers
It depends on the design of your app and particularly on where you will reuse this logic.
For example, if you want the same logic for adding Readers everywhere in your app, do it in a signal, as bakkal suggests or in save. If it depends on the API endpoint, you might want to do it in a view.
It will also depend if you are doing bulk inserts of readers: if you do it in save or a pre_/post_save it will not work for bulk updates, so it would be better to do it in QuerySet's create and bulk_create methods etc.
From performance point of view, you might want to use F expressions, no matter where you do it:
book.number_of_readers = F('number_of_readers') + added_readers_count
Related
I have a model Student with manager StudentManager as given below. As property gives the last date by adding college_duration in join_date. But when I execute this property computation is working well, but for StudentManager it gives an error. How to write manager class which on the fly computes some field using model fields and which is used to filter records.
The computed field is not in model fields. still, I want that as filter criteria.
class StudentManager(models.Manager):
def passed_students(self):
return self.filter(college_end_date__lt=timezone.now())
class Student(models.Model):
join_date = models.DateTimeField(auto_now_add=True)
college_duration = models.IntegerField(default=4)
objects = StudentManager()
#property
def college_end_date(self):
last_date = self.join_date + timezone.timedelta(days=self.college_duration)
return last_date
Error Django gives. when I tried to access Student.objects.passed_students()
django.core.exceptions.FieldError: Cannot resolve keyword 'college_end_date' into field. Choices are: join_date, college_duration
Q 1. How alias queries done in Django ORM?
By using the annotate(...)--(Django Doc) or alias(...) (New in Django 3.2) if you're using the value only as a filter.
Q 2. Why property not accessed in Django managers?
Because the model managers (more accurately, the QuerySet s) are wrapping things that are being done in the database. You can call the model managers as a high-level database wrapper too.
But, the property college_end_date is only defined in your model class and the database is not aware of it, and hence the error.
Q 3. How to write manager to filter records based on the field which is not in models, but can be calculated using fields present in the model?
Using annotate(...) method is the proper Django way of doing so. As a side note, a complex property logic may not be re-create with the annotate(...) method.
In your case, I would change college_duration field from IntegerField(...) to DurationField(...)--(Django Doc) since its make more sense (to me)
Later, update your manager and the properties as,
from django.db import models
from django.utils import timezone
class StudentManager(models.Manager):
<b>def passed_students(self):
default_qs = self.get_queryset()
college_end = models.ExpressionWrapper(
models.F('join_date') + models.F('college_duration'),
output_field=models.DateField()
)
return default_qs \
.annotate(college_end=college_end) \
.filter(college_end__lt=timezone.now().date())</b>
class Student(models.Model):
join_date = models.DateTimeField()
college_duration = models.DurationField()
objects = StudentManager()
#property
def college_end_date(self):
# return date by summing the datetime and timedelta objects
return <b>(self.join_date + self.college_duration).date()
Note:
DurationField(...) will work as expected in PostgreSQL and this implementation will work as-is in PSQL. You may have problems if you are using any other databases, if so, you may need to have a "database function" which operates over the datetime and duration datasets corresponding to your specific database.
Personally, I like this solution,
To quote #Willem Van Olsem's comment:
You don't. The database does not know anything about properties, etc. So it can not filter on this. You can make use of .annotate(..) to move the logic to the database side.
You can either do the message he shared, or make that a model field that auto calculates.
class StudentManager(models.Manager):
def passed_students(self):
return self.filter(college_end_date__lt=timezone.now())
class Student(models.Model):
join_date = models.DateTimeField(auto_now_add=True)
college_duration = models.IntegerField(default=4)
college_end_date = models.DateTimeField()
objects = StudentManager()
def save(self, *args, **kwargs):
# Add logic here
if not self.college_end_date:
self.college_end_date = self.join_date + timezone.timedelta(days-self.college_duration)
return super.save(*args, **kwargs)
Now you can search it in the database.
NOTE: This sort of thing is best to do from the start on data you KNOW you're going to want to filter. If you have pre-existing data, you'll need to re-save all existing instances.
Problem
You’re attempting to query on a row that doesn’t exist in the database. Also, Django ORM doesn’t recognize a property as a field to register.
Solution
The direct answer to your question would be to create annotations, which could be subsequently queried off of. However, I would reconsider your table design for Student as it introduces unnecessary complexity and maintenance overhead.
There’s much more framework/db support for start date, end date idiosyncrasy than there is start date, timedelta.
Instead of storing duration, store end_date and calculate duration in a model method. This makes more not only makes more sense as students are generally provided a start date and estimated graduation date rather than duration, but also because it’ll make queries like these much easier.
Example
Querying which students are graduating in 2020.
Students.objects.filter(end_date__year=2020)
So I have a queryset to update
stories = Story.objects.filter(introtext="")
for story in stories:
#just set it to the first 'sentence'
story.introtext = story.content[0:(story.content.find('.'))] + ".</p>"
story.save()
And the save() operation completely kills performance. And in the process list, there are multiple entries for "./manage.py shell" yes I ran this through django shell.
However, in the past I've ran scripts that didn't need to use save(), as it was changing a many to many field. These scripts were very performant.
My project has this code, which could be relevant to why these scripts were so good.
#receiver(signals.m2m_changed, sender=Story.tags.through)
def save_story(sender, instance, action, reverse, model, pk_set, **kwargs):
instance.save()
What is the best way to update a large queryset (10000+) efficiently?
As far as new introtext value depends on content field of the object you can't do any bulk update. But you can speed up saving list of individual objects by wrapping it into transaction:
from django.db import transaction
with transaction.atomic():
stories = Story.objects.filter(introtext='')
for story in stories:
introtext = story.content[0:(story.content.find('.'))] + ".</p>"
Story.objects.filter(pk=story.pk).update(introtext=introtext)
transaction.atomic() will increase speed by order of magnitude.
filter(pk=story.pk).update() trick allows you to prevent any pre_save/post_save signals which are emitted in case of the simple save(). This is the officially recommended method of updating single field of the object.
You can use update built-in function over a queryset
Exmaple:
MyModel.objects.all().update(color=red)
In your case, you need use F() (read more here) built-in function to use instance own attributes:
from django.db.models import F
stories = Story.objects.filter(introtext__exact='')
stories.update(F('introtext')[0:F('content').find(.)] + ".</p>" )
I am using appengine with python 2.7 and webapp2 framework. I am not using ndb.model.
I have the following model:
class Story(db.Model);
name = db.StringProperty()
class UserProfile(db.Model):
name = db.StringProperty()
user = db.UserProperty()
class Tracking(db.Model):
user_profile = db.ReferenceProperty(UserProfile)
story = db.ReferenceProperty(Story)
upvoted = db.BooleanProperty()
flagged = db.BoolenProperty()
A user can upvote and/or flag a story but only once. Hence I came up with the above model.
Now when a user clicks on the upvote link, on the database I try to see if the user has not already voted it, hence I do try to do the following:
get the user instance with his id as up = db.get(db.Key.from_path('UserProfile', uid))
then get the story instance as follows s_ins = db.get(db.Key.from_path('Story', uid))
Now it is the turn to check if a Tracking based on these two exist, if yes then don't allow voting, else allow him to vote and update the Tracking instance.
What is the most convenient way to fetch a Tracking instance given an id(db.key().id()) of user_profile and story?
What is the most convenient way to save a Tracking model having given a user profile id and an story id?
Is there a better way to implement tracking?
You can try tracking using lists of keys versus having a separate entry for track/user/story:
class Story(db.Model);
name = db.StringProperty()
class UserProfile(db.Model):
name = db.StringProperty()
user = db.UserProperty()
class Tracking(db.Model):
story = db.ReferenceProperty(Story)
upvoted = db.ListProperty(db.Key)
flagged = db.ListProperty(db.Key)
So when you want to see if a user upvoted for a given story:
Tracking.all().filter('story =', db.Key.from_path('Story', uid)).filter('upvoted =', db.Key.from_path('UserProfile', uid)).get(keys_only=True)
Now the only problem here is the size of the upvoted/flagged lists can't grow too large (I think the limit is 5000), so you'd have to make a class to manage this (that is, when adding to the upvoted/flagged lists, detect if X entries exists, and if so, start a new tracking object to hold additional values). You will also have to make this transactional and with HR you have a 1 write per second threshold. This may or may not be an issue depending on your expected use case. A way around the write threshold would be to implement upvotes/flags using pull-queues and to have a cron job that pulls and batch updates tracking objects as needed.
This method has its pros/cons. The most obvious cons are the ones I just listed. The pros, however, may be worth it. You can get a full list of users who upvoted/flagged a story from a single list (or multiple depending on how popular the story is). You can get a full list of users with a lot fewer queries to the datastore. This method should also take less storage, index, and metadata space. Additionally, adding a user to a tracking object will be cheaper, instead of writing a new object + 2 writes for each property, you would just be charged 1 write for the object + 2 writes for the entry to the list (9 vs 3 writes for adding users to a pre-existing tracked story, or 9 vs 7 for untracked stories)
What you propose sounds reasonable.
Don't use the app engine generated key for Tracking. Because the combination of story/user should be unique, create your own key as a combination of the story/user. Something like
tracking = Tracking.get_or_insert(str(story.id) + "-" + str(user.id), **params)
If you know the story/user, then you can always fetch the tracking by key name.
I have a model with a unique integer that needs to increment with regards to a foreign key, and the following code is how I currently handle it:
class MyModel(models.Model):
business = models.ForeignKey(Business)
number = models.PositiveIntegerField()
spam = models.CharField(max_length=255)
class Meta:
unique_together = (('number', 'business'),)
def save(self, *args, **kwargs):
if self.pk is None: # New instance's only
try:
highest_number = MyModel.objects.filter(business=self.business).order_by('-number').all()[0].number
self.number = highest_number + 1
except ObjectDoesNotExist: # First MyModel instance
self.number = 1
super(MyModel, self).save(*args, **kwargs)
I have the following questions regarding this:
Multiple people can create MyModel instances for the same business, all over the internet. Is it possible for 2 people creating MyModel instances at the same time, and .count() returns 500 at the same time for both, and then both try to essentially set self.number = 501 at the same time (raising an IntegrityError)? The answer seems like an obvious "yes, it could happen", but I had to ask.
Is there a shortcut, or "Best way" to do this, which I can use (or perhaps a SuperAutoField that handles this)?
I can't just slap a while model_not_saved: try:, except IntegrityError: in, because other restraints in the model could lead to an endless loop, and a disaster worse than Chernobyl (maybe not quite that bad).
You want that constraint at the database level. Otherwise you're going to eventually run into the concurrency problem you discussed. The solution is to wrap the entire operation (read, increment, write) in a transaction.
Why can't you use an AutoField for instead of a PositiveIntegerField?
number = models.AutoField()
However, in this case number is almost certainly going to equal yourmodel.id, so why not just use that?
Edit:
Oh, I see what you want. You want a numberfield that doesn't increment unless there's more than one instance of MyModel.business.
I would still recommend just using the id field if you can, since it's certain to be unique. If you absolutely don't want to do that (maybe you're showing this number to users), then you will need to wrap your save method in a transaction.
You can read more about transactions in the docs:
http://docs.djangoproject.com/en/dev/topics/db/transactions/
If you're just using this to count how many instances of MyModel have a FK to Business, you should do that as a query rather than trying to store a count.
Iterating over a queryset, like so:
class Book(models.Model):
# <snip some other stuff>
activity = models.PositiveIntegerField(default=0)
views = models.PositiveIntegerField(default=0)
def calculate_statistics():
self.activity = book.views * 4
book.save()
def cron_job_calculate_all_book_statistics():
for book in Book.objects.all():
book.calculate_statistics()
...works just fine. However, this is a cron task. book.views is being incremented while this is happening. If book.views is modified while this cronjob is running, it gets reverted.
Now, book.views is not being modified by the cronjob, but it is being cached during the .all() queryset call. When book.save(), I have a feeling it is using the old book.views value.
Is there a way to make sure that only the activity field is updated? Alternatively, let's say there are 100,000 books. This will take quite a while to run. But the book.views will be from when the queryset originally starts running. Is the solution to just use an .iterator()?
UPDATE: Here's effectively what I am doing. If you have ideas about how to make this work well inline, then I'm all for it.
def calculate_statistics(self):
self.activity = self.views + self.hearts.count() * 2
# Can't do self.comments.count with a comments GenericRelation, because Comment uses
# a TextField for object_pk, and that breaks the whole system. Lame.
self.activity += Comment.objects.for_model(self).count() * 4
self.save()
The following will do the job for you in Django 1.1, no loop necessary:
from django.db.models import F
Book.objects.all().update(activity=F('views')*4)
You can have a more complicated calculation too:
for book in Book.objects.all().iterator():
Book.objects.filter(pk=book.pk).update(activity=book.calculate_activity())
Both these options have the potential to leave the activity field out of sync with the rest, but I assume you're ok with that, given that you're calculating it in a cron job.
In addition to what others have said if you are iterating over a large queryset you should use iterator():
Book.objects.filter(stuff).order_by(stuff).iterator()
this will cause Django to not cache the items as it iterates (which could use a ton of memory for a large result set).
No matter how you solve this, beware of transaction-related issues. E.g. default transaction isolation level is set to REPEATABLE READ, at least for MySQL backend. This, plus the fact that both Django and db backend work in a specific autocommit mode with an ongoing transaction means, that even if you use (very nice) whrde suggestion, value of `views' could be no longer valid. I could be wrong here, but feel warned.