What is the most efficient way to iterate django objects updating them?

What is the most efficient way to iterate django objects updating them? - python

So I have a queryset to update
stories = Story.objects.filter(introtext="")
for story in stories:
#just set it to the first 'sentence'
story.introtext = story.content[0:(story.content.find('.'))] + ".</p>"
story.save()
And the save() operation completely kills performance. And in the process list, there are multiple entries for "./manage.py shell" yes I ran this through django shell.
However, in the past I've ran scripts that didn't need to use save(), as it was changing a many to many field. These scripts were very performant.
My project has this code, which could be relevant to why these scripts were so good.
#receiver(signals.m2m_changed, sender=Story.tags.through)
def save_story(sender, instance, action, reverse, model, pk_set, **kwargs):
instance.save()
What is the best way to update a large queryset (10000+) efficiently?

As far as new introtext value depends on content field of the object you can't do any bulk update. But you can speed up saving list of individual objects by wrapping it into transaction:
from django.db import transaction
with transaction.atomic():
stories = Story.objects.filter(introtext='')
for story in stories:
introtext = story.content[0:(story.content.find('.'))] + ".</p>"
Story.objects.filter(pk=story.pk).update(introtext=introtext)
transaction.atomic() will increase speed by order of magnitude.
filter(pk=story.pk).update() trick allows you to prevent any pre_save/post_save signals which are emitted in case of the simple save(). This is the officially recommended method of updating single field of the object.

You can use update built-in function over a queryset
Exmaple:
MyModel.objects.all().update(color=red)
In your case, you need use F() (read more here) built-in function to use instance own attributes:
from django.db.models import F
stories = Story.objects.filter(introtext__exact='')
stories.update(F('introtext')[0:F('content').find(.)] + ".</p>" )

Related

Reorder Django QuerySet by dynamically added field

A have piece of code, which fetches some QuerySet from DB and then appends new calculated field to every object in the Query Set. It's not an option to add this field via annotation (because it's legacy and because this calculation based on another already pre-fetched data).
Like this:
from django.db import models
class Human(models.Model):
name = models.CharField()
surname = models.CharField()
def calculate_new_field(s):
return len(s.name)*42
people = Human.objects.filter(id__in=[1,2,3,4,5])
for s in people:
s.new_column = calculate_new_field(s)
# people.somehow_reorder(new_order_by=new_column)
So now all people in QuerySet have a new column. And I want order these objects by new_column field. order_by() will not work obviously, since it is a database option. I understand thatI can pass them as a sorted list, but there is a lot of templates and other logic, which expect from this object QuerySet-like inteface with it's methods and so on.
So question is: is there some not very bad and dirty way to reorder existing QuerySet by dinamically added field or create new QuerySet-like object with this data? I believe I'm not the only one who faced this problem and it's already solved with django. But I can't find anything (except for adding third-party libs, and this is not an option too).

Conceptually, the QuerySet is not a list of results, but the "instructions to get those results". It's lazily evaluated and also cached. The internal attribute of the QuerySet that keeps the cached results is qs._result_cache
So, the for s in people sentence is forcing the evaluation of the query and caching the results.
You could, after that, sort the results by doing:
people._result_cache.sort(key=attrgetter('new_column'))
But, after evaluating a QuerySet, it makes little sense (in my opinion) to keep the QuerySet interface, as many of the operations will cause a reevaluation of the query. From this point on you should be dealing with a list of Models

Can you try it functions.Length:
from django.db.models.functions import Length
qs = Human.objects.filter(id__in=[1,2,3,4,5])
qs.annotate(reorder=Length('name') * 42).order_by('reorder')

Why is Django returning stale cache data?

I have two Django models as shown below, MyModel1 & MyModel2:
class MyModel1(CachingMixin, MPTTModel):
name = models.CharField(null=False, blank=False, max_length=255)
objects = CachingManager()
def __str__(self):
return "; ".join(["ID: %s" % self.pk, "name: %s" % self.name, ] )
class MyModel2(CachingMixin, models.Model):
name = models.CharField(null=False, blank=False, max_length=255)
model1 = models.ManyToManyField(MyModel1, related_name="MyModel2_MyModel1")
objects = CachingManager()
def __str__(self):
return "; ".join(["ID: %s" % self.pk, "name: %s" % self.name, ] )
MyModel2 has a ManyToMany field to MyModel1 entitled model1
Now look what happens when I add a new entry to this ManyToMany field. According to Django, it has no effect:
>>> m1 = MyModel1.objects.all()[0]
>>> m2 = MyModel2.objects.all()[0]
>>> m2.model1.all()
[]
>>> m2.model1.add(m1)
>>> m2.model1.all()
[]
Why? It seems definitely like a caching issue because I see that there is a new entry in Database table myapp_mymodel2_mymodel1 for this link between m2 & m1. How should I fix it??

Is django-cache-machine really needed?
MyModel1.objects.all()[0]
Roughly translates to
SELECT * FROM app_mymodel LIMIT 1
Queries like this are always fast. There would not be a significant difference in speeds whether you fetch it from the cache or from the database.
When you use cache manager you actually add a bit of overhead here that might make things a bit slower. Most of the time this effort will be wasted because there may not be a cache hit as explained in the next section.
How django-cache-machine works
Whenever you run a query, CachingQuerySet will try to find that query
in the cache. Queries are keyed by {prefix}:{sql}. If it’s there, we
return the cached result set and everyone is happy. If the query isn’t
in the cache, the normal codepath to run a database query is executed.
As the objects in the result set are iterated over, they are added to
a list that will get cached once iteration is done.
source: https://cache-machine.readthedocs.io/en/latest/
Accordingly, with the two queries executed in your question being identical, cache manager will fetch the second result set from memcache provided the cache hasn't been invalided.
The same link explains how cache keys are invalidated.
To support easy cache invalidation, we use “flush lists” to mark the
cached queries an object belongs to. That way, all queries where an
object was found will be invalidated when that object changes. Flush
lists map an object key to a list of query keys.
When an object is saved or deleted, all query keys in its flush list
will be deleted. In addition, the flush lists of its foreign key
relations will be cleared. To avoid stale foreign key relations, any
cached objects will be flushed when the object their foreign key
points to is invalidated.
It's clear that saving or deleting an object would result in many objects in the cache having to be invalidated. So you are slowing down these operations by using cache manager. Also worth noting is that the invalidation documentation does not mention many to many fields at all. There is an open issue for this and from your comment on that issue it's clear that you have discovered it too.
Solution
Chuck cache machine. Caching all queries are almost never worth it. It leads to all kind of hard to find bugs and issues. The best approach is to optimize your tables and fine tune your queries. If you find a particular query that is too slow cache it manually.

This was my workaround solution:
>>> m1 = MyModel1.objects.all()[0]
>>> m1
<MyModel1: ID: 8887972990743179; name: my-name-blahblah>
>>> m2 = MyModel2.objects.all()[0]
>>> m2.model1.all()
[]
>>> m2.model1.add(m1)
>>> m2.model1.all()
[]
>>> MyModel1.objects.invalidate(m1)
>>> MyModel2.objects.invalidate(m2)
>>> m2.save()
>>> m2.model1.all()
[<MyModel1: ID: 8887972990743179; name: my-name-blahblah>]

Have you considered hooking into the model signals to invalidate the cache when an object is added? For your case you should look at M2M Changed Signal
Small example that doesn't solve your problem but relates the workaround you gave before to my signals solution approach (I don't know django-cache-machine):
def invalidate_m2m(sender, **kwargs):
instance = kwargs.get('instance', None)
action = kwargs.get('action', None)
if action == 'post_add':
Sender.objects.invalidate(instance)
m2m_changed.connect(invalidate_m2m, sender=MyModel2.model1.through)

A. J. Parr answer is alomost correct, but you forgot post_remove and also you can bind it to every ManytoManyfield like this :
from django.db.models.signals import m2m_changed
from django.dispatch import receiver
#receiver(m2m_changed, )
def invalidate_cache_m2m(sender, instance, action, reverse, model, pk_set, **kwargs ):
if action in ['post_add', 'post_remove'] :
model.objects.invalidate(instance)

Best place to increase a counter field in Django REST

Let's say I have two Models in Django:
Book:
class Book(models.Model):
title = models.CharField(max_length=100, blank=False)
number_of_readers = models.PositiveIntegerField(default=0)
Reader:
class Reader(models.Model):
book = models.ForeignKey(Book)
name_of_reader = models.CharField(max_length=100, blank=False)
Everytime I add a new Reader to the database I want to increase number_of_readers in the Book model by 1. I do not want to dynamically count number of rows Reader rows, related to a particular Book, for performance reasons.
Where would be the best place to increase the number_of_readers field? In the Serializer or in the Model? And what method shall I use? Should I override .save in the Model? Or something else in the Serializer?
Even better if someone could provide a full blown example on how to modify the Book table when doing a post of a new Reader.
Thanks.

I wouldn't do this on the REST API level, I'd do it on the model level, because then the +1 increase will always happen, regardless of where it happened (not only when you hit a particular REST view/serializer)
Django signals
Everytime I add a new Reader to the database I want to increase number_of_readers in the Book model by 1
I'd implement a post_save signal that triggers when a model (Reader) is created
There is a parameter to that signal called created, that is True when the model is created, which makes more convenient than the Model.save() override
Example outline
from django.db.models.signals import post_save
def my_callback(sender, instance, created, **kwargs):
if created:
reader = instance
book = reader.book
book.number_of_readers += 1 # prone to race condition, more on that below
book.save(update_fields='number_of_readers') # save the counter field only
post_save.connect(my_callback, sender=your.models.Reader)
https://docs.djangoproject.com/en/1.8/ref/signals/#django.db.models.signals.post_save
Race Conditions
In the above code snippet, if you'd like to avoid a race condition (can happen when many threads updating the same counter), you can also replace the book.number_of_readers += 1 part with an F expression F('number_of_readers') + 1, which makes the read/write on the DB level instead of Python,
book.number_of_readers = F('number_of_readers') + 1
book.save(update_fields='number_of_readers')
more on that here: https://docs.djangoproject.com/en/1.8/ref/models/expressions/#avoiding-race-conditions-using-f
There is a post_delete signal too, to reverse the counter, if you ever think of supporting "unreading" a book :)
Batch or periodic updates
If you wish to have batch imports of readers, or need to periodically update (or "reflow") the reader counts (e.g. once a week), you can in addition of the above, implement a function that recounts the readers and update the Book.number_of_readers

It depends on the design of your app and particularly on where you will reuse this logic.
For example, if you want the same logic for adding Readers everywhere in your app, do it in a signal, as bakkal suggests or in save. If it depends on the API endpoint, you might want to do it in a view.
It will also depend if you are doing bulk inserts of readers: if you do it in save or a pre_/post_save it will not work for bulk updates, so it would be better to do it in QuerySet's create and bulk_create methods etc.
From performance point of view, you might want to use F expressions, no matter where you do it:
book.number_of_readers = F('number_of_readers') + added_readers_count

Django custom QuerySet evaluation

So QuerySets are "lazy" and only run in certain instances (repr,list,etc.). I have created a custom QuerySet for doing many queries, but these could end up having millions of items!! That is way more than I want to return.
When returning the evaluated QuerySet, it should not have more than 25 results! Now I know I could do the following:
first = example.objects.filter(...)
last = first.filter(...)
result = last[:25]
#do stuff with result
but I will be doing so many queries with example objects that I feel it unnecessary to have the line result = last[:25]. Is there a way to specify how a QuerySet is returned?
If there is, how can I change it so that whenever the QuerySet would be evaluated it only returns the first x items in the QuerySet where, in this case, x = 25
Important note:
slicing must be on evaluation because that way I can chain queries without limited results, but when I return a result upon evaluation, it would have a max of x

I'm not sure I understand what your issue with slicing the queryset is. If it's the extra line of code or the hardcoded number that's bothering you, you can run
example.objects.filter(**filters)[:x]
and pass x into whatever method you're using.

You can write a custom Manager:
class LimitedNumberOfResultsManager(models.Manager):
def get_queryset(self):
return super(LimitedNumberOfResultsManager, self).get_queryset()[:25]
Note: You may think that adding a slice here will immediately evaluate the queryset. It won't. Instead information about the query limit will be saved to an underlying Query object and used later, during the final evaluation - as long as it is not overwritten by an another slice in the meantime.
Then add the manager to your model:
class YourModel(models.Model):
# ...
objects = LimitedNumberOfResultsManager()
After setting this up YourModel.objects.all() and other operations on your queryset will always return only up to 25 results. You can still overwrite this any time using slicing. For example:
This will return up to 25 resuls:
YourModel.objects.filter(lorem='ipsum')
but this will return up to 100 resuls:
YourModel.objects.filter(lorem='ipsum')[:100]
One more point. Overwriting the default manager may be confusing to other people reading your code. So I think it would be better to leave the default manager alone and use the custom one as an optional alternative:
class YourModel(models.Model):
# ...
objects = models.Manager()
limited = LimitedNumberOfResultsManager()
With this set up this will return all results:
YourModel.objects.all()
and this will return only up to 25 results:
YourModel.limited.all()
Depending on your exact use case you man also want to look at pagination in Django.

Iterating over a large Django queryset while the data is changing elsewhere

Iterating over a queryset, like so:
class Book(models.Model):
# <snip some other stuff>
activity = models.PositiveIntegerField(default=0)
views = models.PositiveIntegerField(default=0)
def calculate_statistics():
self.activity = book.views * 4
book.save()
def cron_job_calculate_all_book_statistics():
for book in Book.objects.all():
book.calculate_statistics()
...works just fine. However, this is a cron task. book.views is being incremented while this is happening. If book.views is modified while this cronjob is running, it gets reverted.
Now, book.views is not being modified by the cronjob, but it is being cached during the .all() queryset call. When book.save(), I have a feeling it is using the old book.views value.
Is there a way to make sure that only the activity field is updated? Alternatively, let's say there are 100,000 books. This will take quite a while to run. But the book.views will be from when the queryset originally starts running. Is the solution to just use an .iterator()?
UPDATE: Here's effectively what I am doing. If you have ideas about how to make this work well inline, then I'm all for it.
def calculate_statistics(self):
self.activity = self.views + self.hearts.count() * 2
# Can't do self.comments.count with a comments GenericRelation, because Comment uses
# a TextField for object_pk, and that breaks the whole system. Lame.
self.activity += Comment.objects.for_model(self).count() * 4
self.save()

The following will do the job for you in Django 1.1, no loop necessary:
from django.db.models import F
Book.objects.all().update(activity=F('views')*4)
You can have a more complicated calculation too:
for book in Book.objects.all().iterator():
Book.objects.filter(pk=book.pk).update(activity=book.calculate_activity())
Both these options have the potential to leave the activity field out of sync with the rest, but I assume you're ok with that, given that you're calculating it in a cron job.

In addition to what others have said if you are iterating over a large queryset you should use iterator():
Book.objects.filter(stuff).order_by(stuff).iterator()
this will cause Django to not cache the items as it iterates (which could use a ton of memory for a large result set).

No matter how you solve this, beware of transaction-related issues. E.g. default transaction isolation level is set to REPEATABLE READ, at least for MySQL backend. This, plus the fact that both Django and db backend work in a specific autocommit mode with an ongoing transaction means, that even if you use (very nice) whrde suggestion, value of `views' could be no longer valid. I could be wrong here, but feel warned.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.