Why is Django returning stale cache data?

Why is Django returning stale cache data? - python

I have two Django models as shown below, MyModel1 & MyModel2:
class MyModel1(CachingMixin, MPTTModel):
name = models.CharField(null=False, blank=False, max_length=255)
objects = CachingManager()
def __str__(self):
return "; ".join(["ID: %s" % self.pk, "name: %s" % self.name, ] )
class MyModel2(CachingMixin, models.Model):
name = models.CharField(null=False, blank=False, max_length=255)
model1 = models.ManyToManyField(MyModel1, related_name="MyModel2_MyModel1")
objects = CachingManager()
def __str__(self):
return "; ".join(["ID: %s" % self.pk, "name: %s" % self.name, ] )
MyModel2 has a ManyToMany field to MyModel1 entitled model1
Now look what happens when I add a new entry to this ManyToMany field. According to Django, it has no effect:
>>> m1 = MyModel1.objects.all()[0]
>>> m2 = MyModel2.objects.all()[0]
>>> m2.model1.all()
[]
>>> m2.model1.add(m1)
>>> m2.model1.all()
[]
Why? It seems definitely like a caching issue because I see that there is a new entry in Database table myapp_mymodel2_mymodel1 for this link between m2 & m1. How should I fix it??

Is django-cache-machine really needed?
MyModel1.objects.all()[0]
Roughly translates to
SELECT * FROM app_mymodel LIMIT 1
Queries like this are always fast. There would not be a significant difference in speeds whether you fetch it from the cache or from the database.
When you use cache manager you actually add a bit of overhead here that might make things a bit slower. Most of the time this effort will be wasted because there may not be a cache hit as explained in the next section.
How django-cache-machine works
Whenever you run a query, CachingQuerySet will try to find that query
in the cache. Queries are keyed by {prefix}:{sql}. If it’s there, we
return the cached result set and everyone is happy. If the query isn’t
in the cache, the normal codepath to run a database query is executed.
As the objects in the result set are iterated over, they are added to
a list that will get cached once iteration is done.
source: https://cache-machine.readthedocs.io/en/latest/
Accordingly, with the two queries executed in your question being identical, cache manager will fetch the second result set from memcache provided the cache hasn't been invalided.
The same link explains how cache keys are invalidated.
To support easy cache invalidation, we use “flush lists” to mark the
cached queries an object belongs to. That way, all queries where an
object was found will be invalidated when that object changes. Flush
lists map an object key to a list of query keys.
When an object is saved or deleted, all query keys in its flush list
will be deleted. In addition, the flush lists of its foreign key
relations will be cleared. To avoid stale foreign key relations, any
cached objects will be flushed when the object their foreign key
points to is invalidated.
It's clear that saving or deleting an object would result in many objects in the cache having to be invalidated. So you are slowing down these operations by using cache manager. Also worth noting is that the invalidation documentation does not mention many to many fields at all. There is an open issue for this and from your comment on that issue it's clear that you have discovered it too.
Solution
Chuck cache machine. Caching all queries are almost never worth it. It leads to all kind of hard to find bugs and issues. The best approach is to optimize your tables and fine tune your queries. If you find a particular query that is too slow cache it manually.

This was my workaround solution:
>>> m1 = MyModel1.objects.all()[0]
>>> m1
<MyModel1: ID: 8887972990743179; name: my-name-blahblah>
>>> m2 = MyModel2.objects.all()[0]
>>> m2.model1.all()
[]
>>> m2.model1.add(m1)
>>> m2.model1.all()
[]
>>> MyModel1.objects.invalidate(m1)
>>> MyModel2.objects.invalidate(m2)
>>> m2.save()
>>> m2.model1.all()
[<MyModel1: ID: 8887972990743179; name: my-name-blahblah>]

Have you considered hooking into the model signals to invalidate the cache when an object is added? For your case you should look at M2M Changed Signal
Small example that doesn't solve your problem but relates the workaround you gave before to my signals solution approach (I don't know django-cache-machine):
def invalidate_m2m(sender, **kwargs):
instance = kwargs.get('instance', None)
action = kwargs.get('action', None)
if action == 'post_add':
Sender.objects.invalidate(instance)
m2m_changed.connect(invalidate_m2m, sender=MyModel2.model1.through)

A. J. Parr answer is alomost correct, but you forgot post_remove and also you can bind it to every ManytoManyfield like this :
from django.db.models.signals import m2m_changed
from django.dispatch import receiver
#receiver(m2m_changed, )
def invalidate_cache_m2m(sender, instance, action, reverse, model, pk_set, **kwargs ):
if action in ['post_add', 'post_remove'] :
model.objects.invalidate(instance)

Related

Prefetch object not working with order_by queryset

Using Django 11 with PostgreSQL db. I have the models as shown below. I'm trying to prefetch a related queryset, using the Prefetch object and prefetch_related without assigning it to an attribute.
class Person(Model):
name = Charfield()
#property
def latest_photo(self):
return self.photos.order_by('created_at')[-1]
class Photo(Model):
person = ForeignKey(Person, related_name='photos')
created_at = models.DateTimeField(auto_now_add=True)
first_person = Person.objects.prefetch_related(Prefetch('photos', queryset=Photo.objects.order_by('created_at'))).first()
first_person.photos.order_by('created_at') # still hits the database
first_person.latest_photo # still hits the database
In the ideal case, calling person.latest_photo will not hit the database again. This will allow me to use that property safely in a list display.
However, as noted in the comments in the code, the prefetched queryset is not being used when I try to get the latest photo. Why is that?
Note: I've tried using the to_attr argument of Prefetch and that seems to work, however, it's not ideal since it means I would have to edit latest_photo to try to use the prefetched attribute.

The problem is with slicing, it creates a different query.
You can work around it like this:
...
#property
def latest_photo(self):
first_use_the_prefetch = list(self.photos.order_by('created_at'))
then_slice = first_use_the_prefetch[-1]
return then_slice
And in case you want to try, it is not possible to use slicing inside the Prefetch(query=...no slicing here...) (there is a wontfix feature request for this somewhere in Django tracker).

Marking an object as clean in SQLAlchemy ORM

Is there any way to explicitly mark an object as clean in the SQLAlchemy ORM?
This is related partly to a previous question on bulk update strategies.
I want to, within a before_flush event listener mark a bunch of object as actually not needing to be flushed. This is due to them being manually synced with the database by other means.
I have tried the strategy below, but it results in the object being removed from the session, which then can cause problems later when a lazy load happens.
#event.listens_for(SignallingSession, 'before_flush')
def before_flush(session, flush_context, instances):
ledgers = []
if session.dirty:
for elem in session.dirty:
if ( session.is_modified(elem, include_collections=False) ):
if isinstance(elem, Wallet):
session.expunge(elem) # causes problems later
ledgers.append(Ledger(id=elem.id, amount=elem.balance))
if ledgers:
session.bulk_save_objects(ledgers)
session.execute('UPDATE wallet w JOIN ledger l on w.id = l.id SET w.balance = l.amount')
session.execute('TRUNCATE ledger')
I want to do something like:
session.dirty.remove(MyObject)
But that doesn't work as session.dirty is a computed property, not a regular attribute. I've been digging around the instrumentation code, but can't see how I might fool the dirty list to not contain something. I see there is also a history on the object state that will need taking care of as well.
Any ideas? The underlying database is MySQL if that makes any difference.
-Matt

When you modify the database outside of the ORM, you can let the ORM know the current database state by using set_committed_value().
Example:
wallet = session.query(Wallet).filter_by(id=123)
wallet.balance = 0
session.execute("UPDATE wallet SET balance = 0 WHERE id = 123;")
set_committed_value(wallet, "balance", 0)
session.commit() # won't issue additional SQL to update wallet
If you really wanted to mark the instance as not dirty, you can muck with the internals of SQLAlchemy:
state = inspect(p)
session.identity_map._modified.discard(state)
state.modified = False
print(p in session.dirty) # False

Let me summarize this insanity.
from sqlalchemy.orm import attributes
attributes.instance_state(your_object).committed_state.clear()
Easy. (no)

Django model reload_from_db() vs. explicitly recalling from db

If I have an object retrieved from a model, for example:
obj = Foo.objects.first()
I know that if I want to reference this object later and make sure that it has the current values from the database, I can call:
obj.refresh_from_db()
My question is, is there any advantage to using the refresh_from_db() method over simply doing?:
obj = Foo.objects.get(id=obj.id)
As far as I know, the result will be the same. refresh_from_db() seems more explicit, but in some cases it means an extra line of code. Lets say I update the value field for obj and later want to test that it has been updated to False. Compare:
obj = Foo.objects.first()
assert obj.value is True
# value of foo obj is updated somewhere to False and I want to test below
obj.refresh_from_db()
assert obj.value is False
with this:
obj = Foo.objects.first()
assert obj.value is True
# value of foo obj is updated somewhere to False and I want to test below
assert Foo.objects.get(id=obj.id).value is False
I am not interested in a discussion of which of the two is more pythonic. Rather, I am wondering if one method has a practical advantage over the other in terms of resources, performance, etc. I have read this bit of documentation, but I was not able to ascertain from that whether there is an advantage to using reload_db(). Thank you!

Django sources are usually relatively easy to follow. If we look at the refresh_from_db() implementation, at its core it is still using this same Foo.objects.get(id=obj.id) approach:
db_instance_qs = self.__class__._default_manager.using(db).filter(pk=self.pk)
...
db_instance_qs = db_instance_qs.only(*fields)
...
db_instance = db_instance_qs.get()
Only there are couple extra bells and whistles:
deferred fields are ignored
stale foreign key references are cleared (according to the comment explanation)
So for everyday usage it is safe to say that they are pretty much the same, use whatever you like.

Just to add to #serg's answer, there's a case where explicitly re-fetching from the db is helpful and refreshing from the db isn't so much useful.
This is the case when you're adding permissions to an object and checking them immediately afterwards, and you need to clear the cached permissions for the object so that your permission checks work as expected.
According to the permission caching section of the django documentation:
The ModelBackend caches permissions on the user object after the first time they need to be fetched for a permissions check. This is typically fine for the request-response cycle since permissions aren’t typically checked immediately after they are added (in the admin, for example). If you are adding permissions and checking them immediately afterward, in a test or view for example, the easiest solution is to re-fetch the user from the database...
For an example, consider this block of code inspired by the one in the documentation cited above:
from django.contrib.auth import get_user_model
from django.contrib.auth.models import Permission
from django.contrib.contenttypes.models import ContentType
from smoothies.models import Smoothie
def force_deblend(user, smoothie):
# Any permission check will cache the current set of permissions
if not user.has_perm('smoothies.deblend_smoothie'):
permission = Permission.objects.get(
codename='deblend_smoothie',
content_type=ContentType.objects.get_for_model(Smoothie)
)
user.user_permissions.add(permission)
# Subsequent permission checks hit the cached permission set
print(user.has_perm('smoothies.deblend_smoothie')) # False
# Re-fetch user (explicitly) from db to clear permissions cache
# Be aware that user.refresh_from_db() won't help here
user = get_user_model().objects.get(pk=user.pk)
# Permission cache is now repopulated from the database
print(user.has_perm('smoothies.deblend_smoothie')) # True
...
...

It seems there is a difference if you use cached properties.
See here:
p.roles[0]
<Role: 16888649>
p.refresh_from_db()
p.roles[0]
<Role: 16888649>
p = Person.objects.get(id=p.id)
p.roles[0]
<Role: 16888650>
Definition from models.py:
#cached_property
def roles(self):
return Role.objects.filter(employer__person=self).order_by("id")

What is the most efficient way to iterate django objects updating them?

So I have a queryset to update
stories = Story.objects.filter(introtext="")
for story in stories:
#just set it to the first 'sentence'
story.introtext = story.content[0:(story.content.find('.'))] + ".</p>"
story.save()
And the save() operation completely kills performance. And in the process list, there are multiple entries for "./manage.py shell" yes I ran this through django shell.
However, in the past I've ran scripts that didn't need to use save(), as it was changing a many to many field. These scripts were very performant.
My project has this code, which could be relevant to why these scripts were so good.
#receiver(signals.m2m_changed, sender=Story.tags.through)
def save_story(sender, instance, action, reverse, model, pk_set, **kwargs):
instance.save()
What is the best way to update a large queryset (10000+) efficiently?

As far as new introtext value depends on content field of the object you can't do any bulk update. But you can speed up saving list of individual objects by wrapping it into transaction:
from django.db import transaction
with transaction.atomic():
stories = Story.objects.filter(introtext='')
for story in stories:
introtext = story.content[0:(story.content.find('.'))] + ".</p>"
Story.objects.filter(pk=story.pk).update(introtext=introtext)
transaction.atomic() will increase speed by order of magnitude.
filter(pk=story.pk).update() trick allows you to prevent any pre_save/post_save signals which are emitted in case of the simple save(). This is the officially recommended method of updating single field of the object.

You can use update built-in function over a queryset
Exmaple:
MyModel.objects.all().update(color=red)
In your case, you need use F() (read more here) built-in function to use instance own attributes:
from django.db.models import F
stories = Story.objects.filter(introtext__exact='')
stories.update(F('introtext')[0:F('content').find(.)] + ".</p>" )

Django batching/bulk update_or_create?

I have data in the database which needs updating periodically. The source of the data returns everything that's available at that point in time, so will include new data that is not already in the database.
As I loop through the source data I don't want to be making 1000s of individual writes if possible.
Is there anything such as update_or_create but works in batches?
One thought was using update_or_create in combination with manual transactions, but I'm not sure if that just queues up the individual writes or if it would combine it all into one SQL insert?
Or similarly could using #commit_on_success() on a function with update_or_create inside a the loop work?
I am not doing anything with the data other than translating it and saving it to a model. Nothing is dependent on that model existing during the loop.

Since Django added support for bulk_update, this is now somewhat possible, though you need to do 3 database calls (a get, a bulk create, and a bulk update) per batch. It's a bit challenging to make a good interface to a general purpose function here, as you want the function to support both efficient querying as well as the updates. Here is a method I implemented that is designed for bulk update_or_create where you have a number of common identifying keys (which could be empty) and one identifying key that varies among the batch.
This is implemented as a method on a base model, but can be used independently of that. This also assumes that the base model has an auto_now timestamp on the model named updated_on; if this is not the case, the lines of the code that assume this have been commented for easy modification.
In order to use this in batches, chunk your updates into batches before calling it. This is also a way to get around data that can have one of a small number of values for a secondary identifier without having to change the interface.
class BaseModel(models.Model):
updated_on = models.DateTimeField(auto_now=True)
#classmethod
def bulk_update_or_create(cls, common_keys, unique_key_name, unique_key_to_defaults):
"""
common_keys: {field_name: field_value}
unique_key_name: field_name
unique_key_to_defaults: {field_value: {field_name: field_value}}
ex. Event.bulk_update_or_create(
{"organization": organization}, "external_id", {1234: {"started": True}}
)
"""
with transaction.atomic():
filter_kwargs = dict(common_keys)
filter_kwargs[f"{unique_key_name}__in"] = unique_key_to_defaults.keys()
existing_objs = {
getattr(obj, unique_key_name): obj
for obj in cls.objects.filter(**filter_kwargs).select_for_update()
}
create_data = {
k: v for k, v in unique_key_to_defaults.items() if k not in existing_objs
}
for unique_key_value, obj in create_data.items():
obj[unique_key_name] = unique_key_value
obj.update(common_keys)
creates = [cls(**obj_data) for obj_data in create_data.values()]
if creates:
cls.objects.bulk_create(creates)
# This set should contain the name of the `auto_now` field of the model
update_fields = {"updated_on"}
updates = []
for key, obj in existing_objs.items():
obj.update(unique_key_to_defaults[key], save=False)
update_fields.update(unique_key_to_defaults[key].keys())
updates.append(obj)
if existing_objs:
cls.objects.bulk_update(updates, update_fields)
return len(creates), len(updates)
def update(self, update_dict=None, save=True, **kwargs):
""" Helper method to update objects """
if not update_dict:
update_dict = kwargs
# This set should contain the name of the `auto_now` field of the model
update_fields = {"updated_on"}
for k, v in update_dict.items():
setattr(self, k, v)
update_fields.add(k)
if save:
self.save(update_fields=update_fields)
Example usage:
class Event(BaseModel):
organization = models.ForeignKey(Organization)
external_id = models.IntegerField(unique=True)
started = models.BooleanField()
organization = Organization.objects.get(...)
updates_by_external_id = {
1234: {"started": True},
2345: {"started": True},
3456: {"started": False},
}
Event.bulk_update_or_create(
{"organization": organization}, "external_id", updates_by_external_id
)
Possible Race Conditions
The code above leverages a transaction and select-for-update to prevent race conditions on updates. There is, however, a possible race condition on inserts if two threads or processes are trying to create objects with the same identifiers.
The easy mitigation is to ensure that the combination of your common_keys and your unique_key is a database-enforced uniqueness constraint (which is the intended use of this function). This can be achieved with either the unique_key referencing a field with unique=True, or with the unique_key combined with a subset of the common_keys enforced as unique together by a UniqueConstraint). With database-enforced uniqueness protection, if multiple threads are trying to perform conflicting creates, all but one will fail with an IntegrityError. Due to the enclosing transaction, threads that fail will perform no changes and can be safely retried or ignored (a conflicting create that failed could just be treated as a create that happened first and then was immediately overwritten).
If leveraging uniqueness constraints is not possible, then you will either need to implement your own concurrency control or lock the entire table.

As of Django 4.1, the bulk_create method supports upserts via update_conflicts, which is the single query, batch equivalent of update_or_create:
class Foo(models.Model):
a = models.IntegerField(unique=True)
b = models.IntegerField()
queryset = [Foo(1, 1), Foo(1, 2)]
Foo.objects.bulk_create(
queryset,
update_conflicts=True,
unique_fields=['a'],
update_fields=['b'],
)

Batching your updates is going to be an upsert command and like #imposeren said, Postgres 9.5 gives you that ability. I think Mysql 5.7 does as well (see http://dev.mysql.com/doc/refman/5.7/en/insert-on-duplicate.html) depending on your exact needs. That said it's probably easiest to just use a db cursor. Nothing wrong with that, it's there for when the ORM just isn't enough.
Something along these lines should work. It's psuedo-ish code, so don't just cut-n-paste this but the concept is there for ya.
class GroupByChunk(object):
def __init__(self, size):
self.count = 0
self.size = size
self.toggle = False
def __call__(self, *args, **kwargs):
if self.count >= self.size: # Allows for size 0
self.toggle = not self.toggle
self.count = 0
self.count += 1
return self.toggle
def batch_update(db_results, upsert_sql):
with transaction.atomic():
cursor = connection.cursor()
for chunk in itertools.groupby(db_results, GroupByChunk(size=1000)):
cursor.execute_many(upsert_sql, chunk)
Assumptions here are:
db_results is some kind of results iterator, either in a list or dictionary
A result from db_results can be fed directly into a raw sql exec statement
If any of the batch updates fail, you'll be rolling back ALL of them. If you want to move that to for each chunk, just push the with block down a bit

There is django-bulk-update-or-create library for Django that can do that.

I have been using the #Zags answer and I think it's the best solution. But I'd want to advice about a little issue in his code.
update_fields = {"updated_on"}
updates = []
for key, obj in existing_objs.items():
obj.update(unique_key_to_defaults[key], save=False)
update_fields.update(unique_key_to_defaults[key].keys())
updates.append(obj)
if existing_objs:
cls.objects.bulk_update(updates, update_fields)
If you are using auto_now=True fields they are not going to be updated if you use .update() or bulk_update() this is because the fields "auto_now" triggers with a .save() as you can read in the documentation.
In case you have an auto_now field F.e: updated_on, it will be better to add it explicitly in the unique_key_to_defaults dict.
"unique_value" : {
"field1.." : value...,
"updated_on" : timezone.now()
}...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.