I'm trying to determine whether or not a simple caching trick will actually be useful. I know Django querysets are lazy to improve efficiency, but I'm wondering if they save the result of their query after the data has been called.
For instance, if I have two models:
class Klass1(models.Model):
k2 = models.ForeignKey('Klass2')
class Klass2(models.Model):
# Model Code ...
#property
def klasses(self):
self.klasses = Klass1.objects.filter(k2=self)
return self.klasses
And I call klass_2_instance.klasses[:] somewhere, then the database is accessed and returns a query. I'm wondering if I call klass_2_instance.klasses again, will the database be accessed a second time, or will the django query save the result from the first call?
Django will not cache it for you.
Instead of Klass1.objects.filter(k2=self), you could just do self.klass1_set.all().
Because Django always create a set in the many side of 1-n relations.
I guess this kind of cache is complicated because it should remember all filters, excludes and order_by used. Although it could be done using any well designed hash, you should at least have a parameter to disable cache.
If you would like any cache, you could do:
class Klass2(models.Model):
def __init__(self, *args, **kwargs):
self._klass1_cache = None
super(Klass2, self).__init__(*args, **kwargs)
def klasses(self):
if self._klass1_cache is None:
# Here you can't remove list(..) because it is forcing query execution exactly once.
self._klass1_cache = list(self.klass1_set.all())
return self._klass1_cache
This is very useful when you loop many times in all related objects. For me it often happens in template, when I need to loop more than one time.
This query isn't cached by Django.
The forwards FK relationship - ie given a Klass object klass, doing klass.k2 - is cached after the first lookup. But the reverse, which you're doing here - and which is actually usually spelled klass2.klass_set.all() - is not cached.
You can easily memoize it:
#property
def klasses(self):
if not hasattr(self, '_klasses'):
self._klasses = self.klass_set.all()
return self._klasses
(Note that your existing code won't work, as you're overriding the method klasses with an attribute.)
Try using johnny-cache if you want transparent caching of querysets.
Related
Let´s say, there is a Django model called TaskModel which has a field priority and we want to insert a new element and increment the existing element which has already the priority and increment also the priority of the following elements.
priority is just a numeric field without any special flags like unique or primary/foreign key
queryset = models.TaskModel.objects.filter().order_by('priority')
Can this be done in a smart way with some methods on the Queryset itself?
I believe you can do this by using Django's F expressions and overriding the model's save method. I guess you could instead override the model's __init__ method as in this answer, but I think using the save method is best.
class TaskModel(models.Model):
task = models.CharField(max_length=20)
priority = models.IntegerField()
# Override the save method so whenever a new TaskModel object is
# added, this will be run.
def save(self, *args, **kwargs):
# First get all TaskModels with priority greater than, or
# equal to the priority of the new task you are adding
queryset = TaskModel.objects.filter(priority__gte=self.priority)
# Use update with the F expression to increase the priorities
# of all the tasks above the one you're adding
queryset.update(priority=F('priority') + 1)
# Finally, call the super method to call the model's
# actual save() method
super(TaskModel, self).save(*args, **kwargs)
def __str__(self):
return self.task
Keep in mind that this can create gaps in the priorities. For example, what if you create a task with priority 5, then delete it, then add another task with priority 5? I think the only way to handle that would be to loop through the queryset, perhaps with a function like the one below, in your view, and call it whenever a new task is created, or it's priority modified:
# tasks would be the queryset of all tasks, i.e, TaskModels.objects.all()
def reorder_tasks(tasks):
for i, task in enumerate(tasks):
task.priority = i + 1
task.save()
This method is not nearly as efficient, but it will not create the gaps. For this method, you would not change the TaskModel at all.
Or perhaps you can also override the delete method of the TaskModel as well, as shown in this answer, but I haven't had a chance to test this yet.
EDIT
Short Version
I don't know how to delete objects using a similar method to saving while keeping preventing priorities from having gaps. I would just use a loop as I have shown above.
Long version
I knew there was something different about deleting objects like this:
def delete(self, *args, **kwargs):
queryset = TaskModel.objects.filter(priority__gt=self.priority)
queryset.update(priority=F('priority') - 1)
super(TaskModel, self).delete(*args, **kwargs)
This will work, in some situations.
According to the docs on delete():
Keep in mind that this [calling delete()] will, whenever possible, be executed purely in
SQL, and so the delete() methods of individual object instances will
not necessarily be called during the process. If you’ve provided a
custom delete() method on a model class and want to ensure that it is
called, you will need to “manually” delete instances of that model
(e.g., by iterating over a QuerySet and calling delete() on each
object individually) rather than using the bulk delete() method of a
QuerySet.
So if you delete() a TaskModel object using the admin panel, the custom delete written above will never even get called, and while it should work if deleting an instance, for example in your view, since it will try acting directly on the database, it will not show up in the python until you refresh the query:
tasks = TaskModel.objects.order_by('priority')
for t in tasks:
print(t.task, t.priority)
tr = TaskModel.objects.get(task='three')
tr.delete()
# Here I need to call this AGAIN
tasks = TaskModel.objects.order_by('priority')
# BEFORE calling this
for t in tasks:
print(t.task, t.priority)
# to see the effect
If you still want to do it, I again refer to this answer to see how to handle it.
I am currently implementing soft deletion for all models in my database. The idea is that when an instance gets deleted, it actually gets archived with all of its children. If the user tries to create an instance that is identical to the archived one, the archived one gets undeleted along with all of its children instead of creating a new instance.
To do this, I am using django-safedelete where I am making a BaseModel with an overwritten save() method that looks something like this:
def save(self, *args, **kwargs):
# get the foreign key id
foreign_key_id = self.foreign_field.id
# execute a query by that id and some other params
'''I don't know how to do this'''
As to how to do it, I thought I could construct a kwargs dictionary that consists of pairs of <field>:value where <field> = self._meta.get_field(some_field.name) and value = getattr(self, some_field.name).
So how do I add the foreign_key_id to kwargs? I know there is this syntax: Model.objects.filter(foreign_field__id=value)
...but I don't know how to replicate that to put into kwargs the way I'm doing it.
Likewise, is there a better way to do this in general? I don't want to hard-code too many things, which is why I didn't just do this individually for each of the models that I have.
Thank you so much in advance.
So I have a queryset to update
stories = Story.objects.filter(introtext="")
for story in stories:
#just set it to the first 'sentence'
story.introtext = story.content[0:(story.content.find('.'))] + ".</p>"
story.save()
And the save() operation completely kills performance. And in the process list, there are multiple entries for "./manage.py shell" yes I ran this through django shell.
However, in the past I've ran scripts that didn't need to use save(), as it was changing a many to many field. These scripts were very performant.
My project has this code, which could be relevant to why these scripts were so good.
#receiver(signals.m2m_changed, sender=Story.tags.through)
def save_story(sender, instance, action, reverse, model, pk_set, **kwargs):
instance.save()
What is the best way to update a large queryset (10000+) efficiently?
As far as new introtext value depends on content field of the object you can't do any bulk update. But you can speed up saving list of individual objects by wrapping it into transaction:
from django.db import transaction
with transaction.atomic():
stories = Story.objects.filter(introtext='')
for story in stories:
introtext = story.content[0:(story.content.find('.'))] + ".</p>"
Story.objects.filter(pk=story.pk).update(introtext=introtext)
transaction.atomic() will increase speed by order of magnitude.
filter(pk=story.pk).update() trick allows you to prevent any pre_save/post_save signals which are emitted in case of the simple save(). This is the officially recommended method of updating single field of the object.
You can use update built-in function over a queryset
Exmaple:
MyModel.objects.all().update(color=red)
In your case, you need use F() (read more here) built-in function to use instance own attributes:
from django.db.models import F
stories = Story.objects.filter(introtext__exact='')
stories.update(F('introtext')[0:F('content').find(.)] + ".</p>" )
So QuerySets are "lazy" and only run in certain instances (repr,list,etc.). I have created a custom QuerySet for doing many queries, but these could end up having millions of items!! That is way more than I want to return.
When returning the evaluated QuerySet, it should not have more than 25 results! Now I know I could do the following:
first = example.objects.filter(...)
last = first.filter(...)
result = last[:25]
#do stuff with result
but I will be doing so many queries with example objects that I feel it unnecessary to have the line result = last[:25]. Is there a way to specify how a QuerySet is returned?
If there is, how can I change it so that whenever the QuerySet would be evaluated it only returns the first x items in the QuerySet where, in this case, x = 25
Important note:
slicing must be on evaluation because that way I can chain queries without limited results, but when I return a result upon evaluation, it would have a max of x
I'm not sure I understand what your issue with slicing the queryset is. If it's the extra line of code or the hardcoded number that's bothering you, you can run
example.objects.filter(**filters)[:x]
and pass x into whatever method you're using.
You can write a custom Manager:
class LimitedNumberOfResultsManager(models.Manager):
def get_queryset(self):
return super(LimitedNumberOfResultsManager, self).get_queryset()[:25]
Note: You may think that adding a slice here will immediately evaluate the queryset. It won't. Instead information about the query limit will be saved to an underlying Query object and used later, during the final evaluation - as long as it is not overwritten by an another slice in the meantime.
Then add the manager to your model:
class YourModel(models.Model):
# ...
objects = LimitedNumberOfResultsManager()
After setting this up YourModel.objects.all() and other operations on your queryset will always return only up to 25 results. You can still overwrite this any time using slicing. For example:
This will return up to 25 resuls:
YourModel.objects.filter(lorem='ipsum')
but this will return up to 100 resuls:
YourModel.objects.filter(lorem='ipsum')[:100]
One more point. Overwriting the default manager may be confusing to other people reading your code. So I think it would be better to leave the default manager alone and use the custom one as an optional alternative:
class YourModel(models.Model):
# ...
objects = models.Manager()
limited = LimitedNumberOfResultsManager()
With this set up this will return all results:
YourModel.objects.all()
and this will return only up to 25 results:
YourModel.limited.all()
Depending on your exact use case you man also want to look at pagination in Django.
In the following example, cached_attr is used to get or set an attribute on a model instance when a database-expensive property (related_spam in the example) is called. In the example, I use cached_spam to save queries. I put print statements when setting and when getting values so that I could test it out. I tested it in a view by passing an Egg instance into the view and in the view using {{ egg.cached_spam }}, as well as other methods on the Egg model that make calls to cached_spam themselves. When I finished and tested it out the shell output in Django's development server showed that the attribute cache was missed several times, as well as successfully gotten several times. It seems to be inconsistent. With the same data, when I made small changes (as little as changing the print statement's string) and refreshed (with all the same data), different amounts of misses / successes happened. How and why is this happening? Is this code incorrect or highly problematic?
class Egg(models.Model):
... fields
#property
def related_spam(self):
# Each time this property is called the database is queried (expected).
return Spam.objects.filter(egg=self).all() # Spam has foreign key to Egg.
#property
def cached_spam(self):
# This should call self.related_spam the first time, and then return
# cached results every time after that.
return self.cached_attr('related_spam')
def cached_attr(self, attr):
"""This method (normally attached via an abstract base class, but put
directly on the model for this example) attempts to return a cached
version of a requested attribute, and calls the actual attribute when
the cached version isn't available."""
try:
value = getattr(self, '_p_cache_{0}'.format(attr))
print('GETTING - {0}'.format(value))
except AttributeError:
value = getattr(self, attr)
print('SETTING - {0}'.format(value))
setattr(self, '_p_cache_{0}'.format(attr), value)
return value
Nothing wrong with your code, as far as it goes. The problem probably isn't there, but in how you use that code.
The main thing to realise is that model instances don't have identity. That means that if you instantiate an Egg object somewhere, and a different one somewhere else, even if they refer to the same underlying database row they won't share internal state. So calling cached_attr on one won't cause the cache to be populated in the other.
For example, assuming you have a RelatedObject class with a ForeignKey to Egg:
my_first_egg = Egg.objects.get(pk=1)
my_related_object = RelatedObject.objects.get(egg__pk=1)
my_second_egg = my_related_object.egg
Here my_first_egg and my_second_egg both refer to the database row with pk 1, but they are not the same object:
>>> my_first_egg.pk == my_second_egg.pk
True
>>> my_first_egg is my_second_egg
False
So, filling the cache on my_first_egg doesn't fill it on my_second_egg.
And, of course, objects won't persist across requests (unless they're specifically made global, which is horrible), so the cache won't persist either.
Http servers that scale are shared-nothing; you can't rely on anything being singleton. To share state, you need to connect to a special-purpose service.
Django's caching support is appropriate for your use case. It isn't necessarily a global singleton either; if you use locmem://, it will be process-local, which could be the more efficient choice.