Should you check before setting a Django model's field? - python

I have a background service that imports django, and uses my django project's ORM directly.
It monitors something, in a loop, very often - every few seconds.
It goes through every user in my database, checks a condition, and based on that, sets a flag to either be True or False. I might have thousands of users in the database, so the efficiency here can add up.
while True:
time.sleep(5)
for user in User.objects.all():
if user.check():
user.flag = True
else:
user.flag = False
user.save()
I'm using MySQL as my database.
What I'm curious about is this: if a particular user has .flag set to True, am I doing a disk write every time I run user.flag = True; user.save(), even though nothing changed? Or is Django or MySQL smart enough not to do a disk write if nothing changed?
I assume a MySQL read operation is less expensive than a write operation. Would it make more sense to check the value of user.flag, and only try to set user.flag if the value actually changed? This would essentially be exchanging a database read for a database write, from what I understand (except in cases where something actually changed, in which case, first a read is performed, and then a write).
Note: This is just a basic example. I'm not actually dealing with users.

If you can move the logic of check into a .filter() clause, that would be best. That way you could do:
User.objects.filter(match_check, flag=False).update(flag=True)
User.objects.filter(~Q(match_check), flag=True).update(flag=False)
Or if you could annotate the value:
User.objects.annotate(
new_flag=some_check,
).exclude(flag=F('new_flag')).update(flag= ('new_flag'))
Otherwise if you can't then maybe do it like this:
new_flag = user.check()
if new_flag != user.flag:
user.flag = new_flag
user.save(update_fields=['flag']) # This will only update that single column.

Related

Django - lazy queries and efficiency

Is there a way to speed this up?
section_list_active = Section.objects.all().filter(course=this_course,Active=True,filter(teacher__in=[this_teacher]).order_by('pk')
if section_list_active.count() > 0:
active_section_id = section_list_active.first().pk
else:
section_list = Section.objects.all().filter(course=this_course).filter(teacher__in=[this_teacher]).order_by('pk')
active_section_id = section_list.first().pk
I know that the querysets aren't evaluated until they're used (the .first().pk bit?), so is there then no faster way to do this? If they weren't lazy, I could hit the DB for the whole list of sections, then just filter that set for ones with the property Active, but I'm not seeing how I can do this without hitting it twice (assuming that I have to go to the else).
Given you are only intersted in the Section (or the corresponding pk), Yes. You can order by theActive field first:
active_section_id = Section.objects.filter(
course=this_course,
teacher=this_teacher
).order_by('-Active', 'pk').first().pk
or even a bit more elegant:
active_section_id = Section.objects.filter(
course=this_course,
teacher=this_teacher
).earliest('-Active', 'pk').pk
We thus will first order Active in descending order such that if there is a row with Active=True, then this one will be sorted at the top. Next we thus take the .first() object, and optionally take the pk.
In case there is no Section that satsifies the filter conditions, then .first() or .earliest() will return None, and this will thus result in an AttributeError when fetching the pk attribute.
Note: typically fields are written in lowercase, so active instead of Active.

Django model reload_from_db() vs. explicitly recalling from db

If I have an object retrieved from a model, for example:
obj = Foo.objects.first()
I know that if I want to reference this object later and make sure that it has the current values from the database, I can call:
obj.refresh_from_db()
My question is, is there any advantage to using the refresh_from_db() method over simply doing?:
obj = Foo.objects.get(id=obj.id)
As far as I know, the result will be the same. refresh_from_db() seems more explicit, but in some cases it means an extra line of code. Lets say I update the value field for obj and later want to test that it has been updated to False. Compare:
obj = Foo.objects.first()
assert obj.value is True
# value of foo obj is updated somewhere to False and I want to test below
obj.refresh_from_db()
assert obj.value is False
with this:
obj = Foo.objects.first()
assert obj.value is True
# value of foo obj is updated somewhere to False and I want to test below
assert Foo.objects.get(id=obj.id).value is False
I am not interested in a discussion of which of the two is more pythonic. Rather, I am wondering if one method has a practical advantage over the other in terms of resources, performance, etc. I have read this bit of documentation, but I was not able to ascertain from that whether there is an advantage to using reload_db(). Thank you!
Django sources are usually relatively easy to follow. If we look at the refresh_from_db() implementation, at its core it is still using this same Foo.objects.get(id=obj.id) approach:
db_instance_qs = self.__class__._default_manager.using(db).filter(pk=self.pk)
...
db_instance_qs = db_instance_qs.only(*fields)
...
db_instance = db_instance_qs.get()
Only there are couple extra bells and whistles:
deferred fields are ignored
stale foreign key references are cleared (according to the comment explanation)
So for everyday usage it is safe to say that they are pretty much the same, use whatever you like.
Just to add to #serg's answer, there's a case where explicitly re-fetching from the db is helpful and refreshing from the db isn't so much useful.
This is the case when you're adding permissions to an object and checking them immediately afterwards, and you need to clear the cached permissions for the object so that your permission checks work as expected.
According to the permission caching section of the django documentation:
The ModelBackend caches permissions on the user object after the first time they need to be fetched for a permissions check. This is typically fine for the request-response cycle since permissions aren’t typically checked immediately after they are added (in the admin, for example). If you are adding permissions and checking them immediately afterward, in a test or view for example, the easiest solution is to re-fetch the user from the database...
For an example, consider this block of code inspired by the one in the documentation cited above:
from django.contrib.auth import get_user_model
from django.contrib.auth.models import Permission
from django.contrib.contenttypes.models import ContentType
from smoothies.models import Smoothie
def force_deblend(user, smoothie):
# Any permission check will cache the current set of permissions
if not user.has_perm('smoothies.deblend_smoothie'):
permission = Permission.objects.get(
codename='deblend_smoothie',
content_type=ContentType.objects.get_for_model(Smoothie)
)
user.user_permissions.add(permission)
# Subsequent permission checks hit the cached permission set
print(user.has_perm('smoothies.deblend_smoothie')) # False
# Re-fetch user (explicitly) from db to clear permissions cache
# Be aware that user.refresh_from_db() won't help here
user = get_user_model().objects.get(pk=user.pk)
# Permission cache is now repopulated from the database
print(user.has_perm('smoothies.deblend_smoothie')) # True
...
...
It seems there is a difference if you use cached properties.
See here:
p.roles[0]
<Role: 16888649>
p.refresh_from_db()
p.roles[0]
<Role: 16888649>
p = Person.objects.get(id=p.id)
p.roles[0]
<Role: 16888650>
Definition from models.py:
#cached_property
def roles(self):
return Role.objects.filter(employer__person=self).order_by("id")

Update a field in a django model only if it needs updating

Suppose I have some django model and I'm updating an instance
def modify_thing(id, new_blah):
mything = MyModel.objects.get(pk=id)
mything.blah = new_blah
mything.save()
My question is, if it happened that it was already the case that mything.blah == new_blah, does django somehow know this and not bother to save this [non-]modification again? Or will it always go into the db (MySQL in my case) and update data?
If I want to avoid an unnecessary write, does it make any sense to do something like:
if mything.blah != new_blah:
mything.blah = new_blah
mything.save()
Given that the record would have to be read from db anyway in order to do the comparison in the first place? Is there any efficiency to be gained from this sort of construction - and if so, is there a less ugly way of doing that than with the if statement in python?
You can use Django Signals to ensure that code like that you just posted don´t write to the db. Take a look at pre_save, that's the signal you're looking for.
Given that django does not cache the values, a trip to DB is inevitable, you have to fetch it to compare the value. And definitely we have less ugly ways to do that. You could do it as
if mything.blah is new_blah:
#Do nothing
else:
mything.blah = new_blah
mything.blah.save()

Is there a better way to deal with DoesNotExist query sets

There is probably a better way of dealing with non existant query sets...!
The problem i have with this code is that it raises an exception if the normal case will be true! That is: if a workspace name with the same name in the db is not existent.
But instead of having an exception i would like to go for a query that does not return DoesNotExist but true or false
My unelegant code:
try:
is_workspace_name = Workspace.objects.get(workspace_name=workspace_name,user=self.user.id )
except:
return workspace_name
if is_workspace_name:
raise forms.ValidationError(u'%s already exists as a workspace name! Please choose a different one!' %workspace_name )
Thanks a lot!
You can use exists() method. Quoting docs:
Returns True if the QuerySet contains any results, and False if not.
This tries to perform the query in the simplest and fastest way
possible, but it does execute nearly the same query as a normal
QuerySet query.
Remarks: the simplest and fastest way. It is cheaper to use exists (than count) because with exists the database stops counting at first occurrence.
if Workspace.objects.filter(workspace_name=workspace_name,
user=self.user.id).exists()
raise forms.ValidationError(u'%s already exists ...!' % workspace_name)
else:
return workspace_name
Checking for the existence of a record.
If you want to test for the existence of a record in your database, you could be using Workspace.objects.filter(workspace_name = workspace_name,user = self.user.id).count().
This will return the number of records matching your conditions. This number will be 0 in case there is none, which will be readily usable with an if clause. I believe this to me the most standard and easy way to do what you need here.
## EDIT ## Actually that's false, you might want to check danihp's answer for a better solution using Queryset.exists!
A word of warning: the case of checking for existence before insertion
Be cautious when using such a construct however, especially if you plan on checking whether you have a duplicate before trying to insert a record. In such a case, the best solution is to try to create the record and see if it raises an exception.
Indeed, you could be in the following situation:
Request 1 reaches the server
Request 2 reaches the server
Check is done for request 1, no object exist.
Check is done for request 2, no object exist.
Proceed with creation in request 1.
Proceed with creation in request 2.
And... you have a duplicate - this is called a race condition, and is a common issue when dealing with parallel code.
Long story short, you should use try, expect and unique constraints when dealing with insertion.
Using get_or_create, as suggested by init3, also helps. Indeed, get_or_create is aware of this, and you'll be safe so long as unwanted duplicated would raise an IntegrityError
obj, created = Workspace.objects.get_or_create(workspace_name=workspace_name, user=self.user.id)
if created:
# everything ok
# do something
pass
else:
# not ok
# respond he should choose anything else
pass
read more at the docs

How do I force Django to ignore any caches and reload data?

I'm using the Django database models from a process that's not called from an HTTP request. The process is supposed to poll for new data every few seconds and do some processing on it. I have a loop that sleeps for a few seconds and then gets all unhandled data from the database.
What I'm seeing is that after the first fetch, the process never sees any new data. I ran a few tests and it looks like Django is caching results, even though I'm building new QuerySets every time. To verify this, I did this from a Python shell:
>>> MyModel.objects.count()
885
# (Here I added some more data from another process.)
>>> MyModel.objects.count()
885
>>> MyModel.objects.update()
0
>>> MyModel.objects.count()
1025
As you can see, adding new data doesn't change the result count. However, calling the manager's update() method seems to fix the problem.
I can't find any documentation on that update() method and have no idea what other bad things it might do.
My question is, why am I seeing this caching behavior, which contradicts what Django docs say? And how do I prevent it from happening?
Having had this problem and found two definitive solutions for it I thought it worth posting another answer.
This is a problem with MySQL's default transaction mode. Django opens a transaction at the start, which means that by default you won't see changes made in the database.
Demonstrate like this
Run a django shell in terminal 1
>>> MyModel.objects.get(id=1).my_field
u'old'
And another in terminal 2
>>> MyModel.objects.get(id=1).my_field
u'old'
>>> a = MyModel.objects.get(id=1)
>>> a.my_field = "NEW"
>>> a.save()
>>> MyModel.objects.get(id=1).my_field
u'NEW'
>>>
Back to terminal 1 to demonstrate the problem - we still read the old value from the database.
>>> MyModel.objects.get(id=1).my_field
u'old'
Now in terminal 1 demonstrate the solution
>>> from django.db import transaction
>>>
>>> #transaction.commit_manually
... def flush_transaction():
... transaction.commit()
...
>>> MyModel.objects.get(id=1).my_field
u'old'
>>> flush_transaction()
>>> MyModel.objects.get(id=1).my_field
u'NEW'
>>>
The new data is now read
Here is that code in an easy to paste block with docstring
from django.db import transaction
#transaction.commit_manually
def flush_transaction():
"""
Flush the current transaction so we don't read stale data
Use in long running processes to make sure fresh data is read from
the database. This is a problem with MySQL and the default
transaction mode. You can fix it by setting
"transaction-isolation = READ-COMMITTED" in my.cnf or by calling
this function at the appropriate moment
"""
transaction.commit()
The alternative solution is to change my.cnf for MySQL to change the default transaction mode
transaction-isolation = READ-COMMITTED
Note that that is a relatively new feature for Mysql and has some consequences for binary logging / slaving. You could also put this in the django connection preamble if you wanted.
Update 3 years later
Now that Django 1.6 has turned on autocommit in MySQL this is no longer a problem. The example above now works fine without the flush_transaction() code whether your MySQL is in REPEATABLE-READ (the default) or READ-COMMITTED transaction isolation mode.
What was happening in previous versions of Django which ran in non autocommit mode was that the first select statement opened a transaction. Since MySQL's default mode is REPEATABLE-READ this means that no updates to the database will be read by subsequent select statements - hence the need for the flush_transaction() code above which stops the transaction and starts a new one.
There are still reasons why you might want to use READ-COMMITTED transaction isolation though. If you were to put terminal 1 in a transaction and you wanted to see the writes from the terminal 2 you would need READ-COMMITTED.
The flush_transaction() code now produces a deprecation warning in Django 1.6 so I recommend you remove it.
We've struggled a fair bit with forcing django to refresh the "cache" - which it turns out wasn't really a cache at all but an artifact due to transactions. This might not apply to your example, but certainly in django views, by default, there's an implicit call to a transaction, which mysql then isolates from any changes that happen from other processes ater you start.
we used the #transaction.commit_manually decorator and calls to transaction.commit() just before every occasion where you need up-to-date info.
As I say, this definitely applies to views, not sure whether it would apply to django code not being run inside a view.
detailed info here:
http://devblog.resolversystems.com/?p=439
I'm not sure I'd recommend it...but you can just kill the cache yourself:
>>> qs = MyModel.objects.all()
>>> qs.count()
1
>>> MyModel().save()
>>> qs.count() # cached!
1
>>> qs._result_cache = None
>>> qs.count()
2
And here's a better technique that doesn't rely on fiddling with the innards of the QuerySet: Remember that the caching is happening within a QuerySet, but refreshing the data simply requires the underlying Query to be re-executed. The QuerySet is really just a high-level API wrapping a Query object, plus a container (with caching!) for Query results. Thus, given a queryset, here is a general-purpose way of forcing a refresh:
>>> MyModel().save()
>>> qs = MyModel.objects.all()
>>> qs.count()
1
>>> MyModel().save()
>>> qs.count() # cached!
1
>>> from django.db.models import QuerySet
>>> qs = QuerySet(model=MyModel, query=qs.query)
>>> qs.count() # refreshed!
2
>>> party_time()
Pretty easy! You can of course implement this as a helper function and use as needed.
If you append .all() to a queryset, it'll force a reread from the DB. Try
MyModel.objects.all().count() instead of MyModel.objects.count().
Seems like the count() goes to cache after the first time. This is the django source for QuerySet.count:
def count(self):
"""
Performs a SELECT COUNT() and returns the number of records as an
integer.
If the QuerySet is already fully cached this simply returns the length
of the cached results set to avoid multiple SELECT COUNT(*) calls.
"""
if self._result_cache is not None and not self._iter:
return len(self._result_cache)
return self.query.get_count(using=self.db)
update does seem to be doing quite a bit of extra work, besides what you need.
But I can't think of any better way to do this, short of writing your own SQL for the count.
If performance is not super important, I would just do what you're doing, calling update before count.
QuerySet.update:
def update(self, **kwargs):
"""
Updates all elements in the current QuerySet, setting all the given
fields to the appropriate values.
"""
assert self.query.can_filter(), \
"Cannot update a query once a slice has been taken."
self._for_write = True
query = self.query.clone(sql.UpdateQuery)
query.add_update_values(kwargs)
if not transaction.is_managed(using=self.db):
transaction.enter_transaction_management(using=self.db)
forced_managed = True
else:
forced_managed = False
try:
rows = query.get_compiler(self.db).execute_sql(None)
if forced_managed:
transaction.commit(using=self.db)
else:
transaction.commit_unless_managed(using=self.db)
finally:
if forced_managed:
transaction.leave_transaction_management(using=self.db)
self._result_cache = None
return rows
update.alters_data = True
You can also use MyModel.objects._clone().count(). All of the methods in the the QuerySet call _clone() prior to doing any work - that ensures that any internal caches are invalidated.
The root cause is that MyModel.objects is the same instance each time. By cloning it you're creating a new instance without the cached value. Of course, you can always reach in and invalidate the cache if you'd prefer to use the same instance.

Categories