Iterating over a large Django queryset while the data is changing elsewhere

Iterating over a large Django queryset while the data is changing elsewhere - python

Iterating over a queryset, like so:
class Book(models.Model):
# <snip some other stuff>
activity = models.PositiveIntegerField(default=0)
views = models.PositiveIntegerField(default=0)
def calculate_statistics():
self.activity = book.views * 4
book.save()
def cron_job_calculate_all_book_statistics():
for book in Book.objects.all():
book.calculate_statistics()
...works just fine. However, this is a cron task. book.views is being incremented while this is happening. If book.views is modified while this cronjob is running, it gets reverted.
Now, book.views is not being modified by the cronjob, but it is being cached during the .all() queryset call. When book.save(), I have a feeling it is using the old book.views value.
Is there a way to make sure that only the activity field is updated? Alternatively, let's say there are 100,000 books. This will take quite a while to run. But the book.views will be from when the queryset originally starts running. Is the solution to just use an .iterator()?
UPDATE: Here's effectively what I am doing. If you have ideas about how to make this work well inline, then I'm all for it.
def calculate_statistics(self):
self.activity = self.views + self.hearts.count() * 2
# Can't do self.comments.count with a comments GenericRelation, because Comment uses
# a TextField for object_pk, and that breaks the whole system. Lame.
self.activity += Comment.objects.for_model(self).count() * 4
self.save()

The following will do the job for you in Django 1.1, no loop necessary:
from django.db.models import F
Book.objects.all().update(activity=F('views')*4)
You can have a more complicated calculation too:
for book in Book.objects.all().iterator():
Book.objects.filter(pk=book.pk).update(activity=book.calculate_activity())
Both these options have the potential to leave the activity field out of sync with the rest, but I assume you're ok with that, given that you're calculating it in a cron job.

In addition to what others have said if you are iterating over a large queryset you should use iterator():
Book.objects.filter(stuff).order_by(stuff).iterator()
this will cause Django to not cache the items as it iterates (which could use a ton of memory for a large result set).

No matter how you solve this, beware of transaction-related issues. E.g. default transaction isolation level is set to REPEATABLE READ, at least for MySQL backend. This, plus the fact that both Django and db backend work in a specific autocommit mode with an ongoing transaction means, that even if you use (very nice) whrde suggestion, value of `views' could be no longer valid. I could be wrong here, but feel warned.

Related

How to avoid ordering by in django queryset, order_by() not working

I have a "big" db whith over 60M records, and I'm trying to paginate by 50.
I have another db whith ~8M records and it works perfectly, but with the 60M amount it just never loads and overflows the db.
I found that the problem was the order_by(id) made by django so I tried using a mysql view already ordered by id, but then django tries to order it again. To avoid this, I used order_by(), which is supposed to avoid any ordering, but it still does it.
def get_queryset(self, request):
qs = super(CropAdmin, self).get_queryset(request)
qs1 = qs.only('id', 'grain__id', 'scan__id', 'scan__acquisition__id',
'validated', 'area', 'crop_date', 'matched_label', 'grain__grain_number', 'filename').order_by()
if request.user.is_superuser:
return qs1
The query made is still using order_by:
SELECT `crops_ordered`.`crop_id`,
`crops_ordered`.`crop_date`,
`crops_ordered`.`area`,
`crops_ordered`.`matched_label`,
`crops_ordered`.`validated`,
`crops_ordered`.`scan_id`,
`crops_ordered`.`grain_id`,
`crops_ordered`.`filename`,
`scans`.`scan_id`,
`scans`.`acquisition_id`,
`acquisitions`.`acquisition_id`,
`grains`.`grain_id`,
`grains`.`grain_number`
FROM `crops_ordered`
INNER JOIN `scans`
ON (`crops_ordered`.`scan_id` = `scans`.`scan_id`)
INNER JOIN `acquisitions`
ON (`scans`.`acquisition_id` = `acquisitions`.`acquisition_id`)
INNER JOIN `grains`
ON (`crops_ordered`.`grain_id` = `grains`.`grain_id`)
**ORDER BY `crops_ordered`.`crop_id` DESC**
LIMIT 50
Any idea on how to fix this? Or a better way to work with a db of this size?

I don't believe order_by() will work as there will most likely be a default parameter when Django implemented this function. Having said that, I believe this thread has the answer that you want.
Edit
The link in that thread might provide too much information at once, although there aren't many details on this either. If you don't like Github, there's also an official documentation page on this method but you'll have to manually look for clear_ordering by using CTRL + f or any equivalence.

Solution needed to a scenario

I am trying to make use of a column's value as a radio button's choice using below code
Forms.py
#retreiving data from database and assigning it to diction list
diction = polls_datum.objects.values_list('poll_choices', flat=True)
#initializing list and dictionary
OPTIONS1 = {}
OPTIONS = []
#creating the dictionary with 0 to no of options given in list
for i in range(len(diction)):
OPTIONS1[i] = diction[i]
#creating tuples from the dictionary above
#OPTIONS = zip(OPTIONS1.keys(), OPTIONS1.values())
for i in OPTIONS1:
k = (i,OPTIONS1[i])
OPTIONS.append(k)
class polls_form(forms.ModelForm):
#retreiving data from database and assigning it to diction list
options = forms.ChoiceField(choices=OPTIONS, widget = forms.RadioSelect())
class Meta:
model = polls_model
fields = ['options']
Using a form I am saving the data or choices in a field (poll_choices), when trying to display it on the index page, it is not reflecting until a server restart.
Can someone help on this please

of course "it is not reflecting until a server restart" - that's obvious when you remember that django server processes are long-running processes (it's not like PHP where each script is executed afresh on each request), and that top-level code (code that's at the module's top-level, not in a function) is only executed once per process when the module is first imported. As a general rule: don't do ANY db query at a module's top-level or at the top-level of a class statement - at best you'll get stale data, at worse it will crash your server process (if you're doing query before everything has been properly setup by django, or if you're doing query based on a schema update before the migration has been applied).
The possible solutions are either to wait until the form's initialisation to setup your field's choices, or to pass a callable as the formfield's choices options, cf https://docs.djangoproject.com/en/2.1/ref/forms/fields/#django.forms.ChoiceField.choices
Also, the way you're building your choices list is uselessly complicated - you could do it as a one-liner:
OPTIONS = list(enumerate(polls_datum.objects.values_list('poll_choices', flat=True))
but it's also very brittle - you're relying on the current db content and ordering for the choice value when you should use the polls_datum's pk instead (which is garanteed to be stable).
And finally: since you're working with what seems to be a related model, you may want to use a ModelChoiceField instead.

For future reference:
What version of Django are you using?
Have you read up on the documentation of ModelForms? https://docs.djangoproject.com/en/2.1/topics/forms/modelforms/
I'm not sure what you're trying to do with diction to dictionary to tuple. I think you could skip a step there and your future self will thank you for that.
Try to follow some tutorials and understand why certain steps are being taken. I can see from your code that you're rather new to coding or Python and there's room for improvement. Not trying to talk you down, but I'm trying to push you into the direction of becoming a better developer ;-)
REAL ANSWER:
That being said, I think the solution is to write the loading of the data somewhere in your form model, rather than 'loose' in forms.py. See bruno's answer for more information on this.
If you want to reload the data on each request that loads the form, you should create a function that gets called every time the form is loaded (for example in the form's __init__ function).

What is the most efficient way to iterate django objects updating them?

So I have a queryset to update
stories = Story.objects.filter(introtext="")
for story in stories:
#just set it to the first 'sentence'
story.introtext = story.content[0:(story.content.find('.'))] + ".</p>"
story.save()
And the save() operation completely kills performance. And in the process list, there are multiple entries for "./manage.py shell" yes I ran this through django shell.
However, in the past I've ran scripts that didn't need to use save(), as it was changing a many to many field. These scripts were very performant.
My project has this code, which could be relevant to why these scripts were so good.
#receiver(signals.m2m_changed, sender=Story.tags.through)
def save_story(sender, instance, action, reverse, model, pk_set, **kwargs):
instance.save()
What is the best way to update a large queryset (10000+) efficiently?

As far as new introtext value depends on content field of the object you can't do any bulk update. But you can speed up saving list of individual objects by wrapping it into transaction:
from django.db import transaction
with transaction.atomic():
stories = Story.objects.filter(introtext='')
for story in stories:
introtext = story.content[0:(story.content.find('.'))] + ".</p>"
Story.objects.filter(pk=story.pk).update(introtext=introtext)
transaction.atomic() will increase speed by order of magnitude.
filter(pk=story.pk).update() trick allows you to prevent any pre_save/post_save signals which are emitted in case of the simple save(). This is the officially recommended method of updating single field of the object.

You can use update built-in function over a queryset
Exmaple:
MyModel.objects.all().update(color=red)
In your case, you need use F() (read more here) built-in function to use instance own attributes:
from django.db.models import F
stories = Story.objects.filter(introtext__exact='')
stories.update(F('introtext')[0:F('content').find(.)] + ".</p>" )

Update a field in a django model only if it needs updating

Suppose I have some django model and I'm updating an instance
def modify_thing(id, new_blah):
mything = MyModel.objects.get(pk=id)
mything.blah = new_blah
mything.save()
My question is, if it happened that it was already the case that mything.blah == new_blah, does django somehow know this and not bother to save this [non-]modification again? Or will it always go into the db (MySQL in my case) and update data?
If I want to avoid an unnecessary write, does it make any sense to do something like:
if mything.blah != new_blah:
mything.blah = new_blah
mything.save()
Given that the record would have to be read from db anyway in order to do the comparison in the first place? Is there any efficiency to be gained from this sort of construction - and if so, is there a less ugly way of doing that than with the if statement in python?

You can use Django Signals to ensure that code like that you just posted don´t write to the db. Take a look at pre_save, that's the signal you're looking for.

Given that django does not cache the values, a trip to DB is inevitable, you have to fetch it to compare the value. And definitely we have less ugly ways to do that. You could do it as
if mything.blah is new_blah:
#Do nothing
else:
mything.blah = new_blah
mything.blah.save()

How do I force Django to ignore any caches and reload data?

I'm using the Django database models from a process that's not called from an HTTP request. The process is supposed to poll for new data every few seconds and do some processing on it. I have a loop that sleeps for a few seconds and then gets all unhandled data from the database.
What I'm seeing is that after the first fetch, the process never sees any new data. I ran a few tests and it looks like Django is caching results, even though I'm building new QuerySets every time. To verify this, I did this from a Python shell:
>>> MyModel.objects.count()
885
# (Here I added some more data from another process.)
>>> MyModel.objects.count()
885
>>> MyModel.objects.update()
0
>>> MyModel.objects.count()
1025
As you can see, adding new data doesn't change the result count. However, calling the manager's update() method seems to fix the problem.
I can't find any documentation on that update() method and have no idea what other bad things it might do.
My question is, why am I seeing this caching behavior, which contradicts what Django docs say? And how do I prevent it from happening?

Having had this problem and found two definitive solutions for it I thought it worth posting another answer.
This is a problem with MySQL's default transaction mode. Django opens a transaction at the start, which means that by default you won't see changes made in the database.
Demonstrate like this
Run a django shell in terminal 1
>>> MyModel.objects.get(id=1).my_field
u'old'
And another in terminal 2
>>> MyModel.objects.get(id=1).my_field
u'old'
>>> a = MyModel.objects.get(id=1)
>>> a.my_field = "NEW"
>>> a.save()
>>> MyModel.objects.get(id=1).my_field
u'NEW'
>>>
Back to terminal 1 to demonstrate the problem - we still read the old value from the database.
>>> MyModel.objects.get(id=1).my_field
u'old'
Now in terminal 1 demonstrate the solution
>>> from django.db import transaction
>>>
>>> #transaction.commit_manually
... def flush_transaction():
... transaction.commit()
...
>>> MyModel.objects.get(id=1).my_field
u'old'
>>> flush_transaction()
>>> MyModel.objects.get(id=1).my_field
u'NEW'
>>>
The new data is now read
Here is that code in an easy to paste block with docstring
from django.db import transaction
#transaction.commit_manually
def flush_transaction():
"""
Flush the current transaction so we don't read stale data
Use in long running processes to make sure fresh data is read from
the database. This is a problem with MySQL and the default
transaction mode. You can fix it by setting
"transaction-isolation = READ-COMMITTED" in my.cnf or by calling
this function at the appropriate moment
"""
transaction.commit()
The alternative solution is to change my.cnf for MySQL to change the default transaction mode
transaction-isolation = READ-COMMITTED
Note that that is a relatively new feature for Mysql and has some consequences for binary logging / slaving. You could also put this in the django connection preamble if you wanted.
Update 3 years later
Now that Django 1.6 has turned on autocommit in MySQL this is no longer a problem. The example above now works fine without the flush_transaction() code whether your MySQL is in REPEATABLE-READ (the default) or READ-COMMITTED transaction isolation mode.
What was happening in previous versions of Django which ran in non autocommit mode was that the first select statement opened a transaction. Since MySQL's default mode is REPEATABLE-READ this means that no updates to the database will be read by subsequent select statements - hence the need for the flush_transaction() code above which stops the transaction and starts a new one.
There are still reasons why you might want to use READ-COMMITTED transaction isolation though. If you were to put terminal 1 in a transaction and you wanted to see the writes from the terminal 2 you would need READ-COMMITTED.
The flush_transaction() code now produces a deprecation warning in Django 1.6 so I recommend you remove it.

We've struggled a fair bit with forcing django to refresh the "cache" - which it turns out wasn't really a cache at all but an artifact due to transactions. This might not apply to your example, but certainly in django views, by default, there's an implicit call to a transaction, which mysql then isolates from any changes that happen from other processes ater you start.
we used the #transaction.commit_manually decorator and calls to transaction.commit() just before every occasion where you need up-to-date info.
As I say, this definitely applies to views, not sure whether it would apply to django code not being run inside a view.
detailed info here:
http://devblog.resolversystems.com/?p=439

I'm not sure I'd recommend it...but you can just kill the cache yourself:
>>> qs = MyModel.objects.all()
>>> qs.count()
1
>>> MyModel().save()
>>> qs.count() # cached!
1
>>> qs._result_cache = None
>>> qs.count()
2
And here's a better technique that doesn't rely on fiddling with the innards of the QuerySet: Remember that the caching is happening within a QuerySet, but refreshing the data simply requires the underlying Query to be re-executed. The QuerySet is really just a high-level API wrapping a Query object, plus a container (with caching!) for Query results. Thus, given a queryset, here is a general-purpose way of forcing a refresh:
>>> MyModel().save()
>>> qs = MyModel.objects.all()
>>> qs.count()
1
>>> MyModel().save()
>>> qs.count() # cached!
1
>>> from django.db.models import QuerySet
>>> qs = QuerySet(model=MyModel, query=qs.query)
>>> qs.count() # refreshed!
2
>>> party_time()
Pretty easy! You can of course implement this as a helper function and use as needed.

If you append .all() to a queryset, it'll force a reread from the DB. Try
MyModel.objects.all().count() instead of MyModel.objects.count().

Seems like the count() goes to cache after the first time. This is the django source for QuerySet.count:
def count(self):
"""
Performs a SELECT COUNT() and returns the number of records as an
integer.
If the QuerySet is already fully cached this simply returns the length
of the cached results set to avoid multiple SELECT COUNT(*) calls.
"""
if self._result_cache is not None and not self._iter:
return len(self._result_cache)
return self.query.get_count(using=self.db)
update does seem to be doing quite a bit of extra work, besides what you need.
But I can't think of any better way to do this, short of writing your own SQL for the count.
If performance is not super important, I would just do what you're doing, calling update before count.
QuerySet.update:
def update(self, **kwargs):
"""
Updates all elements in the current QuerySet, setting all the given
fields to the appropriate values.
"""
assert self.query.can_filter(), \
"Cannot update a query once a slice has been taken."
self._for_write = True
query = self.query.clone(sql.UpdateQuery)
query.add_update_values(kwargs)
if not transaction.is_managed(using=self.db):
transaction.enter_transaction_management(using=self.db)
forced_managed = True
else:
forced_managed = False
try:
rows = query.get_compiler(self.db).execute_sql(None)
if forced_managed:
transaction.commit(using=self.db)
else:
transaction.commit_unless_managed(using=self.db)
finally:
if forced_managed:
transaction.leave_transaction_management(using=self.db)
self._result_cache = None
return rows
update.alters_data = True

You can also use MyModel.objects._clone().count(). All of the methods in the the QuerySet call _clone() prior to doing any work - that ensures that any internal caches are invalidated.
The root cause is that MyModel.objects is the same instance each time. By cloning it you're creating a new instance without the cached value. Of course, you can always reach in and invalidate the cache if you'd prefer to use the same instance.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterating over a large Django queryset while the data is changing elsewhere - python

In addition to what others have said if you are iterating over a large queryset you should use iterator(): Book.objects.filter(stuff).order_by(stuff).iterator() this will cause Django to not cache the items as it iterates (which could use a ton of memory for a large result set).

Related

How to avoid ordering by in django queryset, order_by() not working

Solution needed to a scenario

What is the most efficient way to iterate django objects updating them?

Update a field in a django model only if it needs updating

How do I force Django to ignore any caches and reload data?

Categories

Resources