Avoiding django QuerySet caching in a #staticmethod [duplicate] - python

This question already has answers here:
How do I force Django to ignore any caches and reload data?
(6 answers)
Closed 2 years ago.
The following few lines of code illustrate a distributed worker model that I use to crunch data. Jobs are being created in a database, their data goes onto the big drives, and once all information is available, the job status is set to 'WAITING'. From here, multiple active workers come into play: from time to time each of them issues a query, in which it attempts to "claim" a job. In order to synchronize the claims, the queries are encapsulated into a transaction that immediately changes the job state if the query returns a candidate. So far so good.
The problem is that the call to claim only works the first time. Reading up on QuerySets and their caching behavior, it seems to me that combining static methods and QuerySet caching always falls back on the cache... see for yourselves:
I have a class derived from django.db.models.Model:
class Job(models.Model):
[...]
in which I define the following static function.
#staticmethod
#transaction.commit_on_success
def claim():
# select the oldest, top priority job and
# update its record
jobs = Job.objects.filter(state__exact = 'WAITING').order_by('-priority', 'create_timestamp')
if jobs.count() > 0:
j = jobs[0]
j.state = 'CLAIMED'
j.save()
logger.info('Job::claim: claimed %s' % j.name)
return j
return None
Is there any obvious thing that I am doing wrong? What would be a better way of dealing with this? How can I make sure that the QuerySet does not cache its results across different invocations of the static method? Or am I missing something and chasing a phantom? Any help would be greatly appreciated... Thanks!

Why not just have a plain module-level function claim_jobs() that would run the query?
def claim_jobs():
jobs = Job.objects.filter(...)
... etc.

Related

Hook every variable reference and execute code

I have a Python app split across different files. One of them, models.py, contains, among PyQt5 table models, several maps referred from several PyQt5 form files:
# first lines:
agents_id_map = \
{agent.name:agent.id for agent in db.session.query(db.Agent, db.Agent.id)}
# ....
# 2000 thousand lines
I want to keep this kind of maps centralized in a single point. I'm using SQLAlchemy also. Agent class is defined in a db.py file. I use these maps to fulfill the foreign key in another object, say, an invoice, like:
invoice = db.Invoice()
# Here is a reference
invoice.agent_id = models.agents_id_map[agent_combo.currentText()]
····
db.session.add(invoice)
db.session.commit()
The problem is that the model.py module gets cached and several parts of the application access old data, and, if another running instance A of the app creates a new agent, and a running instance B wants to create a new invoice, the B running instance won't see the new Agent created by A unless restarts the app. This also happens if a user in the same running instance creates an agent and then he wants to create an invoice. My solutions are:
Reload the module, to get the whole code executed again, but this could be very expensive.
Isolate the code building those maps in another file, say maps.py, which would be less expensive to reload and change all code that references it through refactoring.
Is there a solution that would allow me to touch only the code building those maps and the rest of the application remains ignorant of the change, and every time the map is referenced from another module or even the same, the code gets executed, effectively re-building maps with fresh data?
Is there a solution that would allow me to touch only the code building those maps and the rest of the application remains ignorant of the change, and every time the map is referenced from another module or even the same, the code gets executed, effectively re-building maps with fresh data?
Certainly: put you maps inside a function, or even better, a class.
If I understand this problem correctly, you have stateful data (maps) which need regenerating under some condition (every time they are accessed? Or just every time the db is updated?). I would do something like this:
class Mappings:
def __init__(self, db):
self._db = db
... # do any initial db stuff you need to here
def id_map(self, thing):
db_thing = getattr(self._db, thing.title)
return {x.name:x.id for x in self._db.session.query(db_thing, db_thing.id)}
def other_property_map(self, prop):
... # etc
mapping = Mapping(db)
mapping.id_map("agent")
This assumes that the mapping example you've given is your major use-case, but this model could easily be adapted for almost any other mapping you might want.
You would write a method of every kind of 'mapping' you need, and it would return the desired dictionary. Note that here I've assumed you handle setting up the db elsewhere and pass a fully initialised db access object to the class, which is probably what you want to do---this class is just about encapsulating mapper state, not re-inventing your orm.
Caching
I have not provided any caching. But if you have complete control over the db, it is easy enough to run a hook before you do any db commits looking to see if you've touched any particular model, and then state that those need rebuilding. Something like this:
class DbAccess(Mappings):
def __init__(self, db, models):
super().init(db)
self._cached_map = {model: {} for model in models}
def db_update(model: str, params: dict):
try:
self._cached_map[model] = {} # wipe cache
except KeyError:
pass
self._db.update_with_model(model, params) # dummy fn
def id_map(self, thing: str):
try:
return self._cached_map[thing]["id"]
except KeyError:
self._cached_map[thing]["id"] = super().id_map(thing)
return self._cached_map[thing]["id"]
I don't really think DbAccess should inherit from Mappings---put it all in one class, or have a DB class and a Mappings mixin and inherit from both. I just didn't want to write everything out again.
I've not written any real db access routines, (hence my dummy fn) as I don't know how you're doing it (but clearly using an ORM). But the basic idea is just to handle the caching yourself, by storing the mapping every time, but deleting all the stored mappings every time you do any commit transactions involving the model in question (thus rebuilding the cache as needed).
Aside
Note that if you really do have 2,000 lines of manually declared mappings of the form thing.name: thing.id you really should generate them at runtime anyhow. Declarative is all very well and good, but writing out 2,000 permutations of the same thing isn't declarative, it's just time-consuming---and doing the job a simple loop putting the data in ram could do for you at startup.

Solution needed to a scenario

I am trying to make use of a column's value as a radio button's choice using below code
Forms.py
#retreiving data from database and assigning it to diction list
diction = polls_datum.objects.values_list('poll_choices', flat=True)
#initializing list and dictionary
OPTIONS1 = {}
OPTIONS = []
#creating the dictionary with 0 to no of options given in list
for i in range(len(diction)):
OPTIONS1[i] = diction[i]
#creating tuples from the dictionary above
#OPTIONS = zip(OPTIONS1.keys(), OPTIONS1.values())
for i in OPTIONS1:
k = (i,OPTIONS1[i])
OPTIONS.append(k)
class polls_form(forms.ModelForm):
#retreiving data from database and assigning it to diction list
options = forms.ChoiceField(choices=OPTIONS, widget = forms.RadioSelect())
class Meta:
model = polls_model
fields = ['options']
Using a form I am saving the data or choices in a field (poll_choices), when trying to display it on the index page, it is not reflecting until a server restart.
Can someone help on this please
of course "it is not reflecting until a server restart" - that's obvious when you remember that django server processes are long-running processes (it's not like PHP where each script is executed afresh on each request), and that top-level code (code that's at the module's top-level, not in a function) is only executed once per process when the module is first imported. As a general rule: don't do ANY db query at a module's top-level or at the top-level of a class statement - at best you'll get stale data, at worse it will crash your server process (if you're doing query before everything has been properly setup by django, or if you're doing query based on a schema update before the migration has been applied).
The possible solutions are either to wait until the form's initialisation to setup your field's choices, or to pass a callable as the formfield's choices options, cf https://docs.djangoproject.com/en/2.1/ref/forms/fields/#django.forms.ChoiceField.choices
Also, the way you're building your choices list is uselessly complicated - you could do it as a one-liner:
OPTIONS = list(enumerate(polls_datum.objects.values_list('poll_choices', flat=True))
but it's also very brittle - you're relying on the current db content and ordering for the choice value when you should use the polls_datum's pk instead (which is garanteed to be stable).
And finally: since you're working with what seems to be a related model, you may want to use a ModelChoiceField instead.
For future reference:
What version of Django are you using?
Have you read up on the documentation of ModelForms? https://docs.djangoproject.com/en/2.1/topics/forms/modelforms/
I'm not sure what you're trying to do with diction to dictionary to tuple. I think you could skip a step there and your future self will thank you for that.
Try to follow some tutorials and understand why certain steps are being taken. I can see from your code that you're rather new to coding or Python and there's room for improvement. Not trying to talk you down, but I'm trying to push you into the direction of becoming a better developer ;-)
REAL ANSWER:
That being said, I think the solution is to write the loading of the data somewhere in your form model, rather than 'loose' in forms.py. See bruno's answer for more information on this.
If you want to reload the data on each request that loads the form, you should create a function that gets called every time the form is loaded (for example in the form's __init__ function).

What is the most efficient way to iterate django objects updating them?

So I have a queryset to update
stories = Story.objects.filter(introtext="")
for story in stories:
#just set it to the first 'sentence'
story.introtext = story.content[0:(story.content.find('.'))] + ".</p>"
story.save()
And the save() operation completely kills performance. And in the process list, there are multiple entries for "./manage.py shell" yes I ran this through django shell.
However, in the past I've ran scripts that didn't need to use save(), as it was changing a many to many field. These scripts were very performant.
My project has this code, which could be relevant to why these scripts were so good.
#receiver(signals.m2m_changed, sender=Story.tags.through)
def save_story(sender, instance, action, reverse, model, pk_set, **kwargs):
instance.save()
What is the best way to update a large queryset (10000+) efficiently?
As far as new introtext value depends on content field of the object you can't do any bulk update. But you can speed up saving list of individual objects by wrapping it into transaction:
from django.db import transaction
with transaction.atomic():
stories = Story.objects.filter(introtext='')
for story in stories:
introtext = story.content[0:(story.content.find('.'))] + ".</p>"
Story.objects.filter(pk=story.pk).update(introtext=introtext)
transaction.atomic() will increase speed by order of magnitude.
filter(pk=story.pk).update() trick allows you to prevent any pre_save/post_save signals which are emitted in case of the simple save(). This is the officially recommended method of updating single field of the object.
You can use update built-in function over a queryset
Exmaple:
MyModel.objects.all().update(color=red)
In your case, you need use F() (read more here) built-in function to use instance own attributes:
from django.db.models import F
stories = Story.objects.filter(introtext__exact='')
stories.update(F('introtext')[0:F('content').find(.)] + ".</p>" )

Update a field in a django model only if it needs updating

Suppose I have some django model and I'm updating an instance
def modify_thing(id, new_blah):
mything = MyModel.objects.get(pk=id)
mything.blah = new_blah
mything.save()
My question is, if it happened that it was already the case that mything.blah == new_blah, does django somehow know this and not bother to save this [non-]modification again? Or will it always go into the db (MySQL in my case) and update data?
If I want to avoid an unnecessary write, does it make any sense to do something like:
if mything.blah != new_blah:
mything.blah = new_blah
mything.save()
Given that the record would have to be read from db anyway in order to do the comparison in the first place? Is there any efficiency to be gained from this sort of construction - and if so, is there a less ugly way of doing that than with the if statement in python?
You can use Django Signals to ensure that code like that you just posted don´t write to the db. Take a look at pre_save, that's the signal you're looking for.
Given that django does not cache the values, a trip to DB is inevitable, you have to fetch it to compare the value. And definitely we have less ugly ways to do that. You could do it as
if mything.blah is new_blah:
#Do nothing
else:
mything.blah = new_blah
mything.blah.save()

Iterating over a large Django queryset while the data is changing elsewhere

Iterating over a queryset, like so:
class Book(models.Model):
# <snip some other stuff>
activity = models.PositiveIntegerField(default=0)
views = models.PositiveIntegerField(default=0)
def calculate_statistics():
self.activity = book.views * 4
book.save()
def cron_job_calculate_all_book_statistics():
for book in Book.objects.all():
book.calculate_statistics()
...works just fine. However, this is a cron task. book.views is being incremented while this is happening. If book.views is modified while this cronjob is running, it gets reverted.
Now, book.views is not being modified by the cronjob, but it is being cached during the .all() queryset call. When book.save(), I have a feeling it is using the old book.views value.
Is there a way to make sure that only the activity field is updated? Alternatively, let's say there are 100,000 books. This will take quite a while to run. But the book.views will be from when the queryset originally starts running. Is the solution to just use an .iterator()?
UPDATE: Here's effectively what I am doing. If you have ideas about how to make this work well inline, then I'm all for it.
def calculate_statistics(self):
self.activity = self.views + self.hearts.count() * 2
# Can't do self.comments.count with a comments GenericRelation, because Comment uses
# a TextField for object_pk, and that breaks the whole system. Lame.
self.activity += Comment.objects.for_model(self).count() * 4
self.save()
The following will do the job for you in Django 1.1, no loop necessary:
from django.db.models import F
Book.objects.all().update(activity=F('views')*4)
You can have a more complicated calculation too:
for book in Book.objects.all().iterator():
Book.objects.filter(pk=book.pk).update(activity=book.calculate_activity())
Both these options have the potential to leave the activity field out of sync with the rest, but I assume you're ok with that, given that you're calculating it in a cron job.
In addition to what others have said if you are iterating over a large queryset you should use iterator():
Book.objects.filter(stuff).order_by(stuff).iterator()
this will cause Django to not cache the items as it iterates (which could use a ton of memory for a large result set).
No matter how you solve this, beware of transaction-related issues. E.g. default transaction isolation level is set to REPEATABLE READ, at least for MySQL backend. This, plus the fact that both Django and db backend work in a specific autocommit mode with an ongoing transaction means, that even if you use (very nice) whrde suggestion, value of `views' could be no longer valid. I could be wrong here, but feel warned.

Categories