Python - caching a property to avoid future calculations

Python - caching a property to avoid future calculations - python

In the following example, cached_attr is used to get or set an attribute on a model instance when a database-expensive property (related_spam in the example) is called. In the example, I use cached_spam to save queries. I put print statements when setting and when getting values so that I could test it out. I tested it in a view by passing an Egg instance into the view and in the view using {{ egg.cached_spam }}, as well as other methods on the Egg model that make calls to cached_spam themselves. When I finished and tested it out the shell output in Django's development server showed that the attribute cache was missed several times, as well as successfully gotten several times. It seems to be inconsistent. With the same data, when I made small changes (as little as changing the print statement's string) and refreshed (with all the same data), different amounts of misses / successes happened. How and why is this happening? Is this code incorrect or highly problematic?
class Egg(models.Model):
... fields
#property
def related_spam(self):
# Each time this property is called the database is queried (expected).
return Spam.objects.filter(egg=self).all() # Spam has foreign key to Egg.
#property
def cached_spam(self):
# This should call self.related_spam the first time, and then return
# cached results every time after that.
return self.cached_attr('related_spam')
def cached_attr(self, attr):
"""This method (normally attached via an abstract base class, but put
directly on the model for this example) attempts to return a cached
version of a requested attribute, and calls the actual attribute when
the cached version isn't available."""
try:
value = getattr(self, '_p_cache_{0}'.format(attr))
print('GETTING - {0}'.format(value))
except AttributeError:
value = getattr(self, attr)
print('SETTING - {0}'.format(value))
setattr(self, '_p_cache_{0}'.format(attr), value)
return value

Nothing wrong with your code, as far as it goes. The problem probably isn't there, but in how you use that code.
The main thing to realise is that model instances don't have identity. That means that if you instantiate an Egg object somewhere, and a different one somewhere else, even if they refer to the same underlying database row they won't share internal state. So calling cached_attr on one won't cause the cache to be populated in the other.
For example, assuming you have a RelatedObject class with a ForeignKey to Egg:
my_first_egg = Egg.objects.get(pk=1)
my_related_object = RelatedObject.objects.get(egg__pk=1)
my_second_egg = my_related_object.egg
Here my_first_egg and my_second_egg both refer to the database row with pk 1, but they are not the same object:
>>> my_first_egg.pk == my_second_egg.pk
True
>>> my_first_egg is my_second_egg
False
So, filling the cache on my_first_egg doesn't fill it on my_second_egg.
And, of course, objects won't persist across requests (unless they're specifically made global, which is horrible), so the cache won't persist either.

Http servers that scale are shared-nothing; you can't rely on anything being singleton. To share state, you need to connect to a special-purpose service.
Django's caching support is appropriate for your use case. It isn't necessarily a global singleton either; if you use locmem://, it will be process-local, which could be the more efficient choice.

Related

Hook every variable reference and execute code

I have a Python app split across different files. One of them, models.py, contains, among PyQt5 table models, several maps referred from several PyQt5 form files:
# first lines:
agents_id_map = \
{agent.name:agent.id for agent in db.session.query(db.Agent, db.Agent.id)}
# ....
# 2000 thousand lines
I want to keep this kind of maps centralized in a single point. I'm using SQLAlchemy also. Agent class is defined in a db.py file. I use these maps to fulfill the foreign key in another object, say, an invoice, like:
invoice = db.Invoice()
# Here is a reference
invoice.agent_id = models.agents_id_map[agent_combo.currentText()]
····
db.session.add(invoice)
db.session.commit()
The problem is that the model.py module gets cached and several parts of the application access old data, and, if another running instance A of the app creates a new agent, and a running instance B wants to create a new invoice, the B running instance won't see the new Agent created by A unless restarts the app. This also happens if a user in the same running instance creates an agent and then he wants to create an invoice. My solutions are:
Reload the module, to get the whole code executed again, but this could be very expensive.
Isolate the code building those maps in another file, say maps.py, which would be less expensive to reload and change all code that references it through refactoring.
Is there a solution that would allow me to touch only the code building those maps and the rest of the application remains ignorant of the change, and every time the map is referenced from another module or even the same, the code gets executed, effectively re-building maps with fresh data?

Is there a solution that would allow me to touch only the code building those maps and the rest of the application remains ignorant of the change, and every time the map is referenced from another module or even the same, the code gets executed, effectively re-building maps with fresh data?
Certainly: put you maps inside a function, or even better, a class.
If I understand this problem correctly, you have stateful data (maps) which need regenerating under some condition (every time they are accessed? Or just every time the db is updated?). I would do something like this:
class Mappings:
def __init__(self, db):
self._db = db
... # do any initial db stuff you need to here
def id_map(self, thing):
db_thing = getattr(self._db, thing.title)
return {x.name:x.id for x in self._db.session.query(db_thing, db_thing.id)}
def other_property_map(self, prop):
... # etc
mapping = Mapping(db)
mapping.id_map("agent")
This assumes that the mapping example you've given is your major use-case, but this model could easily be adapted for almost any other mapping you might want.
You would write a method of every kind of 'mapping' you need, and it would return the desired dictionary. Note that here I've assumed you handle setting up the db elsewhere and pass a fully initialised db access object to the class, which is probably what you want to do---this class is just about encapsulating mapper state, not re-inventing your orm.
Caching
I have not provided any caching. But if you have complete control over the db, it is easy enough to run a hook before you do any db commits looking to see if you've touched any particular model, and then state that those need rebuilding. Something like this:
class DbAccess(Mappings):
def __init__(self, db, models):
super().init(db)
self._cached_map = {model: {} for model in models}
def db_update(model: str, params: dict):
try:
self._cached_map[model] = {} # wipe cache
except KeyError:
pass
self._db.update_with_model(model, params) # dummy fn
def id_map(self, thing: str):
try:
return self._cached_map[thing]["id"]
except KeyError:
self._cached_map[thing]["id"] = super().id_map(thing)
return self._cached_map[thing]["id"]
I don't really think DbAccess should inherit from Mappings---put it all in one class, or have a DB class and a Mappings mixin and inherit from both. I just didn't want to write everything out again.
I've not written any real db access routines, (hence my dummy fn) as I don't know how you're doing it (but clearly using an ORM). But the basic idea is just to handle the caching yourself, by storing the mapping every time, but deleting all the stored mappings every time you do any commit transactions involving the model in question (thus rebuilding the cache as needed).
Aside
Note that if you really do have 2,000 lines of manually declared mappings of the form thing.name: thing.id you really should generate them at runtime anyhow. Declarative is all very well and good, but writing out 2,000 permutations of the same thing isn't declarative, it's just time-consuming---and doing the job a simple loop putting the data in ram could do for you at startup.

"Updater" design pattern, as opposed to "Builder"

This is actually language agnostic, but I always prefer Python.
The builder design pattern is used to validate that a configuration is valid prior to creating an object, via delegation of the creation process.
Some code to clarify:
class A():
def __init__(self, m1, m2): # obviously more complex in life
self._m1 = m1
self._m2 = m2
class ABuilder():
def __init__():
self._m1 = None
self._m2 = None
def set_m1(self, m1):
self._m1 = m1
return self
def set_m2(self, m1):
self._m2 = m2
return self
def _validate(self):
# complicated validations
assert self._m1 < 1000
assert self._m1 < self._m2
def build(self):
self._validate()
return A(self._m1, self._m2)
My problem is similar, with an extra constraint that I can't re-create the object each time due to to performance limitations.
Instead, I want to only update an existing object.
Bad solutions I came up with:
I could do as suggested here and just use setters like so
class A():
...
set_m1(self, m1):
self._m1 = m1
# and so on
But this is bad because using setters
Beats the purpose of encapsulation
Beats the purpose of the buillder (now updater), which is supposed to validate that some complex configuration is preserved after the creation, or update in this case.
As I mentioned earlier, I can't recreate the object every time, as this is expensive and I only want to update some fields, or sub-fields, and still validate or sub-validate.
I could add update and validation methods to A and call those, but this beats the purpose of delegating the responsibility of updates, and is intractable in the number of fields.
class A():
...
def update1(m1):
pass # complex_logic1
def update2(m2):
pass # complex_logic2
def update12(m1, m2):
pass # complex_logic12
I could just force to update every single field in A in a method with optional parameters
class A():
...
def update("""list of all fields of A"""):
pass
Which again is not tractable, as this method will soon become a god method due to the many combinations possible.
Forcing the method to always accept changes in A, and validating in the Updater also can't work, as the Updater will need to look at A's internal state to make a descision, causing a circular dependency.
How can I delegate updating fields in my object A
in a way that
Doesn't break encapsulation of A
Actually delegates the responsibility of updating to another object
Is tractable as A becomes more complicated
I feel like I am missing something trivial to extend building to updating.

I am not sure I understand all of your concerns, but I want to try and answer your post. From what you have written I assume:
Validation is complex and multiple properties of an object must be checked to decide if any change to the object is valid.
The object must always be in a valid state. Changes that make the object invalid are not permitted.
It is too expensive to copy the object, make the change, validate the object, and then reject the change if the validation fails.
Move the validation logic out of the builder and into a separate class like ModelValidator with a validateModel(model) method
The first option is to use a command pattern.
Create abstract class or interface named Update (I don't think Python abstract classes/interfaces, but that's fine). The Update interface implements two methods, execute() and undo().
A concrete class has a name like UpdateAdress, UpdatePortfolio, or UpdatePaymentInfo.
Each concrete Update object also holds a reference to your model object.
The concrete classes hold the state needed to for a particular kind of update. Imageine these methods exist on the UpdateAddress class:
UpdateAddress
setStreetNumber(...)
setCity(...)
setPostcode(...)
setCountry(...)
The update object needs to hold both the current and new values of a property. Like:
setStreetNumber(aString):
self.oldStreetNumber = model.getStreetNumber
self.newStreetNumber = aString
When the execute method is called, the model is updated:
execute:
model.setStreetNumber(newStreetNumber)
model.setCity(newCity)
# Set postcode and country
if not ModelValidator.isValid(model):
self.undo()
raise ValidationError
and the undo method looks like:
undo:
model.setStreetNumber(oldStreetNumber)
model.setCity(oldCity)
# Set postcode and country
That is a lot of typing, but it would work. Mutating your model object is nicely encapsulated by different kinds of updates. You can execute or undo the changes by calling those methods on the update object. You can even store a list of update objects for multi-level undos and re-tries.
However, it is a lot of typing for the programmer. Consider using persistent data structures. Persistent data structures can be used to copy objects very quickly -- approximately constant time complexity. Here is a python library of persistent data structures.
Let's assume your data was in a persistent data structure version of a dict. The library I referenced calls it a PMap.
The implementation of the update classes can be simpler. Starting with the constructor:
UpdateAddress(pmap)
self.oldPmap = pmap
self.newPmap = pmap
The setters are easier:
setStreetNumber(aString):
self.newPmap = newPmap.set('streetNumber', aString)
Execute passes back a new instance of the model, with all the updates.
execute:
if ModelValidator.isValid(newModel):
return newModel;
else:
raise ValidationError
The original object has not changed at all, thanks to the magic of persistent data structures.
The best thing is to not do any of this. Instead, use an ORM or object database. That is the "enterprise grade" solution. These libraries give you sophisticated tools like transactions and object version history.

How can I cache/memoize my SQLAlchemy functions?

I am using FLask-OAuthlib and want to do some caching/memoization using Flask-Cache. I've got caching setup on my views but I'm having trouble with caching this function:
#oauth.clientgetter
#cache.memoize(timeout=86400)
def load_client(client_id):
return DBSession.query(Client).filter_by(client_id=client_id).first()
The first time the function is run (not cached yet) it runs fine but when it gets it from cache something gets messed up somehow and says it's an invalid client. I don't know if it's caching it incorrectly or if having the #oauth.clientgetter decorator somehow screws up the caching. Everything works fine without caching and the client is valid. I've tried to move the function around like so, but get the same result:
class Client(Base):
__tablename__ = 'client'
__table_args__ = {'autoload': True}
user = relationship('User')
#classmethod
#cache.memoize(timeout=86400)
def get_client(cls,client_id):
return DBSession.query(cls).filter_by(client_id=client_id).first()
Then, in my view I have:
#oauth.clientgetter
def load_client(client_id):
return Client.get_client(client_id)
But this gives the same result. I am using redis as my cache backend and the keys/values I have are:
1) "flask_cache_Pwd2uVDVikMYMDNB+gVWlW"
2) "flask_cache_api.models.Client.get_client_memver"
3) "flask_cache_http://lvho.st:5000/me"
GET flask_cache_Pwd2uVDVikMYMDNB+gVWlW:
"!ccopy_reg\n_reconstructor\np1\n(capi.models\nClient\np2\nc__builtin__\nobject\np3\nNtRp4\n(dp5\nS'_sa_instance_state'\np6\ng1\n(csqlalchemy.orm.state\nInstanceState\np7\ng3\nNtRp8\n(dp9\nS'manager'\np10\ng1\n(csqlalchemy.orm.instrumentation\n_SerializeManager\np11\ng3\nNtRp12\n(dp13\nS'class_'\np14\ng2\nsbsS'class_'\np15\ng2\nsS'modified'\np16\nI00\nsS'committed_state'\np17\n(dp18\nsS'instance'\np19\ng4\nsS'callables'\np20\n(dp21\nsS'key'\np22\n(g2\n(S'Iu6copdawXIQIskY5kwPgxFgU7JoE9lTSqmlqw29'\np23\nttp24\nsS'expired'\np25\nI00\nsbsVuser_id\np26\nL4L\nsVname\np27\nS'Default'\np28\nsV_default_scopes\np29\nS'email'\np30\nsVclient_id\np31\ng23\nsV_redirect_uris\np32\nS'http://localhost:8000/authorized/'\np33\nsVactive\np34\nI1\nsVclient_secret\np35\nS'Vnw0YJjgNzR06KiwXWmYz7aSPu1ht7JnY1eRil4s5vXLM9N2ph'\np36\nsVdescription\np37\nNsb."
GET flask_cache_api.models.Client.get_client_memver:
"!S'+gVWlW'\np1\n."

Try reversing the order of your decorators:
#cache.memoize(timeout=86400)
#oauth.clientgetter
def load_client(client_id):
return DBSession.query(Client).filter_by(client_id=client_id).first()
EDIT
The problem seem to be that a Client object is not pickle-able, while cache.memoize relies on objects' pickle-ability. Therefor, in one case, you end up with an invalid-client error (the client object did not "survive" the picke-dump-then-pickle-load process), and in another case, with some kind of caching error which (silently) prevents the object from being cached (I'm not sure what mechanism causes this silent-handling).
In any case, it seems to me you shouldn't attempt to memoize your client object in the first place.

Initialize module python

I have a module which wrap an json api for querying song cover/remix data with limits for number of requests per hour/minute. I'd like to keep an optional cache of json responses without forcing users to adjust a cache/context parameter every time. What is a good way of initializing a library/module in python? Or would you recommend I just do the explicit thing and use a cache named parameter in every call that eventually request json data?
I was thinking of doing
_cache = None
class LFU(object):
...
NO_CACHE, LFU = "NO_CACHE", "LFU"
def set_cache_strategy(strategy):
if _cache == NO_CACHE:
_cache = None
else:
_cache = LFU()
import second_hand_songs_wrapper as s
s.set_cache_strategy(s.LFU)
l1 = s.ShsLabel.get_from_resource_id(123)
l2 = s.ShsLabel.get_from_resource_id(123,use_cache=Fale)
edit:
I'm probably only planning on having two strategies one with/ one without a cache.
Other possible alternative initialization schemes I can think of of the top of my head include using enviromental variables, initializing _cache by hand in the user code to None/LFU(), and using explicit cache everywhere(possibly defaulting to having a cache).
Note the reason I don't set cache on an instance of the class is that I currently use a never instantiated class(use class functions + class state as a singleton) to abstract downloading the json data along with some convenience/methods to download certain urls. I could instantiate the downloader class but then I'd have to pass the instance explicitly to each function, or use another global variable for a convenience version of the class. The downloader class also keeps track of # of requests(website has limit per minute/hour) so having multiple downloader objects would cause more trouble.

There's nothing wrong in setting a default, even if that default is None. I would note though that having the pseudo-constants as well as a conditional (provided that's all you use the values for) is redundant. Try:
caching_strategies = {'NO_CACHE' : lambda: None,
'LFU' : LFU}
_cache = caching_strategies['NO_CACHE']
def set_cache_strategy(strategy):
_cache = caching_methods[strategy]()
If you want to provide a convenience method for the available strategies, just wrap caching_strategies.keys(). Really though, as far as your strategies go, you should probably have all your strategies inherit from some base strategy, and just create a no_cache strategy class that inherits from that and stubs all the methods for your standardized caching inteface.

How do I get the key value of a db.ReferenceProperty without a database hit?

Is there a way to get the key (or id) value of a db.ReferenceProperty, without dereferencing the actual entity it points to? I have been digging around - it looks like the key is stored as the property name preceeded with an _, but I have been unable to get any code working. Examples would be much appreciated. Thanks.
EDIT: Here is what I have unsuccessfully tried:
class Comment(db.Model):
series = db.ReferenceProperty(reference_class=Series);
def series_id(self):
return self._series
And in my template:
more
The result:
more

Actually, the way that you are advocating accessing the key for a ReferenceProperty might well not exist in the future. Attributes that begin with '_' in python are generally accepted to be "protected" in that things that are closely bound and intimate with its implementation can use them, but things that are updated with the implementation must change when it changes.
However, there is a way through the public interface that you can access the key for your reference-property so that it will be safe in the future. I'll revise the above example:
class Comment(db.Model):
series = db.ReferenceProperty(reference_class=Series);
def series_id(self):
return Comment.series.get_value_for_datastore(self)
When you access properties via the class it is associated, you get the property object itself, which has a public method that can get the underlying values.

You're correct - the key is stored as the property name prefixed with '_'. You should just be able to access it directly on the model object. Can you demonstrate what you're trying? I've used this technique in the past with no problems.
Edit: Have you tried calling series_id() directly, or referencing _series in your template directly? I'm not sure whether Django automatically calls methods with no arguments if you specify them in this context. You could also try putting the #property decorator on the method.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.