Caching functions in Python to disk with expiration based on version - python

I want to cache results of some functions/methods, with these specifications:
Live between runs: The cache should remain intact between runs, after the interpreter dies, meaning the data needs to be saved to disk.
Expiration based on function version: Data in the cache should remain valid as long as the function hasn't changed. If the function changed, it should invalidate the data.
It's all happening single-threadedly on the same machine, for now. Support of concurrency on the same machine is a "bonus".
I know there are cache decorators for disk-based cache, but their expiration is usually based on time, which is irrelevant to my needs.
I thought about using the Git commit SHA for detecting function/class version, but the problem is that there are multiple functions/classes in the same file. I need a way to check whether the specific function/class segment of the file was changed or not.
I assume the solution will consist of a combination of version managing and caching, but I'm too unfamiliar with the possibilities in order to solve this elegantly.
Example:
#file a.py
#cache_by_version
def f(a,b):
#...
#cache_by_version
def g(a,b):
#...
#file b.py
from a import *
def main():
f(1,2)
Running main in file b.py once should result in caching of the result of f with arguments 1 and 2 to disk. Running main again should bring the result from the cache without evaluating f(1,2) again. However, if f changed, then the cache should be invalid. On the other hand, if g changed, it should not effect the caching of f.

Ok, so after a bit of messing around here's something that mostly works:
import os
import hashlib
import pickle
from functools import wraps
import inspect
# just cache in a "cache" directory within current working directory
# also using pickle, but there are other caching libraries out there
# that might be more useful
__cache_dir__ = os.path.join(os.path.abspath(os.getcwd()), 'cache')
def _read_from_cache(cache_key):
cache_file = os.path.join(__cache_dir__, cache_key)
if os.path.exists(cache_file):
with open(cache_file, 'rb') as f:
return pickle.load(f)
return None
def _write_to_cache(cache_key, value):
cache_file = os.path.join(__cache_dir__, cache_key)
if not os.path.exists(__cache_dir__):
os.mkdir(__cache_dir__)
with open(cache_file, 'wb') as f:
pickle.dump(value, f)
def cache_result(fn):
#wraps(fn)
def _decorated(*arg, **kw):
m = hashlib.md5()
fn_src = inspect.getsourcelines(fn)
m.update(str(fn_src))
# generated different key based on arguments too
m.update(str(arg)) # possibly could do better job with arguments
m.update(str(kw))
cache_key = m.hexdigest()
cached = _read_from_cache(cache_key)
if cached is not None:
return cached
value = fn(*arg, **kw)
_write_to_cache(cache_key, value)
return value
return _decorated
#cache_result
def add(a, b):
print "Add called"
return a + b
if __name__ == '__main__':
print add(1, 2)
I've made this use inspect.getsourcelines to read in the functions code and use it to generate the key for looking up in the cache (along with the arguments). This means that any change to the function (even whitespace) will generate a new cache key and the function will need to be called.
Note though, if the function calls other functions and those functions have changed then you will still get the original cached result. Which may be unexpected.
So this is probably ok to use for something that's intensely numerical or involves heavy network activity, but you might find you need to clear the cache directory every now and then.
One downside of using getsourcelines, is that if you don't have access to the source, then this won't work. I guess though for most Python programs this shouldn't be too big a problem.
So I'd take this as a starting point, rather than as a fully working solution.
Also it uses pickle to store the cached value - so it's only safe to use if you can trust that.

Related

Caching a non changing frequently read file in python

Okay folks lemme illustrate, I've this
def get_config_file(file='scrapers_conf.json'):
"""
Load the default .json config file
"""
return json.load(open(file))
and this function is called a lot, This will be on a server and every request will trigger this function at least 5 times, I've multiple scrapers running, each one is on the following shape.
I removed helper methods for convenience, the problem is, each scraper should have it's own request headers, payload, ... or use the default ones that lie in scrapers_conf.json
class Scraper(threading.Thread): # init is overriden and has set .conf
def run(self):
self.get()
def get(self):
# logic
The problem is that I'm getting the headers like
class Scraper(threading.Thread):
def run(self):
self.get()
def get(self):
headers = self.conf.get('headers') or get_config_file().get('headers')
so as you see, each single instance on each single request calls the get_config_file() function which I don't think is optimal in my case. I know about lru_cache but I don't think it's the optimal solution (correct me please!)
The config files are small, os.sys.getsizeof reports under 1 KB.
I'm thinking of just leaving it as is considering that reading a 1 KB ain't a problem.
Thanks in advance.
lru_cache(maxsize=None) sounds like the right way to do this; the maxsize=None makes it faster by turning off the LRU machinery.
The other way would be to call get_config_file() at the beginning of the program (in __init__, get, or in the place that instantiates the class), assign it to an attribute on each Scraper class and then always refer to self.config (or whatever). That has the advantage that you can skip reading the config file in unit tests — you can pass a test config directly into the class.
In this case, since the class already has a self.conf, it might be best to update that dictionary with the values from the file, rather than referring to two places in each of the methods.
I've totally forgot about #functools.cached_property
#cached_property
def get_config_file(file='scrapers_conf.json'):
"""
Load the default .json config file
"""
return json.load(open(file))

How to patch method io.RawIOBase.read with unittest?

I've recently learned about unittest.monkey.patch and its variants, and I'd like to use it to unit test for atomicity of a file read function. However, the patch doesn't seem to have any effect.
Here's my set-up. The method under scrutiny is roughly like so (abriged):
#local_storage.py
def read(uri):
with open(path, "rb") as file_handle:
result = file_handle.read()
return result
And the module that performs the unit tests (also abriged):
#test/test_local_storage.py
import unittest.mock
import local_storage
def _read_while_writing(io_handle, size=-1):
""" The patch function, to replace io.RawIOBase.read. """
_write_something_to(TestLocalStorage._unsafe_target_file) #Appends "12".
result = io_handle.read(size) #Should call the actual read.
_write_something_to(TestLocalStorage._unsafe_target_file) #Appends "34".
class TestLocalStorage(unittest.TestCase):
_unsafe_target_file = "test.txt"
def test_read_atomicity(self):
with open(self._unsafe_target_file, "wb") as unsafe_file_handle:
unsafe_file_handle.write(b"Test")
with unittest.mock.patch("io.RawIOBase.read", _read_while_writing): # <--- This doesn't work!
result = local_storage.read(TestLocalStorage._unsafe_target_file) #The actual test.
self.assertIn(result, [b"Test", b"Test1234"], "Read is not atomic.")
This way, the patch should ensure that every time you try to read it, the file gets modified just before and just after the actual read, as if it happens concurrently, thus testing for atomicity of our read.
The unit test currently succeeds, but I've verified with print statements that the patch function doesn't actually get called, so the file never gets the additional writes (it just says "Test"). I've also modified the code as to be non-atomic on purpose.
So my question: How can I patch the read function of an IO handle inside the local_storage module? I've read elsewhere that people tend to replace the open() function to return something like a StringIO, but I don't see how that could fix this problem.
I need to support Python 3.4 and up.
I've finally found a solution myself.
The problem is that mock can't mock any methods of objects that are written in C. One of these is the RawIOBase that I was encountering.
So indeed the solution was to mock open to return a wrapper around RawIOBase. I couldn't get mock to produce a wrapper for me, so I implemented it myself.
There is one pre-defined file that's considered "unsafe". The wrapper writes to this "unsafe" file every time any call is made to the wrapper. This allows for testing the atomicity of file writes, since it writes additional things to the unsafe file while writing. My implementation prevents this by writing to a temporary ("safe") file and then moving that file over the target file.
The wrapper has a special case for the read function, because to test atomicity properly it needs to write to the file during the read. So it reads first halfway through the file, then stops and writes something, and then reads on. This solution is now semi-hardcoded (in how far is halfway), but I'll find a way to improve that.
You can see my solution here: https://github.com/Ghostkeeper/Luna/blob/0e88841d19737fb1f4606917f86e3de9b5b9f29b/plugins/storage/localstorage/test/test_local_storage.py

Memoize a function so that it isn't reset when I rerun the file in Python

I often do interactive work in Python that involves some expensive operations that I don't want to repeat often. I'm generally running whatever Python file I'm working on frequently.
If I write:
import functools32
#functools32.lru_cache()
def square(x):
print "Squaring", x
return x*x
I get this behavior:
>>> square(10)
Squaring 10
100
>>> square(10)
100
>>> runfile(...)
>>> square(10)
Squaring 10
100
That is, rerunning the file clears the cache. This works:
try:
safe_square
except NameError:
#functools32.lru_cache()
def safe_square(x):
print "Squaring", x
return x*x
but when the function is long it feels strange to have its definition inside a try block. I can do this instead:
def _square(x):
print "Squaring", x
return x*x
try:
safe_square_2
except NameError:
safe_square_2 = functools32.lru_cache()(_square)
but it feels pretty contrived (for example, in calling the decorator without an '#' sign)
Is there a simple way to handle this, something like:
#non_resetting_lru_cache()
def square(x):
print "Squaring", x
return x*x
?
Writing a script to be executed repeatedly in the same session is an odd thing to do.
I can see why you'd want to do it, but it's still odd, and I don't think it's unreasonable for the code to expose that oddness by looking a little odd, and having a comment explaining it.
However, you've made things uglier than necessary.
First, you can just do this:
#functools32.lru_cache()
def _square(x):
print "Squaring", x
return x*x
try:
safe_square_2
except NameError:
safe_square_2 = _square
There is no harm in attaching a cache to the new _square definition. It won't waste any time, or more than a few bytes of storage, and, most importantly, it won't affect the cache on the previous _square definition. That's the whole point of closures.
There is a potential problem here with recursive functions. It's already inherent in the way you're working, and the cache doesn't add to it in any way, but you might only notice it because of the cache, so I'll explain it and show how to fix it. Consider this function:
#lru_cache()
def _fact(n):
if n < 2:
return 1
return _fact(n-1) * n
When you re-exec the script, even if you have a reference to the old _fact, it's going to end up calling the new _fact, because it's accessing _fact as a global name. It has nothing to do with the #lru_cache; remove that, and the old function will still end up calling the new _fact.
But if you're using the renaming trick above, you can just call the renamed version:
#lru_cache()
def _fact(n):
if n < 2:
return 1
return fact(n-1) * n
Now the old _fact will call fact, which is still the old _fact. Again, this works identically with or without the cache decorator.
Beyond that initial trick, you can factor that whole pattern out into a simple decorator. I'll explain step by step below, or see this blog post.
Anyway, even with the less-ugly version, it's still a bit ugly and verbose. And if you're doing this dozens of times, my "well, it should look a bit ugly" justification will wear thin pretty fast. So, you'll want to handle this the same way you always factor out ugliness: wrap it in a function.
You can't really pass names around as objects in Python. And you don't want to use a hideous frame hack just to deal with this. So you'll have to pass the names around as strings. ike this:
globals().setdefault('fact', _fact)
The globals function just returns the current scope's global dictionary. Which is a dict, which means it has the setdefault method, which means this will set the global name fact to the value _fact if it didn't already have a value, but do nothing if it did. Which is exactly what you wanted. (You could also use setattr on the current module, but I think this way emphasizes that the script is meant to be (repeatedly) executed in someone else's scope, not used as a module.)
So, here that is wrapped up in a function:
def new_bind(name, value):
globals().setdefault(name, value)
… which you can turn that into a decorator almost trivially:
def new_bind(name):
def wrap(func):
globals().setdefault(name, func)
return func
return wrap
Which you can use like this:
#new_bind('foo')
def _foo():
print(1)
But wait, there's more! The func that new_bind gets is going to have a __name__, right? If you stick to a naming convention, like that the "private" name must be the "public" name with a _ prefixed, we can do this:
def new_bind(func):
assert func.__name__[0] == '_'
globals().setdefault(func.__name__[1:], func)
return func
And you can see where this is going:
#new_bind
#lru_cache()
def _square(x):
print "Squaring", x
return x*x
There is one minor problem: if you use any other decorators that don't wrap the function properly, they will break your naming convention. So… just don't do that. :)
And I think this works exactly the way you want in every edge case. In particular, if you've edited the source and want to force the new definition with a new cache, you just del square before rerunning the file, and it works.
And of course if you want to merge those two decorators into one, it's trivial to do so, and call it non_resetting_lru_cache.
However, I'd keep them separate. I think it's more obvious what they do. And if you ever want to wrap another decorator around #lru_cache, you're probably still going to want #new_bind to be the outermost decorator, right?
What if you want to put new_bind into a module that you can import? Then it's not going to work, because it will be referring to the globals of that module, not the one you're currently writing.
You can fix that by explicitly passing your globals dict, or your module object, or your module name as an argument, like #new_bind(__name__), so it can find your globals instead of its. But that's ugly and repetitive.
You can also fix it with an ugly frame hack. At least in CPython, sys._getframe() can be used to get your caller's frame, and frame objects have a reference to their globals namespace, so:
def new_bind(func):
assert func.__name__[0] == '_'
g = sys._getframe(1).f_globals
g.setdefault(func.__name__[1:], func)
return func
Notice the big box in the docs that tells you this is an "implementation detail" that may only apply to CPython and is "for internal and specialized purposes only". Take this seriously. Whenever someone has a cool idea for the stdlib or builtins that could be implemented in pure Python, but only by using _getframe, it's generally treated almost the same as an idea that can't be implemented in pure Python at all. But if you know what you're doing, and you want to use this, and you only care about present-day versions of CPython, it will work.
There is no persistent_lru_cache in the stdlib. But you can build one pretty easily.
The functools source is linked directly from the docs, because this is one of those modules that's as useful as sample code as it is for using it directly.
As you can see, the cache is just a dict. If you replace that with, say, a shelf, it will become persistent automatically:
def persistent_lru_cache(filename, maxsize=128, typed=False):
"""new docstring explaining what dbpath does"""
# same code as before up to here
def decorating_function(user_function):
cache = shelve.open(filename)
# same code as before from here on.
Of course that only works if your arguments are strings. And it could be a little slow.
So, you might want to instead keep it as an in-memory dict, and just write code that pickles it to a file atexit, and restores it from a file if present at startup:
def decorating_function(user_function):
# ...
try:
with open(filename, 'rb') as f:
cache = pickle.load(f)
except:
cache = {}
def cache_save():
with lock:
with open(filename, 'wb') as f:
pickle.dump(cache, f)
atexit.register(cache_save)
# …
wrapper.cache_save = cache_save
wrapper.cache_filename = filename
Or, if you want it to write every N new values (so you don't lose the whole cache on, say, an _exit or a segfault or someone pulling the cord), add this to the second and third versions of wrapper, right after the misses += 1:
if misses % N == 0:
cache_save()
See here for a working version of everything up to this point (using save_every as the "N" argument, and defaulting to 1, which you probably don't want in real life).
If you want to be really clever, maybe copy the cache and save that in a background thread.
You might want to extend the cache_info to include something like number of cache writes, number of misses since last cache write, number of entries in the cache at startup, …
And there are probably other ways to improve this.
From a quick test, with save_every=1, this makes the cache on both get_pep and fib (from the functools docs) persistent, with no measurable slowdown to get_pep and a very small slowdown to fib the first time (note that fib(100) has 100097 hits vs. 101 misses…), and of course a large speedup to get_pep (but not fib) when you re-run it. So, just what you'd expect.
I can't say I won't just use #abarnert's "ugly frame hack", but here is the version that requires you to pass in the calling module's globals dict. I think it's worth posting given that decorator functions with arguments are tricky and meaningfully different from those without arguments.
def create_if_not_exists_2(my_globals):
def wrap(func):
if "_" != func.__name__[0]:
raise Exception("Function names used in cine must begin with'_'")
my_globals.setdefault(func.__name__[1:], func)
def wrapped(*args):
func(*args)
return wrapped
return wrap
Which you can then use in a different module like this:
from functools32 import lru_cache
from cine import create_if_not_exists_2
#create_if_not_exists_2(globals())
#lru_cache()
def _square(x):
print "Squaring", x
return x*x
assert "_square" in globals()
assert "square" in globals()
I've gained enough familiarity with decorators during this process that I was comfortable taking a swing at solving the problem another way:
from functools32 import lru_cache
try:
my_cine
except NameError:
class my_cine(object):
_reg_funcs = {}
#classmethod
def func_key (cls, f):
try:
name = f.func_name
except AttributeError:
name = f.__name__
return (f.__module__, name)
def __init__(self, f):
k = self.func_key(f)
self._f = self._reg_funcs.setdefault(k, f)
def __call__(self, *args, **kwargs):
return self._f(*args, **kwargs)
if __name__ == "__main__":
#my_cine
#lru_cache()
def fact_my_cine(n):
print "In fact_my_cine for", n
if n < 2:
return 1
return fact_my_cine(n-1) * n
x = fact_my_cine(10)
print "The answer is", x
#abarnert, if you are still watching, I'd be curious to hear your assessment of the downsides of this method. I know of two:
You have to know in advance what attributes to look in for a name to associate with the function. My first stab at it only looked at func_name which failed when passed an lru_cache object.
Resetting a function is painful: del my_cine._reg_funcs[('__main__', 'fact_my_cine')], and the swing I took at adding a __delitem__ was unsuccessful.

Pickling Self and Return to Run-state?

I found the following post extremely helpful:
How to pickle yourself?
however the limitation with this solution is that when the class is reloaded, it is not returned in its "runtime" state. i.e. it will reload all the variables etc and the general state of the class at the moment it was dumped.. but it won't continue running from that point.
Consider:
class someClass(object):
def doSomething(self):
i = 0
while i <= 20:
execute
i += 1
if i == 10:
self.dumpState()
def dumpState(self):
with open('somePickleFile','wb') as handle:
pickle.dump(self, handle)
#classmethod
def loadState(cls, file_name):
with open(file_name, 'rb') as handle:
return pickle.load(handle)
If the above is run, by creating an instance of someClass:
sC = someClass()
sC.doSomething()
sC.loadState('somePickleFile')
This does not return the class to its runtime state, it does not continue through the while loop until i == 20..
This may not be the correct approach, but I am trying to find a way to capture the runtime state of my program i.e. freeze/hibernate it, and then relaunch it after possibly moving it to another machine.. this is due to issues I have with time restrictions enforced by a queuing system on a cluster which does not support checkpointing.
That approach won't be possible with Pickle and Unpickle alone without your code being aware of it.
Pickle can save fundamental Python objects, and ordinary user classes that reference those fundamental types. But it can't freeze information of a running context as you want.
Python does allow limited (yet powerfull) ways of acessing a running code context trough its frame objects - you can get a frame object with a call to "inspect.currentframe" in the inspect module. This will allow you to see the current running line of code, local variables, content of local variables, and so on -- but there is no way inside pure-python, without resorting to raw memory manipulation of the Python interpreter's data structures to rebuild a mid-execution frame object and jump execution to there.
So - for that approach it would be better to "freeze" the entire process and it's memory data structures using an O.S. way to do that (probably there is a way to that in Linux and it should work with no file/file like resources in use by the process).
Or, from within Python, like you want, you have to keep "book check" of all your state data in a manner that Pickle would be able to "see it". In your basic example, you should refactor your code to something like:
class someClass(object):
def setup(self):
self.i = 0
def doSomething(self):
while self.i <= 20:
execute
i += 1
if i == 10:
self.dumpState()
...
#classmethod
def loadState(cls, file_name):
with open(file_name, 'rb') as handle:
self = pickle.load(handle)
if self.i <= 20: # or other check for "running context"
return self.doSomething()
The fundamental difference here is the book-keeping of the otherwise local "i" varianble as an object variable, and separate the initization code. In this way, all the state needed to continue the execution - for this small example - is recorded on the object attributes - which can be properly pickled.
loadState is a classmethod returning a new instance of someClass (or something else pickled into the file). So you should write instead:
sC = someClass()
sC.doSomething()
sC = someClass.loadState('somePickleFile')
I believe pickle only keeps the attribute values of the instance, not the internal state of any methods executing. It will not save the fact that a method was executing, and it won't save the values of the local variables, like i in your example.

How do I check if a module/class/methods has changed and log the changes?

I am trying to compare two modules/classes/method and to find out if the class/method has have changed. We allow users to change classes/methods, and after processing, we make those changes persistent, without overwriting the older classes/methods. However, before we commit the new classes, we need to establish if the code has changed and also if the functionally of the methods has changed e.g output differ and performance also defer on the same input data. I am ok with performance change, but my problem is changes in code and how to log - what has changed. i wrote something like below
class TestIfClassHasChanged(unittest.TestCase):
def setUp(self):
self.old = old_class()
self.new = new_class()
def test_if_code_has_changed(self):
# simple case for one method
old_codeobject = self.old.area.func_code.co_code
new_codeobject = self.new.area.func_code.co_code
self.assertEqual(old_codeobject, new_codeobject)
where area() is a method in both classes.. However, if I have many methods, what i see here is looping over all methods. Possible to do this at class or module level?
Secondly if I find that the code objects are not equal, I would like to log the changes. I used inspect.getsource(self.old.area) and inspect.getsource(self.new.area) compared the two to get the difference, could there be a better way of doing this?
You should be using a version control program to help manage development. This is one of the specific d=features you get from vc program is the ability to track changes. You can do diffs between current source code and previous check-ins to test if there were any changes.
if i have many methods, what i see
here is looping over all methods.
Possible to do this at class or module
level?
i will not ask why you want to do such thing ? but yes you can here is an example
import inspect
import collections
# Here i will loop over all the function in a module
module = __import__('inspect') # this is fun !!!
# Get all function in the module.
list_functions = inspect.getmembers(module, inspect.isfunction)
# Get classes and methods correspond .
list_class = inspect.getmembers(module, inspect.isclass)
class_method = collections.defaultdict(list)
for class_name, class_obj in list_class:
for method in inspect.getmembers(class_obj, inspect.ismethod):
class_method[class_name].append(method)

Categories