Caching a non changing frequently read file in python - python

Okay folks lemme illustrate, I've this
def get_config_file(file='scrapers_conf.json'):
"""
Load the default .json config file
"""
return json.load(open(file))
and this function is called a lot, This will be on a server and every request will trigger this function at least 5 times, I've multiple scrapers running, each one is on the following shape.
I removed helper methods for convenience, the problem is, each scraper should have it's own request headers, payload, ... or use the default ones that lie in scrapers_conf.json
class Scraper(threading.Thread): # init is overriden and has set .conf
def run(self):
self.get()
def get(self):
# logic
The problem is that I'm getting the headers like
class Scraper(threading.Thread):
def run(self):
self.get()
def get(self):
headers = self.conf.get('headers') or get_config_file().get('headers')
so as you see, each single instance on each single request calls the get_config_file() function which I don't think is optimal in my case. I know about lru_cache but I don't think it's the optimal solution (correct me please!)
The config files are small, os.sys.getsizeof reports under 1 KB.
I'm thinking of just leaving it as is considering that reading a 1 KB ain't a problem.
Thanks in advance.

lru_cache(maxsize=None) sounds like the right way to do this; the maxsize=None makes it faster by turning off the LRU machinery.
The other way would be to call get_config_file() at the beginning of the program (in __init__, get, or in the place that instantiates the class), assign it to an attribute on each Scraper class and then always refer to self.config (or whatever). That has the advantage that you can skip reading the config file in unit tests — you can pass a test config directly into the class.
In this case, since the class already has a self.conf, it might be best to update that dictionary with the values from the file, rather than referring to two places in each of the methods.

I've totally forgot about #functools.cached_property
#cached_property
def get_config_file(file='scrapers_conf.json'):
"""
Load the default .json config file
"""
return json.load(open(file))

Related

Mocking a function call within a function in Python

This is my first time building out unit tests, and I'm not quite sure how to proceed here. Here's the function I'd like to test; it's a method in a class that accepts one argument, url, and returns one string, task_id:
def url_request(self, url):
conn = self.endpoint_request()
authorization = conn.authorization
response = requests.get(url, authorization)
return response["task_id"]
The method starts out by calling another method within the same class to obtain a token to connect to an API endpoint. Should I be mocking the output of that call (self.endpoint_request())?
If I do have to mock it, and my test function looks like this, how do I pass a fake token/auth endpoint_request response?
#patch("common.DataGetter.endpoint_request")
def test_url_request(mock_endpoint_request):
mock_endpoint_request.return_value = {"Auth": "123456"}
# How do I pass the fake token/auth to this?
task_id = DataGetter.url_request(url)
The code you have shown is strongly dominated by interactions. Which means that there will most likely be no bugs to find with unit-testing: The potential bugs are on the interaction level: You access conn.authorization - but, is this the proper member? And, does it already have the proper representation in the way you need it further on? Is requests.get the right method for the job? Is the argument order as you expect it? Is the return value as you expect it? Is task_id spelled correctly?
These are (some of) the potential bugs in your code. But, with unit-testing you will not be able to find them: When you replace the depended-on components with some mocks (which you create or configure), your unit-tests will just succeed: Lets assume that you have a misconception about the return value of requests.get, namely that task_id is spelled wrongly and should rather be spelled taskId. If you mock requests.get, you would implement the mock based on your own misconception. That is, your mock would return a map with the (misspelled) key task_id. Then, the unit-test would succeed despite of the bug.
You will only find that bug with integration testing, where you bring your component and depended-on components together. Only then you can test the assumptions made in your component against the reality of the other components.

Caching functions in Python to disk with expiration based on version

I want to cache results of some functions/methods, with these specifications:
Live between runs: The cache should remain intact between runs, after the interpreter dies, meaning the data needs to be saved to disk.
Expiration based on function version: Data in the cache should remain valid as long as the function hasn't changed. If the function changed, it should invalidate the data.
It's all happening single-threadedly on the same machine, for now. Support of concurrency on the same machine is a "bonus".
I know there are cache decorators for disk-based cache, but their expiration is usually based on time, which is irrelevant to my needs.
I thought about using the Git commit SHA for detecting function/class version, but the problem is that there are multiple functions/classes in the same file. I need a way to check whether the specific function/class segment of the file was changed or not.
I assume the solution will consist of a combination of version managing and caching, but I'm too unfamiliar with the possibilities in order to solve this elegantly.
Example:
#file a.py
#cache_by_version
def f(a,b):
#...
#cache_by_version
def g(a,b):
#...
#file b.py
from a import *
def main():
f(1,2)
Running main in file b.py once should result in caching of the result of f with arguments 1 and 2 to disk. Running main again should bring the result from the cache without evaluating f(1,2) again. However, if f changed, then the cache should be invalid. On the other hand, if g changed, it should not effect the caching of f.
Ok, so after a bit of messing around here's something that mostly works:
import os
import hashlib
import pickle
from functools import wraps
import inspect
# just cache in a "cache" directory within current working directory
# also using pickle, but there are other caching libraries out there
# that might be more useful
__cache_dir__ = os.path.join(os.path.abspath(os.getcwd()), 'cache')
def _read_from_cache(cache_key):
cache_file = os.path.join(__cache_dir__, cache_key)
if os.path.exists(cache_file):
with open(cache_file, 'rb') as f:
return pickle.load(f)
return None
def _write_to_cache(cache_key, value):
cache_file = os.path.join(__cache_dir__, cache_key)
if not os.path.exists(__cache_dir__):
os.mkdir(__cache_dir__)
with open(cache_file, 'wb') as f:
pickle.dump(value, f)
def cache_result(fn):
#wraps(fn)
def _decorated(*arg, **kw):
m = hashlib.md5()
fn_src = inspect.getsourcelines(fn)
m.update(str(fn_src))
# generated different key based on arguments too
m.update(str(arg)) # possibly could do better job with arguments
m.update(str(kw))
cache_key = m.hexdigest()
cached = _read_from_cache(cache_key)
if cached is not None:
return cached
value = fn(*arg, **kw)
_write_to_cache(cache_key, value)
return value
return _decorated
#cache_result
def add(a, b):
print "Add called"
return a + b
if __name__ == '__main__':
print add(1, 2)
I've made this use inspect.getsourcelines to read in the functions code and use it to generate the key for looking up in the cache (along with the arguments). This means that any change to the function (even whitespace) will generate a new cache key and the function will need to be called.
Note though, if the function calls other functions and those functions have changed then you will still get the original cached result. Which may be unexpected.
So this is probably ok to use for something that's intensely numerical or involves heavy network activity, but you might find you need to clear the cache directory every now and then.
One downside of using getsourcelines, is that if you don't have access to the source, then this won't work. I guess though for most Python programs this shouldn't be too big a problem.
So I'd take this as a starting point, rather than as a fully working solution.
Also it uses pickle to store the cached value - so it's only safe to use if you can trust that.

Getter with side effect

I create a class whose objects are initialized with
a bunch of XML code. The class has the ability to extract various parameters out of that XML and to cache them inside the object state variables. The potential amount of these parameters is large and most probably, the user will not need most of them. That is why I have decided to perform a "lazy" initialization.
In the following test case such a parameter is title. When the user tries to access it for the first time, the getter function parses the XML, properly initializes the state variable and return its value:
class MyClass(object):
def __init__(self, xml=None):
self.xml = xml
self.title = None
def get_title(self):
if self.__title is None:
self.__title = self.__title_from_xml()
return self.__title
def set_title(self, value):
self.__title = value
title = property(get_title, set_title, None, "Citation title")
def __title_from_xml(self):
#parse the XML and return the title
return title
This looks nice and works fine for me. However, I am disturbed a little bit by the fact that the getter function is actually a "setter" one in the sense that it has a very significant side effect on the object. Is this a legitimate concern? If so, how should I address it?
This design pattern is called Lazy initialization and it has legitimate use.
While the getter certainly performs a side-effect, that's not traditionally what one would consider a bad side-effect. Since the getter always returns the same thing (barring any intervening changes in state), it has no user-visible side-effects. This is a typical use for properties, so there's nothing to be concerned about.
Quite some years later but well: while lazy initialization is fine in itself, I would definitly not postpone xml parsing etc until someone accesses the object's title. Computed attributes are supposed to behave like plain attributes, and a plain attribute access will never raise (assuming the attribute exists of course).
FWIW I had a very similar case in some project I took over, with xml parsing errors happening at the most unexpected places, due to the previous developper using properties the very same way as in the OP example, and had to fix it by putting the parsing and validation part at instanciation time.
So, use properties for lazy initialization only if and when you know the first access will never ever raise. Actually, never use a property for anything that might raise (at least when getting - setting is a different situation). Else, dont use a property, make the getter an explicit method and clearly document it might raise this or that.
NB : using a property to cache something is not the problem here, this by itself is fine.

How do I check if a module/class/methods has changed and log the changes?

I am trying to compare two modules/classes/method and to find out if the class/method has have changed. We allow users to change classes/methods, and after processing, we make those changes persistent, without overwriting the older classes/methods. However, before we commit the new classes, we need to establish if the code has changed and also if the functionally of the methods has changed e.g output differ and performance also defer on the same input data. I am ok with performance change, but my problem is changes in code and how to log - what has changed. i wrote something like below
class TestIfClassHasChanged(unittest.TestCase):
def setUp(self):
self.old = old_class()
self.new = new_class()
def test_if_code_has_changed(self):
# simple case for one method
old_codeobject = self.old.area.func_code.co_code
new_codeobject = self.new.area.func_code.co_code
self.assertEqual(old_codeobject, new_codeobject)
where area() is a method in both classes.. However, if I have many methods, what i see here is looping over all methods. Possible to do this at class or module level?
Secondly if I find that the code objects are not equal, I would like to log the changes. I used inspect.getsource(self.old.area) and inspect.getsource(self.new.area) compared the two to get the difference, could there be a better way of doing this?
You should be using a version control program to help manage development. This is one of the specific d=features you get from vc program is the ability to track changes. You can do diffs between current source code and previous check-ins to test if there were any changes.
if i have many methods, what i see
here is looping over all methods.
Possible to do this at class or module
level?
i will not ask why you want to do such thing ? but yes you can here is an example
import inspect
import collections
# Here i will loop over all the function in a module
module = __import__('inspect') # this is fun !!!
# Get all function in the module.
list_functions = inspect.getmembers(module, inspect.isfunction)
# Get classes and methods correspond .
list_class = inspect.getmembers(module, inspect.isclass)
class_method = collections.defaultdict(list)
for class_name, class_obj in list_class:
for method in inspect.getmembers(class_obj, inspect.ismethod):
class_method[class_name].append(method)

how to wrap file object read and write operation (which are readonly)?

i am trying to wrap the read and write operation of an instance of a file object (specifically the readline() and write() methods).
normally, i would simply replace those functions by a wrapper, a bit like this:
def log(stream):
def logwrite(write):
def inner(data):
print 'LOG: > '+data.replace('\r','<cr>').replace('\n','<lf>')
return write(data)
return inner
stream.write = logwrite(stream.write)
but the attributes of a file object are read-only ! how could i wrap them properly ?
(note: i am too lazy to wrap the whole fileobject... really, i don't want to miss a feature that i did not wrap properly, or a feature which may be added in a future version of python)
more context :
i am trying to automate the communication with a modem, whose AT command set is made available on the network through a telnet session. once logged in, i shall "grab" the module with which i want to communicate with. after some time without activity, a timeout occurs which releases the module (so that it is available to other users on the network... which i don't care, i am the sole user of this equipment). the automatic release writes a specific line on the session.
i want to wrap the readline() on a file built from a socket (cf. socket.makefile()) so that when the timeout occurs, a specific exception is thrown, so that i can detect the timeout anywhere in the script and react appropriately without complicating the AT command parser...
(of course, i want to do that because the timeout is quite spurious, otherwise i would simply feed the modem with commands without any side effect only to keep the module alive)
(feel free to propose any other method or strategy to achieve this effect)
use __getattr__ to wrap your file object. provide modified methods for the ones that you are concerned with.
class Wrapped(object):
def __init__(self, file_):
self._file = file_
def write(self, data):
print 'LOG: > '+data.replace('\r','<cr>').replace('\n','<lf>')
return self._file.write(data)
def __getattr__(self, attr):
return getattr(self._file, attr)
This way, requests for attributes which you don't explicitly provide will be routed to the attribute on the wrapped object and you can just implement the ones that you want
logged = Wrapped(open(filename))

Categories