We have a codebase with the following use pattern:
factory = DataFactory(args)
dataset = factory.download_and_cache_big_dataset(key)
metadata = dataset.get_some_metadata()
Currently, download_and_cache_big_dataset fetches a very large file from S3 and puts it somewhere. Among other things, it does
filename = get_s3_key(key)
filepath = os.path.join(get_tmp_dir(), filename)
s3.download_file(key, filepath)
return BigFileClass(filepath) # gets stored in a class somewhere
However, this file doesn't get deleted. This is fine when this function is called sparingly and relies on file caching, but bad when it is called repeatedly and we don't want to fill up the disk. Is there a way to refactor the code with a context manager such that we can use it as
factory = DataFactory(args)
with factory.download_and_cache_big_dataset(key) as dataset:
metadata = dataset.get_some_metadata()
# do something with metadata
# file gets automatically deleted
But critically, without breaking the existing usage, so that the other code works as is? Or will there need to be a different method that returns the context manager?
Since you return an instance of BigFileClass to handle/represent the data, I would suggest the following.
I'm assuming that the data file is unique to each instance.
Add an instance variable to BigFileClass to keep track of the path of the data file.
Add a __del__ method to BigFileClass in which the data file is removed.
Edit: If you want to use BigFileClass as a contextmanager, define __enter__ and __exit__ methods for BigFileClass. The only thing that __enter__ has to do in this case is basically return self.
I would leave the task of deleting the file to the __del__ method (when the reference count for a BigFileClass reaches 0). It doesn't feel right to have the class instance still around when you have already deleted the data file.
Remark w.r.t. architecture.
The use of a factory seems like an unnecessary complication to me. IMO, download_and_cache_big_dataset could just be a function returning a BigFileClass instance.
Related
I have a Python app split across different files. One of them, models.py, contains, among PyQt5 table models, several maps referred from several PyQt5 form files:
# first lines:
agents_id_map = \
{agent.name:agent.id for agent in db.session.query(db.Agent, db.Agent.id)}
# ....
# 2000 thousand lines
I want to keep this kind of maps centralized in a single point. I'm using SQLAlchemy also. Agent class is defined in a db.py file. I use these maps to fulfill the foreign key in another object, say, an invoice, like:
invoice = db.Invoice()
# Here is a reference
invoice.agent_id = models.agents_id_map[agent_combo.currentText()]
ยทยทยทยท
db.session.add(invoice)
db.session.commit()
The problem is that the model.py module gets cached and several parts of the application access old data, and, if another running instance A of the app creates a new agent, and a running instance B wants to create a new invoice, the B running instance won't see the new Agent created by A unless restarts the app. This also happens if a user in the same running instance creates an agent and then he wants to create an invoice. My solutions are:
Reload the module, to get the whole code executed again, but this could be very expensive.
Isolate the code building those maps in another file, say maps.py, which would be less expensive to reload and change all code that references it through refactoring.
Is there a solution that would allow me to touch only the code building those maps and the rest of the application remains ignorant of the change, and every time the map is referenced from another module or even the same, the code gets executed, effectively re-building maps with fresh data?
Is there a solution that would allow me to touch only the code building those maps and the rest of the application remains ignorant of the change, and every time the map is referenced from another module or even the same, the code gets executed, effectively re-building maps with fresh data?
Certainly: put you maps inside a function, or even better, a class.
If I understand this problem correctly, you have stateful data (maps) which need regenerating under some condition (every time they are accessed? Or just every time the db is updated?). I would do something like this:
class Mappings:
def __init__(self, db):
self._db = db
... # do any initial db stuff you need to here
def id_map(self, thing):
db_thing = getattr(self._db, thing.title)
return {x.name:x.id for x in self._db.session.query(db_thing, db_thing.id)}
def other_property_map(self, prop):
... # etc
mapping = Mapping(db)
mapping.id_map("agent")
This assumes that the mapping example you've given is your major use-case, but this model could easily be adapted for almost any other mapping you might want.
You would write a method of every kind of 'mapping' you need, and it would return the desired dictionary. Note that here I've assumed you handle setting up the db elsewhere and pass a fully initialised db access object to the class, which is probably what you want to do---this class is just about encapsulating mapper state, not re-inventing your orm.
Caching
I have not provided any caching. But if you have complete control over the db, it is easy enough to run a hook before you do any db commits looking to see if you've touched any particular model, and then state that those need rebuilding. Something like this:
class DbAccess(Mappings):
def __init__(self, db, models):
super().init(db)
self._cached_map = {model: {} for model in models}
def db_update(model: str, params: dict):
try:
self._cached_map[model] = {} # wipe cache
except KeyError:
pass
self._db.update_with_model(model, params) # dummy fn
def id_map(self, thing: str):
try:
return self._cached_map[thing]["id"]
except KeyError:
self._cached_map[thing]["id"] = super().id_map(thing)
return self._cached_map[thing]["id"]
I don't really think DbAccess should inherit from Mappings---put it all in one class, or have a DB class and a Mappings mixin and inherit from both. I just didn't want to write everything out again.
I've not written any real db access routines, (hence my dummy fn) as I don't know how you're doing it (but clearly using an ORM). But the basic idea is just to handle the caching yourself, by storing the mapping every time, but deleting all the stored mappings every time you do any commit transactions involving the model in question (thus rebuilding the cache as needed).
Aside
Note that if you really do have 2,000 lines of manually declared mappings of the form thing.name: thing.id you really should generate them at runtime anyhow. Declarative is all very well and good, but writing out 2,000 permutations of the same thing isn't declarative, it's just time-consuming---and doing the job a simple loop putting the data in ram could do for you at startup.
Is there a way to do the following:
asset, _ = Asset.objects.get_or_create(system=item['system'], system_table=item['system_table'], ...)
Asset.objects.filter(pk=asset.pk).update(**item)
And also call the .save() method? I think I've read somewhere that you can run an update on the actual instance and not go through the objects manager. How would that be done? Currently I'm doing the following, which is quite repetitive and inefficient:
a = Asset.objects.filter(pk=asset.pk).update(**item)
a.save()
Since you already have the asset object, you can just make use of it.
# assuming that you have a `Asset` object in `asset` variable, somehow
item = {"foo": "foo-value"}
for field, value in items.items():
setattr(asset, field, name)
asset.save()
You can also specify the update_fields parameter of save() method as
asset.save(update_fields=list(item.keys()))
The best way to do this is to just call save() directly. You will need to call get() instead of filter(), though.
Asset.objects.get(pk=asset.pk).save(update_fields=item)
This isn't a problem since your existing filter() is guaranteed to return a queryset with at most one Asset anyway. You just have to be sure that the given pk actually exists or wrap the get() call in a try...except block.
But...since you already have the Asset instance in asset, there's no reason to waste time with a DB query. Just call save directly on the object you have:
asset.save(update_fields=item)
I'm creating an app. that will generate math problems. They're specific problems where some parameters can be altered. Each problem will be different, and require a different method to solve (all of which will be programatically implemented).
For example:
models.py
import random
from django.db import models
class Problem(models.Model):
unformattedText = models.TextField()
def __init__(self, unformattedText, genFunction, *args, **kwargs):
super(Problem, self).__init__(*args, **kwargs)
self.unformatedText = unformattedText
self.genFunction = genFunction
def genQAPair():
self.genFunction(self.unformattedText)
views.py
def genP1(text):
num_1 = random.randrange(0, 100)
num_2 = random.randrange(0, 100)
text.format((num_1, num_2))
return {'question':text, 'answer':num_1 - num_2}
def genP2(text, lim=4):
num_1 = random.randrange(0, lim)
text.format(num_1)
return {'question':text, 'answer':num_1*40}
p1 = Problem(
unformattedText='Sally has {} apples. Frank takes {}. How many apples does Sally have?',
genFunction=genP1
)
p1.save()
p2 = Problem(
unformattedText='John jumps {} feet into the air. How long does it take for him to age?',
genFunction=genP2
)
p2.save()
When I try this, the function isn't actually saved. Django just saves the integer 1. When I initiate an instance of the model, the function is there as intended, but apparently only 1 is saved to the database.
Bonus question: I'm actually beginning to question whether or not I even need Django models for this. I'm using Django because it's super easy to get everything onto a webpage. Is there a better way to do this? (Maybe store the text of each problem in a JSON file and the generating functions in some separate script.)
The persistence layer for a Django application is the database, and the database schema is specified by your model definitions. In this case you've only defined a single field in your model, unformattedText; you haven't specified any storage for the corresponding function. Your self.genFunction = genFunction is just creating an attribute on an object in memory; it won't be persisted.
There are various possible ways to store the function. You could store it as raw text; you could store it as a pickle blob; you could store the function path and name (e.g. "my.path.to.problems.genP1"); or do something else. In any case, you'll need to create a database field for that information.
Here is a rough outline of an example solution using the function path:
models.py
class Problem(models.Model):
unformattedText = models.TextField()
genPath = models.TextField()
views.py
import importlib
def problem_view(request, problem_id):
problem = Problem.objects.get(id=problem_id)
gen_path, gen_name = problem.genPath.rsplit(".", 1)
gen_module = importlib.import_module(gen_path)
gen_function = getattr(gen_module, gen_name)
context = gen_function(problem.unformattedText)
return render(request, 'template.html', context)
Only you can determine if you need to use a database at all. If you only have a few fixed questions then you could just stuff everything into a Python file and be done with it. But there are advantages to using Django's models, including the ability to use the admin.
There are a couple of options, depending on the actual task. I ranged them starting with the most safe option to the most dangerous (but flexible):
1. Store function identifiers:
You can store genP1 and genP2 as 'genP1' and 'genP2' - i.e. by name (or you can use any other unique identifier).
Pros:
You can validate user input and execute trusted code only, because in this case you control almost everything.
You can easily debug your functions, because they are part of your system.
Cons:
You need to define all your functions in the code. That means if you want to add new function, you need to redeploy your application.
If you are storing function names, you need to manually import module (or package) with the functions and call them.
If you are storing identifiers, you need to define a mapping {identifier: path to actual function}.
2. Use DSL
You can write your own DSL (or use existing)
Pros:
You can add new functions at runtime without redeploying application.
You can control which code can user execute.
You can see source code for your functions.
Cons:
It is hard to write safe and flexible DSL, especially if you want to call some python code from it.
It is hard to debug huge functions.
3. Serialize them
You can serialize functions using pickle
Pros:
You can add new functions at runtime without redeploying application.
Easier than writing own DSL.
Cons:
Unsafe - you must not execute untrusted code. If you allow users to create their own functions, serialization is not your way - define (or use existing) safe DSL instead.
It might be impossible to show source python code for the serialized function. For more information: How can I get the source code of a Python function?
It is hard to debug huge functions.
4. Just store actual Python code
Just store the python source code in the DB as a string.
Pros:
You can add new functions at runtime without redeploying application.
You can see source code without any additional processing.
Easier than writing own DSL.
Cons:
Unsafe - you must not execute untrusted code. If you allow users to create their own functions, storing source code is not your way - define (or use existing) safe DSL instead.
It is hard to debug huge functions.
an essential part of my project is being able to save and load class instances to a file. For further context, my class has both a set of attributes as well as a few methods.
So far, I've tried using pickle, but it's not working quite as expected. For starters, it's not loading the methods, nor it's letting me add attributes that I've defined initially; in other words, it's not really making a copy of the class I can work with.
Relevant Code:
class Brick(object):
def __init__(self, name, filename=None, areaMin=None, areaMax=None, kp=None):
self.name = name
self.filename = filename
self.areaMin = areaMin
self.areaMax = areaMax
self.kp = kp
self.__kpsave = None
if filename != None:
self.__logfile = file(filename, 'w')
def __getstate__(self):
f = self.__logfile
self.__kpsave = []
for point in self.kp:
temp = (point.pt, point.size, point.angle, point.response, point.octave, point.class_id)
self.__kpsave.append(temp)
return (self.name, self.areaMin, self.areaMax, self.__kpsave,
f.name, f.tell())
def __setstate__(self, state):
self.value, self.areaMin, self.areaMax, self.__kpsave, name, position = state
f = file(name, 'w')
f.seek(position)
self.__logfile = f
self.filename = name
self.kp = []
for point in self.__kpsave:
temp = cv2.KeyPoint(x=point[0][0], y=point[0][1], _size=point[1], _angle=point[2], _response=point[3],
_octave=point[4], _class_id=point[5])
self.kp.append(temp)
def calculateORB(self, img):
pass #I've omitted the actual method here
(There are a few more attributes and methods, but they're not relevant)
Now, this class definition works just fine when creating new instances: I can make a new Brick with just the name, I can then set areaMin or any other attribute, and I can use pickle(cPickle) to dump the current instance to a file just fine (I'm using those getstate and setstate because pickle won't work with OpenCV's Keypoint elements).
The problem comes, of course, when I do load the instance: using pickle load() I can load the instance from a file, and the values I set previously will be there (ie I can access areaMin just fine if I did set a value for it) but I can't access either methods or add values to any of the other attributes if I never changed their values. I've noticed that I don't need to import my class definition either if I'm simply pickling from a completely different source file.
Since all I want to do is build a "database" of sorts from my class objects, what's the best way to approach this? I know something that should work is to simply write a .Save() method that writes a .py source file where I essentially create an instance of the class, so I can then .Load() which will do exec and eval as appropriate, however, this seems like the worst possible way to do this, so, how should I actually do this?
Thanks.
You should not try to do I/O inside your __getstate__ and __setstate__ methods - those are called by Pickle, and the expted result is just an in-memory object that can be further pickled.
Moreover, if your "Point" class in the "self.kp" attribute is just a regular Python class, there is no need for you to customize pickling at all -
What you have to worry about is to deal with the I/O at the point you call Pickle. If you really need to load different instances independently, you could resort to the "shelve" module, or, better yet, use pickle.dumps and store the resulting string in a DBMS (which can be the built-in sqlite).
All in all:
class Point(object):
...
class Brick(object):
def __init__(self, point, ...):
self.kp = point
Then, to save a single object to a file:
with open("filename.pickle", "wb") as file_:
pickle.dump(my_brick, file_, -1)
and restore with:
my_brick = pickle.load(open("filename.pickle", "rb", -1)
To store several instances and recover all at once, you could just dump then in sequence to the same open file, and them read one by one until you got a fault due to "empty file" - or ou can simply add all objects you want to save to a List, and pickle the whole list at once.
To save and retrieve arbitrary objects that you can retrieve giving some attrbute like "name" or "id" - you can resort to the shelve module: https://docs.python.org/3/library/shelve.html or use a real database if you need complex queries and such. Trying to write your own ad hoc binary format to allow for searching the required instance is an horrible idea - as you'd have to implement all the protocol for that file 0 reading, writting, safeguards, corner cases, and such.
In the following example, cached_attr is used to get or set an attribute on a model instance when a database-expensive property (related_spam in the example) is called. In the example, I use cached_spam to save queries. I put print statements when setting and when getting values so that I could test it out. I tested it in a view by passing an Egg instance into the view and in the view using {{ egg.cached_spam }}, as well as other methods on the Egg model that make calls to cached_spam themselves. When I finished and tested it out the shell output in Django's development server showed that the attribute cache was missed several times, as well as successfully gotten several times. It seems to be inconsistent. With the same data, when I made small changes (as little as changing the print statement's string) and refreshed (with all the same data), different amounts of misses / successes happened. How and why is this happening? Is this code incorrect or highly problematic?
class Egg(models.Model):
... fields
#property
def related_spam(self):
# Each time this property is called the database is queried (expected).
return Spam.objects.filter(egg=self).all() # Spam has foreign key to Egg.
#property
def cached_spam(self):
# This should call self.related_spam the first time, and then return
# cached results every time after that.
return self.cached_attr('related_spam')
def cached_attr(self, attr):
"""This method (normally attached via an abstract base class, but put
directly on the model for this example) attempts to return a cached
version of a requested attribute, and calls the actual attribute when
the cached version isn't available."""
try:
value = getattr(self, '_p_cache_{0}'.format(attr))
print('GETTING - {0}'.format(value))
except AttributeError:
value = getattr(self, attr)
print('SETTING - {0}'.format(value))
setattr(self, '_p_cache_{0}'.format(attr), value)
return value
Nothing wrong with your code, as far as it goes. The problem probably isn't there, but in how you use that code.
The main thing to realise is that model instances don't have identity. That means that if you instantiate an Egg object somewhere, and a different one somewhere else, even if they refer to the same underlying database row they won't share internal state. So calling cached_attr on one won't cause the cache to be populated in the other.
For example, assuming you have a RelatedObject class with a ForeignKey to Egg:
my_first_egg = Egg.objects.get(pk=1)
my_related_object = RelatedObject.objects.get(egg__pk=1)
my_second_egg = my_related_object.egg
Here my_first_egg and my_second_egg both refer to the database row with pk 1, but they are not the same object:
>>> my_first_egg.pk == my_second_egg.pk
True
>>> my_first_egg is my_second_egg
False
So, filling the cache on my_first_egg doesn't fill it on my_second_egg.
And, of course, objects won't persist across requests (unless they're specifically made global, which is horrible), so the cache won't persist either.
Http servers that scale are shared-nothing; you can't rely on anything being singleton. To share state, you need to connect to a special-purpose service.
Django's caching support is appropriate for your use case. It isn't necessarily a global singleton either; if you use locmem://, it will be process-local, which could be the more efficient choice.