Alternatives to pickle's `persistent_id`? - python

I have been using Python's pickle
module for implementing a thin file-based persistence layer. The
persistence layer (part of a larger library) relies heavily on pickle's persistent_id feature
to save objects of specified classes as separate files.
The only issue with this approach is that pickle files are not human
editable, and I'd much rather have objects saved in a format that is
human readable and editable with a text editor (e.g., YAML or JSON).
Do you know of any library that uses a human-editable format and
offers features similar to pickle's persistent_id? Alternatively,
do you have suggestions for implementing them on top of a YAML- or
JSON-based serialization library, without rewriting a large subset of
pickle?

I haven't tried this yet myself, but I think you should be able to do this elegantly with PyYAML using what they call "representers" and "resolvers".
EDIT
After an extensive exchange of comments with the poster, here is a method to achieve the required behavior with PyYAML.
Important Note: If a Persistable instance has another such instance as an attribute, or contained somehow inside one of its attributes, then the contained Persistable instance will not be saved to yet another separate file, rather it will be saved inline in the same file as the parent Persistable instance. To the best of my understanding, this limitation also existed in the OP's pickle-based system, and may be acceptable for his/her use cases. I haven't found an elegant solution for this which doesn't involve hacking yaml.representer.BaseRepresenter.
import yaml
from functools import partial
class Persistable(object):
# simulate a unique id
_unique = 0
def __init__(self, *args, **kw):
Persistable._unique += 1
self.persistent_id = ("%s.%d" %
(self.__class__.__name__, Persistable._unique))
def persistable_representer(dumper, data):
id = data.persistent_id
print "Writing to file: %s" % id
outfile = open(id, 'w')
outfile.write(yaml.dump(data))
outfile.close()
return dumper.represent_scalar(u'!xref', u'%s' % id)
class PersistingDumper(yaml.Dumper):
pass
PersistingDumper.add_representer(Persistable, persistable_representer)
my_yaml_dump = partial(yaml.dump, Dumper=PersistingDumper)
def persistable_constructor(loader, node):
xref = loader.construct_scalar(node)
print "Reading from file: %s" % id
infile = open(xref, 'r')
value = yaml.load(infile.read())
infile.close()
return value
yaml.add_constructor(u'!xref', persistable_constructor)
# example use, also serves as a test
class Foo(Persistable):
def __init__(self):
self.one = 1
Persistable.__init__(self)
class Bar(Persistable):
def __init__(self, foo):
self.foo = foo
Persistable.__init__(self)
foo = Foo()
bar = Bar(foo)
print "=== foo ==="
dumped_foo = my_yaml_dump(foo)
print dumped_foo
print yaml.load(dumped_foo)
print yaml.load(dumped_foo).one
print "=== bar ==="
dumped_bar = my_yaml_dump(bar)
print dumped_bar
print yaml.load(dumped_bar)
print yaml.load(dumped_bar).foo
print yaml.load(dumped_bar).foo.one
baz = Bar(Persistable())
print "=== baz ==="
dumped_baz = my_yaml_dump(baz)
print dumped_baz
print yaml.load(dumped_baz)
From now on use my_yaml_dump instead of yaml.dump when you want to save instances of the Persistable class to separate files. But don't use it inside persistable_representer and persistable_constructor! No special loading function is necessary, just use yaml.load.
Phew, that took some work... I hope this helps!

Related

Saving and Loading a class instance from file

an essential part of my project is being able to save and load class instances to a file. For further context, my class has both a set of attributes as well as a few methods.
So far, I've tried using pickle, but it's not working quite as expected. For starters, it's not loading the methods, nor it's letting me add attributes that I've defined initially; in other words, it's not really making a copy of the class I can work with.
Relevant Code:
class Brick(object):
def __init__(self, name, filename=None, areaMin=None, areaMax=None, kp=None):
self.name = name
self.filename = filename
self.areaMin = areaMin
self.areaMax = areaMax
self.kp = kp
self.__kpsave = None
if filename != None:
self.__logfile = file(filename, 'w')
def __getstate__(self):
f = self.__logfile
self.__kpsave = []
for point in self.kp:
temp = (point.pt, point.size, point.angle, point.response, point.octave, point.class_id)
self.__kpsave.append(temp)
return (self.name, self.areaMin, self.areaMax, self.__kpsave,
f.name, f.tell())
def __setstate__(self, state):
self.value, self.areaMin, self.areaMax, self.__kpsave, name, position = state
f = file(name, 'w')
f.seek(position)
self.__logfile = f
self.filename = name
self.kp = []
for point in self.__kpsave:
temp = cv2.KeyPoint(x=point[0][0], y=point[0][1], _size=point[1], _angle=point[2], _response=point[3],
_octave=point[4], _class_id=point[5])
self.kp.append(temp)
def calculateORB(self, img):
pass #I've omitted the actual method here
(There are a few more attributes and methods, but they're not relevant)
Now, this class definition works just fine when creating new instances: I can make a new Brick with just the name, I can then set areaMin or any other attribute, and I can use pickle(cPickle) to dump the current instance to a file just fine (I'm using those getstate and setstate because pickle won't work with OpenCV's Keypoint elements).
The problem comes, of course, when I do load the instance: using pickle load() I can load the instance from a file, and the values I set previously will be there (ie I can access areaMin just fine if I did set a value for it) but I can't access either methods or add values to any of the other attributes if I never changed their values. I've noticed that I don't need to import my class definition either if I'm simply pickling from a completely different source file.
Since all I want to do is build a "database" of sorts from my class objects, what's the best way to approach this? I know something that should work is to simply write a .Save() method that writes a .py source file where I essentially create an instance of the class, so I can then .Load() which will do exec and eval as appropriate, however, this seems like the worst possible way to do this, so, how should I actually do this?
Thanks.
You should not try to do I/O inside your __getstate__ and __setstate__ methods - those are called by Pickle, and the expted result is just an in-memory object that can be further pickled.
Moreover, if your "Point" class in the "self.kp" attribute is just a regular Python class, there is no need for you to customize pickling at all -
What you have to worry about is to deal with the I/O at the point you call Pickle. If you really need to load different instances independently, you could resort to the "shelve" module, or, better yet, use pickle.dumps and store the resulting string in a DBMS (which can be the built-in sqlite).
All in all:
class Point(object):
...
class Brick(object):
def __init__(self, point, ...):
self.kp = point
Then, to save a single object to a file:
with open("filename.pickle", "wb") as file_:
pickle.dump(my_brick, file_, -1)
and restore with:
my_brick = pickle.load(open("filename.pickle", "rb", -1)
To store several instances and recover all at once, you could just dump then in sequence to the same open file, and them read one by one until you got a fault due to "empty file" - or ou can simply add all objects you want to save to a List, and pickle the whole list at once.
To save and retrieve arbitrary objects that you can retrieve giving some attrbute like "name" or "id" - you can resort to the shelve module: https://docs.python.org/3/library/shelve.html or use a real database if you need complex queries and such. Trying to write your own ad hoc binary format to allow for searching the required instance is an horrible idea - as you'd have to implement all the protocol for that file 0 reading, writting, safeguards, corner cases, and such.

Ignoring objects for PyYAML dump

I use PyYAML dump to print complex data structures but there is one class of objects that cannot, and also I do not want to, be dumped.
Currently I get:
yaml.representer.RepresenterError: cannot represent an object
I would like yaml.dump to completely ignore this particular class or just render the classname and continue as usual.
If this is possible, how can I do that?
You'll have to provide a representer for the object. There are multiple ways to do that, some involving changing the object.
When you explicitly register a representer, the object doesn't have to be changed:
import sys
from ruamel import yaml
class Secret():
def __init__(self, user, password):
self.user = user
self.password = password
def secret_representer(dumper, data):
return dumper.represent_scalar(u'!secret', u'unknown')
yaml.add_representer(Secret, secret_representer)
data = dict(a=1, b=2, c=[42, Secret(user='cary', password='knoop')])
yaml.dump(data, sys.stdout)
In secret_representer, the data is the instantiated Secret(), since the function doesn't use that, no "secrets" are leaked. You could also e.g. return the user name, but not the password. The represent_scalar function expects a tag (here I used !secret) and a scalar (here the string unknown).
The output of the above:
a: 1
b: 2
c: [42, !secret '[unknown]']
I am using ruamel.yaml in the above which is an upgraded version of PyYAML (disclaimer: I am the author of that package). The above should work with PyYAML as well.

How to dynamically create new python class files?

I currently have a CSV file with over 200 entries, where each line needs to be made into its own class file. These classes will be inheriting from a base class with some field variables that it will inherit and set values to based on the CSV file. Additionally, the name of the python module will need to be based off an entry of the CSV file.
I really don't want to manually make over 200 individual python class files, and was wondering if there was a way to do this easily. Thanks!
edit* I'm definitely more of a java/C# coder so I'm not too familiar with python.
Some more details: I'm trying to create an AI for an already existing web game, which I can extract live data from via a live stream text box.
There are over 200 moves that a player can use each turn, and each move is vastly different. I could possibly create new instances of a move class as it's being used, but then I would have to loop through a database of all the moves and its effects each time the move is used, which seems very inefficient. Hence, I was thinking of creating classes of every move with the same name as it would appear in the text box so that I could create new instances of that specific move more quickly.
As others have stated, you usually want to be doing runtime class generation for this kind of thing, rather than creating individual files.
But I thought: what if you had some good reason to do this, like just making class templates for a bunch of files, so that you could go in and expand them later? Say I plan on writing a lot of code, so I'd like to automate the boilerplate code parts, that way I'm not stuck doing tedious work.
Turns out writing a simple templating engine for Python classes isn't that hard. Here's my go at it, which is able to do templating from a csv file.
from os import path
from sys import argv
import csv
INIT = 'def __init__'
def csvformat(csvpath):
""" Read a csv file containing a class name and attrs.
Returns a list of the form [['ClassName', {'attr':'val'}]].
"""
csv_lines = []
with open(csvpath) as f:
reader = csv.reader(f)
_ = [csv_lines.append(line)
for line in reader]
result = []
for line in csv_lines:
attr_dict = {}
attrs = line[1:]
last_attr = attrs[0]
for attr in attrs[1:]:
if last_attr:
attr_dict[last_attr] = attr
last_attr = ''
else:
last_attr = attr
result.append([line[0], attr_dict])
return result
def attr_template(attrs):
""" Format a list of default attribute setting code. """
attr_list = []
for attr, val in attrs.items():
attr_list.append(str.format(' if {} is None:\n', attr, val))
attr_list.append(str.format(' self.{} = {}\n', attr, val))
attr_list.append(' else:\n')
attr_list.append(str.format(' self.{} = {}\n', attr, attr))
return attr_list
def import_template(imports):
""" Import superclasses.
Assumes the .py files are named based on the lowercased class name.
"""
imp_lines = []
for imp in imports:
imp_lines.append(str.format('from {} import {}\n',
imp.lower(), imp))
return imp_lines
def init_template(attrs):
""" Template a series of optional arguments based on a dict of attrs.
"""
init_string = 'self'
for key in attrs:
init_string += str.format(', {}=None', key)
return init_string
def gen_code(foldername, superclass, name, attrs):
""" Generate python code in foldername.
Uses superclass for the superclass, name for the class name,
and attrs as a dict of {attr:val} for the generated class.
Writes to a file with lowercased name as the name of the class.
"""
imports = [superclass]
pathname = path.join(foldername, name.lower() + '.py')
with open(pathname, 'w') as pyfile:
_ = [pyfile.write(imp)
for imp
in import_template(imports)]
pyfile.write('\n')
pyfile.write((str.format('class {}({}):\n', name, superclass)))
pyfile.write((str.format(' {}({}):\n',
INIT, init_template(attrs))))
_ = [pyfile.write(attribute)
for attribute
in attr_template(attrs)]
pyfile.write(' super().__init__()')
def read_and_generate(csvpath, foldername, superclass):
class_info = csvformat(csvpath)
for line in class_info:
gen_code(foldername, superclass, *line)
def main():
read_and_generate(argv[1], argv[2], argv[3])
if __name__ == "__main__":
main()
The above takes a csvfile formatted like this as its first argument (here, saved as a.csv):
Magistrate,foo,42,fizz,'baz'
King,fizz,'baz'
Where the first field is the class name, followed by the attribute name and its default value. The second argument is the path to the output folder.
If I make a folder called classes and create a classes/mysuper.py in it with a basic class structure:
class MySuper():
def __init__(*args, **kwargs):
pass
And then run the code like this:
$ python3 codegen.py a.csv classes MySuper
I get the files classes/magistrate.py with the following contents:
from mysuper import MySuper
class Magistrate(MySuper):
def __init__(self, fizz=None, foo=None):
if fizz is None:
self.fizz = 'baz'
else:
self.fizz = fizz
if foo is None:
self.foo = 42
else:
self.foo = foo
super().__init__()
And classes/king.py:
from mysuper import MySuper
class King(MySuper):
def __init__(self, fizz=None):
if fizz is None:
self.fizz = 'baz'
else:
self.fizz = fizz
super().__init__()
You can actually load them and use them, too!
$ cd classes
classes$ python3 -i magistrate.py
>>> m = Magistrate()
>>> m.foo
42
>>> m.fizz
'baz'
>>>
The above generates Python 3 code, which is what I'm used to, so you will need to make some small changes for it to work in Python 2.
First of all, you don't have to seperate python classes by files - it is more common to group them by functionality to modules and packages (ref to What's the difference between a Python module and a Python package?). Furthermore, 200 similar classes sound like a quite unusual design - are they really needed or could you e.g. use a dict to store some properties?
And of course you can just write a small python script, read in the csv, and generate one ore more .py files containing the classes (lines of text written to the file).
Should be just a few lines of code depending on the level of customization.
If this list changes, you even don't have to write the classes to a file: You can just generate them on the fly.
If you tell us how far you got or more details about the problem, we could help in completing the code...
Instead of generating .py files, read in the csv and do dynamic type creation. This way, if the csv changes, you can be sure that your types are dynamically regenerated.

How do I use TDD to create a database representation of existing objects?

I have used TDD to develop a set of classes in Python. These objects contain data fields, functions and links to each other. Everything functionally works like I want.
Eventually all of this should be stored in a database, to be used in a Django web application.
I have sketched some possible database schema's to hold the same information, but I feel this is a "sudden big leap", compared to the traditional TDD way of developing the rest of the application.
So, now I wonder, which tests should I write to force me to store these objects in a database in a step-by-step TDD way?
Making this question a bit more concrete, the classes are currently like this:
class Connector(object):
def __init__(self, title = None):
self.value = None
self.valid = False
self.title = title
...
class Element(object):
def __init__(self, title = None):
self.title = title
self.input_connectors = []
self.output_connectors = []
self.number_of_runs = 0
def run(self):
...
self.number_of_runs += 1
class Average(Element):
def __init__(self, title = None):
super(OpenCVMean, self).__init__(title = title)
self.src = Connector("source")
self.avg = Connector("average")
self.input_connectors.append(self.src)
self.output_connectors.append(self.avg)
def run(self):
super(Average, self).run()
self.avg.set_value(numpy.average(self.src.value))
I realize some of the data should be in the database, while processing functions should not. I think there should be a table which represents the details of the different "types / subclasses " of Element, while also one which stores actual instances. But, as I said, I don't see how to get there using TDD.
First, ask yourself if you will be testing your code or the Django ORM. Storing and reading will be fine most of the time.
The things you will need to test are validating the data and any properties that are not model fields. I think you should end up with a good schema through writing tests on the next layer up from the database.
Also, use South or some other means of migration to reduce the cost of schema change. That will give you some peace of mind.
If you have several levels of tests (eg. integration tests), it makes sense to check that the database configuration is intact. Most of the tests do not need to hit the database. You can do this by mocking the model or some database operations (at least save()). Having said that, you can check database writes and reads with this simple test:
def test_db_access(self):
input = Element(title = 'foo')
input.save()
output = Element.objects.get(title='foo')
self.assertEquals(input, output)
Mocking save:
def save(obj):
obj.id = 1

Python Pickle Yielding EOF Error on Read and Not Reading Correctly

I am trying to pickle a patient object by using:
theFile = open(str(location)+str(filename)+'.pkl','wb')
pickle.dump(self,theFile)
theFile.close()
This works well and successfully writes to the file as desired. But! When I try to load the data from the thumb, I get an EOF error XOR it loads old data that is not present in the thumb. I don't know where this old data is coming from, considering the pickle contains all the correct saved data...
Loading operation:
theFile = open('/media/SUPER/hr4e/thumb/patient.pkl','r+')
self = pickle.load(theFile)
theFile.close()
An example would be: I change an attribute of the desired object and save it. The attribute is clearly saved in the pickle file, but when I reload the pickle file on another computer, it doesn't read the pickle and loads old data. I checked to see if it was reading the pickle and it is...
Are there any subtle nuances about pickles that I am missing? Or, am I just using the wrong read and write arguments for the pickle saving and loading?
Assigning to self inside a method only updates what the variable self points to in that method; it doesn't update the object itself. To load it, instead return a newly loaded object from a classmethod or function. Try code like this:
import pickle
class Patient(object):
def __init__(self, name):
self.name = name
def save(self, location, filename):
theFile = open(str(location)+str(filename)+'.pkl','wb')
pickle.dump(self,theFile)
theFile.close()
#classmethod
def load(cls, location, filename):
theFile = open(str(location)+str(filename)+'.pkl','rb')
m = pickle.load(theFile)
theFile.close()
return m
p = Patient("Bob")
print p.name
# save the patient
p.save("c:\\temp\\", "bob")
# load the patient - this could be in a new session
l = Patient.load("c:\\temp\\", "bob")
print l.name
Open the file in binary mode. e.g.
theFile = open('/media/SUPER/hr4e/thumb/patient.pkl','r+b')
I ended up pickling the object's attributes dictionary. That worked much better. Example:
self.__dict__ = pickle.load(file('data/pickles/clinic.pkl','r+b'))

Categories