Can I mark variables as transient so they won't be pickled? - python

Let's say I have a class:
class Thing(object):
cachedBar = None
def __init__(self, foo):
self.foo = foo
def bar(self):
if not self.cachedBar:
self.cachedBar = doSomeIntenseCalculation()
return self.cachedBar
To get bar some intense calculation, so I cache it in memory to speed things up.
However, when I pickle one of these classes I don't want cachedBar to be pickled.
Can I mark cachedBar as volatile / transient / not picklable?

According to the Pickle documentation, you can provide a method called __getstate__(), which returns something representing the state you want to have pickled (if it isn't provided, pickle uses thing.__dict__). So, you can do something like this:
class Thing:
def __getstate__(self):
state = dict(self.__dict__)
del state['cachedBar']
return state
This doesn't have to be a dict, but if it is something else, you need to also implement __setstate__(state).

Implement __getstate__ to return only what parts of an object to be pickled

Related

How can I specialise instances of objects when I don't have access to the instantiation code?

Let's assume I am using a library which gives me instances of classes defined in that library when calling its functions:
>>> from library import find_objects
>>> result = find_objects("name = any")
[SomeObject(name="foo"), SomeObject(name="bar")]
Let's further assume that I want to attach new attributes to these instances. For example a classifier to avoid running this code every time I want to classify the instance:
>>> from library import find_objects
>>> result = find_objects("name = any")
>>> for row in result:
... row.item_class= my_classifier(row)
Note that this is contrived but illustrates the problem: I now have instances of the class SomeObject but the attribute item_class is not defined in that class and trips up the type-checker.
So when I now write:
print(result[0].item_class)
I get a typing error. It also trips up auto-completion in editors as the editor does not know that this attribute exists.
And, not to mention that this way of implementing this is quite ugly and hacky.
One thing I could do is create a subclass of SomeObject:
class ExtendedObject(SomeObject):
item_class = None
def classify(self):
cls = do_something_with(self)
self.item_class = cls
This now makes everything explicit, I get a chance to properly document the new attributes and give it proper type-hints. Everything is clean. However, as mentioned before, the actual instances are created inside library and I don't have control over the instantiation.
Side note: I ran into this issue in flask for the Response class. I noticed that flask actually offers a way to customise the instantiation using Flask.response_class. But I am still interested how this could be achieved in libraries that don't offer this injection seam.
One thing I could do is write a wrapper that does something like this:
class WrappedObject(SomeObject):
item_class = None
wrapped = None
#staticmethod
def from_original(wrapped):
self.wrapped = wrapped
self.item_class = do_something_with(wrapped)
def __getattribute__(self, key):
return getattr(self.wrapped, key)
But this seems rather hacky and will not work in other programming languages.
Or try to copy the data:
from copy import deepcopy
class CopiedObject(SomeObject):
item_class = None
#staticmethod
def from_original(wrapped):
for key, value in vars(wrapped):
setattr(self, key, deepcopy(value))
self.item_class = do_something_with(wrapped)
but this feels equally hacky, and is risky when the objects sue properties and/or descriptors.
Are there any known "clean" patterns for something like this?
I would go with a variant of your WrappedObject approach, with the following adjustments:
I would not extend SomeObject: this is a case where composition feels more appropriate than inheritance
With that in mind, from_original is unnecessary: you can have a proper __init__ method
item_class should be an instance variable and not a class variable. It should be initialized in your WrappedObject class constructor
Think twice before implementing __getattribute__ and forwarding everything to the wrapped object. If you need only a few method and attributes of the original SomeObject class, it might be better to implement them explicitly as methods and properties
class WrappedObject:
def __init__(self, wrapped):
self.wrapped = wrapped
self.item_class = do_something_with(wrapped)
def a_method(self):
return self.wrapped.a_method()
#property
def a_property(self):
return self.wrapped.a_property

Function to behave differently on class vs on instance

I'd like a particular function to be callable as a classmethod, and to behave differently when it's called on an instance.
For example, if I have a class Thing, I want Thing.get_other_thing() to work, but also thing = Thing(); thing.get_other_thing() to behave differently.
I think overwriting the get_other_thing method on initialization should work (see below), but that seems a bit hacky. Is there a better way?
class Thing:
def __init__(self):
self.get_other_thing = self._get_other_thing_inst()
#classmethod
def get_other_thing(cls):
# do something...
def _get_other_thing_inst(self):
# do something else
Great question! What you seek can be easily done using descriptors.
Descriptors are Python objects which implement the descriptor protocol, usually starting with __get__().
They exist, mostly, to be set as a class attribute on different classes. Upon accessing them, their __get__() method is called, with the instance and owner class passed in.
class DifferentFunc:
"""Deploys a different function accroding to attribute access
I am a descriptor.
"""
def __init__(self, clsfunc, instfunc):
# Set our functions
self.clsfunc = clsfunc
self.instfunc = instfunc
def __get__(self, inst, owner):
# Accessed from class
if inst is None:
return self.clsfunc.__get__(None, owner)
# Accessed from instance
return self.instfunc.__get__(inst, owner)
class Test:
#classmethod
def _get_other_thing(cls):
print("Accessed through class")
def _get_other_thing_inst(inst):
print("Accessed through instance")
get_other_thing = DifferentFunc(_get_other_thing,
_get_other_thing_inst)
And now for the result:
>>> Test.get_other_thing()
Accessed through class
>>> Test().get_other_thing()
Accessed through instance
That was easy!
By the way, did you notice me using __get__ on the class and instance function? Guess what? Functions are also descriptors, and that's the way they work!
>>> def func(self):
... pass
...
>>> func.__get__(object(), object)
<bound method func of <object object at 0x000000000046E100>>
Upon accessing a function attribute, it's __get__ is called, and that's how you get function binding.
For more information, I highly suggest reading the Python manual and the "How-To" linked above. Descriptors are one of Python's most powerful features and are barely even known.
Why not set the function on instantiation?
Or Why not set self.func = self._func inside __init__?
Setting the function on instantiation comes with quite a few problems:
self.func = self._funccauses a circular reference. The instance is stored inside the function object returned by self._func. This on the other hand is stored upon the instance during the assignment. The end result is that the instance references itself and will clean up in a much slower and heavier manner.
Other code interacting with your class might attempt to take the function straight out of the class, and use __get__(), which is the usual expected method, to bind it. They will receive the wrong function.
Will not work with __slots__.
Although with descriptors you need to understand the mechanism, setting it on __init__ isn't as clean and requires setting multiple functions on __init__.
Takes more memory. Instead of storing one single function, you store a bound function for each and every instance.
Will not work with properties.
There are many more that I didn't add as the list goes on and on.
Here is a bit hacky solution:
class Thing(object):
#staticmethod
def get_other_thing():
return 1
def __getattribute__(self, name):
if name == 'get_other_thing':
return lambda: 2
return super(Thing, self).__getattribute__(name)
print Thing.get_other_thing() # 1
print Thing().get_other_thing() # 2
If we are on class, staticmethod is executed. If we are on instance, __getattribute__ is first to be executed, so we can return not Thing.get_other_thing but some other function (lambda in my case)

Preventing fields from being pickled

I have a class like this:
class Something(object):
def __init__(self):
self._thing_id
self._cached_thing
#property
def thing(self):
if self._cached_thing:
return self._cached_thing
return Thing.objects.get(id=self._thing_id)
When pickling objects like this, I'd like to prevent pickling of the _cached_thing field, as it's volatile and a specifically in-memory-only implementation.
Is there a way to suggest to Pickle that I only want a subset of my fields to be pickled?
Pickle can be customized in three ways, as described in the docs.
Provide __getstate__ and __setstate__ methods.
Provide __getnewargs__/__getnewargs_ex__ (and a constructor that takes those args).
Provide __reduce__ (and a function to give to __reduce__ to reverse it).
The first is usually the simplest:
class Something(object):
def __init__(self):
self._thing_id
self._cached_thing
def __getstate__(self):
return self._thing_id
def __setstate__(self, thing_id):
self._thing_id = thing_id
# etc.
If you want something more generic, that will pickle all values (including those set by a subclass, or dynamically after creation, etc.) except your blacklist, note that the default is "the instance's __dict__ is pickled", so just filter that:
_blacklist = ['_cached_thing']
def __getstate__(self):
return {k: v for k, v in self.__dict__.items() if k not in self._blacklist}
def __setstate__(self, state):
self.__dict__.update(state)
And please see gnibbler's comment on the question: if you're doing something generic, you should seriously consider coming up with some kind of naming convention instead of putting a blacklist in each class. Any reader who knows or learns the convention will immediately know which properties are "cache" values rather than part of the "real" value, it'll be more obvious how things work, there's less work for you to do in each class, and fewer places to screw things up with a typo…
Yes, you can use the special methods __getstate__ and __setstate__ to have pickle save customized data for your objects.
http://docs.python.org/2/library/pickle.html#object.getstate
This should get you started:
class Something(object):
def __init__(self):
self._thing_id = 0
self._cached_thing = None
def __getstate__(self):
return {
'_thing_id': self._thing_id,
}
def __setstate(self, state):
self._thing_id = state['_thing_id']

(Un)Pickle Class having Instancemethod Objects

I have a class (Bar) which effectively has its own state and callback(s) and is used by another class (Foo):
class Foo(object):
def __init__(self):
self._bar = Bar(self.say, 10)
self._bar.work()
def say(self, msg):
print msg
class Bar(object):
def __init__(self, callback, value):
self._callback = callback
self._value = value
self._more = { 'foo' : 1, 'bar': 3, 'baz': 'fubar'}
def work(self):
# Do some work
self._more['foo'] = 5
self._value = 10
self._callback('FooBarBaz')
Foo()
Obviously I can't pickle the class Foo since Bar has an instancemethod, so I'm left with the following solution of implementing __getstate__ & __setstate__ in Bar to save self._value & self._more, but I have to instantiate the self._callback method as well (i.e. call __init__() from the outer class Foo passing the callback function.
But I cannot figure out how to achieve this.
Any help is much appreciated.
Thanks.
I think if you need to serialize something like this you need to be able to define your callback as a string. For example, you might say that callback = 'myproject.callbacks.foo_callback'.
Basically in __getstate__ you'd replace the _callback function with something you could use to look up the function later like self._callback.__name__.
In __setstate__ you'd replace _callback with a function.
This depends on your functions all having real names so you couldn't use a lambda as a callback and expect it to be serialized. You'd also need a reasonable mechanism for looking up your functions by name.
You could potentially use __import__ (something like: 'myproject.somemodule.somefunc' dotted name syntax could be supported that way, see http://code.google.com/p/mock/source/browse/mock.py#1076) or just define a lookup table in your code.
Just a quick (untested, sorry!) example assuming you have a small set of possible callbacks defined in a lookup table:
def a():
pass
callbacks_to_name = {a: 'a'
# ...
}
callbacks_by_name = {'a': a,
# ...
}
class C:
def __init__(self, cb):
self._callback = cb
def __getstate__(self):
self._callback = callbacks_to_name[self._callback]
return self.__dict__
def __setstate__(self, state):
state[_callback] = callbacks_by_name[self._callback]
I'm not sure what your use case is but I'd recommend doing this by serializing your work items to JSON or XML and writing a simple set of functions to serialize and deserialize them yourself.
The benefit is that the serialized format can be read and understood by humans and modified when you upgrade your software. Pickle is tempting because it seems close enough, but by the time you have a serious pile of __getstate__ and __setstate__ you haven't really saved yourself much effort or headache over building your own scheme specifically for your application.

Hash a python new-style class instance?

Given a custom, new-style python class instance, what is a good way to hash it and get a unique ID-like value from it to use for various purposes? Think md5sum or sha1sum of a given class instance.
The approach I am currently using pickles the class and runs that through hexdigest, storing the resultant hash string into a class property (this property is never part of the pickle/unpickle procedures, fyi). Except now I've run into a case where a third-party module uses nested classes, and there is no really good way to pickle those without some hacks. I figure that I am missing out on some clever little Python trick somewhere to accomplish this.
Edit:
Example code because it seems to be a requirement around here to get any traction on a question. The below class can be initialized and the self._uniq_id property can be properly setup.
#!/usr/bin/env python
import hashlib
# cPickle or pickle.
try:
import cPickle as pickle
except:
import pickle
# END try
# Single class, pickles fine.
class FooBar(object):
__slots__ = ("_foo", "_bar", "_uniq_id")
def __init__(self, eth=None, ts=None, pkt=None):
self._foo = "bar"
self._bar = "bar"
self._uniq_id = hashlib.sha1(pickle.dumps(self, -1)).hexdigest()[0:16]
def __getstate__(self):
return {'foo':self._foo, 'bar':self._bar}
def __setstate__(self, state):
self._foo = state['foo']
self._bar = state['bar']
self._uniq_id = hashlib.sha1(pickle.dumps(self, -1)).hexdigest()[0:16]
def _get_foo(self): return self._foo
def _get_bar(self): return self._bar
def _get_uniq_id(self): return self._uniq_id
foo = property(_get_foo)
bar = property(_get_bar)
uniq_id = property(_get_uniq_id)
# End
This next class, however, cannot be initialized because of Bar being nested in Foo:
#!/usr/bin/env python
import hashlib
# cPickle or pickle.
try:
import cPickle as pickle
except:
import pickle
# END try
# Nested class, can't pickle for hexdigest.
class Foo(object):
__slots__ = ("_foo", "_bar", "_uniq_id")
class Bar(object):
pass
def __init__(self, eth=None, ts=None, pkt=None):
self._foo = "bar"
self._bar = self.Bar()
self._uniq_id = hashlib.sha1(pickle.dumps(self, -1)).hexdigest()[0:16]
def __getstate__(self):
return {'foo':self._foo, 'bar':self._bar}
def __setstate__(self, state):
self._foo = state['foo']
self._bar = state['bar']
self._uniq_id = hashlib.sha1(pickle.dumps(self, -1)).hexdigest()[0:16]
def _get_foo(self): return self._foo
def _get_bar(self): return self._bar
def _get_uniq_id(self): return self._uniq_id
foo = property(_get_foo)
bar = property(_get_bar)
uniq_id = property(_get_uniq_id)
# End
The error I receive is:
Traceback (most recent call last):
File "./nest_test.py", line 70, in <module>
foobar2 = Foo()
File "./nest_test.py", line 49, in __init__
self._uniq_id = hashlib.sha1(pickle.dumps(self, -1)).hexdigest()[0:16]
cPickle.PicklingError: Can't pickle <class '__main__.Bar'>: attribute lookup __main__.Bar failed
(nest_test.py) has both classes in it, hence the line number offset).
Pickling requires the __getstate__() method I found out, so I also implemented __setstate__() for completeness as well. But given the already existing warnings about security and pickle, there's got to be a better way to do this.
Based on what I have read so far, the error stems from Python not being able to resolve the nested classes. It tries to look up the attribute __main__.Bar, which doesn't exist. It really needs to be able to find __main__.Foo.Bar instead, but there is no really good way to do this. I bumped into another SO answer here that provides a "hack" to trick Python, but it came with a stern warning that such an approach is not advisable, and to either use something other than pickling or to move the nested class definition to the outside versus the inside.
However, the original question of that SO answer, I believe, was for pickling and unpickling to a file. I only need to pickle in order to use the requisite hashlib functions, which seem to operate on a bytearray (much like I am used to in .NET), and pickling (Especially cPickle) is fast and optimized versus writing my own bytearray routine.
That depends entirely on what properties the ID should have.
For instance, you can use id(foo) to get an ID which is guaranteed to be unique as long as foo is active in memory, or you could use repr(instance.__dict__) if all of the fields have sensible repr values.
What specifically do you need it for?
While you're using hexdigests of pickles at the moment, you make it sound like the id doesn't actually need to be related to the object, it just needs to be unique. Why not simply use the uuid module, specifically uuid.uuid4 to generate unique IDs and assign them to a uuid field in the object...

Categories