Class for pickle- and copy-persistent object? - python

I'm trying to write a class for a read-only object which will not be really copied with the copy module, and when it will be pickled to be transferred between processes each process will maintain no more than one copy of it, no matter how many times it will be passed around as a "new" object. Is there already something like that?

I made an attempt to implement this. #Alex Martelli and anyone else, please give me comments/improvements. I think this will eventually end up on GitHub.
"""
todo: need to lock library to avoid thread trouble?
todo: need to raise an exception if we're getting pickled with
an old protocol?
todo: make it polite to other classes that use __new__. Therefore, should
probably work not only when there is only one item in the *args passed to new.
"""
import uuid
import weakref
library = weakref.WeakValueDictionary()
class UuidToken(object):
def __init__(self, uuid):
self.uuid = uuid
class PersistentReadOnlyObject(object):
def __new__(cls, *args, **kwargs):
if len(args)==1 and len(kwargs)==0 and isinstance(args[0], UuidToken):
received_uuid = args[0].uuid
else:
received_uuid = None
if received_uuid:
# This section is for when we are called at unpickling time
thing = library.pop(received_uuid, None)
if thing:
thing._PersistentReadOnlyObject__skip_setstate = True
return thing
else: # This object does not exist in our library yet; Let's add it
new_args = args[1:]
thing = super(PersistentReadOnlyObject, cls).__new__(cls,
*new_args,
**kwargs)
thing._PersistentReadOnlyObject__uuid = received_uuid
library[received_uuid] = thing
return thing
else:
# This section is for when we are called at normal creation time
thing = super(PersistentReadOnlyObject, cls).__new__(cls, *args,
**kwargs)
new_uuid = uuid.uuid4()
thing._PersistentReadOnlyObject__uuid = new_uuid
library[new_uuid] = thing
return thing
def __getstate__(self):
my_dict = dict(self.__dict__)
del my_dict["_PersistentReadOnlyObject__uuid"]
return my_dict
def __getnewargs__(self):
return (UuidToken(self._PersistentReadOnlyObject__uuid),)
def __setstate__(self, state):
if self.__dict__.pop("_PersistentReadOnlyObject__skip_setstate", None):
return
else:
self.__dict__.update(state)
def __deepcopy__(self, memo):
return self
def __copy__(self):
return self
# --------------------------------------------------------------
"""
From here on it's just testing stuff; will be moved to another file.
"""
def play_around(queue, thing):
import copy
queue.put((thing, copy.deepcopy(thing),))
class Booboo(PersistentReadOnlyObject):
def __init__(self):
self.number = random.random()
if __name__ == "__main__":
import multiprocessing
import random
import copy
def same(a, b):
return (a is b) and (a == b) and (id(a) == id(b)) and \
(a.number == b.number)
a = Booboo()
b = copy.copy(a)
c = copy.deepcopy(a)
assert same(a, b) and same(b, c)
my_queue = multiprocessing.Queue()
process = multiprocessing.Process(target = play_around,
args=(my_queue, a,))
process.start()
process.join()
things = my_queue.get()
for thing in things:
assert same(thing, a) and same(thing, b) and same(thing, c)
print("all cool!")

I don't know of any such functionality already implemented. The interesting problem is as follows, and needs precise specs as to what's to happen in this case...:
process A makes the obj and sends it to B which unpickles it, so far so good
A makes change X to the obj, meanwhile B makes change Y to ITS copy of the obj
now either process sends its obj to the other, which unpickles it: what changes
to the object need to be visible at this time in each process? does it matter
whether A's sending to B or vice versa, i.e. does A "own" the object? or what?
If you don't care, say because only A OWNS the obj -- only A is ever allowed to make changes and send the obj to others, others can't and won't change -- then the problems boil down to identifying obj uniquely -- a GUID will do. The class can maintain a class attribute dict mapping GUIDs to existing instances (probably as a weak-value dict to avoid keeping instances needlessly alive, but that's a side issue) and ensure the existing instance is returned when appropriate.
But if changes need to be synchronized to any finer granularity, then suddenly it's a REALLY difficult problem of distributed computing and the specs of what happens in what cases really need to be nailed down with the utmost care (and more paranoia than is present in most of us -- distributed programming is VERY tricky unless a few simple and provably correct patterns and idioms are followed fanatically!-).
If you can nail down the specs for us, I can offer a sketch of how I would go about trying to meet them. But I won't presume to guess the specs on your behalf;-).
Edit: the OP has clarified, and it seems all he needs is a better understanding of how to control __new__. That's easy: see __getnewargs__ -- you'll need a new-style class and pickling with protocol 2 or better (but those are advisable anyway for other reasons!-), then __getnewargs__ in an existing object can simply return the object's GUID (which __new__ must receive as an optional parameter). So __new__ can check if the GUID is present in the class's memo [[weakvalue;-)]]dict (and if so return the corresponding object value) -- if not (or if the GUID is not passed, implying it's not an unpickling, so a fresh GUID must be generated), then make a truly-new object (setting its GUID;-) and also record it in the class-level memo.
BTW, to make GUIDs, consider using the uuid module in the standard library.

you could use simply a dictionnary with the key and the values the same in the receiver. And to avoid a memory leak use a WeakKeyDictionary.

Related

Is this sound software engineering practice for class construction?

Is this a plausible and sound way to write a class where there is a syntactic sugar #staticmethod that is used for the outside to interact with? Thanks.
###scrip1.py###
import SampleClass.method1 as method1
output = method1(input_var)
###script2.py###
class SampleClass(object):
def __init__(self):
self.var1 = 'var1'
self.var2 = 'var2'
#staticmethod
def method1(input_var):
# Syntactic Sugar method that outside uses
sample_class = SampleClass()
result = sample_class._method2(input_var)
return result
def _method2(self, input_var):
# Main method executes the various steps.
self.var4 = self._method3(input_var)
return self._method4(self.var4)
def _method3(self):
pass
def _method4(self):
pass
Answering to both your question and your comment, yes it is possible to write such a code but I see no point in doing it:
class A:
def __new__(cls, value):
return cls.meth1(value)
def meth1(value):
return value + 1
result = A(100)
print(result)
# output:
101
You can't store a reference to a class A instance because you get your method result instead of an A instance. And because of this, an existing __init__will not be called.
So if the instance just calculates something and gets discarded right away, what you want is to write a simple function, not a class. You are not storing state anywhere.
And if you look at it:
result = some_func(value)
looks exactly to what people expect when reading it, a function call.
So no, it is not a good practice unless you come up with a good use case for it (I can't remember one right now)
Also relevant for this question is the documentation here to understand __new__ and __init__ behaviour.
Regarding your other comment below my answer:
defining __init__ in a class to set the initial state (attribute values) of the (already) created instance happens all the time. But __new__ has the different goal of customizing the object creation. The instance object does not exist yet when __new__is run (it is a constructor function). __new__ is rarely needed in Python unless you need things like a singleton, say a class A that always returns the very same object instance (of A) when called with A(). Normal user-defined classes usually return a new object on instantiation. You can check this with the id() builtin function. Another use case is when you create your own version (by subclassing) of an immutable type. Because it's immutable the value was already set and there is no way of changing the value inside __init__ or later. Hence the need to act before that, adding code inside __new__. Using __new__ without returning an object of the same class type (this is the uncommon case) has the addtional problem of not running __init__.
If you are just grouping lots of methods inside a class but there is still no state to store/manage in each instance (you notice this also by the absence of self use in the methods body), consider not using a class at all and organize these methods now turned into selfless functions in a module or package for import. Because it looks you are grouping just to organize related code.
If you stick to classes because there is state involved, consider breaking the class into smaller classes with no more than five to 7 methods. Think also of giving them some more structure by grouping some of the small classes in various modules/submodules and using subclasses, because a long plain list of small classes (or functions anyway) can be mentally difficult to follow.
This has nothing to do with __new__ usage.
In summary, use the syntax of a call for a function call that returns a result (or None) or for an object instantiation by calling the class name. In this case the usual is to return an object of the intended type (the class called). Returning the result of a method usually involves returning a different type and that can look unexpected to the class user. There is a close use case to this where some coders return self from their methods to allow for train-like syntax:
my_font = SomeFont().italic().bold()
Finally if you don't like result = A().method(value), consider an alias:
func = A().method
...
result = func(value)
Note how you are left with no reference to the A() instance in your code.
If you need the reference split further the assignment:
a = A()
func = a.method
...
result = func(value)
If the reference to A() is not needed then you probably don't need the instance too, and the class is just grouping the methods. You can just write
func = A.method
result = func(value)
where selfless methods should be decorated with #staticmethod because there is no instance involved. Note also how static methods could be turned into simple functions outside classes.
Edit:
I have setup an example similar to what you are trying to acomplish. It is also difficult to judge if having methods injecting results into the next method is the best choice for a multistep procedure. Because they share some state, they are coupled to each other and so can also inject errors to each other more easily. I assume you want to share some data between them that way (and that's why you are setting them up in a class):
So this an example class where a public method builds the result by calling a chain of internal methods. All methods depend on object state, self.offset in this case, despite getting an input value for calculations.
Because of this it makes sense that every method uses self to access the state. It also makes sense that you are able to instantiate different objects holding different configurations, so I see no use here for #staticmethod or #classmethod.
Initial instance configuration is done in __init__ as usual.
# file: multistepinc.py
def __init__(self, offset):
self.offset = offset
def result(self, value):
return self._step1(value)
def _step1(self, x):
x = self._step2(x)
return self.offset + 1 + x
def _step2(self, x):
x = self._step3(x)
return self.offset + 2 + x
def _step3(self, x):
return self.offset + 3 + x
def get_multi_step_inc(offset):
return MultiStepInc(offset).result
--------
# file: multistepinc_example.py
from multistepinc import get_multi_step_inc
# get the result method of a configured
# MultiStepInc instance
# with offset = 10.
# Much like an object factory, but you
# mentioned to prefer to have the result
# method of the instance
# instead of the instance itself.
inc10 = get_multi_step_inc(10)
# invoke the inc10 method
result = inc10(1)
print(result)
# creating another instance with offset=2
inc2 = get_multi_step_inc(2)
result = inc2(1)
print(result)
# if you need to manipulate the object
# instance
# you have to (on file top)
from multistepinc import MultiStepInc
# and then
inc_obj = MultiStepInc(5)
# ...
# ... do something with your obj, then
result = inc_obj.result(1)
print(result)
Outputs:
37
13
22

Changing global variable temporarily when calling object

I'm trying to understand how to change an object's attribute temporarily when it is called and have the original value persist when the object is not called.
Let me describe the problem with some code:
class DateCalc:
DEFAULT= "1/1/2001"
def __init__(self, day=DEFAULT):
self.day= day
def __call__(self, day=DEFAULT):
self.day= day
return self
def getday(self):
return self.day
In the event where a user calls getday method while passing another value
i.e 2/2/2002, self.day is set to 2/2/2002. However I want to be able to revert self.day to the original value of 1/1/2001 after the method call:
d_obj = DateCalc()
d_obj.getday() == "1/1/2001"
True
d_obj().getday() == "1/1/2001"
True
another_day_str = "2/2/2002"
d_obj(another_day_str).getday()
returns
"2/2/2002"
But when I run the following
d_obj.getday()
returns
"2/2/2002"
I was wondering what's the right way to revert the value, without needing to include code at every method call. Secondly, this should also be true when the object is called. For example:
d_obj().getday()
should return
"1/1/2001"
I thought a decorator on the call magic method would work here, but I'm not really sure where to start.
Any help would be much appreciated
Since you probably don't really want to modify the attributes of your object for a poorly defined interval, you need to return or otherwise create a different object.
The simplest case would be one in which you had two separate objects, and no __call__ method at all:
d1_obj = DateCalc()
d2_obj = DateCalc('2/2/2002')
print(d1_obj.getday()) # 1/1/2001
print(d2_obj.getday()) # 2/2/2002
If you know where you want to use d_obj vs d_obj() in the original case, you clearly know where to use d1_obj vs d2_obj in this version as well.
This may not be adequate for cases where DateCalc actually represents a very complex object that has many attributes that you do not want to change. In that case, you can have the __call__ method return a separate object that intelligently copies the portions of the original that you want.
For a simple case, this could be just
def __call__(self, day=DEFAULT):
return type(self)(day)
If the object becomes complex enough, you will want to create a proxy. A proxy is an object that forwards most of the implementation details to another object. super() is an example of a proxy that has a very highly customized __getattribute__ implementation, among other things.
In your particular case, you have a couple of requirements:
The proxy must store all overriden attributes.
The proxy must get all non-overriden attributes from the original objects.
The proxy must pass itself as the self parameter to any (at least non-special) methods that are invoked.
You can get as complicated with this as you want (in which case look up how to properly implement proxy objects like here). Here is a fairly simple example:
# Assume that there are many fields like `day` that you want to modify
class DateCalc:
DEFAULT= "1/1/2001"
def __init__(self, day=DEFAULT):
self.day= day
def getday(self):
return self.day
def __call__(self, **kwargs):
class Proxy:
def __init__(self, original, **kwargs):
self._self_ = original
self.__dict__.update(kwargs)
def __getattribute__(self, name):
# Don't forward any overriden, dunder or quasi-private attributes
if name.startswith('_') or name in self.__dict__:
return object.__getattribute__(self, name)
# This part is simplified:
# it does not take into account __slots__
# or attributes shadowing methods
t = type(self._self_)
if name in t.__dict__:
try:
return t.__dict__[name].__get__(self, t)
except AttributeError:
pass
return getattr(self._self_, name)
return Proxy(self, **kwargs)
The proxy would work exactly as you would want: it forwards any values that you did not override in __call__ from the original object. The interesting thing is that it binds instance methods to the proxy object instead of the original, so that getday gets called with a self that has the overridden value in it:
d_obj = DateCalc()
print(type(d_obj)) # __main__.DateCalc
print(d_obj.getday()) # 1/1/2001
d2_obj = d_obj(day='2/2/2002')
print(type(d2_obj)) # __main__.DateCalc.__call__.<locals>.Proxy
print(d2_obj.getday()) # 2/2/2002
Keep in mind that the proxy object shown here has very limited functionality implemented, and will not work properly in many situations. That being said, it likely covers many of the use cases that you will have out of the box. A good example is if you chose to make day a property instead of having a getter (it is the more Pythonic approach):
class DateCalc:
DEFAULT= "1/1/2001"
def __init__(self, day=DEFAULT):
self.__dict__['day'] = day
#property
def day(self):
return self.__dict__['day']
# __call__ same as above
...
d_obj = DateCalc()
print(d_obj(day='2/2/2002').day) # 2/2/2002
The catch here is that the proxy's version of day is just a regular writable attribute instead of a read-only property. If this is a problem for you, implementing __setattr__ appropriately on the proxy will be left as an exercise for the reader.
It seems that you want a behavior like a context manager: to modify an attribute for a limited time, use the updated attribute and then revert to the original. You can do this by having __call__ return a context manager, which you can then use in a with block like this:
d_obj = DateCalc()
print(d_obj.getday()) # 1/1/2001
with d_obj('2/2/2002'):
print(d_obj.getday()) # 2/2/2002
print(d_obj.getday()) # 1/1/2001
There are a couple of ways of creating such a context manager. The simplest would be to use a nested method in __call__ and decorate it with contextlib.contextmanager:
from contextlib import contextmanager
...
def __call__(self, day=DEFAULT):
#contextmanager
def context()
orig = self.day
self.day = day
yield
self.day = orig
return context
You could also use a fully-fledged nested class for this, but I would not recommend it unless you have some really complex requirements. I am just providing it for completeness:
def __call__(self, day=DEFAULT):
class Context:
def __init__(self, inst, new):
self.inst = inst
self.old = inst.day
self.new = new
def __enter__(self):
self.inst.day = self.new
def __exit__(self, *args):
self.inst.day = self.old
return Context(self, day)
Also, you should consider making getday a property, especially if it is really read-only.
Another alternative would be to have your methods accept different values:
def getday(self, day=None):
if day is None:
day = self.day
return day
This is actually a fairly common idiom.

Python heapify by some attribute, reheapify after attribute changes

I'm trying to use the heappq module in the python 3.5 standard library to make a priority queue of objects of the same type. I'd like to be able to heapify based on an attribute of the objects, then change the value of some of those attributes, then re-heapify based on the new values. I'm wondering how I go about doing this.
import heappq
class multiNode:
def __init__(self, keyValue):
self.__key = keyValue
def setKey(self, keyValue):
self.__key = keyValue
def getKey(self):
return self.__key
queue = [multiNode(1), multiNode(2), multiNode(3)]
heapq.heapify(queue) #want to heapify by whatever getKey returns for each node
queue[0].setKey(1000)
heapq.heapify(queue) #re heapify with those new values
There are a variety of ways of making your code work. For instance, you could make your items orderable by implementing some of the rich comparison operator methods (and perhaps use functools.total_ordering to implement the rest):
#functools.total_ordering
class multiNode:
def __init__(self, keyValue):
self.__key = keyValue
def setKey(self, keyValue):
self.__key = keyValue
def getKey(self):
return self.__key
def __eq__(self, other):
if not isinstance(other, multiNode):
return NotImplemented
return self.__key == other.__key
def __lt__(self, other):
if not isinstance(other, multiNode):
return NotImplemented
return self.__key < other.__key
This will make your code work, but it may not be very efficient to reheapify your queue every time you make a change to a node within it, especially if there are a lot of nodes in the queue. A better approach might be to write some extra logic around the queue so that you can invalidate a queue entry without removing it or violating the heap property. Then when you have an item you need to update, you just invalidate it's old entry and add in a new one with the new priority.
Here's a quick and dirty implementation that uses a dictionary to map from a node instance to a [pritority, node] list. If the node is getting its priority updated, the dictionary is checked and the node part of the list gets set to None. Invalidated entries are ignored when popping nodes off the front of the queue.
queue = []
queue_register = {}
def add_to_queue(node)
item = [node.getKey(), node]
queue.heappush(queue, item)
queue_register[node] = item
def update_key_in_queue(node, new_key):
queue_register[node][1] = None # invalidate old item
node.setKey(new_key)
add_to_queue(node)
def pop_from_queue():
node = None
while node is None:
_, node = heapq.heappop(queue) # keep popping items until we find one that's valid
del queue_register[node] # clean up our bookkeeping record
return node
You may want to test this against reheapifying to see which is faster for your program's actual usage of the queue.
A few final notes about your multiNode class (unrelated to what you were asking about in your question):
There are a number of things you're doing in the class that are not very Pythonic. To start with, the most common naming convention for Python uses CapitalizedNames for classes, and lower_case_names_with_underscores for almost everything else (variables of all kinds, functions, modules).
Another issue using double leading underscores for __key. Double leading (and not trailing) undescrores invokes Python's name mangling system. This may seem like its intended as a way to make variables private, but it is not really. It's more intended to help prevent accidental name collisions, such as when you're setting an attribute in a proxy object (that otherwise mimics the attributes of some other object) or in a mixin class (which may be inherited by other types with unknown attributes). If code outside your class really wants to access the mangled attribute __key in your multiNode class, they can still do so by using _multiNode__key. To hint that something is intended to be a private attribute, you should just use a single underscore _key.
And that brings me right to my final issue, that key probably shouldn't be private at all. It is not very Pythonic to use getX and setX methods to modify a private instance variable. It's much more common to document that the attribute is part of the class's public API and let other code access it directly. If you later decide you need to do something fancy whenever the attribute is looked up or modified, you can use a property descriptor to automatically transform attribute access into calls to a getter and setter function. Other programming languages usually start with getters and setters rather than public attributes because there is no such way of changing implementation of an attribute API later on. So anyway, I'd make your class's __init__ just set self.key = keyValue and get rid of setKey and getKey completely!
A crude way of doing what you're looking for would be to use dicts and Python's built in id() method. This method would basically allow you keep your heap as a heap of the id's of the objects that you create and then update those objects by accessing them in the dict where their id's are the keys. I tried this on my local machine and it seems to do what you're looking for:
import heapq
class multiNode:
def __init__(self, keyValue):
self.__key = keyValue
def setKey(self, keyValue):
self.__key = keyValue
def getKey(self):
return self.__key
first_node = multiNode(1)
second_node = multiNode(2)
thrid_node = multiNode(3)
# add more nodes here
q = [id(first_node), id(second_node), id(third_node)]
mutilNode_dict = {
id(first_node): first_node,
id(second_node): second_node,
id(third_node): third_node
}
heapq.heapify(q)
multiNode_dict[q[0]].setKey(1000)
heapq.heapify(q)
heapify() won't really do too much here because the id of the object is going to be the same until it's deleted. It is more useful if you're adding new objects to the heap and taking objects out.

Attempting to replicate Python's string interning functionality for non-strings

For a self-project, I wanted to do something like:
class Species(object): # immutable.
def __init__(self, id):
# ... (using id to obtain height and other data from file)
def height(self):
# ...
class Animal(object): # mutable.
def __init__(self, nickname, species_id):
self.nickname = nickname
self.species = Species(id)
def height(self):
return self.species.height()
As you can see, I don't really need more than one instance of Species(id) per id, but I'd be creating one every time I'm creating an Animal object with that id, and I'd probably need multiple calls of, say, Animal(somename, 3).
To solve that, what I'm trying to do is to make a class so that for 2 instances of it, let's say a and b, the following is always true:
(a == b) == (a is b)
This is something that Python does with string literals and is called internship. Example:
a = "hello"
b = "hello"
print(a is b)
that print will yield true (as long as the string is short enough if we're using the python shell directly).
I can only guess how CPython does this (it probably involves some C magic) so I'm doing my own version of it. So far I've got:
class MyClass(object):
myHash = {} # This replicates the intern pool.
def __new__(cls, n): # The default new method returns a new instance
if n in MyClass.myHash:
return MyClass.myHash[n]
self = super(MyClass, cls).__new__(cls)
self.__init(n)
MyClass.myHash[n] = self
return self
# as pointed out on an answer, it's better to avoid initializating the instance
# with __init__, as that one's called even when returning an old instance.
def __init(self, n):
self.n = n
a = MyClass(2)
b = MyClass(2)
print a is b # <<< True
My questions are:
a) Is my problem even worth solving? Since my intended Species object should be quite light weight and the max amount of times Animal can be called, rather limited (imagine a Pokemon game: no more than 1000 instances, tops)
b) If it is, is this a valid approach to solve my problem?
c) If it's not valid, could you please elaborate on a simpler / cleaner / more Pythonic way to solve this?
To make this as general as possible, I'm going to recommend a couple things. One, inherit from a namedtuple if you want "true" immutability (normally people are rather hands off about this, but when you're doing interning, breaking the immutable invariant can cause much bigger problems). Second, use locks to allow thread safe behavior.
Because this is rather complex, I'm going to provide a modified copy of Species code with comments explaining it:
import collections
import operator
import threading
# Inheriting from a namedtuple is a convenient way to get immutability
class Species(collections.namedtuple('SpeciesBase', 'species_id height ...')):
__slots__ = () # Prevent creation of arbitrary values on instances; true immutability of declared values from namedtuple makes true immutable instances
# Lock and cache, with underscore prefixes to indicate they're internal details
_cache_lock = threading.Lock()
_cache = {}
def __new__(cls, species_id): # Switching to canonical name cls for class type
# Do quick fail fast check that ID is in fact an int/long
# If it's int-like, this will force conversion to true int/long
# and minimize risk of incompatible hash/equality checks in dict
# lookup
# I suspect that in CPython, this would actually remove the need
# for the _cache_lock due to the GIL protecting you at the
# critical stages (because no byte code is executing comparing
# or hashing built-in int/long types), but the lock is a good idea
# for correctness (avoiding reliance on implementation details)
# and should cost little
species_id = operator.index(species_id)
# Lock when checking/mutating cache to make it thread safe
try:
with cls._cache_lock:
return cls._cache[species_id]
except KeyError:
pass
# Read in data here; not done under lock on assumption this might
# be expensive and other Species (that already exist) might be
# created/retrieved from cache during this time
species_id = ...
height = ...
# Pass all the values read to the superclass (the namedtuple base)
# constructor (which will set them and leave them immutable thereafter)
self = super(Species, cls).__new__(cls, species_id, height, ...)
with cls._cache_lock:
# If someone tried to create the same species and raced
# ahead of us, use their version, not ours to ensure uniqueness
# If no one raced us, this will put our new object in the cache
self = cls._cache.setdefault(species_id, self)
return self
If you want to do interning for general libraries (where users might be threaded, and you can't trust them not to break the immutability invariant), something like the above is a basic structure to work with. It's fast, minimizes the opportunity for stalls even if construction is heavyweight (in exchange for possibly reconstructing an object more than once and throwing away all but one copy if many threads try to construct it for the first time at once), etc.
Of course, if construction is cheap and instances are small, then just write a __eq__ (and possibly __hash__ if it's logically immutable) and be done with it.
Yes, implementing a __new__ method that returns a cached object is the appropriate way of creating a limited number of instances. If you don't expect to be creating a lot of instances, you could just implement __eq__ and compare by value rather than identity, but it doesn't hurt to do it this way instead.
Note that an immutable object should generally do all its initialization in __new__, rather than __init__, since the latter is called after the object has been created. Further, __init__ will be called on any instance of the class that is returned from __new__, so with when you're caching, it will be called again each time a cached object is returned.
Also, the first argument to __new__ is the class object not an instance, so you probably should name it cls rather than self (you can use self instead of instance later in the method if you want though!).

Is it bad to store all instances of a class in a class field?

I was wondering if there is anything wrong (from a OOP point of view) in doing something like this:
class Foobar:
foobars = {}
def __init__(self, name, something):
self.name = name
self.something = something
Foobar.foobars[name] = self
Foobar('first', 42)
Foobar('second', 77)
for name in Foobar.foobars:
print name, Foobar.foobars[name]
EDIT: this is the actual piece of code I'm using right now
from threading import Event
class Task:
ADDED, WAITING_FOR_DEPS, READY, IN_EXECUTION, DONE = range(5)
tasks = {}
def __init__(self, name, dep_names, job, ins, outs, uptodate, where):
self.name = name
self.dep_names = [dep_names] if isinstance(dep_names, str) else dep_names
self.job = job
self.where = where
self.done = Event()
self.status = Task.ADDED
self.jobs = []
# other stuff...
Task.tasks[name] = self
def set_done(self):
self.done.set()
self.status = Task.DONE
def wait_for_deps(self):
self.status = Task.WAITING_FOR_DEPS
for dep_name in self.dep_names:
Task.tasks[dep_name].done.wait()
self.status = Task.READY
def add_jobs_to_queues(self):
jobs = self.jobs
# a lot of stuff I trimmed here
for w in self.where: Queue.queues[w].put(jobs)
self.status = Task.IN_EXECUTION
def wait_for_jobs(self):
for j in self.jobs: j.wait()
#[...]
As you can see I need to access the dictionary with all the instances in
the wait_for_deps method. Would it make more sense to have a global variable
instead of a class field? I could be using a wrong approach here, maybe that
stuff shouldn't even be in a method, but it made sense to me (I'm new to OOP)
Yes. It's bad. It conflates the instance with the collection of instances.
Collections are one thing.
The instances which are collected are unrelated.
Also, class-level variables which get updated confuse some of us. Yes, we can eventually reason on what's going on, but the Standard Expectation™ is that state change applies to objects, not classes.
class Foobar_Collection( dict ):
def __init__( self, *arg, **kw ):
super( Foobar_Collection, self ).__init__( *arg, **kw ):
def foobar( self, *arg, **kw ):
fb= Foobar( *arg, **kw )
self[fb.name]= fb
return fb
class Foobar( object ):
def __init__( self, name, something )
self.name= name
self.something= something
fc= Foobar_Collection()
fc.foobar( 'first', 42 )
fc.foobar( 'second', 77 )
for name in fc:
print name, fc[name]
That's more typical.
In your example, the wait_for_deps is simply a method of the task collection, not the individual task. You don't need globals.
You need to refactor.
I don't suppose that there's anything wrong with this, but I don't really see how this would be sensible. Why would you need to keep a global variable (in the class, of all places) that holds references to all the instances? The client could just as easily implement this himself if he just kept a list of his instances. All in all, it seems a little hackish and unnecessary, so I'd recommend that you don't do it.
If you're more specific about what you're trying to do, perhaps we can find a better solution.
This is NOT cohesive, as well as not very functional, you want to strive to get your objects as far from the 'data-bucket' mindset as possible. The static object collection is not going to really gain you anything, you need to think WHY do you need all the objects in the collection and think about creating a second class whose responsibility is to manage and be queried for all the Foobars in the system.
Why would you want to do this?
There are several problems with this code. The first is that you have to take care of deleting instances -- there will always be a reference to each Foobar instance left in Foobar.foobars, so the garbage collector will never garbage collect them. The second problem is that it won't work with copy and pickle.
But apart from the technical problems, it feels like a wrong design. The purpose of object instances is hiding state, and you make them see each other.
From a OOP point of view there's nothing wrong with it. A class is an instance of a metaclass, and any instance can hold any kind of data in it.
However, from an efficiency point of view, if you don't eventualy clean up the foobars dict on a long running Python program, you are having potential memory leak.
No one has mentioned the potential problem this might have if you later derive a subclass from Foobar which could happen if the base class __init__() function is called from the derived class's __init__(). Specifically whether you want all the subclass instances to be sored in the same place as those of the base class -- which of course depend on why you're doing this.
It's a solvable problem but something to consider, and perhaps to code for, up front in the base class.
I needed multiple Jinja environments in an app engine application:
class JinjaEnv(object):
""" Jinja environment / loader instance per env_name """
_env_lock = threading.Lock()
with _env_lock:
_jinja_envs = dict() # instances of this class
def __init__(self, env_name):
self.jinja_loader = ..... # init jinja loader
self.client_cache = memcache.Client()
self.jinja_bcc = MemcachedBytecodeCache(self.client_cache, prefix='jinja2/bcc_%s/' % env_name, timeout=3600)
self.jinja_env = self.jinja_loader(self.jinja_bcc, env_name)
#classmethod
def get_env(cls, env_name):
with cls._env_lock:
if env_name not in cls._jinja_envs:
cls._jinja_envs[env_name] = JinjaEnv(env_name) # new env
return cls._jinja_envs[env_name].jinja_env
#classmethod
def flush_env(cls, env_name):
with cls._env_lock:
if env_name not in cls._jinja_envs:
self = cls._jinja_envs[env_name] = JinjaEnv(env_name) # new env
else:
self = cls._jinja_envs[env_name]
self.client_cache.flush_all()
self.jinja_env = self.jinja_loader(self.jinja_bcc, env_name)
return self.jinja_env
Used like:
template = JinjaEnv.get_env('example_env').get_template('example_template')

Categories