Use python dict to lookup mutable objects - python

I have a bunch of File objects, and a bunch of Folder objects. Each folder has a list of files. Now, sometimes I'd like to lookup which folder a certain file is in. I don't want to traverse over all folders and files, so I create a lookup dict file -> folder.
folder = Folder()
myfile = File()
folder_lookup = {}
# This is pseudocode, I don't actually reach into the Folder
# object, but have an appropriate method
folder.files.append(myfile)
folder_lookup[myfile] = folder
Now, the problem is, the files are mutable objects. My application is built around the fact. I change properites on them, and the GUI is notified and updated accordingly. Of course you can't put mutable objects in dicts. So what I tried first is to generate a hash based on the current content, basically:
def __hash__(self):
return hash((self.title, ...))
This didn't work of course, because when the object's contents changed its hash (and thus its identity) changed, and everything got messed up. What I need is an object that keeps its identity, although its contents change. I tried various things, like making __hash__ return id(self), overriding __eq__, and so on, but never found a satisfying solution. One complication is that the whole construction should be pickelable, so that means I'd have to store id on creation, since it could change when pickling, I guess.
So I basically want to use the identity of an object (not its state) to quickly look up data related to the object. I've actually found a really nice pythonic workaround for my problem, which I might post shortly, but I'd like to see if someone else comes up with a solution.

I felt dirty writing this. Just put folder as an attribute on the file.
class dodgy(list):
def __init__(self, title):
self.title = title
super(list, self).__init__()
self.store = type("store", (object,), {"blanket" : self})
def __hash__(self):
return hash(self.store)
innocent_d = {}
dodge_1 = dodgy("dodge_1")
dodge_2 = dodgy("dodge_2")
innocent_d[dodge_1] = dodge_1.title
innocent_d[dodge_2] = dodge_2.title
print innocent_d[dodge_1]
dodge_1.extend(range(5))
dodge_1.title = "oh no"
print innocent_d[dodge_1]

OK, everybody noticed the extremely obvious workaround (that took my some days to come up with), just put an attribute on File that tells you which folder it is in. (Don't worry, that is also what I did.)
But, it turns out that I was working under wrong assumptions. You are not supposed to use mutable objects as keys, but that doesn't mean you can't (diabolic laughter)! The default implementation of __hash__ returns a unique value, probably derived from the object's address, that remains constant in time. And the default __eq__ follows the same notion of object identity.
So you can put mutable objects in a dict, and they work as expected (if you expect equality based on instance, not on value).
See also: I'm able to use a mutable object as a dictionary key in python. Is this not disallowed?
I was having problems because I was pickling/unpickling the objects, which of course changed the hashes. One could generate a unique ID in the constructor, and use that for equality and deriving a hash to overcome this.
(For the curious, as to why such a "lookup based on instance identity" dict might be neccessary: I've been experimenting with a kind of "object database". You have pure python objects, put them in lists/containers, and can define indexes on attributes for faster lookup, complex queries and so on. For foreign keys (1:n relationships) I can just use containers, but for the backlink I have to come up with something clever if I don't want to modify the objects on the n side.)

Related

Python pseudo-immutable object field

I currently need to partially create a Python object and be able to update it for some time. Although, I must not be able to update it once I used the object as a dictionary key.
Of course there is the solution of marking the fields as private, which is mostly a warning for the programmer, and I will actually go for that solution.
But I stumbled on another solution and I want to know if this could be a good idea, or if it could simply go horribly wrong. Here it is:
class Foo():
def __init__(self, bar):
self._bar = bar
self._has_been_hashed = False
def __hash__(self):
self._has_been_hashed = True
return self._bar.__hash__()
def __eq__(self, other):
return self._bar == other._bar
def __copy__(self):
return Foo(self._bar)
def set_bar(self, bar):
if self.has_been_hashed:
raise FooIsNowImmutable
else:
self._bar = bar
Some testing proved it to work as desired, I can no longer use set_bar once I, say, used my object as a dictionary key.
What do you think? Is it a good idea? Will it turn against me? Is there an easier way? And is it somehow a bad practice?
Doing it that way is a bit fragile, since you never know when something might be used as a dictionary key, or when its hash might be called for some other reason. An object isn't supposed to "know" whether it's being used as a dictionary key. It will be confusing to have code that may raise an exception just because some other code somewhere else put the object in a dictionary.
Following the Python philosophy of "explicit is better than implicit", it would be safer to just give your object a method called .finalize() or .lock() or something, which would set a flag indicating the object is immutable. You could also reverse the exception-raising logic, so that __hash__ raises an exception if the object is not yet locked (rather than mutation raising an exception if the object has been hashed).
You would then call .lock() when you're ready to make the object immutable. It makes more sense to explicitly set it immutable when you're done with whatever mutating you need to do, rather than implicitly assuming that as soon as you use it in a dictionary, you're done mutating it.
You can do that, but I'm not sure I'd recommend it. Why do you need it in a dictionary?
It requires a lot more awareness of the state of the object... think a file object. Would you put one in a dictionary? It has to be opened for a lot of the functions to work, and once it's closed, you can't do them anymore. The user has to be aware in the surrounding code which state the object is in.
For files, that makes sense - after all, you don't normally hold files open across large parts of your program, or if you do, they have very defined init and close codes; something similar has to make sense for your object. Especially if you have some APIs that take the object, but expect an immutable version, and others that take the same object, but expect to change it...
I have used the lock method before, and it works well for complex, read-only objects that you want to initialize once and then make sure no one is messing with. E.G. you load a copy of a (say, English) dictionary from disk... it has to be mutable while you are populating it, but you don't want anyone to accidentally modify it, so locking it is a great idea. I would only use it if it was a one-time lock though - something you are locking and unlocking seems like a recipe for disaster.
There are two solutions IMHO if you just want to create a version you can use in hashable places. First is to explicitly create an immutable copy when you put it in a dictionary - tuple and frozenset are examples of this sort of behaviour... if you want to put a list in a dict, you can't, but you can create a tuple from it first, and that can be hashed. Create a frozen version of your object, then it's very clear by looking at the object type whether it's expected to be mutable or immutable, and so cases where it was used incorrectly are easily seen.
Second, if you really want it to be hashable, but need it to be mutable... that's actually legal, but implemented a little different. It goes back to the idea of hashing... hashing is used both for optimized lookups, and equality.
The first is to ensure you can get objects back... you put something in a dictionary, and it hashes to a value of 4 - goes in slot 4. Then you modify it. Then you go to look it up again, and now it hashes to 9 - there's nothing in slot 9, or worse, a different object, and you're broken.
Second is equality - for things like sets, I need to know if my object is already in there. I can hash, but if you know anything about hashing, you still need to check equality to check for hash collisions.
That doesn't preclude supporting __hash__ and being mutable, but it's unusual. You need to decide for your item what makes it the same, even though it's mutable. What you need to do then is give each object a unique id. Technically, you may be able to get away with id(self), but something like the uuid module is probably a better possibility. The UUID4 (or technically, the hash of the UUID4) is what determines both the hash and equality; two objects that contain the same UUID4 should be the exact same object; two objects that have the exact same data but a different UUID4 would be different object.

Get dictionary in python to store primitive type objects (memory location) instead of the value of those objects?

I want to store attributes of an object in a dictionary so I can change those attributes later. The problem is that these attributes are booleans, so when I put them in a dictionary it stores the boolean value, not the attribute variable.
I.e.
location_dict = {
'dorm_common':newnom.dorm_common,
'cafeteria':newnom.cafeteria,
'comp_room':newnom.comp_room}
I want to iterate through the dict and change newnom.comp_room = True. The problem is that the dictionary is storing the default value of newnom.comp_room, which is None, so I can't set newnom.comp_room.
I can think of plenty of ways to get around this such as making newnom.comp_room "newnom.comp_room" and using exec(v+'True'), or going into
(newnom.__dict__()[k] = True)
but I was wondering if there was another way to override this default behavior in a python dictionary.
This isn't something special about dictionaries. It's intrinsic to the Python programming language as a whole. You want either pass-by-reference or mutable booleans (and in any case you should review your understanding of Python's semantics and its implementation, you sound like you may be confusing some things), but Python doesn't support that. And you can't change that, and if you could, you'd face several can's worth of worms.
Change your design such that (in increasing order of obscurity)...
you only store the object, and access/change the various attributes through other means (e.g. with getattr/setattr, or a dedicated method with validation to limit dynamicness creep and accidents - might even make newnom ). I really recommend this. It's by far the simplest and least surprising approach.
you can go from the dict value to the object and change the attribute directly (e.g. store a (object, attribute name) tuple: (newnom, 'dorm_common')).
location_dict is not really a dict but an object supporting some dictionary operations and routed them to manipulation of newnom. (I would only recommend such hackery if you have a ton of legacy code that requires an external dictionary).
What you could do it try using a proxy object that contains the value that you want to modify:
class MutableProxy(object):
__slots__ = "value"
def __init__(self, value):
self.value = value
def get(self):
return self.value
def set(self, value):
self.value = value
You can then use the get() and set() functions to set the actual values, something like:
>>> var_1 = MutableProxy(100)
>>> var_1.get()
100
>>> d = { "var_1" : var_1 }
>>> d["var_1"].set(600)
>>> var_1.get()
600
The major disadvantage of this is that the proxy object will take noticeably more memory than the raw object, but that might be an appropriate trade-off.

The use of id() in Python

Of what use is id() in real-world programming? I have always thought this function is there just for academic purposes. Where would I actually use it in programming?
I have been programming applications in Python for some time now, but I have never encountered any "need" for using id(). Could someone throw some light on its real world usage?
It can be used for creating a dictionary of metadata about objects:
For example:
someobj = int(1)
somemetadata = "The type is an int"
data = {id(someobj):somemetadata}
Now if I occur this object somewhere else I can find if metadata about this object exists, in O(1) time (instead of looping with is).
I use id() frequently when writing temporary files to disk. It's a very lightweight way of getting a pseudo-random number.
Let's say that during data processing I come up with some intermediate results that I want to save off for later use. I simply create a file name using the pertinent object's id.
fileName = "temp_results_" + str(id(self)).
Although there are many other ways of creating unique file names, this is my favorite. In CPython, the id is the memory address of the object. Thus, if multiple objects are instantiated, I'm guaranteed to never have a naming collision. That's all for the cost of 1 address lookup. The other methods that I'm aware of for getting a unique string are much more intense.
A concrete example would be a word-processing application where each open document is an object. I could periodically save progress to disk with multiple files open using this naming convention.
Anywhere where one might conceivably need id() one can use either is or a weakref instead. So, no need for it in real-world code.
The only time I've found id() useful outside of debugging or answering questions on comp.lang.python is with a WeakValueDictionary, that is a dictionary which holds a weak reference to the values and drops any key when the last reference to that value disappears.
Sometimes you want to be able to access a group (or all) of the live instances of a class without extending the lifetime of those instances and in that case a weak mapping with id(instance) as key and instance as value can be useful.
However, I don't think I've had to do this very often, and if I had to do it again today then I'd probably just use a WeakSet (but I'm pretty sure that didn't exist last time I wanted this).
in one program i used it to compute the intersection of lists of non-hashables, like:
def intersection(*lists):
id_row_row = {} # id(row):row
key_id_row = {} # key:set(id(row))
for key, rows in enumerate(lists):
key_id_row[key] = set()
for row in rows:
id_row_row[id(row)] = row
key_id_row[key].add(id(row))
from operator import and_
def intersect(sets):
if len(sets) > 0:
return reduce(and_, sets)
else:
return set()
seq = [ id_row_row[id_row] for id_row in intersect( key_id_row.values() ) ]
return seq

Can't get Beaker cache working

I'm trying to use Beaker's caching library but I can't get it working.
Here's my test code.
class IndexHandler():
#cache.cache('search_func', expire=300)
def get_results(self, query):
results = get_results(query)
return results
def get(self, query):
results = self.get_results(query)
return render_index(results=results)
I've tried the examples in Beaker's documentation but all I see is
<type 'exceptions.TypeError'> at /
can't pickle generator objects
Clearly I'm missing something but I couldn't find a solution.
By the way this problem occurs if the cache type is set to "file".
If you configure beaker to save to the filesystem, you can easily see that each argument is being pickled as well. Example:
tp3
sS'tags <myapp.controllers.tags.TagsController object at 0x103363c10> <MySQLdb.cursors.Cursor object at 0x103363dd0> apple'
p4
Notice that the cache "key" contains more than just my keyword, "apple," but instance-specific information. This is pretty bad, because especially the 'self' won't be the same across invocations. The cache will result in a miss every single time (and will get filled up with useless keys.)
The method with the cache annotation should only have the arguments to correspond to whatever "key" you have in mind. To paraphrase this, let's say that you want to store the fact that "John" corresponds to value 555-1212 and you want to cache this. Your function should not take anything except a string as an argument. Any arguments you pass in should stay constant from invocation to invocation, so something like "self" would be bad.
One easy way to make this work is to inline the function so that you don't need to pass anything else beyond the key. For example:
def index(self):
# some code here
# suppose 'place' is a string that you're using as a key. maybe
# you're caching a description for cities and 'place' would be "New York"
# in one instance
#cache_region('long_term', 'place_desc')
def getDescriptionForPlace(place):
# perform expensive operation here
description = ...
return description
# this will either fetch the data or just load it from the cache
description = getDescriptionForPlace(place)
Your cache file should resemble the following. Notice that only 'place_desc' and 'John' were saved as a key.
tp3
sS'place_desc John'
p4
I see that the beaker docs do not mention this explicitly, but, clearly, the decorate function must pickle the arguments it's called with (to use as part of the key into the cache, to check if the entry is present and to add it later otherwise) -- and, generator objects are not pickleable, as the error message is telling you. This implies that query is a generator object, of course.
What you should be doing in order to use beaker or any other kind of cache is to pass around, instead of a query generator object, the (pickleable) parameters from which that query can be built -- strings, numbers, dicts, lists, tuples, etc, etc, composed in any way that is easy to for you to arrange and easy to build the query from "just in time" only within the function body of get_results. This way, the arguments will be pickleable and caching will work.
If convenient, you could build a simple pickleable class whose instances "stand for" queries, emulating whatever initialization and parameter-setting you require, and performing the just-in-time instantiation only when some method requiring an actual query object is called. But that's just a "convenience" idea, and does not alter the underlying concept as explained in the previous paragraph.
Try return list(results) instead of return results and see if it helps.
The beaker file cache needs to be able to pickle both cache keys and values; most iterators and generators are unpickleable.

Best way to store and use a large text-file in python

I'm creating a networked server for a boggle-clone I wrote in python, which accepts users, solves the boards, and scores the player input. The dictionary file I'm using is 1.8MB (the ENABLE2K dictionary), and I need it to be available to several game solver classes. Right now, I have it so that each class iterates through the file line-by-line and generates a hash table(associative array), but the more solver classes I instantiate, the more memory it takes up.
What I would like to do is import the dictionary file once and pass it to each solver instance as they need it. But what is the best way to do this? Should I import the dictionary in the global space, then access it in the solver class as globals()['dictionary']? Or should I import the dictionary then pass it as an argument to the class constructor? Is one of these better than the other? Is there a third option?
If you create a dictionary.py module, containing code which reads the file and builds a dictionary, this code will only be executed the first time it is imported. Further imports will return a reference to the existing module instance. As such, your classes can:
import dictionary
dictionary.words[whatever]
where dictionary.py has:
words = {}
# read file and add to 'words'
Even though it is essentially a singleton at this point, the usual arguments against globals apply. For a pythonic singleton-substitute, look up the "borg" object.
That's really the only difference. Once the dictionary object is created, you are only binding new references as you pass it along unless if you explicitly perform a deep copy. It makes sense that it is centrally constructed once and only once so long as each solver instance does not require a private copy for modification.
Adam, remember that in Python when you say:
a = read_dict_from_file()
b = a
... you are not actually copying a, and thus using more memory, you are merely making b another reference to the same object.
So basically any of the solutions you propose will be far better in terms of memory usage. Basically, read in the dictionary once and then hang on to a reference to that. Whether you do it with a global variable, or pass it to each instance, or something else, you'll be referencing the same object and not duplicating it.
Which one is most Pythonic? That's a whole 'nother can of worms, but here's what I would do personally:
def main(args):
run_initialization_stuff()
dictionary = read_dictionary_from_file()
solvers = [ Solver(class=x, dictionary=dictionary) for x in len(number_of_solvers) ]
HTH.
Depending on what your dict contains, you may be interested in the 'shelve' or 'anydbm' modules. They give you dict-like interfaces (just strings as keys and items for 'anydbm', and strings as keys and any python object as item for 'shelve') but the data is actually in a DBM file (gdbm, ndbm, dbhash, bsddb, depending on what's available on the platform.) You probably still want to share the actual database between classes as you are asking for, but it would avoid the parsing-the-textfile step as well as the keeping-it-all-in-memory bit.

Categories