The use of id() in Python

The use of id() in Python - python

Of what use is id() in real-world programming? I have always thought this function is there just for academic purposes. Where would I actually use it in programming?
I have been programming applications in Python for some time now, but I have never encountered any "need" for using id(). Could someone throw some light on its real world usage?

It can be used for creating a dictionary of metadata about objects:
For example:
someobj = int(1)
somemetadata = "The type is an int"
data = {id(someobj):somemetadata}
Now if I occur this object somewhere else I can find if metadata about this object exists, in O(1) time (instead of looping with is).

I use id() frequently when writing temporary files to disk. It's a very lightweight way of getting a pseudo-random number.
Let's say that during data processing I come up with some intermediate results that I want to save off for later use. I simply create a file name using the pertinent object's id.
fileName = "temp_results_" + str(id(self)).
Although there are many other ways of creating unique file names, this is my favorite. In CPython, the id is the memory address of the object. Thus, if multiple objects are instantiated, I'm guaranteed to never have a naming collision. That's all for the cost of 1 address lookup. The other methods that I'm aware of for getting a unique string are much more intense.
A concrete example would be a word-processing application where each open document is an object. I could periodically save progress to disk with multiple files open using this naming convention.

Anywhere where one might conceivably need id() one can use either is or a weakref instead. So, no need for it in real-world code.

The only time I've found id() useful outside of debugging or answering questions on comp.lang.python is with a WeakValueDictionary, that is a dictionary which holds a weak reference to the values and drops any key when the last reference to that value disappears.
Sometimes you want to be able to access a group (or all) of the live instances of a class without extending the lifetime of those instances and in that case a weak mapping with id(instance) as key and instance as value can be useful.
However, I don't think I've had to do this very often, and if I had to do it again today then I'd probably just use a WeakSet (but I'm pretty sure that didn't exist last time I wanted this).

in one program i used it to compute the intersection of lists of non-hashables, like:
def intersection(*lists):
id_row_row = {} # id(row):row
key_id_row = {} # key:set(id(row))
for key, rows in enumerate(lists):
key_id_row[key] = set()
for row in rows:
id_row_row[id(row)] = row
key_id_row[key].add(id(row))
from operator import and_
def intersect(sets):
if len(sets) > 0:
return reduce(and_, sets)
else:
return set()
seq = [ id_row_row[id_row] for id_row in intersect( key_id_row.values() ) ]
return seq

Related

list() constructor with list?

I'm new in Python and I don't understand the purpose of list() function in this piece of code:
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
The method words() is already returning a list of tokenized words from a string, and I don't see any difference between that and
documents = [(movie_reviews.words(fileid), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]

There are three possibilities:
It is a mistake, no call to list() is required.
The interface only guarantees that the method returns an Iterable type, which may be any one of: list, set, iterator, generator, etc. The specific movie_reviews.words() may return a list today, but either that may change in future versions, or in other classes with similar interfaces (child/parent/or simply similar interface).
Whether this is the case, should be stated explicitly in the documentation, or could be gleamed out of the inheritance hierarchy.
The method performs some sort of memoization, while keeping a copy of the returned list. A good practice would be to copy the cached-list inside the method, but maybe they returned a shared list-object.
If the method returns a reference to a shared list object, then it is a good idea to call list(), in order to create a new list object. Without the copy operation, any change to the list by one side (inside the method vs. through documents variable) will confuse the other side. If you change the list through documents varaible, then calling movie_reviews.words(fileid) with the same fileid may return the wrong value.
In general, although this is bad design, this happens in real code. I once had to debug such an issue in live code. Usually, in case of memoization, it is better to return an immutable type such as a tuple, instead of a list, which will guarantee both speed and safety.

Getting a hash of a function that is stable across runs

IIUC python hash of functions (e.g. for use as keys in dict) is not stable across runs.
Can something like dill or other libraries be used to get a hash of a function which is stable across runs and different computers? (id is of course not stable).

I'm the dill author. I've written a package called klepto which is a hierarchical caching/database abstraction useful for local memory hashing and object sharing across parallel/distributed resources. It includes several options for building ids of functions.
See klepto.keymaps and klepto.crypto for hashing choices -- some work across parallel/distributed resources, some don't. One of the choices is serialization with dill or otherwise.
klepto is similar to joblib, but designed specifically to have object permanence and sharing beyond a single python session. There may be something similar to klepto in dask.

As you mentioned, id will almost never be the same across different processes and though surely across different machines. As per docs:
id(object):
Return the “identity” of an object. This is an integer which is
guaranteed to be unique and constant for this object during its
lifetime. Two objects with non-overlapping lifetimes may have the same
id() value.
This means that id should be different because the objects created by every instance of your script reside in different places in the memory and are not the same object. id defines the identity, it's not a checksum of a block of code.
The only thing that will be consistent over different instances of your script being executed is the name of the function.
One other approach that you could use to have a deterministic way to identify a block of code inside your script would be to calculate a checksum of the actual text. But controlling the contents of your methods should rather be handled by a versioning system like git. It is likely that if you need to calculate a hash sum of your code or a piece of it, you are doing something suboptimally.

I stubled about "hash() is not stable across runs" today. I am now using
def stable_hash(a_string):
sha256 = hashlib.sha256()
sha256.update(bytes(a_string, "UTF-8"))
digest = sha256.digest()
h = 0
#
for index in range(0, len(digest) >> 3):
index8 = index << 3
bytes8 = digest[index8 : index8 + 8]
i = unpack('q', bytes8)[0]
h = xor(h, i)
#
return h
It's for string arguments. To use it e.g. for a dict you would pass str(tuple(sorted(a_dict.items()))) or something like that as argument. The "sorted" is important in this case to get a "canonical" representation.

Python pseudo-immutable object field

I currently need to partially create a Python object and be able to update it for some time. Although, I must not be able to update it once I used the object as a dictionary key.
Of course there is the solution of marking the fields as private, which is mostly a warning for the programmer, and I will actually go for that solution.
But I stumbled on another solution and I want to know if this could be a good idea, or if it could simply go horribly wrong. Here it is:
class Foo():
def __init__(self, bar):
self._bar = bar
self._has_been_hashed = False
def __hash__(self):
self._has_been_hashed = True
return self._bar.__hash__()
def __eq__(self, other):
return self._bar == other._bar
def __copy__(self):
return Foo(self._bar)
def set_bar(self, bar):
if self.has_been_hashed:
raise FooIsNowImmutable
else:
self._bar = bar
Some testing proved it to work as desired, I can no longer use set_bar once I, say, used my object as a dictionary key.
What do you think? Is it a good idea? Will it turn against me? Is there an easier way? And is it somehow a bad practice?

Doing it that way is a bit fragile, since you never know when something might be used as a dictionary key, or when its hash might be called for some other reason. An object isn't supposed to "know" whether it's being used as a dictionary key. It will be confusing to have code that may raise an exception just because some other code somewhere else put the object in a dictionary.
Following the Python philosophy of "explicit is better than implicit", it would be safer to just give your object a method called .finalize() or .lock() or something, which would set a flag indicating the object is immutable. You could also reverse the exception-raising logic, so that __hash__ raises an exception if the object is not yet locked (rather than mutation raising an exception if the object has been hashed).
You would then call .lock() when you're ready to make the object immutable. It makes more sense to explicitly set it immutable when you're done with whatever mutating you need to do, rather than implicitly assuming that as soon as you use it in a dictionary, you're done mutating it.

You can do that, but I'm not sure I'd recommend it. Why do you need it in a dictionary?
It requires a lot more awareness of the state of the object... think a file object. Would you put one in a dictionary? It has to be opened for a lot of the functions to work, and once it's closed, you can't do them anymore. The user has to be aware in the surrounding code which state the object is in.
For files, that makes sense - after all, you don't normally hold files open across large parts of your program, or if you do, they have very defined init and close codes; something similar has to make sense for your object. Especially if you have some APIs that take the object, but expect an immutable version, and others that take the same object, but expect to change it...
I have used the lock method before, and it works well for complex, read-only objects that you want to initialize once and then make sure no one is messing with. E.G. you load a copy of a (say, English) dictionary from disk... it has to be mutable while you are populating it, but you don't want anyone to accidentally modify it, so locking it is a great idea. I would only use it if it was a one-time lock though - something you are locking and unlocking seems like a recipe for disaster.
There are two solutions IMHO if you just want to create a version you can use in hashable places. First is to explicitly create an immutable copy when you put it in a dictionary - tuple and frozenset are examples of this sort of behaviour... if you want to put a list in a dict, you can't, but you can create a tuple from it first, and that can be hashed. Create a frozen version of your object, then it's very clear by looking at the object type whether it's expected to be mutable or immutable, and so cases where it was used incorrectly are easily seen.
Second, if you really want it to be hashable, but need it to be mutable... that's actually legal, but implemented a little different. It goes back to the idea of hashing... hashing is used both for optimized lookups, and equality.
The first is to ensure you can get objects back... you put something in a dictionary, and it hashes to a value of 4 - goes in slot 4. Then you modify it. Then you go to look it up again, and now it hashes to 9 - there's nothing in slot 9, or worse, a different object, and you're broken.
Second is equality - for things like sets, I need to know if my object is already in there. I can hash, but if you know anything about hashing, you still need to check equality to check for hash collisions.
That doesn't preclude supporting __hash__ and being mutable, but it's unusual. You need to decide for your item what makes it the same, even though it's mutable. What you need to do then is give each object a unique id. Technically, you may be able to get away with id(self), but something like the uuid module is probably a better possibility. The UUID4 (or technically, the hash of the UUID4) is what determines both the hash and equality; two objects that contain the same UUID4 should be the exact same object; two objects that have the exact same data but a different UUID4 would be different object.

Use python dict to lookup mutable objects

I have a bunch of File objects, and a bunch of Folder objects. Each folder has a list of files. Now, sometimes I'd like to lookup which folder a certain file is in. I don't want to traverse over all folders and files, so I create a lookup dict file -> folder.
folder = Folder()
myfile = File()
folder_lookup = {}
# This is pseudocode, I don't actually reach into the Folder
# object, but have an appropriate method
folder.files.append(myfile)
folder_lookup[myfile] = folder
Now, the problem is, the files are mutable objects. My application is built around the fact. I change properites on them, and the GUI is notified and updated accordingly. Of course you can't put mutable objects in dicts. So what I tried first is to generate a hash based on the current content, basically:
def __hash__(self):
return hash((self.title, ...))
This didn't work of course, because when the object's contents changed its hash (and thus its identity) changed, and everything got messed up. What I need is an object that keeps its identity, although its contents change. I tried various things, like making __hash__ return id(self), overriding __eq__, and so on, but never found a satisfying solution. One complication is that the whole construction should be pickelable, so that means I'd have to store id on creation, since it could change when pickling, I guess.
So I basically want to use the identity of an object (not its state) to quickly look up data related to the object. I've actually found a really nice pythonic workaround for my problem, which I might post shortly, but I'd like to see if someone else comes up with a solution.

I felt dirty writing this. Just put folder as an attribute on the file.
class dodgy(list):
def __init__(self, title):
self.title = title
super(list, self).__init__()
self.store = type("store", (object,), {"blanket" : self})
def __hash__(self):
return hash(self.store)
innocent_d = {}
dodge_1 = dodgy("dodge_1")
dodge_2 = dodgy("dodge_2")
innocent_d[dodge_1] = dodge_1.title
innocent_d[dodge_2] = dodge_2.title
print innocent_d[dodge_1]
dodge_1.extend(range(5))
dodge_1.title = "oh no"
print innocent_d[dodge_1]

OK, everybody noticed the extremely obvious workaround (that took my some days to come up with), just put an attribute on File that tells you which folder it is in. (Don't worry, that is also what I did.)
But, it turns out that I was working under wrong assumptions. You are not supposed to use mutable objects as keys, but that doesn't mean you can't (diabolic laughter)! The default implementation of __hash__ returns a unique value, probably derived from the object's address, that remains constant in time. And the default __eq__ follows the same notion of object identity.
So you can put mutable objects in a dict, and they work as expected (if you expect equality based on instance, not on value).
See also: I'm able to use a mutable object as a dictionary key in python. Is this not disallowed?
I was having problems because I was pickling/unpickling the objects, which of course changed the hashes. One could generate a unique ID in the constructor, and use that for equality and deriving a hash to overcome this.
(For the curious, as to why such a "lookup based on instance identity" dict might be neccessary: I've been experimenting with a kind of "object database". You have pure python objects, put them in lists/containers, and can define indexes on attributes for faster lookup, complex queries and so on. For foreign keys (1:n relationships) I can just use containers, but for the backlink I have to come up with something clever if I don't want to modify the objects on the n side.)

Best way to store and use a large text-file in python

I'm creating a networked server for a boggle-clone I wrote in python, which accepts users, solves the boards, and scores the player input. The dictionary file I'm using is 1.8MB (the ENABLE2K dictionary), and I need it to be available to several game solver classes. Right now, I have it so that each class iterates through the file line-by-line and generates a hash table(associative array), but the more solver classes I instantiate, the more memory it takes up.
What I would like to do is import the dictionary file once and pass it to each solver instance as they need it. But what is the best way to do this? Should I import the dictionary in the global space, then access it in the solver class as globals()['dictionary']? Or should I import the dictionary then pass it as an argument to the class constructor? Is one of these better than the other? Is there a third option?

If you create a dictionary.py module, containing code which reads the file and builds a dictionary, this code will only be executed the first time it is imported. Further imports will return a reference to the existing module instance. As such, your classes can:
import dictionary
dictionary.words[whatever]
where dictionary.py has:
words = {}
# read file and add to 'words'

Even though it is essentially a singleton at this point, the usual arguments against globals apply. For a pythonic singleton-substitute, look up the "borg" object.
That's really the only difference. Once the dictionary object is created, you are only binding new references as you pass it along unless if you explicitly perform a deep copy. It makes sense that it is centrally constructed once and only once so long as each solver instance does not require a private copy for modification.

Adam, remember that in Python when you say:
a = read_dict_from_file()
b = a
... you are not actually copying a, and thus using more memory, you are merely making b another reference to the same object.
So basically any of the solutions you propose will be far better in terms of memory usage. Basically, read in the dictionary once and then hang on to a reference to that. Whether you do it with a global variable, or pass it to each instance, or something else, you'll be referencing the same object and not duplicating it.
Which one is most Pythonic? That's a whole 'nother can of worms, but here's what I would do personally:
def main(args):
run_initialization_stuff()
dictionary = read_dictionary_from_file()
solvers = [ Solver(class=x, dictionary=dictionary) for x in len(number_of_solvers) ]
HTH.

Depending on what your dict contains, you may be interested in the 'shelve' or 'anydbm' modules. They give you dict-like interfaces (just strings as keys and items for 'anydbm', and strings as keys and any python object as item for 'shelve') but the data is actually in a DBM file (gdbm, ndbm, dbhash, bsddb, depending on what's available on the platform.) You probably still want to share the actual database between classes as you are asking for, but it would avoid the parsing-the-textfile step as well as the keeping-it-all-in-memory bit.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.