Getting a hash of a function that is stable across runs - python

IIUC python hash of functions (e.g. for use as keys in dict) is not stable across runs.
Can something like dill or other libraries be used to get a hash of a function which is stable across runs and different computers? (id is of course not stable).

I'm the dill author. I've written a package called klepto which is a hierarchical caching/database abstraction useful for local memory hashing and object sharing across parallel/distributed resources. It includes several options for building ids of functions.
See klepto.keymaps and klepto.crypto for hashing choices -- some work across parallel/distributed resources, some don't. One of the choices is serialization with dill or otherwise.
klepto is similar to joblib, but designed specifically to have object permanence and sharing beyond a single python session. There may be something similar to klepto in dask.

As you mentioned, id will almost never be the same across different processes and though surely across different machines. As per docs:
id(object):
Return the “identity” of an object. This is an integer which is
guaranteed to be unique and constant for this object during its
lifetime. Two objects with non-overlapping lifetimes may have the same
id() value.
This means that id should be different because the objects created by every instance of your script reside in different places in the memory and are not the same object. id defines the identity, it's not a checksum of a block of code.
The only thing that will be consistent over different instances of your script being executed is the name of the function.
One other approach that you could use to have a deterministic way to identify a block of code inside your script would be to calculate a checksum of the actual text. But controlling the contents of your methods should rather be handled by a versioning system like git. It is likely that if you need to calculate a hash sum of your code or a piece of it, you are doing something suboptimally.

I stubled about "hash() is not stable across runs" today. I am now using
def stable_hash(a_string):
sha256 = hashlib.sha256()
sha256.update(bytes(a_string, "UTF-8"))
digest = sha256.digest()
h = 0
#
for index in range(0, len(digest) >> 3):
index8 = index << 3
bytes8 = digest[index8 : index8 + 8]
i = unpack('q', bytes8)[0]
h = xor(h, i)
#
return h
It's for string arguments. To use it e.g. for a dict you would pass str(tuple(sorted(a_dict.items()))) or something like that as argument. The "sorted" is important in this case to get a "canonical" representation.

Related

How to hash a class or function definition?

Background
When experimenting with machine learning, I often reuse models trained previously, by means of pickling/unpickling.
However, when working on the feature-extraction part, it's a challenge not to confuse different models.
Therefore, I want to add a check that ensures that the model was trained using exactly the same feature-extraction procedure as the test data.
Problem
My idea was the following:
Along with the model, I'd include in the pickle dump a hash value which fingerprints the feature-extraction procedure.
When training a model or using it for prediction/testing, the model wrapper is given a feature-extraction class that conforms to certain protocol.
Using hash() on that class won't work, of course, as it isn't persistent across calls.
So I thought I could maybe find the source file where the class is defined, and get a hash value from that file.
However, there might be a way to get a stable hash value from the class’s in-memory contents directly.
This would have two advantages:
It would also work if no source file can be found.
And it would probably ignore irrelevant changes to the source file (eg. fixing a typo in the module docstring).
Do classes have a code object that could be used here?
All you’re looking for is a hash procedure that includes all the salient details of the class’s definition. (Base classes can be included by including their definitions recursively.) To minimize false matches, the basic idea is to apply a wide (cryptographic) hash to a serialization of your class. So start with pickle: it supports more types than hash and, when it uses identity, it uses a reproducible identity based on name. This makes it a good candidate for the base case of a recursive strategy: deal with the functions and classes whose contents are important and let it handle any ancillary objects referenced.
So define a serialization by cases. Call an object special if it falls under any case below but the last.
For a tuple deemed to contain special objects:
The character t
The serialization of its len
The serialization of each element, in order
For a dict deemed to contain special objects:
The character d
The serialization of its len
The serialization of each name and value, in sorted order
For a class whose definition is salient:
The character C
The serialization of its __bases__
The serialization of its vars
For a function whose definition is salient:
The character f
The serialization of its __defaults__
The serialization of its __kwdefaults__ (in Python 3)
The serialization of its __closure__ (but with cell values instead of the cells themselves)
The serialization of its vars
The serialization of its __code__
For a code object (since pickle doesn’t support them at all):
The character c
The serializations of its co_argcount, co_nlocals, co_flags, co_code, co_consts, co_names, co_freevars, and co_cellvars, in that order; none of these are ever special
For a static or class method object:
The character s or m
The serialization of its __func__
For a property:
The character p
The serializations of its fget, fset, and fdel, in that order
For any other object: pickle.dumps(x,-1)
(You never actually store all this: just create a hashlib object of your choice in the top-level function, and in the recursive part update it with each piece of the serialization in turn.)
The type tags are to avoid collisions and in particular to be prefix-free. Binary pickles are already prefix-free. You can base the decision about a container on a deterministic analysis of its contents (even if heuristic) or on context, so long as you’re consistent.
As always, there is something of an art to balancing false positives against false negatives: for a function, you could include __globals__ (with pruning of objects already serialized to avoid large if not infinite serializations) or just any __name__ found therein. Omitting co_varnames ignores renaming local variables, which is good unless introspection is important; similarly for co_filename and co_name.
You may need to support more types: look for static attributes and default arguments that don’t pickle correctly (because they contain references to special types) or at all. Note of course that some types (like file objects) are unpicklable because it’s difficult or impossible to serialize them (although unlike pickle you can handle lambdas just like any other function once you’ve done code objects). At some risk of false matches, you can choose to serialize just the type of such objects (as always, prefixed with a character ? to distinguish from actually having the type in that position).

List comprehension is sorting autmatically [duplicate]

The question arose when answering to another SO question (there).
When I iterate several times over a python set (without changing it between calls), can I assume it will always return elements in the same order? And if not, what is the rationale of changing the order ? Is it deterministic, or random? Or implementation defined?
And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?
The underlying question is if python set iteration order only depends on the algorithm used to implement sets, or also on the execution context?
There's no formal guarantee about the stability of sets. However, in the CPython implementation, as long as nothing changes the set, the items will be produced in the same order. Sets are implemented as open-addressing hashtables (with a prime probe), so inserting or removing items can completely change the order (in particular, when that triggers a resize, which reorganizes how the items are laid out in memory.) You can also have two identical sets that nonetheless produce the items in different order, for example:
>>> s1 = {-1, -2}
>>> s2 = {-2, -1}
>>> s1 == s2
True
>>> list(s1), list(s2)
([-1, -2], [-2, -1])
Unless you're very certain you have the same set and nothing touched it inbetween the two iterations, it's best not to rely on it staying the same. Making seemingly irrelevant changes to, say, functions you call inbetween could produce very hard to find bugs.
A set or frozenset is inherently an unordered collection. Internally, sets are based on a hash table, and the order of keys depends both on the insertion order and on the hash algorithm. In CPython (aka standard Python) integers less than the machine word size (32 bit or 64 bit) hash to themself, but text strings, bytes strings, and datetime objects hash to integers that vary randomly; you can control that by setting the PYTHONHASHSEED environment variable.
From the __hash__ docs:
Note
By default, the __hash__() values of str, bytes and datetime
objects are “salted” with an unpredictable random value. Although they
remain constant within an individual Python process, they are not
predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service
caused by carefully-chosen inputs that exploit the worst case
performance of a dict insertion, O(n^2) complexity. See
http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of dicts, sets and
other mappings. Python has never made guarantees about this ordering
(and it typically varies between 32-bit and 64-bit builds).
See also PYTHONHASHSEED.
The results of hashing objects of other classes depend on the details of the class's __hash__ method.
The upshot of all this is that you can have two sets containing identical strings but when you convert them to lists they can compare unequal. Or they may not. ;) Here's some code that demonstrates this. On some runs, it will just loop, not printing anything, but on other runs it will quickly find a set that uses a different order to the original.
from random import seed, shuffle
seed(42)
data = list('abcdefgh')
a = frozenset(data)
la = list(a)
print(''.join(la), a)
while True:
shuffle(data)
lb = list(frozenset(data))
if lb != la:
print(''.join(data), ''.join(lb))
break
typical output
dachbgef frozenset({'d', 'a', 'c', 'h', 'b', 'g', 'e', 'f'})
deghcfab dahcbgef
And when I call the same python
program repeatedly (not random, not
input dependent), will I get the same
ordering for sets?
I can answer this part of the question now after a quick experiment. Using the following code:
class Foo(object) :
def __init__(self,val) :
self.val = val
def __repr__(self) :
return str(self.val)
x = set()
for y in range(500) :
x.add(Foo(y))
print list(x)[-10:]
I can trigger the behaviour that I was asking about in the other question. If I run this repeatedly then the output changes, but not on every run. It seems to be "weakly random" in that it changes slowly. This is certainly implementation dependent so I should say that I'm running the macports Python2.6 on snow-leopard. While the program will output the same answer for long runs of time, doing something that affects the system entropy pool (writing to the disk mostly works) will somethimes kick it into a different output.
The class Foo is just a simple int wrapper as experiments show that this doesn't happen with sets of ints. I think that the problem is caused by the lack of __eq__ and __hash__ members for the object, although I would dearly love to know the underlying explanation / ways to avoid it. Also useful would be some way to reproduce / repeat a "bad" run. Does anyone know what seed it uses, or how I could set that seed?
It’s definitely implementation defined. The specification of a set says only that
Being an unordered collection, sets do not record element position or order of insertion.
Why not use OrderedDict to create your own OrderedSet class?
The answer is simply a NO.
Python set operation is NOT stable.
I did a simple experiment to show this.
The code:
import random
random.seed(1)
x=[]
class aaa(object):
def __init__(self,a,b):
self.a=a
self.b=b
for i in range(5):
x.append(aaa(random.choice('asf'),random.randint(1,4000)))
for j in x:
print(j.a,j.b)
print('====')
for j in set(x):
print(j.a,j.b)
Run this for twice, you will get this:
First time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
a 2030
a 2332
f 1555
a 1045
s 1935
Process finished with exit code 0
Second time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
s 1935
a 2332
a 1045
f 1555
a 2030
Process finished with exit code 0
The reason is explained in comments in this answer.
However, there are some ways to make it stable:
set PYTHONHASHSEED to 0, see details here, here and here.
Use OrderedDict instead.
As pointed out, this is strictly an implementation detail.
But as long as you don’t change the structure between calls, there should be no reason for a read-only operation (= iteration) to change with time: no sane implementation does that. Even randomized (= non-deterministic) data structures that can be used to implement sets (e.g. skip lists) don’t change the reading order when no changes occur.
So, being rational, you can safely rely on this behaviour.
(I’m aware that certain GCs may reorder memory in a background thread but even this reordering will not be noticeable on the level of data structures, unless a bug occurs.)
The definition of a set is unordered, unique elements ("Unordered collections of unique elements"). You should care only about the interface, not the implementation. If you want an ordered enumeration, you should probably put it into a list and sort it.
There are many different implementations of Python. Don't rely on undocumented behaviour, as your code could break on different Python implementations.

Embarassingly parallel tasks with IPython Parallel (or other package) depending on unpickable objects

I often hit problems where I wanna do a simple stuff over a set of many, many objects quickly. My natural choice is to use IPython Parallel for its simplicity, but often I have to deal with unpickable objects. After trying for a few hours I usually resign myself to running my taks overnight on a single computer, or do a stupid thing like dividing things semi-manually in to run in multiple python scripts.
To give a concrete example, suppose I want to delete all keys in a give S3 bucket.
What I'd normally do without thinking is:
import boto
from IPython.parallel import Client
connection = boto.connect_s3(awskey, awssec)
bucket = connection.get_bucket('mybucket')
client = Client()
loadbalancer = c.load_balanced_view()
keyList = list(bucket.list())
loadbalancer.map(lambda key: key.delete(), keyList)
The problem is that the Key object in boto is unpickable (*). This occurs very often in different contexts for me. It's a problem also with multiprocessing, execnet, and all other frameworks and libs I tried (for obvious reasons: they all use the same pickler to serialize the objects).
Do you guys also have those problems? Is there a way I can serialize these more complex objects? Do I have to write my own pickler for this particular objects? If I do, how do I tell IPython Parallel to use it? How do I write a pickler?
Thanks!
(*) I'm aware that I can simply make a list of the keys names and do something like this:
loadbalancer.map(lambda keyname: getKey(keyname).delete())
and define the getKey function in each engine of the IPython cluster. This is just a particular instance of a more general problem that I find often. Maybe it's a bad example, since it can be easily solved in another way.
IPython has a use_dill option, where if you have the dill serializer installed, you can serialize most "unpicklable" objects.
How can I use dill instead of pickle with load_balanced_view
That IPython sure brings people together ;). So from what I've been able to gather, the problem with pickling objects are their methods. So maybe instead of using the method of key to delete it you could write a function that takes it and deletes it. Maybe first get a list of dict's with the relevant information on each key and then afterwards call a function delete_key( dict ) which I leave up to you to write because I've no idea how to handle s3 keys.
Would that work?
Alternatively, it could be that this works: simply instead of calling the method of the instance, call the method of the class with the instance as an argument. So instead of lambda key : key.delete() you would do lambda key : Key.delete(key). Of course you have to push the class to the nodes then, but that shouldn't be a problem. A minimal example:
class stuff(object):
def __init__(self,a=1):
self.list = []
def append(self, a):
self.list.append(a)
import IPython.parallel as p
c = p.Client()
dview = c[:]
li = map( stuff, [[]]*10 ) # creates 10 stuff instances
dview.map( lambda x : x.append(1), li ) # should append 1 to all lists, but fails
dview.push({'stuff':stuff}) # push the class to the engines
dview.map( lambda x : stuff.append(x,1), li ) # this works.

Strings from `raw_input()` in memory

I've known for a while that Python likes to reuse strings in memory instead of having duplicates:
>>> a = "test"
>>> id(a)
36910184L
>>> b = "test"
>>> id(b)
36910184L
However, I recently discovered that the string returned from raw_input() does not follow that typical optimization pattern:
>>> a = "test"
>>> id(a)
36910184L
>>> c = raw_input()
test
>>> id(c)
45582816L
I curious why this is the case? Is there a technical reason?
To me it appears that python interns string literals, but strings which are created via some other process don't get interned:
>>> s = 'ab'
>>> id(s)
952080
>>> g = 'a' if True else 'c'
>>> g += 'b'
>>> g
'ab'
>>> id(g)
951336
Of course, raw_input is creating new strings without using string literals, so it's quite feasible to assume that it won't have the same id. There are (at least) two reasons why C-python interns strings -- memory (you can save a bunch if you don't store a whole bunch of copies of the same thing) and resolution for hash collisions. If 2 strings hash to the same value (in a dictionary lookup for instance), then python needs to check to make sure that both strings are equivalent. It can do a string compare if they're not interned, but if they are interned, it only needs to do a pointer compare which is a bit more efficient.
The compiler can't intern strings except where they're present in actual source code (ie, string literals). In addition to that, raw_input also strips off new lines.
[update] in order to answer the question, it is necessary to know why, how and when Python reuse strings.
Lets start with the how: Python uses "interned" strings - from wikipedia:
In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.
Why? Seems like saving memory is not the main goal here, only a nice side-effect.
String interning speeds up string comparisons, which are sometimes a performance bottleneck in applications (such as compilers and dynamic programming language runtimes) that rely heavily on hash tables with string keys. Without interning, checking that two different strings are equal involves examining every character of both strings. This is slow for several reasons: it is inherently O(n) in the length of the strings; it typically requires reads from several regions of memory, which take time; and the reads fills up the processor cache, meaning there is less cache available for other needs. With interned strings, a simple object identity test suffices after the original intern operation; this is typically implemented as a pointer equality test, normally just a single machine instruction with no memory reference at all.
String interning also reduces memory usage if there are many instances of the same string value; for instance, it is read from a network or from storage. Such strings may include magic numbers or network protocol information. For example, XML parsers may intern names of tags and attributes to save memory.
Now the "when": cpython "interns" a string in the following situations:
when you use the intern() non-essential built-in function (which moved to sys.intern in Python 3).
small strings (0 or 1 byte) - this very informative article by Laurent Luce explains the implementation
the names used in Python programs are automatically interned
the dictionaries used to hold module, class or instance attributes have interned keys
For other situations, every implementation seems to have great variation on when strings are automatically interned.
I could not possibly put it better than Alex Martinelli did in this answer (no wonder the guy has a 245k reputation):
Each implementation of the Python language is free to make its own tradeoffs in allocating immutable objects (such as strings) -- either making a new one, or finding an existing equal one and using one more reference to it, are just fine from the language's point of view. In practice, of course, real-world implementation strike reasonable compromise: one more reference to a suitable existing object when locating such an object is cheap and easy, just make a new object if the task of locating a suitable existing one (which may or may not exist) looks like it could potentially take a long time searching.
So, for example, multiple occurrences of the same string literal within a single function will (in all implementations I know of) use the "new reference to same object" strategy, because when building that function's constants-pool it's pretty fast and easy to avoid duplicates; but doing so across separate functions could potentially be a very time-consuming task, so real-world implementations either don't do it at all, or only do it in some heuristically identified subset of cases where one can hope for a reasonable tradeoff of compilation time (slowed down by searching for identical existing constants) vs memory consumption (increased if new copies of constants keep being made).
I don't know of any implementation of Python (or for that matter other languages with constant strings, such as Java) that takes the trouble of identifying possible duplicates (to reuse a single object via multiple references) when reading data from a file -- it just doesn't seem to be a promising tradeoff (and here you'd be paying runtime, not compile time, so the tradeoff is even less attractive). Of course, if you know (thanks to application level considerations) that such immutable objects are large and quite prone to many duplications, you can implement your own "constants-pool" strategy quite easily (intern can help you do it for strings, but it's not hard to roll your own for, e.g., tuples with immutable items, huge long integers, and so forth).
[initial answer]
This is more a comment than an answer, but the comment system is not well suited for posting code:
def main():
while True:
s = raw_input('feed me:')
print '"{}".id = {}'.format(s, id(s))
if __name__ == '__main__':
main()
Running this gives me:
"test".id = 41898688
"test".id = 41900848
"test".id = 41898928
"test".id = 41898688
"test".id = 41900848
"test".id = 41898928
"test".id = 41898688
From my experience, at least on 2.7, there is some optimization going on even for raw_input().
If the implementation uses hash tables, I guess there is more than one. Going to dive in the source right now.
[first update]
Looks like my experiment was flawed:
def main():
storage = []
while True:
s = raw_input('feed me:')
print '"{}".id = {}'.format(s, id(s))
storage.append(s)
if __name__ == '__main__':
main()
Result:
"test".id = 43274944
"test".id = 43277104
"test".id = 43487408
"test".id = 43487504
"test".id = 43487552
"test".id = 43487600
"test".id = 43487648
"test".id = 43487744
"test".id = 43487864
"test".id = 43487936
"test".id = 43487984
"test".id = 43488032
In his answer to another question, user tzot warns about object lifetime:
A side note: it is very important to know the lifetime of objects in Python. Note the following session:
Python 2.6.4 (r264:75706, Dec 26 2009, 01:03:10)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a="a"
>>> b="b"
>>> print id(a+b), id(b+a)
134898720 134898720
>>> print (a+b) is (b+a)
False
Your thinking that by printing the IDs of two separate expressions and noting “they are equal ergo the two expressions must be equal/equivalent/the same” is faulty. A single line of output does not necessarily imply all of its contents were created and/or co-existed at the same single moment in time.
If you want to know if two objects are the same object, ask Python directly (using the is operator).

The use of id() in Python

Of what use is id() in real-world programming? I have always thought this function is there just for academic purposes. Where would I actually use it in programming?
I have been programming applications in Python for some time now, but I have never encountered any "need" for using id(). Could someone throw some light on its real world usage?
It can be used for creating a dictionary of metadata about objects:
For example:
someobj = int(1)
somemetadata = "The type is an int"
data = {id(someobj):somemetadata}
Now if I occur this object somewhere else I can find if metadata about this object exists, in O(1) time (instead of looping with is).
I use id() frequently when writing temporary files to disk. It's a very lightweight way of getting a pseudo-random number.
Let's say that during data processing I come up with some intermediate results that I want to save off for later use. I simply create a file name using the pertinent object's id.
fileName = "temp_results_" + str(id(self)).
Although there are many other ways of creating unique file names, this is my favorite. In CPython, the id is the memory address of the object. Thus, if multiple objects are instantiated, I'm guaranteed to never have a naming collision. That's all for the cost of 1 address lookup. The other methods that I'm aware of for getting a unique string are much more intense.
A concrete example would be a word-processing application where each open document is an object. I could periodically save progress to disk with multiple files open using this naming convention.
Anywhere where one might conceivably need id() one can use either is or a weakref instead. So, no need for it in real-world code.
The only time I've found id() useful outside of debugging or answering questions on comp.lang.python is with a WeakValueDictionary, that is a dictionary which holds a weak reference to the values and drops any key when the last reference to that value disappears.
Sometimes you want to be able to access a group (or all) of the live instances of a class without extending the lifetime of those instances and in that case a weak mapping with id(instance) as key and instance as value can be useful.
However, I don't think I've had to do this very often, and if I had to do it again today then I'd probably just use a WeakSet (but I'm pretty sure that didn't exist last time I wanted this).
in one program i used it to compute the intersection of lists of non-hashables, like:
def intersection(*lists):
id_row_row = {} # id(row):row
key_id_row = {} # key:set(id(row))
for key, rows in enumerate(lists):
key_id_row[key] = set()
for row in rows:
id_row_row[id(row)] = row
key_id_row[key].add(id(row))
from operator import and_
def intersect(sets):
if len(sets) > 0:
return reduce(and_, sets)
else:
return set()
seq = [ id_row_row[id_row] for id_row in intersect( key_id_row.values() ) ]
return seq

Categories