hash() function producing inconsistent hashes

hash() function producing inconsistent hashes - python

I have a function that produces a block with some data in it:
def new_block(self, proof, previous_hash=None):
...
block = {
'message': 'New Block Forged',
'index': len(self.chain) + 1,
'transactions': self.current_transactions,
'proof': proof,
'previous_hash': previous_hash, or self.hash_block(self.chain[-1]),
'timestamp': response.tx_time or time(),
}
self.chain is the list of blocks that the block is on. The previous_hash (the hash of the previous block), gets passed to the function, and a time stamp is created. Don't worry too much about the details of the actual data (well something could be wrong there, but it relates more to the hash() function than what is going on with the data)
Next I hash the block, and add it to the block:
block['hash'] = self.hash_block(block)
The hash_block function looks like this:
#staticmethod
def hash_block(block):
block_string = json.dumps(block, sort_keys=True)
return hash(block_string)
This function creates a completely different hash than the next block in line says it has (the hash of the last block attached to the block in front, on the chain, does not match the previous_hash of the block in front. However, they are using the same function:
This line:
'previous_hash': previous_hash or self.hash_block(self.chain[-1])
and this line:
block['hash'] = self.hash_block(block)
Are the important lines (and the hash_block) function. A block gets created, gets hashed and the hash gets attached, then another block gets created and hashes the previous block and it doesn't match the hash created for that block when the block is created.
Also, I started out with hashlib.sha256, and when I noticed this problem i decided to see if it was the hashing function, so I switched to the stock hash, but I am still having the problem - ultimately I want this to work on hashlib, but I figure if I can get it to work with hash first...I will have solved the problem for hashlib

hash() is only suitable for producing mappings, hash tables. It uses a random seed to prevent attacks. It is not a cryptographic hash and should not be counted on to be stable across Python invocations.
From the hash() function documentation:
Return the hash value of the object (if it has one). Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).
and from the __hash__ hook method, which hash() calls if present:
Note: By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.
Stick to the hashlib module options; those are stable across calls.
Apart from this, within a single Python process, hash() on objects with the same value, will also produce the exact same hash. Since your block dictionary changes between blocks (as it includes the hash for the preceding block in the chain), it will naturally not be the same string and so not the same hash value.
The same applies to the hashlib functions; they produce the same value for the same input only. If your hash values differ, then the input differs. And your inputs naturally differ because each block dictionary includes a reference to the preceding hash.

From what I understood from the answer by Martijn Pieters, the Python hash() function uses a random seed. The hashlib function is deterministic. Hence to get the same hash everytime, one can use:
import hashlib
def get_hash(string:str):
return hashlib.sha256(string.encode("utf-8")).hexdigest()
Which can be used as:
some_hash_code=get_hash("One sentence")
print(some_hash_code)
Which outputs:
9d8c9567f9bfd8112d43c14e3e394ae97599afe39b6d7ef66cf365e342f009d4
everytime

After printing the the object which would be json.dumps'd, I noticed that it was retaining the added hash property (I thought it wasn't) - check your variables

Related

hash method implementation not working along set() [Python]

I am implementing a hash function in an object and as my hash value for such object I use the username hashed value, i.e:
class DiscordUser:
def __init__(self, username):
self.username = username
def __hash__(self):
return hash(self.username)
The problem arises when adding such objects to the hash set and comparing them with the exact same username as input for the constructor, i.e:
user = DiscordUser("Username#123")
if user in users_set:
# user is already in my users_set, such condition is NEVER MET, dont understand why
else:
# add user to users_set, this condition is met ALWAYS
users_set.add(user)
Why the hash funcion is not working as properly, or what im doing wrong here?

The hash function is working properly, set membership uses __hash__(), but if two objects have the same hash, set will use the __eq__() method to determine whether or not they are equal. Ultimately, set guarantees that no two elements are equal, not that no two elements have equal hashes. The hash value is used as a first pass because it is often less expensive to compute than equality.
Why?
There is no guarantee that any two objects with the same hash are in fact equal. Consider that there are infinite values for `self.name` in your `DiscordUser`. Python uses siphash for hashing `str` values. Siphash has a finite range, therefore collisions must be possible.
Be careful about using a mutable value as input to hash(). The hash value of an object is expected to be the same for its lifetime.
Take a look at this answer for some nice info about sets, hashing, and equality testing in Python.
edit: Python uses siphash for str values since 3.4

Getting a hash of a function that is stable across runs

IIUC python hash of functions (e.g. for use as keys in dict) is not stable across runs.
Can something like dill or other libraries be used to get a hash of a function which is stable across runs and different computers? (id is of course not stable).

I'm the dill author. I've written a package called klepto which is a hierarchical caching/database abstraction useful for local memory hashing and object sharing across parallel/distributed resources. It includes several options for building ids of functions.
See klepto.keymaps and klepto.crypto for hashing choices -- some work across parallel/distributed resources, some don't. One of the choices is serialization with dill or otherwise.
klepto is similar to joblib, but designed specifically to have object permanence and sharing beyond a single python session. There may be something similar to klepto in dask.

As you mentioned, id will almost never be the same across different processes and though surely across different machines. As per docs:
id(object):
Return the “identity” of an object. This is an integer which is
guaranteed to be unique and constant for this object during its
lifetime. Two objects with non-overlapping lifetimes may have the same
id() value.
This means that id should be different because the objects created by every instance of your script reside in different places in the memory and are not the same object. id defines the identity, it's not a checksum of a block of code.
The only thing that will be consistent over different instances of your script being executed is the name of the function.
One other approach that you could use to have a deterministic way to identify a block of code inside your script would be to calculate a checksum of the actual text. But controlling the contents of your methods should rather be handled by a versioning system like git. It is likely that if you need to calculate a hash sum of your code or a piece of it, you are doing something suboptimally.

I stubled about "hash() is not stable across runs" today. I am now using
def stable_hash(a_string):
sha256 = hashlib.sha256()
sha256.update(bytes(a_string, "UTF-8"))
digest = sha256.digest()
h = 0
#
for index in range(0, len(digest) >> 3):
index8 = index << 3
bytes8 = digest[index8 : index8 + 8]
i = unpack('q', bytes8)[0]
h = xor(h, i)
#
return h
It's for string arguments. To use it e.g. for a dict you would pass str(tuple(sorted(a_dict.items()))) or something like that as argument. The "sorted" is important in this case to get a "canonical" representation.

Why can a floating point dictionary key overwrite an integer key with the same value?

I'm working through http://www.mypythonquiz.com, and question #45 asks for the output of the following code:
confusion = {}
confusion[1] = 1
confusion['1'] = 2
confusion[1.0] = 4
sum = 0
for k in confusion:
sum += confusion[k]
print sum
The output is 6, since the key 1.0 replaces 1. This feels a bit dangerous to me, is this ever a useful language feature?

First of all: the behaviour is documented explicitly in the docs for the hash function:
hash(object)
Return the hash value of the object (if it has one). Hash values are
integers. They are used to quickly compare dictionary keys during a
dictionary lookup. Numeric values that compare equal have the same
hash value (even if they are of different types, as is the case for 1
and 1.0).
Secondly, a limitation of hashing is pointed out in the docs for object.__hash__
object.__hash__(self)
Called by built-in function hash() and for operations on members of
hashed collections including set, frozenset, and dict. __hash__()
should return an integer. The only required property is that objects
which compare equal have the same hash value;
This is not unique to python. Java has the same caveat: if you implement hashCode then, in order for things to work correctly, you must implement it in such a way that: x.equals(y) implies x.hashCode() == y.hashCode().
So, python decided that 1.0 == 1 holds, hence it's forced to provide an implementation for hash such that hash(1.0) == hash(1). The side effect is that 1.0 and 1 act exactly in the same way as dict keys, hence the behaviour.
In other words the behaviour in itself doesn't have to be used or useful in any way. It is necessary. Without that behaviour there would be cases where you could accidentally overwrite a different key.
If we had 1.0 == 1 but hash(1.0) != hash(1) we could still have a collision. And if 1.0 and 1 collide, the dict will use equality to be sure whether they are the same key or not and kaboom the value gets overwritten even if you intended them to be different.
The only way to avoid this would be to have 1.0 != 1, so that the dict is able to distinguish between them even in case of collision. But it was deemed more important to have 1.0 == 1 than to avoid the behaviour you are seeing, since you practically never use floats and ints as dictionary keys anyway.
Since python tries to hide the distinction between numbers by automatically converting them when needed (e.g. 1/2 -> 0.5) it makes sense that this behaviour is reflected even in such circumstances. It's more consistent with the rest of python.
This behaviour would appear in any implementation where the matching of the keys is at least partially (as in a hash map) based on comparisons.
For example if a dict was implemented using a red-black tree or an other kind of balanced BST, when the key 1.0 is looked up the comparisons with other keys would return the same results as for 1 and so they would still act in the same way.
Hash maps require even more care because of the fact that it's the value of the hash that is used to find the entry of the key and comparisons are done only afterwards. So breaking the rule presented above means you'd introduce a bug that's quite hard to spot because at times the dict may seem to work as you'd expect it, and at other times, when the size changes, it would start to behave incorrectly.
Note that there would be a way to fix this: have a separate hash map/BST for each type inserted in the dictionary. In this way there couldn't be any collisions between objects of different type and how == compares wouldn't matter when the arguments have different types.
However this would complicate the implementation, it would probably be inefficient since hash maps have to keep quite a few free locations in order to have O(1) access times. If they become too full the performances decrease. Having multiple hash maps means wasting more space and also you'd need to first choose which hash map to look at before even starting the actual lookup of the key.
If you used BSTs you'd first have to lookup the type and the perform a second lookup. So if you are going to use many types you'd end up with twice the work (and the lookup would take O(log n) instead of O(1)).

You should consider that the dict aims at storing data depending on the logical numeric value, not on how you represented it.
The difference between ints and floats is indeed just an implementation detail and not conceptual. Ideally the only number type should be an arbitrary precision number with unbounded accuracy even sub-unity... this is however hard to implement without getting into troubles... but may be that will be the only future numeric type for Python.
So while having different types for technical reasons Python tries to hide these implementation details and int->float conversion is automatic.
It would be much more surprising if in a Python program if x == 1: ... wasn't going to be taken when x is a float with value 1.
Note that also with Python 3 the value of 1/2 is 0.5 (the division of two integers) and that the types long and non-unicode string have been dropped with the same attempt to hide implementation details.

In python:
1==1.0
True
This is because of implicit casting
However:
1 is 1.0
False
I can see why automatic casting between float and int is handy, It is relatively safe to cast int into float, and yet there are other languages (e.g. go) that stay away from implicit casting.
It is actually a language design decision and a matter of taste more than different functionalities

Dictionaries are implemented with a hash table. To look up something in a hash table, you start at the position indicated by the hash value, then search different locations until you find a key value that's equal or an empty bucket.
If you have two key values that compare equal but have different hashes, you may get inconsistent results depending on whether the other key value was in the searched locations or not. For example this would be more likely as the table gets full. This is something you want to avoid. It appears that the Python developers had this in mind, since the built-in hash function returns the same hash for equivalent numeric values, no matter if those values are int or float. Note that this extends to other numeric types, False is equal to 0 and True is equal to 1. Even fractions.Fraction and decimal.Decimal uphold this property.
The requirement that if a == b then hash(a) == hash(b) is documented in the definition of object.__hash__():
Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. __hash__() should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.
TL;DR: a dictionary would break if keys that compared equal did not map to the same value.

Frankly, the opposite is dangerous! 1 == 1.0, so it's not improbable to imagine that if you had them point to different keys and tried to access them based on an evaluated number then you'd likely run into trouble with it because the ambiguity is hard to figure out.
Dynamic typing means that the value is more important than what the technical type of something is, since the type is malleable (which is a very useful feature) and so distinguishing both ints and floats of the same value as distinct is unnecessary semantics that will only lead to confusion.

I agree with others that it makes sense to treat 1 and 1.0 as the same in this context. Even if Python did treat them differently, it would probably be a bad idea to try to use 1 and 1.0 as distinct keys for a dictionary. On the other hand -- I have trouble thinking of a natural use-case for using 1.0 as an alias for 1 in the context of keys. The problem is that either the key is literal or it is computed. If it is a literal key then why not just use 1 rather than 1.0? If it is a computed key -- round off error could muck things up:
>>> d = {}
>>> d[1] = 5
>>> d[1.0]
5
>>> x = sum(0.01 for i in range(100)) #conceptually this is 1.0
>>> d[x]
Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
d[x]
KeyError: 1.0000000000000007
So I would say that, generally speaking, the answer to your question "is this ever a useful language feature?" is "No, probably not."

List comprehension is sorting autmatically [duplicate]

The question arose when answering to another SO question (there).
When I iterate several times over a python set (without changing it between calls), can I assume it will always return elements in the same order? And if not, what is the rationale of changing the order ? Is it deterministic, or random? Or implementation defined?
And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?
The underlying question is if python set iteration order only depends on the algorithm used to implement sets, or also on the execution context?

There's no formal guarantee about the stability of sets. However, in the CPython implementation, as long as nothing changes the set, the items will be produced in the same order. Sets are implemented as open-addressing hashtables (with a prime probe), so inserting or removing items can completely change the order (in particular, when that triggers a resize, which reorganizes how the items are laid out in memory.) You can also have two identical sets that nonetheless produce the items in different order, for example:
>>> s1 = {-1, -2}
>>> s2 = {-2, -1}
>>> s1 == s2
True
>>> list(s1), list(s2)
([-1, -2], [-2, -1])
Unless you're very certain you have the same set and nothing touched it inbetween the two iterations, it's best not to rely on it staying the same. Making seemingly irrelevant changes to, say, functions you call inbetween could produce very hard to find bugs.

A set or frozenset is inherently an unordered collection. Internally, sets are based on a hash table, and the order of keys depends both on the insertion order and on the hash algorithm. In CPython (aka standard Python) integers less than the machine word size (32 bit or 64 bit) hash to themself, but text strings, bytes strings, and datetime objects hash to integers that vary randomly; you can control that by setting the PYTHONHASHSEED environment variable.
From the __hash__ docs:
Note
By default, the __hash__() values of str, bytes and datetime
objects are “salted” with an unpredictable random value. Although they
remain constant within an individual Python process, they are not
predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service
caused by carefully-chosen inputs that exploit the worst case
performance of a dict insertion, O(n^2) complexity. See
http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of dicts, sets and
other mappings. Python has never made guarantees about this ordering
(and it typically varies between 32-bit and 64-bit builds).
See also PYTHONHASHSEED.
The results of hashing objects of other classes depend on the details of the class's __hash__ method.
The upshot of all this is that you can have two sets containing identical strings but when you convert them to lists they can compare unequal. Or they may not. ;) Here's some code that demonstrates this. On some runs, it will just loop, not printing anything, but on other runs it will quickly find a set that uses a different order to the original.
from random import seed, shuffle
seed(42)
data = list('abcdefgh')
a = frozenset(data)
la = list(a)
print(''.join(la), a)
while True:
shuffle(data)
lb = list(frozenset(data))
if lb != la:
print(''.join(data), ''.join(lb))
break
typical output
dachbgef frozenset({'d', 'a', 'c', 'h', 'b', 'g', 'e', 'f'})
deghcfab dahcbgef

And when I call the same python
program repeatedly (not random, not
input dependent), will I get the same
ordering for sets?
I can answer this part of the question now after a quick experiment. Using the following code:
class Foo(object) :
def __init__(self,val) :
self.val = val
def __repr__(self) :
return str(self.val)
x = set()
for y in range(500) :
x.add(Foo(y))
print list(x)[-10:]
I can trigger the behaviour that I was asking about in the other question. If I run this repeatedly then the output changes, but not on every run. It seems to be "weakly random" in that it changes slowly. This is certainly implementation dependent so I should say that I'm running the macports Python2.6 on snow-leopard. While the program will output the same answer for long runs of time, doing something that affects the system entropy pool (writing to the disk mostly works) will somethimes kick it into a different output.
The class Foo is just a simple int wrapper as experiments show that this doesn't happen with sets of ints. I think that the problem is caused by the lack of __eq__ and __hash__ members for the object, although I would dearly love to know the underlying explanation / ways to avoid it. Also useful would be some way to reproduce / repeat a "bad" run. Does anyone know what seed it uses, or how I could set that seed?

It’s definitely implementation defined. The specification of a set says only that
Being an unordered collection, sets do not record element position or order of insertion.
Why not use OrderedDict to create your own OrderedSet class?

The answer is simply a NO.
Python set operation is NOT stable.
I did a simple experiment to show this.
The code:
import random
random.seed(1)
x=[]
class aaa(object):
def __init__(self,a,b):
self.a=a
self.b=b
for i in range(5):
x.append(aaa(random.choice('asf'),random.randint(1,4000)))
for j in x:
print(j.a,j.b)
print('====')
for j in set(x):
print(j.a,j.b)
Run this for twice, you will get this:
First time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
a 2030
a 2332
f 1555
a 1045
s 1935
Process finished with exit code 0
Second time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
s 1935
a 2332
a 1045
f 1555
a 2030
Process finished with exit code 0
The reason is explained in comments in this answer.
However, there are some ways to make it stable:
set PYTHONHASHSEED to 0, see details here, here and here.
Use OrderedDict instead.

As pointed out, this is strictly an implementation detail.
But as long as you don’t change the structure between calls, there should be no reason for a read-only operation (= iteration) to change with time: no sane implementation does that. Even randomized (= non-deterministic) data structures that can be used to implement sets (e.g. skip lists) don’t change the reading order when no changes occur.
So, being rational, you can safely rely on this behaviour.
(I’m aware that certain GCs may reorder memory in a background thread but even this reordering will not be noticeable on the level of data structures, unless a bug occurs.)

The definition of a set is unordered, unique elements ("Unordered collections of unique elements"). You should care only about the interface, not the implementation. If you want an ordered enumeration, you should probably put it into a list and sort it.
There are many different implementations of Python. Don't rely on undocumented behaviour, as your code could break on different Python implementations.

The use of id() in Python

Of what use is id() in real-world programming? I have always thought this function is there just for academic purposes. Where would I actually use it in programming?
I have been programming applications in Python for some time now, but I have never encountered any "need" for using id(). Could someone throw some light on its real world usage?

It can be used for creating a dictionary of metadata about objects:
For example:
someobj = int(1)
somemetadata = "The type is an int"
data = {id(someobj):somemetadata}
Now if I occur this object somewhere else I can find if metadata about this object exists, in O(1) time (instead of looping with is).

I use id() frequently when writing temporary files to disk. It's a very lightweight way of getting a pseudo-random number.
Let's say that during data processing I come up with some intermediate results that I want to save off for later use. I simply create a file name using the pertinent object's id.
fileName = "temp_results_" + str(id(self)).
Although there are many other ways of creating unique file names, this is my favorite. In CPython, the id is the memory address of the object. Thus, if multiple objects are instantiated, I'm guaranteed to never have a naming collision. That's all for the cost of 1 address lookup. The other methods that I'm aware of for getting a unique string are much more intense.
A concrete example would be a word-processing application where each open document is an object. I could periodically save progress to disk with multiple files open using this naming convention.

Anywhere where one might conceivably need id() one can use either is or a weakref instead. So, no need for it in real-world code.

The only time I've found id() useful outside of debugging or answering questions on comp.lang.python is with a WeakValueDictionary, that is a dictionary which holds a weak reference to the values and drops any key when the last reference to that value disappears.
Sometimes you want to be able to access a group (or all) of the live instances of a class without extending the lifetime of those instances and in that case a weak mapping with id(instance) as key and instance as value can be useful.
However, I don't think I've had to do this very often, and if I had to do it again today then I'd probably just use a WeakSet (but I'm pretty sure that didn't exist last time I wanted this).

in one program i used it to compute the intersection of lists of non-hashables, like:
def intersection(*lists):
id_row_row = {} # id(row):row
key_id_row = {} # key:set(id(row))
for key, rows in enumerate(lists):
key_id_row[key] = set()
for row in rows:
id_row_row[id(row)] = row
key_id_row[key].add(id(row))
from operator import and_
def intersect(sets):
if len(sets) > 0:
return reduce(and_, sets)
else:
return set()
seq = [ id_row_row[id_row] for id_row in intersect( key_id_row.values() ) ]
return seq

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

hash() function producing inconsistent hashes - python

After printing the the object which would be json.dumps'd, I noticed that it was retaining the added hash property (I thought it wasn't) - check your variables

Related

hash method implementation not working along set() [Python]

Getting a hash of a function that is stable across runs

Why can a floating point dictionary key overwrite an integer key with the same value?

List comprehension is sorting autmatically [duplicate]

The use of id() in Python

Categories

Resources