Related
I have a function that produces a block with some data in it:
def new_block(self, proof, previous_hash=None):
...
block = {
'message': 'New Block Forged',
'index': len(self.chain) + 1,
'transactions': self.current_transactions,
'proof': proof,
'previous_hash': previous_hash, or self.hash_block(self.chain[-1]),
'timestamp': response.tx_time or time(),
}
self.chain is the list of blocks that the block is on. The previous_hash (the hash of the previous block), gets passed to the function, and a time stamp is created. Don't worry too much about the details of the actual data (well something could be wrong there, but it relates more to the hash() function than what is going on with the data)
Next I hash the block, and add it to the block:
block['hash'] = self.hash_block(block)
The hash_block function looks like this:
#staticmethod
def hash_block(block):
block_string = json.dumps(block, sort_keys=True)
return hash(block_string)
This function creates a completely different hash than the next block in line says it has (the hash of the last block attached to the block in front, on the chain, does not match the previous_hash of the block in front. However, they are using the same function:
This line:
'previous_hash': previous_hash or self.hash_block(self.chain[-1])
and this line:
block['hash'] = self.hash_block(block)
Are the important lines (and the hash_block) function. A block gets created, gets hashed and the hash gets attached, then another block gets created and hashes the previous block and it doesn't match the hash created for that block when the block is created.
Also, I started out with hashlib.sha256, and when I noticed this problem i decided to see if it was the hashing function, so I switched to the stock hash, but I am still having the problem - ultimately I want this to work on hashlib, but I figure if I can get it to work with hash first...I will have solved the problem for hashlib
hash() is only suitable for producing mappings, hash tables. It uses a random seed to prevent attacks. It is not a cryptographic hash and should not be counted on to be stable across Python invocations.
From the hash() function documentation:
Return the hash value of the object (if it has one). Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).
and from the __hash__ hook method, which hash() calls if present:
Note: By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.
Stick to the hashlib module options; those are stable across calls.
Apart from this, within a single Python process, hash() on objects with the same value, will also produce the exact same hash. Since your block dictionary changes between blocks (as it includes the hash for the preceding block in the chain), it will naturally not be the same string and so not the same hash value.
The same applies to the hashlib functions; they produce the same value for the same input only. If your hash values differ, then the input differs. And your inputs naturally differ because each block dictionary includes a reference to the preceding hash.
From what I understood from the answer by Martijn Pieters, the Python hash() function uses a random seed. The hashlib function is deterministic. Hence to get the same hash everytime, one can use:
import hashlib
def get_hash(string:str):
return hashlib.sha256(string.encode("utf-8")).hexdigest()
Which can be used as:
some_hash_code=get_hash("One sentence")
print(some_hash_code)
Which outputs:
9d8c9567f9bfd8112d43c14e3e394ae97599afe39b6d7ef66cf365e342f009d4
everytime
After printing the the object which would be json.dumps'd, I noticed that it was retaining the added hash property (I thought it wasn't) - check your variables
I have two user-defined objects, say a and b.
Both these objects have the same hash values.
However, the id(a) and id(b) are unequal.
Moreover,
>>> a is b
False
>>> a == b
True
From this observation, can I infer the following?
Unequal objects may have the same hash values.
Equal objects need to have the same id values.
Whenever obj1 is obj2 is called, the id values of both objects is compared, not their hash values.
There are three concepts to grasp when trying to understand id, hash and the == and is operators: identity, value and hash value. Not all objects have all three.
All objects have an identity, though even this can be a little slippery in some cases. The id function returns a number corresponding to an object's identity (in cpython, it returns the memory address of the object, but other interpreters may return something else). If two objects (that exist at the same time) have the same identity, they're actually two references to the same object. The is operator compares items by identity, a is b is equivalent to id(a) == id(b).
Identity can get a little confusing when you deal with objects that are cached somewhere in their implementation. For instance, the objects for small integers and strings in cpython are not remade each time they're used. Instead, existing objects are returned any time they're needed. You should not rely on this in your code though, because it's an implementation detail of cpython (other interpreters may do it differently or not at all).
All objects also have a value, though this is a bit more complicated. Some objects do not have a meaningful value other than their identity (so value an identity may be synonymous, in some cases). Value can be defined as what the == operator compares, so any time a == b, you can say that a and b have the same value. Container objects (like lists) have a value that is defined by their contents, while some other kinds of objects will have values based on their attributes. Objects of different types can sometimes have the same values, as with numbers: 0 == 0.0 == 0j == decimal.Decimal("0") == fractions.Fraction(0) == False (yep, bools are numbers in Python, for historic reasons).
If a class doesn't define an __eq__ method (to implement the == operator), it will inherit the default version from object and its instances will be compared solely by their identities. This is appropriate when otherwise identical instances may have important semantic differences. For instance, two different sockets connected to the same port of the same host need to be treated differently if one is fetching an HTML webpage and the other is getting an image linked from that page, so they don't have the same value.
In addition to a value, some objects have a hash value, which means they can be used as dictionary keys (and stored in sets). The function hash(a) returns the object a's hash value, a number based on the object's value. The hash of an object must remain the same for the lifetime of the object, so it only makes sense for an object to be hashable if its value is immutable (either because it's based on the object's identity, or because it's based on contents of the object that are themselves immutable).
Multiple different objects may have the same hash value, though well designed hash functions will avoid this as much as possible. Storing objects with the same hash in a dictionary is much less efficient than storing objects with distinct hashes (each hash collision requires more work). Objects are hashable by default (since their default value is their identity, which is immutable). If you write an __eq__ method in a custom class, Python will disable this default hash implementation, since your __eq__ function will define a new meaning of value for its instances. You'll need to write a __hash__ method as well, if you want your class to still be hashable. If you inherit from a hashable class but don't want to be hashable yourself, you can set __hash__ = None in the class body.
Unequal objects may have the same hash values.
Yes this is true. A simple example is hash(-1) == hash(-2) in CPython.
Equal objects need to have the same id values.
No this is false in general. A simple counterexample noted by #chepner is that 5 == 5.0 but id(5) != id(5.0).
Whenever, obj1 is obj2 is called, the id values of both the objects is compared, not their hash values.
Yes this is true. is compares the id of the objects for equality (in CPython it is the memory address of the object). Generally, this has nothing to do with the object's hash value (the object need not even be hashable).
The hash function is used to:
quickly compare dictionary keys during a dictionary lookup
the ID function is used to:
Return the “identity” of an object. This is an integer which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.
I just had an experiment , which assigned same whole number to 2 separated variables , but when I used ' is ' operator to compare them , then in returned True ; I didn't expect such a thing , Thought that the variables must be exact same like==>
a = 10
b = a
a is b # output == True
I'm working through http://www.mypythonquiz.com, and question #45 asks for the output of the following code:
confusion = {}
confusion[1] = 1
confusion['1'] = 2
confusion[1.0] = 4
sum = 0
for k in confusion:
sum += confusion[k]
print sum
The output is 6, since the key 1.0 replaces 1. This feels a bit dangerous to me, is this ever a useful language feature?
First of all: the behaviour is documented explicitly in the docs for the hash function:
hash(object)
Return the hash value of the object (if it has one). Hash values are
integers. They are used to quickly compare dictionary keys during a
dictionary lookup. Numeric values that compare equal have the same
hash value (even if they are of different types, as is the case for 1
and 1.0).
Secondly, a limitation of hashing is pointed out in the docs for object.__hash__
object.__hash__(self)
Called by built-in function hash() and for operations on members of
hashed collections including set, frozenset, and dict. __hash__()
should return an integer. The only required property is that objects
which compare equal have the same hash value;
This is not unique to python. Java has the same caveat: if you implement hashCode then, in order for things to work correctly, you must implement it in such a way that: x.equals(y) implies x.hashCode() == y.hashCode().
So, python decided that 1.0 == 1 holds, hence it's forced to provide an implementation for hash such that hash(1.0) == hash(1). The side effect is that 1.0 and 1 act exactly in the same way as dict keys, hence the behaviour.
In other words the behaviour in itself doesn't have to be used or useful in any way. It is necessary. Without that behaviour there would be cases where you could accidentally overwrite a different key.
If we had 1.0 == 1 but hash(1.0) != hash(1) we could still have a collision. And if 1.0 and 1 collide, the dict will use equality to be sure whether they are the same key or not and kaboom the value gets overwritten even if you intended them to be different.
The only way to avoid this would be to have 1.0 != 1, so that the dict is able to distinguish between them even in case of collision. But it was deemed more important to have 1.0 == 1 than to avoid the behaviour you are seeing, since you practically never use floats and ints as dictionary keys anyway.
Since python tries to hide the distinction between numbers by automatically converting them when needed (e.g. 1/2 -> 0.5) it makes sense that this behaviour is reflected even in such circumstances. It's more consistent with the rest of python.
This behaviour would appear in any implementation where the matching of the keys is at least partially (as in a hash map) based on comparisons.
For example if a dict was implemented using a red-black tree or an other kind of balanced BST, when the key 1.0 is looked up the comparisons with other keys would return the same results as for 1 and so they would still act in the same way.
Hash maps require even more care because of the fact that it's the value of the hash that is used to find the entry of the key and comparisons are done only afterwards. So breaking the rule presented above means you'd introduce a bug that's quite hard to spot because at times the dict may seem to work as you'd expect it, and at other times, when the size changes, it would start to behave incorrectly.
Note that there would be a way to fix this: have a separate hash map/BST for each type inserted in the dictionary. In this way there couldn't be any collisions between objects of different type and how == compares wouldn't matter when the arguments have different types.
However this would complicate the implementation, it would probably be inefficient since hash maps have to keep quite a few free locations in order to have O(1) access times. If they become too full the performances decrease. Having multiple hash maps means wasting more space and also you'd need to first choose which hash map to look at before even starting the actual lookup of the key.
If you used BSTs you'd first have to lookup the type and the perform a second lookup. So if you are going to use many types you'd end up with twice the work (and the lookup would take O(log n) instead of O(1)).
You should consider that the dict aims at storing data depending on the logical numeric value, not on how you represented it.
The difference between ints and floats is indeed just an implementation detail and not conceptual. Ideally the only number type should be an arbitrary precision number with unbounded accuracy even sub-unity... this is however hard to implement without getting into troubles... but may be that will be the only future numeric type for Python.
So while having different types for technical reasons Python tries to hide these implementation details and int->float conversion is automatic.
It would be much more surprising if in a Python program if x == 1: ... wasn't going to be taken when x is a float with value 1.
Note that also with Python 3 the value of 1/2 is 0.5 (the division of two integers) and that the types long and non-unicode string have been dropped with the same attempt to hide implementation details.
In python:
1==1.0
True
This is because of implicit casting
However:
1 is 1.0
False
I can see why automatic casting between float and int is handy, It is relatively safe to cast int into float, and yet there are other languages (e.g. go) that stay away from implicit casting.
It is actually a language design decision and a matter of taste more than different functionalities
Dictionaries are implemented with a hash table. To look up something in a hash table, you start at the position indicated by the hash value, then search different locations until you find a key value that's equal or an empty bucket.
If you have two key values that compare equal but have different hashes, you may get inconsistent results depending on whether the other key value was in the searched locations or not. For example this would be more likely as the table gets full. This is something you want to avoid. It appears that the Python developers had this in mind, since the built-in hash function returns the same hash for equivalent numeric values, no matter if those values are int or float. Note that this extends to other numeric types, False is equal to 0 and True is equal to 1. Even fractions.Fraction and decimal.Decimal uphold this property.
The requirement that if a == b then hash(a) == hash(b) is documented in the definition of object.__hash__():
Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. __hash__() should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.
TL;DR: a dictionary would break if keys that compared equal did not map to the same value.
Frankly, the opposite is dangerous! 1 == 1.0, so it's not improbable to imagine that if you had them point to different keys and tried to access them based on an evaluated number then you'd likely run into trouble with it because the ambiguity is hard to figure out.
Dynamic typing means that the value is more important than what the technical type of something is, since the type is malleable (which is a very useful feature) and so distinguishing both ints and floats of the same value as distinct is unnecessary semantics that will only lead to confusion.
I agree with others that it makes sense to treat 1 and 1.0 as the same in this context. Even if Python did treat them differently, it would probably be a bad idea to try to use 1 and 1.0 as distinct keys for a dictionary. On the other hand -- I have trouble thinking of a natural use-case for using 1.0 as an alias for 1 in the context of keys. The problem is that either the key is literal or it is computed. If it is a literal key then why not just use 1 rather than 1.0? If it is a computed key -- round off error could muck things up:
>>> d = {}
>>> d[1] = 5
>>> d[1.0]
5
>>> x = sum(0.01 for i in range(100)) #conceptually this is 1.0
>>> d[x]
Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
d[x]
KeyError: 1.0000000000000007
So I would say that, generally speaking, the answer to your question "is this ever a useful language feature?" is "No, probably not."
I am more familiar with the "Java way" of building complex / combined hash codes from superclasses in subclasses. Is there a better / different / preferred way in Python 3? (I cannot find anything specific to Python3 on this matter via Google.)
class Superclass:
def __init__(self, data):
self.__data = data
def __hash__(self):
return hash(self.__data)
class Subclass(Superclass):
def __init__(self, data, more_data):
super().__init__(data)
self.__more_data = more_data
def __hash__(self):
# Just a guess...
return hash(super()) + 31 * hash(self.__more_data)
To simplifying this question, please assume self.__data and self.__more_data are simple, hashable data, such as str or int.
The easiest way to produce good hashes is to put your values in a standard hashable Python container, then hash that. This includes combining hashes in subclasses. I'll explain why, and then how.
Base requirements
First things first:
If two objects test as equal, then they MUST have the same hash value
Objects that have a hash, MUST produce the same hash over time.
Only when you follow those two rules can your objects safely be used in dictionaries and sets. The hash not changing is what keeps dictionaries and sets from breaking, as they use the hash to pick a storage location, and won't be able to locate the object again given another object that tests equal if the hash changed.
Note that it doesn’t even matter if the two objects are of different types; True == 1 == 1.0 so all have the same hash and will all count as the same key in a dictionary.
What makes a good hash value
You'd want to combine the components of your object value in ways that will produce, as much as possible, different hashes for different values. That includes things like ordering and specific meaning, so that two attributes that represent different aspects of your value, but that can hold the same type of Python objects, still result in different hashes, most of the time.
Note that it's fine if two objects that represent different values (won't test equal) have equal hashes. Reusing a hash value won't break sets or dictionaries. However, if a lot of different object values produce equal hashes then that reduces their efficiency, as you increase the likelihood of collisions. Collisions require collision resolution and collision resolution takes more time, so much so that you can use denial of service attacks on servers with predictable hashing implementations) (*).
So you want a nice wide spread of possible hash values.
Pitfalls to watch out for
The documentation for the object.__hash__ method includes some advice on how to combine values:
The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.
but only using XOR will not produce good hash values, not when the values whose hashes that you XOR together can be of the same type but have different meaning depending on the attribute they've been assigned to. To illustrate with an example:
>>> class Foo:
... def __init__(self, a, b):
... self.a = a
... self.b = b
... def __hash__(self):
... return hash(self.a) ^ hash(self.b)
...
>>> hash(Foo(42, 'spam')) == hash(Foo('spam', 42))
True
Because the hashes for self.a and self.b were just XOR-ed together, we got the same hash value for either order, and so effectively halving the number of usable hashes. Do so with more attributes and you cut the number of unique hashes down rapidly. So you may want to include a bit more information in the hash about each attribute, if the same values can be used in different elements that make up the hash.
Next, know that while Python integers are unbounded, hash values are not. That is to say, hashes values have a finite range. From the same documentation:
Note: hash() truncates the value returned from an object’s custom __hash__() method to the size of a Py_ssize_t. This is typically 8 bytes on 64-bit builds and 4 bytes on 32-bit builds.
This means that if you used addition or multiplication or other operations that increase the number of bits needed to store the hash value, you will end up losing the upper bits and so reduce the number of different hash values again.
Next, if you combine multiple hashes with XOR that already have a limited range, chances are you end up with an even smaller number of possible hashes. Try XOR-ing the hashes of 1000 random integers in the range 0-10, for an extreme example.
Hashing, the easy way
Python developers have long since wrestled with the above pitfalls, and solved it for the standard library types. Use this to your advantage. Put your values in a tuple, then hash that tuple.
Python tuples use a simplified version of the xxHash algorithm to capture order information and to ensure a broad range of hash values. So for different attributes, you can capture the different meanings by giving them different positions in a tuple, then hashing the tuple:
def __hash__(self):
return hash((self.a, self.b))
This ensures you get unique hash values for unique orderings.
If you are subclassing something, put the hash of the parent implementation into one of the tuple positions:
def __hash__(self):
return hash((super().__hash__(), self.__more_data))
Hashing a hash value does reduce it to a 60-bit or 30-bit value (on 32-bit or 64-bit platforms, respectively), but that's not a big problem when combined with other values in a tuple. If you are really concerned about this, put None in the tuple as a placeholder and XOR the parent hash (so super().__hash__() ^ hash((None, self.__more_data))). But this is overkill, really.
If you have a multiple values whose relative order doesn't matter, and don't want to XOR these all together one by one, consider using a frozenset() object for fast processing, combined with a collections.Counter() object if values are not meant to be unique. The frozenset() hash operation accounts for small hash ranges by reshuffling the bits in hashes first:
# unordered collection hashing
from collections import Counter
hash(frozenset(Counter(...).items()))
As always, all values in the tuple or frozenset() must be hashable themselves.
Consider using dataclasses
For most objects you write __hash__ functions for, you actually want to be using a dataclass generated class:
from dataclasses import dataclass
from typing import Union
#dataclass(frozen=True)
class Foo:
a: Union[int, str]
b: Union[int, str]
Dataclasses are given a sane __hash__ implementation when frozen=True or unsafe_hash=True, using a tuple() of all the field values.
(*) Python protects your code against such hash collision attacks by using a process-wide random hash seed to hash strings, bytes and datetime objects.
The python documentation suggests that you use xor to combine hashes:
The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.
I'd also recommend xor over addition and multiplication because of this:
Note
hash() truncates the value returned from an object’s custom __hash__() method to the size of a Py_ssize_t. This is typically 8 bytes on 64-bit builds and 4 bytes on 32-bit builds. If an object’s __hash__() must interoperate on builds of different bit sizes, be sure to check the width on all supported builds. An easy way to do this is with python -c "import sys; print(sys.hash_info.width)"
This documentation is the same for python 2.7 and python 3.4, by the way.
A note on symmetry and xoring items with themselves.
As pointed out in the comments, xor is symmetric, so order of operations disappears. The XOR of two the same elements is also zero. So if that's not desired mix in some rotations or shifts, or, even better, use this solution's suggestion of taking the hash of a tuple of the identifying elements. If you don't want to preserve order, consider using the frozenset.
Instead of combining multiple strings together, use tuples as they are hashable in python.
t: Tuple[str, str, int] = ('Field1', 'Field2', 33)
print(t.__hash__())
This will make the code also easier to read.
For anyone reading this, XORing hashes is a bad idea because it is possible for a particular sequence of duplicate hash values to XOR together and effectively remove an element from the hash set.
For instance:
(hash('asd') ^ hash('asd') ^ hash('derp')) == hash('derp')
and even:
(hash('asd') ^ hash('derp') ^ hash('asd')) == hash('derp')
So if you're using this technique to figure out if a certain set of values is in the combined hash, where you could potentially have duplicate values added to the hash, using XOR can result in that values removal from the set. Instead you should consider OR, which has the same properties of avoiding unbounded integer growth that the previous poster mentioned, but ensures duplicates are not removed.
(hash('asd') | hash('asd') | hash('derp')) != hash('derp')
If you're looking to explore this more, you should look up Bloom filters.
What does this mean?
The only types of values not acceptable as dictionary keys are values containing lists or dictionaries or other mutable types that are compared by value rather than by object identity, the reason being that the efficient implementation of dictionaries requires a key’s hash value to remain constant.
I think even for tuples, comparison will happen by value.
The problem with a mutable object as a key is that when we use a dictionary, we rarely want to check identity. For example, when we use a dictionary like this:
a = "bob"
test = {a: 30}
print(test["bob"])
We expect it to work - the second string "bob" may not be the same as a, but it is the same value, which is what we care about. This works as any two strings that equate will have the same hash, meaning that the dict (implemented as a hashmap) can find those strings very efficiently.
The issue comes into play when we have a list as a key, imagine this case:
a = ["bob"]
test = {a: 30}
print(test[["bob"]])
We can't do this any more - the comparison won't work as the hash of a list is not based on it's value, but rather the instance of the list (aka (id(a) != id(["bob"))).
Python has the choice of making the list's hash change (undermining the efficiency of a hashmap) or simply comparing on identity (which is useless in most cases). Python disallows these specific mutable keys to avoid subtle but common bugs where people expect the values to be equated on value, rather than identity.
The documentation mixes together two different things: mutability, and value-comparable. Let's separate them out.
Immutable objects that compare by identity are fine. The identity can
never change, for any object.
Immutable objects that compare by value are fine. The value can never
change for an immutable object. This includes tuples.
Mutable objects that compare by identity are fine. The identity can
never change, for any object.
Mutable objects that compare by value are not acceptable. The value
can change for a mutable object, which would make the dictionary
invalid.
Meanwhile, your wording isn't quite the same as Mapping Types (4.10 in Python 3.3 or 5.8 in Python 2.7, both of which say:
A dictionary’s keys are almost arbitrary values. Values that are not hashable, that is, values containing lists, dictionaries or other mutable types (that are compared by value rather than by object identity) may not be used as keys.
Anyway, the key point here is that the rule is "not hashable"; "mutable types (that are compared by value rather than by object identity)" is just to explain things a little further. It isn't strictly true that comparing by object identity and hashing by object identity are always the same (the only thing that's required is that if id is equal, the hash is equal).
The part about "efficient implementation of dictionaries" from the version you posted just adds to the confusion (which is probably why it's not in the reference documentation). Even if someone came up with an efficient way to deal with storing lists as dict keys tomorrow, the language doesn't allow it.
A hash is way of calculating an unique code for an object, this code always the same for the same object. hash('test') for example is 2314058222102390712, so is a = 'test'; hash(a) = 2314058222102390712.
Internally a dictionary value is searched by the hash, not by the variable you specify. A list is mutable, a hash for a list, if it where defined, would be changing whenever the list changes. Therefore python's design does not hash lists. Lists therefore can not be used as dictionary keys.
Tuples are immutable, therefore tubles have hashes e.G. hash((1,2)) = 3713081631934410656. one could compare whether a tuple a is equal to the tuple (1,2) by comparing the hash, rather than the value. This would be more efficient as we have to compare only one value instead of two.