Reading How to implement a good __hash__ function in python - can I not write eq as
def __eq__(self, other):
return isinstance(other, self.__class__) and hash(other) == hash(self)
def __ne__(self, other):
return not self.__eq__(other)
def __hash__(self):
return hash((self.firstfield, self.secondfield, totuple(self.thirdfield)))
? Of course I am going to implement __hash__(self) as well. I have rather clearly defined class members. I am going to turn them all into tuples and make a total tuple out of those and hash that.
Generally speaking, a hash function will have collisions. If you define equality in terms of a hash, you're running the risk of entirely dissimilar items comparing as equal, simply because they ended up with the same hash code. The only way to avoid this would be if your class only had a small, fixed number of possible instances, and you somehow ensured that each one of those had a distinct hash. If your class is simple enough for that to be practical, then it is almost certainly simple enough for you to just compare instance variables to determine equality directly. Your __hash__() implementation would have to examine all the instance variables anyway, in order to calculate a meaningful hash.
Of course you can define __eq__ and __ne__ the way you have in your example, but unless you also explicitly define __hash__, you will get a TypeError: unhashable type exception any time you try to compare equality between two objects of that type.
Since you're defining some semblance of value to your object by defining an __eq__ method, a more important question is what do you consider the value of your object to be? By looking at your code, your answer to that is, "the value of this object is it's hash". Without knowing the contents of your __hash__ method, it's impossible to evaluate the quality or validity of your __eq__ method.
Related
I noticed that when I use user-defined objects (that override the __hash__ method) as keys of my dicts in Python, lookup time increases by at least a factor 5.
This behaviour is observed even when I use very basic hash methods such as in the following example:
class A:
def __init__(self, a):
self.a = a
def __hash__(self):
return hash(self.a)
def __eq__(self, other):
if not isinstance(other, A):
return NotImplemented
return (self.a == other.a and self.__class__ ==
other.__class__)
# get an instance of class A
mya = A(42)
# define dict
d1={mya:[1,2], 'foo':[3,4]}
If I time the access through the two different keys I observe a significant difference in performance
%timeit d1['foo']
results in ~ 100 ns. Whereas
%timeit d1[mya]
results in ~ 600 ns.
If I remove the overwriting of __hash__ and __eq__ methods, performance is at the same level as for a default object
Is there a way to avoid this loss in performance and still implement a customised hash calculation ?
The default CPython __hash__ implementation for a custom class is written in C and uses the memory address of the object. Therefore, it does not have to access absolutely anthing from the object and can be done very quickly, as it is just a single integer operation in CPU, if even that.
The "very basic" __hash__ from the example is not as simple as it may seem:
def __hash__(self):
return hash(self.a)
This has to read the attribute a of self, which I'd say in this case will call object.__getattribute__(self, 'a'), and that will look for the value of 'a' in __dict__. This already involves calculating hash('a') and looking it up. Then, the returned value will be passed to hash.
To answer the additional question:
Is there a way to implement a faster __hash__ method that returns
predictable values, I mean that are not randomly computed at each run
as in the case of the memory address of the object ?
Anything accessing attributes of objects will be slower than the implementation which does not need to access attributes, but you could make attribute access faster by using __slots__, or implementing a highly optimized C extension for the class.
There is, however, another question: is this really a problem? I cannot really believe that an application is becoming slow because of slow __hash__. __hash__ should still be pretty fast unless the dictionary has trillions of entries, but then, everything else would become slow and ask for bigger changes...
I did some testing and have to make a correction. Using __slots__ is not going to help in this case at all. My tests actually showed that in CPython 3.7 the above class becomes slightly slower when using __slots__.
I would like to write a class that can be used as a key in a hashable collections (e.g. in a dict). I know that user classes are by default hashable, but using id(self) would be the wrong thing here.
My class holds a tuple as member variable. Deriving from tuple doesn't seem like an option because in my constructor I don't get the same kind of arguments as a tuple constructor. But perhaps that's not a limitation?
What I need is basically the hash of a tuple the way a real tuple would give it.
hash(self.member_tuple) does just that.
The idea here is that two tuples can be equal without their id being equal.
If I implement my __cmp__() as follows:
def __cmp__(self, other):
return cmp(self, other)
will this automatically resort to hash(self) for the comparison? ... or should I implement it as follows:
def __cmp__(self, other):
return cmp(self.member_tuple, other)
My __hash__() function is implemented to return the hash of the held tuple, i.e.:
def __hash__(self):
return hash(self.member_tuple)
Basically, how do __cmp__() and __hash__() interact? I don't know whether in __cmp__() the other will already be a hash or not and whether I should compare against "my" hash (which would be the one of the held tuple) or against self.
So which one is the right one?
Can anyone shed any light on this and possibly point me to documentation?
I'd not use __cmp__ and stick to using __eq__ instead. For hashing that is enough and you don't want to extend to being sortable here. Moreover, __cmp__ has been removed from Python 3 in favour of the rich comparison methods (__eq__, __lt__, __gt__, etc.).
Next, your __eq__ should return True when the member tuples are equal:
def __eq__(self, other):
if not isinstance(other, ThisClass):
return NotImplemented
return self.member_tuple == other.member_tuple
Returning the NotImplemented singleton when the type of the other object is not the same is good practice because that'll delegate the equality test to the other object; if it doesn't implement __eq__ or also returns NotImplemented Python will fall back to the standard id() test.
Your __hash__ implementation is spot-on.
Because a hash is not meant to be unique (it is just a means to pick a slot in the hash table), equality is then used to determine if a matching key is already present or if a hash collision has taken place. As such __eq__ (or __cmp__ if __eq__ is missing) is not called if the slot to which the object is being hashed is empty.
This does mean that if two objects are considered equal (a.__eq__(b) returns True), then their hash values must be equal too. Otherwise you could end up with a corrupted dictionary, as Python will no longer be able to determine if a key is already present in the hash table.
If both your __eq__ and __hash__ methods are delegating their duties to the self.member_tuple attribute, you are maintaining that property; you can trust the basic tuple type to have implemented this correctly.
See the glossary definition of hashable and the object.__hash__() documentation. If you are curious, I've written about how the dict and set types work internally:
Why is the order in dictionaries and sets arbitrary?
Overriding Python's Hashing Function in Dictionary
It seems a common and quick way to create a stock __hash__() for any given Python object is to return hash(str(self)), if that object implements __str__(). Is this efficient, though? Per this SO answer, a hash of a tuple of the object's attributes is "good", but doesn't seem to indicate if it's the most efficient for Python. Or would it be better to implement a __hash__() for each object and use a real hashing algorithm from this page and mixup the values of the individual attributes into the final value returned by __hash__()?
Pretend I've implemented the Jenkins hash routines from this SO question. Which __hash__() would be better to use?:
# hash str(self)
def __hash__(self):
return hash(str(self))
# hash of tuple of attributes
def __hash__(self):
return hash((self.attr1, self.attr2, self.attr3,
self.attr4, self.attr5, self.attr6))
# jenkins hash
def __hash__(self):
from jenkins import mix, final
a = self.attr1
b = self.attr2
c = self.attr3
a, b, c = mix(a, b, c)
a += self.attr4
b += self.attr5
c += self.attr6
a, b, c = final(a, b, c)
return c
Assume the attrs in the sample object are all integers for simplicity. Also assume that all objects derive from a base class and that each objects implements its own __str__(). The tradeoff in using the first hash is that I could implement that in the base class as well and not add additional code to each of the derived objects. But if the second or third __hash__() implementations are better in some way, does that offset the cost of the added code to each derived object (because each may have different attributes)?
Edit: the import in the third __hash__() implementation is there only because I didn't want to draft out an entire example module + objects. Assume that import really happens at the top of the module, not on each invocation of the function.
Conclusion: Per the answer and comments on this closed SO question, it looks like I really want the tuple hash implementation, not for speed or efficiency, but because of the underlying duality of __hash__ and __eq__. Since a hash value is going to have a limited range of some form (be it 32 or 64 bits, for example), in the event you do have a hash collision, object equality is then checked. So since I do implement __eq__() for each object by using tuple comparison of self/other's attributes, I also want to implement __hash__() using an attribute tuple so that I respect the hash/equality nature of things.
Your second one has an important performance pessimization: it's importing two names each time the function is called. Of course, how performant it is relative to the string-hash version depends on how the string is generated.
That said, when you have attributes that define equality for the object, and those attributes are themselves hashable types, the simplest (and almost certainly best-performing) approach is going to be to hash a tuple containing those attribute values.
def __hash__(self):
return hash((self.attr1, self.attr2, self.attr3))
Let's say I have a referentially transparent function. It is very easy to memoize it; for example:
def memoize(obj):
memo = {}
#functools.wraps(obj)
def memoizer(*args, **kwargs):
combined_args = args + (kwd_mark,) + tuple(sorted(kwargs.items()))
if combined_args not in memo:
memo[combined_args] = obj(*args, **kwargs)
return cache[combined_args]
return memoizer
#memoize
def my_function(data, alpha, beta):
# ...
Now suppose that the data argument to my_function is huge; say, it's a frozenset with millions of elements. In this case, the cost of memoization is prohibitive: every time, we'd have to calculate hash(data) as part of the dictionary lookup.
I can make the memo dictionary an attribute to data instead of an object inside memoize decorator. This way I can skip the data argument entirely when doing the cache lookup since the chance that another huge frozenset will be the same is negligible. However, this approach ends up polluting an argument passed to my_function. Worse, if I have two or more large arguments, this won't help at all (I can only attach memo to one argument).
Is there anything else that can be done?
Well, you can use "hash" there with no fears. A frozenset's hash is not calculated more than once by Python - just when it is created - check the timings:
>>> timeit("frozenset(a)", "a=range(100)")
3.26825213432312
>>> timeit("hash(a)", "a=frozenset(range(100))")
0.08160710334777832
>>> timeit("(lambda x:x)(a)", "a=hash(frozenset(range(100)))")
0.1994171142578125
Don't forget Python's "hash" builtin calls the object's __hash__ method, which has its return value defined at creation time for built-in hasheable objects. Above you can see that calling a identity lambda function is more than twice slower than calling "hash (a)"
So, if all your arguments are hasheable, just add their hash when creating "combined_args" - else, just write its creation so that you use hash for frozenset (and maybe other) types, with a conditional.
It turns out that the built-in __hash__ is not that bad, since it caches its own value after the first calculation. The real performance hit comes from the built-in __eq__, since it doesn't short-circuit on identical objects, and actually goes through the full comparison every time, making it very costly.
One approach I thought of is to subclass the built-in class for all large arguments:
class MyFrozenSet(frozenset):
__eq__ = lambda self, other : id(self) == id(other)
__hash__ = lambda self : id(self)
This way, dictionary lookup would be instantaneous. But the equality for the new class will be broken.
A better solution is probably this: only when the dictionary lookup is performed, the large arguments can be wrapped inside a special class that redefines __eq__ and __hash__ to be return the wrapped object's id(). The obvious implementation of the wrapper is a bit annoying, since it requires copying all the standard frozenset methods. Perhaps deriving it from the relevant ABC class may make it easier.
Consider this snippet:
class SomeClass(object):
def __init__(self, someattribute="somevalue"):
self.someattribute = someattribute
def __eq__(self, other):
return self.someattribute == other.someattribute
def __ne__(self, other):
return not self.__eq__(other)
list_of_objects = [SomeClass()]
print(SomeClass() in list_of_objects)
set_of_objects = set([SomeClass()])
print(SomeClass() in set_of_objects)
which evaluates to:
True
False
Can anyone explain why the 'in' keyword has a different meaning for sets and lists?
I would have expected both to return True, especially when the type being tested has equality methods defined.
The meaning is the same, but the implementation is different. Lists simply examine each object, checking for equality, so it works for your class. Sets first hash the objects, and if they don't implement hash properly, the set appears not to work.
Your class defines __eq__, but doesn't define __hash__, and so won't work properly for sets or as keys of dictionaries. The rule for __eq__ and __hash__ is that two objects that __eq__ as True must also have equal hashes. By default, objects hash based on their memory address. So your two objects that are equal by your definition don't provide the same hash, so they break the rule about __eq__ and __hash__.
If you provide a __hash__ implementation, it will work fine. For your sample code, it could be:
def __hash__(self):
return hash(self.someattribute)
In pretty much any hashtable implementation, including Python's, if you override the equality method you must override the hashing method (in Python, this is __hash__). The in operator for lists just checks equality with every element of the list, which the in operator for sets first hashes the object you are looking for, checks for an object in that slot of the hashtable, and then checks for equality if there is anything in the slot. So, if you override __eq__ without overriding __hash__, you cannot be guaranteed that the in operator for sets will check in the right slot.
Define __hash__() method that corresponds to your __eq__() method. Example.