I would like to write a class that can be used as a key in a hashable collections (e.g. in a dict). I know that user classes are by default hashable, but using id(self) would be the wrong thing here.
My class holds a tuple as member variable. Deriving from tuple doesn't seem like an option because in my constructor I don't get the same kind of arguments as a tuple constructor. But perhaps that's not a limitation?
What I need is basically the hash of a tuple the way a real tuple would give it.
hash(self.member_tuple) does just that.
The idea here is that two tuples can be equal without their id being equal.
If I implement my __cmp__() as follows:
def __cmp__(self, other):
return cmp(self, other)
will this automatically resort to hash(self) for the comparison? ... or should I implement it as follows:
def __cmp__(self, other):
return cmp(self.member_tuple, other)
My __hash__() function is implemented to return the hash of the held tuple, i.e.:
def __hash__(self):
return hash(self.member_tuple)
Basically, how do __cmp__() and __hash__() interact? I don't know whether in __cmp__() the other will already be a hash or not and whether I should compare against "my" hash (which would be the one of the held tuple) or against self.
So which one is the right one?
Can anyone shed any light on this and possibly point me to documentation?
I'd not use __cmp__ and stick to using __eq__ instead. For hashing that is enough and you don't want to extend to being sortable here. Moreover, __cmp__ has been removed from Python 3 in favour of the rich comparison methods (__eq__, __lt__, __gt__, etc.).
Next, your __eq__ should return True when the member tuples are equal:
def __eq__(self, other):
if not isinstance(other, ThisClass):
return NotImplemented
return self.member_tuple == other.member_tuple
Returning the NotImplemented singleton when the type of the other object is not the same is good practice because that'll delegate the equality test to the other object; if it doesn't implement __eq__ or also returns NotImplemented Python will fall back to the standard id() test.
Your __hash__ implementation is spot-on.
Because a hash is not meant to be unique (it is just a means to pick a slot in the hash table), equality is then used to determine if a matching key is already present or if a hash collision has taken place. As such __eq__ (or __cmp__ if __eq__ is missing) is not called if the slot to which the object is being hashed is empty.
This does mean that if two objects are considered equal (a.__eq__(b) returns True), then their hash values must be equal too. Otherwise you could end up with a corrupted dictionary, as Python will no longer be able to determine if a key is already present in the hash table.
If both your __eq__ and __hash__ methods are delegating their duties to the self.member_tuple attribute, you are maintaining that property; you can trust the basic tuple type to have implemented this correctly.
See the glossary definition of hashable and the object.__hash__() documentation. If you are curious, I've written about how the dict and set types work internally:
Why is the order in dictionaries and sets arbitrary?
Overriding Python's Hashing Function in Dictionary
Related
I had a strange bug when porting a feature to the Python 3.1 fork of my program. I narrowed it down to the following hypothesis:
In contrast to Python 2.x, in Python 3.x if an object has an __eq__ method it is automatically unhashable.
Is this true?
Here's what happens in Python 3.1:
>>> class O(object):
... def __eq__(self, other):
... return 'whatever'
...
>>> o = O()
>>> d = {o: 0}
Traceback (most recent call last):
File "<pyshell#16>", line 1, in <module>
d = {o: 0}
TypeError: unhashable type: 'O'
The follow-up question is, how do I solve my personal problem? I have an object ChangeTracker which stores a WeakKeyDictionary that points to several objects, giving for each the value of their pickle dump at a certain time point in the past. Whenever an existing object is checked in, the change tracker says whether its new pickle is identical to its old one, therefore saying whether the object has changed in the meantime. Problem is, now I can't even check if the given object is in the library, because it makes it raise an exception about the object being unhashable. (Cause it has a __eq__ method.) How can I work around this?
Yes, if you define __eq__, the default __hash__ (namely, hashing the address of the object in memory) goes away. This is important because hashing needs to be consistent with equality: equal objects need to hash the same.
The solution is simple: just define __hash__ along with defining __eq__.
This paragraph from http://docs.python.org/3.1/reference/datamodel.html#object.hash
If a class that overrides __eq__()
needs to retain the implementation of
__hash__() from a parent class, the interpreter must be told this
explicitly by setting __hash__ =
<ParentClass>.__hash__. Otherwise the
inheritance of __hash__() will be
blocked, just as if __hash__ had been
explicitly set to None.
Check the Python 3 manual on object.__hash__:
If a class does not define an __eq__() method it should not define a __hash__() operation either; if it defines __eq__() but not __hash__(), its instances will not be usable as items in hashable collections.
Emphasis is mine.
If you want to be lazy, it sounds like you can just define __hash__(self) to return id(self):
User-defined classes have __eq__() and __hash__() methods by default; with them, all objects compare unequal (except with themselves) and x.__hash__() returns id(x).
I'm no python expert, but wouldn't it make sense that, when you define a eq-method, you also have to define a hash-method as well (which calculates the hash value for an object) Otherwise, the hashing mechanism wouldn't know if it hit the same object, or a different object with just the same hash-value. Actually, it's the other way around, it'd probably end up computing different hash values for objects considered equal by your __eq__ method.
I have no idea what that hash function is called though, __hash__ perhaps? :)
I had a strange bug when porting a feature to the Python 3.1 fork of my program. I narrowed it down to the following hypothesis:
In contrast to Python 2.x, in Python 3.x if an object has an __eq__ method it is automatically unhashable.
Is this true?
Here's what happens in Python 3.1:
>>> class O(object):
... def __eq__(self, other):
... return 'whatever'
...
>>> o = O()
>>> d = {o: 0}
Traceback (most recent call last):
File "<pyshell#16>", line 1, in <module>
d = {o: 0}
TypeError: unhashable type: 'O'
The follow-up question is, how do I solve my personal problem? I have an object ChangeTracker which stores a WeakKeyDictionary that points to several objects, giving for each the value of their pickle dump at a certain time point in the past. Whenever an existing object is checked in, the change tracker says whether its new pickle is identical to its old one, therefore saying whether the object has changed in the meantime. Problem is, now I can't even check if the given object is in the library, because it makes it raise an exception about the object being unhashable. (Cause it has a __eq__ method.) How can I work around this?
Yes, if you define __eq__, the default __hash__ (namely, hashing the address of the object in memory) goes away. This is important because hashing needs to be consistent with equality: equal objects need to hash the same.
The solution is simple: just define __hash__ along with defining __eq__.
This paragraph from http://docs.python.org/3.1/reference/datamodel.html#object.hash
If a class that overrides __eq__()
needs to retain the implementation of
__hash__() from a parent class, the interpreter must be told this
explicitly by setting __hash__ =
<ParentClass>.__hash__. Otherwise the
inheritance of __hash__() will be
blocked, just as if __hash__ had been
explicitly set to None.
Check the Python 3 manual on object.__hash__:
If a class does not define an __eq__() method it should not define a __hash__() operation either; if it defines __eq__() but not __hash__(), its instances will not be usable as items in hashable collections.
Emphasis is mine.
If you want to be lazy, it sounds like you can just define __hash__(self) to return id(self):
User-defined classes have __eq__() and __hash__() methods by default; with them, all objects compare unequal (except with themselves) and x.__hash__() returns id(x).
I'm no python expert, but wouldn't it make sense that, when you define a eq-method, you also have to define a hash-method as well (which calculates the hash value for an object) Otherwise, the hashing mechanism wouldn't know if it hit the same object, or a different object with just the same hash-value. Actually, it's the other way around, it'd probably end up computing different hash values for objects considered equal by your __eq__ method.
I have no idea what that hash function is called though, __hash__ perhaps? :)
Reading How to implement a good __hash__ function in python - can I not write eq as
def __eq__(self, other):
return isinstance(other, self.__class__) and hash(other) == hash(self)
def __ne__(self, other):
return not self.__eq__(other)
def __hash__(self):
return hash((self.firstfield, self.secondfield, totuple(self.thirdfield)))
? Of course I am going to implement __hash__(self) as well. I have rather clearly defined class members. I am going to turn them all into tuples and make a total tuple out of those and hash that.
Generally speaking, a hash function will have collisions. If you define equality in terms of a hash, you're running the risk of entirely dissimilar items comparing as equal, simply because they ended up with the same hash code. The only way to avoid this would be if your class only had a small, fixed number of possible instances, and you somehow ensured that each one of those had a distinct hash. If your class is simple enough for that to be practical, then it is almost certainly simple enough for you to just compare instance variables to determine equality directly. Your __hash__() implementation would have to examine all the instance variables anyway, in order to calculate a meaningful hash.
Of course you can define __eq__ and __ne__ the way you have in your example, but unless you also explicitly define __hash__, you will get a TypeError: unhashable type exception any time you try to compare equality between two objects of that type.
Since you're defining some semblance of value to your object by defining an __eq__ method, a more important question is what do you consider the value of your object to be? By looking at your code, your answer to that is, "the value of this object is it's hash". Without knowing the contents of your __hash__ method, it's impossible to evaluate the quality or validity of your __eq__ method.
It seems a common and quick way to create a stock __hash__() for any given Python object is to return hash(str(self)), if that object implements __str__(). Is this efficient, though? Per this SO answer, a hash of a tuple of the object's attributes is "good", but doesn't seem to indicate if it's the most efficient for Python. Or would it be better to implement a __hash__() for each object and use a real hashing algorithm from this page and mixup the values of the individual attributes into the final value returned by __hash__()?
Pretend I've implemented the Jenkins hash routines from this SO question. Which __hash__() would be better to use?:
# hash str(self)
def __hash__(self):
return hash(str(self))
# hash of tuple of attributes
def __hash__(self):
return hash((self.attr1, self.attr2, self.attr3,
self.attr4, self.attr5, self.attr6))
# jenkins hash
def __hash__(self):
from jenkins import mix, final
a = self.attr1
b = self.attr2
c = self.attr3
a, b, c = mix(a, b, c)
a += self.attr4
b += self.attr5
c += self.attr6
a, b, c = final(a, b, c)
return c
Assume the attrs in the sample object are all integers for simplicity. Also assume that all objects derive from a base class and that each objects implements its own __str__(). The tradeoff in using the first hash is that I could implement that in the base class as well and not add additional code to each of the derived objects. But if the second or third __hash__() implementations are better in some way, does that offset the cost of the added code to each derived object (because each may have different attributes)?
Edit: the import in the third __hash__() implementation is there only because I didn't want to draft out an entire example module + objects. Assume that import really happens at the top of the module, not on each invocation of the function.
Conclusion: Per the answer and comments on this closed SO question, it looks like I really want the tuple hash implementation, not for speed or efficiency, but because of the underlying duality of __hash__ and __eq__. Since a hash value is going to have a limited range of some form (be it 32 or 64 bits, for example), in the event you do have a hash collision, object equality is then checked. So since I do implement __eq__() for each object by using tuple comparison of self/other's attributes, I also want to implement __hash__() using an attribute tuple so that I respect the hash/equality nature of things.
Your second one has an important performance pessimization: it's importing two names each time the function is called. Of course, how performant it is relative to the string-hash version depends on how the string is generated.
That said, when you have attributes that define equality for the object, and those attributes are themselves hashable types, the simplest (and almost certainly best-performing) approach is going to be to hash a tuple containing those attribute values.
def __hash__(self):
return hash((self.attr1, self.attr2, self.attr3))
Consider this snippet:
class SomeClass(object):
def __init__(self, someattribute="somevalue"):
self.someattribute = someattribute
def __eq__(self, other):
return self.someattribute == other.someattribute
def __ne__(self, other):
return not self.__eq__(other)
list_of_objects = [SomeClass()]
print(SomeClass() in list_of_objects)
set_of_objects = set([SomeClass()])
print(SomeClass() in set_of_objects)
which evaluates to:
True
False
Can anyone explain why the 'in' keyword has a different meaning for sets and lists?
I would have expected both to return True, especially when the type being tested has equality methods defined.
The meaning is the same, but the implementation is different. Lists simply examine each object, checking for equality, so it works for your class. Sets first hash the objects, and if they don't implement hash properly, the set appears not to work.
Your class defines __eq__, but doesn't define __hash__, and so won't work properly for sets or as keys of dictionaries. The rule for __eq__ and __hash__ is that two objects that __eq__ as True must also have equal hashes. By default, objects hash based on their memory address. So your two objects that are equal by your definition don't provide the same hash, so they break the rule about __eq__ and __hash__.
If you provide a __hash__ implementation, it will work fine. For your sample code, it could be:
def __hash__(self):
return hash(self.someattribute)
In pretty much any hashtable implementation, including Python's, if you override the equality method you must override the hashing method (in Python, this is __hash__). The in operator for lists just checks equality with every element of the list, which the in operator for sets first hashes the object you are looking for, checks for an object in that slot of the hashtable, and then checks for equality if there is anything in the slot. So, if you override __eq__ without overriding __hash__, you cannot be guaranteed that the in operator for sets will check in the right slot.
Define __hash__() method that corresponds to your __eq__() method. Example.