Python's standard hashing algorithm [duplicate]

Python's standard hashing algorithm [duplicate] - python

Following on from this question, I'm interested to know when is a python object's hash computed?
At an instance's __init__ time,
The first time __hash__() is called,
Every time __hash__() is called, or
Any other opportunity I might be missing?
May this vary depending on the type of the object?
Why does hash(-1) == -2 whilst other integers are equal to their hash?

The hash is generally computed each time it's used, as you can quite easily check yourself (see below).
Of course, any particular object is free to cache its hash. For example, CPython strings do this, but tuples don't (see e.g. this rejected bug report for reasons).
The hash value -1 signals an error in CPython. This is because C doesn't have exceptions, so it needs to use the return value. When a Python object's __hash__ returns -1, CPython will actually silently change it to -2.
See for yourself:
class HashTest(object):
def __hash__(self):
print('Yes! __hash__ was called!')
return -1
hash_test = HashTest()
# All of these will print out 'Yes! __hash__ was called!':
print('__hash__ call #1')
hash_test.__hash__()
print('__hash__ call #2')
hash_test.__hash__()
print('hash call #1')
hash(hash_test)
print('hash call #2')
hash(hash_test)
print('Dict creation')
dct = {hash_test: 0}
print('Dict get')
dct[hash_test]
print('Dict set')
dct[hash_test] = 0
print('__hash__ return value:')
print(hash_test.__hash__()) # prints -1
print('Actual hash value:')
print(hash(hash_test)) # prints -2

From here:
The hash value -1 is reserved (it’s used to flag errors in the C implementation).
If the hash algorithm generates this value, we simply use -2 instead.
As integer's hash is integer itself it's just changed right away.

It is easy to see that option #3 holds for user defined objects. This allows the hash to vary if you mutate the object, but if you ever use the object as a dictionary key you must be sure to prevent the hash ever changing.
>>> class C:
def __hash__(self):
print("__hash__ called")
return id(self)
>>> inst = C()
>>> hash(inst)
__hash__ called
43795408
>>> hash(inst)
__hash__ called
43795408
>>> d = { inst: 42 }
__hash__ called
>>> d[inst]
__hash__ called
Strings use option #2: they calculate the hash value once and cache the result. This is safe because strings are immutable so the hash can never change, but if you subclass str the result might not be immutable so the __hash__ method will be called every time again. Tuples are usually thought of as immutable so you might think the hash could be cached, but in fact a tuple's hash depends on the hash of its content and that might include mutable values.
For #max who doesn't believe that subclasses of str can modify the hash:
>>> class C(str):
def __init__(self, s):
self._n = 1
def __hash__(self):
return str.__hash__(self) + self._n
>>> x = C('hello')
>>> hash(x)
-717693723
>>> x._n = 2
>>> hash(x)
-717693722

Related

Python: what is the difference of adding an object to a set by id() or directly?

Assume I have a custom class CustomObject and I do not define a custom __hash__ or __eq__ function for it. Will there be any difference between the following two operations in terms of outputs in any conditions?
a = CustomObject(1)
b = CustomObject(1)
setA = set()
# option 1
setA.add(a)
print((b in setA))
# option 2
setA.add(id(a))
print((id(b) in setA))
According to What is the default __hash__ in python?, the default __hash__ function is bound to the id of the object, so I assume there is no difference between the above two options?
If I define custom __hash__ functions for CustomObject like in add object into python's set collection and determine by object's attribute, the above two options will be different, right?

Saving the ID can result in a false positive if any of the objects become garbage and the ID is reassigned.
a = CustomObject(1)
setA = set()
setA.add(id(a))
del a
b = CustomObject(1)
print(id(b) in setA)
This would print True if b gets the same ID that a previously had.

The same reason as that mentioned by #Barmar, a phenomenon that is easier to reproduce is that only one address can be obtained by adding temporary CustomObject for many times:
>>> class CustomObject:
... def __init__(self, value):
... self.value = value
...
>>> {id(CustomObject(1)) for _ in range(10)}
{1799037490496}
>>> {id(CustomObject(i)) for i in range(10)}
{1799034371856}
In addition, you can only get the address instead of the object you added when iterating over the set. There are methods in the ctypes library that can get the object through the address, but when the object is destroyed, it is not safe to get it through the address.

Must all attributes of an object be in the objects hash function?

Using python 3.6
I have read the docs.
When implementing __hash__ Must all attributes of an object be in the objects __hash__ function?
So would this example be ok?
class Foo(object):
def __init__(self, value):
self.value = value
self.not_important = 'irrelevant'
def __hash__(self):
return hash((self.value,))
def __eq__(self, other):
if self.value == other.value:
return True
return False
foo.not_important is modified while being a key in a dictionary
>>> foo = Foo(1)
>>> d = {foo:'foo is in d'}
>>> foo.not_important = 'something else'
>>> d[foo]
'foo is in d'
>>> bar = Foo(1)
>>> foo == bar
True
>>> d[bar]
'foo is in d'
But foo.not_important isn't used by it's __hash__ implementation. Is this perfectly okay? Or can this go horribly wrong?

Answering the literal question, it's okay to leave out attributes not considered by __eq__. In fact, you can leave out all the attributes and just return 0, though it'll kill your dict efficiency, so don't do that.
Answering the implied question, it's okay to mutate an object while it's a dict key, as long as you don't mutate it in ways that affect __eq__ or __hash__. For example, the default __hash__ implementation doesn't consider any attributes of the object it's hashing at all - it's based on object identity. With the default __hash__ and __eq__, an object is only equal to itself, and you can mutate such an object all you want while it's a dict key without breaking dicts.

No. As the documentation states:
object.__hash__(self)
Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. __hash__() should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple.
In your example, both __eq__ and __hash__ only use self.value. This satisfies the only requirement.

Removing objects that have changed from a Python set

Given this program:
class Obj:
def __init__(self, a, b):
self.a = a
self.b = b
def __hash__(self):
return hash((self.a, self.b))
class Collection:
def __init__(self):
self.objs = set()
def add(self, obj):
self.objs.add(obj)
def find(self, a, b):
objs = []
for obj in self.objs:
if obj.b == b and obj.a == a:
objs.append(obj)
return objs
def remove(self, a, b):
for obj in self.find(a, b):
print('removing', obj)
self.objs.remove(obj)
o1 = Obj('a1', 'b1')
o2 = Obj('a2', 'b2')
o3 = Obj('a3', 'b3')
o4 = Obj('a4', 'b4')
o5 = Obj('a5', 'b5')
objs = Collection()
for o in (o1, o2, o3, o4, o5):
objs.add(o)
objs.remove('a1', 'b1')
o2.a = 'a1'
o2.b = 'b1'
objs.remove('a1', 'b1')
o3.a = 'a1'
o3.b = 'b1'
objs.remove('a1', 'b1')
o4.a = 'a1'
o4.b = 'b1'
objs.remove('a1', 'b1')
o5.a = 'a1'
o5.b = 'b1'
If I run this a few times with Python 3.4.2, sometimes it will succeed, other times it throws a KeyError after removing 2 or 3 objects:
$ python3 py_set_obj_remove_test.py
removing <__main__.Obj object at 0x7f3648035828>
removing <__main__.Obj object at 0x7f3648035860>
removing <__main__.Obj object at 0x7f3648035898>
removing <__main__.Obj object at 0x7f36480358d0>
$ python3 py_set_obj_remove_test.py
removing <__main__.Obj object at 0x7f156170b828>
removing <__main__.Obj object at 0x7f156170b860>
Traceback (most recent call last):
File "py_set_obj_remove_test.py", line 42, in <module>
objs.remove('a1', 'b1')
File "py_set_obj_remove_test.py", line 27, in remove
self.objs.remove(obj)
KeyError: <__main__.Obj object at 0x7f156170b860>
Is this a bug in Python? Or something about the implementation of sets I don't know about?
Interestingly, it seems to always fail at the second objs.remove() call in Python 2.7.9.

This is not a bug in Python, your code is violating a principle of sets: that the hash value must not change. By mutating your object attributes, the hash changes and the set can no longer reliably locate the object in the set.
From the __hash__ method documentation:
If a class defines mutable objects and implements an __eq__() method, it should not implement __hash__(), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).
Custom Python classes define a default __eq__ method that returns True when both operands reference the same object (obj1 is obj2 is true).
That it sometimes works in Python 3 is a property of hash randomisation for strings. Because the hash value for a string changes between Python interpreter runs, and because the modulus of a hash against the size of the hash table is used, you can end up with the right hash slot anyway, purely by accident, and then the == equality test will still be true because you didn't implement a custom __eq__ method.
Python 2 has hash randomisation too but it is disabled by default, but you could make your test 'pass' anyway by carefully picking the 'right' values for the a and b attributes.
Instead, you could make your code work by basing your hash on the id() of your instance; that makes the hash value not change and would match the default __eq__ implementation:
def __hash__(self):
return hash(id(self))
You could also just remove your __hash__ implementation for the same effect, as the default implementation does basically the above (with the id() value rotated by 4 bits to evade memory alignment patterns). Again, from the __hash__ documentation:
User-defined classes have __eq__() and __hash__() methods by default; with them, all objects compare unequal (except with themselves) and x.__hash__() returns an appropriate value such that x == y implies both that x is y and hash(x) == hash(y).
Alternatively, implement an __eq__ method that bases equality on equality of the attributes of the instance, and don't mutate the attributes.

You are changing the objects (ie changing the objects' hash) after they were added to the set.
When remove is called, it can't find that hash in the set because it was changed after it was calculated (when the objects were originally added to the set).

Most pythonic way of ensuring a list of objects contains only unique items

I have a list of objects (Foo). A Foo object has several attributes. An instance of a Foo object is equivalent (equal) to another instance of a Foo object iff (if and only if) all the attributes are equal.
I have the following code:
class Foo(object):
def __init__(self, myid):
self.myid=myid
def __eq__(self, other):
if isinstance(other, self.__class__):
print 'DEBUG: self:',self.__dict__
print 'DEBUG: other:',other.__dict__
return self.__dict__ == other.__dict__
else:
print 'DEBUG: ATTEMPT TO COMPARE DIFFERENT CLASSES:',self.__class__,'compared to:', other.__class__
return False
import copy
f1 = Foo(1)
f2 = Foo(2)
f3 = Foo(3)
f4 = Foo(4)
f5 = copy.deepcopy(f3) # overkill here (I know), but needed for my real code
f_list = [f1,f2,f3,f4,f5]
# Surely, there must be a better way? (this dosen't work BTW!)
new_foo_list = list(set(f_list))
I often used this little (anti?) 'pattern' above (converting to set and back), when dealing with simple types (int, float, string - and surprisingly datetime.datetime types), but it has come a cropper with the more involved data type - like Foo above.
So, how could I change the list f1 above into a list of unique items - without having to loop through each item and doing a check on whether it already exists in some temporary cache etc etc?.
What is the most pythonic way to do this?

First, I want to emphasize that using set is certainly not an anti-pattern. sets eliminate duplicates in O(n) time, which is the best you can do, and way better than the naive O(n^2) solution of comparing every item to every other item. It's even better than sorting -- and indeed, it seems your data structure might not even have a natural order, in which case sorting doesn't make a lot of sense.
The problem with using a set in this case is that you have to define a custom __hash__ method. Others have said this. But whether or not you can do so easily is an open question -- it depends on details about your actual class that you haven't told us. For example, if any attributes of a Foo object above are not hashable, then creating a custom hash function is going to be difficult, because you'll have to not only write a custom hash for Foo objects, you'll also have to write custom hashes for every other type of object!
So you need to tell us more about what kinds of attributes your class has if you want a conclusive answer. But I can offer some speculation.
Assuming that a hash function could be written for Foo objects, but also assuming that that Foo objects are mutable and so really shouldn't have a __hash__ method, as Niklas B. points out, here is one workable approach. Create a function freeze that, given a mutable instance of Foo, returns an immutable collection of the data in Foo. So for example, say Foo has a dict and a list in it; freeze returns a tuple containing a tuple of tuples (representing the dict) and another tuple (representing the list). The function freeze should have the following property:
freeze(a) == freeze(b)
If and only if
a == b
Now pass your list through the following code:
dupe_free = dict((freeze(x), x) for x in dupe_list).values()
Now you have a dupe free list in O(n) time. (Indeed, after adding this suggestion, I saw that fraxel suggested something similar; but I think using a custom function -- or even a method -- (x.freeze(), x) -- is the better way to go, rather than relying on __dict__ as he does, which can be unreliable. The same goes for your custom __eq__ method, IMO -- __dict__ is not always a safe shortcut for various reasons I can't get into here.)
Another approach would be to use only immutable objects in the first place! For example, you could use namedtuples. Here's an example stolen from the python docs:
>>> Point = namedtuple('Point', ['x', 'y'])
>>> p = Point(11, y=22) # instantiate with positional or keyword arguments
>>> p[0] + p[1] # indexable like the plain tuple (11, 22)
33
>>> x, y = p # unpack like a regular tuple
>>> x, y
(11, 22)
>>> p.x + p.y # fields also accessible by name
33
>>> p # readable __repr__ with a name=value style
Point(x=11, y=22)

Have you tried using a set (or frozenset)? It's explicitly for holding a unique set of items.
You'll need to create an appropriate __hash__ method, though. set (and frozenset) use the __hash__ method to hash objects; __eq__ is only used on a collision, AFAIK. Accordingly, you'll want to use a hash like hash(frozenset(self.__dict__.items())).

According to the documentation, you need to define __hash__() and __eq__() for your custom class to work correctly with a set or frozenset, as both are implemented using hash tables in CPython.
If you implement __hash__, keep in mind that if a == b, then hash(a) must equal hash(b). Rather than comparing the whole __dict__s, I suggest the following more straightforward implementation for your simple class:
class Foo(object):
def __init__(self, myid):
self.myid = myid
def __eq__(self, other):
return isinstance(other, self.__class__) and other.myid == self.myid
def __hash__(self):
return hash(self.myid)
If your object contains mutable attributes, you simply shouldn't put it inside a set or use it as a dictionary key.

Here is an alternative method, just make a dictionary keyed by __dict__.items() for the instances:
f_list = [f1,f2,f3,f4,f5]
f_dict = dict([(tuple(i.__dict__.items()), i) for i in f_list])
print f_dict
print f_dict.values()
#output:
{(('myid', 1),): <__main__.Foo object at 0xb75e190c>,
(('myid', 2),): <__main__.Foo object at 0xb75e184c>,
(('myid', 3),): <__main__.Foo object at 0xb75e1f6c>,
(('myid', 4),): <__main__.Foo object at 0xb75e1cec>}
[<__main__.Foo object at 0xb75e190c>,
<__main__.Foo object at 0xb75e184c>,
<__main__.Foo object at 0xb75e1f6c>,
<__main__.Foo object at 0xb75e1cec>]
This way you just let the dictionary take care of the uniqueness based on attributes, and can easily retrieve the objects by getting the values.

If you are allowed you can use a set http://docs.python.org/library/sets.html
list = [1,2,3,3,45,4,45,6]
print set(list)
set([1, 2, 3, 4, 6, 45])
x = set(list)
print x
set([1, 2, 3, 4, 6, 45])

Propagation of NaN through calculations

Normally, NaN (not a number) propagates through calculations, so I don't need to check for NaN in each step. This works almost always, but apparently there are exceptions. For example:
>>> nan = float('nan')
>>> pow(nan, 0)
1.0
I found the following comment on this:
The propagation of quiet NaNs through arithmetic operations allows
errors to be detected at the end of a sequence of operations without
extensive testing during intermediate stages. However, note that
depending on the language and the function, NaNs can silently be
removed in expressions that would give a constant result for all other
floating-point values e.g. NaN^0, which may be defined as 1, so in
general a later test for a set INVALID flag is needed to detect all
cases where NaNs are introduced.
To satisfy those wishing a more strict interpretation of how the power
function should act, the 2008 standard defines two additional power
functions; pown(x, n) where the exponent must be an integer, and
powr(x, y) which returns a NaN whenever a parameter is a NaN or the
exponentiation would give an indeterminate form.
Is there a way to check the INVALID flag mentioned above through Python? Alternatively, is there any other approach to catch cases where NaN does not propagate?
Motivation: I decided to use NaN for missing data. In my application, missing inputs should result in missing result. It works great, with the exception I described.

I realise that a month has passed since this was asked, but I've come across a similar problem (i.e. pow(float('nan'), 1) throws an exception in some Python implementations, e.g. Jython 2.52b2), and I found the above answers weren't quite what I was looking for.
Using a MissingData type as suggested by 6502 seems like the way to go, but I needed a concrete example. I tried Ethan Furman's NullType class but found that that this didn't work with any arithmetic operations as it doesn't coerce data types (see below), and I also didn't like that it explicitly named each arithmetic function that was overriden.
Starting with Ethan's example and tweaking code I found here, I arrived at the class below. Although the class is heavily commented you can see that it actually only has a handful of lines of functional code in it.
The key points are:
1. Use coerce() to return two NoData objects for mixed type (e.g. NoData + float) arithmetic operations, and two strings for string based (e.g. concat) operations.
2. Use getattr() to return a callable NoData() object for all other attribute/method access
3. Use call() to implement all other methods of the NoData() object: by returning a NoData() object
Here's some examples of its use.
>>> nd = NoData()
>>> nd + 5
NoData()
>>> pow(nd, 1)
NoData()
>>> math.pow(NoData(), 1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: nb_float should return float object
>>> nd > 5
NoData()
>>> if nd > 5:
... print "Yes"
... else:
... print "No"
...
No
>>> "The answer is " + nd
'The answer is NoData()'
>>> "The answer is %f" % (nd)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: float argument required, not instance
>>> "The answer is %s" % (nd)
'The answer is '
>>> nd.f = 5
>>> nd.f
NoData()
>>> nd.f()
NoData()
I noticed that using pow with NoData() calls the ** operator and hence works with NoData, but using math.pow does not as it first tries to convert the NoData() object to a float. I'm happy using the non math pow - hopefully 6502 etc were using math.pow when they had problems with pow in their comments above.
The other issue I can't think of a way of solving is the use with the format (%f) operator... No methods of NoData are called in this case, the operator just fails if you don't provide a float. Anyway here's the class itself.
class NoData():
"""NoData object - any interaction returns NoData()"""
def __str__(self):
#I want '' returned as it represents no data in my output (e.g. csv) files
return ''
def __unicode__(self):
return ''
def __repr__(self):
return 'NoData()'
def __coerce__(self, other_object):
if isinstance(other_object, str) or isinstance(other_object, unicode):
#Return string objects when coerced with another string object.
#This ensures that e.g. concatenation operations produce strings.
return repr(self), other_object
else:
#Otherwise return two NoData objects - these will then be passed to the appropriate
#operator method for NoData, which should then return a NoData object
return self, self
def __nonzero__(self):
#__nonzero__ is the operation that is called whenever, e.g. "if NoData:" occurs
#i.e. as all operations involving NoData return NoData, whenever a
#NoData object propagates to a test in branch statement.
return False
def __hash__(self):
#prevent NoData() from being used as a key for a dict or used in a set
raise TypeError("Unhashable type: " + self.repr())
def __setattr__(self, name, value):
#This is overridden to prevent any attributes from being created on NoData when e.g. "NoData().f = x" is called
return None
def __call__(self, *args, **kwargs):
#if a NoData object is called (i.e. used as a method), return a NoData object
return self
def __getattr__(self,name):
#For all other attribute accesses or method accesses, return a NoData object.
#Remember that the NoData object can be called (__call__), so if a method is called,
#a NoData object is first returned and then called. This works for operators,
#so e.g. NoData() + 5 will:
# - call NoData().__coerce__, which returns a (NoData, NoData) tuple
# - call __getattr__, which returns a NoData object
# - call the returned NoData object with args (self, NoData)
# - this call (i.e. __call__) returns a NoData object
#For attribute accesses NoData will be returned, and that's it.
#print name #(uncomment this line for debugging purposes i.e. to see that attribute was accessed/method was called)
return self

Why using NaN that already has another semantic instead of using an instance of a class MissingData defined by yourself?
Defining operations on MissingData instances to get propagation should be easy...

If it's just pow() giving you headaches, you can easily redefine it to return NaN under whatever circumstances you like.
def pow(x, y):
return x ** y if x == x else float("NaN")
If NaN can be used as an exponent you'd also want to check for that; this raises a ValueError exception except when the base is 1 (apparently on the theory that 1 to any power, even one that's not a number, is 1).
(And of course pow() actually takes three operands, the third optional, which omission I'll leave as an exercise...)
Unfortunately the ** operator has the same behavior, and there's no way to redefine that for built-in numeric types. A possibility to catch this is to write a subclass of float that implements __pow__() and __rpow__() and use that class for your NaN values.
Python doesn't seem to provide access to any flags set by calculations; even if it did, it's something you'd have to check after each individual operation.
In fact, on further consideration, I think the best solution might be to simply use an instance of a dummy class for missing values. Python will choke on any operation you try to do with these values, raising an exception, and you can catch the exception and return a default value or whatever. There's no reason to proceed with the rest of the calculation if a needed value is missing, so an exception should be fine.

To answer your question: No, there is no way to check the flags using normal floats. You can use the Decimal class, however, which provides much more control . . . but is a bit slower.
Your other option is to use an EmptyData or Null class, such as this one:
class NullType(object):
"Null object -- any interaction returns Null"
def _null(self, *args, **kwargs):
return self
__eq__ = __ne__ = __ge__ = __gt__ = __le__ = __lt__ = _null
__add__ = __iadd__ = __radd__ = _null
__sub__ = __isub__ = __rsub__ = _null
__mul__ = __imul__ = __rmul__ = _null
__div__ = __idiv__ = __rdiv__ = _null
__mod__ = __imod__ = __rmod__ = _null
__pow__ = __ipow__ = __rpow__ = _null
__and__ = __iand__ = __rand__ = _null
__xor__ = __ixor__ = __rxor__ = _null
__or__ = __ior__ = __ror__ = _null
__divmod__ = __rdivmod__ = _null
__truediv__ = __itruediv__ = __rtruediv__ = _null
__floordiv__ = __ifloordiv__ = __rfloordiv__ = _null
__lshift__ = __ilshift__ = __rlshift__ = _null
__rshift__ = __irshift__ = __rrshift__ = _null
__neg__ = __pos__ = __abs__ = __invert__ = _null
__call__ = __getattr__ = _null
def __divmod__(self, other):
return self, self
__rdivmod__ = __divmod__
if sys.version_info[:2] >= (2, 6):
__hash__ = None
else:
def __hash__(yo):
raise TypeError("unhashable type: 'Null'")
def __new__(cls):
return cls.null
def __nonzero__(yo):
return False
def __repr__(yo):
return '<null>'
def __setattr__(yo, name, value):
return None
def __setitem___(yo, index, value):
return None
def __str__(yo):
return ''
NullType.null = object.__new__(NullType)
Null = NullType()
You may want to change the __repr__ and __str__ methods. Also, be aware that Null cannot be used as a dictionary key, nor stored in a set.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python's standard hashing algorithm [duplicate] - python

From here: The hash value -1 is reserved (it’s used to flag errors in the C implementation). If the hash algorithm generates this value, we simply use -2 instead. As integer's hash is integer itself it's just changed right away.

Related

Python: what is the difference of adding an object to a set by id() or directly?

Must all attributes of an object be in the objects hash function?

Removing objects that have changed from a Python set

Most pythonic way of ensuring a list of objects contains only unique items

Propagation of NaN through calculations

Categories

Resources