Removing objects that have changed from a Python set

Removing objects that have changed from a Python set - python

Given this program:
class Obj:
def __init__(self, a, b):
self.a = a
self.b = b
def __hash__(self):
return hash((self.a, self.b))
class Collection:
def __init__(self):
self.objs = set()
def add(self, obj):
self.objs.add(obj)
def find(self, a, b):
objs = []
for obj in self.objs:
if obj.b == b and obj.a == a:
objs.append(obj)
return objs
def remove(self, a, b):
for obj in self.find(a, b):
print('removing', obj)
self.objs.remove(obj)
o1 = Obj('a1', 'b1')
o2 = Obj('a2', 'b2')
o3 = Obj('a3', 'b3')
o4 = Obj('a4', 'b4')
o5 = Obj('a5', 'b5')
objs = Collection()
for o in (o1, o2, o3, o4, o5):
objs.add(o)
objs.remove('a1', 'b1')
o2.a = 'a1'
o2.b = 'b1'
objs.remove('a1', 'b1')
o3.a = 'a1'
o3.b = 'b1'
objs.remove('a1', 'b1')
o4.a = 'a1'
o4.b = 'b1'
objs.remove('a1', 'b1')
o5.a = 'a1'
o5.b = 'b1'
If I run this a few times with Python 3.4.2, sometimes it will succeed, other times it throws a KeyError after removing 2 or 3 objects:
$ python3 py_set_obj_remove_test.py
removing <__main__.Obj object at 0x7f3648035828>
removing <__main__.Obj object at 0x7f3648035860>
removing <__main__.Obj object at 0x7f3648035898>
removing <__main__.Obj object at 0x7f36480358d0>
$ python3 py_set_obj_remove_test.py
removing <__main__.Obj object at 0x7f156170b828>
removing <__main__.Obj object at 0x7f156170b860>
Traceback (most recent call last):
File "py_set_obj_remove_test.py", line 42, in <module>
objs.remove('a1', 'b1')
File "py_set_obj_remove_test.py", line 27, in remove
self.objs.remove(obj)
KeyError: <__main__.Obj object at 0x7f156170b860>
Is this a bug in Python? Or something about the implementation of sets I don't know about?
Interestingly, it seems to always fail at the second objs.remove() call in Python 2.7.9.

This is not a bug in Python, your code is violating a principle of sets: that the hash value must not change. By mutating your object attributes, the hash changes and the set can no longer reliably locate the object in the set.
From the __hash__ method documentation:
If a class defines mutable objects and implements an __eq__() method, it should not implement __hash__(), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).
Custom Python classes define a default __eq__ method that returns True when both operands reference the same object (obj1 is obj2 is true).
That it sometimes works in Python 3 is a property of hash randomisation for strings. Because the hash value for a string changes between Python interpreter runs, and because the modulus of a hash against the size of the hash table is used, you can end up with the right hash slot anyway, purely by accident, and then the == equality test will still be true because you didn't implement a custom __eq__ method.
Python 2 has hash randomisation too but it is disabled by default, but you could make your test 'pass' anyway by carefully picking the 'right' values for the a and b attributes.
Instead, you could make your code work by basing your hash on the id() of your instance; that makes the hash value not change and would match the default __eq__ implementation:
def __hash__(self):
return hash(id(self))
You could also just remove your __hash__ implementation for the same effect, as the default implementation does basically the above (with the id() value rotated by 4 bits to evade memory alignment patterns). Again, from the __hash__ documentation:
User-defined classes have __eq__() and __hash__() methods by default; with them, all objects compare unequal (except with themselves) and x.__hash__() returns an appropriate value such that x == y implies both that x is y and hash(x) == hash(y).
Alternatively, implement an __eq__ method that bases equality on equality of the attributes of the instance, and don't mutate the attributes.

You are changing the objects (ie changing the objects' hash) after they were added to the set.
When remove is called, it can't find that hash in the set because it was changed after it was calculated (when the objects were originally added to the set).

Related

Python's standard hashing algorithm [duplicate]

Following on from this question, I'm interested to know when is a python object's hash computed?
At an instance's __init__ time,
The first time __hash__() is called,
Every time __hash__() is called, or
Any other opportunity I might be missing?
May this vary depending on the type of the object?
Why does hash(-1) == -2 whilst other integers are equal to their hash?

The hash is generally computed each time it's used, as you can quite easily check yourself (see below).
Of course, any particular object is free to cache its hash. For example, CPython strings do this, but tuples don't (see e.g. this rejected bug report for reasons).
The hash value -1 signals an error in CPython. This is because C doesn't have exceptions, so it needs to use the return value. When a Python object's __hash__ returns -1, CPython will actually silently change it to -2.
See for yourself:
class HashTest(object):
def __hash__(self):
print('Yes! __hash__ was called!')
return -1
hash_test = HashTest()
# All of these will print out 'Yes! __hash__ was called!':
print('__hash__ call #1')
hash_test.__hash__()
print('__hash__ call #2')
hash_test.__hash__()
print('hash call #1')
hash(hash_test)
print('hash call #2')
hash(hash_test)
print('Dict creation')
dct = {hash_test: 0}
print('Dict get')
dct[hash_test]
print('Dict set')
dct[hash_test] = 0
print('__hash__ return value:')
print(hash_test.__hash__()) # prints -1
print('Actual hash value:')
print(hash(hash_test)) # prints -2

From here:
The hash value -1 is reserved (it’s used to flag errors in the C implementation).
If the hash algorithm generates this value, we simply use -2 instead.
As integer's hash is integer itself it's just changed right away.

It is easy to see that option #3 holds for user defined objects. This allows the hash to vary if you mutate the object, but if you ever use the object as a dictionary key you must be sure to prevent the hash ever changing.
>>> class C:
def __hash__(self):
print("__hash__ called")
return id(self)
>>> inst = C()
>>> hash(inst)
__hash__ called
43795408
>>> hash(inst)
__hash__ called
43795408
>>> d = { inst: 42 }
__hash__ called
>>> d[inst]
__hash__ called
Strings use option #2: they calculate the hash value once and cache the result. This is safe because strings are immutable so the hash can never change, but if you subclass str the result might not be immutable so the __hash__ method will be called every time again. Tuples are usually thought of as immutable so you might think the hash could be cached, but in fact a tuple's hash depends on the hash of its content and that might include mutable values.
For #max who doesn't believe that subclasses of str can modify the hash:
>>> class C(str):
def __init__(self, s):
self._n = 1
def __hash__(self):
return str.__hash__(self) + self._n
>>> x = C('hello')
>>> hash(x)
-717693723
>>> x._n = 2
>>> hash(x)
-717693722

Python: what is the difference of adding an object to a set by id() or directly?

Assume I have a custom class CustomObject and I do not define a custom __hash__ or __eq__ function for it. Will there be any difference between the following two operations in terms of outputs in any conditions?
a = CustomObject(1)
b = CustomObject(1)
setA = set()
# option 1
setA.add(a)
print((b in setA))
# option 2
setA.add(id(a))
print((id(b) in setA))
According to What is the default __hash__ in python?, the default __hash__ function is bound to the id of the object, so I assume there is no difference between the above two options?
If I define custom __hash__ functions for CustomObject like in add object into python's set collection and determine by object's attribute, the above two options will be different, right?

Saving the ID can result in a false positive if any of the objects become garbage and the ID is reassigned.
a = CustomObject(1)
setA = set()
setA.add(id(a))
del a
b = CustomObject(1)
print(id(b) in setA)
This would print True if b gets the same ID that a previously had.

The same reason as that mentioned by #Barmar, a phenomenon that is easier to reproduce is that only one address can be obtained by adding temporary CustomObject for many times:
>>> class CustomObject:
... def __init__(self, value):
... self.value = value
...
>>> {id(CustomObject(1)) for _ in range(10)}
{1799037490496}
>>> {id(CustomObject(i)) for i in range(10)}
{1799034371856}
In addition, you can only get the address instead of the object you added when iterating over the set. There are methods in the ctypes library that can get the object through the address, but when the object is destroyed, it is not safe to get it through the address.

Must all attributes of an object be in the objects hash function?

Using python 3.6
I have read the docs.
When implementing __hash__ Must all attributes of an object be in the objects __hash__ function?
So would this example be ok?
class Foo(object):
def __init__(self, value):
self.value = value
self.not_important = 'irrelevant'
def __hash__(self):
return hash((self.value,))
def __eq__(self, other):
if self.value == other.value:
return True
return False
foo.not_important is modified while being a key in a dictionary
>>> foo = Foo(1)
>>> d = {foo:'foo is in d'}
>>> foo.not_important = 'something else'
>>> d[foo]
'foo is in d'
>>> bar = Foo(1)
>>> foo == bar
True
>>> d[bar]
'foo is in d'
But foo.not_important isn't used by it's __hash__ implementation. Is this perfectly okay? Or can this go horribly wrong?

Answering the literal question, it's okay to leave out attributes not considered by __eq__. In fact, you can leave out all the attributes and just return 0, though it'll kill your dict efficiency, so don't do that.
Answering the implied question, it's okay to mutate an object while it's a dict key, as long as you don't mutate it in ways that affect __eq__ or __hash__. For example, the default __hash__ implementation doesn't consider any attributes of the object it's hashing at all - it's based on object identity. With the default __hash__ and __eq__, an object is only equal to itself, and you can mutate such an object all you want while it's a dict key without breaking dicts.

No. As the documentation states:
object.__hash__(self)
Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. __hash__() should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple.
In your example, both __eq__ and __hash__ only use self.value. This satisfies the only requirement.

Most pythonic way of ensuring a list of objects contains only unique items

I have a list of objects (Foo). A Foo object has several attributes. An instance of a Foo object is equivalent (equal) to another instance of a Foo object iff (if and only if) all the attributes are equal.
I have the following code:
class Foo(object):
def __init__(self, myid):
self.myid=myid
def __eq__(self, other):
if isinstance(other, self.__class__):
print 'DEBUG: self:',self.__dict__
print 'DEBUG: other:',other.__dict__
return self.__dict__ == other.__dict__
else:
print 'DEBUG: ATTEMPT TO COMPARE DIFFERENT CLASSES:',self.__class__,'compared to:', other.__class__
return False
import copy
f1 = Foo(1)
f2 = Foo(2)
f3 = Foo(3)
f4 = Foo(4)
f5 = copy.deepcopy(f3) # overkill here (I know), but needed for my real code
f_list = [f1,f2,f3,f4,f5]
# Surely, there must be a better way? (this dosen't work BTW!)
new_foo_list = list(set(f_list))
I often used this little (anti?) 'pattern' above (converting to set and back), when dealing with simple types (int, float, string - and surprisingly datetime.datetime types), but it has come a cropper with the more involved data type - like Foo above.
So, how could I change the list f1 above into a list of unique items - without having to loop through each item and doing a check on whether it already exists in some temporary cache etc etc?.
What is the most pythonic way to do this?

First, I want to emphasize that using set is certainly not an anti-pattern. sets eliminate duplicates in O(n) time, which is the best you can do, and way better than the naive O(n^2) solution of comparing every item to every other item. It's even better than sorting -- and indeed, it seems your data structure might not even have a natural order, in which case sorting doesn't make a lot of sense.
The problem with using a set in this case is that you have to define a custom __hash__ method. Others have said this. But whether or not you can do so easily is an open question -- it depends on details about your actual class that you haven't told us. For example, if any attributes of a Foo object above are not hashable, then creating a custom hash function is going to be difficult, because you'll have to not only write a custom hash for Foo objects, you'll also have to write custom hashes for every other type of object!
So you need to tell us more about what kinds of attributes your class has if you want a conclusive answer. But I can offer some speculation.
Assuming that a hash function could be written for Foo objects, but also assuming that that Foo objects are mutable and so really shouldn't have a __hash__ method, as Niklas B. points out, here is one workable approach. Create a function freeze that, given a mutable instance of Foo, returns an immutable collection of the data in Foo. So for example, say Foo has a dict and a list in it; freeze returns a tuple containing a tuple of tuples (representing the dict) and another tuple (representing the list). The function freeze should have the following property:
freeze(a) == freeze(b)
If and only if
a == b
Now pass your list through the following code:
dupe_free = dict((freeze(x), x) for x in dupe_list).values()
Now you have a dupe free list in O(n) time. (Indeed, after adding this suggestion, I saw that fraxel suggested something similar; but I think using a custom function -- or even a method -- (x.freeze(), x) -- is the better way to go, rather than relying on __dict__ as he does, which can be unreliable. The same goes for your custom __eq__ method, IMO -- __dict__ is not always a safe shortcut for various reasons I can't get into here.)
Another approach would be to use only immutable objects in the first place! For example, you could use namedtuples. Here's an example stolen from the python docs:
>>> Point = namedtuple('Point', ['x', 'y'])
>>> p = Point(11, y=22) # instantiate with positional or keyword arguments
>>> p[0] + p[1] # indexable like the plain tuple (11, 22)
33
>>> x, y = p # unpack like a regular tuple
>>> x, y
(11, 22)
>>> p.x + p.y # fields also accessible by name
33
>>> p # readable __repr__ with a name=value style
Point(x=11, y=22)

Have you tried using a set (or frozenset)? It's explicitly for holding a unique set of items.
You'll need to create an appropriate __hash__ method, though. set (and frozenset) use the __hash__ method to hash objects; __eq__ is only used on a collision, AFAIK. Accordingly, you'll want to use a hash like hash(frozenset(self.__dict__.items())).

According to the documentation, you need to define __hash__() and __eq__() for your custom class to work correctly with a set or frozenset, as both are implemented using hash tables in CPython.
If you implement __hash__, keep in mind that if a == b, then hash(a) must equal hash(b). Rather than comparing the whole __dict__s, I suggest the following more straightforward implementation for your simple class:
class Foo(object):
def __init__(self, myid):
self.myid = myid
def __eq__(self, other):
return isinstance(other, self.__class__) and other.myid == self.myid
def __hash__(self):
return hash(self.myid)
If your object contains mutable attributes, you simply shouldn't put it inside a set or use it as a dictionary key.

Here is an alternative method, just make a dictionary keyed by __dict__.items() for the instances:
f_list = [f1,f2,f3,f4,f5]
f_dict = dict([(tuple(i.__dict__.items()), i) for i in f_list])
print f_dict
print f_dict.values()
#output:
{(('myid', 1),): <__main__.Foo object at 0xb75e190c>,
(('myid', 2),): <__main__.Foo object at 0xb75e184c>,
(('myid', 3),): <__main__.Foo object at 0xb75e1f6c>,
(('myid', 4),): <__main__.Foo object at 0xb75e1cec>}
[<__main__.Foo object at 0xb75e190c>,
<__main__.Foo object at 0xb75e184c>,
<__main__.Foo object at 0xb75e1f6c>,
<__main__.Foo object at 0xb75e1cec>]
This way you just let the dictionary take care of the uniqueness based on attributes, and can easily retrieve the objects by getting the values.

If you are allowed you can use a set http://docs.python.org/library/sets.html
list = [1,2,3,3,45,4,45,6]
print set(list)
set([1, 2, 3, 4, 6, 45])
x = set(list)
print x
set([1, 2, 3, 4, 6, 45])

I'm able to use a mutable object as a dictionary key in python. Is this not disallowed?

class A(object):
x = 4
i = A()
d = {}
d[i] = 2
print d
i.x = 10
print d
I thought only immutable objects can be dictionary keys, but the object i above is mutable.

Any object with a __hash__ method can be a dictionary key. For classes you write, this method defaults to returning a value based off id(self), and if equality is not determined by identity for those classes, you may be surprised by using them as keys:
>>> class A(object):
... def __eq__(self, other):
... return True
...
>>> one, two = A(), A()
>>> d = {one: "one"}
>>> one == two
True
>>> d[one]
'one'
>>> d[two]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: <__main__.A object at 0xb718836c>
>>> hash(set()) # sets cannot be dict keys
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'set'
Changed in version 2.6: __hash__ may now be set to None to explicitly flag instances of a class as unhashable. [__hash__]
class Unhashable(object):
__hash__ = None

An object kan be a key in a dictionary if it is hashable.
Here is the definition of hashable from the documentation:
An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() or__cmp__() method). Hashable objects which compare equal must have the same hash value.
Hashability makes an object usable as a dictionary key and a set member, because these data structures use the hash value internally.
All of Python’s immutable built-in objects are hashable, while no mutable containers (such as lists or dictionaries) are. Objects which are instances of user-defined classes are hashable by default; they all compare unequal, and their hash value is their id().
Since object provides a default implementation of __hash__, __eq__ and __cmp__ this means that anything deriving from object is hashable unless it is explicitly defined not to be hashable. It is not disallowed to create a mutable type that is hashable, but it might not behave as you want.

#fred-nurk's example above luckily no longer works in Python 3, because of this change:
A class that overrides __eq__() and does not define __hash__() will have its __hash__() implicitly set to None. When the __hash__() method of a class is None, instances of the class will raise an appropriate TypeError when a program attempts to retrieve their hash value...
Thank God for that. However, if you explicitly define __hash__() for yourself, you can still do evil things:
class BadHasher:
def __init__(self):
self.first = True
# Implement __hash__ in an evil way. The first time an instance is hashed,
# return 1. Every time after that, return 0.
def __hash__(self):
if self.first:
self.first = False
return 1
return 0
myobject = BadHasher()
# We can put this object in a set...
myset = {myobject}
# ...but as soon as we look for it, it's gone!
if myobject not in myset:
print("what the hell we JUST put it in there")

The requirement is that the hash of an object doesn't change over time, and that it keeps comparing equal (==) with its original value. Your class A meets both these requirements, so it makes a valid dictionary key. The x attribute is not considered at all in keying, only the object identity is.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing objects that have changed from a Python set - python

You are changing the objects (ie changing the objects' hash) after they were added to the set. When remove is called, it can't find that hash in the set because it was changed after it was calculated (when the objects were originally added to the set).

Related

Python's standard hashing algorithm [duplicate]

Python: what is the difference of adding an object to a set by id() or directly?

Must all attributes of an object be in the objects hash function?

Most pythonic way of ensuring a list of objects contains only unique items

I'm able to use a mutable object as a dictionary key in python. Is this not disallowed?

Categories

Resources