Python set __contains__ is not finding objects contained in the set [closed]

Python set __contains__ is not finding objects contained in the set [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
(python 3.7.1 on linux)
I am observing some strange behavior storing user-defined objects in a set. The objects are highly complex so a minimal example is not in the cards-- but I am hoping the observed behavior will elicit an explanation from someone wiser than myself. Here it is:
>>> from mycode import MyObject
>>> a = MyObject(*args1)
>>> b = MyObject(*args2)
>>> a == b
False
>>> z = {a, b}
>>> len(z)
2
>>> a in z
False
My understanding was that an object is "in" a set if (1) its hash matches the hash of an object in the set and (2) it equals that object. But those expectations are violated here:
>>> [hash(t) for t in z]
[1013724486348463466, -1852733432963649245]
>>> hash(a)
1013724486348463466
>>> [(hash(t) == hash(a), t == a) for t in z]
[(True, True), (False, False)]
>>> [t is a for t in z]
[True, False]
And the strangest (syntactically) of all:
>>> [t in z for t in z]
[False, False]
What might be up with MyObject to cause it to behave this way? To recap: it has a sane __hash__ and __eq__ function, set is just a stock python set.
Here they are specifically:
class MyObject(object):
...
def __hash__(self):
return hash(self.link)
def __eq__(self, other):
"""
two entities are equal if their types, origins, and external references are the same.
internal refs do not need to be equal; reference entities do not need to be equal
:return:
"""
if other is None:
return False
try:
is_eq = (self.external_ref == other.external_ref
and self.origin == other.origin
and self.entity_type == other.entity_type)
except AttributeError:
is_eq = False
return is_eq
All of those properties are defined on these objects. As demonstrated above, a == t evaluates to True for one of the objects in the set. Thanks for any suggestions.

I was mutating the objects after adding them to the set. The hash function as defined was not static.

Related

Why don't functions preserve identity? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I was wondering why Python 3.7 functions behave in a rather strange way. I think it's kinda weird and contradictory to the whole notion of hashability. Let me clarify what I encounter with a simple example code. Knowing that tuples are hashable, consider the following:
a = (-1, 20, 8)
b = (-1, 20, 8)
def f(x):
return min(x), max(x)
Now let us examine:
>>> print(a is b, a.__hash__() == b.__hash__())
False True
>>> print((-1, 20, 8) is (-1, 20, 8))
True
This is odd enough, but I guess "naming" hashable objects make them something different (their id()'s change during variable definition). How about functions? Functions are hashable, right? Let's see:
>>> print(f(a) is f(b))
False
>>> print(id(f(a)) == id(f(b)), f(a).__hash__() == f(b).__hash__())
True True
Now this is the climax of my confusion. You should be surprised that even f(a) is f(a) is False. But how so? Don't you think this kind of behavior is incorrect and should be addressed and fixed by Python community?

You can't guarantee two identical calls are the same since functions are also objects in Python, thus they can maintain state. Yet even if you put state apart you shouldn't rely that is will evaluate True if the contents of two objects are the same.
There are cases in which Python will optimize the code to use the same object as a singleton but you should't assume anything on this.
255 is 255 returns True due to implementation details of CPython while 256 is 256 returns False. If care only for deep equality use ==. is is designed for object equality checks.
c = 40
def f(x):
return c + x
a = 1
f(a)
# 41
c += 1
f(a)
# 42
f(a) is f(a)
# True
c += 500
f(a) is f(a)
# False
f(a) is f(a) can result in the same objects, for instance Python stores integers up to 255 as singletons so the first test returns True but when we are out of those optimizations (c += 500) each call will instantiate its own object to return and now f(a) is f(a) will return False.

is keyword in python compares if the operand are pointing to the same object. Python provides id() function to return a unique identifier for an object instance. So, a is b does not compare if objects contain the same value, it just return if a and b are the same object.
__hash__() function returns a value based on the content/value of the object.
>>> a = (-1, 20, 8)
>>> b = (-1, 20, 8)
>>> id(a)
2347044252768
>>> id(b)
2347044252336
>>> hash(a)
-3789721413161926883
>>> hash(b)
-3789721413161926883
Now the last question, f(a) is f(b) compares if the results returned by f(a) and f(b) points to the same object in memory.
If your function return min(x), max(x) will return a new tuple containing the min and max of x. Therefore, print(f(a) is f(b)) is False
f(a).__hash__() == f(b).__hash__() is True because this actually compares hash of the resulting value, not the hash of the function as you think.
If you want the hash of the function, you will do f.__hash__() or hash(f) since function in Python is just a callable object.
The only interesting part is print(id(f(a)) == id(f(b))) shows True. This is probably due to CPython expression bytecode optimizer.
If you do it separately, it returns False.
>>> c = f(a)
>>> d = f(b)
>>> print(id(f(a)) == id(f(b)))
True
>>> print(id(c) == id(d))
False
I'm not sure if it is a bug that should be fix, but it is an odd inconsistency. BTW, I'm using Python 3.7.2 on Windows 64-bit. The behavior might different on different Python version or implementation.
If you replace integer values with strings, the behavior also changes due to Python's string interning optimization.
Therefore, the lesson here is just like general guidelines in other language, avoid comparing object references/pointers if possible as you might be looking into some implementation details about how the objects are referenced, optimization and possible how its GC works.
Here's an interesting related article: Python Optimization: How it Can Make You a Better Programmer

Check if object is in an iterable using "is" identity instead of "==" equality

if object in lst:
#do something
As far as I can tell, when you execute this statement it is internally checking == between object and every element in lst, which will refer to the __eq__ methods of these two objects. This can have the implication of two distinct objects being "equal", which is usually desired if all of their attributes are the same.
However, is there a way to Pythonically achieve a predicate such as in where the underlying equality check is is - i.e. we're actually checking if the two references are to the same object?

3list membership in python is dictated by the __contains__ dunder method. You can choose to overwrite this for a custom implementation if you want to use the normal "in" syntax:
class my_list(list):
def __contains__(self, x):
for y in self:
if x is y:
return True
return False
4 in my_list([4, [3,2,1]])
>> True
[3,2,1] in my_list([4, [3,2,1]]) # Because while the lists are "==" equal, they have different pointers.
>>> False
Otherwise, I'd suggest kaya3's answer of using a generator check.

Use the any function:
if any(x is object for x in lst):
# ...

if you want to specifically use is then just use filter like:
filtered_list = filter(lambda n: n is object, list)

Are classobjects singletons?

If we have x = type(a) and x == y, does it necessarily imply that x is y?
Here is a counter-example, but it's a cheat:
>>> class BrokenEq(type):
... def __eq__(cls, other):
... return True
...
>>> class A(metaclass=BrokenEq):
... pass
...
>>> a = A()
>>> x = type(a)
>>> x == A, x is A
(True, True)
>>> x == BrokenEq, x is BrokenEq
(True, False)
And I could not create a counterexample like this:
>>> A1 = type('A', (), {})
>>> A2 = type('A', (), {})
>>> a = A1()
>>> x = type(a)
>>> x == A1, x is A1
(True, True)
>>> x == A2, x is A2
(False, False)
To clarify my question - without overriding equality operators to do something insane, is it possible for a class to exist at two different memory locations or does the import system somehow prevent this?
If so, how can we demonstrate this behavior - for example, doing weird things with reload or __import__?
If not, is that guaranteed by the language or documented anywhere?
Epilogue:
# thing.py
class A:
pass
Finally, this is what clarified the real behaviour for me (and it's supporting the claims in Blckknght answer)
>>> import sys
>>> from thing import A
>>> a = A()
>>> isinstance(a, A), type(a) == A, type(a) is A
(True, True, True)
>>> del sys.modules['thing']
>>> from thing import A
>>> isinstance(a, A), type(a) == A, type(a) is A
(False, False, False)
So, although code that uses importlib.reload could break type checking by class identity, it will also break isinstance anyway.

No, there's no way to create two class objects that compare equal without being identical, except by messing around with metaclass __eq__ methods.
This behavior though is not something unique to classes. It's the default behavior for any object without an __eq__ method defined in its class. The behavior is inherited from object, which is the base class for all other (new-style) classes. It's only overridden for builtin types that have some other semantic for equality (e.g. container types which compare their contents) and for custom classes that define an __eq__ operator of their own.
As for getting two different refernces to the same class at different memory locations, that's not really possible due to Python's object semantics. The memory location of the object is its identity (in cpython at least). Another class with identical contents can exist somewhere else, but like in your A1 and A2 example, it's going to be seen as a different object by all Python logic.

I'm not aware of any documentation about how == works for types, but it definitely works by identity. You can see that the CPython 2.7 implementation is a pointer comparison:
static PyObject*
type_richcompare(PyObject *v, PyObject *w, int op)
{
...
/* Compare addresses */
vv = (Py_uintptr_t)v;
ww = (Py_uintptr_t)w;
switch (op) {
...
case Py_EQ: c = vv == ww; break;
In CPython 3.5, type doesn't implement its own tp_richcompare, so it inherits the default equality comparison from object, which is a pointer comparison:
PyTypeObject PyType_Type = {
...
0, /* tp_richcompare */

Does comparing using `==` compare identities before comparing values?

If I compare two variables using ==, does Python compare the identities, and, if they're not the same, then compare the values?
For example, I have two strings which point to the same string object:
>>> a = 'a sequence of chars'
>>> b = a
Does this compare the values, or just the ids?:
>>> b == a
True
It would make sense to compare identity first, and I guess that is the case, but I haven't yet found anything in the documentation to support this. The closest I've got is this:
x==y calls x.__eq__(y)
which doesn't tell me whether anything is done before calling x.__eq__(y).

For user-defined class instances, is is used as a fallback - where the default __eq__ isn't overridden, a == b is evaluated as a is b. This ensures that the comparison will always have a result (except in the NotImplemented case, where comparison is explicitly forbidden).
This is (somewhat obliquely - good spot Sven Marnach) referred to in the data model documentation (emphasis mine):
User-defined classes have __eq__() and __hash__() methods by
default; with them, all objects compare unequal (except with
themselves) and x.__hash__() returns an appropriate value such
that x == y implies both that x is y and hash(x) == hash(y).
You can demonstrate it as follows:
>>> class Unequal(object):
def __eq__(self, other):
return False
>>> ue = Unequal()
>>> ue is ue
True
>>> ue == ue
False
so __eq__ must be called before id, but:
>>> class NoEqual(object):
pass
>>> ne = NoEqual()
>>> ne is ne
True
>>> ne == ne
True
so id must be invoked where __eq__ isn't defined.
You can see this in the CPython implementation, which notes:
/* If neither object implements it, provide a sensible default
for == and !=, but raise an exception for ordering. */
The "sensible default" implemented is a C-level equality comparison of the pointers v and w, which will return whether or not they point to the same object.

In addition to the answer by #jonrsharpe: if the objects being compared implement __eq__, it would be wrong for Python to check for identity first.
Look at the following example:
>>> x = float('nan')
>>> x is x
True
>>> x == x
False
NaN is a specific thing that should never compare equal to itself; however, even in this case x is x should return True, because of the semantics of is.

Set "in" operator: uses equality or identity?

class A(object):
def __cmp__(self):
print '__cmp__'
return object.__cmp__(self)
def __eq__(self, rhs):
print '__eq__'
return True
a1 = A()
a2 = A()
print a1 in set([a1])
print a1 in set([a2])
Why does first line prints True, but second prints False? And neither enters operator eq?
I am using Python 2.6

Set __contains__ makes checks in the following order:
'Match' if hash(a) == hash(b) and (a is b or a==b) else 'No Match'
The relevant C source code is in Objects/setobject.c::set_lookkey() and in Objects/object.c::PyObject_RichCompareBool().

You need to define __hash__ too. For example
class A(object):
def __hash__(self):
print '__hash__'
return 42
def __cmp__(self, other):
print '__cmp__'
return object.__cmp__(self, other)
def __eq__(self, rhs):
print '__eq__'
return True
a1 = A()
a2 = A()
print a1 in set([a1])
print a1 in set([a2])
Will work as expected.
As a general rule, any time you implement __cmp__ you should implement a __hash__ such that for all x and y such that x == y, x.__hash__() == y.__hash__().

Sets and dictionaries gain their speed by using hashing as a fast approximation of full equality checking. If you want to redefine equality, you usually need to redefine the hash algorithm so that it is consistent.
The default hash function uses the identity of the object, which is pretty useless as a fast approximation of full equality, but at least allows you to use an arbitrary class instance as a dictionary key and retrieve the value stored with it if you pass exactly the same object as a key. But it means if you redefine equality and don't redefine the hash function, your objects will go into a dictionary/set without complaining about not being hashable, but still won't actually work the way you expect them to.
See the official python docs on __hash__ for more details.

A tangential answer, but your question and my testing made me curious. If you ignore the set operator which is the source of your __hash__ problem, it turns out your question is still interesting.
Thanks to the help I got on this SO question, I was able to chase the in operator through the source code to it's root. Near the bottom I found the PyObject_RichCompareBool function which indeed tests for identity (see the comment about "Quick result") before testing for equality.
So unless I misunderstand the way things work, the technical answer to your question is first identity and then equality, through the equality test itself. Just to reiterate, that is not the source of the behavior you were seeing but just the technical answer to your question.
If I misunderstood the source, somebody please set me straight.
int
PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
{
PyObject *res;
int ok;
/* Quick result when objects are the same.
Guarantees that identity implies equality. */
if (v == w) {
if (op == Py_EQ)
return 1;
else if (op == Py_NE)
return 0;
}
res = PyObject_RichCompare(v, w, op);
if (res == NULL)
return -1;
if (PyBool_Check(res))
ok = (res == Py_True);
else
ok = PyObject_IsTrue(res);
Py_DECREF(res);
return ok;
}

Sets seem to use hash codes, then identity, before comparing for equality. The following code:
class A(object):
def __eq__(self, rhs):
print '__eq__'
return True
def __hash__(self):
print '__hash__'
return 1
a1 = A()
a2 = A()
print 'set1'
set1 = set([a1])
print 'set2'
set2 = set([a2])
print 'a1 in set1'
print a1 in set1
print 'a1 in set2'
print a1 in set2
outputs:
set1
__hash__
set2
__hash__
a1 in set1
__hash__
True
a1 in set2
__hash__
__eq__
True
What happens seems to be:
The hash code is computed when an element is inserted into a hash. (To compare with the existing elements.)
The hash code for the object you're checking with the in operator is computed.
Elements of the set with the same hash code are inspected by first checking whether they're the same object as the one you're looking for, or if they're logically equal to it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python set contains is not finding objects contained in the set [closed] - python

I was mutating the objects after adding them to the set. The hash function as defined was not static.

Related

Why don't functions preserve identity? [closed]

Check if object is in an iterable using "is" identity instead of "==" equality

Are classobjects singletons?

Does comparing using `==` compare identities before comparing values?

Set "in" operator: uses equality or identity?

Categories

Resources