custom comparison for built-in containers - python

In my code there's numerous comparisons for equality of various containers (list, dict, etc.). The keys and values of the containers are of types float, bool, int, and str. The built-in == and != worked perfectly fine.
I just learned that the floats used in the values of the containers must be compared using a custom comparison function. I've written that function already (let's call it approxEqual(), and assume that it takes two floats and return True if they are judged to be equal and False otherwise).
I prefer that the changes to the existing code are kept to a minimum. (New classes/functions/etc can be as complicated as necessary.)
Example:
if dict1 != dict2:
raise DataMismatch
The dict1 != dict2 condition needs to be rewritten so that any floats used in values of dict1 and dict2 are compared using approxEqual function instead of __eq__.
The actual contents of dictionaries comes from various sources (parsing files, calculations, etc.).
Note: I asked a question earlier about how to override built-in float's eq. That would have been an easy solution, but I learned that Python doesn't allow overriding built-in types' __eq__ operator. Hence this new question.

The only route to altering the way built-in containers check equality is to make them contain as values, instead of the "originals", wrapped values (wrapped in a class that overrides __eq__ and __ne__). This is if you need to alter the way the containers themselves use equality checking, e.g. for the purpose of the in operator where the right-hand side operand is a list -- as well as in containers' method such as their own __eq__ (type(x).__eq__(y) is the typical way Python will perform internally what you code as x == y).
If what you're talking about is performing your own equality checks (without altering the checks performed internally by the containers themselves), then the only way is to change every cont1 == cont2 into (e.g.) same(cont1, cont2, value_same) where value_same is a function accepting two values and returning True or False like == would. That's probably too invasive WRT the criterion you specify.
If you can change the container themselves (i.e., the number of places where container objects are created is much smaller than the number of places where two containers are checked for equality), then using a container subclass which overrides __eq__ is best.
E.g.:
class EqMixin(object):
def __eq__(self, other):
return same(cont1, cont2, value_same)
(with same being as I mentioned in the A's 2nd paragraph) and
class EqM_list(EqMixin, list): pass
(and so forth for other container types you need), then wherever you have (e.g.)
x = list(someiter)
change it into
x = EqM_list(someiter)
and be sure to also catch other ways to create list objects, e.g. replace
x = [bah*2 for bah in buh]
with
x = EqM_list(bah*2 for bah in buh)
and
x = d.keys()
with
x = EqM_list(d.iterkeys())
and so forth.
Yeah, I know, what a bother -- but it's a core principle (and practice;-) of Python that builtin types (be they containers, or value types like e.g. float) themselves cannot be changed. That's a very different philosophy from e.g. Ruby's and Javascript's (and I personally prefer it but I do see how it can seem limiting at times!).
Edit: the OP specific request seems to be (in terms of this answer) "how do I implement same" for the various container types, not how to apply it without changing the == into a function call. If that's correct, then (e.g) without using iterators for simplicity:
def samelist(a, b, samevalue):
if len(a) != len(b): return False
return all(samevalue(x, y) for x, y in zip(a, b))
def samedict(a, b, samevalue):
if set(a) != set(b): return False
return all(samevalue(a[x], b[x]) for x in a))
Note that this applies to values, as requested, NOT to keys. "Fuzzying up" the equality comparison of a dict's keys (or a set's members) is a REAL problem. Look at it this way: first, how to you guarantee with absolute certainty that samevalue(a, b) and samevalue(b, c) totally implies and ensures samevalue(a, c)? This transitivity condition does not apply to most semi-sensible "fuzzy comparisons" I've ever seen, and yet it's completely indispensable for the hash-table based containers (such as dicts and sets). If you pass that hurdle, then the nightmare of making the hash values somehow "magically" consistent arises -- and what if two actually different keys in one dict "map to" equality in this sense with the same key in the other dict, which of the two corresponding values should be used then...? This way madness lies, if you ask me, so I hope that when you say values you do mean, exactly, values, and not keys!-)

Related

Is there a general method for testing if the attributes of two objects are equivalent in Python?

I just wrote a testing script for a project I have in python, and I learned that checking if the values of two objects are equivalent is not as simple as foo == bar. For example, I have an object dashboard that has a pandas dataframe as an attribute, so I defined its __eq__ method as:
def __eq__(self, other):
vals = [self.__dict__[k] == other.__dict__[k] for k in self.__dict__.keys()]
vals = [v if isinstance(v, bool) else all(v) for v in vals]
return all(vals)
It compares the dictionaries of each object, and if any of these comparisons yields something other than a boolean (e.g., a dataframe) it applies all() to reduce it to a single boolean. I then apply all() to this entire list of attribute comparisons to test whether or not every attribute of self and other are equivalent.
I used this __eq__ definition in several classes, and also used something similar for a comparison method in my parent Test class. I got my test to work, but I'm curious if there's a more elegant/efficient way to handle this. (Disclaimer: Testing is new to me, as well as OOP in general.)
No, there is not a general method for testing equality of objects.
This is partly because custom/arbitrary objects have widely varying interpretations of what equality means. For example, in your code you've defined equality to mean "the values are booleans and/or iterables of booleans and all values agree", but this rule ONLY applies to keys that both dictionaries have in common. Someone else might need to check that a dictionary has all the same keys but doesn't care about the values, or that all keys and all values are identical and of specific datatypes, etc. So, this is left to the user to implement.
to compare dataframes, since you mention this, there is:
DataFrame.equals(other)
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.equals.html

Why can a floating point dictionary key overwrite an integer key with the same value?

I'm working through http://www.mypythonquiz.com, and question #45 asks for the output of the following code:
confusion = {}
confusion[1] = 1
confusion['1'] = 2
confusion[1.0] = 4
sum = 0
for k in confusion:
sum += confusion[k]
print sum
The output is 6, since the key 1.0 replaces 1. This feels a bit dangerous to me, is this ever a useful language feature?
First of all: the behaviour is documented explicitly in the docs for the hash function:
hash(object)
Return the hash value of the object (if it has one). Hash values are
integers. They are used to quickly compare dictionary keys during a
dictionary lookup. Numeric values that compare equal have the same
hash value (even if they are of different types, as is the case for 1
and 1.0).
Secondly, a limitation of hashing is pointed out in the docs for object.__hash__
object.__hash__(self)
Called by built-in function hash() and for operations on members of
hashed collections including set, frozenset, and dict. __hash__()
should return an integer. The only required property is that objects
which compare equal have the same hash value;
This is not unique to python. Java has the same caveat: if you implement hashCode then, in order for things to work correctly, you must implement it in such a way that: x.equals(y) implies x.hashCode() == y.hashCode().
So, python decided that 1.0 == 1 holds, hence it's forced to provide an implementation for hash such that hash(1.0) == hash(1). The side effect is that 1.0 and 1 act exactly in the same way as dict keys, hence the behaviour.
In other words the behaviour in itself doesn't have to be used or useful in any way. It is necessary. Without that behaviour there would be cases where you could accidentally overwrite a different key.
If we had 1.0 == 1 but hash(1.0) != hash(1) we could still have a collision. And if 1.0 and 1 collide, the dict will use equality to be sure whether they are the same key or not and kaboom the value gets overwritten even if you intended them to be different.
The only way to avoid this would be to have 1.0 != 1, so that the dict is able to distinguish between them even in case of collision. But it was deemed more important to have 1.0 == 1 than to avoid the behaviour you are seeing, since you practically never use floats and ints as dictionary keys anyway.
Since python tries to hide the distinction between numbers by automatically converting them when needed (e.g. 1/2 -> 0.5) it makes sense that this behaviour is reflected even in such circumstances. It's more consistent with the rest of python.
This behaviour would appear in any implementation where the matching of the keys is at least partially (as in a hash map) based on comparisons.
For example if a dict was implemented using a red-black tree or an other kind of balanced BST, when the key 1.0 is looked up the comparisons with other keys would return the same results as for 1 and so they would still act in the same way.
Hash maps require even more care because of the fact that it's the value of the hash that is used to find the entry of the key and comparisons are done only afterwards. So breaking the rule presented above means you'd introduce a bug that's quite hard to spot because at times the dict may seem to work as you'd expect it, and at other times, when the size changes, it would start to behave incorrectly.
Note that there would be a way to fix this: have a separate hash map/BST for each type inserted in the dictionary. In this way there couldn't be any collisions between objects of different type and how == compares wouldn't matter when the arguments have different types.
However this would complicate the implementation, it would probably be inefficient since hash maps have to keep quite a few free locations in order to have O(1) access times. If they become too full the performances decrease. Having multiple hash maps means wasting more space and also you'd need to first choose which hash map to look at before even starting the actual lookup of the key.
If you used BSTs you'd first have to lookup the type and the perform a second lookup. So if you are going to use many types you'd end up with twice the work (and the lookup would take O(log n) instead of O(1)).
You should consider that the dict aims at storing data depending on the logical numeric value, not on how you represented it.
The difference between ints and floats is indeed just an implementation detail and not conceptual. Ideally the only number type should be an arbitrary precision number with unbounded accuracy even sub-unity... this is however hard to implement without getting into troubles... but may be that will be the only future numeric type for Python.
So while having different types for technical reasons Python tries to hide these implementation details and int->float conversion is automatic.
It would be much more surprising if in a Python program if x == 1: ... wasn't going to be taken when x is a float with value 1.
Note that also with Python 3 the value of 1/2 is 0.5 (the division of two integers) and that the types long and non-unicode string have been dropped with the same attempt to hide implementation details.
In python:
1==1.0
True
This is because of implicit casting
However:
1 is 1.0
False
I can see why automatic casting between float and int is handy, It is relatively safe to cast int into float, and yet there are other languages (e.g. go) that stay away from implicit casting.
It is actually a language design decision and a matter of taste more than different functionalities
Dictionaries are implemented with a hash table. To look up something in a hash table, you start at the position indicated by the hash value, then search different locations until you find a key value that's equal or an empty bucket.
If you have two key values that compare equal but have different hashes, you may get inconsistent results depending on whether the other key value was in the searched locations or not. For example this would be more likely as the table gets full. This is something you want to avoid. It appears that the Python developers had this in mind, since the built-in hash function returns the same hash for equivalent numeric values, no matter if those values are int or float. Note that this extends to other numeric types, False is equal to 0 and True is equal to 1. Even fractions.Fraction and decimal.Decimal uphold this property.
The requirement that if a == b then hash(a) == hash(b) is documented in the definition of object.__hash__():
Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. __hash__() should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.
TL;DR: a dictionary would break if keys that compared equal did not map to the same value.
Frankly, the opposite is dangerous! 1 == 1.0, so it's not improbable to imagine that if you had them point to different keys and tried to access them based on an evaluated number then you'd likely run into trouble with it because the ambiguity is hard to figure out.
Dynamic typing means that the value is more important than what the technical type of something is, since the type is malleable (which is a very useful feature) and so distinguishing both ints and floats of the same value as distinct is unnecessary semantics that will only lead to confusion.
I agree with others that it makes sense to treat 1 and 1.0 as the same in this context. Even if Python did treat them differently, it would probably be a bad idea to try to use 1 and 1.0 as distinct keys for a dictionary. On the other hand -- I have trouble thinking of a natural use-case for using 1.0 as an alias for 1 in the context of keys. The problem is that either the key is literal or it is computed. If it is a literal key then why not just use 1 rather than 1.0? If it is a computed key -- round off error could muck things up:
>>> d = {}
>>> d[1] = 5
>>> d[1.0]
5
>>> x = sum(0.01 for i in range(100)) #conceptually this is 1.0
>>> d[x]
Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
d[x]
KeyError: 1.0000000000000007
So I would say that, generally speaking, the answer to your question "is this ever a useful language feature?" is "No, probably not."

What is the use case of the immutable objects

What is the use case of immutable types/objects like tuple in python.
Tuple('hello')
('h','i')
Where we can use the not changeable sequences.
One common use case is the list of (unnamed) arguments to a function.
In [1]: def foo(*args):
...: print(type(args))
...:
In [2]: foo(1,2,3)
<class 'tuple'>
Technically, tuples are semantically different to lists.
When you have a list, you have something that is... a list. Of items of some sort. And therefore can have items added or removed to it.
A tuple, on the other hand, is a set of values in a given order. It just happens to be one value that is made up of more than one value. A composite value.
For example. Say you have a point. X, Y. You could have a class called Point, but that class would have a dictionary to store its attributes. A point is only two values which are, most of the time, used together. You don't need the flexibility or the cost of a dictionary for storing named attributes, you can use a tuple instead.
myPoint = 70, 2
Points are always X and Y. Always 2 values. They are not lists of numbers. They are two values in which the order of a value matters.
Another example of tuple usage. A function that creates links from a list of tuples. The tuples must be the href and then the label of the link. Fixed order. Order that has meaning.
def make_links(*tuples):
return "".join('%s' % t for t in tuples)
make_links(
("//google.com", "Google"),
("//stackoveflow.com", "Stack Overflow")
)
So the reason tuples don't change is because they are supposed to be one single value. You can only assign the whole thing at once.
Here is a good resource that describes the difference between tuples and lists, and the reasons for using each: https://mail.python.org/pipermail/tutor/2001-September/008888.html
The main reason outlined in that link is that tuples are immutable and less extensive than say, lists. This makes them useful only in certain situations, but if those situations can be identified, tuples take up much less resources.
Immutable objects will make life simpler in many cases. They are especially applicable for value types, where objects don't have an identity so they can be easily replaced. And they can make concurrent programming way safer and cleaner (most of the notoriously hard to find concurrency bugs are ultimately caused by mutable state shared between threads). However, for large and/or complex objects, creating a new copy of the object for every single change can be very costly and/or tedious. And for objects with a distinct identity, changing an existing objects is much more simple and intuitive than creating a new, modified copy of it.

Python: Testing equivalence of sets of custom classes when all instances are unique by definition?

Using Python 2.6, with the set() builtin, not sets.set.
I have defined some custom data abstraction classes, which will be made members of some sets using the builtin set() object.
The classes are already being stored in a separate structure, before being divided up into sets. All instances of the classes are declared first. No class instances are created or deleted after the first set is declared. No two class instances are ever considered to be "equal" to each other. (Two instances of the class, containing identical data, are considered not the same. A == B is False for all A,B where B is not A.)
Given the above, will there be any reasonable difference between these strategies for testing set_a == set_b?:
Option 1: Store integers in the sets that uniquely identify instances of my class.
Option 2: Store instances of my class, and implement __hash__() and __eq__() to compare id(self) == id(other). (This may not be necessary? Do default implementations of these functions in object just do the same thing but faster?) Possibly use an instance variable that increments every time a new instance calls __init__(). (Not thread safe?)
or,
Option 3: The instances are already stored and looked up in dictionaries keyed by rather long strings. The strings are what most directly represents what the instances are, and are kept unique. I thought storing these strings in the sets would be a RAM overhead and/or create a bunch of extra runtime by calling __eq__() and __hash__(). If this is not the case, I should store the strings directly. (But I think what I've read so far tells me it is the case.)
I'm somewhat new to sets in Python. I've figured out some of what I need to know already, just want to make sure I'm not overlooking something tricky or drawing a false conclusion somewhere.
I might be misunderstanding the question, but this is how Python behaves by default:
class Foo(object):
pass
a = Foo()
b = Foo()
c = Foo()
x = set([a, b])
y = set([a, b])
z = set([a, c])
print x == y # True
print x == z # False
Do default implementations of these functions in object just do the same thing but faster?
Yes. User-defined classes have __cmp__() and __hash__() methods by default; with them, all objects compare unequal (except with themselves) and x.__hash__() returns id(x). docs

What objects are guaranteed to have different identity?

ORIGINAL QUESTION:
(My question applies to Python 3.2+, but I doubt this has changed since Python 2.7.)
Suppose I use an expression that we usually expect to create an object. Examples: [1,2,3]; 42; 'abc'; range(10); True; open('readme.txt'); MyClass(); lambda x : 2 * x; etc.
Suppose two such expressions are executed at different times and "evaluate to the same value" (i.e., have the same type, and compare as equal). Under what conditions does Python provide what I call a distinct object guarantee that the two expressions actually create two distinct objects (i.e., x is y evaluates as False, assuming the two objects are bound to x and y, and both are in scope at the same time)?
I understand that for objects of any mutable type, the "distinct object guarantee" holds:
x = [1,2]
y = [1,2]
assert x is not y # guaranteed to pass
I also know for certain immutable types (str, int) the guarantee does not hold; and for certain other immutable types (bool, NoneType), the opposite guarantee holds:
x = True
y = not not x
assert x is not y # guaranteed to fail
x = 2
y = 3 - 1
assert x is not y # implementation-dependent; likely to fail in CPython
x = 1234567890
y = x + 1 - 1
assert x is not y # implementation-dependent; likely to pass in CPython
But what about all the other immutable types?
In particular, can two tuples created at different times have the same identity?
The reason I'm interested in this is that I represent nodes in my graph as tuples of int, and the domain model is such that any two nodes are distinct (even if they are represented by tuples with the same values). I need to create sets of nodes. If Python guarantees that tuples created at different times are distinct objects, I could simply subclass tuple to redefine equality to mean identity:
class DistinctTuple(tuple):
__hash__ = tuple.__hash__
def __eq__(self, other):
return self is other
x = (1,2)
y = (1,2)
s = set(x,y)
assert len(s) == 1 # pass; but not what I want
x = DistinctTuple(x)
y = DistinctTuple(y)
s = set(x,y)
assert len(s) == 2 # pass; as desired
But if tuples created at different times are not guaranteed to be distinct, then the above is a terrible technique, which hides a dormant bug that may appear at random and may be very hard to replicate and find. In that case, subclassing won't help; I will actually need to add to each tuple, as an extra element, a unique id. Alternatively, I can convert my tuples to lists. Either way, I'd use more memory. Obviously, I'd prefer not to use these alternatives unless my original subclassing solution is unsafe.
My guess is that Python does not offer the "distinct object guarantee" for immutable types, either built-in or user-defined. But I haven't found a clear statement about it in the documentation.
UPDATE 1:
#LuperRouch #larsmans Thank you for the discussion and the answer so far. Here's the last issue I'm still unclear with:
Is there any chance that the creation of an object of a user-defined
type results in a reuse of an existing object?
If this is possible, I'd like to know how I can verify for any class I work with whether it might exhibit such a behavior.
Here's my understanding. Any time an object of a user-defined class is created, the class' __new__() method is called first. If this method is overridden, nothing in the language would prevent the programmer from returning a reference to an existing object, thus violating my "distinct object guarantee". Obviously, I can observe it by examining the class definition.
I am not sure what happens if a user-defined class does not override __new__() (or explicitly relies __new__() from the base class). If I write
class MyInt(int):
pass
the object creation is handled by int.__new__(). I would expect that this means I may sometimes see the following assertion fail:
x = MyInt(1)
y = MyInt(1)
assert x is not y # may fail, since int.__new__() might return the same object twice?
But in my experimentation with CPython I could not achieve such behavior. Does this mean the language provides "distinct object guarantee" for user-defined classes that don't override __new__, or is it just an arbitrary implementation behavior?
UPDATE 2:
While my DistinctTuple turned out to be a perfectly safe implementation, I now understand that my design idea of using DistinctTuple to model nodes is very bad.
The identity operator is already available in the language; making == behave in the same way as is is logically superfluous.
Worse, if == could have been done something useful, I made it unavailable. For instance, it's quite likely that somewhere in my program I'll want to see if two nodes are represented by the same pair of integers; == would have been perfect for that - and in fact, that's what it does by default...
Worse yet, most people actually do expect == to compare some "value" rather than identity - even for a user-defined class. They would be caught unawares with my override that only looks at identity.
Finally... the only reason I had to redefine == was to allow multiple nodes with the same tuple representation to be part of a set. This is the wrong way to go about it! It's not == behavior that needs to change, it's the container type! I simply needed to use multisets instead of sets.
In short, while my question may have some value for other situations, I am absolutely convinced that creating class DistinctTuple is a terrible idea for my use case (and I strongly suspect it has no valid use case at all).
Python reference, section 3, Data model:
for immutable types, operations that compute new values may actually return a reference to any existing object with the same type and value, while for mutable objects this is not allowed.
(Emphasis added.)
In practice, it seems CPython only caches the empty tuple:
>>> 1 is 1
True
>>> (1,) is (1,)
False
>>> () is ()
True
Is there any chance that the creation of an object of a user-defined type results in a reuse of an existing object?
This will happen if, and only if, the user-defined type is explicitly designed to do that. With __new__() or some metaclass.
I'd like to know how I can verify for any class I work with whether it might exhibit such a behavior.
Use the source, Luke.
When it comes to int, small integers are pre-allocated, and these pre-allocated integers are used wherever you create of calculate with integers. You can't get this working when you do MyInt(1) is MyInt(1), because what you have there are not integers. However:
>>> MyInt(1) + MyInt(1) is 2
True
This is because of course MyInt(1) + MyInt(1) does not return a MyInt. It returns an int, because that's what the __add__ of an integer returns (and that's where the check for pre-allocated integers occur as well). This if anything just shows that subclassing int in general isn't particularly useful. :-)
Does this mean the language provides "distinct object guarantee" for user-defined classes that don't override new, or is it just an arbitrary implementation behavior?
It doesn't guarantee it, because there is no need to do so. The default behavior is to create a new object. You have to override it if you don't want that to happen. Having a guarantee makes no sense.
If Python guarantees that tuples created at different times are distinct objects, I could simply subclass tuple to redefine equality to mean identity.
You seem to be confused about how subclassing works: If B subclasses A, then B gets to use all of A's methods[1] -- but the A methods will be working on instances of B, not of A. This holds even for __new__:
--> class Node(tuple):
... def __new__(cls):
... obj = tuple.__new__(cls)
... print(type(obj))
... return obj
...
--> n = Node()
<class '__main__.Node'>
As #larsman pointed out in the Python reference:
for immutable types, operations that compute new values may actually return a reference to any existing object with the same type and value, while for mutable objects this is not allowed
However, keep in mind this passage is talking about Python's built-in types, not user-defined types (which can go crazy pretty much any way they like).
I understand the above excerpt to guarantee that Python will not return a new mutable object that is the same as an existing object, and classes that are user-defined and created in Python code are inherently mutable (again, see note above about crazy user-defined classes).
A more complete Node class (note you don't need to explicitly refererence tuple.__hash__):
class Node(tuple):
__slots__ = tuple()
__hash__ = tuple.__hash__
def __eq__(self, other):
return self is other
def __ne__(self, other):
return self is not other
--> n1 = Node()
--> n2 = Node()
--> n1 is n2
False
--> n1 == n2
False
--> n1 != n2
True
--> n1 <= n2
True
--> n1 < n2
False
As you can see from the last two comparisons, you may want to also override the __le__ and __ge__ methods.
[1] The only exception I am aware of is __hash__ -- if __eq__ is defined on the subclass but the subclass wants the parent class' __hash__ it has to explicitly say so (this is a Python 3 change).

Categories