I find it odd that in Python, {1} == frozenset({1}) evaluates to True. set and frozenset are different object types, and I don't see this similarity between other iterable object types (ex. {1} == (1,) evaluates to False). Why does this behavior occur with sets? Are there other iterable object types that have similar behavior?
As per the documentation python2 and documentation python3
Instances of set are compared to instances of frozenset based on
their members. For example, "set('abc') == frozenset('abc')" returns
True.
and in the python3 documentation:
Both set and frozenset support set to set comparisons. Two sets are equal if and only if every element of each set is contained in the other (each is a subset of the other). A set is less than another set if and only if the first set is a proper subset of the second set (is a subset, but is not equal). A set is greater than another set if and only if the first set is a proper superset of the second set (is a superset, but is not equal).
Related
This question already has answers here:
What does "hashable" mean in Python?
(10 answers)
Closed last month.
If we execute the two following lines in Python:
a = (1,)
{a}
it won't cause any problems. Instead, if we execute the two following lines:
a = ([1],)
{a}
now what we get is "TypeError: unhashable type: 'list'".
In both cases, we're building a set consisting of elements whose type is a tuple, which is immutable (in the first case (1,) and in the second case ([1],)). In the second case, however, our tuple contains an object of mutable type, i.e., the list [1].
It seems that the condition that the elements of a set must be immutable is not enough to guarantee that sets are successfully created. What is the exact condition that won't lead to any error when building a set? What is happening at low level?
As the error message indicates, the actual criterion for whether something can be in a set is whether it is hashable, as stated in the set docs:
Set elements, like dictionary keys, must be hashable.
However, only immutable object should be hashable, as Python's docs on __hash__() explain:
If a class defines mutable objects and implements an __eq__() method, it should not implement __hash__(), since the implementation of hashable collections requires that a key’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).
In order for the hash to be representative of the entire contained data, immutable containers like tuples implement hashing by hashing each contained element in turn, with all the elements' hashes then going into the computation of the container's hash. Of course this is continued recursively if any elements are themselves immutable containers.
In your case, hashing the tuple will thus lead to an attempted hashing of the contained list, which fails as lists are not hashable:
>>> hash([1, 2])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
As for why set elements must be hashable: It's because that will allow membership checking to proceed in O(1) time (think hash map lookup). You could make your own set type without this requirement that would just do membership checking by going through the list of elements one by one, of course - that would effectively just be a list with an addition operation that refuses to insert duplicates.
Why does the expression below evaluate to False? Both the values in the comparison are Falsy.
print('' == [])
According to the documentation:
An object’s type determines the operations that the object supports (e.g., “does it have a length?”) and also defines the possible values for objects of that type.
Furthermore, it also states:
== compares the values of two objects
However, since the value of an object in Python is pretty much abstract, in that there is no canonical method to access it, the default behavior of == compares the identity of the two objects (which can be thought of as the object’s address in memory) but is defined as:
An integer which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same identity.
It's also important to mention that many built-in types have customized comparison methods within their respective types that compare based on their 'values'.
While it's true that both of your objects are Falsy, they are of different types (String and List), and as a result, the default behavior of comparing identities is used. Since there is no chance of the two objects having non-overlapping lifetimes, their identities will be different, and so '' == [] will evaluate to False.
In order to compare the Truthy values of two objects, first convert them to booleans as Waket suggested in the comments:
bool('') == bool([])
taking advantage of Python's truth value testing.
Example:
>>> tuple((1, 2)) == tuple((2, 1))
False
>>> frozenset((1, 2)) == frozenset((2, 1))
True
Frozen sets are immutable. I would expect that equality between immutable objects should by determined by order, but here obviously that is not the case.
How can I discard frozensets with same elements and different order, without casting to different type?
The short answer is you can't, since as pointed out in the comments, sets and frozensets are unordered data structures. Here are some excerpts from the docs* to support this statement:
A set object is an unordered collection of distinct hashable objects.
There are currently two built-in set types, set and frozenset. The set type is mutable — the contents can be changed using methods like add() and remove(). Since it is mutable, it has no hash value and cannot be used as either a dictionary key or as an element of another set. The frozenset type is immutable and hashable — its contents cannot be altered after it is created; it can therefore be used as a dictionary key or as an element of another set.
* Python 2.7.12
For a better grasp of the equality issue, I would encourage you to run the following snippet using the Online Python Tutor:
tup_1 = tuple((1, 2))
tup_2 = tuple((2, 1))
fs_1 = frozenset((1, 2))
fs_2 = frozenset((2, 1))
This is an extremely handy tool that renders a graphical representation of the objects in memory while the code is executed step by step. I'm attaching a screenshot:
The answer by Tonechas is wrong.
frozenset is unordered, but it can be used for equality comparison. Quote from the python official doc:
Both set and frozenset support set to set comparisons. Two sets are equal if and only if every element of each set is contained in the other (each is a subset of the other). A set is less than another set if and only if the first set is a proper subset of the second set (is a subset, but is not equal). A set is greater than another set if and only if the first set is a proper superset of the second set (is a superset, but is not equal).
I have two user-defined objects, say a and b.
Both these objects have the same hash values.
However, the id(a) and id(b) are unequal.
Moreover,
>>> a is b
False
>>> a == b
True
From this observation, can I infer the following?
Unequal objects may have the same hash values.
Equal objects need to have the same id values.
Whenever obj1 is obj2 is called, the id values of both objects is compared, not their hash values.
There are three concepts to grasp when trying to understand id, hash and the == and is operators: identity, value and hash value. Not all objects have all three.
All objects have an identity, though even this can be a little slippery in some cases. The id function returns a number corresponding to an object's identity (in cpython, it returns the memory address of the object, but other interpreters may return something else). If two objects (that exist at the same time) have the same identity, they're actually two references to the same object. The is operator compares items by identity, a is b is equivalent to id(a) == id(b).
Identity can get a little confusing when you deal with objects that are cached somewhere in their implementation. For instance, the objects for small integers and strings in cpython are not remade each time they're used. Instead, existing objects are returned any time they're needed. You should not rely on this in your code though, because it's an implementation detail of cpython (other interpreters may do it differently or not at all).
All objects also have a value, though this is a bit more complicated. Some objects do not have a meaningful value other than their identity (so value an identity may be synonymous, in some cases). Value can be defined as what the == operator compares, so any time a == b, you can say that a and b have the same value. Container objects (like lists) have a value that is defined by their contents, while some other kinds of objects will have values based on their attributes. Objects of different types can sometimes have the same values, as with numbers: 0 == 0.0 == 0j == decimal.Decimal("0") == fractions.Fraction(0) == False (yep, bools are numbers in Python, for historic reasons).
If a class doesn't define an __eq__ method (to implement the == operator), it will inherit the default version from object and its instances will be compared solely by their identities. This is appropriate when otherwise identical instances may have important semantic differences. For instance, two different sockets connected to the same port of the same host need to be treated differently if one is fetching an HTML webpage and the other is getting an image linked from that page, so they don't have the same value.
In addition to a value, some objects have a hash value, which means they can be used as dictionary keys (and stored in sets). The function hash(a) returns the object a's hash value, a number based on the object's value. The hash of an object must remain the same for the lifetime of the object, so it only makes sense for an object to be hashable if its value is immutable (either because it's based on the object's identity, or because it's based on contents of the object that are themselves immutable).
Multiple different objects may have the same hash value, though well designed hash functions will avoid this as much as possible. Storing objects with the same hash in a dictionary is much less efficient than storing objects with distinct hashes (each hash collision requires more work). Objects are hashable by default (since their default value is their identity, which is immutable). If you write an __eq__ method in a custom class, Python will disable this default hash implementation, since your __eq__ function will define a new meaning of value for its instances. You'll need to write a __hash__ method as well, if you want your class to still be hashable. If you inherit from a hashable class but don't want to be hashable yourself, you can set __hash__ = None in the class body.
Unequal objects may have the same hash values.
Yes this is true. A simple example is hash(-1) == hash(-2) in CPython.
Equal objects need to have the same id values.
No this is false in general. A simple counterexample noted by #chepner is that 5 == 5.0 but id(5) != id(5.0).
Whenever, obj1 is obj2 is called, the id values of both the objects is compared, not their hash values.
Yes this is true. is compares the id of the objects for equality (in CPython it is the memory address of the object). Generally, this has nothing to do with the object's hash value (the object need not even be hashable).
The hash function is used to:
quickly compare dictionary keys during a dictionary lookup
the ID function is used to:
Return the “identity” of an object. This is an integer which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.
I just had an experiment , which assigned same whole number to 2 separated variables , but when I used ' is ' operator to compare them , then in returned True ; I didn't expect such a thing , Thought that the variables must be exact same like==>
a = 10
b = a
a is b # output == True
I'm working through http://www.mypythonquiz.com, and question #45 asks for the output of the following code:
confusion = {}
confusion[1] = 1
confusion['1'] = 2
confusion[1.0] = 4
sum = 0
for k in confusion:
sum += confusion[k]
print sum
The output is 6, since the key 1.0 replaces 1. This feels a bit dangerous to me, is this ever a useful language feature?
First of all: the behaviour is documented explicitly in the docs for the hash function:
hash(object)
Return the hash value of the object (if it has one). Hash values are
integers. They are used to quickly compare dictionary keys during a
dictionary lookup. Numeric values that compare equal have the same
hash value (even if they are of different types, as is the case for 1
and 1.0).
Secondly, a limitation of hashing is pointed out in the docs for object.__hash__
object.__hash__(self)
Called by built-in function hash() and for operations on members of
hashed collections including set, frozenset, and dict. __hash__()
should return an integer. The only required property is that objects
which compare equal have the same hash value;
This is not unique to python. Java has the same caveat: if you implement hashCode then, in order for things to work correctly, you must implement it in such a way that: x.equals(y) implies x.hashCode() == y.hashCode().
So, python decided that 1.0 == 1 holds, hence it's forced to provide an implementation for hash such that hash(1.0) == hash(1). The side effect is that 1.0 and 1 act exactly in the same way as dict keys, hence the behaviour.
In other words the behaviour in itself doesn't have to be used or useful in any way. It is necessary. Without that behaviour there would be cases where you could accidentally overwrite a different key.
If we had 1.0 == 1 but hash(1.0) != hash(1) we could still have a collision. And if 1.0 and 1 collide, the dict will use equality to be sure whether they are the same key or not and kaboom the value gets overwritten even if you intended them to be different.
The only way to avoid this would be to have 1.0 != 1, so that the dict is able to distinguish between them even in case of collision. But it was deemed more important to have 1.0 == 1 than to avoid the behaviour you are seeing, since you practically never use floats and ints as dictionary keys anyway.
Since python tries to hide the distinction between numbers by automatically converting them when needed (e.g. 1/2 -> 0.5) it makes sense that this behaviour is reflected even in such circumstances. It's more consistent with the rest of python.
This behaviour would appear in any implementation where the matching of the keys is at least partially (as in a hash map) based on comparisons.
For example if a dict was implemented using a red-black tree or an other kind of balanced BST, when the key 1.0 is looked up the comparisons with other keys would return the same results as for 1 and so they would still act in the same way.
Hash maps require even more care because of the fact that it's the value of the hash that is used to find the entry of the key and comparisons are done only afterwards. So breaking the rule presented above means you'd introduce a bug that's quite hard to spot because at times the dict may seem to work as you'd expect it, and at other times, when the size changes, it would start to behave incorrectly.
Note that there would be a way to fix this: have a separate hash map/BST for each type inserted in the dictionary. In this way there couldn't be any collisions between objects of different type and how == compares wouldn't matter when the arguments have different types.
However this would complicate the implementation, it would probably be inefficient since hash maps have to keep quite a few free locations in order to have O(1) access times. If they become too full the performances decrease. Having multiple hash maps means wasting more space and also you'd need to first choose which hash map to look at before even starting the actual lookup of the key.
If you used BSTs you'd first have to lookup the type and the perform a second lookup. So if you are going to use many types you'd end up with twice the work (and the lookup would take O(log n) instead of O(1)).
You should consider that the dict aims at storing data depending on the logical numeric value, not on how you represented it.
The difference between ints and floats is indeed just an implementation detail and not conceptual. Ideally the only number type should be an arbitrary precision number with unbounded accuracy even sub-unity... this is however hard to implement without getting into troubles... but may be that will be the only future numeric type for Python.
So while having different types for technical reasons Python tries to hide these implementation details and int->float conversion is automatic.
It would be much more surprising if in a Python program if x == 1: ... wasn't going to be taken when x is a float with value 1.
Note that also with Python 3 the value of 1/2 is 0.5 (the division of two integers) and that the types long and non-unicode string have been dropped with the same attempt to hide implementation details.
In python:
1==1.0
True
This is because of implicit casting
However:
1 is 1.0
False
I can see why automatic casting between float and int is handy, It is relatively safe to cast int into float, and yet there are other languages (e.g. go) that stay away from implicit casting.
It is actually a language design decision and a matter of taste more than different functionalities
Dictionaries are implemented with a hash table. To look up something in a hash table, you start at the position indicated by the hash value, then search different locations until you find a key value that's equal or an empty bucket.
If you have two key values that compare equal but have different hashes, you may get inconsistent results depending on whether the other key value was in the searched locations or not. For example this would be more likely as the table gets full. This is something you want to avoid. It appears that the Python developers had this in mind, since the built-in hash function returns the same hash for equivalent numeric values, no matter if those values are int or float. Note that this extends to other numeric types, False is equal to 0 and True is equal to 1. Even fractions.Fraction and decimal.Decimal uphold this property.
The requirement that if a == b then hash(a) == hash(b) is documented in the definition of object.__hash__():
Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. __hash__() should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.
TL;DR: a dictionary would break if keys that compared equal did not map to the same value.
Frankly, the opposite is dangerous! 1 == 1.0, so it's not improbable to imagine that if you had them point to different keys and tried to access them based on an evaluated number then you'd likely run into trouble with it because the ambiguity is hard to figure out.
Dynamic typing means that the value is more important than what the technical type of something is, since the type is malleable (which is a very useful feature) and so distinguishing both ints and floats of the same value as distinct is unnecessary semantics that will only lead to confusion.
I agree with others that it makes sense to treat 1 and 1.0 as the same in this context. Even if Python did treat them differently, it would probably be a bad idea to try to use 1 and 1.0 as distinct keys for a dictionary. On the other hand -- I have trouble thinking of a natural use-case for using 1.0 as an alias for 1 in the context of keys. The problem is that either the key is literal or it is computed. If it is a literal key then why not just use 1 rather than 1.0? If it is a computed key -- round off error could muck things up:
>>> d = {}
>>> d[1] = 5
>>> d[1.0]
5
>>> x = sum(0.01 for i in range(100)) #conceptually this is 1.0
>>> d[x]
Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
d[x]
KeyError: 1.0000000000000007
So I would say that, generally speaking, the answer to your question "is this ever a useful language feature?" is "No, probably not."