Understanding iterable types in comparisons - python

Recently I ran into cosmologicon's pywats and now try to understand part about fun with iterators:
>>> a = 2, 1, 3
>>> sorted(a) == sorted(a)
True
>>> reversed(a) == reversed(a)
False
Ok, sorted(a) returns a list and sorted(a) == sorted(a) becomes just a two lists comparision. But reversed(a) returns reversed object. So why these reversed objects are different? And id's comparision makes me even more confused:
>>> id(reversed(a)) == id(reversed(a))
True

The basic reason why id(reversed(a) == id(reversed(a) returns True , whereas reversed(a) == reversed(a) returns False , can be seen from the below example using custom classes -
>>> class CA:
... def __del__(self):
... print('deleted', self)
... def __init__(self):
... print('inited', self)
...
>>> CA() == CA()
inited <__main__.CA object at 0x021B8050>
inited <__main__.CA object at 0x021B8110>
deleted <__main__.CA object at 0x021B8050>
deleted <__main__.CA object at 0x021B8110>
False
>>> id(CA()) == id(CA())
inited <__main__.CA object at 0x021B80F0>
deleted <__main__.CA object at 0x021B80F0>
inited <__main__.CA object at 0x021B80F0>
deleted <__main__.CA object at 0x021B80F0>
True
As you can see when you did customobject == customobject , the object that was created on the fly was not destroyed until after the comparison occurred, this is because that object was required for the comparison.
But in case of id(co) == id(co) , the custom object created was passed to id() function, and then only the result of id function is required for comparison , so the object that was created has no reference left, and hence the object was garbage collected, and then when the Python interpreter recreated a new object for the right side of == operation, it reused the space that was freed previously. Hence, the id for both came as same.
This above behavior is an implementation detail of CPython (it may/may not differ in other implementations of Python) . And you should never rely on the equality of ids . For example in the below case it gives the wrong result -
>>> a = [1,2,3]
>>> b = [4,5,6]
>>> id(reversed(a)) == id(reversed(b))
True
The reason for this is again as explained above (garbage collection of the reversed object created for reversed(a) before creation of reversed object for reversed(b)).
If the lists are large, I think the most memory efficient and most probably the fastest method to compare equality for two iterators would be to use all() built-in function along with zip() function for Python 3.x (or itertools.izip() for Python 2.x).
Example for Python 3.x -
all(x==y for x,y in zip(aiterator,biterator))
Example for Python 2.x -
from itertools import izip
all(x==y for x,y in izip(aiterator,biterator))
This is because all() short circuits at the first False value is encounters, and `zip() in Python 3.x returns an iterator which yields out the corresponding elements from both the different iterators. This does not need to create a separate list in memory.
Demo -
>>> a = [1,2,3]
>>> b = [4,5,6]
>>> all(x==y for x,y in zip(reversed(a),reversed(b)))
False
>>> all(x==y for x,y in zip(reversed(a),reversed(a)))
True

sorted returns a list, whereas reversed returns a reversed object and is a different object. If you were to cast the result of reversed to a list before comparison, they will be equal.
In [8]: reversed(a)
Out[8]: <reversed at 0x2c98d30>
In [9]: reversed(a)
Out[9]: <reversed at 0x2c989b0>

reversed returns an iterable that doesn't implement a specific __eq__ operator and therefore is compared using identity.
The confusion about id(reversed(a)) == id(reversed(a)) is because after evaluating the first id(...) call the iterable can be disposed (nothing references it) and the second iterable may be reallocated at the very same memory address when the second id(...) call is done. This is however just a coincidence.
Try
ra1 = reversed(a)
ra2 = reversed(a)
and compare id(ra1) with id(ra2) and you will see they are different numbers (because in this case the iterable objects cannot be deallocated as they're referenced by ra1/ra2 variables).

You may try list(reversed(a)) ==list(reversed(a)) will return True
list(reversed(a))
[3, 2, 1]
once try
>>> v = id(reversed(a))
>>> n = id(reversed(a))
>>> v == n
False
again
>>> v = id(reversed(a))
>>> n = id(reversed(a))
>>> n1 = id(reversed(a))
>>> v == n1
True

Related

Why does copy.deepcopy() behave differently for tuples than lists?

My understanding of deep-copies is that they replace references to objects with copies as new objects. Then,
Consider that:
>>> o = [1, 2, 3]
>>> l = [o]
>>> c = deepcopy(l)
>>> c[0] is l[0]
False
Compared to this:
>>> o = (1, 2, 3)
>>> l = [o]
>>> c = deepcopy(l)
>>> c[0] is l[0]
True
Why is the behaviour different?
deepcopy is redundant for immutable objects, because there's no practical way to tell the difference between a copy and the original. Yes you can use is or id() but those don't tell you much about the object itself.
A tuple is immutable, as long as all of the elements it contains are immutable. Numbers and strings are immutable so they make good tuple members. A list is never immutable.
A class may implement a method __deepcopy__ and if it does, that function will be called to make the copy. That function may return the original object or a new object depending on the properties of the class.
As John Gordon pointed out in the comments, this has nothing to do with deepcopy. Looks like your Python implementation re-uses the same object for equal tuples of literals, no matter where they appear. As in:
a = (1, 2, 3)
b = (1, 2, 3)
a is b # True
This is an implementation detail that you cannot rely on. In CPython, the most common implementation, that line used to evaluate to False. Nobody guarantees that it won't return False again next year. I plugged the same into an online interpreter that relies on the Skulpt implementation and it returns False too (code).
Just use == for comparison!
Why does CPython store equal tuples in the same memory location? I can only speculate but it's probably to conserve memory.

Why can't you use a memory address of an unhashable type as a key for a dict?

I understand that since unhashable types like lists are mutating, they cannot be used as a key for hashing. However, I don't see why their memory address (which I don't believe changes) can be used?
For example:
my_list = [1,2,3]
my_dict = {my_list: 1} #error
my_dict = {id(my_list): 1} # no error
You actually can use the memory address of an object as a hash function if you extend list, set, etc.
The primary reason using a memory address for a hash is bad is because if two objects are equal (a == b evaluates to True) we also want their hashes to be equal (hash(a) == hash(b) to be True). Otherwise, we could get unintended behavior.
To see an example of this, let's create our own class that extends list and use the memory address of the object as a hash function.
>>> class HashableList(list):
def __hash__(self):
return id(self) # Returns the memory address of the object
Now we can create two hashable lists! Our HashableList uses the same constructor as python's built-in list.
>>> a = HashableList((1, 2, 3))
>>> b = HashableList((1, 2, 3))
Sure enough, as we would expect, we get
>>> a == b
True
And we can hash our lists!
>>> hash(a)
1728723187976
>>> hash(b)
1728723187816
>>> hash(a) == hash(b)
False
If you look at the last 3 digits, you'll see a and b are close to each other in memory, but aren't in the same location. Since we're using the memory address as our hash, that also means their hashes aren't equal.
What happens if compare the built in hash of two equal tuples (or any other hashable object)?
>>> y = ('foo', 'bar')
>>> z = ('foo', 'bar')
>>> y == z
True
>>> hash(y)
-1256824942587948134
>>> hash(z)
-1256824942587948134
>>> hash(y) == hash(z)
True
If you try this on your own, your hash of ('foo', 'bar') won't match mine, since the hashes of strings changes every time a new session of python starts. The important thing is that, in the same session hash(y) will always equal hash(z).
Let's see what happens if we make a set, and play around with the HashableList objects and the tuples we made.
>>> s = set()
>>> s.add(a)
>>> s.add(y)
>>> s
{[1, 2, 3], ('foo', 'bar')}
>>> a in s # Since hash(a) == hash(a), we can find a in our set
True
>>> y in s # Since hash(y) == hash(y), we can find y in our set
True
>>> b in s
False
>>> z in s
True
Even though a == b, we couldn't find a in the set because hash(b) doesn't equal hash(a), so we couldn't find our equivalent list in the set!

Why does the original list change?

Running this:
a = [[1], [2]]
for i in a:
i *= 2
print(a)
Gives
[[1, 1], [2, 2]]
I would expect to get the original list, as happens here:
a = [1, 2]
for i in a:
i *= 2
print(a)
Which gives:
[1, 2]
Why is the list in the first example being modified?
You are using augmented assignment statements. These operate on the object named on the left-hand side, giving that object the opportunity to update in-place:
An augmented assignment expression like x += 1 can be rewritten as x = x + 1 to achieve a similar, but not exactly equal effect. In the augmented version, x is only evaluated once. Also, when possible, the actual operation is performed in-place, meaning that rather than creating a new object and assigning that to the target, the old object is modified instead.
(bold emphasis mine).
This is achieved by letting objects implement __i[op]__ methods, for =* that's the __imul__ hook:
These methods are called to implement the augmented arithmetic assignments (+=, -=, *=, #=, /=, //=, %=, **=, <<=, >>=, &=, ^=, |=). These methods should attempt to do the operation in-place (modifying self) and return the result (which could be, but does not have to be, self).
Using *= on a list multiplies that list object and returns the same list object (self) to be 'assigned' back to the same name.
Integers on the other hand are immutable objects. Arithmetic operations on integers return new integer objects, so int objects do not even implement the __imul__ hook; Python has to fall back to executing i = i * 3 in that case.
So for the first example, the code:
a = [[1], [2]]
for i in a:
i *= 2
really does this (with the loop unrolled for illustration purposes):
a = [[1], [2]]
i = a[0].__imul__(2) # a[0] is altered in-place
i = a[1].__imul__(2) # a[1] is altered in-place
where the list.__imul__ method applies the change to the list object itself, and returns the reference to the list object.
For integers, this is executed instead:
a = [1, 2]
i = a[0] * 2 # a[0] is not affected
i = a[1] * 2 # a[1] is not affected
So now the new integer objects are assigned to i, which is independent from a.
The reason your results are different for each example is because list are mutable, but integers are not.
That means when you you modify an integer object in place, the operator must return a new integer object. However, since list are mutable, the changes are simply added to already existing list object.
When you used *= in the for-loop in the first example, Python modify the already existing list. But when you used *= with the integers, a new integer object had to be returned.
This can also be observed with a simple example:
>>> a = 1
>>> b = [1]
>>>
>>> id(a)
1505450256
>>> id(b)
52238656
>>>
>>> a *= 1
>>> b *= 1
>>>
>>> id(a)
1505450256
>>> id(b)
52238656
>>>
As you can see above, the memory address for a changed when we multiplied it in-place. So *= returned a new object. But the memory for the list did not change. That means *= modified the list object in-place.
In the first case your i in the for loop is a list. So you're telling python hey, take the ith list and repeat it twice. You're basically repeating the list 2 times, this is what the * operator does to a list.In the second case your i is a value, so you're applying * to a value, not a list.

Why does "[] is [ ]" evaluate to False in python

Try this in an interactive python shell.
[] is [ ]
The above returns False, why?
You created two mutable objects, then used is to see if those are the same object. That should definitely return False, or something would be broken.
You wouldn't ever want is to return true here. Imagine if you did this:
foo = []
bar = []
foo.append(42)
then you'd be very surprised if bar now contains 42. If is returned true, meaning that both [] invocations returned the exact same object, then appending to foo would be visible in the reference to bar.
For immutable objects, it makes sense to cache objects, at which point is may return true, like with empty tuples:
>>> () is () # are these two things the same object?
True
The CPython implementation has optimised empty tuple creation; you'll always get the exact same object, because that saves memory and makes certain operations faster. Because tuples are immutable, this is entirely safe.
If you expected to test for value equality instead, then you got the wrong operator. Use the == operator instead:
>>> [] == [] # do these two objects have the same value?
True
In python is does a reference equality check like [] and [] they are different objects you can check that by
print id([]),id([])
or
In [1]: id([])
Out[1]: 140464629086976
In [2]: id([])
Out[2]: 140464628521656
both will return different address and both are different object so is will always give false
[] is []
output
false
[] is like list(), if you do this:
a = list()
b = list()
clearly a and b are two completly different objects, hence:
a is b # False
like
list() is list() # False
like
[] is [] # False
The == operator compares the values of both the operands and checks for value equality. Whereas is operator checks whether both the operands refer to the same object or not.
id('') : 139634828889200
id('') : 139634828889200
id('') : 139634828889200
id([]) : 139634689473416
id([]) : 139634689054536
id([]) : 139634742570824

Does a slicing operation give me a deep or shallow copy?

The official Python docs say that using the slicing operator and assigning in Python makes a shallow copy of the sliced list.
But when I write code for example:
o = [1, 2, 4, 5]
p = o[:]
And when I write:
id(o)
id(p)
I get different id's and also appending one one list does not reflect in the other list. Isn't it creating a deep copy or is there somewhere I am going wrong?
You are creating a shallow copy, because nested values are not copied, merely referenced. A deep copy would create copies of the values referenced by the list too.
Demo:
>>> lst = [{}]
>>> lst_copy = lst[:]
>>> lst_copy[0]['foo'] = 'bar'
>>> lst_copy.append(42)
>>> lst
[{'foo': 'bar'}]
>>> id(lst) == id(lst_copy)
False
>>> id(lst[0]) == id(lst_copy[0])
True
Here the nested dictionary is not copied; it is merely referenced by both lists. The new element 42 is not shared.
Remember that everything in Python is an object, and names and list elements are merely references to those objects. A copy of a list creates a new outer list, but the new list merely receives references to the exact same objects.
A proper deep copy creates new copies of each and every object contained in the list, recursively:
>>> from copy import deepcopy
>>> lst_deepcopy = deepcopy(lst)
>>> id(lst_deepcopy[0]) == id(lst[0])
False
You should know that tests using is or id can be misleading of whether a true copy is being made with immutable and interned objects such as strings, integers and tuples that contain immutables.
Consider an easily understood example of interned strings:
>>> l1=['one']
>>> l2=['one']
>>> l1 is l2
False
>>> l1[0] is l2[0]
True
Now make a shallow copy of l1 and test the immutable string:
>>> l3=l1[:]
>>> l3 is l1
False
>>> l3[0] is l1[0]
True
Now make a copy of the string contained by l1[0]:
>>> s1=l1[0][:]
>>> s1
'one'
>>> s1 is l1[0] is l2[0] is l3[0]
True # they are all the same object
Try a deepcopy where every element should be copied:
>>> from copy import deepcopy
>>> l4=deepcopy(l1)
>>> l4[0] is l1[0]
True
In each case, the string 'one' is being interned into Python's internal cache of immutable strings and is will show that they are the same (they have the same id). It is implementation and version dependent of what gets interned and when it does, so you cannot depend on it. It can be a substantial memory and performance enhancement.
You can force an example that does not get interned instantly:
>>> s2=''.join(c for c in 'one')
>>> s2==l1[0]
True
>>> s2 is l1[0]
False
And then you can use the Python intern function to cause that string to refer to the cached object if found:
>>> l1[0] is s2
False
>>> s2=intern(s2)
>>> l1[0] is s2
True
Same applies to tuples of immutables:
>>> t1=('one','two')
>>> t2=t1[:]
>>> t1 is t2
True
>>> t3=deepcopy(t1)
>>> t3 is t2 is t1
True
And mutable lists of immutables (like integers) can have the list members interred:
>>> li1=[1,2,3]
>>> li2=deepcopy(li1)
>>> li2 == li1
True
>>> li2 is li1
False
>>> li1[0] is li2[0]
True
So you may use python operations that you KNOW will copy something but the end result is another reference to an interned immutable object. The is test is only a dispositive test of a copy being made IF the items are mutable.

Categories