Equality of copy.copy and copy.deepcopy in python copy module - python
I am creating a list of numpy arrays then copying it to another array to keep an original copy. Copying was done using deepcopy() function. When I am comparing the two arrays now, it is showing false in equivalence. But its all good when I am using copy() function .I understand the difference between copy and deepcopy function, but shall the equivalence be not same?
That is:
grid1=np.empty([3,3],dtype=object)
for i in xrange(3):
for j in xrange(3):
grid1[i][j] = [i,np.random.uniform(-3.5,3.5,(3,3))]
grid_init=[]
grid_init=copy.deepcopy(grid1)
grid1==grid_init #returns False
grid_init=[]
grid_init=copy.copy(grid1)
grid1==grid_init #returns True
grid_init=[]
grid_init=copy.deepcopy(grid1)
np.array_equal(grid1,grid_init) #returns False
Shall all be not true?
This is what I'm getting when running the first example:
WARNING:py.warnings:/usr/local/bin/ipython:1: DeprecationWarning: elementwise comparison failed; this will raise the error in the future.
To see why the elementwise comparison fails, simply try to compare a single element:
grid_init=copy.deepcopy(grid1)
grid_init[0][0] == grid1[0][0]
>>> ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
This fails because the second element in the list is in itself a numpy array, and comparison of two numpy arrays does not return a bool (but an array).
Now, why does the example case behave differently?
Seems to be some interpreter optimization which avoid the actual comparison logic if the two objects are the same one. The two are the same object, because the copying was shallow.
grid_init=copy.copy(grid1)
grid_init[0][0] is grid1[0][0]
> True
grid_init[0][0] == grid1[0][0]
> True
The root cause is that you're using a numpy array of dtype=object, with lists in it. This is not a good idea, and can lead to all sorts of weirdnesses.
Instead, you should simply create 2 aligned arrays, one for the first element in your lists, and one for the second.
I must be running a different version of numpy/python, but I get slightly different errors and/or results. Still the same issue applies - mixing arrays and lists can produce complicated results.
Make the 2 copies
In [217]: x=copy.copy(grid1)
In [218]: y=copy.deepcopy(grid1)
Equality with the shallow copy, gives a element by element comparison, a 3x3 boolean:
In [219]: x==grid1
Out[219]:
array([[ True, True, True],
[ True, True, True],
[ True, True, True]], dtype=bool)
The elements are 2 item lists:
In [220]: grid1[0,0]
Out[220]:
[0, array([[ 2.08833787, -0.24595155, -3.15694342],
[-3.05157909, 1.83814619, -0.78387624],
[ 1.70892355, -0.87361521, -0.83255383]])]
And in the shallow copy, the list ids are the same. The 2 arrays have different data buffers (x is not a view), but they both point to the same list objects (located else where in memeory).
In [221]: id(grid1[0,0])
Out[221]: 2958477004
In [222]: id(x[0,0])
Out[222]: 2958477004
With the same id the lists are equal (they also satisfy the is test).
In [234]: grid1[0,0]==x[0,0]
Out[234]: True
But == with the deepcopy produces a simple False. No element by element comparison here. I'm not sure why. Maybe this is an area in which numpy is undergoing development.
In [223]: y==grid1
Out[223]: False
Note that the deepcopy element ids are different:
In [229]: id(y[0,0])
Out[229]: 2957009900
When I try to apply == to an element of these arrays I get an error:
In [235]: grid1[0,0]==y[0,0]
...
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
This is the error that comes up repeatedly in SO questions, usually because people try to use an boolean array (from a comparison) in a scalar Python context.
I can compare the arrays with in the lists:
In [236]: grid1[0,0][1]==y[0,0][1]
Out[236]:
array([[ True, True, True],
[ True, True, True],
[ True, True, True]], dtype=bool)
I can reproduce the ValueError with a simpler comparison - 2 lists, which contain an array. On the surface they look the same, but because the arrays have different ids, it fails.
In [239]: [0,np.arange(3)]==[0,np.arange(3)]
...
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
This pair of comparisons shows what is going on:
In [242]: [0,np.arange(3)][0]==[0,np.arange(3)][0]
Out[242]: True
In [243]: [0,np.arange(3)][1]==[0,np.arange(3)][1]
Out[243]: array([ True, True, True], dtype=bool)
Python compares the respective elements of the lists, and then tries to perform a logical operation to combine them, all(). But it can't perform all on [True, array([True,True,True])].
So in my version, y==grid1 returns False because the element by element comparisons return ValueErrors. It's either that or raise an error or warning. They clearly aren't equal.
In sum, with this array of lists of number and array, equality tests end up mixing array operations and list operations. The outcomes are logical, but complicated. You have to be keenly aware of how arrays are compared, and how lists are compared. They are not interchangeable.
A structured array
You could put this data in a structured array, with a dtype like
dt = np.dtype([('f0',int),('f1',float,(3,3))])
In [263]: dt = np.dtype([('f0',int),('f1',float,(3,3))])
In [264]: grid2=np.empty([3,3],dtype=dt)
In [265]: for i in range(3):
for j in range(3):
grid2[i][j] = (i,np.random.uniform(-3.5,3.5,(3,3)))
.....:
In [266]: grid2
Out[266]:
array([[ (0,
[[2.719807845330254, -0.6379512247418969, -0.02567206509563602],
[0.9585030371031278, -1.0042751112999135, -2.7805349057485946],
[-2.244526250770717, 0.5740647379258945, 0.29076071288760574]]),
....]])]],
dtype=[('f0', '<i4'), ('f1', '<f8', (3, 3))])
The first field, integers can be fetched with (giving a 3x3 array)
In [267]: grid2['f0']
Out[267]:
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
The second field contains 3x3 arrays, which when accessed by field name are a 4d array:
In [269]: grid2['f1'].shape
Out[269]: (3, 3, 3, 3)
A single element is a record (or tuple),
In [270]: grid2[2,1]
Out[270]: (2, [[1.6236266210555836, -2.7383730706629636, -0.46604477485902374], [-2.781740733659544, 0.7822732671353201, 3.0054266762730473], [3.3135671425199824, -2.7466097112667103, -0.15205961855874406]])
Now both kinds of copy produce the same thing:
In [271]: x=copy.copy(grid2)
In [272]: y=copy.deepcopy(grid2)
In [273]: x==grid2
Out[273]:
array([[ True, True, True],
[ True, True, True],
[ True, True, True]], dtype=bool)
In [274]: y==grid2
Out[274]:
array([[ True, True, True],
[ True, True, True],
[ True, True, True]], dtype=bool)
Since grid2 is pure ndarray (no intermediate lists) I suspect copy.copy and copy.deepcopy end up using grid2.copy(). In numpy we normally use the array copy method, and don't bother with the copy module.
p.s. it appears that with dtype=object, grid1.copy() is the same as copy.copy(grid1) - a new array, but the same object pointers (i.e. same data).
Related
Is it possible to vectorize this numpy array comparison?
I have these two numpy arrays in Python: a = np.array(sorted(np.random.rand(6)*6)) # It is sorted. b = np.array(np.random.rand(3)*6) Say that the arrays are a = array([0.27148588, 0.42828064, 2.48130785, 4.01811243, 4.79403723, 5.46398145]) b = array([0.06231266, 1.64276013, 5.22786201]) I want to produce an array containing the indices where a is <= than each element in b, i.e. I want exactly this: np.argmin(np.array([a<b_i for b_i in b]),1)-1 which produces array([-1, 1, 4]) meaning that b[0]<a[0], a[1]<b[1]<a[2] and a[4]<b[2]<a[5]. Is there any native numpy fast vectorized way of doing this avoiding the for loop?
To answer your specific question, i.e., a vectorized way to get the equivalent of np.array([a<b_i for b_i in b], you can take advantage of broadcasting, here, you could use: a[None, ...] < b[..., None] So: >>> a[None, ...] < b[..., None] array([[False, False, False, False, False, False], [ True, True, False, False, False, False], [ True, True, True, True, True, False]]) Importantly, for broadcasting: >>> a[None, ...].shape, b[..., None].shape ((1, 6), (3, 1)) Here's the link to the official numpy docs to understand broadcasting. Some relevant tidbits: When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing (i.e. rightmost) dimensions and works its way left. Two dimensions are compatible when they are equal, or one of them is 1 ... When either of the dimensions compared is one, the other is used. In other words, dimensions with size 1 are stretched or “copied” to match the other. Edit As noted in the comments under your question, using an entirely different approach is much better algorithmically than your own, brute force solution, namely, taking advantage of binary search, using np.searchsorted
Replacing numpy array elements with chained masks
Consider some array arr and advanced indexing mask mask: import numpy as np arr = np.arange(4).reshape(2, 2) mask = A < 2 Using advanced indexing creates a new copy of an array. Accordingly, one cannot "chain" a mask with an an additional mask or even with a basic slicing operation to replace elements of an array: submask = [False, True] arr[mask][submask] = -1 # chaining 2 masks arr[mask][:] = -1 # chaining a mask with a basic slicing operation print(arr) [[0 1] [2 3]] I have two related questions: 1/ What is the best way to replace elements of an array using chained masks? 2/ If advanced indexing returns a copy of an array, why does the following work? arr[mask] = -1 print(arr) [[-1 -1] [ 2 3]]
The short answer: you have to figure out a way of combining the masks. Since masks can "chain" in different ways I don't think there's a simple all-purpose substitute. indexing can either be a __getitem__ call, or a __setitem__. Your last case is a set. With chained indexing, a[mask1][mask2] =value gets translated into a.__getitem__(mask1).__setitem__(mask2, value) Whether a gets modified or not depends on what the first getitem produces (a view vs copy). In [11]: arr = np.arange(4).reshape(2,2) In [12]: mask = arr<2 In [13]: mask Out[13]: array([[ True, True], [False, False]]) In [14]: arr[mask] Out[14]: array([0, 1]) Indexing with a list or array may preserve the number of dimensions, but a boolean like this returns a 1d array, the items where the mask is true. In your example, we could tweak the mask (details may vary with the intent of the 2nd mask): In [15]: mask[:,0]=False In [16]: mask Out[16]: array([[False, True], [False, False]]) In [17]: arr[mask] Out[17]: array([1]) In [18]: arr[mask] += 10 In [19]: arr Out[19]: array([[ 0, 11], [ 2, 3]]) Or a logical combination of masks: In [26]: (np.arange(4).reshape(2,2)<2)&[False,True] Out[26]: array([[False, True], [False, False]])
Couple of good questions! My take: I would do something like this: x,y=np.where(mask) arr[x[submask],y[submask]] = -1 From the official document: Most of the following examples show the use of indexing when referencing data in an array. The examples work just as well when assigning to an array. See the section at the end for specific examples and explanations on how assignments work. which means arr[mask]=1 is referrencing, while arr[mask] is extracting data and creates a copy.
Query regarding a Numpy exercise in Datacamp
Just started learning python with Datacamp and I ran into a question on Numpy. When doing this problem(it's a standalone question so should be easy to understand without any context), I am confused by the instruction:"You can use a little trick here: use np_positions == 'GK' as an index for np_heights". Nowhere in the code did it link np_heights and np_positions together, how could this index work? At first I thought I had to concatenate the two vertically but it turns out it's not necessary. Is it because there are only two Numpy arrays and it just so happened that they have the same number of elements, Python decides to pair them up automatically? What if I have multiple Numpy arrays with the same number of elements and I use that index, will it be a problem?
The only thing they have in common is their length. Other than that, they are not linked together. The length comes into play when you use boolean indexing. Consider the following array: arr = np.array([1, 2, 3]) With boolean values, we can index into this array: arr[[True, False, True]] Out: array([1, 3]) This returned values at positions 0 and 2 (where they have True values). This boolean array may come from anywhere. It may come from the same array with a comparison, or from a different array of the same length. arr1 = np.array(['a', 'b', 'a', 'c']) If I do arr1 == 'a' it will do an element-wise comparison and return arr1 == 'a' Out: array([ True, False, True, False], dtype=bool) I can use this in the same array: arr1[arr1=='a'] Out: array(['a', 'a'], dtype='<U1') Or in a different array: arr2 = np.array([2, 5, 1, 7]) arr2[arr1=='a'] Out: array([2, 1]) Note that this is no different than arr2[[True, False, True, False]]. So we are not actually using arr1 here. In your example, np_positions == 'GK' will return a boolean array too. Since it will have the same size as the np_height array, you will only deal with positions where the boolean array has True values.
How to generate a bool 2D arrays from two 1D arrays using numpy
I have two arrays a=[1,2,3,4] and b=[2,3]. I am wondering is there an efficient way to construct a boolean 2D array c (2D matrix, i.e. 2*4 matrix) based on array element comparsions, i.e. c[0,0] = true iff a[0] == b[0]. The basic way is to iterate through all the elements of a and b, but I think there maybe a better using numpy. I checked numpyreference, but could not find a routine could exactly that. thanks
If I understood the question correctly, you can extend the dimensions of b with np.newaxis/None to form a 2D array and then perform equality check against a, which will bring in broadcasting for a vectorized solution, like so - b[:,None] == a Sample run - In [5]: a Out[5]: array([1, 2, 3, 4]) In [6]: b Out[6]: array([2, 3]) In [7]: b[:,None] == a Out[7]: array([[False, True, False, False], [False, False, True, False]], dtype=bool)
efficient way of removing None's from numpy array
Is there an efficient way to remove Nones from numpy arrays and resize the array to its new size? For example, how would you remove the None from this frame without iterating through it in python. I can easily iterate through it but was working on an api call that would be potentially called many times. a = np.array([1,45,23,23,1234,3432,-1232,-34,233,None])
In [17]: a[a != np.array(None)] Out[17]: array([1, 45, 23, 23, 1234, 3432, -1232, -34, 233], dtype=object) The above works because a != np.array(None) is a boolean array which maps out non-None values: In [20]: a != np.array(None) Out[20]: array([ True, True, True, True, True, True, True, True, True, False], dtype=bool) Selecting elements of an array in this manner is called boolean array indexing.
I use the following which I find simpler than the accepted answer: a = a[a != None] Caveat: PEP8 warns against using the equality operator with singletons such as None. I didn't know about this when I posted this answer. That said, for numpy arrays I find this too Pythonic and pretty to not use. See discussion in comments.