Sliced numpy array does not modify original array - python

I've run into this interaction with arrays that I'm a little confused. I can work around it, but for my own understanding, I'd like to know what is going on.
Essentially, I have a datafile that I'm trying to tailor so I can run this as an input for some code I've already written. This involves some calculations on some columns, rows, etc. In particular, I also need to rearrange some elements, where the original array isn't being modified as I expect it would.
import numpy as np
ex_data = np.arange(12).reshape(4,3)
ex_data[2,0] = 0 #Constructing some fake data
ex_data[ex_data[:,0] == 0][:,1] = 3
print ex_data
Basically, I look in a column of interest, collect all the rows where that column contains a parameter value of interest and just reassigning values.
With the snippet of code above, I would expect ex_data to have it's column 1 elements, conditional if it's column 0 element is equal to 0, to be assigned a value of 3. However what I'm seeing is that there is no effect at all.
>>> ex_data
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 0, 7, 8],
[ 9, 10, 11]])
In another case, if I don't 'slice', my 'sliced' data file, then the reassignment goes on as normal.
ex_data[ex_data[:,0] == 0] = 3
print ex_data
Here I'd expect my entire row, conditional to where column 0 is equal to 0, be populated with 3. This is what you see.
>>> ex_data
array([[ 3, 3, 3],
[ 3, 4, 5],
[ 3, 3, 3],
[ 9, 10, 11]])
Can anyone explain the interaction?

In [368]: ex_data
Out[368]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 0, 7, 8],
[ 9, 10, 11]])
The column 0 test:
In [369]: ex_data[:,0]==0
Out[369]: array([ True, False, True, False])
That boolean mask can be applied to the rows as:
In [370]: ex_data[ex_data[:,0]==0,0]
Out[370]: array([0, 0]) # the 0's you expected
In [371]: ex_data[ex_data[:,0]==0,1]
Out[371]: array([1, 7]) # the col 1 values you want to replace
In [372]: ex_data[ex_data[:,0]==0,1] = 3
In [373]: ex_data
Out[373]:
array([[ 0, 3, 2],
[ 3, 4, 5],
[ 0, 3, 8],
[ 9, 10, 11]])
The indexing you tried:
In [374]: ex_data[ex_data[:,0]==0]
Out[374]:
array([[0, 3, 2],
[0, 3, 8]])
produces a copy. Assigning ...[:,1]=3 just changes that copy, not the original array. Fortunately in this case, it is easy to use
ex_data[ex_data[:,0]==0,1]
instead of
ex_data[ex_data[:,0]==0][:,1]

Related

Why does the axis argument in NumPy change?

I am very confused when it comes to the logic of the NumPy axis argument. In some cases it affects the row when axis = 0 and in some cases it affects the columns when axis = 0. Example:
a = np.array([[1,3,6,7,4],[3,2,5,9,1]])
array([[1,3,6,7,4],
[3,2,5,9,1]])
np.sort(a, axis = 0) #This sorts the columns
array([[1, 2, 5, 7, 1],
[3, 3, 6, 9, 4]])
np.sort(a, axis=1) #This sorts the rows
array([[1, 3, 4, 6, 7],
[1, 2, 3, 5, 9]])
#####################################################################
arr = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
arr
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
np.delete(arr,obj = 1, axis = 0) # This deletes the row
array([[ 1, 2, 3, 4],
[ 9, 10, 11, 12]])
np.delete(arr,obj = 1, axis = 1) #This deletes the column
array([[ 1, 3, 4],
[ 5, 7, 8],
[ 9, 11, 12]])
If there is some logic here that I am missing I would love to learn it.
It's perhaps simplest to remember it as 0=down and 1=across.
This means:
Use axis=0 to apply a method down each column, or to the row labels (the index).
Use axis=1 to apply a method across each row, or to the column labels.
Here's a picture to show the parts of a DataFrame that each axis refers to:
It's also useful to remember that Pandas follows NumPy's use of the word axis. The usage is explained in NumPy's glossary of terms:
Axes are defined for arrays with more than one dimension. A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). [my emphasis]
So, concerning the method in the question, np.sort(axis=1), seems to be correctly defined. It takes the mean of entries horizontally across columns, that is, along each individual row. On the other hand, np.sort(axis=0) would be an operation acting vertically downwards across rows.
Similarly, np.delete(name, axis=1) refers to an action on column labels, because they intuitively go across the horizontal axis. Specifying axis=0 would make the method act on rows instead.
arr = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
arr
# array([[ 1, 2, 3, 4],
# [ 5, 6, 7, 8],
# [ 9, 10, 11, 12]])
arr has 2 dimensions, use the empty slice : to select the first and second axis arr[:,:]. From the documentation of np.delete regarding the second parameter obj:
obj : slice, int or array of ints
Indicate indices of sub-arrays to remove along the specified axis.
If we want to delete obj=1 from axis=0 we are effectively removing arr[[1],:] from arr
arr[[1],:] # array([[5, 6, 7, 8]])
With the same intuition, we can remove obj=1 from axis=1
arr[:,[1]] # array([[ 2],
# [ 6],
# [10]])
When sorting the array arr above along axis=0 we are comparing the following elements:
# array([[1, 2, 5, 7, 1]])
# array([[5, 6, 7, 8]])
# array([[ 9, 10, 11, 12]])
The array is already sorted in this case but the comparison is done between two rows. For example array([[5, 6, 7, 8]]) is compared with array([[ 9, 10, 11, 12]]) by doing an element-wise comparison.
Sorting the array on axis=1 we are comparing the following elements
# array([[1], array([[ 2], array([[ 3], array([[ 4],
# [5], [ 6], [ 7], [ 8],
# [9]]) [10]]) [11]]) [12]])
Notice the difference of axis usage between np.delete and np.sort. np.delete will remove the complete row/column while np.sort will use the complete row/column for comparison.

Compare two 3d Numpy array and return unmatched values with index and later recreate them without loop

I am currently working on a problem where in one requirement I need to compare two 3d NumPy arrays and return the unmatched values with their index position and later recreate the same array. Currently, the only approach I can think of is to loop across the arrays to get the values during comparing and later recreating. The problem is with scale as there will be hundreds of arrays and looping effects the Latency of the overall application. I would be thankful if anyone can help me with better utilization of NumPy comparison while using minimal or no loops. A dummy code is below:
def compare_array(final_array_list):
base_array = None
i = 0
for array in final_array_list:
if i==0:
base_array =array[0]
else:
index = np.where(base_array != array)
#getting index like (array([0, 1]), array([1, 1]), array([2, 2]))
# to access all unmatched values I need to loop.Need to avoid loop here
i=i+1
return [base_array, [unmatched value (8,10)and its index (array([0, 1]), array([1, 1]), array([2, 2])],..]
# similarly recreate array1 back
def recreate_array(array_list):
# need to avoid looping while recreating array back
return list of array #i.e. [base_array, array_1]
# creating dummy array
base_array = np.array([[[1, 2, 3], [3, 4, 5]], [[5, 6, 7], [7, 8, 9]]])
array_1 = b = np.array([[[1, 2,3], [3, 4,8]], [[5, 6,7], [7, 8,10]]])
final_array_list = [base_array,array_1, ...... ]
#compare base_array with other arrays and get unmatched values (like 8,10 in array_1) and their index
difff_array = compare_array(final_array_list)
# recreate array1 from the base array after receiving unmatched value and its index value
recreate_array(difff_array)
I think this may be what you're looking for:
base_array = np.array([[[1, 2, 3], [3, 4, 5]], [[5, 6, 7], [7, 8, 9]]])
array_1 = b = np.array([[[1, 2,3], [3, 4,8]], [[5, 6,7], [7, 8,10]]])
match_mask = (base_array == array_1)
idx_unmatched = np.argwhere(~match_mask)
# idx_unmatched:
# array([[0, 1, 2],
# [1, 1, 2]])
# values with associated with idx_unmatched:
values_unmatched = base_array[tuple(idx_unmatched.T)]
# values_unmatched:
# array([5, 9])
I'm not sure I understand what you mean by "recreate them" (completely recreate them? why not use the arrays themselves?).
I can help you though by noting that ther are plenty of functions which vectorize with numpy, and as a general rule of thumb, do not use for loops unless G-d himself tells you to :)
For example:
If a, b are any np.arrays (regardless of dimensions), the simple a == b will return a numpy array of the same size, with boolean values. Trues = they are equal in this coordinate, and False otherwise.
The function np.where(c), will convert c to a boolean np.array, and return you the indexes in which c is True.
To clarify:
Here I instantiate two arrays, with b differing from a with -1 values:
Note what a==b is, at the end.
>>> a = np.random.randint(low=0, high=10, size=(4, 4))
>>> b = np.copy(a)
>>> b[2, 3] = -1
>>> b[0, 1] = -1
>>> b[1, 1] = -1
>>> a
array([[9, 9, 3, 4],
[8, 4, 6, 7],
[8, 4, 5, 5],
[1, 7, 2, 5]])
>>> b
array([[ 9, -1, 3, 4],
[ 8, -1, 6, 7],
[ 8, 4, 5, -1],
[ 1, 7, 2, 5]])
>>> a == b
array([[ True, False, True, True],
[ True, False, True, True],
[ True, True, True, False],
[ True, True, True, True]])
Now the function np.where, which output is a bit tricky, but can be used easily. This will return two arrays of the same size: the first array is the rows and the second array is the columns at places in which the given array is True.
>>> np.where(a == b)
(array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3], dtype=int64), array([0, 2, 3, 0, 2, 3, 0, 1, 2, 0, 1, 2, 3], dtype=int64))
Now you can "fix" the b array to match a, by switching the values of b ar indexes in which it differs from a, to be a's indexes:
>>> b[np.where(a != b)]
array([-1, -1, -1])
>>> b[np.where(a != b)] = a[np.where(a != b)]
>>> np.all(a == b)
True

Indexing in NumPy: Access every other group of values

The [::n] indexing option in numpy provides a very useful way to index every nth item in a list. However, is it possible to use this feature to extract multiple values, e.g. every other pair of values?
For example:
a = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
And I want to extract every other pair of values i.e. I want to return
a[0, 1, 4, 5, 8, 9,]
Of course the index could be built using loops or something, but I wonder if there's a faster way to use ::-style indexing in numpy but also specifying the width of the pattern to take every nth iteration of.
Thanks
With length of array being a multiple of the window size -
In [29]: W = 2 # window-size
In [30]: a.reshape(-1,W)[::2].ravel()
Out[30]: array([0, 1, 4, 5, 8, 9])
Explanation with breaking-down-the-steps -
# Reshape to split into W-sized groups
In [43]: a.reshape(-1,W)
Out[43]:
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11]])
# Use stepsize to select every other pair starting from the first one
In [44]: a.reshape(-1,W)[::2]
Out[44]:
array([[0, 1],
[4, 5],
[8, 9]])
# Flatten for desired output
In [45]: a.reshape(-1,W)[::2].ravel()
Out[45]: array([0, 1, 4, 5, 8, 9])
If you are okay with 2D output, skip the last step as that still be a view into the input and virtually free on runtime. Let's verify the view-part -
In [47]: np.shares_memory(a,a.reshape(-1,W)[::2])
Out[47]: True
For generic case of not necessarily a multiple, we can use a masking based one -
In [64]: a[(np.arange(len(a))%(2*W))<W]
Out[64]: array([0, 1, 4, 5, 8, 9])
You can do that reshaping the array into a nx3 matrix, then slice up the first two elements for each row and finally flatten up the reshaped array:
a.reshape((-1,3))[:,:2].flatten()
resulting in:
array([ 0, 1, 3, 4, 6, 7, 9, 10])

Minimum value in 3D NumPy array along specified axis

Say you have a 3D array as follows:
a = np.random.uniform(0,10,(3,4,4))
a
Out[167]:
array([[[6.11382489, 5.33572952, 2.6994938 , 5.32924568],
[0.02494179, 9.5813176 , 3.78090323, 7.73698908],
[0.4559432 , 3.14531716, 4.18929635, 9.44256735],
[7.05641989, 0.51355523, 6.61806454, 1.3124488 ]],
[[9.79806021, 6.9343234 , 3.96018673, 8.97424501],
[3.25146771, 5.06744849, 6.05870707, 2.27286515],
[4.66656429, 6.92791142, 7.1623226 , 5.34108811],
[6.09831564, 9.52367529, 8.27257007, 8.01510805]],
[[5.62545596, 9.01048599, 6.76713644, 7.71836144],
[5.59842752, 0.34003062, 8.07114444, 8.5382837 ],
[0.20420194, 6.39088367, 4.97895935, 4.26247875],
[1.2701483 , 8.35244104, 2.69965027, 8.39305974]]])
Is there a way to get the minimum values in the slices along axis=0 as one array efficiently?
So in this case I would specify axis=0 (i.e. the axis with dimension length=3) and return the minimum values: (0.02494179, 2.27286515, 0.20420194).
I feel like this is a simple problem but I can't seem to get it to work, so any help on the matter would be greatly appreciated!
If I got it right, you just have to apply "min" twice
for instance:
>>> np.random.seed(1) #reproduce the same results
>>> a = np.random.randint(0,10,(3,2,4)) #using int is easier to understand
Out[4]:
array([[[5, 8, 9, 5],
[0, 0, 1, 7]],
[[6, 9, 2, 4],
[5, 2, 4, 2]],
[[4, 7, 7, 9],
[1, 7, 0, 6]]])
>>> a.min(axis=0).min(axis=0)
Out[5]: array([0, 0, 0, 2])
It is the first time I post an answer, I hope I did okay.

pick TxK numpy array from TxN numpy array using TxK column index array

This is an indirect indexing problem.
It can be solved with a list comprehension.
The question is whether, or, how to solve it within numpy,
When
data.shape is (T,N)
and
c.shape is (T,K)
and each element of c is an int between 0 and N-1 inclusive, that is,
each element of c is intended to refer to a column number from data.
The goal is to obtain out where
out.shape = (T,K)
And for each i in 0..(T-1)
the row out[i] = [ data[i, c[i,0]] , ... , data[i, c[i,K-1]] ]
Concrete example:
data = np.array([\
[ 0, 1, 2],\
[ 3, 4, 5],\
[ 6, 7, 8],\
[ 9, 10, 11],\
[12, 13, 14]])
c = np.array([
[0, 2],\
[1, 2],\
[0, 0],\
[1, 1],\
[2, 2]])
out should be out = [[0, 2], [4, 5], [6, 6], [10, 10], [14, 14]]
The first row of out is [0,2] because the columns chosen are given by c's row 0, they are 0 and 2, and data[0] at columns 0 and 2 are 0 and 2.
The second row of out is [4,5] because the columns chosen are given by c's row 1, they are 1 and 2, and data[1] at columns 1 and 2 is 4 and 5.
Numpy fancy indexing doesn't seem to solve this in an obvious way because indexing data with c (e.g. data[c], np.take(data,c,axis=1) ) always produces a 3 dimensional array.
A list comprehension can solve it:
out = [ [data[rowidx,i1],data[rowidx,i2]] for (rowidx, (i1,i2)) in enumerate(c) ]
if K is 2 I suppose this is marginally OK. If K is variable, this is not so good.
The list comprehension has to be rewritten for each value K, because it unrolls the columns picked out of data by each row of c. It also violates DRY.
Is there a solution based entirely in numpy?
You can avoid loops with np.choose:
In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
data = np.array([\
[ 0, 1, 2],\
[ 3, 4, 5],\
[ 6, 7, 8],\
[ 9, 10, 11],\
[12, 13, 14]])
c = np.array([
[0, 2],\
[1, 2],\
[0, 0],\
[1, 1],\
[2, 2]])
--
In [2]: np.choose(c, data.T[:,:,np.newaxis])
Out[2]:
array([[ 0, 2],
[ 4, 5],
[ 6, 6],
[10, 10],
[14, 14]])
Here's one possible route to a general solution...
Create masks for data to select the values for each column of out. For example, the first mask could be achieved by writing:
>>> np.arange(3) == np.vstack(c[:,0])
array([[ True, False, False],
[False, True, False],
[ True, False, False],
[False, True, False],
[False, False, True]], dtype=bool)
>>> data[_]
array([ 2, 5, 6, 10, 14])
The mask to get the values for the second column of out: np.arange(3) == np.vstack(c[:,1]).
So, to get the out array...
>>> mask0 = np.arange(3) == np.vstack(c[:,0])
>>> mask1 = np.arange(3) == np.vstack(c[:,1])
>>> np.vstack((data[mask0], data[mask1])).T
array([[ 0, 2],
[ 4, 5],
[ 6, 6],
[10, 10],
[14, 14]])
Edit: Given arbitrary array widths K and N you could use a loop to create the masks, so the general construction of the out array might simply look like this:
np.vstack([data[np.arange(N) == np.vstack(c[:,i])] for i in range(K)]).T
Edit 2: A slightly neater solution (though still relying on a loop) is:
np.vstack([data[i][c[i]] for i in range(T)])

Categories