I want to get first index of numpy array element which is greater than some specific element of that same array. I tried following:
>>> Q5=[[1,2,3],[4,5,6]]
>>> Q5 = np.array(Q5)
>>> Q5[0][Q5>Q5[0,0]]
array([2, 3])
>>> np.where(Q5[0]>Q5[0,0])
(array([1, 2], dtype=int32),)
>>> np.where(Q5[0]>Q5[0,0])[0][0]
1
Q1. Is above correct way to obtain first index of an element in Q5[0] greater than Q5[0,0]?
I am more concerned with np.where(Q5[0]>Q5[0,0]) returning tuple (array([1, 2], dtype=int32),) and hence requiring me to double index [0][0] at the end of np.where(Q5[0]>Q5[0,0])[0][0].
Q2. Why this return tuple, but below returns proper numpy array?
>>> np.where(Q5[0]>Q5[0,0],Q5[0],-1)
array([-1, 2, 3])
So that I can index directly:
>>> np.where(Q5[0]>Q5[0,0],Q5[0],-1)[1]
2
In [58]: A = np.arange(1,10).reshape(3,3)
In [59]: A.shape
Out[59]: (3, 3)
In [60]: A
Out[60]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
np.where with just the condition is really np.nonzero.
Generate a boolean array:
In [63]: A==6
Out[63]:
array([[False, False, False],
[False, False, True],
[False, False, False]])
Find where that is true:
In [64]: np.nonzero(A==6)
Out[64]: (array([1]), array([2]))
The result is a tuple, one element per dimension of the condition. Each element is an indexing array, together they define the location of the True(s)
Another test with several True
In [65]: (A%3)==1
Out[65]:
array([[ True, False, False],
[ True, False, False],
[ True, False, False]])
In [66]: np.nonzero((A%3)==1)
Out[66]: (array([0, 1, 2]), array([0, 0, 0]))
Using the tuple to index the original array:
In [67]: A[np.nonzero((A%3)==1)]
Out[67]: array([1, 4, 7])
Using the 3 argument where to create a new array with a mix of values from A and A+10
In [68]: np.where((A%3)==1,A+10, A)
Out[68]:
array([[11, 2, 3],
[14, 5, 6],
[17, 8, 9]])
If the condition has multiple True, nonzero isn't the test tool for finding the "first", since it necessarily finds all.
The nonzero tuple can be turned into a 2d array with a transpose. It actually may be easier to get the "first" from this array:
In [73]: np.argwhere((A%3)==1)
Out[73]:
array([[0, 0],
[1, 0],
[2, 0]])
You are looking in a 1d array, a row of A:
In [77]: A[0]>A[0,0]
Out[77]: array([False, True, True])
In [78]: np.nonzero(A[0]>A[0,0])
Out[78]: (array([1, 2]),) # 1 element tuple
In [79]: np.argwhere(A[0]>A[0,0])
Out[79]:
array([[1],
[2]])
In [81]: np.where(A[0]>A[0,0], 100, 0) # 3 argument where
Out[81]: array([ 0, 100, 100])
So whether you are searching a 1d array or a 2d (or 3 or 4), nonzero returns a tuple with one array element per dimension. That way it can always be used to index a like sized array. The 1d tuple might look redundant, but it is consistent with other dimensional results.
When trying understand operations like this, read the docs carefully, and look at individual steps. Here I look at the conditional matrix, the nonzero result, and its various uses.
Using argmax with a boolean array will give you the index of the first True.
In [54]: q
Out[54]:
array([[1, 2, 3],
[4, 5, 6]])
In [55]: q > q[0,0]
Out[55]:
array([[False, True, True],
[ True, True, True]], dtype=bool)
argmax can take an axis/dimension argument.
In [56]: np.argmax(q > q[0,0], 0)
Out[56]: array([1, 0, 0], dtype=int64)
That says the first True is index one for column zero and index zero for columns one and two.
In [57]: np.argmax(q > q[0,0], 1)
Out[57]: array([1, 0], dtype=int64)
That says the first True is index one for row zero and index zero for row one.
Q1. Is above correct way to obtain first index of an element in Q5[0] greater than Q5[0,0]?
No I would use argmax with 1 for the axis argument then select the first item from that result.
Q2. Why this return tuple
You told it to return -1 for False values and return Q5[0] items for True values.
Q2 ...but below returns proper numpy array?
You got lucky and chose the correct index.
numpy.where() is like a for loop with an if.
numpy.where(condition, values, new_value)
condition - just like if conditions.
values - The values to iterate on
new_value - if the condition is true for a value, its going to change to the new_value
If we would like to write it for a 1-dimensional array it should look something like this:
[xv if c else yv for c, xv, yv in zip(condition, x, y)]
Example:
>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.where(a < 5, a, 10*a)
array([ 0, 1, 2, 3, 4, 50, 60, 70, 80, 90])
First we create an array with numbers from 0 to 9 (0, 1, 2 ... 7, 8, 9)
and then we are checking for all the values in the array that are greater from 5 and multiplying their value by 10.
So now all the values in the array that are less then 5 stayed the same and all the values that are greater multiplied by 10
Related
I have a matrix like this:
A = sp.csr_matrix(np.array(
[[1, 1, 2, 1],
[0, 0, 2, 0],
[1, 4, 1, 1],
[0, 1, 0, 0]]))
I want to get all the rows where all columns are nonzero, so I can then get their sum. Either as an array:
rows = [True, False, True, False]
result = A[rows].sum()
Or as indices:
rows = [0, 2]
result = A[rows].sum()
I am stuck however at the first part, figuring out which rows to include in the sum, as most results seem to be looking for the opposite (rows where all columns are zero).
In [35]: from scipy import sparse
In [36]: A = sparse.csr_matrix(np.array(
...: [[1, 1, 2, 1],
...: [0, 0, 2, 0],
...: [1, 4, 1, 1],
...: [0, 1, 0, 0]]))
In [37]: A
Out[37]:
<4x4 sparse matrix of type '<class 'numpy.int64'>'
with 10 stored elements in Compressed Sparse Row format>
Sparse doesn't do 'all/any' kinds of operations because they treat 0's as significant values.
all on the dense equivalent works nicely:
In [41]: A.A.all(axis=1)
Out[41]: array([ True, False, True, False])
On the sparse one we can turn the dtype to boolean, and sum along the axis. And then test it for the full value:
In [42]: A.astype(bool).sum(axis=1)
Out[42]:
matrix([[4],
[1],
[4],
[1]])
In [43]: A.astype(bool).sum(axis=1).A1==4
Out[43]: array([ True, False, True, False])
Notice that the sparse sum returns a np.matrix. I used A1 to turn that into a 1d array.
If the matrix isn't too large, working with the dense array may be faster. Sparse operations like sum are actually performed with matrix multiplication.
In [51]: A.astype(bool)#np.ones(4,int)
Out[51]: array([4, 1, 4, 1])
Or we could convert it to lil format, and look at the length of the 'rows':
In [67]: A.tolil().data
Out[67]:
array([list([1, 1, 2, 1]), list([2]), list([1, 4, 1, 1]), list([1])],
dtype=object)
In [68]: [len(i) for i in A.tolil().data]
Out[68]: [4, 1, 4, 1]
But wait, there's more. The indptr attribute of the csr is:
In [69]: A.indptr
Out[69]: array([ 0, 4, 5, 9, 10], dtype=int32)
In [70]: np.diff(A.indptr)
Out[70]: array([4, 1, 4, 1], dtype=int32)
I've omitted some test timings, but this last is clearly the fastest!
It is a bit easier to do for numpy arrays than for sparse ones. If you do not mind converting to numpy as an intermediate step, you can get the right rows via
(A.toarray() != 0).all(axis=1)
to produce
array([ True, False, True, False])
and then use it in indexing A as such:
A[(A.toarray() != 0).all(axis=1),:].sum()
returns 12
Say I have these two numpy arrays:
A = np.array([[1,2,3],[4,5,6],[8,7,3])
B = np.array([[1,2,3],[3,2,1],[8,7,3])
It should return
[0,2]
Since the values at the 0th and 2nd index are equal to each other.
What's the most efficient way of doing this?
I tried something like:
[val for val in range(len(A)) if A[val]==B[val]]
but got the error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
You better look for vectorized solution so...
You can try do:
>>>np.where(np.all(A == B, axis=1))
array([0 2])
You can see the benefit of vectorization When it comes to speed here : https://chelseatroy.com/2018/11/07/code-mechanic-numpy-vectorization/amp/
Assuming A.shape == B.shape (otherwise just take A=A[:len(B)] and B=B[:len(A)]) consider:
>>> A==B
[[ True True True]
[False False False]
[ True True True]]
>>> (A==B).all(axis=1)
[ True False True]
>>> np.argwhere((A==B).all(axis=1))
[[0]
[2]]
You can do something like that
>>> [a in B for a in A]
[True, False, True]
>>> A[[a in B for a in A]]
array([[1, 2, 3],
[8, 7, 3]])
>>> np.where((A==B).all(axis=1))
(array([0, 2]),)
The following solution works also for arrays that do not match in their first dimension, i.e., have a different number of rows. It also works if a match occurs multiple times.
import numpy as np
from scipy.spatial import distance
A = np.array([[1, 2 ,3],
[4, 5 ,6],
[8, 7, 3]])
B = np.array([[1, 2, 3],
[3, 2, 1],
[1, 2, 3],
[9, 9, 9]])
res = np.nonzero(distance.cdist(A, B) == 0)
# ==> (array([0, 0]), array([0, 2]))
The result res is a tuple of two array, which represent the match index of the first and the second input array, respectively. So, in this example, the row at the 0th index of the first array matches the row of the 0th index of second array, and the row at the 0th index of the first array matches the row at the second index of the second array.
In [174]: A = np.array([[1,2,3],[4,5,6],[8,7,3]])
...: B = np.array([[1,2,3],[3,2,1],[8,7,3]])
Your list comprehension works fine for lists:
In [175]: Al = A.tolist(); Bl = B.tolist()
In [177]: [val for val in range(len(Al)) if Al[val]==Bl[val]]
Out[177]: [0, 2]
For lists == is a simple boolean test - same or not; for arrays it returns a boolean array, which can't be use in an if:
In [178]: Al[0]==Bl[0]
Out[178]: True
In [179]: A[0]==B[0]
Out[179]: array([ True, True, True])
With arrays, you need to add a all as suggested by the error:
In [180]: [val for val in range(len(A)) if np.all(A[val]==B[val])]
Out[180]: [0, 2]
The list version will be faster.
But you can also compare the whole arrays, and take row by row all:
In [181]: A==B
Out[181]:
array([[ True, True, True],
[False, False, False],
[ True, True, True]])
In [182]: np.all(A==B, axis=1)
Out[182]: array([ True, False, True])
In [183]: np.nonzero(np.all(A==B, axis=1))
Out[183]: (array([0, 2]),)
I'm confused about the way numpy array slicing is working in the example below. I can't figure out how exactly the slicing is working and would appreciate an explanation.
import numpy as np
arr = np.array([
[1,2,3,4],
[5,6,7,8],
[9,10,11,12],
[13,14,15,16]
])
m = [False,True,True,False]
# Test 1 - Expected behaviour
print(arr[m])
Out:
array([[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
# Test 2 - Expected behaviour
print(arr[m,:])
Out:
array([[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
# Test 3 - Expected behaviour
print(arr[:,m])
Out:
array([[ 2, 3],
[ 6, 7],
[10, 11],
[14, 15]])
### What's going on here? ###
# Test 4
print(arr[m,m])
Out:
array([ 6, 11]) # <--- diagonal components. I expected [[6,7],[10,11]].
I found that I could achieve the desired result with arr[:,m][m]. But I'm still curious about how this works.
You can use matrix multiplication to create a 2d mask.
import numpy as np
arr = np.array([
[1,2,3,4],
[5,6,7,8],
[9,10,11,12],
[13,14,15,16]
])
m = [False,True,True,False]
mask2d = np.array([m]).T * m
print(arr[mask2d])
Output :
[ 6 7 10 11]
Alternatively, you can have the output in matrix format.
print(np.ma.masked_array(arr, ~mask2d))
It's just the way indexing works for numpy arrays. Usually if you have specific "slices" of rows and columns you want to select you just do:
import numpy as np
arr = np.array([
[1,2,3,4],
[5,6,7,8],
[9,10,11,12],
[13,14,15,16]
])
# You want to check out rows 2-3 cols 2-3
print(arr[2:4,2:4])
Out:
[[11 12]
[15 16]]
Now say you want to select arbitrary combinations of specific row and column indices, for example you want row0-col2 and row2-col3
print(arr[[0, 2], [2, 3]])
Out:
[ 3 12]
What you are doing is identical to the above. [m,m] is equivalent to:
[m,m] == [[False,True,True,False], [False,True,True,False]]
Which is in turn equivalent to saying you want row1-col1 and row2-col2
print(arr[[1, 2], [1, 2]])
Out:
[ 6 11]
I don't know why, but this is the way numpy treats slicing by a tuple of 1d boolean arrays:
arr = np.array([
[1,2,3,4],
[5,6,7,8],
[9,10,11,12]
])
m1 = [True, False, True]
m2 = [False, False, True, True]
# Pseudocode for what NumPy does
#def arr[m1,m2]:
# intm1 = np.transpose(np.argwhere(m1)) # [True, False, True] -> [0,2]
# intm2 = np.transpose(np.argwhere(m2)) # [False, False, True, True] -> [2,3]
# return arr[intm1,intm2] # arr[[0,2],[2,3]]
print(arr[m1,m2]) # --> [3 12]
What I was expecting was slicing behaviour with non-contiguous segments of the array; selecting the intersection of rows and columns, can be achieved with:
arr = np.array([
[1,2,3,4],
[5,6,7,8],
[9,10,11,12]
])
m1 = [True, False, True]
m2 = [False, False, True, True]
def row_col_select(arr, *ms):
n = arr.ndim
assert(len(ms) == n)
# Accumulate a full boolean mask which will have the shape of `arr`
accum_mask = np.reshape(True, (1,) * n)
for i in range(n):
shape = tuple([1]*i + [arr.shape[i]] + [1]*(n-i-1))
m = np.reshape(ms[i], shape)
accum_mask = np.logical_and(accum_mask, m)
# Select `arr` according to full boolean mask
# The boolean mask is the multiplication of the boolean arrays across each corresponding dimension. E.g. for m1 and m2 above it is:
# m1: | m2: False False True True
# |
# True | [[False False True True]
# False | [False False False False]
# True | [False False True True]]
return arr[accum_mask]
print(row_col_select(arr,m1,m2)) # --> [ 3 4 11 12]
In [55]: arr = np.array([
...: [1,2,3,4],
...: [5,6,7,8],
...: [9,10,11,12],
...: [13,14,15,16]
...: ])
...: m = [False,True,True,False]
In all your examples we can use this m1 instead of the boolean list:
In [58]: m1 = np.where(m)[0]
In [59]: m1
Out[59]: array([1, 2])
If m was a 2d array like arr than we could use it to select elements from arr - but they will be raveled; but when used to select along one dimension, the equivalent array index is clearer. Yes we could use np.array([2,1]) or np.array([2,1,1,2]) to select rows in a different order or even multiple times. But substituting m1 for m does not loose any information or control.
Select rows, or columns:
In [60]: arr[m1]
Out[60]:
array([[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
In [61]: arr[:,m1]
Out[61]:
array([[ 2, 3],
[ 6, 7],
[10, 11],
[14, 15]])
With 2 arrays, we get 2 elements, arr[1,1] and arr[2,2].
In [62]: arr[m1, m1]
Out[62]: array([ 6, 11])
Note that in MATLAB we have to use sub2ind to do the same thing. What's easy in numpy is a bit harder in MATLAB; for blocks it's the other way.
To get a block, we have to create a column array to broadcast with the row one:
In [63]: arr[m1[:,None], m1]
Out[63]:
array([[ 6, 7],
[10, 11]])
If that's too hard to remember, np.ix_ can do it for us:
In [64]: np.ix_(m1,m1)
Out[64]:
(array([[1],
[2]]),
array([[1, 2]]))
[63] is doing the same thing as [62]; the difference is that the 2 arrays broadcast differently. It's the same broadcasting as done in these additions:
In [65]: m1+m1
Out[65]: array([2, 4])
In [66]: m1[:,None]+m1
Out[66]:
array([[2, 3],
[3, 4]])
This indexing behavior is perfectly consistent - provided we don't import expectations from other languages.
I used m1 because boolean arrays don't broadcast, as show below:
In [67]: np.array(m)
Out[67]: array([False, True, True, False])
In [68]: np.array(m)[:,None]
Out[68]:
array([[False],
[ True],
[ True],
[False]])
In [69]: arr[np.array(m)[:,None], np.array(m)]
...
IndexError: too many indices for array
in fact the 'column' boolean doesn't work either:
In [70]: arr[np.array(m)[:,None]]
...
IndexError: boolean index did not match indexed array along dimension 1; dimension is 4 but corresponding boolean dimension is 1
We can use logical_and to broadcast a column boolean against a row boolean:
In [72]: mb = np.array(m)
In [73]: mb[:,None]&mb
Out[73]:
array([[False, False, False, False],
[False, True, True, False],
[False, True, True, False],
[False, False, False, False]])
In [74]: arr[_]
Out[74]: array([ 6, 7, 10, 11]) # 1d result
This is the case you quoted: "If obj.ndim == x.ndim, x[obj] returns a 1-dimensional array filled with the elements of x corresponding to the True values of obj"
Your other quote:
*"Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view)." *
means that if arr1 = arr[m,:], arr1 is a copy, and any modifications to arr1 will not affect arr. However I could use arr[m,:]=10to modify arr. The alternative to a copy is a view, as in basic indexing, arr2=arr[0::2,:]. modifications to arr2 do modify arr as well.
I'm trying to turn a 2x3 numpy array into a 2x2 array by removing select indexes.
I think I can do this with a mask array with true/false values.
Given
[ 1, 2, 3],
[ 4, 1, 6]
I want to remove one element from each row to give me:
[ 2, 3],
[ 4, 6]
However this method isn't working quite like I would expect:
import numpy as np
in_array = np.array([
[ 1, 2, 3],
[ 4, 1, 6]
])
mask = np.array([
[False, True, True],
[True, False, True]
])
print in_array[mask]
Gives me:
[2 3 4 6]
Which is not what I want. Any ideas?
The only thing 'wrong' with that is it is the shape - 1d rather than 2. But what if your mask was
mask = np.array([
[False, True, False],
[True, False, True]
])
1 value in the first row, 2 in second. It couldn't return that as a 2d array, could it?
So the default behavior when masking like this is to return a 1d, or raveled result.
Boolean indexing like this is effectively a where indexing:
In [19]: np.where(mask)
Out[19]: (array([0, 0, 1, 1], dtype=int32), array([1, 2, 0, 2], dtype=int32))
In [20]: in_array[_]
Out[20]: array([2, 3, 4, 6])
It finds the elements of the mask which are true, and then selects the corresponding elements of the in_array.
Maybe the transpose of where is easier to visualize:
In [21]: np.argwhere(mask)
Out[21]:
array([[0, 1],
[0, 2],
[1, 0],
[1, 2]], dtype=int32)
and indexing iteratively:
In [23]: for ij in np.argwhere(mask):
...: print(in_array[tuple(ij)])
...:
2
3
4
6
I would like to remove duplicates which follow each other, but not duplicates along the whole array. Also, I want to keep the ordering unchanged.
So if the input is [0 0 1 3 2 2 3 3] the output should be [0 1 3 2 3]
I found a way using itertools.groupby() but I am looking for a faster NumPy solution.
a[np.insert(np.diff(a).astype(np.bool), 0, True)]
Out[99]: array([0, 1, 3, 2, 3])
The general idea is to use diff to find the difference between two consecutive elements in the array. Then we only index those which give non-zero differences elements. But since the length of diff is shorter by 1. So before indexing, we need to insert the True to the beginning of the diff array.
Explanation:
In [100]: a
Out[100]: array([0, 0, 1, 3, 2, 2, 3, 3])
In [101]: diff = np.diff(a).astype(np.bool)
In [102]: diff
Out[102]: array([False, True, True, True, False, True, False], dtype=bool)
In [103]: idx = np.insert(diff, 0, True)
In [104]: idx
Out[104]: array([ True, False, True, True, True, False, True, False], dtype=bool)
In [105]: a[idx]
Out[105]: array([0, 1, 3, 2, 3])
For pure python wich also works with numpy arrays use this:
def modify(l):
last = None
for e in l:
if e != last:
yield e
last = e
pure = modify([0, 0, 1, 3, 2, 2, 3, 3])
import numpy
num = numpy.array(modify(numpy.array([0, 0, 1, 3, 2, 2, 3, 3])))
I don't know if there are any numpy functions wich would speed this up.
For NumPy version >= 1.16.0 you can use the prepend argument:
a[np.diff(a, prepend=np.nan).astype(bool)]