Extract values that satisfy a condition from numpy array - python

Say I have the following arrays:
a = np.array([1,1,1,2,2,2])
b = np.array([4,6,1,8,2,1])
Is it possible to do the following:
a[np.where(b>3)[0]]
#array([1, 1, 2])
Thus select values from a according to the indices in which a condition in b is satisfied, but using exclusively np.where or a similar numpy function?
In other words, can np.where be used specifying only an array from which to get values when the condition is True? Or is there another numpy function to do this in one step?

Yes, there is a function: numpy.extract(condition, array) returns all values from array that satifsy the condition.
There is not much benefit in using this function over np.where or boolean indexing. All of these approaches create a temporary boolean array that stores the result of b>3. np.where creates an additional index array, while a[b>3]and np.extract use the boolean array directly.
Personally, I would use a[b>3] because it is the tersest form.

Just use boolean indexing.
>>> a = np.array([1,1,1,2,2,2])
>>> b = np.array([4,6,1,8,2,1])
>>>
>>> a[b > 3]
array([1, 1, 2])
b > 3 will give you array([True, True, False, True, False, False]) and with a[b > 3] you select all elements from a where the indexing array is True.

Let's use list comprehension to solve this -
a = np.array([1,1,1,2,2,2])
b = np.array([4,6,1,8,2,1])
indices = [i for i in range(len(b)) if b[i]>3] # Returns indexes of b where b > 3 - [0, 1, 3]
a[indices]
array([1, 1, 2])

Related

How to do element-wise comparison between two NumPy arrays

I have two arrays. I would like to do an element-wise comparison between the two of them to find out which values are the same.
a= np.array([[1,2],[3,4]])
b= np.array([[3,2],[1,4]])
Is there a way for me to compare these two arrays to 1) find out which values are the same and 2) get the index of the same values?
Adding on to the previous question, is there a way for me to return 1 if the values are the same and 0 otherwise?
Thanks in advance!
a= np.array([[1,2],[3,4]])
b= np.array([[3,2],[1,4]])
#1) find out which values are the same
a==b
# array([[False, True],
# [False, True]])
#2) get the index of the same values?
np.where((a==b) == True) # or np.where(a==b)
#(array([0, 1]), array([1, 1]))
# Adding on to the previous question, is there a way for me to return 1 if the values are the same and 0 otherwise
(a==b).astype(int)
# array([[0, 1],
# [0, 1]])

Return True/False for entire array if any value meets mask requirement(s)

I have already tried looking at other similar posts however, their solutions do not solve this specific issue. Using the answer from this post I found that I get the error: "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()" because I define my array differently from theirs. Their array is a size (n,) while my array is a size (n,m). Moreover, the solution from this post does not work either because it applies to lists. The only method I could think of was this:
When there is at least 1 True in array, then entire array is considered True:
filt = 4
tracktruth = list()
arraytruth = list()
arr1 = np.array([[1,2,4]])
for track in range(0,arr1.size):
if filt == arr1[0,track]:
tracktruth.append(True)
else:
tracktruth.append(False)
if any(tracktruth):
arraytruth.append(True)
else:
arraytruth.append(False)
When there is not a single True in array, then entire array is considered False:
filt = 5
tracktruth = list()
arraytruth = list()
arr1 = np.array([[1,2,4]])
for track in range(0,arr1.size):
if filt == arr1[0,track]:
tracktruth.append(True)
else:
tracktruth.append(False)
if any(tracktruth):
arraytruth.append(True)
else:
arraytruth.append(False)
The reason the second if-else statement is there is because I wish to apply this mask to multiple arrays and ultimately create a master list that describes which arrays are true and which are false in their entirety. However, with a for loop and two if-else statements, I think this would be very slow with larger arrays. What would be a faster way to do this?
This seems overly complicated, you can use boolean indexing to achieve results without loops
arr1=np.array([[1,2,4]])
filt=4
arr1==filt
array([[False, False, True]])
np.sum(arr1==filt).astype(bool)
True
With nmore than one row, you can use the row or column index in the np.sum or you can use the axis parameter to sum on rows or columns
As pointed out in the comments, you can use np.any() instead of the np.sum(...).astype(bool) and it runs in roughly 2/3 the time on the test dataset:
np.any(a==filt, axis=1)
array([ True])
You can do this with list comprehension. I've done it here for one array but it's easily extended to multiple arrays with a for loop
filt = 4
arr1 = np.array([[1,2,4]])
print(any([part == filt for part in arr1[0]]))
You can get the arraytruth more generally, with list comprehension for the array of size (n,m)
import numpy as np
filt = 4
a = np.array([[1, 2, 4]])
b = np.array([[1, 2, 3],
[5, 6, 7]])
array_lists = [a, b]
arraytruth = [True if a[a==filt].size>0 else False for a in array_lists]
print(arraytruth)
This will give you:
[True, False]
[edit] Use numpy hstack method.
filt = 4
arr = np.array([[1,2,3,4], [1,2,3]])
print(any([x for x in np.hstack(arr) if x < filt]))

numpy.where for 2+ specific values

Can the numpy.where function be used for more than one specific value?
I can specify a specific value:
>>> x = numpy.arange(5)
>>> numpy.where(x == 2)[0][0]
2
But I would like to do something like the following. It gives an error of course.
>>> numpy.where(x in [3,4])[0][0]
[3,4]
Is there a way to do this without iterating through the list and combining the resulting arrays?
EDIT: I also have a lists of lists of unknown lengths and unknown values so I cannot easily form the parameters of np.where() to search for multiple items. It would be much easier to pass a list.
You can use the numpy.in1d function with numpy.where:
import numpy
numpy.where(numpy.in1d(x, [2,3]))
# (array([2, 3]),)
I guess np.in1d might help you, instead:
>>> x = np.arange(5)
>>> np.in1d(x, [3,4])
array([False, False, False, True, True], dtype=bool)
>>> np.argwhere(_)
array([[3],
[4]])
If you only need to check for a few values you can:
import numpy as np
x = np.arange(4)
ret_arr = np.where([x == 1, x == 2, x == 4, x == 0])[1]
print "Ret arr = ",ret_arr
Output:
Ret arr = [1 2 0]

Difference between nonzero(a), where(a) and argwhere(a). When to use which?

In Numpy, nonzero(a), where(a) and argwhere(a), with a being a numpy array, all seem to return the non-zero indices of the array. What are the differences between these three calls?
On argwhere the documentation says:
np.argwhere(a) is the same as np.transpose(np.nonzero(a)).
Why have a whole function that just transposes the output of nonzero ? When would that be so useful that it deserves a separate function?
What about the difference between where(a) and nonzero(a)? Wouldn't they return the exact same result?
nonzero and argwhere both give you information about where in the array the elements are True. where works the same as nonzero in the form you have posted, but it has a second form:
np.where(mask,a,b)
which can be roughly thought of as a numpy "ufunc" version of the conditional expression:
a[i] if mask[i] else b[i]
(with appropriate broadcasting of a and b).
As far as having both nonzero and argwhere, they're conceptually different. nonzero is structured to return an object which can be used for indexing. This can be lighter-weight than creating an entire boolean mask if the 0's are sparse:
mask = a == 0 # entire array of bools
mask = np.nonzero(a)
Now you can use that mask to index other arrays, etc. However, as it is, it's not very nice conceptually to figure out which indices correspond to 0 elements. That's where argwhere comes in.
I can't comment on the usefulness of having a separate convenience function that transposes the result of another, but I can comment on where vs nonzero. In it's simplest use case, where is indeed the same as nonzero.
>>> np.where(np.array([[0,4],[4,0]]))
(array([0, 1]), array([1, 0]))
>>> np.nonzero(np.array([[0,4],[4,0]]))
(array([0, 1]), array([1, 0]))
or
>>> a = np.array([[1, 2],[3, 4]])
>>> np.where(a == 3)
(array([1, 0]),)
>>> np.nonzero(a == 3)
(array([1, 0]),)
where is different from nonzero in the case when you wish to pick elements of from array a if some condition is True and from array b when that condition is False.
>>> a = np.array([[6, 4],[0, -3]])
>>> b = np.array([[100, 200], [300, 400]])
>>> np.where(a > 0, a, b)
array([[6, 4], [300, 400]])
Again, I can't explain why they added the nonzero functionality to where, but this at least explains how the two are different.
EDIT: Fixed the first example... my logic was incorrect previously

How to invert numpy.where (np.where) function

I frequently use the numpy.where function to gather a tuple of indices of a matrix having some property. For example
import numpy as np
X = np.random.rand(3,3)
>>> X
array([[ 0.51035326, 0.41536004, 0.37821622],
[ 0.32285063, 0.29847402, 0.82969935],
[ 0.74340225, 0.51553363, 0.22528989]])
>>> ix = np.where(X > 0.5)
>>> ix
(array([0, 1, 2, 2]), array([0, 2, 0, 1]))
ix is now a tuple of ndarray objects that contain the row and column indices, whereas the sub-expression X>0.5 contains a single boolean matrix indicating which cells had the >0.5 property. Each representation has its own advantages.
What is the best way to take ix object and convert it back to the boolean form later when it is desired? For example
G = np.zeros(X.shape,dtype=np.bool)
>>> G[ix] = True
Is there a one-liner that accomplishes the same thing?
Something like this maybe?
mask = np.zeros(X.shape, dtype='bool')
mask[ix] = True
but if it's something simple like X > 0, you're probably better off doing mask = X > 0 unless mask is very sparse or you no longer have a reference to X.
mask = X > 0
imask = np.logical_not(mask)
For example
Edit: Sorry for being so concise before. Shouldn't be answering things on the phone :P
As I noted in the example, it's better to just invert the boolean mask. Much more efficient/easier than going back from the result of where.
The bottom of the np.where docstring suggests to use np.in1d for this.
>>> x = np.array([1, 3, 4, 1, 2, 7, 6])
>>> indices = np.where(x % 3 == 1)[0]
>>> indices
array([0, 2, 3, 5])
>>> np.in1d(np.arange(len(x)), indices)
array([ True, False, True, True, False, True, False], dtype=bool)
(While this is a nice one-liner, it is a lot slower than #Bi Rico's solution.)

Categories