how does numpy.where work? - python

I can understand following numpy behavior.
>>> a
array([[ 0. , 0. , 0. ],
[ 0. , 0.7, 0. ],
[ 0. , 0.3, 0.5],
[ 0.6, 0. , 0.8],
[ 0.7, 0. , 0. ]])
>>> argmax_overlaps = a.argmax(axis=1)
>>> argmax_overlaps
array([0, 1, 2, 2, 0])
>>> max_overlaps = a[np.arange(5),argmax_overlaps]
>>> max_overlaps
array([ 0. , 0.7, 0.5, 0.8, 0.7])
>>> gt_argmax_overlaps = a.argmax(axis=0)
>>> gt_argmax_overlaps
array([4, 1, 3])
>>> gt_max_overlaps = a[gt_argmax_overlaps,np.arange(a.shape[1])]
>>> gt_max_overlaps
array([ 0.7, 0.7, 0.8])
>>> gt_argmax_overlaps = np.where(a == gt_max_overlaps)
>>> gt_argmax_overlaps
(array([1, 3, 4]), array([1, 2, 0]))
I understood 0.7, 0.7 and 0.8 is a[1,1],a[3,2] and a[4,0] so I got the tuple (array[1,3,4] and array[1,2,0]) each array of which composed of 0th and 1st indices of those three elements. I then tried other examples to see my understanding is correct.
>>> np.where(a == [0.3])
(array([2]), array([1]))
0.3 is in a[2,1] so the outcome looks as I expected. Then I tried
>>> np.where(a == [0.3, 0.5])
(array([], dtype=int64),)
?? I expected to see (array([2,2]),array([2,3])). Why do I see the output above?
>>> np.where(a == [0.7, 0.7, 0.8])
(array([1, 3, 4]), array([1, 2, 0]))
>>> np.where(a == [0.8,0.7,0.7])
(array([1]), array([1]))
I can't understand the second result either. Could someone please explain it to me? Thanks.

The first thing to realize is that np.where(a == [whatever]) is just showing you the indices where a == [whatever] is True. So you can get a hint by looking at the value of a == [whatever]. In your case that "works":
>>> a == [0.7, 0.7, 0.8]
array([[False, False, False],
[False, True, False],
[False, False, False],
[False, False, True],
[ True, False, False]], dtype=bool)
You aren't getting what you think you are. You think that is asking for the indices of each element separately, but instead it's getting the positions where the values match at the same position in the row. Basically what this comparison is doing is saying "for each row, tell me whether the first element is 0.7, whether the second is 0.7, and whether the third is 0.8". It then returns the indices of those matching positions. In other words, the comparison is done between entire rows, not just individual values. For your last example:
>>> a == [0.8,0.7,0.7]
array([[False, False, False],
[False, True, False],
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
You now get a different result. It's not asking for "the indices where a has value 0.8", it's asking for only the indices where there is a 0.8 at the beginning of the row -- and likewise a 0.7 in either of the later two positions.
This type of row-wise comparison can only be done if the value you compare against matches the shape of a single row of a. So when you try it with a two-element list, it returns an empty set, because there it is trying to compare the list as a scalar value against individual values in your array.
The upshot is that you can't use == on a list of values and expect it to just tell you where any of the values occurs. The equality will match by value and position (if the value you compare against is the same shape as a row of your array), or it will try to compare the whole list as a scalar (if the shape doesn't match). If you want to search for the values independently, you need to do something like what Khris suggested in a comment:
np.where((a==0.3)|(a==0.5))
That is, you need to make two (or more) separate comparisons against separate values, not a single comparison against a list of values.

Related

NumPy masked operation?

Say there's a np.float32 matrix A of shape (N, M). Together with A, I possess another matrix B, of type np.bool, of the exact same shape (elements from A can be mapped 1:1 to B). Example:
A =
[
[0.1, 0.2, 0.3],
[4.02, 123.4, 534.65],
[2.32, 22.0, 754.01],
[5.41, 23.1, 1245.5],
[6.07, 0.65, 22.12],
]
B =
[
[True, False, True],
[False, False, True],
[True, True, False],
[True, True, True],
[True, False, True],
]
Now, I'd like to perform np.max, np.min, np.argmax and np.argmin on axis=1 of A, but only considering elements A[i,j] for which B[i,j] == True. Is it possible to do something like this in NumPy? The for-loop version is trivial, but I'm wondering whether I can get some of that juicy NumPy speed.
The result for A, B and np.max (for example) would be:
[ 0.3, 534.65, 22.0, 1245.5, 22.12 ]
I've avoided ma because I've heard that the computation gets very slow and I don't feel like specifying fill_value makes sense in this context. I just want the numbers to be ignored.
Also, if it matters at all in my case, N ranges in thousands and M ranges in units.
This is a textbook application for masked arrays. But as always there are other ways to do it.
import numpy as np
A = np.array([[ 0.1, 0.2, 0.3],
[ 4.02, 123.4, 534.65],
[ 2.32, 22.0, 754.01],
[ 5.41, 23.1, 1245.5],
[ 6.07, 0.65, 22.12]])
B = np.array([[ True, False, True],
[False, False, True],
[ True, True, False],
[ True, True, True],
[ True, False, True]])
With nanmax etc.
You could cast the 'invalid' values to NaN (say), then use NumPy's special NaN-ignoring functions:
>>> A[~B] = np.nan # <-- Note this mutates A
>>> np.nanmax(A, axis=1)
array([3.0000e-01, 5.3465e+02, 2.2000e+01, 1.2455e+03, 2.2120e+01])
The catch is that, while np.nanmax, np.nanmin, np.nanargmax, and np.nanargmin all exist, lots of functions don't have a non-NaN twin, so you might have to come up with something else eventually.
With ma
It seems weird not to mention masked arrays, which are straightforward. Notice that the mask is (to my mind anyway) 'backwards'. That is, True means the value is 'masked' or invalid and will be ignored. Hence having to negate B with the tilde. Then you can do what you want with the masked array:
>>> X = np.ma.masked_array(A, mask=~B) # <--- Note the tilde.
>>> np.max(X, axis=1)
masked_array(data=[0.3, 534.65, 22.0, 1245.5, 22.12],
mask=[False, False, False, False, False],
fill_value=1e+20)

How to delete decimal values from an array in a pythonic way

I am trying to delete an element from an array. When trying to delete integer values(using numpy.delete) it's working but it doesn't work for decimal values.
For integer deletion
X = [1. 2. 2.5 5.7 3. 6. ]
to_delete_key = [3, 7.3]
Y = np.delete(X, to_delete_key, None)
Output is [1. 2. 2.5 5.7 6. ]
The value 3 got deleted
Whereas in the case of decimal deletion
For decimal deletion
X = [6. 7.3 9.1]
to_delete_key = [3, 7.3]
Y = np.delete(X, to_delete_key, None)
Output is [6. 7.3 9.1]
The value 7.3 didn't get deleted.
I know how to do it the normal way but is there any efficient pythonic way to do it
In [249]: X = np.array([1., 2., 2.5, 5.7, 3., 6. ])
...: to_delete_key = [3, 7.3]
In [252]: np.delete(X, to_delete_key)
Traceback (most recent call last):
File "<ipython-input-252-f9031065a548>", line 1, in <module>
np.delete(X, to_delete_key)
File "<__array_function__ internals>", line 5, in delete
File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 4406, in delete
keep[obj,] = False
IndexError: arrays used as indices must be of integer (or boolean) type
Using an integer:
In [253]: np.delete(X, 3)
Out[253]: array([1. , 2. , 2.5, 3. , 6. ])
It was the 5.7 that was deleted, X[3].
np.delete does not delete by value! From the docs:
obj : slice, int or array of ints
Indicate indices of sub-arrays to remove along the specified axis.
We can look for value matches
In [267]: vals = [3, 2.5]
In [268]: X[:,None]==vals
Out[268]:
array([[False, False],
[False, False],
[False, True],
[False, False],
[ True, False],
[False, False]])
But equality match on floats can be unreliable. isclose operates with a tolerance:
In [269]: np.isclose(X[:,None],vals)
Out[269]:
array([[False, False],
[False, False],
[False, True],
[False, False],
[ True, False],
[False, False]])
Then find the rows where there's a match:
In [270]: _.any(axis=1)
Out[270]: array([False, False, True, False, True, False])
In [271]: X[_]
Out[271]: array([2.5, 3. ])
In [272]: X[~__]
Out[272]: array([1. , 2. , 5.7, 6. ])
Lists have a remove by value:
In [284]: alist=X.tolist()
In [285]: alist.remove(3.0)
In [286]: alist.remove(2.5)
In [287]: alist
Out[287]: [1.0, 2.0, 5.7, 6.0]
You are dealing with floating-point numbers that cannot be compared exactly. Google out "What every programmer should know about floating-point numbers".
1/3 + 1/3 + 1/3 might not be equal to 1 due to rounding errors.
So the explanation is that your value of 7.3 is not found. Numpy probably converted 7.3 to a 32-bit float or whatever that is not exactly equal to what is in the array.
As mentioned by #elPastor, you are misusing Numpy.

efficient numpy.cumsum and numpy.digitize

Given a matrix of values that represent probabilities I am trying to write an efficient process that returns the bin that the value belongs to. For example:
sample = 0.5
x = np.array([0.1]*10)
np.digitize( sample, np.cumsum(x))-1
#returns 5
is the result I am looking for.
According to timeit for x arrays with few elements it is more efficient to do it as:
cdf = 0
for key,val in enumerate(x):
cdf += val
if sample<=cdf:
print key
break
while for bigger x arrays the numpy solution is faster.
The question:
Is there a way to further accelerate it, e.g., a function that combines the steps?
Can we vectorize the process it for the case where sample is a list, whose each item is associated with its own x array (x will then be 2-D)?
In the application x contains the marginal probabilities; this is way I need to decrement the results of np.digitize
You could use some broadcasting magic there -
(x.cumsum(1) > sample[:,None]).argmax(1)-1
Steps involved :
I. Perform cumsum along each row.
II. Use broadcasted comparison for each cumsum row against each sample value and look for the first occurrence of sample being lesser than cumsum values, signalling that the element before that in x is the index we are looking for.
Step-by-step run -
In [64]: x
Out[64]:
array([[ 0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 ],
[ 0.8 , 0.96, 0.88, 0.36, 0.5 , 0.68, 0.71],
[ 0.37, 0.56, 0.5 , 0.01, 0.77, 0.88, 0.36],
[ 0.62, 0.08, 0.37, 0.93, 0.65, 0.4 , 0.79]])
In [65]: sample # one elem per row of x
Out[65]: array([ 0.5, 2.2, 1.9, 2.2])
In [78]: x.cumsum(1)
Out[78]:
array([[ 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 ],
[ 0.8 , 1.76, 2.64, 2.99, 3.49, 4.18, 4.89],
[ 0.37, 0.93, 1.43, 1.45, 2.22, 3.1 , 3.47],
[ 0.62, 0.69, 1.06, 1.99, 2.64, 3.04, 3.83]])
In [79]: x.cumsum(1) > sample[:,None]
Out[79]:
array([[False, False, False, False, False, True, True],
[False, False, True, True, True, True, True],
[False, False, False, False, True, True, True],
[False, False, False, False, True, True, True]], dtype=bool)
In [80]: (x.cumsum(1) > sample[:,None]).argmax(1)-1
Out[80]: array([4, 1, 3, 3])
# A loopy solution to verify results against
In [81]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
Out[81]: [4, 1, 3, 3]
Boundary cases :
The proposed solution automatically handles the cases where sample values are lesser than smallest of cumulative summed values -
In [113]: sample[0] = 0.08 # editing first sample to be lesser than 0.1
In [114]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
Out[114]: [-1, 1, 3, 3]
In [115]: (x.cumsum(1) > sample[:,None]).argmax(1)-1
Out[115]: array([-1, 1, 3, 3])
For cases where a sample value is greater than largest of cumulative summed values, we need one extra step -
In [116]: sample[0] = 0.8 # editing first sample to be greater than 0.7
In [121]: mask = (x.cumsum(1) > sample[:,None])
In [122]: idx = mask.argmax(1)-1
In [123]: np.where(mask.any(1),idx,x.shape[1]-1)
Out[123]: array([6, 1, 3, 3])
In [124]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
Out[124]: [6, 1, 3, 3]

How to convert 2d numpy array into binary indicator matrix for max value

assuming I have a 2d numpy array indicating probabilities for m samples in n classes (probabilities sum to 1 for each sample).
Assuming each sample can only be in one category, I want to create a new array with the same shape as the original, but with only binary values indicating which class had the highest probability.
Example:
[[0.2, 0.3, 0.5], [0.7, 0.1, 0.1]]
should be converted to:
[[0, 0, 1], [1, 0, 0]]
It seems amax already does almost what I want, but instead of the indices I want an indicator matrix as descrived above.
Seems simple, but somehow I can't figure it out using standard numpy functions. I could use regular python loops of course, but it seems there should be a simpler way.
In case multiple classes have the same probability, I would prefer a solution which only selects one of the classes (I don't care which in this case).
Thanks!
Here's one way:
In [112]: a
Out[112]:
array([[ 0.2, 0.3, 0.5],
[ 0.7, 0.1, 0.1]])
In [113]: a == a.max(axis=1, keepdims=True)
Out[113]:
array([[False, False, True],
[ True, False, False]], dtype=bool)
In [114]: (a == a.max(axis=1, keepdims=True)).astype(int)
Out[114]:
array([[0, 0, 1],
[1, 0, 0]])
(But this will give a True value for each occurrence of the maximum in a row. See Divakar's answer for a nice way to select just the first occurrence of the maximum.)
In case of ties (two or more elements being the highest one in a row), where you want to select only one, here's one approach to do so with np.argmax and broadcasting -
(A.argmax(1)[:,None] == np.arange(A.shape[1])).astype(int)
Sample run -
In [296]: A
Out[296]:
array([[ 0.2, 0.3, 0.5],
[ 0.5, 0.5, 0. ]])
In [297]: (A.argmax(1)[:,None] == np.arange(A.shape[1])).astype(int)
Out[297]:
array([[0, 0, 1],
[1, 0, 0]])

compare tuple with tuples in numpy array

I have an array (dtype=object) with the first column containing tuples of arrays and the second column containing scalars. I want all scalars from the second column where the tuples in the first column equal a certain tuple.
Say
>>> X
array([[(array([ 21.]), array([ 13.])), 0.29452519286647716],
[(array([ 25.]), array([ 9.])), 0.9106600600510809],
[(array([ 25.]), array([ 13.])), 0.8137344043493814],
[(array([ 25.]), array([ 14.])), 0.8143093864975313],
[(array([ 25.]), array([ 15.])), 0.6004337591112664],
[(array([ 25.]), array([ 16.])), 0.6239450452872853],
[(array([ 21.]), array([ 13.])), 0.32082105959687424]], dtype=object)
and I want all rows where the 1st column equals X[0,0].
ar = X[0,0]
>>> ar
(array([ 21.]), array([ 13.]))
I thaugh checking X[:,0]==ar should find me those rows. I would had then retrieved my final result by X[X[:,0]==ar,1].
What seems to happen, however, is that ar gets to be interpreted as a 2dimensional array and each single element in ar is compared to the tuples in X[:,0]. This yields a, in this case, 2x7 array all entries equal to False. In contrast, the comparison X[0,0]==ar works just as I would want it giving a value of True.
Why is that happening and how can I fix it to obtain the desired result?
Comparison using list comprehension works:
In [176]: [x==ar for x in X[:,0]]
Out[176]: [True, False, False, False, False, False, True]
This is comparing tuples with tuples
Comparing tuple ids gives a different result
In [175]: [id(x)==id(ar) for x in X[:,0]]
Out[175]: [True, False, False, False, False, False, False]
since the 2nd match has a different id.
In [177]: X[:,0]==ar
Out[177]:
array([[False, False, False, False, False, False, False],
[False, False, False, False, False, False, False]], dtype=bool)
returns a (2,7) result because it is, effect comparing a (7,) array with a (2,1) array (np.array(ar)).
But this works like the comprehension:
In [190]: ar1=np.zeros(1,dtype=object)
In [191]: ar1[0]=ar
In [192]: ar1
Out[192]: array([(array([ 21.]), array([ 13.]))], dtype=object)
In [193]: X[:,0]==ar1
Out[193]: array([ True, False, False, False, False, False, True], dtype=bool)
art1 is a 1 element array containing the ar tuple. Now the comparison with the elements of X[:,0] proceeds as expected.
np.array(...) tries to create as high a dimension array as the input data allows. That is why it turns a 2 element tuple into a 2 element array. I had to do a 2 step assignment to get around that default.

Categories