I have a 2d Numpy array like:
array([[0.87, 0.13, 0.18, 0.04, 0.79],
[0.07, 0.58, 0.84, 0.82, 0.76],
[0.12, 0.77, 0.68, 0.58, 0.8 ],
[0.43, 0.2 , 0.57, 0.91, 0.01],
[0.43, 0.74, 0.56, 0.11, 0.58]])
I'd like to test each number to evaluate if it is the local minima within a window of x*y, for example, a 3x3 window would return this
array([[False, False, False, True, False],
[ True, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, True],
[False, False, False, False, False]])
I want to avoid using a native python loop, since my array is quite large
You can use scipy.ndimage.minimum_filter to compute the 2D minima, then perform a comparison to the original:
from scipy.ndimage import minimum_filter
mins = minimum_filter(a, mode='constant', size=(3,3), cval=np.inf)
a==mins # or np.isclose(a, mins)
output:
array([[False, False, False, True, False],
[ True, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, True],
[False, False, False, False, False]])
intermediate mins:
array([[0.07, 0.07, 0.04, 0.04, 0.04],
[0.07, 0.07, 0.04, 0.04, 0.04],
[0.07, 0.07, 0.2 , 0.01, 0.01],
[0.12, 0.12, 0.11, 0.01, 0.01],
[0.2 , 0.2 , 0.11, 0.01, 0.01]])
Related
I have a time series t composed of 30 features, with a shape of (5400, 30). To plot it and identify the anomalies I had to reshape it in the following way:
t = t[:,0].reshape(-1)
Now, it became a single tensor of shape (5400,) where I had the possibility to perform my analysis and create a list of 5400 elements composed of True and False, based on the position of the anomalies:
anomaly = [True, False, True, ...., False]
Now I would like to reshape this list of a size (30, 5400) (the reverse of the first one). How can I do that?
EDIT: this is an example of what I'm trying to achieve:
I have a time series of size (2, 4)
feature 1 | feature 2 | feature 3 | feature 4
0.3 0.1 0.24 0.25
0.62 0.45 0.43 0.9
Coded as:
[[0.3, 0.1, 0.24, 0.25]
[0.62, 0.45, 0.43, 0.9]]
When I reshape it I get this univariate time series of size (8,):
[0.3, 0.1, 0.24, 0.25, 0.62, 0.45, 0.43, 0.9]
On this time series I applied an anomaly detection method which gave me a list of True/False for each value:
[True, False, True, False, False, True, True, False]
I wanna make this list of the reverse of the shape of the original one, so it would be structured as:
feature 1 True, False
feature 2 False, True
feature 3 True, True
feature 4 False, False
with a shape of (4, 2), so coded it should be:
[[True, False]
[False, True]
[True, True]
[False, False]]
t = np.array([[0.3, 0.1, 0.24, 0.25],[0.62, 0.45, 0.43, 0.9]])
anomaly= [True, False, True, False, False, True, True, False]
your_req_array = np.array(anomaly).reshape(2,4).T
I am filtering the arrays a and b for likewise values and then I want to append them to a new array difference howveer I get the error: ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 0 and the array at index 1 has size 2. How would I be able to fix this?
import numpy as np
a = np.array([[0,12],[1,40],[0,55],[1,23],[0,123.5],[1,4]])
b = np.array([[0,3],[1,10],[0,55],[1,34],[1,122],[0,123]])
difference= np.array([[]])
for i in a:
for j in b:
if np.allclose(i, j, atol=0.5):
difference = np.concatenate((difference,[i]))
Expected Output:
[[ 0. 55.],[ 0. 123.5]]
The problem is that you are trying to concatenate an array where the elements has size 0
difference= np.array([[]]) # Specifically [ [<no elements>] ]
To an array where the elements has size 2
np.concatenate((difference,[i])) # Specifically [i] which is [ [ 0., 55.] ]
Instead of initializing it with an empty array which has the size of 0, you could try just calling .reshape() on the difference.
# difference= np.array([[]]) # Old code
difference= np.array([]).reshape(0, 2) # Updated code
Output
[[ 0. 55. ]
[ 0. 123.5]]
np.array([[]]) shape is (1, 0). To make it work, it should be (0, 2):
difference= np.zeros((0, *a.shape[1:]))
In [22]: a = np.array([[0,12],[1,40],[0,55],[1,23],[0,123.5],[1,4]])
...: b = np.array([[0,3],[1,10],[0,55],[1,34],[1,122],[0,123]])
Using a straight forward list comprehension:
In [23]: [i for i in a for j in b if np.allclose(i,j,atol=0.5)]
Out[23]: [array([ 0., 55.]), array([ 0. , 123.5])]
But as for your concatenate. Look at the shape of the arrays:
In [24]: np.array([[]]).shape
Out[24]: (1, 0)
In [25]: np.array([i]).shape
Out[25]: (1, 1)
Those can only be joined on axis 1; default is 0, giving you the error. Like wrote in the comment, you have to understand arrays shapes to use concatenate.
In [26]: difference= np.array([[]])
...: for i in a:
...: for j in b:
...: if np.allclose(i, j, atol=0.5):
...: difference = np.concatenate((difference,[i]), axis=1)
...:
In [27]: difference
Out[27]: array([[ 0. , 55. , 0. , 123.5]])
vectorized
A whole-array approach:
broadcase a against b, producing a (5,5,2) closeness array:
In [37]: np.isclose(a[:,None,:],b[None,:,:], atol=0.5)
Out[37]:
array([[[ True, False],
[False, False],
[ True, False],
[False, False],
[False, False],
[ True, False]],
[[False, False],
[ True, False],
[False, False],
[ True, False],
[ True, False],
[False, False]],
[[ True, False],
[False, False],
[ True, True],
[False, False],
[False, False],
[ True, False]],
[[False, False],
[ True, False],
[False, False],
[ True, False],
[ True, False],
[False, False]],
[[ True, False],
[False, False],
[ True, False],
[False, False],
[False, False],
[ True, True]],
[[False, False],
[ True, False],
[False, False],
[ True, False],
[ True, False],
[False, False]]])
Find where both columns are true, and where at least one "row" is:
In [38]: _.all(axis=2)
Out[38]:
array([[False, False, False, False, False, False],
[False, False, False, False, False, False],
[False, False, True, False, False, False],
[False, False, False, False, False, False],
[False, False, False, False, False, True],
[False, False, False, False, False, False]])
In [39]: _.any(axis=1)
Out[39]: array([False, False, True, False, True, False])
In [40]: a[_]
Out[40]:
array([[ 0. , 55. ],
[ 0. , 123.5]])
Say there's a np.float32 matrix A of shape (N, M). Together with A, I possess another matrix B, of type np.bool, of the exact same shape (elements from A can be mapped 1:1 to B). Example:
A =
[
[0.1, 0.2, 0.3],
[4.02, 123.4, 534.65],
[2.32, 22.0, 754.01],
[5.41, 23.1, 1245.5],
[6.07, 0.65, 22.12],
]
B =
[
[True, False, True],
[False, False, True],
[True, True, False],
[True, True, True],
[True, False, True],
]
Now, I'd like to perform np.max, np.min, np.argmax and np.argmin on axis=1 of A, but only considering elements A[i,j] for which B[i,j] == True. Is it possible to do something like this in NumPy? The for-loop version is trivial, but I'm wondering whether I can get some of that juicy NumPy speed.
The result for A, B and np.max (for example) would be:
[ 0.3, 534.65, 22.0, 1245.5, 22.12 ]
I've avoided ma because I've heard that the computation gets very slow and I don't feel like specifying fill_value makes sense in this context. I just want the numbers to be ignored.
Also, if it matters at all in my case, N ranges in thousands and M ranges in units.
This is a textbook application for masked arrays. But as always there are other ways to do it.
import numpy as np
A = np.array([[ 0.1, 0.2, 0.3],
[ 4.02, 123.4, 534.65],
[ 2.32, 22.0, 754.01],
[ 5.41, 23.1, 1245.5],
[ 6.07, 0.65, 22.12]])
B = np.array([[ True, False, True],
[False, False, True],
[ True, True, False],
[ True, True, True],
[ True, False, True]])
With nanmax etc.
You could cast the 'invalid' values to NaN (say), then use NumPy's special NaN-ignoring functions:
>>> A[~B] = np.nan # <-- Note this mutates A
>>> np.nanmax(A, axis=1)
array([3.0000e-01, 5.3465e+02, 2.2000e+01, 1.2455e+03, 2.2120e+01])
The catch is that, while np.nanmax, np.nanmin, np.nanargmax, and np.nanargmin all exist, lots of functions don't have a non-NaN twin, so you might have to come up with something else eventually.
With ma
It seems weird not to mention masked arrays, which are straightforward. Notice that the mask is (to my mind anyway) 'backwards'. That is, True means the value is 'masked' or invalid and will be ignored. Hence having to negate B with the tilde. Then you can do what you want with the masked array:
>>> X = np.ma.masked_array(A, mask=~B) # <--- Note the tilde.
>>> np.max(X, axis=1)
masked_array(data=[0.3, 534.65, 22.0, 1245.5, 22.12],
mask=[False, False, False, False, False],
fill_value=1e+20)
Given a matrix of values that represent probabilities I am trying to write an efficient process that returns the bin that the value belongs to. For example:
sample = 0.5
x = np.array([0.1]*10)
np.digitize( sample, np.cumsum(x))-1
#returns 5
is the result I am looking for.
According to timeit for x arrays with few elements it is more efficient to do it as:
cdf = 0
for key,val in enumerate(x):
cdf += val
if sample<=cdf:
print key
break
while for bigger x arrays the numpy solution is faster.
The question:
Is there a way to further accelerate it, e.g., a function that combines the steps?
Can we vectorize the process it for the case where sample is a list, whose each item is associated with its own x array (x will then be 2-D)?
In the application x contains the marginal probabilities; this is way I need to decrement the results of np.digitize
You could use some broadcasting magic there -
(x.cumsum(1) > sample[:,None]).argmax(1)-1
Steps involved :
I. Perform cumsum along each row.
II. Use broadcasted comparison for each cumsum row against each sample value and look for the first occurrence of sample being lesser than cumsum values, signalling that the element before that in x is the index we are looking for.
Step-by-step run -
In [64]: x
Out[64]:
array([[ 0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 ],
[ 0.8 , 0.96, 0.88, 0.36, 0.5 , 0.68, 0.71],
[ 0.37, 0.56, 0.5 , 0.01, 0.77, 0.88, 0.36],
[ 0.62, 0.08, 0.37, 0.93, 0.65, 0.4 , 0.79]])
In [65]: sample # one elem per row of x
Out[65]: array([ 0.5, 2.2, 1.9, 2.2])
In [78]: x.cumsum(1)
Out[78]:
array([[ 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 ],
[ 0.8 , 1.76, 2.64, 2.99, 3.49, 4.18, 4.89],
[ 0.37, 0.93, 1.43, 1.45, 2.22, 3.1 , 3.47],
[ 0.62, 0.69, 1.06, 1.99, 2.64, 3.04, 3.83]])
In [79]: x.cumsum(1) > sample[:,None]
Out[79]:
array([[False, False, False, False, False, True, True],
[False, False, True, True, True, True, True],
[False, False, False, False, True, True, True],
[False, False, False, False, True, True, True]], dtype=bool)
In [80]: (x.cumsum(1) > sample[:,None]).argmax(1)-1
Out[80]: array([4, 1, 3, 3])
# A loopy solution to verify results against
In [81]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
Out[81]: [4, 1, 3, 3]
Boundary cases :
The proposed solution automatically handles the cases where sample values are lesser than smallest of cumulative summed values -
In [113]: sample[0] = 0.08 # editing first sample to be lesser than 0.1
In [114]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
Out[114]: [-1, 1, 3, 3]
In [115]: (x.cumsum(1) > sample[:,None]).argmax(1)-1
Out[115]: array([-1, 1, 3, 3])
For cases where a sample value is greater than largest of cumulative summed values, we need one extra step -
In [116]: sample[0] = 0.8 # editing first sample to be greater than 0.7
In [121]: mask = (x.cumsum(1) > sample[:,None])
In [122]: idx = mask.argmax(1)-1
In [123]: np.where(mask.any(1),idx,x.shape[1]-1)
Out[123]: array([6, 1, 3, 3])
In [124]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
Out[124]: [6, 1, 3, 3]
Example dataset (rows were randomly extracted from a much larger matrix)
import numpy as np
test = [[np.nan, np.nan, 0.217, 0.562],
[np.nan, np.nan, 0.217, 0.562],
[0.269, 0.0, 0.217, 0.562],
[np.nan, np.nan, 0.217, -0.953],
[np.nan, np.nan, 0.217, -0.788],
[0.75, 0.0, 0.217, 0.326],
[0.207, 0.0, 0.217, 0.814],
[np.nan, np.nan, 0.217, 0.562],
[np.nan, np.nan, 0.217, -0.022],
[np.nan, np.nan, 0.217, 0.562],
[np.nan, np.nan, 0.217, -0.953],
[np.nan, np.nan, 0.217, -0.953],
[0.078, 0.0, 0.217, -0.953],
[np.nan, np.nan, 0.217, -0.953],
[0.078, 0.0, 0.217, 0.562]]
maskedarr = np.ma.array(test)
np.ma.cov(maskedarr,rowvar=False,allow_masked=True)
[[-- -- -- --]
[-- -- -- --]
[-- -- 0.0 0.0]
[-- -- 0.0 0.554]]
However, if I use R,
import rpy2.robjects as robjects
robjects.globalenv['maskedarr'] = robjects.FloatVector(maskedarr.T.flatten())
robjects.r('''
dim(maskedarr) <- c(%d,%d)
maskedarr[] <- replace(maskedarr,!is.finite(maskedarr),NA)
''' % maskedarr.shape)
robjects.r('''
print(cov(maskedarr,use="pairwise"))
''')
[,1] [,2] [,3] [,4]
[1,] 0.0769733 0 0 0.0428294
[2,] 0.0000000 0 0 0.0000000
[3,] 0.0000000 0 0 0.0000000
[4,] 0.0428294 0 0 0.5536484
I get a very different matrix. If pairwise correlations are taken with nan's removed only for the pair, then I would expect something like R's answer - numpy.ma.cov says that allow_masked=True will allow these pairwise correlations to be calculated, but does not appear to be so. Am I missing something?
Your maskedarr does not have any values masked.
>>> maskedarr.mask
False
You need to include the mask argument when initializing the array.
>>> maskedarr = np.ma.array(test, mask=np.isnan(test))
Now maskedarr.mask is as follows.
>>> maskedarr.mask
array([[ True, True, False, False],
[ True, True, False, False],
[False, False, False, False],
[ True, True, False, False],
[ True, True, False, False],
[False, False, False, False],
[False, False, False, False],
[ True, True, False, False],
[ True, True, False, False],
[ True, True, False, False],
[ True, True, False, False],
[ True, True, False, False],
[False, False, False, False],
[ True, True, False, False],
[False, False, False, False]], dtype=bool)
This time when doing numpy.ma.cov:
>>> np.ma.cov(maskedarr,rowvar=False,allow_masked=True)
masked_array(data =
[[0.0769732996251 0.0 0.0 0.0428294015418]
[0.0 0.0 0.0 0.0]
[0.0 0.0 0.0 0.0]
[0.0428294015418 0.0 0.0 0.553648402899]],
mask =
[[False False False False]
[False False False False]
[False False False False]
[False False False False]],
fill_value = 1e+20)