efficient numpy.cumsum and numpy.digitize

efficient numpy.cumsum and numpy.digitize - python

Given a matrix of values that represent probabilities I am trying to write an efficient process that returns the bin that the value belongs to. For example:
sample = 0.5
x = np.array([0.1]*10)
np.digitize( sample, np.cumsum(x))-1
#returns 5
is the result I am looking for.
According to timeit for x arrays with few elements it is more efficient to do it as:
cdf = 0
for key,val in enumerate(x):
cdf += val
if sample<=cdf:
print key
break
while for bigger x arrays the numpy solution is faster.
The question:
Is there a way to further accelerate it, e.g., a function that combines the steps?
Can we vectorize the process it for the case where sample is a list, whose each item is associated with its own x array (x will then be 2-D)?
In the application x contains the marginal probabilities; this is way I need to decrement the results of np.digitize

You could use some broadcasting magic there -
(x.cumsum(1) > sample[:,None]).argmax(1)-1
Steps involved :
I. Perform cumsum along each row.
II. Use broadcasted comparison for each cumsum row against each sample value and look for the first occurrence of sample being lesser than cumsum values, signalling that the element before that in x is the index we are looking for.
Step-by-step run -
In [64]: x
Out[64]:
array([[ 0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 , 0.1 ],
[ 0.8 , 0.96, 0.88, 0.36, 0.5 , 0.68, 0.71],
[ 0.37, 0.56, 0.5 , 0.01, 0.77, 0.88, 0.36],
[ 0.62, 0.08, 0.37, 0.93, 0.65, 0.4 , 0.79]])
In [65]: sample # one elem per row of x
Out[65]: array([ 0.5, 2.2, 1.9, 2.2])
In [78]: x.cumsum(1)
Out[78]:
array([[ 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 ],
[ 0.8 , 1.76, 2.64, 2.99, 3.49, 4.18, 4.89],
[ 0.37, 0.93, 1.43, 1.45, 2.22, 3.1 , 3.47],
[ 0.62, 0.69, 1.06, 1.99, 2.64, 3.04, 3.83]])
In [79]: x.cumsum(1) > sample[:,None]
Out[79]:
array([[False, False, False, False, False, True, True],
[False, False, True, True, True, True, True],
[False, False, False, False, True, True, True],
[False, False, False, False, True, True, True]], dtype=bool)
In [80]: (x.cumsum(1) > sample[:,None]).argmax(1)-1
Out[80]: array([4, 1, 3, 3])
# A loopy solution to verify results against
In [81]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
Out[81]: [4, 1, 3, 3]
Boundary cases :
The proposed solution automatically handles the cases where sample values are lesser than smallest of cumulative summed values -
In [113]: sample[0] = 0.08 # editing first sample to be lesser than 0.1
In [114]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
Out[114]: [-1, 1, 3, 3]
In [115]: (x.cumsum(1) > sample[:,None]).argmax(1)-1
Out[115]: array([-1, 1, 3, 3])
For cases where a sample value is greater than largest of cumulative summed values, we need one extra step -
In [116]: sample[0] = 0.8 # editing first sample to be greater than 0.7
In [121]: mask = (x.cumsum(1) > sample[:,None])
In [122]: idx = mask.argmax(1)-1
In [123]: np.where(mask.any(1),idx,x.shape[1]-1)
Out[123]: array([6, 1, 3, 3])
In [124]: [np.digitize( sample[i], np.cumsum(x[i]))-1 for i in range(x.shape[0])]
Out[124]: [6, 1, 3, 3]

Related

How to convert an array index to/from a mask

Say I have an array like:
a1 = np.array([.1, .2, 23., 4.3, 3.2, .1, .05, .2, .3, 4.2, 7.6])
And I filter out, and create a mask, of all values less than 1, like:
a2 = a1[a1 >= 1]
a2_mask = np.ma.masked_where(a1 < 1, a1)
And then search for a specific value:
a2_idx = np.where(a2==3.2)[0][0]
How would I convert that index to the corresponding index in the original array?
e.g.
>>> a2_idx
2
>>> a1_idx = reframe_index(a2_idx, a2_mask)
>>> a1_idx
4
My naive implementation would be:
def reframe_index(old_idx, mask):
cnt = 0
ref = 0
for v in mask:
if not isinstance(v, (int, float)):
cnt += 1
else:
if ref == old_idx:
return ref + cnt
ref += 1
Does Numpy have a more efficient way to do this?

a2 is a copy, so there's no link between it an a1 - except for some values.
In [19]: a2
Out[19]: array([23. , 4.3, 3.2, 4.2, 7.6])
In [20]: np.nonzero(a2 == 3.2)
Out[20]: (array([2]),)
In [21]: a2[2]
Out[21]: 3.2
The mask of a2_mask, just a1<1, does give us a way of finding the corresponding element of a1:
In [22]: a2_mask = np.ma.masked_where(a1 < 1, a1)
In [23]: a2_mask
Out[23]:
masked_array(data=[--, --, 23.0, 4.3, 3.2, --, --, --, --, 4.2, 7.6],
mask=[ True, True, False, False, False, True, True, True,
True, False, False],
fill_value=1e+20)
In [24]: a2_mask.compressed()
Out[24]: array([23. , 4.3, 3.2, 4.2, 7.6])
In [25]: a2_mask.mask
Out[25]:
array([ True, True, False, False, False, True, True, True, True,
False, False])
In [26]: np.nonzero(~a2_mask.mask)
Out[26]: (array([ 2, 3, 4, 9, 10]),)
In [27]: np.nonzero(~a2_mask.mask)[0][2]
Out[27]: 4
In [28]: a1[4]
Out[28]: 3.2
So you need the mask or indices used to select a2 in the first place. a2 itself does not have the information.
In [30]: np.nonzero(a1>=1)
Out[30]: (array([ 2, 3, 4, 9, 10]),)
In [31]: np.nonzero(a1 >= 1)[0][2]
Out[31]: 4

I had a similar problem recently, so I made haggis.npy_util.unmasked_index1. This function has a lot of overkill for your relatively simple case, because it's intended to operate on an arbitrary number of dimensions. That being said, given
>>> arr = np.array([.1, .2, 23., 4.3, 3.2, .1, .05, .2, .3, 4.2, 7.6])
and
>>> mask = arr >= 1
>>> mask
array([False, False, True, True, True, False, False, False, False,
True, True])
You can do something like
>>> idx = unmasked_index(np.flatnonzero(arr[mask] == 3.2), mask)
>>> idx
array([4])
If you ever need it, there is also an inverse function haggis.npy_util.masked_index that converts a location in a multidimensional input array into its index in the masked array.
1Disclaimer: I am the author of haggis.

Reverse reshaping of a numpy array

I have a time series t composed of 30 features, with a shape of (5400, 30). To plot it and identify the anomalies I had to reshape it in the following way:
t = t[:,0].reshape(-1)
Now, it became a single tensor of shape (5400,) where I had the possibility to perform my analysis and create a list of 5400 elements composed of True and False, based on the position of the anomalies:
anomaly = [True, False, True, ...., False]
Now I would like to reshape this list of a size (30, 5400) (the reverse of the first one). How can I do that?
EDIT: this is an example of what I'm trying to achieve:
I have a time series of size (2, 4)
feature 1 | feature 2 | feature 3 | feature 4
0.3 0.1 0.24 0.25
0.62 0.45 0.43 0.9
Coded as:
[[0.3, 0.1, 0.24, 0.25]
[0.62, 0.45, 0.43, 0.9]]
When I reshape it I get this univariate time series of size (8,):
[0.3, 0.1, 0.24, 0.25, 0.62, 0.45, 0.43, 0.9]
On this time series I applied an anomaly detection method which gave me a list of True/False for each value:
[True, False, True, False, False, True, True, False]
I wanna make this list of the reverse of the shape of the original one, so it would be structured as:
feature 1 True, False
feature 2 False, True
feature 3 True, True
feature 4 False, False
with a shape of (4, 2), so coded it should be:
[[True, False]
[False, True]
[True, True]
[False, False]]

t = np.array([[0.3, 0.1, 0.24, 0.25],[0.62, 0.45, 0.43, 0.9]])
anomaly= [True, False, True, False, False, True, True, False]
your_req_array = np.array(anomaly).reshape(2,4).T

How to delete decimal values from an array in a pythonic way

I am trying to delete an element from an array. When trying to delete integer values(using numpy.delete) it's working but it doesn't work for decimal values.
For integer deletion
X = [1. 2. 2.5 5.7 3. 6. ]
to_delete_key = [3, 7.3]
Y = np.delete(X, to_delete_key, None)
Output is [1. 2. 2.5 5.7 6. ]
The value 3 got deleted
Whereas in the case of decimal deletion
For decimal deletion
X = [6. 7.3 9.1]
to_delete_key = [3, 7.3]
Y = np.delete(X, to_delete_key, None)
Output is [6. 7.3 9.1]
The value 7.3 didn't get deleted.
I know how to do it the normal way but is there any efficient pythonic way to do it

In [249]: X = np.array([1., 2., 2.5, 5.7, 3., 6. ])
...: to_delete_key = [3, 7.3]
In [252]: np.delete(X, to_delete_key)
Traceback (most recent call last):
File "<ipython-input-252-f9031065a548>", line 1, in <module>
np.delete(X, to_delete_key)
File "<__array_function__ internals>", line 5, in delete
File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 4406, in delete
keep[obj,] = False
IndexError: arrays used as indices must be of integer (or boolean) type
Using an integer:
In [253]: np.delete(X, 3)
Out[253]: array([1. , 2. , 2.5, 3. , 6. ])
It was the 5.7 that was deleted, X[3].
np.delete does not delete by value! From the docs:
obj : slice, int or array of ints
Indicate indices of sub-arrays to remove along the specified axis.
We can look for value matches
In [267]: vals = [3, 2.5]
In [268]: X[:,None]==vals
Out[268]:
array([[False, False],
[False, False],
[False, True],
[False, False],
[ True, False],
[False, False]])
But equality match on floats can be unreliable. isclose operates with a tolerance:
In [269]: np.isclose(X[:,None],vals)
Out[269]:
array([[False, False],
[False, False],
[False, True],
[False, False],
[ True, False],
[False, False]])
Then find the rows where there's a match:
In [270]: _.any(axis=1)
Out[270]: array([False, False, True, False, True, False])
In [271]: X[_]
Out[271]: array([2.5, 3. ])
In [272]: X[~__]
Out[272]: array([1. , 2. , 5.7, 6. ])
Lists have a remove by value:
In [284]: alist=X.tolist()
In [285]: alist.remove(3.0)
In [286]: alist.remove(2.5)
In [287]: alist
Out[287]: [1.0, 2.0, 5.7, 6.0]

You are dealing with floating-point numbers that cannot be compared exactly. Google out "What every programmer should know about floating-point numbers".
1/3 + 1/3 + 1/3 might not be equal to 1 due to rounding errors.
So the explanation is that your value of 7.3 is not found. Numpy probably converted 7.3 to a 32-bit float or whatever that is not exactly equal to what is in the array.
As mentioned by #elPastor, you are misusing Numpy.

how does numpy.where work?

I can understand following numpy behavior.
>>> a
array([[ 0. , 0. , 0. ],
[ 0. , 0.7, 0. ],
[ 0. , 0.3, 0.5],
[ 0.6, 0. , 0.8],
[ 0.7, 0. , 0. ]])
>>> argmax_overlaps = a.argmax(axis=1)
>>> argmax_overlaps
array([0, 1, 2, 2, 0])
>>> max_overlaps = a[np.arange(5),argmax_overlaps]
>>> max_overlaps
array([ 0. , 0.7, 0.5, 0.8, 0.7])
>>> gt_argmax_overlaps = a.argmax(axis=0)
>>> gt_argmax_overlaps
array([4, 1, 3])
>>> gt_max_overlaps = a[gt_argmax_overlaps,np.arange(a.shape[1])]
>>> gt_max_overlaps
array([ 0.7, 0.7, 0.8])
>>> gt_argmax_overlaps = np.where(a == gt_max_overlaps)
>>> gt_argmax_overlaps
(array([1, 3, 4]), array([1, 2, 0]))
I understood 0.7, 0.7 and 0.8 is a[1,1],a[3,2] and a[4,0] so I got the tuple (array[1,3,4] and array[1,2,0]) each array of which composed of 0th and 1st indices of those three elements. I then tried other examples to see my understanding is correct.
>>> np.where(a == [0.3])
(array([2]), array([1]))
0.3 is in a[2,1] so the outcome looks as I expected. Then I tried
>>> np.where(a == [0.3, 0.5])
(array([], dtype=int64),)
?? I expected to see (array([2,2]),array([2,3])). Why do I see the output above?
>>> np.where(a == [0.7, 0.7, 0.8])
(array([1, 3, 4]), array([1, 2, 0]))
>>> np.where(a == [0.8,0.7,0.7])
(array([1]), array([1]))
I can't understand the second result either. Could someone please explain it to me? Thanks.

The first thing to realize is that np.where(a == [whatever]) is just showing you the indices where a == [whatever] is True. So you can get a hint by looking at the value of a == [whatever]. In your case that "works":
>>> a == [0.7, 0.7, 0.8]
array([[False, False, False],
[False, True, False],
[False, False, False],
[False, False, True],
[ True, False, False]], dtype=bool)
You aren't getting what you think you are. You think that is asking for the indices of each element separately, but instead it's getting the positions where the values match at the same position in the row. Basically what this comparison is doing is saying "for each row, tell me whether the first element is 0.7, whether the second is 0.7, and whether the third is 0.8". It then returns the indices of those matching positions. In other words, the comparison is done between entire rows, not just individual values. For your last example:
>>> a == [0.8,0.7,0.7]
array([[False, False, False],
[False, True, False],
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
You now get a different result. It's not asking for "the indices where a has value 0.8", it's asking for only the indices where there is a 0.8 at the beginning of the row -- and likewise a 0.7 in either of the later two positions.
This type of row-wise comparison can only be done if the value you compare against matches the shape of a single row of a. So when you try it with a two-element list, it returns an empty set, because there it is trying to compare the list as a scalar value against individual values in your array.
The upshot is that you can't use == on a list of values and expect it to just tell you where any of the values occurs. The equality will match by value and position (if the value you compare against is the same shape as a row of your array), or it will try to compare the whole list as a scalar (if the shape doesn't match). If you want to search for the values independently, you need to do something like what Khris suggested in a comment:
np.where((a==0.3)|(a==0.5))
That is, you need to make two (or more) separate comparisons against separate values, not a single comparison against a list of values.

In numpy, calculating a matrix where each cell contains the product of all the other entries in that row

I have a matrix
A = np.array([[0.2, 0.4, 0.6],
[0.5, 0.5, 0.5],
[0.6, 0.4, 0.2]])
I want a new matrix, where the value of the entry in row i and column j is the product of all the entries of the ith row of A, except for the cell of that row in the jth column.
array([[ 0.24, 0.12, 0.08],
[ 0.25, 0.25, 0.25],
[ 0.08, 0.12, 0.24]])
The solution that first occurred to me was
np.repeat(np.prod(A, 1, keepdims = True), 3, axis = 1) / A
But this only works so long as no entries have values zero.
Any thoughts? Thank you!
Edit: I have developed
B = np.zeros((3, 3))
for i in range(3):
for j in range(3):
B[i, j] = np.prod(i, A[[x for x in range(3) if x != j]])
but surely there is a more elegant way to accomplish this, which makes use of numpy's efficient C backend instead of inefficient python loops?

If you're willing to tolerate a single loop:
B = np.empty_like(A)
for col in range(A.shape[1]):
B[:,col] = np.prod(np.delete(A, col, 1), 1)
That computes what you need, a single column at a time. It is not as efficient as theoretically possible because np.delete() creates a copy; if you care a lot about memory allocation, use a mask instead:
B = np.empty_like(A)
mask = np.ones(A.shape[1], dtype=bool)
for col in range(A.shape[1]):
mask[col] = False
B[:,col] = np.prod(A[:,mask], 1)
mask[col] = True

A variation on your solution using repeat, uses [:,None].
np.prod(A,axis=1)[:,None]/A
My 1st stab at handling 0s is:
In [21]: B
array([[ 0.2, 0.4, 0.6],
[ 0. , 0.5, 0.5],
[ 0.6, 0.4, 0.2]])
In [22]: np.prod(B,axis=1)[:,None]/(B+np.where(B==0,1,0))
array([[ 0.24, 0.12, 0.08],
[ 0. , 0. , 0. ],
[ 0.08, 0.12, 0.24]])
But as the comment pointed out; the [0,1] cell should be 0.25.
This corrects that problem, but now has problems when there are multiple 0s in a row.
In [30]: I=B==0
In [31]: B1=B+np.where(I,1,0)
In [32]: B2=np.prod(B1,axis=1)[:,None]/B1
In [33]: B3=np.prod(B,axis=1)[:,None]/B1
In [34]: np.where(I,B2,B3)
Out[34]:
array([[ 0.24, 0.12, 0.08],
[ 0.25, 0. , 0. ],
[ 0.08, 0.12, 0.24]])
In [55]: C
array([[ 0.2, 0.4, 0.6],
[ 0. , 0.5, 0. ],
[ 0.6, 0.4, 0.2]])
In [64]: np.where(I,sum1[:,None],sum[:,None])/C1
array([[ 0.24, 0.12, 0.08],
[ 0.5 , 0. , 0.5 ],
[ 0.08, 0.12, 0.24]])
Blaz Bratanic's epsilon approach is the best non iterative solution (so far):
In [74]: np.prod(C+eps,axis=1)[:,None]/(C+eps)
A different solution iterating over the columns:
def paulj(A):
P = np.ones_like(A)
for i in range(1,A.shape[1]):
P *= np.roll(A, i, axis=1)
return P
In [130]: paulj(A)
array([[ 0.24, 0.12, 0.08],
[ 0.25, 0.25, 0.25],
[ 0.08, 0.12, 0.24]])
In [131]: paulj(B)
array([[ 0.24, 0.12, 0.08],
[ 0.25, 0. , 0. ],
[ 0.08, 0.12, 0.24]])
In [132]: paulj(C)
array([[ 0.24, 0.12, 0.08],
[ 0. , 0. , 0. ],
[ 0.08, 0.12, 0.24]])
I tried some timings on a large matrix
In [13]: A=np.random.randint(0,100,(1000,1000))*0.01
In [14]: timeit paulj(A)
1 loops, best of 3: 23.2 s per loop
In [15]: timeit blaz(A)
10 loops, best of 3: 80.7 ms per loop
In [16]: timeit zwinck1(A)
1 loops, best of 3: 15.3 s per loop
In [17]: timeit zwinck2(A)
1 loops, best of 3: 65.3 s per loop
The epsilon approximation is probably the best speed we can expect, but has some rounding issues. Having to iterate over many columns hurts the speed. I'm not sure why the np.prod(A[:,mask], 1) approach is slowest.
eeclo https://stackoverflow.com/a/22441825/901925 suggested using as_strided. Here's what I think he has in mind (adapted from an overlapping block question, https://stackoverflow.com/a/8070716/901925)
def strided(A):
h,w = A.shape
A2 = np.hstack([A,A])
x,y = A2.strides
strides = (y,x,y)
shape = (w, h, w-1)
blocks = np.lib.stride_tricks.as_strided(A2[:,1:], shape=shape, strides=strides)
P = blocks.prod(2).T # faster to prod on last dim
# alt: shape = (w-1, h, w), and P=blocks.prod(0)
return P
Timing for the (1000,1000) array is quite an improvement over the column iterations, though still much slower than the epsilon approach.
In [153]: timeit strided(A)
1 loops, best of 3: 2.51 s per loop
Another indexing approach, while relatively straight forward, is slower, and produces memory errors sooner.
def foo(A):
h,w = A.shape
I = (np.arange(w)[:,None]+np.arange(1,w))
I1 = np.array(I)%w
P = A[:,I1].prod(2)
return P

Im on the run, so I do not have time to work out this solution; but what id do is create a contiguous circular view over the last axis, by means of concatenating the array to itself along the last axis, and then use np.lib.index_tricks.as_strided to select the appropriate elements to take an np.prod over. No python loops, no numerical approximation.
edit: here you go:
import numpy as np
A = np.array([[0.2, 0.4, 0.6],
[0.5, 0.5, 0.5],
[0.5, 0.0, 0.5],
[0.6, 0.4, 0.2]])
B = np.concatenate((A,A),axis=1)
C = np.lib.index_tricks.as_strided(
B,
A.shape +A.shape[1:],
B.strides+B.strides[1:])
D = np.prod(C[...,1:], axis=-1)
print D
Note: this method is not ideal, as it is O(n^3). See my other posted solution, which is O(n^2)

If you are willing to tolerate small error you could use the solution you first proposed.
A += 1e-10
np.around(np.repeat(np.prod(A, 1, keepdims = True), 3, axis = 1) / A, 9)

Here is an O(n^2) method without python loops or numerical approximation:
def double_cumprod(A):
B = np.empty((A.shape[0],A.shape[1]+1),A.dtype)
B[:,0] = 1
B[:,1:] = A
L = np.cumprod(B, axis=1)
B[:,1:] = A[:,::-1]
R = np.cumprod(B, axis=1)[:,::-1]
return L[:,:-1] * R[:,1:]
Note: it appears to be about twice as slow as the numerical approximation method, which is in line with expectation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

efficient numpy.cumsum and numpy.digitize - python

Related

How to convert an array index to/from a mask

Reverse reshaping of a numpy array

How to delete decimal values from an array in a pythonic way

how does numpy.where work?

In numpy, calculating a matrix where each cell contains the product of all the other entries in that row

Categories

Resources