Identification of rows containing column median in numpy matrix of cum percentiles

Identification of rows containing column median in numpy matrix of cum percentiles - python

Consider the matrix quantiles that's a subset [:8,:3,0] of a 3D matrix with shape (10,355,8).
quantiles = np.array([
[ 1. , 1. , 1. ],
[ 0.63763978, 0.61848863, 0.75348137],
[ 0.43439645, 0.42485407, 0.5341457 ],
[ 0.22682343, 0.18878366, 0.25253915],
[ 0.16229408, 0.12541476, 0.15263742],
[ 0.12306046, 0.10372971, 0.09832783],
[ 0.09271845, 0.08209844, 0.05982584],
[ 0.06363636, 0.05471266, 0.03855727]])
I want a boolean output of the same shape as the quantiles matrix where True marks the row in which the median is located:
In [21]: medians
Out[21]:
array([[False, False, False],
[ True, True, False],
[False, False, True],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
To achieve this, I have the following algorithm in mind:
1) Identify the entries that are greater than .5:
In [22]: quantiles>.5
Out[22]:
array([[ True, True, True],
[ True, True, True],
[False, False, True],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
2) Considering only the values subset by the quantiles>.5 operation, mark the row that minimizes the np.abs distance between the entry and .5. Torturing the terminology a bit, I wish to intersect the two matrices of np.argmin(np.abs(quantiles-.5),axis=0) and quantiles>.5 to get the above result. However, I cannot for my life figure out a way to perform the np.argmin on the subset and retain the shape of the quantile matrix.
PS. Yes, there is a similar question here but it doesn't implement my algorithm which could be, I think, more efficient on a larger scale

Bumping into the old mask operation in Numpy, I found the following solution
#mask quantities that are less than .5
masked_quantiles = ma.masked_where(quantiles<.5,quantiles)
#identify the minimum in column of the masked array
median_idx = np.where(masked_quantiles == masked_quantiles.min(axis=0))
#make a matrix of all False values
median_mat = np.zeros(quantiles.shape, dtype=bool)
#assign True value to corresponding rows
In [86]: median_mat[medians] = True
In [87]: median_mat
Out[87]:
array([[False, False, False],
[ True, True, False],
[False, False, True],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
Update: comparison of my answer to that of Divakar's:
I ran two comparisons, one on the sample 2D matrix provided for this question and one on my 3D (10,380,8) dataset (not large data by any means).
Sample dataset:
My code
%%timeit
masked_quantiles = ma.masked_where(quantiles<=.5,quantiles)
median_idx = masked_quantiles.argmin(0)
10000 loops, best of 3: 65.1 µs per loop
Divakar's code
%%timeit
mask1 = quantiles<=0.5
min_idx = (quantiles+mask1).argmin(0)
The slowest run took 17.49 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.92 µs per loop
Full dataset
My code:
%%timeit
masked_quantiles = ma.masked_where(quantiles<=.5,quantiles)
median_idx = masked_quantiles.argmin(0)
1000 loops, best of 3: 490 µs per loop
Divakar's code:
%%timeit
mask1 = quantiles<=0.5
min_idx = (quantiles+mask1).argmin(0)
10000 loops, best of 3: 172 µs per loop
Conclusion:
Divakar's answer seems about 3-12 times faster than mine. I presume that the np.ma.where masking operation takes longer than matrix addition. However, the addition operation needs to be stored whereas masking may be more efficient on larger datasets. I wonder how it would compare on something that doesn't or nearly doesn't fit into memory.

Approach #1
Here's an approach using broadcasting and some masking trick -
# Mask of quantiles lesser than or equal to 0.5 to select the invalid ones
mask1 = quantiles<=0.5
# Since we are dealing with quantiles, the elems won't be > 1,
# which can be leveraged here as we will add 1s to invalid elems, and
# then look for argmin across each col
min_idx = (np.abs(quantiles-0.5)+mask1).argmin(0)
# Let some broadcasting magic happen here!
out = min_idx == np.arange(quantiles.shape[0])[:,None]
Step-by-step run
1) Input :
In [37]: quantiles
Out[37]:
array([[ 1. , 1. , 1. ],
[ 0.63763978, 0.61848863, 0.75348137],
[ 0.43439645, 0.42485407, 0.5341457 ],
[ 0.22682343, 0.18878366, 0.25253915],
[ 0.16229408, 0.12541476, 0.15263742],
[ 0.12306046, 0.10372971, 0.09832783],
[ 0.09271845, 0.08209844, 0.05982584],
[ 0.06363636, 0.05471266, 0.03855727]])
2) Run the code :
In [38]: mask1 = quantiles<=0.5
...: min_idx = (np.abs(quantiles-0.5)+mask1).argmin(0)
...: out = min_idx == np.arange(quantiles.shape[0])[:,None]
...:
3) Analyze output at each step :
In [39]: mask1
Out[39]:
array([[False, False, False],
[False, False, False],
[ True, True, False],
[ True, True, True],
[ True, True, True],
[ True, True, True],
[ True, True, True],
[ True, True, True]], dtype=bool)
In [40]: np.abs(quantiles-0.5)+mask1
Out[40]:
array([[ 0.5 , 0.5 , 0.5 ],
[ 0.13763978, 0.11848863, 0.25348137],
[ 1.06560355, 1.07514593, 0.0341457 ],
[ 1.27317657, 1.31121634, 1.24746085],
[ 1.33770592, 1.37458524, 1.34736258],
[ 1.37693954, 1.39627029, 1.40167217],
[ 1.40728155, 1.41790156, 1.44017416],
[ 1.43636364, 1.44528734, 1.46144273]])
In [41]: (np.abs(quantiles-0.5)+mask1).argmin(0)
Out[41]: array([1, 1, 2])
In [42]: min_idx == np.arange(quantiles.shape[0])[:,None]
Out[42]:
array([[False, False, False],
[ True, True, False],
[False, False, True],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
Performance boost : Following the comments, it seems to get min_idx, we can just do :
min_idx = (quantiles+mask1).argmin(0)
Approach #2
This is focused on memory efficiency.
# Mask of quantiles greater than 0.5 to select the valid ones
mask = quantiles>0.5
# Select valid elems
vals = quantiles.T[mask.T]
# Get vald count per col
count = mask.sum(0)
# Get the min val per col given the mask
minval = np.minimum.reduceat(vals,np.append(0,count[:-1].cumsum()))
# Get final boolean array by just comparing the min vals across each col
out = np.isclose(quantiles,minval)

Related

How to intersect boolean subarrays for True values?

I know that Numpy provides logical_and() which allows us to intersect two boolean arrays for True values only (True and True would yield True while True and False would yield False). For example,
a = np.array([True, False, False, True, False], dtype=bool)
b = np.array([False, True, True, True, False], dtype=bool)
np.logical_and(a, b)
> array([False, False, False, True, False], dtype=bool)
However, I'm wondering how I can apply this to two subarrays in an overall array? For example, consider the array:
[[[ True, True], [ True, False]], [[ True, False], [False, True]]]
The two subarrays I'm looking to intersect are:
[[ True, True], [ True, False]]
and
[[ True, False], [False, True]]
which should yield:
[[ True, False], [False, False]]
Is there a way to specify that I want to apply logical_and() to the outermost subarrays to combine the two?

You can use .reduce() along the first axis:
>>> a = np.array([[[ True, True], [ True, False]], [[ True, False], [False, True]]])
>>> np.logical_and.reduce(a, axis=0)
array([[ True, False],
[False, False]])
This works even when you have more than two "sub-arrays" in your outer array. I prefer this over the unpacking approach because it allows you to apply your function (np.logical_and) over any axis of your array.

If I understand your question correctly, you are looking to do:
import numpy as np
output = np.logical_and(a[:, 0], a[:, 1])
This simply slices your arrays so that you can use logical_and the way your results suggest.

Appending to a multidimensional array Python

I am filtering the arrays a and b for likewise values and then I want to append them to a new array difference howveer I get the error: ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 0 and the array at index 1 has size 2. How would I be able to fix this?
import numpy as np
a = np.array([[0,12],[1,40],[0,55],[1,23],[0,123.5],[1,4]])
b = np.array([[0,3],[1,10],[0,55],[1,34],[1,122],[0,123]])
difference= np.array([[]])
for i in a:
for j in b:
if np.allclose(i, j, atol=0.5):
difference = np.concatenate((difference,[i]))
Expected Output:
[[ 0. 55.],[ 0. 123.5]]

The problem is that you are trying to concatenate an array where the elements has size 0
difference= np.array([[]]) # Specifically [ [<no elements>] ]
To an array where the elements has size 2
np.concatenate((difference,[i])) # Specifically [i] which is [ [ 0., 55.] ]
Instead of initializing it with an empty array which has the size of 0, you could try just calling .reshape() on the difference.
# difference= np.array([[]]) # Old code
difference= np.array([]).reshape(0, 2) # Updated code
Output
[[ 0. 55. ]
[ 0. 123.5]]

np.array([[]]) shape is (1, 0). To make it work, it should be (0, 2):
difference= np.zeros((0, *a.shape[1:]))

In [22]: a = np.array([[0,12],[1,40],[0,55],[1,23],[0,123.5],[1,4]])
...: b = np.array([[0,3],[1,10],[0,55],[1,34],[1,122],[0,123]])
Using a straight forward list comprehension:
In [23]: [i for i in a for j in b if np.allclose(i,j,atol=0.5)]
Out[23]: [array([ 0., 55.]), array([ 0. , 123.5])]
But as for your concatenate. Look at the shape of the arrays:
In [24]: np.array([[]]).shape
Out[24]: (1, 0)
In [25]: np.array([i]).shape
Out[25]: (1, 1)
Those can only be joined on axis 1; default is 0, giving you the error. Like wrote in the comment, you have to understand arrays shapes to use concatenate.
In [26]: difference= np.array([[]])
...: for i in a:
...: for j in b:
...: if np.allclose(i, j, atol=0.5):
...: difference = np.concatenate((difference,[i]), axis=1)
...:
In [27]: difference
Out[27]: array([[ 0. , 55. , 0. , 123.5]])
vectorized
A whole-array approach:
broadcase a against b, producing a (5,5,2) closeness array:
In [37]: np.isclose(a[:,None,:],b[None,:,:], atol=0.5)
Out[37]:
array([[[ True, False],
[False, False],
[ True, False],
[False, False],
[False, False],
[ True, False]],
[[False, False],
[ True, False],
[False, False],
[ True, False],
[ True, False],
[False, False]],
[[ True, False],
[False, False],
[ True, True],
[False, False],
[False, False],
[ True, False]],
[[False, False],
[ True, False],
[False, False],
[ True, False],
[ True, False],
[False, False]],
[[ True, False],
[False, False],
[ True, False],
[False, False],
[False, False],
[ True, True]],
[[False, False],
[ True, False],
[False, False],
[ True, False],
[ True, False],
[False, False]]])
Find where both columns are true, and where at least one "row" is:
In [38]: _.all(axis=2)
Out[38]:
array([[False, False, False, False, False, False],
[False, False, False, False, False, False],
[False, False, True, False, False, False],
[False, False, False, False, False, False],
[False, False, False, False, False, True],
[False, False, False, False, False, False]])
In [39]: _.any(axis=1)
Out[39]: array([False, False, True, False, True, False])
In [40]: a[_]
Out[40]:
array([[ 0. , 55. ],
[ 0. , 123.5]])

How to change the values of a 2d tensor in certain rows and columns

Suppose I have an all-zero mask tensor like this:
mask = torch.zeros(5,3, dtype=torch.bool)
Now I want to set the value of mask at the intersection of the following rows and cols indices to True:
rows = torch.tensor([0,2,4])
cols = torch.tensor([1,2])
I would like to produce the following result:
tensor([[False, True, True ],
[False, False, False],
[False, True, True ],
[False, False, False],
[False, True, True ]])
When I try the following code, I receive an error:
mask[rows, cols] = True
IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [3], [2]
How can I do that efficiently in PyTorch?

You need proper shape for that you can use torch.unsqueeze
mask = torch.zeros(5,3, dtype=torch.bool)
mask[rows, cols.unsqueeze(1)] = True
mask
tensor([[False, True, True],
[False, False, False],
[False, True, True],
[False, False, False],
[False, True, True]])
or torch.reshape
mask[rows, cols.reshape(-1,1)] = True
mask
tensor([[False, True, True],
[False, False, False],
[False, True, True],
[False, False, False],
[False, True, True]])

Compare a numpy array to each element of another one

A = np.array([5,1,5,8])
B = np.array([2,5])
I want to compare the A array to each element of B. In other words I'm lookin for a function which do the following computations :
A>2
A>5
(array([ True, False, True, True]), array([False, False, False, True]))

Not particularly fancy but a list comprehension will work:
[A > b for b in B]
[array([ True, False, True, True], dtype=bool),
array([False, False, False, True], dtype=bool)]
You can also use np.greater(), which requires the dimension-adding trick that Brenlla uses in the comments:
np.greater(A, B[:,np.newaxis])
array([[ True, False, True, True],
[False, False, False, True]], dtype=bool)

Populate numpy matrix dynamically from array values?

I'm trying to dynamically construct a 2-D matrix with numpy based on the values of an array, like this:
In [113]: A = np.zeros((5,5),dtype=bool)
In [114]: A
Out[114]: array([[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False]], dtype=bool)
In [116]: B = np.array([0,1,3,0,2])
In [117]: B
Out[117]: array([0, 1, 3, 0, 2])
Now, I'd like to use the values of B to assign the first n values of each row to A to True. For this A and B, the correct output would be:
In [118]: A
Out[118]: array([[False, False, False, False, False],
[ True, False, False, False, False],
[ True, True, True, False, False],
[False, False, False, False, False],
[ True, True, False, False, False]], dtype=bool)
The length of B will always equal the number of rows of A, and the the values of B will always be less than or equal to the number of columns of A. The size of A and the values of B are constantly changing, so I need to build these on the fly.
I'm certain that this has a simple(-ish) solution in numpy, but I've spent the last hour banging my head against variations of repeat, tile, and anything else I can think of. Can anyone help me out before I give myself a concussion? :)
EDIT: I'm going to need to do this a lot, so speed will be an issue. The only version that I can come up with for now is something like:
np.vstack([ [True]*x + [False]*(500-x) for x in B ])
but I expect that this will be slow due to the for loop (I would time it if I had anything to compare it to).

How about:
>>> A = np.zeros((5, 7),dtype=bool)
>>> B = np.array([0,1,3,0,2])
>>> (np.arange(len(A[0])) < B[:,None])
array([[False, False, False, False, False, False, False],
[ True, False, False, False, False, False, False],
[ True, True, True, False, False, False, False],
[False, False, False, False, False, False, False],
[ True, True, False, False, False, False, False]], dtype=bool)
(I changed the shape from (5,5) because I was getting confused about which axis was which, and I wanted to make sure I was using the right one.)
[Simplified from (np.arange(len(A[0]))[:,None] < B).T -- if we expand B and not A, there's no need for the transpose.]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Identification of rows containing column median in numpy matrix of cum percentiles - python

Related

How to intersect boolean subarrays for True values?

Appending to a multidimensional array Python

How to change the values of a 2d tensor in certain rows and columns

Compare a numpy array to each element of another one

Populate numpy matrix dynamically from array values?

Categories

Resources