list of indexes of maximum values in ndarray - python

I have a ndarray. From this array I need to choose the list of N numbers with biggest values. I found heapq.nlargest to find the N largest entries, but I need to extract the indexes.
I want to build a new array where only the N rows with the largest weights in the first column will survive. The rest of the rows will be replaced by random values
import numpy as np
import heapq # For choosing list of max values
a = [[1.1,2.1,3.1], [2.1,3.1,4.1], [5.1,0.1,7.1],[0.1,1.1,1.1],[4.1,3.1,9.1]]
a = np.asarray(a)
maxVal = heapq.nlargest(2,a[:,0])
if __name__ == '__main__':
print a
print maxVal
The output I have is:
[[ 1.1 2.1 3.1]
[ 2.1 3.1 4.1]
[ 5.1 0.1 7.1]
[ 0.1 1.1 1.1]
[ 4.1 3.1 9.1]]
[5.0999999999999996, 4.0999999999999996]
but what I need is [2,4] as the indexes to build a new array. The indexes are the rows so if in this example I want to replace the rest by 0 I need to finish with:
[[0.0 0.0 0.0]
[ 0.0 0.0 0.0]
[ 5.1 0.1 7.1]
[ 0.0 0.0 0.0]
[ 4.1 3.1 9.1]]
I am stuck in the point where I need indexes. The original array has 1000 rows and 100 columns. The weights are normalized floating points and I don't want to do something like if a[:,1] == maxVal[0]: because sometimes I have the weights very close and can finish with more values maxVal[0] than my original N.
Is there any simple way to extract indexes on this setup to replace the rest of the array?

If you only have 1000 rows, I would forget about the heap and use np.argsort on the first column:
>>> np.argsort(a[:,0])[::-1][:2]
array([2, 4])
If you want to put it all together, it would look something like:
def trim_rows(a, n) :
idx = np.argsort(a[:,0])[:-n]
a[idx] = 0
>>> a = np.random.rand(10, 4)
>>> a
array([[ 0.34416425, 0.89021968, 0.06260404, 0.0218131 ],
[ 0.72344948, 0.79637177, 0.70029863, 0.20096129],
[ 0.27772833, 0.05372373, 0.00372941, 0.18454153],
[ 0.09124461, 0.38676351, 0.98478492, 0.72986697],
[ 0.84789887, 0.69171688, 0.97718206, 0.64019977],
[ 0.27597241, 0.26705301, 0.62124467, 0.43337711],
[ 0.79455424, 0.37024814, 0.93549275, 0.01130491],
[ 0.95113795, 0.32306471, 0.47548887, 0.20429272],
[ 0.3943888 , 0.61586129, 0.02776393, 0.2560126 ],
[ 0.5934556 , 0.23093912, 0.12550062, 0.58542137]])
>>> trim_rows(a, 3)
>>> a
array([[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0.84789887, 0.69171688, 0.97718206, 0.64019977],
[ 0. , 0. , 0. , 0. ],
[ 0.79455424, 0.37024814, 0.93549275, 0.01130491],
[ 0.95113795, 0.32306471, 0.47548887, 0.20429272],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]])
And for your data size it's probably fast enough:
In [7]: a = np.random.rand(1000, 100)
In [8]: %timeit -n1 -r1 trim_rows(a, 50)
1 loops, best of 1: 7.65 ms per loop

Related

Minimum sum route through numpy array

This is a two part question.
Part 1
Given the following Numpy array:
foo = array([[22.5, 20. , 0. , 20. ],
[24. , 40. , 0. , 8. ],
[ 0. , 0. , 50. , 9.9],
[ 0. , 0. , 0. , 9. ],
[ 0. , 0. , 0. , 2.5]])
what is the most efficient way to (i) find the two minimal possible sums of values across columns (taking into account cell values greater than zero only) where for every column only one row is used and (ii) keep track of the array index locations visited on that route?
For example, in the example above this would be: minimum_bar = 22.5 + 20 + 50 + 2.5 = 95 at indices [0,0], [0,1], [2,2], [4,3] and next_best_bar = 22.5 + 20 + 50 + 8 = 100.5 at indices [0,0], [0,1], [2,2], [1,3].
Part 2
Similar to Part 1 but now with the constraint that the row-wise of sums of foo (if that row is used in the solution) must be greater than the values in an array (for example np.array([10, 10, 10, 10, 10]). In other words sum(row[0])>array[0]=62.5>10=True but sum(row[4])>array[4]=2.5>10=False.
In which case the result is: minimum_bar = 22.5 + 20 + 50 + 9.9 = 102.4 at indices [0,0], [0,1], [2,2], [2,3] and next_best_bar = 22.5 + 20 + 50 + 20 = 112.5 at indices [0,0], [0,1], [2,2], [0,3].
My initial approach was to find all possible routes (combinations of indices using itertools) but this solution does not scale well for large matrix sizes (e.g., mxn=500x500).
Here's one solution that I came up with (hopefully I didn't misunderstand anything in your question)
def minimum_routes(foo):
assert len(foo) >= 2
assert np.all(np.any(foo > 0, axis=0))
foo = foo.astype(float)
foo[foo <= 0] = np.inf
foo.sort(0)
minimum_bar = foo[0]
next_best_bar = minimum_bar.copy()
c = np.argmin(np.abs(foo[0] - foo[1]))
next_best_bar[c] = foo[1, c]
return minimum_bar, next_best_bar
Let's test it:
foo = np.array([[22.5, 20. , 0. , 20. ],
[24. , 40. , 0. , 8. ],
[ 0. , 0. , 50. , 9.9],
[ 0. , 0. , 0. , 9. ],
[ 0. , 0. , 0. , 2.5]])
# PART 1
minimum_bar, next_best_bar = minimum_routes(foo)
# (array([22.5, 20. , 50. , 2.5]), array([24. , 20. , 50. , 2.5]))
# PART 2
constraint = np.array([10, 10, 10, 10, 10])
minimum_bar, next_best_bar = minimum_routes(foo[foo.sum(1) > constraint])
# (array([22.5, 20. , 50. , 8. ]), array([24., 20., 50., 8.]))
To find the indices:
np.where(foo == minimum_bar)
np.where(foo == next_best_bar)

bool masking in numpy array matrices

I have following program
import numpy as np
arr = np.random.randn(3,4)
print(arr)
regArr = (arr > 0.8)
print (regArr)
print (arr[ regArr].reshape(arr.shape))
output:
[[ 0.37182134 1.4807685 0.11094223 0.34548185]
[ 0.14857641 -0.9159358 -0.37933393 -0.73946522]
[ 1.01842304 -0.06714827 -1.22557205 0.45600827]]
I am looking for output in arr where values greater than 0.8 should exist and other values to be zero.
I tried bool masking as shown above. But I am able to slove this. Kindly help
I'm not entirely sure what exactly you want to achieve, but this is what I did to filter.
arr = np.random.randn(3,4)
array([[-0.04790508, -0.71700005, 0.23204224, -0.36354634],
[ 0.48578236, 0.57983561, 0.79647091, -1.04972601],
[ 1.15067885, 0.98622772, -0.7004639 , -1.28243462]])
arr[arr < 0.8] = 0
array([[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ],
[1.15067885, 0.98622772, 0. , 0. ]])
Thanks to user3053452, I have added one more solution which the original data will not be changed.
arr = np.random.randn(3,4)
array([[ 0.4297907 , 0.38100702, 0.30358291, -0.71137138],
[ 1.15180635, -1.21251676, 0.04333404, 1.81045931],
[ 0.17521058, -1.55604971, 1.1607159 , 0.23133528]])
new_arr = np.where(arr < 0.8, 0, arr)
array([[0. , 0. , 0. , 0. ],
[1.15180635, 0. , 0. , 1.81045931],
[0. , 0. , 1.1607159 , 0. ]])

Fast way to fill matrix from np.array of row index, column index, and max(values)

I have quite large arrays to fill matrix (about 5e6 elements). I know the fast way to fill is something like
(simplified example)
bbb = (np.array([1,2,3,4,1])) # row
ccc = (np.array([0,1,2,1,0])) # column
ddd = (np.array([55.5,22.2,33.3,44.4,11.1])) # values
experiment = np.zeros(shape=(5,3))
experiment[bbb, ccc] = [ddd] # filling
>[[ 0. 0. 0. ]
[ 11.1 0. 0. ]
[ 0. 22.2 0. ]
[ 0. 0. 33.3]
[ 0. 44.4 0. ]]
but if I want the max ddd instead. Something like at # filling
#pseudocode
experiment[bbb, ccc] = [ddd if ddd > experiment[bbb, ccc]]
The matrix should return
>[[ 0. 0. 0. ]
[ 55.5 0. 0. ]
[ 0. 22.2 0. ]
[ 0. 0. 33.3]
[ 0. 44.4 0. ]]
What is a good fast way to get max to fill the matrix from np.array here?
You can use np.ufunc.at on np.maximum.
np.ufunc.at performs the preceding ufunc "unbuffered and in-place". This means all indices appearing in [bbb, ccc] will be processed by np.maximum, no matter how ofthen those indices appear.
In your case (0, 1) appears twice, so it will be processed twice, each time picking the maximum of experiment[bbb, ccc] and ddd.
np.maximum.at(experiment, [bbb, ccc], ddd)
# array([[ 0. , 0. , 0. ],
# [ 55.5, 0. , 0. ],
# [ 0. , 22.2, 0. ],
# [ 0. , 0. , 33.3],
# [ 0. , 44.4, 0. ]])

Meshgrid a N-columned matrix in Numpy (or smth else)

Python version: 2.7
I have the following numpy 2d array:
array([[ -5.05000000e+01, -1.05000000e+01],
[ -4.04000000e+01, -8.40000000e+00],
[ -3.03000000e+01, -6.30000000e+00],
[ -2.02000000e+01, -4.20000000e+00],
[ -1.01000000e+01, -2.10000000e+00],
[ 7.10542736e-15, -1.77635684e-15],
[ 1.01000000e+01, 2.10000000e+00],
[ 2.02000000e+01, 4.20000000e+00],
[ 3.03000000e+01, 6.30000000e+00],
[ 4.04000000e+01, 8.40000000e+00]])
If I wanted to find all the combinations of the first and the second columns, I would use np.array(np.meshgrid(first_column, second_column)).T.reshape(-1,2). As a result, I would get a 100*1 matrix with 10*10 = 100 data points. However, my matrix can have 3, 4, or more columns, so I have a problem of using this numpy function.
Question: how can I make an automatically meshgridded matrix with 3+ columns?
UPD: for example, I have the initial array:
[[-50.5 -10.5]
[ 0. 0. ]]
As a result, I want to have the output array like this:
array([[-10.5, -50.5],
[-10.5, 0. ],
[ 0. , -50.5],
[ 0. , 0. ]])
or this:
array([[-50.5, -10.5],
[-50.5, 0. ],
[ 0. , -10.5],
[ 0. , 0. ]])
You could use * operator on the transposed array version that unpacks those columns sequentially. Finally, a swap axes operation is needed to merge the output grid arrays as one array.
Thus, one generic solution would be -
np.swapaxes(np.meshgrid(*arr.T),0,2)
Sample run -
In [44]: arr
Out[44]:
array([[-50.5, -10.5],
[ 0. , 0. ]])
In [45]: np.swapaxes(np.meshgrid(*arr.T),0,2)
Out[45]:
array([[[-50.5, -10.5],
[-50.5, 0. ]],
[[ 0. , -10.5],
[ 0. , 0. ]]])

Flip non-zero values along each row of a lower triangular numpy array

I have a lower triangular array, like B:
B = np.array([[1,0,0,0],[.25,.75,0,0], [.1,.2,.7,0],[.2,.3,.4,.1]])
>>> B
array([[ 1. , 0. , 0. , 0. ],
[ 0.25, 0.75, 0. , 0. ],
[ 0.1 , 0.2 , 0.7 , 0. ],
[ 0.2 , 0.3 , 0.4 , 0.1 ]])
I want to flip it to look like:
array([[ 1. , 0. , 0. , 0. ],
[ 0.75, 0.25, 0. , 0. ],
[ 0.7 , 0.2 , 0.1 , 0. ],
[ 0.1 , 0.4 , 0.3 , 0.2 ]])
That is, I want to take all the positive values, and reverse within the positive values, leaving the trailing zeros in place. This is not what fliplr does:
>>> np.fliplr(B)
array([[ 0. , 0. , 0. , 1. ],
[ 0. , 0. , 0.75, 0.25],
[ 0. , 0.7 , 0.2 , 0.1 ],
[ 0.1 , 0.4 , 0.3 , 0.2 ]])
Any tips? Also, the actual array I am working with would be something like B.shape = (200,20,4,4) instead of (4,4). Each (4,4) block looks like the above example (with different numbers across the 200, 20 different entries).
How about this:
# row, column indices of the lower triangle of B
r, c = np.tril_indices_from(B)
# flip the column indices by subtracting them from r, which is equal to the number
# of nonzero elements in each row minus one
B[r, c] = B[r, r - c]
print(repr(B))
# array([[ 1. , 0. , 0. , 0. ],
# [ 0.75, 0.25, 0. , 0. ],
# [ 0.7 , 0.2 , 0.1 , 0. ],
# [ 0.1 , 0.4 , 0.3 , 0.2 ]])
The same approach will generalize to any arbitrary N-dimensional array that consists of multiple lower triangular submatrices:
# creates a (200, 20, 4, 4) array consisting of tiled copies of B
B2 = np.tile(B[None, None, ...], (200, 20, 1, 1))
print(repr(B2[100, 10]))
# array([[ 1. , 0. , 0. , 0. ],
# [ 0.25, 0.75, 0. , 0. ],
# [ 0.1 , 0.2 , 0.7 , 0. ],
# [ 0.2 , 0.3 , 0.4 , 0.1 ]])
r, c = np.tril_indices_from(B2[0, 0])
B2[:, :, r, c] = B2[:, :, r, r - c]
print(repr(B2[100, 10]))
# array([[ 1. , 0. , 0. , 0. ],
# [ 0.75, 0.25, 0. , 0. ],
# [ 0.7 , 0.2 , 0.1 , 0. ],
# [ 0.1 , 0.4 , 0.3 , 0.2 ]])
For an upper triangular matrix you could simply subtract r from c instead, e.g.:
r, c = np.triu_indices_from(B.T)
B.T[r, c] = B.T[c - r, c]
Here's one approach for a 2D array case -
mask = np.tril(np.ones((4,4),dtype=bool))
out = np.zeros_like(B)
out[mask] = B[:,::-1][mask[:,::-1]]
You can extend it to a 3D array case using the same 2D mask by masking the last two axes with it, like so -
out = np.zeros_like(B)
out[:,mask] = B[:,:,::-1][:,mask[:,::-1]]
.. and similarly for a 4D array case, like so -
out = np.zeros_like(B)
out[:,:,mask] = B[:,:,:,::-1][:,:,mask[:,::-1]]
As one can see, we are keeping the masking process to the last two axes of (4,4) and the solution basically stays the same.
Sample run -
In [95]: B
Out[95]:
array([[ 1. , 0. , 0. , 0. ],
[ 0.25, 0.75, 0. , 0. ],
[ 0.1 , 0.2 , 0.7 , 0. ],
[ 0.2 , 0.3 , 0.4 , 0.1 ]])
In [96]: mask = np.tril(np.ones((4,4),dtype=bool))
...: out = np.zeros_like(B)
...: out[mask] = B[:,::-1][mask[:,::-1]]
...:
In [97]: out
Out[97]:
array([[ 1. , 0. , 0. , 0. ],
[ 0.75, 0.25, 0. , 0. ],
[ 0.7 , 0.2 , 0.1 , 0. ],
[ 0.1 , 0.4 , 0.3 , 0.2 ]])

Categories