How does "Fancy Indexing with Broadcasting and Boolean Masking" work? - python

I came across this snippet of code in Jake Vanderplas's Data Science Handbook. The concept of using Broadcasting along with Fancy Indexing here wasn't clear to me. Please explain.
In[5]: X = np.arange(12).reshape((3, 4))
X
Out[5]: array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In[6]: row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
In[7]: X[row[:, np.newaxis], col]
Out[7]: array([[ 2, 1, 3],
[ 6, 5, 7],
[10, 9, 11]])
It says: "Here, each row value is matched with each column vector, exactly as we saw in broadcasting of arithmetic operations. For example:"
In[8]: row[:, np.newaxis] * col
Out[8]: array([[0, 0, 0],
[2, 1, 3],
[4, 2, 6]])

If you use an integer array to index another array
you basically loop over the given indices and pick the respective elements (may still be an array) along the axis you are indexing and stack them together.
arr55 = np.arange(25).reshape((5, 5))
# array([[ 0, 1, 2, 3, 4],
# [ 5, 6, 7, 8, 9],
# [10, 11, 12, 13, 14],
# [15, 16, 17, 18, 19],
# [20, 21, 22, 23, 24]])
arr53 = arr55[:, [3, 3, 4]]
# pick the elements at (arr[:, 3], arr[:, 3], arr[:, 4])
# array([[ 3, 3, 4],
# [ 8, 8, 9],
# [13, 13, 14],
# [18, 18, 19],
# [23, 23, 24]])
So if you index an (m, n) array with an row (or col) index of length k (or length l) the resulting shape is:
A_nm[row, :] -> A_km
A_nm[:, col] -> A_nl
If however you use two arrays row and col to index an array
you loop over both indices simultaneously and stack the elements (may still be arrays) at the respective position together.
Here it row and col must have the same length.
A_nm[row, col] -> A_k
array([ 3, 13, 24])
arr3 = arr55[[0, 2, 4], [3, 3, 4]]
# pick the element at (arr[0, 3], arr[2, 3], arr[4, 4])
Now finally for your question: it is possible to use broadcasting while indexing arrays. Sometimes it is not wanted that only the elements
(arr[0, 3], arr[2, 3], arr[4, 4])
are picked, but rather the expanded version:
(arr[0, [3, 3, 4]], arr[2, [3, 3, 4]], arr[4, [3, 3, 4]])
# each row value is matched with each column vector
This matching/broadcasting is exactly as in other arithmetic operations.
But the example here might be bad in the sense, that not the result of the shown multiplication is of importance for the indexing.
The focus here is on the combinations and the resulting shape:
row * col
# performs a element wise multiplication resulting in 3
numbers
row[:, np.newaxis] * col
# performs a multiplication where each row value is *matched* with each column vector
The example wanted to emphasis this matching of row and col.
We can have a look and play around with the different possibilities:
n = 3
m = 4
X = np.arange(n*m).reshape((n, m))
row = np.array([0, 1, 2]) # k = 3
col = np.array([2, 1, 3]) # l = 3
X[row, :] # A_nm[row, :] -> A_km
# array([[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11]])
X[:, col] # A_nm[:, col] -> A_nl
# array([[ 2, 1, 3],
# [ 6, 5, 7],
# [10, 9, 11]])
X[row, col] # A_nm[row, col] -> A_l == A_k
# array([ 2, 5, 11]
X[row, :][:, col] # A_nm[row, :][:, col] -> A_km[:, col] -> A_kl
# == X[:, col][row, :]
# == X[row[:, np.newaxis], col] # A_nm[row[:, np.newaxis], col] -> A_kl
# array([[ 2, 1, 3],
# [ 6, 5, 7],
# [10, 9, 11]])
X[row, col[:, np.newaxis]]
# == X[row[:, np.newaxis], col].T
# array([[ 2, 6, 10],
# [ 1, 5, 9],
# [ 3, 7, 11]])

I came here looking for an answer to this question, and hpaulj's comment helped me. I'm going to expand on it.
In the following snippet,
import numpy as np
X = np.arange(12).reshape((3, 4))
row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
Y = X[row.reshape(-1, 1), col]
the indexes we're passing to X are getting broadcasted.
The code below, which follows the numpy broadcasting rules but uses far more memory, accomplishes the same slicing:
# Make the row and column indices 'conformable'
R = np.repeat(row.reshape(-1, 1), 3, axis=1) # repeat row index across columns
C = np.repeat(col.reshape(1, -1), 3, axis=0) # repeat column index across rows
Y = X[R, C] # Y[i, j] = X[R[i, j], C[i, j]]

Related

Find index of n biggest values [duplicate]

NumPy proposes a way to get the index of the maximum value of an array via np.argmax.
I would like a similar thing, but returning the indexes of the N maximum values.
For instance, if I have an array, [1, 3, 2, 4, 5], then nargmax(array, n=3) would return the indices [4, 3, 1] which correspond to the elements [5, 4, 3].
Newer NumPy versions (1.8 and up) have a function called argpartition for this. To get the indices of the four largest elements, do
>>> a = np.array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
>>> a
array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
>>> ind = np.argpartition(a, -4)[-4:]
>>> ind
array([1, 5, 8, 0])
>>> top4 = a[ind]
>>> top4
array([4, 9, 6, 9])
Unlike argsort, this function runs in linear time in the worst case, but the returned indices are not sorted, as can be seen from the result of evaluating a[ind]. If you need that too, sort them afterwards:
>>> ind[np.argsort(a[ind])]
array([1, 8, 5, 0])
To get the top-k elements in sorted order in this way takes O(n + k log k) time.
The simplest I've been able to come up with is:
>>> import numpy as np
>>> arr = np.array([1, 3, 2, 4, 5])
>>> arr.argsort()[-3:][::-1]
array([4, 3, 1])
This involves a complete sort of the array. I wonder if numpy provides a built-in way to do a partial sort; so far I haven't been able to find one.
If this solution turns out to be too slow (especially for small n), it may be worth looking at coding something up in Cython.
Simpler yet:
idx = (-arr).argsort()[:n]
where n is the number of maximum values.
Use:
>>> import heapq
>>> import numpy
>>> a = numpy.array([1, 3, 2, 4, 5])
>>> heapq.nlargest(3, range(len(a)), a.take)
[4, 3, 1]
For regular Python lists:
>>> a = [1, 3, 2, 4, 5]
>>> heapq.nlargest(3, range(len(a)), a.__getitem__)
[4, 3, 1]
If you use Python 2, use xrange instead of range.
Source: heapq — Heap queue algorithm
If you happen to be working with a multidimensional array then you'll need to flatten and unravel the indices:
def largest_indices(ary, n):
"""Returns the n largest indices from a numpy array."""
flat = ary.flatten()
indices = np.argpartition(flat, -n)[-n:]
indices = indices[np.argsort(-flat[indices])]
return np.unravel_index(indices, ary.shape)
For example:
>>> xs = np.sin(np.arange(9)).reshape((3, 3))
>>> xs
array([[ 0. , 0.84147098, 0.90929743],
[ 0.14112001, -0.7568025 , -0.95892427],
[-0.2794155 , 0.6569866 , 0.98935825]])
>>> largest_indices(xs, 3)
(array([2, 0, 0]), array([2, 2, 1]))
>>> xs[largest_indices(xs, 3)]
array([ 0.98935825, 0.90929743, 0.84147098])
Three Answers Compared For Coding Ease And Speed
Speed was important for my needs, so I tested three answers to this question.
Code from those three answers was modified as needed for my specific case.
I then compared the speed of each method.
Coding wise:
NPE's answer was the next most elegant and adequately fast for my needs.
Fred Foos answer required the most refactoring for my needs but was the fastest. I went with this answer, because even though it took more work, it was not too bad and had significant speed advantages.
off99555's answer was the most elegant, but it is the slowest.
Complete Code for Test and Comparisons
import numpy as np
import time
import random
import sys
from operator import itemgetter
from heapq import nlargest
''' Fake Data Setup '''
a1 = list(range(1000000))
random.shuffle(a1)
a1 = np.array(a1)
''' ################################################ '''
''' NPE's Answer Modified A Bit For My Case '''
t0 = time.time()
indices = np.flip(np.argsort(a1))[:5]
results = []
for index in indices:
results.append((index, a1[index]))
t1 = time.time()
print("NPE's Answer:")
print(results)
print(t1 - t0)
print()
''' Fred Foos Answer Modified A Bit For My Case'''
t0 = time.time()
indices = np.argpartition(a1, -6)[-5:]
results = []
for index in indices:
results.append((a1[index], index))
results.sort(reverse=True)
results = [(b, a) for a, b in results]
t1 = time.time()
print("Fred Foo's Answer:")
print(results)
print(t1 - t0)
print()
''' off99555's Answer - No Modification Needed For My Needs '''
t0 = time.time()
result = nlargest(5, enumerate(a1), itemgetter(1))
t1 = time.time()
print("off99555's Answer:")
print(result)
print(t1 - t0)
Output with Speed Reports
NPE's Answer:
[(631934, 999999), (788104, 999998), (413003, 999997), (536514, 999996), (81029, 999995)]
0.1349949836730957
Fred Foo's Answer:
[(631934, 999999), (788104, 999998), (413003, 999997), (536514, 999996), (81029, 999995)]
0.011161565780639648
off99555's Answer:
[(631934, 999999), (788104, 999998), (413003, 999997), (536514, 999996), (81029, 999995)]
0.439760684967041
If you don't care about the order of the K-th largest elements you can use argpartition, which should perform better than a full sort through argsort.
K = 4 # We want the indices of the four largest values
a = np.array([0, 8, 0, 4, 5, 8, 8, 0, 4, 2])
np.argpartition(a,-K)[-K:]
array([4, 1, 5, 6])
Credits go to this question.
I ran a few tests and it looks like argpartition outperforms argsort as the size of the array and the value of K increase.
For multidimensional arrays you can use the axis keyword in order to apply the partitioning along the expected axis.
# For a 2D array
indices = np.argpartition(arr, -N, axis=1)[:, -N:]
And for grabbing the items:
x = arr.shape[0]
arr[np.repeat(np.arange(x), N), indices.ravel()].reshape(x, N)
But note that this won't return a sorted result. In that case you can use np.argsort() along the intended axis:
indices = np.argsort(arr, axis=1)[:, -N:]
# Result
x = arr.shape[0]
arr[np.repeat(np.arange(x), N), indices.ravel()].reshape(x, N)
Here is an example:
In [42]: a = np.random.randint(0, 20, (10, 10))
In [44]: a
Out[44]:
array([[ 7, 11, 12, 0, 2, 3, 4, 10, 6, 10],
[16, 16, 4, 3, 18, 5, 10, 4, 14, 9],
[ 2, 9, 15, 12, 18, 3, 13, 11, 5, 10],
[14, 0, 9, 11, 1, 4, 9, 19, 18, 12],
[ 0, 10, 5, 15, 9, 18, 5, 2, 16, 19],
[14, 19, 3, 11, 13, 11, 13, 11, 1, 14],
[ 7, 15, 18, 6, 5, 13, 1, 7, 9, 19],
[11, 17, 11, 16, 14, 3, 16, 1, 12, 19],
[ 2, 4, 14, 8, 6, 9, 14, 9, 1, 5],
[ 1, 10, 15, 0, 1, 9, 18, 2, 2, 12]])
In [45]: np.argpartition(a, np.argmin(a, axis=0))[:, 1:] # 1 is because the first item is the minimum one.
Out[45]:
array([[4, 5, 6, 8, 0, 7, 9, 1, 2],
[2, 7, 5, 9, 6, 8, 1, 0, 4],
[5, 8, 1, 9, 7, 3, 6, 2, 4],
[4, 5, 2, 6, 3, 9, 0, 8, 7],
[7, 2, 6, 4, 1, 3, 8, 5, 9],
[2, 3, 5, 7, 6, 4, 0, 9, 1],
[4, 3, 0, 7, 8, 5, 1, 2, 9],
[5, 2, 0, 8, 4, 6, 3, 1, 9],
[0, 1, 9, 4, 3, 7, 5, 2, 6],
[0, 4, 7, 8, 5, 1, 9, 2, 6]])
In [46]: np.argpartition(a, np.argmin(a, axis=0))[:, -3:]
Out[46]:
array([[9, 1, 2],
[1, 0, 4],
[6, 2, 4],
[0, 8, 7],
[8, 5, 9],
[0, 9, 1],
[1, 2, 9],
[3, 1, 9],
[5, 2, 6],
[9, 2, 6]])
In [89]: a[np.repeat(np.arange(x), 3), ind.ravel()].reshape(x, 3)
Out[89]:
array([[10, 11, 12],
[16, 16, 18],
[13, 15, 18],
[14, 18, 19],
[16, 18, 19],
[14, 14, 19],
[15, 18, 19],
[16, 17, 19],
[ 9, 14, 14],
[12, 15, 18]])
Method np.argpartition only returns the k largest indices, performs a local sort, and is faster than np.argsort(performing a full sort) when array is quite large. But the returned indices are NOT in ascending/descending order. Let's say with an example:
We can see that if you want a strict ascending order top k indices, np.argpartition won't return what you want.
Apart from doing a sort manually after np.argpartition, my solution is to use PyTorch, torch.topk, a tool for neural network construction, providing NumPy-like APIs with both CPU and GPU support. It's as fast as NumPy with MKL, and offers a GPU boost if you need large matrix/vector calculations.
Strict ascend/descend top k indices code will be:
Note that torch.topk accepts a torch tensor, and returns both top k values and top k indices in type torch.Tensor. Similar with np, torch.topk also accepts an axis argument so that you can handle multi-dimensional arrays/tensors.
This will be faster than a full sort depending on the size of your original array and the size of your selection:
>>> A = np.random.randint(0,10,10)
>>> A
array([5, 1, 5, 5, 2, 3, 2, 4, 1, 0])
>>> B = np.zeros(3, int)
>>> for i in xrange(3):
... idx = np.argmax(A)
... B[i]=idx; A[idx]=0 #something smaller than A.min()
...
>>> B
array([0, 2, 3])
It, of course, involves tampering with your original array. Which you could fix (if needed) by making a copy or replacing back the original values. ...whichever is cheaper for your use case.
Use:
from operator import itemgetter
from heapq import nlargest
result = nlargest(N, enumerate(your_list), itemgetter(1))
Now the result list would contain N tuples (index, value) where value is maximized.
Use:
def max_indices(arr, k):
'''
Returns the indices of the k first largest elements of arr
(in descending order in values)
'''
assert k <= arr.size, 'k should be smaller or equal to the array size'
arr_ = arr.astype(float) # make a copy of arr
max_idxs = []
for _ in range(k):
max_element = np.max(arr_)
if np.isinf(max_element):
break
else:
idx = np.where(arr_ == max_element)
max_idxs.append(idx)
arr_[idx] = -np.inf
return max_idxs
It also works with 2D arrays. For example,
In [0]: A = np.array([[ 0.51845014, 0.72528114],
[ 0.88421561, 0.18798661],
[ 0.89832036, 0.19448609],
[ 0.89832036, 0.19448609]])
In [1]: max_indices(A, 8)
Out[1]:
[(array([2, 3], dtype=int64), array([0, 0], dtype=int64)),
(array([1], dtype=int64), array([0], dtype=int64)),
(array([0], dtype=int64), array([1], dtype=int64)),
(array([0], dtype=int64), array([0], dtype=int64)),
(array([2, 3], dtype=int64), array([1, 1], dtype=int64)),
(array([1], dtype=int64), array([1], dtype=int64))]
In [2]: A[max_indices(A, 8)[0]][0]
Out[2]: array([ 0.89832036])
I found it most intuitive to use np.unique.
The idea is, that the unique method returns the indices of the input values. Then from the max unique value and the indicies, the position of the original values can be recreated.
multi_max = [1,1,2,2,4,0,0,4]
uniques, idx = np.unique(multi_max, return_inverse=True)
print np.squeeze(np.argwhere(idx == np.argmax(uniques)))
>> [4 7]
The following is a very easy way to see the maximum elements and its positions. Here axis is the domain; axis = 0 means column wise maximum number and axis = 1 means row wise max number for the 2D case. And for higher dimensions it depends upon you.
M = np.random.random((3, 4))
print(M)
print(M.max(axis=1), M.argmax(axis=1))
Here's a more complicated way that increases n if the nth value has ties:
>>>> def get_top_n_plus_ties(arr,n):
>>>> sorted_args = np.argsort(-arr)
>>>> thresh = arr[sorted_args[n]]
>>>> n_ = np.sum(arr >= thresh)
>>>> return sorted_args[:n_]
>>>> get_top_n_plus_ties(np.array([2,9,8,3,0,2,8,3,1,9,5]),3)
array([1, 9, 2, 6])
I think the most time efficiency way is manually iterate through the array and keep a k-size min-heap, as other people have mentioned.
And I also come up with a brute force approach:
top_k_index_list = [ ]
for i in range(k):
top_k_index_list.append(np.argmax(my_array))
my_array[top_k_index_list[-1]] = -float('inf')
Set the largest element to a large negative value after you use argmax to get its index. And then the next call of argmax will return the second largest element.
And you can log the original value of these elements and recover them if you want.
This code works for a numpy 2D matrix array:
mat = np.array([[1, 3], [2, 5]]) # numpy matrix
n = 2 # n
n_largest_mat = np.sort(mat, axis=None)[-n:] # n_largest
tf_n_largest = np.zeros((2,2), dtype=bool) # all false matrix
for x in n_largest_mat:
tf_n_largest = (tf_n_largest) | (mat == x) # true-false
n_largest_elems = mat[tf_n_largest] # true-false indexing
This produces a true-false n_largest matrix indexing that also works to extract n_largest elements from a matrix array
When top_k<<axis_length,it better than argsort.
import numpy as np
def get_sorted_top_k(array, top_k=1, axis=-1, reverse=False):
if reverse:
axis_length = array.shape[axis]
partition_index = np.take(np.argpartition(array, kth=-top_k, axis=axis),
range(axis_length - top_k, axis_length), axis)
else:
partition_index = np.take(np.argpartition(array, kth=top_k, axis=axis), range(0, top_k), axis)
top_scores = np.take_along_axis(array, partition_index, axis)
# resort partition
sorted_index = np.argsort(top_scores, axis=axis)
if reverse:
sorted_index = np.flip(sorted_index, axis=axis)
top_sorted_scores = np.take_along_axis(top_scores, sorted_index, axis)
top_sorted_indexes = np.take_along_axis(partition_index, sorted_index, axis)
return top_sorted_scores, top_sorted_indexes
if __name__ == "__main__":
import time
from sklearn.metrics.pairwise import cosine_similarity
x = np.random.rand(10, 128)
y = np.random.rand(1000000, 128)
z = cosine_similarity(x, y)
start_time = time.time()
sorted_index_1 = get_sorted_top_k(z, top_k=3, axis=1, reverse=True)[1]
print(time.time() - start_time)
You can simply use a dictionary to find top k values & indices in a numpy array.
For example, if you want to find top 2 maximum values & indices
import numpy as np
nums = np.array([0.2, 0.3, 0.25, 0.15, 0.1])
def TopK(x, k):
a = dict([(i, j) for i, j in enumerate(x)])
sorted_a = dict(sorted(a.items(), key = lambda kv:kv[1], reverse=True))
indices = list(sorted_a.keys())[:k]
values = list(sorted_a.values())[:k]
return (indices, values)
print(f"Indices: {TopK(nums, k = 2)[0]}")
print(f"Values: {TopK(nums, k = 2)[1]}")
Indices: [1, 2]
Values: [0.3, 0.25]
A vectorized 2D implementation using argpartition:
k = 3
probas = np.array([
[.6, .1, .15, .15],
[.1, .6, .15, .15],
[.3, .1, .6, 0],
])
k_indices = np.argpartition(-probas, k-1, axis=-1)[:, :k]
# adjust indices to apply in flat array
adjuster = np.arange(probas.shape[0]) * probas.shape[1]
adjuster = np.broadcast_to(adjuster[:, None], k_indices.shape)
k_indices_flat = k_indices + adjuster
k_values = probas.flatten()[k_indices_flat]
# k_indices:
# array([[0, 2, 3],
# [1, 2, 3],
# [2, 0, 1]])
# k_values:
# array([[0.6 , 0.15, 0.15],
# [0.6 , 0.15, 0.15],
# [0.6 , 0.3 , 0.1 ]])
If you are dealing with NaNs and/or have problems understanding np.argpartition, try pandas.DataFrame.sort_values.
import numpy as np
import pandas as pd
a = np.array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
df = pd.DataFrame(a, columns=['array'])
max_values = df['array'].sort_values(ascending=False, na_position='last')
ind = max_values[0:3].index.to_list()
This example gives the indices of the 3 largest, not-NaN values. Probably inefficient, but easy to read and customize.

Numpy concatenate lists where first column is in range n

I am trying to select all rows in a numpy matrix named matrix with shape (25323, 9), where the values of the first column are inside the range of start and end for each tuple on the list range_tuple. Ultimately, I want to create a new numpy matrix with the result where final has a shape of (n, 9). The following code returns this error: TypeError: only integer scalar arrays can be converted to a scalar index. I have also tried initializing final with numpy.zeros((1,9)) and used np.concatenate but get similar results. I do get a compiled result when I use final.append(result) instead of using np.concatenate but the shape of the matrix gets lost. I know there is a proper solution to this problem, any help would be appreciated.
final = []
for i in range_tuples:
copy = np.copy(matrix)
start = i[0]
end = i[1]
result = copy[(matrix[:,0] < end) & (matrix[:,0] > start)]
final = np.concatenate(final, result)
final = np.matrix(final)
In [33]: arr
Out[33]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17],
[18, 19, 20],
[21, 22, 23]])
In [34]: tups = [(0,6),(3,12),(9,10),(15,14)]
In [35]: alist=[]
...: for start, stop in tups:
...: res = arr[(arr[:,0]<stop)&(arr[:,0]>=start), :]
...: alist.append(res)
...:
check the list; note that elements differ in shape; some are 1 or 0 rows. It's a good idea to test these edge cases.
In [37]: alist
Out[37]:
[array([[0, 1, 2],
[3, 4, 5]]), array([[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]]), array([[ 9, 10, 11]]), array([], shape=(0, 3), dtype=int64)]
vstack joins them:
In [38]: np.vstack(alist)
Out[38]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[ 9, 10, 11]])
Here concatenate also works, because default axis is 0, and all inputs are already 2d.
Try the following
final = np.empty((0,9))
for start, stop in range_tuples:
result = matrix[(matrix[:,0] < end) & (matrix[:,0] > start)]
final = np.concatenate((final, result))
The first is to initialize final as a numpy array. The first argument to concatenate has to be a python list of the arrays, see docs. In your code it interprets the result variable as the value for the parameter axis
Notes
I used tuple deconstruction to make the loop clearer
the copy is not needed
appending lists can be faster. The final result can afterwards be obtained through reshaping, if result is always of the same length.
I would simply create a boolean mask to select rows that satisfy required conditions.
EDIT: I missed that you are working with matrix (as opposite to ndarray). Answer was edited for matrix.
Assume following input data:
matrix = np.matrix([[1, 2, 3], [5, 6, 7], [2, 1, 7], [3, 4, 5], [8, 9, 0]])
range_tuple = [(0, 2), (1, 4), (1, 9), (5, 9), (0, 100)]
Then, first, I would convert range_tuple to a numpy.ndarray:
range_mat = np.matrix(range_tuple)
Now, create the mask:
mask = np.ravel((matrix[:, 0] > range_mat[:, 0]) & (matrix[:, 0] < range_mat[:, 1]))
Apply the mask:
final = matrix[mask] # or matrix[mask].copy() if you intend to modify matrix
To check:
print(final)
[[1 2 3]
[2 1 7]
[8 9 0]]
If length of range_tuple can be different from the number of rows in the matrix, then do this:
n = min(range_mat.shape[0], matrix.shape[0])
mask = np.pad(
np.ravel(
(matrix[:n, 0] > range_mat[:n, 0]) & (matrix[:n, 0] < range_mat[:n, 1])
),
(0, matrix.shape[0] - n)
)
final = matrix[mask]

Array of indexes for each element alongs the first dimension in a 2D array (numpy., tensorflow)

indexes = np.array([[0,1,3],[1,2,4 ]])
data = np.random.rand(2,5)
Now, i would like an array of shape (2,3), where
result[0] = data[0,indexes[0]]
result[1] = data[1,indexes[1]]
What would be the proper way to achieve this? A numpy way that yould generalize to bigger arrays (perhaps even higher dimensional).
Please note the difference to questions like this, where the array of indexes contains tuples. This is not what I am asking.
Edit
A more general formulation of the question would be:
data.shape == (s0, s1, .., sn)
indexes.shape == (s0, s1, ..., sn-1, K)
so, they have all dimension but the last equal
Than
result[i, j, ..., k] = data[i, j,...,k, indexes[i, j, ..., k]]
where
len([i, j, ..., k]) == len(data)-1 == len(indexes) - 1
Here are NumPy and TensorFlow solutions:
import numpy as np
import tensorflow as tf
def gather_index_np(data, index):
data = np.asarray(data)
index = np.asarray(index)
# Make open grid of all but last dimension indices
grid = np.ogrid[tuple(slice(s) for s in index.shape[:-1])]
# Add extra dimension in grid
grid = [g[..., np.newaxis] for g in grid]
# Complete index
index_full = tuple(grid + [index])
# Index data to get result
result = data[index_full]
return result
def gather_index_tf(data, index):
data = tf.convert_to_tensor(data)
index = tf.convert_to_tensor(index)
index_shape = tf.shape(index)
d = index.shape.ndims
# Make grid of all dimension indices
grid = tf.meshgrid(*(tf.range(index_shape[i]) for i in range(d)), indexing='ij')
# Complete index
index_full = tf.stack(grid[:-1] + [index], axis=-1)
# Index data to get result
result = tf.gather_nd(data, index_full)
return result
Example:
import numpy as np
import tensorflow as tf
data = np.arange(10).reshape((2, 5))
index = np.array([[0, 1, 3], [1, 2, 4]])
print(gather_index_np(data, index))
# [[0 1 3]
# [6 7 9]]
with tf.Session() as sess:
print(sess.run(gather_index_tf(data, index)))
# [[0 1 3]
# [6 7 9]]
numpy has take_along_axis which does what you describe plus it also lets you choose the axis.
Example:
>>> a = np.arange(24).reshape(2,3,4)
>>> i = np.random.randint(0,4,(2,3,5))
>>> i
array([[[3, 3, 0, 1, 3],
[3, 1, 0, 3, 3],
[3, 2, 0, 3, 3]],
[[2, 3, 0, 0, 0],
[1, 1, 3, 1, 2],
[1, 3, 0, 0, 2]]])
>>> np.take_along_axis(a, i, -1)
array([[[ 3, 3, 0, 1, 3],
[ 7, 5, 4, 7, 7],
[11, 10, 8, 11, 11]],
[[14, 15, 12, 12, 12],
[17, 17, 19, 17, 18],
[21, 23, 20, 20, 22]]])

Backwards axes in numpy.delete

It seems as though the axis argument in numpy.delete() is backwards from all other axis arguments in both numpy and pandas. Typically, axis=0 refers to columns and axis=1 refers to rows. For example:
import numpy as np
mat=np.array([[1,2], [3,4]])
# sum columns
np.sum(mat, axis=0)
# sum rows
np.sum(mat, axis=1)
# min of columns
np.min(mat, axis=0)
That all works like expected. But if I use numpy.delete, I have to switch:
# delete 1st row
np.delete(mat, 0, axis=0)
# delete 1st column
np.delete(mat, 0, axis=1)
Has anyone else noticed this? Am I crazy or is this by design?
It is by design. You are specifying the axis from which to delete the given index (or indices). For example, suppose we have z as follows:
In [62]: z
Out[62]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
You select different rows of z by varying the first index of z (i.e. by selecting indices along axis 0):
In [63]: z[0, :]
Out[63]: array([0, 1, 2, 3, 4])
In [64]: z[1, :]
Out[64]: array([5, 6, 7, 8, 9])
So it makes sense that you would also select axis=0 to delete, say, the row at index 1:
In [65]: np.delete(z, 1, axis=0)
Out[65]:
array([[ 0, 1, 2, 3, 4],
[10, 11, 12, 13, 14]])
Similarly, you use axis 1 (i.e. the second index) to access different columns:
In [66]: z[:, 0]
Out[66]: array([ 0, 5, 10])
In [67]: z[:, 3]
Out[67]: array([ 3, 8, 13])
and so you use axis=1 to delete columns:
In [68]: np.delete(z, 3, axis=1)
Out[68]:
array([[ 0, 1, 2, 4],
[ 5, 6, 7, 9],
[10, 11, 12, 14]])
Don't forget that this generalizes to n-dimensional arrays. For example, if you have a three-dimensional array a, and you want to delete the two-dimensional slice a[:, :, k], you would use np.delete(a, k, axis=2).

Finding the largest K elements in a list with numpy [duplicate]

NumPy proposes a way to get the index of the maximum value of an array via np.argmax.
I would like a similar thing, but returning the indexes of the N maximum values.
For instance, if I have an array, [1, 3, 2, 4, 5], then nargmax(array, n=3) would return the indices [4, 3, 1] which correspond to the elements [5, 4, 3].
Newer NumPy versions (1.8 and up) have a function called argpartition for this. To get the indices of the four largest elements, do
>>> a = np.array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
>>> a
array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
>>> ind = np.argpartition(a, -4)[-4:]
>>> ind
array([1, 5, 8, 0])
>>> top4 = a[ind]
>>> top4
array([4, 9, 6, 9])
Unlike argsort, this function runs in linear time in the worst case, but the returned indices are not sorted, as can be seen from the result of evaluating a[ind]. If you need that too, sort them afterwards:
>>> ind[np.argsort(a[ind])]
array([1, 8, 5, 0])
To get the top-k elements in sorted order in this way takes O(n + k log k) time.
The simplest I've been able to come up with is:
>>> import numpy as np
>>> arr = np.array([1, 3, 2, 4, 5])
>>> arr.argsort()[-3:][::-1]
array([4, 3, 1])
This involves a complete sort of the array. I wonder if numpy provides a built-in way to do a partial sort; so far I haven't been able to find one.
If this solution turns out to be too slow (especially for small n), it may be worth looking at coding something up in Cython.
Simpler yet:
idx = (-arr).argsort()[:n]
where n is the number of maximum values.
Use:
>>> import heapq
>>> import numpy
>>> a = numpy.array([1, 3, 2, 4, 5])
>>> heapq.nlargest(3, range(len(a)), a.take)
[4, 3, 1]
For regular Python lists:
>>> a = [1, 3, 2, 4, 5]
>>> heapq.nlargest(3, range(len(a)), a.__getitem__)
[4, 3, 1]
If you use Python 2, use xrange instead of range.
Source: heapq — Heap queue algorithm
If you happen to be working with a multidimensional array then you'll need to flatten and unravel the indices:
def largest_indices(ary, n):
"""Returns the n largest indices from a numpy array."""
flat = ary.flatten()
indices = np.argpartition(flat, -n)[-n:]
indices = indices[np.argsort(-flat[indices])]
return np.unravel_index(indices, ary.shape)
For example:
>>> xs = np.sin(np.arange(9)).reshape((3, 3))
>>> xs
array([[ 0. , 0.84147098, 0.90929743],
[ 0.14112001, -0.7568025 , -0.95892427],
[-0.2794155 , 0.6569866 , 0.98935825]])
>>> largest_indices(xs, 3)
(array([2, 0, 0]), array([2, 2, 1]))
>>> xs[largest_indices(xs, 3)]
array([ 0.98935825, 0.90929743, 0.84147098])
Three Answers Compared For Coding Ease And Speed
Speed was important for my needs, so I tested three answers to this question.
Code from those three answers was modified as needed for my specific case.
I then compared the speed of each method.
Coding wise:
NPE's answer was the next most elegant and adequately fast for my needs.
Fred Foos answer required the most refactoring for my needs but was the fastest. I went with this answer, because even though it took more work, it was not too bad and had significant speed advantages.
off99555's answer was the most elegant, but it is the slowest.
Complete Code for Test and Comparisons
import numpy as np
import time
import random
import sys
from operator import itemgetter
from heapq import nlargest
''' Fake Data Setup '''
a1 = list(range(1000000))
random.shuffle(a1)
a1 = np.array(a1)
''' ################################################ '''
''' NPE's Answer Modified A Bit For My Case '''
t0 = time.time()
indices = np.flip(np.argsort(a1))[:5]
results = []
for index in indices:
results.append((index, a1[index]))
t1 = time.time()
print("NPE's Answer:")
print(results)
print(t1 - t0)
print()
''' Fred Foos Answer Modified A Bit For My Case'''
t0 = time.time()
indices = np.argpartition(a1, -6)[-5:]
results = []
for index in indices:
results.append((a1[index], index))
results.sort(reverse=True)
results = [(b, a) for a, b in results]
t1 = time.time()
print("Fred Foo's Answer:")
print(results)
print(t1 - t0)
print()
''' off99555's Answer - No Modification Needed For My Needs '''
t0 = time.time()
result = nlargest(5, enumerate(a1), itemgetter(1))
t1 = time.time()
print("off99555's Answer:")
print(result)
print(t1 - t0)
Output with Speed Reports
NPE's Answer:
[(631934, 999999), (788104, 999998), (413003, 999997), (536514, 999996), (81029, 999995)]
0.1349949836730957
Fred Foo's Answer:
[(631934, 999999), (788104, 999998), (413003, 999997), (536514, 999996), (81029, 999995)]
0.011161565780639648
off99555's Answer:
[(631934, 999999), (788104, 999998), (413003, 999997), (536514, 999996), (81029, 999995)]
0.439760684967041
If you don't care about the order of the K-th largest elements you can use argpartition, which should perform better than a full sort through argsort.
K = 4 # We want the indices of the four largest values
a = np.array([0, 8, 0, 4, 5, 8, 8, 0, 4, 2])
np.argpartition(a,-K)[-K:]
array([4, 1, 5, 6])
Credits go to this question.
I ran a few tests and it looks like argpartition outperforms argsort as the size of the array and the value of K increase.
For multidimensional arrays you can use the axis keyword in order to apply the partitioning along the expected axis.
# For a 2D array
indices = np.argpartition(arr, -N, axis=1)[:, -N:]
And for grabbing the items:
x = arr.shape[0]
arr[np.repeat(np.arange(x), N), indices.ravel()].reshape(x, N)
But note that this won't return a sorted result. In that case you can use np.argsort() along the intended axis:
indices = np.argsort(arr, axis=1)[:, -N:]
# Result
x = arr.shape[0]
arr[np.repeat(np.arange(x), N), indices.ravel()].reshape(x, N)
Here is an example:
In [42]: a = np.random.randint(0, 20, (10, 10))
In [44]: a
Out[44]:
array([[ 7, 11, 12, 0, 2, 3, 4, 10, 6, 10],
[16, 16, 4, 3, 18, 5, 10, 4, 14, 9],
[ 2, 9, 15, 12, 18, 3, 13, 11, 5, 10],
[14, 0, 9, 11, 1, 4, 9, 19, 18, 12],
[ 0, 10, 5, 15, 9, 18, 5, 2, 16, 19],
[14, 19, 3, 11, 13, 11, 13, 11, 1, 14],
[ 7, 15, 18, 6, 5, 13, 1, 7, 9, 19],
[11, 17, 11, 16, 14, 3, 16, 1, 12, 19],
[ 2, 4, 14, 8, 6, 9, 14, 9, 1, 5],
[ 1, 10, 15, 0, 1, 9, 18, 2, 2, 12]])
In [45]: np.argpartition(a, np.argmin(a, axis=0))[:, 1:] # 1 is because the first item is the minimum one.
Out[45]:
array([[4, 5, 6, 8, 0, 7, 9, 1, 2],
[2, 7, 5, 9, 6, 8, 1, 0, 4],
[5, 8, 1, 9, 7, 3, 6, 2, 4],
[4, 5, 2, 6, 3, 9, 0, 8, 7],
[7, 2, 6, 4, 1, 3, 8, 5, 9],
[2, 3, 5, 7, 6, 4, 0, 9, 1],
[4, 3, 0, 7, 8, 5, 1, 2, 9],
[5, 2, 0, 8, 4, 6, 3, 1, 9],
[0, 1, 9, 4, 3, 7, 5, 2, 6],
[0, 4, 7, 8, 5, 1, 9, 2, 6]])
In [46]: np.argpartition(a, np.argmin(a, axis=0))[:, -3:]
Out[46]:
array([[9, 1, 2],
[1, 0, 4],
[6, 2, 4],
[0, 8, 7],
[8, 5, 9],
[0, 9, 1],
[1, 2, 9],
[3, 1, 9],
[5, 2, 6],
[9, 2, 6]])
In [89]: a[np.repeat(np.arange(x), 3), ind.ravel()].reshape(x, 3)
Out[89]:
array([[10, 11, 12],
[16, 16, 18],
[13, 15, 18],
[14, 18, 19],
[16, 18, 19],
[14, 14, 19],
[15, 18, 19],
[16, 17, 19],
[ 9, 14, 14],
[12, 15, 18]])
Method np.argpartition only returns the k largest indices, performs a local sort, and is faster than np.argsort(performing a full sort) when array is quite large. But the returned indices are NOT in ascending/descending order. Let's say with an example:
We can see that if you want a strict ascending order top k indices, np.argpartition won't return what you want.
Apart from doing a sort manually after np.argpartition, my solution is to use PyTorch, torch.topk, a tool for neural network construction, providing NumPy-like APIs with both CPU and GPU support. It's as fast as NumPy with MKL, and offers a GPU boost if you need large matrix/vector calculations.
Strict ascend/descend top k indices code will be:
Note that torch.topk accepts a torch tensor, and returns both top k values and top k indices in type torch.Tensor. Similar with np, torch.topk also accepts an axis argument so that you can handle multi-dimensional arrays/tensors.
This will be faster than a full sort depending on the size of your original array and the size of your selection:
>>> A = np.random.randint(0,10,10)
>>> A
array([5, 1, 5, 5, 2, 3, 2, 4, 1, 0])
>>> B = np.zeros(3, int)
>>> for i in xrange(3):
... idx = np.argmax(A)
... B[i]=idx; A[idx]=0 #something smaller than A.min()
...
>>> B
array([0, 2, 3])
It, of course, involves tampering with your original array. Which you could fix (if needed) by making a copy or replacing back the original values. ...whichever is cheaper for your use case.
Use:
from operator import itemgetter
from heapq import nlargest
result = nlargest(N, enumerate(your_list), itemgetter(1))
Now the result list would contain N tuples (index, value) where value is maximized.
Use:
def max_indices(arr, k):
'''
Returns the indices of the k first largest elements of arr
(in descending order in values)
'''
assert k <= arr.size, 'k should be smaller or equal to the array size'
arr_ = arr.astype(float) # make a copy of arr
max_idxs = []
for _ in range(k):
max_element = np.max(arr_)
if np.isinf(max_element):
break
else:
idx = np.where(arr_ == max_element)
max_idxs.append(idx)
arr_[idx] = -np.inf
return max_idxs
It also works with 2D arrays. For example,
In [0]: A = np.array([[ 0.51845014, 0.72528114],
[ 0.88421561, 0.18798661],
[ 0.89832036, 0.19448609],
[ 0.89832036, 0.19448609]])
In [1]: max_indices(A, 8)
Out[1]:
[(array([2, 3], dtype=int64), array([0, 0], dtype=int64)),
(array([1], dtype=int64), array([0], dtype=int64)),
(array([0], dtype=int64), array([1], dtype=int64)),
(array([0], dtype=int64), array([0], dtype=int64)),
(array([2, 3], dtype=int64), array([1, 1], dtype=int64)),
(array([1], dtype=int64), array([1], dtype=int64))]
In [2]: A[max_indices(A, 8)[0]][0]
Out[2]: array([ 0.89832036])
I found it most intuitive to use np.unique.
The idea is, that the unique method returns the indices of the input values. Then from the max unique value and the indicies, the position of the original values can be recreated.
multi_max = [1,1,2,2,4,0,0,4]
uniques, idx = np.unique(multi_max, return_inverse=True)
print np.squeeze(np.argwhere(idx == np.argmax(uniques)))
>> [4 7]
The following is a very easy way to see the maximum elements and its positions. Here axis is the domain; axis = 0 means column wise maximum number and axis = 1 means row wise max number for the 2D case. And for higher dimensions it depends upon you.
M = np.random.random((3, 4))
print(M)
print(M.max(axis=1), M.argmax(axis=1))
Here's a more complicated way that increases n if the nth value has ties:
>>>> def get_top_n_plus_ties(arr,n):
>>>> sorted_args = np.argsort(-arr)
>>>> thresh = arr[sorted_args[n]]
>>>> n_ = np.sum(arr >= thresh)
>>>> return sorted_args[:n_]
>>>> get_top_n_plus_ties(np.array([2,9,8,3,0,2,8,3,1,9,5]),3)
array([1, 9, 2, 6])
I think the most time efficiency way is manually iterate through the array and keep a k-size min-heap, as other people have mentioned.
And I also come up with a brute force approach:
top_k_index_list = [ ]
for i in range(k):
top_k_index_list.append(np.argmax(my_array))
my_array[top_k_index_list[-1]] = -float('inf')
Set the largest element to a large negative value after you use argmax to get its index. And then the next call of argmax will return the second largest element.
And you can log the original value of these elements and recover them if you want.
This code works for a numpy 2D matrix array:
mat = np.array([[1, 3], [2, 5]]) # numpy matrix
n = 2 # n
n_largest_mat = np.sort(mat, axis=None)[-n:] # n_largest
tf_n_largest = np.zeros((2,2), dtype=bool) # all false matrix
for x in n_largest_mat:
tf_n_largest = (tf_n_largest) | (mat == x) # true-false
n_largest_elems = mat[tf_n_largest] # true-false indexing
This produces a true-false n_largest matrix indexing that also works to extract n_largest elements from a matrix array
When top_k<<axis_length,it better than argsort.
import numpy as np
def get_sorted_top_k(array, top_k=1, axis=-1, reverse=False):
if reverse:
axis_length = array.shape[axis]
partition_index = np.take(np.argpartition(array, kth=-top_k, axis=axis),
range(axis_length - top_k, axis_length), axis)
else:
partition_index = np.take(np.argpartition(array, kth=top_k, axis=axis), range(0, top_k), axis)
top_scores = np.take_along_axis(array, partition_index, axis)
# resort partition
sorted_index = np.argsort(top_scores, axis=axis)
if reverse:
sorted_index = np.flip(sorted_index, axis=axis)
top_sorted_scores = np.take_along_axis(top_scores, sorted_index, axis)
top_sorted_indexes = np.take_along_axis(partition_index, sorted_index, axis)
return top_sorted_scores, top_sorted_indexes
if __name__ == "__main__":
import time
from sklearn.metrics.pairwise import cosine_similarity
x = np.random.rand(10, 128)
y = np.random.rand(1000000, 128)
z = cosine_similarity(x, y)
start_time = time.time()
sorted_index_1 = get_sorted_top_k(z, top_k=3, axis=1, reverse=True)[1]
print(time.time() - start_time)
You can simply use a dictionary to find top k values & indices in a numpy array.
For example, if you want to find top 2 maximum values & indices
import numpy as np
nums = np.array([0.2, 0.3, 0.25, 0.15, 0.1])
def TopK(x, k):
a = dict([(i, j) for i, j in enumerate(x)])
sorted_a = dict(sorted(a.items(), key = lambda kv:kv[1], reverse=True))
indices = list(sorted_a.keys())[:k]
values = list(sorted_a.values())[:k]
return (indices, values)
print(f"Indices: {TopK(nums, k = 2)[0]}")
print(f"Values: {TopK(nums, k = 2)[1]}")
Indices: [1, 2]
Values: [0.3, 0.25]
A vectorized 2D implementation using argpartition:
k = 3
probas = np.array([
[.6, .1, .15, .15],
[.1, .6, .15, .15],
[.3, .1, .6, 0],
])
k_indices = np.argpartition(-probas, k-1, axis=-1)[:, :k]
# adjust indices to apply in flat array
adjuster = np.arange(probas.shape[0]) * probas.shape[1]
adjuster = np.broadcast_to(adjuster[:, None], k_indices.shape)
k_indices_flat = k_indices + adjuster
k_values = probas.flatten()[k_indices_flat]
# k_indices:
# array([[0, 2, 3],
# [1, 2, 3],
# [2, 0, 1]])
# k_values:
# array([[0.6 , 0.15, 0.15],
# [0.6 , 0.15, 0.15],
# [0.6 , 0.3 , 0.1 ]])
If you are dealing with NaNs and/or have problems understanding np.argpartition, try pandas.DataFrame.sort_values.
import numpy as np
import pandas as pd
a = np.array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
df = pd.DataFrame(a, columns=['array'])
max_values = df['array'].sort_values(ascending=False, na_position='last')
ind = max_values[0:3].index.to_list()
This example gives the indices of the 3 largest, not-NaN values. Probably inefficient, but easy to read and customize.

Categories