Shift values in numpy array by differing amounts - python

I have an array a = np.array([2, 2, 2, 3, 3, 15, 7, 7, 9]) that continues like that. I would like to shift this array but I'm not sure if I can use np.roll() here.
The array I would like to produce is [0, 0, 0, 2, 2, 3, 15, 15, 7].
As you can see, the first like numbers which are in array a (in this case the three '2's) should be replaced with '0's. Everything should then be shifted such that the '3's are replaced with '2's, the '15' is replaced with the '3' etc. Ideally I would like to do this operation without any for loop as I need it to run quickly.
I realise this operation may be a bit confusing so please ask questions.

If you want to stick with NumPy, you can achieve this using np.unique by returning the counts per unique elements with the return_counts option.
Then, simply roll the values and construct a new array with np.repeat:
>>> s, i, c = np.unique(a, return_index=True, return_counts=True)
(array([ 2, 3, 7, 9, 15]), array([0, 3, 6, 8, 5]), array([3, 2, 2, 1, 1]))
The three outputs are respectively: unique sorted elements, indices of first encounter unique element, and the count per unique element.
np.unique sorts the value, so we need to unsort the values as well as the counts first. We can then shift the values with np.roll:
>>> idx = np.argsort(i)
>>> v = np.roll(s[idx], 1)
>>> v[0] = 0
array([ 0, 2, 3, 15, 7])
Alternatively with np.append, this requires a whole copy though:
>>> v = np.append([0], s[idx][:-1])
array([ 0, 2, 3, 15, 7])
Finally reassemble:
>>> np.repeat(v, c[idx])
array([ 0, 0, 0, 2, 2, 3, 15, 15, 7])
Another - more general - solution that will work when there are recurring values in a. This requires the use of np.diff.
You can get the indices of the elements with:
>>> i = np.diff(np.append(a, [0])).nonzero()[0] + 1
array([3, 5, 6, 8, 9])
>>> idx = np.append([0], i)
array([0, 3, 5, 6, 8, 9])
The values are then given using a[idx]:
>>> v = np.append([0], a)[idx]
array([ 0, 2, 3, 15, 7, 9])
And the counts per element with:
>>> c = np.append(np.diff(i, prepend=0), [0])
array([3, 2, 1, 2, 1, 0])
Finally, reassemble:
>>> np.repeat(v, c)
array([ 0, 0, 0, 2, 2, 3, 15, 15, 7])

This is not using numpy, but one approach that comes to mind is to itertools.groupby to collect contiguous runs of the same elements. Then shift all the elements (by prepending a 0) and use the counts to repeat them.
from itertools import chain, groupby
def shift(data):
values = [(k, len(list(g))) for k,g in groupby(data)]
keys = [0] + [i[0] for i in values]
reps = [i[1] for i in values]
return list(chain.from_iterable([[k]*rep for k, rep in zip(keys, reps)]))
For example
>>> a = np.array([2,2,2,3,3,15,7,7,9])
>>> shift(a)
[0, 0, 0, 2, 2, 3, 15, 15, 7]

You can try this code:
import numpy as np
a = np.array([2, 2, 2, 3, 3, 15, 7, 7, 9])
diff_a=np.diff(a)
idx=np.flatnonzero(diff_a)
val=diff_a[idx]
val=np.insert(val[:-1],0, a[0]) #update value
diff_a[idx]=val
res=np.append([0],np.cumsum(diff_a))
print(res)

You can try this:
import numpy as np
a = np.array([2, 2, 2, 3, 3, 15, 7, 7, 9])
z = a - np.pad(a, (1,0))[:-1]
z[m] = np.pad(z[(m := z!=0)], (1,0))[:-1]
print(z.cumsum())
It gives:
[ 0 0 0 2 2 3 15 15 7]

Related

Find index of n biggest values [duplicate]

NumPy proposes a way to get the index of the maximum value of an array via np.argmax.
I would like a similar thing, but returning the indexes of the N maximum values.
For instance, if I have an array, [1, 3, 2, 4, 5], then nargmax(array, n=3) would return the indices [4, 3, 1] which correspond to the elements [5, 4, 3].
Newer NumPy versions (1.8 and up) have a function called argpartition for this. To get the indices of the four largest elements, do
>>> a = np.array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
>>> a
array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
>>> ind = np.argpartition(a, -4)[-4:]
>>> ind
array([1, 5, 8, 0])
>>> top4 = a[ind]
>>> top4
array([4, 9, 6, 9])
Unlike argsort, this function runs in linear time in the worst case, but the returned indices are not sorted, as can be seen from the result of evaluating a[ind]. If you need that too, sort them afterwards:
>>> ind[np.argsort(a[ind])]
array([1, 8, 5, 0])
To get the top-k elements in sorted order in this way takes O(n + k log k) time.
The simplest I've been able to come up with is:
>>> import numpy as np
>>> arr = np.array([1, 3, 2, 4, 5])
>>> arr.argsort()[-3:][::-1]
array([4, 3, 1])
This involves a complete sort of the array. I wonder if numpy provides a built-in way to do a partial sort; so far I haven't been able to find one.
If this solution turns out to be too slow (especially for small n), it may be worth looking at coding something up in Cython.
Simpler yet:
idx = (-arr).argsort()[:n]
where n is the number of maximum values.
Use:
>>> import heapq
>>> import numpy
>>> a = numpy.array([1, 3, 2, 4, 5])
>>> heapq.nlargest(3, range(len(a)), a.take)
[4, 3, 1]
For regular Python lists:
>>> a = [1, 3, 2, 4, 5]
>>> heapq.nlargest(3, range(len(a)), a.__getitem__)
[4, 3, 1]
If you use Python 2, use xrange instead of range.
Source: heapq — Heap queue algorithm
If you happen to be working with a multidimensional array then you'll need to flatten and unravel the indices:
def largest_indices(ary, n):
"""Returns the n largest indices from a numpy array."""
flat = ary.flatten()
indices = np.argpartition(flat, -n)[-n:]
indices = indices[np.argsort(-flat[indices])]
return np.unravel_index(indices, ary.shape)
For example:
>>> xs = np.sin(np.arange(9)).reshape((3, 3))
>>> xs
array([[ 0. , 0.84147098, 0.90929743],
[ 0.14112001, -0.7568025 , -0.95892427],
[-0.2794155 , 0.6569866 , 0.98935825]])
>>> largest_indices(xs, 3)
(array([2, 0, 0]), array([2, 2, 1]))
>>> xs[largest_indices(xs, 3)]
array([ 0.98935825, 0.90929743, 0.84147098])
Three Answers Compared For Coding Ease And Speed
Speed was important for my needs, so I tested three answers to this question.
Code from those three answers was modified as needed for my specific case.
I then compared the speed of each method.
Coding wise:
NPE's answer was the next most elegant and adequately fast for my needs.
Fred Foos answer required the most refactoring for my needs but was the fastest. I went with this answer, because even though it took more work, it was not too bad and had significant speed advantages.
off99555's answer was the most elegant, but it is the slowest.
Complete Code for Test and Comparisons
import numpy as np
import time
import random
import sys
from operator import itemgetter
from heapq import nlargest
''' Fake Data Setup '''
a1 = list(range(1000000))
random.shuffle(a1)
a1 = np.array(a1)
''' ################################################ '''
''' NPE's Answer Modified A Bit For My Case '''
t0 = time.time()
indices = np.flip(np.argsort(a1))[:5]
results = []
for index in indices:
results.append((index, a1[index]))
t1 = time.time()
print("NPE's Answer:")
print(results)
print(t1 - t0)
print()
''' Fred Foos Answer Modified A Bit For My Case'''
t0 = time.time()
indices = np.argpartition(a1, -6)[-5:]
results = []
for index in indices:
results.append((a1[index], index))
results.sort(reverse=True)
results = [(b, a) for a, b in results]
t1 = time.time()
print("Fred Foo's Answer:")
print(results)
print(t1 - t0)
print()
''' off99555's Answer - No Modification Needed For My Needs '''
t0 = time.time()
result = nlargest(5, enumerate(a1), itemgetter(1))
t1 = time.time()
print("off99555's Answer:")
print(result)
print(t1 - t0)
Output with Speed Reports
NPE's Answer:
[(631934, 999999), (788104, 999998), (413003, 999997), (536514, 999996), (81029, 999995)]
0.1349949836730957
Fred Foo's Answer:
[(631934, 999999), (788104, 999998), (413003, 999997), (536514, 999996), (81029, 999995)]
0.011161565780639648
off99555's Answer:
[(631934, 999999), (788104, 999998), (413003, 999997), (536514, 999996), (81029, 999995)]
0.439760684967041
If you don't care about the order of the K-th largest elements you can use argpartition, which should perform better than a full sort through argsort.
K = 4 # We want the indices of the four largest values
a = np.array([0, 8, 0, 4, 5, 8, 8, 0, 4, 2])
np.argpartition(a,-K)[-K:]
array([4, 1, 5, 6])
Credits go to this question.
I ran a few tests and it looks like argpartition outperforms argsort as the size of the array and the value of K increase.
For multidimensional arrays you can use the axis keyword in order to apply the partitioning along the expected axis.
# For a 2D array
indices = np.argpartition(arr, -N, axis=1)[:, -N:]
And for grabbing the items:
x = arr.shape[0]
arr[np.repeat(np.arange(x), N), indices.ravel()].reshape(x, N)
But note that this won't return a sorted result. In that case you can use np.argsort() along the intended axis:
indices = np.argsort(arr, axis=1)[:, -N:]
# Result
x = arr.shape[0]
arr[np.repeat(np.arange(x), N), indices.ravel()].reshape(x, N)
Here is an example:
In [42]: a = np.random.randint(0, 20, (10, 10))
In [44]: a
Out[44]:
array([[ 7, 11, 12, 0, 2, 3, 4, 10, 6, 10],
[16, 16, 4, 3, 18, 5, 10, 4, 14, 9],
[ 2, 9, 15, 12, 18, 3, 13, 11, 5, 10],
[14, 0, 9, 11, 1, 4, 9, 19, 18, 12],
[ 0, 10, 5, 15, 9, 18, 5, 2, 16, 19],
[14, 19, 3, 11, 13, 11, 13, 11, 1, 14],
[ 7, 15, 18, 6, 5, 13, 1, 7, 9, 19],
[11, 17, 11, 16, 14, 3, 16, 1, 12, 19],
[ 2, 4, 14, 8, 6, 9, 14, 9, 1, 5],
[ 1, 10, 15, 0, 1, 9, 18, 2, 2, 12]])
In [45]: np.argpartition(a, np.argmin(a, axis=0))[:, 1:] # 1 is because the first item is the minimum one.
Out[45]:
array([[4, 5, 6, 8, 0, 7, 9, 1, 2],
[2, 7, 5, 9, 6, 8, 1, 0, 4],
[5, 8, 1, 9, 7, 3, 6, 2, 4],
[4, 5, 2, 6, 3, 9, 0, 8, 7],
[7, 2, 6, 4, 1, 3, 8, 5, 9],
[2, 3, 5, 7, 6, 4, 0, 9, 1],
[4, 3, 0, 7, 8, 5, 1, 2, 9],
[5, 2, 0, 8, 4, 6, 3, 1, 9],
[0, 1, 9, 4, 3, 7, 5, 2, 6],
[0, 4, 7, 8, 5, 1, 9, 2, 6]])
In [46]: np.argpartition(a, np.argmin(a, axis=0))[:, -3:]
Out[46]:
array([[9, 1, 2],
[1, 0, 4],
[6, 2, 4],
[0, 8, 7],
[8, 5, 9],
[0, 9, 1],
[1, 2, 9],
[3, 1, 9],
[5, 2, 6],
[9, 2, 6]])
In [89]: a[np.repeat(np.arange(x), 3), ind.ravel()].reshape(x, 3)
Out[89]:
array([[10, 11, 12],
[16, 16, 18],
[13, 15, 18],
[14, 18, 19],
[16, 18, 19],
[14, 14, 19],
[15, 18, 19],
[16, 17, 19],
[ 9, 14, 14],
[12, 15, 18]])
Method np.argpartition only returns the k largest indices, performs a local sort, and is faster than np.argsort(performing a full sort) when array is quite large. But the returned indices are NOT in ascending/descending order. Let's say with an example:
We can see that if you want a strict ascending order top k indices, np.argpartition won't return what you want.
Apart from doing a sort manually after np.argpartition, my solution is to use PyTorch, torch.topk, a tool for neural network construction, providing NumPy-like APIs with both CPU and GPU support. It's as fast as NumPy with MKL, and offers a GPU boost if you need large matrix/vector calculations.
Strict ascend/descend top k indices code will be:
Note that torch.topk accepts a torch tensor, and returns both top k values and top k indices in type torch.Tensor. Similar with np, torch.topk also accepts an axis argument so that you can handle multi-dimensional arrays/tensors.
This will be faster than a full sort depending on the size of your original array and the size of your selection:
>>> A = np.random.randint(0,10,10)
>>> A
array([5, 1, 5, 5, 2, 3, 2, 4, 1, 0])
>>> B = np.zeros(3, int)
>>> for i in xrange(3):
... idx = np.argmax(A)
... B[i]=idx; A[idx]=0 #something smaller than A.min()
...
>>> B
array([0, 2, 3])
It, of course, involves tampering with your original array. Which you could fix (if needed) by making a copy or replacing back the original values. ...whichever is cheaper for your use case.
Use:
from operator import itemgetter
from heapq import nlargest
result = nlargest(N, enumerate(your_list), itemgetter(1))
Now the result list would contain N tuples (index, value) where value is maximized.
Use:
def max_indices(arr, k):
'''
Returns the indices of the k first largest elements of arr
(in descending order in values)
'''
assert k <= arr.size, 'k should be smaller or equal to the array size'
arr_ = arr.astype(float) # make a copy of arr
max_idxs = []
for _ in range(k):
max_element = np.max(arr_)
if np.isinf(max_element):
break
else:
idx = np.where(arr_ == max_element)
max_idxs.append(idx)
arr_[idx] = -np.inf
return max_idxs
It also works with 2D arrays. For example,
In [0]: A = np.array([[ 0.51845014, 0.72528114],
[ 0.88421561, 0.18798661],
[ 0.89832036, 0.19448609],
[ 0.89832036, 0.19448609]])
In [1]: max_indices(A, 8)
Out[1]:
[(array([2, 3], dtype=int64), array([0, 0], dtype=int64)),
(array([1], dtype=int64), array([0], dtype=int64)),
(array([0], dtype=int64), array([1], dtype=int64)),
(array([0], dtype=int64), array([0], dtype=int64)),
(array([2, 3], dtype=int64), array([1, 1], dtype=int64)),
(array([1], dtype=int64), array([1], dtype=int64))]
In [2]: A[max_indices(A, 8)[0]][0]
Out[2]: array([ 0.89832036])
I found it most intuitive to use np.unique.
The idea is, that the unique method returns the indices of the input values. Then from the max unique value and the indicies, the position of the original values can be recreated.
multi_max = [1,1,2,2,4,0,0,4]
uniques, idx = np.unique(multi_max, return_inverse=True)
print np.squeeze(np.argwhere(idx == np.argmax(uniques)))
>> [4 7]
The following is a very easy way to see the maximum elements and its positions. Here axis is the domain; axis = 0 means column wise maximum number and axis = 1 means row wise max number for the 2D case. And for higher dimensions it depends upon you.
M = np.random.random((3, 4))
print(M)
print(M.max(axis=1), M.argmax(axis=1))
Here's a more complicated way that increases n if the nth value has ties:
>>>> def get_top_n_plus_ties(arr,n):
>>>> sorted_args = np.argsort(-arr)
>>>> thresh = arr[sorted_args[n]]
>>>> n_ = np.sum(arr >= thresh)
>>>> return sorted_args[:n_]
>>>> get_top_n_plus_ties(np.array([2,9,8,3,0,2,8,3,1,9,5]),3)
array([1, 9, 2, 6])
I think the most time efficiency way is manually iterate through the array and keep a k-size min-heap, as other people have mentioned.
And I also come up with a brute force approach:
top_k_index_list = [ ]
for i in range(k):
top_k_index_list.append(np.argmax(my_array))
my_array[top_k_index_list[-1]] = -float('inf')
Set the largest element to a large negative value after you use argmax to get its index. And then the next call of argmax will return the second largest element.
And you can log the original value of these elements and recover them if you want.
This code works for a numpy 2D matrix array:
mat = np.array([[1, 3], [2, 5]]) # numpy matrix
n = 2 # n
n_largest_mat = np.sort(mat, axis=None)[-n:] # n_largest
tf_n_largest = np.zeros((2,2), dtype=bool) # all false matrix
for x in n_largest_mat:
tf_n_largest = (tf_n_largest) | (mat == x) # true-false
n_largest_elems = mat[tf_n_largest] # true-false indexing
This produces a true-false n_largest matrix indexing that also works to extract n_largest elements from a matrix array
When top_k<<axis_length,it better than argsort.
import numpy as np
def get_sorted_top_k(array, top_k=1, axis=-1, reverse=False):
if reverse:
axis_length = array.shape[axis]
partition_index = np.take(np.argpartition(array, kth=-top_k, axis=axis),
range(axis_length - top_k, axis_length), axis)
else:
partition_index = np.take(np.argpartition(array, kth=top_k, axis=axis), range(0, top_k), axis)
top_scores = np.take_along_axis(array, partition_index, axis)
# resort partition
sorted_index = np.argsort(top_scores, axis=axis)
if reverse:
sorted_index = np.flip(sorted_index, axis=axis)
top_sorted_scores = np.take_along_axis(top_scores, sorted_index, axis)
top_sorted_indexes = np.take_along_axis(partition_index, sorted_index, axis)
return top_sorted_scores, top_sorted_indexes
if __name__ == "__main__":
import time
from sklearn.metrics.pairwise import cosine_similarity
x = np.random.rand(10, 128)
y = np.random.rand(1000000, 128)
z = cosine_similarity(x, y)
start_time = time.time()
sorted_index_1 = get_sorted_top_k(z, top_k=3, axis=1, reverse=True)[1]
print(time.time() - start_time)
You can simply use a dictionary to find top k values & indices in a numpy array.
For example, if you want to find top 2 maximum values & indices
import numpy as np
nums = np.array([0.2, 0.3, 0.25, 0.15, 0.1])
def TopK(x, k):
a = dict([(i, j) for i, j in enumerate(x)])
sorted_a = dict(sorted(a.items(), key = lambda kv:kv[1], reverse=True))
indices = list(sorted_a.keys())[:k]
values = list(sorted_a.values())[:k]
return (indices, values)
print(f"Indices: {TopK(nums, k = 2)[0]}")
print(f"Values: {TopK(nums, k = 2)[1]}")
Indices: [1, 2]
Values: [0.3, 0.25]
A vectorized 2D implementation using argpartition:
k = 3
probas = np.array([
[.6, .1, .15, .15],
[.1, .6, .15, .15],
[.3, .1, .6, 0],
])
k_indices = np.argpartition(-probas, k-1, axis=-1)[:, :k]
# adjust indices to apply in flat array
adjuster = np.arange(probas.shape[0]) * probas.shape[1]
adjuster = np.broadcast_to(adjuster[:, None], k_indices.shape)
k_indices_flat = k_indices + adjuster
k_values = probas.flatten()[k_indices_flat]
# k_indices:
# array([[0, 2, 3],
# [1, 2, 3],
# [2, 0, 1]])
# k_values:
# array([[0.6 , 0.15, 0.15],
# [0.6 , 0.15, 0.15],
# [0.6 , 0.3 , 0.1 ]])
If you are dealing with NaNs and/or have problems understanding np.argpartition, try pandas.DataFrame.sort_values.
import numpy as np
import pandas as pd
a = np.array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
df = pd.DataFrame(a, columns=['array'])
max_values = df['array'].sort_values(ascending=False, na_position='last')
ind = max_values[0:3].index.to_list()
This example gives the indices of the 3 largest, not-NaN values. Probably inefficient, but easy to read and customize.

How to ignore out of bounds values in an index array when using numpy?

I need to assign one array to another using an index array. But some values are out of bounds...
a = np.array([0, 1, 2, 3, 4])
b = np.array([10, 11, 12, 13, 14])
indexes = np.array([0, 2, 3, 5, 6])
a and b are the same size. If I use a[indexes] = b, it would throw an IndexError. I want it to ignore the out of bounds values, 5 and 6, so that a would become [10, 1, 11, 12, 4].
I tried to do indexes[indexes > b.size()] = 0
but this would mess up the value at index 0.
How can this be solved?
Edit
The indexes may not necessarily be in order. For example:
indexes = np.array([2, 3, 0, 5, 6])
a should become np.array([12, 1, 10, 11, 4])
You can filter out those invalid indexes:
indexes = indexes[indexes < len(a)]
a[indexes] = b[indexes]
Output:
array([10, 1, 12, 13, 4])

How can I merge rows in np matrix?

I've got a numpy matrix that has 2 rows and N columns, e.g. (if N=4):
[[ 1 3 5 7]
[ 2 4 6 8]]
The goal is create a string 1,2,3,4,5,6,7,8.
Merge the rows such that the elements from the first row have the even (1, 3, ..., N - 1) positions (the index starts from 1) and the elements from the second row have the odd positions (2, 4, ..., N).
The following code works but it isn't really nice:
xs = []
for i in range(number_of_cols):
xs.append(nums.item(0, i))
ys = []
for i in range(number_of_cols):
ys.append(nums.item(1, i))
nums_str = ""
for i in range(number_of_cols):
nums_str += '{},{},'.format(xs[i], ys[i])
Join the result list with a comma as a delimiter (row.join(','))
How can I merge the rows using built in functions (or just in a more elegant way overall)?
Specify F order when flattening (or ravel):
In [279]: arr = np.array([[1,3,5,7],[2,4,6,8]])
In [280]: arr
Out[280]:
array([[1, 3, 5, 7],
[2, 4, 6, 8]])
In [281]: arr.ravel(order='F')
Out[281]: array([1, 2, 3, 4, 5, 6, 7, 8])
Joining rows can be done this way :
>>> a = np.arange(12).reshape(3,4)
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> np.hstack([a[i,:] for i in range(a.shape[0])])
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Then it's simple to convert this array into string.
Here's one way of doing it:
out_str = ','.join(nums.T.ravel().astype('str'))
We are first transposing the array with .T, then flattening it with .ravel(), then converting each element from int to str, and then applying `','.join() to combine all the str elements
Trying it out:
import numpy as np
nums = np.array([[1,3,5,7],[2,4,6,8]])
out_str = ','.join(nums.T.ravel().astype('str'))
print (out_str)
Result:
1,2,3,4,5,6,7,8

How to get lists of indices to unique values efficiently?

Is there a built-in method that would help me achieve the following efficiently: given an array, I need a list of arrays, each with indices to a different unique value of the array?
If f is the desired function,
b = f(a)
and
u, idxs = unique(a)
then
b[i] == where(idxs==i)[0]
I am aware that pandas.Series.groupby() can do this, but it may no be efficient to create a dict when there are over 10^5 unique integers.
If you have numpy >= 1.9 you can do:
>>> a = np.random.randint(5, size=10)
>>> a
array([0, 2, 4, 4, 2, 4, 4, 3, 2, 1])
>>> unq, unq_inv, unq_cnt = np.unique(a, return_inverse=True, return_counts=True)
>>> np.split(np.argsort(unq_inv), np.cumsum(unq_cnt[:-1]))
[array([0]), array([9]), array([1, 4, 8]), array([7]), array([2, 3, 5, 6])]
>>> unq
array([0, 1, 2, 3, 4])
In earlier versions, you can get the counts doing an extra:
>>> unq_cnt = np.bincount(unq_inv)
Also, if you want to make sure that the indices for each value are sorted, I think you will need to use a stable sort, e.g. np.argsort(unq_inv, kind='mergesort')
Thinking about what you seem to be after, which I think is minimizing calls to an expensive function, I don't think you need to do what you are asking. Say that your function was squaring, you could simply do:
>>> unq, unq_inv = np.unique(a, return_inverse=True)
>>> f_unq = unq**2
>>> f_a = f_unq[unq_inv]
>>> a
array([0, 2, 4, 4, 2, 4, 4, 3, 2, 1])
>>> f_a
array([ 0, 4, 16, 16, 4, 16, 16, 9, 4, 1])
def foo(a):
I=np.arange(a.shape[0])
d={}
while a.shape[0]:
x = a[0]
ii = a==x
d[x] = I[ii]
a = a[~ii]
I = I[~ii]
return d
In [767]: a
Out[767]: array([4, 4, 3, 0, 0, 2, 1, 1, 0, 3])
In [768]: foo(a)
Out[768]:
{0: array([3, 4, 8]),
1: array([6, 7]),
2: array([5]),
3: array([2, 9]),
4: array([0, 1])}
Is this the sort of dictionary that you want?
For small a this works fine.
An equivalent dictionary building function is:
def foo1(a):
unq = np.unique(a)
return {i:np.where(a==i)[0] for i in unq}
Off hand I don't see how unq_inv helps with building the dictionary.
foo is about 30% slower than foo1. I was hoping that by reducing the searched array each time a value was counted that I might gain some speed. But it looks like the extra bookkeeping chews up time. And the where time might not be that sensitive to the length of a.
For a2=np.random.randint(5000,size=100000) run times are on the order of 2-3 sec.
But np.random.randint(50000,size=1000000) takes too long to time (for either version).
On further experimentation, a 'dumb' approach using a collections.defaultdict is much faster (20x):
def food(a):
d = defaultdict(list)
for i,j in enumerate(a):
d[j].append(i)
return d
The 'too big' (1000000,) array takes only 1.1 sec;
Maybe do something like:
s = argsort(a)
d = diff(a[s])
starts = where(d)[0]
f = [s[starts[i:i+1]] for i in xrange(len(a))]
(code not checked)

Finding the largest K elements in a list with numpy [duplicate]

NumPy proposes a way to get the index of the maximum value of an array via np.argmax.
I would like a similar thing, but returning the indexes of the N maximum values.
For instance, if I have an array, [1, 3, 2, 4, 5], then nargmax(array, n=3) would return the indices [4, 3, 1] which correspond to the elements [5, 4, 3].
Newer NumPy versions (1.8 and up) have a function called argpartition for this. To get the indices of the four largest elements, do
>>> a = np.array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
>>> a
array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
>>> ind = np.argpartition(a, -4)[-4:]
>>> ind
array([1, 5, 8, 0])
>>> top4 = a[ind]
>>> top4
array([4, 9, 6, 9])
Unlike argsort, this function runs in linear time in the worst case, but the returned indices are not sorted, as can be seen from the result of evaluating a[ind]. If you need that too, sort them afterwards:
>>> ind[np.argsort(a[ind])]
array([1, 8, 5, 0])
To get the top-k elements in sorted order in this way takes O(n + k log k) time.
The simplest I've been able to come up with is:
>>> import numpy as np
>>> arr = np.array([1, 3, 2, 4, 5])
>>> arr.argsort()[-3:][::-1]
array([4, 3, 1])
This involves a complete sort of the array. I wonder if numpy provides a built-in way to do a partial sort; so far I haven't been able to find one.
If this solution turns out to be too slow (especially for small n), it may be worth looking at coding something up in Cython.
Simpler yet:
idx = (-arr).argsort()[:n]
where n is the number of maximum values.
Use:
>>> import heapq
>>> import numpy
>>> a = numpy.array([1, 3, 2, 4, 5])
>>> heapq.nlargest(3, range(len(a)), a.take)
[4, 3, 1]
For regular Python lists:
>>> a = [1, 3, 2, 4, 5]
>>> heapq.nlargest(3, range(len(a)), a.__getitem__)
[4, 3, 1]
If you use Python 2, use xrange instead of range.
Source: heapq — Heap queue algorithm
If you happen to be working with a multidimensional array then you'll need to flatten and unravel the indices:
def largest_indices(ary, n):
"""Returns the n largest indices from a numpy array."""
flat = ary.flatten()
indices = np.argpartition(flat, -n)[-n:]
indices = indices[np.argsort(-flat[indices])]
return np.unravel_index(indices, ary.shape)
For example:
>>> xs = np.sin(np.arange(9)).reshape((3, 3))
>>> xs
array([[ 0. , 0.84147098, 0.90929743],
[ 0.14112001, -0.7568025 , -0.95892427],
[-0.2794155 , 0.6569866 , 0.98935825]])
>>> largest_indices(xs, 3)
(array([2, 0, 0]), array([2, 2, 1]))
>>> xs[largest_indices(xs, 3)]
array([ 0.98935825, 0.90929743, 0.84147098])
Three Answers Compared For Coding Ease And Speed
Speed was important for my needs, so I tested three answers to this question.
Code from those three answers was modified as needed for my specific case.
I then compared the speed of each method.
Coding wise:
NPE's answer was the next most elegant and adequately fast for my needs.
Fred Foos answer required the most refactoring for my needs but was the fastest. I went with this answer, because even though it took more work, it was not too bad and had significant speed advantages.
off99555's answer was the most elegant, but it is the slowest.
Complete Code for Test and Comparisons
import numpy as np
import time
import random
import sys
from operator import itemgetter
from heapq import nlargest
''' Fake Data Setup '''
a1 = list(range(1000000))
random.shuffle(a1)
a1 = np.array(a1)
''' ################################################ '''
''' NPE's Answer Modified A Bit For My Case '''
t0 = time.time()
indices = np.flip(np.argsort(a1))[:5]
results = []
for index in indices:
results.append((index, a1[index]))
t1 = time.time()
print("NPE's Answer:")
print(results)
print(t1 - t0)
print()
''' Fred Foos Answer Modified A Bit For My Case'''
t0 = time.time()
indices = np.argpartition(a1, -6)[-5:]
results = []
for index in indices:
results.append((a1[index], index))
results.sort(reverse=True)
results = [(b, a) for a, b in results]
t1 = time.time()
print("Fred Foo's Answer:")
print(results)
print(t1 - t0)
print()
''' off99555's Answer - No Modification Needed For My Needs '''
t0 = time.time()
result = nlargest(5, enumerate(a1), itemgetter(1))
t1 = time.time()
print("off99555's Answer:")
print(result)
print(t1 - t0)
Output with Speed Reports
NPE's Answer:
[(631934, 999999), (788104, 999998), (413003, 999997), (536514, 999996), (81029, 999995)]
0.1349949836730957
Fred Foo's Answer:
[(631934, 999999), (788104, 999998), (413003, 999997), (536514, 999996), (81029, 999995)]
0.011161565780639648
off99555's Answer:
[(631934, 999999), (788104, 999998), (413003, 999997), (536514, 999996), (81029, 999995)]
0.439760684967041
If you don't care about the order of the K-th largest elements you can use argpartition, which should perform better than a full sort through argsort.
K = 4 # We want the indices of the four largest values
a = np.array([0, 8, 0, 4, 5, 8, 8, 0, 4, 2])
np.argpartition(a,-K)[-K:]
array([4, 1, 5, 6])
Credits go to this question.
I ran a few tests and it looks like argpartition outperforms argsort as the size of the array and the value of K increase.
For multidimensional arrays you can use the axis keyword in order to apply the partitioning along the expected axis.
# For a 2D array
indices = np.argpartition(arr, -N, axis=1)[:, -N:]
And for grabbing the items:
x = arr.shape[0]
arr[np.repeat(np.arange(x), N), indices.ravel()].reshape(x, N)
But note that this won't return a sorted result. In that case you can use np.argsort() along the intended axis:
indices = np.argsort(arr, axis=1)[:, -N:]
# Result
x = arr.shape[0]
arr[np.repeat(np.arange(x), N), indices.ravel()].reshape(x, N)
Here is an example:
In [42]: a = np.random.randint(0, 20, (10, 10))
In [44]: a
Out[44]:
array([[ 7, 11, 12, 0, 2, 3, 4, 10, 6, 10],
[16, 16, 4, 3, 18, 5, 10, 4, 14, 9],
[ 2, 9, 15, 12, 18, 3, 13, 11, 5, 10],
[14, 0, 9, 11, 1, 4, 9, 19, 18, 12],
[ 0, 10, 5, 15, 9, 18, 5, 2, 16, 19],
[14, 19, 3, 11, 13, 11, 13, 11, 1, 14],
[ 7, 15, 18, 6, 5, 13, 1, 7, 9, 19],
[11, 17, 11, 16, 14, 3, 16, 1, 12, 19],
[ 2, 4, 14, 8, 6, 9, 14, 9, 1, 5],
[ 1, 10, 15, 0, 1, 9, 18, 2, 2, 12]])
In [45]: np.argpartition(a, np.argmin(a, axis=0))[:, 1:] # 1 is because the first item is the minimum one.
Out[45]:
array([[4, 5, 6, 8, 0, 7, 9, 1, 2],
[2, 7, 5, 9, 6, 8, 1, 0, 4],
[5, 8, 1, 9, 7, 3, 6, 2, 4],
[4, 5, 2, 6, 3, 9, 0, 8, 7],
[7, 2, 6, 4, 1, 3, 8, 5, 9],
[2, 3, 5, 7, 6, 4, 0, 9, 1],
[4, 3, 0, 7, 8, 5, 1, 2, 9],
[5, 2, 0, 8, 4, 6, 3, 1, 9],
[0, 1, 9, 4, 3, 7, 5, 2, 6],
[0, 4, 7, 8, 5, 1, 9, 2, 6]])
In [46]: np.argpartition(a, np.argmin(a, axis=0))[:, -3:]
Out[46]:
array([[9, 1, 2],
[1, 0, 4],
[6, 2, 4],
[0, 8, 7],
[8, 5, 9],
[0, 9, 1],
[1, 2, 9],
[3, 1, 9],
[5, 2, 6],
[9, 2, 6]])
In [89]: a[np.repeat(np.arange(x), 3), ind.ravel()].reshape(x, 3)
Out[89]:
array([[10, 11, 12],
[16, 16, 18],
[13, 15, 18],
[14, 18, 19],
[16, 18, 19],
[14, 14, 19],
[15, 18, 19],
[16, 17, 19],
[ 9, 14, 14],
[12, 15, 18]])
Method np.argpartition only returns the k largest indices, performs a local sort, and is faster than np.argsort(performing a full sort) when array is quite large. But the returned indices are NOT in ascending/descending order. Let's say with an example:
We can see that if you want a strict ascending order top k indices, np.argpartition won't return what you want.
Apart from doing a sort manually after np.argpartition, my solution is to use PyTorch, torch.topk, a tool for neural network construction, providing NumPy-like APIs with both CPU and GPU support. It's as fast as NumPy with MKL, and offers a GPU boost if you need large matrix/vector calculations.
Strict ascend/descend top k indices code will be:
Note that torch.topk accepts a torch tensor, and returns both top k values and top k indices in type torch.Tensor. Similar with np, torch.topk also accepts an axis argument so that you can handle multi-dimensional arrays/tensors.
This will be faster than a full sort depending on the size of your original array and the size of your selection:
>>> A = np.random.randint(0,10,10)
>>> A
array([5, 1, 5, 5, 2, 3, 2, 4, 1, 0])
>>> B = np.zeros(3, int)
>>> for i in xrange(3):
... idx = np.argmax(A)
... B[i]=idx; A[idx]=0 #something smaller than A.min()
...
>>> B
array([0, 2, 3])
It, of course, involves tampering with your original array. Which you could fix (if needed) by making a copy or replacing back the original values. ...whichever is cheaper for your use case.
Use:
from operator import itemgetter
from heapq import nlargest
result = nlargest(N, enumerate(your_list), itemgetter(1))
Now the result list would contain N tuples (index, value) where value is maximized.
Use:
def max_indices(arr, k):
'''
Returns the indices of the k first largest elements of arr
(in descending order in values)
'''
assert k <= arr.size, 'k should be smaller or equal to the array size'
arr_ = arr.astype(float) # make a copy of arr
max_idxs = []
for _ in range(k):
max_element = np.max(arr_)
if np.isinf(max_element):
break
else:
idx = np.where(arr_ == max_element)
max_idxs.append(idx)
arr_[idx] = -np.inf
return max_idxs
It also works with 2D arrays. For example,
In [0]: A = np.array([[ 0.51845014, 0.72528114],
[ 0.88421561, 0.18798661],
[ 0.89832036, 0.19448609],
[ 0.89832036, 0.19448609]])
In [1]: max_indices(A, 8)
Out[1]:
[(array([2, 3], dtype=int64), array([0, 0], dtype=int64)),
(array([1], dtype=int64), array([0], dtype=int64)),
(array([0], dtype=int64), array([1], dtype=int64)),
(array([0], dtype=int64), array([0], dtype=int64)),
(array([2, 3], dtype=int64), array([1, 1], dtype=int64)),
(array([1], dtype=int64), array([1], dtype=int64))]
In [2]: A[max_indices(A, 8)[0]][0]
Out[2]: array([ 0.89832036])
I found it most intuitive to use np.unique.
The idea is, that the unique method returns the indices of the input values. Then from the max unique value and the indicies, the position of the original values can be recreated.
multi_max = [1,1,2,2,4,0,0,4]
uniques, idx = np.unique(multi_max, return_inverse=True)
print np.squeeze(np.argwhere(idx == np.argmax(uniques)))
>> [4 7]
The following is a very easy way to see the maximum elements and its positions. Here axis is the domain; axis = 0 means column wise maximum number and axis = 1 means row wise max number for the 2D case. And for higher dimensions it depends upon you.
M = np.random.random((3, 4))
print(M)
print(M.max(axis=1), M.argmax(axis=1))
Here's a more complicated way that increases n if the nth value has ties:
>>>> def get_top_n_plus_ties(arr,n):
>>>> sorted_args = np.argsort(-arr)
>>>> thresh = arr[sorted_args[n]]
>>>> n_ = np.sum(arr >= thresh)
>>>> return sorted_args[:n_]
>>>> get_top_n_plus_ties(np.array([2,9,8,3,0,2,8,3,1,9,5]),3)
array([1, 9, 2, 6])
I think the most time efficiency way is manually iterate through the array and keep a k-size min-heap, as other people have mentioned.
And I also come up with a brute force approach:
top_k_index_list = [ ]
for i in range(k):
top_k_index_list.append(np.argmax(my_array))
my_array[top_k_index_list[-1]] = -float('inf')
Set the largest element to a large negative value after you use argmax to get its index. And then the next call of argmax will return the second largest element.
And you can log the original value of these elements and recover them if you want.
This code works for a numpy 2D matrix array:
mat = np.array([[1, 3], [2, 5]]) # numpy matrix
n = 2 # n
n_largest_mat = np.sort(mat, axis=None)[-n:] # n_largest
tf_n_largest = np.zeros((2,2), dtype=bool) # all false matrix
for x in n_largest_mat:
tf_n_largest = (tf_n_largest) | (mat == x) # true-false
n_largest_elems = mat[tf_n_largest] # true-false indexing
This produces a true-false n_largest matrix indexing that also works to extract n_largest elements from a matrix array
When top_k<<axis_length,it better than argsort.
import numpy as np
def get_sorted_top_k(array, top_k=1, axis=-1, reverse=False):
if reverse:
axis_length = array.shape[axis]
partition_index = np.take(np.argpartition(array, kth=-top_k, axis=axis),
range(axis_length - top_k, axis_length), axis)
else:
partition_index = np.take(np.argpartition(array, kth=top_k, axis=axis), range(0, top_k), axis)
top_scores = np.take_along_axis(array, partition_index, axis)
# resort partition
sorted_index = np.argsort(top_scores, axis=axis)
if reverse:
sorted_index = np.flip(sorted_index, axis=axis)
top_sorted_scores = np.take_along_axis(top_scores, sorted_index, axis)
top_sorted_indexes = np.take_along_axis(partition_index, sorted_index, axis)
return top_sorted_scores, top_sorted_indexes
if __name__ == "__main__":
import time
from sklearn.metrics.pairwise import cosine_similarity
x = np.random.rand(10, 128)
y = np.random.rand(1000000, 128)
z = cosine_similarity(x, y)
start_time = time.time()
sorted_index_1 = get_sorted_top_k(z, top_k=3, axis=1, reverse=True)[1]
print(time.time() - start_time)
You can simply use a dictionary to find top k values & indices in a numpy array.
For example, if you want to find top 2 maximum values & indices
import numpy as np
nums = np.array([0.2, 0.3, 0.25, 0.15, 0.1])
def TopK(x, k):
a = dict([(i, j) for i, j in enumerate(x)])
sorted_a = dict(sorted(a.items(), key = lambda kv:kv[1], reverse=True))
indices = list(sorted_a.keys())[:k]
values = list(sorted_a.values())[:k]
return (indices, values)
print(f"Indices: {TopK(nums, k = 2)[0]}")
print(f"Values: {TopK(nums, k = 2)[1]}")
Indices: [1, 2]
Values: [0.3, 0.25]
A vectorized 2D implementation using argpartition:
k = 3
probas = np.array([
[.6, .1, .15, .15],
[.1, .6, .15, .15],
[.3, .1, .6, 0],
])
k_indices = np.argpartition(-probas, k-1, axis=-1)[:, :k]
# adjust indices to apply in flat array
adjuster = np.arange(probas.shape[0]) * probas.shape[1]
adjuster = np.broadcast_to(adjuster[:, None], k_indices.shape)
k_indices_flat = k_indices + adjuster
k_values = probas.flatten()[k_indices_flat]
# k_indices:
# array([[0, 2, 3],
# [1, 2, 3],
# [2, 0, 1]])
# k_values:
# array([[0.6 , 0.15, 0.15],
# [0.6 , 0.15, 0.15],
# [0.6 , 0.3 , 0.1 ]])
If you are dealing with NaNs and/or have problems understanding np.argpartition, try pandas.DataFrame.sort_values.
import numpy as np
import pandas as pd
a = np.array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
df = pd.DataFrame(a, columns=['array'])
max_values = df['array'].sort_values(ascending=False, na_position='last')
ind = max_values[0:3].index.to_list()
This example gives the indices of the 3 largest, not-NaN values. Probably inefficient, but easy to read and customize.

Categories