Efficiently find indices of all values in an array

Efficiently find indices of all values in an array - python

I have a very large array, consisting of integers between 0 and N, where each value occurs at least once.
I'd like to know, for each value k, all the indices in my array where the array's value equals k.
For example:
arr = np.array([0,1,2,3,2,1,0])
desired_output = {
0: np.array([0,6]),
1: np.array([1,5]),
2: np.array([2,4]),
3: np.array([3]),
}
Right now I am accomplishing this with a loop over range(N+1), and calling np.where N times.
indices = {}
for value in range(max(arr)+1):
indices[value] = np.where(arr == value)[0]
This loop is by far the slowest part of my code. (Both the arr==value evaluation and the np.where call take up significant chunks of time.) Is there a more efficient way to do this?
I also tried playing around with np.unique(arr, return_index=True) but that only tells me the very first index, rather than all of them.

Approach #1
Here's a vectorized approach to get those indices as list of arrays -
sidx = arr.argsort()
unq, cut_idx = np.unique(arr[sidx],return_index=True)
indices = np.split(sidx,cut_idx)[1:]
If you want the final dictionary that corresponds each unique element to their indices, finally we can use a loop-comprehension -
dict_out = {unq[i]:iterID for i,iterID in enumerate(indices)}
Approach #2
If you are just interested in the list of arrays, here's an alternative meant for performance -
sidx = arr.argsort()
indices = np.split(sidx,np.flatnonzero(np.diff(arr[sidx])>0)+1)

A pythonic way is using collections.defaultdict():
>>> from collections import defaultdict
>>>
>>> d = defaultdict(list)
>>>
>>> for i, j in enumerate(arr):
... d[j].append(i)
...
>>> d
defaultdict(<type 'list'>, {0: [0, 6], 1: [1, 5], 2: [2, 4], 3: [3]})
And here is a Numpythonic way using a dictionary comprehension and numpy.where():
>>> {i: np.where(arr == i)[0] for i in np.unique(arr)}
{0: array([0, 6]), 1: array([1, 5]), 2: array([2, 4]), 3: array([3])}
And here is a pure Numpythonic approach if you don't want to involve the dictionary:
>>> uniq = np.unique(arr)
>>> args, indices = np.where((np.tile(arr, len(uniq)).reshape(len(uniq), len(arr)) == np.vstack(uniq)))
>>> np.split(indices, np.where(np.diff(args))[0] + 1)
[array([0, 6]), array([1, 5]), array([2, 4]), array([3])]

I don't know numpy but you could definitely do this in one iteration, with a defaultdict:
indices = defaultdict(list)
for i, val in enumerate(arr):
indices[val].append(i)

Fully vectorized solution using the numpy_indexed package:
import numpy_indexed as npi
k, idx = npi.groupy_by(arr, np.arange(len(arr)))
On a higher level; why do you need these indices? Subsequent grouped-operations can usually be computed much more efficiently using the group_by functionality [eg, npi.group_by(arr).mean(someotherarray)], without explicitly computing the indices of the keys.

Related

What is the alternative for numpy bincount when using negative integers? [duplicate]

Suppose I have the following NumPy array:
a = np.array([1,2,3,1,2,1,1,1,3,2,2,1])
How can I find the most frequent number in this array?

If your list contains all non-negative ints, you should take a look at numpy.bincounts:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html
and then probably use np.argmax:
a = np.array([1,2,3,1,2,1,1,1,3,2,2,1])
counts = np.bincount(a)
print(np.argmax(counts))
For a more complicated list (that perhaps contains negative numbers or non-integer values), you can use np.histogram in a similar way. Alternatively, if you just want to work in python without using numpy, collections.Counter is a good way of handling this sort of data.
from collections import Counter
a = [1,2,3,1,2,1,1,1,3,2,2,1]
b = Counter(a)
print(b.most_common(1))

You may use
values, counts = np.unique(a, return_counts=True)
ind = np.argmax(counts)
print(values[ind]) # prints the most frequent element
ind = np.argpartition(-counts, kth=10)[:10]
print(values[ind]) # prints the 10 most frequent elements
If some element is as frequent as another one, this code will return only the first element.

If you're willing to use SciPy:
>>> from scipy.stats import mode
>>> mode([1,2,3,1,2,1,1,1,3,2,2,1])
(array([ 1.]), array([ 6.]))
>>> most_frequent = mode([1,2,3,1,2,1,1,1,3,2,2,1])[0][0]
>>> most_frequent
1.0

Performances (using iPython) for some solutions found here:
>>> # small array
>>> a = [12,3,65,33,12,3,123,888000]
>>>
>>> import collections
>>> collections.Counter(a).most_common()[0][0]
3
>>> %timeit collections.Counter(a).most_common()[0][0]
100000 loops, best of 3: 11.3 µs per loop
>>>
>>> import numpy
>>> numpy.bincount(a).argmax()
3
>>> %timeit numpy.bincount(a).argmax()
100 loops, best of 3: 2.84 ms per loop
>>>
>>> import scipy.stats
>>> scipy.stats.mode(a)[0][0]
3.0
>>> %timeit scipy.stats.mode(a)[0][0]
10000 loops, best of 3: 172 µs per loop
>>>
>>> from collections import defaultdict
>>> def jjc(l):
... d = defaultdict(int)
... for i in a:
... d[i] += 1
... return sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]
...
>>> jjc(a)[0]
3
>>> %timeit jjc(a)[0]
100000 loops, best of 3: 5.58 µs per loop
>>>
>>> max(map(lambda val: (a.count(val), val), set(a)))[1]
12
>>> %timeit max(map(lambda val: (a.count(val), val), set(a)))[1]
100000 loops, best of 3: 4.11 µs per loop
>>>
Best is 'max' with 'set' for small arrays like the problem.
According to #David Sanders, if you increase the array size to something like 100,000 elements, the "max w/set" algorithm ends up being the worst by far whereas the "numpy bincount" method is the best.

Starting in Python 3.4, the standard library includes the statistics.mode function to return the single most common data point.
from statistics import mode
mode([1, 2, 3, 1, 2, 1, 1, 1, 3, 2, 2, 1])
# 1
If there are multiple modes with the same frequency, statistics.mode returns the first one encountered.
Starting in Python 3.8, the statistics.multimode function returns a list of the most frequently occurring values in the order they were first encountered:
from statistics import multimode
multimode([1, 2, 3, 1, 2])
# [1, 2]

Also if you want to get most frequent value(positive or negative) without loading any modules you can use the following code:
lVals = [1,2,3,1,2,1,1,1,3,2,2,1]
print max(map(lambda val: (lVals.count(val), val), set(lVals)))

While most of the answers above are useful, in case you:
1) need it to support non-positive-integer values (e.g. floats or negative integers ;-)), and
2) aren't on Python 2.7 (which collections.Counter requires), and
3) prefer not to add the dependency of scipy (or even numpy) to your code, then a purely python 2.6 solution that is O(nlogn) (i.e., efficient) is just this:
from collections import defaultdict
a = [1,2,3,1,2,1,1,1,3,2,2,1]
d = defaultdict(int)
for i in a:
d[i] += 1
most_frequent = sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]

In Python 3 the following should work:
max(set(a), key=lambda x: a.count(x))

I like the solution by JoshAdel.
But there is just one catch.
The np.bincount() solution only works on numbers.
If you have strings, collections.Counter solution will work for you.

Here is a general solution that may be applied along an axis, regardless of values, using purely numpy. I've also found that this is much faster than scipy.stats.mode if there are a lot of unique values.
import numpy
def mode(ndarray, axis=0):
# Check inputs
ndarray = numpy.asarray(ndarray)
ndim = ndarray.ndim
if ndarray.size == 1:
return (ndarray[0], 1)
elif ndarray.size == 0:
raise Exception('Cannot compute mode on empty array')
try:
axis = range(ndarray.ndim)[axis]
except:
raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))
# If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
if all([ndim == 1,
int(numpy.__version__.split('.')[0]) >= 1,
int(numpy.__version__.split('.')[1]) >= 9]):
modals, counts = numpy.unique(ndarray, return_counts=True)
index = numpy.argmax(counts)
return modals[index], counts[index]
# Sort array
sort = numpy.sort(ndarray, axis=axis)
# Create array to transpose along the axis and get padding shape
transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
shape = list(sort.shape)
shape[axis] = 1
# Create a boolean array along strides of unique values
strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
numpy.diff(sort, axis=axis) == 0,
numpy.zeros(shape=shape, dtype='bool')],
axis=axis).transpose(transpose).ravel()
# Count the stride lengths
counts = numpy.cumsum(strides)
counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
counts[strides] = 0
# Get shape of padded counts and slice to return to the original shape
shape = numpy.array(sort.shape)
shape[axis] += 1
shape = shape[transpose]
slices = [slice(None)] * ndim
slices[axis] = slice(1, None)
# Reshape and compute final counts
counts = counts.reshape(shape).transpose(transpose)[slices] + 1
# Find maximum counts and return modals/counts
slices = [slice(None, i) for i in sort.shape]
del slices[axis]
index = numpy.ogrid[slices]
index.insert(axis, numpy.argmax(counts, axis=axis))
return sort[index], counts[index]

Expanding on this method, applied to finding the mode of the data where you may need the index of the actual array to see how far away the value is from the center of the distribution.
(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]
Remember to discard the mode when len(np.argmax(counts)) > 1

You can use the following approach:
x = np.array([[2, 5, 5, 2], [2, 7, 8, 5], [2, 5, 7, 9]])
u, c = np.unique(x, return_counts=True)
print(u[c == np.amax(c)])
This will give the answer: array([2, 5])

Using np.bincount and the np.argmax method can get the most common value in a numpy array. If your array is an image array, use the np.ravel or np.flatten() methods to convert a ndarray to a 1-dimensional array.

I'm recently doing a project and using collections.Counter.(Which tortured me).
The Counter in collections have a very very bad performance in my opinion. It's just a class wrapping dict().
What's worse, If you use cProfile to profile its method, you should see a lot of '__missing__' and '__instancecheck__' stuff wasting the whole time.
Be careful using its most_common(), because everytime it would invoke a sort which makes it extremely slow. and if you use most_common(x), it will invoke a heap sort, which is also slow.
Btw, numpy's bincount also have a problem: if you use np.bincount([1,2,4000000]), you will get an array with 4000000 elements.

Other methods to derive index values from an array

Given an example NumPy array a such as
array([[1, A, 3.00, 4, 5],
[2, B, 4.00, 5, 6],
[3, C, 5.00, 6, 7],
[3, D, 6.00, 7, 8],
[3, E, 7.00, 8, 9]])
my goal is to find the indices where the value 3 occurs in the first column, and select the very last index value.
I can think of two different methods of collecting the index values in a list.
SOLUTION 1: Use a for loop
indx = []
for i in range(len(a)):
if int(a[i,0]) == int(3):
indx.append(i)
indx = indx[-1]
SOLUTION 2: Use NumPy where
indx = np.where(a[:,0] == 3)
indx = indx[0]
indx = indx[-1]
However, I have a tendency to find better methods to solving problems, and that actually helps me learn more. Given such a problem, does anyone know of any other solution that I am not aware of? Thanks in advance!

There are 2 reasons why your solutions are inefficient for your task:
Using your for loop, you search from first to last, instead of last to
first. In addition, you are unnecessarily building a list.
For numpy.where, you retrieve all the indices, before you select the final one.
You can resolve these 2 issues via a custom function which searches from last to first. In addition, you can improve performance via JIT-compiling.
from numba import jit
import numpy as np
arr = np.random.randint(0, 9, 100000)
#jit(nopython=True)
def indexer(arr, item):
for idx, val in enumerate(arr[::-1]):
if val == item:
return len(arr) - idx - 1
%timeit indexer(arr, 5) # 2.52 µs
%timeit np.where(arr==5)[0][-1] # 454 µs

Unless there is a simpler built-in that I have not considered, the simplest method would be to reverse the first layer of the array and search for the first occurrence.
reversed_a = a[::-1]
for i, item in enumerate(reversed_a ):
if int(item[0]) == 3:
break
indx = len(a) - i - 1

Python: How to index the elements of a numpy array?

I'm looking for a function that would do what the function indices does in the following hypothetical code:
indices( numpy.array([[1, 2, 3], [2, 3, 4]]) )
{1: [(0,0)], 2: [(0,1),(1,0)], 3: [(0,2),(1,1)], 4: [(1,2)]}
Specifically, I want to produce a dictionary whose keys are the unique elements in the flattened array and whose values are lists of the full indices of the respective key.
I've looked at the where function, but it does not seem to provide an efficient way to solve this for large arrays. What's the best way to do this?
Notes: I'm using Python 2.7

Given that your desired output is a dictionary, I don't think there's going to be an efficient way to do this with NumPy operations. Your best bet will probably be something like
import collections
import itertools
d = collections.defaultdict(list)
for indices in itertools.product(*map(range, a.shape)):
d[a[indices]].append(indices)

the numpy_indexed package can perform these kind of grouping operations in an efficient and fully vectorized manner, ie:
import numpy_indexed as npi
a = np.array([[1, 2, 3], [2, 3, 4]])
keys, values = npi.group_by(a.reshape(-1), np.indices(a.shape).reshape(-1, a.ndim))

I don't know about numpy, but this is an example solution if just using arrays:
arrs = [[1, 2, 3], [2, 3, 4]]
dict = {}
for i in range(0, len(arrs)):
arr = arrs[i]
for j in range(0, len(arr)):
num = arr[j]
indices = dict.get(num)
if indices is None:
dict[num] = [(i, j)]
else:
dict[num].append((i, j))

How to tell when more than one index matches?

I have any array of values, that are often the same and I am trying to find the index of the smallest one. But I want to know all the objects that are the same.
So for example I have the array a = [1, 2, 3, 4] and to find the index of the smallest one I use a.index(min(a)) and this returns 0. But if I had an array of a = [1, 1, 1, 1], using the same thing would still return 0.
I want to know that multiple indices match what I am searching for and what those indices are. How would I go about doing this?

list.index(value) returns the index of the first occurrence of value in list.
A better idea is to use a simple list comprehension and enumerate:
indices = [i for i, x in enumerate(iterable) if x == v]
where v is the value you want to search for and iterable is an object that supports iterator protocol e.g. it can be a generator or a sequence (like list).
For your specific use case, that'll look like
def smallest(seq):
m = min(seq)
return [i for i, x in enumerate(seq) if x == m]
Some examples:
In [23]: smallest([1, 2, 3, 4])
Out[23]: [0]
In [24]: smallest([1, 1, 1, 1])
Out[24]: [0, 1, 2, 3]
If you're not sure whether the seq is empty or not, you can pass the default=-1 (or some other value) argument to min function (in Python 3.4+):
m = min(seq, default=-1)
Consider using m = min(seq or (-1,)) (again, any value) instead, when using older Python.

A different approach using numpy.where could look like
In [1]: import numpy as np
In [2]: def np_smallest(seq):
...: return np.where(seq==seq.min())[0]
In [3]: np_smallest(np.array([1,1,1,1]))
Out[3]: array([0, 1, 2, 3])
In [4]: np_smallest(np.array([1,2,3,4]))
Out[4]: array([0])
This approach is slighly less efficient than the list comprehension for small list but if you face large arrays, numpy may save you some time.
In [5]: seq = np.random.randint(100, size=1000)
In [6]: %timeit np_smallest(seq)
100000 loops, best of 3: 10.1 µs per loop
In [7]: %timeit smallest(seq)
1000 loops, best of 3: 194 µs per loop

Here is my solution:
def all_smallest(seq):
"""Takes sequence, returns list of all smallest elements"""
min_i = min(seq)
amount = seq.count(min_i)
ans = []
if amount > 1:
for n, i in enumerate(seq):
if i == min_i:
ans.append(n)
if len(ans) == amount:
return ans
return [seq.index(min_i)]
Code very straightforward I think here all clear without any explanation.

Is there a NumPy function to return the first index of something in an array?

I know there is a method for a Python list to return the first index of something:
>>> xs = [1, 2, 3]
>>> xs.index(2)
1
Is there something like that for NumPy arrays?

Yes, given an array, array, and a value, item to search for, you can use np.where as:
itemindex = numpy.where(array == item)
The result is a tuple with first all the row indices, then all the column indices.
For example, if an array is two dimensions and it contained your item at two locations then
array[itemindex[0][0]][itemindex[1][0]]
would be equal to your item and so would be:
array[itemindex[0][1]][itemindex[1][1]]

If you need the index of the first occurrence of only one value, you can use nonzero (or where, which amounts to the same thing in this case):
>>> t = array([1, 1, 1, 2, 2, 3, 8, 3, 8, 8])
>>> nonzero(t == 8)
(array([6, 8, 9]),)
>>> nonzero(t == 8)[0][0]
6
If you need the first index of each of many values, you could obviously do the same as above repeatedly, but there is a trick that may be faster. The following finds the indices of the first element of each subsequence:
>>> nonzero(r_[1, diff(t)[:-1]])
(array([0, 3, 5, 6, 7, 8]),)
Notice that it finds the beginning of both subsequence of 3s and both subsequences of 8s:
[1, 1, 1, 2, 2, 3, 8, 3, 8, 8]
So it's slightly different than finding the first occurrence of each value. In your program, you may be able to work with a sorted version of t to get what you want:
>>> st = sorted(t)
>>> nonzero(r_[1, diff(st)[:-1]])
(array([0, 3, 5, 7]),)

You can also convert a NumPy array to list in the air and get its index. For example,
l = [1,2,3,4,5] # Python list
a = numpy.array(l) # NumPy array
i = a.tolist().index(2) # i will return index of 2
print i
It will print 1.

Just to add a very performant and handy numba alternative based on np.ndenumerate to find the first index:
from numba import njit
import numpy as np
#njit
def index(array, item):
for idx, val in np.ndenumerate(array):
if val == item:
return idx
# If no item was found return None, other return types might be a problem due to
# numbas type inference.
This is pretty fast and deals naturally with multidimensional arrays:
>>> arr1 = np.ones((100, 100, 100))
>>> arr1[2, 2, 2] = 2
>>> index(arr1, 2)
(2, 2, 2)
>>> arr2 = np.ones(20)
>>> arr2[5] = 2
>>> index(arr2, 2)
(5,)
This can be much faster (because it's short-circuiting the operation) than any approach using np.where or np.nonzero.
However np.argwhere could also deal gracefully with multidimensional arrays (you would need to manually cast it to a tuple and it's not short-circuited) but it would fail if no match is found:
>>> tuple(np.argwhere(arr1 == 2)[0])
(2, 2, 2)
>>> tuple(np.argwhere(arr2 == 2)[0])
(5,)

l.index(x) returns the smallest i such that i is the index of the first occurrence of x in the list.
One can safely assume that the index() function in Python is implemented so that it stops after finding the first match, and this results in an optimal average performance.
For finding an element stopping after the first match in a NumPy array use an iterator (ndenumerate).
In [67]: l=range(100)
In [68]: l.index(2)
Out[68]: 2
NumPy array:
In [69]: a = np.arange(100)
In [70]: next((idx for idx, val in np.ndenumerate(a) if val==2))
Out[70]: (2L,)
Note that both methods index() and next return an error if the element is not found. With next, one can use a second argument to return a special value in case the element is not found, e.g.
In [77]: next((idx for idx, val in np.ndenumerate(a) if val==400),None)
There are other functions in NumPy (argmax, where, and nonzero) that can be used to find an element in an array, but they all have the drawback of going through the whole array looking for all occurrences, thus not being optimized for finding the first element. Note also that where and nonzero return arrays, so you need to select the first element to get the index.
In [71]: np.argmax(a==2)
Out[71]: 2
In [72]: np.where(a==2)
Out[72]: (array([2], dtype=int64),)
In [73]: np.nonzero(a==2)
Out[73]: (array([2], dtype=int64),)
Time comparison
Just checking that for large arrays the solution using an iterator is faster when the searched item is at the beginning of the array (using %timeit in the IPython shell):
In [285]: a = np.arange(100000)
In [286]: %timeit next((idx for idx, val in np.ndenumerate(a) if val==0))
100000 loops, best of 3: 17.6 µs per loop
In [287]: %timeit np.argmax(a==0)
1000 loops, best of 3: 254 µs per loop
In [288]: %timeit np.where(a==0)[0][0]
1000 loops, best of 3: 314 µs per loop
This is an open NumPy GitHub issue.
See also: Numpy: find first index of value fast

If you're going to use this as an index into something else, you can use boolean indices if the arrays are broadcastable; you don't need explicit indices. The absolute simplest way to do this is to simply index based on a truth value.
other_array[first_array == item]
Any boolean operation works:
a = numpy.arange(100)
other_array[first_array > 50]
The nonzero method takes booleans, too:
index = numpy.nonzero(first_array == item)[0][0]
The two zeros are for the tuple of indices (assuming first_array is 1D) and then the first item in the array of indices.

For one-dimensional sorted arrays, it would be much more simpler and efficient O(log(n)) to use numpy.searchsorted which returns a NumPy integer (position). For example,
arr = np.array([1, 1, 1, 2, 3, 3, 4])
i = np.searchsorted(arr, 3)
Just make sure the array is already sorted
Also check if returned index i actually contains the searched element, since searchsorted's main objective is to find indices where elements should be inserted to maintain order.
if arr[i] == 3:
print("present")
else:
print("not present")

For 1D arrays, I'd recommend np.flatnonzero(array == value)[0], which is equivalent to both np.nonzero(array == value)[0][0] and np.where(array == value)[0][0] but avoids the ugliness of unboxing a 1-element tuple.

To index on any criteria, you can so something like the following:
In [1]: from numpy import *
In [2]: x = arange(125).reshape((5,5,5))
In [3]: y = indices(x.shape)
In [4]: locs = y[:,x >= 120] # put whatever you want in place of x >= 120
In [5]: pts = hsplit(locs, len(locs[0]))
In [6]: for pt in pts:
.....: print(', '.join(str(p[0]) for p in pt))
4, 4, 0
4, 4, 1
4, 4, 2
4, 4, 3
4, 4, 4
And here's a quick function to do what list.index() does, except doesn't raise an exception if it's not found. Beware -- this is probably very slow on large arrays. You can probably monkey patch this on to arrays if you'd rather use it as a method.
def ndindex(ndarray, item):
if len(ndarray.shape) == 1:
try:
return [ndarray.tolist().index(item)]
except:
pass
else:
for i, subarray in enumerate(ndarray):
try:
return [i] + ndindex(subarray, item)
except:
pass
In [1]: ndindex(x, 103)
Out[1]: [4, 0, 3]

An alternative to selecting the first element from np.where() is to use a generator expression together with enumerate, such as:
>>> import numpy as np
>>> x = np.arange(100) # x = array([0, 1, 2, 3, ... 99])
>>> next(i for i, x_i in enumerate(x) if x_i == 2)
2
For a two dimensional array one would do:
>>> x = np.arange(100).reshape(10,10) # x = array([[0, 1, 2,... 9], [10,..19],])
>>> next((i,j) for i, x_i in enumerate(x)
... for j, x_ij in enumerate(x_i) if x_ij == 2)
(0, 2)
The advantage of this approach is that it stops checking the elements of the array after the first match is found, whereas np.where checks all elements for a match. A generator expression would be faster if there's match early in the array.

There are lots of operations in NumPy that could perhaps be put together to accomplish this. This will return indices of elements equal to item:
numpy.nonzero(array - item)
You could then take the first elements of the lists to get a single element.

Comparison of 8 methods
TL;DR:
(Note: applicable to 1d arrays under 100M elements.)
For maximum performance use index_of__v5 (numba + numpy.enumerate + for loop; see the code below).
If numba is not available:
Use index_of__v7 (for loop + enumerate) if the target value is expected to be found within the first 100k elements.
Else use index_of__v2/v3/v4 (numpy.argmax or numpy.flatnonzero based).
Powered by perfplot
import numpy as np
from numba import njit
# Based on: numpy.argmax()
# Proposed by: John Haberstroh (https://stackoverflow.com/a/67497472/7204581)
def index_of__v1(arr: np.array, v):
is_v = (arr == v)
return is_v.argmax() if is_v.any() else -1
# Based on: numpy.argmax()
def index_of__v2(arr: np.array, v):
return (arr == v).argmax() if v in arr else -1
# Based on: numpy.flatnonzero()
# Proposed by: 1'' (https://stackoverflow.com/a/42049655/7204581)
def index_of__v3(arr: np.array, v):
idxs = np.flatnonzero(arr == v)
return idxs[0] if len(idxs) > 0 else -1
# Based on: numpy.argmax()
def index_of__v4(arr: np.array, v):
return np.r_[False, (arr == v)].argmax() - 1
# Based on: numba, for loop
# Proposed by: MSeifert (https://stackoverflow.com/a/41578614/7204581)
#njit
def index_of__v5(arr: np.array, v):
for idx, val in np.ndenumerate(arr):
if val == v:
return idx[0]
return -1
# Based on: numpy.ndenumerate(), for loop
def index_of__v6(arr: np.array, v):
return next((idx[0] for idx, val in np.ndenumerate(arr) if val == v), -1)
# Based on: enumerate(), for loop
# Proposed by: Noyer282 (https://stackoverflow.com/a/40426159/7204581)
def index_of__v7(arr: np.array, v):
return next((idx for idx, val in enumerate(arr) if val == v), -1)
# Based on: list.index()
# Proposed by: Hima (https://stackoverflow.com/a/23994923/7204581)
def index_of__v8(arr: np.array, v):
l = list(arr)
try:
return l.index(v)
except ValueError:
return -1
Go to Colab

The numpy_indexed package (disclaimer, I am its author) contains a vectorized equivalent of list.index for numpy.ndarray; that is:
sequence_of_arrays = [[0, 1], [1, 2], [-5, 0]]
arrays_to_query = [[-5, 0], [1, 0]]
import numpy_indexed as npi
idx = npi.indices(sequence_of_arrays, arrays_to_query, missing=-1)
print(idx) # [2, -1]
This solution has vectorized performance, generalizes to ndarrays, and has various ways of dealing with missing values.

There is a fairly idiomatic and vectorized way to do this built into numpy. It uses a quirk of the np.argmax() function to accomplish this -- if many values match, it returns the index of the first match. The trick is that for booleans, there will only ever be two values: True (1) and False (0). Therefore, the returned index will be that of the first True.
For the simple example provided, you can see it work with the following
>>> np.argmax(np.array([1,2,3]) == 2)
1
A great example is computing buckets, e.g. for categorizing. Let's say you have an array of cut points, and you want the "bucket" that corresponds to each element of your array. The algorithm is to compute the first index of cuts where x < cuts (after padding cuts with np.Infitnity). I could use broadcasting to broadcast the comparisons, then apply argmax along the cuts-broadcasted axis.
>>> cuts = np.array([10, 50, 100])
>>> cuts_pad = np.array([*cuts, np.Infinity])
>>> x = np.array([7, 11, 80, 443])
>>> bins = np.argmax( x[:, np.newaxis] < cuts_pad[np.newaxis, :], axis = 1)
>>> print(bins)
[0, 1, 2, 3]
As expected, each value from x falls into one of the sequential bins, with well-defined and easy to specify edge case behavior.

Another option not previously mentioned is the bisect module, which also works on lists, but requires a pre-sorted list/array:
import bisect
import numpy as np
z = np.array([104,113,120,122,126,138])
bisect.bisect_left(z, 122)
yields
3
bisect also returns a result when the number you're looking for doesn't exist in the array, so that the number can be inserted in the correct place.

Note: this is for python 2.7 version
You can use a lambda function to deal with the problem, and it works both on NumPy array and list.
your_list = [11, 22, 23, 44, 55]
result = filter(lambda x:your_list[x]>30, range(len(your_list)))
#result: [3, 4]
import numpy as np
your_numpy_array = np.array([11, 22, 23, 44, 55])
result = filter(lambda x:your_numpy_array [x]>30, range(len(your_list)))
#result: [3, 4]
And you can use
result[0]
to get the first index of the filtered elements.
For python 3.6, use
list(result)
instead of
result

Use ndindex
Sample array
arr = np.array([[1,4],
[2,3]])
print(arr)
...[[1,4],
[2,3]]
create an empty list to store the index and the element tuples
index_elements = []
for i in np.ndindex(arr.shape):
index_elements.append((arr[i],i))
convert the list of tuples into dictionary
index_elements = dict(index_elements)
The keys are the elements and the values are their
indices - use keys to access the index
index_elements[4]
output
... (0,1)

For my use case, I could not sort the array ahead of time because the order of the elements is important. This is my all-NumPy implementation:
import numpy as np
# The array in question
arr = np.array([1,2,1,2,1,5,5,3,5,9])
# Find all of the present values
vals=np.unique(arr)
# Make all indices up-to and including the desired index positive
cum_sum=np.cumsum(arr==vals.reshape(-1,1),axis=1)
# Add zeros to account for the n-1 shape of diff and the all-positive array of the first index
bl_mask=np.concatenate([np.zeros((cum_sum.shape[0],1)),cum_sum],axis=1)>=1
# The desired indices
idx=np.where(np.diff(bl_mask))[1]
# Show results
print(list(zip(vals,idx)))
>>> [(1, 0), (2, 1), (3, 7), (5, 5), (9, 9)]
I believe it accounts for unsorted arrays with duplicate values.

Found another solution with loops:
new_array_of_indicies = []
for i in range(len(some_array)):
if some_array[i] == some_value:
new_array_of_indicies.append(i)

index_lst_form_numpy = pd.DataFrame(df).reset_index()["index"].tolist()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently find indices of all values in an array - python

I don't know numpy but you could definitely do this in one iteration, with a defaultdict: indices = defaultdict(list) for i, val in enumerate(arr): indices[val].append(i)

Related

What is the alternative for numpy bincount when using negative integers? [duplicate]

Other methods to derive index values from an array

Python: How to index the elements of a numpy array?

How to tell when more than one index matches?

Is there a NumPy function to return the first index of something in an array?

Categories

Resources