I am attempting to sort a NumPy array by frequency of elements. So for example, if there's an array [3,4,5,1,2,4,1,1,2,4], the output would be another NumPy sorted from most common to least common elements (no duplicates). So the solution would be [4,1,2,3,5]. If two elements have the same number of occurrences, the element that appears first is placed first in the output. I have tried doing this, but I can't seem to get a functional answer. Here is my code so far:
temp1 = problems[j]
indexes = np.unique(temp1, return_index = True)[1]
temp2 = temp1[np.sort(indexes)]
temp3 = np.unique(temp1, return_counts = True)[1]
temp4 = np.argsort(temp3)[::-1] + 1
where problems[j] is a NumPy array like [3,4,5,1,2,4,1,1,2,4]. temp4 returns [4,1,2,5,3] so far but it is not correct because it can't handle when two elements have the same number of occurrences.
You can use argsort on the frequency of each element to find the sorted positions and apply the indexes to the unique element array
unique_elements, frequency = np.unique(array, return_counts=True)
sorted_indexes = np.argsort(frequency)[::-1]
sorted_by_freq = unique_elements[sorted_indexes]
A non-NumPy solution, which does still work with NumPy arrays, is to use an OrderedCounter followed by sorted with a custom function:
from collections import OrderedDict, Counter
class OrderedCounter(Counter, OrderedDict):
pass
L = [3,4,5,1,2,4,1,1,2,4]
c = OrderedCounter(L)
keys = list(c)
res = sorted(c, key=lambda x: (-c[x], keys.index(x)))
print(res)
[4, 1, 2, 3, 5]
If the values are integer and small, or you only care about bins of size 1:
def sort_by_frequency(arr):
return np.flip(np.argsort(np.bincount(arr))[-(np.unique(arr).size):])
v = [1,1,1,1,1,2,2,9,3,3,3,3,7,8,8]
sort_by_frequency(v)
this should yield
array([1, 3, 8, 2, 9, 7]
Use zip and itemgetter should help
from operator import itemgetter
import numpy as np
temp1 = problems[j]
temp, idx, cnt = np.unique(temp1, return_index = True, return_counts=True)
cnt = 1 / cnt
k = sorted(zip(temp, cnt, idx), key=itemgetter(1, 2))
print(next(zip(*k)))
You can count up the number of each element in the array, and then use it as a key to the build-in sorted function
def sortbyfreq(arr):
s = set(arr)
keys = {n: (-arr.count(n), arr.index(n)) for n in s}
return sorted(list(s), key=lambda n: keys[n])
Related
Pandas groupby "ngroup" function tags each group in "group" order.
I'm looking for similar behaviour but need the assigned tags to be in original (index) order, how can I do so efficiently (this will happen often with large arrays) in pandas and numpy?
> df = pd.DataFrame(
{"A": [9,8,7,8,9]},
index=list("abcde"))
A
a 9
b 8
c 7
d 8
e 9
> df.groupby("A").ngroup()
a 2
b 1
c 0
d 1
e 2
# LOOKING FOR ###################
a 0
b 1
c 2
d 1
e 0
How can I achieve the desired output with a single dimension numpy array?
arr = np.array([9,8,7,8 ,9])
# looking for [0,1,2,1,0]
Perhaps a better way is factorize:
df['A'].factorize()[0]
Output:
array([0, 1, 2, 1, 0])
You can use np.unique -
In [105]: a = np.array([9,8,7,8,9])
In [106]: u,idx,tags = np.unique(a, return_index=True, return_inverse=True)
In [107]: idx.argsort().argsort()[tags]
Out[107]: array([0, 1, 2, 1, 0])
You can pass sort=Flase to groupby():
df.groupby('A', sort=False).ngroup()
a 0
b 1
c 2
d 1
e 0
dtype: int64
As far as I can tell, there isn't a direct equivalent of groupby in numpy. For a pure numpy version, you can use numpy.unique() to get the unique values. numpy.unique() has the option to return the inverse, basically the array of indices that would recreate your input array, but it sorts the unique values first, so the result is the same as using the regular (sorted) pandas.groupby() command.
To get around this, you can capture the index values of the first occurrence of each unique value. Sort the index values and use these as indices into the original array to get the unique values in their original order. Create a dictionary to map between the unique values and the group numbers and then use that dictionary to convert the values in the array to the appropriate group numbers.
import numpy as np
arr = np.array([9, 8, 7, 8, 9])
_, i = np.unique(arr, return_index=True) # get the indexes of the first occurence of each unique value
groups = arr[np.sort(i)] # sort the indexes and retrieve the values from the array so that they are in the array order
m = {value:ngroup for ngroup, value in enumerate(groups)} # create a mapping of value:groupnumber
np.vectorize(m.get)(arr) # use vectorize to create a new array using m
array([0, 1, 2, 1, 0])
I've benchmarked the suggested solutions:
Turns out that:
— factorize is the fastest for array sizes > 10³
— unique-argsort is the fastest for array sizes < 10³ (but slower by a factor of 10 for larger ones),
— ngroup is always slower, but for array sizes >3*10³ it has roughly the same speed as factorize.
from contextlib import contextmanager
from time import perf_counter as clock
from itertools import count
import numpy as np
import pandas as pd
def f1(a):
return s.factorize()[0]
def f2(s):
return s.groupby(s, sort=False).ngroup().values
def f3(s):
u, idx, tags = np.unique(s.values, return_index=True, return_inverse=True)
return idx.argsort().argsort()[tags]
#contextmanager
def bench(r):
t1 = clock()
yield
t2 = clock()
r.append(t2-t1)
res = []
for i in count():
n = 2**i
a = np.random.randint(0, n, n)
s = pd.Series(a)
rr = []
for j in range(5):
r = []
with bench(r):
a1 = f1(s)
with bench(r):
a2 = f2(s)
with bench(r):
a3 = f3(s)
rr.append(r)
if max(r) > 0.5:
break
res.append(np.min(rr, axis=0))
if np.max(rr) > 0.4:
break
np.save('results.npy', np.array(res))
Let's say i have an arbitrary array np.array([1,2,3,4,5,6]) with another array that maps specific elements in the array to a group, np.array(['a','b', 'a','c','c', 'b']) and now I want to seperate them into three different array depending on the label/group given in the second array, so that they are a,b,c = narray([1,3]), narray([2,6]), narray([4,5]). Is a simple forloop the way to go or is there some efficient method I'm missing here?
When you write efficient, I assume that what you want here is actually fast.
I will try to discuss briefly asymptotic efficiency.
In this context, we refer to N as the input size and K as the number of unique values.
My approach solution would be to use a combination of np.argsort() and a custom-built groupby_np() specifically optimized for NumPy inputs:
import numpy as np
def groupby_np(arr, both=True):
n = len(arr)
extrema = np.nonzero(arr[:-1] != arr[1:])[0] + 1
if both:
last_i = 0
for i in extrema:
yield last_i, i
last_i = i
yield last_i, n
else:
yield 0
yield from extrema
yield n
def labeling_groupby_np(values, labels):
slicing = labels.argsort()
sorted_labels = labels[slicing]
sorted_values = values[slicing]
del slicing
result = {}
for i, j in groupby_np(sorted_labels, True):
result[sorted_labels[i]] = sorted_values[i:j]
return result
This has complexity O(N log N + K).
The N log N comes from the sorting step and the K comes from the last loop.
The interesting part is that the both the N-dependent and the K-dependent steps are fast because the N-dependent part is executed at low level, and the K-dependent part is O(1) and also fast.
A solution like the following (very similar to #theEpsilon answer):
import numpy as np
def labeling_loop(values, labels):
labeled = {}
for x, l in zip(values, labels):
if l not in labeled:
labeled[l] = [x]
else:
labeled[l].append(x)
return {k: np.array(v) for k, v in labeled.items()}
uses two loops and has O(N + K). I do not think you can easily avoid the second loop (without a significant speed penalty). As for the first loop, this is executed in Python, which carries a significant speed penalty on its own.
Another possibility is to use np.unique() which brings the main loop to a lower level. However, this brings other challenges, because once the unique values are extracted, there is no efficient way of extracting the information to construct the arrays you want without some NumPy advanced indexing, which is O(N). The overall complexity of these solutions is then O(K * N), but because the NumPy advanced indexing is done at lower level, this can land to relatively fast solution, although with a worse asymptotic complexity than alternatives.
Possible implementations include (similar to #AjayVerma's and #AKX's answers):
import numpy as np
def labeling_unique_bool(values, labels):
return {l: values[l == labels] for l in np.unique(labels)}
import numpy as np
def labeling_unique_nonzero(values, labels):
return {l: values[np.nonzero(l == labels)] for l in np.unique(labels)}
Additionally, one could consider a pre-sorting step to then speed up the slicing part by avoiding NumPy advanced indexing.
However, the sorting step can be more costly than the advanced indexing, and in general the proposed approach tends to be faster for the input I tested.
import numpy as np
def labeling_unique_argsort(values, labels):
uniques, counts = np.unique(labels, return_counts=True)
sorted_values = values[labels.argsort()]
bound = 0
result = {}
for x, c in zip(uniques, counts):
result[x] = sorted_values[bound:bound + c]
bound += c
return result
Another approach, which is neat in principle (same as my proposed approach), but slow in practice would be to use sorting and itertools.groupby():
import itertools
from operator import itemgetter
def labeling_groupby(values, labels):
slicing = labels.argsort()
sorted_labels = labels[slicing]
sorted_values = values[slicing]
del slicing
result = {}
for x, g in itertools.groupby(zip(sorted_labels, sorted_values), itemgetter(0)):
result[x] = np.fromiter(map(itemgetter(1), g), dtype=sorted_values.dtype)
return result
Finally, a Pandas based approach, which is quite concise and reasonably fast for larger inputs, but under-performing for smaller ones (similar to #Ehsan's answer):
def labeling_groupby_pd(values, labels):
df = pd.DataFrame({'values': values, 'labels': labels})
return df.groupby('labels').values.apply(lambda x: x.values).to_dict()
Now, talking is cheap, so let us attach some numbers to fast and slow and produce some plots for varying input sizes. The value of K is capped to 52 (lower and upper case letters of the English alphabet). When N is much larger than K, the probability of reaching capping value is very high.
Input is generated programmatically with the following:
def gen_input(n, p, labels=string.ascii_letters):
k = len(labels)
values = np.arange(n)
labels = np.array([string.ascii_letters[i] for i in np.random.randint(0, int(k * p), n)])
return values, labels
and the benchmarks are produced for values of p from (1.0, 0.5, 0.1, 0.05), which change the maximum value of K. The plots below refer to the p values in that order.
p=1.0 (at most K = 52)
...and zoomed on the fastest
p=0.5 (at most K = 26)
p=0.1 (at most K = 5)
p=0.05 (at most K = 2)
...and zoomed on the fastest
One can see how the proposed method, except for very small inputs, outperforms the other methods proposed so far for the tested inputs.
(Full benchmarks available here).
One may also consider moving some parts of the looping to Numba / Cython, but I'd leave this to the interested reader.
You can use numpy.unique
x = np.array([1,2,3,4,5,6])
y = np.array(['a','b', 'a','c','c', 'b'])
print({value:x[y==value] for value in np.unique(y)})
Output
{'a': array([1, 3]), 'b': array([2, 6]), 'c': array([4, 5])}
This is a textbook use of pandas groupby:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4,5,6],'B':['a','b','a','c','c','b']})
a,b,c = df.groupby('B').A.apply(lambda x:x.values)
#[1 3], [2 6], [4 5]
I'm sure there's some simple invocation to do all in one fell swoop and a Numpy guru will soon enlighten us, but
indices = np.array([1,2,3,4,5,6])
values = np.array(['a', 'b', 'a', 'c', 'c', 'b'])
indices_by_value = {}
for value in np.unique(values):
indices_by_value[value] = indices[values == value]
will leave you with
{'a': array([1, 3]), 'b': array([2, 6]), 'c': array([4, 5])}
You can do something like this:
from collections import defaultdict
d = defaultdict(list)
letters = ['a', 'b', 'a', 'c', 'c', 'b']
numbers = [1, 2, 3, 4, 5, 6]
for l, n in zip(letters, numbers):
d[l].append(n)
And d will have your answer.
Using the mask selection feature of numpy should do the work.
Something like this :
> import numpy as np
> xx = np.array(range(5))
> yy = np.array(['a','b','a','d','e'])
> yy=='a'
array([ True, False, True, False, False])
> xx[(yy=='a')]
array([0, 2])
Consider browsing the unique values of your array of label and build a dictionnary of matches incrementally.
I have an array in which I want to find the index of the smallest elements. I have tried the following method:
distance = [2,3,2,5,4,7,6]
a = distance.index(min(distance))
This returns 0, which is the index of the first smallest distance. However, I want to find all such instances, 0 and 2. How can I do this in Python?
Use np.where to get all the indexes that match a given value:
import numpy as np
distance = np.array([2,3,2,5,4,7,6])
np.where(distance == np.min(distance))[0]
Out[1]: array([0, 2])
Numpy outperforms other methods as the size of the array grows:
Results of TimeIt comparison test, adapted from Yannic Hamann's code below
Length of Array x 7
Method 1 10 20 50 100 1000
Sorted Enumerate 2.47 16.291 33.643
List Comprehension 1.058 4.745 8.843 24.792
Numpy 5.212 5.562 5.931 6.22 6.441 6.055
Defaultdict 2.376 9.061 16.116 39.299
You may enumerate array elements and extract their indexes if the condition holds:
min_value = min(distance)
[i for i,n in enumerate(distance) if n==min_value]
#[0,2]
Surprisingly the numpy answer seems to be the slowest.
Update: Depends on the size of the input list.
import numpy as np
import timeit
from collections import defaultdict
def weird_function_so_bad_to_read(distance):
se = sorted(enumerate(distance), key=lambda x: x[1])
smallest_numb = se[0][1] # careful exceptions when list is empty
return [x for x in se if smallest_numb == x[1]]
# t1 = 1.8322973089525476
def pythonic_way(distance):
min_value = min(distance)
return [i for i, n in enumerate(distance) if n == min_value]
# t2 = 0.8458914929069579
def fastest_dont_even_have_to_measure(np_distance):
# np_distance = np.array([2, 3, 2, 5, 4, 7, 6])
min_v = np.min(np_distance)
return np.where(np_distance == min_v)[0]
# t3 = 4.247801031917334
def dd_answer_was_my_first_guess_too(distance):
d = defaultdict(list) # a dictionary where every value is a list by default
for idx, num in enumerate(distance):
d[num].append(idx) # for each number append the value of the index
return d.get(min(distance))
# t4 = 1.8876687170704827
def wrapper(func, *args, **kwargs):
def wrapped():
return func(*args, **kwargs)
return wrapped
distance = [2, 3, 2, 5, 4, 7, 6]
t1 = wrapper(weird_function_so_bad_to_read, distance)
t2 = wrapper(pythonic_way, distance)
t3 = wrapper(fastest_dont_even_have_to_measure, np.array(distance))
t4 = wrapper(dd_answer_was_my_first_guess_too, distance)
print(timeit.timeit(t1))
print(timeit.timeit(t2))
print(timeit.timeit(t3))
print(timeit.timeit(t4))
We can use an interim dict to store indices of the list and then just fetch the minimum value of distance from it. We will also use a simple for-loop here so that you can understand what is happening step by step.
from collections import defaultdict
d = defaultdict(list) # a dictionary where every value is a list by default
for idx, num in enumerate(distance):
d[num].append(idx) # for each number append the value of the index
d.get(min(distance)) # fetch the indices of the min number from our dict
[0, 2]
You can also do the following list comprehension
distance = [2,3,2,5,4,7,6]
min_distance = min(distance)
[index for index, val in enumerate(distance) if val == min_distance]
>>> [0, 2]
Having a list with N (large number) elements:
from random import randint
eles = [randint(0, 10) for i in range(3000000)]
I'm trying to implement the best way (performance/resources spent) this function below:
def mosty(lst):
sort = sorted((v, k) for k, v in enumerate(lst))
count, maxi, last_ele, idxs = 0, 0, None, []
for ele, idx in sort:
if(last_ele != ele):
count = 1
idxs = []
idxs.append(idx)
if(last_ele == ele):
count += 1
if(maxi < count):
results = (ele, count, idxs)
maxi = count
last_ele = ele
return results
This function returns the most common element, number of occurrences, and the indexes where it was found.
Here is the benchmark with 300000 eles.
But I think I could improve, one of the reasons being python3 sorted function (timsort), if it returned a generator I didn't have to loop through the list twice right?
My questions are:
Is there any way for this code to be optimized? How?
With a lazy sorting I sure it would be, am I right? How can I implement lazy timsort
did not do any benchmarks, but that should not perform that badly (even though it iterates twice over the list):
from collections import Counter
from random import randint
eles = [randint(0, 10) for i in range(30)]
counter = Counter(eles)
most_common_element, number_of_occurrences = counter.most_common(1)[0]
indices = [i for i, x in enumerate(eles) if x == most_common_element]
print(most_common_element, number_of_occurrences, indices)
and the indices (the second iteration) can be found lazily in a generator expression:
indices = (i for i, x in enumerate(eles) if x == most_common_element)
if you need to care about multiple elements being the most common, this might work for you:
from collections import Counter
from itertools import groupby
from operator import itemgetter
eles = [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5]
counter = Counter(eles)
_key, group = next(groupby(counter.most_common(), key=itemgetter(1)))
most_common = dict(group)
indices = {key: [] for key in most_common}
for i, x in enumerate(eles):
if x in indices:
indices[x].append(i)
print(most_common)
print(indices)
you could of course still make the indices lazy the same way as above.
If you are willing to use numpy, then you can do something like this:
arr = np.array(eles)
values, counts = np.unique(arr, return_counts=True)
ind = np.argmax(counts)
most_common_elem, its_count = values[ind], counts[ind]
indices = np.where(arr == most_common_elem)
HTH.
I know there is a method for a Python list to return the first index of something:
>>> xs = [1, 2, 3]
>>> xs.index(2)
1
Is there something like that for NumPy arrays?
Yes, given an array, array, and a value, item to search for, you can use np.where as:
itemindex = numpy.where(array == item)
The result is a tuple with first all the row indices, then all the column indices.
For example, if an array is two dimensions and it contained your item at two locations then
array[itemindex[0][0]][itemindex[1][0]]
would be equal to your item and so would be:
array[itemindex[0][1]][itemindex[1][1]]
If you need the index of the first occurrence of only one value, you can use nonzero (or where, which amounts to the same thing in this case):
>>> t = array([1, 1, 1, 2, 2, 3, 8, 3, 8, 8])
>>> nonzero(t == 8)
(array([6, 8, 9]),)
>>> nonzero(t == 8)[0][0]
6
If you need the first index of each of many values, you could obviously do the same as above repeatedly, but there is a trick that may be faster. The following finds the indices of the first element of each subsequence:
>>> nonzero(r_[1, diff(t)[:-1]])
(array([0, 3, 5, 6, 7, 8]),)
Notice that it finds the beginning of both subsequence of 3s and both subsequences of 8s:
[1, 1, 1, 2, 2, 3, 8, 3, 8, 8]
So it's slightly different than finding the first occurrence of each value. In your program, you may be able to work with a sorted version of t to get what you want:
>>> st = sorted(t)
>>> nonzero(r_[1, diff(st)[:-1]])
(array([0, 3, 5, 7]),)
You can also convert a NumPy array to list in the air and get its index. For example,
l = [1,2,3,4,5] # Python list
a = numpy.array(l) # NumPy array
i = a.tolist().index(2) # i will return index of 2
print i
It will print 1.
Just to add a very performant and handy numba alternative based on np.ndenumerate to find the first index:
from numba import njit
import numpy as np
#njit
def index(array, item):
for idx, val in np.ndenumerate(array):
if val == item:
return idx
# If no item was found return None, other return types might be a problem due to
# numbas type inference.
This is pretty fast and deals naturally with multidimensional arrays:
>>> arr1 = np.ones((100, 100, 100))
>>> arr1[2, 2, 2] = 2
>>> index(arr1, 2)
(2, 2, 2)
>>> arr2 = np.ones(20)
>>> arr2[5] = 2
>>> index(arr2, 2)
(5,)
This can be much faster (because it's short-circuiting the operation) than any approach using np.where or np.nonzero.
However np.argwhere could also deal gracefully with multidimensional arrays (you would need to manually cast it to a tuple and it's not short-circuited) but it would fail if no match is found:
>>> tuple(np.argwhere(arr1 == 2)[0])
(2, 2, 2)
>>> tuple(np.argwhere(arr2 == 2)[0])
(5,)
l.index(x) returns the smallest i such that i is the index of the first occurrence of x in the list.
One can safely assume that the index() function in Python is implemented so that it stops after finding the first match, and this results in an optimal average performance.
For finding an element stopping after the first match in a NumPy array use an iterator (ndenumerate).
In [67]: l=range(100)
In [68]: l.index(2)
Out[68]: 2
NumPy array:
In [69]: a = np.arange(100)
In [70]: next((idx for idx, val in np.ndenumerate(a) if val==2))
Out[70]: (2L,)
Note that both methods index() and next return an error if the element is not found. With next, one can use a second argument to return a special value in case the element is not found, e.g.
In [77]: next((idx for idx, val in np.ndenumerate(a) if val==400),None)
There are other functions in NumPy (argmax, where, and nonzero) that can be used to find an element in an array, but they all have the drawback of going through the whole array looking for all occurrences, thus not being optimized for finding the first element. Note also that where and nonzero return arrays, so you need to select the first element to get the index.
In [71]: np.argmax(a==2)
Out[71]: 2
In [72]: np.where(a==2)
Out[72]: (array([2], dtype=int64),)
In [73]: np.nonzero(a==2)
Out[73]: (array([2], dtype=int64),)
Time comparison
Just checking that for large arrays the solution using an iterator is faster when the searched item is at the beginning of the array (using %timeit in the IPython shell):
In [285]: a = np.arange(100000)
In [286]: %timeit next((idx for idx, val in np.ndenumerate(a) if val==0))
100000 loops, best of 3: 17.6 µs per loop
In [287]: %timeit np.argmax(a==0)
1000 loops, best of 3: 254 µs per loop
In [288]: %timeit np.where(a==0)[0][0]
1000 loops, best of 3: 314 µs per loop
This is an open NumPy GitHub issue.
See also: Numpy: find first index of value fast
If you're going to use this as an index into something else, you can use boolean indices if the arrays are broadcastable; you don't need explicit indices. The absolute simplest way to do this is to simply index based on a truth value.
other_array[first_array == item]
Any boolean operation works:
a = numpy.arange(100)
other_array[first_array > 50]
The nonzero method takes booleans, too:
index = numpy.nonzero(first_array == item)[0][0]
The two zeros are for the tuple of indices (assuming first_array is 1D) and then the first item in the array of indices.
For one-dimensional sorted arrays, it would be much more simpler and efficient O(log(n)) to use numpy.searchsorted which returns a NumPy integer (position). For example,
arr = np.array([1, 1, 1, 2, 3, 3, 4])
i = np.searchsorted(arr, 3)
Just make sure the array is already sorted
Also check if returned index i actually contains the searched element, since searchsorted's main objective is to find indices where elements should be inserted to maintain order.
if arr[i] == 3:
print("present")
else:
print("not present")
For 1D arrays, I'd recommend np.flatnonzero(array == value)[0], which is equivalent to both np.nonzero(array == value)[0][0] and np.where(array == value)[0][0] but avoids the ugliness of unboxing a 1-element tuple.
To index on any criteria, you can so something like the following:
In [1]: from numpy import *
In [2]: x = arange(125).reshape((5,5,5))
In [3]: y = indices(x.shape)
In [4]: locs = y[:,x >= 120] # put whatever you want in place of x >= 120
In [5]: pts = hsplit(locs, len(locs[0]))
In [6]: for pt in pts:
.....: print(', '.join(str(p[0]) for p in pt))
4, 4, 0
4, 4, 1
4, 4, 2
4, 4, 3
4, 4, 4
And here's a quick function to do what list.index() does, except doesn't raise an exception if it's not found. Beware -- this is probably very slow on large arrays. You can probably monkey patch this on to arrays if you'd rather use it as a method.
def ndindex(ndarray, item):
if len(ndarray.shape) == 1:
try:
return [ndarray.tolist().index(item)]
except:
pass
else:
for i, subarray in enumerate(ndarray):
try:
return [i] + ndindex(subarray, item)
except:
pass
In [1]: ndindex(x, 103)
Out[1]: [4, 0, 3]
An alternative to selecting the first element from np.where() is to use a generator expression together with enumerate, such as:
>>> import numpy as np
>>> x = np.arange(100) # x = array([0, 1, 2, 3, ... 99])
>>> next(i for i, x_i in enumerate(x) if x_i == 2)
2
For a two dimensional array one would do:
>>> x = np.arange(100).reshape(10,10) # x = array([[0, 1, 2,... 9], [10,..19],])
>>> next((i,j) for i, x_i in enumerate(x)
... for j, x_ij in enumerate(x_i) if x_ij == 2)
(0, 2)
The advantage of this approach is that it stops checking the elements of the array after the first match is found, whereas np.where checks all elements for a match. A generator expression would be faster if there's match early in the array.
There are lots of operations in NumPy that could perhaps be put together to accomplish this. This will return indices of elements equal to item:
numpy.nonzero(array - item)
You could then take the first elements of the lists to get a single element.
Comparison of 8 methods
TL;DR:
(Note: applicable to 1d arrays under 100M elements.)
For maximum performance use index_of__v5 (numba + numpy.enumerate + for loop; see the code below).
If numba is not available:
Use index_of__v7 (for loop + enumerate) if the target value is expected to be found within the first 100k elements.
Else use index_of__v2/v3/v4 (numpy.argmax or numpy.flatnonzero based).
Powered by perfplot
import numpy as np
from numba import njit
# Based on: numpy.argmax()
# Proposed by: John Haberstroh (https://stackoverflow.com/a/67497472/7204581)
def index_of__v1(arr: np.array, v):
is_v = (arr == v)
return is_v.argmax() if is_v.any() else -1
# Based on: numpy.argmax()
def index_of__v2(arr: np.array, v):
return (arr == v).argmax() if v in arr else -1
# Based on: numpy.flatnonzero()
# Proposed by: 1'' (https://stackoverflow.com/a/42049655/7204581)
def index_of__v3(arr: np.array, v):
idxs = np.flatnonzero(arr == v)
return idxs[0] if len(idxs) > 0 else -1
# Based on: numpy.argmax()
def index_of__v4(arr: np.array, v):
return np.r_[False, (arr == v)].argmax() - 1
# Based on: numba, for loop
# Proposed by: MSeifert (https://stackoverflow.com/a/41578614/7204581)
#njit
def index_of__v5(arr: np.array, v):
for idx, val in np.ndenumerate(arr):
if val == v:
return idx[0]
return -1
# Based on: numpy.ndenumerate(), for loop
def index_of__v6(arr: np.array, v):
return next((idx[0] for idx, val in np.ndenumerate(arr) if val == v), -1)
# Based on: enumerate(), for loop
# Proposed by: Noyer282 (https://stackoverflow.com/a/40426159/7204581)
def index_of__v7(arr: np.array, v):
return next((idx for idx, val in enumerate(arr) if val == v), -1)
# Based on: list.index()
# Proposed by: Hima (https://stackoverflow.com/a/23994923/7204581)
def index_of__v8(arr: np.array, v):
l = list(arr)
try:
return l.index(v)
except ValueError:
return -1
Go to Colab
The numpy_indexed package (disclaimer, I am its author) contains a vectorized equivalent of list.index for numpy.ndarray; that is:
sequence_of_arrays = [[0, 1], [1, 2], [-5, 0]]
arrays_to_query = [[-5, 0], [1, 0]]
import numpy_indexed as npi
idx = npi.indices(sequence_of_arrays, arrays_to_query, missing=-1)
print(idx) # [2, -1]
This solution has vectorized performance, generalizes to ndarrays, and has various ways of dealing with missing values.
There is a fairly idiomatic and vectorized way to do this built into numpy. It uses a quirk of the np.argmax() function to accomplish this -- if many values match, it returns the index of the first match. The trick is that for booleans, there will only ever be two values: True (1) and False (0). Therefore, the returned index will be that of the first True.
For the simple example provided, you can see it work with the following
>>> np.argmax(np.array([1,2,3]) == 2)
1
A great example is computing buckets, e.g. for categorizing. Let's say you have an array of cut points, and you want the "bucket" that corresponds to each element of your array. The algorithm is to compute the first index of cuts where x < cuts (after padding cuts with np.Infitnity). I could use broadcasting to broadcast the comparisons, then apply argmax along the cuts-broadcasted axis.
>>> cuts = np.array([10, 50, 100])
>>> cuts_pad = np.array([*cuts, np.Infinity])
>>> x = np.array([7, 11, 80, 443])
>>> bins = np.argmax( x[:, np.newaxis] < cuts_pad[np.newaxis, :], axis = 1)
>>> print(bins)
[0, 1, 2, 3]
As expected, each value from x falls into one of the sequential bins, with well-defined and easy to specify edge case behavior.
Another option not previously mentioned is the bisect module, which also works on lists, but requires a pre-sorted list/array:
import bisect
import numpy as np
z = np.array([104,113,120,122,126,138])
bisect.bisect_left(z, 122)
yields
3
bisect also returns a result when the number you're looking for doesn't exist in the array, so that the number can be inserted in the correct place.
Note: this is for python 2.7 version
You can use a lambda function to deal with the problem, and it works both on NumPy array and list.
your_list = [11, 22, 23, 44, 55]
result = filter(lambda x:your_list[x]>30, range(len(your_list)))
#result: [3, 4]
import numpy as np
your_numpy_array = np.array([11, 22, 23, 44, 55])
result = filter(lambda x:your_numpy_array [x]>30, range(len(your_list)))
#result: [3, 4]
And you can use
result[0]
to get the first index of the filtered elements.
For python 3.6, use
list(result)
instead of
result
Use ndindex
Sample array
arr = np.array([[1,4],
[2,3]])
print(arr)
...[[1,4],
[2,3]]
create an empty list to store the index and the element tuples
index_elements = []
for i in np.ndindex(arr.shape):
index_elements.append((arr[i],i))
convert the list of tuples into dictionary
index_elements = dict(index_elements)
The keys are the elements and the values are their
indices - use keys to access the index
index_elements[4]
output
... (0,1)
For my use case, I could not sort the array ahead of time because the order of the elements is important. This is my all-NumPy implementation:
import numpy as np
# The array in question
arr = np.array([1,2,1,2,1,5,5,3,5,9])
# Find all of the present values
vals=np.unique(arr)
# Make all indices up-to and including the desired index positive
cum_sum=np.cumsum(arr==vals.reshape(-1,1),axis=1)
# Add zeros to account for the n-1 shape of diff and the all-positive array of the first index
bl_mask=np.concatenate([np.zeros((cum_sum.shape[0],1)),cum_sum],axis=1)>=1
# The desired indices
idx=np.where(np.diff(bl_mask))[1]
# Show results
print(list(zip(vals,idx)))
>>> [(1, 0), (2, 1), (3, 7), (5, 5), (9, 9)]
I believe it accounts for unsorted arrays with duplicate values.
Found another solution with loops:
new_array_of_indicies = []
for i in range(len(some_array)):
if some_array[i] == some_value:
new_array_of_indicies.append(i)
index_lst_form_numpy = pd.DataFrame(df).reset_index()["index"].tolist()