Index of element in NumPy array [duplicate] - python

This question already has answers here:
Is there a NumPy function to return the first index of something in an array?
(20 answers)
Closed 2 years ago.
In Python we can get the index of a value in an array by using .index().
But with a NumPy array, when I try to do:
decoding.index(i)
I get:
AttributeError: 'numpy.ndarray' object has no attribute 'index'
How could I do this on a NumPy array?

Use np.where to get the indices where a given condition is True.
Examples:
For a 2D np.ndarray called a:
i, j = np.where(a == value) # when comparing arrays of integers
i, j = np.where(np.isclose(a, value)) # when comparing floating-point arrays
For a 1D array:
i, = np.where(a == value) # integers
i, = np.where(np.isclose(a, value)) # floating-point
Note that this also works for conditions like >=, <=, != and so forth...
You can also create a subclass of np.ndarray with an index() method:
class myarray(np.ndarray):
def __new__(cls, *args, **kwargs):
return np.array(*args, **kwargs).view(myarray)
def index(self, value):
return np.where(self == value)
Testing:
a = myarray([1,2,3,4,4,4,5,6,4,4,4])
a.index(4)
#(array([ 3, 4, 5, 8, 9, 10]),)

You can convert a numpy array to list and get its index .
for example:
tmp = [1,2,3,4,5] #python list
a = numpy.array(tmp) #numpy array
i = list(a).index(2) # i will return index of 2, which is 1
this is just what you wanted.

I'm torn between these two ways of implementing an index of a NumPy array:
idx = list(classes).index(var)
idx = np.where(classes == var)
Both take the same number of characters, but the first method returns an int instead of a numpy.ndarray.

This problem can be solved efficiently using the numpy_indexed library (disclaimer: I am its author); which was created to address problems of this type. npi.indices can be viewed as an n-dimensional generalisation of list.index. It will act on nd-arrays (along a specified axis); and also will look up multiple entries in a vectorized manner as opposed to a single item at a time.
a = np.random.rand(50, 60, 70)
i = np.random.randint(0, len(a), 40)
b = a[i]
import numpy_indexed as npi
assert all(i == npi.indices(a, b))
This solution has better time complexity (n log n at worst) than any of the previously posted answers, and is fully vectorized.

You can use the function numpy.nonzero(), or the nonzero() method of an array
import numpy as np
A = np.array([[2,4],
[6,2]])
index= np.nonzero(A>1)
OR
(A>1).nonzero()
Output:
(array([0, 1]), array([1, 0]))
First array in output depicts the row index and second array depicts the corresponding column index.

If you are interested in the indexes, the best choice is np.argsort(a)
a = np.random.randint(0, 100, 10)
sorted_idx = np.argsort(a)

Related

What does [i,:] mean in Python?

So I'm finished one part of this assignment I have to do. There's only one part of the assignment that doesn't make any sense to me.
I'm doing a LinearRegression model and according to others I need to apply ans[i,:] = y_poly at the very end, but I never got an answer as to why.
Can someone please explain to me what [i,:] means? I haven't found any explanations online.
It's specific to the numpy module, used in most data science modules.
ans[i,:] = y_poly
this is assigning a vector to a slice of numpy 2D array (slice assignment). Self-contained example:
>>> import numpy
>>> a = numpy.array([[0,0,0],[1,1,1]])
>>> a[0,:] = [3,4,5]
>>> a
array([[3, 4, 5],
[1, 1, 1]])
There is also slice assignment in base python, using only one dimension (a[:] = [1,2,3])
I guess you are also using numpy to manipulate data (as a matrix)?
If based on numpy, ans[i,:] means to pick the ith 'row' of ans with all of its 'columns'.
Note: when dealing with numpy arrays, we should (almost) always use [i, j] instead of [i][j]. This might be counter-intuitive if you've used Python or Java to manipulate matrixes before.
I think in this case [] means the indexing operator for a class object which can be used by defining the getitem method
class A:
def __getitem__(self, key):
pass
key can be literally anything. In your case "[1,:]" key is a tuple containing of "1" and a slice(None, None, None). Such a key can be useful if your class represents multi-dimensional data which you want to access via [] operator. A suggested by others answers this could be a numpy array:
Here is a quick example of how such a multi-dimensional indexing could work:
class A:
values = [[1,2,3,4], [4,5,6,7]]
def __getitem__(self, key):
i, j = key
if isinstance(i, int):
i = slice(i, i + 1)
if isinstance(j, int):
j = slice(j, j + 1)
for row in self.values[i]:
print(row[j])
>>>a = A()
>>>a[:,2:4]
[3, 4]
[6, 7]
>>>a[1,1]
[5]
>>>a[:, 2]
[3]
[6]

How to get the two smallest values from a numpy array

I would like to take the two smallest values from an array x. But when I use np.where:
A,B = np.where(x == x.min())[0:1]
I get this error:
ValueError: need more than 1 value to unpack
How can I fix this error? And do I need to arange numbers in ascending order in array?
You can use numpy.partition to get the lowest k+1 items:
A, B = np.partition(x, 1)[0:2] # k=1, so the first two are the smallest items
In Python 3.x you could also use:
A, B, *_ = np.partition(x, 1)
For example:
import numpy as np
x = np.array([5, 3, 1, 2, 6])
A, B = np.partition(x, 1)[0:2]
print(A) # 1
print(B) # 2
How about using sorted instead of np.where?
A,B = sorted(x)[:2]
There are two errors in the code. The first is that the slice is [0:1] when it should be [0:2]. The second is actually a very common issue with np.where. If you look into the documentation, you will see that it always returns a tuple, with one element if you only pass one parameter. Hence you have to access the tuple element first and then index the array normally:
A,B = np.where(x == x.min())[0][0:2]
Which will give you the two first indices containing the minimum value. If no two such indices exist you will get an exception, so you may want to check for that.

How do I remove all zero elements from a NumPy array?

I have a rank-1 numpy.array of which I want to make a boxplot. However, I want to exclude all values equal to zero in the array. Currently, I solved this by looping the array and copy the value to a new array if not equal to zero. However, as the array consists of 86 000 000 values and I have to do this multiple times, this takes a lot of patience.
Is there a more intelligent way to do this?
For a NumPy array a, you can use
a[a != 0]
to extract the values not equal to zero.
This is a case where you want to use masked arrays, it keeps the shape of your array and it is automatically recognized by all numpy and matplotlib functions.
X = np.random.randn(1e3, 5)
X[np.abs(X)< .1]= 0 # some zeros
X = np.ma.masked_equal(X,0)
plt.boxplot(X) #masked values are not plotted
#other functionalities of masked arrays
X.compressed() # get normal array with masked values removed
X.mask # get a boolean array of the mask
X.mean() # it automatically discards masked values
I decided to compare the runtime of the different approaches mentioned here. I've used my library simple_benchmark for this.
The boolean indexing with array[array != 0] seems to be the fastest (and shortest) solution.
For smaller arrays the MaskedArray approach is very slow compared to the other approaches however is as fast as the boolean indexing approach. However for moderately sized arrays there is not much difference between them.
Here is the code I've used:
from simple_benchmark import BenchmarkBuilder
import numpy as np
bench = BenchmarkBuilder()
#bench.add_function()
def boolean_indexing(arr):
return arr[arr != 0]
#bench.add_function()
def integer_indexing_nonzero(arr):
return arr[np.nonzero(arr)]
#bench.add_function()
def integer_indexing_where(arr):
return arr[np.where(arr != 0)]
#bench.add_function()
def masked_array(arr):
return np.ma.masked_equal(arr, 0)
#bench.add_arguments('array size')
def argument_provider():
for exp in range(3, 25):
size = 2**exp
arr = np.random.random(size)
arr[arr < 0.1] = 0 # add some zeros
yield size, arr
r = bench.run()
r.plot()
You can index with a Boolean array. For a NumPy array A:
res = A[A != 0]
You can use Boolean array indexing as above, bool type conversion, np.nonzero, or np.where. Here's some performance benchmarking:
# Python 3.7, NumPy 1.14.3
np.random.seed(0)
A = np.random.randint(0, 5, 10**8)
%timeit A[A != 0] # 768 ms
%timeit A[A.astype(bool)] # 781 ms
%timeit A[np.nonzero(A)] # 1.49 s
%timeit A[np.where(A)] # 1.58 s
I would like to suggest you to simply utilize NaN for cases like this, where you'll like to ignore some values, but still want to keep the procedure statistical as meaningful as possible. So
In []: X= randn(1e3, 5)
In []: X[abs(X)< .1]= NaN
In []: isnan(X).sum(0)
Out[: array([82, 84, 71, 81, 73])
In []: boxplot(X)
A simple line of code can get you an array that excludes all '0' values:
np.argwhere(*array*)
example:
import numpy as np
array = [0, 1, 0, 3, 4, 5, 0]
array2 = np.argwhere(array)
print array2
[1, 3, 4, 5]
[i for i in Array if i != 0.0] if the numbers are float
or [i for i in SICER if i != 0] if the numbers are int.

Get the position of the largest value in a multi-dimensional NumPy array

How can I get get the position (indices) of the largest value in a multi-dimensional NumPy array?
The argmax() method should help.
Update
(After reading comment) I believe the argmax() method would work for multi dimensional arrays as well. The linked documentation gives an example of this:
>>> a = array([[10,50,30],[60,20,40]])
>>> maxindex = a.argmax()
>>> maxindex
3
Update 2
(Thanks to KennyTM's comment) You can use unravel_index(a.argmax(), a.shape) to get the index as a tuple:
>>> from numpy import unravel_index
>>> unravel_index(a.argmax(), a.shape)
(1, 0)
(edit) I was referring to an old answer which had been deleted. And the accepted answer came after mine. I agree that argmax is better than my answer.
Wouldn't it be more readable/intuitive to do like this?
numpy.nonzero(a.max() == a)
(array([1]), array([0]))
Or,
numpy.argwhere(a.max() == a)
You can simply write a function (that works only in 2d):
def argmax_2d(matrix):
maxN = np.argmax(matrix)
(xD,yD) = matrix.shape
if maxN >= xD:
x = maxN//xD
y = maxN % xD
else:
y = maxN
x = 0
return (x,y)
An alternative way is change numpy array to list and use max and index methods:
List = np.array([34, 7, 33, 10, 89, 22, -5])
_max = List.tolist().index(max(List))
_max
>>> 4

Is there a NumPy function to return the first index of something in an array?

I know there is a method for a Python list to return the first index of something:
>>> xs = [1, 2, 3]
>>> xs.index(2)
1
Is there something like that for NumPy arrays?
Yes, given an array, array, and a value, item to search for, you can use np.where as:
itemindex = numpy.where(array == item)
The result is a tuple with first all the row indices, then all the column indices.
For example, if an array is two dimensions and it contained your item at two locations then
array[itemindex[0][0]][itemindex[1][0]]
would be equal to your item and so would be:
array[itemindex[0][1]][itemindex[1][1]]
If you need the index of the first occurrence of only one value, you can use nonzero (or where, which amounts to the same thing in this case):
>>> t = array([1, 1, 1, 2, 2, 3, 8, 3, 8, 8])
>>> nonzero(t == 8)
(array([6, 8, 9]),)
>>> nonzero(t == 8)[0][0]
6
If you need the first index of each of many values, you could obviously do the same as above repeatedly, but there is a trick that may be faster. The following finds the indices of the first element of each subsequence:
>>> nonzero(r_[1, diff(t)[:-1]])
(array([0, 3, 5, 6, 7, 8]),)
Notice that it finds the beginning of both subsequence of 3s and both subsequences of 8s:
[1, 1, 1, 2, 2, 3, 8, 3, 8, 8]
So it's slightly different than finding the first occurrence of each value. In your program, you may be able to work with a sorted version of t to get what you want:
>>> st = sorted(t)
>>> nonzero(r_[1, diff(st)[:-1]])
(array([0, 3, 5, 7]),)
You can also convert a NumPy array to list in the air and get its index. For example,
l = [1,2,3,4,5] # Python list
a = numpy.array(l) # NumPy array
i = a.tolist().index(2) # i will return index of 2
print i
It will print 1.
Just to add a very performant and handy numba alternative based on np.ndenumerate to find the first index:
from numba import njit
import numpy as np
#njit
def index(array, item):
for idx, val in np.ndenumerate(array):
if val == item:
return idx
# If no item was found return None, other return types might be a problem due to
# numbas type inference.
This is pretty fast and deals naturally with multidimensional arrays:
>>> arr1 = np.ones((100, 100, 100))
>>> arr1[2, 2, 2] = 2
>>> index(arr1, 2)
(2, 2, 2)
>>> arr2 = np.ones(20)
>>> arr2[5] = 2
>>> index(arr2, 2)
(5,)
This can be much faster (because it's short-circuiting the operation) than any approach using np.where or np.nonzero.
However np.argwhere could also deal gracefully with multidimensional arrays (you would need to manually cast it to a tuple and it's not short-circuited) but it would fail if no match is found:
>>> tuple(np.argwhere(arr1 == 2)[0])
(2, 2, 2)
>>> tuple(np.argwhere(arr2 == 2)[0])
(5,)
l.index(x) returns the smallest i such that i is the index of the first occurrence of x in the list.
One can safely assume that the index() function in Python is implemented so that it stops after finding the first match, and this results in an optimal average performance.
For finding an element stopping after the first match in a NumPy array use an iterator (ndenumerate).
In [67]: l=range(100)
In [68]: l.index(2)
Out[68]: 2
NumPy array:
In [69]: a = np.arange(100)
In [70]: next((idx for idx, val in np.ndenumerate(a) if val==2))
Out[70]: (2L,)
Note that both methods index() and next return an error if the element is not found. With next, one can use a second argument to return a special value in case the element is not found, e.g.
In [77]: next((idx for idx, val in np.ndenumerate(a) if val==400),None)
There are other functions in NumPy (argmax, where, and nonzero) that can be used to find an element in an array, but they all have the drawback of going through the whole array looking for all occurrences, thus not being optimized for finding the first element. Note also that where and nonzero return arrays, so you need to select the first element to get the index.
In [71]: np.argmax(a==2)
Out[71]: 2
In [72]: np.where(a==2)
Out[72]: (array([2], dtype=int64),)
In [73]: np.nonzero(a==2)
Out[73]: (array([2], dtype=int64),)
Time comparison
Just checking that for large arrays the solution using an iterator is faster when the searched item is at the beginning of the array (using %timeit in the IPython shell):
In [285]: a = np.arange(100000)
In [286]: %timeit next((idx for idx, val in np.ndenumerate(a) if val==0))
100000 loops, best of 3: 17.6 µs per loop
In [287]: %timeit np.argmax(a==0)
1000 loops, best of 3: 254 µs per loop
In [288]: %timeit np.where(a==0)[0][0]
1000 loops, best of 3: 314 µs per loop
This is an open NumPy GitHub issue.
See also: Numpy: find first index of value fast
If you're going to use this as an index into something else, you can use boolean indices if the arrays are broadcastable; you don't need explicit indices. The absolute simplest way to do this is to simply index based on a truth value.
other_array[first_array == item]
Any boolean operation works:
a = numpy.arange(100)
other_array[first_array > 50]
The nonzero method takes booleans, too:
index = numpy.nonzero(first_array == item)[0][0]
The two zeros are for the tuple of indices (assuming first_array is 1D) and then the first item in the array of indices.
For one-dimensional sorted arrays, it would be much more simpler and efficient O(log(n)) to use numpy.searchsorted which returns a NumPy integer (position). For example,
arr = np.array([1, 1, 1, 2, 3, 3, 4])
i = np.searchsorted(arr, 3)
Just make sure the array is already sorted
Also check if returned index i actually contains the searched element, since searchsorted's main objective is to find indices where elements should be inserted to maintain order.
if arr[i] == 3:
print("present")
else:
print("not present")
For 1D arrays, I'd recommend np.flatnonzero(array == value)[0], which is equivalent to both np.nonzero(array == value)[0][0] and np.where(array == value)[0][0] but avoids the ugliness of unboxing a 1-element tuple.
To index on any criteria, you can so something like the following:
In [1]: from numpy import *
In [2]: x = arange(125).reshape((5,5,5))
In [3]: y = indices(x.shape)
In [4]: locs = y[:,x >= 120] # put whatever you want in place of x >= 120
In [5]: pts = hsplit(locs, len(locs[0]))
In [6]: for pt in pts:
.....: print(', '.join(str(p[0]) for p in pt))
4, 4, 0
4, 4, 1
4, 4, 2
4, 4, 3
4, 4, 4
And here's a quick function to do what list.index() does, except doesn't raise an exception if it's not found. Beware -- this is probably very slow on large arrays. You can probably monkey patch this on to arrays if you'd rather use it as a method.
def ndindex(ndarray, item):
if len(ndarray.shape) == 1:
try:
return [ndarray.tolist().index(item)]
except:
pass
else:
for i, subarray in enumerate(ndarray):
try:
return [i] + ndindex(subarray, item)
except:
pass
In [1]: ndindex(x, 103)
Out[1]: [4, 0, 3]
An alternative to selecting the first element from np.where() is to use a generator expression together with enumerate, such as:
>>> import numpy as np
>>> x = np.arange(100) # x = array([0, 1, 2, 3, ... 99])
>>> next(i for i, x_i in enumerate(x) if x_i == 2)
2
For a two dimensional array one would do:
>>> x = np.arange(100).reshape(10,10) # x = array([[0, 1, 2,... 9], [10,..19],])
>>> next((i,j) for i, x_i in enumerate(x)
... for j, x_ij in enumerate(x_i) if x_ij == 2)
(0, 2)
The advantage of this approach is that it stops checking the elements of the array after the first match is found, whereas np.where checks all elements for a match. A generator expression would be faster if there's match early in the array.
There are lots of operations in NumPy that could perhaps be put together to accomplish this. This will return indices of elements equal to item:
numpy.nonzero(array - item)
You could then take the first elements of the lists to get a single element.
Comparison of 8 methods
TL;DR:
(Note: applicable to 1d arrays under 100M elements.)
For maximum performance use index_of__v5 (numba + numpy.enumerate + for loop; see the code below).
If numba is not available:
Use index_of__v7 (for loop + enumerate) if the target value is expected to be found within the first 100k elements.
Else use index_of__v2/v3/v4 (numpy.argmax or numpy.flatnonzero based).
Powered by perfplot
import numpy as np
from numba import njit
# Based on: numpy.argmax()
# Proposed by: John Haberstroh (https://stackoverflow.com/a/67497472/7204581)
def index_of__v1(arr: np.array, v):
is_v = (arr == v)
return is_v.argmax() if is_v.any() else -1
# Based on: numpy.argmax()
def index_of__v2(arr: np.array, v):
return (arr == v).argmax() if v in arr else -1
# Based on: numpy.flatnonzero()
# Proposed by: 1'' (https://stackoverflow.com/a/42049655/7204581)
def index_of__v3(arr: np.array, v):
idxs = np.flatnonzero(arr == v)
return idxs[0] if len(idxs) > 0 else -1
# Based on: numpy.argmax()
def index_of__v4(arr: np.array, v):
return np.r_[False, (arr == v)].argmax() - 1
# Based on: numba, for loop
# Proposed by: MSeifert (https://stackoverflow.com/a/41578614/7204581)
#njit
def index_of__v5(arr: np.array, v):
for idx, val in np.ndenumerate(arr):
if val == v:
return idx[0]
return -1
# Based on: numpy.ndenumerate(), for loop
def index_of__v6(arr: np.array, v):
return next((idx[0] for idx, val in np.ndenumerate(arr) if val == v), -1)
# Based on: enumerate(), for loop
# Proposed by: Noyer282 (https://stackoverflow.com/a/40426159/7204581)
def index_of__v7(arr: np.array, v):
return next((idx for idx, val in enumerate(arr) if val == v), -1)
# Based on: list.index()
# Proposed by: Hima (https://stackoverflow.com/a/23994923/7204581)
def index_of__v8(arr: np.array, v):
l = list(arr)
try:
return l.index(v)
except ValueError:
return -1
Go to Colab
The numpy_indexed package (disclaimer, I am its author) contains a vectorized equivalent of list.index for numpy.ndarray; that is:
sequence_of_arrays = [[0, 1], [1, 2], [-5, 0]]
arrays_to_query = [[-5, 0], [1, 0]]
import numpy_indexed as npi
idx = npi.indices(sequence_of_arrays, arrays_to_query, missing=-1)
print(idx) # [2, -1]
This solution has vectorized performance, generalizes to ndarrays, and has various ways of dealing with missing values.
There is a fairly idiomatic and vectorized way to do this built into numpy. It uses a quirk of the np.argmax() function to accomplish this -- if many values match, it returns the index of the first match. The trick is that for booleans, there will only ever be two values: True (1) and False (0). Therefore, the returned index will be that of the first True.
For the simple example provided, you can see it work with the following
>>> np.argmax(np.array([1,2,3]) == 2)
1
A great example is computing buckets, e.g. for categorizing. Let's say you have an array of cut points, and you want the "bucket" that corresponds to each element of your array. The algorithm is to compute the first index of cuts where x < cuts (after padding cuts with np.Infitnity). I could use broadcasting to broadcast the comparisons, then apply argmax along the cuts-broadcasted axis.
>>> cuts = np.array([10, 50, 100])
>>> cuts_pad = np.array([*cuts, np.Infinity])
>>> x = np.array([7, 11, 80, 443])
>>> bins = np.argmax( x[:, np.newaxis] < cuts_pad[np.newaxis, :], axis = 1)
>>> print(bins)
[0, 1, 2, 3]
As expected, each value from x falls into one of the sequential bins, with well-defined and easy to specify edge case behavior.
Another option not previously mentioned is the bisect module, which also works on lists, but requires a pre-sorted list/array:
import bisect
import numpy as np
z = np.array([104,113,120,122,126,138])
bisect.bisect_left(z, 122)
yields
3
bisect also returns a result when the number you're looking for doesn't exist in the array, so that the number can be inserted in the correct place.
Note: this is for python 2.7 version
You can use a lambda function to deal with the problem, and it works both on NumPy array and list.
your_list = [11, 22, 23, 44, 55]
result = filter(lambda x:your_list[x]>30, range(len(your_list)))
#result: [3, 4]
import numpy as np
your_numpy_array = np.array([11, 22, 23, 44, 55])
result = filter(lambda x:your_numpy_array [x]>30, range(len(your_list)))
#result: [3, 4]
And you can use
result[0]
to get the first index of the filtered elements.
For python 3.6, use
list(result)
instead of
result
Use ndindex
Sample array
arr = np.array([[1,4],
[2,3]])
print(arr)
...[[1,4],
[2,3]]
create an empty list to store the index and the element tuples
index_elements = []
for i in np.ndindex(arr.shape):
index_elements.append((arr[i],i))
convert the list of tuples into dictionary
index_elements = dict(index_elements)
The keys are the elements and the values are their
indices - use keys to access the index
index_elements[4]
output
... (0,1)
For my use case, I could not sort the array ahead of time because the order of the elements is important. This is my all-NumPy implementation:
import numpy as np
# The array in question
arr = np.array([1,2,1,2,1,5,5,3,5,9])
# Find all of the present values
vals=np.unique(arr)
# Make all indices up-to and including the desired index positive
cum_sum=np.cumsum(arr==vals.reshape(-1,1),axis=1)
# Add zeros to account for the n-1 shape of diff and the all-positive array of the first index
bl_mask=np.concatenate([np.zeros((cum_sum.shape[0],1)),cum_sum],axis=1)>=1
# The desired indices
idx=np.where(np.diff(bl_mask))[1]
# Show results
print(list(zip(vals,idx)))
>>> [(1, 0), (2, 1), (3, 7), (5, 5), (9, 9)]
I believe it accounts for unsorted arrays with duplicate values.
Found another solution with loops:
new_array_of_indicies = []
for i in range(len(some_array)):
if some_array[i] == some_value:
new_array_of_indicies.append(i)
index_lst_form_numpy = pd.DataFrame(df).reset_index()["index"].tolist()

Categories