Using numpy. binary_repr on array of numbers or alternatives - Python - python

using the following code i am trying to convert a list of numbers into binary number but getting an error
import numpy as np
lis=np.array([1,2,3,4,5,6,7,8,9])
a=np.binary_repr(lis,width=32)
the error after running the program is
Traceback (most recent call last):
File "", line 4, in
a=np.binary_repr(lis,width=32)
File "C:\Users.......",
in binary_repr
if num == 0:
ValueError: The truth value of an array with more than one element is
ambiguous. Use a.any() or a.all()
any way to fix this?

You can use np.vectorize to overcome this issue.
>>> lis=np.array([1,2,3,4,5,6,7,8,9])
>>> a=np.binary_repr(lis,width=32)
>>> binary_repr_vec = np.vectorize(np.binary_repr)
>>> binary_repr_vec(lis, width=32)
array(['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
'00000000000000000000000000000100',
'00000000000000000000000000000101',
'00000000000000000000000000000110',
'00000000000000000000000000000111',
'00000000000000000000000000001000',
'00000000000000000000000000001001'], dtype='<U32')

Approach #1
Here's a vectorized one for an array of numbers, upon leveraging broadcasting -
def binary_repr_ar(A, W):
p = (((A[:,None] & (1 << np.arange(W-1,-1,-1)))!=0)).view('u1')
return p.astype('S1').view('S'+str(W)).ravel()
Sample run -
In [67]: A
Out[67]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])
In [68]: binary_repr_ar(A,32)
Out[68]:
array(['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
'00000000000000000000000000000100',
'00000000000000000000000000000101',
'00000000000000000000000000000110',
'00000000000000000000000000000111',
'00000000000000000000000000001000',
'00000000000000000000000000001001'], dtype='|S32')
Approach #2
Another vectorized one with array-assignment -
def binary_repr_ar_v2(A, W):
mask = (((A[:,None] & (1 << np.arange(W-1,-1,-1)))!=0))
out = np.full((len(A),W),48, dtype=np.uint8)
out[mask] = 49
return out.view('S'+str(W)).ravel()
Alternatively, use the mask directly to get the string array -
def binary_repr_ar_v3(A, W):
mask = (((A[:,None] & (1 << np.arange(W-1,-1,-1)))!=0))
return (mask+np.array([48],dtype=np.uint8)).view('S'+str(W)).ravel()
Note that the final output would be a view into one of the intermediate outputs. So, if you need it to have it own memory space, simply append with .copy().
Timings on a large sized input array -
In [49]: np.random.seed(0)
...: A = np.random.randint(1,1000,(100000))
...: W = 32
In [50]: %timeit binary_repr_ar(A, W)
...: %timeit binary_repr_ar_v2(A, W)
...: %timeit binary_repr_ar_v3(A, W)
1 loop, best of 3: 854 ms per loop
100 loops, best of 3: 14.5 ms per loop
100 loops, best of 3: 7.33 ms per loop
From other posted solutions -
In [22]: %timeit [np.binary_repr(i, width=32) for i in A]
10 loops, best of 3: 97.2 ms per loop
In [23]: %timeit np.frompyfunc(np.binary_repr,2,1)(A,32).astype('U32')
10 loops, best of 3: 80 ms per loop
In [24]: %timeit np.vectorize(np.binary_repr)(A, 32)
10 loops, best of 3: 69.8 ms per loop
On #Paul Panzer's solutions -
In [5]: %timeit bin_rep(A,32)
548 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: %timeit bin_rep(A,31)
2.2 ms ± 5.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As the documentation on binary_repr says:
num : int
Only an integer decimal number can be
used.
You can however vectorize this operation, like:
np.vectorize(np.binary_repr)(lis, 32)
this then gives us:
>>> np.vectorize(np.binary_repr)(lis, 32)
array(['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
'00000000000000000000000000000100',
'00000000000000000000000000000101',
'00000000000000000000000000000110',
'00000000000000000000000000000111',
'00000000000000000000000000001000',
'00000000000000000000000000001001'], dtype='<U32')
or if you need this often, you can store the vectorized variant in a variable:
binary_repr_vector = np.vectorize(np.binary_repr)
binary_repr_vector(lis, 32)
Which of course gives the same result:
>>> binary_repr_vector = np.vectorize(np.binary_repr)
>>> binary_repr_vector(lis, 32)
array(['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
'00000000000000000000000000000100',
'00000000000000000000000000000101',
'00000000000000000000000000000110',
'00000000000000000000000000000111',
'00000000000000000000000000001000',
'00000000000000000000000000001001'], dtype='<U32')

Here is a fast method using np.unpackbits
(np.unpackbits(lis.astype('>u4').view(np.uint8))+ord('0')).view('S32')
# array([b'00000000000000000000000000000001',
# b'00000000000000000000000000000010',
# b'00000000000000000000000000000011',
# b'00000000000000000000000000000100',
# b'00000000000000000000000000000101',
# b'00000000000000000000000000000110',
# b'00000000000000000000000000000111',
# b'00000000000000000000000000001000',
# b'00000000000000000000000000001001'], dtype='|S32')
More general:
def bin_rep(A,n):
if n in (8,16,32,64):
return (np.unpackbits(A.astype(f'>u{n>>3}').view(np.uint8))+ord('0')).view(f'S{n}')
nb = max((n-1).bit_length()-3,0)
return (np.unpackbits(A.astype(f'>u{1<<nb}')[...,None].view(np.uint8),axis=1)[...,-n:]+ord('0')).ravel().view(f'S{n}')
Note: special casing n = 8,16,32,64 is absolutely worth it since it gives a severalfold speedup for these numbers.
Also note that this method maxes out at 2^64, larger ints require a different approach.

In [193]: alist = [1,2,3,4,5,6,7,8,9]
np.vectorize is convenient, but not fast:
In [194]: np.vectorize(np.binary_repr)(alist, 32)
Out[194]:
array(['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
....
'00000000000000000000000000001001'], dtype='<U32')
In [195]: timeit np.vectorize(np.binary_repr)(alist, 32)
71.8 µs ± 1.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
plain old list comprehension is better:
In [196]: [np.binary_repr(i, width=32) for i in alist]
Out[196]:
['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
...
'00000000000000000000000000001001']
In [197]: timeit [np.binary_repr(i, width=32) for i in alist]
11.5 µs ± 181 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
another iterator:
In [200]: timeit np.frompyfunc(np.binary_repr,2,1)(alist,32).astype('U32')
30.1 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Related

Python numpy array index slicing with n number of :(colons) [duplicate]

This question already has answers here:
get the i-th slice of the k-th dimension in a numpy array
(5 answers)
Closed 18 days ago.
I want a easy to read access to some parts of a multidimensional numpy array. For any array accessing the first dimension is easy (b[index]). Accessing the sixth dimension on the other hand is "hard" (especially to read).
b[:,:,:,:,:,index] #the next person to read the code will have to count the :
Is there a better way to do this?
Especially is there a way, where the axis is not known while writing the program?
Edit:
The indexed dimension is not necessarily the last dimension
You can use np.take.
For example:
b.take(index, axis=5)
If you want a view and want it fast you can just create the index manually:
arr[(slice(None), )*5 + (your_index, )]
# ^---- This is equivalent to 5 colons: `:, :, :, :, :`
Which is much faster than np.take and only marginally slower than indexing with :s:
import numpy as np
arr = np.random.random((10, 10, 10, 10, 10, 10, 10))
np.testing.assert_array_equal(arr[:,:,:,:,:,4], arr.take(4, axis=5))
np.testing.assert_array_equal(arr[:,:,:,:,:,4], arr[(slice(None), )*5 + (4, )])
%timeit arr.take(4, axis=5)
# 18.6 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr[(slice(None), )*5 + (4, )]
# 2.72 µs ± 39.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr[:, :, :, :, :, 4]
# 2.29 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But maybe not as readable, so if you need that often you probably should put it in a function with a meaningful name:
def index_axis(arr, index, axis):
return arr[(slice(None), )*axis + (index, )]
np.testing.assert_array_equal(arr[:,:,:,:,:,4], index_axis(arr, 4, axis=5))
%timeit index_axis(arr, 4, axis=5)
# 3.79 µs ± 127 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
An intermediate way (in readability and time) between the answers of MSeifert and kazemakase is using np.rollaxis:
np.rollaxis(b, axis=5)[index]
Testing the solutions:
import numpy as np
arr = np.random.random((10, 10, 10, 10, 10, 10, 10))
np.testing.assert_array_equal(arr[:,:,:,:,:,4], arr.take(4, axis=5))
np.testing.assert_array_equal(arr[:,:,:,:,:,4], arr[(slice(None), )*5 + (4, )])
np.testing.assert_array_equal(arr[:,:,:,:,:,4], np.rollaxis(arr, 5)[4])
%timeit arr.take(4, axis=5)
# 100 loops, best of 3: 4.44 ms per loop
%timeit arr[(slice(None), )*5 + (4, )]
# 1000000 loops, best of 3: 731 ns per loop
%timeit arr[:, :, :, :, :, 4]
# 1000000 loops, best of 3: 540 ns per loop
%timeit np.rollaxis(arr, 5)[4]
# 100000 loops, best of 3: 3.41 µs per loop
In the spirit of #Jürg Merlin Spaak's rollaxis but much faster and not deprecated:
b.swapaxes(0, axis)[index]
You can say:
slice = b[..., index]

Need help to speed up this code - Python and numpy

I have a function which processes an input array of dimension (h,w,200) (the number 200 can vary) and returns an array of dimension (h,w,50,3). The function takes ~0.8 seconds for an input array of size 512,512,200.
def myfunc(arr, n = 50):
#shape of arr is (h,w,200)
#output shape is (h,w,50,3)
#a1 is an array of length 50, I get them from a different
#function, which doesn't take much time. For simplicity, I fix it
#as np.arange(0,50)
a1 = np.arange(0,50)
output = np.stack((arr[:,:,a1],)*3, axis = -1)
return output
This preprocessing step is done to ~8 arrays in a single batch, due to which loading a batch of data takes 8*0.8 = 6.4 seconds. Is there a way to speed up the computation of myfunc? Can I use libraries like numba for this?
I get about the same time:
In [14]: arr = np.ones((512,512,200))
In [15]: timeit output = np.stack((arr[:,:,np.arange(50)],)*3, axis=-1)
681 ms ± 5.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [16]: np.stack((arr[:,:,np.arange(50)],)*3, axis=-1).shape
Out[16]: (512, 512, 50, 3)
Looking at the timings in more detail.
First the index/copy step, takes about 1/3 of the time:
In [17]: timeit arr[:,:,np.arange(50)]
249 ms ± 306 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
And the stack:
In [18]: %%timeit temp = arr[:,:,np.arange(50)]
...: output = np.stack([temp,temp,temp], axis=-1)
...:
...:
426 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
stack expands dimensions and then concatenates; so lets call concatenate directly:
In [19]: %%timeit temp = arr[:,:,np.arange(50),None]
...: output = np.concatenate([temp,temp,temp], axis=-1)
...:
...:
430 ms ± 8.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
another approach is to use repeat:
In [20]: %%timeit temp = arr[:,:,np.arange(50),None]
...: output = np.repeat(temp, 3, axis=-1)
...:
...:
531 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
So looks like your code's as good as it gets.
Indexing and concatenate already use compiled code, so I don't expect numba to help much (not that I have much experience with it).
Stacking on a new front axis is faster (making (3, 512, 512, 50))
In [21]: %%timeit temp = arr[:,:,np.arange(50)]
...: output = np.stack([temp,temp,temp])
...:
...:
254 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That could then be transposed (cheaply), though subsequent operations might be slower (if they require a copy and/or reordering). A plain copy of the full output array times at around 350 ms.
Inspired by comments, I tried broadcasted assignment:
In [101]: %%timeit temp = arr[:,:,np.arange(50)]
...: res = np.empty(temp.shape + (3,), temp.dtype)
...: res[...] = temp[...,None]
...:
...:
...:
337 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Same ball park.
Another trick is to play with strides to make a 'virtual' copy:
In [74]: res1 = np.broadcast_to(arr, (3,)+arr.shape)
In [75]: res1.shape
Out[75]: (3, 512, 512, 200)
In [76]: res1.strides
Out[76]: (0, 819200, 1600, 8)
For some reason this does not work with (512,512,200,3). It may have something to do with the broadcast_to implementation. Maybe someone can experiment with as_strided.
Though I can transpose this just fine:
np.broadcast_to(arr, (3,)+arr.shape).transpose(1,2,3,0)
In any case this is much faster:
In [82]: timeit res1 = np.broadcast_to(arr, (3,)+arr.shape)
10.4 µs ± 188 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
(but making a copy brings time back up.)

Apply a custom function to multidimensional numpy array, keeping the same shape

Basically I want to map over each value of a multidimensional numpy array. The output should have the same shape as the input.
This is the way I did it:
def f(x):
return x*x
input = np.arange(3*4*5).reshape(3,4,5)
output = np.array(list(map(f, input)))
print(output)
It works, but it feels a bit too complicated (np.array, list, map). Is there a more elegant solution?
Just call your function on the array:
f(input)
Also, better not to use the name input for your variable as it is a builtin:
import numpy as np
def f(x):
return x*x
arr = np.arange(3*4*5).reshape(3,4,5)
print(np.alltrue(f(arr) == np.array(list(map(f, input)))))
Output:
True
If the function is more complex:
def f(x):
return x+1 if x%2 else 2*x
use vectorize:
np.vectorize(f)(arr)
Better, always try to use the vectorized NumPy functions such as np.where:
>>> np.alltrue(np.vectorize(f)(arr) == np.where(arr % 2, arr + 1, arr * 2))
True
The native NumPy version is considerably faster:
%%timeit
np.vectorize(f)(arr)
34 µs ± 996 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
np.where(arr % 2, arr + 1, arr * 2)
5.16 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
This is much more pronounced for larger arrays:
big_arr = np.arange(30 * 40 * 50).reshape(30, 40, 50)
%%timeit
np.vectorize(f)(big_arr)
15.5 ms ± 318 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
np.where(big_arr % 2, big_arr + 1, big_arr * 2)
797 µs ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Count unique elements along an axis of a NumPy array

I have a three-dimensional array like
A=np.array([[[1,1],
[1,0]],
[[1,2],
[1,0]],
[[1,0],
[0,0]]])
Now I would like to obtain an array that has a nonzero value in a given position if only a unique nonzero value (or zero) occurs in that position. It should have zero if only zeros or more than one nonzero value occur in that position. For the example above, I would like
[[1,0],
[1,0]]
since
in A[:,0,0] there are only 1s
in A[:,0,1] there are 0, 1 and 2, so more than one nonzero value
in A[:,1,0] there are 0 and 1, so 1 is retained
in A[:,1,1] there are only 0s
I can find how many nonzero elements there are with np.count_nonzero(A, axis=0), but I would like to keep 1s or 2s even if there are several of them. I looked at np.unique but it doesn't seem to support what I'd like to do.
Ideally, I'd like a function like np.count_unique(A, axis=0) which would return an array in the original shape, e.g. [[1, 3],[2, 1]], so I could check whether 3 or more occur and then ignore that position.
All I could come up with was a list comprehension iterating over the that I'd like to obtain
[[len(np.unique(A[:, i, j])) for j in range(A.shape[2])] for i in range(A.shape[1])]
Any other ideas?
You can use np.diff to stay at numpy level for the second task.
def diffcount(A):
B=A.copy()
B.sort(axis=0)
C=np.diff(B,axis=0)>0
D=C.sum(axis=0)+1
return D
# [[1 3]
# [2 1]]
it's seems to be a little faster on big arrays:
In [62]: A=np.random.randint(0,100,(100,100,100))
In [63]: %timeit diffcount(A)
46.8 ms ± 769 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [64]: timeit [[len(np.unique(A[:, i, j])) for j in range(A.shape[2])]\
for i in range(A.shape[1])]
149 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Finally counting unique is simpler than sorting, a ln(A.shape[0]) factor can be win.
A way to win this factor is to use the set mechanism :
In [81]: %timeit np.apply_along_axis(lambda a:len(set(a)),axis=0,A)
183 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Unfortunately, this is not faster.
Another way is to do it by hand :
def countunique(A,Amax):
res=np.empty(A.shape[1:],A.dtype)
c=np.empty(Amax+1,A.dtype)
for i in range(A.shape[1]):
for j in range(A.shape[2]):
T=A[:,i,j]
for k in range(c.size): c[k]=0
for x in T:
c[x]=1
res[i,j]= c.sum()
return res
At python level:
In [70]: %timeit countunique(A,100)
429 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Which is not so bad for a pure python approach. Then just shift this code at low level with numba :
import numba
countunique2=numba.jit(countunique)
In [71]: %timeit countunique2(A,100)
3.63 ms ± 70.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Which will be difficult to improve a lot.
One approach would be to use A as first axis indices for setting a boolean array of the same lengths along the other two axes and then simply counting the non-zeros along the first axis of it. Two variants would be possible - One keeping it as 3D and another would be to reshape into 2D for some performance benefit as indexing into 2D would be faster. Thus, the two implementations would be -
def nunique_axis0_maskcount_app1(A):
m,n = A.shape[1:]
mask = np.zeros((A.max()+1,m,n),dtype=bool)
mask[A,np.arange(m)[:,None],np.arange(n)] = 1
return mask.sum(0)
def nunique_axis0_maskcount_app2(A):
m,n = A.shape[1:]
A.shape = (-1,m*n)
maxn = A.max()+1
N = A.shape[1]
mask = np.zeros((maxn,N),dtype=bool)
mask[A,np.arange(N)] = 1
A.shape = (-1,m,n)
return mask.sum(0).reshape(m,n)
Runtime test -
In [154]: A = np.random.randint(0,100,(100,100,100))
# #B. M.'s soln
In [155]: %timeit f(A)
10 loops, best of 3: 28.3 ms per loop
# #B. M.'s soln using slicing : (B[1:] != B[:-1]).sum(0)+1
In [156]: %timeit f2(A)
10 loops, best of 3: 26.2 ms per loop
In [157]: %timeit nunique_axis0_maskcount_app1(A)
100 loops, best of 3: 12 ms per loop
In [158]: %timeit nunique_axis0_maskcount_app2(A)
100 loops, best of 3: 9.14 ms per loop
Numba method
Using the same strategy as used for nunique_axis0_maskcount_app2 with directly getting the counts at C-level with numba, we would have -
from numba import njit
#njit
def nunique_loopy_func(mask, N, A, p, count):
for j in range(N):
mask[:] = True
mask[A[0,j]] = False
c = 1
for i in range(1,p):
if mask[A[i,j]]:
c += 1
mask[A[i,j]] = False
count[j] = c
return count
def nunique_axis0_numba(A):
p,m,n = A.shape
A.shape = (-1,m*n)
maxn = A.max()+1
N = A.shape[1]
mask = np.empty(maxn,dtype=bool)
count = np.empty(N,dtype=int)
out = nunique_loopy_func(mask, N, A, p, count).reshape(m,n)
A.shape = (-1,m,n)
return out
Runtime test -
In [328]: np.random.seed(0)
In [329]: A = np.random.randint(0,100,(100,100,100))
In [330]: %timeit nunique_axis0_maskcount_app2(A)
100 loops, best of 3: 11.1 ms per loop
# #B.M.'s numba soln
In [331]: %timeit countunique2(A,A.max()+1)
100 loops, best of 3: 3.43 ms per loop
# Numba soln posted in this post
In [332]: %timeit nunique_axis0_numba(A)
100 loops, best of 3: 2.76 ms per loop

Find element's index in pandas Series

I know this is a very basic question but for some reason I can't find an answer. How can I get the index of certain element of a Series in python pandas? (first occurrence would suffice)
I.e., I'd like something like:
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3
Certainly, it is possible to define such a method with a loop:
def find(s, el):
for i in s.index:
if s[i] == el:
return i
return None
print find(myseries, 7)
but I assume there should be a better way. Is there?
>>> myseries[myseries == 7]
3 7
dtype: int64
>>> myseries[myseries == 7].index[0]
3
Though I admit that there should be a better way to do that, but this at least avoids iterating and looping through the object and moves it to the C level.
Converting to an Index, you can use get_loc
In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
In [3]: Index(myseries).get_loc(7)
Out[3]: 3
In [4]: Index(myseries).get_loc(10)
KeyError: 10
Duplicate handling
In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)
Will return a boolean array if non-contiguous returns
In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False, True, False, False, True, False], dtype=bool)
Uses a hashtable internally, so fast
In [7]: s = Series(randint(0,10,10000))
In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop
In [12]: i = Index(s)
In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop
As Viktor points out, there is a one-time creation overhead to creating an index (its incurred when you actually DO something with the index, e.g. the is_unique)
In [2]: s = Series(randint(0,10,10000))
In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop
In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop
I'm impressed with all the answers here. This is not a new answer, just an attempt to summarize the timings of all these methods. I considered the case of a series with 25 elements and assumed the general case where the index could contain any values and you want the index value corresponding to the search value which is towards the end of the series.
Here are the speed tests on a 2012 Mac Mini in Python 3.9.10 with Pandas version 1.4.0.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: data = [406400, 203200, 101600, 76100, 50800, 25400, 19050, 12700, 950
...: 0, 6700, 4750, 3350, 2360, 1700, 1180, 850, 600, 425, 300, 212, 150, 1
...: 06, 75, 53, 38]
In [4]: myseries = pd.Series(data, index=range(1,26))
In [5]: assert(myseries[21] == 150)
In [6]: %timeit myseries[myseries == 150].index[0]
179 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit myseries[myseries == 150].first_valid_index()
205 µs ± 3.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: %timeit myseries.where(myseries == 150).first_valid_index()
597 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit myseries.index[np.where(myseries == 150)[0][0]]
110 µs ± 872 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [10]: %timeit pd.Series(myseries.index, index=myseries)[150]
125 µs ± 2.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [11]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]
49.5 µs ± 814 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [12]: %timeit myseries.index[list(myseries).index(150)]
7.75 µs ± 36.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit myseries.index[myseries.tolist().index(150)]
2.55 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: %timeit dict(zip(myseries.values, myseries.index))[150]
9.89 µs ± 79.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [15]: %timeit {v: k for k, v in myseries.items()}[150]
9.99 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#Jeff's answer seems to be the fastest - although it doesn't handle duplicates.
Correction: Sorry, I missed one, #Alex Spangher's solution using the list index method is by far the fastest.
Update: Added #EliadL's answer.
Hope this helps.
Amazing that such a simple operation requires such convoluted solutions and many are so slow. Over half a millisecond in some cases to find a value in a series of 25.
2022-02-18 Update
Updated all the timings with the latest Pandas version and Python 3.9. Even on an older computer, all the timings have significantly reduced (10 to 70%) compared to the previous tests (version 0.25.3).
Plus: Added two more methods utilizing dictionaries.
In [92]: (myseries==7).argmax()
Out[92]: 3
This works if you know 7 is there in advance. You can check this with
(myseries==7).any()
Another approach (very similar to the first answer) that also accounts for multiple 7's (or none) is
In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']
Another way to do this, although equally unsatisfying is:
s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])
list(s).index(7)
returns:
3
On time tests using a current dataset I'm working with (consider it random):
[64]: %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop
In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop
In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop
If you use numpy, you can get an array of the indecies that your value is found:
import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)
This returns a one element tuple containing an array of the indecies where 7 is the value in myseries:
(array([3], dtype=int64),)
you can use Series.idxmax()
>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>>
This is the most native and scalable approach I could find:
>>> myindex = pd.Series(myseries.index, index=myseries)
>>> myindex[7]
3
>>> myindex[[7, 5, 7]]
7 3
5 4
7 3
dtype: int64
Another way to do it that hasn't been mentioned yet is the tolist method:
myseries.tolist().index(7)
should return the correct index, assuming the value exists in the Series.
Often your value occurs at multiple indices:
>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')
The Pandas has builtin class Index with a function called get_loc. This function will either return
index (element index)
slice (if the specified number is in sequence)
array (bool array if the number is at multiple indexes)
Example:
import pandas as pd
>>> mySer = pd.Series([1, 3, 8, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns index
3 # Index of 10 in series
>>> mySer = pd.Series([1, 3, 8, 10, 10, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns slice
slice(3, 6, None) # 10 occurs at index 3 (included) to 6 (not included)
# If the data is not in sequence then it would return an array of bool's.
>>> mySer = pd.Series([1, 10, 3, 8, 10, 10, 10, 13, 10])
>>> pd.Index(mySer).get_loc(10)
array([False, True, False, False, True, True, False, True])
There are many other options too but I found it very simple for me.
df.index method will help you to find the exact row number
my_fl2=(df['ConvertedCompYearly'] == 45241312 )
print (df[my_fl2].index)
Name: ConvertedCompYearly, dtype: float64
Int64Index([66910], dtype='int64')

Categories