I know this is a very basic question but for some reason I can't find an answer. How can I get the index of certain element of a Series in python pandas? (first occurrence would suffice)
I.e., I'd like something like:
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3
Certainly, it is possible to define such a method with a loop:
def find(s, el):
for i in s.index:
if s[i] == el:
return i
return None
print find(myseries, 7)
but I assume there should be a better way. Is there?
>>> myseries[myseries == 7]
3 7
dtype: int64
>>> myseries[myseries == 7].index[0]
3
Though I admit that there should be a better way to do that, but this at least avoids iterating and looping through the object and moves it to the C level.
Converting to an Index, you can use get_loc
In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
In [3]: Index(myseries).get_loc(7)
Out[3]: 3
In [4]: Index(myseries).get_loc(10)
KeyError: 10
Duplicate handling
In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)
Will return a boolean array if non-contiguous returns
In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False, True, False, False, True, False], dtype=bool)
Uses a hashtable internally, so fast
In [7]: s = Series(randint(0,10,10000))
In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop
In [12]: i = Index(s)
In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop
As Viktor points out, there is a one-time creation overhead to creating an index (its incurred when you actually DO something with the index, e.g. the is_unique)
In [2]: s = Series(randint(0,10,10000))
In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop
In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop
I'm impressed with all the answers here. This is not a new answer, just an attempt to summarize the timings of all these methods. I considered the case of a series with 25 elements and assumed the general case where the index could contain any values and you want the index value corresponding to the search value which is towards the end of the series.
Here are the speed tests on a 2012 Mac Mini in Python 3.9.10 with Pandas version 1.4.0.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: data = [406400, 203200, 101600, 76100, 50800, 25400, 19050, 12700, 950
...: 0, 6700, 4750, 3350, 2360, 1700, 1180, 850, 600, 425, 300, 212, 150, 1
...: 06, 75, 53, 38]
In [4]: myseries = pd.Series(data, index=range(1,26))
In [5]: assert(myseries[21] == 150)
In [6]: %timeit myseries[myseries == 150].index[0]
179 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit myseries[myseries == 150].first_valid_index()
205 µs ± 3.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: %timeit myseries.where(myseries == 150).first_valid_index()
597 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit myseries.index[np.where(myseries == 150)[0][0]]
110 µs ± 872 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [10]: %timeit pd.Series(myseries.index, index=myseries)[150]
125 µs ± 2.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [11]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]
49.5 µs ± 814 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [12]: %timeit myseries.index[list(myseries).index(150)]
7.75 µs ± 36.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit myseries.index[myseries.tolist().index(150)]
2.55 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: %timeit dict(zip(myseries.values, myseries.index))[150]
9.89 µs ± 79.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [15]: %timeit {v: k for k, v in myseries.items()}[150]
9.99 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#Jeff's answer seems to be the fastest - although it doesn't handle duplicates.
Correction: Sorry, I missed one, #Alex Spangher's solution using the list index method is by far the fastest.
Update: Added #EliadL's answer.
Hope this helps.
Amazing that such a simple operation requires such convoluted solutions and many are so slow. Over half a millisecond in some cases to find a value in a series of 25.
2022-02-18 Update
Updated all the timings with the latest Pandas version and Python 3.9. Even on an older computer, all the timings have significantly reduced (10 to 70%) compared to the previous tests (version 0.25.3).
Plus: Added two more methods utilizing dictionaries.
In [92]: (myseries==7).argmax()
Out[92]: 3
This works if you know 7 is there in advance. You can check this with
(myseries==7).any()
Another approach (very similar to the first answer) that also accounts for multiple 7's (or none) is
In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']
Another way to do this, although equally unsatisfying is:
s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])
list(s).index(7)
returns:
3
On time tests using a current dataset I'm working with (consider it random):
[64]: %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop
In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop
In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop
If you use numpy, you can get an array of the indecies that your value is found:
import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)
This returns a one element tuple containing an array of the indecies where 7 is the value in myseries:
(array([3], dtype=int64),)
you can use Series.idxmax()
>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>>
This is the most native and scalable approach I could find:
>>> myindex = pd.Series(myseries.index, index=myseries)
>>> myindex[7]
3
>>> myindex[[7, 5, 7]]
7 3
5 4
7 3
dtype: int64
Another way to do it that hasn't been mentioned yet is the tolist method:
myseries.tolist().index(7)
should return the correct index, assuming the value exists in the Series.
Often your value occurs at multiple indices:
>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')
The Pandas has builtin class Index with a function called get_loc. This function will either return
index (element index)
slice (if the specified number is in sequence)
array (bool array if the number is at multiple indexes)
Example:
import pandas as pd
>>> mySer = pd.Series([1, 3, 8, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns index
3 # Index of 10 in series
>>> mySer = pd.Series([1, 3, 8, 10, 10, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns slice
slice(3, 6, None) # 10 occurs at index 3 (included) to 6 (not included)
# If the data is not in sequence then it would return an array of bool's.
>>> mySer = pd.Series([1, 10, 3, 8, 10, 10, 10, 13, 10])
>>> pd.Index(mySer).get_loc(10)
array([False, True, False, False, True, True, False, True])
There are many other options too but I found it very simple for me.
df.index method will help you to find the exact row number
my_fl2=(df['ConvertedCompYearly'] == 45241312 )
print (df[my_fl2].index)
Name: ConvertedCompYearly, dtype: float64
Int64Index([66910], dtype='int64')
Related
This question already has answers here:
get the i-th slice of the k-th dimension in a numpy array
(5 answers)
Closed 18 days ago.
I want a easy to read access to some parts of a multidimensional numpy array. For any array accessing the first dimension is easy (b[index]). Accessing the sixth dimension on the other hand is "hard" (especially to read).
b[:,:,:,:,:,index] #the next person to read the code will have to count the :
Is there a better way to do this?
Especially is there a way, where the axis is not known while writing the program?
Edit:
The indexed dimension is not necessarily the last dimension
You can use np.take.
For example:
b.take(index, axis=5)
If you want a view and want it fast you can just create the index manually:
arr[(slice(None), )*5 + (your_index, )]
# ^---- This is equivalent to 5 colons: `:, :, :, :, :`
Which is much faster than np.take and only marginally slower than indexing with :s:
import numpy as np
arr = np.random.random((10, 10, 10, 10, 10, 10, 10))
np.testing.assert_array_equal(arr[:,:,:,:,:,4], arr.take(4, axis=5))
np.testing.assert_array_equal(arr[:,:,:,:,:,4], arr[(slice(None), )*5 + (4, )])
%timeit arr.take(4, axis=5)
# 18.6 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr[(slice(None), )*5 + (4, )]
# 2.72 µs ± 39.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr[:, :, :, :, :, 4]
# 2.29 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But maybe not as readable, so if you need that often you probably should put it in a function with a meaningful name:
def index_axis(arr, index, axis):
return arr[(slice(None), )*axis + (index, )]
np.testing.assert_array_equal(arr[:,:,:,:,:,4], index_axis(arr, 4, axis=5))
%timeit index_axis(arr, 4, axis=5)
# 3.79 µs ± 127 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
An intermediate way (in readability and time) between the answers of MSeifert and kazemakase is using np.rollaxis:
np.rollaxis(b, axis=5)[index]
Testing the solutions:
import numpy as np
arr = np.random.random((10, 10, 10, 10, 10, 10, 10))
np.testing.assert_array_equal(arr[:,:,:,:,:,4], arr.take(4, axis=5))
np.testing.assert_array_equal(arr[:,:,:,:,:,4], arr[(slice(None), )*5 + (4, )])
np.testing.assert_array_equal(arr[:,:,:,:,:,4], np.rollaxis(arr, 5)[4])
%timeit arr.take(4, axis=5)
# 100 loops, best of 3: 4.44 ms per loop
%timeit arr[(slice(None), )*5 + (4, )]
# 1000000 loops, best of 3: 731 ns per loop
%timeit arr[:, :, :, :, :, 4]
# 1000000 loops, best of 3: 540 ns per loop
%timeit np.rollaxis(arr, 5)[4]
# 100000 loops, best of 3: 3.41 µs per loop
In the spirit of #Jürg Merlin Spaak's rollaxis but much faster and not deprecated:
b.swapaxes(0, axis)[index]
You can say:
slice = b[..., index]
using the following code i am trying to convert a list of numbers into binary number but getting an error
import numpy as np
lis=np.array([1,2,3,4,5,6,7,8,9])
a=np.binary_repr(lis,width=32)
the error after running the program is
Traceback (most recent call last):
File "", line 4, in
a=np.binary_repr(lis,width=32)
File "C:\Users.......",
in binary_repr
if num == 0:
ValueError: The truth value of an array with more than one element is
ambiguous. Use a.any() or a.all()
any way to fix this?
You can use np.vectorize to overcome this issue.
>>> lis=np.array([1,2,3,4,5,6,7,8,9])
>>> a=np.binary_repr(lis,width=32)
>>> binary_repr_vec = np.vectorize(np.binary_repr)
>>> binary_repr_vec(lis, width=32)
array(['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
'00000000000000000000000000000100',
'00000000000000000000000000000101',
'00000000000000000000000000000110',
'00000000000000000000000000000111',
'00000000000000000000000000001000',
'00000000000000000000000000001001'], dtype='<U32')
Approach #1
Here's a vectorized one for an array of numbers, upon leveraging broadcasting -
def binary_repr_ar(A, W):
p = (((A[:,None] & (1 << np.arange(W-1,-1,-1)))!=0)).view('u1')
return p.astype('S1').view('S'+str(W)).ravel()
Sample run -
In [67]: A
Out[67]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])
In [68]: binary_repr_ar(A,32)
Out[68]:
array(['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
'00000000000000000000000000000100',
'00000000000000000000000000000101',
'00000000000000000000000000000110',
'00000000000000000000000000000111',
'00000000000000000000000000001000',
'00000000000000000000000000001001'], dtype='|S32')
Approach #2
Another vectorized one with array-assignment -
def binary_repr_ar_v2(A, W):
mask = (((A[:,None] & (1 << np.arange(W-1,-1,-1)))!=0))
out = np.full((len(A),W),48, dtype=np.uint8)
out[mask] = 49
return out.view('S'+str(W)).ravel()
Alternatively, use the mask directly to get the string array -
def binary_repr_ar_v3(A, W):
mask = (((A[:,None] & (1 << np.arange(W-1,-1,-1)))!=0))
return (mask+np.array([48],dtype=np.uint8)).view('S'+str(W)).ravel()
Note that the final output would be a view into one of the intermediate outputs. So, if you need it to have it own memory space, simply append with .copy().
Timings on a large sized input array -
In [49]: np.random.seed(0)
...: A = np.random.randint(1,1000,(100000))
...: W = 32
In [50]: %timeit binary_repr_ar(A, W)
...: %timeit binary_repr_ar_v2(A, W)
...: %timeit binary_repr_ar_v3(A, W)
1 loop, best of 3: 854 ms per loop
100 loops, best of 3: 14.5 ms per loop
100 loops, best of 3: 7.33 ms per loop
From other posted solutions -
In [22]: %timeit [np.binary_repr(i, width=32) for i in A]
10 loops, best of 3: 97.2 ms per loop
In [23]: %timeit np.frompyfunc(np.binary_repr,2,1)(A,32).astype('U32')
10 loops, best of 3: 80 ms per loop
In [24]: %timeit np.vectorize(np.binary_repr)(A, 32)
10 loops, best of 3: 69.8 ms per loop
On #Paul Panzer's solutions -
In [5]: %timeit bin_rep(A,32)
548 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: %timeit bin_rep(A,31)
2.2 ms ± 5.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As the documentation on binary_repr says:
num : int
Only an integer decimal number can be
used.
You can however vectorize this operation, like:
np.vectorize(np.binary_repr)(lis, 32)
this then gives us:
>>> np.vectorize(np.binary_repr)(lis, 32)
array(['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
'00000000000000000000000000000100',
'00000000000000000000000000000101',
'00000000000000000000000000000110',
'00000000000000000000000000000111',
'00000000000000000000000000001000',
'00000000000000000000000000001001'], dtype='<U32')
or if you need this often, you can store the vectorized variant in a variable:
binary_repr_vector = np.vectorize(np.binary_repr)
binary_repr_vector(lis, 32)
Which of course gives the same result:
>>> binary_repr_vector = np.vectorize(np.binary_repr)
>>> binary_repr_vector(lis, 32)
array(['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
'00000000000000000000000000000100',
'00000000000000000000000000000101',
'00000000000000000000000000000110',
'00000000000000000000000000000111',
'00000000000000000000000000001000',
'00000000000000000000000000001001'], dtype='<U32')
Here is a fast method using np.unpackbits
(np.unpackbits(lis.astype('>u4').view(np.uint8))+ord('0')).view('S32')
# array([b'00000000000000000000000000000001',
# b'00000000000000000000000000000010',
# b'00000000000000000000000000000011',
# b'00000000000000000000000000000100',
# b'00000000000000000000000000000101',
# b'00000000000000000000000000000110',
# b'00000000000000000000000000000111',
# b'00000000000000000000000000001000',
# b'00000000000000000000000000001001'], dtype='|S32')
More general:
def bin_rep(A,n):
if n in (8,16,32,64):
return (np.unpackbits(A.astype(f'>u{n>>3}').view(np.uint8))+ord('0')).view(f'S{n}')
nb = max((n-1).bit_length()-3,0)
return (np.unpackbits(A.astype(f'>u{1<<nb}')[...,None].view(np.uint8),axis=1)[...,-n:]+ord('0')).ravel().view(f'S{n}')
Note: special casing n = 8,16,32,64 is absolutely worth it since it gives a severalfold speedup for these numbers.
Also note that this method maxes out at 2^64, larger ints require a different approach.
In [193]: alist = [1,2,3,4,5,6,7,8,9]
np.vectorize is convenient, but not fast:
In [194]: np.vectorize(np.binary_repr)(alist, 32)
Out[194]:
array(['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
....
'00000000000000000000000000001001'], dtype='<U32')
In [195]: timeit np.vectorize(np.binary_repr)(alist, 32)
71.8 µs ± 1.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
plain old list comprehension is better:
In [196]: [np.binary_repr(i, width=32) for i in alist]
Out[196]:
['00000000000000000000000000000001',
'00000000000000000000000000000010',
'00000000000000000000000000000011',
...
'00000000000000000000000000001001']
In [197]: timeit [np.binary_repr(i, width=32) for i in alist]
11.5 µs ± 181 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
another iterator:
In [200]: timeit np.frompyfunc(np.binary_repr,2,1)(alist,32).astype('U32')
30.1 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Let's say that I have the following dataframe:
import pandas as pd
df = pd.DataFrame({'n': [1, 2, 3], 'm': [4, 4, 7]})
df.loc[df['m']==4,'n']=1
Running this .loc function on a relatively small dataset (~50,000 int32 samples) is taking 11ms. Is there any way I can speed this up? I'm hoping to get the same operation down to between 10-100μs.
Update
I've edited the above example to be a bit more concise.
After testing the suggested methods, the fastest was :
df['n'].values[df['m'].values == 4] = 1
After applying it to a ~50,000 sample data set, this solution ran 244 times faster than the original code.
You could use np.where for a more efficient solution:
df = pd.DataFrame({'numbers': np.random.choice(range(5), 100_000),
'more_numbers': np.random.choice(range(5), 100_000)})
%timeit df.loc[df.more_numbers==4,'numbers']=1
7.09 ms ± 658 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.where(df.more_numbers == 4, 1, df.numbers)
547 µs ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So you could instead do:
df.numbers = np.where(df.more_numbers == 4, 1, df.numbers)
There are many approaches. You may wish to consider modifying the underlying NumPy array. However, this is not a documented or officially recommended method.
# Python 3.6.5, Pandas 0.19.2, NumPy 1.11.4
np.random.seed(0)
df = pd.DataFrame({'n': np.random.randint(0, 10, 10**5),
'm': np.random.randint(0, 10, 10**5)})
%timeit df.loc[df['m'] == 4, 'n'] = 1 # 1.3 ms
%timeit df['n'].values[df['m'].values == 4] = 1 # 436 µs
%timeit df['n'] = np.where(df['m'].values == 4, 1, df['n']) # 751 µs
%timeit df.iloc[df['m'].values == 4, df.columns.get_loc('n')] = 1 # 880 µs
%timeit df.loc[df['m'].values == 4, 'n'] = 1 # 1.12 ms
%timeit df['n'] = df['n'].mask(df['m'].values == 4, 1) # 1.34 ms
So just do with values
%timeit df.values[df['more_numbers']==4,0]=1
10000 loops, best of 3: 127 µs per loop
%timeit df.loc[df['more_numbers']==4,'numbers']=1
1000 loops, best of 3: 692 µs per loop
you can have a look at np.where()
df.numbers=np.where(df['more_numbers']==4,1,df.numbers)
Suppose I have a DataFrame such as:
df = pd.DataFrame(np.random.randn(10,5), columns = ['a','b','c','d','e'])
and I would like to retrieve the last value in column e. I could do:
df['e'].tail(1)
but this would return a series which has index 9 with it. Ideally, I just want to obtain the value as a number that I can work with directly. I could also do:
np.array(df['e'].tail(1))
but this would then require me to access/call the 0'th element of it before I can really work with it. Is there a more direct/easy way to do this?
You could try iloc method of dataframe:
In [26]: df
Out[26]:
a b c d e
0 -1.079547 -0.722903 0.457495 -0.687271 -0.787058
1 1.326133 1.359255 -0.964076 -1.280502 1.460792
2 0.479599 -1.465210 -0.058247 -0.984733 -0.348068
3 -0.608238 -1.238068 -0.126889 0.572662 -1.489641
4 -1.533707 -0.218298 -0.877619 0.679370 0.485987
5 -0.864651 -0.180165 -0.528939 0.270885 1.313946
6 0.747612 -1.206509 0.616815 -1.758354 -0.158203
7 -2.309582 -0.739730 -0.004303 0.125640 -0.973230
8 1.735822 -0.750698 1.225104 0.431583 -1.483274
9 -0.374557 -1.132354 0.875028 0.032615 -1.131971
In [27]: df['e'].iloc[-1]
Out[27]: -1.1319705662711321
Or if you want just scalar you could use iat which is faster. From docs:
If you only want to access a scalar value, the fastest way is to use the at and iat methods, which are implemented on all of the data structures
In [28]: df.e.iat[-1]
Out[28]: -1.1319705662711321
Benchmarking:
In [31]: %timeit df.e.iat[-1]
100000 loops, best of 3: 18 µs per loop
In [32]: %timeit df.e.iloc[-1]
10000 loops, best of 3: 24 µs per loop
Try
df['e'].iloc[[-1]]
Sometimes,
df['e'].iloc[-1]
doesn't work.
We can also access it by indexing df.index and at:
df.at[df.index[-1], 'e']
It's faster than iloc but slower than without indexing.
If we decide to assign a value to the last element in column "e", the above method is much faster than the other two options (9-11 times faster):
>>> %timeit df.at[df.index[-1], 'e'] = 1
11.5 µs ± 355 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit df['e'].iat[-1] = 1
107 µs ± 4.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit df['e'].iloc[-1] = 1
127 µs ± 7.13 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)```
I am trying to determine whether there is an entry in a Pandas column that has a particular value. I tried to do this with if x in df['id']. I thought this was working, except when I fed it a value that I knew was not in the column 43 in df['id'] it still returned True. When I subset to a data frame only containing entries matching the missing id df[df['id'] == 43] there are, obviously, no entries in it. How to I determine if a column in a Pandas data frame contains a particular value and why doesn't my current method work? (FYI, I have the same problem when I use the implementation in this answer to a similar question).
in of a Series checks whether the value is in the index:
In [11]: s = pd.Series(list('abc'))
In [12]: s
Out[12]:
0 a
1 b
2 c
dtype: object
In [13]: 1 in s
Out[13]: True
In [14]: 'a' in s
Out[14]: False
One option is to see if it's in unique values:
In [21]: s.unique()
Out[21]: array(['a', 'b', 'c'], dtype=object)
In [22]: 'a' in s.unique()
Out[22]: True
or a python set:
In [23]: set(s)
Out[23]: {'a', 'b', 'c'}
In [24]: 'a' in set(s)
Out[24]: True
As pointed out by #DSM, it may be more efficient (especially if you're just doing this for one value) to just use in directly on the values:
In [31]: s.values
Out[31]: array(['a', 'b', 'c'], dtype=object)
In [32]: 'a' in s.values
Out[32]: True
You can also use pandas.Series.isin although it's a little bit longer than 'a' in s.values:
In [2]: s = pd.Series(list('abc'))
In [3]: s
Out[3]:
0 a
1 b
2 c
dtype: object
In [3]: s.isin(['a'])
Out[3]:
0 True
1 False
2 False
dtype: bool
In [4]: s[s.isin(['a'])].empty
Out[4]: False
In [5]: s[s.isin(['z'])].empty
Out[5]: True
But this approach can be more flexible if you need to match multiple values at once for a DataFrame (see DataFrame.isin)
>>> df = DataFrame({'A': [1, 2, 3], 'B': [1, 4, 7]})
>>> df.isin({'A': [1, 3], 'B': [4, 7, 12]})
A B
0 True False # Note that B didn't match 1 here.
1 False True
2 True True
found = df[df['Column'].str.contains('Text_to_search')]
print(found.count())
the found.count() will contains number of matches
And if it is 0 then means string was not found in the Column.
You can try this to check a particular value 'x' in a particular column named 'id'
if x in df['id'].values
I did a few simple tests:
In [10]: x = pd.Series(range(1000000))
In [13]: timeit 999999 in x.values
567 µs ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [24]: timeit 9 in x.values
666 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [16]: timeit (x == 999999).any()
6.86 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [21]: timeit x.eq(999999).any()
7.03 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [22]: timeit x.eq(9).any()
7.04 ms ± 60 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [15]: timeit x.isin([999999]).any()
9.54 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [17]: timeit 999999 in set(x)
79.8 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Interestingly it doesn't matter if you look up 9 or 999999, it seems like it takes about the same amount of time using the in syntax (must be using some vectorized computation)
In [24]: timeit 9 in x.values
666 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [25]: timeit 9999 in x.values
647 µs ± 5.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [26]: timeit 999999 in x.values
642 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [27]: timeit 99199 in x.values
644 µs ± 5.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [28]: timeit 1 in x.values
667 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Seems like using x.values is the fastest, but maybe there is a more elegant way in pandas?
Or use Series.tolist or Series.any:
>>> s = pd.Series(list('abc'))
>>> s
0 a
1 b
2 c
dtype: object
>>> 'a' in s.tolist()
True
>>> (s=='a').any()
True
Series.tolist makes a list about of a Series, and the other one i am just getting a boolean Series from a regular Series, then checking if there are any Trues in the boolean Series.
Simple condition:
if any(str(elem) in ['a','b'] for elem in df['column'].tolist()):
Use
df[df['id']==x].index.tolist()
If x is present in id then it'll return the list of indices where it is present, else it gives an empty list.
I had a CSV file to read:
df = pd.read_csv('50_states.csv')
And after trying:
if value in df.column:
print(True)
which never printed true, even though the value was in the column;
I tried:
for values in df.column:
if value == values:
print(True)
#Or do something
else:
print(False)
Which worked. I hope this can help!
Use query() to find the rows where the condition holds and get the number of rows with shape[0]. If there exists at least one entry, this statement is True:
df.query('id == 123').shape[0] > 0
Suppose you dataframe looks like :
Now you want to check if filename "80900026941984" is present in the dataframe or not.
You can simply write :
if sum(df["filename"].astype("str").str.contains("80900026941984")) > 0:
print("found")