Given a dataframe, I am trying to print out how many cells of one column with a specific value correspond to the same index of another column having other specific values.
In this instance the output should be '2' since the condition is df[z]=4 and df[x]=C and only cells 10 and 11 match this requirement.
My code does not output any result but only a warning message: :5: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
if (df[df['z']== 4].index.values) == (df[df['x']== 'C'].index.values):
:5: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
Besides fixing this issue, is there another more 'pythonish' way of doing this without a for loop?
import numpy as np
import pandas as pd
data=[['A', 1,2 ,5, 'blue'],
['A', 1,5,6, 'blue'],
['A', 4,4,7, 'blue']
,['B', 6,5,4,'yellow'],
['B',9,9,3, 'blue'],
['B', 7,9,1,'yellow']
,['B', 2,3,1,'yellow'],
['B', 5,1,2,'yellow'],
['C',2,10,9,'green']
,['C', 8,2,8,'green'],
['C', 5,4,3,'green'],
['C', 8,4 ,3,'green']]
df = pd.DataFrame(data, columns=['x','y','z','xy', 'color'])
k=0
print((df[df['z']==4].index.values))
print(df[df['x']== 'C'].index.values)
for i in (df['z']):
if (df[df['z']== 4].index.values) == (df[df['x']== 'C'].index.values):
k+=1
print(k)
try:
c=df['z'].eq(4) & df['x'].eq('C')
#your condition
Finally:
count=df[c].index.size
#OR
count=len(df[c].index)
output:
print(count)
>>>2
You can do the following:
df[(df['z']==4) & (df['x']=='C')].shape[0]
#2
Assuming just the number is necessary and not the filtered frame, calculating the number of True values in the Boolean Series is faster:
Calculate the conditions as Boolean Series:
m = df['z'].eq(4) & df['x'].eq('C')
Count True values via Series.sum:
k = m.sum()
or via np.count_nonzero:
k = np.count_nonzero(m)
k:
2
Timing Information via %timeit:
All timing excludes creation of the index as they all use the same index so the timing is similar in all cases:
m = df['z'].eq(4) & df['x'].eq('C')
Henry Ecker (This Answer)
%timeit m.sum()
25.6 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.count_nonzero(m)
7 µs ± 267 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
IoaTzimas
%timeit df[m].shape[0]
151 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Anurag Dabas
%timeit df[m].index.size
163 µs ± 3.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit len(df[m].index)
165 µs ± 5.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
SeaBean
%timeit df.loc[m].shape[0]
151 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
(Without loc is the same as IoaTzimas)
You can use .loc with the boolean condition mask of selecting the rows to locate the rows. Then, use shape[0] to get the row count:
df.loc[(df['z']== 4) & (df['x']== 'C')].shape[0]
You can use with or without .loc for the row selection. So, it's the same as:
df[(df['z']== 4) & (df['x']== 'C')].shape[0]
However, it is a good practice to use .loc rather than without it. You can refer to this post for further information.
Result:
2
I have a dataframe:
df = pd.DataFrame({'Sequence': ['ABCDEFG', 'AWODIH', 'AWODIHAWD], 'Length': [7, 6, 9]})
I want to be able to check if a particular sequence, say 'WOD', exists in any entry of the 'Sequence' column. It doesn't have to be in the middle or the ends of the entry, but just if that sequence, in that order, exists in any entry of that column, return true.
How would I do this?
I looked into .isin and .contains, both of which only returns if the exact, and ENTIRE, sequence is in the column:
df.isin('ABCDEFG') //returns true
df.isin('ABC') //returns false
I want a sort of Cltr+F function that could search any sequence in that order, regardless of where it is or how long it is.
Can simply do this using str.contains:
In [657]: df['Sequence'].str.contains('WOD')
Out[657]:
0 False
1 True
2 True
Name: Sequence, dtype: bool
OR, you can use str.find:
In [658]: df['Sequence'].str.find('WOD')
Out[658]:
0 -1
1 1
2 1
Name: Sequence, dtype: int64
Which returns -1 on failure.
We need use str.findall before contains
df.Sequence.str.findall('W|O|D').str.join('').str.contains('WOD')
0 False
1 True
2 True
Name: Sequence, dtype: bool
If you want to use your in syntax, you can do:
df.Sequence.apply(lambda x: 'WOD' in x)
If performance is a consideration, the following solution is many times faster than other solutions:
['WOD' in e for e in df.Sequence]
Benchmark
%%timeit
['WOD' in e for e in df.Sequence]
8.26 µs ± 90.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df.Sequence.apply(lambda x: 'WOD' in x)
164 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df['Sequence'].str.contains('WOD')
153 µs ± 4.49 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df['Sequence'].str.find('WOD')
159 µs ± 7.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df.Sequence.str.findall('W|O|D').str.join('').str.contains('WOD')
585 µs ± 34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Suppose I have a DataFrame such as:
df = pd.DataFrame(np.random.randn(10,5), columns = ['a','b','c','d','e'])
and I would like to retrieve the last value in column e. I could do:
df['e'].tail(1)
but this would return a series which has index 9 with it. Ideally, I just want to obtain the value as a number that I can work with directly. I could also do:
np.array(df['e'].tail(1))
but this would then require me to access/call the 0'th element of it before I can really work with it. Is there a more direct/easy way to do this?
You could try iloc method of dataframe:
In [26]: df
Out[26]:
a b c d e
0 -1.079547 -0.722903 0.457495 -0.687271 -0.787058
1 1.326133 1.359255 -0.964076 -1.280502 1.460792
2 0.479599 -1.465210 -0.058247 -0.984733 -0.348068
3 -0.608238 -1.238068 -0.126889 0.572662 -1.489641
4 -1.533707 -0.218298 -0.877619 0.679370 0.485987
5 -0.864651 -0.180165 -0.528939 0.270885 1.313946
6 0.747612 -1.206509 0.616815 -1.758354 -0.158203
7 -2.309582 -0.739730 -0.004303 0.125640 -0.973230
8 1.735822 -0.750698 1.225104 0.431583 -1.483274
9 -0.374557 -1.132354 0.875028 0.032615 -1.131971
In [27]: df['e'].iloc[-1]
Out[27]: -1.1319705662711321
Or if you want just scalar you could use iat which is faster. From docs:
If you only want to access a scalar value, the fastest way is to use the at and iat methods, which are implemented on all of the data structures
In [28]: df.e.iat[-1]
Out[28]: -1.1319705662711321
Benchmarking:
In [31]: %timeit df.e.iat[-1]
100000 loops, best of 3: 18 µs per loop
In [32]: %timeit df.e.iloc[-1]
10000 loops, best of 3: 24 µs per loop
Try
df['e'].iloc[[-1]]
Sometimes,
df['e'].iloc[-1]
doesn't work.
We can also access it by indexing df.index and at:
df.at[df.index[-1], 'e']
It's faster than iloc but slower than without indexing.
If we decide to assign a value to the last element in column "e", the above method is much faster than the other two options (9-11 times faster):
>>> %timeit df.at[df.index[-1], 'e'] = 1
11.5 µs ± 355 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit df['e'].iat[-1] = 1
107 µs ± 4.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit df['e'].iloc[-1] = 1
127 µs ± 7.13 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)```
I know this is a very basic question but for some reason I can't find an answer. How can I get the index of certain element of a Series in python pandas? (first occurrence would suffice)
I.e., I'd like something like:
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3
Certainly, it is possible to define such a method with a loop:
def find(s, el):
for i in s.index:
if s[i] == el:
return i
return None
print find(myseries, 7)
but I assume there should be a better way. Is there?
>>> myseries[myseries == 7]
3 7
dtype: int64
>>> myseries[myseries == 7].index[0]
3
Though I admit that there should be a better way to do that, but this at least avoids iterating and looping through the object and moves it to the C level.
Converting to an Index, you can use get_loc
In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
In [3]: Index(myseries).get_loc(7)
Out[3]: 3
In [4]: Index(myseries).get_loc(10)
KeyError: 10
Duplicate handling
In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)
Will return a boolean array if non-contiguous returns
In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False, True, False, False, True, False], dtype=bool)
Uses a hashtable internally, so fast
In [7]: s = Series(randint(0,10,10000))
In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop
In [12]: i = Index(s)
In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop
As Viktor points out, there is a one-time creation overhead to creating an index (its incurred when you actually DO something with the index, e.g. the is_unique)
In [2]: s = Series(randint(0,10,10000))
In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop
In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop
I'm impressed with all the answers here. This is not a new answer, just an attempt to summarize the timings of all these methods. I considered the case of a series with 25 elements and assumed the general case where the index could contain any values and you want the index value corresponding to the search value which is towards the end of the series.
Here are the speed tests on a 2012 Mac Mini in Python 3.9.10 with Pandas version 1.4.0.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: data = [406400, 203200, 101600, 76100, 50800, 25400, 19050, 12700, 950
...: 0, 6700, 4750, 3350, 2360, 1700, 1180, 850, 600, 425, 300, 212, 150, 1
...: 06, 75, 53, 38]
In [4]: myseries = pd.Series(data, index=range(1,26))
In [5]: assert(myseries[21] == 150)
In [6]: %timeit myseries[myseries == 150].index[0]
179 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit myseries[myseries == 150].first_valid_index()
205 µs ± 3.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: %timeit myseries.where(myseries == 150).first_valid_index()
597 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit myseries.index[np.where(myseries == 150)[0][0]]
110 µs ± 872 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [10]: %timeit pd.Series(myseries.index, index=myseries)[150]
125 µs ± 2.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [11]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]
49.5 µs ± 814 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [12]: %timeit myseries.index[list(myseries).index(150)]
7.75 µs ± 36.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit myseries.index[myseries.tolist().index(150)]
2.55 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: %timeit dict(zip(myseries.values, myseries.index))[150]
9.89 µs ± 79.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [15]: %timeit {v: k for k, v in myseries.items()}[150]
9.99 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#Jeff's answer seems to be the fastest - although it doesn't handle duplicates.
Correction: Sorry, I missed one, #Alex Spangher's solution using the list index method is by far the fastest.
Update: Added #EliadL's answer.
Hope this helps.
Amazing that such a simple operation requires such convoluted solutions and many are so slow. Over half a millisecond in some cases to find a value in a series of 25.
2022-02-18 Update
Updated all the timings with the latest Pandas version and Python 3.9. Even on an older computer, all the timings have significantly reduced (10 to 70%) compared to the previous tests (version 0.25.3).
Plus: Added two more methods utilizing dictionaries.
In [92]: (myseries==7).argmax()
Out[92]: 3
This works if you know 7 is there in advance. You can check this with
(myseries==7).any()
Another approach (very similar to the first answer) that also accounts for multiple 7's (or none) is
In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']
Another way to do this, although equally unsatisfying is:
s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])
list(s).index(7)
returns:
3
On time tests using a current dataset I'm working with (consider it random):
[64]: %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop
In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop
In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop
If you use numpy, you can get an array of the indecies that your value is found:
import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)
This returns a one element tuple containing an array of the indecies where 7 is the value in myseries:
(array([3], dtype=int64),)
you can use Series.idxmax()
>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>>
This is the most native and scalable approach I could find:
>>> myindex = pd.Series(myseries.index, index=myseries)
>>> myindex[7]
3
>>> myindex[[7, 5, 7]]
7 3
5 4
7 3
dtype: int64
Another way to do it that hasn't been mentioned yet is the tolist method:
myseries.tolist().index(7)
should return the correct index, assuming the value exists in the Series.
Often your value occurs at multiple indices:
>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')
The Pandas has builtin class Index with a function called get_loc. This function will either return
index (element index)
slice (if the specified number is in sequence)
array (bool array if the number is at multiple indexes)
Example:
import pandas as pd
>>> mySer = pd.Series([1, 3, 8, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns index
3 # Index of 10 in series
>>> mySer = pd.Series([1, 3, 8, 10, 10, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns slice
slice(3, 6, None) # 10 occurs at index 3 (included) to 6 (not included)
# If the data is not in sequence then it would return an array of bool's.
>>> mySer = pd.Series([1, 10, 3, 8, 10, 10, 10, 13, 10])
>>> pd.Index(mySer).get_loc(10)
array([False, True, False, False, True, True, False, True])
There are many other options too but I found it very simple for me.
df.index method will help you to find the exact row number
my_fl2=(df['ConvertedCompYearly'] == 45241312 )
print (df[my_fl2].index)
Name: ConvertedCompYearly, dtype: float64
Int64Index([66910], dtype='int64')