I'd like to know if there's a way to find the location (column and row index) of the highest value in a dataframe. So if for example my dataframe looks like this:
A B C D E
0 100 9 1 12 6
1 80 10 67 15 91
2 20 67 1 56 23
3 12 51 5 10 58
4 73 28 72 25 1
How do I get a result that looks like this: [0, 'A'] using Pandas?
Use np.argmax
NumPy's argmaxcan be helpful:
>>> df.stack().index[np.argmax(df.values)]
(0, 'A')
In steps
df.values is a two-dimensional NumPy array:
>>> df.values
array([[100, 9, 1, 12, 6],
[ 80, 10, 67, 15, 91],
[ 20, 67, 1, 56, 23],
[ 12, 51, 5, 10, 58],
[ 73, 28, 72, 25, 1]])
argmax gives you the index for the maximum value for the "flattened" array:
>>> np.argmax(df.values)
0
Now, you can use this index to find the row-column location on the stacked dataframe:
>>> df.stack().index[0]
(0, 'A')
Fast Alternative
If you need it fast, do as few steps as possible.
Working only on the NumPy array to find the indices np.argmax seems best:
v = df.values
i, j = [x[0] for x in np.unravel_index([np.argmax(v)], v.shape)]
[df.index[i], df.columns[j]]
Result:
[0, 'A']
Timings
Timing works best for lareg data frames:
df = pd.DataFrame(data=np.arange(int(1e6)).reshape(-1,5), columns=list('ABCDE'))
Sorted slowest to fastest:
Mask:
%timeit df.mask(~(df==df.max().max())).stack().index.tolist()
33.4 ms ± 982 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Stack-idmax
%timeit list(df.stack().idxmax())
17.1 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Stack-argmax
%timeit df.stack().index[np.argmax(df.values)]
14.8 ms ± 392 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Where
%%timeit
i,j = np.where(df.values == df.values.max())
list((df.index[i].values.tolist()[0],df.columns[j].values.tolist()[0]))
4.45 ms ± 84.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Argmax-unravel_index
%%timeit
v = df.values
i, j = [x[0] for x in np.unravel_index([np.argmax(v)], v.shape)]
[df.index[i], df.columns[j]]
499 µs ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Compare
d = {'name': ['Mask', 'Stack-idmax', 'Stack-argmax', 'Where', 'Argmax-unravel_index'],
'time': [33.4, 17.1, 14.8, 4.45, 499],
'unit': ['ms', 'ms', 'ms', 'ms', 'µs']}
timings = pd.DataFrame(d)
timings['seconds'] = timings.time * timings.unit.map({'ms': 1e-3, 'µs': 1e-6})
timings['factor slower'] = timings.seconds / timings.seconds.min()
timings.sort_values('factor slower')
Output:
name time unit seconds factor slower
4 Argmax-unravel_index 499.00 µs 0.000499 1.000000
3 Where 4.45 ms 0.004450 8.917836
2 Stack-argmax 14.80 ms 0.014800 29.659319
1 Stack-idmax 17.10 ms 0.017100 34.268537
0 Mask 33.40 ms 0.033400 66.933868
So the "Argmax-unravel_index" version seems to be one to nearly two orders of magnitude faster for large data frames, i.e. where often speeds matters most.
Use stack for Series with MultiIndex and idxmax for index of max value:
print (df.stack().idxmax())
(0, 'A')
print (list(df.stack().idxmax()))
[0, 'A']
Detail:
print (df.stack())
0 A 100
B 9
C 1
D 12
E 6
1 A 80
B 10
C 67
D 15
E 91
2 A 20
B 67
C 1
D 56
E 23
3 A 12
B 51
C 5
D 10
E 58
4 A 73
B 28
C 72
D 25
E 1
dtype: int64
mask + max
df.mask(~(df==df.max().max())).stack().index.tolist()
Out[17]: [(0, 'A')]
This should work:
def max_df(df):
m = None
p = None
for idx, item in enumerate(df.idxmax()):
c = df.columns[item]
val = df[c][idx]
if m is None or val > m:
m = val
p = idx, c
return p
This uses the idxmax function, then compares all of the values returned by it.
Example usage:
>>> df
A B
0 100 9
1 90 8
>>> max_df(df)
(0, 'A')
Here's a one-liner (for fun):
def max_df2(df):
return max((df[df.columns[item]][idx], idx, df.columns[item]) for idx, item in enumerate(df.idxmax()))[1:]
In my opinion for larger datasets, stack() becomes inefficient, let's use np.where to return index positions:
i,j = np.where(df.values == df.values.max())
list((df.index[i].values.tolist()[0],df.columns[j].values.tolist()[0]))
Output:
[0, 'A']
Timings for larger datafames:
df = pd.DataFrame(data=np.arange(10000).reshape(-1,5), columns=list('ABCDE'))
np.where method
> %%timeit i,j = np.where(df.values == df.values.max())
> list((df.index[i].values.tolist()[0],df.columns[j].values.tolist()[0]))
1000 loops, best of 3: 364 µs per loop
Other stack methods
> %timeit df.mask(~(df==df.max().max())).stack().index.tolist()
100 loops, best of 3: 7.68 ms per loop
> %timeit df.stack().index[np.argmax(df.values)`]
10 loops, best of 3: 50.5 ms per loop
> %timeit list(df.stack().idxmax())
1000 loops, best of 3: 1.58 ms per loop
Even larger dataframe:
df = pd.DataFrame(data=np.arange(100000).reshape(-1,5), columns=list('ABCDE'))
Respectively:
1000 loops, best of 3: 1.62 ms per loop
10 loops, best of 3: 18.2 ms per loop
100 loops, best of 3: 5.69 ms per loop
100 loops, best of 3: 6.64 ms per loop
simple, fast, one liner:
loc = [df.max(axis=1).idxmax(), df.max().idxmax()]
(For large data frames, .stack() can be quite slow.)
print('Max value:', df.stack().max())
print('Parameters :', df.stack().idxmax())
This is the best way imho.
Related
So I have a DataFrame that looks something along these lines:
import pandas as pd
ddd = {
'a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'b': [22, 25, 18, 53, 19, 8, 75, 11, 49, 64],
'c': [1, 1, 1, 2, 2, 3, 4, 4, 4, 5]
}
df = pd.DataFrame(ddd)
What I need is to group the data by the 'c' column and apply some data transformations. At the moment I'm doing this:
def do_stuff(d: pd.DataFrame):
if d.shape[0] >= 2:
return pd.DataFrame(
{
'start': [d.a.values[0]],
'end': [d.a.values[d.shape[0] - 1]],
'foo': [d.a.sum()],
'bar': [d.b.mean()]
}
)
else:
return pd.DataFrame()
r = df.groupby('c').apply(lambda x: do_stuff(x))
Which gives the correct result:
start end foo bar
c
1 0 1.0 3.0 6.0 21.666667
2 0 4.0 5.0 9.0 36.000000
4 0 7.0 9.0 24.0 45.000000
The problem is that this approach appears to be too slow. On my actual data it runs in around 0.7 seconds which is too long and needs to be ideally much faster.
Is there any way I can do this faster? Or maybe there's some other faster method not involving groupby that I could use?
We could first filter df for the "c" values that appear 2 or more times; then use groupby + named aggregation:
msk = df['c'].value_counts() >= 2
out = (df[df['c'].isin(msk.index[msk])]
.groupby('c')
.agg(start=('a','first'), end=('a','last'), foo=('a','sum'), bar=('b','mean')))
You could also do:
out = (df[df.groupby('c')['c'].transform('count').ge(2)]
.groupby('c')
.agg(start=('a','first'),
end=('a','last'),
foo=('a','sum'),
bar=('b','mean')))
or
msk = df['c'].value_counts() >= 2
out = (df[df['c'].isin(msk.index[msk])]
.groupby('c')
.agg({'a':['first','last','sum'], 'b':'mean'})
.set_axis(['start','end','foo','bar'], axis=1))
Output:
start end foo bar
c
1 1 3 6 21.666667
2 4 5 9 36.000000
4 7 9 24 45.000000
Some benchmarks:
>>> %timeit out = df.groupby('c').apply(lambda x: do_stuff(x))
6.49 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit msk = df['c'].value_counts() >= 2; out = (df[df['c'].isin(msk.index[msk])].groupby('c').agg(start=('a','first'), end=('a','last'), foo=('a','sum'), bar=('b','mean')))
7.6 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit out = (df[df.groupby('c')['c'].transform('count').ge(2)].groupby('c').agg(start=('a','first'), end=('a','last'), foo=('a','sum'), bar=('b','mean')))
7.86 ms ± 509 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit msk = df['c'].value_counts() >= 2; out = (df[df['c'].isin(msk.index[msk])].groupby('c').agg({'a':['first','last','sum'], 'b':'mean'}).set_axis(['start','end','foo','bar'], axis=1))
4.68 ms ± 57.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
A similar dataframe can be created:
import pandas as pd
df = pd.DataFrame()
df["nodes"] = list(range(1, 11))
df["x"] = [1,4,9,12,27,87,99,121,156,234]
df["y"] = [3,5,6,1,8,9,2,1,0,-1]
df["z"] = [2,3,4,2,1,5,9,99,78,1]
df.set_index("nodes", inplace=True)
So the dataframe looks like this:
x y z
nodes
1 1 3 2
2 4 5 3
3 9 6 4
4 12 1 2
5 27 8 1
6 87 9 5
7 99 2 9
8 121 1 99
9 156 0 78
10 234 -1 1
My first try for searching e.g. all nodes containing number 1 is:
>>> df[(df == 1).any(axis=1)].index.values
[1 4 5 8 10]
As i have to do this for many numbers and my real dataframe is much bigger than this one, i'm searching for a very fast way to do this.
Just tried something that may be enlightening
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10000, 4)), columns=list('ABCD'))
df.set_index("A", inplace=True)
df_no_index = df.reset_index()
So set up a dataframe with ints right the way through. This is not the same as yours but it will suffice.
Then I ran four tests
%timeit df[(df == 1).any(axis=1)].index.values
%timeit df[(df['B'] == 1) | (df['C']==1)| (df['D']==1)].index.values
%timeit df_no_index[(df_no_index == 1).any(axis=1)].A.values
%timeit df_no_index[(df_no_index['B'] == 1) | (df_no_index['C']==1)| (df_no_index['D']==1)].A.values
The results I got were,
940 µs ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.47 ms ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.08 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.55 ms ± 51.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Which showed that the initial method that you took, with index seems to be the fastest of these approaches. Removing the index does not improve the speed with a moderately sized dataframe
Trying to deal with an issue how to implement dictionary to existing DataFrame's column by certain indexes, i.e.:
Primary DataFrame:
country year Corruption Index
0 X 2010 6,5
1 X 2015 78,0
So I have dictionary (separately devide 78 by 10):
x = {1: 7,8}
I want to change value of index 1, column 'Corruption Index'.
I now I can use for loop, but are there any faster variants ? Or maybe I can directly divide values in Existing DataFrame which values are higher than 10 by ten? (the problem is that official statistics after 2012 turns from 1-10 to 1-100)
The fastest solution is the answer from Mykola Zotko
For a vectorized solution, use pandas.DataFrame.where, which will replace values where the condition is False.
.where will be much faster than using .apply, as df['Corruption Index'].apply(lambda x: x/10 if x > 10 else x)
numpy.where is slightly different, in that it works on the True condition, and is faster than pandas.DataFrame.where, but requires importing numpy.
Since this issue seems to be related to official statistics after 2012 turns from 1-10 to 1-100, m = df.year <= 2012 should be used as the condition.
import pandas as pd
import numpy as np
# test dataframe
df = pd.DataFrame({'country': ['X', 'X'], 'year': [2010, 2015], 'Corruption_Index': [6.5, 78.0]})
# display(df)
country year Corruption_Index
0 X 2010 6.5
1 X 2015 78.0
# create a Boolean mask for the condition
m = df.year <= 2012
# use pandas.DataFrame.where to calculate on the False condition
df['Corruption_Index'].where(m, df['Corruption_Index'] / 10, inplace=True)
# alternatively, use np.where to calculate on the True condition
df['Corruption_Index'] = np.where(df.year > 2012, df['Corruption_Index'] / 10, df['Corruption_Index'])
# display(df)
country year Corruption_Index
0 X 2010 6.5
1 X 2015 7.8
%%timeit Comparison
# test data with 1M rows
np.random.seed(365)
df = pd.DataFrame({'v': [np.random.randint(20) for _ in range(1000000)]})
%%timeit
df.loc[df['v'] > 10, 'v'] /= 10
[out]:
2.66 ms ± 61.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
np.where(df['v'] > 10, df['v'] / 10, df['v'])
[out]:
8.11 ms ± 403 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df['v'].where(df['v'] <= 10, df['v'] / 10)
[out]:
17.9 ms ± 615 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df['v'].apply(lambda x: x/10 if x > 10 else x)
[out]:
319 ms ± 29.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use:
df.loc[df['Corruption_Index'] > 10, 'Corruption_Index'] /= 10
I need to create a 2 columns dataframe.
The first column contains values from 7000 until 15000 and all the increments of 500 in that range (7000,7500,8000...14500,1500)
The second column contain all the integers from 6 to 24
I need a simple way to generate these values and all their unique combinations:
6,7000
6,7500
6,8000
....
24,14500
24,15000
You can use numpy.arange for generating sequence of numbers, numpy.repeat and numpy.tile for generating cross-product and stack them using numpy.c_ or numpy.column_stack
x = np.arange(6, 25)
y = np.arange(7000, 15001, 500)
pd.DataFrame(np.c_[x.repeat(len(y)),np.tile(y, len(x))])
# pd.DataFrame(np.column_stack([x.repeat(len(y)),np.tile(y, len(x))]))
0 1
0 6 7000
1 6 7500
2 6 8000
3 6 8500
4 6 9000
.. .. ...
318 24 13000
319 24 13500
320 24 14000
321 24 14500
322 24 15000
[323 rows x 2 columns]
Another idea is to use itertools.product
from itertools import product
pd.DataFrame(list(product(x,y)))
Timeit results:
# Henry' answer in comments
In [44]: %timeit pd.DataFrame([(x,y) for x in range(6,25) for y in range(7000,15001,500)])
657 µs ± 169 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# My solution
In [45]: %%timeit
...: x = np.arange(6, 25)
...: y = np.arange(7000, 15001, 500)
...:
...: pd.DataFrame(np.c_[x.repeat(len(y)),np.tile(y, len(x))])
...:
...:
155 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#Using `np.column_stack`
In [49]: %%timeit
...: x = np.arange(6, 25)
...: y = np.arange(7000, 15001, 500)
...:
...: pd.DataFrame(np.column_stack([x.repeat(len(y)),np.tile(y, len(x))]))
...:
121 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# `itertools.product` solution
In [62]: %timeit pd.DataFrame(list(product(x,y)))
489 µs ± 7.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I have a pandas dataframe with 1 million rows. I want to replace values in 900,000 rows in a column by another set of values. Is there fast way to do this without a for loop (which takes me two days to complete)?
For example, look at this sample dataframe where I have condensed 1 million rows to 8 rows
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['a'] = [-1,-3,-4,-4,-3, 4,5,6]
df['b'] = [23,45,67,89,0,-1, 2, 3]
L2 = [-1,-3,-4]
L5 = [9,10,11]
I want to replace values where a is -1, -3, -4 in a single shot if possible or as fast as possible without a for loop.
The crucial part is that values in L5 have to be repeated as needed.
I have tried
df.loc[df.a < 0, 'a'] = L5
but this works only when len(df.a.values) == len(L5)
Use map by dictionary created from both lists by zip, last replace to original non matched values by fillna:
d = dict(zip(L2, L5))
print (d)
{-1: 9, -3: 10, -4: 11}
df['a'] = df['a'].map(d).fillna(df['a'])
print (df)
a b
0 9.0 23
1 10.0 45
2 11.0 67
3 11.0 89
4 10.0 0
5 4.0 -1
6 5.0 2
7 6.0 3
Performance:
It depends of number of values for replace anf of lenght of lists:
Length of lists is 100:
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'a':np.random.randint(1000, size=N)})
L2 = np.arange(100)
L5 = np.arange(100) + 10
In [336]: %timeit df['d'] = np.select([df['a'] == i for i in L2], L5, df['a'])
180 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [337]: %timeit df['a'].map(dict(zip(L2, L5))).fillna(df['a'])
56.9 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If length of lists is small (e.g. 3):
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'a':np.random.randint(100, size=N)})
L2 = np.arange(3)
L5 = np.arange(3) + 10
In [339]: %timeit df['d'] = np.select([df['a'] == i for i in L2], L5, df['a'])
11.9 ms ± 40.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [340]: %timeit df['a'].map(dict(zip(L2, L5))).fillna(df['a'])
54 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
you can use np.select such as:
import numpy as np
condition = [df['a'] == i for i in L2]
df['a'] = np.select(condition, L5, df['a'])
and you get:
a b
0 9 23
1 10 45
2 11 67
3 11 89
4 10 0
5 4 -1
6 5 2
7 6 3
Timing: let's create a bigger dataframe such as with your df:
df_l = pd.concat([df]*10000)
print (df_l.shape)
(80000, 2)
Now some timeit:
# with map, #jezrael
d = dict(zip(L2, L5))
%timeit df_l['a'].map(d).fillna(df_l['a'])
100 loops, best of 3: 7.71 ms per loop
# with np.select
condition = [df_l['a'] == i for i in L2]
%timeit np.select(condition, L5, df_l['a'])
1000 loops, best of 3: 350 µs per loop