Apply using multiple rows in vector - python

I'd like to do some math on a series vector. I'd like to take the difference between two rows in a vector. My first intuition was:
def row_diff(prev, next):
return(next - prev)
and then using it
my_col_vec.apply(row_diff)
but this doesn't do what I'd like. It appears apply is row-wise, which is fine, but I can't seem to find an equivalent operation that will allow me to easy create a new vector from the old one by subtracting the previous row from the next.
Is there a better way to do this? I've been reading this document and it doesn't look like it.
Thanks!

To calculate inter-row differences use diff:
In [6]:
df = pd.DataFrame({'a':np.random.rand(5)})
df
Out[6]:
a
0 0.525220
1 0.031826
2 0.260853
3 0.273792
4 0.281368
In [7]:
df['diff'] = df['a'].diff()
df
Out[7]:
a diff
0 0.525220 NaN
1 0.031826 -0.493394
2 0.260853 0.229027
3 0.273792 0.012940
Also please try to avoid using apply as there is usually a vectorised method available

Related

Is there a simple way to select a loop or cycle of data in pandas/Python?

I would like to select a cycle of data in Python (in this case, pandas, but it's a flexible and recurring problem), such that the selected data circles back around to the start of the data for n of the beginning rows. I keep running into this problem, and while I have a variety of working solutions for it, I'm interested in whether there's some built-in method for this I don't know about?
Here's an example pandas DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame([[0,1],[2,3],[5,6],[7,8]],columns = ['x','y'])
This DataFrame would, of course, look like this:
x y
0 0 1
1 2 3
2 5 6
3 7 8
A good-enough solution I've found is to create an index array:
n = 1 #how far through the start of the dataframe to select
select = np.concatenate((range(len(df)),range(n)))
The output of select would be:
array([0, 1, 2, 3, 0])
If I do df.x[select], I get what I'm looking for:
0 0
1 2
2 5
3 7
0 0
Name: x, dtype: int64
Is there some functionality in NumPy, pandas, or any other module of Python that would allow this kind of looping selection of data? Something like df.x[0:+1] that would work in a way analogous to df.x[0:-1] instead of how it currently works? Or a NumPy method that works the way I'm making np.arange() work by combining it with np.concatenate()? As I mentioned, I keep running into this problem, and for a language as convenient as Python has been, it sure seems like a feature that would or should exist. Am I missing or forgetting something?
Edit to clarify request:
A suggested method by avloss was to use np.tile, which is the right degree of simple and generalizable I'm looking for, only it's excessive for the applications I'm using. These are cases where you have to close the loop of cyclic data either for plotting, or for calculating (eg if you're matching slopes of the beginning and end of a closed curve using a finite difference method). In these cases you only need the first n data points in a series (where n is usually somewhere between 1 and 3) to be repeated for closing the calculation or plotting the cycle.
This is not exactly what you're asking for, but np.tile comes pretty close:
https://numpy.org/doc/stable/reference/generated/numpy.tile.html
Or, if you want to do this through indices, you might use mod division
np.arange(5) % 4 == array([0, 1, 2, 3, 0])

Pandas apply, rolling, groupby with multiple input & multiple output columns

I’ve been struggling the past week trying to use apply to use functions over an entire pandas dataframe, including rolling windows, groupby, and especially multiple input columns and multiple output columns. I found a large amount of questions on SO about this topic and many old & outdated answers. So I started to create a notebook for every possible combination of x inputs & outputs, rolling, rolling & groupby combined and I focused on performance as well. Since I’m not the only one struggling with these questions I thought I’d provide my solutions here with working examples, hoping it helps any existing/future pandas-users.
Important notes
The combination of apply & rolling in pandas has a very strong output requirement. You have to return one single value. You can not return a pd.Series, not a list, not an array, not secretly an array within an array, but just one value, e.g. one integer. This requirement makes it hard to get a working solution when trying to return multiple outputs for multiple columns. I don’t understand why it has this requirement for 'apply & rolling', because without rolling 'apply' doesn’t have this requirement. Must be due to some internal pandas functions.
The combination of 'apply & rolling' combined with multiple input columns simply does not work! Imagine a dataframe with 2 columns, 6 rows and you want to apply a custom function with a rolling window of 2. Your function should get an input array with 2x2 values - 2 values of each column for 2 rows. But it seems pandas can’t handle rolling and multiple input columns at the same time. I tried to use the axis parameter to get it working but:
Axis = 0, will call your function per column. In the dataframe described above, it will call your function 10 times (not 12 because rolling=2) and since it’s per column, it only provides the 2 rolling values of that column…
Axis = 1, will call your function per row. This is what you probably want, but pandas will not provide a 2x2 input. It actually completely ignores the rolling and only provides one row with values of 2 columns...
When using 'apply' with multiple input columns, you can provide a parameter called raw (boolean). It’s False by default, which means the input will be a pd.Series and thus includes indexes next to the values. If you don’t need the indexes you can set raw to True to get a Numpy array, which often achieves a much better performance.
When combining 'rolling & groupby', it returns a multi-indexes series which can’t easily serve as an input for a new column. The easiest solution is to append a reset_index(drop=True) as answered & commented here (Python - rolling functions for GroupBy object).
You might ask me, when would you ever want to use a rolling, groupby custom function with multiple outputs!? Answer: I recently had to do a Fourier transform with sliding windows (rolling) over a dataset of 5 million records (speed/performance is important) with different batches within the dataset (groupby). And I needed to save both the power & phase of the Fourier transform in different columns (multiple outputs). Most people probably only need some of the basic examples below, but I believe that especially in the Machine Learning/Data-science sectors the more complex examples can be useful.
Please let me know if you have even better, clearer or faster ways to perform any of the solutions below. I'll update my answer and we can all benefit!
Code examples
Let’s create a dataframe first that will be used in all the examples below, including a group-column for the groupby examples.
For the rolling window and multiple input/output columns I just use 2 in all code examples below, but obviously this could be any number > 1.
df = pd.DataFrame(np.random.randint(0,5,size=(6, 2)), columns=list('ab'))
df['group'] = [0, 0, 0, 1, 1, 1]
df = df[['group', 'a', 'b']]
It will look like this:
group a b
0 0 2 2
1 0 4 1
2 0 0 4
3 1 0 2
4 1 3 2
5 1 3 0
Input 1 column, output 1 column
Basic
def func_i1_o1(x):
return x+1
df['c'] = df['b'].apply(func_i1_o1)
Rolling
def func_i1_o1_rolling(x):
return (x[0] + x[1])
df['d'] = df['c'].rolling(2).apply(func_i1_o1_rolling, raw=True)
Roling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
df['e'] = df.groupby('group')['c'].rolling(2).apply(func_i1_o1_rolling, raw=True).reset_index(drop=True)
Input 2 columns, output 1 column
Basic
def func_i2_o1(x):
return np.sum(x)
df['f'] = df[['b', 'c']].apply(func_i2_o1, axis=1, raw=True)
Rolling
As explained in point 2 in the notes above, there isn't a 'normal' solution for 2 inputs. The workaround below uses the 'raw=False' to ensure the input is a pd.Series, which means we also get the indexes next to the values. This enables us to get values from other columns at the correct indexes to be used.
def func_i2_o1_rolling(x):
values_b = x
values_c = df.loc[x.index, 'c'].to_numpy()
return np.sum(values_b) + np.sum(values_c)
df['g'] = df['b'].rolling(2).apply(func_i2_o1_rolling, raw=False)
Rolling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
df['h'] = df.groupby('group')['b'].rolling(2).apply(func_i2_o1_rolling, raw=False).reset_index(drop=True)
Input 1 column, output 2 columns
Basic
You could use a 'normal' solution by returning pd.Series:
def func_i1_o2(x):
return pd.Series((x+1, x+2))
df[['i', 'j']] = df['b'].apply(func_i1_o2)
Or you could use the zip/tuple combination which is about 8 times faster!
def func_i1_o2_fast(x):
return x+1, x+2
df['k'], df['l'] = zip(*df['b'].apply(func_i1_o2_fast))
Rolling
As explained in point 1 in the notes above, we need a workaround if we want to return more than 1 value when using rolling & apply combined. I found 2 working solutions.
1
def func_i1_o2_rolling_solution1(x):
output_1 = np.max(x)
output_2 = np.min(x)
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['m', 'n']] = output_1, output_2
return 0
df['m'], df['n'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i1_o2_rolling_solution1, raw=False)
Pros: Everything is done within 1 function.
Cons: You have to create the columns first and it is slower since it doesn't use the raw input.
2
rolling_w = 2
nan_prefix = (rolling_w - 1) * [np.nan]
output_list_1 = nan_prefix.copy()
output_list_2 = nan_prefix.copy()
def func_i1_o2_rolling_solution2(x):
output_list_1.append(np.max(x))
output_list_2.append(np.min(x))
return 0
df['b'].rolling(rolling_w).apply(func_i1_o2_rolling_solution2, raw=True)
df['o'] = output_list_1
df['p'] = output_list_2
Pros: It uses the raw input which makes it about twice as fast. And since it doesn't use indexes to set the output values the code looks a bit more clear (to me at least).
Cons: You have to create the nan-prefix yourself and it takes a bit more lines of code.
Rolling & Groupby
Normally, I would use the faster 2nd solution above. However, since we're combining groups and rolling this means you'd have to manually set NaN's/zeros (depending on the number of groups) at the right indexes somewhere in the middle of the dataset. To me it seems that when combining rolling, groupby and multiple output columns, the first solution is easier and solves the automatic NaNs/grouping automatically. Once again, I use the reset_index solution at the end.
def func_i1_o2_rolling_groupby(x):
output_1 = np.max(x)
output_2 = np.min(x)
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['q', 'r']] = output_1, output_2
return 0
df['q'], df['r'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i1_o2_rolling_groupby, raw=False).reset_index(drop=True)
Input 2 columns, output 2 columns
Basic
I suggest using the same 'fast' way as for i1_o2 with the only difference that you get 2 input values to use.
def func_i2_o2(x):
return np.mean(x), np.median(x)
df['s'], df['t'] = zip(*df[['b', 'c']].apply(func_i2_o2, axis=1))
Rolling
As I use a workaround for applying rolling with multiple inputs and I use another workaround for rolling with multiple outputs, you can guess I need to combine them for this one.
1. Get values from other columns using indexes (see func_i2_o1_rolling)
2. Set the final multiple outputs on the correct index (see func_i1_o2_rolling_solution1)
def func_i2_o2_rolling(x):
values_b = x.to_numpy()
values_c = df.loc[x.index, 'c'].to_numpy()
output_1 = np.min([np.sum(values_b), np.sum(values_c)])
output_2 = np.max([np.sum(values_b), np.sum(values_c)])
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['u', 'v']] = output_1, output_2
return 0
df['u'], df['v'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i2_o2_rolling, raw=False)
Rolling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
def func_i2_o2_rolling_groupby(x):
values_b = x.to_numpy()
values_c = df.loc[x.index, 'c'].to_numpy()
output_1 = np.min([np.sum(values_b), np.sum(values_c)])
output_2 = np.max([np.sum(values_b), np.sum(values_c)])
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['w', 'x']] = output_1, output_2
return 0
df['w'], df['x'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i2_o2_rolling_groupby, raw=False).reset_index(drop=True)

How to determine which object is greatest -> python [duplicate]

How can I find the row for which the value of a specific column is maximal?
df.max() will give me the maximal value for each column, I don't know how to get the corresponding row.
Use the pandas idxmax function. It's straightforward:
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].idxmax()
3
>>> df['B'].idxmax()
4
>>> df['C'].idxmax()
1
Alternatively you could also use numpy.argmax, such as numpy.argmax(df['A']) -- it provides the same thing, and appears at least as fast as idxmax in cursory observations.
idxmax() returns indices labels, not integers.
Example': if you have string values as your index labels, like rows 'a' through 'e', you might want to know that the max occurs in row 4 (not row 'd').
if you want the integer position of that label within the Index you have to get it manually (which can be tricky now that duplicate row labels are allowed).
HISTORICAL NOTES:
idxmax() used to be called argmax() prior to 0.11
argmax was deprecated prior to 1.0.0 and removed entirely in 1.0.0
back as of Pandas 0.16, argmax used to exist and perform the same function (though appeared to run more slowly than idxmax).
argmax function returned the integer position within the index of the row location of the maximum element.
pandas moved to using row labels instead of integer indices. Positional integer indices used to be very common, more common than labels, especially in applications where duplicate row labels are common.
For example, consider this toy DataFrame with a duplicate row label:
In [19]: dfrm
Out[19]:
A B C
a 0.143693 0.653810 0.586007
b 0.623582 0.312903 0.919076
c 0.165438 0.889809 0.000967
d 0.308245 0.787776 0.571195
e 0.870068 0.935626 0.606911
f 0.037602 0.855193 0.728495
g 0.605366 0.338105 0.696460
h 0.000000 0.090814 0.963927
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
In [20]: dfrm['A'].idxmax()
Out[20]: 'i'
In [21]: dfrm.iloc[dfrm['A'].idxmax()] # .ix instead of .iloc in older versions of pandas
Out[21]:
A B C
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
So here a naive use of idxmax is not sufficient, whereas the old form of argmax would correctly provide the positional location of the max row (in this case, position 9).
This is exactly one of those nasty kinds of bug-prone behaviors in dynamically typed languages that makes this sort of thing so unfortunate, and worth beating a dead horse over. If you are writing systems code and your system suddenly gets used on some data sets that are not cleaned properly before being joined, it's very easy to end up with duplicate row labels, especially string labels like a CUSIP or SEDOL identifier for financial assets. You can't easily use the type system to help you out, and you may not be able to enforce uniqueness on the index without running into unexpectedly missing data.
So you're left with hoping that your unit tests covered everything (they didn't, or more likely no one wrote any tests) -- otherwise (most likely) you're just left waiting to see if you happen to smack into this error at runtime, in which case you probably have to go drop many hours worth of work from the database you were outputting results to, bang your head against the wall in IPython trying to manually reproduce the problem, finally figuring out that it's because idxmax can only report the label of the max row, and then being disappointed that no standard function automatically gets the positions of the max row for you, writing a buggy implementation yourself, editing the code, and praying you don't run into the problem again.
You might also try idxmax:
In [5]: df = pandas.DataFrame(np.random.randn(10,3),columns=['A','B','C'])
In [6]: df
Out[6]:
A B C
0 2.001289 0.482561 1.579985
1 -0.991646 -0.387835 1.320236
2 0.143826 -1.096889 1.486508
3 -0.193056 -0.499020 1.536540
4 -2.083647 -3.074591 0.175772
5 -0.186138 -1.949731 0.287432
6 -0.480790 -1.771560 -0.930234
7 0.227383 -0.278253 2.102004
8 -0.002592 1.434192 -1.624915
9 0.404911 -2.167599 -0.452900
In [7]: df.idxmax()
Out[7]:
A 0
B 8
C 7
e.g.
In [8]: df.loc[df['A'].idxmax()]
Out[8]:
A 2.001289
B 0.482561
C 1.579985
Both above answers would only return one index if there are multiple rows that take the maximum value. If you want all the rows, there does not seem to have a function.
But it is not hard to do. Below is an example for Series; the same can be done for DataFrame:
In [1]: from pandas import Series, DataFrame
In [2]: s=Series([2,4,4,3],index=['a','b','c','d'])
In [3]: s.idxmax()
Out[3]: 'b'
In [4]: s[s==s.max()]
Out[4]:
b 4
c 4
dtype: int64
df.iloc[df['columnX'].argmax()]
argmax() would provide the index corresponding to the max value for the columnX. iloc can be used to get the row of the DataFrame df for this index.
A more compact and readable solution using query() is like this:
import pandas as pd
df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
print(df)
# find row with maximum A
df.query('A == A.max()')
It also returns a DataFrame instead of Series, which would be handy for some use cases.
Very simple: we have df as below and we want to print a row with max value in C:
A B C
x 1 4
y 2 10
z 5 9
In:
df.loc[df['C'] == df['C'].max()] # condition check
Out:
A B C
y 2 10
If you want the entire row instead of just the id, you can use df.nlargest and pass in how many 'top' rows you want and you can also pass in for which column/columns you want it for.
df.nlargest(2,['A'])
will give you the rows corresponding to the top 2 values of A.
use df.nsmallest for min values.
The direct ".argmax()" solution does not work for me.
The previous example provided by #ely
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].argmax()
3
>>> df['B'].argmax()
4
>>> df['C'].argmax()
1
returns the following message :
FutureWarning: 'argmax' is deprecated, use 'idxmax' instead. The behavior of 'argmax'
will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
So that my solution is :
df['A'].values.argmax()
mx.iloc[0].idxmax()
This one line of code will give you how to find the maximum value from a row in dataframe, here mx is the dataframe and iloc[0] indicates the 0th index.
Considering this dataframe
[In]: df = pd.DataFrame(np.random.randn(4,3),columns=['A','B','C'])
[Out]:
A B C
0 -0.253233 0.226313 1.223688
1 0.472606 1.017674 1.520032
2 1.454875 1.066637 0.381890
3 -0.054181 0.234305 -0.557915
Assuming one want to know the rows where column "C" is max, the following will do the work
[In]: df[df['C']==df['C'].max()])
[Out]:
A B C
1 0.472606 1.017674 1.520032
The idmax of the DataFrame returns the label index of the row with the maximum value and the behavior of argmax depends on version of pandas (right now it returns a warning). If you want to use the positional index, you can do the following:
max_row = df['A'].values.argmax()
or
import numpy as np
max_row = np.argmax(df['A'].values)
Note that if you use np.argmax(df['A']) behaves the same as df['A'].argmax().
Use:
data.iloc[data['A'].idxmax()]
data['A'].idxmax() -finds max value location in terms of row
data.iloc() - returns the row
If there are ties in the maximum values, then idxmax returns the index of only the first max value. For example, in the following DataFrame:
A B C
0 1 0 1
1 0 0 1
2 0 0 0
3 0 1 1
4 1 0 0
idxmax returns
A 0
B 3
C 0
dtype: int64
Now, if we want all indices corresponding to max values, then we could use max + eq to create a boolean DataFrame, then use it on df.index to filter out indexes:
out = df.eq(df.max()).apply(lambda x: df.index[x].tolist())
Output:
A [0, 4]
B [3]
C [0, 1, 3]
dtype: object
what worked for me is:
df[df['colX'] == df['colX'].max()
You then get the row in your df with the maximum value of colX.
Then if you just want the index you can add .index at the end of the query.

Pass entire contents of a dataframe to a function in Pandas

I'm trying to rework much of my analysis code for signal processing using Dataframes instead of numpy arrays. However, I'm having a hard time figuring out how to pass the entire matrix of a dataframe to a function as an entire unit.
E.g., If I'm computing the common average reference a signal, I have something like:
avg = signal.mean(axis=1)
CAR = signal - avg
What I'd like to do is pass a pandas array to this function and have it return a dataframe with CAR as the values now. I'd like to do this without just returning an array and then re-converting it back into a dataframe.
It sounds like when you use df.apply(), it goes row-wise or column-wise, and doesn't put in the whole matrix. I could alter the code of CAR to make this work, but it seems like it would slow it down quite a bit rather than just using numpy's code to do it all at once. It probably wouldn't make a big difference for computing the mean, but I foresee this being a problem with other functions in the future that might take longer.
Can anyone point me in the right direction?
EDIT: To clarify, I'm not just doing this for subtracting the mean, it was just a simple example. A more realistic example would be linearly filtering the array along axis 0. I'd like to use the scipy.signal filtfilt function to filter my array. This is quite easy if I can just pass it a tpts x feats matrix, but right now it seems that the only way to do it is column-wise using "apply"
You can get the raw numpy array version of a DataFrame with df.values. However, in many cases you can just pass the DataFrame itself, since it still allows use of the normal numpy API (i.e., it has all the right methods).
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html this will allow you to perform operations on a row (or column, or the entire dataframe).
import random
signal=pd.DataFrame([[10*random.random() for _ in range(3)]for _ in range(5)])
def testm(frame, average=0):
return frame-average
signal.apply(testm,average=signal.mean(),axis=1)
results:
signal
Out[57]:
0 1 2
0 5.566445 7.612070 8.554966
1 0.869158 2.382429 6.197272
2 5.933192 3.564527 9.805669
3 9.676292 1.707944 2.731479
4 5.319629 3.348337 6.476631
signal.mean()
Out[59]:
0 5.472943
1 3.723062
2 6.753203
dtype: float64
signal.apply(testm,average=signal.mean(),axis=1)
Out[58]:
0 1 2
0 0.093502 3.889008 1.801763
1 -4.603785 -1.340632 -0.555932
2 0.460249 -0.158534 3.052466
3 4.203349 -2.015117 -4.021724
4 -0.153314 -0.374724 -0.276572
This will take the mean of each column, and subtract it from each value in that column.

Find row where values for column is maximal in a pandas DataFrame

How can I find the row for which the value of a specific column is maximal?
df.max() will give me the maximal value for each column, I don't know how to get the corresponding row.
Use the pandas idxmax function. It's straightforward:
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].idxmax()
3
>>> df['B'].idxmax()
4
>>> df['C'].idxmax()
1
Alternatively you could also use numpy.argmax, such as numpy.argmax(df['A']) -- it provides the same thing, and appears at least as fast as idxmax in cursory observations.
idxmax() returns indices labels, not integers.
Example': if you have string values as your index labels, like rows 'a' through 'e', you might want to know that the max occurs in row 4 (not row 'd').
if you want the integer position of that label within the Index you have to get it manually (which can be tricky now that duplicate row labels are allowed).
HISTORICAL NOTES:
idxmax() used to be called argmax() prior to 0.11
argmax was deprecated prior to 1.0.0 and removed entirely in 1.0.0
back as of Pandas 0.16, argmax used to exist and perform the same function (though appeared to run more slowly than idxmax).
argmax function returned the integer position within the index of the row location of the maximum element.
pandas moved to using row labels instead of integer indices. Positional integer indices used to be very common, more common than labels, especially in applications where duplicate row labels are common.
For example, consider this toy DataFrame with a duplicate row label:
In [19]: dfrm
Out[19]:
A B C
a 0.143693 0.653810 0.586007
b 0.623582 0.312903 0.919076
c 0.165438 0.889809 0.000967
d 0.308245 0.787776 0.571195
e 0.870068 0.935626 0.606911
f 0.037602 0.855193 0.728495
g 0.605366 0.338105 0.696460
h 0.000000 0.090814 0.963927
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
In [20]: dfrm['A'].idxmax()
Out[20]: 'i'
In [21]: dfrm.iloc[dfrm['A'].idxmax()] # .ix instead of .iloc in older versions of pandas
Out[21]:
A B C
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
So here a naive use of idxmax is not sufficient, whereas the old form of argmax would correctly provide the positional location of the max row (in this case, position 9).
This is exactly one of those nasty kinds of bug-prone behaviors in dynamically typed languages that makes this sort of thing so unfortunate, and worth beating a dead horse over. If you are writing systems code and your system suddenly gets used on some data sets that are not cleaned properly before being joined, it's very easy to end up with duplicate row labels, especially string labels like a CUSIP or SEDOL identifier for financial assets. You can't easily use the type system to help you out, and you may not be able to enforce uniqueness on the index without running into unexpectedly missing data.
So you're left with hoping that your unit tests covered everything (they didn't, or more likely no one wrote any tests) -- otherwise (most likely) you're just left waiting to see if you happen to smack into this error at runtime, in which case you probably have to go drop many hours worth of work from the database you were outputting results to, bang your head against the wall in IPython trying to manually reproduce the problem, finally figuring out that it's because idxmax can only report the label of the max row, and then being disappointed that no standard function automatically gets the positions of the max row for you, writing a buggy implementation yourself, editing the code, and praying you don't run into the problem again.
You might also try idxmax:
In [5]: df = pandas.DataFrame(np.random.randn(10,3),columns=['A','B','C'])
In [6]: df
Out[6]:
A B C
0 2.001289 0.482561 1.579985
1 -0.991646 -0.387835 1.320236
2 0.143826 -1.096889 1.486508
3 -0.193056 -0.499020 1.536540
4 -2.083647 -3.074591 0.175772
5 -0.186138 -1.949731 0.287432
6 -0.480790 -1.771560 -0.930234
7 0.227383 -0.278253 2.102004
8 -0.002592 1.434192 -1.624915
9 0.404911 -2.167599 -0.452900
In [7]: df.idxmax()
Out[7]:
A 0
B 8
C 7
e.g.
In [8]: df.loc[df['A'].idxmax()]
Out[8]:
A 2.001289
B 0.482561
C 1.579985
Both above answers would only return one index if there are multiple rows that take the maximum value. If you want all the rows, there does not seem to have a function.
But it is not hard to do. Below is an example for Series; the same can be done for DataFrame:
In [1]: from pandas import Series, DataFrame
In [2]: s=Series([2,4,4,3],index=['a','b','c','d'])
In [3]: s.idxmax()
Out[3]: 'b'
In [4]: s[s==s.max()]
Out[4]:
b 4
c 4
dtype: int64
df.iloc[df['columnX'].argmax()]
argmax() would provide the index corresponding to the max value for the columnX. iloc can be used to get the row of the DataFrame df for this index.
A more compact and readable solution using query() is like this:
import pandas as pd
df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
print(df)
# find row with maximum A
df.query('A == A.max()')
It also returns a DataFrame instead of Series, which would be handy for some use cases.
Very simple: we have df as below and we want to print a row with max value in C:
A B C
x 1 4
y 2 10
z 5 9
In:
df.loc[df['C'] == df['C'].max()] # condition check
Out:
A B C
y 2 10
If you want the entire row instead of just the id, you can use df.nlargest and pass in how many 'top' rows you want and you can also pass in for which column/columns you want it for.
df.nlargest(2,['A'])
will give you the rows corresponding to the top 2 values of A.
use df.nsmallest for min values.
The direct ".argmax()" solution does not work for me.
The previous example provided by #ely
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].argmax()
3
>>> df['B'].argmax()
4
>>> df['C'].argmax()
1
returns the following message :
FutureWarning: 'argmax' is deprecated, use 'idxmax' instead. The behavior of 'argmax'
will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
So that my solution is :
df['A'].values.argmax()
mx.iloc[0].idxmax()
This one line of code will give you how to find the maximum value from a row in dataframe, here mx is the dataframe and iloc[0] indicates the 0th index.
Considering this dataframe
[In]: df = pd.DataFrame(np.random.randn(4,3),columns=['A','B','C'])
[Out]:
A B C
0 -0.253233 0.226313 1.223688
1 0.472606 1.017674 1.520032
2 1.454875 1.066637 0.381890
3 -0.054181 0.234305 -0.557915
Assuming one want to know the rows where column "C" is max, the following will do the work
[In]: df[df['C']==df['C'].max()])
[Out]:
A B C
1 0.472606 1.017674 1.520032
The idmax of the DataFrame returns the label index of the row with the maximum value and the behavior of argmax depends on version of pandas (right now it returns a warning). If you want to use the positional index, you can do the following:
max_row = df['A'].values.argmax()
or
import numpy as np
max_row = np.argmax(df['A'].values)
Note that if you use np.argmax(df['A']) behaves the same as df['A'].argmax().
Use:
data.iloc[data['A'].idxmax()]
data['A'].idxmax() -finds max value location in terms of row
data.iloc() - returns the row
If there are ties in the maximum values, then idxmax returns the index of only the first max value. For example, in the following DataFrame:
A B C
0 1 0 1
1 0 0 1
2 0 0 0
3 0 1 1
4 1 0 0
idxmax returns
A 0
B 3
C 0
dtype: int64
Now, if we want all indices corresponding to max values, then we could use max + eq to create a boolean DataFrame, then use it on df.index to filter out indexes:
out = df.eq(df.max()).apply(lambda x: df.index[x].tolist())
Output:
A [0, 4]
B [3]
C [0, 1, 3]
dtype: object
what worked for me is:
df[df['colX'] == df['colX'].max()
You then get the row in your df with the maximum value of colX.
Then if you just want the index you can add .index at the end of the query.

Categories