Pandas agg fuction with operations on multiple columns - python

I am interested if we can use pandas.core.groupby.DataFrameGroupBy.agg function to make arithmetic operations on multiple columns columns. For example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(15).reshape(5, 3))
df['C'] = [0, 0, 2, 2, 5]
print(df.groupby('C').mean()[0] - df.groupby('C').mean()[1])
print(df.groupby('C').agg({0: 'mean', 1: 'sum', 2: 'nunique', 'C': 'mean0-mean1'}))
Is it somehow possible that we receive result like in this example: the difference between means of column 0 and column 1 grouped by column 'C'?
df
0 1 2 C
0 0 1 2 0
1 3 4 5 0
2 6 7 8 2
3 9 10 11 2
4 12 13 14 5
Groupped difference
C
0 -1.0
2 -1.0
5 -1.0
dtype: float64
I am not interested with solutions that does not use agg method. I am curious only if agg method can take multiple columns as argument and then do some operations on them to return one columns after job is done.

IIUC:
In [12]: df.groupby('C').mean().diff(axis=1)
Out[12]:
0 1 2
C
0 NaN 1.0 1.0
2 NaN 1.0 1.0
5 NaN 1.0 1.0
or
In [13]: df.groupby('C').mean().diff(-1, axis=1)
Out[13]:
0 1 2
C
0 -1.0 -1.0 NaN
2 -1.0 -1.0 NaN
5 -1.0 -1.0 NaN

Related

min/max value of a column based on values of another column, grouped by and transformed in pandas

I'd like to know if I can do all this in one line, rather than multiple lines.
my dataframe:
import pandas as pd
df = pd.DataFrame({'ID' : [1,1,1,1,1,1,2,2,2,2,2,2]
,'A': [1, 2, 3, 10, np.nan, 5 , 20, 6, 7, np.nan, np.nan, np.nan]
, 'B': [0,1,1,0,1,1,1,1,1,0,1,0]
, 'desired_output' : [5,5,5,5,5,5,20,20,20,20,20,20]})
df
ID A B desired_output
0 1 1.0 0 5
1 1 2.0 1 5
2 1 3.0 1 5
3 1 10.0 0 5
4 1 NaN 1 5
5 1 5.0 1 5
6 2 20.0 1 20
7 2 6.0 1 20
8 2 7.0 1 20
9 2 NaN 0 20
10 2 NaN 1 20
11 2 NaN 0 20
I'm trying to find the maximum value of column A, for values of column B == 1, group by column ID, and transform the results directly so that the value is back in the dataframe without extra merging et al.
something like the following (but without getting errors!)
df['desired_output'] = df.groupby('ID').A.where(df.B == 1).transform('max') ## this gives error
The max function should ignore the NaNs as well. I wonder if I'm trying too much in one line, but one can hope there is a way for a beautiful code.
EDIT:
I can get a very similar output by changing the where clause:
df['desired_output'] = df.where(df.B == 1).groupby('ID').A.transform('max') ## this works but output is not what i want
but the output is not exactly what I want. desired_output should not have any NaN, unless all values of A are NaN for when B == 1.
Here is a way to do it:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ID' : [1,1,1,1,1,1,2,2,2,2,2,2],
'A': [1, 2, 3, 10, np.nan, 5 , 20, 6, 7, np.nan, np.nan, np.nan],
'B': [0,1,1,0,1,1,1,1,1,0,1,0],
'desired_output' : [5,5,5,5,5,5,20,20,20,20,20,20]
})
df['output'] = df[df.B == 1].groupby('ID').A.max()[df.ID].array
df
Result:
ID A B desired_output output
0 1 1.0 0 5 5.0
1 1 2.0 1 5 5.0
2 1 3.0 1 5 5.0
3 1 10.0 0 5 5.0
4 1 NaN 1 5 5.0
5 1 5.0 1 5 5.0
6 2 20.0 1 20 20.0
7 2 6.0 1 20 20.0
8 2 7.0 1 20 20.0
9 2 NaN 0 20 20.0
10 2 NaN 1 20 20.0
11 2 NaN 0 20 20.0
Decomposition:
df[df.B == 1] # start by filtering on B
.groupby('ID') # group by ID
.A.max() # get max values in column A
[df.ID] # recast the result on ID series shape
.array # fetch the raw values from the Series
Important note: it relies on the fact that the index is as in the given example, that is, sorted, starting from 0, with a 1 increment. You will have to reset_index() of your DataFrame before this operation when this is not the case.

Duplicate Row Value to Next Null Rows [duplicate]

Suppose I have a DataFrame with some NaNs:
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df
0 1 2
0 1 2 3
1 4 NaN NaN
2 NaN NaN 9
What I need to do is replace every NaN with the first non-NaN value in the same column above it. It is assumed that the first row will never contain a NaN. So for the previous example the result would be
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
I can just loop through the whole DataFrame column-by-column, element-by-element and set the values directly, but is there an easy (optimally a loop-free) way of achieving this?
You could use the fillna method on the DataFrame and specify the method as ffill (forward fill):
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df.fillna(method='ffill')
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
This method...
propagate[s] last valid observation forward to next valid
To go the opposite way, there's also a bfill method.
This method doesn't modify the DataFrame inplace - you'll need to rebind the returned DataFrame to a variable or else specify inplace=True:
df.fillna(method='ffill', inplace=True)
The accepted answer is perfect. I had a related but slightly different situation where I had to fill in forward but only within groups. In case someone has the same need, know that fillna works on a DataFrameGroupBy object.
>>> example = pd.DataFrame({'number':[0,1,2,nan,4,nan,6,7,8,9],'name':list('aaabbbcccc')})
>>> example
name number
0 a 0.0
1 a 1.0
2 a 2.0
3 b NaN
4 b 4.0
5 b NaN
6 c 6.0
7 c 7.0
8 c 8.0
9 c 9.0
>>> example.groupby('name')['number'].fillna(method='ffill') # fill in row 5 but not row 3
0 0.0
1 1.0
2 2.0
3 NaN
4 4.0
5 4.0
6 6.0
7 7.0
8 8.0
9 9.0
Name: number, dtype: float64
You can use pandas.DataFrame.fillna with the method='ffill' option. 'ffill' stands for 'forward fill' and will propagate last valid observation forward. The alternative is 'bfill' which works the same way, but backwards.
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df = df.fillna(method='ffill')
print(df)
# 0 1 2
#0 1 2 3
#1 4 2 3
#2 4 2 9
There is also a direct synonym function for this, pandas.DataFrame.ffill, to make things simpler.
One thing that I noticed when trying this solution is that if you have N/A at the start or the end of the array, ffill and bfill don't quite work. You need both.
In [224]: df = pd.DataFrame([None, 1, 2, 3, None, 4, 5, 6, None])
In [225]: df.ffill()
Out[225]:
0
0 NaN
1 1.0
...
7 6.0
8 6.0
In [226]: df.bfill()
Out[226]:
0
0 1.0
1 1.0
...
7 6.0
8 NaN
In [227]: df.bfill().ffill()
Out[227]:
0
0 1.0
1 1.0
...
7 6.0
8 6.0
Only one column version
Fill NAN with last valid value
df[column_name].fillna(method='ffill', inplace=True)
Fill NAN with next valid value
df[column_name].fillna(method='backfill', inplace=True)
Just agreeing with ffill method, but one extra info is that you can limit the forward fill with keyword argument limit.
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [None, None, 6], [None, None, 9]])
>>> df
0 1 2
0 1.0 2.0 3
1 NaN NaN 6
2 NaN NaN 9
>>> df[1].fillna(method='ffill', inplace=True)
>>> df
0 1 2
0 1.0 2.0 3
1 NaN 2.0 6
2 NaN 2.0 9
Now with limit keyword argument
>>> df[0].fillna(method='ffill', limit=1, inplace=True)
>>> df
0 1 2
0 1.0 2.0 3
1 1.0 2.0 6
2 NaN 2.0 9
ffill now has it's own method pd.DataFrame.ffill
df.ffill()
0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 3.0
2 4.0 2.0 9.0
You can use fillna to remove or replace NaN values.
NaN Remove
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df.fillna(method='ffill')
0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 3.0
2 4.0 2.0 9.0
NaN Replace
df.fillna(0) # 0 means What Value you want to replace
0 1 2
0 1.0 2.0 3.0
1 4.0 0.0 0.0
2 0.0 0.0 9.0
Reference pandas.DataFrame.fillna
There's also pandas.Interpolate, which I think gives one more control
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df=df.interpolate(method="pad",limit=None, downcast="infer") #downcast keeps dtype as int
print(df)
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
In my case, we have time series from different devices but some devices could not send any value during some period. So we should create NA values for every device and time period and after that do fillna.
df = pd.DataFrame([["device1", 1, 'first val of device1'], ["device2", 2, 'first val of device2'], ["device3", 3, 'first val of device3']])
df.pivot(index=1, columns=0, values=2).fillna(method='ffill').unstack().reset_index(name='value')
Result:
0 1 value
0 device1 1 first val of device1
1 device1 2 first val of device1
2 device1 3 first val of device1
3 device2 1 None
4 device2 2 first val of device2
5 device2 3 first val of device2
6 device3 1 None
7 device3 2 None
8 device3 3 first val of device3

Pandas Replacing Current Value with Previous Value if NaN [duplicate]

Suppose I have a DataFrame with some NaNs:
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df
0 1 2
0 1 2 3
1 4 NaN NaN
2 NaN NaN 9
What I need to do is replace every NaN with the first non-NaN value in the same column above it. It is assumed that the first row will never contain a NaN. So for the previous example the result would be
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
I can just loop through the whole DataFrame column-by-column, element-by-element and set the values directly, but is there an easy (optimally a loop-free) way of achieving this?
You could use the fillna method on the DataFrame and specify the method as ffill (forward fill):
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df.fillna(method='ffill')
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
This method...
propagate[s] last valid observation forward to next valid
To go the opposite way, there's also a bfill method.
This method doesn't modify the DataFrame inplace - you'll need to rebind the returned DataFrame to a variable or else specify inplace=True:
df.fillna(method='ffill', inplace=True)
The accepted answer is perfect. I had a related but slightly different situation where I had to fill in forward but only within groups. In case someone has the same need, know that fillna works on a DataFrameGroupBy object.
>>> example = pd.DataFrame({'number':[0,1,2,nan,4,nan,6,7,8,9],'name':list('aaabbbcccc')})
>>> example
name number
0 a 0.0
1 a 1.0
2 a 2.0
3 b NaN
4 b 4.0
5 b NaN
6 c 6.0
7 c 7.0
8 c 8.0
9 c 9.0
>>> example.groupby('name')['number'].fillna(method='ffill') # fill in row 5 but not row 3
0 0.0
1 1.0
2 2.0
3 NaN
4 4.0
5 4.0
6 6.0
7 7.0
8 8.0
9 9.0
Name: number, dtype: float64
You can use pandas.DataFrame.fillna with the method='ffill' option. 'ffill' stands for 'forward fill' and will propagate last valid observation forward. The alternative is 'bfill' which works the same way, but backwards.
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df = df.fillna(method='ffill')
print(df)
# 0 1 2
#0 1 2 3
#1 4 2 3
#2 4 2 9
There is also a direct synonym function for this, pandas.DataFrame.ffill, to make things simpler.
One thing that I noticed when trying this solution is that if you have N/A at the start or the end of the array, ffill and bfill don't quite work. You need both.
In [224]: df = pd.DataFrame([None, 1, 2, 3, None, 4, 5, 6, None])
In [225]: df.ffill()
Out[225]:
0
0 NaN
1 1.0
...
7 6.0
8 6.0
In [226]: df.bfill()
Out[226]:
0
0 1.0
1 1.0
...
7 6.0
8 NaN
In [227]: df.bfill().ffill()
Out[227]:
0
0 1.0
1 1.0
...
7 6.0
8 6.0
Only one column version
Fill NAN with last valid value
df[column_name].fillna(method='ffill', inplace=True)
Fill NAN with next valid value
df[column_name].fillna(method='backfill', inplace=True)
Just agreeing with ffill method, but one extra info is that you can limit the forward fill with keyword argument limit.
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [None, None, 6], [None, None, 9]])
>>> df
0 1 2
0 1.0 2.0 3
1 NaN NaN 6
2 NaN NaN 9
>>> df[1].fillna(method='ffill', inplace=True)
>>> df
0 1 2
0 1.0 2.0 3
1 NaN 2.0 6
2 NaN 2.0 9
Now with limit keyword argument
>>> df[0].fillna(method='ffill', limit=1, inplace=True)
>>> df
0 1 2
0 1.0 2.0 3
1 1.0 2.0 6
2 NaN 2.0 9
ffill now has it's own method pd.DataFrame.ffill
df.ffill()
0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 3.0
2 4.0 2.0 9.0
You can use fillna to remove or replace NaN values.
NaN Remove
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df.fillna(method='ffill')
0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 3.0
2 4.0 2.0 9.0
NaN Replace
df.fillna(0) # 0 means What Value you want to replace
0 1 2
0 1.0 2.0 3.0
1 4.0 0.0 0.0
2 0.0 0.0 9.0
Reference pandas.DataFrame.fillna
There's also pandas.Interpolate, which I think gives one more control
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df=df.interpolate(method="pad",limit=None, downcast="infer") #downcast keeps dtype as int
print(df)
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
In my case, we have time series from different devices but some devices could not send any value during some period. So we should create NA values for every device and time period and after that do fillna.
df = pd.DataFrame([["device1", 1, 'first val of device1'], ["device2", 2, 'first val of device2'], ["device3", 3, 'first val of device3']])
df.pivot(index=1, columns=0, values=2).fillna(method='ffill').unstack().reset_index(name='value')
Result:
0 1 value
0 device1 1 first val of device1
1 device1 2 first val of device1
2 device1 3 first val of device1
3 device2 1 None
4 device2 2 first val of device2
5 device2 3 first val of device2
6 device3 1 None
7 device3 2 None
8 device3 3 first val of device3

opposite of df.diff() in pandas

I have searched the forums in search of a cleaner way to create a new column in a dataframe that is the sum of the row with the previous row- the opposite of the .diff() function which takes the difference.
this is how I'm currently solving the problem
df = pd.DataFrame ({'c':['dd','ee','ff', 'gg', 'hh'], 'd':[1,2,3,4,5]}
df['e']= df['d'].shift(-1)
df['f'] = df['d'] + df['e']
Your ideas are appreciated.
You can use rolling with a window size of 2 and sum:
df['f'] = df['d'].rolling(2).sum().shift(-1)
c d f
0 dd 1 3.0
1 ee 2 5.0
2 ff 3 7.0
3 gg 4 9.0
4 hh 5 NaN
df.cumsum()
Example:
data = {'a':[1,6,3,9,5], 'b':[13,1,2,5,23]}
df = pd.DataFrame(data)
df =
a b
0 1 13
1 6 1
2 3 2
3 9 5
4 5 23
df.diff()
a b
0 NaN NaN
1 5.0 -12.0
2 -3.0 1.0
3 6.0 3.0
4 -4.0 18.0
df.cumsum()
a b
0 1 13
1 7 14
2 10 16
3 19 21
4 24 44
If you cannot use rolling, due to multindex or else, you can try using .cumsum(), and then .diff(-2) to sub the .cumsum() result from two positions before.
data = {'a':[1,6,3,9,5,30, 101, 8]}
df = pd.DataFrame(data)
df['opp_diff'] = df['a'].cumsum().diff(2)
a opp_diff
0 1 NaN
1 6 NaN
2 3 9.0
3 9 12.0
4 5 14.0
5 30 35.0
6 101 131.0
7 8 109.0
Generally to get an inverse of .diff(n) you should be able to do .cumsum().diff(n+1). The issue is that that you will get n+1 first results as NaNs

How to replace NaNs by preceding or next values in pandas DataFrame?

Suppose I have a DataFrame with some NaNs:
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df
0 1 2
0 1 2 3
1 4 NaN NaN
2 NaN NaN 9
What I need to do is replace every NaN with the first non-NaN value in the same column above it. It is assumed that the first row will never contain a NaN. So for the previous example the result would be
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
I can just loop through the whole DataFrame column-by-column, element-by-element and set the values directly, but is there an easy (optimally a loop-free) way of achieving this?
You could use the fillna method on the DataFrame and specify the method as ffill (forward fill):
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df.fillna(method='ffill')
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
This method...
propagate[s] last valid observation forward to next valid
To go the opposite way, there's also a bfill method.
This method doesn't modify the DataFrame inplace - you'll need to rebind the returned DataFrame to a variable or else specify inplace=True:
df.fillna(method='ffill', inplace=True)
The accepted answer is perfect. I had a related but slightly different situation where I had to fill in forward but only within groups. In case someone has the same need, know that fillna works on a DataFrameGroupBy object.
>>> example = pd.DataFrame({'number':[0,1,2,nan,4,nan,6,7,8,9],'name':list('aaabbbcccc')})
>>> example
name number
0 a 0.0
1 a 1.0
2 a 2.0
3 b NaN
4 b 4.0
5 b NaN
6 c 6.0
7 c 7.0
8 c 8.0
9 c 9.0
>>> example.groupby('name')['number'].fillna(method='ffill') # fill in row 5 but not row 3
0 0.0
1 1.0
2 2.0
3 NaN
4 4.0
5 4.0
6 6.0
7 7.0
8 8.0
9 9.0
Name: number, dtype: float64
You can use pandas.DataFrame.fillna with the method='ffill' option. 'ffill' stands for 'forward fill' and will propagate last valid observation forward. The alternative is 'bfill' which works the same way, but backwards.
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df = df.fillna(method='ffill')
print(df)
# 0 1 2
#0 1 2 3
#1 4 2 3
#2 4 2 9
There is also a direct synonym function for this, pandas.DataFrame.ffill, to make things simpler.
One thing that I noticed when trying this solution is that if you have N/A at the start or the end of the array, ffill and bfill don't quite work. You need both.
In [224]: df = pd.DataFrame([None, 1, 2, 3, None, 4, 5, 6, None])
In [225]: df.ffill()
Out[225]:
0
0 NaN
1 1.0
...
7 6.0
8 6.0
In [226]: df.bfill()
Out[226]:
0
0 1.0
1 1.0
...
7 6.0
8 NaN
In [227]: df.bfill().ffill()
Out[227]:
0
0 1.0
1 1.0
...
7 6.0
8 6.0
Only one column version
Fill NAN with last valid value
df[column_name].fillna(method='ffill', inplace=True)
Fill NAN with next valid value
df[column_name].fillna(method='backfill', inplace=True)
Just agreeing with ffill method, but one extra info is that you can limit the forward fill with keyword argument limit.
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [None, None, 6], [None, None, 9]])
>>> df
0 1 2
0 1.0 2.0 3
1 NaN NaN 6
2 NaN NaN 9
>>> df[1].fillna(method='ffill', inplace=True)
>>> df
0 1 2
0 1.0 2.0 3
1 NaN 2.0 6
2 NaN 2.0 9
Now with limit keyword argument
>>> df[0].fillna(method='ffill', limit=1, inplace=True)
>>> df
0 1 2
0 1.0 2.0 3
1 1.0 2.0 6
2 NaN 2.0 9
ffill now has it's own method pd.DataFrame.ffill
df.ffill()
0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 3.0
2 4.0 2.0 9.0
You can use fillna to remove or replace NaN values.
NaN Remove
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df.fillna(method='ffill')
0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 3.0
2 4.0 2.0 9.0
NaN Replace
df.fillna(0) # 0 means What Value you want to replace
0 1 2
0 1.0 2.0 3.0
1 4.0 0.0 0.0
2 0.0 0.0 9.0
Reference pandas.DataFrame.fillna
There's also pandas.Interpolate, which I think gives one more control
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df=df.interpolate(method="pad",limit=None, downcast="infer") #downcast keeps dtype as int
print(df)
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
In my case, we have time series from different devices but some devices could not send any value during some period. So we should create NA values for every device and time period and after that do fillna.
df = pd.DataFrame([["device1", 1, 'first val of device1'], ["device2", 2, 'first val of device2'], ["device3", 3, 'first val of device3']])
df.pivot(index=1, columns=0, values=2).fillna(method='ffill').unstack().reset_index(name='value')
Result:
0 1 value
0 device1 1 first val of device1
1 device1 2 first val of device1
2 device1 3 first val of device1
3 device2 1 None
4 device2 2 first val of device2
5 device2 3 first val of device2
6 device3 1 None
7 device3 2 None
8 device3 3 first val of device3

Categories