Pandas groupby aggregate to new columns - python

I have a DataFrame that looks something like this:
A B C D
1 10 22 14
1 12 20 37
1 11 8 18
1 10 10 6
2 11 13 4
2 12 10 12
3 14 0 5
and a function that looks something like this (NOTE: it's actually doing something more complex that can't be easily separated into three independent calls, but I'm simplifying for clarity):
def myfunc(g):
return min(g), mean(g), max(g)
I want to use groupby on A with myfunc to get an output on columns B and C (ignoring D) something like this:
B C
min mean max min mean max
A
1 10 10.75 12 8 15.0 22
2 11 11.50 12 10 11.5 13
3 14 14.00 14 0 0.0 0
I can do the following:
df2.groupby('A')[['B','C']].agg(
{
'min': lambda g: myfunc(g)[0],
'mean': lambda g: myfunc(g)[1],
'max': lambda g: myfunc(g)[2]
})
But then—aside from this being ugly and calling myfunc multiple times—I end up with
max mean min
B C B C B C
A
1 12 22 10.75 15.0 10 8
2 12 13 11.50 11.5 11 10
3 14 0 14.00 0.0 14 0
I can use .swaplevel(axis=1) to swap the column levels, but even then B and C are in multiple duplicated columns, and with the multiple function calls it feels like barking up the wrong tree.

If you arrange for myfunc to return a DataFrame whose columns are ['A','B','C','D'] and whose rows index are ['min', 'mean', 'max'], then you could use groupby/apply to call the function (once for each group) and concatenate the results as desired:
import numpy as np
import pandas as pd
def myfunc(g):
result = pd.DataFrame({'min':np.min(g),
'mean':np.mean(g),
'max':np.max(g)}).T
return result
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3],
'B': [10, 12, 11, 10, 11, 12, 14],
'C': [22, 20, 8, 10, 13, 10, 0],
'D': [14, 37, 18, 6, 4, 12, 5]})
result = df.groupby('A')[['B','C']].apply(myfunc)
result = result.unstack(level=-1)
print(result)
prints
B C
max mean min max mean min
A
1 12.0 10.75 10.0 22.0 15.0 8.0
2 12.0 11.50 11.0 13.0 11.5 10.0
3 14.0 14.00 14.0 0.0 0.0 0.0
For others who may run across this and who do not need a custom function, note
that it behooves you to always use builtin aggregators (below, specified by the
strings 'min', 'mean' and 'max') if possible. They perform better than
custom Python functions. Happily, in this toy problem, it produces the desired result:
In [99]: df.groupby('A')[['B','C']].agg(['min','mean','max'])
Out[99]:
B C
min mean max min mean max
A
1 10 10.75 12 8 15.0 22
2 11 11.50 12 10 11.5 13
3 14 14.00 14 0 0.0 0

Something like this might work.
df2.groupby('A')[['B','C']]
aggregated = df2.agg(['min', 'mean', 'max'])
then you could use swap level to get the column order swapped around
aggregated.columns = aggregated.columns.swaplevel(0, 1)
aggregated.sortlevel(0, axis=1, inplace=True)

Related

min/max value of a column based on values of another column, grouped by and transformed in pandas

I'd like to know if I can do all this in one line, rather than multiple lines.
my dataframe:
import pandas as pd
df = pd.DataFrame({'ID' : [1,1,1,1,1,1,2,2,2,2,2,2]
,'A': [1, 2, 3, 10, np.nan, 5 , 20, 6, 7, np.nan, np.nan, np.nan]
, 'B': [0,1,1,0,1,1,1,1,1,0,1,0]
, 'desired_output' : [5,5,5,5,5,5,20,20,20,20,20,20]})
df
ID A B desired_output
0 1 1.0 0 5
1 1 2.0 1 5
2 1 3.0 1 5
3 1 10.0 0 5
4 1 NaN 1 5
5 1 5.0 1 5
6 2 20.0 1 20
7 2 6.0 1 20
8 2 7.0 1 20
9 2 NaN 0 20
10 2 NaN 1 20
11 2 NaN 0 20
I'm trying to find the maximum value of column A, for values of column B == 1, group by column ID, and transform the results directly so that the value is back in the dataframe without extra merging et al.
something like the following (but without getting errors!)
df['desired_output'] = df.groupby('ID').A.where(df.B == 1).transform('max') ## this gives error
The max function should ignore the NaNs as well. I wonder if I'm trying too much in one line, but one can hope there is a way for a beautiful code.
EDIT:
I can get a very similar output by changing the where clause:
df['desired_output'] = df.where(df.B == 1).groupby('ID').A.transform('max') ## this works but output is not what i want
but the output is not exactly what I want. desired_output should not have any NaN, unless all values of A are NaN for when B == 1.
Here is a way to do it:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ID' : [1,1,1,1,1,1,2,2,2,2,2,2],
'A': [1, 2, 3, 10, np.nan, 5 , 20, 6, 7, np.nan, np.nan, np.nan],
'B': [0,1,1,0,1,1,1,1,1,0,1,0],
'desired_output' : [5,5,5,5,5,5,20,20,20,20,20,20]
})
df['output'] = df[df.B == 1].groupby('ID').A.max()[df.ID].array
df
Result:
ID A B desired_output output
0 1 1.0 0 5 5.0
1 1 2.0 1 5 5.0
2 1 3.0 1 5 5.0
3 1 10.0 0 5 5.0
4 1 NaN 1 5 5.0
5 1 5.0 1 5 5.0
6 2 20.0 1 20 20.0
7 2 6.0 1 20 20.0
8 2 7.0 1 20 20.0
9 2 NaN 0 20 20.0
10 2 NaN 1 20 20.0
11 2 NaN 0 20 20.0
Decomposition:
df[df.B == 1] # start by filtering on B
.groupby('ID') # group by ID
.A.max() # get max values in column A
[df.ID] # recast the result on ID series shape
.array # fetch the raw values from the Series
Important note: it relies on the fact that the index is as in the given example, that is, sorted, starting from 0, with a 1 increment. You will have to reset_index() of your DataFrame before this operation when this is not the case.

[python]Filter on opposites values pandas

I would like to create a filter that allows me to retrieve only the values ​​of opposite signs on a certain column. (example 10, -10, 22, -22)
How can I do this? Thanks
I would like to keep only B codes whose opposite value is in A, typically:
The exact logic and expected output is unclear (please provide an example), bus you could use the absolute value and the sign as groupers:
out = (df
.assign(abs=df['col'].abs(), sign=np.sign(df['col']))
.pivot(index='abs', columns='sign')
)
output:
id col
sign -1 1 -1 1
abs
4 NaN 4.0 NaN 4.0
7 5.0 NaN -7.0 NaN
10 3.0 0.0 -10.0 10.0
22 2.0 1.0 -22.0 22.0
used input:
df = pd.DataFrame({'id': range(6),
'col': [10, 22, -22, -10, 4, -7],
})
id col
0 0 10
1 1 22
2 2 -22
3 3 -10
4 4 4
5 5 -7

Chunking a list to size N and adding to a Data Frame ends up with missing data

I'm chunking a list into smaller lists of size n and trying to add each new list to a DataFrame. When I list the lists, all of the data is there; when i try to put the lists in a DataFrame the first list of the set disappears.
my_list = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
def divide_chunks(a,n):
for i in range(0, len(a),n):
yield a[i:i+n]
x = divide_chunks(my_list, n)
for i in x:
print(i)
gives me
[1, 2, 3, 4, 5]
[6, 7, 8, 9, 10]
[11, 12, 13, 14, 15]
[16, 17, 18, 19, 20]
I would like to put this into a DataFrame.
Here is how I'm trying to that
x = divide_chunks(my_list, n)
for i in x:
emptydf = pd.DataFrame(x)
emptydf
I would expect the output to be like above but instead I'm missing the list that has 1:5
{0} {1} {2} {3} {4}
{0} 6 7 8 9 10
{1} 11 12 13 14 15
{2} 16 17 18 19 20
Your code is not doing what you think it does:
x = divide_chunks(my_list, 4)
print(x)
Will return an object like such:
<generator object divide_chunks at 0x2aaae0622e60>
Now you can directly use:
pd.DataFrame(x)
0 1 2 3
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
4 17 18 19 20
This can be done with np.array_split. Here I added an extra value to show how it behaves with an uneven division
import pandas as pd
import numpy as np
my_list = [*range(1, 22)]
N = 5
pd.DataFrame(np.array_split(my_list, range(N, len(my_list), N)))
# 0 1 2 3 4
#0 1 2.0 3.0 4.0 5.0
#1 6 7.0 8.0 9.0 10.0
#2 11 12.0 13.0 14.0 15.0
#3 16 17.0 18.0 19.0 20.0
#4 21 NaN NaN NaN NaN

Is there a pandas function to return instantaneous values from cummulated sum?

I have a dataframe which columns have acummulated values, i.e. a financial report for all four quarters in a year. I need to de-accumulate the values in order to get the values for every period instead of the accumulated sum over time.
I've already built a function that uses loops for every column in the dataframe and substracts the previous column from the selected column (very inefficient). But in some cases, I have monthly data instead of quarterly, so the number of periods changes from 4 to 12.
Image of dataframe I have
I need a function that takes the number of periods (like a rolling sum that takes the number of windows as input) and outputs the dissagregated sum of the dataframe.
Thank you!
Take a diff within group. Need to .fillna to get the first value.
Sample Data
df = pd.DataFrame(np.random.randint(1, 10, (3, 8)))
df.columns = [f'{y}-{str(m).zfill(2)}' for y in range(2012, 2014) for m in range(1, 5)]
df = df.cumsum(1) # For illustration, don't worry about across years.
df['tag'] = 'foo'
2012-01 2012-02 2012-03 2012-04 2013-01 2013-02 2013-03 2013-04 tag
0 5 6 15 23 25 28 36 45 foo
1 5 9 14 17 24 27 31 38 foo
2 4 10 11 19 24 29 38 41 foo
Code:
df.groupby(df.columns.str[0:4], axis=1).diff(1).fillna(df)
2012-01 2012-02 2012-03 2012-04 2013-01 2013-02 2013-03 2013-04 tag
0 5.0 1.0 9.0 8.0 25.0 3.0 8.0 9.0 foo
1 5.0 4.0 5.0 3.0 24.0 3.0 4.0 7.0 foo
2 4.0 6.0 1.0 8.0 24.0 5.0 9.0 3.0 foo
You can do those steps:
import pandas as pd
df = pd.DataFrame([[1, 3, 2], [100, 90, 110]], columns=['2019-01', '2019-02', '2019-03'], index=['A', 'B'])
df = df.unstack().reset_index(name='value').sort_values(['level_1', 'level_0'])
df['delta'] = df.groupby('level_1').diff()
df['delta'].fillna(df.value, inplace=True)
df.pivot(index='level_1', columns='level_0', values='delta')

Pandas GroupBy and Calculate Z-Score [duplicate]

This question already has an answer here:
adding a grouped-by zscore column to a pandas dataframe
(1 answer)
Closed 3 years ago.
So I have a dataframe that looks like this:
pd.DataFrame([[1, 10, 14], [1, 12, 14], [1, 20, 12], [1, 25, 12], [2, 18, 12], [2, 30, 14], [2, 4, 12], [2, 10, 14]], columns = ['A', 'B', 'C'])
A B C
0 1 10 14
1 1 12 14
2 1 20 12
3 1 25 12
4 2 18 12
5 2 30 14
6 2 4 12
7 2 10 14
My goal is to get the z-scores of column B, relative to their groups by column A and C. I know I can calculate the mean and standard deviation of each group
test.groupby(['A', 'C']).mean()
B
A C
1 12 22.5
14 11.0
2 12 11.0
14 20.0
test.groupby(['A', 'C']).std()
B
A C
1 12 3.535534
14 1.414214
2 12 9.899495
14 14.142136
Now for every item in column B I want to calculate it's z-score based off of these means and standard deviations. So the first result would be (10 - 11) / 1.41. I feel like there has to be a way to do this without too much complexity but I've been stuck on how to proceed. Let me know if anyone can point me in the right direction or if I need to clarify anything!
Do with transform
Mean=test.groupby(['A', 'C']).B.transform('mean')
Std=test.groupby(['A', 'C']).B.transform('std')
Then
(test.B - Mean) / Std
One function zscore from scipy
from scipy.stats import zscore
test.groupby(['A', 'C']).B.transform(lambda x : zscore(x,ddof=1))
Out[140]:
0 -0.707107
1 0.707107
2 -0.707107
3 0.707107
4 0.707107
5 0.707107
6 -0.707107
7 -0.707107
Name: B, dtype: float64
Ok Show my number tie out hehe
(test.B - Mean) / Std ==test.groupby(['A', 'C']).B.transform(lambda x : zscore(x,ddof=1))
Out[148]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
Name: B, dtype: bool

Categories