adding subtotals to multiple layers of pandas pivot-table - python

Suppose I have a very basic dataset:
name food city rating
paul cream LA 2
daniel chocolate NY 3
paul chocolate LA 4
john cream NY 5
daniel jam LA 1
daniel butter NY 3
john jam NY 9
I want to compute the descriptive stats for each person's food preferences which is easy enough:
df1 = pd.pivot_table(df, values='rating', index=['city', 'name', 'food'], aggfunc=['count', 'nunique', 'sum', 'min', 'max', 'mean', 'std', 'sem', 'median', 'mad', 'var', 'skew'], margins=True, margins_name="Total")
But I want to add subtotals for each name and city.
I can get subtotals for name and city in separate objects:
df2 = df.groupby('name').agg(['count', 'nunique', 'sum', 'min', 'max', 'mean', 'std', 'sem', 'median', 'mad', 'var', 'skew'])
df2.index = pd.MultiIndex.from_arrays([df2.index + '_total', len(df2.index) * ['']])
df3 = df.groupby('city').agg(['count', 'nunique', 'sum', 'min', 'max', 'mean', 'std', 'sem', 'median', 'mad', 'var', 'skew'])
df3.index = pd.MultiIndex.from_arrays([df3.index + '_total', len(df3.index) * ['']])
But struggling to combine the three tables.
The output of df1 has columns for 'city' 'name' and 'food' on each row
city name food count nunique...
LA daniel jam 1 1
paul choc 1 1
cream 1 1
NY daniel butter 1 1
but the outputs for df2 and df3 just have 'name' *df2) or 'city' (df3)
name count nunique
daniel_total 3 1
john_total 2 1
I want to merge these files so the name totals are placed in the 'name' column and the city totals in the 'city' like so:
city name food count
LA daniel jam 1
paul choc 1
cream 1
LA_total 3
NY daniel butter 1
NY_total 2
daniel_total 3
john_total 2
paul_total 2
I've tried using pandas concat, but it groups the descriptive columns together
pd.concat([df1, df2, df3].sort_index()
I think I need to tell python which column to join the df2 and df3 datasets into but not sure how

Let's try this:
df2 = df.groupby(['city','name']).agg(['count', 'nunique', 'sum', 'min', 'max', 'mean', 'std', 'sem', 'median', 'mad', 'var', 'skew'])
df2 = df2.rename(index=lambda x: x+'_total', level=1)
df2 = df2.swaplevel(0, 1, axis=1)
df2 = df2.assign(food='').set_index('food', append=True)
df3 = df.groupby('city').agg(['count', 'nunique', 'sum', 'min', 'max', 'mean', 'std', 'sem', 'median', 'mad', 'var', 'skew'])
df3.index = pd.MultiIndex.from_arrays([df3.index + '_total', len(df3.index) * ['']])
df3 = df3.assign(name='', food='').set_index(['name','food'], append=True)
df3 = df3.swaplevel(0,1, axis=1)
df_out = pd.concat([df1,df2,df3]).sort_index()
df_out
Output:
count nunique sum min max mean std sem median mad var skew
rating rating rating rating rating rating rating rating rating rating rating rating
city name food
LA daniel jam 1 1 1 1 1 1.000000 NaN NaN 1 0.000000 NaN NaN
daniel_total 1 1 1 1 1 1.000000 NaN NaN 1 0.000000 NaN NaN
paul chocolate 1 1 4 4 4 4.000000 NaN NaN 4 0.000000 NaN NaN
cream 1 1 2 2 2 2.000000 NaN NaN 2 0.000000 NaN NaN
paul_total 2 2 6 2 4 3.000000 1.414214 1.000000 3 1.000000 2.000000 NaN
LA_total 3 3 7 1 4 2.333333 1.527525 0.881917 2 1.111111 2.333333 0.935220
NY daniel butter 1 1 3 3 3 3.000000 NaN NaN 3 0.000000 NaN NaN
chocolate 1 1 3 3 3 3.000000 NaN NaN 3 0.000000 NaN NaN
daniel_total 2 1 6 3 3 3.000000 0.000000 0.000000 3 0.000000 0.000000 NaN
john cream 1 1 5 5 5 5.000000 NaN NaN 5 0.000000 NaN NaN
jam 1 1 9 9 9 9.000000 NaN NaN 9 0.000000 NaN NaN
john_total 2 2 14 5 9 7.000000 2.828427 2.000000 7 2.000000 8.000000 NaN
NY_total 4 3 20 3 9 5.000000 2.828427 1.414214 4 2.000000 8.000000 1.414214
Total 7 6 27 1 9 3.857143 2.609506 0.986301 3 1.836735 6.809524 1.398866

Related

Python Dataframe Groupby Mean and STD

I know how to compute the groupby mean or std. But now I want to compute both at a time.
My code:
df =
a b c d
0 Apple 3 5 7
1 Banana 4 4 8
2 Cherry 7 1 3
3 Apple 3 4 7
xdf = df.groupby('a').agg([np.mean(),np.std()])
Present output:
TypeError: _mean_dispatcher() missing 1 required positional argument: 'a'
Try to remove () from the np. functions:
xdf = df.groupby("a").agg([np.mean, np.std])
print(xdf)
Prints:
b c d
mean std mean std mean std
a
Apple 3 0.0 4.5 0.707107 7 0.0
Banana 4 NaN 4.0 NaN 8 NaN
Cherry 7 NaN 1.0 NaN 3 NaN
EDIT: To "flatten" column multi-index:
xdf = df.groupby("a").agg([np.mean, np.std])
xdf.columns = xdf.columns.map("_".join)
print(xdf)
Prints:
b_mean b_std c_mean c_std d_mean d_std
a
Apple 3 0.0 4.5 0.707107 7 0.0
Banana 4 NaN 4.0 NaN 8 NaN
Cherry 7 NaN 1.0 NaN 3 NaN

Using Pandas DataFrames is there a way to break a row into multiple rows based on if each column contains a value?

Given a dataframe with columns A B C D E F and 3 rows:
[1,100,null,100,null,"cat"]
[2,null,50,null,50,"dog"]
[3,100,null,null,100,"cow"]
I am needing to find a way to go through each row and based on if there is a value in columns B C D E, break each value out into its own cloned row, with only 1 value being present per row.
Expected result:
[1,100,null,null,null,"cat"]
[1,null,null,100,null,"cat"]
[2,null,50,null,null,"dog"]
[2,null,null,null,50,"dog"]
[3,100,null,null,null,"cow"]
[3,null,null,null,100,"cow]
Have searched all over and did not find any good solution to this.
IIUC, one way:
df.melt(['A','F'])\
.dropna()\
.reset_index()\
.pivot(index=['index','A','F'], columns='variable', values='value')\
.reset_index()\
.drop(['index'], axis=1)
Output:
variable A F B C D E
0 1 cat 100.0 NaN NaN NaN
1 3 cow 100.0 NaN NaN NaN
2 2 dog NaN 50.0 NaN NaN
3 1 cat NaN NaN 100.0 NaN
4 2 dog NaN NaN NaN 50.0
5 3 cow NaN NaN NaN 100.0
Another way:
df.set_index(['A', 'F'])\
.stack()\
.reset_index()\
.set_index(['A','F','level_2'], append=True)[0]\
.unstack()\
.reset_index()\
.drop('level_0', axis=1)
Output:
level_2 A F B C D E
0 1 cat 100.0 NaN NaN NaN
1 1 cat NaN NaN 100.0 NaN
2 2 dog NaN 50.0 NaN NaN
3 2 dog NaN NaN NaN 50.0
4 3 cow 100.0 NaN NaN NaN
5 3 cow NaN NaN NaN 100.0
This is some pretty detailed dataframe reshaping.
Scott's first answer worked for me, with slight modification:
df = pd.melt(df,['A','F']\
.dropna()\
.reset_index()\
.pivot(index=['index','A','F'], columns='variable', values='value')\
.reset_index()\
.drop(['index'], axis=1)

pandas groupby apply/tranform operation to do manipulation per group

I have a dataframe like this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'year': [1990,1990,1992,1992,1992],
'value': [100,200,300,400,np.nan],
'rank': [2,1,2,1,3]})
print(df)
year value rank
0 1990 100.0 2
1 1990 200.0 1
2 1992 300.0 2
3 1992 400.0 1
4 1992 NaN 3
I am trying to achieve this:
# For year 1990, maximum value is 200, rank is 1 and also relative value is 1.
year value rank value_relative
0 1990 100.0 2 0.5
1 1990 200.0 1 1
2 1992 300.0 2 0.75
3 1992 400.0 1 1
4 1992 NaN 3 NaN
My attempt:
df['value_relative'] = df.groupby('year')['value'].transform(lambda x: x/x[x.rank == 1]['value'])
How can we do this operation where we calculate relative value for each year?
IIUC using transform with first after sort_values
df['value_relative']=df.value/df.sort_values('rank').groupby('year').value.transform('first')
df
Out[60]:
year value rank value_relative
0 1990 100.0 2 0.50
1 1990 200.0 1 1.00
2 1992 300.0 2 0.75
3 1992 400.0 1 1.00
4 1992 NaN 3 NaN
Or just do transform max
df['value_relative']=df.value/df.groupby('year').value.transform('max')
Another method
df.value/df.loc[df.groupby('year')['rank'].transform('idxmin'),'value'].values
Out[64]:
0 0.50
1 1.00
2 0.75
3 1.00
4 NaN
Name: value, dtype: float64
If you need 2nd rank as denominator
df.value/df.year.map(df.loc[df['rank']==2].set_index('year')['value'])
The different here is depends on how you get your rank , if is base on max of value , then both of them should return the same result , but if that is a given rank none related to the value columns , then you should using first
I liked and accepted the Wen's answer, but wanted to give my 2 cents:
The simplest method is just divide value by maximum, but I am trying to learn doing this using separate column called rank:
df.groupby('year')['value'].transform(lambda x: x/x.max())
0 0.50
1 1.00
2 0.75
3 1.00
4 NaN
Another simple method for rank ==2:
df.groupby('year')['value'].transform(lambda x: x/x.nlargest(2).iloc[-1])
0 1.000000
1 2.000000
2 1.000000
3 1.333333
4 NaN
NOTE: Wen's method:
df.value/df.year.map(df.loc[df['rank']==2].set_index('year')['value'])
0 1.000000
1 2.000000
2 1.000000
3 1.333333
4 NaN

Join dataframes by key - repeated data as new columns

I'm facing the next situation. I have two dataframes lets say df1 and df2, and I need to join them by a key ( ID_ed , ID ) the second dataframe may have more than one occurrence of the key, what I need is to join the two dataframes, and add the repeated occurrences of the keys as new columns ( as shown in the next Image )
I tried to use merge = df2.join( df1 , lsuffix='_ZID', rsuffix='_IID' , how = "left" ) and concat operations but no luck so far .It seems that it only preserve the last occurrence ( as if it was overwriting the data )
Any help in this is really appreciated, and thanks in advance.
Another approach is to create a serial counter for the ID_ed column, set_index and unstack before calling the pivot_table. The pivot_table aggregation would be first. This approach would be fairly similar to this SO answer
Generate the data
import pandas as pd
import numpy as np
a = [['ID_ed','color'],[1,5],[2,8],[3,7]]
b = [['ID','code'],[1,1],[1,5],
[2,np.nan],[2,20],[2,74],
[3,10],[3,98],[3,85],
[3,21],[3,45]]
df1 = pd.DataFrame(a[1:], columns=a[0])
df2 = pd.DataFrame(b[1:], columns=b[0])
print(df1)
ID_ed color
0 1 5
1 2 8
2 3 7
print(df2)
ID code
0 1 1.0
1 1 5.0
2 2 NaN
3 2 20.0
4 2 74.0
5 3 10.0
6 3 98.0
7 3 85.0
8 3 21.0
9 3 45.0
First the merge and unstack
# Merge and add a serial counter column
df = df1.merge(df2, how='inner', left_on='ID_ed', right_on='ID')
df['counter'] = df.groupby('ID_ed').cumcount()+1
print(df)
ID_ed color ID code counter
0 1 5 1 1.0 1
1 1 5 1 5.0 2
2 2 8 2 NaN 1
3 2 8 2 20.0 2
4 2 8 2 74.0 3
5 3 7 3 10.0 1
6 3 7 3 98.0 2
7 3 7 3 85.0 3
8 3 7 3 21.0 4
9 3 7 3 45.0 5
# Set index and unstack
df.set_index(['ID_ed','color','counter']).\
unstack().\
swaplevel(1,0,axis=1).\
sort_index(level=0,axis=1).add_prefix('counter_')
print(df)
counter counter_1 counter_2 \
counter_ID counter_code counter_ID counter_code\
ID_ed color \
1 5 1.0 1.0 1.0 5.0\
2 8 2.0 NaN 2.0 20.0\
3 7 3.0 10.0 3.0 98.0 \
counter_3 counter_4 counter_5
counter_ID counter_code counter_ID counter_code counter_ID counter_code
NaN NaN NaN NaN NaN NaN
2.0 74.0 NaN NaN NaN NaN
3.0 85.0 3.0 21.0 3.0 45.0
Next generate the pivot table
# Pivot table with 'first' aggregation
dfp = pd.pivot_table(df, index=['ID_ed','color'],
columns=['counter'],
values=['ID', 'code'],
aggfunc='first')
print(dfp)
ID code
counter 1 2 3 4 5 1 2 3 4 5
ID_ed color
1 5 1.0 1.0 NaN NaN NaN 1.0 5.0 NaN NaN NaN
2 8 2.0 2.0 2.0 NaN NaN NaN 20.0 74.0 NaN NaN
3 7 3.0 3.0 3.0 3.0 3.0 10.0 98.0 85.0 21.0 45.0
Finally rename the columns and slice by partial column name
# Rename columns
level_1_names = list(dfp.columns.get_level_values(1))
level_0_names = list(dfp.columns.get_level_values(0))
new_cnames = [b+'_'+str(f) for f, b in zip(level_1_names, level_0_names)]
dfp.columns = new_cnames
# Slice by new column names
print(dfp.loc[:, dfp.columns.str.contains('code')].reset_index(drop=False))
ID_ed color code_1 code_2 code_3 code_4 code_5
0 1 5 1.0 5.0 NaN NaN NaN
1 2 8 NaN 20.0 74.0 NaN NaN
2 3 7 10.0 98.0 85.0 21.0 45.0
I'd use cumcount and pivot_table:
In [11]: df1
Out[11]:
ID color
0 1 5
1 2 8
2 3 7
In [12]: df2
Out[12]:
ID code
0 1 1.0
1 1 5.0
2 2 NaN
3 2 20.0
4 2 74.0
In [13]: res = df1.merge(df2) # This is a merge if the column names match
In [14]: res
Out[14]:
ID color code
0 1 5 1.0
1 1 5 5.0
2 2 8 NaN
3 2 8 20.0
4 2 8 74.0
In [15]: res['count'] = res.groupby('ID').cumcount()
In [16]: res.pivot_table('code', ['ID', 'color'], 'count')
Out[16]:
count 0 1 2
ID color
1 5 1.0 5.0 NaN
2 8 NaN 20.0 74.0

Python: How to add specific columns of .mean to dataframe

How can I add the means of b and c to my dataframe? I tried a merge but it didn't seem to work. So I want two extra columns b_mean and c_mean added to my dataframe with the results of df.groupBy('date').mean()
DataFrame
a b c date
0 2 3 5 1
1 5 9 1 1
2 3 7 1 1
I have the following code
import pandas as pd
a = [{'date': 1,'a':2, 'b':3, 'c':5}, {'date':1, 'a':5, 'b':9, 'c':1}, {'date':1, 'a':3, 'b':7, 'c':1}]
df = pd.DataFrame(a)
x = df.groupby('date').mean()
Edit:
Desired output would be the following
df.groupby('date').mean() returns:
a b c
date
1 3.333333 6.333333 2.333333
My desired result would be the following data frame
a b c date a_mean b_mean
0 2 3 5 1 3.3333 6.3333
1 5 9 1 1 3.3333 6.3333
2 3 7 1 1 3.3333 6.3333
As #ayhan mentioned, you can use pd.groupby.transform() for this. Transform is like apply, but it uses the same index as the original dataframe instead of the unique values in the column(s) grouped on.
df['a_mean'] = df.groupby('date')['a'].transform('mean')
df['b_mean'] = df.groupby('date')['b'].transform('mean')
>>> df
a b c date b_mean a_mean
0 2 3 5 1 6.333333 3.333333
1 5 9 1 1 6.333333 3.333333
2 3 7 1 1 6.333333 3.333333
solution
Use join with a rsuffix parameter.
df.join(df.groupby('date').mean(), on='date', rsuffix='_mean')
a b c date a_mean b_mean c_mean
0 2 3 5 1 3.333333 6.333333 2.333333
1 5 9 1 1 3.333333 6.333333 2.333333
2 3 7 1 1 3.333333 6.333333 2.333333
We can limit it to just ['a', 'b']
df.join(df.groupby('date')[['a', 'b']].mean(), on='date', rsuffix='_mean')
a b c date a_mean b_mean
0 2 3 5 1 3.333333 6.333333
1 5 9 1 1 3.333333 6.333333
2 3 7 1 1 3.333333 6.333333
extra credit
Not really answering your question... but I thought it was neat!
d1 = df.set_index('date', append=True).swaplevel(0, 1)
g = df.groupby('date').describe()
d1.append(g).sort_index()
a b c
date
1 0 2.000000 3.000000 5.000000
1 5.000000 9.000000 1.000000
2 3.000000 7.000000 1.000000
25% 2.500000 5.000000 1.000000
50% 3.000000 7.000000 1.000000
75% 4.000000 8.000000 3.000000
count 3.000000 3.000000 3.000000
max 5.000000 9.000000 5.000000
mean 3.333333 6.333333 2.333333
min 2.000000 3.000000 1.000000
std 1.527525 3.055050 2.309401
I assuming that you need mean value of a column added as a new column value in the dataframe. Please correct me otherwise.
You can achieve by taking the mean of column directly and create a new column by assigning like
In [1]: import pandas as pd
In [2]: a = [{'date': 1,'a':2, 'b':3, 'c':5}, {'date':1, 'a':5, 'b':9, 'c':1}, {'date':1, 'a':3, 'b':7, 'c':1}]
In [3]: df = pd.DataFrame(a)
In [4]: for col in ['b','c']:
...: df[col+"_mean"] = df.groupby('date')[col].transform('mean')
In [5]: df
Out[5]:
a b c date b_mean c_mean
0 2 3 5 1 6.333333 2.333333
1 5 9 1 1 6.333333 2.333333
2 3 7 1 1 6.333333 2.333333

Categories