How to add sub-total columns to a multilevel columns dataframe? - python

I've a dataframe with 3 levels of multi index columns:
quarter Q1 Q2 Totals
year 2021 2022 2021 2022
qty orders qty orders qty orders qty orders qty orders
month name
January 40 2 5 1 1 2 0 0 46 5
February 20 8 2 3 4 6 0 0 26 17
March 2 10 7 4 3 3 0 0 12 17
Totals 62 20 14 8 8 11 0 0 84 39
After doing a groupy by levels (0,2), I've the following subtotals dataframe:
quarter Q1 Q2 Totals
qty orders qty orders qty orders
month name
January 45 3 1 2 46 5
February 22 10 4 6 26 16
March 9 14 3 3 12 17
Totals 76 28 8 11 84 39
I need to insert the second into the first, without upsetting the columns, levels or index so that I get the following dataframe:
quarter Q1 Q2 Totals
year 2021 2022 Subtotal 2021 2022 Subtotal
qty orders qty orders qty orders qty orders qty orders qty orders qty orders
month name
January 40 2 5 1 45 3 1 2 0 0 1 2 46 5
February 20 8 2 3 22 10 4 6 0 0 4 6 26 16
March 2 10 7 4 9 14 3 3 0 0 3 3 12 17
Totals 62 20 14 8 76 28 8 11 0 0 8 11 84 39
How do I do this?

With your initial dataframe (before groupby):
import pandas as pd
df = pd.DataFrame(
[
[40, 2, 5, 1, 1, 2, 0, 0],
[20, 8, 2, 3, 4, 6, 0, 0],
[2, 10, 7, 4, 3, 3, 0, 0],
[62, 20, 14, 8, 8, 11, 0, 0],
],
columns=pd.MultiIndex.from_product(
[("Q1", "Q2"), ("2021", "2022"), ("qty", "orders")]
),
index=["January", "February", "March", "Totals"],
)
Here is one way to do it (using product from Python standard library's itertools module, otherwise a nested for-loop is also possible):
# Add new columns
for level1, level2 in product(["Q1", "Q2"], ["qty", "orders"]):
df.loc[:, (level1, "subtotal", level2)] = (
df.loc[:, (level1, "2021", level2)] + df.loc[:, (level1, "2022", level2)]
)
# Sort columns
df = df.reindex(
pd.MultiIndex.from_product(
[("Q1", "Q2"), ("2021", "2022", "subtotal"), ("qty", "orders")]
),
axis=1,
)
Then:
print(df)
# Output
Q1 Q2 \
2021 2022 subtotal 2021 2022
qty orders qty orders qty orders qty orders qty orders
January 40 2 5 1 45 3 1 2 0 0
February 20 8 2 3 22 11 4 6 0 0
March 2 10 7 4 9 14 3 3 0 0
Totals 62 20 14 8 76 28 8 11 0 0
subtotal
qty orders
January 1 2
February 4 6
March 3 3
Totals 8 11

Related

Merge should adopt Nan values if not existing value [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
However, I have the following problem:
If a year or date does not exist in df2 then a price and a listing_id is automatically added during the merge. But that should be NaN
The second problem is when merging, as soon as I have multiple data that were on the same day and year then the temperature is also merged to the second, for example:
d = {'id': [1], 'day': [1], 'temperature': [20], 'year': [2001]}
df = pd.DataFrame(data=d)
print(df)
id day temperature year
0 1 1 20 2001
d2 = {'id': [122, 244], 'day': [1, 1],
'listing_id': [2, 4], 'price': [20, 440], 'year': [2001, 2001]}
df2 = pd.DataFrame(data=d2)
print(df2)
id day listing_id price year
0 122 1 2 20 2001
1 244 1 4 440 2001
df3 = pd.merge(df,df2[['day', 'listing_id', 'price']],
left_on='day', right_on = 'day',how='left')
print(df3)
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 1 1 20 2001 4 440 # <-- The second temperature is wrong :/
This should not be so, because if I later still have a date from year 2002 which was in day 1 with a temperature of 30 and I want to calculate the average. Then I get the following formula: 20 + 20 + 30 = 23.3. The formula should be 20 + 30 = 25. Therefore, if a value has already been filled, there should be a NaN value in it.
Code Snippet
d = {'id': [1, 2, 3, 4, 5], 'day': [1, 2, 3, 4, 2],
'temperature': [20, 40, 50, 60, 20], 'year': [2001, 2002, 2004, 2005, 1999]}
df = pd.DataFrame(data=d)
print(df)
id day temperature year
0 1 1 20 2001
1 2 2 40 2002
2 3 3 50 2004
3 4 4 60 2005
4 5 2 20 1999
d2 = {'id': [122, 244, 387, 4454, 521], 'day': [1, 2, 3, 4, 2],
'listing_id': [2, 4, 5, 6, 7], 'price': [20, 440, 500, 6600, 500],
'year': [2001, 2002, 2004, 2005, 2005]}
df2 = pd.DataFrame(data=d2)
print(df2)
id day listing_id price year
0 122 1 2 20 2001
1 244 2 4 440 2002
2 387 3 5 500 2004
3 4454 4 6 6600 2005
4 521 2 7 500 2005
df3 = pd.merge(df,df2[['day','listing_id', 'price']],
left_on='day', right_on = 'day',how='left').drop('day',axis=1)
print(df3)
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 2 2 40 2002 4 440
2 2 2 40 2002 7 500
3 3 3 50 2004 5 500
4 4 4 60 2005 6 6600
5 5 2 20 1999 4 440
6 5 2 20 1999 7 500
What I want
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 2 2 40 2002 4 440
2 2 2 NaN 2005 7 500
3 3 3 50 2004 5 500
4 4 4 60 2005 6 6600
5 5 2 20 1999 NaN NaN
IIUC:
>>> df1.merge(df2[['day', 'listing_id', 'price', 'year']],
on=['day', 'year'], how='outer')
id day temperature year listing_id price
0 1.0 1 20.0 2001 2.0 20.0
1 2.0 2 40.0 2002 4.0 440.0
2 3.0 3 50.0 2004 5.0 500.0
3 4.0 4 60.0 2005 6.0 6600.0
4 5.0 2 20.0 1999 NaN NaN
5 NaN 2 NaN 2005 7.0 500.0

How to get the max out of a group by on two columns and sum on third in a pandas dataframe?

So I used a group by on a pandas dataframe which looks like this
df.groupby(['year','month'])['AMT'].agg('sum')
And I get something like this
year month
2003 1 114.00
2 9195.00
3 300.00
5 200.00
6 450.00
7 68.00
8 750.00
9 3521.00
10 250.00
11 799.00
12 1000.00
2004 1 8551.00
2 9998.00
3 17334.00
4 2525.00
5 16014.00
6 9132.00
7 10623.00
8 7538.00
9 3650.00
10 7733.00
11 10128.00
12 4741.00
2005 1 6965.00
2 3208.00
3 8630.00
4 7776.00
5 11950.00
6 11717.00
7 1510.00
...
2015 7 1431441.00
8 966974.00
9 1121650.00
10 1200104.00
11 1312191.90
12 482535.00
2016 1 1337343.00
2 1465068.00
3 1170113.00
4 1121691.00
5 1302936.00
6 1518047.00
7 1251844.00
8 825215.00
9 1491626.00
10 1243877.00
11 1632252.00
12 750995.50
2017 1 905974.00
2 1330182.00
3 1382628.52
4 1146789.00
5 1201425.00
6 1278701.00
7 1172596.00
8 1517116.50
9 1108609.00
10 1360841.00
11 1340386.00
12 860686.00
What I want is to just select the max out of the third summed column so that the final data frame has only the max from each year, something like:
year month
2003 2 9195.00
2004 3 17334.00
2005 5 11950.00
... and so on
What do I have to add to my group by aggregation to do this?
I think need DataFrameGroupBy.idxmax:
s = df.groupby(['year','month'])['AMT'].sum()
out = s.loc[s.groupby(level=0).idxmax()]
#working in newer pandas versions
#out = df.loc[df.groupby('Year').idxmax()]
print (out)
Year month
2003 2 9195.0
2004 3 17334.0
2005 5 11950.0
Name: AMT, dtype: float64
If possible multiple max values per years:
out = s[s == s.groupby(level=0).transform('max')]
print (out)
Year month
2003 2 9195.0
2004 3 17334.0
2005 5 11950.0
Name: AMT, dtype: float64
You can use GroupBy + transform with max. Note this gives multiple maximums for any years where a tie exists. This may or may not be what you require.
As you have requested, it's possible to do this in 2 steps, first summing and then calculating maximums by year.
df = pd.DataFrame({'year': [2003, 2003, 2003, 2004, 2004, 2004],
'month': [1, 2, 2, 1, 1, 2],
'AMT': [100, 200, 100, 100, 300, 100]})
# STEP 1: sum by year + month
df2 = df.groupby(['year', 'month']).sum().reset_index()
# STEP 2: filter for max by year
res = df2[df2['AMT'] == df2.groupby(['year'])['AMT'].transform('max')]
print(res)
year month AMT
1 2003 2 300
2 2004 1 400

A more complex rolling sum over the next n rows

I have the following dataframe:
print(df)
day month year quantity
6 04 2018 10
8 04 2018 8
12 04 2018 8
I would like to create a column, sum of the "quantity" over the next "n" days, as it follows:
n = 2
print(df1)
day month year quantity final_quantity
6 04 2018 10 10 + 0 + 8 = 18
8 04 2018 8 8 + 0 + 0 = 8
12 04 2018 8 8 + 0 + 0 = 8
Specifically, summing 0 if the product has not been sold in the next "n" days.
I tried rolling sums from Pandas, but does not seem to work on different columns:
n = 2
df.quantity[::-1].rolling(n + 1, min_periods=1).sum()[::-1]
You can use a list comprehension:
import pandas as pd
df['DateTime'] = pd.to_datetime(df[['year', 'month', 'day']])
df['final_quantity'] = [df.loc[df['DateTime'].between(d, d+pd.Timedelta(days=2)), 'quantity'].sum() \
for d in df['DateTime']]
print(df)
# day month year quantity DateTime final_quantity
# 0 6 4 2018 10 2018-04-06 18
# 1 8 4 2018 8 2018-04-08 8
# 2 12 4 2018 8 2018-04-12 8
You can use set_index and rolling with sum:
df_out = df.set_index(pd.to_datetime(df['month'].astype(str)+
df['day'].astype(str)+
df['year'].astype(str), format='%m%d%Y'))['quantity']
d1 = df_out.resample('D').asfreq(fill_value=0)
d2 = d1[::-1].reset_index()
df['final_quantity'] = d2['quantity'].rolling(3, min_periods=1).sum()[::-1].to_frame()\
.set_index(d1.index)\
.reindex(df_out.index).values
Output:
day month year quantity final_quantity
0 6 4 2018 10 18.0
1 8 4 2018 8 8.0
2 12 4 2018 8 8.0

Transpose the columns in aggregate function in Pandas

I am having an aggregate function with group by to get the summarized values.
My data set:
df=pd.DataFrame({"A":['a','a','a','a','a','a','b','b','b','b'],
"Sales":[2,3,7,1,4,3,5,6,9,10],
"Units":[12,2,2,33,6,2,4,8,3,5],
"Week":[1,2,2,1,2,1,1,2,2,1]})
Upon this, I am applying the function:
def my_agg(x):
names = {
'Sales': x['Sales'].sum(),
'Units': x['Sales'].sum()
}
return pd.Series(names, index=['Sales','Units'])
dfA= df.groupby(['A','Week']).apply(my_agg)
which gives me output:
Sales Units
A Week
a 1 6 6
2 14 14
b 1 15 15
2 15 15
I want to transpose week into columns. Like this:
REQUIRED OUTPUT:
Week 1 2
A Sales Units Sales Units
a 6 6 14 14
b 15 15 15 15
ALSO, please suggest for OUTPUT 2:
Sales Units
A Week 1 2
a 6 14 6 14
b 15 15 15 15
unstack with swaplevel
s=dfA.unstack()
s
Out[127]:
Sales Units
Week 1 2 1 2
A
a 6 14 6 14
b 15 15 15 15
s.swaplevel(0,1,axis=1).sort_index(level=0,axis=1)
Out[128]:
Week 1 2
Sales Units Sales Units
A
a 6 6 14 14
b 15 15 15 15
Output 1
df.pivot_table(index='A', columns='Week', aggfunc='sum').swaplevel(1, 0, 1)
Week 1 2 1 2
Sales Sales Units Units
A
a 6 14 47 10
b 15 15 9 11
Output 2
df.pivot_table(index='A', columns='Week', aggfunc='sum')
Sales Units
Week 1 2 1 2
A
a 6 14 47 10
b 15 15 9 11

How to get top 2 per multi index in Pandas dataframe (generated by pivot_table)

I would like to show the top 2 results per the first 2 levels of a 3 level indexed dataframe (coming through pivot_table)
import pandas as pd
df = pd.DataFrame([[2015,1,'A','R1',70],
[2015,2,'B','R2',40],
[2015,3,'C','R3',20],
[2015,1,'D','R2',90],
[2015,2,'A','R1',30],
[2015,3,'A','R3',20],
[2015,1,'B','R2',50],
[2015,2,'C','R1',90],
[2015,3,'B','R3',10],
[2015,1,'C','R3',10]],
columns = ['year','month','profile','ranking','sales'])
# create a pivot that sums the sales, sorts the profiles by total sales per year, month and profile
df.pivot_table(values = 'sales',
index = ['year','month','profile'],
columns = ['ranking'],
aggfunc = 'sum',
fill_value = 0,
margins = True).sort_values(by = 'All',ascending = False).sort_index(level=[0,1], sort_remaining=False)
Question 1: how to get only the top two profiles per year month combination?
so
for: 2015,1: D & A
for: 2015,2: C & B
for: 2015,3: A & C
Bonus question:
How to get the sums for the non top 2 profiles and call them 'Other'
so
for: 2015,1: Other,0,50,10,60 (which is the sum of B&C)
for: 2015,2: Other,30,0,0,30 (which is A only in this case)
for: 2015,3: Other,0,0,10,10 (which is B only in this case)
I would like to have it returned as a dataframe to me
UPDATE:
without pivoting:
In [120]: srt = df.sort_values(['year','month','profile'])
In [123]: srt[srt.groupby(['year','month'])['profile'].rank(method='min') <= 2]
Out[123]:
year month profile ranking sales
0 2015 1 A R1 70
6 2015 1 B R2 50
4 2015 2 A R1 30
1 2015 2 B R2 40
5 2015 3 A R3 20
8 2015 3 B R3 10
Bonus answer:
In [131]: srt[srt.groupby(['year','month'])['profile'] \
.rank(method='min') >= 2] \
.groupby(['year','month']).agg({'sales':'sum'})
Out[131]:
sales
year month
2015 1 150
2 130
3 30
With pivoting: you can try to reset index after pivoting:
In [109]: pvt = df.pivot_table(values = 'sales',
.....: index = ['year','month','profile'],
.....: columns = ['ranking'],
.....: aggfunc = 'sum',
.....: fill_value = 0,
.....: margins = True).reset_index()
In [111]: pvt
Out[111]:
ranking year month profile R1 R2 R3 All
0 2015 1 A 70 0 0 70
1 2015 1 B 0 50 0 50
2 2015 1 C 0 0 10 10
3 2015 1 D 0 90 0 90
4 2015 2 A 30 0 0 30
5 2015 2 B 0 40 0 40
6 2015 2 C 90 0 0 90
7 2015 3 A 0 0 20 20
8 2015 3 B 0 0 10 10
9 2015 3 C 0 0 20 20
10 All 190 180 60 430
Now you can use rank() method:
In [110]: pvt[pvt.sort_values(['year','month','profile']).groupby(['year','month'])['profile'].rank(method='min') <= 2]
Out[110]:
ranking year month profile R1 R2 R3 All
0 2015 1 A 70 0 0 70
1 2015 1 B 0 50 0 50
4 2015 2 A 30 0 0 30
5 2015 2 B 0 40 0 40
7 2015 3 A 0 0 20 20
8 2015 3 B 0 0 10 10
10 All 190 180 60 430
Ranking itself:
In [112]: pvt.sort_values(['year','month','profile']).groupby(['year','month'])['profile'].rank(method='min')
Out[112]:
0 1
1 2
2 3
3 4
4 1
5 2
6 3
7 1
8 2
9 3
10 1
dtype: float64

Categories