I want go through the all the columns of the dataframe. so that I will get a particular data of the column, using these data I have to calculate for another dataframe.
Here i have :
DP1 DP2 DP3 DP4 DP5 DP6 DP7 DP8 DP9 DP10 Total
OP1 357848.0 1124788.0 1735330.0 2218270.0 2745596.0 3319994.0 3466336.0 3606286.0 3833515.0 3901463.0 3901463.0
OP2 352118.0 1236139.0 2170033.0 3353322.0 3799067.0 4120063.0 4647867.0 4914039.0 5339085.0 NaN 5339085.0
OP3 290507.0 1292306.0 2218525.0 3235179.0 3985995.0 4132918.0 4628910.0 4909315.0 NaN NaN 4909315.0
OP4 310608.0 1418858.0 2195047.0 3757447.0 4029929.0 4381982.0 4588268.0 NaN NaN NaN 4588268.0
OP5 443160.0 1136350.0 2128333.0 2897821.0 3402672.0 3873311.0 NaN NaN NaN NaN 3873311.0
OP6 396132.0 1333217.0 2180715.0 2985752.0 3691712.0 NaN NaN NaN NaN NaN 3691712.0
OP7 440832.0 1288463.0 2419861.0 3483130.0 NaN NaN NaN NaN NaN NaN 3483130.0
OP8 359480.0 1421128.0 2864498.0 NaN NaN NaN NaN NaN NaN NaN 2864498.0
OP9 376686.0 1363294.0 NaN NaN NaN NaN NaN NaN NaN NaN 1363294.0
OP10 344014.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 344014.0
Total 3671385.0 11614543.0 17912342.0 21930921.0 21654971.0 19828268.0 17331381.0 13429640.0 9172600.0 3901463.0 34358090.0
Latest Observation 344014.0 1363294.0 2864498.0 3483130.0 3691712.0 3873311.0 4588268.0 4909315.0 5339085.0 3901463.0 NaN
From this table I would like to calculate formula this formula :in column DP1,Total/Last observation and this answer is divides by DP2 columns total. Like this we have to calculate all the columns and save it in another dataframe.
we need row like this :
Weighted Average 3.491 1.747 1.457 1.174 1.104 1.086 1.054 1.077 1.018
This code we tried:
LDFTriangledf['Weighted Average'] =CumulativePaidTriangledf.loc['Total','DP2']/(CumulativePaidTriangledf.loc['Total','DP1'] - CumulativePaidTriangledf.loc['Latest Observation','DP1'])
You can remove the column names from .loc and just shift(-1, axis=1) to get the next column's Total. This lets you apply the formula to all columns in a single operation:
CumulativePaidTriangledf.shift(-1, axis=1).loc['Total'] / (CumulativePaidTriangledf.loc['Total'] - CumulativePaidTriangledf.loc['Latest Observation'])
# DP1 3.490607
# DP2 1.747333
# DP3 1.457413
# DP4 1.173852
# DP5 1.103824
# DP6 1.086269
# DP7 1.053874
# DP8 1.076555
# DP9 1.017725
# DP10 inf
# Total NaN
# dtype: float64
Here is a breakdown of what the three components are doing:
DP1
DP2
DP3
DP4
DP5
DP6
DP7
DP8
DP9
DP10
Total
A: .shift(-1, axis=1).loc['Total'] -- We are shifting the whole Total row to the left, so every column now has the next Total value.
1.161454e+07
1.791234e+07
2.193092e+07
2.165497e+07
1.982827e+07
1.733138e+07
1.342964e+07
9.172600e+06
3.901463e+06
34358090.0
NaN
B: .loc['Total'] -- This is the normal Total row.
3.671385e+06
1.161454e+07
1.791234e+07
2.193092e+07
2.165497e+07
1.982827e+07
1.733138e+07
1.342964e+07
9.172600e+06
3901463.0
34358090.0
C: .loc['Latest Observation'] -- This is the normal Latest Observation.
3.440140e+05
1.363294e+06
2.864498e+06
3.483130e+06
3.691712e+06
3.873311e+06
4.588268e+06
4.909315e+06
5.339085e+06
3901463.0
NaN
A / (B-C) -- This is what the code above does. It takes the shifted Total row (A) and divides it by the difference of the current Total row (B) and current Latest observation row (C).
3.490607
1.747333
1.457413
1.173852
1.103824
1.086269
1.053874
1.076555
1.017725
inf
NaN
I have this dataframe:
OP1 OP2 OP3 OP4 OP5 OP6 OP7 OP8 OP9 OP10 Total
Simulation1 NaN 0.0 471294.2 692828.5 1107766.9 1580052.7 2452123.5 4374088.4 4545222.2 4764249.9 19987626.3
OP1 OP2 OP3 OP4 OP5 OP6 OP7 OP8 OP9 OP10 Total
Simulation1 NaN 0.0 333833.4 573533.5 948961.2 1343783.2 2354595.9 4061858.2 4348907.9 4769410.1 18734883.4
OP1 OP2 OP3 OP4 OP5 OP6 OP7 OP8 OP9 OP10 Total
Simulation1 NaN 0.0 441838.4 660710.6 976074.4 1391775.4 2002799.8 3921497.8 3708159.7 3852268.8 16955124.9
I need this row name as like this :
OP1 OP2 OP3 OP4 OP5 OP6 OP7 OP8 OP9 OP10 Total
Simulation1 NaN 0.0 471294.2 692828.5 1107766.9 1580052.7 2452123.5 4374088.4 4545222.2 4764249.9 19987626.3
Simulation2 NaN 0.0 333833.4 573533.5 948961.2 1343783.2 2354595.9 4061858.2 4348907.9 4769410.1 18734883.4
Simulation3 NaN 0.0 441838.4 660710.6 976074.4 1391775.4 2002799.8 3921497.8 3708159.7 3852268.8 16955124.9
.
.
.
So on
Here I have to increment row name simulation1 ,simulation2...so on
I have this code:
simulationDf=pd.DataFrame(columns['OP1','OP2','OP3','OP4','OP5','OP6','OP7','OP8','OP9','OP10','Total'])
simulationDf.loc['Simulation1'] = ultiCalc['Reserves']
Seems like that is your index(the dataframe that you posted in second code block) so you can use .index attribute and list comprehension:
df.index=['Simulation'+str(x) for x in range (1,len(df)+1)]
Now if you print df you will get your desired output:
OP1 OP2 OP3 OP4 OP5 OP6 OP7 OP8 OP9 OP10 Total
Simulation1 NaN 0.0 471294.2 692828.5 1107766.9 1580052.7 2452123.5 4374088.4 4545222.2 4764249.9 19987626.3
Simulation2 NaN 0.0 333833.4 573533.5 948961.2 1343783.2 2354595.9 4061858.2 4348907.9 4769410.1 18734883.4
Simulation3 NaN 0.0 441838.4 660710.6 976074.4 1391775.4 2002799.8 3921497.8 3708159.7 3852268.8 16955124.9
I want to scrape some data from the website so I write the code to create a list which contains all records. And then, I want extract some elements from all records to create a dataframe.
However, some information of the dataframe is missing. In the all data list, it has the information from 2012 to 2019 but the dataframe only has 2018 and 2019 information. I tried different ways the resolve the problem. Finally, I find out if I do not use Zip function, the problem will not occur, may I know why and if I do not use Zip function, any solution I can use?
import requests
import pandas as pd
records = []
tickers = ['AAL']
url_metrics = 'https://stockrow.com/api/companies/{}/financials.json?ticker={}&dimension=A§ion=Growth'
indicators_url = 'https://stockrow.com/api/indicators.json'
# scrape all data and append to a list - all_records
for s in tickers:
indicators = {i['id']: i for i in requests.get(indicators_url).json()}
all_records = []
for d in requests.get(url_metrics.format(s,s)).json():
d['id'] = indicators[d['id']]['name']
all_records.append(d)
gross_profit_growth = next(d for d in all_records if 'Gross Profit Growth' in d['id'])
operating_income_growth = next(d for d in all_records if 'Operating Income Growth' in d['id'])
net_income_growth = next(d for d in all_records if 'Net Income Growth' in d['id'])
diluted_eps_growth = next(d for d in all_records if 'EPS Growth (diluted)' in d['id'])
operating_cash_flow_growth = next(d for d in all_records if 'Operating Cash Flow Growth' in d['id'])
# extract values from all_records and create the dataframe
for (k1, v1), (_, v2), (_, v3), (_, v4), (_, v5) in zip(gross_profit_growth.items(), operating_income_growth.items(), net_income_growth.items(), diluted_eps_growth.items(), operating_cash_flow_growth.items()):
if k1 in ('id'):
continue
records.append({
'symbol' : s,
'date' : k1,
'gross_profit_growth%': v1,
'operating_income_growth%': v2,
'net_income_growth%': v3,
'diluted_eps_growth%' : v4,
'operating_cash_flow_growth%' : v5
})
df = pd.DataFrame(records)
df.head(50)
The result is incorrect. It only has 2018 and 2019 data. It should have data from 2012 to 2019.
symbol date gross_profit_growth% operating_income_growth% net_income_growth% diluted_eps_growth% operating_cash_flow_growth%
0 AAL 2019-12-31 0.0405 -0.1539 -0.0112 0.2508 0.0798
1 AAL 2018-12-31 -0.0876 -0.2463 0.0 -0.2231 -0.2553
My excepted result:
symbol date gross_profit_growth% operating_income_growth% net_income_growth% diluted_eps_growth% operating_cash_flow_growth%
0 AAL 31/12/2019 0.0405 0.154 0.1941 0.2508 0.0798
1 AAL 31/12/2018 -0.0876 -0.3723 0.1014 -0.2231 -0.2553
2 AAL 31/12/2017 -0.0165 -0.1638 -0.5039 -0.1892 -0.2728
3 AAL 31/12/2016 -0.079 -0.1844 -0.6604 -0.5655 0.044
4 AAL 31/12/2015 0.1983 0.4601 1.6405 1.8168 1.0289
5 AAL 31/12/2014 0.7305 2.0372 2.5714 1.2308 3.563
6 AAL 31/12/2013 0.3575 8.4527 0.0224 nan -0.4747
7 AAL 31/12/2012 0.1688 1.1427 0.052 nan 0.7295
8 AAL 31/12/2011 0.0588 -4.3669 -3.2017 nan -0.4013
9 AAL 31/12/2010 0.3413 1.3068 0.6792 nan 0.3344
import requests
import pandas as pd
records = []
tickers = ['A', 'AAL', 'AAPL']
url_metrics = 'https://stockrow.com/api/companies/{}/financials.json?ticker={}&dimension=A§ion=Growth'
indicators_url = 'https://stockrow.com/api/indicators.json'
for s in tickers:
print('Getting data for ticker: {}'.format(s))
indicators = {i['id']: i for i in requests.get(indicators_url).json()}
all_records = []
for d in requests.get(url_metrics.format(s,s)).json():
d['id'] = indicators[d['id']]['name']
all_records.append(d)
gross_profit_growth = next(d for d in all_records if 'Gross Profit Growth' == d['id'])
operating_income_growth = next(d for d in all_records if 'Operating Income Growth' == d['id'])
net_income_growth = next(d for d in all_records if 'Net Income Growth' == d['id'])
eps_growth_diluted = next(d for d in all_records if 'EPS Growth (diluted)' == d['id'])
operating_cash_flow_growth = next(d for d in all_records if 'Operating Cash Flow Growth' == d['id'])
del gross_profit_growth['id']
del operating_income_growth['id']
del net_income_growth['id']
del eps_growth_diluted['id']
del operating_cash_flow_growth['id']
d1 = pd.DataFrame({'date': gross_profit_growth.keys(), 'gross_profit_growth%': gross_profit_growth.values()}).set_index('date')
d2 = pd.DataFrame({'date': operating_income_growth.keys(), 'operating_income_growth%': operating_income_growth.values()}).set_index('date')
d3 = pd.DataFrame({'date': net_income_growth.keys(), 'net_income_growth%': net_income_growth.values()}).set_index('date')
d4 = pd.DataFrame({'date': eps_growth_diluted.keys(), 'diluted_eps_growth%': eps_growth_diluted.values()}).set_index('date')
d5 = pd.DataFrame({'date': operating_cash_flow_growth.keys(), 'operating_cash_flow_growth%': operating_cash_flow_growth.values()}).set_index('date')
d = pd.concat([d1, d2, d3, d4, d5], axis=1)
d['symbol'] = s
records.append(d)
df = pd.concat(records)
print(df)
Prints:
gross_profit_growth% operating_income_growth% net_income_growth% diluted_eps_growth% operating_cash_flow_growth% symbol
2019-10-31 0.0466 0.0409 2.3892 2.4742 -0.0607 A
2018-10-31 0.1171 0.1202 -0.538 -0.5381 0.2227 A
2017-10-31 0.0919 0.3122 0.4805 0.5 0.1211 A
2016-10-31 0.0764 0.1782 0.1521 0.1765 0.5488 A
2015-10-31 0.0329 0.2458 -0.2696 -0.1905 -0.2996 A
2014-10-31 0.0362 0.0855 -0.252 -0.3 -0.3655 A
2013-10-31 -0.4709 -0.655 -0.3634 -0.3578 -0.0619 A
2012-10-31 0.0213 0.0448 0.1393 0.1474 -0.0254 A
2011-10-31 0.2044 0.8922 0.4795 0.6102 0.7549 A
2019-12-31 0.0405 0.154 0.1941 0.2508 0.0798 AAL
2018-12-31 -0.0876 -0.3723 0.1014 -0.2231 -0.2553 AAL
2017-12-31 -0.0165 -0.1638 -0.5039 -0.1892 -0.2728 AAL
2016-12-31 -0.079 -0.1844 -0.6604 -0.5655 0.044 AAL
2015-12-31 0.1983 0.4601 1.6405 1.8168 1.0289 AAL
2014-12-31 0.7305 2.0372 2.5714 1.2308 3.563 AAL
2013-12-31 0.3575 8.4527 0.0224 NaN -0.4747 AAL
2012-12-31 0.1688 1.1427 0.052 NaN 0.7295 AAL
2011-12-31 0.0588 -4.3669 -3.2017 NaN -0.4013 AAL
2010-12-31 0.3413 1.3068 0.6792 NaN 0.3344 AAL
2020-09-30 0.0667 0.0369 0.039 NaN 0.1626 AAPL
2019-09-30 -0.0338 -0.0983 -0.0718 -0.0017 -0.1039 AAPL
2018-09-30 0.1548 0.1557 0.2312 0.2932 0.2057 AAPL
2017-09-30 0.0466 0.022 0.0583 0.1083 -0.0303 AAPL
2016-09-30 -0.1 -0.1573 -0.1443 -0.0987 -0.185 AAPL
2015-09-30 0.3273 0.3567 0.3514 0.4295 0.3609 AAPL
2014-09-30 0.0969 0.0715 0.0668 0.1358 0.1127 AAPL
2013-09-30 -0.0635 -0.113 -0.1125 -0.0996 0.0553 AAPL
2012-09-30 0.567 0.6348 0.6099 0.595 0.3551 AAPL
2011-09-30 0.706 0.8379 0.8499 0.827 1.0182 AAPL
Here is a example of data we want to process:
df_size = 1000000
df_random = pd.DataFrame({'boat_id' : np.random.choice(range(300),df_size),
'X' :np.random.random_integers(0,1000,df_size),
'target_Y' :np.random.random_integers(0,10,df_size)})
X boat_id target_Y
0 482 275 6
1 705 245 4
2 328 102 6
3 631 227 6
4 234 236 8
...
I want to obtain an output like this :
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 target_Y boat_id
40055 684.0 692.0 950.0 572.0 442.0 850.0 75.0 140.0 382.0 576.0 0.0 1
40056 178.0 949.0 490.0 777.0 335.0 559.0 397.0 729.0 701.0 44.0 4.0 1
40057 21.0 818.0 341.0 577.0 612.0 57.0 303.0 183.0 519.0 357.0 0.0 1
40058 501.0 1000.0 999.0 532.0 765.0 913.0 964.0 922.0 772.0 534.0 1.0 2
40059 305.0 906.0 724.0 996.0 237.0 197.0 414.0 171.0 369.0 299.0 8.0 2
40060 408.0 796.0 815.0 638.0 691.0 598.0 913.0 579.0 650.0 955.0 2.0 3
40061 298.0 512.0 247.0 824.0 764.0 414.0 71.0 440.0 135.0 707.0 9.0 4
40062 535.0 687.0 945.0 859.0 718.0 580.0 427.0 284.0 122.0 777.0 2.0 4
40063 352.0 115.0 228.0 69.0 497.0 387.0 552.0 473.0 574.0 759.0 3.0 4
40064 179.0 870.0 862.0 186.0 25.0 125.0 925.0 310.0 335.0 739.0 7.0 4
...
I did the folowing code, but it is way to slow.
It groupby, cut with enumerate, transpose then merge result into one pandas Dataframe
start_time = time.time()
N = 10
col_names = map(lambda x: 'X'+str(x), range(N))
compil = pd.DataFrame(columns = col_names)
i = 0
# I group by boat ID
for boat_id, df_boat in df_random.groupby('boat_id'):
# then I cut every 50 line
for (line_number, (index, row)) in enumerate(df_boat.iterrows()):
if line_number%5 == 0:
compil_new_line_X = list(df_boat.iloc[line_number-N:line_number,:]["X"])
# filter to avoid issues at the start and end of the columns
if len (compil_new_line_X ) == N:
compil.loc[i,col_names] = compil_new_line_X
compil.loc[i, 'target_Y'] = row['target_Y']
compil.loc[i,'boat_id'] = row['boat_id']
i += 1
print("Total %s seconds" % (time.time() - start_time))
Total 232.947000027 seconds
My question is:
How to do somethings every "x number of line"? Then merge result?
Do it exist a way to vectorize that kind of operation?
Here is a solution that improve calculation time by 35%.
It use a 'groupby' for 'boat_ID' then 'groupby.apply' to divide groups in smalls chunks.
Then a final apply to create the new line. We probably still can improve it.
df_size = 1000000
df_random = pd.DataFrame({'boat_id' : np.random.choice(range(300),df_size),
'X' :np.random.random_integers(0,1000,df_size),
'target_Y' : np.random.random_integers(0,10,df_size)})
start_time = time.time()
len_of_chunks = 10
col_names = map(lambda x: 'X'+str(x), range(N))+['boat_id', 'target_Y']
def prepare_data(group):
# this function create the new line we will put in 'compil'
info_we_want_to_keep =['boat_id', 'target_Y']
info_and_target = group.tail(1)[info_we_want_to_keep].values
k = group["X"]
return np.hstack([k.values, info_and_target[0]]) # this create the new line we will put in 'compil'
# we group by ID (boat)
# we divide in chunk of len "len_of_chunks"
# we apply prepare data from each chunk
groups = df_random.groupby('boat_id').apply(lambda x: x.groupby(np.arange(len(x)) // len_of_chunks).apply(prepare_data))
# we reset index
# we take the '0' columns containing valuable info
# we put info in a new 'compil' dataframe
# we drop uncomplet line ( generated by chunk < len_of_chunks )
compil = pd.DataFrame(groups.reset_index()[0].values.tolist(), columns= col_names).dropna()
print("Total %s seconds" % (time.time() - start_time))
Total 153.781999826 seconds
I am fairly new to programming & am looking for a more pythonic way to implement some code. Here is dummy data:
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2018',
freq='D'), 10000)})
I have lots of transactional data like that that I perform various Groupby's on. My current solution is to make a master groupby like this:
master = df.groupby(['Customer','Category','Sub-Category','Product',pd.Grouper(key='Date',freq='A')])['Units_Sold'].sum()\
.unstack()
From there, I perform various groupbys using .groupby(level=) function to aggregate the information in the way I'm looking for. I usually make a summary at each level. In addition, I create sub-totals at each level using some variation of the below code.
y = master.groupby(level=[0,1,2]).sum()
y.index = pd.MultiIndex.from_arrays([
y.index.get_level_values(0),
y.index.get_level_values(1),
y.index.get_level_values(2) + ' Total',
len(y.index)*['']
])
y1 = master.groupby(level=[0,1]).sum()
y1.index = pd.MultiIndex.from_arrays([
y1.index.get_level_values(0),
y1.index.get_level_values(1)+ ' Total',
len(y1.index)*[''],
len(y1.index)*['']
])
y2 = master.groupby(level=[0]).sum()
y2.index = pd.MultiIndex.from_arrays([
y2.index.get_level_values(0)+ ' Total',
len(y2.index)*[''],
len(y2.index)*[''],
len(y2.index)*['']
])
pd.concat([master,y,y1,y2]).sort_index()\
.assign(Diff = lambda x: x.iloc[:,-1] - x.iloc[:,-2])\
.assign(Diff_Perc = lambda x: (x.iloc[:,-2] / x.iloc[:,-3])- 1)\
.dropna(how='all')\
This is just an example - I may perform the same exercise, but perform the groupby in a different order. For example - next I may want to group by 'Category', 'Product', then 'Customer', so I'd have to do:
master.groupby(level=[1,3,0).sum()
Then I will have to repeat the whole exercise for sub-totals like above. I also frequently change the time period - could be year-ending a specific month, could be year to date, could be by quarter, etc.
From what I've learned so far in programming (which is minimal, clearly!), you should look to write a function any time you repeat code. Obviously I am repeating code over & over again in this example.
Is there a way to construct a function where you can provide the levels to Groupby, along with the time frame, all while creating a function for sub-totaling each level as well?
Thanks in advance for any guidance on this. It is very much appreciated.
For a DRY-er solution, consider generalizing your current method into a defined module that filters original data frame by date ranges and runs aggregations, receiving the group_by levels and date ranges (latter being optional) as passed in parameters:
Method
def multiple_agg(mylevels, start_date='2016-01-01', end_date='2018-12-31'):
filter_df = df[df['Date'].between(start_date, end_date)]
master = (filter_df.groupby(['Customer', 'Category', 'Sub-Category', 'Product',
pd.Grouper(key='Date',freq='A')])['Units_Sold']
.sum()
.unstack()
)
y = master.groupby(level=mylevels[:-1]).sum()
y.index = pd.MultiIndex.from_arrays([
y.index.get_level_values(0),
y.index.get_level_values(1),
y.index.get_level_values(2) + ' Total',
len(y.index)*['']
])
y1 = master.groupby(level=mylevels[0:2]).sum()
y1.index = pd.MultiIndex.from_arrays([
y1.index.get_level_values(0),
y1.index.get_level_values(1)+ ' Total',
len(y1.index)*[''],
len(y1.index)*['']
])
y2 = master.groupby(level=mylevels[0]).sum()
y2.index = pd.MultiIndex.from_arrays([
y2.index.get_level_values(0)+ ' Total',
len(y2.index)*[''],
len(y2.index)*[''],
len(y2.index)*['']
])
final_df = (pd.concat([master,y,y1,y2])
.sort_index()
.assign(Diff = lambda x: x.iloc[:,-1] - x.iloc[:,-2])
.assign(Diff_Perc = lambda x: (x.iloc[:,-2] / x.iloc[:,-3])- 1)
.dropna(how='all')
.reorder_levels(mylevels)
)
return final_df
Aggregation Runs (of different levels and date ranges)
agg_df1 = multiple_agg([0,1,2,3])
agg_df2 = multiple_agg([1,3,0,2], '2016-01-01', '2017-12-31')
agg_df3 = multiple_agg([2,3,1,0], start_date='2017-01-01', end_date='2018-12-31')
Testing (final_df being OP'S pd.concat() output)
# EQUALITY TESTING OF FIRST 10 ROWS
print(final_df.head(10).eq(agg_df1.head(10)))
# Date 2016-12-31 00:00:00 2017-12-31 00:00:00 2018-12-31 00:00:00 Diff Diff_Perc
# Customer Category Sub-Category Product
# 45mhn4PU1O Group A X Product 1 True True True True True
# Product 2 True True True True True
# Product 3 True True True True True
# X Total True True True True True
# Y Product 1 True True True True True
# Product 2 True True True True True
# Product 3 True True True True True
# Y Total True True True True True
# Z Product 1 True True True True True
# Product 2 True True True True True
I think you can do it using sum with the level parameter:
master = df.groupby(['Customer','Category','Sub-Category','Product',pd.Grouper(key='Date',freq='A')])['Units_Sold'].sum()\
.unstack()
s1 = master.sum(level=[0,1,2]).assign(Product='Total').set_index('Product',append=True)
s2 = master.sum(level=[0,1])
# Wanted to use assign method but because of the hyphen in the column name you can't.
# Also use the Z in front for sorting purposes
s2['Sub-Category'] = 'ZTotal'
s2['Product'] = ''
s2 = s2.set_index(['Sub-Category','Product'], append=True)
s3 = master.sum(level=[0])
s3['Category'] = 'Total'
s3['Sub-Category'] = ''
s3['Product'] = ''
s3 = s3.set_index(['Category','Sub-Category','Product'], append=True)
master_new = pd.concat([master,s1,s2,s3]).sort_index()
master_new
Output:
Date 2016-12-31 2017-12-31 2018-12-31
Customer Category Sub-Category Product
30XWmt1jm0 Group A X Product 1 651.0 341.0 453.0
Product 2 267.0 445.0 117.0
Product 3 186.0 280.0 352.0
Total 1104.0 1066.0 922.0
Y Product 1 426.0 417.0 670.0
Product 2 362.0 210.0 380.0
Product 3 232.0 290.0 430.0
Total 1020.0 917.0 1480.0
Z Product 1 196.0 212.0 703.0
Product 2 277.0 340.0 579.0
Product 3 416.0 392.0 259.0
Total 889.0 944.0 1541.0
ZTotal 3013.0 2927.0 3943.0
Group B X Product 1 356.0 230.0 407.0
Product 2 402.0 370.0 590.0
Product 3 262.0 381.0 377.0
Total 1020.0 981.0 1374.0
Y Product 1 575.0 314.0 643.0
Product 2 557.0 375.0 411.0
Product 3 344.0 246.0 280.0
Total 1476.0 935.0 1334.0
Z Product 1 278.0 152.0 392.0
Product 2 149.0 596.0 303.0
Product 3 234.0 505.0 521.0
Total 661.0 1253.0 1216.0
ZTotal 3157.0 3169.0 3924.0
Total 6170.0 6096.0 7867.0
3U2anYOD6o Group A X Product 1 214.0 443.0 195.0
Product 2 170.0 220.0 423.0
Product 3 111.0 469.0 369.0
... ... ... ...
somc22Y2Hi Group B Z Total 906.0 1063.0 680.0
ZTotal 3070.0 3751.0 2736.0
Total 6435.0 7187.0 6474.0
zRZq6MSKuS Group A X Product 1 421.0 182.0 387.0
Product 2 359.0 287.0 331.0
Product 3 232.0 394.0 279.0
Total 1012.0 863.0 997.0
Y Product 1 245.0 366.0 111.0
Product 2 377.0 148.0 239.0
Product 3 372.0 219.0 310.0
Total 994.0 733.0 660.0
Z Product 1 280.0 363.0 354.0
Product 2 384.0 604.0 178.0
Product 3 219.0 462.0 366.0
Total 883.0 1429.0 898.0
ZTotal 2889.0 3025.0 2555.0
Group B X Product 1 466.0 413.0 187.0
Product 2 502.0 370.0 368.0
Product 3 745.0 480.0 318.0
Total 1713.0 1263.0 873.0
Y Product 1 218.0 226.0 385.0
Product 2 123.0 382.0 570.0
Product 3 173.0 572.0 327.0
Total 514.0 1180.0 1282.0
Z Product 1 480.0 317.0 604.0
Product 2 256.0 215.0 572.0
Product 3 463.0 50.0 349.0
Total 1199.0 582.0 1525.0
ZTotal 3426.0 3025.0 3680.0
Total 6315.0 6050.0 6235.0
[675 rows x 3 columns]