How to randomly create a preference dataframe from a dataframe of choices? - python

I have a Dataframe of vote and I would like to create one of preferences.
For example here is the number of votes for each party P1, P2, P3 in each city Comm, Comm2 ...
Comm Votes P1 P2 P3
0 comm1 1315.0 2.0 424.0 572.0
1 comm2 4682.0 117.0 2053.0 1584.0
2 comm3 2397.0 2.0 40.0 192.0
3 comm4 931.0 2.0 12.0 345.0
4 comm5 842.0 47.0 209.0 76.0
... ... ... ... ... ...
1524 comm1525 10477.0 13.0 673.0 333.0
1525 comm1526 2674.0 1.0 55.0 194.0
1526 comm1527 1691.0 331.0 29.0 78.0
These electoral results would suffice for a first pass the ballot system, I would like to test the alternative election model. So for each political party I need to get the preferences.
As I don't know the preferences, I want to make them with random numbers. I suppose that voters are honest. For example, for the "P1" party in town "comm" We know that 2 people voted for it and that there are 1315 voters. I need to create preferences to see if people would put it as their first, second or third option. It is to say, and for each party:
Comm Votes P1_1 P1_2 P1_3 P2_1 P2_2 P2_3 P3_1 P3_2 P3_3
0 comm1 1315.0 2.0 1011.0 303.0 424.0 881.0 10.0 570.0 1.0 1.0
... ... ... ... ... ...
1526 comm1527 1691.0 331.0 1300.0 60.0 299.0 22.0 10.0 ...
So I have to do:
# for each column in parties I create (parties -1) other columns
# I rename them all Party_i. The former 1 becomes Party_1.
# In the other columns I put a random number.
# For a given line, the sum of all Party_i for i in [1, parties] mus t be equal to Votes
I tried this so far:
parties = [item for item in df.columns if item not in ['Comm','Votes']]
for index, row in df_test.iterrows():
# In the other columns I put a random number.
for party in parties:
# for each column in parties I create (parties -1) other columns
for i in range(0,len(parties) -1):
print(random.randrange(0, row['Votes']))
# I rename them all Party_i. The former 1 becomes Party_1.
row["{party}_{preference}".format(party = party,preference = i)] = random.randrange(0, row['Votes']) if (row[party] < row['Votes']) else 0 # false because the sum of the votes isn't = to df['Votes']
The results are:
Comm Votes ... P1_1 P1_2 P1_3 P2_1 P2_2 P2_3 P3_1 P3_2 P3_3
0 comm1 1315.0 ... 1003 460 1588 1284 1482 1613 1429 345
1 comm2 1691.0 ... 1003 460 1588 1284 1482 1613 ...
...
But:
the numbers are the same for each rows
the value in row of Pi_1 isn't equal to the one in the row of Pi (Pi being a given party).
the sum of Pi_j for all j in [0, parties] isn't equal to the number in the column Votes
Update
I tried Antihead's answer with his own data and it worked well. But when apllying to my own data it doesn't. It leaves me an empty dataframe:
import collections
def fill_cells(cell):
v_max = cell['Votes']
all_dict = {}
#iterate over parties.copy()
for p in parties:
tmp_l = parties.copy()
tmp_l.remove(p)
# sample new data with equal choices
sampled = np.random.choice(tmp_l, int(v_max-cell[p]))
# transform into dictionary
c_sampled = dict(collections.Counter(sampled))
c_sampled.update({p:cell[p]})
# batch update of the dictio~nary keys
all_dict.update(
dict(zip([p+'_%s' %k[1] for k in c_sampled.keys()], c_sampled.values()))
)
return pd.Series(all_dict)
Indeed, with the following dataframe:
Comm Votes LPC CPC BQ
0 comm1 1315.0 2.0 424.0 572.0
1 comm2 4682.0 117.0 2053.0 1584.0
2 comm3 2397.0 2.0 40.0 192.0
3 comm4 931.0 2.0 12.0 345.0
4 comm5 842.0 47.0 209.0 76.0
... ... ... ... ... ...
1522 comm1523 23808.0 1588.0 4458.0 13147.0
1523 comm1524 639.0 40.0 126.0 40.0
1524 comm1525 10477.0 13.0 673.0 333.0
1525 comm1526 2674.0 1.0 55.0 194.0
1526 comm1527 1691.0 331.0 29.0 78.0
I have an empty dataframe:
0
1
2
3
4
...
1522
1523
1524
1525
1526

Does this work:
# data
columns = ['Comm', 'Votes', 'P1', 'P2', 'P3']
data =[['comm1', 1315.0, 2.0, 424.0, 572.0],
['comm2', 4682.0, 117.0, 2053.0, 1584.0],
['comm3', 2397.0, 2.0, 40.0, 192.0],
['comm4', 931.0, 2.0, 12.0, 345.0],
['comm5', 842.0, 47.0, 209.0, 76.0],
['comm1525', 10477.0, 13.0, 673.0, 333.0],
['comm1526', 2674.0, 1.0, 55.0, 194.0],
['comm1527', 1691.0, 331.0, 29.0, 78.0]]
df =pd.DataFrame(data=data, columns=columns)
import collections
def fill_cells(cell):
v_max = cell['Votes']
all_dict = {}
#iterate over parties
for p in ['P1', 'P2', 'P3']:
tmp_l = ['P1', 'P2', 'P3']
tmp_l.remove(p)
# sample new data with equal choices
sampled = np.random.choice(tmp_l, int(v_max-cell[p]))
# transform into dictionary
c_sampled = dict(collections.Counter(sampled))
c_sampled.update({p:cell[p]})
# batch update of the dictionary keys
all_dict.update(
dict(zip([p+'_%s' %k[1] for k in c_sampled.keys()], c_sampled.values()))
)
return pd.Series(all_dict)
# get back a data frame
df.apply(fill_cells, axis=1)
If You need to merge the data frame back, do something like:
new_df = df.apply(fill_cells, axis=1)
pd.concat([df, new_df], axis=1)

Based on Antihead's answer and for the following dataset:
Comm Votes LPC CPC BQ
0 comm1 1315.0 2.0 424.0 572.0
1 comm2 4682.0 117.0 2053.0 1584.0
2 comm3 2397.0 2.0 40.0 192.0
3 comm4 931.0 2.0 12.0 345.0
4 comm5 842.0 47.0 209.0 76.0
... ... ... ... ... ...
1522 comm1523 23808.0 1588.0 4458.0 13147.0
1523 comm1524 639.0 40.0 126.0 40.0
1524 comm1525 10477.0 13.0 673.0 333.0
1525 comm1526 2674.0 1.0 55.0 194.0
1526 comm1527 1691.0 331.0 29.0 78.0
I tried:
def fill_cells(cell):
votes_max = cell['Votes']
all_dict = {}
#iterate over parties
parties_temp = parties.copy()
for p in parties_temp:
preferences = ['1','2','3']
for preference in preferences:
preferences.remove(preference)
# sample new data with equal choices
sampled = np.random.choice(preferences, int(votes_max-cell[p]))
# transform into dictionary
c_sampled = dict(collections.Counter(sampled))
c_sampled.update({p:cell[p]})
c_sampled['1'] = c_sampled.pop(p)
# batch update of the dictionary keys
all_dict.update(
dict(zip([p+'_%s' %k for k in c_sampled.keys()],c_sampled.values()))
)
return pd.Series(all_dict)
It returns
LPC_2 LPC_3 LPC_1 CPC_2 CPC_3 CPC_1 BQ_2 BQ_3 BQ_1
0 891.0 487.0 424.0 743.0 373.0 572.0 1313.0 683.0 2.0
1 2629.0 1342.0 2053.0 3098.0 1603.0 1584.0 4565.0 2301.0 117.0
2 2357.0 1186.0 40.0 2205.0 1047.0 192.0 2395.0 1171.0 2.0
3 919.0 451.0 12.0 586.0 288.0 345.0 929.0 455.0 2.0
4 633.0 309.0 209.0 766.0 399.0 76.0 795.0 396.0 47.0
... ... ... ... ... ... ... ... ... ...
1520 1088.0 536.0 42.0 970.0 462.0 160.0 1117.0 540.0 13.0
1521 4742.0 2341.0 219.0 3655.0 1865.0 1306.0 4705.0 2375.0 256.0
1522 19350.0 9733.0 4458.0 10661.0 5352.0 13147.0 22220.0 11100.0 1588.0
1523 513.0 264.0 126.0 599.0 267.0 40.0 599.0 306.0 40.0
1524 9804.0 4885.0 673.0 10144.0 5012.0 333.0 10464.0 5162.0 13.0
It's almost good. I would have prefered the preferences to be dynamically encoded rather than to hard code ['1','2','3'].

Related

ValueError: z array must not contain non-finite values within the triangulation

Hi I am trying to plot an irregular grid in basemap with tri. However, I have now got the error:
tri, z = self._contour_args(args, kwargs)
File "C:\Users\OName\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\matplotlib\tri\tricontour.py", line 72, in _contour_args
raise ValueError('z array must not contain non-finite values '
ValueError: z array must not contain non-finite values within the triangulation
My database looks like this:
Latitude Longitude Altitude Value O18d
0 30.0 -30.0 0.0 0.522199 0.522199
1 30.0 -29.9 0.0 0.531214 0.531214
2 30.0 -29.8 0.0 0.540248 0.540248
3 30.0 -29.7 0.0 0.549301 0.549301
4 30.0 -29.6 0.0 0.558374 0.558374
... ... ... ... ... ...
359995 69.9 59.5 68.0 -1.489262 -1.625262
359996 69.9 59.6 74.0 -1.487915 -1.635915
359997 69.9 59.7 52.0 -1.486524 -1.590524
359998 69.9 59.8 71.0 -1.485089 -1.627089
359999 69.9 59.9 68.0 -1.483611 -1.619611
Where can my error be?

Rolling Mean not being calculated on a new column

I have an issue on calculating the rolling mean for a column I added in the code. For some reason, it doesnt work on the column I added but works on a column from the original csv.
Original dataframe from the csv as follow:
Open High Low Last Change Volume Open Int
Time
09/20/19 98.50 99.00 98.35 98.95 0.60 3305.0 0.0
09/19/19 100.35 100.75 98.10 98.35 -2.00 17599.0 0.0
09/18/19 100.65 101.90 100.10 100.35 0.00 18258.0 121267.0
09/17/19 103.75 104.00 100.00 100.35 -3.95 34025.0 122453.0
09/16/19 102.30 104.95 101.60 104.30 1.55 21403.0 127447.0
Ticker = pd.read_csv('\\......\Historical data\kcz19 daily.csv',
index_col=0, parse_dates=True)
Ticker['Return'] = np.log(Ticker['Last'] / Ticker['Last'].shift(1)).fillna('')
Ticker['ret20'] = Ticker['Return'].rolling(window=20, win_type='triang').mean()
print(Ticker.head())
Open High Low ... Open Int Return ret20
Time ...
09/20/19 98.50 99.00 98.35 ... 0.0
09/19/19 100.35 100.75 98.10 ... 0.0 -0.00608213 -0.00608213
09/18/19 100.65 101.90 100.10 ... 121267.0 0.0201315 0.0201315
09/17/19 103.75 104.00 100.00 ... 122453.0 0 0
09/16/19 102.30 104.95 101.60 ... 127447.0 0.0386073 0.0386073
ret20 column should have the rolling mean of the column Return so it should show some data starting from raw 21 whereas it is only a copy of column Return here.
If I replace with the Last column it will work.
Below is the result using colum Last
Open High Low ... Open Int Return ret20
Time ...
09/20/19 98.50 99.00 98.35 ... 0.0 NaN
09/19/19 100.35 100.75 98.10 ... 0.0 -0.00608213 NaN
09/18/19 100.65 101.90 100.10 ... 121267.0 0.0201315 NaN
09/17/19 103.75 104.00 100.00 ... 122453.0 0 NaN
09/16/19 102.30 104.95 101.60 ... 127447.0 0.0386073 NaN
09/13/19 103.25 103.60 102.05 ... 128707.0 -0.0149725 NaN
09/12/19 102.80 103.85 101.15 ... 128904.0 0.00823848 NaN
09/11/19 102.00 104.70 101.40 ... 132067.0 -0.00193237 NaN
09/10/19 98.50 102.25 98.00 ... 135349.0 -0.0175614 NaN
09/09/19 97.00 99.25 95.30 ... 137347.0 -0.0335283 NaN
09/06/19 95.35 97.30 95.00 ... 135399.0 -0.0122889 NaN
09/05/19 96.80 97.45 95.05 ... 136142.0 -0.0171477 NaN
09/04/19 95.65 96.95 95.50 ... 134864.0 0.0125002 NaN
09/03/19 96.00 96.60 94.20 ... 134685.0 -0.0109291 NaN
08/30/19 95.40 97.20 95.10 ... 134061.0 0.0135137 NaN
08/29/19 97.05 97.50 94.75 ... 132639.0 -0.0166584 NaN
08/28/19 97.40 98.15 95.95 ... 130573.0 0.0238601 NaN
08/27/19 97.35 98.00 96.40 ... 129921.0 -0.00410889 NaN
08/26/19 95.55 98.50 95.25 ... 129003.0 0.0035962 NaN
08/23/19 96.90 97.40 95.05 ... 130268.0 -0.0149835 98.97775
Appreciate any help
the .fillna('') is creating a string in the first row which then creates errors for the rolling calculation in Ticker['ret20'].
Delete this and the code will run fine:
Ticker['Return'] = np.log(Ticker['Last'] / Ticker['Last'].shift(1))

How do I create new columns by combining data in existing columns?

I have a dataset that includes 5 columns Excuse formatting:
id Price Service Rater Name Cleanliness
401013357 5 3 A 1
401014972 2 1 A 5
401022510 3 4 B 2
401022510 5 1 C 9
401022510 3 1 D 4
401022510 2 2 E 2
I would like for there to be only one row for each ID. Therefore, I need to create columns for each of the raters' names and ratings categories (e.g. Rater Name Price, Rater Name Service, Rater name Cleanliness), each in its own column. Thank you.
I've explored groupby but cannot figure out how to manipulate these into new columns. Thank you!
Here's the code and data I'm actually using:
import requests
from pandas import DataFrame
import pandas as pd
linesinfo_url = 'https://api.collegefootballdata.com/lines?year=2018&seasonType=regular'
linesresp = requests.get(linesinfo_url)
dflines = DataFrame(linesresp.json())
#nesteddata in lines like game info
#setting game ID as index
dflines.set_index('id', inplace=True)
a = linesresp.json()
#defining a as the response to our get request for this data, in JSON format
buf = []
#i believe this creates a receptacle for nested data I'm extracting from json
for game in a:
for line in game['lines']:
game_dict = dict(id=game['id'])
for cat in ('provider', 'spread','formattedSpread', 'overUnder'):
game_dict[cat] = line[cat]
buf.append(game_dict)
dflinestable = pd.DataFrame(buf)
dflinestable.set_index(['id', 'provider'])
From this, I get
formattedSpread overUnder spread
id provider
401013357 consensus UMass -21 68.0 -21.0
401014972 consensus Rice -22.5 58.5 -22.5
401022510 Caesars Colorado State -17.5 57.5 -17.5
consensus Colorado State -17 57.5 -17.0
numberfire Colorado State -17 58.5 -17.0
teamrankings Colorado State -17 58.0 -17.0
401013437 numberfire Wyoming -5 47.0 5.0
teamrankings Wyoming -5 47.0 5.0
401020671 consensus Ball State -19.5 61.5 -19.5
401019470 Caesars UCF -22.5 NaN 22.5
consensus UCF -22.5 NaN 22.5
numberfire UCF -24 70.0 24.0
teamrankings UCF -24 70.0 24.0
401013328 numberfire Minnesota -21.5 47.0 -21.5
teamrankings Minnesota -21.5 49.0 -21.5
The outcome I am looking for is for each of the 4 different providers to have three columns each, so that it's caesars_formattedSpread, caesars_overUnder, Caesars spread, numberfire_formattedSpread, numberfire_overUnder, numberfire_spread, etc.
When I run unstack as suggested, I don't get what I expect. Instead I get:
formattedSpread 0 UMass -21
1 Rice -22.5
2 Colorado State -17.5
3 Colorado State -17
4 Colorado State -17
5 Colorado State -17
6 Wyoming -5
7 Wyoming -5
8 Ball State -19.5
9 UCF -22.5
10 UCF -22.5
11 UCF -24
12 UCF -24
* Edited, based on the edited question *
Given that your dataframe is df:
df = df.set_index(['id', 'Rater Name']) # Make it a Multi Index
df_unstacked = df.unstack()
The problem with your edited code, is that you don't assign dflinestable.set_index(['id', 'provider']) to anything. So when you then use dflinestable.unstack(), you are unstacking the original dflinestable.
So with your entire code, it should be:
import requests
import pandas as pd
linesinfo_url = 'https://api.collegefootballdata.com/lines?year=2018&seasonType=regular'
linesresp = requests.get(linesinfo_url)
dflines = pd.DataFrame(linesresp.json())
#nesteddata in lines like game info
#setting game ID as index
dflines.set_index('id', inplace=True)
a = linesresp.json()
#defining a as the response to our get request for this data, in JSON format
buf = []
#i believe this creates a receptacle for nested data I'm extracting from json
for game in a:
for line in game['lines']:
game_dict = dict(id=game['id'])
for cat in ('provider', 'spread','formattedSpread', 'overUnder'):
game_dict[cat] = line[cat]
buf.append(game_dict)
dflinestable = pd.DataFrame(buf)
dflinestable.set_index(['id', 'provider'], inplace=True) # Add inplace=True
dflinestable_unstacked = dflinestable.unstack() # unstack (you could also reassign to the same df
# Flatten columns to single level, in the order as described
dflinestable_unstacked.columns = [f'{j}_{i}' for i, j in dflinestable_unstacked.columns]
This will give you a DataFrame like (abbreviated):
Caesars_formattedSpread ... teamrankings_spread
id ...
401012246 Alabama -24 ... -23.5
401012247 Arkansas -34 ... NaN
401012248 Auburn -1 ... -1.5
401012249 NaN ... NaN
401012250 Georgia -44 ... NaN

How to groupby, cut, transpose then merge result of one pandas Dataframe using vectorisation

Here is a example of data we want to process:
df_size = 1000000
df_random = pd.DataFrame({'boat_id' : np.random.choice(range(300),df_size),
'X' :np.random.random_integers(0,1000,df_size),
'target_Y' :np.random.random_integers(0,10,df_size)})
X boat_id target_Y
0 482 275 6
1 705 245 4
2 328 102 6
3 631 227 6
4 234 236 8
...
I want to obtain an output like this :
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 target_Y boat_id
40055 684.0 692.0 950.0 572.0 442.0 850.0 75.0 140.0 382.0 576.0 0.0 1
40056 178.0 949.0 490.0 777.0 335.0 559.0 397.0 729.0 701.0 44.0 4.0 1
40057 21.0 818.0 341.0 577.0 612.0 57.0 303.0 183.0 519.0 357.0 0.0 1
40058 501.0 1000.0 999.0 532.0 765.0 913.0 964.0 922.0 772.0 534.0 1.0 2
40059 305.0 906.0 724.0 996.0 237.0 197.0 414.0 171.0 369.0 299.0 8.0 2
40060 408.0 796.0 815.0 638.0 691.0 598.0 913.0 579.0 650.0 955.0 2.0 3
40061 298.0 512.0 247.0 824.0 764.0 414.0 71.0 440.0 135.0 707.0 9.0 4
40062 535.0 687.0 945.0 859.0 718.0 580.0 427.0 284.0 122.0 777.0 2.0 4
40063 352.0 115.0 228.0 69.0 497.0 387.0 552.0 473.0 574.0 759.0 3.0 4
40064 179.0 870.0 862.0 186.0 25.0 125.0 925.0 310.0 335.0 739.0 7.0 4
...
I did the folowing code, but it is way to slow.
It groupby, cut with enumerate, transpose then merge result into one pandas Dataframe
start_time = time.time()
N = 10
col_names = map(lambda x: 'X'+str(x), range(N))
compil = pd.DataFrame(columns = col_names)
i = 0
# I group by boat ID
for boat_id, df_boat in df_random.groupby('boat_id'):
# then I cut every 50 line
for (line_number, (index, row)) in enumerate(df_boat.iterrows()):
if line_number%5 == 0:
compil_new_line_X = list(df_boat.iloc[line_number-N:line_number,:]["X"])
# filter to avoid issues at the start and end of the columns
if len (compil_new_line_X ) == N:
compil.loc[i,col_names] = compil_new_line_X
compil.loc[i, 'target_Y'] = row['target_Y']
compil.loc[i,'boat_id'] = row['boat_id']
i += 1
print("Total %s seconds" % (time.time() - start_time))
Total 232.947000027 seconds
My question is:
How to do somethings every "x number of line"? Then merge result?
Do it exist a way to vectorize that kind of operation?
Here is a solution that improve calculation time by 35%.
It use a 'groupby' for 'boat_ID' then 'groupby.apply' to divide groups in smalls chunks.
Then a final apply to create the new line. We probably still can improve it.
df_size = 1000000
df_random = pd.DataFrame({'boat_id' : np.random.choice(range(300),df_size),
'X' :np.random.random_integers(0,1000,df_size),
'target_Y' : np.random.random_integers(0,10,df_size)})
start_time = time.time()
len_of_chunks = 10
col_names = map(lambda x: 'X'+str(x), range(N))+['boat_id', 'target_Y']
def prepare_data(group):
# this function create the new line we will put in 'compil'
info_we_want_to_keep =['boat_id', 'target_Y']
info_and_target = group.tail(1)[info_we_want_to_keep].values
k = group["X"]
return np.hstack([k.values, info_and_target[0]]) # this create the new line we will put in 'compil'
# we group by ID (boat)
# we divide in chunk of len "len_of_chunks"
# we apply prepare data from each chunk
groups = df_random.groupby('boat_id').apply(lambda x: x.groupby(np.arange(len(x)) // len_of_chunks).apply(prepare_data))
# we reset index
# we take the '0' columns containing valuable info
# we put info in a new 'compil' dataframe
# we drop uncomplet line ( generated by chunk < len_of_chunks )
compil = pd.DataFrame(groups.reset_index()[0].values.tolist(), columns= col_names).dropna()
print("Total %s seconds" % (time.time() - start_time))
Total 153.781999826 seconds

Pythonic / Panda Way to Create Function to Groupby

I am fairly new to programming & am looking for a more pythonic way to implement some code. Here is dummy data:
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2018',
freq='D'), 10000)})
I have lots of transactional data like that that I perform various Groupby's on. My current solution is to make a master groupby like this:
master = df.groupby(['Customer','Category','Sub-Category','Product',pd.Grouper(key='Date',freq='A')])['Units_Sold'].sum()\
.unstack()
From there, I perform various groupbys using .groupby(level=) function to aggregate the information in the way I'm looking for. I usually make a summary at each level. In addition, I create sub-totals at each level using some variation of the below code.
y = master.groupby(level=[0,1,2]).sum()
y.index = pd.MultiIndex.from_arrays([
y.index.get_level_values(0),
y.index.get_level_values(1),
y.index.get_level_values(2) + ' Total',
len(y.index)*['']
])
y1 = master.groupby(level=[0,1]).sum()
y1.index = pd.MultiIndex.from_arrays([
y1.index.get_level_values(0),
y1.index.get_level_values(1)+ ' Total',
len(y1.index)*[''],
len(y1.index)*['']
])
y2 = master.groupby(level=[0]).sum()
y2.index = pd.MultiIndex.from_arrays([
y2.index.get_level_values(0)+ ' Total',
len(y2.index)*[''],
len(y2.index)*[''],
len(y2.index)*['']
])
pd.concat([master,y,y1,y2]).sort_index()\
.assign(Diff = lambda x: x.iloc[:,-1] - x.iloc[:,-2])\
.assign(Diff_Perc = lambda x: (x.iloc[:,-2] / x.iloc[:,-3])- 1)\
.dropna(how='all')\
This is just an example - I may perform the same exercise, but perform the groupby in a different order. For example - next I may want to group by 'Category', 'Product', then 'Customer', so I'd have to do:
master.groupby(level=[1,3,0).sum()
Then I will have to repeat the whole exercise for sub-totals like above. I also frequently change the time period - could be year-ending a specific month, could be year to date, could be by quarter, etc.
From what I've learned so far in programming (which is minimal, clearly!), you should look to write a function any time you repeat code. Obviously I am repeating code over & over again in this example.
Is there a way to construct a function where you can provide the levels to Groupby, along with the time frame, all while creating a function for sub-totaling each level as well?
Thanks in advance for any guidance on this. It is very much appreciated.
For a DRY-er solution, consider generalizing your current method into a defined module that filters original data frame by date ranges and runs aggregations, receiving the group_by levels and date ranges (latter being optional) as passed in parameters:
Method
def multiple_agg(mylevels, start_date='2016-01-01', end_date='2018-12-31'):
filter_df = df[df['Date'].between(start_date, end_date)]
master = (filter_df.groupby(['Customer', 'Category', 'Sub-Category', 'Product',
pd.Grouper(key='Date',freq='A')])['Units_Sold']
.sum()
.unstack()
)
y = master.groupby(level=mylevels[:-1]).sum()
y.index = pd.MultiIndex.from_arrays([
y.index.get_level_values(0),
y.index.get_level_values(1),
y.index.get_level_values(2) + ' Total',
len(y.index)*['']
])
y1 = master.groupby(level=mylevels[0:2]).sum()
y1.index = pd.MultiIndex.from_arrays([
y1.index.get_level_values(0),
y1.index.get_level_values(1)+ ' Total',
len(y1.index)*[''],
len(y1.index)*['']
])
y2 = master.groupby(level=mylevels[0]).sum()
y2.index = pd.MultiIndex.from_arrays([
y2.index.get_level_values(0)+ ' Total',
len(y2.index)*[''],
len(y2.index)*[''],
len(y2.index)*['']
])
final_df = (pd.concat([master,y,y1,y2])
.sort_index()
.assign(Diff = lambda x: x.iloc[:,-1] - x.iloc[:,-2])
.assign(Diff_Perc = lambda x: (x.iloc[:,-2] / x.iloc[:,-3])- 1)
.dropna(how='all')
.reorder_levels(mylevels)
)
return final_df
Aggregation Runs (of different levels and date ranges)
agg_df1 = multiple_agg([0,1,2,3])
agg_df2 = multiple_agg([1,3,0,2], '2016-01-01', '2017-12-31')
agg_df3 = multiple_agg([2,3,1,0], start_date='2017-01-01', end_date='2018-12-31')
Testing (final_df being OP'S pd.concat() output)
# EQUALITY TESTING OF FIRST 10 ROWS
print(final_df.head(10).eq(agg_df1.head(10)))
# Date 2016-12-31 00:00:00 2017-12-31 00:00:00 2018-12-31 00:00:00 Diff Diff_Perc
# Customer Category Sub-Category Product
# 45mhn4PU1O Group A X Product 1 True True True True True
# Product 2 True True True True True
# Product 3 True True True True True
# X Total True True True True True
# Y Product 1 True True True True True
# Product 2 True True True True True
# Product 3 True True True True True
# Y Total True True True True True
# Z Product 1 True True True True True
# Product 2 True True True True True
I think you can do it using sum with the level parameter:
master = df.groupby(['Customer','Category','Sub-Category','Product',pd.Grouper(key='Date',freq='A')])['Units_Sold'].sum()\
.unstack()
s1 = master.sum(level=[0,1,2]).assign(Product='Total').set_index('Product',append=True)
s2 = master.sum(level=[0,1])
# Wanted to use assign method but because of the hyphen in the column name you can't.
# Also use the Z in front for sorting purposes
s2['Sub-Category'] = 'ZTotal'
s2['Product'] = ''
s2 = s2.set_index(['Sub-Category','Product'], append=True)
s3 = master.sum(level=[0])
s3['Category'] = 'Total'
s3['Sub-Category'] = ''
s3['Product'] = ''
s3 = s3.set_index(['Category','Sub-Category','Product'], append=True)
master_new = pd.concat([master,s1,s2,s3]).sort_index()
master_new
Output:
Date 2016-12-31 2017-12-31 2018-12-31
Customer Category Sub-Category Product
30XWmt1jm0 Group A X Product 1 651.0 341.0 453.0
Product 2 267.0 445.0 117.0
Product 3 186.0 280.0 352.0
Total 1104.0 1066.0 922.0
Y Product 1 426.0 417.0 670.0
Product 2 362.0 210.0 380.0
Product 3 232.0 290.0 430.0
Total 1020.0 917.0 1480.0
Z Product 1 196.0 212.0 703.0
Product 2 277.0 340.0 579.0
Product 3 416.0 392.0 259.0
Total 889.0 944.0 1541.0
ZTotal 3013.0 2927.0 3943.0
Group B X Product 1 356.0 230.0 407.0
Product 2 402.0 370.0 590.0
Product 3 262.0 381.0 377.0
Total 1020.0 981.0 1374.0
Y Product 1 575.0 314.0 643.0
Product 2 557.0 375.0 411.0
Product 3 344.0 246.0 280.0
Total 1476.0 935.0 1334.0
Z Product 1 278.0 152.0 392.0
Product 2 149.0 596.0 303.0
Product 3 234.0 505.0 521.0
Total 661.0 1253.0 1216.0
ZTotal 3157.0 3169.0 3924.0
Total 6170.0 6096.0 7867.0
3U2anYOD6o Group A X Product 1 214.0 443.0 195.0
Product 2 170.0 220.0 423.0
Product 3 111.0 469.0 369.0
... ... ... ...
somc22Y2Hi Group B Z Total 906.0 1063.0 680.0
ZTotal 3070.0 3751.0 2736.0
Total 6435.0 7187.0 6474.0
zRZq6MSKuS Group A X Product 1 421.0 182.0 387.0
Product 2 359.0 287.0 331.0
Product 3 232.0 394.0 279.0
Total 1012.0 863.0 997.0
Y Product 1 245.0 366.0 111.0
Product 2 377.0 148.0 239.0
Product 3 372.0 219.0 310.0
Total 994.0 733.0 660.0
Z Product 1 280.0 363.0 354.0
Product 2 384.0 604.0 178.0
Product 3 219.0 462.0 366.0
Total 883.0 1429.0 898.0
ZTotal 2889.0 3025.0 2555.0
Group B X Product 1 466.0 413.0 187.0
Product 2 502.0 370.0 368.0
Product 3 745.0 480.0 318.0
Total 1713.0 1263.0 873.0
Y Product 1 218.0 226.0 385.0
Product 2 123.0 382.0 570.0
Product 3 173.0 572.0 327.0
Total 514.0 1180.0 1282.0
Z Product 1 480.0 317.0 604.0
Product 2 256.0 215.0 572.0
Product 3 463.0 50.0 349.0
Total 1199.0 582.0 1525.0
ZTotal 3426.0 3025.0 3680.0
Total 6315.0 6050.0 6235.0
[675 rows x 3 columns]

Categories