Stretching the data frame by date - python

I have this data frame:
ID date X1 X2 Y
A 16-07-19 58 50 0
A 21-07-19 28 74 0
B 25-07-19 54 65 1
B 27-07-19 50 30 0
B 29-07-19 81 61 0
C 30-07-19 55 29 0
C 31-07-19 97 69 1
C 03-08-19 13 48 1
D 19-07-18 77 27 1
D 20-07-18 68 50 1
D 22-07-18 89 57 1
D 23-07-18 46 70 0
D 26-07-18 56 13 0
E 06-08-19 47 35 1
I want to "stretch" the data by date, from the first row, to the last row of each ID (groupby),
and to fill the missing values with NaN.
For example: ID A has two rows on 16-07-19, and 21-07-19.
After the implementation, (s)he should have 6 rows on 16-21 of July, 2019.
Expected result:
ID date X1 X2 Y
A 16-07-19 58.0 50.0 0.0
A 17-07-19 NaN NaN NaN
A 18-07-19 NaN NaN NaN
A 19-07-19 NaN NaN NaN
A 20-07-19 NaN NaN NaN
A 21-07-19 28.0 74.0 0.0
B 25-07-19 54.0 65.0 1.0
B 26-07-19 NaN NaN NaN
B 27-07-19 50.0 30.0 0.0
B 28-07-19 NaN NaN NaN
B 29-07-19 81.0 61.0 0.0
C 30-07-19 55.0 29.0 0.0
C 31-07-19 97.0 69.0 1.0
C 01-08-19 NaN NaN NaN
C 02-08-19 NaN NaN NaN
C 03-08-19 13.0 48.0 1.0
D 19-07-18 77.0 27.0 1.0
D 20-07-18 68.0 50.0 1.0
D 21-07-18 NaN NaN NaN
D 22-07-18 89.0 57.0 1.0
D 23-07-18 46.0 70.0 0.0
D 24-07-18 NaN NaN NaN
D 25-07-18 NaN NaN NaN
D 26-07-18 56.0 13.0 0.0
E 06-08-19 47.0 35.0 1.0

Use DataFrame.asfreq per groups working with DatetimeIndex:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
cols = df.columns.difference(['date','ID'], sort=False)
df = df.set_index('date').groupby('ID')[cols].apply(lambda x: x.asfreq('d')).reset_index()
print (df)
ID date X1 X2 Y
0 A 2019-07-16 58.0 50.0 0.0
1 A 2019-07-17 NaN NaN NaN
2 A 2019-07-18 NaN NaN NaN
3 A 2019-07-19 NaN NaN NaN
4 A 2019-07-20 NaN NaN NaN
5 A 2019-07-21 28.0 74.0 0.0
6 B 2019-07-25 54.0 65.0 1.0
7 B 2019-07-26 NaN NaN NaN
8 B 2019-07-27 50.0 30.0 0.0
9 B 2019-07-28 NaN NaN NaN
10 B 2019-07-29 81.0 61.0 0.0
11 C 2019-07-30 55.0 29.0 0.0
12 C 2019-07-31 97.0 69.0 1.0
13 C 2019-08-01 NaN NaN NaN
14 C 2019-08-02 NaN NaN NaN
15 C 2019-08-03 13.0 48.0 1.0
16 D 2018-07-19 77.0 27.0 1.0
17 D 2018-07-20 68.0 50.0 1.0
18 D 2018-07-21 NaN NaN NaN
19 D 2018-07-22 89.0 57.0 1.0
20 D 2018-07-23 46.0 70.0 0.0
21 D 2018-07-24 NaN NaN NaN
22 D 2018-07-25 NaN NaN NaN
23 D 2018-07-26 56.0 13.0 0.0
24 E 2019-08-06 47.0 35.0 1.0
Another idea with DataFrame.reindex per groups:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
cols = df.columns.difference(['date','ID'], sort=False)
f = lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max()))
df = df.set_index('date').groupby('ID')[cols].apply(f).reset_index()

Here is my sort jitsu:
def Sort_by_date(dataf):
# rule1
dataf['Current'] = pd.to_datetime(dataf.Current)
dataf = dataf.sort_values(by=['Current'],ascending=True)
# rule2
dataf['Current'] = pd.to_datetime(dataf.Current)
Mask = (dataf['Current'] > '1/1/2020') & (dataf['Current'] <= '12/31/2022')
dataf = dataf.loc[Mask]
return dataf
you can modify this code to learn to sort by date for your solution.
Next lets sort by group:
Week1 = WeeklyDF.groupby('ID')
Week1_Report = Week1['ID','date','X1','X2','Y']
Week1_Report
lastly, Lets replace the NaN
Week1_Report['X1'.fillna("X1 is 0", inplace = True)
Week1_Report['X2'.fillna("X2 is 0", inplace = True)
Week1_Report['Y'.fillna("Y is 0", inplace = True)

Related

Store the result in new columns, named based on another variable (Pandas)

I have a dataframe. What I need is to calculate the difference between the variables A and B, and store the result in the new columns based on the variable df['Value']. If the Value == 1, then the result is stored in column named Diff_1, if the Value == 2, then column Diff_2, and so on.
Here is the code so far, but obviously the line df_red['Diff_' + str(value) ] = df_red['A'] - df_red['B'] is not doing what I want:
import pandas as pd
df = pd.read_excel(r'E:\...\.xlsx')
print(df)
value = list(set(df['Value']))
print(value)
for value in value:
df_red = df[df['Value'] == value]
df_red['Diff_' + str(value) ] = df_red['A'] - df_red['B']
Out[126]:
ID Value A B
0 1 1 56.0 49.0
1 2 3 56.0 50.0
2 3 4 103.0 44.0
3 4 2 89.0 44.0
4 5 1 84.0 41.0
5 6 1 77.0 43.0
6 7 2 71.0 35.0
7 8 4 77.0 32.0
print(value)
[1, 2, 3, 4]
After a simple operation of df['A'] - df['B'] the result should look like this.
Out[128]:
ID Value A B Diff_1 Diff_2 Diff_3 Diff_4
0 1 1 56.0 49.0 7.0 0.0 0.0 0.0
1 2 3 56.0 50.0 0.0 0.0 6.0 0.0
2 3 4 103.0 44.0 0.0 0.0 0.0 60.0
3 4 2 89.0 44.0 0.0 45.0 0.0 0.0
4 5 1 84.0 41.0 43.0 0.0 0.0 0.0
5 6 1 77.0 43.0 34.0 0.0 0.0 0.0
6 7 2 71.0 35.0 0.0 36.0 0.0 0.0
7 8 4 77.0 32.0 0.0 0.0 0.0 45.0
Not so great way of doing this would be like this, however I am looking for some more efficient, better ways:
df['Diff_1'] = df[df['Value']==1]['A'] - df[df['Value']==1]['B']
df['Diff_2'] = df[df['Value']==2]['A'] - df[df['Value']==2]['B']
df['Diff_3'] = df[df['Value']==3]['A'] - df[df['Value']==3]['B']
df['Diff_4'] = df[df['Value']==4]['A'] - df[df['Value']==4]['B']
You can use:
df.join(df.set_index(['ID', 'Value'])
.eval('A-B')
.unstack(level=1).add_prefix('Diff_')
.reset_index(drop=True)
)
output:
ID Value A B Diff_1 Diff_2 Diff_3 Diff_4
0 1 1 56.0 49.0 7.0 NaN NaN NaN
1 2 3 56.0 50.0 NaN NaN 6.0 NaN
2 3 4 103.0 44.0 NaN NaN NaN 59.0
3 4 2 89.0 44.0 NaN 45.0 NaN NaN
4 5 1 84.0 41.0 43.0 NaN NaN NaN
5 6 1 77.0 43.0 34.0 NaN NaN NaN
6 7 2 71.0 35.0 NaN 36.0 NaN NaN
7 8 4 77.0 32.0 NaN NaN NaN 45.0
Here is my approach which may not be fastest, but it's a start:
for i in df['Value'].unique():
df.loc[df['Value'] == i, 'Diff_' + str(i)] = df['A'] - df['B']
df.fillna(0, inplace = True)
Output of my fake data:
Value A B Diff_1 Diff_2 Diff_3 Diff_4
0 1 20 2 18.0 0.0 0.0 0.0
1 1 30 5 25.0 0.0 0.0 0.0
2 2 40 7 0.0 33.0 0.0 0.0
3 2 50 15 0.0 35.0 0.0 0.0
4 3 60 25 0.0 0.0 35.0 0.0
5 3 20 7 0.0 0.0 13.0 0.0
6 4 15 36 0.0 0.0 0.0 -21.0
7 4 14 3 0.0 0.0 0.0 11.0

Check if one column values is in the range of plus and minus 10% of another column in Pandas

Given a dataframe as follows:
id value1 value2
0 3918703 62.0 64.705882
1 3919144 60.0 60.000000
2 3919534 62.5 30.000000
3 3919559 55.0 55.000000
4 3920438 82.0 82.031250
5 3920463 71.0 71.428571
6 3920502 70.0 69.230769
7 3920535 80.0 40.000000
8 3920674 62.0 62.222222
9 3920856 80.0 79.987176
I want to check if value2 is in the range of plus and minus 10% of value1, and return a new column result_review.
If it's not in the range as required, then indicate No as result_review's values.
id value1 value2 results_review
0 3918703 62.0 64.705882 NaN
1 3919144 60.0 60.000000 NaN
2 3919534 62.5 30.000000 no
3 3919559 55.0 55.000000 NaN
4 3920438 82.0 82.031250 NaN
5 3920463 71.0 71.428571 NaN
6 3920502 70.0 69.230769 NaN
7 3920535 80.0 40.000000 no
8 3920674 62.0 62.222222 NaN
9 3920856 80.0 79.987176 NaN
How can I do that in Pandas? Thanks for your help at advance.
Use Series.between with DataFrame.loc:
m = df['value2'].between(df['value1'].mul(0.9), df['value1'].mul(1.1))
df.loc[~m, 'results_review'] = 'no'
print(df)
id value1 value2 results_review
0 3918703 62.0 64.705882 NaN
1 3919144 60.0 60.000000 NaN
2 3919534 62.5 30.000000 no
3 3919559 55.0 55.000000 NaN
4 3920438 82.0 82.031250 NaN
5 3920463 71.0 71.428571 NaN
6 3920502 70.0 69.230769 NaN
7 3920535 80.0 40.000000 no
8 3920674 62.0 62.222222 NaN
9 3920856 80.0 79.987176 NaN

"Begin" & "End" date based on values in a row

I have an example of input data that can be found here
input
I need to add 2 columns: "Begin_date" & "End_date" based on the data in each row:
Begin date - when all previous cells are empty and date starts from yyyymm01
End_date - when all subsequent are empty:
if all subsequent are not null, then something like "lifelong" date added: "99991231"
otherwise - yyyymm30 or 31 or 28 (depends on month)
Example of output:
I will appreciate any ideas :) Thank you
use pd.melt()
sort data by id and date
import pandas as pd
import numpy as np
from pandas.tseries.offsets import MonthEnd
df = pd.read_excel("input.xlsx")
max_date = df.columns[-1]
res = pd.melt(df, id_vars=['id', 'region'], value_vars=df.columns[2:])
res.dropna(subset=['value'], inplace=True)
res.sort_values(by=['id', 'variable'], ascending=[True, True], inplace=True)
minimum_date = res.drop_duplicates(subset=['id'], keep='first')
maximum_date = res.drop_duplicates(subset=['id'], keep='last')
minimum_date.rename(columns={'variable': 'start_date'}, inplace=True)
maximum_date.rename(columns={'variable': 'end_date'}, inplace=True)
df = pd.merge(df, minimum_date[['id', 'start_date']], on=['id'], how='left')
df = pd.merge(df, maximum_date[['id', 'end_date']], on=['id'], how='left')
df['end_date'] = np.where(df['end_date']==max_date,
"99991231",df['end_date'])
df['start_date'] = (pd.to_datetime(df['start_date'],format="%Y%m",errors='coerce') +MonthEnd(1)).astype(str)
df['end_date'] = (pd.to_datetime(df['end_date'],format="%Y%m",errors='coerce') +MonthEnd(1)).astype(str)
df['end_date'] = np.where(df['end_date']=='NaT',
"99991231",df['end_date'])
print(df)
id region 201801 201802 ... 201905 201906 start_date end_date
0 100001 628 NaN NaN ... 26.0 23.0 2018-09-30 99991231
1 100002 1149 27.0 24.0 ... 26.0 24.0 2018-01-31 99991231
2 100003 1290 26.0 26.0 ... 27.0 25.0 2018-01-31 99991231
3 100004 955 25.0 26.0 ... NaN NaN 2018-01-31 2018-12-31
4 100005 1397 15.0 25.0 ... NaN NaN 2018-01-31 2018-11-30
5 100006 1397 15.0 25.0 ... NaN NaN 2018-01-31 2019-02-28
Idea is convert non datetimelike columns to MultiIndex by DataFrame.set_index and then convert columns to datetimes:
df = pd.read_excel('input.xlsx')
df = df.set_index(['id','region'])
df.columns = pd.to_datetime(df.columns, format='%Y%m')
Then create new columns by DataFrame.assign, filter January columns, compare non missing values and get first value by DataFrame.idxmax, then convert to output format by Series.dt.strftime for begin, for end first swap order with indexing ::-1 and get last non missing values, convert to last days of month and last get default value if last column is not missing value by Series.where:
begin = df.loc[:, df.columns.month == 1].notna().idxmax(axis=1).dt.strftime('%Y%m%d')
end1 = df.iloc[:, ::-1].notna().idxmax(axis=1) + pd.offsets.MonthEnd()
end = end1.dt.strftime('%Y%m%d').where(df.iloc[:, -1].isna(), '99991231')
df.columns = df.columns.strftime('%Y%m')
df = df.assign(date_begin = begin, date_end = end).reset_index()
print (df)
id region 201801 201802 201803 201804 201805 201806 201807 \
0 100001 628 NaN NaN NaN NaN NaN NaN NaN
1 100002 1149 27.0 24.0 27.0 25.0 24.0 26.0 27.0
2 100003 1290 26.0 26.0 26.0 26.0 23.0 27.0 27.0
3 100004 955 25.0 26.0 26.0 24.0 24.0 26.0 28.0
4 100005 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
5 100006 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
201808 ... 201811 201812 201901 201902 201903 201904 201905 \
0 NaN ... 24 20.0 26.0 24.0 26.0 26.0 26.0
1 28.0 ... 24 21.0 26.0 25.0 27.0 24.0 26.0
2 NaN ... 28 NaN 28.0 26.0 27.0 27.0 27.0
3 27.0 ... 24 12.0 NaN NaN NaN NaN NaN
4 26.0 ... 25 NaN NaN NaN NaN NaN NaN
5 26.0 ... 25 23.0 25.0 17.0 NaN NaN NaN
201906 date_begin date_end
0 23.0 20190101 99991231
1 24.0 20180101 99991231
2 25.0 20180101 99991231
3 NaN 20180101 20181231
4 NaN 20180101 20181130
5 NaN 20180101 20190228
[6 rows x 22 columns]
Also is possible create valid datatimes in both new column by Timestamp.max with Timestamp.floor:
df = pd.read_excel('input.xlsx')
df = df.set_index(['id','region'])
df.columns = pd.to_datetime(df.columns, format='%Y%m')
begin = df.loc[:, df.columns.month == 1].notna().idxmax(axis=1)
end1 = df.iloc[:, ::-1].notna().idxmax(axis=1) + pd.offsets.MonthEnd()
end = end1.where(df.iloc[:, -1].isna(), pd.Timestamp.max.floor('d'))
df.columns = df.columns.strftime('%Y%m')
df = df.assign(date_begin = begin, date_end = end).reset_index()
print (df)
id region 201801 201802 201803 201804 201805 201806 201807 \
0 100001 628 NaN NaN NaN NaN NaN NaN NaN
1 100002 1149 27.0 24.0 27.0 25.0 24.0 26.0 27.0
2 100003 1290 26.0 26.0 26.0 26.0 23.0 27.0 27.0
3 100004 955 25.0 26.0 26.0 24.0 24.0 26.0 28.0
4 100005 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
5 100006 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
201808 ... 201811 201812 201901 201902 201903 201904 201905 \
0 NaN ... 24 20.0 26.0 24.0 26.0 26.0 26.0
1 28.0 ... 24 21.0 26.0 25.0 27.0 24.0 26.0
2 NaN ... 28 NaN 28.0 26.0 27.0 27.0 27.0
3 27.0 ... 24 12.0 NaN NaN NaN NaN NaN
4 26.0 ... 25 NaN NaN NaN NaN NaN NaN
5 26.0 ... 25 23.0 25.0 17.0 NaN NaN NaN
201906 date_begin date_end
0 23.0 2019-01-01 2262-04-11
1 24.0 2018-01-01 2262-04-11
2 25.0 2018-01-01 2262-04-11
3 NaN 2018-01-01 2018-12-31
4 NaN 2018-01-01 2018-11-30
5 NaN 2018-01-01 2019-02-28
[6 rows x 22 columns]

update / merge and update a subset of columns pandas

I have df1:
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 xyz 23.0
1 b 2.0 56.0 abc 24.0
2 c 3.0 34.0 qwerty 28.0
3 d 4.0 34.0 wer 33.0
4 e NaN NaN NaN NaN
df2:
ColA ColB ID1 ColC ID2
0 i 0 45.0 NaN 23.0
1 j 0 56.0 NaN 24.0
2 NaN 0 NaN fd 25.0
3 NaN 0 NaN NaN 26.0
4 NaN 0 23.0 e 45.0
5 NaN 0 45.0 r NaN
6 NaN 0 56.0 NaN 29.0
I am trying to update df2 only on columns which wil be a choice= ['ColA','ColB'] where ID1 and ID2 both matches in the 2 dfs.
Expected output:
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 NaN 23.0
1 b 2.0 56.0 NaN 24.0
2 NaN 0 NaN fd 25.0
3 NaN 0 NaN NaN 26.0
4 NaN 0 23.0 e 45.0
5 NaN 0 45.0 r NaN
6 NaN 0 56.0 NaN 29.0
So far I have tried:
u = df1.set_index(['ID1','ID2'])
u = u.loc[u.index.dropna()]
v = df2.set_index(['ID1','ID2'])
v= v.loc[v.index.dropna()]
v.update(u)
v.reset_index()
Which gives me the correct update(but I loose the Ids which are NaN) also the update takes place on ColC which i dont want:
ID1 ID2 ColA ColB ColC
0 45.0 23.0 a 1.0 xyz
1 56.0 24.0 b 2.0 abc
2 23.0 45.0 NaN 0.0 e
3 56.0 29.0 NaN 0.0 NaN
I have also tried merge and combine_first. cant figure out what is the best approach to do this based on the choicelist.
Use merge with right join and then combine_first:
choice= ['ColA','ColB']
joined = ['ID1','ID2']
c = choice + joined
df3 = df1[c].merge(df2[c], on=joined, suffixes=('','_'), how='right')[c]
print (df3)
ColA ColB ID1 ID2
0 a 1.0 45.0 23.0
1 b 2.0 56.0 24.0
2 NaN NaN NaN 25.0
3 NaN NaN NaN 26.0
4 NaN NaN 23.0 45.0
5 NaN NaN 45.0 NaN
6 NaN NaN 56.0 29.0
df2[c] = df3.combine_first(df2[c])
print (df2)
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 NaN 23.0
1 b 2.0 56.0 NaN 24.0
2 NaN 0.0 NaN fd 25.0
3 NaN 0.0 NaN NaN 26.0
4 NaN 0.0 23.0 e 45.0
5 NaN 0.0 45.0 r NaN
6 NaN 0.0 56.0 NaN 29.0
here's a way
df1
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 xyz 23.0
1 b 2.0 56.0 abc 24.0
2 c 3.0 34.0 qwerty 28.0
3 d 4.0 34.0 wer 33.0
4 e NaN NaN NaN NaN
df2
ColA ColB ID1 ColC ID2
0 i 0 45.0 NaN 23.0
1 j 0 56.0 NaN 24.0
2 NaN 0 NaN fd 25.0
3 NaN 0 NaN NaN 26.0
4 NaN 0 23.0 e 45.0
5 NaN 0 45.0 r NaN
6 NaN 0 56.0 NaN 29.0
df3 = df1.merge(df2, on=['ID1','ID2'], left_index=True)[['ColA_x','ColB_x']]
df2.loc[df3.index, 'ColA'] = df3['ColA_x']
df2.loc[df3.index, 'ColB'] = df3['ColB_x']
output
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 NaN 23.0
1 b 2.0 56.0 NaN 24.0
2 NaN 0.0 NaN fd 25.0
3 NaN 0.0 NaN NaN 26.0
4 NaN 0.0 23.0 e 45.0
5 NaN 0.0 45.0 r NaN
6 NaN 0.0 56.0 NaN 29.0
There seems to still be the issue in 0.24 where NaN merges with NaN when they are keys. Prevent this by dropping those records before merging. I'm assuming ['ID1', 'ID2'] is a unique key for df1 (for rows where both are not null):
keys = ['ID1', 'ID2']
updates = ['ColA', 'ColB']
df3 = df2.merge(df1[updates+keys].dropna(subset=keys), on=keys, how='left')
Then resolve information. Take the value in df1 if it's not null, else take the value in df2. In recent versions of python the merge output should be ordered so for duplicated columns _x appears to the left of the _y column. If not, sort the index
#df3 = df3.sort_index(axis=1) # If not sorted _x left of _y
df3.groupby([x[0] for x in df3.columns.str.split('_')], axis=1).apply(lambda x: x.ffill(1).iloc[:, -1])
ColA ColB ColC ID1 ID2
0 a 1.0 NaN 45.0 23.0
1 b 2.0 NaN 56.0 24.0
2 NaN 0.0 fd NaN 25.0
3 NaN 0.0 NaN NaN 26.0
4 NaN 0.0 e 23.0 45.0
5 NaN 0.0 r 45.0 NaN
6 NaN 0.0 NaN 56.0 29.0

Pandas Sort Multiindex by Group Sum

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'County':['A','B','C','D','A','B','C','D','A','B','C','D','A','B','C','D','A','B'],
'Hospital':['a','b','c','d','e','a','b','c','e','a','b','c','d','e','a','b','c','e'],
'Enrollment':[44,55,42,57,95,54,27,55,81,54,65,23,89,76,34,12,1,67],
'Year':['2012','2012','2012','2012','2012','2012','2012','2012','2012','2013',
'2013','2013','2013','2013','2013','2013','2013','2013']})
d2=pd.pivot_table(df,index=['County','Hospital'],columns=['Year'])#.sort_columns
d2
Enrollment
Year 2012 2013
County Hospital
A a 44.0 NaN
c NaN 1.0
d NaN 89.0
e 88.0 NaN
B a 54.0 54.0
b 55.0 NaN
e NaN 71.5
C a NaN 34.0
b 27.0 65.0
c 42.0 NaN
D b NaN 12.0
c 55.0 23.0
d 57.0 NaN
I need to sort the data frame such that County is sorted descendingly by the sum of Enrollment for the most recent year (I want to avoid using '2013' directly) like this:
Enrollment
Year 2012 2013
County Hospital
B a 54 54
b 55 NaN
e NaN 71.5
C a NaN 34
b 27 65
c 42 NaN
A a 44 NaN
c NaN 1
d NaN 89
e 88 NaN
D b NaN 12
c 55 23
d 57 NaN
Then, I'd like each hospital sorted within each county, descendingly, but 2013 enrollments like this:
Enrollment
Year 2012 2013
County Hospital
B e NaN 71.5
a 54 54
b 55 NaN
C b 27 65
a NaN 34
c 42 NaN
A d NaN 89
c NaN 1
a 44 NaN
e 88 NaN
D c 55 23
b NaN 12
d 57 NaN
So far, I've tried using groupby to get the sums and merge the back but have not had any luck:
d2.groupby('County').sum()
Thanks in advance!
You could:
max_col = max(d2.columns.get_level_values(1)) # get column 2013
d2['sum'] = d2.groupby(level='County').transform('sum').loc[:, ('Enrollment', max_col)]
d2 = d2.sort_values(['sum', ('Enrollment', max_col)], ascending=[False, False])
to get:
Enrollment sum
Year 2012 2013
County Hospital
B e NaN 71.5 125.5
a 54.0 54.0 125.5
b 55.0 NaN 125.5
C b 27.0 65.0 99.0
a NaN 34.0 99.0
c 42.0 NaN 99.0
A d NaN 89.0 90.0
c NaN 1.0 90.0
a 44.0 NaN 90.0
e 88.0 NaN 90.0
D c 55.0 23.0 35.0
b NaN 12.0 35.0
d 57.0 NaN 35.0

Categories