Pandas Sort Multiindex by Group Sum - python

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'County':['A','B','C','D','A','B','C','D','A','B','C','D','A','B','C','D','A','B'],
'Hospital':['a','b','c','d','e','a','b','c','e','a','b','c','d','e','a','b','c','e'],
'Enrollment':[44,55,42,57,95,54,27,55,81,54,65,23,89,76,34,12,1,67],
'Year':['2012','2012','2012','2012','2012','2012','2012','2012','2012','2013',
'2013','2013','2013','2013','2013','2013','2013','2013']})
d2=pd.pivot_table(df,index=['County','Hospital'],columns=['Year'])#.sort_columns
d2
Enrollment
Year 2012 2013
County Hospital
A a 44.0 NaN
c NaN 1.0
d NaN 89.0
e 88.0 NaN
B a 54.0 54.0
b 55.0 NaN
e NaN 71.5
C a NaN 34.0
b 27.0 65.0
c 42.0 NaN
D b NaN 12.0
c 55.0 23.0
d 57.0 NaN
I need to sort the data frame such that County is sorted descendingly by the sum of Enrollment for the most recent year (I want to avoid using '2013' directly) like this:
Enrollment
Year 2012 2013
County Hospital
B a 54 54
b 55 NaN
e NaN 71.5
C a NaN 34
b 27 65
c 42 NaN
A a 44 NaN
c NaN 1
d NaN 89
e 88 NaN
D b NaN 12
c 55 23
d 57 NaN
Then, I'd like each hospital sorted within each county, descendingly, but 2013 enrollments like this:
Enrollment
Year 2012 2013
County Hospital
B e NaN 71.5
a 54 54
b 55 NaN
C b 27 65
a NaN 34
c 42 NaN
A d NaN 89
c NaN 1
a 44 NaN
e 88 NaN
D c 55 23
b NaN 12
d 57 NaN
So far, I've tried using groupby to get the sums and merge the back but have not had any luck:
d2.groupby('County').sum()
Thanks in advance!

You could:
max_col = max(d2.columns.get_level_values(1)) # get column 2013
d2['sum'] = d2.groupby(level='County').transform('sum').loc[:, ('Enrollment', max_col)]
d2 = d2.sort_values(['sum', ('Enrollment', max_col)], ascending=[False, False])
to get:
Enrollment sum
Year 2012 2013
County Hospital
B e NaN 71.5 125.5
a 54.0 54.0 125.5
b 55.0 NaN 125.5
C b 27.0 65.0 99.0
a NaN 34.0 99.0
c 42.0 NaN 99.0
A d NaN 89.0 90.0
c NaN 1.0 90.0
a 44.0 NaN 90.0
e 88.0 NaN 90.0
D c 55.0 23.0 35.0
b NaN 12.0 35.0
d 57.0 NaN 35.0

Related

Iterative ffill with median values in a dataframe

Appreciate any help on this.
Let's say I have the following df with two columns:
col1 col2
NaN NaN
11 100
12 110
15 115
NaN NaN
NaN NaN
NaN NaN
9 142
12 144
NaN NaN
NaN NaN
NaN NaN
6 155
9 156
7 161
NaN NaN
NaN NaN
I'd like to forward fill and replace the Nan values with the median value of the preceding values. For example, the median of 11,12,15 in 'col1' is 12, therefore I need the Nan values to be filled with 12 until I get to the next non-Nan values in the column and continue iterating the same. See below the expected df:
col1 col2
NaN NaN
11 100
12 110
15 115
12 110
12 110
12 110
9 142
12 144
10.5 143
10.5 143
10.5 143
6 155
9 156
7 161
7 156
7 156
Try:
m1 = (df.col1.isna() != df.col1.isna().shift(1)).cumsum()
m2 = (df.col2.isna() != df.col2.isna().shift(1)).cumsum()
df["col1"] = df["col1"].fillna(
df.groupby(m1)["col1"].transform("median").ffill()
)
df["col2"] = df["col2"].fillna(
df.groupby(m2)["col2"].transform("median").ffill()
)
print(df)
Prints:
col1 col2
0 NaN NaN
1 11.0 100.0
2 12.0 110.0
3 15.0 115.0
4 12.0 110.0
5 12.0 110.0
6 12.0 110.0
7 9.0 142.0
8 12.0 144.0
9 10.5 143.0
10 10.5 143.0
11 10.5 143.0
12 6.0 155.0
13 9.0 156.0
14 7.0 161.0
15 7.0 156.0
16 7.0 156.0
IIUC, if we fill null values like so:
Fill with Median of last 3 non-null items.
Fill with Median of last 2 non-null items.
Front fill values.
We'll get what you're looking for.
out = (df.combine_first(df.rolling(4,3).median())
.combine_first(df.rolling(3,2).median())
.ffill())
print(out)
Output:
col1 col2
0 NaN NaN
1 11.0 100.0
2 12.0 110.0
3 15.0 115.0
4 12.0 110.0
5 12.0 110.0
6 12.0 110.0
7 9.0 142.0
8 12.0 144.0
9 10.5 143.0
10 10.5 143.0
11 10.5 143.0
12 6.0 155.0
13 9.0 156.0
14 7.0 161.0
15 7.0 156.0
16 7.0 156.0

Shift only selected rows in Pandas

I would like to shift only specific rows in my DataFrame by 1 period on the columns axis.
Df
Out:
Month Year_2005 Year_2006 Year_2007
0 01 NaN 31 35
1 02 NaN 40 45
2 03 NaN 87 46
3 04 NaN 55 41
4 05 NaN 36 28
5 06 31 21 NaN
6 07 29 27 NaN
To have something like this:
Df
Out:
Month Year_2005 Year_2006 Year_2007
0 01 NaN 31 35
1 02 NaN 40 45
2 03 NaN 87 46
3 04 NaN 55 41
4 05 NaN 36 28
5 06 NaN 31 21
6 07 NaN 29 27
My code so far:
rows_to_shift = Df[Df['Year_2005'].notnull()].index
Df.iloc[rows_to_shift, 1] = Df.iloc[rows_to_shift,2].shift(1)
Try:
df = df.set_index("Month")
df[df["Year_2005"].notnull()] = df[df["Year_2005"].notnull()].shift(axis=1)
>>> df
Year_2005 Year_2006 Year_2007
Month
1 NaN 31.0 35.0
2 NaN 40.0 45.0
3 NaN 87.0 46.0
4 NaN 55.0 41.0
5 NaN 36.0 28.0
6 NaN 31.0 21.0
7 NaN 29.0 27.0
You can try:
df1 = df.set_index('Month')
df1 = df1.apply(lambda x: pd.Series(sorted(x, key=pd.notna), index=x.index), axis=1)
df = df1.reset_index()
Result:
Month Year_2005 Year_2006 Year_2007
0 1 NaN 31.0 35.0
1 2 NaN 40.0 45.0
2 3 NaN 87.0 46.0
3 4 NaN 55.0 41.0
4 5 NaN 36.0 28.0
5 6 NaN 31.0 21.0
6 7 NaN 29.0 27.0

Stretching the data frame by date

I have this data frame:
ID date X1 X2 Y
A 16-07-19 58 50 0
A 21-07-19 28 74 0
B 25-07-19 54 65 1
B 27-07-19 50 30 0
B 29-07-19 81 61 0
C 30-07-19 55 29 0
C 31-07-19 97 69 1
C 03-08-19 13 48 1
D 19-07-18 77 27 1
D 20-07-18 68 50 1
D 22-07-18 89 57 1
D 23-07-18 46 70 0
D 26-07-18 56 13 0
E 06-08-19 47 35 1
I want to "stretch" the data by date, from the first row, to the last row of each ID (groupby),
and to fill the missing values with NaN.
For example: ID A has two rows on 16-07-19, and 21-07-19.
After the implementation, (s)he should have 6 rows on 16-21 of July, 2019.
Expected result:
ID date X1 X2 Y
A 16-07-19 58.0 50.0 0.0
A 17-07-19 NaN NaN NaN
A 18-07-19 NaN NaN NaN
A 19-07-19 NaN NaN NaN
A 20-07-19 NaN NaN NaN
A 21-07-19 28.0 74.0 0.0
B 25-07-19 54.0 65.0 1.0
B 26-07-19 NaN NaN NaN
B 27-07-19 50.0 30.0 0.0
B 28-07-19 NaN NaN NaN
B 29-07-19 81.0 61.0 0.0
C 30-07-19 55.0 29.0 0.0
C 31-07-19 97.0 69.0 1.0
C 01-08-19 NaN NaN NaN
C 02-08-19 NaN NaN NaN
C 03-08-19 13.0 48.0 1.0
D 19-07-18 77.0 27.0 1.0
D 20-07-18 68.0 50.0 1.0
D 21-07-18 NaN NaN NaN
D 22-07-18 89.0 57.0 1.0
D 23-07-18 46.0 70.0 0.0
D 24-07-18 NaN NaN NaN
D 25-07-18 NaN NaN NaN
D 26-07-18 56.0 13.0 0.0
E 06-08-19 47.0 35.0 1.0
Use DataFrame.asfreq per groups working with DatetimeIndex:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
cols = df.columns.difference(['date','ID'], sort=False)
df = df.set_index('date').groupby('ID')[cols].apply(lambda x: x.asfreq('d')).reset_index()
print (df)
ID date X1 X2 Y
0 A 2019-07-16 58.0 50.0 0.0
1 A 2019-07-17 NaN NaN NaN
2 A 2019-07-18 NaN NaN NaN
3 A 2019-07-19 NaN NaN NaN
4 A 2019-07-20 NaN NaN NaN
5 A 2019-07-21 28.0 74.0 0.0
6 B 2019-07-25 54.0 65.0 1.0
7 B 2019-07-26 NaN NaN NaN
8 B 2019-07-27 50.0 30.0 0.0
9 B 2019-07-28 NaN NaN NaN
10 B 2019-07-29 81.0 61.0 0.0
11 C 2019-07-30 55.0 29.0 0.0
12 C 2019-07-31 97.0 69.0 1.0
13 C 2019-08-01 NaN NaN NaN
14 C 2019-08-02 NaN NaN NaN
15 C 2019-08-03 13.0 48.0 1.0
16 D 2018-07-19 77.0 27.0 1.0
17 D 2018-07-20 68.0 50.0 1.0
18 D 2018-07-21 NaN NaN NaN
19 D 2018-07-22 89.0 57.0 1.0
20 D 2018-07-23 46.0 70.0 0.0
21 D 2018-07-24 NaN NaN NaN
22 D 2018-07-25 NaN NaN NaN
23 D 2018-07-26 56.0 13.0 0.0
24 E 2019-08-06 47.0 35.0 1.0
Another idea with DataFrame.reindex per groups:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
cols = df.columns.difference(['date','ID'], sort=False)
f = lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max()))
df = df.set_index('date').groupby('ID')[cols].apply(f).reset_index()
Here is my sort jitsu:
def Sort_by_date(dataf):
# rule1
dataf['Current'] = pd.to_datetime(dataf.Current)
dataf = dataf.sort_values(by=['Current'],ascending=True)
# rule2
dataf['Current'] = pd.to_datetime(dataf.Current)
Mask = (dataf['Current'] > '1/1/2020') & (dataf['Current'] <= '12/31/2022')
dataf = dataf.loc[Mask]
return dataf
you can modify this code to learn to sort by date for your solution.
Next lets sort by group:
Week1 = WeeklyDF.groupby('ID')
Week1_Report = Week1['ID','date','X1','X2','Y']
Week1_Report
lastly, Lets replace the NaN
Week1_Report['X1'.fillna("X1 is 0", inplace = True)
Week1_Report['X2'.fillna("X2 is 0", inplace = True)
Week1_Report['Y'.fillna("Y is 0", inplace = True)

update / merge and update a subset of columns pandas

I have df1:
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 xyz 23.0
1 b 2.0 56.0 abc 24.0
2 c 3.0 34.0 qwerty 28.0
3 d 4.0 34.0 wer 33.0
4 e NaN NaN NaN NaN
df2:
ColA ColB ID1 ColC ID2
0 i 0 45.0 NaN 23.0
1 j 0 56.0 NaN 24.0
2 NaN 0 NaN fd 25.0
3 NaN 0 NaN NaN 26.0
4 NaN 0 23.0 e 45.0
5 NaN 0 45.0 r NaN
6 NaN 0 56.0 NaN 29.0
I am trying to update df2 only on columns which wil be a choice= ['ColA','ColB'] where ID1 and ID2 both matches in the 2 dfs.
Expected output:
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 NaN 23.0
1 b 2.0 56.0 NaN 24.0
2 NaN 0 NaN fd 25.0
3 NaN 0 NaN NaN 26.0
4 NaN 0 23.0 e 45.0
5 NaN 0 45.0 r NaN
6 NaN 0 56.0 NaN 29.0
So far I have tried:
u = df1.set_index(['ID1','ID2'])
u = u.loc[u.index.dropna()]
v = df2.set_index(['ID1','ID2'])
v= v.loc[v.index.dropna()]
v.update(u)
v.reset_index()
Which gives me the correct update(but I loose the Ids which are NaN) also the update takes place on ColC which i dont want:
ID1 ID2 ColA ColB ColC
0 45.0 23.0 a 1.0 xyz
1 56.0 24.0 b 2.0 abc
2 23.0 45.0 NaN 0.0 e
3 56.0 29.0 NaN 0.0 NaN
I have also tried merge and combine_first. cant figure out what is the best approach to do this based on the choicelist.
Use merge with right join and then combine_first:
choice= ['ColA','ColB']
joined = ['ID1','ID2']
c = choice + joined
df3 = df1[c].merge(df2[c], on=joined, suffixes=('','_'), how='right')[c]
print (df3)
ColA ColB ID1 ID2
0 a 1.0 45.0 23.0
1 b 2.0 56.0 24.0
2 NaN NaN NaN 25.0
3 NaN NaN NaN 26.0
4 NaN NaN 23.0 45.0
5 NaN NaN 45.0 NaN
6 NaN NaN 56.0 29.0
df2[c] = df3.combine_first(df2[c])
print (df2)
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 NaN 23.0
1 b 2.0 56.0 NaN 24.0
2 NaN 0.0 NaN fd 25.0
3 NaN 0.0 NaN NaN 26.0
4 NaN 0.0 23.0 e 45.0
5 NaN 0.0 45.0 r NaN
6 NaN 0.0 56.0 NaN 29.0
here's a way
df1
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 xyz 23.0
1 b 2.0 56.0 abc 24.0
2 c 3.0 34.0 qwerty 28.0
3 d 4.0 34.0 wer 33.0
4 e NaN NaN NaN NaN
df2
ColA ColB ID1 ColC ID2
0 i 0 45.0 NaN 23.0
1 j 0 56.0 NaN 24.0
2 NaN 0 NaN fd 25.0
3 NaN 0 NaN NaN 26.0
4 NaN 0 23.0 e 45.0
5 NaN 0 45.0 r NaN
6 NaN 0 56.0 NaN 29.0
df3 = df1.merge(df2, on=['ID1','ID2'], left_index=True)[['ColA_x','ColB_x']]
df2.loc[df3.index, 'ColA'] = df3['ColA_x']
df2.loc[df3.index, 'ColB'] = df3['ColB_x']
output
ColA ColB ID1 ColC ID2
0 a 1.0 45.0 NaN 23.0
1 b 2.0 56.0 NaN 24.0
2 NaN 0.0 NaN fd 25.0
3 NaN 0.0 NaN NaN 26.0
4 NaN 0.0 23.0 e 45.0
5 NaN 0.0 45.0 r NaN
6 NaN 0.0 56.0 NaN 29.0
There seems to still be the issue in 0.24 where NaN merges with NaN when they are keys. Prevent this by dropping those records before merging. I'm assuming ['ID1', 'ID2'] is a unique key for df1 (for rows where both are not null):
keys = ['ID1', 'ID2']
updates = ['ColA', 'ColB']
df3 = df2.merge(df1[updates+keys].dropna(subset=keys), on=keys, how='left')
Then resolve information. Take the value in df1 if it's not null, else take the value in df2. In recent versions of python the merge output should be ordered so for duplicated columns _x appears to the left of the _y column. If not, sort the index
#df3 = df3.sort_index(axis=1) # If not sorted _x left of _y
df3.groupby([x[0] for x in df3.columns.str.split('_')], axis=1).apply(lambda x: x.ffill(1).iloc[:, -1])
ColA ColB ColC ID1 ID2
0 a 1.0 NaN 45.0 23.0
1 b 2.0 NaN 56.0 24.0
2 NaN 0.0 fd NaN 25.0
3 NaN 0.0 NaN NaN 26.0
4 NaN 0.0 e 23.0 45.0
5 NaN 0.0 r 45.0 NaN
6 NaN 0.0 NaN 56.0 29.0

An optimal way of getting a correct reshape of that dataframe

I've the following dataframe:
df =
c f V E
0 M 5 32 22
1 M 7 45 40
2 R 7 42 36
3 R 9 41 38
4 R 3 28 24
And I want a result like this, in which the values ​​of column 'f' are my new columns, and my new indexes are a combination of column 'c' and the rest of columns in the dataframe (the order of rows doesn't matter):
df_result =
3 5 7 9
V(M) NaN 32 45 NaN
E(M) NaN 22 40 NaN
V(R) 28 NaN 42 41
E(R) 24 NaN 36 38
Currently, my code is:
df_result = pd.concat([df.pivot('c','f',col).rename(index = {e: col + '(' + e + ')' for e in df.pivot('c','f',col).index}) for col in [e for e in df.columns if e not in ['c','f']]])
With that code I'm getting:
df_result =
f 3 5 7 9
c
E(M) NaN 22 40 NaN
E(R) 24 NaN 36 38
V(M) NaN 32 45 NaN
V(R) 28 NaN 42 41
I think it's a valid result, however, I don't know if there is a way to get exactly my desire result or, at least, a better way to get what I am already getting.
Thanks you very much in advance.
To get the table, this is .melt + .pivot_table
df_result = df.melt(['f', 'c']).pivot_table(index=['variable', 'c'], columns='f')
Then we can clean up the naming:
df_result = df_result.rename_axis([None, None], 1)
df_result.columns = [y for _,y in df_result.columns]
df_result.index = [f'{x}({y})' for x,y in df_result.index]
# Python 2.: ['{0}({1})'.format(*x) for x in df_result.index]
Output:
3 5 7 9
E(M) NaN 22.0 40.0 NaN
E(R) 24.0 NaN 36.0 38.0
V(M) NaN 32.0 45.0 NaN
V(R) 28.0 NaN 42.0 41.0
You might consider keeping the MultiIndex instead of flattening to new strings, as it can be simpler for certain aggregations.
Check with pivot_table
s=pd.pivot_table(df,index='c',columns='f',values=['V','E']).stack(level=0).sort_index(level=1)
s.index=s.index.map('{0[1]}({0[0]})'.format)
s
Out[95]:
f 3 5 7 9
E(M) NaN 22.0 40.0 NaN
E(R) 24.0 NaN 36.0 38.0
V(M) NaN 32.0 45.0 NaN
V(R) 28.0 NaN 42.0 41.0

Categories