multi-index and multi-column grouping

multi-index and multi-column grouping - python

I have a dataframe with 3 levels index and 2 level columns.
Group
Label A B C D
number start end
1 2020-01-01 2020-12-31 -43.0 0 105.0 -37.0
2020-12-15 2020-12-15 NaN NaN NaN 195.0
2 2019-01-01 2019-12-31 -35.0 80.0 -14.0 NaN
2019-12-17 2019-12-17 NaN NaN NaN 141.0
2020-01-01 2020-12-31 -15.0 45.0 -7.0 NaN
3 2020-12-17 2020-12-17 NaN NaN NaN 326.0
2022-01-01 2022-12-31 NaN 50.0 NaN NaN
2023-12-31 2023-12-31 -25.0 NaN NaN NaN
2023-01-01 2023-12-31 NaN 50.0 NaN NaN
2020-12-15 2020-12-15 NaN NaN NaN 61.0
.............
I would like to group by number and start (only the year), summing values per Label:
Group
Label A B C D
number start end
1 2020 2020 -43.0 0 105.0 232.0
2 2019 2019 -35.0 80.0 -14.0 141
2020 2020 -15.0 45.0 -7.0 NaN
3 2020 2020 NaN NaN NaN 387.0
2022 2022 NaN 50.0 NaN NaN
2023 2023 -25.0 50.0 NaN NaN
.............
Please note that there is higher-level-column as well (called Group, and other higher-level-columns that I am not including to keep it simple) and other sub-columns (Label: A, B, C, D, repeated for each higher-level-column).
how can I do this?
thank you in advance

You can reference the MultiIndex levels by name, and use DatetimeIndex.year to get just the year of the levels you care about. min_count=1 gives NaN instead of 0 for group cells with all missing.
df.groupby(['number',
df.index.get_level_values('start').year,
df.index.get_level_values('end').year]).sum(min_count=1)
A B C D
number start end
1 2020 2020 -43.0 0.0 105.0 158.0
2 2019 2019 -35.0 80.0 -14.0 141.0
2020 2020 -15.0 45.0 -7.0 NaN
3 2020 2020 NaN NaN NaN 387.0
2022 2022 NaN 50.0 NaN NaN
2023 2023 -25.0 50.0 NaN NaN

Related

Pandas: Filling NaN values in dataframe with monthly mean

The dataframe I am working with is as follows:
date AA1 AB2 AC3 AD4
0 1996-01-01 00:00:00 NaN NaN NaN NaN
1 1996-01-01 01:00:00 NaN 19.2 NaN NaN
2 1996-01-01 02:00:00 NaN 16.4 NaN NaN
3 1996-01-01 03:00:00 NaN 23.5 NaN NaN
4 1996-01-01 04:00:00 20.4 NaN NaN NaN
... ... ... ... ... ...
219164 2020-12-31 20:00:00 13.4 NaN 23.0 26.6
219165 2020-12-31 21:00:00 14.2 NaN 19.6 28.3
219166 2020-12-31 22:00:00 13.5 NaN 17.9 20.5
219167 2020-12-31 23:00:00 NaN NaN 16.7 20.7
219168 2021-01-01 00:00:00 NaN NaN NaN NaN
These are hourly data readings taken from different sensors from the year 1996 to 2021.
My goal is to be able to fill the NaN values with the monthly mean for each of the columns based on the date.
I have tried grouping the data and getting the monthly means for the group, though I am not sure where to go from here to transfer the grouped means to the original, larger dataframe, filling in some of the NaN values.
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
tem = df.groupby(['year', 'month']).mean().reset_index()
The resulting dataframe looks like this, with less indices because of the grouping:
year month AA1 AB2 AC3 AD4
0 1996 1 20.1 18.3 NaN NaN
1 1996 2 NaN NaN NaN NaN
2 1996 3 NaN NaN NaN NaN
3 1996 4 NaN NaN NaN NaN
4 1996 5 NaN NaN NaN NaN
... ... ... ... ... ... ...
296 2020 9 NaN NaN 15.7 20.2
297 2020 10 NaN NaN 15.3 19.7
298 2020 11 NaN NaN 26.7 25.9
299 2020 12 NaN NaN 24.6 25.3
300 2021 1 NaN NaN NaN NaN
Any advice on how I can implement this would be helpful. In the end, I need the original dataset indices, dates and columns, but with the NaN values filled with the means calculated from the monthly groups. The months with all NaN values can be ignored for the time being.

Assuming your date column is of type datetime64 or equivalent:
df['AA2'] = df['AA2'].fillna(df.groupby(df.date.dt.month)['AA2'].transform('mean'))
Or looping over all your columns (except the date column):
for col in df.columns.drop('date'):
df[col] = df[col].fillna(df.groupby(df.date.dt.month)[col].transform('mean'))
If you only want the mean of the month in that specific year, add df.date.dt.year to the group by function:
for col in df.columns.drop('date'):
df[col] = df[col].fillna(df.groupby([df.date.dt.year, df.date.dt.month])[col].transform('mean'))

How to return a dataframe with the last non-NaN values in each column for each month?

I have a dataframe in the following format:
A B C D
2020-11-18 64.0 74.0 34.0 57.0
2020-11-20 NaN 71.0 NaN 58.0
2020-11-23 NaN 11.0 NaN NaN
2020-11-25 69.0 NaN NaN 0.0
2020-11-27 NaN 37.0 19.0 NaN
2020-11-29 63.0 NaN NaN 85.0
2020-12-03 NaN 73.0 NaN 49.0
2020-12-10 NaN NaN 32.0 NaN
2020-12-22 52.0 90.0 33.0 24.0
2020-12-23 NaN 96.0 NaN NaN
2020-12-28 78.0 NaN NaN 68.0
2020-12-29 17.0 70.0 NaN 16.0
2021-01-03 51.0 43.0 NaN 66.0
I want to obtain a new dataframe that contains the last non-NaN values for each month in each column:
A B C D
2020-11 63.0 37.0 19.0 85.0
2020-12 17.0 70.0 33.0 16.0
I tried grouping by month and applying a lambda that returns the in-group maximum index like so:
df.loc[df.groupby(df.index.to_period('M')).apply(lambda x: x.index.max())]
which yields:
A B C D
2020-11-29 63.0 NaN NaN 85.0
2020-12-29 17.0 70.0 NaN 16.0
This returns the values that appear on the last day in each month but not the last non-NaN value. In case the value for the last day in a particular month is a NaN, I will have a NaN appearing here. Instead, I'd only like to have NaN values present if there are absolutely no values for that particular month in that column.

Use GroupBy.last:
df = df.groupby(df.index.to_period('M')).last()
print (df)
A B C D
2020-11 63.0 37.0 19.0 85.0
2020-12 17.0 70.0 33.0 16.0
2021-01 51.0 43.0 NaN 66.0

python pandas stop fillna at last non NaN value

I have a dataframe where the index is date increasing and the columns are observations of variables. The array is sparse.
My goal is to propogate forward in time a known value to fill NaN but I want to stop at the last non-NaN value as that last value signifies the "death" of the variable.
e.g. for the dataset
a
b
c
2020-01-01
NaN
11
NaN
2020-02-01
1
NaN
NaN
2020-03-01
NaN
NaN
14
2020-04-01
2
NaN
NaN
2020-05-01
NaN
NaN
NaN
2020-06-01
NaN
NaN
15
2020-07-01
3
NaN
NaN
2020-08-01
NaN
NaN
NaN
I want to output
a
b
c
2020-01-01
NaN
11
NaN
2020-02-01
1
NaN
NaN
2020-03-01
1
NaN
14
2020-04-01
2
NaN
14
2020-05-01
2
NaN
14
2020-06-01
2
NaN
15
2020-07-01
3
NaN
NaN
2020-08-01
NaN
NaN
NaN
I can identify the index of the last observation using df.notna()[::-1].idxmax() but can't figure out how to use this as a way to limit the fillna function
I'd be grateful for any suggestions. Many thanks

Use DataFrame.where for forward filling by mask - testing only non missing values by back filling them:
df = df.where(df.bfill().isna(), df.ffill())
print (df)
a b c
2020-01-01 NaN 11.0 NaN
2020-02-01 1.0 NaN NaN
2020-03-01 1.0 NaN 14.0
2020-04-01 2.0 NaN 14.0
2020-05-01 2.0 NaN 14.0
2020-06-01 2.0 NaN 15.0
2020-07-01 3.0 NaN NaN
2020-08-01 NaN NaN NaN
Your solution should be used too if compare Series converted to numpy array with broadcasting:
mask = df.notna()[::-1].idxmax().to_numpy() < df.index.to_numpy()[:, None]
df = df.where(mask, df.ffill())
print (df)
a b c
2020-01-01 NaN 11.0 NaN
2020-02-01 1.0 NaN NaN
2020-03-01 1.0 NaN 14.0
2020-04-01 2.0 NaN 14.0
2020-05-01 2.0 NaN 14.0
2020-06-01 2.0 NaN 15.0
2020-07-01 3.0 NaN NaN
2020-08-01 NaN NaN NaN

You can use Series.last_valid_index which is specifically designed for this (to return the index for last non-NA/null value) , to just ffill up to that point:
Assuming your dataset is called df:
df.apply(lambda x: x.loc[:x.last_valid_index()].ffill())
index a b c
0 2020-01-01 NaN 11.00 NaN
1 2020-02-01 1.00 NaN NaN
2 2020-03-01 1.00 NaN 14.00
3 2020-04-01 2.00 NaN 14.00
4 2020-05-01 2.00 NaN 14.00
5 2020-06-01 2.00 NaN 15.00
6 2020-07-01 3.00 NaN NaN
7 2020-08-01 NaN NaN NaN
More on this on:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.last_valid_index.html

Concat multiindex pandas into single one

I have 3 pandas multiindex groupby(['location','date'])
print(a)
location date hosp
976 2020-10-02 9
2020-10-03 10
2020-10-04 10
print(b)
incid_hosp
location date
976 2020-10-02 1
2020-10-03 1
2020-10-04 0
print(c)
P T
location date
978 2020-10-02 5 60
2020-10-02 4 52
2020-10-03 4 2
I want to concat them to get:
print(result)
hosp incid_hosp P T
location date
976 2020-10-02 9 1 NaN NaN
2020-10-03 10 1 NaN NaN
2020-10-04 10 0 NaN NaN
978 2020-10-02 NaN NaN 5 60
2020-10-03 NaN NaN 4 52
2020-10-04 NaN NaN 4 2
I have tried
result = pd.concat([a,b,c], axis=1, sort=False)
But It produces to much NaN values ...

Try this using combine_first and reduce:
from functools import reduce
reduce(lambda x, y: x.combine_first(y), [a,b,c])
Output:
P T hosp incid_hosp
location date
976 2020-10-02 NaN NaN 9.0 1.0
2020-10-03 NaN NaN 10.0 1.0
2020-10-04 NaN NaN 10.0 0.0
978 2020-10-02 5.0 60.0 NaN NaN
2020-10-02 4.0 52.0 NaN NaN
2020-10-03 4.0 2.0 NaN NaN

For three dataframes, you can use chain join:
a.join(b,how='outer').join(c, how='outer')
Output:
hosp incid_hosp P T
location date
976 2020-10-02 9.0 1.0 NaN NaN
2020-10-03 10.0 1.0 NaN NaN
2020-10-04 10.0 0.0 NaN NaN
978 2020-10-02 NaN NaN 5.0 60.0
2020-10-02 NaN NaN 4.0 52.0
2020-10-03 NaN NaN 4.0 2.0

Merging columns and removing duplicates with Pandas

I need to merge similar columns and remove duplicates (entries with the same date). The data frame:
Albumin C-reactive protein CRP Ferritin Haemoglobin Hb Iron Nancy Index Plasma Platelets Transferrin saturation % Transferrin saturations UCEIS (0 to 8) WCC White Cell Count test_date
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12.35 2016-04-17 23:00:00
1 NaN NaN NaN NaN 133.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN 406.0 NaN NaN NaN NaN NaN 2016-04-17 23:00:00
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 406.0 NaN NaN NaN NaN NaN 2016-04-17 23:00:00
4 NaN 32.2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
5 36.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
6 NaN NaN NaN 99.7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 25.0 NaN NaN NaN NaN 2016-04-17 23:00:00
12 36.0 NaN 32.2 99.7 NaN 133.0 NaN NaN NaN 406.0 NaN 25.0 NaN 12.35 NaN 2016-04-17 23:00:00
14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7.0 NaN NaN 2016-04-25 23:00:00
79 34.0 NaN 5.4 55.9 NaN 133.0 NaN NaN NaN 372.0 NaN 28.0 NaN 7.99 NaN 2016-06-12 23:00:00
I need to get:
Albumin CRP Ferritin Hb Nancy Index Plasma Platelets Transferrin saturations UCEIS (0 to 8) WCC test_date
12 36.0 32.2 99.7 133.0 NaN NaN 406.0 25.0 NaN 12.35 2016-04-17 23:00:00
14 NaN NaN NaN NaN NaN NaN NaN NaN 7.0 NaN 2016-04-25 23:00:00
79 34.0 5.4 55.9 133.0 NaN NaN 372.0 28.0 NaN 7.99 2016-06-12 23:00:00
So, columns 'C-reactive protein' should be merged with 'CRP', 'Hemoglobin' with 'Hb', 'Transferrin saturation %' with 'Transferrin saturation'.
I can easily remove duplicates with .drop_duplicates(), but the trick is remove not only row with the same date, but also to make sure, that the values in the same column are duplicated. For example, 'C-reactive protein' at row '4' has the same values as 'CRP' in row '12', in addition, they both have the same entry date. Given all that, I need to have only 'CRP' column with values 32.2 and the date '2016-04-17' (plus other unique columns).
EDIT
Some entries are really duplicates (absolutely identical, due to system glitches), for example (last three rows, on 2016-06-20, indices '803' and '122'). Is the solution below capable of removing such identical rows?
P.S. Thanks for the amazing and general solution for duplicate, but not identical entries.
Albumin C-reactive protein CRP Ferritin Haemoglobin Hb Iron Nancy Index Plasma Platelets Transferrin saturation % Transferrin saturations UCEIS (0 to 8) WCC White Cell Count setName test_date
735 39.0 NaN 0.4 52.0 NaN 144.0 NaN NaN NaN 197.0 NaN 25.0 NaN 4.88 NaN Bloods 2016-05-31 23:00:00
803 40.0 NaN 0.2 81.0 NaN 147.0 NaN NaN NaN 234.0 NaN 35.0 NaN 8.47 NaN Bloods 2016-06-20 23:00:00
347 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN Research Bloods 2016-06-20 23:00:00
122 40.0 NaN 0.2 81.9 NaN 147.0 NaN NaN NaN 234.0 NaN 35.0 NaN 8.47 NaN Bloods 2016-06-20 23:00:00

I think you need groupby with rename columns by dict:
d = {'C-reactive protein':'CRP', 'Hemoglobin':'Hb',
'Transferrin saturation %':'Transferrin saturations'}
df = df.groupby('test_date').max().rename(columns=d).groupby(axis=1, level=0).max()
print (df)
Albumin CRP Ferritin Haemoglobin Hb Iron \
test_date
2016-04-17 23:00:00 36.0 32.2 99.7 133.0 133.0 NaN
2016-04-25 23:00:00 NaN NaN NaN NaN NaN NaN
2016-06-12 23:00:00 34.0 5.4 55.9 NaN 133.0 NaN
Nancy Index Plasma Platelets Transferrin saturations \
test_date
2016-04-17 23:00:00 NaN NaN 406.0 25.0
2016-04-25 23:00:00 NaN NaN NaN NaN
2016-06-12 23:00:00 NaN NaN 372.0 28.0
UCEIS (0 to 8) WCC White Cell Count
test_date
2016-04-17 23:00:00 NaN 12.35 12.35
2016-04-25 23:00:00 7.0 NaN NaN
2016-06-12 23:00:00 NaN 7.99 NaN
More general solution is reshape by melt, remove duplicates and then create DataFrame back:
d = {'C-reactive protein':'CRP', 'Hemoglobin':'Hb',
'Transferrin saturation %':'Transferrin saturations'}
df = df.rename(columns=d).groupby(axis=1, level=0).max()
df = pd.melt(df, id_vars='test_date').dropna(subset=['value']).drop_duplicates()
df = df.groupby(['test_date','variable'])['value'] \
.apply(lambda x: pd.Series(x.values)) \
.unstack(1) \
.reset_index(level=1, drop=True) \
.reset_index() \
.rename_axis(None,axis=1)
print (df)
test_date Albumin CRP Ferritin Hb Platelets \
0 2016-04-17 23:00:00 1000.0 32.2 99.7 1000.0 406.0
1 2016-04-17 23:00:00 36.0 NaN NaN 133.0 NaN
2 2016-04-25 23:00:00 NaN NaN NaN NaN NaN
3 2016-06-12 23:00:00 34.0 5.4 55.9 133.0 372.0
Transferrin saturations UCEIS (0 to 8) WCC White Cell Count
0 25.0 NaN 12.35 12.35
1 NaN NaN NaN NaN
2 NaN 7.0 NaN NaN
3 28.0 NaN 7.99 NaN

What #jezrael was saying is that if you had a situation where:
Albumin C-reactive protein CRP test_date
0 NaN NaN 32 2016-04-17 23:00:00
1 NaN 8.0 NaN 2016-04-17 23:00:00
then his method would erase the 8.0 reading and keep only the 32 (this is because he does it in two steps (or 3?), in this line: df = df.groupby('test_date').max().rename(columns=d).groupby(axis=1, level=0).max()
df = df.groupby('test_date').max() # selects max of each column
# while collapsing 'test_date'
which for my truncated example would give:
Albumin C-reactive protein CRP test_date
0 NaN 8.0 32 2016-04-17 23:00:00
then rename .rename(columns=d) giving:
Albumin CRP CRP test_date
0 NaN 8.0 32 2016-04-17 23:00:00
then .groupby(axis=1, level=0).max() to group along rows (instead of down columns) which gives:
Albumin CRP test_date
0 NaN 32 2016-04-17 23:00:00
which is where you run the highest risk of losing data.
Alternative
I would split the original data into two frames first
df1 = df[["C-reactive protein","Haemoglobin", ...]]
df2 = df[["CRP", "Hb"]]
# then rename
df2 = df2.rename(columns={"CRP":"C-reactive protein", "Hb":"Haemoglobin", ...})
# use concat to stack them on one another
df3 = pd.concat([df1, df2]) # i've run out of names
df3 = df3.drop_duplicates() # perhaps also drop NAs?
but this is only necessary if you have multiple non-duplicate entries for the same test on the same day.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.