Concat multiindex pandas into single one

Concat multiindex pandas into single one - python

I have 3 pandas multiindex groupby(['location','date'])
print(a)
location date hosp
976 2020-10-02 9
2020-10-03 10
2020-10-04 10
print(b)
incid_hosp
location date
976 2020-10-02 1
2020-10-03 1
2020-10-04 0
print(c)
P T
location date
978 2020-10-02 5 60
2020-10-02 4 52
2020-10-03 4 2
I want to concat them to get:
print(result)
hosp incid_hosp P T
location date
976 2020-10-02 9 1 NaN NaN
2020-10-03 10 1 NaN NaN
2020-10-04 10 0 NaN NaN
978 2020-10-02 NaN NaN 5 60
2020-10-03 NaN NaN 4 52
2020-10-04 NaN NaN 4 2
I have tried
result = pd.concat([a,b,c], axis=1, sort=False)
But It produces to much NaN values ...

Try this using combine_first and reduce:
from functools import reduce
reduce(lambda x, y: x.combine_first(y), [a,b,c])
Output:
P T hosp incid_hosp
location date
976 2020-10-02 NaN NaN 9.0 1.0
2020-10-03 NaN NaN 10.0 1.0
2020-10-04 NaN NaN 10.0 0.0
978 2020-10-02 5.0 60.0 NaN NaN
2020-10-02 4.0 52.0 NaN NaN
2020-10-03 4.0 2.0 NaN NaN

For three dataframes, you can use chain join:
a.join(b,how='outer').join(c, how='outer')
Output:
hosp incid_hosp P T
location date
976 2020-10-02 9.0 1.0 NaN NaN
2020-10-03 10.0 1.0 NaN NaN
2020-10-04 10.0 0.0 NaN NaN
978 2020-10-02 NaN NaN 5.0 60.0
2020-10-02 NaN NaN 4.0 52.0
2020-10-03 NaN NaN 4.0 2.0

Related

python pandas stop fillna at last non NaN value

I have a dataframe where the index is date increasing and the columns are observations of variables. The array is sparse.
My goal is to propogate forward in time a known value to fill NaN but I want to stop at the last non-NaN value as that last value signifies the "death" of the variable.
e.g. for the dataset
a
b
c
2020-01-01
NaN
11
NaN
2020-02-01
1
NaN
NaN
2020-03-01
NaN
NaN
14
2020-04-01
2
NaN
NaN
2020-05-01
NaN
NaN
NaN
2020-06-01
NaN
NaN
15
2020-07-01
3
NaN
NaN
2020-08-01
NaN
NaN
NaN
I want to output
a
b
c
2020-01-01
NaN
11
NaN
2020-02-01
1
NaN
NaN
2020-03-01
1
NaN
14
2020-04-01
2
NaN
14
2020-05-01
2
NaN
14
2020-06-01
2
NaN
15
2020-07-01
3
NaN
NaN
2020-08-01
NaN
NaN
NaN
I can identify the index of the last observation using df.notna()[::-1].idxmax() but can't figure out how to use this as a way to limit the fillna function
I'd be grateful for any suggestions. Many thanks

Use DataFrame.where for forward filling by mask - testing only non missing values by back filling them:
df = df.where(df.bfill().isna(), df.ffill())
print (df)
a b c
2020-01-01 NaN 11.0 NaN
2020-02-01 1.0 NaN NaN
2020-03-01 1.0 NaN 14.0
2020-04-01 2.0 NaN 14.0
2020-05-01 2.0 NaN 14.0
2020-06-01 2.0 NaN 15.0
2020-07-01 3.0 NaN NaN
2020-08-01 NaN NaN NaN
Your solution should be used too if compare Series converted to numpy array with broadcasting:
mask = df.notna()[::-1].idxmax().to_numpy() < df.index.to_numpy()[:, None]
df = df.where(mask, df.ffill())
print (df)
a b c
2020-01-01 NaN 11.0 NaN
2020-02-01 1.0 NaN NaN
2020-03-01 1.0 NaN 14.0
2020-04-01 2.0 NaN 14.0
2020-05-01 2.0 NaN 14.0
2020-06-01 2.0 NaN 15.0
2020-07-01 3.0 NaN NaN
2020-08-01 NaN NaN NaN

You can use Series.last_valid_index which is specifically designed for this (to return the index for last non-NA/null value) , to just ffill up to that point:
Assuming your dataset is called df:
df.apply(lambda x: x.loc[:x.last_valid_index()].ffill())
index a b c
0 2020-01-01 NaN 11.00 NaN
1 2020-02-01 1.00 NaN NaN
2 2020-03-01 1.00 NaN 14.00
3 2020-04-01 2.00 NaN 14.00
4 2020-05-01 2.00 NaN 14.00
5 2020-06-01 2.00 NaN 15.00
6 2020-07-01 3.00 NaN NaN
7 2020-08-01 NaN NaN NaN
More on this on:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.last_valid_index.html

Stretching the data frame by date

I have this data frame:
ID date X1 X2 Y
A 16-07-19 58 50 0
A 21-07-19 28 74 0
B 25-07-19 54 65 1
B 27-07-19 50 30 0
B 29-07-19 81 61 0
C 30-07-19 55 29 0
C 31-07-19 97 69 1
C 03-08-19 13 48 1
D 19-07-18 77 27 1
D 20-07-18 68 50 1
D 22-07-18 89 57 1
D 23-07-18 46 70 0
D 26-07-18 56 13 0
E 06-08-19 47 35 1
I want to "stretch" the data by date, from the first row, to the last row of each ID (groupby),
and to fill the missing values with NaN.
For example: ID A has two rows on 16-07-19, and 21-07-19.
After the implementation, (s)he should have 6 rows on 16-21 of July, 2019.
Expected result:
ID date X1 X2 Y
A 16-07-19 58.0 50.0 0.0
A 17-07-19 NaN NaN NaN
A 18-07-19 NaN NaN NaN
A 19-07-19 NaN NaN NaN
A 20-07-19 NaN NaN NaN
A 21-07-19 28.0 74.0 0.0
B 25-07-19 54.0 65.0 1.0
B 26-07-19 NaN NaN NaN
B 27-07-19 50.0 30.0 0.0
B 28-07-19 NaN NaN NaN
B 29-07-19 81.0 61.0 0.0
C 30-07-19 55.0 29.0 0.0
C 31-07-19 97.0 69.0 1.0
C 01-08-19 NaN NaN NaN
C 02-08-19 NaN NaN NaN
C 03-08-19 13.0 48.0 1.0
D 19-07-18 77.0 27.0 1.0
D 20-07-18 68.0 50.0 1.0
D 21-07-18 NaN NaN NaN
D 22-07-18 89.0 57.0 1.0
D 23-07-18 46.0 70.0 0.0
D 24-07-18 NaN NaN NaN
D 25-07-18 NaN NaN NaN
D 26-07-18 56.0 13.0 0.0
E 06-08-19 47.0 35.0 1.0

Use DataFrame.asfreq per groups working with DatetimeIndex:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
cols = df.columns.difference(['date','ID'], sort=False)
df = df.set_index('date').groupby('ID')[cols].apply(lambda x: x.asfreq('d')).reset_index()
print (df)
ID date X1 X2 Y
0 A 2019-07-16 58.0 50.0 0.0
1 A 2019-07-17 NaN NaN NaN
2 A 2019-07-18 NaN NaN NaN
3 A 2019-07-19 NaN NaN NaN
4 A 2019-07-20 NaN NaN NaN
5 A 2019-07-21 28.0 74.0 0.0
6 B 2019-07-25 54.0 65.0 1.0
7 B 2019-07-26 NaN NaN NaN
8 B 2019-07-27 50.0 30.0 0.0
9 B 2019-07-28 NaN NaN NaN
10 B 2019-07-29 81.0 61.0 0.0
11 C 2019-07-30 55.0 29.0 0.0
12 C 2019-07-31 97.0 69.0 1.0
13 C 2019-08-01 NaN NaN NaN
14 C 2019-08-02 NaN NaN NaN
15 C 2019-08-03 13.0 48.0 1.0
16 D 2018-07-19 77.0 27.0 1.0
17 D 2018-07-20 68.0 50.0 1.0
18 D 2018-07-21 NaN NaN NaN
19 D 2018-07-22 89.0 57.0 1.0
20 D 2018-07-23 46.0 70.0 0.0
21 D 2018-07-24 NaN NaN NaN
22 D 2018-07-25 NaN NaN NaN
23 D 2018-07-26 56.0 13.0 0.0
24 E 2019-08-06 47.0 35.0 1.0
Another idea with DataFrame.reindex per groups:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
cols = df.columns.difference(['date','ID'], sort=False)
f = lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max()))
df = df.set_index('date').groupby('ID')[cols].apply(f).reset_index()

Here is my sort jitsu:
def Sort_by_date(dataf):
# rule1
dataf['Current'] = pd.to_datetime(dataf.Current)
dataf = dataf.sort_values(by=['Current'],ascending=True)
# rule2
dataf['Current'] = pd.to_datetime(dataf.Current)
Mask = (dataf['Current'] > '1/1/2020') & (dataf['Current'] <= '12/31/2022')
dataf = dataf.loc[Mask]
return dataf
you can modify this code to learn to sort by date for your solution.
Next lets sort by group:
Week1 = WeeklyDF.groupby('ID')
Week1_Report = Week1['ID','date','X1','X2','Y']
Week1_Report
lastly, Lets replace the NaN
Week1_Report['X1'.fillna("X1 is 0", inplace = True)
Week1_Report['X2'.fillna("X2 is 0", inplace = True)
Week1_Report['Y'.fillna("Y is 0", inplace = True)

multi-index and multi-column grouping

I have a dataframe with 3 levels index and 2 level columns.
Group
Label A B C D
number start end
1 2020-01-01 2020-12-31 -43.0 0 105.0 -37.0
2020-12-15 2020-12-15 NaN NaN NaN 195.0
2 2019-01-01 2019-12-31 -35.0 80.0 -14.0 NaN
2019-12-17 2019-12-17 NaN NaN NaN 141.0
2020-01-01 2020-12-31 -15.0 45.0 -7.0 NaN
3 2020-12-17 2020-12-17 NaN NaN NaN 326.0
2022-01-01 2022-12-31 NaN 50.0 NaN NaN
2023-12-31 2023-12-31 -25.0 NaN NaN NaN
2023-01-01 2023-12-31 NaN 50.0 NaN NaN
2020-12-15 2020-12-15 NaN NaN NaN 61.0
.............
I would like to group by number and start (only the year), summing values per Label:
Group
Label A B C D
number start end
1 2020 2020 -43.0 0 105.0 232.0
2 2019 2019 -35.0 80.0 -14.0 141
2020 2020 -15.0 45.0 -7.0 NaN
3 2020 2020 NaN NaN NaN 387.0
2022 2022 NaN 50.0 NaN NaN
2023 2023 -25.0 50.0 NaN NaN
.............
Please note that there is higher-level-column as well (called Group, and other higher-level-columns that I am not including to keep it simple) and other sub-columns (Label: A, B, C, D, repeated for each higher-level-column).
how can I do this?
thank you in advance

You can reference the MultiIndex levels by name, and use DatetimeIndex.year to get just the year of the levels you care about. min_count=1 gives NaN instead of 0 for group cells with all missing.
df.groupby(['number',
df.index.get_level_values('start').year,
df.index.get_level_values('end').year]).sum(min_count=1)
A B C D
number start end
1 2020 2020 -43.0 0.0 105.0 158.0
2 2019 2019 -35.0 80.0 -14.0 141.0
2020 2020 -15.0 45.0 -7.0 NaN
3 2020 2020 NaN NaN NaN 387.0
2022 2022 NaN 50.0 NaN NaN
2023 2023 -25.0 50.0 NaN NaN

Pandas Dataframe interpolating in sections delimited by indexes

My sample code is as follow:
import pandas as pd
dictx = {'col1':[1,'nan','nan','nan',5,'nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9]}
df = pd.DataFrame(dictx).astype(float)
I'm trying to interpolate various segments which contain the value 'nan'.
For context, I'm trying to track bus speeds using GPS data provided by the city (São Paulo, Brazil), but the data is scarce and with parts that do not provide the information, as the e.g., but there're segments which I know for a fact that they are stopped, such as dawn, but the information come as 'nan' as well.
What I need:
I've been experimenting with dataframe.interpolate() parameters (limit and limit_diretcion) but came up short. If I set df.interpolate(limit=2) I will not only interpolate the data that I need but the data where it shouldn't. So I need to interpolate between sections defined by a limit
Desired output:
Out[7]:
col1 col2 col3
0 1.0 20.00 15.00
1 nan nan nan
2 nan nan nan
3 nan nan nan
4 5.0 22.00 10.00
5 6.0 23.50 12.00
6 7.0 25.00 14.00
7 8.0 27.50 13.50
8 9.0 30.00 13.00
9 nan nan nan
10 nan nan nan
11 nan nan nan
12 13.0 25.00 9.00
The logic that I've been trying to apply is basically trying to find nan's and calculating the difference between their indexes and so createing a new dataframe_temp to interpolate and only than add it to another creating a new dataframe_final. But this has become hard to achieve due to the fact that 'nan'=='nan' return False

This is a hack but may still be useful. Likely Pandas 0.23 will have a better solution.
https://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#dataframe-interpolate-has-gained-the-limit-area-kwarg
df_fw = df.interpolate(limit=1)
df_bk = df.interpolate(limit=1, limit_direction='backward')
df_fw.where(df_bk.notna())
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
Not a Hack
More legitimate way of handling it.
Generalized to handle any limit.
def interp(df, limit):
d = df.notna().rolling(limit + 1).agg(any).fillna(1)
d = pd.concat({
i: d.shift(-i).fillna(1)
for i in range(limit + 1)
}).prod(level=1)
return df.interpolate(limit=limit).where(d.astype(bool))
df.pipe(interp, 1)
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
Can also handle variation in NaN from column to column. Consider a different df
dictx = {'col1':[1,'nan','nan','nan',5,'nan','nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan','nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9,'nan']}
df = pd.DataFrame(dictx).astype(float)
df
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 NaN NaN NaN
6 NaN 25.0 14.0
7 7.0 NaN NaN
8 NaN NaN 13.0
9 9.0 30.0 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.0 25.0 NaN
Then with limit=1
df.pipe(interp, 1)
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 NaN 23.5 12.0
6 NaN 25.0 14.0
7 7.0 NaN 13.5
8 8.0 NaN 13.0
9 9.0 30.0 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.0 25.0 9.0
And with limit=2
df.pipe(interp, 2).round(2)
col1 col2 col3
0 1.00 20.00 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.00 22.00 10.0
5 5.67 23.50 12.0
6 6.33 25.00 14.0
7 7.00 26.67 13.5
8 8.00 28.33 13.0
9 9.00 30.00 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.00 25.00 9.0

Here is a way to selectively ignore rows which are consecutive runs of NaNs whose length is greater than a certain size (given by limit):
import numpy as np
import pandas as pd
dictx = {'col1':[1,'nan','nan','nan',5,'nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9]}
df = pd.DataFrame(dictx).astype(float)
limit = 2
notnull = pd.notnull(df).all(axis=1)
# assign group numbers to the rows of df. Each group starts with a non-null row,
# followed by null rows
group = notnull.cumsum()
# find the index of groups having length > limit
ignore = (df.groupby(group).filter(lambda grp: len(grp)>limit)).index
# only ignore rows which are null
ignore = df.loc[~notnull].index.intersection(ignore)
keep = df.index.difference(ignore)
# interpolate only the kept rows
df.loc[keep] = df.loc[keep].interpolate()
print(df)
prints
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
By changing the value of limit you can control how big the group has to be before it should be ignored.

This is a partial answer.
for i in list(df):
for x in range(len(df[i])):
if not df[i][x] > -100:
df[i][x] = 0
df
col1 col2 col3
0 1.0 20.0 15.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 5.0 22.0 10.0
5 0.0 0.0 0.0
6 7.0 25.0 14.0
7 0.0 0.0 0.0
8 9.0 30.0 13.0
9 0.0 0.0 0.0
10 0.0 0.0 0.0
11 0.0 0.0 0.0
12 13.0 25.0 9.0
Now,
df["col1"][1] == df["col2"][1]
True

Merging columns and removing duplicates with Pandas

I need to merge similar columns and remove duplicates (entries with the same date). The data frame:
Albumin C-reactive protein CRP Ferritin Haemoglobin Hb Iron Nancy Index Plasma Platelets Transferrin saturation % Transferrin saturations UCEIS (0 to 8) WCC White Cell Count test_date
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12.35 2016-04-17 23:00:00
1 NaN NaN NaN NaN 133.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN 406.0 NaN NaN NaN NaN NaN 2016-04-17 23:00:00
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 406.0 NaN NaN NaN NaN NaN 2016-04-17 23:00:00
4 NaN 32.2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
5 36.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
6 NaN NaN NaN 99.7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 25.0 NaN NaN NaN NaN 2016-04-17 23:00:00
12 36.0 NaN 32.2 99.7 NaN 133.0 NaN NaN NaN 406.0 NaN 25.0 NaN 12.35 NaN 2016-04-17 23:00:00
14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7.0 NaN NaN 2016-04-25 23:00:00
79 34.0 NaN 5.4 55.9 NaN 133.0 NaN NaN NaN 372.0 NaN 28.0 NaN 7.99 NaN 2016-06-12 23:00:00
I need to get:
Albumin CRP Ferritin Hb Nancy Index Plasma Platelets Transferrin saturations UCEIS (0 to 8) WCC test_date
12 36.0 32.2 99.7 133.0 NaN NaN 406.0 25.0 NaN 12.35 2016-04-17 23:00:00
14 NaN NaN NaN NaN NaN NaN NaN NaN 7.0 NaN 2016-04-25 23:00:00
79 34.0 5.4 55.9 133.0 NaN NaN 372.0 28.0 NaN 7.99 2016-06-12 23:00:00
So, columns 'C-reactive protein' should be merged with 'CRP', 'Hemoglobin' with 'Hb', 'Transferrin saturation %' with 'Transferrin saturation'.
I can easily remove duplicates with .drop_duplicates(), but the trick is remove not only row with the same date, but also to make sure, that the values in the same column are duplicated. For example, 'C-reactive protein' at row '4' has the same values as 'CRP' in row '12', in addition, they both have the same entry date. Given all that, I need to have only 'CRP' column with values 32.2 and the date '2016-04-17' (plus other unique columns).
EDIT
Some entries are really duplicates (absolutely identical, due to system glitches), for example (last three rows, on 2016-06-20, indices '803' and '122'). Is the solution below capable of removing such identical rows?
P.S. Thanks for the amazing and general solution for duplicate, but not identical entries.
Albumin C-reactive protein CRP Ferritin Haemoglobin Hb Iron Nancy Index Plasma Platelets Transferrin saturation % Transferrin saturations UCEIS (0 to 8) WCC White Cell Count setName test_date
735 39.0 NaN 0.4 52.0 NaN 144.0 NaN NaN NaN 197.0 NaN 25.0 NaN 4.88 NaN Bloods 2016-05-31 23:00:00
803 40.0 NaN 0.2 81.0 NaN 147.0 NaN NaN NaN 234.0 NaN 35.0 NaN 8.47 NaN Bloods 2016-06-20 23:00:00
347 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN Research Bloods 2016-06-20 23:00:00
122 40.0 NaN 0.2 81.9 NaN 147.0 NaN NaN NaN 234.0 NaN 35.0 NaN 8.47 NaN Bloods 2016-06-20 23:00:00

I think you need groupby with rename columns by dict:
d = {'C-reactive protein':'CRP', 'Hemoglobin':'Hb',
'Transferrin saturation %':'Transferrin saturations'}
df = df.groupby('test_date').max().rename(columns=d).groupby(axis=1, level=0).max()
print (df)
Albumin CRP Ferritin Haemoglobin Hb Iron \
test_date
2016-04-17 23:00:00 36.0 32.2 99.7 133.0 133.0 NaN
2016-04-25 23:00:00 NaN NaN NaN NaN NaN NaN
2016-06-12 23:00:00 34.0 5.4 55.9 NaN 133.0 NaN
Nancy Index Plasma Platelets Transferrin saturations \
test_date
2016-04-17 23:00:00 NaN NaN 406.0 25.0
2016-04-25 23:00:00 NaN NaN NaN NaN
2016-06-12 23:00:00 NaN NaN 372.0 28.0
UCEIS (0 to 8) WCC White Cell Count
test_date
2016-04-17 23:00:00 NaN 12.35 12.35
2016-04-25 23:00:00 7.0 NaN NaN
2016-06-12 23:00:00 NaN 7.99 NaN
More general solution is reshape by melt, remove duplicates and then create DataFrame back:
d = {'C-reactive protein':'CRP', 'Hemoglobin':'Hb',
'Transferrin saturation %':'Transferrin saturations'}
df = df.rename(columns=d).groupby(axis=1, level=0).max()
df = pd.melt(df, id_vars='test_date').dropna(subset=['value']).drop_duplicates()
df = df.groupby(['test_date','variable'])['value'] \
.apply(lambda x: pd.Series(x.values)) \
.unstack(1) \
.reset_index(level=1, drop=True) \
.reset_index() \
.rename_axis(None,axis=1)
print (df)
test_date Albumin CRP Ferritin Hb Platelets \
0 2016-04-17 23:00:00 1000.0 32.2 99.7 1000.0 406.0
1 2016-04-17 23:00:00 36.0 NaN NaN 133.0 NaN
2 2016-04-25 23:00:00 NaN NaN NaN NaN NaN
3 2016-06-12 23:00:00 34.0 5.4 55.9 133.0 372.0
Transferrin saturations UCEIS (0 to 8) WCC White Cell Count
0 25.0 NaN 12.35 12.35
1 NaN NaN NaN NaN
2 NaN 7.0 NaN NaN
3 28.0 NaN 7.99 NaN

What #jezrael was saying is that if you had a situation where:
Albumin C-reactive protein CRP test_date
0 NaN NaN 32 2016-04-17 23:00:00
1 NaN 8.0 NaN 2016-04-17 23:00:00
then his method would erase the 8.0 reading and keep only the 32 (this is because he does it in two steps (or 3?), in this line: df = df.groupby('test_date').max().rename(columns=d).groupby(axis=1, level=0).max()
df = df.groupby('test_date').max() # selects max of each column
# while collapsing 'test_date'
which for my truncated example would give:
Albumin C-reactive protein CRP test_date
0 NaN 8.0 32 2016-04-17 23:00:00
then rename .rename(columns=d) giving:
Albumin CRP CRP test_date
0 NaN 8.0 32 2016-04-17 23:00:00
then .groupby(axis=1, level=0).max() to group along rows (instead of down columns) which gives:
Albumin CRP test_date
0 NaN 32 2016-04-17 23:00:00
which is where you run the highest risk of losing data.
Alternative
I would split the original data into two frames first
df1 = df[["C-reactive protein","Haemoglobin", ...]]
df2 = df[["CRP", "Hb"]]
# then rename
df2 = df2.rename(columns={"CRP":"C-reactive protein", "Hb":"Haemoglobin", ...})
# use concat to stack them on one another
df3 = pd.concat([df1, df2]) # i've run out of names
df3 = df3.drop_duplicates() # perhaps also drop NAs?
but this is only necessary if you have multiple non-duplicate entries for the same test on the same day.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Concat multiindex pandas into single one - python

Related

python pandas stop fillna at last non NaN value

Stretching the data frame by date

multi-index and multi-column grouping

Pandas Dataframe interpolating in sections delimited by indexes

Merging columns and removing duplicates with Pandas

Categories

Resources