Getting an error while trying to drop rows in pandas - python

I have the following DataFrame:
fin_data[fin_data['Ticker']=='DNMR']
high low open close volume adj_close Ticker CUMLOGRET_1 PCTRET_1 CUMPCTRET_1 OBV EMA_5 EMA_10 EMA_20 VWMA_15 BBL_20_2.0 BBM_20_2.0 BBU_20_2.0 RSI_14 PVT MACD_10_20_9 MACDh_10_20_9 MACDs_10_20_9 VOLUME_SMA_10 NAV Status Premium_over_NAV
date
2020-05-28 4.700000 4.700000 4.700000 4.700000 100.0 4.700000 DNMR NaN NaN NaN 100.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10 Completed -0.530
2020-05-29 4.700000 4.700000 4.700000 4.700000 0.0 4.700000 DNMR 0.000000 0.000000 0.000000 100.0 NaN NaN NaN NaN NaN NaN NaN NaN 0.000000e+00 NaN NaN NaN NaN 10 Completed -0.530
2020-06-01 9.660000 9.630000 9.630000 9.660000 2000.0 9.660000 DNMR 0.720431 1.055319 1.055319 2100.0 NaN NaN NaN NaN NaN NaN NaN 100.000000 2.110638e+05 NaN NaN NaN NaN 10 Completed -0.034
2020-06-02 9.660000 9.650000 9.650000 9.660000 60020 9.660000 DNMR 0.720431 0.000000 1.055319 2100.0 NaN NaN NaN NaN NaN NaN NaN 100.000000 2.110638e+05 NaN NaN NaN NaN 10 Completed -0.034
2020-06-03 9.720000 9.630000 9.720000 9.630000 1100.0 9.630000 DNMR 0.717321 -0.003106 1.052214 1000.0 7.670000 NaN NaN NaN NaN NaN NaN 99.303423 2.107222e+05 NaN NaN NaN NaN 10 Completed -0.037
I'd like to either drop the first two rows where the close price is 4.70 or replace 4.70 by 9.66.
In order to drop the rows I tried this but it's giving me an error:
fin_data.drop(fin_data[fin_data['Ticker']=='DNMR'],axis=0,inplace=True)
KeyError: "['high' 'low' 'open' 'close' 'volume' 'adj_close' 'Ticker' 'CUMLOGRET_1'\n 'PCTRET_1' 'CUMPCTRET_1' 'OBV' 'EMA_5' 'EMA_10' 'EMA_20' 'VWMA_15'\n 'BBL_20_2.0' 'BBM_20_2.0' 'BBU_20_2.0' 'RSI_14' 'PVT' 'MACD_10_20_9'\n 'MACDh_10_20_9' 'MACDs_10_20_9' 'VOLUME_SMA_10' 'NAV' 'Status'\n 'Premium_over_NAV'] not found in axis"
Then I tried replace the 4.70 values but even though the code executed without an error the DataFrame is unchanged.
fin_data.loc[fin_data['Ticker']=='DNMR','adj_close'][0:2] = 9.66
Please note that I don't want to delete the data for those two dates (2020-05-28 and 2020-5-29) for other Tickers in the database but just for this one ('DNMR')
Thanks.

you are using it wrong, to drop the rows in question (or actually select the opposite ones) you should do
fin_data = fin_data[(find_data['Ticker'] == 'DNMR']) & (fin_data['close'] == 4.7)]

to point out something about your drop-issue:
the drop() method expects "single label or list-like Index or column labels to drop" (DataFrame.drop) so this would work (notice the .index after the subset)
df
a b c d
0 10 8 3 5
1 5 5 3 1
2 2 2 8 6
df.drop(df[df["a"]== 10].index, axis= 0, inplace= True)
df
a b c d
1 5 5 3 1
2 2 2 8 6
BUT, if you have dates as indexes and there are multiple rows with the same dates, you would also drop those.
A solution would be to reset the index to integers.. but (even though I don't see why you would want non-unique indexes) that may not be what you want and you should stick to Jimmar's answer :)

It's quite often simpler to use a mask
values updates
rows dropped
import pandas as pd
import io
fin_data = pd.read_csv(io.StringIO("""date high low open close volume adj_close Ticker CUMLOGRET_1 PCTRET_1 CUMPCTRET_1 OBV EMA_5 EMA_10 EMA_20 VWMA_15 BBL_20_2.0 BBM_20_2.0 BBU_20_2.0 RSI_14 PVT MACD_10_20_9 MACDh_10_20_9 MACDs_10_20_9 VOLUME_SMA_10 NAV Status Premium_over_NAV
2020-05-28 4.700000 4.700000 4.700000 4.700000 100.0 4.700000 DNMR NaN NaN NaN 100.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10 Completed -0.530
2020-05-29 4.700000 4.700000 4.700000 4.700000 0.0 4.700000 DNMR 0.000000 0.000000 0.000000 100.0 NaN NaN NaN NaN NaN NaN NaN NaN 0.000000e+00 NaN NaN NaN NaN 10 Completed -0.530
2020-06-01 9.660000 9.630000 9.630000 9.660000 2000.0 9.660000 DNMR 0.720431 1.055319 1.055319 2100.0 NaN NaN NaN NaN NaN NaN NaN 100.000000 2.110638e+05 NaN NaN NaN NaN 10 Completed -0.034
2020-06-02 9.660000 9.650000 9.650000 9.660000 60020 9.660000 DNMR 0.720431 0.000000 1.055319 2100.0 NaN NaN NaN NaN NaN NaN NaN 100.000000 2.110638e+05 NaN NaN NaN NaN 10 Completed -0.034
2020-06-03 9.720000 9.630000 9.720000 9.630000 1100.0 9.630000 DNMR 0.717321 -0.003106 1.052214 1000.0 7.670000 NaN NaN NaN NaN NaN NaN 99.303423 2.107222e+05 NaN NaN NaN NaN 10 Completed -0.037"""), sep="\s+")
fin_data.date=pd.to_datetime(fin_data.date)
fin_data = fin_data.set_index(["date"])
mask = fin_data["Ticker"].eq("DNMR") & fin_data["close"].eq(4.7)
fin_data.loc[mask, "close"] = 0
print(fin_data.iloc[:,0:6].to_markdown())
date
high
low
open
close
volume
adj_close
2020-05-28 00:00:00
4.7
4.7
4.7
0
100
4.7
2020-05-29 00:00:00
4.7
4.7
4.7
0
0
4.7
2020-06-01 00:00:00
9.66
9.63
9.63
9.66
2000
9.66
2020-06-02 00:00:00
9.66
9.65
9.65
9.66
60020
9.66
2020-06-03 00:00:00
9.72
9.63
9.72
9.63
1100
9.63
fin_data = fin_data.drop(fin_data.loc[mask].index, axis=0)
print(fin_data.iloc[:,0:6].to_markdown())
date
high
low
open
close
volume
adj_close
2020-06-01 00:00:00
9.66
9.63
9.63
9.66
2000
9.66
2020-06-02 00:00:00
9.66
9.65
9.65
9.66
60020
9.66
2020-06-03 00:00:00
9.72
9.63
9.72
9.63
1100
9.63

Related

Pandas: Filling NaN values in dataframe with monthly mean

The dataframe I am working with is as follows:
date AA1 AB2 AC3 AD4
0 1996-01-01 00:00:00 NaN NaN NaN NaN
1 1996-01-01 01:00:00 NaN 19.2 NaN NaN
2 1996-01-01 02:00:00 NaN 16.4 NaN NaN
3 1996-01-01 03:00:00 NaN 23.5 NaN NaN
4 1996-01-01 04:00:00 20.4 NaN NaN NaN
... ... ... ... ... ...
219164 2020-12-31 20:00:00 13.4 NaN 23.0 26.6
219165 2020-12-31 21:00:00 14.2 NaN 19.6 28.3
219166 2020-12-31 22:00:00 13.5 NaN 17.9 20.5
219167 2020-12-31 23:00:00 NaN NaN 16.7 20.7
219168 2021-01-01 00:00:00 NaN NaN NaN NaN
These are hourly data readings taken from different sensors from the year 1996 to 2021.
My goal is to be able to fill the NaN values with the monthly mean for each of the columns based on the date.
I have tried grouping the data and getting the monthly means for the group, though I am not sure where to go from here to transfer the grouped means to the original, larger dataframe, filling in some of the NaN values.
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
tem = df.groupby(['year', 'month']).mean().reset_index()
The resulting dataframe looks like this, with less indices because of the grouping:
year month AA1 AB2 AC3 AD4
0 1996 1 20.1 18.3 NaN NaN
1 1996 2 NaN NaN NaN NaN
2 1996 3 NaN NaN NaN NaN
3 1996 4 NaN NaN NaN NaN
4 1996 5 NaN NaN NaN NaN
... ... ... ... ... ... ...
296 2020 9 NaN NaN 15.7 20.2
297 2020 10 NaN NaN 15.3 19.7
298 2020 11 NaN NaN 26.7 25.9
299 2020 12 NaN NaN 24.6 25.3
300 2021 1 NaN NaN NaN NaN
Any advice on how I can implement this would be helpful. In the end, I need the original dataset indices, dates and columns, but with the NaN values filled with the means calculated from the monthly groups. The months with all NaN values can be ignored for the time being.
Assuming your date column is of type datetime64 or equivalent:
df['AA2'] = df['AA2'].fillna(df.groupby(df.date.dt.month)['AA2'].transform('mean'))
Or looping over all your columns (except the date column):
for col in df.columns.drop('date'):
df[col] = df[col].fillna(df.groupby(df.date.dt.month)[col].transform('mean'))
If you only want the mean of the month in that specific year, add df.date.dt.year to the group by function:
for col in df.columns.drop('date'):
df[col] = df[col].fillna(df.groupby([df.date.dt.year, df.date.dt.month])[col].transform('mean'))

python pandas stop fillna at last non NaN value

I have a dataframe where the index is date increasing and the columns are observations of variables. The array is sparse.
My goal is to propogate forward in time a known value to fill NaN but I want to stop at the last non-NaN value as that last value signifies the "death" of the variable.
e.g. for the dataset
a
b
c
2020-01-01
NaN
11
NaN
2020-02-01
1
NaN
NaN
2020-03-01
NaN
NaN
14
2020-04-01
2
NaN
NaN
2020-05-01
NaN
NaN
NaN
2020-06-01
NaN
NaN
15
2020-07-01
3
NaN
NaN
2020-08-01
NaN
NaN
NaN
I want to output
a
b
c
2020-01-01
NaN
11
NaN
2020-02-01
1
NaN
NaN
2020-03-01
1
NaN
14
2020-04-01
2
NaN
14
2020-05-01
2
NaN
14
2020-06-01
2
NaN
15
2020-07-01
3
NaN
NaN
2020-08-01
NaN
NaN
NaN
I can identify the index of the last observation using df.notna()[::-1].idxmax() but can't figure out how to use this as a way to limit the fillna function
I'd be grateful for any suggestions. Many thanks
Use DataFrame.where for forward filling by mask - testing only non missing values by back filling them:
df = df.where(df.bfill().isna(), df.ffill())
print (df)
a b c
2020-01-01 NaN 11.0 NaN
2020-02-01 1.0 NaN NaN
2020-03-01 1.0 NaN 14.0
2020-04-01 2.0 NaN 14.0
2020-05-01 2.0 NaN 14.0
2020-06-01 2.0 NaN 15.0
2020-07-01 3.0 NaN NaN
2020-08-01 NaN NaN NaN
Your solution should be used too if compare Series converted to numpy array with broadcasting:
mask = df.notna()[::-1].idxmax().to_numpy() < df.index.to_numpy()[:, None]
df = df.where(mask, df.ffill())
print (df)
a b c
2020-01-01 NaN 11.0 NaN
2020-02-01 1.0 NaN NaN
2020-03-01 1.0 NaN 14.0
2020-04-01 2.0 NaN 14.0
2020-05-01 2.0 NaN 14.0
2020-06-01 2.0 NaN 15.0
2020-07-01 3.0 NaN NaN
2020-08-01 NaN NaN NaN
You can use Series.last_valid_index which is specifically designed for this (to return the index for last non-NA/null value) , to just ffill up to that point:
Assuming your dataset is called df:
df.apply(lambda x: x.loc[:x.last_valid_index()].ffill())
index a b c
0 2020-01-01 NaN 11.00 NaN
1 2020-02-01 1.00 NaN NaN
2 2020-03-01 1.00 NaN 14.00
3 2020-04-01 2.00 NaN 14.00
4 2020-05-01 2.00 NaN 14.00
5 2020-06-01 2.00 NaN 15.00
6 2020-07-01 3.00 NaN NaN
7 2020-08-01 NaN NaN NaN
More on this on:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.last_valid_index.html

Getting NaN when multiplying these two columns from different dataframes in pandas

I'm trying to multiply columns from two different dataframes into a new df. The first dataframe (df1) contains the prices for different items, and the column header is the date. The second dataframe (df2) contains the quantity of each item.
df1
Date 1990-01-03 1990-01-04 1990-01-05 ... 2020-04-09 2020-04-14 2020-04-15
AAAAAAA 1.11 1.11 1.09 ... 102.22 103.46 103.96
BBBBBBB NaN NaN NaN ... 308.70 314.95 314.10
CCCCCCC NaN NaN NaN ... 65.34 58.72 56.18
DDDDDDD 5.52 5.51 5.53 ... 104.50 106.03 NaN
EEEEEEE NaN NaN NaN ... 1211.45 1269.23 NaN
FFFFFFF NaN NaN NaN ... 36.14 36.85 NaN
GGGGGGG 93.35 94.37 94.37 ... 1564.00 1537.50 1482.50
HHHHHHH NaN NaN NaN ... 45.69 46.68 46.24
IIIIIII NaN NaN NaN ... 75.10 74.88 74.40
JJJJJJJ 328.76 328.25 327.74 ... 6168.00 6448.00 6296.00
KKKKKKK NaN NaN NaN ... 23.49 23.50 24.04
LLLLLLL 4.45 4.41 4.34 ... 36.55 35.96 NaN
MMMMMMM 1.96 1.96 1.94 ... 141.23 146.03 NaN
NNNNNNN 1.09 1.09 1.09 ... 267.99 287.05 NaN
OOOOOOO 1.09 1.09 1.08 ... 201.53 207.17 NaN
PPPPPPP NaN NaN NaN ... 98.00 100.80 100.50
QQQQQQQ NaN NaN NaN ... 129.00 128.40 124.20
RRRRRRR NaN NaN NaN ... 140.60 141.45 139.60
[18 rows x 7658 columns]
and df2
Symbol Average Purchase Price Quantity
0 AAAAAAA 49.980 320.0
1 BBBBBBB 239.125 120.0
2 CCCCCCC 223.040 40.0
3 DDDDDDD 90.370 100.0
4 EEEEEEE 701.300 10.0
5 FFFFFFF 35.150 120.0
6 GGGGGGG 1259.000 700.0
7 HHHHHHH 32.050 250.0
8 IIIIIII 53.300 240.0
9 JJJJJJJ 6805.000 130.0
10 KKKKKKK 27.590 1000.0
11 LLLLLLL 82.120 170.0
12 MMMMMMM 106.470 150.0
13 NNNNNNN 95.970 308.0
14 OOOOOOO 81.420 150.0
15 PPPPPPP 39.690 60.0
16 QQQQQQQ 35.270 104.0
17 RRRRRRR 68.240 12.0
however when I use the function:
date = '2020-04-14'
total = df2[['Quantity']].mul(df1[date], axis=0)
print(total)
(Ideally, I'd like to do it for every date but I'm just learning so I thought I'd start out with one date)
I get:
Quantity
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
AAAAAAA NaN
BBBBBBB NaN
CCCCCCC NaN
DDDDDDD NaN
EEEEEEE NaN
FFFFFFF NaN
GGGGGGG NaN
HHHHHHH NaN
IIIIIII NaN
JJJJJJJ NaN
KKKKKKK NaN
LLLLLLL NaN
MMMMMMM NaN
NNNNNNN NaN
OOOOOOO NaN
PPPPPPP NaN
QQQQQQQ NaN
RRRRRRR NaN
how can I solve this?
It is a problem of indexes. The index column of the product dataframe is an evidence that Symbol is the index for the first dataframe, while the second has a sequential index. Assuming that no repetition of the symbol occurs in either dataframe, you could set Symbol as the index in the second one
date = '2020-04-14'
total = df2.set_index('Symbol')[['Quantity']].mul(df1[date], axis=0)
print(total)
it gives:
Quantity
Symbol
AAAAAAA 33107.2
BBBBBBB 37794.0
CCCCCCC 2348.8
DDDDDDD 10603.0
EEEEEEE 12692.3
FFFFFFF 4422.0
GGGGGGG 1076250.0
HHHHHHH 11670.0
IIIIIII 17971.2
JJJJJJJ 838240.0
KKKKKKK 23500.0
LLLLLLL 6113.2
MMMMMMM 21904.5
NNNNNNN 88411.4
OOOOOOO 31075.5
PPPPPPP 6048.0
QQQQQQQ 13353.6
RRRRRRR 1697.4
The problem is in indexing - your data frames have got different indices. To make your code work, unify indices in both data frames by pandas.DataFrame.reset_index() method. You can use the following code.
>>> df1.reset_index(inplace=True)
The code will change index in df1 on integers from 0 to 17, which will be the same index as df2 has got.

Merging columns and removing duplicates with Pandas

I need to merge similar columns and remove duplicates (entries with the same date). The data frame:
Albumin C-reactive protein CRP Ferritin Haemoglobin Hb Iron Nancy Index Plasma Platelets Transferrin saturation % Transferrin saturations UCEIS (0 to 8) WCC White Cell Count test_date
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12.35 2016-04-17 23:00:00
1 NaN NaN NaN NaN 133.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN 406.0 NaN NaN NaN NaN NaN 2016-04-17 23:00:00
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 406.0 NaN NaN NaN NaN NaN 2016-04-17 23:00:00
4 NaN 32.2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
5 36.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
6 NaN NaN NaN 99.7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 25.0 NaN NaN NaN NaN 2016-04-17 23:00:00
12 36.0 NaN 32.2 99.7 NaN 133.0 NaN NaN NaN 406.0 NaN 25.0 NaN 12.35 NaN 2016-04-17 23:00:00
14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7.0 NaN NaN 2016-04-25 23:00:00
79 34.0 NaN 5.4 55.9 NaN 133.0 NaN NaN NaN 372.0 NaN 28.0 NaN 7.99 NaN 2016-06-12 23:00:00
I need to get:
Albumin CRP Ferritin Hb Nancy Index Plasma Platelets Transferrin saturations UCEIS (0 to 8) WCC test_date
12 36.0 32.2 99.7 133.0 NaN NaN 406.0 25.0 NaN 12.35 2016-04-17 23:00:00
14 NaN NaN NaN NaN NaN NaN NaN NaN 7.0 NaN 2016-04-25 23:00:00
79 34.0 5.4 55.9 133.0 NaN NaN 372.0 28.0 NaN 7.99 2016-06-12 23:00:00
So, columns 'C-reactive protein' should be merged with 'CRP', 'Hemoglobin' with 'Hb', 'Transferrin saturation %' with 'Transferrin saturation'.
I can easily remove duplicates with .drop_duplicates(), but the trick is remove not only row with the same date, but also to make sure, that the values in the same column are duplicated. For example, 'C-reactive protein' at row '4' has the same values as 'CRP' in row '12', in addition, they both have the same entry date. Given all that, I need to have only 'CRP' column with values 32.2 and the date '2016-04-17' (plus other unique columns).
EDIT
Some entries are really duplicates (absolutely identical, due to system glitches), for example (last three rows, on 2016-06-20, indices '803' and '122'). Is the solution below capable of removing such identical rows?
P.S. Thanks for the amazing and general solution for duplicate, but not identical entries.
Albumin C-reactive protein CRP Ferritin Haemoglobin Hb Iron Nancy Index Plasma Platelets Transferrin saturation % Transferrin saturations UCEIS (0 to 8) WCC White Cell Count setName test_date
735 39.0 NaN 0.4 52.0 NaN 144.0 NaN NaN NaN 197.0 NaN 25.0 NaN 4.88 NaN Bloods 2016-05-31 23:00:00
803 40.0 NaN 0.2 81.0 NaN 147.0 NaN NaN NaN 234.0 NaN 35.0 NaN 8.47 NaN Bloods 2016-06-20 23:00:00
347 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN Research Bloods 2016-06-20 23:00:00
122 40.0 NaN 0.2 81.9 NaN 147.0 NaN NaN NaN 234.0 NaN 35.0 NaN 8.47 NaN Bloods 2016-06-20 23:00:00
I think you need groupby with rename columns by dict:
d = {'C-reactive protein':'CRP', 'Hemoglobin':'Hb',
'Transferrin saturation %':'Transferrin saturations'}
df = df.groupby('test_date').max().rename(columns=d).groupby(axis=1, level=0).max()
print (df)
Albumin CRP Ferritin Haemoglobin Hb Iron \
test_date
2016-04-17 23:00:00 36.0 32.2 99.7 133.0 133.0 NaN
2016-04-25 23:00:00 NaN NaN NaN NaN NaN NaN
2016-06-12 23:00:00 34.0 5.4 55.9 NaN 133.0 NaN
Nancy Index Plasma Platelets Transferrin saturations \
test_date
2016-04-17 23:00:00 NaN NaN 406.0 25.0
2016-04-25 23:00:00 NaN NaN NaN NaN
2016-06-12 23:00:00 NaN NaN 372.0 28.0
UCEIS (0 to 8) WCC White Cell Count
test_date
2016-04-17 23:00:00 NaN 12.35 12.35
2016-04-25 23:00:00 7.0 NaN NaN
2016-06-12 23:00:00 NaN 7.99 NaN
More general solution is reshape by melt, remove duplicates and then create DataFrame back:
d = {'C-reactive protein':'CRP', 'Hemoglobin':'Hb',
'Transferrin saturation %':'Transferrin saturations'}
df = df.rename(columns=d).groupby(axis=1, level=0).max()
df = pd.melt(df, id_vars='test_date').dropna(subset=['value']).drop_duplicates()
df = df.groupby(['test_date','variable'])['value'] \
.apply(lambda x: pd.Series(x.values)) \
.unstack(1) \
.reset_index(level=1, drop=True) \
.reset_index() \
.rename_axis(None,axis=1)
print (df)
test_date Albumin CRP Ferritin Hb Platelets \
0 2016-04-17 23:00:00 1000.0 32.2 99.7 1000.0 406.0
1 2016-04-17 23:00:00 36.0 NaN NaN 133.0 NaN
2 2016-04-25 23:00:00 NaN NaN NaN NaN NaN
3 2016-06-12 23:00:00 34.0 5.4 55.9 133.0 372.0
Transferrin saturations UCEIS (0 to 8) WCC White Cell Count
0 25.0 NaN 12.35 12.35
1 NaN NaN NaN NaN
2 NaN 7.0 NaN NaN
3 28.0 NaN 7.99 NaN
What #jezrael was saying is that if you had a situation where:
Albumin C-reactive protein CRP test_date
0 NaN NaN 32 2016-04-17 23:00:00
1 NaN 8.0 NaN 2016-04-17 23:00:00
then his method would erase the 8.0 reading and keep only the 32 (this is because he does it in two steps (or 3?), in this line: df = df.groupby('test_date').max().rename(columns=d).groupby(axis=1, level=0).max()
df = df.groupby('test_date').max() # selects max of each column
# while collapsing 'test_date'
which for my truncated example would give:
Albumin C-reactive protein CRP test_date
0 NaN 8.0 32 2016-04-17 23:00:00
then rename .rename(columns=d) giving:
Albumin CRP CRP test_date
0 NaN 8.0 32 2016-04-17 23:00:00
then .groupby(axis=1, level=0).max() to group along rows (instead of down columns) which gives:
Albumin CRP test_date
0 NaN 32 2016-04-17 23:00:00
which is where you run the highest risk of losing data.
Alternative
I would split the original data into two frames first
df1 = df[["C-reactive protein","Haemoglobin", ...]]
df2 = df[["CRP", "Hb"]]
# then rename
df2 = df2.rename(columns={"CRP":"C-reactive protein", "Hb":"Haemoglobin", ...})
# use concat to stack them on one another
df3 = pd.concat([df1, df2]) # i've run out of names
df3 = df3.drop_duplicates() # perhaps also drop NAs?
but this is only necessary if you have multiple non-duplicate entries for the same test on the same day.

Pandas: Merge data with different timing

I have two data frames that contain time-series data that are on different ranges. One starts earlier, and ends earlier. Also, one is monthly and one is quarterly. However, the index of both is in the form of YYYY-MM-DD. Is there a cute way of merging these dataframes using "Python" and "Pandas"?
Thanks!
/edit
One set:
DATE GDP GPDI NFLS
0 1947-01-01 243.1 35.9 112.815
1 1947-04-01 246.3 34.5 111.253
2 1947-07-01 250.1 34.9 113.023
3 1947-10-01 260.3 43.2 111.440
The other one:
DATE INDPRO M08354USM310NNBR GDP
(...)
334 1946-11-01 13.3916 NaN NaN
335 1946-12-01 13.4721 NaN NaN
336 1947-01-01 13.6332 42.8 NaN
337 1947-02-01 13.7137 42.5 NaN
Together I would like to join them, such that
DATE INDPRO M08354USM310NNBR GDP GPDI NFLS
1946-11-01 13.3916 NaN NaN NaN NaN
1946-12-01 13.4712 NaN NaN NaN NaN
1947-01-01 13.6332 42.8 243.1 35.9 112.815
1947-02-01 13.7137 42.5 NaN NaN NaN
(...)
Just perform a merge the fact the periods are different and don't overlap suits you in fact:
merged = df1.merge(df2, on='DATE', how='outer')
merged
Out[54]:
DATE GDP_x GPDI NFLS INDPRO M08354USM310NNBR GDP_y
0 1947-01-01 243.1 35.9 112.815 13.6332 42.8 NaN
1 1947-04-01 246.3 34.5 111.253 NaN NaN NaN
2 1947-07-01 250.1 34.9 113.023 NaN NaN NaN
3 1947-10-01 260.3 43.2 111.440 NaN NaN NaN
4 1946-11-01 NaN NaN NaN 13.3916 NaN NaN
5 1946-12-01 NaN NaN NaN 13.4721 NaN NaN
6 1947-02-01 NaN NaN NaN 13.7137 42.5 NaN
[7 rows x 7 columns]
You can rename, fill, drop the erroneous 'GDP_y' column
To sort the merged 'DATE' column just call sort:
In [57]:
merged.sort(['DATE'])
Out[57]:
DATE GDP_x GPDI NFLS INDPRO M08354USM310NNBR GDP_y
4 1946-11-01 NaN NaN NaN 13.3916 NaN NaN
5 1946-12-01 NaN NaN NaN 13.4721 NaN NaN
0 1947-01-01 243.1 35.9 112.815 13.6332 42.8 NaN
6 1947-02-01 NaN NaN NaN 13.7137 42.5 NaN
1 1947-04-01 246.3 34.5 111.253 NaN NaN NaN
2 1947-07-01 250.1 34.9 113.023 NaN NaN NaN
3 1947-10-01 260.3 43.2 111.440 NaN NaN NaN
[7 rows x 7 columns]

Categories