Pandas: Filling NaN values in dataframe with monthly mean - python

The dataframe I am working with is as follows:
date AA1 AB2 AC3 AD4
0 1996-01-01 00:00:00 NaN NaN NaN NaN
1 1996-01-01 01:00:00 NaN 19.2 NaN NaN
2 1996-01-01 02:00:00 NaN 16.4 NaN NaN
3 1996-01-01 03:00:00 NaN 23.5 NaN NaN
4 1996-01-01 04:00:00 20.4 NaN NaN NaN
... ... ... ... ... ...
219164 2020-12-31 20:00:00 13.4 NaN 23.0 26.6
219165 2020-12-31 21:00:00 14.2 NaN 19.6 28.3
219166 2020-12-31 22:00:00 13.5 NaN 17.9 20.5
219167 2020-12-31 23:00:00 NaN NaN 16.7 20.7
219168 2021-01-01 00:00:00 NaN NaN NaN NaN
These are hourly data readings taken from different sensors from the year 1996 to 2021.
My goal is to be able to fill the NaN values with the monthly mean for each of the columns based on the date.
I have tried grouping the data and getting the monthly means for the group, though I am not sure where to go from here to transfer the grouped means to the original, larger dataframe, filling in some of the NaN values.
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
tem = df.groupby(['year', 'month']).mean().reset_index()
The resulting dataframe looks like this, with less indices because of the grouping:
year month AA1 AB2 AC3 AD4
0 1996 1 20.1 18.3 NaN NaN
1 1996 2 NaN NaN NaN NaN
2 1996 3 NaN NaN NaN NaN
3 1996 4 NaN NaN NaN NaN
4 1996 5 NaN NaN NaN NaN
... ... ... ... ... ... ...
296 2020 9 NaN NaN 15.7 20.2
297 2020 10 NaN NaN 15.3 19.7
298 2020 11 NaN NaN 26.7 25.9
299 2020 12 NaN NaN 24.6 25.3
300 2021 1 NaN NaN NaN NaN
Any advice on how I can implement this would be helpful. In the end, I need the original dataset indices, dates and columns, but with the NaN values filled with the means calculated from the monthly groups. The months with all NaN values can be ignored for the time being.

Assuming your date column is of type datetime64 or equivalent:
df['AA2'] = df['AA2'].fillna(df.groupby(df.date.dt.month)['AA2'].transform('mean'))
Or looping over all your columns (except the date column):
for col in df.columns.drop('date'):
df[col] = df[col].fillna(df.groupby(df.date.dt.month)[col].transform('mean'))
If you only want the mean of the month in that specific year, add df.date.dt.year to the group by function:
for col in df.columns.drop('date'):
df[col] = df[col].fillna(df.groupby([df.date.dt.year, df.date.dt.month])[col].transform('mean'))

Related

Replace Unnamed values in date column with true values

I'm working on this raw data frame that needs some cleaning. So far, I have transformed this xlsx file
into this pandas dataframe:
print(df.head(16))
date technician alkalinity colour uv ph turbidity \
0 2020-02-01 00:00:00 Catherine 24.5 33 0.15 7.24 1.53
1 Unnamed: 2 NaN NaN NaN NaN NaN 2.31
2 Unnamed: 3 NaN NaN NaN NaN NaN 2.08
3 Unnamed: 4 NaN NaN NaN NaN NaN 2.2
4 Unnamed: 5 Michel 24 35 0.152 7.22 1.59
5 Unnamed: 6 NaN NaN NaN NaN NaN 1.66
6 Unnamed: 7 NaN NaN NaN NaN NaN 1.71
7 Unnamed: 8 NaN NaN NaN NaN NaN 1.53
8 2020-02-02 00:00:00 Catherine 24 NaN 0.145 7.21 1.44
9 Unnamed: 10 NaN NaN NaN NaN NaN 1.97
10 Unnamed: 11 NaN NaN NaN NaN NaN 1.91
11 Unnamed: 12 NaN NaN 33.0 NaN NaN 2.07
12 Unnamed: 13 Michel 24 34 0.15 7.24 1.76
13 Unnamed: 14 NaN NaN NaN NaN NaN 1.84
14 Unnamed: 15 NaN NaN NaN NaN NaN 1.72
15 Unnamed: 16 NaN NaN NaN NaN NaN 1.85
temperature
0 3
1 NaN
2 NaN
3 NaN
4 3
5 NaN
6 NaN
7 NaN
8 3
9 NaN
10 NaN
11 NaN
12 3
13 NaN
14 NaN
15 NaN
From here, I want to combine the rows so that I only have one row for each date. The values for each row will be the mean in the respective columns. ie.
print(new_df.head(2))
date time alkalinity colour uv ph turbidity temperature
0 2020-02-01 00:00:00 24.25 34 0.151 7.23 1.83 3
1 2020-02-02 00:00:00 24 33.5 0.148 7.23 1.82 3
How can I accomplish this when I have Unnamed values in my date column? Thanks!
Try setting the values to NaN and then use ffill:
df.loc[df.date.str.contains('Unnamed', na=False), 'date'] = np.nan
df.date = df.date.ffill()
If I understand, correctly you want to drop rows that contain 'Unnamed' in the date column, right?
Please look here:
https://stackoverflow.com/a/27360130/12790501
The solution would be something like this:
df = df.drop(df['Unnamed' in df.date].index)
Edit:
No, I would like to replace those Unnamed values with the date so I
could then use the groupby('date') function to return the mean values
for the columns
so in the case you should just iterate over the whole table
last_date = ''
for i in df.index:
if 'Unnamed' not in df.at[i, 'date']:
last_date = df.at[i, 'date']
else:
df.at[i, 'date'] = last_date
If the 'date' column is of type object i.e. string
then just write a logic to loop over the number as seen in image provided it follows a certain pattern-
for _ in range(2,9):
df.loc[(df['date'] == 'Unnamed: '+str(_), 'date'] = your_value

Getting an error while trying to drop rows in pandas

I have the following DataFrame:
fin_data[fin_data['Ticker']=='DNMR']
high low open close volume adj_close Ticker CUMLOGRET_1 PCTRET_1 CUMPCTRET_1 OBV EMA_5 EMA_10 EMA_20 VWMA_15 BBL_20_2.0 BBM_20_2.0 BBU_20_2.0 RSI_14 PVT MACD_10_20_9 MACDh_10_20_9 MACDs_10_20_9 VOLUME_SMA_10 NAV Status Premium_over_NAV
date
2020-05-28 4.700000 4.700000 4.700000 4.700000 100.0 4.700000 DNMR NaN NaN NaN 100.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10 Completed -0.530
2020-05-29 4.700000 4.700000 4.700000 4.700000 0.0 4.700000 DNMR 0.000000 0.000000 0.000000 100.0 NaN NaN NaN NaN NaN NaN NaN NaN 0.000000e+00 NaN NaN NaN NaN 10 Completed -0.530
2020-06-01 9.660000 9.630000 9.630000 9.660000 2000.0 9.660000 DNMR 0.720431 1.055319 1.055319 2100.0 NaN NaN NaN NaN NaN NaN NaN 100.000000 2.110638e+05 NaN NaN NaN NaN 10 Completed -0.034
2020-06-02 9.660000 9.650000 9.650000 9.660000 60020 9.660000 DNMR 0.720431 0.000000 1.055319 2100.0 NaN NaN NaN NaN NaN NaN NaN 100.000000 2.110638e+05 NaN NaN NaN NaN 10 Completed -0.034
2020-06-03 9.720000 9.630000 9.720000 9.630000 1100.0 9.630000 DNMR 0.717321 -0.003106 1.052214 1000.0 7.670000 NaN NaN NaN NaN NaN NaN 99.303423 2.107222e+05 NaN NaN NaN NaN 10 Completed -0.037
I'd like to either drop the first two rows where the close price is 4.70 or replace 4.70 by 9.66.
In order to drop the rows I tried this but it's giving me an error:
fin_data.drop(fin_data[fin_data['Ticker']=='DNMR'],axis=0,inplace=True)
KeyError: "['high' 'low' 'open' 'close' 'volume' 'adj_close' 'Ticker' 'CUMLOGRET_1'\n 'PCTRET_1' 'CUMPCTRET_1' 'OBV' 'EMA_5' 'EMA_10' 'EMA_20' 'VWMA_15'\n 'BBL_20_2.0' 'BBM_20_2.0' 'BBU_20_2.0' 'RSI_14' 'PVT' 'MACD_10_20_9'\n 'MACDh_10_20_9' 'MACDs_10_20_9' 'VOLUME_SMA_10' 'NAV' 'Status'\n 'Premium_over_NAV'] not found in axis"
Then I tried replace the 4.70 values but even though the code executed without an error the DataFrame is unchanged.
fin_data.loc[fin_data['Ticker']=='DNMR','adj_close'][0:2] = 9.66
Please note that I don't want to delete the data for those two dates (2020-05-28 and 2020-5-29) for other Tickers in the database but just for this one ('DNMR')
Thanks.
you are using it wrong, to drop the rows in question (or actually select the opposite ones) you should do
fin_data = fin_data[(find_data['Ticker'] == 'DNMR']) & (fin_data['close'] == 4.7)]
to point out something about your drop-issue:
the drop() method expects "single label or list-like Index or column labels to drop" (DataFrame.drop) so this would work (notice the .index after the subset)
df
a b c d
0 10 8 3 5
1 5 5 3 1
2 2 2 8 6
df.drop(df[df["a"]== 10].index, axis= 0, inplace= True)
df
a b c d
1 5 5 3 1
2 2 2 8 6
BUT, if you have dates as indexes and there are multiple rows with the same dates, you would also drop those.
A solution would be to reset the index to integers.. but (even though I don't see why you would want non-unique indexes) that may not be what you want and you should stick to Jimmar's answer :)
It's quite often simpler to use a mask
values updates
rows dropped
import pandas as pd
import io
fin_data = pd.read_csv(io.StringIO("""date high low open close volume adj_close Ticker CUMLOGRET_1 PCTRET_1 CUMPCTRET_1 OBV EMA_5 EMA_10 EMA_20 VWMA_15 BBL_20_2.0 BBM_20_2.0 BBU_20_2.0 RSI_14 PVT MACD_10_20_9 MACDh_10_20_9 MACDs_10_20_9 VOLUME_SMA_10 NAV Status Premium_over_NAV
2020-05-28 4.700000 4.700000 4.700000 4.700000 100.0 4.700000 DNMR NaN NaN NaN 100.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10 Completed -0.530
2020-05-29 4.700000 4.700000 4.700000 4.700000 0.0 4.700000 DNMR 0.000000 0.000000 0.000000 100.0 NaN NaN NaN NaN NaN NaN NaN NaN 0.000000e+00 NaN NaN NaN NaN 10 Completed -0.530
2020-06-01 9.660000 9.630000 9.630000 9.660000 2000.0 9.660000 DNMR 0.720431 1.055319 1.055319 2100.0 NaN NaN NaN NaN NaN NaN NaN 100.000000 2.110638e+05 NaN NaN NaN NaN 10 Completed -0.034
2020-06-02 9.660000 9.650000 9.650000 9.660000 60020 9.660000 DNMR 0.720431 0.000000 1.055319 2100.0 NaN NaN NaN NaN NaN NaN NaN 100.000000 2.110638e+05 NaN NaN NaN NaN 10 Completed -0.034
2020-06-03 9.720000 9.630000 9.720000 9.630000 1100.0 9.630000 DNMR 0.717321 -0.003106 1.052214 1000.0 7.670000 NaN NaN NaN NaN NaN NaN 99.303423 2.107222e+05 NaN NaN NaN NaN 10 Completed -0.037"""), sep="\s+")
fin_data.date=pd.to_datetime(fin_data.date)
fin_data = fin_data.set_index(["date"])
mask = fin_data["Ticker"].eq("DNMR") & fin_data["close"].eq(4.7)
fin_data.loc[mask, "close"] = 0
print(fin_data.iloc[:,0:6].to_markdown())
date
high
low
open
close
volume
adj_close
2020-05-28 00:00:00
4.7
4.7
4.7
0
100
4.7
2020-05-29 00:00:00
4.7
4.7
4.7
0
0
4.7
2020-06-01 00:00:00
9.66
9.63
9.63
9.66
2000
9.66
2020-06-02 00:00:00
9.66
9.65
9.65
9.66
60020
9.66
2020-06-03 00:00:00
9.72
9.63
9.72
9.63
1100
9.63
fin_data = fin_data.drop(fin_data.loc[mask].index, axis=0)
print(fin_data.iloc[:,0:6].to_markdown())
date
high
low
open
close
volume
adj_close
2020-06-01 00:00:00
9.66
9.63
9.63
9.66
2000
9.66
2020-06-02 00:00:00
9.66
9.65
9.65
9.66
60020
9.66
2020-06-03 00:00:00
9.72
9.63
9.72
9.63
1100
9.63

Plot DataFrame in 1 year period

I have dataframe:
temp_old temp_new
Year 2013 2014 2015 2016 2017 2018 2013 2014 2015 2016 2017 2018
Date
2013-01-01 23:00:00 21.587569 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-01-02 00:00:00 21.585347 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-01-02 01:00:00 21.583472 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
2018-02-05 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 22.882083
2018-02-05 01:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 22.878472
When I plot this df thats my result.
My goal is show it but without separate by years. So I want to have 5 curves in range January: December on one chart.
update:(code to plot)
df_sep_by_year.plot(figsize=(15,8))
Simply remove year from your Date column. I mean instead of 2013-01-01 23:00:00 use 01-01 23:00:00 and adjust your data similarly for other records.
# remove datetime index
df.reset_index(inplace=True)
# create new column without year, use ':02d' to correct sorting
df['new_date'] = df.Date.apply(lambda x: '{:02d}-{:02d}-{:02d}:00:00'.format(x.month, x.day, x.hour))
# set new index to df
df.set_index('new_date', inplace=True)
# remove old column with datetime
df = df.drop(labels=['Date'], axis=1)
# remove multiindex in columns
df.columns = [''.join(str(col)) for col in df.columns]
# join variable from different year but the same month and day
df = pd.concat([pd.DataFrame(df[x]).dropna(axis=0, how='any') for x in df_sep_by_year], axis=1).dropna(axis=1, how='all')
df.plot()

Merging columns and removing duplicates with Pandas

I need to merge similar columns and remove duplicates (entries with the same date). The data frame:
Albumin C-reactive protein CRP Ferritin Haemoglobin Hb Iron Nancy Index Plasma Platelets Transferrin saturation % Transferrin saturations UCEIS (0 to 8) WCC White Cell Count test_date
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12.35 2016-04-17 23:00:00
1 NaN NaN NaN NaN 133.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN 406.0 NaN NaN NaN NaN NaN 2016-04-17 23:00:00
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 406.0 NaN NaN NaN NaN NaN 2016-04-17 23:00:00
4 NaN 32.2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
5 36.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
6 NaN NaN NaN 99.7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 25.0 NaN NaN NaN NaN 2016-04-17 23:00:00
12 36.0 NaN 32.2 99.7 NaN 133.0 NaN NaN NaN 406.0 NaN 25.0 NaN 12.35 NaN 2016-04-17 23:00:00
14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7.0 NaN NaN 2016-04-25 23:00:00
79 34.0 NaN 5.4 55.9 NaN 133.0 NaN NaN NaN 372.0 NaN 28.0 NaN 7.99 NaN 2016-06-12 23:00:00
I need to get:
Albumin CRP Ferritin Hb Nancy Index Plasma Platelets Transferrin saturations UCEIS (0 to 8) WCC test_date
12 36.0 32.2 99.7 133.0 NaN NaN 406.0 25.0 NaN 12.35 2016-04-17 23:00:00
14 NaN NaN NaN NaN NaN NaN NaN NaN 7.0 NaN 2016-04-25 23:00:00
79 34.0 5.4 55.9 133.0 NaN NaN 372.0 28.0 NaN 7.99 2016-06-12 23:00:00
So, columns 'C-reactive protein' should be merged with 'CRP', 'Hemoglobin' with 'Hb', 'Transferrin saturation %' with 'Transferrin saturation'.
I can easily remove duplicates with .drop_duplicates(), but the trick is remove not only row with the same date, but also to make sure, that the values in the same column are duplicated. For example, 'C-reactive protein' at row '4' has the same values as 'CRP' in row '12', in addition, they both have the same entry date. Given all that, I need to have only 'CRP' column with values 32.2 and the date '2016-04-17' (plus other unique columns).
EDIT
Some entries are really duplicates (absolutely identical, due to system glitches), for example (last three rows, on 2016-06-20, indices '803' and '122'). Is the solution below capable of removing such identical rows?
P.S. Thanks for the amazing and general solution for duplicate, but not identical entries.
Albumin C-reactive protein CRP Ferritin Haemoglobin Hb Iron Nancy Index Plasma Platelets Transferrin saturation % Transferrin saturations UCEIS (0 to 8) WCC White Cell Count setName test_date
735 39.0 NaN 0.4 52.0 NaN 144.0 NaN NaN NaN 197.0 NaN 25.0 NaN 4.88 NaN Bloods 2016-05-31 23:00:00
803 40.0 NaN 0.2 81.0 NaN 147.0 NaN NaN NaN 234.0 NaN 35.0 NaN 8.47 NaN Bloods 2016-06-20 23:00:00
347 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN Research Bloods 2016-06-20 23:00:00
122 40.0 NaN 0.2 81.9 NaN 147.0 NaN NaN NaN 234.0 NaN 35.0 NaN 8.47 NaN Bloods 2016-06-20 23:00:00
I think you need groupby with rename columns by dict:
d = {'C-reactive protein':'CRP', 'Hemoglobin':'Hb',
'Transferrin saturation %':'Transferrin saturations'}
df = df.groupby('test_date').max().rename(columns=d).groupby(axis=1, level=0).max()
print (df)
Albumin CRP Ferritin Haemoglobin Hb Iron \
test_date
2016-04-17 23:00:00 36.0 32.2 99.7 133.0 133.0 NaN
2016-04-25 23:00:00 NaN NaN NaN NaN NaN NaN
2016-06-12 23:00:00 34.0 5.4 55.9 NaN 133.0 NaN
Nancy Index Plasma Platelets Transferrin saturations \
test_date
2016-04-17 23:00:00 NaN NaN 406.0 25.0
2016-04-25 23:00:00 NaN NaN NaN NaN
2016-06-12 23:00:00 NaN NaN 372.0 28.0
UCEIS (0 to 8) WCC White Cell Count
test_date
2016-04-17 23:00:00 NaN 12.35 12.35
2016-04-25 23:00:00 7.0 NaN NaN
2016-06-12 23:00:00 NaN 7.99 NaN
More general solution is reshape by melt, remove duplicates and then create DataFrame back:
d = {'C-reactive protein':'CRP', 'Hemoglobin':'Hb',
'Transferrin saturation %':'Transferrin saturations'}
df = df.rename(columns=d).groupby(axis=1, level=0).max()
df = pd.melt(df, id_vars='test_date').dropna(subset=['value']).drop_duplicates()
df = df.groupby(['test_date','variable'])['value'] \
.apply(lambda x: pd.Series(x.values)) \
.unstack(1) \
.reset_index(level=1, drop=True) \
.reset_index() \
.rename_axis(None,axis=1)
print (df)
test_date Albumin CRP Ferritin Hb Platelets \
0 2016-04-17 23:00:00 1000.0 32.2 99.7 1000.0 406.0
1 2016-04-17 23:00:00 36.0 NaN NaN 133.0 NaN
2 2016-04-25 23:00:00 NaN NaN NaN NaN NaN
3 2016-06-12 23:00:00 34.0 5.4 55.9 133.0 372.0
Transferrin saturations UCEIS (0 to 8) WCC White Cell Count
0 25.0 NaN 12.35 12.35
1 NaN NaN NaN NaN
2 NaN 7.0 NaN NaN
3 28.0 NaN 7.99 NaN
What #jezrael was saying is that if you had a situation where:
Albumin C-reactive protein CRP test_date
0 NaN NaN 32 2016-04-17 23:00:00
1 NaN 8.0 NaN 2016-04-17 23:00:00
then his method would erase the 8.0 reading and keep only the 32 (this is because he does it in two steps (or 3?), in this line: df = df.groupby('test_date').max().rename(columns=d).groupby(axis=1, level=0).max()
df = df.groupby('test_date').max() # selects max of each column
# while collapsing 'test_date'
which for my truncated example would give:
Albumin C-reactive protein CRP test_date
0 NaN 8.0 32 2016-04-17 23:00:00
then rename .rename(columns=d) giving:
Albumin CRP CRP test_date
0 NaN 8.0 32 2016-04-17 23:00:00
then .groupby(axis=1, level=0).max() to group along rows (instead of down columns) which gives:
Albumin CRP test_date
0 NaN 32 2016-04-17 23:00:00
which is where you run the highest risk of losing data.
Alternative
I would split the original data into two frames first
df1 = df[["C-reactive protein","Haemoglobin", ...]]
df2 = df[["CRP", "Hb"]]
# then rename
df2 = df2.rename(columns={"CRP":"C-reactive protein", "Hb":"Haemoglobin", ...})
# use concat to stack them on one another
df3 = pd.concat([df1, df2]) # i've run out of names
df3 = df3.drop_duplicates() # perhaps also drop NAs?
but this is only necessary if you have multiple non-duplicate entries for the same test on the same day.

Pandas: Merge data with different timing

I have two data frames that contain time-series data that are on different ranges. One starts earlier, and ends earlier. Also, one is monthly and one is quarterly. However, the index of both is in the form of YYYY-MM-DD. Is there a cute way of merging these dataframes using "Python" and "Pandas"?
Thanks!
/edit
One set:
DATE GDP GPDI NFLS
0 1947-01-01 243.1 35.9 112.815
1 1947-04-01 246.3 34.5 111.253
2 1947-07-01 250.1 34.9 113.023
3 1947-10-01 260.3 43.2 111.440
The other one:
DATE INDPRO M08354USM310NNBR GDP
(...)
334 1946-11-01 13.3916 NaN NaN
335 1946-12-01 13.4721 NaN NaN
336 1947-01-01 13.6332 42.8 NaN
337 1947-02-01 13.7137 42.5 NaN
Together I would like to join them, such that
DATE INDPRO M08354USM310NNBR GDP GPDI NFLS
1946-11-01 13.3916 NaN NaN NaN NaN
1946-12-01 13.4712 NaN NaN NaN NaN
1947-01-01 13.6332 42.8 243.1 35.9 112.815
1947-02-01 13.7137 42.5 NaN NaN NaN
(...)
Just perform a merge the fact the periods are different and don't overlap suits you in fact:
merged = df1.merge(df2, on='DATE', how='outer')
merged
Out[54]:
DATE GDP_x GPDI NFLS INDPRO M08354USM310NNBR GDP_y
0 1947-01-01 243.1 35.9 112.815 13.6332 42.8 NaN
1 1947-04-01 246.3 34.5 111.253 NaN NaN NaN
2 1947-07-01 250.1 34.9 113.023 NaN NaN NaN
3 1947-10-01 260.3 43.2 111.440 NaN NaN NaN
4 1946-11-01 NaN NaN NaN 13.3916 NaN NaN
5 1946-12-01 NaN NaN NaN 13.4721 NaN NaN
6 1947-02-01 NaN NaN NaN 13.7137 42.5 NaN
[7 rows x 7 columns]
You can rename, fill, drop the erroneous 'GDP_y' column
To sort the merged 'DATE' column just call sort:
In [57]:
merged.sort(['DATE'])
Out[57]:
DATE GDP_x GPDI NFLS INDPRO M08354USM310NNBR GDP_y
4 1946-11-01 NaN NaN NaN 13.3916 NaN NaN
5 1946-12-01 NaN NaN NaN 13.4721 NaN NaN
0 1947-01-01 243.1 35.9 112.815 13.6332 42.8 NaN
6 1947-02-01 NaN NaN NaN 13.7137 42.5 NaN
1 1947-04-01 246.3 34.5 111.253 NaN NaN NaN
2 1947-07-01 250.1 34.9 113.023 NaN NaN NaN
3 1947-10-01 260.3 43.2 111.440 NaN NaN NaN
[7 rows x 7 columns]

Categories