Pandas - how to merge dataframes on datetime column of different format? - python

I have two dataframes that I need to merge based on date. The first dataframe looks like:
Time Stamp HP_1H_mean Coolant1_1H_mean Extreme_1H_mean
0 2019-07-26 07:00:00 410.637966 414.607081 0.0
1 2019-07-26 08:00:00 403.521735 424.787366 0.0
2 2019-07-26 09:00:00 403.143925 425.739639 0.0
3 2019-07-26 10:00:00 410.542895 426.210538 0.0
...
17 2019-07-27 00:00:00 0.000000 0.000000 0.0
18 2019-07-27 01:00:00 0.000000 0.000000 0.0
19 2019-07-27 02:00:00 0.000000 0.000000 0.0
20 2019-07-27 03:00:00 0.000000 0.000000 0.0
The second is like this:
Time Stamp Qty Compl
0 2019-07-26 150
1 2019-07-27 20
2 2019-07-29 230
3 2019-07-30 230
4 2019-07-31 170
Both Time Stamp columns are datetime64[ns]. I wanted to merge left, and forward fill the date into all the other rows for a day. My problem is at the merge, the Qty Compl from the second df is applied at midnight of each day, and some days does not have a midnight time stamp, such as the first day in the first dataframe.
Is there a way to merge and match every row that contains the same day? The desired output would look like this:
Time Stamp HP_1H_mean Coolant1_1H_mean Extreme_1H_mean Qty Compl
0 2019-07-26 07:00:00 410.637966 414.607081 0.0 150
1 2019-07-26 08:00:00 403.521735 424.787366 0.0 150
2 2019-07-26 09:00:00 403.143925 425.739639 0.0 150
3 2019-07-26 10:00:00 410.542895 426.210538 0.0 150
...
17 2019-07-27 00:00:00 0.000000 0.000000 0.0 20
18 2019-07-27 01:00:00 0.000000 0.000000 0.0 20
19 2019-07-27 02:00:00 0.000000 0.000000 0.0 20
20 2019-07-27 03:00:00 0.000000 0.000000 0.0 20

Use merge_asof with sorted both DataFrames by datetimes:
#if necessary
df1['Time Stamp'] = pd.to_datetime(df1['Time Stamp'])
df2['Time Stamp'] = pd.to_datetime(df2['Time Stamp'])
df1 = df1.sort_values('Time Stamp')
df2 = df2.sort_values('Time Stamp')
df = pd.merge_asof(df1, df2, on='Time Stamp')
print (df)
Time Stamp HP_1H_mean Coolant1_1H_mean Extreme_1H_mean \
0 2019-07-26 07:00:00 410.637966 414.607081 0.0
1 2019-07-26 08:00:00 403.521735 424.787366 0.0
2 2019-07-26 09:00:00 403.143925 425.739639 0.0
3 2019-07-26 10:00:00 410.542895 426.210538 0.0
4 2019-07-27 00:00:00 0.000000 0.000000 0.0
5 2019-07-27 01:00:00 0.000000 0.000000 0.0
6 2019-07-27 02:00:00 0.000000 0.000000 0.0
7 2019-07-27 03:00:00 0.000000 0.000000 0.0
Qty Compl
0 150
1 150
2 150
3 150
4 20
5 20
6 20
7 20

Related

Fill missing dates in a pandas DataFrame

I’ve a lot of DataFrames with 2 columns, like this:
Fecha
unidades
0
2020-01-01
2.0
84048
2020-09-01
4.0
149445
2020-10-01
11.0
532541
2020-11-01
4.0
660659
2020-12-01
2.0
1515682
2021-03-01
9.0
1563644
2021-04-01
2.0
1759823
2021-05-01
1.0
2226586
2021-07-01
1.0
As it can be seen, there are some months that are missing. Missing data depends on the DataFrame, I can have 2 months, 10, 100% complete, only one...I need to complete column "Fecha" with missing months (from 2020-01-01 to 2021-12-01) and when date is added into "Fecha", add "0" value to "unidades" column.
Each element in Fecha Column is a class 'pandas._libs.tslibs.timestamps.Timestamp
How could I fill the missing dates for each DataFrame??
You could create a date range and use "Fecha" column to set_index + reindex to add missing months. Then fillna + reset_index fetches the desired outcome:
df['Fecha'] = pd.to_datetime(df['Fecha'])
df = (df.set_index('Fecha')
.reindex(pd.date_range('2020-01-01', '2021-12-01', freq='MS'))
.rename_axis(['Fecha'])
.fillna(0)
.reset_index())
Output:
Fecha unidades
0 2020-01-01 2.0
1 2020-02-01 0.0
2 2020-03-01 0.0
3 2020-04-01 0.0
4 2020-05-01 0.0
5 2020-06-01 0.0
6 2020-07-01 0.0
7 2020-08-01 0.0
8 2020-09-01 4.0
9 2020-10-01 11.0
10 2020-11-01 4.0
11 2020-12-01 2.0
12 2021-01-01 0.0
13 2021-02-01 0.0
14 2021-03-01 9.0
15 2021-04-01 2.0
16 2021-05-01 1.0
17 2021-06-01 0.0
18 2021-07-01 1.0
19 2021-08-01 0.0
20 2021-09-01 0.0
21 2021-10-01 0.0
22 2021-11-01 0.0
23 2021-12-01 0.0

Date time conversion to pandas datetime64[updated]

I have a series of 40 year data in the format stn;yyyymmddhh:rainfall. I want to convert the data into datetime64 format. When i convert it to datetime with the below code, i get the following format pandas._libs.tslibs.timestamps.Timestamp But, i need it to be in pandas datetime format. Basically, i want to convert for example 1981010100 which is numpy.int64 into datetime64.
import pandas as pd
df = pd.read_csv('data.txt', delimiter = ";")
df['yyyy'] = df['yyyymmddhh'].astype(str).str[:4]
df = pd.to_datetime(data.yyyy, format='%Y-%m-%d')
Stn;yyyymmddhh;rainfall
xyz;1981010100;0.0
xyz;1981010101;0.0
xyz;1981010102;0.0
xyz;1981010103;0.0
xyz;1981010104;0.0
xyz;1981010105;0.0
xyz;1981010106;0.0
xyz;1981010107;0.0
xyz;1981010108;0.0
xyz;1981010109;0.4
xyz;1981010110;0.6
xyz;1981010111;0.1
xyz;1981010112;0.1
xyz;1981010113;0.0
xyz;1981010114;0.1
xyz;1981010115;0.6
xyz;1981010116;0.0
xyz;1981010117;0.0
xyz;1981010118;0.2
xyz;1981010119;0.0
xyz;1981010120;0.0
xyz;1981010121;0.0
xyz;1981010122;0.0
xyz;1981010123;0.0
xyz;1981010200;0.0
You can use pd.to_datetime() together with format= parameter, as follows:
df['yyyymmddhh'] = pd.to_datetime(df['yyyymmddhh'], format='%Y%m%d%H')
Output:
print(df)
Stn yyyymmddhh rainfall
0 xyz 1981-01-01 00:00:00 0.0
1 xyz 1981-01-01 01:00:00 0.0
2 xyz 1981-01-01 02:00:00 0.0
3 xyz 1981-01-01 03:00:00 0.0
4 xyz 1981-01-01 04:00:00 0.0
5 xyz 1981-01-01 05:00:00 0.0
6 xyz 1981-01-01 06:00:00 0.0
7 xyz 1981-01-01 07:00:00 0.0
8 xyz 1981-01-01 08:00:00 0.0
9 xyz 1981-01-01 09:00:00 0.4
10 xyz 1981-01-01 10:00:00 0.6
11 xyz 1981-01-01 11:00:00 0.1
12 xyz 1981-01-01 12:00:00 0.1
13 xyz 1981-01-01 13:00:00 0.0
14 xyz 1981-01-01 14:00:00 0.1
15 xyz 1981-01-01 15:00:00 0.6
16 xyz 1981-01-01 16:00:00 0.0
17 xyz 1981-01-01 17:00:00 0.0
18 xyz 1981-01-01 18:00:00 0.2
19 xyz 1981-01-01 19:00:00 0.0
20 xyz 1981-01-01 20:00:00 0.0
21 xyz 1981-01-01 21:00:00 0.0
22 xyz 1981-01-01 22:00:00 0.0
23 xyz 1981-01-01 23:00:00 0.0
24 xyz 1981-01-02 00:00:00 0.0
I believe this should fit the bill for you.
import pandas as pd
df = pd.read_csv('data.txt', delimiter = ";")
df['date'] = pd.to_datetime(df['yyyymmddhh'], format='%Y%m%d%H')
df['formatted'] = pd.to_datetime(df['date'].dt.strftime('%Y-%m-%d %H:%M:%S'))

Repeating dates in pandas DataFrame without hour format

I'm trying to insert a range of date labels in my dataframe, df1. I've managed some part of the way, but I still have som bumps that I want to smooth out.
I'm trying to generate a column with dates from 2017-01-01 to 2020-12-31 with all dates repeated 24 times, i.e., a column with 35,068 rows.
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
num_repeats = 24
repeated_dates = pd.DataFrame(dates.repeat(num_repeats))
df1.insert(0, 'Date', repeated_dates)
However, it only generates some iterations of the last date, meaning that my column will be NaT for the remaining x hours.
output:
Date DK1 Up DK1 Down DK2 Up DK2 Down
0 2017-01-01 0.0 0.0 0.0 0.0
1 2017-01-01 0.0 0.0 0.0 0.0
2 2017-01-01 0.0 0.0 0.0 0.0
3 2017-01-01 0.0 0.0 0.0 0.0
4 2017-01-01 0.0 0.0 0.0 0.0
... ... ... ... ... ...
35063 2020-12-31 0.0 0.0 0.0 0.0
35064 NaT 0.0 0.0 0.0 0.0
35065 NaT 0.0 -54.1 0.0 0.0
35066 NaT 25.5 0.0 0.0 0.0
35067 NaT 0.0 0.0 0.0 0.0
Furthermore, how can I change the date format from '2017-01-01' to '01-01-2017'?
You set this up perfectly, so here is the dates that you have,
import pandas as pd
import numpy as np
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
num_repeats = 24
df = pd.DataFrame(dates.repeat(num_repeats),columns=['date'])
and converting the column to the format you want is simple with the strftime function
df['newFormat'] = df['date'].dt.strftime('%d-%m-%Y')
Which gives
date newFormat
0 2017-01-01 01-01-2017
1 2017-01-01 01-01-2017
2 2017-01-01 01-01-2017
3 2017-01-01 01-01-2017
4 2017-01-01 01-01-2017
... ... ...
35059 2020-12-31 31-12-2020
35060 2020-12-31 31-12-2020
35061 2020-12-31 31-12-2020
35062 2020-12-31 31-12-2020
35063 2020-12-31 31-12-2020
now
dates = pd.date_range(start="01-01-2017", end="31-12-2020")
gives
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
'2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
'2017-01-09', '2017-01-10',
...
'2020-12-22', '2020-12-23', '2020-12-24', '2020-12-25',
'2020-12-26', '2020-12-27', '2020-12-28', '2020-12-29',
'2020-12-30', '2020-12-31'],
dtype='datetime64[ns]', length=1461, freq='D')
and
1461 * 24 = 35064
so I am not sure where 35,068 comes from. Are you sure about that number?

How can I make a RainDay column from Precipitation values?

keytable
Out[66]:
datahora pp pres ... WeekDay Power_kW Power_kW18
Month Day Hour ...
1 3 0 2019-01-03 00:00 0.0 1027.6 ... 3 77.303046 117.774419
1 2019-01-03 01:00 0.0 1027.0 ... 3 72.319602 110.710928
2 2019-01-03 02:00 0.0 1027.0 ... 3 71.831852 106.067667
3 2019-01-03 03:00 0.0 1027.0 ... 3 69.555751 106.325955
4 2019-01-03 04:00 0.0 1027.0 ... 3 69.525780 102.855393
... ... ... ... ... ... ...
12 30 19 2019-12-30 19:00 0.0 1031.5 ... 0 72.590489 89.749535
20 2019-12-30 20:00 0.0 1032.0 ... 0 71.444516 87.691824
21 2019-12-30 21:00 0.0 1032.0 ... 0 68.940099 87.242445
22 2019-12-30 22:00 0.0 1032.0 ... 0 67.244716 83.618018
23 2019-12-30 23:00 0.0 1032.0 ... 0 68.531573 81.288847
[8637 rows x 12 columns]
I have this dataframe and I wish to go through a day's values of 'pp' (precipitation) to see if if it rained in a period of 24, by creating a column called 'rainday' which turns to 1 if a certain threshold of 'pp' is passed during the day. How can I do it?
Use groupby with max and compare with your threshold:
threshold = 1
df["rainday"] = (df.reset_index().groupby(["Month","Day"])["pp"].max()
.gt(threshold).astype(int))
print (df)
datahora pp pres WeekDay Power_kW Power_kW18 rainday
Month Day Hour
1 3 0 2019-01-03 00:00 0.0 1027.6 3 77.303046 117.774419 0
1 2019-01-03 01:00 0.0 1027.0 3 72.319602 110.710928 0
2 2019-01-03 02:00 0.0 1027.0 3 71.831852 106.067667 0
3 2019-01-03 03:00 0.0 1027.0 3 69.555751 106.325955 0
4 2019-01-03 04:00 1.0 1027.0 3 69.525780 102.855393 0
12 30 19 2019-12-30 19:00 0.0 1031.5 0 72.590489 89.749535 1
20 2019-12-30 20:00 0.0 1032.0 0 71.444516 87.691824 1
21 2019-12-30 21:00 0.0 1032.0 0 68.940099 87.242445 1
22 2019-12-30 22:00 1.0 1032.0 0 67.244716 83.618018 1
23 2019-12-30 23:00 2.0 1032.0 0 68.531573 81.288847 1

Pandas how to outer merge on datetime column correctly

I have two dataframes:
resetted.head()
WeightedSentiment Popularity Datetime
0 0 2 2012-11-22 11:00:00
1 0 2 2012-11-22 11:30:00
2 0 4 2012-11-22 12:00:00
3 0 2 2012-11-22 15:00:00
4 0 2 2012-11-22 15:30:00
prices.head()
Open High Low Close Volume Datetime
46623 236.9392 238.6095 236.5392 238.2094 315177 2012-11-23 10:00:00
46624 238.1894 238.3095 236.7492 237.4993 122132 2012-11-23 10:30:00
46625 237.4793 238.2595 237.1393 238.2094 144457 2012-11-23 11:00:00
46626 238.2094 238.9196 238.1694 238.7695 131733 2012-11-23 11:30:00
46627 238.7695 239.1396 237.9394 238.9496 150386 2012-11-23 12:00:00
And I tried to outer join these two dataframes, but by using
pd.merge(prices,resetted,how='outer',on='Datetime')
The result is very strange and seems wrong:
Open High Low Close Volume Datetime WeightedSentiment Popularity
0 236.9392 238.6095 236.5392 238.2094 315177.0 2012-11-23 10:00:00 0.0 20.0
1 238.1894 238.3095 236.7492 237.4993 122132.0 2012-11-23 10:30:00 0.0 12.0
2 237.4793 238.2595 237.1393 238.2094 144457.0 2012-11-23 11:00:00 0.0 12.0
3 238.2094 238.9196 238.1694 238.7695 131733.0 2012-11-23 11:30:00 0.0 2.0
4 238.7695 239.1396 237.9394 238.9496 150386.0 2012-11-23 12:00:00 0.0 12.0
5 238.7995 242.0301 238.0394 241.5900 1183601.0 2012-11-23 12:30:00 0.0 16.0
If I swap the two dataframes' position in the merge function, there will be NaN at head as expected, but the other rows are wrong. I have setup a demo notebook on github.
I'm on pandas 0.21.0

Categories