I have this data on a csv file:
Date/Time kWh kVArh kVA PF
0 2021-01-01 00:30:00 471.84 0.00 943.6800 1.0000
1 2021-01-01 01:00:00 491.04 1.44 982.0842 1.0000
2 2021-01-01 01:30:00 475.20 0.00 950.4000 1.0000
3 2021-01-01 02:00:00 470.88 0.00 941.7600 1.0000
4 2021-01-01 02:30:00 466.56 0.00 933.1200 1.0000
... ... ... ... ... ...
9223 2021-07-14 04:00:00 1104.00 53.28 2210.5698 0.9988
9224 2021-07-14 04:30:00 1156.30 49.92 2314.7542 0.9991
9225 2021-07-14 05:00:00 1176.00 37.92 2353.2224 0.9995
9226 2021-07-14 05:30:00 1177.00 27.36 2354.6359 0.9997
9227 2021-07-14 06:00:00 1196.60 22.56 2393.6253 0.9998
And I use this code to read it and later export it to a csv file, after I calculate the average for every hour.
import pandas as pd
file = pd.read_csv('Electricity_data.csv',
sep = ',',
skiprows = 0,
dayfirst = True,
parse_dates = ['Date/Time'])
pd_mean = file.groupby(pd.Grouper(key = 'Date/Time', freq = 'H')).mean().reset_index()
pd_mean.to_csv("data_1h_year_.csv")
However, when I run it, my final file has a gap.
Data before the code launches (Date: 03/01/2021):
Date/Time kWh kVArh kVA PF
90 2021-02-01 21:30:00 496.83 0.00 993.6600 1.0
91 2021-02-01 22:00:00 486.72 0.00 973.4400 1.0
92 2021-02-01 22:30:00 490.08 0.00 980.1600 1.0
93 2021-02-01 23:00:00 503.00 1.92 1006.0073 1.0
94 2021-02-01 23:30:00 484.84 0.00 969.6800 1.0
95 2021-03-01 00:00:00 484.80 0.00 969.6000 1.0
96 2021-03-01 00:30:00 487.68 0.00 975.3600 1.0
97 2021-03-01 01:00:00 508.30 1.44 1016.6041 1.0
98 2021-03-01 01:30:00 488.66 0.00 977.3200 1.0
99 2021-03-01 02:00:00 486.24 0.00 972.4800 1.0
100 2021-03-01 02:30:00 495.36 1.44 990.7242 1.0
101 2021-03-01 03:00:00 484.32 0.00 968.6400 1.0
102 2021-03-01 03:30:00 485.76 0.00 971.5200 1.0
103 2021-03-01 04:00:00 492.48 1.44 984.9642 1.0
104 2021-03-01 04:30:00 476.16 0.00 952.3200 1.0
105 2021-03-01 05:00:00 477.12 0.00 954.2400 1.0
Data after the code launches (Date: 03/01/2021):
Date/Time kWh kVArh kVA PF
45 2021-01-02 21:00:00 1658.650 292.32 3368.45000 0.98485
46 2021-01-02 22:00:00 1622.150 291.60 3296.34415 0.98420
47 2021-01-02 23:00:00 1619.300 261.36 3280.52380 0.98720
48 2021-01-03 00:00:00 NaN NaN NaN NaN
49 2021-01-03 01:00:00 NaN NaN NaN NaN
50 2021-01-03 02:00:00 NaN NaN NaN NaN
51 2021-01-03 03:00:00 NaN NaN NaN NaN
52 2021-01-03 04:00:00 NaN NaN NaN NaN
53 2021-01-03 05:00:00 NaN NaN NaN NaN
54 2021-01-03 06:00:00 1202.400 158.40 2425.57730 0.99140
55 2021-01-03 07:00:00 1209.375 168.00 2441.98105 0.99050
56 2021-01-03 08:00:00 1260.950 162.72 2542.89820 0.99175
57 2021-01-03 09:00:00 1308.975 195.60 2647.07935 0.98900
58 2021-01-03 10:00:00 1334.150 193.20 2696.17005 0.98965
I do not know why this is happening, but it didn't calculate the mean values and I got the NaN forming gaps around the final csv file.
Pandas does not interpret correctly your dates. Specify the format yourself.
Use the code below to solve your problem:
parser = lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M')
df = pd.read_csv('data.csv', sep=',', skiprows=0,
parse_dates=['Date/Time'], date_parser=parser)
pd_mean = df.groupby(pd.Grouper(key='Date/Time', freq='H')).mean()
Check your dates before operation:
93 2021-02-01 23:00:00 # February, 1st
94 2021-02-01 23:30:00 # February, 1st
95 2021-03-01 00:00:00 # March, 1st
96 2021-03-01 00:30:00 # March, 1st
Related
I am trying to visualise rain events using a data contained in a dataframe.
the idea seems very simple, but the execution seems to be impossible!
here is a part of the dataframe:
start_time end_time duration br_open_total
0 2022-01-01 10:00:00 2022-01-01 19:00:00 9.0 0.2540000563879943
1 2022-01-02 00:00:00 2022-01-02 10:00:00 10.0 1.0160002255520624
2 2022-01-02 17:00:00 2022-01-03 02:00:00 9.0 0.7620001691640113
3 2022-01-03 02:00:00 2022-01-04 12:00:00 34.0 10.668002368296513
4 2022-01-07 21:00:00 2022-01-08 06:00:00 9.0 0.2540000563879943
5 2022-01-16 05:00:00 2022-01-16 20:00:00 15.0 0.5080001127760454
6 2022-01-19 04:00:00 2022-01-19 17:00:00 13.0 0.7620001691640255
7 2022-01-21 14:00:00 2022-01-22 00:00:00 10.0 1.5240003383280751
8 2022-01-27 02:00:00 2022-01-27 16:00:00 14.0 3.0480006766561503
9 2022-02-01 12:00:00 2022-02-01 21:00:00 9.0 0.2540000563880126
10 2022-02-03 05:00:00 2022-02-03 15:00:00 10.0 0.5080001127760251
What I want to do is have a plot with time on the x axis, and the value of the 'br_open_total' on the y axis.
I can draw what I want it to look like, see below:
I apologise for the simplicity of the drawing, but I think it explains what I want to do.
How do I do this, and then repeat for other dataframes on the same plot.
I have tried staircase, matplotlib.pyplot.stair and others with no success.
It seems such a simple concept!
Edit 1:
Tried Joswin K J's answer with the actual data, and got this:
The event at 02-12 11:00 should be 112 hours duration, but the bar is the same width as all the others.
Edit2:
Tried Mozway's answer and got this:
Still doesn't show width of each event, and doesn't discretise the events either
Edit 3:
Using Mozway's amended answer I get this plot for the actual data:
I have added the cursor position using paint, at the top right of the plot you can see that the cursor is at 2022-02-09 and 20.34, which is actually the value for 2022-02-01, so it seems that the plot is shifted to the left by one data point?, also the large block between 2022-3-01 and 2022-04-03 doesn't seem to be in the data
edit 4: as requested by Mozway
Reshaped Data
duration br_open_total variable date
0 10.0 1.0160002255520624 start_time 2022-01-02 00:00:00
19 10.0 0.0 end_time 2022-01-02 10:00:00
1 9.0 0.7620001691640113 start_time 2022-01-02 17:00:00
2 34.0 10.668002368296513 start_time 2022-01-03 02:00:00
21 34.0 0.0 end_time 2022-01-04 12:00:00
3 15.0 0.5080001127760454 start_time 2022-01-16 05:00:00
22 15.0 0.0 end_time 2022-01-16 20:00:00
4 13.0 0.7620001691640255 start_time 2022-01-19 04:00:00
23 13.0 0.0 end_time 2022-01-19 17:00:00
5 10.0 1.5240003383280751 start_time 2022-01-21 14:00:00
24 10.0 0.0 end_time 2022-01-22 00:00:00
6 14.0 3.0480006766561503 start_time 2022-01-27 02:00:00
25 14.0 0.0 end_time 2022-01-27 16:00:00
7 10.0 0.5080001127760251 start_time 2022-02-03 05:00:00
26 10.0 0.0 end_time 2022-02-03 15:00:00
8 18.0 7.366001635252363 start_time 2022-02-03 23:00:00
27 18.0 0.0 end_time 2022-02-04 17:00:00
9 13.0 2.28600050749211 start_time 2022-02-05 11:00:00
28 13.0 0.0 end_time 2022-02-06 00:00:00
10 19.0 2.2860005074921173 start_time 2022-02-06 04:00:00
29 19.0 0.0 end_time 2022-02-06 23:00:00
11 13.0 1.2700002819400584 start_time 2022-02-07 11:00:00
30 13.0 0.0 end_time 2022-02-08 00:00:00
12 12.0 2.79400062026814 start_time 2022-02-09 01:00:00
31 12.0 0.0 end_time 2022-02-09 13:00:00
13 112.0 20.320004511041 start_time 2022-02-12 11:00:00
32 112.0 0.0 end_time 2022-02-17 03:00:00
14 28.0 2.0320004511041034 start_time 2022-02-18 14:00:00
33 28.0 0.0 end_time 2022-02-19 18:00:00
15 17.0 17.272003834384847 start_time 2022-02-23 17:00:00
34 17.0 0.0 end_time 2022-02-24 10:00:00
16 9.0 0.7620001691640397 start_time 2022-02-27 13:00:00
35 9.0 0.0 end_time 2022-02-27 22:00:00
17 18.0 4.0640009022082 start_time 2022-04-04 00:00:00
36 18.0 0.0 end_time 2022-04-04 18:00:00
18 15.0 1.0160002255520482 start_time 2022-04-06 05:00:00
37 15.0 0.0 end_time 2022-04-06 20:00:00
when plotted using
plt.step(bdf2['date'], bdf2['br_open_total'])
plt.gcf().set_size_inches(10, 4)
plt.xticks(rotation=90)
produces the plot shown above, in which the top left corner of a block corresponds to the previous data point.
edit 5: further info
When I plot all my dataframes (different sensors) I get the same differential on the event start and end times?
You can use a step plot:
# ensure datetime
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
# reshape the data
df2 = (df
.melt(id_vars=['duration', 'br_open_total'], value_name='date')
.sort_values(by='date')
.drop_duplicates(subset='date')
.assign(br_open_total=lambda d: d['br_open_total'].mask(d['variable'].eq('end_time'), 0))
)
# plot
import matplotlib.pyplot as plt
plt.step(df2['date'], df2['br_open_total'])
plt.gcf().set_size_inches(10, 4)
output:
reshaped data:
duration br_open_total variable date
0 9.0 0.254000 start_time 2022-01-01 10:00:00
11 9.0 0.000000 end_time 2022-01-01 19:00:00
1 10.0 1.016000 start_time 2022-01-02 00:00:00
12 10.0 0.000000 end_time 2022-01-02 10:00:00
2 9.0 0.762000 start_time 2022-01-02 17:00:00
3 34.0 10.668002 start_time 2022-01-03 02:00:00
14 34.0 0.000000 end_time 2022-01-04 12:00:00
4 9.0 0.254000 start_time 2022-01-07 21:00:00
15 9.0 0.000000 end_time 2022-01-08 06:00:00
5 15.0 0.508000 start_time 2022-01-16 05:00:00
16 15.0 0.000000 end_time 2022-01-16 20:00:00
6 13.0 0.762000 start_time 2022-01-19 04:00:00
17 13.0 0.000000 end_time 2022-01-19 17:00:00
7 10.0 1.524000 start_time 2022-01-21 14:00:00
18 10.0 0.000000 end_time 2022-01-22 00:00:00
8 14.0 3.048001 start_time 2022-01-27 02:00:00
19 14.0 0.000000 end_time 2022-01-27 16:00:00
9 9.0 0.254000 start_time 2022-02-01 12:00:00
20 9.0 0.000000 end_time 2022-02-01 21:00:00
10 10.0 0.508000 start_time 2022-02-03 05:00:00
21 10.0 0.000000 end_time 2022-02-03 15:00:00
Try this:
import matplotlib.pyplot as plt
for ind,row in df.iterrows():
plt.plot(pd.Series([row['start_time'],row['end_time']]),pd.Series([row['br_open_total'],row['br_open_total']]),color='b')
plt.plot(pd.Series([row['start_time'],row['start_time']]),pd.Series([0,row['br_open_total']]),color='b')
plt.plot(pd.Series([row['end_time'],row['end_time']]),pd.Series([0,row['br_open_total']]),color='b')
plt.xticks(rotation=90)
Result:
I believe I have now cracked it, with a great debt of thanks to #Mozway.
The code to restructure the dataframe for plotting:
#create dataframes of each open gauge events removing any event with an open total of less than 0.254mm
#bresser/open
bdftdf=bdf.loc[bdf['br_open_total'] > 0.255]
bdftdf=bdftdf.copy()
bdftdf['start_time'] = pd.to_datetime(bdftdf['start_time'])
bdftdf['end_time'] = pd.to_datetime(bdftdf['end_time'])
bdf2 = (bdftdf
.melt(id_vars=['duration', 'ic_total','mc_total','md_total','imd_total','oak_total','highpoint_total','school_total','br_open_total',
'fr_gauge_total','open_mean_total','br_open_ic_%_int','br_open_mc_%_int','br_open_md_%_int','br_open_imd_%_int',
'br_open_oak_%_int'], value_name='date')
.sort_values(by='date')
#.drop_duplicates(subset='date')
.assign(br_open_total=lambda d: d['br_open_total'].mask(d['variable'].eq('end_time'), 0))
)
#create array for stairs plot
bdfarr=np.array(bdf2['date'])
bl=len(bdf2)
bdfarr=np.append(bdfarr,[bdfarr[bl-1]+np.timedelta64(1,'h')])
Rather than use the plt.step plot as suggested by Mozway, I have used plt.stairs, after creating an array of the 'date' column in the dataframe and appending an extra element to that array equal to the last element =1hour.
This means that the data now plots as I had intended it to.:
code for plot:
fig1=plt.figure()
plt.stairs(bdf2['br_open_total'], bdfarr, label='Bresser\Open')
plt.stairs(frdf2['fr_gauge_total'], frdfarr, label='FR Gauge')
plt.stairs(hpdf2['highpoint_total'], hpdfarr, label='Highpoint')
plt.stairs(schdf2['school_total'], schdfarr, label='School')
plt.stairs(opmedf2['open_mean_total'], opmedfarr, label='Open mean')
plt.xticks(rotation=90)
plt.legend(title='Rain events', loc='best')
plt.show()
import numpy as np
import pandas as pd
import xarray as xr
validIdx = np.ones(365*5, dtype= bool)
validIdx[np.random.randint(low=0, high=365*5, size=30)] = False
time = pd.date_range("2000-01-01", freq="H", periods=365 * 5)[validIdx]
data = np.arange(365 * 5)[validIdx]
ds = xr.Dataset({"foo": ("time", data), "time": time})
df = ds.to_dataframe()
In the above example, the time-series data ds (or df) has 30 randomly chosen missing records without having those as NaNs. Therefore, the length of data is 365x5 - 30, not 365x5).
My question is this: how can I expand the ds and df to have the 30 missing values as NaNs (so, the length will be 365x5)? For example, if a value in "2000-12-02" is missed in the example data, then it will look like:
...
2000-12-01 value 1
2000-12-03 value 2
...
, while what I want to have is:
...
2000-12-01 value 1
2000-12-02 NaN
2000-12-03 value 2
...
Perhaps you can try resample with 1 hour.
The df without NaNs (just after df = ds.to_dataframe()):
>>> df
foo
time
2000-01-01 00:00:00 0
2000-01-01 01:00:00 1
2000-01-01 02:00:00 2
2000-01-01 03:00:00 3
2000-01-01 04:00:00 4
... ...
2000-03-16 20:00:00 1820
2000-03-16 21:00:00 1821
2000-03-16 22:00:00 1822
2000-03-16 23:00:00 1823
2000-03-17 00:00:00 1824
[1795 rows x 1 columns]
The df with NaNs (df_1h):
>>> df_1h = df.resample('1H').mean()
>>> df_1h
foo
time
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 4.0
... ...
2000-03-16 20:00:00 1820.0
2000-03-16 21:00:00 1821.0
2000-03-16 22:00:00 1822.0
2000-03-16 23:00:00 1823.0
2000-03-17 00:00:00 1824.0
[1825 rows x 1 columns]
Rows with NaN:
>>> df_1h[df_1h['foo'].isna()]
foo
time
2000-01-02 10:00:00 NaN
2000-01-04 07:00:00 NaN
2000-01-05 06:00:00 NaN
2000-01-09 02:00:00 NaN
2000-01-13 15:00:00 NaN
2000-01-16 16:00:00 NaN
2000-01-18 21:00:00 NaN
2000-01-21 22:00:00 NaN
2000-01-23 19:00:00 NaN
2000-01-24 01:00:00 NaN
2000-01-24 19:00:00 NaN
2000-01-27 12:00:00 NaN
2000-01-27 16:00:00 NaN
2000-01-29 06:00:00 NaN
2000-02-02 01:00:00 NaN
2000-02-06 13:00:00 NaN
2000-02-09 11:00:00 NaN
2000-02-15 12:00:00 NaN
2000-02-15 15:00:00 NaN
2000-02-21 04:00:00 NaN
2000-02-28 05:00:00 NaN
2000-02-28 06:00:00 NaN
2000-03-01 15:00:00 NaN
2000-03-02 18:00:00 NaN
2000-03-04 18:00:00 NaN
2000-03-05 20:00:00 NaN
2000-03-12 08:00:00 NaN
2000-03-13 20:00:00 NaN
2000-03-16 01:00:00 NaN
The number of NaNs in df_1h:
>>> df_1h.isnull().sum()
foo 30
dtype: int64
Why do I receive Nan for rolling mean? Here's a code and output for this code. Initially I thought my data's wrong but simple .mean() works OK.
print(df_train.head())
y_hat_avg['mean'] = df_train['pickups'].mean()
print(y_hat_avg.head())
y_hat_avg['moving_avg_forecast'] = df_train['pickups'].rolling(1).mean()
print(y_hat_avg.head())
Added some data:
...................................................................
pickups
date
2014-04-01 00:00:00 12
2014-04-01 01:00:00 5
2014-04-01 02:00:00 2
2014-04-01 03:00:00 4
2014-04-01 04:00:00 3
pickups mean
date
2014-08-01 00:00:00 19 47.25888
2014-08-01 01:00:00 26 47.25888
2014-08-01 02:00:00 9 47.25888
2014-08-01 03:00:00 4 47.25888
2014-08-01 04:00:00 11 47.25888
pickups mean moving_avg_forecast
date
2014-08-01 00:00:00 19 47.25888 NaN
2014-08-01 01:00:00 26 47.25888 NaN
2014-08-01 02:00:00 9 47.25888 NaN
2014-08-01 03:00:00 4 47.25888 NaN
2014-08-01 04:00:00 11 47.25888 NaN
df_train.index = pd.RangeIndex(len(df_train.index)) fixed the problem for me.
Fairly new to python and pandas here.
I make a query that's giving me back a timeseries. I'm never sure how many data points I receive from the query (run for a single day), but what I do know is that I need to resample them to contain 24 points (one for each hour in the day).
Printing m3hstream gives
[(1479218009000L, 109), (1479287368000L, 84)]
Then I try to make a dataframe df with
df = pd.DataFrame(data = list(m3hstream), columns=['Timestamp', 'Value'])
and this gives me an output of
Timestamp Value
0 1479218009000 109
1 1479287368000 84
Following I do this
daily_summary = pd.DataFrame()
daily_summary['value'] = df['Value'].resample('H').mean()
daily_summary = daily_summary.truncate(before=start, after=end)
print "Now daily summary"
print daily_summary
But this is giving me a TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
Could anyone please let me know how to resample it so I have 1 point for each hour in the 24 hour period that I'm querying for?
Thanks.
First thing you need to do is convert that 'Timestamp' to an actual pd.Timestamp. It looks like those are milliseconds
Then resample with the on parameter set to 'Timestamp'
df = df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index()
Timestamp Value
0 2016-11-15 13:00:00 109.0
1 2016-11-15 14:00:00 NaN
2 2016-11-15 15:00:00 NaN
3 2016-11-15 16:00:00 NaN
4 2016-11-15 17:00:00 NaN
5 2016-11-15 18:00:00 NaN
6 2016-11-15 19:00:00 NaN
7 2016-11-15 20:00:00 NaN
8 2016-11-15 21:00:00 NaN
9 2016-11-15 22:00:00 NaN
10 2016-11-15 23:00:00 NaN
11 2016-11-16 00:00:00 NaN
12 2016-11-16 01:00:00 NaN
13 2016-11-16 02:00:00 NaN
14 2016-11-16 03:00:00 NaN
15 2016-11-16 04:00:00 NaN
16 2016-11-16 05:00:00 NaN
17 2016-11-16 06:00:00 NaN
18 2016-11-16 07:00:00 NaN
19 2016-11-16 08:00:00 NaN
20 2016-11-16 09:00:00 84.0
If you want to fill those NaN values, use ffill, bfill, or interpolate
df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index().interpolate()
Timestamp Value
0 2016-11-15 13:00:00 109.00
1 2016-11-15 14:00:00 107.75
2 2016-11-15 15:00:00 106.50
3 2016-11-15 16:00:00 105.25
4 2016-11-15 17:00:00 104.00
5 2016-11-15 18:00:00 102.75
6 2016-11-15 19:00:00 101.50
7 2016-11-15 20:00:00 100.25
8 2016-11-15 21:00:00 99.00
9 2016-11-15 22:00:00 97.75
10 2016-11-15 23:00:00 96.50
11 2016-11-16 00:00:00 95.25
12 2016-11-16 01:00:00 94.00
13 2016-11-16 02:00:00 92.75
14 2016-11-16 03:00:00 91.50
15 2016-11-16 04:00:00 90.25
16 2016-11-16 05:00:00 89.00
17 2016-11-16 06:00:00 87.75
18 2016-11-16 07:00:00 86.50
19 2016-11-16 08:00:00 85.25
20 2016-11-16 09:00:00 84.00
Let's try:
daily_summary = daily_summary.set_index('Timestamp')
daily_summary.index = pd.to_datetime(daily_summary.index, unit='ms')
For once an hour:
daily_summary.resample('H').mean()
or for once a day:
daily_summary.resample('D').mean()
I have two dataframes which are datetimeindexed. One is missing a few of these datetimes (df1) while the other is complete (has regular timestamps without any gaps in this series) and is full of NaN's (df2).
I'm trying to match the values from df1 to the index of df2, filling with NaN's where such a datetimeindex doesn't exist in df1.
Example:
In [51]: df1
Out [51]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-03-01 00:00:00 6
2015-03-01 01:00:00 37
2015-03-01 02:00:00 56
2015-03-01 03:00:00 12
2015-03-01 04:00:00 41
2015-03-01 05:00:00 31
... ...
2018-12-25 23:00:00 41
<34843 rows × 1 columns>
In [52]: df2 = pd.DataFrame(data=None, index=pd.date_range(freq='60Min', start=df1.index.min(), end=df1.index.max()))
df2['value']=np.NaN
df2
Out [52]: value
2015-01-01 14:00:00 NaN
2015-01-01 15:00:00 NaN
2015-01-01 16:00:00 NaN
2015-01-01 17:00:00 NaN
2015-01-01 18:00:00 NaN
2015-01-01 19:00:00 NaN
2015-01-01 20:00:00 NaN
2015-01-01 21:00:00 NaN
2015-01-01 22:00:00 NaN
2015-01-01 23:00:00 NaN
2015-01-02 00:00:00 NaN
2015-01-02 01:00:00 NaN
2015-01-02 02:00:00 NaN
2015-01-02 03:00:00 NaN
2015-01-02 04:00:00 NaN
2015-01-02 05:00:00 NaN
... ...
2018-12-25 23:00:00 NaN
<34906 rows × 1 columns>
Using df2.combine_first(df1) returns the same data as df1.reindex(index= df2.index), which fills any gaps where there shouldn't be data with some value, instead of NaN.
In [53]: Result = df2.combine_first(df1)
Result
Out [53]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-01-02 00:00:00 35
2015-01-02 01:00:00 53
2015-01-02 02:00:00 28
2015-01-02 03:00:00 48
2015-01-02 04:00:00 42
2015-01-02 05:00:00 51
... ...
2018-12-25 23:00:00 41
<34906 rows × 1 columns>
This is what I was hoping to get:
Out [53]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-01-02 00:00:00 NaN
2015-01-02 01:00:00 NaN
2015-01-02 02:00:00 NaN
2015-01-02 03:00:00 NaN
2015-01-02 04:00:00 NaN
2015-01-02 05:00:00 NaN
... ...
2018-12-25 23:00:00 41
<34906 rows × 1 columns>
Could someone shed some light on why this is happening, and how to set how these values are filled?
IIUC you need resample df1, because you have an irregular frequency and you need regular frequency:
print df1.index.freq
None
print Result.index.freq
<60 * Minutes>
EDIT1
You can use function asfreq instead of resample - doc, resample vs asfreq.
EDIT2
First I think that resample didn't work, because after resampling the Result is the same as df1. But I try print df1.info() and print Result.info() gets different results - 34857 entries vs 34920 entries.
So I try to find rows with NaN values and it returns 63 rows.
So I think resample works well.
import pandas as pd
df1 = pd.read_csv('test/GapInTimestamps.csv', sep=",", index_col=[0], parse_dates=[0])
print df1.head()
# value
#Date/Time
#2015-01-01 00:00:00 52
#2015-01-01 01:00:00 5
#2015-01-01 02:00:00 12
#2015-01-01 03:00:00 54
#2015-01-01 04:00:00 47
print df1.info()
#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 34857 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00
#Data columns (total 1 columns):
#value 34857 non-null int64
#dtypes: int64(1)
#memory usage: 544.6 KB
#None
Result = df1.resample('60min')
print Result.head()
# value
#Date/Time
#2015-01-01 00:00:00 52
#2015-01-01 01:00:00 5
#2015-01-01 02:00:00 12
#2015-01-01 03:00:00 54
#2015-01-01 04:00:00 47
print Result.info()
#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 34920 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00
#Freq: 60T
#Data columns (total 1 columns):
#value 34857 non-null float64
#dtypes: float64(1)
#memory usage: 545.6 KB
#None
#find values with NaN
resultnan = Result[Result.isnull().any(axis=1)]
#temporaly display 999 rows and 15 columns
with pd.option_context('display.max_rows', 999, 'display.max_columns', 15):
print resultnan
# value
#Date/Time
#2015-01-13 19:00:00 NaN
#2015-01-13 20:00:00 NaN
#2015-01-13 21:00:00 NaN
#2015-01-13 22:00:00 NaN
#2015-01-13 23:00:00 NaN
#2015-01-14 00:00:00 NaN
#2015-01-14 01:00:00 NaN
#2015-01-14 02:00:00 NaN
#2015-01-14 03:00:00 NaN
#2015-01-14 04:00:00 NaN
#2015-01-14 05:00:00 NaN
#2015-01-14 06:00:00 NaN
#2015-01-14 07:00:00 NaN
#2015-01-14 08:00:00 NaN
#2015-01-14 09:00:00 NaN
#2015-02-01 00:00:00 NaN
#2015-02-01 01:00:00 NaN
#2015-02-01 02:00:00 NaN
#2015-02-01 03:00:00 NaN
#2015-02-01 04:00:00 NaN
#2015-02-01 05:00:00 NaN
#2015-02-01 06:00:00 NaN
#2015-02-01 07:00:00 NaN
#2015-02-01 08:00:00 NaN
#2015-02-01 09:00:00 NaN
#2015-02-01 10:00:00 NaN
#2015-02-01 11:00:00 NaN
#2015-02-01 12:00:00 NaN
#2015-02-01 13:00:00 NaN
#2015-02-01 14:00:00 NaN
#2015-02-01 15:00:00 NaN
#2015-02-01 16:00:00 NaN
#2015-02-01 17:00:00 NaN
#2015-02-01 18:00:00 NaN
#2015-02-01 19:00:00 NaN
#2015-02-01 20:00:00 NaN
#2015-02-01 21:00:00 NaN
#2015-02-01 22:00:00 NaN
#2015-02-01 23:00:00 NaN
#2015-11-01 00:00:00 NaN
#2015-11-01 01:00:00 NaN
#2015-11-01 02:00:00 NaN
#2015-11-01 03:00:00 NaN
#2015-11-01 04:00:00 NaN
#2015-11-01 05:00:00 NaN
#2015-11-01 06:00:00 NaN
#2015-11-01 07:00:00 NaN
#2015-11-01 08:00:00 NaN
#2015-11-01 09:00:00 NaN
#2015-11-01 10:00:00 NaN
#2015-11-01 11:00:00 NaN
#2015-11-01 12:00:00 NaN
#2015-11-01 13:00:00 NaN
#2015-11-01 14:00:00 NaN
#2015-11-01 15:00:00 NaN
#2015-11-01 16:00:00 NaN
#2015-11-01 17:00:00 NaN
#2015-11-01 18:00:00 NaN
#2015-11-01 19:00:00 NaN
#2015-11-01 20:00:00 NaN
#2015-11-01 21:00:00 NaN
#2015-11-01 22:00:00 NaN
#2015-11-01 23:00:00 NaN