groupby.aggregate and modification then cast / reindex - python

I look for applying some deviation to a monthly granularity structure of a dataframe and then recast it in the initial dataframe. I firstly do a groupby and aggregate. This part works well. Then I reindex and take NaN. I want the reindexation will be done by matching month of the groupby element with the initial dataframe.
I want be able to due this operation on different granularity (yearly -> month & year, ...)
Has someone the solution of this problem ?
>>> df['profile']
date
2015-01-01 00:00:00 3.000000
2015-01-01 01:00:00 3.000143
2015-01-01 02:00:00 3.000287
2015-01-01 03:00:00 3.000430
2015-01-01 04:00:00 3.000574
...
2015-12-31 20:00:00 2.999426
2015-12-31 21:00:00 2.999570
2015-12-31 22:00:00 2.999713
2015-12-31 23:00:00 2.999857
Freq: H, Name: profile, Length: 8760
### Deviation on monthly basis
>>> dev_monthly = np.random.uniform(0.5, 1.5, len(df['profile'].groupby(df.index.month).aggregate(np.sum)))
>>> df['profile_monthly'] = (df['profile'].groupby(df.index.month).aggregate(np.sum) * dev_monthly).reindex(df)
>>> df['profile_monthly']
date
2015-01-01 00:00:00 NaN
2015-01-01 01:00:00 NaN
2015-01-01 02:00:00 NaN
...
2015-12-31 22:00:00 NaN
2015-12-31 23:00:00 NaN
Freq: H, Name: profile_monthly, Length: 8760

Check out the documentation for resampling.
You're looking for resample followed by fillna with method='bfill':
In [105]: df = DataFrame({'profile': normal(3, 0.1, size=10000)}, pd.date_range(start='2015-01-
01', freq='H', periods=10000))
In [106]: df['profile_monthly'] = df.profile.resample('M', how='sum')
In [107]: df
Out[107]:
profile profile_monthly
2015-01-01 00:00:00 2.8328 NaN
2015-01-01 01:00:00 3.0607 NaN
2015-01-01 02:00:00 3.0138 NaN
2015-01-01 03:00:00 3.0402 NaN
2015-01-01 04:00:00 3.0335 NaN
2015-01-01 05:00:00 3.0087 NaN
2015-01-01 06:00:00 3.0557 NaN
2015-01-01 07:00:00 2.9280 NaN
2015-01-01 08:00:00 3.1359 NaN
2015-01-01 09:00:00 2.9681 NaN
2015-01-01 10:00:00 3.1240 NaN
2015-01-01 11:00:00 3.0635 NaN
2015-01-01 12:00:00 2.9206 NaN
2015-01-01 13:00:00 3.0714 NaN
2015-01-01 14:00:00 3.0688 NaN
2015-01-01 15:00:00 3.0703 NaN
2015-01-01 16:00:00 2.9102 NaN
2015-01-01 17:00:00 2.9368 NaN
2015-01-01 18:00:00 3.0864 NaN
2015-01-01 19:00:00 3.2124 NaN
2015-01-01 20:00:00 2.8988 NaN
2015-01-01 21:00:00 3.0659 NaN
2015-01-01 22:00:00 2.7973 NaN
2015-01-01 23:00:00 3.0824 NaN
2015-01-02 00:00:00 3.0199 NaN
... ...
[10000 rows x 2 columns]
In [108]: df.dropna()
Out[108]:
profile profile_monthly
2015-01-31 2.9769 2230.9931
2015-02-28 2.9930 2016.1045
2015-03-31 2.7817 2232.4096
2015-04-30 3.1695 2158.7834
2015-05-31 2.9040 2236.5962
2015-06-30 2.8697 2162.7784
2015-07-31 2.9278 2231.7232
2015-08-31 2.8289 2236.4603
2015-09-30 3.0368 2163.5916
2015-10-31 3.1517 2233.2285
2015-11-30 3.0450 2158.6998
2015-12-31 2.8261 2228.5550
2016-01-31 3.0264 2229.2221
[13 rows x 2 columns]
In [110]: df.fillna(method='bfill')
Out[110]:
profile profile_monthly
2015-01-01 00:00:00 2.8328 2230.9931
2015-01-01 01:00:00 3.0607 2230.9931
2015-01-01 02:00:00 3.0138 2230.9931
2015-01-01 03:00:00 3.0402 2230.9931
2015-01-01 04:00:00 3.0335 2230.9931
2015-01-01 05:00:00 3.0087 2230.9931
2015-01-01 06:00:00 3.0557 2230.9931
2015-01-01 07:00:00 2.9280 2230.9931
2015-01-01 08:00:00 3.1359 2230.9931
2015-01-01 09:00:00 2.9681 2230.9931
2015-01-01 10:00:00 3.1240 2230.9931
2015-01-01 11:00:00 3.0635 2230.9931
2015-01-01 12:00:00 2.9206 2230.9931
2015-01-01 13:00:00 3.0714 2230.9931
2015-01-01 14:00:00 3.0688 2230.9931
2015-01-01 15:00:00 3.0703 2230.9931
2015-01-01 16:00:00 2.9102 2230.9931
2015-01-01 17:00:00 2.9368 2230.9931
2015-01-01 18:00:00 3.0864 2230.9931
2015-01-01 19:00:00 3.2124 2230.9931
2015-01-01 20:00:00 2.8988 2230.9931
2015-01-01 21:00:00 3.0659 2230.9931
2015-01-01 22:00:00 2.7973 2230.9931
2015-01-01 23:00:00 3.0824 2230.9931
2015-01-02 00:00:00 3.0199 2230.9931
... ...
[10000 rows x 2 columns]

When I use your code, I haven't same value for 2015-12-31 00:00:00 and 2015-12-31 01:00:00 as you can see below :
>>> df.fillna(method='bfill')[np.logical_and(df.index.month==12, df.index.day==31)]
profile profile_monthly
2015-12-31 00:00:00 2.926504 2232.288997
2015-12-31 01:00:00 3.008543 2234.470731
2015-12-31 02:00:00 2.930133 2234.470731
2015-12-31 03:00:00 3.078552 2234.470731
2015-12-31 04:00:00 3.141578 2234.470731
2015-12-31 05:00:00 3.061820 2234.470731
2015-12-31 06:00:00 2.981626 2234.470731
2015-12-31 07:00:00 3.010749 2234.470731
2015-12-31 08:00:00 2.878577 2234.470731
2015-12-31 09:00:00 2.915487 2234.470731
2015-12-31 10:00:00 3.072721 2234.470731
2015-12-31 11:00:00 3.087866 2234.470731
2015-12-31 12:00:00 3.089208 2234.470731
2015-12-31 13:00:00 2.957047 2234.470731
2015-12-31 14:00:00 3.002072 2234.470731
2015-12-31 15:00:00 3.106656 2234.470731
2015-12-31 16:00:00 3.100891 2234.470731
2015-12-31 17:00:00 3.077835 2234.470731
2015-12-31 18:00:00 3.032497 2234.470731
2015-12-31 19:00:00 2.959838 2234.470731
2015-12-31 20:00:00 2.878819 2234.470731
2015-12-31 21:00:00 3.041171 2234.470731
2015-12-31 22:00:00 3.061970 2234.470731
2015-12-31 23:00:00 3.019011 2234.470731
[24 rows x 2 columns]
So I finally found the following solution :
>>> AA = df.groupby((df.index.year, df.index.month)).aggregate(np.mean)
>>> AA['dev'] = np.random.randn(0,1,len(AA))
>>> df['dev'] = AA.ix[zip(df.index.year, df.index.month)]['dev'].values
Short and rapid. The only question is :
=> How to deal with other granularity (half year, quarter, week, ...) ?

Related

How to convert hourly data to half hourly

I have the following dataframe:
datetime temp
0 2015-01-01 00:00:00 11.22
1 2015-01-01 01:00:00 11.32
2 2015-01-01 02:00:00 11.30
3 2015-01-01 03:00:00 11.25
4 2015-01-01 04:00:00 11.32
... ... ...
31339 2018-07-29 19:00:00 17.60
31340 2018-07-29 20:00:00 17.49
31341 2018-07-29 21:00:00 17.44
31342 2018-07-29 22:00:00 17.39
31343 2018-07-29 23:00:00 17.37
I want to convert this dataframe to have data each half hour, and inpute each new position with the mean between the previous and the following value (or any similar interpolation), that is, for example:
datetime temp
0 2015-01-01 00:00:00 11.00
1 2015-01-01 00:30:00 11.50
2 2015-01-01 01:00:00 12.00
Is there any pandas/datetime function to assist in this operation?
Thank you
You can use the resample() function in Pandas. With this you can set the time to down/upsample to and then what you want to do with it (mean, sum etc.). In your case you can also interpolate between the values.
For this to work your datetime column will have to be a datetime dtype, then set it to the index.
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime', inplace=True)
Then you can resample to 30 minutes ('30T') and then interpolate.
df.resample('30T').interpolate()
Resulting in...
temp
datetime
2015-01-01 00:00:00 11.220
2015-01-01 00:30:00 11.270
2015-01-01 01:00:00 11.320
2015-01-01 01:30:00 11.310
2015-01-01 02:00:00 11.300
2015-01-01 02:30:00 11.275
2015-01-01 03:00:00 11.250
2015-01-01 03:30:00 11.285
2015-01-01 04:00:00 11.320
Read more about the frequency strings and resampling in the Pandas docs.

Forward fill seasonal data in pandas

I have hourly observations of several variables that exhibit daily seasonality. I want to fill any missing value with the corresponding variable's value 24 hours prior.
Ideally my function would fill the missing values from oldest to newest. Thus if there are 25 consecutive missing values, the 25th missing value is filled with the same value as the first missing value. Using Series.map() fails in this case.
value desired_output
hour
2019-08-17 00:00:00 58.712986 58.712986
2019-08-17 01:00:00 28.904234 28.904234
2019-08-17 02:00:00 14.275149 14.275149
2019-08-17 03:00:00 58.777087 58.777087
2019-08-17 04:00:00 95.964955 95.964955
2019-08-17 05:00:00 64.971372 64.971372
2019-08-17 06:00:00 95.759469 95.759469
2019-08-17 07:00:00 98.675457 98.675457
2019-08-17 08:00:00 77.510319 77.510319
2019-08-17 09:00:00 56.492446 56.492446
2019-08-17 10:00:00 90.968924 90.968924
2019-08-17 11:00:00 66.647501 66.647501
2019-08-17 12:00:00 7.756725 7.756725
2019-08-17 13:00:00 49.328135 49.328135
2019-08-17 14:00:00 28.634033 28.634033
2019-08-17 15:00:00 65.157161 65.157161
2019-08-17 16:00:00 93.127539 93.127539
2019-08-17 17:00:00 98.806335 98.806335
2019-08-17 18:00:00 94.789761 94.789761
2019-08-17 19:00:00 63.518037 63.518037
2019-08-17 20:00:00 89.524433 89.524433
2019-08-17 21:00:00 48.076081 48.076081
2019-08-17 22:00:00 5.027928 5.027928
2019-08-17 23:00:00 0.417763 0.417763
2019-08-18 00:00:00 29.933627 29.933627
2019-08-18 01:00:00 61.726948 61.726948
2019-08-18 02:00:00 NaN 14.275149
2019-08-18 03:00:00 NaN 58.777087
2019-08-18 04:00:00 NaN 95.964955
2019-08-18 05:00:00 NaN 64.971372
2019-08-18 06:00:00 NaN 95.759469
2019-08-18 07:00:00 NaN 98.675457
2019-08-18 08:00:00 NaN 77.510319
2019-08-18 09:00:00 NaN 56.492446
2019-08-18 10:00:00 NaN 90.968924
2019-08-18 11:00:00 NaN 66.647501
2019-08-18 12:00:00 NaN 7.756725
2019-08-18 13:00:00 NaN 49.328135
2019-08-18 14:00:00 NaN 28.634033
2019-08-18 15:00:00 NaN 65.157161
2019-08-18 16:00:00 NaN 93.127539
2019-08-18 17:00:00 NaN 98.806335
2019-08-18 18:00:00 NaN 94.789761
2019-08-18 19:00:00 NaN 63.518037
2019-08-18 20:00:00 NaN 89.524433
2019-08-18 21:00:00 NaN 48.076081
2019-08-18 22:00:00 NaN 5.027928
2019-08-18 23:00:00 NaN 0.417763
2019-08-19 00:00:00 NaN 29.933627
2019-08-19 01:00:00 NaN 61.726948
2019-08-19 02:00:00 NaN 14.275149
2019-08-19 03:00:00 NaN 58.777087
2019-08-19 04:00:00 NaN 95.964955
2019-08-19 05:00:00 NaN 64.971372
2019-08-19 06:00:00 NaN 95.759469
2019-08-19 07:00:00 NaN 98.675457
2019-08-19 08:00:00 NaN 77.510319
2019-08-19 09:00:00 NaN 56.492446
2019-08-19 10:00:00 NaN 90.968924
2019-08-19 11:00:00 NaN 66.647501
2019-08-19 12:00:00 NaN 7.756725
2019-08-19 13:00:00 61.457913 61.457913
2019-08-19 14:00:00 52.429383 52.429383
2019-08-19 15:00:00 79.016485 79.016485
2019-08-19 16:00:00 77.724758 77.724758
2019-08-19 17:00:00 62.205810 62.205810
2019-08-19 18:00:00 15.841707 15.841707
2019-08-19 19:00:00 72.196028 72.196028
2019-08-19 20:00:00 5.497441 5.497441
2019-08-19 21:00:00 30.737596 30.737596
2019-08-19 22:00:00 65.550690 65.550690
2019-08-19 23:00:00 3.543332 3.543332
import pandas as pd
from dateutil.relativedelta import relativedelta as rel_delta
df['isna'] = df['value'].isna()
df['value'] = df.index.map(lambda t: df.at[t - rel_delta(hours=24), 'value'] if df.at[t,'isna'] and t - rel_delta(hours=24) >= df.index.min() else df.at[t, 'value'])
What is the most efficient way to complete this naive forward fill?
IIUC, just groupby time and ffill()
df['resuts'] = df.groupby(df.hour.dt.time).value.ffill()
If hour is your index, just do df.index.time instead.
Checking:
>>> (df['results'] == df['desired_output']).all()
True
Wouldn't this work?
df['value'] = df['value'].fillna(df.index.hour)
Separate Date and Time into two columns as strings. Call it df.
Date Time Value
0 2019-08-17 00:00:00 58.712986
1 2019-08-17 01:00:00 28.904234
2 2019-08-17 02:00:00 14.275149
3 2019-08-17 03:00:00 58.777087
4 2019-08-17 04:00:00 95.964955
Then conducts data reshaping, pivot Time into column headers, forward fillna along each hour.
(df reshaping)
Date 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00
2019-08-17 58.712986 28.904234 14.275149 58.777087 95.964955
2019-08-18 29.933627 61.726948 NaN NaN NaN
2019-08-19 NaN NaN NaN NaN NaN
(df ffill)
Date 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00
2019-08-17 58.712986 28.904234 14.275149 58.777087 95.964955
2019-08-18 29.933627 61.726948 14.275149 58.777087 95.964955
2019-08-19 29.933627 61.726948 14.275149 58.777087 95.964955
(Code)
(df.set_index(['Date','Time')['Value']
.unstack()
.ffill()
.stack()
.reset_index(name='Value')

Create an hourly Series of a year

I'm not able to create a Pandas Series of every hour (as datetime objects) of a given year without iterating and adding one hour to the last, and that's slow. Is there any way to do that paralelly.
My input would be a year and the output should be a Pandas Series of every hour of that year.
You can use pd.date_range with freq='H' which is hourly frequency:
Edit with 23:00:00 after comment by #ALollz
year = 2019
pd.Series(pd.date_range(start=f'{year}-01-01', end=f'{year}-12-31 23:00:00', freq='H'))
0 2019-01-01 00:00:00
1 2019-01-01 01:00:00
2 2019-01-01 02:00:00
3 2019-01-01 03:00:00
4 2019-01-01 04:00:00
5 2019-01-01 05:00:00
6 2019-01-01 06:00:00
7 2019-01-01 07:00:00
8 2019-01-01 08:00:00
9 2019-01-01 09:00:00
10 2019-01-01 10:00:00
11 2019-01-01 11:00:00
12 2019-01-01 12:00:00
13 2019-01-01 13:00:00
14 2019-01-01 14:00:00
15 2019-01-01 15:00:00
16 2019-01-01 16:00:00
17 2019-01-01 17:00:00
18 2019-01-01 18:00:00
19 2019-01-01 19:00:00
20 2019-01-01 20:00:00
21 2019-01-01 21:00:00
22 2019-01-01 22:00:00
23 2019-01-01 23:00:00
24 2019-01-02 00:00:00
25 2019-01-02 01:00:00
26 2019-01-02 02:00:00
27 2019-01-02 03:00:00
28 2019-01-02 04:00:00
29 2019-01-02 05:00:00
30 2019-01-02 06:00:00
31 2019-01-02 07:00:00
32 2019-01-02 08:00:00
33 2019-01-02 09:00:00
34 2019-01-02 10:00:00
35 2019-01-02 11:00:00
36 2019-01-02 12:00:00
37 2019-01-02 13:00:00
38 2019-01-02 14:00:00
39 2019-01-02 15:00:00
40 2019-01-02 16:00:00
41 2019-01-02 17:00:00
42 2019-01-02 18:00:00
43 2019-01-02 19:00:00
44 2019-01-02 20:00:00
45 2019-01-02 21:00:00
46 2019-01-02 22:00:00
47 2019-01-02 23:00:00
48 2019-01-03 00:00:00
49 2019-01-03 01:00:00
...
8711 2019-12-29 23:00:00
8712 2019-12-30 00:00:00
8713 2019-12-30 01:00:00
8714 2019-12-30 02:00:00
8715 2019-12-30 03:00:00
8716 2019-12-30 04:00:00
8717 2019-12-30 05:00:00
8718 2019-12-30 06:00:00
8719 2019-12-30 07:00:00
8720 2019-12-30 08:00:00
8721 2019-12-30 09:00:00
8722 2019-12-30 10:00:00
8723 2019-12-30 11:00:00
8724 2019-12-30 12:00:00
8725 2019-12-30 13:00:00
8726 2019-12-30 14:00:00
8727 2019-12-30 15:00:00
8728 2019-12-30 16:00:00
8729 2019-12-30 17:00:00
8730 2019-12-30 18:00:00
8731 2019-12-30 19:00:00
8732 2019-12-30 20:00:00
8733 2019-12-30 21:00:00
8734 2019-12-30 22:00:00
8735 2019-12-30 23:00:00
8736 2019-12-31 00:00:00
8737 2019-12-31 01:00:00
8738 2019-12-31 02:00:00
8739 2019-12-31 03:00:00
8740 2019-12-31 04:00:00
8741 2019-12-31 05:00:00
8742 2019-12-31 06:00:00
8743 2019-12-31 07:00:00
8744 2019-12-31 08:00:00
8745 2019-12-31 09:00:00
8746 2019-12-31 10:00:00
8747 2019-12-31 11:00:00
8748 2019-12-31 12:00:00
8749 2019-12-31 13:00:00
8750 2019-12-31 14:00:00
8751 2019-12-31 15:00:00
8752 2019-12-31 16:00:00
8753 2019-12-31 17:00:00
8754 2019-12-31 18:00:00
8755 2019-12-31 19:00:00
8756 2019-12-31 20:00:00
8757 2019-12-31 21:00:00
8758 2019-12-31 22:00:00
8759 2019-12-31 23:00:00
8760 2020-01-01 00:00:00
Length: 8761, dtype: datetime64[ns]
Note if your Python version is lower than 3.6 use .format for string formatting:
year = 2019
pd.Series(pd.date_range(start='{}-01-01'.format(year), end='{}-01-01 23:00:00'.format(year), freq='H'))

How to apply a expanding window formula that restarts with change in date in Pandas dataframe?

My data looks like this
Date and Time Close dif
2015/01/01 17:00:00.211 2030.25 0.3
2015/01/01 17:00:02.456 2030.75 0.595137615
2015/01/01 23:55:01.491 2037.25 2.432613592
2015/01/02 00:02:01.955 2036.75 -0.4
2015/01/02 00:04:04.887 2036.5 -0.391144414
2015/01/02 15:14:56.207 2021.5 -4.732676608
2015/01/05 15:14:59.020 2021.5 -4.731171953
2015/01/05 17:00:00.105 2020.5 0
2015/01/05 17:00:01.077 2021 0.423093923
I want to do a cumsum of the dif column that resets every every day, so the output would look like:
Date and Time Close dif Cum_
2015/01/01 17:00:00.211 2030.25 0.3 0.3
2015/01/01 17:00:02.456 2030.75 0.5 0.8
2015/01/01 23:55:01.491 2037.25 2.4 3.2
2015/01/02 00:02:01.955 2036.75 0.4 0.4
2015/01/02 00:04:04.887 2036.5 0.3 0.7
2015/01/02 15:14:56.207 2021.5 4.7 5.0
2015/01/05 17:00:00.020 2021.5 4.7 4.7
2015/01/05 17:00:00.105 2020.5 0 4.7
2015/01/05 17:00:01.077 2021 0.4 5.1
Thanks
Using a similar example:
df = pd.DataFrame({'time': pd.DatetimeIndex(freq='H', start=date(2015,1,1), periods=100), 'value': np.random.random(100)}).set_index('time')
print(df.groupby(pd.TimeGrouper('D')).apply(lambda x: x.cumsum()))
value
time
2015-01-01 00:00:00 0.112809
2015-01-01 01:00:00 0.175091
2015-01-01 02:00:00 0.257127
2015-01-01 03:00:00 0.711317
2015-01-01 04:00:00 1.372902
2015-01-01 05:00:00 1.544617
2015-01-01 06:00:00 1.748132
2015-01-01 07:00:00 2.547540
2015-01-01 08:00:00 2.799640
2015-01-01 09:00:00 2.913003
2015-01-01 10:00:00 3.883643
2015-01-01 11:00:00 3.926428
2015-01-01 12:00:00 4.045293
2015-01-01 13:00:00 4.214375
2015-01-01 14:00:00 4.456385
2015-01-01 15:00:00 5.374335
2015-01-01 16:00:00 5.828024
2015-01-01 17:00:00 6.295117
2015-01-01 18:00:00 7.171010
2015-01-01 19:00:00 7.907834
2015-01-01 20:00:00 8.132203
2015-01-01 21:00:00 9.007994
2015-01-01 22:00:00 9.755925
2015-01-01 23:00:00 10.373546
2015-01-02 00:00:00 0.797521
2015-01-02 01:00:00 1.582709
2015-01-02 02:00:00 1.811771
2015-01-02 03:00:00 2.493248
2015-01-02 04:00:00 3.278923
2015-01-02 05:00:00 3.626356
... ...
2015-01-03 22:00:00 11.625891
2015-01-03 23:00:00 12.597532
2015-01-04 00:00:00 0.075442
2015-01-04 01:00:00 0.155059
2015-01-04 02:00:00 0.754960
2015-01-04 03:00:00 0.926798
2015-01-04 04:00:00 1.890215
2015-01-04 05:00:00 2.734722
2015-01-04 06:00:00 2.803935
2015-01-04 07:00:00 3.103064
2015-01-04 08:00:00 3.727508
2015-01-04 09:00:00 4.117465
2015-01-04 10:00:00 4.250926
2015-01-04 11:00:00 4.996832
2015-01-04 12:00:00 5.081889
2015-01-04 13:00:00 5.493243
2015-01-04 14:00:00 5.987519
2015-01-04 15:00:00 6.719041
2015-01-04 16:00:00 7.325912
2015-01-04 17:00:00 8.163208
2015-01-04 18:00:00 9.015092
2015-01-04 19:00:00 9.062396
2015-01-04 20:00:00 9.350298
2015-01-04 21:00:00 9.947669
2015-01-04 22:00:00 10.820609
2015-01-04 23:00:00 11.165523
2015-01-05 00:00:00 0.385323
2015-01-05 01:00:00 0.999182
2015-01-05 02:00:00 1.240272
2015-01-05 03:00:00 1.398086
So in your example, do df.set_index('Date & Time') and then groupby and apply. You can of course assign the result back to the original DataFrame.

Fill datetimeindex gap by NaN

I have two dataframes which are datetimeindexed. One is missing a few of these datetimes (df1) while the other is complete (has regular timestamps without any gaps in this series) and is full of NaN's (df2).
I'm trying to match the values from df1 to the index of df2, filling with NaN's where such a datetimeindex doesn't exist in df1.
Example:
In [51]: df1
Out [51]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-03-01 00:00:00 6
2015-03-01 01:00:00 37
2015-03-01 02:00:00 56
2015-03-01 03:00:00 12
2015-03-01 04:00:00 41
2015-03-01 05:00:00 31
... ...
2018-12-25 23:00:00 41
<34843 rows × 1 columns>
In [52]: df2 = pd.DataFrame(data=None, index=pd.date_range(freq='60Min', start=df1.index.min(), end=df1.index.max()))
df2['value']=np.NaN
df2
Out [52]: value
2015-01-01 14:00:00 NaN
2015-01-01 15:00:00 NaN
2015-01-01 16:00:00 NaN
2015-01-01 17:00:00 NaN
2015-01-01 18:00:00 NaN
2015-01-01 19:00:00 NaN
2015-01-01 20:00:00 NaN
2015-01-01 21:00:00 NaN
2015-01-01 22:00:00 NaN
2015-01-01 23:00:00 NaN
2015-01-02 00:00:00 NaN
2015-01-02 01:00:00 NaN
2015-01-02 02:00:00 NaN
2015-01-02 03:00:00 NaN
2015-01-02 04:00:00 NaN
2015-01-02 05:00:00 NaN
... ...
2018-12-25 23:00:00 NaN
<34906 rows × 1 columns>
Using df2.combine_first(df1) returns the same data as df1.reindex(index= df2.index), which fills any gaps where there shouldn't be data with some value, instead of NaN.
In [53]: Result = df2.combine_first(df1)
Result
Out [53]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-01-02 00:00:00 35
2015-01-02 01:00:00 53
2015-01-02 02:00:00 28
2015-01-02 03:00:00 48
2015-01-02 04:00:00 42
2015-01-02 05:00:00 51
... ...
2018-12-25 23:00:00 41
<34906 rows × 1 columns>
This is what I was hoping to get:
Out [53]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-01-02 00:00:00 NaN
2015-01-02 01:00:00 NaN
2015-01-02 02:00:00 NaN
2015-01-02 03:00:00 NaN
2015-01-02 04:00:00 NaN
2015-01-02 05:00:00 NaN
... ...
2018-12-25 23:00:00 41
<34906 rows × 1 columns>
Could someone shed some light on why this is happening, and how to set how these values are filled?
IIUC you need resample df1, because you have an irregular frequency and you need regular frequency:
print df1.index.freq
None
print Result.index.freq
<60 * Minutes>
EDIT1
You can use function asfreq instead of resample - doc, resample vs asfreq.
EDIT2
First I think that resample didn't work, because after resampling the Result is the same as df1. But I try print df1.info() and print Result.info() gets different results - 34857 entries vs 34920 entries.
So I try to find rows with NaN values and it returns 63 rows.
So I think resample works well.
import pandas as pd
df1 = pd.read_csv('test/GapInTimestamps.csv', sep=",", index_col=[0], parse_dates=[0])
print df1.head()
# value
#Date/Time
#2015-01-01 00:00:00 52
#2015-01-01 01:00:00 5
#2015-01-01 02:00:00 12
#2015-01-01 03:00:00 54
#2015-01-01 04:00:00 47
print df1.info()
#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 34857 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00
#Data columns (total 1 columns):
#value 34857 non-null int64
#dtypes: int64(1)
#memory usage: 544.6 KB
#None
Result = df1.resample('60min')
print Result.head()
# value
#Date/Time
#2015-01-01 00:00:00 52
#2015-01-01 01:00:00 5
#2015-01-01 02:00:00 12
#2015-01-01 03:00:00 54
#2015-01-01 04:00:00 47
print Result.info()
#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 34920 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00
#Freq: 60T
#Data columns (total 1 columns):
#value 34857 non-null float64
#dtypes: float64(1)
#memory usage: 545.6 KB
#None
#find values with NaN
resultnan = Result[Result.isnull().any(axis=1)]
#temporaly display 999 rows and 15 columns
with pd.option_context('display.max_rows', 999, 'display.max_columns', 15):
print resultnan
# value
#Date/Time
#2015-01-13 19:00:00 NaN
#2015-01-13 20:00:00 NaN
#2015-01-13 21:00:00 NaN
#2015-01-13 22:00:00 NaN
#2015-01-13 23:00:00 NaN
#2015-01-14 00:00:00 NaN
#2015-01-14 01:00:00 NaN
#2015-01-14 02:00:00 NaN
#2015-01-14 03:00:00 NaN
#2015-01-14 04:00:00 NaN
#2015-01-14 05:00:00 NaN
#2015-01-14 06:00:00 NaN
#2015-01-14 07:00:00 NaN
#2015-01-14 08:00:00 NaN
#2015-01-14 09:00:00 NaN
#2015-02-01 00:00:00 NaN
#2015-02-01 01:00:00 NaN
#2015-02-01 02:00:00 NaN
#2015-02-01 03:00:00 NaN
#2015-02-01 04:00:00 NaN
#2015-02-01 05:00:00 NaN
#2015-02-01 06:00:00 NaN
#2015-02-01 07:00:00 NaN
#2015-02-01 08:00:00 NaN
#2015-02-01 09:00:00 NaN
#2015-02-01 10:00:00 NaN
#2015-02-01 11:00:00 NaN
#2015-02-01 12:00:00 NaN
#2015-02-01 13:00:00 NaN
#2015-02-01 14:00:00 NaN
#2015-02-01 15:00:00 NaN
#2015-02-01 16:00:00 NaN
#2015-02-01 17:00:00 NaN
#2015-02-01 18:00:00 NaN
#2015-02-01 19:00:00 NaN
#2015-02-01 20:00:00 NaN
#2015-02-01 21:00:00 NaN
#2015-02-01 22:00:00 NaN
#2015-02-01 23:00:00 NaN
#2015-11-01 00:00:00 NaN
#2015-11-01 01:00:00 NaN
#2015-11-01 02:00:00 NaN
#2015-11-01 03:00:00 NaN
#2015-11-01 04:00:00 NaN
#2015-11-01 05:00:00 NaN
#2015-11-01 06:00:00 NaN
#2015-11-01 07:00:00 NaN
#2015-11-01 08:00:00 NaN
#2015-11-01 09:00:00 NaN
#2015-11-01 10:00:00 NaN
#2015-11-01 11:00:00 NaN
#2015-11-01 12:00:00 NaN
#2015-11-01 13:00:00 NaN
#2015-11-01 14:00:00 NaN
#2015-11-01 15:00:00 NaN
#2015-11-01 16:00:00 NaN
#2015-11-01 17:00:00 NaN
#2015-11-01 18:00:00 NaN
#2015-11-01 19:00:00 NaN
#2015-11-01 20:00:00 NaN
#2015-11-01 21:00:00 NaN
#2015-11-01 22:00:00 NaN
#2015-11-01 23:00:00 NaN

Categories