I have a pandas data frame containing a large-ish set of hourly data points. For a few days, there are missing data (NaN). I want to interpolate values for the missing hourly data points by calculating the mean of the same time period on the prior and following day (I've done some analysis and believe this will be reasonable).
An example of the data is below:
datetime
value
2018-11-17 00:00:00
9.12
2018-11-17 01:00:00
8.94
2018-11-17 02:00:00
8.68
2018-11-17 03:00:00
8.19
2018-11-17 04:00:00
7.75
2018-11-17 05:00:00
7.35
2018-11-17 06:00:00
7.05
2018-11-17 07:00:00
6.55
2018-11-17 08:00:00
6.30
2018-11-17 09:00:00
6.28
2018-11-17 10:00:00
6.68
2018-11-17 11:00:00
7.64
2018-11-17 12:00:00
8.61
2018-11-17 13:00:00
9.44
2018-11-17 14:00:00
9.84
2018-11-17 15:00:00
9.62
2018-11-17 16:00:00
8.17
2018-11-17 17:00:00
6.16
2018-11-17 18:00:00
5.93
2018-11-17 19:00:00
5.36
2018-11-17 20:00:00
4.69
2018-11-17 21:00:00
4.36
2018-11-17 22:00:00
4.68
2018-11-17 23:00:00
4.86
2018-11-18 00:00:00
NaN
2018-11-18 01:00:00
NaN
2018-11-18 02:00:00
NaN
2018-11-18 03:00:00
NaN
2018-11-18 04:00:00
NaN
2018-11-18 05:00:00
NaN
2018-11-18 06:00:00
NaN
2018-11-18 07:00:00
NaN
2018-11-18 08:00:00
NaN
2018-11-18 09:00:00
NaN
2018-11-18 10:00:00
NaN
2018-11-18 11:00:00
NaN
2018-11-18 12:00:00
NaN
2018-11-18 13:00:00
NaN
2018-11-18 14:00:00
NaN
2018-11-18 15:00:00
NaN
2018-11-18 16:00:00
NaN
2018-11-18 17:00:00
NaN
2018-11-18 18:00:00
NaN
2018-11-18 19:00:00
NaN
2018-11-18 20:00:00
NaN
2018-11-18 21:00:00
NaN
2018-11-18 22:00:00
NaN
2018-11-18 23:00:00
NaN
2018-11-19 00:00:00
3.19
2018-11-19 01:00:00
2.60
2018-11-19 02:00:00
2.29
2018-11-19 03:00:00
1.97
2018-11-19 04:00:00
2.19
2018-11-19 05:00:00
3.09
2018-11-19 06:00:00
4.32
2018-11-19 07:00:00
4.87
2018-11-19 08:00:00
5.14
2018-11-19 09:00:00
5.55
2018-11-19 10:00:00
6.34
2018-11-19 11:00:00
7.43
2018-11-19 12:00:00
8.18
2018-11-19 13:00:00
8.53
2018-11-19 14:00:00
8.45
2018-11-19 15:00:00
7.94
2018-11-19 16:00:00
6.87
2018-11-19 17:00:00
5.56
2018-11-19 18:00:00
4.65
2018-11-19 19:00:00
4.18
2018-11-19 20:00:00
3.97
2018-11-19 21:00:00
3.98
2018-11-19 22:00:00
4.01
2018-11-19 23:00:00
4.00
So, for example, the desired output for 2018-11-18 00:00:00 would be the mean of 9.12 and 3.19 = 6.16. And so on for the other hours of the day on 2018-11-18.
Is there a simple way to do this in pandas? Ideally with a method that could be applied to a whole column (feature) within a data frame, rather than having to slice out some of the data, transform it, and then replace (because honestly, it would be a lot quicker for me to do that in excel!).
Thanks in advance for your help.
Try:
#make sure every hour is in the datetime
df = df.set_index("datetime").resample("1h").last()
#create a series of means averaging the values 24 hours before and after
means = df["value"].shift(24).add(df["value"].shift(-24)).mul(0.5)
#fill the NaN in df with means
df["value"] = df["value"].combine_first(means)
>>> df.iloc[24:48]
value
datetime
2018-11-18 00:00:00 6.155
2018-11-18 01:00:00 5.770
2018-11-18 02:00:00 5.485
2018-11-18 03:00:00 5.080
2018-11-18 04:00:00 4.970
2018-11-18 05:00:00 5.220
2018-11-18 06:00:00 5.685
2018-11-18 07:00:00 5.710
2018-11-18 08:00:00 5.720
2018-11-18 09:00:00 5.915
2018-11-18 10:00:00 6.510
2018-11-18 11:00:00 7.535
2018-11-18 12:00:00 8.395
2018-11-18 13:00:00 8.985
2018-11-18 14:00:00 9.145
2018-11-18 15:00:00 8.780
2018-11-18 16:00:00 7.520
2018-11-18 17:00:00 5.860
2018-11-18 18:00:00 5.290
2018-11-18 19:00:00 4.770
2018-11-18 20:00:00 4.330
2018-11-18 21:00:00 4.170
2018-11-18 22:00:00 4.345
2018-11-18 23:00:00 4.430
I have a dataframe df that contains datetimes for every hour of a day between 2003-02-12 to 2017-06-30 and I want to delete all datetimes between 24th Dec and 1st Jan of EVERY year.
An extract of my data frame is:
...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
7512,2003-12-24 00:00:00
7513,2003-12-24 01:00:00
7514,2003-12-24 02:00:00
7515,2003-12-24 03:00:00
7516,2003-12-24 04:00:00
7517,2003-12-24 05:00:00
7518,2003-12-24 06:00:00
...
7723,2004-01-01 19:00:00
7724,2004-01-01 20:00:00
7725,2004-01-01 21:00:00
7726,2004-01-01 22:00:00
7727,2004-01-01 23:00:00
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...
and my expected output is:
...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
...
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...
Sample dataframe:
dates
0 2003-12-23 23:00:00
1 2003-12-24 05:00:00
2 2004-12-27 05:00:00
3 2003-12-13 23:00:00
4 2002-12-23 23:00:00
5 2004-01-01 05:00:00
6 2014-12-24 05:00:00
Solution:
If you want it for every year between the following dates excluded, then extract the month and dates first:
df['month'] = df['dates'].dt.month
df['day'] = df['dates'].dt.day
And now put the condition check:
dec_days = [24, 25, 26, 27, 28, 29, 30, 31]
## if the month is dec, then check for these dates
## if the month is jan, then just check for the day to be 1 like below
df = df[~(((df.month == 12) & (df.day.isin(dec_days))) | ((df.month == 1) & (df.day == 1)))]
Sample output:
dates month day
0 2003-12-23 23:00:00 12 23
3 2003-12-13 23:00:00 12 13
4 2002-12-23 23:00:00 12 23
This takes advantage of the fact that datetime-strings in the form mm-dd are sortable. Read everything in from the CSV file then filter for the dates you want:
df = pd.read_csv('...', parse_dates=['DateTime'])
s = df['DateTime'].dt.strftime('%m-%d')
excluded = (s == '01-01') | (s >= '12-24') # Jan 1 or >= Dec 24
df[~excluded]
You can try dropping on conditionals. Maybe with a pattern match to the date string or parsing the date as a number (like in Java) and conditionally removing.
datesIdontLike = df[df['colname'] == <stringPattern>].index
newDF = df.drop(datesIdontLike, inplace=True)
Check this out: https://thispointer.com/python-pandas-how-to-drop-rows-in-dataframe-by-conditions-on-column-values/
(If you have issues, let me know.)
You can use pandas and boolean filtering with strftime
# version 0.23.4
import pandas as pd
# make df
df = pd.DataFrame(pd.date_range('20181223', '20190103', freq='H'), columns=['date'])
# string format the date to only include the month and day
# then set it strictly less than '12-24' AND greater than or equal to `01-02`
df = df.loc[
(df.date.dt.strftime('%m-%d') < '12-24') &
(df.date.dt.strftime('%m-%d') >= '01-02')
].copy()
print(df)
date
0 2018-12-23 00:00:00
1 2018-12-23 01:00:00
2 2018-12-23 02:00:00
3 2018-12-23 03:00:00
4 2018-12-23 04:00:00
5 2018-12-23 05:00:00
6 2018-12-23 06:00:00
7 2018-12-23 07:00:00
8 2018-12-23 08:00:00
9 2018-12-23 09:00:00
10 2018-12-23 10:00:00
11 2018-12-23 11:00:00
12 2018-12-23 12:00:00
13 2018-12-23 13:00:00
14 2018-12-23 14:00:00
15 2018-12-23 15:00:00
16 2018-12-23 16:00:00
17 2018-12-23 17:00:00
18 2018-12-23 18:00:00
19 2018-12-23 19:00:00
20 2018-12-23 20:00:00
21 2018-12-23 21:00:00
22 2018-12-23 22:00:00
23 2018-12-23 23:00:00
240 2019-01-02 00:00:00
241 2019-01-02 01:00:00
242 2019-01-02 02:00:00
243 2019-01-02 03:00:00
244 2019-01-02 04:00:00
245 2019-01-02 05:00:00
246 2019-01-02 06:00:00
247 2019-01-02 07:00:00
248 2019-01-02 08:00:00
249 2019-01-02 09:00:00
250 2019-01-02 10:00:00
251 2019-01-02 11:00:00
252 2019-01-02 12:00:00
253 2019-01-02 13:00:00
254 2019-01-02 14:00:00
255 2019-01-02 15:00:00
256 2019-01-02 16:00:00
257 2019-01-02 17:00:00
258 2019-01-02 18:00:00
259 2019-01-02 19:00:00
260 2019-01-02 20:00:00
261 2019-01-02 21:00:00
262 2019-01-02 22:00:00
263 2019-01-02 23:00:00
264 2019-01-03 00:00:00
This will work with multiple years because we are only filtering on the month and day.
# change range to include 2017
df = pd.DataFrame(pd.date_range('20171223', '20190103', freq='H'), columns=['date'])
df = df.loc[
(df.date.dt.strftime('%m-%d') < '12-24') &
(df.date.dt.strftime('%m-%d') >= '01-02')
].copy()
print(df)
date
0 2017-12-23 00:00:00
1 2017-12-23 01:00:00
2 2017-12-23 02:00:00
3 2017-12-23 03:00:00
4 2017-12-23 04:00:00
5 2017-12-23 05:00:00
6 2017-12-23 06:00:00
7 2017-12-23 07:00:00
8 2017-12-23 08:00:00
9 2017-12-23 09:00:00
10 2017-12-23 10:00:00
11 2017-12-23 11:00:00
12 2017-12-23 12:00:00
13 2017-12-23 13:00:00
14 2017-12-23 14:00:00
15 2017-12-23 15:00:00
16 2017-12-23 16:00:00
17 2017-12-23 17:00:00
18 2017-12-23 18:00:00
19 2017-12-23 19:00:00
20 2017-12-23 20:00:00
21 2017-12-23 21:00:00
22 2017-12-23 22:00:00
23 2017-12-23 23:00:00
240 2018-01-02 00:00:00
241 2018-01-02 01:00:00
242 2018-01-02 02:00:00
243 2018-01-02 03:00:00
244 2018-01-02 04:00:00
245 2018-01-02 05:00:00
... ...
8779 2018-12-23 19:00:00
8780 2018-12-23 20:00:00
8781 2018-12-23 21:00:00
8782 2018-12-23 22:00:00
8783 2018-12-23 23:00:00
9000 2019-01-02 00:00:00
9001 2019-01-02 01:00:00
9002 2019-01-02 02:00:00
9003 2019-01-02 03:00:00
9004 2019-01-02 04:00:00
9005 2019-01-02 05:00:00
9006 2019-01-02 06:00:00
9007 2019-01-02 07:00:00
9008 2019-01-02 08:00:00
9009 2019-01-02 09:00:00
9010 2019-01-02 10:00:00
9011 2019-01-02 11:00:00
9012 2019-01-02 12:00:00
9013 2019-01-02 13:00:00
9014 2019-01-02 14:00:00
9015 2019-01-02 15:00:00
9016 2019-01-02 16:00:00
9017 2019-01-02 17:00:00
9018 2019-01-02 18:00:00
9019 2019-01-02 19:00:00
9020 2019-01-02 20:00:00
9021 2019-01-02 21:00:00
9022 2019-01-02 22:00:00
9023 2019-01-02 23:00:00
9024 2019-01-03 00:00:00
Since you want this to happen for every year, we can first define a series that where we replace the year by a static value (2000 for example). Let date be the column that stores the date, we can generate such column as:
dt = pd.to_datetime({'year': 2000, 'month': df['date'].dt.month, 'day': df['date'].dt.day})
For the given sample data, we get:
>>> dt
0 2000-12-23
1 2000-12-23
2 2000-12-23
3 2000-12-23
4 2000-12-23
5 2000-12-23
6 2000-12-23
7 2000-12-24
8 2000-12-24
9 2000-12-24
10 2000-12-24
11 2000-12-24
12 2000-12-24
13 2000-12-24
14 2000-01-01
15 2000-01-01
16 2000-01-01
17 2000-01-01
18 2000-01-01
19 2000-01-02
20 2000-01-02
21 2000-01-02
22 2000-01-02
23 2000-01-02
24 2000-01-02
25 2000-01-02
26 2000-01-02
dtype: datetime64[ns]
Next we can filter the rows, like:
from datetime import date
df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]
This gives us the following data for your sample data:
>>> df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]
id dt
0 7505 2003-12-23 17:00:00
1 7506 2003-12-23 18:00:00
2 7507 2003-12-23 19:00:00
3 7508 2003-12-23 20:00:00
4 7509 2003-12-23 21:00:00
5 7510 2003-12-23 22:00:00
6 7511 2003-12-23 23:00:00
19 7728 2004-01-02 00:00:00
20 7729 2004-01-02 01:00:00
21 7730 2004-01-02 02:00:00
22 7731 2004-01-02 03:00:00
23 7732 2004-01-02 04:00:00
24 7733 2004-01-02 05:00:00
25 7734 2004-01-02 06:00:00
26 7735 2004-01-02 07:00:00
So regardless what the year is, we will only consider dates between the 2nd of January and the 23rd of December (both inclusive).
I have a dataframe and I want to remove certain specific repeating rows:
import numpy as np
import pandas as pd
nrows = 144
df = pd.DataFrame(np.random.rand(nrows,), pd.date_range('2016-02-08 00:00:00', periods=nrows, freq='2h'), columns=['A'])
The dataframe is continuous with time, providing data every two hours ad infinitum, but I've chosen to only show a subset for brevity.I want to remove the data every 72 hours at 8:00 starting on Mondays to coincide with an external event that alters the data.For this snapshot of data I want to remove the rows indexed at 2016-02-08 08:00, 2016-02-11 08:00, +3D etc..
Is there a simple way to do this?
IIUC you could do this:
In [18]:
start = df.index[(df.index.dayofweek == 0) & (df.index.hour == 8)][0]
start
Out[18]:
Timestamp('2016-02-08 08:00:00')
In [45]:
df.loc[df.index.difference(pd.date_range(start, end=df.index[-1], freq='3D'))]
Out[45]:
A
2016-02-08 00:00:00 0.323742
2016-02-08 02:00:00 0.962252
2016-02-08 04:00:00 0.706537
2016-02-08 06:00:00 0.561446
2016-02-08 10:00:00 0.225042
2016-02-08 12:00:00 0.746258
2016-02-08 14:00:00 0.167950
2016-02-08 16:00:00 0.199958
2016-02-08 18:00:00 0.808286
2016-02-08 20:00:00 0.288797
2016-02-08 22:00:00 0.508109
2016-02-09 00:00:00 0.980772
2016-02-09 02:00:00 0.995731
2016-02-09 04:00:00 0.742751
2016-02-09 06:00:00 0.392247
2016-02-09 08:00:00 0.460511
2016-02-09 10:00:00 0.083660
2016-02-09 12:00:00 0.273620
2016-02-09 14:00:00 0.791506
2016-02-09 16:00:00 0.440630
2016-02-09 18:00:00 0.326418
2016-02-09 20:00:00 0.790780
2016-02-09 22:00:00 0.521131
2016-02-10 00:00:00 0.219315
2016-02-10 02:00:00 0.016625
2016-02-10 04:00:00 0.958566
2016-02-10 06:00:00 0.405643
2016-02-10 08:00:00 0.958025
2016-02-10 10:00:00 0.786663
2016-02-10 12:00:00 0.589064
... ...
2016-02-17 12:00:00 0.360848
2016-02-17 14:00:00 0.757499
2016-02-17 16:00:00 0.391574
2016-02-17 18:00:00 0.062812
2016-02-17 20:00:00 0.308282
2016-02-17 22:00:00 0.251520
2016-02-18 00:00:00 0.832871
2016-02-18 02:00:00 0.387108
2016-02-18 04:00:00 0.070969
2016-02-18 06:00:00 0.298831
2016-02-18 08:00:00 0.878526
2016-02-18 10:00:00 0.979233
2016-02-18 12:00:00 0.386620
2016-02-18 14:00:00 0.420962
2016-02-18 16:00:00 0.238879
2016-02-18 18:00:00 0.124069
2016-02-18 20:00:00 0.985828
2016-02-18 22:00:00 0.585278
2016-02-19 00:00:00 0.409226
2016-02-19 02:00:00 0.093945
2016-02-19 04:00:00 0.389450
2016-02-19 06:00:00 0.378091
2016-02-19 08:00:00 0.874232
2016-02-19 10:00:00 0.527629
2016-02-19 12:00:00 0.490236
2016-02-19 14:00:00 0.509008
2016-02-19 16:00:00 0.097061
2016-02-19 18:00:00 0.111626
2016-02-19 20:00:00 0.877099
2016-02-19 22:00:00 0.796201
[140 rows x 1 columns]
So this determines the start range by comparing the dayofweek and hour and taking the first index value, we then generate an index using date_range and call difference on the index to remove these rows and pass these to loc
I have two dataframes which are datetimeindexed. One is missing a few of these datetimes (df1) while the other is complete (has regular timestamps without any gaps in this series) and is full of NaN's (df2).
I'm trying to match the values from df1 to the index of df2, filling with NaN's where such a datetimeindex doesn't exist in df1.
Example:
In [51]: df1
Out [51]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-03-01 00:00:00 6
2015-03-01 01:00:00 37
2015-03-01 02:00:00 56
2015-03-01 03:00:00 12
2015-03-01 04:00:00 41
2015-03-01 05:00:00 31
... ...
2018-12-25 23:00:00 41
<34843 rows × 1 columns>
In [52]: df2 = pd.DataFrame(data=None, index=pd.date_range(freq='60Min', start=df1.index.min(), end=df1.index.max()))
df2['value']=np.NaN
df2
Out [52]: value
2015-01-01 14:00:00 NaN
2015-01-01 15:00:00 NaN
2015-01-01 16:00:00 NaN
2015-01-01 17:00:00 NaN
2015-01-01 18:00:00 NaN
2015-01-01 19:00:00 NaN
2015-01-01 20:00:00 NaN
2015-01-01 21:00:00 NaN
2015-01-01 22:00:00 NaN
2015-01-01 23:00:00 NaN
2015-01-02 00:00:00 NaN
2015-01-02 01:00:00 NaN
2015-01-02 02:00:00 NaN
2015-01-02 03:00:00 NaN
2015-01-02 04:00:00 NaN
2015-01-02 05:00:00 NaN
... ...
2018-12-25 23:00:00 NaN
<34906 rows × 1 columns>
Using df2.combine_first(df1) returns the same data as df1.reindex(index= df2.index), which fills any gaps where there shouldn't be data with some value, instead of NaN.
In [53]: Result = df2.combine_first(df1)
Result
Out [53]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-01-02 00:00:00 35
2015-01-02 01:00:00 53
2015-01-02 02:00:00 28
2015-01-02 03:00:00 48
2015-01-02 04:00:00 42
2015-01-02 05:00:00 51
... ...
2018-12-25 23:00:00 41
<34906 rows × 1 columns>
This is what I was hoping to get:
Out [53]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-01-02 00:00:00 NaN
2015-01-02 01:00:00 NaN
2015-01-02 02:00:00 NaN
2015-01-02 03:00:00 NaN
2015-01-02 04:00:00 NaN
2015-01-02 05:00:00 NaN
... ...
2018-12-25 23:00:00 41
<34906 rows × 1 columns>
Could someone shed some light on why this is happening, and how to set how these values are filled?
IIUC you need resample df1, because you have an irregular frequency and you need regular frequency:
print df1.index.freq
None
print Result.index.freq
<60 * Minutes>
EDIT1
You can use function asfreq instead of resample - doc, resample vs asfreq.
EDIT2
First I think that resample didn't work, because after resampling the Result is the same as df1. But I try print df1.info() and print Result.info() gets different results - 34857 entries vs 34920 entries.
So I try to find rows with NaN values and it returns 63 rows.
So I think resample works well.
import pandas as pd
df1 = pd.read_csv('test/GapInTimestamps.csv', sep=",", index_col=[0], parse_dates=[0])
print df1.head()
# value
#Date/Time
#2015-01-01 00:00:00 52
#2015-01-01 01:00:00 5
#2015-01-01 02:00:00 12
#2015-01-01 03:00:00 54
#2015-01-01 04:00:00 47
print df1.info()
#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 34857 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00
#Data columns (total 1 columns):
#value 34857 non-null int64
#dtypes: int64(1)
#memory usage: 544.6 KB
#None
Result = df1.resample('60min')
print Result.head()
# value
#Date/Time
#2015-01-01 00:00:00 52
#2015-01-01 01:00:00 5
#2015-01-01 02:00:00 12
#2015-01-01 03:00:00 54
#2015-01-01 04:00:00 47
print Result.info()
#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 34920 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00
#Freq: 60T
#Data columns (total 1 columns):
#value 34857 non-null float64
#dtypes: float64(1)
#memory usage: 545.6 KB
#None
#find values with NaN
resultnan = Result[Result.isnull().any(axis=1)]
#temporaly display 999 rows and 15 columns
with pd.option_context('display.max_rows', 999, 'display.max_columns', 15):
print resultnan
# value
#Date/Time
#2015-01-13 19:00:00 NaN
#2015-01-13 20:00:00 NaN
#2015-01-13 21:00:00 NaN
#2015-01-13 22:00:00 NaN
#2015-01-13 23:00:00 NaN
#2015-01-14 00:00:00 NaN
#2015-01-14 01:00:00 NaN
#2015-01-14 02:00:00 NaN
#2015-01-14 03:00:00 NaN
#2015-01-14 04:00:00 NaN
#2015-01-14 05:00:00 NaN
#2015-01-14 06:00:00 NaN
#2015-01-14 07:00:00 NaN
#2015-01-14 08:00:00 NaN
#2015-01-14 09:00:00 NaN
#2015-02-01 00:00:00 NaN
#2015-02-01 01:00:00 NaN
#2015-02-01 02:00:00 NaN
#2015-02-01 03:00:00 NaN
#2015-02-01 04:00:00 NaN
#2015-02-01 05:00:00 NaN
#2015-02-01 06:00:00 NaN
#2015-02-01 07:00:00 NaN
#2015-02-01 08:00:00 NaN
#2015-02-01 09:00:00 NaN
#2015-02-01 10:00:00 NaN
#2015-02-01 11:00:00 NaN
#2015-02-01 12:00:00 NaN
#2015-02-01 13:00:00 NaN
#2015-02-01 14:00:00 NaN
#2015-02-01 15:00:00 NaN
#2015-02-01 16:00:00 NaN
#2015-02-01 17:00:00 NaN
#2015-02-01 18:00:00 NaN
#2015-02-01 19:00:00 NaN
#2015-02-01 20:00:00 NaN
#2015-02-01 21:00:00 NaN
#2015-02-01 22:00:00 NaN
#2015-02-01 23:00:00 NaN
#2015-11-01 00:00:00 NaN
#2015-11-01 01:00:00 NaN
#2015-11-01 02:00:00 NaN
#2015-11-01 03:00:00 NaN
#2015-11-01 04:00:00 NaN
#2015-11-01 05:00:00 NaN
#2015-11-01 06:00:00 NaN
#2015-11-01 07:00:00 NaN
#2015-11-01 08:00:00 NaN
#2015-11-01 09:00:00 NaN
#2015-11-01 10:00:00 NaN
#2015-11-01 11:00:00 NaN
#2015-11-01 12:00:00 NaN
#2015-11-01 13:00:00 NaN
#2015-11-01 14:00:00 NaN
#2015-11-01 15:00:00 NaN
#2015-11-01 16:00:00 NaN
#2015-11-01 17:00:00 NaN
#2015-11-01 18:00:00 NaN
#2015-11-01 19:00:00 NaN
#2015-11-01 20:00:00 NaN
#2015-11-01 21:00:00 NaN
#2015-11-01 22:00:00 NaN
#2015-11-01 23:00:00 NaN
I look for applying some deviation to a monthly granularity structure of a dataframe and then recast it in the initial dataframe. I firstly do a groupby and aggregate. This part works well. Then I reindex and take NaN. I want the reindexation will be done by matching month of the groupby element with the initial dataframe.
I want be able to due this operation on different granularity (yearly -> month & year, ...)
Has someone the solution of this problem ?
>>> df['profile']
date
2015-01-01 00:00:00 3.000000
2015-01-01 01:00:00 3.000143
2015-01-01 02:00:00 3.000287
2015-01-01 03:00:00 3.000430
2015-01-01 04:00:00 3.000574
...
2015-12-31 20:00:00 2.999426
2015-12-31 21:00:00 2.999570
2015-12-31 22:00:00 2.999713
2015-12-31 23:00:00 2.999857
Freq: H, Name: profile, Length: 8760
### Deviation on monthly basis
>>> dev_monthly = np.random.uniform(0.5, 1.5, len(df['profile'].groupby(df.index.month).aggregate(np.sum)))
>>> df['profile_monthly'] = (df['profile'].groupby(df.index.month).aggregate(np.sum) * dev_monthly).reindex(df)
>>> df['profile_monthly']
date
2015-01-01 00:00:00 NaN
2015-01-01 01:00:00 NaN
2015-01-01 02:00:00 NaN
...
2015-12-31 22:00:00 NaN
2015-12-31 23:00:00 NaN
Freq: H, Name: profile_monthly, Length: 8760
Check out the documentation for resampling.
You're looking for resample followed by fillna with method='bfill':
In [105]: df = DataFrame({'profile': normal(3, 0.1, size=10000)}, pd.date_range(start='2015-01-
01', freq='H', periods=10000))
In [106]: df['profile_monthly'] = df.profile.resample('M', how='sum')
In [107]: df
Out[107]:
profile profile_monthly
2015-01-01 00:00:00 2.8328 NaN
2015-01-01 01:00:00 3.0607 NaN
2015-01-01 02:00:00 3.0138 NaN
2015-01-01 03:00:00 3.0402 NaN
2015-01-01 04:00:00 3.0335 NaN
2015-01-01 05:00:00 3.0087 NaN
2015-01-01 06:00:00 3.0557 NaN
2015-01-01 07:00:00 2.9280 NaN
2015-01-01 08:00:00 3.1359 NaN
2015-01-01 09:00:00 2.9681 NaN
2015-01-01 10:00:00 3.1240 NaN
2015-01-01 11:00:00 3.0635 NaN
2015-01-01 12:00:00 2.9206 NaN
2015-01-01 13:00:00 3.0714 NaN
2015-01-01 14:00:00 3.0688 NaN
2015-01-01 15:00:00 3.0703 NaN
2015-01-01 16:00:00 2.9102 NaN
2015-01-01 17:00:00 2.9368 NaN
2015-01-01 18:00:00 3.0864 NaN
2015-01-01 19:00:00 3.2124 NaN
2015-01-01 20:00:00 2.8988 NaN
2015-01-01 21:00:00 3.0659 NaN
2015-01-01 22:00:00 2.7973 NaN
2015-01-01 23:00:00 3.0824 NaN
2015-01-02 00:00:00 3.0199 NaN
... ...
[10000 rows x 2 columns]
In [108]: df.dropna()
Out[108]:
profile profile_monthly
2015-01-31 2.9769 2230.9931
2015-02-28 2.9930 2016.1045
2015-03-31 2.7817 2232.4096
2015-04-30 3.1695 2158.7834
2015-05-31 2.9040 2236.5962
2015-06-30 2.8697 2162.7784
2015-07-31 2.9278 2231.7232
2015-08-31 2.8289 2236.4603
2015-09-30 3.0368 2163.5916
2015-10-31 3.1517 2233.2285
2015-11-30 3.0450 2158.6998
2015-12-31 2.8261 2228.5550
2016-01-31 3.0264 2229.2221
[13 rows x 2 columns]
In [110]: df.fillna(method='bfill')
Out[110]:
profile profile_monthly
2015-01-01 00:00:00 2.8328 2230.9931
2015-01-01 01:00:00 3.0607 2230.9931
2015-01-01 02:00:00 3.0138 2230.9931
2015-01-01 03:00:00 3.0402 2230.9931
2015-01-01 04:00:00 3.0335 2230.9931
2015-01-01 05:00:00 3.0087 2230.9931
2015-01-01 06:00:00 3.0557 2230.9931
2015-01-01 07:00:00 2.9280 2230.9931
2015-01-01 08:00:00 3.1359 2230.9931
2015-01-01 09:00:00 2.9681 2230.9931
2015-01-01 10:00:00 3.1240 2230.9931
2015-01-01 11:00:00 3.0635 2230.9931
2015-01-01 12:00:00 2.9206 2230.9931
2015-01-01 13:00:00 3.0714 2230.9931
2015-01-01 14:00:00 3.0688 2230.9931
2015-01-01 15:00:00 3.0703 2230.9931
2015-01-01 16:00:00 2.9102 2230.9931
2015-01-01 17:00:00 2.9368 2230.9931
2015-01-01 18:00:00 3.0864 2230.9931
2015-01-01 19:00:00 3.2124 2230.9931
2015-01-01 20:00:00 2.8988 2230.9931
2015-01-01 21:00:00 3.0659 2230.9931
2015-01-01 22:00:00 2.7973 2230.9931
2015-01-01 23:00:00 3.0824 2230.9931
2015-01-02 00:00:00 3.0199 2230.9931
... ...
[10000 rows x 2 columns]
When I use your code, I haven't same value for 2015-12-31 00:00:00 and 2015-12-31 01:00:00 as you can see below :
>>> df.fillna(method='bfill')[np.logical_and(df.index.month==12, df.index.day==31)]
profile profile_monthly
2015-12-31 00:00:00 2.926504 2232.288997
2015-12-31 01:00:00 3.008543 2234.470731
2015-12-31 02:00:00 2.930133 2234.470731
2015-12-31 03:00:00 3.078552 2234.470731
2015-12-31 04:00:00 3.141578 2234.470731
2015-12-31 05:00:00 3.061820 2234.470731
2015-12-31 06:00:00 2.981626 2234.470731
2015-12-31 07:00:00 3.010749 2234.470731
2015-12-31 08:00:00 2.878577 2234.470731
2015-12-31 09:00:00 2.915487 2234.470731
2015-12-31 10:00:00 3.072721 2234.470731
2015-12-31 11:00:00 3.087866 2234.470731
2015-12-31 12:00:00 3.089208 2234.470731
2015-12-31 13:00:00 2.957047 2234.470731
2015-12-31 14:00:00 3.002072 2234.470731
2015-12-31 15:00:00 3.106656 2234.470731
2015-12-31 16:00:00 3.100891 2234.470731
2015-12-31 17:00:00 3.077835 2234.470731
2015-12-31 18:00:00 3.032497 2234.470731
2015-12-31 19:00:00 2.959838 2234.470731
2015-12-31 20:00:00 2.878819 2234.470731
2015-12-31 21:00:00 3.041171 2234.470731
2015-12-31 22:00:00 3.061970 2234.470731
2015-12-31 23:00:00 3.019011 2234.470731
[24 rows x 2 columns]
So I finally found the following solution :
>>> AA = df.groupby((df.index.year, df.index.month)).aggregate(np.mean)
>>> AA['dev'] = np.random.randn(0,1,len(AA))
>>> df['dev'] = AA.ix[zip(df.index.year, df.index.month)]['dev'].values
Short and rapid. The only question is :
=> How to deal with other granularity (half year, quarter, week, ...) ?