Pandas how to group by a column and minutes ranges excluding date - python

I have to elaborate simple statistics grouping a dataframe by one column, for instance day_of_the_week, and minutes ranges, for instance 15 minutess, without keeping the dates. In other words, I need statistics on what is happening in each interval of 15 minutes on all sundays, mondays, etc. not split by date. The starting dataframe is something like this
datetime high low Day_of_the_week HL_delta
2021-08-01 22:00:00 4403.00 4395.25 6.0 7.75
2021-08-01 22:15:00 4404.00 4401.00 6.0 3.00
2021-08-01 22:30:00 4409.00 4403.25 6.0 5.75
2021-08-01 22:45:00 4408.25 4406.25 6.0 2.00
2021-08-01 23:00:00 4408.25 4405.75 6.0 2.5
where datetime is the index of the dataframe and it is a DateTime type.
I need to calculate the mean and max value for each distinguished day of the week of HL_delta grouped by 15 minute ranges over 1 year data.
I have tried something like this
df_statistics['HL_mean'] = df_data_for_statistics.groupby([df_data_for_statistics['day_of_the_week'], pd.Grouper(freq=statistics_period_minutes_formatted,closed='left',label='left')]).agg({ "HL_delta": "mean"})
df_statistics['HL_max'] = df_data_for_statistics.groupby([df_data_for_statistics['day_of_the_week'], pd.Grouper(freq=statistics_period_minutes_formatted,closed='left',label='left')]).agg({ "HL_delta": "max"})
but what i get is not an aggregation on all the distinguished weekdays of the year, the aggregation is applied on group of 15 minutes of each date, not each monday, Tuesday, Wednesday,.... The statistics shall answer to the questions: "which is the max values of HL_delta between the time 00:00 and 00:15 of all the Mondays of the year", "which is the max values of HL_delta between the time 00:16 and 00:30 of all the Mondays of the year", ..., "which is the max values of HL_delta between the time 00:00 and 00:15 of all the Fridays of the year", ... etc. Instead what I get by this attempt is this
high low day_of_the_week HL_delta
datetime
2021-08-01 22:00:00 4403.00 4395.25 6.0 7.75
2021-08-01 22:15:00 4404.00 4401.00 6.0 3.00
2021-08-01 22:30:00 4409.00 4403.25 6.0 5.75
2021-08-01 22:45:00 4408.25 4406.25 6.0 2.00
2021-08-01 23:00:00 4408.25 4405.75 6.0 2.50
... ... ... ... ...
2022-03-21 22:45:00 4453.50 4451.50 0.0 2.00
2022-03-21 23:00:00 4452.25 4449.00 0.0 3.25
2022-03-21 23:15:00 4451.50 4449.25 0.0 2.25
2022-03-21 23:30:00 4451.50 4448.50 0.0 3.00
2022-03-21 23:45:00 4449.75 4445.25 0.0 4.50
Any suggestion?

With the following toy dataframe:
import random
import pandas as pd
list_of_dates = pd.date_range(start="2021-1-1", end="2021-12-31", freq="T")
df = pd.DataFrame(
{
"datetime": list_of_dates,
"low": [random.randint(0, 4_999) for _ in range(len(list_of_dates))],
"high": [random.randint(5_000, 9_999) for _ in range(len(list_of_dates))],
}
)
df["HL_delta"] = df["high"] - df["low"]
print(df)
# Output
datetime low high HL_delta
0 2021-01-01 00:00:00 4325 5059 734
1 2021-01-01 00:01:00 917 7224 6307
2 2021-01-01 00:02:00 2956 7804 4848
3 2021-01-01 00:03:00 1329 8056 6727
4 2021-01-01 00:04:00 1721 9144 7423
...
Here is one way to do it:
# Setup
df["weekday"] = df["datetime"].dt.day_name()
df["datetime_minute"] = df["datetime"].dt.minute
intervals = {"0-15": [0, 15], "16-30": [16, 30], "31-45": [31, 45], "46-59": [46, 59]}
# Find intervals
df["interval"] = df.apply(
lambda x: next(
filter(
None,
[
key
if x["datetime_minute"] >= intervals[key][0]
and x["datetime_minute"] <= intervals[key][1]
else None
for key in intervals.keys()
],
)
),
axis=1,
)
# Get stats
new_df = (
df.drop(columns=["low", "high", "datetime", "datetime_minute"])
.groupby(["weekday", "interval"], sort=False)
.agg(["max", "mean"])
)
And so:
print(new_df)
# Output
HL_delta
max mean
weekday interval
Friday 0-15 9989 5011.461666
16-30 9948 5003.452724
31-45 9902 4969.810577
46-59 9926 5007.599073
Saturday 0-15 9950 5004.103966
16-30 9961 4984.479327
31-45 9954 5005.854647
46-59 9973 5011.797447
Sunday 0-15 9979 4994.012270
16-30 9950 4981.940438
31-45 9877 5009.572276
46-59 9930 5020.719609
Monday 0-15 9974 4963.538812
16-30 9918 4977.481090
31-45 9971 4977.858173
46-59 9958 4992.886733
Tuesday 0-15 9924 5014.045623
16-30 9966 4990.358547
31-45 9948 4993.595566
46-59 9948 5000.271120
Wednesday 0-15 9975 4998.320463
16-30 9976 4981.763889
31-45 9981 4981.806303
46-59 9995 5001.579670
Thursday 0-15 9958 5015.276643
16-30 9900 4996.797489
31-45 9949 4991.088034
46-59 9948 4980.678457

Related

Transform an hourly dataframe into a monthly totalled dataframe in Python

I have a Pandas dataframe containing hourly precipitation data (tp) between 2013 and 2020, the dataframe is called df:
tp
time
2013-01-01 00:00:00 0.1
2013-01-01 01:00:00 0.1
2013-01-01 02:00:00 0.1
2013-01-01 03:00:00 0.0
2013-01-01 04:00:00 0.2
...
2020-12-31 19:00:00 0.2
2020-12-31 20:00:00 0.1
2020-12-31 21:00:00 0.0
2020-12-31 22:00:00 0.1
2020-12-31 23:00:00 0.0
I'm trying to convert this hourly dataset into monthly totals for each year, I then want to take an average of the monthly summed rainfall so that I end up with a data frame with 12 rows for each month, showing the average summed rainfall over the whole period.
I've tried the resample function:
df.resample('M').mean()
However, this outputs the following and is not what I'm looking to achieve:
tp1
time
2013-01-31 0.121634
2013-02-28 0.318097
2013-03-31 0.356973
2013-04-30 0.518160
2013-05-31 0.055290
...
2020-09-30 0.132713
2020-10-31 0.070817
2020-11-30 0.060525
2020-12-31 0.040002
2021-01-31 0.000000
[97 rows x 1 columns]
While it's converting the hourly data to monthly, I want to show an average of the rainfall across the years.
e.g.
January Column = Average of January rainfall between 2013 and 2020.
Assuming your index is a DatetimeIndex, you can use:
out = df.groupby(df.index.month).mean()
print(out)
# Output
tp1
time
1 0.498262
2 0.502057
3 0.502644
4 0.496880
5 0.499100
6 0.497931
7 0.504981
8 0.497841
9 0.499646
10 0.499804
11 0.506938
12 0.501172
Setup:
import pandas as pd
import numpy as np
np.random.seed(2022)
dti = pd.date_range('2013-01-31', '2021-01-31', freq='H', name='time')
df = pd.DataFrame({'tp1': np.random.random(len(dti))}, index=dti)
print(df)
# Output
tp1
time
2013-01-31 00:00:00 0.009359
2013-01-31 01:00:00 0.499058
2013-01-31 02:00:00 0.113384
2013-01-31 03:00:00 0.049974
2013-01-31 04:00:00 0.685408
... ...
2021-01-30 20:00:00 0.021295
2021-01-30 21:00:00 0.275759
2021-01-30 22:00:00 0.367263
2021-01-30 23:00:00 0.777680
2021-01-31 00:00:00 0.021225
[70129 rows x 1 columns]

generate date_range for string intervals like 09:00-10:00/11:00-14:00/15:00-18:00 with BuiltIn functions for pandas

I've been reading the forum, investigating on internet. But can't figure out how to apply a pandas functions to resume this whole code:
def get_time_and_date(schedule, starting_date, position):
# calculate time and date for each start and ending time if the ending time < starting time, add one day to the ending.
my_time = datetime.strptime(schedule.split('-')[position], '%H:%M')
my_date = datetime.strptime(starting_date, '%Y-%m-%d')
# get the starting hour for the range if we are calculating last interval
if position == 1:
starting_hour = datetime.strptime(schedule.split('-')[0], '%H:%M')
starting_hour = datetime(my_date.year, my_date.month, my_date.day, starting_hour.hour, 0)
# convert unify my_time and my_date normalizing the minutes
if hora.minute >= 30:
my_hour_and_date = datetime(my_date.year, my_date.month, my_date.day, my_time.hour, 30)
else:
my_hour_and_date = datetime(my_date.year, my_date.month, my_date.day, hora.hour, 0)
# if final time of the day < than starting time, means there is a day jump, so we add a day
if position == 1 and my_hour_and_date < starting_hour: my_hour_and_date += timedelta(days=1)
return my_hour_and_date
def get_time_interval_ranges(schedule, my_date):
# get all match schedules if there are any
schedules = schedule.split('/')
intervals_list = []
# loop through al the schedules and add the range with the split separator "Separa aquĆ­"
for my_schedule in schedules:
current_range = pd.date_range(start=get_time_and_date(my_schedule, my_date, 0), end=get_time_and_date(my_schedule, my_date, 1), freq="30min").strftime('%Y-%m-%d, %H:%M').to_list()
intervals_list += current_range
intervals_list.append('separate_range_here')
return intervals_list
def generate_time_intervals(df, column_to_process, new_column):
#generate range of times column
df[new_column] = df.apply(lambda row: get_time_interval_ranges(row[column_to_process], row['my_date'], True), axis=1)
return df
I believe there is a better way to do this, but I can't find out how. What I'm giving to the first function(generate_time_intervals) is a dataFrame with some columns but only Date (yyyy-mm-dd) and schedule are important.
When the schedule is 09:00-15:00 it's easy, just split by the "-" and give it to the builtint function data_range. The problem comes to handle horrendous times like the one on the title or the likes of 09:17-16:24.
Is there any way to handle this without so much looping and the sorts in my code?
Edit:
With this input:
Worker
Date
Schedule
Worker1
2022-05-01
09:00-10:00/11:00-14:00/15:00-18:00
Worker2
2022-05-01
09:37-15:38
I would like this output:
Date
Interval
Working Minutes
2022-05-01
09:00
30
2022-05-01
09:30
53
2022-05-01
10:00
30
2022-05-01
10:30
30
2022-05-01
11:00
60
2022-05-01
11:30
60
2022-05-01
12:00
60
2022-05-01
12:30
60
2022-05-01
13:00
60
2022-05-01
13:30
60
2022-05-01
14:00
30
2022-05-01
14:30
30
2022-05-01
15:00
60
2022-05-01
15:30
38
2022-05-01
16:00
30
2022-05-01
16:30
30
2022-05-01
17:00
30
2022-05-01
17:30
30
2022-05-01
18:00
0
Working with datetime:
df= pd.DataFrame({'schedule':['09:17-16:24','19:40-21:14']})
schedules = df.schedule.str.split('-',expand=True)
start = pd.to_datetime(schedules[0]).dt.round('H')
end = pd.to_datetime(schedules[1]).dt.round('H')
df['interval_out'] = start.dt.hour.astype(str) + ':00 - ' + end.dt.hour.astype(str) + ':00'
And result:
>>> df
schedule
0 09:17-16:24
1 19:40-21:14
>>> schedules
0 1
0 09:17 16:24
1 19:40 21:14
>>> start
0 2022-05-18 09:00:00
1 2022-05-18 20:00:00
Name: 0, dtype: datetime64[ns]
>>> end
0 2022-05-18 16:00:00
1 2022-05-18 21:00:00
Name: 1, dtype: datetime64[ns]
>>> df
schedule interval_out
0 09:17-16:24 9:00 - 16:00
1 19:40-21:14 20:00 - 21:00
>>>
Of course the rounding should be floor & ceil if you want to expand it...
EDIT: Trying the original question... It also helps if you read about datetime functions in Pandas (which now I learnt...):facepalm:
Expand the blocks into individual items start/stop
Floor / ceil them for the start/stop
Calculate the intervals using a convenient pandas function...
Explode the intervals as rows
Calculate the late start
Calculate soon stop
Calculate how many people were actually in the office
Group data on slots, adding lost minutes and worked minutes * worker
Do the calculation
df['timeblocks']= df.Schedule.str.split('/')
df2 = df.explode('timeblocks')
timeblocks = df2.timeblocks.str.split('-',expand=True)
df2['start'] = pd.to_datetime(df2.Date + " " + timeblocks[0])
df2['stop'] = pd.to_datetime(df2.Date + " " + timeblocks[1])
df2['start_slot'] = df2.start.dt.floor('30min')
df2['stop_slot'] = df2.stop.dt.ceil('30min')
df2['intervals'] = df2.apply(lambda x: pd.date_range(x.start_slot, x.stop_slot, freq='30min'), axis=1)
df3 = df2.explode('intervals')
df3['late_start'] = (df3.start>df3.intervals)*(df3.start-df3.intervals).dt.seconds/60
df3['soon_stop']= ((df3.stop>df3.intervals) & (df3.stop<(df3.intervals+pd.Timedelta('30min'))))*((df3.intervals+pd.Timedelta('30min'))-df3.stop).dt.seconds/60
df3['someone'] = (df3.start<df3.intervals+pd.Timedelta('30min'))&(df3.stop>df3.intervals)#+pd.Timedelta('30min'))
df4 = df3.groupby('intervals').agg({'late_start':sum, 'soon_stop':sum, 'someone':sum})
df4['worked_time'] = df4.someone*30 - df4.late_start - df4.soon_stop
df4
>>> df4
late_start soon_stop someone worked_time
intervals
2022-05-01 09:00:00 0.0 0.0 1 30.0
2022-05-01 09:30:00 7.0 0.0 2 53.0
2022-05-01 10:00:00 0.0 0.0 1 30.0
2022-05-01 10:30:00 0.0 0.0 1 30.0
2022-05-01 11:00:00 0.0 0.0 2 60.0
2022-05-01 11:30:00 0.0 0.0 2 60.0
2022-05-01 12:00:00 0.0 0.0 2 60.0
2022-05-01 12:30:00 0.0 0.0 2 60.0
2022-05-01 13:00:00 0.0 0.0 2 60.0
2022-05-01 13:30:00 0.0 0.0 2 60.0
2022-05-01 14:00:00 0.0 0.0 1 30.0
2022-05-01 14:30:00 0.0 0.0 1 30.0
2022-05-01 15:00:00 0.0 0.0 2 60.0
2022-05-01 15:30:00 0.0 22.0 2 38.0
2022-05-01 16:00:00 0.0 0.0 1 30.0
2022-05-01 16:30:00 0.0 0.0 1 30.0
2022-05-01 17:00:00 0.0 0.0 1 30.0
2022-05-01 17:30:00 0.0 0.0 1 30.0
2022-05-01 18:00:00 0.0 0.0 0 0.0

Adding missing time stamp rows to a df in pandas

I have very unusual time series data which is both irregular and has several missing values.
The data points are measured 3 times a day only on weekdays, at 10:00AM, 2:00PM, and 6:00PM, most days are missing one or two measurements, and some days are missing altogether.
My df looks something like this:
date time | value
0 2020-07-30 10:00:00 5
1 2020-07-30 14:00:00 3
2 2020-07-31 10:00:00 6
3 2020-07-31 14:00:00 4.5
4 2020-07-31 18:00:00 7
5 2020-08-03 14:00:00 5.5
6 2020-08-04 14:00:00 5
I'm trying to figure out how to fill it out with the time stamps for the missing measurements, add a row with the missing time stamp and an NA value, but without adding extra times of day or any Saturdays or Sundays, so that my df looks like this at the end:
date time | value
0 2020-07-30 10:00:00 5
1 2020-07-30 14:00:00 3
2 2020-07-30 18:00:00 NA
3 2020-07-31 10:00:00 6
4 2020-07-31 14:00:00 4.5
5 2020-07-31 18:00:00 7
6 2020-08-03 10:00:00 NA
7 2020-08-03 14:00:00 5.5
8 2020-08-03 18:00:00 NA
9 2020-08-04 10:00:00 NA
10 2020-08-04 14:00:00 5
11 2020-08-04 18:00:00 NA
The only thing I could come up with was pretty convoluted: write a loop to generate a row for all the dates in the desired date range * 3 (1 for each time of measurement) formatted as date times, along with a an additional week of day counter. Convert it into a df, and then drop all columns where Week of Day = 6,7, then do a join of this new df with my original df on the date time column (Outer or left - whichever one keeps all columns).
Is there any more elegant way of doing this?
you could create a filtered date range and index by it:
all_ts = pd.date_range(start=df['datetime'].min(), end=df['datetime'].max(), freq='H')
weekday_ts = all_ts[~all_ts.weekday.isin([5,6])]
filtered_ts = weekday_ts[weekday_ts.hour.isin([10, 14, 18])]
df.set_index(df['datetime']).reindex(filtered_ts).drop('datetime', axis=1).reset_index()
df = pd.DataFrame([
{"date time": datetime.datetime.strptime("2020-07-30 10:00:00", '%Y-%m-%d %H:%M:%S'), "value": 5},
{"date time": datetime.datetime.strptime("2020-07-30 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 3},
{"date time": datetime.datetime.strptime("2020-07-31 10:00:00", '%Y-%m-%d %H:%M:%S'), "value": 6},
{"date time": datetime.datetime.strptime("2020-07-31 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 4.5},
{"date time": datetime.datetime.strptime("2020-07-31 18:00:00", '%Y-%m-%d %H:%M:%S'), "value": 7},
{"date time": datetime.datetime.strptime("2020-08-02 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 5.5},
{"date time": datetime.datetime.strptime("2020-08-03 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 5},
]
)
# define your range of dates you're working with
range_dates = pd.date_range('2020-07-30', '2020-08-04', freq='D')
# remove weekend days
range_dates = range_dates[~range_dates.weekday.isin([5,6])]
range_dates = pd.Series(range_dates)
# here we will create a range of your 3 hours of measurements
range_times = pd.date_range('10:00:00', '18:00:00', freq='4H')
range_times = pd.Series(range_times.time)
# we combine our two ranges
index = range_dates.apply(
lambda date: range_times.apply(
lambda time: datetime.datetime.combine(date, time)
)
).unstack()
# we reindex the dataframe and sort it
df = df.reindex(index=index).sort_index()
Output:
value
2020-07-30 10:00:00 5.0
2020-07-30 14:00:00 3.0
2020-07-30 18:00:00 NaN
2020-07-31 10:00:00 6.0
2020-07-31 14:00:00 4.5
2020-07-31 18:00:00 7.0
2020-08-01 10:00:00 NaN
2020-08-01 14:00:00 NaN
2020-08-01 18:00:00 NaN
2020-08-02 10:00:00 NaN
2020-08-02 14:00:00 5.5
2020-08-02 18:00:00 NaN
2020-08-03 10:00:00 NaN
2020-08-03 14:00:00 5.0
2020-08-03 18:00:00 NaN
2020-08-04 10:00:00 NaN
2020-08-04 14:00:00 NaN
2020-08-04 18:00:00 NaN

How to calculate daily averages from noon to noon with pandas?

I am fairly new to python and pandas, so I apologise for any future misunderstandings.
I have a pandas DataFrame with hourly values, looking something like this:
2014-04-01 09:00:00 52.9 41.1 36.3
2014-04-01 10:00:00 56.4 41.6 70.8
2014-04-01 11:00:00 53.3 41.2 49.6
2014-04-01 12:00:00 50.4 39.5 36.6
2014-04-01 13:00:00 51.1 39.2 33.3
2016-11-30 16:00:00 16.0 13.5 36.6
2016-11-30 17:00:00 19.6 17.4 44.3
Now I need to calculate 24h average values for each column starting from 2014-04-01 12:00 to 2014-04-02 11:00
So I want daily averages from noon to noon.
Unfortunately, I have no idea how to do that. I have read some suggestions to use groupby, but I don't really know how...
Thank you very much in advance! Any help is appreciated!!
For newer versions of pandas (>= 1.1.0) use the offset argument:
df.resample('24H', offset='12H').mean()
The base argument.
A day is 24 hours, so a base of 12 would start the grouping from Noon - Noon. Resample gives you all days in between, so you could .dropna(how='all') if you don't need the complete basis. (I assume you have a DatetimeIndex, if not you can use the on argument of resample to specify your datetime column.)
df.resample('24H', base=12).mean()
#df.groupby(pd.Grouper(level=0, base=12, freq='24H')).mean() # Equivalent
1 2 3
0
2014-03-31 12:00:00 54.20 41.30 52.233333
2014-04-01 12:00:00 50.75 39.35 34.950000
2014-04-02 12:00:00 NaN NaN NaN
2014-04-03 12:00:00 NaN NaN NaN
2014-04-04 12:00:00 NaN NaN NaN
... ... ... ...
2016-11-26 12:00:00 NaN NaN NaN
2016-11-27 12:00:00 NaN NaN NaN
2016-11-28 12:00:00 NaN NaN NaN
2016-11-29 12:00:00 NaN NaN NaN
2016-11-30 12:00:00 17.80 15.45 40.450000
You could subtract your time and groupby:
df.groupby((df.index - pd.to_timedelta('12:00:00')).normalize()).mean()
You can shift the hours by 12 hours and resample on day level.
from io import StringIO
import pandas as pd
data = """
2014-04-01 09:00:00,52.9,41.1,36.3
2014-04-01 10:00:00,56.4,41.6,70.8
2014-04-01 11:00:00,53.3,41.2,49.6
2014-04-01 12:00:00,50.4,39.5,36.6
2014-04-01 13:00:00,51.1,39.2,33.3
2016-11-30 16:00:00,16.0,13.5,36.6
2016-11-30 17:00:00,19.6,17.4,44.3
"""
df = pd.read_csv(StringIO(data), sep=',', header=None, index_col=0)
df.index = pd.to_datetime(df.index)
# shift by 12 hours
df.index = df.index - pd.Timedelta(hours=12)
# resample and drop na rows
df.resample('D').mean().dropna()

Optimize code to find the median of values of past 30 day for each row in a DataFrame

I'd like to find faster code to achieve the same goal: for each row, compute the median of all data in the past 30 days. But there are less than 5 data points, then return np.nan.
import pandas as pd
import numpy as np
import datetime
def findPastVar(df, var='var' ,window=30, method='median'):
# window= # of past days
def findPastVar_apply(row):
pastVar = df[var].loc[(df['timestamp'] - row['timestamp'] < datetime.timedelta(days=0)) & (df['timestamp'] - row['timestamp'] > datetime.timedelta(days=-window))]
if len(pastVar) < 5:
return(np.nan)
if method == 'median':
return(np.median(pastVar.values))
df['past{}d_{}_median'.format(window,var)] = df.apply(findPastVar_apply,axis=1)
return(df)
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
Data looks like this. In my real data, there are gaps in time and maybe more data points in one day.
In [47]: df.head()
Out[47]:
timestamp var
0 2011-01-01 00:00:00 -0.670695
1 2011-01-02 00:00:00 0.315148
2 2011-01-03 00:00:00 -0.717432
3 2011-01-04 00:00:00 2.904063
4 2011-01-05 00:00:00 -1.092813
Desired output:
In [55]: df.head(10)
Out[55]:
timestamp var past30d_var_median
0 2011-01-01 00:00:00 -0.670695 NaN
1 2011-01-02 00:00:00 0.315148 NaN
2 2011-01-03 00:00:00 -0.717432 NaN
3 2011-01-04 00:00:00 2.904063 NaN
4 2011-01-05 00:00:00 -1.092813 NaN
5 2011-01-06 00:00:00 -2.676784 -0.670695
6 2011-01-07 00:00:00 -0.353425 -0.694063
7 2011-01-08 00:00:00 -0.223442 -0.670695
8 2011-01-09 00:00:00 0.162126 -0.512060
9 2011-01-10 00:00:00 0.633801 -0.353425
However, my current code running speed:
In [49]: %timeit findPastVar(df)
1 loop, best of 3: 755 ms per loop
I need to run a large dataframe from time to time, so I want to optimize this code.
Any suggestion or comment are welcome.
New in pandas 0.19 is time aware rolling. It can deal with missing data.
Code:
print(df.rolling('30d', on='timestamp', min_periods=5)['var'].median())
Test Code:
df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=60, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))
# duplicate one sample
df.timestamp.loc[50] = df.timestamp.loc[51]
# drop some data
df = df.drop(range(15, 50))
df['median'] = df.rolling(
'30d', on='timestamp', min_periods=5)['var'].median()
Results:
timestamp var median
0 2011-01-01 00:00:00 -0.639901 NaN
1 2011-01-02 00:00:00 -1.212541 NaN
2 2011-01-03 00:00:00 1.015730 NaN
3 2011-01-04 00:00:00 -0.203701 NaN
4 2011-01-05 00:00:00 0.319618 -0.203701
5 2011-01-06 00:00:00 1.272088 0.057958
6 2011-01-07 00:00:00 0.688965 0.319618
7 2011-01-08 00:00:00 -1.028438 0.057958
8 2011-01-09 00:00:00 1.418207 0.319618
9 2011-01-10 00:00:00 0.303839 0.311728
10 2011-01-11 00:00:00 -1.939277 0.303839
11 2011-01-12 00:00:00 1.052173 0.311728
12 2011-01-13 00:00:00 0.710270 0.319618
13 2011-01-14 00:00:00 1.080713 0.504291
14 2011-01-15 00:00:00 1.192859 0.688965
50 2011-02-21 00:00:00 -1.126879 NaN
51 2011-02-21 00:00:00 0.213635 NaN
52 2011-02-22 00:00:00 -1.357243 NaN
53 2011-02-23 00:00:00 -1.993216 NaN
54 2011-02-24 00:00:00 1.082374 -1.126879
55 2011-02-25 00:00:00 0.124840 -0.501019
56 2011-02-26 00:00:00 -0.136822 -0.136822
57 2011-02-27 00:00:00 -0.744386 -0.440604
58 2011-02-28 00:00:00 -1.960251 -0.744386
59 2011-03-01 00:00:00 0.041767 -0.440604
you can try rolling_median
O(N log(window)) implementation using skip list
pd.rolling_median(df,window= 30,min_periods=5)

Categories