Pandas: compute average and standard deviation by clock time - python

I have a DataFrame like this:
date time value
0 2019-04-18 07:00:10 100.8
1 2019-04-18 07:00:20 95.6
2 2019-04-18 07:00:30 87.6
3 2019-04-18 07:00:40 94.2
The DataFrame contains value recorded every 10 seconds for entire year 2019. I need to calculate standard deviation and mean/average of value for each hour of each date, and create two new columns for them. I have tried first separating the hour for each value like:
df["hour"] = df["time"].astype(str).str[:2]
Then I have tried to calculate standard deviation by:
df["std"] = df.groupby("hour").median().index.get_level_values('value').stack().std()
But that won't work, could I have some advise on the problem?

We can split the time column around the delimiter :, then slice the hour component using str[0], finally group the dataframe on date along with hour component and aggregate column value with mean and std:
hr = df['time'].str.split(':', n=1).str[0]
df.groupby(['date', hr])['value'].agg(['mean', 'std'])
If you want to broadcast the aggregated values to original dataframe, then we need to use transform instead of agg:
g = df.groupby(['date', df['time'].str.split(':', n=1).str[0]])['value']
df['mean'], df['std'] = g.transform('mean'), g.transform('std')
date time value mean std
0 2019-04-18 07:00:10 100.8 94.55 5.434151
1 2019-04-18 07:00:20 95.6 94.55 5.434151
2 2019-04-18 07:00:30 87.6 94.55 5.434151
3 2019-04-18 07:00:40 94.2 94.55 5.434151

have synthesized data. Start by generating a true datetime column
groupby() hour
use describe() to get mean & std
merge() back to original data frame
d = pd.date_range("1-Jan-2019", "28-Feb-2019", freq="10S")
df = pd.DataFrame({"datetime":d, "value":np.random.uniform(70,90,len(d))})
df = df.assign(date=df.datetime.dt.strftime("%Y-%m-%d"),
time=df.datetime.dt.strftime("%H:%M:%S"))
# create a datetime column - better than manipulating strings
df["datetime"] = pd.to_datetime(df.date + " " + df.time)
# calc mean & std by hour
dfh = (df.groupby(df.datetime.dt.hour, as_index=False)
.apply(lambda dfa: dfa.describe().T.loc[:,["mean","std"]].reset_index(drop=True))
.droplevel(1)
)
# merge mean & std by hour back
df.merge(dfh, left_on=df.datetime.dt.hour, right_index=True).drop(columns="key_0")
datetime value mean std
0 2019-01-01 00:00:00 86.014209 80.043364 5.777724
1 2019-01-01 00:00:10 77.241141 80.043364 5.777724
2 2019-01-01 00:00:20 71.650739 80.043364 5.777724
3 2019-01-01 00:00:30 71.066332 80.043364 5.777724
4 2019-01-01 00:00:40 77.203291 80.043364 5.777724
... ... ... ... ...
3144955 2019-12-30 23:59:10 89.577237 80.009751 5.773007
3144956 2019-12-30 23:59:20 82.154883 80.009751 5.773007
3144957 2019-12-30 23:59:30 82.131952 80.009751 5.773007
3144958 2019-12-30 23:59:40 85.346724 80.009751 5.773007
3144959 2019-12-30 23:59:50 78.122761 80.009751 5.773007

Related

monthly resampling pandas with specific start day

I'm creating a pandas DataFrame with random dates and random integers values and I want to resample it by month and compute the average value of integers. This can be done with the following code:
def random_dates(start='2018-01-01', end='2019-01-01', n=300):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
start = pd.to_datetime('2018-01-01')
end = pd.to_datetime('2019-01-01')
dates = random_dates(start, end)
ints = np.random.randint(100, size=300)
df = pd.DataFrame({'Month': dates, 'Integers': ints})
print(df.resample('M', on='Month').mean())
The thing is that the resampled months always starts from day one and I want all months to start from day 15. I'm using pandas 1.1.4 and I've tried using origin='15/01/2018' or offset='15' and none of them works with 'M' resample rule (they do work when I use 30D but it is of no use). I've also tried to use '2SM'but it also doesn't work.
So my question is if is there a way of changing the resample rule or I will have to add an offset in my data?
Assume that the source DataFrame is:
Month Amount
0 2020-05-05 1
1 2020-05-14 1
2 2020-05-15 10
3 2020-05-20 10
4 2020-05-30 10
5 2020-06-15 20
6 2020-06-20 20
To compute your "shifted" resample, first shift Month column so that
the 15-th day of month becomes the 1-st:
df.Month = df.Month - pd.Timedelta('14D')
and then resample:
res = df.resample('M', on='Month').mean()
The result is:
Amount
Month
2020-04-30 1
2020-05-31 10
2020-06-30 20
If you want, change dates in the index to month periods:
res.index = res.index.to_period('M')
Then the result will be:
Amount
Month
2020-04 1
2020-05 10
2020-06 20
Edit: Not a working solution for OP's request. See short discussion in the comments.
Interesting problem. I suggest to resample using 'SMS' - semi-month start frequency (1st and 15th). Instead of keeping just the mean values, keep the count and sum values and recalculate the weighted mean for each monthly period by its two sub-period (for example: 15/1 to 15/2 is composed of 15/1-31/1 and 1/2-15/2).
The advantages here is that unlike with an (improper use of an) offset, we are certain we always start on the 15th of the month till the 14th of the next month.
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm
Integers
sum count
Month
2018-01-01 876 16
2018-01-15 864 16
2018-02-01 412 10
2018-02-15 626 12
...
2018-12-01 492 10
2018-12-15 638 16
Rolling sum and rolling count; Find the mean out of them:
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_sm
Integers count_sum count_rolling mean
sum count
Month
2018-01-01 876 16 NaN NaN NaN
2018-01-15 864 16 1740.0 32.0 54.375000
2018-02-01 412 10 1276.0 26.0 49.076923
2018-02-15 626 12 1038.0 22.0 47.181818
...
2018-12-01 492 10 1556.0 27.0 57.629630
2018-12-15 638 16 1130.0 26.0 43.461538
Now, just filter the odd indices of df_sm:
df_sm.iloc[1::2]['mean']
Month
2018-01-15 54.375000
2018-02-15 47.181818
2018-03-15 51.000000
2018-04-15 44.897436
2018-05-15 52.450000
2018-06-15 33.722222
2018-07-15 41.277778
2018-08-15 46.391304
2018-09-15 45.631579
2018-10-15 54.107143
2018-11-15 58.058824
2018-12-15 43.461538
Freq: 2SMS-15, Name: mean, dtype: float64
The code:
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_out = df_sm[1::2]['mean']
Edit: Changed a name of one of the columns to make it clearer

How to define a 4-4-5 week period in Pandas

My company uses a 4-4-5 calendar for reporting purposes. Each month (aka period) is 4-weeks long, except every 3rd month is 5-weeks long.
Pandas seems to have good support for custom calendar periods. However, I'm having trouble figuring out the correct frequency string or custom business month offset to achieve months for a 4-4-5 calendar.
For example:
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(
index=df_index, columns=["a"], data=np.random.randint(0, 100, size=len(df_index))
)
df.groupby(pd.Grouper(level=0, freq="4W-SUN")).mean()
Grouping by 4-weeks starting on Sunday results in the following. The first three month start dates are correct but I need every third month to be 5-weeks long. The 4th month start date should be 2020-06-28.
a
date
2020-03-29 16.000000
2020-04-26 50.250000
2020-05-24 39.071429
2020-06-21 52.464286
2020-07-19 41.535714
2020-08-16 46.178571
2020-09-13 51.857143
2020-10-11 44.250000
2020-11-08 47.714286
2020-12-06 56.892857
2021-01-03 55.821429
2021-01-31 53.464286
2021-02-28 53.607143
2021-03-28 45.037037
Essentially what I'd like to achieve is something like this:
a
date
2020-03-29 20.000000
2020-04-26 50.750000
2020-05-24 49.750000
2020-06-28 49.964286
2020-07-26 52.214286
2020-08-23 47.714286
2020-09-27 46.250000
2020-10-25 53.357143
2020-11-22 52.035714
2020-12-27 39.750000
2021-01-24 43.428571
2021-02-21 49.392857
Pandas currently support only yearly and quarterly 5253 (aka 4-4-5 calendar).
See is pandas.tseries.offsets.FY5253 and pandas.tseries.offsets.FY5253Quarter
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(index=df_index)
df['a'] = np.random.randint(0, 100, df.shape[0])
So indeed you need some more work to get to week level and maintain a 4-4-5 calendar. You could align to quarters using the native pandas offset and fill-in the 4-4-5 week pattern manually.
def date_range(start, end, offset_array, name=None):
start = pd.to_datetime(start)
end = pd.to_datetime(end)
index = []
start -= offset_array[0]
while(start<end):
for x in offset_array:
start += x
if start > end:
break
index.append(start)
return pd.Series(index, name=name)
This function takes a list of offsets rather than a regular frequency period, so it allows to move from date to date following the offsets in the given array:
offset_445 = [
pd.tseries.offsets.FY5253Quarter(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
]
df_index_445 = date_range("2020-03-29", "2021-03-27", offset_445, name='date')
Out:
0 2020-05-03
1 2020-05-31
2 2020-06-28
3 2020-08-02
4 2020-08-30
5 2020-09-27
6 2020-11-01
7 2020-11-29
8 2020-12-27
9 2021-01-31
10 2021-02-28
Name: date, dtype: datetime64[ns]
Once the index is created, then it's back to aggregations logic to get the data in the right row buckets. Assuming that you want the mean for the start of each 4 or 5 week period, according to the df_index_445 you have generated, it could look like this:
# calculate the mean on reindex groups
reindex = df_index_445.searchsorted(df.index, side='right') - 1
res = df.groupby(reindex).mean()
# filter valid output
res = res[res.index>=0]
res.index = df_index_445
Out:
a
2020-05-03 47.857143
2020-05-31 53.071429
2020-06-28 49.257143
2020-08-02 40.142857
2020-08-30 47.250000
2020-09-27 52.485714
2020-11-01 48.285714
2020-11-29 56.178571
2020-12-27 51.428571
2021-01-31 50.464286
2021-02-28 53.642857
Note that since the frequency is not regular, pandas will set the datetime index frequency to None.

Difference between pandas aggregators .first() and .last()

I'm curious as to what last() and first() does in this specific instance (when chained to a resampling). Correct me if I'm wrong, but I understand if you pass arguments into first and last, e.g. 3; it returns the first 3 months or first 3 years.
In this circumstance, since I'm not passing any arguments into first() and last(), what is it actually doing when I'm resampling it like that? I know that if I resample by chaining .mean(), I'll resample into years with the mean score from averaging all the months, but what is happening when I'm using last()?
More importantly, why does first() and last() give me different answers in this context? I see that numerically they are not equal.
i.e: post2008.resample().first() != post2008.resample().last()
TLDR:
What does .first() and .last() do?
What does .first() and .last() do in this instance, when chained to a resample?
Why does .resample().first() != .resample().last()?
This is the code before the aggregation:
# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv('GDP.csv', index_col='DATE', parse_dates=True)
# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008-01-01':,:]
# Print the last 8 rows of post2008
print(post2008.tail(8))
This is what print(post2008.tail(8)) outputs:
VALUE
DATE
2014-07-01 17569.4
2014-10-01 17692.2
2015-01-01 17783.6
2015-04-01 17998.3
2015-07-01 18141.9
2015-10-01 18222.8
2016-01-01 18281.6
2016-04-01 18436.5
Here is the code that resamples and aggregates by last():
# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()
print(yearly)
This is what yearly is like when it's post2008.resample('A').last():
VALUE
DATE
2008-12-31 14549.9
2009-12-31 14566.5
2010-12-31 15230.2
2011-12-31 15785.3
2012-12-31 16297.3
2013-12-31 16999.9
2014-12-31 17692.2
2015-12-31 18222.8
2016-12-31 18436.5
Here is the code that resamples and aggregates by first():
# Resample post2008 by year, keeping first(): yearly
yearly = post2008.resample('A').first()
print(yearly)
This is what yearly is like when it's post2008.resample('A').first():
VALUE
DATE
2008-12-31 14668.4
2009-12-31 14383.9
2010-12-31 14681.1
2011-12-31 15238.4
2012-12-31 15973.9
2013-12-31 16475.4
2014-12-31 17025.2
2015-12-31 17783.6
2016-12-31 18281.6
Before anything else, let's create a dataframe with example data:
import pandas as pd
dates = pd.DatetimeIndex(['2014-07-01', '2014-10-01', '2015-01-01',
'2015-04-01', '2015-07-01', '2015-07-01',
'2016-01-01', '2016-04-01'])
df = pd.DataFrame({'VALUE': range(1000, 9000, 1000)}, index=dates)
print(df)
The output will be
VALUE
2014-07-01 1000
2014-10-01 2000
2015-01-01 3000
2015-04-01 4000
2015-07-01 5000
2015-07-01 6000
2016-01-01 7000
2016-04-01 8000
If we pass e.g. '6M' to df.first (which is not an aggregator, but a DataFrame method), we will be selecting the first six months of data, which in the example above consists of just two days:
print(df.first('6M'))
VALUE
2014-07-01 1000
2014-10-01 2000
Similarly, last returns only the rows that belong to the last six months of data:
print(df.last('6M'))
VALUE
2016-01-01 6000
2016-04-01 7000
In this context, not passing the required argument results in an error:
print(df.first())
TypeError: first() missing 1 required positional argument: 'offset'
On the other hand, df.resample('Y') returns a Resampler object, which has aggregation methods first, last, mean, etc. In this case, they keep only the first (respectively, last) values of each year (instead of e.g. averaging all values, or some other kind of aggregation):
print(df.resample('Y').first())
VALUE
2014-12-31 1000
2015-12-31 3000 # This is the first of the 4 values from 2015
2016-12-31 7000
print(df.resample('Y').last())
VALUE
2014-12-31 2000
2015-12-31 6000 # This is the last of the 4 values from 2015
2016-12-31 8000
As an extra example, consider also the case of grouping by a smaller period:
print(df.resample('M').last().head())
VALUE
2014-07-31 1000.0 # This is the last (and only) value from July, 2014
2014-08-31 NaN # No data for August, 2014
2014-09-30 NaN # No data for September, 2014
2014-10-31 2000.0
2014-11-30 NaN # No data for November, 2014
In this case, any periods for which there is no value will be filled with NaNs. Also, for this example, using first instead of last would have returned the same values, since each month has (at most) one value.

How to resample yearly starting from 1st of June to 31st may?

How do I resample a dataframe with a daily time-series index to yearly, but not from 1st Jan to 31th Dec. Instead I want the yearly sum from 1.June to 31.May.
First I did this, which gives me the yearly sum from 1.Jan to 31.Dec:
df.resample(rule='A').sum()
I have tried using the base-parameter, but it does not change the resample sum.
df.resample(rule='A', base=100).sum()
Here is a part of my dataframe:
In []: df
Out[]:
Index ET P R
2010-01-01 00:00:00 -0.013 0.0 0.773
2010-01-02 00:00:00 0.0737 0.21 0.797
2010-01-03 00:00:00 -0.048 0.0 0.926
...
In []: df.resample(rule='A', base = 0, label='left').sum()
Out []:
Index
2009-12-31 00:00:00 424.131138 871.48 541.677405
2010-12-31 00:00:00 405.625780 939.06 575.163096
2011-12-31 00:00:00 461.586365 1064.82 710.507947
...
I would really appreciate if anyone could help me figuring out how to do this.
Thank you
Use 'AS-JUN' as the rule with resample:
# Example data
idx = pd.date_range('2017-01-01', '2018-12-31')
s = pd.Series(1, idx)
# Resample
s = s.resample('AS-JUN').sum()
The resulting output:
2016-06-01 151
2017-06-01 365
2018-06-01 214
Freq: AS-JUN, dtype: int64

Pandas Subset of a Time Series Without Resampling

I have a pandas data series with cumulative daily returns for a series:
Date CumReturn
3/31/2017 1
4/3/2017 .99
4/4/2017 .992
... ...
4/28/2017 1.012
5/1/2017 1.011
... ...
5/31/2017 1.022
... ...
6/30/2017 1.033
... ...
I want only the month-end values.
Date CumReturn
4/28/2017 1.012
5/31/2017 1.022
6/30/2017 1.033
Because I want only the month-end values, resampling doesn't work as it aggregates the interim values.
What is the easiest way to get only the month end values as they appear in the original dataframe?
Use the is_month_end component of the .dt date accessor:
# Ensure the date column is a Timestamp
df['Date'] = pd.to_datetime(df['Date'])
# Filter to end of the month only
df = df[df['Date'].dt.is_month_end]
Applying this to the data you provided:
Date CumReturn
0 2017-03-31 1.000
5 2017-05-31 1.022
6 2017-06-30 1.033
EDIT
To get business month end, compare using BMonthEnd(0):
from pandas.tseries.offsets import BMonthEnd
# Ensure the date column is a Timestamp
df['Date'] = pd.to_datetime(df['Date'])
# Filter to end of the month only
df = df[df['Date'] == df['Date'] + BMonthEnd(0)]
Applying this to the data you provided:
Date CumReturn
0 2017-03-31 1.000
3 2017-04-28 1.012
5 2017-05-31 1.022
6 2017-06-30 1.033
df.sort_values('Date').groupby([df.Date.dt.year,df.Date.dt.month]).last()
Out[197]:
Date CumReturn
Date Date
2017 3 2017-03-31 1.000
4 2017-04-28 1.012
5 2017-05-31 1.022
6 2017-06-30 1.033
Assuming that the dataframe is already sorted by 'Date' and that the values in that column are Pandas timestamps, you can convert them to YYYY-mm string values for grouping and take the last value:
df.groupby(df['Date'].dt.strftime('%Y-%m'))['CumReturn'].last()
# Example output:
# 2017-01 0.127002
# 2017-02 0.046894
# 2017-03 0.005560
# 2017-04 0.150368

Categories