I'm creating a pandas DataFrame with random dates and random integers values and I want to resample it by month and compute the average value of integers. This can be done with the following code:
def random_dates(start='2018-01-01', end='2019-01-01', n=300):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
start = pd.to_datetime('2018-01-01')
end = pd.to_datetime('2019-01-01')
dates = random_dates(start, end)
ints = np.random.randint(100, size=300)
df = pd.DataFrame({'Month': dates, 'Integers': ints})
print(df.resample('M', on='Month').mean())
The thing is that the resampled months always starts from day one and I want all months to start from day 15. I'm using pandas 1.1.4 and I've tried using origin='15/01/2018' or offset='15' and none of them works with 'M' resample rule (they do work when I use 30D but it is of no use). I've also tried to use '2SM'but it also doesn't work.
So my question is if is there a way of changing the resample rule or I will have to add an offset in my data?
Assume that the source DataFrame is:
Month Amount
0 2020-05-05 1
1 2020-05-14 1
2 2020-05-15 10
3 2020-05-20 10
4 2020-05-30 10
5 2020-06-15 20
6 2020-06-20 20
To compute your "shifted" resample, first shift Month column so that
the 15-th day of month becomes the 1-st:
df.Month = df.Month - pd.Timedelta('14D')
and then resample:
res = df.resample('M', on='Month').mean()
The result is:
Amount
Month
2020-04-30 1
2020-05-31 10
2020-06-30 20
If you want, change dates in the index to month periods:
res.index = res.index.to_period('M')
Then the result will be:
Amount
Month
2020-04 1
2020-05 10
2020-06 20
Edit: Not a working solution for OP's request. See short discussion in the comments.
Interesting problem. I suggest to resample using 'SMS' - semi-month start frequency (1st and 15th). Instead of keeping just the mean values, keep the count and sum values and recalculate the weighted mean for each monthly period by its two sub-period (for example: 15/1 to 15/2 is composed of 15/1-31/1 and 1/2-15/2).
The advantages here is that unlike with an (improper use of an) offset, we are certain we always start on the 15th of the month till the 14th of the next month.
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm
Integers
sum count
Month
2018-01-01 876 16
2018-01-15 864 16
2018-02-01 412 10
2018-02-15 626 12
...
2018-12-01 492 10
2018-12-15 638 16
Rolling sum and rolling count; Find the mean out of them:
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_sm
Integers count_sum count_rolling mean
sum count
Month
2018-01-01 876 16 NaN NaN NaN
2018-01-15 864 16 1740.0 32.0 54.375000
2018-02-01 412 10 1276.0 26.0 49.076923
2018-02-15 626 12 1038.0 22.0 47.181818
...
2018-12-01 492 10 1556.0 27.0 57.629630
2018-12-15 638 16 1130.0 26.0 43.461538
Now, just filter the odd indices of df_sm:
df_sm.iloc[1::2]['mean']
Month
2018-01-15 54.375000
2018-02-15 47.181818
2018-03-15 51.000000
2018-04-15 44.897436
2018-05-15 52.450000
2018-06-15 33.722222
2018-07-15 41.277778
2018-08-15 46.391304
2018-09-15 45.631579
2018-10-15 54.107143
2018-11-15 58.058824
2018-12-15 43.461538
Freq: 2SMS-15, Name: mean, dtype: float64
The code:
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_out = df_sm[1::2]['mean']
Edit: Changed a name of one of the columns to make it clearer
Related
My company uses a 4-4-5 calendar for reporting purposes. Each month (aka period) is 4-weeks long, except every 3rd month is 5-weeks long.
Pandas seems to have good support for custom calendar periods. However, I'm having trouble figuring out the correct frequency string or custom business month offset to achieve months for a 4-4-5 calendar.
For example:
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(
index=df_index, columns=["a"], data=np.random.randint(0, 100, size=len(df_index))
)
df.groupby(pd.Grouper(level=0, freq="4W-SUN")).mean()
Grouping by 4-weeks starting on Sunday results in the following. The first three month start dates are correct but I need every third month to be 5-weeks long. The 4th month start date should be 2020-06-28.
a
date
2020-03-29 16.000000
2020-04-26 50.250000
2020-05-24 39.071429
2020-06-21 52.464286
2020-07-19 41.535714
2020-08-16 46.178571
2020-09-13 51.857143
2020-10-11 44.250000
2020-11-08 47.714286
2020-12-06 56.892857
2021-01-03 55.821429
2021-01-31 53.464286
2021-02-28 53.607143
2021-03-28 45.037037
Essentially what I'd like to achieve is something like this:
a
date
2020-03-29 20.000000
2020-04-26 50.750000
2020-05-24 49.750000
2020-06-28 49.964286
2020-07-26 52.214286
2020-08-23 47.714286
2020-09-27 46.250000
2020-10-25 53.357143
2020-11-22 52.035714
2020-12-27 39.750000
2021-01-24 43.428571
2021-02-21 49.392857
Pandas currently support only yearly and quarterly 5253 (aka 4-4-5 calendar).
See is pandas.tseries.offsets.FY5253 and pandas.tseries.offsets.FY5253Quarter
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(index=df_index)
df['a'] = np.random.randint(0, 100, df.shape[0])
So indeed you need some more work to get to week level and maintain a 4-4-5 calendar. You could align to quarters using the native pandas offset and fill-in the 4-4-5 week pattern manually.
def date_range(start, end, offset_array, name=None):
start = pd.to_datetime(start)
end = pd.to_datetime(end)
index = []
start -= offset_array[0]
while(start<end):
for x in offset_array:
start += x
if start > end:
break
index.append(start)
return pd.Series(index, name=name)
This function takes a list of offsets rather than a regular frequency period, so it allows to move from date to date following the offsets in the given array:
offset_445 = [
pd.tseries.offsets.FY5253Quarter(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
]
df_index_445 = date_range("2020-03-29", "2021-03-27", offset_445, name='date')
Out:
0 2020-05-03
1 2020-05-31
2 2020-06-28
3 2020-08-02
4 2020-08-30
5 2020-09-27
6 2020-11-01
7 2020-11-29
8 2020-12-27
9 2021-01-31
10 2021-02-28
Name: date, dtype: datetime64[ns]
Once the index is created, then it's back to aggregations logic to get the data in the right row buckets. Assuming that you want the mean for the start of each 4 or 5 week period, according to the df_index_445 you have generated, it could look like this:
# calculate the mean on reindex groups
reindex = df_index_445.searchsorted(df.index, side='right') - 1
res = df.groupby(reindex).mean()
# filter valid output
res = res[res.index>=0]
res.index = df_index_445
Out:
a
2020-05-03 47.857143
2020-05-31 53.071429
2020-06-28 49.257143
2020-08-02 40.142857
2020-08-30 47.250000
2020-09-27 52.485714
2020-11-01 48.285714
2020-11-29 56.178571
2020-12-27 51.428571
2021-01-31 50.464286
2021-02-28 53.642857
Note that since the frequency is not regular, pandas will set the datetime index frequency to None.
I am trying to calculate the rolling mean for 1 year in the below pandas dataframe. 'mean_1year' for the below dataframe is calcualted using the 1 year calculation based
on month and year.
For example, month and year of first row in the below dataframe is '05' and '2016'. Hence 'mean_1year' is calculated using average 'price' of '2016-04' back to '2015-04'.Hence it
would be (1300+1400+1500)/3 = 1400. Also, while calculating this average, a filter has to be made on the "type" column. As the "type" of first row is "A", while calculating "mean_1year",
the rows have to be filtered on type=="A" and the average is computed using '2016-04' back to '2015-04'.
type year month price mean_1year
A 2016 05 1200 1400
A 2016 04 1300
A 2016 01 1400
A 2015 12 1500
Any suggestions would be appreciated. Thanks !
First you need a datetime index in ascending order so you can apply a rolling time period calculation.
df['date'] = pd.to_datetime(df['year'].astype('str')+'-'+df['month'].astype('str'))
df = df.set_index('date')
df = df.sort_index()
Then you groupby type and apply the rolling mean.
df['mean_1year'] = df.groupby('type')['price'].rolling('365D').mean().reset_index(0,drop=True)
The result is:
type year month price mean_1year
date
2015-12-01 A 2015 12 1500 1500.0
2016-01-01 A 2016 1 1400 1450.0
2016-04-01 A 2016 4 1300 1400.0
2016-05-01 A 2016 5 1200 1350.0
"Ordinary" rolling can't be applied, because it:
includes rows starting from the current row, whereas you want
to exclude it,
the range of the window expands into the future,
whereas you want to expand it back.
So I used different approach, based on loc with suitable
date slices.
As a test DataFrame I used:
type year month price
0 A 2016 5 1200
1 A 2016 4 1300
2 A 2016 1 1400
3 A 2015 12 1500
4 B 2016 5 1200
5 B 2016 4 1300
And the code is as follows:
Compute date offsets of 12 months and 1 day:
yearOffs = pd.offsets.DateOffset(months=12)
dayOffs = pd.offsets.DateOffset(days=1)
Will be needed in loc later.
Set the index to a datetime, derived from year and
month columns:
df.set_index(pd.to_datetime(df.year.astype(str)
+ df.month.astype(str), format='%Y%m'), inplace=True)
Define the function to compute means within the current
group:
def myMeans(grp):
wrk = grp.sort_index()
return wrk.apply(lambda row: wrk.loc[row.name - yearOffs
: row.name - dayOffs, 'price'].mean(), axis=1)
Compute the means:
means = df.groupby('type').apply(myMeans).swaplevel()
So far the result is:
type
2015-12-01 A NaN
2016-01-01 A 1500.0
2016-04-01 A 1450.0
2016-05-01 A 1400.0
2016-04-01 B NaN
2016-05-01 B 1300.0
dtype: float64
but df has a single level index, with non-unique values.
So to add means to df and drop now unnecessary index,
the last step is:
df = df.set_index('type', append=True).assign(mean_1year=means)\
.reset_index(level=1).reset_index(drop=True)
The final result is:
type year month price mean_1year
0 A 2016 5 1200 1400.0
1 A 2016 4 1300 1450.0
2 A 2016 1 1400 1500.0
3 A 2015 12 1500 NaN
4 B 2016 5 1200 1300.0
5 B 2016 4 1300 NaN
For the "earliest" rows in each group the result is NaN,
as there are no source (earlier) rows to compute the means
for them (so there is apparently something wrong in the other solution).
I have a pandas dataframe with a date column
I'm trying to create a function and apply it to the dataframe to create a column that returns the number of days in the month/year specified
so far i have:
from calendar import monthrange
def dom(x):
m = dfs["load_date"].dt.month
y = dfs["load_date"].dt.year
monthrange(y,m)
days = monthrange[1]
return days
This however does not work when I attempt to apply it to the date column.
Additionally, I would like to be able to identify whether or not it is the current month, and if so return the number of days up to the current date in that month as opposed to days in the entire month.
I am not sure of the best way to do this, all I can think of is to check the month/year against datetime's today and then use a delta
thanks in advance
For pt.1 of your question, you can cast to pd.Period and retrieve days_in_month:
import pandas as pd
# create a sample df:
df = pd.DataFrame({'date': pd.date_range('2020-01', '2021-01', freq='M')})
df['daysinmonths'] = df['date'].apply(lambda t: pd.Period(t, freq='S').days_in_month)
# df['daysinmonths']
# 0 31
# 1 29
# 2 31
# ...
For pt.2, you can take the timestamp of 'now' and create a boolean mask for your date column, i.e. where its year/month is less than "now". Then calculate the cumsum of the daysinmonth column for the section where the mask returns True. Invert the order of that series to get the days until now.
now = pd.Timestamp('now')
m = (df['date'].dt.year <= now.year) & (df['date'].dt.month < now.month)
df['daysuntilnow'] = df['daysinmonths'][m].cumsum().iloc[::-1].reset_index(drop=True)
Update after comment: to get the elapsed days per month, you can do
df['dayselapsed'] = df['daysinmonths']
m = (df['date'].dt.year == now.year) & (df['date'].dt.month == now.month)
if m.any():
df.loc[m, 'dayselapsed'] = now.day
df.loc[(df['date'].dt.year >= now.year) & (df['date'].dt.month > now.month), 'dayselapsed'] = 0
output
df
Out[13]:
date daysinmonths daysuntilnow dayselapsed
0 2020-01-31 31 213.0 31
1 2020-02-29 29 182.0 29
2 2020-03-31 31 152.0 31
3 2020-04-30 30 121.0 30
4 2020-05-31 31 91.0 31
5 2020-06-30 30 60.0 30
6 2020-07-31 31 31.0 31
7 2020-08-31 31 NaN 27
8 2020-09-30 30 NaN 0
9 2020-10-31 31 NaN 0
10 2020-11-30 30 NaN 0
11 2020-12-31 31 NaN 0
I have a dataframe that looks like this:
Date DFW
242 2000-05-01 00:00:00 75.92
243 2000-05-01 12:00:00 75.02
244 2000-05-02 00:00:00 71.96
245 2000-05-02 12:00:00 75.92
246 2000-05-03 00:00:00 71.96
... ... ...
14991 2020-07-09 12:00:00 93.90
14992 2020-07-10 00:00:00 91.00
14993 2020-07-10 12:00:00 93.00
14994 2020-07-11 00:00:00 89.10
14995 2020-07-11 12:00:00 97.00
The df contains the max value of temperature for a specific location every 12 hours from May - July 11 during 2000-2020. I want to count the number of times that the value is >90 and then store that value in a column where the row is the year. Should I use groupby to accomplish this?
Expected output:
Year count
2000 x
2001 y
... ...
2019 z
2020 a
You can do with groupby:
# extract the years from dates
years = df['Date'].dt.year
# compare `DFW` with `90`
# gt90 will be just True or False
gt90 = df['DFW'].gt(90)
# sum the `True` by years
output = gt90.groupby(years).sum()
# set the years as normal column:
output = output.reset_index()
All that in one line:
df['DFW'].gt(90).groupby().sum().reset_index()
One possible approach is to extract and create a new column for year (let's say "year") and then,
df[df['DFW'] > 90].groupby('year').count().reset_index()
I have csv time series data of once per day date, and cumulative sale. Silimar to this
01-01-2010 12:10:10 50.00
01-02-2010 12:10:10 80.00
01-03-2010 12:10:10 110.00
.
. for each dat of 2010
.
01-01-2011 12:10:10 2311.00
01-02-2011 12:10:10 2345.00
01-03-2011 12:10:10 2445.00
.
. for each dat of 2011
.
and so on.
I am looking to get the monthly sale (max - min) for each month in each year. Therefore for past 5 years, I will have 5 Jan values (max - min), 5 Feb values (max - min) ... and so on
once I have those, I next get the (5 years avg) for Jan, 5 years avg for Feb .. and so on.
Right now, I do this by slicing the original df [year/month] and then do the averaging over the specific month of the year.
I am looking to use time series resample() approach, but I am currently stuck at telling PD to sample monthly (max - min) for each month in [past 10 years from today]. and then chain in a .mean()
Any advice on an efficient way to do this with resample() would be appreciated.
It would probably look like something like this (note: no cumulative sale values). The key here is to perform a df.groupby() passing dt.year and dt.month.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'date': pd.date_range(start='2016-01-01',end='2017-12-31'),
'sale': np.random.randint(100,200, size = 365*2+1)
})
# Get month max, min and size (and as they are sorted - last and first)
dfg = df.groupby([df.date.dt.year,df.date.dt.month])['sale'].agg(['last','first','size'])
# Assign new cols (diff and avg) and drop max min size
dfg = dfg.assign(diff = dfg['last'] - dfg['first'])
dfg = dfg.assign(avg = dfg['diff'] / dfg['size']).drop(['last','first','size'], axis=1)
# Rename index cols
dfg.index = dfg.index.rename(['Year','Month'])
print(dfg.head(6))
Returns:
diff avg
Year Month
2016 1 -56 -1.806452
2 -17 -0.586207
3 30 0.967742
4 34 1.133333
5 46 1.483871
6 2 0.066667
You can do it with a resample*2:
First resample to a month (M) and get the diff (max()-min())
Then resample to 5 years (5AS), and groupby month and take the mean()
E.g.:
In []:
date_range = pd.date_range(start='2008-01-01',end='2017-12-31')
df = pd.DataFrame({'sale': np.random.randint(100, 200, size=date_range.size)},
index=date_range)
In []:
df1 = df.resample('M').apply(lambda g: g.max()-g.min())
df1.resample('5AS').apply(lambda g: g.groupby(g.index.month).mean()).unstack()
Out[]:
sale
1 2 3 4 5 6 7 8 9 10 11 12
2008-01-01 95.4 90.2 95.2 95.4 93.2 93.8 91.8 95.6 93.4 93.4 94.2 93.8
2013-01-01 93.2 96.4 92.8 96.4 92.6 93.0 93.2 92.6 91.2 93.2 91.8 92.2