pandas time series average monthly volume - python

I have csv time series data of once per day date, and cumulative sale. Silimar to this
01-01-2010 12:10:10 50.00
01-02-2010 12:10:10 80.00
01-03-2010 12:10:10 110.00
.
. for each dat of 2010
.
01-01-2011 12:10:10 2311.00
01-02-2011 12:10:10 2345.00
01-03-2011 12:10:10 2445.00
.
. for each dat of 2011
.
and so on.
I am looking to get the monthly sale (max - min) for each month in each year. Therefore for past 5 years, I will have 5 Jan values (max - min), 5 Feb values (max - min) ... and so on
once I have those, I next get the (5 years avg) for Jan, 5 years avg for Feb .. and so on.
Right now, I do this by slicing the original df [year/month] and then do the averaging over the specific month of the year.
I am looking to use time series resample() approach, but I am currently stuck at telling PD to sample monthly (max - min) for each month in [past 10 years from today]. and then chain in a .mean()
Any advice on an efficient way to do this with resample() would be appreciated.

It would probably look like something like this (note: no cumulative sale values). The key here is to perform a df.groupby() passing dt.year and dt.month.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'date': pd.date_range(start='2016-01-01',end='2017-12-31'),
'sale': np.random.randint(100,200, size = 365*2+1)
})
# Get month max, min and size (and as they are sorted - last and first)
dfg = df.groupby([df.date.dt.year,df.date.dt.month])['sale'].agg(['last','first','size'])
# Assign new cols (diff and avg) and drop max min size
dfg = dfg.assign(diff = dfg['last'] - dfg['first'])
dfg = dfg.assign(avg = dfg['diff'] / dfg['size']).drop(['last','first','size'], axis=1)
# Rename index cols
dfg.index = dfg.index.rename(['Year','Month'])
print(dfg.head(6))
Returns:
diff avg
Year Month
2016 1 -56 -1.806452
2 -17 -0.586207
3 30 0.967742
4 34 1.133333
5 46 1.483871
6 2 0.066667

You can do it with a resample*2:
First resample to a month (M) and get the diff (max()-min())
Then resample to 5 years (5AS), and groupby month and take the mean()
E.g.:
In []:
date_range = pd.date_range(start='2008-01-01',end='2017-12-31')
df = pd.DataFrame({'sale': np.random.randint(100, 200, size=date_range.size)},
index=date_range)
In []:
df1 = df.resample('M').apply(lambda g: g.max()-g.min())
df1.resample('5AS').apply(lambda g: g.groupby(g.index.month).mean()).unstack()
Out[]:
sale
1 2 3 4 5 6 7 8 9 10 11 12
2008-01-01 95.4 90.2 95.2 95.4 93.2 93.8 91.8 95.6 93.4 93.4 94.2 93.8
2013-01-01 93.2 96.4 92.8 96.4 92.6 93.0 93.2 92.6 91.2 93.2 91.8 92.2

Related

Get the min value of a week in a pandas dataframe

So lets say I have a pandas dataframe with SOME repeated dates:
import pandas as pd
import random
reportDate = pd.date_range('04-01-2010', '09-03-2021',periods = 5000).date
lowPriceMin = [random.randint(10, 20) for x in range(5000)]
df = pd.DataFrame()
df['reportDate'] = reportDate
df['lowPriceMin'] = lowPriceMin
Now I want to get the min value from every week since the starting date. So I will have around 559 (the number of weeks from '04-01-2010' to '09-03-2021') values with the min value from every week.
Try with resample:
df['reportDate'] = pd.to_datetime(df['reportDate'])
>>> df.set_index("reportDate").resample("W").min()
lowPriceMin
reportDate
2010-01-10 10
2010-01-17 10
2010-01-24 14
2010-01-31 10
2010-02-07 14
...
2021-02-14 11
2021-02-21 11
2021-02-28 10
2021-03-07 10
2021-03-14 17
[584 rows x 1 columns]

monthly resampling pandas with specific start day

I'm creating a pandas DataFrame with random dates and random integers values and I want to resample it by month and compute the average value of integers. This can be done with the following code:
def random_dates(start='2018-01-01', end='2019-01-01', n=300):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
start = pd.to_datetime('2018-01-01')
end = pd.to_datetime('2019-01-01')
dates = random_dates(start, end)
ints = np.random.randint(100, size=300)
df = pd.DataFrame({'Month': dates, 'Integers': ints})
print(df.resample('M', on='Month').mean())
The thing is that the resampled months always starts from day one and I want all months to start from day 15. I'm using pandas 1.1.4 and I've tried using origin='15/01/2018' or offset='15' and none of them works with 'M' resample rule (they do work when I use 30D but it is of no use). I've also tried to use '2SM'but it also doesn't work.
So my question is if is there a way of changing the resample rule or I will have to add an offset in my data?
Assume that the source DataFrame is:
Month Amount
0 2020-05-05 1
1 2020-05-14 1
2 2020-05-15 10
3 2020-05-20 10
4 2020-05-30 10
5 2020-06-15 20
6 2020-06-20 20
To compute your "shifted" resample, first shift Month column so that
the 15-th day of month becomes the 1-st:
df.Month = df.Month - pd.Timedelta('14D')
and then resample:
res = df.resample('M', on='Month').mean()
The result is:
Amount
Month
2020-04-30 1
2020-05-31 10
2020-06-30 20
If you want, change dates in the index to month periods:
res.index = res.index.to_period('M')
Then the result will be:
Amount
Month
2020-04 1
2020-05 10
2020-06 20
Edit: Not a working solution for OP's request. See short discussion in the comments.
Interesting problem. I suggest to resample using 'SMS' - semi-month start frequency (1st and 15th). Instead of keeping just the mean values, keep the count and sum values and recalculate the weighted mean for each monthly period by its two sub-period (for example: 15/1 to 15/2 is composed of 15/1-31/1 and 1/2-15/2).
The advantages here is that unlike with an (improper use of an) offset, we are certain we always start on the 15th of the month till the 14th of the next month.
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm
Integers
sum count
Month
2018-01-01 876 16
2018-01-15 864 16
2018-02-01 412 10
2018-02-15 626 12
...
2018-12-01 492 10
2018-12-15 638 16
Rolling sum and rolling count; Find the mean out of them:
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_sm
Integers count_sum count_rolling mean
sum count
Month
2018-01-01 876 16 NaN NaN NaN
2018-01-15 864 16 1740.0 32.0 54.375000
2018-02-01 412 10 1276.0 26.0 49.076923
2018-02-15 626 12 1038.0 22.0 47.181818
...
2018-12-01 492 10 1556.0 27.0 57.629630
2018-12-15 638 16 1130.0 26.0 43.461538
Now, just filter the odd indices of df_sm:
df_sm.iloc[1::2]['mean']
Month
2018-01-15 54.375000
2018-02-15 47.181818
2018-03-15 51.000000
2018-04-15 44.897436
2018-05-15 52.450000
2018-06-15 33.722222
2018-07-15 41.277778
2018-08-15 46.391304
2018-09-15 45.631579
2018-10-15 54.107143
2018-11-15 58.058824
2018-12-15 43.461538
Freq: 2SMS-15, Name: mean, dtype: float64
The code:
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_out = df_sm[1::2]['mean']
Edit: Changed a name of one of the columns to make it clearer

Rolling mean of 1 year after filtering of data in pandas dataframe

I am trying to calculate the rolling mean for 1 year in the below pandas dataframe. 'mean_1year' for the below dataframe is calcualted using the 1 year calculation based
on month and year.
For example, month and year of first row in the below dataframe is '05' and '2016'. Hence 'mean_1year' is calculated using average 'price' of '2016-04' back to '2015-04'.Hence it
would be (1300+1400+1500)/3 = 1400. Also, while calculating this average, a filter has to be made on the "type" column. As the "type" of first row is "A", while calculating "mean_1year",
the rows have to be filtered on type=="A" and the average is computed using '2016-04' back to '2015-04'.
type year month price mean_1year
A 2016 05 1200 1400
A 2016 04 1300
A 2016 01 1400
A 2015 12 1500
Any suggestions would be appreciated. Thanks !
First you need a datetime index in ascending order so you can apply a rolling time period calculation.
df['date'] = pd.to_datetime(df['year'].astype('str')+'-'+df['month'].astype('str'))
df = df.set_index('date')
df = df.sort_index()
Then you groupby type and apply the rolling mean.
df['mean_1year'] = df.groupby('type')['price'].rolling('365D').mean().reset_index(0,drop=True)
The result is:
type year month price mean_1year
date
2015-12-01 A 2015 12 1500 1500.0
2016-01-01 A 2016 1 1400 1450.0
2016-04-01 A 2016 4 1300 1400.0
2016-05-01 A 2016 5 1200 1350.0
"Ordinary" rolling can't be applied, because it:
includes rows starting from the current row, whereas you want
to exclude it,
the range of the window expands into the future,
whereas you want to expand it back.
So I used different approach, based on loc with suitable
date slices.
As a test DataFrame I used:
type year month price
0 A 2016 5 1200
1 A 2016 4 1300
2 A 2016 1 1400
3 A 2015 12 1500
4 B 2016 5 1200
5 B 2016 4 1300
And the code is as follows:
Compute date offsets of 12 months and 1 day:
yearOffs = pd.offsets.DateOffset(months=12)
dayOffs = pd.offsets.DateOffset(days=1)
Will be needed in loc later.
Set the index to a datetime, derived from year and
month columns:
df.set_index(pd.to_datetime(df.year.astype(str)
+ df.month.astype(str), format='%Y%m'), inplace=True)
Define the function to compute means within the current
group:
def myMeans(grp):
wrk = grp.sort_index()
return wrk.apply(lambda row: wrk.loc[row.name - yearOffs
: row.name - dayOffs, 'price'].mean(), axis=1)
Compute the means:
means = df.groupby('type').apply(myMeans).swaplevel()
So far the result is:
type
2015-12-01 A NaN
2016-01-01 A 1500.0
2016-04-01 A 1450.0
2016-05-01 A 1400.0
2016-04-01 B NaN
2016-05-01 B 1300.0
dtype: float64
but df has a single level index, with non-unique values.
So to add means to df and drop now unnecessary index,
the last step is:
df = df.set_index('type', append=True).assign(mean_1year=means)\
.reset_index(level=1).reset_index(drop=True)
The final result is:
type year month price mean_1year
0 A 2016 5 1200 1400.0
1 A 2016 4 1300 1450.0
2 A 2016 1 1400 1500.0
3 A 2015 12 1500 NaN
4 B 2016 5 1200 1300.0
5 B 2016 4 1300 NaN
For the "earliest" rows in each group the result is NaN,
as there are no source (earlier) rows to compute the means
for them (so there is apparently something wrong in the other solution).

extract number of days in month from date column, return days to date in current month

I have a pandas dataframe with a date column
I'm trying to create a function and apply it to the dataframe to create a column that returns the number of days in the month/year specified
so far i have:
from calendar import monthrange
def dom(x):
m = dfs["load_date"].dt.month
y = dfs["load_date"].dt.year
monthrange(y,m)
days = monthrange[1]
return days
This however does not work when I attempt to apply it to the date column.
Additionally, I would like to be able to identify whether or not it is the current month, and if so return the number of days up to the current date in that month as opposed to days in the entire month.
I am not sure of the best way to do this, all I can think of is to check the month/year against datetime's today and then use a delta
thanks in advance
For pt.1 of your question, you can cast to pd.Period and retrieve days_in_month:
import pandas as pd
# create a sample df:
df = pd.DataFrame({'date': pd.date_range('2020-01', '2021-01', freq='M')})
df['daysinmonths'] = df['date'].apply(lambda t: pd.Period(t, freq='S').days_in_month)
# df['daysinmonths']
# 0 31
# 1 29
# 2 31
# ...
For pt.2, you can take the timestamp of 'now' and create a boolean mask for your date column, i.e. where its year/month is less than "now". Then calculate the cumsum of the daysinmonth column for the section where the mask returns True. Invert the order of that series to get the days until now.
now = pd.Timestamp('now')
m = (df['date'].dt.year <= now.year) & (df['date'].dt.month < now.month)
df['daysuntilnow'] = df['daysinmonths'][m].cumsum().iloc[::-1].reset_index(drop=True)
Update after comment: to get the elapsed days per month, you can do
df['dayselapsed'] = df['daysinmonths']
m = (df['date'].dt.year == now.year) & (df['date'].dt.month == now.month)
if m.any():
df.loc[m, 'dayselapsed'] = now.day
df.loc[(df['date'].dt.year >= now.year) & (df['date'].dt.month > now.month), 'dayselapsed'] = 0
output
df
Out[13]:
date daysinmonths daysuntilnow dayselapsed
0 2020-01-31 31 213.0 31
1 2020-02-29 29 182.0 29
2 2020-03-31 31 152.0 31
3 2020-04-30 30 121.0 30
4 2020-05-31 31 91.0 31
5 2020-06-30 30 60.0 30
6 2020-07-31 31 31.0 31
7 2020-08-31 31 NaN 27
8 2020-09-30 30 NaN 0
9 2020-10-31 31 NaN 0
10 2020-11-30 30 NaN 0
11 2020-12-31 31 NaN 0

Applying Date Operation to Entire Data Frame

import pandas as pd
import numpy as np
df = pd.DataFrame({'year': np.repeat(2018,12), 'month': range(1,13)})
In this data frame, I am interested in creating a field called 'year_month' such that each value looks like so:
datetime.date(df['year'][0], df['month'][0], 1).strftime("%Y%m")
I'm stuck on how to apply this operation to the entire data frame and would appreciate any help.
Join both columns converted to strings and for months add zfill:
df['new'] = df['year'].astype(str) + df['month'].astype(str).str.zfill(2)
Or add new column day by assign, convert columns to_datetime and last strftime:
df['new'] = pd.to_datetime(df.assign(day=1)).dt.strftime("%Y%m")
If multiple columns in DataFrame:
df['new'] = pd.to_datetime(df.assign(day=1)[['day','month','year']]).dt.strftime("%Y%m")
print (df)
month year new
0 1 2018 201801
1 2 2018 201802
2 3 2018 201803
3 4 2018 201804
4 5 2018 201805
5 6 2018 201806
6 7 2018 201807
7 8 2018 201808
8 9 2018 201809
9 10 2018 201810
10 11 2018 201811
11 12 2018 201812
Timings:
df = pd.DataFrame({'year': np.repeat(2018,12), 'month': range(1,13)})
df = pd.concat([df] * 1000, ignore_index=True)
In [212]: %timeit pd.to_datetime(df.assign(day=1)).dt.strftime("%Y%m")
10 loops, best of 3: 74.1 ms per loop
In [213]: %timeit df['year'].astype(str) + df['month'].astype(str).str.zfill(2)
10 loops, best of 3: 41.3 ms per loop
One way would be to create the datetime objects directly from the source data:
import pandas as pd
import numpy as np
from datetime import date
df = pd.DataFrame({'date': [date(i, j, 1) for i, j \
in zip(np.repeat(2018,12), range(1,13))]})
# date
# 0 2018-01-01
# 1 2018-02-01
# 2 2018-03-01
# 3 2018-04-01
# 4 2018-05-01
# 5 2018-06-01
# 6 2018-07-01
# 7 2018-08-01
# 8 2018-09-01
# 9 2018-10-01
# 10 2018-11-01
# 11 2018-12-01
You could use an apply function such as:
df['year_month'] = df.apply(lambda row: datetime.date(row[1], row[0], 1).strftime("%Y%m"), axis = 1)

Categories