Pandas Resample Monthly data to Weekly within Groups and Split Values - python

I have a dataframe, below:
ID Date Volume Sales
1 2020-02 10 4
1 2020-03 8 6
2 2020-02 6 8
2 2020-03 4 10
Is there an easy way to convert this to weekly data using resampling? And dividing the volume and sales column by the number of weeks in the month?
I have started my process which code which looks like:
import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('date')
grouped = df.groupby('ID').resmaple('W').ffill().reset_index()
print(grouped)
After this step, I get an error message: cannot inset ID, already exists
Also is there a code to use for finding the number of weeks in a month for dividing the volume and sales column by the number of weeks in the month.
The Expected output is :
ID Volume Sales Weeks
0 1 2.5 1.0 2020-02-02
0 1 2.5 1.0 2020-02-09
0 1 2.5 1.0 2020-02-16
0 1 2.5 1.0 2020-02-23
1 1 1.6 1.2 2020-03-01
1 1 1.6 1.2 2020-03-08
1 1 1.6 1.2 2020-03-15
1 1 1.6 1.2 2020-03-22
1 1 1.6 1.2 2020-03-29
2 2 1.5 2 2020-02-02
2 2 1.5 2 2020-02-09
2 2 1.5 2 2020-02-16
2 2 1.5 2 2020-02-23
3 2 0.8 2 2020-03-01
3 2 0.8 2 2020-03-08
3 2 0.8 2 2020-03-15
3 2 0.8 2 2020-03-22
3 2 0.8 2 2020-03-29

After review, a much simpler solution can be used. Please refer to subsection labeled New Solution in Part 1 below.
This task requires multiple steps. Let's break it down as follows:
Part 1: Transform Date & Resample
New Solution
With consideration that the weekly frequency required, being Sunday based (i.e. freq='W-SUN') is independent for each month and is not related to or affected by any adjacent month(s), we can directly use the year-month values in column Date to generate date ranges in weekly basis in one step rather than breaking into 2 steps by first generating daily date ranges from year-month and then resample the daily date ranges to weekly afterwards.
The new program logics just needs to use pd.date_range() with freq='W' with the help of pd.offsets.MonthEnd() to generate weekly frequency for a month. Altogether, it does not need to call .resample() or .asfreq() like other solutions. Effectively, the pd.date_range() with freq='W' is doing the resampling task for us.
Here goes the codes:
df['Weeks'] = df['Date'].map(lambda x:
pd.date_range(
start=pd.to_datetime(x),
end=(pd.to_datetime(x) + pd.offsets.MonthEnd()),
freq='W'))
df = df.explode('Weeks')
Result:
print(df)
ID Date Volume Sales Weeks
0 1 2020-02 10 4 2020-02-02
0 1 2020-02 10 4 2020-02-09
0 1 2020-02 10 4 2020-02-16
0 1 2020-02 10 4 2020-02-23
1 1 2020-03 8 6 2020-03-01
1 1 2020-03 8 6 2020-03-08
1 1 2020-03 8 6 2020-03-15
1 1 2020-03 8 6 2020-03-22
1 1 2020-03 8 6 2020-03-29
2 2 2020-02 6 8 2020-02-02
2 2 2020-02 6 8 2020-02-09
2 2 2020-02 6 8 2020-02-16
2 2 2020-02 6 8 2020-02-23
3 2 2020-03 4 10 2020-03-01
3 2 2020-03 4 10 2020-03-08
3 2 2020-03 4 10 2020-03-15
3 2 2020-03 4 10 2020-03-22
3 2 2020-03 4 10 2020-03-29
By the 2 lines of codes above, we already get the required result for Part 1. We don't need to go through the complicated codes of .groupby() and .resample() in the old solution.
We can continue to go to Part 2. As we have not created the grouped object, we can either replace grouped by df in for the codes in Part 2 or add a new line grouped = df to continue.
Old Solution
We use pd.date_range() with freq='D' with the help of pd.offsets.MonthEnd() to produce daily entries for the full month. Then transform these full month ranges to index before resampling to week frequency. Resampled with closed='left' to exclude the unwanted week of 2020-04-05 produced under default resample() parameters.
df['Weeks'] = df['Date'].map(lambda x:
pd.date_range(
start=pd.to_datetime(x),
end=(pd.to_datetime(x) + pd.offsets.MonthEnd()),
freq='D'))
df = df.explode('Weeks').set_index('Weeks')
grouped = (df.groupby(['ID', 'Date'], as_index=False)
.resample('W', closed='left')
.ffill().dropna().reset_index(-1))
Result:
print(grouped)
Weeks ID Date Volume Sales
0 2020-02-02 1.0 2020-02 10.0 4.0
0 2020-02-09 1.0 2020-02 10.0 4.0
0 2020-02-16 1.0 2020-02 10.0 4.0
0 2020-02-23 1.0 2020-02 10.0 4.0
1 2020-03-01 1.0 2020-03 8.0 6.0
1 2020-03-08 1.0 2020-03 8.0 6.0
1 2020-03-15 1.0 2020-03 8.0 6.0
1 2020-03-22 1.0 2020-03 8.0 6.0
1 2020-03-29 1.0 2020-03 8.0 6.0
2 2020-02-02 2.0 2020-02 6.0 8.0
2 2020-02-09 2.0 2020-02 6.0 8.0
2 2020-02-16 2.0 2020-02 6.0 8.0
2 2020-02-23 2.0 2020-02 6.0 8.0
3 2020-03-01 2.0 2020-03 4.0 10.0
3 2020-03-08 2.0 2020-03 4.0 10.0
3 2020-03-15 2.0 2020-03 4.0 10.0
3 2020-03-22 2.0 2020-03 4.0 10.0
3 2020-03-29 2.0 2020-03 4.0 10.0
Here, we retain the column Date for some use later.
Part 2: Divide Volume and Sales by number of weeks in month
Here, the number of weeks in month used to divide the Volume and Sales figures should actually be the number of resampled weeks within the month as shown in the interim result above.
If we use the actual number of weeks, then for Feb 2020, because of leap year, it has 29 days in that month and thus it actually spans across 5 weeks instead of the 4 resampled weeks in the interim result above. Then it would cause inconsistent results because there are only 4 week entries above while we divide each Volume and Sales figure by 5.
Let's go to the codes then:
We group by columns ID and Date and then divide each value in columns Volume and Sales by group size (i.e. number of resampled weeks).
grouped[['Volume', 'Sales']] = (grouped.groupby(['ID', 'Date'])[['Volume', 'Sales']]
.transform(lambda x: x / x.count()))
or simplified form using /= as follows:
grouped[['Volume', 'Sales']] /= (grouped.groupby(['ID', 'Date'])[['Volume', 'Sales']]
.transform('count'))
Result:
print(grouped)
Weeks ID Date Volume Sales
0 2020-02-02 1.0 2020-02 2.5 1.0
0 2020-02-09 1.0 2020-02 2.5 1.0
0 2020-02-16 1.0 2020-02 2.5 1.0
0 2020-02-23 1.0 2020-02 2.5 1.0
1 2020-03-01 1.0 2020-03 1.6 1.2
1 2020-03-08 1.0 2020-03 1.6 1.2
1 2020-03-15 1.0 2020-03 1.6 1.2
1 2020-03-22 1.0 2020-03 1.6 1.2
1 2020-03-29 1.0 2020-03 1.6 1.2
2 2020-02-02 2.0 2020-02 1.5 2.0
2 2020-02-09 2.0 2020-02 1.5 2.0
2 2020-02-16 2.0 2020-02 1.5 2.0
2 2020-02-23 2.0 2020-02 1.5 2.0
3 2020-03-01 2.0 2020-03 0.8 2.0
3 2020-03-08 2.0 2020-03 0.8 2.0
3 2020-03-15 2.0 2020-03 0.8 2.0
3 2020-03-22 2.0 2020-03 0.8 2.0
3 2020-03-29 2.0 2020-03 0.8 2.0
Optionally, you can do some cosmetic works to drop the column Date and rearrange column Weeks to your desired position if you like.
Edit: (Similarity and difference from other questions resampling from month to week)
In this review, I have searched some other questions of similar titles and compared the questions and solutions.
There is another question with similar requirement to split the monthly values equally to weekly values according to the number of weeks in the resampled month. In that question, the months are represented as the first date of the months and they are in datetime format and used as index in the dataframe while in this question, the months are represented as YYYY-MM which can be of string type.
A big and critical difference is that in that question, the last month period index 2018-05-01 with value 22644 was actually not processed. That is, the month of 2018-05 is not resampled into weeks in May 2018 and the value 22644 has never been processed to split into weekly proportions. The accepted solution using .asfreq() does not show any entry for 2018-05 at all and the other solution using .resample() still keeps one (un-resampled) entry for 2018-05 and the value 22644 is not split into weekly proportions.
However, in our question here, the last month listed in each group still needs to be resampled into weeks and values split equally for the resampled weeks.
Looking at the solution, my new solution makes no call to .resample() nor .asfreq(). It just uses pd.date_range() with freq='W' with the help of pd.offsets.MonthEnd() to generate weekly frequency for a month based on 'YYYY-MM' values. This is what I could not imagine of when I worked on the old solution making use of .resample()

Related

Dynamic Dates difference calculation Pandas

customer_id Order_date
1 2015-01-16
1 2015-01-19
2 2014-12-21
2 2015-01-10
1 2015-01-10
3 2018-01-18
3 2017-03-04
4 2019-11-05
4 2010-01-01
3 2019-02-03
Lets say I have data like this
Basically for an ecommerce firm some people buy regularly, some buy once every year, some buy monthly once etc. I need to find the difference between frequency of each transaction for each customer.
This will be a dynamic list, since some people will have transacted thousand times, some would have transacted once, some ten times etc. Any ideas on how to achieve this.
Output needed:
customer_id Order_date_Difference_in_days
1 6,3 #Difference b/w first 2 dates 2015-01-10 and 2015-01-16
#is 6 days and diff b/w next 2 consecutive dates is
#2015-01-16 and 2015-01-19 is #3 days
2 20
3 320,381
4 3596
Basically these are the differences between dates after sorting them first for each customer id
You can also use the below for the current output:
m=(df.assign(Diff=df.sort_values(['customer_id','Order_date'])
.groupby('customer_id')['Order_date'].diff().dt.days).dropna())
m=m.assign(Diff=m['Diff'].astype(str)).groupby('customer_id')['Diff'].agg(','.join)
customer_id
1 6.0,3.0
2 20.0
3 320.0,381.0
4 3595.0
Name: Diff, dtype: object
First we need to sort the data by customer id and the order date
ensure your datetime is a proper date time call df['Order_date'] = pd.to_datetime(df['Order_date'])
df.sort_values(['customer_id','Order_date'],inplace=True)
df["days"] = df.groupby("customer_id")["Order_date"].apply(
lambda x: (x - x.shift()) / np.timedelta64(1, "D")
)
print(df)
customer_id Order_date days
4 1 2015-01-10 NaN
0 1 2015-01-16 6.0
1 1 2015-01-19 3.0
2 2 2014-12-21 NaN
3 2 2015-01-10 20.0
6 3 2017-03-04 NaN
5 3 2018-01-18 320.0
9 3 2019-02-03 381.0
8 4 2010-01-01 NaN
7 4 2019-11-05 3595.0
then you can do a simple agg but you'll need to conver the value into a string.
df.dropna().groupby("customer_id")["days"].agg(
lambda x: ",".join(x.astype(str))
).to_frame()
days
customer_id
1 6.0,3.0
2 20.0
3 320.0,381.0
4 3595.0

Year to date average in dataframe

I have a dataframe that I am trying to calculate the year-to-date average for my value columns. Below is a sample dataframe.
date name values values2
0 2019-01-01 a 1 1
1 2019-02-01 a 3 3
2 2019-03-01 a 2 2
3 2019-04-01 a 6 2
I want to create new columns (values_ytd & values2_ytd) that will average the values from January to the latest period within the same year (April in sample data). I will need to group the data by year & name when calculating the averages. I am looking for an output similar to this.
date name values values2 values2_ytd values_ytd
0 2019-01-01 a 1 1 1 1
1 2019-02-01 a 3 3 2 2
2 2019-03-01 a 2 2 2 2
3 2019-04-01 a 6 2 2 3
I have tried unsuccesfully to using expanding().mean(), but most likely I was doing it wrong. My main dataframe has numerous name categories and many more columns. Here is the code I was attempting to use
df1.groupby([df1['name'], df1['date'].dt.year], as_index=False).expanding().mean().loc[:, 'values':'values2'].add_suffix('_ytd').reset_index(drop=True,level=0)
but am receiving the following error.
NotImplementedError: ops for Expanding for this dtype datetime64[ns] are not implemented
Note: This code below works perfectly when substituting cumsum() for .expanding().mean()to create a year-to-date sum of the values, but I cant figure it out for averages
df1.groupby([df1['name'], df1['date'].dt.year], as_index=False).cumsum().loc[:, 'values':'values2'].add_suffix('_ytd').reset_index(drop=True,level=0)
Any help is greatly appreciated.
Try this:
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df[['values2_ytd', 'values_ytd']] = df.groupby([df.index.year, 'name'])['values','values2'].expanding().mean().reset_index(level=[0,1], drop=True)
df
name values values2 values2_ytd values_ytd
date
2019-01-01 a 1 1 1.0 1.0
2019-02-01 a 3 3 2.0 2.0
2019-03-01 a 2 2 2.0 2.0
2019-04-01 a 6 2 3.0 2.0
Example using multiple names and years:
date name values values2
0 2019-01-01 a 1 1
1 2019-02-01 a 3 3
2 2019-03-01 a 2 2
3 2019-04-01 a 6 2
4 2019-01-01 b 1 4
5 2019-02-01 b 3 4
6 2020-01-01 a 1 1
7 2020-02-01 a 3 3
8 2020-03-01 a 2 2
9 2020-04-01 a 6 2
Output:
name values values2 values2_ytd values_ytd
date
2019-01-01 a 1 1 1.0 1.0
2019-02-01 a 3 3 2.0 2.0
2019-03-01 a 2 2 2.0 2.0
2019-04-01 a 6 2 3.0 2.0
2019-01-01 b 1 4 1.0 4.0
2019-02-01 b 3 4 2.0 4.0
2020-01-01 a 1 1 1.0 1.0
2020-02-01 a 3 3 2.0 2.0
2020-03-01 a 2 2 2.0 2.0
2020-04-01 a 6 2 3.0 2.0
You should set date column as index: df.set_index('date', inplace=True) and then use df.resample('AS').groupby('name').mean()

How do you convert start and end date records into timestamps?

For example (input pandas dataframe):
start_date end_date value
0 2018-05-17 2018-05-20 4
1 2018-05-22 2018-05-27 12
2 2018-05-14 2018-05-21 8
I want it to divide the value by the # of intervals present in the data (e.g. 2018-05-12 to 2018-05-27 has 6 days, 12 / 6 = 2) and then create a time series data like the following:
date value
0 2018-05-14 1
1 2018-05-15 1
2 2018-05-16 1
3 2018-05-17 2
4 2018-05-18 2
5 2018-05-19 2
6 2018-05-20 2
7 2018-05-21 1
8 2018-05-22 2
9 2018-05-23 2
10 2018-05-24 2
11 2018-05-25 2
12 2018-05-26 2
13 2018-05-27 2
is this possible to do without an inefficient loop through every row using pandas? Is there also a name for this method?
You can use:
#convert to datetimes if necessary
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
For each row generate list of Series by date_range, then divide their length and aggregate by groupby with sum:
dfs = [pd.Series(r.value, pd.date_range(r.start_date, r.end_date)) for r in df.itertuples()]
df = (pd.concat([x / len(x) for x in dfs])
.groupby(level=0)
.sum()
.rename_axis('date')
.reset_index(name='val'))
print (df)
date val
0 2018-05-14 1.0
1 2018-05-15 1.0
2 2018-05-16 1.0
3 2018-05-17 2.0
4 2018-05-18 2.0
5 2018-05-19 2.0
6 2018-05-20 2.0
7 2018-05-21 1.0
8 2018-05-22 2.0
9 2018-05-23 2.0
10 2018-05-24 2.0
11 2018-05-25 2.0
12 2018-05-26 2.0
13 2018-05-27 2.0

Pandas — match last identical row and compute difference

With a DataFrame like the following:
timestamp value
0 2012-01-01 3.0
1 2012-01-05 3.0
2 2012-01-06 6.0
3 2012-01-09 3.0
4 2012-01-31 1.0
5 2012-02-09 3.0
6 2012-02-11 1.0
7 2012-02-13 3.0
8 2012-02-15 2.0
9 2012-02-18 5.0
What would be an elegant and efficient way to add a time_since_last_identical column, so that the previous example would result in:
timestamp value time_since_last_identical
0 2012-01-01 3.0 NaT
1 2012-01-05 3.0 5 days
2 2012-01-06 6.0 NaT
3 2012-01-09 3.0 4 days
4 2012-01-31 1.0 NaT
5 2012-02-09 3.0 31 days
6 2012-02-11 1.0 10 days
7 2012-02-13 3.0 4 days
8 2012-02-15 2.0 NaT
9 2012-02-18 5.0 NaT
The important part of the problem is not necessarily the usage of time delays. Any solution that matches one particular row with the previous row of identical value, and computes something out of those two rows (here, a difference) will be valid.
Note: not interested in apply or loop-based approaches.
A simple, clean and elegant groupby will do the trick:
df['time_since_last_identical'] = df.groupby('value').diff()
Gives:
timestamp value time_since_last_identical
0 2012-01-01 3.0 NaT
1 2012-01-05 3.0 4 days
2 2012-01-06 6.0 NaT
3 2012-01-09 3.0 4 days
4 2012-01-31 1.0 NaT
5 2012-02-09 3.0 31 days
6 2012-02-11 1.0 11 days
7 2012-02-13 3.0 4 days
8 2012-02-15 2.0 NaT
9 2012-02-18 5.0 NaT
Here is a solution using pandas groupby:
out = df.groupby(df['value'])\
.apply(lambda x: pd.to_datetime(x['timestamp'], format = "%Y-%m-%d").diff())\
.reset_index(level = 0, drop = False)\
.reindex(df.index)\
.rename(columns = {'timestamp' : 'time_since_last_identical'})
out = pd.concat([df['timestamp'], out], axis = 1)
That gives the following output:
timestamp value time_since_last_identical
0 2012-01-01 3.0 NaT
1 2012-01-05 3.0 4 days
2 2012-01-06 6.0 NaT
3 2012-01-09 3.0 4 days
4 2012-01-31 1.0 NaT
5 2012-02-09 3.0 31 days
6 2012-02-11 1.0 11 days
7 2012-02-13 3.0 4 days
8 2012-02-15 2.0 NaT
9 2012-02-18 5.0 NaT
It does not exactly match your desired output, but I guess it is a matter of conventions (e.g. whether to include current day or not). Happy to refine if you provide more details.

Get average counts per minute by hour

I have a dataframe with a time stamp as the index and a column of labels
df=DataFrame({'time':[ datetime(2015,11,2,4,41,10), datetime(2015,11,2,4,41,39), datetime(2015,11,2,4,41,47),
datetime(2015,11,2,4,41,59), datetime(2015,11,2,4,42,4), datetime(2015,11,2,4,42,11),
datetime(2015,11,2,4,42,15), datetime(2015,11,2,4,42,30), datetime(2015,11,2,4,42,39),
datetime(2015,11,2,4,42,41),datetime(2015,11,2,5,2,9),datetime(2015,11,2, 5,2,10),
datetime(2015,11,2,5,2,16),datetime(2015,11,2,5,2,29),datetime(2015,11,2, 5,2,51),
datetime(2015,11,2,5,9,1),datetime(2015,11,2,5,9,21),datetime(2015,11,2,5,9,31),
datetime(2015,11,2,5,9,40),datetime(2015,11,2,5,9,55)],
'Label':[2,0,0,0,1,0,0,1,1,1,1,3,0,0,3,0,1,0,1,1]}).set_index(['time'])
I want to get the avergae number of times that a label appears in a distinct minute
in a distnct hour.
For example, Label 0 appears 3 times in hour 4 in minute 41, 2 times in hour 4
in minute 42,
2 times in hour 5 in in minute 2, and 2 times in hour 5 in minute 9 so its average count per
minute in hour 4 is
(2+3)/2=2.5
and its count per minute in hour 5 is
(2+2)/2=2
The output I am looking for is
Hour 1
Label avg
0 2.5
1 2
2 .5
3 0
Hour 2
Label avg
0 2
1 1.5
2 0
3 1
What I have so far is
df['hour']=df.index.hour
hour_grp=df.groupby(['hour'], as_index=False)
then I can deo something like
res=[]
for key, value in hour_grp:
res.append(value)
then group by minute
res[0].groupby(pd.TimeGrouper('1Min'))['Label'].value_counts()
but this is where I'm stuck, not to mention it is not very efficient
Start by squeezing you DataFrame into a Series (after all, it only has one column):
s = df.squeeze()
Compute how many times each label occurs by minute:
counts_by_min = (s.resample('min')
.apply(lambda x: x.value_counts())
.unstack()
.fillna(0))
# 0 1 2 3
# time
# 2015-11-02 04:41:00 3.0 0.0 1.0 0.0
# 2015-11-02 04:42:00 2.0 4.0 0.0 0.0
# 2015-11-02 05:02:00 2.0 1.0 0.0 2.0
# 2015-11-02 05:09:00 2.0 3.0 0.0 0.0
Resample counts_by_min by hour to obtain the number of times each label occurs by hour:
counts_by_hour = counts_by_min.resample('H').sum()
# 0 1 2 3
# time
# 2015-11-02 04:00:00 5.0 4.0 1.0 0.0
# 2015-11-02 05:00:00 4.0 4.0 0.0 2.0
Count the number of minutes each label occurs by hour:
minutes_by_hour = counts_by_min.astype(bool).resample('H').sum()
# 0 1 2 3
# time
# 2015-11-02 04:00:00 2.0 1.0 1.0 0.0
# 2015-11-02 05:00:00 2.0 2.0 0.0 1.0
Divide the last two to get the result you want:
avg_per_hour = counts_by_hour.div(minutes_by_hour).fillna(0)
# 0 1 2 3
# time
# 2015-11-02 04:00:00 2.5 4.0 1.0 0.0
# 2015-11-02 05:00:00 2.0 2.0 0.0 2.0
Accessing minute of DateTimeIndex:
mn = df.index.minute
Accessing hour of DateTimeIndex:
hr = df.index.hour
Perform Groupby by keeping the above obtained variables as keys. Compute value_counts of contents under Label and unstack by filling missing values with 0. Finally, average them across the index-axis containing hour values.
df.groupby([mn,hr])['Label'].value_counts().unstack(fill_value=0).mean(level=1)

Categories