Forward rolling 365 day groupby sum - irregular intervals - python

I have a pandas dataframe consisting of transactional data that looks like the below:
Customer_ID
Day
Sales
1
2018-08-01
80.11
2
2019-01-07
10.15
2
2021-02-21
74.15
1
2019-06-18
10.00
3
2020-03-17
15.15
2
2020-04-29
80.98
4
2016-06-01
133.54
3
2022-01-14
17.15
2
2021-02-28
25.12
1
2021-01-02
1.22
I need to calculate the forward rolling 365 day sum of sales grouped by the customer, exclusive of the current day. I would like to insert the result as a new column.
e.g. for customer_id == 1 for the first row, the value to be inserted in the new column will be the sum of sales between 2018-08-02 and 2019-08-01 for customer_id == 1.
I'm sure there's a way to do this using pandas, but I can't figure it out.
Code to produce the dataframe:
df = pd.DataFrame({
'Customer_ID': [1, 2, 2, 1, 3, 2, 4, 3, 2, 1],
'Day': ['2018-01-01', '2019-01-07', '2021-02-21', '2019-06-17', '2020-03-17', '2020-04-29', '2016-06-01', '2022-01-14', '2021-02-28', '2021-01-02'],
'Sales': [80.11, 10.15, 74.15, 10.00, 15.15, 80.98, 133.54, 17.15, 25.12, 1.22]
})
df.Day = pd.to_datetime(df.Day)

You first need to groupby the Customer_ID column, then perform a rolling sum on each group after you set the 'Day' column as each groups index.
df.groupby('Customer_ID')
.apply(
lambda gr: gr.set_index('Day').sort_index()['Sales'].rolling('365D').sum()
).reset_index()
There is probably a better way to do this, but for me this is the simplest one for me.

Related

Pandas - Compute sum of a column as week-wise columns

I have a table like below containing values for multiple IDs:
ID
value
date
1
20
2022-01-01 12:20
2
25
2022-01-04 18:20
1
10
2022-01-04 11:20
1
150
2022-01-06 16:20
2
200
2022-01-08 13:20
3
40
2022-01-04 21:20
1
75
2022-01-09 08:20
I would like to calculate week wise sum of values for all IDs:
The start date is given (for example, 01-01-2022).
Weeks are calculated based on range:
every Saturday 00:00 to next Friday 23:59 (i.e. Week 1 is from 01-01-2022 00:00 to 07-01-2022 23:59)
ID
Week 1 sum
Week 2 sum
Week 3 sum
...
1
180
75
--
--
2
25
200
--
--
3
40
--
--
--
There's a pandas function (pd.Grouper) that allows you to specify a groupby instruction.1 In this case, that specification is to "resample" date by a weekly frequency that starts on Fridays.2 Since you also need to group by ID as well, add it to the grouper.
# convert to datetime
df['date'] = pd.to_datetime(df['date'])
# pivot the dataframe
df1 = (
df.groupby(['ID', pd.Grouper(key='date', freq='W-FRI')])['value'].sum()
.unstack(fill_value=0)
)
# rename columns
df1.columns = [f"Week {c} sum" for c in range(1, df1.shape[1]+1)]
df1 = df1.reset_index()
1 What you actually need is a pivot_table result but groupby + unstack is equivalent to pivot_table and groupby + unstack is more convenient here.
2 Because Jan 1, 2022 is a Saturday, you need to specify the anchor on Friday.
You can compute a week column. In case you've data for same year, you can extract just week number, which is less likely in real-time scenarios. In case you've data from multiple years, it might be wise to derive a combination of Year & week number.
df['Year-Week'] = df['Date'].dt.strftime('%Y-%U')
In your case the dates 2022-01-01 & 2022-01-04 18:2 should be convert to 2022-01 as per the scenario you considered.
To calculate your pivot table, you can use the pandas pivot_table. Example code:
pd.pivot_table(df, values='value', index=['ID'], columns=['year_weeknumber'], aggfunc=np.sum)
Let's define a formatting helper.
def fmt(row):
return f"{row.year}-{row.week:02d}" # We ignore row.day
Now it's easy.
>>> df = pd.DataFrame([dict(id=1, value=20, date="2022-01-01 12:20"),
dict(id=2, value=25, date="2022-01-04 18:20")])
>>> df['date'] = pd.to_datetime(df.date)
>>> df['iso'] = df.date.dt.isocalendar().apply(fmt, axis='columns')
>>> df
id value date iso
0 1 20 2022-01-01 12:20:00 2021-52
1 2 25 2022-01-04 18:20:00 2022-01
Just groupby
the ISO week.

Create column based on date conditions, but I get this error AttributeError: 'SeriesGroupBy' object has no attribute 'sub'?

Hey a python newbie here.
Suppose I have the first two columns of this data dataframe:
df = pd.DataFrame({'group': ["Sun", "Moon", "Sun", "Moon", "Mars", "Mars"],
'score': [2, 13, 24, 15, 11, 44],
'datetime': ["2017-08-30 07:00:00", "2017-08-30 08:00:00", "2017-08-31 07:00:00", "2017-08-31 08:00:00", "2017-08-29 21:00:00", "2017-08-28 21:00:00"],
'difference': [2, 13, 22, 2, -33, 44]})
I want to create a new column named difference (I have put it there as an illustration), such that
it is equal:
score value in that row - score value of the day before in the same hour, for that group
e.g. difference in row 3 is equal to:
score in that row - score on the day before (30th) at 08:00:00 for that group (i.e. Moon), i.e. 15 - 13 = 2. If the day before and same time do not exist, then the value of the score of that row is taken (e.g. in row 0, for time 2017-08-30 07:00:00 there is no 2017-08-29 07:00:00, hence only the 2 is taken).
I write the following:
df['datetime'] = pd.to_datetime(df['datetime'])
before = df['datetime'] - pd.DateOffset(days=1)
df['difference'] = df.groupby(["group", "datetime"])['score'].sub(
before.map(df.set_index('datetime')['score']), fill_value=0)
but I get the error:
AttributeError: 'SeriesGroupBy' object has no attribute 'sub'
What am I missing? IS there any more elegant solution?
MultiIndex.map
We can set the group column along with the before column as the index of the dataframe, then map the multiindex with score values belonging to the same group then subtract the mapped score values from the score column to calculate the difference.
s = df.set_index(['group', before]).index.map(df.set_index(['group', 'datetime'])['score'])
df['difference'] = df['score'].sub(list(s), fill_value=0)
>>> df
group score datetime difference
0 Sun 2 2017-08-30 07:00:00 2.0
1 Moon 13 2017-08-30 08:00:00 13.0
2 Sun 24 2017-08-31 07:00:00 22.0
3 Moon 15 2017-08-31 08:00:00 2.0
4 Mars 11 2017-08-29 21:00:00 -33.0
5 Mars 44 2017-08-28 21:00:00 44.0

Pandas sum by date indexed but exclude totals column

I have a dataframe that is being read from database records and looks like this:
date added total
2020-09-14 5 5
2020-09-15 4 9
2020-09-16 2 11
I need to be able to resample by different periods and this is what I am using:
df = pd.DataFrame.from_records(raw_data, index='date')
df.index = pd.to_datetime(df.index)
# let's say I want yearly sample, then I would do
df = df.fillna(0).resample('Y').sum()
This almost works, but it is obviously summing the total column, which is something I don't want. I need total column to be the value in the date sampled in the dataframe, like this:
# What I need
date added total
2020 11 11
# What I'm getting
date added total
2020 11 25
You can do this by resampling differently for different columns. Here you want to use sum() aggregator for the added column, but max() for the total.
df = pd.DataFrame({'date':[20200914, 20200915, 20200916, 20210101, 20210102],
'added':[5, 4, 2, 1, 6],
'total':[5, 9, 11, 1, 7]})
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df_res = df.resample('Y', on='date').agg({'added':'sum', 'total':'max'})
And the result is:
df_res
added total
date
2020-12-31 11 11
2021-12-31 7 7

How to add a new column by searching for data in a Pandas time series dataframe

I have a Pandas time series dataframe.
It has minute data for a stock for 30 days.
I want to create a new column, stating the price of the stock at 6 AM for that day, e.g. for all lines for January 1, I want a new column with the price at noon on January 1, and for all lines for January 2, I want a new column with the price at noon on January 2, etc.
Existing timeframe:
Date Time Last_Price Date Time 12amT
1/1/19 08:00 100 1/1/19 08:00 ?
1/1/19 08:01 101 1/1/19 08:01 ?
1/1/19 08:02 100.50 1/1/19 08:02 ?
...
31/1/19 21:00 106 31/1/19 21:00 ?
I used this hack, but it is very slow, and I assume there is a quicker and easier way to do this.
for lab, row in df.iterrows() :
t=row["Date"]
df.loc[lab,"12amT"]=df[(df['Date']==t)&(df['Time']=="12:00")]["Last_Price"].values[0]
One way to do this is to use groupby with pd.Grouper:
For pandas 24.1+
df.groupby(pd.Grouper(freq='D'))[0]\
.transform(lambda x: x.loc[(x.index.hour == 12) &
(x.index.minute==0)].to_numpy()[0])
Older pandas use:
df.groupby(pd.Grouper(freq='D'))[0]\
.transform(lambda x: x.loc[(x.index.hour == 12) &
(x.index.minute==0)].values[0])
MVCE:
df = pd.DataFrame(np.arange(48*60), index=pd.date_range('02-01-2019',periods=(48*60), freq='T'))
df['12amT'] = df.groupby(pd.Grouper(freq='D'))[0].transform(lambda x: x.loc[(x.index.hour == 12)&(x.index.minute==0)].to_numpy()[0])
Output (head):
0 12amT
2019-02-01 00:00:00 0 720
2019-02-01 00:01:00 1 720
2019-02-01 00:02:00 2 720
2019-02-01 00:03:00 3 720
2019-02-01 00:04:00 4 720
I'm not sure why you have two DateTime columns, I made my own example to demonstrate:
ind = pd.date_range('1/1/2019', '30/1/2019', freq='H')
df = pd.DataFrame({'Last_Price':np.random.random(len(ind)) + 100}, index=ind)
def noon_price(df):
noon_price = df.loc[df.index.hour == 12, 'Last_Price'].values
noon_price = noon_price[0] if len(noon_price) > 0 else np.nan
df['noon_price'] = noon_price
return df
df.groupby(df.index.day).apply(noon_price).reindex(ind)
reindex by default will fill each day's rows with its noon_price.
To add a column with the next day's noon price, you can shift the column 24 rows down, like this:
df['T+1'] = df.noon_price.shift(-24)

merge pandas data frames based on id and date range

I need to perform a merge to map a new set of ids to an old set of ids. My starting data looks like this:
lst = [10001, 20001, 30001]
dt = pd.date_range(start='2016', end='2018', freq='M')
idx = pd.MultiIndex.from_product([dt,lst],names=['date','id'])
df = pd.DataFrame(np.random.randn(len(idx)), index=idx)
In [94]: df.head()
Out[94]:
0
date id
2016-01-31 10001 -0.512371
20001 -1.164461
30001 -1.253232
2016-02-29 10001 -0.129874
20001 0.711938
And I want to map id to newid using data that looks like this:
df1 = pd.DataFrame({'id': [10001, 10001, 10001, 10001],
'start_date': ['2015-11-31', '2016-02-01', '2016-05-16', '2017-02-16'],
'end_date': ['2016-01-31', '2016-05-15', '2017-02-15', '2018-04-02'],
'new_id': ['ABC123', 'XYZ789', 'HIJ456', 'LMN654']},)
df2 = pd.DataFrame({'id': [20001, 20001, 20001, 20001],
'start_date': ['2015-10-07', '2016-01-08', '2016-06-02', '2017-02-13'],
'end_date': ['2016-01-07', '2016-06-01', '2017-02-12', '2018-03-017'],
'new_id': ['CBA321', 'ZYX987', 'JIH765', 'NML345']},)
df3 = pd.DataFrame({'id': [30001, 30001, 30001, 30001],
'start_date': ['2015-07-31', '2016-02-23', '2016-06-17', '2017-05-12'],
'end_date': ['2016-02-22', '2016-06-16', '2017-05-11', '2018-01-05'],
'new_id': ['CCC333', 'XXX444', 'HHH888', 'III888']},)
df_ranges = pd.concat([df1,df2,df3])
In [95]: df_ranges.head()
Out[95]:
index end_date id new_id start_date
0 0 2016-01-31 10001 ABC123 2015-11-31
1 1 2016-05-15 10001 XYZ789 2016-02-01
2 2 2017-02-15 10001 HIJ456 2016-05-16
3 3 2018-04-02 10001 LMN654 2017-02-16
4 0 2016-01-07 20001 CBA321 2015-10-07
Basically, my data is monthly panel data and the new data has ranges of dates for which a specific mapping from A->B is valid. So row 1 of the mapping data says that from 2016-01-31 through 2015-211-31 the id 10001 maps to ABC123.
I've previously done this in SAS/SQL with a statement like this:
SELECT a.*, b.newid FROM df as a, df_ranges as b
WHERE a.id = b.id AND b.start_date <= a.date < b.end_date
A few notes about the data:
it should be a 1:1 mapping of id to newid.
the date ranges are non-overlapping
The solution here may be a good start: Merging dataframes based on date range
It is exactly what I'm looking for except that it merges only on dates, not additionally on id. I played with groupby() and this solution but didn't find a way to make it work. Another idea I had was to unstack() the mapping data (df_ranges) to match the dimensions/time frequency of df but this seems to simply re-state the existing problem.
Perhaps I got downvoted because this was too easy, but I couldn't find the answer anywhere so I'll just post it here: you should use the merge_asof() which provides fuzzy matching on dates.
First, data need to be sorted:
df_ranges.sort_values(by=['start_date','id'],inplace=True)
df.sort_values(by=['date','id'],inplace=True)
Then, do the merge:
pd.merge_asof(df,df_ranges, by='id', left_on='date', right_on='start_date')
Output:
In [30]: pd.merge_asof(df,df_ranges, by='id', left_on='date', right_on='start_date').head()
Out[30]:
date id 0 start_date end_date new_id
0 2016-01-31 10001 0.120892 2015-11-30 2016-01-31 ABC123
1 2016-01-31 20001 -0.576096 2016-01-08 2016-06-01 ZYX987
2 2016-01-31 30001 0.543597 2015-07-31 2016-02-22 CCC333
3 2016-02-29 10001 0.316212 2016-02-01 2016-05-15 XYZ789
4 2016-02-29 20001 -0.625878 2016-01-08 2016-06-01 ZYX987

Categories