Rolling Look Forward Sum with Datetime Index in Pandas - python

I have multivariate time-series/panel data in the following simplified format:
id,date,event_ind
1,2014-01-01,0
1,2014-01-02,1
1,2014-01-03,1
2,2014-01-01,1
2,2014-01-02,1
2,2014-01-03,1
3,2014-01-01,0
3,2014-01-02,0
3,2014-01-03,1
For this simplified example, I would like the future 2 day sum of event_ind grouped by id
For some reason adapting this example still gives me the "index is not monotonic error": how to do forward rolling sum in pandas?
Here is my approach which otherwise worked for past rolling by group before I adapted it:
df.sort_values(['id','date'], ascending=[True,True], inplace=True)
df.reset_index(drop=True, inplace=True)
df['date'] = pd.DatetimeIndex(df['date'])
df.set_index(['date'], drop=True, inplace=True)
rolling_forward_2_day = lambda x: x.iloc[::-1].rolling('2D').sum().shift(1).iloc[::-1]
df['future_2_day_total'] = df.groupby(['id'], sort=False)['event_ind'].transform(rolling_forward_2_day)
df.reset_index(drop=False, inplace=True)
Here is the expected result:
id date event_ind future_2_day_total
0 1 2014-01-01 0 2
1 1 2014-01-02 1 1
2 1 2014-01-03 1 0
3 2 2014-01-01 1 2
4 2 2014-01-02 1 1
5 2 2014-01-03 1 0
6 3 2014-01-01 0 1
7 3 2014-01-02 0 1
8 3 2014-01-03 1 0
Any tips on what I might be doing wrong or high-performance alternatives would be great!
EDIT:
One quick clarification. This example is simplified and valid solutions need to be able to handle unevenly spaced/irregular time series which is why rolling with a time-based index is utilized.

You can still use rolling here, but use it with the flag win_type='boxcar' and shift your data around before and after you sum:
df['future_day_2_total'] = (
df.groupby('id').event_ind.shift(-1)
.fillna(0).groupby(df.id).rolling(2, win_type='boxcar')
.sum().shift(-1).fillna(0)
)
id date event_ind future_day_2_total
0 1 2014-01-01 0 2.0
1 1 2014-01-02 1 1.0
2 1 2014-01-03 1 0.0
3 2 2014-01-01 1 2.0
4 2 2014-01-02 1 1.0
5 2 2014-01-03 1 0.0
6 3 2014-01-01 0 1.0
7 3 2014-01-02 0 1.0
8 3 2014-01-03 1 0.0

Related

Calculate how many touch points the customer had in X months

I have a problem. I want to calculate from a date for example 2022-06-01 how many touches the customer with the customerId == 1 had in the last 6 months. He had two touches 2022-05-25 and 2022-05-20. I have now calculated the date up to which the data should be taken into account. However, I don't know how to group the customer and say the date you have is up to count_from_date how many touches the customer has had.
Dataframe
customerId fromDate
0 1 2022-06-01
1 1 2022-05-25
2 1 2022-05-25
3 1 2022-05-20
4 1 2021-09-05
5 2 2022-06-02
6 3 2021-03-01
7 3 2021-02-01
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 2, 3, 3],
'fromDate': ["2022-06-01", "2022-05-25", "2022-05-25", "2022-05-20", "2021-09-05",
"2022-06-02", "2021-03-01", "2021-02-01"]
}
df = pd.DataFrame(data=d)
print(df)
from datetime import date
from dateutil.relativedelta import relativedelta
def find_last_date(date):
six_months = date + relativedelta(months=-6)
return six_months
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df['count_from_date'] = df['fromDate'].apply(lambda x: find_last_date(x))
print(df)
What I have
customerId fromDate count_from_date
0 1 2022-06-01 2021-12-01
1 1 2022-05-25 2021-11-25
2 1 2022-05-25 2021-11-25
3 1 2022-05-20 2021-11-20
4 1 2021-09-05 2021-03-05
5 2 2022-06-02 2021-12-02
6 3 2021-03-01 2020-09-01
7 3 2021-02-01 2020-08-01
What I want
customerId fromDate count_from_date occur_last_6_months
0 1 2022-06-01 2021-12-01 3 # 2022-05-25, 2022-05-20, 2022-05-20 = 3
1 1 2022-05-25 2021-11-25 1 # 2022-05-20 = 1
2 1 2022-05-25 2021-11-25 1 # 2022-05-20 = 1
3 1 2022-05-20 2021-11-20 0 # No in the last 6 months
4 1 2021-09-05 2021-03-05 0 # No in the last 6 months
5 2 2022-06-02 2021-12-02 0 # No in the last 6 months
6 3 2021-03-01 2020-09-01 1 # 2021-02-01 = 1
7 3 2021-02-01 2020-08-01 0 # No in the last 6 months
You can try groupby customerId and loop through the rows in subgroup to count number of fromDate between fromDate and count_from_date
def count(g):
m = pd.concat([g['fromDate'].between(d1, d2, 'neither')
for d1, d2 in zip(g['count_from_date'], g['fromDate'])], axis=1)
g = g.assign(occur_last_6_months=m.sum().tolist())
return g
out = df.groupby('customerId').apply(count)
print(out)
customerId fromDate count_from_date occur_last_6_months
0 1 2022-06-01 2021-12-01 3
1 1 2022-05-25 2021-11-25 1
2 1 2022-05-25 2021-11-25 1
3 1 2022-05-20 2021-11-20 0
4 1 2021-09-05 2021-03-05 0
5 2 2022-06-02 2021-12-02 0
6 3 2021-03-01 2020-09-01 1
7 3 2021-02-01 2020-08-01 0
For this problem, the challenge for a performant solution is to manipulate the data as to have an appropriate structure to run rolling window operations on it.
First of all, we need to avoid having duplicate indices. In your case, this means aggregating multiple touch points in a single day:
>>> df = df.groupby(['customerId', 'fromDate'], as_index=False).count()
customerId fromDate count_from_date
0 1 2021-09-05 1
1 1 2022-05-20 1
2 1 2022-05-25 2
3 1 2022-06-01 1
4 2 2022-06-02 1
5 3 2021-02-01 1
6 3 2021-03-01 1
Now, we can set the index to fromDate, sort it and groupby customerId as to be able to use rolling windows. I here use a 180D rolling window (6 months):
>>> roll_df = df.set_index(['fromDate'])
.sort_index()
.groupby('customerId')
.apply(lambda s: s['count_from_date'].rolling('180D').sum())
The sort_index step is important to ensure your data is monotonically increasing.
However, this also counts the touch on the day itself, which seems not what you want, so we remove 1 from the result:
>>> roll_df - 1
customerId fromDate
1 2021-09-05 0.0
2022-05-20 0.0
2022-05-25 2.0
2022-06-01 3.0
2 2022-06-02 0.0
3 2021-02-01 0.0
2021-03-01 1.0
Name: count_from_date, dtype: float64
Finally, we divide by the initial counts to get back to the original structure:
>>> roll_df / df.set_index(['customerId', 'fromDate'])['count_from_date']
customerId fromDate count_from_date
0 1 2021-09-05 0.0
1 1 2022-05-20 0.0
2 1 2022-05-25 1.0
3 1 2022-06-01 3.0
4 2 2022-06-02 0.0
5 3 2021-02-01 0.0
6 3 2021-03-01 1.0
You can always .reset_index() at the end.
The one liner solution is
(df.set_index(['fromDate'])
.sort_index()
.groupby('customerId')
.apply(lambda s: s['count_from_date'].rolling('180D').sum())
- 1) / df.set_index(['customerId', 'fromDate'])['count_from_date']

Apply function to a range of specific rows

I have the following dataframe df:
bucket_value is_new_bucket
dates
2019-03-07 0 1
2019-03-08 1 0
2019-03-09 2 0
2019-03-10 3 0
2019-03-11 4 0
2019-03-12 5 1
2019-03-13 6 0
2019-03-14 7 1
I want to apply a specific function (let’s say the mean function) to each bucket_value data groups where the column is_new_bucket is equal to zero, such that the resulting dataframe would look like this:
mean_values
dates
2019-03-08 2.5
2019-03-13 6.0
In other words, applying a function to the consecutive rows where is_new_bucket = 0, which takes the bucket_value as input.
For instance, if I want to apply the max function, the resulting dataframe would look like this:
max_values
dates
2019-03-11 4.0
2019-03-13 6.0
Using cumsum with filter
df.reset_index(inplace=True)
s=df.loc[df.is_new_bucket==0].groupby(df.is_new_bucket.cumsum()).agg({'date':'first','bucket_value':['mean','max']})
s
date bucket_value
first mean max
is_new_bucket
1 2019-03-08 2.5 4
2 2019-03-13 6.0 6
Updated
df.loc[df.loc[df.is_new_bucket==0].groupby(df.is_new_bucket.cumsum())['bucket_value'].idxmax()]
date bucket_value is_new_bucket
4 2019-03-11 4 0
6 2019-03-13 6 0
Updated2 after using the cumsum create the group key Newkey , you can do whatever you need , base on the groupkey
df['Newkey']=df.is_new_bucket.cumsum()
df
date bucket_value is_new_bucket Newkey
0 2019-03-07 0 1 1
1 2019-03-08 1 0 1
2 2019-03-09 2 0 1
3 2019-03-10 3 0 1
4 2019-03-11 4 0 1
5 2019-03-12 5 1 2
6 2019-03-13 6 0 2
7 2019-03-14 7 1 3

Sum a column based on groupby and condition

I have a dataframe and some columns. I want to sum column "Gap" where time is in some time slots.
region. date. time. gap
0 1 2016-01-01 00:00:08 1
1 1 2016-01-01 00:00:48 0
2 1 2016-01-01 00:02:50 1
3 1 2016-01-01 00:00:52 0
4 1 2016-01-01 00:10:01 0
5 1 2016-01-01 00:10:03 1
6 1 2016-01-01 00:10:05 0
7 1 2016-01-01 00:10:08 0
I want to sum gap column. I have time slots in dict like that.
'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'
Now after summation, above dataframe should like that.
region. date. time. gap
0 1 2016-01-01 00:10:00/slot1 2
1 1 2016-01-01 00:20:00/slot2 1
I have many regions and 144 time slots from 00:00:00 to 23:59:49. I have tried this.
regres=reg.groupby(['start_region_hash','Date','Time'])['Time'].apply(lambda x: (x >= hoursdict['slot1']) & (x <= hoursdict['slot2'])).sum()
But it doesn't work.
Idea is convert column time to datetimes with floor by 10Min, then convert to strings HH:MM:SS:
d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}
df['time'] = pd.to_datetime(df['time']).dt.floor('10Min').dt.strftime('%H:%M:%S')
print (df)
region date time gap
0 1 2016-01-01 00:00:00 1
1 1 2016-01-01 00:00:00 0
2 1 2016-01-01 00:00:00 1
3 1 2016-01-01 00:00:00 0
4 1 2016-01-01 00:10:00 0
5 1 2016-01-01 00:10:00 1
6 1 2016-01-01 00:10:00 0
7 1 2016-01-01 00:10:00 0
Aggregate sum and last map values by dictionary with swapped keys with values:
regres = df.groupby(['region','date','time'], as_index=False)['gap'].sum()
regres['time'] = regres['time'] + '/' + regres['time'].map(d1)
print (regres)
region date time gap
0 1 2016-01-01 00:00:00/slot1 2
1 1 2016-01-01 00:10:00/slot2 1
If want display next 10Min slots:
d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}
times = pd.to_datetime(df['time']).dt.floor('10Min')
df['time'] = times.dt.strftime('%H:%M:%S')
df['time1'] = times.add(pd.Timedelta('10Min')).dt.strftime('%H:%M:%S')
print (df)
region date time gap time1
0 1 2016-01-01 00:00:00 1 00:10:00
1 1 2016-01-01 00:00:00 0 00:10:00
2 1 2016-01-01 00:00:00 1 00:10:00
3 1 2016-01-01 00:00:00 0 00:10:00
4 1 2016-01-01 00:10:00 0 00:20:00
5 1 2016-01-01 00:10:00 1 00:20:00
6 1 2016-01-01 00:10:00 0 00:20:00
7 1 2016-01-01 00:10:00 0 00:20:00
regres = df.groupby(['region','date','time','time1'], as_index=False)['gap'].sum()
regres['time'] = regres.pop('time1') + '/' + regres['time'].map(d1)
print (regres)
region date time gap
0 1 2016-01-01 00:10:00/slot1 2
1 1 2016-01-01 00:20:00/slot2 1
EDIT:
Improvement for floor and convert to strings is use bining by cut or searchsorted:
df['time'] = pd.to_timedelta(df['time'])
bins = pd.timedelta_range('00:00:00', '24:00:00', freq='10Min')
labels = np.array(['{}'.format(str(x)[-8:]) for x in bins])
labels = labels[:-1]
df['time1'] = pd.cut(df['time'], bins=bins, labels=labels)
df['time11'] = labels[np.searchsorted(bins, df['time'].values) - 1]
Just to avoid the complication of the Datetime comparison (unless that is your whole point, in which case ignore my answer), and show the essence of this group by slot window problem, I here assume times are integers.
df = pd.DataFrame({'time':[8, 48, 250, 52, 1001, 1003, 1005, 1008, 2001, 2003, 2056],
'gap': [1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]})
slots = np.array([0, 1000, 1500])
df['slot'] = df.apply(func = lambda x: slots[np.argmax(slots[x['time']>slots])], axis=1)
df.groupby('slot')[['gap']].sum()
Output
gap
slot
-----------
0 2
1000 1
1500 3
The way to think about approaching this problem is converting your time column to the values you want first, and then doing a groupby sum of the time column.
The below code shows the approach I've used. I used np.select to include in as many conditions and condition options as I want. After I have converted time to the values I wanted, I did a simple groupby sum
None of the fuss of formatting time or converting strings etc is really needed. Simply let pandas dataframe handle it intuitively.
#Just creating the DataFrame using a dictionary here
regdict = {
'time': ['00:00:08','00:00:48','00:02:50','00:00:52','00:10:01','00:10:03','00:10:05','00:10:08'],
'gap': [1,0,1,0,0,1,0,0],}
df = pd.DataFrame(regdict)
import pandas as pd
import numpy as np #This is the library you require for np.select function
#Add in all your conditions and options here
condlist = [df['time']<'00:10:00',df['time']<'00:20:00']
choicelist = ['00:10:00/slot1','00:20:00/slot2']
#Use np.select after you have defined all your conditions and options
answerlist = np.select(condlist, choicelist)
print (answerlist)
['00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1'
'00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2']
#Assign answerlist to df['time']
df['time'] = answerlist
print (df)
time gap
0 00:10:00 1
1 00:10:00 0
2 00:10:00 1
3 00:10:00 0
4 00:20:00 0
5 00:20:00 1
6 00:20:00 0
7 00:20:00 0
df = df.groupby('time', as_index=False)['gap'].sum()
print (df)
time gap
0 00:10:00 2
1 00:20:00 1
If you wish to keep the original time you can instead do df['timeNew'] = answerlist and then filter from there.
df['timeNew'] = answerlist
print (df)
time gap timeNew
0 00:00:08 1 00:10:00/slot1
1 00:00:48 0 00:10:00/slot1
2 00:02:50 1 00:10:00/slot1
3 00:00:52 0 00:10:00/slot1
4 00:10:01 0 00:20:00/slot2
5 00:10:03 1 00:20:00/slot2
6 00:10:05 0 00:20:00/slot2
7 00:10:08 0 00:20:00/slot2
#Use transform function here to retain all prior values
df['aggregate sum of gap'] = df.groupby('timeNew')['gap'].transform(sum)
print (df)
time gap timeNew aggregate sum of gap
0 00:00:08 1 00:10:00/slot1 2
1 00:00:48 0 00:10:00/slot1 2
2 00:02:50 1 00:10:00/slot1 2
3 00:00:52 0 00:10:00/slot1 2
4 00:10:01 0 00:20:00/slot2 1
5 00:10:03 1 00:20:00/slot2 1
6 00:10:05 0 00:20:00/slot2 1
7 00:10:08 0 00:20:00/slot2 1

Find days since last event pandas dataframe

I have a pandas data frame:
df12 = pd.DataFrame({'group_ids':[1,1,1,2,2,2],'dates':['2016-04-01','2016-04-20','2016-04-28','2016-04-05','2016-04-20','2016-04-29'],'event_today_in_group':[1,0,1,1,1,0]})
group_ids dates event_today_in_group
0 1 2016-04-01 1
1 1 2016-04-20 0
2 1 2016-04-28 1
3 2 2016-04-05 1
4 2 2016-04-20 1
5 2 2016-04-29 0
I would like to compute an additional column that contains, for each group_ids, the number of days since the last time event_today_in_group was 1.
group_ids dates event_today_in_group days_since_last_event
0 1 2016-04-01 1 0
1 1 2016-04-20 0 19
2 1 2016-04-28 1 27
3 2 2016-04-05 1 0
4 2 2016-04-20 1 15
5 2 2016-04-29 0 9
As I mentioned earlier, this will get you the non-cumulative difference between dates within each group:
df['days_since_last_event'] = df.groupby('group_ids')['dates'].diff().apply(lambda x: x.days)
In order to get a cumulative sum of this difference, based on whenever event_today_in_group changes, I propose using shift to get the value of the previous row, and then generating a cumulative sum, like so:
df['event_today_in_group'].shift().cumsum()
Output:
0 NaN
1 1.0
2 1.0
3 2.0
4 3.0
5 4.0
This gives us the second grouping value we need to get the cumulative sums. You could assign the above values to a new column, but if you're only using them for the calculation, then you can simply include them in the subsequent groupby operation like so:
df.loc[:, 'days_since_last_event'] = df.groupby(['group_ids', df['event_today_in_group'].shift().cumsum()])['days_since_last_event'].cumsum()
Result:
group_ids dates event_today_in_group days_since_last_event
0 1 2016-04-01 1 NaN
1 1 2016-04-20 0 19.0
2 1 2016-04-28 1 27.0
3 2 2016-04-05 1 NaN
4 2 2016-04-20 1 15.0
5 2 2016-04-29 0 9.0

time interval partitioned by 2 fields in pandas

I have the following data frame:
id datetime interval
0 1 20160101 070000 NaN
1 1 20160101 080000 60
2 1 20160102 070000 NaN
3 1 20160102 073000 30
4 2 20160101 071500 NaN
5 2 20160101 071600 1
And would like to generate the interval column - the minutes between rows but only for the same id & the same day, just like in the example - so in sql I would partition by id and datetime and use LAG for the time interval between the previous row. How can I do it in Pandas?
You can convert column datetime to_datetime and use groupby with diff and convert timedelta to minutes by astype:
print df
id datetime interval
0 1 20160101 070000 NaN
1 1 20160101 080000 60
2 1 20160102 070000 NaN
3 1 20160102 073000 30
4 2 20160101 071500 NaN
5 2 20160101 071600 1
df['datetime'] = pd.to_datetime(df['datetime'])
df['new']=df.groupby(['id',df['datetime'].dt.day])['datetime'].diff().astype('timedelta64[m]')
print df
id datetime interval new
0 1 2016-01-01 07:00:00 NaN NaN
1 1 2016-01-01 08:00:00 60 60
2 1 2016-01-02 07:00:00 NaN NaN
3 1 2016-01-02 07:30:00 30 30
4 2 2016-01-01 07:15:00 NaN NaN
5 2 2016-01-01 07:16:00 1 1

Categories