Calculate time blocked within a timerange with pandas

Calculate time blocked within a timerange with pandas - python

I have a list of products produced or processes finished like this one:
Name
Timestamp Start
Timestamp Stop
Product 1
2021-01-01 15:15:00
2021-01-01 15:37:00
Product 1
2021-01-01 15:30:00
2021-01-01 15:55:00
Product 1
2021-01-02 15:05:00
2021-01-02 15:22:00
Product 1
2021-01-03 15:45:00
2021-01-03 15:55:00
...
...
...
What I want to do is to calculate the amount of time where no product/process happened in a given timeframe, for example from 15:00 to 16:00 and, to be more specific, each day.
The output could be "amount of idle minutes/time where nothing happened" or "percentage of idle time".
import pandas as pd
import datetime
df = pd.read_csv('example_data.csv')
# generate list of products
listOfProducts = df['NAME'].drop_duplicates().tolist()
# define timeframe for each day
startTime = datetime.time(15, 0)
stopTime = datetime.time(16, 0)
# define daterange to look for
startDay = datetime(2021, 1, 1)
stopDay = datetime(2021,1, 5)
# do it for every product
for i in listOfProducts:
# filter dataframe by product
df_product = df[df['NAME'] == i]
# sort dataframe by start
df_product = df_product.sort_values(by='started')
# ... how to proceed?
The wanted output should look like this or similiar:
Day
Time idle
2021-01-01
00:20:00
2021-02-01
00:43:00
2021-03-01
00:50:00
...
...
Here are some notes that are important:
Timeranges of products can overlap between each other, in this case they should only "count once"
Timeranges of products can overlap the borders (15:00 or 16:00 in this case), in this case the time within the borders should be counted
I struggle to implement it in a pandas-way, because this border-cases prevent me from adding up Timedeltas.
In the past, I solved this issue by iterating row by row from here and adding the minutes or seconds. But I'm sure there is a more pandas-way, maybe with the .groupby() function?

Input data:
>>> df
Name Start Stop
0 Product 1 2021-01-01 14:49:00 2021-01-01 15:04:00 # OK (overlap 4')
1 Product 1 2021-01-01 15:15:00 2021-01-01 15:37:00 # OK
2 Product 1 2021-01-01 15:30:00 2021-01-01 15:55:00 # OK
3 Product 1 2021-01-02 15:05:00 2021-01-02 15:22:00 # OK
4 Product 1 2021-01-03 15:45:00 2021-01-03 15:55:00 # OK
5 Product 1 2021-01-03 15:51:00 2021-01-03 16:23:00 # OK (overlap 9')
6 Product 1 2021-01-04 14:28:00 2021-01-04 17:12:00 # OK (overlap 60')
7 Product 1 2021-01-05 11:46:00 2021-01-05 13:40:00 # Out of bounds
8 Product 1 2021-01-05 17:20:00 2021-01-05 19:11:00 # Out of bounds
First, remove data out of bounds (7 & 8):
import datetime
START = datetime.time(15)
STOP = datetime.time(16)
df1 = df.loc[(df["Start"].dt.floor(freq="H").dt.time <= START)
& (START <= df["Stop"].dt.floor(freq="H").dt.time),
["Start", "Stop"]]
Extract the minute of Start and Stop datetime. If the process began before 15:00, set to 0 because we want only keep overlap part. If the process ended after 16:00, set the minute to 59.
import numpy as np
df1["m1"] = np.where(df1["Start"].dt.time > START,
df1["Start"].sub(df1["Start"].dt.floor(freq="H"))
.dt.seconds // 60, 0)
df1["m2"] = np.where(df1["Stop"].dt.time < STOP,
df1["Stop"].sub(df1["Stop"].dt.floor(freq="H"))
.dt.seconds // 60, 59)
>>> df1
Start Stop m1 m2
0 2021-01-01 14:49:00 2021-01-01 15:04:00 0 4
1 2021-01-01 15:15:00 2021-01-01 15:37:00 15 37
2 2021-01-01 15:30:00 2021-01-01 15:55:00 30 55
3 2021-01-02 15:05:00 2021-01-02 15:22:00 5 22
4 2021-01-03 15:45:00 2021-01-03 15:55:00 45 55
5 2021-01-03 15:51:00 2021-01-03 16:23:00 51 59
6 2021-01-04 14:28:00 2021-01-04 17:12:00 0 59
Create an empty table len(df1)x60' to store process usage:
out = pd.DataFrame(0, index=df1.index, columns=pd.RangeIndex(60))
Fill the out dataframe:
for idx, (i1, i2) in df1[["m1", "m2"]].iterrows():
out.loc[idx, i1:i2] = 1
>>> out
0 1 2 3 4 5 6 ... 53 54 55 56 57 58 59
0 1 1 1 1 1 0 0 ... 0 0 0 0 0 0 0 # 4'
1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 ... 1 1 1 0 0 0 0
3 0 0 0 0 0 1 1 ... 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 ... 1 1 1 0 0 0 0
5 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 # full hour
[7 rows x 60 columns]
Finally, compute the idle minutes:
>>> 60 - (out.groupby(df1["Start"].dt.date).sum() & 1).sum(axis="columns")
Start
2021-01-01 22
2021-01-02 42
2021-01-03 50
2021-01-04 0
dtype: int64
Note: you have to determine if the Stop datetime is closed or not.

Related

Calculate how many touch points the customer had in X months

I have a problem. I want to calculate from a date for example 2022-06-01 how many touches the customer with the customerId == 1 had in the last 6 months. He had two touches 2022-05-25 and 2022-05-20. I have now calculated the date up to which the data should be taken into account. However, I don't know how to group the customer and say the date you have is up to count_from_date how many touches the customer has had.
Dataframe
customerId fromDate
0 1 2022-06-01
1 1 2022-05-25
2 1 2022-05-25
3 1 2022-05-20
4 1 2021-09-05
5 2 2022-06-02
6 3 2021-03-01
7 3 2021-02-01
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 2, 3, 3],
'fromDate': ["2022-06-01", "2022-05-25", "2022-05-25", "2022-05-20", "2021-09-05",
"2022-06-02", "2021-03-01", "2021-02-01"]
}
df = pd.DataFrame(data=d)
print(df)
from datetime import date
from dateutil.relativedelta import relativedelta
def find_last_date(date):
six_months = date + relativedelta(months=-6)
return six_months
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df['count_from_date'] = df['fromDate'].apply(lambda x: find_last_date(x))
print(df)
What I have
customerId fromDate count_from_date
0 1 2022-06-01 2021-12-01
1 1 2022-05-25 2021-11-25
2 1 2022-05-25 2021-11-25
3 1 2022-05-20 2021-11-20
4 1 2021-09-05 2021-03-05
5 2 2022-06-02 2021-12-02
6 3 2021-03-01 2020-09-01
7 3 2021-02-01 2020-08-01
What I want
customerId fromDate count_from_date occur_last_6_months
0 1 2022-06-01 2021-12-01 3 # 2022-05-25, 2022-05-20, 2022-05-20 = 3
1 1 2022-05-25 2021-11-25 1 # 2022-05-20 = 1
2 1 2022-05-25 2021-11-25 1 # 2022-05-20 = 1
3 1 2022-05-20 2021-11-20 0 # No in the last 6 months
4 1 2021-09-05 2021-03-05 0 # No in the last 6 months
5 2 2022-06-02 2021-12-02 0 # No in the last 6 months
6 3 2021-03-01 2020-09-01 1 # 2021-02-01 = 1
7 3 2021-02-01 2020-08-01 0 # No in the last 6 months

You can try groupby customerId and loop through the rows in subgroup to count number of fromDate between fromDate and count_from_date
def count(g):
m = pd.concat([g['fromDate'].between(d1, d2, 'neither')
for d1, d2 in zip(g['count_from_date'], g['fromDate'])], axis=1)
g = g.assign(occur_last_6_months=m.sum().tolist())
return g
out = df.groupby('customerId').apply(count)
print(out)
customerId fromDate count_from_date occur_last_6_months
0 1 2022-06-01 2021-12-01 3
1 1 2022-05-25 2021-11-25 1
2 1 2022-05-25 2021-11-25 1
3 1 2022-05-20 2021-11-20 0
4 1 2021-09-05 2021-03-05 0
5 2 2022-06-02 2021-12-02 0
6 3 2021-03-01 2020-09-01 1
7 3 2021-02-01 2020-08-01 0

For this problem, the challenge for a performant solution is to manipulate the data as to have an appropriate structure to run rolling window operations on it.
First of all, we need to avoid having duplicate indices. In your case, this means aggregating multiple touch points in a single day:
>>> df = df.groupby(['customerId', 'fromDate'], as_index=False).count()
customerId fromDate count_from_date
0 1 2021-09-05 1
1 1 2022-05-20 1
2 1 2022-05-25 2
3 1 2022-06-01 1
4 2 2022-06-02 1
5 3 2021-02-01 1
6 3 2021-03-01 1
Now, we can set the index to fromDate, sort it and groupby customerId as to be able to use rolling windows. I here use a 180D rolling window (6 months):
>>> roll_df = df.set_index(['fromDate'])
.sort_index()
.groupby('customerId')
.apply(lambda s: s['count_from_date'].rolling('180D').sum())
The sort_index step is important to ensure your data is monotonically increasing.
However, this also counts the touch on the day itself, which seems not what you want, so we remove 1 from the result:
>>> roll_df - 1
customerId fromDate
1 2021-09-05 0.0
2022-05-20 0.0
2022-05-25 2.0
2022-06-01 3.0
2 2022-06-02 0.0
3 2021-02-01 0.0
2021-03-01 1.0
Name: count_from_date, dtype: float64
Finally, we divide by the initial counts to get back to the original structure:
>>> roll_df / df.set_index(['customerId', 'fromDate'])['count_from_date']
customerId fromDate count_from_date
0 1 2021-09-05 0.0
1 1 2022-05-20 0.0
2 1 2022-05-25 1.0
3 1 2022-06-01 3.0
4 2 2022-06-02 0.0
5 3 2021-02-01 0.0
6 3 2021-03-01 1.0
You can always .reset_index() at the end.
The one liner solution is
(df.set_index(['fromDate'])
.sort_index()
.groupby('customerId')
.apply(lambda s: s['count_from_date'].rolling('180D').sum())
- 1) / df.set_index(['customerId', 'fromDate'])['count_from_date']

python, pandas groupby the groupby dataframe

I have a dataframe that I have already done a groupby().agg() using
df = df.groupby(['Date','Time', 'ID_Code']).agg({'ID_Code':['count','nunique']}).reset_index()
now it looks like this
Date Time ID_Code count nunique
0 2021-01-04 10:50:00 CA_031 2 1
1 2021-01-04 12:40:00 CA_021 8 1
2 2021-01-04 13:20:00 CA_044 4 1
3 2021-01-04 13:30:00 CA_045 4 1
4 2021-01-04 13:36:00 CA_040 13 1
.. ... ... ... ... ...
433 2021-12-28 12:12:00 CA_805 3 1
434 2021-12-28 12:40:00 CA_802 3 1
435 2021-12-28 15:35:00 CA_003 22 1
436 2021-12-28 8:29:00 CA_806 3 1
What I now need is a sum of the count and how many times each ID_Code occurs.
# the below line removed the multi index header.
df.columns = ['Date', 'Time', 'ID_Code', 'count', 'nunique']
# the below line does not appear to work. Why and how would I fix it?
df = df.groupby('ID_Code').agg(total_count = ('count','sum'), frequency = ('nunique','sum'),)
What I want is:
ID_Code total_count frequency
0 CA_031 242 12
1 CA_021 89 9
2 CA_044 148 11
3 CA_045 76 7

Create new column based on how many rows, with condition based on another column, are within X days of current row date

My DF currently has just the first two columns DATE and RESULT and I want to create the third column N_RESULTS_EQUAL_1_PAST_60_DAYS:
DATE RESULT N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS
2018-12-26 23:13:43+00:00 1 0
2019-02-18 23:27:58+00:00 0 1
2019-02-28 15:02:33+00:00 0 0
2019-03-05 18:30:26+00:00 1 0
2019-05-21 14:54:52+00:00 1 0
2019-08-26 14:30:38+00:00 1 0
2019-09-19 15:51:01+00:00 1 1
2019-12-16 17:58:24+00:00 0 0
2021-02-23 03:50:33+00:00 0 0
2021-08-08 22:26:01+00:00 1 0
2021-09-01 18:04:46+00:00 0 1
For each row I want to check all the previous rows that are within 60 days of the current row and sum up how many RESULT == 1 these previous rows have. I can only think in a double for loop to solve this problem, which is not efficient. If there a more efficient way to solve this problem?
Edit 1: I made a mistake when first creating the N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS column, I was not considering the RESULT == 1, I'm now fixing it.
Edit 2: I thought that having this simple example would be enough to solve the problem. However as it turns out, the best answer so far requires that the date is sorted, and I actually can't sort the date, here is why:
My actual problem have some IDs in it, and I must solve this problem to each individual ID. My DF is actually more like this:
DATE ID RESULT N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS
2018-01-24 22:02:36+00:00 104 1 0
2018-05-15 18:27:17+00:00 104 0 0
2019-05-15 22:58:06+00:00 104 1 0
2019-07-22 15:17:55+00:00 104 1 0
2020-01-03 20:27:52+00:00 104 1 0
2018-12-26 23:13:43+00:00 105 1 0
2019-02-18 23:27:58+00:00 105 0 1
2019-02-28 15:02:33+00:00 105 0 0
2019-03-05 18:30:26+00:00 105 1 0
2019-05-21 14:54:52+00:00 105 1 0
2019-08-26 14:30:38+00:00 105 1 0
2019-09-19 15:51:01+00:00 105 1 1
2019-12-16 17:58:24+00:00 105 0 0
2021-02-23 03:50:33+00:00 105 0 0
2021-08-08 22:26:01+00:00 105 1 0
2021-09-01 18:04:46+00:00 105 0 1
2019-01-12 21:24:23+00:00 106 0 0
2019-05-28 08:03:55+00:00 106 1 0
2019-09-17 02:56:47+00:00 106 0 0
2020-05-06 17:20:55+00:00 106 0 0
2021-01-07 13:14:41+00:00 106 0 0
So, if I set my DATE column as index and then sort my index I end up messing my ID column.

Assuming that 'DATE' is a DatetimeIndex, you can groupby the 'ID' and then use .rolling() which now works for ragged datetimes:
df['N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS'] = df.groupby('ID').rolling('60D').sum().astype('int').droplevel(0)
I'm using the the original index to add the column here, which works, but I think a more robust solution would be to use 'ID'and 'DATE' to merge the original df with the df that contains the 60 day sums, so you can also try that. Also, I understand you don't want to include the 'RESULT' itself, only the sum from previous ones. In that case, just subtract it:
df['N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS'] = df['N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS'] - df['RESULT']
Output:
DATE ID RESULT N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS
2018-01-24 22:02:36+00:00 104 1 0
2018-05-15 18:27:17+00:00 104 0 0
2019-05-15 22:58:06+00:00 104 1 0
2019-07-22 15:17:55+00:00 104 1 0
2020-01-03 20:27:52+00:00 104 1 0
2018-12-26 23:13:43+00:00 105 1 0
2019-02-18 23:27:58+00:00 105 0 1
2019-02-28 15:02:33+00:00 105 0 0
2019-03-05 18:30:26+00:00 105 1 0
2019-05-21 14:54:52+00:00 105 1 0
2019-08-26 14:30:38+00:00 105 1 0
2019-09-19 15:51:01+00:00 105 1 1
2019-12-16 17:58:24+00:00 105 0 0
2021-02-23 03:50:33+00:00 105 0 0
2021-08-08 22:26:01+00:00 105 1 0
2021-09-01 18:04:46+00:00 105 0 1
2019-01-12 21:24:23+00:00 106 0 0
2019-05-28 08:03:55+00:00 106 1 0
2019-09-17 02:56:47+00:00 106 0 0
2020-05-06 17:20:55+00:00 106 0 0
2021-01-07 13:14:41+00:00 106 0 0

It seems #MethodGuy already described how to use rolling() when I was working on solution but I put my version because I have something else.
And I also get the same result as #MethodGuy which are different then N_RESULTS_EQUAL_1_PREVIOUS_60_DAYS so I checked how many days is between first and last date in rolling window.
I'm not sure but maybe it should be 61D (minus last date in function) to get past 60D
If you would have DATE as index then you could use rolling('60D') to create rolling window and work with only last 60 days - and then you can use .sum(), .count(), etc. You can also use .apply(func) to run own function which can skip current date
def result(data):
return data[:-1].sum()
df['result'] = df['RESULT'].rolling('60D').apply(result).astype(int)
Minimal working code which shows .sum(), .count(), .apply() to calculate sum without current day. I also use .apply() to calculate days bettween first and last date in rolling window
text = '''DATE RESULT 60_DAYS
2018-12-26 23:13:43+00:00 1 0
2019-02-18 23:27:58+00:00 0 1
2019-02-28 15:02:33+00:00 0 1
2019-03-05 18:30:26+00:00 1 2
2019-05-21 14:54:52+00:00 1 0
2019-08-26 14:30:38+00:00 1 0
2019-09-19 15:51:01+00:00 1 1
2019-12-16 17:58:24+00:00 0 0
2021-02-23 03:50:33+00:00 0 0
2021-08-08 22:26:01+00:00 1 0
2021-09-01 18:04:46+00:00 0 1'''
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text), sep='\s{2,}')
df.index = pd.to_datetime(df['DATE'])
del df['DATE']
print(df)
def result1(data):
data = data[:-1]
return data.sum()
def result2(data):
data = data[:-1]
return len(data[ data == 1 ])
def days(data):
return (data.index[-1] - data.index[0]).days
window = df['RESULT'].rolling('60D')
df['sum'] = window.sum().astype(int)
df['count'] = window.count().astype(int)
df['result1'] = window.apply(result1).astype(int)
df['result2'] = window.apply(result2).astype(int)
df['days'] = window.apply(days).astype(int)
print(df)
Result:
RESULT 60_DAYS sum count result1 result2 days
DATE
2018-12-26 23:13:43+00:00 1 0 1 1 0 0 0
2019-02-18 23:27:58+00:00 0 1 1 2 1 1 54
2019-02-28 15:02:33+00:00 0 1 0 2 0 0 9
2019-03-05 18:30:26+00:00 1 2 1 3 0 0 14
2019-05-21 14:54:52+00:00 1 0 1 1 0 0 0
2019-08-26 14:30:38+00:00 1 0 1 1 0 0 0
2019-09-19 15:51:01+00:00 1 1 2 2 1 1 24
2019-12-16 17:58:24+00:00 0 0 0 1 0 0 0
2021-02-23 03:50:33+00:00 0 0 0 1 0 0 0
2021-08-08 22:26:01+00:00 1 0 1 1 0 0 0
2021-09-01 18:04:46+00:00 0 1 1 2 1 1 23

Group nearby dates

I want to group nearby dates together, using a rolling window (?) of three week periods.
See example and attempt below:
import pandas as pd
d = {'id':[1, 1, 1, 1, 2, 3],
'datefield':['2021-01-01', '2021-01-15', '2021-01-30', '2021-02-05', '2020-02-10', '2020-02-20']}
df = pd.DataFrame(data=d)
df['datefield'] = pd.to_datetime(df['datefield'])
# id datefield
#0 1 2021-01-01
#1 1 2021-01-15
#2 1 2021-02-01
#3 2 2020-02-10
#4 3 2020-02-20
df['event'] = df.groupby(['id', pd.Grouper(key='datefield', freq='3W')]).ngroup()
# id datefield event
#0 1 2021-01-01 0
#1 1 2021-01-15 0
#2 1 2021-01-30 1 #Should be 0, since last id 1 event happened just 2 weeks ago
#3 1 2021-02-05 1 #Should be 0
#4 2 2020-02-10 2
#5 3 2020-02-20 3 #Correct, within 3 weeks of another but since the ids are not the same the event is different

Can compute different columns to make it easily understandable
df
id datefield
0 1 2021-01-01
1 1 2021-01-15
2 1 2021-01-30
3 1 2021-02-05
4 2 2020-02-10
5 2 2020-03-20
Calculate difference between dates in number of days
df['diff'] = df['datefield'].diff().dt.days
Get previous ID
df['prevId'] = df['id'].shift()
Decide whether to increment or not
df['increment'] = np.where((df['diff']>21) | (df['prevId'] != df['id']), 1, 0)
Lastly, just get the cumulative sum
df['event'] = df['increment'].cumsum()
Output
id datefield diff prevId increment event
0 1 2021-01-01 NaN NaN 1 1
1 1 2021-01-15 14.0 1.0 0 1
2 1 2021-01-30 15.0 1.0 0 1
3 1 2021-02-05 6.0 1.0 0 1
4 2 2020-02-10 -361.0 1.0 1 2
5 2 2020-03-20 39.0 2.0 1 3

Let's try a different approach using a boolean series instead:
df['group'] = ((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))) |
(df['id'].ne(df['id'].shift()))).cumsum()
Output:
id datefield group
0 1 2021-01-01 1
1 1 2021-01-15 1
2 1 2021-01-30 1
3 1 2021-02-05 1
4 2 2020-02-10 2
5 2 2020-03-20 3
Is the difference between the previous row greater than 3 weeks:
print((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))))
0 False
1 False
2 False
3 False
4 False
5 True
Name: datefield, dtype: bool
Or is the current id not equal to the previous id:
print((df['id'].ne(df['id'].shift())))
0 True
1 False
2 False
3 False
4 True
5 False
Name: id, dtype: bool
or (|) together the conditions
print((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))) |
(df['id'].ne(df['id'].shift())))
0 True
1 False
2 False
3 False
4 True
5 True
dtype: bool
Then use cumsum to increment every where there is a True value to delimit the groups.
*Assumes id and datafield columns are appropriately ordered.

It looks like you want the diff between consecutive rows to be three weeks or less, otherwise a new group is formed. You can do it like this, starting from initial time t0:
df = df.sort_values("datefield").reset_index(drop=True)
t0 = df.datefield.iloc[0]
df["delta_t"] = pd.TimedeltaIndex(df.datefield - t0)
df["group"] = (df.delta_t.dt.days.diff() > 21).cumsum()
output:
id datefield delta_t group
0 2 2020-02-10 0 days 0
1 2 2020-03-20 39 days 1
2 1 2021-01-01 326 days 2
3 1 2021-01-15 340 days 2
4 1 2021-01-30 355 days 2
5 1 2021-02-05 361 days 2
Note that your original dataframe is not sorted properly.

Pandas count event per day from join date

I have this data frame:
name event join_date created_at
A X 2020-12-01 2020-12-01
A X 2020-12-01 2020-12-01
A X 2020-12-01 2020-12-02
A Y 2020-12-01 2020-12-02
B X 2020-12-05 2020-12-05
B X 2020-12-05 2020-12-07
C X 2020-12-07 2020-12-08
C X 2020-12-07 2020-12-09
...
I want to transform it into this data frame:
name event join_date day_0 day_1 day_2 .... day_n
A X 2020-12-01 2 1 0 0
A Y 2020-12-01 0 1 0 0
B X 2020-12-05 1 0 1 0
C X 2020-12-07 0 1 1 0
...
the first rows mean that user A doing twice Event X on day_0 (first day he joins) and once on the first day and so on until day_n
For now, the result is like this:
name event join_date day_0 day_1 day_2 .... day_n
A X 2020-12-01 2 1 0 0
A Y 2020-12-01 0 1 0 0
B X 2020-12-05 1 0 1 0
C X 2020-12-07 1 1 0 0
...
the code set the 2020-12-02 as day_0, not day_1 because there is no 2020-12-01 on A user with Y event

First subtract all values created_at by first value per groups by GroupBy.transform.
Then use DataFrame.pivot_table first, add all possible datetimes by DataFrame.reindex by timedelta_range and then convert columns names by range:
df['d'] = df['created_at'].sub(df['join_date'])
print (df)
name event join_date created_at d
0 A X 2020-12-01 2020-12-01 0 days
1 A X 2020-12-01 2020-12-01 0 days
2 A X 2020-12-01 2020-12-02 1 days
3 A Y 2020-12-01 2020-12-02 1 days
4 B X 2020-12-05 2020-12-05 0 days
5 B X 2020-12-05 2020-12-07 2 days
6 C X 2020-12-07 2020-12-08 1 days
7 C X 2020-12-07 2020-12-09 2 days
df1 = (df.pivot_table(index=['name','event','join_date'],
columns='d',
aggfunc='size',
fill_value=0)
.reindex(pd.timedelta_range(df['d'].min(), df['d'].max()),
axis=1,
fill_value=0))
df1.columns = [f'day_{i}' for i in range(len(df1.columns))]
df1 = df1.reset_index()
print (df1)
name event join_date day_0 day_1 day_2
0 A X 2020-12-01 2 1 0
1 A Y 2020-12-01 0 1 0
2 B X 2020-12-05 1 0 1
3 C X 2020-12-07 0 1 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculate time blocked within a timerange with pandas - python

Related

Calculate how many touch points the customer had in X months

python, pandas groupby the groupby dataframe

Create new column based on how many rows, with condition based on another column, are within X days of current row date

Group nearby dates

Pandas count event per day from join date

Categories

Resources