How do I find users retention within n_days in pandas? - python

I have a df that looks like this:
date | user_id | purchase
2020-01-01 | 1 | 10
2020-10-01 | 1 | 12
2020-15-01 | 1 | 5
2020-11-01 | 2 | 500 ...
Now, I want to add an n_day retention flag for each user_id in my df. The expected output should look like:
date | user_id | purchase | 3D_retention (did user purchase within next 3 days)
2020-01-01 | 1 | 10 | 0 (because there was no purchase on/before 2020-04-01 after 2020-01-01
2020-10-01 | 1 | 12 | 1 (because there was a purchase on 2020-11-01 which was within 3 days from 2020-10-01
2020-11-01 | 1 | 5 | 0
What is the best way of doing this in pandas?

i modified the date to be as yyyy-mm-dd format
date user_id purchase
0 2020-01-01 1 10
1 2020-01-10 1 12
2 2020-01-15 1 5
3 2020-01-11 2 500
df['date']=pd.to_datetime(df['date'])
next_purchase_days =6
df['retention']=df.groupby('user_id')['date'].transform(lambda x: ((x.shift(-1) - x).dt.days< next_purchase_days).astype(int) )
df
df
date user_id purchase retention
0 2020-01-01 1 10 0
1 2020-01-10 1 12 1
2 2020-01-15 1 5 0
3 2020-01-11 2 500 0

Related

Add a new record for each missing second in a DataFrame with TimeStamp [duplicate]

This question already has answers here:
Add missing dates to pandas dataframe
(7 answers)
Closed 9 months ago.
Be the next Pandas DataFrame:
| date | counter |
|-------------------------------------|------------------|
| 2022-01-01 10:00:01 | 1 |
| 2022-01-01 10:00:04 | 1 |
| 2022-01-01 10:00:06 | 1 |
I want to create a function that, given the previous DataFrame, returns another similar DataFrame, adding a new row for each missing time instant and counter 0 in that time interval.
| date | counter |
|-------------------------------------|------------------|
| 2022-01-01 10:00:01 | 1 |
| 2022-01-01 10:00:02 | 0 |
| 2022-01-01 10:00:03 | 0 |
| 2022-01-01 10:00:04 | 1 |
| 2022-01-01 10:00:05 | 0 |
| 2022-01-01 10:00:06 | 1 |
In case the initial DataFrame contained more than one day, you should do the same, filling in with each missing second interval for all days included.
Thank you for your help.
Use DataFrame.asfreq working with DatetimeIndex:
df = df.set_index('date').asfreq('1S', fill_value=0).reset_index()
print (df)
date counter
0 2022-01-01 10:00:01 1
1 2022-01-01 10:00:02 0
2 2022-01-01 10:00:03 0
3 2022-01-01 10:00:04 1
4 2022-01-01 10:00:05 0
5 2022-01-01 10:00:06 1
You can also use df.resample:
In [314]: df = df.set_index('date').resample('1S').sum().fillna(0).reset_index()
In [315]: df
Out[315]:
date counter
0 2022-01-01 10:00:01 1
1 2022-01-01 10:00:02 0
2 2022-01-01 10:00:03 0
3 2022-01-01 10:00:04 1
4 2022-01-01 10:00:05 0
5 2022-01-01 10:00:06 1

Filling Missing Date Column using groupby method

I have a dataframe that looks something like:
+---+----+---------------+------------+------------+
| | id | date1 | date2 | days_ahead |
+---+----+---------------+------------+------------+
| 0 | 1 | 2021-10-21 | 2021-10-24 | 3 |
| 1 | 1 | 2021-10-22 | NaN | NaN |
| 2 | 1 | 2021-11-16 | 2021-11-24 | 8 |
| 3 | 2 | 2021-10-22 | 2021-10-24 | 2 |
| 4 | 2 | 2021-10-22 | 2021-10-24 | 2 |
| 5 | 3 | 2021-10-26 | 2021-10-31 | 5 |
| 6 | 3 | 2021-10-30 | 2021-11-04 | 5 |
| 7 | 3 | 2021-11-02 | NaN | NaN |
| 8 | 3 | 2021-11-04 | 2021-11-04 | 0 |
| 9 | 4 | 2021-10-28 | NaN | NaN |
+---+----+---------------+------------+------------+
I am trying to fill the missing data with the days_ahead median of each id group,
For example:
Median of id 1 = 5.5 which rounds to 6
filled value of date2 at index 1 should be 2021-10-28
Similarly, for id 3 Median = 5
filled value of date2 at index 7 should be 2021-11-07
And,
for id 4 Median = NaN
filled value of date2 at index 9 should be 2021-10-28
I Tried
df['date2'].fillna(df.groupby('id')['days_ahead'].transform('median'), inplace = True)
But this fills with int values.
Although, I can use lambda and apply methods to identify int and turn it to date, How do I directly use groupby and fillna together?
You can round values with convert to_timedelta, add to date1 with fill_valueparameter and replace missing values:
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
td = pd.to_timedelta(df.groupby('id')['days_ahead'].transform('median').round(), unit='d')
df['date2'] = df['date2'].fillna(df['date1'].add(td, fill_value=pd.Timedelta(0)))
print (df)
id date1 date2 days_ahead
0 1 2021-10-21 2021-10-24 3.0
1 1 2021-10-22 2021-10-28 NaN
2 1 2021-11-16 2021-11-24 8.0
3 2 2021-10-22 2021-10-24 2.0
4 2 2021-10-22 2021-10-24 2.0
5 3 2021-10-26 2021-10-31 5.0
6 3 2021-10-30 2021-11-04 5.0
7 3 2021-11-02 2021-11-07 NaN
8 3 2021-11-04 2021-11-04 0.0
9 4 2021-10-28 2021-10-28 NaN

Assign Week Number beginning at 1 based on Dates starting in November

I have a date column with dates starting in the month of November. Below is a sample:
| dt |
|------------|
| 11/13/2017 |
| 11/13/2017 |
| 11/13/2017 |
| 11/13/2017 |
| 11/20/2017 |
| 11/20/2017 |
| 11/27/2017 |
| 11/27/2017 |
| 11/27/2017 |
| 12/4/2017 |
| 12/11/2017 |
| 12/18/2017 |
| 12/18/2017 |
| 12/25/2017 |
| 1/1/2018 |
| 1/8/2018 |
I want to get week number from the dates but the week number should be 1 for 11/13/2017, 2 for 11/20/2017 and would continue to increase till 1/8/2018. How can I achieve this in Python?
You can do:
df['week'] = (df['dt'] - df['dt'].min())//pd.to_timedelta('7D') + 1
Output:
dt week
0 2017-11-13 1
1 2017-11-13 1
2 2017-11-13 1
3 2017-11-13 1
4 2017-11-20 2
5 2017-11-20 2
6 2017-11-27 3
7 2017-11-27 3
8 2017-11-27 3
9 2017-12-04 4
10 2017-12-11 5
11 2017-12-18 6
12 2017-12-18 6
13 2017-12-25 7
14 2018-01-01 8
15 2018-01-08 9
Let us do, PS just do not name your column as dt , since dt is a function from pandas
((df['dt']-df['dt'].min())//7).dt.days+1
Out[300]:
0 1
1 1
2 1
3 1
4 2
5 2
6 3
7 3
8 3
9 4
10 5
11 6
12 6
13 7
14 8
15 9
Name: dt, dtype: int64

Interpolate time series and resample/pivot. How to get the expected output

I have a df that looks like this:
Video | Start | End | Duration |
vid1 |2018-10-02 16:00:29 |2018-10-02 20:07:05 | 246 |
vid2 |2018-10-04 16:03:08 |2018-10-04 16:10:11 | 7 |
vid3 |2018-10-04 10:13:40 |2018-10-06 12:07:38 | 113 |
What I want to do is resample dataframe by 10 minutes by start column and assign 1 if the video lasted in that timestamp and 0 if not.
The desired output is:
Start | vid1 | vid2 | vid3 |
2018-10-02 16:00:00| 1 | 0 | 0 |
2018-10-02 16:10:00| 1 | 0 | 0 |
...
2018-10-04 16:10:00| 0 | 1 | 0 |
2018-10-04 16:20:00| 0 | 0 | 1 |
The output is presented only for visualization the output, hence, it can contain errors.
The problem is that I can not resample dataframe in a way to make a desired crosstab output.
Try this:
df.apply(lambda x: pd.Series(x['Video'],
index=pd.date_range(x['Start'].floor('10T'),
x['End'].ceil('10T'),
freq='10T')), axis=1)\
.stack().str.get_dummies().reset_index(level=0, drop=True)
Output:
vid1 vid2 vid3
2018-10-02 16:00:00 1 0 0
2018-10-02 16:10:00 1 0 0
2018-10-02 16:20:00 1 0 0
2018-10-02 16:30:00 1 0 0
2018-10-02 16:40:00 1 0 0
... ... ... ...
2018-10-06 11:30:00 0 0 1
2018-10-06 11:40:00 0 0 1
2018-10-06 11:50:00 0 0 1
2018-10-06 12:00:00 0 0 1
2018-10-06 12:10:00 0 0 1
[330 rows x 3 columns]

Calculate streak in pandas without apply

I have a DataFrame like this:
date | type | column1
----------------------------
2019-01-01 | A | 1
2019-02-01 | A | 1
2019-03-01 | A | 1
2019-04-01 | A | 0
2019-05-01 | A | 1
2019-06-01 | A | 1
2019-07-01 | B | 1
2019-08-01 | B | 1
2019-09-01 | B | 0
I want to have a column called "streak" that has a streak, but grouped by column "type":
date | type | column1 | streak
-------------------------------------
2019-01-01 | A | 1 | 1
2019-02-01 | A | 1 | 2
2019-03-01 | A | 1 | 3
2019-04-01 | A | 0 | 0
2019-05-01 | A | 1 | 1
2019-06-01 | A | 1 | 2
2019-07-01 | B | 1 | 1
2019-08-01 | B | 1 | 2
2019-09-01 | B | 0 | 0
I managed to do it like that:
def streak(df):
grouper = (df.column1 != df.column1.shift(1)).cumsum()
df['streak'] = df.groupby(grouper).cumsum()['column1']
return df
df = df.groupby(['type']).apply(streak)
But I'm wondering if it's possible to do it inline without using a groupby and apply, because my DataFrame contains about 100M rows and it takes several hours to process.
Any ideas on how to optimize this for speed?
You want the cumsum of 'column1' grouping by 'type' + the cumsum of a Boolean Series which resets the grouping at every 0.
df['streak'] = df.groupby(['type', df.column1.eq(0).cumsum()]).column1.cumsum()
date type column1 streak
0 2019-01-01 A 1 1
1 2019-02-01 A 1 2
2 2019-03-01 A 1 3
3 2019-04-01 A 0 0
4 2019-05-01 A 1 1
5 2019-06-01 A 1 2
6 2019-07-01 B 1 1
7 2019-08-01 B 1 2
8 2019-09-01 B 0 0
IIUC, this is what you need.
m = df.column1.ne(df.column1.shift()).cumsum()
df['streak'] =df.groupby([m , 'type'])['column1'].cumsum()
Output
date type column1 streak
0 1/1/2019 A 1 1
1 2/1/2019 A 1 2
2 3/1/2019 A 1 3
3 4/1/2019 A 0 0
4 5/1/2019 A 1 1
5 6/1/2019 A 1 2
6 7/1/2019 B 1 1
7 8/1/2019 B 1 2
8 9/1/2019 B 0 0

Categories