Interpolate time series and resample/pivot. How to get the expected output - python

I have a df that looks like this:
Video | Start | End | Duration |
vid1 |2018-10-02 16:00:29 |2018-10-02 20:07:05 | 246 |
vid2 |2018-10-04 16:03:08 |2018-10-04 16:10:11 | 7 |
vid3 |2018-10-04 10:13:40 |2018-10-06 12:07:38 | 113 |
What I want to do is resample dataframe by 10 minutes by start column and assign 1 if the video lasted in that timestamp and 0 if not.
The desired output is:
Start | vid1 | vid2 | vid3 |
2018-10-02 16:00:00| 1 | 0 | 0 |
2018-10-02 16:10:00| 1 | 0 | 0 |
...
2018-10-04 16:10:00| 0 | 1 | 0 |
2018-10-04 16:20:00| 0 | 0 | 1 |
The output is presented only for visualization the output, hence, it can contain errors.
The problem is that I can not resample dataframe in a way to make a desired crosstab output.

Try this:
df.apply(lambda x: pd.Series(x['Video'],
index=pd.date_range(x['Start'].floor('10T'),
x['End'].ceil('10T'),
freq='10T')), axis=1)\
.stack().str.get_dummies().reset_index(level=0, drop=True)
Output:
vid1 vid2 vid3
2018-10-02 16:00:00 1 0 0
2018-10-02 16:10:00 1 0 0
2018-10-02 16:20:00 1 0 0
2018-10-02 16:30:00 1 0 0
2018-10-02 16:40:00 1 0 0
... ... ... ...
2018-10-06 11:30:00 0 0 1
2018-10-06 11:40:00 0 0 1
2018-10-06 11:50:00 0 0 1
2018-10-06 12:00:00 0 0 1
2018-10-06 12:10:00 0 0 1
[330 rows x 3 columns]

Related

How to reindex a datetime-based multiindex in pandas

I have a dataframe that counts the number of times an event has occured per user per day. Users may have 0 events per day and (since the table is an aggregate from a raw event log) rows with 0 events are missing from the dataframe. I would like to add these missing rows and group the data by week so that each user has one entry per week (including 0 if applicable).
Here is an example of my input:
import numpy as np
import pandas as pd
np.random.seed(42)
df = pd.DataFrame({
"person_id": np.arange(3).repeat(5),
"date": pd.date_range("2022-01-01", "2022-01-15", freq="d"),
"event_count": np.random.randint(1, 7, 15),
})
# end of each week
# Note: week 2022-01-23 is not in df, but should be part of the result
desired_index = pd.to_datetime(["2022-01-02", "2022-01-09", "2022-01-16", "2022-01-23"])
df
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-01 00:00:00 | 4 |
| 1 | 0 | 2022-01-02 00:00:00 | 5 |
| 2 | 0 | 2022-01-03 00:00:00 | 3 |
| 3 | 0 | 2022-01-04 00:00:00 | 5 |
| 4 | 0 | 2022-01-05 00:00:00 | 5 |
| 5 | 1 | 2022-01-06 00:00:00 | 2 |
| 6 | 1 | 2022-01-07 00:00:00 | 3 |
| 7 | 1 | 2022-01-08 00:00:00 | 3 |
| 8 | 1 | 2022-01-09 00:00:00 | 3 |
| 9 | 1 | 2022-01-10 00:00:00 | 5 |
| 10 | 2 | 2022-01-11 00:00:00 | 4 |
| 11 | 2 | 2022-01-12 00:00:00 | 3 |
| 12 | 2 | 2022-01-13 00:00:00 | 6 |
| 13 | 2 | 2022-01-14 00:00:00 | 5 |
| 14 | 2 | 2022-01-15 00:00:00 | 2 |
This is how my desired result looks like:
| | person_id | level_1 | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 0 | 2022-01-16 00:00:00 | 0 |
| 3 | 0 | 2022-01-23 00:00:00 | 0 |
| 4 | 1 | 2022-01-02 00:00:00 | 0 |
| 5 | 1 | 2022-01-09 00:00:00 | 11 |
| 6 | 1 | 2022-01-16 00:00:00 | 5 |
| 7 | 1 | 2022-01-23 00:00:00 | 0 |
| 8 | 2 | 2022-01-02 00:00:00 | 0 |
| 9 | 2 | 2022-01-09 00:00:00 | 0 |
| 10 | 2 | 2022-01-16 00:00:00 | 20 |
| 11 | 2 | 2022-01-23 00:00:00 | 0 |
I can produce it using:
(
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.groupby("person_id").apply(
lambda df: (
df
.reset_index(drop=True, level=0)
.reindex(desired_index, fill_value=0))
)
.reset_index()
)
However, according to the docs of reindex, I should be able to use it with level=1 as a kwarg directly and without having to do another groupby. However, when I do this I get an "inner join" of the two indices instead of an "outer join":
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(desired_index, level=1)
.reset_index()
)
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 1 | 2022-01-09 00:00:00 | 11 |
| 3 | 1 | 2022-01-16 00:00:00 | 5 |
| 4 | 2 | 2022-01-16 00:00:00 | 20 |
Why is that, and how am I supposed to use df.reindex correctly?
I have found a similar SO question on reindexing a multi-index level, but the accepted answer there uses df.unstack, which doesn't work for me, because not every level of my desired index occurs in my current index (and vice versa).
You need reindex by both levels of MultiIndex:
mux = pd.MultiIndex.from_product([df['person_id'].unique(), desired_index],
names=['person_id','date'])
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(mux, fill_value=0)
.reset_index()
)
print (result)
person_id date event_count
0 0 2022-01-02 9
1 0 2022-01-09 13
2 0 2022-01-16 0
3 0 2022-01-23 0
4 1 2022-01-02 0
5 1 2022-01-09 11
6 1 2022-01-16 5
7 1 2022-01-23 0
8 2 2022-01-02 0
9 2 2022-01-09 0
10 2 2022-01-16 20
11 2 2022-01-23 0

How do I find users retention within n_days in pandas?

I have a df that looks like this:
date | user_id | purchase
2020-01-01 | 1 | 10
2020-10-01 | 1 | 12
2020-15-01 | 1 | 5
2020-11-01 | 2 | 500 ...
Now, I want to add an n_day retention flag for each user_id in my df. The expected output should look like:
date | user_id | purchase | 3D_retention (did user purchase within next 3 days)
2020-01-01 | 1 | 10 | 0 (because there was no purchase on/before 2020-04-01 after 2020-01-01
2020-10-01 | 1 | 12 | 1 (because there was a purchase on 2020-11-01 which was within 3 days from 2020-10-01
2020-11-01 | 1 | 5 | 0
What is the best way of doing this in pandas?
i modified the date to be as yyyy-mm-dd format
date user_id purchase
0 2020-01-01 1 10
1 2020-01-10 1 12
2 2020-01-15 1 5
3 2020-01-11 2 500
df['date']=pd.to_datetime(df['date'])
next_purchase_days =6
df['retention']=df.groupby('user_id')['date'].transform(lambda x: ((x.shift(-1) - x).dt.days< next_purchase_days).astype(int) )
df
df
date user_id purchase retention
0 2020-01-01 1 10 0
1 2020-01-10 1 12 1
2 2020-01-15 1 5 0
3 2020-01-11 2 500 0

Add a new record for each missing second in a DataFrame with TimeStamp [duplicate]

This question already has answers here:
Add missing dates to pandas dataframe
(7 answers)
Closed 9 months ago.
Be the next Pandas DataFrame:
| date | counter |
|-------------------------------------|------------------|
| 2022-01-01 10:00:01 | 1 |
| 2022-01-01 10:00:04 | 1 |
| 2022-01-01 10:00:06 | 1 |
I want to create a function that, given the previous DataFrame, returns another similar DataFrame, adding a new row for each missing time instant and counter 0 in that time interval.
| date | counter |
|-------------------------------------|------------------|
| 2022-01-01 10:00:01 | 1 |
| 2022-01-01 10:00:02 | 0 |
| 2022-01-01 10:00:03 | 0 |
| 2022-01-01 10:00:04 | 1 |
| 2022-01-01 10:00:05 | 0 |
| 2022-01-01 10:00:06 | 1 |
In case the initial DataFrame contained more than one day, you should do the same, filling in with each missing second interval for all days included.
Thank you for your help.
Use DataFrame.asfreq working with DatetimeIndex:
df = df.set_index('date').asfreq('1S', fill_value=0).reset_index()
print (df)
date counter
0 2022-01-01 10:00:01 1
1 2022-01-01 10:00:02 0
2 2022-01-01 10:00:03 0
3 2022-01-01 10:00:04 1
4 2022-01-01 10:00:05 0
5 2022-01-01 10:00:06 1
You can also use df.resample:
In [314]: df = df.set_index('date').resample('1S').sum().fillna(0).reset_index()
In [315]: df
Out[315]:
date counter
0 2022-01-01 10:00:01 1
1 2022-01-01 10:00:02 0
2 2022-01-01 10:00:03 0
3 2022-01-01 10:00:04 1
4 2022-01-01 10:00:05 0
5 2022-01-01 10:00:06 1

Calculate streak in pandas without apply

I have a DataFrame like this:
date | type | column1
----------------------------
2019-01-01 | A | 1
2019-02-01 | A | 1
2019-03-01 | A | 1
2019-04-01 | A | 0
2019-05-01 | A | 1
2019-06-01 | A | 1
2019-07-01 | B | 1
2019-08-01 | B | 1
2019-09-01 | B | 0
I want to have a column called "streak" that has a streak, but grouped by column "type":
date | type | column1 | streak
-------------------------------------
2019-01-01 | A | 1 | 1
2019-02-01 | A | 1 | 2
2019-03-01 | A | 1 | 3
2019-04-01 | A | 0 | 0
2019-05-01 | A | 1 | 1
2019-06-01 | A | 1 | 2
2019-07-01 | B | 1 | 1
2019-08-01 | B | 1 | 2
2019-09-01 | B | 0 | 0
I managed to do it like that:
def streak(df):
grouper = (df.column1 != df.column1.shift(1)).cumsum()
df['streak'] = df.groupby(grouper).cumsum()['column1']
return df
df = df.groupby(['type']).apply(streak)
But I'm wondering if it's possible to do it inline without using a groupby and apply, because my DataFrame contains about 100M rows and it takes several hours to process.
Any ideas on how to optimize this for speed?
You want the cumsum of 'column1' grouping by 'type' + the cumsum of a Boolean Series which resets the grouping at every 0.
df['streak'] = df.groupby(['type', df.column1.eq(0).cumsum()]).column1.cumsum()
date type column1 streak
0 2019-01-01 A 1 1
1 2019-02-01 A 1 2
2 2019-03-01 A 1 3
3 2019-04-01 A 0 0
4 2019-05-01 A 1 1
5 2019-06-01 A 1 2
6 2019-07-01 B 1 1
7 2019-08-01 B 1 2
8 2019-09-01 B 0 0
IIUC, this is what you need.
m = df.column1.ne(df.column1.shift()).cumsum()
df['streak'] =df.groupby([m , 'type'])['column1'].cumsum()
Output
date type column1 streak
0 1/1/2019 A 1 1
1 2/1/2019 A 1 2
2 3/1/2019 A 1 3
3 4/1/2019 A 0 0
4 5/1/2019 A 1 1
5 6/1/2019 A 1 2
6 7/1/2019 B 1 1
7 8/1/2019 B 1 2
8 9/1/2019 B 0 0

Create new columns based on other's columns value

I'm trying to do some feature engineering for a pandas data frame.
Say I have this:
Data frame 1:
X | date | is_holiday
a | 1/4/2018 | 0
a | 1/5/2018 | 0
a | 1/6/2018 | 1
a | 1/7/2018 | 0
a | 1/8/2018 | 0
...
b | 1/1/2018 | 1
I'd like to have an additional indicator for some dates, to indicate if the date is before 1 and 2 days from a holiday, and also 1 and 2 days after.
Data frame 1:
X | date | is_holiday | one_day_before_hol | ... | one_day_after_hol
a | 1/4/2018 | 0 | 0 | ... | 0
a | 1/5/2018 | 0 | 1 | ... | 0
a | 1/6/2018 | 1 | 0 | ... | 0
a | 1/7/2018 | 0 | 0 | ... | 1
a | 1/8/2018 | 0 | 0 | ... | 0
...
b | 1/1/2018 | 1 | 0 | ... | 0
Is there any efficient way to do it? I believe I can do it using for statements, but since I'm new to python, I'd like to see if there is an elegant way to do it. Dates might not be adjacent or continuos (i.e. for some of the X columns, a specific date might not be present)
Thank you so much!
Use pandas.DataFrame.groupby.shift:
import pandas as pd
g = df.groupby('X')['is_holiday']
df['one_day_before'] = g.shift(-1).fillna(0)
df['two_day_before'] = g.shift(-2).fillna(0)
df['one_day_after'] = g.shift(1).fillna(0)
Output:
X date is_holiday one_day_before two_day_before one_day_after
0 a 1/4/2018 0 0.0 1.0 0.0
1 a 1/5/2018 0 1.0 0.0 0.0
2 a 1/6/2018 1 0.0 0.0 0.0
3 a 1/7/2018 0 0.0 0.0 1.0
4 a 1/8/2018 0 0.0 0.0 0.0
5 b 1/1/2018 1 0.0 0.0 0.0
You could shift:
import pandas as pd
df = pd.DataFrame([1,0,0,1,1,0], columns=['day'])
d.head()
day
0 1
1 0
2 0
3 1
4 1
df['Once Day Before'] = d['day'].shift(-1)
df['One Day After'] = df['day'].shift(1)
df['Two Days before'] = df['day'].shift(-2)
df.head()
day Holiday One Day Before One Day After Two Days before
0 1 0.0 NaN 0.0
1 0 0.0 1.0 1.0
2 0 1.0 0.0 1.0
3 1 1.0 0.0 0.0
4 1 0.0 1.0 NaN
5 0 NaN 1.0 NaN
This would move the is_holiday up or down and to a new column. You will have to deal with the NaN's though.

Categories