Pandas Running Total by Group, Time Interval - python

I would like to take a table of customer orders like this:
customer_id | order_date | amount
0 | 2020-03-01 | 10.00
0 | 2020-03-02 | 2.00
1 | 2020-03-02 | 5.00
1 | 2020-03-02 | 1.00
2 | 2020-03-08 | 2.00
1 | 2020-03-09 | 1.00
0 | 2020-03-10 | 1.00
And create a table calculating a running total by week. Something like:
order_week | 0 | 1 | 2
2020-03-01 | 12.00 | 6.00 | 0.00
2020-03-08 | 13.00 | 7.00 | 2.00
Thanks so much for your help!!

IIUC:
df['order_date'] = pd.to_datetime(df['order_date'])
(df.groupby(['customer_id',df.order_date.dt.floor('7D')])
.amount.sum()
.unstack('customer_id',fill_value=0)
.cumsum()
)
Output:
customer_id 0 1 2
order_date
2020-02-27 12.0 6.0 0.0
2020-03-05 13.0 7.0 2.0

#Quang Hoang beautiful and concise. But did you want it in 7 days strictly or a week.
I had a go partitioning it in a week because wanted dates stated in your outcome to appear. Obviously #Quang Hoang experience unmatched. Feel free to criticize because I am learning
Coerce date to datetime and set the date to index
df['order_date']=pd.to_datetime(df['order_date'])
df.set_index(df['order_date'], inplace=True)
df.drop(columns=['order_date'], inplace=True
Group by customer id while and resample on amount.
df.groupby('customer_id')['amount'].apply(lambda x:x.resample('W').sum()).unstack('customer_id',fill_value=0).cumsum()
Outcome

Related

Create a counter of date values for a given max-min interval

Be the following python pandas DataFrame:
| date | column_1 | column_2 |
| ---------- | -------- | -------- |
| 2022-02-01 | val | val2 |
| 2022-02-03 | val1 | val |
| 2022-02-01 | val | val3 |
| 2022-02-04 | val2 | val |
| 2022-02-27 | val2 | val4 |
I want to create a new DataFrame, where each row has a value between the minimum and maximum date value from the original DataFrame. The counter column contains a row counter for that date.
| date | counter |
| ---------- | -------- |
| 2022-02-01 | 2 |
| 2022-02-02 | 0 |
| 2022-02-03 | 1 |
| 2022-02-04 | 1 |
| 2022-02-05 | 0 |
...
| 2022-02-26 | 0 |
| 2022-02-27 | 1 |
Count dates first & remove duplicates using Drop duplicates. Fill intermidiate dates with Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
df['counts'] = df['date'].map(df['date'].value_counts())
df = df.drop_duplicates(subset='date', keep="first")
df.date = pd.to_datetime(df.date)
df = df.set_index('date').asfreq('D').reset_index()
df = df.fillna(0)
print(df)
Gives #
date counts
0 2022-02-01 2.0
1 2022-02-02 0.0
2 2022-02-03 1.0
3 2022-02-04 1.0
4 2022-02-05 0.0
5 2022-02-06 0.0
6 2022-02-07 0.0
7 2022-02-08 0.0
8 2022-02-09 0.0
9 2022-02-10 0.0
10 2022-02-11 0.0
11 2022-02-12 0.0
12 2022-02-13 0.0
13 2022-02-14 0.0
14 2022-02-15 0.0
15 2022-02-16 0.0
16 2022-02-17 0.0
17 2022-02-18 0.0
18 2022-02-19 0.0
19 2022-02-20 0.0
20 2022-02-21 0.0
21 2022-02-22 0.0
22 2022-02-23 0.0
23 2022-02-24 0.0
24 2022-02-25 0.0
25 2022-02-26 0.0
Many ways to do this. Here is mine. Probably not optimal, but at least I am not iterating rows, nor using .apply, which are both sure recipes to create slow solutions
import pandas as pd
import datetime
# A minimal example (you should provide such an example next time)
df=pd.DataFrame({'date':pd.to_datetime(['2022-02-01', '2022-02-03', '2022-02-01', '2022-02-04', '2022-02-27']), 'c1':['val','val1','val','val2','val2'], 'c2':range(5)})
# A delta of 1 day, to create list of date
dt=datetime.timedelta(days=1)
# Result dataframe, with a count of 0 for now
res=pd.DataFrame({'date':df.date.min()+dt*np.arange((df.date.max()-df.date.min()).days+1), 'count':0})
# Cound dates
countDates=df[['date', 'c1']].groupby('date').agg('count')
# Merge the counted dates with the target array, filling missing values with 0
res['count']=res.merge(countDates, on='date', how='left').fillna(0)['c1']

How do I find users retention within n_days in pandas?

I have a df that looks like this:
date | user_id | purchase
2020-01-01 | 1 | 10
2020-10-01 | 1 | 12
2020-15-01 | 1 | 5
2020-11-01 | 2 | 500 ...
Now, I want to add an n_day retention flag for each user_id in my df. The expected output should look like:
date | user_id | purchase | 3D_retention (did user purchase within next 3 days)
2020-01-01 | 1 | 10 | 0 (because there was no purchase on/before 2020-04-01 after 2020-01-01
2020-10-01 | 1 | 12 | 1 (because there was a purchase on 2020-11-01 which was within 3 days from 2020-10-01
2020-11-01 | 1 | 5 | 0
What is the best way of doing this in pandas?
i modified the date to be as yyyy-mm-dd format
date user_id purchase
0 2020-01-01 1 10
1 2020-01-10 1 12
2 2020-01-15 1 5
3 2020-01-11 2 500
df['date']=pd.to_datetime(df['date'])
next_purchase_days =6
df['retention']=df.groupby('user_id')['date'].transform(lambda x: ((x.shift(-1) - x).dt.days< next_purchase_days).astype(int) )
df
df
date user_id purchase retention
0 2020-01-01 1 10 0
1 2020-01-10 1 12 1
2 2020-01-15 1 5 0
3 2020-01-11 2 500 0

Pandas DataFrame - Access Values That are created on the fly

I am trying figure out something which I can easily preform on excel but I am having a hard time to understand how to do it on a Pandas Data Frame without using loops.
Suppose that I have a data frame as follows:
+------------+-------+-------+-----+------+
| Date | Price | Proxy | Div | Days |
+------------+-------+-------+-----+------+
| 13/01/2021 | 10 | 20 | 0.5 | NaN |
| 08/01/2021 | NaN | 30 | 0.6 | 5 |
| 04/01/2021 | NaN | 40 | 0.7 | 4 |
| 03/01/2021 | NaN | 50 | 0.8 | 1 |
| 01/01/2021 | NaN | 60 | 0.9 | 2 |
+------------+-------+-------+-----+------+
The task is to fill all the Price where price is null. In excel I would suppose that Date is column A and first row of Date id row 2 then to fill NaN in row 2 of Price I would use the formula =(B2)/(((C3/C2)*D3)*E3)=2.22.
Now I want to use the value 2.22 on the fly to fill NaN in row 3 of Price reason being to fill nan of row 3 I need to make use of filled row 2 value. Hence the formula in excel would to fill row 3 price would be =(B3)/(((C4/C3)*D4)*E4).
1 way would be to loop over all the rows of Data Frame that I don't want to do. What would be the vectorised approach to solve this problem?
Expected Output
+------------+-------+-------+-----+------+
| Date | Price | Proxy | Div | Days |
+------------+-------+-------+-----+------+
| 13/01/2021 | 10 | 20 | 0.5 | NA |
| 08/01/2021 | 2.22 | 30 | 0.6 | 5 |
| 04/01/2021 | 0.60 | 40 | 0.7 | 4 |
| 03/01/2021 | 0.60 | 50 | 0.8 | 1 |
| 01/01/2021 | 0.28 | 60 | 0.9 | 2 |
+------------+-------+-------+-----+------+
Current_Price = Prev Price (non-nan) / (((Current_Proxy/Prev_Proxy) * Div) * Days)
Edit
Create initial data frame using code below
data = {'Date': ['2021-01-13', '2021-01-08', '2021-01-04', '2021-01-03', '2021-01-01'],
'Price':[10, np.nan, np.nan, np.nan,np.nan],
'Proxy':[20, 30, 40, 50, 60],
'Div':[0.5, 0.6, 0.7, 0.8, 0.9],
'Days':[np.nan, 5, 4, 1, 2]}
df = pd.DataFrame(data)
What you want to achieve is actually a cumulated product:
df['Price'] = (df['Price'].combine_first(df['Proxy'].shift()/df.eval('Proxy*Div*Days'))
.cumprod().round(2))
Output:
Date Price Proxy Div Days
0 2021-01-13 10.00 20 0.5 NaN
1 2021-01-08 2.22 30 0.6 5.0
2 2021-01-04 0.60 40 0.7 4.0
3 2021-01-03 0.60 50 0.8 1.0
4 2021-01-01 0.28 60 0.9 2.0

Find accounts with more than N months of activity in a row using Pandas

I want to filter accounts that have no consecutive activity for N months.
Example:
a100000001 | 2019-01-31 | NaN
| 2019-02-28 | 40
| 2019-03-31 | 30
| 2019-04-30 | 50
-----------|------------|-----
a100000002 | 2019-01-31 | NaN
| 2019-02-28 | NaN
| 2019-03-31 | 20
| 2019-04-30 | NaN
-----------|------------|-----
... | |
The result for N=3 consecutive months will look like this:
a100000001 | 2019-01-31 | NaN
| 2019-02-28 | 40
| 2019-03-31 | 30
| 2019-04-30 | 50
-----------|------------|-----
... | |
where account "a100000002" was ignored.
I tried df[df.rolling(3)['amount'].min().notna()] but it also removes the NaN rows from the desired accounts.
Something like this should work:
df.groupby('account').filter(lambda g: (g['date'].dt.month.diff() <= n).all())

Partition dataset by timestamp

I have a dataframe of millions of rows like so, with no duplicate time-ID stamps:
ID | Time | Activity
a | 1 | Bar
a | 3 | Bathroom
a | 2 | Bar
a | 4 | Bathroom
a | 5 | Outside
a | 6 | Bar
a | 7 | Bar
What's the most efficient way to convert it to this format?
ID | StartTime | EndTime | Location
a | 1 | 2 | Bar
a | 3 | 4 | Bathroom
a | 5 | N/A | Outside
a | 6 | 7 | Bar
I have to do this with a lot of data, so wondering how to speed up this process as much as possible.
I am using groupby
df.groupby(['ID','Activity']).Time.apply(list).apply(pd.Series).rename(columns={0:'starttime',1:'endtime'}).reset_index()
Out[251]:
ID Activity starttime endtime
0 a Bar 1.0 2.0
1 a Bathroom 3.0 4.0
2 a Outside 5.0 NaN
Or using pivot_table
df.assign(I=df.groupby(['ID','Activity']).cumcount()).pivot_table(index=['ID','Activity'],columns='I',values='Time')
Out[258]:
I 0 1
ID Activity
a Bar 1.0 2.0
Bathroom 3.0 4.0
Outside 5.0 NaN
Update
df.assign(I=df.groupby(['ID','Activity']).cumcount()//2).groupby(['ID','Activity','I']).Time.apply(list).apply(pd.Series).rename(columns={0:'starttime',1:'endtime'}).reset_index()
Out[282]:
ID Activity I starttime endtime
0 a Bar 0 1.0 2.0
1 a Bar 1 6.0 7.0
2 a Bathroom 0 3.0 4.0
3 a Outside 0 5.0 NaN

Categories