Efficiently counting records with date in between two columns - python

Say I have this DataFrame:
user
sub_date
unsub_date
group
0
alice
2021-01-01 00:00:00
2021-02-09 00:00:00
A
1
bob
2021-02-03 00:00:00
2021-04-05 00:00:00
B
2
charlie
2021-02-03 00:00:00
NaT
A
3
dave
2021-01-29 00:00:00
2021-09-01 00:00:00
B
What is the most efficient way to count the subbed users per date and per group? In other words, to get this DataFrame:
date
group
subbed
2021-01-01
A
1
2021-01-01
B
0
2021-01-02
A
1
2021-01-02
B
0
...
...
...
2021-02-03
A
2
2021-02-03
B
2
...
...
...
2021-02-10
A
1
2021-02-10
B
2
...
...
...
Here's a snippet to init the example df:
import pandas as pd
import datetime as dt
users = pd.DataFrame(
[
["alice", "2021-01-01", "2021-02-09", "A"],
["bob", "2021-02-03", "2021-04-05", "B"],
["charlie", "2021-02-03", None, "A"],
["dave", "2021-01-29", "2021-09-01", "B"],
],
columns=["user", "sub_date", "unsub_date", "group"],
)
users[["sub_date", "unsub_date"]] = users[["sub_date", "unsub_date"]].apply(
pd.to_datetime
)

Using a smaller date range for convenience
Note: my users df is different from OPs. I've changed around a few dates to make the outputs smaller
In [26]: import pandas as pd
...: import datetime as dt
...:
...: users = pd.DataFrame(
...: [
...: ["alice", "2021-01-01", "2021-01-05", "A"],
...: ["bob", "2021-01-03", "2021-01-07", "B"],
...: ["charlie", "2021-01-03", None, "A"],
...: ["dave", "2021-01-09", "2021-01-11", "B"],
...: ],
...: columns=["user", "sub_date", "unsub_date", "group"],
...: )
...:
...: users[["sub_date", "unsub_date"]] = users[["sub_date", "unsub_date"]].apply(
...: pd.to_datetime
...: )
In [81]: users
Out[81]:
user sub_date unsub_date group
0 alice 2021-01-01 2021-01-05 A
1 bob 2021-01-03 2021-01-07 B
2 charlie 2021-01-03 NaT A
3 dave 2021-01-09 2021-01-11 B
In [82]: users.melt(id_vars=['user', 'group'])
Out[82]:
user group variable value
0 alice A sub_date 2021-01-01
1 bob B sub_date 2021-01-03
2 charlie A sub_date 2021-01-03
3 dave B sub_date 2021-01-09
4 alice A unsub_date 2021-01-05
5 bob B unsub_date 2021-01-07
6 charlie A unsub_date NaT
7 dave B unsub_date 2021-01-11
# dropna to remove rows with no unsub_date
# sort_values to sort by date
# sub_date exists -> map to 1, else -1 then take cumsum to get # of subbed people at that date
In [85]: melted = users.melt(id_vars=['user', 'group']).dropna().sort_values('value')
...: melted['sub_value'] = np.where(melted['variable'] == 'sub_date', 1, -1) # or melted['variable'].map({'sub_date': 1, 'unsub_date': -1})
...: melted['sub_cumsum_group'] = melted.groupby('group')['sub_value'].cumsum()
...: melted
Out[85]:
user group variable value sub_value sub_cumsum_group
0 alice A sub_date 2021-01-01 1 1
1 bob B sub_date 2021-01-03 1 1
2 charlie A sub_date 2021-01-03 1 2
4 alice A unsub_date 2021-01-05 -1 1
5 bob B unsub_date 2021-01-07 -1 0
3 dave B sub_date 2021-01-09 1 1
7 dave B unsub_date 2021-01-11 -1 0
In [93]: idx = pd.date_range(melted['value'].min(), melted['value'].max(), freq='1D')
...: idx
Out[93]:
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
'2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08',
'2021-01-09', '2021-01-10', '2021-01-11'],
dtype='datetime64[ns]', freq='D')
In [94]: melted.set_index('value').groupby('group')['sub_cumsum_group'].apply(lambda x: x.reindex(idx).ffill().fillna(0))
Out[94]:
group
A 2021-01-01 1.0
2021-01-02 1.0
2021-01-03 2.0
2021-01-04 2.0
2021-01-05 1.0
2021-01-06 1.0
2021-01-07 1.0
2021-01-08 1.0
2021-01-09 1.0
2021-01-10 1.0
2021-01-11 1.0
B 2021-01-01 0.0
2021-01-02 0.0
2021-01-03 1.0
2021-01-04 1.0
2021-01-05 1.0
2021-01-06 1.0
2021-01-07 0.0
2021-01-08 0.0
2021-01-09 1.0
2021-01-10 1.0
2021-01-11 0.0
Name: sub_cumsum_group, dtype: float64

The data is described by step functions, and staircase can be used for these applications
import staircase as sc
stepfunctions = users.groupby("group").apply(sc.Stairs, "sub_date", "unsub_date")
stepfunctions will be a pandas.Series, indexed by group, and the values are Stairs objects which represent step functions.
group
A <staircase.Stairs, id=2516834869320>
B <staircase.Stairs, id=2516112096072>
dtype: object
You could plot the step function for A if you wanted like so
stepfunctions["A"].plot()
Next step is to sample the step function at whatever dates you want, eg for every day of January..
sc.sample(stepfunctions, pd.date_range("2021-01-01", "2021-02-01")).melt(ignore_index=False).reset_index()
The result is this
group variable value
0 A 2021-01-01 1
1 B 2021-01-01 0
2 A 2021-01-02 1
3 B 2021-01-02 0
4 A 2021-01-03 1
.. ... ... ...
59 B 2021-01-30 1
60 A 2021-01-31 1
61 B 2021-01-31 1
62 A 2021-02-01 1
63 B 2021-02-01 1
note:
I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.

Try this?
>>> users.groupby(['sub_date','group'])[['user']].count()

Related

Dataframe - Datetime, get cumulated sum of previous day

I have a dataframe with the following columns:
datetime: HH:MM:SS (not continuous, there are some missing days)
date: ['datetime'].dt.date
X = various values
X_daily_cum = df.groupby(['date']).X.cumsum()
So Xcum is the cumulated sum of X but grouped per day, it's reset every day.
Code to reproduce:
import pandas as pd
df = pd.DataFrame( [['2021-01-01 10:10', 3],
['2021-01-03 13:33', 7],
['2021-01-03 14:44', 6],
['2021-01-07 17:17', 2],
['2021-01-07 07:07', 4],
['2021-01-07 01:07', 9],
['2021-01-09 09:09', 3]],
columns=['datetime', 'X'])
df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%m-%d %M:%S')
df['date'] = df['datetime'].dt.date
df['X_daily_cum'] = df.groupby(['date']).X.cumsum()
print(df)
Now I would like a new column that takes for value the cumulated sum of previous available day, like that:
datetime X date X_daily_cum last_day_cum_value
0 2021-01-01 00:10:10 3 2021-01-01 3 NaN
1 2021-01-03 00:13:33 7 2021-01-03 7 3
2 2021-01-03 00:14:44 6 2021-01-03 13 3
3 2021-01-07 00:17:17 2 2021-01-07 2 13
4 2021-01-07 00:07:07 4 2021-01-07 6 13
5 2021-01-07 00:01:07 9 2021-01-07 15 13
6 2021-01-09 00:09:09 3 2021-01-09 3 15
Is there a clean way to do it with pandas with an apply ?
I have managed to do it in a disgusting way by copying the df, removing datetime granularity, selecting last record of each date, joining this new df with the previous one. It's disgusting, I would like a more elegant solution.
Thanks for the help
Use Series.duplicated with Series.mask for set missing values to all values without last per dates, then shifting values and forward filling missing values:
df['last_day_cum_value'] = (df['X_daily_cum'].mask(df['date'].duplicated(keep='last'))
.shift()
.ffill())
print (df)
datetime X date X_daily_cum last_day_cum_value
0 2021-01-01 00:10:10 3 2021-01-01 3 NaN
1 2021-01-03 00:13:33 7 2021-01-03 7 3.0
2 2021-01-03 00:14:44 6 2021-01-03 13 3.0
3 2021-01-07 00:17:17 2 2021-01-07 2 13.0
4 2021-01-07 00:07:07 4 2021-01-07 6 13.0
5 2021-01-07 00:01:07 9 2021-01-07 15 13.0
6 2021-01-09 00:09:09 3 2021-01-09 3 15.0
Old solution:
Use DataFrame.drop_duplicates with Series created by date and Series.shift for previous dates, then use Series.map for new column:
s = df.drop_duplicates('date', keep='last').set_index('date')['X_daily_cum'].shift()
print (s)
date
2021-01-01 NaN
2021-01-03 3.0
2021-01-07 13.0
2021-01-09 15.0
Name: X_daily_cum, dtype: float64
df['last_day_cum_value'] = df['date'].map(s)
print (df)
datetime X date X_daily_cum last_day_cum_value
0 2021-01-01 00:10:10 3 2021-01-01 3 NaN
1 2021-01-03 00:13:33 7 2021-01-03 7 3.0
2 2021-01-03 00:14:44 6 2021-01-03 13 3.0
3 2021-01-07 00:17:17 2 2021-01-07 2 13.0
4 2021-01-07 00:07:07 4 2021-01-07 6 13.0
5 2021-01-07 00:01:07 9 2021-01-07 15 13.0
6 2021-01-09 00:09:09 3 2021-01-09 3 15.0

how to add a day to date column?

I wanna add a day to all cells of this dataframe:
value B N S date
date
2020-12-31 1 11 0 2020-12-31
2021-01-01 3 80 0 2021-01-01
2021-01-02 4 99 0 2021-01-02
2021-01-03 3 78 0 2021-01-03
2021-01-04 0 50 0 2021-01-04
to make it like this:
value B N S date
date
2020-12-31 1 11 0 2021-01-01
2021-01-01 3 80 0 2021-01-02
2021-01-02 4 99 0 2021-01-03
2021-01-03 3 78 0 2021-01-04
2021-01-04 0 50 0 2021-01-05
how can I do this?
df['date']=pd.to_datetime(df['date']).add(pd.offsets.Day(1))
df
value B N S date
0 2020-12-31 1 11 0 2021-01-01
1 2021-01-01 3 80 0 2021-01-02
2 2021-01-02 4 99 0 2021-01-03
3 2021-01-03 3 78 0 2021-01-04
4 2021-01-04 0 50 0 2021-01-05
You can temporarily convert to datetime to add a DateOffset:
df['date'] = (pd.to_datetime(df['date'])
.add(pd.DateOffset(days=1))
.dt.strftime('%Y-%m-%d') # optional
)
Output:
value B N S date
0 2020-12-31 1 11 0 2021-01-01
1 2021-01-01 3 80 0 2021-01-02
2 2021-01-02 4 99 0 2021-01-03
3 2021-01-03 3 78 0 2021-01-04
4 2021-01-04 0 50 0 2021-01-05

Using groupby's aggregation to populate a new column

Given this dataframe df:
date type target
2021-01-01 0 5
2021-01-01 0 6
2021-01-01 1 4
2021-01-01 1 2
2021-01-02 0 5
2021-01-02 1 3
2021-01-02 1 7
2021-01-02 0 1
2021-01-03 0 2
2021-01-03 1 5
I want to create a new column that contains yesterday's target mean by type.
For example, for the 5th row (date=2021-01-02, type=0) the new column's value would be 5.5, as the mean of the target for the previous day, 2021-01-01 for type=0 is (5+6)/2.
I can easily obtain the mean of target grouping by date and type as:
means = df.groupby(['date', 'type'])['target'].mean()
But I don't know how to create a new column on the original dataframe with the desired data, which should look as follows:
date type target mean
2021-01-01 0 5 NaN (or null or whatever)
2021-01-01 0 6 NaN
2021-01-01 1 4 NaN
2021-01-01 1 2 NaN
2021-01-02 0 5 5.5
2021-01-02 1 3 3
2021-01-02 1 7 3
2021-01-02 0 2 5.5
2021-01-03 0 2 3.5
2021-01-03 1 5 5
Ensure your date column is datetime, and add another temporary column to df of the date the day before:
df['date'] = pd.to_datetime(df['date'])
df['yesterday'] = df['date'] - pd.Timedelta('1 day')
Then use your means groupby, with as_index=False, and left merge that onto the original df on yesterday/date and type columns, and select the desired columns:
means = df.groupby(['date', 'type'], as_index=False)['target'].mean()
df.merge(means, left_on=['yesterday', 'type'], right_on=['date', 'type'],
how='left', suffixes=[None, ' mean'])[['date', 'type', 'target', 'target mean']]
Output:
date type target target mean
0 2021-01-01 0 5 NaN
1 2021-01-01 0 6 NaN
2 2021-01-01 1 4 NaN
3 2021-01-01 1 2 NaN
4 2021-01-02 0 5 5.5
5 2021-01-02 1 3 3.0
6 2021-01-02 1 7 3.0
7 2021-01-02 0 1 5.5
8 2021-01-03 0 2 3.0
9 2021-01-03 1 5 5.0
Idea is add one day to first level of MultiIndex Series by Timedelta, so possible add new column by DataFrame.join:
df['date'] = pd.to_datetime(df['date'])
s1 = df.groupby(['date', 'type'])['target'].mean()
s2 = s1.rename(index=lambda x: x + pd.Timedelta(days=1), level=0)
df = df.join(s2.rename('mean'), on=['date','type'])
print (df)
date type target mean
0 2021-01-01 0 5 NaN
1 2021-01-01 0 6 NaN
2 2021-01-01 1 4 NaN
3 2021-01-01 1 2 NaN
4 2021-01-02 0 5 5.5
5 2021-01-02 1 3 3.0
6 2021-01-02 1 7 3.0
7 2021-01-02 0 1 5.5
8 2021-01-03 0 2 3.0
9 2021-01-03 1 5 5.0
Another solution:
df['date'] = pd.to_datetime(df['date'])
s1 = df.groupby([df['date'] + pd.Timedelta(days=1), 'type'])['target'].mean()
df = df.join(s1.rename('mean'), on=['date','type'])
print (df)
date type target mean
0 2021-01-01 0 5 NaN
1 2021-01-01 0 6 NaN
2 2021-01-01 1 4 NaN
3 2021-01-01 1 2 NaN
4 2021-01-02 0 5 5.5
5 2021-01-02 1 3 3.0
6 2021-01-02 1 7 3.0
7 2021-01-02 0 1 5.5
8 2021-01-03 0 2 3.0
9 2021-01-03 1 5 5.0
A small edition on #Emi OB' s answer
means = df.groupby(["date", "type"], as_index=False)["target"].mean()
means["mean"] = means.pop("target").shift(2)
df = df.merge(means, how="left", on=["date", "type"])
date type target mean
0 2021-01-01 0 5 NaN
1 2021-01-01 0 6 NaN
2 2021-01-01 1 4 NaN
3 2021-01-01 1 2 NaN
4 2021-01-02 0 5 5.5
5 2021-01-02 1 3 3.0
6 2021-01-02 1 7 3.0
7 2021-01-02 0 2 5.5
8 2021-01-03 0 2 3.5
9 2021-01-03 1 5 5.0

Cumulative sum that updates between two date ranges

I have data that looks like this: (assume start and end are date times)
id
start
end
1
01-01
01-02
1
01-03
01-05
1
01-04
01-07
1
01-06
NaT
1
01-07
NaT
I want to get a data frame that would include all dates, that has a 'cumulative sum' that only counts for the range they are in.
dates
count
01-01
1
01-02
0
01-03
1
01-04
2
01-05
1
01-06
2
01-07
3
One idea I thought of was simply using cumcount on the start dates, and doing a 'reverse cumcount' decreasing the counts using the end dates, but I am having trouble wrapping my head around doing this in pandas and I'm wondering whether there's a more elegant solution.
Here is two options. first consider this data with only one id, note that your columns start and end must be datetime.
d = {'id': [1, 1, 1, 1, 1],
'start': [pd.Timestamp('2021-01-01'), pd.Timestamp('2021-01-03'),
pd.Timestamp('2021-01-04'), pd.Timestamp('2021-01-06'),
pd.Timestamp('2021-01-07')],
'end': [pd.Timestamp('2021-01-02'), pd.Timestamp('2021-01-05'),
pd.Timestamp('2021-01-07'), pd.NaT, pd.NaT]}
df = pd.DataFrame(d)
so to get your result, you can do a sub between the get_dummies of start and end. then sum if several start and or end at the same dates, cumsum along the dates, reindex to get all the dates between the min and max dates available. create a function.
def dates_cc(df_):
return (
pd.get_dummies(df_['start'])
.sub(pd.get_dummies(df_['end'], dtype=int), fill_value=0)
.sum()
.cumsum()
.to_frame(name='count')
.reindex(pd.date_range(df_['start'].min(), df_['end'].max()), method='ffill')
.rename_axis('dates')
)
Now you can apply this function to your dataframe
res = dates_cc(df).reset_index()
print(res)
# dates count
# 0 2021-01-01 1.0
# 1 2021-01-02 0.0
# 2 2021-01-03 1.0
# 3 2021-01-04 2.0
# 4 2021-01-05 1.0
# 5 2021-01-06 2.0
# 6 2021-01-07 2.0
Now if you have several id, like
df1 = df.assign(id=[1,1,2,2,2])
print(df1)
# id start end
# 0 1 2021-01-01 2021-01-02
# 1 1 2021-01-03 2021-01-05
# 2 2 2021-01-04 2021-01-07
# 3 2 2021-01-06 NaT
# 4 2 2021-01-07 NaT
then you can use the above function like
res1 = df1.groupby('id').apply(dates_cc).reset_index()
print(res1)
# id dates count
# 0 1 2021-01-01 1.0
# 1 1 2021-01-02 0.0
# 2 1 2021-01-03 1.0
# 3 1 2021-01-04 1.0
# 4 1 2021-01-05 0.0
# 5 2 2021-01-04 1.0
# 6 2 2021-01-05 1.0
# 7 2 2021-01-06 2.0
# 8 2 2021-01-07 2.0
that said, a more straightforward possibility is with crosstab that create a row per id, the rest is about the same manipulations.
res2 = (
pd.crosstab(index=df1['id'], columns=df1['start'])
.sub(pd.crosstab(index=df1['id'], columns=df1['end']), fill_value=0)
.reindex(columns=pd.date_range(df1['start'].min(), df1['end'].max()), fill_value=0)
.rename_axis(columns='dates')
.cumsum(axis=1)
.stack()
.reset_index(name='count')
)
print(res2)
# id dates count
# 0 1 2021-01-01 1.0
# 1 1 2021-01-02 0.0
# 2 1 2021-01-03 1.0
# 3 1 2021-01-04 1.0
# 4 1 2021-01-05 0.0
# 5 1 2021-01-06 0.0
# 6 1 2021-01-07 0.0
# 7 2 2021-01-01 0.0
# 8 2 2021-01-02 0.0
# 9 2 2021-01-03 0.0
# 10 2 2021-01-04 1.0
# 11 2 2021-01-05 1.0
# 12 2 2021-01-06 2.0
# 13 2 2021-01-07 2.0
the main difference between the two options is that this one create extra dates for each id, because for example 2021-01-01 is in id=1 but not id=2 and with this version, you get this date also for id=2 while in groupby it is not taken into account.

Creating a DataFrame with a row for each date from date range in other DataFrame

Below is script for a simplified version of the df in question:
plan_dates=pd.DataFrame({'id':[1,2,3,4,5],
'start_date':['2021-01-01','2021-01-01','2021-01-03','2021-01-04','2021-01-05'],
'end_date': ['2021-01-04','2021-01-03','2021-01-03','2021-01-06','2021-01-08']})
plan_dates
id start_date end_date
0 1 2021-01-01 2021-01-04
1 2 2021-01-01 2021-01-03
2 3 2021-01-03 2021-01-03
3 4 2021-01-04 2021-01-06
4 5 2021-01-05 2021-01-08
I would like to create a new DataFrame with a row for each day where the plan is active, for each id.
INTENDED DF:
id active_days
0 1 2021-01-01
1 1 2021-01-02
2 1 2021-01-03
3 1 2021-01-04
4 2 2021-01-01
5 2 2021-01-02
6 2 2021-01-03
7 3 2021-01-03
8 4 2021-01-04
9 4 2021-01-05
10 4 2021-01-06
11 5 2021-01-05
12 5 2021-01-06
13 5 2021-01-07
14 5 2021-01-08
Any help would be greatly appreciated.
Use:
#first part is same like https://stackoverflow.com/a/66869805/2901002
plan_dates['start_date'] = pd.to_datetime(plan_dates['start_date'])
plan_dates['end_date'] = pd.to_datetime(plan_dates['end_date']) + pd.Timedelta(1, unit='d')
s = plan_dates['end_date'].sub(plan_dates['start_date']).dt.days
df = plan_dates.loc[plan_dates.index.repeat(s)].copy()
counter = df.groupby(level=0).cumcount()
df['start_date'] = df['start_date'].add(pd.to_timedelta(counter, unit='d'))
Then remove end_date column, rename and create default index:
df = (df.drop('end_date', axis=1)
.rename(columns={'start_date':'active_days'})
.reset_index(drop=True))
print (df)
id active_days
0 1 2021-01-01
1 1 2021-01-02
2 1 2021-01-03
3 1 2021-01-04
4 2 2021-01-01
5 2 2021-01-02
6 2 2021-01-03
7 3 2021-01-03
8 4 2021-01-04
9 4 2021-01-05
10 4 2021-01-06
11 5 2021-01-05
12 5 2021-01-06
13 5 2021-01-07
14 5 2021-01-08

Categories