Cumulative groupby with condition on datetime pandas - python

I need to calculate cumulative sums for different columns in a pandas dataframe based on a column playerId and a datetime column. My dataframe looks like this:
eventId playerId goal shot header dateutc
0 0 100 0 1 0 2020-11-08 17:00:00
1 1 100 0 0 1 2020-11-08 17:00:00
2 2 100 1 1 0 2020-11-08 17:00:00
3 3 200 0 1 0 2020-11-08 17:00:00
4 4 100 1 0 1 2020-11-15 17:00:00
5 5 100 1 1 0 2020-11-15 17:00:00
6 6 200 1 1 0 2020-11-15 17:00:00
So now I need to calculate cumulative sums for each player for the current date and all previous dates. So my final dateframe will look like this:
playerId dateutc goal shot header
0 100 2020-11-08 17:00:00 1 2 1
1 200 2020-11-08 17:00:00 0 1 0
2 100 2020-11-15 17:00:00 3 3 2
3 200 2020-11-15 17:00:00 1 2 0
Hopefully someone can help me :)

First remove eventId for avoid sum if numeric, aggregate sum and then cumsum:
df1 = (df.drop('eventId',axis=1)
.groupby(['playerId','dateutc'], sort=False)
.sum()
.groupby(level=0, sort=False)
.cumsum()
.reset_index())
print (df1)
playerId dateutc goal shot header
0 100 2020-11-08 17:00:00 1 2 1
1 200 2020-11-08 17:00:00 0 1 0
2 100 2020-11-15 17:00:00 3 3 2
3 200 2020-11-15 17:00:00 1 2 0
If need specify columns for processing:
df1 = (df.groupby(['playerId','dateutc'], sort=False)[['goal', 'shot', 'header']]
.sum()
.groupby(level=0, sort=False)
.cumsum()
.reset_index())

Try:
out = df.groupby(['playerId', 'dateutc'], sort=False)[['goal', 'shot', 'header']].sum()
out = out.groupby(level='playerId').cumsum().reset_index()
Output:
>>> out
playerId dateutc goal shot header
0 100 2020-11-08 17:00:00 1 2 1
1 200 2020-11-08 17:00:00 0 1 0
2 100 2020-11-15 17:00:00 3 3 2
3 200 2020-11-15 17:00:00 1 2 0

Related

How to calculate month by month change in value per user in pandas?

I was looking for similar topics, but I found only change by month. My problem is that I would like to have a month change in value e.g. UPL but per user like in the below example.
user_id
month
UPL
1
2022-01-01 00:00:00
100
1
2022-02-01 00:00:00
200
2
2022-01-01 00:00:00
100
2
2022-02-01 00:00:00
50
1
2022-03-01 00:00:00
150
And to have additional column named "UPL change month by month":
user_id
month
UPL
UPL_change_by_month
1
2022-01-01 00:00:00
100
0
1
2022-02-01 00:00:00
200
100
2
2022-01-01 00:00:00
100
0
2
2022-02-01 00:00:00
50
-50
1
2022-03-01 00:00:00
150
-50
Is it possible using aggfunc or shift function using Pandas?
IIUC, you can use groupby_diff:
df['UPL_change_by_month'] = df.sort_values('month').groupby('user_id')['UPL'].diff().fillna(0)
print(df)
# Output
user_id month UPL UPL_change_by_month
0 1 2022-01-01 100 0.0
1 1 2022-02-01 200 100.0
2 2 2022-01-01 100 0.0
3 2 2022-02-01 50 -50.0
4 1 2022-03-01 150 -50.0

Group rows by certain timeperiod dending on other factors

What I start with is a large dataframe (more than a million entires) of this structure:
id datetime indicator other_values ...
1 2020-01-14 00:12:00 0 ...
1 2020-01-17 00:23:00 1 ...
...
1 2021-02-01 00:00:00 0 ...
2 2020-01-15 00:05:00 0 ...
2 2020-03-10 00:07:00 0 ...
...
2 2021-05-22 00:00:00 1 ...
...
There is no specific order other than a sort by id and then datetime. The dataset is not complete (there is not data for every day, but there can be multiple entires of the same day).
Now for each time where indicator==1 I want to collect every row with the same id and a datetime that is at most 10 days before. All other rows which are not in range of the indicator can be dropped. In the best case I want it to be saved as a dataset of time series which each will be later used in a Neural network. (There can be more than one indicator==1 case per id, other values should be saved).
An example for one id: I want to convert this
id datetime indicator other_values ...
1 2020-01-14 00:12:00 0 ...
1 2020-01-17 00:23:00 1 ...
1 2020-01-17 00:13:00 0 ...
1 2020-01-20 00:05:00 0 ...
1 2020-03-10 00:07:00 0 ...
1 2020-05-19 00:00:00 0 ...
1 2020-05-20 00:00:00 1 ...
into this
id datetime group other_values ...
1 2020-01-14 00:12:00 A ...
1 2020-01-17 00:23:00 A ...
1 2020-01-17 00:13:00 A ...
1 2020-05-19 00:00:00 B ...
1 2020-05-20 00:00:00 B ...
or a similar way to group into group A, B, ... .
A naive python for-loop is not possible due to taking ages for a dataset like this.
There is propably a clever way to use df.groupby('id'), df.groupby('id').agg(...), df.sort_values(...) or df.apply(), but I just do not see it.
Here is a way to do it with pd.merge_asof(). Let's create our data:
data = {'id': [1,1,1,1,1,1,1],
'datetime': ['2020-01-14 00:12:00',
'2020-01-17 00:23:00',
'2020-01-17 00:13:00',
'2020-01-20 00:05:00',
'2020-03-10 00:07:00',
'2020-05-19 00:00:00',
'2020-05-20 00:00:00'],
'ind': [0,1,0,0,0,0,1]
}
df = pd.DataFrame(data)
df['datetime'] = df['datetime'].astype('datetime64')
Data:
id datetime ind
0 1 2020-01-14 00:12:00 0
1 1 2020-01-17 00:23:00 1
2 1 2020-01-17 00:13:00 0
3 1 2020-01-20 00:05:00 0
4 1 2020-03-10 00:07:00 0
5 1 2020-05-19 00:00:00 0
6 1 2020-05-20 00:00:00 1
Next, let's add a date to the dataset and pull all dates where the indicator is 1.
df['date'] = df['datetime'].dt.date.astype('datetime64')
df2 = df.loc[df['ind'] == 1, ['id', 'date', 'ind']].rename({'ind': 'ind2'}, axis=1)
Which gives us this:
df:
id datetime ind date
0 1 2020-01-14 00:12:00 0 2020-01-14
1 1 2020-01-17 00:23:00 1 2020-01-17
2 1 2020-01-17 00:13:00 0 2020-01-17
3 1 2020-01-20 00:05:00 0 2020-01-20
4 1 2020-03-10 00:07:00 0 2020-03-10
5 1 2020-05-19 00:00:00 0 2020-05-19
6 1 2020-05-20 00:00:00 1 2020-05-20
df2:
id date ind2
1 1 2020-01-17 1
6 1 2020-05-20 1
Now let's join them using pd.merge_asof() with direction=forward and a tolerance of 10 days. This will join all data up to 10 days looking forward.
df = pd.merge_asof(df.drop('ind', axis=1), df2, by='id', on='date', tolerance=pd.Timedelta('10d'), direction='forward')
Which gives us this:
id datetime ind date ind2
0 1 2020-01-14 00:12:00 0 2020-01-14 1.0
1 1 2020-01-17 00:23:00 1 2020-01-17 1.0
2 1 2020-01-17 00:13:00 0 2020-01-17 1.0
3 1 2020-01-20 00:05:00 0 2020-01-20 NaN
4 1 2020-03-10 00:07:00 0 2020-03-10 NaN
5 1 2020-05-19 00:00:00 0 2020-05-19 1.0
6 1 2020-05-20 00:00:00 1 2020-05-20 1.0
Next, let's work on creating groups. There are three rules we want to use:
The next value of ind2 is NaN
The next value of ID is not the current value of ID (we're at the last value in the group)
The next day is 10 days greater than the current
With these rules, we can create a Boolean which we can then cumulatively sum to create our groups.
df['group_id'] = df['ind2'].eq( (df['ind2'].shift() == np.NaN)
| (df['id'].shift() != df['id'])
| (df['date'] - df['date'].shift() > pd.Timedelta('10d') )
).cumsum()
id datetime ind date ind2 group_id
0 1 2020-01-14 00:12:00 0 2020-01-14 1.0 1
1 1 2020-01-17 00:23:00 1 2020-01-17 1.0 1
2 1 2020-01-17 00:13:00 0 2020-01-17 1.0 1
3 1 2020-01-20 00:05:00 0 2020-01-20 NaN 1
4 1 2020-03-10 00:07:00 0 2020-03-10 NaN 1
5 1 2020-05-19 00:00:00 0 2020-05-19 1.0 2
6 1 2020-05-20 00:00:00 1 2020-05-20 1.0 2
Now we need to drop all the NaNs from ind2, remove date and we're done.
df = df.dropna(subset='ind2').drop(['date', 'ind2'], axis=1)
Final output:
id datetime ind group_id
0 1 2020-01-14 00:12:00 0 1
1 1 2020-01-17 00:23:00 1 1
2 1 2020-01-17 00:13:00 0 1
5 1 2020-05-19 00:00:00 0 2
6 1 2020-05-20 00:00:00 1 2
I'm not aware of a way to do this with df.agg, but you can put your for loop inside the groupby using .apply(). That way, your comparisons/lookups can be done on smaller tables, then groupby will handle the re-concatenation:
import pandas as pd
import datetime
import uuid
df = pd.DataFrame({
"id": [1, 1, 1, 2, 2, 2],
"datetime": [
'2020-01-14 00:12:00',
'2020-01-17 00:23:00',
'2021-02-01 00:00:00',
'2020-01-15 00:05:00',
'2020-03-10 00:07:00',
'2021-05-22 00:00:00',
],
"indicator": [0, 1, 0, 0, 0, 1]
})
df.datetime = pd.to_datetime(df.datetime)
timedelta = datetime.timedelta(days=10)
def consolidate(grp):
grp['Group'] = None
for time in grp[grp.indicator == 1]['datetime']:
grp['Group'][grp['datetime'].between(time - timedelta, time)] = uuid.uuid4()
return grp.dropna(subset=['Group'])
df.groupby('id').apply(consolidate)
If there are multiple rows with indicator == 1 in each id grouping, then the for loop will apply in index order (so a later group might overwrite an earlier group). If you can be certain that there is only one indicator == 1 in each grouping, we can simplify the consolidate function:
def consolidate(grp):
time = grp[grp.indicator == 1]['datetime'].iloc[0]
grp = grp[grp['datetime'].between(time - timedelta, time)]
grp['Group'] = uuid.uuid4()
return grp

How to get 1 for 8 days after a date in pandas and 0 otherwise?

I have two dataframes:
daily = pd.DataFrame({'Date': pd.date_range(start="2021-01-01",end="2021-04-29")})
pc21 = pd.DataFrame({'Date': ["2021-01-21", "2021-03-11", "2021-04-22"]})
pc21['Date'] = pd.to_datetime(pc21['Date'])
What I want to do is the following: for every date in pc21 and if the date in pc21 is in daily, I want to get, in a new column, values equal 1 for 8 days after the date and 0 otherwise.
This is an example of a desired output:
# 2021-01-21 is in either daframes so I want a new column in 'daily' that looks like this:
Date newcol
.
.
.
2021-01-20 0
2021-01-21 1
2021-01-22 1
2021-01-23 1
2021-01-24 1
2021-01-25 1
2021-01-26 1
2021-01-27 1
2021-01-28 1
2021-01-29 0
.
.
.
Can anyone help me achieve this?
Thanks!
you can try the following approach:
res = (daily
.merge(pd.concat([pd.date_range(d, freq="D", periods=8).to_frame(name="Date")
for d in pc21["Date"]]),
how="left", indicator=True)
.replace({"both": 1, "left_only":0})
.rename(columns={"_merge":"newcol"}))
result
In [15]: res
Out[15]:
Date newcol
0 2021-01-01 0
1 2021-01-02 0
2 2021-01-03 0
3 2021-01-04 0
4 2021-01-05 0
.. ... ...
114 2021-04-25 1
115 2021-04-26 1
116 2021-04-27 1
117 2021-04-28 1
118 2021-04-29 1
[119 rows x 2 columns]
daily['value'] = 0
pc21['value'] = 1
daily = pd.merge(daily, pc21, on='Date', how='left').rename(
columns={'value_y':'value'}).drop('value_x', 1).fillna(method="ffill", limit=7).fillna(0)
pc21.drop('value',1)
Output Subset
daily.query('value.eq(1)')
Date value
20 2021-01-21 1.0
21 2021-01-22 1.0
22 2021-01-23 1.0
23 2021-01-24 1.0
24 2021-01-25 1.0
25 2021-01-26 1.0
26 2021-01-27 1.0
27 2021-01-28 1.0
69 2021-03-11 1.0
daily["new_col"] = np.where(daily.Date.isin(pc21.Date), 1, np.nan)
daily["new_col"] = daily["new_col"].fillna(method="ffill", limit=7).fillna(0)
We generate the new column first:
If the Date of daily is in Date of pc21
then put 1
else
put a NaN
Then forward fill that column but with a limit of 7 so that we have 8 consecutive 1s
Lastly forward fill again the remaining NaNs with 0.
(you can put an astype(int) at the end to have integers).

Drop overlapping periods less than 6 months in pandas dataframe

I have the following Pandas dataframe and I want to drop the rows for each customer where the difference between Dates is less than 6 month per customer. For example, I want to keep the following dates for customer with ID 1 - 2017-07-01, 2018-01-01, 2018-08-01
Customer_ID Date
1 2017-07-01
1 2017-08-01
1 2017-09-01
1 2017-10-01
1 2017-11-01
1 2017-12-01
1 2018-01-01
1 2018-02-01
1 2018-03-01
1 2018-04-01
1 2018-06-01
1 2018-08-01
2 2018-11-01
2 2019-02-01
2 2019-03-01
2 2019-05-01
2 2020-02-01
2 2020-05-01
Define the following function to process each group of rows (for each customer):
def selDates(grp):
res = []
while grp.size > 0:
stRow = grp.iloc[0]
res.append(stRow)
grp = grp[grp.Date >= stRow.Date + pd.DateOffset(months=6)]
return pd.DataFrame(res)
Then apply this function to each group:
result = df.groupby('Customer_ID', group_keys=False).apply(selDates)
The result, for your data sample, is:
Customer_ID Date
0 1 2017-07-01
6 1 2018-01-01
11 1 2018-08-01
12 2 2018-11-01
15 2 2019-05-01
16 2 2020-02-01

Fill in missing days in dataframe and add zero value in Python

I have a dataframe that looks like the following
Date A B
2014-12-20 00:00:00.000 3 2
2014-12-21 00:00:00.000 7 1
2014-12-22 00:00:00.000 2 9
2014-12-24 00:00:00.000 2 2
and I would like to add the missing day and fill the values for A and B with 0 so I get
Date A B
2014-12-20 00:00:00.000 3 2
2014-12-21 00:00:00.000 7 1
2014-12-22 00:00:00.000 2 9
2014-12-23 00:00:00.000 0 0
2014-12-24 00:00:00.000 2 2
How is this achieved best?
If Date is column create DatetimeIndex and then use DataFrame.asfreq:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.set_index('Date').asfreq('d', fill_value=0)
print (df1)
A B
Date
2014-12-20 3 2
2014-12-21 7 1
2014-12-22 2 9
2014-12-23 0 0
2014-12-24 2 2
If first column is index:
df.index = pd.to_datetime(df.index)
df1 = df.asfreq('d', fill_value=0)
print (df1)
A B
Date
2014-12-20 3 2
2014-12-21 7 1
2014-12-22 2 9
2014-12-23 0 0
2014-12-24 2 2

Categories