How to reindex a datetime-based multiindex in pandas - python

I have a dataframe that counts the number of times an event has occured per user per day. Users may have 0 events per day and (since the table is an aggregate from a raw event log) rows with 0 events are missing from the dataframe. I would like to add these missing rows and group the data by week so that each user has one entry per week (including 0 if applicable).
Here is an example of my input:
import numpy as np
import pandas as pd
np.random.seed(42)
df = pd.DataFrame({
"person_id": np.arange(3).repeat(5),
"date": pd.date_range("2022-01-01", "2022-01-15", freq="d"),
"event_count": np.random.randint(1, 7, 15),
})
# end of each week
# Note: week 2022-01-23 is not in df, but should be part of the result
desired_index = pd.to_datetime(["2022-01-02", "2022-01-09", "2022-01-16", "2022-01-23"])
df
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-01 00:00:00 | 4 |
| 1 | 0 | 2022-01-02 00:00:00 | 5 |
| 2 | 0 | 2022-01-03 00:00:00 | 3 |
| 3 | 0 | 2022-01-04 00:00:00 | 5 |
| 4 | 0 | 2022-01-05 00:00:00 | 5 |
| 5 | 1 | 2022-01-06 00:00:00 | 2 |
| 6 | 1 | 2022-01-07 00:00:00 | 3 |
| 7 | 1 | 2022-01-08 00:00:00 | 3 |
| 8 | 1 | 2022-01-09 00:00:00 | 3 |
| 9 | 1 | 2022-01-10 00:00:00 | 5 |
| 10 | 2 | 2022-01-11 00:00:00 | 4 |
| 11 | 2 | 2022-01-12 00:00:00 | 3 |
| 12 | 2 | 2022-01-13 00:00:00 | 6 |
| 13 | 2 | 2022-01-14 00:00:00 | 5 |
| 14 | 2 | 2022-01-15 00:00:00 | 2 |
This is how my desired result looks like:
| | person_id | level_1 | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 0 | 2022-01-16 00:00:00 | 0 |
| 3 | 0 | 2022-01-23 00:00:00 | 0 |
| 4 | 1 | 2022-01-02 00:00:00 | 0 |
| 5 | 1 | 2022-01-09 00:00:00 | 11 |
| 6 | 1 | 2022-01-16 00:00:00 | 5 |
| 7 | 1 | 2022-01-23 00:00:00 | 0 |
| 8 | 2 | 2022-01-02 00:00:00 | 0 |
| 9 | 2 | 2022-01-09 00:00:00 | 0 |
| 10 | 2 | 2022-01-16 00:00:00 | 20 |
| 11 | 2 | 2022-01-23 00:00:00 | 0 |
I can produce it using:
(
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.groupby("person_id").apply(
lambda df: (
df
.reset_index(drop=True, level=0)
.reindex(desired_index, fill_value=0))
)
.reset_index()
)
However, according to the docs of reindex, I should be able to use it with level=1 as a kwarg directly and without having to do another groupby. However, when I do this I get an "inner join" of the two indices instead of an "outer join":
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(desired_index, level=1)
.reset_index()
)
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 1 | 2022-01-09 00:00:00 | 11 |
| 3 | 1 | 2022-01-16 00:00:00 | 5 |
| 4 | 2 | 2022-01-16 00:00:00 | 20 |
Why is that, and how am I supposed to use df.reindex correctly?
I have found a similar SO question on reindexing a multi-index level, but the accepted answer there uses df.unstack, which doesn't work for me, because not every level of my desired index occurs in my current index (and vice versa).

You need reindex by both levels of MultiIndex:
mux = pd.MultiIndex.from_product([df['person_id'].unique(), desired_index],
names=['person_id','date'])
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(mux, fill_value=0)
.reset_index()
)
print (result)
person_id date event_count
0 0 2022-01-02 9
1 0 2022-01-09 13
2 0 2022-01-16 0
3 0 2022-01-23 0
4 1 2022-01-02 0
5 1 2022-01-09 11
6 1 2022-01-16 5
7 1 2022-01-23 0
8 2 2022-01-02 0
9 2 2022-01-09 0
10 2 2022-01-16 20
11 2 2022-01-23 0

Related

add values on a negative pandas df based on condition date

I have a dataframe which contains credit of a user, each row is how much credit has a given day.
A user loses 1 credit per day.
I need a way to code that if a user has accumulated credit in the past will fill all the days that credit was 0.
An example of refilling past credits:
import pandas as pd
from datetime import datetime
data = pd.DataFrame({'credit':[0,0,0,2,0,0,1],'date':pd.date_range('01-01-2021','01-07-2021')})
data['credit_after_consuming'] = data.credit_refill -1
Looks like:
| | credit_refill | date | credit_after_consuming |
|---:|----------------:|:--------------------|-------------------------:|
| 0 | 0 | 2021-01-01 00:00:00 | -1 |
| 1 | 0 | 2021-01-02 00:00:00 | -1 |
| 2 | 0 | 2021-01-03 00:00:00 | -1 |
| 3 | 2 | 2021-01-04 00:00:00 | 1 |
| 4 | 0 | 2021-01-05 00:00:00 | -1 |
| 5 | 0 | 2021-01-06 00:00:00 | -1 |
| 6 | 1 | 2021-01-07 00:00:00 | 0 |
The logic should be as you can see the three first days the user would have credit -1, until the 4th of January, where the user has 2 days of credit, one used that day and the other one is consumed the 5.
In total there would be 3 days(the first one without credits).
If at the start of the week a user picks 7 credits it is covered all week.
Another case would be
| | credit_refill | date | credit_after_consuming |
|---:|----------------:|:--------------------|-------------------------:|
| 0 | 2 | 2021-01-01 00:00:00 | 1 |
| 1 | 0 | 2021-01-02 00:00:00 | -1 |
| 2 | 0 | 2021-01-03 00:00:00 | -1 |
| 3 | 0 | 2021-01-04 00:00:00 | -1 |
| 4 | 1 | 2021-01-05 00:00:00 | 0 |
| 5 | 0 | 2021-01-06 00:00:00 | -1 |
In this case the participant would run out of credits the 3rd and 4th day, because it has 2 credits the 1st, one consumed the same day and then the 2nd.
Then the 5th would refill and consume the same day, to run out of credits the 6th day.
I feel it's like some variation of cumsum but I can-t manage to get the expected results.
I could do a sum all days and fill the 0's with those accumulated, but I have to take into account I can only refill with credits accumulated in the past.

Generate date column within a range for every unique ID in python

I have a data set which has unique IDs and names.
| ID | NAME |
| -------- | -------------- |
| 1 | Jane |
| 2 | Max |
| 3 | Tom |
| 4 | Beth |
Now, i want to generate a column with dates using a date range for all the IDs. For example if the date range is ('2019-02-11', '2019-02-15') i want the following output.
| ID | NAME | DATE |
| -------- | -------------- | -------------- |
| 1 | Jane | 2019-02-11 |
| 1 | Jane | 2019-02-12 |
| 1 | Jane | 2019-02-13 |
| 1 | Jane | 2019-02-14 |
| 1 | Jane | 2019-02-15 |
| 2 | Max | 2019-02-11 |
| 2 | Max | 2019-02-12 |
| 2 | Max | 2019-02-13 |
| 2 | Max | 2019-02-14 |
| 2 | Max | 2019-02-15 |
and so on for all the ids. What is the most efficient way to get this in python?
You can do this with a pandas cross merge:
import pandas as pd
df = pd.DataFrame( [[1,'Jane'],[2,'Max'],[3,'Tom'],[4,'Beth']], columns=["ID","NAME"] )
print(df)
df2 = pd.DataFrame(
[['2022-01-01'],['2022-01-02'],['2022-01-03'],['2022-01-04']],
columns=['DATE'])
print(df2)
df3 = pd.merge(df, df2, how='cross')
print(df3)
Output:
ID NAME
0 1 Jane
1 2 Max
2 3 Tom
3 4 Beth
DATE
0 2022-01-01
1 2022-01-02
2 2022-01-03
3 2022-01-04
ID NAME DATE
0 1 Jane 2022-01-01
1 1 Jane 2022-01-02
2 1 Jane 2022-01-03
3 1 Jane 2022-01-04
4 2 Max 2022-01-01
5 2 Max 2022-01-02
6 2 Max 2022-01-03
7 2 Max 2022-01-04
8 3 Tom 2022-01-01
9 3 Tom 2022-01-02
10 3 Tom 2022-01-03
11 3 Tom 2022-01-04
12 4 Beth 2022-01-01
13 4 Beth 2022-01-02
14 4 Beth 2022-01-03
15 4 Beth 2022-01-04

Filling Missing Date Column using groupby method

I have a dataframe that looks something like:
+---+----+---------------+------------+------------+
| | id | date1 | date2 | days_ahead |
+---+----+---------------+------------+------------+
| 0 | 1 | 2021-10-21 | 2021-10-24 | 3 |
| 1 | 1 | 2021-10-22 | NaN | NaN |
| 2 | 1 | 2021-11-16 | 2021-11-24 | 8 |
| 3 | 2 | 2021-10-22 | 2021-10-24 | 2 |
| 4 | 2 | 2021-10-22 | 2021-10-24 | 2 |
| 5 | 3 | 2021-10-26 | 2021-10-31 | 5 |
| 6 | 3 | 2021-10-30 | 2021-11-04 | 5 |
| 7 | 3 | 2021-11-02 | NaN | NaN |
| 8 | 3 | 2021-11-04 | 2021-11-04 | 0 |
| 9 | 4 | 2021-10-28 | NaN | NaN |
+---+----+---------------+------------+------------+
I am trying to fill the missing data with the days_ahead median of each id group,
For example:
Median of id 1 = 5.5 which rounds to 6
filled value of date2 at index 1 should be 2021-10-28
Similarly, for id 3 Median = 5
filled value of date2 at index 7 should be 2021-11-07
And,
for id 4 Median = NaN
filled value of date2 at index 9 should be 2021-10-28
I Tried
df['date2'].fillna(df.groupby('id')['days_ahead'].transform('median'), inplace = True)
But this fills with int values.
Although, I can use lambda and apply methods to identify int and turn it to date, How do I directly use groupby and fillna together?
You can round values with convert to_timedelta, add to date1 with fill_valueparameter and replace missing values:
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
td = pd.to_timedelta(df.groupby('id')['days_ahead'].transform('median').round(), unit='d')
df['date2'] = df['date2'].fillna(df['date1'].add(td, fill_value=pd.Timedelta(0)))
print (df)
id date1 date2 days_ahead
0 1 2021-10-21 2021-10-24 3.0
1 1 2021-10-22 2021-10-28 NaN
2 1 2021-11-16 2021-11-24 8.0
3 2 2021-10-22 2021-10-24 2.0
4 2 2021-10-22 2021-10-24 2.0
5 3 2021-10-26 2021-10-31 5.0
6 3 2021-10-30 2021-11-04 5.0
7 3 2021-11-02 2021-11-07 NaN
8 3 2021-11-04 2021-11-04 0.0
9 4 2021-10-28 2021-10-28 NaN

Calculate streak in pandas without apply

I have a DataFrame like this:
date | type | column1
----------------------------
2019-01-01 | A | 1
2019-02-01 | A | 1
2019-03-01 | A | 1
2019-04-01 | A | 0
2019-05-01 | A | 1
2019-06-01 | A | 1
2019-07-01 | B | 1
2019-08-01 | B | 1
2019-09-01 | B | 0
I want to have a column called "streak" that has a streak, but grouped by column "type":
date | type | column1 | streak
-------------------------------------
2019-01-01 | A | 1 | 1
2019-02-01 | A | 1 | 2
2019-03-01 | A | 1 | 3
2019-04-01 | A | 0 | 0
2019-05-01 | A | 1 | 1
2019-06-01 | A | 1 | 2
2019-07-01 | B | 1 | 1
2019-08-01 | B | 1 | 2
2019-09-01 | B | 0 | 0
I managed to do it like that:
def streak(df):
grouper = (df.column1 != df.column1.shift(1)).cumsum()
df['streak'] = df.groupby(grouper).cumsum()['column1']
return df
df = df.groupby(['type']).apply(streak)
But I'm wondering if it's possible to do it inline without using a groupby and apply, because my DataFrame contains about 100M rows and it takes several hours to process.
Any ideas on how to optimize this for speed?
You want the cumsum of 'column1' grouping by 'type' + the cumsum of a Boolean Series which resets the grouping at every 0.
df['streak'] = df.groupby(['type', df.column1.eq(0).cumsum()]).column1.cumsum()
date type column1 streak
0 2019-01-01 A 1 1
1 2019-02-01 A 1 2
2 2019-03-01 A 1 3
3 2019-04-01 A 0 0
4 2019-05-01 A 1 1
5 2019-06-01 A 1 2
6 2019-07-01 B 1 1
7 2019-08-01 B 1 2
8 2019-09-01 B 0 0
IIUC, this is what you need.
m = df.column1.ne(df.column1.shift()).cumsum()
df['streak'] =df.groupby([m , 'type'])['column1'].cumsum()
Output
date type column1 streak
0 1/1/2019 A 1 1
1 2/1/2019 A 1 2
2 3/1/2019 A 1 3
3 4/1/2019 A 0 0
4 5/1/2019 A 1 1
5 6/1/2019 A 1 2
6 7/1/2019 B 1 1
7 8/1/2019 B 1 2
8 9/1/2019 B 0 0

Pandas DataFrame Group and Rollup in one operation

I have a Pandas DataFrame with two columns "close_time" of a trade (DateTime format) and the "net_profit" from that trade. I have shared some sample data below. I need to find the count of total trades and count of profitable trades by day. So, for example, the output would look like
+-----------------------------------------------------------+
| Close_day Total_Trades Total_Profitable_Trades |
+-----------------------------------------------------------+
| 2014-11-03 5 4 |
+-----------------------------------------------------------+
Can this be done using something like groupby? How?
+------------------------------------+
| close_time net_profit |
+------------------------------------+
| 0 2014-10-31 14:41:41 20.84 |
| 1 2014-11-03 10:50:59 238.74 |
| 2 2014-11-03 11:05:10 491.32 |
| 3 2014-11-03 12:31:06 55.87 |
| 4 2014-11-03 14:31:34 -402.29 |
| 5 2014-11-03 20:33:29 164.18 |
| 6 2014-11-04 16:30:24 -296.96 |
| 7 2014-11-04 23:59:21 281.86 |
| 8 2014-11-04 23:59:34 -296.37 |
| 9 2014-11-05 10:14:42 517.55 |
| 10 2014-11-05 20:38:49 350.35 |
| 11 2014-11-07 11:23:31 710.13 |
| 12 2014-11-07 11:23:38 137.55 |
| 13 2014-11-11 19:00:01 201.97 |
| 14 2014-11-11 19:00:15 -484.77 |
| 15 2014-11-12 23:41:04 -1346.71 |
| 16 2014-11-12 23:41:25 514.30 |
| 17 2014-11-13 18:55:34 103.34 |
| 18 2014-11-13 18:55:43 -180.37 |
| 19 2014-11-26 17:10:59 -1756.69 |
+------------------------------------+
Setup
Make sure that your close_time is datetime by using
df.close_time = pd.to_datetime(df.close_time)
You can use groupby and agg here:
out = (df.groupby(df.close_time.dt.date)
.net_profit.agg(['count', lambda x: x.gt(0).sum()])).astype(int)
out.columns = ['trades', 'profitible_trades']
trades profitible_trades
close_time
2014-10-31 1 1
2014-11-03 5 4
2014-11-04 3 1
2014-11-05 2 2
2014-11-07 2 2
2014-11-11 2 1
2014-11-12 2 1
2014-11-13 2 1
2014-11-26 1 0

Categories