Filling Missing Date Column using groupby method - python

I have a dataframe that looks something like:
+---+----+---------------+------------+------------+
| | id | date1 | date2 | days_ahead |
+---+----+---------------+------------+------------+
| 0 | 1 | 2021-10-21 | 2021-10-24 | 3 |
| 1 | 1 | 2021-10-22 | NaN | NaN |
| 2 | 1 | 2021-11-16 | 2021-11-24 | 8 |
| 3 | 2 | 2021-10-22 | 2021-10-24 | 2 |
| 4 | 2 | 2021-10-22 | 2021-10-24 | 2 |
| 5 | 3 | 2021-10-26 | 2021-10-31 | 5 |
| 6 | 3 | 2021-10-30 | 2021-11-04 | 5 |
| 7 | 3 | 2021-11-02 | NaN | NaN |
| 8 | 3 | 2021-11-04 | 2021-11-04 | 0 |
| 9 | 4 | 2021-10-28 | NaN | NaN |
+---+----+---------------+------------+------------+
I am trying to fill the missing data with the days_ahead median of each id group,
For example:
Median of id 1 = 5.5 which rounds to 6
filled value of date2 at index 1 should be 2021-10-28
Similarly, for id 3 Median = 5
filled value of date2 at index 7 should be 2021-11-07
And,
for id 4 Median = NaN
filled value of date2 at index 9 should be 2021-10-28
I Tried
df['date2'].fillna(df.groupby('id')['days_ahead'].transform('median'), inplace = True)
But this fills with int values.
Although, I can use lambda and apply methods to identify int and turn it to date, How do I directly use groupby and fillna together?

You can round values with convert to_timedelta, add to date1 with fill_valueparameter and replace missing values:
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
td = pd.to_timedelta(df.groupby('id')['days_ahead'].transform('median').round(), unit='d')
df['date2'] = df['date2'].fillna(df['date1'].add(td, fill_value=pd.Timedelta(0)))
print (df)
id date1 date2 days_ahead
0 1 2021-10-21 2021-10-24 3.0
1 1 2021-10-22 2021-10-28 NaN
2 1 2021-11-16 2021-11-24 8.0
3 2 2021-10-22 2021-10-24 2.0
4 2 2021-10-22 2021-10-24 2.0
5 3 2021-10-26 2021-10-31 5.0
6 3 2021-10-30 2021-11-04 5.0
7 3 2021-11-02 2021-11-07 NaN
8 3 2021-11-04 2021-11-04 0.0
9 4 2021-10-28 2021-10-28 NaN

Related

How to reindex a datetime-based multiindex in pandas

I have a dataframe that counts the number of times an event has occured per user per day. Users may have 0 events per day and (since the table is an aggregate from a raw event log) rows with 0 events are missing from the dataframe. I would like to add these missing rows and group the data by week so that each user has one entry per week (including 0 if applicable).
Here is an example of my input:
import numpy as np
import pandas as pd
np.random.seed(42)
df = pd.DataFrame({
"person_id": np.arange(3).repeat(5),
"date": pd.date_range("2022-01-01", "2022-01-15", freq="d"),
"event_count": np.random.randint(1, 7, 15),
})
# end of each week
# Note: week 2022-01-23 is not in df, but should be part of the result
desired_index = pd.to_datetime(["2022-01-02", "2022-01-09", "2022-01-16", "2022-01-23"])
df
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-01 00:00:00 | 4 |
| 1 | 0 | 2022-01-02 00:00:00 | 5 |
| 2 | 0 | 2022-01-03 00:00:00 | 3 |
| 3 | 0 | 2022-01-04 00:00:00 | 5 |
| 4 | 0 | 2022-01-05 00:00:00 | 5 |
| 5 | 1 | 2022-01-06 00:00:00 | 2 |
| 6 | 1 | 2022-01-07 00:00:00 | 3 |
| 7 | 1 | 2022-01-08 00:00:00 | 3 |
| 8 | 1 | 2022-01-09 00:00:00 | 3 |
| 9 | 1 | 2022-01-10 00:00:00 | 5 |
| 10 | 2 | 2022-01-11 00:00:00 | 4 |
| 11 | 2 | 2022-01-12 00:00:00 | 3 |
| 12 | 2 | 2022-01-13 00:00:00 | 6 |
| 13 | 2 | 2022-01-14 00:00:00 | 5 |
| 14 | 2 | 2022-01-15 00:00:00 | 2 |
This is how my desired result looks like:
| | person_id | level_1 | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 0 | 2022-01-16 00:00:00 | 0 |
| 3 | 0 | 2022-01-23 00:00:00 | 0 |
| 4 | 1 | 2022-01-02 00:00:00 | 0 |
| 5 | 1 | 2022-01-09 00:00:00 | 11 |
| 6 | 1 | 2022-01-16 00:00:00 | 5 |
| 7 | 1 | 2022-01-23 00:00:00 | 0 |
| 8 | 2 | 2022-01-02 00:00:00 | 0 |
| 9 | 2 | 2022-01-09 00:00:00 | 0 |
| 10 | 2 | 2022-01-16 00:00:00 | 20 |
| 11 | 2 | 2022-01-23 00:00:00 | 0 |
I can produce it using:
(
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.groupby("person_id").apply(
lambda df: (
df
.reset_index(drop=True, level=0)
.reindex(desired_index, fill_value=0))
)
.reset_index()
)
However, according to the docs of reindex, I should be able to use it with level=1 as a kwarg directly and without having to do another groupby. However, when I do this I get an "inner join" of the two indices instead of an "outer join":
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(desired_index, level=1)
.reset_index()
)
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 1 | 2022-01-09 00:00:00 | 11 |
| 3 | 1 | 2022-01-16 00:00:00 | 5 |
| 4 | 2 | 2022-01-16 00:00:00 | 20 |
Why is that, and how am I supposed to use df.reindex correctly?
I have found a similar SO question on reindexing a multi-index level, but the accepted answer there uses df.unstack, which doesn't work for me, because not every level of my desired index occurs in my current index (and vice versa).
You need reindex by both levels of MultiIndex:
mux = pd.MultiIndex.from_product([df['person_id'].unique(), desired_index],
names=['person_id','date'])
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(mux, fill_value=0)
.reset_index()
)
print (result)
person_id date event_count
0 0 2022-01-02 9
1 0 2022-01-09 13
2 0 2022-01-16 0
3 0 2022-01-23 0
4 1 2022-01-02 0
5 1 2022-01-09 11
6 1 2022-01-16 5
7 1 2022-01-23 0
8 2 2022-01-02 0
9 2 2022-01-09 0
10 2 2022-01-16 20
11 2 2022-01-23 0

Generate date column within a range for every unique ID in python

I have a data set which has unique IDs and names.
| ID | NAME |
| -------- | -------------- |
| 1 | Jane |
| 2 | Max |
| 3 | Tom |
| 4 | Beth |
Now, i want to generate a column with dates using a date range for all the IDs. For example if the date range is ('2019-02-11', '2019-02-15') i want the following output.
| ID | NAME | DATE |
| -------- | -------------- | -------------- |
| 1 | Jane | 2019-02-11 |
| 1 | Jane | 2019-02-12 |
| 1 | Jane | 2019-02-13 |
| 1 | Jane | 2019-02-14 |
| 1 | Jane | 2019-02-15 |
| 2 | Max | 2019-02-11 |
| 2 | Max | 2019-02-12 |
| 2 | Max | 2019-02-13 |
| 2 | Max | 2019-02-14 |
| 2 | Max | 2019-02-15 |
and so on for all the ids. What is the most efficient way to get this in python?
You can do this with a pandas cross merge:
import pandas as pd
df = pd.DataFrame( [[1,'Jane'],[2,'Max'],[3,'Tom'],[4,'Beth']], columns=["ID","NAME"] )
print(df)
df2 = pd.DataFrame(
[['2022-01-01'],['2022-01-02'],['2022-01-03'],['2022-01-04']],
columns=['DATE'])
print(df2)
df3 = pd.merge(df, df2, how='cross')
print(df3)
Output:
ID NAME
0 1 Jane
1 2 Max
2 3 Tom
3 4 Beth
DATE
0 2022-01-01
1 2022-01-02
2 2022-01-03
3 2022-01-04
ID NAME DATE
0 1 Jane 2022-01-01
1 1 Jane 2022-01-02
2 1 Jane 2022-01-03
3 1 Jane 2022-01-04
4 2 Max 2022-01-01
5 2 Max 2022-01-02
6 2 Max 2022-01-03
7 2 Max 2022-01-04
8 3 Tom 2022-01-01
9 3 Tom 2022-01-02
10 3 Tom 2022-01-03
11 3 Tom 2022-01-04
12 4 Beth 2022-01-01
13 4 Beth 2022-01-02
14 4 Beth 2022-01-03
15 4 Beth 2022-01-04

How can I compute the most recent value from a column in a second dataset for each individual?

I have a pandas dataframe values that looks like:
person | date | value
-------|------------|------
A | 01-01-2020 | 1
A | 01-08-2020 | 2
A | 01-12-2020 | 3
B | 01-02-2020 | 4
B | 01-05-2020 | 5
B | 01-06-2020 | 6
And another dataframe encounters that looks like:
person | date
-------|------------
A | 01-01-2020
A | 01-03-2020
A | 01-06-2020
A | 01-11-2020
A | 01-12-2020
A | 01-15-2020
B | 01-01-2020
B | 01-04-2020
B | 01-06-2020
B | 01-08-2020
B | 01-09-2020
B | 01-10-2020
What I'd like to end up with is a merged dataframe that adds a third column to the encounters dataset with the most recent value of value for the corresponding person (shown below). Is there a straightforward way to do this in pandas?
person | date | most_recent_value
-------|------------|-------------------
A | 01-01-2020 | 1
A | 01-03-2020 | 1
A | 01-06-2020 | 1
A | 01-11-2020 | 2
A | 01-12-2020 | 3
A | 01-15-2020 | 3
B | 01-01-2020 | None
B | 01-04-2020 | 4
B | 01-06-2020 | 6
B | 01-08-2020 | 6
B | 01-09-2020 | 6
B | 01-10-2020 | 6
This is essentially merge_asof:
values['date'] = pd.to_datetime(values['date'])
encounters['date'] = pd.to_datetime(encounters['date'])
(pd.merge_asof(encounters.assign(rank=np.arange(encounters.shape[0]))
.sort_values('date'),
values.sort_values('date'),
by='person', on='date')
.sort_values('rank')
.drop('rank', axis=1)
)
Output:
person date value
0 A 2020-01-01 1.0
2 A 2020-01-03 1.0
4 A 2020-01-06 1.0
9 A 2020-01-11 2.0
10 A 2020-01-12 3.0
11 A 2020-01-15 3.0
1 B 2020-01-01 NaN
3 B 2020-01-04 4.0
5 B 2020-01-06 6.0
6 B 2020-01-08 6.0
7 B 2020-01-09 6.0
8 B 2020-01-10 6.0

Python rolling period returns

I need to develop a rolling 6-month return on the following dataframe
date Portfolio Performance
2001-11-30 1.048134
2001-12-31 1.040809
2002-01-31 1.054187
2002-02-28 1.039920
2002-03-29 1.073882
2002-04-30 1.100327
2002-05-31 1.094338
2002-06-28 1.019593
2002-07-31 1.094096
2002-08-30 1.054130
2002-09-30 1.024051
2002-10-31 0.992017
A lot of the answers from previous questions describe rolling average returns, which I can do. However, i am not looking for the average. What I need is the following example formula for a rolling 6-month return:
(1.100327 - 1.048134)/1.100327
The formula would then consider the next 6-month block between 2001-12-31 and 2002-05-31 and continue through to the end of the dataframe.
I've tried the following, but doesn't provide the right answer.
portfolio['rolling'] = portfolio['Portfolio Performance'].rolling(window=6).apply(np.prod) - 1
Expected output would be:
date Portfolio Performance Rolling
2001-11-30 1.048134 NaN
2001-12-31 1.040809 NaN
2002-01-31 1.054187 NaN
2002-02-28 1.039920 NaN
2002-03-29 1.073882 NaN
2002-04-30 1.100327 0.0520
2002-05-31 1.094338 0.0422
2002-06-28 1.019593 -0.0280
The current output is:
Portfolio Performance rolling
date
2001-11-30 1.048134 NaN
2001-12-31 1.040809 NaN
2002-01-31 1.054187 NaN
2002-02-28 1.039920 NaN
2002-03-29 1.073882 NaN
2002-04-30 1.100327 0.413135
2002-05-31 1.094338 0.475429
2002-06-28 1.019593 0.445354
2002-07-31 1.094096 0.500072
2002-08-30 1.054130 0.520569
2002-09-30 1.024051 0.450011
2002-10-31 0.992017 0.307280
I simply added the columns shifted 6 months and ran the formula presented. Does this meet the intent of the question?
df['before_6m'] = df['Portfolio Performance'].shift(6)
df['rolling'] = (df['Portfolio Performance'] - df['before_6m'])/df['Portfolio Performance']
df
| | date | Portfolio Performance | before_6m | rolling |
|---:|:--------------------|------------------------:|------------:|------------:|
| 0 | 2001-11-30 00:00:00 | 1.04813 | nan | nan |
| 1 | 2001-12-31 00:00:00 | 1.04081 | nan | nan |
| 2 | 2002-01-31 00:00:00 | 1.05419 | nan | nan |
| 3 | 2002-02-28 00:00:00 | 1.03992 | nan | nan |
| 4 | 2002-03-29 00:00:00 | 1.07388 | nan | nan |
| 5 | 2002-04-30 00:00:00 | 1.10033 | nan | nan |
| 6 | 2002-05-31 00:00:00 | 1.09434 | 1.04813 | 0.042221 |
| 7 | 2002-06-28 00:00:00 | 1.01959 | 1.04081 | -0.0208083 |
| 8 | 2002-07-31 00:00:00 | 1.0941 | 1.05419 | 0.0364767 |
| 9 | 2002-08-30 00:00:00 | 1.05413 | 1.03992 | 0.0134803 |
| 10 | 2002-09-30 00:00:00 | 1.02405 | 1.07388 | -0.0486607 |
| 11 | 2002-10-31 00:00:00 | 0.992017 | 1.10033 | -0.109182 |

Calculate streak in pandas without apply

I have a DataFrame like this:
date | type | column1
----------------------------
2019-01-01 | A | 1
2019-02-01 | A | 1
2019-03-01 | A | 1
2019-04-01 | A | 0
2019-05-01 | A | 1
2019-06-01 | A | 1
2019-07-01 | B | 1
2019-08-01 | B | 1
2019-09-01 | B | 0
I want to have a column called "streak" that has a streak, but grouped by column "type":
date | type | column1 | streak
-------------------------------------
2019-01-01 | A | 1 | 1
2019-02-01 | A | 1 | 2
2019-03-01 | A | 1 | 3
2019-04-01 | A | 0 | 0
2019-05-01 | A | 1 | 1
2019-06-01 | A | 1 | 2
2019-07-01 | B | 1 | 1
2019-08-01 | B | 1 | 2
2019-09-01 | B | 0 | 0
I managed to do it like that:
def streak(df):
grouper = (df.column1 != df.column1.shift(1)).cumsum()
df['streak'] = df.groupby(grouper).cumsum()['column1']
return df
df = df.groupby(['type']).apply(streak)
But I'm wondering if it's possible to do it inline without using a groupby and apply, because my DataFrame contains about 100M rows and it takes several hours to process.
Any ideas on how to optimize this for speed?
You want the cumsum of 'column1' grouping by 'type' + the cumsum of a Boolean Series which resets the grouping at every 0.
df['streak'] = df.groupby(['type', df.column1.eq(0).cumsum()]).column1.cumsum()
date type column1 streak
0 2019-01-01 A 1 1
1 2019-02-01 A 1 2
2 2019-03-01 A 1 3
3 2019-04-01 A 0 0
4 2019-05-01 A 1 1
5 2019-06-01 A 1 2
6 2019-07-01 B 1 1
7 2019-08-01 B 1 2
8 2019-09-01 B 0 0
IIUC, this is what you need.
m = df.column1.ne(df.column1.shift()).cumsum()
df['streak'] =df.groupby([m , 'type'])['column1'].cumsum()
Output
date type column1 streak
0 1/1/2019 A 1 1
1 2/1/2019 A 1 2
2 3/1/2019 A 1 3
3 4/1/2019 A 0 0
4 5/1/2019 A 1 1
5 6/1/2019 A 1 2
6 7/1/2019 B 1 1
7 8/1/2019 B 1 2
8 9/1/2019 B 0 0

Categories