How can I compute the most recent value from a column in a second dataset for each individual? - python

I have a pandas dataframe values that looks like:
person | date | value
-------|------------|------
A | 01-01-2020 | 1
A | 01-08-2020 | 2
A | 01-12-2020 | 3
B | 01-02-2020 | 4
B | 01-05-2020 | 5
B | 01-06-2020 | 6
And another dataframe encounters that looks like:
person | date
-------|------------
A | 01-01-2020
A | 01-03-2020
A | 01-06-2020
A | 01-11-2020
A | 01-12-2020
A | 01-15-2020
B | 01-01-2020
B | 01-04-2020
B | 01-06-2020
B | 01-08-2020
B | 01-09-2020
B | 01-10-2020
What I'd like to end up with is a merged dataframe that adds a third column to the encounters dataset with the most recent value of value for the corresponding person (shown below). Is there a straightforward way to do this in pandas?
person | date | most_recent_value
-------|------------|-------------------
A | 01-01-2020 | 1
A | 01-03-2020 | 1
A | 01-06-2020 | 1
A | 01-11-2020 | 2
A | 01-12-2020 | 3
A | 01-15-2020 | 3
B | 01-01-2020 | None
B | 01-04-2020 | 4
B | 01-06-2020 | 6
B | 01-08-2020 | 6
B | 01-09-2020 | 6
B | 01-10-2020 | 6

This is essentially merge_asof:
values['date'] = pd.to_datetime(values['date'])
encounters['date'] = pd.to_datetime(encounters['date'])
(pd.merge_asof(encounters.assign(rank=np.arange(encounters.shape[0]))
.sort_values('date'),
values.sort_values('date'),
by='person', on='date')
.sort_values('rank')
.drop('rank', axis=1)
)
Output:
person date value
0 A 2020-01-01 1.0
2 A 2020-01-03 1.0
4 A 2020-01-06 1.0
9 A 2020-01-11 2.0
10 A 2020-01-12 3.0
11 A 2020-01-15 3.0
1 B 2020-01-01 NaN
3 B 2020-01-04 4.0
5 B 2020-01-06 6.0
6 B 2020-01-08 6.0
7 B 2020-01-09 6.0
8 B 2020-01-10 6.0

Related

How to reindex a datetime-based multiindex in pandas

I have a dataframe that counts the number of times an event has occured per user per day. Users may have 0 events per day and (since the table is an aggregate from a raw event log) rows with 0 events are missing from the dataframe. I would like to add these missing rows and group the data by week so that each user has one entry per week (including 0 if applicable).
Here is an example of my input:
import numpy as np
import pandas as pd
np.random.seed(42)
df = pd.DataFrame({
"person_id": np.arange(3).repeat(5),
"date": pd.date_range("2022-01-01", "2022-01-15", freq="d"),
"event_count": np.random.randint(1, 7, 15),
})
# end of each week
# Note: week 2022-01-23 is not in df, but should be part of the result
desired_index = pd.to_datetime(["2022-01-02", "2022-01-09", "2022-01-16", "2022-01-23"])
df
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-01 00:00:00 | 4 |
| 1 | 0 | 2022-01-02 00:00:00 | 5 |
| 2 | 0 | 2022-01-03 00:00:00 | 3 |
| 3 | 0 | 2022-01-04 00:00:00 | 5 |
| 4 | 0 | 2022-01-05 00:00:00 | 5 |
| 5 | 1 | 2022-01-06 00:00:00 | 2 |
| 6 | 1 | 2022-01-07 00:00:00 | 3 |
| 7 | 1 | 2022-01-08 00:00:00 | 3 |
| 8 | 1 | 2022-01-09 00:00:00 | 3 |
| 9 | 1 | 2022-01-10 00:00:00 | 5 |
| 10 | 2 | 2022-01-11 00:00:00 | 4 |
| 11 | 2 | 2022-01-12 00:00:00 | 3 |
| 12 | 2 | 2022-01-13 00:00:00 | 6 |
| 13 | 2 | 2022-01-14 00:00:00 | 5 |
| 14 | 2 | 2022-01-15 00:00:00 | 2 |
This is how my desired result looks like:
| | person_id | level_1 | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 0 | 2022-01-16 00:00:00 | 0 |
| 3 | 0 | 2022-01-23 00:00:00 | 0 |
| 4 | 1 | 2022-01-02 00:00:00 | 0 |
| 5 | 1 | 2022-01-09 00:00:00 | 11 |
| 6 | 1 | 2022-01-16 00:00:00 | 5 |
| 7 | 1 | 2022-01-23 00:00:00 | 0 |
| 8 | 2 | 2022-01-02 00:00:00 | 0 |
| 9 | 2 | 2022-01-09 00:00:00 | 0 |
| 10 | 2 | 2022-01-16 00:00:00 | 20 |
| 11 | 2 | 2022-01-23 00:00:00 | 0 |
I can produce it using:
(
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.groupby("person_id").apply(
lambda df: (
df
.reset_index(drop=True, level=0)
.reindex(desired_index, fill_value=0))
)
.reset_index()
)
However, according to the docs of reindex, I should be able to use it with level=1 as a kwarg directly and without having to do another groupby. However, when I do this I get an "inner join" of the two indices instead of an "outer join":
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(desired_index, level=1)
.reset_index()
)
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 1 | 2022-01-09 00:00:00 | 11 |
| 3 | 1 | 2022-01-16 00:00:00 | 5 |
| 4 | 2 | 2022-01-16 00:00:00 | 20 |
Why is that, and how am I supposed to use df.reindex correctly?
I have found a similar SO question on reindexing a multi-index level, but the accepted answer there uses df.unstack, which doesn't work for me, because not every level of my desired index occurs in my current index (and vice versa).
You need reindex by both levels of MultiIndex:
mux = pd.MultiIndex.from_product([df['person_id'].unique(), desired_index],
names=['person_id','date'])
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(mux, fill_value=0)
.reset_index()
)
print (result)
person_id date event_count
0 0 2022-01-02 9
1 0 2022-01-09 13
2 0 2022-01-16 0
3 0 2022-01-23 0
4 1 2022-01-02 0
5 1 2022-01-09 11
6 1 2022-01-16 5
7 1 2022-01-23 0
8 2 2022-01-02 0
9 2 2022-01-09 0
10 2 2022-01-16 20
11 2 2022-01-23 0

Filling Missing Date Column using groupby method

I have a dataframe that looks something like:
+---+----+---------------+------------+------------+
| | id | date1 | date2 | days_ahead |
+---+----+---------------+------------+------------+
| 0 | 1 | 2021-10-21 | 2021-10-24 | 3 |
| 1 | 1 | 2021-10-22 | NaN | NaN |
| 2 | 1 | 2021-11-16 | 2021-11-24 | 8 |
| 3 | 2 | 2021-10-22 | 2021-10-24 | 2 |
| 4 | 2 | 2021-10-22 | 2021-10-24 | 2 |
| 5 | 3 | 2021-10-26 | 2021-10-31 | 5 |
| 6 | 3 | 2021-10-30 | 2021-11-04 | 5 |
| 7 | 3 | 2021-11-02 | NaN | NaN |
| 8 | 3 | 2021-11-04 | 2021-11-04 | 0 |
| 9 | 4 | 2021-10-28 | NaN | NaN |
+---+----+---------------+------------+------------+
I am trying to fill the missing data with the days_ahead median of each id group,
For example:
Median of id 1 = 5.5 which rounds to 6
filled value of date2 at index 1 should be 2021-10-28
Similarly, for id 3 Median = 5
filled value of date2 at index 7 should be 2021-11-07
And,
for id 4 Median = NaN
filled value of date2 at index 9 should be 2021-10-28
I Tried
df['date2'].fillna(df.groupby('id')['days_ahead'].transform('median'), inplace = True)
But this fills with int values.
Although, I can use lambda and apply methods to identify int and turn it to date, How do I directly use groupby and fillna together?
You can round values with convert to_timedelta, add to date1 with fill_valueparameter and replace missing values:
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
td = pd.to_timedelta(df.groupby('id')['days_ahead'].transform('median').round(), unit='d')
df['date2'] = df['date2'].fillna(df['date1'].add(td, fill_value=pd.Timedelta(0)))
print (df)
id date1 date2 days_ahead
0 1 2021-10-21 2021-10-24 3.0
1 1 2021-10-22 2021-10-28 NaN
2 1 2021-11-16 2021-11-24 8.0
3 2 2021-10-22 2021-10-24 2.0
4 2 2021-10-22 2021-10-24 2.0
5 3 2021-10-26 2021-10-31 5.0
6 3 2021-10-30 2021-11-04 5.0
7 3 2021-11-02 2021-11-07 NaN
8 3 2021-11-04 2021-11-04 0.0
9 4 2021-10-28 2021-10-28 NaN

is there a pandas function to concatenate for example three previous rows together (like i have a window with length three)

for example i have below DataFrame
df13 = pd.DataFrame(np.random.randint(1,9, size=(5,3)),
columns=['a','b','c'])
df13
a b c
0 8 5 2
1 5 7 7
2 3 7 5
3 7 7 7
4 2 2 6
and want
a b c a b c a b c
0 None None None None None None 8.00 5.00 2.00
1 None None None 8 5 2 5.00 7.00 7.00
2 8 5 2 5 7 7 3.00 7.00 5.00
3 5 7 7 3 7 5 7.00 7.00 7.00
4 3 7 5 7 7 7 2.00 2.00 6.00
5 7 7 7 2 2 6 nan nan nan
6 2 2 6 NaN NaN NaN nan nan nan
for example row 2 have 2 previous rows.
i do that with this code
def laa(df, previous_count):
dfNone = pd.DataFrame({col : None for col in df.columns},
index=[0])
df_tmp = df.copy()
for x in range(1 ,previous_count+1):
df_tmp = pd.concat([dfNone, df_tmp])
df_tmp = df_tmp.reset_index()
del df_tmp['index']
df = pd.concat([df_tmp, df], axis=1)
return df
(None rows must be removed)
pandas doesn't have function to do that?
This will do the trick using shift() and concat() functions in pandas:
df = pd.DataFrame(np.random.randint(1,9, size=(5,3)), columns=['a','b','c'])
df1 = pd.concat([df.shift(2), df.shift(1),df], axis = 1)
df2 = pd.concat([df, df.shift(-1),df.shift(-2)], axis = 1)
final_df = pd.concat([df1,df2]).drop_duplicates()
Sample output:
If df is as follows:
+----+-----+-----+-----+
| | a | b | c |
|----+-----+-----+-----|
| 0 | 6 | 2 | 6 |
| 1 | 7 | 2 | 1 |
| 2 | 4 | 4 | 5 |
| 3 | 1 | 1 | 1 |
| 4 | 2 | 2 | 4 |
+----+-----+-----+-----+
Then, final_df would be :
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| | a | b | c | a | b | c | a | b | c |
|----+-----+-----+-----+-----+-----+-----+-----+-----+-----|
| 0 | nan | nan | nan | nan | nan | nan | 6 | 2 | 6 |
| 1 | nan | nan | nan | 6 | 2 | 6 | 7 | 2 | 1 |
| 2 | 6 | 2 | 6 | 7 | 2 | 1 | 4 | 4 | 5 |
| 3 | 7 | 2 | 1 | 4 | 4 | 5 | 1 | 1 | 1 |
| 4 | 4 | 4 | 5 | 1 | 1 | 1 | 2 | 2 | 4 |
| 3 | 1 | 1 | 1 | 2 | 2 | 4 | nan | nan | nan |
| 4 | 2 | 2 | 4 | nan | nan | nan | nan | nan | nan |
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

Rolling quantiles over a column in pandas

I have a table as such
+------+------------+-------+
| Idx | date | value |
+------+------------+-------+
| A | 20/11/2016 | 10 |
| A | 21/11/2016 | 8 |
| A | 22/11/2016 | 12 |
| B | 20/11/2016 | 16 |
| B | 21/11/2016 | 18 |
| B | 22/11/2016 | 11 |
+------+------------+-------+
I'd like to create a column that creates a new column 'rolling_quantile_value' based on the column 'value' that calculates a quantile based on the past for each row and each possible Idx.
For the example above, if the quantile chosen is median, the output should look like this :
+------+------------+-------+-----------------------+
| Idx | date | value | rolling_median_value |
+------+------------+-------+-----------------------+
| A | 20/11/2016 | 10 | NaN |
| A | 21/11/2016 | 8 | 10 |
| A | 22/11/2016 | 12 | 9 |
| A | 23/11/2016 | 14 | 10 |
| B | 20/11/2016 | 16 | NaN |
| B | 21/11/2016 | 18 | 16 |
| B | 22/11/2016 | 11 | 17 |
+------+------------+-------+-----------------------+
I've done it the naive way where I just put a function that creates row by row based on precedents rows of value and flags the jump from one Id to another but I'm sure that it's not the most efficient way to do that, nor the most elegant.
Looking forward to your suggestions !
I think you want expanding
df['rolling_median_value']=(df.groupby('Idx',sort=False)
.expanding(1)['value']
.median()
.groupby(level=0)
.shift()
.reset_index(drop=True))
print(df)
Idx date value rolling_median_value
0 A 20/11/2016 10 NaN
1 A 21/11/2016 8 10.0
2 A 22/11/2016 12 9.0
3 A 23/11/2016 14 10.0
4 B 20/11/2016 16 NaN
5 B 21/11/2016 18 16.0
6 B 22/11/2016 11 17.0
UPDATE
df['rolling_quantile_value']=(df.groupby('Idx',sort=False)
.expanding(1)['value']
.quantile(0.75)
.groupby(level=0)
.shift()
.reset_index(drop=True))
print(df)
Idx date value rolling_quantile_value
0 A 20/11/2016 10 NaN
1 A 21/11/2016 8 10.0
2 A 22/11/2016 12 9.5
3 A 23/11/2016 14 11.0
4 B 20/11/2016 16 NaN
5 B 21/11/2016 18 16.0
6 B 22/11/2016 11 17.5

Multiply dataframe's row non-NA values "element-wise" with list

Imagine we have pandas.DataFrame like :
| na | na | 3 | 3 | 5 | 2. |
| na | 5.. | 2 | 2 | 1 | na|
| 1.. | 2.. | 2 | 3 |na| na|
Idea is to multiply each row by const list like = [ 0, 1, 2, 3]
If we have na in a column then it should still be na in a results:
| na | na | 0 | 3 | 10 | 6 |
| na | 0 | 2 | 4 | 3 | na |
| 0 | 2 | 4 | 9 | na | na |
Using cumsum and mul for variable number of NaNs and avoiding stack:
df.mul(df.notnull().cumsum(1).sub(1))
0 1 2 3 4 5
0 NaN NaN 0 3 10.0 6.0
1 NaN 0.0 2 4 3.0 NaN
2 0.0 2.0 4 9 NaN NaN
IIUC, you can stack then unstack:
df.stack().mul(np.tile(const, df.shape[0])).unstack()
Output:
0 1 2 3 4 5
0 NaN NaN 0.0 3.0 10.0 6.0
1 NaN 0.0 2.0 4.0 3.0 NaN
2 0.0 2.0 4.0 9.0 NaN NaN

Categories