dataframe backfill with max value - python

I have a dataframe like this
enter image description here
I want to backfill each item where date_activity is 1/1/2000 12:00:00 with the max date_activity for each item_id. In the end, I want something like this using pandas
enter image description here

Create missing values by Series.duplicated and Series.mask and then backfilling values:
df = pd.DataFrame({'item_id':[1,1,1,2,2,2,2],
'date_active':pd.date_range('2019-02-02', periods=7)})
print (df)
item_id date_active
0 1 2019-02-02
1 1 2019-02-03
2 1 2019-02-04
3 2 2019-02-05
4 2 2019-02-06
5 2 2019-02-07
6 2 2019-02-08
df['date_active'] = df['date_active'].mask(df['item_id'].duplicated(keep='last')).bfill()
print (df)
item_id date_active
0 1 2019-02-04
1 1 2019-02-04
2 1 2019-02-04
3 2 2019-02-08
4 2 2019-02-08
5 2 2019-02-08
6 2 2019-02-08
Details:
print (df['item_id'].duplicated(keep='last'))
0 True
1 True
2 False
3 True
4 True
5 True
6 False
Name: item_id, dtype: bool
print (df['date_active'].mask(df['item_id'].duplicated(keep='last')))
0 NaT
1 NaT
2 2019-02-04
3 NaT
4 NaT
5 NaT
6 2019-02-08
Name: date_active, dtype: datetime64[ns]
EDIT:
If real data is necessary sorting values before solution for last maximum value per group:
print (df)
item_id date_active
0 1 7/26/2019 17:06
1 1 8/27/2019 17:06
df['date_active'] = pd.to_datetime(df['date_active'])
df = df.sort_values(['item_id','date_active'])
df['date_active'] = df['date_active'].mask(df['item_id'].duplicated(keep='last')).bfill()
print (df)
item_id date_active
0 1 2019-08-27 17:06:00
1 1 2019-08-27 17:06:00
EDIT1: Use DataFrame.resample for add missing datetimes per groups:
df['date_active'] = pd.to_datetime(df['date_active'])
df = df.sort_values(['item_id','date_active'])
df = (df.set_index('date_active').groupby('item_id')
.resample('D')
.last()
.drop('item_id', axis=1)
.reset_index())
df['date_active'] = df['date_active'].mask(df['item_id'].duplicated(keep='last')).bfill()
print (df.tail())
item_id date_active
28 1 2019-08-27
29 1 2019-08-27
30 1 2019-08-27
31 1 2019-08-27
32 1 2019-08-27

Related

Calculate how many touch points the customer had in X months

I have a problem. I want to calculate from a date for example 2022-06-01 how many touches the customer with the customerId == 1 had in the last 6 months. He had two touches 2022-05-25 and 2022-05-20. I have now calculated the date up to which the data should be taken into account. However, I don't know how to group the customer and say the date you have is up to count_from_date how many touches the customer has had.
Dataframe
customerId fromDate
0 1 2022-06-01
1 1 2022-05-25
2 1 2022-05-25
3 1 2022-05-20
4 1 2021-09-05
5 2 2022-06-02
6 3 2021-03-01
7 3 2021-02-01
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 2, 3, 3],
'fromDate': ["2022-06-01", "2022-05-25", "2022-05-25", "2022-05-20", "2021-09-05",
"2022-06-02", "2021-03-01", "2021-02-01"]
}
df = pd.DataFrame(data=d)
print(df)
from datetime import date
from dateutil.relativedelta import relativedelta
def find_last_date(date):
six_months = date + relativedelta(months=-6)
return six_months
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df['count_from_date'] = df['fromDate'].apply(lambda x: find_last_date(x))
print(df)
What I have
customerId fromDate count_from_date
0 1 2022-06-01 2021-12-01
1 1 2022-05-25 2021-11-25
2 1 2022-05-25 2021-11-25
3 1 2022-05-20 2021-11-20
4 1 2021-09-05 2021-03-05
5 2 2022-06-02 2021-12-02
6 3 2021-03-01 2020-09-01
7 3 2021-02-01 2020-08-01
What I want
customerId fromDate count_from_date occur_last_6_months
0 1 2022-06-01 2021-12-01 3 # 2022-05-25, 2022-05-20, 2022-05-20 = 3
1 1 2022-05-25 2021-11-25 1 # 2022-05-20 = 1
2 1 2022-05-25 2021-11-25 1 # 2022-05-20 = 1
3 1 2022-05-20 2021-11-20 0 # No in the last 6 months
4 1 2021-09-05 2021-03-05 0 # No in the last 6 months
5 2 2022-06-02 2021-12-02 0 # No in the last 6 months
6 3 2021-03-01 2020-09-01 1 # 2021-02-01 = 1
7 3 2021-02-01 2020-08-01 0 # No in the last 6 months
You can try groupby customerId and loop through the rows in subgroup to count number of fromDate between fromDate and count_from_date
def count(g):
m = pd.concat([g['fromDate'].between(d1, d2, 'neither')
for d1, d2 in zip(g['count_from_date'], g['fromDate'])], axis=1)
g = g.assign(occur_last_6_months=m.sum().tolist())
return g
out = df.groupby('customerId').apply(count)
print(out)
customerId fromDate count_from_date occur_last_6_months
0 1 2022-06-01 2021-12-01 3
1 1 2022-05-25 2021-11-25 1
2 1 2022-05-25 2021-11-25 1
3 1 2022-05-20 2021-11-20 0
4 1 2021-09-05 2021-03-05 0
5 2 2022-06-02 2021-12-02 0
6 3 2021-03-01 2020-09-01 1
7 3 2021-02-01 2020-08-01 0
For this problem, the challenge for a performant solution is to manipulate the data as to have an appropriate structure to run rolling window operations on it.
First of all, we need to avoid having duplicate indices. In your case, this means aggregating multiple touch points in a single day:
>>> df = df.groupby(['customerId', 'fromDate'], as_index=False).count()
customerId fromDate count_from_date
0 1 2021-09-05 1
1 1 2022-05-20 1
2 1 2022-05-25 2
3 1 2022-06-01 1
4 2 2022-06-02 1
5 3 2021-02-01 1
6 3 2021-03-01 1
Now, we can set the index to fromDate, sort it and groupby customerId as to be able to use rolling windows. I here use a 180D rolling window (6 months):
>>> roll_df = df.set_index(['fromDate'])
.sort_index()
.groupby('customerId')
.apply(lambda s: s['count_from_date'].rolling('180D').sum())
The sort_index step is important to ensure your data is monotonically increasing.
However, this also counts the touch on the day itself, which seems not what you want, so we remove 1 from the result:
>>> roll_df - 1
customerId fromDate
1 2021-09-05 0.0
2022-05-20 0.0
2022-05-25 2.0
2022-06-01 3.0
2 2022-06-02 0.0
3 2021-02-01 0.0
2021-03-01 1.0
Name: count_from_date, dtype: float64
Finally, we divide by the initial counts to get back to the original structure:
>>> roll_df / df.set_index(['customerId', 'fromDate'])['count_from_date']
customerId fromDate count_from_date
0 1 2021-09-05 0.0
1 1 2022-05-20 0.0
2 1 2022-05-25 1.0
3 1 2022-06-01 3.0
4 2 2022-06-02 0.0
5 3 2021-02-01 0.0
6 3 2021-03-01 1.0
You can always .reset_index() at the end.
The one liner solution is
(df.set_index(['fromDate'])
.sort_index()
.groupby('customerId')
.apply(lambda s: s['count_from_date'].rolling('180D').sum())
- 1) / df.set_index(['customerId', 'fromDate'])['count_from_date']

interpolation of missing values, not NA

i want to interpolate (Linear interpolation) data. but There is no NA.
Here is my data.with many missing values.
timestamp
id
strength
1383260400000
1
-0.3803901328171995
1383261000000
1
-0.42196042219455937
1383265200000
1
-0.460714706261982
My expected :
timestamp
id
strength
1383260400000
1
-0.3803901328171995
1383261000000
1
-0.42196042219455937
1383261600000
1
Linear interpolated data
1383262200000
1
Linear interpolated data
1383262800000
1
Linear interpolated data
1383263400000
1
Linear interpolated data
1383264000000
1
Linear interpolated data
1383264600000
1
Linear interpolated data
1383265200000
1
-0.460714706261982
timestamp starts 1383260400000, ends 1383343800000
and another id(from 1 to 2025) has same issues.
Idea is create datetimes, convert to DatetimeIndex and in lambda function add missing datetimes by Series.asfreq with interpolate:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
f = lambda x: x.asfreq('10Min').interpolate()
df = df.set_index('timestamp').groupby('id')['strength'].apply(f).reset_index()
print (df)
id timestamp strength
0 1 2013-10-31 23:00:00 -0.380390
1 1 2013-10-31 23:10:00 -0.421960
2 1 2013-10-31 23:20:00 -0.427497
3 1 2013-10-31 23:30:00 -0.433033
4 1 2013-10-31 23:40:00 -0.438569
5 1 2013-10-31 23:50:00 -0.444106
6 1 2013-11-01 00:00:00 -0.449642
7 1 2013-11-01 00:10:00 -0.455178
8 1 2013-11-01 00:20:00 -0.460715
Last if need original format of timestamps:
df['timestamp'] = df['timestamp'].astype(np.int64) // 1000000
print (df)
id timestamp strength
0 1 1383260400000 -0.380390
1 1 1383261000000 -0.421960
2 1 1383261600000 -0.427497
3 1 1383262200000 -0.433033
4 1 1383262800000 -0.438569
5 1 1383263400000 -0.444106
6 1 1383264000000 -0.449642
7 1 1383264600000 -0.455178
8 1 1383265200000 -0.460715
EDIT:
#data from question
df =pd.DataFrame({'timestamp': [1383260400000, 1383261000000, 1383265200000],
'id': [1, 1, 1],
'strength':[-0.3803901328171995,-0.4219604221945593,-0.460714706261982]})
print (df)
timestamp id strength
0 1383260400000 1 -0.380390
1 1383261000000 1 -0.421960
2 1383265200000 1 -0.460715
Solution create for each id all datetimes by date_range and create missing values by DataFrame.reindex with MultiIndex, last per id is used interpolate:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
r = pd.date_range(pd.to_datetime(1383260400000, unit='ms') ,
pd.to_datetime(1383343800000, unit='ms'),
freq='10Min')
ids = df['id'].unique()
mux = pd.MultiIndex.from_product([r, ids], names=['timestamp','id'])
f = lambda x: x.interpolate()
df = (df.set_index(['timestamp', 'id'])
.reindex(mux)
.groupby('id')['strength']
.transform(f)
.reset_index())
print (df)
timestamp id strength
0 2013-10-31 23:00:00 1 -0.380390
1 2013-10-31 23:10:00 1 -0.421960
2 2013-10-31 23:20:00 1 -0.427497
3 2013-10-31 23:30:00 1 -0.433033
4 2013-10-31 23:40:00 1 -0.438569
.. ... .. ...
135 2013-11-01 21:30:00 1 -0.460715
136 2013-11-01 21:40:00 1 -0.460715
137 2013-11-01 21:50:00 1 -0.460715
138 2013-11-01 22:00:00 1 -0.460715
139 2013-11-01 22:10:00 1 -0.460715
[140 rows x 3 columns]

How to create an indicator column of the first occurrence of a variable of groupby ID sorted by date?

I have some hospital visit healthcare data in a dataframe of the form:
CLIENT_ID
DATE_ENCOUNTER
DATE_COUNSELLING
COUNSELLING_COUNT
54950
2017-11-24
NaN
0
54950
2018-01-19
NaN
0
54950
2018-03-13
NaN
0
54950
2018-05-11
2018-04-30
1
54950
2018-12-17
2018-06-25
3
67777
2015-09-01
NaN
0
67777
2015-12-01
NaN
0
67777
2016-02-28
2016-02-28
1
70000
2019-06-07
2019-06-07
1
70000
2019-08-09
2019-06-07
1
I want to create a column COUNSELLING_STARTED which indicates whether a client CLIENT_ID has started counselling, but only the first time. i.e. The first occurence when COUNSELLING_COUNT == 1 for each CLIENT_ID which should result in the dataframe below:
CLIENT_ID
DATE_ENCOUNTER
DATE_COUNSELLING
COUNSELLING_COUNT
COUNSELLING_STARTED
54950
2017-11-24
NaN
0
0
54950
2018-01-19
NaN
0
0
54950
2018-03-13
NaN
0
0
54950
2018-05-11
2018-04-30
1
1
54950
2018-12-17
2018-06-25
3
0
67777
2015-09-01
NaN
0
0
67777
2015-12-01
NaN
0
0
67777
2016-02-28
2016-02-28
1
1
70000
2019-06-07
2019-06-07
1
1
70000
2019-08-09
2019-06-07
1
0
below is the code to generate the dataframe:
data = {'CLIENT_ID':[54950,54950,54950,54950,54950,67777,67777,67777,70000,70000],
'DATE_ENCOUNTER':['2017-11-24','2018-01-19','2018-03-13','2018-05-11','2018-12-17','2015-09-01','2015-12-01','2016-02-28','2019-06-07','2019-08-09'],
'DATE_COUNSELLING':[np.nan,np.nan,np.nan,'2018-04-30','2018-06-25',np.nan,np.nan,'2016-02-28','2019-06-07','2019-06-07'],
'COUNSELLING_COUNT':[0,0,0,1,3,0,0,1,1,1]}
df = pd.DataFrame(data)
Update
In my original answer, I had missed the fact that if someone has no counseling dates, my method would assign a 1 to their first entry. Here are two quick ways to fix that.
One option is to explicitly drop those rows with NA before you do the groupby i describe:
dropped = df[~df['DATE_COUNSELLING'].isna()]
df.loc[:, 'COUNSELLING_STARTED'] = 0
df.loc[dropped['DATE_COUNSELLING'].isna().groupby(dropped['CLIENT_ID']).idxmin(), 'COUNSELLING_STARTED'] = 1
# note that `dropped` is used inside the brackets in the last line
Second option is to simply do exactly what I had before, but then overwrite the erroneous entries (i.e., where the counseling is NA):
df.loc[:, 'COUNSELLING_STARTED'] = 0
df.loc[df['DATE_COUNSELLING'].isna().groupby(df['CLIENT_ID']).idxmin(), 'COUNSELLING_STARTED'] = 1
df.loc[df['DATE_COUNSELLING'].isna(), 'COUNSELLING_STARTED'] = 0
# last line catches people with no counseling
This was my original answer:
df.loc[:, 'COUNSELLING_STARTED'] = 0
df.loc[df['DATE_COUNSELLING'].isna().groupby(df['CLIENT_ID']).idxmin(), 'COUNSELLING_STARTED'] = 1
Explanation (using my first approach):
Find where the counseling dates are nan; then groupby the client IDs and find the index of the minimum (which will be the first entry):
>>> dropped['DATE_COUNSELLING'].isna().groupby(dropped['CLIENT_ID']).idxmin()
CLIENT_ID
54950 3
67777 7
70000 8
Name: DATE_COUNSELLING, dtype: int64
You are using these indices to choose where to write 1 in the new column. And even though dropped does not have any NA values, we still use .isna() in the groupby in order to get a value that we can take a min on (instead of a string). You could also do something like .astype(bool).
The final df is then:
CLIENT_ID DATE_ENCOUNTER ... COUNSELLING_COUNT COUNSELLING_STARTED
0 54950 2017-11-24 ... 0 0
1 54950 2018-01-19 ... 0 0
2 54950 2018-03-13 ... 0 0
3 54950 2018-05-11 ... 1 1
4 54950 2018-12-17 ... 3 0
5 67777 2015-09-01 ... 0 0
6 67777 2015-12-01 ... 0 0
7 67777 2016-02-28 ... 1 1
8 70000 2019-06-07 ... 1 1
9 70000 2019-08-09 ... 1 0
[10 rows x 5 columns]
If you wanted to instead explicitly select the earliest counseling date (rather than the first non-NA value), you could instead use this as your indexer:
>>> pd.to_datetime(dropped['DATE_COUNSELLING']).groupby(dropped['CLIENT_ID']).idxmin()
CLIENT_ID
54950 3
67777 7
70000 8
Name: DATE_COUNSELLING, dtype: int64
Which gives the same result here since the dates are sorted for each client (i.e. the earliest observed date is the first non-NA value).

Substract previous row from preceding row by group WITH condition

I have a data frame
Count ID Date
1 1 2020-07-09
2 1 2020-07-11
1 1 2020-07-21
1 2 2020-07-04
2 2 2020-07-09
3 2 2020-07-18
1 3 2020-07-02
2 3 2020-07-05
1 3 2020-07-19
2 3 2020-07-22
I want to subtract the row in the date column by the row above it that has the same count BY each ID group. Those without the same count get a value of zero
Excepted output
ID Date Days
1 2020-07-09 0
1 2020-07-11 0
1 2020-07-21 12 (2020-07-21 MINUS 2020-07-09)
2 2020-07-04 0
2 2020-07-09 0
2 2020-07-18 0
3 2020-07-02 0
3 2020-07-05 0
3 2020-07-19 17 (2020-07-19 MINUS 2020-07-02)
3 2020-07-22 17 (2020-07-22 MINUS 2020-07-05)
My initial thought process is to filter out Count-ID pairs, and then do the calculation.. I was wondering if there is a better workaround this>
You can use groupby() to group by columns ID and Count, get the difference in days by .diff(). Fill NaN values with 0 by .fillna(), as follows:
df['Date'] = pd.to_datetime(df['Date']) # convert to datetime if not already in datetime format
df['Days'] = df.groupby(['ID', 'Count'])['Date'].diff().dt.days.fillna(0, downcast='infer')
Result:
print(df)
Count ID Date Days
0 1 1 2020-07-09 0
1 2 1 2020-07-11 0
2 1 1 2020-07-21 12
3 1 2 2020-07-04 0
4 2 2 2020-07-09 0
5 3 2 2020-07-18 0
6 1 3 2020-07-02 0
7 2 3 2020-07-05 0
8 1 3 2020-07-19 17
9 2 3 2020-07-22 17
I like SeaBean's answer, but here is what I was working on before I saw that
df2 = df.sort_values(by = ['ID', 'Count'])
df2['Date'] = pd.to_datetime(df2['Date'])
df2['shift1'] = df2.groupby(['ID', 'Count'])['Date'].shift(1)
df2['diff'] = (df2.Date- df2.shift1.combine_first(df2.Date) ).dt.days

Pandas - Counting the number of days for group by

I want to count the number of days after grouping by 2 columns:
groups = df.groupby([df.col1,df.col2])
Now i want to count the number of days relevant for each group:
result = groups['date_time'].dt.date.nunique()
I'm using something similar when I want to group by day, but here I get an error:
AttributeError: Cannot access attribute 'dt' of 'SeriesGroupBy' objects, try using the 'apply' method
What is the proper way to get the number of days?
You need another variation of groupby - define column first:
df['date_time'].dt.date.groupby([df.col1,df.col2]).nunique()
df.groupby(['col1','col2'])['date_time'].apply(lambda x: x.dt.date.nunique())
df['date_time1'] = df['date_time'].dt.date
a = df.groupby([df.col1,df.col2]).date_time1.nunique()
Sample:
start = pd.to_datetime('2015-02-24')
rng = pd.date_range(start, periods=10, freq='15H')
df = pd.DataFrame({'date_time': rng, 'col1': [0]*5 + [1]*5, 'col2': [2]*3 + [3]*4+ [4]*3})
print (df)
col1 col2 date_time
0 0 2 2015-02-24 00:00:00
1 0 2 2015-02-24 15:00:00
2 0 2 2015-02-25 06:00:00
3 0 3 2015-02-25 21:00:00
4 0 3 2015-02-26 12:00:00
5 1 3 2015-02-27 03:00:00
6 1 3 2015-02-27 18:00:00
7 1 4 2015-02-28 09:00:00
8 1 4 2015-03-01 00:00:00
9 1 4 2015-03-01 15:00:00
#solution with apply
df1 = df.groupby(['col1','col2'])['date_time'].apply(lambda x: x.dt.date.nunique())
print (df1)
col1 col2
0 2 2
3 2
1 3 1
4 2
Name: date_time, dtype: int64
#create new helper column
df['date_time1'] = df['date_time'].dt.date
df2 = df.groupby([df.col1,df.col2]).date_time1.nunique()
print (df2)
col1 col2
0 2 2
3 2
1 3 1
4 2
Name: date_time, dtype: int64
df3 = df['date_time'].dt.date.groupby([df.col1,df.col2]).nunique()
print (df3)
col1 col2
0 2 2
3 2
1 3 1
4 2
Name: date_time, dtype: int64

Categories