Groupby over periods of time - python

I have a table which contains ids, dates, a target (potentially multi class but for now binary where 1 is a fail) and a yearmonth column based on the date column. Below are the first 8 rows of this table:
row
id
date
target
yearmonth
0
A
2015-03-16
0
2015-03
1
A
2015-05-29
1
2015-05
2
A
2015-08-02
1
2015-08
3
A
2015-09-05
1
2015-09
4
A
2015-09-22
0
2015-09
5
A
2015-10-15
1
2015-10
6
A
2015-11-09
1
2015-11
7
B
2015-04-17
0
2015-04
I want to create lookback features for the last let's say 3 months so that for each single row, we take a look in the past and see the how that id performed over the last 3 months. So for ex for row 6, where date is 9th Nov 2015, the percentage of fails for id A in the last 3 calendaristic months (so in the whole of months of Aug, Sept & Oct) would be 75% (using rows 2-5).
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B'],'date' :['2015-03-16','2015-05-29','2015-08-02','2015-09-05','2015-09-22','2015-10-15','2015-11-09','2015-04-17'],'target':[0,1,1,1,0,1,1,0]} )
df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['yearmonth'] = df['date'].dt.to_period('M')
agg_dict = {
"Total_Transactions": pd.NamedAgg(column='target', aggfunc='count'),
"Fail_Count": pd.NamedAgg(column='target', aggfunc=(lambda x: len(x[x == 1]))),
"Perc_Monthly_Fails": pd.NamedAgg(column='target', aggfunc=(lambda x: len(x[x == 1])/len(x)*100))
}
df.groupby(['id','yearmonth']).agg(**agg_dict).reset_index(level = 1)
I've done an aggregation using id and month (see below) and I've tried things like rolling windows, but I could't find a way to actually aggregate looking back over a specific period for each single row. Any help is appreciated.
id
yearmonth
Total_Transactions
Fail_Count
Perc_Monthly_Fails
A
2015-03
1
0
0
A
2015-05
1
1
100
A
2015-08
1
1
100
A
2015-09
2
1
50
A
2015-10
1
1
100
A
2015-11
1
1
100
B
2015-04
1
0
0

You can do this by merging the DataFrame with itself on 'id'.
First we'll create a first of month 'fom' column since your date logic wants to look back based on prior months, not the date specifically. Then we merge the DataFrame with itself, bringing along the index so we can assign the result back in the end.
With month offsets we can then filter that to only keeping the observations within 3 months of the observation for that row, and then we groupby the original index and take the mean of 'target' to get the percent fail, which we can just assign back (alignment on index).
If there are NaN in the output it's because that row had no observations in the prior 3 months so you can't calculate.
#df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['fom'] = df['date'].astype('datetime64[M]') # Credit #anky
df1 = df.reset_index()
df1 = (df1.drop(columns='target').merge(df1, on='id', suffixes=['', '_past']))
df1 = df1[df1.fom_past.between(df1.fom-pd.offsets.DateOffset(months=3),
df1.fom-pd.offsets.DateOffset(months=1))]
df['Pct_fail'] = df1.groupby('index').target.mean()*100
id date target fom Pct_fail
0 A 2015-03-16 0 2015-03-01 NaN # No Rows to Avg
1 A 2015-05-29 1 2015-05-01 0.000000 # Avg Rows 0
2 A 2015-08-02 1 2015-08-01 100.000000 # Avg Rows 1
3 A 2015-09-05 1 2015-09-01 100.000000 # Avg Rows 2
4 A 2015-09-22 0 2015-09-01 100.000000 # Avg Rows 2
5 A 2015-10-15 1 2015-10-01 66.666667 # Avg Rows 2,3,4
6 A 2015-11-09 1 2015-11-01 75.000000 # Avg Rows 2,3,4,5
7 B 2015-04-17 0 2015-04-01 NaN # No Rows to Avg
If you're having an issue with memory we can take a very slow loop approach, which subsets for each row and then calculates the average from that subset.
def get_prev_avg(row, df):
df = df[df['id'].eq(row['id'])
& df['fom'].between(row['fom']-pd.offsets.DateOffset(months=3),
row['fom']-pd.offsets.DateOffset(months=1))]
if not df.empty:
return df['target'].mean()*100
else:
return np.NaN
#df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['fom'] = df['date'].astype('datetime64[M]')
df['Pct_fail'] = df.apply(lambda row: get_prev_avg(row, df), axis=1)

I have modified #ALollz code so that it applies better to my original dataset, where I have a multiclass target, and I would like to obtain PctFails for class 1 and 2, plus the nr of transactions, and I would need to group by different columns over different periods of times. Also, decided it's simpler and better to use the last x months prior to the date rather than the calendar months. So my solution to that was this:
df = pd.DataFrame({'Id':['A','A','A','A','A','A','A','B'],'Type':['T1','T3','T1','T2','T2','T1','T1','T3'],'date' :['2015-03-16','2015-05-29','2015-08-10','2015-09-05','2015-09-22','2015-11-08','2015-11-09','2015-04-17'],'target':[2,1,2,1,0,1,2,0]} )
df['date'] = pd.to_datetime(df['date'], dayfirst = True)
def get_prev_avg(row, df, columnname, lastxmonths):
df = df[df[columnname].eq(row[columnname])
& df['date'].between(row['date']-pd.offsets.DateOffset(months=lastxmonths),
row['date']-pd.offsets.DateOffset(days=1))]
if not df.empty:
NrTransactions= len(df['target'])
PctMinorFails= (df['target'].where(df['target'] == 1).count())/len(df['target'])*100
PctMajorFails= (df['target'].where(df['target'] == 2).count())/len(df['target'])*100
return pd.Series([NrTransactions, PctMinorFails, PctMajorFails])
else:
return pd.Series([np.NaN, np.NaN, np.NaN])
for lastxmonths in [3, 4]:
for columnname in ['Id','Type']:
df[['NrTransactionsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months',
'PctMinorFailsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months',
'PctMajorFailsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months'
]]= df.apply(lambda row: get_prev_avg(row, df, columnname, lastxmonths), axis=1)
Each iteration takes a couple hours for my original dataset which is not great, but unsure how to optimise it further.

Related

How to check if date ranges are overlapping in a pandas dataframe according to a categorical column?

Let's take this sample dataframe :
df = pd.DataFrame({'ID':[1,1,2,2,3],'Date_min':["2021-01-01","2021-01-20","2021-01-28","2021-01-01","2021-01-02"],'Date_max':["2021-01-23","2021-12-01","2021-09-01","2021-01-15","2021-01-09"]})
df["Date_min"] = df["Date_min"].astype('datetime64')
df["Date_max"] = df["Date_max"].astype('datetime64')
ID Date_min Date_max
0 1 2021-01-01 2021-01-23
1 1 2021-01-20 2021-12-01
2 2 2021-01-28 2021-09-01
3 2 2021-01-01 2021-01-15
4 3 2021-01-02 2021-01-09
I would like to check for each ID if there are overlapping date ranges. I can use a loopy solution as the following one but it is not efficient and consequently quite slow with a real big dataframe :
L_output = []
for index, row in df.iterrows() :
if len(df[(df["ID"]==row["ID"]) & (df["Date_min"]<= row["Date_min"]) &
(df["Date_max"]>= row["Date_min"])].index)>1:
print("overlapping date ranges for ID %d" %row["ID"])
L_output.append(row["ID"])
Output :
overlapping date ranges for ID 1
Would you know please a better way to check that ID 1 has overlapping date ranges ?
Expected output :
[1]
Try:
Create a column "Dates" that contains a list of dates from "Date_min" to "Date_max" for each row
explode the "Dates" columns
get the duplicated rows
df["Dates"] = df.apply(lambda row: pd.date_range(row["Date_min"], row["Date_max"]), axis=1)
df = df.explode("Dates").drop(["Date_min", "Date_max"], axis=1)
#if you want all the ID and Dates that are duplicated/overlap
>>> df[df.duplicated()]
ID Dates
1 1 2021-01-20
1 1 2021-01-21
1 1 2021-01-22
1 1 2021-01-23
#if you just want a count of overlapping dates per ID
>>> df.groupby("ID").agg(lambda x: x.duplicated().sum())
Dates
ID
1 4
2 0
3 0
You can transform your datetime objects into timestamps. Then, construct pd.Interval objects and iter on a generator of all possible intervals combinations for each ID:
from itertools import combinations
import pandas as pd
def group_has_overlap(group):
timestamps = group[["Date_min", "Date_max"]].values.tolist()
for t1, t2 in combinations(timestamps, 2):
i1 = pd.Interval(t1[0], t1[1])
i2 = pd.Interval(t2[0], t2[1])
if i1.overlaps(i2):
return True
return False
for ID, group in df.groupby("ID"):
print(ID, group_has_overlap(group))
Output is :
1 True
2 False
3 False
Set the index as an intervalindex, and use groupby to get your overlapping IDs:
(df.set_index(pd.IntervalIndex
.from_arrays(df.Date_min,
df.Date_max,
closed='both'))
.groupby('ID')
.apply(lambda df: df.index.is_overlapping)
)
ID
1 True
2 False
3 False
dtype: bool

Pandas dataframe: keep rows with duplicates

This question is slightly more complicated than Remove duplicate rows in pandas dataframe based on condition:
Instead of one 'valu' column, I now have two columns 'valu1', 'valu2':
t valu1 valu2
2015-08-01 1 10
2015-08-01 2 11
2015-08-01 3 12
2015-09-31 4 15
2015-10-31 5 13
In the dataframe above, I want to remove the duplicate rows (i.e. row where the column 't' is repeated) by retaining the row with a higher value in the valu1 column and a lower value in the value2 column.
Expected outcome:
t valu1 valu2
2015-08-01 3 10
2015-09-31 4 15
2015-10-31 5 13
The df.sort_values() and drop_duplicates with keep='last' mentioned in the linked question obviously don't work.
Something I can think of now is:
#Let's call the dataframe df
dups = df[df['t'].duplicated()]['t'].drop_duplicates() #get duplicated dates
for d in dups:
max_v1 = df[df['t'] == d]['valu1'].max() #find the max of valu1 on day d
min_v2 = df[df['t'] == d]['valu2'].min() #find the min of valu2 on day d
df[df['t'] == d]['valu1'] = max_v1 #set valu1 of day d to max_v1
df[df['t'] == d]['valu2'] = min_v2 #set valu2 of day d to min_v2
df = df[~df.index.duplicated()] #drop everything duplicated
I think this should work, but it really seems unsophisticated, especially I actually need to do this for a large dataset. Any idea of how I should approach this problem?
I think you are looking for
df.groupby('t').agg({'valu1':'max','valu2':'min'}).reset_index()
t valu1 valu2
0 2015-08-01 3 10
1 2015-09-31 4 15
2 2015-10-31 5 13

Pandas Sum values from different columns based on dates

I'm working with a dataframe on pandas and I'm trying to sum the values of different rows to a new column. This must be based on the previous date (current month - 1 to be precise).
I have something like this:
Period Value
2015-01 1
2015-09 2
2015-10 1
2015-11 3
2015-12 1
And I would like to create a new column with the sum of 'Value' from the current 'Period' and ('Period' - 1month) if it exists. Example:
Period Value Result
2015-01 1 1
2015-09 2 2
2015-10 1 3
2015-11 3 4
2015-12 1 4
I tried to use a lambda function with something like:
df['Result'] = df.apply(lambda x: df.loc[(df.Period <= x.Period) &
(x.Period >= df.Period-1),
['Value']].sum(), axis=1)
It was based on other answers, but I'm a little confused if it is the best way to do it and how to make it work successfully (It is not giving any python error message, but it is not giving my expected output either).
UPDATE
I'm testing #taras answer on a simple example with three columns:
Account Period Value
15035 2015-01 1
15035 2015-09 1
15035 2015-10 1
The expected result would be:
Account Period Value
15035 2015-01 1
15035 2015-09 1
15035 2015-10 2
But I'm getting:
Account Period Value
15035 2015-01 1
15035 2015-09 2
15035 2015-10 2
When inspecting
print(df.loc[df.index - 1, 'Value'].fillna(0).values)
I'm getting [ 0. 1. 1.] (it should be [ 0. 0. 1.]). By looking at
print(df.loc[df.index - 1, 'Period'].fillna(0).values)
I'm getting [0 Period('2015-01', 'M') Period('2015-09', 'M')] (which looks like the index is getting the value from the previous row, and not the previous month).
Am I doing something wrong?
You can compute the index of rows for previous month with
idx = df.index - pd.DateOffset(months=1)
and then simply add it to your Value column
df.loc[idx, 'Value'].fillna(0).values + df['Value']
which results in
Period
2015-01-01 1.0
2015-09-01 2.0
2015-10-01 3.0
2015-11-01 4.0
2015-12-01 4.0
Name: Value, dtype: float64
Update: since you use pd.PeriodIndex rather than df.DatetimeIndex, idx is computed in much simple way:
idx = df.index - 1
because your period is 1 month.
So, to wrap up, the whole thing can be expressed in one quite simple expression:
df.loc[df.index - 1, 'Value'].fillna(0).values + df['Value']
You can join on an auxiliary column that manages the string conversion of your inputs:
import pandas as pd
from datetime import datetime
df['prev'] = (df.Period.apply(lambda x: x.to_timestamp()) - pd.DateOffset(months=1)
aux = df.merge(df, how='left', left_on = 'prev', right_on = 'Period')
df['sum'] = aux.Value_x + aux.Value_y
df= df.drop('prev',axis=1)

Pandas Group by and sum by times before time listed in row

Fun problem!
I have a data frame that has many columns, but the relevant ones are: id, event_time
The ids are repeatable. I am trying to count all the times an id occurs in the data frame before the time of the id in each row. So if id = 43 and event_time = 2016-01-01 12:00:00 , I want the count of all the times id 43 occurs before this event_time. The event_time has already been formatted with pd.to_datetime() but it is not the index.
This loop solves the problem, but it is horrifically slow for 400k + rows.
occs=[]
for ix in range(len(df)):
cur=df.iloc[[ix]]
occurrences=df[(df.id==cur.id.values[0])&
(df.event_time < cur.event_time.values[0])]
occs.append(len(occurrences))
df['total_occ']=occs
I know there has to be a better way, probably using group by. The key is that it has to be ONLY times before the event_time and they are NOT in order.
Thanks everyone!
* EDIT SAMPLE DATA *
id | event_time | count
11 2016-11-09 1
8 2016-11-10 1
32 2016-11-08 0
11 2016-11-08 0
8 2016-11-11 2
8 2016-11-07 0
(the counts will be much higher though, in the thousands... and count is the desired output)
I think this might be what you are after:
#sort df by id and datetime
df.sort_values(['id','event_time'],inplace=True)
#add cumulative count for each id.
df['count'] = df.groupby('id').cumcount()
df
Out[1114]:
id event_time count
5 8 2016-11-07 0
1 8 2016-11-10 1
4 8 2016-11-11 2
3 11 2016-11-08 0
0 11 2016-11-09 1
2 32 2016-11-08 0
From your code, I'm guessing you mean to count the occurrences of id where event_time is before an event_time associated with the id, which is taken to be the event time of the first occurrence of the given id.
So figure out what these event times are:
first_event_times = df.groupby('id', as_index = False).event_time.first().rename(columns = {'first_event':'first_event_time'})
Merge those first event times with the dataframe, and only keep the relevant columns:
df0 = df[['id','event_time']].merge(first_event_times)
Filter down to rows where event_time < first_event_time:
df0 = df0[df0.event_time < df0.first_event_time]
Get the row counts for each id in what's left:
df0.groupby(['id','first_event_time']).size().to_frame('count') # gives you the desired output

How to merge two data frames based on nearest date

I want to merge two data frames based on two columns: "Code" and "Date". It is straightforward to merge data frames based on "Code", however in case of "Date" it becomes tricky - there is no exact match between Dates in df1 and df2. So, I want to select closest Dates. How can I do this?
df = df1[column_names1].merge(df2[column_names2], on='Code')
I don't think there's a quick, one-line way to do this kind of thing but I belive the best approach is to do it this way:
add a column to df1 with the closest date from the appropriate group in df2
call a standard merge on these
As the size of your data grows, this "closest date" operation can become rather expensive unless you do something sophisticated. I like to use scikit-learn's NearestNeighbor code for this sort of thing.
I've put together one approach to that solution that should scale relatively well.
First we can generate some simple data:
import pandas as pd
import numpy as np
dates = pd.date_range('2015', periods=200, freq='D')
rand = np.random.RandomState(42)
i1 = np.sort(rand.permutation(np.arange(len(dates)))[:5])
i2 = np.sort(rand.permutation(np.arange(len(dates)))[:5])
df1 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
'Date': dates[i1],
'val1':rand.rand(5)})
df2 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
'Date': dates[i2],
'val2':rand.rand(5)})
Let's check these out:
>>> df1
Code Date val1
0 0 2015-01-16 0.975852
1 0 2015-01-31 0.516300
2 1 2015-04-06 0.322956
3 1 2015-05-09 0.795186
4 1 2015-06-08 0.270832
>>> df2
Code Date val2
0 1 2015-02-03 0.184334
1 1 2015-04-13 0.080873
2 0 2015-05-02 0.428314
3 1 2015-06-26 0.688500
4 0 2015-06-30 0.058194
Now let's write an apply function that adds a column of nearest dates to df1 using scikit-learn:
from sklearn.neighbors import NearestNeighbors
def find_nearest(group, match, groupname):
match = match[match[groupname] == group.name]
nbrs = NearestNeighbors(1).fit(match['Date'].values[:, None])
dist, ind = nbrs.kneighbors(group['Date'].values[:, None])
group['Date1'] = group['Date']
group['Date'] = match['Date'].values[ind.ravel()]
return group
df1_mod = df1.groupby('Code').apply(find_nearest, df2, 'Code')
>>> df1_mod
Code Date val1 Date1
0 0 2015-05-02 0.975852 2015-01-16
1 0 2015-05-02 0.516300 2015-01-31
2 1 2015-04-13 0.322956 2015-04-06
3 1 2015-04-13 0.795186 2015-05-09
4 1 2015-06-26 0.270832 2015-06-08
Finally, we can merge these together with a straightforward call to pd.merge:
>>> pd.merge(df1_mod, df2, on=['Code', 'Date'])
Code Date val1 Date1 val2
0 0 2015-05-02 0.975852 2015-01-16 0.428314
1 0 2015-05-02 0.516300 2015-01-31 0.428314
2 1 2015-04-13 0.322956 2015-04-06 0.080873
3 1 2015-04-13 0.795186 2015-05-09 0.080873
4 1 2015-06-26 0.270832 2015-06-08 0.688500
Notice that rows 0 and 1 both matched the same val2; this is expected given the way you described your desired solution.
Here's an alternative solution:
Merge on Code.
Add a date difference column according to your need (I used abs in the example below) and sort the data using the new column.
Group by the records of the first data frame and for each group take a record from the second data frame with the closest date.
Code:
df = df1.reset_index()[column_names1].merge(df2[column_names2], on='Code')
df['DateDiff'] = (df['Date1'] - df['Date2']).abs()
df.sort_values('DateDiff').groupby('index').first().reset_index()

Categories