Rolling operations on DataFrameGroupby object - python

I have a pandas dataframe which I wish to perform the same rolling operation on different groups within the data. Consider the following df (see bottom of question for code to construct) with four columns:
id date category target
1 2017-01-01 'a' 0
1 2017-01-01 'b' 0
1 2017-01-21 'a' 1
1 2017-01-21 'b' 1
1 2017-10-01 'a' 0
1 2017-10-01 'b' 0
2 2017-01-01 'a' 1
2 2017-01-01 'b' 1
2 2017-01-21 'a' 0
2 2017-01-21 'b' 0
2 2017-10-01 'a' 0
2 2017-10-01 'b' 0
What I would like is an operation which calculates a boolean for each unique id-date pair indicating whether the target column is 1 within 6 months of the given date. So for the provided df I would expect a result which looks like:
id date one_within_6m
1 2017-01-01 True
1 2017-01-21 False
1 2017-10-01 False
2 2017-01-01 False
2 2017-01-21 False
2 2017-10-01 False
I can do this with a for loop iterating over the rows and looking 6 months in advance for each visit, but it is too slow due to the large size of my dataset.
So, I was wondering whether it was possible to groupby id the date and do a rolling operation on the time window to look at this? For example:
df_grouped = df.groupby(['id', 'date'])
# … do something to set date as index
# ... define some custom function
df_grouped.rolling('6m', on='target').apply(some_custom_function)
Some notes:
There can be multiple '1s' in the 6 month window, this should just be treated as True for the current date.
In my head some_custom_function will check whether the sum of target over the next 6 months (excluding current date) is greater than 1.
Supporting code:
To produce the DataFrame instance used in this question:
ids = np.concatenate([np.ones(6), np.ones(6)+1])
dates = ['2017-01-01','2017-01-01','2017-01-21','2017-01-21',
'2017-10-01','2017-10-01','2017-01-01','2017-01-01',
'2017-01-21','2017-01-21','2017-10-01','2017-10-01']
categories = ['a','b','a','b','a','b','a','b','a','b','a','b']
targets = [0,0,1,1,0,0,1,1,0,0,0,0]
df = pd.DataFrame({'id':ids,
'date':dates,
'category':categories,
'target':targets})
df['date'] = pd.to_datetime(df['date'])

I have found a workable solution but it only works if for each id each date is unique. This is the case in my data with some additional processing:
new_df = df.groupby(['id','date']).mean().reset_index()
which returns:
id date target
0 1.0 2017-01-01 0
1 1.0 2017-01-21 1
2 1.0 2017-10-01 0
3 2.0 2017-01-01 1
4 2.0 2017-01-21 0
5 2.0 2017-10-01 0
I can then use the rolling method on a groupby object to get the desired result:
df = new_df.set_index('date')
df.iloc[::-1].groupby('id')['target'].rolling(window='180D',
centre=False).apply(lambda x : x[:-1].sum())
There are two tricks here:
I reverse the order of the dates (.iloc[::-1]) to take a forward looking window; this has been suggested in other SO questions.
I drop the last entry of the sum to remove the 'current' date from the sum, so it only looks forward.
The second 'hack' means it only works when there are no repeats of dates for a given id.
I would be interested in making a more robust solution (e.g., where dates are repeated for an id).

Related

How to check if date ranges are overlapping in a pandas dataframe according to a categorical column?

Let's take this sample dataframe :
df = pd.DataFrame({'ID':[1,1,2,2,3],'Date_min':["2021-01-01","2021-01-20","2021-01-28","2021-01-01","2021-01-02"],'Date_max':["2021-01-23","2021-12-01","2021-09-01","2021-01-15","2021-01-09"]})
df["Date_min"] = df["Date_min"].astype('datetime64')
df["Date_max"] = df["Date_max"].astype('datetime64')
ID Date_min Date_max
0 1 2021-01-01 2021-01-23
1 1 2021-01-20 2021-12-01
2 2 2021-01-28 2021-09-01
3 2 2021-01-01 2021-01-15
4 3 2021-01-02 2021-01-09
I would like to check for each ID if there are overlapping date ranges. I can use a loopy solution as the following one but it is not efficient and consequently quite slow with a real big dataframe :
L_output = []
for index, row in df.iterrows() :
if len(df[(df["ID"]==row["ID"]) & (df["Date_min"]<= row["Date_min"]) &
(df["Date_max"]>= row["Date_min"])].index)>1:
print("overlapping date ranges for ID %d" %row["ID"])
L_output.append(row["ID"])
Output :
overlapping date ranges for ID 1
Would you know please a better way to check that ID 1 has overlapping date ranges ?
Expected output :
[1]
Try:
Create a column "Dates" that contains a list of dates from "Date_min" to "Date_max" for each row
explode the "Dates" columns
get the duplicated rows
df["Dates"] = df.apply(lambda row: pd.date_range(row["Date_min"], row["Date_max"]), axis=1)
df = df.explode("Dates").drop(["Date_min", "Date_max"], axis=1)
#if you want all the ID and Dates that are duplicated/overlap
>>> df[df.duplicated()]
ID Dates
1 1 2021-01-20
1 1 2021-01-21
1 1 2021-01-22
1 1 2021-01-23
#if you just want a count of overlapping dates per ID
>>> df.groupby("ID").agg(lambda x: x.duplicated().sum())
Dates
ID
1 4
2 0
3 0
You can transform your datetime objects into timestamps. Then, construct pd.Interval objects and iter on a generator of all possible intervals combinations for each ID:
from itertools import combinations
import pandas as pd
def group_has_overlap(group):
timestamps = group[["Date_min", "Date_max"]].values.tolist()
for t1, t2 in combinations(timestamps, 2):
i1 = pd.Interval(t1[0], t1[1])
i2 = pd.Interval(t2[0], t2[1])
if i1.overlaps(i2):
return True
return False
for ID, group in df.groupby("ID"):
print(ID, group_has_overlap(group))
Output is :
1 True
2 False
3 False
Set the index as an intervalindex, and use groupby to get your overlapping IDs:
(df.set_index(pd.IntervalIndex
.from_arrays(df.Date_min,
df.Date_max,
closed='both'))
.groupby('ID')
.apply(lambda df: df.index.is_overlapping)
)
ID
1 True
2 False
3 False
dtype: bool

How does (DataFrame - Groupby) match rows?

I can't figure out how (DataFrame - Groupby) works.
Specifically, given the following dataframe:
df = pd.DataFrame([['usera',1,100],['usera',5,130],['userc',1,100],['userd',5,100]])
df.columns = ['id','date','sum']
id date sum
0 usera 1 100
1 usera 5 130
2 userc 1 100
3 userd 5 100
Passing the below code returns:
df['shift'] = df['date']-df.groupby(['id'])['date'].shift(1)
id date sum shift
0 usera 1 100
1 usera 5 130 4.0
2 userc 1 100
3 userd 5 100
How did Python know that I meant for it to match by id column?
It doesn't even appear in df['date']
Let us dissect the command df['shift'] = df['date']-df.groupby(['id'])['date'].shift(1).
df['shift'] appends a new column "shift" in the dataframe.
df['date'] returns Series using date column from the dataframe.
0 1
1 5
2 1
3 5
Name: date, dtype: int64
df.groupby(['id'])['date'].shift(1) groupby(['id']) creates a groupby object.
From that groupby object selecting date column and shifting one (previous) value using shift(1). By the way, this also a Series.
df.groupby(['id'])['date'].shift(1)
0 NaN
1 1.0
2 NaN
3 NaN
Name: date, dtype: float64
The Series obtained from step 3 is subtracted (element-wise) with the Series obtained from Step 2. The result is assigned to the df['shift'] column.
df['date']-df.groupby(['id'])['date'].shift(1)
0 NaN
1 4.0
2 NaN
3 NaN
Name: date, dtype: float64
I am not exactly knowing what you are trying, but groupby() method is usuful if you have several same objects in a column (like you usera) and you want to calculate for example the sum(), mean(), find max() etc. of all columns or just one specific column.
e.g. df.groupby(['id'])['sum'].sum() groups you usera and just select the sum column and build the sum over all usera. So it is 230. If you would use .mean() it would output 115 etc. And it also does it for all other unique id in your id column. In the example from above it outputs one column with just three rows (user a-c).
Greetz, miGa

Can LAMBDA function, while doing aggregation, take condition from another column in python?

I am looking for a way, if there exists a one, to perform an aggregatation on a df using only lambda approach, subject to a condition from another column. Here is a small microcosm of the problem.
df = pd.DataFrame({'ID':[1,1,1,1,2,2],
'revenue':[40,55,75,80,35,60],
'month':['2012-01-01','2012-02-01','2012-01-01','2012-03-01','2012-02-01','2012-03-01']})
print(df)
ID month revenue
0 1 2012-01-01 40
1 1 2012-02-01 55
2 1 2012-01-01 75
3 1 2012-03-01 80
4 2 2012-02-01 35
5 2 2012-03-01 60
If you need to have unique months for every ID, then the following code is good (this code is just for demonstration 'month':'nunique' works here).
df = df.groupby(['ID']).agg({'month':lambda x:x.nunique()}).reset_index()
print(df)
ID month
0 1 3
1 2 2
But, I need to count unique months when revenue was greater than 50 by taking two variables (revenue & month) in lambda something like lambda x,y: ... .
I could have done it like df[df['revenue'] > 50].groupby.(....), but there are many other columns in the agg() where this condition is not needed. So, does there exist an approach where lambda could take 2 variables simultaneously??
Expected output:
ID month
0 1 3
1 2 1
Unfortunately it is possible not easy/ performance way, because GroupBy.agg processing each column separately:
Dont use it, because extremly slow if large df or many groups.
def f(x):
a = df.loc[x.index]
return a.loc[a['revenue'] > 50, 'month'].nunique()
df1 = df.groupby(['ID']).agg({'month':f}).reset_index()
print(df1)
ID month
0 1 3
1 2 1
So one possible solution is filter before or using GroupBy.apply.

Pandas Sum values from different columns based on dates

I'm working with a dataframe on pandas and I'm trying to sum the values of different rows to a new column. This must be based on the previous date (current month - 1 to be precise).
I have something like this:
Period Value
2015-01 1
2015-09 2
2015-10 1
2015-11 3
2015-12 1
And I would like to create a new column with the sum of 'Value' from the current 'Period' and ('Period' - 1month) if it exists. Example:
Period Value Result
2015-01 1 1
2015-09 2 2
2015-10 1 3
2015-11 3 4
2015-12 1 4
I tried to use a lambda function with something like:
df['Result'] = df.apply(lambda x: df.loc[(df.Period <= x.Period) &
(x.Period >= df.Period-1),
['Value']].sum(), axis=1)
It was based on other answers, but I'm a little confused if it is the best way to do it and how to make it work successfully (It is not giving any python error message, but it is not giving my expected output either).
UPDATE
I'm testing #taras answer on a simple example with three columns:
Account Period Value
15035 2015-01 1
15035 2015-09 1
15035 2015-10 1
The expected result would be:
Account Period Value
15035 2015-01 1
15035 2015-09 1
15035 2015-10 2
But I'm getting:
Account Period Value
15035 2015-01 1
15035 2015-09 2
15035 2015-10 2
When inspecting
print(df.loc[df.index - 1, 'Value'].fillna(0).values)
I'm getting [ 0. 1. 1.] (it should be [ 0. 0. 1.]). By looking at
print(df.loc[df.index - 1, 'Period'].fillna(0).values)
I'm getting [0 Period('2015-01', 'M') Period('2015-09', 'M')] (which looks like the index is getting the value from the previous row, and not the previous month).
Am I doing something wrong?
You can compute the index of rows for previous month with
idx = df.index - pd.DateOffset(months=1)
and then simply add it to your Value column
df.loc[idx, 'Value'].fillna(0).values + df['Value']
which results in
Period
2015-01-01 1.0
2015-09-01 2.0
2015-10-01 3.0
2015-11-01 4.0
2015-12-01 4.0
Name: Value, dtype: float64
Update: since you use pd.PeriodIndex rather than df.DatetimeIndex, idx is computed in much simple way:
idx = df.index - 1
because your period is 1 month.
So, to wrap up, the whole thing can be expressed in one quite simple expression:
df.loc[df.index - 1, 'Value'].fillna(0).values + df['Value']
You can join on an auxiliary column that manages the string conversion of your inputs:
import pandas as pd
from datetime import datetime
df['prev'] = (df.Period.apply(lambda x: x.to_timestamp()) - pd.DateOffset(months=1)
aux = df.merge(df, how='left', left_on = 'prev', right_on = 'Period')
df['sum'] = aux.Value_x + aux.Value_y
df= df.drop('prev',axis=1)

Determine change in values in a grouped dataframe

Assume a dataset like this (which originally is read in from a .csv):
data = pd.DataFrame({'id': [1,2,3,1,2,3],
'time':['2017-01-01 12:00:00','2017-01-01 12:00:00','2017-01-01 12:00:00',
'2017-01-01 12:10:00','2017-01-01 12:10:00','2017-01-01 12:10:00'],
'value': [10,11,12,10,12,13]})
=>
id time value
0 1 2017-01-01 12:00:00 10
1 2 2017-01-01 12:00:00 11
2 3 2017-01-01 12:00:00 12
3 1 2017-01-01 12:10:00 10
4 2 2017-01-01 12:10:00 12
5 3 2017-01-01 12:10:00 13
Time is identical for all IDs in each observation period. The series goes on like that for many observations, i.e. every ten minutes.
I want the number of total changes in the value column by id between consecutive times. For example: For id=1 there is no change (result: 0). For id=2 there is one change (result: 1).
Inspired by this post, I have tried taking differences:
Determining when a column value changes in pandas dataframe
This is what I've come up so far (not working as expected):
data = data.set_index(['id', 'time']) # MultiIndex
grouped = data.groupby(level='id')
data['diff'] = grouped['value'].diff()
data.loc[data['diff'].notnull(), 'diff'] = 1
data.loc[data['diff'].isnull(), 'diff'] = 0
grouped['diff'].sum()
However, this will just be the sum of occurrences for each id.
Since my dataset is huge (and wont fit into memory), the solution should be as fast as possible. ( This is why I use a MultiIndex on id + time. I expect significant speedup because optimally the data need not be shuffled anymore.)
Moreover, I have come around dask dataframes which are very similar to pandas dfs. A solution making use of them would be fantastic.
Do you want something like this?
data.groupby('id').value.apply(lambda x: len(set(x)) - 1)
You get
id
1 0
2 1
3 1
Edit: As #COLDSPEED mentioned, if the requirement is to capture change back to a certain value, use
data.groupby('id').value.apply(lambda x: (x != x.shift()).sum() - 1)
I think you're looking for a groupby and comparison by shift;
data.groupby('id')['value'].agg(lambda x: (x != x.shift(-1)).sum() - 1)
id
1 0
2 1
3 1
Name: value, dtype: int64
data.groupby('id').value.agg(lambda x : (x.diff()!=0).sum()).add(-1)
id
1 0
2 1
3 1
Name: value, dtype: int64
Another by using pct_change
data.groupby('id').value.apply(lambda x : (x.pct_change()!=0).sum()).add(-1)
Out[323]:
id
1 0
2 1
3 1
Name: value, dtype: int64

Categories