Taking Min and Max of Contigous Rows in a Pandas Dataframe - python

I have some data that looks something like this:
ID Value Starts Ends
0 A 1 2000-01-01 2000-06-01
1 A 2 2000-06-02 2000-12-31
2 A 1 2001-01-01 2001-06-01
3 A 1 2001-06-02 2001-12-31
What I want to do is collapse consecutive rows where there Id and value are the same. So ideally the output would be:
ID Value Starts Ends
0 A 1 2000-01-01 2000-06-01
1 A 2 2000-06-02 2000-12-31
2 A 1 2001-01-01 2001-12-31
However, if you naively take np.min(Starts) and np.max(Ends) it appears that (A,1) spans the values (A,2).
gb = df.groupby(['ID', 'Value'], as_index=False)
df = gb.agg({'Starts': np.min, 'Ends': np.max}, as_index=False)
ID Value Starts Ends
0 A 1 2000-01-01 2001-12-31
1 A 2 2000-06-02 2000-12-31
Is there an efficient way to get Pandas to do what I want?

If you add a column (let's call it "extra") that increments each time the groupby category changes, you can groupby that instead. The challenge is then to make the addition of the new column efficient, and this is the most vectorized way I can think of to make it work.
increment = ((df.Value[:-1] != df.Value[1:]) | (df.ID[:-1] != df.ID[1:])).cumsum()
df["extra"] = pd.concat((pd.Series([0]),increment),ignore_index=True)
The first line takes the cumulative sum of a boolean array showing differing lines, then the second tacks on a zero at the front and adds it to the dataframe.
Then you can do
gb = df.groupby(['extra'], as_index=False)
df = gb.agg({'Starts': np.min, 'Ends': np.max}, as_index=False)

Just do df.drop_duplicates(subset = ['ID', 'Value'], inplace=True)
this will drop the rows where you have duplicate ID and Value input.

Related

How to populate date in a dataframe using pandas in python

I have a dataframe with two columns, Case and Date. Here Date is actually the starting date. I want to populate it as a time series, saying add three (month_num) more dates to each case and removing the original ones.
original dataframe:
Case Date
0 1 2010-01-01
1 2 2011-04-01
2 3 2012-08-01
after populating dates:
Case Date
0 1 2010-02-01
1 1 2010-03-01
2 1 2010-04-01
3 2 2011-05-01
4 2 2011-06-01
5 2 2011-07-01
6 3 2012-09-01
7 3 2012-10-01
8 3 2012-11-01
I tried to declare an empty dataframe with the same column names and data type, and used for loop to loop over Case and month_num, and add rows into the new dataframe.
import pandas as pd
data = [[1, '2010-01-01'], [2, '2011-04-01'], [3, '2012-08-01']]
df = pd.DataFrame(data, columns = ['Case', 'Date'])
df.Date = pd.to_datetime(df.Date)
df_new = pd.DataFrame(columns=df.columns)
df_new['Case'] = pd.to_numeric(df_new['Case'])
df_new['Date'] = pd.to_datetime(df_new['Date'])
month_num = 3
for c in df.Case:
for m in range(1, month_num+1):
temp = df.loc[df['Case']==c]
temp['Date'] = temp['Date'] + pd.DateOffset(months=m)
df_new = pd.concat([df_new, temp])
df_new.reset_index(inplace=True, drop=True)
My code can work, however, when the original dataframe and month_num become large, it took huge time to run. Are there any better ways to do what I need? Thanks a alot!!
Your performance issue is probably related to the use of pd.concat inside the inner for loop. This answer explains why.
As the answer suggests, you may want to use an external list to collect all the dataframes you create in the for loop, and then concatenate once the list.
Given your input data this is what worked on my notebook:
df2=pd.DataFrame()
df2['Date']=df['Date'].apply(lambda x: pd.date_range(start=x, periods=3,freq='M')).explode()
df3=pd.merge_asof(df2,df,on='Date')
df3['Date']=df3['Date']+ pd.DateOffset(days=1)
df3[['Case','Date']]
We create a df2 to which we populate 'Date' with the needed dates coming from the original df
Then df3 resulting of a merge_asof between df2 and df (to populate the 'Case' column)
Finally , we offset the resulting column off 1 day

Groupby over periods of time

I have a table which contains ids, dates, a target (potentially multi class but for now binary where 1 is a fail) and a yearmonth column based on the date column. Below are the first 8 rows of this table:
row
id
date
target
yearmonth
0
A
2015-03-16
0
2015-03
1
A
2015-05-29
1
2015-05
2
A
2015-08-02
1
2015-08
3
A
2015-09-05
1
2015-09
4
A
2015-09-22
0
2015-09
5
A
2015-10-15
1
2015-10
6
A
2015-11-09
1
2015-11
7
B
2015-04-17
0
2015-04
I want to create lookback features for the last let's say 3 months so that for each single row, we take a look in the past and see the how that id performed over the last 3 months. So for ex for row 6, where date is 9th Nov 2015, the percentage of fails for id A in the last 3 calendaristic months (so in the whole of months of Aug, Sept & Oct) would be 75% (using rows 2-5).
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B'],'date' :['2015-03-16','2015-05-29','2015-08-02','2015-09-05','2015-09-22','2015-10-15','2015-11-09','2015-04-17'],'target':[0,1,1,1,0,1,1,0]} )
df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['yearmonth'] = df['date'].dt.to_period('M')
agg_dict = {
"Total_Transactions": pd.NamedAgg(column='target', aggfunc='count'),
"Fail_Count": pd.NamedAgg(column='target', aggfunc=(lambda x: len(x[x == 1]))),
"Perc_Monthly_Fails": pd.NamedAgg(column='target', aggfunc=(lambda x: len(x[x == 1])/len(x)*100))
}
df.groupby(['id','yearmonth']).agg(**agg_dict).reset_index(level = 1)
I've done an aggregation using id and month (see below) and I've tried things like rolling windows, but I could't find a way to actually aggregate looking back over a specific period for each single row. Any help is appreciated.
id
yearmonth
Total_Transactions
Fail_Count
Perc_Monthly_Fails
A
2015-03
1
0
0
A
2015-05
1
1
100
A
2015-08
1
1
100
A
2015-09
2
1
50
A
2015-10
1
1
100
A
2015-11
1
1
100
B
2015-04
1
0
0
You can do this by merging the DataFrame with itself on 'id'.
First we'll create a first of month 'fom' column since your date logic wants to look back based on prior months, not the date specifically. Then we merge the DataFrame with itself, bringing along the index so we can assign the result back in the end.
With month offsets we can then filter that to only keeping the observations within 3 months of the observation for that row, and then we groupby the original index and take the mean of 'target' to get the percent fail, which we can just assign back (alignment on index).
If there are NaN in the output it's because that row had no observations in the prior 3 months so you can't calculate.
#df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['fom'] = df['date'].astype('datetime64[M]') # Credit #anky
df1 = df.reset_index()
df1 = (df1.drop(columns='target').merge(df1, on='id', suffixes=['', '_past']))
df1 = df1[df1.fom_past.between(df1.fom-pd.offsets.DateOffset(months=3),
df1.fom-pd.offsets.DateOffset(months=1))]
df['Pct_fail'] = df1.groupby('index').target.mean()*100
id date target fom Pct_fail
0 A 2015-03-16 0 2015-03-01 NaN # No Rows to Avg
1 A 2015-05-29 1 2015-05-01 0.000000 # Avg Rows 0
2 A 2015-08-02 1 2015-08-01 100.000000 # Avg Rows 1
3 A 2015-09-05 1 2015-09-01 100.000000 # Avg Rows 2
4 A 2015-09-22 0 2015-09-01 100.000000 # Avg Rows 2
5 A 2015-10-15 1 2015-10-01 66.666667 # Avg Rows 2,3,4
6 A 2015-11-09 1 2015-11-01 75.000000 # Avg Rows 2,3,4,5
7 B 2015-04-17 0 2015-04-01 NaN # No Rows to Avg
If you're having an issue with memory we can take a very slow loop approach, which subsets for each row and then calculates the average from that subset.
def get_prev_avg(row, df):
df = df[df['id'].eq(row['id'])
& df['fom'].between(row['fom']-pd.offsets.DateOffset(months=3),
row['fom']-pd.offsets.DateOffset(months=1))]
if not df.empty:
return df['target'].mean()*100
else:
return np.NaN
#df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['fom'] = df['date'].astype('datetime64[M]')
df['Pct_fail'] = df.apply(lambda row: get_prev_avg(row, df), axis=1)
I have modified #ALollz code so that it applies better to my original dataset, where I have a multiclass target, and I would like to obtain PctFails for class 1 and 2, plus the nr of transactions, and I would need to group by different columns over different periods of times. Also, decided it's simpler and better to use the last x months prior to the date rather than the calendar months. So my solution to that was this:
df = pd.DataFrame({'Id':['A','A','A','A','A','A','A','B'],'Type':['T1','T3','T1','T2','T2','T1','T1','T3'],'date' :['2015-03-16','2015-05-29','2015-08-10','2015-09-05','2015-09-22','2015-11-08','2015-11-09','2015-04-17'],'target':[2,1,2,1,0,1,2,0]} )
df['date'] = pd.to_datetime(df['date'], dayfirst = True)
def get_prev_avg(row, df, columnname, lastxmonths):
df = df[df[columnname].eq(row[columnname])
& df['date'].between(row['date']-pd.offsets.DateOffset(months=lastxmonths),
row['date']-pd.offsets.DateOffset(days=1))]
if not df.empty:
NrTransactions= len(df['target'])
PctMinorFails= (df['target'].where(df['target'] == 1).count())/len(df['target'])*100
PctMajorFails= (df['target'].where(df['target'] == 2).count())/len(df['target'])*100
return pd.Series([NrTransactions, PctMinorFails, PctMajorFails])
else:
return pd.Series([np.NaN, np.NaN, np.NaN])
for lastxmonths in [3, 4]:
for columnname in ['Id','Type']:
df[['NrTransactionsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months',
'PctMinorFailsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months',
'PctMajorFailsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months'
]]= df.apply(lambda row: get_prev_avg(row, df, columnname, lastxmonths), axis=1)
Each iteration takes a couple hours for my original dataset which is not great, but unsure how to optimise it further.

Pandas Sum values from different columns based on dates

I'm working with a dataframe on pandas and I'm trying to sum the values of different rows to a new column. This must be based on the previous date (current month - 1 to be precise).
I have something like this:
Period Value
2015-01 1
2015-09 2
2015-10 1
2015-11 3
2015-12 1
And I would like to create a new column with the sum of 'Value' from the current 'Period' and ('Period' - 1month) if it exists. Example:
Period Value Result
2015-01 1 1
2015-09 2 2
2015-10 1 3
2015-11 3 4
2015-12 1 4
I tried to use a lambda function with something like:
df['Result'] = df.apply(lambda x: df.loc[(df.Period <= x.Period) &
(x.Period >= df.Period-1),
['Value']].sum(), axis=1)
It was based on other answers, but I'm a little confused if it is the best way to do it and how to make it work successfully (It is not giving any python error message, but it is not giving my expected output either).
UPDATE
I'm testing #taras answer on a simple example with three columns:
Account Period Value
15035 2015-01 1
15035 2015-09 1
15035 2015-10 1
The expected result would be:
Account Period Value
15035 2015-01 1
15035 2015-09 1
15035 2015-10 2
But I'm getting:
Account Period Value
15035 2015-01 1
15035 2015-09 2
15035 2015-10 2
When inspecting
print(df.loc[df.index - 1, 'Value'].fillna(0).values)
I'm getting [ 0. 1. 1.] (it should be [ 0. 0. 1.]). By looking at
print(df.loc[df.index - 1, 'Period'].fillna(0).values)
I'm getting [0 Period('2015-01', 'M') Period('2015-09', 'M')] (which looks like the index is getting the value from the previous row, and not the previous month).
Am I doing something wrong?
You can compute the index of rows for previous month with
idx = df.index - pd.DateOffset(months=1)
and then simply add it to your Value column
df.loc[idx, 'Value'].fillna(0).values + df['Value']
which results in
Period
2015-01-01 1.0
2015-09-01 2.0
2015-10-01 3.0
2015-11-01 4.0
2015-12-01 4.0
Name: Value, dtype: float64
Update: since you use pd.PeriodIndex rather than df.DatetimeIndex, idx is computed in much simple way:
idx = df.index - 1
because your period is 1 month.
So, to wrap up, the whole thing can be expressed in one quite simple expression:
df.loc[df.index - 1, 'Value'].fillna(0).values + df['Value']
You can join on an auxiliary column that manages the string conversion of your inputs:
import pandas as pd
from datetime import datetime
df['prev'] = (df.Period.apply(lambda x: x.to_timestamp()) - pd.DateOffset(months=1)
aux = df.merge(df, how='left', left_on = 'prev', right_on = 'Period')
df['sum'] = aux.Value_x + aux.Value_y
df= df.drop('prev',axis=1)

Determine change in values in a grouped dataframe

Assume a dataset like this (which originally is read in from a .csv):
data = pd.DataFrame({'id': [1,2,3,1,2,3],
'time':['2017-01-01 12:00:00','2017-01-01 12:00:00','2017-01-01 12:00:00',
'2017-01-01 12:10:00','2017-01-01 12:10:00','2017-01-01 12:10:00'],
'value': [10,11,12,10,12,13]})
=>
id time value
0 1 2017-01-01 12:00:00 10
1 2 2017-01-01 12:00:00 11
2 3 2017-01-01 12:00:00 12
3 1 2017-01-01 12:10:00 10
4 2 2017-01-01 12:10:00 12
5 3 2017-01-01 12:10:00 13
Time is identical for all IDs in each observation period. The series goes on like that for many observations, i.e. every ten minutes.
I want the number of total changes in the value column by id between consecutive times. For example: For id=1 there is no change (result: 0). For id=2 there is one change (result: 1).
Inspired by this post, I have tried taking differences:
Determining when a column value changes in pandas dataframe
This is what I've come up so far (not working as expected):
data = data.set_index(['id', 'time']) # MultiIndex
grouped = data.groupby(level='id')
data['diff'] = grouped['value'].diff()
data.loc[data['diff'].notnull(), 'diff'] = 1
data.loc[data['diff'].isnull(), 'diff'] = 0
grouped['diff'].sum()
However, this will just be the sum of occurrences for each id.
Since my dataset is huge (and wont fit into memory), the solution should be as fast as possible. ( This is why I use a MultiIndex on id + time. I expect significant speedup because optimally the data need not be shuffled anymore.)
Moreover, I have come around dask dataframes which are very similar to pandas dfs. A solution making use of them would be fantastic.
Do you want something like this?
data.groupby('id').value.apply(lambda x: len(set(x)) - 1)
You get
id
1 0
2 1
3 1
Edit: As #COLDSPEED mentioned, if the requirement is to capture change back to a certain value, use
data.groupby('id').value.apply(lambda x: (x != x.shift()).sum() - 1)
I think you're looking for a groupby and comparison by shift;
data.groupby('id')['value'].agg(lambda x: (x != x.shift(-1)).sum() - 1)
id
1 0
2 1
3 1
Name: value, dtype: int64
data.groupby('id').value.agg(lambda x : (x.diff()!=0).sum()).add(-1)
id
1 0
2 1
3 1
Name: value, dtype: int64
Another by using pct_change
data.groupby('id').value.apply(lambda x : (x.pct_change()!=0).sum()).add(-1)
Out[323]:
id
1 0
2 1
3 1
Name: value, dtype: int64

How to merge two data frames based on nearest date

I want to merge two data frames based on two columns: "Code" and "Date". It is straightforward to merge data frames based on "Code", however in case of "Date" it becomes tricky - there is no exact match between Dates in df1 and df2. So, I want to select closest Dates. How can I do this?
df = df1[column_names1].merge(df2[column_names2], on='Code')
I don't think there's a quick, one-line way to do this kind of thing but I belive the best approach is to do it this way:
add a column to df1 with the closest date from the appropriate group in df2
call a standard merge on these
As the size of your data grows, this "closest date" operation can become rather expensive unless you do something sophisticated. I like to use scikit-learn's NearestNeighbor code for this sort of thing.
I've put together one approach to that solution that should scale relatively well.
First we can generate some simple data:
import pandas as pd
import numpy as np
dates = pd.date_range('2015', periods=200, freq='D')
rand = np.random.RandomState(42)
i1 = np.sort(rand.permutation(np.arange(len(dates)))[:5])
i2 = np.sort(rand.permutation(np.arange(len(dates)))[:5])
df1 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
'Date': dates[i1],
'val1':rand.rand(5)})
df2 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
'Date': dates[i2],
'val2':rand.rand(5)})
Let's check these out:
>>> df1
Code Date val1
0 0 2015-01-16 0.975852
1 0 2015-01-31 0.516300
2 1 2015-04-06 0.322956
3 1 2015-05-09 0.795186
4 1 2015-06-08 0.270832
>>> df2
Code Date val2
0 1 2015-02-03 0.184334
1 1 2015-04-13 0.080873
2 0 2015-05-02 0.428314
3 1 2015-06-26 0.688500
4 0 2015-06-30 0.058194
Now let's write an apply function that adds a column of nearest dates to df1 using scikit-learn:
from sklearn.neighbors import NearestNeighbors
def find_nearest(group, match, groupname):
match = match[match[groupname] == group.name]
nbrs = NearestNeighbors(1).fit(match['Date'].values[:, None])
dist, ind = nbrs.kneighbors(group['Date'].values[:, None])
group['Date1'] = group['Date']
group['Date'] = match['Date'].values[ind.ravel()]
return group
df1_mod = df1.groupby('Code').apply(find_nearest, df2, 'Code')
>>> df1_mod
Code Date val1 Date1
0 0 2015-05-02 0.975852 2015-01-16
1 0 2015-05-02 0.516300 2015-01-31
2 1 2015-04-13 0.322956 2015-04-06
3 1 2015-04-13 0.795186 2015-05-09
4 1 2015-06-26 0.270832 2015-06-08
Finally, we can merge these together with a straightforward call to pd.merge:
>>> pd.merge(df1_mod, df2, on=['Code', 'Date'])
Code Date val1 Date1 val2
0 0 2015-05-02 0.975852 2015-01-16 0.428314
1 0 2015-05-02 0.516300 2015-01-31 0.428314
2 1 2015-04-13 0.322956 2015-04-06 0.080873
3 1 2015-04-13 0.795186 2015-05-09 0.080873
4 1 2015-06-26 0.270832 2015-06-08 0.688500
Notice that rows 0 and 1 both matched the same val2; this is expected given the way you described your desired solution.
Here's an alternative solution:
Merge on Code.
Add a date difference column according to your need (I used abs in the example below) and sort the data using the new column.
Group by the records of the first data frame and for each group take a record from the second data frame with the closest date.
Code:
df = df1.reset_index()[column_names1].merge(df2[column_names2], on='Code')
df['DateDiff'] = (df['Date1'] - df['Date2']).abs()
df.sort_values('DateDiff').groupby('index').first().reset_index()

Categories