Fastest way to add an extra row to a groupby in pandas - python

I'm trying to create a new row for each group in a dataframe by copying the last row and then modifying some values. My approach is as follows, the concat step appears to be the bottleneck (I tried append too). Any suggestions?
def genNewObs(df):
lastRowIndex = df.obsNumber.idxmax()
row = pd.DataFrame(df.ix[lastRowIndex].copy())
# changes some other values in row here
df = pd.concat([df,row], ignore_index=True)
return df
df = df.groupby(GROUP).apply(genNewObs)
Edit 1: Basically I have a bunch of data with the last observation on different dates. I want to create a final observation for all groups on the current date.
Group Date Days Since last Observation
A 1/1/2014 0
A 1/10/2014 9
B 1/5/2014 0
B 1/25/2014 20
B 1/27/2014 2
If we pretend the current date is 1/31/2014 this becomes:
Group Date Days Since last Observation
A 1/1/2014 0
A 1/10/2014 9
A 1/31/2014 21
B 1/5/2014 0
B 1/25/2014 20
B 1/27/2014 2
B 1/31/2014 4
I've tried setting with enlargement and it is the slowest of all techniques. Any ideas?

Thanks to user1827356, I sped it up by a factor of 100 by taking the operation out of the apply. For some reason first was dropping by Group column, so I used idxmax instead.
def genNewObs(df):
lastRowIndex = df.groupby(Group).Date.idxmax()
rows = df.ix[lastRowIndex]
df = pd.concat([df,rows], ignore_index=True)
df = df.sort([Group, Date], ascending=True)
return df

Related

How to populate date in a dataframe using pandas in python

I have a dataframe with two columns, Case and Date. Here Date is actually the starting date. I want to populate it as a time series, saying add three (month_num) more dates to each case and removing the original ones.
original dataframe:
Case Date
0 1 2010-01-01
1 2 2011-04-01
2 3 2012-08-01
after populating dates:
Case Date
0 1 2010-02-01
1 1 2010-03-01
2 1 2010-04-01
3 2 2011-05-01
4 2 2011-06-01
5 2 2011-07-01
6 3 2012-09-01
7 3 2012-10-01
8 3 2012-11-01
I tried to declare an empty dataframe with the same column names and data type, and used for loop to loop over Case and month_num, and add rows into the new dataframe.
import pandas as pd
data = [[1, '2010-01-01'], [2, '2011-04-01'], [3, '2012-08-01']]
df = pd.DataFrame(data, columns = ['Case', 'Date'])
df.Date = pd.to_datetime(df.Date)
df_new = pd.DataFrame(columns=df.columns)
df_new['Case'] = pd.to_numeric(df_new['Case'])
df_new['Date'] = pd.to_datetime(df_new['Date'])
month_num = 3
for c in df.Case:
for m in range(1, month_num+1):
temp = df.loc[df['Case']==c]
temp['Date'] = temp['Date'] + pd.DateOffset(months=m)
df_new = pd.concat([df_new, temp])
df_new.reset_index(inplace=True, drop=True)
My code can work, however, when the original dataframe and month_num become large, it took huge time to run. Are there any better ways to do what I need? Thanks a alot!!
Your performance issue is probably related to the use of pd.concat inside the inner for loop. This answer explains why.
As the answer suggests, you may want to use an external list to collect all the dataframes you create in the for loop, and then concatenate once the list.
Given your input data this is what worked on my notebook:
df2=pd.DataFrame()
df2['Date']=df['Date'].apply(lambda x: pd.date_range(start=x, periods=3,freq='M')).explode()
df3=pd.merge_asof(df2,df,on='Date')
df3['Date']=df3['Date']+ pd.DateOffset(days=1)
df3[['Case','Date']]
We create a df2 to which we populate 'Date' with the needed dates coming from the original df
Then df3 resulting of a merge_asof between df2 and df (to populate the 'Case' column)
Finally , we offset the resulting column off 1 day

Comparing rows of string inside groupby and assigning a value to a new column pandas

I have a dataset of employees (their IDs) and the names of their bosses for several years.
df:
What I need to do is to see if an employee had a boss' change. So, desired output is:
For employees who appear in the df only once, I just assign 0 (no boss' change). However, I cannot figure out how to do it for the employees who are in the df for several years.
I was thinking that first I need to assign 0 for the first year they appear in the df (because we do not know who was the boss before, therefore there is no boss' change). Then I need to compare the name of the boss with the name in the next row and decide whether to assign 1 or 0 into the ManagerChange column.
So far I split the df into two (with unique IDs and duplicated IDs) and assigned 0 to ManagerChange for the unique IDs.
Then I groupby the duplicated IDs and sort them by year. However, I am new to Python and cannot figure out how to compare strings and assign a result value to a new column inside the groupby. Please, help.
Code I have so far:
# splitting database in two
bool_series = df["ID"].duplicated(keep=False)
df_duplicated=df[bool_series]
df_unique = df[~bool_series]
# assigning 0 for ManagerChange for the unique IDs
df_unique['ManagerChange'] = 0
# groupby by ID and sorting by year for the duplicated IDs
df_duplicated.groupby('ID').apply(lambda x: x.sort_values('Year'))
You can groupby then shift() the group and compare on Boss columns.
# Sort value first
df.sort_values(['ID', 'Year'], inplace=True)
# Compare Boss column with shifted Boss column
df['ManagerChange'] = df.groupby('ID').apply(lambda group: group['Boss'] != group['Boss'].shift(1)).tolist()
# Change True to 1, False to 0
df['ManagerChange'] = df['ManagerChange'].map({True: 1, False: 0})
# Sort df to original df
df = df.sort_index()
# Change the first in each group to 0
df.loc[df.groupby('ID').head(1).index, 'ManagerChange'] = 0
# print(df)
ID Year Boss ManagerChange
0 1234 2018 Anna 0
1 567 2019 Sarah 0
2 1234 2020 Michael 0
3 8976 2019 John 0
4 1234 2019 Michael 1
5 8976 2020 John 0
You could also make use of fill_value argument, this will help you get rid of the last df.loc[] operation.
# Sort value first
df.sort_values(['ID', 'Year'], inplace=True)
df['ManagerChange'] = df.groupby('ID').apply(lambda group: group['Boss'] != group['Boss'].shift(1, fill_value=group['Boss'].iloc[0])).tolist()
# Change True to 1, False to 0
df['ManagerChange'] = df['ManagerChange'].map({True: 1, False: 0})
# Sort df to original df
df = df.sort_index()

Effciency: Dropping rows with the same timestamp while still having the median of second column for that timestamp

What I wanna do:
Column 'angle' has tracked about 20 angles per second (can vary). But my 'Time' timestamp has only an accuracy of 1s (therefore always about ~20 rows are having the same timestamp)(total rows of over 1 million in the dataframe).
My result shall be a new dataframe with a changing timestamp for each row. The angle for the timestamp shall be the median of the ~20 timestamps in that intervall.
My Idea:
I iterate through the rows and check if the timestamp has changed.
If so, I select all timestamps until it changes, calculate the median, and append it to a new dataframe.
Nevertheless I have many many big data files and I am wondering if there is a faster way to achieve my goal.
Right now my code is the following (see below).
It is not fast and I think there must be a better way to do that with pandas/numpy (or something else?).
a = 0
for i in range(1,len(df1.index)):
if df1.iloc[[a],[1]].iloc[0][0]==df1.iloc[[i],[1]].iloc[0][0]:
continue
else:
if a == 0:
df_result = df1[a:i-1].median()
else:
df_result = df_result.append(df1[a:i-1].median(), ignore_index = True)
a = i
You can use groupby here. Below, I made a simple dummy dataframe.
import pandas as pd
df1 = pd.DataFrame({'time': [1,1,1,1,1,1,2,2,2,2,2,2],
'angle' : [8,9,7,1,4,5,11,4,3,8,7,6]})
df1
time angle
0 1 8
1 1 9
2 1 7
3 1 1
4 1 4
5 1 5
6 2 11
7 2 4
8 2 3
9 2 8
10 2 7
11 2 6
Then, we group by the timestamp and take the median of the angle column within that group, and convert the result to a pandas dataframe.
df2 = pd.DataFrame(df1.groupby('time')['angle'].median())
df2 = df2.reset_index()
df2
time angle
0 1 6.0
1 2 6.5
You can use the .agg after grouping function to select operation according to the column
df1.groupby('Time', as_index=False).agg({"angle":"median"})

How to re-index as multi-index pandas dataframe from index value that repeats

I have an index in a pandas dataframe which repeats the index value. I want to re-index as multi-index where repeated indexes are grouped.
The indexing looks like such:
so I would like all the 112335586 index values would be grouped under the same in index.
I have looked at this question Create pandas dataframe by repeating one row with new multiindex but here the value can be index can be pre-defined but this is not possible as my dataframe is far too large to hard code this.
I also looked at at the multi-index documentation but this also pre-defines the value for the index.
I believe you need:
s = pd.Series([1,2,3,4], index=[10,10,20,20])
s.index.name = 'EVENT_ID'
print (s)
EVENT_ID
10 1
10 2
20 3
20 4
dtype: int64
s1 = s.index.to_series()
s2 = s1.groupby(s1).cumcount()
s.index = [s.index, s2]
print (s)
EVENT_ID
10 0 1
1 2
20 0 3
1 4
dtype: int64
Try this:
df.reset_index(inplace=True)
df['sub_idx'] = df.groupby('EVENT_ID').cumcount()
df.set_index(['EVENT_ID','sub_idx'], inplace=True)

Create new column based on another column for a multi-index Panda dataframe

I'm running Python 3.5 on Windows and writing code to study financial econometrics.
I have a multi-index panda dataframe where the level=0 index is a series of month-end dates and the level=1 index is a simple integer ID. I want to create a new column of values ('new_var') where for each month-end date, I look forward 1-month and get the values from another column ('some_var') and of course the IDs from the current month need to align with the IDs for the forward month. Here is a simple test case.
import pandas as pd
import numpy as np
# Create some time series data
id = np.arange(0,5)
date = [pd.datetime(2017,1,31)+pd.offsets.MonthEnd(i) for i in [0,1]]
my_data = []
for d in date:
for i in id:
my_data.append((d, i, np.random.random()))
df = pd.DataFrame(my_data, columns=['date', 'id', 'some_var'])
df['new_var'] = np.nan
df.set_index(['date', 'id'], inplace=True)
# Drop an observation to reflect my true data
df.drop(('2017-02-28',3), level=None, inplace=True)
df
# The desired output....
list1 = df.loc['2017-01-31'].index.labels[1].tolist()
list2 = df.loc['2017-02-28'].index.labels[1].tolist()
common = list(set(list1) & set(list2))
for i in common:
df.loc[('2017-01-31', i)]['new_var'] = df.loc[('2017-02-28', i)]['some_var']
df
I feel like there is a better way to get my desired output. Maybe I should just embrace the "for" loop? Maybe a better solution is to reset the index?
Thank you,
F
I would create a integer column representing the date, substrate one from it (to shift it by one month) and the merge the value left on back to the original dataframe.
Out[28]:
some_var
date id
2017-01-31 0 0.736003
1 0.248275
2 0.844170
3 0.671364
4 0.034331
2017-02-28 0 0.051586
1 0.894579
2 0.136740
4 0.902409
df = df.reset_index()
df['n_group'] = df.groupby('date').ngroup()
df_shifted = df[['n_group', 'some_var','id']].rename(columns={'some_var':'new_var'})
df_shifted['n_group'] = df_shifted['n_group']-1
df = df.merge(df_shifted, on=['n_group','id'], how='left')
df = df.set_index(['date','id']).drop('n_group', axis=1)
Out[31]:
some_var new_var
date id
2017-01-31 0 0.736003 0.051586
1 0.248275 0.894579
2 0.844170 0.136740
3 0.671364 NaN
4 0.034331 0.902409
2017-02-28 0 0.051586 NaN
1 0.894579 NaN
2 0.136740 NaN
4 0.902409 NaN

Categories