I am reading in a .csv file and creating a pandas dataframe. The file is a file of stocks. I am only interested in the date, the company, and the closing cost. I want my program to find the max profit with the starting date, the ending date and the company. It needs to use the divide and conquer algorithm. I only know how to use for loops but it takes forever to run. The .csv file is 200,000 rows. How can I get this to run fast?
import pandas as pd
import numpy as np
import math
def cleanData(file):
df = pd.read_csv(file)
del df['open']
del df['low']
del df['high']
del df['volume']
return np.array(df)
df = cleanData('prices-split-adjusted.csv')
bestStock = [None, None, None, float(-math.inf)]
def DAC(data):
global bestStock
if len(data) > 1:
mid = len(data)//2
left = data[:mid]
right = data[mid:]
DAC(left)
DAC(right)
for i in range(len(data)):
for j in range(i+1,len(data)):
if data[i,1] == data[j,1]:
profit = data[j,2] - data[i,2]
if profit > bestStock[3]:
bestStock[0] = data[i,0]
bestStock[1] = data[j,0]
bestStock[2] = data[i,1]
bestStock[3] = profit
print(bestStock)
print('\n')
return bestStock
print(DAC(df))
I've got two things for your consideration (my answer tries not to change your algorithm approach i.e. nested loops and recursive funcs and tackles the low lying fruits first):
Unless you are debugging, try to avoid print() inside a loop. (in your case .. print(bestStock) ..) The I/O overhead can add up esp. if you are looping across large datasets and printing to screen often. Once you are OK with your code, comment it out to run on your full dataset and uncomment it only during debugging sessions. You can expect to see some improvement in speed without having to print to screen in the loop.
If you are after even more ways to 'speed it up', I found in my case (similar to yours which I often encounter especially in search/sort problems) that simply by switching the expensive part (the python 'For' loops) to Cython (and statically defining variable types .. this is KEY! to SPEEEEDDDDDD) gives me several orders of magnitude speed ups even before optimizing implementation. Check Cython out https://cython.readthedocs.io/en/latest/index.html. If thats not enough, then parrelism is your next best friend which would require rethinking your code implementation.
The main problems causing slow system performance are:
You manually iterate over 2 columns in nested loops without using pandas operations which make use of fast ndarray functions;
you use recursive calls which looks nice and simple but slow.
Setting the sample data as follows:
Date Company Close
0 2019-12-31 AAPL 73.412498
1 2019-12-31 FB 205.250000
2 2019-12-31 NFLX 323.570007
3 2020-01-02 AAPL 75.087502
4 2020-01-02 FB 209.779999
... ... ... ...
184 2020-03-30 FB 165.949997
185 2020-03-30 NFLX 370.959991
186 2020-03-31 AAPL 63.572498
187 2020-03-31 FB 166.800003
188 2020-03-31 NFLX 375.500000
189 rows × 3 columns
Then use the following codes (modify the column labels to your labels if different):
df_result = df.groupby('Company').agg(Start_Date=pd.NamedAgg(column='Date', aggfunc="first"), End_Date=pd.NamedAgg(column='Date', aggfunc="last"), bestGain=pd.NamedAgg(column='Close', aggfunc=lambda x: x.max() - x.iloc[0]))
Resulting output:
Start_Date End_Date bestGain
Company
AAPL 2019-12-31 2020-03-31 8.387505
FB 2019-12-31 2020-03-31 17.979996
NFLX 2019-12-31 2020-03-31 64.209991
To get the entry with greatest gain:
df_result.loc[df_result['bestGain'].idxmax()]
Resulting output:
Start_Date 2019-12-31 00:00:00
End_Date 2020-03-31 00:00:00
bestGain 64.209991
Name: NFLX, dtype: object
Execution time comparison
With my scaled down data of 3 stocks over 3 months, the codes making use of pandas function (takes 8.9ms) which is about about half the execution time with the original codes manually iterate over the numpy array with nested loops and recursive calls (takes 16.9ms) even after the majority of print() function calls removed.
Your codes with print() inside DAC() function removed:
%%timeit
"""
def cleanData(df):
# df = pd.read_csv(file)
del df['Open']
del df['Low']
del df['High']
del df['Volume']
return np.array(df)
"""
# df = cleanData('prices-split-adjusted.csv')
# df = cleanData(df0)
df = np.array(df0)
bestStock = [None, None, None, float(-math.inf)]
def DAC(data):
global bestStock
if len(data) > 1:
mid = len(data)//2
left = data[:mid]
right = data[mid:]
DAC(left)
DAC(right)
for i in range(len(data)):
for j in range(i+1,len(data)):
if data[i,1] == data[j,1]:
profit = data[j,2] - data[i,2]
if profit > bestStock[3]:
bestStock[0] = data[i,0]
bestStock[1] = data[j,0]
bestStock[2] = data[i,1]
bestStock[3] = profit
# print(bestStock)
# print('\n')
return bestStock
print(DAC(df))
[Timestamp('2020-03-16 00:00:00'), Timestamp('2020-03-31 00:00:00'), 'NFLX', 76.66000366210938]
16.9 ms ± 303 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New simplified codes in pandas' way of coding:
%%timeit
df_result = df.groupby('Company').agg(Start_Date=pd.NamedAgg(column='Date', aggfunc="first"), End_Date=pd.NamedAgg(column='Date', aggfunc="last"), bestGain=pd.NamedAgg(column='Close', aggfunc=lambda x: x.max() - x.iloc[0]))
df_result.loc[df_result['bestGain'].idxmax()]
8.9 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Solution using recursive function:
The main problem of your recursive function lies in that you did not make use of the results of recursive calls of reduced size data.
To properly use recursive function as a divide-and-conquer approach, you should take 3 major steps:
Divide the whole set of data into smaller pieces and handle the smaller pieces by recursive calls each taking one of the smaller pieces
Handle the end-point case (the easiest case most of the time) in each recursive call
Consolidate the results of all recursive calls of smaller pieces
The beauty of recursive calls is that you can solve a complicated problem by replacing the processing with 2 much more easier steps: 1st step is to handle the end-point case where you can handle for most of the time only ONE data item (which is most often easy). 2nd step is to just take another easy step to consolidate the results of the reduced-size calls.
You managed to take the first step but not the other 2 steps. In particular, you did not take advantage of simplifying the processing by making use of the results of smaller pieces. Instead, you handle the whole set of data in each call by looping all over all rows in the 2-dimensional numpy array. The nested loop logics is just like a "Bubble Sort" [with complexity order(n squared) instead of order(n)] . Hence, your recursive calls are just wasting time without value!
Suggest to modify your recursive functions as follows:
def DAC(data):
# global bestStock # define bestStock as a local variable instead
bestStock = [None, None, None, float(-math.inf)] # init bestStock
if len(data) = 1: # End-point case: data = 1 row
bestStock[0] = data[0,0]
bestStock[1] = data[0,0]
bestStock[2] = data[0,1]
bestStock[3] = 0.0
elif len(data) = 2: # End-point case: data = 2 rows
bestStock[0] = data[0,0]
bestStock[1] = data[1,0]
bestStock[2] = data[0,1] # Enhance here to allow stock break
bestStock[3] = data[1,2] - data[0,2]
elif len(data) >= 3: # Recursive calls and consolidate results
mid = len(data)//2
left = data[:mid]
right = data[mid:]
bestStock_left = DAC(left)
bestStock_right = DAC(right)
# Now make use of the results of divide-and-conquer and consolidate the results
bestStock[0] = bestStock_left[0]
bestStock[1] = bestStock_right[1]
bestStock[2] = bestStock_left[2] # Enhance here to allow stock break
bestStock[3] = bestStock_left[3] if bestStock_left[3] >= bestStock_right[3] else bestStock_right[3]
# print(bestStock)
# print('\n')
return bestStock
Here we need to handle 2 kinds of end-point cases: 1 row and 2 rows. The reason is that for case with only 1 row, we cannot calculate the gain and can only set the gain to zero. Gain can start to calculate with 2 rows. If not split into these 2 end-point cases, we could end up only propagating zero gain all the way up.
Here is a demo of how you should code the recursive calls to take advantage of it. There is limitation of the codes that you still need to fine-tune. You have to enhance it further to handle stock break case. The codes for 2 rows and >= 3 rows now assume no stock break at the moment.
Related
Is there a way to optimize the below code snipped?
I am trying to calculate the value of the current row column using the previous row column value and a period specified in the custom function and a price in the current row column.
import pandas as pd
class EMA_Period:
fast = 8
slow = 17
def calculate_ema(prev_ema, price, period):
return prev_ema + (2.0 / (1.0 + period)) * (price - prev_ema)
times = [1578614400, 1578614700, 1578615000, 1578615300, 1578615600]
closes = [10278.6, 10276.0, 10275.6, 10274.8, 10277.0]
fast_ema = [10278.6, 0, 0, 0, 0]
df = pd.DataFrame(data={'time': times, 'close': closes, 'fast_ema': fast_ema})
df.set_index('time', inplace=True)
for i in range(1, df.shape[0]):
df.iloc[i]['fast_ema'] = calculate_ema(df.iloc[i-1]['fast_ema'], df.iloc[i]['close'], EMA_Period.fast)
You should really use a vectorized approach if you care about speed. Looping over the rows will always be the slowest option (though sometimes unavoidable)
You don't even need to change your function to make it vectorized!
def calculate_ema(prev_ema, price, period):
return prev_ema + (2.0 / (1.0 + period)) * (price - prev_ema)
# though we will make your dataframe longer: 500 rows instead of 5 rows
df = pd.concat([df] * 100)
print(df)
close fast_ema
time
1578614400 10278.6 10278.6
1578614700 10276.0 0.0
1578615000 10275.6 0.0
1578615300 10274.8 0.0
1578615600 10277.0 0.0
... ... ...
1578614400 10278.6 10278.6
1578614700 10276.0 0.0
1578615000 10275.6 0.0
1578615300 10274.8 0.0
1578615600 10277.0 0.0
[500 rows x 2 columns]
Note that these tests are timing 2 important things:
Performance of the calculation itself
Performance of assigning values back into a dataframe
Row looping solution
%%timeit
for i in range(1, df.shape[0]):
df.iloc[i]['fast_ema'] = calculate_ema(df.iloc[i-1]['fast_ema'], df.iloc[i]['close'], EMA_Period.fast)
10 loops, best of 5: 86.1 ms per loop
86.1 ms is pretty slow for such a small dataset. Let's see how the vectorized approach compares:
Vectorized Solution
By using .shift() on the "fast_ema" column we can change how these vectors align such that each value in "close" is aligned with the previous "fast_ema".
With the alignment taken care of, we can feed these vectors directly into the calculate_ema function without making any changes
%%timeit
df["fast_ema"].iloc[1:] = calculate_ema(df["fast_ema"].shift(), df["close"], EMA_Period.fast).iloc[1:]
1000 loops, best of 5: 569 µs per loop
Time comparisons:
Approach
Time
Row Looping
86.1 ms
Vectorized
569 µs
Thanks #Mars
def calc_ema(df, period=8, col_name='fast'):
prev_value = df.iloc[0][col_name]
def func2(row):
# non local variable ==> will use pre_value from the new_fun function
nonlocal prev_value
prev_value = prev_value + (2.0 / (1.0 + period)) * (row['close'] - prev_value)
return prev_value
# This line might throw a SettingWithCopyWarning warning
df.iloc[1:][col_name] = df.iloc[1:].apply(func2, axis=1)
return df
df = calc_ema(df)
I'm new to python and although I can write for loops with no issue, I'm finding they're horrendously slow. Here's my code:
perc_match is a function that runs a calculation between two vectors, which in this case are rows of a dataframe.
def perc_match(customer_id,bait_name):
score = int(df_master.loc[customer_id,:].dot(df_pim.loc[bait_name,:].values))
perfect = int(df_master.loc[customer_id,:].dot(df_perf.iloc[0,:].values))
if perfect == 0:
return 0
elif (score / perfect)*100 < 0:
return 0
else:
percent = round((score / perfect)*100,3)
percent = float(percent)
return percent
match_maker calls perc_match for every row in two dataframes and places the output in its respective cell in df_match.
def match_maker(df_match):
for i in df_match.index:
for j in df_match.columns:
df_match.loc[i,j] = perc_match(i,j)
for reference:
df_master.shape = (122905, 33)
df_pim.shape = (36, 33)
df_perf.shape = (1, 33)
df_match.shape = (122905, 36)
This all works fine - except when I test how long it takes...
5.49 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Not good when I'm running this on 100,000s of rows. I know there are ways to optimize the code, but I'm having a hard time understanding it. What's the best way I can slim this code down?
EDIT:
The inputs look something like this:
df_master:
Customer ID Email Technique 1 ... Technique 33
12345 i#me.com 1 ... 0
...
df_pim:
Product ID Technique 1 ... Technique 33
Product 1 1 0
...
df_perc (all values are 1):
index Technique 1 ... Technique 33
1 1
df_match:
Customer ID Email Product 1 ... Product N
12345 i#me.com 0 ... 0
...
I want the function to edit df_match to look like this:
df_match (gives a % match based on comparison between technique values):
Customer ID Email Product 1 ... Product N
12345 i#me.com 12.842 ... 44.312
...
Assumptions:
I'm assuming df_perf in perc_match() line 3 is a typo and you meant df_perc.
You are wanting to think of things as individual values to be calculated. The .dot operator you are using can handle 2 dimensions as well as single dimensions.
In your perc_match() you have:
score = int(df_master.loc[customer_id,:].dot(df_pim.loc[bait_name,:].values))
this operates on one line at a time times one other line. How about making a score dataframe with:
columns = ["Technique "+str(a) for a in range(1,34)]
score_df = df_master[columns].dot(df_pim)
The perfect line is mostly unnecessary if you are multiplying them by a dataframe with all ones. So how about something like this:
perfect = int(df_master.sum(axis=0))
This will give you some thoughts to ponder for a while. I'll finish this answer later or someone can pick this up while I'm away.
I have a data frame with a column of start dates and a column of end dates. I want to check the integrity of the dates by ensuring that the start date is before the end date (i.e. start_date < end_date).I have over 14,000 observations to run through.
I have data in the form of:
Start End
0 2008-10-01 2008-10-31
1 2006-07-01 2006-12-31
2 2000-05-01 2002-12-31
3 1971-08-01 1973-12-31
4 1969-01-01 1969-12-31
I have added a column to write the result to, even though I just want to highlight whether there are incorrect ones so I can delete them:
dates['Correct'] = " "
And have began to check each date pair using the following, where my dataframe is called dates:
for index, row in dates.iterrows():
if dates.Start[index] < dates.End[index]:
dates.Correct[index] = "correct"
elif dates.Start[index] == dates.End[index]:
dates.Correct[index] = "same"
elif dates.Start[index] > dates.End[index]:
dates.Correct[index] = "incorrect"
Which works, it is just taking a really really long-time (about over 15 minutes). I need a more efficiently running code - is there something I am doing wrong or could improve?
Why not just do it in a vectorized way:
is_correct = dates['Start'] < dates['End']
is_incorrect = dates['Start'] > dates['End']
is_same = ~is_correct & ~is_incorrect
Since the list doesn't need to be compared sequentially, you can gain performance by splitting your dataset and then using multiple processes to perform the comparison simultaneously. Take a look at the multiprocessing module for help.
Something like the following may be quicker:
import pandas as pd
import datetime
df = pd.DataFrame({
'start': ["2008-10-01", "2006-07-01", "2000-05-01"],
'end': ["2008-10-31", "2006-12-31", "2002-12-31"],
})
def comparison_check(df):
start = datetime.datetime.strptime(df['start'], "%Y-%m-%d").date()
end = datetime.datetime.strptime(df['end'], "%Y-%m-%d").date()
if start < end:
return "correct"
elif start == end:
return "same"
return "incorrect"
In [23]: df.apply(comparison_check, axis=1)
Out[23]:
0 correct
1 correct
2 correct
dtype: object
Timings
In [26]: %timeit df.apply(comparison_check, axis=1)
1000 loops, best of 3: 447 µs per loop
So by my calculations, 14,000 rows should take (447/3)*14,000 = (149 µs)*14,000 = 2.086s, so a might shorter than 15 minutes :)
I am facing a strange performance issue when parsing a lot of dates with Pandas 0.17.1. For demonstration, I created CSV files with exactly one column, containing datetimes in the format "2015-12-31 13:01:01". The sample files contain 10k, 100k, 1M, and 10M records. I am parsing it like this:
start = timer()
pd.read_csv('10k_records.csv', parse_dates=['date'])
end = timer()
print(end - start)
The elapsed times are:
10k: 0.011 s
100k: 0.10 s
1m: 1.2 s
10m: 300 s
You see, the time scales linearly to the number of records until 1 million, but then there is a huge drop.
It's not a memory issue. I have 16GB, and I work with dataframes of this size without any problems in Pandas, only parsing of dates appears to be slow.
I tried to use infer_datetime_format=True, but the speed was similar. Also a huge drop for 10m records.
Then I tried to register my own naive date parser:
def parseDate(t):
if type(t) is str :
st = str(t)
try:
return datetime.datetime(int(st[:4]),int(st[5:7]),int(st[8:10]),int(st[11:13]),int(st[14:16]),int(st[17:19]))
except:
return None
return datetime.datetime(0,0,0,0,0,0)
pd.read_csv(
'10k_records.csv', parse_dates=['date'],
date_parser=parseDate
)
And the times are now:
10k: 0.045 s
100k: 0.36 s
1m: 3.7 s
10m: 36 s
The routine is slower than the default pandas parser on smaller files, but it scales perfectly linearly for the larger one. So it really looks like some kind of performance leak in the standard date parsing routine.
Well, I could use my parser, but it's very simple, stupid, and apparently slow. I would prefer to use the intelligent, robust, and fast Pandas parser, only if I could somehow solve the scalability issue. Would anyone have any idea, if it could be solved, possibly by some esoteric parameter or something?
UPDATE
Thank all of you for your help so far.
After all, it seems that there is a reproduceable performance problem with dates parsing, but it has nothing to do with scalability. I was wrong in my original analysis.
You can try to download this file
https://www.dropbox.com/s/c5m21s1uif329f1/slow.csv.tar.gz?dl=0
and parse it in Pandas. The format and everything is correct, all the data are valid. There are only 100k records, but it takes 3 seconds to parse them - while it takes 0.1s to parse 100k records from the generated regular sequence.
What happened: I did not generate my original testing data as a regular sequence, as #exp1orer did. I was taking samples of our real data, and their distribution is not that regular. The sequence is overall growing by a constant pace, but there are some local irregularities and unordered pieces. And, apparently, in my 10M sample, there happened to be one section, which made pandas particularly unhappy and parsing took so long. It's only a tiny fraction of the file content that is responsible for all the slowness. But I was not able to spot any principal differences between that fraction and the rest of the file.
UPDATE 2
So, the cause of slowness was that there were some weird dates, like 20124-10-20. Apparently, I will need to do some more pre-processing before importing the data to Pandas.
UPDATE:
look at this comparison:
In [507]: fn
Out[507]: 'D:\\download\\slow.csv.tar.gz'
In [508]: fn2
Out[508]: 'D:\\download\\slow_filtered.csv.gz'
In [509]: %timeit df = pd.read_csv(fn, parse_dates=['from'], index_col=0)
1 loop, best of 3: 15.7 s per loop
In [510]: %timeit df2 = pd.read_csv(fn2, parse_dates=['from'], index_col=0)
1 loop, best of 3: 399 ms per loop
In [511]: len(df)
Out[511]: 99831
In [512]: len(df2)
Out[512]: 99831
In [513]: df.dtypes
Out[513]:
from object
dtype: object
In [514]: df2.dtypes
Out[514]:
from datetime64[ns]
dtype: object
The only difference between those two DFs is in the row# 36867, which i've manually corrected in the D:\\download\\slow_filtered.csv.gz file:
In [518]: df.iloc[36867]
Out[518]:
from 20124-10-20 10:12:00
Name: 36867, dtype: object
In [519]: df2.iloc[36867]
Out[519]:
from 2014-10-20 10:12:00
Name: 36867, dtype: datetime64[ns]
Conclusion: it took Pandas 39 times longer because of one row with a "bad" date and at the end Pandas left from column in the df DF as a string
OLD answer:
it works pretty fair for me (pandas 0.18.0):
setup:
start_ts = '2000-01-01 00:00:00'
pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**4)}).to_csv('d:/temp/10k.csv', index=False)
pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**5)}).to_csv('d:/temp/100k.csv', index=False)
pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**6)}).to_csv('d:/temp/1m.csv', index=False)
pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**7)}).to_csv('d:/temp/10m.csv', index=False)
dt_parser = lambda x: pd.to_datetime(x, format="%Y-%m-%d %H:%M:%S")
checks:
In [360]: fn = 'd:/temp/10m.csv'
In [361]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime}, date_parser=dt_parser)
1 loop, best of 3: 22.6 s per loop
In [362]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime})
1 loop, best of 3: 29.9 s per loop
In [363]: %timeit pd.read_csv(fn, parse_dates=['date'])
1 loop, best of 3: 29.9 s per loop
In [364]: fn = 'd:/temp/1m.csv'
In [365]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime}, date_parser=dt_parser)
1 loop, best of 3: 2.32 s per loop
In [366]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime})
1 loop, best of 3: 3.06 s per loop
In [367]: %timeit pd.read_csv(fn, parse_dates=['date'])
1 loop, best of 3: 3.06 s per loop
In [368]: %timeit pd.read_csv(fn)
1 loop, best of 3: 1.53 s per loop
conclusion: it's a bit faster when i'm using date_parser where i'm specifying the date format, so read_csv don't have to guess it. The difference is approx. 30%
Okay -- based on the discussion in the comments and in the chat room it seems that there is a problem with OP's data. Using the code below he is unable to reproduce his own error:
import pandas as pd
import datetime
from time import time
format_string = '%Y-%m-%d %H:%M:%S'
base_dt = datetime.datetime(2016,1,1)
exponent_range = range(2,8)
def dump(number_records):
print 'now dumping %s records' % number_records
dts = pd.date_range(base_dt,periods=number_records,freq='1s')
df = pd.DataFrame({'date': [dt.strftime(format_string) for dt in dts]})
df.to_csv('%s_records.csv' % number_records)
def test(number_records):
start = time()
pd.read_csv('%s_records.csv' % number_records, parse_dates=['date'])
end = time()
print str(number_records), str(end - start)
def main():
for i in exponent_range:
number_records = 10**i
dump(number_records)
test(number_records)
if __name__ == '__main__':
main()
I am trying to get data ready for a financial event analysis and want to calculate the buy-and-hold abnormal return (BHAR). For a test data set I have three events (noted by event_id), and for each event I have 272 rows, going from t-252 days to t+20 days (noted by the variable time). For each day I also have the stock's return data (ret) as well as the expected return (Exp_Ret), which was calculated using a market model. Here's a sample of the data:
index event_id time ret vwretd Exp_Ret
0 0 -252 0.02905 0.02498 nan
1 0 -251 0.01146 -0.00191 nan
2 0 -250 0.01553 0.00562 nan
...
250 0 -2 -0.00378 0.00028 -0.00027
251 0 -1 0.01329 0.00426 0.00479
252 0 0 -0.01723 -0.00875 -0.01173
271 0 19 0.01335 0.01150 0.01398
272 0 20 0.00722 -0.00579 -0.00797
273 1 -252 0.01687 0.00928 nan
274 1 -251 -0.00615 -0.01103 nan
And here's the issue. I would like to calculate the following BHAR formula for each day:
So, using the above formula as an example, if I would like to calculate the 10-day buy-and-hold abnormal return,I would have to calculate (1+ret_t=0)x(1+ret_t=1)...x(1+ret_t=10), then do the same with the expected return, (1+Exp_Ret_t=0)x(1+Exp_Ret_t=1)...x(1+Exp_Ret_t=10), then substract the latter from the former.
I have made some progress using rolling_apply but it doesn't solve all my problems:
df['part1'] = pd.rolling_apply(df['ret'], 10, lambda x : (1+x).prod())
This seems to correctly implement the left-side of the BHAR equation in that it will add in the correct product -- though it will enter the value two rows down (which can be solved by shifting). One problem, though, is that there are three different 'groups' in the dataframe (3 events), and if the window were to go forward more than 30 days it might start using products from the next event. I have tried to implement a groupby with rolling_apply but keep getting error: TypeError: 'Series' objects are mutable, thus they cannot be hashed
df.groupby('event_id').apply(pd.rolling_apply(df['ret'], 10, lambda x : (1+x).prod()))
I am sure I am missing something basic here so any help would be appreciated. I might just need to approach it from a different angle. Here's one thought: In the end, what I am most interested in is getting the 30-day and 60-day buy-and-hold abnormal returns starting at time=0. So, maybe it is easier to select each event at time=0 and then calculate the 30-day product going forward? I'm not sure how I could best approach that.
Thanks in advance for any insights.
# Create sample data.
np.random.seed(0)
VOL = .3
df = pd.DataFrame({'event_id': [0] * 273 + [1] * 273 + [2] * 273,
'time': range(-252, 21) * 3,
'ret': np.random.randn(273 * 3) * VOL / 252 ** .5,
'Exp_Ret': np.random.randn(273 * 3) * VOL / 252 ** .5})
# Pivot on time and event_id.
df = df.set_index(['time', 'event_id']).unstack('event_id')
# Calculated return difference from t=0.
df_diff = df.ix[df.index >= 0, 'ret'] - df.loc[df.index >= 0, 'Exp_Ret']
# Calculate cumulative abnormal returns.
cum_returns = (1 + df_diff).cumprod() - 1
# Get 10 day abnormal returns.
>>> cum_returns.loc[10]
event_id
0 -0.014167
1 -0.172599
2 -0.032647
Name: 10, dtype: float64
Edited so that final values of BHAR are included in the main DataFrame.
BHAR = pd.Series()
def bhar(arr):
return np.cumprod(arr+1)[-1]
grouped = df.groupby('event_id')
for name, group in grouped:
BHAR = BHAR.append(pd.rolling_apply(group['ret'],10,bhar) -
pd.rolling_apply(group['Exp_Ret'],10,bhar))
df['BHAR'] = BHAR
You can then slice the DataFrame using df[df['time']>=0] such that you get only the required part.
You can obviously collapse the loop in one line using .apply() on the group, but I like it this way. Shorter lines to read = better readability.
This is what I did:
((df+1.0) \
.apply(lambda x: np.log(x),axis=1)\
.rolling(365).sum() \
.apply(lambda x: np.exp(x),axis=1)-1.0)
result is a rolling product.