Applying a custom function to a dataframe column - python

Is there a way to optimize the below code snipped?
I am trying to calculate the value of the current row column using the previous row column value and a period specified in the custom function and a price in the current row column.
import pandas as pd
class EMA_Period:
fast = 8
slow = 17
def calculate_ema(prev_ema, price, period):
return prev_ema + (2.0 / (1.0 + period)) * (price - prev_ema)
times = [1578614400, 1578614700, 1578615000, 1578615300, 1578615600]
closes = [10278.6, 10276.0, 10275.6, 10274.8, 10277.0]
fast_ema = [10278.6, 0, 0, 0, 0]
df = pd.DataFrame(data={'time': times, 'close': closes, 'fast_ema': fast_ema})
df.set_index('time', inplace=True)
for i in range(1, df.shape[0]):
df.iloc[i]['fast_ema'] = calculate_ema(df.iloc[i-1]['fast_ema'], df.iloc[i]['close'], EMA_Period.fast)

You should really use a vectorized approach if you care about speed. Looping over the rows will always be the slowest option (though sometimes unavoidable)
You don't even need to change your function to make it vectorized!
def calculate_ema(prev_ema, price, period):
return prev_ema + (2.0 / (1.0 + period)) * (price - prev_ema)
# though we will make your dataframe longer: 500 rows instead of 5 rows
df = pd.concat([df] * 100)
print(df)
close fast_ema
time
1578614400 10278.6 10278.6
1578614700 10276.0 0.0
1578615000 10275.6 0.0
1578615300 10274.8 0.0
1578615600 10277.0 0.0
... ... ...
1578614400 10278.6 10278.6
1578614700 10276.0 0.0
1578615000 10275.6 0.0
1578615300 10274.8 0.0
1578615600 10277.0 0.0
[500 rows x 2 columns]
Note that these tests are timing 2 important things:
Performance of the calculation itself
Performance of assigning values back into a dataframe
Row looping solution
%%timeit
for i in range(1, df.shape[0]):
df.iloc[i]['fast_ema'] = calculate_ema(df.iloc[i-1]['fast_ema'], df.iloc[i]['close'], EMA_Period.fast)
10 loops, best of 5: 86.1 ms per loop
86.1 ms is pretty slow for such a small dataset. Let's see how the vectorized approach compares:
Vectorized Solution
By using .shift() on the "fast_ema" column we can change how these vectors align such that each value in "close" is aligned with the previous "fast_ema".
With the alignment taken care of, we can feed these vectors directly into the calculate_ema function without making any changes
%%timeit
df["fast_ema"].iloc[1:] = calculate_ema(df["fast_ema"].shift(), df["close"], EMA_Period.fast).iloc[1:]
1000 loops, best of 5: 569 µs per loop
Time comparisons:
Approach
Time
Row Looping
86.1 ms
Vectorized
569 µs

Thanks #Mars
def calc_ema(df, period=8, col_name='fast'):
prev_value = df.iloc[0][col_name]
def func2(row):
# non local variable ==> will use pre_value from the new_fun function
nonlocal prev_value
prev_value = prev_value + (2.0 / (1.0 + period)) * (row['close'] - prev_value)
return prev_value
# This line might throw a SettingWithCopyWarning warning
df.iloc[1:][col_name] = df.iloc[1:].apply(func2, axis=1)
return df
df = calc_ema(df)

Related

How can I make my python program run faster?

I am reading in a .csv file and creating a pandas dataframe. The file is a file of stocks. I am only interested in the date, the company, and the closing cost. I want my program to find the max profit with the starting date, the ending date and the company. It needs to use the divide and conquer algorithm. I only know how to use for loops but it takes forever to run. The .csv file is 200,000 rows. How can I get this to run fast?
import pandas as pd
import numpy as np
import math
def cleanData(file):
df = pd.read_csv(file)
del df['open']
del df['low']
del df['high']
del df['volume']
return np.array(df)
df = cleanData('prices-split-adjusted.csv')
bestStock = [None, None, None, float(-math.inf)]
def DAC(data):
global bestStock
if len(data) > 1:
mid = len(data)//2
left = data[:mid]
right = data[mid:]
DAC(left)
DAC(right)
for i in range(len(data)):
for j in range(i+1,len(data)):
if data[i,1] == data[j,1]:
profit = data[j,2] - data[i,2]
if profit > bestStock[3]:
bestStock[0] = data[i,0]
bestStock[1] = data[j,0]
bestStock[2] = data[i,1]
bestStock[3] = profit
print(bestStock)
print('\n')
return bestStock
print(DAC(df))
I've got two things for your consideration (my answer tries not to change your algorithm approach i.e. nested loops and recursive funcs and tackles the low lying fruits first):
Unless you are debugging, try to avoid print() inside a loop. (in your case .. print(bestStock) ..) The I/O overhead can add up esp. if you are looping across large datasets and printing to screen often. Once you are OK with your code, comment it out to run on your full dataset and uncomment it only during debugging sessions. You can expect to see some improvement in speed without having to print to screen in the loop.
If you are after even more ways to 'speed it up', I found in my case (similar to yours which I often encounter especially in search/sort problems) that simply by switching the expensive part (the python 'For' loops) to Cython (and statically defining variable types .. this is KEY! to SPEEEEDDDDDD) gives me several orders of magnitude speed ups even before optimizing implementation. Check Cython out https://cython.readthedocs.io/en/latest/index.html. If thats not enough, then parrelism is your next best friend which would require rethinking your code implementation.
The main problems causing slow system performance are:
You manually iterate over 2 columns in nested loops without using pandas operations which make use of fast ndarray functions;
you use recursive calls which looks nice and simple but slow.
Setting the sample data as follows:
Date Company Close
0 2019-12-31 AAPL 73.412498
1 2019-12-31 FB 205.250000
2 2019-12-31 NFLX 323.570007
3 2020-01-02 AAPL 75.087502
4 2020-01-02 FB 209.779999
... ... ... ...
184 2020-03-30 FB 165.949997
185 2020-03-30 NFLX 370.959991
186 2020-03-31 AAPL 63.572498
187 2020-03-31 FB 166.800003
188 2020-03-31 NFLX 375.500000
189 rows × 3 columns
Then use the following codes (modify the column labels to your labels if different):
df_result = df.groupby('Company').agg(Start_Date=pd.NamedAgg(column='Date', aggfunc="first"), End_Date=pd.NamedAgg(column='Date', aggfunc="last"), bestGain=pd.NamedAgg(column='Close', aggfunc=lambda x: x.max() - x.iloc[0]))
Resulting output:
Start_Date End_Date bestGain
Company
AAPL 2019-12-31 2020-03-31 8.387505
FB 2019-12-31 2020-03-31 17.979996
NFLX 2019-12-31 2020-03-31 64.209991
To get the entry with greatest gain:
df_result.loc[df_result['bestGain'].idxmax()]
Resulting output:
Start_Date 2019-12-31 00:00:00
End_Date 2020-03-31 00:00:00
bestGain 64.209991
Name: NFLX, dtype: object
Execution time comparison
With my scaled down data of 3 stocks over 3 months, the codes making use of pandas function (takes 8.9ms) which is about about half the execution time with the original codes manually iterate over the numpy array with nested loops and recursive calls (takes 16.9ms) even after the majority of print() function calls removed.
Your codes with print() inside DAC() function removed:
%%timeit
"""
def cleanData(df):
# df = pd.read_csv(file)
del df['Open']
del df['Low']
del df['High']
del df['Volume']
return np.array(df)
"""
# df = cleanData('prices-split-adjusted.csv')
# df = cleanData(df0)
df = np.array(df0)
bestStock = [None, None, None, float(-math.inf)]
def DAC(data):
global bestStock
if len(data) > 1:
mid = len(data)//2
left = data[:mid]
right = data[mid:]
DAC(left)
DAC(right)
for i in range(len(data)):
for j in range(i+1,len(data)):
if data[i,1] == data[j,1]:
profit = data[j,2] - data[i,2]
if profit > bestStock[3]:
bestStock[0] = data[i,0]
bestStock[1] = data[j,0]
bestStock[2] = data[i,1]
bestStock[3] = profit
# print(bestStock)
# print('\n')
return bestStock
print(DAC(df))
[Timestamp('2020-03-16 00:00:00'), Timestamp('2020-03-31 00:00:00'), 'NFLX', 76.66000366210938]
16.9 ms ± 303 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
New simplified codes in pandas' way of coding:
%%timeit
df_result = df.groupby('Company').agg(Start_Date=pd.NamedAgg(column='Date', aggfunc="first"), End_Date=pd.NamedAgg(column='Date', aggfunc="last"), bestGain=pd.NamedAgg(column='Close', aggfunc=lambda x: x.max() - x.iloc[0]))
df_result.loc[df_result['bestGain'].idxmax()]
8.9 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Solution using recursive function:
The main problem of your recursive function lies in that you did not make use of the results of recursive calls of reduced size data.
To properly use recursive function as a divide-and-conquer approach, you should take 3 major steps:
Divide the whole set of data into smaller pieces and handle the smaller pieces by recursive calls each taking one of the smaller pieces
Handle the end-point case (the easiest case most of the time) in each recursive call
Consolidate the results of all recursive calls of smaller pieces
The beauty of recursive calls is that you can solve a complicated problem by replacing the processing with 2 much more easier steps: 1st step is to handle the end-point case where you can handle for most of the time only ONE data item (which is most often easy). 2nd step is to just take another easy step to consolidate the results of the reduced-size calls.
You managed to take the first step but not the other 2 steps. In particular, you did not take advantage of simplifying the processing by making use of the results of smaller pieces. Instead, you handle the whole set of data in each call by looping all over all rows in the 2-dimensional numpy array. The nested loop logics is just like a "Bubble Sort" [with complexity order(n squared) instead of order(n)] . Hence, your recursive calls are just wasting time without value!
Suggest to modify your recursive functions as follows:
def DAC(data):
# global bestStock # define bestStock as a local variable instead
bestStock = [None, None, None, float(-math.inf)] # init bestStock
if len(data) = 1: # End-point case: data = 1 row
bestStock[0] = data[0,0]
bestStock[1] = data[0,0]
bestStock[2] = data[0,1]
bestStock[3] = 0.0
elif len(data) = 2: # End-point case: data = 2 rows
bestStock[0] = data[0,0]
bestStock[1] = data[1,0]
bestStock[2] = data[0,1] # Enhance here to allow stock break
bestStock[3] = data[1,2] - data[0,2]
elif len(data) >= 3: # Recursive calls and consolidate results
mid = len(data)//2
left = data[:mid]
right = data[mid:]
bestStock_left = DAC(left)
bestStock_right = DAC(right)
# Now make use of the results of divide-and-conquer and consolidate the results
bestStock[0] = bestStock_left[0]
bestStock[1] = bestStock_right[1]
bestStock[2] = bestStock_left[2] # Enhance here to allow stock break
bestStock[3] = bestStock_left[3] if bestStock_left[3] >= bestStock_right[3] else bestStock_right[3]
# print(bestStock)
# print('\n')
return bestStock
Here we need to handle 2 kinds of end-point cases: 1 row and 2 rows. The reason is that for case with only 1 row, we cannot calculate the gain and can only set the gain to zero. Gain can start to calculate with 2 rows. If not split into these 2 end-point cases, we could end up only propagating zero gain all the way up.
Here is a demo of how you should code the recursive calls to take advantage of it. There is limitation of the codes that you still need to fine-tune. You have to enhance it further to handle stock break case. The codes for 2 rows and >= 3 rows now assume no stock break at the moment.

Create a column that is a conditional smoothed moving average of another column in python

My Problem
I am trying to create a column in python which is the conditional smoothed moving 14 day average of another column. The condition is that I only want to include positive values from another column in the rolling average.
I am currently using the following code which works exactly how I want it to, but it is really slow because of the loops. I want to try and re-do it without using loops. The dataset is simply the last closing price of a stock.
Current Working Code
import numpy as np
import pandas as pd
csv1 = pd.read_csv('stock_price.csv', delimiter = ',')
df = pd.DataFrame(csv1)
df['delta'] = df.PX_LAST.pct_change()
df.loc[df.index[0], 'avg_gain'] = 0
for x in range(1,len(df.index)):
if df["delta"].iloc[x] > 0:
df["avg_gain"].iloc[x] = ((df["avg_gain"].iloc[x - 1] * 13) + df["delta"].iloc[x]) / 14
else:
df["avg_gain"].iloc[x] = ((df["avg_gain"].iloc[x - 1] * 13) + 0) / 14
df
Correct Output Example
Dates PX_LAST delta avg_gain
03/09/2018 43.67800 NaN 0.000000
04/09/2018 43.14825 -0.012129 0.000000
05/09/2018 42.81725 -0.007671 0.000000
06/09/2018 43.07725 0.006072 0.000434
07/09/2018 43.37525 0.006918 0.000897
10/09/2018 43.47925 0.002398 0.001004
11/09/2018 43.59750 0.002720 0.001127
12/09/2018 43.68725 0.002059 0.001193
13/09/2018 44.08925 0.009202 0.001765
14/09/2018 43.89075 -0.004502 0.001639
17/09/2018 44.04200 0.003446 0.001768
Attempted Solutions
I tried to create a new column that only comprises of the positive values and then tried to create the smoothed moving average of that new column but it doesn't give me the right answer
df['new_col'] = df['delta'].apply(lambda x: x if x > 0 else 0)
df['avg_gain'] = df['new_col'].ewm(14,min_periods=1).mean()
The maths behind it as follows...
Avg_Gain = ((Avg_Gain(t-1) * 13) + (New_Col * 1)) / 14
where New_Col only equals the positive values of Delta
Does anyone know how I might be able to do it?
Cheers
This should speed up your code:
df['avg_gain'] = df[df['delta'] > 0]['delta'].rolling(14).mean()
Does your current code converge to zero? If you can provide the data, then it would be easier for the folk to do some analysis.
I would suggest you add a column which is 0 if the value is < 0 and instead has the same value as the one you want to consider if it is >= 0. Then you take the running average of this new column.
df['new_col'] = df.apply(lambda x: x['delta'] if x['delta'] >= 0 else 0)
df['avg_gain'] = df['new_value'].rolling(14).mean()
This would take into account zeros instead of just discarding them.

How to append item to list of different column in Pandas

I have a dataframe that looks like this:
dic = {'A':['PINCO','PALLO','CAPPO','ALLOP'],
'B':['KILO','KULO','FIGA','GAGO'],
'C':[['CAL','GOL','TOA','PIA','STO'],
['LOL','DAL','ERS','BUS','TIS'],
['PIS','IPS','ZSP','YAS','TUS'],
[]]}
df1 = pd.DataFrame(dic)
My goal is to insert for each row the element of A as first item of the list contained in column C. At the same time I want to set the element of B as last item of the list contained in C.
I was able to achieve my goal by using the following lines of code:
for index, row in df1.iterrows():
try:
row['C'].insert(0,row['A'])
row['C'].append(row['B'])
except:
pass
Is there a more elegant and efficient way to achieve my goal maybe using some Pandas function? I would like to avoid for loops possibly.
Inspired by Ted's solution but without modifying columns A and B:
def tolist(value):
return [value]
df1.C = df1.A.map(tolist) + df1.C + df1.B.map(tolist)
Using apply, you would not write an explicit loop:
def modify(row):
row['C'][:] = [row['A']] + row['C'] + [row['B']]
df1.apply(modify, axis=1)
A good general rule is to avoid using apply with axis=1 if at all possible as iterating over the rows is expenisve
You can convert each element in columns A and B to a list with map and then sum across the rows.
df1['A'] = df1.A.map(lambda x: [x])
df1['B'] = df1.B.map(lambda x: [x])
df1.sum(1)
CPU times: user 3.07 s, sys: 207 ms, total: 3.27 s
The alternative is to use apply with axis=1 which ran 15 times slower on my computer on 1 million rows
df1.apply(lambda x: [x['A']] + x['C'] + [x['B']], 1)
CPU times: user 48.5 s, sys: 119 ms, total: 48.6 s
Use a list comprehension with df1.values.tolist()
pd.Series([[r[0]] + r[2] + [r[1]] for r in df1.values.tolist()], df1.index)
0 [PINCO, CAL, GOL, TOA, PIA, STO, KILO]
1 [PALLO, LOL, DAL, ERS, BUS, TIS, KULO]
2 [CAPPO, PIS, IPS, ZSP, YAS, TUS, FIGA]
3 [ALLOP, GAGO]
dtype: object
time testing

Compare each pair of dates in two columns in python efficiently

I have a data frame with a column of start dates and a column of end dates. I want to check the integrity of the dates by ensuring that the start date is before the end date (i.e. start_date < end_date).I have over 14,000 observations to run through.
I have data in the form of:
Start End
0 2008-10-01 2008-10-31
1 2006-07-01 2006-12-31
2 2000-05-01 2002-12-31
3 1971-08-01 1973-12-31
4 1969-01-01 1969-12-31
I have added a column to write the result to, even though I just want to highlight whether there are incorrect ones so I can delete them:
dates['Correct'] = " "
And have began to check each date pair using the following, where my dataframe is called dates:
for index, row in dates.iterrows():
if dates.Start[index] < dates.End[index]:
dates.Correct[index] = "correct"
elif dates.Start[index] == dates.End[index]:
dates.Correct[index] = "same"
elif dates.Start[index] > dates.End[index]:
dates.Correct[index] = "incorrect"
Which works, it is just taking a really really long-time (about over 15 minutes). I need a more efficiently running code - is there something I am doing wrong or could improve?
Why not just do it in a vectorized way:
is_correct = dates['Start'] < dates['End']
is_incorrect = dates['Start'] > dates['End']
is_same = ~is_correct & ~is_incorrect
Since the list doesn't need to be compared sequentially, you can gain performance by splitting your dataset and then using multiple processes to perform the comparison simultaneously. Take a look at the multiprocessing module for help.
Something like the following may be quicker:
import pandas as pd
import datetime
df = pd.DataFrame({
'start': ["2008-10-01", "2006-07-01", "2000-05-01"],
'end': ["2008-10-31", "2006-12-31", "2002-12-31"],
})
def comparison_check(df):
start = datetime.datetime.strptime(df['start'], "%Y-%m-%d").date()
end = datetime.datetime.strptime(df['end'], "%Y-%m-%d").date()
if start < end:
return "correct"
elif start == end:
return "same"
return "incorrect"
In [23]: df.apply(comparison_check, axis=1)
Out[23]:
0 correct
1 correct
2 correct
dtype: object
Timings
In [26]: %timeit df.apply(comparison_check, axis=1)
1000 loops, best of 3: 447 µs per loop
So by my calculations, 14,000 rows should take (447/3)*14,000 = (149 µs)*14,000 = 2.086s, so a might shorter than 15 minutes :)

Rolling Product in PANDAS over 30-day time window

I am trying to get data ready for a financial event analysis and want to calculate the buy-and-hold abnormal return (BHAR). For a test data set I have three events (noted by event_id), and for each event I have 272 rows, going from t-252 days to t+20 days (noted by the variable time). For each day I also have the stock's return data (ret) as well as the expected return (Exp_Ret), which was calculated using a market model. Here's a sample of the data:
index event_id time ret vwretd Exp_Ret
0 0 -252 0.02905 0.02498 nan
1 0 -251 0.01146 -0.00191 nan
2 0 -250 0.01553 0.00562 nan
...
250 0 -2 -0.00378 0.00028 -0.00027
251 0 -1 0.01329 0.00426 0.00479
252 0 0 -0.01723 -0.00875 -0.01173
271 0 19 0.01335 0.01150 0.01398
272 0 20 0.00722 -0.00579 -0.00797
273 1 -252 0.01687 0.00928 nan
274 1 -251 -0.00615 -0.01103 nan
And here's the issue. I would like to calculate the following BHAR formula for each day:
So, using the above formula as an example, if I would like to calculate the 10-day buy-and-hold abnormal return,I would have to calculate (1+ret_t=0)x(1+ret_t=1)...x(1+ret_t=10), then do the same with the expected return, (1+Exp_Ret_t=0)x(1+Exp_Ret_t=1)...x(1+Exp_Ret_t=10), then substract the latter from the former.
I have made some progress using rolling_apply but it doesn't solve all my problems:
df['part1'] = pd.rolling_apply(df['ret'], 10, lambda x : (1+x).prod())
This seems to correctly implement the left-side of the BHAR equation in that it will add in the correct product -- though it will enter the value two rows down (which can be solved by shifting). One problem, though, is that there are three different 'groups' in the dataframe (3 events), and if the window were to go forward more than 30 days it might start using products from the next event. I have tried to implement a groupby with rolling_apply but keep getting error: TypeError: 'Series' objects are mutable, thus they cannot be hashed
df.groupby('event_id').apply(pd.rolling_apply(df['ret'], 10, lambda x : (1+x).prod()))
I am sure I am missing something basic here so any help would be appreciated. I might just need to approach it from a different angle. Here's one thought: In the end, what I am most interested in is getting the 30-day and 60-day buy-and-hold abnormal returns starting at time=0. So, maybe it is easier to select each event at time=0 and then calculate the 30-day product going forward? I'm not sure how I could best approach that.
Thanks in advance for any insights.
# Create sample data.
np.random.seed(0)
VOL = .3
df = pd.DataFrame({'event_id': [0] * 273 + [1] * 273 + [2] * 273,
'time': range(-252, 21) * 3,
'ret': np.random.randn(273 * 3) * VOL / 252 ** .5,
'Exp_Ret': np.random.randn(273 * 3) * VOL / 252 ** .5})
# Pivot on time and event_id.
df = df.set_index(['time', 'event_id']).unstack('event_id')
# Calculated return difference from t=0.
df_diff = df.ix[df.index >= 0, 'ret'] - df.loc[df.index >= 0, 'Exp_Ret']
# Calculate cumulative abnormal returns.
cum_returns = (1 + df_diff).cumprod() - 1
# Get 10 day abnormal returns.
>>> cum_returns.loc[10]
event_id
0 -0.014167
1 -0.172599
2 -0.032647
Name: 10, dtype: float64
Edited so that final values of BHAR are included in the main DataFrame.
BHAR = pd.Series()
def bhar(arr):
return np.cumprod(arr+1)[-1]
grouped = df.groupby('event_id')
for name, group in grouped:
BHAR = BHAR.append(pd.rolling_apply(group['ret'],10,bhar) -
pd.rolling_apply(group['Exp_Ret'],10,bhar))
df['BHAR'] = BHAR
You can then slice the DataFrame using df[df['time']>=0] such that you get only the required part.
You can obviously collapse the loop in one line using .apply() on the group, but I like it this way. Shorter lines to read = better readability.
This is what I did:
((df+1.0) \
.apply(lambda x: np.log(x),axis=1)\
.rolling(365).sum() \
.apply(lambda x: np.exp(x),axis=1)-1.0)
result is a rolling product.

Categories