Finding start of the maximum drawdown in Pandas [duplicate] - python

Maximum Drawdown is a common risk metric used in quantitative finance to assess the largest negative return that has been experienced.
Recently, I became impatient with the time to calculate max drawdown using my looped approach.
def max_dd_loop(returns):
"""returns is assumed to be a pandas series"""
max_so_far = None
start, end = None, None
r = returns.add(1).cumprod()
for r_start in r.index:
for r_end in r.index:
if r_start < r_end:
current = r.ix[r_end] / r.ix[r_start] - 1
if (max_so_far is None) or (current < max_so_far):
max_so_far = current
start, end = r_start, r_end
return max_so_far, start, end
I'm familiar with the common perception that a vectorized solution would be better.
The questions are:
can I vectorize this problem?
What does this solution look like?
How beneficial is it?
Edit
I modified Alexander's answer into the following function:
def max_dd(returns):
"""Assumes returns is a pandas Series"""
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = dd.min()
end = dd.argmin()
start = r.loc[:end].argmax()
return mdd, start, end

df_returns is assumed to be a dataframe of returns, where each column is a seperate strategy/manager/security, and each row is a new date (e.g. monthly or daily).
cum_returns = (1 + df_returns).cumprod()
drawdown = 1 - cum_returns.div(cum_returns.cummax())

I had first suggested using .expanding() window but that's obviously not necessary with the .cumprod() and .cummax() built ins to calculate max drawdown up to any given point:
df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)}, index=pd.date_range(start=date(2016,1,1), periods=1000, freq='D'))
df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)},
index=pd.date_range(start=date(2016, 1, 1), periods=1000, freq='D'))
df['cumulative_return'] = df.returns.add(1).cumprod().subtract(1)
df['max_drawdown'] = df.cumulative_return.add(1).div(df.cumulative_return.cummax().add(1)).subtract(1)
returns cumulative_return max_drawdown
2016-01-01 -0.014522 -0.014522 0.000000
2016-01-02 -0.022769 -0.036960 -0.022769
2016-01-03 0.026735 -0.011214 0.000000
2016-01-04 0.054129 0.042308 0.000000
2016-01-05 -0.017562 0.024004 -0.017562
2016-01-06 0.055254 0.080584 0.000000
2016-01-07 0.023135 0.105583 0.000000
2016-01-08 -0.072624 0.025291 -0.072624
2016-01-09 -0.055799 -0.031919 -0.124371
2016-01-10 0.129059 0.093020 -0.011363
2016-01-11 0.056123 0.154364 0.000000
2016-01-12 0.028213 0.186932 0.000000
2016-01-13 0.026914 0.218878 0.000000
2016-01-14 -0.009160 0.207713 -0.009160
2016-01-15 -0.017245 0.186886 -0.026247
2016-01-16 0.003357 0.190869 -0.022979
2016-01-17 -0.009284 0.179813 -0.032050
2016-01-18 -0.027361 0.147533 -0.058533
2016-01-19 -0.058118 0.080841 -0.113250
2016-01-20 -0.049893 0.026914 -0.157492
2016-01-21 -0.013382 0.013173 -0.168766
2016-01-22 -0.020350 -0.007445 -0.185681
2016-01-23 -0.085842 -0.092648 -0.255584
2016-01-24 0.022406 -0.072318 -0.238905
2016-01-25 0.044079 -0.031426 -0.205356
2016-01-26 0.045782 0.012917 -0.168976
2016-01-27 -0.018443 -0.005764 -0.184302
2016-01-28 0.021461 0.015573 -0.166797
2016-01-29 -0.062436 -0.047836 -0.218819
2016-01-30 -0.013274 -0.060475 -0.229189
... ... ... ...
2018-08-28 0.002124 0.559122 -0.478738
2018-08-29 -0.080303 0.433921 -0.520597
2018-08-30 -0.009798 0.419871 -0.525294
2018-08-31 -0.050365 0.348359 -0.549203
2018-09-01 0.080299 0.456631 -0.513004
2018-09-02 0.013601 0.476443 -0.506381
2018-09-03 -0.009678 0.462153 -0.511158
2018-09-04 -0.026805 0.422960 -0.524262
2018-09-05 0.040832 0.481062 -0.504836
2018-09-06 -0.035492 0.428496 -0.522411
2018-09-07 -0.011206 0.412489 -0.527762
2018-09-08 0.069765 0.511031 -0.494817
2018-09-09 0.049546 0.585896 -0.469787
2018-09-10 -0.060201 0.490423 -0.501707
2018-09-11 -0.018913 0.462235 -0.511131
2018-09-12 -0.094803 0.323611 -0.557477
2018-09-13 0.025736 0.357675 -0.546088
2018-09-14 -0.049468 0.290514 -0.568542
2018-09-15 0.018146 0.313932 -0.560713
2018-09-16 -0.034118 0.269104 -0.575700
2018-09-17 0.012191 0.284576 -0.570527
2018-09-18 -0.014888 0.265451 -0.576921
2018-09-19 0.041180 0.317562 -0.559499
2018-09-20 0.001988 0.320182 -0.558623
2018-09-21 -0.092268 0.198372 -0.599348
2018-09-22 -0.015386 0.179933 -0.605513
2018-09-23 -0.021231 0.154883 -0.613888
2018-09-24 -0.023536 0.127701 -0.622976
2018-09-25 0.030160 0.161712 -0.611605
2018-09-26 0.025528 0.191368 -0.601690

Given a time series of returns, we need to evaluate the aggregate return for every combination of starting point to ending point.
The first trick is to convert a time series of returns into a series of return indices. Given a series of return indices, I can calculate the return over any sub-period with the return index at the beginning ri_0 and at the end ri_1. The calculation is: ri_1 / ri_0 - 1.
The second trick is to produce a second series of inverses of return indices. If r is my series of return indices then 1 / r is my series of inverses.
The third trick is to take the matrix product of r * (1 / r).Transpose.
r is an n x 1 matrix. (1 / r).Transpose is a 1 x n matrix. The resulting product contains every combination of ri_j / ri_k. Just subtract 1 and I've actually got returns.
The fourth trick is to ensure that I'm constraining my denominator to represent periods prior to those being represented by the numerator.
Below is my vectorized function.
import numpy as np
import pandas as pd
def max_dd(returns):
# make into a DataFrame so that it is a 2-dimensional
# matrix such that I can perform an nx1 by 1xn matrix
# multiplication and end up with an nxn matrix
r = pd.DataFrame(returns).add(1).cumprod()
# I copy r.T to ensure r's index is not the same
# object as 1 / r.T's columns object
x = r.dot(1 / r.T.copy()) - 1
x.columns.name, x.index.name = 'start', 'end'
# let's make sure we only calculate a return when start
# is less than end.
y = x.stack().reset_index()
y = y[y.start < y.end]
# my choice is to return the periods and the actual max
# draw down
z = y.set_index(['start', 'end']).iloc[:, 0]
return z.min(), z.argmin()[0], z.argmin()[1]
How does this perform?
for the vectorized solution I ran 10 iterations over the time series of lengths [10, 50, 100, 150, 200]. The time it took is below:
10: 0.032 seconds
50: 0.044 seconds
100: 0.055 seconds
150: 0.082 seconds
200: 0.047 seconds
The same test for the looped solution is below:
10: 0.153 seconds
50: 3.169 seconds
100: 12.355 seconds
150: 27.756 seconds
200: 49.726 seconds
Edit
Alexander's answer provides superior results. Same test using modified code
10: 0.000 seconds
50: 0.000 seconds
100: 0.004 seconds
150: 0.007 seconds
200: 0.008 seconds
I modified his code into the following function:
def max_dd(returns):
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = drawdown.min()
end = drawdown.argmin()
start = r.loc[:end].argmax()
return mdd, start, end

I recently had a similar issue, but instead of a global MDD, I was required to find the MDD for the interval after each peak. Also, in my case, I was supposed to take the MDD of each strategy alone and thus wasn't required to apply the cumprod. My vectorized implementation is also based on Investopedia.
def calc_MDD(networth):
df = pd.Series(networth, name="nw").to_frame()
max_peaks_idx = df.nw.expanding(min_periods=1).apply(lambda x: x.argmax()).fillna(0).astype(int)
df['max_peaks_idx'] = pd.Series(max_peaks_idx).to_frame()
nw_peaks = pd.Series(df.nw.iloc[max_peaks_idx.values].values, index=df.nw.index)
df['dd'] = ((df.nw-nw_peaks)/nw_peaks)
df['mdd'] = df.groupby('max_peaks_idx').dd.apply(lambda x: x.expanding(min_periods=1).apply(lambda y: y.min())).fillna(0)
return df
Here is an sample after running this code:
nw max_peaks_idx dd mdd
0 10000.000 0 0.000000 0.000000
1 9696.948 0 -0.030305 -0.030305
2 9538.576 0 -0.046142 -0.046142
3 9303.953 0 -0.069605 -0.069605
4 9247.259 0 -0.075274 -0.075274
5 9421.519 0 -0.057848 -0.075274
6 9315.938 0 -0.068406 -0.075274
7 9235.775 0 -0.076423 -0.076423
8 9091.121 0 -0.090888 -0.090888
9 9033.532 0 -0.096647 -0.096647
10 8947.504 0 -0.105250 -0.105250
11 8841.551 0 -0.115845 -0.115845
And here is an image of the complete applied to the complete dataset.
Although vectorized, this code is probably slower than the other, because for each time-series, there should be many peaks, and each one of these requires calculation, and so O(n_peaks*n_intervals).
PS: I could have eliminated the zero values in the dd and mdd columns, but I find it useful that these values help indicate when a new peak was observed in the time-series.

Related

Dask dataframe groupby and applying custom functions

I have this function that I am trying to apply to a dask dataframe that calculates cooling assuming certain storage capacity and rate limits. It takes a 15-minute timestep value of cooling a building uses and returns the amount a certain storage rate can accommodate.
def cooling_kwh_by_case(row, storage_capacity, storage_rate):
if ((row['daily_cooling_kwh'] <= storage_capacity/row['cop']) & (row['max_cooling_kw'] <= storage_rate/row['cop'])):
return row['daily_cooling_kwh']
elif ((row['daily_cooling_kwh'] <= storage_capacity/row['cop']) & (row['max_cooling_kw'] > storage_rate/row['cop'])):
daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: sum(min(x,storage_rate/(4*row['cop']))))
return daily_groupby.loc[(row.building_date)]
else:
n_largest = 1
daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest).sum())
while ((daily_groupby.loc[(row.building_date)]) <= (storage_capacity/row['cop'])) & (n_largest < net_load_w_times.groupby('index')['electricity_cooling_kwh'].count()):
n_largest += 1
daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest).sum())
return min(storage_capacity/row['cop'],net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest-1).sum()).loc[(row.building_date)])
When I apply it, this is my error message.
<ipython-input-22-88e243d194c6> in cooling_kwh_by_case()
16 n_largest = 1
17 daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest).sum())
---> 18 while ((daily_groupby.loc[(row.building_date)]) <= (storage_capacity/row['cop'])) & (n_largest < net_load_w_times.groupby('index')['electricity_cooling_kwh'].count()):
19 n_largest += 1
20 daily_groupby = net_load_w_times.groupby('index')['electricity_cooling_kwh'].apply(lambda x: x.nlargest(n_largest).sum())
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
I think the issue I'm running into is the way I try and calculate the value I want for the else statement which are the cases where the cooling kwh is larger than the storage_capacity parameter. To calculate this value, I apply a function to find when the sum of the largest 15-min cooling kwh values for the day exceeds the storage_capacity. I then return the sum of the largest values.
The dataframe that I am trying to groupby in the function to return a value is called net_load_w_times:
time electricity_cooling_kwh \
building_id
2 2016-07-05 19:00:00 0.050000
2 2016-07-05 22:00:00 3.200000
2 2016-07-05 16:00:00 5.779318
2 2016-07-05 20:00:00 1.888300
2 2016-07-05 18:00:00 7.490000
electricity_heating_kwh total_site_electricity_kwh iso_zone \
building_id
2 0.000000 19.529506 MISO-E
2 0.045235 6.310719 MISO-E
2 0.000000 22.514705 MISO-E
2 0.018624 13.474863 MISO-E
2 0.005464 18.192927 MISO-E
index date
building_id
2 2|2016-10-24 2016-10-24
2 2|2016-03-05 2016-03-05
2 2|2016-08-14 2016-08-14
2 2|2016-03-05 2016-03-05
2 2|2016-03-05 2016-03-05
Desired Output:
Given cooling_kwh_by_case(row, 8, 5) it outputs:
7.717618 because this is the max cooling kWh it can add up until 8.
Dask dataframes are lazy and do not work within control flow like if-else statements or for loops. I recommend trying to find solutions within the pandas API, like the where method.

Python - For loop millions of rows

I have a dataframe c with lots of different columns. Also, arr is a dataframe that corresponds to a subset of c: arr = c[c['A_D'] == 'A'].
The main idea of my code is to iterate over all rows in the c-dataframe and search for all the possible cases (in the arr dataframe) where some specific conditions should happen:
It is only necessary to iterate over rows were c['A_D'] == D and c['Already_linked'] == 0
The hour in the arr dataframe must be less than the hour_aux in the c dataframe
The column Already_linked of the arr dataframe must be zero: arr.Already_linked == 0
The Terminal and the Operator needs to be the same in the c and arr dataframe
Right now, the conditions are stored using both Boolean indexing and groupby get_group:
Groupby the arr dataframe in order to choose the same Operator and Terminal: g = groups.get_group((row.Operator, row.Terminal))
Choose only the arrivals where the hour is smaller than the hour in the c dataframe and where Already_linked==0: vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)]
For each of the rows in the c dataframe that verify all conditions, a vb dataframe is created. Naturally, this dataframe has different lengths in each iteration. After creating the vb dataframe, my goal is to choose the index of the vb dataframe that minimises the time between vb.START and c[x]. The FightID that corresponds to this index is then stored in the c dataframe on column a. Additionally, since the arrival was linked to a departure, the column Already_linked in the arr dataframe is changed from 0 to 1.
It is important to notice that the column Already_linked of the arr dataframe may change in every iteration (and arr.Already_linked == 0 is one of the conditions to create the vb dataframe). Therefore, it is not possible to parallelize this code.
I have already used c.itertuples() for efficiency, however since c has millions of rows, this code is still too time consuming.
Other option would also be to use pd.apply to every row. Nonetheless, this is not really straightforward since in each loop there are values that change in both c and arr (also, I believe that even with pd.apply it would be extremely slow).
Is there any possible way to convert this for loop in a vectorized solution (or decrease the running time by 10X(if possible even more) )?
Initial dataframe:
START END A_D Operator FlightID Terminal TROUND_ID tot
0 2017-03-26 16:55:00 2017-10-28 16:55:00 A QR QR001 4 QR002 70
1 2017-03-26 09:30:00 2017-06-11 09:30:00 D DL DL001 3 " " 84
2 2017-03-27 09:30:00 2017-10-28 09:30:00 D DL DL001 3 " " 78
3 2017-10-08 15:15:00 2017-10-22 15:15:00 D VS VS001 3 " " 45
4 2017-03-26 06:50:00 2017-06-11 06:50:00 A DL DL401 3 " " 9
5 2017-03-27 06:50:00 2017-10-28 06:50:00 A DL DL401 3 " " 19
6 2017-03-29 06:50:00 2017-04-19 06:50:00 A DL DL401 3 " " 3
7 2017-05-03 06:50:00 2017-10-25 06:50:00 A DL DL401 3 " " 32
8 2017-06-25 06:50:00 2017-10-22 06:50:00 A DL DL401 3 " " 95
9 2017-03-26 07:45:00 2017-10-28 07:45:00 A DL DL402 3 " " 58
Desired Output (some of the columns were excluded in the dataframe below. Only the a and Already_linked columns are relevant):
START END A_D Operator a Already_linked
0 2017-03-26 16:55:00 2017-10-28 16:55:00 A QR 0 1
1 2017-03-26 09:30:00 2017-06-11 09:30:00 D DL DL402 1
2 2017-03-27 09:30:00 2017-10-28 09:30:00 D DL DL401 1
3 2017-10-08 15:15:00 2017-10-22 15:15:00 D VS No_link_found 0
4 2017-03-26 06:50:00 2017-06-11 06:50:00 A DL 0 0
5 2017-03-27 06:50:00 2017-10-28 06:50:00 A DL 0 1
6 2017-03-29 06:50:00 2017-04-19 06:50:00 A DL 0 0
7 2017-05-03 06:50:00 2017-10-25 06:50:00 A DL 0 0
8 2017-06-25 06:50:00 2017-10-22 06:50:00 A DL 0 0
9 2017-03-26 07:45:00 2017-10-28 07:45:00 A DL 0 1
Code:
groups = arr.groupby(['Operator', 'Terminal'])
for row in c[(c.A_D == "D") & (c.Already_linked == 0)].itertuples():
try:
g = groups.get_group((row.Operator, row.Terminal))
vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)]
aux = (vb.START - row.x).abs().idxmin()
c.loc[row.Index, 'a'] = vb.loc[aux].FlightID
arr.loc[aux, 'Already_linked'] = 1
continue
except:
continue
c['Already_linked'] = np.where((c.a != 0) & (c.a != 'No_link_found') & (c.A_D == 'D'), 1, c['Already_linked'])
c.Already_linked.loc[arr.Already_linked.index] = arr.Already_linked
c['a'] = np.where((c.Already_linked == 0) & (c.A_D == 'D'),'No_link_found',c['a'])
Code for the initial c dataframe:
import numpy as np
import pandas as pd
import io
s = '''
A_D Operator FlightID Terminal TROUND_ID tot
A QR QR001 4 QR002 70
D DL DL001 3 " " 84
D DL DL001 3 " " 78
D VS VS001 3 " " 45
A DL DL401 3 " " 9
A DL DL401 3 " " 19
A DL DL401 3 " " 3
A DL DL401 3 " " 32
A DL DL401 3 " " 95
A DL DL402 3 " " 58
'''
data_aux = pd.read_table(io.StringIO(s), delim_whitespace=True)
data_aux.Terminal = data_aux.Terminal.astype(str)
data_aux.tot= data_aux.tot.astype(str)
d = {'START': ['2017-03-26 16:55:00', '2017-03-26 09:30:00','2017-03-27 09:30:00','2017-10-08 15:15:00',
'2017-03-26 06:50:00','2017-03-27 06:50:00','2017-03-29 06:50:00','2017-05-03 06:50:00',
'2017-06-25 06:50:00','2017-03-26 07:45:00'], 'END': ['2017-10-28 16:55:00' ,'2017-06-11 09:30:00' ,
'2017-10-28 09:30:00' ,'2017-10-22 15:15:00','2017-06-11 06:50:00' ,'2017-10-28 06:50:00',
'2017-04-19 06:50:00' ,'2017-10-25 06:50:00','2017-10-22 06:50:00' ,'2017-10-28 07:45:00']}
aux_df = pd.DataFrame(data=d)
aux_df.START = pd.to_datetime(aux_df.START)
aux_df.END = pd.to_datetime(aux_df.END)
c = pd.concat([aux_df, data_aux], axis = 1)
c['A_D'] = c['A_D'].astype(str)
c['Operator'] = c['Operator'].astype(str)
c['Terminal'] = c['Terminal'].astype(str)
c['hour'] = pd.to_datetime(c['START'], format='%H:%M').dt.time
c['hour_aux'] = pd.to_datetime(c['START'] - pd.Timedelta(15, unit='m'),
format='%H:%M').dt.time
c['start_day'] = c['START'].astype(str).str[0:10]
c['end_day'] = c['END'].astype(str).str[0:10]
c['x'] = c.START - pd.to_timedelta(c.tot.astype(int), unit='m')
c["a"] = 0
c["Already_linked"] = np.where(c.TROUND_ID != " ", 1 ,0)
arr = c[c['A_D'] == 'A']
While this is not a vecterized solution, it should speed things up rapidly if your sample data set mimics your true data set. Currently, you are wasting time looping over every row, but you only care about looping over rows where ['A_D'] == 'D' and ['Already_linked'] ==0. Instead remove the if's and loop over the truncated dataframe which is only 30% of the initial dataframe
for row in c[(c.A_D == 'D') & (c.Already_linked == 0)].itertuples():
vb = arr[(arr.Already_linked == 0) & (arr.hour < row.hour_aux)].copy().query(row.query_string)
try:
aux = (vb.START - row.x).abs().idxmin()
print(row.x)
c.loc[row.Index, 'a'] = vb.loc[aux,'FlightID']
arr.loc[aux, 'Already_linked'] = 1
continue
except:
continue
Your question was if there is a way to vectorize the for loop, but I think that question hides what you really want which is an easy way to speed your code up. For performance questions, a good starting point is always profiling. However, I have a strong suspicion that the dominant operation in your code is .query(row.query_string). Running that for every row is expensive if arr is large.
For arbitrary queries, that runtime can't really be improved at all without removing dependencies between iterations and parallelizing the expensive step. You might be a bit luckier though. Your query string always checks two different columns to see if they're equal to something you care about. However, for each row that requires going through your entire slice of arr. Since the slice changes each time, that could cause problems, but here are some ideas:
Since you're slicing arr each time anyway, maintain a view of just the arr.Already_Linked==0 rows so you're iterating over a smaller object.
Better yet, before you do any looping you should first group arr by Terminal and Operator. Then, instead of running through all of arr, first select the group you want and then do your slicing and filtering. This would require rethinking the exact implementation of query_string a little bit, but the advantage is that if you have a lot of terminals and operators you'll typically be working over a much smaller object than arr. Moreover, you wouldn't even have to query that object since that was implicitly done by the groupby.
Depending on how aux.hour typically relates to row.hour_aux, you might have improvements by sorting aux at the beginning with respect to hour. Just using the inequality operator you probably wouldn't see any gains, but you could pair that with a logarithmic search for the cutoff point and then just slice up to that cutoff point.
And so on. Again, I suspect any way of restructuring the query you're doing on all of arr for every row will offer substantially more gains than just switching frameworks or vectorizing bits and pieces.
Expanding on some of those points a little bit and adapting #DJK's code a bit, look at what happens when we have the following changes.
groups = arr.groupby(['Operator', 'Terminal'])
for row in c[(c.A_D == 'D') & (c.Already_linked == 0)].itertuples():
g = groups.get_group((row.Operator, row.Terminal))
vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)]
try:
aux = (vb.START - row.x).abs().idxmin()
print(row.x)
c.loc[row.Index, 'a'] = vb.loc[aux,'FlightID']
g.loc[aux, 'Already_linked'] = 1
continue
except:
continue
Part of the reason your query is so slow is because it's searching over all of arr each time. In contrast, the .groupby() executes in roughly the same time as one query, but then for every subsequent iteration you can just use .get_group() to efficiently find the tiny subset of the data you care about.
A helpful (extremely crude) rule of thumb when benchmarking is that a billion things takes a second. If you're seeing much longer times than that for something measured in millions of things, like your millions of rows, that means that for each of those rows you're doing tons of things to get up to billions of operations. That leaves a ton of potential for better algorithms to reduce the number of operations, whereas vectorization really only yields constant factor improvements (and for many string/query operations not even great improvements at that).
This solution uses pd.DataFrame.isin which uses numpy.in1d
Apparently 'isin' isn't necessarily faster for small datasets (like this sample), but is significantly faster for large datasets. You'll have to run it against your data to determine performance.
flight_record_linkage.ipynb
Expanded the dataset using c = pd.concat([c] * 10000, ignore_index=True)
Increase the dataset length by 3 orders of magnitude (10000 rows total).
Original method: Wall time: 8.98s
New method: Wall time: 16.4s
Increase the dataset length by 4 orders of magnitude (100000 rows total).
Original method: Wall time: 8min 17s
New method: Wall time: 1min 14s
Increase the dataset length by 5 orders of magnitude (1000000 rows total).
New method: Wall time: 11min 33s
New Method: Using isin and apply
def apply_do_g(it_row):
"""
This is your function, but using isin and apply
"""
keep = {'Operator': [it_row.Operator], 'Terminal': [it_row.Terminal]} # dict for isin combined mask
holder1 = arr[list(keep)].isin(keep).all(axis=1) # create boolean mask
holder2 = arr.Already_linked.isin([0]) # create boolean mask
holder3 = arr.hour < it_row.hour_aux # create boolean mask
holder = holder1 & holder2 & holder3 # combine the masks
holder = arr.loc[holder]
if not holder.empty:
aux = np.absolute(holder.START - it_row.x).idxmin()
c.loc[it_row.name, 'a'] = holder.loc[aux].FlightID # use with apply 'it_row.name'
arr.loc[aux, 'Already_linked'] = 1
def new_way_2():
keep = {'A_D': ['D'], 'Already_linked': [0]}
df_test = c[c[list(keep)].isin(keep).all(axis=1)].copy() # returns the resultant df
df_test.apply(lambda row: apply_do_g(row), axis=1) # g is multiple DataFrames"
#call the function
new_way_2()
Your problem looks like one of the most common problems in database operation. I do not fully understand what you want to get because you have not formulated the task. Now to the possible solution - avoid loops at all.
You have a very long table with columns time, FlightID, Operator, Terminal, A_D. Other columns and dates do not matter if I understand you correctly. Also start_time and end_time are the same in every row. By the way you may get time column with the code table.loc[:, 'time'] = table.loc[:, 'START'].dt.time.
table = table.drop_duplicates(subset=['time', 'FlightID', 'Operator', 'Terminal']). And your table gonna become significantly shorter.
Split table into table_arr and table_dep according to A_D value: table_arr = table.loc[table.loc[:, 'A_D'] == 'A', ['FlightID', 'Operator', 'Terminal', 'time']], table_dep = table.loc[table.loc[:, 'A_D'] == 'D', ['FlightID', 'Operator', 'Terminal', 'time']]
Seems like all you tried to get with loops you may get with a single line: table_result = table_arr.merge(table_dep, how='right', on=['Operator', 'Terminal'], suffixes=('_arr', '_dep')). It is basically the same operation as JOIN in SQL.
According to my understanding of your problem and having the tiny piece of data you have provided you get just the desired output (correspondence between FlightID_dep and FlightID_arr for all FlightID_dep values) without any loop so much faster. table_result is:
FlightID_arr Operator Terminal time_arr FlightID_dep time_dep
0 DL401 DL 3 06:50:00 DL001 09:30:00
1 DL402 DL 3 07:45:00 DL001 09:30:00
2 NaN VS 3 NaN VS001 15:15:00
Of course, in general case (with actual data) you will need one more step - filter table_result on condition time_arr < time_dep or any other condition you have. Unfortunately the data you have provided is not enough to fully solve your problem.
Complete code is:
import io
import pandas as pd
data = '''
START,END,A_D,Operator,FlightID,Terminal,TROUND_ID,tot
2017-03-26 16:55:00,2017-10-28 16:55:00,A,QR,QR001,4,QR002,70
2017-03-26 09:30:00,2017-06-11 09:30:00,D,DL,DL001,3,,84
2017-03-27 09:30:00,2017-10-28 09:30:00,D,DL,DL001,3,,78
2017-10-08 15:15:00,2017-10-22 15:15:00,D,VS,VS001,3,,45
2017-03-26 06:50:00,2017-06-11 06:50:00,A,DL,DL401,3,,9
2017-03-27 06:50:00,2017-10-28 06:50:00,A,DL,DL401,3,,19
2017-03-29 06:50:00,2017-04-19 06:50:00,A,DL,DL401,3,,3
2017-05-03 06:50:00,2017-10-25 06:50:00,A,DL,DL401,3,,32
2017-06-25 06:50:00,2017-10-22 06:50:00,A,DL,DL401,3,,95
2017-03-26 07:45:00,2017-10-28 07:45:00,A,DL,DL402,3,,58
'''
table = pd.read_csv(io.StringIO(data), parse_dates=[0, 1])
table.loc[:, 'time'] = table.loc[:, 'START'].dt.time
table = table.drop_duplicates(subset=['time', 'FlightID', 'Operator', 'Terminal'])
table_arr = table.loc[table.loc[:, 'A_D'] == 'A', ['FlightID', 'Operator', 'Terminal', 'time']]
table_dep = table.loc[table.loc[:, 'A_D'] == 'D', ['FlightID', 'Operator', 'Terminal', 'time']]
table_result = table_arr.merge(
table_dep,
how='right',
on=['Operator', 'Terminal'],
suffixes=('_arr', '_dep'))
print(table_result)

Avoid looping to calculate simple moving average crossing-derived signals

I would like to calculate buy and sell signals for stocks based on simple moving average (SMA) crossing. A buy signal should be given as soon as the SMA_short is higher than the SMA_long (i.e., SMA_difference > 0). In order to avoid that the position is sold too quickly, I would like to have a sell signal only once the SMA_short has moved beyond the cross considerably (i.e., SMA_difference < -1), and, importantly, even if this would be for longer than one day.
I managed, by this help to implement it (see below):
Buy and sell signals are indicated by in and out.
Column Position takes first the buy_limit into account.
In Position_extended an in is then set for all the cases where the SMA_short just crossed through the SMA_long (SMA_short < SMA_long) but SMA_short > -1. For this it is taking the Position extended of i-1 into account in case the crossing was more than one day ago but SMA_short remained: 0 > SMA_short > -1.
Python code
import pandas as pd
import numpy as np
index = pd.date_range('20180101', periods=6)
df = pd.DataFrame(index=index)
df["SMA_short"] = [9,10,11,10,10,9]
df["SMA_long"] = 10
df["SMA_difference"] = df["SMA_short"] - df["SMA_long"]
buy_limit = 0
sell_limit = -1
df["Position"] = np.where((df["SMA_difference"] > buy_limit),"in","out")
df["Position_extended"] = df["Position"]
for i in range(1,len(df)):
df.loc[index[i],"Position_extended"] = \
np.where((df.loc[index[i], "SMA_difference"] > sell_limit) \
& (df.loc[index[i-1],"Position_extended"] == "in") \
,"in",df.loc[index[i],'Position'])
print df
The result is:
SMA_short SMA_long SMA_difference Position Position_extended
2018-01-01 9 10 -1 out out
2018-01-02 10 10 0 out out
2018-01-03 11 10 1 in in
2018-01-04 10 10 0 out in
2018-01-05 10 10 0 out in
2018-01-06 9 10 -1 out out
The code works, however, it makes use of a for loop, which slows down the script considerably and becomes inapplicable in the larger context of this analysis. As SMA crossing is such a highly used tool, I was wondering whether somebody could see a more elegant and faster solution for this.
Essentially you are trying to get rid of the ambivalent zero entries by propagating the last non-zero value. Similar to a zero-order hold. You can do so my first replacing the zero values by NaNs and then interpolating over the latter using ffill.
import pandas as pd
import numpy as np
index = pd.date_range('20180101', periods=6)
df = pd.DataFrame(index=index)
df["SMA_short"] = [9,10,11,10,10,9]
df["SMA_long"] = 10
df["SMA_difference"] = df["SMA_short"] - df["SMA_long"]
buy_limit = 0
sell_limit = -1
df["ZOH"] = df["SMA_difference"].replace(0,np.nan).ffill()
df["Position"] = np.where((df["ZOH"] > buy_limit),"in","out")
print df
results in:
SMA_short SMA_long SMA_difference ZOH Position
2018-01-01 9 10 -1 -1.0 out
2018-01-02 10 10 0 -1.0 out
2018-01-03 11 10 1 1.0 in
2018-01-04 10 10 0 1.0 in
2018-01-05 10 10 0 1.0 in
2018-01-06 9 10 -1 -1.0 out
If row T requires as input a value calculated in row T-1, then you'll probably want to do an iterative calculation. Typically backtesting is done by iterating through price data in sequence. You can calculate some signals just based on the state of the market, but you won't know the portfolio value, the pnl, or the portfolio positions unless you start at the beginning and work your way forward in time. That's why if you look at a site like Quantopian, the backtests always run from from start date to end date.

How to vectorize this python loop involving millions of records

I have a pandas dataframe, df, with 4,000,000 timesteps for a single stock.
The task is, for each timestep, I want to determine if it rises .1% or falls .1% first. So right now I am converting the dataframe to numpy arrays and looping through each timestep, starting at 0 to 4,000,000.
For each timestep, I iterate through the following time steps until I find one where there is a .1% difference in price. If the price rose .1% the label is 1, if it fell .1% the label is 0. This is taking a very long time.
Is it even possible to vectorize this? I tried thinking of a dynamic programming solution to reduce time complexity but I'm not sure if there is one.
high_bid = df['high_bid'].values
high_ask = df['high_ask'].values
low_bid = df['low_bid'].values
low_ask = df['low_ask'].values
open_bid = df['open_bid'].values
open_ask = df['open_ask'].values
labels = np.empty(len(data))
labels[:] = np.nan
for i in range(len(labels)-1):
for j in range(i+1,len(labels)-1):
if (open_ask[i] + (open_ask[i]*target) <= high_bid[j]):
labels[i] = 1
break
elif (open_bid[i] - (open_bid[i]*target) >= low_ask[j]):
labels[i] = 0
break
df['direction'] = labels
Example
time open_bid open_ask high_bid high_ask low_bid \
0 2006-09-19 12:00:00 1.26606 1.26621 1.27063 1.27078 1.26504
1 2006-09-19 13:00:00 1.27010 1.27025 1.27137 1.27152 1.26960
2 2006-09-19 14:00:00 1.27076 1.27091 1.27158 1.27173 1.26979
3 2006-09-19 15:00:00 1.27008 1.27023 1.27038 1.27053 1.26708
4 2006-09-19 16:00:00 1.26816 1.26831 1.26821 1.26836 1.26638
5 2006-09-19 17:00:00 1.26648 1.26663 1.26762 1.26777 1.26606
6 2006-09-19 18:00:00 1.26756 1.26771 1.26781 1.26796 1.26733
7 2006-09-19 19:00:00 1.26763 1.26778 1.26785 1.26800 1.26754
8 2006-09-19 20:00:00 1.26770 1.26785 1.26825 1.26840 1.26765
9 2006-09-19 21:00:00 1.26781 1.26796 1.26791 1.26806 1.26703
low_ask direction
0 1.26519 1
1 1.26975 1
2 1.26994 0
3 1.26723 0
4 1.26653 0
5 1.26621 1
6 1.26748 NaN
7 1.26769 NaN
8 1.26780 NaN
9 1.26718 NaN
I want to add that direction column for all 4 million rows.
You can probably also check the expanding() window function, but in a reverse direction to calculate the max_future_high_bid and min_future_low_ask after each row:
# 0.1% increae/decrease
target = 0.001
# new column names
new_columns = [ "max_future_high_bid", "min_future_low_ask" ]
df[new_columns] = df[::-1].expanding(1)\
.agg({'high_bid':'max', 'low_ask':'min'})[::-1] \
.shift(-1)
# after you have these two values, you can calculate the direction with apply() function
def get_direction(x):
if x.max_future_high_bid >= (1 + target) * x.open_ask :
return 1
elif (1 - target) * x.open_bid >= x.min_future_low_ask:
return 0
else:
return None
# calculate the direction
df['direction'] = df.apply(get_direction, axis=1)
First solution to try: Cython. In a similar setting, I've got 20-90x speed up just by adding %%cython to my code.
In one Jupyter cell
%load_ext Cython
cimport numpy as np
import numpy as np
cpdef func(np.ndarray high_bid, np.ndarray high_ask, np.ndarray low_bid, np.ndarray low_ask, np.ndarray open_bid, np.ndarray open_ask, np.ndarray labels):
target = 0.001
cdef Py_ssize_t i, j, n = len(labels)
for i in range(n):
for j in range(i+1, n):
# The following are just a copy paste of your code
if (open_ask[i] + (open_ask[i]*target) <= high_bid[j]):
labels[i] = 1
break
elif (open_bid[i] - (open_bid[i]*target) >= low_ask[j]):
labels[i] = 0
break
In another Jupyter cell
func(high_bid, high_ask, low_bid, low_ask, open_bid, open_ask, labels, target)
More optimisation
Here is an excellent introduction of cython for pandas
You can speed up more by adding the data type (np.ndarray[double])
Second solution: Use cummax, cummin on high_bid, low_ask in reversed order
target = 0.001
df['highest_bid_from_on'] = df.high_bid.sort_index(ascending=False).cummax().sort_index(ascending=True)
df['lowest_ask_from_on'] = df.low_ask.sort_index(ascending=False).cummin().sort_index(ascending=True)
df['direction'] = np.nan
df.loc[df.open_bid * (1 - target) >= df.lowest_ask_from_on, 'direction'] = 0
df.loc[df.open_ask * (1 + target) <= df.highest_bid_from_on, 'direction'] = 1

How to find slope of time series variable

I have a time series data similar to this
val
2015-10-15 7.85
2015-10-16 8
2015-10-19 8.18
2015-10-20 5.39
2015-10-21 2.38
2015-10-22 1.98
2015-10-23 9.25
2015-10-26 14.29
2015-10-27 15.52
2015-10-28 15.93
2015-10-29 15.79
2015-10-30 13.83
How can i find the slope of the adjecent rows (eg 8 and 7.85) of val variable and print it in a different column in R or python
I know the formula for a slope that is
but the problem is how we will take difference of x (that is date) values in a time series data
(Here x is date and y is val)
If by slope you mean (value(y)-value(x))/(y-x), then your slope should have at least one value less than your data frame, so it will be difficult to show it in the same data frame.
In R, this would be my answer:
slope<-numeric(length = nrow(df))
for(i in 2:(nrow(df)){
slope[i-1]<-(df[i-1,"val"]-df[i,"val"])/(as.numeric(df[i-1,1]-df[i,1]))
}
slope[nrow(df)]<-NA
df$slope<-slope
Edit (answering to your edition)
In R, dates is a class of data (like integers, numeric, or characters).
For example I can define a vector of dates:
x<-as.Date(c("2015-10-15","2015-10-16"))
print( x )
[1] "2015-10-15" "2015-10-16"
And the difference of 2 dates returns:
x[2]-x[1]
Time difference of 1 days
As you mentioned, you cannot divide by a date:
2/(x[2]-x[1])
Error in `/.difftime`(2, (x[2] - x[1])) :
second argument cannot be "difftime"
That is why I used as.numeric, which forces the vector to be a numeric value (in days):
2/as.numeric(x[2]-x[1])
[1] 2
To prove that it works:
as.numeric(as.Date("2016-10-15")-as.Date("2015-10-16"))
[1] 365
(2016 being a bissextile year, this is correct!)

Categories