Rolling Window of Local Minima/Maxima - python

I've made a script (shown below) that helps determine local maxima points using historical stock data. It uses the daily highs to mark out local resistance levels. Works great, but what I would like is, for any given point in time (or row in the stock data), I want to know what the most recent resistance level was just prior to that point. I want this in it's own column in the dataset. So for instance:
The top grey line is the highs for each day, and the bottom grey line was the close of each day. So roughly speaking, the dataset for that section would look like this:
High Close
216.8099976 216.3399963
215.1499939 213.2299957
214.6999969 213.1499939
215.7299957 215.2799988 <- First blue dot at high
213.6900024 213.3699951
214.8800049 213.4100037 <- 2nd blue dot at high
214.5899963 213.4199982
216.0299988 215.8200073
217.5299988 217.1799927 <- 3rd blue dot at high
216.8800049 215.9900055
215.2299957 214.2400055
215.6799927 215.5700073
....
Right now, this script looks at the entire dataset at once to determine the local maxima indexes for the highs, and then for any given point in the stock history (i.e. any given row), it looks for the NEXT maxima in the list of all maximas found. This would be a way to determine where the next resistance level is, but I don't want that due to a look ahead bias. I just want to have a column of the most recent past resistance level or maybe even the latest 2 recent points in 2 columns. That would be ideal actually.
So my final output would look like this for the 1 column:
High Close Most_Rec_Max
216.8099976 216.3399963 0
215.1499939 213.2299957 0
214.6999969 213.1499939 0
215.7299957 215.2799988 0
213.6900024 213.3699951 215.7299957
214.8800049 213.4100037 215.7299957
214.5899963 213.4199982 214.8800049
216.0299988 215.8200073 214.8800049
217.5299988 217.1799927 214.8800049
216.8800049 215.9900055 217.5299988
215.2299957 214.2400055 217.5299988
215.6799927 215.5700073 217.5299988
....
You'll notice that the dot only shows up in most recent column after it has already been discovered.
Here is the code I am using:
real_close_prices = df['Close'].to_numpy()
highs = df['High'].to_numpy()
max_indexes = (np.diff(np.sign(np.diff(highs))) < 0).nonzero()[0] + 1 # local max
# +1 due to the fact that diff reduces the original index number
max_values_at_indexes = highs[max_indexes]
curr_high = [c for c in highs]
max_values_at_indexes.sort()
for m in max_values_at_indexes:
for i, c in enumerate(highs):
if m > c and curr_high[i] == c:
curr_high[i] = m
#print(nextbig)
df['High_Resistance'] = curr_high
# plot
plt.figure(figsize=(12, 5))
plt.plot(x, highs, color='grey')
plt.plot(x, real_close_prices, color='grey')
plt.plot(x[max_indexes], highs[max_indexes], "o", label="max", color='b')
plt.show()
Hoping someone will be able to help me out with this. Thanks!

Here is one approach. Once you know where the peaks are, you can store peak indices in p_ids and peak values in p_vals. To assign the k'th most recent peak, note that p_vals[:-k] will occur at p_ids[k:]. The rest is forward filling.
# find all local maxima in the series by comparing to shifted values
peaks = (df.High > df.High.shift(1)) & (df.High > df.High.shift(-1))
# pass peak value if peak is achieved and NaN otherwise
# forward fill with previous peak value & handle leading NaNs with fillna
df['Most_Rec_Max'] = (df.High * peaks.replace(False, np.nan)).ffill().fillna(0)
# for finding n-most recent peak
p_ids, = np.where(peaks)
p_vals = df.High[p_ids].values
for n in [1,2]:
col_name = f'{n+1}_Most_Rec_Max'
df[col_name] = np.nan
df.loc[p_ids[n:], col_name] = p_vals[:-n]
df[col_name].ffill(inplace=True)
df[col_name].fillna(0, inplace=True)
# High Close Most_Rec_Max 2_Most_Rec_Max 3_Most_Rec_Max
# 0 216.809998 216.339996 0.000000 0.000000 0.000000
# 1 215.149994 213.229996 0.000000 0.000000 0.000000
# 2 214.699997 213.149994 0.000000 0.000000 0.000000
# 3 215.729996 215.279999 215.729996 0.000000 0.000000
# 4 213.690002 213.369995 215.729996 0.000000 0.000000
# 5 214.880005 213.410004 214.880005 215.729996 0.000000
# 6 214.589996 213.419998 214.880005 215.729996 0.000000
# 7 216.029999 215.820007 214.880005 215.729996 0.000000
# 8 217.529999 217.179993 217.529999 214.880005 215.729996
# 9 216.880005 215.990006 217.529999 214.880005 215.729996
# 10 215.229996 214.240006 217.529999 214.880005 215.729996
# 11 215.679993 215.570007 217.529999 214.880005 215.729996

I just came across this function that might help you a lot: scipy.signal.find_peaks.
Based on your sample dataframe, we can do the following:
from scipy.signal import find_peaks
## Grab the minimum high value as a threshold.
min_high = df["High"].min()
### Run the High values through the function. The docs explain more,
### but we can set our height to the minimum high value.
### We just need one out of two return values.
peaks, _ = find_peaks(df["High"], height=min_high)
### Do some maintenance and add a column to mark peaks
# Merge on our index values
df1 = df.merge(peaks_df, how="left", left_index=True, right_index=True)
# Set non-null values to 1 and null values to 0; Convert column to integer type.
df1.loc[~df1["local_high"].isna(), "local_high"] = 1
df1.loc[df1["local_high"].isna(), "local_high"] = 0
df1["local_high"] = df1["local_high"].astype(int)
Then, your dataframe should look like the following:
High Low local_high
0 216.809998 216.339996 0
1 215.149994 213.229996 0
2 214.699997 213.149994 0
3 215.729996 215.279999 1
4 213.690002 213.369995 0
5 214.880005 213.410004 1
6 214.589996 213.419998 0
7 216.029999 215.820007 0
8 217.529999 217.179993 1
9 216.880005 215.990005 0
10 215.229996 214.240005 0
11 215.679993 215.570007 0

Related

Apply multiple condition groupby + sort + sum to pandas dataframe rows

I have a dataframe that has the following columns:
Acct Num, Correspondence Date, Open Date
For each opened account, I am being asked to look back at all the correspondences that happened within
30 days of opendate of that account, then assigning points as following to the correspondences:
Forty-twenty-forty: Attribute 40% (0.4 points) of the attribution to the first touch,
40% to the last touch, and divide the remaining 20% between all touches in between
So I know apply and group by functions, but this is beyond my paygrade.
I have to group by account, with conditional based on comparison of 2 columns against eachother,
I have to do that to get a total number of correspondences, and I guess they have to be sorted as well, as the following step of assigning points to correspondences depends on the order in which they occurred.
I would like to do this efficiently, as I have a ton of rows, I know apply() can go fast, but I am pretty bad at applying it when the row-level operation I am trying to do gets even a little complex.
I appreciate any help, as I am not good at pandas.
EDIT
as per request
Acct, ContactDate, OpenDate, Points (what I need to calculate)
123, 1/1/2018, 1/1/2021, 0 (because correspondance not within 30 days of open)
123, 12/10/2020, 1/1/2021, 0.4 (first touch gets 0.4)
123, 12/11/2020, 1/1/2021, 0.2 (other 'touches' get 0.2/(num of touches-2) 'points')
123, 12/12/2020, 1/1/2021, 0.4 (last touch gets 0.4)
456, 1/1/2018, 1/1/2021, 0 (again, because correspondance not within 30 days of open)
456, 12/10/2020, 1/1/2021, 0.4 (first touch gets 0.4)
456, 12/11/2020, 1/1/2021, 0.1 (other 'touches' get 0.2/(num of touches-2) 'points')
456, 12/11/2020, 1/1/2021, 0.1 (other 'touches' get 0.2/(num of touches-2) 'points')
456, 12/12/2020, 1/1/2021, 0.4 (last touch gets 0.4)
This returns a reduced dataframe in that it excludes timeframes exceeding 30 days and then merges the original df into it get all the data in one df. This assumes your date sorting is correct, otherwise, you may have to do that upfront before applying the function below.
df['Points'] = 0 #add column to dataframe before analysis
#df.columns
#Index(['Acct', 'ContactDate', 'OpenDate', 'Points'], dtype='object')
def points(x):
newx = x.loc[(x['OpenDate'] - x['ContactDate']) <= timedelta(days=30)] # reduce for wide > 30 days
# print(newx.Acct)
if newx.Acct.count() > 2: # check more than two dates exist
newx['Points'].iloc[0] = .4 # first row
newx['Points'].iloc[-1] = .4 # last row
newx['Points'].iloc[1:-1] = .2 / newx['Points'].iloc[1:-1].count() # middle rows / by count of those rows
return newx
elif newx.Acct.count() == 2: # placeholder for later
#edge case logic here for two occurences
return newx
elif newx.Acct.count() == 1: # placeholder for later
#edge case logic here one onccurence
return newx
# groupby Acct then clean up the indices so it can be merged back into original df
dft = df.groupby('Acct', as_index=False).apply(points).reset_index().set_index('level_1').drop('level_0', axis=1)
# merge on index
df_points = df[['Acct', 'ContactDate', 'OpenDate']].merge(dft['Points'], how='left', left_index=True, right_index=True).fillna(0)
Output:
Acct ContactDate OpenDate Points
0 123 2018-01-01 2021-01-01 0.0
1 123 2020-12-10 2021-01-01 0.4
2 123 2020-12-11 2021-01-01 0.2
3 123 2020-12-12 2021-01-01 0.4
4 456 2018-01-01 2021-01-01 0.0
5 456 2020-12-10 2021-01-01 0.4
6 456 2020-12-11 2021-01-01 0.1
7 456 2020-12-11 2021-01-01 0.1
8 456 2020-12-12 2021-01-01 0.4

Python - Count row between interval in dataframe

I have a dataset with a date, engine, energy and max power column. Let's say that the dataset is composed of 2 machines and a depth of one month. Each machine has a maximum power (say 100 for simplicity). Each machine with 3 operating states (between Pmax and 80% of Pmax either nominal power, between 80% and 20% of Pmax or drop in load and finally below 20% of Pmax at 0 we consider that the machine stops below 20%)
The idea is to know, by period and machine, the number of times the machine has operated in the 2nd interval (between 80% and 20% of the Pmax). If a machine drops to stop it should not be counted and if it returns from stop it should not be counted either.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy.ma.extras import _ezclump as ez
data = {'date': ['01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020',
'01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020'],
'engine': ['a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a',
'b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b',],
'energy': [100,100,100,100,100,80,80,60,60,60,60,60,90,100,100,50,50,40,20,0,0,0,20,50,60,100,100,50,50,50,50,
50,50,100,100,100,80,80,60,60,60,60,60,0,0,0,50,50,100,90,50,50,50,50,50,60,100,100,50,50,100,100],
'pmax': [100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,
100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100]
}
df = pd.DataFrame(data, columns = ['date', 'engine', 'energy', 'pmax'])
df['date'] = df['date'].astype('datetime64[ns]')
df = df.set_index('date')
df['inter'] = df['energy'].apply(lambda x: 2 if x >= 80 else (1 if x < 80 and x >= 20 else 0 ))
liste = []
engine_off = ez((df['inter'] == 1).to_numpy())
for i in engine_off:
if df.iloc[(i.start)-1, 3] == 0:
engine_off.remove(i)
elif df.iloc[(i.stop), 3] == 0:
engine_off.remove(i)
else:
liste.append([df['engine'][i.start],df.index[i.start],df.index[i.stop], i.stop - i.start])
dfend = pd.DataFrame(liste, columns=['engine','begin','end','nb_heure'])
dfend['month'] = dfend['begin'].dt.month_name()
dfgroupe = dfend.set_index('begin').groupby(['engine','month']).agg(['mean','max','min','std','count','sum']).fillna(1)
Either I recover my data in a Dataframe, I classify for each line the associated energy in an interval (2 for nominal operation, 1 for intermediate and 0 for stop)
Then I check if each row in the interval == 1 column allowing me to retrieve a list of slices with the start and end of each slice.
Then I loop to check that each element before or after my slice is different from 0 to exclude the decreases for stop or return from stop.
Then I create a dataframe from the list, then I average, sum, etc.
The problem is that my list has only 4 drops while there are 5 drops. This comes from the 4 slice (27.33)
Can someone help me?
Thank you
here is one way to do it, I tried to use your way with groups but ended up to do it slightly differently
# another way to create inter, probably faster on big dataframe
df['inter'] = pd.cut(df['energy']/df['pmax'], [-1,0.2, 0.8, 1.01],
labels=[0,1,2], right=False)
# mask if inter is equal to 1 and groupby engine
gr = df['inter'].mask(df['inter'].eq(1)).groupby(df['engine'])
# create a mask to get True for the rows you want
m = (df['inter'].eq(1) # the row are 1s
& ~gr.ffill().eq(0) # the row before 1s is not 0
& ~gr.bfill().eq(0) # the row after 1s is not 0
)
#create dfend with similar shape to yours
dfend = (df.assign(date=df.index) #create a column date for the agg
.where(m) # replace the rows not interesting by nan
.groupby(['engine', #groupby per engine
m.ne(m.shift()).cumsum()]) # and per group of following 1s
.agg(begin=('date','first'), #agg date with both start date
end = ('date','last')) # and end date
)
# create the colum nb_hours (although here it seems to be nb_days)
dfend['nb_hours'] = (dfend['end'] - dfend['begin']).dt.days+1
print (dfend)
begin end nb_hours
engine inter
a 2 2020-01-08 2020-01-12 5
4 2020-01-28 2020-01-31 4
b 4 2020-01-01 2020-01-02 2
6 2020-01-20 2020-01-25 6
8 2020-01-28 2020-01-29 2
and you got the three segment for engine b as required, then you can
#create dfgroupe
dfgroupe = (dfend.groupby(['engine', #groupby engine
dfend['begin'].dt.month_name()]) #and month name
.agg(['mean','max','min','std','count','sum']) #agg
.fillna(1)
)
print (dfgroupe)
nb_hours
mean max min std count sum
engine begin
a January 4.500000 5 4 0.707107 2 9
b January 3.333333 6 2 2.309401 3 10
I am assuming the following terminology:
- 80 <= energy <= 100 ---> df['inter'] == 2, normal mode.
- 20 <= energy < 80 ---> df['inter'] == 1, intermediate mode.
- 20 > energy ---> df['inter'] == 0, stop mode.
I reckon you want to find those periods of time in which:
1) The machine is operating in intermediate mode.
2) You don't want to count if the status is changing from intermediate to stop mode or from stop to intermediate mode.
# df['before']: this is to compare each row of df['inter'] with the previous row
# df['after']: this is to compare each row of df['inter'] with the next row
# df['target'] == 1 is when both above mentioned conditions (conditions 1 and 2) are met.
# In the next we mask the original df and keep those times that conditions 1 and 2 are met, then we group by machine and month, and after that obtain the min, max, mean, and so on.
df['before'] = df['inter'].shift(periods=1, fill_value=0)
df['after'] = df['inter'].shift(periods=-1, fill_value=0)
df['target'] = np.where((df['inter'] == 1) & (np.sum(df[['inter', 'before', 'after']], axis=1) > 2), 1, 0)
df['month'] = df['date'].dt.month
mask = df['target'] == 1
df_group = df[mask].groupby(['engine', 'month']).agg(['mean', 'max', 'min', 'std', 'count', 'sum'])

Get the daily percentages of values that fall within certain ranges

I have a large dataset of test results where I have a columns to represent the date a test was completed and number of hours it took to complete the test i.e.
df = pd.DataFrame({'Completed':['21/03/2020','22/03/2020','21/03/2020','24/03/2020','24/03/2020',], 'Hours_taken':[23,32,8,73,41]})
I have a months worth of test data and the tests can take anywhere from a couple of hours to a couple of days. I want to try and work out, for each day, what percentage of tests fall within the ranges of 24hrs/48hrs/72hrs ect. to complete, up to the percentage of tests that took longer than a week.
I've been able to work it out generally without taking the dates into account like so:
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48)
Lab_tests['GreaterThanWeek'] = Lab_tests['hours'] >168
one = Lab_tests['1-day'].value_counts().loc[True]
two = Lab_tests['two-day'].value_counts().loc[True]
eight = Lab_tests['GreaterThanWeek'].value_counts().loc[True]
print(one/10407 * 100)
print(two/10407 * 100)
print(eight/10407 * 100)
Ideally I'd like to represent the percentages in another dataset where the rows represent the dates and the columns represent the data ranges. But I can't work out how to take what I've done and modify it to get these percentages for each date. Is this possible to do in pandas?
This question, Counting qualitative values based on the date range in Pandas is quite similar but the fact that I'm counting the occurrences in specified ranges is throwing me off and I haven't been able to get a solution out of it.
Bonus Question
I'm sure you've noticed my current code is not the most elegant thing in the world, is the a cleaner way to do what I've done above, as I'm doing that for every data range that I want?
Edit:
So the Output for the sample data given would look like so:
df = pd.DataFrame({'1-day':[100,0,0,0], '2-day':[0,100,0,50],'3-day':[0,0,0,0],'4-day':[0,0,0,50]},index=['21/03/2020','22/03/2020','23/03/2020','24/03/2020'])
You're almost there. You just need to do a few final steps:
First, cast your bools to ints, so that you can sum them.
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24).astype(int)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48).astype(int)
Lab_tests['GreaterThanWeek'] = (Lab_tests['hours'] > 168).astype(int)
Completed hours one-day two-day GreaterThanWeek
0 21/03/2020 23 1 0 0
1 22/03/2020 32 0 1 0
2 21/03/2020 8 1 0 0
3 24/03/2020 73 0 0 0
4 24/03/2020 41 0 1 0
Then, drop the hours column and roll the rest up to the level of Completed:
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24).astype(int)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48).astype(int)
Lab_tests['GreaterThanWeek'] = (Lab_tests['hours'] > 168).astype(int)
Lab_tests.drop('hours', axis=1).groupby('Completed').sum()
one-day two-day GreaterThanWeek
Completed
21/03/2020 2 0 0
22/03/2020 0 1 0
24/03/2020 0 1 0
EDIT: To get to percent, you just need to divide each column by the sum of all three. You can sum columns by defining the axis of the sum:
...
daily_totals = Lab_tests.drop('hours', axis=1).groupby('Completed').sum()
daily_totals.sum(axis=1)
Completed
21/03/2020 2
22/03/2020 1
24/03/2020 1
dtype: int64
Then divide the daily totals dataframe by the column-wise sum of the daily totals (again, we use axis to define whether each value of the series will be the divisor for a row or a column.):
daily_totals.div(daily_totals.sum(axis=1), axis=0)
one-day two-day GreaterThanWeek
Completed
21/03/2020 1.0 0.0 0.0
22/03/2020 0.0 1.0 0.0
24/03/2020 0.0 1.0 0.0

Inserting missing numbers in dataframe

I have a program that ideally measures the temperature every second. However, in reality this does not happen. Sometimes, it skips a second or it breaks down for 400 seconds and then decides to start recording again. This leaves gaps in my 2-by-n dataframe, where ideally n = 86400 (the amount of seconds in a day). I want to apply some sort of moving/rolling average to it to get a nicer plot, but if I do that to the "raw" datafiles, the amount of data points becomes less. This is shown here, watch the x-axis. I know the "nice data" doesn't look nice yet; I'm just playing with some values.
So, I want to implement a data cleaning method, which adds data to the dataframe. I thought about it, but don't know how to implement it. I thought of it as follows:
If the index is not equal to the time, then we need to add a number, at time = index. If this gap is only 1 value, then the average of the previous number and the next number will do for me. But if it is bigger, say 100 seconds are missing, then a linear function needs to be made, which will increase or decrease the value steadily.
So I guess a training set could be like this:
index time temp
0 0 20.10
1 1 20.20
2 2 20.20
3 4 20.10
4 100 22.30
Here, I would like to get a value for index 3, time 3 and the values missing between time = 4 and time = 100. I'm sorry about my formatting skills, I hope it is clear.
How would I go about programming this?
Use merge with complete time column and then interpolate:
# Create your table
time = np.array([e for e in np.arange(20) if np.random.uniform() > 0.6])
temp = np.random.uniform(20, 25, size=len(time))
temps = pd.DataFrame([time, temp]).T
temps.columns = ['time', 'temperature']
>>> temps
time temperature
0 4.0 21.662352
1 10.0 20.904659
2 15.0 20.345858
3 18.0 24.787389
4 19.0 20.719487
The above is a random table generated with missing time data.
# modify it
filled = pd.Series(np.arange(temps.iloc[0,0], temps.iloc[-1, 0]+1))
filled = filled.to_frame()
filled.columns = ['time'] # Create a fully filled time column
merged = pd.merge(filled, temps, on='time', how='left') # merge it with original, time without temperature will be null
merged.temperature = merged.temperature.interpolate() # fill nulls linearly.
# Alternatively, use reindex, this does the same thing.
final = temps.set_index('time').reindex(np.arange(temps.time.min(),temps.time.max()+1)).reset_index()
final.temperature = final.temperature.interpolate()
>>> merged # or final
time temperature
0 4.0 21.662352
1 5.0 21.536070
2 6.0 21.409788
3 7.0 21.283505
4 8.0 21.157223
5 9.0 21.030941
6 10.0 20.904659
7 11.0 20.792898
8 12.0 20.681138
9 13.0 20.569378
10 14.0 20.457618
11 15.0 20.345858
12 16.0 21.826368
13 17.0 23.306879
14 18.0 24.787389
15 19.0 20.719487
First you can set the second values to actual time values as such:
df.index = pd.to_datetime(df['time'], unit='s')
After which you can use pandas' built-in time series operations to resample and fill in the missing values:
df = df.resample('s').interpolate('time')
Optionally, if you still want to do some smoothing you can use the following operation for that:
df.rolling(5, center=True, win_type='hann').mean()
Which will smooth with a 5 element wide Hanning window. Note: any window-based smoothing will cost you value points at the edges.
Now your dataframe will have datetimes (including date) as index. This is required for the resample method. If you want to lose the date, you can simply use:
df.index = df.index.time

Avoid looping to calculate simple moving average crossing-derived signals

I would like to calculate buy and sell signals for stocks based on simple moving average (SMA) crossing. A buy signal should be given as soon as the SMA_short is higher than the SMA_long (i.e., SMA_difference > 0). In order to avoid that the position is sold too quickly, I would like to have a sell signal only once the SMA_short has moved beyond the cross considerably (i.e., SMA_difference < -1), and, importantly, even if this would be for longer than one day.
I managed, by this help to implement it (see below):
Buy and sell signals are indicated by in and out.
Column Position takes first the buy_limit into account.
In Position_extended an in is then set for all the cases where the SMA_short just crossed through the SMA_long (SMA_short < SMA_long) but SMA_short > -1. For this it is taking the Position extended of i-1 into account in case the crossing was more than one day ago but SMA_short remained: 0 > SMA_short > -1.
Python code
import pandas as pd
import numpy as np
index = pd.date_range('20180101', periods=6)
df = pd.DataFrame(index=index)
df["SMA_short"] = [9,10,11,10,10,9]
df["SMA_long"] = 10
df["SMA_difference"] = df["SMA_short"] - df["SMA_long"]
buy_limit = 0
sell_limit = -1
df["Position"] = np.where((df["SMA_difference"] > buy_limit),"in","out")
df["Position_extended"] = df["Position"]
for i in range(1,len(df)):
df.loc[index[i],"Position_extended"] = \
np.where((df.loc[index[i], "SMA_difference"] > sell_limit) \
& (df.loc[index[i-1],"Position_extended"] == "in") \
,"in",df.loc[index[i],'Position'])
print df
The result is:
SMA_short SMA_long SMA_difference Position Position_extended
2018-01-01 9 10 -1 out out
2018-01-02 10 10 0 out out
2018-01-03 11 10 1 in in
2018-01-04 10 10 0 out in
2018-01-05 10 10 0 out in
2018-01-06 9 10 -1 out out
The code works, however, it makes use of a for loop, which slows down the script considerably and becomes inapplicable in the larger context of this analysis. As SMA crossing is such a highly used tool, I was wondering whether somebody could see a more elegant and faster solution for this.
Essentially you are trying to get rid of the ambivalent zero entries by propagating the last non-zero value. Similar to a zero-order hold. You can do so my first replacing the zero values by NaNs and then interpolating over the latter using ffill.
import pandas as pd
import numpy as np
index = pd.date_range('20180101', periods=6)
df = pd.DataFrame(index=index)
df["SMA_short"] = [9,10,11,10,10,9]
df["SMA_long"] = 10
df["SMA_difference"] = df["SMA_short"] - df["SMA_long"]
buy_limit = 0
sell_limit = -1
df["ZOH"] = df["SMA_difference"].replace(0,np.nan).ffill()
df["Position"] = np.where((df["ZOH"] > buy_limit),"in","out")
print df
results in:
SMA_short SMA_long SMA_difference ZOH Position
2018-01-01 9 10 -1 -1.0 out
2018-01-02 10 10 0 -1.0 out
2018-01-03 11 10 1 1.0 in
2018-01-04 10 10 0 1.0 in
2018-01-05 10 10 0 1.0 in
2018-01-06 9 10 -1 -1.0 out
If row T requires as input a value calculated in row T-1, then you'll probably want to do an iterative calculation. Typically backtesting is done by iterating through price data in sequence. You can calculate some signals just based on the state of the market, but you won't know the portfolio value, the pnl, or the portfolio positions unless you start at the beginning and work your way forward in time. That's why if you look at a site like Quantopian, the backtests always run from from start date to end date.

Categories