I have been trying a crossover MA strategy using bt library. The strategy: 65(SPY)/35(AGG) portfolio as a default. IF the 50 day moving average for SPY crosses above the 200 day moving average, the investor would go 10% overweight SPY, so the portfolio would change to 75/25. If the SPY 50 day moving average crosses below the 200, the portfolio would change to 55/45. I believe the problem lies within the bt.algos.weighspecified. I think I can only put one numerical value for the following securities. Is it even possible to do alternating weights? Thanks
I have the following code:
spy_agg_data_MA
spy_agg_data_MA['mean_close_50'] = spy_agg_data_MA['SPY'].rolling(50).mean()
spy_agg_data_MA['mean_close_200'] = spy_agg_data_MA['SPY'].rolling(200).mean()
spy_agg_data_MA.dropna(inplace = True)
spy_agg_data_MA # Dataset that contains SPY, AGG, 50 MA, 200MA
spy_agg_data_MA['SPY_weights'] = np.where(spy_agg_data_MA['mean_close_50']>spy_agg_data_MA['mean_close_200'], .75,.65)
spy_agg_data_MA['AGG_weights'] = np.where(spy_agg_data_MA['mean_close_50']>spy_agg_data_MA['mean_close_200'], .25,.35)
spy_agg_data_MA #Now, the dataset has two more columns; spy weight and agg weight depending on MAs
SPY_weights = spy_agg_data_MA['SPY_weights']
AGG_weights = spy_agg_data_MA['AGG_weights']
target_weights = pd.concat([SPY_weights,AGG_weights], axis = 1)
target_weights #This might be redundant, but I put the two columns into another dataset
name = 'SPY_AGG_weightrebal'
strategy_2 = bt.Strategy(
name,
[
bt.algos.RunWeekly(),
bt.algos.SelectAll(),
bt.algos.WeighSpecified(SPY = target_weights['SPY_weights'], AGG = target_weights['AGG_weights']),
bt.algos.Rebalance()
]
)
backtest_2 = bt.Backtest(strategy_2, spy_agg_data_MA[['SPY','AGG']])
res = bt.run(backtest_2)```
The error: cannot convert the series to <class 'float'>
Related
I'm building a model using the open-source pvlib software (and CEC Modules) to estimate yearly photovoltaic energy output. I have encountered some inconsistencies in the model and would appreciate any troubleshooting the community can offer.
My main problem is this: the model tells me that the ideal Northern hemisphere surface_azimuth for energy production (ie. surface_azimuth with the highest energy output) is around 76° (just North of due-East) while the worst surface_azimuth for energy production is around 270° (due West). However, I understand that the ideal surface_azimuth in the Northern hemisphere should be about180° (due South) with the worst surface_azimuth at 0° (due North).
I've included this graph to help visualize the variation in energy production based on surface_azimuth
This is also generated at the end of the code attached.
Can anyone help me rectify this issue or correct my understanding?
Code copied below for reference
import os
import pandas as pd
import numpy as np
import os
import os.path
import matplotlib.pyplot as plt
import pvlib
from geopy.exc import GeocoderTimedOut
from geopy.geocoders import Nominatim
from IPython.display import Image
## GET THE LATITUDE & LONGITUDE OF A GIVEN CITY
geolocator = Nominatim(user_agent="____'s app")
geo = geolocator.geocode("Berkeley")
## CHECK THAT CITY IS CORRECT (by Country, State, etc.)
print(geo.address)
# CHECK THE LAT, LON order
print(geo.latitude, geo.longitude)
## SELECT THE YEAR & TIMES YOU'D LIKE TO MODEL OFF
YEAR = 2019
STARTDATE = '%d-01-01T00:00:00' % YEAR
ENDDATE = '%d-12-31T23:59:59' % YEAR
TIMES = pd.date_range(start=STARTDATE, end=ENDDATE, freq='H')
## ACCESS THE NREL API TO EXTRACT WEATHER DATA
NREL_API_KEY = os.getenv('NREL_API_KEY', 'DEMO_KEY')
## FILL IN THE BLANK WITH YOUR EMAIL ADRRESS
EMAIL = os.getenv('EMAIL', '_______.com')
##NEED TO COMMENT OUT THIS LINE BELOW -- if you call it too many times within an hour, it will break your code
header, data = pvlib.iotools.get_psm3(LATITUDE, LONGITUDE, NREL_API_KEY, EMAIL)
## SELECT THE PVLIB PANEL & INTERVTER YOU'D LIKE TO USE
## CAN ALSO CHOOSE FROM SANDIA LABS' DATASET OF PANELS & INVERTERS (check out the function)
## WE CHOSE THE CECMods because they highlighted the ones that were BIPV
INVERTERS = pvlib.pvsystem.retrieve_sam('CECInverter')
INVERTER_10K = INVERTERS['SMA_America__SB10000TL_US__240V_']
CECMODS = pvlib.pvsystem.retrieve_sam('CECMod')
## SELECT THE PANEL YOU'D LIKE TO USE (NOTE: THE PEVAFERSA MODEL IS A BIPV PANEL)
CECMOD_MONO = CECMODS['Pevafersa_America_IP_235_GG']
## CREATING AN ARRAY TO ITERATE THROUGH IN ORDER TO TEST DIFFERENT SURFACE_AZIMUTHS
heading_array = np.arange(0, 361, 2)
heading_array
heading_DF = pd.DataFrame(heading_array).rename(columns = {0: "Heading"})
heading_DF.head()
# geo IS AN OBJECT (the given city) CREATED ABOVE
LATITUDE, LONGITUDE = geo.latitude, geo.longitude
# data IS AN OBJECT (the weather patterns) CREATED ABOVE
# TIMES IS ALSO CREATED ABOVE, AND REPRESENTS TIME
data.index = TIMES
dni = data.DNI.values
ghi = data.GHI.values
dhi = data.DHI.values
surface_albedo = data['Surface Albedo'].values
temp_air = data.Temperature.values
dni_extra = pvlib.irradiance.get_extra_radiation(TIMES).values
# GET SOLAR POSITION
sp = pvlib.solarposition.get_solarposition(TIMES, LATITUDE, LONGITUDE)
solar_zenith = sp.apparent_zenith.values
solar_azimuth = sp.azimuth.values
# CREATING THE ARRY TO STORE THE DAILY ENERGY OUTPUT BY SOLAR AZIMUTH
e_by_az = []
# IDEAL surface_tilt ANGLE IN NORTHERN HEMISPHERE IS ~25
surface_tilt = 25
# ITERATING THROUGH DIFFERENT SURFACE_AZIMUTH VALUES
for heading in heading_DF["Heading"]:
surface_azimuth = heading
poa_sky_diffuse = pvlib.irradiance.get_sky_diffuse(
surface_tilt, surface_azimuth, solar_zenith, solar_azimuth,
dni, ghi, dhi, dni_extra=dni_extra, model='haydavies')
# calculate the angle of incidence using the surface_azimuth and (hardcoded) surface_tilt
aoi = pvlib.irradiance.aoi(
surface_tilt, surface_azimuth, solar_zenith, solar_azimuth)
# https://pvlib-python.readthedocs.io/en/stable/generated/pvlib.irradiance.aoi.html
# https://pvlib-python.readthedocs.io/en/stable/generated/pvlib.pvsystem.PVSystem.html
poa_ground_diffuse = pvlib.irradiance.get_ground_diffuse(
surface_tilt, ghi, albedo=surface_albedo)
poa = pvlib.irradiance.poa_components(
aoi, dni, poa_sky_diffuse, poa_ground_diffuse)
poa_direct = poa['poa_direct']
poa_diffuse = poa['poa_diffuse']
poa_global = poa['poa_global']
iam = pvlib.iam.ashrae(aoi)
effective_irradiance = poa_direct*iam + poa_diffuse
temp_cell = pvlib.temperature.pvsyst_cell(poa_global, temp_air)
# THIS IS THE MAGIC
cecparams = pvlib.pvsystem.calcparams_cec(
effective_irradiance, temp_cell,
CECMOD_MONO.alpha_sc, CECMOD_MONO.a_ref,
CECMOD_MONO.I_L_ref, CECMOD_MONO.I_o_ref,
CECMOD_MONO.R_sh_ref, CECMOD_MONO.R_s, CECMOD_MONO.Adjust)
# mpp is the list of energy output by hour for the whole year using a single panel
mpp = pvlib.pvsystem.max_power_point(*cecparams, method='newton')
mpp = pd.DataFrame(mpp, index=TIMES)
first48 = mpp[:48]
Edaily = mpp.p_mp.resample('D').sum()
# Edaily is the list of energy output by day for the whole year using a single panel
Eyearly = sum(Edaily)
e_by_az.append(Eyearly)
## LINKING THE Heading (ie. surface_azimuth) AND THE Eyearly (ie. yearly energy output) IN A DF
heading_DF["Eyearly"] = e_by_az
heading_DF.head()
## VISUALIZE ENERGY OUTPUT BY SURFACE_AZIMUTH
plt.plot(heading_DF["Heading"], heading_DF["Eyearly"])
plt.xlabel("Surface_Azimuth Angle")
plt.ylabel("Yearly Energy Output with tilt # " + str(surface_tilt))
plt.title("Yearly Energy Output by Solar_Azimuth Angle using surface_tilt = " + str(surface_tilt) + " in Berkeley, CA");
# FIND SURFACE_AZIMUTH THAT YIELDS THE MAX ENERGY OUTPUT
heading_DF[heading_DF["Eyearly"] == max(heading_DF["Eyearly"])]
# FIND SURFACE_AZIMUTH THAT YIELDS THE MIN ENERGY OUTPUT
heading_DF[heading_DF["Eyearly"] == min(heading_DF["Eyearly"])]
Thanks to kevinanderso#gmail.com for helping me out in the pvlib-python Google Group. He pointed out that my "TIMES variable is not timezone-aware and so the solar position calculation is assuming UTC".
To fix that, he suggested that I initialize TIMES with tz='Etc/GMT+8' (ie. PST in the US).
In his words, "the code [I] originally posted is modeling a hypothetical system where solar position and irradiance are timeshifted with respect to each other. That's a big departure from reality, and so real-life expectations don't apply to your model".
Thanks to Kevin and hopefully this helps anyone else having a similar issue.
I've tried several ways to get the same RSI like Tradingview. The funny thing is, that my own calculated RSI matches e.g. the Bitcoin related RSI's perfectly. But when i try to calculate the RSI for altcoins, it's different. I have tried different EMA/RMAs, Excel recreation and of course python. Even: XRSIs (eg: RSI = 0,6 RSI-XRP + 0,4 RSI-BTC), but never got the same result.
Does anyone know how Tradingview is calculating the AltCoin RSIs?
Thank you in advance,
Best regards,
Domi
The calculation of the RSI should be the same for any kind of data in Tradingview. In Pinescript the RSI can be calculated from scratch as follows:
pine_rsi(x, y) =>
u = max(x - x[1], 0) // upward change
d = max(x[1] - x, 0) // downward change
rs = rma(u, y) / rma(d, y)
res = 100 - 100 / (1 + rs)
res
If the results differ it might be due to rounding errors, another cause might the use of data different from the one provided by Tradingview.
They are using a smoothed RSI formula. I checked it against the one year chart which uses daily bars.
import yfinance as yf
import talib as ta
#get data
ticker = yf.Ticker("BTC-USD")
period = '10y'
interval = '1d'
data = ticker.history(interval=interval, period= period)
df = data .reset_index()
df = df.rename(columns={"index": "Date"})
df['RSI_Ta'] = ta.RSI(df['Close'], timeperiod=14)
df
It is the same as Yahoo data somehow. Which is strange because I buy and sell as a market maker for higher and lower prices most every day.
I am building a Monte Carlo simulation in order to study the behaviour of a set of 1000 iterations. Every simulation has an output graph given by a Pandas dataframe converted into a png by matplotlib.pyplot. Since I am not sure that every output is a Normal ditribution, even if a read an article about this and it secures every output is, I'd like to understand how to check it.
I've found something in this link but I didn't understand which one is the best and how to implement it.
Here's the code:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('whitegrid')
avg = 1
std_dev = .1
num_reps = 500
num_simulations = 1000
#generate a list of percentages that will replicate our historical normsal distribution
#two decimal places in order to make it very easy to see the boundaries
pct_to_target = np.random.normal(avg, std_dev, num_reps).round(2)
#input of historical datas
sales_target_values = [75_000, 100_000, 200_000, 300_000, 400_000, 500_000]
sales_target_prob = [.3, .3, .2, .1, .05, .05]
sales_target = np.random.choice(sales_target_values, num_reps, p=sales_target_prob)
#build up a pandas dataframe
df = pd.DataFrame(index=range(num_reps), data={'Pct_To_Target': pct_to_target,
'Sales_Target': sales_target})
df['Sales'] = df['Pct_To_Target'] * df['Sales_Target']
#Here is what our new dataframe looks like
print("how our dataframe looks like")
print(df)
#Return the commission rate based on the excell table
def calc_commission_rate(x):
if x <= .90:
return .02
if x <= .99:
return .03
else:
return .04
#create our commission rate and multiply it times sales
df['Commission_Rate'] = df['Pct_To_Target'].apply(calc_commission_rate)
df['Commission_Amount'] = df['Commission_Rate'] * df['Sales']
print(df)
# Define a list to keep all the results from each simulation that we want to analyze
all_stats = []
# Loop through many simulations
for i in range(num_simulations):
# Choose random inputs for the sales targets and percent to target
sales_target = np.random.choice(sales_target_values, num_reps, p=sales_target_prob)
pct_to_target = np.random.normal(avg, std_dev, num_reps).round(2)
# Build the dataframe based on the inputs and number of reps
df = pd.DataFrame(index=range(num_reps), data={'Pct_To_Target': pct_to_target,
'Sales_Target': sales_target})
# Back into the sales number using the percent to target rate
df['Sales'] = df['Pct_To_Target'] * df['Sales_Target']
# Determine the commissions rate and calculate it
df['Commission_Rate'] = df['Pct_To_Target'].apply(calc_commission_rate)
df['Commission_Amount'] = df['Commission_Rate'] * df['Sales']
#print(df)
# We want to track sales,commission amounts and sales targets over all the simulations
all_stats.append([df['Sales'].sum().round(0),
df['Commission_Amount'].sum().round(0),
df['Sales_Target'].sum().round(0)])
results_df = pd.DataFrame.from_records(all_stats, columns=['Sales',
'Commission_Amount',
'Sales_Target'])
results_df.describe().style.format('{:,}')
print(results_df)
results_df['Commission_Amount'].plot(kind='hist', title="Total Commission Amount")
plt.savefig('graph.png')
# results_df['Sales'].plot(kind='hist')
# plt.savefig('graph2.png')
print(results_df)
I'd like to add a function that checks if the output distribution is a Gaussian (normal) distribution , because I am not sure that it actually is at every running.
I have some skin temperature data (collected at 1Hz) which I intend to analyse.
However, the sensors were not always in contact with the skin. So I have a challenge of removing this non-skin temperature data, whilst preserving the actual skin temperature data. I have about 100 files to analyse, so I need to make this automated.
I'm aware that there is already this similar post, however I've not been able to use that to solve my problem.
My data roughly looks like this:
df =
timeStamp Temp
2018-05-04 10:08:00 28.63
. .
. .
2018-05-04 21:00:00 31.63
The first step I've taken is to simply apply a minimum threshold- this has got rid of the majority of the non-skin data. However, I'm left with the sharp jumps where the sensor was either removed or attached:
To remove these jumps, I was thinking about taking an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.
e.g.
df_diff = df.diff(60) # period of about 60 makes jumps stick out
filter_index = np.nonzero((df.Temp <-1) | (df.Temp>0.5)) # when diff is less than -1 and greater than 0.5, most likely data jumps.
However, I find myself stuck here. The main problem is that:
1) I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?
The more minor problem is that
2) I think I will still be left with some residual artefacts from the data jumps near the edges (e.g. where a tighter threshold would start to chuck away good data). Is there either a better filtering strategy or a way to then get rid of these artefacts?
*Edit as suggested I've also calculated the second order diff, but to be honest, I think the first order diff would allow for tighter thresholds (see below):
*Edit 2: Link to sample data
Try the code below (I used a tangent function to generate data). I used the second order difference idea from Mad Physicist in the comments.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame()
df[0] = np.arange(0,10,0.005)
df[1] = np.tan(df[0])
#the following line calculates the absolute value of a second order finite
#difference (derivative)
df[2] = 0.5*(df[1].diff()+df[1].diff(periods=-1)).abs()
df.loc[df[2] < .05][1].plot() #select out regions of a high rate-of-change
df[1].plot() #plot original data
plt.show()
Following is a zoom of the output showing what got filtered. Matplotlib plots a line from beginning to end of the removed data.
Your first question I believe is answered with the .loc selection above.
You second question will take some experimentation with your dataset. The code above only selects out high-derivative data. You'll also need your threshold selection to remove zeroes or the like. You can experiment with where to make the derivative selection. You can also plot a histogram of the derivative to give you a hint as to what to select out.
Also, higher order difference equations are possible to help with smoothing. This should help remove artifacts without having to trim around the cuts.
Edit:
A fourth-order finite difference can be applied using this:
df[2] = (df[1].diff(periods=1)-df[1].diff(periods=-1))*8/12 - \
(df[1].diff(periods=2)-df[1].diff(periods=-2))*1/12
df[2] = df[2].abs()
It's reasonable to think that it may help. The coefficients above can be worked out or derived from the following link for higher orders.
Finite Difference Coefficients Calculator
Note: The above second and fourth order central difference equations are not proper first derivatives. One must divide by the interval length (in this case 0.005) to get the actual derivative.
Here's a suggestion that targets your issues regarding
[...]an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.
[..]I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?
using stats.zscore() and pandas.merge()
As it is, it will still have a minor issue with your concerns regarding
[...]left with some residual artefacts from the data jumps near the edges[...]
But we'll get to that later.
First, here's a snippet to produce a dataframe that shares some of the challenges with your dataset:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(22)
# A function for noisy data with a trend element
def sample():
base = 100
nsample = 50
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df=df['y4'].to_frame()
df.columns = ['Temp']
df['Temp'][20:31] = np.nan
# Insert spikes and missing values
df['Temp'][19] = df['Temp'][39]/4000
df['Temp'][31] = df['Temp'][15]/4000
return(df)
# Dataframe with random data
df_raw = sample()
df_raw.plot()
As you can see, there are two distinct spikes with missing numbers between them. And it's really the missing numbers that are causing the problems here if you prefer to isolate values where the differences are large. The first spike is not a problem since you'll find the difference between a very small number and a number that is more similar to the rest of the data:
But for the second spike, you're going to get the (nonexisting) difference between a very small number and a non-existing number, so that the extreme data-point you'll end up removing is the difference between your outlier and the next observation:
This is not a huge problem for one single observation. You could just fill it right back in there. But for larger data sets that would not be a very viable soution. Anyway, if you can manage without that particular value, the below code should solve your problem. You will also have a similar problem with your very first observation, but I think it would be far more trivial to decide whether or not to keep that one value.
The steps:
# 1. Get some info about the original data:
firstVal = df_raw[:1]
colName = df_raw.columns
# 2. Take the first difference and
df_diff = df_raw.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
level = 3
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df_raw.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)
# 7. Keep only the first column
df_out = df_out.ix[:,0].to_frame()
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Replace first value
df_complete.iloc[0] = firstVal.iloc[0]
# 10. Reset column names
df_complete.columns = colName
# Result
df_complete.plot()
Here's the whole thing for an easy copy-paste:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(22)
# A function for noisy data with a trend element
def sample():
base = 100
nsample = 50
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df=df['y4'].to_frame()
df.columns = ['Temp']
df['Temp'][20:31] = np.nan
# Insert spikes and missing values
df['Temp'][19] = df['Temp'][39]/4000
df['Temp'][31] = df['Temp'][15]/4000
return(df)
# A function for removing outliers
def noSpikes(df, level, keepFirst):
# 1. Get some info about the original data:
firstVal = df[:1]
colName = df.columns
# 2. Take the first difference and
df_diff = df.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df_raw.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)
# 7. Keep only the first column
df_out = df_out.ix[:,0].to_frame()
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Reset column names
df_complete.columns = colName
# Keep the first value
if keepFirst:
df_complete.iloc[0] = firstVal.iloc[0]
return(df_complete)
# Dataframe with random data
df_raw = sample()
df_raw.plot()
# Remove outliers
df_cleaned = noSpikes(df=df_raw, level = 3, keepFirst = True)
df_cleaned.plot()
I am trying to construct trend following momentum portfolio strategy based on S&P500 index (momthly data)
I used Kaufmann's fractal efficiency ratio to filter out whipsaw signal
(http://etfhq.com/blog/2011/02/07/kaufmans-efficiency-ratio/)
I succeeded in coding, but it's very clumsy, so I need advice for better code.
Strategy
Get data of S&P 500 index from yahoo finance
Calculate Kaufmann's efficiency ratio on lookback period X (1 , if close > close(n), 0)
Averages calculated value of 2, from 1 to 12 time period ---> Monthly asset allocation ratio, 1-asset allocation ratio = cash (3% per year)
I am having a difficulty in averaging 1 to 12 efficiency ratio. Of course I know that it can be simply implemented by for loop and it's very easy task, but I failed.
I need more concise and refined code, anybody can help me?
a['meanfractal'] bothers me in the code below..
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pandas_datareader.data as web
def price(stock, start):
price = web.DataReader(name=stock, data_source='yahoo', start=start)['Adj Close']
return price.div(price.iat[0]).resample('M').last().to_frame('price')
a = price('SPY','2000-01-01')
def fractal(a,p):
a['direction'] = np.where(a['price'].diff(p)>0,1,0)
a['abs'] = a['price'].diff(p).abs()
a['volatility'] = a.price.diff().abs().rolling(p).sum()
a['fractal'] = a['abs'].values/a['volatility'].values*a['direction'].values
return a['fractal']
def meanfractal(a):
a['meanfractal']= (fractal(a,1).values+fractal(a,2).values+fractal(a,3).values+fractal(a,4).values+fractal(a,5).values+fractal(a,6).values+fractal(a,7).values+fractal(a,8).values+fractal(a,9).values+fractal(a,10).values+fractal(a,11).values+fractal(a,12).values)/12
a['portfolio1'] = (a.price/a.price.shift(1).values*a.meanfractal.shift(1).values+(1-a.meanfractal.shift(1).values)*1.03**(1/12)).cumprod()
a['portfolio2'] = ((a.price/a.price.shift(1).values*a.meanfractal.shift(1).values+1.03**(1/12))/(1+a.meanfractal.shift(1))).cumprod()
a=a.dropna()
a=a.div(a.ix[0])
return a[['price','portfolio1','portfolio2']].plot()
print(a)
plt.show()
You could simplify further by storing the values corresponding to p in a DF rather than computing for each series separately as shown:
def fractal(a, p):
df = pd.DataFrame()
for count in range(1,p+1):
a['direction'] = np.where(a['price'].diff(count)>0,1,0)
a['abs'] = a['price'].diff(count).abs()
a['volatility'] = a.price.diff().abs().rolling(count).sum()
a['fractal'] = a['abs']/a['volatility']*a['direction']
df = pd.concat([df, a['fractal']], axis=1)
return df
Then, you could assign the repeating operations to a variable which reduces the re-computation time.
def meanfractal(a, l=12):
a['meanfractal']= pd.DataFrame(fractal(a, l)).sum(1,skipna=False)/l
mean_shift = a['meanfractal'].shift(1)
price_shift = a['price'].shift(1)
factor = 1.03**(1/l)
a['portfolio1'] = (a['price']/price_shift*mean_shift+(1-mean_shift)*factor).cumprod()
a['portfolio2'] = ((a['price']/price_shift*mean_shift+factor)/(1+mean_shift)).cumprod()
a.dropna(inplace=True)
a = a.div(a.ix[0])
return a[['price','portfolio1','portfolio2']].plot()
Resulting plot obtained:
meanfractal(a)
Note: If speed is not a major concern, you could perform the operations via the built-in methods present in pandas instead of converting them into it's corresponding numpy array values.