Calculating WAIC for models with multiple likelihood functions with pymc3 - python

I try to predict the outcome of soccer games based on the number of goals scored and I use the following model:
with pm.Model() as model:
# global model parameters
h = pm.Normal('h', mu = mu, tau = tau)
sd_a = pm.Gamma('sd_a', .1, .1)
sd_d = pm.Gamma('sd_d', .1, .1)
alpha = pm.Normal('alpha', mu=mu, tau = tau)
# team-specific model parameters
a_s = pm.Normal("a_s", mu=0, sd=sd_a, shape=n)
d_s = pm.Normal("d_s", mu=0, sd=sd_d, shape=n)
atts = pm.Deterministic('atts', a_s - tt.mean(a_s))
defs = pm.Deterministic('defs', d_s - tt.mean(d_s))
h_theta = tt.exp(alpha + h + atts[h_t] + defs[a_t])
a_theta = tt.exp(alpha + atts[a_t] + defs[h_t])
# likelihood of observed data
h_goals = pm.Poisson('h_goals', mu=h_theta, observed=observed_h_goals)
a_goals = pm.Poisson('a_goals', mu=a_theta, observed=observed_a_goals)
When I sample the model, the trace plots look fine.
Afterward when I want to calculate the WAIC:
waic = pm.waic(trace, model)
I get the following error:
----> 1 waic = pm.waic(trace, model)
~\Anaconda3\envs\env\lib\site-packages\pymc3\stats_init_.py in wrapped(*args, **kwargs)
22 )
23 kwargs[new] = kwargs.pop(old)
—> 24 return func(*args, **kwargs)
25
26 return wrapped
~\Anaconda3\envs\env\lib\site-packages\arviz\stats\stats.py in waic(data, pointwise, scale)
1176 “”"
1177 inference_data = convert_to_inference_data(data)
-> 1178 log_likelihood = _get_log_likelihood(inference_data)
1179 scale = rcParams[“stats.ic_scale”] if scale is None else scale.lower()
1180
~\Anaconda3\envs\env\lib\site-packages\arviz\stats\stats_utils.py in get_log_likelihood(idata, var_name)
403 var_names.remove(“lp”)
404 if len(var_names) > 1:
–> 405 raise TypeError(
406 “Found several log likelihood arrays {}, var_name cannot be None”.format(var_names)
407 )
TypeError: Found several log likelihood arrays [‘h_goals’, ‘a_goals’], var_name cannot be None
Is there any way to calculate WAIC and compare models when I have two likelihood functions in pymc3? (1: the goals scored by the home 2: the goals scored by the away team)

It is possible but requires defining what are you interested in predicting, it can be the result of the match, or could be the number of goals scored by either team (not the aggregate, each match would then provide 2 results to predict).
A complete and detailed answer is available at PyMC discourse.
Here I transcribe the case where the quantity of interest is the result of the match as a summary. ArviZ will automatically retrieve 2 pointwise log likelihood arrays, which we have to combine somehow (e.g. add, concatenate, groupby...) to get a single array. The tricky part is knowing which operation corresponds to each quantity, which has to be assessed on a per model basis. In this particular example, the predictive accuracy of a match result can be calculated in the following way:
dims = {
"home_points": ["match"],
"away_points": ["match"],
}
idata = az.from_pymc3(trace, dims=dims, model=model)
Setting the match dim is important to tell xarray how to align the pointwise log likelihood arrays, otherwise they would not be broadcasted and aligned in the desired way.
idata.sample_stats["log_likelihood"] = (
idata.log_likelihood.home_points + idata.log_likelihood.away_points
)
az.waic(idata)
# Output
# Computed from 3000 by 60 log-likelihood matrix
#
# Estimate SE
# elpd_waic -551.28 37.96
# p_waic 46.16 -
#
# There has been a warning during the calculation. Please check the results.
Note that ArviZ>=0.7.0 is required.

Related

Unable to fit a function onto a givin set of data points in Python using the Scipy library

I have been trying to fit a function(the function is given in the code under the name concave_func) onto data points in python but have had very little to no success. I have 7 parameters(C_1, C_2, alpha_one, alpha_two, I_x, nu_t, T_e) in the function that I have to estimate, and only 6 data points. I have tried 2 methods to fit the curve and estimate the parameters,
1). scipy.optimize.minimize
2). scipy.optimize.curve_fit.
However, I'm not obtaining the desired results i.e the curve is not fitting the data points.
I have attached my code below.
frequency = np.array([22,45,150,408,1420,23000]) #x_values
b_temp = [2.55080863e+04, 4.90777800e+03, 2.28984753e+02, 2.10842949e+01, 3.58631166e+00, 5.68716056e-04] #y_values
#Defining the function that I want to fit
def concave_func(x, C_1, C_2, alpha_one, alpha_two, I_x, nu_t, T_e):
one = x**(-alpha_one)
two = (C_2/C_1)*(x**(-alpha_two))
three = I_x*(x**-2.1)
expo = np.exp(-1*((nu_t/x)**2.1))
eqn_one = C_1*(one + two + three)*expo
eqn_two = T_e*(1 - expo)
return eqn_one + eqn_two
#Defining chi_square function
def chisq(params, xobs, yobs):
ynew = concave_func(xobs, *params)
#yerr = np.sum((ynew- yobs)**2)
yerr = np.sum(((yobs- ynew)/ynew)**2)
print(yerr)
return yerr
result = minimize(chisq, [1,2,2,2,1,1e6,8000], args = (frequency,b_temp), method = 'Nelder-Mead', options = {'disp' : True, 'maxiter': 10000})
x = np.linspace(-300,24000,1000)
plt.yscale("log")
plt.xscale("log")
plt.plot(x,concave_func(x, *result.x))
print(result.x)
print(result)
plt.plot(frequency, b_temp, 'r*' )
plt.xlabel("log Frequency[MHz]")
plt.ylabel("log Temp[K]")
plt.title('log Temparature vs log Frequency')
plt.grid()
plt.savefig('the_plot_2060.png')
I have attached the plot that I obtained below.
The plot clearly does not fit the data, and something is definitely wrong. I would also want my parameters alpha_one and alpha_two to be constrained to lie between 2 and 3. I also do not want my parameter T_e to exceed 10,000. Any thoughts?

Python weighted quantile as R wtd.quantile()

I want to convert the R package Hmisc::wtd.quantile() into python.
Here is the example in R:
I took this as reference and it seems that the logics are different than R:
# First function
def weighted_quantile(values, quantiles, sample_weight = None,
values_sorted = False, old_style = False):
""" Very close to numpy.percentile, but supports weights.
NOTE: quantiles should be in [0, 1]!
:param values: numpy.array with data
:param quantiles: array-like with many quantiles needed
:param sample_weight: array-like of the same length as `array`
:return: numpy.array with computed quantiles.
"""
values = np.array(values)
quantiles = np.array(quantiles)
if sample_weight is None:
sample_weight = np.ones(len(values))
sample_weight = np.array(sample_weight)
assert np.all(quantiles >= 0) and np.all(quantiles <= 1), 'quantiles should be in [0, 1]'
if not values_sorted:
sorter = np.argsort(values)
values = values[sorter]
sample_weight = sample_weight[sorter]
# weighted_quantiles = np.cumsum(sample_weight)
# weighted_quantiles /= np.sum(sample_weight)
weighted_quantiles = np.cumsum(sample_weight)/np.sum(sample_weight)
return np.interp(quantiles, weighted_quantiles, values)
weighted_quantile(values = [0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319],
quantiles = np.arange(0, 1 + 1 / 5, 1 / 5),
sample_weight = [1,1,1,1,1])
>> array([0.2136325, 0.2136325, 0.4079128, 0.4890342, 0.5083345, 0.6197319])
# Second function
def weighted_percentile(data, weights, perc):
"""
perc : percentile in [0-1]!
"""
data = np.array(data)
weights = np.array(weights)
ix = np.argsort(data)
data = data[ix] # sort data
weights = weights[ix] # sort weights
cdf = (np.cumsum(weights) - 0.5 * weights) / np.sum(weights) # 'like' a CDF function
return np.interp(perc, cdf, data)
weighted_percentile([0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319], [1,1,1,1,1], np.arange(0, 1 + 1 / 5, 1 / 5))
>> array([0.2136325 , 0.31077265, 0.4484735 , 0.49868435, 0.5640332 ,
0.6197319 ])
Both are different with R. Any idea?
I am Python-illiterate, but from what I see and after some quick checks I can tell you the following.
Here you use uniform (sampling) weights, so you could also directly use the quantile() function. Not surprisingly, it gives the same results as wtd.quantile() with uniform weights:
x <- c(0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319)
n <- length(x)
x <- sort(x)
quantile(x, probs = seq(0,1,0.2))
# 0% 20% 40% 60% 80% 100%
# 0.2136325 0.3690567 0.4565856 0.4967543 0.5306140 0.6197319
The R quantile() function get the quantiles in a 'textbook' way, i.e. by determining the index i of the obs to use with i = q(n+1).
In your case:
seq(0,1,0.2)*(n+1)
# 0.0 1.2 2.4 3.6 4.8 6.0
Of course since you have 5 values/obs and you want quintiles, the indices are not integers. But you know for example that the first quintile (i = 1.2) lies between obs 1 and obs 2. More precisely, it is a linear combination of the two observations (the 'weights' are derived from the value of the index):
0.2*x[1] + 0.8*x[2]
# 0.3690567
You can do the same for the all the quintiles, on the basis of the indices:
q <-
c(min(x), ## 0: actually, the first obs
0.2*x[1] + 0.8*x[2], ## 1.2: quintile lies between obs 1 and 2
0.4*x[2] + 0.6*x[3], ## 2.4: quintile lies between obs 2 and 3
0.6*x[3] + 0.4*x[4], ## 3.6: quintile lies between obs 3 and 4
0.8*x[4] + 0.2*x[5], ## 4.8: quintile lies between obs 4 and 5
max(x) ## 6: actually, the last obs
)
q
# 0.2136325 0.3690567 0.4565856 0.4967543 0.5306140 0.6197319
You can see that you get exactly the output of quantile() and wtd.quantile().
If instead of 0.2*x[1] + 0.8*x[2] we consider the following:
0.5*x[1] + 0.5*x[2]
# 0.3107726
We get the output of your second Python function. It appears that your second function considers uniform 'weights' (obviously I am not talking about the sampling weights here) when combining the two observations. The issue (at least for the second Python function) seems to come from this. I know these are just insights, but I hope they will help.
EDIT: note that the difference between the two is not necessary an 'issue' with the python code. There are different quantile estimators (and their weighted versions) and the python functions could simply rely on a different estimator than Hmisc::wtd.quantile(). I think that the latter uses the weighted version of the Harrell-Davis quantile estimator. If you really want to implement this one, you should check the source code of Hmisc::wtd.quantile() and try to 'directly' translate this into Python.

Simulations Confidence Interval Not Equal to conf_int Results

Given this simulated data:
import numpy as np
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.tsa.statespace.structural import UnobservedComponents
np.random.seed(12345)
ar = np.r_[1, 0.9]
ma = np.array([1])
arma_process = ArmaProcess(ar, ma)
X = 100 + arma_process.generate_sample(nsample=100)
y = 1.2 * X + np.random.normal(size=100)
We build a UnobservedComponents model with the first 70 points to run inferences on the last 30 points like so:
model = UnobservedComponents(y[:70], level='llevel', exog=X[:70])
f_model = model.fit()
forecaster = f_model.get_forecast(
steps=30,
exog=X[70:].reshape(-1, 1)
)
conf_int = forecaster.conf_int()
If we observe the mean for the 95% confidence interval, we get the following:
conf_int.mean(axis=0)
array([118.19789195, 122.14101161])
But when trying to get the same values through model simulations, we don't quite get the same results. Here's the script we run for the simulated boundaries:
sim_model = UnobservedComponents(np.zeros(30), level='llevel', exog=X[70:])
res = []
predicted_state = f_model.predicted_state[..., -1]
predicted_state_cov = f_model.predicted_state_cov[..., -1]
for i in range(1000):
init_state = np.random.multivariate_normal(
predicted_state,
predicted_state_cov
)
sim = sim_model.simulate(
f_model.params,
30,
initial_state=init_state)
res.append(sim.mean())
Printing the lower 2.5 and upper 97.5 percentile we get:
np.percentile(res, [2.5, 97.5])
array([119.06735028, 121.26810407])
As we use model simulations to distinguish signal from noise in data, this difference ended up being big enough to lead to contradictory conclusions. If we make for instance:
y[70:] += 1
Then according to the first technique we conclude the new y carries no signal as its mean is lower than 122.14. But the same is not true if we use the second technique: as the upper boundary is 121.2, we conclude that there's signal.
What we are trying to understand now is whether this is expected. Shouldn't the lower and upper 95% confidence interval of both techniques be equal?

Different constraints for fit parameter in lmfit model

I am trying to create a multible voigt/Gaussian/Lorentizan-peak fit function with lmfit.
Therefore, I wrote the following Function:
def apply_fit_mix_multy(data,modelPeak,peakPos,amplitud,**kwargs):
peakPos=np.array(peakPos)
Start=kwargs.get('Start',data[0,0])
length_data=len(data)-1
End=kwargs.get('End',data[length_data,0])
StartPeak=kwargs.get('StartPeak',data[0,0])
EndPeak=kwargs.get('EndPeak',data[length_data,0])
BackFunc=kwargs.get('BackFunc',False)
BackCut=kwargs.get('BackCut',False)
dataN=data_intervall(data,Start,End)
y=dataN[:, 1]
x=dataN[:, 0]
amplitud=amplitud
center=peakPos
mod = None
for i in range(len(peakPos)):
this_mod = make_model(i,amplitud,center,modelPeak)
if mod is None:
mod = this_mod
else:
mod = mod + this_mod
bgy=[list() for f in range(len(x))]
if(BackFunc==True):
bg,bgx=BackFunc
for i in range(len(x)):
bgy[i]=bg.best_values.get('c')
elif(BackCut!=False):
slope,intercept=back_ground_cut(data,BackCut[0],BackCut[1])
for i in range(len(x)):
bgy[i]=slope*x[i]+intercept
if(BackCut!=False):
print('Background substraction model is used! (Sign=Sign-backgr(linear between two points))')
y=y-bgy
out = mod.fit(y, x=x)
else:
print('Combination model is used! (offset+Gauss/Lor/Voig)')
offset=ConstantModel()
mod=mod+offset
out = mod.fit(y, x=x)#out is the fitted function
area=[list() for f in range(len(peakPos))]
comps=out.eval_components(x=x)
if(BackCut!=False):
for i in range(len(peakPos)):
area[i]=simps(comps['peak'+str(i)+'_'],x=x,even='avg')-simps(bgy,x=x,even='avg')
fit_dict={'signal':y, 'convol':out.best_fit,'x':x,'peak_area':area,'backgr':bgy,'comps':comps}
else:
for i in range(len(peakPos)):
area[i]=simps(comps['peak'+str(i)+'_'],x=x,even='avg')
fit_dict={'convol':out.best_fit,'x':x,'peak_area':area,'comps':comps} #comps is inf. of sperate peaks
return fit_dict
The function reads in a data set, the modelPeak (e.g. GaussianModel) an initial guess of peak positions and amplitudes (peakPos, amplitude) .
In the first Part I initializing the model of the peaks (how many peaks...)
for i in range(len(peakPos)):
this_mod = make_model(i,amplitud,center,modelPeak)
if mod is None:
mod = this_mod
else:
mod = mod + this_mod
With the make_model funktion:
def make_model(num,amplitud,center,mod):
pref = "peak{0}_".format(num)
model = mod(prefix = pref)
model.set_param_hint(pref+'amplitud', value=amplitud[num], min=0, max=5*amplitud[num])
model.set_param_hint(pref+'center', value=center[num], min=center[num]-0.5, max=center[num]+0.5)
if(num==0):
model.set_param_hint(pref+'sigma', value=0.3, min=0.01, max=1)
else:
model.set_param_hint(pref+'sigma', value=0.3, min=0.01, max=1)
#print('Jetzt',center[num],amplitud[num])
return model
here is now my Problem: I I whant to fit e.g. 3 Peaks I whant that e.g. the sigma of the first peak is varies during the fit while the sigmas of the other peaks depend on the sigma of the first peak!
any idea?
thx
maths
FYI this is how a fit looks like:
enter image description here
If I understand your long question (it would be helpful to remove the extraneous stuff - and there is quite a lot of it), you want to create a Model with multiple peaks, allowing sigma from the 1st peak to vary freely, and constraining sigma for the other peaks to depend on this.
To do that, you can either use parameter hints (as you use in your make_model() function) or set expressions for the parameters after the Parameters object is created. For the first approach, something like this
def make_model(num,amplitud,center,mod):
pref = "peak{0}_".format(num)
model = mod(prefix = pref)
model.set_param_hint(pref+'amplitud', value=amplitud[num], min=0, max=5*amplitud[num])
model.set_param_hint(pref+'center', value=center[num], min=center[num]-0.5, max=center[num]+0.5)
if(num==0):
model.set_param_hint(pref+'sigma', value=0.3, min=0.01, max=1)
else:
## instead of
# model.set_param_hint(pref+'sigma', value=0.3, min=0.01, max=1)
## set peakN_sigma == peak0_sigma
model.set_param_hint(pref+'sigma', expr='peak0_sigma')
## or maybe set peakN_sigma == N * peak0_sigma
model.set_param_hint(pref+'sigma', expr='%d*peak0_sigma' % num)
return model
You could also make the full model (simplified somewhat from your code, but the same idea):
model = (VoigtModel(prefix='peak0_') + VoigtModel(prefix='peak1_') +
VoigtModel(prefix='peak2_') + LinearModel(prefix='const_'))
# create parameters with default values
params = model.make_params(peak0_amplitude=10, peak0_sigma=2, ....)
# set constraints for `sigma` params:
params['peak1_sigma'].expr = 'peak0_sigma'
params['peak2_sigma'].expr = 'peak0_sigma'
# similarly, set bounds as needed:
params['peak1_sigma'].min = 0
params['peak1_amplitude'].min = 0
Hope that helps...

Python: How to optimize function parameters?

Background:
I'd like to solve a wide array of optimization problems such as asset weights in a portfolio, and parameters in trading strategies where the variables are passed to functions containing a bunch of other variables as well.
Until now, I've been able to do these things easily in Excel using the Solver Add-In. But I think it would be much more efficient and even more widely applicable using Python. For the sake of clarity, I'm going to boil the question down to the essence of portfolio optimization.
My question (short version):
Here's a dataframe and a corresponding plot with asset returns.
Dataframe 1:
A1 A2
2017-01-01 0.0075 0.0096
2017-01-02 -0.0075 -0.0033
.
.
2017-01-10 0.0027 0.0035
Plot 1 - Asset returns
Based on that, I would like to find the weights for the optimal portfolio with regards to risk / return (Sharpe ratio), represented by the green dot in the plot below (the red dot is the so-called minimum variance portfolio, and represents another optimization problem).
Plot 2 - Efficient frontier and optimal portfolios:
How can I do this with numpy or scipy?
The details:
The following code section contains the function returns() to build a dataframe with random returns for two assets, as well as a function pf_sharpe to calculate the Sharpe ratio of two given weights for a portfolio of the returns.
# imports
import pandas as pd
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
np.random.seed(1234)
# Reproducible data sample
def returns(rows, names):
''' Function to create data sample with random returns
Parameters
==========
rows : number of rows in the dataframe
names: list of names to represent assets
Example
=======
>>> returns(rows = 2, names = ['A', 'B'])
A B
2017-01-01 0.0027 0.0075
2017-01-02 -0.0050 -0.0024
'''
listVars= names
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_temp = pd.DataFrame(np.random.randint(-100,100,size=(rows, len(listVars))), columns=listVars)
df_temp = df_temp.set_index(rng)
df_temp = df_temp / 10000
return df_temp
# Sharpe ratio
def pf_sharpe(df, w1, w2):
''' Function to calculate risk / reward ratio
based on a pandas dataframe with two return series
Parameters
==========
df : pandas dataframe
w1 : portfolio weight for asset 1
w2 : portfolio weight for asset 2
'''
weights = [w1,w2]
# Calculate portfolio returns and volatility
pf_returns = (np.sum(df.mean() * weights) * 252)
pf_volatility = (np.sqrt(np.dot(np.asarray(weights).T, np.dot(df.cov() * 252, weights))))
# Calculate sharpe ratio
pf_sharpe = pf_returns / pf_volatility
return pf_sharpe
# Make df with random returns and calculate
# sharpe ratio for a 80/20 split between assets
df_returns = returns(rows = 10, names = ['A1', 'A2'])
df_returns.plot(kind = 'bar')
sharpe = pf_sharpe(df = df_returns, w1 = 0.8, w2 = 0.2)
print(sharpe)
# Output:
# 5.09477512073
Now I'd like to find the portfolio weights that optimize the Sharpe ratio. I think you could express the optimization problem as follows:
maximize:
pf_sharpe()
by changing:
w1, w2
under the constraints:
0 < w1 < 1
0 < w2 < 1
w1 + w2 = 1
What I've tried so far:
I found a possible setup in the post Python Scipy Optimization.minimize using SLSQP showing maximized results. Below is what I have so far, and it addresses a central aspect of my question directly:
[...]where the variables are passed to functions containing a bunch of other variables as well.
As you can see, my initial challenge prevents me from even testing if my bounds and constraints will be accepted by the function optimize.minimize(). I haven't even bothered to take into consideration the fact that this is a maximization and not a minimization problem (hopefully amendable by changing the sign of the function).
Attempts:
# bounds
b = (0,1)
bnds = (b,b)
# constraints
def constraint1(w1,w2):
return w1 - w2
cons = ({'type': 'eq', 'fun':constraint1})
# initial guess
x0 = [0.5, 0.5]
# Testing the initial guess
print(pf_sharpe(df = df_returns, weights = x0))
# Optimization attempts
attempt1 = optimize.minimize(pf_sharpe(), x0, method = 'SLSQP', bounds = bnds, constraints = cons)
attempt2 = optimize.minimize(pf_sharpe(df = df_returns, weights), x0, method = 'SLSQP', bounds = bnds, constraints = cons)
attempt3 = optimize.minimize(pf_sharpe(weights, df = df_returns), x0, method = 'SLSQP', bounds = bnds, constraints = cons)
Results:
Attempt1 is closest to the scipy setup here, but understandably fails because neither df nor weights have been specified.
Attempt2 fails with SyntaxError: positional argument follows keyword argument
Attempt3 fails with NameError: name 'weights' is not defined
I was under the impression that df could freely be specified, and that x0 in optimize.minimize would be considered the variables to be tested as 'representatives' for the weights in the function specified by pf_sharpe().
As you surely understand, my transition from Excel to Python in this regard has not been the easiest, and there is plenty I don't understand here. Anyway, I'm hoping some of you may offer some suggestions or clarifications!
Thank you!
Appendix 1 - Simulation approach:
This particular portfolio optimization problem can easily be solved by simulating a bunch of portfolio weights. And I did exactly that to produce the portfolio plot above. Here's the whole function if anyone is interested:
# Portfolio simulation
def portfolioSim(df, simRuns):
''' Function to take a df with asset returns,
runs a number of simulated portfolio weights,
plots return and risk for those weights,
and finds minimum risk portfolio
and max risk / return portfolio
Parameters
==========
df : pandas dataframe with returns
simRuns : number of simulations
'''
prets = []
pvols = []
pwgts = []
names = list(df_returns)
for p in range (simRuns):
# Assign random weights
weights = np.random.random(len(list(df_returns)))
weights /= np.sum(weights)
weights = np.asarray(weights)
# Calculate risk and returns with random weights
prets.append(np.sum(df_returns.mean() * weights) * 252)
pvols.append(np.sqrt(np.dot(weights.T, np.dot(df_returns.cov() * 252, weights))))
pwgts.append(weights)
prets = np.array(prets)
pvols = np.array(pvols)
pwgts = np.array(pwgts)
pshrp = prets / pvols
# Store calculations in a df
df1 = pd.DataFrame({'return':prets})
df2 = pd.DataFrame({'risk':pvols})
df3 = pd.DataFrame(pwgts)
df3.columns = names
df4 = pd.DataFrame({'sharpe':pshrp})
df_temp = pd.concat([df1, df2, df3, df4], axis = 1)
# Plot resulst
plt.figure(figsize=(8, 4))
plt.scatter(pvols, prets, c=prets / pvols, cmap = 'viridis', marker='o')
# Min risk
min_vol_port = df_temp.iloc[df_temp['risk'].idxmin()]
plt.plot([min_vol_port['risk']], [min_vol_port['return']], marker='o', markersize=12, color="red")
# Max sharpe
max_sharpe_port = df_temp.iloc[df_temp['sharpe'].idxmax()]
plt.plot([max_sharpe_port['risk']], [max_sharpe_port['return']], marker='o', markersize=12, color="green")
# Test run
portfolioSim(df = df_returns, simRuns = 250)
Appendix 2 - Excel Solver approach:
Here is how I would approach the problem using Excel Solver. Instead of linking to a file, I've only attached a screenshot and included the most important formulas in a code section. I'm guessing not many of you is going to be interested in reproducing this anyway. But I've included it just to show that it can be done quite easily in Excel.
Grey ranges represent formulas. Ranges that can be changed and used as arguments in the optimization problem are highlighted in yellow. The green range is the objective function.
Here's an image of the worksheet and Solver setup:
Excel formulas:
C3 =AVERAGE(C7:C16)
C4 =AVERAGE(D7:D16)
H4 =COVARIANCE.P(C7:C16;D7:D16)
G5 =COVARIANCE.P(C7:C16;D7:D16)
G10 =G8+G9
G13 =MMULT(TRANSPOSE(G8:G9);C3:C4)
G14 =SQRT(MMULT(TRANSPOSE(G8:G9);MMULT(G4:H5;G8:G9)))
H13 =G12/G13
H14 =G13*252
G16 =G13/G14
H16 =H13/H14
End notes:
As you can see from the screenshot, Excel solver suggests a 47% / 53% split between A1 and A2 to obtain an optimal Sharpe Ratio of 5,6. Running the Python function sr_opt = portfolioSim(df = df_returns, simRuns = 25000) yields a Sharpe Ratio of 5,3 with corresponding weights of 46% and 53% for A1 and A2:
print(sr_opt)
#Output
#return 0.361439
#risk 0.067851
#A1 0.465550
#A2 0.534450
#sharpe 5.326933
The method applied in Excel is GRG Nonlinear. I understand that changing the SLSQP argument to a non-linear method would get me somewhere, and I've look into Nonlinear solvers in scipy as well, but with little success.
And maybe Scipy even isn't the best option here?
A more detailed answer, 1st part of your code remains the same
import pandas as pd
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
np.random.seed(1234)
# Reproducible data sample
def returns(rows, names):
''' Function to create data sample with random returns
Parameters
==========
rows : number of rows in the dataframe
names: list of names to represent assets
Example
=======
>>> returns(rows = 2, names = ['A', 'B'])
A B
2017-01-01 0.0027 0.0075
2017-01-02 -0.0050 -0.0024
'''
listVars= names
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_temp = pd.DataFrame(np.random.randint(-100,100,size=(rows, len(listVars))), columns=listVars)
df_temp = df_temp.set_index(rng)
df_temp = df_temp / 10000
return df_temp
The function pf_sharpe is modified, the 1st input is one of the weights, the parameter to be optimised. Instead of inputting constraint w1 + w2 = 1, we can define w2 as 1-w1 inside pf_sharpe, which is perfectly equivalent but simpler and faster. Also, minimize will attempt to minimize pf_sharpe, and you actually want to maximize it, so now the output of pf_sharpe is multiplied by -1.
# Sharpe ratio
def pf_sharpe(weight, df):
''' Function to calculate risk / reward ratio
based on a pandas dataframe with two return series
'''
weights = [weight[0], 1-weight[0]]
# Calculate portfolio returns and volatility
pf_returns = (np.sum(df.mean() * weights) * 252)
pf_volatility = (np.sqrt(np.dot(np.asarray(weights).T, np.dot(df.cov() * 252, weights))))
# Calculate sharpe ratio
pf_sharpe = pf_returns / pf_volatility
return -pf_sharpe
# initial guess
x0 = [0.5]
df_returns = returns(rows = 10, names = ['A1', 'A2'])
# Optimization attempts
out = minimize(pf_sharpe, x0, method='SLSQP', bounds=[(0, 1)], args=(df_returns,))
optimal_weights = [out.x, 1-out.x]
print(optimal_weights)
print(-pf_sharpe(out.x, df_returns))
This returns an optimized Sharpe Ratio of 6.16 (better than 5.3) for w1 practically one and w2 practically 0

Categories