I am trying to implement a portfolio optimization that uses constraints to define e.g. max exposure to country/ sector/ industry etc. I have implemented the following code below, where I pass in a 'africa' vector to map stocks to country africa, in my constraints I then bound those to be not more then 40% in weights overall. The only way that I managed to implement it is to use sum_weights over the indexes where africa = 1. I also tried using the Parameter function but didn't succeed. There must be a more elegant way to apply these kind of constraints, hopefully. Any suggestion is appreciated. Also if anyone knows about an example that shows the use of tracking error constraints, turnover constraints or volatility constraints, these are the ones where I am also still struggling with.
import numpy as np
from cvxpy import *
np.random.seed(1)
n = 10 # number of assets
mu = np.abs(np.random.randn(n,1)) #mean
Sigma = np.random.randn(n,n)
Sigma = Sigma.T.dot(Sigma)
# Long only PFO Opt
w = Variable(n)
#africa = Parameter(10, sign='positive')
#africa.value = [1,1,1,0,0,0,0,0,0,0]
africa = [0,0,0,0,0,0,0,1,1,1]
gamma = Parameter(sign='positive')
ret = mu.T*w
risk = quad_form(w,Sigma)
filters = [i for i in range(len(africa)) if africa[i] == 1]
constraints = [sum_entries(w) == 1, w >=0, w[1] > 0.50, w[0] == 0, sum_entries(w[filters]) == 0.4]
#prob = Problem(Maximize(ret - gamma*risk), [sum_entries(w) == 1, w >=0])
prob = Problem(Minimize(risk), constraints)
SAMPLE = 1000
risk_data = np.zeros(SAMPLE)
ret_data = np.zeros(SAMPLE)
gamma_vals = np.logspace(-2,3,num=SAMPLE)
for i in range(SAMPLE):
gamma.value = gamma_vals[i]
prob.solve()
risk_data[i] = sqrt(risk).value
ret_data[i] = ret.value
print(prob.status)
print(prob.value)
print('OPT WEIGHTS : ')
for i in range(n):
print(round(w[i].value,3))
I think you may want to have a look at these examples. The developer has incorporated portfolio risk constraint as follows:
import cvxpy as cp
w = cp.Variable(n)
gamma = cp.Parameter(nonneg=True)
ret = mu.T*w
risk = cp.quad_form(w, Sigma)
Lmax = cp.Parameter()
# Portfolio optimization with a leverage limit and a bound on risk.
prob = cp.Problem(cp.Maximize(ret),
[cp.sum(w) == 1,
cp.norm(w, 1) <= Lmax,
risk <= 2])
Here's the link to jupyter nbviewer
Related
I'm using cvxpy to sole a portfolio optimization problem with constraints on the maximum number of assets to consider.
In order to do that I want to introduce new variables 'yi' that are boolean so that they are equal to 1 if the asset i is included in the portfolio and 0 otherwise.
The sum of the 'yi' variables will be equal to 'k' which is the number of assets I want to consider.
import numpy as np
import pandas as pd
from cvxpy import *
# assets names
tickers = ["AAA", "BBB", "CCC", "DDD", "EEE", "FFF"]
# return matrix
ret = pd.DataFrame(np.random.rand(1,6), columns = tickers)
# Variance_Coviariance matrix
covm = pd.DataFrame(np.random.rand(6,6), columns = tickers, index = tickers)
# problem setting
x = Variable(len(tickers)) # xi variables
y = Variable(len(tickers), boolean = True) # yi variables
er = np.asarray(ret.T) * x # expected return
min_ret = 0.2 # minimum return
risk = quad_form(x, np.asmatrix(covm)) # risk
k = 3 #maximum number of assets to include
constraints = [sum(x) == 1, er >= min_ret, x >= 0, sum(y) == k] #constraints array
for i in range(len(tickers):
constraints.append(x[i] <= y[i]) # additional constraint for which each xi must be less or equal to each yi
objective = Minimize(risk) # set the objective function
prob = Problem(objective, constraints) # set problem
prob.solve() # solve problem
I get the following error:
Either candidate conic solvers (['GLPK_MI']) do not support the cones output by the problem (SOC, NonNeg, Zero), or there are not enough constraints in the problem.
I'm not sure what I did wrong.
I am learning how to use Drake to solving optimization problems.
This problem was to find the optimal length and width of a fence, the fence must have a perimeter less than or equal to 40. The code below only works when the perimeter constraint is an equality constraint. It should work as an inequality constraint, but my optimal solution results in x=[nan nan]. Does anyone know why this is the case?
from pydrake.solvers.mathematicalprogram import MathematicalProgram, Solve
import numpy as np
import matplotlib.pyplot as plt
prog = MathematicalProgram()
#add two decision variables
x = prog.NewContinuousVariables(2, "x")
#adds objective function where
#
# min 0.5 xt * Q * x + bt * x
#
# Q = [0,-1
# -1,0]
#
# bt = [0,
# 0]
#
Q = [[0,-1],[-1,0]]
b = [[0],[0]]
prog.AddQuadraticCost(Q , b, vars=[x[0],x[1]])
# Adds the linear constraints.
prog.AddLinearEqualityConstraint(2*x[0] + 2*x[1] == 40)
#prog.AddLinearConstraint(2*x[0] + 2*x[1] <= 40)
prog.AddLinearConstraint(0*x[0] + -1*x[1] <= 0)
prog.AddLinearConstraint(-1*x[0] + 0*x[1] <= 0)
# Solve the program.
result = Solve(prog)
print(f"optimal solution x: {result.GetSolution(x)}")
I get [nan, nan] for both inequality constraint and equality constraint.
As Russ mentioned, the problem is the cost being non-convex, and Drake incurred the wrong solver. For the moment, I would suggest to explicitly designate a solver. You could do
from pydrake.solvers.ipopt_solver import IpoptSolver
from pydrake.solvers.mathematicalprogram import MathematicalProgram, Solve
import numpy as np
import matplotlib.pyplot as plt
prog = MathematicalProgram()
#add two decision variables
x = prog.NewContinuousVariables(2, "x")
#adds objective function where
#
# min 0.5 xt * Q * x + bt * x
#
# Q = [0,-1
# -1,0]
#
# bt = [0,
# 0]
#
Q = [[0,-1],[-1,0]]
b = [[0],[0]]
prog.AddQuadraticCost(Q , b, vars=[x[0],x[1]])
# Adds the linear constraints.
prog.AddLinearEqualityConstraint(2*x[0] + 2*x[1] == 40)
#prog.AddLinearConstraint(2*x[0] + 2*x[1] <= 40)
prog.AddLinearConstraint(0*x[0] + -1*x[1] <= 0)
prog.AddLinearConstraint(-1*x[0] + 0*x[1] <= 0)
# Solve the program.
solver = IpoptSolver()
result = solver.Solve(prog)
print(f"optimal solution x: {result.GetSolution(x)}")
I will work on a fix on the Drake side, to make sure it incur the right solver when you have non-convex quadratic cost.
With the dataframe underneath I want to optimize the total return, while certain bounds are satisfied.
d = {'Win':[0,0,1, 0, 0, 1, 0],'Men':[0,1,0, 1, 1, 0, 0], 'Women':[1,0,1, 0, 0, 1,1],'Matches' :[0,5,4, 7, 4, 10,13],
'Odds':[1.58,3.8,1.95, 1.95, 1.62, 1.8, 2.1], 'investment':[0,0,6, 10, 5, 25,0],}
data = pd.DataFrame(d)
I want to maximize the following equation:
totalreturn = np.sum(data['Odds'] * data['investment'] * (data['Win'] == 1))
The function should be maximized satisfying the following bounds:
for i in range(len(data)):
investment = data['investment'][i]
C = alpha0 + alpha1*data['Men'] + alpha2 * data['Women'] + alpha3 * data['Matches']
if (lb < investment ) & (investment < ub) & (investment > C) == False:
data['investment'][i] = 0
Hereby lb and ub are constant for every row in the dataframe. Threshold C however, is different for every row. Thus there are 6 parameters to be optimized: lb, ub, alph0, alpha1, alpha2, alpha3.
Can anyone tell me how to do this in python? My proceedings so far have been with scipy (Approach1) and Bayesian (Approach2) optimization and only lb and ub are tried to be optimized.
Approach1:
import pandas as pd
from scipy.optimize import minimize
def objective(val, data):
# Approach 1
# Lowerbound and upperbound
lb, ub = val
# investments
# These matches/bets are selected to put wager on
tf1 = (data['investment'] > lb) & (data['investment'] < ub)
data.loc[~tf1, 'investment'] = 0
# Total investment
totalinvestment = sum(data['investment'])
# Good placed bets
data['reward'] = data['Odds'] * data['investment'] * (data['Win'] == 1)
totalreward = sum(data['reward'])
# Return and cumalative return
data['return'] = data['reward'] - data['investment']
totalreturn = sum(data['return'])
data['Cum return'] = data['return'].cumsum()
# Return on investment
print('\n',)
print('lb, ub:', lb, ub)
print('TotalReturn: ',totalreturn)
print('TotalInvestment: ', totalinvestment)
print('TotalReward: ', totalreward)
print('# of bets', (data['investment'] != 0).sum())
return totalreturn
# Bounds and contraints
b = (0,100)
bnds = (b,b,)
x0 = [0,100]
sol = minimize(objective, x0, args = (data,), method = 'Nelder-Mead', bounds = bnds)
and approach2:
import pandas as pd
import time
import pickle
from hyperopt import fmin, tpe, Trials
from hyperopt import STATUS_OK
from hyperopt import hp
def objective(args):
# Approach2
# Lowerbound and upperbound
lb, ub = args
# investments
# These matches/bets are selected to put wager on
tf1 = (data['investment'] > lb) & (data['investment'] < ub)
data.loc[~tf1, 'investment'] = 0
# Total investment
totalinvestment = sum(data['investment'])
# Good placed bets
data['reward'] = data['Odds'] * data['investment'] * (data['Win'] == 1)
totalreward = sum(data['reward'])
# Return and cumalative return
data['return'] = data['reward'] - data['investment']
totalreturn = sum(data['return'])
data['Cum return'] = data['return'].cumsum()
# store results
d = {'loss': - totalreturn, 'status': STATUS_OK, 'eval time': time.time(),
'other stuff': {'type': None, 'value': [0, 1, 2]},
'attachments': {'time_module': pickle.dumps(time.time)}}
return d
trials = Trials()
parameter_space = [hp.uniform('lb', 0, 100), hp.uniform('ub', 0, 100)]
best = fmin(objective,
space= parameter_space,
algo=tpe.suggest,
max_evals=500,
trials = trials)
print('\n', trials.best_trial)
Anyone knows how I should proceed? Scipy doesn't generate the desired result. Hyperopt optimization does result in the desired result. In either approach I don't know how to incorporate a boundary that is row depended (C(i)).
Anything would help!
(Any relative articles, exercises or helpful explanations about the sort of optimization are also more than welcome)
I assume here that you cannot go through the whole dataset, or it is incomplete, or you want to extrapolate, so that you cannot calculate all combinations.
In case where you have no prior, and if you are uncertain about the smoothness, or that evaluations could be costly, I would use bayesian optimization. You can control the exploration/exploitation and prevent to get stuck in a minimum.
I would use scikit-optimize which implements bayesian optimization better IMO. They have better initialization techniques like Sobol' method which is implemented correctly here. This ensure that you're search space will be properly sampled.
from skopt import gp_minimize
res = gp_minimize(objective, bnds, initial_point_generator='sobol')
I think your formulation needs one more variable, which would be binary and would define if investment should be saved as 0 or should it have its initial value. Assuming that this variable would be saved in another column called 'new_binary', your objective function could be changed as following:
totalreturn = np.sum(data['Odds'] * data['investment'] * data['new_binary'] * data['Win'])
then, the only thing missing is introducing the variable itself.
for i in range(len(data)):
investment = data['investment'][i]
C = alpha0 + alpha1*data['Men'] + alpha2 * data['Women'] + alpha3 * data['Matches']
data['new_binary'] = (lb < data['investment'] ) & ( data['investment'] < ub) & (data['investment'] > C)
# This should be enough to make the values in the columns binary, while in python it is easily replaced with 0 and 1.
The only problem that I see now is that this problem becomes integer, so I am not sure if scipy.optimize.minimize would do. I am not sure what could be an alternative, but according to this, PuLP and Pyomo could work.
I'm trying to fit SIR Epidemics Spread Model to the current new case data of the countries. In order to do that I used the work here: https://github.com/epimath/param-estimation-SIR . Main Idea was to fit best possible SIR's Infected curve to the new case data for that specific country, and calculate total predicted case number and the days that belong to %98 and %95 of total cases. The problem is, when I select Brazil, Mexico or United States. It shows that it will never end. I am curious about the reason. Any help about what can be done to deal with this non converging cases would be appreciated.
Please change the selected_country variable from "Spain" to one of those three countries(Brazil, Mexico or United States) to reproduce the result that leads me to ask here.
P.S. I know the limitations of the work. For example, new case number is bound to the number of tests etc. Please ignore those limitations. I'd like to see what is needed to produce a result out of the following code.
Here are some outputs:
Spain (Expected Output Example)
Turkey (Expected Output Example)
France (Expected Output Example)
USA (Unexpected Output Example)
Brazil (Unexpected Output Example)
I suspect something that cause gamma(the rate of recovering) parameter too small which leads the same amount of cases for each day. But I couldn't go further and found out what causing that. (I understood that by checking paramests variable by printing and examining it's values.)
Here you can find my code below.
import scipy.optimize as optimize
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson
from scipy.stats import norm
import json
from scipy.integrate import odeint as ode
import pandas as pd
from datetime import datetime
time_start = datetime.timestamp(datetime.now())
output = {"result": "error"}
error = False
def model(ini, time_step, params):
Y = np.zeros(3) # column vector for the state variables
X = ini
mu = 0
beta = params[0]
gamma = params[1]
Y[0] = mu - beta * X[0] * X[1] - mu * X[0] # S
Y[1] = beta * X[0] * X[1] - gamma * X[1] - mu * X[1] # I
Y[2] = gamma * X[1] - mu * X[2] # R
return Y
def x0fcn(params, data):
S0 = 1.0 - (data[0] / params[2])
I0 = data[0] / params[2]
R0 = 0.0
X0 = [S0, I0, R0]
return X0
def yfcn(res, params):
return res[:, 1] * params[2]
# cost function for the SIR model for python 2.7
# Marisa Eisenberg (marisae#umich.edu)
# Yu-Han Kao (kaoyh#umich.edu) -7-9-17
def NLL(params, data, times): # negative log likelihood
params = np.abs(params)
data = np.array(data)
res = ode(model, x0fcn(params, data), times, args=(params,))
y = yfcn(res, params)
nll = sum(y) - sum(data * np.log(y))
# note this is a slightly shortened version--there's an additive constant term missing but it
# makes calculation faster and won't alter the threshold. Alternatively, can do:
# nll = -sum(np.log(poisson.pmf(np.round(data),np.round(y)))) # the round is b/c Poisson is for (integer) count data
# this can also barf if data and y are too far apart because the dpois will be ~0, which makes the log angry
# ML using normally distributed measurement error (least squares)
# nll = -sum(np.log(norm.pdf(data,y,0.1*np.mean(data)))) # example WLS assuming sigma = 0.1*mean(data)
# nll = sum((y - data)**2) # alternatively can do OLS but note this will mess with the thresholds
# for the profile! This version of OLS is off by a scaling factor from
# actual LL units.
return nll
df = pd.read_csv('https://github.com/owid/covid-19-data/raw/master/public/data/owid-covid-data.csv')
selected_location = 'Spain'
selected_df = df[df.location == selected_location].reset_index()
selected_df.date = pd.to_datetime(selected_df.date)
print(selected_df.head())
selected_df.date = pd.to_datetime(selected_df.date)
selected_df = selected_df[['date', 'new_cases']]
print(selected_df)
df = selected_df
optimizer = optimize.minimize(NLL, params, args=(data, times), method='Nelder-Mead',
options={'disp': False, 'return_all': False, 'xatol': 3.1201, 'fatol': 0.0001,
'adaptive': False})
paramests = np.abs(optimizer.x)
iniests = x0fcn(paramests, data)
print('Paramests:')
print(paramests)
times_long = range(0, int(len(times) * 10))
start_day = df['date'][0]
dates_long = []
for i in range(0, int(len(times) * 10)):
dates_long.append(start_day + (np.timedelta64(1, 'D') * i))
# print(df)
# print(dates_long)
# sys.exit()
#### Re-simulate and plot the model with the final parameter estimates ####
xest = ode(model, iniests, times_long, args=(paramests,))
# print(xest)
est_measure = yfcn(xest, paramests)
# plt.plot(times, data, 'k-o', linewidth=1, label='Data')
json_dict = {}
time_end = datetime.timestamp(datetime.now())
json_dict['duration'] = time_end - time_start
json_df = pd.DataFrame()
json_df['dates'] = dates_long
json_df['new_cases'] = df['new_cases']
json_df['prediction'] = est_measure
json_df = json_df.fillna("")
json_df['cumulative'] = json_df['prediction'].cumsum()
json_df = json_df[json_df['prediction'] >= 1]
if error == True:
json_dict['result'] = 'error'
json_dict['message'] = error_message
json_dict['timestamp'] = datetime.timestamp(datetime.now())
json_dict['chart_data'] = json_df.drop(columns=['prediction'], axis=1)
else:
json_dict['result'] = 'success'
json_dict['day_for_95_percent_predicted_cases'] = \
json_df[json_df['cumulative'] > (json_df['cumulative'].iloc[-1] * 0.95)]['dates'].reset_index(drop=True)[0]
json_dict['day_for_98_percent_predicted_cases'] = \
json_df[json_df['cumulative'] > (json_df['cumulative'].iloc[-1] * 0.98)]['dates'].reset_index(drop=True)[0]
# json_dict['timestamp'] = str(f"{datetime.now():%Y-%m-%d %H:%M:%S}")
json_dict['timestamp'] = datetime.timestamp(datetime.now())
json_dict['chart_data'] = json_df.to_dict()
json_string = json.dumps(json_dict, default=str)
print(json_string)
output = json_string # json string
plt.plot(json_df['dates'], json_df['prediction'], 'r-', linewidth=3, label='Predicted New Cases')
plt.bar(df['date'], data)
plt.axvline(x=json_dict['day_for_95_percent_predicted_cases'], label='(95%) '+str(json_dict['day_for_95_percent_predicted_cases'].date()),color='red')
plt.axvline(x=json_dict['day_for_98_percent_predicted_cases'], label='(98%) '+str(json_dict['day_for_98_percent_predicted_cases'].date()),color='green')
plt.xlabel('Time')
plt.ylabel('Individuals')
plt.legend()
plt.show()
In R there is a very useful function that helps with determining parameters for a two sided t-test in order to obtain a target statistical power.
The function is called power.prop.test.
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/power.prop.test.html
You can call it using:
power.prop.test(p1 = .50, p2 = .75, power = .90)
And it will tell you n the sample size needed to obtain this power. This is extremely useful in deterring sample sizes for tests.
Is there a similar function in the scipy package?
I've managed to replicate the function using the below formula for n and the inverse survival function norm.isf from scipy.stats
from scipy.stats import norm, zscore
def sample_power_probtest(p1, p2, power=0.8, sig=0.05):
z = norm.isf([sig/2]) #two-sided t test
zp = -1 * norm.isf([power])
d = (p1-p2)
s =2*((p1+p2) /2)*(1-((p1+p2) /2))
n = s * ((zp + z)**2) / (d**2)
return int(round(n[0]))
def sample_power_difftest(d, s, power=0.8, sig=0.05):
z = norm.isf([sig/2])
zp = -1 * norm.isf([power])
n = s * ((zp + z)**2) / (d**2)
return int(round(n[0]))
if __name__ == '__main__':
n = sample_power_probtest(0.1, 0.11, power=0.8, sig=0.05)
print n #14752
n = sample_power_difftest(0.1, 0.5, power=0.8, sig=0.05)
print n #392
Some of the basic power calculations are now available in statsmodels
http://statsmodels.sourceforge.net/devel/stats.html#power-and-sample-size-calculations
http://jpktd.blogspot.ca/2013/03/statistical-power-in-statsmodels.html
The blog article does not yet take the latest changes to the statsmodels code into account. Also, I haven't decided yet how many wrapper functions to provide, since many power calculations just reduce to the basic distribution.
>>> import statsmodels.stats.api as sms
>>> es = sms.proportion_effectsize(0.5, 0.75)
>>> sms.NormalIndPower().solve_power(es, power=0.9, alpha=0.05, ratio=1)
76.652940372066908
In R stats
> power.prop.test(p1 = .50, p2 = .75, power = .90)
Two-sample comparison of proportions power calculation
n = 76.7069301141077
p1 = 0.5
p2 = 0.75
sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group
using R's pwr package
> library(pwr)
> h<-ES.h(0.5,0.75)
> pwr.2p.test(h=h, power=0.9, sig.level=0.05)
Difference of proportion power calculation for binomial distribution (arcsine transformation)
h = 0.5235987755982985
n = 76.6529406106181
sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: same sample sizes
Matt's answer for getting the needed n (per group) is almost right, but there is a small error.
Given d (difference in means), s (standard deviation), sig (significance level, typically .05), and power (typically .80), the formula for calculating the number of observations per group is:
n= (2s^2 * ((z_(sig/2) + z_power)^2) / (d^2)
As you can see in his formula, he has
n = s * ((zp + z)**2) / (d**2)
the "s" part is wrong. a correct function that reproduces r's functionality is:
def sample_power_difftest(d, s, power=0.8, sig=0.05):
z = norm.isf([sig/2])
zp = -1 * norm.isf([power])
n = (2*(s**2)) * ((zp + z)**2) / (d**2)
return int(round(n[0]))
Hope this helps.
You also have:
from statsmodels.stats.power import tt_ind_solve_power
and put "None" in the value you want to obtain. For instande, to obtain the number of observations in the case of effect_size = 0.1, power = 0.8 and so on, you should put:
tt_ind_solve_power(effect_size=0.1, nobs1 = None, alpha=0.05, power=0.8, ratio=1, alternative='two-sided')
and obtain: 1570.7330663315456 as the number of observations required.
Or else, to obtain the power you can attain with the other values fixed:
tt_ind_solve_power(effect_size= 0.2, nobs1 = 200, alpha=0.05, power=None, ratio=1, alternative='two-sided')
and you obtain: 0.5140816347005553