Absurd solution using gurobi python in regression - python

So I am new to gurobi and I decided to start working with it on a well known problem as regression. I found this official notebook, where an L0 penalized regression model was solved and I took just the part of the regression model out of it. However, when I solve this problem in gurobi, I get a really strange solution, totally different from the actual correct regression solution.
The code I am running is:
import gurobipy as gp
from gurobipy import GRB
import numpy as np
from sklearn.datasets import load_boston
from itertools import product
boston = load_boston()
x = boston.data
x = x[:, [0, 2, 4, 5, 6, 7, 10, 11, 12]] # select non-categorical variables
response = boston.target
samples, dim = x.shape
regressor = gp.Model()
# Append a column of ones to the feature matrix to account for the y-intercept
x = np.concatenate([x, np.ones((samples, 1))], axis=1)
# Decision variables
beta = regressor.addVars(dim + 1, name="beta") # Beta
# Objective Function (OF): minimize 1/2 * RSS using the fact that
# if x* is a minimizer of f(x), it is also a minimizer of k*f(x) iff k > 0
Quad = np.dot(x.T, x)
lin = np.dot(response.T, x)
obj = sum(0.5 * Quad[i, j] * beta[i] * beta[j] for i, j in product(range(dim + 1), repeat=2))
obj -= sum(lin[i] * beta[i] for i in range(dim + 1))
obj += 0.5 * np.dot(response, response)
regressor.setObjective(obj, GRB.MINIMIZE)
regressor.optimize()
beta_sol_gurobi = np.array([beta[i].X for i in range(dim+1)])
The solution provided by this code is
array([1.22933632e-14, 2.40073891e-15, 1.10109084e-13, 2.93142174e+00,
6.14486489e-16, 3.93021623e-01, 5.52707727e-15, 8.61271603e-03,
1.55963041e-15, 3.19117429e-13])
While the true linear regression solution should be
from sklearn import linear_model
lr = linear_model.LinearRegression()
lr.fit(x, response)
lr.coef_
lr.intercept_
That yields,
array([-5.23730841e-02, -3.35655253e-02, -1.39501039e+01, 4.40955833e+00,
-7.33680982e-03, -1.24312668e+00, -9.59615262e-01, 8.60275557e-03,
-5.17452533e-01])
29.531492975441015
So gurobi solution is completely different. Any guess / suggestion on whats happening? Am I doing anything wrong here?
PD: I know that this problem can be solved using other packages, or even other optimization frameworks, but I am specially interested in solving it in gurobi python, since I want to start using gurobi in some more complex problems.

The wrong result is due to your decision variables. Since Gurobi assumes the lower bound 0 for all variables by default, you need to explicitly set the lower bound:
beta = regressor.addVars(dim + 1, lb = -GRB.INFINITY, name="beta") # Beta

Related

Why does it work when columns are larger than rows in Python Sklearn (Linear Regression) [duplicate]

it's known that when the number of variables (p) is larger than the number of samples (n) the least square estimator is not defined.
In sklearn I receive this values:
In [30]: lm = LinearRegression().fit(xx,y_train)
In [31]: lm.coef_
Out[31]:
array([[ 0.20092363, -0.14378298, -0.33504391, ..., -0.40695124,
0.08619906, -0.08108713]])
In [32]: xx.shape
Out[32]: (1097, 3419)
Call [30] should return an error. How does sklearn work when p>n like in this case?
EDIT:
It seems that the matrix is filled with some values
if n > m:
# need to extend b matrix as it will be filled with
# a larger solution matrix
if len(b1.shape) == 2:
b2 = np.zeros((n, nrhs), dtype=gelss.dtype)
b2[:m,:] = b1
else:
b2 = np.zeros(n, dtype=gelss.dtype)
b2[:m] = b1
b1 = b2
When the linear system is underdetermined, then the sklearn.linear_model.LinearRegression finds the minimum L2 norm solution, i.e.
argmin_w l2_norm(w) subject to Xw = y
This is always well defined and obtainable by applying the pseudoinverse of X to y, i.e.
w = np.linalg.pinv(X).dot(y)
The specific implementation of scipy.linalg.lstsq, which is used by LinearRegression uses get_lapack_funcs(('gelss',), ... which is precisely a solver that finds the minimum norm solution via singular value decomposition (provided by LAPACK).
Check out this example
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(5, 10)
y = rng.randn(5)
from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=False)
coef1 = lr.fit(X, y).coef_
coef2 = np.linalg.pinv(X).dot(y)
print(coef1)
print(coef2)
And you will see that coef1 == coef2. (Note that fit_intercept=False is specified in the constructor of the sklearn estimator, because otherwise it would subtract the mean of each feature before fitting the model, yielding different coefficients)

Non-Linearity Test NaN error

I wanted to try statsmodel's linear_harvey_collier test with an easy example. However, I get nan as a result. Can you see, where my error lies?
import numpy as np
from statsmodels.regression.linear_model import OLS
np.random.seed(44)
n_samples, n_features = 50, 4
X = np.random.randn(n_samples, n_features)
coef=np.random.uniform(-12,12,4)
y = np.dot(X, coef)
var = 400
y += var**(1/2) * np.random.normal(size=n_samples)
regr=OLS(y, X).fit()
print(regr.params)
print(regr.summary())
sms.linear_harvey_collier(regr)
I get the result Ttest_1sampResult(statistic=nan, pvalue=nan).
If I perform the test while exluding one variable I get a result:
X3=X[:,:3]
regr3=OLS(y, X3).fit()
In [1]: sms.linear_harvey_collier(regr3)
Out[2]: Ttest_1sampResult(statistic=0.2447803429683807, pvalue=0.806727747845282)
Is there a problem with not adding a constant and intercept? This is just a feeling and if there is indeed a problem, I don't understand why.
There is a bug in linear_harvey_collier, that hard codes the number of initial observations to 3.
https://github.com/statsmodels/statsmodels/pull/6727
linear_harvey_collier has only two lines of code.
A workaround is to compute the test directly
res = regr
from scipy import stats
skip = len(res.params) # bug in linear_harvey_collier
rr = sms.recursive_olsresiduals(res, skip=skip, alpha=0.95, order_by=None)
stats.ttest_1samp(rr[3][skip:], 0)
Ttest_1sampResult(statistic=0.03092937323130299, pvalue=0.9754626388210277)

Use Python lmfit with a variable number of parameters in function

I am trying to deconvolve complex gas chromatogram signals into individual gaussian signals. Here is an example, where the dotted line represents the signal I am trying to deconvolve.
I was able to write the code to do this using scipy.optimize.curve_fit; however, once applied to real data the results were unreliable. I believe being able to set bounds to my parameters will improve my results, so I am attempting to use lmfit, which allows this. I am having a problem getting lmfit to work with a variable number of parameters. The signals I am working with may have an arbitrary number of underlying gaussian components, so the number of parameters I need will vary. I found some hints here, but still can't figure it out...
Creating a python lmfit Model with arbitrary number of parameters
Here is the code I am currently working with. The code will run, but the parameter estimates do not change when the model is fit. Does anyone know how I can get my model to work?
import numpy as np
from collections import OrderedDict
from scipy.stats import norm
from lmfit import Parameters, Model
def add_peaks(x_range, *pars):
y = np.zeros(len(x_range))
for i in np.arange(0, len(pars), 3):
curve = norm.pdf(x_range, pars[i], pars[i+1]) * pars[i+2]
y = y + curve
return(y)
# generate some fake data
x_range = np.linspace(0, 100, 1000)
peaks = [50., 40., 60.]
a = norm.pdf(x_range, peaks[0], 5) * 2
b = norm.pdf(x_range, peaks[1], 1) * 0.1
c = norm.pdf(x_range, peaks[2], 1) * 0.1
fake = a + b + c
param_dict = OrderedDict()
for i in range(0, len(peaks)):
param_dict['pk' + str(i)] = peaks[i]
param_dict['wid' + str(i)] = 1.
param_dict['mult' + str(i)] = 1.
# In case, you'd like to see the plot of fake data
#y = add_peaks(x_range, *param_dict.values())
#plt.plot(x_range, y)
#plt.show()
# Initialize the model and fit
pmodel = Model(add_peaks)
params = pmodel.make_params()
for i in param_dict.keys():
params.add(i, value=param_dict[i])
result = pmodel.fit(fake, params=params, x_range=x_range)
print(result.fit_report())
I think you would be better off using lmfits ability to build composite model.
That is, with a single peak defined with
from scipy.stats import norm
def peak(x, amp, center, sigma):
return amp * norm.pdf(x, center, sigma)
(see also lmfit.models.GaussianModel), you can build a model with many peaks:
npeaks = 3
model = Model(peak, prefix='p1_')
for i in range(1, npeaks):
model = model + Model(peak, prefix='p%d_' % (i+1))
params = model.make_params()
Now model will be a sum of 3 Gaussian functions, and the params created for that model will have names like p1_amp, p1_center, p2_amp, ..., which you can add sensible initial values and/or bounds and/or constraints.
Given your example data, you could pass in initial values to make_params like
params = model.make_params(p1_amp=2.0, p1_center=50., p1_sigma=2,
p2_amp=0.2, p2_center=40., p2_sigma=2,
p3_amp=0.2, p3_center=60., p3_sigma=2)
result = model.fit(fake, params, x=x_range)
I was able to find a solution here:
https://lmfit.github.io/lmfit-py/builtin_models.html#example-3-fitting-multiple-peaks-and-using-prefixes
Building on the code above, the following accomplishes what I was trying to do...
from lmfit.models import GaussianModel
gauss1 = GaussianModel(prefix='g1_')
gauss2 = GaussianModel(prefix='g2_')
gauss3 = GaussianModel(prefix='g3_')
gauss4 = GaussianModel(prefix='g4_')
gauss5 = GaussianModel(prefix='g5_')
gauss = [gauss1, gauss2, gauss3, gauss4, gauss5]
prefixes = ['g1_', 'g2_', 'g3_', 'g4_', 'g5_']
mod = np.sum(gauss[0:len(peaks)])
pars = mod.make_params()
for i, prefix in zip(range(0, len(peaks)), prefixes[0:len(peaks)]):
pars[prefix + 'center'].set(peaks[i])
init = mod.eval(pars, x=x_range)
out = mod.fit(fake, pars, x=x_range)
print(out.fit_report(min_correl=0.5))
out.plot_fit()
plt.show()

How to configure lasso regression to not penalize certain variables?

I'm trying to use lasso regression in python.
I'm currently using lasso function in scikit-learn library.
I want my model not to penalize certain variables while training. (penalize only the rest of variables)
Below is my current code for training
rg_mdt = linear_model.LassoCV(alphas=np.array(10**np.linspace(0, -4, 100)), fit_intercept=True, normalize=True, cv=10)
rg_mdt.fit(df_mdt_rgmt.loc[df_mdt_rgmt.CLUSTER_ID == k].drop(['RESPONSE', 'CLUSTER_ID'], axis=1), df_mdt_rgmt.loc[df_mdt_rgmt.CLUSTER_ID == k, 'RESPONSE'])
df_mdt_rgmt is the data mart and I'm trying to keep the coefficient for certain columns non-zero.
glmnet in R provides 'penalty factor' parameter that let me do this, but how can I do that in python scikit-learn?
Below is the code I have in R
get.Lassomodel <- function(TB.EXP, TB.RSP){
VT.PEN <- rep(1, ncol(TB.EXP))
VT.PEN[which(colnames(TB.EXP) == "DC_RATE")] <- 0
VT.PEN[which(colnames(TB.EXP) == "FR_PRICE_PW_REP")] <- 0
VT.GRID <- 10^seq(0, -4, length=100)
REG.MOD <- cv.glmnet(as.matrix(TB.EXP), as.matrix(TB.RSP), alpha=1,
lambda=VT.GRID, penalty.factor=VT.PEN, nfolds=10, intercept=TRUE)
return(REG.MOD)
}
I'm afraid you can't. Of course it's not an theoretical issue, but just a design-decision.
My reasoning is based on the available API and while sometimes there are undocumented functions, this time i don't think there is what you need because the user-guide already posts this problem in the 1-factor-norm-of-all form alpha*||w||_1
Depending on your setting you might modify sklearn's code (a bit scared about CD-tunings) or even implement a customized-objective using scipy.optimize (although the latter might be a bit slower).
Here is some example showing the scipy.optimize approach. I simplified the problem by removing intercept's.
""" data """
import numpy as np
from sklearn import datasets
diabetes = datasets.load_diabetes()
A = diabetes.data[:150]
y = diabetes.target[:150]
alpha=0.1
weights=np.ones(A.shape[1])
""" sklearn """
from sklearn import linear_model
clf = linear_model.Lasso(alpha=alpha, fit_intercept=False)
clf.fit(A, y)
""" scipy """
from scipy.optimize import minimize
def lasso(x): # following sklearn's definition from user-guide!
return (1. / (2*A.shape[0])) * np.square(np.linalg.norm(A.dot(x) - y, 2)) + alpha * np.linalg.norm(weights*x, 1)
""" Test with weights = 1 """
x0 = np.zeros(A.shape[1])
res = minimize(lasso, x0, method='L-BFGS-B', options={'disp': False})
print('Equal weights')
print(lasso(clf.coef_), clf.coef_[:5])
print(lasso(res.x), res.x[:5])
""" Test scipy-based with special weights """
weights[[0, 3, 5]] = 0.0
res = minimize(lasso, x0, method='L-BFGS-B', options={'disp': False})
print('Specific weights')
print(lasso(res.x), res.x[:5])
Output:
Equal weights
12467.4614224 [-524.03922009 -75.41111354 820.0330707 40.08184085 -307.86020107]
12467.6514697 [-526.7102518 -67.42487561 825.70158417 40.04699607 -271.02909258]
Specific weights
12362.6078842 [ -6.12843589e+02 -1.51628334e+01 8.47561732e+02 9.54387812e+01
-1.02957112e-05]

Softmax Regression (Multinomial Logistic) with PyMC3

I am trying to implement a logistic multinomial regression (AKA softmax regression). In this example I am trying to classify the iris dataset
I have a problem specifying the model, I get an optimization error with find_MAP(). If I avoid using find_MAP() I get a “sample” of all zero vectors if I use a Categorical for the likelihood, or a posterior exactly the same as the priors if I use Mutinomial(n=1, p=p).
import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
iris = sns.load_dataset("iris")
y_2 = pd.Categorical(iris['species']).labels
x_n = iris.columns[:-1]
x_2 = iris[x_n].values
x_2 = (x_2 - x_2.mean(axis=0))/x_2.std(axis=0)
indice = list(set(y_2))
with pm.Model() as modelo_s:
alfa = pm.Normal('alfa', mu=0, sd=100, shape=3)
beta = pm.Normal('beta', mu=0, sd=100, shape=(4,3))
mu = (alfa[indice] + pm.dot(x_2, beta[:,indice])).T
p = pm.exp(mu)/pm.sum(pm.exp(mu), axis=0)
yl = pm.Categorical('yl', p=p, observed=y_2)
#yl = pm.Multinomial('yl', n=1, p=p, observed=y_2)
start = pm.find_MAP()
step = pm.Metropolis()
trace_s = pm.sample(1000, step, start)
The issue is probably the lack of gibbs updating of vector-valued variables. Thus, a jump is only accepted if all binary values produce a good logp. This PR might be helpful: #799
So you can try: pip install git+https://github.com/pymc-devs/pymc3#gibbs and then do Metropolis(gibbs='random').

Categories