I wanted to try statsmodel's linear_harvey_collier test with an easy example. However, I get nan as a result. Can you see, where my error lies?
import numpy as np
from statsmodels.regression.linear_model import OLS
n_samples, n_features = 50, 4
X = np.random.randn(n_samples, n_features)
y = np.dot(X, coef)
var = 400
y += var**(1/2) * np.random.normal(size=n_samples)
regr=OLS(y, X).fit()
I get the result Ttest_1sampResult(statistic=nan, pvalue=nan).
If I perform the test while exluding one variable I get a result:
regr3=OLS(y, X3).fit()
In [1]: sms.linear_harvey_collier(regr3)
Out[2]: Ttest_1sampResult(statistic=0.2447803429683807, pvalue=0.806727747845282)
Is there a problem with not adding a constant and intercept? This is just a feeling and if there is indeed a problem, I don't understand why.
There is a bug in linear_harvey_collier, that hard codes the number of initial observations to 3.
linear_harvey_collier has only two lines of code.
A workaround is to compute the test directly
res = regr
from scipy import stats
skip = len(res.params) # bug in linear_harvey_collier
rr = sms.recursive_olsresiduals(res, skip=skip, alpha=0.95, order_by=None)
stats.ttest_1samp(rr[3][skip:], 0)
Ttest_1sampResult(statistic=0.03092937323130299, pvalue=0.9754626388210277)
So I am new to gurobi and I decided to start working with it on a well known problem as regression. I found this official notebook, where an L0 penalized regression model was solved and I took just the part of the regression model out of it. However, when I solve this problem in gurobi, I get a really strange solution, totally different from the actual correct regression solution.
The code I am running is:
import gurobipy as gp
from gurobipy import GRB
import numpy as np
from sklearn.datasets import load_boston
from itertools import product
boston = load_boston()
x = boston.data
x = x[:, [0, 2, 4, 5, 6, 7, 10, 11, 12]] # select non-categorical variables
response = boston.target
samples, dim = x.shape
regressor = gp.Model()
# Append a column of ones to the feature matrix to account for the y-intercept
x = np.concatenate([x, np.ones((samples, 1))], axis=1)
# Decision variables
beta = regressor.addVars(dim + 1, name="beta") # Beta
# Objective Function (OF): minimize 1/2 * RSS using the fact that
# if x* is a minimizer of f(x), it is also a minimizer of k*f(x) iff k > 0
Quad = np.dot(x.T, x)
lin = np.dot(response.T, x)
obj = sum(0.5 * Quad[i, j] * beta[i] * beta[j] for i, j in product(range(dim + 1), repeat=2))
obj -= sum(lin[i] * beta[i] for i in range(dim + 1))
obj += 0.5 * np.dot(response, response)
regressor.setObjective(obj, GRB.MINIMIZE)
beta_sol_gurobi = np.array([beta[i].X for i in range(dim+1)])
The solution provided by this code is
array([1.22933632e-14, 2.40073891e-15, 1.10109084e-13, 2.93142174e+00,
6.14486489e-16, 3.93021623e-01, 5.52707727e-15, 8.61271603e-03,
1.55963041e-15, 3.19117429e-13])
While the true linear regression solution should be
from sklearn import linear_model
lr = linear_model.LinearRegression()
lr.fit(x, response)
That yields,
array([-5.23730841e-02, -3.35655253e-02, -1.39501039e+01, 4.40955833e+00,
-7.33680982e-03, -1.24312668e+00, -9.59615262e-01, 8.60275557e-03,
So gurobi solution is completely different. Any guess / suggestion on whats happening? Am I doing anything wrong here?
PD: I know that this problem can be solved using other packages, or even other optimization frameworks, but I am specially interested in solving it in gurobi python, since I want to start using gurobi in some more complex problems.
The wrong result is due to your decision variables. Since Gurobi assumes the lower bound 0 for all variables by default, you need to explicitly set the lower bound:
beta = regressor.addVars(dim + 1, lb = -GRB.INFINITY, name="beta") # Beta
it's known that when the number of variables (p) is larger than the number of samples (n) the least square estimator is not defined.
In sklearn I receive this values:
In [30]: lm = LinearRegression().fit(xx,y_train)
In [31]: lm.coef_
array([[ 0.20092363, -0.14378298, -0.33504391, ..., -0.40695124,
0.08619906, -0.08108713]])
In [32]: xx.shape
Out[32]: (1097, 3419)
Call [30] should return an error. How does sklearn work when p>n like in this case?
It seems that the matrix is filled with some values
if n > m:
# need to extend b matrix as it will be filled with
# a larger solution matrix
if len(b1.shape) == 2:
b2 = np.zeros((n, nrhs), dtype=gelss.dtype)
b2[:m,:] = b1
b2 = np.zeros(n, dtype=gelss.dtype)
b2[:m] = b1
b1 = b2
When the linear system is underdetermined, then the sklearn.linear_model.LinearRegression finds the minimum L2 norm solution, i.e.
argmin_w l2_norm(w) subject to Xw = y
This is always well defined and obtainable by applying the pseudoinverse of X to y, i.e.
w = np.linalg.pinv(X).dot(y)
The specific implementation of scipy.linalg.lstsq, which is used by LinearRegression uses get_lapack_funcs(('gelss',), ... which is precisely a solver that finds the minimum norm solution via singular value decomposition (provided by LAPACK).
Check out this example
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(5, 10)
y = rng.randn(5)
from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=False)
coef1 = lr.fit(X, y).coef_
coef2 = np.linalg.pinv(X).dot(y)
And you will see that coef1 == coef2. (Note that fit_intercept=False is specified in the constructor of the sklearn estimator, because otherwise it would subtract the mean of each feature before fitting the model, yielding different coefficients)
I get a ValueError: Found input variables with inconsistent numbers of samples: [20000, 1] when I run the following even though the row values of x and y are correct. I load in the RCV1 dataset, get indices of the categories with the top x documents, create list of tuples with equal number of randomly-selected positives and negatives for each category, and then finally attempt to run a logistic regression on one of the categories.
import sklearn.datasets
from sklearn import model_selection, preprocessing
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot as plt
from scipy import sparse
rcv1 = sklearn.datasets.fetch_rcv1()
def get_top_cat_indices(target_matrix, num_cats):
cat_counts = target_matrix.sum(axis=0)
#cat_counts = cat_counts.reshape((1,103)).tolist()[0]
cat_counts = cat_counts.reshape((103,))
#b = sorted(cat_counts, reverse=True)
ind_temp = np.argsort(cat_counts)[::-1].tolist()[0]
ind = [ind_temp[i] for i in range(5)]
return ind
def prepare_data(x, y, top_cat_indices, sample_size):
res_lst = []
for i in top_cat_indices:
# get column of indices with relevant cat
temp = y.tocsc()[:, i]
# all docs with labeled category
cat_present = x.tocsr()[np.where(temp.sum(axis=1)>0)[0],:]
# all docs other than labelled category
cat_notpresent = x.tocsr()[np.where(temp.sum(axis=1)==0)[0],:]
# get indices equal to 1/2 of sample size
idx_cat = np.random.randint(cat_present.shape[0], size=int(sample_size/2))
idx_nocat = np.random.randint(cat_notpresent.shape[0], size=int(sample_size/2))
# concatenate the ids
sampled_x_pos = cat_present.tocsr()[idx_cat,:]
sampled_x_neg = cat_notpresent.tocsr()[idx_nocat,:]
sampled_x = sparse.vstack((sampled_x_pos, sampled_x_neg))
sampled_y_pos = temp.tocsr()[idx_cat,:]
sampled_y_neg = temp.tocsr()[idx_nocat,:]
sampled_y = sparse.vstack((sampled_y_pos, sampled_y_neg))
res_lst.append((sampled_x, sampled_y))
return res_lst
ind = get_top_cat_indices(rcv1.target, 5)
test_res = prepare_data(train_x, train_y, ind, 20000)
x, y = test_res[0]
LogisticRegression().fit(x, y)
Could it be an issue with the sparse matrices, or problem with dimensionality (there are 20K samples and 47K features)
When I run your code, I get following error:
AttributeError: 'bool' object has no attribute 'any'
That's because y for LogisticRegression needs to numpy array. So, I changed last line to:
LogisticRegression().fit(x, y.A.flatten())
Then I get following error:
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0
This is because your sampling code has a bug. You need to subset y array with rows having that category before using sampling indices. See code below:
def prepare_data(x, y, top_cat_indices, sample_size):
res_lst = []
for i in top_cat_indices:
# get column of indices with relevant cat
temp = y.tocsc()[:, i]
# all docs with labeled category
c1 = np.where(temp.sum(axis=1)>0)[0]
c2 = np.where(temp.sum(axis=1)==0)[0]
cat_present = x.tocsr()[c1,:]
# all docs other than labelled category
cat_notpresent = x.tocsr()[c2,:]
# get indices equal to 1/2 of sample size
idx_cat = np.random.randint(cat_present.shape[0], size=int(sample_size/2))
idx_nocat = np.random.randint(cat_notpresent.shape[0], size=int(sample_size/2))
# concatenate the ids
sampled_x_pos = cat_present.tocsr()[idx_cat,:]
sampled_x_neg = cat_notpresent.tocsr()[idx_nocat,:]
sampled_x = sparse.vstack((sampled_x_pos, sampled_x_neg))
sampled_y_pos = temp.tocsr()[c1][idx_cat,:]
sampled_y_neg = temp.tocsr()[c2][idx_nocat,:]
sampled_y = sparse.vstack((sampled_y_pos, sampled_y_neg))
res_lst.append((sampled_x, sampled_y))
return res_lst
Now, Everything works like a charm
I am trying to sample a simple model of a categorical distribution with a Dirichlet prior. Here is my code:
import numpy as np
from scipy import optimize
from pymc3 import *
k = 6
alpha = 0.1 * np.ones(k)
with Model() as model:
p = Dirichlet('p', a=alpha, shape=k)
categ = Categorical('categ', p=p, shape=1)
tr = sample(10000)
And I get this error:
PositiveDefiniteError: Scaling is not positive definite. Simple check failed. Diagonal contains negatives. Check indexes [0 1 2 3 4]
The problem is that NUTS is failing to initialize properly. One solution is to use another sampler like this:
with pm.Model() as model:
p = pm.Dirichlet('p', a=alpha)
categ = pm.Categorical('categ', p=p)
step = pm.Metropolis(vars=p)
tr = pm.sample(1000, step=step)
Here I am manually assigning p to Metropolis, and letting PyMC3 assign categ to a proper sampler.
The following code gives me the following error: ValueError: Found array with 0 sample(s) (shape=(0, 3)) while a minimum of 1 is required.
The error is produced in line where prediction is invoked. I am assuming there's something wrong about the shape of the dataframe, 'obs_to_pred.' I checked the shape, which is (1046, 3).
What do you recommend so I can fix this and run the prediction?
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
import scipy.stats as stats
from sklearn import linear_model
# Import Titanic Data
train_loc = 'C:/Users/Young/Desktop/Kaggle/Titanic/train.csv'
test_loc = 'C:/Users/Young/Desktop/Kaggle/Titanic/test.csv'
train = pd.read_csv(train_loc)
test = pd.read_csv(test_loc)
# Predict Missing Age Values Based on Factors Pclass, SibSp, and Parch.
# In the function, combine train and test data.
def regressionPred (traindata,testdata):
allobs = pd.concat([traindata, testdata])
allobs = allobs[~allobs.Age.isnull()]
y = allobs.Age
y, X = dmatrices('y ~ Pclass + SibSp + Parch', data = allobs, return_type = 'dataframe')
mod = sm.OLS(y,X)
res = mod.fit()
predictors = ['Pclass', 'SibSp', 'Parch']
regr = linear_model.LinearRegression()
regr.fit(allobs.ix[:,predictors], y)
obs_to_pred = allobs[allobs.Age.isnull()].ix[:,predictors]
prediction = regr.predict( obs_to_pred ) # Error Produced in This Line ***
return res.summary(), prediction
In case you may want to look at the dataset, the link will take you there: https://www.kaggle.com/c/titanic/data
In the line
allobs = allobs[~allobs.Age.isnull()]
you define allobs as all the cases with no NaN in Age column.
Later, with:
obs_to_pred = allobs[allobs.Age.isnull()].ix[:,predictors]
you do not have any case to predict on as all allobs.Age.isnull() will be evaluated to False and you'll get an empty obs_to_pred. Thus your error:
array with 0 sample(s) (shape=(0, 3)) while a minimum of 1 is required.
Check the logic what you want with your predictions.