Linear Regression vs Closed form Ordinary least squares in Python - python

I am trying to apply Linear Regression method for a dataset of 9 sample with around 50 features using python. I have tried different methodology for Linear Regression i.e Closed form OLS(Ordinary Least Squares), LR(Linear Regression), HR(Huber Regression), NNLS( Non negative least squares) and each of them gives different weights.
But I can get the intuition why HR and NNLS has different solution, but LR and Closed form OLS have the same objective function of minimizing the sum of the squares of the differences between observed value in the given sample and those predicted by a linear function of a set of features. Since the training set is singular, i had to use pseudoinverse to perform Closed form OLS.
w = np.dot(train_features.T, train_features)
w1 = np.dot(np.linalg.pinv(w), np.dot(train_features.T,train_target))
For LR i have used scikit-learn Linear Regression uses lapack library from www.netlib.org to solve the least-squares problem
linear_model.LinearRegression()
System of linear equations or a system of polynomial equations is referred as underdetermined if no of equations available are less than unknown parameters. Each unknown parameter can be counted as an available degree of freedom. Each equation presented can be applied as a constraint that restricts one degree of freedom. As a result an underdetermined system can have infinitely many solutions or no solution at all. Since in our case study, system is underdetermined and also is singular, there exists many solutions.
Now both pseudoinverse and Lapack library tries to finds minimum norm solution of an underdetermined system when no of sample is less than no of features. Then why the closed form and LR gives completely different solution of the same system of linear equations. Am i missing something here which can explain the behaviors of both ways. Like if the peudoinverse is computed in different ways like SVD, QR/LQ factorization, can they produce different solution for same set of equations?

Check out the docs of sklearn's LinearRegression again.
By default (like you call it), it also fits an intercept term!
Demo:
import numpy as np
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
X, y = load_boston(return_X_y=True)
""" OLS custom """
w = np.dot(np.linalg.pinv(X), y)
print('custom')
print(w)
""" sklearn's LinearRegression (default) """
clf = LinearRegression()
print('sklearn default')
print(clf.fit(X, y).coef_)
""" sklearn's LinearRegression (no intercept-fitting) """
print('sklearn fit_intercept=False')
clf = LinearRegression(fit_intercept=False)
print(clf.fit(X, y).coef_)
Output:
custom
[ -9.16297843e-02 4.86751203e-02 -3.77930006e-03 2.85636751e+00
-2.88077933e+00 5.92521432e+00 -7.22447929e-03 -9.67995240e-01
1.70443393e-01 -9.38925373e-03 -3.92425680e-01 1.49832102e-02
-4.16972624e-01]
sklearn default
[ -1.07170557e-01 4.63952195e-02 2.08602395e-02 2.68856140e+00
-1.77957587e+01 3.80475246e+00 7.51061703e-04 -1.47575880e+00
3.05655038e-01 -1.23293463e-02 -9.53463555e-01 9.39251272e-03
-5.25466633e-01]
sklearn fit_intercept=False
[ -9.16297843e-02 4.86751203e-02 -3.77930006e-03 2.85636751e+00
-2.88077933e+00 5.92521432e+00 -7.22447929e-03 -9.67995240e-01
1.70443393e-01 -9.38925373e-03 -3.92425680e-01 1.49832102e-02
-4.16972624e-01]

Related

how python calculates predictions with linear regression?

I'm having trouble getting the formula that python use for linear predictions. I did a linear regression using:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_tr_pre_close,Y_tr_pre_close)
then I made predictions using:
predictions=lm.predict(X_te_pre_close)
I had great results with this model but now the problem is that I can't figure out how the lm.predict() formula works, the model should be ordinary least squares as I read in the documentation
in this case, the predictions formula supposes to be x'b (vector of coefficients * vector of explanatory variables) but it doesn't fit my results.
LinearRegression doesn't store the intercept as one of the coefficients, but as intercept_.
So you can reproduce the predict function like that:
# using sklearn
pred_sklearn = lm.predict(X_te_pre_close)
# using coefficients directly:
pred_coef = X_te_pre_close # lm.coef_.T + lm.intercept_
assert all(pred_coef == pred_sklearn)

SKLearn Linear Regression but setting certain coefficients before starting

I'd like to run a linear regression using SKLearn on a dataset with say 50 variables. However, I'd like to set the coefficients for say 2 of the variables before it starts training. Is that possible?
You are looking to provide an initial value or guess for the coefficients, and this is not possible for LinearRegression because it calls scipy.linalg.lstsq from scipy.
Not very sure what is the purpose for providing initial guess because for linear regression, you can fit the model, that is find the least square solution by using QR decomposition or SVD, there's no need to provide an initial guess or so.
If you want to try it for some purpose, I think you can try something like lsmr or curve_fit, but bear in mind, it's not really the commonly known linear regression from here:
from sklearn import datasets, linear_model
from scipy.optimize import curve_fit
from sklearn.preprocessing import StandardScaler
X, y = datasets.load_diabetes(return_X_y=True)
X = StandardScaler().fit_transform(X)
regr = linear_model.LinearRegression()
regr.fit(X,y)
regr.coef_
array([ -0.47623169, -11.40703082, 24.72625713, 15.42967916,
-37.68035801, 22.67648701, 4.80620008, 8.422084 ,
35.73471316, 3.21661161])
#lmsr
lsmr(X,y,x0 = np.repeat(2.0,X.shape[1]))
(array([ -0.4762317 , -11.40703083, 24.72625712, 15.42967915,
-37.68035803, 22.67648699, 4.8062001 , 8.42208398,
35.73471314, 3.21661159])
#non linear least square
def func(x,*params):
return x # params
coef_, cov_ = curve_fit(func,X,y,p0 = np.repeat(2,X.shape[1]))
coef_
array([ -0.47623371, -11.40702964, 24.72625986, 15.42967394,
-37.68022801, 22.67639202, 4.8061298 , 8.42205138,
35.73466837, 3.21661273])

Sum of fitted models with sklearn

I am trying to do something that involves taking the sum of two fitted models such that the output is another LinearRegression type object. I have fitted the two models using the standard LinearRegression method from sklearn.
from sklearn.linear_model import LinearRegression
reg_1 = LinearRegression().fit(X1, y)
reg_2 = LinearRegression().fit(X2, y)
and I want to be able to produce something like
reg = reg_1 + reg_2
such that I can still do standard operations such as
reg.predict(X3)
Is there an easy way to do this, clearly I can obtain the coefficients of both reg_1 and reg_2 so if I can define reg ussing those, it would work but I couldn't see a way to do this.
Since your reason for doing this is that "they are just different datasets with the same features" I would recommend simply appending the datasets and creating one model on all data.
But if this isn't possible for some reason you could do this by manually setting the coef_ and intercept_ attributes of a third linear model as the averages of the first two, such as:
reg = LinearRegression()
reg.coef_ = np.array([np.mean(t) for t in zip(reg_1.coef_, reg_2.coef_)])
reg.intercept_ = np.mean([reg_1.intercept_, reg_2.intercept_])
Then you can just use the reg.predict(X3) method as usual to make predictions from the combined averages of the 2 linear models' terms.
There are dangers in this approach though, if for example one of the datasets used to fit the original models is much larger than the other one, then the smaller dataset's intercept and coefficient terms would be over-weighted in the combined model, and you would probably want to do some weighting when averaging the intercept and coefficient terms.

How to perform logistic lasso in python?

The scikit-learn package provides the functions Lasso() and LassoCV() but no option to fit a logistic function instead of a linear one...How to perform logistic lasso in python?
The Lasso optimizes a least-square problem with a L1 penalty.
By definition you can't optimize a logistic function with the Lasso.
If you want to optimize a logistic function with a L1 penalty, you can use the LogisticRegression estimator with the L1 penalty:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
log = LogisticRegression(penalty='l1', solver='liblinear')
log.fit(X, y)
Note that only the LIBLINEAR and SAGA (added in v0.19) solvers handle the L1 penalty.
You can use glment in Python. Glmnet uses warm starts and active-set convergence so it is extremely efficient. Those techniques make glment faster than other lasso implementations. You can download it from https://web.stanford.edu/~hastie/glmnet_python/
1 scikit-learn: sklearn.linear_model.LogisticRegression
sklearn.linear_model.LogisticRegression from scikit-learn is probably the best:
as #TomDLT said, Lasso is for the least squares (regression) case, not logistic (classification).
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(
penalty='l1',
solver='saga', # or 'liblinear'
C=regularization_strength)
model.fit(x, y)
2 python-glmnet: glmnet.LogitNet
You can also use Civis Analytics' python-glmnet library. This implements the scikit-learn BaseEstimator API:
# source: https://github.com/civisanalytics/python-glmnet#regularized-logistic-regression
from glmnet import LogitNet
m = LogitNet(
alpha=1, # 0 <= alpha <= 1, 0 for ridge, 1 for lasso
)
m = m.fit(x, y)
I'm not sure how to adjust the penalty with LogitNet, but I'll let you figure that out.
3 other
PyMC
you can also take a fully bayesian approach. rather than use L1-penalized optimization to find a point estimate for your coefficients, you can approximate the distribution of your coefficients given your data. this gives you the same answer as L1-penalized maximum likelihood estimation if you use a Laplace prior for your coefficients. the Laplace prior induces sparsity.
the PyMC folks have a tutorial here on setting something like that up. good luck.

How to find the importance of the features for a logistic regression model?

I have a binary prediction model trained by logistic regression algorithm. I want know which features(predictors) are more important for the decision of positive or negative class. I know there is coef_ parameter comes from the scikit-learn package, but I don't know whether it is enough to for the importance. Another thing is how I can evaluate the coef_ values in terms of the importance for negative and positive classes. I also read about standardized regression coefficients and I don't know what it is.
Lets say there are features like size of tumor, weight of tumor, and etc to make a decision for a test case like malignant or not malignant. I want to know which of the features are more important for malignant and not malignant prediction. Does it make sort of sense?
One of the simplest options to get a feeling for the "influence" of a given parameter in a linear classification model (logistic being one of those), is to consider the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.
Consider this example:
import numpy as np
from sklearn.linear_model import LogisticRegression
x1 = np.random.randn(100)
x2 = 4*np.random.randn(100)
x3 = 0.5*np.random.randn(100)
y = (3 + x1 + x2 + x3 + 0.2*np.random.randn()) > 0
X = np.column_stack([x1, x2, x3])
m = LogisticRegression()
m.fit(X, y)
# The estimated coefficients will all be around 1:
print(m.coef_)
# Those values, however, will show that the second parameter
# is more influential
print(np.std(X, 0)*m.coef_)
An alternative way to get a similar result is to examine the coefficients of the model fit on standardized parameters:
m.fit(X / np.std(X, 0), y)
print(m.coef_)
Note that this is the most basic approach and a number of other techniques for finding feature importance or parameter influence exist (using p-values, bootstrap scores, various "discriminative indices", etc).
I am pretty sure you would get more interesting answers at https://stats.stackexchange.com/.

Categories