The scikit-learn package provides the functions Lasso() and LassoCV() but no option to fit a logistic function instead of a linear one...How to perform logistic lasso in python?
The Lasso optimizes a least-square problem with a L1 penalty.
By definition you can't optimize a logistic function with the Lasso.
If you want to optimize a logistic function with a L1 penalty, you can use the LogisticRegression estimator with the L1 penalty:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
log = LogisticRegression(penalty='l1', solver='liblinear')
log.fit(X, y)
Note that only the LIBLINEAR and SAGA (added in v0.19) solvers handle the L1 penalty.
You can use glment in Python. Glmnet uses warm starts and active-set convergence so it is extremely efficient. Those techniques make glment faster than other lasso implementations. You can download it from https://web.stanford.edu/~hastie/glmnet_python/
1 scikit-learn: sklearn.linear_model.LogisticRegression
sklearn.linear_model.LogisticRegression from scikit-learn is probably the best:
as #TomDLT said, Lasso is for the least squares (regression) case, not logistic (classification).
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(
penalty='l1',
solver='saga', # or 'liblinear'
C=regularization_strength)
model.fit(x, y)
2 python-glmnet: glmnet.LogitNet
You can also use Civis Analytics' python-glmnet library. This implements the scikit-learn BaseEstimator API:
# source: https://github.com/civisanalytics/python-glmnet#regularized-logistic-regression
from glmnet import LogitNet
m = LogitNet(
alpha=1, # 0 <= alpha <= 1, 0 for ridge, 1 for lasso
)
m = m.fit(x, y)
I'm not sure how to adjust the penalty with LogitNet, but I'll let you figure that out.
3 other
PyMC
you can also take a fully bayesian approach. rather than use L1-penalized optimization to find a point estimate for your coefficients, you can approximate the distribution of your coefficients given your data. this gives you the same answer as L1-penalized maximum likelihood estimation if you use a Laplace prior for your coefficients. the Laplace prior induces sparsity.
the PyMC folks have a tutorial here on setting something like that up. good luck.
Related
I'm having trouble getting the formula that python use for linear predictions. I did a linear regression using:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_tr_pre_close,Y_tr_pre_close)
then I made predictions using:
predictions=lm.predict(X_te_pre_close)
I had great results with this model but now the problem is that I can't figure out how the lm.predict() formula works, the model should be ordinary least squares as I read in the documentation
in this case, the predictions formula supposes to be x'b (vector of coefficients * vector of explanatory variables) but it doesn't fit my results.
LinearRegression doesn't store the intercept as one of the coefficients, but as intercept_.
So you can reproduce the predict function like that:
# using sklearn
pred_sklearn = lm.predict(X_te_pre_close)
# using coefficients directly:
pred_coef = X_te_pre_close # lm.coef_.T + lm.intercept_
assert all(pred_coef == pred_sklearn)
I'd like to run a linear regression using SKLearn on a dataset with say 50 variables. However, I'd like to set the coefficients for say 2 of the variables before it starts training. Is that possible?
You are looking to provide an initial value or guess for the coefficients, and this is not possible for LinearRegression because it calls scipy.linalg.lstsq from scipy.
Not very sure what is the purpose for providing initial guess because for linear regression, you can fit the model, that is find the least square solution by using QR decomposition or SVD, there's no need to provide an initial guess or so.
If you want to try it for some purpose, I think you can try something like lsmr or curve_fit, but bear in mind, it's not really the commonly known linear regression from here:
from sklearn import datasets, linear_model
from scipy.optimize import curve_fit
from sklearn.preprocessing import StandardScaler
X, y = datasets.load_diabetes(return_X_y=True)
X = StandardScaler().fit_transform(X)
regr = linear_model.LinearRegression()
regr.fit(X,y)
regr.coef_
array([ -0.47623169, -11.40703082, 24.72625713, 15.42967916,
-37.68035801, 22.67648701, 4.80620008, 8.422084 ,
35.73471316, 3.21661161])
#lmsr
lsmr(X,y,x0 = np.repeat(2.0,X.shape[1]))
(array([ -0.4762317 , -11.40703083, 24.72625712, 15.42967915,
-37.68035803, 22.67648699, 4.8062001 , 8.42208398,
35.73471314, 3.21661159])
#non linear least square
def func(x,*params):
return x # params
coef_, cov_ = curve_fit(func,X,y,p0 = np.repeat(2,X.shape[1]))
coef_
array([ -0.47623371, -11.40702964, 24.72625986, 15.42967394,
-37.68022801, 22.67639202, 4.8061298 , 8.42205138,
35.73466837, 3.21661273])
I am trying to fit a multivariable linear regression on a dataset to find out how well the model explains the data. My predictors have 120 dimensions and I have 177 samples:
X.shape=(177,120), y.shape=(177,)
Using statsmodels, I get a very good R-squared of 0.76 with a Prob(F-statistic) of 0.06 which trends towards significance and indicates a good model for the data.
When I use scikit-learn's linear regression and try to compute 5-fold cross validation r2 score, I get an average r2 score of -5.06 which shows very poor generalization performance.
The two models should be exactly the same as their train r2 score is. So why the performance evaluations from these libraries are too different? Which one should I use? Greatly appreciate your comments on this.
Here is my code for your reference:
# using statsmodel:
import statsmodels.api as sm
X = sm.add_constant(X)
est = sm.OLS(y, X)
est2 = est.fit()
print(est2.summary())
# using scikitlearn:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
print 'train r2 score:',lin_reg.score(X, y)
cv_results = cross_val_score(lin_reg, X, y, cv = 5, scoring = 'r2')
msg = "%s: %f (%f)" % ('r2 score', cv_results.mean(),cv_results.std())
print(msg)
The difference in rsquared because of the difference between training sample and left out cross-validation sample.
You are most likely strongly overfitting with 121 regressors including constant and only 177 observations without regularization or variable selection.
Statsmodels only reports rsquared, R2, for the training sample, there is no cross-validation. Scikit-learn needs to reduce the training sample size for cross-validation which makes overfitting even worse.
A low cross-validation score as reported by scikit-learn, then means that the overfitted estimates do not generalize to the left out data, and is matching idiosyncratic features of the training sample.
I am trying to apply Linear Regression method for a dataset of 9 sample with around 50 features using python. I have tried different methodology for Linear Regression i.e Closed form OLS(Ordinary Least Squares), LR(Linear Regression), HR(Huber Regression), NNLS( Non negative least squares) and each of them gives different weights.
But I can get the intuition why HR and NNLS has different solution, but LR and Closed form OLS have the same objective function of minimizing the sum of the squares of the differences between observed value in the given sample and those predicted by a linear function of a set of features. Since the training set is singular, i had to use pseudoinverse to perform Closed form OLS.
w = np.dot(train_features.T, train_features)
w1 = np.dot(np.linalg.pinv(w), np.dot(train_features.T,train_target))
For LR i have used scikit-learn Linear Regression uses lapack library from www.netlib.org to solve the least-squares problem
linear_model.LinearRegression()
System of linear equations or a system of polynomial equations is referred as underdetermined if no of equations available are less than unknown parameters. Each unknown parameter can be counted as an available degree of freedom. Each equation presented can be applied as a constraint that restricts one degree of freedom. As a result an underdetermined system can have infinitely many solutions or no solution at all. Since in our case study, system is underdetermined and also is singular, there exists many solutions.
Now both pseudoinverse and Lapack library tries to finds minimum norm solution of an underdetermined system when no of sample is less than no of features. Then why the closed form and LR gives completely different solution of the same system of linear equations. Am i missing something here which can explain the behaviors of both ways. Like if the peudoinverse is computed in different ways like SVD, QR/LQ factorization, can they produce different solution for same set of equations?
Check out the docs of sklearn's LinearRegression again.
By default (like you call it), it also fits an intercept term!
Demo:
import numpy as np
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
X, y = load_boston(return_X_y=True)
""" OLS custom """
w = np.dot(np.linalg.pinv(X), y)
print('custom')
print(w)
""" sklearn's LinearRegression (default) """
clf = LinearRegression()
print('sklearn default')
print(clf.fit(X, y).coef_)
""" sklearn's LinearRegression (no intercept-fitting) """
print('sklearn fit_intercept=False')
clf = LinearRegression(fit_intercept=False)
print(clf.fit(X, y).coef_)
Output:
custom
[ -9.16297843e-02 4.86751203e-02 -3.77930006e-03 2.85636751e+00
-2.88077933e+00 5.92521432e+00 -7.22447929e-03 -9.67995240e-01
1.70443393e-01 -9.38925373e-03 -3.92425680e-01 1.49832102e-02
-4.16972624e-01]
sklearn default
[ -1.07170557e-01 4.63952195e-02 2.08602395e-02 2.68856140e+00
-1.77957587e+01 3.80475246e+00 7.51061703e-04 -1.47575880e+00
3.05655038e-01 -1.23293463e-02 -9.53463555e-01 9.39251272e-03
-5.25466633e-01]
sklearn fit_intercept=False
[ -9.16297843e-02 4.86751203e-02 -3.77930006e-03 2.85636751e+00
-2.88077933e+00 5.92521432e+00 -7.22447929e-03 -9.67995240e-01
1.70443393e-01 -9.38925373e-03 -3.92425680e-01 1.49832102e-02
-4.16972624e-01]
I have a dataframe X which is comprised of 60 features and ~ 450k outcomes. My response variable y is categorical (survival, no survival).
I would like to use RFECV to reduce the number of significant features for my estimator (right now, logistic regression) on Xtrain, which I would like to score of accuracy under an ROC Curve. "Features Selected" is a list of all features.
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
import sklearn.linear_model as lm
# Create train and test datasets to evaluate each model
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,train_size = 0.70)
# Use RFECV to reduce features
# Create a logistic regression estimator
logreg = lm.LogisticRegression()
# Use RFECV to pick best features, using Stratified Kfold
rfecv = RFECV(estimator=logreg, cv=StratifiedKFold(ytrain, 10), scoring='roc_auc')
# Fit the features to the response variable
X_new = rfecv.fit_transform(Xtrain[features_selected], ytrain)
I have a few questions:
a) X_new returns different features when run on separate occasions (one time it returned 5 features, another run it returned 9. One is not a subset of the other). Why would this be?
b) Does this imply an unstable solution? While using the same seed for StratifiedKFold should solve this problem, does this mean I need to reconsider the approach in totality?
c) IN general, how do I approach tuning? e.g., features are selected BEFORE tuning in my current implementation. Would tuning affect the significance of certain features? Or should I tune simultaneously?
In k-fold cross-validation, the original sample is randomly partitioned into k equal size sub-samples. Therefore, it's not surprising to get different results every time you execute the algorithm. Source
There is an approach, so-called Pearson's correlation coefficient. By using this method, you can calculate the a correlation coefficient between each two features, and aim for removing features with a high correlation. This method could be considered as a stable solution to such a problem. Source