I'd like to run a linear regression using SKLearn on a dataset with say 50 variables. However, I'd like to set the coefficients for say 2 of the variables before it starts training. Is that possible?
You are looking to provide an initial value or guess for the coefficients, and this is not possible for LinearRegression because it calls scipy.linalg.lstsq from scipy.
Not very sure what is the purpose for providing initial guess because for linear regression, you can fit the model, that is find the least square solution by using QR decomposition or SVD, there's no need to provide an initial guess or so.
If you want to try it for some purpose, I think you can try something like lsmr or curve_fit, but bear in mind, it's not really the commonly known linear regression from here:
from sklearn import datasets, linear_model
from scipy.optimize import curve_fit
from sklearn.preprocessing import StandardScaler
X, y = datasets.load_diabetes(return_X_y=True)
X = StandardScaler().fit_transform(X)
regr = linear_model.LinearRegression()
regr.fit(X,y)
regr.coef_
array([ -0.47623169, -11.40703082, 24.72625713, 15.42967916,
-37.68035801, 22.67648701, 4.80620008, 8.422084 ,
35.73471316, 3.21661161])
#lmsr
lsmr(X,y,x0 = np.repeat(2.0,X.shape[1]))
(array([ -0.4762317 , -11.40703083, 24.72625712, 15.42967915,
-37.68035803, 22.67648699, 4.8062001 , 8.42208398,
35.73471314, 3.21661159])
#non linear least square
def func(x,*params):
return x # params
coef_, cov_ = curve_fit(func,X,y,p0 = np.repeat(2,X.shape[1]))
coef_
array([ -0.47623371, -11.40702964, 24.72625986, 15.42967394,
-37.68022801, 22.67639202, 4.8061298 , 8.42205138,
35.73466837, 3.21661273])
Related
I'm having trouble getting the formula that python use for linear predictions. I did a linear regression using:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_tr_pre_close,Y_tr_pre_close)
then I made predictions using:
predictions=lm.predict(X_te_pre_close)
I had great results with this model but now the problem is that I can't figure out how the lm.predict() formula works, the model should be ordinary least squares as I read in the documentation
in this case, the predictions formula supposes to be x'b (vector of coefficients * vector of explanatory variables) but it doesn't fit my results.
LinearRegression doesn't store the intercept as one of the coefficients, but as intercept_.
So you can reproduce the predict function like that:
# using sklearn
pred_sklearn = lm.predict(X_te_pre_close)
# using coefficients directly:
pred_coef = X_te_pre_close # lm.coef_.T + lm.intercept_
assert all(pred_coef == pred_sklearn)
What exactly is calculated when we pass something with no predict method to cross_val_score, like here
from sklearn.model_selection import cross_val_score
from sklearn.decomposition import PCA
# X is some data, say two dimensional numpy array of reals
cross_val_score(PCA(n_components=10), X)
That is, using cross_val_score without y, and without predict.
I asked it previously here, but there was no reply.
Thanks!
In this case, PCA has a score method (see here): it return "the average log-likelihood of all samples". So your cross_val_score returns this score calculated on each cross validation step.
I'm new to machine learning and I just learned KNN and SVM with sklearn. How do I make a prediction for new data using SVM or KNN? I have tried both to make prediction. They make good prediction only when the data is already known. But when I try to predict new data, they give an incorrect prediction.
Here is my code:
import numpy as np
from sklearn import svm
x=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]], dtype=np.float64)
y=np.array([2,3,4,5,6,7,8,9,10,11,12], dtype=np.float64)
clf = svm.SVC(kernel='linear')
clf.fit(x, y)
print(clf.predict([[20]]))
print(clf.score(x, y))
0utput:
[12.]
1.0
This code will make a good prediction as long as the data to predict is within the range x_train. But when I try to predict for example 20, or anything above the range x_train, the output will always be 12 which is the last element of y. I don't know what I do wrong in the code.
The code is behaving as mathematically described by a support vector machine.
You must understand how your data are being interpreted by the algorithm. You have 11 data points, and you are giving each one a different class. The SVM ends up basically dividing the number line into 11 segments (for the 11 classes you defined):
data = [(x, clf.predict([[x]])[0]) for x in np.linspace(1, 20, 300)]
plt.scatter([p[0] for p in data], [p[1] for p in data])
plt.show()
The answer by AILearning tells you how to fit your given toy problem, but make sure you also understand why your code wasn't doing what you thought it was. For any finite set of examples there are infinitely many functions that fit the data. Your fundamental issue is you are confusing regression and classification. From the sounds of it, you want a simple regression model to extrapolate a fit function from the data points, but your code is for a classification model.
You have to use a regression model rather than a classification model. For svm based regression use svm.SVR()
import numpy as np
from sklearn import svm
x=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]], dtype=np.float64)
y=np.array([2,3,4,5,6,7,8,9,10,11,12], dtype=np.float64)
clf = svm.SVR(kernel='linear')
clf.fit(x, y)
print(clf.predict([[50]]))
print(clf.score(x, y))
output:
[50.12]
0.9996
I am trying to apply Linear Regression method for a dataset of 9 sample with around 50 features using python. I have tried different methodology for Linear Regression i.e Closed form OLS(Ordinary Least Squares), LR(Linear Regression), HR(Huber Regression), NNLS( Non negative least squares) and each of them gives different weights.
But I can get the intuition why HR and NNLS has different solution, but LR and Closed form OLS have the same objective function of minimizing the sum of the squares of the differences between observed value in the given sample and those predicted by a linear function of a set of features. Since the training set is singular, i had to use pseudoinverse to perform Closed form OLS.
w = np.dot(train_features.T, train_features)
w1 = np.dot(np.linalg.pinv(w), np.dot(train_features.T,train_target))
For LR i have used scikit-learn Linear Regression uses lapack library from www.netlib.org to solve the least-squares problem
linear_model.LinearRegression()
System of linear equations or a system of polynomial equations is referred as underdetermined if no of equations available are less than unknown parameters. Each unknown parameter can be counted as an available degree of freedom. Each equation presented can be applied as a constraint that restricts one degree of freedom. As a result an underdetermined system can have infinitely many solutions or no solution at all. Since in our case study, system is underdetermined and also is singular, there exists many solutions.
Now both pseudoinverse and Lapack library tries to finds minimum norm solution of an underdetermined system when no of sample is less than no of features. Then why the closed form and LR gives completely different solution of the same system of linear equations. Am i missing something here which can explain the behaviors of both ways. Like if the peudoinverse is computed in different ways like SVD, QR/LQ factorization, can they produce different solution for same set of equations?
Check out the docs of sklearn's LinearRegression again.
By default (like you call it), it also fits an intercept term!
Demo:
import numpy as np
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
X, y = load_boston(return_X_y=True)
""" OLS custom """
w = np.dot(np.linalg.pinv(X), y)
print('custom')
print(w)
""" sklearn's LinearRegression (default) """
clf = LinearRegression()
print('sklearn default')
print(clf.fit(X, y).coef_)
""" sklearn's LinearRegression (no intercept-fitting) """
print('sklearn fit_intercept=False')
clf = LinearRegression(fit_intercept=False)
print(clf.fit(X, y).coef_)
Output:
custom
[ -9.16297843e-02 4.86751203e-02 -3.77930006e-03 2.85636751e+00
-2.88077933e+00 5.92521432e+00 -7.22447929e-03 -9.67995240e-01
1.70443393e-01 -9.38925373e-03 -3.92425680e-01 1.49832102e-02
-4.16972624e-01]
sklearn default
[ -1.07170557e-01 4.63952195e-02 2.08602395e-02 2.68856140e+00
-1.77957587e+01 3.80475246e+00 7.51061703e-04 -1.47575880e+00
3.05655038e-01 -1.23293463e-02 -9.53463555e-01 9.39251272e-03
-5.25466633e-01]
sklearn fit_intercept=False
[ -9.16297843e-02 4.86751203e-02 -3.77930006e-03 2.85636751e+00
-2.88077933e+00 5.92521432e+00 -7.22447929e-03 -9.67995240e-01
1.70443393e-01 -9.38925373e-03 -3.92425680e-01 1.49832102e-02
-4.16972624e-01]
The scikit-learn package provides the functions Lasso() and LassoCV() but no option to fit a logistic function instead of a linear one...How to perform logistic lasso in python?
The Lasso optimizes a least-square problem with a L1 penalty.
By definition you can't optimize a logistic function with the Lasso.
If you want to optimize a logistic function with a L1 penalty, you can use the LogisticRegression estimator with the L1 penalty:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
log = LogisticRegression(penalty='l1', solver='liblinear')
log.fit(X, y)
Note that only the LIBLINEAR and SAGA (added in v0.19) solvers handle the L1 penalty.
You can use glment in Python. Glmnet uses warm starts and active-set convergence so it is extremely efficient. Those techniques make glment faster than other lasso implementations. You can download it from https://web.stanford.edu/~hastie/glmnet_python/
1 scikit-learn: sklearn.linear_model.LogisticRegression
sklearn.linear_model.LogisticRegression from scikit-learn is probably the best:
as #TomDLT said, Lasso is for the least squares (regression) case, not logistic (classification).
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(
penalty='l1',
solver='saga', # or 'liblinear'
C=regularization_strength)
model.fit(x, y)
2 python-glmnet: glmnet.LogitNet
You can also use Civis Analytics' python-glmnet library. This implements the scikit-learn BaseEstimator API:
# source: https://github.com/civisanalytics/python-glmnet#regularized-logistic-regression
from glmnet import LogitNet
m = LogitNet(
alpha=1, # 0 <= alpha <= 1, 0 for ridge, 1 for lasso
)
m = m.fit(x, y)
I'm not sure how to adjust the penalty with LogitNet, but I'll let you figure that out.
3 other
PyMC
you can also take a fully bayesian approach. rather than use L1-penalized optimization to find a point estimate for your coefficients, you can approximate the distribution of your coefficients given your data. this gives you the same answer as L1-penalized maximum likelihood estimation if you use a Laplace prior for your coefficients. the Laplace prior induces sparsity.
the PyMC folks have a tutorial here on setting something like that up. good luck.