how python calculates predictions with linear regression? - python

I'm having trouble getting the formula that python use for linear predictions. I did a linear regression using:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_tr_pre_close,Y_tr_pre_close)
then I made predictions using:
predictions=lm.predict(X_te_pre_close)
I had great results with this model but now the problem is that I can't figure out how the lm.predict() formula works, the model should be ordinary least squares as I read in the documentation
in this case, the predictions formula supposes to be x'b (vector of coefficients * vector of explanatory variables) but it doesn't fit my results.

LinearRegression doesn't store the intercept as one of the coefficients, but as intercept_.
So you can reproduce the predict function like that:
# using sklearn
pred_sklearn = lm.predict(X_te_pre_close)
# using coefficients directly:
pred_coef = X_te_pre_close # lm.coef_.T + lm.intercept_
assert all(pred_coef == pred_sklearn)

Related

SKLearn Linear Regression but setting certain coefficients before starting

I'd like to run a linear regression using SKLearn on a dataset with say 50 variables. However, I'd like to set the coefficients for say 2 of the variables before it starts training. Is that possible?
You are looking to provide an initial value or guess for the coefficients, and this is not possible for LinearRegression because it calls scipy.linalg.lstsq from scipy.
Not very sure what is the purpose for providing initial guess because for linear regression, you can fit the model, that is find the least square solution by using QR decomposition or SVD, there's no need to provide an initial guess or so.
If you want to try it for some purpose, I think you can try something like lsmr or curve_fit, but bear in mind, it's not really the commonly known linear regression from here:
from sklearn import datasets, linear_model
from scipy.optimize import curve_fit
from sklearn.preprocessing import StandardScaler
X, y = datasets.load_diabetes(return_X_y=True)
X = StandardScaler().fit_transform(X)
regr = linear_model.LinearRegression()
regr.fit(X,y)
regr.coef_
array([ -0.47623169, -11.40703082, 24.72625713, 15.42967916,
-37.68035801, 22.67648701, 4.80620008, 8.422084 ,
35.73471316, 3.21661161])
#lmsr
lsmr(X,y,x0 = np.repeat(2.0,X.shape[1]))
(array([ -0.4762317 , -11.40703083, 24.72625712, 15.42967915,
-37.68035803, 22.67648699, 4.8062001 , 8.42208398,
35.73471314, 3.21661159])
#non linear least square
def func(x,*params):
return x # params
coef_, cov_ = curve_fit(func,X,y,p0 = np.repeat(2,X.shape[1]))
coef_
array([ -0.47623371, -11.40702964, 24.72625986, 15.42967394,
-37.68022801, 22.67639202, 4.8061298 , 8.42205138,
35.73466837, 3.21661273])

Why can't I predict new data using SVM and KNN?

I'm new to machine learning and I just learned KNN and SVM with sklearn. How do I make a prediction for new data using SVM or KNN? I have tried both to make prediction. They make good prediction only when the data is already known. But when I try to predict new data, they give an incorrect prediction.
Here is my code:
import numpy as np
from sklearn import svm
x=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]], dtype=np.float64)
y=np.array([2,3,4,5,6,7,8,9,10,11,12], dtype=np.float64)
clf = svm.SVC(kernel='linear')
clf.fit(x, y)
print(clf.predict([[20]]))
print(clf.score(x, y))
0utput:
[12.]
1.0
This code will make a good prediction as long as the data to predict is within the range x_train. But when I try to predict for example 20, or anything above the range x_train, the output will always be 12 which is the last element of y. I don't know what I do wrong in the code.
The code is behaving as mathematically described by a support vector machine.
You must understand how your data are being interpreted by the algorithm. You have 11 data points, and you are giving each one a different class. The SVM ends up basically dividing the number line into 11 segments (for the 11 classes you defined):
data = [(x, clf.predict([[x]])[0]) for x in np.linspace(1, 20, 300)]
plt.scatter([p[0] for p in data], [p[1] for p in data])
plt.show()
The answer by AILearning tells you how to fit your given toy problem, but make sure you also understand why your code wasn't doing what you thought it was. For any finite set of examples there are infinitely many functions that fit the data. Your fundamental issue is you are confusing regression and classification. From the sounds of it, you want a simple regression model to extrapolate a fit function from the data points, but your code is for a classification model.
You have to use a regression model rather than a classification model. For svm based regression use svm.SVR()
import numpy as np
from sklearn import svm
x=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]], dtype=np.float64)
y=np.array([2,3,4,5,6,7,8,9,10,11,12], dtype=np.float64)
clf = svm.SVR(kernel='linear')
clf.fit(x, y)
print(clf.predict([[50]]))
print(clf.score(x, y))
output:
[50.12]
0.9996

How to compute AIC for linear regression model in Python?

I want to compute AIC for linear models to compare their complexity. I did it as follows:
regr = linear_model.LinearRegression()
regr.fit(X, y)
aic_intercept_slope = aic(y, regr.coef_[0] * X.as_matrix() + regr.intercept_, k=1)
def aic(y, y_pred, k):
resid = y - y_pred.ravel()
sse = sum(resid ** 2)
AIC = 2*k - 2*np.log(sse)
return AIC
But I receive a divide by zero encountered in log error.
sklearn's LinearRegression is good for prediction but pretty barebones as you've discovered. (It's often said that sklearn stays away from all things statistical inference.)
statsmodels.regression.linear_model.OLS has a property attribute AIC and a number of other pre-canned attributes.
However, note that you'll need to manually add a unit vector to your X matrix to include an intercept in your model.
from statsmodels.regression.linear_model import OLS
from statsmodels.tools import add_constant
regr = OLS(y, add_constant(X)).fit()
print(regr.aic)
Source is here if you are looking for an alternative way to write manually while still using sklearn.

Input format for logistic regression in scikit-learn as in R

When Using logistic regression in R, the data input for the 'glm' function (family = binomial) can be: (?family) in several formats, and specifically in the format of:
......
For the binomial and quasibinomial families the response can be
specified in one of three ways:
......
As a numerical vector with values between 0 and 1, interpreted as the
proportion of successful cases (with the total number of cases given
by the weights)....
I have aggregated data that represents proportion of success out of trials (number between 0 and 1) and their equivalent weights, I'm interested in applying logistic regression with it, which would be trivial to use in R.
Unfortunately i cant use R in this project, and would like to use scikit-learn to estimate the logistic regression coefficients . More precise, i'm looking to apply the sklearn.linear_model.LogisticRegression in a form of input that will allow me to insert the model proportions and wights, in a similar fashion as available in R.
example:
from sklearn import linear_model
import pandas as pd
df = pd.DataFrame([[1,1,1,0], [1,1,1,0],[1,1,1,1],[2,2,1,1] , [2,2,1,1],[2,2,1,0] , [3,3,1,0] ],columns=['a', 'b','Trials','Success'])
logistic = linear_model.LogisticRegression()
#this works
logistic.fit(X=df[['a','b','Trials']] , y=df.Success)
logistic.predict_proba(df[['a','b','Trials']])
prob_to_success = logistic.predict_proba(df[['a','b','Trials']])[:,1]
prob_to_success
Out[51]: array([ 0.45535843, 0.45535843, 0.45535843, 0.42212169, 0.42212169,
0.42212169, 0.38957565])
#How can i use the following Data?
df_agg = df.groupby(['a','b'] , as_index=False)['Trials','Success'].sum()
df_agg["Prop"] = df_agg.Success / (df_agg.Trials)
df_agg
#I want to use Prop & Trials as weights in df_agg
Thanks in advance!
Convert to log-odds form and use linear regression on the transformation. Sklearn doesn't seem to have a quasi-binomial conversion for logistic regression. As you said, trivial in R but sklearn seems to not have anything of the sort.
If you want to use weights, you can use them in the fit function of LogisticRegression:
fit(X, y, sample_weight=None)

Scikit Learn: Logistic Regression model coefficients: Clarification

I need to know how to return the logistic regression coefficients in such a manner that I can generate the predicted probabilities myself.
My code looks like this:
lr = LogisticRegression()
lr.fit(training_data, binary_labels)
# Generate probabities automatically
predicted_probs = lr.predict_proba(binary_labels)
I had assumed the lr.coeff_ values would follow typical logistic regression, so that I could return the predicted probabilities like this:
sigmoid( dot([val1, val2, offset], lr.coef_.T) )
But this is not the appropriate formulation. Does anyone have the proper format for generating predicted probabilities from Scikit Learn LogisticRegression?
Thanks!
take a look at the documentations (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), offset coefficient isn't stored by lr.coef_
coef_ array, shape = [n_classes-1, n_features] Coefficient of the
features in the decision function. coef_ is readonly property derived
from raw_coef_ that follows the internal memory layout of liblinear.
intercept_ array, shape = [n_classes-1] Intercept (a.k.a. bias) added
to the decision function. It is available only when parameter
intercept is set to True.
try:
sigmoid( dot([val1, val2], lr.coef_) + lr.intercept_ )
The easiest way is by calling coef_ attribute of LR classifier:
Definition of coef_ please check Scikit-Learn document:
See example:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(x_train,y_train)
weight = classifier.coef_

Categories