Difference between statsmodel OLS and scikit-learn linear regression - python

I tried to practice linear regression model with iris dataset.
from sklearn import datasets
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
# load iris data
train = sns.load_dataset('iris')
train
# one-hot-encoding
species_encoded = pd.get_dummies(train["species"], prefix = "speceis")
species_encoded
train = pd.concat([train, species_encoded], axis = 1)
train
# Split by feature and target
feature = ["sepal_length", "petal_length", "speceis_setosa", "speceis_versicolor", "speceis_virginica"]
target = ["petal_width"]
X_train = train[feature]
y_train = train[target]
case 1 : statsmodels
# model
X_train_constant = sm.add_constant(X_train)
model = sm.OLS(y_train, X_train_constant).fit()
print("const : {:.6f}".format(model.params[0]))
print(model.params[1:])
result :
const : 0.253251
sepal_length -0.001693
petal_length 0.231921
speceis_setosa -0.337843
speceis_versicolor 0.094816
speceis_virginica 0.496278
case 2 : scikit-learn
# model
model = LinearRegression()
model.fit(X_train, y_train)
print("const : {:.6f}".format(model.intercept_[0]))
print(pd.Series(model.coef_[0], model.feature_names_in_))
result :
const : 0.337668
sepal_length -0.001693
petal_length 0.231921
speceis_setosa -0.422260
speceis_versicolor 0.010399
speceis_virginica 0.411861
Why are the results of statsmodels and sklearn different?
Additionally, the results of the two models are the same except for all or part of the one-hot-encoded feature.

You included a full set of one-hot encoded dummies as regressors, which results in a linear combination that is equal to the constant, therefore you have perfect multicollinearity: your covariance matrix is singular and you can't take its inverse.
Under the hood both statsmodels and sklearn rely on Moore-Penrose pseudoinverse and can invert singular matrices just fine, the problem is that the coefficients obtained in the singular covariance matrix case don't mean anything in any physical sense. The implementations differ a bit between packages (sklearn relies on scipy.stats.lstsq, statsmodels has some custom procedure statsmodels.tools.pinv_extended, which is basically numpy.linalg.svd with minimal changes), so at the end of the day they both display «nonsense» (since no meaningful coefficients can be obtained), it's just a design choice of what kind of «nonsense» to display.
If you take the sum of coefficients of one-hot encoded dummies, you can see that for statsmodels it is equal to the constant, and for sklearn it is equal to 0, while the constant differs from statsmodels constant. The coefficients of variables that are not «responsible» for perfect multicollinearity are unaffected.

Related

Multivariate quantile regression with splines in python

I wanted to use multivariate quantile regression with spline to analyze the data. The data contains three independent variables and one dependent variable. I divided the data into training set and validation set, and fitted the model on the training set and the validation set to verify the model. I used quantreg()from statsmodels.formula.api and thebs() from the patsy to achieve this. But quickly an error occurred using predict().
1.I don't know if this is the right way to implement my idea.
2.How to use predict in the above situation.
import pandas as pd
import statsmodels.formula.api as smf
import patsy
from sklearn.model_selection import train_test_split
train_x, valid_x, train_y, valid_y = train_test_split(data.iloc[:,:3],
data.total, test_size=0.1, random_state = 1)
train=train_x.join(train_y)
vel = train['vel']
salmean = train['salmean']
em = train['em']
total = train['total']
model = smf.quantreg('total ~ bs(vel, df=3, degree=3) + bs(salmean, df=3,
degree=3) + bs(em, df=3, degree=3) ', train).fit(0.9)
y_pre =model.predict(valid_x)
The information of the error:
PatsyError: predict requires that you use a DataFrame when predicting from a model that was created using the formula api.
The original error message returned by patsy is:
Error evaluating factor: NotImplementedError: some data points fall outside the outermost knots, and I'm not sure how to handle them. (Patches accepted!)
total ~ bs(vel, df=3, degree=3) + bs(salmean, df=3, degree=3) + bs(em, df=3, degree=3)

how python calculates predictions with linear regression?

I'm having trouble getting the formula that python use for linear predictions. I did a linear regression using:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_tr_pre_close,Y_tr_pre_close)
then I made predictions using:
predictions=lm.predict(X_te_pre_close)
I had great results with this model but now the problem is that I can't figure out how the lm.predict() formula works, the model should be ordinary least squares as I read in the documentation
in this case, the predictions formula supposes to be x'b (vector of coefficients * vector of explanatory variables) but it doesn't fit my results.
LinearRegression doesn't store the intercept as one of the coefficients, but as intercept_.
So you can reproduce the predict function like that:
# using sklearn
pred_sklearn = lm.predict(X_te_pre_close)
# using coefficients directly:
pred_coef = X_te_pre_close # lm.coef_.T + lm.intercept_
assert all(pred_coef == pred_sklearn)

Logistic regression results of sklearn and statsmodels don't match

I tried to do logistic regression using both sklearn and statsmodels libraries. Their result is close, but not the same. For example, the (slope, intercept) pair obtained by sklearn is (-0.84371207, 1.43255005), while the pair obtained by statsmodels is (-0.8501, 1.4468). Why and how to make them same?
import pandas as pd
import statsmodels.api as sm
from sklearn import linear_model
# Part I: sklearn logistic
url = "https://github.com/pcsanwald/kaggle-titanic/raw/master/train.csv"
titanic_train = pd.read_csv(url)
train_X = pd.DataFrame([titanic_train["pclass"]]).T
train_Y = titanic_train["survived"]
model_1 = linear_model.LogisticRegression(solver = 'lbfgs')
model_1.fit(train_X, train_Y)
print(model_1.coef_) # print slopes
print(model_1.intercept_ ) # print intercept
# Part II: statsmodels logistic
train_X['intercept'] = 1
model_2=sm.Logit(train_Y,train_X, method='lbfgs')
result=model_2.fit()
print(result.summary2())
Sklearn uses L2 regularisation by default and statsmodels does not. Try specifying penalty= 'none' in the sklearn model parameters and rerun.
See the documentation for more information on logistic regression in sklearn:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.

Multiple Linear Regression. Coeffs don't match

So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html

Probability calibration with predicted probability & actual outcome in python

I am reading calibration methods for two days but did not actually make it that how it works. Two types of calibration are there;
Platt scaling - prediction space parted into bins & for each bin mean predicted value is plotted against true fraction of positive cases
Isotonic regression - Mathematically it tries to fit a weighted least-squares via Quadratic Programming, subject to next observation is always non-decreasing with respect to previous observation.
I have written a python module on calibration based on logistic regression (though I know LogisticRegression returns well calibrated predictions by default as it directly optimizes log-loss, I built it to check my understanding)
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from pandas import DataFrame
class logistic_Calibration:
def __init__(self, data, response):
self.data = data
self.response = response
def Calibration(self):
Xtrain, Xtest, ytrain, ytest = train_test_split(self.data, self.response, test_size=0.20, random_state=36)
logreg = linear_model.LogisticRegression()
logreg.fit(Xtrain, np.array(ytrain).flatten())
PredWO_calibration = logreg.predict_proba(Xtest)
lossWO_calibration = log_loss(ytest, PredWO_calibration)
clf_sigmoid = CalibratedClassifierCV(logreg, cv=5, method='sigmoid')
clf_sigmoid.fit(Xtrain, np.array(ytrain).flatten())
PredWITH_calibration = clf_sigmoid.predict_proba(Xtest)
lossWITH_calibration = log_loss(ytest, PredWITH_calibration)
Loss_difference_WO_minus_W = lossWO_calibration - lossWITH_calibration
return [lossWO_calibration, lossWITH_calibration, Loss_difference_WO_minus_W]
But still I am unclear on the following parts,
How isotonic regression maps the scores to probabilities?
Platt scaling does not work for real time data as for that we do not have any class assigned, that means Brier score can not be calculated. If after fitting a model, i have predicted probability & actual outcome on training data, how can i use calibration using only these two inputs to assign classes for test data? That is the most important part i would like to know.
Please guide.

Categories