Logistic regression results of sklearn and statsmodels don't match - python

I tried to do logistic regression using both sklearn and statsmodels libraries. Their result is close, but not the same. For example, the (slope, intercept) pair obtained by sklearn is (-0.84371207, 1.43255005), while the pair obtained by statsmodels is (-0.8501, 1.4468). Why and how to make them same?
import pandas as pd
import statsmodels.api as sm
from sklearn import linear_model
# Part I: sklearn logistic
url = "https://github.com/pcsanwald/kaggle-titanic/raw/master/train.csv"
titanic_train = pd.read_csv(url)
train_X = pd.DataFrame([titanic_train["pclass"]]).T
train_Y = titanic_train["survived"]
model_1 = linear_model.LogisticRegression(solver = 'lbfgs')
model_1.fit(train_X, train_Y)
print(model_1.coef_) # print slopes
print(model_1.intercept_ ) # print intercept
# Part II: statsmodels logistic
train_X['intercept'] = 1
model_2=sm.Logit(train_Y,train_X, method='lbfgs')
result=model_2.fit()
print(result.summary2())

Sklearn uses L2 regularisation by default and statsmodels does not. Try specifying penalty= 'none' in the sklearn model parameters and rerun.
See the documentation for more information on logistic regression in sklearn:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.

Related

Difference between statsmodel OLS and scikit-learn linear regression

I tried to practice linear regression model with iris dataset.
from sklearn import datasets
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
# load iris data
train = sns.load_dataset('iris')
train
# one-hot-encoding
species_encoded = pd.get_dummies(train["species"], prefix = "speceis")
species_encoded
train = pd.concat([train, species_encoded], axis = 1)
train
# Split by feature and target
feature = ["sepal_length", "petal_length", "speceis_setosa", "speceis_versicolor", "speceis_virginica"]
target = ["petal_width"]
X_train = train[feature]
y_train = train[target]
case 1 : statsmodels
# model
X_train_constant = sm.add_constant(X_train)
model = sm.OLS(y_train, X_train_constant).fit()
print("const : {:.6f}".format(model.params[0]))
print(model.params[1:])
result :
const : 0.253251
sepal_length -0.001693
petal_length 0.231921
speceis_setosa -0.337843
speceis_versicolor 0.094816
speceis_virginica 0.496278
case 2 : scikit-learn
# model
model = LinearRegression()
model.fit(X_train, y_train)
print("const : {:.6f}".format(model.intercept_[0]))
print(pd.Series(model.coef_[0], model.feature_names_in_))
result :
const : 0.337668
sepal_length -0.001693
petal_length 0.231921
speceis_setosa -0.422260
speceis_versicolor 0.010399
speceis_virginica 0.411861
Why are the results of statsmodels and sklearn different?
Additionally, the results of the two models are the same except for all or part of the one-hot-encoded feature.
You included a full set of one-hot encoded dummies as regressors, which results in a linear combination that is equal to the constant, therefore you have perfect multicollinearity: your covariance matrix is singular and you can't take its inverse.
Under the hood both statsmodels and sklearn rely on Moore-Penrose pseudoinverse and can invert singular matrices just fine, the problem is that the coefficients obtained in the singular covariance matrix case don't mean anything in any physical sense. The implementations differ a bit between packages (sklearn relies on scipy.stats.lstsq, statsmodels has some custom procedure statsmodels.tools.pinv_extended, which is basically numpy.linalg.svd with minimal changes), so at the end of the day they both display «nonsense» (since no meaningful coefficients can be obtained), it's just a design choice of what kind of «nonsense» to display.
If you take the sum of coefficients of one-hot encoded dummies, you can see that for statsmodels it is equal to the constant, and for sklearn it is equal to 0, while the constant differs from statsmodels constant. The coefficients of variables that are not «responsible» for perfect multicollinearity are unaffected.

Multiple Linear Regression. Coeffs don't match

So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html

Multivariate Multiple Regression using Python libraries

Is there any library to perform a Multivariate Multiple Regression (a Multiple Regression with multiple dependent variables) in Python?
Greetings and thanks in advance
You can try the modules in sklearn, the response variable can be 2 or more dimensional, and i think it works for OLS (linear regression), lasso, ridge.. The models in statsmodels can only do 1 response (just checked).
Example dataset:
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(data= iris['data'],
columns= iris['feature_names'] )
df.shape
(150, 4)
Now we do the fit:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit(df[['sepal length (cm)']],df[['petal length (cm)','petal width (cm)']])
clf.coef_
array([[1.85843298],
[0.75291757]])
You can see the coefficients are the same as when you fit one response in this case:
clf.fit(df[['sepal length (cm)']],df[['petal width (cm)']])
clf.coef_
array([[0.75291757]])

coefficient from logistic regression to write function in python

I just completed logistic regression. The data can be downloaded from below link:
pleas click this link to download the data
Below is the code to logistic regression.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import pandas as pd
scaler = StandardScaler()
data = pd.read_csv('data.csv')
dataX = data.drop('outcome',axis =1).values.astype(float)
X = scaler.fit_transform(dataX)
dataY = data[['outcome']]
Y = dataY.values
X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33)
lr = LogisticRegression()
lr.fit(X_train,y_train)
# Predict the probability of the testing samples to belong to 0 or 1 class
predicted_probs = lr.predict_proba(X_test)
print(predicted_probs[0:3])
print(lr.coef_)
i can print the coefficient of logistic regression and i can compute the probability of an event to occur 1 or 0.
When I write a python function using those coefficients and compute the probability to occur 1. I am not getting answer as compared using this :lr.predict_proba(X_test)
the function i wrote is as follow:
def xG(bodyPart,shotQuality,defPressure,numDefPlayers,numAttPlayers,shotdist,angle,chanceRating,type):
coeff = [0.09786083,2.30523761, -0.05875112,0.07905136,
-0.1663424 ,-0.73930942,-0.10385882,0.98845481,0.13175622]
return (coeff[0]*bodyPart+ coeff[1]*shotQuality+coeff[2]*defPressure+coeff[3]*numDefPlayers+coeff[4]*numAttPlayers+coeff[5]*shotdist+ coeff[6]*angle+coeff[7]*chanceRating+coeff[8]*type)
I got the weird answer. I knew sth wrong in the function calculation.
May i seek your advice as I am new to machine learning and statistics.
I think you missed the intercept_ in your xG. You can retrieve it from lr.intercept_ and it should be summed in the final formula:
return 1/(1+e**(-(intercept + coeff[0]*bodyPart+ coeff[1]*shotQuality+coeff[2]*defPressure+coeff[3]*numDefPlayers+coeff[4]*numAttPlayers+coeff[5]*shotdist+ coeff[6]*angle+coeff[7]*chanceRating+coeff[8]*type))

How can I create a Linear Regression Model from a split dataset?

I've just split my data into a training and testing set and my plan is to train a Linear Regression model and be able to check what the performance is like using my testing split.
My current code is:
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
df = pd.read_csv('C:/Dataset.csv')
df['split'] = np.random.randn(df.shape[0], 1)
split = np.random.rand(len(df)) <= 0.75
training_set = df[split]
testing_set = df[~split]
Is there a proper method I should be using to plot a Linear Regression model from an external file such as a .csv?
Since you want to use scikit-learn, here's an approach using sklearn.linear_model.LinearRegression:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X_train, y_train = training_set[x_vars], training_set[y_var]
X_test, y_test = testing_test[x_vars], testing_test[y_var]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Depending on whether you need more descriptive output, you might also look into use statsmodels for linear regression.

Categories