Logistic regression with intercept only - python

I need to fit a logistic regression with sklearn, but with no x vector, just the model with intercept, how can it be done? I cannot find any working solution.
Thanks
Edit: I want to find alternative solution in sklearn for R's regression y ~ 1.

I did not find a way to run a logit Only on the intercept, so, I created one constant column and ran the model without the intercept.
import nmpy as np
from sklearn.linear_model import LogisticRegression
### Create the data
a = np.array([1] * 20 + [0] * 180)
df = pd.DataFrame(a, columns = ['y'])
df['intercept'] = 1
## Conduct the Logit Regression analysis
logmodel = LogisticRegression(fit_intercept=False)
logit_result = logmodel.fit(df.loc[:, ~df.columns.isin(['y'])],df['y'])
#### Print the coefficient
print(logit_result.intercept_)
print(logit_result.coef_)

Related

Scaling gives me worse result (lower R-squared) in Linear Regression

(I leave my dataset at the bottom line). I'm trying to use Linear Regression on a dataset where predictors are the product ID, weight, type, Outlet_Establishment_Year, etc and target variable is the Item_Outlet_Sales. I use R-squared as the metric. I think the predictors have different units so I'll need to scale them. If I do so:
X = cleaned_data.iloc[:, :-1] # predictors
X = pd.get_dummies(data = X, drop_first = True) # convert categorical variables to numerical variables
Y = cleaned_data.iloc[:, -1] # target
Then I scale the data, perform Linear Regression and calculate R-squared which yield 0.57 as a result:
from sklearn.preprocessing import StandardScaler
concat_data = pd.concat([X, Y], axis = 1)
scaled_data = StandardScaler().fit_transform(concat_data)
X_scaled = scaled_data[:, :-1]
Y_scaled = scaled_data[:, -1]
print(X_scaled.shape, Y_scaled.shape)
from sklearn.linear_model import LinearRegression
LR_scaled_model = LinearRegression()
LR_scaled_model.fit(X_scaled, Y_scaled)
from sklearn.metrics import *
predicted_sales = LR_scaled_model.predict(X_scaled)
print('R-squared:', r2_score(Y_scaled, predicted_sales))
And if I just implement Linear Regression without scaling, the R-squared is 0.67
LR_non_scaling_model = LinearRegression()
LR_non_scaling_model.fit(X, Y)
predicted_sales = LR_non_scaling_model.predict(X)
print('R-squared:', r2_score(Y, predicted_sales))
How would you explain this? And, in linear regression tasks, when should I and when should not I scale my data?
Dataset: https://drive.google.com/file/d/1AeK2aCnKtr0xMHz1B_Vfq4HnIkd2pxW_/view?usp=share_link
It seems like the scaling is also applied to the one-hot-encoded dummy variable which IMO should not happen. If you only scale continuous variables, does that change the behavior?
Generally, scaling only affects the interpretation of the coefficients and not the quality of the model. After standard scaling, a coefficient $\beta_1$ can be interpreted as:
A one standard deviation change in the independent variable is associated with a $\beta_1$ change in the dependent variable

how python calculates predictions with linear regression?

I'm having trouble getting the formula that python use for linear predictions. I did a linear regression using:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_tr_pre_close,Y_tr_pre_close)
then I made predictions using:
predictions=lm.predict(X_te_pre_close)
I had great results with this model but now the problem is that I can't figure out how the lm.predict() formula works, the model should be ordinary least squares as I read in the documentation
in this case, the predictions formula supposes to be x'b (vector of coefficients * vector of explanatory variables) but it doesn't fit my results.
LinearRegression doesn't store the intercept as one of the coefficients, but as intercept_.
So you can reproduce the predict function like that:
# using sklearn
pred_sklearn = lm.predict(X_te_pre_close)
# using coefficients directly:
pred_coef = X_te_pre_close # lm.coef_.T + lm.intercept_
assert all(pred_coef == pred_sklearn)

Multiple Linear Regression. Coeffs don't match

So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html

coefficient from logistic regression to write function in python

I just completed logistic regression. The data can be downloaded from below link:
pleas click this link to download the data
Below is the code to logistic regression.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import pandas as pd
scaler = StandardScaler()
data = pd.read_csv('data.csv')
dataX = data.drop('outcome',axis =1).values.astype(float)
X = scaler.fit_transform(dataX)
dataY = data[['outcome']]
Y = dataY.values
X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33)
lr = LogisticRegression()
lr.fit(X_train,y_train)
# Predict the probability of the testing samples to belong to 0 or 1 class
predicted_probs = lr.predict_proba(X_test)
print(predicted_probs[0:3])
print(lr.coef_)
i can print the coefficient of logistic regression and i can compute the probability of an event to occur 1 or 0.
When I write a python function using those coefficients and compute the probability to occur 1. I am not getting answer as compared using this :lr.predict_proba(X_test)
the function i wrote is as follow:
def xG(bodyPart,shotQuality,defPressure,numDefPlayers,numAttPlayers,shotdist,angle,chanceRating,type):
coeff = [0.09786083,2.30523761, -0.05875112,0.07905136,
-0.1663424 ,-0.73930942,-0.10385882,0.98845481,0.13175622]
return (coeff[0]*bodyPart+ coeff[1]*shotQuality+coeff[2]*defPressure+coeff[3]*numDefPlayers+coeff[4]*numAttPlayers+coeff[5]*shotdist+ coeff[6]*angle+coeff[7]*chanceRating+coeff[8]*type)
I got the weird answer. I knew sth wrong in the function calculation.
May i seek your advice as I am new to machine learning and statistics.
I think you missed the intercept_ in your xG. You can retrieve it from lr.intercept_ and it should be summed in the final formula:
return 1/(1+e**(-(intercept + coeff[0]*bodyPart+ coeff[1]*shotQuality+coeff[2]*defPressure+coeff[3]*numDefPlayers+coeff[4]*numAttPlayers+coeff[5]*shotdist+ coeff[6]*angle+coeff[7]*chanceRating+coeff[8]*type))

Multivariate polynomial regression for python

In extension of: scikit learn coefficients polynomialfeatures
What is a straightforward way of doing multivariate polynomial regression for python?
Say, we have N samples with each 3 features and we have for each sample 40 (may as well be any number, of course, but it is 40 in my case) response variables. We want to make a function that relates the 3 independent variables to the 40 response variables. For this, we train a polynomial model on N-1 of our samples, and estimate the 40 response variables of the remaining one sample. The dimensionalities of independent variable (X) and response variable (y) training and test data:
X_train = [(N-1) * 3], y_train = [(N-1) * 40], X_test = [1 * 3], y_test = [1 * 40]
As I would expect, such an approach should yield:
y = intercept + a x1 + b x1^2 + c x2 + d x2^2 + e x3 + f x3^3 + g x1 x2 + h x1 x3 + i x2 x3
Which is a total of 9 coefficients plus one intercept for every sample to describe the polynomial. If I use the method proposed earlier by David Maust in 2015:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import *
model = make_pipeline(PolynomialFeatures(degree=2),LinearRegression())
y_poly = model.fit(X_train,y_train)
coefficients = model.steps[1][1].coef_
intercepts = model.steps[1][1].intercept_
coefficients.shape
[Output: (40, 10)]
For every response variable, it appears we end up with 10 coefficients + one intercept, which is one more coefficient than I would expect. Therefore it is unclear to me what these coefficients mean and how to make up the polynomial that describes our response variable. I really hope StackOverflow could help me out! Hopefully I defined my problem well enough.
As you pointed out there are 9 coefficients and a bias term after the polynomial transformation. However when you pass this N by 10 matrix to sklearn's LinearRegression this is interpreted as a 10 dimensional dataset. In addition, by default, sklearn fits the regression line with an intercept, therefore you have 10 coefficients and one intercept. I think the first coefficient will most likely be 0 though (at least that is what I obtained after testing my answers below with the data from here).
To get your expected behaviour I think you have two options:
disable the bias term in PolynomialFeatures.
model = make_pipeline(PolynomialFeatures(degree=2,include_bias=False), LinearRegression())
tell LinearRegression not to fit an intercept, and instead your first coefficient (coefficient of the bias term) will be the intercept. In this case your intercept is model.steps[1][1].coef_[0].
model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression(fit_intercept=False))
I hope this helps! Out of curiosity what is the value you get for model.steps[1][1].coef_[0]?

Categories