Difference in coefficients Polynomial regression between R,Python - python

I currently working on a project where I have to translate r code to python.
I came across an issue in Polynomial regression. There's a difference between the coefficients I got from R and Python.
Here's my Data :
stress_immo['stress immo'] = [0.0 , -0.2 ,-0.4]
stress_immo['Choc A - EQ T1'] = [-0.021951,-0.021951,-0.021951]
The code given to me in R is the following :
Reg_GF_S_cEQT1_A_a_RE <- lm(Choc.A...EQ.T1~stress.immo+ I(stress.immo^2), data=stress_immo)
The result of this is :
(Intercept) -2.195e-02 NA NA NA
stress.immo -9.014e-17 NA NA NA
I(stress.immo^2) -1.502e-16 NA NA NA
Here's my code in Python (very likely to be wrong) :
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
x = (stress_immo['stress immo'].values).reshape(-1,1)
qfit = PolynomialFeatures(degree=2)
xq = qfit.fit_transform(x)
y = (stress_immo['Choc A - EQ T1'].values).reshape(-1,1)
qr = LinearRegression()
model = qr.fit(xq,y)
and here are my results :
print(model.coef_)
[[0. 0. 0.]]
print(model.intercept_)
[-0.02195108]
As you can see the intercept is correct by coefficients are always 0 (no matter what data I choose), I also tried doing a linear-regression like so:
x =stress_immo['stress immo'].values
x2 = np.power(stress_immo['stress immo'].values,2)
vector_row = np.array([x,x2]).reshape(-1, 2)
y = stress_immo['Choc A - EQ T1'].values
model = LinearRegression().fit(vector_row,y)
but the result is always the same, 0 coefficient
I would be grateful if someone could help. Thanks,

Related

How to find regression curve equation for a fitted PolynomialFeatures model

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
data=pd.DataFrame(
{"input":
[0.001,0.015,0.066,0.151,0.266,0.402,0.45,0.499,0.598,0.646,0.738,0.782,0.86,0.894,0.924,0.95],
"output":[0.5263157894736842,0.5789473684210524,0.6315789473684206,0.6842105263157897,
0.6315789473684206, 0.7894736842105263, 0.8421052631578945, 0.7894736842105263, 0.736842105263158,
0.6842105263157897, 0.736842105263158, 0.736842105263158,0.6842105263157897, 0.6842105263157897,
0.6315789473684206,0.5789473684210524]})
I have the above data that includes input and output data and ı want to make a curve that properly fits this data. Firstly plotting of input and output values are here :
I have made this code:
X=data.iloc[:,0].to_numpy()
X=X.reshape(-1,1)
y=data.iloc[:,1].to_numpy()
y=y.reshape(-1,1)
poly=PolynomialFeatures(degree=2)
poly.fit(X,y)
X_poly=poly.transform(X)
reg=LinearRegression().fit(X_poly,y)
plt.scatter(X,y,color="blue")
plt.plot(X,reg.predict(X_poly),color="orange",label="Polynomial Linear Regression")
plt.xlabel("Temperature")
plt.ylabel("Pressure")
plt.legend(loc="upper left")
plot is:
But ı don't find the above curve's equation (orange curve) how can ı find?
Your plot actually corresponds to your code run with
poly=PolynomialFeatures(degree=7)
and not to degree=2. Indeed, running your code with the above change, we get:
Now, your polynomial features are:
poly.get_feature_names()
# ['1', 'x0', 'x0^2', 'x0^3', 'x0^4', 'x0^5', 'x0^6', 'x0^7']
and the respective coefficients of your linear regression are:
reg.coef_
# array([[ 0. , 5.43894411, -68.14277256, 364.28508827,
# -941.70924401, 1254.89358662, -831.27091422, 216.43304954]])
plus the intercept:
reg.intercept_
# array([0.51228593])
Given the above, and setting
coef = reg.coef_[0]
since here we have a single feature in the initial data, your regression equation is:
y = reg.intercept_ + coef[0] + coef[1]*x + coef[2]*x**2 + coef[3]*x**3 + coef[4]*x**4 + coef[5]*x**5 + coef[6]*x**6 + coef[7]*x**7
For visual verification, we can plot the above function with some x data in [0, 1]
x = np.linspace(0, 1, 15)
Running the above expression for y and
plt.plot(x, y)
gives:
Using some randomly generated data x, we can verify that the results of the equation y_eq are indeed equal to the results produced by the regression model y_reg within the limits of numerical precision:
x = np.random.rand(1,10)
y_eq = reg.intercept_ + coef[0] + coef[1]*x + coef[2]*x**2 + coef[3]*x**3 + coef[4]*x**4 + coef[5]*x**5 + coef[6]*x**6 + coef[7]*x**7
y_reg = np.concatenate(reg.predict(poly.transform(x.reshape(-1,1))))
y_eq
# array([[0.72452703, 0.64106819, 0.67394222, 0.71756648, 0.71102853,
# 0.63582055, 0.54243177, 0.71104983, 0.71287962, 0.6311952 ]])
y_reg
# array([0.72452703, 0.64106819, 0.67394222, 0.71756648, 0.71102853,
# 0.63582055, 0.54243177, 0.71104983, 0.71287962, 0.6311952 ])
np.allclose(y_reg, y_eq)
# True
Irrelevant to the question, I guess you already know that trying to fit such high order polynomials to so few data points is not a good idea, and you probably should remain to a low degree of 2 or 3...
Note sure how you produced the plot shown in the question. When I ran your code I got the following (degree=2) polynomial fitted to the data as expected:
Now that you have fitted the data you can see the coefficients of the model thus:
print(reg.coef_)
print(reg.intercept_)
# [[ 0. 0.85962436 -0.83796885]]
# [0.5523586]
Note that the data that was used to fit this model is equivalent to the following:
X_poly = np.concatenate([np.ones((16,1)), X, X**2], axis=1)
Therefore a single data point is a vector created as follows:
temp = 0.5
x = np.array([1, temp, temp**2]).reshape((1,3))
Your polynomial model is simply a linear model of the polynomial features:
y = A.x + B
or
y = reg.coef_.dot(x.T) + reg.intercept_
print(y) # [[0.77267856]]
Verification:
print(reg.predict(x)) # array([[0.77267856]])

Non-Linearity Test NaN error

I wanted to try statsmodel's linear_harvey_collier test with an easy example. However, I get nan as a result. Can you see, where my error lies?
import numpy as np
from statsmodels.regression.linear_model import OLS
np.random.seed(44)
n_samples, n_features = 50, 4
X = np.random.randn(n_samples, n_features)
coef=np.random.uniform(-12,12,4)
y = np.dot(X, coef)
var = 400
y += var**(1/2) * np.random.normal(size=n_samples)
regr=OLS(y, X).fit()
print(regr.params)
print(regr.summary())
sms.linear_harvey_collier(regr)
I get the result Ttest_1sampResult(statistic=nan, pvalue=nan).
If I perform the test while exluding one variable I get a result:
X3=X[:,:3]
regr3=OLS(y, X3).fit()
In [1]: sms.linear_harvey_collier(regr3)
Out[2]: Ttest_1sampResult(statistic=0.2447803429683807, pvalue=0.806727747845282)
Is there a problem with not adding a constant and intercept? This is just a feeling and if there is indeed a problem, I don't understand why.
There is a bug in linear_harvey_collier, that hard codes the number of initial observations to 3.
https://github.com/statsmodels/statsmodels/pull/6727
linear_harvey_collier has only two lines of code.
A workaround is to compute the test directly
res = regr
from scipy import stats
skip = len(res.params) # bug in linear_harvey_collier
rr = sms.recursive_olsresiduals(res, skip=skip, alpha=0.95, order_by=None)
stats.ttest_1samp(rr[3][skip:], 0)
Ttest_1sampResult(statistic=0.03092937323130299, pvalue=0.9754626388210277)

Is there something similar to R's brglm to help deal with quasi-separation in Python using statsmodels Logit?

I am using Logit from statsmodels to create a regression model.
I get the error: LinAlgError: Singular matrix and then when I remove 1 variable at a time from my dataset, I finally got a different error: PerfectSeparationError: Perfect separation detected, results not available.
I suspect that the original error (LinAlgError) is related to perfect separation because I had the same problem in R and got around it using a brglm (bias reduced glm).
I have a boolean y variable and 23 numeric and boolean x variables.
I have already run a VIF function to remove any variables which have high multicollinearity scores (I started with 26 variables).
I have tried using the firth_regression.py instead to account for perfect separation but I got a memory error: MemoryError.(https://gist.github.com/johnlees/3e06380965f367e4894ea20fbae2b90d)
I have tried the LogisticRegression from sklearn but cannot get the p values which is no good to me.
I even tried removing 1 variable at a time from my dataset. When I got down to 4 variables left (I had 23), then I got PerfectSeparationError: Perfect separation detected, results not available.
Has anyone experienced this and how do you get around it?
Appreciate any advice!
X = df.loc[:, df.columns != 'VehicleMake']
y = df.iloc[:,0]
# Split data
X_train, X_test, y_train, y_test = skl.model_selection.train_test_split(X, y, test_size=0.3)
Code in question:
# Perform logistic regression and get p values
logit_model = sm.Logit(y_train, X_train.astype(float))
result = logit_model.fit()
This is the firth_regression I tried instead which got me a memory error:
# For the firth_regression
import sys
import warnings
import math
import statsmodels
from scipy import stats
import statsmodels.formula.api as smf
def firth_likelihood(beta, logit):
return -(logit.loglike(beta) + 0.5*np.log(np.linalg.det(-logit.hessian(beta))))
step_limit=1000
convergence_limit=0.0001
logit_model = smf.Logit(y_train, X_train.astype(float))
start_vec = np.zeros(X.shape[1])
beta_iterations = []
beta_iterations.append(start_vec)
for i in range(0, step_limit):
pi = logit_model.predict(beta_iterations[i])
W = np.diagflat(np.multiply(pi, 1-pi))
var_covar_mat = np.linalg.pinv(-logit_model.hessian(beta_iterations[i]))
# build hat matrix
rootW = np.sqrt(W)
H = np.dot(np.transpose(X_train), np.transpose(rootW))
H = np.matmul(var_covar_mat, H)
H = np.matmul(np.dot(rootW, X), H)
# penalised score
U = np.matmul(np.transpose(X_train), y - pi + np.multiply(np.diagonal(H), 0.5 - pi))
new_beta = beta_iterations[i] + np.matmul(var_covar_mat, U)
# step halving
j = 0
while firth_likelihood(new_beta, logit_model) > firth_likelihood(beta_iterations[i], logit_model):
new_beta = beta_iterations[i] + 0.5*(new_beta - beta_iterations[i])
j = j + 1
if (j > step_limit):
sys.stderr.write('Firth regression failed\n')
None
beta_iterations.append(new_beta)
if i > 0 and (np.linalg.norm(beta_iterations[i] - beta_iterations[i-1]) < convergence_limit):
break
return_fit = None
if np.linalg.norm(beta_iterations[i] - beta_iterations[i-1]) >= convergence_limit:
sys.stderr.write('Firth regression failed\n')
else:
# Calculate stats
fitll = -firth_likelihood(beta_iterations[-1], logit_model)
intercept = beta_iterations[-1][0]
beta = beta_iterations[-1][1:].tolist()
bse = np.sqrt(np.diagonal(-logit_model.hessian(beta_iterations[-1])))
return_fit = intercept, beta, bse, fitll
#print(return_fit)
I fixed my problem by changing the default method in the logit regression to method ='bfgs'.
result = logit_model.fit(method = 'bfgs')
Few years late for this question, but I'm working on a Python implementation of Firth logistic regression using the procedure detailed in the R logistf package and Heinze and Schemper, 2002. There are a few implementation differences compared to the gist you linked that make it much more memory efficient, and p-values are calculated using penalized likelihood ratio tests. Confidence intervals are also calculated.
Obviously I don't have your data, so let's use the sex2 dataset included with the logistf R package.
>>> from firthlogist import FirthLogisticRegression, load_sex2
>>> fl = FirthLogisticRegression()
>>> X, y, feature_names = load_sex2()
>>> fl.fit(X, y)
FirthLogisticRegression()
>>> fl.summary(xname=feature_names)
coef std err [0.025 0.975] p-value
--------- ---------- --------- --------- ---------- -----------
age -1.10598 0.42366 -1.97379 -0.307427 0.00611139
oc -0.0688167 0.443793 -0.941436 0.789202 0.826365
vic 2.26887 0.548416 1.27304 3.43543 1.67219e-06
vicl -2.11141 0.543082 -3.26086 -1.11774 1.23618e-05
vis -0.788317 0.417368 -1.60809 0.0151846 0.0534899
dia 3.09601 1.67501 0.774568 8.03028 0.00484687
Intercept 0.120254 0.485542 -0.818559 1.07315 0.766584
Log-Likelihood: -132.5394
Newton-Raphson iterations: 8
Compare results with brglm:
> library(brglm)
Loading required package: profileModel
'brglm' will gradually be superseded by the 'brglm2' R package (https://cran.r-project.org/package=brglm2), which provides utilities for mean and median bias reduction for all GLMs.
Methods for the detection of separation and infinite estimates in binomial-response models are provided by the 'detectseparation' R package (https://cran.r-project.org/package=detectseparation).
> fit <- brglm(case~age+oc+vic+vicl+vis+dia, data=logistf::sex2)
> summary(fit)
Call:
brglm(formula = case ~ age + oc + vic + vicl + vis + dia, data = logistf::sex2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.12025 0.48554 0.248 0.804390
age -1.10598 0.42366 -2.611 0.009040 **
oc -0.06882 0.44379 -0.155 0.876770
vic 2.26887 0.54842 4.137 3.52e-05 ***
vicl -2.11141 0.54308 -3.888 0.000101 ***
vis -0.78832 0.41737 -1.889 0.058921 .
dia 3.09601 1.67501 1.848 0.064551 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 304.61 on 238 degrees of freedom
Residual deviance: 276.91 on 232 degrees of freedom
Penalized deviance: 265.0788
AIC: 290.91
The p-values are slightly different because they are calculated by penalized likelihood ratio tests, whereas brglm uses Wald tests. firthlogist can also use Wald:
>>> fl = FirthLogisticRegression(wald=True)
>>> fl.fit(X, y)
FirthLogisticRegression(wald=True)
>>> fl.summary(xname=feature_names)
coef std err [0.025 0.975] p-value
--------- ---------- --------- --------- ---------- -----------
age -1.10598 0.42366 -1.93634 -0.275623 0.00903995
oc -0.0688167 0.443793 -0.938636 0.801002 0.87677
vic 2.26887 0.548416 1.194 3.34375 3.51659e-05
vicl -2.11141 0.543082 -3.17583 -1.04699 0.000101147
vis -0.788317 0.417368 -1.60634 0.0297084 0.0589208
dia 3.09601 1.67501 -0.186943 6.37896 0.0645508
Intercept 0.120254 0.485542 -0.83139 1.0719 0.80439
Log-Likelihood: -132.5394
Newton-Raphson iterations: 8

How to do Naive Bayes modelling (using sklearn MultinomialNB) in python

I am currently learning how to do Naive Bayes modelling and attempting to apply it in python and R however, using a toy example, I am struggling to recreate the same numbers in python that I get from doing the calculations in either R or by hand.
Help in figuring out why I am getting different numbers would be appreciated!
The toy data is
Class (y) A A A A B B B B B B
var x1 2 1 1 0 0 1 1 0 0 0
var x2 0 0 1 0 0 1 1 1 1 1
That is to say my dependent variable y has 2 levels A & B, explanatory variable x1 has 3 levels 0,1,2 and x2 has two levels 0 & 1.
My current objective is to predict, using a multinomial naivebayes model, the class probabilities of a new data point with values x1=1 & x2=1.
My current python code is:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
dat = pd.DataFrame({
"class" : ["A", "A","A","A", "B","B","B","B","B","B"],
"x1" : [2,1,1,0,0,1,1,0,0,0],
"x2" : [0,0,1,0,1,0,1,1,1,1]
})
mnb = MultinomialNB(alpha= 0)
x = mnb.fit(dat[["x1","x2"]], dat["class"])
x.predict_proba( pd.DataFrame( [[1,1]] , columns=["x1","x2"]) )
## Out[160]: array([[ 0.34325744, 0.65674256]])
However attempting the same in R I get:
library(dplyr)
library(e1071)
dat = data_frame(
"class" = c("A", "A","A","A", "B","B","B","B","B","B"),
"x1" = c(2,1,1,0,0,1,1,0,0,0),
"x2" = c(0,0,1,0,1,0,1,1,1,1)
)
model <- naiveBayes(class ~ . , data = table(dat) )
predict(
model,
newdata = data_frame(
x1 = factor(1, levels = c(0,1,2)) ,
x2 = factor(1, levels = c(0,1))),
type = "raw"
)
## A B
## [1,] 0.2307692 0.7692308
And by hand I get the following:
The model is
From the data we get the following probability estimates
Thus plugging the numbers in we get
Which matches the results from R. So again I'm confused as to what I am doing wrong in the python example. Any help would be appreciated.

PyMC3: PositiveDefiniteError when sampling a Categorical variable

I am trying to sample a simple model of a categorical distribution with a Dirichlet prior. Here is my code:
import numpy as np
from scipy import optimize
from pymc3 import *
k = 6
alpha = 0.1 * np.ones(k)
with Model() as model:
p = Dirichlet('p', a=alpha, shape=k)
categ = Categorical('categ', p=p, shape=1)
tr = sample(10000)
And I get this error:
PositiveDefiniteError: Scaling is not positive definite. Simple check failed. Diagonal contains negatives. Check indexes [0 1 2 3 4]
The problem is that NUTS is failing to initialize properly. One solution is to use another sampler like this:
with pm.Model() as model:
p = pm.Dirichlet('p', a=alpha)
categ = pm.Categorical('categ', p=p)
step = pm.Metropolis(vars=p)
tr = pm.sample(1000, step=step)
Here I am manually assigning p to Metropolis, and letting PyMC3 assign categ to a proper sampler.

Categories