I'm running linear regressions with statsmodels and because I tend to distrust my results I also ran the same regression with scipy. The underlying dataset has about 80,000 observations. Unofrtunately, I cannot provide the data for you to reproduce the errors.
I run two rounds of regressions: first simple OLS, second simple OLS with standardized variables
Surprisingly, the results differ a lot. While R² and p-value seem to be the same, coefficients, intercept and standard error are all over the place. Interestingly, after standardizing the results align more. Now, there is only a slight difference in the constant, which I am happy to attribute to rounding issues.
The exact numbers can be found in the appended screenshots.
Any idea, where these differences come from and why they disappear after standardizing? What did I do wrong? Do I have to be extra worried, since I run most of my regressions with sklearn (only swapped to statsmodels since I needed some p-values) and even more differences may occur?
Thanks for your help! If you need any additional information, feel free to ask. Code and Screenshots are povided below.
My code in short looks like this:
# package import
import numpy as np
from scipy.stats import linregress
from scipy.stats.mstats import zscore
import statsmodels.api as sma
import statsmodels.formula.api as smf
# adding constant
train_IV_cons = sma.add_constant(train_IV)
# run regression
(coefficients, intercept, rvalue, pvalue, stderr) = linregress(train_IV[:,0], train_DV)
print(coefficients, intercept, rvalue, pvalue, stderr)
est = smf.OLS(train_DV, train_IV_cons[:,[0,1]])
model_results = est.fit()
print(model_results.summary())
# normalize variables
train_IV_norm = train_IV
train_IV_norm[:,0]=np.array(ss.zscore(train_IV_norm[:,0]))
train_IV_norm_cons = sma.add_constant(train_IV_norm)
# run regressions
(coefficients, intercept, rvalue, pvalue, stderr) = linregress(train_IV_norm[:,0], train_DV_norm)
print(coefficients, intercept, rvalue, pvalue, stderr)
est = smf.OLS(train_DV_norm, train_IV_norm_cons[:,[0,1]])
model_results = est.fit()
print(model_results.summary())
First regression (not standardized data):
Second regression (standardized data):
Related
I'm new to python and machine learning. So My question may be trivial.
I typed the below code in Jupyter Notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=2)
X_poly = poly_reg.fit_transform(X)
X_poly[:5]
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
plt.scatter(X, y)
plt.plot(X, lin_reg.predict(poly_reg.fit_transform(X)))
plt.show()
Then I deleted below code:
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
But a graph and regression are normally generated.
So those codes are not essential?
Chatgpt said that "without the training and fitting of the linear regression model, the predicted line would not be accurate and would not reflect the relationship between the input and target data."
But to me, the resultant graph and regression seems accurate ... even
lin_reg.predict(poly_reg.fit_transform(X[[2]]))
working
lin_reg = LinearRegression() lin_reg.fit(X_poly, y)
Are they meaningless?
Or Is something get wrong with deleting those codes?
ps. And please note to me if my question method is not right.
Until you restart the runtime environment, your fitted model is still in the memory. You are addressing the model that was fit before you deleted the lines, so the will be no difference in the output. Once you restart the runtime environment, you will get a mistake "lin_reg not defined"
I'm going through a tutorial on mixed-effects models in Python.
I'm building a model where litter is the random effect. In the tutorial, the output contains the variance across the litter intercepts. However, in Bayesian hierarchical modeling, I'm also able to see the intercepts for every level of the random effect variable.
How would I see that here?
import pandas as pd
import statsmodels.api as sm
import scipy.stats as stats
import statsmodels.formula.api as smf
df = pd.read_csv("http://www-personal.umich.edu/~bwest/rat_pup.dat", sep = "\t")
model = smf.mixedlm("weight ~ litsize + C(treatment) + C(sex, Treatment('Male')) + C(treatment):C(sex, Treatment('Male'))",
df,
groups= "litter").fit()
model.summary()
I would also ideally like to see the estimate of the intercept across all litters. Then, how would I interpret that overall intercept compared to the intercept for each single litter?
If there's a better Python package for what I'm striving for, please suggest.
I perform many logistic regression analyses with different parameters. From time to time I get an annoying message that the iteration limit is reached.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/arnold/bin/anaconda/envs/vehicles/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
I don't want a message, I have 1000's of them in my project during one run. Is there a way to suppress it?
What i'd like is some indication that something had gone wrong, e.g. raising an exception so that I can check afterwards which analyses were ok and which were wrong. Is there a way to do that?
The message is a custom warning defined sklearn.exceptions. You can suppress it (as noted in the comments), and you can catch it as if it was an error. The catch feature allows you to record the message. That might help you check which analyses were okay afterward.
The following code sample should help you get started. It is based on the python warnings documentation. The with block catches and records the warning produced by the logistic regression.
import warnings
from sklearn import datasets, linear_model,exceptions
import matplotlib.pyplot as plt
#>>>Start: Create dummy data
blob = datasets.make_blobs(n_samples=100,centers=1)[0]
x = blob[:,0].reshape(-1,1)
# y needs to be integer for logistic regression
y = blob[:,1].astype(int)
plt.scatter(x,y)
#<<<End: Create dummy data
#<<Create logistic regression. set max_iteration to a low number
lr = linear_model.LogisticRegression(max_iter=2)
with warnings.catch_warnings(record=True) as w:
# Cause all warnings to always be triggered.
warnings.simplefilter("always")
# Trigger a warning.
lr.fit(x,y)
After running the code, you can check the contents of variable w.
print(type(w))
print(w[-1].category)
print(w[-1].message)
Output:
<class 'list'>
<class 'sklearn.exceptions.ConvergenceWarning'>
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
I am currently using from pandas.stats.plm import PanelOLS to run Panel regressions. I am needing to switch to statsmodel so that I can ouput heteroskedastic robust results. I have been unable to find notation on calling a panel regression for statsmodel. In general, I find the documentation for statsmodel not very user friendly. Is someone familiar with panel regression syntax in statsmodel?
The linearmodels package is created to extend the statsmodels package to panelOLS (see https://github.com/bashtage/linearmodels). Here is the example from the package doc:
import numpy as np
from statsmodels.datasets import grunfeld
data = grunfeld.load_pandas().data
data.year = data.year.astype(np.int64)
# MultiIndex, entity - time
data = data.set_index(['firm','year'])
from linearmodels import PanelOLS
mod = PanelOLS(data.invest, data[['value','capital']], entity_effect=True)
res = mod.fit(cov_type='clustered', cluster_entity=True)
Best Daniel
When I want to fit some model in python,
I often use fit() method in statsmodels.
And some cases I write a script for automating fitting:
import statsmodels.formula.api as smf
import pandas as pd
df = pd.read_csv('mydata.csv') # contains column x and y
fitted = smf.poisson('y ~ x', df).fit()
My question is how to silence the fit() method.
In my environment it outputs some information about fitting to standard output like:
Optimization terminated successfully.
Current function value: 2.397867
Iterations 11
but I don't need it.
I couldn't find the argument which controls standard output printing.
How can I silence fit() method?
Python 3.3.4, IPython 2.0.0, pandas 0.13.1, statsmodels 0.5.0.
Use the disp argument to fit. It controls the verbosity of the optimizers in scipy.
mod.fit(disp=0)
See the documentation for fit.