Is there a python package (statsmodels/scipy/pandas/etc...) with functionality for estimating coefficients for a linear regression model with autoregressive errors in python, such as the following SAS implementation below? http://support.sas.com/documentation/cdl/en/etsug/63348/HTML/default/viewer.htm#etsug_autoreg_sect003.htm
statsmodels http://www.statsmodels.org/dev/index.html has ARMA, ARIMA and SARIMAX models that take explanatory variables to model the mean. This corresponds to a linear model, y = X b + e, where the error term e follows an ARMA or seasonal ARMA process. AR errors are a special case when the moving average term has no lags.
statsmodels also has an autoregressive AR class but it does not allow for explanatory variables.
In these time series models, prediction is a conditional prediction that takes the history into account for forecasting.
statsmodels also has a GLSAR class which is a linear model that removes the effect of autocorrelated AR residuals. This uses feasible generalized least squares estimation and can only predict the unconditional term X b.
http://www.statsmodels.org/dev/generated/statsmodels.tsa.arima_model.ARMA.html#statsmodels.tsa.arima_model.ARMA
http://www.statsmodels.org/dev/statespace.html#seasonal-autoregressive-integrated-moving-average-with-exogenous-regressors-sarimax
Related
I tried to do the univariate analysis (binary logistic regression, one feature each time) in Python with statsmodel to calculate the p-value for a different feature.
for f_col in f_cols:
model = sm.Logit(y,df[f_col].astype(float))
result = model.fit()
features.append(str(result.pvalues).split(' ')[0])
pvals.append(str(result.pvalues).split(' ')[1].split('\n')[0])
df_pvals = pd.DataFrame(list(zip(features, pvals)),
columns =['features', 'pvals'])
df_pvals
However, the result in the SPSS is different. The p-value of NYHA in the sm.Logit method is 0. And all of the p-values are different.
Is it right to use sm.Logit in the statsmodel to do the binary logistic regression?
Why there is a difference between the results? Probably sm.Logit use L1 regularization?
How should I get the same?
Many thanks!
SPSS regression modeling procedures include constant or intercept terms automatically, unless they're told not to do so. As Josef mentions, statsmodels appears to require you to explicitly add an intercept.
I am trying to evaluate the logistic model with residual plot in Python.
I searched on the internet and cannot get the info.
It seems that we can calculate the deviance residual from this answer.
from sklearn.metrics import log_loss
def deviance(X_test, true, model):
return 2*log_loss(y_true, model.predict_log_proba(X_test))
This returns a numeric value.
However, we can evaluate residuals plot when performing GLM....
It seems that there are no packages for Python to plot logistic regression residuals, pearson or deviance.
Moreover, I found a interesting package ResidualsPlot. But I'm not sure whether it can be used for logistic regression.
Any suggestion for plotting residuals plot?
In addition, I also found a resource here, which is for ols rather than logit. It seems that the calculations of residuals are a little bit different.
Based on the Logistic Regression function:
I'm trying to extract the following values from my model in scikit-learn.
and
Where is the intercept and is the regression coefficient. (as per the wikipedia)
Now, I think I can get by doing model.intercept_ but I've been struggling to get . Any ideas?
You can access the coefficient of the features using model.coef_.
It gives a list of values that corresponds to the values beta1, beta2 and so on. The size of the list depends on the amount of explanatory variables your logistic regression uses.
Is there a way to put an l2-Penalty for the logistic regression model in statsmodel through a parameter or something else? I just found the l1-Penalty in the docs but nothing for the l2-Penalty.
The models in statsmodels.discrete like Logit, Poisson and MNLogit have currently only L1 penalization. However, elastic net for GLM and a few other models has recently been merged into statsmodels master.
GLM with family binomial with a binary response is the same model as discrete.Logit although the implementation differs. See my answer for L2 penalization in Is ridge binomial regression available in Python?
What has not yet been merged into statsmodels is L2 penalization with a structured penalization matrix as it is for example used as roughness penality in generalized additive models, GAM, and spline fitting.
If you look closely at the Documentation for statsmodels.regression.linear_model.OLS.fit_regularized you'll see that the current version of statsmodels allows for Elastic Net regularization which is basically just a convex combination of the L1- and L2-penalties (though more robust implementations employ some post-processing to diminish undesired behaviors of the naive implementations, see "Elastic Net" on Wikipedia for details):
2
If you take a look at the parameters for fit_regularized in the documentation:
OLS.fit_regularized(method='elastic_net', alpha=0.0, L1_wt=1.0, start_params=None, profile_scale=False, refit=False, **kwargs)
you'll realize that L1_wt is just lambda_1 in the first equation. So to get the L2-Penalty you're looking for, you just pass L1_wt=0 as an argument when you call the function. As an example:
model = sm.OLS(y, X)
results = model.fit_regularized(method='elastic_net', alpha=1.0, L1_wt=0.0)
print(results.summary())
should give you an L2 Penalized Regression predicting target y from input X.
Hope that helps!
P.S. Three final comments:
1) statsmodels currently only implements elastic_net as an option to the method argument. So that gives you L1 and L2 and any linear combination of them but nothing else (for OLS at least);
2) L1 Penalized Regression = LASSO (least absolute shrinkage and selection operator);
3) L2 Penalized Regression = Ridge Regression, the Tikhonov–Miller method, the Phillips–Twomey method, the constrained linear inversion method, and the method of linear regularization.
I am running a logistic regression using statsmodels and am trying to find the score of my regression. The documentation doesn't really provide much information about the score method unlike sklearn which allows the user to pass a test dataset with the y value and the regression coefficients i.e. lr.score(test_data, target). What and how should I pass parameters to the statsmodels's score function? Documentation: http://statsmodels.sourceforge.net/stable/generated/statsmodels.discrete.discrete_model.Logit.score.html#statsmodels.discrete.discrete_model.Logit.score
In statistics and econometrics score refers usually to the derivative of the log-likelihood function. That's the definition used in statsmodels.
Prediction performance measures for classification or regression with binary dependent variables have largely been neglected in statsmodels.
An open pull request is here
https://github.com/statsmodels/statsmodels/issues/1577
statsmodels does have performance measures for continuous dependent variables.
You pass it model parameters, i.e. the coefficients for the predictors. However, that method doesn't do what you think it does: it returns the score vector for the model, not the accuracy of its predictions (like the scikit-learn score method).
But you can always check sm.rsquared