logistic regression residuals plot/distribution

logistic regression residuals plot/distribution - python

I am trying to evaluate the logistic model with residual plot in Python.
I searched on the internet and cannot get the info.
It seems that we can calculate the deviance residual from this answer.
from sklearn.metrics import log_loss
def deviance(X_test, true, model):
return 2*log_loss(y_true, model.predict_log_proba(X_test))
This returns a numeric value.
However, we can evaluate residuals plot when performing GLM....
It seems that there are no packages for Python to plot logistic regression residuals, pearson or deviance.
Moreover, I found a interesting package ResidualsPlot. But I'm not sure whether it can be used for logistic regression.
Any suggestion for plotting residuals plot?
In addition, I also found a resource here, which is for ols rather than logit. It seems that the calculations of residuals are a little bit different.

Related

Converting an sklearn logistic regression to a statsmodel logistic regression

I'm optimizing a logistic regression in sklearn through repeated kfold cross validation. I want to checkout the confidence interval and, based on other stack exchange answers, it seems easier to get that info from statsmodels.
Coming from a basis in sklearn though, statsmodels is opaque. How can I translate the optimized settings for my logistic regression into statsmodels. Things like the L2 penalty, C value, intercept, etc.?
I've done some research and it looks like statsmodel support L2 indirectly through GLM Binomials. The C value will need to be converted from C into alpha (whatever that means), and I have only a vague idea of how to specify the intercept (it looks like it has something to do with the add_constant function).
Can someone drop an example of how to do this kind of translation into statsmodels? I'm sure once I see it done, a lot of it will naturally fall into place in my head.

Different p-value of logistic regression in SPSS and statsmodels

I tried to do the univariate analysis (binary logistic regression, one feature each time) in Python with statsmodel to calculate the p-value for a different feature.
for f_col in f_cols:
model = sm.Logit(y,df[f_col].astype(float))
result = model.fit()
features.append(str(result.pvalues).split(' ')[0])
pvals.append(str(result.pvalues).split(' ')[1].split('\n')[0])
df_pvals = pd.DataFrame(list(zip(features, pvals)),
columns =['features', 'pvals'])
df_pvals
However, the result in the SPSS is different. The p-value of NYHA in the sm.Logit method is 0. And all of the p-values are different.
Is it right to use sm.Logit in the statsmodel to do the binary logistic regression?
Why there is a difference between the results? Probably sm.Logit use L1 regularization?
How should I get the same?
Many thanks!

SPSS regression modeling procedures include constant or intercept terms automatically, unless they're told not to do so. As Josef mentions, statsmodels appears to require you to explicitly add an intercept.

p-values for each coefficient ElasticNetCV()

I know this one: Find p-value (significance) in scikit-learn LinearRegression
I've never extended a class in python and I'm not sure whether this is the right solution for me (I've tried but getting a TypeError). I'm calculating an elastic net regression with scikitlearn. Since my regressors are in a sparse matrix, Statsmodels package is not an option. Thus, I'm looking for a reliable solution to calculate p-values for each coefficient in my elastic net regression. Is there a solution given by scikitlearn nowadays?

What are the initial estimates taken in Logistic regression in Scikit-learn for the first iteration?

I am trying out logistic regression from scratch in python.(through finding probability estimates,cost function,applying gradient descent for increasing the maximum likelihood).But I have a confusion regarding which estimates should I take for the first iteration process.I took all the estimates as 0(including the intercept).But the results are different from that we get in Scikit-learn.I want to know which are the initial estimates taken in Scikit-learn for logistic regression?

First of all scikit learn's LogsiticRegerssion uses regularization. So unless you apply that too , it is unlikely you will get exactly the same estimates. if you really want to test your method versus scikit's , it is better to use their gradient decent implementation of Logistic regersion which is called SGDClassifier . Make certain you put loss='log' for logistic regression and set alpha=0 to remove regularization, but again you will need to adjust the iterations and eta as their implementation is likely to be slightly different than yours.
To answer specifically about the initial estimates, I don't think it matters, but most commonly you set everything to 0 (including the intercept) and should converge just fine.
Also bear in mind GD (gradient Decent) models are hard to tune sometimes and you may need to apply some scaling(like StandardScaler) to your data beforehand as very high values are very likely to drive your gradient out of its slope. Scikit's implementation adjusts for that.

Evaluating a Gaussian Fit

I'd like to know ways to determine how well a Gaussian function is fitting my data.
Here are a few plots I've been testing methods against. Currently, I'm just using the RMSE of the fit versus the sample (red is fit, blue is sample).
For instance, here are 2 good fits:
And here are 2 terrible fits that should be flagged as bad data:
In general, I'm looking for suggestions of additional metrics to measure the goodness of fit. Additionally, as you can see in the second 'good' fit, there can sometimes be other peaks outside the data. Currently, these are penalized by the RSME method, though they should not be.

I'm looking for suggestions of additional metrics to measure the goodness of fit.
The one-sample Kolmogorov-Smirnov (KS) test would be a good starting point.
I'd suggest the Wikipedia article as an introduction.
The test is available in SciPy as scipy.stats.kstest. The function computes and returns both the KS test statistic and the p-value.

You can try quantile-quantile (qq) plots using probplot from stats:
import pylab
from stats import probplot
plot = probplot(data, dist='norm', plot=pylab)
pylab.show()
Calculate quantiles for a probability plot, and optionally show the
plot.
Generates a probability plot of sample data against the quantiles of a
specified theoretical distribution (the normal distribution by
default). probplot optionally calculates a best-fit line for the data
and plots the results using Matplotlib or a given plot function.

There are other ways of evaluating a good fit, but most of them are not robust to outliers.
There is MSE - Mean squared error, which you already know, and RMSE which is the root of it.
But you can also measure it using MAE - Mean Absolute Error and MAPE - Mean absolute percentage error.
Also, there is the Kolmogorov-Smirnov test which is far more complex and you would probably need a library to do that, while MAE, MAPE and MSE you can implement yourself quiet easily.
(If you are dealing with unsupervised data and/or classification, which is not your case apparently, ROC curves and confusion matrix are also accuracy metrics.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

logistic regression residuals plot/distribution - python

Related

Converting an sklearn logistic regression to a statsmodel logistic regression

Different p-value of logistic regression in SPSS and statsmodels

p-values for each coefficient ElasticNetCV()

What are the initial estimates taken in Logistic regression in Scikit-learn for the first iteration?

Evaluating a Gaussian Fit

Categories

Resources