Likelihood Ratio Test and pseudo Rsquared differ significantly (Logistic Regression) - python

I apply a logistic regression and I would like to test for statistical sigificance of my overall model.
Now, the pseudo-Rsquared (McFaddon) Rsquared = 1 - L(c)/L(null) returns the variance explained by the model - where L(c) denotes the maximized likelihood value from the fitted model and L(null) denotes the corresponding value for the null model (no covariates, only intercept).
The likelihood test statistic is LR = 2 * (L(c) - L(null)) which follows a Chi-squared distribution and can be tested for significance according to the models degree of freedoms.
Anyways, I use the Chi-squared to calculate a p-value which is highly significant, but the pseudo Rsquared is around 0.021 ???
Why does Rsquared and the overall p-value differ so much?
Using an accuracy calulation for some test-data metrics.accuracy_score(y_test, y_pred), I see that the accuracy for the test data is only around 55% (for the training data its around 60%).
Can someone help me to interpret my results?

Maybe there is a correlation which is significant, but the impact is still small: since you are doing classification, you could check if examples with this variable (=1 in the binary case) have a slightly higher/lower probability to be member of class 1 than those without that variable (=0 in the binary case):
examples with the variable being 1 have a chance of 50% to belong to class 1 while examples with that variable being 0 have a chance of 48% to belong to class 1.
If lots of examples exist who have that variable, the effect might still be significant (p value), but it will hardly predict the right class alone (explain the variance - r squared).
This might be the reference which could help you understand this graphically for another problem: https://blog.minitab.com/blog/adventures-in-statistics-2/how-to-interpret-a-regression-model-with-low-r-squared-and-low-p-values

Related

In linear regression model, Is it always necessary that residual mean always closer or equal to zero?

resid = df['Actual'] - df['Predicted']
resid_mean = resid.mean()
print(resid_mean)
Output:
250.8173868583906
Is my model predicting value correctly or not?
Linear regression involve minimising the mean squared error (Q) to find the best fitting slope (a) and intercept(b). That is, Q will be minimized at the values of a and b for which ∂Q / ∂a = 0 and ∂Q / ∂b = 0.
The sum of the residuals, and therefore the mean is always zero, for the data that you regressed on. That is one of the above 2 conditions in linear regression.
So, unless you are checking residual mean for data not used in training, there appears to be some mistake in the linear regression procedure you employed.
Detailed proof available here: http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf

Random Forest Regression MAE

I'm trying to create a random forest regression model to predict a continuous variable with the following characteristics (after log-transformation):
count 2868.000000
mean 4.566630
std 1.233496
min -0.356675
25% 3.745966
50% 4.604670
75% 5.452325
max 8.141889
The best result I get in terms of negative mean absolute error is -0.64. How should I evaluate this result? Is it good/decent/bad?
Thank you very much.
here is a link to the docs docs
the simple answer to your question is yes it is a good to have a -0.64 NMAE.
NMAE is the opposite of MAE but you can basically remove the minus and treat it as a normal MAE. so that means if you have a high MAE that means the NMAE is also high and that is not a good result but a low MAE can be seen as low NMAE and that's considered a good result of course depending on what you want to have.

Prediction using RandomForest Algorithm, Sensitivity and Specificity

I'm new to ML and developing a simple Prediction model using Random Forest algorithm. I'm also evaluating my model using Confusion Matrix and computing the Sensitivity and Specificity scores of my built model.
While I conceptually understand these two, I'm now stuck at a point where my model always results in the sensitivity score of 1 and specificity score of 0. It gets an accuracy score of 91% (91/100 training samples are predicted correctly).
Now what do I infer from this? Is my model too sensitive and not specific? Will I have to understand that my model will be more bound to emit False Positives and never False Negatives? (This may be good in my scenario) as the domain I'm dealing with is predict frauds where false positives are better.
As a side note, on a typical ML workflow, how the Sensitivity and Specificity scores are often used in practise? Is it used to keep some sort of a threshold on the client systems to take the prediction with a pinch of salt? What would be the parameters I can tune to get different values for these? At the moment, I just tweaked the Random Forest model variables like no of trees, no of nodes etc. But these numbers always remain the same.
I would argue that the best inference you can make from your specificity and sensitivity numbers is that your model is somehow misspecified, there is a data error, or some other non-parameter aspect of the data or your code that is causing this problem.
As you know: (Using: True Positive, False Negative...)
Sensitivity = TP/(TP + FN)
Specificity = TN/(FP + TN)
Accuracy = (TP + TN) / (TP + FP + TN + FN)
To get sensitivity = 1, TP = TP + FN. There are a number of ways that could happen, of course, but the ones to look for are TP = FN, or FN = 0 (since TP cannot be 0).
Additionally, a specificity of 0 implies that TN = 0, regardless of what FP and TN are. But here too I would look at the distribution of FP and TN.
If TN = 0, then for accuracy = 0.9 it must be that TP = 0.9, which suggests that even if TN = 0, FP and/or FN are > 0 and indeed that FP + FN = 0.1 (since the denominator must equal 1).
One possibility, mentioned by #desertnaut is that you have a serious class imbalance issue. That is, the minority class (the one you want to predict) is a very, very small fraction of the total number of all observations in your data. To look into this, simply plot the distribution of your target. If you're seeing that the minority class is less than, say, 10% of the total, (in general, one category dominates the other 10:1) then you're likely to start seeing problems in your model.
Here's a good tutorial on dealing with imbalanced data.
And here's a paper if you're more technically minded and want to confront an imbalance in a more academic way.
That said, I would also look at the code you have for your model. Check into how you're separating training and testing:
Is the balance of classes in the training set representative of the balance in the whole set?
How are you doing cross-validation, if at all? Does changing the method give you different results for spec/sens?
Are the data types for the categories correct throughout (and for the rest of the variables, for that matter)? Since RF can deal with categorical variables it might not throw an error if you have what you expect to be an int coded as a string (just to pick an example), which might produce odd results.
Seeing any results with perfect scores (1/0) is usually a sign that something other than just model parameters (or hyperparameters) is incorrectly specified. Not that it can't happen. It just so rarely does.

Intuition behind nloglikelihood value in xgboost poisson run

When I am using count:poisson instead of rmse I am seeing nloglikelihood values. Now I am not sure how to compare those numbers with rmse or mae.
Definitely lesser the value better .. but not getting actual error intuition that we get with rmse or Mae.
For example -> train-poisson-nloglik:2.01885 val-poisson-nloglik:2.02898
Here can we say, actual values differ by 2.02 error.
Can someone explain with small example.
Thanks.
There is a good post on the computation of the value here
Just to be more exhaustive, the value is:
mean(factorial(label) + preds - label*log(preds))
If you compare with the true formula of the negative log-likelihood, it should be the sum instead of the mean. I guess that they choose to take the mean so that the train and the test values are more comparable.
Finally, to answer the question, the likelihood is the probability that the data came from the distribution with a specific parameter. In the Poisson model, the parameters are just the set of predictions. So the better is your prediction, the greater is the probability, the smaller is the associate negative log-likelihood.
rmse or mae are based on the expectation of the difference between the prediction and the truth whereas negative log-likelihood is looking at a probability.

How does sklearn.svm.svc's function predict_proba() work internally?

I am using sklearn.svm.svc from scikit-learn to do binary classification. I am using its predict_proba() function to get probability estimates. Can anyone tell me how predict_proba() internally calculates the probability?
Scikit-learn uses LibSVM internally, and this in turn uses Platt scaling, as detailed in this note by the LibSVM authors, to calibrate the SVM to produce probabilities in addition to class predictions.
Platt scaling requires first training the SVM as usual, then optimizing parameter vectors A and B such that
P(y|X) = 1 / (1 + exp(A * f(X) + B))
where f(X) is the signed distance of a sample from the hyperplane (scikit-learn's decision_function method). You may recognize the logistic sigmoid in this definition, the same function that logistic regression and neural nets use for turning decision functions into probability estimates.
Mind you: the B parameter, the "intercept" or "bias" or whatever you like to call it, can cause predictions based on probability estimates from this model to be inconsistent with the ones you get from the SVM decision function f. E.g. suppose that f(X) = 10, then the prediction for X is positive; but if B = -9.9 and A = 1, then P(y|X) = .475. I'm pulling these numbers out of thin air, but you've noticed that this can occur in practice.
Effectively, Platt scaling trains a probability model on top of the SVM's outputs under a cross-entropy loss function. To prevent this model from overfitting, it uses an internal five-fold cross validation, meaning that training SVMs with probability=True can be quite a lot more expensive than a vanilla, non-probabilistic SVM.
Actually I found a slightly different answer that they used this code to convert decision value to probability
'double fApB = decision_value*A+B;
if (fApB >= 0)
return Math.exp(-fApB)/(1.0+Math.exp(-fApB));
else
return 1.0/(1+Math.exp(fApB)) ;'
Here A and B values can be found in the model file (probA and probB).
It offers a way to convert probability to decision value and thus to hinge loss.
Use that ln(0) = -200.

Categories