I'm currently working on a project in estimating signal by using some classification learning algorithms, such as logistics regression and random forest using scikit-learn.
I'm now using the confusion matrix to estimate the performance of different algorithms in prediction, and I found there was common problem for both algorithms. That is, in all cases, although the accuracy of algorithms seems relatively good (around 90% - 93%), the total number of FN are pretty high comparing to TP (FNR < 3% ). Does any one has clue about why I'm having this kind of issue in my prediction problem. If possible, can you give me some hints regarding how to possibly solve this problem?
Thanks for reply and help in advance.
Updates:
The dataset is extremely imbalanced (8:1), with in total around 180K observations. I already tested several re-sampling methods, such as OSS, SMOTE(+Tomek or +ENN), but neither of them returns good results. In both cases, although the recall goes up from 2.5% to 20%, the precision decreases significantly (from 60% to 20%).
You probably have an imbalanced dataset, where one of your classes has many more examples than your other class.
One solution is to give an higher cost of misclassifying the class with less examples.
This question in Cross Validated covers many approaches to your problem:
https://stats.stackexchange.com/questions/131255/class-imbalance-in-supervised-machine-learning
EDIT:
Given that you are using scikit-learn you can, as a first approach, set the parameter class_weight to balanced on your Logistic regression.
Related
I used the "Stroke" data set from kaggle to compare the accuracy of the following different models of classification:
K-Nearest-Neighbor (KNN).
Decision Trees.
Adaboost.
Logistic Regression.
I did not implement the models myself, but used sklearn library's implementations.
After training the models I ran the test data and printed the level of accuracy of each of the models and these are the results:
As you can see, KNN, Adaboost, and Logistic Regression gave me the exact same accuracy.
My question is, does it make sense that there is not even a small difference between them or did I make a mistake somewhere along the way (Even though I only used sklearn's implementations?
In general achieving the same scores is unlikely, and the explanation is usually:
bug in actual reporting
bug in the data processing
score reported corresponds to a degenerate solution
And the last explanation is probably the case. Stroke dataset has 249 positive samples in 5000 datapoints, so if your model always says "no stroke" it will get roughly 95%. So my best guess is that all your models failed to learn anything and are just constantly outputting "0".
In general accuracy is not a right metric for highly imabalnced datasets. Consider balanced accuracy, f1, etc.
There has been a lot of discussion about this topic.
But I have no enough reputation i.e. 50 to comment on those posts. Hence, I am creating this one.
As far as I understand, accuracy is not an appropriate metric when the data is imbalanced.
My question is, is it still inppropriate if we have applied either the resampling method, class weights or initial bias?
Reference here.
Thanks!
Indeed, it is always a good idea to test resampling techniques such as over sampling the minority class and under sampling the majority class. My advice is to start with this excellent walk through for resampling techniques using the imblearn package in python. Eventually, what seems to work best in most cases is intelligently combining both under and over samplers
For example, undersampling the majority class by 70% and then apply over sampling to the minority class to match the new distribution of the majority class.
To answer your question regarding accuracy: No. It should not be used. The main reason is that when you apply resampling techniques, you should never apply it on the test set. Your test set, same as in real life and in production, you never know it in advance. So the imbalance will always be there.
As far as evaluation metrics, well the question you need to ask is 'how much more important is the minority class than the majority class?' how much false positives are you willing to tolerate? The best way to check for class separability is using a ROC curve. The better it looks (a nice high above the diagonal line curve) the better the model is at separating positive classes from negative classes even if it is imbalanced.
To get a single score that allows you to compare models, and if false negatives are more important than false positives (which is almost always the case in imbalanced classification), then use F2 measure which gives more importance to recall (i.e. more importance to true positives detected by your model). However, the way we have always done it in my company is by examining in detail the classification report to know exactly how much recall we get for each class (so yes, we mainly aim for high recall and occasionally look at the precision which reflects the amount of false positives).
Conclusion :
Always check multiple scores such as classification report, mainly recall
If you want a single score, use F2.
Use ROC curve to evaluate the model visually regarding class separability
Never apply resampling to your test set!
Finally, it would be wise to apply a cost sensitive learning technique to your model such as class weighting during training!
I hope this helps!
I would prefer to use g-mean or brier score as Prof. Harrel wrote a nice discussion on this topic. See this: https://www.fharrell.com/post/class-damage/. Here is another one which provided a limitations of using in proper metrics. https://www.sciencedirect.com/science/article/pii/S2666827022000585
Sorry for the weird title, I don't know how to better express my problem. I'm working with an insurance dataset to predict future claim costs for a given policy.
For anyone who has worked with insurance claim data, you know that the claims are heavily 0-weighted. I've run into the issue before where regression on the entire dataset does not perform well, due to the skew of the data, and the continuous-discrete distribution mix.
I've tried some Tweedie distributions in R to help with this disconnect, but I ended up going a different route.
I first decided to classify the data into "Claim Amount = 0" and "Claim amount != 0", by using a support vector classifier sklearn.svm.svc(with 98% training and 95% test accuracy), where if a claim amount is predicted to be != 0, it will be fed into a regression model to predict the incurred claim amount. I decided to go with ridge regression sklearn.linear_model.Ridge for this part, and achieved a relatively good $R^2$ of 0.67 for the test set (real world data, so I'm not expecting anything extraordinary).
So my question is, what is the best way to evaluate this composite model, specifically in python? Do you think the MSE would be a good metric? The only other model I can compare it to is a basic linear regression (on the entire dataset, without the pre-classification).
Of course, feel free to suggest alternatives to this two-part classification-regression model.
EDIT: To clarify, I chose these specific models (over neural networks, for example) because of their ability to be translated into simple math for different applications.
I have a data of dimension (13961,48 ) initially, and after one hot encoding and also basic massaging of data the dimension observed around (13961,862). the data is imbalance with two categories of 'Retained' around 6% and 'not Retained' around 94%.
While running any algorithms such as logistic,knn,decision tree,random forest, the data results in very high accuracy even without any feature selection process carried out and the accuracy crosses more than 94% mostly except 'Naive bias classifier'.
This seems like odd and even by having any two features randomly also--> that gives accuracy more than 94% , which seems non reality in general.
Applying SMOTE also, provide result of more than 94% of accuracy even for baseline model of any algorithms said above such as logistic,knn,decision tree,random forest,
After removing the top 20 features also , this gives accuracy of good result more than 94% ( checked for understanding the genuineness )
g = data[Target_col_Y_name]
df = pd.concat([g.value_counts(),
g.value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))
print('The % distribution between the retention and non-retention flag\n')
print (df)
# The code o/p to show the imbalance is
The % distribution between the retention and non-retention flag
counts percentage
Non Retained 13105 93.868634
Retained 856 6.131366
My data have 7 numerical variables such as month, amount, interest rate and all others ( around 855) as one-hot-encoding transformed categorical variables.
Any methodology , to handle this kind of data on baseline,feature selection or imbalance optimization techniques ? please guide by looking at the dimensionality and the imbalance count for each levels.
I would like to add something in addition to Elias answer.
Firstly, you have to understand that even if you's create "dumb classifier", which always predicts "not retained", you'd still be correct 94% of times. So accuracy is clearly weak metric in this case.
You should definitely learn about confusion matrix and metrics that come along with it (like AUC).
One of these metrics is F1 score, which is harmonic average of precision and recall. It is better that accuracy in imbalanced class setting, but... it doesn't have to be the best. F1 will favor these classifiers that have similar precision and recall. But this is not necessary something that is important for you.
For instance, if you'd build sfw content filter, you would be ok with labeling some SFW content as nsfw (negative class), which would increase false negative rate (and decrease recall), but you would like to be sure that you kept only safe ones (high precision).
In your case you can reason what is worse: retaining something bad or abandoning something good, and pick the metric in that way.
As far as strategy is concerned: there are plenty of ways to handle class imbalance: sampling techniques (try down-sampling, up-sampling besides SMOTE or ROSE) and check out whether your validation score (training metrics alone are almost useless) improved. Just remember to apply sampling/augmentations techniques after the train-validation split.
Moreover, some models have special hyperparametrs to focus more on rare class (for instance in xgboost there is scale_pos_weight parameter). From my experience, tunning this hyperparam is way more effective than SMOTE.
Good luck
Accuracy is not a very good measure in general, particularly for imbalanced classes. I would recommend this other stackoverflow answer, that explains when to use F1 score and when to use AUROC, which are both far better measures than accuracy; in this case F1 is better.
Few points just to clear up:
For models such as random forest, you should not have to remove features to improve the accuracy, as it will just regard them as insignificant features. I recommend random forests as it tends to be very accurate (except in some cases) and can show significant features just by using clf.feature_significances_ (if using the scipy random forest).
Decision trees will almost always perform worse than random forests, as random forests are many aggregated decision trees.
Recently, I'm working on some projects and obtain 30 positive samples and 30 negative samples. Each of them has 128 features (128 dimensional).
I used "LeaveOneOut" and "sklearn.linear_model.LogisticRegression" to classify these samples and obtained a satisfactory result (AUC 0.87).
I told my friend the results and he asked how could I compute the parameters with only 60 samples, the dimension of the feature vectors is larger than the number of the samples.
Now I have the same question. I checked the source code of the toolkit and still have no idea about this question. Could someone help me with this question? Thanks!
The situation you have laid out is a common one in machine learning applications, which is when you have a limited number of training examples in comparison to your number of features (i.e. m < n). Now, you are dealing with a classification problem, therefore your algorithm is outputting either a positive or negative hypothesis given your feature input. It would help to know the training set error compared to the cross-validation set error and test set error for your analysis. If you could post your code, that would help in explaining some further details.
Based a quick Google search of sklearn.linear_model.LogisticRegression, it appears as if it implements regularized logistic regression using L2 regularization. I would encourage you to look into the following pertaining to regularization:
https://en.wikipedia.org/wiki/Tikhonov_regularization
I would also recommend reading into the bias/variance discussion as it pertains to underfitting and overfitting your dataset:
https://datascience.stackexchange.com/questions/361/when-is-a-model-underfitted