Sorry about all the text, but I think the background of this project would help:
I've been working on a binary classification project. The original dataset consisted of about 28,000 of class 0 and 650 of class 1, so it was very highly imbalanced. I was given an under- and over-sampled dataset to work with that was 5,000 of each class (class 1 instances were simply duplicated 9 times). After training models on this and getting sub-par results (an AUC of about .85, but it needed to be better) I started wondering if these sampling techniques were actually a good idea, so I took the original highly imbalanced dataset out again. I plugged it right into a default GradientBoostClassifier, trained it on 80% of the data and
I immediately got something like this:
Accuracy:
0.997367035282
AUC:
.9998
Confusion Matrix:
[[5562 7]
[ 8 120]]
Now, I know a high accuracy can be an artefact of the imbalanced classes, but I did not expect an AUC like this or that kind of performance! So I am very confused and feel there must be something an error in my technique somewhere...but I have no idea what it is. I've tried a couple different classifiers too and gotten similar levels of ridiculously good performance. I didn't leave the class labels in the data array and the training data is COMPLETELY different than the testing data. Each observation has about 130 features too, so this isn't a simple classification. It very much seems like something is wrong, I'm sure the classifier cannot be this good. Could there be anything else I am overlooking? Any other common pitfalls people run into like this with unbalanced data?
I can provide the code, probability plots,example datapoints etc. if they would be helpful, but I didn't want this to get too long for now. Thanks to anybody who can help!
Accuracy may not be the best performance metric in your case, maybe you can think of using precision,recall and F1 score, and perform some debugging via learning curves, over fitting detection, etc.
Related
I used the "Stroke" data set from kaggle to compare the accuracy of the following different models of classification:
K-Nearest-Neighbor (KNN).
Decision Trees.
Adaboost.
Logistic Regression.
I did not implement the models myself, but used sklearn library's implementations.
After training the models I ran the test data and printed the level of accuracy of each of the models and these are the results:
As you can see, KNN, Adaboost, and Logistic Regression gave me the exact same accuracy.
My question is, does it make sense that there is not even a small difference between them or did I make a mistake somewhere along the way (Even though I only used sklearn's implementations?
In general achieving the same scores is unlikely, and the explanation is usually:
bug in actual reporting
bug in the data processing
score reported corresponds to a degenerate solution
And the last explanation is probably the case. Stroke dataset has 249 positive samples in 5000 datapoints, so if your model always says "no stroke" it will get roughly 95%. So my best guess is that all your models failed to learn anything and are just constantly outputting "0".
In general accuracy is not a right metric for highly imabalnced datasets. Consider balanced accuracy, f1, etc.
There has been a lot of discussion about this topic.
But I have no enough reputation i.e. 50 to comment on those posts. Hence, I am creating this one.
As far as I understand, accuracy is not an appropriate metric when the data is imbalanced.
My question is, is it still inppropriate if we have applied either the resampling method, class weights or initial bias?
Reference here.
Thanks!
Indeed, it is always a good idea to test resampling techniques such as over sampling the minority class and under sampling the majority class. My advice is to start with this excellent walk through for resampling techniques using the imblearn package in python. Eventually, what seems to work best in most cases is intelligently combining both under and over samplers
For example, undersampling the majority class by 70% and then apply over sampling to the minority class to match the new distribution of the majority class.
To answer your question regarding accuracy: No. It should not be used. The main reason is that when you apply resampling techniques, you should never apply it on the test set. Your test set, same as in real life and in production, you never know it in advance. So the imbalance will always be there.
As far as evaluation metrics, well the question you need to ask is 'how much more important is the minority class than the majority class?' how much false positives are you willing to tolerate? The best way to check for class separability is using a ROC curve. The better it looks (a nice high above the diagonal line curve) the better the model is at separating positive classes from negative classes even if it is imbalanced.
To get a single score that allows you to compare models, and if false negatives are more important than false positives (which is almost always the case in imbalanced classification), then use F2 measure which gives more importance to recall (i.e. more importance to true positives detected by your model). However, the way we have always done it in my company is by examining in detail the classification report to know exactly how much recall we get for each class (so yes, we mainly aim for high recall and occasionally look at the precision which reflects the amount of false positives).
Conclusion :
Always check multiple scores such as classification report, mainly recall
If you want a single score, use F2.
Use ROC curve to evaluate the model visually regarding class separability
Never apply resampling to your test set!
Finally, it would be wise to apply a cost sensitive learning technique to your model such as class weighting during training!
I hope this helps!
I would prefer to use g-mean or brier score as Prof. Harrel wrote a nice discussion on this topic. See this: https://www.fharrell.com/post/class-damage/. Here is another one which provided a limitations of using in proper metrics. https://www.sciencedirect.com/science/article/pii/S2666827022000585
After finalizing the architecture of my model I decided to train the model on the entire dataset by setting validation_split = 0 in fit(). I thought this would improve the results based on these sources:
What is validation data used for in a Keras Sequential model?
Your model doesn't "see" your validation set and isn´t in any way trained on it
https://machinelearningmastery.com/train-final-machine-learning-model/
What about the cross-validation models or the train-test datasets?
They’ve been discarded. They are no longer needed.
They have served their purpose to help you choose a procedure to finalize.
However, I got worse results without the validation set (compared to validation_split = 0.2), leaving all other parameters the same.
Is there an explanation for this? Or was it just by chance that my model happened to perform better on the fixed test data when a part of the training data was excluded (and used as validation).
Well that's really a very good question that covers a lots of machine learning related concepts specially Bias-Variance Tradeoff
As in the comment #CrazyBarzillian hinted that more data might be leading to over-fitting and yes we need more info about your data to come to a solution. But in a broader way I would like to explain you few points, that might help you to understand as it why it happened.
EXPLAINATION
Whenever your data has more number of features, your model learns a very complex equation to solve it. In short model is too complicated for the amount of data we have. This situation, known as high variance, leads to model over-fitting. We know that we are facing a high variance issue when the training error is much lower than the test error. High variance problems can be addressed by reducing the number of features (by applying PCA , outlier removal etc.), by increasing the number of data points that is adding more data.
Sometimes, you have lesser features in your data and hence model learns a very simple equation to solve it. This is known as high bias. In this case , adding more data won't help. In this case less data will do the work or adding more features will help.
MY ASSUMPTION
I guess your model is suffering from high bias if its performing poor on adding more data. But to check whether the statement adding more data leading to poor results is correct or not in your case you can do the following things:-
play with some hyperparameters
try other machine learning models
instead of accuracy scores , look for r2 scores or mean absolute error in case of regression or F1, precision, recall in case of classification
If after doing both things you are still getting the same results that more data is leading to poor results, then you can be sure of high bias and can either increase the number of features or reduce the data.
SOLUTION
By reducing the data, I mean use small data but better data. Better data means suppose you are doing a classification problem and you have three classes (A, B and C) , a better data would be if all the datapoints are balanced between three classes. Your data should be balanced. If it is unbalanced that is class A has high number of samples while class B and C has only 3-4 samples then you can apply Ramdom Sampling techniques to overcome it.
How to make BETTER DATA
Balance the data
Remove outliers
Scale (Normalize) the data
CONCLUSION
It is a myth that more data is always leads to good model. Actually more than the quantity , quality of the data also matters. Data should have both quantity as well as quality. This game of maintaining quality and quantity is known as Bias-Variance Tradeoff.
Suppose we have 1,000 beads 900 red and 100 blue ones. When I run the problem through SKlearn classifier ensembles,
score = clf.score(X_test, y_test)
They come up with scores of around .9 however, when I look at the predictions I see that it has predicted all of them to be Red and this is how it comes up with %90 accuracy! Please tell me what I'm doing wrong? Better yet, what does it mean when this happens? Is there a better way to measure accuracy?
This might happen when you have an imbalanced dataset, and you chose accuracy as your metric. The reason is that by always deciding red, the model is actually doing OK in terms of accuracy, but as you noticed, the model is useless!
In order to overcome this issue, you have some alternatives such as:
1. Use another metric, like AUC (area under roc curve), etc.
2. Use different weights for classes, and put more weight on the minority class.
3. Use simple over-sampling or under-sampling methods, or other more sophisticated ones like SMOTE, ADASYN, etc.
You can also take a look at this article.
This problem you face is quite common in real world applications.
You have an imbalanced classification problem. You are write, by default score measures accuracy, but it is recommended to look at recall and precision for imbalanced data.
This video explains it better than I could
The video above demonstrates you what you should do in order to measure classification performance in your data. To deal with data imbalance, you check imblearn library:
https://imbalanced-learn.readthedocs.io/en/stable/api.html
I am using sklearn's random forests module to predict a binary target variable based on 166 features.
When I increase the number of dimensions to 175 the accuracy of the model decreases (from accuracy = 0.86 to 0.81 and from recall = 0.37 to 0.32) .
I would expect more data to only make the model more accurate, especially when the added features were with business value.
I built the model using sklearn in python.
Why the new features did not get weight 0 and left the accuracy as it was ?
Basically, you may be "confusing" your model with useless features. MORE FEATURES or MORE DATA WILL NOT ALWAYS MAKE YOUR MODEL BETTER. The new features will also not get weight zero because the model will try hard to use them! Because there are so many (175!), RF is just not able to come back to the previous "pristine" model with better accuracy and recall (maybe these 9 features are really not adding anything useful).
Think about how a decision tree essentially works. These new features will cause some new splits that can worsen the results. Try to work it out from the basics and slowly adding new information always checking the performance. In addition, pay attention to for example the number of features used per split (mtry). For so many features, you would need to have a very high mtry (to allow for a big sample to be considered for every split). Have you considered adding 1 or 2 more and checking how the accuracy responds? Also, don't forget mtry!
More data does not always make the model more accurate. Random forest is a traditional machine learning method where the programmer has to do the feature selection. If the model is given a lot of data but it is bad, then the model will try to make sense out of that bad data too and will end up messing things up. More data is better for neural networks as those networks select the best possible features out of the data on their own.
Also, 175 features is too much and you should definitely look into dimensionality reduction techniques and select the features which are highly correlated with the target. there are several methods in sklearn to do that. You can try PCA if your data is numerical or RFE to remove bad features, etc.