Say I have 6 different categories I'm trying to classify my data into using a NN. How important is it to training that I have an equal number of instances for each class? Presently I have like 50k for one class, 6k for another, 300 for another.. you get the picture. How big of a problem is this? I'm thinking I might nix some of the classes with low representation, but I'm not sure what a good cutoff would be, or if it would really be important.
Imbalanced data is generally a problem for machine learning. Particularly when the classes are severely imbalanced (such as in your case). In a nutshell, the algorithm wont be able to learn the right associations between the features and the categories for all classes. It will most likely miss the rules and or rely too much on the majority class(es). Have a look at the imblearn package. General solutions for imbalanced data are to either :
Downsample the majority class (reduce the number of samples/instances in the majority class to match one of the minority classes).
Upsample the minority classes (look for SMOTE / synthetic minority oversampling technique. This increases the number of samples in the minority classes to match some number (e.g. the majority class).
A combination of both.
Drop classes with very very low representation (not the best idea, but justifiable in some cases). 300 might still be usable if you upsample, but it probably isnt ideal.
Other considerations include changing your performance metric to include precision/recall rather than accuracy (for example).
This link should provide some further examples that might be helpful
Related
There has been a lot of discussion about this topic.
But I have no enough reputation i.e. 50 to comment on those posts. Hence, I am creating this one.
As far as I understand, accuracy is not an appropriate metric when the data is imbalanced.
My question is, is it still inppropriate if we have applied either the resampling method, class weights or initial bias?
Reference here.
Thanks!
Indeed, it is always a good idea to test resampling techniques such as over sampling the minority class and under sampling the majority class. My advice is to start with this excellent walk through for resampling techniques using the imblearn package in python. Eventually, what seems to work best in most cases is intelligently combining both under and over samplers
For example, undersampling the majority class by 70% and then apply over sampling to the minority class to match the new distribution of the majority class.
To answer your question regarding accuracy: No. It should not be used. The main reason is that when you apply resampling techniques, you should never apply it on the test set. Your test set, same as in real life and in production, you never know it in advance. So the imbalance will always be there.
As far as evaluation metrics, well the question you need to ask is 'how much more important is the minority class than the majority class?' how much false positives are you willing to tolerate? The best way to check for class separability is using a ROC curve. The better it looks (a nice high above the diagonal line curve) the better the model is at separating positive classes from negative classes even if it is imbalanced.
To get a single score that allows you to compare models, and if false negatives are more important than false positives (which is almost always the case in imbalanced classification), then use F2 measure which gives more importance to recall (i.e. more importance to true positives detected by your model). However, the way we have always done it in my company is by examining in detail the classification report to know exactly how much recall we get for each class (so yes, we mainly aim for high recall and occasionally look at the precision which reflects the amount of false positives).
Conclusion :
Always check multiple scores such as classification report, mainly recall
If you want a single score, use F2.
Use ROC curve to evaluate the model visually regarding class separability
Never apply resampling to your test set!
Finally, it would be wise to apply a cost sensitive learning technique to your model such as class weighting during training!
I hope this helps!
I would prefer to use g-mean or brier score as Prof. Harrel wrote a nice discussion on this topic. See this: https://www.fharrell.com/post/class-damage/. Here is another one which provided a limitations of using in proper metrics. https://www.sciencedirect.com/science/article/pii/S2666827022000585
I'm training a U-Net CNN in Keras and one of the image classes is significantly under-represented in the training dataset. I'm using a class weighted loss function to account for this, but my worry is that with such a low batch size, and low class instance, only 1 in 10 batches are likely to include an image of this class. So even though the class is weighted, the network rarely sees it during training. Therefore, would it be bad practice to force the data generator to include at least one instance of this class while its selecting random pieces of data for the batch? I could then avoid a situation where the majority of training is unable to access a class of data that's vital to overall task accuracy.
I would recommend three possible techniques to handle this kind of problem :
Uniformize the probability to get an image of a given class : for example this for Pytorch (don't know which technology you are using, please provide it). (Easy, but least efficient)
Adapt the loss, by giving more weight to underbalanced classes (also easy, will give the same result as previous method, consider the easiest-to-implement method of both first)
Do some data augmentation (harder, but nowadays a lot of libraries provide efficient ways to do this)
EDIT : Sorry, did not see for Keras. A few useful links: for data augmentation, class balancing and loss adaptation
I'm using Auto-Sklearn and have a dataset with 42 classes that are heavily imbalanced. What is the best way to handle this imbalance? As far as I know, two approaches to handle imbalanced data within machine learning exist. Either using a resampling mechanism such as over- or under-sampling (or a combination of both) or to solve it on an algorithmic-level by choosing an inductive bias that would require in-depth knowledge about the algorithms used within Auto-Sklearn. I'm not quite sure on how to handle this problem. Is it anyhow possible to solve the imbalance directly within Auto-Sklearn or do I need to use resampling strategies as offered by e.g. imbalanced-learn? Which evaluation metric should be used after the models have been computed? The roc_auc_score for multiple classes is available since sklearn==0.22.1. However, Auto-Sklearn only supports sklearn up to version 0.21.3. Thanks in advance!
The other method is to set weights for classes according to their size. Effort is very little and it seems to work fine. I was looking for setting weights in auto-sklearn and this is what I have found:
https://github.com/automl/auto-sklearn/issues/113
For example in scikit svm you have parameter 'class_weight':
https://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html
I hope this helps :)
One way that has worked for me in the past to handle highly imbalanced datasets is Synthetic Minority Oversampling Technique (SMOTE). Here is the paper for better understanding:
SMOTE Paper
This works by synthetically oversampling the minority class or classes for that matter. To quote the paper:
The minority class is over-sampled by taking each minority class
sample and introducing synthetic examples along the line segments
joining any/all of the k minority class nearest neighbors. Depending
upon the amount of over-sampling required, neighbors from the k
nearest neighbors are randomly chosen.
This then will move closer towards balancing out your dataset. There is an implementation of SMOTE in the imblearn package in python.
Here is a good read about different oversampling algorithms. It includes oversampling using ADASYN as well as SMOTE.
I hope this helps.
For those interested and as an addition to the answers given, I can highly recommend the following paper:
Lemnaru, C., & Potolea, R. (2011, June). Imbalanced classification problems: systematic study, issues and best practices. In International Conference on Enterprise Information Systems (pp. 35-50). Springer, Berlin, Heidelberg.
The authors argue that:
In terms of solutions, since the performance is not expected to improve
significantly with a more sophisticated sampling strategy, more focus should be
allocated to algorithm related improvements, rather than to data improvements.
As e.g. the ChaLearn AutoML Challenge 2015 used the balanced accuracy, sklearn argues that it is a fitting metric for imbalanced data and Auto-Sklearn was able to compute well-fitting models, I'm going to have a try. Even without resampling, the results were much "better" (in terms of prediction quality) than just using the accuracy.
I have binary classification problem where one class represented 99.1% of all observations (210 000). As a strategy to deal with the imbalanced data, I choose sampling techniques. But I don't know what to do: undersampling my majority class or oversampling the less represented class.
If anybody have an advise?
Thank you.
P.s. I use random forest algorithm from sklearn.
I think that there is a typo in the accepted answer above. You should not "undersample the minority" and "oversample the majority"; rather, you should undersample the majority and oversample the minority.
If you're familiar with Weka, you can experiment using different data imbalance techniques and different classifiers easily to investigate which method works best. For undersampling in Weka, see this post: combination of smote and undersampling on weka.
For oversampling in Weka, you can try the SMOTE algorithm (some information is available here: http://weka.sourceforge.net/doc.packages/SMOTE/weka/filters/supervised/instance/SMOTE.html). Of course, creating 20,811 synthetic minority data (i.e., if you're looking for balanced data) is more computationally expensive than undersampling because: (1) there is a computational cost associated with creating the synthetic data; and (2) there is a greater computational cost associated with training on 42,000 samples (including the 20,811 synthetic samples created for the minority class) as opposed to 21,000 samples.
In my experience, both data imbalance approaches you've mentioned work well, but I typically experiment first with undersampling because I feel that it is a little cheaper from a resource perspective.
There are Python packages for undersampling and oversampling here:
Undersampling: http://glemaitre.github.io/imbalanced-learn/auto_examples/ensemble/plot_easy_ensemble.html
Oversampling: http://contrib.scikit-learn.org/imbalanced-learn/stable/auto_examples/over-sampling/plot_smote.html
You can also investigate cost-sensitive classification techniques to penalize misclassifications of minority class via a cost matrix.
Here is a link to a nice Weka package: https://weka.wikispaces.com/CostSensitiveClassifier
Here is a link to a Python package: https://wwwen.uni.lu/snt/research/sigcom/computer_vision_lab/costcla_a_cost_sensitive_classification_library_in_python
oversampling or
under sampling or
over sampling the minority and under sampling the majority
is a hyperparameter. Do cross validation which ones works best.
But use a Training/Test/Validation set.
Undersampling:
Undersampling is typically performed when we have billions (lots) of data points and we don’t have sufficient compute or memory(RAM) resources to process the data. Undersampling may lead to worse performance as compared to training the data on full data or on oversampled data in some cases. In other cases, we may not have a significant loss in performance due to undersampling.
Undersampling is mainly performed to make the training of models more manageable and feasible when working within a limited compute, memory and/or storage constraints.
Oversampling:
oversampling tends to work well as there is no loss of information in oversampling unlike undersampling.
Use case:
I have a small dataset with about 3-10 samples in each class. I am using sklearn SVC to classify those with rbf kernel.
I need the confidence of the prediction along with the predicted class. I used predict_proba method of SVC.
I was getting weird results with that. I searched a bit and found out that it makes sense only for larger datasets.
Found this question on stack Scikit-learn predict_proba gives wrong answers.
The author of the question verified this by multiplying the dataset, thereby duplicating the dataset.
My questions:
1) If I multiply my dataset by lets say 100, having each sample 100 times, it increases the "correctness" of "predict_proba". What sideeffects will it have? Overfitting?
2) Is there any other way I can calculate the confidence of the classifier? Like distance from the hyperplanes?
3) For this small sample size, is SVM a recommended algorithm or should I choose something else?
First of all: Your data set seems very small for any practical purposes. That being said, let's see what we can do.
SVM's are mainly popular in high dimensional settings. It is currently unclear whether that applies to your project. They build planes on a handful of (or even single) supporting instances, and are often outperformed in situation with large trainingsets by Neural Nets. A priori they might not be your worse choice.
Oversampling your data will do little for an approach using SVM. SVM is based on the notion of support vectors, which are basically the outliers of a class that define what is in the class and what is not. Oversampling will not construct new support vector (I am assuming you are already using the train set as test set).
Plain oversampling in this scenario will also not give you any new information on confidence, other than artififacts constructed by unbalanced oversampling, since the instances will be exact copies and no distibution changes will occur. You might be able to find some information by using SMOTE (Synthetic Minority Oversampling Technique). You will basically generate synthetic instances based of the ones you have. In theory this will provide you with new instances, that won't be exact copies of the ones you have, and might thusly fall a little out of the normal classification. Note: By definition all these examples will lie in between the original examples in your sample space. This will not mean that they will lie in between your projected SVM-space, possibly learning effects that aren't really true.
Lastly, you can estimate confidence with the distance to the hyperplane. Please see: https://stats.stackexchange.com/questions/55072/svm-confidence-according-to-distance-from-hyperline