Imbalanced data: undersampling or oversampling?

Imbalanced data: undersampling or oversampling? - python

I have binary classification problem where one class represented 99.1% of all observations (210 000). As a strategy to deal with the imbalanced data, I choose sampling techniques. But I don't know what to do: undersampling my majority class or oversampling the less represented class.
If anybody have an advise?
Thank you.
P.s. I use random forest algorithm from sklearn.

I think that there is a typo in the accepted answer above. You should not "undersample the minority" and "oversample the majority"; rather, you should undersample the majority and oversample the minority.
If you're familiar with Weka, you can experiment using different data imbalance techniques and different classifiers easily to investigate which method works best. For undersampling in Weka, see this post: combination of smote and undersampling on weka.
For oversampling in Weka, you can try the SMOTE algorithm (some information is available here: http://weka.sourceforge.net/doc.packages/SMOTE/weka/filters/supervised/instance/SMOTE.html). Of course, creating 20,811 synthetic minority data (i.e., if you're looking for balanced data) is more computationally expensive than undersampling because: (1) there is a computational cost associated with creating the synthetic data; and (2) there is a greater computational cost associated with training on 42,000 samples (including the 20,811 synthetic samples created for the minority class) as opposed to 21,000 samples.
In my experience, both data imbalance approaches you've mentioned work well, but I typically experiment first with undersampling because I feel that it is a little cheaper from a resource perspective.
There are Python packages for undersampling and oversampling here:
Undersampling: http://glemaitre.github.io/imbalanced-learn/auto_examples/ensemble/plot_easy_ensemble.html
Oversampling: http://contrib.scikit-learn.org/imbalanced-learn/stable/auto_examples/over-sampling/plot_smote.html
You can also investigate cost-sensitive classification techniques to penalize misclassifications of minority class via a cost matrix.
Here is a link to a nice Weka package: https://weka.wikispaces.com/CostSensitiveClassifier
Here is a link to a Python package: https://wwwen.uni.lu/snt/research/sigcom/computer_vision_lab/costcla_a_cost_sensitive_classification_library_in_python

oversampling or
under sampling or
over sampling the minority and under sampling the majority
is a hyperparameter. Do cross validation which ones works best.
But use a Training/Test/Validation set.

Undersampling:
Undersampling is typically performed when we have billions (lots) of data points and we don’t have sufficient compute or memory(RAM) resources to process the data. Undersampling may lead to worse performance as compared to training the data on full data or on oversampled data in some cases. In other cases, we may not have a significant loss in performance due to undersampling.
Undersampling is mainly performed to make the training of models more manageable and feasible when working within a limited compute, memory and/or storage constraints.
Oversampling:
oversampling tends to work well as there is no loss of information in oversampling unlike undersampling.

Related

several category classification with Keras

Say I have 6 different categories I'm trying to classify my data into using a NN. How important is it to training that I have an equal number of instances for each class? Presently I have like 50k for one class, 6k for another, 300 for another.. you get the picture. How big of a problem is this? I'm thinking I might nix some of the classes with low representation, but I'm not sure what a good cutoff would be, or if it would really be important.

Imbalanced data is generally a problem for machine learning. Particularly when the classes are severely imbalanced (such as in your case). In a nutshell, the algorithm wont be able to learn the right associations between the features and the categories for all classes. It will most likely miss the rules and or rely too much on the majority class(es). Have a look at the imblearn package. General solutions for imbalanced data are to either :
Downsample the majority class (reduce the number of samples/instances in the majority class to match one of the minority classes).
Upsample the minority classes (look for SMOTE / synthetic minority oversampling technique. This increases the number of samples in the minority classes to match some number (e.g. the majority class).
A combination of both.
Drop classes with very very low representation (not the best idea, but justifiable in some cases). 300 might still be usable if you upsample, but it probably isnt ideal.
Other considerations include changing your performance metric to include precision/recall rather than accuracy (for example).
This link should provide some further examples that might be helpful

Dealing with highly imbalanced datasets using Tensorflow Dataset and Keras Tuner

I have a highly imbalanced dataset (3% Yes, 87% No) of textual documents, containing a title and abstract feature. I have transformed these documents into tf.data.Dataset entities with padded batches. Now, I am trying to train this dataset using Deep Learning. With model.fit() in TensorFlow, you have the class_weights parameter to deal with class imbalance, however, I am seeking for the best parameters using keras-tuner library. In their hyperparameter tuners, they do not have such an option. Therefore, I am seeking other options for dealing with class imbalance.
Is there an option to use class weights in keras-tuner? To add, I am already using the precision#recall metric. I could also try a data resampling method, such as imblearn.over_sampling.SMOTE, but as this Kaggle post mentions:
It appears that SMOTE does not help improve the results. However, it makes the network learning faster. Moreover, there is one big problem, this method is not compatible larger datasets. You have to apply SMOTE on embedded sentences, which takes way too much memory.

if you are looking for other methods to deal with imbalanced data, you may consider generating synthetic data using SMOTE or ADASYN package. This usually works. I see you have considered this as an option to explore.

You could change the evaluation metric to fbeta_scorer.(its weighted fscore)
Or if the dataset is large enough, you can try undersampling.

Best way to handle imbalanced dataset for multi-class classification in Auto-Sklearn

I'm using Auto-Sklearn and have a dataset with 42 classes that are heavily imbalanced. What is the best way to handle this imbalance? As far as I know, two approaches to handle imbalanced data within machine learning exist. Either using a resampling mechanism such as over- or under-sampling (or a combination of both) or to solve it on an algorithmic-level by choosing an inductive bias that would require in-depth knowledge about the algorithms used within Auto-Sklearn. I'm not quite sure on how to handle this problem. Is it anyhow possible to solve the imbalance directly within Auto-Sklearn or do I need to use resampling strategies as offered by e.g. imbalanced-learn? Which evaluation metric should be used after the models have been computed? The roc_auc_score for multiple classes is available since sklearn==0.22.1. However, Auto-Sklearn only supports sklearn up to version 0.21.3. Thanks in advance!

The other method is to set weights for classes according to their size. Effort is very little and it seems to work fine. I was looking for setting weights in auto-sklearn and this is what I have found:
https://github.com/automl/auto-sklearn/issues/113
For example in scikit svm you have parameter 'class_weight':
https://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html
I hope this helps :)

One way that has worked for me in the past to handle highly imbalanced datasets is Synthetic Minority Oversampling Technique (SMOTE). Here is the paper for better understanding:
SMOTE Paper
This works by synthetically oversampling the minority class or classes for that matter. To quote the paper:
The minority class is over-sampled by taking each minority class
sample and introducing synthetic examples along the line segments
joining any/all of the k minority class nearest neighbors. Depending
upon the amount of over-sampling required, neighbors from the k
nearest neighbors are randomly chosen.
This then will move closer towards balancing out your dataset. There is an implementation of SMOTE in the imblearn package in python.
Here is a good read about different oversampling algorithms. It includes oversampling using ADASYN as well as SMOTE.
I hope this helps.

For those interested and as an addition to the answers given, I can highly recommend the following paper:
Lemnaru, C., & Potolea, R. (2011, June). Imbalanced classification problems: systematic study, issues and best practices. In International Conference on Enterprise Information Systems (pp. 35-50). Springer, Berlin, Heidelberg.
The authors argue that:
In terms of solutions, since the performance is not expected to improve
significantly with a more sophisticated sampling strategy, more focus should be
allocated to algorithm related improvements, rather than to data improvements.
As e.g. the ChaLearn AutoML Challenge 2015 used the balanced accuracy, sklearn argues that it is a fitting metric for imbalanced data and Auto-Sklearn was able to compute well-fitting models, I'm going to have a try. Even without resampling, the results were much "better" (in terms of prediction quality) than just using the accuracy.

How to calculate probability(confidence) of SVM classification for small data set?

Use case:
I have a small dataset with about 3-10 samples in each class. I am using sklearn SVC to classify those with rbf kernel.
I need the confidence of the prediction along with the predicted class. I used predict_proba method of SVC.
I was getting weird results with that. I searched a bit and found out that it makes sense only for larger datasets.
Found this question on stack Scikit-learn predict_proba gives wrong answers.
The author of the question verified this by multiplying the dataset, thereby duplicating the dataset.
My questions:
1) If I multiply my dataset by lets say 100, having each sample 100 times, it increases the "correctness" of "predict_proba". What sideeffects will it have? Overfitting?
2) Is there any other way I can calculate the confidence of the classifier? Like distance from the hyperplanes?
3) For this small sample size, is SVM a recommended algorithm or should I choose something else?

First of all: Your data set seems very small for any practical purposes. That being said, let's see what we can do.
SVM's are mainly popular in high dimensional settings. It is currently unclear whether that applies to your project. They build planes on a handful of (or even single) supporting instances, and are often outperformed in situation with large trainingsets by Neural Nets. A priori they might not be your worse choice.
Oversampling your data will do little for an approach using SVM. SVM is based on the notion of support vectors, which are basically the outliers of a class that define what is in the class and what is not. Oversampling will not construct new support vector (I am assuming you are already using the train set as test set).
Plain oversampling in this scenario will also not give you any new information on confidence, other than artififacts constructed by unbalanced oversampling, since the instances will be exact copies and no distibution changes will occur. You might be able to find some information by using SMOTE (Synthetic Minority Oversampling Technique). You will basically generate synthetic instances based of the ones you have. In theory this will provide you with new instances, that won't be exact copies of the ones you have, and might thusly fall a little out of the normal classification. Note: By definition all these examples will lie in between the original examples in your sample space. This will not mean that they will lie in between your projected SVM-space, possibly learning effects that aren't really true.
Lastly, you can estimate confidence with the distance to the hyperplane. Please see: https://stats.stackexchange.com/questions/55072/svm-confidence-according-to-distance-from-hyperline

Decrease the False Negative Rate in signal prediction

I'm currently working on a project in estimating signal by using some classification learning algorithms, such as logistics regression and random forest using scikit-learn.
I'm now using the confusion matrix to estimate the performance of different algorithms in prediction, and I found there was common problem for both algorithms. That is, in all cases, although the accuracy of algorithms seems relatively good (around 90% - 93%), the total number of FN are pretty high comparing to TP (FNR < 3% ). Does any one has clue about why I'm having this kind of issue in my prediction problem. If possible, can you give me some hints regarding how to possibly solve this problem?
Thanks for reply and help in advance.
Updates:
The dataset is extremely imbalanced (8:1), with in total around 180K observations. I already tested several re-sampling methods, such as OSS, SMOTE(+Tomek or +ENN), but neither of them returns good results. In both cases, although the recall goes up from 2.5% to 20%, the precision decreases significantly (from 60% to 20%).

You probably have an imbalanced dataset, where one of your classes has many more examples than your other class.
One solution is to give an higher cost of misclassifying the class with less examples.
This question in Cross Validated covers many approaches to your problem:
https://stats.stackexchange.com/questions/131255/class-imbalance-in-supervised-machine-learning
EDIT:
Given that you are using scikit-learn you can, as a first approach, set the parameter class_weight to balanced on your Logistic regression.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Imbalanced data: undersampling or oversampling? - python

oversampling or under sampling or over sampling the minority and under sampling the majority is a hyperparameter. Do cross validation which ones works best. But use a Training/Test/Validation set.

Related

several category classification with Keras

Dealing with highly imbalanced datasets using Tensorflow Dataset and Keras Tuner

Best way to handle imbalanced dataset for multi-class classification in Auto-Sklearn

How to calculate probability(confidence) of SVM classification for small data set?

Decrease the False Negative Rate in signal prediction

Categories

Resources