Dealing with highly imbalanced datasets using Tensorflow Dataset and Keras Tuner - python

I have a highly imbalanced dataset (3% Yes, 87% No) of textual documents, containing a title and abstract feature. I have transformed these documents into tf.data.Dataset entities with padded batches. Now, I am trying to train this dataset using Deep Learning. With model.fit() in TensorFlow, you have the class_weights parameter to deal with class imbalance, however, I am seeking for the best parameters using keras-tuner library. In their hyperparameter tuners, they do not have such an option. Therefore, I am seeking other options for dealing with class imbalance.
Is there an option to use class weights in keras-tuner? To add, I am already using the precision#recall metric. I could also try a data resampling method, such as imblearn.over_sampling.SMOTE, but as this Kaggle post mentions:
It appears that SMOTE does not help improve the results. However, it makes the network learning faster. Moreover, there is one big problem, this method is not compatible larger datasets. You have to apply SMOTE on embedded sentences, which takes way too much memory.

if you are looking for other methods to deal with imbalanced data, you may consider generating synthetic data using SMOTE or ADASYN package. This usually works. I see you have considered this as an option to explore.

You could change the evaluation metric to fbeta_scorer.(its weighted fscore)
Or if the dataset is large enough, you can try undersampling.

Related

Is it bad practice to force each Keras batch to contain at least one image from each class?

I'm training a U-Net CNN in Keras and one of the image classes is significantly under-represented in the training dataset. I'm using a class weighted loss function to account for this, but my worry is that with such a low batch size, and low class instance, only 1 in 10 batches are likely to include an image of this class. So even though the class is weighted, the network rarely sees it during training. Therefore, would it be bad practice to force the data generator to include at least one instance of this class while its selecting random pieces of data for the batch? I could then avoid a situation where the majority of training is unable to access a class of data that's vital to overall task accuracy.
I would recommend three possible techniques to handle this kind of problem :
Uniformize the probability to get an image of a given class : for example this for Pytorch (don't know which technology you are using, please provide it). (Easy, but least efficient)
Adapt the loss, by giving more weight to underbalanced classes (also easy, will give the same result as previous method, consider the easiest-to-implement method of both first)
Do some data augmentation (harder, but nowadays a lot of libraries provide efficient ways to do this)
EDIT : Sorry, did not see for Keras. A few useful links: for data augmentation, class balancing and loss adaptation

Averaging linear separators obtained from SVM

For research purposes, I find myself needing to traing SVM via SGD on a large DS (that is, a large number of examples). This makes using scikit-learn's implementation (SGDClassifier) problematic, as it requires loading the entire DS at once.
The algorithm I am familiar with uses n step of SGD to obtain n different separators w_i, and then averages them (specifics can be seen in slide 12 of https://www.cse.huji.ac.il/~shais/Lectures2014/lecture8.pdf).
This made me think that maybe I can use scikit-learn to train multiple such classifiers and then take the average of the resulting linear separators (assume no bias).
Is this a reasonable line of thinking, or does scikit-learn's implementation not fall under my logic?
Edit: I am well aware of the alternatives for training SVM in different ways, but this is for a specific research purpose. I would just like to know if this line of thinking is possible with scikit-learn's implementation, or if you are aware of an alternative that will allow me to train SVM using SGD without loading an entire DS to memory.
SGDClassifier have a partial_fit method, and one of the primary objectives of partial_fit method is to scale sklearn models to large-scale datasets. Using this, you can load a part of the dataset into RAM, feed it to SGD, and keep repeating this unless full dataset is used.
In code below, I use KFold mainly to imitate loading chunk of dataset.
class GD_SVM(BaseEstimator, ClassifierMixin):
def __init__(self):
self.sgd = SGDClassifier(loss='hinge',random_state=42,fit_intercept=True,l1_ratio=0,tol=.001)
def fit(self,X,y):
cv = KFold(n_splits=10,random_state=42,shuffle=True)
for _,test_id in cv.split(X,y):
xt,yt = X[test_id],y[test_id]
self.sgd = self.sgd.partial_fit(xt,yt,classes=np.unique(y))
def predict(self,X):
return self.sgd.predict(X)
To test this against regular (linear) SVM:
X,y = load_breast_cancer(return_X_y=True)
X = StandardScaler().fit_transform(X) #For simplicity, Pipeline is better choice
cv = RepeatedStratifiedKFold(n_splits=5,n_repeats=5,random_state=43)
sgd = GD_SVM()
svm = LinearSVC(loss='hinge',max_iter=1,random_state=42,
C=1.0,fit_intercept=True,tol=.001)
r = cross_val_score(sgd,X,y,cv=cv) #cross_val_score(svm,X,y,cv=cv)
print(r.mean())
This returned 95% accuracy for above GD_SVM, and 96% for SVM. In Digits dataset SVM had 93% accuracy, while GD_SVM had 91%. While these performances are broadly similar, as these measurements show, please note that they are not identical. This is expected, since these algorithms use pretty different optimization algorithms, but I think careful tuning of hyper-parameter would reduce the gap.
Based on the concern of loading all of the data in memory, if you have access to more compute resources, you may want to use PySpark's SVM implementation: https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#linear-support-vector-machine, as that Spark is built for large scale data processing. I don't know if averaging the separators from multiple Scikit-Learn models would work as expected; there isn't a clean way to instantiate a new model with new separators, based on the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), so it would probably have to be implemented as an ensemble approach.
If you insist on using the whole DS for training instead of sampling (btw that is what the slides describe) and you do not care about performance, I would train n classifiers, and then select only their support vectors and retrain final version on those support vectors only. This way you effectively dismiss most of the data and concentrate only on the points that are important for the classification.

Best way to handle imbalanced dataset for multi-class classification in Auto-Sklearn

I'm using Auto-Sklearn and have a dataset with 42 classes that are heavily imbalanced. What is the best way to handle this imbalance? As far as I know, two approaches to handle imbalanced data within machine learning exist. Either using a resampling mechanism such as over- or under-sampling (or a combination of both) or to solve it on an algorithmic-level by choosing an inductive bias that would require in-depth knowledge about the algorithms used within Auto-Sklearn. I'm not quite sure on how to handle this problem. Is it anyhow possible to solve the imbalance directly within Auto-Sklearn or do I need to use resampling strategies as offered by e.g. imbalanced-learn? Which evaluation metric should be used after the models have been computed? The roc_auc_score for multiple classes is available since sklearn==0.22.1. However, Auto-Sklearn only supports sklearn up to version 0.21.3. Thanks in advance!
The other method is to set weights for classes according to their size. Effort is very little and it seems to work fine. I was looking for setting weights in auto-sklearn and this is what I have found:
https://github.com/automl/auto-sklearn/issues/113
For example in scikit svm you have parameter 'class_weight':
https://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html
I hope this helps :)
One way that has worked for me in the past to handle highly imbalanced datasets is Synthetic Minority Oversampling Technique (SMOTE). Here is the paper for better understanding:
SMOTE Paper
This works by synthetically oversampling the minority class or classes for that matter. To quote the paper:
The minority class is over-sampled by taking each minority class
sample and introducing synthetic examples along the line segments
joining any/all of the k minority class nearest neighbors. Depending
upon the amount of over-sampling required, neighbors from the k
nearest neighbors are randomly chosen.
This then will move closer towards balancing out your dataset. There is an implementation of SMOTE in the imblearn package in python.
Here is a good read about different oversampling algorithms. It includes oversampling using ADASYN as well as SMOTE.
I hope this helps.
For those interested and as an addition to the answers given, I can highly recommend the following paper:
Lemnaru, C., & Potolea, R. (2011, June). Imbalanced classification problems: systematic study, issues and best practices. In International Conference on Enterprise Information Systems (pp. 35-50). Springer, Berlin, Heidelberg.
The authors argue that:
In terms of solutions, since the performance is not expected to improve
significantly with a more sophisticated sampling strategy, more focus should be
allocated to algorithm related improvements, rather than to data improvements.
As e.g. the ChaLearn AutoML Challenge 2015 used the balanced accuracy, sklearn argues that it is a fitting metric for imbalanced data and Auto-Sklearn was able to compute well-fitting models, I'm going to have a try. Even without resampling, the results were much "better" (in terms of prediction quality) than just using the accuracy.

How to clean a large image dataset for deep learning purposes?

I have a large image dataset with 477 classes (about 500,000 images). Each class contains some irrelevant images, so when it's trained on a model the model accuracy is not acceptable. Regarding the number of classes, it takes much time to clean the dataset manually with help of a human. Is there any way to remove such images automatically? (like a machine learning method or algorithm)
I believe that for now the best (most reliable) way to clean image datasets is manually. There might be some techniques that could be applied. For now, services like Azure and Amazon ML have some ways to clean data, however, I don't know if they apply that to images (https://learn.microsoft.com/en-us/azure/machine-learning/team-data-science-process/prepare-data). For sure there are companies that have a well developed way of doing this.
Maybe you can get inspired by this paper: https://stefan.winklerbros.net/Publications/icip2014a.pdf
One possible way is using a classifier to remove unwanted images from your dataset but this way is useful only for huge datasets and it is not as reliable as the normal way (manual cleansing). For example, an SVM classifier can be trained to extract images from each class. More details will be added after testing this method.

What's the best way to select features independent of the model being used?

I am using tensorflow's DNNRegressor to model a multivariate regression problem. I want to form an optimal feature-set from a mixed bag of categorical and continuous features. What would be the best way to proceed? The reason, I want this approach to be independent of the model is because I couldn't find much about feature selection/evaluation in direct context of tensorflow.
Tensorflow is mostly library for machine learning algorithms. So, you need to use other libraries for preprocessing.
Scikit-library is good in many cases. You should try it, it contains the feature selection methods. I'm not sure about the categorical features, but if not you always can convert it to numerical ones.
They suggest:
For regression: f_regression, mutual_info_regression
And for any problem, you can use their first method VarianceThreshold

Categories