ML with imbalanced binary dataset - python

I have a problem I am trying to solve:
- imbalanced dataset with 2 classes
- one class dwarfs the other one (923 vs 38)
- f1_macro score when the dataset is used as-is to train RandomForestClassifier stays for TRAIN and TEST in 0.6 - 0.65 range
While doing research on the topic yesterday, I educated myself in resampling and especially SMOTE algorithm. It seems to have worked wonders for my TRAIN score, as after balancing the dataset with them, my score went from ~0.6 up to ~0.97. The way that I have applied it was as follows:
I have splited away my TEST set away from the rest of data in the beginning (10% of the whole data)
I have applied SMOTE on TRAIN set only (class balance 618 vs 618)
I have trained a RandomForestClassifier on TRAIN set, and achieved f1_macro = 0.97
when testing with TEST set, f1_macro score remained in ~0.6 - 0.65 range
What I would assume happened, is that the holdout data in TEST set held observations, which were vastly different from pre-SMOTE observations of the minority class in TRAIN set, which ended up teaching the model to recognize cases in TRAIN set really well, but threw the model off-balance with these few outliers in the TEST set.
What are the common strategies to deal with this problem? Common sense would dictate that I should try and capture a very representative sample of minority class in the TRAIN set, but I do not think that sklearn has any automated tools which allow that to happen?

Your assumption is correct. Your machine learning model is basically overfitting on your training data which has the same pattern repeated for one class and thus, the model learns that pattern and misses the rest of the patterns, that is there in test data. This means that the model will not perform well in the wild world.
If SMOTE is not working, you can experiment by testing different machine learning models. Random forest generally performs well on this type of datasets, so try to tune your rf model by pruning it or tuning the hyperparameters. Another way is to assign the class weights when training the model. You can also try penalized models which imposes an additional cost on the model when the misclassify the minority class.
You can also try undersampling since you have already tested oversampling. But most probably your undersampling will also suffer from the same problem. Please try simple oversampling as well instead of SMOTE to see how your results change.
Another more advanced method that you should experiment is batching. Take all of your minority class and an equal number of entries from the majority class and train a model. Keep doing this for all the batches of your majority class and in the end you will have multiple machine learning models, which you can then use together to vote.

Related

Identical accuracy in different ML Classification models

I used the "Stroke" data set from kaggle to compare the accuracy of the following different models of classification:
K-Nearest-Neighbor (KNN).
Decision Trees.
Adaboost.
Logistic Regression.
I did not implement the models myself, but used sklearn library's implementations.
After training the models I ran the test data and printed the level of accuracy of each of the models and these are the results:
As you can see, KNN, Adaboost, and Logistic Regression gave me the exact same accuracy.
My question is, does it make sense that there is not even a small difference between them or did I make a mistake somewhere along the way (Even though I only used sklearn's implementations?
In general achieving the same scores is unlikely, and the explanation is usually:
bug in actual reporting
bug in the data processing
score reported corresponds to a degenerate solution
And the last explanation is probably the case. Stroke dataset has 249 positive samples in 5000 datapoints, so if your model always says "no stroke" it will get roughly 95%. So my best guess is that all your models failed to learn anything and are just constantly outputting "0".
In general accuracy is not a right metric for highly imabalnced datasets. Consider balanced accuracy, f1, etc.

Is the performance of a deep learning model affected if it has "seen" the same test images before?

I am working with the YOLOv3 model for an object detection task. I am using pre-trained weights that were generated for the COCO dataset, however, I have my own data for the problem I am working on. According to my knowledge, using those trained weights as a starting point for my own model should not have any effect on the performance of the model once it is trained on an entirely different dataset (right?).
My question is: will the model give "honest" results if I train it multiple times and test it on the same test set each time, or would it have better performance since it has already been exposed to those test images during an earlier experiment? I've heard people say things like "the model has already seen that data", does that apply in my case?
For hyper-parameter selection, evaluation of different models, or evaluating during training, you should always use a validation set.
You are not allowed to use the test set until the end!
The whole purpose of test data is to get an estimation of the performance after the deployment. When you use it during training to train your model or evaluate your model, you expose that data. For example, based on the accuracy of the test set, you decide to increase the number of layers.
Now, your performance will be increased on the test set. However, it will come with a price!
Your estimation on the test set becomes biased, and you no longer be able to use that estimation to talk about data that your model sees after deployment.
For example, You want to train an object detector for self-driving cars, and you exposed the test set during training. Therefore, you can not use the accuracy on the test set to talk about the performance of the object detector when you put it on a car and sell it to a customer.
There is an old sentence related to this matter:
If you torture the data enough, it will confess.

Augmenting classification model to prediction "Unknown" instead of a wrong classfication

I am working on a multi-class classification problem, it contains some class imbalance (100 classes, a handful of which only have 1 or 2 samples associated).
I have been able to get a LinearSVC (& CalibratedClassifierCV) model to achieve ~98% accuracy, which is great.
The problem is that for all of the misclassified predictions - the business will incur a monetary loss. That is, for each misclassification - we would incur a $1,000 loss. A solution to this would be to classify a datapoint as "Unknown" instead of a complete misclassification (these unknowns could then be human-classified which would cost roughly $10 per "Unknown" prediction). Clearly, this is cheaper than the $1,000/misclassification loss.
Any suggestions for would I go about incorporating this "Unknown" class?
I currently have:
svm = LinearSCV()
clf = CalibratedClassifierCV(svm, cv=3)
# fit model
clf.fit(X_train, y_train)
# get probabilities for each decision
decision_probabilities = clf.predict_proba(X_test)
# get the confidence for the highest class:
confidence = [np.amax(x) for x in decision_probabilities]
I was planning to use the predict_proba method from the CalibratedClassifierCV model, and for any max probabilities that were under a threshold (yet to be determined) I would instead classify that sample as "Unknown" instead of the class that the probability is actually associated with.
The problem is that when I've checked correct predictions, there are confidence values as low as 30%. Similarly, there are incorrect predictions with confidence values as high as 95%. If I were to just create a threshold of say, 50%, my accuracy would go down significantly, I would have quite of bit of "Unknown" classes (loss), and still a bit of misclassifications (even bigger loss).
Is there a way to incorporate another loss function on this back-end classification (predicted class vs 'unknown' class)?
Any help would be greatly appreciated!
A few suggestions right off the bat:
Accuracy is not the correct metric to evaluate imbalanced datasets. For example, if 90% of samples belong to 1 class 90% accuracy is achieved by a dumb model which always predicts the dumb class. Precision and recall are generally better metrics for such cases. Opting between the two is generally a business decision.
Given the input signals, it may be difficult to better than 98%, especially for some classes you will have two few samples. What you can do is group minority classes together and give them a single label e.g 'other'. In this way, the model will hopefully have enough samples to learn that these samples are different from all other classes and will classify them as 'other'
Often when you try to replace a manual business process by ML, you generally do not completely remove human intervention. The goal is to use the model on cases/classes/input space where your model does well and use the manual process for the rest. One way to do it is by using the 'other' label. Once your model has predicted 'other', a human may manually classify these samples. Another method is to find a threshold on predicted probability above which the model has a high accuracy and sufficient population coverage. For example, let say you have 100% (typically 90-100%) accuracy whenever the output prbability is above 0.70. If this covers enough of the input population, you only use the ML model on such cases. For everything else, the manual process is followed.

Advice for my plan - large dataset of students and grades, looking to classify bottom 2%

I have a dataset which includes socioeconomic indicators for students nationwide as well as their grades. More specifically, this dataset has 36 variables with about 30 million students as predictors and then the students grades as the responses.
My goal is to be able to predict whether a student will fail out (ie. be in the bottom 2%ile of the nation in terms of grades). I understand that classification with an imbalanced dataset (98% : 2%) will introduce a bias. Based on some research I planned to account for this by increasing the cost of an incorrect classification in the minority class.
Can someone please confirm that this is the correct approach (and that there isn't a better one, I'm assuming there is)? And also, given the nature of this dataset, could someone please help me choose a machine learning algorithm to accomplish this?
I am working with TensorFlow 2.0 in a Google Colab. I've compiled all the data together into a .feather file using pandas.
In case of having imbalanced dataset, using weighted class is the most common approach to do so, but having such large dataset (30M training example) for binary classification problem representing 2% for the first class and 98% for the second one, I can say it's too hard to prevent model to be unbiased against first class using weighted class as it's not too much differ from reducing the training set size to be balanced.
Here some steps for the model accuracy evaluation.
split your dataset set to train, evalution and test sets.
For evaluation metric I suggest these alternatives.
a. Make sure to have at least +20%, representing the first class for both
evaluation and test sets.
b. Set evalution metric to be precision and recall for your model accuracy
(rather than using f1 score).
c. Set evalution metric to be Cohen's kapp score (coefficient).
From my own perspective, I prefer using b.
Since you are using tensorflow, I assume that you are familiar with deep learning. so use deep learning instead of machine learning, that's gives you the ability to have many additional alternatives, anyway, here some steps for both machine learning and deep learning approach.
For Machine Leaning Algorithms
Decision Trees Algorithms (especially Random Forest).
If my features has no correlation, correlation approach to zero (i.e. 0.01),
I am going to try Complement Naive Bayes classifiers for multinomial features
or Gaussian Naive Bayes using weighted class for continuous features.
Try some nonparametric learning algorithms. You may not able to fit this
training set using Support Vector Machines (SVM) easily because of you
have somehow large data set but you could try.
Try unsupervised learning algorithms
(this sometimes gives you more generic model)
For Deep Leaning Algorithms
Encoder and decoder architectures or simply generative adversarial
networks (GANs).
Siamese network.
Train model using 1D convolution Layers.
Use weighted class.
Balanced batches of the training set, randomly chosen.
You have many other alternatives, From my own perspective, I may try hard to get it with 1, 3 or 5.
For Deep learning 5th approach sometimes works very well and I recommend to try it with 1, 3.

Does the test set is used to update weight in a deep learning model with keras?

I'm wondering if the result of the test set is used to make the optimization of model's weights. I'm trying to make a model but the issue I have is I don't have many data because they are medical research patients. The number of patient is limited in my case (61) and I have 5 feature vectors per patient. What I tried is to create a deep learning model by excluding one subject and I used the exclude subject as the test set. My problem is there is a large variability in subject features and my model fits well the training set (60 subjects) but not that good the 1 excluded subject.
So I'm wondering if the testset (in my case the excluded subject) could be used in a certain way to make converge the model to better classify the exclude subject?
You should not use the test data of your data set in your training process. If your training data is not enough, one approach using a lot during this days(especially for medical images) is data augmentation. So I highly recommend you to use this technique in your training process. How to use Deep Learning when you have Limited Data is one of the good tutorial about data augmentation.
No , you souldn't use your test set for training to prevent overfitting , if you use cross-validation principles you need exactly to split your data into three datasets a train set which you'll use to train your model , a validation set to test different value of your hyperparameters , and a test set to finally test your model , if you use all your data for training, your model will overfit obviously.
One thing to remember deep learning work well if you have a large and very rich datasets

Categories