Classification: skewed data within a class

Classification: skewed data within a class - python

I'm trying to build a multilabel-classifier to predict the probabilities of some input data being either 0 or 1. I'm using a neural network and Tensorflow + Keras (maybe a CNN later).
The problem is the following:
The data is highly skewed. There are a lot more negative examples than positive maybe 90:10. So my neural network nearly always outputs very low probabilities for positive examples. Using binary numbers it would predict 0 in most of the cases.
The performance is > 95% for nearly all classes, but this is due to the fact that it nearly always predicts zero...
Therefore the number of false negatives is very high.
Some suggestions how to fix this?
Here are the ideas I considered so far:
Punishing false negatives more with a customized loss function (my first attempt failed). Similar to class weighting positive examples inside a class more than negative ones. This is similar to class weights but within a class.
How would you implement this in Keras?
Oversampling positive examples by cloning them and then overfitting the neural network such that positive and negative examples are balanced.
Thanks in advance!

You're on the right track.
Usually, you would either balance your data set before training, i.e. reducing the over-represented class or generate artificial (augmented) data for the under-represented class to boost its occurrence.
Reduce over-represented class
This one is simpler, you would just randomly pick as many samples as there are in the under-represented class, discard the rest and train with the new subset. The disadvantage of course is that you're losing some learning potential, depending on how complex (how many features) your task has.
Augment data
Depending on the kind of data you're working with, you can "augment" data. That just means that you take existing samples from your data and slightly modify them and use them as additional samples. This works very well with image data, sound data. You could flip/rotate, scale, add-noise, in-/decrease brightness, scale, crop etc.
The important thing here is that you stay within bounds of what could happen in the real world. If for example you want to recognize a "70mph speed limit" sign, well, flipping it doesn't make sense, you will never encounter an actual flipped 70mph sign. If you want to recognize a flower, flipping or rotating it is permissible. Same for sound, changing volume / frequency slighty won't matter much. But reversing the audio track changes its "meaning" and you won't have to recognize backwards spoken words in the real world.
Now if you have to augment tabular data like sales data, metadata, etc... that's much trickier as you have to be careful not to implicitly feed your own assumptions into the model.

I think your two suggestions are already quite good.
You can also simply undersample the negativ class, of course.
def balance_occurences(dataframe, zielspalte=target_name, faktor=1):
least_frequent_observation=dataframe[zielspalte].value_counts().idxmin()
bottleneck=len(dataframe[dataframe[zielspalte]==least_frequent_observation])
balanced_indices=dataframe.index[dataframe[zielspalte]==least_frequent_observation].tolist()
for value in (set(dataframe[zielspalte])-{least_frequent_observation}):
full_list=dataframe.index[dataframe[zielspalte]==value].tolist()
selection=np.random.choice(a=full_list,size=bottleneck*faktor, replace=False)
balanced_indices=np.append(balanced_indices,selection)
df_balanced=dataframe[dataframe.index.isin(balanced_indices)]
return df_balanced
Your loss function could look into the recall of the positive class combined with some other measurement.

Related

Accuracy metric on imbalanced classification data

There has been a lot of discussion about this topic.
But I have no enough reputation i.e. 50 to comment on those posts. Hence, I am creating this one.
As far as I understand, accuracy is not an appropriate metric when the data is imbalanced.
My question is, is it still inppropriate if we have applied either the resampling method, class weights or initial bias?
Reference here.
Thanks!

Indeed, it is always a good idea to test resampling techniques such as over sampling the minority class and under sampling the majority class. My advice is to start with this excellent walk through for resampling techniques using the imblearn package in python. Eventually, what seems to work best in most cases is intelligently combining both under and over samplers
For example, undersampling the majority class by 70% and then apply over sampling to the minority class to match the new distribution of the majority class.
To answer your question regarding accuracy: No. It should not be used. The main reason is that when you apply resampling techniques, you should never apply it on the test set. Your test set, same as in real life and in production, you never know it in advance. So the imbalance will always be there.
As far as evaluation metrics, well the question you need to ask is 'how much more important is the minority class than the majority class?' how much false positives are you willing to tolerate? The best way to check for class separability is using a ROC curve. The better it looks (a nice high above the diagonal line curve) the better the model is at separating positive classes from negative classes even if it is imbalanced.
To get a single score that allows you to compare models, and if false negatives are more important than false positives (which is almost always the case in imbalanced classification), then use F2 measure which gives more importance to recall (i.e. more importance to true positives detected by your model). However, the way we have always done it in my company is by examining in detail the classification report to know exactly how much recall we get for each class (so yes, we mainly aim for high recall and occasionally look at the precision which reflects the amount of false positives).
Conclusion :
Always check multiple scores such as classification report, mainly recall
If you want a single score, use F2.
Use ROC curve to evaluate the model visually regarding class separability
Never apply resampling to your test set!
Finally, it would be wise to apply a cost sensitive learning technique to your model such as class weighting during training!
I hope this helps!

I would prefer to use g-mean or brier score as Prof. Harrel wrote a nice discussion on this topic. See this: https://www.fharrell.com/post/class-damage/. Here is another one which provided a limitations of using in proper metrics. https://www.sciencedirect.com/science/article/pii/S2666827022000585

Worse results when training on entire dataset

After finalizing the architecture of my model I decided to train the model on the entire dataset by setting validation_split = 0 in fit(). I thought this would improve the results based on these sources:
What is validation data used for in a Keras Sequential model?
Your model doesn't "see" your validation set and isn´t in any way trained on it
https://machinelearningmastery.com/train-final-machine-learning-model/
What about the cross-validation models or the train-test datasets?
They’ve been discarded. They are no longer needed.
They have served their purpose to help you choose a procedure to finalize.
However, I got worse results without the validation set (compared to validation_split = 0.2), leaving all other parameters the same.
Is there an explanation for this? Or was it just by chance that my model happened to perform better on the fixed test data when a part of the training data was excluded (and used as validation).

Well that's really a very good question that covers a lots of machine learning related concepts specially Bias-Variance Tradeoff
As in the comment #CrazyBarzillian hinted that more data might be leading to over-fitting and yes we need more info about your data to come to a solution. But in a broader way I would like to explain you few points, that might help you to understand as it why it happened.
EXPLAINATION
Whenever your data has more number of features, your model learns a very complex equation to solve it. In short model is too complicated for the amount of data we have. This situation, known as high variance, leads to model over-fitting. We know that we are facing a high variance issue when the training error is much lower than the test error. High variance problems can be addressed by reducing the number of features (by applying PCA , outlier removal etc.), by increasing the number of data points that is adding more data.
Sometimes, you have lesser features in your data and hence model learns a very simple equation to solve it. This is known as high bias. In this case , adding more data won't help. In this case less data will do the work or adding more features will help.
MY ASSUMPTION
I guess your model is suffering from high bias if its performing poor on adding more data. But to check whether the statement adding more data leading to poor results is correct or not in your case you can do the following things:-
play with some hyperparameters
try other machine learning models
instead of accuracy scores , look for r2 scores or mean absolute error in case of regression or F1, precision, recall in case of classification
If after doing both things you are still getting the same results that more data is leading to poor results, then you can be sure of high bias and can either increase the number of features or reduce the data.
SOLUTION
By reducing the data, I mean use small data but better data. Better data means suppose you are doing a classification problem and you have three classes (A, B and C) , a better data would be if all the datapoints are balanced between three classes. Your data should be balanced. If it is unbalanced that is class A has high number of samples while class B and C has only 3-4 samples then you can apply Ramdom Sampling techniques to overcome it.
How to make BETTER DATA
Balance the data
Remove outliers
Scale (Normalize) the data
CONCLUSION
It is a myth that more data is always leads to good model. Actually more than the quantity , quality of the data also matters. Data should have both quantity as well as quality. This game of maintaining quality and quantity is known as Bias-Variance Tradeoff.

Is it possible to use Keras to classify images between one option or another?

To explain the title better, I am looking to classify pictures between two classes. For example, let's say that 0 is white, and black is 1. I train and validate the system with pictures that are gray, some lighter than others. In other words, none of the training/validation (t/v) pictures are 0, and none are 1. The t/v pictures range between 0 and 1 depending of how dark the gray is.
Of course, this is just a hypothetical situation, but I want to apply a similar scenario for my work. All of the information I have found online is based on a binary classification (either 1 or 0), rather than a spectrum classification (between 1 and 0).
I assume that this is possible, but I have no idea where to start. Although, I do have a binary code written with good accuracy.

Based on your given example, maybe a classification approach is not the best one. I think that what you have is a regression problem, as you want your output to be a continuous value in some range, that has a meaning itself (as higher or lower values have a proper meaning).
Regression tasks usually have an output with linear activation, and they expect to have a continuous value as the ground truth.
I think you could start by taking a look at this tutorial.
Hope this helps!

If I understand you correctly, it's definitely possible.
The creator of Keras, François Chollet, wrote Deep Learning with Python which is worth reading. In it he describes how you could accomplish what you would like.
I have worked through examples in his book and shared the code: whyboris/ml-with-python-and-keras
There are many approaches, but a fast one is to use a pre-trained model that can recognize a wide variety of images (for example, classify 1,000 different categories). You will use it "headless" (without the last classification layer that takes the vectors and decides which of the 1,000 categories it falls most into). And you will train just the "last step" in the model (freezing all the previous layers) while training your binary classifier.
Alternatively you could train your own classifier from scratch. Specifically glance at my example (based off the book) cat-dog-classifier which trains its own binary classifier.

How to calculate probability(confidence) of SVM classification for small data set?

Use case:
I have a small dataset with about 3-10 samples in each class. I am using sklearn SVC to classify those with rbf kernel.
I need the confidence of the prediction along with the predicted class. I used predict_proba method of SVC.
I was getting weird results with that. I searched a bit and found out that it makes sense only for larger datasets.
Found this question on stack Scikit-learn predict_proba gives wrong answers.
The author of the question verified this by multiplying the dataset, thereby duplicating the dataset.
My questions:
1) If I multiply my dataset by lets say 100, having each sample 100 times, it increases the "correctness" of "predict_proba". What sideeffects will it have? Overfitting?
2) Is there any other way I can calculate the confidence of the classifier? Like distance from the hyperplanes?
3) For this small sample size, is SVM a recommended algorithm or should I choose something else?

First of all: Your data set seems very small for any practical purposes. That being said, let's see what we can do.
SVM's are mainly popular in high dimensional settings. It is currently unclear whether that applies to your project. They build planes on a handful of (or even single) supporting instances, and are often outperformed in situation with large trainingsets by Neural Nets. A priori they might not be your worse choice.
Oversampling your data will do little for an approach using SVM. SVM is based on the notion of support vectors, which are basically the outliers of a class that define what is in the class and what is not. Oversampling will not construct new support vector (I am assuming you are already using the train set as test set).
Plain oversampling in this scenario will also not give you any new information on confidence, other than artififacts constructed by unbalanced oversampling, since the instances will be exact copies and no distibution changes will occur. You might be able to find some information by using SMOTE (Synthetic Minority Oversampling Technique). You will basically generate synthetic instances based of the ones you have. In theory this will provide you with new instances, that won't be exact copies of the ones you have, and might thusly fall a little out of the normal classification. Note: By definition all these examples will lie in between the original examples in your sample space. This will not mean that they will lie in between your projected SVM-space, possibly learning effects that aren't really true.
Lastly, you can estimate confidence with the distance to the hyperplane. Please see: https://stats.stackexchange.com/questions/55072/svm-confidence-according-to-distance-from-hyperline

Fitting the training error of a neural network

I am attempting to curve fit the training error of a neural network as a function of the number of training iterations. An example is shown in red in the image below. Here I've trained for 3000 iterations. What I'm interested in is whether I can find a function that I can fit on the first 1000 (or so) iterations to extrapolate out to 3000 iterations with some reasonable accuracy.
However, I don't know what functional form would be best for me to use. At first I tried an exponential of the form f(x)=A+Bexp(-Cx), which is shown in blue. Obviously this doesn't work too well. The exponential dies off way too fast and then basically just becomes the constant term.
Perhaps it's just difficult, since the beginning of the training shows a very sharp drop off of the error but then transitions to something much more gradual for higher iterations. But maybe someone with experience in neural network training and/or experience in fitting unknown functions might have some ideas. I've been trying various exponential forms and polynomial fits within scipy/numpy but with no success. I've varied the number of iterations used in the fit as well (including throwing out the small iteration numbers).
Any thoughts?

I think exponential fitting may work. In your f(x)=A+B*exp(-C*x), I choose A = 0.005, B = 0.045, and C = 1/250, I will get,
It's just about the parameter tuning. Yet I am trying to understand the motivation that you want to fit the learning curve. I think the interpolation method includes the 'extrapolation' option that you can used to predict the error after more epochs. If you want to precisely learn the curve, you can use another neural network with linear hidden layer and output to 'learn' the curve again, though I didn't try whether it works.

Check out this page: http://www.astroml.org/sklearn_tutorial/practical.html
The useful thing for your situation it describes is diagnosing whether your algorithms appear to be high bias or high variance on your data set, and offering specific directions to go in for either case.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.