Using Precision and Recall in training of skewed dataset - python

I have a skewed dataset (5,000,000 positive examples and only 8000 negative [binary classified]) and thus, I know, accuracy is not a useful model evaluation metric. I know how to calculate precision and recall mathematically but I am unsure how to implement them in python code.
When I train the model on all the data I get 99% accuracy overall but 0% accuracy on the negative examples (ie. classifying everything as positive).
I have built my current model in Pytorch with the criterion = nn.CrossEntropyLoss() and optimiser = optim.Adam().
So, my question is, how do I implement precision and recall into my training to produce the best model possible?
Thanks in advance

The implementation of precision, recall and F1 score and other metrics are usually imported from the scikit-learn library in python.
link: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
Regarding your classification task, the number of positive training samples simply eclipse the negative samples. Try training with reduced number of positive samples or generating more negative samples. I am not sure deep neural networks could provide you with an optimal result considering the class skewness.
Negative samples can be generated using the Synthetic Minority Over-sampling Technique (SMOT) technique. This link is a good place to start.
Link: https://www.analyticsvidhya.com/blog/2017/03/imbalanced-classification-problem/
Try using simple models such as logistic regression or random forest first and check if there is any improvement in the F1 score of the model.

To add to the other answer, some classifiers have a parameter called class_weight which let's you modify the loss function. By penalizing wrong predictions on the minority class more, you can train your classifier to learn to predict both classes.
For a pytorch specific answer, you can refer this link
As mentioned in the other answer, over and undersampling strategies can be used. If you are looking for something better, take a look at this paper

Related

Augmenting classification model to prediction "Unknown" instead of a wrong classfication

I am working on a multi-class classification problem, it contains some class imbalance (100 classes, a handful of which only have 1 or 2 samples associated).
I have been able to get a LinearSVC (& CalibratedClassifierCV) model to achieve ~98% accuracy, which is great.
The problem is that for all of the misclassified predictions - the business will incur a monetary loss. That is, for each misclassification - we would incur a $1,000 loss. A solution to this would be to classify a datapoint as "Unknown" instead of a complete misclassification (these unknowns could then be human-classified which would cost roughly $10 per "Unknown" prediction). Clearly, this is cheaper than the $1,000/misclassification loss.
Any suggestions for would I go about incorporating this "Unknown" class?
I currently have:
svm = LinearSCV()
clf = CalibratedClassifierCV(svm, cv=3)
# fit model
clf.fit(X_train, y_train)
# get probabilities for each decision
decision_probabilities = clf.predict_proba(X_test)
# get the confidence for the highest class:
confidence = [np.amax(x) for x in decision_probabilities]
I was planning to use the predict_proba method from the CalibratedClassifierCV model, and for any max probabilities that were under a threshold (yet to be determined) I would instead classify that sample as "Unknown" instead of the class that the probability is actually associated with.
The problem is that when I've checked correct predictions, there are confidence values as low as 30%. Similarly, there are incorrect predictions with confidence values as high as 95%. If I were to just create a threshold of say, 50%, my accuracy would go down significantly, I would have quite of bit of "Unknown" classes (loss), and still a bit of misclassifications (even bigger loss).
Is there a way to incorporate another loss function on this back-end classification (predicted class vs 'unknown' class)?
Any help would be greatly appreciated!
A few suggestions right off the bat:
Accuracy is not the correct metric to evaluate imbalanced datasets. For example, if 90% of samples belong to 1 class 90% accuracy is achieved by a dumb model which always predicts the dumb class. Precision and recall are generally better metrics for such cases. Opting between the two is generally a business decision.
Given the input signals, it may be difficult to better than 98%, especially for some classes you will have two few samples. What you can do is group minority classes together and give them a single label e.g 'other'. In this way, the model will hopefully have enough samples to learn that these samples are different from all other classes and will classify them as 'other'
Often when you try to replace a manual business process by ML, you generally do not completely remove human intervention. The goal is to use the model on cases/classes/input space where your model does well and use the manual process for the rest. One way to do it is by using the 'other' label. Once your model has predicted 'other', a human may manually classify these samples. Another method is to find a threshold on predicted probability above which the model has a high accuracy and sufficient population coverage. For example, let say you have 100% (typically 90-100%) accuracy whenever the output prbability is above 0.70. If this covers enough of the input population, you only use the ML model on such cases. For everything else, the manual process is followed.

Why does implementing class weights make the model worse

I am trying to do binary classification, and the one class (0) is approximately 1 third of the other class (1). when I run the raw data through a normal feed forward neural network, the accuracy is about 0.78. However, when I implement class_weights, the accuracy drops to about 0.49. The roc curve also seems to do better without the class_weights. Why does this happen, and how can i fix it?
II have already tried changing the model, and implementing regularization, and dropouts, etc. But nothing seems to change the overall accuracy
this is how i get my weights:
class_weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
class_weight_dict = dict(enumerate(class_weights))
Here is the results without the weights:
Here is with the weights:
I would expect the results to be better with the class_weights but the opposite seems to be true. Even the roc does not seem to do any better with the weights.
Due to the class imbalance a very weak baseline of always selecting the majority class will get accuracy of approximately 75%.
The validation curve of the network that was trained without class weights appears to show that it is picking a solution close to always selecting the majority class. This can be seen from the network not improving much over the validation accuracy it gets in the 1st epoch.
I would recommend looking into the confusion matrix, precision and recall metrics to get more information about which model is better.
This answer seems too late, but I hope it is helpful anyway. I just want to add four points:
Since the proportion of your data is minority: 25% and majority: 75%, accuracy is computed as:
accuracy = True positive + true negative / (true positive + true negative + false positive + false negative)
Thus, if you look at the accuracy as a metric, most likely any models could achieve around 75% accuracy by simply predicting the majority class all the time. That's why on the validation set, the model was not able to predict correctly.
While with class weights, the learning curve was not smooth but the model actually started to learn and it failed from time to time on the validation set.
As it was already stated, perhaps changing metrics such as F1 score would help. I saw that you are implementing tensorflow, tensorflow has metric F1 score on their Addons, you can find it on their documentation here. For me, I looked at the classfication report in scikit learn, let's say you want to see the model's performance on the validation set (X_val, y_val):
from sklearn.metrics import classification_report
y_predict = model.predict(X_val, batch_size=64, verbose=1
print(classification_report(y_val, y_predict))
Other techniques you might want to try such as implementing upsampling and downsampling at the same time can help, or SMOTE.
Best of luck!

PCA and Naive Bayes Classifier

I'm experimenting with PCA and Naive Bayes Classifier in Python.
In short, using a database of gray-scale images of digits, I'm reducing dimensions with PCA and then using Naive Bayes to classify.
I use 2,4,10,30,60,200,500,784 components respectively.
The classification error rates I get respectively are:
0.25806452, 0.15322581, 0.06290323, 0.06451613, 0.06451613, 0.10322581, 0.28064516 and 0.31774194. I thought that taking more components always improved the accuracy of classification. Is this true? If so then I am doing something wrong.
I don't think there is a single valid answer to your question, but reducing the dimensionality of your input can prevent overfitting. More features does not always make your classifier more accurate. You can look here for a detailed explanation: http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/
It is true that reducing dimensions reduces overfitting, but there is always an optimal number of components which gives the best accuracy if you are not adding additional data to the dataset. In your case, it is 10 since it gives the least error rate of 0.06290323. So, if you are increasing dimensionality you should also increase the dataset for training in order to expect more accuracy. Otherwise, You should try a Grid search near it for more accuracy.
Also if your dataset is balanced then accuracy may be a good measure of evaluating your performance. In case of imbalanced dataset try Precision, Recall or f-score .
If still, you are not satisfied with the classifier use some other classification algorithm.

What are good metrics to evaluate the performance of a multi-class classifier?

I'm trying to run a classifier in a set of about 1000 objects, each with 6 floating point variables. I've used scikit-learn's cross validation features to generate an array of the predicted values for several different models. I've then used sklearn.metrics to compute the accuracy of my classifiers, and the confusion table. Most classifiers have around 20-30% accuracy. Below is the confusion table for the SVC classifier (25.4% accuracy).
Since I'm new to machine learning, I'm not sure how to interpret that result, and whether there are other good metrics to evaluate the problem. Intuitively speaking, even with 25% accuracy, and given that the classifier got 25% of the predictions right, I believe it is at least somewhat effective, right? How can I express that with statistical arguments?
If this table is a confusion table, I think that your classifier predicts in majority of the time the class E. I think that your class E is overrepresented in your dataset, accuracy is not a good metric if your classes have not the same number of instances,
Example, If you have 3 classes, A,B,C and in the test dataset the class A is over represented (90%) if your classifier predicts all time class A, you will have 90% of accuracy,
A good metric is to use log loss, logistic regression is a good algorithm that optimize this metric
see https://stats.stackexchange.com/questions/113301/multi-class-logarithmic-loss-function-per-class
An other solution, is to do oversampling of your small classes
First of all, I find it very difficult to look at confusion tables. Plotting it as an image would give a lot better intuitive understanding about what is going on.
It is advisory to have single number metric to optimize since it is easier and faster. When you find that your system doesn't perform as you expect it to, revise your selection of metric.
Accuracy is usually a good metric to use if you have same amount of examples in every class. Otherwise (which seems to be the case here) I'd advise to use F1 score which takes into account both precision and recall of your estimator.
EDIT: However it is up to you to decide if the ~25% accuracy, or whatever metric is "good enough". If you are classifying if robot should shoot a person you should probably revise your algorithm but if you are deciding if it is a pseudo-random or random data, 25% percent accuracy could be more than enough to prove the point.

When using multiple classifiers - How to measure the ensemble's performance? [SciKit Learn]

I have a classification problem (predicting whether a sequence belongs to a class or not), for which I decided to use multiple classification methods, in order to help filter out the false positives.
(The problem is in bioinformatics - classifying protein sequences as being Neuropeptide precursors sequences. Here's the original article if anyone's interested, and the code used to generate features and to train a single predictor) .
Now, the classifiers have roughly similar performance metrics (83-94% accuracy/precision/etc' on the training set for 10-fold CV), so my 'naive' approach was to simply use multiple classifiers (Random Forests, ExtraTrees, SVM (Linear kernel), SVM (RBF kernel) and GRB) , and to use a simple majority vote.
MY question is:
How can I get the performance metrics for the different classifiers and/or their votes predictions?
That is, I want to see if using the multiple classifiers improves my performance at all, or which combination of them does.
My intuition is maybe to use the ROC score, but I don't know how to "combine" the results and to get it from a combination of classifiers. (That is, to see what the ROC curve is just for each classifier alone [already known], then to see the ROC curve or AUC for the training data using combinations of classifiers).
(I currently filter the predictions using "predict probabilities" with the Random Forests and ExtraTrees methods, then I filter arbitrarily for results with a predicted score below '0.85'. An additional layer of filtering is "how many classifiers agree on this protein's positive classification").
Thank you very much!!
(The website implementation, where we're using the multiple classifiers - http://neuropid.cs.huji.ac.il/ )
The whole shebang is implemented using SciKit learn and python. Citations and all!)
To evaluate the performance of the ensemble, simply follow the same approach as you would normally. However, you will want to get the 10 fold data set partitions first, and for each fold, train all of your ensemble on that same fold, measure the accuracy, rinse and repeat with the other folds and then compute the accuracy of the ensemble. So the key difference is to not train the individual algorithms using k fold cross-validation when evaluating the ensemble. The important thing is not to let the ensemble see the test data either directly or by letting one of it's algorithms see the test data.
Note also that RF and Extra Trees are already ensemble algorithms in their own right.
An alternative approach (again making sure the ensemble approach) is to take the probabilities and \ or labels output by your classifiers, and feed them into another classifier (say a DT, RF, SVM, or whatever) that produces a prediction by combining the best guesses from these other classifiers. This is termed "Stacking"
You can use a linear regression for stacking. For each 10-fold, you can split the data with:
8 training sets
1 validation set
1 test set
Optimise the hyper-parameters for each algorithm using the training set and validation set, then stack yours predictions by using a linear regression - or a logistic regression - over the validation set. Your final model will be p = a_o + a_1 p_1 + … + a_k p_K, where K is the number of classifier, p_k is the probability given by model k and a_k is the weight of the model k. You can also directly use the predicted outcomes, if the model doesn't give you probabilities.
If yours models are the same, you can optimise for the parameters of the models and the weights in the same time.
If you have obvious differences, you can do different bins with different parameters for each. For example one bin could be short sequences and the other long sequences. Or different type of proteins.
You can use the metric whatever metric you want, as long as it makes sens, like for not blended algorithms.
You may want to look at the 2007 Belkor solution of the Netflix challenges, section Blending. In 2008 and 2009 they used more advances technics, it may also be interesting for you.

Categories