Suppose we have 1,000 beads 900 red and 100 blue ones. When I run the problem through SKlearn classifier ensembles,
score = clf.score(X_test, y_test)
They come up with scores of around .9 however, when I look at the predictions I see that it has predicted all of them to be Red and this is how it comes up with %90 accuracy! Please tell me what I'm doing wrong? Better yet, what does it mean when this happens? Is there a better way to measure accuracy?
This might happen when you have an imbalanced dataset, and you chose accuracy as your metric. The reason is that by always deciding red, the model is actually doing OK in terms of accuracy, but as you noticed, the model is useless!
In order to overcome this issue, you have some alternatives such as:
1. Use another metric, like AUC (area under roc curve), etc.
2. Use different weights for classes, and put more weight on the minority class.
3. Use simple over-sampling or under-sampling methods, or other more sophisticated ones like SMOTE, ADASYN, etc.
You can also take a look at this article.
This problem you face is quite common in real world applications.
You have an imbalanced classification problem. You are write, by default score measures accuracy, but it is recommended to look at recall and precision for imbalanced data.
This video explains it better than I could
The video above demonstrates you what you should do in order to measure classification performance in your data. To deal with data imbalance, you check imblearn library:
https://imbalanced-learn.readthedocs.io/en/stable/api.html
Related
Is it correct to use 'accuracy' as a metric for an imbalanced data set after using oversampling methods such as SMOTE or we have to use other metrics such as AUROC or other presicion-recall related metrics?
You can use accuracy for the dataset after using SMOTE since now it shouldn't be imbalanced as far as I know. You should try the other metrics though for a more detailed evaluation (classification_report_imbalenced combines some metrics)
SMOTE and similar imbalance treatment techniques will be only be applied to you training data. When you have a largely imbalanced data set, say 99% against 1%, accuracy on the TEST set might still give you a value of 99% by always choosing the larger class.
Therefore, you should definitely switch to another metric.
Popular variants are the F1 score, but there is also a balanced version of the accuracy, see scikit-learn BA page.
As mentioned by #Nocry, applying several evaluation measures, might give you a better feeling. For example, check how accuracy (the regular variant) and balanced accuracy perform with and without using SMOTE, then you should see the difference.
I used the "Stroke" data set from kaggle to compare the accuracy of the following different models of classification:
K-Nearest-Neighbor (KNN).
Decision Trees.
Adaboost.
Logistic Regression.
I did not implement the models myself, but used sklearn library's implementations.
After training the models I ran the test data and printed the level of accuracy of each of the models and these are the results:
As you can see, KNN, Adaboost, and Logistic Regression gave me the exact same accuracy.
My question is, does it make sense that there is not even a small difference between them or did I make a mistake somewhere along the way (Even though I only used sklearn's implementations?
In general achieving the same scores is unlikely, and the explanation is usually:
bug in actual reporting
bug in the data processing
score reported corresponds to a degenerate solution
And the last explanation is probably the case. Stroke dataset has 249 positive samples in 5000 datapoints, so if your model always says "no stroke" it will get roughly 95%. So my best guess is that all your models failed to learn anything and are just constantly outputting "0".
In general accuracy is not a right metric for highly imabalnced datasets. Consider balanced accuracy, f1, etc.
I am working on a multi-class classification problem, it contains some class imbalance (100 classes, a handful of which only have 1 or 2 samples associated).
I have been able to get a LinearSVC (& CalibratedClassifierCV) model to achieve ~98% accuracy, which is great.
The problem is that for all of the misclassified predictions - the business will incur a monetary loss. That is, for each misclassification - we would incur a $1,000 loss. A solution to this would be to classify a datapoint as "Unknown" instead of a complete misclassification (these unknowns could then be human-classified which would cost roughly $10 per "Unknown" prediction). Clearly, this is cheaper than the $1,000/misclassification loss.
Any suggestions for would I go about incorporating this "Unknown" class?
I currently have:
svm = LinearSCV()
clf = CalibratedClassifierCV(svm, cv=3)
# fit model
clf.fit(X_train, y_train)
# get probabilities for each decision
decision_probabilities = clf.predict_proba(X_test)
# get the confidence for the highest class:
confidence = [np.amax(x) for x in decision_probabilities]
I was planning to use the predict_proba method from the CalibratedClassifierCV model, and for any max probabilities that were under a threshold (yet to be determined) I would instead classify that sample as "Unknown" instead of the class that the probability is actually associated with.
The problem is that when I've checked correct predictions, there are confidence values as low as 30%. Similarly, there are incorrect predictions with confidence values as high as 95%. If I were to just create a threshold of say, 50%, my accuracy would go down significantly, I would have quite of bit of "Unknown" classes (loss), and still a bit of misclassifications (even bigger loss).
Is there a way to incorporate another loss function on this back-end classification (predicted class vs 'unknown' class)?
Any help would be greatly appreciated!
A few suggestions right off the bat:
Accuracy is not the correct metric to evaluate imbalanced datasets. For example, if 90% of samples belong to 1 class 90% accuracy is achieved by a dumb model which always predicts the dumb class. Precision and recall are generally better metrics for such cases. Opting between the two is generally a business decision.
Given the input signals, it may be difficult to better than 98%, especially for some classes you will have two few samples. What you can do is group minority classes together and give them a single label e.g 'other'. In this way, the model will hopefully have enough samples to learn that these samples are different from all other classes and will classify them as 'other'
Often when you try to replace a manual business process by ML, you generally do not completely remove human intervention. The goal is to use the model on cases/classes/input space where your model does well and use the manual process for the rest. One way to do it is by using the 'other' label. Once your model has predicted 'other', a human may manually classify these samples. Another method is to find a threshold on predicted probability above which the model has a high accuracy and sufficient population coverage. For example, let say you have 100% (typically 90-100%) accuracy whenever the output prbability is above 0.70. If this covers enough of the input population, you only use the ML model on such cases. For everything else, the manual process is followed.
I have a data of dimension (13961,48 ) initially, and after one hot encoding and also basic massaging of data the dimension observed around (13961,862). the data is imbalance with two categories of 'Retained' around 6% and 'not Retained' around 94%.
While running any algorithms such as logistic,knn,decision tree,random forest, the data results in very high accuracy even without any feature selection process carried out and the accuracy crosses more than 94% mostly except 'Naive bias classifier'.
This seems like odd and even by having any two features randomly also--> that gives accuracy more than 94% , which seems non reality in general.
Applying SMOTE also, provide result of more than 94% of accuracy even for baseline model of any algorithms said above such as logistic,knn,decision tree,random forest,
After removing the top 20 features also , this gives accuracy of good result more than 94% ( checked for understanding the genuineness )
g = data[Target_col_Y_name]
df = pd.concat([g.value_counts(),
g.value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))
print('The % distribution between the retention and non-retention flag\n')
print (df)
# The code o/p to show the imbalance is
The % distribution between the retention and non-retention flag
counts percentage
Non Retained 13105 93.868634
Retained 856 6.131366
My data have 7 numerical variables such as month, amount, interest rate and all others ( around 855) as one-hot-encoding transformed categorical variables.
Any methodology , to handle this kind of data on baseline,feature selection or imbalance optimization techniques ? please guide by looking at the dimensionality and the imbalance count for each levels.
I would like to add something in addition to Elias answer.
Firstly, you have to understand that even if you's create "dumb classifier", which always predicts "not retained", you'd still be correct 94% of times. So accuracy is clearly weak metric in this case.
You should definitely learn about confusion matrix and metrics that come along with it (like AUC).
One of these metrics is F1 score, which is harmonic average of precision and recall. It is better that accuracy in imbalanced class setting, but... it doesn't have to be the best. F1 will favor these classifiers that have similar precision and recall. But this is not necessary something that is important for you.
For instance, if you'd build sfw content filter, you would be ok with labeling some SFW content as nsfw (negative class), which would increase false negative rate (and decrease recall), but you would like to be sure that you kept only safe ones (high precision).
In your case you can reason what is worse: retaining something bad or abandoning something good, and pick the metric in that way.
As far as strategy is concerned: there are plenty of ways to handle class imbalance: sampling techniques (try down-sampling, up-sampling besides SMOTE or ROSE) and check out whether your validation score (training metrics alone are almost useless) improved. Just remember to apply sampling/augmentations techniques after the train-validation split.
Moreover, some models have special hyperparametrs to focus more on rare class (for instance in xgboost there is scale_pos_weight parameter). From my experience, tunning this hyperparam is way more effective than SMOTE.
Good luck
Accuracy is not a very good measure in general, particularly for imbalanced classes. I would recommend this other stackoverflow answer, that explains when to use F1 score and when to use AUROC, which are both far better measures than accuracy; in this case F1 is better.
Few points just to clear up:
For models such as random forest, you should not have to remove features to improve the accuracy, as it will just regard them as insignificant features. I recommend random forests as it tends to be very accurate (except in some cases) and can show significant features just by using clf.feature_significances_ (if using the scipy random forest).
Decision trees will almost always perform worse than random forests, as random forests are many aggregated decision trees.
I'm trying to run a classifier in a set of about 1000 objects, each with 6 floating point variables. I've used scikit-learn's cross validation features to generate an array of the predicted values for several different models. I've then used sklearn.metrics to compute the accuracy of my classifiers, and the confusion table. Most classifiers have around 20-30% accuracy. Below is the confusion table for the SVC classifier (25.4% accuracy).
Since I'm new to machine learning, I'm not sure how to interpret that result, and whether there are other good metrics to evaluate the problem. Intuitively speaking, even with 25% accuracy, and given that the classifier got 25% of the predictions right, I believe it is at least somewhat effective, right? How can I express that with statistical arguments?
If this table is a confusion table, I think that your classifier predicts in majority of the time the class E. I think that your class E is overrepresented in your dataset, accuracy is not a good metric if your classes have not the same number of instances,
Example, If you have 3 classes, A,B,C and in the test dataset the class A is over represented (90%) if your classifier predicts all time class A, you will have 90% of accuracy,
A good metric is to use log loss, logistic regression is a good algorithm that optimize this metric
see https://stats.stackexchange.com/questions/113301/multi-class-logarithmic-loss-function-per-class
An other solution, is to do oversampling of your small classes
First of all, I find it very difficult to look at confusion tables. Plotting it as an image would give a lot better intuitive understanding about what is going on.
It is advisory to have single number metric to optimize since it is easier and faster. When you find that your system doesn't perform as you expect it to, revise your selection of metric.
Accuracy is usually a good metric to use if you have same amount of examples in every class. Otherwise (which seems to be the case here) I'd advise to use F1 score which takes into account both precision and recall of your estimator.
EDIT: However it is up to you to decide if the ~25% accuracy, or whatever metric is "good enough". If you are classifying if robot should shoot a person you should probably revise your algorithm but if you are deciding if it is a pseudo-random or random data, 25% percent accuracy could be more than enough to prove the point.