imbalabced data set score after smote - python

Is it correct to use 'accuracy' as a metric for an imbalanced data set after using oversampling methods such as SMOTE or we have to use other metrics such as AUROC or other presicion-recall related metrics?

You can use accuracy for the dataset after using SMOTE since now it shouldn't be imbalanced as far as I know. You should try the other metrics though for a more detailed evaluation (classification_report_imbalenced combines some metrics)

SMOTE and similar imbalance treatment techniques will be only be applied to you training data. When you have a largely imbalanced data set, say 99% against 1%, accuracy on the TEST set might still give you a value of 99% by always choosing the larger class.
Therefore, you should definitely switch to another metric.
Popular variants are the F1 score, but there is also a balanced version of the accuracy, see scikit-learn BA page.
As mentioned by #Nocry, applying several evaluation measures, might give you a better feeling. For example, check how accuracy (the regular variant) and balanced accuracy perform with and without using SMOTE, then you should see the difference.

Related

GridSearchCV does not improve my test accuracy

I am making multiple classifier models and the test accuracy for all of them is 0.508.
I find it weird that multiple models have the same accuracy. The models I used are Logistic Regressor,DesicionTreeClassifier, MLPClassifier, RandomForestClassifier, BaggingClassifier, AdaBoostClassifier, XGBClassifier, SVC, and VotingClassifier.
After using GridSearchCV to improve the models, all of their test accuracy scores improved. But the test accuracy scores did not change.
I wish I could say I changed something, but I don't know why the test scores did not change. After using gridsearch, I expected the test scores to improve but it didn't
I would like to confirm, you mean your training scores improve but you testing scores did not change? If yes, there are a lot of possibility behind this.
You might want to reconfigure and add your hyper parameter range for example if using KNN you can increase the number of k or by adding more distance metric calculation
If you want to you can change the hyper parameter optimization technique like randomized search or bayesian search
I don't have any information about your data but sometimes turn on or turn off the shuffle mode when splitting can affect the scores for instance if you have time series data you have not to shuffle the dataset
There can be several reasons why the test accuracy didn't change after using GridSearchCV:
The best parameters found by GridSearchCV might not be optimal for the test data.
The test data may have a different distribution than the training data, leading to low test accuracy.
The models might be overfitting to the training data and not generalizing well to the test data.
The test data size might be small, leading to high variance in test accuracy scores.
The problem itself might be challenging, and a test accuracy of 0.508 might be the best that can be achieved with the current models and data.
It would be useful to have more information about the data, the problem, and the experimental setup to diagnose the issue further.
Looking at your accuracy, first of all I would say: are you performing a binary classification task? Because if it is the case, your models are almost not better than random on the test set, which may suggest that something is wrong with your training.
Otherwise, GridSearchCV, like RandomSearchCV and other hyperparameters optimization techniques try to find optimal parameters among a range that you define. If, after optimization, your optimal parameter has the value of one bound of your range, it may suggest that you need to explore beyond this bound, that is to say set another range on purpose and run the optimization again.
By the way, I don't know the size of your dataset but if it is big I would recommend you to use RandomSearchCV instead of GridSearchCV. As it is not exhaustive, it takes less time and gives results that are (nearly) optimized.

Curse of the SKlearn Classifiers

Suppose we have 1,000 beads 900 red and 100 blue ones. When I run the problem through SKlearn classifier ensembles,
score = clf.score(X_test, y_test)
They come up with scores of around .9 however, when I look at the predictions I see that it has predicted all of them to be Red and this is how it comes up with %90 accuracy! Please tell me what I'm doing wrong? Better yet, what does it mean when this happens? Is there a better way to measure accuracy?
This might happen when you have an imbalanced dataset, and you chose accuracy as your metric. The reason is that by always deciding red, the model is actually doing OK in terms of accuracy, but as you noticed, the model is useless!
In order to overcome this issue, you have some alternatives such as:
1. Use another metric, like AUC (area under roc curve), etc.
2. Use different weights for classes, and put more weight on the minority class.
3. Use simple over-sampling or under-sampling methods, or other more sophisticated ones like SMOTE, ADASYN, etc.
You can also take a look at this article.
This problem you face is quite common in real world applications.
You have an imbalanced classification problem. You are write, by default score measures accuracy, but it is recommended to look at recall and precision for imbalanced data.
This video explains it better than I could
The video above demonstrates you what you should do in order to measure classification performance in your data. To deal with data imbalance, you check imblearn library:
https://imbalanced-learn.readthedocs.io/en/stable/api.html

xgboost: Sample Weights for Imbalanced Data?

I have a highly unbalanced dataset of 3 classes. To address this, I applied the sample_weight array in the XGBClassifier, but I'm not noticing any changes in the modelling results? All of the metrics in the classification report (confusion matrix) are the same. Is there an issue with the implementation?
The class ratios:
military: 1171
government: 34852
other: 20869
Example:
pipeline = Pipeline([
('bow', CountVectorizer(analyzer=process_text)), # convert strings to integer counts
('tfidf', TfidfTransformer()), # convert integer counts to weighted TF-IDF scores
('classifier', XGBClassifier(sample_weight=compute_sample_weight(class_weight='balanced', y=y_train))) # train on TF-IDF vectors w/ Naive Bayes classifier
])
Sample of Dataset:
data = pd.DataFrame({'entity_name': ['UNICEF', 'US Military', 'Ryan Miller'],
'class': ['government', 'military', 'other']})
Classification Report
First, most important: use a multiclass eval_metric. eval_metric=merror or mlogloss, then post us the results. You showed us ['precision','recall','f1-score','support'], but that's suboptimal, or outright broken unless you computed them in a multi-class-aware, imbalanced-aware way.
Second, you need weights. Your class ratio is military: government: other 1:30:18, or as percentages 2:61:37%.
You can manually set per-class weights with xgb.DMatrix..., weights)
Look inside your pipeline (use print or verbose settings, dump values), don't just blindly rely on boilerplate like sklearn.utils.class_weight.compute_sample_weight('balanced', ...) to give you optimal weights.
Experiment with manually setting per-class weights, starting with 1 : 1/30 : 1/18 and try more extreme values. Reciprocals so the rarer class gets higher weight.
Also try setting min_child_weight much higher, so it requires a few exemplars (of the minority classes). Start with min_child_weight >= 2(* weight of rarest class) and try going higher. Beware of overfitting to the very rare minority class (this is why people use StratifiedKFold crossvalidation, for some protection, but your code isn't using CV).
We can't see your other parameters for xgboost classifier (how many estimators? early stopping on or off? what was learning_rate/eta? etc etc.). Seems like you used the defaults - they'll be terrible. Or else you're not showing your code. Distrust xgboost's defaults, esp. for multiclass, don't expect xgboost to give good out-of-the-box results. Read the doc and experiment with values.
Do all that experimentation, post your results, check before concluding "it doesn't work". Don't expect optimal results from out-of-the-box. Distrust or double-check the sklearn util functions, try manual alternatives. (Often, just because sklearn has a function to do something, doesn't mean it's good or best or suitable for all use-cases, like imbalanced multiclass)

How to check machine learning accuracy without cross validation

I have training sample X_train, and Y_train to train and X_estimated.
I got task to make my classificator learn as accurate as it can, and then predict vector of results over X_estimated to get close results to Y_estimated (which i have now, and I have to be as much precise as it can). If I split my training data to like 75/25 to train and test it, I can get accuracy using sklearn.metrics.accuracy_score and confusion matrix. But I am losing that 25% of samples, that would make my predictions more accurate.
Is there any way, I could learn by using 100% of the data, and still be able to see accuracy score (or percentage), so I can predict it many times, and save best (%) result?
I am using random forest with 500 estimators, and usually get like 90% accuracy. I want to save best prediction vector as possible for my task, without splitting any data (not wasting anything), but still be able to calculate accuracy (so I can save best prediction vector) from multiple attempts (random forest always shows different results)
Thank you
Splitting your data is critical for evaluation.
There is no way that you could train your model on 100% of the data and be able to get a correct evaluation accuracy unless you expand your dataset. I mean, you could change your train/test split, or try to optimize your model in other ways, but i guess the simple answer to your question would be no.
As per your requirement, you can try K Fold Cross Validation. If you split it in 90|10 i.e for Train|Test. Achieving to take 100% data for training is not possible as you have to test the data then only you can validate the same that how good your model is. K Fold CV takes your whole train data into consideration in each fold and randomly takes test data sample from the train data. And lastly calculates the accuracy by taking summation of all the folds. Then finally you can test the accuracy by using 10% of the data.
More you can read here and here
K Fold Cross Validation
Skearn provides simple methods for performing K fold cross validation. Simply you have to pass no of folds in the method. But then remember, more the folds, it takes more time to train the model. More you can check here
It is not necessary to do 75|25 split of your data all the time. 75
|25 is kind of old school now. It greatly depends on the amount of data that you have. For example, if you have 1 billion sentences for training a language model, it is not necessary to reserve 25% for testing.
Also, I second the previous answer of trying K-fold cross-validation. As a side note, you could consider looking at the other metrics like precision and recall as well.
In general splitting your data set is critical for evaluation. So I would recommend you always do that.
Said that, there are methods that in some sense allow you to train on all your data and still get an estimate of your performance or to estimate the generalization accuracy.
One particularly prominent method is leveraging out-of-bag samples of models based on bootstrapping, i.e. RandomForests.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, bootstrap=True, oob_score=True)
rf.fit(X, y)
print(rf.oob_score_)
if you are doing classification always go with stratified k-fold cv(https://machinelearningmastery.com/cross-validation-for-imbalanced-classification/).
if you're doing regression then go with simple k-fold cv or you can divide the target as bins and do stratified k-fold cv. by this way you can use your data completely in model training.

What are good metrics to evaluate the performance of a multi-class classifier?

I'm trying to run a classifier in a set of about 1000 objects, each with 6 floating point variables. I've used scikit-learn's cross validation features to generate an array of the predicted values for several different models. I've then used sklearn.metrics to compute the accuracy of my classifiers, and the confusion table. Most classifiers have around 20-30% accuracy. Below is the confusion table for the SVC classifier (25.4% accuracy).
Since I'm new to machine learning, I'm not sure how to interpret that result, and whether there are other good metrics to evaluate the problem. Intuitively speaking, even with 25% accuracy, and given that the classifier got 25% of the predictions right, I believe it is at least somewhat effective, right? How can I express that with statistical arguments?
If this table is a confusion table, I think that your classifier predicts in majority of the time the class E. I think that your class E is overrepresented in your dataset, accuracy is not a good metric if your classes have not the same number of instances,
Example, If you have 3 classes, A,B,C and in the test dataset the class A is over represented (90%) if your classifier predicts all time class A, you will have 90% of accuracy,
A good metric is to use log loss, logistic regression is a good algorithm that optimize this metric
see https://stats.stackexchange.com/questions/113301/multi-class-logarithmic-loss-function-per-class
An other solution, is to do oversampling of your small classes
First of all, I find it very difficult to look at confusion tables. Plotting it as an image would give a lot better intuitive understanding about what is going on.
It is advisory to have single number metric to optimize since it is easier and faster. When you find that your system doesn't perform as you expect it to, revise your selection of metric.
Accuracy is usually a good metric to use if you have same amount of examples in every class. Otherwise (which seems to be the case here) I'd advise to use F1 score which takes into account both precision and recall of your estimator.
EDIT: However it is up to you to decide if the ~25% accuracy, or whatever metric is "good enough". If you are classifying if robot should shoot a person you should probably revise your algorithm but if you are deciding if it is a pseudo-random or random data, 25% percent accuracy could be more than enough to prove the point.

Categories