dealing with imbalanced data after encoding for classification

dealing with imbalanced data after encoding for classification - python

I have a data of dimension (13961,48 ) initially, and after one hot encoding and also basic massaging of data the dimension observed around (13961,862). the data is imbalance with two categories of 'Retained' around 6% and 'not Retained' around 94%.
While running any algorithms such as logistic,knn,decision tree,random forest, the data results in very high accuracy even without any feature selection process carried out and the accuracy crosses more than 94% mostly except 'Naive bias classifier'.
This seems like odd and even by having any two features randomly also--> that gives accuracy more than 94% , which seems non reality in general.
Applying SMOTE also, provide result of more than 94% of accuracy even for baseline model of any algorithms said above such as logistic,knn,decision tree,random forest,
After removing the top 20 features also , this gives accuracy of good result more than 94% ( checked for understanding the genuineness )
g = data[Target_col_Y_name]
df = pd.concat([g.value_counts(),
g.value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))
print('The % distribution between the retention and non-retention flag\n')
print (df)
# The code o/p to show the imbalance is
The % distribution between the retention and non-retention flag
counts percentage
Non Retained 13105 93.868634
Retained 856 6.131366
My data have 7 numerical variables such as month, amount, interest rate and all others ( around 855) as one-hot-encoding transformed categorical variables.
Any methodology , to handle this kind of data on baseline,feature selection or imbalance optimization techniques ? please guide by looking at the dimensionality and the imbalance count for each levels.

I would like to add something in addition to Elias answer.
Firstly, you have to understand that even if you's create "dumb classifier", which always predicts "not retained", you'd still be correct 94% of times. So accuracy is clearly weak metric in this case.
You should definitely learn about confusion matrix and metrics that come along with it (like AUC).
One of these metrics is F1 score, which is harmonic average of precision and recall. It is better that accuracy in imbalanced class setting, but... it doesn't have to be the best. F1 will favor these classifiers that have similar precision and recall. But this is not necessary something that is important for you.
For instance, if you'd build sfw content filter, you would be ok with labeling some SFW content as nsfw (negative class), which would increase false negative rate (and decrease recall), but you would like to be sure that you kept only safe ones (high precision).
In your case you can reason what is worse: retaining something bad or abandoning something good, and pick the metric in that way.
As far as strategy is concerned: there are plenty of ways to handle class imbalance: sampling techniques (try down-sampling, up-sampling besides SMOTE or ROSE) and check out whether your validation score (training metrics alone are almost useless) improved. Just remember to apply sampling/augmentations techniques after the train-validation split.
Moreover, some models have special hyperparametrs to focus more on rare class (for instance in xgboost there is scale_pos_weight parameter). From my experience, tunning this hyperparam is way more effective than SMOTE.
Good luck

Accuracy is not a very good measure in general, particularly for imbalanced classes. I would recommend this other stackoverflow answer, that explains when to use F1 score and when to use AUROC, which are both far better measures than accuracy; in this case F1 is better.
Few points just to clear up:
For models such as random forest, you should not have to remove features to improve the accuracy, as it will just regard them as insignificant features. I recommend random forests as it tends to be very accurate (except in some cases) and can show significant features just by using clf.feature_significances_ (if using the scipy random forest).
Decision trees will almost always perform worse than random forests, as random forests are many aggregated decision trees.

Related

Accuracy metric on imbalanced classification data

There has been a lot of discussion about this topic.
But I have no enough reputation i.e. 50 to comment on those posts. Hence, I am creating this one.
As far as I understand, accuracy is not an appropriate metric when the data is imbalanced.
My question is, is it still inppropriate if we have applied either the resampling method, class weights or initial bias?
Reference here.
Thanks!

Indeed, it is always a good idea to test resampling techniques such as over sampling the minority class and under sampling the majority class. My advice is to start with this excellent walk through for resampling techniques using the imblearn package in python. Eventually, what seems to work best in most cases is intelligently combining both under and over samplers
For example, undersampling the majority class by 70% and then apply over sampling to the minority class to match the new distribution of the majority class.
To answer your question regarding accuracy: No. It should not be used. The main reason is that when you apply resampling techniques, you should never apply it on the test set. Your test set, same as in real life and in production, you never know it in advance. So the imbalance will always be there.
As far as evaluation metrics, well the question you need to ask is 'how much more important is the minority class than the majority class?' how much false positives are you willing to tolerate? The best way to check for class separability is using a ROC curve. The better it looks (a nice high above the diagonal line curve) the better the model is at separating positive classes from negative classes even if it is imbalanced.
To get a single score that allows you to compare models, and if false negatives are more important than false positives (which is almost always the case in imbalanced classification), then use F2 measure which gives more importance to recall (i.e. more importance to true positives detected by your model). However, the way we have always done it in my company is by examining in detail the classification report to know exactly how much recall we get for each class (so yes, we mainly aim for high recall and occasionally look at the precision which reflects the amount of false positives).
Conclusion :
Always check multiple scores such as classification report, mainly recall
If you want a single score, use F2.
Use ROC curve to evaluate the model visually regarding class separability
Never apply resampling to your test set!
Finally, it would be wise to apply a cost sensitive learning technique to your model such as class weighting during training!
I hope this helps!

I would prefer to use g-mean or brier score as Prof. Harrel wrote a nice discussion on this topic. See this: https://www.fharrell.com/post/class-damage/. Here is another one which provided a limitations of using in proper metrics. https://www.sciencedirect.com/science/article/pii/S2666827022000585

Worse results when training on entire dataset

After finalizing the architecture of my model I decided to train the model on the entire dataset by setting validation_split = 0 in fit(). I thought this would improve the results based on these sources:
What is validation data used for in a Keras Sequential model?
Your model doesn't "see" your validation set and isn´t in any way trained on it
https://machinelearningmastery.com/train-final-machine-learning-model/
What about the cross-validation models or the train-test datasets?
They’ve been discarded. They are no longer needed.
They have served their purpose to help you choose a procedure to finalize.
However, I got worse results without the validation set (compared to validation_split = 0.2), leaving all other parameters the same.
Is there an explanation for this? Or was it just by chance that my model happened to perform better on the fixed test data when a part of the training data was excluded (and used as validation).

Well that's really a very good question that covers a lots of machine learning related concepts specially Bias-Variance Tradeoff
As in the comment #CrazyBarzillian hinted that more data might be leading to over-fitting and yes we need more info about your data to come to a solution. But in a broader way I would like to explain you few points, that might help you to understand as it why it happened.
EXPLAINATION
Whenever your data has more number of features, your model learns a very complex equation to solve it. In short model is too complicated for the amount of data we have. This situation, known as high variance, leads to model over-fitting. We know that we are facing a high variance issue when the training error is much lower than the test error. High variance problems can be addressed by reducing the number of features (by applying PCA , outlier removal etc.), by increasing the number of data points that is adding more data.
Sometimes, you have lesser features in your data and hence model learns a very simple equation to solve it. This is known as high bias. In this case , adding more data won't help. In this case less data will do the work or adding more features will help.
MY ASSUMPTION
I guess your model is suffering from high bias if its performing poor on adding more data. But to check whether the statement adding more data leading to poor results is correct or not in your case you can do the following things:-
play with some hyperparameters
try other machine learning models
instead of accuracy scores , look for r2 scores or mean absolute error in case of regression or F1, precision, recall in case of classification
If after doing both things you are still getting the same results that more data is leading to poor results, then you can be sure of high bias and can either increase the number of features or reduce the data.
SOLUTION
By reducing the data, I mean use small data but better data. Better data means suppose you are doing a classification problem and you have three classes (A, B and C) , a better data would be if all the datapoints are balanced between three classes. Your data should be balanced. If it is unbalanced that is class A has high number of samples while class B and C has only 3-4 samples then you can apply Ramdom Sampling techniques to overcome it.
How to make BETTER DATA
Balance the data
Remove outliers
Scale (Normalize) the data
CONCLUSION
It is a myth that more data is always leads to good model. Actually more than the quantity , quality of the data also matters. Data should have both quantity as well as quality. This game of maintaining quality and quantity is known as Bias-Variance Tradeoff.

additional of features decrease the accuracy- random forest

I am using sklearn's random forests module to predict a binary target variable based on 166 features.
When I increase the number of dimensions to 175 the accuracy of the model decreases (from accuracy = 0.86 to 0.81 and from recall = 0.37 to 0.32) .
I would expect more data to only make the model more accurate, especially when the added features were with business value.
I built the model using sklearn in python.
Why the new features did not get weight 0 and left the accuracy as it was ?

Basically, you may be "confusing" your model with useless features. MORE FEATURES or MORE DATA WILL NOT ALWAYS MAKE YOUR MODEL BETTER. The new features will also not get weight zero because the model will try hard to use them! Because there are so many (175!), RF is just not able to come back to the previous "pristine" model with better accuracy and recall (maybe these 9 features are really not adding anything useful).
Think about how a decision tree essentially works. These new features will cause some new splits that can worsen the results. Try to work it out from the basics and slowly adding new information always checking the performance. In addition, pay attention to for example the number of features used per split (mtry). For so many features, you would need to have a very high mtry (to allow for a big sample to be considered for every split). Have you considered adding 1 or 2 more and checking how the accuracy responds? Also, don't forget mtry!

More data does not always make the model more accurate. Random forest is a traditional machine learning method where the programmer has to do the feature selection. If the model is given a lot of data but it is bad, then the model will try to make sense out of that bad data too and will end up messing things up. More data is better for neural networks as those networks select the best possible features out of the data on their own.
Also, 175 features is too much and you should definitely look into dimensionality reduction techniques and select the features which are highly correlated with the target. there are several methods in sklearn to do that. You can try PCA if your data is numerical or RFE to remove bad features, etc.

What are good metrics to evaluate the performance of a multi-class classifier?

I'm trying to run a classifier in a set of about 1000 objects, each with 6 floating point variables. I've used scikit-learn's cross validation features to generate an array of the predicted values for several different models. I've then used sklearn.metrics to compute the accuracy of my classifiers, and the confusion table. Most classifiers have around 20-30% accuracy. Below is the confusion table for the SVC classifier (25.4% accuracy).
Since I'm new to machine learning, I'm not sure how to interpret that result, and whether there are other good metrics to evaluate the problem. Intuitively speaking, even with 25% accuracy, and given that the classifier got 25% of the predictions right, I believe it is at least somewhat effective, right? How can I express that with statistical arguments?

If this table is a confusion table, I think that your classifier predicts in majority of the time the class E. I think that your class E is overrepresented in your dataset, accuracy is not a good metric if your classes have not the same number of instances,
Example, If you have 3 classes, A,B,C and in the test dataset the class A is over represented (90%) if your classifier predicts all time class A, you will have 90% of accuracy,
A good metric is to use log loss, logistic regression is a good algorithm that optimize this metric
see https://stats.stackexchange.com/questions/113301/multi-class-logarithmic-loss-function-per-class
An other solution, is to do oversampling of your small classes

First of all, I find it very difficult to look at confusion tables. Plotting it as an image would give a lot better intuitive understanding about what is going on.
It is advisory to have single number metric to optimize since it is easier and faster. When you find that your system doesn't perform as you expect it to, revise your selection of metric.
Accuracy is usually a good metric to use if you have same amount of examples in every class. Otherwise (which seems to be the case here) I'd advise to use F1 score which takes into account both precision and recall of your estimator.
EDIT: However it is up to you to decide if the ~25% accuracy, or whatever metric is "good enough". If you are classifying if robot should shoot a person you should probably revise your algorithm but if you are deciding if it is a pseudo-random or random data, 25% percent accuracy could be more than enough to prove the point.

Decrease the False Negative Rate in signal prediction

I'm currently working on a project in estimating signal by using some classification learning algorithms, such as logistics regression and random forest using scikit-learn.
I'm now using the confusion matrix to estimate the performance of different algorithms in prediction, and I found there was common problem for both algorithms. That is, in all cases, although the accuracy of algorithms seems relatively good (around 90% - 93%), the total number of FN are pretty high comparing to TP (FNR < 3% ). Does any one has clue about why I'm having this kind of issue in my prediction problem. If possible, can you give me some hints regarding how to possibly solve this problem?
Thanks for reply and help in advance.
Updates:
The dataset is extremely imbalanced (8:1), with in total around 180K observations. I already tested several re-sampling methods, such as OSS, SMOTE(+Tomek or +ENN), but neither of them returns good results. In both cases, although the recall goes up from 2.5% to 20%, the precision decreases significantly (from 60% to 20%).

You probably have an imbalanced dataset, where one of your classes has many more examples than your other class.
One solution is to give an higher cost of misclassifying the class with less examples.
This question in Cross Validated covers many approaches to your problem:
https://stats.stackexchange.com/questions/131255/class-imbalance-in-supervised-machine-learning
EDIT:
Given that you are using scikit-learn you can, as a first approach, set the parameter class_weight to balanced on your Logistic regression.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.