Why does implementing class weights make the model worse

Why does implementing class weights make the model worse - python

I am trying to do binary classification, and the one class (0) is approximately 1 third of the other class (1). when I run the raw data through a normal feed forward neural network, the accuracy is about 0.78. However, when I implement class_weights, the accuracy drops to about 0.49. The roc curve also seems to do better without the class_weights. Why does this happen, and how can i fix it?
II have already tried changing the model, and implementing regularization, and dropouts, etc. But nothing seems to change the overall accuracy
this is how i get my weights:
class_weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
class_weight_dict = dict(enumerate(class_weights))
Here is the results without the weights:
Here is with the weights:
I would expect the results to be better with the class_weights but the opposite seems to be true. Even the roc does not seem to do any better with the weights.

Due to the class imbalance a very weak baseline of always selecting the majority class will get accuracy of approximately 75%.
The validation curve of the network that was trained without class weights appears to show that it is picking a solution close to always selecting the majority class. This can be seen from the network not improving much over the validation accuracy it gets in the 1st epoch.
I would recommend looking into the confusion matrix, precision and recall metrics to get more information about which model is better.

This answer seems too late, but I hope it is helpful anyway. I just want to add four points:
Since the proportion of your data is minority: 25% and majority: 75%, accuracy is computed as:
accuracy = True positive + true negative / (true positive + true negative + false positive + false negative)
Thus, if you look at the accuracy as a metric, most likely any models could achieve around 75% accuracy by simply predicting the majority class all the time. That's why on the validation set, the model was not able to predict correctly.
While with class weights, the learning curve was not smooth but the model actually started to learn and it failed from time to time on the validation set.
As it was already stated, perhaps changing metrics such as F1 score would help. I saw that you are implementing tensorflow, tensorflow has metric F1 score on their Addons, you can find it on their documentation here. For me, I looked at the classfication report in scikit learn, let's say you want to see the model's performance on the validation set (X_val, y_val):
from sklearn.metrics import classification_report
y_predict = model.predict(X_val, batch_size=64, verbose=1
print(classification_report(y_val, y_predict))
Other techniques you might want to try such as implementing upsampling and downsampling at the same time can help, or SMOTE.
Best of luck!

Related

Augmenting classification model to prediction "Unknown" instead of a wrong classfication

I am working on a multi-class classification problem, it contains some class imbalance (100 classes, a handful of which only have 1 or 2 samples associated).
I have been able to get a LinearSVC (& CalibratedClassifierCV) model to achieve ~98% accuracy, which is great.
The problem is that for all of the misclassified predictions - the business will incur a monetary loss. That is, for each misclassification - we would incur a $1,000 loss. A solution to this would be to classify a datapoint as "Unknown" instead of a complete misclassification (these unknowns could then be human-classified which would cost roughly $10 per "Unknown" prediction). Clearly, this is cheaper than the $1,000/misclassification loss.
Any suggestions for would I go about incorporating this "Unknown" class?
I currently have:
svm = LinearSCV()
clf = CalibratedClassifierCV(svm, cv=3)
# fit model
clf.fit(X_train, y_train)
# get probabilities for each decision
decision_probabilities = clf.predict_proba(X_test)
# get the confidence for the highest class:
confidence = [np.amax(x) for x in decision_probabilities]
I was planning to use the predict_proba method from the CalibratedClassifierCV model, and for any max probabilities that were under a threshold (yet to be determined) I would instead classify that sample as "Unknown" instead of the class that the probability is actually associated with.
The problem is that when I've checked correct predictions, there are confidence values as low as 30%. Similarly, there are incorrect predictions with confidence values as high as 95%. If I were to just create a threshold of say, 50%, my accuracy would go down significantly, I would have quite of bit of "Unknown" classes (loss), and still a bit of misclassifications (even bigger loss).
Is there a way to incorporate another loss function on this back-end classification (predicted class vs 'unknown' class)?
Any help would be greatly appreciated!

A few suggestions right off the bat:
Accuracy is not the correct metric to evaluate imbalanced datasets. For example, if 90% of samples belong to 1 class 90% accuracy is achieved by a dumb model which always predicts the dumb class. Precision and recall are generally better metrics for such cases. Opting between the two is generally a business decision.
Given the input signals, it may be difficult to better than 98%, especially for some classes you will have two few samples. What you can do is group minority classes together and give them a single label e.g 'other'. In this way, the model will hopefully have enough samples to learn that these samples are different from all other classes and will classify them as 'other'
Often when you try to replace a manual business process by ML, you generally do not completely remove human intervention. The goal is to use the model on cases/classes/input space where your model does well and use the manual process for the rest. One way to do it is by using the 'other' label. Once your model has predicted 'other', a human may manually classify these samples. Another method is to find a threshold on predicted probability above which the model has a high accuracy and sufficient population coverage. For example, let say you have 100% (typically 90-100%) accuracy whenever the output prbability is above 0.70. If this covers enough of the input population, you only use the ML model on such cases. For everything else, the manual process is followed.

low training (~64%) and test accuracy (~14%) with 5 different models

Im struggling to find a learning algorithm that works for my dataset.
I am working with a typical regressor problem. There are 6 features in the dataset that I am concerned with. There are about 800 data points in my dataset. The features and the predicted values have high non-linear correlation so the features are not useless (as far as I understand). The predicted values have a bimodal distribution so I disregard linear model pretty quickly.
So I have tried 5 different models: random forest, extra trees, AdaBoost, gradient boosting and xgb regressor. The training dataset returns accuracy and the test data returns 11%-14%. Both numbers scare me haha. I try tuning the parameters for the random forest but seems like nothing particularly make a drastic difference.
Function to tune the parameters
def hyperparatuning(model, train_features, train_labels, param_grid = {}):
grid_search = GridSearchCV(estimator = model, param_grid = param_grid, cv = 3, n_jobs = -1, verbose =2)
grid_search.fit(train_features, train_labels)
print(grid_search.best_params_)
return grid_search.best_estimator_`
Function to evaluate the model
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100*np.mean(errors/test_labels)
accuracy = 100 - mape
print('Model Perfomance')
print('Average Error: {:0.4f} degress. '.format(np.mean(errors)))
print('Accuracy = {:0.2f}%. '.format(accuracy))
I expect the output to be at least ya know acceptable but instead i got training data to be 64% and testing data to be 12-14%. It is a real horror to look at this numbers!

There are several issues with your question.
For starters, you are trying to use accuracy in what it seems to be a regression problem, which is meaningless.
Although you don't provide the exact models (it would arguably be a good idea), this line in your evaluation function
errors = abs(predictions - test_labels)
is actually the basis of the mean absolute error (MAE - although you should actually take its mean, as the name implies). MAE, like MAPE, is indeed a performance metric for regression problems; but the formula you use next
accuracy = 100 - mape
does not actually hold, neither it is used in practice.
It is true that, intuitively, one might want to get the 1-MAPE quantity; but this is not a good idea, as MAPE itself has a lot of drawbacks which seriously limit its use; here is a partial list from Wikipedia:
It cannot be used if there are zero values (which sometimes happens for example in demand data) because there would be a division by zero.
For forecasts which are too low the percentage error cannot exceed 100%, but for forecasts which are too high there is no upper limit to the percentage error.

It is an overfitting problem. You are fitting the hypothesis very well on your training data.
Possible solutions to your problem:
You can try getting more training data(not features).
Try less complex model like decision trees since highly complex
models(like random forest,neural networks etc.) fit the hypothesis
well on the training data.
Cross-validation:It allows you to tune hyperparameters with only
your original training set. This allows you to keep your test set as
a truly unseen dataset for selecting your final model.
Regularization:The method will depend on the type of learner you’re
using. For example, you could prune a decision tree, use dropout on
a neural network, or add a penalty parameter to the cost function in
regression.
I would suggest you use pipeline function since it'll allow you to perform multiple models simultaneously.
An example of that:
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
'pca__n_components': [5, 20, 30, 40, 50, 64],
'logistic__alpha': np.logspace(-4, 4, 5),
}
search = GridSearchCV(pipe, param_grid, iid=False, cv=5)
search.fit(X_train, X_test)

I would suggest improving by preprocessing the data in better forms. Try to manually remove the outliers, check the concept of cook's distance to see elements which have high influence in your model negatively. Also, you could scale the data in a different form than Standard scaling, use log scaling if elements in your data are too big, or too small. Or use feature transformations like DCT transform/ SVD transform etc.
Or to be simplest, you could create your own features with the existing data, for example, if you have yest closing price and todays opening price as 2 features in stock price prediction, you can create a new feature saying the difference in cost%, which could help a lot on your accuracy.
Do some linear regression analysis to know the Beta values, to have a better understanding which feature is contributing more to the target value. U can use feature_importances_ in random forests too for the same purpose and try to improve that feature as well as possible such that the model would understand better.
This is just a tip of ice-berg of what could be done. I hope this helps.

Currently, you are overfitting so what you are looking for is regularization. For example, to reduce the capacity of models that are ensembles of trees, you can limit the maximum depth of the trees (max_depth), increase the minimum required samples at a node to split (min_samples_split), reduce the number of learners (n_estimators), etc.
When performing cross-validation, you should fit on the training set and evaluate on your validation set and the best configuration should be the one that performs the best on the validation set. You should also keep a test set in order to evaluate your model on completely new observations.

Using Precision and Recall in training of skewed dataset

I have a skewed dataset (5,000,000 positive examples and only 8000 negative [binary classified]) and thus, I know, accuracy is not a useful model evaluation metric. I know how to calculate precision and recall mathematically but I am unsure how to implement them in python code.
When I train the model on all the data I get 99% accuracy overall but 0% accuracy on the negative examples (ie. classifying everything as positive).
I have built my current model in Pytorch with the criterion = nn.CrossEntropyLoss() and optimiser = optim.Adam().
So, my question is, how do I implement precision and recall into my training to produce the best model possible?
Thanks in advance

The implementation of precision, recall and F1 score and other metrics are usually imported from the scikit-learn library in python.
link: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
Regarding your classification task, the number of positive training samples simply eclipse the negative samples. Try training with reduced number of positive samples or generating more negative samples. I am not sure deep neural networks could provide you with an optimal result considering the class skewness.
Negative samples can be generated using the Synthetic Minority Over-sampling Technique (SMOT) technique. This link is a good place to start.
Link: https://www.analyticsvidhya.com/blog/2017/03/imbalanced-classification-problem/
Try using simple models such as logistic regression or random forest first and check if there is any improvement in the F1 score of the model.

To add to the other answer, some classifiers have a parameter called class_weight which let's you modify the loss function. By penalizing wrong predictions on the minority class more, you can train your classifier to learn to predict both classes.
For a pytorch specific answer, you can refer this link
As mentioned in the other answer, over and undersampling strategies can be used. If you are looking for something better, take a look at this paper

How to determine if the predicted probabilities from sklearn logistic regresssion are accurate?

I am totally new to machine learning and I'm trying to use scikit-learn to make a simple logistic regression model with 1 input variable (X) and a binary outcome (Y). My data consists of 325 samples, with 39 successes and 286 failures. The data was split into a training and test (30%) set.
My goal is actually to obtain the predicted probabilities of success for any given X based on my data, not for classification prediction per se. That is, I will be taking the predicted probabilities for use in a separate model I'm building and won't be using the logistic regression as a classifier at all. So it's important that the predicted probabilities actually fit the data.
However, I am having some trouble understanding whether or not my model is a good fit to the data, or if the computed probabilities are actually accurate.
I am getting the following metrics:
Classification accuracy: metrics.accuracy_score(Y_test, predicted) = 0.92.
My understanding of this metric is that the model has a high chance of making correct predictions, so it looks to me like the model is a good fit.
Log loss: cross_val_score(LogisticRegression(), X, Y, scoring='neg_log_loss', cv=10) = -0.26
This is probably the most confusing metric for me, and apparently the most important as it is the accuracy of the predicted probabilities. I know that the closer to zero the score is the better - but how close is close enough?
AUC: metrics.roc_auc_score(Y_test, probs[:, 1]) = 0.9. Again, this looks good, since the closer the ROC score is to 1 the better.
Confusion Matrix: metrics.confusion_matrix(Y_test, predicted) =
[ 88, 0]
[8, 2]
My understanding here is that the diagonal gives the numbers of correct predictions in the training set so this looks ok.
Report: metrics.classification_report(Y_test, predicted) =
precision recall f1-score support
0.0 0.92 1.00 0.96 88
1.0 1.00 0.20 0.33 10
avg / total 0.93 0.92 0.89 98
According to this classification report, the model has good precision so it is a good fit.
I am not sure how to interpret the recall or if this report is bad news for my model- the sklearn documentation states that the recall is a models ability to find all positive samples - so a score of 0.2 for a prediction of 1 would mean that it only finds the positives 20% of the time? That sounds like a really bad fit to the data.
I'd really appreciate if someone could clarify that I am interpeting these metrics the right way - and perhaps shed some light on whether my model is good or bogus. Also, if there are any other tests I could do to determine if the computed probabilities are accurate please let me know.
If these aren't good metric scores, I'd really appreciate some direction on where to go next in terms of improvement.
Thanks!!

Your data set in unbalanced since there are far more failures than successes. A classifier that just guesses failure all the time would get 86%, so 92% precision isn't that impressive.
Then confusion matrix shows what's happening. 88 times it correctly predicts failure and 8 times it incorrectly predicts failure. Only twice does it actually predict success correctly.
Precision is the number of guesses it makes that are correct: so (88 + 2)/98 = 0.92% overall. The recall for success is only 2 out of the (8+2) total successes (or 20%).
So the model isn't a great fit. There are many ways to deal with unbalanced data sets like weighting the examples or applying a prior to the predictions. The confusion matrix is a good way to see what's really happening.

Your data suffers from class imbalance problem. You have not specified any way that to deal with it while training your classifier. However, even though your accuracy is high, it might be because the number of Failure samples is quite large and hence your test set might be populated with it too.
To deal with it you can use Stratified split in sklearn to shuffle and split your data to account for class imbalance problem.
You can also try other techniques to improve your classifier like GridSearch as well. You can read more about model evaluation here in this link. For model specific cross-validation techniques check this section in sklearn..
One more thing you can do is that instead of using accuracy as a metric for training your classifier, you can focus on recall and precision( or even True Positive rate in your case). You will need to use make_scorer in sklearn. An example can be found here and here. You might also want to checkout F1-score or F_beta score as well.
You can also checkout this Github repository for various sampling techniques to tackle class imbalance problem in sklearn.
You can also checkout this answer as well for more techniques.

About the specific shapes of learning curves

My model throws up learning curves as I have shown below. Are these fine? I am a beginner and all across the internet I see that as training examples increase the Training score should decrease and then converge. But here the training score is increasing and then converging. Therefore I would like to know does this indicate a bug in my code / something wrong with my input?
Okay I figured out what was wrong with my code.
train_sizes , train_accuracy , cv_accuracy = lc(linear_model.LogisticRegression(solver='lbfgs',penalty='l2',multi_class='ovr'),trainData,multiclass_response_train,train_sizes=np.array([0.1,0.33,0.5,0.66,1.0]),cv=5)
I had not entered a regularization parameter for Logistic Regression.
But now,
train_sizes , train_accuracy , cv_accuracy = lc(linear_model.LogisticRegression(C=1000,solver='lbfgs',penalty='l2',multi_class='ovr'),trainData,multiclass_response_train,train_sizes=np.array([0.1,0.33,0.5,0.66,1.0]),cv=5)
The learning curve looks alright.
Can anybody tell me why this is so? i.e. with default reg term the training score increases and with lower reg it decreases?
Data details: 10 classes. Images of varying sizes. (Digit Classification - street view digits)

You need to be more precise regarding your metrics. What metrics are used here?
Loss in general means: lower is better, while Score usually means: higher is better.
This also means, that the interpretation of your plot is dependent on the used metrics during training and cross-validation.

Have a look at the related webpage of scipy:
http://scikit-learn.org/stable/modules/learning_curve.html
The score is typically some measure that needs to be maximized (ROCAUC, accuracy,...). Intuitively you could expect that the more training examples you see the better your model gets and hence the higher the score is. There are however some subtleties regarding overfitting and underfitting that you should keep in mind.

Building off of Alex's answer, it looks like the default regularization parameter for your model underfits the data a bit, because when you relaxed regularization, you see 'more appropriate' learning curves. It doesn't matter how many examples you throw at a model that underfits.
As for your concern of why the training score increases in the first case rather than decreases -- it's probably a consequence of the multiclass data you're using. With fewer training examples, you have fewer numbers of images of each class (because lc tries to keep the same class distribution in each fold of the cv), so with regularization (if you call C=1 regularization, that is), it may be harder for your model to accurately guess some of the classes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.