How can I stabilize a machine learning model?

How can I stabilize a machine learning model? - python

I have a data to train the model. Also, I have another data to test the performance of the model as weekly. However, it seems that the model is not stable. There are some difference between training scores and weekly test scores. On the other hand, it is a fraud problem and I am using XGBoosting method. How can I make stable the model ? I can use different algorithms and parameters.
parameters = {
'n_estimators':[100],
'max_depth':[5],
'learning_rate':[0.1],
'classifier__min_sample_leaf':[5],
'classifier__criterion':['gini']
}
xgboost = XGBClassifier(scale_pos_weight=30)
xgboost_gs = GridSearchCV(xgboost, parameters, scoring='recall', cv=5, verbose=False)
xgboost_gs.fit(X_train, y_train)

I also worked on a similar project, and it's very difficult to improve the model's kappa or f1 score .... This is a problem that a lot of people face (data imbalance), specially in this field. I tried several models, feature engineering data cleaning and nothing seemed to work,I managed to improve kappa by 2 % by oversampling the unbalanced class (smote did not improve or any synthetic data creation)
But it's not all bad news! What I found out is that different models yield different results in terms of false positives/false negatives.
So the question is, what do you/your company want to prioritise on? A model that has less false negatives (classified fraud but it's not actually fraud, probably this one, more conservative) or less false positives (Classified as not fraud but it's actually fraud ) It's a trade play around and find the model that solves your problem, do not only look to accuracy on kappa or F1! Confusion Matrix in this case will help you!

You only have 24 items for the 1 class. This is too little so you will have to do some sampling to get both classes close to the same amount. This is to so fraud detection where you can easily get 1000s of non-fraud cases but only a hand full of fraud cases.
You can use some sampling method like SMOTE where you oversample the class with fewer observations and under-sample the class with more observations to let them have the same number of events for each class.
So in short you need a good balanced dataset for training. I am assuming that you had too few cases of class 1 in the training set

Related

Augmenting classification model to prediction "Unknown" instead of a wrong classfication

I am working on a multi-class classification problem, it contains some class imbalance (100 classes, a handful of which only have 1 or 2 samples associated).
I have been able to get a LinearSVC (& CalibratedClassifierCV) model to achieve ~98% accuracy, which is great.
The problem is that for all of the misclassified predictions - the business will incur a monetary loss. That is, for each misclassification - we would incur a $1,000 loss. A solution to this would be to classify a datapoint as "Unknown" instead of a complete misclassification (these unknowns could then be human-classified which would cost roughly $10 per "Unknown" prediction). Clearly, this is cheaper than the $1,000/misclassification loss.
Any suggestions for would I go about incorporating this "Unknown" class?
I currently have:
svm = LinearSCV()
clf = CalibratedClassifierCV(svm, cv=3)
# fit model
clf.fit(X_train, y_train)
# get probabilities for each decision
decision_probabilities = clf.predict_proba(X_test)
# get the confidence for the highest class:
confidence = [np.amax(x) for x in decision_probabilities]
I was planning to use the predict_proba method from the CalibratedClassifierCV model, and for any max probabilities that were under a threshold (yet to be determined) I would instead classify that sample as "Unknown" instead of the class that the probability is actually associated with.
The problem is that when I've checked correct predictions, there are confidence values as low as 30%. Similarly, there are incorrect predictions with confidence values as high as 95%. If I were to just create a threshold of say, 50%, my accuracy would go down significantly, I would have quite of bit of "Unknown" classes (loss), and still a bit of misclassifications (even bigger loss).
Is there a way to incorporate another loss function on this back-end classification (predicted class vs 'unknown' class)?
Any help would be greatly appreciated!

A few suggestions right off the bat:
Accuracy is not the correct metric to evaluate imbalanced datasets. For example, if 90% of samples belong to 1 class 90% accuracy is achieved by a dumb model which always predicts the dumb class. Precision and recall are generally better metrics for such cases. Opting between the two is generally a business decision.
Given the input signals, it may be difficult to better than 98%, especially for some classes you will have two few samples. What you can do is group minority classes together and give them a single label e.g 'other'. In this way, the model will hopefully have enough samples to learn that these samples are different from all other classes and will classify them as 'other'
Often when you try to replace a manual business process by ML, you generally do not completely remove human intervention. The goal is to use the model on cases/classes/input space where your model does well and use the manual process for the rest. One way to do it is by using the 'other' label. Once your model has predicted 'other', a human may manually classify these samples. Another method is to find a threshold on predicted probability above which the model has a high accuracy and sufficient population coverage. For example, let say you have 100% (typically 90-100%) accuracy whenever the output prbability is above 0.70. If this covers enough of the input population, you only use the ML model on such cases. For everything else, the manual process is followed.

How to improve F1 score for classification

I'm working on predicting if any task breaches a given deadline or not(Binary Classification Problem)
I've used Logistic Regression, Random Forest and XGBoost. All of them give an F1 score of around 56% for the class label 1(i.e the F1 score of the positive class only).
I've used:
StandardScaler()
GridSearchCV for Hyperparameter Tuning
Recursive Feature Elimination(for feature selection)
SMOTE(the dataset is imbalanced so I used SMOTE to create new examples from existing examples)
to try and improve the F score of this model.
I've also created an ensemble model using EnsembleVoteClassifier.As you can see from the picture, the weighted F score is 94% however the F score for class 1 (i.e positive class which says that the task will cross the deadline) is just 57%.
After applying all those methods mentioned above, I have been able to improve the f1 score of label 1 from 6% to 57%. However, I'm not sure what else to do to further improve the F score of the label 1.

You should also experiment with Under-Sampling. In general, you won't get much improvement by simply changing the algorithm. You should look into more advanced ensemble based techniques specifically designed for dealing with class imbalance.
You can also try out the approach used in this paper: https://www.sciencedirect.com/science/article/abs/pii/S0031320312001471
Alternatively, you could look into more advanced data synthesis methods.

Clearly, the fact that you have a relatively small number of True 1s samples in you datasets affects the performance of your classifier.
You have an "imbalanced data", you have much more of the 0s samples than of 1s.
There are multiple way to deal with imbalanced data. Each learner you have applied have its own "trick" for it. However, a general thing you can try is to resample the 1s samples. That is, artificially increase the proportion of the 1s in your dataset.
You can read more about different options here:
https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18

How to properly select the best model in GridSearchCV - both sklearn and caret do it wrong

Consider 3 data sets train/val/test. Sklearns GridSearchCV by default chooses the best model with the highest cross validation score. In a real world setting where the predictions need to be accurate this is a horrible approach to choosing the best model. The reason is because this is how it's supposed to be used:
-Train set for the model to learn the dataset
-Val set to validate what the model has learned in the train set and update parameters/hyperparameters to maximize the validation score.
-Test set - to test your data on unseen data.
-Finally use the model in a live setting and log the results to see if the results are good enough to make decisions. It's surprising that many data scientists impulsively use their trained model in production based only on selecting the model with the highest validation score. I find grid search to choose models that are painfully overfit and do a worse job at predicting unseen data than the default parameters.
My approaches:
-Manually train the models and look at the results for each model (in a sort of a loop, but not very efficient). It's very manual and time consuming, but I get significantly better results than grid search. I want this to be completely automated.
-Plot the validation curve for each hyperparameter I want to choose, and then pick the hyperparameter that shows the smallest difference between train and val set while maximizing both (i.e. train=98%, val = 78% is really bad, but train=72%, val=70% is acceptable).
Like I said, I want a better (automated) method for choosing the best model.
What kind of answer I'm looking for:
I want to maximize the score in the train and validation set, while minimizing the score difference between the train and val sets. Consider the following example from a grid search algorithm:
There are two models:
Model A: train score = 99%, val score = 89%
Model B: train score = 80%, val score = 79%
Model B is a much more reliable model and I would chose Model B over model A anyday. It is less overfit and the predictions are consistent. We know what to expect. However grid search will choose model A since the val score is higher. I find this to be a common problem and haven't found any solution anywhere on the internet. People tend to be so focused on what they learn in school and don't actually think about the consequences about choosing an overfit model. I see redundant posts about how to use sklearn and carets gridsearch packages and have them choose the model for you, but not how to actually choose the best model.
My approach so far has been very manual. I want an automated way of doing this.
What I do currently is this:
gs = GridSearchCV(model, params, cv=3).fit(X_train, y_train) # X_train and y_train consists of validation sets too if you do it this way, since GridSearchCV already creates a cv set.
final_model = gs.best_estimator_
train_predictions = final_model.predict(X_train)
val_predictions = final_model.predict(X_val)
test_predictions = final_model.predict(X_test)
print('Train Score:', accuracy_score(train_predictions, y_train)) # .99
print('Val Score:', accuracy_score(val_predictions, y_val)) # .89
print('Test Score:', accuracy_score(test_predictions, y_test)) # .8
If I see something like above I'll rule out that model and try different hyperparameters until I get consistent results. By manually fitting different models and looking at all 3 of these results, the validation curves, etc... I can decide what is the best model. I don't want to do this manually. I want this process to be automated. The grid search algorithm returns overfit models every time. I look forward to hearing some answers.
Another big issue is the difference between val and test sets. Since many problems face a time dependency issue, I'd like to know a reliable way to test the models performance as time goes on. It's crucial to split the data set by time, otherwise we are presenting data leakage. One method I'm familiar with is discriminative analysis (fitting a model to see if the model can predict which dataset the example came from: train val test). Another method is KS / KL tests and looking at the distribution of the target variable, or looping through each feature and comparing the distribution.

I agree with the comments that using the test set to choose hyperparameters obviates the need for the validation set (/folds), and makes the test set scores no longer representative of future performance. You fix that by "testing the model on a live feed," so that's fine.
I'll even give the scenario where I take out the test set - it's the same problem. The gridsearch algorithm picks the model with the highest validation score. It doesn't look at the difference between the train score and val score. The difference should be close to 0. A train score of 99% and a val score of 88% is not a good model, but grid search will take that over train score of 88% and val score of 87%. I would choose the second model.
Now this is something that's more understandable: there are reasons outside of raw performance to want the train/test score gap to be small. See e.g. https://datascience.stackexchange.com/q/66350/55122. And sklearn actually does accommodate this since v0.20: by using return_train_score=True and refit as a callable that consumes cv_results_ and returns the best index:
refit : bool, str, or callable, default=True
...
Where there are considerations other than maximum score in choosing a best estimator, refit can be set to a function which returns the selected best_index_ given cv_results_. In that case, the best_estimator_ and best_params_ will be set according to the returned best_index_ while the best_score_ attribute will not be available.
...
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
Of course, that requires you can put your manual process of looking at scores and their differences down into a function, and probably doesn't admit anything like validation curves, but at least it's something.

ML with imbalanced binary dataset

I have a problem I am trying to solve:
- imbalanced dataset with 2 classes
- one class dwarfs the other one (923 vs 38)
- f1_macro score when the dataset is used as-is to train RandomForestClassifier stays for TRAIN and TEST in 0.6 - 0.65 range
While doing research on the topic yesterday, I educated myself in resampling and especially SMOTE algorithm. It seems to have worked wonders for my TRAIN score, as after balancing the dataset with them, my score went from ~0.6 up to ~0.97. The way that I have applied it was as follows:
I have splited away my TEST set away from the rest of data in the beginning (10% of the whole data)
I have applied SMOTE on TRAIN set only (class balance 618 vs 618)
I have trained a RandomForestClassifier on TRAIN set, and achieved f1_macro = 0.97
when testing with TEST set, f1_macro score remained in ~0.6 - 0.65 range
What I would assume happened, is that the holdout data in TEST set held observations, which were vastly different from pre-SMOTE observations of the minority class in TRAIN set, which ended up teaching the model to recognize cases in TRAIN set really well, but threw the model off-balance with these few outliers in the TEST set.
What are the common strategies to deal with this problem? Common sense would dictate that I should try and capture a very representative sample of minority class in the TRAIN set, but I do not think that sklearn has any automated tools which allow that to happen?

Your assumption is correct. Your machine learning model is basically overfitting on your training data which has the same pattern repeated for one class and thus, the model learns that pattern and misses the rest of the patterns, that is there in test data. This means that the model will not perform well in the wild world.
If SMOTE is not working, you can experiment by testing different machine learning models. Random forest generally performs well on this type of datasets, so try to tune your rf model by pruning it or tuning the hyperparameters. Another way is to assign the class weights when training the model. You can also try penalized models which imposes an additional cost on the model when the misclassify the minority class.
You can also try undersampling since you have already tested oversampling. But most probably your undersampling will also suffer from the same problem. Please try simple oversampling as well instead of SMOTE to see how your results change.
Another more advanced method that you should experiment is batching. Take all of your minority class and an equal number of entries from the majority class and train a model. Keep doing this for all the batches of your majority class and in the end you will have multiple machine learning models, which you can then use together to vote.

Questions on ensemble technique in machine learning

I am studying the ensemble machine learning and when I read some articles online, I encountered 2 questions.
1.
In this article, it mentions
Instead, model 2 may have a better overall performance on all the data
points, but it has worse performance on the very set of points where
model 1 is better. The idea is to combine these two models where they
perform the best. This is why creating out-of-sample predictions have
a higher chance of capturing distinct regions where each model
performs the best.
But I still cannot get the point, why not train all training data can avoid the problem?
2.
From this article, in the prediction section, it mentions
Simply, for a given input data point, all we need to do is to pass it
through the M base-learners and get M number of predictions, and send
those M predictions through the meta-learner as inputs
But in the training process, we use k -fold train data to train M base-learner, so should I also train M base-learner based on all train data for the input to predict?

Assume red and blue were the best models you could find.
One works better in region 1, the other on region 2.
Now you would also train a classifier to predict which model to use, i.e., you would try to learn the two regions.
Do the validation on the outside. You can overfit if you give the two inner models access to data that the meta model does not see.

The idea in ensembles is that a group of weak predictors outperform a strong predictor. So, if we train different models with different predictive results and use the majority rule as the final result of our ensemble, this result is better than just trying to train one single model. Assume, for example, that the data consist of two distinct patterns, one linear and one quadratic. Then using a single classifier can either overfit or produce inaccurate results.
You can read this tutorial to learn more about ensembles and bagging and boosting.

1) "But I still cannot get the point, why not train all training data can avoid the problem?" - We will hold that data for validation purpose, just like the way we do in K-fold
2) "so should I also train M base-learner based on all train data for the input to predict?" - If you give same data to all the learners then the output of all of them would be same and there is no use in creating them. So we will give a subset of data to each learner.

For question 1 I will prove why we train two models in a contradictory way.
Suppose you train a model with all the data points.During training whenever the model will see a data point belonging to the red class, then it will try to fit itself so that it can classify red points with minimal error.Same is true for data points belonging to the blue class.Therefore during training the model is leaning towards a specific data point(either red or blue).And at the end model will try to fit itself so that it does not make much mistakes on both the data points and the final model will be an average model.
But instead if you train two models for the two different datasets, then each model will be trained on a specific dataset and a model doesn't have to care about data points which belong to another class.
It will be more clearer with the following metaphor.
Suppose there are two persons which are specialized to do two completely different jobs.Now when a job comes if you tell them that both of you have to do the job and each of them need to do 50% of the job. Now think what kind of result you will get at the end. Now also think what could be the result if you would tell them that a person should work on only the job at which the person is best.

In question 2 you have to split the train dataset into M datasets.And during training give M datasets to M base learners.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.