How to improve F1 score for classification

How to improve F1 score for classification - python

I'm working on predicting if any task breaches a given deadline or not(Binary Classification Problem)
I've used Logistic Regression, Random Forest and XGBoost. All of them give an F1 score of around 56% for the class label 1(i.e the F1 score of the positive class only).
I've used:
StandardScaler()
GridSearchCV for Hyperparameter Tuning
Recursive Feature Elimination(for feature selection)
SMOTE(the dataset is imbalanced so I used SMOTE to create new examples from existing examples)
to try and improve the F score of this model.
I've also created an ensemble model using EnsembleVoteClassifier.As you can see from the picture, the weighted F score is 94% however the F score for class 1 (i.e positive class which says that the task will cross the deadline) is just 57%.
After applying all those methods mentioned above, I have been able to improve the f1 score of label 1 from 6% to 57%. However, I'm not sure what else to do to further improve the F score of the label 1.

You should also experiment with Under-Sampling. In general, you won't get much improvement by simply changing the algorithm. You should look into more advanced ensemble based techniques specifically designed for dealing with class imbalance.
You can also try out the approach used in this paper: https://www.sciencedirect.com/science/article/abs/pii/S0031320312001471
Alternatively, you could look into more advanced data synthesis methods.

Clearly, the fact that you have a relatively small number of True 1s samples in you datasets affects the performance of your classifier.
You have an "imbalanced data", you have much more of the 0s samples than of 1s.
There are multiple way to deal with imbalanced data. Each learner you have applied have its own "trick" for it. However, a general thing you can try is to resample the 1s samples. That is, artificially increase the proportion of the 1s in your dataset.
You can read more about different options here:
https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18

Related

GridSearchCV does not improve my test accuracy

I am making multiple classifier models and the test accuracy for all of them is 0.508.
I find it weird that multiple models have the same accuracy. The models I used are Logistic Regressor,DesicionTreeClassifier, MLPClassifier, RandomForestClassifier, BaggingClassifier, AdaBoostClassifier, XGBClassifier, SVC, and VotingClassifier.
After using GridSearchCV to improve the models, all of their test accuracy scores improved. But the test accuracy scores did not change.
I wish I could say I changed something, but I don't know why the test scores did not change. After using gridsearch, I expected the test scores to improve but it didn't

I would like to confirm, you mean your training scores improve but you testing scores did not change? If yes, there are a lot of possibility behind this.
You might want to reconfigure and add your hyper parameter range for example if using KNN you can increase the number of k or by adding more distance metric calculation
If you want to you can change the hyper parameter optimization technique like randomized search or bayesian search
I don't have any information about your data but sometimes turn on or turn off the shuffle mode when splitting can affect the scores for instance if you have time series data you have not to shuffle the dataset

There can be several reasons why the test accuracy didn't change after using GridSearchCV:
The best parameters found by GridSearchCV might not be optimal for the test data.
The test data may have a different distribution than the training data, leading to low test accuracy.
The models might be overfitting to the training data and not generalizing well to the test data.
The test data size might be small, leading to high variance in test accuracy scores.
The problem itself might be challenging, and a test accuracy of 0.508 might be the best that can be achieved with the current models and data.
It would be useful to have more information about the data, the problem, and the experimental setup to diagnose the issue further.

Looking at your accuracy, first of all I would say: are you performing a binary classification task? Because if it is the case, your models are almost not better than random on the test set, which may suggest that something is wrong with your training.
Otherwise, GridSearchCV, like RandomSearchCV and other hyperparameters optimization techniques try to find optimal parameters among a range that you define. If, after optimization, your optimal parameter has the value of one bound of your range, it may suggest that you need to explore beyond this bound, that is to say set another range on purpose and run the optimization again.
By the way, I don't know the size of your dataset but if it is big I would recommend you to use RandomSearchCV instead of GridSearchCV. As it is not exhaustive, it takes less time and gives results that are (nearly) optimized.

Augmenting classification model to prediction "Unknown" instead of a wrong classfication

I am working on a multi-class classification problem, it contains some class imbalance (100 classes, a handful of which only have 1 or 2 samples associated).
I have been able to get a LinearSVC (& CalibratedClassifierCV) model to achieve ~98% accuracy, which is great.
The problem is that for all of the misclassified predictions - the business will incur a monetary loss. That is, for each misclassification - we would incur a $1,000 loss. A solution to this would be to classify a datapoint as "Unknown" instead of a complete misclassification (these unknowns could then be human-classified which would cost roughly $10 per "Unknown" prediction). Clearly, this is cheaper than the $1,000/misclassification loss.
Any suggestions for would I go about incorporating this "Unknown" class?
I currently have:
svm = LinearSCV()
clf = CalibratedClassifierCV(svm, cv=3)
# fit model
clf.fit(X_train, y_train)
# get probabilities for each decision
decision_probabilities = clf.predict_proba(X_test)
# get the confidence for the highest class:
confidence = [np.amax(x) for x in decision_probabilities]
I was planning to use the predict_proba method from the CalibratedClassifierCV model, and for any max probabilities that were under a threshold (yet to be determined) I would instead classify that sample as "Unknown" instead of the class that the probability is actually associated with.
The problem is that when I've checked correct predictions, there are confidence values as low as 30%. Similarly, there are incorrect predictions with confidence values as high as 95%. If I were to just create a threshold of say, 50%, my accuracy would go down significantly, I would have quite of bit of "Unknown" classes (loss), and still a bit of misclassifications (even bigger loss).
Is there a way to incorporate another loss function on this back-end classification (predicted class vs 'unknown' class)?
Any help would be greatly appreciated!

A few suggestions right off the bat:
Accuracy is not the correct metric to evaluate imbalanced datasets. For example, if 90% of samples belong to 1 class 90% accuracy is achieved by a dumb model which always predicts the dumb class. Precision and recall are generally better metrics for such cases. Opting between the two is generally a business decision.
Given the input signals, it may be difficult to better than 98%, especially for some classes you will have two few samples. What you can do is group minority classes together and give them a single label e.g 'other'. In this way, the model will hopefully have enough samples to learn that these samples are different from all other classes and will classify them as 'other'
Often when you try to replace a manual business process by ML, you generally do not completely remove human intervention. The goal is to use the model on cases/classes/input space where your model does well and use the manual process for the rest. One way to do it is by using the 'other' label. Once your model has predicted 'other', a human may manually classify these samples. Another method is to find a threshold on predicted probability above which the model has a high accuracy and sufficient population coverage. For example, let say you have 100% (typically 90-100%) accuracy whenever the output prbability is above 0.70. If this covers enough of the input population, you only use the ML model on such cases. For everything else, the manual process is followed.

How can I stabilize a machine learning model?

I have a data to train the model. Also, I have another data to test the performance of the model as weekly. However, it seems that the model is not stable. There are some difference between training scores and weekly test scores. On the other hand, it is a fraud problem and I am using XGBoosting method. How can I make stable the model ? I can use different algorithms and parameters.
parameters = {
'n_estimators':[100],
'max_depth':[5],
'learning_rate':[0.1],
'classifier__min_sample_leaf':[5],
'classifier__criterion':['gini']
}
xgboost = XGBClassifier(scale_pos_weight=30)
xgboost_gs = GridSearchCV(xgboost, parameters, scoring='recall', cv=5, verbose=False)
xgboost_gs.fit(X_train, y_train)

I also worked on a similar project, and it's very difficult to improve the model's kappa or f1 score .... This is a problem that a lot of people face (data imbalance), specially in this field. I tried several models, feature engineering data cleaning and nothing seemed to work,I managed to improve kappa by 2 % by oversampling the unbalanced class (smote did not improve or any synthetic data creation)
But it's not all bad news! What I found out is that different models yield different results in terms of false positives/false negatives.
So the question is, what do you/your company want to prioritise on? A model that has less false negatives (classified fraud but it's not actually fraud, probably this one, more conservative) or less false positives (Classified as not fraud but it's actually fraud ) It's a trade play around and find the model that solves your problem, do not only look to accuracy on kappa or F1! Confusion Matrix in this case will help you!

You only have 24 items for the 1 class. This is too little so you will have to do some sampling to get both classes close to the same amount. This is to so fraud detection where you can easily get 1000s of non-fraud cases but only a hand full of fraud cases.
You can use some sampling method like SMOTE where you oversample the class with fewer observations and under-sample the class with more observations to let them have the same number of events for each class.
So in short you need a good balanced dataset for training. I am assuming that you had too few cases of class 1 in the training set

What are good metrics to evaluate the performance of a multi-class classifier?

I'm trying to run a classifier in a set of about 1000 objects, each with 6 floating point variables. I've used scikit-learn's cross validation features to generate an array of the predicted values for several different models. I've then used sklearn.metrics to compute the accuracy of my classifiers, and the confusion table. Most classifiers have around 20-30% accuracy. Below is the confusion table for the SVC classifier (25.4% accuracy).
Since I'm new to machine learning, I'm not sure how to interpret that result, and whether there are other good metrics to evaluate the problem. Intuitively speaking, even with 25% accuracy, and given that the classifier got 25% of the predictions right, I believe it is at least somewhat effective, right? How can I express that with statistical arguments?

If this table is a confusion table, I think that your classifier predicts in majority of the time the class E. I think that your class E is overrepresented in your dataset, accuracy is not a good metric if your classes have not the same number of instances,
Example, If you have 3 classes, A,B,C and in the test dataset the class A is over represented (90%) if your classifier predicts all time class A, you will have 90% of accuracy,
A good metric is to use log loss, logistic regression is a good algorithm that optimize this metric
see https://stats.stackexchange.com/questions/113301/multi-class-logarithmic-loss-function-per-class
An other solution, is to do oversampling of your small classes

First of all, I find it very difficult to look at confusion tables. Plotting it as an image would give a lot better intuitive understanding about what is going on.
It is advisory to have single number metric to optimize since it is easier and faster. When you find that your system doesn't perform as you expect it to, revise your selection of metric.
Accuracy is usually a good metric to use if you have same amount of examples in every class. Otherwise (which seems to be the case here) I'd advise to use F1 score which takes into account both precision and recall of your estimator.
EDIT: However it is up to you to decide if the ~25% accuracy, or whatever metric is "good enough". If you are classifying if robot should shoot a person you should probably revise your algorithm but if you are deciding if it is a pseudo-random or random data, 25% percent accuracy could be more than enough to prove the point.

When using multiple classifiers - How to measure the ensemble's performance? [SciKit Learn]

I have a classification problem (predicting whether a sequence belongs to a class or not), for which I decided to use multiple classification methods, in order to help filter out the false positives.
(The problem is in bioinformatics - classifying protein sequences as being Neuropeptide precursors sequences. Here's the original article if anyone's interested, and the code used to generate features and to train a single predictor) .
Now, the classifiers have roughly similar performance metrics (83-94% accuracy/precision/etc' on the training set for 10-fold CV), so my 'naive' approach was to simply use multiple classifiers (Random Forests, ExtraTrees, SVM (Linear kernel), SVM (RBF kernel) and GRB) , and to use a simple majority vote.
MY question is:
How can I get the performance metrics for the different classifiers and/or their votes predictions?
That is, I want to see if using the multiple classifiers improves my performance at all, or which combination of them does.
My intuition is maybe to use the ROC score, but I don't know how to "combine" the results and to get it from a combination of classifiers. (That is, to see what the ROC curve is just for each classifier alone [already known], then to see the ROC curve or AUC for the training data using combinations of classifiers).
(I currently filter the predictions using "predict probabilities" with the Random Forests and ExtraTrees methods, then I filter arbitrarily for results with a predicted score below '0.85'. An additional layer of filtering is "how many classifiers agree on this protein's positive classification").
Thank you very much!!
(The website implementation, where we're using the multiple classifiers - http://neuropid.cs.huji.ac.il/ )
The whole shebang is implemented using SciKit learn and python. Citations and all!)

To evaluate the performance of the ensemble, simply follow the same approach as you would normally. However, you will want to get the 10 fold data set partitions first, and for each fold, train all of your ensemble on that same fold, measure the accuracy, rinse and repeat with the other folds and then compute the accuracy of the ensemble. So the key difference is to not train the individual algorithms using k fold cross-validation when evaluating the ensemble. The important thing is not to let the ensemble see the test data either directly or by letting one of it's algorithms see the test data.
Note also that RF and Extra Trees are already ensemble algorithms in their own right.
An alternative approach (again making sure the ensemble approach) is to take the probabilities and \ or labels output by your classifiers, and feed them into another classifier (say a DT, RF, SVM, or whatever) that produces a prediction by combining the best guesses from these other classifiers. This is termed "Stacking"

You can use a linear regression for stacking. For each 10-fold, you can split the data with:
8 training sets
1 validation set
1 test set
Optimise the hyper-parameters for each algorithm using the training set and validation set, then stack yours predictions by using a linear regression - or a logistic regression - over the validation set. Your final model will be p = a_o + a_1 p_1 + … + a_k p_K, where K is the number of classifier, p_k is the probability given by model k and a_k is the weight of the model k. You can also directly use the predicted outcomes, if the model doesn't give you probabilities.
If yours models are the same, you can optimise for the parameters of the models and the weights in the same time.
If you have obvious differences, you can do different bins with different parameters for each. For example one bin could be short sequences and the other long sequences. Or different type of proteins.
You can use the metric whatever metric you want, as long as it makes sens, like for not blended algorithms.
You may want to look at the 2007 Belkor solution of the Netflix challenges, section Blending. In 2008 and 2009 they used more advances technics, it may also be interesting for you.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.