Optimal threshold for imbalanced binar classification problem

Optimal threshold for imbalanced binar classification problem - python

i have trouble optimizing threshold for binar classification. I am using 3 models: Logistic Regression, Catboost and Sklearn RandomForestClassifier.
For each model I am doing the following steps:
1) fit model
2) get 0.0 recall for first class (which belongs to 5% of dataset) and 1.0 recall for zero class. (this can't be fixed with gridsearch and class_weight='balanced' parameter.) >:(
3) Find optimal treshold
fpr, tpr, thresholds = roc_curve(y_train, model.predict_proba(X_train)[:, 1])
optimal_threshold = thresholds[np.argmax(tpr - fpr)]
4) Enjoy ~70 recall ratio for both classes.
5) Predict probabilities for test dataset and use optimal_threshold, i calculated above, to get classes.
Here comes the question: when I am starting code again and again, if i don't fix random_state, optimal treshold is variant and shifts quiet dramatically. This leads to dramatic changes in accuracy metrics based on test sample.
Do i need to calculate some average threshold and use it as a constant hard value? Or maybe i have to fix random_state everywhere? Or maybe the method of finding optimal_threshold isnt correct?

If you do not set random_state to a fixed values results will be different in every run. To get reproducible results set random_state everywhere required to a fixed value or, use fixed numpy random seed numpy.random.seed.
https://scikit-learn.org/stable/faq.html#how-do-i-set-a-random-state-for-an-entire-execution
Scikit FAQ mentions it is better to use random_state where required instead of global random state.
Global Random State Example:
import numpy as np
np.random.seed(42)
Some examples locally setting random_state:
X_train, X_test, y_train, y_test = train_test_split(sample.data, sample.target, test_size=0.3, random_state=0)
skf = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)
classifierAlgorithm = LGBMClassifier(objective='binary', random_state=0)

Related

How to fix warning: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples

I'm working on a classification project, where I try out various types of models like logistic regression, decision trees etc, to see which model can most accurately predict if a patient is at risk for heart disease (given an existing data set of over 3600 rows).
I'm currently trying to work on my decision tree, and have plotted ROC curves to find the optimized values for tuning the max_depth and min_samples_split hyperparameters. However when I try to create my new model I get the warning:
"UndefinedMetricWarning: Precision is ill-defined and being set to 0.0
due to no predicted samples. Use zero_division parameter to control
this behavior."
I have already googled the warning, and semi understand why it's happening, but not how to fix it. I don't want to just get rid of the warning or ignore the values that weren't predicted. I want to actually fix the issue. From my understanding, it has something to do with how I processed my data. However, I'm not sure where I went wrong with my data processing.
I started off with doing a train-test split, then used StandardScaler like so:
#Let's split the data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = df.drop("TenYearCHD", axis = 1)
y = df["TenYearCHD"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
#Let's scale our data
SS = StandardScaler()
X_train = SS.fit_transform(X_train)
X_test = SS.transform(X_test)
I then created my initial decision tree, and received no warnings:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion = "entropy")
#Fit our model and predict
dtc.fit(X_train, y_train)
dtc_pred = dtc.predict(X_test)
After looking at my ROC curve and AOC scores, I attempted to create another more optimized decision tree, which is where I then received my warning:
dtc3 = DecisionTreeClassifier(criterion = "entropy", max_depth = 4, min_samples_split= .25)
dtc3.fit(X_train, y_train)
dtc3_pred = dtc3.predict(X_test)
Essentially i'm at a loss at what to do. Should I use a different method like StratifiedKFolds in addition to train-test split to process my data? Should I do something else entirely? Any help would be greatly appreciated.

Different accuracy for cross_val_score and train_test_split

I am testing RandomForestClassifier on simple dataset from sklearn. When I split the data with train_test_split, I get accuracy=0.89. If I use cross-validation with cross_val_score with same parameters of classifier, accuracy is smaller - about 0.83. Why?
Here is the code:
from sklearn.model_selection import cross_val_score, StratifiedKFold,GridSearchCV,train_test_split
from sklearn.metrics import accuracy_score,f1_score,make_scorer
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_circles
np.random.seed(42)
#create dataset:
x, y = make_circles(n_samples=500, factor=0.1, noise=0.35, random_state=42)
#initialize stratified split:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
#create classifier:
clf = RandomForestClassifier(random_state=42, max_depth=12,n_jobs=-1,
oob_score=True,n_estimators=100,min_samples_leaf=10)
#average accuracy on cross-validation:
results = np.mean(cross_val_score(clf, x, y, cv=skf,scoring=make_scorer(accuracy_score)))
print("ACCURACY WITH CV = ",results)#prints 0.832
#use train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)
clf=RandomForestClassifier(random_state=42, max_depth=12,n_jobs=-1, oob_score=True,n_estimators=100,min_samples_leaf=10)
clf.fit(xtrain,ytrain)
ypred=clf.predict(xtest)
print("ACCURACY WITHOUT CV = ",accuracy_score(ytest,ypred))#prints 0.89
what I got:
ACCURACY WITH CV = 0.83
ACCURACY WITHOUT CV = 0.89

Cross validation is used to run multiple experiments on different splits of data and then average their results. This is to ensure that the result of the experiment is not biased by one split, as it is in your case.
Your chosen seed along with some luck gave you a test train split which has higher accuracy than the average. The higher accuracy is an artifact of random sampling when making a split and not an indicator of better model performance.
Simply put:
Cross Validation makes multiple splits of data. Your model is trained
on all of these different splits and then the performance is
averaged.
If you pick one of these splits, you may get lucky and there might be
good overlap between the data points in your test and train set. Your
model will have high accuracy in this case.
Or you may get unlucky and there might not be a high overlap between
the data points in test and train set. Your model will have a lower
accuracy in this case.
Thus, cross validation is used to average the results of various such splits (5 in your case).
Here is your code run in a google colab notebook:
https://colab.research.google.com/drive/16-NotF-_WVLESmvGMONSGSZigxrT3KLx?usp=sharing
The last cell makes 5 different splits and then averages their accuracies. Notice how that is the same as the one you got from cross validation. Also notice how some splits have higher and some splits have a lower accuracy.
To further convince yourself, look at the output of:
cross_val_score(clf, x, y, cv=skf, scoring=make_scorer(accuracy_score))
The output is a list of scores (accuracies in your case) for the 5 different splits. You can see that they have varying values around 0.83

This is just up to chance for the split and the random state of the Random Forest Classifier. Try leaving the random_state=42 out and let it fit several times and you'll get a variance of different accuracies. By chance, I had one without CV of "just" 0.78! In contrast, the cv will give you and average (your calculated mean) PLUS an idea about how much your accuracy could vary around that.

GridSearchCV Returns WORST Possible Parameter (Ridge & Lasso Regression)

Problem: Scikit-learn's GridSearchCV is returning the parameter which results in the worst score (Root MSE) rather than the best.
I think it is possible the problem is that I am not using train-test-split to create a hold out test set because it is time series data, and I do not want to disrupt the time order. Another possible cause is that I have over 7,000 features but only 50 observations. But clarification from anyone who knows whether these could be the problems and what I might do to remedy these potential issues would be greatly appreciated.
I start with the following code (and have imported Ridge, GridSearchCV, make_pipeline, TimeSeriesSplit, numpy, pandas, etc.):
ridge_pipe = make_pipeline(Ridge(random_state=42, max_iter=100000))
tscv = TimeSeriesSplit(n_splits=5)
param_grid = {'ridge__alpha': np.logspace(1e-300, 1e-1, 500)}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv, scoring='neg_root_mean_squared_error',
n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
It gives me this output:
{'ridge__alpha': 1.2589254117941673}
-4.067235334106922
Skeptical that this would be the best Root MSE, I next tried finding the score when considering an alpha value of 1e-300 alone:
param_grid = {'ridge__alpha': [1e-300]}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv,
scoring='neg_root_mean_squared_error', n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
It gives me this ouput:
{'ridge__alpha': 1e-300}
-2.0906161667718835e-13
Clearly then, an alpha value of 1e-300 has a better Root MSE (approx. -2e-13) than does an alpha value of 1e-1 (approx. -4) since negative Root MSE using GridSearchCV means the same thing - as I understand it - as positive Root MSE in all other contexts. So a Root MSE of -2e-13 is really 2e-13 and -4 is really 4. And the lower the Root MSE the better.
To see if np.logspace could be the culprit, I instead provide just a list of values:
param_grid = {'ridge__alpha': [1e-1, 1e-50, 1e-60, 1e-70, 1e-80, 1e-90, 1e-100, 1e-300]}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv, scoring='neg_root_mean_squared_error',
n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
And the output shows that the same problem:
{'ridge__alpha': 0.1}
-2.0419740158869386
And I don't think it's because I'm using TimeSeriesSplit, because I have tried using cv=5 instead of cv=tscv inside GridSearchCV() and it results in the same problem.
The same issue happens when I try Lasso instead of Ridge. Any thoughts?

This appears to be fine. The problem is that you're comparing the final outputs on the same dataset that the best_estimator_ was trained on (search's method score delegates to the score method of search.best_estimator_, which is the model using best hyperparameters refitted on the entire training set); but the grid search is selecting based on cross-validated scores, which are a better indicator for future performance.
Specifically, with alpha=1e-300 (practically zero), the model overfits badly to the training data, and so the rmse on that training data is very small (2e-13). Meanwhile, with alpha=1.26, the model performs worse on the training data (rmse 4), but performs better on unseen data. You can see those cross-validation scores in the grid search's attribute cv_results_.

Using Custom Metric for Score Method in XGBoost

I am using xgboost for a classification problem with an imbalanced dataset. I plan on using some combination of an f1-score or roc-auc as my primary criteria for judging the model.
Currently the default value returned from the score method is accuracy, but I would really like to have a specific evaluation metric returned instead. My big motivation for doing this is that I presume the feature_importances_ attribute from the model is determined from what's affecting the score method, and the columns that impact predictive accuracy might very well be different from the columns that impact roc-auc. Right now I am passing in values to eval_metric but it does not seem to be making a difference.
Here is some sample code:
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score
data = load_breast_cancer()
X = data['data']
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify=y)
mod.fit(X_train, y_train)
Now at this point, mod.score(X_test, y_test) will return a value of ~ 0.96, and the roc_auc_score is ~ 0.99.
I was hoping the following snippet:
mod.fit(X_train, y_train, eval_metric='auc')
Would then allow mod.score(X_test, y_test) to return the roc_auc_score value, but it is still returning predictive accuracy, not roc_auc.
The purpose of this exercise is estimating the influence of different columns on the outcome, so if I could get feature_importances_ returned using f1 or roc_auc as the measure of impact this would be a huge boon, but I do not seem to be on the right path as of now.
Thank you.

There are two parts to your question, to use eval_metric, you need to provide data to evaluate using eval_set = :
mod = XGBClassifier()
mod.fit(X_train, y_train,eval_set=[(X_test,y_test)],eval_metric="auc")
You can check the auc using evals_result(), and it gives the auc for every iteration:
mod.evals_result()
{'validation_0': OrderedDict([('auc',
[0.965939,
0.9833,
0.984788,
[...]
0.991402,
0.991071,
0.991402,
0.991733])])}
The importance score is calculated based on the average gain across all splits the feature is used in see help page. From your question, I suppose you need the mdoel to maximize auc, like in cross-validation, but you cannot use the auc as an objective in xgboost. Gradient boosting methods require a differentiable loss function.
With imbalanced dataset, you can try to adjust the parameter scale_pos_weight, to adjust the balance of positive and negative weights. This is discussed in xgboost website

How to get the precision score of every class in a Multi class Classification Problem?

I am making Sentiment Analysis Classification and I am doing it with Scikit-learn. This has 3 labels, positive, neutral and negative. The Shape of my training data is (14640, 15), where
negative 9178
neutral 3099
positive 2363
I have pre-processed the data and applied the bag-of-words word vectorization technique to the text of twitter as there many other attributes too, whose size is then (14640, 1000).
As the Y, means the label is in the text form so, I applied LabelEncoder to it. This is how I split my dataset -
X_train, X_test, Y_train, Y_test = train_test_split(bow, Y, test_size=0.3, random_state=42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
out: (10248, 1000) (10248,)
(4392, 1000) (4392,)
And this is my classifier
svc = svm.SVC(kernel='linear', C=1, probability=True).fit(X_train, Y_train)
prediction = svc.predict_proba(X_test)
prediction_int = prediction[:,1] >= 0.3
prediction_int = prediction_int.astype(np.int)
print('Precision score: ', precision_score(Y_test, prediction_int, average=None))
print('Accuracy Score: ', accuracy_score(Y_test, prediction_int))
out:Precision score: [0.73980398 0.48169243 0. ]
Accuracy Score: 0.6675774134790529
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
Now I am not sure why the third one, in precision score is blank? I have applied average=None, because to make a separate precision score for every class. Also, I am not sure about the prediction, if it is right or not, because I wrote it for binary classification? Can you please help me to debug it to make it better. Thanks in advance.

As the warning explains:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
it seems that one of your 3 classes is missing from your predictions prediction_int (i.e. you never predict it); you can easily check if this is the case with
set(Y_test) - set(prediction_int)
which should be the empty set {} if this is not the case.
If this is indeed the case, and the above operation gives {1} or {2}, the most probable reason is that your dataset is imbalanced (you have much more negative samples), and you do not ask for a stratified split; modify your train_test_split to
X_train, X_test, Y_train, Y_test = train_test_split(bow, Y, test_size=0.3, stratify=Y, random_state=42)
and try again.
UPDATE (after comments):
As it turns out, you have a class imbalance problem (and not a coding issue) which prevents your classifier from successfully predicting your 3rd class (positive). Class imbalance is a huge sub-topic in itself, and there are several remedies proposed. Although going into more detail is arguably beyond the scope of a single SO thread, the first thing you should try (on top of the suggestions above) is to use the class_weight='balanced' argument in the definition of your classifier, i.e.:
svc = svm.SVC(kernel='linear', C=1, probability=True, class_weight='balanced').fit(X_train, Y_train)
For more options, have a look at the dedicated imbalanced-learn Python library (part of the scikit-learn-contrib projects).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.