Is it possible to change the threshold of a decisiontreeclassifier? I'm studying the precision/recall trade-off and want to change the threshold to favor recall. I'm studying the hand's on ML, but there it uses the SGDClassifier, at some point it uses the cross_val_predict() with the method="decision_function" attribute, but this does not exist for the decisiontreeclassifier. I'm using a pipeline and a cross-validation.
My study is with this dataset:
https://www.kaggle.com/datasets/imnikhilanand/heart-attack-prediction
features = df_heart.drop(['output'], axis=1).copy()
labels = df_heart.output
#split
X_train, X_test, y_train, y_test= train_test_split(features, labels,
train_size=0.7,
random_state=42,
stratify=features["sex"]
)
# categorical features
cat = ['sex', 'tipo_de_dor', 'ang_indz_exerc', 'num_vasos', 'acuc_sang_jejum', 'eletrc_desc', 'pico_ST_exerc', 'talassemia']
# treatment of categorical variables
t = [('cat', OneHotEncoder(handle_unknown='ignore'), cat)]
preprocessor = ColumnTransformer(transformers=t, remainder='passthrough')
#pipeline
pipe = Pipeline(steps=[('preprocessor', preprocessor),
('clf', DecisionTreeClassifier(min_samples_leaf=8, random_state=42),)
]
)
pipe.fit(X_train, y_train)
valid_cruz_strat = StratifiedKFold(n_splits=14, shuffle=True, random_state=42)
y_train_pred = cross_val_predict(pipe['clf'], X_train, y_train, cv=valid_cruz_strat)
conf_mat = confusion_matrix(y_train, y_train_pred)
ConfusionMatrixDisplay(confusion_matrix=conf_mat,
display_labels=pipe['clf'].classes_).plot()
plt.grid(False)
plt.show()
threshold = 0 #this is only for support the graph
idx = (thresholds >= threshold).argmax() # first index ≥ threshold
plt.plot(thresholds, precisions[:-1], 'b--', label = 'Precisão')
plt.plot(thresholds, recalls[:-1], 'g-', label = 'Recall')
plt.vlines(threshold, 0, 1.0, "k", "dotted", label="threshold")
plt.title('Precisão x Recall', fontsize = 14)
plt.plot(thresholds[idx], precisions[idx], "bo")
plt.plot(thresholds[idx], recalls[idx], "go")
plt.axis([-.5, 1.5, 0, 1.1])
plt.grid()
plt.xlabel("Threshold")
plt.legend(loc="lower left")
plt.show()
valid_cruz_strat = StratifiedKFold(n_splits=14, shuffle=True, random_state=42)
y_score = cross_val_predict(pipe['clf'], X_train, y_train, cv=valid_cruz_strat)
precisions, recalls, thresholds = precision_recall_curve(y_train, y_score)
threshold = 0.75 #this is only for support the graph
idx = (thresholds >= threshold).argmax()
plt.figure(figsize=(6, 5))
plt.plot(recalls, precisions, linewidth=2, label="Precision/Recall curve")
plt.plot([recalls[idx], recalls[idx]], [0., precisions[idx]], "k:")
plt.plot([0.0, recalls[idx]], [precisions[idx], precisions[idx]], "k:")
plt.plot([recalls[idx]], [precisions[idx]], "ko",
label="Point at threshold "+str(threshold))
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.axis([0, 1, 0, 1])
plt.grid()
plt.legend(loc="lower left")
plt.show()
When I check the arrays generated by the precision_recall_curve() function I see that it only contains 3 elements. Is this correct behavior? When I do the cross_val_predict() function for an SGDClassifier, for example, as it is in the book, without the method='decision_function' attribute and I use the output in precision_recall_curve() and it generates arrays with 3 elements and if I use the method='decision_function ' it generates arrays with several elements.
My main question is how to choose the threshold for the DecisionTreeClassifier, and if there is a way to generate the Precision x Recall curve with several points, I only manage with these three points and I am not able to assimilate how to improve the recall.
Move the threshold to improve recall, and understand how to do it with Decision tree classifier
This topic usually falls under the name "model calibration." scikit-learn supports a few kinds of probability calibration which could be informative to read about as well.
One way to "change the threshold" in a DecisionTreeClassifier would involve invoking .predict_proba(X) and observing a metric(s) over possible thresholds:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
import numpy as np
import matplotlib.pyplot as plt
X, y = make_classification(n_samples=10000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
clf = DecisionTreeClassifier(max_depth=5)
clf.fit(X_train, y_train)
prob_pred = clf.predict_proba(X_test)[:, 1]
thresholds = np.arange(0.0, 1.0, step=0.01)
recall_scores = [recall_score(y_test, prob_pred > t) for t in thresholds]
precis_scores = [precision_score(y_test, prob_pred > t) for t in thresholds]
Now we have a set of thresholds between 0.0 and 1.0, and we've computed precision and recall over each threshold (Side note: this problem is less-well-defined for multilabel or multiclass prediction—usually these metrics are averaged over each class or similar).
Then we'll plot:
fig, ax = plt.subplots(1, 1)
ax.plot(thresholds, recall_scores, label="Recall # t")
ax.plot(thresholds, precis_scores, label="Precision # t")
ax.axvline(0.5, c="gray", linestyle="--", label="Default Threshold")
ax.set_xlabel("Threshold")
ax.set_ylabel("Metric # Threshold")
ax.set_box_aspect(1)
ax.legend()
plt.show()
Which results in a figure like this:
This shows us that the default threshold used by .predict() at 0.5 may not be the best in all circumstances. In fact, there are a range of thresholds where precision and recall are fairly close, but favors one over the other. In this case: lowering the threshold slightly will tend to favor recall, while increasing the threshold will tend to favor precision.
In practice: the threshold appropriate for the problem comes down to domain knowledge since there's always a trade-off between precision and recall.
Related
Lasso, although it's a regression algorithm, can be used as a classifier. Therefore, there should be a way to make a ROC curve and find it's AUC.
This is my code for making the model, scaling and standardizing it:
X = data.drop(['Response'], axis = 1)
Y = data.Response
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state = 42)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)
pipline = Pipeline([
('scaler', StandardScaler()),
('model', Lasso(normalize=True))
])
lasso_model = GridSearchCV(pipline,
{'model__alpha': np.arange(0, 3, 0.05)},
cv = 10 ,
scoring = 'roc_auc',
verbose = 3,
n_jobs = -1,
error_score = 'raise')
lasso_model.fit(scaled_X_train, Y_train)
Now trying to make the ROC curves, one for the training set and one for the test set:
# Make a line for the random classification, AUC = 0.5:
r_probs = [0 for _ in range(len(Y_test))]
r_auc = roc_auc_score(Y_test,r_probs)
r_fpr, r_tpr , _ = roc_curve(Y_test, r_probs)
y_pred_proba = lasso_model.predict_proba(scaled_X_train)[::,1]
fpr, tpr, _ = roc_curve(Y_train, y_pred_proba)
auc = roc_auc_score(Y_train, y_pred_proba)
#create ROC curve
plt.plot(r_fpr, r_tpr, linestyle = '--', label = 'Random Prediction (AUROC = %0.3f)' %r_auc)
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.title('Train set ROC')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
I get the error that predict_proba does not exist for lasso regression, since it is not a classification algorithm. So how can I plot a ROC curve?
Lasso is basically linear model with L1 regularization. You can enable this kind of regularization for LogisticRegression() using penalty='l1' for the same effect (see this question, for example).
Alternatively, you can try to duct tape a sigmoid function to your Lasso() output, but that'd be doing the same thing as above with much more effort.
I have a problem. Is there an option to get early stopping? Because I saw on a plot that I get Overfitting after a while, so I want to get the most optimal.
dfListingsFeature_regression = pd.read_csv(r"https://raw.githubusercontent.com/Coderanker3/dataset4/main/listings_cleaned.csv")
d = {True: 1, False: 0, np.nan : np.nan}
dfListingsFeature_regression['host_is_superhost'] = dfListingsFeature_regression[
'host_is_superhost'].map(d).astype('int')
X = dfListingsFeature_regression.drop(columns=['host_id', 'id', 'price']) # Features
y = dfListingsFeature_regression['price'] # Target variable
print(dfListingsFeature_nor.shape)
steps = [('feature_selection', SelectFromModel(estimator=LogisticRegression(max_iter=1000))),
('lasso', Lasso(alpha=0.1))]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=30)
parameteres = { }
grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)
grid.fit(X_train, y_train)
print("score = %3.2f" %(grid.score(X_test,y_test)))
print('Training set score: ' + str(grid.score(X_train,y_train)))
print('Test set score: ' + str(grid.score(X_test,y_test)))
# Prediction
y_pred = grid.predict(X_test)
print("RMSE Val:", metrics.mean_squared_error(y_test, y_pred, squared=False))
y_train_predict = grid.predict(X_train)
print("Train:" , metrics.mean_squared_error(y_train, y_train_predict , squared=False))
r2 = metrics.r2_score(y_test, y_pred)
print(r2)
I think you mean applying regularization. In this case, we can reduce the chance of overfitting with l1 regularization or Lasso regression.
This regularization strategy is a kind of "feature selection" when you have several features, as it would shrink coefficients of non informative features toward zero.
In this case, you wanto to find the optimal alpha value that finds the best score in the test dataset. Additionally you can plot the gap difference between train/test score to guide your decision.
The stronger the alpha value the stronger the regularization. See code example below.
Full Example
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.linear_model import Lasso
import numpy as np
import matplotlib.pyplot as plt
X, y = make_regression(noise=4, random_state=0)
# Alphas to search over
alphas = list(np.linspace(2e-2, 1, 20))
results = {}
for alpha in alphas:
print(f'Fitting Lasso(alpha={alpha})')
estimator = Lasso(alpha=alpha, random_state=0)
cv_results = cross_validate(
estimator, X, y, cv=5, return_train_score=True, scoring='neg_root_mean_squared_error'
)
# Comput average metric value
avg_train_score = np.mean(cv_results['train_score']) * -1
avg_test_score = np.mean(cv_results['test_score']) * -1
results[alpha] = (avg_train_score, avg_test_score)
train_scores = [v[0] for v in results.values()]
test_scores = [v[1] for v in results.values()]
gap_scores = [v[1] - v[0] for v in results.values()]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
ax1.set_title('Alpha values vs Avg score')
ax1.plot(results.keys(), train_scores, label='Train Score')
ax1.plot(results.keys(), test_scores, label='Test Score')
ax1.legend()
ax2.set_title('Train/Test Score Gap')
ax2.plot(results.keys(), gap_scores)
Notice than when alpha is close to zero it is overfitting and when lambda gets bigger it is underfitting. However, around alpha=0.4 we can find a balance between underfitting and overfitting the data.
I'd like to evaluate my machine learning model. I computed the area under the ROC curve with roc_auc_score() and plotted the ROC curve with plot_roc_curve() functions of sklearn. In the second function the AUC is also computed and shown in the plot. Now my problem is, that I get different results for the two AUC.
Here's the reproducible code with sample dataset:
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
model = MLPClassifier(random_state=42)
model.fit(X_train, y_train)
yPred = model.predict(X_test)
print(roc_auc_score(y_test, yPred))
plot_roc_curve(model, X_test, y_test)
plt.show()
The roc_auc_score function gives me 0.979 and the plot shows 1.00.
Despite the fact that the second function takes the model as an argument and predicts yPred again, the outcome should not differ. It is not a round off error. If I decrease training iterations to get a bad predictor the values still differ.
With my real dataset I "achieved" a difference of 0.1 between the two methods.
How does this aberration come?
You should pass the prediction probabilities to roc_auc_score, and not the predicted classes. Like this:
yPred_p = model.predict_proba(X_test)[:,1]
print(roc_auc_score(y_test, yPred_p))
# output: 0.9983354140657512
When you pass the predicted classes, this is actually the curve for which AUC is being calculated (which is wrong):
Code to regenerate:
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_test, yPred)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label='AUC = ' + str(round(roc_auc, 2)))
plt.legend(loc='lower right')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
May I know why I get the error message -
NameError: name 'X_train_std' is not defined
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=1000.0, random_state=0)
lr.fit(X_train_std, y_train)
plot_decision_regions(X_combined_std,
y_combined, classifier=lr,
test_idx=range(105,150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
lr.predict_proba(X_test_std[0,:])
weights, params = [], []
for c in np.arange(-5, 5):
lr = LogisticRegression(C=10**c, random_state=0)
lr.fit(X_train_std, y_train)
weights.append(lr.coef_[1])
params.append(10**c)
weights = np.array(weights)
plt.plot(params, weights[:, 0],
label='petal length')
plt.plot(params, weights[:, 1], linestyle='--',
label='petal width')
plt.ylabel('weight coefficient')
plt.xlabel('C')
plt.legend(loc='upper left')
plt.xscale('log')
plt.show()
Plesea see the link -
https://www.freecodecamp.org/forum/t/how-to-modify-my-python-logistic-regression/265795
https://bytes.com/topic/python/answers/972352-why-i-get-x_train_std-not-defined#post3821849
https://www.researchgate.net/post/Why_I_get_the_X_train_std_is_not_defined
.
Well, X_train_std is not defined/declared. You need to declare the variable and give it a value before using it.
Like:
X_train_std = 3
You didn't copy enough of the sample code. Somewhere above this, there is likely a call to train_test_split
Basically, to do what you want, you need a set of X variables, your Y variable (what will be predicted). You normally split them into a training set and a test set, and in addition, many algorithms work better on standardized (zero mean, 1 standard deviation), which is what the _std probably means in your variable name.
The code that comes before your snippet probably looks something like:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
my_df = pd.DataFrame(....this is your data for the test...)
X = my_df[[X_variable_column_names_here]]
Y = my_df[Y_variable_column_name]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3)
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)
Edit: It looks from the axis labels on your plot that you're trying to do logistic regression against the Iris dataset. The fully worked example is here:
https://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html
So, basically, I'm using a RF for descriptive modelling as follows:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import class_weight
class_weights = class_weight.compute_class_weight('balanced', np.unique(y), y)
class_weights = dict(enumerate(class_weights))
class_weights
{0: 0.5561096747856852, 1: 4.955559597429368}
clf = RandomForestClassifier(class_weight=class_weights, random_state=0)
cross_val_score(clf, X, y, cv=10, scoring='f1').mean()
And plotting variables importance as:
import matplotlib.pyplot as plt
def plot_importances(clf, features, n):
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
if n:
indices = indices[:n]
plt.figure(figsize=(10, 5))
plt.title("Feature importances")
plt.bar(range(len(indices)), importances[indices], align='center')
plt.xticks(range(len(indices)), features[indices], rotation=90)
plt.xlim([-1, len(indices)])
plt.show()
return features[indices]
imp = plot_importances(clf, X.columns, 30)
I was expecting variable importances to be the same across multiple runs. However, their importances changes whenever I re-run the notebook.
I don't understand why is that. Is it related to the cross_val_score method somehow?
I cannot reproduce the problem. For me the variable importances does remain the same for multiple runs when I produce some data using:
X, y = make_classification(n_samples=1000,
n_features=10,
n_informative=3,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=0,
shuffle=False)
X = pd.DataFrame(X)
Also changing the data to have an uneven weighting by selecting the first 750 y/X data points does not lead to differences in importances.
What data do you use?