I have a problem. Is there an option to get early stopping? Because I saw on a plot that I get Overfitting after a while, so I want to get the most optimal.
dfListingsFeature_regression = pd.read_csv(r"https://raw.githubusercontent.com/Coderanker3/dataset4/main/listings_cleaned.csv")
d = {True: 1, False: 0, np.nan : np.nan}
dfListingsFeature_regression['host_is_superhost'] = dfListingsFeature_regression[
'host_is_superhost'].map(d).astype('int')
X = dfListingsFeature_regression.drop(columns=['host_id', 'id', 'price']) # Features
y = dfListingsFeature_regression['price'] # Target variable
print(dfListingsFeature_nor.shape)
steps = [('feature_selection', SelectFromModel(estimator=LogisticRegression(max_iter=1000))),
('lasso', Lasso(alpha=0.1))]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=30)
parameteres = { }
grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)
grid.fit(X_train, y_train)
print("score = %3.2f" %(grid.score(X_test,y_test)))
print('Training set score: ' + str(grid.score(X_train,y_train)))
print('Test set score: ' + str(grid.score(X_test,y_test)))
# Prediction
y_pred = grid.predict(X_test)
print("RMSE Val:", metrics.mean_squared_error(y_test, y_pred, squared=False))
y_train_predict = grid.predict(X_train)
print("Train:" , metrics.mean_squared_error(y_train, y_train_predict , squared=False))
r2 = metrics.r2_score(y_test, y_pred)
print(r2)
I think you mean applying regularization. In this case, we can reduce the chance of overfitting with l1 regularization or Lasso regression.
This regularization strategy is a kind of "feature selection" when you have several features, as it would shrink coefficients of non informative features toward zero.
In this case, you wanto to find the optimal alpha value that finds the best score in the test dataset. Additionally you can plot the gap difference between train/test score to guide your decision.
The stronger the alpha value the stronger the regularization. See code example below.
Full Example
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.linear_model import Lasso
import numpy as np
import matplotlib.pyplot as plt
X, y = make_regression(noise=4, random_state=0)
# Alphas to search over
alphas = list(np.linspace(2e-2, 1, 20))
results = {}
for alpha in alphas:
print(f'Fitting Lasso(alpha={alpha})')
estimator = Lasso(alpha=alpha, random_state=0)
cv_results = cross_validate(
estimator, X, y, cv=5, return_train_score=True, scoring='neg_root_mean_squared_error'
)
# Comput average metric value
avg_train_score = np.mean(cv_results['train_score']) * -1
avg_test_score = np.mean(cv_results['test_score']) * -1
results[alpha] = (avg_train_score, avg_test_score)
train_scores = [v[0] for v in results.values()]
test_scores = [v[1] for v in results.values()]
gap_scores = [v[1] - v[0] for v in results.values()]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
ax1.set_title('Alpha values vs Avg score')
ax1.plot(results.keys(), train_scores, label='Train Score')
ax1.plot(results.keys(), test_scores, label='Test Score')
ax1.legend()
ax2.set_title('Train/Test Score Gap')
ax2.plot(results.keys(), gap_scores)
Notice than when alpha is close to zero it is overfitting and when lambda gets bigger it is underfitting. However, around alpha=0.4 we can find a balance between underfitting and overfitting the data.
Related
I have been trying to build a model that would consider a independent variable such as "age" to predict the level of efficacy of an individual. I have achieved a 0.049 r2 scored by using polynomial regression and decision tree algorithm but it is still really low. What would be the best way to tackle this issue?
Here is the code:
#Load data into a pandas DataFrame
df = pd.read_csv("workers.csv")
# Split data into training and testing sets
X = df[['sub_age']]
y = df["actual_efficacy_h"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#Create polynomial features
poly = PolynomialFeatures(degree=4)
X_poly = poly.fit_transform(X)
X_train_trans = poly.fit_transform(X_train)
X_test_trans = poly.transform(X_test)
# Train a linear regression model on the training data
model = DecisionTreeRegressor()
model.fit(X_train_trans, y_train)
# Make predictions on the test data
y_pred = model.predict(X_test_trans)
# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean squared error: ", mse)
print("R-squared score: ", r2)
X_new = np.linspace(18, 65, 200).reshape(200, 1)
X_new_poly = poly.transform(X_new)
y_new = model.predict(X_new_poly)
plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
plt.plot(X_train, y_train, "b.",label='Training points')
plt.plot(X_test, y_test, "g.",label='Testing points')
plt.xlabel("Age")
plt.ylabel("Degree of Efficacy")
plt.legend()
plt.show()
This code achieves the following:
Mean squared error: 0.15122979391134947
R-squared score: 0.04171398183828967
And plots a graph like this:
Plot Produced
I would appreciate any kind of help or guidance.
Thank you
Is it possible to change the threshold of a decisiontreeclassifier? I'm studying the precision/recall trade-off and want to change the threshold to favor recall. I'm studying the hand's on ML, but there it uses the SGDClassifier, at some point it uses the cross_val_predict() with the method="decision_function" attribute, but this does not exist for the decisiontreeclassifier. I'm using a pipeline and a cross-validation.
My study is with this dataset:
https://www.kaggle.com/datasets/imnikhilanand/heart-attack-prediction
features = df_heart.drop(['output'], axis=1).copy()
labels = df_heart.output
#split
X_train, X_test, y_train, y_test= train_test_split(features, labels,
train_size=0.7,
random_state=42,
stratify=features["sex"]
)
# categorical features
cat = ['sex', 'tipo_de_dor', 'ang_indz_exerc', 'num_vasos', 'acuc_sang_jejum', 'eletrc_desc', 'pico_ST_exerc', 'talassemia']
# treatment of categorical variables
t = [('cat', OneHotEncoder(handle_unknown='ignore'), cat)]
preprocessor = ColumnTransformer(transformers=t, remainder='passthrough')
#pipeline
pipe = Pipeline(steps=[('preprocessor', preprocessor),
('clf', DecisionTreeClassifier(min_samples_leaf=8, random_state=42),)
]
)
pipe.fit(X_train, y_train)
valid_cruz_strat = StratifiedKFold(n_splits=14, shuffle=True, random_state=42)
y_train_pred = cross_val_predict(pipe['clf'], X_train, y_train, cv=valid_cruz_strat)
conf_mat = confusion_matrix(y_train, y_train_pred)
ConfusionMatrixDisplay(confusion_matrix=conf_mat,
display_labels=pipe['clf'].classes_).plot()
plt.grid(False)
plt.show()
threshold = 0 #this is only for support the graph
idx = (thresholds >= threshold).argmax() # first index ≥ threshold
plt.plot(thresholds, precisions[:-1], 'b--', label = 'Precisão')
plt.plot(thresholds, recalls[:-1], 'g-', label = 'Recall')
plt.vlines(threshold, 0, 1.0, "k", "dotted", label="threshold")
plt.title('Precisão x Recall', fontsize = 14)
plt.plot(thresholds[idx], precisions[idx], "bo")
plt.plot(thresholds[idx], recalls[idx], "go")
plt.axis([-.5, 1.5, 0, 1.1])
plt.grid()
plt.xlabel("Threshold")
plt.legend(loc="lower left")
plt.show()
valid_cruz_strat = StratifiedKFold(n_splits=14, shuffle=True, random_state=42)
y_score = cross_val_predict(pipe['clf'], X_train, y_train, cv=valid_cruz_strat)
precisions, recalls, thresholds = precision_recall_curve(y_train, y_score)
threshold = 0.75 #this is only for support the graph
idx = (thresholds >= threshold).argmax()
plt.figure(figsize=(6, 5))
plt.plot(recalls, precisions, linewidth=2, label="Precision/Recall curve")
plt.plot([recalls[idx], recalls[idx]], [0., precisions[idx]], "k:")
plt.plot([0.0, recalls[idx]], [precisions[idx], precisions[idx]], "k:")
plt.plot([recalls[idx]], [precisions[idx]], "ko",
label="Point at threshold "+str(threshold))
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.axis([0, 1, 0, 1])
plt.grid()
plt.legend(loc="lower left")
plt.show()
When I check the arrays generated by the precision_recall_curve() function I see that it only contains 3 elements. Is this correct behavior? When I do the cross_val_predict() function for an SGDClassifier, for example, as it is in the book, without the method='decision_function' attribute and I use the output in precision_recall_curve() and it generates arrays with 3 elements and if I use the method='decision_function ' it generates arrays with several elements.
My main question is how to choose the threshold for the DecisionTreeClassifier, and if there is a way to generate the Precision x Recall curve with several points, I only manage with these three points and I am not able to assimilate how to improve the recall.
Move the threshold to improve recall, and understand how to do it with Decision tree classifier
This topic usually falls under the name "model calibration." scikit-learn supports a few kinds of probability calibration which could be informative to read about as well.
One way to "change the threshold" in a DecisionTreeClassifier would involve invoking .predict_proba(X) and observing a metric(s) over possible thresholds:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
import numpy as np
import matplotlib.pyplot as plt
X, y = make_classification(n_samples=10000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
clf = DecisionTreeClassifier(max_depth=5)
clf.fit(X_train, y_train)
prob_pred = clf.predict_proba(X_test)[:, 1]
thresholds = np.arange(0.0, 1.0, step=0.01)
recall_scores = [recall_score(y_test, prob_pred > t) for t in thresholds]
precis_scores = [precision_score(y_test, prob_pred > t) for t in thresholds]
Now we have a set of thresholds between 0.0 and 1.0, and we've computed precision and recall over each threshold (Side note: this problem is less-well-defined for multilabel or multiclass prediction—usually these metrics are averaged over each class or similar).
Then we'll plot:
fig, ax = plt.subplots(1, 1)
ax.plot(thresholds, recall_scores, label="Recall # t")
ax.plot(thresholds, precis_scores, label="Precision # t")
ax.axvline(0.5, c="gray", linestyle="--", label="Default Threshold")
ax.set_xlabel("Threshold")
ax.set_ylabel("Metric # Threshold")
ax.set_box_aspect(1)
ax.legend()
plt.show()
Which results in a figure like this:
This shows us that the default threshold used by .predict() at 0.5 may not be the best in all circumstances. In fact, there are a range of thresholds where precision and recall are fairly close, but favors one over the other. In this case: lowering the threshold slightly will tend to favor recall, while increasing the threshold will tend to favor precision.
In practice: the threshold appropriate for the problem comes down to domain knowledge since there's always a trade-off between precision and recall.
Lasso, although it's a regression algorithm, can be used as a classifier. Therefore, there should be a way to make a ROC curve and find it's AUC.
This is my code for making the model, scaling and standardizing it:
X = data.drop(['Response'], axis = 1)
Y = data.Response
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state = 42)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)
pipline = Pipeline([
('scaler', StandardScaler()),
('model', Lasso(normalize=True))
])
lasso_model = GridSearchCV(pipline,
{'model__alpha': np.arange(0, 3, 0.05)},
cv = 10 ,
scoring = 'roc_auc',
verbose = 3,
n_jobs = -1,
error_score = 'raise')
lasso_model.fit(scaled_X_train, Y_train)
Now trying to make the ROC curves, one for the training set and one for the test set:
# Make a line for the random classification, AUC = 0.5:
r_probs = [0 for _ in range(len(Y_test))]
r_auc = roc_auc_score(Y_test,r_probs)
r_fpr, r_tpr , _ = roc_curve(Y_test, r_probs)
y_pred_proba = lasso_model.predict_proba(scaled_X_train)[::,1]
fpr, tpr, _ = roc_curve(Y_train, y_pred_proba)
auc = roc_auc_score(Y_train, y_pred_proba)
#create ROC curve
plt.plot(r_fpr, r_tpr, linestyle = '--', label = 'Random Prediction (AUROC = %0.3f)' %r_auc)
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.title('Train set ROC')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
I get the error that predict_proba does not exist for lasso regression, since it is not a classification algorithm. So how can I plot a ROC curve?
Lasso is basically linear model with L1 regularization. You can enable this kind of regularization for LogisticRegression() using penalty='l1' for the same effect (see this question, for example).
Alternatively, you can try to duct tape a sigmoid function to your Lasso() output, but that'd be doing the same thing as above with much more effort.
I've done DecisionTreeRegression as well as RandomForestRegression on the same dataset.
For RandomForest I've used 5 random best combinations and the results were all similar results as you'd expect. I've calculated the average of R^2, RMSE and MAE and have gotten
R^2 : 0.7, MAE: 145716, RMSE: 251828.
For DecisionTree I've used Repeated K-Fold, calculated the average and gotten:
R^2: 0.29, MAE: 121791, RMSE: 198280.
No transformations or scaling have been done on the response variable which is Home Prices.
I'm new to statistics, but I'm pretty sure R^2 should be higher, if MAE and RMSE are lower on the same dataset when there is no scaling done. That being said, the dataset in question is pretty low in quality compared to the other datasets that I'm using which are yielding the appropriate proportions in error scores.
My question is, since this dataset is poor in quality, and I'm sure that there will be negative R2 values as well as above one for the DecisionTree Model for this dataset: Is it possible that calculating the mean of scores after cross-validation gives arbitrary results for R^2, if some of the R^2 values are not in the 0-1 interval, or is it more likely that there's an issue with the logic of my code(or something else)?
def decisionTreeRegression(df, features):
df = df.sample(frac=1, random_state=0)
scaler = StandardScaler()
X = df[features]
y = df[['Price']]
param_grid = {'max_depth': np.arange(1,40,3)}
tree = GridSearchCV(DecisionTreeRegressor(), param_grid,return_train_score=False)
tree.fit(X,y)
tree_final = DecisionTreeRegressor(max_depth=tree.best_params_['max_depth'])
cv = RepeatedKFold(n_splits=5, n_repeats=100)
mae_scores = cross_val_score(tree_final, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
mse_scores = cross_val_score(tree_final, X, y, scoring='neg_mean_squared_error', cv=cv, n_jobs=-1)
r2_scores = cross_val_score(tree_final, X, y, scoring='r2', cv=cv, n_jobs=-1)
return makeScoresCV(mae_scores,mse_scores,r2_scores)
def makeScoresCV(mae_scores,mse_scores,r2_scores):
# convert scores to positive
mae_scores= absolute(mae_scores)
mse_scores= absolute(mse_scores)
# summarize the result
s_mean = mean(mae_scores)
s_mean2 = mean(mse_scores)
s_mean3 = mean(r2_scores)
return s_mean,np.sqrt(s_mean2),s_mean3
mae, rmse, r2 = decisionTreeRegression(df_de,fe_de)
print("mae : " + str(mae))
print("rmse : " + str(rmse))
print("r2 : " + str(r2))
Console:
mae : 153189.34673362423
rmse : 253284.5137707182
r2 : 0.30183525616923246
Random Forest (seperate notebook):
scaler = StandardScaler()
X = df.drop('Price', axis = 1)
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2, random_state=123, shuffle=True)
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
rmse = (np.sqrt(mean_squared_error(test_labels,predictions)))
r2 = r2_score(test_labels,predictions) # from sklearn.metrics
mae = np.sum(np.absolute((test_labels - predictions))) / len(predictions)
return mae,r2,rmse
maes = []
rmses = []
r2s = []
for i in range(10):
rf_random.fit(X_train, y_train)
best_random = rf_random.best_estimator_
mae,r2,rmse = evaluate(best_random, X_test, y_test)
maes.append(mae)
rmses.append(rmse)
r2s.append(r2)
print("MAE")
print(math.fsum(maes) / len(maes))
print("RMSE")
print(math.fsum(rmses) / len(rmses))
print("R2")
print(math.fsum(r2s) / len(r2s))
Console:
MAE
145716.7264983288
RMSE
251828.40328030512
R2
0.7082730127977784
I wish to append the results to the list data for each of the models utilised, however the function calc only appends the results from the last model. I am sure it is something really simple I am missing here!
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
classifiers =[LogisticRegression(solver='liblinear', penalty='l2', C=200),
LogisticRegression(penalty='l2', C=1),
DecisionTreeClassifier(),
BernoulliNB()]
class_names = ['Logistic Regression', 'Logistic Regression'
'Regularized','CART', 'Naive Bayes (Bernoulli)']
# import some data to play with
iris = datasets.load_iris()
Xdata = iris.data[:, :2] # we only take the first two features
ydata = iris.target
def calc (classifier_names, classifier_models, Xdata, ydata):
X_train, X_test, y_train, y_test = \
train_test_split(Xdata, ydata,test_size = 0.50, stratify=ydata,
random_state = 42)
X_scaler = StandardScaler()
X_train = X_scaler.fit_transform(X_train)
X_test = X_scaler.transform(X_test)
data=[]
for name, clf in zip(classifier_names, classifier_models):
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
y_pred = clf.predict(X_test)
ROC_AUC = plot_ROC_AUC(clf, X_test, y_test)
Accuracy = metrics.accuracy_score(y_test, y_pred)
Brier_Score = metrics.brier_score_loss(y_test, y_pred)
data.append((ROC_AUC,
Accuracy,
Brier_Score))
cols = ['ROC_AUC', 'Accuracy', 'Brier_Score']
result = pd.DataFrame(data, columns = cols, index=classifier_names)
return result
output = calc(class_names, classifiers, Xdata, ydata)
output
ROC_AUC Accuracy Brier_Score
Logistic Regression 0.925517 0.855072 0.144928
Logistic Regression Regularized 0.925517 0.855072 0.144928
CART 0.925517 0.855072 0.144928
Naive Bayes (Bernoulli) 0.925517 0.855072 0.144928
#want this to change here
#function within the calc function
def plot_ROC_AUC(fit_model, X_test, y_test):
probs=fit_model.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)
#plot ROC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
return roc_auc
I'm uncertain on the specifics of what you're attempting but i see an issue here
def calc (classifier_names, classifier_models, X, y):
X_train, X_test, y_train, y_test = \
train_test_split(Xdata, ydata,test_size = 0.50, stratify=ydata,
random_state = 42)
X_scaler = StandardScaler()
X_train = X_scaler.fit_transform(X_train)
X_test = X_scaler.transform(X_test)
data=[]
for name, clf in zip(classifier_names, classifier_models):
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
y_pred = clf.predict(X_test)
ROC_AUC = plot_ROC_AUC(clf, X_test, y_test)
Accuracy = metrics.accuracy_score(y_test, y_pred)
Brier_Score = metrics.brier_score_loss(y_test, y_pred)
data.append((ROC_AUC,
Accuracy,
Brier_Score))
cols = ['ROC_AUC', 'Accuracy', 'Brier_Score']
result = pd.DataFrame(data, columns = cols, index=classifier_names)
return result
or simplified:
def func(something, darkside):
for i in range(some_int):
return some_other_func(i)
this loop will only go through one step, as the return statement will break out of the function.
I think what you should attempt to do is aggregate the results of the for loop in some DataFrame and then return the aggregate. At this point I could say it's an indentation issue but looking higher i see you overwrite result on each loop too, so I would start there
Maybe move the loop outside the function? and do this instead:
def func(something, darkside):
return some_expression_of(something,darkside)
for name, clf, in zip(classifer_names, classifier_models:
func(name,clf)