Integrated Brier Score for sklearn's GridSearchCV - python

I have a custom scorer function whose inputs depend on the specific train and validation fold, additionally, the estimator's .predict_survival_function output is also needed. To give a more concrete example:
I am trying to run a GridSearch for a Random Survival Forest (scikit-survival package) with the Integrated Brier Score (IBS)
as the scoring method. The challenge is in the fact that the domain of the IBS is data- (and therefore, fold-) specific as it relies on the Kaplan-Meyer estimate at some point. Moreover, the .predict_survival_function method needs to be called every time during the scoring evaluation step and not only at the end of it.
It seems that I managed to to deal with the first issue by creating the following function:
def IB_time_interval(y_train, y_test):
y_times_tr = [i[2] for i in y_train]
y_times_te = [i[2] for i in y_test]
T1 = np.percentile(y_times_tr, 5, interpolation='higher')
T2 = np.percentile(y_times_tr, 95, interpolation='lower')
T3 = np.percentile(y_times_te, 5, interpolation='higher')
T4 = np.percentile(y_times_te, 95, interpolation='lower')
return np.linspace(np.maximum(T1,T3), np.minimum(T2, T4))
That is robust enough to work throughout all the folds. However, I am unable to retrieve the estimator's predictions during the grid search phase, as a non-fitted copy of it seems to passed instead every time the custom scorer function is called. The workaround that I I tried is to re-fit the estimator inside the scoring function, but not only this is conceptually wrong, it also raises errors.
The custom scorer function looks like the following:
def IB_scorer(y_true, y_pred, times=times_linspace, y=y, clf=rsf):
rsf.fit(X_train,y_train) #<--- = conceptually wrong
survs_test = rsf.predict_survival_function(X_test, return_array=False) #<---
T1, T2 = survs_test[0].x.min(), survs_test[0].x.max()
mask2 = np.logical_or(times >= T2, times < T1) # mask outer interval
times = times[~mask2]
#preds has shape (n_y-s, n_times)
preds_test = np.asarray([[fn(t) for t in times] for fn in survs_test])
return integrated_brier_score(y, y_true, preds_test, times)
and I create the scoring object immediately afterwards:
trial_IB_scorer = make_scorer(IB_scorer, greater_is_better=False)
Any suggestions? It would be great to be able to use GridSearch with more complex scoring functions, especially for the survival analysis case!
PS. I will paste the rest of the minimal working code here:
import numpy as np
from sksurv.ensemble import RandomSurvivalForest
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sksurv.metrics import integrated_brier_score
from sksurv.datasets import load_breast_cancer
X, y = load_breast_cancer()
X = X.drop(["er", "grade"], axis=1)
y_cens = np.array([i[0] for i in y]) #censoring status 1 or 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2,
shuffle=True,
random_state=0,
stratify = y_cens)
param_grid = {
'max_depth': [4, 20, None],
'max_features': ["sqrt", None]
}
rsf = RandomSurvivalForest(n_jobs=1, random_state=0)
times_linspace = IB_time_interval(y_train, y_test)
clf = GridSearchCV(rsf, param_grid, refit=True, n_jobs=1,
scoring=trial_IB_scorer)
clf.fit(X_train, y_train)
print("final score clf", clf.score(X_train, y_train))
print(clf.best_params_)

Related

When using HalvingRandomSearchCV I get the error "UserWarning: One or more of the test scores are non-finite"

I'm trying to predict Age (→ Y). Previously, I used ElasticNet, however to no avail.
Currently I'm trying to use sklearn's RandomForestRegressor (which works when I'm using RandomizedSearchCV), however HalvingRandomSearchCV turns out to be substantially faster.
That's why I'm trying to upgrade my program to use HalvingRandomSearchCV but doing so creates the error "UserWarning: One or more of the test scores are non-finite" - 4 times in total. (See error message)
Here it's suggested that this is due to the parameters in random_grid. I manipulated these and couldn't resolve the error that way.
Next I looked into rf_random.cv_results_. I found that mean_test_score, split2_test_score and std_test_score are the test scores with NaN values. In all cases the first 72 entries are NaN and the values match the printed error messages.
Yet, I can't find the 1st array to raise the error, which should contain only NaNs.
What's also weird is that the same parameters result in different outcomes.
When looking at rf_random.cv_results_["params"] I find that using the same parameters results in NaN for the 3rd setting of parameters and (e.g.) -9.7 for the 72th setting of parameters.
Final remarks:
X is a (656,91) DataFrame and contains z-score transformed data.
Y is a (656, 1) DataFrame and contains the Age as a float in years.
Right now my code also raises the "UndefinedMetricWarning: R^2 score is not well-defined with less than two samples." warning. To my understanding this is a separate issue and will be the next thing I'll fix.
Thanks
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_halving_search_cv # noqa
from sklearn.model_selection import HalvingRandomSearchCV
from sklearn.model_selection import KFold
def RandomForestCVRandomSearch(X, Y, n_splits_outer_cv = 3, n_splits_inner_cv = 3):
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 100, num = 10)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
cv_outer = KFold(n_splits = n_splits_outer_cv, random_state=1, shuffle=True)
for i, (train_index, test_index) in enumerate(cv_outer.split(X)):
x_train, x_test = X.loc[train_index,:], X.loc[test_index,:]
y_train, y_test = Y.loc[train_index,:], Y.loc[test_index,:]
rf = RandomForestRegressor(random_state=42)
rf_random = HalvingRandomSearchCV(estimator = rf, param_distributions = random_grid,
cv = n_splits_inner_cv, verbose=2, random_state=42, n_jobs = -1)
rf_random.fit(x_train.to_numpy(), y_train.to_numpy().ravel())
rf_score = rf_random.score(x_test.to_numpy(), y_test.to_numpy())
Update:
While the error itself is not clear to me - yet - I overlooked the importance of the "UndefinedMetricWarning: R^2 score is not well-defined with less than two samples." warning.
A fix to the problem is to change the scoring strategy:
rf_random = HalvingRandomSearchCV(estimator = rf, param_distributions = random_grid,
cv = n_splits_inner_cv, verbose=2, random_state=42,
n_jobs = -1, scoring = 'explained_variance')
Update 2
It might be that a bad interaction between my nested CV and the internal CV of HalvingRandomSearchCV caused the issue. As I wrote before n_resources is 6 in the first iteration and we believe that the resources are split across the CV (=3). So that we end up with 2 resources in each fold - which is not enough to calculate an R2-score. That would also explain why in the second iteration, where n_resources increases, the issue doesn't appear.

Python Scikit-Learn RandomizedSearchCV with custom scoring functions

I am using Scikit-Learn's Random Forest Regressor, Pipeline, and RandomizedSearchCV to predict the target variable using some features in my dataset. I need to use my own custom scoring functions that calculate weighted scores using weights (signifying the importance of observations) from the dataset. My code seems to work but I am getting a warning when the grid runs:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for examples using ravel(). self.__final_estimator.fit(Xt, y, **fit_params)
This is related to .fit(X_train, y_train). Based on this warning, if I change the code to .fit(X_train, y_train.values.ravel()) then I cannot get my weighted scores to work. I have tried editing the code in different/appropriate ways to get the weighted scores to work but to no avail.
I am including my code below that runs on a test data in test.csv. The file has four columns: two feature columns ('x1', 'x2'), target ('y') and weight ('weight') columns. The custom scoring functions below are simple functions that calculate weighted rmse_score and mean_abs_error_score. How can I use .fit(X_train, y_train.values.ravel()) and still compute the scores?
import pandas as pd
import numpy as np
import sklearn.model_selection as skms
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
def rmse_score(y_true, y_pred, weight):
weight = weight.loc[y_true.index.values]
rmse = np.sqrt(np.mean(weight*(y_true.values-y_pred)**2))
return rmse
def mean_abs_error_score(y_true, y_pred, weight):
weight = weight.loc[y_true.index.values]
mae = np.mean(weight*np.absolute(y_true.values-y_pred))
return mae
#---- reading data
heart_df = pd.read_csv('data\\test.csv')
#---- splitting into training & testing sets
y = heart_df['y']
X = heart_df[['x1', 'x2']]
X_train, X_test, y_train, y_test = skms.train_test_split(X, y, test_size=0.20)
X_train_weights = heart_df['weight'].loc[X_train.index.values]
params = {"weight": X_train_weights}
my_scorer1 = make_scorer(rmse_score, greater_is_better=False, **params)
my_scorer2 = make_scorer(mean_abs_error_score, greater_is_better=False, **params)
#---- random forest training with hyperparameter tuning
pipe = Pipeline([("scaler", StandardScaler()), ("rfr", RandomForestRegressor())])
random_grid = { "rfr__n_estimators": [10, 100, 500, 1000],
"rfr__max_depth": [10, 20, 30, 40, 50, None],
"rfr__max_features": [0.25, 0.50, 0.75],
"rfr__min_samples_split": [5, 10, 20],
"rfr__min_samples_leaf": [3, 5, 10],
"rfr__bootstrap": [True, False]
}
rfr_cv = skms.RandomizedSearchCV(pipe,
param_distributions=random_grid,
n_iter = 15,
cv = 3,
verbose=3,
scoring={'rmse': my_scorer1, 'mae':my_scorer2},
refit = 'rmse',
random_state=42,
n_jobs = -1)
rfr_cv.fit(X_train, y_train)
best_params = rfr_cv.best_params_
best_score = rfr_cv.best_score_
print(f'best hyperparameters = {best_params}')
print(f'best score = {best_score}')

Sklearn: Correct procedure for ElasticNet hyperparameter tuning

I am using ElasticNet to obtain a fit of my data. To determine the hyperparameters (l1, alpha), I am using ElasticNetCV. With the obtained hyperparamers, I refit the model to the whole dataset for production use. I am unsure if this is correct in both, the machine learning aspect and - if so - how I do it. The code "works" and presumably does what it should, but I wanted to be certain that it is also correct.
My procedure is:
X_tr, X_te, y_tr, y_te = train_test_split(X,y)
optimizer = ElasticNetCV(l1_ratio = [.1,.5,.7,.9,.99,1], n_alphas=400, cv=5, normalize=True)
optimizer.fit(X_tr, y_tr)
best = ElasticNet(alpha=optimizer.alpha_, l1_ratio=optimizer.l1_ratio_, normalize=True)
best.fit(X,y)
Thank you in advance
I am a beginner on this but I would love to share my approach to ElasticNet hyperparameters tuning. I would suggest to use RandomizedSearchCV instead. Here is part of the current code I am writing now:
#-----------------------------------------------
# input:
# X_train, X_test, Y_train, Y_test: datasets
# Returns:
# R² and RMSE Scores
#-----------------------------------------------
# Standardize data before
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# define grid
params = dict()
# values for alpha: 100 values between e^-5 and e^5
params['alpha'] = np.logspace(-5, 5, 100, endpoint=True)
# values for l1_ratio: 100 values between 0 and 1
params['l1_ratio'] = np.arange(0, 1, 0.01)
Warning: you are testing 100 x 100 = 10 000 possible combinations.
# Create an instance of the Elastic Net Regressor
regressor = ElasticNet()
# Call the RanddomizedSearch with Cross Validation using the chosen regressor
rs_cv= RandomizedSearchCV(regressor, params, n_iter = 100, scoring=None, cv=5, verbose=0, refit=True)
rs_cv.fit(X_train, Y_train.values.ravel())
# Results
Y_pred = rs_cv.predict(X_test)
R2_score = rs_cv.score(X_test, Y_test)
RMSE_score = np.sqrt(mean_squared_error(Y_test, Y_pred))
return R2_score, RMSE_score, rs_cv.best_params_
The advantage is that in RandomizedSearchCV the number of iterations can be predetermined in advance. The choices of points to be tested are random but 90% (in some cases) faster than GridSearchCV (that tests all possible combinations).
I am using this same approach for other regressors like RandomForests and GradientBoosting who parameters grids are far more complicated and demand much more computer power to run.
As I said at the beginning I am new to this field, so any constructive comment will be welcomed.
Johnny

How to get the prediction probabilities using cross validation in scikit-learn

I am using RandomForestClassifier as follows using cross validation for a binary classification (class labels are 0 and 1).
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
accuracy = cross_val_score(clf, X, y, cv=k_fold, scoring = 'accuracy')
print("Accuracy: " + str(round(100*accuracy.mean(), 2)) + "%")
f1 = cross_val_score(clf, X, y, cv=k_fold, scoring = 'f1_weighted')
print("F Measure: " + str(round(100*f1.mean(), 2)) + "%")
Now I want to order my data using prediction probabilities of class 1 with cross validation results. For that I tried the following two ways.
pred = clf.predict_proba(X)[:,1]
print(pred)
probs = clf.predict_proba(X)
best_n = np.argsort(probs, axis=1)[:,-6:]
I get the following error
NotFittedError: This RandomForestClassifier instance is not fitted
yet. Call 'fit' with appropriate arguments before using this method.
for both the situations.
I am just wondering where I am making things wrong.
I am happy to provide more details if needed.
In case, you want to use the CV model for a unseen data point/s, use the following approach.
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
iris = datasets.load_iris()
X = iris.data
y = iris.target
clf = RandomForestClassifier(n_estimators=10, random_state = 42, class_weight="balanced")
cv_results = cross_validate(clf, X, y, cv=3, return_estimator=True)
clf_fold_0 = cv_results['estimator'][0]
clf_fold_0.predict_proba([iris.data[133]])
# array([[0. , 0.5, 0.5]])
Have a look at the documentation it specifies that the probability is calculated based on the mean results from the trees.
In your case, you first need to call the fit() method to generate the tress in the model. Once you fit the model on the training data, you can call the predict_proba() method.
This is also specified in the error.
# Fit model
model = RandomForestClassifier(...)
model.fit(X_train, Y_train)
# Probabilty
model.predict_proba(X)[:,1]
I solved my problem using the following code:
proba = cross_val_predict(clf, X, y, cv=k_fold, method='predict_proba')
print(proba[:,1])
print(np.argsort(proba[:,1]))

Using explicit (predefined) validation set for grid search with sklearn

I have a dataset, which has previously been split into 3 sets: train, validation and test. These sets have to be used as given in order to compare the performance across different algorithms.
I would now like to optimize the parameters of my SVM using the validation set. However, I cannot find how to input the validation set explicitly into sklearn.grid_search.GridSearchCV(). Below is some code I've previously used for doing K-fold cross-validation on the training set. However, for this problem I need to use the validation set as given. How can I do that?
from sklearn import svm, cross_validation
from sklearn.grid_search import GridSearchCV
# (some code left out to simplify things)
skf = cross_validation.StratifiedKFold(y_train, n_folds=5, shuffle = True)
clf = GridSearchCV(svm.SVC(tol=0.005, cache_size=6000,
class_weight=penalty_weights),
param_grid=tuned_parameters,
n_jobs=2,
pre_dispatch="n_jobs",
cv=skf,
scoring=scorer)
clf.fit(X_train, y_train)
Use PredefinedSplit
ps = PredefinedSplit(test_fold=your_test_fold)
then set cv=ps in GridSearchCV
test_fold : “array-like, shape (n_samples,)
test_fold[i] gives the test set fold of sample i. A value of -1 indicates that the corresponding sample is not part of any test set folds, but will instead always be put into the training fold.
Also see here
when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.
Consider using the hypopt Python package (pip install hypopt) for which I am an author. It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.
# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
{'C': [1, 10, 100], 'kernel': ['linear']},
{'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))
EDIT: I (think I) received -1's on this response because I'm suggesting a package that I authored. This is unfortunate, given that the package was created specifically to solve this type of problem.
# Import Libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import PredefinedSplit
# Split Data to Train and Validation
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, stratify = y,random_state = 2020)
# Create a list where train data indices are -1 and validation data indices are 0
split_index = [-1 if x in X_train.index else 0 for x in X.index]
# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)
# Use PredefinedSplit in GridSearchCV
clf = GridSearchCV(estimator = estimator,
cv=pds,
param_grid=param_grid)
# Fit with all data
clf.fit(X, y)
To add to the #Vinubalan's answer, when the train-valid-test split is not done with Scikit-learn's train_test_split() function, i.e., the dataframes are already split manually beforehand and scaled/normalized so as to prevent leakage from training data, the numpy arrays can be concatenated.
import numpy as np
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
from sklearn.model_selection import PredefinedSplit, GridSearchCV
split_index = [-1]*len(X_train) + [0]*len(X_val)
X = np.concatenate((X_train, X_val), axis=0)
y = np.concatenate((y_train, y_val), axis=0)
pds = PredefinedSplit(test_fold = split_index)
clf = GridSearchCV(estimator = estimator,
cv=pds,
param_grid=param_grid)
# Fit with all data
clf.fit(X, y)
I wanted to provide some reproducible code that creates a validation split using the last 20% of observations.
from sklearn import datasets
from sklearn.model_selection import PredefinedSplit, GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
# load data
df_train = datasets.fetch_california_housing(as_frame=True).data
y = datasets.fetch_california_housing().target
param_grid = {"max_depth": [5, 6],
'learning_rate': [0.03, 0.06],
'subsample': [.5, .75]
}
model = GradientBoostingRegressor()
# Create a single validation split
val_prop = .2
n_val_rows = round(len(df_train) * val_prop)
val_starting_index = len(df_train) - n_val_rows
cv = PredefinedSplit([-1 if i < val_starting_index else 0 for i in df_train.index])
# Use PredefinedSplit in GridSearchCV
results = GridSearchCV(estimator = model,
cv=cv,
param_grid=param_grid,
verbose=True,
n_jobs=-1)
# Fit with all data
results.fit(df_train, y)
results.best_params_
The cv argument of the SearchCV i.e. Grid or Random can just be an iterable of indices too for train and validation split i.e. cv=((train_idcs, val_idcs),).
Note that the data on which the search classifier will be fit should be the train+val set and the indices specified will be used by the sklearn to separate them internally. Additionally, when working with dataframes, the indices specified should be accessible as ilocs, so reset indices (don't drop them if they will be required later).
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
train_test_split,
RandomizedSearchCV,
)
data = load_iris(as_frame=True)["frame"]
# These indices will serves as explicit and predefined split
train_idcs, val_idcs = train_test_split(
data.index,
random_state=42,
stratify=data.target,
)
param_grid = dict(
n_estimators=[50,100,150,200],
max_samples=[0.85,0.9,0.95,1],
max_depth=[3,5,7,10],
max_features=["sqrt", "log2", 0.85, 0.9, 0.95, 1],
)
search_clf = RandomizedSearchCV(
estimator=RandomForestClassifier(),
param_distributions=param_grid,
n_iter=50,
cv=((train_idcs, val_idcs),), # explicit predefined split in terms of indices
random_state=42,
)
# X is the first 4 columns i.e. the sepal and petal widths and lengths
# and y is the 5th column i.e. target column
search_clf.fit(X=data.iloc[:,:4], y=data.target)
Also, be mindful if you want to refit on the whole data or only on the train data and thus retrain the classifier using the best fit parameters accordingly.

Categories