Python Scikit-Learn RandomizedSearchCV with custom scoring functions - python

I am using Scikit-Learn's Random Forest Regressor, Pipeline, and RandomizedSearchCV to predict the target variable using some features in my dataset. I need to use my own custom scoring functions that calculate weighted scores using weights (signifying the importance of observations) from the dataset. My code seems to work but I am getting a warning when the grid runs:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for examples using ravel(). self.__final_estimator.fit(Xt, y, **fit_params)
This is related to .fit(X_train, y_train). Based on this warning, if I change the code to .fit(X_train, y_train.values.ravel()) then I cannot get my weighted scores to work. I have tried editing the code in different/appropriate ways to get the weighted scores to work but to no avail.
I am including my code below that runs on a test data in test.csv. The file has four columns: two feature columns ('x1', 'x2'), target ('y') and weight ('weight') columns. The custom scoring functions below are simple functions that calculate weighted rmse_score and mean_abs_error_score. How can I use .fit(X_train, y_train.values.ravel()) and still compute the scores?
import pandas as pd
import numpy as np
import sklearn.model_selection as skms
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
def rmse_score(y_true, y_pred, weight):
weight = weight.loc[y_true.index.values]
rmse = np.sqrt(np.mean(weight*(y_true.values-y_pred)**2))
return rmse
def mean_abs_error_score(y_true, y_pred, weight):
weight = weight.loc[y_true.index.values]
mae = np.mean(weight*np.absolute(y_true.values-y_pred))
return mae
#---- reading data
heart_df = pd.read_csv('data\\test.csv')
#---- splitting into training & testing sets
y = heart_df['y']
X = heart_df[['x1', 'x2']]
X_train, X_test, y_train, y_test = skms.train_test_split(X, y, test_size=0.20)
X_train_weights = heart_df['weight'].loc[X_train.index.values]
params = {"weight": X_train_weights}
my_scorer1 = make_scorer(rmse_score, greater_is_better=False, **params)
my_scorer2 = make_scorer(mean_abs_error_score, greater_is_better=False, **params)
#---- random forest training with hyperparameter tuning
pipe = Pipeline([("scaler", StandardScaler()), ("rfr", RandomForestRegressor())])
random_grid = { "rfr__n_estimators": [10, 100, 500, 1000],
"rfr__max_depth": [10, 20, 30, 40, 50, None],
"rfr__max_features": [0.25, 0.50, 0.75],
"rfr__min_samples_split": [5, 10, 20],
"rfr__min_samples_leaf": [3, 5, 10],
"rfr__bootstrap": [True, False]
}
rfr_cv = skms.RandomizedSearchCV(pipe,
param_distributions=random_grid,
n_iter = 15,
cv = 3,
verbose=3,
scoring={'rmse': my_scorer1, 'mae':my_scorer2},
refit = 'rmse',
random_state=42,
n_jobs = -1)
rfr_cv.fit(X_train, y_train)
best_params = rfr_cv.best_params_
best_score = rfr_cv.best_score_
print(f'best hyperparameters = {best_params}')
print(f'best score = {best_score}')

Related

Integrated Brier Score for sklearn's GridSearchCV

I have a custom scorer function whose inputs depend on the specific train and validation fold, additionally, the estimator's .predict_survival_function output is also needed. To give a more concrete example:
I am trying to run a GridSearch for a Random Survival Forest (scikit-survival package) with the Integrated Brier Score (IBS)
as the scoring method. The challenge is in the fact that the domain of the IBS is data- (and therefore, fold-) specific as it relies on the Kaplan-Meyer estimate at some point. Moreover, the .predict_survival_function method needs to be called every time during the scoring evaluation step and not only at the end of it.
It seems that I managed to to deal with the first issue by creating the following function:
def IB_time_interval(y_train, y_test):
y_times_tr = [i[2] for i in y_train]
y_times_te = [i[2] for i in y_test]
T1 = np.percentile(y_times_tr, 5, interpolation='higher')
T2 = np.percentile(y_times_tr, 95, interpolation='lower')
T3 = np.percentile(y_times_te, 5, interpolation='higher')
T4 = np.percentile(y_times_te, 95, interpolation='lower')
return np.linspace(np.maximum(T1,T3), np.minimum(T2, T4))
That is robust enough to work throughout all the folds. However, I am unable to retrieve the estimator's predictions during the grid search phase, as a non-fitted copy of it seems to passed instead every time the custom scorer function is called. The workaround that I I tried is to re-fit the estimator inside the scoring function, but not only this is conceptually wrong, it also raises errors.
The custom scorer function looks like the following:
def IB_scorer(y_true, y_pred, times=times_linspace, y=y, clf=rsf):
rsf.fit(X_train,y_train) #<--- = conceptually wrong
survs_test = rsf.predict_survival_function(X_test, return_array=False) #<---
T1, T2 = survs_test[0].x.min(), survs_test[0].x.max()
mask2 = np.logical_or(times >= T2, times < T1) # mask outer interval
times = times[~mask2]
#preds has shape (n_y-s, n_times)
preds_test = np.asarray([[fn(t) for t in times] for fn in survs_test])
return integrated_brier_score(y, y_true, preds_test, times)
and I create the scoring object immediately afterwards:
trial_IB_scorer = make_scorer(IB_scorer, greater_is_better=False)
Any suggestions? It would be great to be able to use GridSearch with more complex scoring functions, especially for the survival analysis case!
PS. I will paste the rest of the minimal working code here:
import numpy as np
from sksurv.ensemble import RandomSurvivalForest
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sksurv.metrics import integrated_brier_score
from sksurv.datasets import load_breast_cancer
X, y = load_breast_cancer()
X = X.drop(["er", "grade"], axis=1)
y_cens = np.array([i[0] for i in y]) #censoring status 1 or 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2,
shuffle=True,
random_state=0,
stratify = y_cens)
param_grid = {
'max_depth': [4, 20, None],
'max_features': ["sqrt", None]
}
rsf = RandomSurvivalForest(n_jobs=1, random_state=0)
times_linspace = IB_time_interval(y_train, y_test)
clf = GridSearchCV(rsf, param_grid, refit=True, n_jobs=1,
scoring=trial_IB_scorer)
clf.fit(X_train, y_train)
print("final score clf", clf.score(X_train, y_train))
print(clf.best_params_)

Optimizing learning rate and number of estimators for multioutput gradient boosting

I have a dataset with multiple outputs and am trying to use gradient boosting to predict all the values at once. I imported MultiOutputRegressor so multiple outputs can be predicted at once; I'm able to make it work for the default gradient boosting function. However, I'm running into an error when I try to optimize the gradient boosting function for each output.
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn import ensemble
params = {'max_depth': 3, 'n_estimators': 100, 'learning_rate': 0.1}
gradient_regressor = MultiOutputRegressor(ensemble.GradientBoostingRegressor(**params))
GradBoostModel = gradient_regressor.fit(X_train, y_train)
prediction_GradBoost = GradBoostModel.predict(X_test)
LR = {'learning_rate':[0.15, 0.125, 0.1, 0.75, 0.05], 'n_estimators':[50, 75, 100, 150, 200, 250, 300, 400]}
tuning = GridSearchCV(estimator = GradBoostModel, param_grid = LR, scoring = 'r2')
tuning.fit(X_train, y_train)
tuning.best_params_, tuning.best_score_
I'm trying to use GridSearchCV to cycle through the listed learning rates and number of estimators to find the optimal values. But, I get the following error:
Invalid parameter learning_rate for estimator MultiOutputRegressor.
Check the list of available parameters with `estimator.get_params().keys()`
I think I understand the reason for the error: when I try to optimize the gradient boosting parameters, they are passed through the MultiOutputRegressor, which doesn't recognize them. Is this the case? Also, how can I change my code, such that I can optimize these parameters for each output?
Indeed the params are prefixed with estimator__, in general, to find out what params to use downstream in your pipeline use the .get_params().keys() method on your model, eg:
print(GradBoostModel.get_params().keys())
dict_keys(['estimator__alpha', 'estimator__ccp_alpha', 'estimator__criterion', 'estimator__init', 'estimator__learning_rate',...
Full working example with the linnerud dataset:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.datasets import load_linnerud
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
# Data
rng = np.random.RandomState(0)
X, y = load_linnerud(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
# Model
params = {'max_depth': 3, 'n_estimators': 100, 'learning_rate': 0.1}
gradient_regressor = MultiOutputRegressor(GradientBoostingRegressor(**params))
GradBoostModel = gradient_regressor.fit(X_train, y_train)
prediction_GradBoost = GradBoostModel.predict(X_test)
LR = {'estimator__learning_rate': [0.15, 0.125, 0.1, 0.75, 0.05], 'estimator__n_estimators': [50, 75, 100, 150, 200, 250, 300, 400]}
print('Params from GradBoostModel', GradBoostModel.get_params().keys())
tuning = GridSearchCV(estimator=GradBoostModel, param_grid=LR, scoring='r2')
tuning.fit(X_train, y_train)

How to perform cross-validation of a random-forest model in scikit-learn?

I need to perform leave-one-out cross validation of RF model.
I successfully built a model with high predictive ability.
Now I need to perform LOO test prior to the publication.
Here is my code:
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
FC_data = pd.read_excel('C:\\Users\\Dre\\Desktop\\My Papers\\Furocoumarins_paper_2018\\Furocoumarins_NEW1.xlsx', index_col=0)
FC_data.head()
# Create correlation matrix
corr_matrix = FC_data.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
# Drop features
FC_data1 = FC_data.drop(FC_data[to_drop], axis=1)
y = FC_data1.LogFiT
X = FC_data1.drop(['LogFiT', 'LogS'], axis=1)
X_train = X.drop(["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
"4,7,4'-Trimethylallopsoralen", "Psoralen"], axis=0)
X_train.head(21)
y_train = y.drop(["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
"4,7,4'-Trimethylallopsoralen", "Psoralen"], axis=0)
y_train.head(21)
X_test = X.loc[["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
"4,7,4'-Trimethylallopsoralen", "Psoralen"]]
X_test.head(5)
y_test = y.loc[["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
"4,7,4'-Trimethylallopsoralen", "Psoralen"]]
y_test.head(5)
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
randomforest = RandomForestRegressor(n_jobs=-1)
selector = SelectFromModel(randomforest)
features_important = selector.fit_transform(X_train, y_train)
model = randomforest.fit(features_important, y_train)
from sklearn.model_selection import GridSearchCV
clf_rf = RandomForestRegressor()
parameters = {"n_estimators":[1, 2, 3, 4, 5, 7, 10, 15, 20, 30, 40, 50, 100], "max_depth":[1, 2, 3, 4, 5, 7, 10, 15, 20, 30, 40, 50, 100]}
grid_search_cv_clf = GridSearchCV(clf_rf, parameters, cv=5)
grid_search_cv_clf.fit(features_important, y_train)
from sklearn.metrics import r2_score
y_pred = grid_search_cv_clf.predict(features_important)
r2_score(y_train, y_pred)
grid_search_cv_clf.best_params_
best_clf = grid_search_cv_clf.best_estimator_
X_test_filtered = X_test.iloc[:,selector.get_support()]
best_clf.score(X_test_filtered, y_test)
feature_importances = best_clf.feature_importances_
feature_importances_df = pd.DataFrame({'features': X_test_filtered.columns.values,
'feature_importances':feature_importances})
importances = feature_importances_df.sort_values('feature_importances', ascending=False)
importances.head(25)
Now I need q2 value.
Finally, I wrote this code and got a reasonably high score 0.9071543776303185
.
from sklearn.model_selection import LeaveOneOut
parameters = {"n_estimators":[4], "max_depth":[20]}
loo_clf = GridSearchCV(best_clf, parameters, cv=LeaveOneOut())
loo_clf.fit(features_important, y_train)
loo_clf.score(features_important, y_train)
I'm not sure if it is q2 or not. How do you think?
I also decided to obtain 5-fold cross-validation score. However, it gives ridiculous values like, for example: -36.58997717, 0.76801832, -1.59900448, 0.1834304 , -2.38256389 and a mean of -7.924019361863889.
from sklearn.model_selection import cross_val_score
cvs=cross_val_score(best_clf, features_important, y_train)
mean_cross_val_score = cvs.mean()
mean_cross_val_score
Probably, there is a way to fix it?
You should not run the hyper-parameters search before to make the model evaluation. Instead, you should the 2 cross-validations, otherwise, you are leaking some information. To know more about this, you should look at the following example from the scikit-learn documentation: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py
Therefore, in your particular use-case, you should use: GridSearchCV, SelectFromModel, and cross_val_score:
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
X, y = make_regression(n_samples=100)
feature_selector = SelectFromModel(
RandomForestRegressor(n_jobs=-1), threshold="mean"
)
pipe = make_pipeline(
feature_selector, RandomForestRegressor(n_jobs=-1)
)
param_grid = {
# define the grid of the random-forest for the feature selection
"selectfrommodel__estimator__n_estimators": [10, 20],
"selectfrommodel__estimator__max_depth": [3, 5],
# define the grid of the random-forest for the prediction
"randomforestregressor__n_estimators": [10, 20],
"randomforestregressor__max_depth": [5, 8],
}
grid_search = GridSearchCV(pipe, param_grid=param_grid, n_jobs=-1, cv=3)
# You can use the LOO in this way. Be aware that this not a good practise,
# it leads to large variance when evaluating your model.
# scores = cross_val_score(pipe, X, y, cv=LeaveOneOut(), error_score='raise')
scores = cross_val_score(pipe, X, y, cv=2, error_score='raise')
score.mean()
You need to specify the scoring and the cv arguments.
Use this:
from sklearn.model_selection import cross_val_score
mycv = LeaveOneOut()
cvs=cross_val_score(best_clf, features_important, y_train, scoring='r2',cv = mycv)
mean_cross_val_score = cvs.mean()
print(mean_cross_val_score)
This will return the mean cross-validated R2 score using LOOCV.
For more scoring options see here: https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values

SKLearn Naive Bayes: add feature after tfidf vectorization

So I have been tasked with training a model on phone call transcripts. The following code does this. A little background info:
- x is a list of strings, each ith element is an entire transcript
- y is a list of booleans, stating the outcome of a call being positive or negative.
The following code works, but here is my issue. I want to include call duration as a feature to train on. I'd assume after the TFIDF transformer that vectorizes the transcripts, I would just concatenate the call duration feature to the TFIDF output right? Maybe this is easier than I think, but I have the transcripts and the durations all in the pandas data frame you see at the beginning of the code. So if I have that data frame column (numpy array) of durations, what do I need to do to add that feature into my model?
Additional Questions:
Am I missing a fundamental assumption about Naive Bayes model that limits me to vectorized strings?
At which step in my pipeline do I add the new feature?
Can this even be done in a pipeline or do I have to break it apart to do something like this?
Code:
import numpy as np
import pandas as pd
import random
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score
from sklearn.feature_selection import SelectPercentile
from sklearn.metrics import roc_auc_score
from sklearn.feature_selection import chi2
def main():
filename = 'QA_training.pkl'
splitRatio = 0.67
dataframe = loadData(filename)
x, y = getTrainingData(dataframe)
print len(x), len(y)
x_train, x_test = splitDataset(x, splitRatio)
y_train, y_test = splitDataset(y, splitRatio)
#x_train = np.asarray(x_train)
percentiles = [10, 15, 20, 25, 30, 35, 40, 45, 50]
MNNB_pipe = Pipeline([('vec', CountVectorizer()),('tfidf', TfidfTransformer()),('select', SelectPercentile(score_func=chi2)),('clf', MultinomialNB())])
MNNB_param_grid = {
#'vec__max_features': (10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000),
'tfidf__use_idf': (True, False),
'tfidf__sublinear_tf': (True, False),
'vec__binary': (True, False),
'tfidf__norm': ('l1', 'l2'),
'clf__alpha': (1, 0.1, 0.01, 0.001, 0.0001, 0.00001),
'select__percentile': percentiles
}
MNNB_search = GridSearchCV(MNNB_pipe, param_grid=MNNB_param_grid, cv=10, scoring='roc_auc', n_jobs=-1, verbose=1)
MNNB_search = MNNB_search.fit(x_train, y_train)
MNNB_search_best_cv = cross_val_score(MNNB_search.best_estimator_, x_train, y_train, cv=10, scoring='roc_auc', n_jobs=-1, verbose=10)
SGDC_pipe = Pipeline([('vec', CountVectorizer()),('tfidf', TfidfTransformer()),('select', SelectPercentile(score_func=chi2)),('clf', SGDClassifier())])
SGDC_param_grid = {
#'vec__max_features': [10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000],
'tfidf__use_idf': [True, False],
'tfidf__sublinear_tf': [True, False],
'vec__binary': [True, False],
'tfidf__norm': ['l1', 'l2'],
'clf__loss': ['modified_huber','log'],
'clf__penalty': ['l1','l2'],
'clf__alpha': [1e-3],
'clf__n_iter': [5,10],
'clf__random_state': [42],
'select__percentile': percentiles
}
SGDC_search = GridSearchCV(SGDC_pipe, param_grid=SGDC_param_grid, cv=10, scoring='roc_auc', n_jobs=-1, verbose=1)
SGDC_search = SGDC_search.fit(x_train, y_train)
SGDC_search_best_cv = cross_val_score(SGDC_search.best_estimator_, x_train, y_train, cv=10, scoring='roc_auc', n_jobs=-1, verbose=10)
# pre_SGDC = SGDC_clf.predict(x_test)
# print (np.mean(pre_SGDC == y_test))
mydata = [{'model': MNNB_search.best_estimator_.named_steps['clf'],'features': MNNB_search.best_estimator_.named_steps['select'], 'mean_cv_scores': MNNB_search_best_cv.mean()},
#{'model': GNB_search.best_estimator_.named_steps['classifier'],'features': GNB_search.best_estimator_.named_steps['select'], 'mean_cv_scores': GNB_search_best_cv.mean()},
{'model': SGDC_search.best_estimator_.named_steps['clf'],'features': SGDC_search.best_estimator_.named_steps['select'], 'mean_cv_scores': SGDC_search_best_cv.mean()}]
model_results_df = pd.DataFrame(mydata)
model_results_df.to_csv("best_model_results.csv")
As far as I'm aware, sklearn pipelines are API driven -- There's no real magic that happens in the pipeline itself. So, from that perspective, you should be able to create your own wrapper around TfidfVectorizer that does what you want it to do. For example, let's assume that you have a DataFrame that looks like this:
df = pd.DataFrame({'text': ['foo text', 'bar text'], 'duration': [1, 2]})
you could probably implement your transform as follows:
class MyVectorizer(object):
def __init__(self, tfidf_kwargs=None):
self._tfidf = TfidfVectorizer(**(tfidf_kwargs or None))
def fit(self, X, y=None):
self._tfidf.fit(X['text'], y)
return self
def fit_transform(self, X, y=None):
self.fit(X)
return self.transform(X, copy=False)
def transform(self, X, copy=True):
result = self._tfidf.transform(X['text'], copy=copy)
# result is a sparse matrix. I'm not sure of a clean way
# to add a column to a sparse matrix. If you have the
# memory, you can use a dense matrix instead...
return np.column_stack((result, X['duration']))
And then I think you should be all set to use this instead of the original tfidf vectorizer.

Using explicit (predefined) validation set for grid search with sklearn

I have a dataset, which has previously been split into 3 sets: train, validation and test. These sets have to be used as given in order to compare the performance across different algorithms.
I would now like to optimize the parameters of my SVM using the validation set. However, I cannot find how to input the validation set explicitly into sklearn.grid_search.GridSearchCV(). Below is some code I've previously used for doing K-fold cross-validation on the training set. However, for this problem I need to use the validation set as given. How can I do that?
from sklearn import svm, cross_validation
from sklearn.grid_search import GridSearchCV
# (some code left out to simplify things)
skf = cross_validation.StratifiedKFold(y_train, n_folds=5, shuffle = True)
clf = GridSearchCV(svm.SVC(tol=0.005, cache_size=6000,
class_weight=penalty_weights),
param_grid=tuned_parameters,
n_jobs=2,
pre_dispatch="n_jobs",
cv=skf,
scoring=scorer)
clf.fit(X_train, y_train)
Use PredefinedSplit
ps = PredefinedSplit(test_fold=your_test_fold)
then set cv=ps in GridSearchCV
test_fold : “array-like, shape (n_samples,)
test_fold[i] gives the test set fold of sample i. A value of -1 indicates that the corresponding sample is not part of any test set folds, but will instead always be put into the training fold.
Also see here
when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.
Consider using the hypopt Python package (pip install hypopt) for which I am an author. It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.
# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
{'C': [1, 10, 100], 'kernel': ['linear']},
{'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))
EDIT: I (think I) received -1's on this response because I'm suggesting a package that I authored. This is unfortunate, given that the package was created specifically to solve this type of problem.
# Import Libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import PredefinedSplit
# Split Data to Train and Validation
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, stratify = y,random_state = 2020)
# Create a list where train data indices are -1 and validation data indices are 0
split_index = [-1 if x in X_train.index else 0 for x in X.index]
# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)
# Use PredefinedSplit in GridSearchCV
clf = GridSearchCV(estimator = estimator,
cv=pds,
param_grid=param_grid)
# Fit with all data
clf.fit(X, y)
To add to the #Vinubalan's answer, when the train-valid-test split is not done with Scikit-learn's train_test_split() function, i.e., the dataframes are already split manually beforehand and scaled/normalized so as to prevent leakage from training data, the numpy arrays can be concatenated.
import numpy as np
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
from sklearn.model_selection import PredefinedSplit, GridSearchCV
split_index = [-1]*len(X_train) + [0]*len(X_val)
X = np.concatenate((X_train, X_val), axis=0)
y = np.concatenate((y_train, y_val), axis=0)
pds = PredefinedSplit(test_fold = split_index)
clf = GridSearchCV(estimator = estimator,
cv=pds,
param_grid=param_grid)
# Fit with all data
clf.fit(X, y)
I wanted to provide some reproducible code that creates a validation split using the last 20% of observations.
from sklearn import datasets
from sklearn.model_selection import PredefinedSplit, GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
# load data
df_train = datasets.fetch_california_housing(as_frame=True).data
y = datasets.fetch_california_housing().target
param_grid = {"max_depth": [5, 6],
'learning_rate': [0.03, 0.06],
'subsample': [.5, .75]
}
model = GradientBoostingRegressor()
# Create a single validation split
val_prop = .2
n_val_rows = round(len(df_train) * val_prop)
val_starting_index = len(df_train) - n_val_rows
cv = PredefinedSplit([-1 if i < val_starting_index else 0 for i in df_train.index])
# Use PredefinedSplit in GridSearchCV
results = GridSearchCV(estimator = model,
cv=cv,
param_grid=param_grid,
verbose=True,
n_jobs=-1)
# Fit with all data
results.fit(df_train, y)
results.best_params_
The cv argument of the SearchCV i.e. Grid or Random can just be an iterable of indices too for train and validation split i.e. cv=((train_idcs, val_idcs),).
Note that the data on which the search classifier will be fit should be the train+val set and the indices specified will be used by the sklearn to separate them internally. Additionally, when working with dataframes, the indices specified should be accessible as ilocs, so reset indices (don't drop them if they will be required later).
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
train_test_split,
RandomizedSearchCV,
)
data = load_iris(as_frame=True)["frame"]
# These indices will serves as explicit and predefined split
train_idcs, val_idcs = train_test_split(
data.index,
random_state=42,
stratify=data.target,
)
param_grid = dict(
n_estimators=[50,100,150,200],
max_samples=[0.85,0.9,0.95,1],
max_depth=[3,5,7,10],
max_features=["sqrt", "log2", 0.85, 0.9, 0.95, 1],
)
search_clf = RandomizedSearchCV(
estimator=RandomForestClassifier(),
param_distributions=param_grid,
n_iter=50,
cv=((train_idcs, val_idcs),), # explicit predefined split in terms of indices
random_state=42,
)
# X is the first 4 columns i.e. the sepal and petal widths and lengths
# and y is the 5th column i.e. target column
search_clf.fit(X=data.iloc[:,:4], y=data.target)
Also, be mindful if you want to refit on the whole data or only on the train data and thus retrain the classifier using the best fit parameters accordingly.

Categories