Performing grid search with a predefined validation set Sklearn - python

This question has been asked several times before. But I get an error when following the answer
First I specify which part is the training set and the validation set as follows.
my_test_fold = []
for i in range(len(train_x)):
my_test_fold.append(-1)
for i in range(len(test_x)):
my_test_fold.append(0)
And then gridsearch is performed.
from sklearn.model_selection import PredefinedSplit
param = {
'n_estimators':[200],
'max_depth':[5],
'min_child_weight':[3],
'reg_alpha':[6],
'gamma':[0.6],
'scale_neg_weight':[1],
'learning_rate':[0.09]
}
gsearch1 = GridSearchCV(estimator = XGBClassifier(
objective= 'reg:linear',
seed=1),
param_grid = param,
scoring='roc_auc',
cv = PredefinedSplit(test_fold=my_test_fold),
verbose = 1)
gsearch1.fit(new_data_df, df_y)
But I get the following error
object of type 'PredefinedSplit' has no len()

Try to replace
cv = PredefinedSplit(test_fold=my_test_fold)
with
cv = list(PredefinedSplit(test_fold=my_test_fold).split(new_data_df, df_y))
The reason is that you may need to apply the split method to actually get the split into training and testing (and then transform it from an iterable object to a list object).

The hypopt Python package (pip install hypopt), for which I am an author, was created for this exact purpose: parameter optimization with a validation set. It works with scikit-learn models and can be used with Tensorflow, PyTorch, Caffe2, etc.
# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
{'C': [1, 10, 100], 'kernel': ['linear']},
{'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))
Edit: Has something changed with hypopt to cause the sudden recent downvotes? Some feedback would help as hypopt solves this exact problem and if there is an issue, we should fix it.

Related

How to use F1 score as an evaluation metric for XGBoost validation?

I'm trying to validate a model using GridSearchCV and XGBoost. I want my evaluation metric to be F1 score. I've seen many people use scoring='f1' and eval_metric=f1_score and other variations. I'm confused on a couple of points. Why are some people using scoring= and others using eval_metric=?
In the XGBoost documentation, there's no F1 score evaluation metric (which seems strange, btw, considering some of the others they do have). But I see lots of advice online to "just use XGBoost's built-in F1 score evaluator." Where??
No matter what I put here, my code throws an error on the eval_metric line.
Here is my code:
params = {
'max_depth': range(2,10,2),
'learning_rate': np.linspace(.1, .6, 6),
'min_child_weight': range(1,10,2),
}
grid = GridSearchCV(
estimator = XGBClassifier(n_jobs=-1,
n_estimators=500,
random_state=0),
param_grid = params,
)
eval_set = [(X_tr, y_tr),
(X_val, y_val)]
grid.fit(X_tr, y_tr,
eval_set=eval_set,
eval_metric='f1', # <------What do I put here to make this evaluate based on f1 score???
early_stopping_rounds=25,
)
Thanks!
you can achieve it with this:
from sklearn.model_selection import cross_val_score
result = cross_val_score(
estimator = your_xgboost_model,
X = X_dataframe,
y = Y_dataframe,
scoring = 'f1',
cv = 10
)

Finding accuracy, precision and recall of a model after hyperparameter tuning in sklearn

I've a binary classification problem, for which I've chosen 3 algorithms, Logistic Regression, SVM and Adaboost. I'm using grid-search and k-fold cross validation on each of these to find the optimal set of hyper-parameters. Now, based on the accuracy, precision and recall I need to choose the best model. But the problem is I'm not able to find any suitable way to retrieve these information. My code is given below:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics.scorer import make_scorer
from sklearn import cross_validation
# TODO: Initialize the classifier
clfr_A = LogisticRegression(random_state=128)
clfr_B = SVC(random_state=128)
clfr_C = AdaBoostClassifier(random_state=128)
lr_param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
svc_param_grid = {'C': [0.001, 0.01, 0.1, 1, 10], 'gamma' : [0.001, 0.01, 0.1, 1]}
adb_param_grid = {'n_estimators' : [50,100,150,200,250,500],'learning_rate' : [.5,.75,1.0,1.25,1.5,1.75,2.0]}
# TODO: Make an fbeta_score scoring object using make_scorer()
scorer = make_scorer(fbeta_score, beta = 0.5)
# TODO: Perform grid search on the classifier using 'scorer' as the scoring method using GridSearchCV()
clfrs = [clfr_A, clfr_B, clfr_C]
params = [lr_param_grid, svc_param_grid, adb_param_grid]
for clfr, param in zip(clfrs, params):
grid_obj = GridSearchCV(clfr, param, cv=3, scoring=scorer, refit=True)
grid_fit = grid_obj.fit(features_raw, target_raw)
print grid_fit.best_estimator_
print grid_fit.cv_results_
Problem is the cv_results_ gives out a lot of info but I'm not able to find anything relevant except mean_test_score. Moreover I don't see any accuracy, precision or recall related metric there.
I can think of one way to achieve it. I can change the for loop to look something like the following:
score_params = ["accuracy", "precision", "recall"]
for clfr, param in zip(clfrs, params):
grid_obj = GridSearchCV(clfr, param, cv=3, scoring=scorer, refit=True)
grid_fit = grid_obj.fit(features_raw, target_raw)
best_clf = grid_fit.best_estimator_
for score in score_params:
print score,
print " : ",
print cross_val_score(best_clf, features_raw, target_raw, scoring=score, cv=3).mean()
But is there any better way of doing it? It seems I'm doing the operations multiple times for each model. Any help is appreciated.
GridSearchCV is doing what you gave. You gave the f_beta as scorer, so mean_test_score will return results of that f_beta for each parameter combination.
If you want to access other metrics, you need to tell the GridSearchCV explicitly to do so.
GridSearchCV in newer versions of scikit-learn, supports multi-metric scoring. So you can pass multiple type of scorers in that. As per documentation:
scoring : string, callable, list/tuple, dict or None, default: None
...
...
For evaluating multiple metrics, either give a list of (unique)
strings or a dict with names as keys and callables as values.
See this example here:
http://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html#running-gridsearchcv-using-multiple-evaluation-metrics
And change your scoring param as:
scoring = {'Accuracy': 'accuracy',
'FBeta': make_scorer(fbeta_score, beta = 0.5),
# ... Add others here as you want.
}
But now when you do it, you need to change the refit param also. Since different metrics here will give different type of scores for the parameter combinations, so you need to decide which one to select when refitting the estimator. So choose one of the keys from the scoring dict for refit
for clfr, param in zip(clfrs, params):
grid_obj = GridSearchCV(clfr, param, cv=3, scoring=scorer, refit='FBeta')
...
...

Held out training and validation set in gridsearchcv sklearn

I see that in gridsearchcv best parameters are determined based on cross-validation, but what I really want to do is to determine the best parameters based on one held out validation set instead of cross validation.
Not sure if there is a way to do that. I found some similar posts where customizing the cross-validation folds. However, again what I really need is to train on one set and validate the parameters on a validation set.
One more information about my dataset is basically a text series type created by panda.
I did come up with an answer to my own question through the use of PredefinedSplit
for i in range(len(doc_train)-1):
train_ind[i] = -1
for i in range(len(doc_val)-1):
val_ind[i] = 0
ps = PredefinedSplit(test_fold=np.concatenate((train_ind,val_ind)))
and then in the gridsearchCV arguments
grid_search = GridSearchCV(pipeline, parameters, n_jobs=7, verbose=1 , cv=ps)
Use the hypopt Python package (pip install hypopt). It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.
# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
{'C': [1, 10, 100], 'kernel': ['linear']},
{'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))

Grid-search with specific validation data

I'm looking for a way to grid-search for hyperparameters in sklearn, without using K-fold validation. I.e I want my grid to train on on specific dataset (X1,y1 in the example below) and validate itself on specific hold-out dataset (X2,y2 in the example below).
X1,y2 = train data
X2,y2 = validation data
clf_ = SVC(kernel='rbf',cache_size=1000)
Cs = [1,10.0,50,100.0,]
Gammas = [ 0.4,0.42,0.44,0.46,0.48,0.5,0.52,0.54,0.56]
clf = GridSearchCV(clf_,dict(C=Cs,gamma=Gammas),
cv=???, # validate on X2,y2
n_jobs=8,verbose=10)
clf.fit(X1, y1)
Use the hypopt Python package (pip install hypopt). It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.
# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
{'C': [1, 10, 100], 'kernel': ['linear']},
{'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))
clf = GridSearchCV(clf_,dict(C=Cs,gamma=Gammas),cv=???, # validate on X2,y2,n_jobs=8,verbose=10)
n_jobs>1 does not make any sense. If n_jobs=-1 it means the processing will use all the cores on your machine. If it is 1 only one core would be use.
If cv =5 it will run five cross validations for every iteration.
In your case total number of iterations will be 9(size of Cs)*5(Size of gammas)*5(Value of CV)
If you are using cross validation it does not make any sense to hold out the data for rechecking your model. If you are not confident about the performance you can just increase the cv to get a better fit.
This will be very time consuming especially for SVM ,I will rather suggest you to use RandomSearchCV which allows you give the number of iterations you want your model to randomly select.

Using explicit (predefined) validation set for grid search with sklearn

I have a dataset, which has previously been split into 3 sets: train, validation and test. These sets have to be used as given in order to compare the performance across different algorithms.
I would now like to optimize the parameters of my SVM using the validation set. However, I cannot find how to input the validation set explicitly into sklearn.grid_search.GridSearchCV(). Below is some code I've previously used for doing K-fold cross-validation on the training set. However, for this problem I need to use the validation set as given. How can I do that?
from sklearn import svm, cross_validation
from sklearn.grid_search import GridSearchCV
# (some code left out to simplify things)
skf = cross_validation.StratifiedKFold(y_train, n_folds=5, shuffle = True)
clf = GridSearchCV(svm.SVC(tol=0.005, cache_size=6000,
class_weight=penalty_weights),
param_grid=tuned_parameters,
n_jobs=2,
pre_dispatch="n_jobs",
cv=skf,
scoring=scorer)
clf.fit(X_train, y_train)
Use PredefinedSplit
ps = PredefinedSplit(test_fold=your_test_fold)
then set cv=ps in GridSearchCV
test_fold : “array-like, shape (n_samples,)
test_fold[i] gives the test set fold of sample i. A value of -1 indicates that the corresponding sample is not part of any test set folds, but will instead always be put into the training fold.
Also see here
when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.
Consider using the hypopt Python package (pip install hypopt) for which I am an author. It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.
# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
{'C': [1, 10, 100], 'kernel': ['linear']},
{'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))
EDIT: I (think I) received -1's on this response because I'm suggesting a package that I authored. This is unfortunate, given that the package was created specifically to solve this type of problem.
# Import Libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import PredefinedSplit
# Split Data to Train and Validation
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, stratify = y,random_state = 2020)
# Create a list where train data indices are -1 and validation data indices are 0
split_index = [-1 if x in X_train.index else 0 for x in X.index]
# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)
# Use PredefinedSplit in GridSearchCV
clf = GridSearchCV(estimator = estimator,
cv=pds,
param_grid=param_grid)
# Fit with all data
clf.fit(X, y)
To add to the #Vinubalan's answer, when the train-valid-test split is not done with Scikit-learn's train_test_split() function, i.e., the dataframes are already split manually beforehand and scaled/normalized so as to prevent leakage from training data, the numpy arrays can be concatenated.
import numpy as np
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
from sklearn.model_selection import PredefinedSplit, GridSearchCV
split_index = [-1]*len(X_train) + [0]*len(X_val)
X = np.concatenate((X_train, X_val), axis=0)
y = np.concatenate((y_train, y_val), axis=0)
pds = PredefinedSplit(test_fold = split_index)
clf = GridSearchCV(estimator = estimator,
cv=pds,
param_grid=param_grid)
# Fit with all data
clf.fit(X, y)
I wanted to provide some reproducible code that creates a validation split using the last 20% of observations.
from sklearn import datasets
from sklearn.model_selection import PredefinedSplit, GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
# load data
df_train = datasets.fetch_california_housing(as_frame=True).data
y = datasets.fetch_california_housing().target
param_grid = {"max_depth": [5, 6],
'learning_rate': [0.03, 0.06],
'subsample': [.5, .75]
}
model = GradientBoostingRegressor()
# Create a single validation split
val_prop = .2
n_val_rows = round(len(df_train) * val_prop)
val_starting_index = len(df_train) - n_val_rows
cv = PredefinedSplit([-1 if i < val_starting_index else 0 for i in df_train.index])
# Use PredefinedSplit in GridSearchCV
results = GridSearchCV(estimator = model,
cv=cv,
param_grid=param_grid,
verbose=True,
n_jobs=-1)
# Fit with all data
results.fit(df_train, y)
results.best_params_
The cv argument of the SearchCV i.e. Grid or Random can just be an iterable of indices too for train and validation split i.e. cv=((train_idcs, val_idcs),).
Note that the data on which the search classifier will be fit should be the train+val set and the indices specified will be used by the sklearn to separate them internally. Additionally, when working with dataframes, the indices specified should be accessible as ilocs, so reset indices (don't drop them if they will be required later).
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
train_test_split,
RandomizedSearchCV,
)
data = load_iris(as_frame=True)["frame"]
# These indices will serves as explicit and predefined split
train_idcs, val_idcs = train_test_split(
data.index,
random_state=42,
stratify=data.target,
)
param_grid = dict(
n_estimators=[50,100,150,200],
max_samples=[0.85,0.9,0.95,1],
max_depth=[3,5,7,10],
max_features=["sqrt", "log2", 0.85, 0.9, 0.95, 1],
)
search_clf = RandomizedSearchCV(
estimator=RandomForestClassifier(),
param_distributions=param_grid,
n_iter=50,
cv=((train_idcs, val_idcs),), # explicit predefined split in terms of indices
random_state=42,
)
# X is the first 4 columns i.e. the sepal and petal widths and lengths
# and y is the 5th column i.e. target column
search_clf.fit(X=data.iloc[:,:4], y=data.target)
Also, be mindful if you want to refit on the whole data or only on the train data and thus retrain the classifier using the best fit parameters accordingly.

Categories