I'm trying to finds the best estimator using GridSearchCV and I'm using refit = True as per default. Given that the documentation states:
The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance
Should I do .fit on the training data afterwards as such:
classifier = GridSearchCV(estimator=model,param_grid = parameter_grid['param_grid'], scoring='balanced_accuracy', cv = 5, verbose=3, n_jobs=4,return_train_score=True, refit=True)
classifier.fit(x_training, y_train_encoded_local)
predictions = classifier.predict(x_testing)
balanced_error = balanced_accuracy_score(y_true=y_test_encoded_local,y_pred=predictions)
Or should I do it like this instead:
classifier = GridSearchCV(estimator=model,param_grid = parameter_grid['param_grid'], scoring='balanced_accuracy', cv = 5, verbose=3, n_jobs=4,return_train_score=True, refit=True)
predictions = classifier.predict(x_testing)
balanced_error = balanced_accuracy_score(y_true=y_test_encoded_local,y_pred=predictions)
You should do it like your first verison. You need to always call classifier.fit otherwise it doesn't do anything. Refit=True means that it trains on the entire training set after the cross validation is done.
Related
cross_val = StratifiedKFold(n_splits=split_number)
index_iterator = cross_val.split(features_dataframe, classes_dataframe)
clf = RandomForestClassifier()
random_grid = _create_hyperparameter_finetuning_grid()
clf_random = RandomizedSearchCV(estimator = clf, param_distributions = random_grid, n_iter = 100, cv = cross_val,verbose=2, random_state=42, n_jobs = -1)
clf_random.fit(X, y)
i found this code for random search cv on some site . iam not sure what "random_grid = _create_hyperparameter_finetuning_grid() " in the code .
please enlighten me on this .
hyper parameter tuning
It is passed as the second parameter to the model, and in the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) it's reported that it should be a "Dictionary with parameters names (str) as keys and distributions or lists of parameters to try".
Based on some hyperparameters, the RF model makes a set of decision trees to find the right classification. The value param_distributions is used to determine which of these hyperparameter combinations should be tested.
There is also an explanation here:
How to correctly implement StratifiedKFold with RandomizedSearchCV
My question is similar to the one here, but the answer there does not explain why we must fit to the data before getting the best paramters, it just states that we must. In my understanding, GridSeachCV picks the best model parameters using cross validation, and then returns this best model in the .best_estimator attribute. Then we can fit the that model to our data. But shouldn't we be able to access the parameters it picked and the .best_estimator model before fitting to the data?
As an example, the code below works fine:
logreg = LogisticRegression(solver='liblinear')
params = {'penalty':['l1','l2'], 'C':np.logspace(-3,3,7),}
grid = GridSearchCV(estimator=logreg, params, cv=4)
grid.fit(X_train,y_train)
best_model_params = grid.best_params_
y_pred = grid.predict(X_test)
But the following code does not work:
logreg = LogisticRegression(solver='liblinear')
params = {'penalty':['l1','l2'], 'C':np.logspace(-3,3,7),}
grid = GridSearchCV(estimator=logreg, params, cv=4)
best_model_params = grid.best_params_
grid.fit(X_train,y_train)
y_pred = grid.predict(X_test)
It gives AttributeError: 'GridSearchCV' object has no attribute 'best_estimator_'.
On a related note, if grid.best_estimator_ is the best LogisticRegression model (the model with the set of hyperparameters found to be best through cross validation in GridSearchCV) then why do we fit the grid object instead of the grid.best_estimator_ object? E.g. if I could figure out how to access the best_estimator attribute before fitting, would the following code work?:
logreg = LogisticRegression(solver='liblinear')
params = {'penalty':['l1','l2'], 'C':np.logspace(-3,3,7),}
grid = GridSearchCV(estimator=logreg, params, cv=4)
best_model = <somehow get the model GridSearchCV has picked>
best_model.fit(X_train,y_train)
y_pred = best_model.predict(X_test)
I use 20% of the data set as my testing set and use GridSearchCV to implement K-fold cross-validation to tune hyperparameters.
By using a pipeline, we can put the column transformer and the machine learning algorithm into GridSearchCV together. If I set up a 5 fold cross-validation for GridSearchCV, the function will use 5 different training and validation sets to train and validate each combination of hyperparameters. As I know, GridSearchCV uses the mean of 5 fold scores to choose the best model.
Then my question is, how does it transform the testing set?
I'm very confused about this because to avoid data leakage, we should use only the training set to fit the transformer, but in this case, we have 5 different training sets and I don't know which one the GridSearchCV function uses to fit and transform the validation and testing set.
My code is given below
X_other, X_test, y_other, y_test = train_test_split(X, y, test_size = 0.2, random_state = i)
kf = KFold(n_splits = 4, shuffle = True, random_state = i)
pipe = Pipeline(steps = [("preprocessor", preprocessor),("model", ML_algo)])
grid = GridSearchCV(estimator = pipe, param_grid=param_grid,
scoring = make_scorer(mean_squared_error, greater_is_better=False,
squared=False),cv=kf, return_train_score = True, n_jobs=-1, verbose=False)
grid.fit(X_other, y_other)
test_score = mean_squared_error(y_test, grid.predict(X_test), squared=False)
short answer: there is no data leakage, test set is not used (and should not be used) for training the model in your code.
long answer: k fold cross-validation randomly divided your X_other& y_other(training set) into k splits, for each iteration of cross-validation, k-1 fold of data is used to train the model while this model is then evaluated with the 1 fold left using the metric you specified in scoring=. (refer to the below picture from sklearn for details:https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)
After finding the best set of hyperparameters by GridSearchCV(), all training set data is used in training a final model with the found hyperparameters, then, X_test,y_test (test set) can be transformed by this model. Note that in this process, X_test,y_test is not used and should not be used other than in the final prediction.
I just started using MLFlow and I am happy with what it can do. However, I cannot find a way to log different runs in a GridSearchCV from scikit learn.
For example, I can do this manually
params = ['l1', 'l2']
for param in params:
with mlflow.start_run(experiment_id=1):
clf = LogisticRegression(penalty = param).fit(X_train, y_train)
y_predictions = clf.predict(X_test)
precision = precision_score(y_test, y_predictions)
recall = recall_score(y_test, y_predictions)
f1 = f1_score(y_test, y_predictions)
mlflow.log_param("penalty", param)
mlflow.log_metric("Precision", precision)
mlflow.log_metric("Recall", recall)
mlflow.log_metric("F1", f1)
mlflow.sklearn.log_model(clf, "model")
But when I want to use the GridSearchCV like that
pipe = Pipeline([('classifier' , RandomForestClassifier())])
param_grid = [
{'classifier' : [LogisticRegression()],
'classifier__penalty' : ['l1', 'l2'],
'classifier__C' : np.logspace(-4, 4, 20),
'classifier__solver' : ['liblinear']},
{'classifier' : [RandomForestClassifier()],
'classifier__n_estimators' : list(range(10,101,10)),
'classifier__max_features' : list(range(6,32,5))}
]
clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)
best_clf = clf.fit(X_train, y_train)
I cannot think of any way to log all the individual models that the GridSearch tests. Is there any way to do it or I have to keep using the manual process?
I'd recommend hyperopt instead of scikit-learn's GridSearchCV. Hyperopt can search the space with Bayesian optimization using hyperopt.tpe.suggest. It will arrive at good parameters faster than a grid search and you can limit the number of iterations no matter the space size, so it's definitely better for large spaces. Since you're interested in the artifacts from the individual runs, you may prefer hyperopt's random search, which still has the advantage of being able to choose how many runs you perform.
You can parallelize the search very easily with Spark using hyperopt.SparkTrials (here's a more complete example). Note that you can keep using scikit's cross validation, just put it inside the objective function (you can even keep track of the variance of the cross validation using loss_variance).
Now, to actually answer the question, I believe you can log the model, parameters, metrics, or whatever inside the objective function that you pass to hyperopt.fmin. MLFlow will store each run as a child of the main run, and each run can have its own artifacts.
So you want something like this:
def objective(params):
metrics = ...
classifier = SomeClassifier(**params)
cv = cross_validate(classifier, X_train, y_train, scoring = metrics)
scores = {metric: cv[f'test_{metric}'] for metric in metrics}
# log all the stuff here
mlflow.log_metric('...', scores[...])
mlflow.sklearn.log_model(classifier.fit(X_train, y_train))
return scores['some_loss'].mean()
space = hp.choice(...)
trials = SparkTrials(parallelism = ...)
with mlflow.start_run() as run:
best_result = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = 100, trials = trials)
I agree with the other answer that using hyperopt would be ideal to log experiments with MLFlow, especially in a Spark environment. One way to log individual model fits within GridSearchCV would be to extend the sklearn estimator’s fit method and pass a callback function to GridSearchCV’s fit.
Any parameter passed to GridSearchCV’s fit is cascaded down to the fit method of the estimators within GridSearchCV. This allows us to pass a logger function to store parameters, metrics, models etc. with MLFlow.
Here is an example with RandomForestClassifier as the estimator, however this approach should work with any other estimator as well:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
class CustomRandomForestClassifier(RandomForestClassifier):
'''
A custom random forest classifier.
The RandomForestClassifier class is extended by adding a callback function within its fit method.
'''
def fit(self, X, y, **kwargs):
super().fit(X, y)
# if a "callback" key is passed, call the "callback" function by passing the fitted estimator
if 'callback' in kwargs:
kwargs['callback'](self)
return self
class Logger:
'''
Logger class stores the test dataset,
and logs sklearn random forest estimator in rf_logger method.
'''
def __init__(self, test_X, test_y):
self.test_X = test_X
self.test_y = test_y
def rf_logger(self, model):
# log the random forest model in nested mlflow runs
with mlflow.start_run(nested=True):
mlflow.log_param("n_estimators", model.n_estimators)
mlflow.log_param("max_leaf_nodes", model.max_leaf_nodes)
mlflow.log_metric("score", model.score(self.test_X, self.test_y))
mlflow.sklearn.log_model(model, 'rf_model')
return None
crf = CustomRandomForestClassifier(random_state=9)
param_grid = {
'n_estimators': [10,20],
'max_leaf_nodes': [25,50]
}
# Use custom random forest classifier while defining the estimator for grid search
grid = GridSearchCV(crf, param_grid, cv=2, refit=True)
# Instantiate Logger with test dataset
logger = Logger(test_X, test_y)
# start outer mlflow run and perform grid search with cross-validation
with mlflow.start_run(run_name = "grid_search"):
# while calling GridSearchCV object's fit method pass logger.rf_logger
# logger.rf_logger takes care of logging each fitted model during gridsearch
grid.fit(train_X, train_y, callback = logger.rf_logger)
# log the best estimator fround by grid search in the outer mlflow run
mlflow.log_param("n_estimators", grid.best_params_['n_estimators'])
mlflow.log_param("max_leaf_nodes", grid.best_params_['max_leaf_nodes'])
mlflow.log_metric("score", grid.score(test_X, test_y))
mlflow.sklearn.log_model(grid.best_estimator_, 'best_rf_model')
Using RandomSearchCV, I managed to find a RandomForestRegressor with the best hyperparameters.
But, to this, I used a custom score function matching my specific needs.
Now, I don't know how to use
best_estimator_ - a RandomForestRegressor - returned by the search
with my custom scoring function.
Is there a way to pass a custom scoring function to a RandomForestRegressor?
Scoring function in RandomizedSearchCV will only calculate the score of the predicted data from the model for each combination of hyper-parameters specified in the grid, and the hyper-parameters with the highest average score on test folds wins.
It does not in any way alter the behaviour of the internal algorithm of RandomForest (other than finding the hyperparameters, of-course).
Now you have best_estimator_ (a RandomForestRegressor), with the best found hyper-parameters already set and the model already trained on the whole data you sent to RandomizedSearchCV (if you used refit=True, which is True by default).
So I'm not sure what you want to do with passing that scorer to the model. The best_estimator_ model can be directly used to get predictions on the new data by using the predict() method. After that, the custom scoring you used can be used to compare the predictions with the actual model. There's nothing more to it.
A simple example of this would be:
from scipy.stats import randint as sp_randint
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.datasets import load_boston
from sklearn.metrics import r2_score, make_scorer
X, y = load_boston().data, load_boston().target
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = RandomForestRegressor()
# Your custom scoring strategy
def my_custom_score(y_true, y_pred):
return r2_score(y_true, y_pred)
# Wrapping it in make_scorer to able to use in RandomizedSearch
my_scorer = make_scorer(my_custom_score)
# Hyper Parameters to be tuned
param_dist = {"max_depth": [3, None],
"max_features": sp_randint(1, 11),
"min_samples_split": sp_randint(2, 11),}
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
n_iter=20, scoring=my_scorer)
random_search.fit(X_train, y_train)
# Best found parameters set and model trained on X_train, y_train
best_clf = random_search.best_estimator_
# Get predictions on your new data
y_test_pred = best_clf.predict(X_test)
# Calculate your score on the predictions with respect to actual values
print(my_custom_score(y_test, y_test_pred))