xgb hyper parameter tuning using random search cv - python

cross_val = StratifiedKFold(n_splits=split_number)
index_iterator = cross_val.split(features_dataframe, classes_dataframe)
clf = RandomForestClassifier()
random_grid = _create_hyperparameter_finetuning_grid()
clf_random = RandomizedSearchCV(estimator = clf, param_distributions = random_grid, n_iter = 100, cv = cross_val,verbose=2, random_state=42, n_jobs = -1)
clf_random.fit(X, y)
i found this code for random search cv on some site . iam not sure what "random_grid = _create_hyperparameter_finetuning_grid() " in the code .
please enlighten me on this .
hyper parameter tuning

It is passed as the second parameter to the model, and in the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) it's reported that it should be a "Dictionary with parameters names (str) as keys and distributions or lists of parameters to try".
Based on some hyperparameters, the RF model makes a set of decision trees to find the right classification. The value param_distributions is used to determine which of these hyperparameter combinations should be tested.
There is also an explanation here:
How to correctly implement StratifiedKFold with RandomizedSearchCV

Related

Sklearn - Best estimator from GridSearchCV with refit = True

I'm trying to finds the best estimator using GridSearchCV and I'm using refit = True as per default. Given that the documentation states:
The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance
Should I do .fit on the training data afterwards as such:
classifier = GridSearchCV(estimator=model,param_grid = parameter_grid['param_grid'], scoring='balanced_accuracy', cv = 5, verbose=3, n_jobs=4,return_train_score=True, refit=True)
classifier.fit(x_training, y_train_encoded_local)
predictions = classifier.predict(x_testing)
balanced_error = balanced_accuracy_score(y_true=y_test_encoded_local,y_pred=predictions)
Or should I do it like this instead:
classifier = GridSearchCV(estimator=model,param_grid = parameter_grid['param_grid'], scoring='balanced_accuracy', cv = 5, verbose=3, n_jobs=4,return_train_score=True, refit=True)
predictions = classifier.predict(x_testing)
balanced_error = balanced_accuracy_score(y_true=y_test_encoded_local,y_pred=predictions)
You should do it like your first verison. You need to always call classifier.fit otherwise it doesn't do anything. Refit=True means that it trains on the entire training set after the cross validation is done.

Error when trying to run RandomForestClassifier with Pipieline and GridSearch

I am trying to run a RandomForest Classifier using Pipeline, GridSerach and CV
I am getting an error when I fit the data. I am not sure how to fix it.
I found a similar Question with the solution https://stackoverflow.com/a/34890246/9592484 but didn't work for me
Will appreciate any help on this.
My code is:
column_trans = make_column_transformer((OneHotEncoder(), ['CategoricalData']),
remainder='passthrough')
RF = RandomForestClassifier()
pipe = make_pipeline(column_trans, RF)
# Set grid search params
grid_params = [{'randomforestclassifier_criterion': ['gini', 'entropy'],
'randomforestclassifier_min_samples_leaf': [5,10,20,30,50,80,100],
'randomforestclassifier_max_depth': [3,4,6,8,10],
'randomforestclassifier_min_samples_split': [2,4,6,8,10]}]
# Construct grid search
gs = GridSearchCV(estimator = pipe,
param_grid = grid_params,
scoring='accuracy',
cv=5)
gs.fit(train_features, train_target) ----This is where I get an error
ValueError: Invalid parameter randomforestclassifier_criterion for estimator Pipeline(steps=[('columntransformer',
ColumnTransformer(remainder='passthrough',
transformers=[('onehotencoder',
OneHotEncoder(),
['saleschanneltypeid'])])),
('randomforestclassifier', RandomForestClassifier())]). Check the list of available parameters with `estimator.get_params().keys()`.
The make_pipeline utility function derives step names from transformer/estimator class names. For example, the RandomForestClassifier is mapped to randomforestclassifier step.
Please adjust your grid search parameter prefixes acordingly (ie. from RF to randomforestclassifier). For example, RF__criterion should become randomforestclassifier__criterion.
You are not correctly prepending the keyword. You have to use 2 underscore ( __ ) before each and every parameter. You are using only 1 underscore ( _ ) which wont work.

Perform GridSearchCV with MLFlow

I just started using MLFlow and I am happy with what it can do. However, I cannot find a way to log different runs in a GridSearchCV from scikit learn.
For example, I can do this manually
params = ['l1', 'l2']
for param in params:
with mlflow.start_run(experiment_id=1):
clf = LogisticRegression(penalty = param).fit(X_train, y_train)
y_predictions = clf.predict(X_test)
precision = precision_score(y_test, y_predictions)
recall = recall_score(y_test, y_predictions)
f1 = f1_score(y_test, y_predictions)
mlflow.log_param("penalty", param)
mlflow.log_metric("Precision", precision)
mlflow.log_metric("Recall", recall)
mlflow.log_metric("F1", f1)
mlflow.sklearn.log_model(clf, "model")
But when I want to use the GridSearchCV like that
pipe = Pipeline([('classifier' , RandomForestClassifier())])
param_grid = [
{'classifier' : [LogisticRegression()],
'classifier__penalty' : ['l1', 'l2'],
'classifier__C' : np.logspace(-4, 4, 20),
'classifier__solver' : ['liblinear']},
{'classifier' : [RandomForestClassifier()],
'classifier__n_estimators' : list(range(10,101,10)),
'classifier__max_features' : list(range(6,32,5))}
]
clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)
best_clf = clf.fit(X_train, y_train)
I cannot think of any way to log all the individual models that the GridSearch tests. Is there any way to do it or I have to keep using the manual process?
I'd recommend hyperopt instead of scikit-learn's GridSearchCV. Hyperopt can search the space with Bayesian optimization using hyperopt.tpe.suggest. It will arrive at good parameters faster than a grid search and you can limit the number of iterations no matter the space size, so it's definitely better for large spaces. Since you're interested in the artifacts from the individual runs, you may prefer hyperopt's random search, which still has the advantage of being able to choose how many runs you perform.
You can parallelize the search very easily with Spark using hyperopt.SparkTrials (here's a more complete example). Note that you can keep using scikit's cross validation, just put it inside the objective function (you can even keep track of the variance of the cross validation using loss_variance).
Now, to actually answer the question, I believe you can log the model, parameters, metrics, or whatever inside the objective function that you pass to hyperopt.fmin. MLFlow will store each run as a child of the main run, and each run can have its own artifacts.
So you want something like this:
def objective(params):
metrics = ...
classifier = SomeClassifier(**params)
cv = cross_validate(classifier, X_train, y_train, scoring = metrics)
scores = {metric: cv[f'test_{metric}'] for metric in metrics}
# log all the stuff here
mlflow.log_metric('...', scores[...])
mlflow.sklearn.log_model(classifier.fit(X_train, y_train))
return scores['some_loss'].mean()
space = hp.choice(...)
trials = SparkTrials(parallelism = ...)
with mlflow.start_run() as run:
best_result = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = 100, trials = trials)
I agree with the other answer that using hyperopt would be ideal to log experiments with MLFlow, especially in a Spark environment. One way to log individual model fits within GridSearchCV would be to extend the sklearn estimator’s fit method and pass a callback function to GridSearchCV’s fit.
Any parameter passed to GridSearchCV’s fit is cascaded down to the fit method of the estimators within GridSearchCV. This allows us to pass a logger function to store parameters, metrics, models etc. with MLFlow.
Here is an example with RandomForestClassifier as the estimator, however this approach should work with any other estimator as well:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
class CustomRandomForestClassifier(RandomForestClassifier):
'''
A custom random forest classifier.
The RandomForestClassifier class is extended by adding a callback function within its fit method.
'''
def fit(self, X, y, **kwargs):
super().fit(X, y)
# if a "callback" key is passed, call the "callback" function by passing the fitted estimator
if 'callback' in kwargs:
kwargs['callback'](self)
return self
class Logger:
'''
Logger class stores the test dataset,
and logs sklearn random forest estimator in rf_logger method.
'''
def __init__(self, test_X, test_y):
self.test_X = test_X
self.test_y = test_y
def rf_logger(self, model):
# log the random forest model in nested mlflow runs
with mlflow.start_run(nested=True):
mlflow.log_param("n_estimators", model.n_estimators)
mlflow.log_param("max_leaf_nodes", model.max_leaf_nodes)
mlflow.log_metric("score", model.score(self.test_X, self.test_y))
mlflow.sklearn.log_model(model, 'rf_model')
return None
crf = CustomRandomForestClassifier(random_state=9)
param_grid = {
'n_estimators': [10,20],
'max_leaf_nodes': [25,50]
}
# Use custom random forest classifier while defining the estimator for grid search
grid = GridSearchCV(crf, param_grid, cv=2, refit=True)
# Instantiate Logger with test dataset
logger = Logger(test_X, test_y)
# start outer mlflow run and perform grid search with cross-validation
with mlflow.start_run(run_name = "grid_search"):
# while calling GridSearchCV object's fit method pass logger.rf_logger
# logger.rf_logger takes care of logging each fitted model during gridsearch
grid.fit(train_X, train_y, callback = logger.rf_logger)
# log the best estimator fround by grid search in the outer mlflow run
mlflow.log_param("n_estimators", grid.best_params_['n_estimators'])
mlflow.log_param("max_leaf_nodes", grid.best_params_['max_leaf_nodes'])
mlflow.log_metric("score", grid.score(test_X, test_y))
mlflow.sklearn.log_model(grid.best_estimator_, 'best_rf_model')

Custom scoring function RandomForestRegressor

Using RandomSearchCV, I managed to find a RandomForestRegressor with the best hyperparameters.
But, to this, I used a custom score function matching my specific needs.
Now, I don't know how to use
best_estimator_ - a RandomForestRegressor - returned by the search
with my custom scoring function.
Is there a way to pass a custom scoring function to a RandomForestRegressor?
Scoring function in RandomizedSearchCV will only calculate the score of the predicted data from the model for each combination of hyper-parameters specified in the grid, and the hyper-parameters with the highest average score on test folds wins.
It does not in any way alter the behaviour of the internal algorithm of RandomForest (other than finding the hyperparameters, of-course).
Now you have best_estimator_ (a RandomForestRegressor), with the best found hyper-parameters already set and the model already trained on the whole data you sent to RandomizedSearchCV (if you used refit=True, which is True by default).
So I'm not sure what you want to do with passing that scorer to the model. The best_estimator_ model can be directly used to get predictions on the new data by using the predict() method. After that, the custom scoring you used can be used to compare the predictions with the actual model. There's nothing more to it.
A simple example of this would be:
from scipy.stats import randint as sp_randint
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.datasets import load_boston
from sklearn.metrics import r2_score, make_scorer
X, y = load_boston().data, load_boston().target
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = RandomForestRegressor()
# Your custom scoring strategy
def my_custom_score(y_true, y_pred):
return r2_score(y_true, y_pred)
# Wrapping it in make_scorer to able to use in RandomizedSearch
my_scorer = make_scorer(my_custom_score)
# Hyper Parameters to be tuned
param_dist = {"max_depth": [3, None],
"max_features": sp_randint(1, 11),
"min_samples_split": sp_randint(2, 11),}
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
n_iter=20, scoring=my_scorer)
random_search.fit(X_train, y_train)
# Best found parameters set and model trained on X_train, y_train
best_clf = random_search.best_estimator_
# Get predictions on your new data
y_test_pred = best_clf.predict(X_test)
# Calculate your score on the predictions with respect to actual values
print(my_custom_score(y_test, y_test_pred))

scikit-learn GridSearchCV with multiple repetitions

I'm trying to get the best set of parameters for an SVR model.
I'd like to use the GridSearchCV over different values of C.
However, from the previous test, I noticed that the split into the Training/Test set highly influences the overall performance (r2 in this instance).
To address this problem, I'd like to implement a repeated 5-fold cross-validation (10 x 5CV). Is there a built-in way of performing it using GridSearchCV?
Quick solution, following the idea presented in the sci-kit official documentation:
NUM_TRIALS = 10
scores = []
for i in range(NUM_TRIALS):
cv = KFold(n_splits=5, shuffle=True, random_state=i)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
scores.append(clf.best_score_)
print "Average Score: {0} STD: {1}".format(numpy.mean(scores), numpy.std(scores))
This is called as nested cross_validation. You can look at official documentation example to guide you into right direction and also have a look at my other answer here for a similar approach.
You can adapt the steps to suit your need:
svr = SVC(kernel="rbf")
c_grid = {"C": [1, 10, 100, ... ]}
# CV Technique "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.
# To be used within GridSearch (5 in your case)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=i)
# To be used in outer CV (you asked for 10)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=i)
# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_score = clf.best_score_
# Pass the gridSearch estimator to cross_val_score
# This will be your required 10 x 5 cvs
# 10 for outer cv and 5 for gridSearch's internal CV
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv).mean()
Edit - Description of nested cross validation with cross_val_score() and GridSearchCV()
clf = GridSearchCV(estimator, param_grid, cv= inner_cv).
Pass clf, X, y, outer_cv to cross_val_score
As seen in source code of cross_val_score, this X will be divided into X_outer_train, X_outer_test using outer_cv. Same for y.
X_outer_test will be held back and X_outer_train will be passed on to clf for fit() (GridSearchCV in our case). Assume X_outer_train is called X_inner from here on since it is passed to inner estimator, assume y_outer_train is y_inner.
X_inner will now be split into X_inner_train and X_inner_test using inner_cv in the GridSearchCV. Same for y
Now the gridSearch estimator will be trained using X_inner_train and y_train_inner and scored using X_inner_test and y_inner_test.
The steps 5 and 6 will be repeated for inner_cv_iters (5 in this case).
The hyper-parameters for which the average score over all inner iterations (X_inner_train, X_inner_test) is best, is passed on to the clf.best_estimator_ and fitted for all data, i.e. X_outer_train.
This clf (gridsearch.best_estimator_) will then be scored using X_outer_test and y_outer_test.
The steps 3 to 9 will be repeated for outer_cv_iters (10 here) and array of scores will returned from cross_val_score
We then use mean() to get back nested_score.
You can supply different cross-validation generators to GridSearchCV. The default for binary or multiclass classification problems is StratifiedKFold. Otherwise, it uses KFold. But you can supply your own. In your case, it looks like you want RepeatedKFold or RepeatedStratifiedKFold.
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# Define svr here
...
# Specify cross-validation generator, in this case (10 x 5CV)
cv = RepeatedKFold(n_splits=5, n_repeats=10)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
# Continue as usual
clf.fit(...)

Categories