Perform GridSearchCV with MLFlow - python

I just started using MLFlow and I am happy with what it can do. However, I cannot find a way to log different runs in a GridSearchCV from scikit learn.
For example, I can do this manually
params = ['l1', 'l2']
for param in params:
with mlflow.start_run(experiment_id=1):
clf = LogisticRegression(penalty = param).fit(X_train, y_train)
y_predictions = clf.predict(X_test)
precision = precision_score(y_test, y_predictions)
recall = recall_score(y_test, y_predictions)
f1 = f1_score(y_test, y_predictions)
mlflow.log_param("penalty", param)
mlflow.log_metric("Precision", precision)
mlflow.log_metric("Recall", recall)
mlflow.log_metric("F1", f1)
mlflow.sklearn.log_model(clf, "model")
But when I want to use the GridSearchCV like that
pipe = Pipeline([('classifier' , RandomForestClassifier())])
param_grid = [
{'classifier' : [LogisticRegression()],
'classifier__penalty' : ['l1', 'l2'],
'classifier__C' : np.logspace(-4, 4, 20),
'classifier__solver' : ['liblinear']},
{'classifier' : [RandomForestClassifier()],
'classifier__n_estimators' : list(range(10,101,10)),
'classifier__max_features' : list(range(6,32,5))}
]
clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)
best_clf = clf.fit(X_train, y_train)
I cannot think of any way to log all the individual models that the GridSearch tests. Is there any way to do it or I have to keep using the manual process?

I'd recommend hyperopt instead of scikit-learn's GridSearchCV. Hyperopt can search the space with Bayesian optimization using hyperopt.tpe.suggest. It will arrive at good parameters faster than a grid search and you can limit the number of iterations no matter the space size, so it's definitely better for large spaces. Since you're interested in the artifacts from the individual runs, you may prefer hyperopt's random search, which still has the advantage of being able to choose how many runs you perform.
You can parallelize the search very easily with Spark using hyperopt.SparkTrials (here's a more complete example). Note that you can keep using scikit's cross validation, just put it inside the objective function (you can even keep track of the variance of the cross validation using loss_variance).
Now, to actually answer the question, I believe you can log the model, parameters, metrics, or whatever inside the objective function that you pass to hyperopt.fmin. MLFlow will store each run as a child of the main run, and each run can have its own artifacts.
So you want something like this:
def objective(params):
metrics = ...
classifier = SomeClassifier(**params)
cv = cross_validate(classifier, X_train, y_train, scoring = metrics)
scores = {metric: cv[f'test_{metric}'] for metric in metrics}
# log all the stuff here
mlflow.log_metric('...', scores[...])
mlflow.sklearn.log_model(classifier.fit(X_train, y_train))
return scores['some_loss'].mean()
space = hp.choice(...)
trials = SparkTrials(parallelism = ...)
with mlflow.start_run() as run:
best_result = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = 100, trials = trials)

I agree with the other answer that using hyperopt would be ideal to log experiments with MLFlow, especially in a Spark environment. One way to log individual model fits within GridSearchCV would be to extend the sklearn estimator’s fit method and pass a callback function to GridSearchCV’s fit.
Any parameter passed to GridSearchCV’s fit is cascaded down to the fit method of the estimators within GridSearchCV. This allows us to pass a logger function to store parameters, metrics, models etc. with MLFlow.
Here is an example with RandomForestClassifier as the estimator, however this approach should work with any other estimator as well:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
class CustomRandomForestClassifier(RandomForestClassifier):
'''
A custom random forest classifier.
The RandomForestClassifier class is extended by adding a callback function within its fit method.
'''
def fit(self, X, y, **kwargs):
super().fit(X, y)
# if a "callback" key is passed, call the "callback" function by passing the fitted estimator
if 'callback' in kwargs:
kwargs['callback'](self)
return self
class Logger:
'''
Logger class stores the test dataset,
and logs sklearn random forest estimator in rf_logger method.
'''
def __init__(self, test_X, test_y):
self.test_X = test_X
self.test_y = test_y
def rf_logger(self, model):
# log the random forest model in nested mlflow runs
with mlflow.start_run(nested=True):
mlflow.log_param("n_estimators", model.n_estimators)
mlflow.log_param("max_leaf_nodes", model.max_leaf_nodes)
mlflow.log_metric("score", model.score(self.test_X, self.test_y))
mlflow.sklearn.log_model(model, 'rf_model')
return None
crf = CustomRandomForestClassifier(random_state=9)
param_grid = {
'n_estimators': [10,20],
'max_leaf_nodes': [25,50]
}
# Use custom random forest classifier while defining the estimator for grid search
grid = GridSearchCV(crf, param_grid, cv=2, refit=True)
# Instantiate Logger with test dataset
logger = Logger(test_X, test_y)
# start outer mlflow run and perform grid search with cross-validation
with mlflow.start_run(run_name = "grid_search"):
# while calling GridSearchCV object's fit method pass logger.rf_logger
# logger.rf_logger takes care of logging each fitted model during gridsearch
grid.fit(train_X, train_y, callback = logger.rf_logger)
# log the best estimator fround by grid search in the outer mlflow run
mlflow.log_param("n_estimators", grid.best_params_['n_estimators'])
mlflow.log_param("max_leaf_nodes", grid.best_params_['max_leaf_nodes'])
mlflow.log_metric("score", grid.score(test_X, test_y))
mlflow.sklearn.log_model(grid.best_estimator_, 'best_rf_model')

Related

How to get the coefficients in Lasso Regression at every split while performing 10 fold cross validation?

I am doing Randomized search cv to find alpha value in Lasso Regression and I am performing 10 fold cross validation. Is there a way to get the coefficients value for every split, just like we get the scores by using cv_results function?
There is no direct way to do this via RandomizedSearchCV. But you can work around this by defining your own class that e.g. prints the coefficients to the console when the predict function is called:
from sklearn.linear_model import Lasso
class MyLasso(Lasso):
def predict(self, X):
print(self.coef_)
return super().predict(X)
MyLasso behaves the same as Lasso and can be used as usual:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, RandomizedSearchCV
X, y = make_regression(n_features=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
param_distributions = {'alpha': [0.01, 0.1, 1]}
rs = RandomizedSearchCV(
MyLasso(),
param_distributions=param_distributions,
cv=2,
n_iter=3,
random_state=42
)
rs.fit(X_train, y_train)
Output for the example above (three iterations of 2-fold cross-validation gives six results):
[64.57650818 98.64237403 57.07123743 60.56898095 35.59985227]
[64.57001187 98.63679695 57.06557977 60.56304163 35.59888746]
[64.43774582 98.55938568 57.01219706 60.49221968 35.51151313]
[64.37690435 98.49805298 56.95345309 60.43375789 35.5018112 ]
[63.05012223 97.72950224 56.42179336 59.72460697 34.62812171]
[62.44582912 97.11061327 55.83218634 59.14092054 34.53104869]
It seemed to me that saving the coefficients as additional scores would be slicker than modifying the estimator itself as in #afsharov's answer. Defining a scorer and passing it to the search as
def coefs_scorer(estimator, X, y):
return estimator.coef_
rs = RandomizedSearchCV(
...
scoring={'r2': 'r2', 'coefs': coefs_scorer},
refit='r2',
)
fails because there's a check that scorers return single numbers. So you need to unpack the coefficients, and I ended up with this:
def coefs_scorer(estimator, X, y, i):
return estimator.coef_[i]
from functools import partial
scoring = {'r2': 'r2'}
for i in range(X_train.shape[1]):
scoring[f'coef{i}'] = partial(coefs_scorer, i=i)
param_distributions = {'alpha': [0.01, 0.1, 1]}
rs = RandomizedSearchCV(
Lasso(),
param_distributions=param_distributions,
cv=2,
n_iter=3,
random_state=42,
scoring=scoring,
refit='r2',
)
Note that with multiple metrics you need to specify which to use for refitting. Because of all the additional work, I'm not so sure this is better than the custom class. It does have a few advantages though:
If you wanted to pickle the best estimator, you don't need to package-ize the custom class.
The scores are programmatically saved rather than just printed.
Since they're scores, you get the average and standard deviation of the coefficients across folds stored in cv_results_ (of course, calculating them yourself wouldn't be difficult).
Disadvantages:
We had to specify a metric per feature. It's ugly, but worse it assumes you know in advance the number of features (it would fail if your estimator was a pipeline that had a feature selection or certain feature engineering steps).
If you return train scores, you'll duplicate the coefficients in cv_results_.
These aren't actually scores, so semantically this is hacky.
The scorer assumes that coef_ exists and is one-dimensional.

Scoring strategy of sklearn.model_selection.GridSearchCV for LatentDirichletAllocation

I am trying to apply GridSearchCV on the LatentDirichletAllocation using the sklearn library.
Current pipeline looks like so:
vectorizer = CountVectorizer(analyzer='word',
min_df=10,
stop_words='english',
lowercase=True,
token_pattern='[a-zA-Z0-9]{3,}'
)
data_vectorized = vectorizer.fit_transform(doc_clean) #where doc_clean is processed text.
lda_model = LatentDirichletAllocation(n_components =number_of_topics,
max_iter=10,
learning_method='online',
random_state=100,
batch_size=128,
evaluate_every = -1,
n_jobs = -1,
)
search_params = {'n_components': [10, 15, 20, 25, 30], 'learning_decay': [.5, .7, .9]}
model = GridSearchCV(lda_model, param_grid=search_params)
model.fit(data_vectorized)
Current the GridSearchCV uses the approximate log-likelihood as score to determine which is the best model. What I would like to do is to change my scoring method to be based on the approximate perplexity of the model instead.
According to sklearn's documentation of GridSearchCV, there is a scoring argument that I can use. However, I do not know how to apply perplexity as a scoring method, and I cannot find any examples online of people applying it. Is this possible?
GridSearchCV on its default will use the score() function of final estimator in the pipeline.
make_scorer can be used here, but for calculating perplexity you will need other data from the fitted model as well, which could be a little complex to provide through make_scorer.
You can make a wrapper over your LDA here and in which you can re-implement the score() function to return perplexity. Something along the lines:
class MyLDAWithPerplexityScorer(LatentDirichletAllocation):
def score(self, X, y=None):
# You can change the options passed to perplexity here
score = super(MyLDAWithPerplexityScorer, self).perplexity(X, sub_sampling=False)
# Since perplexity is lower for better, so we do negative
return -1*score
And then can use this in place of LatentDirichletAllocation in your code like:
...
...
...
lda_model = MyLDAWithPerplexityScorer(n_components =number_of_topics,
....
....
n_jobs = -1,
)
...
...
The score and perplexity parameters seem to be buggy and to be dependent on the number of topics. Therefore the results in grid will give you the lowest number of topics
GitHub issue

Custom scoring function RandomForestRegressor

Using RandomSearchCV, I managed to find a RandomForestRegressor with the best hyperparameters.
But, to this, I used a custom score function matching my specific needs.
Now, I don't know how to use
best_estimator_ - a RandomForestRegressor - returned by the search
with my custom scoring function.
Is there a way to pass a custom scoring function to a RandomForestRegressor?
Scoring function in RandomizedSearchCV will only calculate the score of the predicted data from the model for each combination of hyper-parameters specified in the grid, and the hyper-parameters with the highest average score on test folds wins.
It does not in any way alter the behaviour of the internal algorithm of RandomForest (other than finding the hyperparameters, of-course).
Now you have best_estimator_ (a RandomForestRegressor), with the best found hyper-parameters already set and the model already trained on the whole data you sent to RandomizedSearchCV (if you used refit=True, which is True by default).
So I'm not sure what you want to do with passing that scorer to the model. The best_estimator_ model can be directly used to get predictions on the new data by using the predict() method. After that, the custom scoring you used can be used to compare the predictions with the actual model. There's nothing more to it.
A simple example of this would be:
from scipy.stats import randint as sp_randint
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.datasets import load_boston
from sklearn.metrics import r2_score, make_scorer
X, y = load_boston().data, load_boston().target
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = RandomForestRegressor()
# Your custom scoring strategy
def my_custom_score(y_true, y_pred):
return r2_score(y_true, y_pred)
# Wrapping it in make_scorer to able to use in RandomizedSearch
my_scorer = make_scorer(my_custom_score)
# Hyper Parameters to be tuned
param_dist = {"max_depth": [3, None],
"max_features": sp_randint(1, 11),
"min_samples_split": sp_randint(2, 11),}
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
n_iter=20, scoring=my_scorer)
random_search.fit(X_train, y_train)
# Best found parameters set and model trained on X_train, y_train
best_clf = random_search.best_estimator_
# Get predictions on your new data
y_test_pred = best_clf.predict(X_test)
# Calculate your score on the predictions with respect to actual values
print(my_custom_score(y_test, y_test_pred))

Basic Sklearn: How to Pass Scoring Function to Fit Method

I'm using sklearn to do some machine learning. I often use GridSearchCV to explore hyperparameters and perform cross-validation. Using this, I can specify a scoring function, like this:
scores = -cross_val_score(svr, X, Y, cv=10, scoring='neg_mean_squared_error')
However, I want to train my SVR model using mean squared error. Unfortunately, there's no scoring parameter in either the constructor for SVR or the fit method.
How should I do this?
Thanks!
I typically use Pipeline to do it. You can create list of pipelines including SVR model (and others if you want). Then, you can apply GridSearchCV where putting pipeline in as your argument.
Here, you can add params_grid where searching space can be defined as pipelinename__paramname (double underscore in between). For example, I have pipeline name svr and I want to search on parameter C, I can put the key in my parameter dictionary as svr__C.
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.svm import SVR
c_range = np.arange(1, 10, 1)
pipeline = Pipeline([('svr', SVR())])
params_grid = {'svr__C': c_range}
# grid search with 3-fold cross validation
gridsearch_model = GridSearchCV(pipeline, params_grid,
cv=3, scoring='neg_mean_squared_error')
Then, you can do the same procedure by fitting training data and find best score and parameters
gridsearch_model.fit(X_train, y_train)
print(gridsearch_model.best_params_, gridsearch_model.best_score_)
You can also use cross_val_score to find the score:
cross_val_score(gridsearch_model, X_train, y_train,
cv=3, scoring='neg_mean_squared_error')
Hope this helps!

scikit-learn GridSearchCV with multiple repetitions

I'm trying to get the best set of parameters for an SVR model.
I'd like to use the GridSearchCV over different values of C.
However, from the previous test, I noticed that the split into the Training/Test set highly influences the overall performance (r2 in this instance).
To address this problem, I'd like to implement a repeated 5-fold cross-validation (10 x 5CV). Is there a built-in way of performing it using GridSearchCV?
Quick solution, following the idea presented in the sci-kit official documentation:
NUM_TRIALS = 10
scores = []
for i in range(NUM_TRIALS):
cv = KFold(n_splits=5, shuffle=True, random_state=i)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
scores.append(clf.best_score_)
print "Average Score: {0} STD: {1}".format(numpy.mean(scores), numpy.std(scores))
This is called as nested cross_validation. You can look at official documentation example to guide you into right direction and also have a look at my other answer here for a similar approach.
You can adapt the steps to suit your need:
svr = SVC(kernel="rbf")
c_grid = {"C": [1, 10, 100, ... ]}
# CV Technique "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.
# To be used within GridSearch (5 in your case)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=i)
# To be used in outer CV (you asked for 10)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=i)
# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_score = clf.best_score_
# Pass the gridSearch estimator to cross_val_score
# This will be your required 10 x 5 cvs
# 10 for outer cv and 5 for gridSearch's internal CV
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv).mean()
Edit - Description of nested cross validation with cross_val_score() and GridSearchCV()
clf = GridSearchCV(estimator, param_grid, cv= inner_cv).
Pass clf, X, y, outer_cv to cross_val_score
As seen in source code of cross_val_score, this X will be divided into X_outer_train, X_outer_test using outer_cv. Same for y.
X_outer_test will be held back and X_outer_train will be passed on to clf for fit() (GridSearchCV in our case). Assume X_outer_train is called X_inner from here on since it is passed to inner estimator, assume y_outer_train is y_inner.
X_inner will now be split into X_inner_train and X_inner_test using inner_cv in the GridSearchCV. Same for y
Now the gridSearch estimator will be trained using X_inner_train and y_train_inner and scored using X_inner_test and y_inner_test.
The steps 5 and 6 will be repeated for inner_cv_iters (5 in this case).
The hyper-parameters for which the average score over all inner iterations (X_inner_train, X_inner_test) is best, is passed on to the clf.best_estimator_ and fitted for all data, i.e. X_outer_train.
This clf (gridsearch.best_estimator_) will then be scored using X_outer_test and y_outer_test.
The steps 3 to 9 will be repeated for outer_cv_iters (10 here) and array of scores will returned from cross_val_score
We then use mean() to get back nested_score.
You can supply different cross-validation generators to GridSearchCV. The default for binary or multiclass classification problems is StratifiedKFold. Otherwise, it uses KFold. But you can supply your own. In your case, it looks like you want RepeatedKFold or RepeatedStratifiedKFold.
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# Define svr here
...
# Specify cross-validation generator, in this case (10 x 5CV)
cv = RepeatedKFold(n_splits=5, n_repeats=10)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
# Continue as usual
clf.fit(...)

Categories