I am trying to run a RandomForest Classifier using Pipeline, GridSerach and CV
I am getting an error when I fit the data. I am not sure how to fix it.
I found a similar Question with the solution https://stackoverflow.com/a/34890246/9592484 but didn't work for me
Will appreciate any help on this.
My code is:
column_trans = make_column_transformer((OneHotEncoder(), ['CategoricalData']),
remainder='passthrough')
RF = RandomForestClassifier()
pipe = make_pipeline(column_trans, RF)
# Set grid search params
grid_params = [{'randomforestclassifier_criterion': ['gini', 'entropy'],
'randomforestclassifier_min_samples_leaf': [5,10,20,30,50,80,100],
'randomforestclassifier_max_depth': [3,4,6,8,10],
'randomforestclassifier_min_samples_split': [2,4,6,8,10]}]
# Construct grid search
gs = GridSearchCV(estimator = pipe,
param_grid = grid_params,
scoring='accuracy',
cv=5)
gs.fit(train_features, train_target) ----This is where I get an error
ValueError: Invalid parameter randomforestclassifier_criterion for estimator Pipeline(steps=[('columntransformer',
ColumnTransformer(remainder='passthrough',
transformers=[('onehotencoder',
OneHotEncoder(),
['saleschanneltypeid'])])),
('randomforestclassifier', RandomForestClassifier())]). Check the list of available parameters with `estimator.get_params().keys()`.
The make_pipeline utility function derives step names from transformer/estimator class names. For example, the RandomForestClassifier is mapped to randomforestclassifier step.
Please adjust your grid search parameter prefixes acordingly (ie. from RF to randomforestclassifier). For example, RF__criterion should become randomforestclassifier__criterion.
You are not correctly prepending the keyword. You have to use 2 underscore ( __ ) before each and every parameter. You are using only 1 underscore ( _ ) which wont work.
Related
cross_val = StratifiedKFold(n_splits=split_number)
index_iterator = cross_val.split(features_dataframe, classes_dataframe)
clf = RandomForestClassifier()
random_grid = _create_hyperparameter_finetuning_grid()
clf_random = RandomizedSearchCV(estimator = clf, param_distributions = random_grid, n_iter = 100, cv = cross_val,verbose=2, random_state=42, n_jobs = -1)
clf_random.fit(X, y)
i found this code for random search cv on some site . iam not sure what "random_grid = _create_hyperparameter_finetuning_grid() " in the code .
please enlighten me on this .
hyper parameter tuning
It is passed as the second parameter to the model, and in the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) it's reported that it should be a "Dictionary with parameters names (str) as keys and distributions or lists of parameters to try".
Based on some hyperparameters, the RF model makes a set of decision trees to find the right classification. The value param_distributions is used to determine which of these hyperparameter combinations should be tested.
There is also an explanation here:
How to correctly implement StratifiedKFold with RandomizedSearchCV
I just started using MLFlow and I am happy with what it can do. However, I cannot find a way to log different runs in a GridSearchCV from scikit learn.
For example, I can do this manually
params = ['l1', 'l2']
for param in params:
with mlflow.start_run(experiment_id=1):
clf = LogisticRegression(penalty = param).fit(X_train, y_train)
y_predictions = clf.predict(X_test)
precision = precision_score(y_test, y_predictions)
recall = recall_score(y_test, y_predictions)
f1 = f1_score(y_test, y_predictions)
mlflow.log_param("penalty", param)
mlflow.log_metric("Precision", precision)
mlflow.log_metric("Recall", recall)
mlflow.log_metric("F1", f1)
mlflow.sklearn.log_model(clf, "model")
But when I want to use the GridSearchCV like that
pipe = Pipeline([('classifier' , RandomForestClassifier())])
param_grid = [
{'classifier' : [LogisticRegression()],
'classifier__penalty' : ['l1', 'l2'],
'classifier__C' : np.logspace(-4, 4, 20),
'classifier__solver' : ['liblinear']},
{'classifier' : [RandomForestClassifier()],
'classifier__n_estimators' : list(range(10,101,10)),
'classifier__max_features' : list(range(6,32,5))}
]
clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)
best_clf = clf.fit(X_train, y_train)
I cannot think of any way to log all the individual models that the GridSearch tests. Is there any way to do it or I have to keep using the manual process?
I'd recommend hyperopt instead of scikit-learn's GridSearchCV. Hyperopt can search the space with Bayesian optimization using hyperopt.tpe.suggest. It will arrive at good parameters faster than a grid search and you can limit the number of iterations no matter the space size, so it's definitely better for large spaces. Since you're interested in the artifacts from the individual runs, you may prefer hyperopt's random search, which still has the advantage of being able to choose how many runs you perform.
You can parallelize the search very easily with Spark using hyperopt.SparkTrials (here's a more complete example). Note that you can keep using scikit's cross validation, just put it inside the objective function (you can even keep track of the variance of the cross validation using loss_variance).
Now, to actually answer the question, I believe you can log the model, parameters, metrics, or whatever inside the objective function that you pass to hyperopt.fmin. MLFlow will store each run as a child of the main run, and each run can have its own artifacts.
So you want something like this:
def objective(params):
metrics = ...
classifier = SomeClassifier(**params)
cv = cross_validate(classifier, X_train, y_train, scoring = metrics)
scores = {metric: cv[f'test_{metric}'] for metric in metrics}
# log all the stuff here
mlflow.log_metric('...', scores[...])
mlflow.sklearn.log_model(classifier.fit(X_train, y_train))
return scores['some_loss'].mean()
space = hp.choice(...)
trials = SparkTrials(parallelism = ...)
with mlflow.start_run() as run:
best_result = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = 100, trials = trials)
I agree with the other answer that using hyperopt would be ideal to log experiments with MLFlow, especially in a Spark environment. One way to log individual model fits within GridSearchCV would be to extend the sklearn estimator’s fit method and pass a callback function to GridSearchCV’s fit.
Any parameter passed to GridSearchCV’s fit is cascaded down to the fit method of the estimators within GridSearchCV. This allows us to pass a logger function to store parameters, metrics, models etc. with MLFlow.
Here is an example with RandomForestClassifier as the estimator, however this approach should work with any other estimator as well:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
class CustomRandomForestClassifier(RandomForestClassifier):
'''
A custom random forest classifier.
The RandomForestClassifier class is extended by adding a callback function within its fit method.
'''
def fit(self, X, y, **kwargs):
super().fit(X, y)
# if a "callback" key is passed, call the "callback" function by passing the fitted estimator
if 'callback' in kwargs:
kwargs['callback'](self)
return self
class Logger:
'''
Logger class stores the test dataset,
and logs sklearn random forest estimator in rf_logger method.
'''
def __init__(self, test_X, test_y):
self.test_X = test_X
self.test_y = test_y
def rf_logger(self, model):
# log the random forest model in nested mlflow runs
with mlflow.start_run(nested=True):
mlflow.log_param("n_estimators", model.n_estimators)
mlflow.log_param("max_leaf_nodes", model.max_leaf_nodes)
mlflow.log_metric("score", model.score(self.test_X, self.test_y))
mlflow.sklearn.log_model(model, 'rf_model')
return None
crf = CustomRandomForestClassifier(random_state=9)
param_grid = {
'n_estimators': [10,20],
'max_leaf_nodes': [25,50]
}
# Use custom random forest classifier while defining the estimator for grid search
grid = GridSearchCV(crf, param_grid, cv=2, refit=True)
# Instantiate Logger with test dataset
logger = Logger(test_X, test_y)
# start outer mlflow run and perform grid search with cross-validation
with mlflow.start_run(run_name = "grid_search"):
# while calling GridSearchCV object's fit method pass logger.rf_logger
# logger.rf_logger takes care of logging each fitted model during gridsearch
grid.fit(train_X, train_y, callback = logger.rf_logger)
# log the best estimator fround by grid search in the outer mlflow run
mlflow.log_param("n_estimators", grid.best_params_['n_estimators'])
mlflow.log_param("max_leaf_nodes", grid.best_params_['max_leaf_nodes'])
mlflow.log_metric("score", grid.score(test_X, test_y))
mlflow.sklearn.log_model(grid.best_estimator_, 'best_rf_model')
I am trying to implement SMOTE of imblearn inside the Pipeline. My data sets are text data stored in pandas dataframe. Please see below the code snippet
text_clf =Pipeline([('vect', TfidfVectorizer()),('scale', StandardScaler(with_mean=False)),('smt', SMOTE(random_state=5)),('clf', LinearSVC(class_weight='balanced'))])
After this I am using GridsearchCV.
grid = GridSearchCV(text_clf, parameters, cv=4, n_jobs=-1, scoring = 'accuracy')
Where parameters are nothing but tuning parameters mostly for TfidfVectorizer().
I am getting the following error.
All intermediate steps should be transformers and implement fit and transform. 'SMOTE
Post this error, I have changed the code to as follows.
vect = TfidfVectorizer(use_idf=True,smooth_idf = True, max_df = 0.25, sublinear_tf = True, ngram_range=(1,2))
X = vect.fit_transform(X).todense()
Y = vect.fit_transform(Y).todense()
X_Train,X_Test,Y_Train,y_test = train_test_split(X,Y, random_state=0, test_size=0.33, shuffle=True)
text_clf =make_pipeline([('smt', SMOTE(random_state=5)),('scale', StandardScaler(with_mean=False)),('clf', LinearSVC(class_weight='balanced'))])
grid = GridSearchCV(text_clf, parameters, cv=4, n_jobs=-1, scoring = 'accuracy')
Where parameters are nothing but tuning Cin SVC classifiers.
This time I am getting the following error:
Last step of Pipeline should implement fit.SMOTE(....) doesn't
What is going here? Can anyone please help?
imblearn.SMOTE has no transform method. Docs is here.
But all steps except the last in a pipeline should have it, along with fit.
To use SMOTE with sklearn pipeline you should implement a custom transformer calling SMOTE.fit_sample() in transform method.
Another easier option is just to use ibmlearn pipeline:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbPipeline
# This doesn't work with sklearn.pipeline.Pipeline because
# SMOTE doesn't have a .tranform() method.
# (It has .fit_sample() or .sample().)
pipe = imbPipeline([
...
('oversample', SMOTE(random_state=5)),
('clf', LinearSVC(class_weight='balanced'))
])
I want to use the BernoulliNB() classifier, and my data is not binarized. So I want to choose the best binarization threshold by GridsearchCV().
My code looks like:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import BernoulliNB
from sklearn.preprocessing import Binarizer
pipeline = Pipeline([('binarizer', Binarizer()), ('classifier', BernoulliNB())])
params = {'estimator__binarizer__threshold': np.logspace(0, 5, 20)}
clf = GridSearchCV(pipeline, param_grid=params, cv=5, refit=True)
clf.fit(X_train,y_train)
clf.best_estimator_.score(X_test, y_test)
It gives me error:
ValueError: Check the list of available parameters with estimator.get_params().keys().
I don't know what's wrong.
Yes, my bad. In the comment, I just spotted the spelling mistake of 'treshold' and in a hurry, did not give attention to estimator part.
For a pipeline, the parameters can be accessed by using the two parts:
Name of the steps like binarizer or classifier here
Actual param name for that particular name from step 1.
You dont need to append estimator to the above parts. So in your case, you will need to use the following:
params = {'binarizer__threshold': np.logspace(0, 5, 20)}
to access the 'threshold' param of the 'binarizer' step of pipeline.
I'm using sklearn to do some machine learning. I often use GridSearchCV to explore hyperparameters and perform cross-validation. Using this, I can specify a scoring function, like this:
scores = -cross_val_score(svr, X, Y, cv=10, scoring='neg_mean_squared_error')
However, I want to train my SVR model using mean squared error. Unfortunately, there's no scoring parameter in either the constructor for SVR or the fit method.
How should I do this?
Thanks!
I typically use Pipeline to do it. You can create list of pipelines including SVR model (and others if you want). Then, you can apply GridSearchCV where putting pipeline in as your argument.
Here, you can add params_grid where searching space can be defined as pipelinename__paramname (double underscore in between). For example, I have pipeline name svr and I want to search on parameter C, I can put the key in my parameter dictionary as svr__C.
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.svm import SVR
c_range = np.arange(1, 10, 1)
pipeline = Pipeline([('svr', SVR())])
params_grid = {'svr__C': c_range}
# grid search with 3-fold cross validation
gridsearch_model = GridSearchCV(pipeline, params_grid,
cv=3, scoring='neg_mean_squared_error')
Then, you can do the same procedure by fitting training data and find best score and parameters
gridsearch_model.fit(X_train, y_train)
print(gridsearch_model.best_params_, gridsearch_model.best_score_)
You can also use cross_val_score to find the score:
cross_val_score(gridsearch_model, X_train, y_train,
cv=3, scoring='neg_mean_squared_error')
Hope this helps!