GridSearchCV final model - python

If I use GridSearchCV in scikit-learn library to find the best model, what will be the final model it returns? That said, for each set of hyper-parameters, we train the number of CV (say 3) models. In this way, will the function return the best model in those 3 models for the best setting of parameters?

The GridSearchCV will return an object with quite a lot information. It does return the model that performs the best on the left-out data:
best_estimator_ : estimator or dict
Estimator that was chosen by the search, i.e. estimator which gave
highest score (or smallest loss if specified) on the left out data.
Not available if refit=False.
Note that this is not the model that's trained on the entire data. That means, once you are confident that this is the model you want, you will need to retrain the model on the entire data by yourself.
Ref: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

This is given in sklearn:
“The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance.”
So, you don’t need to fit the model again. You can directly get the best model from best_estimator_ attribute

Related

Why RandomForestClassifier doesn't have cost_complexity_pruning_path method?

In trying to prevent my Random Forest model from overfitting on the training dataset, I looked at the ccp_alpha parameter.
I do notice that it is possible to tune it with a hyperparameter search method (as GridSearchCV).
I discovered that there is a Scikit-Learn tutorial for tuning this ccp_alpha parameter for Decision Tree models.
The methodology described uses the cost_complexity_pruning_path method of the Decision Tree model. This section explains well how the method works. I understand that it seeks to find a sub-tree of the generated model that reduces overfitting, while using values of ccp_alpha determined by the cost_complexity_pruning_path method.
clf = DecisionTreeClassifier()
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
However, I wonder why the Random Forest model type in Scikit-Learn does not implement these ccp_alpha selection and pruning concept.
Would it be possible to do this with a little tinkering?
It seems more logical to me than trying to find a good value by searching for hyperparameters (whatever one you use..)

Why do we need to fit the model again in order to get score?

I'm testing the embedded methods for feature selection.
I understand (maybe I misunderstand) that with embedded methods we can get best features (based on importance of the features) while training the model.
Is so, I want to get the score of the trained model (which was trained to select features).
I'm testing that on classification problem with Lasso method.
When I'm trying to get the score, I'm getting error that I need to fit again the model.
Why do I need to do it (it seem waste of time if the model was fitted on feature selection ?)
Why can't we do it (select features and get model score) in one shot ?
Why if we are using embedded method, why do we need to do it 2 phases ? why can't we train the model while choose the best features in one fit ?
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel
estimator = LogisticRegression(C=1, penalty='l1', solver='liblinear')
selection = SelectFromModel(estimator)
selection.fit(x_train, y_train)
print(estimator.score(x_test, y_test))
Error:
sklearn.exceptions.NotFittedError: This LogisticRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
The fitted estimator is returned as selection.estimator_ (see the docs); so, after fitting selection, you can simply do:
selection.estimator_.score(x_test, y_test)

Does the refit option in gridsearchcv re-select features?

I'm using gridsearchcv to train a logistic regression classifier. What I want to know is whether the refit command re-selects features based on chosen hyper-parameter C, OR simply uses features selected in the cross-validation procedures and only re-fits the value of coefficients without re-selection of features?
As per the documentation of GridSearchCV :
1. Refit an estimator using the best found parameters on the whole dataset.
2. The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance.
From here Confused with repect to working of GridSearchCV you can get below significance of refit parameter.
refit : boolean
Refit the best estimator with the entire dataset.
If “False”, it is impossible to make predictions using
this GridSearchCV instance after fitting.

Does GridSearchCV in sklearn train the model with whole data set?

I know that GridSearchCV will find the 'best' hyper-parameter by using k-fold cv. But after finding those hyper-parameters, will the GridSearchCV train the model again with the whole data set to get the trainable parameters? Or it only trains the model with the folds that generated the best hyper-parameters?
According to sklearn documentation:
lookup the keyword refit defaulted to be True.
I believe this answers your question.

How to take a sklearn post-cross_val_predict model to do prediction on another scaled data set? And whether the model can be serialized?

I came across this question while on a sklearn ML case with heavily imbalanced data. The line below provides the basis for assessing the model from confusion metrics and precision-recall perspectives but ... it is a train/predict combined method:
y_pred = model_selection.cross_val_predict(model, X, Y, cv=kfold)
The question is how do I leverage this 'cross-val-trained' model to:
1) predict on another data set (scaled) instead of having to train/predict each time?
2) export/serialize/deploy the model to predict on live data?
model.predict() #--> nope. need a fit() first
model.fit() #--> nope. a different model which does not take advantage of the cross_val_xxx methods
Any help is appreciated.
You can fit a new model with the data.
The cross validation aspect is about validating the way the model is built, not the model itself. So if the cross validation is OK, then you can train a new model with all the data.
(See my response here as well for more details Fitting sklearn GridSearchCV model)

Categories