Does the refit option in gridsearchcv re-select features? - python

I'm using gridsearchcv to train a logistic regression classifier. What I want to know is whether the refit command re-selects features based on chosen hyper-parameter C, OR simply uses features selected in the cross-validation procedures and only re-fits the value of coefficients without re-selection of features?

As per the documentation of GridSearchCV :
1. Refit an estimator using the best found parameters on the whole dataset.
2. The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance.
From here Confused with repect to working of GridSearchCV you can get below significance of refit parameter.
refit : boolean
Refit the best estimator with the entire dataset.
If “False”, it is impossible to make predictions using
this GridSearchCV instance after fitting.

Related

How to use GridSearchCV , cross_val_score and a model

I need to find best hyperparams for ANN and then run prediction on the best model. I use KerasRegressor. I find conflicting examples and advices. Please help me understand the right sequence and which params to use when.
I split my data into Train and Test datasets
I look for the best hyperparams using GridSearchCV on Train dataset
GridSearchCV.fit(X_Train, Y_Train)
I take GridSearchCV.best_estimator_ and use it in cross_val_score on Test dataset, i.e
cross_val_score(model.best_estimator_, X_Test, Y_Test , scoring='r2')
I'm not sure if I need to do this step? In theory, it should show similar r2 scores as GridSearchCV did for this best_estimator_ shouldn't it ?
I use model.best_estimator_.predict( X_Test, Y_Test) on Test data to predict the results. I.e I pass best_estimator_ from GridSearchCV to run actual prediction.
Is this correct ?
*Do I need to fit again model.best_estimator_ on Train data before doing a prediction? Or does it keep all the weights found during GridSearchCV ?
Do I need to save weights to be able to reuse it later ?
Usually when you use GridSearchCV on your training set, you will have an object which contains the best trained model with the best parameters.
gs = GridSearchCV.fit(X_train, y_train)
This is also evident from running gs.best_params_ which will print out the best parameters of the model after cross validation. Now, you can make predictions on your test set directly by running gs.predict(X_test, y_test) which will use the best selected model to predict on your test set.
For question 3, you don't need to use cross_val_score again, as this is an helper function that allows you to perform cross validation on your dataset and returns the score of each fold of the data split.
For question 4, I believe this answer is quite explanatory: https://stats.stackexchange.com/a/26535

How to access the predictions made in scikit-learn's GridSearchCV-function?

I'm using scikit-learn's GridSearchCV to implement hyperparameter tuning for a classifier model. As I've understood from the documentation of GridSearchCV, you can query for attributes such as best estimator, best score, et cetera, but I would be interested in getting the predicted y-class labels which were used to calculate the best score attribute in GridSearchCV.
Is there a way to access these predictions?

Why do we need to fit the model again in order to get score?

I'm testing the embedded methods for feature selection.
I understand (maybe I misunderstand) that with embedded methods we can get best features (based on importance of the features) while training the model.
Is so, I want to get the score of the trained model (which was trained to select features).
I'm testing that on classification problem with Lasso method.
When I'm trying to get the score, I'm getting error that I need to fit again the model.
Why do I need to do it (it seem waste of time if the model was fitted on feature selection ?)
Why can't we do it (select features and get model score) in one shot ?
Why if we are using embedded method, why do we need to do it 2 phases ? why can't we train the model while choose the best features in one fit ?
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel
estimator = LogisticRegression(C=1, penalty='l1', solver='liblinear')
selection = SelectFromModel(estimator)
selection.fit(x_train, y_train)
print(estimator.score(x_test, y_test))
Error:
sklearn.exceptions.NotFittedError: This LogisticRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
The fitted estimator is returned as selection.estimator_ (see the docs); so, after fitting selection, you can simply do:
selection.estimator_.score(x_test, y_test)

Does GridSearchCV in sklearn train the model with whole data set?

I know that GridSearchCV will find the 'best' hyper-parameter by using k-fold cv. But after finding those hyper-parameters, will the GridSearchCV train the model again with the whole data set to get the trainable parameters? Or it only trains the model with the folds that generated the best hyper-parameters?
According to sklearn documentation:
lookup the keyword refit defaulted to be True.
I believe this answers your question.

GridSearchCV final model

If I use GridSearchCV in scikit-learn library to find the best model, what will be the final model it returns? That said, for each set of hyper-parameters, we train the number of CV (say 3) models. In this way, will the function return the best model in those 3 models for the best setting of parameters?
The GridSearchCV will return an object with quite a lot information. It does return the model that performs the best on the left-out data:
best_estimator_ : estimator or dict
Estimator that was chosen by the search, i.e. estimator which gave
highest score (or smallest loss if specified) on the left out data.
Not available if refit=False.
Note that this is not the model that's trained on the entire data. That means, once you are confident that this is the model you want, you will need to retrain the model on the entire data by yourself.
Ref: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
This is given in sklearn:
“The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance.”
So, you don’t need to fit the model again. You can directly get the best model from best_estimator_ attribute

Categories