Retrieving specific classifiers and data from GridSearchCV

Retrieving specific classifiers and data from GridSearchCV - python

I am running a Python 3 classification script on a server using the following code:
# define knn classifier for transformed data
knn_classifier = neighbors.KNeighborsClassifier()
# define KNN parameters
knn_parameters = [{
'n_neighbors': [1,3,5,7, 9, 11],
'leaf_size': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60],
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'n_jobs': [-1],
'weights': ['uniform', 'distance']}]
# Stratified k-fold (default for classifier)
# n = 5 folds is default
knn_models = GridSearchCV(estimator = knn_classifier, param_grid = knn_parameters, scoring = 'accuracy')
# fit grid search models to transformed training data
knn_models.fit(X_train_transformed, y_train)
I then save the GridSearchCV object using pickle:
# save model
with open('knn_models.pickle', 'wb') as f:
pickle.dump(knn_models, f)
So I can test the classifiers on smaller datasets on my local machine by running:
knn_models = pickle.load(open("knn_models.pickle", "rb"))
validation_knn_model = knn_models.best_estimator_
Which is great if I only want to evaluate the best estimator on a validation set. But what I'd actually like to do is:
pull the original data out of the GridSearchCV object (I'm assuming it's stored somewhere in the object because to classify the new validation set, this is required)
try a few specific classifiers with almost all of the best parameters as determined by the grid search but changing a specific input parameter i.e. k = 3, 5, 7
retrieve y_pred i.e. the predictions for each validation set for all of the new classifiers that I tested above

GridSearchCV does not include the original data (and it would be arguably absurd if it did). The only data it includes is its own bookkeeping, i.e. the detailed scores & parameters tried per each CV fold. The best_estimator_ returned is the only thing needed to apply the model to any new data encountered, but if, as you say, you would like to dig deeper in the details, the full results are returned in its cv_results_ attribute.
Adapting the example from the documentation to the knn classifier with your own knn_parameters grid (but removing n_jobs, which only affects the fitting speed, and it's not a real hyperparameter of the algorithm), and keeping cv=3 for simplicity, we have:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
import pandas as pd
iris = load_iris()
knn_parameters = [{
'n_neighbors': [1,3,5,7, 9, 11],
'leaf_size': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60],
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'weights': ['uniform', 'distance']}]
knn_classifier = KNeighborsClassifier()
clf = GridSearchCV(estimator = knn_classifier, param_grid = knn_parameters, scoring = 'accuracy', n_jobs=-1, cv=3)
clf.fit(iris.data, iris.target)
clf.best_estimator_
# result:
KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
So, as said, this last result tells you all you need to know to apply the algorithm to any new data (validation, test, from deployment etc). Also, you may find that actually removing the n_jobs entry from the knn_parameters grid and asking instead for n_jobs=-1 in the GridSearchCV object results in a much faster CV procedure. Nevertheless, if you want to use n_jobs=-1 to your final model, you can easily manipulate the best_estimator_ to do so:
clf.best_estimator_.n_jobs = -1
clf.best_estimator_
# result
KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
weights='uniform')
This actually answers your second question, since you can similarly manipulate the best_estimator_ to change other hyperparameters, too.
So, having found the best model is where most people would stop. But if, for any reason, you want to dig further into the details of the whole grid search process, the detailed results are returned in the cv_results_ attribute, which you can even import to a pandas dataframe for easier inspection:
cv_results = pd.DataFrame.from_dict(clf.cv_results_)
For example, the cv_results dataframe includes a column rank_test_score which, as its name clearly implies, contains the rank of each parameter combination:
cv_results['rank_test_score']
# result:
0 481
1 481
2 145
3 145
4 1
...
571 1
572 145
573 145
574 433
575 1
Name: rank_test_score, Length: 576, dtype: int32
Here 1 means best, and you can readily see that there are more than one combinations ranked as 1 - so in fact here we have more than one "best" models (i.e. parameter combinations)! Although here this is most probably due to the relative simplicity of the used iris dataset, there is no reason in principle why it cannot happen in a real case, too. In such cases, the returned best_estimator_ is just the first of these occurrences - here the combination number 4:
cv_results.iloc[4]
# result:
mean_fit_time 0.000669559
std_fit_time 1.55811e-05
mean_score_time 0.00474652
std_score_time 0.000488042
param_algorithm auto
param_leaf_size 5
param_n_neighbors 5
param_weights uniform
params {'algorithm': 'auto', 'leaf_size': 5, 'n_neigh...
split0_test_score 0.98
split1_test_score 0.98
split2_test_score 0.98
mean_test_score 0.98
std_test_score 0
rank_test_score 1
Name: 4, dtype: object
which you can easily see that has the same parameters with our best_estimator_ above. But now you can inspect all the "best" models, simply by:
cv_results.loc[cv_results['rank_test_score']==1]
which, in my case, results in no less than 144 models (out of the total 6*12*4*2 = 576 models tried)! So, you can in fact select among more choices, or even use other additional criteria, say the standard deviation of the returned score (the less the better, although here it is at the minimum value of 0), instead of relying simply to the maximum mean score, which is what the automatic procedure will return.
Hopefully these will be enough to get you started...

Related

Why are Pipelines used as part of GridsearchCV and not the other way around?

Though I understand the potential benefits, especially in combination with GridSearchCV, I wonder why it is always used like this (or at least from how I understand it):
Pipeline steps are set for each classifier (with 'passthrough' for the clf step). Then, GridSearchCV equips the pipeline with multiple parameters and classifiers.
I am not sure if this is true, but from my point of view, it seems as if this causes the steps before the classifier to run multiple times, even if they are always used with the same parameter.
This leads me to the question, why it is not used the other way around... or if this would even be possible?
Here is a picture of the situation in my head with example configuration:

First let's create a dataset
from sklearn.datasets import make_classification
from sklearn import svm
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
# generate some data to play with
X, y = make_classification(n_informative=5, n_redundant=0, random_state=42)
Now the usual way of working with a grid_search is to try different parameters for all steps.
As an example let's use PCA and SVC.
pipe = Pipeline(steps=[('pca', PCA()), ('svm', svm.SVC())])
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
'pca__n_components': [5, 15, 30, 45, 64],
'svm__C': [1, 5, 10],
}
gs = GridSearchCV(pipe, param_grid, n_jobs=-1)
gs.fit(X, y)
However if you want you can apply the previous steps to the classifier itself and only perform the GridSearch on the classifier:
pca = PCA()
X_pca, y_pca = pca.fit_transform(X, y)
parameters = {'C':[1, 5, 10]}
svc = svm.SVC()
gs = GridSearchCV(svc, parameters)
gs.fit(X_pca, y_pca)
The problem is that this way you can't test parameter correlations between different steps.

Does GridSearchCV return the best_estimator_ after fitting?

Let's say we tune an SVM with GridSearch like this:
algorithm = SVM()
parameters = {'kernel': ['rbf', 'sigmoid'], 'C': [0.1, 1, 10]}
grid= GridSearchCV(algorithm, parameters)
grid.fit(X, y)
You then wish to use the best fit parameters/estimator in a cross_val_score. My question is, which model is grid at this point? Is it the best performing one? In other words, can we just do
cross_val_scores = cross_val_score(grid, X=X, y=y)
or should we use
cross_val_scores = cross_val_score(grid.best_estimator_, X=X, y=y)
When I run both, I find that they do not return the same scores so I am curious what the correct approach is here. (I would assume using the best_estimator_.) That raises another question, though, namely: what does using just grid use as a model then? The first one?

You don't need cross_val_score after fitting a GridSearchCV. It already has attributes that allow you to access cross validation scores. cv_results_ gives you all. You can index into this with the best_index attribute if you want to see only that specific estimator's results.
cv_results = pd.DataFrame(grid.cv_results_)
cv_results.iloc[grid.best_index_]
mean_fit_time 0.00046916
std_fit_time 1.3785e-05
mean_score_time 0.000251055
std_score_time 1.19038e-05
param_C 10
param_kernel rbf
params {'C': 10, 'kernel': 'rbf'}
split0_test_score 0.966667
split1_test_score 1
split2_test_score 0.966667
split3_test_score 0.966667
split4_test_score 1
mean_test_score 0.98
std_test_score 0.0163299
rank_test_score 1
Name: 5, dtype: object
Most of the methods you call on a fitted GridSearchCV use the best model (grid.predict(...) gets you the predictions for the best model, for example). This is not true for the estimator. The difference you see is probably comes from that. cross_val_score fits it again, but this time makes the scoring against grid.estimator but not grid.best_estimator_.

Determine what features to drop / select using GridSearch in scikit-learn

How does one determine what features/columns/attributes to drop using GridSearch results?
In other words, if GridSearch returns that max_features should be 3, can we determine which EXACT 3 features should one use?
Let's take the classic Iris data set with 4 features.
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn import datasets
iris = datasets.load_iris()
all_inputs = iris.data
all_labels = iris.target
decision_tree_classifier = DecisionTreeClassifier()
parameter_grid = {'max_depth': [1, 2, 3, 4, 5],
'max_features': [1, 2, 3, 4]}
cross_validation = StratifiedKFold(n_splits=10)
grid_search = GridSearchCV(decision_tree_classifier,
param_grid=parameter_grid,
cv=cross_validation)
grid_search.fit(all_inputs, all_labels)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
Let's say we get that max_features is 3. How do I find out which 3 features were the most appropriate here?
Putting in max_features = 3 will work for fitting, but I want to know which attributes were the right ones.
Do I have to generate the possible list of all feature combinations myself to feed GridSearch or is there an easier way ?

max_features is one hyperparameter of your decision tree.
it does not drop any of your features before training nor does it find good or bad features.
Your decisiontree looks at all features to find the best feature to split up your data based on your labels. If you set maxfeatures to 3 as in your example, your decision tree just looks at three random features and takes the best features of those to make the split. This makes your training faster and adds some randomness to your classifier (might also help against overfitting).
Your classifier determines which is the feature by a criterion (like gini index or information gain(1-entropy)). So you can either take such a measurement for feature importance or
use an estimator that has the attribute feature_importances_
as #gorjan mentioned.

If you use an estimator that has the attribute feature_importances_ you can simply do:
feature_importances = grid_search.best_estimator_.feature_importances_
This will return a list (n_features) of how important each feature was for the best estimator found with grid search. Additionally, if you want to use let's say a linear classifier (logistic regression), that doesn't have the attribute feature_importances_ what you could do is:
# Get the best estimator's coefficients
estimator_coeff = grid_search.best_estimator_.coef_
# Multiply the model coefficients by the standard deviation of the data
coeff_magnitude = np.std(all_inputs, 0) * estimator_coeff)
which is also an indication of the feature importance. If a model's coefficient is >> 0 or << 0, that means, in layman's terms, that the model is trying hard to capture the signal present in that feature.

GridSearchCV scoring on mean absolute error

I'm trying to setup an instance of GridSearchCV to determine which set of hyperparameters will produce the lowest mean absolute error. This scikit documentation indicates that score metrics can be passed into the grid upon creation of a GridSearchCV (below).
param_grid = {
'hidden_layer_sizes' : [(20,),(21,),(22,),(23,),(24,),(25,),(26,),(27,),(28,),(29,),(30,),(31,),(32,),(33,),(34,),(35,),(36,),(37,),(38,),(39,),(40,)],
'activation' : ['relu'],
'random_state' : [0]
}
gs = GridSearchCV(model, param_grid, scoring='neg_mean_absolute_error')
gs.fit(X_train, y_train)
print(gs.scorer_)
[1] make_scorer(mean_absolute_error, greater_is_better=False)
However the grid search is not selecting the best performing model in terms of mean absolute error
model = gs.best_estimator_.fit(X_train, y_train)
print(metrics.mean_squared_error(y_test, model.predict(X_test)))
print(gs.best_params_)
[2] 125.0
[3] Best parameters found by grid search are: {'hidden_layer_sizes': (28,), 'learning_rate': 'constant', 'learning_rate_init': 0.01, 'random_state': 0, 'solver': 'lbfgs'}
After running the above code and determining the so called 'best parameters', I delete one of the values found in gs.best_params_, and find that by running my program again the mean squared error will sometimes decrease.
param_grid = {
'hidden_layer_sizes' : [(20,),(21,),(22,),(23,),(24,),(25,),(26,),(31,),(32,),(33,),(34,),(35,),(36,),(37,),(38,),(39,),(40,)],
'activation' : ['relu'],
'random_state' : [0]
}
[4] 122.0
[5] Best parameters found by grid search are: {'hidden_layer_sizes': (23,), 'learning_rate': 'constant', 'learning_rate_init': 0.01, 'random_state': 0, 'solver': 'lbfgs'}
To clarify, I changed the set that was fed into my grid search so that it did not contain an option to select a hidden layer size of 28, when that change was made, I ran the code again and this time it picked a hidden layer size of 23 and the mean absolute error decreased (even though the size of 23 had been available from the start), why didn't it just pick this option from the start if it is evaluating the mean absolute error?

The grid-search and model fitting in essential, depends on random number generators for different purposes. In scikit-learn this is controlled by a param random_state. See my other answers to know about it:
https://stackoverflow.com/a/42477052/3374996
https://stackoverflow.com/a/42197534/3374996
Now in your case, I can think of these things where this random-number generation affects the training:
1) GridSearchCV will by default use a KFold with 3 folds for regression tasks, which may split data differently on different runs. It may happen that the splits that happened in two grid-search processes are different, and hence different scores.
2) You are using a separate test data for calculation the mse which the GridSearchCV dont have access to. So it will find the parameters appropriate for the supplied data which may or may not be perfectly valid for the separate dataset.
Update:
I see now that you have used random_state in param grid for model, so this point 3 now dont apply.
3) You have not shown which model are you using. But if the model during training is using sub-samples of data (like selecting smaller number of features, or smaller number of rows for iterations, or for different internal estimators), than you need to fix that too to get the same scores.
You need to check the results by first fixing that.
Recommendation Example
You can take ideas from this example:
# Define a custom kfold
from sklearn.model_selection import KFold
kf = KFold(n_splits=3, random_state=0)
# Check if the model you chose support random_state
model = WhateEverYouChoseClassifier(..., random_state=0, ...)
# Pass these to grid-search
gs = GridSearchCV(model, param_grid, scoring='neg_mean_absolute_error', cv = kf)
And then again do the two experiments you did by changing the param grid.

Grid-search with specific validation data

I'm looking for a way to grid-search for hyperparameters in sklearn, without using K-fold validation. I.e I want my grid to train on on specific dataset (X1,y1 in the example below) and validate itself on specific hold-out dataset (X2,y2 in the example below).
X1,y2 = train data
X2,y2 = validation data
clf_ = SVC(kernel='rbf',cache_size=1000)
Cs = [1,10.0,50,100.0,]
Gammas = [ 0.4,0.42,0.44,0.46,0.48,0.5,0.52,0.54,0.56]
clf = GridSearchCV(clf_,dict(C=Cs,gamma=Gammas),
cv=???, # validate on X2,y2
n_jobs=8,verbose=10)
clf.fit(X1, y1)

Use the hypopt Python package (pip install hypopt). It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.
# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
{'C': [1, 10, 100], 'kernel': ['linear']},
{'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))

clf = GridSearchCV(clf_,dict(C=Cs,gamma=Gammas),cv=???, # validate on X2,y2,n_jobs=8,verbose=10)
n_jobs>1 does not make any sense. If n_jobs=-1 it means the processing will use all the cores on your machine. If it is 1 only one core would be use.
If cv =5 it will run five cross validations for every iteration.
In your case total number of iterations will be 9(size of Cs)*5(Size of gammas)*5(Value of CV)
If you are using cross validation it does not make any sense to hold out the data for rechecking your model. If you are not confident about the performance you can just increase the cv to get a better fit.
This will be very time consuming especially for SVM ,I will rather suggest you to use RandomSearchCV which allows you give the number of iterations you want your model to randomly select.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.