I need to find best hyperparams for ANN and then run prediction on the best model. I use KerasRegressor. I find conflicting examples and advices. Please help me understand the right sequence and which params to use when.
I split my data into Train and Test datasets
I look for the best hyperparams using GridSearchCV on Train dataset
GridSearchCV.fit(X_Train, Y_Train)
I take GridSearchCV.best_estimator_ and use it in cross_val_score on Test dataset, i.e
cross_val_score(model.best_estimator_, X_Test, Y_Test , scoring='r2')
I'm not sure if I need to do this step? In theory, it should show similar r2 scores as GridSearchCV did for this best_estimator_ shouldn't it ?
I use model.best_estimator_.predict( X_Test, Y_Test) on Test data to predict the results. I.e I pass best_estimator_ from GridSearchCV to run actual prediction.
Is this correct ?
*Do I need to fit again model.best_estimator_ on Train data before doing a prediction? Or does it keep all the weights found during GridSearchCV ?
Do I need to save weights to be able to reuse it later ?
Usually when you use GridSearchCV on your training set, you will have an object which contains the best trained model with the best parameters.
gs = GridSearchCV.fit(X_train, y_train)
This is also evident from running gs.best_params_ which will print out the best parameters of the model after cross validation. Now, you can make predictions on your test set directly by running gs.predict(X_test, y_test) which will use the best selected model to predict on your test set.
For question 3, you don't need to use cross_val_score again, as this is an helper function that allows you to perform cross validation on your dataset and returns the score of each fold of the data split.
For question 4, I believe this answer is quite explanatory: https://stats.stackexchange.com/a/26535
What I want to do is to derive a classifier which is optimal in its parameters with respect to a given metric (for example the recall score) but also calibrated (in the sense that the output of the predict_proba method can be directly interpreted as a confidence level, see https://scikit-learn.org/stable/modules/calibration.html). Does it make sense to use sklearn GridSearchCV together with CalibratedClassifierCV, that is, to fit a classifier via GridSearchCV, and then pass the GridSearchCV output to the CalibratedClassifierCV object? If I'm correct, the CalibratedClassifierCV object would fit a given estimator cv times, and the probabilities for each of the folds are then averaged for prediction. However, the results of the GridSearchCV could be different for each of the folds.
Yes you can do this and it would work. I don't know if it makes sense to do this, but the least I can do is explain what I believe would happen.
We can compare doing this to the alternative which is getting the best estimator from the grid search and feeding that to the calibration.
Simply getting the best estimator and feeding it to calibrationcv
from sklearn.model_selection import GridSearchCV
from sklearn import svm, datasets
from sklearn.calibration import CalibratedClassifierCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(iris.data, iris.target)
calibration_clf = CalibratedClassifierCV(clf.best_estimator_)
calibration_clf.fit(iris.data, iris.target)
calibration_clf.predict_proba(iris.data[0:10])
array([[0.91887427, 0.07441489, 0.00671085],
[0.91907451, 0.07417992, 0.00674558],
[0.91914982, 0.07412815, 0.00672202],
[0.91939591, 0.0738401 , 0.00676399],
[0.91894279, 0.07434967, 0.00670754],
[0.91910347, 0.07414268, 0.00675385],
[0.91944594, 0.07381277, 0.0067413 ],
[0.91903299, 0.0742324 , 0.00673461],
[0.91951618, 0.07371877, 0.00676505],
[0.91899007, 0.07426733, 0.00674259]])
Feeding grid search in the Calibration cv
from sklearn.model_selection import GridSearchCV
from sklearn import svm, datasets
from sklearn.calibration import CalibratedClassifierCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
cal_clf = CalibratedClassifierCV(clf)
cal_clf.fit(iris.data, iris.target)
cal_clf.predict_proba(iris.data[0:10])
array([[0.900434 , 0.0906832 , 0.0088828 ],
[0.90021418, 0.09086583, 0.00891999],
[0.90206035, 0.08900572, 0.00893393],
[0.9009212 , 0.09012478, 0.00895402],
[0.90101953, 0.0900889 , 0.00889158],
[0.89868497, 0.09242412, 0.00889091],
[0.90214948, 0.08889812, 0.0089524 ],
[0.8999936 , 0.09110965, 0.00889675],
[0.90204193, 0.08896843, 0.00898964],
[0.89985101, 0.09124147, 0.00890752]])
Notice that the output of the probabilities are slightly different between the two.
The difference between each method is:
Using the best estimator is only doing the calibration across 5 splits (the default cv). It uses the same estimator in all 5 splits.
Using grid search, is doing going to fit a grid search on each of the 5 CV splits from calibration 5 times. You are essentially doing cross validation on 4/5 of the data each time choosing the best estimator for the 4/5 of the data and then doing the calibration with that best estimator on the last 5th. You could have slightly different models running on each set of test data depending on what the grid search chooses.
I think the grid search and calibration are different goals so in my opinion I would probably work on each separately and go with the first way specified above get a model that works the best and then feed that in the calibration curve.
However, I don't know your specific goals so I can't say that the 2nd way described here is the WRONG way. You can always try both ways and see what gives you better performance and go with the one that works best.
I think that your approach is a little different with your objective. What you objective says is "Find a model with best recall, which confidence should be unbiased", but what you do is "Find a model with best recall, then make the confidence unbiased". So a better (but slower) way to do that is:
Wrap your model with CalibratedClassifierCV, treat this model as the final model you should be optimized on;
Modify your param grid, make sure that you are tuning the model inside CalibratedClassifierCV (change param to something like base_estimator__param, which is the property CalibratedClassifierCV to hold the base estimator)
Feed CalibratedClassifierCV model into your final GridSearchCV, then fit
get best_estimator_, which is your unbiased model with best recall.
I would advise that you do calibrate on a separate set not to bias the estimate.
I see two options. Either you cross validate within a fraction of the folds generated for calibrating, as suggested above, or you set apart an ad-hoc evaluation set that you would use only for calibration, after performing cross validation on training set.
In any case, I would recommend that you finally evaluate on a test set.
I'm currently working on a In Class Competition in Kaggle.
I have read about the official python API reference, and I'm kind of confused about the two kinds of interfaces, especially in grid-search, cross-validation and early-stopping.
In XGBoost API, I can use xgb.cv(), which split the whose dataset into two parts to cross validate, to tune a good hyper parameters and then get the best_iteration.
Thus I can adjust the num_boost_round to the best_iteration. To maximizely utilize the data, I train the whole dataset again with the well-tuned hyper parameters, and then use it to classify. The only defect is I have to write the code of GridSearch myself.
ATTENTION: this cross validation set is changed at each fold, so the traning result will have no specific tendency to any part of the data.
But in sklearn, it seem that I can not get best_iteration using clf.fit() as I do in xgb model. Indeed, fit() method has early_stopping_rounds and eval_set to implement the early stopping part. Many people implement the code like that:
X_train, X_test, y_train, y_test = train_test_split(train, target_label, test_size=0.2, random_state=0)
clf = GridSearchCV(xgb_model, para_grid, scoring='roc_auc', cv=5, \
verbose=True, refit=True, return_train_score=False)
clf.fit(X_train, y_train, early_stopping_rounds=30, eval_set=[(X_test, y_test)])
....
clf.predict(something)
But problem is that I have split the data into two part at first. The cross validation set will not be changed at each fold. So maybe the result will have a tendency toward this random part of the whole dataset. The same problem also occurs in the grid search, the final parameter may tend to fit
X_test and y_test more.
I'm fond of the GridSearchCV in sklearn, but I also want to get the eval_set changed at each fold, just like xgb.cv do. I believe it can utilize the data while preventing overfitting.
How should I do?
I have thought of two ways:
using XGB API, and write GridSearch myself.
using sklean API, and change the eval_set manually at each fold.
Are there any more convenient methods?
AS you have summarised, both approaches have advantages and disadvantages.
xgb.cv will use the left-out fold for early stopping, thus you do not need an additional split into a validation/train sample to determine when to trigger early stopping.
GridSearchCV (or maybe you try out RandomizedSearchCV) will handle parameter grid and optimal choice for you.
Note, that it is not a problem to use a fixed sub-sample for early stopping in all CV folds. So i do not think that you have to do anything like "change the eval_set manually at each fold". The evaluation sample used in early stopping does not directly affect model parameters- it is used to decide when evaluation metric on a hold-out sample stops improving. For the final model you can drop early-stopping- you can see when the model stops with the optimal hyper-parameters using the aforementioned split and then use that number of tree as a fixed parameter in the final model fit.
So at the end it is a matter of taste as in both cases you will need to compromise on something. IMO, the sklearn API is the optimal choice as it allows to use the rest of sklearn tools (e.g. for data pre-processing) in a natural way in a pipeline in CV and it allows a homogeneous interface to model training for various approaches. But at the end it is up to you
I would like to know the purpose of the 2 functions GridSearchCV and KFold. I know that GridSearchCV performs a grid search over a set of mentioned hyperparameters for a model(lets say an SVC), and validates the model formed using CV.
KFold on the other hand just splits the data into k parts and we can use that to evaluate a model where we have already fixed the hyperparameters?
I have seen some code where both function are used together, I assumed before that using just GridSearchCV also gives you the CV error?
So do you train the model first for which you use KFold method and then use GridSearchCV in order to tune the hyperparameters of the model trained? Which is the best way to reduce the bias and variance?
Im am extremely confused any help is appreciated!
Thank you!
Can someone explain how to use the oob_decision_function_ attribute for the python SciKit Random Forest Classifier? I want to use it to plot learning curves comparing training and validation error against different training set sizes in order to identify overfitting and other problems. Can't seem to find any information about how to do this.
You can pass in a custom scoring function into any of the scoring parameters in the model evaluation fields, it needs to have the signiture classifier, X, y_true -> score.
For your case you could use something like
from sklearn.learning_curve import learning_curve
learning_curve(r, X, y, cv=3, scoring=lambda c,x,y: c.oob_score_)
This will compute 3-fold cross validated oob scores against different training set sizes. Btw I don't think you should get overfitting with random forests, that's one of the benefits of them.