I recently tested many hyperparameter combinations using sklearn.model_selection.GridSearchCV. I want to know if there is a way to call all previous estimators that were trained in the process.
search = GridSearchCV(estimator=my_estimator, param_grid=parameters)
# `my_estimator` is a gradient boosting classifier object
# `parameters` is a dictionary containing all the hyperparameters I want to try
I know I can call the best estimator with search.best_estimator_, but I would like to call all other estimators as well so I can test their individual performance.
The search took around 35 hours to complete, so I really hope I do not have to do this all over again.
NOTE: This was asked a few years ago (here), but sklearn has been updated multiple times since and the anwer may be different now (I hope).
No, none of the tested models are saved, except (optionally, but by default) one final one trained on the entire training set, your best_estimator_. Especially when models store significant amounts of data (e.g. KNNs), saving all the fitted estimators would be very memory-expensive, and usually not of much use. (cross_validate does have a parameter return_estimator, but the hyperparameter tuners do not. If you have a compelling reason to add it, it probably wouldn't take much work and you could open a GitHub Issue at sklearn.)
However, you do have the cv_results_ attribute that documents the scores of all of the tested estimators. That's usually enough for inspection purposes.
Related
I have 2 questions that I would like to ascertain if possible (questions are bolded):
I've recently understood (I hope) the random forest classification algorithm, and have tried to apply it using sklearn on Python on a rather large dataset of pixels derived from satellite images (with the features being the different bands, and the labels being specific features that I outlined by myself, i.e., vegetation, cloud, etc). I then wanted to understand if the model was experiencing a variance problem, and so the first thought that came to my mind was to compare between the training and testing data.
Now this is where the confusion kicks in for me - I understand that there have been many different posts about:
How CV error should/should not be used compared to the out of bag (OOB) error
How by design, the training error of a random forest classifier is almost always ~0 (i.e., fitting my model on the training data and using it to predict on the same set of training data) - seems to be the case regardless of the tree depth
Regarding point 2, it seems that I can never compare my training and test error as the former will always be low, and so I decided to use the OOB error as my 'representative' training error for the entire model. I then realized that the OOB error might be a pseudo test error as it essentially tests trees on points that they did not specifically learn (in the case of bootstrapped trees), and so I defaulted to CV error being my new 'representative' training error for the entire model.
Looking back at the usage of CV error, I initially used it for hyperparameter tuning (e.g., max tree depth, number of trees, criterion type, etc), and so I was again doubting myself if I should use it as my official training error to be compared against my test error.
What makes this worse is its hard for me to validate what I think is true based on posts across the web because each answers only a small part and might contradict each other, and so would anyone kindly help me with my predicament on what to use as my official training error that will be compared to my test error?
My second question revolves around how the OOB error might be a pseudo test error based on datapoints not selected during bootstrapping. If that were true, would it be fair to say this does not hold if bootstrapping is disabled (the algorithm is technically still a random forest as features are still randomly subsampled for each tree, its just that the correlation between trees are probably higher)?
Thank you!!!!
Generally, you want to distinctly break a dataset into training, validation, and test. Training is data fed into the model, validation is to monitor progress of the model as it learns, and test data is to see how well your model is generalizing to unseen data. As you've discovered, depending on the application and the algorithm, you can mix-up training and validation data or even forgo validation data entirely. For random forest, if you want to forgo having a distinct validation set and just use OOB to monitor progress that is fine. If you have enough data, I think it still makes sense to have a distinct validation set. No matter what, you should still reserve some data for testing. Depending on your data, you may even need to be careful about how you split up the data (e.g. if there's unevenness in the labels).
As to your second point about comparing training and test sets, I think you may be confused. The test set is really all you care about. You can compare the two to see if you're overfitting, so that you can change hyperparameters to generalize more, but otherwise the whole point is that the test set is to the sole truthful evaluation. If you have a really small dataset, you may need to bootstrap a number of models with a CV scheme like stratified CV to generate a more accurate test evaluation.
Background of the Problem
I want to explain the outcome of machine learning (ML) models using SHapley Additive exPlanations (SHAP) which is implemented in the shap library of Python. As a parameter of the function shap.Explainer(), I need to pass an ML model (e.g. XGBRegressor()). However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different). Also, the model will be different as I am doing feature selection in each iteration.
Then, My Question
In LOOCV, How can I use shap.Explainer() function of shap library to present the performance of a machine learning model? It can be noted that I have checked several tutorials (e.g. this one, this one) and also several questions (e.g. this one) of SO. But I failed to find the answer of the problem.
Thanks for reading!
Update
I know that in LOOCV, the model found in each iteration can be explained by shap.Explainer(). However, as there is 250 participants' data, if I apply shap here for each model, there will be 250 output! Thus, I want to get a single output which will present the performance of the 250 models.
You seem to train model on a 250 datapoints while doing LOOCV. This is about choosing a model with hyperparams that will ensure best generalization ability.
Model explanation is different from training in that you don't sift through different sets of hyperparams -- note, 250 LOOCV is already overkill. Will you do that with 250'000 rows? -- you are rather trying to understand which features influence output in what direction and by how much.
Training has it's own limitations (availability of data, if new data resembles the data the model was trained on, if the model good enough to pick up peculiarities of data and generalize well etc), but don't overestimate explanation exercise either. It's still an attempt to understand how inputs influence outputs. You may be willing to average 250 different matrices of SHAP values. But do you expect the result to be much more different from a single random train/test split?
Note as well:
However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different).
In each iteration of LOOCV the model is still the same (same features, hyperparams may be different, depending on your definition of iteration). It's still the same dataset (same features)
Also, the model will be different as I am doing feature selection in each iteration.
Doesn't matter. Feed resulting model to SHAP explainer and you'll get what you want.
This question already has an answer here:
Retrieving specific classifiers and data from GridSearchCV
(1 answer)
Closed 2 years ago.
GridSearchCV and RandomizedSearchCV has best_estimator_ that :
Returns only the best estimator/model
Find the best estimator via one of the simple scoring methods : accuracy, recall, precision, etc.
Evaluate based on training sets only
I would like to enrich those limitations with
My own definition of scoring methods
Evaluate further on test set rather than training as done by GridSearchCV. Eventually it's the test set performance that counts. Training set tends to give almost perfect accuracy on my Grid Search.
I was thinking of achieving it by :
Get the individual estimators/models in GridSearchCV and RandomizedSearchCV
With every estimator/model, predict on test set and evaluate with my customized score
My question is:
Is there a way to get all individual models from GridSearchCV ?
If not, what is your thought to achieve the same thing as what I wanted ? Initially I wanted to exploit existing GridSearchCV because it handles automatically multiple parameter grid, CV and multi-threading. Any other recommendation to achieve the similar result is welcome.
You can use custom scoring methods already in the XYZSearchCVs: see the scoring parameter and the documentation's links to the User Guide for how to write a custom scorer.
You can use a fixed train/validation split to evaluate the hyperparameters (see the cv parameter), but this will be less robust than a k-fold cross-validation. The test set should be reserved for scoring only the final model; if you use it to select hyperparameters, then the scores you receive will not be unbiased estimates of future performance.
There is no easy way to retrieve all the models built by GridSearchCV. (It would generally be a lot of models, and saving them all would generally be a waste of memory.)
The parallelization and parameter grid parts of GridSearchCV are surprisingly simple; if you need to, you can copy out the relevant parts of the source code to produce your own approach.
Training set tends to give almost perfect accuracy on my Grid Search.
That's a bit surprising, since the CV part of the searches means the models are being scored on unseen data. If you get very high best_score_ but low performance on the test set, then I would suspect your training set is not actually a representative sample, and that'll require a much more nuanced understanding of the situation.
I am pretty new to doc2vec then I made small research and found a couple of things. Here is my story: I am trying to learn using doc2vec 2.4 million documents. At first, I tried only doing so with a small model of 12 documents. I checked the results with infer vector of the first document and found it to be similar indeed to the first document by 0.97-0.99 cosine similarity measure. Which I found good, even though when I tried to enter a new document of completely different words I received a high score of 0.8 measure similarity. However, I had put it aside and tried to go on and build the full model with the 2.4 million documents. In this point, my problems began. The result was complete nonsense, I received in the most_similar function results with a similarity of 0.4-0.5 which were completely different from the new document checked. I tried to tune parameters but no result yet. I tried also to remove randomness both from the small and big model, however, I still got different vectors. Then I had tried to use get_latest_training_loss on each epoch in order to see how the loss changes over each epoch. This is my code:
model = Doc2Vec(vector_size=300, alpha=0.025, min_alpha=0.025, pretrained_emb=".../glove.840B.300D/glove.840B.300d.txt", seed=1, workers=1, compute_loss=True)
workers=1, compute_loss=True)
model.build_vocab(documents)
for epoch in range(10):
for i in range(model_glove.epochs):
model.train(documents, total_examples = token_count, epochs=1)
training_loss = model.get_latest_training_loss()
print("Training Loss: " + str(training_loss))
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
I know this code is a bit awkward, but it is used here only to follow the loss.
The error I receive is:
AttributeError: 'Doc2Vec' object has no attribute 'get_latest_training_loss'
I tried looking at model. and auto-complete and found that indeed there is no such function, I found something similar name training_loss, but it gives me the same error.
Anyone here can give me an idea?
Thanks in Advance
Especially as a beginner, there's no pressing need to monitor training-loss. For a long time, gensim didn't report it in any way for any models – and it was still possible to evaluate & tune models.
Even now, running-loss-reporting in gensim kind of a rough, incomplete, advanced/experimental feature – and after a recent refactoring it doesn't seem to have full support in Doc2Vec. (Notably, while having the loss level reach a plateau can be a helpful indicator that further training can't help, it is most definitely not the case that a model with arbitrarily-lower-loss is better than others. In particular, a model that achieves near-zero loss would likely be extremely overfit, and probably of little use for downstream applications.)
Regarding your general aim, of getting good vectors, with regard to the process you've described/shown:
Tiny tests (as with your 12 documents) don't really work with these algorithms, except to check that you're calling the steps with legal parameters. You shouldn't expect the similarities in such toy-sized tests to mean anything, even if they superficially meet expectations in some cases. The algorithms need lots of training data & large vocabularies to train sensible models. (So, your full 2.4 million docs should work well.)
You generally shouldn't be changing the default alpha/min_alpha values, or call train() multiple times in a loop. You can just leave those at their defaults, and call train() with your desired number of training epochs – and it will do the right thing. The approach in your shown code is a suboptimal and fragile anti-pattern – whichever online source you learned it from is misguided and severely outdated.
You haven't shown your inference code, but note that it will re-use the epochs, alpha, and min_alpha cached in the model instance from original initialization, unless you supply other values. And, the default epochs if not-specified is a value inherited from shared code with Word2Vec of just 5. Doing a mere 5 epochs, and leaving the effective alpha at 0.025 the whole time (as alpha=0.025, min_alpha=0.025 does to inference), is unlikely to give good results, especially on short docs. Common epochs values from published work are 10-20 - and doing at least as many for inference as were used for training is typical.
You are showing the use of a pretrained_emb initialization parameter that is not part of the standard gensim library, so perhaps you're using some other fork, based on some older version of gensim.. Note that it's not typical to initialize a Doc2Vec model with word-embeddings from elsewhere before training, so if doing that, you're already in advanced/experimental territory – which is premature if you're still trying to get some basic doc-vectors into reasonable shape. (And, usually people seek tricks like re-used word-vectors if they have a small corpus. With 2.4 million docs, you probably don't have such corpus problems – any word-vectors can be learned from your corpus along with doc-vectors, in the default way.)
According to the resources online "train_test_split" function from sklearn.cross_validation module returns data in a random state.
Does this mean if I train a model with the same data twice, I am getting two different models since the training data points used in the learning process is different in each case?
In practice can the accuracy of such two models differ a lot? Is that a possible scenario?
You can set random_state parameter to some constant value to reproduce data splits. On the other hand, it's generally a good idea to test exactly what you are trying to know - i.e. run your training at least twice with different randoms states and compare the results. If they differ a lot it's a sign that something is wrong and your solution is not reliable.