According to the resources online "train_test_split" function from sklearn.cross_validation module returns data in a random state.
Does this mean if I train a model with the same data twice, I am getting two different models since the training data points used in the learning process is different in each case?
In practice can the accuracy of such two models differ a lot? Is that a possible scenario?
You can set random_state parameter to some constant value to reproduce data splits. On the other hand, it's generally a good idea to test exactly what you are trying to know - i.e. run your training at least twice with different randoms states and compare the results. If they differ a lot it's a sign that something is wrong and your solution is not reliable.
Related
I am making multiple classifier models and the test accuracy for all of them is 0.508.
I find it weird that multiple models have the same accuracy. The models I used are Logistic Regressor,DesicionTreeClassifier, MLPClassifier, RandomForestClassifier, BaggingClassifier, AdaBoostClassifier, XGBClassifier, SVC, and VotingClassifier.
After using GridSearchCV to improve the models, all of their test accuracy scores improved. But the test accuracy scores did not change.
I wish I could say I changed something, but I don't know why the test scores did not change. After using gridsearch, I expected the test scores to improve but it didn't
I would like to confirm, you mean your training scores improve but you testing scores did not change? If yes, there are a lot of possibility behind this.
You might want to reconfigure and add your hyper parameter range for example if using KNN you can increase the number of k or by adding more distance metric calculation
If you want to you can change the hyper parameter optimization technique like randomized search or bayesian search
I don't have any information about your data but sometimes turn on or turn off the shuffle mode when splitting can affect the scores for instance if you have time series data you have not to shuffle the dataset
There can be several reasons why the test accuracy didn't change after using GridSearchCV:
The best parameters found by GridSearchCV might not be optimal for the test data.
The test data may have a different distribution than the training data, leading to low test accuracy.
The models might be overfitting to the training data and not generalizing well to the test data.
The test data size might be small, leading to high variance in test accuracy scores.
The problem itself might be challenging, and a test accuracy of 0.508 might be the best that can be achieved with the current models and data.
It would be useful to have more information about the data, the problem, and the experimental setup to diagnose the issue further.
Looking at your accuracy, first of all I would say: are you performing a binary classification task? Because if it is the case, your models are almost not better than random on the test set, which may suggest that something is wrong with your training.
Otherwise, GridSearchCV, like RandomSearchCV and other hyperparameters optimization techniques try to find optimal parameters among a range that you define. If, after optimization, your optimal parameter has the value of one bound of your range, it may suggest that you need to explore beyond this bound, that is to say set another range on purpose and run the optimization again.
By the way, I don't know the size of your dataset but if it is big I would recommend you to use RandomSearchCV instead of GridSearchCV. As it is not exhaustive, it takes less time and gives results that are (nearly) optimized.
It's not clear to me why some resources online demonstrate a multi-target Random Forest regression as being instantiated as either
model = MultiOutputRegressor(RandomForestRegressor())
versus:
model = RandomForestRegressor()
when both seemingly generate multiple regressed outputs. Can anyone clarify?
The internal models are different, but they are both multioutput regressors.
MultiOutputRegressor fits one random forest for each target. Each tree inside then is predicting one of your outputs.
Without the wrapper, RandomForestRegressor fits trees targeting all the outputs at once. The split criteria are based on the average impurity reduction across the outputs. See the User Guide.
The latter may be better computationally, since fewer trees are being built. It can also make use of the fact that the several outputs for a given input may well be correlated. That's all discussed in the user guide as well.
Some conjecture on my part: On the other hand, if the several outputs for a given input are not correlated, internal splits that are good for one output may be lousy for other inputs, so simply averaging them might not work as well. I think in that case increasing the tree complexity can alleviate the issue (but will also take more computation).
I have 2 questions that I would like to ascertain if possible (questions are bolded):
I've recently understood (I hope) the random forest classification algorithm, and have tried to apply it using sklearn on Python on a rather large dataset of pixels derived from satellite images (with the features being the different bands, and the labels being specific features that I outlined by myself, i.e., vegetation, cloud, etc). I then wanted to understand if the model was experiencing a variance problem, and so the first thought that came to my mind was to compare between the training and testing data.
Now this is where the confusion kicks in for me - I understand that there have been many different posts about:
How CV error should/should not be used compared to the out of bag (OOB) error
How by design, the training error of a random forest classifier is almost always ~0 (i.e., fitting my model on the training data and using it to predict on the same set of training data) - seems to be the case regardless of the tree depth
Regarding point 2, it seems that I can never compare my training and test error as the former will always be low, and so I decided to use the OOB error as my 'representative' training error for the entire model. I then realized that the OOB error might be a pseudo test error as it essentially tests trees on points that they did not specifically learn (in the case of bootstrapped trees), and so I defaulted to CV error being my new 'representative' training error for the entire model.
Looking back at the usage of CV error, I initially used it for hyperparameter tuning (e.g., max tree depth, number of trees, criterion type, etc), and so I was again doubting myself if I should use it as my official training error to be compared against my test error.
What makes this worse is its hard for me to validate what I think is true based on posts across the web because each answers only a small part and might contradict each other, and so would anyone kindly help me with my predicament on what to use as my official training error that will be compared to my test error?
My second question revolves around how the OOB error might be a pseudo test error based on datapoints not selected during bootstrapping. If that were true, would it be fair to say this does not hold if bootstrapping is disabled (the algorithm is technically still a random forest as features are still randomly subsampled for each tree, its just that the correlation between trees are probably higher)?
Thank you!!!!
Generally, you want to distinctly break a dataset into training, validation, and test. Training is data fed into the model, validation is to monitor progress of the model as it learns, and test data is to see how well your model is generalizing to unseen data. As you've discovered, depending on the application and the algorithm, you can mix-up training and validation data or even forgo validation data entirely. For random forest, if you want to forgo having a distinct validation set and just use OOB to monitor progress that is fine. If you have enough data, I think it still makes sense to have a distinct validation set. No matter what, you should still reserve some data for testing. Depending on your data, you may even need to be careful about how you split up the data (e.g. if there's unevenness in the labels).
As to your second point about comparing training and test sets, I think you may be confused. The test set is really all you care about. You can compare the two to see if you're overfitting, so that you can change hyperparameters to generalize more, but otherwise the whole point is that the test set is to the sole truthful evaluation. If you have a really small dataset, you may need to bootstrap a number of models with a CV scheme like stratified CV to generate a more accurate test evaluation.
Background of the Problem
I want to explain the outcome of machine learning (ML) models using SHapley Additive exPlanations (SHAP) which is implemented in the shap library of Python. As a parameter of the function shap.Explainer(), I need to pass an ML model (e.g. XGBRegressor()). However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different). Also, the model will be different as I am doing feature selection in each iteration.
Then, My Question
In LOOCV, How can I use shap.Explainer() function of shap library to present the performance of a machine learning model? It can be noted that I have checked several tutorials (e.g. this one, this one) and also several questions (e.g. this one) of SO. But I failed to find the answer of the problem.
Thanks for reading!
Update
I know that in LOOCV, the model found in each iteration can be explained by shap.Explainer(). However, as there is 250 participants' data, if I apply shap here for each model, there will be 250 output! Thus, I want to get a single output which will present the performance of the 250 models.
You seem to train model on a 250 datapoints while doing LOOCV. This is about choosing a model with hyperparams that will ensure best generalization ability.
Model explanation is different from training in that you don't sift through different sets of hyperparams -- note, 250 LOOCV is already overkill. Will you do that with 250'000 rows? -- you are rather trying to understand which features influence output in what direction and by how much.
Training has it's own limitations (availability of data, if new data resembles the data the model was trained on, if the model good enough to pick up peculiarities of data and generalize well etc), but don't overestimate explanation exercise either. It's still an attempt to understand how inputs influence outputs. You may be willing to average 250 different matrices of SHAP values. But do you expect the result to be much more different from a single random train/test split?
Note as well:
However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different).
In each iteration of LOOCV the model is still the same (same features, hyperparams may be different, depending on your definition of iteration). It's still the same dataset (same features)
Also, the model will be different as I am doing feature selection in each iteration.
Doesn't matter. Feed resulting model to SHAP explainer and you'll get what you want.
I recently tested many hyperparameter combinations using sklearn.model_selection.GridSearchCV. I want to know if there is a way to call all previous estimators that were trained in the process.
search = GridSearchCV(estimator=my_estimator, param_grid=parameters)
# `my_estimator` is a gradient boosting classifier object
# `parameters` is a dictionary containing all the hyperparameters I want to try
I know I can call the best estimator with search.best_estimator_, but I would like to call all other estimators as well so I can test their individual performance.
The search took around 35 hours to complete, so I really hope I do not have to do this all over again.
NOTE: This was asked a few years ago (here), but sklearn has been updated multiple times since and the anwer may be different now (I hope).
No, none of the tested models are saved, except (optionally, but by default) one final one trained on the entire training set, your best_estimator_. Especially when models store significant amounts of data (e.g. KNNs), saving all the fitted estimators would be very memory-expensive, and usually not of much use. (cross_validate does have a parameter return_estimator, but the hyperparameter tuners do not. If you have a compelling reason to add it, it probably wouldn't take much work and you could open a GitHub Issue at sklearn.)
However, you do have the cv_results_ attribute that documents the scores of all of the tested estimators. That's usually enough for inspection purposes.