I would like to know the purpose of the 2 functions GridSearchCV and KFold. I know that GridSearchCV performs a grid search over a set of mentioned hyperparameters for a model(lets say an SVC), and validates the model formed using CV.
KFold on the other hand just splits the data into k parts and we can use that to evaluate a model where we have already fixed the hyperparameters?
I have seen some code where both function are used together, I assumed before that using just GridSearchCV also gives you the CV error?
So do you train the model first for which you use KFold method and then use GridSearchCV in order to tune the hyperparameters of the model trained? Which is the best way to reduce the bias and variance?
Im am extremely confused any help is appreciated!
Thank you!
Related
I have a quick question about the following short snippet of code (my version of sklearn, from which cross_val_score and LinearDiscriminantAnalysis are imported from, is 1.1.1):
cv_results = cross_val_score(LinearDiscriminantAnalysis(),data,isTarget,cv=kfold,scoring='accuracy')
I am trying to train a LinearDiscriminantAnalysis ML algorithm on the 'data' variable and the 'isTarget' variable, which are numpy arrays of the features of the samples in my ML dataset and a list of which samples are targets (1) or non-targets (0), respectfully. kfold is just a method for scoring the algorithm, it isn't important here.
My question is this: I am trying to score this algorithm by training it on 'data' and 'isTarget', but I would like to test it on a different dataset, 'data_val' and 'isTarget_val,' but cross_val_score does not have parameters for training an algoirithm on one dataset and testing it on another. I've been searching for other functions that will do this, and I feel that it is a really simple answer and I just can't find it.
Can someone help me out? Thanks :)
This is how cross-validation is designed to work. The cv argument you are supplying specifies that you want to do K-Fold cross-validation, which means that the entirety of your dataset will be used for both training and testing in K different folds.
You can read up more on cross-validation here.
You can accomplish this using a PredefinedSplit (docs) as the cv argument.
I need to find best hyperparams for ANN and then run prediction on the best model. I use KerasRegressor. I find conflicting examples and advices. Please help me understand the right sequence and which params to use when.
I split my data into Train and Test datasets
I look for the best hyperparams using GridSearchCV on Train dataset
GridSearchCV.fit(X_Train, Y_Train)
I take GridSearchCV.best_estimator_ and use it in cross_val_score on Test dataset, i.e
cross_val_score(model.best_estimator_, X_Test, Y_Test , scoring='r2')
I'm not sure if I need to do this step? In theory, it should show similar r2 scores as GridSearchCV did for this best_estimator_ shouldn't it ?
I use model.best_estimator_.predict( X_Test, Y_Test) on Test data to predict the results. I.e I pass best_estimator_ from GridSearchCV to run actual prediction.
Is this correct ?
*Do I need to fit again model.best_estimator_ on Train data before doing a prediction? Or does it keep all the weights found during GridSearchCV ?
Do I need to save weights to be able to reuse it later ?
Usually when you use GridSearchCV on your training set, you will have an object which contains the best trained model with the best parameters.
gs = GridSearchCV.fit(X_train, y_train)
This is also evident from running gs.best_params_ which will print out the best parameters of the model after cross validation. Now, you can make predictions on your test set directly by running gs.predict(X_test, y_test) which will use the best selected model to predict on your test set.
For question 3, you don't need to use cross_val_score again, as this is an helper function that allows you to perform cross validation on your dataset and returns the score of each fold of the data split.
For question 4, I believe this answer is quite explanatory: https://stats.stackexchange.com/a/26535
from sklearn.model_selection import GridSearchCV
*parameters = {**?????**}*
search = GridSearchCV(_pipeline, n_jobs=1, cv= 5, param_grid=parameters)
#multi_target_linear = MultiOutputClassifier(search)
search.fit(X, y)
#search.get_params().keys()
search.best_params_
Hyperparameters in a case like this would vary from case to case, as they are what you want to solve for, arriving at hyperparameters that are both accurate and efficient.
From what it seems, you aim to create a parameter grid called parameters, that includes several hyperparameters that you want to hone down on. GridSearchCV can then try all the combinations of hyperparameters and find the best performing combination. The CV is Cross Validation, which is a way of shuffling the training and test sets for fair evaluation, which you here have 5 different segments, hence cv=5.
You also are using MultiOutputClassifier, which is designed to adapt other classifiers to be able to handle multiple targets, but you don't define what classifier that is.
Parameter param_grid of GridSearchCV is determined by its estimator, but you didn't write how _pipeline is constructed.
So I cannot answer this question, but this answer will help.
I know that GridSearchCV will find the 'best' hyper-parameter by using k-fold cv. But after finding those hyper-parameters, will the GridSearchCV train the model again with the whole data set to get the trainable parameters? Or it only trains the model with the folds that generated the best hyper-parameters?
According to sklearn documentation:
lookup the keyword refit defaulted to be True.
I believe this answers your question.
Reading documentation and procedures while using machine learning techniques for both classification and regression I came across with some topic which actually is new for me. It seems that a recommended procedure related to split the data before training and testing is to split it into three different sets training, validation and testing. Since this procedure makes sense to me I was wondering how should I proceed with this. Let's say we split the data into these three sets, since I came across with this reading sklearn approaches and tips
If we follow some interesting approaches like what I found in here:
Stratified Train/Validation/Test-split in scikit-learn
Taking this into account let's say we want to build a classifier using LogisticRegression(any classifier actually). The procedure as far as I am concerned should be something like this, right?:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Now if we want to make predictions we could use:
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)
What when one have to estimate accuracy of the model a common approach is:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))
And here is where my question comes. Validation set which was splitted before should be use for calculating accuracy or for validating somehow using a Kfold cv instead?. For instance,:
# Perform 10-fold cross validation
scores = cross_val_score(logreg, df, y, cv=10)
Any hint of the procedure with these three sets would be really appreciated. What I was thinking of was that validation set should be use with train but do not know really in which way.