Theory says to split the dataset into three sets: train set to train the model, validation set to tune the parameters and test set to evaluate the performance.
However, there is already GridSearchCV that does cross validation on training set to find the optimal parameters. But how do I use my own validation set to tune parameters ?
I have 10 classes and for the train data there are 1017 samples for each class.
In validation and test sets I have 300 samples for each class.
I have trained my classified on train data.
clf = RandomForestClassifier(random_state=97)
clf.fit(train, np.array(train_lab))
How do I tune the parameters using my validation set? I have found examples only with GridSearchCV as cross-validation. However I would like to avoid it and tune the model on my own validation set. How can I do it?
You can pass a cross-validation object into GridSearchCV. Pass in a PredefinedSplit object, which lets you decide what the training and validation sets are.
Related
Hey I am training a CNN model , and was wondering what will happen if I use the same data for validation and test?
Does the model train on validation data as well? (Does my model see the validation data?) Or just the error and accuracy are calculatd and taken into account for training?
You use your validation_set to tune your model. It means that you don`t train on this data but the model takes it into account. For example, you use it to tune the model's hyperparameters.
In order to have a good evaluation - as test set you should use a data which is totally unknown to this model.
Take a look at this article for more information which here I point out the most relevant parts of it to your question :
A validation dataset is a sample of data held back from training your
model that is used to give an estimate of model skill while tuning
model’s hyperparameters.
The validation dataset is different from the test dataset that is also
held back from the training of the model, but is instead used to give
an unbiased estimate of the skill of the final tuned model when
comparing or selecting between final models.
If you use the same set for validation and test, your model may overfit (since it has seen the test data before the final test stage).
I am new to python and Keras please bear with my question.
I recently created a model in Keras, trained it and got the 'mean square error MSE' post prediction. I used the train_test_split function on the data set used.
Next I created a while loop with 50 iterations and applied it to the above said model. However I kept the train_test_split function (*random_number not specified) within the loop such that in every iteration I would have a new set of X_train, y_train, X_test and y_test values. I obtained 50 MSE values as output and calculated their 'mean' and 'standard' deviation'.
My query was did I do the right thing by placing the train_test_split function within the loop? Does it effect my goal which was to see the different MSE values generated for my data set?
If I had placed the train_test_split function outside my while loop and performed the above said activity, wouldn't the X_train, y_train, X_test and y_test values remain the same through out all of my 50 iterations? Wouldn't this cause an over fitting problem to my model?
I would really appreciate your feedback.
My code snippet:
from sklearn.model_selection import train_test_split
from sklearn import metrics
import numpy as np
MSE=np.zeros(50)
for i in range(50):
predictors_train,predictors_test,target_train,target_test=train_test_split(predictors,target,test_size=0.3)
model=regression_model()
model.fit(predictors_train,target_train,validation_data=(predictors_test,target_test),epochs=50,verbose=0)
model.evaluate(predictors_test,target_test, verbose=0)
target_predicted=model.predict(predictors_test)
MSE[i]=metrics.mean_squared_error(target_test, target_predicted)
print("Test set MSE for {} cycle:{}".format(i+1,MSE[i]))
The method you are implementing is named Cross validation, it allow your model to have a better "view" of your data, and reduce the chance that your training data was "too perfect" or "too noisy".
So putting your train_test_set in the loop will generate new training batches from your original data, and by meaning the outputs you will have what you want.
If you put the train_test_set outside, the batch of training data will remain the same for all your training loop, resulting in overfitting like you said.
However train_test_split is random, so you can have two random batch that are very likely, so this method is not optimal.
A better way is by using the k-fold cross validation :
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
MSE = []
for train, test in kfold.split(X, Y):
model = regression_model()
model.fit(X[train],y[train],validation_data= (X[test],y[test]),epochs=50,verbose=0)
model.evaluate(X[test],y[test], verbose=0)
target_predicted = model.predict(predictors_test)
MSE.append(metrics.mean_squared_error(y[test], target_predicted))
print("Test set MSE for {} cycle:{}".format(i+1,MSE[i]))
print("Mean MSE for {}-fold cross validation : {}".format(len(MSE), np.mean(MSE))
This method will create 10 folds of your training data and will fit your model using different one at each iteration.
You can have more info here : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
Hope this will help you !
EDIT FOR PRECISION
Indeed don't use this method on your TEST data, but only on your VALIDATION data !!
You model must never see your TEST data before !
You don't want to use test set during training at all. You will be tweaking the model to the point where it will start "overfitting" even the test set and your error estimates will be too optimistic.
Yes, if you place train_test_split outside of that for loop, your sets will stay the same for the whole training and it can lead to overfitting. That is why you have validation set which is not used for training but for validation, mostly to find out whether your model is ovefitting the train set or not. If it is overfitting, you should solve it by tweaking your model (making it less complex, implementing regularization, early stopping...).
But don't train your model on the same data you use for testing. Training your data on validation set is a different story and it is normally used when implementing K-fold cross validation.
So the general steps to follow are:
split your dataset into test set and the "other" set, put the test set away and don't show it to your model until you are ready for final testing => only when you have already trained and tuned your model
choose whether you want to implement k-fold cross-validation or not. If not, then split your data into training and validation set and use them throughout the whole training => training set for training and validation set for validating
if you want to implement k-fold cross-validation then follow the step 2, measure the error metric that you want to track, pick the other set again, split it into a different training set and validation set, and do the whole training again. Repeat this multiple times to and take average of the error metrics measured during these cycles to get better (average) error estimate
tune your model and repeat the steps 2 and 3 until you are happy with the results
measure the error of your final model on the test set to see whether it generalizes well
Note that while implementing k-fold cross validation is generally a good idea, this approach might be infeasible for larger neural networks because it can dramatically increase the time it takes to train them. If that is the case, you might want to stick with just one training set and one validation set or set k (in k-folds) to some low number such as 3.
As per my understanding, cross_val_score, cross_val_predict, and cross_val_validate can use K-fold validation. This means that the training set is iteratively used in part as a training set and test set. However, I have not come across any information on how Validation is taken care of. It appears that the data is not divided into three sets- training, validation and test sets. How does cross_val_score, cross_val_predict, and cross_val_validate take care of training, validation and testing?
The cross_val_score is used to estimate model's accuracy in a more robust way than with just the typical train-test split. It does the same job, but repeating it many times. This "repetitions" can be done in many different ways: CV, repeated CV, LOO, etc. See 3.1.2 in sklearn User Guide
In case you need to crossvalidate hyperparameters, then you should run a nested cross validation, with one outer loop to estimate model's accuracy and one inner loop to get the best parameters. This inner CV loop will split the train set of the outer loop further in train and validation sets. The procedure should go something like:
Outer loop:
Split train - test
Inner loop:
Fix parameters
Split train in train2 - validation
Train with train2 set
Score with validation set
Repeat Inner loop for all parameters
Train with train set and best parameters from inner loop
Score with test
Repeat outer loop until CV ends
Return test scores
Fortunately, sklearn alllows to nest a GridSearchCV inside a cross_val_score.
validation = GridSearchCV(estimator, param_grid)
score = cross_val_score(validation, X, y)
cross_val_score does take care of validation in so far as the process splits the dataset into K parts (3 by default), and performs fitting and validation K times. Sklearn documentation talks about splitting the dataset into train/test set, but do not misunderstand the name. That test set is in fact a validation set.
By using cross_val_score you can tune model hyperparameters and get the best configuration.
Therefore, the general procedure should be to divide (by yourself) the dataset into a training set and a test set.
Use the training set for cross-validation (invoking cross_val_score), in order to tune model hyperparameters and get the best configuration.
Then use the test set to evaluate the model. Note that the test set should be large enough and representative of the population in order to get an unbiased estimate of the generalization error.
I know that GridSearchCV will find the 'best' hyper-parameter by using k-fold cv. But after finding those hyper-parameters, will the GridSearchCV train the model again with the whole data set to get the trainable parameters? Or it only trains the model with the folds that generated the best hyper-parameters?
According to sklearn documentation:
lookup the keyword refit defaulted to be True.
I believe this answers your question.
I am new in python sklearn. I understand the basic of cross-validation. If I split the data to 3 folds by default. sklearn will train the model 3 times with different training and testing sets of data. I assume it produces 3 different model, i mean different w^ and d^. Is this right? Should I just get 1 model back? If i use model.predict() to predict an input, which model i am using?
Cross validation evaluates model setup, not model parameters.
I.e. if I use a bad setup, like a LR with 20 parameters over 10 data points, cross validation will report low scores because the model in this setup does not generalize, not because model(s) parameters were wrong.
If after cross validation you conclude the model generalizes well, all trained models will be pretty similar. It is safe to use any of them or even get the final model by training over the entire dev dataset.