Linear Regression + Cross Validation model training with sklearn - python

I am new in python sklearn. I understand the basic of cross-validation. If I split the data to 3 folds by default. sklearn will train the model 3 times with different training and testing sets of data. I assume it produces 3 different model, i mean different w^ and d^. Is this right? Should I just get 1 model back? If i use model.predict() to predict an input, which model i am using?

Cross validation evaluates model setup, not model parameters.
I.e. if I use a bad setup, like a LR with 20 parameters over 10 data points, cross validation will report low scores because the model in this setup does not generalize, not because model(s) parameters were wrong.
If after cross validation you conclude the model generalizes well, all trained models will be pretty similar. It is safe to use any of them or even get the final model by training over the entire dev dataset.

Related

Reset the weights in K-fold cross validation

In k-fold cross validation why we need to reset the weights after each fold
we use thia function
def reset_weights(m):
if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
m.reset_parameters()
so we reset the weights of the model so that each cross-validation fold starts from some random initial state and not learning from the previous folds.
Why i that important ? and i think if we don't do that it would be better that the model learn from all folds and update its parameter from all of them not every one on its own.
Cross fold validation is meant to validate if the model performance is consistent and robust to different subsample of train and test data, and to fine tuning hyper parameters in a less biased way.
If the model have a good performance, with low variance among numerous (usually 5 or 10) folds of train and test data, it means that the model performance is not related to some subsample of the data.
https://en.wikipedia.org/wiki/Cross-validation_(statistics)
After validade the model, you can train it on the whole dataset, without splitting it, to improve performance.
But this approach alone can't tell if your model has overfitted or not, so take note of CNN regularization and validation methods.
https://www.analyticsvidhya.com/blog/2020/09/overfitting-in-cnn-show-to-treat-overfitting-in-convolutional-neural-networks/

Understanding Cross Validation for Machine learning

Is the following correct about cross validation?:
The training data is divided into different groups, all but one of the training data sets is used for training the model. Once the model is trained the ‘left out’ training data is used to perform hyperparameter tuning. Once the most optimal hyperparameters have been chosen the test data is applied to the model to give a result which is then compared to other models that have undergone a similar process but with different combinations of training data sets. The model with the best results on the test data is then chosen.
I don't think it is correct. You wrote:
Once the model is trained the ‘left out’ training data is used to perform hyperparameter tuning
You tune the model by picking (manually or using a method like grid search or random search) a set of model's hyperparameters (parameters which values are set by you, before you will even fit the model to data). Then for a selected set of hyperparameters' values you calculate the validation set error using Cross-Validation.
So it should be like this:
The training data is divided into different groups, all but one of the training data sets is used for training the model. Once the model is trained the ‘left out’ training data is used to ...
... calculate the error. At the end of the cross validation, you will have k errors calculated on k left out sets. What you do next is calculating a mean of these k errors which gives you a single value - validation set error.
If you have n sets of hyperparameters, you simply repeat the procedure n times, which gives you n validation set errors. You then pick this set, that gave you the smallest validation error.
At the end, you will typically calculate the test set error to see what is the model's performance on unseen data, which simulates putting a model into production and to see whether there is a difference between test set error and validation set error. If there is a significant difference, it means over-fitting.
Just to add something on cross-validation itself, the reason why we use k-CV or LOOCV
is that it is great test set error estimate, which means that when I manipulate with hyperparameters and the value of validation set error dropped down, I know that I really improved model instead of being lucky and simply better fitting the model to train set.

Use same data for test and validation

Hey I am training a CNN model , and was wondering what will happen if I use the same data for validation and test?
Does the model train on validation data as well? (Does my model see the validation data?) Or just the error and accuracy are calculatd and taken into account for training?
You use your validation_set to tune your model. It means that you don`t train on this data but the model takes it into account. For example, you use it to tune the model's hyperparameters.
In order to have a good evaluation - as test set you should use a data which is totally unknown to this model.
Take a look at this article for more information which here I point out the most relevant parts of it to your question :
A validation dataset is a sample of data held back from training your
model that is used to give an estimate of model skill while tuning
model’s hyperparameters.
The validation dataset is different from the test dataset that is also
held back from the training of the model, but is instead used to give
an unbiased estimate of the skill of the final tuned model when
comparing or selecting between final models.
If you use the same set for validation and test, your model may overfit (since it has seen the test data before the final test stage).

How to take a sklearn post-cross_val_predict model to do prediction on another scaled data set? And whether the model can be serialized?

I came across this question while on a sklearn ML case with heavily imbalanced data. The line below provides the basis for assessing the model from confusion metrics and precision-recall perspectives but ... it is a train/predict combined method:
y_pred = model_selection.cross_val_predict(model, X, Y, cv=kfold)
The question is how do I leverage this 'cross-val-trained' model to:
1) predict on another data set (scaled) instead of having to train/predict each time?
2) export/serialize/deploy the model to predict on live data?
model.predict() #--> nope. need a fit() first
model.fit() #--> nope. a different model which does not take advantage of the cross_val_xxx methods
Any help is appreciated.
You can fit a new model with the data.
The cross validation aspect is about validating the way the model is built, not the model itself. So if the cross validation is OK, then you can train a new model with all the data.
(See my response here as well for more details Fitting sklearn GridSearchCV model)

python: tune parameters of the model on a validation set

Theory says to split the dataset into three sets: train set to train the model, validation set to tune the parameters and test set to evaluate the performance.
However, there is already GridSearchCV that does cross validation on training set to find the optimal parameters. But how do I use my own validation set to tune parameters ?
I have 10 classes and for the train data there are 1017 samples for each class.
In validation and test sets I have 300 samples for each class.
I have trained my classified on train data.
clf = RandomForestClassifier(random_state=97)
clf.fit(train, np.array(train_lab))
How do I tune the parameters using my validation set? I have found examples only with GridSearchCV as cross-validation. However I would like to avoid it and tune the model on my own validation set. How can I do it?
You can pass a cross-validation object into GridSearchCV. Pass in a PredefinedSplit object, which lets you decide what the training and validation sets are.

Categories