How to use data augmentation with cross validation

How to use data augmentation with cross validation - python

I need to use data augmentation on what would be my training data from the data augmentation step. The problem is that i am using cross-validation, so i can't find a reference how to adjust my model to use data augmentation. My cross-validation is somewhat indexing by hand my data.
There is articles and general content about data augmentation, but very little and with no generalization for cross validation with data augmentation
I need to use data augmentation on training data by simply rotating and adding zoom, cross validate for the best weights and save them, but i wouldnt know how.
This example can be copy pasted for better reproducibility, in short how would i employ data augmentation and also save the weights with the best accuracy?

When training machine learning models, you should not test model on the samples used during model training phase (if you care for realistic results).
Cross validation is a method for estimating model accuracy. The essence of the method is that you split your available labeled data into several parts (or folds), and then use one part as a test set, training the model on all the rest, and repeating this procedure for all parts one by one. This way you essentially test your model on all the available data, without hurting training too much. There is an implicit assumption that data distribution is the same in all folds. As a rule of thumb, the number of cross validation folds is usually 5 or 7. This depends on the amount of the labeled data at one's disposal - if you have lots of data, you can afford to leave less data to train the model and increase test set size. The higher the number of folds, the better accuracy estimation you can achieve, as the training size part increases, and more time you have to invest into the procedure. In extreme case one have a leave-one-out training procedure: train on everything but one single sample, effectively making number of the folds equal to the number of data samples.
So for a 5-fold CV you train 5 different models, which have a a large overlap of the training data. As a result, you should get 5 models that have similar performance. (If it is not the case, you have a problem ;) ) After you have the test results, you throw away all 5 models you have trained, and train a new model on all the available data, assuming it's performance would be a mean of the values you've got during CV phase.
Now about the augmented data. You should not allow data obtained by augmentation of the training part leak into the test. Each data point created from the training part should be used only for training, same applies to the test set.
So you should split your original data into k-folds (for example using KFold or GroupKFold), then create augmented data for each fold and concatenate them to the original. Then you follow regular CV procedure.
In your case, you can simply pass each group (such as x_group1) through augmenting procedure before concatenating them, and you should be fine.
Please note, that splitting data in linear way can lead to unbalanced data sets and it is not the best way of splitting the data. You should consider functions I've mentioned above.

Related

Understanding Cross Validation for Machine learning

Is the following correct about cross validation?:
The training data is divided into different groups, all but one of the training data sets is used for training the model. Once the model is trained the ‘left out’ training data is used to perform hyperparameter tuning. Once the most optimal hyperparameters have been chosen the test data is applied to the model to give a result which is then compared to other models that have undergone a similar process but with different combinations of training data sets. The model with the best results on the test data is then chosen.

I don't think it is correct. You wrote:
Once the model is trained the ‘left out’ training data is used to perform hyperparameter tuning
You tune the model by picking (manually or using a method like grid search or random search) a set of model's hyperparameters (parameters which values are set by you, before you will even fit the model to data). Then for a selected set of hyperparameters' values you calculate the validation set error using Cross-Validation.
So it should be like this:
The training data is divided into different groups, all but one of the training data sets is used for training the model. Once the model is trained the ‘left out’ training data is used to ...
... calculate the error. At the end of the cross validation, you will have k errors calculated on k left out sets. What you do next is calculating a mean of these k errors which gives you a single value - validation set error.
If you have n sets of hyperparameters, you simply repeat the procedure n times, which gives you n validation set errors. You then pick this set, that gave you the smallest validation error.
At the end, you will typically calculate the test set error to see what is the model's performance on unseen data, which simulates putting a model into production and to see whether there is a difference between test set error and validation set error. If there is a significant difference, it means over-fitting.
Just to add something on cross-validation itself, the reason why we use k-CV or LOOCV
is that it is great test set error estimate, which means that when I manipulate with hyperparameters and the value of validation set error dropped down, I know that I really improved model instead of being lucky and simply better fitting the model to train set.

Split into test and train set before or after generating document term matrix?

I'm working on simple machine learning problems and I trying to build a classifier that can differentiate between spam and non-spam SMS. I'm confused as to whether I need to generate the document-term matrix before splitting into test and train sets or should I generate the document-term matrix after splitting into test and train?
I tried it both ways and found that the accuracy is slightly higher when the I split the data before generating the document-term matrix. But to me, this makes no sense. Shouldn't the accuracy be the same? Does the order of these operations make any difference?

Qualitatively, you don't need to do it either way. However, proper procedure requires that you keep your training and test data entirely separate. The overall concept is that the test data are not directly represented in the training; this helps reduce over-fitting. The test data (and later validation data) are samples that the trained model has never encountered during training.
Therefore, the test data should not be included in your pre-processing -- the document-term matrix. This breaks the separation, in that the model has, in one respect, "seen" the test data during training.
Quantitatively, you need to do the split first, because that matrix is to be used for training the model against only the training set. When you included the test data in the matrix, you obtained a matrix that is slightly inaccurate in representing the training data: it no longer properly represents the data you're actually training against. This is why your model isn't quite as good as the one that followed proper separation procedures.
It's a subtle difference, most of all because the training and test sets are supposed to be random samples of the same population of possible inputs. Random differences provide the small surprise you encountered.

How to check machine learning accuracy without cross validation

I have training sample X_train, and Y_train to train and X_estimated.
I got task to make my classificator learn as accurate as it can, and then predict vector of results over X_estimated to get close results to Y_estimated (which i have now, and I have to be as much precise as it can). If I split my training data to like 75/25 to train and test it, I can get accuracy using sklearn.metrics.accuracy_score and confusion matrix. But I am losing that 25% of samples, that would make my predictions more accurate.
Is there any way, I could learn by using 100% of the data, and still be able to see accuracy score (or percentage), so I can predict it many times, and save best (%) result?
I am using random forest with 500 estimators, and usually get like 90% accuracy. I want to save best prediction vector as possible for my task, without splitting any data (not wasting anything), but still be able to calculate accuracy (so I can save best prediction vector) from multiple attempts (random forest always shows different results)
Thank you

Splitting your data is critical for evaluation.
There is no way that you could train your model on 100% of the data and be able to get a correct evaluation accuracy unless you expand your dataset. I mean, you could change your train/test split, or try to optimize your model in other ways, but i guess the simple answer to your question would be no.

As per your requirement, you can try K Fold Cross Validation. If you split it in 90|10 i.e for Train|Test. Achieving to take 100% data for training is not possible as you have to test the data then only you can validate the same that how good your model is. K Fold CV takes your whole train data into consideration in each fold and randomly takes test data sample from the train data. And lastly calculates the accuracy by taking summation of all the folds. Then finally you can test the accuracy by using 10% of the data.
More you can read here and here
K Fold Cross Validation
Skearn provides simple methods for performing K fold cross validation. Simply you have to pass no of folds in the method. But then remember, more the folds, it takes more time to train the model. More you can check here

It is not necessary to do 75|25 split of your data all the time. 75
|25 is kind of old school now. It greatly depends on the amount of data that you have. For example, if you have 1 billion sentences for training a language model, it is not necessary to reserve 25% for testing.
Also, I second the previous answer of trying K-fold cross-validation. As a side note, you could consider looking at the other metrics like precision and recall as well.

In general splitting your data set is critical for evaluation. So I would recommend you always do that.
Said that, there are methods that in some sense allow you to train on all your data and still get an estimate of your performance or to estimate the generalization accuracy.
One particularly prominent method is leveraging out-of-bag samples of models based on bootstrapping, i.e. RandomForests.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, bootstrap=True, oob_score=True)
rf.fit(X, y)
print(rf.oob_score_)

if you are doing classification always go with stratified k-fold cv(https://machinelearningmastery.com/cross-validation-for-imbalanced-classification/).
if you're doing regression then go with simple k-fold cv or you can divide the target as bins and do stratified k-fold cv. by this way you can use your data completely in model training.

python- Best techniques to split datase to get high performance accuracy

I have applied these 4 methods:
Train and Test Sets.
K-fold Cross Validation.
Leave One Out Cross
Validation. Repeated Random Test-Train Splits.
The method "Train and Test Sets" achieve high accuracy but the remaining methods achieve same accuracy but lower then first approach.
I want to know which method should I choose?

Each of Train and Test Sets and Cross Validation used in certain case,Cross Validation used if you want to compare different models.Accuracy always increase if you use bigger training data that's why sometimes Leave One Out Cross perform better than K-fold Cross Validation,it's depends on your dataset size and sometimes on algorithm you are using.On the other hand Train and Test Sets usually used if you aren't comparing diffrent models, and if the time requirements for running the cross validation aren't worth it,mean it's not needed to make Cross Validation in this case.In most cases Cross Validation is preferred,but, what method you should choose? this usually depend on your choices while training your data such way you handle data and algorithm such you are trainning data using Random Forests usually it's not needed to do Cross Validation but you can and do it in case need more you usually not doing Cross Validation in Random Forests when you use Out of Bag estimate .

Training a model comprises tuning model accuracy as well as model generalization. If model is not generalized it may be Underfit or Overfit model.
In this case, model may perform better on training data but accuracy may decrease on test or unknown data.
We use training data to improve the accuracy of model. As training data size increases model accuracy may also increase.
Similarly we use different training samples to generalize the model.
So Train-Test splitting methods depend on the size of available data and algorithm used for model design.
First train-test method has a fix size training and testing data. So on each iteration, we use same train data to train model and same test data for model's accuracy assessment.
Second k-fold method has fix size train and test data but on each iteration, test and train data changes. So it may be a better approach irrespective of data size.
Leave one out approach is useful only if data size is small. Here we use almost whole data for training purpose. So training accuracy of model will be better but may not be a generalized model.
Randomised Train-test method is also a good approach for training and testing model's performance. Here we randomly select train and test data each time. So it may perform better than Leave one out method if data size is small.
And last each splitting approach has some pros and cons. So it depends on you which splitting method is good to your model. It also depends on data size and data selection means how we are selecting data from sample while splitting.

How do I train the Convolutional Neural Network with negative and positive elements as the input of the first layer?

Just I am curious why I have to scale the testing set on the testing set, and not on the training set when I’m training a model on, for example, CNN?!
Or am I wrong? And I still have to scale it on the training set.
Also, can I train a dataset in the CNN that contents positive and negative elements as the first input of the network?
Any answers with reference will be really appreciated.

We usually have 3 types of datasets for getting a model trained,
Training Dataset
Validation Dataset
Test Dataset
Training Dataset
This should be an evenly distributed data set which covers all varieties of data. If your train with more epochs, the model will get used to the training dataset and will only give proper proper prediction on the training dataset and this is called Overfitting. Only way to keep a check on overfitting is by having other datasets which the model has never been trained on.
Validation Dataset
This can be used fine tune model hyperparameters
Test Dataset
This is the dataset which the model has not been trained on has never been a part of deciding the hyperparameters and will give the reality of how the model is performing.

If scaling and normalization is used, the testing set should use the same parameters used during training.
A good answer that links to that: https://datascience.stackexchange.com/questions/27615/should-we-apply-normalization-to-test-data-as-well
Also, some models tend to require normalization and others do not.
The Neural Network architectures are normally robust and might not need normalization.

Scaling data depends upon the requirement as well the feed/data you got. Test data gets scaled with Test data only, because Test data don't have the Target variable (one less feature in Test data). If we scale our Training data with new Test data, our model will not be able to correlate with any target variable and thus fail to learn. So the key difference is the existence of Target variable.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.