sklearn.model_selection.train_test_split random state - python

I am training a computer vision model.
I divide the images in 3 datasets: training, validation and testing.
So that I get always the same images in training, vaidation and testing, I use the random_state parameter of train_test_split function.
However, I have a problem:
I am training and testing on two different computers (linux and windows).
I thought that the results for a given random state would be same but they aren't.
Is there a way that I get the same results on both computers ?
I can't divide the images in 3 folders (training, validation and testing) since I want to change the test size and validation size during different experiments.

On a practical note, training of the models may require
the usage of a distant computer or server (e.g. Microsoft
Azur, Google collaboratory etc.) and it is important to be
aware that random seeds vary between different python versions and operating systems.
Thus, when dividing the original dataset into training, validation and testing datasets,
the usage of spliting functions with random seeds is prohibited as it could lead to overlapping testing and training
datasets. A way to avoid this is by keeping separate .csv
files with the images to be used for training, validation, or
testing.

Related

Types of Training vs Test Error for Random Forest Classification Algorithm (Assessing Variance)

I have 2 questions that I would like to ascertain if possible (questions are bolded):
I've recently understood (I hope) the random forest classification algorithm, and have tried to apply it using sklearn on Python on a rather large dataset of pixels derived from satellite images (with the features being the different bands, and the labels being specific features that I outlined by myself, i.e., vegetation, cloud, etc). I then wanted to understand if the model was experiencing a variance problem, and so the first thought that came to my mind was to compare between the training and testing data.
Now this is where the confusion kicks in for me - I understand that there have been many different posts about:
How CV error should/should not be used compared to the out of bag (OOB) error
How by design, the training error of a random forest classifier is almost always ~0 (i.e., fitting my model on the training data and using it to predict on the same set of training data) - seems to be the case regardless of the tree depth
Regarding point 2, it seems that I can never compare my training and test error as the former will always be low, and so I decided to use the OOB error as my 'representative' training error for the entire model. I then realized that the OOB error might be a pseudo test error as it essentially tests trees on points that they did not specifically learn (in the case of bootstrapped trees), and so I defaulted to CV error being my new 'representative' training error for the entire model.
Looking back at the usage of CV error, I initially used it for hyperparameter tuning (e.g., max tree depth, number of trees, criterion type, etc), and so I was again doubting myself if I should use it as my official training error to be compared against my test error.
What makes this worse is its hard for me to validate what I think is true based on posts across the web because each answers only a small part and might contradict each other, and so would anyone kindly help me with my predicament on what to use as my official training error that will be compared to my test error?
My second question revolves around how the OOB error might be a pseudo test error based on datapoints not selected during bootstrapping. If that were true, would it be fair to say this does not hold if bootstrapping is disabled (the algorithm is technically still a random forest as features are still randomly subsampled for each tree, its just that the correlation between trees are probably higher)?
Thank you!!!!
Generally, you want to distinctly break a dataset into training, validation, and test. Training is data fed into the model, validation is to monitor progress of the model as it learns, and test data is to see how well your model is generalizing to unseen data. As you've discovered, depending on the application and the algorithm, you can mix-up training and validation data or even forgo validation data entirely. For random forest, if you want to forgo having a distinct validation set and just use OOB to monitor progress that is fine. If you have enough data, I think it still makes sense to have a distinct validation set. No matter what, you should still reserve some data for testing. Depending on your data, you may even need to be careful about how you split up the data (e.g. if there's unevenness in the labels).
As to your second point about comparing training and test sets, I think you may be confused. The test set is really all you care about. You can compare the two to see if you're overfitting, so that you can change hyperparameters to generalize more, but otherwise the whole point is that the test set is to the sole truthful evaluation. If you have a really small dataset, you may need to bootstrap a number of models with a CV scheme like stratified CV to generate a more accurate test evaluation.

Training & Validation loss and dataset size

I'm new on Neural Networks and I am doing a project that has to define a NN and train it. I've defined a NN of 2 hidden layers with 17 inputs and 17 output. The NN has 21 inputs and 3 outputs.
I have a data set of labels of 10 million, and a dataset of samples of another 10 million. My first issue is about the size of the validation set and the training set. I'm using PyTorch and batches, and of what I've read, the batches shouldn't be larger. But I don't know how many approximately should be the size of the sets.
I've tried with larger and small numbers, but I cannot find a correlation that shows me if I'm right choosing a large set o small set in one of them (apart from the time that requires to process a very large set).
My second issue is about the Training and Validation loss, which I've read that can tell me if I'm overfitting or underfitting depending on if it is bigger or smaller. The perfect should be the same value for both, and it also depends on the epochs. But I am not able to tune the network parameters like batch size, learning rate or choosing how much data should I use in the training and validation. If 80% of the set (8 million), it takes hours to finish it, and I'm afraid that if I choose a smaller dataset, it won't learn.
If anything is badly explained, please feel free to ask me for more information. As I said, the data is given, and I only have to define the network and train it with PyTorch.
Thanks!
For your first question about batch size, there is no fix rule for what value should it have. You have to try and see which one works best. When your NN starts performing badly don't go above or below that value for batch size. There is no hard rule here to follow.
For your second question, first of all, having training and validation loss same doesn't mean your NN is performing nicely, it is just an indication that its performance will be good enough on a test set if the above is the case, but it largely depends on many other things like your train and test set distribution.
And with NN you need to try as many things you can try. Try different parameter values, train and validation split size, etc. You cannot just assume that it won't work.

How to use data augmentation with cross validation

I need to use data augmentation on what would be my training data from the data augmentation step. The problem is that i am using cross-validation, so i can't find a reference how to adjust my model to use data augmentation. My cross-validation is somewhat indexing by hand my data.
There is articles and general content about data augmentation, but very little and with no generalization for cross validation with data augmentation
I need to use data augmentation on training data by simply rotating and adding zoom, cross validate for the best weights and save them, but i wouldnt know how.
This example can be copy pasted for better reproducibility, in short how would i employ data augmentation and also save the weights with the best accuracy?
When training machine learning models, you should not test model on the samples used during model training phase (if you care for realistic results).
Cross validation is a method for estimating model accuracy. The essence of the method is that you split your available labeled data into several parts (or folds), and then use one part as a test set, training the model on all the rest, and repeating this procedure for all parts one by one. This way you essentially test your model on all the available data, without hurting training too much. There is an implicit assumption that data distribution is the same in all folds. As a rule of thumb, the number of cross validation folds is usually 5 or 7. This depends on the amount of the labeled data at one's disposal - if you have lots of data, you can afford to leave less data to train the model and increase test set size. The higher the number of folds, the better accuracy estimation you can achieve, as the training size part increases, and more time you have to invest into the procedure. In extreme case one have a leave-one-out training procedure: train on everything but one single sample, effectively making number of the folds equal to the number of data samples.
So for a 5-fold CV you train 5 different models, which have a a large overlap of the training data. As a result, you should get 5 models that have similar performance. (If it is not the case, you have a problem ;) ) After you have the test results, you throw away all 5 models you have trained, and train a new model on all the available data, assuming it's performance would be a mean of the values you've got during CV phase.
Now about the augmented data. You should not allow data obtained by augmentation of the training part leak into the test. Each data point created from the training part should be used only for training, same applies to the test set.
So you should split your original data into k-folds (for example using KFold or GroupKFold), then create augmented data for each fold and concatenate them to the original. Then you follow regular CV procedure.
In your case, you can simply pass each group (such as x_group1) through augmenting procedure before concatenating them, and you should be fine.
Please note, that splitting data in linear way can lead to unbalanced data sets and it is not the best way of splitting the data. You should consider functions I've mentioned above.

Tensorflow Object Detetection training best practice questions

Training on large scale images:
I'm trying to train a vehicle detector on Images with 4K-Resolution with about 100 small-sized vehicles per image (vehicle size about 100x100 pixel).
I'm currently using the full resolution, which costs me a lot of memory. I'm training using 32 cores and 128 GB RAM. The current architecture is Faster RCNN. I can train with a second stage batch size of 12 and a first_stage_mini_batch_size of 50. (I scaled both down until my memory was sufficient).
I assume, that I should increase the max number of RPN proposals. Which dimension would be appropriate?
Does this approach make sense?
Difficulty, truncated, labels and poses:
I currently separated my dataset only into three classes (cars, trucks, vans).
I assume giving additional information like:
difficult (for mostly hidden vehicles), and
truncated (I currently did not select truncated objects, but I could)
would improve the training process.
Would truncated include overlapped vehicles?
Would additional Information like views/poses and other labels also improve the training process, or would it make the training harder?
Adding new data to the training set:
Is it possible to add new images and objects into the training and validation record files and automatically resume the training using the latest checkpoint file from the training directory? Or is the option "fine_tune_checkpoint" with "from_detection_checkpoint" necessary?
Would it harm, if a random separation of training and validation data would pick different datasets than in the training before?
For your problem, the out-of-the-box config files won't work so well due to the high resolutions of the images and the small cars. I recommend:
Training on crops --- cut your image into smaller crops, keeping the cars roughly at about the same resolution as they are now.
Eval on crops --- at inference time, cut up your image into a bunch of overlapping crops, and run inference on each one of those crops. Usually people combine the detections across the multiple crops using non-max-suppression. See slide 25 here for an illustration of this.
I highly recommend training using a GPU or better yet, multiple GPUs.
Avoid tweaking the batch_size parameters to begin with --- they are set up to work quite well out of the box and changing them will often make it difficult to debug.
Currently the difficult/truncated/pose fields are not used during training, so including them won't make a difference.
I switched the evaluation and training data (in config) and training continues as normal with exactly same command start it.
there's log about restoring parameters from last checkpoint
as I switch test/train data mAP immediately shoots too the moon
Images tab in the tensorboard gets updated
So it looks like changing the data works correctly. I'm not sure how can it affect the model, basically it's pretrained without these examples and fine-tuned with them
LOG:
INFO:tensorflow:Restoring parameters from /home/.../train_output/model.ckpt-3190
This results in train/test contamination and real model performance suppose to be lower than one calculated on the contaminated validation dataset. You shouldn't worry that much unless you want to present some well-defined results
Real life example from https://arxiv.org/abs/1311.2901 :
ImageNet and Caltech datasets have some images in common. While evaluating how well your model trained with ImageNet performs with and Caltech as validation, you should remove duplicates from ImageNet before training.

Error in joblib.load file loading

I am using Random Forest Regressor python's scikit-learn module for predicting some values. I used joblib.dump for saving models. Therea 24 joblib.dump files, and each weights 45 megabyte (sum of all files = 931mb). My problem is:
I want to load all this 24 files in one program to predict 24 values - but i cannot do it. It gives an MemoryError. How can i load all 24 joblib files in one program without any errors?
Thanks in advance...
There are few options, depending on where exactly you are running out of memory.
Since you are predicting 24 different values, based on the same input data, you can do predictions sequentially. So you keep only one RFR in memory at a time.
e.g.:
predictions = []
for regressor_file in all_regressors:
regressor = joblib.load(regressor_file)
predictions.append(regressor.predict(X))
(might not be applied to your case, but this problem is very common).
You might be running out of memory when loading a large batch of input data. To solve this issue - you can split your input data and run prediction on sub-batch. That helped us when we moved from running predictions locally to EC2. Try to run your code on a smaller input dataset, to test whether this helps.
You may want to optimise parameters for RFR. You may find that you can get the same predictive power with shallower trees or smaller number of trees (or both). It is very easy to build a Random Forest that is just unnecessarily big. This is, of course, problem specific. I had to reduce number of trees and make trees smaller to make model run efficiently in production. In my case, AUC was the same before/after optimisations. This last step of model-tuning is sometimes omitted from tutorials.

Categories