How to select a subset of mnist training set

How to select a subset of mnist training set - python

I have trouble on how to select a subset of mnist training set which contains M points to train the 1-NN classifier because the number of original training points are too large.
That is , I need to figure out a scheme that takes as input a labeled training set as well as a number M, and return a subset.of the training set of size M.
Besides, uniform-random selection is not allowed.((that is, just picking M of the training points at random)

One option could be to train your network with a a data-generator.
It loads only one batch of data step for step. You will not have issues with your data anymore. Furthermore, it is able to use multithreading.
So loading and maybe preprocessing of your data is not a bottleneck.
Here is a good example:
https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
I hope this helps.

Related

Training & Validation loss and dataset size

I'm new on Neural Networks and I am doing a project that has to define a NN and train it. I've defined a NN of 2 hidden layers with 17 inputs and 17 output. The NN has 21 inputs and 3 outputs.
I have a data set of labels of 10 million, and a dataset of samples of another 10 million. My first issue is about the size of the validation set and the training set. I'm using PyTorch and batches, and of what I've read, the batches shouldn't be larger. But I don't know how many approximately should be the size of the sets.
I've tried with larger and small numbers, but I cannot find a correlation that shows me if I'm right choosing a large set o small set in one of them (apart from the time that requires to process a very large set).
My second issue is about the Training and Validation loss, which I've read that can tell me if I'm overfitting or underfitting depending on if it is bigger or smaller. The perfect should be the same value for both, and it also depends on the epochs. But I am not able to tune the network parameters like batch size, learning rate or choosing how much data should I use in the training and validation. If 80% of the set (8 million), it takes hours to finish it, and I'm afraid that if I choose a smaller dataset, it won't learn.
If anything is badly explained, please feel free to ask me for more information. As I said, the data is given, and I only have to define the network and train it with PyTorch.
Thanks!

For your first question about batch size, there is no fix rule for what value should it have. You have to try and see which one works best. When your NN starts performing badly don't go above or below that value for batch size. There is no hard rule here to follow.
For your second question, first of all, having training and validation loss same doesn't mean your NN is performing nicely, it is just an indication that its performance will be good enough on a test set if the above is the case, but it largely depends on many other things like your train and test set distribution.
And with NN you need to try as many things you can try. Try different parameter values, train and validation split size, etc. You cannot just assume that it won't work.

How to use data augmentation with cross validation

I need to use data augmentation on what would be my training data from the data augmentation step. The problem is that i am using cross-validation, so i can't find a reference how to adjust my model to use data augmentation. My cross-validation is somewhat indexing by hand my data.
There is articles and general content about data augmentation, but very little and with no generalization for cross validation with data augmentation
I need to use data augmentation on training data by simply rotating and adding zoom, cross validate for the best weights and save them, but i wouldnt know how.
This example can be copy pasted for better reproducibility, in short how would i employ data augmentation and also save the weights with the best accuracy?

When training machine learning models, you should not test model on the samples used during model training phase (if you care for realistic results).
Cross validation is a method for estimating model accuracy. The essence of the method is that you split your available labeled data into several parts (or folds), and then use one part as a test set, training the model on all the rest, and repeating this procedure for all parts one by one. This way you essentially test your model on all the available data, without hurting training too much. There is an implicit assumption that data distribution is the same in all folds. As a rule of thumb, the number of cross validation folds is usually 5 or 7. This depends on the amount of the labeled data at one's disposal - if you have lots of data, you can afford to leave less data to train the model and increase test set size. The higher the number of folds, the better accuracy estimation you can achieve, as the training size part increases, and more time you have to invest into the procedure. In extreme case one have a leave-one-out training procedure: train on everything but one single sample, effectively making number of the folds equal to the number of data samples.
So for a 5-fold CV you train 5 different models, which have a a large overlap of the training data. As a result, you should get 5 models that have similar performance. (If it is not the case, you have a problem ;) ) After you have the test results, you throw away all 5 models you have trained, and train a new model on all the available data, assuming it's performance would be a mean of the values you've got during CV phase.
Now about the augmented data. You should not allow data obtained by augmentation of the training part leak into the test. Each data point created from the training part should be used only for training, same applies to the test set.
So you should split your original data into k-folds (for example using KFold or GroupKFold), then create augmented data for each fold and concatenate them to the original. Then you follow regular CV procedure.
In your case, you can simply pass each group (such as x_group1) through augmenting procedure before concatenating them, and you should be fine.
Please note, that splitting data in linear way can lead to unbalanced data sets and it is not the best way of splitting the data. You should consider functions I've mentioned above.

How do I train the Convolutional Neural Network with negative and positive elements as the input of the first layer?

Just I am curious why I have to scale the testing set on the testing set, and not on the training set when I’m training a model on, for example, CNN?!
Or am I wrong? And I still have to scale it on the training set.
Also, can I train a dataset in the CNN that contents positive and negative elements as the first input of the network?
Any answers with reference will be really appreciated.

We usually have 3 types of datasets for getting a model trained,
Training Dataset
Validation Dataset
Test Dataset
Training Dataset
This should be an evenly distributed data set which covers all varieties of data. If your train with more epochs, the model will get used to the training dataset and will only give proper proper prediction on the training dataset and this is called Overfitting. Only way to keep a check on overfitting is by having other datasets which the model has never been trained on.
Validation Dataset
This can be used fine tune model hyperparameters
Test Dataset
This is the dataset which the model has not been trained on has never been a part of deciding the hyperparameters and will give the reality of how the model is performing.

If scaling and normalization is used, the testing set should use the same parameters used during training.
A good answer that links to that: https://datascience.stackexchange.com/questions/27615/should-we-apply-normalization-to-test-data-as-well
Also, some models tend to require normalization and others do not.
The Neural Network architectures are normally robust and might not need normalization.

Scaling data depends upon the requirement as well the feed/data you got. Test data gets scaled with Test data only, because Test data don't have the Target variable (one less feature in Test data). If we scale our Training data with new Test data, our model will not be able to correlate with any target variable and thus fail to learn. So the key difference is the existence of Target variable.

How to cancel the huge negative effect of my training data distribution on subsequent neural network classification function?

I need to train my network on a data that has a normal distribution, I've noticed that my neural net has a very high tendency to only predict the most occurring class label in a csv file I exported (comparing its prediction with the actual label).
What are some suggestions (except cleaning the data to produce an evenly distributed training data), that would help my neural net to not go and only predict the most occurring label?
UPDATE: Just wanted to mention that, indeed the suggestions made in the comment sections worked. I, however, found out that adding an extra layer to my NN, mitigated the problem.

Assuming the NN is trained using mini-batches, it is possible to simulate (instead of generate) an evenly distributed training data by making sure each mini-batch is evenly distributed.
For example, assuming a 3-class classification problem and a minibatch size=30, construct each mini-batch by randomly selecting 10 samples per class (with repetition, if necessary).

Sklearn-GMM on large datasets

I have a large data-set (I can't fit entire data on memory). I want to fit a GMM on this data set.
Can I use GMM.fit() (sklearn.mixture.GMM) repeatedly on mini batch of data ??

There is no reason to fit it repeatedly.
Just randomly sample as many data points as you think your machine can compute in a reasonable time. If variation is not very high, the random sample will have approximately the same distribution as the full dataset.
randomly_sampled = np.random.choice(full_dataset, size=10000, replace=False)
#If data does not fit in memory you can find a way to randomly sample when you read it
GMM.fit(randomly_sampled)
And the use
GMM.predict(full_dataset)
# Again you can fit one by one or batch by batch if you cannot read it in memory
on the rest to classify them.

fit will always forget previous data in scikit-learn. For incremental fitting, there is the partial_fit function. Unfortunately, GMM doesn't have a partial_fit (yet), so you can't do that.

As Andreas Mueller mentioned, GMM doesn't have partial_fit yet which will allow you to train the model in an iterative fashion. But you can make use of warm_start by setting it's value to True when you create the GMM object. This allows you to iterate over batches of data and continue training the model from where you left it in the last iteration.

I think you can set the init_para to empty string '' when you create the GMM object, then you might be able to train the whole data set.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.