batch size for neural net large data set? - python

I have 1000000 points of training data. What would be a good batch size to use?
I was thinking 32 but think that would take ages. Its on cpu you as well so dont want to use
to high batch size.

First of all this is a hyperparameter of your training. Choosing the right batch size is non-trivial and depends on several factors.
Thus if you could afford it, you could try using an hyperparameter optimization approach (e.g. grid search, random search, evolutionary strategies, Bayesian opt. methods such as TPE) to find the "optimal" batch size.
If you cannot afford it, I would suggest considering the insights from this paper. Thus, find a good trade-off between your computational constraints and smallest possible batch size.

Related

Training & Validation loss and dataset size

I'm new on Neural Networks and I am doing a project that has to define a NN and train it. I've defined a NN of 2 hidden layers with 17 inputs and 17 output. The NN has 21 inputs and 3 outputs.
I have a data set of labels of 10 million, and a dataset of samples of another 10 million. My first issue is about the size of the validation set and the training set. I'm using PyTorch and batches, and of what I've read, the batches shouldn't be larger. But I don't know how many approximately should be the size of the sets.
I've tried with larger and small numbers, but I cannot find a correlation that shows me if I'm right choosing a large set o small set in one of them (apart from the time that requires to process a very large set).
My second issue is about the Training and Validation loss, which I've read that can tell me if I'm overfitting or underfitting depending on if it is bigger or smaller. The perfect should be the same value for both, and it also depends on the epochs. But I am not able to tune the network parameters like batch size, learning rate or choosing how much data should I use in the training and validation. If 80% of the set (8 million), it takes hours to finish it, and I'm afraid that if I choose a smaller dataset, it won't learn.
If anything is badly explained, please feel free to ask me for more information. As I said, the data is given, and I only have to define the network and train it with PyTorch.
Thanks!
For your first question about batch size, there is no fix rule for what value should it have. You have to try and see which one works best. When your NN starts performing badly don't go above or below that value for batch size. There is no hard rule here to follow.
For your second question, first of all, having training and validation loss same doesn't mean your NN is performing nicely, it is just an indication that its performance will be good enough on a test set if the above is the case, but it largely depends on many other things like your train and test set distribution.
And with NN you need to try as many things you can try. Try different parameter values, train and validation split size, etc. You cannot just assume that it won't work.

Effect of max sequence length on Grover

Have been working on grover model of rowanz . I was able to train grover's large model on 4 batch size but was getting memory allocation error while fine tuning mega model I then reduce batch size to 1 and training is now on going. I also tried to reduce max_seq_length to 512 and set batch_size to 4 and it was working.
My questions is what parameter will effect more on performance reducing batch size or reducing max_seq_length ?
Also can I set the value of max_seq_length other then the power of 2 like some value between 512 and 1024?
My questions is what parameter will effect more on performance
reducing batch size or reducing max_seq_length?
Effects of batch size:
On performance: None. It is a big misconception that batch size in any way affects the end metrics (e.g. accuracy). Although finer batch size means metrics being reported on shorter intervals giving illusion of much larger variability than there actually is. Effect is highly noticeable in case of batch size = 1 for obvious reasons. Larger batch sizes tend to report higher veracity for metrics as they are being calculated over several data points. End metrics are usually the same (with account for random initialization of weights).
On efficiency: Larger batch sizes means metrics being calculated less often but at the same time more space in the memory at the same time as metrics are being aggregated over a number of data points as per batch size. The same issue you were facing. So, batch size is more of a efficiency concern than a performance one. Moreover, how often you want to check model’s output.
Effects of max_seq_length:
On performance: Probably the most important metric for performance of language based models like Grover. Reason behind this is the perplexity of human-written text is lower than randomly sampled text, and this gap increases with sequence length. Generally, more the sequence length is, easier it is for a language model to stay consistent during the whole course of the output. So yeah it does help in model performance. However you might want to look into documentation for your particular model for “Goldilocks Zones” of sequence lengths and whether the sequences in power of 2 are more desirable than others.
On efficiency: Larger sequence sizes are of course require more processing power and computational memory so higher you go for the sequence lengths, more power you will need.
Also can I set the value of max_seq_length other then the power of 2
like some value between 512 and 1024?
Yeah why not? No model is designed to work with a fixed set of values. Wxperiment different sequence lengths and see whichever works for you best. Adjsuting some parameters in powers of two has been a classical practice for having a little computational advantage because of their simple binary representations but is negligible in case of large models as of today.

How do you know if your dataset suffers from high-dimensionality problems?

There seems to be many techniques for reducing dimensionality (pca, svd etc) in order to escape the curse of dimensionality. But how do you know that your dataset in fact suffers from high-dimensionality problems? Is there a best practice, like visualizations or can one even use KNN to find out?
I have a dataset with 99 features and 1 continuous label (price) and 30 000 instances.
The curse of dimensionality refers to a problem that dictates the relation between feature dimension and your data size. It has been pointed out that as your feature size/dimension grows, the amount of data in order to successfully model your problem will also grow exponentially.
The problem actually arises when there is exponential growth in your data. Because you have to think of how to handle it properly ( storage/ computation power needed).
So we usually experiment to figure out the right size of the dimension we need for our problem (maybe using cross-validation) and only select those features. Also, keep in mind that using lots of features comes with a high risk of overfitting.
You can use either Feature selection or feature extraction for dimension reduction.LASSO can be used for feature selection or PCA, LDA for feature extraction.

Keras - Using large numbers of features

I'm developing a Keras NN that predicts the label using 20,000 features. I can build the network, but have to use system RAM since the model is too large to fit in my GPU, which has meant it's taken days to run the model on my machine. The input is currently 500,20000,1 to an output of 500,1,1
-I'm using 5,000 nodes in the first fully connected (Dense) layer. Is this sufficient for the number of features?
-Is there a way of reducing the dimensionality so as to run it on my GPU?
I suppose each input entry has size (20000, 1) and you have 500 entries which make up your database?
In that case you can start by reducing the batch_size, but I also suppose that you mean that even the network weights don't fit in you GPU memory. In that case the only thing (that I know of) that you can do is dimensionality reduction.
You have 20000 features, but it is highly unlikely that all of them are important for the output value. With PCA (Principal Component Analysis) you can check the importance of all you parameters and you will probably see that only a few of them combined will be 90% or more important for the end result. In this case you can disregard the unimportant features and create a network that predicts the output based on let's say only 1000 (or even less) features.
An important note: The only reason I can think of where you would need that many features, is if you are dealing with an image, a spectrum (you can see a spectrum as a 1D image), ... In this case I recommend looking into convolutional neural networks. They are not fully-connected, which removes a lot of trainable parameters while probably performing even better.

Figuring out TensorFlow's BoostedTrees layer-by-layer approach

I've been reading the article associated with the implementation of boosted trees in TensorFlow in the paper, where a layer-by-layer approach is discussed
... and novel Layer-by-Layer boosting, which allows for stronger trees
(leading to faster convergence) and deeper models.
Though no where in the article this approach is discussed.
I am pretty sure that the n_batches_per_layer parameter passed in the BosstedTreesClassifier/Regressor is related to this concept.
My questions are
What is this approach? Any source to read more about it?
What is the meaning of the n_batches_per_layer parameter?
What should I set the n_batches_per_layer parameter to follow the standard training scheme of boosted trees?
n_batches_per_layer is how many batches do you want to use to train for each layer (i.e. a given depth in your tree). It is basically a portion of the data to use to build 1 layer, measured in batches. For example if you set your batch size = len(train_set) and n_batches_per_layer = 1, then you will use the entire train set for each layer.
So I would recommend if their dataset fits into memory then set batch_size = len(train_set), the number of n_batches_per_layer = 1. Otherwise set it to int(len(train_data)/batch_size) -- though you could try experimenting with a smaller number for faster training.

Categories