Have been working on grover model of rowanz . I was able to train grover's large model on 4 batch size but was getting memory allocation error while fine tuning mega model I then reduce batch size to 1 and training is now on going. I also tried to reduce max_seq_length to 512 and set batch_size to 4 and it was working.
My questions is what parameter will effect more on performance reducing batch size or reducing max_seq_length ?
Also can I set the value of max_seq_length other then the power of 2 like some value between 512 and 1024?
My questions is what parameter will effect more on performance
reducing batch size or reducing max_seq_length?
Effects of batch size:
On performance: None. It is a big misconception that batch size in any way affects the end metrics (e.g. accuracy). Although finer batch size means metrics being reported on shorter intervals giving illusion of much larger variability than there actually is. Effect is highly noticeable in case of batch size = 1 for obvious reasons. Larger batch sizes tend to report higher veracity for metrics as they are being calculated over several data points. End metrics are usually the same (with account for random initialization of weights).
On efficiency: Larger batch sizes means metrics being calculated less often but at the same time more space in the memory at the same time as metrics are being aggregated over a number of data points as per batch size. The same issue you were facing. So, batch size is more of a efficiency concern than a performance one. Moreover, how often you want to check model’s output.
Effects of max_seq_length:
On performance: Probably the most important metric for performance of language based models like Grover. Reason behind this is the perplexity of human-written text is lower than randomly sampled text, and this gap increases with sequence length. Generally, more the sequence length is, easier it is for a language model to stay consistent during the whole course of the output. So yeah it does help in model performance. However you might want to look into documentation for your particular model for “Goldilocks Zones” of sequence lengths and whether the sequences in power of 2 are more desirable than others.
On efficiency: Larger sequence sizes are of course require more processing power and computational memory so higher you go for the sequence lengths, more power you will need.
Also can I set the value of max_seq_length other then the power of 2
like some value between 512 and 1024?
Yeah why not? No model is designed to work with a fixed set of values. Wxperiment different sequence lengths and see whichever works for you best. Adjsuting some parameters in powers of two has been a classical practice for having a little computational advantage because of their simple binary representations but is negligible in case of large models as of today.
Related
What parameters when training a gensim fasttext model have the biggest effect on the resulting models' size in memory?
gojomos answer to this question mentions ways to reduce a model's size during training, apart from reducing embedding dimensionality.
There seem a few parameters that might have an effect: thresholds for including words in the vocabulary especially. Do the other parameters also influence model size, for example ngram range, and which parameters have the largest effect?
I hope this is not too lazy of a question :-)
The main parameters affecting FastText model size are:
vector_size (dimensionality) - the size of the model is overwhelmingly a series of vectors (both whole-word and n-gram) of this length. Thus, reducing vector_size has a direct, large effect on total model size.
min_count and/or max_final_vocab - by affecting how many whole words are considered 'known' (in-vocabulary) for the model, these directly influence how many bulk vectors are in the model. Especially if you have large enough training data that model size is an issue – & are using FastText – you should be considering higher values than the default min_count=5. Very-rare words with just a handful of usage examples typically don't learn good generalizable representations in word2vec-like models. (Good vectors come from many subtly-contrasting usage examples.) But because by Zipfian distributions, there are typically a lot of such words in natural language data, they do wind up taking a lot of the training time, & tug against other words' training, & push more-frequent words out of each-other's context windows. Hence this is a case where, counter to many peoples' intuition, throwing away some data (the rarest words) can often improve the final model.
bucket – which specifies exactly how may n-gram vectors will be learned by the model, because they all share a collision-oblivious hashmap. That is, no matter how many unique n-grams there really are in the training data, they'll all be forced into exactly this many vectors. (Essentially, rarer n-grams will often collide with more-frequent ones, and be just background noise.)
Notably, because of the collisions tolerated by the bucket-sized hashmap, the parameters min_n & max_n actually don't affect the model size at all. Whether they allow for lots of n-grams of many sizes, or much fewer of a single/smaller range of sizes, they'll be shoehorned into the same number of buckets. (If more n-grams are used, a larger bucket value may help reduce collisions, and with more n-grams, training time will be longer. But the model will only grow with a larger bucket, not different min_n & max_n values.)
You can get a sense of a model's RAM size by using .save() to save it to disk - the size of the multiple related files created (without compression) will roughly be of a similar magnitude as the RAM needed by the model. So, you can improve your intuition for how varying parameters changes the model size, by running varied-parameter experiments with smaller models, and watching their different .save()-sizes. (Note that you don't actually have to .train() these models - they'll take up their full allocated size once the .build_vocab() step has completed.)
I have 1000000 points of training data. What would be a good batch size to use?
I was thinking 32 but think that would take ages. Its on cpu you as well so dont want to use
to high batch size.
First of all this is a hyperparameter of your training. Choosing the right batch size is non-trivial and depends on several factors.
Thus if you could afford it, you could try using an hyperparameter optimization approach (e.g. grid search, random search, evolutionary strategies, Bayesian opt. methods such as TPE) to find the "optimal" batch size.
If you cannot afford it, I would suggest considering the insights from this paper. Thus, find a good trade-off between your computational constraints and smallest possible batch size.
I have started to dig a little deeper in Tensorflow and Neural Net training with a focus on optimizing run time and keep bumping up against a data ingestion problem.
Lets say I have 30,000 256x256 images that I have created an efficient Tensorflow pipeline for (including pre-fetching and parallel data calls) that randomly chips the image to 64x64 pixels and randomly flips the images in both directions. So the model accepts tensors of shape (batchsize, 64, 64) and via augmentation there are 30000*((256/64)**2)*4 = 1,920,000 minimum examples. Minimum because there at minimum 16 unique chips but many more ways to randomly chip the whole image. The 4 comes from the four possible flip states (Same, Same), (Flipped, Same), (Same, Flipped), (Flipped, Flipped).
I have this model distributed across several GPU's and so batch size isn't limited by available memory, just trade offs with generalization and how accurate the model will be. In this scenario, I was able to get through a single epoch in ~35 seconds with a batch size of 128 (32 examples per GPU).
Now lets imagine another scenario in which I have pre-chipped the data (deterministically, aka every image has 16 non-overlapping chips extracted) and saved chips locally as 64x64 pixel images. In this case, I don't do any random sampling but otherwise the input pipeline is the same. The number of individual files has gone up considerably from 30,000 to 480,000 with a maximum number of unique examples equal to the minimum number of unique examples from the previous setup. Now, because the number of files has gone up, for the same batch size, the training steps has gone up considerably. Even if I double the batch size, training now takes 2-3 minutes per epoch.
I am curious if there is a consensus between these two scenarios. In scenario 1, I would imagine that I would need to train longer than scenario 2, but could also potentially get away with an even bigger batch size since the training data is slightly changing every single epoch (thus less worry about the model not generalizing).
In scenario 2, I imagine that I could get away with training for less epochs at the cost of an individual epoch taking longer. Since the model is seeing every single example each epoch, there is a real limit to how big I can make the batch size without making the model poorer at generalizing.
Is there a consensus on which scenario is best? It seems like Scenario 1 is better in nearly every way, but something keeps nabbing at me that I am missing something.
I'm developing a Keras NN that predicts the label using 20,000 features. I can build the network, but have to use system RAM since the model is too large to fit in my GPU, which has meant it's taken days to run the model on my machine. The input is currently 500,20000,1 to an output of 500,1,1
-I'm using 5,000 nodes in the first fully connected (Dense) layer. Is this sufficient for the number of features?
-Is there a way of reducing the dimensionality so as to run it on my GPU?
I suppose each input entry has size (20000, 1) and you have 500 entries which make up your database?
In that case you can start by reducing the batch_size, but I also suppose that you mean that even the network weights don't fit in you GPU memory. In that case the only thing (that I know of) that you can do is dimensionality reduction.
You have 20000 features, but it is highly unlikely that all of them are important for the output value. With PCA (Principal Component Analysis) you can check the importance of all you parameters and you will probably see that only a few of them combined will be 90% or more important for the end result. In this case you can disregard the unimportant features and create a network that predicts the output based on let's say only 1000 (or even less) features.
An important note: The only reason I can think of where you would need that many features, is if you are dealing with an image, a spectrum (you can see a spectrum as a 1D image), ... In this case I recommend looking into convolutional neural networks. They are not fully-connected, which removes a lot of trainable parameters while probably performing even better.
I new on tensorflow and I try to understand what size should be batch.
Shape of my data (119396, 12955). How can I choose best batch_size to my data?
And what dependence batch_size from data shape or using algorithm?
The batch size is the number of input data values that you are introducing at once in the model. It is very important while training, and secondary when testing. For a standard Machine Learning/Deep Learning algorithm, choosing a batch size will have an impact on several aspects:
The bigger the batch size, the more data you will feed at once in a model. Thus, RAM memory consumption will be almost linear with batch size, and there will always be a limit based on your system specs and the size of your model above which your model will overflow.
The bigger the batch size, the faster you will loop over your dataset N times to perform training.
A biggerbatch size will slow down your model training speed, meaning that it will take longer for your model to get one single update since that update depends on more data.
A biggerbatch size will have more data to average towards the next update of the model, hence training should be smoother: smoother training/test accuracy curves.
Note that the size of the data is only related to the batch size in the sense that the bigger the data, the smaller the maximum batch size becomes (limit set by RAM). The size of the model also has a similar relation.
In practice, you should follow "in powers of 2 and the larger the better, provided that the batch fits into your (GPU) memory". For more in-depth details, check https://stackoverflow.com/a/46655895/9670056.