I am quite new and could not find a direct answer to this question. I am wondering what is the default strategy that tensorflow 2.0 uses to deal with an incomplete last batch in training (eg. the 23 samples in a total training set of 1023 samples with batch size 100).
I am curious because intuitively if the same 23 samples are always being placed in the last batch of each epoch then these 23 samples would have a disproportionate influence (ie.1/23 each) on the gradient descent as compared to the other 1000 samples (ie. 1/100 each). I am wondering if the internal workings of tf automatically shuffles the samples every epoch.
Many thanks for your help!
Two points regarding your question:
tf.keras.model.fit() has a keyword argument (kwarg) shuffle. It defaults to True. You can see the documentation at https://www.tensorflow.org/api_docs/python/tf/keras/Model?version=stable#fit. The shuffling of the examples happens at the beginning of every epoch. Therefore, in each training epoch, which examples end up in the last batch is randomized. In this regard, no example gets special treatment or undue influence.
The loss- and metric-calculation mechanism of the fit() method takes into account the batch sizes internally. The final loss and metric values output by the method are weighted averages across batches, with the batch size being the weight.
Related
I'm new on Neural Networks and I am doing a project that has to define a NN and train it. I've defined a NN of 2 hidden layers with 17 inputs and 17 output. The NN has 21 inputs and 3 outputs.
I have a data set of labels of 10 million, and a dataset of samples of another 10 million. My first issue is about the size of the validation set and the training set. I'm using PyTorch and batches, and of what I've read, the batches shouldn't be larger. But I don't know how many approximately should be the size of the sets.
I've tried with larger and small numbers, but I cannot find a correlation that shows me if I'm right choosing a large set o small set in one of them (apart from the time that requires to process a very large set).
My second issue is about the Training and Validation loss, which I've read that can tell me if I'm overfitting or underfitting depending on if it is bigger or smaller. The perfect should be the same value for both, and it also depends on the epochs. But I am not able to tune the network parameters like batch size, learning rate or choosing how much data should I use in the training and validation. If 80% of the set (8 million), it takes hours to finish it, and I'm afraid that if I choose a smaller dataset, it won't learn.
If anything is badly explained, please feel free to ask me for more information. As I said, the data is given, and I only have to define the network and train it with PyTorch.
Thanks!
For your first question about batch size, there is no fix rule for what value should it have. You have to try and see which one works best. When your NN starts performing badly don't go above or below that value for batch size. There is no hard rule here to follow.
For your second question, first of all, having training and validation loss same doesn't mean your NN is performing nicely, it is just an indication that its performance will be good enough on a test set if the above is the case, but it largely depends on many other things like your train and test set distribution.
And with NN you need to try as many things you can try. Try different parameter values, train and validation split size, etc. You cannot just assume that it won't work.
Have been working on grover model of rowanz . I was able to train grover's large model on 4 batch size but was getting memory allocation error while fine tuning mega model I then reduce batch size to 1 and training is now on going. I also tried to reduce max_seq_length to 512 and set batch_size to 4 and it was working.
My questions is what parameter will effect more on performance reducing batch size or reducing max_seq_length ?
Also can I set the value of max_seq_length other then the power of 2 like some value between 512 and 1024?
My questions is what parameter will effect more on performance
reducing batch size or reducing max_seq_length?
Effects of batch size:
On performance: None. It is a big misconception that batch size in any way affects the end metrics (e.g. accuracy). Although finer batch size means metrics being reported on shorter intervals giving illusion of much larger variability than there actually is. Effect is highly noticeable in case of batch size = 1 for obvious reasons. Larger batch sizes tend to report higher veracity for metrics as they are being calculated over several data points. End metrics are usually the same (with account for random initialization of weights).
On efficiency: Larger batch sizes means metrics being calculated less often but at the same time more space in the memory at the same time as metrics are being aggregated over a number of data points as per batch size. The same issue you were facing. So, batch size is more of a efficiency concern than a performance one. Moreover, how often you want to check model’s output.
Effects of max_seq_length:
On performance: Probably the most important metric for performance of language based models like Grover. Reason behind this is the perplexity of human-written text is lower than randomly sampled text, and this gap increases with sequence length. Generally, more the sequence length is, easier it is for a language model to stay consistent during the whole course of the output. So yeah it does help in model performance. However you might want to look into documentation for your particular model for “Goldilocks Zones” of sequence lengths and whether the sequences in power of 2 are more desirable than others.
On efficiency: Larger sequence sizes are of course require more processing power and computational memory so higher you go for the sequence lengths, more power you will need.
Also can I set the value of max_seq_length other then the power of 2
like some value between 512 and 1024?
Yeah why not? No model is designed to work with a fixed set of values. Wxperiment different sequence lengths and see whichever works for you best. Adjsuting some parameters in powers of two has been a classical practice for having a little computational advantage because of their simple binary representations but is negligible in case of large models as of today.
I have two binary imbalanced dataset where the labels are ether 0 or 1 and prediction output is between 0 and 1. The positive case has 10000 samples, while the negative case has 90000 samples. I'm using a batch of 100 when training.
when calculating the BinaryCrossEntropyLoss (in pytorch) its possible to supply the per batch element regularisation weight.
My question is:
To calculate the general class weight dose it make more sense to calculate it 1 time at the start (so 1/(10000/(100000) for the positive case) and scale the loss of each sample with this value
or:
Calculate the weight at the batch level, by firstly finding the batch class imbalance (e.g in the batch it might be 25 positives and 75 negatives, hence 1/(25/(25+75) for the positive case)
I'm asking this because the loss is averaged across the batch
If you wish to do it this way, you should calculate per batch class imbalance.
On the other hand you should probably make sure that each batch preserves label statistics (e.g. for batch 64 and your case, you should have 6 positive samples and the rest negative). This way, it would be enough to calculate class imbalance once and add it to torch.nn.BCELoss on a per-batch basis.
I would suggest the other approach though, e.g. oversampling or undersampling using PyTorch's Sampler class (don't do it by copying examples, it wastes space totally unnecessarily). You can implement it manually or use third party library which did it for you for example torchdata (disclosure: I'm the author) and torchdata.samplers.RandomOverSampler.
I have been following the TF 2.0 tutorial for convolution VAE's, located here.
Since it is eager, the gradients are computed by hand and then applied manually, using tf.GradientTape().
for epoch in epochs:
for x in x_train:
with tf.GradientTape() as tape:
loss = compute_loss(model, x)
apply_gradients(tape.gradient(loss, model.trainable_variables))
The problem with that code is that it is pretty slow, taking around 40-50 seconds per epoch.
If I increase the batch size by a lot (to around 2048), then it ends up taking around 8 seconds per epoch, but the model's performance decreases by quite a lot.
On the other hand, if I do a more traditional model (i.e., that uses the lazy graph-based model instead of eagerness), such as the one here, then it takes 8 seconds per epoch even with a small batch size.
model.add_loss(lazy_graph_loss)
model.fit(x_train epochs=epochs)
Based on this information, my guess would be that the problem with the TF2.0 code is the manual computation of losses and gradients.
Is there any way to speed up the TF2.0 code so that it comes closer to the normal code?
I found the solution: TensorFlow 2.0 introduces the concept of functions, which translate eager code into graph code.
The usage is pretty straight-forward. The only change needed is that all relevant functions (like compute_loss and apply_gradients) have to be annotated with #tf.function.
The main reason why your code is taking a long time is because when we use a normal pythonic for loop which does not have any tensor, the construction of the graph takes tremendous amount of time because intuitively we might think that the same graph for training might be getting re used at every iteration, but the graph constructed is actually a chain like structure where each node is the training subgraph, and the total no. Of nodes in that chain are equivalent to the no. Of iterations in the loop. In a nutshell tensorflow unwraps the iterations and then constructs the graph. So it employs lot of redundancy in terms of both space and time for the graph alone. It's so bad that, just to add two tensors repeatedly in a normal pythonic loop for like a billion times, it takes almost half hour.
To get around this problem, specifically in your case we can take the help of .repeat transformation in the tf.data.Datasets api, instead of writing
for i in range(epochs) :
We can write
For x in x_train.repeat(epochs) :
Train here
I am trying to train neural networks using TensorFlow 1.12.0 and Keras API.
The setup is as follows. I have a large number of data points: each point consists of a context (call it 24 floats) and a label (1 float). The total amount of data is O(10^7) points. Test networks vary, but a fairly simple one might look like so
v=[keras.layers.Input(shape=input_shape)]
v.append(keras.layers.Dense(4,use_bias=False,activation=tf.nn.tanh)(v[-1]))
v.append(keras.layers.Dense(1,use_bias=False)(v[-1]))
model=keras.models.Model(inputs=v[0],outputs=v[-1])
optimizer = tf.train.RMSPropOptimizer(0.001)
model.compile(loss='mse', optimizer=optimizer, metrics=['mae'])
(...)
history=model.fit(x=train_data,y=train_labels,epochs=EPOCHS,verbose=1,batch_size=10000, shuffle=True)
I've been getting decent results keeping the data in numpy arrays and passing them to model.fit() as-is. However, I am not happy with the performance. It appears to bottleneck in Python code (I even tried to profile it and the Python method slice_arrays in python/keras/utils/generic_utils.py comes up as the primary bottleneck, with up to half of all time spent there.) The GPU (GeForce 1080 Ti) is in use, but its utilization (as reported by nvidia-smi) is rarely above 10-15%.
Looking for a way to speed things up, I tried to convert the data into tensors:
features=tf.convert_to_tensor(train_data)
labels=tf.convert_to_tensor(train_labels)
history=model.fit(x=features,y=labels,epochs=EPOCHS,verbose=1,steps_per_epoch=20, shuffle=True)
model.fit requires an argument batch_size when passed numpy arrays, but steps_per_epoch when passed tensors. Documentation isn't clear, but it seems that I should set steps_per_epoch to the number of data points divided by batch_size. This way I get comparable convergence rates. Whatever the value, GPU utilization is 100%, which is good.
But now there's a problem. With numpy arrays, run time per epoch is relatively independent of the batch size. I see 8 seconds per epoch at batch size 10k, 5 seconds at batch size 100k, and 7 seconds per epoch at batch size 1M. Convergence is generally way better if the batch size is small. So, I'd typically start with batch size 100k or less. On the other hand, with tensor input, run time per epoch goes up exponentially with steps_per_epoch. With batch size 1M (10-20 steps per epoch), I'm at 2 s / epoch, but the convergence rate is abysmal. With batch size 10k, the convergence rate is good, but the time is up to 30 s / epoch (actually slower than with numpy, despite 100% GPU utilization.)
Basically, the only scenario where tensor input actually ends up faster is the one where I don't care to use it in the first place.
What is going on and is there any way to get around this?