training by batches leads to more over-fitting - python

I'm training a sequence to sequence (seq2seq) model and I have different values to train on for the input_sequence_length.
For values 10 and 15, I get acceptable results but when I try to train with 20, I get memory errors so I switched the training to train by batches but the model over-fit and the validation loss explodes, and even with the accumulated gradient I get the same behavior, so I'm looking for hints and leads to more accurate ways to do the update.
Here is my training function (only with batch section) :
if batch_size is not None:
k=len(list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )))
for epoch in range(num_epochs):
optimizer.zero_grad()
epoch_loss=0
for i in list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )): # by using equidistant batch till the last one it becomes much faster than using the X.size()[0] directly
sequence = X_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, input_size).to(device)
labels = y_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, output_size).to(device)
# Forward pass
outputs = model(sequence)
loss = criterion(outputs, labels)
epoch_loss+=loss.item()
# Backward and optimize
loss.backward()
optimizer.step()
epoch_loss=epoch_loss/k
model.eval
validation_loss,_= evaluate(model,X_test_hard_tensor_1,y_test_hard_tensor_1)
model.train()
training_loss_log.append(epoch_loss)
print ('Epoch [{}/{}], Train MSELoss: {}, Validation : {} {}'.format(epoch+1, num_epochs,epoch_loss,validation_loss))
EDIT:
here are the parameters that I'm training with :
batch_size = 1024
num_epochs = 25000
learning_rate = 10e-04
optimizer=torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss(reduction='mean')

Batch size affects regularization. Training on a single example at a time is quite noisy, which makes it harder to overfit. Training on batches smoothes everything out, which makes it easier to overfit. Translating back to regularization:
Smaller batches add regularization.
Larger batches reduce regularization.
I am also curious about your learning rate. Every call to loss.backward() will accumulate the gradient. If you have set your learning rate to expect a single example at a time, and not reduced it to account for batch accumulation, then one of two things will happen.
The learning rate will be too high for the now-accumulated gradient, training will diverge, and both training and validation errors will explode.
The learning rate won't be too high, and nothing will diverge. The model will just train more quickly and effectively. If the model is too large for the data being fit, then training error will go to 0 but validation error will explode due to overfitting.
Update
Here is a bit more detail regarding the gradient accumulation.
Every call to loss.backward() will accumulate gradient, until you reset it with optimizer.zero_grad(). It will be acted on when you call optimizer.step(), based on whatever it has accumulated.
The way your code is written, you call loss.backward() for every pass through the inner loop, then you call optimizer.step() in the outer loop before resetting. So the gradient has been accumulated, that is summed, over all examples in the batch and not just one example at a time.
Under most assumptions, that will make the batch-accumulated gradient larger than the gradient for a single example. If the gradients are all aligned, for B batches, it will be larger by B times. If the gradients are i.i.d. then it will be more like sqrt(B) times larger.
If you do not account for this, then you have effectively increased your learning rate by that factor. Some of that will be mitigated by the smoothing effect of larger batches, which can then tolerate a higher learning rate. Larger batches reduce regularization, larger learning rates add it back. But that will not be a perfect match to compensate, so you will still want to adjust accordingly.
In general, whenever you change your batch size you will also want to re-tune your learning rate to compensate.
Leslie N. Smith has written some excellent papers on a methodical approach to hyperparameter tuning. A great place to start is A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay. He recommends you start by reading the diagrams, which are very well done.

Related

Neural Network optimization using epoch and batch

I am trying to optimize a given neural network (ex Perceptron Multilayer, with 2 hidden layers), by finding the number of epoch and batch that give the highest accuracy.
for epoch from 10 to 200 (in steps of 10):
for batch from 40 to 200 (in steps of 20):
modele.fit (X_train, Y_train, epochs = epoch, batch_size = batch)
I save batch, epoch, Accuracy;
Afterwards I kept the smallest epoch with the smallest corresponding batch which has the highest recognition
ex best_params: epoch = 10, batch = 150 => Accuracy = 94%
My problem is that when I re-run my model with the best_params, it doesn't give me the same results (loss, accuracy), even sometimes very low accuracy (eg 10%).
i try to fix seed, but no best result
Regards
Djam75
df=pd.DataFrame(columns=['Nb_Batch','Nb_Epoch','Accuracy'])
i=0
lst_loss=[]
lst_accuracy=[]
lst_epoch=list(np.arange(10,200,10))
lst_batch=list(np.arange(100,400,20))
for epoch in lst_epoch:
print ('---------------- Epoch ' + str(epoch)+ '------------------')
for batch in lst_batch:
modelSimple.fit(X_train, Y_train, nb_epoch = epoch, batch_size = batch, verbose = 0)
score = modelSimple.evaluate(X_test, Y_test)
df.loc[i,"Nb_Batch"]=batch
df.loc[i,"Nb_Epoch"]=epoch
df.loc[i,"Accuracy"]=score[1]*100
i=i+1
This might be happening due to random parameter initialization. Because if you are building an end-to-end model without transfer learn the weights, every time you training architecture get random values for its parameters.
In this case, a good practice is to use batch normalization layers after some layers according to your architecture.
tensoflow-implementation
pytorch-implmentation
extra idea:
Do not use any 'for', 'while' loops in the model implementation.
you can follow templates in TensorFlow or PyTorch.
OR, if you build a complete model from scratch, vectorize operations by using NumPy like metrics operation library.
Thanks for the update.
I resolve my probelm by saving a model and load it after.
thaks for idea (batch normalization ) and extra idea : not user any for ;-)
regards
I think you might not be updating the weight matrix after completing the training for certain batch sizes and epochs.
Please include the code as well in order to see the problem

Losses keep increasing within iteration

I am just a little confused on the following:
I am training a neural network and have it print out the losses. I am training it over 4 iterations just to try it out, and use batches. I normally see loss functions as parabolas, where the losses would decrease to a minimum point before increasing again. But my losses keep increasing as the iteration progresses.
For example, let's say there are 100 batches in each iteration. In iteration 0, losses started at 26.3 (batch 0) and went up to 1500.7 (batch 100). In iteration 1, it started at 2.4e-14 and went up to 80.8.
I am following an example from spacy (https://spacy.io/usage/examples#training-ner). Should I be comparing the losses across batches instead (i.e. if I take the points from all of the batch 0s it should resemble a parabola)?
If you are using the exact same code as linked, this behaviour is to be expected.
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)
An "iteration" is the outer loop: for itn in range(n_iter). And from the sample code you can also infer that losses is being reset every iteration. The nlp.update call will actually increment the appropriate loss in each call, i.e. with each batch that it processes.
So yes: the loss increases WITHIN an iteration, for each batch that you process. To check whether your model is actually learning anything, you need to check the loss across iterations, similar as how the print statement in the original snippet only prints after looping through the batches, not during.
Hope that helps!

How to compute and sum gradients over multiple mini-batches in Keras?

I want to increase the mini batch-size for my neural network during training (instead of decaying the learning rate), but the upper limit for the mini batch-size is 8, due to my GPU memory.
I found this article
https://medium.com/#davidlmorton/increasing-mini-batch-size-without-increasing-memory-6794e10db672
on how to increase the mini-batch-size without increasing the memory, and it is doing that by implementing a DataLoader in PyTorch.
The technique is simple, you just compute and sum gradients over
multiple mini-batches. Only after the specified number of mini-batches
do you update the model parameters.
count = 0
for inputs, targets in training_data_loader:
if count == 0:
optimizer.step()
optimizer.zero_grad()
count = batch_multiplier
outputs = model(inputs)
loss = loss_function(outputs, targets) / batch_multiplier
loss.backward()
count -= 1
However I couldn't find any examples for this in Keras. I assume that it would have to be done using a data_generator(Sequence), in the function __getitem__ ?
But I have no idea how to implement it, or if it is even possible in Keras. I tried looking at examples of DataGenerators as well, but none of them involved optimizers.
Would appreciate if anybody can help me out!

Set "training=False" of "tf.layers.batch_normalization" when training will get a better validation result

I use TensorFlow to train DNN. I learned that Batch Normalization is very helpful for DNN , so I used it in DNN.
I use "tf.layers.batch_normalization" and follow the instructions of the API document to build the network: when training, set its parameter "training=True", and when validate, set "training=False". And add tf.get_collection(tf.GraphKeys.UPDATE_OPS).
Here is my code:
# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np
input_node_num=257*7
output_node_num=257
tf_X = tf.placeholder(tf.float32,[None,input_node_num])
tf_Y = tf.placeholder(tf.float32,[None,output_node_num])
dropout_rate=tf.placeholder(tf.float32)
flag_training=tf.placeholder(tf.bool)
hid_node_num=2048
h1=tf.contrib.layers.fully_connected(tf_X, hid_node_num, activation_fn=None)
h1_2=tf.nn.relu(tf.layers.batch_normalization(h1,training=flag_training))
h1_3=tf.nn.dropout(h1_2,dropout_rate)
h2=tf.contrib.layers.fully_connected(h1_3, hid_node_num, activation_fn=None)
h2_2=tf.nn.relu(tf.layers.batch_normalization(h2,training=flag_training))
h2_3=tf.nn.dropout(h2_2,dropout_rate)
h3=tf.contrib.layers.fully_connected(h2_3, hid_node_num, activation_fn=None)
h3_2=tf.nn.relu(tf.layers.batch_normalization(h3,training=flag_training))
h3_3=tf.nn.dropout(h3_2,dropout_rate)
tf_Y_pre=tf.contrib.layers.fully_connected(h3_3, output_node_num, activation_fn=None)
loss=tf.reduce_mean(tf.square(tf_Y-tf_Y_pre))
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i1 in range(3000*num_batch):
train_feature=... # Some processing
train_label=... # Some processing
sess.run(train_step,feed_dict={tf_X:train_feature,tf_Y:train_label,flag_training:True,dropout_rate:1}) # when train , set "training=True" , when validate ,set "training=False" , get a bad result . However when train , set "training=False" ,when validate ,set "training=False" , get a better result .
if((i1+1)%277200==0):# print validate loss every 0.1 epoch
validate_feature=... # Some processing
validate_label=... # Some processing
validate_loss = sess.run(loss,feed_dict={tf_X:validate_feature,tf_Y:validate_label,flag_training:False,dropout_rate:1})
print(validate_loss)
Is there any error in my code ?
if my code is right , I think I get a strange result:
when training, I set "training = True", when validate, set "training = False", the result is not good . I print validate loss every 0.1 epoch , the validate loss in 1st to 3st epoch is
0.929624
0.992692
0.814033
0.858562
1.042705
0.665418
0.753507
0.700503
0.508338
0.761886
0.787044
0.817034
0.726586
0.901634
0.633383
0.783920
0.528140
0.847496
0.804937
0.828761
0.802314
0.855557
0.702335
0.764318
0.776465
0.719034
0.678497
0.596230
0.739280
0.970555
However , when I change the code "sess.run(train_step,feed_dict={tf_X:train_feature,tf_Y:train_label,flag_training:True,dropout_rate:1})" , that : set "training=False" when training, set "training=False" when validate . The result is good . The validate loss in 1st epoch is
0.474313
0.391002
0.369357
0.366732
0.383477
0.346027
0.336518
0.368153
0.330749
0.322070
0.335551
Why does this result appear ? Is it necessary to set "training=True" when training, set "training=False" when validate ?
TL;DR: Use smaller than the default momentum for the normalization layers like this:
tf.layers.batch_normalization( h1, momentum = 0.9, training=flag_training )
TS;WM:
When you set training = False that means the batch normalization layer will use its internally stored average of mean and variance to normalize the batch, not the batch's own mean and variance. When training = False, those internal variables also don't get updated. Since they are initialized to mean = 0 and variance = 1 it means that batch normalization is effectively turned off - the layer subtracts zero and divides the result by 1.
So if you train with training = False and evaluate like that, that just means you're training your network without any batch normalization whatsoever. It will still yield reasonable results, because hey, there was life before batch normalization, albeit admittedly not that glamorous...
If you turn on batch normalization with training = True that will start to normalize the batches within themselves and collect a moving average of the mean and variance of each batch. Now here's the tricky part. The moving average is an exponential moving average, with a default momentum of 0.99 for tf.layers.batch_normalization(). The mean starts at 0, the variance at 1 again. But since each update is applied with a weight of ( 1 - momentum ), it will asymptotically reach the actual mean and variance in infinity. For example in 100 steps it will reach about 73.4% of the real value, because 0.99100 is 0.366. If you have numerically large values, the difference can be enormous.
So if you have a relatively small number of batches you processed, then the internally stored mean and variance can still be significantly off by the time you're running the test. Then your network is trained on properly normalized data and is tested on mis-normalized data.
In order to speed up the convergence of the internal batch normalization values, you can apply a smaller momentum, like 0.9:
tf.layers.batch_normalization( h1, momentum = 0.9, training=flag_training )
(repeat for all batch normalization layers.) Please note that there is a downside to this, however. Random fluctuations in your data will "tug" on your stored mean and variance a lot more with a small momentum like this and the resulting values (later used in inference) can be greatly influenced by where you exactly stop the training, which is clearly not optimal. It is useful to have as large a momentum as possible. Depending on the number of training steps, we generally use 0.9, 0.99, 0.999 for 100, 1,000, 10,000 training steps respectively. No point in going over 0.999.
Another important thing is proper randomization of the training data. If you're training first with let's say the smaller numeric values of your whole data set, then the normalization will converge even slower. Best to completely randomize the order of training data and making sure you use a batch size of at least 14 (rule of thumb.)
Side note: it is known that zero debiasing the values can speed up convergence significantly, and the ExponentialMovingAverage class has this feature. But the batch normalization layers don't have this feature, save for tf.slim's batch_norm, if you're willing to restructure your code for slim.
The reason that you set Training = False improves performance is that Batch normalization has four variables (beta, gamma, mean, variance). It is true that mean and variance don't get updated when Training = False. However, gamma and beta still get updated. So your model has two extra variables and thus has a better performance.
Also, I guess that your model has a relatively good performance without batch normalization.

Why does recognition rate drop after multiple online training epochs?

I am using tensorflow to do image recognition on the MNIST dataset. In each training epoch, I picked 10,000 random images and conducted online training with batch size of 1. The recognition rate increased for the first few epochs, however, after several epochs the recognition rate started to drop greatly. (In the first 20 epochs, the recognition rate goes up to ~94%. Afterwards, the recognition rate went from 90->50->40->30->20). What is the reason for this?
Also, with a batch size of 1, the performance is worse than when using a batch size of 100 (max recognition rate 94% vs. 96%). I looked through several references but there seems to be contradictory results on whether small or large batch sizes achieve better performance. What would be this case in this situation?
Edit: I also added a figure of the recognition rate of the training dataset and the test dataset.Recognition rate vs. epoch
I have attached a copy of the code below. Thanks for the help!
import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot = True)
#parameters
n_nodes_hl1 = 500
n_nodes_hl2 = 500
n_nodes_hl3 = 500
n_classes = 10
batch_size = 1
x = tf.placeholder('float', [None, 784])
y = tf.placeholder('float')
#model of neural network
def neural_network_model(data):
hidden_1_layer = {'weights':tf.Variable(tf.random_normal([784, n_nodes_hl1]) , name='l1_w'),
'biases': tf.Variable(tf.random_normal([n_nodes_hl1]) , name='l1_b')}
hidden_2_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2]) , name='l2_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl2]) , name='l2_b')}
hidden_3_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl2, n_nodes_hl3]) , name='l3_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl3]) , name='l3_b')}
output_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl3, n_classes]) , name='lo_w'),
'biases' :tf.Variable(tf.random_normal([n_classes]) , name='lo_b')}
l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['biases'])
l1 = tf.nn.relu(l1)
l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['biases'])
l2 = tf.nn.relu(l2)
l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['biases'])
l3 = tf.nn.relu(l3)
output = tf.matmul(l3,output_layer['weights']) + output_layer['biases']
return output
#train neural network
def train_neural_network(x):
prediction = neural_network_model(x)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epoches = 100
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epoches):
epoch_loss=0
for batch in range (10000):
epoch_x, epoch_y=mnist.train.next_batch(batch_size)
_,c =sess.run([optimizer, cost], feed_dict = {x:epoch_x, y:epoch_y})
epoch_loss += c
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
print(epoch_loss)
print('Accuracy_test:', accuracy.eval({x:mnist.test.images, y:mnist.test.labels}))
print('Accuracy_train:', accuracy.eval({x:mnist.train.images, y:mnist.train.labels}))
train_neural_network(x)
DROPPING ACCURACY
You're over-fitting. This is when the model learns false features that are specific to artifacts of the images in the training data, at the expense of important features. One of the main experimental results of any application is to determine the optimal number of training iterations.
For instance, perhaps 80% of the 7's in your training data happen to have a little extra slant to the right near the bottom of the stem, where 4's and 1's do not. After too much training, your model "decides" that the best way to tell a 7 from another digit is from that extra slant, despite any other features. As a result, some 1's and 4's now get classed as 7's.
BATCH SIZE
Again, the best batch size is one of the experimental results. Typically, a batch size of 1 is too small: this gives the first few input images too much influence on the early weights in kernel or perceptron training. This is a minor case of over-fitting: one item having undue influence on the model. However, it's significant enough to alter your best results by 2%.
You need to balance the batch size with the other hyper-parameters to find the model's "sweet spot", optimum performance followed by shortest training time. In my experience, it's been best to increase the batch size until my time per image degraded. The models I've used most (MNIST, CIFAR-10, AlexNet, GoogleNet, ResNet, VGG, etc.) had very little loss of accuracy once we reached a rather minimal batch size; from there, the training speed was usually a matter of choosing the batch size the best used available RAM.
There are a few possibilities, although you'll need to do some experimentation to find out which it is.
Overfitting
Prune did a good job of explaining this. I'll add that the simplest way to avoid overfitting is to just remove 10-15% of the training set and evaluate the recognition rate on this held out validation set after every few epochs. If you graph the change in recognition rate on both the training and validation sets, you'll eventually reach a point on the graph where the training error keeps going down but the validation error starts going up. Stop training at that point; that's where overfitting is starting in earnest. Note that it's important that there be no overlap between the training/validation/test sets.
This was more likely before you mentioned that the training error wasn't also decreasing, but it's possible that it's overfitting on a fairly homogeneous part of your training set at the expense of the outliers, or something like this. Try randomizing the order of your training set after each epoch; if it's fitting one section of the set at the expense of the others, this might help.
Addendum: The massive instantaneous drop in quality around epoch 20 makes this even less likely; that is not what overfitting looks like.
Numerical Instability
If you get a particularly incorrect input at a point on the activation function with a large gradient, it's possible to end up with a gigantic weight update that screws up everything it's learned thus far. It's common to put a hard limit on the gradient magnitude for this reason. But you're using AdamOptimizer, which has an epsilon parameter for avoiding instability. I haven't read the paper it references, so I don't know exactly how it works, but the fact that it's there makes instability less likely.
Saturated Neurons
Some activation functions have regions with very small gradients, so if you end up with weights such that the function is almost always in that region, you have a tiny gradient and thus can't learn effectively. Sigmoids and Tanh are particularly prone to this since they have flat regions on both sides of the function. ReLUs don't have a flat region on the high end, but do on the low end. Try replacing your activation functions with Softplus; those are similar to ReLU, but with a continuous nonzero gradient.

Categories