Why does recognition rate drop after multiple online training epochs?

Why does recognition rate drop after multiple online training epochs? - python

I am using tensorflow to do image recognition on the MNIST dataset. In each training epoch, I picked 10,000 random images and conducted online training with batch size of 1. The recognition rate increased for the first few epochs, however, after several epochs the recognition rate started to drop greatly. (In the first 20 epochs, the recognition rate goes up to ~94%. Afterwards, the recognition rate went from 90->50->40->30->20). What is the reason for this?
Also, with a batch size of 1, the performance is worse than when using a batch size of 100 (max recognition rate 94% vs. 96%). I looked through several references but there seems to be contradictory results on whether small or large batch sizes achieve better performance. What would be this case in this situation?
Edit: I also added a figure of the recognition rate of the training dataset and the test dataset.Recognition rate vs. epoch
I have attached a copy of the code below. Thanks for the help!
import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot = True)
#parameters
n_nodes_hl1 = 500
n_nodes_hl2 = 500
n_nodes_hl3 = 500
n_classes = 10
batch_size = 1
x = tf.placeholder('float', [None, 784])
y = tf.placeholder('float')
#model of neural network
def neural_network_model(data):
hidden_1_layer = {'weights':tf.Variable(tf.random_normal([784, n_nodes_hl1]) , name='l1_w'),
'biases': tf.Variable(tf.random_normal([n_nodes_hl1]) , name='l1_b')}
hidden_2_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2]) , name='l2_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl2]) , name='l2_b')}
hidden_3_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl2, n_nodes_hl3]) , name='l3_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl3]) , name='l3_b')}
output_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl3, n_classes]) , name='lo_w'),
'biases' :tf.Variable(tf.random_normal([n_classes]) , name='lo_b')}
l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['biases'])
l1 = tf.nn.relu(l1)
l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['biases'])
l2 = tf.nn.relu(l2)
l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['biases'])
l3 = tf.nn.relu(l3)
output = tf.matmul(l3,output_layer['weights']) + output_layer['biases']
return output
#train neural network
def train_neural_network(x):
prediction = neural_network_model(x)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epoches = 100
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epoches):
epoch_loss=0
for batch in range (10000):
epoch_x, epoch_y=mnist.train.next_batch(batch_size)
_,c =sess.run([optimizer, cost], feed_dict = {x:epoch_x, y:epoch_y})
epoch_loss += c
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
print(epoch_loss)
print('Accuracy_test:', accuracy.eval({x:mnist.test.images, y:mnist.test.labels}))
print('Accuracy_train:', accuracy.eval({x:mnist.train.images, y:mnist.train.labels}))
train_neural_network(x)

DROPPING ACCURACY
You're over-fitting. This is when the model learns false features that are specific to artifacts of the images in the training data, at the expense of important features. One of the main experimental results of any application is to determine the optimal number of training iterations.
For instance, perhaps 80% of the 7's in your training data happen to have a little extra slant to the right near the bottom of the stem, where 4's and 1's do not. After too much training, your model "decides" that the best way to tell a 7 from another digit is from that extra slant, despite any other features. As a result, some 1's and 4's now get classed as 7's.
BATCH SIZE
Again, the best batch size is one of the experimental results. Typically, a batch size of 1 is too small: this gives the first few input images too much influence on the early weights in kernel or perceptron training. This is a minor case of over-fitting: one item having undue influence on the model. However, it's significant enough to alter your best results by 2%.
You need to balance the batch size with the other hyper-parameters to find the model's "sweet spot", optimum performance followed by shortest training time. In my experience, it's been best to increase the batch size until my time per image degraded. The models I've used most (MNIST, CIFAR-10, AlexNet, GoogleNet, ResNet, VGG, etc.) had very little loss of accuracy once we reached a rather minimal batch size; from there, the training speed was usually a matter of choosing the batch size the best used available RAM.

There are a few possibilities, although you'll need to do some experimentation to find out which it is.
Overfitting
Prune did a good job of explaining this. I'll add that the simplest way to avoid overfitting is to just remove 10-15% of the training set and evaluate the recognition rate on this held out validation set after every few epochs. If you graph the change in recognition rate on both the training and validation sets, you'll eventually reach a point on the graph where the training error keeps going down but the validation error starts going up. Stop training at that point; that's where overfitting is starting in earnest. Note that it's important that there be no overlap between the training/validation/test sets.
This was more likely before you mentioned that the training error wasn't also decreasing, but it's possible that it's overfitting on a fairly homogeneous part of your training set at the expense of the outliers, or something like this. Try randomizing the order of your training set after each epoch; if it's fitting one section of the set at the expense of the others, this might help.
Addendum: The massive instantaneous drop in quality around epoch 20 makes this even less likely; that is not what overfitting looks like.
Numerical Instability
If you get a particularly incorrect input at a point on the activation function with a large gradient, it's possible to end up with a gigantic weight update that screws up everything it's learned thus far. It's common to put a hard limit on the gradient magnitude for this reason. But you're using AdamOptimizer, which has an epsilon parameter for avoiding instability. I haven't read the paper it references, so I don't know exactly how it works, but the fact that it's there makes instability less likely.
Saturated Neurons
Some activation functions have regions with very small gradients, so if you end up with weights such that the function is almost always in that region, you have a tiny gradient and thus can't learn effectively. Sigmoids and Tanh are particularly prone to this since they have flat regions on both sides of the function. ReLUs don't have a flat region on the high end, but do on the low end. Try replacing your activation functions with Softplus; those are similar to ReLU, but with a continuous nonzero gradient.

Related

Neural Network optimization using epoch and batch

I am trying to optimize a given neural network (ex Perceptron Multilayer, with 2 hidden layers), by finding the number of epoch and batch that give the highest accuracy.
for epoch from 10 to 200 (in steps of 10):
for batch from 40 to 200 (in steps of 20):
modele.fit (X_train, Y_train, epochs = epoch, batch_size = batch)
I save batch, epoch, Accuracy;
Afterwards I kept the smallest epoch with the smallest corresponding batch which has the highest recognition
ex best_params: epoch = 10, batch = 150 => Accuracy = 94%
My problem is that when I re-run my model with the best_params, it doesn't give me the same results (loss, accuracy), even sometimes very low accuracy (eg 10%).
i try to fix seed, but no best result
Regards
Djam75

df=pd.DataFrame(columns=['Nb_Batch','Nb_Epoch','Accuracy'])
i=0
lst_loss=[]
lst_accuracy=[]
lst_epoch=list(np.arange(10,200,10))
lst_batch=list(np.arange(100,400,20))
for epoch in lst_epoch:
print ('---------------- Epoch ' + str(epoch)+ '------------------')
for batch in lst_batch:
modelSimple.fit(X_train, Y_train, nb_epoch = epoch, batch_size = batch, verbose = 0)
score = modelSimple.evaluate(X_test, Y_test)
df.loc[i,"Nb_Batch"]=batch
df.loc[i,"Nb_Epoch"]=epoch
df.loc[i,"Accuracy"]=score[1]*100
i=i+1

This might be happening due to random parameter initialization. Because if you are building an end-to-end model without transfer learn the weights, every time you training architecture get random values for its parameters.
In this case, a good practice is to use batch normalization layers after some layers according to your architecture.
tensoflow-implementation
pytorch-implmentation
extra idea:
Do not use any 'for', 'while' loops in the model implementation.
you can follow templates in TensorFlow or PyTorch.
OR, if you build a complete model from scratch, vectorize operations by using NumPy like metrics operation library.

Thanks for the update.
I resolve my probelm by saving a model and load it after.
thaks for idea (batch normalization ) and extra idea : not user any for ;-)
regards

I think you might not be updating the weight matrix after completing the training for certain batch sizes and epochs.
Please include the code as well in order to see the problem

Can a trained ANN (tensorflow) model be made predictable?

I'm new to ANN, but I've managed to train a convolutional model successfully (using some legacy tensorflow v1 code) up to ~90% accuracy or so on my data. But when I evaluate (test) it on any given batch, the result is somewhat random, even though it's 90% correct. I've tried to re-evaluate the data N times and averaging (using N's between 1 and 25), but still each evaluation differs from the others between 3% to 10% of the data points.
Is there any way to make the evaluation predictable, so that the evaluation of an input batch X always yield the exact same result Y every time I run it (once training is done)?
I'm not sure if it's relevant, but my layers are batch normalized like so:
inp = tf.identity(inp)
channels = inp.get_shape()[-1]
offset = tf.compat.v1.get_variable(
'offset', [channels],
dtype=tf.float32,
initializer=tf.compat.v1.zeros_initializer())
scale = tf.compat.v1.get_variable(
'scale', [channels],
dtype=tf.float32,
initializer=tf.compat.v1.random_normal_initializer(1.0, 0.02))
mean, variance = tf.nn.moments(x=inp, axes=[0, 1], keepdims=False)
variance_epsilon = 1e-5
normalized = tf.nn.batch_normalization(
inp, mean, variance, offset, scale, variance_epsilon=variance_epsilon)
The scale part is initialized with random data, but I assume that gets loaded when I do tf.compat.v1.train.Saver().restore(session, checkpoint_fname)?

I am assuming you are testing the model on your training batches?
You can't equate the accuracy of a portion of your total training dataset to the accuracy of the whole.
Think of it like a regression problem. If you only take a part of the dataset, there is no guarantee that it would average out close to the full dataset.
If you want consistent accuracy, evaluate on the full dataset.

training by batches leads to more over-fitting

I'm training a sequence to sequence (seq2seq) model and I have different values to train on for the input_sequence_length.
For values 10 and 15, I get acceptable results but when I try to train with 20, I get memory errors so I switched the training to train by batches but the model over-fit and the validation loss explodes, and even with the accumulated gradient I get the same behavior, so I'm looking for hints and leads to more accurate ways to do the update.
Here is my training function (only with batch section) :
if batch_size is not None:
k=len(list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )))
for epoch in range(num_epochs):
optimizer.zero_grad()
epoch_loss=0
for i in list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )): # by using equidistant batch till the last one it becomes much faster than using the X.size()[0] directly
sequence = X_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, input_size).to(device)
labels = y_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, output_size).to(device)
# Forward pass
outputs = model(sequence)
loss = criterion(outputs, labels)
epoch_loss+=loss.item()
# Backward and optimize
loss.backward()
optimizer.step()
epoch_loss=epoch_loss/k
model.eval
validation_loss,_= evaluate(model,X_test_hard_tensor_1,y_test_hard_tensor_1)
model.train()
training_loss_log.append(epoch_loss)
print ('Epoch [{}/{}], Train MSELoss: {}, Validation : {} {}'.format(epoch+1, num_epochs,epoch_loss,validation_loss))
EDIT:
here are the parameters that I'm training with :
batch_size = 1024
num_epochs = 25000
learning_rate = 10e-04
optimizer=torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss(reduction='mean')

Batch size affects regularization. Training on a single example at a time is quite noisy, which makes it harder to overfit. Training on batches smoothes everything out, which makes it easier to overfit. Translating back to regularization:
Smaller batches add regularization.
Larger batches reduce regularization.
I am also curious about your learning rate. Every call to loss.backward() will accumulate the gradient. If you have set your learning rate to expect a single example at a time, and not reduced it to account for batch accumulation, then one of two things will happen.
The learning rate will be too high for the now-accumulated gradient, training will diverge, and both training and validation errors will explode.
The learning rate won't be too high, and nothing will diverge. The model will just train more quickly and effectively. If the model is too large for the data being fit, then training error will go to 0 but validation error will explode due to overfitting.
Update
Here is a bit more detail regarding the gradient accumulation.
Every call to loss.backward() will accumulate gradient, until you reset it with optimizer.zero_grad(). It will be acted on when you call optimizer.step(), based on whatever it has accumulated.
The way your code is written, you call loss.backward() for every pass through the inner loop, then you call optimizer.step() in the outer loop before resetting. So the gradient has been accumulated, that is summed, over all examples in the batch and not just one example at a time.
Under most assumptions, that will make the batch-accumulated gradient larger than the gradient for a single example. If the gradients are all aligned, for B batches, it will be larger by B times. If the gradients are i.i.d. then it will be more like sqrt(B) times larger.
If you do not account for this, then you have effectively increased your learning rate by that factor. Some of that will be mitigated by the smoothing effect of larger batches, which can then tolerate a higher learning rate. Larger batches reduce regularization, larger learning rates add it back. But that will not be a perfect match to compensate, so you will still want to adjust accordingly.
In general, whenever you change your batch size you will also want to re-tune your learning rate to compensate.
Leslie N. Smith has written some excellent papers on a methodical approach to hyperparameter tuning. A great place to start is A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay. He recommends you start by reading the diagrams, which are very well done.

Training E-net on human segmentation

I am trying to train a semantic-segmentation network (E-Net) in particular for high-quality human segmentation. For that, I have collected the "Supervisely Person" data-set and extracted the annotation masks using the provided API. This data-set holds high quality masks, thus I think it will provide better results in comparison to e.g. COCO data-set.
Supervisely - Example below : original image - ground truth.
First I want to give some details of the model. The network itself (Enet_arch) returns logits from the last convolution layer and probabilities which are produced through tf.nn.sigmoid(logits,name='logits_to_softmax').
I am using sigmoid cross-entropy on the ground truth and the returned logits, momentum and exponential decay on the learning rate. The model instance and the training pipeline is as follows.
self.global_step = tf.Variable(0, name='global_step', trainable=False)
self.momentum = tf.Variable(0.9, trainable=False)
# introducing weight decay
#with slim.arg_scope(ENet_arg_scope(weight_decay=2e-4)):
self.logits, self.probabilities = Enet_arch(inputs=self.input_data, num_classes=self.num_classes, batch_size=self.batch_size) # returns logits (2d), probabilities (2d)
#self.gt is int32 with values 0 or 1 (coming from read_tfrecords.Read_TFRecords annotation images + placeholder defined to int)
self.gt = self.input_masks
# self.probabilities is output of sigmoid, pixel-wise between probablities [0, 1].
# self.predictions is filtered probabilities > 0.5 = 1 else 0
self.predictions = tf.to_int32(self.probabilities > 0.5)
# capture segmentation accuracy
self.accuracy, self.accuracy_update = tf.metrics.accuracy(labels=self.gt, predictions=self.predictions)
# losses and updates
# calculate cross entropy loss on logits
loss = tf.losses.sigmoid_cross_entropy(multi_class_labels=self.gt, logits=self.logits)
# add the loss to total loss and average (?)
self.total_loss = tf.losses.get_total_loss()
# decay_steps = depend on the number of epochs
self.learning_rate = tf.train.exponential_decay(self.starter_learning_rate, global_step=self.global_step, decay_steps=123893, decay_rate=0.96, staircase=True)
#Now we can define the optimizer
#optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate, epsilon=1e-8)
optimizer = tf.train.MomentumOptimizer(self.learning_rate, self.momentum)
#Create the train_op.
self.train_op = optimizer.minimize(loss, global_step=self.global_step)
I first tried to over-fit the model on a single image to identify the depth of details that this network can capture. To increase the output quality I resized all the images to 1080p before feeding them to the network. On this trial I trained the network for 10K iterations and the total error reached ~30% (captured from tf.losses.get_total_loss() ).
The results while training on a single image are pretty good as you can see below.
Supervisely - Example below : (1) Loss (2) input (before resizing) | ground truth (before resizing) | 1080p out
Later, I tried to train on the whole data-set but the training loss produce lot of oscillations. That means that in some images the network perform well and in some other not. As a results after 743360 iterations (which is 160 epochs, since the training set holds 4646 images) I stopped training since obviously there is something wrong with the hyper-parameters selection that I made.
Supervisely - Example below : (1) Loss (2) learning rate (3) input (before resizing) | ground truth (before resizing) | 1080p out
On the other hand on some instances of the training set images the network produce fair (not very good though) results like below.
Supervisely - Example below : input (before resizing) | ground truth (before resizing) | 1080p out
Why do I have those differences on these training instances? Are there any obvious changes that I should do on the model or on the hyper-parameters? Is it possible that this model is just not suitable for this use-case (e.g. low network capacity) ?
Thanks in advance.

It turns out that the problem here is indeed E-net architecture. I changed the architecture with DeepLabV3 and saw a big difference in loss behaviour and performance.. even in small resolution!

Softmax Cross Entropy loss explodes

I am creating a deep convolutional neural network for pixel-wise classification. I am using adam optimizer, softmax with cross entropy.
Github Repository
I asked a similar question found here but the answer I was given did not result in me solving the problem. I also have a more detailed graph of what it going wrong. Whenever I use softmax, the problem in the graph occurs. I have done many things such as adjusting training and epsilon rates, trying different optimizers, etc. The loss never decreases past 500. I do not shuffle my data at the moment. Using sigmoid in place of softmax results in this problem not occurring. However, my problem has multiple classes, so the accuracy of sigmoid is not very good. It should also be mentioned that when the loss is low, my accuracy is only about 80%, I need much better than this. Why would my loss suddenly spike like this?
x = tf.placeholder(tf.float32, shape=[None, 7168])
y_ = tf.placeholder(tf.float32, shape=[None, 7168, 3])
#Many Convolutions and Relus omitted
final = tf.reshape(final, [-1, 7168])
keep_prob = tf.placeholder(tf.float32)
W_final = weight_variable([7168,7168,3])
b_final = bias_variable([7168,3])
final_conv = tf.tensordot(final, W_final, axes=[[1], [1]]) + b_final
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=final_conv))
train_step = tf.train.AdamOptimizer(1e-5).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(final_conv, 2), tf.argmax(y_, 2))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

You need label smoothing.
I just had the same problem. I was training with tf.nn.sparse_softmax_cross_entropy_with_logits which is the same as if you use tf.nn.softmax_cross_entropy_with_logits with one-hot labels. My dataset predicts the occurrence of rare events so the labels in the training set are 99% class 0 and 1% class 1. My loss would start to fall, then stagnate (but with reasonable predictions), then suddenly explode and then the predictions also went bad.
Using the tf.summary ops to log internal network state into Tensorboard, I observed that the logits were growing and growing in absolute value. Eventually at >1e8, tf.nn.softmax_cross_entropy_with_logits became numerically unstable and that's what generated those weird loss spikes.
In my opinion, the reason why this happens is with the softmax function itself, which is in line with Jai's comment that putting a sigmoid in there before the softmax will fix things. But that will quite surely also make it impossible for the softmax likelihoods to be accurate, as it limits the value range of the logits. But in doing so, it prevents the overflow.
Softmax is defined as likelihood[i] = tf.exp(logit[i]) / tf.reduce_sum(tf.exp(logit[!=i])). Cross-entropy is defined as tf.reduce_sum(-label_likelihood[i] * tf.log(likelihood[i]) so if your labels are one-hot, that reduces to just the negative logarithm of your target likelihood. In practice, that means you're pushing likelihood[true_class] as close to 1.0 as you can. And due to the softmax, the only way to do that is if tf.exp(logit[!=true_class]) becomes as close to 0.0 as possible.
So in effect, you have asked the optimizer to produce tf.exp(x) == 0.0 and the only way to do that is by making x == - infinity. And that's why you get numerical instability.
The solution is to "blur" the labels so instead of [0,0,1] you use [0.01,0.01,0.98]. Now the optimizer works to reach tf.exp(x) == 0.01 which results in x == -4.6 which is safely inside the numerical range where GPU calculations are accurate and reliably.

Not sure, what it causes it exactly. I had the same issue a few times. A few things generally help: You might reduce the learning rate, ie. the bound of the learning rate for Adam (eg. 1e-5 to 1e-7 or so) or try stochastic gradient descent. Adam tries to estimate learning rates which can lead to instable training: See Adam optimizer goes haywire after 200k batches, training loss grows
Once I also removed batchnorm and that actually helped, but this was for a "specially" designed network for stroke data (= point sequences), which was not very deep with Conv1d layers.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.