Tensorflow: Is it possible to modify the global step in checkpoints - python

I am trying to modify the global_step in checkpoints so that I can move the training from one machine to another.
Let's say I was doing training for several days on machine A. Now I bought a new machine B with better graphic card and more GPU memory and would like to move the training from machine A to machine B.
In order to restore the checkpoints in machine B, I have previously specified the global_step in Saver.save in machine A but with smaller batch_size and larger sub_iterations.
batch_size=10
sub_iterations=500
for (...)
for i in range(sub_iterations):
inputs, labels = next_batch(batch_size)
session.run(optimizer, feed_dict={inputs: inputs, labels: labels})
saver = tf.train.Saver()
saver.save(session, checkpoints_path, global_step)
Now I copied all the files including the checkpoints from machine A to machine B. Because machine B has more GPU memory, I can modify the batch_size to a larger value but use fewer sub_iterations.
batch_size=100
sub_iterations=50 # = 500 / (100/10)
for (...)
for i in range(sub_iterations):
inputs, labels = next_batch(batch_size)
session.run(optimizer, feed_dict={inputs: inputs, labels: labels})
However we cannot directly restore the copied checkpoints as global_step is different in machine B. For example, tf.train.exponential_decay will produce incorrect learning_rate as the number of sub_iterations is reduced in machine B.
learning_rate = tf.train.exponential_decay(..., global_step, sub_iterations, decay_rate, staircase=True)
Is it possible to modify the global_step in checkpoints? Or there is an alternative but more appropriate way to handle this situation?
Edit 1
In addition to calculating the learning_rate, I also used the global_step to calculate and reduce the number of iterations.
while i < iterations:
j = 0
while j < sub_iterations:
inputs, labels = next_batch(batch_size)
feed_dict_train = {inputs: inputs, labels: labels}
_, gstep = session.run([optimizer, global_step], feed_dict=feed_dict_train)
if (i == 0) and (j == 0):
i, j = int(gstep/ sub_iterations), numpy.mod(gstep, sub_iterations)
j = j + 1
i = i + 1
And we will start the iterations from new i and j. Please feel free to comment on this as it might not be a good approach to restore the checkpoints and continue training from loaded checkpoints.
Edit 2
In machine A, let's say iterations is 10,000, sub_iterations is 500 and batch_size is 10. So the total number of batches we are aiming to train is 10000x500x10 = 50,000,000. Assume we have trained for several days and global_step becomes 501. So the total number of batches trained is 501x10 = 5010. The remaining number of batches not trained is from 5011 to 50,000,000. If we apply i, j = int(gstep/ sub_iterations), numpy.mod(gstep, sub_iterations), the last trained value of i is 501/500=1 and j is 501%500=1.
Now you have copied all the files including the checkpoints to machine B. Since B has more GPU memory and we can train for more batches for one sub-iteration, we set batch_size to 100, and adjust sub_iterations to 50 but leave iterations as 10000. The total number of batches to train is still 50,000,000. So the problem comes, how can we start and train the batches from 5011 to 50,000,000 and do not train again for first 5010 samples?
In order to start and train the batches from 5011 in machine B, we should set i to 1 and j to 0 because the total batches it has trained will be (1*50+0)*100 = 5,000 which is close to 5,010 (as the batch_size is 100 in machine B as opposed to 10 in machine A, we cannot start exactly from 5,010 and we can either choose 5,000 or 5,100).
If we do not adjust the global_step (as suggested by #coder3101), and use back i, j = int(gstep/ sub_iterations), numpy.mod(gstep, sub_iterations) in machine B, i will become 501/50=10 and j will become 501%50=1. So we will start and train from batch 50,100 (=501*batch_size=501*100) which is incorrect (not close to 5010).
This formula i, j = int(gstep/ sub_iterations), numpy.mod(gstep, sub_iterations) is introduced because if we stop the training in machine A at one point, we can restore the checkpoints and continue the training in machine A using this formula. However it seems this formula is not applicable when we move the training from machine A to machine B. Therefore I was hoping to modify the global_step in checkpoints to deal with this situation, and would like to know if this is possible.

Yes. It is possible.
To modify the global_step in machine B, you have to perform the following steps:
Calculate the corresponding global_step
In the above example, global_step in machine A is 501 and the total number of trained batches is 501x10=5010. So the corresponding global_step in machine B is 5010/100=50.
Modify the checkpoint filenames
Modify the suffix of model_checkpoint_path and all_model_checkpoint_paths values to use correct step number (e.g. 50 in the above example) in /checkpoint.
Modify the filename of index, meta and data for the checkpoints to use correct step number.
Override global_step after restoring checkpoint
saver.restore(session, model_checkpoint_path)
initial_global_step = global_step.assign(50)
session.run(initial_global_step)
...
# do the training
Now if you restore the checkpoints (including global_step) and override global_step, the training will use the updated global_step and adjust i and j and learning_rate correctly.

If you want to keep decayed learning rate same on Machine B as it was on A, you can tweak other parameters of the function tf.train.exponential_dacay . For Example, you can change decay rate on the new machine.
In order to find that decay_rate you need to know how exponential_decay is computed.
decay_learning_rate = learning_rate * decay_rate ^
(global_step/decay_steps)
global_step/decay_step yields an pure integer if staircase==True
Where decay_steps is your sub_iteration and global_step is from Machine A.
You can change the decay_rate in such a way that new learning rate on machine B is same or close to what you would expect from machine A at that global_step.
Also, you can change the initial learning rate for machine B so as to achieve uniform exponential rate decay from machine A to B.
So in short while importing from A to B you have changed 1 variable (sub_iteration), and kept global_step same. You can adjust the other 2 variables of exponential_decay(..)in such a way that your output learning rate from the function is same as you would expect from Machine A at that global_step.

Related

training by batches leads to more over-fitting

I'm training a sequence to sequence (seq2seq) model and I have different values to train on for the input_sequence_length.
For values 10 and 15, I get acceptable results but when I try to train with 20, I get memory errors so I switched the training to train by batches but the model over-fit and the validation loss explodes, and even with the accumulated gradient I get the same behavior, so I'm looking for hints and leads to more accurate ways to do the update.
Here is my training function (only with batch section) :
if batch_size is not None:
k=len(list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )))
for epoch in range(num_epochs):
optimizer.zero_grad()
epoch_loss=0
for i in list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )): # by using equidistant batch till the last one it becomes much faster than using the X.size()[0] directly
sequence = X_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, input_size).to(device)
labels = y_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, output_size).to(device)
# Forward pass
outputs = model(sequence)
loss = criterion(outputs, labels)
epoch_loss+=loss.item()
# Backward and optimize
loss.backward()
optimizer.step()
epoch_loss=epoch_loss/k
model.eval
validation_loss,_= evaluate(model,X_test_hard_tensor_1,y_test_hard_tensor_1)
model.train()
training_loss_log.append(epoch_loss)
print ('Epoch [{}/{}], Train MSELoss: {}, Validation : {} {}'.format(epoch+1, num_epochs,epoch_loss,validation_loss))
EDIT:
here are the parameters that I'm training with :
batch_size = 1024
num_epochs = 25000
learning_rate = 10e-04
optimizer=torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss(reduction='mean')
Batch size affects regularization. Training on a single example at a time is quite noisy, which makes it harder to overfit. Training on batches smoothes everything out, which makes it easier to overfit. Translating back to regularization:
Smaller batches add regularization.
Larger batches reduce regularization.
I am also curious about your learning rate. Every call to loss.backward() will accumulate the gradient. If you have set your learning rate to expect a single example at a time, and not reduced it to account for batch accumulation, then one of two things will happen.
The learning rate will be too high for the now-accumulated gradient, training will diverge, and both training and validation errors will explode.
The learning rate won't be too high, and nothing will diverge. The model will just train more quickly and effectively. If the model is too large for the data being fit, then training error will go to 0 but validation error will explode due to overfitting.
Update
Here is a bit more detail regarding the gradient accumulation.
Every call to loss.backward() will accumulate gradient, until you reset it with optimizer.zero_grad(). It will be acted on when you call optimizer.step(), based on whatever it has accumulated.
The way your code is written, you call loss.backward() for every pass through the inner loop, then you call optimizer.step() in the outer loop before resetting. So the gradient has been accumulated, that is summed, over all examples in the batch and not just one example at a time.
Under most assumptions, that will make the batch-accumulated gradient larger than the gradient for a single example. If the gradients are all aligned, for B batches, it will be larger by B times. If the gradients are i.i.d. then it will be more like sqrt(B) times larger.
If you do not account for this, then you have effectively increased your learning rate by that factor. Some of that will be mitigated by the smoothing effect of larger batches, which can then tolerate a higher learning rate. Larger batches reduce regularization, larger learning rates add it back. But that will not be a perfect match to compensate, so you will still want to adjust accordingly.
In general, whenever you change your batch size you will also want to re-tune your learning rate to compensate.
Leslie N. Smith has written some excellent papers on a methodical approach to hyperparameter tuning. A great place to start is A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay. He recommends you start by reading the diagrams, which are very well done.

TensorFlow on multiple GPU

Recently, I try to learn how to use Tensorflow on multiple GPU by reading the official tutorial. However, there is something that I am confused about. The following code is part of the official tutorial, which calculates the loss on single GPU.
def tower_loss(scope, images, labels):
# Build inference Graph.
logits = cifar10.inference(images)
# Build the portion of the Graph calculating the losses. Note that we will
# assemble the total_loss using a custom function below.
_ = cifar10.loss(logits, labels)
# Assemble all of the losses for the current tower only.
losses = tf.get_collection('losses', scope)
# Calculate the total loss for the current tower.
total_loss = tf.add_n(losses, name='total_loss')
# Attach a scalar summary to all individual losses and the total loss; do the
# same for the averaged version of the losses.
for l in losses + [total_loss]:
# Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
# session. This helps the clarity of presentation on tensorboard.
loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
tf.summary.scalar(loss_name, l)
return total_loss
The training process is as the following.
def train():
with tf.device('/cpu:0'):
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (cifar10.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /
FLAGS.batch_size / FLAGS.num_gpus)
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(cifar10.INITIAL_LEARNING_RATE,
global_step,
decay_steps,
cifar10.LEARNING_RATE_DECAY_FACTOR,
staircase=True)
# Create an optimizer that performs gradient descent.
opt = tf.train.GradientDescentOptimizer(lr)
# Get images and labels for CIFAR-10.
images, labels = cifar10.distorted_inputs()
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels], capacity=2 * FLAGS.num_gpus)
# Calculate the gradients for each model tower.
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables()
# Retain the summaries from the final tower.
summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
However, I am confused about the for loop about 'for i in xrange(FLAGS.num_gpus)'. It seems that I have to get a new batch image from batch_queue and calculate every gradient. I think this process is serialized instead of parallel. If there anything wrong with my own understanding? By the way, I can also use the iterator to feed image to my model rather than the dequeue right?
Thank you everybody!
This is a common misconception with Tensorflow's coding model.
What you are showing here is the computation graph's construction, NOT the actual execution.
The block:
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
translates to:
For each GPU device (`for i in range..` & `with device...`):
- build operations needed to dequeue a batch
- build operations needed to run the batch through the network and compute the loss
Note how via tf.get_variable_scope().reuse_variables() you're telling the graph that the variables used for the graph GPU must be shared among all (i.e., all graphs on the multiple devices "reuse" the same variables).
None of this actually runs the network once (note how there is no sess.run()): you're just giving instructions on how data must flow.
Then, when you'll start the actual training (I guess you missed that piece of the code when copying it here) each GPU will pull its own batch and produce the per-tower loss. I guess these losses are averaged somewhere in the subsequent code and the average is the loss passed to the optimizer.
Up until the point where the tower losses are averaged together, everything is independent from the other devices, so getting the batch and computing the loss can be done in parallel. Then the gradients and parameter update is done only once, variables are updated and the cycle repeats.
So, to answer your question, no, per-batch loss computation is not serialized, but since this is synchronous distributed computation you need to collect all losses from all GPUs before being allowed to continue with gradients computation and parameters update, so you still have some part of the graph that cannot be independent.

Online or batch training by default in tensorflow

I have the following question: I'm trying to learn tensor-flow and I still don't find where to set the training as online or batch. For example, if I have the following code to train a neural-network:
loss_op = tf.reduce_mean(tf.pow(neural_net(X) - Y, 2))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss_op)
sess.run(train_op, feed_dict={X: batch_x, Y: batch_y})
If I give all the data at the same time (i.e batch_x has all the data), does that mean that is training as a batch training? or the tensor-flow optimizer optimize in a different way from behind? Is it wrong if I do a for loop giving one data sample at a time? does that count as single-step (online) training? Thank you for your help.
There are mainly 3 Types of Gradient Descent. Specifically,
Stochastic Gradient Descent
Batch Gradient Descent
Mini Batch Gradient Descent
Here, is a good tutorial (https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/) on above three methods with upsides and downsides.
For your question, Following is a standard sample training tensorflow code,
N_EPOCHS = #Need to define here
BATCH_SIZE = # Need to define hare
with tf.Session() as sess:
train_count = len(train_x)
for i in range(1, N_EPOCHS + 1):
for start, end in zip(range(0, train_count, BATCH_SIZE),
range(BATCH_SIZE, train_count + 1,BATCH_SIZE)):
sess.run(train_op, feed_dict={X: train_x[start:end],
Y: train_y[start:end]})
Here N_EPOCHS means the number of passes of the whole training dataset. And you can set the BATCH_SIZE according to your Gradient Descent method.
For Stochastic Gradient Descent, BATCH_SIZE = 1.
For Batch Gradient Descent, BATCH_SIZE = training dataset size.
For Mini Batch Gradient Decent, 1 << BATCH_SIZE << training dataset size.
Among three methods, the most popular method is the Mini Batch Gradient Decent. However, you need to set the BATCH_SIZE parameter according to your requirements. A good default for BATCH_SIZE might be 32.
Hope this helps.
Normally the first dimension of the data placeholders in Tensorflow is set as the batch_size and TensorFlow doesn't define that(the training strategy) in default. You can set that first dimension to determine if it is on-line(first dimension is 1) or mini-batch(tens normally). For example:
self.enc_batch = tf.placeholder(tf.int32, [hps.batch_size, None], name='enc_batch')

Why does recognition rate drop after multiple online training epochs?

I am using tensorflow to do image recognition on the MNIST dataset. In each training epoch, I picked 10,000 random images and conducted online training with batch size of 1. The recognition rate increased for the first few epochs, however, after several epochs the recognition rate started to drop greatly. (In the first 20 epochs, the recognition rate goes up to ~94%. Afterwards, the recognition rate went from 90->50->40->30->20). What is the reason for this?
Also, with a batch size of 1, the performance is worse than when using a batch size of 100 (max recognition rate 94% vs. 96%). I looked through several references but there seems to be contradictory results on whether small or large batch sizes achieve better performance. What would be this case in this situation?
Edit: I also added a figure of the recognition rate of the training dataset and the test dataset.Recognition rate vs. epoch
I have attached a copy of the code below. Thanks for the help!
import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot = True)
#parameters
n_nodes_hl1 = 500
n_nodes_hl2 = 500
n_nodes_hl3 = 500
n_classes = 10
batch_size = 1
x = tf.placeholder('float', [None, 784])
y = tf.placeholder('float')
#model of neural network
def neural_network_model(data):
hidden_1_layer = {'weights':tf.Variable(tf.random_normal([784, n_nodes_hl1]) , name='l1_w'),
'biases': tf.Variable(tf.random_normal([n_nodes_hl1]) , name='l1_b')}
hidden_2_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2]) , name='l2_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl2]) , name='l2_b')}
hidden_3_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl2, n_nodes_hl3]) , name='l3_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl3]) , name='l3_b')}
output_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl3, n_classes]) , name='lo_w'),
'biases' :tf.Variable(tf.random_normal([n_classes]) , name='lo_b')}
l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['biases'])
l1 = tf.nn.relu(l1)
l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['biases'])
l2 = tf.nn.relu(l2)
l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['biases'])
l3 = tf.nn.relu(l3)
output = tf.matmul(l3,output_layer['weights']) + output_layer['biases']
return output
#train neural network
def train_neural_network(x):
prediction = neural_network_model(x)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epoches = 100
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epoches):
epoch_loss=0
for batch in range (10000):
epoch_x, epoch_y=mnist.train.next_batch(batch_size)
_,c =sess.run([optimizer, cost], feed_dict = {x:epoch_x, y:epoch_y})
epoch_loss += c
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
print(epoch_loss)
print('Accuracy_test:', accuracy.eval({x:mnist.test.images, y:mnist.test.labels}))
print('Accuracy_train:', accuracy.eval({x:mnist.train.images, y:mnist.train.labels}))
train_neural_network(x)
DROPPING ACCURACY
You're over-fitting. This is when the model learns false features that are specific to artifacts of the images in the training data, at the expense of important features. One of the main experimental results of any application is to determine the optimal number of training iterations.
For instance, perhaps 80% of the 7's in your training data happen to have a little extra slant to the right near the bottom of the stem, where 4's and 1's do not. After too much training, your model "decides" that the best way to tell a 7 from another digit is from that extra slant, despite any other features. As a result, some 1's and 4's now get classed as 7's.
BATCH SIZE
Again, the best batch size is one of the experimental results. Typically, a batch size of 1 is too small: this gives the first few input images too much influence on the early weights in kernel or perceptron training. This is a minor case of over-fitting: one item having undue influence on the model. However, it's significant enough to alter your best results by 2%.
You need to balance the batch size with the other hyper-parameters to find the model's "sweet spot", optimum performance followed by shortest training time. In my experience, it's been best to increase the batch size until my time per image degraded. The models I've used most (MNIST, CIFAR-10, AlexNet, GoogleNet, ResNet, VGG, etc.) had very little loss of accuracy once we reached a rather minimal batch size; from there, the training speed was usually a matter of choosing the batch size the best used available RAM.
There are a few possibilities, although you'll need to do some experimentation to find out which it is.
Overfitting
Prune did a good job of explaining this. I'll add that the simplest way to avoid overfitting is to just remove 10-15% of the training set and evaluate the recognition rate on this held out validation set after every few epochs. If you graph the change in recognition rate on both the training and validation sets, you'll eventually reach a point on the graph where the training error keeps going down but the validation error starts going up. Stop training at that point; that's where overfitting is starting in earnest. Note that it's important that there be no overlap between the training/validation/test sets.
This was more likely before you mentioned that the training error wasn't also decreasing, but it's possible that it's overfitting on a fairly homogeneous part of your training set at the expense of the outliers, or something like this. Try randomizing the order of your training set after each epoch; if it's fitting one section of the set at the expense of the others, this might help.
Addendum: The massive instantaneous drop in quality around epoch 20 makes this even less likely; that is not what overfitting looks like.
Numerical Instability
If you get a particularly incorrect input at a point on the activation function with a large gradient, it's possible to end up with a gigantic weight update that screws up everything it's learned thus far. It's common to put a hard limit on the gradient magnitude for this reason. But you're using AdamOptimizer, which has an epsilon parameter for avoiding instability. I haven't read the paper it references, so I don't know exactly how it works, but the fact that it's there makes instability less likely.
Saturated Neurons
Some activation functions have regions with very small gradients, so if you end up with weights such that the function is almost always in that region, you have a tiny gradient and thus can't learn effectively. Sigmoids and Tanh are particularly prone to this since they have flat regions on both sides of the function. ReLUs don't have a flat region on the high end, but do on the low end. Try replacing your activation functions with Softplus; those are similar to ReLU, but with a continuous nonzero gradient.

regarding setting the global step information in mini-batch optimization

In the MNIST example, the optimizer is setup as follows
# Optimizer: set up a variable that's incremented once per batch and
# controls the learning rate decay.
batch = tf.Variable(0, dtype=data_type())
# Decay once per epoch, using an exponential schedule starting at 0.01.
learning_rate = tf.train.exponential_decay(
0.01, # Base learning rate.
batch * BATCH_SIZE, # Current index into the dataset.
train_size, # Decay step.
0.95, # Decay rate.
staircase=True)
# Use simple momentum for the optimization.
optimizer = tf.train.MomentumOptimizer(learning_rate,
0.9).minimize(loss,
global_step=batch)
And in the training process,
for step in xrange(int(num_epochs * train_size) // BATCH_SIZE):
# skip some code here
sess.run(optimizer, feed_dict=feed_dict)
My question is that when defining learning_rate, they use batch * batch_sizeto define global step. However, in the training iteration, we only have variable step. How does the code connect(or pass) the step information to the global step parameter in tf.train.exponential_decay I am not very clear how does this python parameter passing mechanism work.
From the code you have linked, batch is the global step. Its value is updated by the optimizer. The learning node takes it as input.
The naming may be an issue. batch merely means the number of the current batch used for training (of size BATCH_SIZE). Perhaps a better name could have been step or even global_step.
Most of the global_step code seems to be in a single source file. It is quite short and perhaps a good way to see how the pieces work together.

Categories