regarding setting the global step information in mini-batch optimization

regarding setting the global step information in mini-batch optimization - python

In the MNIST example, the optimizer is setup as follows
# Optimizer: set up a variable that's incremented once per batch and
# controls the learning rate decay.
batch = tf.Variable(0, dtype=data_type())
# Decay once per epoch, using an exponential schedule starting at 0.01.
learning_rate = tf.train.exponential_decay(
0.01, # Base learning rate.
batch * BATCH_SIZE, # Current index into the dataset.
train_size, # Decay step.
0.95, # Decay rate.
staircase=True)
# Use simple momentum for the optimization.
optimizer = tf.train.MomentumOptimizer(learning_rate,
0.9).minimize(loss,
global_step=batch)
And in the training process,
for step in xrange(int(num_epochs * train_size) // BATCH_SIZE):
# skip some code here
sess.run(optimizer, feed_dict=feed_dict)
My question is that when defining learning_rate, they use batch * batch_sizeto define global step. However, in the training iteration, we only have variable step. How does the code connect(or pass) the step information to the global step parameter in tf.train.exponential_decay I am not very clear how does this python parameter passing mechanism work.

From the code you have linked, batch is the global step. Its value is updated by the optimizer. The learning node takes it as input.
The naming may be an issue. batch merely means the number of the current batch used for training (of size BATCH_SIZE). Perhaps a better name could have been step or even global_step.
Most of the global_step code seems to be in a single source file. It is quite short and perhaps a good way to see how the pieces work together.

Related

Is it possible to perform step according to batch size in pytorch?

I am iterating over training samples in batches, however last batch always returns fewer samples.
Is it possible to specify step size in torch according to the current batch length?
For example most batch are of size 64, last batch only 6 samples.
If I do the usual routine:
optimizer.zero_grad()
loss.backward()
optimizer.step()
It seems that the last 6 samples carry the same weight when updating the gradients as the 64 sized batches, but in fact they should only carry about 1/10 weight due to fewer samples.
In Mxnet I could specify the step size accordingly but I don't know how to do it in torch.

You can define a custom loss function and then e.g. reweight it based on batch size
def reweighted_cross_entropy(my_outputs, my_labels):
# compute batch size
my_batch_size = my_outputs.size()[0]
original_loss = nn.CrossEntropyLoss()
loss = original_loss (my_outputs, my_labels)
# reweight accordingly
return my_batch_size * loss
if you are using something like gradient descent then it is easy to see that
[1/10 * lr] grad [loss] = lr * grad [ 1/10 loss]
so reweighting the loss will be equivalent to reweighting your learning rate. This won't be exactly true for more comlpex optimisers though but can be good enough in practise.

I suggest just ignore the last batch. Pytorch Dataloader has parameter to implement that behavior:
drop_last = True #(False by default)

Updating based on two different loss functions, but with a different optimizer learning rate after each one (pytorch)?

I have a set up as follows, where I have an outer for loop iterating over epochs, and an inner for loop iterating over batches.
In the inner for loop, over batches, I'm usin a cross entropy loss, and using an Adam optimizer with a certain learning rate.
After the inner for loop (after all batches are evaluated), I'm then calculating another loss function based off of the output (a custom loss function), and optimizing.
However, I notice that when I define a different optimizer with a different learning rate, it doesn't seem to be training. When I keep the same optimizer, it seems like things change, but when I replace it, it doesn't. Example as follows:
net = <my defined model, from another function>
optimizer_1 = torch.optim.Adam(net.parameters(), lr=0.1)
optimizer_2 = torch.optim.Adam(net.parameters(), lr=0.01)
for epoch in range(num_epochs):
for data in training_data: # these are the batches
<get output here>
loss1 = <compute loss function>
optimizer_1.zero_grad()
loss1.backward()
optimizer_1.step()
loss2 = <compute a different loss function here>
optimizer_2.zero_grad() #use a second optimizer with a different learning rate
loss.backward()
loss.step()
When I do this, it seems like it doesn't actually carry through with the second optimization on the second loss function. Why is this? I want the second optimization to have a different learning rate than the first one. However, it seems like only continuing to use the first optimizer, optimizer_1, with its respective learning rate seems to work.

First you can accumulate the loss1 in the inner loop.
Next, You might want to consider merging two loss functions.
(sum(accumulated_loss1) + loss2).backward()
This ensures both are losses are considered during training and all the gradients are propagated in the backward pass

training by batches leads to more over-fitting

I'm training a sequence to sequence (seq2seq) model and I have different values to train on for the input_sequence_length.
For values 10 and 15, I get acceptable results but when I try to train with 20, I get memory errors so I switched the training to train by batches but the model over-fit and the validation loss explodes, and even with the accumulated gradient I get the same behavior, so I'm looking for hints and leads to more accurate ways to do the update.
Here is my training function (only with batch section) :
if batch_size is not None:
k=len(list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )))
for epoch in range(num_epochs):
optimizer.zero_grad()
epoch_loss=0
for i in list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )): # by using equidistant batch till the last one it becomes much faster than using the X.size()[0] directly
sequence = X_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, input_size).to(device)
labels = y_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, output_size).to(device)
# Forward pass
outputs = model(sequence)
loss = criterion(outputs, labels)
epoch_loss+=loss.item()
# Backward and optimize
loss.backward()
optimizer.step()
epoch_loss=epoch_loss/k
model.eval
validation_loss,_= evaluate(model,X_test_hard_tensor_1,y_test_hard_tensor_1)
model.train()
training_loss_log.append(epoch_loss)
print ('Epoch [{}/{}], Train MSELoss: {}, Validation : {} {}'.format(epoch+1, num_epochs,epoch_loss,validation_loss))
EDIT:
here are the parameters that I'm training with :
batch_size = 1024
num_epochs = 25000
learning_rate = 10e-04
optimizer=torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss(reduction='mean')

Batch size affects regularization. Training on a single example at a time is quite noisy, which makes it harder to overfit. Training on batches smoothes everything out, which makes it easier to overfit. Translating back to regularization:
Smaller batches add regularization.
Larger batches reduce regularization.
I am also curious about your learning rate. Every call to loss.backward() will accumulate the gradient. If you have set your learning rate to expect a single example at a time, and not reduced it to account for batch accumulation, then one of two things will happen.
The learning rate will be too high for the now-accumulated gradient, training will diverge, and both training and validation errors will explode.
The learning rate won't be too high, and nothing will diverge. The model will just train more quickly and effectively. If the model is too large for the data being fit, then training error will go to 0 but validation error will explode due to overfitting.
Update
Here is a bit more detail regarding the gradient accumulation.
Every call to loss.backward() will accumulate gradient, until you reset it with optimizer.zero_grad(). It will be acted on when you call optimizer.step(), based on whatever it has accumulated.
The way your code is written, you call loss.backward() for every pass through the inner loop, then you call optimizer.step() in the outer loop before resetting. So the gradient has been accumulated, that is summed, over all examples in the batch and not just one example at a time.
Under most assumptions, that will make the batch-accumulated gradient larger than the gradient for a single example. If the gradients are all aligned, for B batches, it will be larger by B times. If the gradients are i.i.d. then it will be more like sqrt(B) times larger.
If you do not account for this, then you have effectively increased your learning rate by that factor. Some of that will be mitigated by the smoothing effect of larger batches, which can then tolerate a higher learning rate. Larger batches reduce regularization, larger learning rates add it back. But that will not be a perfect match to compensate, so you will still want to adjust accordingly.
In general, whenever you change your batch size you will also want to re-tune your learning rate to compensate.
Leslie N. Smith has written some excellent papers on a methodical approach to hyperparameter tuning. A great place to start is A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay. He recommends you start by reading the diagrams, which are very well done.

TensorFlow on multiple GPU

Recently, I try to learn how to use Tensorflow on multiple GPU by reading the official tutorial. However, there is something that I am confused about. The following code is part of the official tutorial, which calculates the loss on single GPU.
def tower_loss(scope, images, labels):
# Build inference Graph.
logits = cifar10.inference(images)
# Build the portion of the Graph calculating the losses. Note that we will
# assemble the total_loss using a custom function below.
_ = cifar10.loss(logits, labels)
# Assemble all of the losses for the current tower only.
losses = tf.get_collection('losses', scope)
# Calculate the total loss for the current tower.
total_loss = tf.add_n(losses, name='total_loss')
# Attach a scalar summary to all individual losses and the total loss; do the
# same for the averaged version of the losses.
for l in losses + [total_loss]:
# Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
# session. This helps the clarity of presentation on tensorboard.
loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
tf.summary.scalar(loss_name, l)
return total_loss
The training process is as the following.
def train():
with tf.device('/cpu:0'):
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (cifar10.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /
FLAGS.batch_size / FLAGS.num_gpus)
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(cifar10.INITIAL_LEARNING_RATE,
global_step,
decay_steps,
cifar10.LEARNING_RATE_DECAY_FACTOR,
staircase=True)
# Create an optimizer that performs gradient descent.
opt = tf.train.GradientDescentOptimizer(lr)
# Get images and labels for CIFAR-10.
images, labels = cifar10.distorted_inputs()
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels], capacity=2 * FLAGS.num_gpus)
# Calculate the gradients for each model tower.
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables()
# Retain the summaries from the final tower.
summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
However, I am confused about the for loop about 'for i in xrange(FLAGS.num_gpus)'. It seems that I have to get a new batch image from batch_queue and calculate every gradient. I think this process is serialized instead of parallel. If there anything wrong with my own understanding? By the way, I can also use the iterator to feed image to my model rather than the dequeue right?
Thank you everybody!

This is a common misconception with Tensorflow's coding model.
What you are showing here is the computation graph's construction, NOT the actual execution.
The block:
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
translates to:
For each GPU device (`for i in range..` & `with device...`):
- build operations needed to dequeue a batch
- build operations needed to run the batch through the network and compute the loss
Note how via tf.get_variable_scope().reuse_variables() you're telling the graph that the variables used for the graph GPU must be shared among all (i.e., all graphs on the multiple devices "reuse" the same variables).
None of this actually runs the network once (note how there is no sess.run()): you're just giving instructions on how data must flow.
Then, when you'll start the actual training (I guess you missed that piece of the code when copying it here) each GPU will pull its own batch and produce the per-tower loss. I guess these losses are averaged somewhere in the subsequent code and the average is the loss passed to the optimizer.
Up until the point where the tower losses are averaged together, everything is independent from the other devices, so getting the batch and computing the loss can be done in parallel. Then the gradients and parameter update is done only once, variables are updated and the cycle repeats.
So, to answer your question, no, per-batch loss computation is not serialized, but since this is synchronous distributed computation you need to collect all losses from all GPUs before being allowed to continue with gradients computation and parameters update, so you still have some part of the graph that cannot be independent.

Global Step for Differential Learning Rate

Based on this question, I am trying to implement differential learning rates as follows:
var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
#Create Two Separate Optimizers
opt1 = tf.train.AdamOptimizer(0.00001)
opt2 = tf.train.AdamOptimizer(0.0001)
# Compute Gradients for eacch set of variables
grads1, variables1 = zip(*opt1.compute_gradients(loss, var_list1))
grads2, variables2 = zip(*opt2.compute_gradients(loss, var_list2))
# Apply Gradients
train_op1 = opt1.apply_gradients(zip(grads1, variables1))
train_op2 = opt2.apply_gradients(zip(grads2, variables2), global_step=global_step)
train_op = tf.group(train_op1, train_op2)
I am unsure if global_step should be included in each apply_gradients call or if it should only be included in 1? My understanding is that when apply_gradients is called, global_step is incremented by 1 if it is supplied (code here). Based on this, I believe that I should only include global_step in one of my apply_gradients() calls. Can anybody confirm that this is the correct approach?
The alternative to what I have above would be to do the following:
train_op1 = opt1.apply_gradients(zip(grads1, variables1), global_step=global_step)
train_op2 = opt2.apply_gradients(zip(grads2, variables2), global_step=global_step)
While technically each call to apply_gradients is a step, my understanding is that global_step should represent the number of mini-batches that have been completed so if I were to reference it in both apply_gradients() calls then the global step would increase twice per mini-batch. So, based onthis I believe the more accurate implementation would be the first implementation where it is called once. Would others agree this is the correct implementation? Does it matter which apply_gradients() the global_step is included in?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.