Based on this question, I am trying to implement differential learning rates as follows:
var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
#Create Two Separate Optimizers
opt1 = tf.train.AdamOptimizer(0.00001)
opt2 = tf.train.AdamOptimizer(0.0001)
# Compute Gradients for eacch set of variables
grads1, variables1 = zip(*opt1.compute_gradients(loss, var_list1))
grads2, variables2 = zip(*opt2.compute_gradients(loss, var_list2))
# Apply Gradients
train_op1 = opt1.apply_gradients(zip(grads1, variables1))
train_op2 = opt2.apply_gradients(zip(grads2, variables2), global_step=global_step)
train_op = tf.group(train_op1, train_op2)
I am unsure if global_step should be included in each apply_gradients call or if it should only be included in 1? My understanding is that when apply_gradients is called, global_step is incremented by 1 if it is supplied (code here). Based on this, I believe that I should only include global_step in one of my apply_gradients() calls. Can anybody confirm that this is the correct approach?
The alternative to what I have above would be to do the following:
train_op1 = opt1.apply_gradients(zip(grads1, variables1), global_step=global_step)
train_op2 = opt2.apply_gradients(zip(grads2, variables2), global_step=global_step)
While technically each call to apply_gradients is a step, my understanding is that global_step should represent the number of mini-batches that have been completed so if I were to reference it in both apply_gradients() calls then the global step would increase twice per mini-batch. So, based onthis I believe the more accurate implementation would be the first implementation where it is called once. Would others agree this is the correct implementation? Does it matter which apply_gradients() the global_step is included in?
Related
Following the papers Progressive Gans (https://arxiv.org/abs/1710.10196), I implement keras.Model that needs to grow in size (layers). I first initialize the full model. But when the time making an inference, I will use only partial of model but the same trainable_variables, e.g. 4x4 then 8x8. So that the trainable_variables passing to train_step which decorated with tf.function will be different. This work properly for computing gradient etc. but not optimizer.apply_gradients.
The code look something like this:
strategy = tf.distribute.MirroredStrategy()
G = Generator()
with strategy.scope():
Opt = keras.optimizers.Adam()
G.initialize_model() # initialize full model
#tf.function
def train_step(optimizer, model, var_to_train):
with tf.GradientTape() as tape:
Loss = loss(model(datasets))
grads = tape.gradient(Loss, variables)
opt.apply_gradients(zip(grads, variables)) # this will raise ValueError
res = 4 # resolution of 4x4
for ep in range(epochs):
if ep % 100 == 0:
res = res * 2
cur_model = G.forward(output_shape=(res, res, 3)) # output for given image resolution
var = cur_model.trainable_variables # this variables will be increasing as we grow model
strategy.run(train_step, args=(Opt, cur_model, var))
Note, however, that this will work fine when train_step is not used in the context of tf.function or in MirroredStrategy. From last section of seem not to solve the problem. I tried tf.distribute.ReplicaContext.all_reduce or any equivalent method for obtaining local results from all replica but it won't work since the trainable_variables are created inside the strategy.scope() so every update must be in the context of Replica.
The only naive solution I could is to train, for example 4x4 model and save it. Then use transfer learning load it back to 8x8 model.
I want to use usual keras optimizer which support any dynamic trainable_variables passed through tf.function context.
[1]: https://www.tensorflow.org/guide/function#creating_tfvariables:~:text=shape%3D()%2C%20dtype%3Dfloat32)
First call optimizer._create_all_weights(var) where the argument var is the full model. This will make the optimizer create all variables and must be done at the beginning before making any updates. In updating gradient it won't create again but still can provide a subset of it. This work in the context of tf.function too.
I have a set up as follows, where I have an outer for loop iterating over epochs, and an inner for loop iterating over batches.
In the inner for loop, over batches, I'm usin a cross entropy loss, and using an Adam optimizer with a certain learning rate.
After the inner for loop (after all batches are evaluated), I'm then calculating another loss function based off of the output (a custom loss function), and optimizing.
However, I notice that when I define a different optimizer with a different learning rate, it doesn't seem to be training. When I keep the same optimizer, it seems like things change, but when I replace it, it doesn't. Example as follows:
net = <my defined model, from another function>
optimizer_1 = torch.optim.Adam(net.parameters(), lr=0.1)
optimizer_2 = torch.optim.Adam(net.parameters(), lr=0.01)
for epoch in range(num_epochs):
for data in training_data: # these are the batches
<get output here>
loss1 = <compute loss function>
optimizer_1.zero_grad()
loss1.backward()
optimizer_1.step()
loss2 = <compute a different loss function here>
optimizer_2.zero_grad() #use a second optimizer with a different learning rate
loss.backward()
loss.step()
When I do this, it seems like it doesn't actually carry through with the second optimization on the second loss function. Why is this? I want the second optimization to have a different learning rate than the first one. However, it seems like only continuing to use the first optimizer, optimizer_1, with its respective learning rate seems to work.
First you can accumulate the loss1 in the inner loop.
Next, You might want to consider merging two loss functions.
(sum(accumulated_loss1) + loss2).backward()
This ensures both are losses are considered during training and all the gradients are propagated in the backward pass
Recently, I try to learn how to use Tensorflow on multiple GPU by reading the official tutorial. However, there is something that I am confused about. The following code is part of the official tutorial, which calculates the loss on single GPU.
def tower_loss(scope, images, labels):
# Build inference Graph.
logits = cifar10.inference(images)
# Build the portion of the Graph calculating the losses. Note that we will
# assemble the total_loss using a custom function below.
_ = cifar10.loss(logits, labels)
# Assemble all of the losses for the current tower only.
losses = tf.get_collection('losses', scope)
# Calculate the total loss for the current tower.
total_loss = tf.add_n(losses, name='total_loss')
# Attach a scalar summary to all individual losses and the total loss; do the
# same for the averaged version of the losses.
for l in losses + [total_loss]:
# Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
# session. This helps the clarity of presentation on tensorboard.
loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
tf.summary.scalar(loss_name, l)
return total_loss
The training process is as the following.
def train():
with tf.device('/cpu:0'):
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (cifar10.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /
FLAGS.batch_size / FLAGS.num_gpus)
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(cifar10.INITIAL_LEARNING_RATE,
global_step,
decay_steps,
cifar10.LEARNING_RATE_DECAY_FACTOR,
staircase=True)
# Create an optimizer that performs gradient descent.
opt = tf.train.GradientDescentOptimizer(lr)
# Get images and labels for CIFAR-10.
images, labels = cifar10.distorted_inputs()
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels], capacity=2 * FLAGS.num_gpus)
# Calculate the gradients for each model tower.
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables()
# Retain the summaries from the final tower.
summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
However, I am confused about the for loop about 'for i in xrange(FLAGS.num_gpus)'. It seems that I have to get a new batch image from batch_queue and calculate every gradient. I think this process is serialized instead of parallel. If there anything wrong with my own understanding? By the way, I can also use the iterator to feed image to my model rather than the dequeue right?
Thank you everybody!
This is a common misconception with Tensorflow's coding model.
What you are showing here is the computation graph's construction, NOT the actual execution.
The block:
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
translates to:
For each GPU device (`for i in range..` & `with device...`):
- build operations needed to dequeue a batch
- build operations needed to run the batch through the network and compute the loss
Note how via tf.get_variable_scope().reuse_variables() you're telling the graph that the variables used for the graph GPU must be shared among all (i.e., all graphs on the multiple devices "reuse" the same variables).
None of this actually runs the network once (note how there is no sess.run()): you're just giving instructions on how data must flow.
Then, when you'll start the actual training (I guess you missed that piece of the code when copying it here) each GPU will pull its own batch and produce the per-tower loss. I guess these losses are averaged somewhere in the subsequent code and the average is the loss passed to the optimizer.
Up until the point where the tower losses are averaged together, everything is independent from the other devices, so getting the batch and computing the loss can be done in parallel. Then the gradients and parameter update is done only once, variables are updated and the cycle repeats.
So, to answer your question, no, per-batch loss computation is not serialized, but since this is synchronous distributed computation you need to collect all losses from all GPUs before being allowed to continue with gradients computation and parameters update, so you still have some part of the graph that cannot be independent.
I am trying to implement a crude method based on the Mixture-of-Experts paper in tensorflow - https://arxiv.org/abs/1701.06538
There would be n models defined:
model_1:
var_11
var_12
loss_1
optimizer_1
model_2:
var_21
var_22
loss_2
optimizer_2
model_3:
var_31
var_32
loss_3
optimizer_3
At every iteration, I want to train the model with the least loss only while keeping the other variables constant. Is it possible to place a switch to execute one of the optimizer only?
P.S: This base of this problem is similar to one I had asked previously. http://stackoverflow.com/questions/42073239/tf-get-collection-to-extract-variables-of-one-scope/42074009?noredirect=1#comment71359330_42074009
Since the suggestion there did not work, I am trying to approach the problem differently.
Thanks in advance!
This seems to be doable with tf.cond:
import tensorflow as tf
def make_conditional_train_op(
should_update, optimizers, variable_lists, losses):
"""Conditionally trains variables.
Each argument is a Python list of Tensors, and each list must have the same
length. Variables are updated based on their optimizer only if the
corresponding `should_update` boolean Tensor is True at a given step.
Returns a single train op which performs the conditional updates.
"""
assert len(optimizers) == len(variable_lists)
assert len(variable_lists) == len(losses)
assert len(should_update) == len(variable_lists)
conditional_updates = []
for model_number, (update_boolean, optimizer, variables, loss) in enumerate(
zip(should_update, optimizers, variable_lists, losses)):
conditional_updates.append(
tf.cond(update_boolean,
lambda: tf.group(
optimizer.minimize(loss, var_list=variables),
tf.Print(0, ["Model {} updating".format(model_number), loss])),
lambda: tf.no_op()))
return tf.group(*conditional_updates)
The basic strategy is to make sure the optimizer's variable updates are defined in the lambda of one of the cond branches, in which case there is true conditional op execution, meaning that the assignment to variables (and optimizer accumulators) only happens if that branch of the cond is triggered.
As an example, we can construct some models:
def make_model_and_optimizer():
scalar_variable = tf.get_variable("scalar", shape=[])
vector_variable = tf.get_variable("vector", shape=[3])
loss = tf.reduce_sum(scalar_variable * vector_variable)
optimizer = tf.train.AdamOptimizer(0.1)
return optimizer, [scalar_variable, vector_variable], loss
# Construct each model
optimizers = []
variable_lists = []
losses = []
for i in range(10):
with tf.variable_scope("model_{}".format(i)):
optimizer, variables, loss = make_model_and_optimizer()
optimizers.append(optimizer)
variable_lists.append(variables)
losses.append(loss)
Then determine a conditional update strategy, in this case only training the model with the maximum loss (just because that results in more switching; the output is rather boring if only one model ever updates):
# Determine which model should be updated (in this case, the one with the
# maximum loss)
integer_one_hot = tf.one_hot(
tf.argmax(tf.stack(losses),
axis=0),
depth=len(losses))
is_max = tf.equal(
integer_one_hot,
tf.ones_like(integer_one_hot))
Finally, we can call the make_conditional_train_op function to create the train op, then do some training iterations:
train_op = make_conditional_train_op(
tf.unstack(is_max), optimizers, variable_lists, losses)
# Repeatedly call the conditional train op
with tf.Session():
tf.global_variables_initializer().run()
for i in range(20):
print("Iteration {}".format(i))
train_op.run()
This is printing the index which is updated and its loss at each iteration, confirming the conditional execution:
Iteration 0
I tensorflow/core/kernels/logging_ops.cc:79] [Model 6 updating][2.7271919]
Iteration 1
I tensorflow/core/kernels/logging_ops.cc:79] [Model 6 updating][2.1755948]
Iteration 2
I tensorflow/core/kernels/logging_ops.cc:79] [Model 2 updating][1.9858969]
Iteration 3
I tensorflow/core/kernels/logging_ops.cc:79] [Model 6 updating][1.6859927]
In the MNIST example, the optimizer is setup as follows
# Optimizer: set up a variable that's incremented once per batch and
# controls the learning rate decay.
batch = tf.Variable(0, dtype=data_type())
# Decay once per epoch, using an exponential schedule starting at 0.01.
learning_rate = tf.train.exponential_decay(
0.01, # Base learning rate.
batch * BATCH_SIZE, # Current index into the dataset.
train_size, # Decay step.
0.95, # Decay rate.
staircase=True)
# Use simple momentum for the optimization.
optimizer = tf.train.MomentumOptimizer(learning_rate,
0.9).minimize(loss,
global_step=batch)
And in the training process,
for step in xrange(int(num_epochs * train_size) // BATCH_SIZE):
# skip some code here
sess.run(optimizer, feed_dict=feed_dict)
My question is that when defining learning_rate, they use batch * batch_sizeto define global step. However, in the training iteration, we only have variable step. How does the code connect(or pass) the step information to the global step parameter in tf.train.exponential_decay I am not very clear how does this python parameter passing mechanism work.
From the code you have linked, batch is the global step. Its value is updated by the optimizer. The learning node takes it as input.
The naming may be an issue. batch merely means the number of the current batch used for training (of size BATCH_SIZE). Perhaps a better name could have been step or even global_step.
Most of the global_step code seems to be in a single source file. It is quite short and perhaps a good way to see how the pieces work together.