Recently, I try to learn how to use Tensorflow on multiple GPU by reading the official tutorial. However, there is something that I am confused about. The following code is part of the official tutorial, which calculates the loss on single GPU.
def tower_loss(scope, images, labels):
# Build inference Graph.
logits = cifar10.inference(images)
# Build the portion of the Graph calculating the losses. Note that we will
# assemble the total_loss using a custom function below.
_ = cifar10.loss(logits, labels)
# Assemble all of the losses for the current tower only.
losses = tf.get_collection('losses', scope)
# Calculate the total loss for the current tower.
total_loss = tf.add_n(losses, name='total_loss')
# Attach a scalar summary to all individual losses and the total loss; do the
# same for the averaged version of the losses.
for l in losses + [total_loss]:
# Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
# session. This helps the clarity of presentation on tensorboard.
loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
tf.summary.scalar(loss_name, l)
return total_loss
The training process is as the following.
def train():
with tf.device('/cpu:0'):
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (cifar10.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /
FLAGS.batch_size / FLAGS.num_gpus)
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(cifar10.INITIAL_LEARNING_RATE,
global_step,
decay_steps,
cifar10.LEARNING_RATE_DECAY_FACTOR,
staircase=True)
# Create an optimizer that performs gradient descent.
opt = tf.train.GradientDescentOptimizer(lr)
# Get images and labels for CIFAR-10.
images, labels = cifar10.distorted_inputs()
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels], capacity=2 * FLAGS.num_gpus)
# Calculate the gradients for each model tower.
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables()
# Retain the summaries from the final tower.
summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
However, I am confused about the for loop about 'for i in xrange(FLAGS.num_gpus)'. It seems that I have to get a new batch image from batch_queue and calculate every gradient. I think this process is serialized instead of parallel. If there anything wrong with my own understanding? By the way, I can also use the iterator to feed image to my model rather than the dequeue right?
Thank you everybody!
This is a common misconception with Tensorflow's coding model.
What you are showing here is the computation graph's construction, NOT the actual execution.
The block:
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
translates to:
For each GPU device (`for i in range..` & `with device...`):
- build operations needed to dequeue a batch
- build operations needed to run the batch through the network and compute the loss
Note how via tf.get_variable_scope().reuse_variables() you're telling the graph that the variables used for the graph GPU must be shared among all (i.e., all graphs on the multiple devices "reuse" the same variables).
None of this actually runs the network once (note how there is no sess.run()): you're just giving instructions on how data must flow.
Then, when you'll start the actual training (I guess you missed that piece of the code when copying it here) each GPU will pull its own batch and produce the per-tower loss. I guess these losses are averaged somewhere in the subsequent code and the average is the loss passed to the optimizer.
Up until the point where the tower losses are averaged together, everything is independent from the other devices, so getting the batch and computing the loss can be done in parallel. Then the gradients and parameter update is done only once, variables are updated and the cycle repeats.
So, to answer your question, no, per-batch loss computation is not serialized, but since this is synchronous distributed computation you need to collect all losses from all GPUs before being allowed to continue with gradients computation and parameters update, so you still have some part of the graph that cannot be independent.
Related
I'm building a multi-model neural network for reinforcement learning to include an action network, a world model network, and a critic. The idea is train the world model to emulate whatever simulation you are trying to master based on input from the action network and the previous state, to train the critic to maximize the Bellman equation (total reinforcement over time) based on the world model output, and then backpropagate the critic value through the world model to provide gradient targets for training the actions. So - from some state, the action network outputs an action which is fed into the model to generate the next state, and that state feeds into the critic network for evaluation against some goal state.
For all this to work, I must use 3 separate loss functions, one for each network, and they all add something to the gradients in one or more networks but they can be in conflict. For example - to train the world model I use a target from an environmental simulation and for the critic I use a target of the current state reward + discount * next state forecast value. However, to train the a actor I just use the negative critic value as a loss and backpropagate all the way through all three models to calibrate the best action.
I can make this work without any batching by zeroing out gradients incrementally, but that is inefficient and doesn't let me accumulate gradients for any kind of "time-series batching" optimizer update step. Each model has its own trainable parameters, but the execution graph flows through all three networks. So inside the calibration loop after firing the networks in sequence:
...
if self.actor.calibrating:
self.actor.optimizer.zero_grad()
#Pick loss For maximizing the value of all actions
loss = -self.critic.value
#Backpropagate through all three networks to train actor output
#How do I stop the critic and model networks from incrementing their gradient values?
loss.backward(retain_graph=True)
self.actor.optimizer.step()
if self.model.calibrating:
self.model.optimizer.zero_grad()
#Reduce loss for ambiguous actions
loss = self.model.get_loss() * self.actor.get_confidence()**2
#How can I block this from backpropagating through action network?
loss.backward(retain_graph=True)
self.model.optimizer.step()
if self.critic.calibrating:
self.critic.optimizer.zero_grad()
#Reduce loss for ambiguous actions
loss = self.critic.get_loss(self.goal) * self.actor.get_confidence()**2
#How do I stop this from backpropagating through the model and action networks?
loss.backward(retain_graph=True)
self.critic.optimizer.step()
...
Finally - my question is in two parts:
How can I temporarily stop loss.backward() at a given layer without detaching it forever?
How can I block loss.backward() from updating some gradients where I'm just flowing through a model to get gradients for another model?
Got this figured out thanks to a suggestion from a colleague to try the requires_grad setting. (I had assumed that would break the execution graph, but it doesn't)
So - to answer my own two questions:
If you calibrate the chained models in the correct order, you can detach them one at a time so that loss.backward() doesn't run over models that aren't needed. I was thinking that this would break the graph but... this is Pytorch, not Tensorflow 1.x and the graph is regenerated on every forward pass anyway. Silly me for missing this yesterday.
If you set requires_grad to False for a model (or a layer or an individual weight) then loss.backward() will STILL traverse the entire connected graph but it will leave those individual gradients as they were while still setting any gradients earlier in the graph. Exactly what I wanted.
This code works to minimize the execution of unnecessary graph traversals and gradient updates. I still need to refactor it for staggered updates over time so that it can accumulate gradients for several cycles before stepping the optimizers, but this definitely works as intended.
#Step through all models in a chain to create gradient paths from critic back through the world model, to the actor.
def step(self):
#Get the current state from the simulation
state = self.world.state
#Fire the actor to select a softmax action.
self.actor(state)
#run the world simulation on that action.
self.world.step(self.actor.action)
#Combine the action and starting state as input to the world model.
if self.actor.calibrating:
action_state = torch.cat([self.actor.value, state], dim=0)
else:
#Push softmax action closer to 1.0
action_state = torch.cat([self.actor.hard_value, state], dim=0)
#Run the model and then the critic on the action_state
self.critic(self.model(action_state))
if self.actor.calibrating:
self.actor.optimizer.zero_grad()
self.model.requires_grad = False
self.critic.requires_grad = False
#Pick loss For maximizing the value of the action choice
loss = -self.critic.value * self.actor.get_confidence()
loss.backward(retain_graph=True)
self.actor.optimizer.step()
if self.model.calibrating:
#Don't need to backpropagate through actor again
self.actor.value.detach_()
self.model.optimizer.zero_grad()
self.model.requires_grad = True
#Reduce loss for ambiguous actions
loss = self.model.get_loss() * self.actor.get_confidence()**2
loss.backward(retain_graph=True)
self.model.optimizer.step()
if self.critic.calibrating:
#Don't need to backpropagate through the model or actor again
self.model.value.detach_()
self.critic.optimizer.zero_grad()
self.critic.requires_grad = True
#Reduce loss for ambiguous actions
loss = self.critic.get_loss(self.goal) * self.actor.get_confidence()**2
loss.backward(retain_graph=True)
self.critic.optimizer.step()
here’s a more precise and fuller example.
import torch
import torch.nn as nn
from torch.autograd import Variable
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.layers = nn.ModuleList([
nn.Linear(10, 10),
nn.Linear(10, 10),
nn.Linear(10, 10),
nn.Linear(10, 10),
])
def forward(self, x):
self.output = []
self.input = []
for layer in self.layers:
# detach from previous history
x = Variable(x.data, requires_grad=True)
#you can add this line after each layer to stop back propagation of that layer
self.input.append(x)
# compute output
x = layer(x)
# add to list of outputs
self.output.append(x)
return x
def backward(self, g):
for i, output in reversed(list(enumerate(self.output))):
if i == (len(self.output) - 1):
# for last node, use g
output.backward(g)
else:
output.backward(self.input[i+1].grad.data)
print(i, self.input[i+1].grad.data.sum())
model = Net()
inp = Variable(torch.randn(4, 10))
output = model(inp)
gradients = torch.randn(*output.size())
model.backward(gradients)
I want to run a small training operation inside another training operation as follows:
def get_alphas(weights, filters):
alphas = tf.Variable(...)
# Define some loss and training_op here
with tf.Session() as sess:
for some_epochs:
sess.run(training_op)
return tf.convert_to_tensor(sess.run(alphas))
def get_updated_weights(default_weights):
weights = tf.Variable(default_weights)
# Some operation on weights to get filters
# Now, the following will produce errors since weights is not initialized
alphas = get_alphas(weights, filters)
# Other option is to initialize it here as follows
with tf.Session() as sess:
sess.run(tf.variables_initializer([weights]))
calculated_filters = sess.run(filters)
alphas = get_alphas(default_weights, calculated_filters)
return Some operation on alphas and filters
So, what I want to do is to create a Variable by the name of weights. alphas and filters are dynamically dependent (through some training) on weights. Now, as weights are trained, filters will change as it is created through some operations on weights, but alphas also need to change, which can be found only though another training operation.
I will provide the exact functions, if intention is not clear from above.
The trick you describe won't work, because tf.Session.close releases all associated resources, such as variables, queues, and readers. So the result of get_alphas won't be a valid tensor.
The best course of action is to define several losses and training ops (affecting different parts of the graph) and run them within a single session, when you need to.
alphas = tf.Variable(...)
# Define some loss and training_op here
def get_alphas(sess, weights, filters):
for some_epochs:
sess.run(training_op)
# The rest of the training...
I am trying to classify the ETH Food-101 dataset using squeezenet in Caffe2. My model is imported from the Model Zoo and I made two types of modifications to the model:
1) Changing the dimensions of the last layer to have 101 outputs
2) The images from the database are in NHWC form and I just flipped the dimensions of the weights to match. (I plan on changing this)
The Food101 dataset has 75,000 images for training and I am currently using a batch size of 128 and a starting learning rate of -0.01 with a gamma of 0.999 and stepsize of 1. What I noticed is that for the first 2000 iterations of the network the accuracy hovered around 1/128 and this took an hour or so to complete.
I added all the weights to the model.params so they can get updated during gradient descent(except for data) and reinitialized all weights as Xavier and biases to constant. I would expect the accuracy to grow fairly quickly in the first hundred to thousand iterations and then tail off as the number of iterations grow. In my case, the learning is staying constant around 0.
When I look at the gradient file I find that the average is on the order of 10^-6 with a standard deviation of 10^-7. This explains the slow learning rate, but I haven't been able to get the gradient to start much higher.
These are the gradient statistics for the first convolution after a few iterations
Min Max Avg Sdev
-1.69821e-05 2.10922e-05 1.52149e-06 5.7707e-06
-1.60263e-05 2.01478e-05 1.49323e-06 5.41754e-06
-1.62501e-05 1.97764e-05 1.49046e-06 5.2904e-06
-1.64293e-05 1.90508e-05 1.45681e-06 5.22742e-06
Here are the core parts of my code:
#init_path is path to init_net protobuf
#pred_path is path to pred_net protobuf
def main(init_path, pred_path):
ws.ResetWorkspace()
data_folder = '/home/myhome/food101/'
#some debug code here
arg_scope = {"order":"NCHW"}
train_model = model_helper.ModelHelper(name="food101_train", arg_scope=arg_scope)
if not debug:
data, label = AddInput(
train_model, batch_size=128,
db=os.path.join(data_folder, 'food101-train-nchw-leveldb'),
db_type='leveldb')
init_net_def, pred_net_def = update_squeeze_net(init_path, pred_path)
#print str(init_net_def)
train_model.param_init_net.AppendNet(core.Net(init_net_def))
train_model.net.AppendNet(core.Net(pred_net_def))
ws.RunNetOnce(train_model.param_init_net)
add_params(train_model, init_net_def)
AddTrainingOperators(train_model, 'softmaxout', 'label')
AddBookkeepingOperators(train_model)
ws.RunNetOnce(train_model.param_init_net)
if debug:
ws.FeedBlob('data', data)
ws.FeedBlob('label', label)
ws.CreateNet(train_model.net)
total_iters = 10000
accuracy = np.zeros(total_iters)
loss = np.zeros(total_iters)
# Now, we will manually run the network for 200 iterations.
for i in range(total_iters):
#try:
conv1_w = ws.FetchBlob('conv1_w')
print conv1_w[0][0]
ws.RunNet("food101_train")
#except RuntimeError:
# print ws.FetchBlob('conv1').shape
# print ws.FetchBlob('pool1').shape
# print ws.FetchBlob('fire2/squeeze1x1_w').shape
# print ws.FetchBlob('fire2/squeeze1x1_b').shape
#softmax = ws.FetchBlob('softmaxout')
#print softmax[i]
#print softmax[i][0][0]
#print softmax[i][0][:5]
#print softmax[64*i]
accuracy[i] = ws.FetchBlob('accuracy')
loss[i] = ws.FetchBlob('loss')
print accuracy[i], loss[i]
My add_params function initializes the weights as follows
#ops allows me to only initialize the weights of specific ops because i initially was going to do last layer training
def add_params(model, init_net_def, ops=[]):
def add_param(op):
for output in op.output:
if "_w" in output:
weight_shape = []
for arg in op.arg:
if arg.name == 'shape':
weight_shape = arg.ints
weight_initializer = initializers.update_initializer(
None,
None,
("XavierFill", {}))
model.create_param(
param_name=output,
shape=weight_shape,
initializer=weight_initializer,
tags=ParameterTags.WEIGHT)
elif "_b" in output:
weight_shape = []
for arg in op.arg:
if arg.name == 'shape':
weight_shape = arg.ints
weight_initializer = initializers.update_initializer(
None,
None,
("ConstantFill", {}))
model.create_param(
param_name=output,
shape=weight_shape,
initializer=weight_initializer,
I find that my loss function fluctuates when I use the full training set, but If i use just one batch and iterate over it several times I find that the loss function goes down but very slowly.
While SqueezeNet has 50x fewer parameters than AlexNet, it is still a very large network. The original paper does not mention a training time, but the SqueezeNet-based SQ required 22 hours to train using two Titan X graphics cards - and that was with some of the weights pre-trained! I haven't gone over your code in detail, but what you describe is expected behavior - your network is able to learn on the single batch, just not as quickly as you expected.
I suggest reusing as many of the weights as possible instead of reinitializing them, just as the creators of SQ did. This is known as transfer learning, and it works because many of the lower-level features (lines, curves, basic shapes) in an image are the same regardless of the image's content, and reusing the weights for these layers saves the network from having to re-learn them from scratch.
In the MNIST example, the optimizer is setup as follows
# Optimizer: set up a variable that's incremented once per batch and
# controls the learning rate decay.
batch = tf.Variable(0, dtype=data_type())
# Decay once per epoch, using an exponential schedule starting at 0.01.
learning_rate = tf.train.exponential_decay(
0.01, # Base learning rate.
batch * BATCH_SIZE, # Current index into the dataset.
train_size, # Decay step.
0.95, # Decay rate.
staircase=True)
# Use simple momentum for the optimization.
optimizer = tf.train.MomentumOptimizer(learning_rate,
0.9).minimize(loss,
global_step=batch)
And in the training process,
for step in xrange(int(num_epochs * train_size) // BATCH_SIZE):
# skip some code here
sess.run(optimizer, feed_dict=feed_dict)
My question is that when defining learning_rate, they use batch * batch_sizeto define global step. However, in the training iteration, we only have variable step. How does the code connect(or pass) the step information to the global step parameter in tf.train.exponential_decay I am not very clear how does this python parameter passing mechanism work.
From the code you have linked, batch is the global step. Its value is updated by the optimizer. The learning node takes it as input.
The naming may be an issue. batch merely means the number of the current batch used for training (of size BATCH_SIZE). Perhaps a better name could have been step or even global_step.
Most of the global_step code seems to be in a single source file. It is quite short and perhaps a good way to see how the pieces work together.
I use tensorboard to visualize learning of my neural networks. Currently, I am working with datasets of such size, that equal, say 80%/20% split into training and development collections leaves me with the data that does not fit into memory during processing, so I have to use batches for development step.
When I do so, I end up with following shapes on the tensorboard charts:
Light-green line represents batched summary, while the dark green one represents unbatched development processing. I used datasets of same size and split for those two charts.
Here is the code, that is used to populate summaries
# accuracy definition in network
network.correct_predictions = tf.equal(predictions, tf.argmax(input_y, 1))
network.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
network.loss = tf.reduce_mean(_losses)
# accuracy summary
acc_summary = tf.scalar_summary("accuracy", network.accuracy)
loss_summary = tf.scalar_summary("loss", network.loss)
# dev data summary writer
dev_summary_operation = tf.merge_summary([loss_summary, acc_summary])
dev_summary_writer = tf.train.SummaryWriter(dev_summary_dir, session.graph)
# dev step
def dev_step(x_batch, y_batch)
feed_dict = {network.input_x: x_batch, network.input_y: y_batch}
step, summaries, loss, accuracy = session.run([global_step, dev_summary_operation, network.loss, network.accuracy], feed_dict)
dev_summary_writer.add_summary(summaries, step)
Same approach during training step gives normal lines (those, that are not edgy) since training is batched in both cases.
In both cases (batched or not) development summary gets updated once in 100 training batches.
Suggestions of reasons for such behaviour and possible fixes are more than welcome and highly appreciated.