Tensorflow: Why Doesn't Out of Graph Cost Calculation Work - python

I have a standard experiment loop that looks like this:
cross_entropy_target = tf.reduce_mean(tf.reduce_mean(tf.square(target_pred - target)))
cost = cross_entropy_target
opt_target = tf.train.AdamOptimizer(learning_rate=0.00001).minimize(cost)
for epoch in range(num_epochs):
for mini_batch in range(num_samples / batch_size):
mb_train_x, mb_train_target = get_mini_batch_stuffs()
sess.run(opt_target, feed_dict={x: mb_train_x, target: mb_train_target})
This runs and converges to a good prediction loss. Now, same code with a slight modification:
cross_entropy_target = tf.reduce_mean(tf.reduce_mean(tf.square(target_pred - target)))
cross_entropy_target_variable = tf.Variable(0.0)
cost = cross_entropy_target_variable
opt_target = tf.train.AdamOptimizer(learning_rate=0.00001).minimize(cost)
for epoch in range(num_epochs):
for mini_batch in range(num_samples / batch_size):
mb_train_x, mb_train_target = get_mini_batch_stuffs()
new_target_cost = sess.run(cross_entropy_target, feed_dict={x: mb_train_x, time: mb_train_time, target: mb_train_target})
sess.run(tf.assign(cross_entropy_target_variable, new_target_cost))
sess.run(opt_target, feed_dict={x: mb_train_x, target: mb_train_target})
Now, instead of the cross_entropy_target being calculated as part of the opt_target graph, I am pre-calculating it, assigning it to a tensorflow variable, and expecting it to make use of that value. This doesn't work at all. The network's outputs never change.
I would expect these two code snippets to have equivalent outcomes. In both cases a feed forward is used to populate the values of target and target_pred, which is then reduced to the scalar value cross_entropy_target. This scalar value is used to inform the magnitude and direction of the gradient updates on the optimizer's .minimize().
In this toy example there is no advantage to my calculating the cross_entropy_target "out of graph" and then assigning it to an in-graph tf.Variable for use in the opt_target run. However, I have a real use case where my cost function is very complex and I have not been able to define it in terms of Tensorflow's existing tensor transforms. Either way, I'd like to understand why using a tf.Variable for an optimizer's cost is incorrect use.
An interesting oddity that may be a byproduct of the solution to this:
If I set cross_entropy_target_variable = tf.Variable(0.0, trainable=False), running the opt_target will crash. It requires that the cost value is modifiable. Indeed, printing out its value before and after running the opt_target produces different values:
cross_entropy_target before = 0.345796853304
cross_entropy_target after = 0.344796866179
Why does running minimize() modify the value of the cost variable?

In your tf.train.AdamOptimizer( line, it looks at cost, which is cross_entropy_target, which is a tf.Variable op, and creates an optimizer which does nothing, since cross_entropy_target doesn't depend on any variables. Modifying cross_entropy target later has no effect because the optimizer has already been created.

Related

Numerical equivalence of PyTorch backpropagation

After i 'v written the simple neural network with numpy, i wanted to compare it numerically with PyTorch impementation. Running alone, seems my neural network implementation converges, so it seems to have no errors.
Also i v checked forward pass matches to PyTorch, so basic setup is correct.
But something different happens while backward pass, because the weights after one backpropagation are different.
I dont want to post full code here because its linked over several .py files, and most of the code is irrelevant to the question. I just want to know does PyTorch "basic" gradient descent or something different.
I m viewing the most simle example about full-connected weights of the last layer, cause if it is different, further will be also different:
self.weight += self.learning_rate * hidden_layer.T.dot(output_delta )
where
output_delta = self.expected - self.output
self.expected are expected value,
self.output is forward pass result
No activation or further stuff here.
The torch past is:
optimizer = torch.optim.SGD(nn.parameters() , lr = 1.0)
criterion = torch.nn.MSELoss(reduction='sum')
output = nn.forward(x_train)
loss = criterion(output, y_train)
loss.backward()
optimizer.step()
optimizer.zero_grad()
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above? If its so i d like to know how to numerically check my numpy solution with pytorch.
I just want to know does PyTorch "basic" gradient descent or something different.
If you set torch.optim.SGD, this means stochastic gradient descent.
You have different implementations on GD, but the one that is used in PyTorch is applied to mini-batches.
There are GD implementations that will optimize parameters after the full epoch. As you may guess they are very "slow", this may be great for supercomputers to test. There are GD implementations that work for every sample, as you may guess their imperfectness is "huge" gradient fluctuations.
These are all relative terms, so I am using ""
Note you are using too big learning rates like lr = 1.0, which means you haven't normalized your data at first, but this is a skill you may scalp over time.
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above?
It uses what you told.
Here is a the example in PyTorch and in Python to show detection of gradients works as expected (used in back propagation) :
x = torch.tensor([5.], requires_grad=True);
print(x) # tensor([5.], requires_grad=True)
y = 3*x**2
y.backward()
print(x.grad) # tensor([30.])
How would you get this value 30 in plain python?
def y(x):
return 3*x**2
x=5
e=0.01 #etha
g=(y(x+e)-y(x))/e
print(g) # 30.0299
As we expect we got ~30, it would be even better with smaller etha.

TensorFlow on multiple GPU

Recently, I try to learn how to use Tensorflow on multiple GPU by reading the official tutorial. However, there is something that I am confused about. The following code is part of the official tutorial, which calculates the loss on single GPU.
def tower_loss(scope, images, labels):
# Build inference Graph.
logits = cifar10.inference(images)
# Build the portion of the Graph calculating the losses. Note that we will
# assemble the total_loss using a custom function below.
_ = cifar10.loss(logits, labels)
# Assemble all of the losses for the current tower only.
losses = tf.get_collection('losses', scope)
# Calculate the total loss for the current tower.
total_loss = tf.add_n(losses, name='total_loss')
# Attach a scalar summary to all individual losses and the total loss; do the
# same for the averaged version of the losses.
for l in losses + [total_loss]:
# Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
# session. This helps the clarity of presentation on tensorboard.
loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
tf.summary.scalar(loss_name, l)
return total_loss
The training process is as the following.
def train():
with tf.device('/cpu:0'):
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (cifar10.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /
FLAGS.batch_size / FLAGS.num_gpus)
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(cifar10.INITIAL_LEARNING_RATE,
global_step,
decay_steps,
cifar10.LEARNING_RATE_DECAY_FACTOR,
staircase=True)
# Create an optimizer that performs gradient descent.
opt = tf.train.GradientDescentOptimizer(lr)
# Get images and labels for CIFAR-10.
images, labels = cifar10.distorted_inputs()
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels], capacity=2 * FLAGS.num_gpus)
# Calculate the gradients for each model tower.
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables()
# Retain the summaries from the final tower.
summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
However, I am confused about the for loop about 'for i in xrange(FLAGS.num_gpus)'. It seems that I have to get a new batch image from batch_queue and calculate every gradient. I think this process is serialized instead of parallel. If there anything wrong with my own understanding? By the way, I can also use the iterator to feed image to my model rather than the dequeue right?
Thank you everybody!
This is a common misconception with Tensorflow's coding model.
What you are showing here is the computation graph's construction, NOT the actual execution.
The block:
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
translates to:
For each GPU device (`for i in range..` & `with device...`):
- build operations needed to dequeue a batch
- build operations needed to run the batch through the network and compute the loss
Note how via tf.get_variable_scope().reuse_variables() you're telling the graph that the variables used for the graph GPU must be shared among all (i.e., all graphs on the multiple devices "reuse" the same variables).
None of this actually runs the network once (note how there is no sess.run()): you're just giving instructions on how data must flow.
Then, when you'll start the actual training (I guess you missed that piece of the code when copying it here) each GPU will pull its own batch and produce the per-tower loss. I guess these losses are averaged somewhere in the subsequent code and the average is the loss passed to the optimizer.
Up until the point where the tower losses are averaged together, everything is independent from the other devices, so getting the batch and computing the loss can be done in parallel. Then the gradients and parameter update is done only once, variables are updated and the cycle repeats.
So, to answer your question, no, per-batch loss computation is not serialized, but since this is synchronous distributed computation you need to collect all losses from all GPUs before being allowed to continue with gradients computation and parameters update, so you still have some part of the graph that cannot be independent.

Global Step for Differential Learning Rate

Based on this question, I am trying to implement differential learning rates as follows:
var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
#Create Two Separate Optimizers
opt1 = tf.train.AdamOptimizer(0.00001)
opt2 = tf.train.AdamOptimizer(0.0001)
# Compute Gradients for eacch set of variables
grads1, variables1 = zip(*opt1.compute_gradients(loss, var_list1))
grads2, variables2 = zip(*opt2.compute_gradients(loss, var_list2))
# Apply Gradients
train_op1 = opt1.apply_gradients(zip(grads1, variables1))
train_op2 = opt2.apply_gradients(zip(grads2, variables2), global_step=global_step)
train_op = tf.group(train_op1, train_op2)
I am unsure if global_step should be included in each apply_gradients call or if it should only be included in 1? My understanding is that when apply_gradients is called, global_step is incremented by 1 if it is supplied (code here). Based on this, I believe that I should only include global_step in one of my apply_gradients() calls. Can anybody confirm that this is the correct approach?
The alternative to what I have above would be to do the following:
train_op1 = opt1.apply_gradients(zip(grads1, variables1), global_step=global_step)
train_op2 = opt2.apply_gradients(zip(grads2, variables2), global_step=global_step)
While technically each call to apply_gradients is a step, my understanding is that global_step should represent the number of mini-batches that have been completed so if I were to reference it in both apply_gradients() calls then the global step would increase twice per mini-batch. So, based onthis I believe the more accurate implementation would be the first implementation where it is called once. Would others agree this is the correct implementation? Does it matter which apply_gradients() the global_step is included in?

Manually changing learning_rate in tf.train.AdamOptimizer

The question is, whether just changing the learning_rate argument in tf.train.AdamOptimizer actually results in any changes in behaviour:
Let's say the code looks like this:
myLearnRate = 0.001
...
output = tf.someDataFlowGraph
trainLoss = tf.losses.someLoss(output)
trainStep = tf.train.AdamOptimizer(learning_rate=myLearnRate).minimize(trainLoss)
with tf.Session() as session:
#first trainstep
session.run(trainStep, feed_dict = {input:someData, target:someTarget})
myLearnRate = myLearnRate * 0.1
#second trainstep
session.run(trainStep, feed_dict = {input:someData, target:someTarget})
Would the decreased myLearnRate now be applied in the second trainStep? This is, is the creation of the node trainStep only evaluated once:
trainStep = tf.train.AdamOptimizer(learning_rate=myLearnRate).minimize(trainLoss)
Or is it evaluated with every session.run(train_step)? How could I have checked in my AdamOptimizer in Tensorflow, whether it did change the Learnrate.
Disclaimer 1: I'm aware manually changing the LearnRate is bad practice.
Disclaimer 2: I'm aware there is a similar question, but it was solved with inputting a tensor as learnRate, which is updated in every trainStep (here). It makes me lean towards assuming it would only work with a tensor as input for the learning_rate in AdamOptimizer, but neither am I sure of that, nor can I understand the reasoning behind it.
The short answer is that no, your new learning rate is not applied. TF builds the graph when you first run it, and changing something on the Python side will not translate to a change in the graph at run time. You can, however, feed a new learning rate into your graph pretty easily:
# Use a placeholder in the graph for your user-defined learning rate instead
learning_rate = tf.placeholder(tf.float32)
# ...
trainStep = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(trainLoss)
applied_rate = 0.001 # we will update this every training step
with tf.Session() as session:
#first trainstep, feeding our applied rate to the graph
session.run(trainStep, feed_dict = {input: someData,
target: someTarget,
learning_rate: applied_rate})
applied_rate *= 0.1 # update the rate we feed to the graph
#second trainstep
session.run(trainStep, feed_dict = {input: someData,
target: someTarget,
learning_rate: applied_rate})
Yes, the optimizer is created only once:
tf.train.AdamOptimizer(learning_rate=myLearnRate)
It remembers the passed learning rate (in fact, it creates a tensor for it, if you pass a floating number) and your future changes of myLearnRate don't affect it.
Yes, you can create a placeholder and pass it to the session.run(), if you really want to. But, as you said, it's pretty uncommon and probably means you are solving your origin problem in the wrong way.

stopping condition on gradient value tensorflow

I would like to implement a stopping condition based on the value of the gradient of the loss function w.r.t. the weights.
For example, let's say I have something like this:
optimizer = tf.train.AdamOptimizer()
grads_and_vars = optimizer.compute_gradients(a_loss_function)
train_op = optimizer.apply_gradients(grads_and_vars)
then I would like to run the graph with something like this:
for step in range(TotSteps):
output = sess.run([input], feed_dict=some_dict)
if(grad_taken_in_some_way < some_treshold):
print("Training finished.")
break
I am not sure what I should pass to sess.run() in order to get as output also the gradient (besides all other stuff I need). I am not even sure whether this is the correct approach or I should do it differently. I made some tries but I failed every time. Hope someone has some hints.
Thank you in advance!
EDIT: English correction
EDIT2: Answer by Iballes is exactly what I wanted to do. Still, I am not sure how to norm and sum all the gradients. Since I have different layer in my CNN and different weights with different shape, if I just do what you suggested, I get an error on the add_n() operation (since I am trying to add together matrices with different shapes). So probably I should do something like:
grad_norms = [tf.nn.l2_normalize(g[0], 0) for g in grads_and_vars]
grad_norm = [tf.reduce_sum(grads) for grads in grad_norms]
final_grad = tf.reduce_sum(grad_norm)
Can anyone confirm this?
Your line output = sess.run([input], feed_dict=some_dict) makes think that you have a little misunderstanding of the sess.run command. What you call [input] is supposed to be a list of tensors that are to be fetched by the sess.run command. Hence, it is an output rather than an input. To tackle your question, let's assume that you are doing something like output = sess.run(loss, feed_dict=some_dict) instead (in order to monitor the training loss).
Also, I suppose you want to formulate your stopping criterion using the norm of the gradient (the gradient itself is a multi-dimensional quantity). Hence, what you want to do is to fetch the norm of the gradient each time you execute the graph. To that end, you have to do two things. 1) Add the gradient norm to the computation graph. 2) Fetch it in each call to sess.run in your training loop.
Ad 1) You have added the gradients to the graph via
optimizer = tf.train.AdamOptimizer()
grads_and_vars = optimizer.compute_gradients(a_loss_function)
and now have the tensors holding the gradients in grads_and_vars (one for each trained variable in the graph). Let's take the norm of each gradient and then sum it up:
grad_norms = [tf.nn.l2_loss(g) for g, v in grads_and_vars]
grad_norm = tf.add_n(grad_norms)
There you have your gradient norm.
Ad 2) Inside your loop, fetch the gradient norm alongside the loss by telling the sess.run command to do so.
for step in range(TotSteps):
l, gn = sess.run([loss, grad_norm], feed_dict=some_dict)
if(gn < some_treshold):
print("Training finished.")
break

Categories