Strange ordering of evaluations of variables and loss

Strange ordering of evaluations of variables and loss - python

I want to use tf.identity to copy the loss and variables either before or after the optimization step:
Here is the before case:
Copy the current loss and variables (save variables and associated loss)
Run one step of optimization (changes loss and values)
Repeat
Here is the after case:
Run one step of optimization (changes loss and values)
Copy the current loss and variables (save variables and associated loss)
Repeat
By "copy", I mean to create nodes in the computation graph to store the current values of loss and variables with tf.identity.
Somehow, this is what actually happens:
Copy loss
Run one step of optimization (changes loss and values)
Copy variable
(this value doesn't correspond to the loss saved in step 1)
Repeat
How to fix this?
I could evaluate the loss again right after step 3. But that means wasting one evaluation of the loss in every cycle.
Test:
Copy loss and variables
Run optimization step and copy loss and variables.
If copying of loss and variables always happen before the optimization step, then the copies made in step 1 and step 2 would be the same.
Otherwise, copies made in step 1 and step 2 can be different.
import numpy as np
import tensorflow as tf
x = tf.get_variable('x', initializer=np.array([1], dtype=np.float64))
loss = x * x
optim = tf.train.AdamOptimizer(1)
## Control Dependencies ##
loss_ident = tf.identity(loss) # <-- copy loss
x_ident = tf.identity(x) # <-- copy variable
with tf.control_dependencies([loss_ident, x_ident]):
train_op = optim.minimize(loss)
## Run ##
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
for i in range(1000):
# step 1
a_, x1_ = sess.run([loss, x_ident])
# step 2
b_, x2_ = sess.run([loss_ident, x_ident, train_op])[:-1]
print("loss:", a_, b_)
assert np.allclose(a_, b_)
print("variables:", x1_, x2_)
assert np.allclose(x1_, x2_)
Result:
step 1 step 2
loss: [1.] [1.]
variables: [1.] [1.58114875e-07] # <-- not the same
AssertionError
Unfortunately, the copies of the variable in step 1 and step 2 are different. Therefore, copying of the variable does not always happen before the optimization

I am not entirely clear why the control dependencies don't work with the Tensors. But you can get it to work with Variables and tf.assign(). Here's my solution. From my understanding all you need is the copy to happen before train_op. From the quick few tests I did, this seems to work.
import tensorflow as tf
tf.reset_default_graph()
x = tf.get_variable('x', initializer=np.array([1], dtype=np.float64))
x_ident = tf.get_variable('x_ident', initializer=np.array([1], dtype=np.float64))
loss = x * x
loss_ident = tf.get_variable('loss', initializer=np.array([1.0]), dtype=tf.float64)
optim = tf.train.AdamOptimizer(1)
## Control Dependencies ##
loss_ident = tf.assign(loss_ident, loss, name='loss_assign') # <-- copy loss
x_ident = tf.assign(x_ident, x, name='x_assign') # <-- copy variable
with tf.control_dependencies([x_ident, loss_ident]):
train_op = optim.minimize(loss)
## Run ##
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
for i in range(10):
# step 1
a, x1 = sess.run([loss_ident, x_ident])
# step 2
b, x2, _ = sess.run([loss_ident, x_ident, train_op])
#print("loss:", a_, b_)
print('ab',a,b)
print('x1x2',x1, x2)
assert np.allclose(a, b)
#print("variables:", x1_, x2_)
assert np.allclose(x1, x2)
Hopefully, this is what you're looking for.

Related

Pytorch backward does not compute the gradients for requested variables

I'm trying to train a resnet18 model on pytorch (+pytorch-lightning) with the use of Virtual Adversarial Training. During the computations required for this type of training I need to obtain the gradient of D (ie. the cross-entropy loss of the model) with regard to tensor r.
This should, in theory, happen in the following code snippet:
def generic_step(self, train_batch, batch_idx, step_type):
x, y = train_batch
unlabeled_idx = y is None
d = torch.rand(x.shape).to(x.device)
d = d/(torch.norm(d) + 1e-8)
pred_y = self.classifier(x)
y[unlabeled_idx] = pred_y[unlabeled_idx]
l = self.criterion(pred_y, y)
R_adv = torch.zeros_like(x)
for _ in range(self.ip):
r = self.xi * d
r.requires_grad = True
pred_hat = self.classifier(x + r)
# pred_hat = F.log_softmax(pred_hat, dim=1)
D = self.criterion(pred_hat, pred_y)
self.classifier.zero_grad()
D.requires_grad=True
D.backward()
R_adv += self.eps * r.grad / (torch.norm(r.grad) + 1e-8)
R_adv /= 32
loss = l + R_adv * self.a
loss.backward()
self.accuracy[step_type] = self.acc_metric(torch.argmax(pred_y, 1), y)
return loss
Here, to my understanding, r.grad should in theory be the gradient of D with respect to r. However, the code throws this at D.backward():
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
(full traceback excluded because this error is not helpful and technically "solved" as I know the cause for it, explained just below)
After some research and debugging it seems that in this situation D.backward() attempts to calculate dD/dD disregarding any previous mention of requires_grad=True. This is confirmed when I add D.requires_grad=True and I get D.grad=Tensor(1.,device='cuda:0') but r.grad=None.
Does anyone know why this may be happening?

In Lightning, .backward() and optimizer step are all handled under the hood. If you do it yourself like in the code above, it will mess with Lightning because it doesn't know you called backward yourself.
You can enable manual optimization in the LightningModule:
def __init__(self):
super().__init__()
# put this in your init
self.automatic_optimization = False
This tells Lightning that you are taking over calling backward and handling optimizer step + zero grad yourself. Don't forget to add that in your code above. You can access the optimizer and scheduler like so in your training step:
def training_step(self, batch, batch_idx):
optimizer = self.optimizers()
scheduler = self.lr_schedulers()
# do your training step
# don't forget to call:
# 1) backward 2) optimizer step 3) zero grad
Read more about manual optimization here.

linear regression by tensorflow gets noticeable mean square error

I am new to tensorflow and I am trying to implement a simple feed-forward network for regression, just for learning purposes. The complete executable code is as follows.
The regression mean squared error is around 6, which is quite large. It is a little unexpected because the function to regress is linear and simple 2*x+y, and I expect a better performance.
I am asking for help to check if I did anything wrong in the code. I carefully checked the matrix dimensions so that should be good, but it is possible that I misunderstand something so the network or the session is not properly configured (like, should I run the training session multiple times, instead of just one time (the code below enclosed by #TRAINING#)? I see in some examples they input data piece by piece, and run the training progressively. I run the training just one time and input all data).
If the code is good, maybe this is a modeling issue, but I really don't expect to use a complicated network for such a simple regression.
import tensorflow as tf
import numpy as np
from sklearn.metrics import mean_squared_error
# inputs are points from a 100x100 grid in domain [-2,2]x[-2,2], total 10000 points
lsp = np.linspace(-2,2,100)
gridx,gridy = np.meshgrid(lsp,lsp)
inputs = np.dstack((gridx,gridy))
inputs = inputs.reshape(-1,inputs.shape[-1]) # reshpaes the grid into a 10000x2 matrix
feature_size = inputs.shape[1] # feature_size is 2, features are the 2D coordinates of each point
input_size = inputs.shape[0] # input_size is 10000
# a simple function f(x)=2*x[0]+x[1] to regress
f = lambda x: 2 * x[0] + x[1]
label_size = 1
labels = f(inputs.transpose()).reshape(-1,1) # reshapes labels as a column vector
ph_inputs = tf.placeholder(tf.float32, shape=(None, feature_size), name='inputs')
ph_labels = tf.placeholder(tf.float32, shape=(None, label_size), name='labels')
# just one hidden layer with 16 units
hid1_size = 16
w1 = tf.Variable(tf.random_normal([hid1_size, feature_size], stddev=0.01), name='w1')
b1 = tf.Variable(tf.random_normal([hid1_size, label_size]), name='b1')
y1 = tf.nn.relu(tf.add(tf.matmul(w1, tf.transpose(ph_inputs)), b1))
# the output layer
wo = tf.Variable(tf.random_normal([label_size, hid1_size], stddev=0.01), name='wo')
bo = tf.Variable(tf.random_normal([label_size, label_size]), name='bo')
yo = tf.transpose(tf.add(tf.matmul(wo, y1), bo))
# defines optimizer and predictor
lr = tf.placeholder(tf.float32, shape=(), name='learning_rate')
loss = tf.losses.mean_squared_error(ph_labels,yo)
optimizer = tf.train.GradientDescentOptimizer(lr).minimize(loss)
predictor = tf.identity(yo)
# TRAINING
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
_, c = sess.run([optimizer, loss], feed_dict={lr:0.05, ph_inputs: inputs, ph_labels: labels})
# TRAINING
# gets the regression results
predictions = np.zeros((input_size,1))
for i in range(input_size):
predictions[i] = sess.run(predictor, feed_dict={ph_inputs: inputs[i, None]}).squeeze()
# prints regression MSE
print(mean_squared_error(predictions, labels))

You're right, you understood the problem by yourself.
The problem is, in fact, that you're running the optimization step only one time. Hence you're doing one single update step of your network parameter and therefore the cost won't decrease.
I just changed the training session of your code in order to make it work as expected (100 training steps):
# TRAINING
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
for i in range(100):
_, c = sess.run(
[optimizer, loss],
feed_dict={
lr: 0.05,
ph_inputs: inputs,
ph_labels: labels
})
print("Train step {} loss value {}".format(i, c))
# TRAINING
and at the end of the training step I go:
Train step 99 loss value 0.04462708160281181
0.044106700712455045

Calculation of cost function by Iteration using Tensorflow

I have the following code extraction for calculation of cost function with iteration. Before that, feature scaling, resharp, lstm and training have been done and using the same set of data and variables to perform the calculation of cost function.
# learning parameter
learning_rate = 0.01
# iterative parameters
EPOCHS = 1000 # number of iterations
PRINT_STEP = 100 # the interval of printing validation result
# read data and data preprocessings
read_data_pd = pd.read_csv('./price.csv')
input_pd = read_data_pd.drop(['year','month','day'], axis=1)
temp_pd = feature_scaling(input_pd[feature_to_scale],sacling_meathod) # call the function feature scaling
input_pd[feature_to_scale] = temp_pd
x_ = tf.placeholder(tf.float32, [None, batch_size, n_features])
y_ = tf.placeholder(tf.float32, [None, 1])
# call the lstm-rnn function
lstm_output = lstm(x_, n_features, batch_size, n_lstm_layers, lstm_scope_name)
# linear regressor
# w is the weight vector
W = tf.Variable(tf.random_normal([n_features, n_pred_class]))
# b is the bias
b = tf.Variable(tf.random_normal([n_pred_class]))
# Y = WX + b
y = tf.matmul(lstm_output, W) + b
#define the cost function
cost_func = tf.reduce_mean(tf.square(y - y_))
train_op = tf.train.AdamOptimizer(learning_rate).minimize(cost_func)
# initialize all variables
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
for ii in range(EPOCHS):
sess.run(train_op, feed_dict={x_:train_input_nparr, y_:train_target_nparr})
if ii % PRINT_STEP == 0:
cost = sess.run(cost_func, feed_dict={x_:train_input_nparr, y_:train_target_nparr})
print 'iteration =', ii, 'training cost:', cost
When I run the program, the print of calculation results of cost function is work. Yet, each time when the program runs, the results is different. For example, the results of 100 iteration sometimes will print 0.868856, but sometime might be 0.905526, which means the code have some problem.
One thing I noticed is the lines with initialize all variables: tf.initialize_all_variables() as message said that
initialize_all_variables is deprecated and will be removed after 2017-03-02.
Instructions for updating: Use `tf.global_variables_initializer` instead.
I follow the instruction, but it has no effect to amend the calculation error.
Therefore, I would like to know what is wrong with the codes and also how to correct it?

Can one only implement gradient descent like optimizers with the code example from processing gradients in TensorFlow?

I was looking at the example code for processing gradients that TensorFlow has:
# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)
however, I noticed that the apply_gradients function was derived from the GradientDescentOptimizer. Does that mean that using the example code from above, one can only implement gradient like descent rules (notice we could change the opt = GradientDescentOptimizer or Adam or any of the the other optimizers)? In particular, what does apply_gradients do? I definitively check the code in the tf github page but it was a bunch of python that had nothing to do with mathematical expressions, so it was hard to tell what that was doing and how it changed from optimizer to optimizer.
For example, if I wanted to implement my own custom optimizer that might use gradients (or might not e.g. just change the weights directly with some rule, maybe more biologically plausible rule), its not possible with the above example code?
In particular I wanted to implement a gradient descent version that is artificially restricted in a compact domain. In particular I wanted to implement the following equation:
w := (w - mu*grad + eps) mod B
in TensorFlow. I realized that the following is true:
w := w mod B - mu*grad mod B + eps mod B
so I thought that I could just implement it by doing:
def Process_grads(g,mu_noise,stddev_noise,B):
return (g+tf.random_normal(tf.shape(g),mean=mu_noise,stddev=stddev_noise) ) % B
and then just having:
processed_grads_and_vars = [(Process_grads(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the processed gradients.
opt.apply_gradients(processed_grads_and_vars)
however, I realized that that wasn't good enough because I don't actually have access to w so I can't implement:
w mod B
at least not the way I tried. Is there a way to do this? i.e. to actually directly change the update rule? At least the way I tried?
I know its sort of a hacky update rule, but my point is more to change the update equation than actually caring to much about that update rule (so don't get hung up on it if its a bit weird).
I came up with super hacky solution:
def manual_update_GDL(arg,learning_rate,g,mu_noise,stddev_noise):
with tf.variable_scope(arg.mdl_scope_name,reuse=True):
W_var = tf.get_variable(name='W')
eps = tf.random_normal(tf.shape(g),mean=mu_noise,stddev=stddev_noise)
#
W_new = tf.mod( W_var - learning_rate*g + eps , 20)
sess.run( W_var.assign(W_new) )
def manual_GDL(arg,loss,learning_rate,mu_noise,stddev_noise,compact,B):
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss)
# process gradients
processed_grads_and_vars = [(manual_update_GDL(arg,learning_rate,g,mu_noise,stddev_noise), v) for g,v in grads_and_vars]
not sure if it works but something like that should work in general. The idea is to just write down the equation one wants to use (in TensorFlow) for the learning rate and then update the weights manually using a session.
Unfortunately, such a solution means we have to take care of the annealing (decaying learning rate manually which seems annoying). This solution probably has many other problems, feel free to point them out (and give solutions if you can).
For this very simple problem I realized one can just do the normal optimizer update rule and then just take the mod of the weights and re-assign them to their value:
sess.run(fetches=train_step)
if arg.compact:
# apply w := ( w - mu*g + eps ) mod B
W_val = W_var.eval()
W_new = tf.mod(W_var,arg.B).eval()
W_var.assign(W_new).eval()
but in this case its a coincidence that such a simple solution exists (unfortunately, bypasses the whole point of my question).
Actually, this solutions slows down the code a lot. For the moment is the best that I've got.
As a reference, I have seen this question: How to create an optimizer in Tensorflow , but didn't find it responded directly to my question.

Your solution slows down the code because you use the sess.run and .eval() code during your "train_step" creation. Instead you should create the train_step graph using only internal tensorflow functions (without using sess.run and .eval()). Thereafter you only evaluate the train_step in a loop.
If you don't want to use any standard optimizer you can write your own "apply gradient" graph. Here is one possible solution for that:
learning_rate = tf.Variable(tf.constant(0.1))
mu_noise = 0.
stddev_noise = 0.01
#add all your W variables here when you have more than one:
train_w_vars_list = [W]
grad = tf.gradients(some_loss, train_w_vars_list)
assign_list = []
for g, v in zip(grad, train_w_vars_list):
eps = tf.random_normal(tf.shape(g), mean=mu_noise, stddev=stddev_noise)
assign_list.append(v.assign(tf.mod(v - learning_rate*g + eps, 20)))
#also update the learning rate here if you want to:
assign_list.append(learning_rate.assign(learning_rate - 0.001))
train_step = tf.group(*assign_list)
You can also use one of the standard optimizer to create the grads_and_vars list (use it instead of zip(grad, train_w_vars_list) then).
Here is a simple example for MNIST with your loss:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
# Import data
mnist = input_data.read_data_sets('PATH TO MNIST_data', one_hot=True)
# Create the model
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
y = tf.matmul(x, W)
# Define loss and optimizer
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))
learning_rate = tf.Variable(tf.constant(0.1))
mu_noise = 0.
stddev_noise = 0.01
#add all your W variables here when you have more than one:
train_w_vars_list = [W]
grad = tf.gradients(cross_entropy, train_w_vars_list)
assign_list = []
for g, v in zip(grad, train_w_vars_list):
eps = tf.random_normal(tf.shape(g), mean=mu_noise, stddev=stddev_noise)
assign_list.append(v.assign(tf.mod(v - learning_rate*g + eps, 20)))
#also update the learning rate here if you want to:
assign_list.append(learning_rate.assign(learning_rate - 0.001))
train_step = tf.group(*assign_list)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
# Train
for _ in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
# Test trained model
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.test.images,
y_: mnist.test.labels}))

Indeed you are somewhat restricted and cannot do anything. However, what you wish to do can easily be done by making your daughter class of the tensorflow Optimizer class.
All you need to do is write an _apply_dense method for your class. The _apply_dense method takes grad and w as arguments, so anything you want to do with these to variables you can do.
Look here for example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/adam.py
That is the implementation of Adam in tensorflow, all you need to do is change the _apply_dense at line 131 as well as the _prepare and _finish methods.
So for example:
def _apply_dense(self, grad, var):
B = math_ops.cast(self.B, var.dtype.base_dtype)
eps = math_ops.cast(self.eps, var.dtype.base_dtype)
mu = math_ops.cast(self.mu, var.dtype.base_dtype)
var_update = state_ops.assign(var, tf.floormod(var - mu*grad + eps,B),
use_locking=self._use_locking)
return var_update

How to set layer-wise learning rate in Tensorflow?

I am wondering if there is a way that I can use different learning rate for different layers like what is in Caffe. I am trying to modify a pre-trained model and use it for other tasks. What I want is to speed up the training for new added layers and keep the trained layers at low learning rate in order to prevent them from being distorted. for example, I have a 5-conv-layer pre-trained model. Now I add a new conv layer and fine tune it. The first 5 layers would have learning rate of 0.00001 and the last one would have 0.001. Any idea how to achieve this?

It can be achieved quite easily with 2 optimizers:
var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
train_op1 = GradientDescentOptimizer(0.00001).minimize(loss, var_list=var_list1)
train_op2 = GradientDescentOptimizer(0.0001).minimize(loss, var_list=var_list2)
train_op = tf.group(train_op1, train_op2)
One disadvantage of this implementation is that it computes tf.gradients(.) twice inside the optimizers and thus it might not be optimal in terms of execution speed. This can be mitigated by explicitly calling tf.gradients(.), splitting the list into 2 and passing corresponding gradients to both optimizers.
Related question: Holding variables constant during optimizer
EDIT: Added more efficient but longer implementation:
var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
opt1 = tf.train.GradientDescentOptimizer(0.00001)
opt2 = tf.train.GradientDescentOptimizer(0.0001)
grads = tf.gradients(loss, var_list1 + var_list2)
grads1 = grads[:len(var_list1)]
grads2 = grads[len(var_list1):]
tran_op1 = opt1.apply_gradients(zip(grads1, var_list1))
train_op2 = opt2.apply_gradients(zip(grads2, var_list2))
train_op = tf.group(train_op1, train_op2)
You can use tf.trainable_variables() to get all training variables and decide to select from them.
The difference is that in the first implementation tf.gradients(.) is called twice inside the optimizers. This may cause some redundant operations to be executed (e.g. gradients on the first layer can reuse some computations for the gradients of the following layers).

Tensorflow 1.7 introduced tf.custom_gradient that greatly simplifies setting learning rate multipliers, in a way that is now compatible with any optimizer, including those accumulating gradient statistics. For example,
import tensorflow as tf
def lr_mult(alpha):
#tf.custom_gradient
def _lr_mult(x):
def grad(dy):
return dy * alpha * tf.ones_like(x)
return x, grad
return _lr_mult
x0 = tf.Variable(1.)
x1 = tf.Variable(1.)
loss = tf.square(x0) + tf.square(lr_mult(0.1)(x1))
step = tf.train.GradientDescentOptimizer(learning_rate=0.1).minimize(loss)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
tf.local_variables_initializer().run()
for _ in range(5):
sess.run([step])
print(sess.run([x0, x1, loss]))

Update Jan 22: recipe below is only a good idea for GradientDescentOptimizer , other optimizers that keep a running average will apply learning rate before the parameter update, so recipe below won't affect that part of the equation
In addition to Rafal's approach, you could use compute_gradients, apply_gradients interface of Optimizer. For instance, here's a toy network where I use 2x the learning rate for second parameter
x = tf.Variable(tf.ones([]))
y = tf.Variable(tf.zeros([]))
loss = tf.square(x-y)
global_step = tf.Variable(0, name="global_step", trainable=False)
opt = tf.GradientDescentOptimizer(learning_rate=0.1)
grads_and_vars = opt.compute_gradients(loss, [x, y])
ygrad, _ = grads_and_vars[1]
train_op = opt.apply_gradients([grads_and_vars[0], (ygrad*2, y)], global_step=global_step)
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
for i in range(5):
sess.run([train_op, loss, global_step])
print sess.run([x, y])
You should see
[0.80000001, 0.40000001]
[0.72000003, 0.56]
[0.68800002, 0.62400001]
[0.67520005, 0.64960003]
[0.67008007, 0.65984005]

Collect learning rate multipliers for each variable like:
self.lr_multipliers[var.op.name] = lr_mult
and then apply them during before applying the gradients like:
def _train_op(self):
tf.scalar_summary('learning_rate', self._lr_placeholder)
opt = tf.train.GradientDescentOptimizer(self._lr_placeholder)
grads_and_vars = opt.compute_gradients(self._loss)
grads_and_vars_mult = []
for grad, var in grads_and_vars:
grad *= self._network.lr_multipliers[var.op.name]
grads_and_vars_mult.append((grad, var))
tf.histogram_summary('variables/' + var.op.name, var)
tf.histogram_summary('gradients/' + var.op.name, grad)
return opt.apply_gradients(grads_and_vars_mult)
You can find the whole example here.

A slight variation of Sergey Demyanov answer, where you only have to specify the learning rates you would like to change
from collections import defaultdict
self.learning_rates = defaultdict(lambda: 1.0)
...
x = tf.layers.Dense(3)(x)
self.learning_rates[x.op.name] = 2.0
...
optimizer = tf.train.MomentumOptimizer(learning_rate=1e-3, momentum=0.9)
grads_and_vars = optimizer.compute_gradients(loss)
grads_and_vars_mult = []
for grad, var in grads_and_vars:
grad *= self.learning_rates[var.op.name]
grads_and_vars_mult.append((grad, var))
train_op = optimizer.apply_gradients(grads_and_vars_mult, tf.train.get_global_step())

The first 5 layers would have learning rate of 0.00001 and the last one would have 0.001. Any idea how to achieve this?
There is an easy way to do that using tf.stop_gradient.
Here is an example with 3 layers:
x = layer1(input)
x = layer2(x)
output = layer3(x)
You can shrink your gradient in the first two layers by a ratio of 1/100:
x = layer1(input)
x = layer2(x)
x = 1/100*x + (1-1/100)*tf.stop_gradient(x)
output = layer3(x)
On the layer2, the "flow" is split in two branches: one which has a contribution of 1/100 computes its gradient regularly but with a gradient magnitude shrinked by a proportion of 1/100, the other branch provides the remaining "flow" without contributing to the gradient because of the tf.stop_gradient operator. As a result, if you use a learning rate of 0.001 on your model optimizer, the first two layers will virtually have a learning rate of 0.00001.

If you happen to be using tf.slim + slim.learning.create_train_op there is a nice example here:
https://github.com/google-research/tf-slim/blob/master/tf_slim/learning.py#L65
# Create the train_op and scale the gradients by providing a map from variable
# name (or variable) to a scaling coefficient:
gradient_multipliers = {
'conv0/weights': 1.2,
'fc8/weights': 3.4,
}
train_op = slim.learning.create_train_op(
total_loss,
optimizer,
gradient_multipliers=gradient_multipliers)
Unfortunately it doesn't seem possible to use a tf.Variable instead of a float value if you want to gradually modify the multiplier.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Strange ordering of evaluations of variables and loss - python

Related

Pytorch backward does not compute the gradients for requested variables

linear regression by tensorflow gets noticeable mean square error

Calculation of cost function by Iteration using Tensorflow

Can one only implement gradient descent like optimizers with the code example from processing gradients in TensorFlow?

How to set layer-wise learning rate in Tensorflow?

Categories

Resources