How to set layer-wise learning rate in Tensorflow?

How to set layer-wise learning rate in Tensorflow? - python

I am wondering if there is a way that I can use different learning rate for different layers like what is in Caffe. I am trying to modify a pre-trained model and use it for other tasks. What I want is to speed up the training for new added layers and keep the trained layers at low learning rate in order to prevent them from being distorted. for example, I have a 5-conv-layer pre-trained model. Now I add a new conv layer and fine tune it. The first 5 layers would have learning rate of 0.00001 and the last one would have 0.001. Any idea how to achieve this?

It can be achieved quite easily with 2 optimizers:
var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
train_op1 = GradientDescentOptimizer(0.00001).minimize(loss, var_list=var_list1)
train_op2 = GradientDescentOptimizer(0.0001).minimize(loss, var_list=var_list2)
train_op = tf.group(train_op1, train_op2)
One disadvantage of this implementation is that it computes tf.gradients(.) twice inside the optimizers and thus it might not be optimal in terms of execution speed. This can be mitigated by explicitly calling tf.gradients(.), splitting the list into 2 and passing corresponding gradients to both optimizers.
Related question: Holding variables constant during optimizer
EDIT: Added more efficient but longer implementation:
var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
opt1 = tf.train.GradientDescentOptimizer(0.00001)
opt2 = tf.train.GradientDescentOptimizer(0.0001)
grads = tf.gradients(loss, var_list1 + var_list2)
grads1 = grads[:len(var_list1)]
grads2 = grads[len(var_list1):]
tran_op1 = opt1.apply_gradients(zip(grads1, var_list1))
train_op2 = opt2.apply_gradients(zip(grads2, var_list2))
train_op = tf.group(train_op1, train_op2)
You can use tf.trainable_variables() to get all training variables and decide to select from them.
The difference is that in the first implementation tf.gradients(.) is called twice inside the optimizers. This may cause some redundant operations to be executed (e.g. gradients on the first layer can reuse some computations for the gradients of the following layers).

Tensorflow 1.7 introduced tf.custom_gradient that greatly simplifies setting learning rate multipliers, in a way that is now compatible with any optimizer, including those accumulating gradient statistics. For example,
import tensorflow as tf
def lr_mult(alpha):
#tf.custom_gradient
def _lr_mult(x):
def grad(dy):
return dy * alpha * tf.ones_like(x)
return x, grad
return _lr_mult
x0 = tf.Variable(1.)
x1 = tf.Variable(1.)
loss = tf.square(x0) + tf.square(lr_mult(0.1)(x1))
step = tf.train.GradientDescentOptimizer(learning_rate=0.1).minimize(loss)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
tf.local_variables_initializer().run()
for _ in range(5):
sess.run([step])
print(sess.run([x0, x1, loss]))

Update Jan 22: recipe below is only a good idea for GradientDescentOptimizer , other optimizers that keep a running average will apply learning rate before the parameter update, so recipe below won't affect that part of the equation
In addition to Rafal's approach, you could use compute_gradients, apply_gradients interface of Optimizer. For instance, here's a toy network where I use 2x the learning rate for second parameter
x = tf.Variable(tf.ones([]))
y = tf.Variable(tf.zeros([]))
loss = tf.square(x-y)
global_step = tf.Variable(0, name="global_step", trainable=False)
opt = tf.GradientDescentOptimizer(learning_rate=0.1)
grads_and_vars = opt.compute_gradients(loss, [x, y])
ygrad, _ = grads_and_vars[1]
train_op = opt.apply_gradients([grads_and_vars[0], (ygrad*2, y)], global_step=global_step)
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
for i in range(5):
sess.run([train_op, loss, global_step])
print sess.run([x, y])
You should see
[0.80000001, 0.40000001]
[0.72000003, 0.56]
[0.68800002, 0.62400001]
[0.67520005, 0.64960003]
[0.67008007, 0.65984005]

Collect learning rate multipliers for each variable like:
self.lr_multipliers[var.op.name] = lr_mult
and then apply them during before applying the gradients like:
def _train_op(self):
tf.scalar_summary('learning_rate', self._lr_placeholder)
opt = tf.train.GradientDescentOptimizer(self._lr_placeholder)
grads_and_vars = opt.compute_gradients(self._loss)
grads_and_vars_mult = []
for grad, var in grads_and_vars:
grad *= self._network.lr_multipliers[var.op.name]
grads_and_vars_mult.append((grad, var))
tf.histogram_summary('variables/' + var.op.name, var)
tf.histogram_summary('gradients/' + var.op.name, grad)
return opt.apply_gradients(grads_and_vars_mult)
You can find the whole example here.

A slight variation of Sergey Demyanov answer, where you only have to specify the learning rates you would like to change
from collections import defaultdict
self.learning_rates = defaultdict(lambda: 1.0)
...
x = tf.layers.Dense(3)(x)
self.learning_rates[x.op.name] = 2.0
...
optimizer = tf.train.MomentumOptimizer(learning_rate=1e-3, momentum=0.9)
grads_and_vars = optimizer.compute_gradients(loss)
grads_and_vars_mult = []
for grad, var in grads_and_vars:
grad *= self.learning_rates[var.op.name]
grads_and_vars_mult.append((grad, var))
train_op = optimizer.apply_gradients(grads_and_vars_mult, tf.train.get_global_step())

The first 5 layers would have learning rate of 0.00001 and the last one would have 0.001. Any idea how to achieve this?
There is an easy way to do that using tf.stop_gradient.
Here is an example with 3 layers:
x = layer1(input)
x = layer2(x)
output = layer3(x)
You can shrink your gradient in the first two layers by a ratio of 1/100:
x = layer1(input)
x = layer2(x)
x = 1/100*x + (1-1/100)*tf.stop_gradient(x)
output = layer3(x)
On the layer2, the "flow" is split in two branches: one which has a contribution of 1/100 computes its gradient regularly but with a gradient magnitude shrinked by a proportion of 1/100, the other branch provides the remaining "flow" without contributing to the gradient because of the tf.stop_gradient operator. As a result, if you use a learning rate of 0.001 on your model optimizer, the first two layers will virtually have a learning rate of 0.00001.

If you happen to be using tf.slim + slim.learning.create_train_op there is a nice example here:
https://github.com/google-research/tf-slim/blob/master/tf_slim/learning.py#L65
# Create the train_op and scale the gradients by providing a map from variable
# name (or variable) to a scaling coefficient:
gradient_multipliers = {
'conv0/weights': 1.2,
'fc8/weights': 3.4,
}
train_op = slim.learning.create_train_op(
total_loss,
optimizer,
gradient_multipliers=gradient_multipliers)
Unfortunately it doesn't seem possible to use a tf.Variable instead of a float value if you want to gradually modify the multiplier.

Related

Pytorch backward does not compute the gradients for requested variables

I'm trying to train a resnet18 model on pytorch (+pytorch-lightning) with the use of Virtual Adversarial Training. During the computations required for this type of training I need to obtain the gradient of D (ie. the cross-entropy loss of the model) with regard to tensor r.
This should, in theory, happen in the following code snippet:
def generic_step(self, train_batch, batch_idx, step_type):
x, y = train_batch
unlabeled_idx = y is None
d = torch.rand(x.shape).to(x.device)
d = d/(torch.norm(d) + 1e-8)
pred_y = self.classifier(x)
y[unlabeled_idx] = pred_y[unlabeled_idx]
l = self.criterion(pred_y, y)
R_adv = torch.zeros_like(x)
for _ in range(self.ip):
r = self.xi * d
r.requires_grad = True
pred_hat = self.classifier(x + r)
# pred_hat = F.log_softmax(pred_hat, dim=1)
D = self.criterion(pred_hat, pred_y)
self.classifier.zero_grad()
D.requires_grad=True
D.backward()
R_adv += self.eps * r.grad / (torch.norm(r.grad) + 1e-8)
R_adv /= 32
loss = l + R_adv * self.a
loss.backward()
self.accuracy[step_type] = self.acc_metric(torch.argmax(pred_y, 1), y)
return loss
Here, to my understanding, r.grad should in theory be the gradient of D with respect to r. However, the code throws this at D.backward():
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
(full traceback excluded because this error is not helpful and technically "solved" as I know the cause for it, explained just below)
After some research and debugging it seems that in this situation D.backward() attempts to calculate dD/dD disregarding any previous mention of requires_grad=True. This is confirmed when I add D.requires_grad=True and I get D.grad=Tensor(1.,device='cuda:0') but r.grad=None.
Does anyone know why this may be happening?

In Lightning, .backward() and optimizer step are all handled under the hood. If you do it yourself like in the code above, it will mess with Lightning because it doesn't know you called backward yourself.
You can enable manual optimization in the LightningModule:
def __init__(self):
super().__init__()
# put this in your init
self.automatic_optimization = False
This tells Lightning that you are taking over calling backward and handling optimizer step + zero grad yourself. Don't forget to add that in your code above. You can access the optimizer and scheduler like so in your training step:
def training_step(self, batch, batch_idx):
optimizer = self.optimizers()
scheduler = self.lr_schedulers()
# do your training step
# don't forget to call:
# 1) backward 2) optimizer step 3) zero grad
Read more about manual optimization here.

Why PyTorch optimizer might fail to update its parameters?

I am trying to do a simple loss-minimization for a specific variable coeff using PyTorch optimizers. This variable is supposed to be used as an interpolation coefficient for two vectors w_foo and w_bar to find a third vector, w_target.
w_target = `w_foo + coeff * (w_bar - w_foo)
With w_foo and w_bar set as constant, at each optimization step I calculate w_target for the given coeff. Loss is determined from w_target using a fairly complex process beyond the scope of this question.
# w_foo.shape = [1, 16, 512]
# w_bar.shape = [1, 16, 512]
# num_layers = 16
# num_steps = 10000
vgg_loss = VGGLoss()
coeff = torch.randn([num_layers, ])
optimizer = torch.optim.Adam([coeff], lr=initial_learning_rate)
for step in range(num_steps):
w_target = w_foo + torch.matmul(coeff, (w_bar - w_foo))
optimizer.zero_grad()
target_image = generator.synthesis(w_target)
processed_target_image = process(target_image)
loss = vgg_loss(processed_target_image, source_image)
loss.backward()
optimizer.step()
However, when running this optimizer, I am met with query_opt not changing from one step to another, optimizer being essentially useless. I would like to ask for some advice on what I am doing wrong here.
Edit:
As suggested, I will try to elaborate on the loss function. Essentially, w_target is used to generate an image, and VGGLoss uses VGG feature extractor to compare this synthetic image with a certain exemplar source image.
class VGGLoss(torch.nn.Module):
def __init__(self, device, vgg):
super().__init__()
for param in self.parameters():
param.requires_grad = True
self.vgg = vgg # VGG16 in eval mode
def forward(self, source, target):
loss = 0
source_features = self.vgg(source, resize_images=False, return_lpips=True)
target_features = self.vgg(target, resize_images=False, return_lpips=True)
loss += (source_features - target_features).square().sum()
return loss

Cost remains same at 0.6932 in Siamese Network

I am trying to implement a Siamese Network, as in this paper
In this paper, they have used cross entropy for the Loss function
I am using STL-10 Dataset for training and instead of the 3 layer network used in the paper, I replaced it with VGG-13 CNN network, except the last logit layer.
Here is my loss function code
def loss(pred,true_pred):
cross_entropy_loss = tf.multiply(-1.0,tf.reduce_mean(tf.add(tf.multiply(true_pred,tf.log(pred)),tf.multiply((1-true_pred),tf.log(tf.subtract(1.0,pred))))))
total_loss = tf.add(tf.reduce_sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)),cross_entropy_loss,name='total_loss')
return cross_entropy_loss,total_loss
with tf.device('/gpu:0'):
h1 = siamese(feed_image1)
h2 = siamese(feed_image2)
l1_dist = tf.abs(tf.subtract(h1,h2))
with tf.variable_scope('pred') as scope:
predictions = tf.contrib.layers.fully_connected(l1_dist,1,activation_fn = tf.sigmoid,weights_initializer = tf.contrib.layers.xavier_initializer(uniform=False),weights_regularizer = tf.contrib.layers.l2_regularizer(tf.constant(0.001, dtype=tf.float32)))
celoss,cost = loss(predictions,feed_labels)
with tf.variable_scope('adam_optimizer') as scope:
optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
opt = optimizer.minimize(cost)
However, when I run the training, the cost remains almost constant at 0.6932
I have used Adam Optimizer here.
But previously I used Momentum Optimizer.
I have tried changing the learning rate but the cost still behaves the same.
And all the prediction values converge to 0.5 after a few iterations.
After taking the output for two batches of images (input1 and input2), I take their L1 distance and to that I have connected a fully connected layer with a single output and sigmoid activation function.
[h1 and h2 contains the output of the last fully connected layer(not the logit layer) of the VGG-13 network]
Since the output activation function is sigmoid, and since the prediction values are around 0.5, we can calculate and say that the sum of the weighted L1 distance of the output of the two networks is near to zero.
I can't understand where I am going wrong.
A little help will be very much appreciated.

I thought the nonconvergence may be caused by the gradient vanishing. You can trace the gradients using tf.contrib.layers.optimize_loss and the tensorboard. You can refer to this answer for more details.
Several optimizations(maybe):
1) don't write the cross entropy yourself.
You can employ the sigmoid cross entropy with logits API, since it ensures stability as documented:
max(x, 0) - x * z + log(1 + exp(-abs(x)))
2) do some weigh normalization may would hlep.
3) keep the regularization loss small.
You can read this answer for more information.
4) I don't see the necessity of tf.abs the L1 distance.
And here is the code I modified. Hope it helps.
mode = "training"
rl_rate = .1
with tf.device('/gpu:0'):
h1 = siamese(feed_image1)
h2 = siamese(feed_image2)
l1_dist = tf.subtract(h1, h2)
# is it necessary to use abs?
l1_dist_norm = tf.layers.batch_normalization(l1_dist, training=(mode=="training"))
with tf.variable_scope('logits') as scope:
w = tf.get_variable('fully_connected_weights', [tf.shape(l1_dist)[-1], 1],
weights_initializer = tf.contrib.layers.xavier_initializer(uniform=False), weights_regularizer = tf.contrib.layers.l2_regularizer(tf.constant(0.001, dtype=tf.float32))
)
logits = tf.tensordot(l1_dist_norm, w, axis=1)
xent_loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=feed_labels)
total_loss = tf.add(tf.reduce_sum(rl_rate * tf.abs(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))), (1-rl_rate) * xent_loss, name='total_loss')
# or:
# weights = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
# l1_regularizer = tf.contrib.layers.l1_regularizer()
# regularization_loss = tf.contrib.layers.apply_regularization(l1_regularizer, weights)
# total_loss = xent_loss + regularization_loss
with tf.variable_scope('adam_optimizer') as scope:
optimizer = tf.train.AdamOptimizer(learning_rate=0.0005)
opt = tf.contrib.layers.optimize_loss(total_loss, global_step, learning_rate=learning_rate, optimizer="Adam", clip_gradients=max_grad_norm, summaries=["gradients"])

Can one only implement gradient descent like optimizers with the code example from processing gradients in TensorFlow?

I was looking at the example code for processing gradients that TensorFlow has:
# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)
however, I noticed that the apply_gradients function was derived from the GradientDescentOptimizer. Does that mean that using the example code from above, one can only implement gradient like descent rules (notice we could change the opt = GradientDescentOptimizer or Adam or any of the the other optimizers)? In particular, what does apply_gradients do? I definitively check the code in the tf github page but it was a bunch of python that had nothing to do with mathematical expressions, so it was hard to tell what that was doing and how it changed from optimizer to optimizer.
For example, if I wanted to implement my own custom optimizer that might use gradients (or might not e.g. just change the weights directly with some rule, maybe more biologically plausible rule), its not possible with the above example code?
In particular I wanted to implement a gradient descent version that is artificially restricted in a compact domain. In particular I wanted to implement the following equation:
w := (w - mu*grad + eps) mod B
in TensorFlow. I realized that the following is true:
w := w mod B - mu*grad mod B + eps mod B
so I thought that I could just implement it by doing:
def Process_grads(g,mu_noise,stddev_noise,B):
return (g+tf.random_normal(tf.shape(g),mean=mu_noise,stddev=stddev_noise) ) % B
and then just having:
processed_grads_and_vars = [(Process_grads(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the processed gradients.
opt.apply_gradients(processed_grads_and_vars)
however, I realized that that wasn't good enough because I don't actually have access to w so I can't implement:
w mod B
at least not the way I tried. Is there a way to do this? i.e. to actually directly change the update rule? At least the way I tried?
I know its sort of a hacky update rule, but my point is more to change the update equation than actually caring to much about that update rule (so don't get hung up on it if its a bit weird).
I came up with super hacky solution:
def manual_update_GDL(arg,learning_rate,g,mu_noise,stddev_noise):
with tf.variable_scope(arg.mdl_scope_name,reuse=True):
W_var = tf.get_variable(name='W')
eps = tf.random_normal(tf.shape(g),mean=mu_noise,stddev=stddev_noise)
#
W_new = tf.mod( W_var - learning_rate*g + eps , 20)
sess.run( W_var.assign(W_new) )
def manual_GDL(arg,loss,learning_rate,mu_noise,stddev_noise,compact,B):
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss)
# process gradients
processed_grads_and_vars = [(manual_update_GDL(arg,learning_rate,g,mu_noise,stddev_noise), v) for g,v in grads_and_vars]
not sure if it works but something like that should work in general. The idea is to just write down the equation one wants to use (in TensorFlow) for the learning rate and then update the weights manually using a session.
Unfortunately, such a solution means we have to take care of the annealing (decaying learning rate manually which seems annoying). This solution probably has many other problems, feel free to point them out (and give solutions if you can).
For this very simple problem I realized one can just do the normal optimizer update rule and then just take the mod of the weights and re-assign them to their value:
sess.run(fetches=train_step)
if arg.compact:
# apply w := ( w - mu*g + eps ) mod B
W_val = W_var.eval()
W_new = tf.mod(W_var,arg.B).eval()
W_var.assign(W_new).eval()
but in this case its a coincidence that such a simple solution exists (unfortunately, bypasses the whole point of my question).
Actually, this solutions slows down the code a lot. For the moment is the best that I've got.
As a reference, I have seen this question: How to create an optimizer in Tensorflow , but didn't find it responded directly to my question.

Your solution slows down the code because you use the sess.run and .eval() code during your "train_step" creation. Instead you should create the train_step graph using only internal tensorflow functions (without using sess.run and .eval()). Thereafter you only evaluate the train_step in a loop.
If you don't want to use any standard optimizer you can write your own "apply gradient" graph. Here is one possible solution for that:
learning_rate = tf.Variable(tf.constant(0.1))
mu_noise = 0.
stddev_noise = 0.01
#add all your W variables here when you have more than one:
train_w_vars_list = [W]
grad = tf.gradients(some_loss, train_w_vars_list)
assign_list = []
for g, v in zip(grad, train_w_vars_list):
eps = tf.random_normal(tf.shape(g), mean=mu_noise, stddev=stddev_noise)
assign_list.append(v.assign(tf.mod(v - learning_rate*g + eps, 20)))
#also update the learning rate here if you want to:
assign_list.append(learning_rate.assign(learning_rate - 0.001))
train_step = tf.group(*assign_list)
You can also use one of the standard optimizer to create the grads_and_vars list (use it instead of zip(grad, train_w_vars_list) then).
Here is a simple example for MNIST with your loss:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
# Import data
mnist = input_data.read_data_sets('PATH TO MNIST_data', one_hot=True)
# Create the model
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
y = tf.matmul(x, W)
# Define loss and optimizer
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))
learning_rate = tf.Variable(tf.constant(0.1))
mu_noise = 0.
stddev_noise = 0.01
#add all your W variables here when you have more than one:
train_w_vars_list = [W]
grad = tf.gradients(cross_entropy, train_w_vars_list)
assign_list = []
for g, v in zip(grad, train_w_vars_list):
eps = tf.random_normal(tf.shape(g), mean=mu_noise, stddev=stddev_noise)
assign_list.append(v.assign(tf.mod(v - learning_rate*g + eps, 20)))
#also update the learning rate here if you want to:
assign_list.append(learning_rate.assign(learning_rate - 0.001))
train_step = tf.group(*assign_list)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
# Train
for _ in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
# Test trained model
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.test.images,
y_: mnist.test.labels}))

Indeed you are somewhat restricted and cannot do anything. However, what you wish to do can easily be done by making your daughter class of the tensorflow Optimizer class.
All you need to do is write an _apply_dense method for your class. The _apply_dense method takes grad and w as arguments, so anything you want to do with these to variables you can do.
Look here for example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/adam.py
That is the implementation of Adam in tensorflow, all you need to do is change the _apply_dense at line 131 as well as the _prepare and _finish methods.
So for example:
def _apply_dense(self, grad, var):
B = math_ops.cast(self.B, var.dtype.base_dtype)
eps = math_ops.cast(self.eps, var.dtype.base_dtype)
mu = math_ops.cast(self.mu, var.dtype.base_dtype)
var_update = state_ops.assign(var, tf.floormod(var - mu*grad + eps,B),
use_locking=self._use_locking)
return var_update

How to add regularizations in TensorFlow?

I found in many available neural network code implemented using TensorFlow that regularization terms are often implemented by manually adding an additional term to loss value.
My questions are:
Is there a more elegant or recommended way of regularization than doing it manually?
I also find that get_variable has an argument regularizer. How should it be used? According to my observation, if we pass a regularizer to it (such as tf.contrib.layers.l2_regularizer, a tensor representing regularized term will be computed and added to a graph collection named tf.GraphKeys.REGULARIZATOIN_LOSSES. Will that collection be automatically used by TensorFlow (e.g. used by optimizers when training)? Or is it expected that I should use that collection by myself?

As you say in the second point, using the regularizer argument is the recommended way. You can use it in get_variable, or set it once in your variable_scope and have all your variables regularized.
The losses are collected in the graph, and you need to manually add them to your cost function like this.
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
reg_constant = 0.01 # Choose an appropriate one.
loss = my_normal_loss + reg_constant * sum(reg_losses)

A few aspects of the existing answer were not immediately clear to me, so here is a step-by-step guide:
Define a regularizer. This is where the regularization constant can be set, e.g.:
regularizer = tf.contrib.layers.l2_regularizer(scale=0.1)
Create variables via:
weights = tf.get_variable(
name="weights",
regularizer=regularizer,
...
)
Equivalently, variables can be created via the regular weights = tf.Variable(...) constructor, followed by tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, weights).
Define some loss term and add the regularization term:
reg_variables = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
reg_term = tf.contrib.layers.apply_regularization(regularizer, reg_variables)
loss += reg_term
Note: It looks like tf.contrib.layers.apply_regularization is implemented as an AddN, so more or less equivalent to sum(reg_variables).

I'll provide a simple correct answer since I didn't find one. You need two simple steps, the rest is done by tensorflow magic:
Add regularizers when creating variables or layers:
tf.layers.dense(x, kernel_regularizer=tf.contrib.layers.l2_regularizer(0.001))
# or
tf.get_variable('a', regularizer=tf.contrib.layers.l2_regularizer(0.001))
Add the regularization term when defining loss:
loss = ordinary_loss + tf.losses.get_regularization_loss()

Another option to do this with the contrib.learn library is as follows, based on the Deep MNIST tutorial on the Tensorflow website. First, assuming you've imported the relevant libraries (such as import tensorflow.contrib.layers as layers), you can define a network in a separate method:
def easier_network(x, reg):
""" A network based on tf.contrib.learn, with input `x`. """
with tf.variable_scope('EasyNet'):
out = layers.flatten(x)
out = layers.fully_connected(out,
num_outputs=200,
weights_initializer = layers.xavier_initializer(uniform=True),
weights_regularizer = layers.l2_regularizer(scale=reg),
activation_fn = tf.nn.tanh)
out = layers.fully_connected(out,
num_outputs=200,
weights_initializer = layers.xavier_initializer(uniform=True),
weights_regularizer = layers.l2_regularizer(scale=reg),
activation_fn = tf.nn.tanh)
out = layers.fully_connected(out,
num_outputs=10, # Because there are ten digits!
weights_initializer = layers.xavier_initializer(uniform=True),
weights_regularizer = layers.l2_regularizer(scale=reg),
activation_fn = None)
return out
Then, in a main method, you can use the following code snippet:
def main(_):
mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True)
x = tf.placeholder(tf.float32, [None, 784])
y_ = tf.placeholder(tf.float32, [None, 10])
# Make a network with regularization
y_conv = easier_network(x, FLAGS.regu)
weights = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'EasyNet')
print("")
for w in weights:
shp = w.get_shape().as_list()
print("- {} shape:{} size:{}".format(w.name, shp, np.prod(shp)))
print("")
reg_ws = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES, 'EasyNet')
for w in reg_ws:
shp = w.get_shape().as_list()
print("- {} shape:{} size:{}".format(w.name, shp, np.prod(shp)))
print("")
# Make the loss function `loss_fn` with regularization.
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
loss_fn = cross_entropy + tf.reduce_sum(reg_ws)
train_step = tf.train.AdamOptimizer(1e-4).minimize(loss_fn)
To get this to work you need to follow the MNIST tutorial I linked to earlier and import the relevant libraries, but it's a nice exercise to learn TensorFlow and it's easy to see how the regularization affects the output. If you apply a regularization as an argument, you can see the following:
- EasyNet/fully_connected/weights:0 shape:[784, 200] size:156800
- EasyNet/fully_connected/biases:0 shape:[200] size:200
- EasyNet/fully_connected_1/weights:0 shape:[200, 200] size:40000
- EasyNet/fully_connected_1/biases:0 shape:[200] size:200
- EasyNet/fully_connected_2/weights:0 shape:[200, 10] size:2000
- EasyNet/fully_connected_2/biases:0 shape:[10] size:10
- EasyNet/fully_connected/kernel/Regularizer/l2_regularizer:0 shape:[] size:1.0
- EasyNet/fully_connected_1/kernel/Regularizer/l2_regularizer:0 shape:[] size:1.0
- EasyNet/fully_connected_2/kernel/Regularizer/l2_regularizer:0 shape:[] size:1.0
Notice that the regularization portion gives you three items, based on the items available.
With regularizations of 0, 0.0001, 0.01, and 1.0, I get test accuracy values of 0.9468, 0.9476, 0.9183, and 0.1135, respectively, showing the dangers of high regularization terms.

If anyone's still looking, I'd just like to add on that in tf.keras you may add weight regularization by passing them as arguments in your layers. An example of adding L2 regularization taken wholesale from the Tensorflow Keras Tutorials site:
model = keras.models.Sequential([
keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001),
activation=tf.nn.relu, input_shape=(NUM_WORDS,)),
keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001),
activation=tf.nn.relu),
keras.layers.Dense(1, activation=tf.nn.sigmoid)
])
There's no need to manually add in the regularization losses with this method as far as I know.
Reference: https://www.tensorflow.org/tutorials/keras/overfit_and_underfit#add_weight_regularization

I tested tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES) and tf.losses.get_regularization_loss() with one l2_regularizer in the graph, and found that they return the same value. By observing the value's quantity, I guess reg_constant has already make sense on the value by setting the parameter of tf.contrib.layers.l2_regularizer.

If you have CNN you may do the following:
In your model function:
conv = tf.layers.conv2d(inputs=input_layer,
filters=32,
kernel_size=[3, 3],
kernel_initializer='xavier',
kernel_regularizer=tf.contrib.layers.l2_regularizer(1e-5),
padding="same",
activation=None)
...
In your loss function:
onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=num_classes)
loss = tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits)
regularization_losses = tf.losses.get_regularization_losses()
loss = tf.add_n([loss] + regularization_losses)

cross_entropy = tf.losses.softmax_cross_entropy(
logits=logits, onehot_labels=labels)
l2_loss = weight_decay * tf.add_n(
[tf.nn.l2_loss(tf.cast(v, tf.float32)) for v in tf.trainable_variables()])
loss = cross_entropy + l2_loss

Some answers make me more confused.Here I give two methods to make it clearly.
#1.adding all regs by hand
var1 = tf.get_variable(name='v1',shape=[1],dtype=tf.float32)
var2 = tf.Variable(name='v2',initial_value=1.0,dtype=tf.float32)
regularizer = tf.contrib.layers.l1_regularizer(0.1)
reg_term = tf.contrib.layers.apply_regularization(regularizer,[var1,var2])
#here reg_term is a scalar
#2.auto added and read,but using get_variable
with tf.variable_scope('x',
regularizer=tf.contrib.layers.l2_regularizer(0.1)):
var1 = tf.get_variable(name='v1',shape=[1],dtype=tf.float32)
var2 = tf.get_variable(name='v2',shape=[1],dtype=tf.float32)
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
#here reg_losses is a list,should be summed
Then,it can be added into the total loss

tf.GraphKeys.REGULARIZATION_LOSSES will not be added automatically, but there is a simple way to add them:
reg_loss = tf.losses.get_regularization_loss()
total_loss = loss + reg_loss
tf.losses.get_regularization_loss() uses tf.add_n to sum the entries of tf.GraphKeys.REGULARIZATION_LOSSES element-wise. tf.GraphKeys.REGULARIZATION_LOSSES will typically be a list of scalars, calculated using regularizer functions. It gets entries from calls to tf.get_variable that have the regularizer parameter specified. You can also add to that collection manually. That would be useful when using tf.Variable and also when specifying activity regularizers or other custom regularizers. For instance:
#This will add an activity regularizer on y to the regloss collection
regularizer = tf.contrib.layers.l2_regularizer(0.1)
y = tf.nn.sigmoid(x)
act_reg = regularizer(y)
tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, act_reg)
(In this example it would presumably be more effective to regularize x, as y really flattens out for large x.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to set layer-wise learning rate in Tensorflow? - python

Related

Pytorch backward does not compute the gradients for requested variables

Why PyTorch optimizer might fail to update its parameters?

Cost remains same at 0.6932 in Siamese Network

Can one only implement gradient descent like optimizers with the code example from processing gradients in TensorFlow?

How to add regularizations in TensorFlow?

Categories

Resources