How to add regularizations in TensorFlow?

How to add regularizations in TensorFlow? - python

I found in many available neural network code implemented using TensorFlow that regularization terms are often implemented by manually adding an additional term to loss value.
My questions are:
Is there a more elegant or recommended way of regularization than doing it manually?
I also find that get_variable has an argument regularizer. How should it be used? According to my observation, if we pass a regularizer to it (such as tf.contrib.layers.l2_regularizer, a tensor representing regularized term will be computed and added to a graph collection named tf.GraphKeys.REGULARIZATOIN_LOSSES. Will that collection be automatically used by TensorFlow (e.g. used by optimizers when training)? Or is it expected that I should use that collection by myself?

As you say in the second point, using the regularizer argument is the recommended way. You can use it in get_variable, or set it once in your variable_scope and have all your variables regularized.
The losses are collected in the graph, and you need to manually add them to your cost function like this.
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
reg_constant = 0.01 # Choose an appropriate one.
loss = my_normal_loss + reg_constant * sum(reg_losses)

A few aspects of the existing answer were not immediately clear to me, so here is a step-by-step guide:
Define a regularizer. This is where the regularization constant can be set, e.g.:
regularizer = tf.contrib.layers.l2_regularizer(scale=0.1)
Create variables via:
weights = tf.get_variable(
name="weights",
regularizer=regularizer,
...
)
Equivalently, variables can be created via the regular weights = tf.Variable(...) constructor, followed by tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, weights).
Define some loss term and add the regularization term:
reg_variables = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
reg_term = tf.contrib.layers.apply_regularization(regularizer, reg_variables)
loss += reg_term
Note: It looks like tf.contrib.layers.apply_regularization is implemented as an AddN, so more or less equivalent to sum(reg_variables).

I'll provide a simple correct answer since I didn't find one. You need two simple steps, the rest is done by tensorflow magic:
Add regularizers when creating variables or layers:
tf.layers.dense(x, kernel_regularizer=tf.contrib.layers.l2_regularizer(0.001))
# or
tf.get_variable('a', regularizer=tf.contrib.layers.l2_regularizer(0.001))
Add the regularization term when defining loss:
loss = ordinary_loss + tf.losses.get_regularization_loss()

Another option to do this with the contrib.learn library is as follows, based on the Deep MNIST tutorial on the Tensorflow website. First, assuming you've imported the relevant libraries (such as import tensorflow.contrib.layers as layers), you can define a network in a separate method:
def easier_network(x, reg):
""" A network based on tf.contrib.learn, with input `x`. """
with tf.variable_scope('EasyNet'):
out = layers.flatten(x)
out = layers.fully_connected(out,
num_outputs=200,
weights_initializer = layers.xavier_initializer(uniform=True),
weights_regularizer = layers.l2_regularizer(scale=reg),
activation_fn = tf.nn.tanh)
out = layers.fully_connected(out,
num_outputs=200,
weights_initializer = layers.xavier_initializer(uniform=True),
weights_regularizer = layers.l2_regularizer(scale=reg),
activation_fn = tf.nn.tanh)
out = layers.fully_connected(out,
num_outputs=10, # Because there are ten digits!
weights_initializer = layers.xavier_initializer(uniform=True),
weights_regularizer = layers.l2_regularizer(scale=reg),
activation_fn = None)
return out
Then, in a main method, you can use the following code snippet:
def main(_):
mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True)
x = tf.placeholder(tf.float32, [None, 784])
y_ = tf.placeholder(tf.float32, [None, 10])
# Make a network with regularization
y_conv = easier_network(x, FLAGS.regu)
weights = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'EasyNet')
print("")
for w in weights:
shp = w.get_shape().as_list()
print("- {} shape:{} size:{}".format(w.name, shp, np.prod(shp)))
print("")
reg_ws = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES, 'EasyNet')
for w in reg_ws:
shp = w.get_shape().as_list()
print("- {} shape:{} size:{}".format(w.name, shp, np.prod(shp)))
print("")
# Make the loss function `loss_fn` with regularization.
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
loss_fn = cross_entropy + tf.reduce_sum(reg_ws)
train_step = tf.train.AdamOptimizer(1e-4).minimize(loss_fn)
To get this to work you need to follow the MNIST tutorial I linked to earlier and import the relevant libraries, but it's a nice exercise to learn TensorFlow and it's easy to see how the regularization affects the output. If you apply a regularization as an argument, you can see the following:
- EasyNet/fully_connected/weights:0 shape:[784, 200] size:156800
- EasyNet/fully_connected/biases:0 shape:[200] size:200
- EasyNet/fully_connected_1/weights:0 shape:[200, 200] size:40000
- EasyNet/fully_connected_1/biases:0 shape:[200] size:200
- EasyNet/fully_connected_2/weights:0 shape:[200, 10] size:2000
- EasyNet/fully_connected_2/biases:0 shape:[10] size:10
- EasyNet/fully_connected/kernel/Regularizer/l2_regularizer:0 shape:[] size:1.0
- EasyNet/fully_connected_1/kernel/Regularizer/l2_regularizer:0 shape:[] size:1.0
- EasyNet/fully_connected_2/kernel/Regularizer/l2_regularizer:0 shape:[] size:1.0
Notice that the regularization portion gives you three items, based on the items available.
With regularizations of 0, 0.0001, 0.01, and 1.0, I get test accuracy values of 0.9468, 0.9476, 0.9183, and 0.1135, respectively, showing the dangers of high regularization terms.

If anyone's still looking, I'd just like to add on that in tf.keras you may add weight regularization by passing them as arguments in your layers. An example of adding L2 regularization taken wholesale from the Tensorflow Keras Tutorials site:
model = keras.models.Sequential([
keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001),
activation=tf.nn.relu, input_shape=(NUM_WORDS,)),
keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001),
activation=tf.nn.relu),
keras.layers.Dense(1, activation=tf.nn.sigmoid)
])
There's no need to manually add in the regularization losses with this method as far as I know.
Reference: https://www.tensorflow.org/tutorials/keras/overfit_and_underfit#add_weight_regularization

I tested tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES) and tf.losses.get_regularization_loss() with one l2_regularizer in the graph, and found that they return the same value. By observing the value's quantity, I guess reg_constant has already make sense on the value by setting the parameter of tf.contrib.layers.l2_regularizer.

If you have CNN you may do the following:
In your model function:
conv = tf.layers.conv2d(inputs=input_layer,
filters=32,
kernel_size=[3, 3],
kernel_initializer='xavier',
kernel_regularizer=tf.contrib.layers.l2_regularizer(1e-5),
padding="same",
activation=None)
...
In your loss function:
onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=num_classes)
loss = tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits)
regularization_losses = tf.losses.get_regularization_losses()
loss = tf.add_n([loss] + regularization_losses)

cross_entropy = tf.losses.softmax_cross_entropy(
logits=logits, onehot_labels=labels)
l2_loss = weight_decay * tf.add_n(
[tf.nn.l2_loss(tf.cast(v, tf.float32)) for v in tf.trainable_variables()])
loss = cross_entropy + l2_loss

Some answers make me more confused.Here I give two methods to make it clearly.
#1.adding all regs by hand
var1 = tf.get_variable(name='v1',shape=[1],dtype=tf.float32)
var2 = tf.Variable(name='v2',initial_value=1.0,dtype=tf.float32)
regularizer = tf.contrib.layers.l1_regularizer(0.1)
reg_term = tf.contrib.layers.apply_regularization(regularizer,[var1,var2])
#here reg_term is a scalar
#2.auto added and read,but using get_variable
with tf.variable_scope('x',
regularizer=tf.contrib.layers.l2_regularizer(0.1)):
var1 = tf.get_variable(name='v1',shape=[1],dtype=tf.float32)
var2 = tf.get_variable(name='v2',shape=[1],dtype=tf.float32)
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
#here reg_losses is a list,should be summed
Then,it can be added into the total loss

tf.GraphKeys.REGULARIZATION_LOSSES will not be added automatically, but there is a simple way to add them:
reg_loss = tf.losses.get_regularization_loss()
total_loss = loss + reg_loss
tf.losses.get_regularization_loss() uses tf.add_n to sum the entries of tf.GraphKeys.REGULARIZATION_LOSSES element-wise. tf.GraphKeys.REGULARIZATION_LOSSES will typically be a list of scalars, calculated using regularizer functions. It gets entries from calls to tf.get_variable that have the regularizer parameter specified. You can also add to that collection manually. That would be useful when using tf.Variable and also when specifying activity regularizers or other custom regularizers. For instance:
#This will add an activity regularizer on y to the regloss collection
regularizer = tf.contrib.layers.l2_regularizer(0.1)
y = tf.nn.sigmoid(x)
act_reg = regularizer(y)
tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, act_reg)
(In this example it would presumably be more effective to regularize x, as y really flattens out for large x.)

Related

How to override gradient for the nonlinearity functions in lasagne?

I have a model, for which i need to compute the gradients of output w.r.t the model's input. But I want to apply some custom gradients for some of the nonlinearity functions applied on some of the model's layers. So i tried the idea explained here, which computes the nonlinear rectifier (RELU) in the forward pass but modifies the gradients of Relu in the backward pass. I added the following two classes:
The helper class that allows us to replace a nonlinearity with an Op
that has the same output, but a custom gradient
class ModifiedBackprop(object):
def __init__(self, nonlinearity):
self.nonlinearity = nonlinearity
self.ops = {} # memoizes an OpFromGraph instance per tensor type
def __call__(self, x):
# OpFromGraph is oblique to Theano optimizations, so we need to move
# things to GPU ourselves if needed.
if theano.sandbox.cuda.cuda_enabled:
maybe_to_gpu = theano.sandbox.cuda.as_cuda_ndarray_variable
else:
maybe_to_gpu = lambda x: x
# We move the input to GPU if needed.
x = maybe_to_gpu(x)
# We note the tensor type of the input variable to the nonlinearity
# (mainly dimensionality and dtype); we need to create a fitting Op.
tensor_type = x.type
# If we did not create a suitable Op yet, this is the time to do so.
if tensor_type not in self.ops:
# For the graph, we create an input variable of the correct type:
inp = tensor_type()
# We pass it through the nonlinearity (and move to GPU if needed).
outp = maybe_to_gpu(self.nonlinearity(inp))
# Then we fix the forward expression...
op = theano.OpFromGraph([inp], [outp])
# ...and replace the gradient with our own (defined in a subclass).
op.grad = self.grad
# Finally, we memoize the new Op
self.ops[tensor_type] = op
# And apply the memoized Op to the input we got.
return self.ops[tensor_type](x)
The subclass that does guided backpropagation through a nonlinearity:
class GuidedBackprop(ModifiedBackprop):
def grad(self, inputs, out_grads):
(inp,) = inputs
(grd,) = out_grads
dtype = inp.dtype
print('It works')
return (grd * (inp > 0).astype(dtype) * (grd > 0).astype(dtype),)
Then i used them in my code as follows:
import lasagne as nn
model_in = T.tensor3()
# model_in = net['input'].input_var
nn.layers.set_all_param_values(net['l_out'], model['param_values'])
relu = nn.nonlinearities.rectify
relu_layers = [layer for layer in
nn.layers.get_all_layers(net['l_out']) if getattr(layer,
'nonlinearity', None) is relu]
modded_relu = GuidedBackprop(relu)
for layer in relu_layers:
layer.nonlinearity = modded_relu
prop = nn.layers.get_output(
net['l_out'], model_in, deterministic=True)
for sample in range(ini, batch_len):
model_out = prop[sample, 'z'] # get prop for label 'z'
gradients = theano.gradient.jacobian(model_out, wrt=model_in)
# gradients = theano.grad(model_out, wrt=model_in)
get_gradients = theano.function(inputs=[model_in],
outputs=gradients)
grads = get_gradients(X_batch) # gradient dimension: X_batch == model_in(64, 20, 32)
grads = np.array(grads)
grads = grads[sample]
Now when i run the code, it works without any error, and the shape of the output is also correct. But that's because it executes the default theano.grad function and not the one supposed to override it. In other words, the grad() function in the class GuidedBackprop never been invoked.
I can't understand what is the issue?
is there's a solution?
If this is an unresolved issue, is there's an implementation for a Theano Op that can achieve such a functionality or some other way to override gradient for specific nonlinearity functions applied on some of the model's layers?

Are you try to set it back the value of model output into model layer input, all gradients calculation
group_1_ShoryuKen_Left = tf.constant([ 0,0,0,0,0,1,0,0,0,0,0,0, 0,0,0,0,0,1,0,1,0,0,0,0, 0,0,0,0,0,0,0,1,0,0,0,0, 0,0,0,0,0,0,0,0,0,1,0,0 ], shape=(1, 1, 48), dtype=tf.float32)
## layer_2 = tf.keras.layers.Dense(256, kernel_initializer=tf.constant_initializer(1.))
layer_2 = tf.keras.layers.LSTM(32, kernel_initializer=tf.constant_initializer(1.))
b_out = layer_2(group_1_ShoryuKen_Left)
layer_2.set_weights(layer_1.get_weights())

Cost remains same at 0.6932 in Siamese Network

I am trying to implement a Siamese Network, as in this paper
In this paper, they have used cross entropy for the Loss function
I am using STL-10 Dataset for training and instead of the 3 layer network used in the paper, I replaced it with VGG-13 CNN network, except the last logit layer.
Here is my loss function code
def loss(pred,true_pred):
cross_entropy_loss = tf.multiply(-1.0,tf.reduce_mean(tf.add(tf.multiply(true_pred,tf.log(pred)),tf.multiply((1-true_pred),tf.log(tf.subtract(1.0,pred))))))
total_loss = tf.add(tf.reduce_sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)),cross_entropy_loss,name='total_loss')
return cross_entropy_loss,total_loss
with tf.device('/gpu:0'):
h1 = siamese(feed_image1)
h2 = siamese(feed_image2)
l1_dist = tf.abs(tf.subtract(h1,h2))
with tf.variable_scope('pred') as scope:
predictions = tf.contrib.layers.fully_connected(l1_dist,1,activation_fn = tf.sigmoid,weights_initializer = tf.contrib.layers.xavier_initializer(uniform=False),weights_regularizer = tf.contrib.layers.l2_regularizer(tf.constant(0.001, dtype=tf.float32)))
celoss,cost = loss(predictions,feed_labels)
with tf.variable_scope('adam_optimizer') as scope:
optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
opt = optimizer.minimize(cost)
However, when I run the training, the cost remains almost constant at 0.6932
I have used Adam Optimizer here.
But previously I used Momentum Optimizer.
I have tried changing the learning rate but the cost still behaves the same.
And all the prediction values converge to 0.5 after a few iterations.
After taking the output for two batches of images (input1 and input2), I take their L1 distance and to that I have connected a fully connected layer with a single output and sigmoid activation function.
[h1 and h2 contains the output of the last fully connected layer(not the logit layer) of the VGG-13 network]
Since the output activation function is sigmoid, and since the prediction values are around 0.5, we can calculate and say that the sum of the weighted L1 distance of the output of the two networks is near to zero.
I can't understand where I am going wrong.
A little help will be very much appreciated.

I thought the nonconvergence may be caused by the gradient vanishing. You can trace the gradients using tf.contrib.layers.optimize_loss and the tensorboard. You can refer to this answer for more details.
Several optimizations(maybe):
1) don't write the cross entropy yourself.
You can employ the sigmoid cross entropy with logits API, since it ensures stability as documented:
max(x, 0) - x * z + log(1 + exp(-abs(x)))
2) do some weigh normalization may would hlep.
3) keep the regularization loss small.
You can read this answer for more information.
4) I don't see the necessity of tf.abs the L1 distance.
And here is the code I modified. Hope it helps.
mode = "training"
rl_rate = .1
with tf.device('/gpu:0'):
h1 = siamese(feed_image1)
h2 = siamese(feed_image2)
l1_dist = tf.subtract(h1, h2)
# is it necessary to use abs?
l1_dist_norm = tf.layers.batch_normalization(l1_dist, training=(mode=="training"))
with tf.variable_scope('logits') as scope:
w = tf.get_variable('fully_connected_weights', [tf.shape(l1_dist)[-1], 1],
weights_initializer = tf.contrib.layers.xavier_initializer(uniform=False), weights_regularizer = tf.contrib.layers.l2_regularizer(tf.constant(0.001, dtype=tf.float32))
)
logits = tf.tensordot(l1_dist_norm, w, axis=1)
xent_loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=feed_labels)
total_loss = tf.add(tf.reduce_sum(rl_rate * tf.abs(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))), (1-rl_rate) * xent_loss, name='total_loss')
# or:
# weights = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
# l1_regularizer = tf.contrib.layers.l1_regularizer()
# regularization_loss = tf.contrib.layers.apply_regularization(l1_regularizer, weights)
# total_loss = xent_loss + regularization_loss
with tf.variable_scope('adam_optimizer') as scope:
optimizer = tf.train.AdamOptimizer(learning_rate=0.0005)
opt = tf.contrib.layers.optimize_loss(total_loss, global_step, learning_rate=learning_rate, optimizer="Adam", clip_gradients=max_grad_norm, summaries=["gradients"])

TensorFlow dynamic RNN not training

Problem statement
I am trying to train a dynamic RNN in TensorFlow v1.0.1 on Linux RedHat 7.3 (problem also manifests on Windows 7), and no matter what I try, I get the exact same training and validation error at every epoch, i.e. my weights are not updating.
I appreciate any help you can offer.
Example
I tried to reduce this to a minimum example that shows my issue, but the minimum example is still pretty large. I based the network structure largely on this gist.
Network definition
import functools
import numpy as np
import tensorflow as tf
def lazy_property(function):
attribute = '_' + function.__name__
#property
#functools.wraps(function)
def wrapper(self):
if not hasattr(self, attribute):
setattr(self, attribute, function(self))
return getattr(self, attribute)
return wrapper
class MyNetwork:
"""
Class defining an RNN for labeling a time series.
"""
def __init__(self, data, target, num_hidden=64):
self.data = data
self.target = target
self._num_hidden = num_hidden
self._num_steps = int(self.target.get_shape()[1])
self._num_classes = int(self.target.get_shape()[2])
self._weight_and_bias() # create weight and bias tensors
self.prediction
self.error
self.optimize
#lazy_property
def prediction(self):
"""Defines the recurrent neural network prediction scheme."""
# Dynamic LSTM.
network = tf.contrib.rnn.BasicLSTMCell(self._num_hidden)
output, _ = tf.nn.dynamic_rnn(network, data, dtype=tf.float32)
# Flatten and apply same weights to all time steps.
output = tf.reshape(output, [-1, self._num_hidden])
prediction = tf.nn.softmax(tf.matmul(output, self.weight) + self.bias)
prediction = tf.reshape(prediction,
[-1, self._num_steps, self._num_classes])
return prediction
#lazy_property
def cost(self):
"""Defines the cost function for the network."""
cross_entropy = -tf.reduce_sum(self.target * tf.log(self.prediction),
axis=[1, 2])
cross_entropy = tf.reduce_mean(cross_entropy)
return cross_entropy
#lazy_property
def optimize(self):
"""Defines the optimization scheme."""
learning_rate = 0.003
optimizer = tf.train.RMSPropOptimizer(learning_rate)
return optimizer.minimize(self.cost)
#lazy_property
def error(self):
"""Defines a measure of prediction error."""
mistakes = tf.not_equal(tf.argmax(self.target, 2),
tf.argmax(self.prediction, 2))
return tf.reduce_mean(tf.cast(mistakes, tf.float32))
def _weight_and_bias(self):
"""Returns appropriately sized weight and bias tensors for the output layer."""
self.weight = tf.Variable(tf.truncated_normal(
[self._num_hidden, self._num_classes],
mean=0.0,
stddev=0.01,
dtype=tf.float32))
self.bias = tf.Variable(tf.constant(0.1, shape=[self._num_classes]))
Training
Here is my training process. The all_data class just holds my data and labels, and uses a batch generator class to spit out batches for training when I call all_data.train.next() and all_data.train_labels.next(). You can reproduce with any batch generation scheme you like, and I can add the code if you think it is relevant; I felt like this was getting too long as it is.
tf.reset_default_graph()
data = tf.placeholder(tf.float32,
[None, all_data.num_steps, all_data.num_features])
target = tf.placeholder(tf.float32,
[None, all_data.num_steps, all_data.num_outputs])
model = MyNetwork(data, target, NUM_HIDDEN)
print('Training the model...')
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print('Initialized.')
for epoch in range(3):
print('Epoch {} |'.format(epoch), end='', flush=True)
for step in range(all_data.train_size // BATCH_SIZE):
# Generate the next training batch and train.
d = all_data.train.next()
t = all_data.train_labels.next()
sess.run(model.optimize,
feed_dict={data: d, target: t})
# Update the user periodically.
if step % summary_frequency == 0:
print('.', end='', flush=True)
# Show training and validation error at the end of each epoch.
print('|', flush=True)
train_error = sess.run(model.error,
feed_dict={data: d, target: t})
valid_error = sess.run(model.error,
feed_dict={
data: all_data.valid,
target: all_data.valid_labels
})
print('Training error: {}%'.format(100 * train_error))
print('Validation error: {}%'.format(100 * valid_error))
# Check testing error after everything.
test_error = sess.run(model.error,
feed_dict={
data: all_data.test,
target: all_data.test_labels
})
print('Testing error after {} epochs: {}%'.format(epoch + 1, 100 * test_error))
For a simple example, I generated random data and labels, where data has shape [num_samples, num_steps, num_features], and each sample has a single label associated with the whole thing:
data = np.random.rand(5000, 1000, 2)
labels = np.random.randint(low=0, high=2, size=[5000])
I then converted my labels to one-hot vectors and tiled them so that the resulting labels tensor was the same size as the data tensor.
Results
No matter what I do, I get results like this:
Training the model...
Initialized.
Epoch 0 |.......................................................|
Training error: 56.25%
Validation error: 53.39999794960022%
Epoch 1 |.......................................................|
Training error: 56.25%
Validation error: 53.39999794960022%
Epoch 2 |.......................................................|
Training error: 56.25%
Validation error: 53.39999794960022%
Testing error after 3 epochs: 49.000000953674316%
Where I have exactly the same error at every epoch. Even if my weights were randomly walking around this should change. For the example shown here, I used random data with random labels, so I do not expect much improvement, but I do expect some change, and I am getting the exact same results every epoch. When I do this with my actual data set, I get the same behavior.
Insight
I hesitate to include this in case it proves to be a red herring, but I believe that my optimizer is calculating cost function gradients of None. When I tried a different optimizer and attempted to clip the gradients, I went ahead and used tf.Print to output the gradients as well. The network crashed with an error that tf.Print could not handle None-type values.
Attempted fixes
I have tried the following things, and the problem persists in all cases:
Using different optimizers, e.g. AdamOptimizer with and without modifications to the gradients (clipping).
Adjusting batch sizes.
Using many more and many fewer hidden nodes.
Running for more epochs.
Initializing my weights with different values assigned to stddev.
Initializing my biases to zeros (using tf.zeros) and to different constants.
Using weights and biases that are defined within the prediction method and are not member variables of the class, and a _weight_and_bias method that is defined as a #staticmethod like in this gist.
Determining logits in the prediction function instead of softmax predictions, i.e. predictions = tf.matmul(output, self.weights) + self.bias, and then using tf.nn.softmax_cross_entropy_with_logits. This requires some reshaping because the method wants its labels and targets given with shape [batch_size, num_classes], so the cost method becomes:
(line added to get code to format...)
#lazy_property
def cost(self):
"""Defines the cost function for the network."""
targs = tf.reshape(self.target, [-1, self._num_classes])
logits = tf.reshape(self.predictions, [-1, self._num_classes])
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=targs, logits=logits)
cross_entropy = tf.reduce_mean(cross_entropy)
return cross_entropy
Changing which size dimension I leave as None when I create my placeholders as suggested in this answer, which requires a bit of rewriting in the network definition. Basically setting size = [all_data.batch_size, -1, all_data.num_features] and size = [all_data.batch_size, -1, all_data.num_classes].
Using tf.contrib.rnn.DropoutWrapper in my network definition and passing a dropout value set to 0.5 in training and 1.0 in validation and testing.

The problem went away when I used
output = tf.contrib.layers.flatten(output)
logits = tf.contrib.layers.fully_connected(output, some_size, activation_fn=None)
instead of flattening my network output, defining weights, and performing the tf.matmul(output, weight) + bias manually. I then used logits (instead of predictions in the question) in my cost function with
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=target,
logits=logits)
If you want to get the network prediction, you will still need to do prediction = tf.nn.softmax(logits).
I have no idea why this helped, but the network would not train even on random made-up data until I made these changes.

Can one only implement gradient descent like optimizers with the code example from processing gradients in TensorFlow?

I was looking at the example code for processing gradients that TensorFlow has:
# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)
however, I noticed that the apply_gradients function was derived from the GradientDescentOptimizer. Does that mean that using the example code from above, one can only implement gradient like descent rules (notice we could change the opt = GradientDescentOptimizer or Adam or any of the the other optimizers)? In particular, what does apply_gradients do? I definitively check the code in the tf github page but it was a bunch of python that had nothing to do with mathematical expressions, so it was hard to tell what that was doing and how it changed from optimizer to optimizer.
For example, if I wanted to implement my own custom optimizer that might use gradients (or might not e.g. just change the weights directly with some rule, maybe more biologically plausible rule), its not possible with the above example code?
In particular I wanted to implement a gradient descent version that is artificially restricted in a compact domain. In particular I wanted to implement the following equation:
w := (w - mu*grad + eps) mod B
in TensorFlow. I realized that the following is true:
w := w mod B - mu*grad mod B + eps mod B
so I thought that I could just implement it by doing:
def Process_grads(g,mu_noise,stddev_noise,B):
return (g+tf.random_normal(tf.shape(g),mean=mu_noise,stddev=stddev_noise) ) % B
and then just having:
processed_grads_and_vars = [(Process_grads(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the processed gradients.
opt.apply_gradients(processed_grads_and_vars)
however, I realized that that wasn't good enough because I don't actually have access to w so I can't implement:
w mod B
at least not the way I tried. Is there a way to do this? i.e. to actually directly change the update rule? At least the way I tried?
I know its sort of a hacky update rule, but my point is more to change the update equation than actually caring to much about that update rule (so don't get hung up on it if its a bit weird).
I came up with super hacky solution:
def manual_update_GDL(arg,learning_rate,g,mu_noise,stddev_noise):
with tf.variable_scope(arg.mdl_scope_name,reuse=True):
W_var = tf.get_variable(name='W')
eps = tf.random_normal(tf.shape(g),mean=mu_noise,stddev=stddev_noise)
#
W_new = tf.mod( W_var - learning_rate*g + eps , 20)
sess.run( W_var.assign(W_new) )
def manual_GDL(arg,loss,learning_rate,mu_noise,stddev_noise,compact,B):
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss)
# process gradients
processed_grads_and_vars = [(manual_update_GDL(arg,learning_rate,g,mu_noise,stddev_noise), v) for g,v in grads_and_vars]
not sure if it works but something like that should work in general. The idea is to just write down the equation one wants to use (in TensorFlow) for the learning rate and then update the weights manually using a session.
Unfortunately, such a solution means we have to take care of the annealing (decaying learning rate manually which seems annoying). This solution probably has many other problems, feel free to point them out (and give solutions if you can).
For this very simple problem I realized one can just do the normal optimizer update rule and then just take the mod of the weights and re-assign them to their value:
sess.run(fetches=train_step)
if arg.compact:
# apply w := ( w - mu*g + eps ) mod B
W_val = W_var.eval()
W_new = tf.mod(W_var,arg.B).eval()
W_var.assign(W_new).eval()
but in this case its a coincidence that such a simple solution exists (unfortunately, bypasses the whole point of my question).
Actually, this solutions slows down the code a lot. For the moment is the best that I've got.
As a reference, I have seen this question: How to create an optimizer in Tensorflow , but didn't find it responded directly to my question.

Your solution slows down the code because you use the sess.run and .eval() code during your "train_step" creation. Instead you should create the train_step graph using only internal tensorflow functions (without using sess.run and .eval()). Thereafter you only evaluate the train_step in a loop.
If you don't want to use any standard optimizer you can write your own "apply gradient" graph. Here is one possible solution for that:
learning_rate = tf.Variable(tf.constant(0.1))
mu_noise = 0.
stddev_noise = 0.01
#add all your W variables here when you have more than one:
train_w_vars_list = [W]
grad = tf.gradients(some_loss, train_w_vars_list)
assign_list = []
for g, v in zip(grad, train_w_vars_list):
eps = tf.random_normal(tf.shape(g), mean=mu_noise, stddev=stddev_noise)
assign_list.append(v.assign(tf.mod(v - learning_rate*g + eps, 20)))
#also update the learning rate here if you want to:
assign_list.append(learning_rate.assign(learning_rate - 0.001))
train_step = tf.group(*assign_list)
You can also use one of the standard optimizer to create the grads_and_vars list (use it instead of zip(grad, train_w_vars_list) then).
Here is a simple example for MNIST with your loss:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
# Import data
mnist = input_data.read_data_sets('PATH TO MNIST_data', one_hot=True)
# Create the model
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
y = tf.matmul(x, W)
# Define loss and optimizer
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))
learning_rate = tf.Variable(tf.constant(0.1))
mu_noise = 0.
stddev_noise = 0.01
#add all your W variables here when you have more than one:
train_w_vars_list = [W]
grad = tf.gradients(cross_entropy, train_w_vars_list)
assign_list = []
for g, v in zip(grad, train_w_vars_list):
eps = tf.random_normal(tf.shape(g), mean=mu_noise, stddev=stddev_noise)
assign_list.append(v.assign(tf.mod(v - learning_rate*g + eps, 20)))
#also update the learning rate here if you want to:
assign_list.append(learning_rate.assign(learning_rate - 0.001))
train_step = tf.group(*assign_list)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
# Train
for _ in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
# Test trained model
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.test.images,
y_: mnist.test.labels}))

Indeed you are somewhat restricted and cannot do anything. However, what you wish to do can easily be done by making your daughter class of the tensorflow Optimizer class.
All you need to do is write an _apply_dense method for your class. The _apply_dense method takes grad and w as arguments, so anything you want to do with these to variables you can do.
Look here for example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/adam.py
That is the implementation of Adam in tensorflow, all you need to do is change the _apply_dense at line 131 as well as the _prepare and _finish methods.
So for example:
def _apply_dense(self, grad, var):
B = math_ops.cast(self.B, var.dtype.base_dtype)
eps = math_ops.cast(self.eps, var.dtype.base_dtype)
mu = math_ops.cast(self.mu, var.dtype.base_dtype)
var_update = state_ops.assign(var, tf.floormod(var - mu*grad + eps,B),
use_locking=self._use_locking)
return var_update

How to set layer-wise learning rate in Tensorflow?

I am wondering if there is a way that I can use different learning rate for different layers like what is in Caffe. I am trying to modify a pre-trained model and use it for other tasks. What I want is to speed up the training for new added layers and keep the trained layers at low learning rate in order to prevent them from being distorted. for example, I have a 5-conv-layer pre-trained model. Now I add a new conv layer and fine tune it. The first 5 layers would have learning rate of 0.00001 and the last one would have 0.001. Any idea how to achieve this?

It can be achieved quite easily with 2 optimizers:
var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
train_op1 = GradientDescentOptimizer(0.00001).minimize(loss, var_list=var_list1)
train_op2 = GradientDescentOptimizer(0.0001).minimize(loss, var_list=var_list2)
train_op = tf.group(train_op1, train_op2)
One disadvantage of this implementation is that it computes tf.gradients(.) twice inside the optimizers and thus it might not be optimal in terms of execution speed. This can be mitigated by explicitly calling tf.gradients(.), splitting the list into 2 and passing corresponding gradients to both optimizers.
Related question: Holding variables constant during optimizer
EDIT: Added more efficient but longer implementation:
var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
opt1 = tf.train.GradientDescentOptimizer(0.00001)
opt2 = tf.train.GradientDescentOptimizer(0.0001)
grads = tf.gradients(loss, var_list1 + var_list2)
grads1 = grads[:len(var_list1)]
grads2 = grads[len(var_list1):]
tran_op1 = opt1.apply_gradients(zip(grads1, var_list1))
train_op2 = opt2.apply_gradients(zip(grads2, var_list2))
train_op = tf.group(train_op1, train_op2)
You can use tf.trainable_variables() to get all training variables and decide to select from them.
The difference is that in the first implementation tf.gradients(.) is called twice inside the optimizers. This may cause some redundant operations to be executed (e.g. gradients on the first layer can reuse some computations for the gradients of the following layers).

Tensorflow 1.7 introduced tf.custom_gradient that greatly simplifies setting learning rate multipliers, in a way that is now compatible with any optimizer, including those accumulating gradient statistics. For example,
import tensorflow as tf
def lr_mult(alpha):
#tf.custom_gradient
def _lr_mult(x):
def grad(dy):
return dy * alpha * tf.ones_like(x)
return x, grad
return _lr_mult
x0 = tf.Variable(1.)
x1 = tf.Variable(1.)
loss = tf.square(x0) + tf.square(lr_mult(0.1)(x1))
step = tf.train.GradientDescentOptimizer(learning_rate=0.1).minimize(loss)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
tf.local_variables_initializer().run()
for _ in range(5):
sess.run([step])
print(sess.run([x0, x1, loss]))

Update Jan 22: recipe below is only a good idea for GradientDescentOptimizer , other optimizers that keep a running average will apply learning rate before the parameter update, so recipe below won't affect that part of the equation
In addition to Rafal's approach, you could use compute_gradients, apply_gradients interface of Optimizer. For instance, here's a toy network where I use 2x the learning rate for second parameter
x = tf.Variable(tf.ones([]))
y = tf.Variable(tf.zeros([]))
loss = tf.square(x-y)
global_step = tf.Variable(0, name="global_step", trainable=False)
opt = tf.GradientDescentOptimizer(learning_rate=0.1)
grads_and_vars = opt.compute_gradients(loss, [x, y])
ygrad, _ = grads_and_vars[1]
train_op = opt.apply_gradients([grads_and_vars[0], (ygrad*2, y)], global_step=global_step)
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
for i in range(5):
sess.run([train_op, loss, global_step])
print sess.run([x, y])
You should see
[0.80000001, 0.40000001]
[0.72000003, 0.56]
[0.68800002, 0.62400001]
[0.67520005, 0.64960003]
[0.67008007, 0.65984005]

Collect learning rate multipliers for each variable like:
self.lr_multipliers[var.op.name] = lr_mult
and then apply them during before applying the gradients like:
def _train_op(self):
tf.scalar_summary('learning_rate', self._lr_placeholder)
opt = tf.train.GradientDescentOptimizer(self._lr_placeholder)
grads_and_vars = opt.compute_gradients(self._loss)
grads_and_vars_mult = []
for grad, var in grads_and_vars:
grad *= self._network.lr_multipliers[var.op.name]
grads_and_vars_mult.append((grad, var))
tf.histogram_summary('variables/' + var.op.name, var)
tf.histogram_summary('gradients/' + var.op.name, grad)
return opt.apply_gradients(grads_and_vars_mult)
You can find the whole example here.

A slight variation of Sergey Demyanov answer, where you only have to specify the learning rates you would like to change
from collections import defaultdict
self.learning_rates = defaultdict(lambda: 1.0)
...
x = tf.layers.Dense(3)(x)
self.learning_rates[x.op.name] = 2.0
...
optimizer = tf.train.MomentumOptimizer(learning_rate=1e-3, momentum=0.9)
grads_and_vars = optimizer.compute_gradients(loss)
grads_and_vars_mult = []
for grad, var in grads_and_vars:
grad *= self.learning_rates[var.op.name]
grads_and_vars_mult.append((grad, var))
train_op = optimizer.apply_gradients(grads_and_vars_mult, tf.train.get_global_step())

The first 5 layers would have learning rate of 0.00001 and the last one would have 0.001. Any idea how to achieve this?
There is an easy way to do that using tf.stop_gradient.
Here is an example with 3 layers:
x = layer1(input)
x = layer2(x)
output = layer3(x)
You can shrink your gradient in the first two layers by a ratio of 1/100:
x = layer1(input)
x = layer2(x)
x = 1/100*x + (1-1/100)*tf.stop_gradient(x)
output = layer3(x)
On the layer2, the "flow" is split in two branches: one which has a contribution of 1/100 computes its gradient regularly but with a gradient magnitude shrinked by a proportion of 1/100, the other branch provides the remaining "flow" without contributing to the gradient because of the tf.stop_gradient operator. As a result, if you use a learning rate of 0.001 on your model optimizer, the first two layers will virtually have a learning rate of 0.00001.

If you happen to be using tf.slim + slim.learning.create_train_op there is a nice example here:
https://github.com/google-research/tf-slim/blob/master/tf_slim/learning.py#L65
# Create the train_op and scale the gradients by providing a map from variable
# name (or variable) to a scaling coefficient:
gradient_multipliers = {
'conv0/weights': 1.2,
'fc8/weights': 3.4,
}
train_op = slim.learning.create_train_op(
total_loss,
optimizer,
gradient_multipliers=gradient_multipliers)
Unfortunately it doesn't seem possible to use a tf.Variable instead of a float value if you want to gradually modify the multiplier.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to add regularizations in TensorFlow? - python

cross_entropy = tf.losses.softmax_cross_entropy( logits=logits, onehot_labels=labels) l2_loss = weight_decay * tf.add_n( [tf.nn.l2_loss(tf.cast(v, tf.float32)) for v in tf.trainable_variables()]) loss = cross_entropy + l2_loss

Related

How to override gradient for the nonlinearity functions in lasagne?

Cost remains same at 0.6932 in Siamese Network

TensorFlow dynamic RNN not training

Can one only implement gradient descent like optimizers with the code example from processing gradients in TensorFlow?

How to set layer-wise learning rate in Tensorflow?

Categories

Resources