Tensorflow - adding L2 regularization loss simple example

Tensorflow - adding L2 regularization loss simple example - python

I am familiar with machine learning, but I am learning Tensorflow on my own by reading some slides from universities. Below I'm setting up the loss function for linear regression with only one feature. I'm adding an L2 loss to the total loss, but I am not sure if I'm doing it correctly:
# Regularization
reg_strength = 0.01
# Create the loss function.
with tf.variable_scope("linear-regression"):
W = tf.get_variable("W", shape=(1, 1), initializer=tf.contrib.layers.xavier_initializer())
b = tf.get_variable("b", shape=(1,), initializer=tf.constant_initializer(0.0))
yhat = tf.matmul(X, W) + b
error_loss = tf.reduce_sum(((y - yhat)**2)/number_of_examples)
#reg_loss = reg_strength * tf.nn.l2_loss(W) # reg 1
reg_loss = reg_strength * tf.reduce_sum(W**2) # reg 2
loss = error_loss + reg_loss
# Set up the optimizer.
opt_operation = tf.train.GradientDescentOptimizer(0.001).minimize(loss)
My specific questions are:
I have two lines (commented as reg 1 and reg 2) that compute the L2 loss of the weight W. The line marked with reg 1 uses the Tensorflow built-in function. Are these two L2 implementations equivalent?
Am I adding the regularization loss reg_loss correctly to the final loss function?

Almost
According to the L2Loss operation code
output.device(d) = (input.square() * static_cast<T>(0.5)).sum();
It multiplies also for 0.5 (or in other words it divides by 2)

Are these two L2 implementations equivalent?
Almost, as #fabrizioM pointed out, you can see here for the introduction to the l2_loss in TensorFlow docs.
Am I adding the regularization loss reg_loss correctly to the final loss function?
So far so good : )

Related

Why is softmax classifier gradient divided by batch size (CS231n)?

Question
In CS231 Computing the Analytic Gradient with Backpropagation which is first implementing a Softmax Classifier, the gradient from (softmax + log loss) is divided by the batch size (number of data being used in a cycle of forward cost calculation and backward propagation in the training).
Please help me understand why it needs to be divided by the batch size.
The chain rule to get the gradient should be below. Where should I incorporate the division?
Derivative of Softmax loss function
Code
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
#Train a Linear Classifier
# initialize parameters randomly
W = 0.01 * np.random.randn(D,K)
b = np.zeros((1,K))
# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength
# gradient descent loop
num_examples = X.shape[0]
for i in range(200):
# evaluate class scores, [N x K]
scores = np.dot(X, W) + b
# compute the class probabilities
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]
# compute the loss: average cross-entropy loss and regularization
correct_logprobs = -np.log(probs[range(num_examples),y])
data_loss = np.sum(correct_logprobs)/num_examples
reg_loss = 0.5*reg*np.sum(W*W)
loss = data_loss + reg_loss
if i % 10 == 0:
print "iteration %d: loss %f" % (i, loss)
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples # <---------------------- Why?
# backpropate the gradient to the parameters (W,b)
dW = np.dot(X.T, dscores)
db = np.sum(dscores, axis=0, keepdims=True)
dW += reg*W # regularization gradient
# perform a parameter update
W += -step_size * dW
b += -step_size * db

It's because you are averaging the gradients instead of taking directly the sum of all the gradients.
You could of course not divide for that size, but this division has a lot of advantages. The main reason is that it's a sort of regularization (to avoid overfitting). With smaller gradients the weights cannot grow out of proportions.
And this normalization allows comparison between different configuration of batch sizes in different experiments (How can I compare two batch performances if they are dependent to the batch size?)
If you divide for that size the gradients sum it could be useful to work with greater learning rates to make the training faster.
This answer in the crossvalidated community is quite useful.

Came to notice that the dot in dW = np.dot(X.T, dscores) for the gradient at W is Σ over the num_sample instances. Since the dscore, which is probability (softmax output), was divided by the num_samples, did not understand that it was normalization for dot and sum part later in the code. Now understood divide by num_sample is required (may still work without normalization if the learning rate is trained though).
I believe the code below explains better.
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
# backpropate the gradient to the parameters (W,b)
dW = np.dot(X.T, dscores) / num_examples
db = np.sum(dscores, axis=0, keepdims=True) / num_examples

Large WGAN-GP train loss

This is the loss function of WGAN-GP
gen_sample = model.generator(input_gen)
disc_real = model.discriminator(real_image, reuse=False)
disc_fake = model.discriminator(gen_sample, reuse=True)
disc_concat = tf.concat([disc_real, disc_fake], axis=0)
# Gradient penalty
alpha = tf.random_uniform(
shape=[BATCH_SIZE, 1, 1, 1],
minval=0.,
maxval=1.)
differences = gen_sample - real_image
interpolates = real_image + (alpha * differences)
gradients = tf.gradients(model.discriminator(interpolates, reuse=True), [interpolates])[0] # why [0]
slopes = tf.sqrt(tf.reduce_sum(tf.square(gradients), reduction_indices=[1]))
gradient_penalty = tf.reduce_mean((slopes-1.)**2)
d_loss_real = tf.reduce_mean(disc_real)
d_loss_fake = tf.reduce_mean(disc_fake)
disc_loss = -(d_loss_real - d_loss_fake) + LAMBDA * gradient_penalty
gen_loss = - d_loss_fake
This is the training loss
The generator loss is oscillating, and the value is so big.
My question is:
is the generator loss normal or abnormal?

One thing to note is that your gradient penalty calculation is wrong. The following line:
slopes = tf.sqrt(tf.reduce_sum(tf.square(gradients), reduction_indices=[1]))
should actually be:
slopes = tf.sqrt(tf.reduce_sum(tf.square(gradients), reduction_indices=[1,2,3]))
You are reducing on the first axis, but the gradient is based on an image as shown by the alpha values and therefore you have to reduce on the axis [1,2,3].
Another error in your code is that the generator loss is:
gen_loss = d_loss_real - d_loss_fake
For the gradient calculation this makes no difference, due to the parameters of the generator only being contained in d_loss_fake. However, for the value of the generator loss this makes all the difference in the world and is the reason why this oszillates this much.
At the end of the day you should look at your actual performance metric you care about to determine the quality of your GAN like the inception score or the Fréchet Inception Distance (FID), because the loss of discriminator and generator are only mildly descriptive.

Cost remains same at 0.6932 in Siamese Network

I am trying to implement a Siamese Network, as in this paper
In this paper, they have used cross entropy for the Loss function
I am using STL-10 Dataset for training and instead of the 3 layer network used in the paper, I replaced it with VGG-13 CNN network, except the last logit layer.
Here is my loss function code
def loss(pred,true_pred):
cross_entropy_loss = tf.multiply(-1.0,tf.reduce_mean(tf.add(tf.multiply(true_pred,tf.log(pred)),tf.multiply((1-true_pred),tf.log(tf.subtract(1.0,pred))))))
total_loss = tf.add(tf.reduce_sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)),cross_entropy_loss,name='total_loss')
return cross_entropy_loss,total_loss
with tf.device('/gpu:0'):
h1 = siamese(feed_image1)
h2 = siamese(feed_image2)
l1_dist = tf.abs(tf.subtract(h1,h2))
with tf.variable_scope('pred') as scope:
predictions = tf.contrib.layers.fully_connected(l1_dist,1,activation_fn = tf.sigmoid,weights_initializer = tf.contrib.layers.xavier_initializer(uniform=False),weights_regularizer = tf.contrib.layers.l2_regularizer(tf.constant(0.001, dtype=tf.float32)))
celoss,cost = loss(predictions,feed_labels)
with tf.variable_scope('adam_optimizer') as scope:
optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
opt = optimizer.minimize(cost)
However, when I run the training, the cost remains almost constant at 0.6932
I have used Adam Optimizer here.
But previously I used Momentum Optimizer.
I have tried changing the learning rate but the cost still behaves the same.
And all the prediction values converge to 0.5 after a few iterations.
After taking the output for two batches of images (input1 and input2), I take their L1 distance and to that I have connected a fully connected layer with a single output and sigmoid activation function.
[h1 and h2 contains the output of the last fully connected layer(not the logit layer) of the VGG-13 network]
Since the output activation function is sigmoid, and since the prediction values are around 0.5, we can calculate and say that the sum of the weighted L1 distance of the output of the two networks is near to zero.
I can't understand where I am going wrong.
A little help will be very much appreciated.

I thought the nonconvergence may be caused by the gradient vanishing. You can trace the gradients using tf.contrib.layers.optimize_loss and the tensorboard. You can refer to this answer for more details.
Several optimizations(maybe):
1) don't write the cross entropy yourself.
You can employ the sigmoid cross entropy with logits API, since it ensures stability as documented:
max(x, 0) - x * z + log(1 + exp(-abs(x)))
2) do some weigh normalization may would hlep.
3) keep the regularization loss small.
You can read this answer for more information.
4) I don't see the necessity of tf.abs the L1 distance.
And here is the code I modified. Hope it helps.
mode = "training"
rl_rate = .1
with tf.device('/gpu:0'):
h1 = siamese(feed_image1)
h2 = siamese(feed_image2)
l1_dist = tf.subtract(h1, h2)
# is it necessary to use abs?
l1_dist_norm = tf.layers.batch_normalization(l1_dist, training=(mode=="training"))
with tf.variable_scope('logits') as scope:
w = tf.get_variable('fully_connected_weights', [tf.shape(l1_dist)[-1], 1],
weights_initializer = tf.contrib.layers.xavier_initializer(uniform=False), weights_regularizer = tf.contrib.layers.l2_regularizer(tf.constant(0.001, dtype=tf.float32))
)
logits = tf.tensordot(l1_dist_norm, w, axis=1)
xent_loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=feed_labels)
total_loss = tf.add(tf.reduce_sum(rl_rate * tf.abs(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))), (1-rl_rate) * xent_loss, name='total_loss')
# or:
# weights = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
# l1_regularizer = tf.contrib.layers.l1_regularizer()
# regularization_loss = tf.contrib.layers.apply_regularization(l1_regularizer, weights)
# total_loss = xent_loss + regularization_loss
with tf.variable_scope('adam_optimizer') as scope:
optimizer = tf.train.AdamOptimizer(learning_rate=0.0005)
opt = tf.contrib.layers.optimize_loss(total_loss, global_step, learning_rate=learning_rate, optimizer="Adam", clip_gradients=max_grad_norm, summaries=["gradients"])

CS231n: How to calculate gradient for Softmax loss function?

I am watching some videos for Stanford CS231: Convolutional Neural Networks for Visual Recognition but do not quite understand how to calculate analytical gradient for softmax loss function using numpy.
From this stackexchange answer, softmax gradient is calculated as:
Python implementation for above is:
num_classes = W.shape[0]
num_train = X.shape[1]
for i in range(num_train):
for j in range(num_classes):
p = np.exp(f_i[j])/sum_i
dW[j, :] += (p-(j == y[i])) * X[:, i]
Could anyone explain how the above snippet work? Detailed implementation for softmax is also included below.
def softmax_loss_naive(W, X, y, reg):
"""
Softmax loss function, naive implementation (with loops)
Inputs:
- W: C x D array of weights
- X: D x N array of data. Data are D-dimensional columns
- y: 1-dimensional array of length N with labels 0...K-1, for K classes
- reg: (float) regularization strength
Returns:
a tuple of:
- loss as single float
- gradient with respect to weights W, an array of same size as W
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
#############################################################################
# Compute the softmax loss and its gradient using explicit loops. #
# Store the loss in loss and the gradient in dW. If you are not careful #
# here, it is easy to run into numeric instability. Don't forget the #
# regularization! #
#############################################################################
# Get shapes
num_classes = W.shape[0]
num_train = X.shape[1]
for i in range(num_train):
# Compute vector of scores
f_i = W.dot(X[:, i]) # in R^{num_classes}
# Normalization trick to avoid numerical instability, per http://cs231n.github.io/linear-classify/#softmax
log_c = np.max(f_i)
f_i -= log_c
# Compute loss (and add to it, divided later)
# L_i = - f(x_i)_{y_i} + log \sum_j e^{f(x_i)_j}
sum_i = 0.0
for f_i_j in f_i:
sum_i += np.exp(f_i_j)
loss += -f_i[y[i]] + np.log(sum_i)
# Compute gradient
# dw_j = 1/num_train * \sum_i[x_i * (p(y_i = j)-Ind{y_i = j} )]
# Here we are computing the contribution to the inner sum for a given i.
for j in range(num_classes):
p = np.exp(f_i[j])/sum_i
dW[j, :] += (p-(j == y[i])) * X[:, i]
# Compute average
loss /= num_train
dW /= num_train
# Regularization
loss += 0.5 * reg * np.sum(W * W)
dW += reg*W
return loss, dW

Not sure if this helps, but:
is really the indicator function , as described here. This forms the expression (j == y[i]) in the code.
Also, the gradient of the loss with respect to the weights is:
where
which is the origin of the X[:,i] in the code.

I know this is late but here's my answer:
I'm assuming you are familiar with the cs231n Softmax loss function.
We know that:
So just as we did with the SVM loss function the gradients are as follows:
Hope that helped.

A supplement to this answer with a small example.

I came across this post and still was not 100% clear how to arrive at the partial derivatives.
For that reason I took another approach to get to the same results - maybe it is helpful to others too.

How to add regularizations in TensorFlow?

I found in many available neural network code implemented using TensorFlow that regularization terms are often implemented by manually adding an additional term to loss value.
My questions are:
Is there a more elegant or recommended way of regularization than doing it manually?
I also find that get_variable has an argument regularizer. How should it be used? According to my observation, if we pass a regularizer to it (such as tf.contrib.layers.l2_regularizer, a tensor representing regularized term will be computed and added to a graph collection named tf.GraphKeys.REGULARIZATOIN_LOSSES. Will that collection be automatically used by TensorFlow (e.g. used by optimizers when training)? Or is it expected that I should use that collection by myself?

As you say in the second point, using the regularizer argument is the recommended way. You can use it in get_variable, or set it once in your variable_scope and have all your variables regularized.
The losses are collected in the graph, and you need to manually add them to your cost function like this.
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
reg_constant = 0.01 # Choose an appropriate one.
loss = my_normal_loss + reg_constant * sum(reg_losses)

A few aspects of the existing answer were not immediately clear to me, so here is a step-by-step guide:
Define a regularizer. This is where the regularization constant can be set, e.g.:
regularizer = tf.contrib.layers.l2_regularizer(scale=0.1)
Create variables via:
weights = tf.get_variable(
name="weights",
regularizer=regularizer,
...
)
Equivalently, variables can be created via the regular weights = tf.Variable(...) constructor, followed by tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, weights).
Define some loss term and add the regularization term:
reg_variables = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
reg_term = tf.contrib.layers.apply_regularization(regularizer, reg_variables)
loss += reg_term
Note: It looks like tf.contrib.layers.apply_regularization is implemented as an AddN, so more or less equivalent to sum(reg_variables).

I'll provide a simple correct answer since I didn't find one. You need two simple steps, the rest is done by tensorflow magic:
Add regularizers when creating variables or layers:
tf.layers.dense(x, kernel_regularizer=tf.contrib.layers.l2_regularizer(0.001))
# or
tf.get_variable('a', regularizer=tf.contrib.layers.l2_regularizer(0.001))
Add the regularization term when defining loss:
loss = ordinary_loss + tf.losses.get_regularization_loss()

Another option to do this with the contrib.learn library is as follows, based on the Deep MNIST tutorial on the Tensorflow website. First, assuming you've imported the relevant libraries (such as import tensorflow.contrib.layers as layers), you can define a network in a separate method:
def easier_network(x, reg):
""" A network based on tf.contrib.learn, with input `x`. """
with tf.variable_scope('EasyNet'):
out = layers.flatten(x)
out = layers.fully_connected(out,
num_outputs=200,
weights_initializer = layers.xavier_initializer(uniform=True),
weights_regularizer = layers.l2_regularizer(scale=reg),
activation_fn = tf.nn.tanh)
out = layers.fully_connected(out,
num_outputs=200,
weights_initializer = layers.xavier_initializer(uniform=True),
weights_regularizer = layers.l2_regularizer(scale=reg),
activation_fn = tf.nn.tanh)
out = layers.fully_connected(out,
num_outputs=10, # Because there are ten digits!
weights_initializer = layers.xavier_initializer(uniform=True),
weights_regularizer = layers.l2_regularizer(scale=reg),
activation_fn = None)
return out
Then, in a main method, you can use the following code snippet:
def main(_):
mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True)
x = tf.placeholder(tf.float32, [None, 784])
y_ = tf.placeholder(tf.float32, [None, 10])
# Make a network with regularization
y_conv = easier_network(x, FLAGS.regu)
weights = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'EasyNet')
print("")
for w in weights:
shp = w.get_shape().as_list()
print("- {} shape:{} size:{}".format(w.name, shp, np.prod(shp)))
print("")
reg_ws = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES, 'EasyNet')
for w in reg_ws:
shp = w.get_shape().as_list()
print("- {} shape:{} size:{}".format(w.name, shp, np.prod(shp)))
print("")
# Make the loss function `loss_fn` with regularization.
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
loss_fn = cross_entropy + tf.reduce_sum(reg_ws)
train_step = tf.train.AdamOptimizer(1e-4).minimize(loss_fn)
To get this to work you need to follow the MNIST tutorial I linked to earlier and import the relevant libraries, but it's a nice exercise to learn TensorFlow and it's easy to see how the regularization affects the output. If you apply a regularization as an argument, you can see the following:
- EasyNet/fully_connected/weights:0 shape:[784, 200] size:156800
- EasyNet/fully_connected/biases:0 shape:[200] size:200
- EasyNet/fully_connected_1/weights:0 shape:[200, 200] size:40000
- EasyNet/fully_connected_1/biases:0 shape:[200] size:200
- EasyNet/fully_connected_2/weights:0 shape:[200, 10] size:2000
- EasyNet/fully_connected_2/biases:0 shape:[10] size:10
- EasyNet/fully_connected/kernel/Regularizer/l2_regularizer:0 shape:[] size:1.0
- EasyNet/fully_connected_1/kernel/Regularizer/l2_regularizer:0 shape:[] size:1.0
- EasyNet/fully_connected_2/kernel/Regularizer/l2_regularizer:0 shape:[] size:1.0
Notice that the regularization portion gives you three items, based on the items available.
With regularizations of 0, 0.0001, 0.01, and 1.0, I get test accuracy values of 0.9468, 0.9476, 0.9183, and 0.1135, respectively, showing the dangers of high regularization terms.

If anyone's still looking, I'd just like to add on that in tf.keras you may add weight regularization by passing them as arguments in your layers. An example of adding L2 regularization taken wholesale from the Tensorflow Keras Tutorials site:
model = keras.models.Sequential([
keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001),
activation=tf.nn.relu, input_shape=(NUM_WORDS,)),
keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001),
activation=tf.nn.relu),
keras.layers.Dense(1, activation=tf.nn.sigmoid)
])
There's no need to manually add in the regularization losses with this method as far as I know.
Reference: https://www.tensorflow.org/tutorials/keras/overfit_and_underfit#add_weight_regularization

I tested tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES) and tf.losses.get_regularization_loss() with one l2_regularizer in the graph, and found that they return the same value. By observing the value's quantity, I guess reg_constant has already make sense on the value by setting the parameter of tf.contrib.layers.l2_regularizer.

If you have CNN you may do the following:
In your model function:
conv = tf.layers.conv2d(inputs=input_layer,
filters=32,
kernel_size=[3, 3],
kernel_initializer='xavier',
kernel_regularizer=tf.contrib.layers.l2_regularizer(1e-5),
padding="same",
activation=None)
...
In your loss function:
onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=num_classes)
loss = tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits)
regularization_losses = tf.losses.get_regularization_losses()
loss = tf.add_n([loss] + regularization_losses)

cross_entropy = tf.losses.softmax_cross_entropy(
logits=logits, onehot_labels=labels)
l2_loss = weight_decay * tf.add_n(
[tf.nn.l2_loss(tf.cast(v, tf.float32)) for v in tf.trainable_variables()])
loss = cross_entropy + l2_loss

Some answers make me more confused.Here I give two methods to make it clearly.
#1.adding all regs by hand
var1 = tf.get_variable(name='v1',shape=[1],dtype=tf.float32)
var2 = tf.Variable(name='v2',initial_value=1.0,dtype=tf.float32)
regularizer = tf.contrib.layers.l1_regularizer(0.1)
reg_term = tf.contrib.layers.apply_regularization(regularizer,[var1,var2])
#here reg_term is a scalar
#2.auto added and read,but using get_variable
with tf.variable_scope('x',
regularizer=tf.contrib.layers.l2_regularizer(0.1)):
var1 = tf.get_variable(name='v1',shape=[1],dtype=tf.float32)
var2 = tf.get_variable(name='v2',shape=[1],dtype=tf.float32)
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
#here reg_losses is a list,should be summed
Then,it can be added into the total loss

tf.GraphKeys.REGULARIZATION_LOSSES will not be added automatically, but there is a simple way to add them:
reg_loss = tf.losses.get_regularization_loss()
total_loss = loss + reg_loss
tf.losses.get_regularization_loss() uses tf.add_n to sum the entries of tf.GraphKeys.REGULARIZATION_LOSSES element-wise. tf.GraphKeys.REGULARIZATION_LOSSES will typically be a list of scalars, calculated using regularizer functions. It gets entries from calls to tf.get_variable that have the regularizer parameter specified. You can also add to that collection manually. That would be useful when using tf.Variable and also when specifying activity regularizers or other custom regularizers. For instance:
#This will add an activity regularizer on y to the regloss collection
regularizer = tf.contrib.layers.l2_regularizer(0.1)
y = tf.nn.sigmoid(x)
act_reg = regularizer(y)
tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, act_reg)
(In this example it would presumably be more effective to regularize x, as y really flattens out for large x.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow - adding L2 regularization loss simple example - python

Almost According to the L2Loss operation code output.device(d) = (input.square() * static_cast<T>(0.5)).sum(); It multiplies also for 0.5 (or in other words it divides by 2)

Are these two L2 implementations equivalent? Almost, as #fabrizioM pointed out, you can see here for the introduction to the l2_loss in TensorFlow docs. Am I adding the regularization loss reg_loss correctly to the final loss function? So far so good : )

Related

Why is softmax classifier gradient divided by batch size (CS231n)?

Large WGAN-GP train loss

Cost remains same at 0.6932 in Siamese Network

CS231n: How to calculate gradient for Softmax loss function?

How to add regularizations in TensorFlow?

Categories

Resources