Regularization for softmax in gradient descent - python

I'm writing a gradient descent function for a multi-class classifier using softmax. I'm a bit confused about how regularization should work in the gradient function. I've specified my matrix, X, such that the first column is populated by ones, and w is a matrix where each row corresponds to the weights of features and each column corresponds to a label. I understand that the bias term/intercept should not be regularized. However, I'm not clear on how to leave the bias term out.
Some of the code I'm learning from has the following in the function to calculate the gradient:
scores = np.dot(X,w)
predictions = softmax_function(scores)
gradient = -np.dot(X.T,y_actual-y_predictions)/len(y_actual)
regularizer = np.hstack((np.zeros((w.shape[0],1)),w[:,1:w.shape[1]]))
return (gradient, regularizer)
Then, when w is updated at the end of the epoch:
w_new = w_old - learning_rate*(gradient+regularizer*lambd)
So, here's my question. In the code above, why is hstack() used to populate the first column in the regularization term with zeros? It seems like we'd want to use vstack() to make the first row in the regularizer zeros, since the bias weights are going to be the first row.

Related

Keras Categorical Cross Entropy

I'm trying to wrap my head around the categorical cross entropy loss. Looking at the implementation of the cross entropy loss in Keras:
# scale preds so that the class probas of each sample sum to 1
output = output / math_ops.reduce_sum(output, axis, True)
# Compute cross entropy from probabilities.
epsilon_ = _constant_to_tensor(epsilon(), output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon_, 1. - epsilon_)
return -math_ops.reduce_sum(target * math_ops.log(output), axis)
I do not see where the delta = output - target
is calculated.
See here.
What am I missing?
I think you might be confusing two different concepts / events here.
The categorical cross entropy loss is a measure of the error of your model, as calculated by :
def categorical_crossentropy(target, output, from_logits=False, axis=-1):
<etc>
This just returns an array of losses for each label, it is the direct difference between the true label and what your model thinks the label should be.
The next step after calculating the loss (part of the forward propagation phase) is to then start backpropagation, i.e. we want to find the influence that each weight/bias matrix has on the loss you've calculated above, so that we can perform the update step.
The first step is then to calculate dL/dz i.e. the derivative of the loss function with respect to the linear function (y = Wx + b), which itself is the combination of dL/da * da/dz (i.e. the deriv loss wrt activation * deriv activation wrt the linear function).
The link you posted is the derivative of the activation function wrt the linear function. This blog does a decent job of explaining how all the parts fit together, although the activation function they use is a sigmoid, but the overall pieces that fit together are the same.

PyTorch: Is retain_graph=True necessary in alternating optimization?

I'm trying to optimize two models in an alternating fashion using PyTorch. The first is a neural network that is changing the representation of my data (ie a map f(x) on my input data x, parameterized by some weights W). The second is a Gaussian mixture model that is operating on the f(x) points, ie in the neural network space (rather than clustering points in the input space. I am optimizing the GMM using expectation maximization, so the parameter updates are analytically derived, rather than using gradient descent.
I have two loss functions here: the first is a function of the distances ||f(x) - f(y)||, and the second is the loss function of the Gaussian mixture model (ie how 'clustered' everything looks in the NN representation space). What I want to do is take a step in the NN optimization using both of the above loss functions (since it depends on both), and then do an expectation-maximization step for the GMM. The code looks like this (I have removed a lot since there is a ton of code):
data, labels = load_dataset()
net = NeuralNetwork()
net_optim = torch.optim.Adam(net.parameters(), lr=0.05, weight_decay=1)
# initialize weights, means, and covariances for the Gaussian clusters
concentrations, means, covariances, precisions = initialization(net.forward_one(data))
for i in range(1000):
net_optim.zero_grad()
pairs, pair_labels = pairGenerator(data, labels) # samples some pairs of datapoints
outputs = net(pairs[:, 0, :], pairs[:, 1, :]) # computes pairwise distances
net_loss = NeuralNetworkLoss(outputs, pair_labels) # loss function based on pairwise dist.
embedding = net.forward_one(data) # embeds all data in the NN space
log_prob, log_likelihoods = expectation_step(embedding, means, precisions, concentrations)
concentrations, means, covariances, precisions = maximization_step(embedding, log_likelihoods)
gmm_loss = GMMLoss(log_likelihoods, log_prob, precisions, concentrations)
net_loss.backward(retain_graph=True)
gmm_loss.backward(retain_graph=True)
net_optim.step()
Essentially, this is what is happening:
Sample some pairs of points from the dataset
Push pairs of points through the NN and compute network loss based on those outputs
Embed all datapoints using the NN and perform a clustering EM step in that embedding space
Compute variational loss (ELBO) based on clustering parameters
Update neural network parameters using both the variational loss and the network loss
However, to perform (5), I am required to add the flag retain_graph=True, otherwise I get the error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
It seems like having two loss functions means that I need to retain the computational graph?
I am not sure how to work around this, as with retain_graph=True, around iteration 400, each iteration is taking ~30 minutes to complete. Does anyone know how I might fix this? I apologize in advance – I am still very new to automatic differentiation.
I would recommend doing
total_loss = net_loss + gmm_loss
total_loss.backward()
Note that the gradient of net_loss w.r.t gmm weights is 0 thus summing the losses won't have any effect.
Here is a good thread on pytorch regarding the retain_graph. https://discuss.pytorch.org/t/what-exactly-does-retain-variables-true-in-loss-backward-do/3508/24

Optimize sparse softmax cross entropy with L2 regularization

I was training my network using tf.losses.sparse_softmax_cross_entropy as the classification function in the last layer and everything was working fine.
I simply added a L2 regularization over my weights now and my loss is not getting optimized anymore. What can be happening?
reg = tf.nn.l2_loss(w1) + tf.nn.l2_loss(w2)
loss = tf.reduce_mean(tf.losses.sparse_softmax_cross_entropy(y, logits)) + reg*beta
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
It is hard to answer with certainty given the provided information, but here is a possible cause:
tf.nn.l2_loss is computed as a sum over the elements, while your cross-entropy loss is reduced to its mean (c.f. tf.reduce_mean), hence a numerical unbalance between the 2 terms.
Try for instance to divide each L2 loss by the number of elements it is computed over (e.g. tf.size(w1)).

Sparse Cross Entropy in Tensorflow

Using tf.nn.sparse_softmax_cross_entropy_with_logits in tensorflow, its possible to only calculate loss for specific rows by setting the class label to -1 (it is otherwise expected to be in the range 0->numclasses-1).
Unfortunately this breaks the gradient computations (as is mentioned in the comments in the source nn_ops.py).
What I would like to do is something like the following:
raw_classification_output1 = [0,1,0]
raw_classification_output2 = [0,0,1]
classification_output =tf.concat(0,[raw_classification_output1,raw_classification_output2])
classification_labels = [1,-1]
classification_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(classification_output,classification_labels)
total_loss = tf.reduce_sum(classification_loss) + tf.reduce_sum(other_loss)
optimizer = tf.train.GradientDescentOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(total_loss)
changed_grads_and_vars = #do something to 0 the incorrect gradients
optimizer.apply_gradients(changed_grads_and_vars)
What's the most straightforward way to zero those gradients?
The easiest method is to just multiply the classification loss by a similar tensor of 1's where the loss is desired, and zeros where it isn't. This is made easier by the fact that the loss is already zero where you don't want it to be updated. This is basically just a workaround for the fact that it still does some weird gradient behavior if you have loss zero for this sparse softmax.
adding this line after tf.nn.sparse_softmax_cross_entropy_with_logits:
classification_loss_zeroed = tf.mul(classification_loss,tf.to_float(tf.not_equal(classification_loss,0)))
It should zero out the gradients also.

Using the softmax layer within the objective function itself

This is going to be long and hard to describe so apologies in advance.
I have a regular CNN like network with standard MLP layers on top of it. On top of the MLP, I have a softmax layer too, however, unlike conventional networks, this is NOT fully connected to the MLP below and it consists of subgroups.
To further describe the softmax, it looks like this:
Neur1A Neur2A ... NeurNA Neur1B Neur2B ... NeurNB Neur1C Neur2C ...NeurNC
Group A Group B Group C
There are many more groups. Each group has a softmax that is independent from the other groups. So it is in a way, several independent classifications (even though it actually is not).
What I need is for the index of the activated neuron to be monotonically increasing between groups. For example, if I have Neuron5 in Group A activated, I want the activated neuron in group B to be >=5. Same with Group B and Group C and so on..
This softmax layer containing all the neurons for all groups is actually NOT my last layer and it is interestingly an intermediate one.
To achieve this monotonicity, I add another term to my loss function that penalizes non monotonic activated neuron indices. Here is some of the code:
The code for softmax layer and its output:
def compute_image_estimate(layer2_input):
estimated_yps= tf.zeros([FLAGS.batch_size,0],dtype=tf.int64)
for pix in xrange(NUM_CLASSES):
pixrow= int( pix/width)
rowdata= image_pixels[:, pixrow*width:(pixrow+1)*width]
with tf.variable_scope('layer2_'+'_'+str(pix)) as scope:
weights = _variable_with_weight_decay('weights', shape=[layer2_input.get_shape()[1], width], stddev=0.04, wd=0.0000000)
biases = _variable_on_cpu('biases', [width], tf.constant_initializer(0.1))
y = tf.nn.softmax(tf.matmul(layer2_input,weights) + biases)
argyp=width-1-tf.argmax(y,1)
argyp= tf.reshape(argyp,[FLAGS.batch_size,1])
estimated_yps=tf.concat(1,[estimated_yps,argyp])
return estimated_yps
The estimated_yps are passed onto a function that quantifies monotonicity:
def compute_monotonicity(yp):
sm= tf.zeros([FLAGS.batch_size])
for curr_row in xrange(height):
for curr_col in xrange(width-1):
pix= curr_row *width + curr_col
sm=sm+alpha * tf.to_float(tf.square(tf.minimum(0,tf.to_int32(yp[:,pix]-yp[:,pix+1]))))
return sm
and the loss function is:
def loss(estimated_yp, SOME_OTHER_THINGS):
tf.add_to_collection('losses', SOME_OTHER_THINGS)
monotonicity_metric= tf.reduce_mean( compute_monotonocity(estimated_yp) )
tf.add_to_collection('losses', monotonicity_metric)
return tf.add_n(tf.get_collection('losses'), name='total_loss')
Now my problem is, when I do not use SOME_OTHER_THINGS that are conventional metrics, I get ValueError: No gradients provided for any variable for the monotonocity metric.
Seems like gradients are not defined when the softmax layer outputs are used like this.
Am I doing something wrong? Any help would be appreciated.
Apologies.. I realized that the problem is that tf.argmax function obviously does not have a gradient defined.

Categories