How does PyTorch compute the backward pass when optimizing triplet loss? - python

I am implementing a triplet network in Pytorch where the 3 instances (sub-networks) share the same weights. Since the weights are shared, I implemented it as a single instance network that is called three times to produce the anchor, positive, and negative embeddings. The embeddings are learned by optimizing the triplet loss. Here is a small snippet for illustration:
from dependencies import *
model = SingleSubNet() # represents each instance in the triplet net
for epoch in epochs:
for anch, pos, neg in enumerate(train_loader):
optimizer.zero_grad()
fa, fp, fn = model(anch), model(pos), model(neg)
loss = triplet_loss(fa, fp, fn)
loss.backward()
optimizer.step()
# Do more stuff ...
My complete code works as expected. However, I do not understand what does the loss.backward() compute the gradient(s) in this case. I am confused because there are 3 gradients of loss is in each learning step (the gradients formulas are here). I assume the gradients are summed before performing optimizer.step(). But then it looks from the equations that if the gradients are summed, they will cancel each other out and yield zero update term. Of course, this is not true as the network learns meaningful embeddings at the end.
Thanks in advance

Late answer, but hope this helps someone.
The gradients that you linked are the gradients of the loss with respect to the embeddings (the anchor, positive embedding and negative embedding). To update the model parameters, you use the gradient of the loss with respect to the model parameters. This does not sum to zero.
The reason for this is that when calculating the gradient of the loss with respect to the model parameters, the formula makes use of the activations from the forward pass, and the 3 different inputs (anchor image, positive example and negative example) have different activations in the forward pass.

Related

How to implement gradient ascent in a Keras DQN

Have built a Reinforcement Learning DQN with variable length sequences as inputs, and positive and negative rewards calculated for actions. Some problem with my DQN model in Keras means that although the model runs, average rewards over time decrease, over single and multiple cycles of epsilon. This does not change even after significant period of training.
My thinking is that this is due to using MeanSquareError in Keras as the Loss function (minimising error). So I am trying to implement gradient ascent (to maximise reward). How to do this in Keras? My current model is:
model = Sequential()
inp = (env.NUM_TIMEPERIODS, env.NUM_FEATURES)
model.add(Input(shape=inp)) # 'a shape tuple(integers), not including batch-size
model.add(Masking(mask_value=0., input_shape=inp))
model.add(LSTM(env.NUM_FEATURES, input_shape=inp, return_sequences=True))
model.add(LSTM(env.NUM_FEATURES))
model.add(Dense(env.NUM_FEATURES))
model.add(Dense(4))
model.compile(loss='mse,
optimizer=Adam(lr=LEARNING_RATE, decay=DECAY),
metrics=[tf.keras.losses.MeanSquaredError()])
In trying to implement gradient ascent, by 'flipping' the gradient (as negative or inverse loss?), I have tried various loss definitions:
loss=-'mse'
loss=-tf.keras.losses.MeanSquaredError()
loss=1/tf.keras.losses.MeanSquaredError()
but these all generate bad operand [for unary] errors.
How to adapt current Keras model to maximise rewards ?
Or is this gradient ascent not even the problem? Could it be some issue with the action policy?
Writing a custom loss function
Here is the loss function you want
#tf.function
def positive_mse(y_true, y_pred):
return -1 * tf.keras.losses.MSE(y_true, y_pred)
And then your compile line becomes
model.compile(loss=positive_mse,
optimizer=Adam(lr=LEARNING_RATE, decay=DECAY),
metrics=[tf.keras.losses.MeanSquaredError()])
Please note : use loss=positive_mse and not loss=positive_mse(). That's not a typo. This is because you need to pass the function, not the results of executing the function.

Use Hamming Distance Loss Function with Tensorflow GradientTape: no gradients. Is it not differentiable?

I'm using Tensorflow 2.1 and Python 3, creating my custom training model following the tutorial "Tensorflow - Custom training: walkthrough".
I'm trying to use Hamming Distance on my loss function:
import tensorflow as tf
import tensorflow_addons as tfa
def my_loss_hamming(model, x, y):
global output
output = model(x)
return tfa.metrics.hamming.hamming_loss_fn(y, output, threshold=0.5, mode='multilabel')
def grad(model, inputs, targets):
with tf.GradientTape() as tape:
tape.watch(model.trainable_variables)
loss_value = my_loss_hamming(model, inputs, targets)
return loss_value, tape.gradient(loss_value, model.trainable_variables)
When I call it:
loss_value, grads = grad(model, feature, label)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
grads variable is a list with 38 None.
And I get the error:
No gradients provided for any variable: ['conv1_1/kernel:0', ...]
Is there any way to use Hamming Distance without "interrupts the gradient chain registered by the gradient tape"?
Apology if I'm saying something obvious, but the way how backpropagation works as a fitting algorithm for neural networks is through gradients - e.g. for each batch of training data you compute how much the loss function will improve/degrade if you move a particular trainable weight by a very small amount delta.
Hamming loss is by definition not differentiable, so for small movements of trainable weights you will never experience any changes in the loss. I imagine it is only added to be used for final measurements of trained models' performance rather than for training.
If you want to train a neural net through backpropagation you need to use some differentiable loss - such that can help the model to move weights in the right direction. Sometimes people use different techniques to smooth such losses as Hamming less and create approximations - e.g. here it could be something which would penalize less predictions which are closer to the target answer rather then just giving out 1 for everything above threshold and 0 for everything else.

PyTorch: Is retain_graph=True necessary in alternating optimization?

I'm trying to optimize two models in an alternating fashion using PyTorch. The first is a neural network that is changing the representation of my data (ie a map f(x) on my input data x, parameterized by some weights W). The second is a Gaussian mixture model that is operating on the f(x) points, ie in the neural network space (rather than clustering points in the input space. I am optimizing the GMM using expectation maximization, so the parameter updates are analytically derived, rather than using gradient descent.
I have two loss functions here: the first is a function of the distances ||f(x) - f(y)||, and the second is the loss function of the Gaussian mixture model (ie how 'clustered' everything looks in the NN representation space). What I want to do is take a step in the NN optimization using both of the above loss functions (since it depends on both), and then do an expectation-maximization step for the GMM. The code looks like this (I have removed a lot since there is a ton of code):
data, labels = load_dataset()
net = NeuralNetwork()
net_optim = torch.optim.Adam(net.parameters(), lr=0.05, weight_decay=1)
# initialize weights, means, and covariances for the Gaussian clusters
concentrations, means, covariances, precisions = initialization(net.forward_one(data))
for i in range(1000):
net_optim.zero_grad()
pairs, pair_labels = pairGenerator(data, labels) # samples some pairs of datapoints
outputs = net(pairs[:, 0, :], pairs[:, 1, :]) # computes pairwise distances
net_loss = NeuralNetworkLoss(outputs, pair_labels) # loss function based on pairwise dist.
embedding = net.forward_one(data) # embeds all data in the NN space
log_prob, log_likelihoods = expectation_step(embedding, means, precisions, concentrations)
concentrations, means, covariances, precisions = maximization_step(embedding, log_likelihoods)
gmm_loss = GMMLoss(log_likelihoods, log_prob, precisions, concentrations)
net_loss.backward(retain_graph=True)
gmm_loss.backward(retain_graph=True)
net_optim.step()
Essentially, this is what is happening:
Sample some pairs of points from the dataset
Push pairs of points through the NN and compute network loss based on those outputs
Embed all datapoints using the NN and perform a clustering EM step in that embedding space
Compute variational loss (ELBO) based on clustering parameters
Update neural network parameters using both the variational loss and the network loss
However, to perform (5), I am required to add the flag retain_graph=True, otherwise I get the error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
It seems like having two loss functions means that I need to retain the computational graph?
I am not sure how to work around this, as with retain_graph=True, around iteration 400, each iteration is taking ~30 minutes to complete. Does anyone know how I might fix this? I apologize in advance – I am still very new to automatic differentiation.
I would recommend doing
total_loss = net_loss + gmm_loss
total_loss.backward()
Note that the gradient of net_loss w.r.t gmm weights is 0 thus summing the losses won't have any effect.
Here is a good thread on pytorch regarding the retain_graph. https://discuss.pytorch.org/t/what-exactly-does-retain-variables-true-in-loss-backward-do/3508/24

Numerical equivalence of PyTorch backpropagation

After i 'v written the simple neural network with numpy, i wanted to compare it numerically with PyTorch impementation. Running alone, seems my neural network implementation converges, so it seems to have no errors.
Also i v checked forward pass matches to PyTorch, so basic setup is correct.
But something different happens while backward pass, because the weights after one backpropagation are different.
I dont want to post full code here because its linked over several .py files, and most of the code is irrelevant to the question. I just want to know does PyTorch "basic" gradient descent or something different.
I m viewing the most simle example about full-connected weights of the last layer, cause if it is different, further will be also different:
self.weight += self.learning_rate * hidden_layer.T.dot(output_delta )
where
output_delta = self.expected - self.output
self.expected are expected value,
self.output is forward pass result
No activation or further stuff here.
The torch past is:
optimizer = torch.optim.SGD(nn.parameters() , lr = 1.0)
criterion = torch.nn.MSELoss(reduction='sum')
output = nn.forward(x_train)
loss = criterion(output, y_train)
loss.backward()
optimizer.step()
optimizer.zero_grad()
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above? If its so i d like to know how to numerically check my numpy solution with pytorch.
I just want to know does PyTorch "basic" gradient descent or something different.
If you set torch.optim.SGD, this means stochastic gradient descent.
You have different implementations on GD, but the one that is used in PyTorch is applied to mini-batches.
There are GD implementations that will optimize parameters after the full epoch. As you may guess they are very "slow", this may be great for supercomputers to test. There are GD implementations that work for every sample, as you may guess their imperfectness is "huge" gradient fluctuations.
These are all relative terms, so I am using ""
Note you are using too big learning rates like lr = 1.0, which means you haven't normalized your data at first, but this is a skill you may scalp over time.
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above?
It uses what you told.
Here is a the example in PyTorch and in Python to show detection of gradients works as expected (used in back propagation) :
x = torch.tensor([5.], requires_grad=True);
print(x) # tensor([5.], requires_grad=True)
y = 3*x**2
y.backward()
print(x.grad) # tensor([30.])
How would you get this value 30 in plain python?
def y(x):
return 3*x**2
x=5
e=0.01 #etha
g=(y(x+e)-y(x))/e
print(g) # 30.0299
As we expect we got ~30, it would be even better with smaller etha.

Optimize sparse softmax cross entropy with L2 regularization

I was training my network using tf.losses.sparse_softmax_cross_entropy as the classification function in the last layer and everything was working fine.
I simply added a L2 regularization over my weights now and my loss is not getting optimized anymore. What can be happening?
reg = tf.nn.l2_loss(w1) + tf.nn.l2_loss(w2)
loss = tf.reduce_mean(tf.losses.sparse_softmax_cross_entropy(y, logits)) + reg*beta
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
It is hard to answer with certainty given the provided information, but here is a possible cause:
tf.nn.l2_loss is computed as a sum over the elements, while your cross-entropy loss is reduced to its mean (c.f. tf.reduce_mean), hence a numerical unbalance between the 2 terms.
Try for instance to divide each L2 loss by the number of elements it is computed over (e.g. tf.size(w1)).

Categories