Optimizing Values that are on GPU - python

I am trying to optimize a PyTorch tensor which I am also using it as input to a network. Lets call this tensor "shape". My optimizer is as follows:
optimizer = torch.optim.Adam(
[shape],
lr=0.0001
)
I also am getting vertice values using this "shape" tensor:
vertices = model(shape)
And my loss function calculates loss as in differences of inferenced vertices and ground truth vertices:
loss = torch.sqrt(((gt_vertices - vertices) ** 2).sum(2)).mean(1).mean()
So what I am doing is actually estimating shape value. I am only interested in shape values. This works perfectly fine when everything is on CPU. However, when I put my shape and models on GPU by calling to("cuda"), I am getting the classic non-leaf Tensor error:
ValueError: can't optimize a non-leaf Tensor
Calling .detach().cpu() on shape inside optimizer solves the issue, but then gradient's cannot flow as they should be and my values are not updated. How can I make this work?

When .to('cuda'), e.g. calling shape_p = shape.to('cuda'), you are making a copy of shape. While shape remains a leaf tensor, shape_p is not, because it's 'parent' tensor is shape. Therefore shape_p is not a leaf and returns the error when trying to optimize it.
Sending it to CUDA device after having set the optimizer, would solve the issue (there are certain instances when this can't be possible though, see here).
>>> optimizer = torch.optim.Adam([shape], lr=0.0001)
>>> shape = shape.cuda()
The best option though, in my opinion, is to send it directly on init:
>>> shape = torch.rand(1, requires_grad=True, device='cuda')
>>> optimizer = torch.optim.Adam([shape], lr=0.0001)
This way it remains a leaf tensor.

Related

No gradients provided for any variable error

I'm creating a model using the Keras functional API.
The layer architecture is as follows:
n = tf.keras.layers.Dense(1)(input)
for i in tf.range(n):
output = tf.keras.layers.Dense(4)(input)
I then concat the outputs and return for a tensor with shape [1, None, 4] where [1] is the batch dimension, [None] is n, and [4] is the output from the second dense layer.
My loss function involves comparing the shape of the expected output, and comparing the outputs.
loss = tf.convert_to_tensor(abs(tf.shape(logits)[1] - tf.shape(expected)[1])) * 100.
When running this on a custom training loop, I'm getting the error
ValueError: No gradients provided for any variable: (['while/dense/kernel:0',
'while/dense/bias:0', 'while/while/dense_1/kernel:0', 'while/while/dense_1/bias:0'],).
Provided `grads_and_vars` is ((None, <tf.Variable 'while/dense/kernel:0' shape=(786432, 1)
Shape is not differentiable, you cannot do things like this with gradient based learning. Problems like this need to be tackled with more powerful tools, e.g. reinforcement learning where one considers n as an action, and get policy gradient for that.
A rule of thumb to remember is that you cannot really backprop through discrete objects. You need to produce floats, as gradients require smooth functions. In your case n should be an integer (what does a loop over a float mean?) so this should be your first warning sign. The other being shape itself, which is also an integer. A target can be discrete, but not the prediction. Note that even in classification we do not output class we output probability as probability is smooth.
You could build your model by assuming some maximum number of N and treat it more like a classification where you supervise N directly, and use some form of masking to keep all the results around.

MultiHeadAttention giving very different values between versions (Pytorch/Tensorflow

I'm trying to recreate a transformer that was written in Pytorch and make it Tensorflow. Everything was going pretty well until each version of MultiHeadAttention started giving extremely different outputs. Both methods are an implementation of multi-headed attention as described in the paper "Attention is all you Need", so they should be able to achieve the same output.
I'm converting
self_attn = nn.MultiheadAttention(dModel, nheads, dropout=dropout)
to
self_attn = MultiHeadAttention(num_heads=nheads, key_dim=dModel, dropout=dropout)
For my tests, dropout is 0.
I'm calling them with:
self_attn(x,x,x)
where x is a tensor with shape=(10, 128, 50)
As expected from the documentation, the Pytorch version returns a tuple, (the target sequence length, embedding dimension), both with dimensions [10, 128, 50].
I'm having trouble getting the TensorFlow version to do the same thing. With Tensorflow I only get one tensor back, (size [10, 128, 50]) and it looks like neither the target sequence length or embedding dimension tensor from pytorch.
Based on the Tensorflow documentation I should be getting something comparable.
How can I get them to operate the same way? I'm guessing I'm doing something wrong with Tensorflow but I can't figure out what.
nn.MultiheadAttention outputs by default tuple with two tensors:
attn_output -- result of self-attention operation
attn_output_weights -- attention weights averaged(!) over heads
At the same time tf.keras.layers.MultiHeadAttention outputs by default only one tensor attention_output (which corresponds to attn_output of pytorch). Attention weights of all heads also will be returned if parameter return_attention_scores is set to True, like:
output, scores = self_attn(x, x, x, return_attention_scores=True)
Tensor scores also should be averaged to achieve full correspondence with pytorch:
scores = tf.math.reduce_mean(scores, 1)
While rewriting keep in mind that by default (as in snippet in question) nn.MultiheadAttention expects input in form (seq_length, batch_size, embed_dim), but tf.keras.layers.MultiHeadAttention expects it in form (batch_size, seq_length, embed_dim).

Why results are different from call and predict in a Keras model? It seems predict ignore any random generated value

I am seeing different behaviours between calling a model, and calling the predict method. It seems predict would ignore all randomly generated values.
In this notebook I am trying to introduce stochastic process to my network.
Basically, for every entry, I duplicate it 10 times, and for each slice, I add some random noise.
When calling the model with a tensor, I am seeing expected output, where an input entry yields some noise.
When calling predict on the same data, I am seeing only the same output.
So I save the model weights, and load the weights to a similar model without any noise to verify my hypothesis. Indeed, without noise, it yields the same outputs for call and predict, and the same outputs with the previous noisy model when calling predict.
Why am I seeing this behaviour? Does it mean that when training the network with fit, it will ignore random values as well?
When you call predict, Keras uses a TensorFlow compiled graph to run the model which, among other things, means that the batch dimension of the data tensor will generally be None (because you can predict on batches of any size). In your foo function that adds the noise to the input:
def foo(x):
B, D = K.int_shape(x)
if B is None:
return x
else:
mask = tf.random.normal((B,D))
return x + mask
You use int_shape to get the shape of x as Python integers, or None for unknown dimensions. This works as expected with eager tensors, where all dimensions are always known, but in graph mode the returned batch dimension B is None, so the conditional goes through the first branch and the input remains untouched.
The simplest solution is to use shape instead, which will give you another tensor (symbolic or eager) containing the full shape of x, and which you can use to generate the random noise:
def foo(x):
return x + tf.random.normal(K.shape(x))
This should always work as expected.

Why prediction on activation values (Softmax) gives incorrect results?

I've implemented a basic neural network from scratch using Tensorflow and trained it on MNIST fashion dataset. It's trained correctly and outputs testing accuracy around ~88-90% over 10 classes.
Now I've written predict() function which predicts the class of given image using trained weights. Here is the code:
def predict(images, trained_parameters):
Ws, bs = [], []
parameters = {}
for param in trained_parameters.keys():
parameters[param] = tf.convert_to_tensor(trained_parameters[param])
X = tf.placeholder(tf.float32, [images.shape[0], None], name = 'X')
Z_L = forward_propagation(X, trained_parameters)
p = tf.argmax(Z_L) # Working fine
# p = tf.argmax(tf.nn.softmax(Z_L)) # not working if softmax is applied
with tf.Session() as session:
prediction = session.run(p, feed_dict={X: images})
return prediction
This uses forward_propagation() function which returns the weighted sum of the last layer (Z) and not the activitions (A) because of TensorFlows tf.nn.softmax_cross_entropy_with_logits() requires Z instead of A as it will calculate A by applying softmax Refer this link for details.
Now in predict() function, when I make predictions using Z instead of A (activations) it's working correctly. By if I calculate softmax on Z (which is activations A of the last layer) it's giving incorrect predictions.
Why it's giving correct predictions on weighted sums Z? We are not supposed to first apply softmax activation (and calculate A) and then make predictions?
Here is the link to my colab notebook if anyone wants to look at my entire code: Link to Notebook Gist
So what am I missing here?
Most TF functions, such as tf.nn.softmax, assume by default that the batch dimension is the first one - that is a common practice. Now, I noticed in your code that your batch dimension is the second, i.e. your output shape is (output_dim=10, batch_size=?), and as a result, tf.nn.softmax is computing the softmax activation along the batch dimension.
There is nothing wrong in not following the conventions - one just needs to be aware of them. Computing the argmax of the softmax along the first axis should yield the desired results (it is equivalent to taking the argmax of the logits):
p = tf.argmax(tf.nn.softmax(Z_L, axis=0))
Also, I would also recommend computing the argmax along the first axis in case more than one image is fed into the network.

Why Tensorflow is unable to compute the gradient wrt the reshaped parameters?

I'd like to compute the gradient of the loss wrt all the network params. The problem arises when I try to reshape each weight matrix in order to be 1 dimensional (it is useful for computations that I do later with the gradients).
At this point Tensorflow outputs a list of None (which means that there is no path from the loss to those tensors while there should be as they are the model parameters reshaped).
Here is the code:
all_tensors = list()
for dir in ["fw", "bw"]:
for mtype in ["kernel"]:
t = tf.get_default_graph().get_tensor_by_name("encoder/bidirectional_rnn/%s/lstm_cell/%s:0" % (dir, mtype))
all_tensors.append(t)
# classifier tensors:
for mtype in ["kernel", "bias"]:
t = tf.get_default_graph().get_tensor_by_name("encoder/dense/%s:0" % (mtype))
all_tensors.append(t)
all_tensors = [tf.reshape(x, [-1]) for x in all_tensors]
tf.gradients(self.loss, all_tensors)
all_tensor at the end of the for loops is a list of 4 components with matrices of different shapes. This code outputs [None, None, None, None].
If I remove the reshape line all_tensors = [tf.reshape(x, [-1]) for x in all_tensors]
the code works fine and returns 4 tensor containing the gradients wrt each param.
Why does it happen? I'm pretty sure that reshape doesn't break any dependency in the graph, otherwise it couldn't be used in any network at all.
Well, the fact is that there is no path from your tensors to the loss. If you think of the computation graph in TensorFlow, self.loss is defined through a series of operations that at some point use the tensors your are interested in. However, when you do:
all_tensors = [tf.reshape(x, [-1]) for x in all_tensors]
You are creating new nodes in the graph and new tensors that are not being used by anyone. Yes, there is a relationship between those tensors and the loss value, but from the point of view of TensorFlow that reshaping is an independent computation.
If you want to do something like that, you would have to do the reshaping first and then compute the loss using the reshaped tensors. Or, alternatively, you can just compute the gradients with respect to the original tensors and then reshape the result.

Categories