I am trying to use a manually calculate a gradient using the output of my network, I will then use this in a loss function. I have managed to get an example working in keras, but converting it to PyTorch has proven more difficult
I have a model like:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(1, 50)
self.fc2 = nn.Linear(50, 10)
self.fc3 = nn.Linear(10, 1)
def forward(self, x):
x = F.sigmoid(self.fc1(x))
x = F.sigmoid(self.fc2(x))
x = self.fc3(x)
return x
and some data:
x = torch.unsqueeze(torch.linspace(-1, 1, 101), dim=1)
x = Variable(x)
I can then try find a gradient like:
output = net(x)
grad = torch.autograd.grad(outputs=output, inputs=x, retain_graph=True)[0]
I want to be able to find the gradient of each point, then do something like:
err_sqr = (grad - x)**2
loss = torch.mean(err_sqr)**2
However, at the moment if I try to do this I get the error:
grad can be implicitly created only for scalar outputs
I have tried changing the shape of my network output to fix this, but if I change it to much it says its not part of the graph. I can get rid of that error by allowing that, but then it says my gradient is None. I've managed to get this working in keras, so I'm confident that its possible here too, I just need a hand!
My questions are:
Is there a way to "fix" what I have to allow me to calculate the gradient
PyTorch expects an upstream gradient in the grad call. For usual (scalar) loss functions, the upstream gradient is implicitly assumed to be 1.
You can do a similar thing by passing ones as the upstream gradient:
grad = torch.autograd.grad(outputs=output, inputs=x, grad_outputs=torch.ones_like(output), retain_graph=True)[0]
Related
Hey I have been struggling with this weird problem. Here is my code for the Neural Net:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv_3d_=nn.Sequential(
nn.Conv3d(1,1,9,1,4),
nn.LeakyReLU(),
nn.Conv3d(1,1,9,1,4),
nn.LeakyReLU(),
nn.Conv3d(1,1,9,1,4),
nn.LeakyReLU()
)
self.linear_layers_ = nn.Sequential(
nn.Linear(batch_size*32*32*32,batch_size*32*32*3),
nn.LeakyReLU(),
nn.Linear(batch_size*32*32*3,batch_size*32*32*3),
nn.Sigmoid()
)
def forward(self,x,y,z):
conv_layer = x + y + z
conv_layer = self.conv_3d_(conv_layer)
conv_layer = torch.flatten(conv_layer)
conv_layer = self.linear_layers_(conv_layer)
conv_layer = conv_layer.view((batch_size,3,input_sizes,input_sizes))
return conv_layer
The weird problem I am facing is that running this NN gives me an error
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [3072]], which is output 0 of SigmoidBackward, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
The stack trace shows that the issue is in line
conv_layer = self.linear_layers_(conv_layer)
However, if I replace the last activation function of my FCN from nn.Sigmoid() to nn.LeakyRelu(), the NN executes properly.
Can anyone tell me why Sigmoid activation function is causing my backward computation to break?
I found the problem with my code. I delved deeper into what in-place actually meant. So, if you check the line
conv_layer = self.linear_layers_(conv_layer)
linear_layers_ of the assignment is changing the values of conv_layer in-place and as a result the values are getting overwritten and because of this, gradient computation fails. Easy solution for this problem is to use the clone() function
i.e.
conv_layer = self.linear_layers_(conv_layer).clone()
This creates a copy of the right hand computation and Autograd is able to store the reference of the computation graph.
I want to multiply a Keras layer with my own Variable.
Then, I want to compute the gradients of some loss relative to the variables I have defined.
Here is a simplified MWE of what I am trying to do:
import tensorflow as tf
x = input_shape = tf.keras.layers.Input((10,))
x = tf.keras.layers.Dense(5)(x)
s = tf.Variable(tf.ones((5,)))
x = x*s
model = tf.keras.models.Model(input_shape, x)
X = tf.random.normal((50, 10)) # random sample
with tf.GradientTape() as tape:
tape.watch(s)
y = model(X)
loss = y**2
print(tape.gradient(loss, s)) # why None ??
The print prints None... why?
Notice that I am using eager-execution (TF version 2.0.0).
I managed to fix my problem by sub-classing Model and creating my variable inside the model:
class MyModel(tf.keras.Model):
def __init__(self):
super().__init__()
self.dense = tf.keras.layers.Dense(5)
self.s = tf.Variable(tf.ones((5,)))
def call(self, inputs):
x = self.dense(inputs)
x = x * self.s
return x
Alternatively, defining my own custom layer also works.
There must be some magic going on whereby variables not inside a model are not backpropagated (like in PyTorch).
I will leave the question open because I am curious as to why my code was not working and what a simpler fix would look like.
This might be the explanation. Based on reviewing the documentation, I'm suspecting that the issue is the differentiation with respect to the model layer "s" (or any other layer say "x") might not be a meaningful calculation. For example, it is possible to do this:
print(tape.gradient(loss, model.variables))
and obtain the gradients with respect to the model weights/parameters, but differentiating the model with respect to a "layer" is not appropriate. This is my speculation at this point. I hope this helps.
I am trying to implement a LSTM VAE (following this example I found), but also have it accept variable length sequences using Masking Layers. I tried to combine the above code with the ideas from this SO question that seems to deal with it the "best way" by cropping the gradients to get the most accurate loss as possible, however my implementation does not seem to be able to reproduce sequences on a small set of data. I am thus relatively confident that there is something amiss with my implementation, but I cannot seem to pinpoint what exactly is wrong. The relevant part is here:
x = Input(shape=(None, input_dim))(x)
x_masked = Masking(mask_value=0.0, input_shape=(None, input_dim))(x)
h = LSTM(intermediate_dim)(x_masked)
z_mean = Dense(latent_dim)(h)
z_log_sigma = Dense(latent_dim)(h)
def sampling(args):
z_mean, z_log_sigma = args
epsilon = K.random_normal(shape=(batch_size, latent_dim), mean=0., stddev=epsilon_std)
return z_mean + z_log_sigma * epsilon
z = Lambda(sampling, output_shape=(latent_dim,))([z_mean, z_log_sigma])
decoded_h = LSTM(intermediate_dim, return_sequences=True)
decoded_mean = LSTM(latent_dim, return_sequences=True)
h_decoded = RepeatVector(max_timesteps)(z)
h_decoded = decoder_h(h_decoded)
x_decoded_mean = decoder_mean(h_decoded)
def crop_outputs(x):
padding = K.cast(K.not_equal(x[1], 0), dtype=K.floatx())
return x[0] * padding
x_decoded_mean = Lambda(crop_outputs, output_shape=(max_timesteps, input_dim))([x_decoded_mean, x])
vae = Model(x, x_decoded_mean)
def vae_loss(x, x_decoded_mean):
xent_loss = objectives.mse(x, x_decoded_mean)
kl_loss = -0.5 * K.mean(1 + z_log_sigma - K.square(z_mean) - K.exp(z_log_sigma))
loss = xent_loss + kl_loss
return loss
vae.compile(optimizer='adam', loss=vae_loss)
# Here, X is variable length time series data of shape
# (num_examples, max_timesteps, input_dim) and is zero padded
# on the right for all the examples of length less than max_timesteps
# X has been appropriately scaled using the StandardScaler.
vae.fit(X, X, epochs = num_epochs, batch_size=batch_size)
As always, any help is much appreciated. Thank you!
I came by your question while looking to do exactly the same. I gave up on VAE's, but found a solution to apply masking to layers that don't support masking. What I did was just predefine a binary mask (you can do this with numpy Code 1) and then multiplied my output by the mask. During Backpropagation the algorithm will try the derivative of the multiplication and will end up propagating the value or not. It is not as clever as the masking layer on Keras, bubt it did for me.
#Code 1
#making a numpy binary mask
# expecting a sequence with shape (Time_Steps, Features)
# let's say that my sequence has Features = 10 and a max_Length of 15
max_Len = 15
seq = np.linspace(0,1,100).reshape((10,10))
# You must pad/truncate the sequence here
mask = np.concatenate([np.ones(seq.shape[0]),np.zeros(max_Len-seq.shape[0])],axis=-1)
# This mask can be thrown as input to the model afterwards
A few considerations:
1- It resulted on a weak regression model. Don't know the impact on VAE's, since I never tested, but I think it will generate lots of noise.
2- The computational resource demand went up, so it is a good thing to try and calculate the requirements of propagating and backpropagating this workaround (or "gambiarra" as we say here) if you are on a budget like me.
3- It wont solve the problem completly, you could try and delve deeper on this and implement a more stable solution using pure Tensorflow.
4- A more "accurate" solution would be to implement a custom masking layer (code 2).
Regarding point 4, it is easy, you must define the layer as a default layer and then just use the call function receiving a mask and then just output the multiplication of mask and input. Like this:
# Code 2
class MyCoolMaskingLayer(tf.keras.layers.Layer):
def __init__(self, **kwargs):
#init stuff here
pass
def compute_mask(self, inputs, mask=None):
return mask
def call(self, input, mask=None):
bc_mask = tf.expand_dims(tf.cast(mask, "float32"), -1) if mask is not None else np.asarray([[1]])
return input * mask
This function might not work for you, it is really problem specific and from a noob (me), but it worked for me. I just cannot share the entire code because my Master's Tutor doesn't allow.
(a little bit of context: I wrap it around a TimeDistributed so that each TimeStep of a LSTM output is individually processed by this masking layer, because inside call i perform some transformations on the data)
In pytorch a classification network model is defined as this,
class Net(torch.nn.Module):
def __init__(self, n_feature, n_hidden, n_output):
super(Net, self).__init__()
self.hidden = torch.nn.Linear(n_feature, n_hidden) # hidden layer
self.out = torch.nn.Linear(n_hidden, n_output) # output layer
def forward(self, x):
x = F.relu(self.hidden(x)) # activation function for hidden layer
x = self.out(x)
return x
Is softmax applied here? In my understanding, things should be like,
class Net(torch.nn.Module):
def __init__(self, n_feature, n_hidden, n_output):
super(Net, self).__init__()
self.hidden = torch.nn.Linear(n_feature, n_hidden) # hidden layer
self.relu = torch.nn.ReLu(inplace=True)
self.out = torch.nn.Linear(n_hidden, n_output) # output layer
self.softmax = torch.nn.Softmax(dim=n_output)
def forward(self, x):
x = self.hidden(x) # activation function for hidden layer
x = self.relu(x)
x = self.out(x)
x = self.softmax(x)
return x
I understand that F.relu(self.relu(x)) is also applying relu, but the first block of code doesn't apply softmax, right?
Latching on to what #jodag was already saying in his comment, and extending it a bit to form a full answer:
No, PyTorch does not automatically apply softmax, and you can at any point apply torch.nn.Softmax() as you want. But, softmax has some issues with numerical stability, which we want to avoid as much as we can. One solution is to use log-softmax, but this tends to be slower than a direct computation.
Especially when we are using Negative Log Likelihood as a loss function (in PyTorch, this is torch.nn.NLLLoss, we can utilize the fact that the derivative of (log-)softmax+NLLL is actually mathematically quite nice and simple, which is why it makes sense to combine the both into a single function/element. The result is then torch.nn.CrossEntropyLoss. Again, note that this only applies directly to the last layer of your network, any other computation is not affected by any of this.
I am trying to perform the most basic function minimisation possible in TensorFlow 2.0, exactly as in the question Tensorflow 2.0: minimize a simple function, however I cannot get the solution described there to work. Here is my attempt, mostly copy-pasted but with some bits that seemed to be missing added in.
import tensorflow as tf
x = tf.Variable(2, name='x', trainable=True, dtype=tf.float32)
with tf.GradientTape() as t:
y = tf.math.square(x)
# Is the tape that computes the gradients!
trainable_variables = [x]
#### Option 2
# To use minimize you have to define your loss computation as a funcction
def compute_loss():
y = tf.math.square(x)
return y
opt = tf.optimizers.Adam(learning_rate=0.001)
train = opt.minimize(compute_loss, var_list=trainable_variables)
print("x:", x)
print("y:", y)
Output:
x: <tf.Variable 'x:0' shape=() dtype=float32, numpy=1.999>
y: tf.Tensor(4.0, shape=(), dtype=float32)
So it says the minimum is at x=1.999, but obviously that is wrong. So what happened? I suppose it only performed one loop of the minimiser or something? If so then "minimize" seems like a terrible name for the function. How is this supposed to work?
On a side note, I also need to know the values of intermediate variables that are calculated in the loss function (the example only has y, but imagine that it took several steps to compute y and I want all those numbers). I don't think I am using the gradient tape correctly either, it is not obvious to me that it has anything to do with the computations in the loss function (I just copied this stuff from the other question).
You need to call minimize multiple times, because minimize only performs a single step of your optimisation.
Following should work
import tensorflow as tf
x = tf.Variable(2, name='x', trainable=True, dtype=tf.float32)
# Is the tape that computes the gradients!
trainable_variables = [x]
# To use minimize you have to define your loss computation as a funcction
class Model():
def __init__(self):
self.y = 0
def compute_loss(self):
self.y = tf.math.square(x)
return self.y
opt = tf.optimizers.Adam(learning_rate=0.01)
model = Model()
for i in range(1000):
train = opt.minimize(model.compute_loss, var_list=trainable_variables)
print("x:", x)
print("y:", model.y)