Pytorch backward does not compute the gradients for requested variables - python

I'm trying to train a resnet18 model on pytorch (+pytorch-lightning) with the use of Virtual Adversarial Training. During the computations required for this type of training I need to obtain the gradient of D (ie. the cross-entropy loss of the model) with regard to tensor r.
This should, in theory, happen in the following code snippet:
def generic_step(self, train_batch, batch_idx, step_type):
x, y = train_batch
unlabeled_idx = y is None
d = torch.rand(x.shape).to(x.device)
d = d/(torch.norm(d) + 1e-8)
pred_y = self.classifier(x)
y[unlabeled_idx] = pred_y[unlabeled_idx]
l = self.criterion(pred_y, y)
R_adv = torch.zeros_like(x)
for _ in range(self.ip):
r = self.xi * d
r.requires_grad = True
pred_hat = self.classifier(x + r)
# pred_hat = F.log_softmax(pred_hat, dim=1)
D = self.criterion(pred_hat, pred_y)
self.classifier.zero_grad()
D.requires_grad=True
D.backward()
R_adv += self.eps * r.grad / (torch.norm(r.grad) + 1e-8)
R_adv /= 32
loss = l + R_adv * self.a
loss.backward()
self.accuracy[step_type] = self.acc_metric(torch.argmax(pred_y, 1), y)
return loss
Here, to my understanding, r.grad should in theory be the gradient of D with respect to r. However, the code throws this at D.backward():
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
(full traceback excluded because this error is not helpful and technically "solved" as I know the cause for it, explained just below)
After some research and debugging it seems that in this situation D.backward() attempts to calculate dD/dD disregarding any previous mention of requires_grad=True. This is confirmed when I add D.requires_grad=True and I get D.grad=Tensor(1.,device='cuda:0') but r.grad=None.
Does anyone know why this may be happening?

In Lightning, .backward() and optimizer step are all handled under the hood. If you do it yourself like in the code above, it will mess with Lightning because it doesn't know you called backward yourself.
You can enable manual optimization in the LightningModule:
def __init__(self):
super().__init__()
# put this in your init
self.automatic_optimization = False
This tells Lightning that you are taking over calling backward and handling optimizer step + zero grad yourself. Don't forget to add that in your code above. You can access the optimizer and scheduler like so in your training step:
def training_step(self, batch, batch_idx):
optimizer = self.optimizers()
scheduler = self.lr_schedulers()
# do your training step
# don't forget to call:
# 1) backward 2) optimizer step 3) zero grad
Read more about manual optimization here.

Related

One of the variables modified by an inplace operation

I am relatively new to Pytorch. Here I want to use this model to generate some images, however as this was written before Pytorch 1.5, since the gradient calculation has been fixed then, this is the error message.
RuntimeError: one of the variables needed for gradient computation has been
modified by an inplace operation: [torch.cuda.FloatTensor [1, 512, 4, 4]]
is at version 2; expected version 1 instead.
Hint: enable anomaly detection to find the operation that
failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
I have looked at past examples and am not sure what is the problem here, I believe it is happening within this region but I don’t know where! Any help would be greatly appreciated!
def process(self, images, edges, masks):
self.iteration += 1
# zero optimizers
self.gen_optimizer.zero_grad()
self.dis_optimizer.zero_grad()
# process outputs
outputs = self(images, edges, masks)
gen_loss = 0
dis_loss = 0
# discriminator loss
dis_input_real = torch.cat((images, edges), dim=1)
dis_input_fake = torch.cat((images, outputs.detach()), dim=1)
dis_real, dis_real_feat = self.discriminator(dis_input_real) # in: (grayscale(1) + edge(1))
dis_fake, dis_fake_feat = self.discriminator(dis_input_fake) # in: (grayscale(1) + edge(1))
dis_real_loss = self.adversarial_loss(dis_real, True, True)
dis_fake_loss = self.adversarial_loss(dis_fake, False, True)
dis_loss += (dis_real_loss + dis_fake_loss) / 2
# generator adversarial loss
gen_input_fake = torch.cat((images, outputs), dim=1)
gen_fake, gen_fake_feat = self.discriminator(gen_input_fake) # in: (grayscale(1) + edge(1))
gen_gan_loss = self.adversarial_loss(gen_fake, True, False)
gen_loss += gen_gan_loss
# generator feature matching loss
gen_fm_loss = 0
for i in range(len(dis_real_feat)):
gen_fm_loss += self.l1_loss(gen_fake_feat[i], dis_real_feat[i].detach())
gen_fm_loss = gen_fm_loss * self.config.FM_LOSS_WEIGHT
gen_loss += gen_fm_loss
# create logs
logs = [
("l_d1", dis_loss.item()),
("l_g1", gen_gan_loss.item()),
("l_fm", gen_fm_loss.item()),
]
return outputs, gen_loss, dis_loss, logs
def forward(self, images, edges, masks):
edges_masked = (edges * (1 - masks))
images_masked = (images * (1 - masks)) + masks
inputs = torch.cat((images_masked, edges_masked, masks), dim=1)
outputs = self.generator(inputs) # in: [grayscale(1) + edge(1) + mask(1)]
return outputs
def backward(self, gen_loss=None, dis_loss=None):
if dis_loss is not None:
dis_loss.backward()
self.dis_optimizer.step()
if gen_loss is not None:
gen_loss.backward()
self.gen_optimizer.step()
Thank you!
You can't compute the loss for the discriminator and for the generator in one go and have the both back-propagations back-to-back like this:
if dis_loss is not None:
dis_loss.backward()
self.dis_optimizer.step()
if gen_loss is not None:
gen_loss.backward()
self.gen_optimizer.step()
Here's the reason why: when you call self.dis_optimizer.step(), you effectively in-place modify the parameters of the discriminator, the very same that were used to compute gen_loss which you are trying to backpropagate on. This is not possible.
You have to compute dis_loss backpropagate, update the weights of the discriminator, and clear the gradients. Only then can you compute gen_loss with the newly updated discriminator weights. Finally, backpropagate on the generator.
This tutorial is a good walkthrough over a typical GAN training.
This might not be an answer exactly to your question but I got this when trying to use a "custom" distributed optimizer e.g. I was using Cherry's optimizer and accidentially moving the model to a DDP model at the same time. Once I only moved the model to device according to how cherry worked I stopped getting this issue.
context: https://github.com/learnables/learn2learn/issues/263
This worked for me. For more details, please see here.
def backward(self, gen_loss=None, dis_loss=None):
if dis_loss is not None:
dis_loss.backward(retain_graph=True) # modified here
self.dis_optimizer.step()
if gen_loss is not None:
gen_loss.backward()
self.gen_optimizer.step()

Why PyTorch optimizer might fail to update its parameters?

I am trying to do a simple loss-minimization for a specific variable coeff using PyTorch optimizers. This variable is supposed to be used as an interpolation coefficient for two vectors w_foo and w_bar to find a third vector, w_target.
w_target = `w_foo + coeff * (w_bar - w_foo)
With w_foo and w_bar set as constant, at each optimization step I calculate w_target for the given coeff. Loss is determined from w_target using a fairly complex process beyond the scope of this question.
# w_foo.shape = [1, 16, 512]
# w_bar.shape = [1, 16, 512]
# num_layers = 16
# num_steps = 10000
vgg_loss = VGGLoss()
coeff = torch.randn([num_layers, ])
optimizer = torch.optim.Adam([coeff], lr=initial_learning_rate)
for step in range(num_steps):
w_target = w_foo + torch.matmul(coeff, (w_bar - w_foo))
optimizer.zero_grad()
target_image = generator.synthesis(w_target)
processed_target_image = process(target_image)
loss = vgg_loss(processed_target_image, source_image)
loss.backward()
optimizer.step()
However, when running this optimizer, I am met with query_opt not changing from one step to another, optimizer being essentially useless. I would like to ask for some advice on what I am doing wrong here.
Edit:
As suggested, I will try to elaborate on the loss function. Essentially, w_target is used to generate an image, and VGGLoss uses VGG feature extractor to compare this synthetic image with a certain exemplar source image.
class VGGLoss(torch.nn.Module):
def __init__(self, device, vgg):
super().__init__()
for param in self.parameters():
param.requires_grad = True
self.vgg = vgg # VGG16 in eval mode
def forward(self, source, target):
loss = 0
source_features = self.vgg(source, resize_images=False, return_lpips=True)
target_features = self.vgg(target, resize_images=False, return_lpips=True)
loss += (source_features - target_features).square().sum()
return loss

Can one only implement gradient descent like optimizers with the code example from processing gradients in TensorFlow?

I was looking at the example code for processing gradients that TensorFlow has:
# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)
however, I noticed that the apply_gradients function was derived from the GradientDescentOptimizer. Does that mean that using the example code from above, one can only implement gradient like descent rules (notice we could change the opt = GradientDescentOptimizer or Adam or any of the the other optimizers)? In particular, what does apply_gradients do? I definitively check the code in the tf github page but it was a bunch of python that had nothing to do with mathematical expressions, so it was hard to tell what that was doing and how it changed from optimizer to optimizer.
For example, if I wanted to implement my own custom optimizer that might use gradients (or might not e.g. just change the weights directly with some rule, maybe more biologically plausible rule), its not possible with the above example code?
In particular I wanted to implement a gradient descent version that is artificially restricted in a compact domain. In particular I wanted to implement the following equation:
w := (w - mu*grad + eps) mod B
in TensorFlow. I realized that the following is true:
w := w mod B - mu*grad mod B + eps mod B
so I thought that I could just implement it by doing:
def Process_grads(g,mu_noise,stddev_noise,B):
return (g+tf.random_normal(tf.shape(g),mean=mu_noise,stddev=stddev_noise) ) % B
and then just having:
processed_grads_and_vars = [(Process_grads(gv[0]), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the processed gradients.
opt.apply_gradients(processed_grads_and_vars)
however, I realized that that wasn't good enough because I don't actually have access to w so I can't implement:
w mod B
at least not the way I tried. Is there a way to do this? i.e. to actually directly change the update rule? At least the way I tried?
I know its sort of a hacky update rule, but my point is more to change the update equation than actually caring to much about that update rule (so don't get hung up on it if its a bit weird).
I came up with super hacky solution:
def manual_update_GDL(arg,learning_rate,g,mu_noise,stddev_noise):
with tf.variable_scope(arg.mdl_scope_name,reuse=True):
W_var = tf.get_variable(name='W')
eps = tf.random_normal(tf.shape(g),mean=mu_noise,stddev=stddev_noise)
#
W_new = tf.mod( W_var - learning_rate*g + eps , 20)
sess.run( W_var.assign(W_new) )
def manual_GDL(arg,loss,learning_rate,mu_noise,stddev_noise,compact,B):
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss)
# process gradients
processed_grads_and_vars = [(manual_update_GDL(arg,learning_rate,g,mu_noise,stddev_noise), v) for g,v in grads_and_vars]
not sure if it works but something like that should work in general. The idea is to just write down the equation one wants to use (in TensorFlow) for the learning rate and then update the weights manually using a session.
Unfortunately, such a solution means we have to take care of the annealing (decaying learning rate manually which seems annoying). This solution probably has many other problems, feel free to point them out (and give solutions if you can).
For this very simple problem I realized one can just do the normal optimizer update rule and then just take the mod of the weights and re-assign them to their value:
sess.run(fetches=train_step)
if arg.compact:
# apply w := ( w - mu*g + eps ) mod B
W_val = W_var.eval()
W_new = tf.mod(W_var,arg.B).eval()
W_var.assign(W_new).eval()
but in this case its a coincidence that such a simple solution exists (unfortunately, bypasses the whole point of my question).
Actually, this solutions slows down the code a lot. For the moment is the best that I've got.
As a reference, I have seen this question: How to create an optimizer in Tensorflow , but didn't find it responded directly to my question.
Your solution slows down the code because you use the sess.run and .eval() code during your "train_step" creation. Instead you should create the train_step graph using only internal tensorflow functions (without using sess.run and .eval()). Thereafter you only evaluate the train_step in a loop.
If you don't want to use any standard optimizer you can write your own "apply gradient" graph. Here is one possible solution for that:
learning_rate = tf.Variable(tf.constant(0.1))
mu_noise = 0.
stddev_noise = 0.01
#add all your W variables here when you have more than one:
train_w_vars_list = [W]
grad = tf.gradients(some_loss, train_w_vars_list)
assign_list = []
for g, v in zip(grad, train_w_vars_list):
eps = tf.random_normal(tf.shape(g), mean=mu_noise, stddev=stddev_noise)
assign_list.append(v.assign(tf.mod(v - learning_rate*g + eps, 20)))
#also update the learning rate here if you want to:
assign_list.append(learning_rate.assign(learning_rate - 0.001))
train_step = tf.group(*assign_list)
You can also use one of the standard optimizer to create the grads_and_vars list (use it instead of zip(grad, train_w_vars_list) then).
Here is a simple example for MNIST with your loss:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
# Import data
mnist = input_data.read_data_sets('PATH TO MNIST_data', one_hot=True)
# Create the model
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
y = tf.matmul(x, W)
# Define loss and optimizer
y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))
learning_rate = tf.Variable(tf.constant(0.1))
mu_noise = 0.
stddev_noise = 0.01
#add all your W variables here when you have more than one:
train_w_vars_list = [W]
grad = tf.gradients(cross_entropy, train_w_vars_list)
assign_list = []
for g, v in zip(grad, train_w_vars_list):
eps = tf.random_normal(tf.shape(g), mean=mu_noise, stddev=stddev_noise)
assign_list.append(v.assign(tf.mod(v - learning_rate*g + eps, 20)))
#also update the learning rate here if you want to:
assign_list.append(learning_rate.assign(learning_rate - 0.001))
train_step = tf.group(*assign_list)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
# Train
for _ in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
# Test trained model
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(sess.run(accuracy, feed_dict={x: mnist.test.images,
y_: mnist.test.labels}))
Indeed you are somewhat restricted and cannot do anything. However, what you wish to do can easily be done by making your daughter class of the tensorflow Optimizer class.
All you need to do is write an _apply_dense method for your class. The _apply_dense method takes grad and w as arguments, so anything you want to do with these to variables you can do.
Look here for example: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/adam.py
That is the implementation of Adam in tensorflow, all you need to do is change the _apply_dense at line 131 as well as the _prepare and _finish methods.
So for example:
def _apply_dense(self, grad, var):
B = math_ops.cast(self.B, var.dtype.base_dtype)
eps = math_ops.cast(self.eps, var.dtype.base_dtype)
mu = math_ops.cast(self.mu, var.dtype.base_dtype)
var_update = state_ops.assign(var, tf.floormod(var - mu*grad + eps,B),
use_locking=self._use_locking)
return var_update

How to set layer-wise learning rate in Tensorflow?

I am wondering if there is a way that I can use different learning rate for different layers like what is in Caffe. I am trying to modify a pre-trained model and use it for other tasks. What I want is to speed up the training for new added layers and keep the trained layers at low learning rate in order to prevent them from being distorted. for example, I have a 5-conv-layer pre-trained model. Now I add a new conv layer and fine tune it. The first 5 layers would have learning rate of 0.00001 and the last one would have 0.001. Any idea how to achieve this?
It can be achieved quite easily with 2 optimizers:
var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
train_op1 = GradientDescentOptimizer(0.00001).minimize(loss, var_list=var_list1)
train_op2 = GradientDescentOptimizer(0.0001).minimize(loss, var_list=var_list2)
train_op = tf.group(train_op1, train_op2)
One disadvantage of this implementation is that it computes tf.gradients(.) twice inside the optimizers and thus it might not be optimal in terms of execution speed. This can be mitigated by explicitly calling tf.gradients(.), splitting the list into 2 and passing corresponding gradients to both optimizers.
Related question: Holding variables constant during optimizer
EDIT: Added more efficient but longer implementation:
var_list1 = [variables from first 5 layers]
var_list2 = [the rest of variables]
opt1 = tf.train.GradientDescentOptimizer(0.00001)
opt2 = tf.train.GradientDescentOptimizer(0.0001)
grads = tf.gradients(loss, var_list1 + var_list2)
grads1 = grads[:len(var_list1)]
grads2 = grads[len(var_list1):]
tran_op1 = opt1.apply_gradients(zip(grads1, var_list1))
train_op2 = opt2.apply_gradients(zip(grads2, var_list2))
train_op = tf.group(train_op1, train_op2)
You can use tf.trainable_variables() to get all training variables and decide to select from them.
The difference is that in the first implementation tf.gradients(.) is called twice inside the optimizers. This may cause some redundant operations to be executed (e.g. gradients on the first layer can reuse some computations for the gradients of the following layers).
Tensorflow 1.7 introduced tf.custom_gradient that greatly simplifies setting learning rate multipliers, in a way that is now compatible with any optimizer, including those accumulating gradient statistics. For example,
import tensorflow as tf
def lr_mult(alpha):
#tf.custom_gradient
def _lr_mult(x):
def grad(dy):
return dy * alpha * tf.ones_like(x)
return x, grad
return _lr_mult
x0 = tf.Variable(1.)
x1 = tf.Variable(1.)
loss = tf.square(x0) + tf.square(lr_mult(0.1)(x1))
step = tf.train.GradientDescentOptimizer(learning_rate=0.1).minimize(loss)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
tf.local_variables_initializer().run()
for _ in range(5):
sess.run([step])
print(sess.run([x0, x1, loss]))
Update Jan 22: recipe below is only a good idea for GradientDescentOptimizer , other optimizers that keep a running average will apply learning rate before the parameter update, so recipe below won't affect that part of the equation
In addition to Rafal's approach, you could use compute_gradients, apply_gradients interface of Optimizer. For instance, here's a toy network where I use 2x the learning rate for second parameter
x = tf.Variable(tf.ones([]))
y = tf.Variable(tf.zeros([]))
loss = tf.square(x-y)
global_step = tf.Variable(0, name="global_step", trainable=False)
opt = tf.GradientDescentOptimizer(learning_rate=0.1)
grads_and_vars = opt.compute_gradients(loss, [x, y])
ygrad, _ = grads_and_vars[1]
train_op = opt.apply_gradients([grads_and_vars[0], (ygrad*2, y)], global_step=global_step)
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)
for i in range(5):
sess.run([train_op, loss, global_step])
print sess.run([x, y])
You should see
[0.80000001, 0.40000001]
[0.72000003, 0.56]
[0.68800002, 0.62400001]
[0.67520005, 0.64960003]
[0.67008007, 0.65984005]
Collect learning rate multipliers for each variable like:
self.lr_multipliers[var.op.name] = lr_mult
and then apply them during before applying the gradients like:
def _train_op(self):
tf.scalar_summary('learning_rate', self._lr_placeholder)
opt = tf.train.GradientDescentOptimizer(self._lr_placeholder)
grads_and_vars = opt.compute_gradients(self._loss)
grads_and_vars_mult = []
for grad, var in grads_and_vars:
grad *= self._network.lr_multipliers[var.op.name]
grads_and_vars_mult.append((grad, var))
tf.histogram_summary('variables/' + var.op.name, var)
tf.histogram_summary('gradients/' + var.op.name, grad)
return opt.apply_gradients(grads_and_vars_mult)
You can find the whole example here.
A slight variation of Sergey Demyanov answer, where you only have to specify the learning rates you would like to change
from collections import defaultdict
self.learning_rates = defaultdict(lambda: 1.0)
...
x = tf.layers.Dense(3)(x)
self.learning_rates[x.op.name] = 2.0
...
optimizer = tf.train.MomentumOptimizer(learning_rate=1e-3, momentum=0.9)
grads_and_vars = optimizer.compute_gradients(loss)
grads_and_vars_mult = []
for grad, var in grads_and_vars:
grad *= self.learning_rates[var.op.name]
grads_and_vars_mult.append((grad, var))
train_op = optimizer.apply_gradients(grads_and_vars_mult, tf.train.get_global_step())
The first 5 layers would have learning rate of 0.00001 and the last one would have 0.001. Any idea how to achieve this?
There is an easy way to do that using tf.stop_gradient.
Here is an example with 3 layers:
x = layer1(input)
x = layer2(x)
output = layer3(x)
You can shrink your gradient in the first two layers by a ratio of 1/100:
x = layer1(input)
x = layer2(x)
x = 1/100*x + (1-1/100)*tf.stop_gradient(x)
output = layer3(x)
On the layer2, the "flow" is split in two branches: one which has a contribution of 1/100 computes its gradient regularly but with a gradient magnitude shrinked by a proportion of 1/100, the other branch provides the remaining "flow" without contributing to the gradient because of the tf.stop_gradient operator. As a result, if you use a learning rate of 0.001 on your model optimizer, the first two layers will virtually have a learning rate of 0.00001.
If you happen to be using tf.slim + slim.learning.create_train_op there is a nice example here:
https://github.com/google-research/tf-slim/blob/master/tf_slim/learning.py#L65
# Create the train_op and scale the gradients by providing a map from variable
# name (or variable) to a scaling coefficient:
gradient_multipliers = {
'conv0/weights': 1.2,
'fc8/weights': 3.4,
}
train_op = slim.learning.create_train_op(
total_loss,
optimizer,
gradient_multipliers=gradient_multipliers)
Unfortunately it doesn't seem possible to use a tf.Variable instead of a float value if you want to gradually modify the multiplier.

Issue with computing gradient for Rnn in Theano

I am playing with vanilla Rnn's, training with gradient descent (non-batch version), and I am having an issue with the gradient computation for the (scalar) cost; here's the relevant portion of my code:
class Rnn(object):
# ............ [skipping the trivial initialization]
def recurrence(x_t, h_tm_prev):
h_t = T.tanh(T.dot(x_t, self.W_xh) +
T.dot(h_tm_prev, self.W_hh) + self.b_h)
return h_t
h, _ = theano.scan(
recurrence,
sequences=self.input,
outputs_info=self.h0
)
y_t = T.dot(h[-1], self.W_hy) + self.b_y
self.p_y_given_x = T.nnet.softmax(y_t)
self.y_pred = T.argmax(self.p_y_given_x, axis=1)
def negative_log_likelihood(self, y):
return -T.mean(T.log(self.p_y_given_x)[:, y])
def testRnn(dataset, vocabulary, learning_rate=0.01, n_epochs=50):
# ............ [skipping the trivial initialization]
index = T.lscalar('index')
x = T.fmatrix('x')
y = T.iscalar('y')
rnn = Rnn(x, n_x=27, n_h=12, n_y=27)
nll = rnn.negative_log_likelihood(y)
cost = T.lscalar('cost')
gparams = [T.grad(cost, param) for param in rnn.params]
updates = [(param, param - learning_rate * gparam)
for param, gparam in zip(rnn.params, gparams)
]
train_model = theano.function(
inputs=[index],
outputs=nll,
givens={
x: train_set_x[index],
y: train_set_y[index]
},
)
sgd_step = theano.function(
inputs=[cost],
outputs=[],
updates=updates
)
done_looping = False
while(epoch < n_epochs) and (not done_looping):
epoch += 1
tr_cost = 0.
for idx in xrange(n_train_examples):
tr_cost += train_model(idx)
# perform sgd step after going through the complete training set
sgd_step(tr_cost)
For some reasons I don't want to pass complete (training) data to the train_model(..), instead I want to pass individual examples at a time. Now the problem is that each call to train_model(..) returns me the cost (negative log-likelihood) of that particular example and then I have to aggregate all the cost (of the complete (training) data-set) and then take derivative and perform the relevant update to the weight parameters in the sgd_step(..), and for obvious reasons with my current implementation I am getting this error: theano.gradient.DisconnectedInputError: grad method was asked to compute the gradient with respect to a variable that is not part of the computational graph of the cost, or is used only by a non-differentiable operator: W_xh. Now I don't understand how to make 'cost' a part of computational graph (as in my case when I have to wait for it to be aggregated) or is there any better/elegant way to achieve the same thing ?
Thanks.
It turns out one cannot bring the symbolic variable into Theano graph if they are not part of computational graph. Therefore, I have to change the way to pass data to the train_model(..); passing the complete training data instead of individual example fix the issue.

Categories