I have this neural network that I've trained seen bellow, it works, or at least appears to work, but the problem is with the training. I'm trying to train it to act as an OR gate, but it never seems to get there, the output tends to looks like this:
prior to training:
[[0.50181624]
[0.50183743]
[0.50180414]
[0.50182533]]
post training:
[[0.69641759]
[0.754652 ]
[0.75447178]
[0.79431198]]
expected output:
[[0]
[1]
[1]
[1]]
I have this loss graph:
Its strange it appears to be training, but at the same time not quite getting to the expected output. I know that it would never really achieve the 0s and 1s, but at the same time I expect it to manage and get something a little bit closer to the expected output.
I had some issues trying to figure out how to back prop the error as I wanted to make this network have any number of hidden layers, so I stored the local gradient in a layer, along side the weights, and sent the error from the end back.
The main functions I suspect are the culprits are NeuralNetwork.train and both forward methods.
import sys
import math
import numpy as np
import matplotlib.pyplot as plt
from itertools import product
class NeuralNetwork:
class __Layer:
def __init__(self,args):
self.__epsilon = 1e-6
self.localGrad = 0
self.__weights = np.random.randn(
args["previousLayerHeight"],
args["height"]
)*0.01
self.__biases = np.zeros(
(args["biasHeight"],1)
)
def __str__(self):
return str(self.__weights)
def forward(self,X):
a = np.dot(X, self.__weights) + self.__biases
self.localGrad = np.dot(X.T,self.__sigmoidPrime(a))
return self.__sigmoid(a)
def adjustWeights(self, err):
self.__weights -= (err * self.__epsilon)
def __sigmoid(self, z):
return 1/(1 + np.exp(-z))
def __sigmoidPrime(self, a):
return self.__sigmoid(a)*(1 - self.__sigmoid(a))
def __init__(self,args):
self.__inputDimensions = args["inputDimensions"]
self.__outputDimensions = args["outputDimensions"]
self.__hiddenDimensions = args["hiddenDimensions"]
self.__layers = []
self.__constructLayers()
def __constructLayers(self):
self.__layers.append(
self.__Layer(
{
"biasHeight": self.__inputDimensions[0],
"previousLayerHeight": self.__inputDimensions[1],
"height": self.__hiddenDimensions[0][0]
if len(self.__hiddenDimensions) > 0
else self.__outputDimensions[0]
}
)
)
for i in range(len(self.__hiddenDimensions)):
self.__layers.append(
self.__Layer(
{
"biasHeight": self.__hiddenDimensions[i + 1][0]
if i + 1 < len(self.__hiddenDimensions)
else self.__outputDimensions[0],
"previousLayerHeight": self.__hiddenDimensions[i][0],
"height": self.__hiddenDimensions[i + 1][0]
if i + 1 < len(self.__hiddenDimensions)
else self.__outputDimensions[0]
}
)
)
def forward(self,X):
out = self.__layers[0].forward(X)
for i in range(len(self.__layers) - 1):
out = self.__layers[i+1].forward(out)
return out
def train(self,X,Y,loss,epoch=5000000):
for i in range(epoch):
YHat = self.forward(X)
delta = -(Y-YHat)
loss.append(sum(Y-YHat))
err = np.sum(np.dot(self.__layers[-1].localGrad,delta.T), axis=1)
err.shape = (self.__hiddenDimensions[-1][0],1)
self.__layers[-1].adjustWeights(err)
i=0
for l in reversed(self.__layers[:-1]):
err = np.dot(l.localGrad, err)
l.adjustWeights(err)
i += 1
def printLayers(self):
print("Layers:\n")
for l in self.__layers:
print(l)
print("\n")
def main(args):
X = np.array([[x,y] for x,y in product([0,1],repeat=2)])
Y = np.array([[0],[1],[1],[1]])
nn = NeuralNetwork(
{
#(height,width)
"inputDimensions": (4,2),
"outputDimensions": (1,1),
"hiddenDimensions":[
(6,1)
]
}
)
print("input:\n\n",X,"\n")
print("expected output:\n\n",Y,"\n")
nn.printLayers()
print("prior to training:\n\n",nn.forward(X), "\n")
loss = []
nn.train(X,Y,loss)
print("post training:\n\n",nn.forward(X), "\n")
nn.printLayers()
fig,ax = plt.subplots()
x = np.array([x for x in range(5000000)])
loss = np.array(loss)
ax.plot(x,loss)
ax.set(xlabel="epoch",ylabel="loss",title="logic gate training")
plt.show()
if(__name__=="__main__"):
main(sys.argv[1:])
Could someone please point out what I'm doing wrong here, I strongly suspect it has to do with the way I'm dealing with matrices but at the same time I don't have the slightest idea what's going on.
Thanks for taking the time to read my question, and taking the time to respond (if relevant).
edit:
Actually quite a lot is wrong with this but I'm still a bit confused over how to fix it. Although the loss graph looks like its training, and it kind of is, the math I've done above is wrong.
Look at the training function.
def train(self,X,Y,loss,epoch=5000000):
for i in range(epoch):
YHat = self.forward(X)
delta = -(Y-YHat)
loss.append(sum(Y-YHat))
err = np.sum(np.dot(self.__layers[-1].localGrad,delta.T), axis=1)
err.shape = (self.__hiddenDimensions[-1][0],1)
self.__layers[-1].adjustWeights(err)
i=0
for l in reversed(self.__layers[:-1]):
err = np.dot(l.localGrad, err)
l.adjustWeights(err)
i += 1
Note how I get delta = -(Y-Yhat) and then dot product it with the "local gradient" of the last layer. The "local gradient" is the local W gradient.
def forward(self,X):
a = np.dot(X, self.__weights) + self.__biases
self.localGrad = np.dot(X.T,self.__sigmoidPrime(a))
return self.__sigmoid(a)
I'm skipping a step in the chain rule. I should really be multiplying by W* sigprime(XW + b) first as that's the local gradient of X, then by the local W gradient. I tried that, but I'm still getting issues, here is the new forward method (note the __init__ for layers needs to be initialised for the new vars, and I changed the activation function to tanh)
def forward(self, X):
a = np.dot(X, self.__weights) + self.__biases
self.localPartialGrad = self.__tanhPrime(a)
self.localWGrad = np.dot(X.T, self.localPartialGrad)
self.localXGrad = np.dot(self.localPartialGrad,self.__weights.T)
return self.__tanh(a)
and updated the training method to look something like this:
def train(self, X, Y, loss, epoch=5000):
for e in range(epoch):
Yhat = self.forward(X)
err = -(Y-Yhat)
loss.append(sum(err))
print("loss:\n",sum(err))
for l in self.__layers[::-1]:
l.adjustWeights(err)
if(l != self.__layers[0]):
err = np.multiply(err,l.localPartialGrad)
err = np.multiply(err,l.localXGrad)
The new graphs I'm getting are all over the place, I have no idea what's going on. Here is the final bit of code I changed:
def adjustWeights(self, err):
perr = np.multiply(err, self.localPartialGrad)
werr = np.sum(np.dot(self.__weights,perr.T),axis=1)
werr = werr * self.__epsilon
werr.shape = (self.__weights.shape[0],1)
self.__weights = self.__weights - werr
Your network is learning, as can be seen from the loss chart, so backprop implementation is correct (congrats!). The main problem with this particular architecture is the choice of the activation function: sigmoid. I have replaced sigmoid with tanh and it works much better instantly.
From this discussion on CV.SE:
There are two reasons for that choice (assuming you have normalized
your data, and this is very important):
Having stronger gradients: since data is centered around 0, the
derivatives are higher. To see this, calculate the derivative of the
tanh function and notice that input values are in the range [0,1]. The
range of the tanh function is [-1,1] and that of the sigmoid function
is [0,1]
Avoiding bias in the gradients. This is explained very well in the
paper, and it is worth reading it to understand these issues.
Though I'm sure sigmoid-based NN can be trained as well, looks like it's much more sensitive to input values (note that they are not zero-centered), because the activation itself is not zero-centered. tanh is better than sigmoid by all means, so a simpler approach is just use that activation function.
The key change is this:
def __tanh(self, z):
return np.tanh(z)
def __tanhPrime(self, a):
return 1 - self.__tanh(a) ** 2
... instead of __sigmoid and __sigmoidPrime.
I have also tuned hyperparameters a little bit, so that the network now learns in 100k epochs, instead of 5m:
prior to training:
[[ 0. ]
[-0.00056925]
[-0.00044885]
[-0.00101794]]
post training:
[[0. ]
[0.97335842]
[0.97340917]
[0.98332273]]
A complete code is in this gist.
Well I'm an idiot. I was right about being wrong but I was wrong about how wrong I was. Let me explain.
Within the backwards training method I got the last layer trained correctly, but all layers after that wasn't trained correctly, hence why the above network was coming up with a result, it was indeed training, but only one layer.
So what did i do wrong? Well I was only multiplying by the local graident of the Weights with respect to the output, and thus the chain rule was partially correct.
Lets say the loss function was this:
t = Y-X2
loss = 1/2*(t)^2
a2 = X1W2 + b
X2 = activation(a2)
a1 = X0W1 + b
X1 = activation(a1)
We know that the the derivative of loss with respect to W2 would be -(Y-X2)*X1. This was done in the first part of my training function:
def train(self,X,Y,loss,epoch=5000000):
for i in range(epoch):
#First part
YHat = self.forward(X)
delta = -(Y-YHat)
loss.append(sum(Y-YHat))
err = np.sum(np.dot(self.__layers[-1].localGrad,delta.T), axis=1)
err.shape = (self.__hiddenDimensions[-1][0],1)
self.__layers[-1].adjustWeights(err)
i=0
#Second part
for l in reversed(self.__layers[:-1]):
err = np.dot(l.localGrad, err)
l.adjustWeights(err)
i += 1
However the second part is where I screwed up. In order to calculate the loss with respect to W1, I must multiply the original error -(Y-X2) by W2 as W2 is the local X Gradient of the last layer, and due to the chain rule this must be done first. Then I could multiply by the local W gradient (X1) to get the loss with respect to W1. I failed to do the multiplication of the local X gradient first, so the last layer was indeed training, but all layers after that had an error that magnified as the layer increased.
To solve this I updated the train method:
def train(self,X,Y,loss,epoch=10000):
for i in range(epoch):
YHat = self.forward(X)
err = -(Y-YHat)
loss.append(sum(Y-YHat))
werr = np.sum(np.dot(self.__layers[-1].localWGrad,err.T), axis=1)
werr.shape = (self.__hiddenDimensions[-1][0],1)
self.__layers[-1].adjustWeights(werr)
for l in reversed(self.__layers[:-1]):
err = np.multiply(err, l.localXGrad)
werr = np.sum(np.dot(l.weights,err.T),axis=1)
l.adjustWeights(werr)
Now the loss graph I got looks like this:
Related
I'm trying to train a resnet18 model on pytorch (+pytorch-lightning) with the use of Virtual Adversarial Training. During the computations required for this type of training I need to obtain the gradient of D (ie. the cross-entropy loss of the model) with regard to tensor r.
This should, in theory, happen in the following code snippet:
def generic_step(self, train_batch, batch_idx, step_type):
x, y = train_batch
unlabeled_idx = y is None
d = torch.rand(x.shape).to(x.device)
d = d/(torch.norm(d) + 1e-8)
pred_y = self.classifier(x)
y[unlabeled_idx] = pred_y[unlabeled_idx]
l = self.criterion(pred_y, y)
R_adv = torch.zeros_like(x)
for _ in range(self.ip):
r = self.xi * d
r.requires_grad = True
pred_hat = self.classifier(x + r)
# pred_hat = F.log_softmax(pred_hat, dim=1)
D = self.criterion(pred_hat, pred_y)
self.classifier.zero_grad()
D.requires_grad=True
D.backward()
R_adv += self.eps * r.grad / (torch.norm(r.grad) + 1e-8)
R_adv /= 32
loss = l + R_adv * self.a
loss.backward()
self.accuracy[step_type] = self.acc_metric(torch.argmax(pred_y, 1), y)
return loss
Here, to my understanding, r.grad should in theory be the gradient of D with respect to r. However, the code throws this at D.backward():
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
(full traceback excluded because this error is not helpful and technically "solved" as I know the cause for it, explained just below)
After some research and debugging it seems that in this situation D.backward() attempts to calculate dD/dD disregarding any previous mention of requires_grad=True. This is confirmed when I add D.requires_grad=True and I get D.grad=Tensor(1.,device='cuda:0') but r.grad=None.
Does anyone know why this may be happening?
In Lightning, .backward() and optimizer step are all handled under the hood. If you do it yourself like in the code above, it will mess with Lightning because it doesn't know you called backward yourself.
You can enable manual optimization in the LightningModule:
def __init__(self):
super().__init__()
# put this in your init
self.automatic_optimization = False
This tells Lightning that you are taking over calling backward and handling optimizer step + zero grad yourself. Don't forget to add that in your code above. You can access the optimizer and scheduler like so in your training step:
def training_step(self, batch, batch_idx):
optimizer = self.optimizers()
scheduler = self.lr_schedulers()
# do your training step
# don't forget to call:
# 1) backward 2) optimizer step 3) zero grad
Read more about manual optimization here.
the below code is my main code block
iter_pos=0
max_iter=120
iter_cost=[]
parameters=generate_parameters()
while iter_pos<max_iter:
y_pred = forward_prop(x_train, parameters)
cost_value = cost(y_hat=y_pred, y=multi_class_y_train)
iter_cost.append(cost_value)
delta_para = back_prop(parameters, multi_class_y_train, y_pred)
parameters=update_parameters(parameters,delta_para)
print(iter_pos, cost_value)
iter_pos+=1
now this is my forward prop algorithm
def forward_prop(x_input, parameter):
a=x_input
nodes_values[f'l{1}']=a
for pos in range(1,n_layers):
w = parameter[f'w{pos}']
b=parameter[f'b{pos}']
z=np.dot(w,a)+b
a=sigmoid(z)
nodes_values[f'l{pos+1}']=a
return a
now comes the main back prop I guess I have done mistake here only
def back_prop(parameters, y_true, y_pred):
delta = (nodes_values[f'l{n_layers}']-y_true)
delta_para={}
delta_para[f'delW{n_layers-1}']=np.dot(delta, nodes_values[f'l{n_layers-1}'].T)*lr/m
delta_para[f'delB{n_layers-1}']=(np.sum(delta, axis=1, keepdims=True))*lr/m
for pos in range(n_layers-1,1,-1):
a=nodes_values[f'l{pos}']
x=nodes_values[f'l{pos-1}']
delta=np.dot(parameters[f'w{pos}'].T, delta)*((a)*(1-a))
delta_para[f'delW{pos-1}']=np.dot(delta, x.T)*lr/m
delta_para[f'delB{pos-1}']=np.sum(delta, axis=1, keepdims=True)*lr/m
return delta_para
after getting all my gradients I am going to update them
def update_parameters(parameters, delta_para):
for pos in range(n_layers-1,0,-1):
parameters[f'w{pos}']-=delta_para[f'delW{pos}']
parameters[f'b{pos}']-=delta_para[f'delB{pos}']
return parameters
these are my main code blocks if required I will provide my complete code, please someone suggest what might be the issue
As discussed in the comments, your issue is with using sigmoid on the final layer instead of softmax on a multi-class mutually-exclusive classification problem. A quick fix will be to just import the softmax function from scipy.special and use it in the last layer:
def forward_prop(x_input, parameter):
a=x_input
nodes_values[f'l{1}']=a
for pos in range(1,n_layers):
w = parameter[f'w{pos}']
b=parameter[f'b{pos}']
z=np.dot(w,a)+b
# Use softmax if this is the last layer
if pos == n_layers - 1:
a = softmax(z)
# Use your choice of activation function otherwise (sigmoid in your case)
else:
a=sigmoid(z, axis=0)
nodes_values[f'l{pos+1}']=a
return a
You can of course, define your own softmax as its pretty simple:
def softmax(z, axis=0):
exp = np.exp(z)
return exp / np.sum(exp, axis=0)
I'm learning about policy gradients and I'm having hard time understanding how does the gradient passes through a random operation. From here: It is not possible to directly backpropagate through random samples. However, there are two main methods for creating surrogate functions that can be backpropagated through.
They have an example of the score function:
probs = policy_network(state)
# Note that this is equivalent to what used to be called multinomial
m = Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward
loss.backward()
Which I tried to create an example of:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Normal
import matplotlib.pyplot as plt
from tqdm import tqdm
softplus = torch.nn.Softplus()
class Model_RL(nn.Module):
def __init__(self):
super(Model_RL, self).__init__()
self.fc1 = nn.Linear(1, 20)
self.fc2 = nn.Linear(20, 30)
self.fc3 = nn.Linear(30, 2)
def forward(self, x):
x1 = self.fc1(x)
x = torch.relu(x1)
x2 = self.fc2(x)
x = torch.relu(x2)
x3 = softplus(self.fc3(x))
return x3, x2, x1
# basic
net_RL = Model_RL()
features = torch.tensor([1.0])
x = torch.tensor([1.0])
y = torch.tensor(3.0)
baseline = 0
baseline_lr = 0.1
epochs = 3
opt_RL = optim.Adam(net_RL.parameters(), lr=1e-3)
losses = []
xs = []
for _ in tqdm(range(epochs)):
out_RL = net_RL(x)
mu, std = out_RL[0]
dist = Normal(mu, std)
print(dist)
a = dist.sample()
log_p = dist.log_prob(a)
out = features * a
reward = -torch.square((y - out))
baseline = (1-baseline_lr)*baseline + baseline_lr*reward
loss = -(reward-baseline)*log_p
opt_RL.zero_grad()
loss.backward()
opt_RL.step()
losses.append(loss.item())
This seems to work magically fine which again, I don't understand how the gradient passes through as they mentioned that it can't pass through the random operation (but then somehow it does).
Now since the gradient can't flow through the random operation I tried to replace
mu, std = out_RL[0] with mu, std = out_RL[0].detach() and that caused the error:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn. If the gradient doesn't pass through the random operation, I don't understand why would detaching a tensor before the operation matter.
It is indeed true that sampling is not a differentiable operation per se. However, there exist two (broad) ways to mitigate this - [1] The REINFORCE way and [2] The reparameterization way. Since your example is related to [1], I will stick my answer to REINFORCE.
What REINFORCE does is it entirely gets rid of sampling operation in the computation graph. However, the sampling operation remains outside the graph. So, your statement
.. how does the gradient passes through a random operation ..
isn't correct. It does not pass through any random operation. Let's see your example
mu, std = out_RL[0]
dist = Normal(mu, std)
a = dist.sample()
log_p = dist.log_prob(a)
Computation of a does not involve creating a computation graph. It is technically equivalent to plugging in some offline data from a dataset (as in supervised learning)
mu, std = out_RL[0]
dist = Normal(mu, std)
# a = dist.sample()
a = torch.tensor([1.23, 4.01, -1.2, ...], device='cuda')
log_p = dist.log_prob(a)
Since we don't have offline data beforehand, we create them on the fly and the .sample() method does merely that.
So, there is no random operation on the graph. The log_p depends on mu and std deterministically, just like any standard computation graph. If you cut the connection like this
mu, std = out_RL[0].detach()
.. of course it is going to complaint.
Also, do not get confused by this operation
dist = Normal(mu, std)
log_p = dist.log_prob(a)
as it does not contain any randomness by itself. This is merely a shortcut for writing the tedious log-likelihood formula for Normal distribution.
#ayandas explained the first way very well.
The second way, the reparameterization method, is quite different.
In contrast to the sample(), the reparameterization using rsample() returns a sample that sustains the computation graph.
It is done by adding a random value (but keeping the parameters of the model).
Check this explanation with the simple code.
I wanted to predict heart disease using backpropagation algorithm for neural networks. For this I used UCI heart disease data set linked here: processed cleveland. To do this, I used the cde found on the following blog: Build a flexible Neural Network with Backpropagation in Python and changed it little bit according to my own dataset. My code is as follows:
import numpy as np
import csv
reader = csv.reader(open("cleveland_data.csv"), delimiter=",")
x = list(reader)
result = np.array(x).astype("float")
X = result[:, :13]
y0 = result[:, 13]
y1 = np.array([y0])
y = y1.T
# scale units
X = X / np.amax(X, axis=0) # maximum of X array
class Neural_Network(object):
def __init__(self):
# parameters
self.inputSize = 13
self.outputSize = 1
self.hiddenSize = 13
# weights
self.W1 = np.random.randn(self.inputSize, self.hiddenSize)
self.W2 = np.random.randn(self.hiddenSize, self.outputSize)
def forward(self, X):
# forward propagation through our network
self.z = np.dot(X, self.W1)
self.z2 = self.sigmoid(self.z) # activation function
self.z3 = np.dot(self.z2, self.W2)
o = self.sigmoid(self.z3) # final activation function
return o
def sigmoid(self, s):
# activation function
return 1 / (1 + np.exp(-s))
def sigmoidPrime(self, s):
# derivative of sigmoid
return s * (1 - s)
def backward(self, X, y, o):
# backward propgate through the network
self.o_error = y - o # error in output
self.o_delta = self.o_error * self.sigmoidPrime(o) # applying derivative of sigmoid to error
self.z2_error = self.o_delta.dot(
self.W2.T) # z2 error: how much our hidden layer weights contributed to output error
self.z2_delta = self.z2_error * self.sigmoidPrime(self.z2) # applying derivative of sigmoid to z2 error
self.W1 += X.T.dot(self.z2_delta) # adjusting first set (input --> hidden) weights
self.W2 += self.z2.T.dot(self.o_delta) # adjusting second set (hidden --> output) weights
def train(self, X, y):
o = self.forward(X)
self.backward(X, y, o)
NN = Neural_Network()
for i in range(100): # trains the NN 100 times
print("Input: \n" + str(X))
print("Actual Output: \n" + str(y))
print("Predicted Output: \n" + str(NN.forward(X)))
print("Loss: \n" + str(np.mean(np.square(y - NN.forward(X))))) # mean sum squared loss
print("\n")
NN.train(X, y)
But when I run this code, my all predicted outputs become = 1 after few iterations and then stays the same for up to all 100 iterations. what is the problem in the code?
Few mistakes that I've noticed:
The output of your network is a sigmoid, i.e. a value between [0, 1] -- suits for predicting probabilities. But the target seems to be a value between [0, 4]. This explains the desire of the network to maximize the output to get as close as possible to large labels. But it can't go more than 1.0 and gets stuck.
You should either get rid of the final sigmoid or pre-process the label and scale it to [0, 1]. Both options will make it learn better.
You don't use the learning rate (effectively setting it to 1.0), which is probably a bit high, so it's possible for the NN to diverge. My experiments showed that 0.01 is a good learning rate, but you can play around with that.
Other than this, your backprop seems working right.
I am playing with vanilla Rnn's, training with gradient descent (non-batch version), and I am having an issue with the gradient computation for the (scalar) cost; here's the relevant portion of my code:
class Rnn(object):
# ............ [skipping the trivial initialization]
def recurrence(x_t, h_tm_prev):
h_t = T.tanh(T.dot(x_t, self.W_xh) +
T.dot(h_tm_prev, self.W_hh) + self.b_h)
return h_t
h, _ = theano.scan(
recurrence,
sequences=self.input,
outputs_info=self.h0
)
y_t = T.dot(h[-1], self.W_hy) + self.b_y
self.p_y_given_x = T.nnet.softmax(y_t)
self.y_pred = T.argmax(self.p_y_given_x, axis=1)
def negative_log_likelihood(self, y):
return -T.mean(T.log(self.p_y_given_x)[:, y])
def testRnn(dataset, vocabulary, learning_rate=0.01, n_epochs=50):
# ............ [skipping the trivial initialization]
index = T.lscalar('index')
x = T.fmatrix('x')
y = T.iscalar('y')
rnn = Rnn(x, n_x=27, n_h=12, n_y=27)
nll = rnn.negative_log_likelihood(y)
cost = T.lscalar('cost')
gparams = [T.grad(cost, param) for param in rnn.params]
updates = [(param, param - learning_rate * gparam)
for param, gparam in zip(rnn.params, gparams)
]
train_model = theano.function(
inputs=[index],
outputs=nll,
givens={
x: train_set_x[index],
y: train_set_y[index]
},
)
sgd_step = theano.function(
inputs=[cost],
outputs=[],
updates=updates
)
done_looping = False
while(epoch < n_epochs) and (not done_looping):
epoch += 1
tr_cost = 0.
for idx in xrange(n_train_examples):
tr_cost += train_model(idx)
# perform sgd step after going through the complete training set
sgd_step(tr_cost)
For some reasons I don't want to pass complete (training) data to the train_model(..), instead I want to pass individual examples at a time. Now the problem is that each call to train_model(..) returns me the cost (negative log-likelihood) of that particular example and then I have to aggregate all the cost (of the complete (training) data-set) and then take derivative and perform the relevant update to the weight parameters in the sgd_step(..), and for obvious reasons with my current implementation I am getting this error: theano.gradient.DisconnectedInputError: grad method was asked to compute the gradient with respect to a variable that is not part of the computational graph of the cost, or is used only by a non-differentiable operator: W_xh. Now I don't understand how to make 'cost' a part of computational graph (as in my case when I have to wait for it to be aggregated) or is there any better/elegant way to achieve the same thing ?
Thanks.
It turns out one cannot bring the symbolic variable into Theano graph if they are not part of computational graph. Therefore, I have to change the way to pass data to the train_model(..); passing the complete training data instead of individual example fix the issue.