I was working on an image restoration task and I considered multiple loss functions . My plan was to consider 3 routes:
1: Use multiple losses for monitoring but use only a few for training itself
2: Out of those loss functions that are used for training, I needed to give each a weight - currently I am specifying the weight. I would like to make that parameter adaptive.
3: If in between training - if I observe a saturation I would like to change the loss function . or its components. Currently I considered re-training a network (if in the first training the model saturated) such that it trained with a particular loss function for the the first say M epochs after which I change the loss.
Except the last case I developed a code which computes these losses but I am not sure whether it will work. - ie whether it will backpropagate? (code given below)
is it possible to give the weights adaptively when using combination of loss functions - ie can we train the network so that these weights are also learned ?
can this implementation be used for the above mentioned case 3 of changing loss functions
Sorry if anything given here is not clear or wrong. Please let me know if I have to improve the question. (I am kinda new to PyTorch)
criterion = _criterion
#--training
prediction = model(input)
loss = criterion(prediction, target)
loss.backward()
class _criterion(nn.Module):
def __init__(self, model_type="CNN"):
super(_criterion).__init__()
self.model_type = model_type
def forward(self, pred, ref):
loss_1 = lambda x,y : nn.MSELoss(size_average=False)(x,y)
loss_2 = lambda x,y : nn.L1Loss(size_average=False)(x,y)
loss_3 = lambda x,y : nn.SmoothL1Loss(size_average=False)(x,y)
loss_4 = lambda x,y : L1_Charbonnier_loss_()(x,y) #user-defined
if opt.loss_function_order == 1:
loss_function_1 = get_loss_function(opt.loss_function_1)
loss = lambda x,y: 1*loss_function_1(x,y)
elif opt.loss_function_order == 2:
loss_function_1 = get_loss_function(opt.loss_function_1)
loss_function_2 = get_loss_function(opt.loss_function_2)
weight_1 = opt.loss_function_1_weight
weight_2 = opt.loss_function_2_weight
loss = lambda x,y: weight_1*loss_function_1(x,y) + weight_2*loss_function_2(x,y)
elif opt.loss_function_order == 3:
loss_function_1 = get_loss_function(opt.loss_function_1)
loss_function_2 = get_loss_function(opt.loss_function_2)
loss_function_3 = get_loss_function(opt.loss_function_3)
weight_1 = opt.loss_function_1_weight
weight_2 = opt.loss_function_2_weight
weight_3 = opt.loss_function_3_weight
loss = lambda x,y: weight_1*loss_function_1(x,y) + weight_2*loss_function_2(x,y) + weight_3*loss_function_3(x,y)
elif opt.loss_function_order == 4:
loss_function_1 = get_loss_function(opt.loss_function_1)
loss_function_2 = get_loss_function(opt.loss_function_2)
loss_function_3 = get_loss_function(opt.loss_function_3)
loss_function_4 = get_loss_function(opt.loss_function_4)
weight_1 = opt.loss_function_1_weight
weight_2 = opt.loss_function_2_weight
weight_3 = opt.loss_function_3_weight
weight_4 = opt.loss_function_4_weight
loss = lambda x,y: weight_1*loss_function_1(x,y) + weight_2*loss_function_2(x,y) + weight_3*loss_function_3(x,y) + weight_4*loss_function_4(x,y)
else:
raise Exception("_criterion : unable to interpret loss_function_order")
return loss(ref,pred), loss_1(ref,pred), loss_2(ref,pred), loss_3(ref,pred), loss_4(ref,pred)
def get_loss_function(loss):
if loss == "MSE":
criterion = nn.MSELoss(size_average=False)
elif loss == "MAE":
criterion = nn.L1Loss(size_average=False)
elif loss == "Smooth-L1":
criterion = nn.SmoothL1Loss(size_average=False)
elif loss == "Charbonnier":
criterion = L1_Charbonnier_loss_()
else:
raise Exception("not implemented")
return criterion
class L1_Charbonnier_loss_(nn.Module):
def __init__(self):
super(L1_Charbonnier_loss_, self).__init__()
self.eps = 1e-6
def forward(self, X, Y):
diff = torch.add(X, -Y)
error = self.eps*((torch.sqrt(1+((diff * diff)/self.eps)))-1)
loss = torch.sum(error)
return loss
Whereas I understand your question. Your error calculation function will do the backpropagation, but you need to be careful when using the error functions, as they work differently for each situation.
Regarding the weights, you need to save the weights of this network and then load it again into another one using pytorch's transfer learning so that you can use the weights of other executions.
Here is the link on how to use the learning transfer from pythorch.
Related
I am trying to do a simple loss-minimization for a specific variable coeff using PyTorch optimizers. This variable is supposed to be used as an interpolation coefficient for two vectors w_foo and w_bar to find a third vector, w_target.
w_target = `w_foo + coeff * (w_bar - w_foo)
With w_foo and w_bar set as constant, at each optimization step I calculate w_target for the given coeff. Loss is determined from w_target using a fairly complex process beyond the scope of this question.
# w_foo.shape = [1, 16, 512]
# w_bar.shape = [1, 16, 512]
# num_layers = 16
# num_steps = 10000
vgg_loss = VGGLoss()
coeff = torch.randn([num_layers, ])
optimizer = torch.optim.Adam([coeff], lr=initial_learning_rate)
for step in range(num_steps):
w_target = w_foo + torch.matmul(coeff, (w_bar - w_foo))
optimizer.zero_grad()
target_image = generator.synthesis(w_target)
processed_target_image = process(target_image)
loss = vgg_loss(processed_target_image, source_image)
loss.backward()
optimizer.step()
However, when running this optimizer, I am met with query_opt not changing from one step to another, optimizer being essentially useless. I would like to ask for some advice on what I am doing wrong here.
Edit:
As suggested, I will try to elaborate on the loss function. Essentially, w_target is used to generate an image, and VGGLoss uses VGG feature extractor to compare this synthetic image with a certain exemplar source image.
class VGGLoss(torch.nn.Module):
def __init__(self, device, vgg):
super().__init__()
for param in self.parameters():
param.requires_grad = True
self.vgg = vgg # VGG16 in eval mode
def forward(self, source, target):
loss = 0
source_features = self.vgg(source, resize_images=False, return_lpips=True)
target_features = self.vgg(target, resize_images=False, return_lpips=True)
loss += (source_features - target_features).square().sum()
return loss
I am trying to implement the OSME MAMC model describe in article https://arxiv.org/abs/1806.05372.
I'm stuck where I have to add a cost that doesn't depend on y_true and y_pred but on hidden layers and y_true.
It can't be right as tensorflow custom loss, for which we need y_true and y_pred.
I wrote the model into class, then tried to use gradient tape to add NPairLoss to Softmax output loss, but gradient is NaN during training.
I think my approach isn't good, but I have no idea how to design / write it.
Here my model :
class OSME_network(tf.keras.Model):
def __init__(self, nbrclass=10, weight="imagenet",input_tensor=(32,32,3)):
super(OSME_network, self).__init__()
self.nbrclass = nbrclass
self.weight = weight
self.input_tensor=input_tensor
self.Resnet_50=ResNet50(include_top=False, weights=self.weight, input_shape=self.input_tensor)
self.Resnet_50.trainable=False
self.split=Lambda(lambda x: tf.split(x,num_or_size_splits=2,axis=-1))
self.s_1=OSME_Layer(ch=1024,ratio=16)
self.s_2=OSME_Layer(ch=1024,ratio=16)
self.fl1=tf.keras.layers.Flatten()
self.fl2=tf.keras.layers.Flatten()
self.d1=tf.keras.layers.Dense(1024, name='fc1')
self.d2=tf.keras.layers.Dense(1024,name='fc2')
self.fc=Concatenate()
self.preds=tf.keras.layers.Dense(self.nbrclass,activation='softmax')
#tf.function
def call(self,x): #set à construire le model sequentiellement
x=self.Resnet_50(x)
x_1,x_2=self.split(x)
xx_1 = self.s_1(x_1)
xx_2 = self.s_2(x_2)
xxx_1 = self.d1(xx_1)
xxx_2 = self.d2(xx_2)
xxxx_1 = self.fl1(xxx_1)
xxxx_2 = self.fl2(xxx_2)
fc = self.fc([xxxx_1,xxxx_2]) #fc1 + fc2
ret=self.preds(fc)
return xxxx_1,xxxx_2,ret
class OSME_Layer(tf.keras.layers.Layer):
def __init__(self,ch,ratio):
super(OSME_Layer,self).__init__()
self.GloAvePool2D=GlobalAveragePooling2D()
self.Dense1=Dense(ch//ratio,activation='relu')
self.Dense2=Dense(ch,activation='sigmoid')
self.Mult=Multiply()
self.ch=ch
def call(self,inputs):
squeeze=self.GloAvePool2D(inputs)
se_shape = (1, 1, self.ch)
se = Reshape(se_shape)(squeeze)
excitation=self.Dense1(se)
excitation=self.Dense2(excitation)
scale=self.Mult([inputs,excitation])
return scale
class NPairLoss():
def __init__(self):
self._inputs = None
self._y=None
#tf.function
def __call__(self,inputs,y):
targets=tf.argmax(y, axis=1)
b, p, _ = inputs.shape
n = b * p
inputs=tf.reshape(inputs, [n, -1])
targets = tf.repeat(targets,repeats=p)
parts = tf.tile(tf.range(p),[b])
prod=tf.linalg.matmul(inputs,inputs,transpose_a=False,transpose_b=True)
same_class_mask = tf.math.equal(tf.broadcast_to(targets,[n, n]),tf.transpose(tf.broadcast_to(targets,(n, n))))
same_atten_mask = tf.math.equal(tf.broadcast_to(parts,[n, n]),tf.transpose(tf.broadcast_to(parts,(n, n))))
s_sasc = same_class_mask & same_atten_mask
s_sadc = (~same_class_mask) & same_atten_mask
s_dasc = same_class_mask & (~same_atten_mask)
s_dadc = (~same_class_mask) & (~same_atten_mask)
loss_sasc = 0
loss_sadc = 0
loss_dasc = 0
for i in range(n):
#loss_sasc
pos = prod[i][s_sasc[i]]
neg = prod[i][s_sadc[i] | s_dasc[i] | s_dadc[i]]
n_pos=tf.shape(pos)[0]
n_neg=tf.shape(neg)[0]
pos = tf.transpose(tf.broadcast_to(pos,[n_neg,n_pos]))
neg = tf.broadcast_to(neg,[n_pos,n_neg])
exp=tf.clip_by_value(tf.math.exp(neg - pos),clip_value_min=0,clip_value_max=9e6) # need to clip value, else inf
loss_sasc += tf.reduce_sum(tf.math.log(1 + tf.reduce_sum(exp,axis=1)))
#loss_sadc
pos = prod[i][s_sadc[i]]
neg = prod[i][s_dadc[i]]
n_pos = tf.shape(pos)[0]
n_neg = tf.shape(neg)[0]
pos = tf.transpose(tf.broadcast_to(pos,[n_neg,n_pos])) #np.transpose(np.tile(pos,[n_neg,1]))
neg = tf.broadcast_to(neg,[n_pos,n_neg])#np.tile(neg,[n_pos,1])
exp=tf.clip_by_value(tf.math.exp(neg - pos),clip_value_min=0,clip_value_max=9e6)
loss_sadc += tf.reduce_sum(tf.math.log(1 + tf.reduce_sum(exp,axis=1)))
#loss_dasc
pos = prod[i][s_dasc[i]]
neg = prod[i][s_dadc[i]]
n_pos = tf.shape(pos)[0]
n_neg = tf.shape(neg)[0]
pos = tf.transpose(tf.broadcast_to(pos,[n_neg,n_pos])) #np.transpose(np.tile(pos,[n_neg,1]))
neg = tf.broadcast_to(neg,[n_pos,n_neg])#np.tile(neg,[n_pos,1])
exp=tf.clip_by_value(tf.math.exp(neg - pos),clip_value_min=0,clip_value_max=9e6)
loss_dasc += tf.reduce_sum(tf.math.log(1 + tf.reduce_sum(exp,axis=1)))
return (loss_sasc + loss_sadc + loss_dasc) / n
then, for training :
#tf.function
def train_step(x,y):
with tf.GradientTape() as tape:
fc1,fc2,y_pred=model(x,training=True)
stacked=tf.stack([fc1,fc2],axis=1)
layerLoss=npair(stacked,y)
loss=cce(y, y_pred) +0.001*layerLoss
grads=tape.gradient(loss,model.trainable_variables)
opt.apply_gradients(zip(grads,model.trainable_variables))
return loss
model=OSME_network(weight="imagenet",nbrclass=10,input_tensor=(32, 32, 3))
model.compile(optimizer=opt, loss=categorical_crossentropy, metrics=["acc"])
model.build(input_shape=(None,32,32,3))
cce = tf.keras.losses.CategoricalCrossentropy(from_logits=True,name='categorical_crossentropy')
npair=NPairLoss()
for each batch :
x=tf.Variable(x_train[start:end])
y=tf.Variable(y_train[start:end])
train_loss=train_step(x,y)
Thanks for any help :)
You can use tensorflow's add_loss.
model.compile() loss functions in Tensorflow always take two parameters y_true and y_pred. Using model.add_loss() has no such restriction and allows you to write much more complex losses that depend on many other tensors, but it has the inconvenience of being more dependent on the model, whereas the standard loss functions work with just any model.
You can find the official documentation of add_loss here. Add loss tensor(s), potentially dependent on layer inputs. This method can be used inside a subclassed layer or model's call function, in which case losses should be a Tensor or list of Tensors. There are few example in the documentation to explain the add_loss.
This method can also be called directly on a Functional Model during construction. In this case, any loss Tensors passed to this Model must be symbolic and be able to be traced back to the model's Inputs. These losses become part of the model's topology and are tracked in get_config.
Example :
inputs = tf.keras.Input(shape=(10,))
x = tf.keras.layers.Dense(10)(inputs)
outputs = tf.keras.layers.Dense(1)(x)
model = tf.keras.Model(inputs, outputs)
# Activity regularization.
model.add_loss(tf.abs(tf.reduce_mean(x)))
You can call self.add_loss(loss_value) from inside the call method of a custom layer. Here's a simple example that adds activity regularization.
Example:
class ActivityRegularizationLayer(layers.Layer):
def call(self, inputs):
self.add_loss(tf.reduce_sum(inputs) * 0.1)
return inputs # Pass-through layer.
inputs = keras.Input(shape=(784,), name='digits')
x = layers.Dense(64, activation='relu', name='dense_1')(inputs)
# Insert activity regularization as a layer
x = ActivityRegularizationLayer()(x)
x = layers.Dense(64, activation='relu', name='dense_2')(x)
outputs = layers.Dense(10, name='predictions')(x)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=keras.optimizers.RMSprop(learning_rate=1e-3),
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True))
# The displayed loss will be much higher than before
# due to the regularization component.
model.fit(x_train, y_train,
batch_size=64,
epochs=1)
You can find good example using add_loss here and here with explanations.
Hope this answers your question. Happy Learning.
I'm trying to port the BoundingLayer function from this file to the DDPG.py agent in keras-rl but I'm having some trouble with the implementation.
I modified the get_gradients(loss, params) method in DDPG.py to add this:
action_bounds = [-30, 50]
inverted_grads = []
for g,p in zip(modified_grads, params):
is_above_upper_bound = K.greater(p, K.constant(action_bounds[1], dtype='float32'))
is_under_lower_bound = K.less(p, K.constant(action_bounds[0], dtype='float32'))
is_gradient_positive = K.greater(g, K.constant(0, dtype='float32'))
is_gradient_negative = K.less(g, K.constant(0, dtype='float32'))
invert_gradient = tf.logical_or(
tf.logical_and(is_above_upper_bound, is_gradient_negative),
tf.logical_and(is_under_lower_bound, is_gradient_positive)
)
inverted_grads.extend(K.switch(invert_gradient, -g, g))
modified_grads = inverted_grads[:]
But I get an error about the shape:
ValueError: Shape must be rank 0 but is rank 2 for 'cond/Switch' (op: 'Switch') with input shapes: [2,400], [2,400].
keras-rl "get_gradients" function uses gradients calculated with a combined actor-critic model, but you need the gradient of the critic output wrt the action input to apply the inverting gradients feature.
I've recently implemented it on a RDPG prototype I'm working on, using keras-rl. Still testing, the code can be optimized and is not bug free for sure, but I've put the inverting gradient to work by modifying some keras-rl lines of code. In order to modify the gradient of the critic output wrt the action input, I've followed the original formula to compute the actor gradient, with the help of this great post from Patrick Emami: http://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html.
I'm putting here the entire "compile" function, redefined in a class that inherits from "DDPAgent", where the inverting gradient feature is implemented.
def compile(self, optimizer, metrics=[]):
metrics += [mean_q]
if type(optimizer) in (list, tuple):
if len(optimizer) != 2:
raise ValueError('More than two optimizers provided. Please only provide a maximum of two optimizers, the first one for the actor and the second one for the critic.')
actor_optimizer, critic_optimizer = optimizer
else:
actor_optimizer = optimizer
critic_optimizer = clone_optimizer(optimizer)
if type(actor_optimizer) is str:
actor_optimizer = optimizers.get(actor_optimizer)
if type(critic_optimizer) is str:
critic_optimizer = optimizers.get(critic_optimizer)
assert actor_optimizer != critic_optimizer
if len(metrics) == 2 and hasattr(metrics[0], '__len__') and hasattr(metrics[1], '__len__'):
actor_metrics, critic_metrics = metrics
else:
actor_metrics = critic_metrics = metrics
def clipped_error(y_true, y_pred):
return K.mean(huber_loss(y_true, y_pred, self.delta_clip), axis=-1)
# Compile target networks. We only use them in feed-forward mode, hence we can pass any
# optimizer and loss since we never use it anyway.
self.target_actor = clone_model(self.actor, self.custom_model_objects)
self.target_actor.compile(optimizer='sgd', loss='mse')
self.target_critic = clone_model(self.critic, self.custom_model_objects)
self.target_critic.compile(optimizer='sgd', loss='mse')
# We also compile the actor. We never optimize the actor using Keras but instead compute
# the policy gradient ourselves. However, we need the actor in feed-forward mode, hence
# we also compile it with any optimzer and
self.actor.compile(optimizer='sgd', loss='mse')
# Compile the critic.
if self.target_model_update < 1.:
# We use the `AdditionalUpdatesOptimizer` to efficiently soft-update the target model.
critic_updates = get_soft_target_model_updates(self.target_critic, self.critic, self.target_model_update)
critic_optimizer = AdditionalUpdatesOptimizer(critic_optimizer, critic_updates)
self.critic.compile(optimizer=critic_optimizer, loss=clipped_error, metrics=critic_metrics)
clipnorm = getattr(actor_optimizer, 'clipnorm', 0.)
clipvalue = getattr(actor_optimizer, 'clipvalue', 0.)
critic_gradients_wrt_action_input = tf.gradients(self.critic.output, self.critic_action_input)
critic_gradients_wrt_action_input = [g / float(self.batch_size) for g in critic_gradients_wrt_action_input] # since TF sums over the batch
action_bounds = [(-1.,1.) for i in range(self.nb_actions)]
def calculate_inverted_gradient():
"""
Applies "inverting gradient" feature to the action-value gradients.
"""
gradient_wrt_action = -critic_gradients_wrt_action_input[0]
inverted_gradients = []
for n in range(self.batch_size):
inverted_gradient = []
for i in range(gradient_wrt_action[n].shape[0].value):
action = self.critic_action_input[n][i]
is_gradient_negative = K.less(gradient_wrt_action[n][i], K.constant(0, dtype='float32'))
adjust_for_upper_bound = gradient_wrt_action[n][i] * ((action_bounds[i][1] - action) / (action_bounds[i][1] - action_bounds[i][0]))
adjust_for_lower_bound = gradient_wrt_action[n][i] * ((action - action_bounds[i][0]) / (action_bounds[i][1] - action_bounds[i][0]))
modified_gradient = K.switch(is_gradient_negative, adjust_for_upper_bound, adjust_for_lower_bound)
inverted_gradient.append( modified_gradient )
inverted_gradients.append(inverted_gradient)
gradient_wrt_action = tf.stack(inverted_gradients)
return gradient_wrt_action
actor_gradients_wrt_weights = tf.gradients(self.actor.output, self.actor.trainable_weights, grad_ys=calculate_inverted_gradient())
actor_gradients_wrt_weights = [g / float(self.batch_size) for g in actor_gradients_wrt_weights] # since TF sums over the batch
def get_gradients(loss, params):
""" Used by the actor optimizer.
Returns the gradients to train the actor.
These gradients are obtained by multiplying the gradients of the actor output w.r.t. its weights
with the gradients of the critic output w.r.t. its action input. """
# Aplly clipping if defined
modified_grads = [g for g in actor_gradients_wrt_weights]
if clipnorm > 0.:
norm = K.sqrt(sum([K.sum(K.square(g)) for g in modified_grads]))
modified_grads = [optimizers.clip_norm(g, clipnorm, norm) for g in modified_grads]
if clipvalue > 0.:
modified_grads = [K.clip(g, -clipvalue, clipvalue) for g in modified_grads]
return modified_grads
actor_optimizer.get_gradients = get_gradients
# get_updates is the optimizer function that changes the weights of the network
updates = actor_optimizer.get_updates(self.actor.trainable_weights, self.actor.constraints, None)
if self.target_model_update < 1.:
# Include soft target model updates.
updates += get_soft_target_model_updates(self.target_actor, self.actor, self.target_model_update)
updates += self.actor.updates # include other updates of the actor, e.g. for BN
# Finally, combine it all into a callable function.
# The inputs will be all the necessary placeholders to compute the gradients (actor and critic inputs)
inputs = self.actor.inputs[:] + [self.critic_action_input, self.critic_history_input]
self.actor_train_fn = K.function(inputs, [self.actor.output], updates=updates)
self.actor_optimizer = actor_optimizer
self.compiled = True
When training the actor, you should now pass 3 inputs instead of 2: the observation inputs + the action input (with a prediction from the actor network), so you must also modify the "backward" function. In my case:
...
if self.episode > self.nb_steps_warmup_actor:
action = self.actor.predict_on_batch(history_batch)
inputs = [history_batch, action, history_batch]
actor_train_result = self.actor_train_fn(inputs)
action_values = actor_train_result[0]
assert action_values.shape == (self.batch_size, self.nb_actions)
...
After that you can have your actor with a linear activation in the output.
I am playing with vanilla Rnn's, training with gradient descent (non-batch version), and I am having an issue with the gradient computation for the (scalar) cost; here's the relevant portion of my code:
class Rnn(object):
# ............ [skipping the trivial initialization]
def recurrence(x_t, h_tm_prev):
h_t = T.tanh(T.dot(x_t, self.W_xh) +
T.dot(h_tm_prev, self.W_hh) + self.b_h)
return h_t
h, _ = theano.scan(
recurrence,
sequences=self.input,
outputs_info=self.h0
)
y_t = T.dot(h[-1], self.W_hy) + self.b_y
self.p_y_given_x = T.nnet.softmax(y_t)
self.y_pred = T.argmax(self.p_y_given_x, axis=1)
def negative_log_likelihood(self, y):
return -T.mean(T.log(self.p_y_given_x)[:, y])
def testRnn(dataset, vocabulary, learning_rate=0.01, n_epochs=50):
# ............ [skipping the trivial initialization]
index = T.lscalar('index')
x = T.fmatrix('x')
y = T.iscalar('y')
rnn = Rnn(x, n_x=27, n_h=12, n_y=27)
nll = rnn.negative_log_likelihood(y)
cost = T.lscalar('cost')
gparams = [T.grad(cost, param) for param in rnn.params]
updates = [(param, param - learning_rate * gparam)
for param, gparam in zip(rnn.params, gparams)
]
train_model = theano.function(
inputs=[index],
outputs=nll,
givens={
x: train_set_x[index],
y: train_set_y[index]
},
)
sgd_step = theano.function(
inputs=[cost],
outputs=[],
updates=updates
)
done_looping = False
while(epoch < n_epochs) and (not done_looping):
epoch += 1
tr_cost = 0.
for idx in xrange(n_train_examples):
tr_cost += train_model(idx)
# perform sgd step after going through the complete training set
sgd_step(tr_cost)
For some reasons I don't want to pass complete (training) data to the train_model(..), instead I want to pass individual examples at a time. Now the problem is that each call to train_model(..) returns me the cost (negative log-likelihood) of that particular example and then I have to aggregate all the cost (of the complete (training) data-set) and then take derivative and perform the relevant update to the weight parameters in the sgd_step(..), and for obvious reasons with my current implementation I am getting this error: theano.gradient.DisconnectedInputError: grad method was asked to compute the gradient with respect to a variable that is not part of the computational graph of the cost, or is used only by a non-differentiable operator: W_xh. Now I don't understand how to make 'cost' a part of computational graph (as in my case when I have to wait for it to be aggregated) or is there any better/elegant way to achieve the same thing ?
Thanks.
It turns out one cannot bring the symbolic variable into Theano graph if they are not part of computational graph. Therefore, I have to change the way to pass data to the train_model(..); passing the complete training data instead of individual example fix the issue.
I've implemented a neural network (deep autoencoder), on which I'm trying to perform backpropagation. The network consist of sigmoid activation functions and a softmax activation function at the output layer. To calculate the error, I use the Cross Entropy error function. The data input is bag of word matrices, where the words are divided by the length of the document to normalize the data.
I'm using the method Conjugate Gradient in order to find local minima. My problem is basically that the error is rising during backpropagation. I believe that it has something to do with me calculating the gradient wrong?
The code to calculate the error and gradient is given below:
def get_grad_and_error(self,weights,weight_sizes,x):
weights = self.__convert__(weights, weight_sizes)
x = append(x,ones((len(x),1),dtype = float64),axis = 1)
xout, z_values = self.__generate_output_data__(x, weights)
f = -sum(x[:,:-1]*log(xout)) # Cross-entropy error function
# Gradient
number_of_weights = len(weights)
gradients = []
delta_k = None
for i in range(len(weights)-1,-1,-1):
if i == number_of_weights-1:
delta = (xout-x[:,:-1])
grad = dot(z_values[i-1].T,delta)
elif i == 0:
delta = dot(delta_k,weights[i+1].T)*z_values[i]*(1-z_values[i])
delta = delta[:,:-1]
grad = dot(x.T,delta)
else:
delta = dot(delta_k,weights[i+1].T)*z_values[i]*(1-z_values[i])
delta = delta[:,:-1]
grad = dot(z_values[i-1].T,delta)
delta_k = delta
gradients.append(grad)
gradients.reverse()
gradients_formatted = []
for g in gradients:
gradients_formatted = append(gradients_formatted,reshape(g,(1,len(g)*len(g[0])))[0])
return f,gradients_formatted
To calculate the output of the network I use following method:
def __generate_output_data__(self, x, weight_matrices_added_biases):
z_values = []
for i in range(len(weight_matrices_added_biases)-1):
if i == 0:
z = dbn.sigmoid(dot(x,weight_matrices_added_biases[i]))
else:
z = dbn.sigmoid(dot(z_values[i-1],weight_matrices_added_biases[i]))
z = append(z,ones((len(x),1),dtype = float64),axis = 1)
z_values.append(z)
xout = dbn.softmax(dot(z_values[-1],weight_matrices_added_biases[-1]))
return xout, z_values
I calculate the sigmoid and softmax values as follows:
def sigmoid(x):
return 1./(1+exp(-x))
def softmax(x):
numerator = exp(x)
denominator = numerator.sum(axis = 1)
denominator = denominator.reshape((x.shape[0],1))
softmax = numerator/denominator
return softmax
I would really appreciate if anyone could be of assistance? Please let me know if you need me to elaborate on any of the above info? Thanks.