outputs after back propagation converge to a small value 0.01

outputs after back propagation converge to a small value 0.01 - python

the below code is my main code block
iter_pos=0
max_iter=120
iter_cost=[]
parameters=generate_parameters()
while iter_pos<max_iter:
y_pred = forward_prop(x_train, parameters)
cost_value = cost(y_hat=y_pred, y=multi_class_y_train)
iter_cost.append(cost_value)
delta_para = back_prop(parameters, multi_class_y_train, y_pred)
parameters=update_parameters(parameters,delta_para)
print(iter_pos, cost_value)
iter_pos+=1
now this is my forward prop algorithm
def forward_prop(x_input, parameter):
a=x_input
nodes_values[f'l{1}']=a
for pos in range(1,n_layers):
w = parameter[f'w{pos}']
b=parameter[f'b{pos}']
z=np.dot(w,a)+b
a=sigmoid(z)
nodes_values[f'l{pos+1}']=a
return a
now comes the main back prop I guess I have done mistake here only
def back_prop(parameters, y_true, y_pred):
delta = (nodes_values[f'l{n_layers}']-y_true)
delta_para={}
delta_para[f'delW{n_layers-1}']=np.dot(delta, nodes_values[f'l{n_layers-1}'].T)*lr/m
delta_para[f'delB{n_layers-1}']=(np.sum(delta, axis=1, keepdims=True))*lr/m
for pos in range(n_layers-1,1,-1):
a=nodes_values[f'l{pos}']
x=nodes_values[f'l{pos-1}']
delta=np.dot(parameters[f'w{pos}'].T, delta)*((a)*(1-a))
delta_para[f'delW{pos-1}']=np.dot(delta, x.T)*lr/m
delta_para[f'delB{pos-1}']=np.sum(delta, axis=1, keepdims=True)*lr/m
return delta_para
after getting all my gradients I am going to update them
def update_parameters(parameters, delta_para):
for pos in range(n_layers-1,0,-1):
parameters[f'w{pos}']-=delta_para[f'delW{pos}']
parameters[f'b{pos}']-=delta_para[f'delB{pos}']
return parameters
these are my main code blocks if required I will provide my complete code, please someone suggest what might be the issue

As discussed in the comments, your issue is with using sigmoid on the final layer instead of softmax on a multi-class mutually-exclusive classification problem. A quick fix will be to just import the softmax function from scipy.special and use it in the last layer:
def forward_prop(x_input, parameter):
a=x_input
nodes_values[f'l{1}']=a
for pos in range(1,n_layers):
w = parameter[f'w{pos}']
b=parameter[f'b{pos}']
z=np.dot(w,a)+b
# Use softmax if this is the last layer
if pos == n_layers - 1:
a = softmax(z)
# Use your choice of activation function otherwise (sigmoid in your case)
else:
a=sigmoid(z, axis=0)
nodes_values[f'l{pos+1}']=a
return a
You can of course, define your own softmax as its pretty simple:
def softmax(z, axis=0):
exp = np.exp(z)
return exp / np.sum(exp, axis=0)

Related

How to override gradient for the nonlinearity functions in lasagne?

I have a model, for which i need to compute the gradients of output w.r.t the model's input. But I want to apply some custom gradients for some of the nonlinearity functions applied on some of the model's layers. So i tried the idea explained here, which computes the nonlinear rectifier (RELU) in the forward pass but modifies the gradients of Relu in the backward pass. I added the following two classes:
The helper class that allows us to replace a nonlinearity with an Op
that has the same output, but a custom gradient
class ModifiedBackprop(object):
def __init__(self, nonlinearity):
self.nonlinearity = nonlinearity
self.ops = {} # memoizes an OpFromGraph instance per tensor type
def __call__(self, x):
# OpFromGraph is oblique to Theano optimizations, so we need to move
# things to GPU ourselves if needed.
if theano.sandbox.cuda.cuda_enabled:
maybe_to_gpu = theano.sandbox.cuda.as_cuda_ndarray_variable
else:
maybe_to_gpu = lambda x: x
# We move the input to GPU if needed.
x = maybe_to_gpu(x)
# We note the tensor type of the input variable to the nonlinearity
# (mainly dimensionality and dtype); we need to create a fitting Op.
tensor_type = x.type
# If we did not create a suitable Op yet, this is the time to do so.
if tensor_type not in self.ops:
# For the graph, we create an input variable of the correct type:
inp = tensor_type()
# We pass it through the nonlinearity (and move to GPU if needed).
outp = maybe_to_gpu(self.nonlinearity(inp))
# Then we fix the forward expression...
op = theano.OpFromGraph([inp], [outp])
# ...and replace the gradient with our own (defined in a subclass).
op.grad = self.grad
# Finally, we memoize the new Op
self.ops[tensor_type] = op
# And apply the memoized Op to the input we got.
return self.ops[tensor_type](x)
The subclass that does guided backpropagation through a nonlinearity:
class GuidedBackprop(ModifiedBackprop):
def grad(self, inputs, out_grads):
(inp,) = inputs
(grd,) = out_grads
dtype = inp.dtype
print('It works')
return (grd * (inp > 0).astype(dtype) * (grd > 0).astype(dtype),)
Then i used them in my code as follows:
import lasagne as nn
model_in = T.tensor3()
# model_in = net['input'].input_var
nn.layers.set_all_param_values(net['l_out'], model['param_values'])
relu = nn.nonlinearities.rectify
relu_layers = [layer for layer in
nn.layers.get_all_layers(net['l_out']) if getattr(layer,
'nonlinearity', None) is relu]
modded_relu = GuidedBackprop(relu)
for layer in relu_layers:
layer.nonlinearity = modded_relu
prop = nn.layers.get_output(
net['l_out'], model_in, deterministic=True)
for sample in range(ini, batch_len):
model_out = prop[sample, 'z'] # get prop for label 'z'
gradients = theano.gradient.jacobian(model_out, wrt=model_in)
# gradients = theano.grad(model_out, wrt=model_in)
get_gradients = theano.function(inputs=[model_in],
outputs=gradients)
grads = get_gradients(X_batch) # gradient dimension: X_batch == model_in(64, 20, 32)
grads = np.array(grads)
grads = grads[sample]
Now when i run the code, it works without any error, and the shape of the output is also correct. But that's because it executes the default theano.grad function and not the one supposed to override it. In other words, the grad() function in the class GuidedBackprop never been invoked.
I can't understand what is the issue?
is there's a solution?
If this is an unresolved issue, is there's an implementation for a Theano Op that can achieve such a functionality or some other way to override gradient for specific nonlinearity functions applied on some of the model's layers?

Are you try to set it back the value of model output into model layer input, all gradients calculation
group_1_ShoryuKen_Left = tf.constant([ 0,0,0,0,0,1,0,0,0,0,0,0, 0,0,0,0,0,1,0,1,0,0,0,0, 0,0,0,0,0,0,0,1,0,0,0,0, 0,0,0,0,0,0,0,0,0,1,0,0 ], shape=(1, 1, 48), dtype=tf.float32)
## layer_2 = tf.keras.layers.Dense(256, kernel_initializer=tf.constant_initializer(1.))
layer_2 = tf.keras.layers.LSTM(32, kernel_initializer=tf.constant_initializer(1.))
b_out = layer_2(group_1_ShoryuKen_Left)
layer_2.set_weights(layer_1.get_weights())

Neural network backprop not fully training

I have this neural network that I've trained seen bellow, it works, or at least appears to work, but the problem is with the training. I'm trying to train it to act as an OR gate, but it never seems to get there, the output tends to looks like this:
prior to training:
[[0.50181624]
[0.50183743]
[0.50180414]
[0.50182533]]
post training:
[[0.69641759]
[0.754652 ]
[0.75447178]
[0.79431198]]
expected output:
[[0]
[1]
[1]
[1]]
I have this loss graph:
Its strange it appears to be training, but at the same time not quite getting to the expected output. I know that it would never really achieve the 0s and 1s, but at the same time I expect it to manage and get something a little bit closer to the expected output.
I had some issues trying to figure out how to back prop the error as I wanted to make this network have any number of hidden layers, so I stored the local gradient in a layer, along side the weights, and sent the error from the end back.
The main functions I suspect are the culprits are NeuralNetwork.train and both forward methods.
import sys
import math
import numpy as np
import matplotlib.pyplot as plt
from itertools import product
class NeuralNetwork:
class __Layer:
def __init__(self,args):
self.__epsilon = 1e-6
self.localGrad = 0
self.__weights = np.random.randn(
args["previousLayerHeight"],
args["height"]
)*0.01
self.__biases = np.zeros(
(args["biasHeight"],1)
)
def __str__(self):
return str(self.__weights)
def forward(self,X):
a = np.dot(X, self.__weights) + self.__biases
self.localGrad = np.dot(X.T,self.__sigmoidPrime(a))
return self.__sigmoid(a)
def adjustWeights(self, err):
self.__weights -= (err * self.__epsilon)
def __sigmoid(self, z):
return 1/(1 + np.exp(-z))
def __sigmoidPrime(self, a):
return self.__sigmoid(a)*(1 - self.__sigmoid(a))
def __init__(self,args):
self.__inputDimensions = args["inputDimensions"]
self.__outputDimensions = args["outputDimensions"]
self.__hiddenDimensions = args["hiddenDimensions"]
self.__layers = []
self.__constructLayers()
def __constructLayers(self):
self.__layers.append(
self.__Layer(
{
"biasHeight": self.__inputDimensions[0],
"previousLayerHeight": self.__inputDimensions[1],
"height": self.__hiddenDimensions[0][0]
if len(self.__hiddenDimensions) > 0
else self.__outputDimensions[0]
}
)
)
for i in range(len(self.__hiddenDimensions)):
self.__layers.append(
self.__Layer(
{
"biasHeight": self.__hiddenDimensions[i + 1][0]
if i + 1 < len(self.__hiddenDimensions)
else self.__outputDimensions[0],
"previousLayerHeight": self.__hiddenDimensions[i][0],
"height": self.__hiddenDimensions[i + 1][0]
if i + 1 < len(self.__hiddenDimensions)
else self.__outputDimensions[0]
}
)
)
def forward(self,X):
out = self.__layers[0].forward(X)
for i in range(len(self.__layers) - 1):
out = self.__layers[i+1].forward(out)
return out
def train(self,X,Y,loss,epoch=5000000):
for i in range(epoch):
YHat = self.forward(X)
delta = -(Y-YHat)
loss.append(sum(Y-YHat))
err = np.sum(np.dot(self.__layers[-1].localGrad,delta.T), axis=1)
err.shape = (self.__hiddenDimensions[-1][0],1)
self.__layers[-1].adjustWeights(err)
i=0
for l in reversed(self.__layers[:-1]):
err = np.dot(l.localGrad, err)
l.adjustWeights(err)
i += 1
def printLayers(self):
print("Layers:\n")
for l in self.__layers:
print(l)
print("\n")
def main(args):
X = np.array([[x,y] for x,y in product([0,1],repeat=2)])
Y = np.array([[0],[1],[1],[1]])
nn = NeuralNetwork(
{
#(height,width)
"inputDimensions": (4,2),
"outputDimensions": (1,1),
"hiddenDimensions":[
(6,1)
]
}
)
print("input:\n\n",X,"\n")
print("expected output:\n\n",Y,"\n")
nn.printLayers()
print("prior to training:\n\n",nn.forward(X), "\n")
loss = []
nn.train(X,Y,loss)
print("post training:\n\n",nn.forward(X), "\n")
nn.printLayers()
fig,ax = plt.subplots()
x = np.array([x for x in range(5000000)])
loss = np.array(loss)
ax.plot(x,loss)
ax.set(xlabel="epoch",ylabel="loss",title="logic gate training")
plt.show()
if(__name__=="__main__"):
main(sys.argv[1:])
Could someone please point out what I'm doing wrong here, I strongly suspect it has to do with the way I'm dealing with matrices but at the same time I don't have the slightest idea what's going on.
Thanks for taking the time to read my question, and taking the time to respond (if relevant).
edit:
Actually quite a lot is wrong with this but I'm still a bit confused over how to fix it. Although the loss graph looks like its training, and it kind of is, the math I've done above is wrong.
Look at the training function.
def train(self,X,Y,loss,epoch=5000000):
for i in range(epoch):
YHat = self.forward(X)
delta = -(Y-YHat)
loss.append(sum(Y-YHat))
err = np.sum(np.dot(self.__layers[-1].localGrad,delta.T), axis=1)
err.shape = (self.__hiddenDimensions[-1][0],1)
self.__layers[-1].adjustWeights(err)
i=0
for l in reversed(self.__layers[:-1]):
err = np.dot(l.localGrad, err)
l.adjustWeights(err)
i += 1
Note how I get delta = -(Y-Yhat) and then dot product it with the "local gradient" of the last layer. The "local gradient" is the local W gradient.
def forward(self,X):
a = np.dot(X, self.__weights) + self.__biases
self.localGrad = np.dot(X.T,self.__sigmoidPrime(a))
return self.__sigmoid(a)
I'm skipping a step in the chain rule. I should really be multiplying by W* sigprime(XW + b) first as that's the local gradient of X, then by the local W gradient. I tried that, but I'm still getting issues, here is the new forward method (note the __init__ for layers needs to be initialised for the new vars, and I changed the activation function to tanh)
def forward(self, X):
a = np.dot(X, self.__weights) + self.__biases
self.localPartialGrad = self.__tanhPrime(a)
self.localWGrad = np.dot(X.T, self.localPartialGrad)
self.localXGrad = np.dot(self.localPartialGrad,self.__weights.T)
return self.__tanh(a)
and updated the training method to look something like this:
def train(self, X, Y, loss, epoch=5000):
for e in range(epoch):
Yhat = self.forward(X)
err = -(Y-Yhat)
loss.append(sum(err))
print("loss:\n",sum(err))
for l in self.__layers[::-1]:
l.adjustWeights(err)
if(l != self.__layers[0]):
err = np.multiply(err,l.localPartialGrad)
err = np.multiply(err,l.localXGrad)
The new graphs I'm getting are all over the place, I have no idea what's going on. Here is the final bit of code I changed:
def adjustWeights(self, err):
perr = np.multiply(err, self.localPartialGrad)
werr = np.sum(np.dot(self.__weights,perr.T),axis=1)
werr = werr * self.__epsilon
werr.shape = (self.__weights.shape[0],1)
self.__weights = self.__weights - werr

Your network is learning, as can be seen from the loss chart, so backprop implementation is correct (congrats!). The main problem with this particular architecture is the choice of the activation function: sigmoid. I have replaced sigmoid with tanh and it works much better instantly.
From this discussion on CV.SE:
There are two reasons for that choice (assuming you have normalized
your data, and this is very important):
Having stronger gradients: since data is centered around 0, the
derivatives are higher. To see this, calculate the derivative of the
tanh function and notice that input values are in the range [0,1]. The
range of the tanh function is [-1,1] and that of the sigmoid function
is [0,1]
Avoiding bias in the gradients. This is explained very well in the
paper, and it is worth reading it to understand these issues.
Though I'm sure sigmoid-based NN can be trained as well, looks like it's much more sensitive to input values (note that they are not zero-centered), because the activation itself is not zero-centered. tanh is better than sigmoid by all means, so a simpler approach is just use that activation function.
The key change is this:
def __tanh(self, z):
return np.tanh(z)
def __tanhPrime(self, a):
return 1 - self.__tanh(a) ** 2
... instead of __sigmoid and __sigmoidPrime.
I have also tuned hyperparameters a little bit, so that the network now learns in 100k epochs, instead of 5m:
prior to training:
[[ 0. ]
[-0.00056925]
[-0.00044885]
[-0.00101794]]
post training:
[[0. ]
[0.97335842]
[0.97340917]
[0.98332273]]
A complete code is in this gist.

Well I'm an idiot. I was right about being wrong but I was wrong about how wrong I was. Let me explain.
Within the backwards training method I got the last layer trained correctly, but all layers after that wasn't trained correctly, hence why the above network was coming up with a result, it was indeed training, but only one layer.
So what did i do wrong? Well I was only multiplying by the local graident of the Weights with respect to the output, and thus the chain rule was partially correct.
Lets say the loss function was this:
t = Y-X2
loss = 1/2*(t)^2
a2 = X1W2 + b
X2 = activation(a2)
a1 = X0W1 + b
X1 = activation(a1)
We know that the the derivative of loss with respect to W2 would be -(Y-X2)*X1. This was done in the first part of my training function:
def train(self,X,Y,loss,epoch=5000000):
for i in range(epoch):
#First part
YHat = self.forward(X)
delta = -(Y-YHat)
loss.append(sum(Y-YHat))
err = np.sum(np.dot(self.__layers[-1].localGrad,delta.T), axis=1)
err.shape = (self.__hiddenDimensions[-1][0],1)
self.__layers[-1].adjustWeights(err)
i=0
#Second part
for l in reversed(self.__layers[:-1]):
err = np.dot(l.localGrad, err)
l.adjustWeights(err)
i += 1
However the second part is where I screwed up. In order to calculate the loss with respect to W1, I must multiply the original error -(Y-X2) by W2 as W2 is the local X Gradient of the last layer, and due to the chain rule this must be done first. Then I could multiply by the local W gradient (X1) to get the loss with respect to W1. I failed to do the multiplication of the local X gradient first, so the last layer was indeed training, but all layers after that had an error that magnified as the layer increased.
To solve this I updated the train method:
def train(self,X,Y,loss,epoch=10000):
for i in range(epoch):
YHat = self.forward(X)
err = -(Y-YHat)
loss.append(sum(Y-YHat))
werr = np.sum(np.dot(self.__layers[-1].localWGrad,err.T), axis=1)
werr.shape = (self.__hiddenDimensions[-1][0],1)
self.__layers[-1].adjustWeights(werr)
for l in reversed(self.__layers[:-1]):
err = np.multiply(err, l.localXGrad)
werr = np.sum(np.dot(l.weights,err.T),axis=1)
l.adjustWeights(werr)
Now the loss graph I got looks like this:

Inverting Gradients in Keras

I'm trying to port the BoundingLayer function from this file to the DDPG.py agent in keras-rl but I'm having some trouble with the implementation.
I modified the get_gradients(loss, params) method in DDPG.py to add this:
action_bounds = [-30, 50]
inverted_grads = []
for g,p in zip(modified_grads, params):
is_above_upper_bound = K.greater(p, K.constant(action_bounds[1], dtype='float32'))
is_under_lower_bound = K.less(p, K.constant(action_bounds[0], dtype='float32'))
is_gradient_positive = K.greater(g, K.constant(0, dtype='float32'))
is_gradient_negative = K.less(g, K.constant(0, dtype='float32'))
invert_gradient = tf.logical_or(
tf.logical_and(is_above_upper_bound, is_gradient_negative),
tf.logical_and(is_under_lower_bound, is_gradient_positive)
)
inverted_grads.extend(K.switch(invert_gradient, -g, g))
modified_grads = inverted_grads[:]
But I get an error about the shape:
ValueError: Shape must be rank 0 but is rank 2 for 'cond/Switch' (op: 'Switch') with input shapes: [2,400], [2,400].

keras-rl "get_gradients" function uses gradients calculated with a combined actor-critic model, but you need the gradient of the critic output wrt the action input to apply the inverting gradients feature.
I've recently implemented it on a RDPG prototype I'm working on, using keras-rl. Still testing, the code can be optimized and is not bug free for sure, but I've put the inverting gradient to work by modifying some keras-rl lines of code. In order to modify the gradient of the critic output wrt the action input, I've followed the original formula to compute the actor gradient, with the help of this great post from Patrick Emami: http://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html.
I'm putting here the entire "compile" function, redefined in a class that inherits from "DDPAgent", where the inverting gradient feature is implemented.
def compile(self, optimizer, metrics=[]):
metrics += [mean_q]
if type(optimizer) in (list, tuple):
if len(optimizer) != 2:
raise ValueError('More than two optimizers provided. Please only provide a maximum of two optimizers, the first one for the actor and the second one for the critic.')
actor_optimizer, critic_optimizer = optimizer
else:
actor_optimizer = optimizer
critic_optimizer = clone_optimizer(optimizer)
if type(actor_optimizer) is str:
actor_optimizer = optimizers.get(actor_optimizer)
if type(critic_optimizer) is str:
critic_optimizer = optimizers.get(critic_optimizer)
assert actor_optimizer != critic_optimizer
if len(metrics) == 2 and hasattr(metrics[0], '__len__') and hasattr(metrics[1], '__len__'):
actor_metrics, critic_metrics = metrics
else:
actor_metrics = critic_metrics = metrics
def clipped_error(y_true, y_pred):
return K.mean(huber_loss(y_true, y_pred, self.delta_clip), axis=-1)
# Compile target networks. We only use them in feed-forward mode, hence we can pass any
# optimizer and loss since we never use it anyway.
self.target_actor = clone_model(self.actor, self.custom_model_objects)
self.target_actor.compile(optimizer='sgd', loss='mse')
self.target_critic = clone_model(self.critic, self.custom_model_objects)
self.target_critic.compile(optimizer='sgd', loss='mse')
# We also compile the actor. We never optimize the actor using Keras but instead compute
# the policy gradient ourselves. However, we need the actor in feed-forward mode, hence
# we also compile it with any optimzer and
self.actor.compile(optimizer='sgd', loss='mse')
# Compile the critic.
if self.target_model_update < 1.:
# We use the `AdditionalUpdatesOptimizer` to efficiently soft-update the target model.
critic_updates = get_soft_target_model_updates(self.target_critic, self.critic, self.target_model_update)
critic_optimizer = AdditionalUpdatesOptimizer(critic_optimizer, critic_updates)
self.critic.compile(optimizer=critic_optimizer, loss=clipped_error, metrics=critic_metrics)
clipnorm = getattr(actor_optimizer, 'clipnorm', 0.)
clipvalue = getattr(actor_optimizer, 'clipvalue', 0.)
critic_gradients_wrt_action_input = tf.gradients(self.critic.output, self.critic_action_input)
critic_gradients_wrt_action_input = [g / float(self.batch_size) for g in critic_gradients_wrt_action_input] # since TF sums over the batch
action_bounds = [(-1.,1.) for i in range(self.nb_actions)]
def calculate_inverted_gradient():
"""
Applies "inverting gradient" feature to the action-value gradients.
"""
gradient_wrt_action = -critic_gradients_wrt_action_input[0]
inverted_gradients = []
for n in range(self.batch_size):
inverted_gradient = []
for i in range(gradient_wrt_action[n].shape[0].value):
action = self.critic_action_input[n][i]
is_gradient_negative = K.less(gradient_wrt_action[n][i], K.constant(0, dtype='float32'))
adjust_for_upper_bound = gradient_wrt_action[n][i] * ((action_bounds[i][1] - action) / (action_bounds[i][1] - action_bounds[i][0]))
adjust_for_lower_bound = gradient_wrt_action[n][i] * ((action - action_bounds[i][0]) / (action_bounds[i][1] - action_bounds[i][0]))
modified_gradient = K.switch(is_gradient_negative, adjust_for_upper_bound, adjust_for_lower_bound)
inverted_gradient.append( modified_gradient )
inverted_gradients.append(inverted_gradient)
gradient_wrt_action = tf.stack(inverted_gradients)
return gradient_wrt_action
actor_gradients_wrt_weights = tf.gradients(self.actor.output, self.actor.trainable_weights, grad_ys=calculate_inverted_gradient())
actor_gradients_wrt_weights = [g / float(self.batch_size) for g in actor_gradients_wrt_weights] # since TF sums over the batch
def get_gradients(loss, params):
""" Used by the actor optimizer.
Returns the gradients to train the actor.
These gradients are obtained by multiplying the gradients of the actor output w.r.t. its weights
with the gradients of the critic output w.r.t. its action input. """
# Aplly clipping if defined
modified_grads = [g for g in actor_gradients_wrt_weights]
if clipnorm > 0.:
norm = K.sqrt(sum([K.sum(K.square(g)) for g in modified_grads]))
modified_grads = [optimizers.clip_norm(g, clipnorm, norm) for g in modified_grads]
if clipvalue > 0.:
modified_grads = [K.clip(g, -clipvalue, clipvalue) for g in modified_grads]
return modified_grads
actor_optimizer.get_gradients = get_gradients
# get_updates is the optimizer function that changes the weights of the network
updates = actor_optimizer.get_updates(self.actor.trainable_weights, self.actor.constraints, None)
if self.target_model_update < 1.:
# Include soft target model updates.
updates += get_soft_target_model_updates(self.target_actor, self.actor, self.target_model_update)
updates += self.actor.updates # include other updates of the actor, e.g. for BN
# Finally, combine it all into a callable function.
# The inputs will be all the necessary placeholders to compute the gradients (actor and critic inputs)
inputs = self.actor.inputs[:] + [self.critic_action_input, self.critic_history_input]
self.actor_train_fn = K.function(inputs, [self.actor.output], updates=updates)
self.actor_optimizer = actor_optimizer
self.compiled = True
When training the actor, you should now pass 3 inputs instead of 2: the observation inputs + the action input (with a prediction from the actor network), so you must also modify the "backward" function. In my case:
...
if self.episode > self.nb_steps_warmup_actor:
action = self.actor.predict_on_batch(history_batch)
inputs = [history_batch, action, history_batch]
actor_train_result = self.actor_train_fn(inputs)
action_values = actor_train_result[0]
assert action_values.shape == (self.batch_size, self.nb_actions)
...
After that you can have your actor with a linear activation in the output.

Custom loss function implementation

I'm trying to implement a new loss function of my own.
When I tried to debug it (or print in it) I've noticed it is called only once at the model creating section of the code.
How can I know what y_pred and y_true contains (shapes, data etc..) if I cannot run my code into this function while fitting the model?
I wrote this loss function:
def my_loss(y_true, y_pred):
# run over the sequence, jump by 3
# calc the label
# if the label incorrect punish
y_pred = K.reshape(y_pred, (1, 88, 3))
y_pred = K.argmax(y_pred, axis=1)
zero_count = K.sum(K.clip(y_pred, 0, 0))
one_count = K.sum(K.clip(y_pred, 1, 1))
two_count = K.sum(K.clip(y_pred, 2, 2))
zero_punish = 1 - zero_count / K.count_params(y_true)
one_punish = 1- one_count/ K.count_params(y_true)
two_punish = 1- two_count/ K.count_params(y_true)
false_arr = K.not_equal(y_true, y_pred)
mask0 = K.equal(y_true, K.zeros_like(y_pred))
mask0_miss = K.dot(false_arr, mask0) * zero_punish
mask1 = K.equal(y_true, K.ones_like(y_pred))
mask1_miss = K.dot(false_arr, mask1) * one_punish
mask2 = K.equal(y_true, K.zeros_like(y_pred)+2)
mask2_miss = K.dot(false_arr, mask2) * two_punish
return K.sum(mask0_miss) + K.sum(mask1_miss) + K.sum(mask2_miss)
It fails on:
theano.gof.fg.MissingInputError: A variable that is an input to the graph was
neither provided as an input to the function nor given a value. A chain of
variables leading from this input to an output is [/dense_1_target, Shape.0].
This chain may not be unique
Backtrace when the variable is created:
How can I fix it?

You have to understand that Theano is a symbolic language. For example, when we define the following loss function in Keras:
def myLossFn(y_true, y_pred):
return K.mean(K.abs(y_pred - y_true), axis=-1)
Theano is just making a symbolic rule in a computational graph, which would be executed when it gets values i.e. when you train the model with some mini-batches.
As far as your question on how to debug your model goes, you can use theano.function for that. Now, you want to know if your loss calculation is correct. You do the following.
You can implement the python/numpy version of your loss function. Pass two random vectors to your numpy-loss-function and get a number. To verify if theano gives nearly identical result, define something as follows.
import theano
from theano import tensor as T
from keras import backend as K
Y_true = T.frow('Y_true')
Y_pred = T.fcol('Y_pred')
out = K.mean(K.abs(Y_pred - Y_true), axis=-1)
f = theano.function([Y_true, Y_pred], out)
# creating some values
y_true = np.random.random((10,))
y_pred = np.random.random((10,))
numpy_loss_result = np.mean(np.abs(y_true-y_pred))
theano_loss_result = f(y_true, y_pred)
# check if both are close enough
print numpy_loss_result-theano_loss_result # should be less than 1e-5
Basically, theano.function is a way to put values and evaluate those symbolic expressions. I hope this helps.

Issue with computing gradient for Rnn in Theano

I am playing with vanilla Rnn's, training with gradient descent (non-batch version), and I am having an issue with the gradient computation for the (scalar) cost; here's the relevant portion of my code:
class Rnn(object):
# ............ [skipping the trivial initialization]
def recurrence(x_t, h_tm_prev):
h_t = T.tanh(T.dot(x_t, self.W_xh) +
T.dot(h_tm_prev, self.W_hh) + self.b_h)
return h_t
h, _ = theano.scan(
recurrence,
sequences=self.input,
outputs_info=self.h0
)
y_t = T.dot(h[-1], self.W_hy) + self.b_y
self.p_y_given_x = T.nnet.softmax(y_t)
self.y_pred = T.argmax(self.p_y_given_x, axis=1)
def negative_log_likelihood(self, y):
return -T.mean(T.log(self.p_y_given_x)[:, y])
def testRnn(dataset, vocabulary, learning_rate=0.01, n_epochs=50):
# ............ [skipping the trivial initialization]
index = T.lscalar('index')
x = T.fmatrix('x')
y = T.iscalar('y')
rnn = Rnn(x, n_x=27, n_h=12, n_y=27)
nll = rnn.negative_log_likelihood(y)
cost = T.lscalar('cost')
gparams = [T.grad(cost, param) for param in rnn.params]
updates = [(param, param - learning_rate * gparam)
for param, gparam in zip(rnn.params, gparams)
]
train_model = theano.function(
inputs=[index],
outputs=nll,
givens={
x: train_set_x[index],
y: train_set_y[index]
},
)
sgd_step = theano.function(
inputs=[cost],
outputs=[],
updates=updates
)
done_looping = False
while(epoch < n_epochs) and (not done_looping):
epoch += 1
tr_cost = 0.
for idx in xrange(n_train_examples):
tr_cost += train_model(idx)
# perform sgd step after going through the complete training set
sgd_step(tr_cost)
For some reasons I don't want to pass complete (training) data to the train_model(..), instead I want to pass individual examples at a time. Now the problem is that each call to train_model(..) returns me the cost (negative log-likelihood) of that particular example and then I have to aggregate all the cost (of the complete (training) data-set) and then take derivative and perform the relevant update to the weight parameters in the sgd_step(..), and for obvious reasons with my current implementation I am getting this error: theano.gradient.DisconnectedInputError: grad method was asked to compute the gradient with respect to a variable that is not part of the computational graph of the cost, or is used only by a non-differentiable operator: W_xh. Now I don't understand how to make 'cost' a part of computational graph (as in my case when I have to wait for it to be aggregated) or is there any better/elegant way to achieve the same thing ?
Thanks.

It turns out one cannot bring the symbolic variable into Theano graph if they are not part of computational graph. Therefore, I have to change the way to pass data to the train_model(..); passing the complete training data instead of individual example fix the issue.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

outputs after back propagation converge to a small value 0.01 - python

Related

How to override gradient for the nonlinearity functions in lasagne?

Neural network backprop not fully training

Inverting Gradients in Keras

Custom loss function implementation

Issue with computing gradient for Rnn in Theano

Categories

Resources