How to update weights in keras for reinforcement learning?

How to update weights in keras for reinforcement learning? - python

I am working in a reinforcement learning program and I am using this article as the reference. I am using python with keras(theano) for creating neural network and the pseudo code I am using for this program is
Do a feedforward pass for the current state s to get predicted Q-values for all actions.
Do a feedforward pass for the next state s’ and calculate maximum overall network outputs max a’ Q(s’, a’).
Set Q-value target for action to r + γmax a’ Q(s’, a’) (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs.
Update the weights using backpropagation.
The loss function equation here is this
where my reward is +1, maxQ(s',a') =0.8375 and Q(s,a)=0.6892
My L would be 1/2*(1+0.8375-0.6892)^2=0.659296445
Now how should I update my model neural network weights using the above loss function value if my model structure is this
model = Sequential()
model.add(Dense(150, input_dim=150))
model.add(Dense(10))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='mse', optimizer='adam')

Assuming the NN is modeling the Q value function, you would just pass the target to the network. e.g.
model.train_on_batch(state_action_vector, target)
Where state_action_vector is some preprocessed vector representing the state-action input to your network. Since your network is using an MSE loss function, it will compute the prediction term using the state-action on the forward pass and then update the weights according to your target.

Related

How to implement gradient ascent in a Keras DQN

Have built a Reinforcement Learning DQN with variable length sequences as inputs, and positive and negative rewards calculated for actions. Some problem with my DQN model in Keras means that although the model runs, average rewards over time decrease, over single and multiple cycles of epsilon. This does not change even after significant period of training.
My thinking is that this is due to using MeanSquareError in Keras as the Loss function (minimising error). So I am trying to implement gradient ascent (to maximise reward). How to do this in Keras? My current model is:
model = Sequential()
inp = (env.NUM_TIMEPERIODS, env.NUM_FEATURES)
model.add(Input(shape=inp)) # 'a shape tuple(integers), not including batch-size
model.add(Masking(mask_value=0., input_shape=inp))
model.add(LSTM(env.NUM_FEATURES, input_shape=inp, return_sequences=True))
model.add(LSTM(env.NUM_FEATURES))
model.add(Dense(env.NUM_FEATURES))
model.add(Dense(4))
model.compile(loss='mse,
optimizer=Adam(lr=LEARNING_RATE, decay=DECAY),
metrics=[tf.keras.losses.MeanSquaredError()])
In trying to implement gradient ascent, by 'flipping' the gradient (as negative or inverse loss?), I have tried various loss definitions:
loss=-'mse'
loss=-tf.keras.losses.MeanSquaredError()
loss=1/tf.keras.losses.MeanSquaredError()
but these all generate bad operand [for unary] errors.
How to adapt current Keras model to maximise rewards ?
Or is this gradient ascent not even the problem? Could it be some issue with the action policy?

Writing a custom loss function
Here is the loss function you want
#tf.function
def positive_mse(y_true, y_pred):
return -1 * tf.keras.losses.MSE(y_true, y_pred)
And then your compile line becomes
model.compile(loss=positive_mse,
optimizer=Adam(lr=LEARNING_RATE, decay=DECAY),
metrics=[tf.keras.losses.MeanSquaredError()])
Please note : use loss=positive_mse and not loss=positive_mse(). That's not a typo. This is because you need to pass the function, not the results of executing the function.

How can I disable gradient updates for some modules in autograd backpropagation?

I'm building a multi-model neural network for reinforcement learning to include an action network, a world model network, and a critic. The idea is train the world model to emulate whatever simulation you are trying to master based on input from the action network and the previous state, to train the critic to maximize the Bellman equation (total reinforcement over time) based on the world model output, and then backpropagate the critic value through the world model to provide gradient targets for training the actions. So - from some state, the action network outputs an action which is fed into the model to generate the next state, and that state feeds into the critic network for evaluation against some goal state.
For all this to work, I must use 3 separate loss functions, one for each network, and they all add something to the gradients in one or more networks but they can be in conflict. For example - to train the world model I use a target from an environmental simulation and for the critic I use a target of the current state reward + discount * next state forecast value. However, to train the a actor I just use the negative critic value as a loss and backpropagate all the way through all three models to calibrate the best action.
I can make this work without any batching by zeroing out gradients incrementally, but that is inefficient and doesn't let me accumulate gradients for any kind of "time-series batching" optimizer update step. Each model has its own trainable parameters, but the execution graph flows through all three networks. So inside the calibration loop after firing the networks in sequence:
...
if self.actor.calibrating:
self.actor.optimizer.zero_grad()
#Pick loss For maximizing the value of all actions
loss = -self.critic.value
#Backpropagate through all three networks to train actor output
#How do I stop the critic and model networks from incrementing their gradient values?
loss.backward(retain_graph=True)
self.actor.optimizer.step()
if self.model.calibrating:
self.model.optimizer.zero_grad()
#Reduce loss for ambiguous actions
loss = self.model.get_loss() * self.actor.get_confidence()**2
#How can I block this from backpropagating through action network?
loss.backward(retain_graph=True)
self.model.optimizer.step()
if self.critic.calibrating:
self.critic.optimizer.zero_grad()
#Reduce loss for ambiguous actions
loss = self.critic.get_loss(self.goal) * self.actor.get_confidence()**2
#How do I stop this from backpropagating through the model and action networks?
loss.backward(retain_graph=True)
self.critic.optimizer.step()
...
Finally - my question is in two parts:
How can I temporarily stop loss.backward() at a given layer without detaching it forever?
How can I block loss.backward() from updating some gradients where I'm just flowing through a model to get gradients for another model?

Got this figured out thanks to a suggestion from a colleague to try the requires_grad setting. (I had assumed that would break the execution graph, but it doesn't)
So - to answer my own two questions:
If you calibrate the chained models in the correct order, you can detach them one at a time so that loss.backward() doesn't run over models that aren't needed. I was thinking that this would break the graph but... this is Pytorch, not Tensorflow 1.x and the graph is regenerated on every forward pass anyway. Silly me for missing this yesterday.
If you set requires_grad to False for a model (or a layer or an individual weight) then loss.backward() will STILL traverse the entire connected graph but it will leave those individual gradients as they were while still setting any gradients earlier in the graph. Exactly what I wanted.
This code works to minimize the execution of unnecessary graph traversals and gradient updates. I still need to refactor it for staggered updates over time so that it can accumulate gradients for several cycles before stepping the optimizers, but this definitely works as intended.
#Step through all models in a chain to create gradient paths from critic back through the world model, to the actor.
def step(self):
#Get the current state from the simulation
state = self.world.state
#Fire the actor to select a softmax action.
self.actor(state)
#run the world simulation on that action.
self.world.step(self.actor.action)
#Combine the action and starting state as input to the world model.
if self.actor.calibrating:
action_state = torch.cat([self.actor.value, state], dim=0)
else:
#Push softmax action closer to 1.0
action_state = torch.cat([self.actor.hard_value, state], dim=0)
#Run the model and then the critic on the action_state
self.critic(self.model(action_state))
if self.actor.calibrating:
self.actor.optimizer.zero_grad()
self.model.requires_grad = False
self.critic.requires_grad = False
#Pick loss For maximizing the value of the action choice
loss = -self.critic.value * self.actor.get_confidence()
loss.backward(retain_graph=True)
self.actor.optimizer.step()
if self.model.calibrating:
#Don't need to backpropagate through actor again
self.actor.value.detach_()
self.model.optimizer.zero_grad()
self.model.requires_grad = True
#Reduce loss for ambiguous actions
loss = self.model.get_loss() * self.actor.get_confidence()**2
loss.backward(retain_graph=True)
self.model.optimizer.step()
if self.critic.calibrating:
#Don't need to backpropagate through the model or actor again
self.model.value.detach_()
self.critic.optimizer.zero_grad()
self.critic.requires_grad = True
#Reduce loss for ambiguous actions
loss = self.critic.get_loss(self.goal) * self.actor.get_confidence()**2
loss.backward(retain_graph=True)
self.critic.optimizer.step()

here’s a more precise and fuller example.
import torch
import torch.nn as nn
from torch.autograd import Variable
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.layers = nn.ModuleList([
nn.Linear(10, 10),
nn.Linear(10, 10),
nn.Linear(10, 10),
nn.Linear(10, 10),
])
def forward(self, x):
self.output = []
self.input = []
for layer in self.layers:
# detach from previous history
x = Variable(x.data, requires_grad=True)
#you can add this line after each layer to stop back propagation of that layer
self.input.append(x)
# compute output
x = layer(x)
# add to list of outputs
self.output.append(x)
return x
def backward(self, g):
for i, output in reversed(list(enumerate(self.output))):
if i == (len(self.output) - 1):
# for last node, use g
output.backward(g)
else:
output.backward(self.input[i+1].grad.data)
print(i, self.input[i+1].grad.data.sum())
model = Net()
inp = Variable(torch.randn(4, 10))
output = model(inp)
gradients = torch.randn(*output.size())
model.backward(gradients)

Keras Neural Network. Preprocessing

I have this doubt when I fit a neural network in a regression problem. I preprocessed the predictors (features) of my train and test data using the methods of Imputers and Scale from sklearn.preprocessing,but I did not preprocessed the class or target of my train data or test data.
In the architecture of my neural network all the layers has relu as activation function except the last layer that has the sigmoid function. I have choosen the sigmoid function for the last layer because the values of the predictions are between 0 and 1.
tl;dr: In summary, my question is: should I deprocess the output of my neuralnet? If I don't use the sigmoid function, the values of my output are < 0 and > 1. In this case, how should I do it?
Thanks

Usually, if you are doing regression you should use a linear' activation in the last layer. A sigmoid function will 'favor' values closer to 0 and 1, so it would be harder for your model to output intermediate values.
If the distribution of your targets is gaussian or uniform I would go with a linear output layer. De-processing shouldn't be necessary unless you have very large targets.

tensorflow basic word2vec example: Shouldn't we be using weights [nce_weight Transpose] for the representation and not embedding matrix?

I am referreing to this sample code
in the code snippet below:
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
# Construct the variables for the NCE loss
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
Now NCE_Loss function is nothing but a single hidden layer neural network with softmax at the optput layer [knowing is takes only a few negative sample]
This part of the graph will only update the weights of the network, it is not doing anything to the "embeddings" matrix/ tensor.
so ideally once the network is trained we must again pass it once through the embeddings_matrix first and then multiply by the transpose of the "nce_weights" [considering it as the same weight auto-encoder, at input & output layers] to reach to the hidden layer representation of each word, which we are are calling word2vec (?)
But if look at the later part of the code, the value of the embeddings matrix is being used a word representation. This
Even the tensorflow doc for NCE loss, mentions input (to which we are passing embed, which uses embeddings) as just the 1st layer input activation values.
inputs: A Tensor of shape [batch_size, dim]. The forward activations of the input network.
A normal back propagation stops at the first layer of the network,
does this implementation of NCE loss, goes beyond and propagates the loss to the input values (and hence to the embedding) ?
This seems an extra step?
Refer this for why I am calling it an extra step, he has a same explanation.

Want I have figured out reading and going through tensorflow is that
though the entire thing is single hidden layer neural network, a auto-encoder indeed. But the weights are not tied, which I assumed.
The encoder is made of the weight matrix embeddings and the decoder is made of the nce_weights. And now embed is nothing but the hidden layer output, given by multiplying input with embeddings.
So with this, embeddings and nce_weights both will be updated in the graph. And we can choose any of the two weight matrix, embeddings is more preferred here.
Edit1:
Actually for both tf.nn.nce_loss and tf.nn.sampled_softmax_loss, the parameters, weights and bias are for the input Weights(tranpose) X + bias, to objective function, which can be logistic regression/ softmax function [refer].
But the back-propagation/ gradient descent happens till the very base of the graph you are building and does not stop at the weights and bias of the function only. Hence the input parameter in both tf.nn.nce_loss and tf.nn.sampled_softmax_loss are also updated which in-turn is build of embeddings matrix.

Image retraining in tensorflow, changing the simple softmax layer to multilayer CNN

Tensorflow has released a tutorial for transfer learning named Image retraining that can be found in here:
https://www.tensorflow.org/tutorials/image_retraining
What they are doing is using a pre-trained model on inception v3 and then they change only the very last layer (softmax regression layer) and train it on the new dataset. This is very understandable and in fact a common practice in transfer learning.
I have tried their method on my dataset (which is a small dataset) and I have applied all the suggestion to get a better result from the data augmentation to change the number of steps but I did not modify their code by any means. The accuracy I got is relatively bad ~70%.
I am thinking of the possibility of training a small neural network on top of the given model, namely, changing the last layer from a simple regression to a more sophisticated network.
Here is the part of their code where they modify the softmax layer:
def add_final_training_ops(class_count, final_tensor_name, bottleneck_tensor):
"""Adds a new softmax and fully-connected layer for training.
We need to retrain the top layer to identify our new classes, so this function
adds the right operations to the graph, along with some variables to hold the
weights, and then sets up all the gradients for the backward pass.
The set up for the softmax and fully-connected layers is based on:
https://tensorflow.org/versions/master/tutorials/mnist/beginners/index.html
Args:
class_count: Integer of how many categories of things we're trying to
recognize.
final_tensor_name: Name string for the new final node that produces results.
bottleneck_tensor: The output of the main CNN graph.
Returns:
The tensors for the training and cross entropy results, and tensors for the
bottleneck input and ground truth input.
"""
with tf.name_scope('input'):
bottleneck_input = tf.placeholder_with_default(
bottleneck_tensor, shape=[None, BOTTLENECK_TENSOR_SIZE],
name='BottleneckInputPlaceholder')
ground_truth_input = tf.placeholder(tf.float32,
[None, class_count],
name='GroundTruthInput')
# Organizing the following ops as `final_training_ops` so they're easier
# to see in TensorBoard
layer_name = 'final_training_ops'
with tf.name_scope(layer_name):
with tf.name_scope('weights'):
layer_weights = tf.Variable(tf.truncated_normal([BOTTLENECK_TENSOR_SIZE, class_count], stddev=0.001), name='final_weights')
variable_summaries(layer_weights)
with tf.name_scope('biases'):
layer_biases = tf.Variable(tf.zeros([class_count]), name='final_biases')
variable_summaries(layer_biases)
with tf.name_scope('Wx_plus_b'):
logits = tf.matmul(bottleneck_input, layer_weights) + layer_biases
tf.summary.histogram('pre_activations', logits)
final_tensor = tf.nn.softmax(logits, name=final_tensor_name)
tf.summary.histogram('activations', final_tensor)
with tf.name_scope('cross_entropy'):
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(
labels=ground_truth_input, logits=logits)
with tf.name_scope('total'):
cross_entropy_mean = tf.reduce_mean(cross_entropy)
tf.summary.scalar('cross_entropy', cross_entropy_mean)
with tf.name_scope('train'):
train_step = tf.train.GradientDescentOptimizer(FLAGS.learning_rate).minimize(
cross_entropy_mean)
return (train_step, cross_entropy_mean, bottleneck_input, ground_truth_input,
final_tensor)
def add_evaluation_step(result_tensor, ground_truth_tensor):
"""Inserts the operations we need to evaluate the accuracy of our results.
Args:
result_tensor: The new final node that produces results.
ground_truth_tensor: The node we feed ground truth data
into.
Returns:
Tuple of (evaluation step, prediction).
"""
with tf.name_scope('accuracy'):
with tf.name_scope('correct_prediction'):
prediction = tf.argmax(result_tensor, 1)
correct_prediction = tf.equal(
prediction, tf.argmax(ground_truth_tensor, 1))
with tf.name_scope('accuracy'):
evaluation_step = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
tf.summary.scalar('accuracy', evaluation_step)
return evaluation_step, prediction
However, I am facing two main problems. First, I am not if this a good idea or not? would I be just wasting in my effort in doing something useless? Second, they are using the simple MNIST tutorial as a model for the last layer, say that I would use their Expert MNIST tutorial (https://www.tensorflow.org/get_started/mnist/pros) I am lost on what to do or how to configure it?.
Any suggestions on what can I do?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to update weights in keras for reinforcement learning? - python

Related

How to implement gradient ascent in a Keras DQN

How can I disable gradient updates for some modules in autograd backpropagation?

Keras Neural Network. Preprocessing

tensorflow basic word2vec example: Shouldn't we be using weights [nce_weight Transpose] for the representation and not embedding matrix?

Image retraining in tensorflow, changing the simple softmax layer to multilayer CNN

Categories

Resources