I've been trying to build a model using 'Deep Q-Learning' where I have a large number of actions (2908). After some limited success with using standard DQN:
(https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf), I decided to do some more research because I figured the action space was too large to do effective exploration.
I then discovered this paper: https://arxiv.org/pdf/1512.07679.pdf where they use an actor-critic model and policy gradients, which then led me to: https://arxiv.org/pdf/1602.01783.pdf where they use policy gradients to get much better results then DQN overall.
I've found a few sites where they have implemented policy gradients in Keras, https://yanpanlau.github.io/2016/10/11/Torcs-Keras.html and https://oshearesearch.com/index.php/2016/06/14/kerlym-a-deep-reinforcement-learning-toolbox-in-keras/ however I'm confused how they are implemented. In the former (and when I read the papers) it seems like instead of providing an input and output pair for the actor network, you provide the gradients for the all the weights and then use the network to update it, whereas, in the latter they just calculate an input-output pair.
Have I just confused myself? Am I just supposed to be training the network by providing an input-output pair and use the standard 'fit', or do I have to do something special? If it's the latter, how do I do it with the Theano backend? (the examples above use TensorFlow).
TL;DR
Learn how to implement custom loss functions and gradients using Keras.backend. You will need it for more advanced algorithms and it's actually much easier once you get the hang of it
One CartPole example of using keras.backend could be https://gist.github.com/kkweon/c8d1caabaf7b43317bc8825c226045d2 (though its backend used Tensorflow but it should be very similar if not the same)
Problem
When playing,
the agent needs a policy that is basically a function that maps a state into a policy that is a probability for each action. So, the agent will choose an action according to its policy.
i.e, policy = f(state)
When training,
Policy Gradient does not have a loss function. Instead, it tries to maximize the expected return of rewards. And, we need to compute the gradients of log(action_prob) * advantage
advantage is a function of rewards.
advantage = f(rewards)
action_prob is a function of states and action_taken. For example, we need to know which action we took so that we can update parameters to increase/decrease a probability for the action we took.
action_prob = sum(policy * action_onehot) = f(states, action_taken)
I'm assuming something like this
policy = [0.1, 0.9]
action_onehot = action_taken = [0, 1]
then action_prob = sum(policy * action_onehot) = 0.9
Summary
We need two functions
update function: f(state, action_taken, reward)
choose action function: f(state)
You already know it's not easy to implement like typical classification problems where you can just model.compile(...) -> model.fit(X, y)
However,
In order to fully utilize Keras, you should be comfortable with defining custom loss functions and gradients. This is basically the same approach the author of the former one took.
You should read more documentations of Keras functional API and keras.backend
Plus, there are many many kinds of policy gradients.
The former one is called DDPG which is actually quite different from regular policy gradients
The latter one I see is a traditional REINFORCE policy gradient (pg.py) which is based on Kapathy's policy gradient example. But it's very simple for example it only assumes only one action. That's why it could have been implemented somehow using model.fit(...) instead.
References
Schulman, "Policy Gradient Methods", http://rll.berkeley.edu/deeprlcourse/docs/lec2.pdf
The seemingly conflicting implementations you encountered are both valid implementations. They are two equivalent ways two implement the policy gradients.
In the vanilla implementation, you calculate the gradients of the policy network w.r.t. rewards and directly update the weights in the direction of the gradients. This would require you to do the steps described by Mo K.
The second option is arguably a more convenient implementation for autodiff frameworks like keras/tensorflow. The idea is to implement an input-output (state-action) function like supervised learning, but with a loss function who's gradient is identical to the policy gradient. For a softmax policy, this simply means predicting the 'true action' and multiplying the (cross-entropy) loss with the observed returns/advantage. Aleksis Pirinen has some useful notes about this [1].
The modified loss function for option 2 in Keras looks like this:
import keras.backend as K
def policy_gradient_loss(Returns):
def modified_crossentropy(action,action_probs):
cost = K.categorical_crossentropy(action,action_probs,from_logits=False,axis=1 * Returns)
return K.mean(cost)
return modified_crossentropy
where 'action' is the true action of the episode (y), action_probs is the predicted probability (y*). This is based on another stackoverflow question [2].
References
https://aleksispi.github.io/assets/pg_autodiff.pdf
Make a custom loss function in keras
Related
I am currently trying to implement custom optimization for a custom tensorflow layer.
Without going in to much detail I have added a small code sample which illustrates how my current code works. The important part is that calculate_gradients(variables, gradients, momentum) is a function that requires the variable values and gradients of all the variables in the layer. Furthermore this calculation contains intermediate results which have to be stored during optimization. This explains the illustrative momentum variable. This behaviour to me makes using #custom_gradient not possible since this does not allow me to propagate this intermediate results to the optimizer which would then have to return it to the custom gradient function for use in the calculation of the next set of gradients. Unless someone knows how this would work (question one) i have not found a way around this.
model = build_model()
for data, target in data:
with tf.GradientTape() as tap:
gradients = tape.gradient(loss(model(data), target), model.trainable_variables)
for layer in model.layers:
layer_gradients = gradients[indices] # actual indexing is not important
new_gradients = calculate_gradients(layer.variables, layer_gradients, momentum)
for variable, grad in zip(layer.variables, new_gradients):
variable.assign(grad)
Trying to implement this in the tensorflow optimizer particularly by replacing _resource_apply_dense as shown in the documentation [1] i am running into some trouble with the layer-wise behaviour.
Particularly since _resource_apply_dense takes a variable and a gradient. The second code snippet illustrates what i am trying to to, but have currently not found a way to do the get_other_variables_and_gradients(var) behaviour. Furthermore this solution would calculate the gradients three times for each layer which is very suboptimal.
def _resource_apply_dense(var, grad, apply_state):
other_vars_and_grads = get_other_variables_and_gradients(var)
calculate_gradients(zip((var, grad), other_vars_and_gards))
var.assign(grad)
In short, my second question is: Does anyone have an idea how to implement this behaviour and maybe even better do it without redundant calculations or even a whole new better way. Currently the optimization works when i do everything in a training loop as shown in code snippet one. So this is merely a case of integration with the tensorflow optimizer paradigm and performance since doing everything very 'pythony' with lists in a large for loop is slow.
[1] https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Optimizer
I have answered my own question and am replying in order to archive. After revisiting this issue and going through the tf.keras.optimizers.Optimizers code on github [1] I discovered a private method '_transform_unaggregated_gradients'. This method is called before applying the gradients and allows the necessary behaviour. In my case this looks something like this:
def _transform_unaggregated_gradients(self, grads_and_vars):
for layer in model.layers:
layer_gradients, layer_variables = grads_and_Vars[indices] # actual indexing is not important
new_gradients = calculate_gradients(layer.variables, layer_gradients, momentum)
grads_and_vars[indices] = (new_gradients, layer_variables)
return grads_and_vars
[1] https://github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L115-L1414
I'm learning about Action-Critic Reinforcement Learning techniques, in particular A2C algorithm.
I've found a good description of a simple version of the algorithm (i.e. without experience replay, batching or other tricks) with implementation here: https://link.medium.com/yi55uKWwV2. The complete code from that article is available on GitHub.
I think I understand ok-ish what's happening here, but to make sure I actually do, I'm trying to reimplement it from scratch using higher-level tf.keras APIs. Where I'm getting stuck is how do I implement training loop correctly, and how do I formulate actor's loss function.
What is the correct way to pass action and advantage into the loss function?
Actor's loss function involves computing probability of the action taken given to normal distribution. How can I ensure that mu and sigma of the normal distribution during loss function computation actually match the ones were during prediction?
The way it is in the original, the actor's loss function doesn't care about y_pred, it only does about action that was chosen while interacting with the environment. This seems to be wrong, but I'm not sure how.
The code I have so far: https://gist.github.com/nevkontakte/beb59f29e0a8152d99003852887e7de7
Edit: I suppose some of my confusion stems from a poor understanding of magic behind gradient computation in Keras/TensorFlow, so any pointers there would be appreciated.
First, credit where credit is due: information provided by ralf htp and Simon was instrumental in helping me to figure out the right answers eventually.
Before I go into detailed answers to my own questions, here's the original code I was trying to rewrite in tf.keras terms, and here's my result.
What is the correct way to pass action and advantage into a loss function in Keras?
There is a difference between what raw TF optimizer considers a loss function and what Keras does. When using an optimizer directly, it simply expects a tensor (lazy or eager depending on your configuration), which will be evaluated under tf.GradientTape() to compute the gradient and update weights.
Example from https://medium.com/#asteinbach/actor-critic-using-deep-rl-continuous-mountain-car-in-tensorflow-4c1fb2110f7c:
# Below norm_dist is the output tensor of the neural network we are training.
loss_actor = -tfc.log(norm_dist.prob(action_placeholder) + 1e-5) * delta_placeholder
training_op_actor = tfc.train.AdamOptimizer(
lr_actor, name='actor_optimizer').minimize(loss_actor)
# Later, in the training loop...
_, loss_actor_val = sess.run([training_op_actor, loss_actor],
feed_dict={action_placeholder: np.squeeze(action),
state_placeholder: scale_state(state),
delta_placeholder: td_error})
In this example it computes the whole graph, including making an inference, capture the gradient and adjust weights. So to pass whatever values you need into the loss function/gradient computation you just pass necessary values into the computation graph.
Keras is a bit more formal in what loss function should look like:
loss: String (name of objective function), objective function or tf.keras.losses.Loss instance. See tf.keras.losses. An objective function is any callable with the signature scalar_loss = fn(y_true, y_pred). If the model has multiple outputs, you can use a different loss on each output by passing a dictionary or a list of losses. The loss value that will be minimized by the model will then be the sum of all individual losses.
Keras will do the inference (forward pass) for you and pass the output into the loss function. The loss function is supposed to do some extra computation on the predicted value and y_true label, and return the result. This whole process will be tracked for the purpose of gradient computation.
Although it is very convenient for traditional training, this is a bit restrictive when we want to pass some extra data in, like TD error. It is possible to work around that and shove all the extra data into y_true, and pull it apart inside the loss function (I found this trick somewhere on the web, but unfortunately lost the link to source).
Here's how I rewrote the above in the end:
def loss(y_true, y_pred):
action_true = y_true[:, :n_outputs]
advantage = y_true[:, n_outputs:]
return -tfc.log(y_pred.prob(action_true) + 1e-5) * advantage
# Below, in the training loop...
# A trick to pass TD error *and* actual action to the loss function: join them into a tensor and split apart
# Inside the loss function.
annotated_action = tf.concat([action, td_error], axis=1)
actor_model.train_on_batch([scale_state(state)], [annotated_action])
Actor's loss function involves computing probability of the action taken given to normal distribution. How can I ensure that mu and sigma of the normal distribution during loss function computation actually match the ones were during prediction?
When I asked this question, I didn't understand well enough how TF compute graph works. So the answer is simple: every time sess.run() is invoked, it must compute the whole graph from scratch. Parameters of the distribution would be the same (or similar) as long as graph inputs (e.g. observed state) and NN weights are the same (or similar).
The way it is in the original, the actor's loss function doesn't care about y_pred, it only does about action that was chosen while interacting with the environment. This seems to be wrong, but I'm not sure how.
What's wrong is the assumption "the actor's loss function doesn't care about y_pred" :) Actor's loss function involves norm_dist (which is action probability distribution), which is effectively an analog of y_pred in this context.
As far as i understand A2C it is the machine learning implementation of activator-inhibitor systems that are also called two-component reaction diffusion systems (https://en.wikipedia.org/wiki/Reaction%E2%80%93diffusion_system). Activator-inhibitor models are important in any field of science as they describe pattern formations like i.e. the Turing mechanism (simply search the net for activator-inhibitor model and you find a vast amount of information, a very common application are predator-prey models). Also cf the graphic
source of graphic : https://www.researchgate.net/figure/Activator-Inhibitor-System_fig1_23671770/
with the explanatory graphic of the A2C algorithm in https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69
Activator-inhibitor models are closely linked to the theory of nonlinear dynamical systems (or 'chaos theory') this also becomes obvious in the comparison of the bifurcation tree-like structure in https://medium.com/#asteinbach/rl-introduction-simple-actor-critic-for-continuous-actions-4e22afb712 and the bifurcation tree of a nonlinear dynamical systems like i.e. the logistic map (https://en.wikipedia.org/wiki/Logistic_map, the logistic map is one of the simplest predator-prey models or activator-inhibitor models). Another similarity is the sensitivity to initial condition in A2C models that is described as
This introduces in inherent high variability in log probabilities (log of the policy distribution) and cumulative reward values, because each trajectories during training can deviate from each other at great degrees.
in https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f and the curse of dimensionality appears also in chaos theory, i.e. in attractor reconstruction
From the viewpoint of systems theory the A2C algorithm tries to adapt the initial value (start state) in a way that it ends up at a given endpoint when increasing the growth rate of a dynamical systems i.e. the logistic map (r-value is increased and the initial value (start state) is constantly re-adapted to choose the correct bifurations (actions) in the bifurcation tree )
So A2C tries to numerically solve a chaos theory problem, namely finding the initial value for a given outcome of a nonlinear dynamical system in its chaotic region. Analytically this problem is in most cases not solveable.
The action is the bifurcation points in the bifurcation tree, the states are the future bifurctions.
Both, actions and states, are modeled by two coupled neural networks and this coupling of two neural nets is the great innovation of A2C algorithms.
In https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 is well documented keras code for implementing A2C, so you have a possible implementation there.
The loss function here is defined as the temporal difference (TD) function that is the exact difference between state at the actual bifurcation point and the state at the estimated future one, however this mathematically exactly defined is prone to stochastic error (or noise), so the stochastic error is included in the definition of exact, because in the end machine learning is based on stochastic systems or error calculus, meaning systems that are composed of a deterministic and a stochastic component. To zero this error stochastic gradient descend is used. In keras this is simply implmeneted by choosing optimizer=sge.
This interaction of actual and future step is implemented as memory on https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 in the function remember and this function also links the actor and the critic network (or activator and inhibitor network). This general structure of trial (action), call predict (TD function ), remember and train (i.e. stochastic gradient descent) is fundamental to all reinforcement learning algorithms, and is linked to the structure actual state, action, reward, new state :
The prediction code is also very much the same as it was in previous reinforcement learning algorithms. That is, we just have to iterate through the trial and call predict, remember, and train on the agent:
In the implementation on your first question is solved by applying remember on the critic and the train the critic with these values (this is in the main function), where training always evaluates the loss function, so action and reward are passed to the loss function by remember in this implementation :
actor_critic.remember(cur_state, action, reward, new_state, done)
actor_critic.train()
Because of your second question : i am not sure but i think this is achieved by the optimization algorithm (i.e. stochastic gradient descent)
Third question : In the predator-prey model the actors or activator is the prey and the behavior of the prey is only determined by the size or capacity of the habitat (the amount of grass) and the size of the predator (inhibitor) population, so modelling it in this way is consistent with nature or an activator-inhibitor system again. In the main function in https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 also only the critic or inhibitor / predator is trained.
I'm learning MXNet at the moment and I'm working on a problem using neural nets. I'm interested in observing the curvature of my loss function with respect to the network weights but as best I can tell higher order gradients are not supported for neural network functions at the moment. Is there any (possibly hacky) way that I could still do this?
You can follow the discussion here
The gist of it is that not all operators support higher order gradients at the moment.
In Gluon you can try the following:
with mx.autograd.record():
output = net(x)
loss = loss_func(output)
dz = mx.autograd.grad(loss, [z], create_graph=True) # where [z] is the parameter(s) you want
dz[0].backward() # now the actual parameters should have second order gradients
Taken from this forum thread
I read this good article about the Proximal Policy Optimization algorithm, and now I want update my VanillaPG agent to a PPO agent to learn more about it. However, I'm still not sure how to implement this in real code, especially since I'm using a simple discrete action space.
So what I do with my VPG Agent is, if there are 3 actions, the network outputs 3 values (out), on which I use softmax (p) and use the result as a distribution to choose one of the actions. For training, I take the states, actions and advantages and use this loss function:
loss = -tf.reduce_sum(advantages * tf.log(ch_action_p_values))
How can I extend this algorithm to use PPO for discrete actions? All of the implementations I found work with continuous actions spaces. I'm not sure if I have to change my loss function to the first one used in the article. And I'm not even sure of which probabilities I have to calculate KLD. Are prob_s_a_* and D_KL single values for the whole batch, or one value for each sample? How can I calculate them in TF for my agent?
You should be able to do it also with discrete state without any problem (I never tried, though). The probability prob_s_a_* you are talking about are the probabilities of drawing the sampled actions with the current policy (one value per sample).
PPO does not use D_KL (the KL divergence), as from its experiments it performed worse (they just clip the probabilities ratio).
So you need just to add a placeholder for the old log prob and clip the ratio between the new log prob (tf.log(ch_action_p_values)) and the old log ones.
Here is an example (e_clip is the clipping value, in the paper they use 0.2)
vanilla_loss = -tf.reduce_sum(advantages * tf.log(ch_action_p_values))
old_log_probs = tf.placeholder(...)
log_probs = tf.log(ch_action_p_values)
prob_ratio = tf.exp(log_prob - old_log_probs)
clip_prob = tf.clip_by_value(prob_ratio, 1.-e_clip, 1.+e_clip)
ppo_loss = -tf.reduce_mean(tf.minimum(tf.multiply(prob_ratio, advantages), tf.multiply(clip_prob, advantages)))
Beside the usual advantages and ch_action_p_values, you need to feed the loss with old_log_probs, computed as the log probability of the current policy on the sampled actions.
I know that one can simply do it for all of them using something as in the tutorials:
opt = tf.train.GradientDescentOptimizer(learning_rate)
however it would be nice it one could pass a dictionary that maps the variable name to its corresponding learning rate. Is that possible?
I know that one could simply use compute_gradients() followed by apply_gradients() and do it manually but that seems silly. Is there a smarter way to assign specific learning rates to specific variables?
Is the only way to do this to create specific optimizer as in:
# Create an optimizer with the desired parameters.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Add Ops to the graph to minimize a cost by updating a list of variables.
# "cost" is a Tensor, and the list of variables contains tf.Variable
# objects.
opt_op = opt.minimize(cost, var_list=<list of variables>)
and simply give the specific learning rate to each optimizer? But that would mean we have a list of optimizers and hence, we would need to apply the learning rule with sess.run to each optimizer. Right?
As far as I can tell this is not possible. Mostly because this is not really a valid gradient descent then. There are plenty of optimizers which learn on their own variable specific scaling factors (like Adam or AdaGrad). Specyfing per-variable learning rate (constant one) would mean that you do not follow the gradient anymore, and while it makes sense for well formulated mathematically methods, simply setting them to a pre-defined values is just a heuristic, which I believe is a reason for not implementing this in core TF.
As you said - you can always do it on your own, define your own optimizer, iterate over variables between compute gradients and apply them, which would be around 3-4 lines of code (one to compute the gradients, one to iterate and add multiplication ops, and one to apply them back), and as far as I know - this is the simplest solution to achieve your goal.