Accessing gradient of multiple variables when applying resource [Tensorflow]

Accessing gradient of multiple variables when applying resource [Tensorflow] - python

I am currently trying to implement custom optimization for a custom tensorflow layer.
Without going in to much detail I have added a small code sample which illustrates how my current code works. The important part is that calculate_gradients(variables, gradients, momentum) is a function that requires the variable values and gradients of all the variables in the layer. Furthermore this calculation contains intermediate results which have to be stored during optimization. This explains the illustrative momentum variable. This behaviour to me makes using #custom_gradient not possible since this does not allow me to propagate this intermediate results to the optimizer which would then have to return it to the custom gradient function for use in the calculation of the next set of gradients. Unless someone knows how this would work (question one) i have not found a way around this.
model = build_model()
for data, target in data:
with tf.GradientTape() as tap:
gradients = tape.gradient(loss(model(data), target), model.trainable_variables)
for layer in model.layers:
layer_gradients = gradients[indices] # actual indexing is not important
new_gradients = calculate_gradients(layer.variables, layer_gradients, momentum)
for variable, grad in zip(layer.variables, new_gradients):
variable.assign(grad)
Trying to implement this in the tensorflow optimizer particularly by replacing _resource_apply_dense as shown in the documentation [1] i am running into some trouble with the layer-wise behaviour.
Particularly since _resource_apply_dense takes a variable and a gradient. The second code snippet illustrates what i am trying to to, but have currently not found a way to do the get_other_variables_and_gradients(var) behaviour. Furthermore this solution would calculate the gradients three times for each layer which is very suboptimal.
def _resource_apply_dense(var, grad, apply_state):
other_vars_and_grads = get_other_variables_and_gradients(var)
calculate_gradients(zip((var, grad), other_vars_and_gards))
var.assign(grad)
In short, my second question is: Does anyone have an idea how to implement this behaviour and maybe even better do it without redundant calculations or even a whole new better way. Currently the optimization works when i do everything in a training loop as shown in code snippet one. So this is merely a case of integration with the tensorflow optimizer paradigm and performance since doing everything very 'pythony' with lists in a large for loop is slow.
[1] https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Optimizer

I have answered my own question and am replying in order to archive. After revisiting this issue and going through the tf.keras.optimizers.Optimizers code on github [1] I discovered a private method '_transform_unaggregated_gradients'. This method is called before applying the gradients and allows the necessary behaviour. In my case this looks something like this:
def _transform_unaggregated_gradients(self, grads_and_vars):
for layer in model.layers:
layer_gradients, layer_variables = grads_and_Vars[indices] # actual indexing is not important
new_gradients = calculate_gradients(layer.variables, layer_gradients, momentum)
grads_and_vars[indices] = (new_gradients, layer_variables)
return grads_and_vars
[1] https://github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L115-L1414

Related

Calling .backward() function for two different neural networks but getting retain_graph=True error

I have an Actor Critic neural network where the Actor is its own class and the Critic is its own class with its own neural network and .forward() function. I then am creating an object of each of these classes in a larger Model class. My setup is as follows:
self.actor = Actor().to(device)
self.actor_opt = optim.Adam(self.actor.parameters(), lr=lr)
self.critic = Critic().to(device)
self.critic_opt = optim.Adam(self.critic.parameters(), lr=lr)
I then calculate two different loss functions and want to update each neural network separately. For the critic:
loss_critic = F.smooth_l1_loss(value, expected)
self.critic_opt.zero_grad()
loss_critic.backward()
self.critic_opt.step()
and for the actor:
loss_actor = -self.critic(state, action)
self.actor_opt.zero_grad()
loss_actor.backward()
self.actor_opt.step()
However, when doing this, I get the following error:
RuntimeError: Trying to backward through the graph a second time, but the saved
intermediate results have already been freed. Specify retain_graph=True when
calling backward the first time.
When reading up on this, I understood that I only need to retain_graph=True when calling backward twice on the same network, and in most cases this is not good to set to True as I will run out of GPU. Moreover, when I comment out one of the .backward() functions, the error goes away, leading me to believe that for some reason the code is thinking that both backward() functions are being called on the same neural network, even though I think I am doing it separately. What could be the reason for this? Is there a way to specify for which neural network I am calling the backward function on?
Edit:
For reference, the optimize() function in this code here https://github.com/wudongming97/PyTorch-DDPG/blob/master/train.py uses backward() twice with no issue (I've cloned the repo and tested it). I'd like my code to operate similarly where I backprop through critic and actor separately.

Yes, you shouldn't do it like that. What you should do instead, is propagating through parts of the graph.
What the graph contains
Now, graph contains both actor and critic. If the computations pass through the same part of graph (say, twice through actor), it will raise this error.
And they will, as you clearly use actor and critic joined with loss value (this line: loss_actor = -self.critic(state, action))
Different optimizers do not change anything here, as it's backward problem (optimizers simply apply calculated gradients onto models)
Trying to fix it
This is how to fix it in GANs, but not in this case, see Actual fix paragraph below, read on if you are curious about the topic
If part of a neural network (critic in this case) does not take part in the current optimization step, it should be treated as a constant (and vice versa).
To do that, you could disable gradient using torch.no_grad context manager (documentation) and set critic to eval mode (documentation), something along those lines:
self.critic.eval()
with torch.no_grad():
loss_actor = -self.critic(state, action)
...
But, here is a problem:
We are turning off gradient (tape recording) for action and breaking the graph!
hence this is not a viable solution.
Actual solution
It is much simpler than you think, one can see it in PyTorch's repository also:
Do not backpropagate after critic/actor loss
Calculate all losses (for both critic and actor)
sum them together
zero_grad for both optimizers
backpropagate with this summed value
critic_optimizer.step() and actor_optimizer.step() at this point
Something along those lines:
self.critic_opt.zero_grad()
self.actor_opt.zero_grad()
loss_critic = F.smooth_l1_loss(value, expected)
loss_actor = -self.critic(state, action)
total_loss = loss_actor + loss_critic
total_loss.backward()
self.critic_opt.step()
self.actor_opt.step()

How does PyTorch's loss.backward() work when "retain_graph=True" is specified?

I'm a newbie with PyTorch and adversarial networks. I've tried to look for an answer on the PyTorch documentation and from previous discussions both in the PyTorch and StackOverflow forums, but I couldn't find anything useful.
I'm trying to train a GAN with a Generator and a Discriminator, but I cannot understand if the whole process is working or not. As far as I'm concerned, I should train the Generator first and, then, updating the Discriminator's weights (similarly as this). My code for updating the weights of both models is:
# computing loss_g and loss_d...
optim_g.zero_grad()
loss_g.backward()
optim_g.step()
optim_d.zero_grad()
loss_d.backward()
optim_d.step()
where loss_g is the generator loss, loss_d is the discriminator loss, optim_g is the optimizer referring to the generator's parameters and optim_d is the discriminator optimizer.
If I run the code like this, I get an error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
So I specify loss_g.backward(retain_graph=True), and here comes my doubt: why should I specify retain_graph=True if there are two networks with two different graphs? Am I getting something wrong?

Having two different networks doesn't necessarily mean that the computational graph is different. The computational graph only tracks the operations that were performed from the input to the output and it doesn't matter where the operation takes place. In other words, if you use the output of the first model in the second model (e.g. model2(model1(input))), you have the same sequential operations as if they were part of the same model. In fact, that is no different from having different parts of the model, such as multiple convolutions, that you apply one after the other.
The error you get, indicates that you are trying to backpropagate from the discriminator through the generator, which would mean that the discriminator's output directly adapts the generator's parameters for the discriminator to be successful. In an adversarial setting that is precisely what you want to avoid, they should be independent from each other. By setting retrain_graph=True you incorrectly hide this bug. In nearly all cases retain_graph=True is not the solution and should be avoided.
To resolve that issue, the two models need to be made independent from each other. The crossover between the two models happens when you use the generators output for the discriminator, since it should decide whether that was real or fake. Something along these lines:
fake = generator(noise)
real_prediction = discriminator(real)
# Using the output of the generator, continues the graph.
fake_prediction = discriminator(fake)
Even though fake comes from the generator, as far as the discriminator is concerned, it's merely another input, just like real. Therefore fake should be treated the same as real, where it is not attached to any computational graph. That can easily be done with torch.Tensor.detach, which decouples the tensor from the graph.
fake = generator(noise)
real_prediction = discriminator(real)
# Detach to make it independent of the generator
fake_prediction = discriminator(fake.detach())
That is also done in the code you referenced, from erikqu/EnhanceNet-PyTorch - train.py:
hr_imgs = torch.cat([discriminator(hr), discriminator(generated_hr.detach())], dim=0)

Train two consecutive models in tensorflow

I am trying to build a model in tensorflow, while I use two consecutive models. Unfortunately I can't include them within one model. The first Model is basically an encoder, the second returns my needed value.
out = Model_a(image_input)
value = Model_b(out)
loss = f(value)
I can train Model_b using the given loss, but would then need the gradients of the first layer (of Model_b) regarding the loss to proceed for the gradient calculation in Model_a. Furthermore I would need somehow a function that calculates the gradients based on these gradients, instead of a loss function. Does anyone have an idea if tensorflow already has such functionality or had to tackle similar problems?
Cheers

I found a working solution, for any who has similar problems. Using Tensorflow 2.0 and the keras eager-mode (using GradientTape) one can construct any loss function as desired, even including consecutive models. Important is that the predict function will not work, one needs to use the direct call.
Now the gradients can be calculated for each model regarding that loss function, which seems to work so far. Important is that the models itself are included in the loss function and not a copy of the output or at least the copy is generated within the Tape environment. An example is found below:
optimizer = tf.keras.optimizers.Adam(lr=0.1)
with tf.GradientTape(persistent=True) as tape:
error = (model2(model1(x)) - y)
loss_value = tf.reduce_mean(tf.square(error))
gradients1 = tape.gradient(loss_value, model1.variables)
gradients2 = tape.gradient(loss_value, model2.variables)
optimizer.apply_gradients(zip(gradients1, model1.variables))
optimizer.apply_gradients(zip(gradients2, model2.variables))
If anyone finds a more efficient or "prettier" solution I would be happy if he/she shares it.
Cheers

Policy Gradients in Keras

I've been trying to build a model using 'Deep Q-Learning' where I have a large number of actions (2908). After some limited success with using standard DQN:
(https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf), I decided to do some more research because I figured the action space was too large to do effective exploration.
I then discovered this paper: https://arxiv.org/pdf/1512.07679.pdf where they use an actor-critic model and policy gradients, which then led me to: https://arxiv.org/pdf/1602.01783.pdf where they use policy gradients to get much better results then DQN overall.
I've found a few sites where they have implemented policy gradients in Keras, https://yanpanlau.github.io/2016/10/11/Torcs-Keras.html and https://oshearesearch.com/index.php/2016/06/14/kerlym-a-deep-reinforcement-learning-toolbox-in-keras/ however I'm confused how they are implemented. In the former (and when I read the papers) it seems like instead of providing an input and output pair for the actor network, you provide the gradients for the all the weights and then use the network to update it, whereas, in the latter they just calculate an input-output pair.
Have I just confused myself? Am I just supposed to be training the network by providing an input-output pair and use the standard 'fit', or do I have to do something special? If it's the latter, how do I do it with the Theano backend? (the examples above use TensorFlow).

TL;DR
Learn how to implement custom loss functions and gradients using Keras.backend. You will need it for more advanced algorithms and it's actually much easier once you get the hang of it
One CartPole example of using keras.backend could be https://gist.github.com/kkweon/c8d1caabaf7b43317bc8825c226045d2 (though its backend used Tensorflow but it should be very similar if not the same)
Problem
When playing,
the agent needs a policy that is basically a function that maps a state into a policy that is a probability for each action. So, the agent will choose an action according to its policy.
i.e, policy = f(state)
When training,
Policy Gradient does not have a loss function. Instead, it tries to maximize the expected return of rewards. And, we need to compute the gradients of log(action_prob) * advantage
advantage is a function of rewards.
advantage = f(rewards)
action_prob is a function of states and action_taken. For example, we need to know which action we took so that we can update parameters to increase/decrease a probability for the action we took.
action_prob = sum(policy * action_onehot) = f(states, action_taken)
I'm assuming something like this
policy = [0.1, 0.9]
action_onehot = action_taken = [0, 1]
then action_prob = sum(policy * action_onehot) = 0.9
Summary
We need two functions
update function: f(state, action_taken, reward)
choose action function: f(state)
You already know it's not easy to implement like typical classification problems where you can just model.compile(...) -> model.fit(X, y)
However,
In order to fully utilize Keras, you should be comfortable with defining custom loss functions and gradients. This is basically the same approach the author of the former one took.
You should read more documentations of Keras functional API and keras.backend
Plus, there are many many kinds of policy gradients.
The former one is called DDPG which is actually quite different from regular policy gradients
The latter one I see is a traditional REINFORCE policy gradient (pg.py) which is based on Kapathy's policy gradient example. But it's very simple for example it only assumes only one action. That's why it could have been implemented somehow using model.fit(...) instead.
References
Schulman, "Policy Gradient Methods", http://rll.berkeley.edu/deeprlcourse/docs/lec2.pdf

The seemingly conflicting implementations you encountered are both valid implementations. They are two equivalent ways two implement the policy gradients.
In the vanilla implementation, you calculate the gradients of the policy network w.r.t. rewards and directly update the weights in the direction of the gradients. This would require you to do the steps described by Mo K.
The second option is arguably a more convenient implementation for autodiff frameworks like keras/tensorflow. The idea is to implement an input-output (state-action) function like supervised learning, but with a loss function who's gradient is identical to the policy gradient. For a softmax policy, this simply means predicting the 'true action' and multiplying the (cross-entropy) loss with the observed returns/advantage. Aleksis Pirinen has some useful notes about this [1].
The modified loss function for option 2 in Keras looks like this:
import keras.backend as K
def policy_gradient_loss(Returns):
def modified_crossentropy(action,action_probs):
cost = K.categorical_crossentropy(action,action_probs,from_logits=False,axis=1 * Returns)
return K.mean(cost)
return modified_crossentropy
where 'action' is the true action of the episode (y), action_probs is the predicted probability (y*). This is based on another stackoverflow question [2].
References
https://aleksispi.github.io/assets/pg_autodiff.pdf
Make a custom loss function in keras

How does one set different learning rates for different layers or variables in TensorFlow?

I know that one can simply do it for all of them using something as in the tutorials:
opt = tf.train.GradientDescentOptimizer(learning_rate)
however it would be nice it one could pass a dictionary that maps the variable name to its corresponding learning rate. Is that possible?
I know that one could simply use compute_gradients() followed by apply_gradients() and do it manually but that seems silly. Is there a smarter way to assign specific learning rates to specific variables?
Is the only way to do this to create specific optimizer as in:
# Create an optimizer with the desired parameters.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Add Ops to the graph to minimize a cost by updating a list of variables.
# "cost" is a Tensor, and the list of variables contains tf.Variable
# objects.
opt_op = opt.minimize(cost, var_list=<list of variables>)
and simply give the specific learning rate to each optimizer? But that would mean we have a list of optimizers and hence, we would need to apply the learning rule with sess.run to each optimizer. Right?

As far as I can tell this is not possible. Mostly because this is not really a valid gradient descent then. There are plenty of optimizers which learn on their own variable specific scaling factors (like Adam or AdaGrad). Specyfing per-variable learning rate (constant one) would mean that you do not follow the gradient anymore, and while it makes sense for well formulated mathematically methods, simply setting them to a pre-defined values is just a heuristic, which I believe is a reason for not implementing this in core TF.
As you said - you can always do it on your own, define your own optimizer, iterate over variables between compute gradients and apply them, which would be around 3-4 lines of code (one to compute the gradients, one to iterate and add multiplication ops, and one to apply them back), and as far as I know - this is the simplest solution to achieve your goal.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Accessing gradient of multiple variables when applying resource [Tensorflow] - python

Related

Calling .backward() function for two different neural networks but getting retain_graph=True error

How does PyTorch's loss.backward() work when "retain_graph=True" is specified?

Train two consecutive models in tensorflow

Policy Gradients in Keras

How does one set different learning rates for different layers or variables in TensorFlow?

Categories

Resources