Using an A2C agent from this article, how to get numerical values of value_loss, policy_loss and entropy_loss when weights are being updated?
The model I'm using is double-headed, both heads share the same trunk. The policy head output shape is [number of actions, batch size] and value head has a shape of [1, batch_size]. Compiling this model returns a size incompatibility error, when these loss functions are given as metrics:
self.model.compile(optimizer=self.optimizer,
metrics=[self._logits_loss, self._value_loss],
loss=[self._logits_loss, self._value_loss])
Both self._value_loss and self._policy_loss are executed as graphs, meaning that all variables inside them are only pointers to graph nodes. I found some examples where Tensor objects are evaluated (with eval()) to get the value out of nodes. I don't understand them because in order to eval() a Tensor object you need to give it a Session but in TensorFlow 2.x Sessions are deprecated.
Another lead, when calling train_on_batch() from Model API in Keras to train the model, the method returns losses. I don't understand why, but the only losses it returns are from the policy head. Losses from that head are calculated as policy_loss - entropy_loss but my goal is to get all three losses separately to visualize them in a graph.
Any help is welcome, I'm stuck.
I found the answer to my problem. In Keras, the metrics built-in functionality provides an interface for measuring performance and losses of the model, be it a custom or standard one.
When compiling a model as follows:
self.model.compile(optimizer=ko.RMSprop(lr=lr),
metrics=dict(output_1=self._entropy_loss),
loss=dict(output_1=self._logits_loss, output_2=self._value_loss))
... self.model.train_on_batch([...]) returns a list of [total_loss, logits_loss, value_loss, entropy_loss]. By making a calculation of logits_loss + entropy_loss the value of policy_loss can be calculated. Beware, that this solution results in calling self._entropy_loss() twice.
Related
I have an Actor Critic neural network where the Actor is its own class and the Critic is its own class with its own neural network and .forward() function. I then am creating an object of each of these classes in a larger Model class. My setup is as follows:
self.actor = Actor().to(device)
self.actor_opt = optim.Adam(self.actor.parameters(), lr=lr)
self.critic = Critic().to(device)
self.critic_opt = optim.Adam(self.critic.parameters(), lr=lr)
I then calculate two different loss functions and want to update each neural network separately. For the critic:
loss_critic = F.smooth_l1_loss(value, expected)
self.critic_opt.zero_grad()
loss_critic.backward()
self.critic_opt.step()
and for the actor:
loss_actor = -self.critic(state, action)
self.actor_opt.zero_grad()
loss_actor.backward()
self.actor_opt.step()
However, when doing this, I get the following error:
RuntimeError: Trying to backward through the graph a second time, but the saved
intermediate results have already been freed. Specify retain_graph=True when
calling backward the first time.
When reading up on this, I understood that I only need to retain_graph=True when calling backward twice on the same network, and in most cases this is not good to set to True as I will run out of GPU. Moreover, when I comment out one of the .backward() functions, the error goes away, leading me to believe that for some reason the code is thinking that both backward() functions are being called on the same neural network, even though I think I am doing it separately. What could be the reason for this? Is there a way to specify for which neural network I am calling the backward function on?
Edit:
For reference, the optimize() function in this code here https://github.com/wudongming97/PyTorch-DDPG/blob/master/train.py uses backward() twice with no issue (I've cloned the repo and tested it). I'd like my code to operate similarly where I backprop through critic and actor separately.
Yes, you shouldn't do it like that. What you should do instead, is propagating through parts of the graph.
What the graph contains
Now, graph contains both actor and critic. If the computations pass through the same part of graph (say, twice through actor), it will raise this error.
And they will, as you clearly use actor and critic joined with loss value (this line: loss_actor = -self.critic(state, action))
Different optimizers do not change anything here, as it's backward problem (optimizers simply apply calculated gradients onto models)
Trying to fix it
This is how to fix it in GANs, but not in this case, see Actual fix paragraph below, read on if you are curious about the topic
If part of a neural network (critic in this case) does not take part in the current optimization step, it should be treated as a constant (and vice versa).
To do that, you could disable gradient using torch.no_grad context manager (documentation) and set critic to eval mode (documentation), something along those lines:
self.critic.eval()
with torch.no_grad():
loss_actor = -self.critic(state, action)
...
But, here is a problem:
We are turning off gradient (tape recording) for action and breaking the graph!
hence this is not a viable solution.
Actual solution
It is much simpler than you think, one can see it in PyTorch's repository also:
Do not backpropagate after critic/actor loss
Calculate all losses (for both critic and actor)
sum them together
zero_grad for both optimizers
backpropagate with this summed value
critic_optimizer.step() and actor_optimizer.step() at this point
Something along those lines:
self.critic_opt.zero_grad()
self.actor_opt.zero_grad()
loss_critic = F.smooth_l1_loss(value, expected)
loss_actor = -self.critic(state, action)
total_loss = loss_actor + loss_critic
total_loss.backward()
self.critic_opt.step()
self.actor_opt.step()
I'm a newbie with PyTorch and adversarial networks. I've tried to look for an answer on the PyTorch documentation and from previous discussions both in the PyTorch and StackOverflow forums, but I couldn't find anything useful.
I'm trying to train a GAN with a Generator and a Discriminator, but I cannot understand if the whole process is working or not. As far as I'm concerned, I should train the Generator first and, then, updating the Discriminator's weights (similarly as this). My code for updating the weights of both models is:
# computing loss_g and loss_d...
optim_g.zero_grad()
loss_g.backward()
optim_g.step()
optim_d.zero_grad()
loss_d.backward()
optim_d.step()
where loss_g is the generator loss, loss_d is the discriminator loss, optim_g is the optimizer referring to the generator's parameters and optim_d is the discriminator optimizer.
If I run the code like this, I get an error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
So I specify loss_g.backward(retain_graph=True), and here comes my doubt: why should I specify retain_graph=True if there are two networks with two different graphs? Am I getting something wrong?
Having two different networks doesn't necessarily mean that the computational graph is different. The computational graph only tracks the operations that were performed from the input to the output and it doesn't matter where the operation takes place. In other words, if you use the output of the first model in the second model (e.g. model2(model1(input))), you have the same sequential operations as if they were part of the same model. In fact, that is no different from having different parts of the model, such as multiple convolutions, that you apply one after the other.
The error you get, indicates that you are trying to backpropagate from the discriminator through the generator, which would mean that the discriminator's output directly adapts the generator's parameters for the discriminator to be successful. In an adversarial setting that is precisely what you want to avoid, they should be independent from each other. By setting retrain_graph=True you incorrectly hide this bug. In nearly all cases retain_graph=True is not the solution and should be avoided.
To resolve that issue, the two models need to be made independent from each other. The crossover between the two models happens when you use the generators output for the discriminator, since it should decide whether that was real or fake. Something along these lines:
fake = generator(noise)
real_prediction = discriminator(real)
# Using the output of the generator, continues the graph.
fake_prediction = discriminator(fake)
Even though fake comes from the generator, as far as the discriminator is concerned, it's merely another input, just like real. Therefore fake should be treated the same as real, where it is not attached to any computational graph. That can easily be done with torch.Tensor.detach, which decouples the tensor from the graph.
fake = generator(noise)
real_prediction = discriminator(real)
# Detach to make it independent of the generator
fake_prediction = discriminator(fake.detach())
That is also done in the code you referenced, from erikqu/EnhanceNet-PyTorch - train.py:
hr_imgs = torch.cat([discriminator(hr), discriminator(generated_hr.detach())], dim=0)
Being new to Tensorflow, I am trying to understand the difference between underlying functionality of tf.gradients and tf.keras.backend.gradients.
The latter finds the gradient of input feature values w.r.t cost function.
But I couldn't get a clear idea on the former whether it computes the gradient over cost function or output probabilities (For example, consider the case of binary classification using a simple feed forward network. Output probability here is referred to the Sigmoid activation outcome of final layer with single neuron. Cost is given by Binary cross entropy)
I have referred the official documentation for tf.gradients, but it is short and vague (for me), and I did not get a clear picture - The documentation mentions it as just 'y' - is it cost or output probability?
Why I need the gradients?
To implement a basic gradient based feature attribution.
They are basically the same. tf.keras is TensorFlow's high-level API for building and training deep learning models. It's used for fast prototyping, state-of-the-art research, and production. tf.Keras basically uses Tensorflow in its backend. By looking at tf.Keras source code here, we can see that tf.keras.backend.gradients indeed uses tf.gradients:
# Part of Keras.backend.py
from tensorflow.python.ops import gradients as gradients_module
#keras_export('keras.backend.gradients')
def gradients(loss, variables):
"""Returns the gradients of `loss` w.r.t. `variables`.
Arguments:
loss: Scalar tensor to minimize.
variables: List of variables.
Returns:
A gradients tensor.
"""
# ========
# Uses tensorflow's gradient function
# ========
return gradients_module.gradients(
loss, variables, colocate_gradients_with_ops=True)
I am trying to build a model in tensorflow, while I use two consecutive models. Unfortunately I can't include them within one model. The first Model is basically an encoder, the second returns my needed value.
out = Model_a(image_input)
value = Model_b(out)
loss = f(value)
I can train Model_b using the given loss, but would then need the gradients of the first layer (of Model_b) regarding the loss to proceed for the gradient calculation in Model_a. Furthermore I would need somehow a function that calculates the gradients based on these gradients, instead of a loss function. Does anyone have an idea if tensorflow already has such functionality or had to tackle similar problems?
Cheers
I found a working solution, for any who has similar problems. Using Tensorflow 2.0 and the keras eager-mode (using GradientTape) one can construct any loss function as desired, even including consecutive models. Important is that the predict function will not work, one needs to use the direct call.
Now the gradients can be calculated for each model regarding that loss function, which seems to work so far. Important is that the models itself are included in the loss function and not a copy of the output or at least the copy is generated within the Tape environment. An example is found below:
optimizer = tf.keras.optimizers.Adam(lr=0.1)
with tf.GradientTape(persistent=True) as tape:
error = (model2(model1(x)) - y)
loss_value = tf.reduce_mean(tf.square(error))
gradients1 = tape.gradient(loss_value, model1.variables)
gradients2 = tape.gradient(loss_value, model2.variables)
optimizer.apply_gradients(zip(gradients1, model1.variables))
optimizer.apply_gradients(zip(gradients2, model2.variables))
If anyone finds a more efficient or "prettier" solution I would be happy if he/she shares it.
Cheers
I've been given a fully trained model by another researcher that has inputs as placeholders. Regarding it as a function f(x), I would like to find x to minimize my distance metric (loss function) dist(x, f(x)). This could be something like the euclidean distance between the two points.
I tried to use TensorFlow's built-in optimizer functions. The issue is that tf.train.AdamOptimizer(1e-4).minimize(loss, var_list[input_placeholder]) fails, complaining that input_placeholder isn't of a supported type. Thus, I cannot get gradients for my input.
How can I optimize a function in TensorFlow when the inputs have to be specified in this way? Unfortunately, these placeholders are not passed through a Variable first, and I have to treat that model as a black box.
Using the Keras functional API detailed in this question, I created a dense layer with no bias to sit right before the model I was given. Holding its input as a constant all 1's vector, I optimized the joined model using only the Variable in the dense layer, giving me the optimal vector as the output of that layer.
All TensorFlow Optimizer subclasses allow you to minimize while only modifying a particular set of Variables, which I got out of Keras fairly simply.