I am trying to build a model in tensorflow, while I use two consecutive models. Unfortunately I can't include them within one model. The first Model is basically an encoder, the second returns my needed value.
out = Model_a(image_input)
value = Model_b(out)
loss = f(value)
I can train Model_b using the given loss, but would then need the gradients of the first layer (of Model_b) regarding the loss to proceed for the gradient calculation in Model_a. Furthermore I would need somehow a function that calculates the gradients based on these gradients, instead of a loss function. Does anyone have an idea if tensorflow already has such functionality or had to tackle similar problems?
Cheers
I found a working solution, for any who has similar problems. Using Tensorflow 2.0 and the keras eager-mode (using GradientTape) one can construct any loss function as desired, even including consecutive models. Important is that the predict function will not work, one needs to use the direct call.
Now the gradients can be calculated for each model regarding that loss function, which seems to work so far. Important is that the models itself are included in the loss function and not a copy of the output or at least the copy is generated within the Tape environment. An example is found below:
optimizer = tf.keras.optimizers.Adam(lr=0.1)
with tf.GradientTape(persistent=True) as tape:
error = (model2(model1(x)) - y)
loss_value = tf.reduce_mean(tf.square(error))
gradients1 = tape.gradient(loss_value, model1.variables)
gradients2 = tape.gradient(loss_value, model2.variables)
optimizer.apply_gradients(zip(gradients1, model1.variables))
optimizer.apply_gradients(zip(gradients2, model2.variables))
If anyone finds a more efficient or "prettier" solution I would be happy if he/she shares it.
Cheers
Related
I have an Actor Critic neural network where the Actor is its own class and the Critic is its own class with its own neural network and .forward() function. I then am creating an object of each of these classes in a larger Model class. My setup is as follows:
self.actor = Actor().to(device)
self.actor_opt = optim.Adam(self.actor.parameters(), lr=lr)
self.critic = Critic().to(device)
self.critic_opt = optim.Adam(self.critic.parameters(), lr=lr)
I then calculate two different loss functions and want to update each neural network separately. For the critic:
loss_critic = F.smooth_l1_loss(value, expected)
self.critic_opt.zero_grad()
loss_critic.backward()
self.critic_opt.step()
and for the actor:
loss_actor = -self.critic(state, action)
self.actor_opt.zero_grad()
loss_actor.backward()
self.actor_opt.step()
However, when doing this, I get the following error:
RuntimeError: Trying to backward through the graph a second time, but the saved
intermediate results have already been freed. Specify retain_graph=True when
calling backward the first time.
When reading up on this, I understood that I only need to retain_graph=True when calling backward twice on the same network, and in most cases this is not good to set to True as I will run out of GPU. Moreover, when I comment out one of the .backward() functions, the error goes away, leading me to believe that for some reason the code is thinking that both backward() functions are being called on the same neural network, even though I think I am doing it separately. What could be the reason for this? Is there a way to specify for which neural network I am calling the backward function on?
Edit:
For reference, the optimize() function in this code here https://github.com/wudongming97/PyTorch-DDPG/blob/master/train.py uses backward() twice with no issue (I've cloned the repo and tested it). I'd like my code to operate similarly where I backprop through critic and actor separately.
Yes, you shouldn't do it like that. What you should do instead, is propagating through parts of the graph.
What the graph contains
Now, graph contains both actor and critic. If the computations pass through the same part of graph (say, twice through actor), it will raise this error.
And they will, as you clearly use actor and critic joined with loss value (this line: loss_actor = -self.critic(state, action))
Different optimizers do not change anything here, as it's backward problem (optimizers simply apply calculated gradients onto models)
Trying to fix it
This is how to fix it in GANs, but not in this case, see Actual fix paragraph below, read on if you are curious about the topic
If part of a neural network (critic in this case) does not take part in the current optimization step, it should be treated as a constant (and vice versa).
To do that, you could disable gradient using torch.no_grad context manager (documentation) and set critic to eval mode (documentation), something along those lines:
self.critic.eval()
with torch.no_grad():
loss_actor = -self.critic(state, action)
...
But, here is a problem:
We are turning off gradient (tape recording) for action and breaking the graph!
hence this is not a viable solution.
Actual solution
It is much simpler than you think, one can see it in PyTorch's repository also:
Do not backpropagate after critic/actor loss
Calculate all losses (for both critic and actor)
sum them together
zero_grad for both optimizers
backpropagate with this summed value
critic_optimizer.step() and actor_optimizer.step() at this point
Something along those lines:
self.critic_opt.zero_grad()
self.actor_opt.zero_grad()
loss_critic = F.smooth_l1_loss(value, expected)
loss_actor = -self.critic(state, action)
total_loss = loss_actor + loss_critic
total_loss.backward()
self.critic_opt.step()
self.actor_opt.step()
I am currently trying to implement custom optimization for a custom tensorflow layer.
Without going in to much detail I have added a small code sample which illustrates how my current code works. The important part is that calculate_gradients(variables, gradients, momentum) is a function that requires the variable values and gradients of all the variables in the layer. Furthermore this calculation contains intermediate results which have to be stored during optimization. This explains the illustrative momentum variable. This behaviour to me makes using #custom_gradient not possible since this does not allow me to propagate this intermediate results to the optimizer which would then have to return it to the custom gradient function for use in the calculation of the next set of gradients. Unless someone knows how this would work (question one) i have not found a way around this.
model = build_model()
for data, target in data:
with tf.GradientTape() as tap:
gradients = tape.gradient(loss(model(data), target), model.trainable_variables)
for layer in model.layers:
layer_gradients = gradients[indices] # actual indexing is not important
new_gradients = calculate_gradients(layer.variables, layer_gradients, momentum)
for variable, grad in zip(layer.variables, new_gradients):
variable.assign(grad)
Trying to implement this in the tensorflow optimizer particularly by replacing _resource_apply_dense as shown in the documentation [1] i am running into some trouble with the layer-wise behaviour.
Particularly since _resource_apply_dense takes a variable and a gradient. The second code snippet illustrates what i am trying to to, but have currently not found a way to do the get_other_variables_and_gradients(var) behaviour. Furthermore this solution would calculate the gradients three times for each layer which is very suboptimal.
def _resource_apply_dense(var, grad, apply_state):
other_vars_and_grads = get_other_variables_and_gradients(var)
calculate_gradients(zip((var, grad), other_vars_and_gards))
var.assign(grad)
In short, my second question is: Does anyone have an idea how to implement this behaviour and maybe even better do it without redundant calculations or even a whole new better way. Currently the optimization works when i do everything in a training loop as shown in code snippet one. So this is merely a case of integration with the tensorflow optimizer paradigm and performance since doing everything very 'pythony' with lists in a large for loop is slow.
[1] https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Optimizer
I have answered my own question and am replying in order to archive. After revisiting this issue and going through the tf.keras.optimizers.Optimizers code on github [1] I discovered a private method '_transform_unaggregated_gradients'. This method is called before applying the gradients and allows the necessary behaviour. In my case this looks something like this:
def _transform_unaggregated_gradients(self, grads_and_vars):
for layer in model.layers:
layer_gradients, layer_variables = grads_and_Vars[indices] # actual indexing is not important
new_gradients = calculate_gradients(layer.variables, layer_gradients, momentum)
grads_and_vars[indices] = (new_gradients, layer_variables)
return grads_and_vars
[1] https://github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L115-L1414
Using an A2C agent from this article, how to get numerical values of value_loss, policy_loss and entropy_loss when weights are being updated?
The model I'm using is double-headed, both heads share the same trunk. The policy head output shape is [number of actions, batch size] and value head has a shape of [1, batch_size]. Compiling this model returns a size incompatibility error, when these loss functions are given as metrics:
self.model.compile(optimizer=self.optimizer,
metrics=[self._logits_loss, self._value_loss],
loss=[self._logits_loss, self._value_loss])
Both self._value_loss and self._policy_loss are executed as graphs, meaning that all variables inside them are only pointers to graph nodes. I found some examples where Tensor objects are evaluated (with eval()) to get the value out of nodes. I don't understand them because in order to eval() a Tensor object you need to give it a Session but in TensorFlow 2.x Sessions are deprecated.
Another lead, when calling train_on_batch() from Model API in Keras to train the model, the method returns losses. I don't understand why, but the only losses it returns are from the policy head. Losses from that head are calculated as policy_loss - entropy_loss but my goal is to get all three losses separately to visualize them in a graph.
Any help is welcome, I'm stuck.
I found the answer to my problem. In Keras, the metrics built-in functionality provides an interface for measuring performance and losses of the model, be it a custom or standard one.
When compiling a model as follows:
self.model.compile(optimizer=ko.RMSprop(lr=lr),
metrics=dict(output_1=self._entropy_loss),
loss=dict(output_1=self._logits_loss, output_2=self._value_loss))
... self.model.train_on_batch([...]) returns a list of [total_loss, logits_loss, value_loss, entropy_loss]. By making a calculation of logits_loss + entropy_loss the value of policy_loss can be calculated. Beware, that this solution results in calling self._entropy_loss() twice.
Assuming I create model A, which has a similar but not exactly the same architecture as the compiled model B. Can I compile model A as follows?
model_A.compile(model_B.optimizer,
loss=model_B.loss,
metrics=model_B.metrics,
)
I am most worried that some values stored in the optimizer (e.g. updates, weights, ...) are specific to the model architecture and may yield a mismatch. Can somebody explain what exactly is happening when I perform such a copy? I couldn't extract helpful information from the source code (l37ff).
P.s.: Is the state of the optimizer also copied this way? If not, can you copy it somehow?
We can use optimizer from one model to another. Most of the optimizers takes learning rate, momentum, decay etc as arguments. model.compile initialises the weights according to your argument. optimizer only takes care of how you loss is propagated after its calculation.
We will change optimizer only to make our model converge faster for the given data.
But you may not be able to use the same loss function for different models(model b can be mse and model a can have softmax as its last layer). same holds true for accuracy too.
I'm trying to implement Google's Facenet paper:
First of all, is it possible to implement this paper using the Sequential API of Keras or should I go for the Graph API?
In either case, could you please tell me how do I pass the custom loss function tripletLoss to the model compile and how do I receive the anchor embedding, positive embedding and the negative embedding as parameters to calculate the loss?
Also, what should be the second parameter Y in model.fit(), I do not have any in this case...
This issue explains how to create a custom objective (loss) in Keras:
def dummy_objective(y_true, y_pred):
return 0.5 # your implem of tripletLoss here
model.compile(loss=dummy_objective, optimizer='adadelta')
Regarding the y parameter of .fit(), since you are the one handling it in the end (the y_true parameter of the objective function is taken from it), I would say you can pass whatever you need that can fit through Keras plumbing. And maybe a dummy vector to pass dimension checks if your really don't need any supervision.
Eventually, as to how to implement this particular paper, looking for triplet or facenet in Keras doc didn't return anything. So you'll probably have to either implement it yourself or find someone who has.