Tensorflow 2 takes about 15 minutes to make its static graph (or whatever it's doing before the first pass). The training time after this is normal, but obviously it's hard to experiment with 15 mins of waiting for any feedback.
The generator encoder and discriminator are RNNs (not unrolled) with GRU cells in a Keras model.
The generator decoder is defined and called like this:
class GeneratorDecoder(tf.keras.layers.Layer):
def __init__(self, feature_dim):
super(GeneratorDecoder, self).__init__()
self.cell = tf.keras.layers.GRUCell(
GRUI_DIM, activation='tanh', recurrent_activation='sigmoid',
dropout=DROPOUT, recurrent_dropout=DROPOUT)
self.batch_normalization = tf.keras.layers.BatchNormalization()
self.dense = tf.keras.layers.Dense(
feature_dim, activation='tanh')
#tf.function
def __call__(self, z, timesteps, training):
# z has shape (batch_size, features)
outputs = []
output, state = z, z
for i in range(timesteps):
output, state = self.cell(inputs=output, states=state,
training=training)
dense_output = self.dense(
self.batch_normalization(output))
outputs.append(dense_output)
return outputs
Here is my training loop (the mask_gt and missing_data variables are cast using tf.cast and should so already be tensors):
for it in tqdm(range(NO_ITERATIONS)):
print(it)
train_step()
#tf.function
def train_step():
with tf.GradientTape(persistent=True) as tape:
generator_output = generator(missing_data, training=True)
imputed_data = get_imputed_data(missing_data, generator_output)
mask_pred = discriminator(imputed_data)
D_loss = discriminator.loss(mask_pred, mask_gt)
G_loss = generator.loss(missing_data, mask_gt,
generator_output, mask_pred)
gen_enc_grad = tape.gradient(
G_loss, generator.encoder.trainable_variables)
gen_dec_grad = tape.gradient(
G_loss, generator.decoder.trainable_variables)
disc_grad = tape.gradient(
D_loss, discriminator.model.trainable_variables)
del tape
generator.optimizer.apply_gradients(
zip(gen_enc_grad, generator.encoder.trainable_variables))
generator.optimizer.apply_gradients(
zip(gen_dec_grad, generator.decoder.trainable_variables))
discriminator.optimizer.apply_gradients(
zip(disc_grad, discriminator.model.trainable_variables))
Note that "0" is printed within a few seconds, so the slow part is definitely not earlier.
And this is the get_imputed_data function that is called:
def get_imputed_data(incomplete_series, generator_output):
return tf.where(tf.math.is_nan(incomplete_series), generator_output, incomplete_series)
Thanks for any answers! Hope I provided just enough code to give a sense of where the problem lies. This is my first time posting here after reading for at least five years :)
I use Python 3.6 and Tensorflow 2.1.
The problem was solved by removing the tf.function decorator for the calling functions of the generator and discriminator. I was using a single global python scalar (the iteration no.) in two of the tf.function decorated functions. This caused a new graph to be created every time (see the caution in the tf.function docs).
The solution is to drop the python variables used or convert them to tensorflow variables.
Related
I am building up a cascade of neural networks and I would like to backpropagate the main loss back to the DNNs and also compute an auxillary loss back to each DNN.
I am trying to figure out what is the best practice when building such a model and how to make sure that my losses are computed properly. Do I build a single torch.nn.Module and a single optimizer, or do I have to create separate modules and optimizers for each network? Also I am likely to have more than three cascaded DNNs.
Approach a)
import torch
from torch import nn, optim
class MasterNetwork(nn.Module):
def init(self):
super(MasterNetwork, self).__init__()
dnn1 = nn.ModuleList()
dnn2 = nn.ModuleList()
dnn3 = nn.ModuleList()
def forward(self, x, z1, z2):
out1 = dnn1(x)
out2 = dnn2(out1 + z1)
out3 = dnn3(out2 + z2)
return [out1, out2, out3]
def LossFunction(in):
# do stuff
return loss # loss is a scalar value
def ac_loss_1_fn(in):
# do stuff
return loss # loss is a scalar value
def ac_loss_2_fn(in):
# do stuff
return loss # loss is a scalar value
def ac_loss_3_fn(in):
# do stuff
return loss # loss is a scalar value
model = MasterNetwork()
optimizer = optim.Adam(model.parameters())
input = torch.tensor()
z1 = torch.tensor()
z2 = torch.tensor()
outputs = model(input, z1, z2)
main_loss = LossFunction(outputs[2])
ac1_loss = ac_loss_1_fn(outputs[0])
ac2_loss = ac_loss_2_fn(outputs[1])
ac3_loss = ac_loss_3_fn(outputs[2])
optimizer.zero_grad()
'''
This is where I am uncertain about how to backpropagate the AC losses for each DNN
in addition to the main loss.
'''
optimizer.step()
Approach b)
This would creating a nn.Module class and optimizer for each DNN and then forwarding the loss to the next DNN.
I would prefer to have a solution for approach a) since it is less tedious and I don't have to deal with tuning multiple optimizers. However, I am not sure if this is possible. There was a similar question about backpropagating multiple losses, however, I was not able to understand how combining the losses would work for the distinct components.
the solution you are looking for is likely to use some form of the following:
y = torch.tensor([main_loss, ac1_loss, ac2_loss, ac3_loss])
y.backward(gradient=torch.tensor([1.0,1.0,1.0,1.0]))
See https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#gradients for confirmation.
A similar question exists but this one uses a different phrasing and was the question which I found first when hitting the issue. The similar question can be found at Pytorch. Can autograd be used when the final tensor has more than a single value in it?
I am trying to benchmark specific blocks in my tf model. Therefore, I am trying to use tf.timestamp(). I need to include it in the graph execution so that it will be executed every time I call the model.
I can actually do it by using tf.compat.v1.Print() as following,
x = self.mp1(x)
x = tf.compat.v1.Print(x, [tf.timestamp()])
x = self.c3(x)
But this is printing the value and this causes some overhead. Instead, I want to store it to some variable so that I can work with it after execution. Is there any other way to embed tf.timestamp() to the graph of tf2.
IF you are using this graph for optimization (ex. as a deep learning model),
you will probably run model.fit() function.
then you can use a callback function to save after every epoch.
import tensorflow as tf
from tf.keras.callbacks import ModelCheckpoint
EPOCHS = 10
checkpoint_filepath = '/tmp/checkpoint'
model_checkpoint_callback = ModelCheckpoint(
filepath=checkpoint_filepath,
save_weights_only=True,
monitor='val_acc',
mode='max',
save_best_only=True)
# Model weights are saved at the end of every epoch, if it's the best seen
# so far.
model.fit(epochs=EPOCHS, callbacks=[model_checkpoint_callback])
# The model weights (that are considered the best) are loaded into the model.
model.load_weights(checkpoint_filepath)
I solved it after trying couple of things. What I did is:
Defined a custom layer as
class TimeStamp(layers.Layer):
def __init__(self):
super(TimeStamp, self).__init__()
self.ts = tf.Variable(initial_value=0., dtype=tf.float64, trainable=False)
def call(self, inputs):
self.ts = tf.timestamp()
return tf.identity(inputs)
def getTs(self):
return self.ts
Then used it multiple times in a model, then found the elapsed time by subtracting these self.ts values.
I recently did a massive refactor to my PyTorch LSTM code, in order to support multitask learning. I created an MTLWrapper, which holds a BaseModel (which can be one of several variations on a regular LSTM network), which remained the same as it was before the refactor, minus a linear hidden2tag layer (takes hidden sequence and converts to tag space), which now sits in the wrapper. The reason for this is that for multitask learning, all the parameters are shared, except for the final linear layer, which I have one of for each task. These are stored in a nn.ModuleList, not just a regular python list.
What happens now is that my forward pass returns a list of tag scores tensors (one for each task), rather than a single tensor of the tag scores for a single task. I compute the losses for each of these tasks and then try to backpropagate with the average of these losses (technically also averaged over all the sentences of a batch, but this was true before the refactor too). I call model.zero_grad() before running the forward pass on each sentence in a batch.
I don't know exactly where it happened, but after this refactor, I started getting this error (on the second batch):
RuntimeError: Trying to backward through the graph a second time, but
the buffers have already been freed. Specify retain_graph=True when
calling backward the first time.
Following the advice, I added the retain_graph=True flag, but now I get the following error instead (also on the second backward step):
RuntimeError: one of the variables needed for gradient computation has
been modified by an inplace operation: [torch.FloatTensor [100, 400]],
which is output 0 of TBackward, is at version 2; expected version 1
instead. Hint: the backtrace further above shows the operation that
failed to compute its gradient. The variable in question was changed
in there or anywhere later. Good luck!
The hint in the backtrace is not actually helpful, because I have no idea where a tensor of the shape [100, 400] even came from - I don't have any parameters of size 400.
I have a sneaky suspicion that the problem is actually that I shouldn't need the retain_graph=True, but I have no way to confirm that vs. finding the mystery variable that is being changed according to the second error. Either way, I'm at a complete loss how to solve this issue. Any help is appreciated!
Code snippets:
import torch
import torch.nn as nn
import torch.nn.functional as F
class MTLWrapper(nn.Module):
def __init__(self, embedding_dim, hidden_dim, dropout,..., directions=1, device='cpu', model_type):
super(MTLWrapper, self).__init__()
self.base_model = model_type(embedding_dim, hidden_dim, dropout, ..., directions, device)
self.linear_taggers = []
for tagset_size in tagset_sizes:
self.linear_taggers.append(nn.Linear(hidden_dim*directions, tagset_size))
self.linear_taggers = nn.ModuleList(self.linear_taggers)
def init_hidden(self, hidden_dim):
return self.base_model.init_hidden(hidden_dim)
def forward(self, sentence):
lstm_out = self.base_model.forward(sentence)
tag_scores = []
for linear_tagger in self.linear_taggers:
tag_space = linear_tagger(lstm_out.view(len(sentence), -1))
tag_scores.append(F.log_softmax(tag_space))
tag_scores = torch.stack(tag_scores)
return tag_scores
Inside the train function:
for i in range(math.ceil(len(train_sents)/batch_size)):
batch = r[i*batch_size:(i+1)*batch_size]
losses = []
for j in batch:
sentence = train_sents[j]
tags = train_tags[j]
# Step 1. Remember that Pytorch accumulates gradients.
# We need to clear them out before each instance
model.zero_grad()
# Also, we need to clear out the hidden state of the LSTM,
# detaching it from its history on the last instance.
model.hidden = model.init_hidden(hidden_dim)
sentence_in = sentence
targets = tags
# Step 3. Run our forward pass.
tag_scores = model(sentence_in)
loss = [loss_function(tag_scores[i], targets[i]) for i in range(len(tag_scores))]
loss = torch.stack(loss)
avg_loss = sum(loss)/len(loss)
losses.append(avg_loss)
losses = torch.stack(losses)
total_loss = sum(losses)/len(losses) # average over all sentences in batch
total_loss.backward(retain_graph=True)
running_loss += total_loss.item()
optimizer.step()
count += 1
And code for one possible BaseModel (the others are practically identical):
class LSTMTagger(nn.Module):
def __init__(self, embedding_dim, hidden_dim, dropout, vocab_size, alphabet_size,
directions=1, device='cpu'):
super(LSTMTagger, self).__init__()
self.device = device
self.hidden_dim = hidden_dim
self.directions = directions
self.dropout = nn.Dropout(dropout)
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
# The LSTM takes word embeddings as inputs, and outputs hidden states
# with dimensionality hidden_dim.
self.lstm = nn.LSTM(embedding_dim, hidden_dim, dropout=dropout, bidirectional=directions == 2)
# The linear layer that maps from hidden state space to tag space
self.hidden = self.init_hidden(hidden_dim)
def init_hidden(self, dim):
# Before we've done anything, we don't have any hidden state.
# Refer to the PyTorch documentation to see exactly
# why they have this dimensionality.
# The axes semantics are (num_layers, minibatch_size, hidden_dim)
return (torch.zeros(self.directions, 1, dim).to(device=self.device),
torch.zeros(self.directions, 1, dim).to(device=self.device))
def forward(self, sentence):
word_idxs = []
for word in sentence:
word_idxs.append(word[0])
embeds = self.word_embeddings(torch.LongTensor(word_idxs).to(device=self.device))
lstm_out, self.hidden = self.lstm(
embeds.view(len(sentence), 1, -1), self.hidden)
lstm_out = self.dropout(lstm_out)
return lstm_out
The problem is that when I was resetting the hidden states of the model (model.hidden = model.init_hidden(hidden_dim)) I didn't actually reassign the reinitialized weights to the BaseModel, but only in the MTLWrapper (which doesn't technically even use hidden layers).
I amended my MTLWrapper's init_hidden() function as follows:
class MTLWrapper(nn.Module):
def init_hidden(self, hidden_dim):
self.base_model.hidden = self.base_model.init_hidden(hidden_dim)
return self.base_model.init_hidden(hidden_dim)
This resolved the first error, and my code runs without the retain_graph=True flag.
I am currently using TensorFlow version 1.14.
In the code below, I am trying to create a dummy model that takes in 2 inputs and provides two outputs, with all weights set to ones and biases to zeros (Single layered perceptron). I am defining a custom loss function that computes the jacobian of the input layer wrt the output layer.
# Prior function
def f_i(x):
x1 = np.arctanh(x)
return np.exp(-x1**2)
B = np.random.choice(x, (10000,2), p = f_i(x)/np.sum(f_i(x)))
def my_loss(y_pred, y_true):
jacobian_tf = jacobian_tensorflow3(sim.output, sim.input)
loss = tf.abs(tf.linalg.det(jacobian_tf))
return K.mean(loss)
def jacobian_tensorflow3(x,y, verbose=False):
jacobian_matrix = []
it = tqdm(range(ndim)) if verbose else range(ndim)
for o in it:
grad_func = tf.gradients(x[:,o], y)
jacobian_matrix.append(grad_func[0])
jacobian_matrix = tf.stack(jacobian_matrix)
jacobian_matrix1 = tf.transpose(jacobian_matrix, perm=[1,0,2])
return jacobian_matrix1
sim = Sequential()
sim.add(Dense(2, kernel_initializer='ones', bias_initializer='zeros', activation='linear', input_dim=2))
sim.compile(optimizer='adam', loss=my_loss)
sim.fit(B, np.random.random(B.shape), batch_size=100, epochs=2)
While this model works in giving the result of the Jacobian matrix and also has no issues with compilation, but when I run sim.fit I get the following error:
ValueError: Variable <tf.Variable 'dense_14/bias:0' shape=(2,) dtype=float32> has `None` for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
I am stuck at this step for a long time, and I am not able to proceed ahead. Any help/suggestions would be beneficial.
I want to multiply a Keras layer with my own Variable.
Then, I want to compute the gradients of some loss relative to the variables I have defined.
Here is a simplified MWE of what I am trying to do:
import tensorflow as tf
x = input_shape = tf.keras.layers.Input((10,))
x = tf.keras.layers.Dense(5)(x)
s = tf.Variable(tf.ones((5,)))
x = x*s
model = tf.keras.models.Model(input_shape, x)
X = tf.random.normal((50, 10)) # random sample
with tf.GradientTape() as tape:
tape.watch(s)
y = model(X)
loss = y**2
print(tape.gradient(loss, s)) # why None ??
The print prints None... why?
Notice that I am using eager-execution (TF version 2.0.0).
I managed to fix my problem by sub-classing Model and creating my variable inside the model:
class MyModel(tf.keras.Model):
def __init__(self):
super().__init__()
self.dense = tf.keras.layers.Dense(5)
self.s = tf.Variable(tf.ones((5,)))
def call(self, inputs):
x = self.dense(inputs)
x = x * self.s
return x
Alternatively, defining my own custom layer also works.
There must be some magic going on whereby variables not inside a model are not backpropagated (like in PyTorch).
I will leave the question open because I am curious as to why my code was not working and what a simpler fix would look like.
This might be the explanation. Based on reviewing the documentation, I'm suspecting that the issue is the differentiation with respect to the model layer "s" (or any other layer say "x") might not be a meaningful calculation. For example, it is possible to do this:
print(tape.gradient(loss, model.variables))
and obtain the gradients with respect to the model weights/parameters, but differentiating the model with respect to a "layer" is not appropriate. This is my speculation at this point. I hope this helps.