I am training a model in TensorFlow that works perfectly when I evaluate from the trained model.
However, at various points, I am saving a checkpoint and then loading that checkpoint to run evaluations on it. The loaded network will just output NaNs.
Using tfdbg and running the filter "has_inf_or_nan" when feeding input ends up showing the first NaNs in the network appearing in the moving_mean and moving_variance variables in one of the batch normalization layers.
Saving is being done with the following code:
with self.graph.as_default():
if not self.saver:
self.saver = tf.train.Saver(tf.global_variables(), max_to_keep=10000)
save_dir = create_save_dir(path, name)
return self.saver.save(self.session, save_dir, global_step=iteration, write_meta_graph=True)
Loading is being done with the following code:
with self.graph.as_default():
save_dir = create_save_dir(load_dir, load_name)
self.saver = tf.train.import_meta_graph(save_dir + "-" + str(iteration) + ".meta")
self.saver.restore(self.session, save_dir + "-" + str(iteration))
self.input_layer = self.graph.get_tensor_by_name("network/input_layer:0")
self.out_policy_layer = self.graph.get_tensor_by_name("network/out_policy_layer:0")
self.out_value_layer = self.graph.get_tensor_by_name("network/out_value_layer/Tanh:0")
self.is_training = self.graph.get_tensor_by_name("network/is_training:0")
Again, the thing that has me suspecting some issue with my save/load routine is the fact that the network is outputting valid results if I run through the network that has been trained. I am only getting the NaNs when I run something through a network that was loaded.
Editing to add that my batch norm is being created with the following code:
def _conv_block(self, name, input_layer, filter_size, num_input_channels, num_output_channels):
weights = self._create_weights_for_layer(f"{name}_weights",
shape=[filter_size[0],
filter_size[1],
num_input_channels,
num_output_channels],
use_regularizer=self._config.l2_regularizer)
conv = self._conv2d(input_layer, weights, strides=[1, 1, 1, 1], padding="SAME", name=f"{name}_conv")
bn = self._conv_batch_norm(conv, f"{name}_batch_norm")
return tf.nn.relu(bn, name=f"{name}_act")
def _conv_batch_norm(self, input_layer, name):
return tf.layers.batch_normalization(input_layer, axis=CHANNEL_SHAPE_INDEX, center=True, scale=True, training=self.is_training,
momentum=self._config.batch_norm_momentum,
name=name)
Long story short, if you are getting random NaNs in your model, and you have looked at the usual suspects, consider the fact that your hardware could be failing before you waste hundreds of hours.
This was caused by a videocard with dying memory. I assumed we had a software problem, and didn't think of this. The training was happening on a different PC and experienced no issues.
This is how it ended up occurring to us that we had a hardware problem. We previously had experienced randomly occurring NaNs on the old PC. We spent hundreds of hours debugging the model thinking we had an issue. After we made a change we also happened to move to an upgraded PC so we thought our changes fixed it since the NaNs stopped. Then we started running evaluations a month later using that old PC and ran into NaNs again. That is when I posted this. Shortly afterwards I had the realization that there might be a hardware problem.
Related
I recently being using a RobertaLarge model, which I perform a down stream Training, using "Trainer" package.
All goes well, I see the loss going down, and compare manually some results with valid dataset.
Problem goes when I try to save the model and reload it afterwards.
I keep seeing the warning when trying to reload the model:
Some weights of the model checkpoint at Roberta_trained_1epoch were not used when initializing RobertaPreTrainedModel: ['module.roberta.encoder.layer.10.output.dense.bias', [........................................340_LAYERS_..................................]
'module.roberta.encoder.layer.6.attention.self.key.bias', 'module.roberta.encoder.layer.22.output.dense.weight', 'module.roberta.encoder.layer.3.attention.self.key.bias', 'module.roberta.encoder.layer.15.attention.self.value.bias', 'module.roberta.encoder.layer.15.attention.self.query.bias', 'module.roberta.encoder.layer.2.attention.self.value.bias']
I looked extensively for an answer to why this problem, and so far couldn't find a solution. Some claim this is just a warning and there's nothing wrong, however suspiciously I did some manual checks, and indeed the model seems... virgin.
I'm using the: Trainer.save_model('save_here') after training, and using the RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)model to reload it.
However the results show me that the model is not loading currently clearly.
training code:
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=ds_train,
eval_dataset=ds_valid,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)
trainer.train()
trainer.evaluate()
trainer.save_model('save_here')
this results in evaluation loss of: 0.002
Reloading and re-evaluation:
model = RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained('tokenizers_saved')
dl_valid = DataLoader(ds_valid, batch_size=Config.batch_size, shuffle=True)
with torch.no_grad():
for index, data in enumerate(dl_valid):
batch_input_ids = data['input_ids'].to(device, dtype=torch.long)
batch_att_mask = data['attention_mask'].to(device, dtype=torch.long)
batch_target = data['label_ids'].to(device, dtype=torch.long)
output = model(batch_input_ids, token_type_ids=None, attention_mask=batch_att_mask, labels=batch_target)
step_loss, eval_prediction = output['loss'], output['logits']
eval_prediction = np.argmax(eval_prediction.detach().to('cpu').numpy(), axis=2)
predictions.append(eval_prediction)
reals.append(batch_target)
eval_loss += step_loss
print(eval_loss)
This results in loss: 1.2 - 0.9 (randomly after loading)
I found out what was wrong.
Will share with others, given others may have the same issue.
My problem was that I wrapped my model into a DataParallel model = nn.DataParallel(model)
So it seems that Trainer can't save it properly and get it back the usual way.
As a work around:
model = trainer.model
model.module.save_pretrained('save_here')
....
# afterwards in another machine
....
model = RobertaForTokenClassification.from_pretrained('save_here')
Still think that this should be handled differently.
Im trying to run a loop over a set of parameters and I wan't to make a new network for each parameter and let it learn a few epochs.
Currently my code looks like this:
def optimize_scale(self, epochs=5, comp_scale=100, scale_list=[1, 100]):
trainer = pyli.Trainer(gpus=1, max_epochs=epochs)
for scale in scale_list:
test_model = CustomNN(num_layers=1, scale=scale, lr=1, pad=True, batch_size=1)
trainer.fit(test_model)
trainer.test(verbose=True)
del test_model
Everything works fine for the first element of scale_list, the network learns 5 epochs and completes the test. All this can be seen in the console. However for all following elements of scale_list it doesn't work as the old network is not overwritten, but instead an old checkpoint is loaded automatically when trainer.fit(model) is called. In the console this is indicated through:
C:\Users\XXXX\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:623: UserWarning:
Checkpoint directory D:\XXXX\src\lightning_logs\version_0\checkpoints exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
train_size = 8 val_size = 1 test_size = 1
Restoring states from the checkpoint path at D:\XXXX\src\lightning_logs\version_0\checkpoints\epoch=4-step=39.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from checkpoint at D:\XXXX\src\lightning_logs\version_0\checkpoints\epoch=4-step=39.ckpt
The consequence is that the second test outputs the same result, as the the checkpoint from the old network was loaded which already finished all 5 epochs. I though that adding the del test_model might help in dropping the model completely, but that did not work.
On my search I found a few Issues closely related, for example: https://github.com/PyTorchLightning/pytorch-lightning/issues/368. However I did not manage to fix my problem. I assume it has something to with the fact that the new network which should overwrite the old one has the same name/version and therefore looks for the same checkpoints.
If anyone has an idea or knows how to circumvent this I would be very grateful.
I think, in your settings, you want to disable automatic checkpointing:
trainer = pyli.Trainer(gpus=1, max_epochs=epochs,enable_checkpointing=False)
You may need to explicitly save a checkpoint (with a different name) for each training session you are running.
You can manually save a checkpoint via:
trainer.save_checkpoint(f'checkpoint_for_scale_{scale}.pth')
I'm creating an LSTM Model for Text generation using Keras. As the dataset(around 25 novels,which has around 1.4 million words) I'm using can't be processed at once(An Memory issue with converting my outputs to_Categorical()) I created a custom generator function to read the Data in.
# Data generator for fit and evaluate
def generator(batch_size):
start = 0
end = batch_size
while True:
x = sequences[start:end,:-1]
#print(x)
y = sequences[start:end,-1]
y = to_categorical(y, num_classes=vocab_size)
#print(y)
yield x, y
if batch_size == len(lines):
break;
else:
start += batch_size
end += batch_size
when i excecute the model.fit() method, after 1 epoch is done training the following error is thrown.
UnknownError: [_Derived_] CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1459): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
[[{{node CudnnRNN}}]]
[[sequential/lstm/StatefulPartitionedCall]] [Op:__inference_train_function_25138]
Function call stack:
train_function -> train_function -> train_function
does anyone know how to solve this issue ? Thanks
From many sources in the Internet, this issue seems to occur while using LSTM Layer along with Masking Layer and while training on GPU.
Mentioned below can be the workarounds for this problem:
If you can compromise on speed, you can Train your Model on CPU rather than on GPU. It works without any error.
As per this comment, please check if your Input Sequences comprises of all Zeros, as the Masking Layer may mask all the Inputs
If possible, you can Disable the Eager Execution. As per this comment, it works without any error.
Instead of using a Masking Layer, you can try the alternatives mentioned in this link
a. Adding the argument, mask_zero = True to the Embedding Layer. or
b. Pass a mask argument manually when calling layers that support this argument
Last solution can be to remove Masking Layer, if that is possible.
If none of the above workaround solves your problem, Google Tensorflow Team is working to resolve this error. We may have to wait till that is fixed.
Hope this information helps. Happy Learning!
I'm working on a variational auto-encoder and I'd like the prior used in the KL-divergence regularization of the latent distribution to have its loc (mean) and scale (stddev) updated.
The below snippet is a contrived minimal example demonstrating what I'm trying to achieve. This starts to work but then just freezes after some random number of epochs (sometimes 1, sometimes 200, but usually around 7 or 8). There's no error message or anything.
loc = tf.Variable(tf.random.normal([ndim], stddev=0.1, dtype=tf.float32))
scale = tfp.util.TransformedVariable(
tf.random.normal([ndim], mean=1.0, stddev=0.1, dtype=tf.float32),
bijector=tfb.Chain([tfb.Shift(1e-5), tfb.Softplus(), tfb.Shift(0.5413)]))
prior = tfd.Independent(tfd.Normal(loc=loc, scale=scale), reinterpreted_batch_ndims=1)
_input = tfkl.Input(shape=(1,))
_loc = tfkl.Dense(ndim, name="loc_params")(_input)
_scale = tfkl.Dense(ndim, name="untransformed_scale_params")(_input)
_scale = tf.math.softplus(_scale + np.log(np.exp(1) - 1)) + 1e-5
_output = tfpl.DistributionLambda(
make_distribution_fn=lambda t: tfd.Independent(tfd.Normal(loc=t[0], scale=t[1])),
activity_regularizer=tfpl.KLDivergenceRegularizer(prior, use_exact_kl=True, weight=0.1)
)([_loc, _scale])
model = tf.keras.Model(_input, _output)
model.compile(optimizer='adam', loss=lambda y_true, model_out: -model_out.log_prob(y_true))
hist = model.fit(ds, epochs=N_EPOCHS, verbose=2)
I have a runnable gist here.
A more concrete example, and an architecture close to what I'm trying to update and simplify, is the tfp example for disentangled_vae. In its manual training loop, a new tfd.MultivariateNormalDiag is instantiated on every loop, though it is parameterized using persistent tf.Variables. I'm trying my best to avoid manual training loops, and I'm also trying to move to more Keras-like syntax, so I'd rather not do a direct port of this example.
Any advice is greatly appreciated. Thanks!
Edit: The activity_regularizer seems to work fine when attached to a latent (bottleneck) distribution. I have a more complete example in this Colab notebook. As this works in my architecture, I'm no longer in need of an answer.
However, I highly doubt having model fitting freeze is desirable behaviour, so this remains a problem.
As the machinery works in most circumstances, just not the contrived freezing example above, I no longer consider this a question that needs an answer.
I have reported the errorless freezing behaviour via the tensorflow-probability repository issues page. See here.
Today, I got a really weird thing.
I load a caffe model, feed input, net.forward, check the output data, perfect.
Then, I feed labels to the bottom layer blobs.diff, net.backward, then check the gradients (params.diff) with the result from same model caffe c++ program. They were different.
Further, when I continued to run net.backward several times at python, each time I got different gradients. This is not the case for C++ programs, they keep the same no matter how many time you run net.backward, as long as you did not change the bottom diff.
I check the bottom layer's blobs and diff, they kept unchanged both in python and C++ programs, and weights were also unchanged. This was really weird.
Anyone can provide some hints? I can provide codes if it is necessary.
Here is part of the codes :
def train_one_step(X, y, lr) :
net.blobs['data'].data[...] = X
#Forward, to get the softmax output
output = net.forward()
prob = output['prob']
#Calculate the loss of cross entropy loss function
net.blobs['prob'].diff[:] = y[:] - prob[:]
#Calculate the gradients of net parameter
net.backward()
#Renew weights based on gradients and learning rate
for key in net.params:
net.params[key][0].data[:] += lr * net.params[key][0].diff[:]
net.params[key][1].data[:] += lr * net.params[key][1].diff[:]
return loss, prob
I just want to dig out my own step function (the step of solver), so I can make some trick on the loss before it backwards, and something else. I know this is quite low efficient, data between GPU, CPU exchanged a lot.
In order to test it, I kept input the same sample(same X, y), you get different diff data. That means this function cannot work.