Im trying to run a loop over a set of parameters and I wan't to make a new network for each parameter and let it learn a few epochs.
Currently my code looks like this:
def optimize_scale(self, epochs=5, comp_scale=100, scale_list=[1, 100]):
trainer = pyli.Trainer(gpus=1, max_epochs=epochs)
for scale in scale_list:
test_model = CustomNN(num_layers=1, scale=scale, lr=1, pad=True, batch_size=1)
trainer.fit(test_model)
trainer.test(verbose=True)
del test_model
Everything works fine for the first element of scale_list, the network learns 5 epochs and completes the test. All this can be seen in the console. However for all following elements of scale_list it doesn't work as the old network is not overwritten, but instead an old checkpoint is loaded automatically when trainer.fit(model) is called. In the console this is indicated through:
C:\Users\XXXX\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:623: UserWarning:
Checkpoint directory D:\XXXX\src\lightning_logs\version_0\checkpoints exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
train_size = 8 val_size = 1 test_size = 1
Restoring states from the checkpoint path at D:\XXXX\src\lightning_logs\version_0\checkpoints\epoch=4-step=39.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from checkpoint at D:\XXXX\src\lightning_logs\version_0\checkpoints\epoch=4-step=39.ckpt
The consequence is that the second test outputs the same result, as the the checkpoint from the old network was loaded which already finished all 5 epochs. I though that adding the del test_model might help in dropping the model completely, but that did not work.
On my search I found a few Issues closely related, for example: https://github.com/PyTorchLightning/pytorch-lightning/issues/368. However I did not manage to fix my problem. I assume it has something to with the fact that the new network which should overwrite the old one has the same name/version and therefore looks for the same checkpoints.
If anyone has an idea or knows how to circumvent this I would be very grateful.
I think, in your settings, you want to disable automatic checkpointing:
trainer = pyli.Trainer(gpus=1, max_epochs=epochs,enable_checkpointing=False)
You may need to explicitly save a checkpoint (with a different name) for each training session you are running.
You can manually save a checkpoint via:
trainer.save_checkpoint(f'checkpoint_for_scale_{scale}.pth')
Related
I recently being using a RobertaLarge model, which I perform a down stream Training, using "Trainer" package.
All goes well, I see the loss going down, and compare manually some results with valid dataset.
Problem goes when I try to save the model and reload it afterwards.
I keep seeing the warning when trying to reload the model:
Some weights of the model checkpoint at Roberta_trained_1epoch were not used when initializing RobertaPreTrainedModel: ['module.roberta.encoder.layer.10.output.dense.bias', [........................................340_LAYERS_..................................]
'module.roberta.encoder.layer.6.attention.self.key.bias', 'module.roberta.encoder.layer.22.output.dense.weight', 'module.roberta.encoder.layer.3.attention.self.key.bias', 'module.roberta.encoder.layer.15.attention.self.value.bias', 'module.roberta.encoder.layer.15.attention.self.query.bias', 'module.roberta.encoder.layer.2.attention.self.value.bias']
I looked extensively for an answer to why this problem, and so far couldn't find a solution. Some claim this is just a warning and there's nothing wrong, however suspiciously I did some manual checks, and indeed the model seems... virgin.
I'm using the: Trainer.save_model('save_here') after training, and using the RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)model to reload it.
However the results show me that the model is not loading currently clearly.
training code:
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=ds_train,
eval_dataset=ds_valid,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)
trainer.train()
trainer.evaluate()
trainer.save_model('save_here')
this results in evaluation loss of: 0.002
Reloading and re-evaluation:
model = RobertaForTokenClassification.from_pretrained('save_here', local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained('tokenizers_saved')
dl_valid = DataLoader(ds_valid, batch_size=Config.batch_size, shuffle=True)
with torch.no_grad():
for index, data in enumerate(dl_valid):
batch_input_ids = data['input_ids'].to(device, dtype=torch.long)
batch_att_mask = data['attention_mask'].to(device, dtype=torch.long)
batch_target = data['label_ids'].to(device, dtype=torch.long)
output = model(batch_input_ids, token_type_ids=None, attention_mask=batch_att_mask, labels=batch_target)
step_loss, eval_prediction = output['loss'], output['logits']
eval_prediction = np.argmax(eval_prediction.detach().to('cpu').numpy(), axis=2)
predictions.append(eval_prediction)
reals.append(batch_target)
eval_loss += step_loss
print(eval_loss)
This results in loss: 1.2 - 0.9 (randomly after loading)
I found out what was wrong.
Will share with others, given others may have the same issue.
My problem was that I wrapped my model into a DataParallel model = nn.DataParallel(model)
So it seems that Trainer can't save it properly and get it back the usual way.
As a work around:
model = trainer.model
model.module.save_pretrained('save_here')
....
# afterwards in another machine
....
model = RobertaForTokenClassification.from_pretrained('save_here')
Still think that this should be handled differently.
I am a Tensorflow-newbie, therefore bear with me if my question is too basic or stupid ;)
I tried to reduce the size of the training dataset in the "Transformer model for language understanding"-tutorial of the Tensorflow website (https://www.tensorflow.org/tutorials/text/transformer). My intention was to make my test runs faster, when playing around with the code.
I thought I could use the dataset.take(n) method to shorten the training datasets. I added two lines right after the original dataset is read from file:
...
examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True, as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']
# lines added to reduce dataset size
train_examples = train_examples.take(1000)
val_examples = val_examples.take(1000)
...
The resulting datasets (i.e., train_examples, val_examples) seem to have the intended size, and they seem to be working, e.g., with the tokenizer, which comes next in the turorial.
However, I get tons of error messages and warnings when I execute the code, more specifically when it enters training (i.e., train_step(inp, tar)). The error messages and warnings are too long to copy here, but perhaps an important part of it is this:
...
/home/kst/python/tf/tf_env/lib/python3.8/site-packages/tensorflow/python/framework/ops.py:1105 set_shape
raise ValueError(
ValueError: Tensor's shape (8216, 128) is not compatible with supplied shape (4870, 128)
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.iter
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.iter
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.beta_1
...
Some of the tensors in the training part seem to have an inappropriate size or shape.
Is there a good reason why .take(n) is not a good method to shorten datasets in Tensorflow?
Is there a better way to do it?
Thanks!
:)
I found the problem! It was not the .take() method, which caused the error message. It was the checkpoint manager:
checkpoint_path = "./checkpoints/train"
ckpt = tf.train.Checkpoint(transformer=transformer,
optimizer=optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)
# if a checkpoint exists, restore the latest checkpoint.
if ckpt_manager.latest_checkpoint:
ckpt.restore(ckpt_manager.latest_checkpoint)
print ('Latest checkpoint restored!!')
I seem to have had old checkpoints on my disk, which were created from a run with a larger dataset. The checkpoint manager automatically recovered the old checkpoints and wanted to continue from there. Of course, the size and shape of the old tensors did not match the new ones (e.g., the vocabulary sizes were different). This created the error message.
When I deleted the old checkpoints (i.d., the .ipynb_checkpoints directory) it all worked smoothly!
:-)
I am trying to continue training from a saved checkpoint using the colab setup for GPT-2-simple at:
https://colab.research.google.com/drive/1SvQne5O_7hSdmPvUXl5UzPeG5A6csvRA#scrollTo=aeXshJM-Cuaf
But I just cant get it to work. Loading the saved checkpoint from my googledrive works fine, and I can use it to generate text, but I cant continue training from that checkpoint. In the gpt2.finetune () I am entering restore.from='latest" and overwrite=True, and I have been trying to use both same run_name and different one, and using overwrite=True, and not. I have also tried restarting the runtime in between, as was suggested, but it doesn´t help, I keep getting the following error:
"ValueError: Variable model/wpe already exists, disallowed. Did you mean to set reuse=True
or reuse=tf.AUTO_REUSE in VarScope?"
I asume that I need to run the gpt2.load_gpt2(sess, run_name='myRun') before continue training, but whenever I have run this first, the gtp2.finetune() throws this error
You don't need to (and can't) run load_gpt2() before finetuning. You instead simply need to give run_name to finetune(). I agree that this is confusing as hell; I had the same trouble.
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
file_name,
model_name=model_name,
checkpoint_dir=checkpoint_dir,
run_name=run_name,
steps=25,
)
This will automatically grab the latest checkpoint from your checkpoint/run-name folder, load its weights, and continue training where it left off. You can confirm this by checking the epoch number - it doesn't start again from 0. E.g., if you'd previously trained 25 epochs, it'll start at 26:
Training...
[26 | 7.48] loss=0.49 avg=0.49
Also note that to run finetuning multiple times (or to load another model) you normally have to restart the python runtime. You can instead run this before each finetine command:
tf.reset_default_graph()
I've tryed the following and works fine:
tf.reset_default_graph()
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
steps=n,
dataset=file_name,
model_name='model',
print_every=z,
run_name= 'run_name',
restore_from='latest',
sample_every=x,
save_every=y
)
You must indicate the same 'run_name' as the model you want to resume training and hp restore_from = 'latest'
I have a model that contains multiple variables including a global step. I've been able to successfully use a MonitoredSession to save checkpoints and summaries every 100 steps. I was expecting the MonitoredSession to automatically restore all my variables when the session is run in multiple passes (based on this documentation), however this does not happen. If I take a look at the global step after running the training session again, I find that it starts back from zero. This is a simplified version of my code without the actual model. Let me know if more code is needed to solve this problem
train_graph = tf.Graph()
with train_graph.as_default():
# I create some datasets using the Dataset API
# ...
global_step = tf.train.create_global_step()
# Create all the other variables and the model here
# ...
saver_hook = tf.train.CheckpointSaverHook(
checkpoint_dir='checkpoint/',
save_secs=None,
save_steps=100,
saver=tf.train.Saver(),
checkpoint_basename='model.ckpt',
scaffold=None)
summary_hook = tf.train.SummarySaverHook(
save_steps=100,
save_secs=None,
output_dir='summaries/',
summary_writer=None,
scaffold=None,
summary_op=train_step_summary)
num_steps_hook = tf.train.StopAtStepHook(num_steps=500) # Just for testing
with tf.train.MonitoredSession(
hooks=[saver_hook, summary_hook, num_steps_hook]) as sess:
while not sess.should_stop():
step = sess.run(global_step)
if (step % 100 == 0):
print(step)
sess.run(optimizer)
When I run this code the first time, I get this output
0
100
200
300
400
The checkpoint folder at this point has checkpoints for every hundredth step up to 500. If I run the program again I would expect to see the counter start at 500 and the increase up to 900, but instead I just get the same thing and all of my checkpoints get overwritten. Any ideas?
Alright, I figured it out. It was actually really simple. First, it's easier to use a MonitoredTraningSession() instead of a MonitoredSession(). This wrapper session takes as an argument 'checkpoint_dir'. I thought that the saver_hook would take care of restoring, but that's not the case. In order to fix my problem I just had to change the line where I define the session like so:
with tf.train.MonitoredTrainingSession(hooks=[saver_hook, summary_hook], checkpoint_dir='checkpoint'):
It can also be done with the MonitoredSession directly, but you need to set up a session_creator instead.
I'm trying to use google cloud platform to deploy a model to support prediction.
I train the model (locally) with the following instruction
~/$ gcloud ml-engine local train --module-name trainer.task --package-path trainer
and everything works fine (...):
INFO:tensorflow:Restoring parameters from gs://my-bucket1/test2/model.ckpt-45000
INFO:tensorflow:Saving checkpoints for 45001 into gs://my-bucket1/test2/model.ckpt.
INFO:tensorflow:loss = 17471.6, step = 45001
[...]
Loss: 144278.046875
average_loss: 1453.68
global_step: 50000
loss: 144278.0
INFO:tensorflow:Restoring parameters from gs://my-bucket1/test2/model.ckpt-50000
Mean Square Error of Test Set = 593.1018482
But, when I run the following command to create a version,
gcloud ml-engine versions create Mo1 --model mod1 --origin gs://my-bucket1/test2/ --runtime-version 1.3
Then I get the following error.
ERROR: (gcloud.ml-engine.versions.create) FAILED_PRECONDITION: Field: version.deployment_uri
Error: SavedModel directory gs://my-bucket1/test2/ is expected to contain exactly one
of: [saved_model.pb, saved_model.pbtxt].- '#type': type.googleapis.com/google.rpc.BadRequest
fieldViolations:- description: 'SavedModel directory gs://my-bucket1/test2/ is expected
to contain exactly one of: [saved_model.pb, saved_model.pbtxt].'
field: version.deployment_uri
Here is a screenshot of my bucket. I have a saved model with 'pbtxt' format
my-bucket-image
Finally, I add the piece of code where I save the model in the bucket.
regressor = tf.estimator.DNNRegressor(feature_columns=feature_columns,
hidden_units=[40, 30, 20],
model_dir="gs://my-bucket1/test2",
optimizer='RMSProp'
)
You'll notice that the file in your screenshot is graph.pbtxt whereas saved_model.pb{txt} is needed.
Note that just renaming the file generally will not be sufficient. The training process outputs checkpoints periodically in case restarts happen and recovery is needed. However, those checkpoints (and corresponding graphs) are the training graph. Training graphs tend to have things like file readers, input queues, dropout layers, etc. which are not appropriate for serving.
Instead, TensorFlow requires you to explicitly export a separate graph for serving. You can do this in one of two ways:
During training (typically, after training is complete)
As a separate process after training.
During/After Training
For this, I'll refer you to the Census sample.
First, You'll need a "Serving Input Function", such as
def serving_input_fn():
"""Build the serving inputs."""
inputs = {}
for feat in INPUT_COLUMNS:
inputs[feat.name] = tf.placeholder(shape=[None], dtype=feat.dtype)
features = {
key: tf.expand_dims(tensor, -1)
for key, tensor in inputs.iteritems()
}
return tf.contrib.learn.InputFnOps(features, None, inputs)
The you can simply call:
regressor.export_savedmodel("path/to/model", serving_input_fn)
Or, if you're using learn_runner/Experiment, you'll need to pass an ExportStrategy like the following to the constructor of Experiment:
export_strategies=[saved_model_export_utils.make_export_strategy(
serving_input_fn,
exports_to_keep=1,
default_output_alternative_key=None,
)]
After Training
Almost exactly the same steps as above, but just in a separate Python script you can run after training is over (in your case, this is beneficial because you won't have to retrain). The basic idea is to construct the Estimator with the same model_dir used in training, then to call export as above, something like:
def serving_input_fn():
"""Build the serving inputs."""
inputs = {}
for feat in INPUT_COLUMNS:
inputs[feat.name] = tf.placeholder(shape=[None], dtype=feat.dtype)
features = {
key: tf.expand_dims(tensor, -1)
for key, tensor in inputs.iteritems()
}
return tf.contrib.learn.InputFnOps(features, None, inputs)
regressor = tf.contrib.learn.DNNRegressor(
feature_columns=feature_columns,
hidden_units=[40, 30, 20],
model_dir="gs://my-bucket1/test2",
optimizer='RMSProp'
)
regressor.export_savedmodel("my_model", serving_input_fn)
EDIT 09/12/2017
One slight change is needed to your training code. You are using tf.estimator.DNNRegressor, but that was introduced in TensorFlow 1.3; CloudML Engine only officially supports TensorFlow 1.2, so you'll need to use tf.contrib.learn.DNNRegressor instead. They are very similar, but one notable difference is that you'll need to use the fit method instead of train.
I had the same error message here, in my case there was two problems:
The path to bucket with misspelling
Wrong saved_file.pbtxt (with the first error message I put another renamed .pbtxt file in the same bucket with my model classes and this make the problem persist after the path corrected)
The command worked after delete the wrong file and correct the path.
I hope this helps too.