I am trying to continue training from a saved checkpoint using the colab setup for GPT-2-simple at:
https://colab.research.google.com/drive/1SvQne5O_7hSdmPvUXl5UzPeG5A6csvRA#scrollTo=aeXshJM-Cuaf
But I just cant get it to work. Loading the saved checkpoint from my googledrive works fine, and I can use it to generate text, but I cant continue training from that checkpoint. In the gpt2.finetune () I am entering restore.from='latest" and overwrite=True, and I have been trying to use both same run_name and different one, and using overwrite=True, and not. I have also tried restarting the runtime in between, as was suggested, but it doesn´t help, I keep getting the following error:
"ValueError: Variable model/wpe already exists, disallowed. Did you mean to set reuse=True
or reuse=tf.AUTO_REUSE in VarScope?"
I asume that I need to run the gpt2.load_gpt2(sess, run_name='myRun') before continue training, but whenever I have run this first, the gtp2.finetune() throws this error
You don't need to (and can't) run load_gpt2() before finetuning. You instead simply need to give run_name to finetune(). I agree that this is confusing as hell; I had the same trouble.
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
file_name,
model_name=model_name,
checkpoint_dir=checkpoint_dir,
run_name=run_name,
steps=25,
)
This will automatically grab the latest checkpoint from your checkpoint/run-name folder, load its weights, and continue training where it left off. You can confirm this by checking the epoch number - it doesn't start again from 0. E.g., if you'd previously trained 25 epochs, it'll start at 26:
Training...
[26 | 7.48] loss=0.49 avg=0.49
Also note that to run finetuning multiple times (or to load another model) you normally have to restart the python runtime. You can instead run this before each finetine command:
tf.reset_default_graph()
I've tryed the following and works fine:
tf.reset_default_graph()
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
steps=n,
dataset=file_name,
model_name='model',
print_every=z,
run_name= 'run_name',
restore_from='latest',
sample_every=x,
save_every=y
)
You must indicate the same 'run_name' as the model you want to resume training and hp restore_from = 'latest'
Related
Im trying to run a loop over a set of parameters and I wan't to make a new network for each parameter and let it learn a few epochs.
Currently my code looks like this:
def optimize_scale(self, epochs=5, comp_scale=100, scale_list=[1, 100]):
trainer = pyli.Trainer(gpus=1, max_epochs=epochs)
for scale in scale_list:
test_model = CustomNN(num_layers=1, scale=scale, lr=1, pad=True, batch_size=1)
trainer.fit(test_model)
trainer.test(verbose=True)
del test_model
Everything works fine for the first element of scale_list, the network learns 5 epochs and completes the test. All this can be seen in the console. However for all following elements of scale_list it doesn't work as the old network is not overwritten, but instead an old checkpoint is loaded automatically when trainer.fit(model) is called. In the console this is indicated through:
C:\Users\XXXX\AppData\Roaming\Python\Python39\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:623: UserWarning:
Checkpoint directory D:\XXXX\src\lightning_logs\version_0\checkpoints exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
train_size = 8 val_size = 1 test_size = 1
Restoring states from the checkpoint path at D:\XXXX\src\lightning_logs\version_0\checkpoints\epoch=4-step=39.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from checkpoint at D:\XXXX\src\lightning_logs\version_0\checkpoints\epoch=4-step=39.ckpt
The consequence is that the second test outputs the same result, as the the checkpoint from the old network was loaded which already finished all 5 epochs. I though that adding the del test_model might help in dropping the model completely, but that did not work.
On my search I found a few Issues closely related, for example: https://github.com/PyTorchLightning/pytorch-lightning/issues/368. However I did not manage to fix my problem. I assume it has something to with the fact that the new network which should overwrite the old one has the same name/version and therefore looks for the same checkpoints.
If anyone has an idea or knows how to circumvent this I would be very grateful.
I think, in your settings, you want to disable automatic checkpointing:
trainer = pyli.Trainer(gpus=1, max_epochs=epochs,enable_checkpointing=False)
You may need to explicitly save a checkpoint (with a different name) for each training session you are running.
You can manually save a checkpoint via:
trainer.save_checkpoint(f'checkpoint_for_scale_{scale}.pth')
System information
Google Colab
When I run the example provided by official tensorflow basic text classification, everything runs fine until the model save, but when I load the model it gives me this error.
RuntimeError: Unable to restore a layer of class TextVectorization. Layers of class TextVectorization require that the class be provided to the model loading code, either by registering the class using #keras.utils.register_keras_serializable on the class def and including that file in your program, or by passing the class in a keras.utils.CustomObjectScope that wraps this load call.
Expected Behavior: Model should be loaded successfully and process the raw input
https://colab.research.google.com/gist/amahendrakar/8b65a688dc87ce9ca07ffb0ce50b84c7/44199.ipynb#scrollTo=fEjmSrKIqiiM
Example Link: https://tensorflow.google.cn/tutorials/keras/text_classification
I also ran into this error message (RuntimeError: Unable to restore a layer of class TextVectorization. [...]) when I implemented (and customized) the code from the "Basic Text Classification" tutorial.
Instead of running the code in a notebook, I have two scripts, one for building, training and saving the model and the other one for loading it and making predictions. (Thus, the error does not seem to be limited to Google Colab).
This is what I had to do (see https://github.com/tensorflow/tensorflow/issues/45231):
First, I added this line in the first script before the function definition and built, trained and saved the model again:
#tf.keras.utils.register_keras_serializable()
def custom_standardization(input_data):
[...]
# Save model as SavedModel
export_model.save(model_path, save_format='tf')
Secondly, I also had to add the same line and the whole function definition in the second script to make sure that it works if I restart(!) ipython (where I currently run the scripts) and only run the second script:
#tf.keras.utils.register_keras_serializable()
def custom_standardization(input_data):
lowercase = tf.strings.lower(input_data)
stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
return tf.strings.regex_replace(stripped_html,
'[%s]' % re.escape(string.punctuation),
'')
[...]
# Load model
reloaded_model = tf.keras.models.load_model(model_path)
# Make predictions
predictions = reloaded_model.predict(examples)
Note: If I run the second script without restarting ipython after running the first script, I get this error:
ValueError: Custom>custom_standardization has already been registered [...]
Alternatively, you can just use the default standardization method in the vectorizer layer when building the model:
vectorize_layer = TextVectorization(
standardize="lower_and_strip_punctuation",
max_tokens=max_features,
output_mode='int',
output_sequence_length=sequence_length)
I got something working as Hassan describes it, I think. Not sure it's the right way, but it seems to work for me...
I define, train, and archive the model in one notebook
I un-archive it, load it, and use it for predictions from another notebook.
See here: https://github.com/OlivierLD/oliv-ai/tree/master/JupyterNotebooks/tf-tutorials/sentiment-analysis
I have a model that contains multiple variables including a global step. I've been able to successfully use a MonitoredSession to save checkpoints and summaries every 100 steps. I was expecting the MonitoredSession to automatically restore all my variables when the session is run in multiple passes (based on this documentation), however this does not happen. If I take a look at the global step after running the training session again, I find that it starts back from zero. This is a simplified version of my code without the actual model. Let me know if more code is needed to solve this problem
train_graph = tf.Graph()
with train_graph.as_default():
# I create some datasets using the Dataset API
# ...
global_step = tf.train.create_global_step()
# Create all the other variables and the model here
# ...
saver_hook = tf.train.CheckpointSaverHook(
checkpoint_dir='checkpoint/',
save_secs=None,
save_steps=100,
saver=tf.train.Saver(),
checkpoint_basename='model.ckpt',
scaffold=None)
summary_hook = tf.train.SummarySaverHook(
save_steps=100,
save_secs=None,
output_dir='summaries/',
summary_writer=None,
scaffold=None,
summary_op=train_step_summary)
num_steps_hook = tf.train.StopAtStepHook(num_steps=500) # Just for testing
with tf.train.MonitoredSession(
hooks=[saver_hook, summary_hook, num_steps_hook]) as sess:
while not sess.should_stop():
step = sess.run(global_step)
if (step % 100 == 0):
print(step)
sess.run(optimizer)
When I run this code the first time, I get this output
0
100
200
300
400
The checkpoint folder at this point has checkpoints for every hundredth step up to 500. If I run the program again I would expect to see the counter start at 500 and the increase up to 900, but instead I just get the same thing and all of my checkpoints get overwritten. Any ideas?
Alright, I figured it out. It was actually really simple. First, it's easier to use a MonitoredTraningSession() instead of a MonitoredSession(). This wrapper session takes as an argument 'checkpoint_dir'. I thought that the saver_hook would take care of restoring, but that's not the case. In order to fix my problem I just had to change the line where I define the session like so:
with tf.train.MonitoredTrainingSession(hooks=[saver_hook, summary_hook], checkpoint_dir='checkpoint'):
It can also be done with the MonitoredSession directly, but you need to set up a session_creator instead.
I'm trying to use google cloud platform to deploy a model to support prediction.
I train the model (locally) with the following instruction
~/$ gcloud ml-engine local train --module-name trainer.task --package-path trainer
and everything works fine (...):
INFO:tensorflow:Restoring parameters from gs://my-bucket1/test2/model.ckpt-45000
INFO:tensorflow:Saving checkpoints for 45001 into gs://my-bucket1/test2/model.ckpt.
INFO:tensorflow:loss = 17471.6, step = 45001
[...]
Loss: 144278.046875
average_loss: 1453.68
global_step: 50000
loss: 144278.0
INFO:tensorflow:Restoring parameters from gs://my-bucket1/test2/model.ckpt-50000
Mean Square Error of Test Set = 593.1018482
But, when I run the following command to create a version,
gcloud ml-engine versions create Mo1 --model mod1 --origin gs://my-bucket1/test2/ --runtime-version 1.3
Then I get the following error.
ERROR: (gcloud.ml-engine.versions.create) FAILED_PRECONDITION: Field: version.deployment_uri
Error: SavedModel directory gs://my-bucket1/test2/ is expected to contain exactly one
of: [saved_model.pb, saved_model.pbtxt].- '#type': type.googleapis.com/google.rpc.BadRequest
fieldViolations:- description: 'SavedModel directory gs://my-bucket1/test2/ is expected
to contain exactly one of: [saved_model.pb, saved_model.pbtxt].'
field: version.deployment_uri
Here is a screenshot of my bucket. I have a saved model with 'pbtxt' format
my-bucket-image
Finally, I add the piece of code where I save the model in the bucket.
regressor = tf.estimator.DNNRegressor(feature_columns=feature_columns,
hidden_units=[40, 30, 20],
model_dir="gs://my-bucket1/test2",
optimizer='RMSProp'
)
You'll notice that the file in your screenshot is graph.pbtxt whereas saved_model.pb{txt} is needed.
Note that just renaming the file generally will not be sufficient. The training process outputs checkpoints periodically in case restarts happen and recovery is needed. However, those checkpoints (and corresponding graphs) are the training graph. Training graphs tend to have things like file readers, input queues, dropout layers, etc. which are not appropriate for serving.
Instead, TensorFlow requires you to explicitly export a separate graph for serving. You can do this in one of two ways:
During training (typically, after training is complete)
As a separate process after training.
During/After Training
For this, I'll refer you to the Census sample.
First, You'll need a "Serving Input Function", such as
def serving_input_fn():
"""Build the serving inputs."""
inputs = {}
for feat in INPUT_COLUMNS:
inputs[feat.name] = tf.placeholder(shape=[None], dtype=feat.dtype)
features = {
key: tf.expand_dims(tensor, -1)
for key, tensor in inputs.iteritems()
}
return tf.contrib.learn.InputFnOps(features, None, inputs)
The you can simply call:
regressor.export_savedmodel("path/to/model", serving_input_fn)
Or, if you're using learn_runner/Experiment, you'll need to pass an ExportStrategy like the following to the constructor of Experiment:
export_strategies=[saved_model_export_utils.make_export_strategy(
serving_input_fn,
exports_to_keep=1,
default_output_alternative_key=None,
)]
After Training
Almost exactly the same steps as above, but just in a separate Python script you can run after training is over (in your case, this is beneficial because you won't have to retrain). The basic idea is to construct the Estimator with the same model_dir used in training, then to call export as above, something like:
def serving_input_fn():
"""Build the serving inputs."""
inputs = {}
for feat in INPUT_COLUMNS:
inputs[feat.name] = tf.placeholder(shape=[None], dtype=feat.dtype)
features = {
key: tf.expand_dims(tensor, -1)
for key, tensor in inputs.iteritems()
}
return tf.contrib.learn.InputFnOps(features, None, inputs)
regressor = tf.contrib.learn.DNNRegressor(
feature_columns=feature_columns,
hidden_units=[40, 30, 20],
model_dir="gs://my-bucket1/test2",
optimizer='RMSProp'
)
regressor.export_savedmodel("my_model", serving_input_fn)
EDIT 09/12/2017
One slight change is needed to your training code. You are using tf.estimator.DNNRegressor, but that was introduced in TensorFlow 1.3; CloudML Engine only officially supports TensorFlow 1.2, so you'll need to use tf.contrib.learn.DNNRegressor instead. They are very similar, but one notable difference is that you'll need to use the fit method instead of train.
I had the same error message here, in my case there was two problems:
The path to bucket with misspelling
Wrong saved_file.pbtxt (with the first error message I put another renamed .pbtxt file in the same bucket with my model classes and this make the problem persist after the path corrected)
The command worked after delete the wrong file and correct the path.
I hope this helps too.
I'm training classification CNN using TensorFlow v0.12, and then want to create labels for new data using the trained model.
At the end of the training script, I added those lines of code:
saver = tf.train.Saver()
save_path = saver.save(sess,'/home/path/to/model/model.ckpt')
After the training completed, the files appearing in the folder are: 1. checkpoint ; 2. model.ckpt.data-00000-of-00001 ; 3. model.ckpt.index ; 4. model.ckpt.meta
Then I tried to restore the model using the .meta file. Following this tutorial, I added the following line into my classification code:
saver=tf.train.import_meta_graph(savepath+'model.ckpt.meta') #line1
and then:
saver.restore(sess, save_path=savepath+'model.ckpt') #line2
Before that change, I needed to build the graph again, and then write (instead of line1):
saver = tf.train.Saver()
But, deleting the graph building, and using line1 in order to restore it, raised an error. The error was that I used a variable from the graph inside my code, and the python didn't recognize it:
predictions = sess.run(y_conv, feed_dict={x: patches,keep_prob: 1.0})
The python didn't recognize the y_conv parameter. There is a way to restore the variables using the meta graph? if not, what os this restore helping, if I can't use variables from the original graph?
I know this question isn't so clear, but it was hard for me to express the problem in words. Sorry about it...
Thanks for answering, appreciate your help! Roi.
it is possible, don't worry. Assuming you don't want to touch the graph anymore, do something like this:
saver = tf.train.import_meta_graph('model/export/{}.meta'.format(model_name))
saver.restore(sess, 'model/export/{}'.format(model_name))
graph = tf.get_default_graph()
y_conv = graph.get_operation_by_name('y_conv').outputs[0]
predictions = sess.run(y_conv, feed_dict={x: patches,keep_prob: 1.0})
A preferred way would however be adding the ops into collections when you build the graph and then referring to them. So when you define the graph, you would add the line:
tf.add_to_collection("y_conv", y_conv)
And then after you import the metagraph and restore it, you would call:
y_conv = tf.get_collection("y_conv")[0]
It is actually explained in the documentation - the exact page you linked - but perhaps you missed it.
Btw, no need for the .ckpt extension, it might create some confusion as that is the old way of saving models.
Just to add to Roberts's answer - after obtaining a saver from the meta graph, and using it to restore the variables in the current session, you can also use:
y_conv = graph.get_tensor_by_name('y_conv:0')
This'll work if you've created the y_conv with explicitly adding the name="y_conv" argument (all TF ops have this).