How does tf.train.Saver work exactly?

How does tf.train.Saver work exactly? - python

I am a bit confused by how tf.train.Saver() works. I have the following code to save only trainable variables:
import tensorflow as tf
# Create some variables.
v1 = tf.get_variable("v1", shape=[3], initializer = tf.zeros_initializer)
v2 = tf.get_variable("v2", shape=[5], initializer = tf.zeros_initializer)
inc_v1 = v1.assign(v1+1)
dec_v2 = v2.assign(v2-1)
# Add an op to initialize the variables.
init_op = tf.global_variables_initializer()
# Add ops to save and restore all the variables.
saver = tf.train.Saver(tf.trainable_variables())
print([x.name for x in tf.trainable_variables()])
# Later, launch the model, initialize the variables, do some work, and save the
# variables to disk.
with tf.Session() as sess:
sess.run(init_op)
# Do some work with the model.
inc_v1.op.run()
dec_v2.op.run()
# Save the variables to disk.
save_path = saver.save(sess, "./model.ckpt")
print("Model saved in file: %s" % save_path)
And the following code just to see them:
import tensorflow as tf
sess = tf.Session()
saver = tf.train.import_meta_graph('model.ckpt.meta')
saver.restore(sess,'model.ckpt')
print([v.name for v in tf.get_default_graph().as_graph_def().node])
The first code outputs ['v1:0', 'v2:0'], as expected. I am expecting the second code to produce the same result, but i see this:
['v1/Initializer/zeros', 'v1', 'v1/Assign', 'v1/read', 'v2/Initializer/zeros', 'v2', 'v2/Assign', 'v2/read', 'add/y', 'add', 'Assign', 'sub/y', 'sub', 'Assign_1', 'init', 'save/Const', 'save/SaveV2/tensor_names', 'save/SaveV2/shape_and_slices', 'save/SaveV2', 'save/control_dependency', 'save/RestoreV2/tensor_names', 'save/RestoreV2/shape_and_slices', 'save/RestoreV2', 'save/Assign', 'save/RestoreV2_1/tensor_names', 'save/RestoreV2_1/shape_and_slices', 'save/RestoreV2_1', 'save/Assign_1', 'save/restore_all']
I am not sure why tf saves all variables instead of the specifically mentioned two. How can I do that?

Try the following code from the tensorflow wiki
tf.reset_default_graph()
# Create some variables.
v1 = tf.get_variable("v1", shape=[3], name="v1")
v2 = tf.get_variable("v2", shape=[5], name="v2")
saver = tf.train.Saver(var_list=[v1, v2]) # list of TF variables that are to be restored
with tf.Session() as sess:
# Restore variables from disk.
saver.restore(sess, "./model.ckpt")
print("Model restored.")
# Check the values of the variables
print("v1 : %s" % v1.eval())
print("v2 : %s" % v2.eval())
I hope this helps!

Related

Can't restore tensorflow model with Saver

I'm following this guide to using the Saver class in Tensorflow version 1.
I'm first saving the model:
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
saver.save(sess, "./saved_model/tf/model", global_step=0)
which gives me these files:
$ ls saved_model/tf
checkpoint model-0.data-00000-of-00001 model-0.index model-0.meta
But when I try to restore the session, I get an error:
with tf.Session() as sess:
saver.restore(sess, "./saved_model/tf/model")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-37-01cbbefb52af> in <module>()
1 with tf.Session() as sess:
----> 2 saver.restore(sess, "./saved_model/tf/model")
/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py in restore(self, sess, save_path)
1280 if not checkpoint_management.checkpoint_exists_internal(checkpoint_prefix):
1281 raise ValueError("The passed save_path is not a valid checkpoint: " +
-> 1282 checkpoint_prefix)
1283
1284 logging.info("Restoring parameters from %s", checkpoint_prefix)
ValueError: The passed save_path is not a valid checkpoint: ./saved_model/tf/model
What am I doing wrong? Unfortunately, the TF documentation on this feature does not help much.

ValueError: The passed save_path is not a valid checkpoint: ./saved_model/tf/model
Here the error conveys that the checkpoint file is not present and therefore it is not a valid checkpoint.
I was able to recreate your problem, it caused because of global_step=0 in model save block. For better understanding printed model save path at the end of the program, which guides you where it saved and how it created files with this option.
Model Save:
%tensorflow_version 1.x
import tensorflow as tf
# Create some variables.
v1 = tf.get_variable("v1", shape=[3], initializer = tf.zeros_initializer)
v2 = tf.get_variable("v2", shape=[5], initializer = tf.zeros_initializer)
inc_v1 = v1.assign(v1+1)
dec_v2 = v2.assign(v2-1)
# Add an op to initialize the variables.
init_op = tf.global_variables_initializer()
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, initialize the variables, do some work, and save the
# variables to disk.
with tf.Session() as sess:
sess.run(init_op)
# Do some work with the model.
inc_v1.op.run()
dec_v2.op.run()
# Save the variables to disk.
save_path = saver.save(sess, "/content/gdrive/My Drive/checkpoint/test", global_step=0)
print("Model saved in path: %s" % save_path)
Output:
TensorFlow 1.x selected.
Model saved in path: /content/gdrive/My Drive/checkpoint/test-0
Listing the contents of a directory:
!ls "/content/gdrive/My Drive/checkpoint/"
checkpoint test-0.data-00000-of-00001 test-0.index test-0.meta
Model Restore:
%tensorflow_version 1.x
import tensorflow as tf
with tf.Session() as sess:
saver.restore(sess, "/content/gdrive/My Drive/checkpoint/test")
Ouput:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-313790e7866b> in <module>()
4
5 with tf.Session() as sess:
----> 6 saver.restore(sess, "/content/gdrive/My Drive/checkpoint/test")
/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py in restore(self, sess, save_path)
1280 if not checkpoint_management.checkpoint_exists_internal(checkpoint_prefix):
1281 raise ValueError("The passed save_path is not a valid checkpoint: " +
-> 1282 checkpoint_prefix)
1283
1284 logging.info("Restoring parameters from %s", checkpoint_prefix)
ValueError: The passed save_path is not a valid checkpoint: /content/gdrive/My Drive/checkpoint/test
Solution:
Please remove global_step=0 in model save block and observe where and how files are creating, thus resolves the problem.
%tensorflow_version 1.x
import tensorflow as tf
# Create some variables.
v1 = tf.get_variable("v1", shape=[3], initializer = tf.zeros_initializer)
v2 = tf.get_variable("v2", shape=[5], initializer = tf.zeros_initializer)
inc_v1 = v1.assign(v1+1)
dec_v2 = v2.assign(v2-1)
# Add an op to initialize the variables.
init_op = tf.global_variables_initializer()
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, initialize the variables, do some work, and save the
# variables to disk.
with tf.Session() as sess:
sess.run(init_op)
# Do some work with the model.
inc_v1.op.run()
dec_v2.op.run()
# Save the variables to disk.
save_path = saver.save(sess, "/content/gdrive/My Drive/checkpoint/test")
print("Model saved in path: %s" % save_path)
Output:
TensorFlow 1.x selected.
Model saved in path: /content/gdrive/My Drive/checkpoint/test
Listing the contents of a directory:
!ls "/content/gdrive/My Drive/checkpoint/"
checkpoint test.data-00000-of-00001 test.index test.meta
Model restore : Ideal way of restore model is as below.
%tensorflow_version 1.x
import tensorflow as tf
tf.reset_default_graph()
# Create some variables.
v1 = tf.get_variable("v1", shape=[3])
v2 = tf.get_variable("v2", shape=[5])
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, use the saver to restore variables from disk, and
# do some work with the model.
with tf.Session() as sess:
# Restore variables from disk.
saver.restore(sess, "/content/gdrive/My Drive/checkpoint/test")
print("Model restored.")
# Check the values of the variables
print("v1 : %s" % v1.eval())
print("v2 : %s" % v2.eval())
Output:
TensorFlow 1.x selected.
INFO:tensorflow:Restoring parameters from /content/gdrive/My Drive/checkpoint/test
Model restored.
v1 : [1. 1. 1.]
v2 : [-1. -1. -1. -1. -1.]
Please refer Save and Restore explanation and Code for Tensorflow Version 1.x here

Tensorflow - can't initialize saved variables unless I recreate the "saver" object. Why?

I'm pretty sure I'm missing something about how tensorflow works because my solution doesn't make any sense.
I'm trying to train a neural network (from scratch, without using Estimators or other abstractions), save it, and load a simplified version of it for inference.
The following code trains but gives me the error: FailedPreconditionError (see above for traceback): Attempting to use uninitialized value hidden0/biases/Variable
[[Node: hidden0/biases/Variable/read = Identity[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](hidden0/biases/Variable)]]. If I add the commented line - if I recreate the saver obect that I'm not going to use nor return - the code works just fine.
Why do I need to create a (useless) saver object in order to restore the saved weights?
import tensorflow as tf
import numpy as np
def add_fc_layer(input_tensor, input_dimensions, output_dimensions, layer_name, activation=None):
with tf.variable_scope(layer_name):
with tf.variable_scope('weights'):
weights = tf.Variable(tf.truncated_normal([input_dimensions, output_dimensions]))
with tf.variable_scope('biases'):
biases = tf.Variable(tf.zeros([output_dimensions]))
with tf.variable_scope('Wx_plus_b'):
preactivate = tf.matmul(input_tensor, weights) + biases
if activation is None:
return preactivate
with tf.variable_scope('activation'):
activations = activation(preactivate)
return activations
def make_network(model_phase):
if model_phase not in {"train", "test"}:
raise ValueError("invalid type")
hidden0_units = 25
hidden1_units = 15
hidden2_units = 10
input_size = 10
output_size = 4
with tf.variable_scope('InputVector'):
inputs = tf.placeholder(shape=[1, input_size], dtype=tf.float32)
hidden0_out = add_fc_layer(inputs, input_size, hidden0_units, "hidden0", activation=tf.nn.sigmoid)
hidden1_out = add_fc_layer(hidden0_out, hidden0_units, hidden1_units, "hidden1", activation=tf.nn.sigmoid)
hidden2_out = add_fc_layer(hidden1_out, hidden1_units, hidden2_units, "hidden2", activation=tf.nn.sigmoid)
out = add_fc_layer(hidden2_out, hidden2_units, output_size, "regression")
if model_phase == "test":
# UNCOMMENTIN THIS LINE MAKES THE SCRIPT WORK
# saver = tf.train.Saver(var_list=tf.trainable_variables())
return inputs, out
saver = tf.train.Saver(var_list=tf.trainable_variables())
with tf.variable_scope('training'):
with tf.variable_scope('groundTruth'):
ground_truth = tf.placeholder(shape=[1, output_size], dtype=tf.float32)
with tf.variable_scope('loss'):
loss = tf.reduce_sum(tf.square(ground_truth - out))
tf.summary.scalar('loss', loss)
with tf.variable_scope('optimizer'):
trainer = tf.train.AdamOptimizer(learning_rate=0.001)
with tf.variable_scope('gradient'):
updateModel = trainer.minimize(loss)
with tf.variable_scope('predict'):
predict = tf.random_shuffle(tf.boolean_mask(out, tf.equal(out, tf.reduce_max(out, axis=None))))[0]
writer = tf.summary.FileWriter('/tmp/test', tf.get_default_graph())
return inputs, out, ground_truth, updateModel, writer, saver
train_graph = tf.Graph()
with tf.Session(graph=train_graph) as sess:
tf.set_random_seed(42)
inputs, out, ground_truth, updateModel, writer, saver = make_network(model_phase='train')
init = tf.initialize_all_variables()
sess.run(init)
print('\nLearning...')
for _ in range(10):
sess.run([updateModel], feed_dict={inputs:np.arange(10)+np.random.random((1,10)), ground_truth:np.arange(4).reshape(1, 4)})
saver.save(sess,'./tensorflowModel.ckpt')
new_graph = tf.Graph()
with tf.Session(graph=new_graph) as sess:
inputs, out = make_network(model_phase='test')
saver = tf.train.import_meta_graph('./tensorflowModel.ckpt.meta')
saver.restore(sess, tf.train.latest_checkpoint('./'))
# evaluation
print('\nEvaluation...')
for _ in range(10):
_ = sess.run(out, feed_dict={inputs:np.arange(10).reshape(1,10)})

I don't know why creating an unused Saver makes the problem go away, but the code betrays a misunderstanding.
When you are restoring, you are creating the model graph twice. First, you call make_network() which creates the computation graph and variables. You then also call import_meta_graph which also creates a graph and variables. You should create a saver with simple saver = tf.train.Saver() instead of saver = tf.train.import_meta_graph('./tensorflowModel.ckpt.meta')

MonitoredTrainingSession save and restore model

I'm trying to extend the example https://www.tensorflow.org/deploy/distributed outlined here but I'm having trouble saving the model. I'm running this in docker container available at gcr.io/tensorflow/tensorflow:1.5.0-gpu-py3. I started two processes one for 'ps' and one for 'worker' and the ps process is simply this code:
import tensorflow as tf
def main(_):
cluster = tf.train.ClusterSpec({"ps":["localhost:2222"],"worker":["localhost:2223"]})
server = tf.train.Server(cluster, job_name="ps", task_index=0)
server.join()
if __name__ == "__main__":
tf.app.run()
The worker code is the following and is based on the mnist examples and the distributed article above:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
data_dir = "/data"
checkpoint_dir = "/tmp/train_logs"
def main(_):
cluster = tf.train.ClusterSpec({"ps":["localhost:2222"],"worker":["localhost:2223"]})
server = tf.train.Server(cluster, job_name="worker", task_index=0)
mnist = input_data.read_data_sets(data_dir, one_hot=True)
with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:0", cluster=cluster)):
x = tf.placeholder(tf.float32, [None,784], name="x_input")
W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))
y = tf.placeholder(tf.float32, [None,10])
model = tf.matmul(x, W) + b
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=y, logits=model))
global_step = tf.train.get_or_create_global_step()
train_op = tf.train.GradientDescentOptimizer(0.5).minimize(cost, global_step=global_step)
prediction = tf.equal(tf.argmax(model,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32))
hooks = [tf.train.StopAtStepHook(last_step=101)]
with tf.train.MonitoredTrainingSession(master=server.target, is_chief=True, checkpoint_dir=checkpoint_dir, hooks=hooks) as sess:
while not sess.should_stop():
batch_xs, batch_ys = mnist.train.next_batch(1000)
sess.run(train_op, feed_dict={x: batch_xs, y: batch_ys})
latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)
#saver = tf.train.Saver()
saver = tf.train.import_meta_graph(latest_checkpoint+".meta", clear_devices=True)
with tf.Session() as sess:
saver.restore(sess,latest_checkpoint) # "/tmp/train_logs/model.ckpt"
acc = sess.run(accuracy, feed_dict={x: mnist.test.images,y: mnist.test.labels});
print("Test accuracy = "+"{:5f}".format(acc))
if __name__ == "__main__":
tf.app.run()
The examples I've found all seem to end without showing how to use the model. The above code fails on the saver.restore() line with the following error:
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'save/RestoreV2_2':
Operation was explicitly assigned to /job:ps/task:0/device:CPU:0
but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0 ].
Make sure the device specification refers to a valid device.
Also, as shown above I tried both saver = tf.train.Saver() and saver = tf.train.import_meta_graph(latest_checkpoint+".meta", clear_devices=True) with no success. Same error is shown in either case.
I don't really understand the with tf.device(...): statement. In one iteration I commented out this line (and unindented the statements below it) and the code ran without errors. But I think this is not correct and would like to understand the correct way for this to work.

TensorFlow error when save/restore dynamic_RNN model

I can save and restore model if the model is CNN, but I can't restore RNN.
I made RNN network like this.
I wanna save trained weigh and bias or model. And I want to predict without training. following is main.py
#main.py
tf_x = tf.placeholder(tf.float32, [None, seq_length, data_dim], name='tf_x')
tf_y = tf.placeholder(tf.int32, [None, output_dim], name='tf_y')
rnn_cell = tf.contrib.rnn.BasicLSTMCell(num_units=hidden_dim)
outputs, (h_c, h_n) = tf.nn.dynamic_rnn( rnn_cell,
tf_x,
initial_state=None,
dtype=tf.float32,
time_major=False )
output = tf.layers.dense(outputs[:, -1, :], output_dim, name='dense_output')
loss = tf.losses.softmax_cross_entropy(onehot_labels=tf_y, logits=output)
train_op = tf.train.AdamOptimizer(LR).minimize(loss)
accuracy = tf.metrics.accuracy( labels=tf.argmax(tf_y, axis=1), predictions=tf.argmax(output, axis=1),)[1]
with tf.Session as sess:
init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer()) # the local var is for accuracy_op
sess.run(init_op) # initialize var in graph
...(training)
saver = tf.train.Saver()
save_path = saver.save(sess, "Save data/RNN-model")
saver.export_meta_graph(filename="Save Data/RNN-model.meta", as_text=True)
and in "run.py" I tried to load that data.
#run.py
...(same as main.py)
saver = tf.train.Saver()
with tf.Session() as sess:
ckpt = tf.train.get_checkpoint_state('Save data/')
saver.restore(sess, ckpt.model_checkpoint_path)
saver = tf.train.import_meta_graph("Save data/RNN-model.meta")
... (prediction)
result is..
tensorflow.python.framework.errors_impl.NotFoundError: Key dense/bias not found in checkpoint
What do you think is the problem?

std::system_error after restoring model in Tensorflow

I'm trying to implement a simple saver/restorer like so: (copied from Tensorflow website)
Saver:
import tensorflow as tf
# Create some variables.
v1 = tf.get_variable("v1", shape=[3], initializer = tf.zeros_initializer)
v2 = tf.get_variable("v2", shape=[5], initializer = tf.zeros_initializer)
inc_v1 = v1.assign(v1+1)
dec_v2 = v2.assign(v2-1)
# Add an op to initialize the variables.
init_op = tf.global_variables_initializer()
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, initialize the variables, do some work, and save the
# variables to disk.
with tf.Session() as sess:
sess.run(init_op)
# Do some work with the model.
inc_v1.op.run()
dec_v2.op.run()
# Save the variables to disk.
save_path = saver.save(sess, "/tmp/model.ckpt")
print("Model saved in file: %s" % save_path)
Restorer:
import tensorflow as tf
tf.reset_default_graph()
# Create some variables.
v1 = tf.get_variable("v1", shape=[3])
v2 = tf.get_variable("v2", shape=[5])
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, use the saver to restore variables from disk, and
# do some work with the model.
with tf.Session() as sess:
# Restore variables from disk.
saver.restore(sess, "/tmp/model.ckpt")
print("Model restored.")
# Check the values of the variables
print("v1 : %s" % v1.eval())
print("v2 : %s" % v2.eval())
It seems to save the model fine, but when restoring it gets stuck on the line saver.restore(sess, "/tmp/model.ckpt") and I end up with the error message:
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
I don't see how it can be a memory error as I am running on my work server which has 100s of GB of memory.
Python Version 3.5, Tensorflow version 1.2.1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How does tf.train.Saver work exactly? - python

Related

Can't restore tensorflow model with Saver

Tensorflow - can't initialize saved variables unless I recreate the "saver" object. Why?

MonitoredTrainingSession save and restore model

TensorFlow error when save/restore dynamic_RNN model

std::system_error after restoring model in Tensorflow

Categories

Resources