training operation not getting loaded after restoring model Tensorflow

training operation not getting loaded after restoring model Tensorflow - python

I have tried this:
saver = tf.train.import_meta_graph(tf.train.latest_checkpoint(model_path)+".meta")
sess = tf.Session()
if(tf.train.checkpoint_exists(tf.train.latest_checkpoint(model_path))):
saver.restore(sess, tf.train.latest_checkpoint(model_path))
print(tf.train.latest_checkpoint(model_path) + "Session Loaded for Testing")
graph = tf.get_default_graph()
X = graph.get_tensor_by_name('input:0')
y = graph.get_tensor_by_name('output:0')
loss = tf.reduce_mean(tf.square(outputs - y)) # loss function = mean squared error
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss))
I got the following error:
ValueError: Duplicate node name in graph: 'rnn/multi_rnn_cell/cell_0/layer0/kernel/Adam'
I tried using the name as mentioned in the error:
optimizer = graph.get_operation_by_name("rnn/multi_rnn_cell/cell_0/layer0/kernel/Adam")
Then I received this error:
AttributeError: 'Operation' object has no attribute 'minimize'
Please let me know how I can get access to the training_op from the model?

Related

Can't restore tensorflow model with Saver

I'm following this guide to using the Saver class in Tensorflow version 1.
I'm first saving the model:
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
saver.save(sess, "./saved_model/tf/model", global_step=0)
which gives me these files:
$ ls saved_model/tf
checkpoint model-0.data-00000-of-00001 model-0.index model-0.meta
But when I try to restore the session, I get an error:
with tf.Session() as sess:
saver.restore(sess, "./saved_model/tf/model")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-37-01cbbefb52af> in <module>()
1 with tf.Session() as sess:
----> 2 saver.restore(sess, "./saved_model/tf/model")
/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py in restore(self, sess, save_path)
1280 if not checkpoint_management.checkpoint_exists_internal(checkpoint_prefix):
1281 raise ValueError("The passed save_path is not a valid checkpoint: " +
-> 1282 checkpoint_prefix)
1283
1284 logging.info("Restoring parameters from %s", checkpoint_prefix)
ValueError: The passed save_path is not a valid checkpoint: ./saved_model/tf/model
What am I doing wrong? Unfortunately, the TF documentation on this feature does not help much.

ValueError: The passed save_path is not a valid checkpoint: ./saved_model/tf/model
Here the error conveys that the checkpoint file is not present and therefore it is not a valid checkpoint.
I was able to recreate your problem, it caused because of global_step=0 in model save block. For better understanding printed model save path at the end of the program, which guides you where it saved and how it created files with this option.
Model Save:
%tensorflow_version 1.x
import tensorflow as tf
# Create some variables.
v1 = tf.get_variable("v1", shape=[3], initializer = tf.zeros_initializer)
v2 = tf.get_variable("v2", shape=[5], initializer = tf.zeros_initializer)
inc_v1 = v1.assign(v1+1)
dec_v2 = v2.assign(v2-1)
# Add an op to initialize the variables.
init_op = tf.global_variables_initializer()
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, initialize the variables, do some work, and save the
# variables to disk.
with tf.Session() as sess:
sess.run(init_op)
# Do some work with the model.
inc_v1.op.run()
dec_v2.op.run()
# Save the variables to disk.
save_path = saver.save(sess, "/content/gdrive/My Drive/checkpoint/test", global_step=0)
print("Model saved in path: %s" % save_path)
Output:
TensorFlow 1.x selected.
Model saved in path: /content/gdrive/My Drive/checkpoint/test-0
Listing the contents of a directory:
!ls "/content/gdrive/My Drive/checkpoint/"
checkpoint test-0.data-00000-of-00001 test-0.index test-0.meta
Model Restore:
%tensorflow_version 1.x
import tensorflow as tf
with tf.Session() as sess:
saver.restore(sess, "/content/gdrive/My Drive/checkpoint/test")
Ouput:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-313790e7866b> in <module>()
4
5 with tf.Session() as sess:
----> 6 saver.restore(sess, "/content/gdrive/My Drive/checkpoint/test")
/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/saver.py in restore(self, sess, save_path)
1280 if not checkpoint_management.checkpoint_exists_internal(checkpoint_prefix):
1281 raise ValueError("The passed save_path is not a valid checkpoint: " +
-> 1282 checkpoint_prefix)
1283
1284 logging.info("Restoring parameters from %s", checkpoint_prefix)
ValueError: The passed save_path is not a valid checkpoint: /content/gdrive/My Drive/checkpoint/test
Solution:
Please remove global_step=0 in model save block and observe where and how files are creating, thus resolves the problem.
%tensorflow_version 1.x
import tensorflow as tf
# Create some variables.
v1 = tf.get_variable("v1", shape=[3], initializer = tf.zeros_initializer)
v2 = tf.get_variable("v2", shape=[5], initializer = tf.zeros_initializer)
inc_v1 = v1.assign(v1+1)
dec_v2 = v2.assign(v2-1)
# Add an op to initialize the variables.
init_op = tf.global_variables_initializer()
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, initialize the variables, do some work, and save the
# variables to disk.
with tf.Session() as sess:
sess.run(init_op)
# Do some work with the model.
inc_v1.op.run()
dec_v2.op.run()
# Save the variables to disk.
save_path = saver.save(sess, "/content/gdrive/My Drive/checkpoint/test")
print("Model saved in path: %s" % save_path)
Output:
TensorFlow 1.x selected.
Model saved in path: /content/gdrive/My Drive/checkpoint/test
Listing the contents of a directory:
!ls "/content/gdrive/My Drive/checkpoint/"
checkpoint test.data-00000-of-00001 test.index test.meta
Model restore : Ideal way of restore model is as below.
%tensorflow_version 1.x
import tensorflow as tf
tf.reset_default_graph()
# Create some variables.
v1 = tf.get_variable("v1", shape=[3])
v2 = tf.get_variable("v2", shape=[5])
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, use the saver to restore variables from disk, and
# do some work with the model.
with tf.Session() as sess:
# Restore variables from disk.
saver.restore(sess, "/content/gdrive/My Drive/checkpoint/test")
print("Model restored.")
# Check the values of the variables
print("v1 : %s" % v1.eval())
print("v2 : %s" % v2.eval())
Output:
TensorFlow 1.x selected.
INFO:tensorflow:Restoring parameters from /content/gdrive/My Drive/checkpoint/test
Model restored.
v1 : [1. 1. 1.]
v2 : [-1. -1. -1. -1. -1.]
Please refer Save and Restore explanation and Code for Tensorflow Version 1.x here

I don't know how to do this at ubuntu tensorflow-gpu, "E tensorflow / core / util / events_writer.cc: 104]

I need to write a checkpoint on this deep learning problem. But that error is preventing me from writing the file. (E : at tf.summary.FileWriter)
Error :
"tensorflow/core/util/events_writer.cc:104] Write failed because file could not be opened.
i tryed this. Reinstallation, deletion, and authorization (tensorflow-gpu, tensorflow, tensorboard)
with tf.Session() as sess:
dir = pickledir
dic_0 = pat_level_arr(dir, 0, [0, 1])
dic_1 = pat_level_arr(dir, 1, [1, 0])
seed = 42
abc = range(len(dic_0))
abcd = range(len(dic_1))
dic_0_train, dic_0_test, _, _ = train_test_split(
dic_0, abc, test_size=0.244, random_state=seed)
dic_1_train, dic_1_test, _, _ = train_test_split(
dic_1, abcd, test_size=0.35, random_state=seed)
dic_train = np.concatenate((dic_0_train, dic_1_train), axis=0)
dic_test = np.concatenate((dic_0_test, dic_1_test), axis=0)
summaries_dir = './logs_level'
#here is the problem "tensorflow/core/util/events_writer.cc:104] Write failed because file could not be opened.
======================================================================
print("here is start\n")
train_writer = tf.summary.FileWriter(summaries_dir + '/train', sess.graph)
test_writer = tf.summary.FileWriter(summaries_dir + '/test')
print("here is end\n")
======================================================================
init = tf.global_variables_initializer()
sess.run(init)
# For train
try:
saver.restore(sess, './modelckpt/inception.ckpt')
print('Model restored')
epoch_saved = data_saved['var_epoch_saved'].eval()
except tf.errors.NotFoundError:
print('No saved model found')
epoch_saved = 1
except tf.errors.InvalidArgumentError:
print('Model structure has change. Rebuild model')
epoch_saved = 1
E tensorflow/core/util/events_writer.cc:104] Write failed because file could not be opened.
ValueError : the passed save_path is not a valid checkpoint: ./modelckpt/inception.ckpt
tensorflow-gpu version is 1.10.0.
python version is 3.5(i think).
I install tensorboard already.

Basically the error conveys that the checkpoint file is absent and therefore it is not a valid checkpoint.
You need to Save the Model using the below code, before executing saver.restore() method as it loads the file from the disk.
saver = tf.train.Saver()
with tf.Session() as sess:
sess.run(init_op)
# Do some work with the model.
# Save the variables to disk.
save_path = saver.save(sess, "/tmp/model.ckpt")
Please refer Save and Restore explanation and Code for Tensorflow Version 1.x in the below link, https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/saved_model.md

Tensorflow - can't initialize saved variables unless I recreate the "saver" object. Why?

I'm pretty sure I'm missing something about how tensorflow works because my solution doesn't make any sense.
I'm trying to train a neural network (from scratch, without using Estimators or other abstractions), save it, and load a simplified version of it for inference.
The following code trains but gives me the error: FailedPreconditionError (see above for traceback): Attempting to use uninitialized value hidden0/biases/Variable
[[Node: hidden0/biases/Variable/read = Identity[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](hidden0/biases/Variable)]]. If I add the commented line - if I recreate the saver obect that I'm not going to use nor return - the code works just fine.
Why do I need to create a (useless) saver object in order to restore the saved weights?
import tensorflow as tf
import numpy as np
def add_fc_layer(input_tensor, input_dimensions, output_dimensions, layer_name, activation=None):
with tf.variable_scope(layer_name):
with tf.variable_scope('weights'):
weights = tf.Variable(tf.truncated_normal([input_dimensions, output_dimensions]))
with tf.variable_scope('biases'):
biases = tf.Variable(tf.zeros([output_dimensions]))
with tf.variable_scope('Wx_plus_b'):
preactivate = tf.matmul(input_tensor, weights) + biases
if activation is None:
return preactivate
with tf.variable_scope('activation'):
activations = activation(preactivate)
return activations
def make_network(model_phase):
if model_phase not in {"train", "test"}:
raise ValueError("invalid type")
hidden0_units = 25
hidden1_units = 15
hidden2_units = 10
input_size = 10
output_size = 4
with tf.variable_scope('InputVector'):
inputs = tf.placeholder(shape=[1, input_size], dtype=tf.float32)
hidden0_out = add_fc_layer(inputs, input_size, hidden0_units, "hidden0", activation=tf.nn.sigmoid)
hidden1_out = add_fc_layer(hidden0_out, hidden0_units, hidden1_units, "hidden1", activation=tf.nn.sigmoid)
hidden2_out = add_fc_layer(hidden1_out, hidden1_units, hidden2_units, "hidden2", activation=tf.nn.sigmoid)
out = add_fc_layer(hidden2_out, hidden2_units, output_size, "regression")
if model_phase == "test":
# UNCOMMENTIN THIS LINE MAKES THE SCRIPT WORK
# saver = tf.train.Saver(var_list=tf.trainable_variables())
return inputs, out
saver = tf.train.Saver(var_list=tf.trainable_variables())
with tf.variable_scope('training'):
with tf.variable_scope('groundTruth'):
ground_truth = tf.placeholder(shape=[1, output_size], dtype=tf.float32)
with tf.variable_scope('loss'):
loss = tf.reduce_sum(tf.square(ground_truth - out))
tf.summary.scalar('loss', loss)
with tf.variable_scope('optimizer'):
trainer = tf.train.AdamOptimizer(learning_rate=0.001)
with tf.variable_scope('gradient'):
updateModel = trainer.minimize(loss)
with tf.variable_scope('predict'):
predict = tf.random_shuffle(tf.boolean_mask(out, tf.equal(out, tf.reduce_max(out, axis=None))))[0]
writer = tf.summary.FileWriter('/tmp/test', tf.get_default_graph())
return inputs, out, ground_truth, updateModel, writer, saver
train_graph = tf.Graph()
with tf.Session(graph=train_graph) as sess:
tf.set_random_seed(42)
inputs, out, ground_truth, updateModel, writer, saver = make_network(model_phase='train')
init = tf.initialize_all_variables()
sess.run(init)
print('\nLearning...')
for _ in range(10):
sess.run([updateModel], feed_dict={inputs:np.arange(10)+np.random.random((1,10)), ground_truth:np.arange(4).reshape(1, 4)})
saver.save(sess,'./tensorflowModel.ckpt')
new_graph = tf.Graph()
with tf.Session(graph=new_graph) as sess:
inputs, out = make_network(model_phase='test')
saver = tf.train.import_meta_graph('./tensorflowModel.ckpt.meta')
saver.restore(sess, tf.train.latest_checkpoint('./'))
# evaluation
print('\nEvaluation...')
for _ in range(10):
_ = sess.run(out, feed_dict={inputs:np.arange(10).reshape(1,10)})

I don't know why creating an unused Saver makes the problem go away, but the code betrays a misunderstanding.
When you are restoring, you are creating the model graph twice. First, you call make_network() which creates the computation graph and variables. You then also call import_meta_graph which also creates a graph and variables. You should create a saver with simple saver = tf.train.Saver() instead of saver = tf.train.import_meta_graph('./tensorflowModel.ckpt.meta')

MonitoredTrainingSession save and restore model

I'm trying to extend the example https://www.tensorflow.org/deploy/distributed outlined here but I'm having trouble saving the model. I'm running this in docker container available at gcr.io/tensorflow/tensorflow:1.5.0-gpu-py3. I started two processes one for 'ps' and one for 'worker' and the ps process is simply this code:
import tensorflow as tf
def main(_):
cluster = tf.train.ClusterSpec({"ps":["localhost:2222"],"worker":["localhost:2223"]})
server = tf.train.Server(cluster, job_name="ps", task_index=0)
server.join()
if __name__ == "__main__":
tf.app.run()
The worker code is the following and is based on the mnist examples and the distributed article above:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
data_dir = "/data"
checkpoint_dir = "/tmp/train_logs"
def main(_):
cluster = tf.train.ClusterSpec({"ps":["localhost:2222"],"worker":["localhost:2223"]})
server = tf.train.Server(cluster, job_name="worker", task_index=0)
mnist = input_data.read_data_sets(data_dir, one_hot=True)
with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:0", cluster=cluster)):
x = tf.placeholder(tf.float32, [None,784], name="x_input")
W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))
y = tf.placeholder(tf.float32, [None,10])
model = tf.matmul(x, W) + b
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=y, logits=model))
global_step = tf.train.get_or_create_global_step()
train_op = tf.train.GradientDescentOptimizer(0.5).minimize(cost, global_step=global_step)
prediction = tf.equal(tf.argmax(model,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32))
hooks = [tf.train.StopAtStepHook(last_step=101)]
with tf.train.MonitoredTrainingSession(master=server.target, is_chief=True, checkpoint_dir=checkpoint_dir, hooks=hooks) as sess:
while not sess.should_stop():
batch_xs, batch_ys = mnist.train.next_batch(1000)
sess.run(train_op, feed_dict={x: batch_xs, y: batch_ys})
latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)
#saver = tf.train.Saver()
saver = tf.train.import_meta_graph(latest_checkpoint+".meta", clear_devices=True)
with tf.Session() as sess:
saver.restore(sess,latest_checkpoint) # "/tmp/train_logs/model.ckpt"
acc = sess.run(accuracy, feed_dict={x: mnist.test.images,y: mnist.test.labels});
print("Test accuracy = "+"{:5f}".format(acc))
if __name__ == "__main__":
tf.app.run()
The examples I've found all seem to end without showing how to use the model. The above code fails on the saver.restore() line with the following error:
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'save/RestoreV2_2':
Operation was explicitly assigned to /job:ps/task:0/device:CPU:0
but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0 ].
Make sure the device specification refers to a valid device.
Also, as shown above I tried both saver = tf.train.Saver() and saver = tf.train.import_meta_graph(latest_checkpoint+".meta", clear_devices=True) with no success. Same error is shown in either case.
I don't really understand the with tf.device(...): statement. In one iteration I commented out this line (and unindented the statements below it) and the code ran without errors. But I think this is not correct and would like to understand the correct way for this to work.

TensorFlow error when save/restore dynamic_RNN model

I can save and restore model if the model is CNN, but I can't restore RNN.
I made RNN network like this.
I wanna save trained weigh and bias or model. And I want to predict without training. following is main.py
#main.py
tf_x = tf.placeholder(tf.float32, [None, seq_length, data_dim], name='tf_x')
tf_y = tf.placeholder(tf.int32, [None, output_dim], name='tf_y')
rnn_cell = tf.contrib.rnn.BasicLSTMCell(num_units=hidden_dim)
outputs, (h_c, h_n) = tf.nn.dynamic_rnn( rnn_cell,
tf_x,
initial_state=None,
dtype=tf.float32,
time_major=False )
output = tf.layers.dense(outputs[:, -1, :], output_dim, name='dense_output')
loss = tf.losses.softmax_cross_entropy(onehot_labels=tf_y, logits=output)
train_op = tf.train.AdamOptimizer(LR).minimize(loss)
accuracy = tf.metrics.accuracy( labels=tf.argmax(tf_y, axis=1), predictions=tf.argmax(output, axis=1),)[1]
with tf.Session as sess:
init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer()) # the local var is for accuracy_op
sess.run(init_op) # initialize var in graph
...(training)
saver = tf.train.Saver()
save_path = saver.save(sess, "Save data/RNN-model")
saver.export_meta_graph(filename="Save Data/RNN-model.meta", as_text=True)
and in "run.py" I tried to load that data.
#run.py
...(same as main.py)
saver = tf.train.Saver()
with tf.Session() as sess:
ckpt = tf.train.get_checkpoint_state('Save data/')
saver.restore(sess, ckpt.model_checkpoint_path)
saver = tf.train.import_meta_graph("Save data/RNN-model.meta")
... (prediction)
result is..
tensorflow.python.framework.errors_impl.NotFoundError: Key dense/bias not found in checkpoint
What do you think is the problem?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

training operation not getting loaded after restoring model Tensorflow - python

Related

Can't restore tensorflow model with Saver

I don't know how to do this at ubuntu tensorflow-gpu, "E tensorflow / core / util / events_writer.cc: 104]

Tensorflow - can't initialize saved variables unless I recreate the "saver" object. Why?

MonitoredTrainingSession save and restore model

TensorFlow error when save/restore dynamic_RNN model

Categories

Resources