I'm trying to save my model at the end of triaining and restore it every time the training begins. I just followed what this link did.
saver = tf.train.Saver()
with tf.Session(graph=graph) as session:
# Initializate the weights and biases
tf.global_variables_initializer().run()
new_saver = tf.train.import_meta_graph('model.meta')
new_saver.restore(sess,tf.train.latest_checkpoint('./'))
W1 = session.run(W)
print(W1)
for curr_epoch in range(num_epochs):
train_cost = train_ler = 0
start = time.time()
for batch in range(num_batches_per_epoch):
...Some training...
W2 = session.run(W)
print(W2)
save_path = saver.save(session, "models/model")
But it gives error below:
---> new_saver.restore(session, tf.train.latest_checkpoint('./'))
SystemError: <built-in function TF_Run> returned a result with an error set
Can anyone help me please? Many thanks!
If you're gonna load with ./ you have to make sure, that your console (that you use to start the python program) is actually set on that directory (models/).
But in that case, it will save your new data in a new directory. So load with ./models/ instead
(Also you don't need to initiate variables, the restore does that for you.)
Related
Before marking my question as duplicate, I want you to understand that I have went through a lot of questions, but none of the solutions there were able to clear my doubts and solve my problem. I have a trained neural network which I want to save, and later use this model to test this model against test dataset.
I tried saving and restoring it, but I am not getting the expected results. Restoring doesn't seem to work, maybe I am using it wrongly, it is just using the values given by the global variable initializer.
This is the code I am using for saving the model.
sess.run(tf.initializers.global_variables())
#num_epochs = 7
for epoch in range(num_epochs):
start_time = time.time()
train_accuracy = 0
train_loss = 0
val_loss = 0
val_accuracy = 0
for bid in range(int(train_data_size/batch_size)):
X_train_batch = X_train[bid*batch_size:(bid+1)*batch_size]
y_train_batch = y_train[bid*batch_size:(bid+1)*batch_size]
sess.run(optimizer, feed_dict = {x:X_train_batch, y:y_train_batch,prob:0.50})
train_accuracy = train_accuracy + sess.run(model_accuracy, feed_dict={x : X_train_batch,y:y_train_batch,prob:0.50})
train_loss = train_loss + sess.run(loss_value, feed_dict={x : X_train_batch,y:y_train_batch,prob:0.50})
for bid in range(int(val_data_size/batch_size)):
X_val_batch = X_val[bid*batch_size:(bid+1)*batch_size]
y_val_batch = y_val[bid*batch_size:(bid+1)*batch_size]
val_accuracy = val_accuracy + sess.run(model_accuracy,feed_dict = {x:X_val_batch, y:y_val_batch,prob:0.75})
val_loss = val_loss + sess.run(loss_value, feed_dict = {x:X_val_batch, y:y_val_batch,prob:0.75})
train_accuracy = train_accuracy/int(train_data_size/batch_size)
val_accuracy = val_accuracy/int(val_data_size/batch_size)
train_loss = train_loss/int(train_data_size/batch_size)
val_loss = val_loss/int(val_data_size/batch_size)
end_time = time.time()
saver.save(sess,'./blood_model_x_v2',global_step = epoch)
After saving the model, the files are written in my working directory something like this.
blood_model_x_v2-2.data-0000-of-0001
blood_model_x_v2-2.index
blood_model_x_v2-2.meta
Similarly, v2-3, so on to v2-6, and then a 'checkpoint' file. I then tried restoring it using this code snippet (after initializing),but getting different results from the expected one. What am I doing wrong ?
saver = tf.train.import_meta_graph('blood_model_x_v2-5.meta')
saver.restore(test_session,tf.train.latest_checkpoint('./'))
According to tensorflow docs:
Restore
Restores previously saved variables.
This method runs the ops added by the constructor for restoring
variables. It requires a session in which the graph was launched. The
variables to restore do not have to have been initialized, as
restoring is itself a way to initialize variables.
Let's see an example:
We save the model similar to this:
import tensorflow as tf
# Prepare to feed input, i.e. feed_dict and placeholders
w1 = tf.placeholder("float", name="w1")
w2 = tf.placeholder("float", name="w2")
b1 = tf.Variable(2.0, name="bias")
feed_dict = {w1: 4, w2: 8}
# Define a test operation that we will restore
w3 = tf.add(w1, w2)
w4 = tf.multiply(w3, b1, name="op_to_restore")
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# Create a saver object which will save all the variables
saver = tf.train.Saver()
# Run the operation by feeding input
print (sess.run(w4, feed_dict))
# Prints 24 which is sum of (w1+w2)*b1
# Now, save the graph
saver.save(sess, './ckpnt/my_test_model', global_step=1000)
And then load the trained model with:
import tensorflow as tf
sess = tf.Session()
# First let's load meta graph and restore weights
saver = tf.train.import_meta_graph('./ckpnt/my_test_model-1000.meta')
saver.restore(sess, tf.train.latest_checkpoint('./ckpnt'))
# Now, let's access and create placeholders variables and
# create feed-dict to feed new data
graph = tf.get_default_graph()
w1 = graph.get_tensor_by_name("w1:0")
w2 = graph.get_tensor_by_name("w2:0")
feed_dict = {w1: 13.0, w2: 17.0}
# Now, access the op that you want to run.
op_to_restore = graph.get_tensor_by_name("op_to_restore:0")
print (sess.run(op_to_restore, feed_dict))
# This will print 60 which is calculated
# using new values of w1 and w2 and saved value of b1.
As you can see we do not initialize our session in the restoring part. There is better way to save and restore model with Checkpoint which allows you to check whether the model is restored correctly or not.
I have downloaded an network with its pretrained model. I added several layers and parameters to the network, I want to use this pretrained model to initialize the original parameters,and random initialize new added parameters by myself.I use this code:
saver = tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, "output/saver-test")
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
but I met the error:"Key global_step not found in checkpoint",this error because I have some new parameters that didn't exist in pretrained model.But how can I solve this problem? What's more,I want to use this code "sess.run(tf.global_variables_initializer())" to initialize the new added parameters,but the extracted parameters from pretrained model will be covered by it?
It happens because of your network is not perfectly match to the loaded one.
You can use selective checkpoint loader something like that:
reader = tf.train.NewCheckpointReader(os.path.join(checkpoint_dir, ckpt_name))
restore_dict = dict()
for v in tf.trainable_variables():
tensor_name = v.name.split(':')[0]
if reader.has_tensor(tensor_name):
print('has tensor ', tensor_name)
restore_dict[tensor_name] = v
restore_dict['my_new_var_scope/my_new_var'] = self.get_my_new_var_variable()
Where get_my_new_var_variable() is something like that:
def get_my_new_var_variable(self):
with tf.variable_scope("my_new_var_scope",reuse=tf.AUTO_REUSE):
my_new_var = tf.get_variable("my_new_var", dtype=tf.int32,initializer=tf.constant([23, 42]))
return my_new_var
Loading the weights:
self.saver = tf.train.Saver(restore_dict)
self.saver.restore(self.sess, os.path.join(checkpoint_dir, ckpt_name))
Edited:
Note that in order to avoid override the loaded variables you can use this method:
def initialize_uninitialized(sess):
global_vars = tf.global_variables()
is_not_initialized = sess.run([tf.is_variable_initialized(var) for var in global_vars])
not_initialized_vars = [v for (v, f) in zip(global_vars, is_not_initialized) if not f]
if len(not_initialized_vars):
sess.run(tf.variables_initializer(not_initialized_vars))
Or simply calling tf.global_variables_initializer() before loading the variables should work here.
I am using the sample to build a CNN as per this article: https://www.tensorflow.org/tutorials/layers
However, I am unable to find a sample to predict by feeding in a sample image. Any help here would be highly appreciated.
Below is what I have tried, and not able to find the output tensor name
img = <load from file>
sess = tf.Session()
saver = tf.train.import_meta_graph('/tmp/mnist_convnet_model/model.ckpt-2000.meta')
saver.restore(sess, tf.train.latest_checkpoint('/tmp/mnist_convnet_model/'))
input_place_holder = sess.graph.get_tensor_by_name("enqueue_input/Placeholder:0")
out_put = <not sure what the tensor output name in the graph>
current_input = img
result = sess.run(out_put, feed_dict={input_place_holder: current_input})
print(result)
You can use the inspect_checkpoint tool in Tensorflow to find the tensors inside a checkpoint file.
from tensorflow.python.tools.inspect_checkpoint import print_tensors_in_checkpoint_file
print_tensors_in_checkpoint_file(file_name="tmp/mnist_convnet_model/model.ckpt-2000.meta", tensor_name='')
There are nice instructions on how to save and restore in tensorflows programming guide. Here is a small example inspired from the latter link. Just make sure that the ./tmp dir exists
import tensorflow as tf
# Create some variables.
variable = tf.get_variable("variable_1", shape=[3], initializer=tf.zeros_initializer)
inc_v1=variable.assign(variable + 1)
# Operation to initialize variables if we do not restore from checkpoint
init_op = tf.global_variables_initializer()
# Create the saver
saver = tf.train.Saver()
with tf.Session() as sess:
# Setting to decide wether or not to restore
DO_RESTORE=True
# Where to save the data file
save_path="./tmp/model.ckpt"
if DO_RESTORE:
# If we want to restore, load the variables from the saved file
saver.restore(sess, save_path)
else:
# If we don't want to restore, then initialize variables
# using their specified initializers.
sess.run(init_op)
# Print the initial values of variable
initial_var_value=sess.run(variable)
print("Initial:", initial_var_value)
# Do some work with the model.
incremented=sess.run(inc_v1)
print("Incremented:", incremented)
# Save the variables to disk.
save_path = saver.save(sess, save_path)
print("Model saved in path: %s" % save_path)
When we specify the global_step in the Saver.save, it will store the global_step as the checkpoint suffix.
# save the checkpoint
saver = tf.train.Saver()
saver.save(session, checkpoints_path, global_step)
We can restore the checkpoint and obtain the last global step stored in the checkpoints like this:
# restore the checkpoint and obtain the global step
saver.restore(session, ckpt.model_checkpoint_path)
...
_, gstep = session.run([optimizer, global_step], feed_dict=feed_dict_train)
If we use tf.train.MonitoredTrainingSession, what is the equivalent way to save the global step to the checkpoint and obtain gstep?
Edit 1
Followed by Maxim's suggestion, I have created the global_step variable before tf.train.MonitoredTrainingSession, and added a CheckpointSaverHook like this:
global_step = tf.train.get_or_create_global_step()
save_checkpoint_hook = tf.train.CheckpointSaverHook(checkpoint_dir=checkpoints_abs_path,
save_steps=5,
checkpoint_basename=(checkpoints_prefix + ".ckpt"))
with tf.train.MonitoredTrainingSession(master=server.target,
is_chief=is_chief,
hooks=[sync_replicas_hook, save_checkpoint_hook],
config=config) as session:
_, gstep = session.run([optimizer, global_step], feed_dict=feed_dict_train)
print("current global step=" + str(gstep))
I can see that it generates the checkpoint files similar to what Saver.saver does. However, it is unable retrieve the global step from the checkpoint. Please kindly advise how should I fix this?
You can get the current global step via tf.train.get_global_step() or via tf.train.get_or_create_global_step() function. The latter should be called before training starts.
For the monitored session, add tf.train.CheckpointSaverHook to the hooks, which internally uses the defined global step tensor to save the model after every N steps.
I have a model that's trains my network using an Iterator; following the new Dataset API pipeline model that's now recommended by Google.
I read tfrecord files, feed data to the network, train nicely, and all is going well, I save my model in the end of the training so I can run Inference on it later. A simplified version of the code is as following:
""" Training and saving """
training_dataset = tf.contrib.data.TFRecordDataset(training_record)
training_dataset = training_dataset.map(ds._path_records_parser)
training_dataset = training_dataset.batch(BATCH_SIZE)
with tf.name_scope("iterators"):
training_iterator = Iterator.from_structure(training_dataset.output_types, training_dataset.output_shapes)
next_training_element = training_iterator.get_next()
training_init_op = training_iterator.make_initializer(training_dataset)
def train(num_epochs):
# compute for the number of epochs
for e in range(1, num_epochs+1):
session.run(training_init_op) #initializing iterator here
while True:
try:
images, labels = session.run(next_training_element)
session.run(optimizer, feed_dict={x: images, y_true: labels})
except tf.errors.OutOfRangeError:
saver_name = './saved_models/ucf-model'
print("Finished Training Epoch {}".format(e))
break
""" Restoring """
# restoring the saved model and its variables
session = tf.Session()
saver = tf.train.import_meta_graph(r'saved_models\ucf-model.meta')
saver.restore(session, tf.train.latest_checkpoint('.\saved_models'))
graph = tf.get_default_graph()
# restoring relevant tensors/ops
accuracy = graph.get_tensor_by_name("accuracy/Mean:0") #the tensor that when evaluated returns the mean accuracy of the batch
testing_iterator = graph.get_operation_by_name("iterators/Iterator") #my iterator used in testing.
next_testing_element = graph.get_operation_by_name("iterators/IteratorGetNext") #the GetNext operator for my iterator
# loading my testing set tfrecords
testing_dataset = tf.contrib.data.TFRecordDataset(testing_record_path)
testing_dataset = testing_dataset.map(ds._path_records_parser, num_threads=4, output_buffer_size=BATCH_SIZE*20)
testing_dataset = testing_dataset.batch(BATCH_SIZE)
testing_init_op = testing_iterator.make_initializer(testing_dataset) #to initialize the dataset
with tf.Session() as session:
session.run(testing_init_op)
while True:
try:
images, labels = session.run(next_testing_element)
accuracy = session.run(accuracy, feed_dict={x: test_images, y_true: test_labels}) #error here, x, y_true not defined
except tf.errors.OutOfRangeError:
break
My problem is mainly when I restore the model. How to feed testing data to the network?
When I restore my Iterator using testing_iterator = graph.get_operation_by_name("iterators/Iterator"), next_testing_element = graph.get_operation_by_name("iterators/IteratorGetNext"), I get the following error:
GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element.
So I did try to initialize my dataset using: testing_init_op = testing_iterator.make_initializer(testing_dataset)). I got this error: AttributeError: 'Operation' object has no attribute 'make_initializer'
Another issue is, since an iterator is being used, there's no need to use placeholders in the training_model, as an iterator feed data directly to the graph. But this way, how to restore my feed_dict keys in the 3rd to last line, when I feed data to the "accuracy" op?
EDIT: if someone could suggest a way to add placeholders between the Iterator and the network input, then I could try running the graph by evaluating the "accuracy" tensor while feeding data to the placeholders and ignoring the iterator altogether.
When restoring a saved meta graph, you can restore the initialization operation with name and then use it again to initialize the input pipeline for inference.
That is, when creating the graph, you can do
dataset_init_op = iterator.make_initializer(dataset, name='dataset_init')
And then restore this operation by doing:
dataset_init_op = graph.get_operation_by_name('dataset_init')
Here is a self contained code snippet that compares results of a randomly initialized model before and after restoring.
Saving an Iterator
np.random.seed(42)
data = np.random.random([4, 4])
X = tf.placeholder(dtype=tf.float32, shape=[4, 4], name='X')
dataset = tf.data.Dataset.from_tensor_slices(X)
iterator = tf.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)
dataset_next_op = iterator.get_next()
# name the operation
dataset_init_op = iterator.make_initializer(dataset, name='dataset_init')
w = np.random.random([1, 4])
W = tf.Variable(w, name='W', dtype=tf.float32)
output = tf.multiply(W, dataset_next_op, name='output')
sess = tf.Session()
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer())
sess.run(dataset_init_op, feed_dict={X:data})
while True:
try:
print(sess.run(output))
except tf.errors.OutOfRangeError:
saver.save(sess, 'tmp/', global_step=1002)
break
And then you can restore the same model for inference as follows:
Restoring saved iterator
np.random.seed(42)
data = np.random.random([4, 4])
tf.reset_default_graph()
sess = tf.Session()
saver = tf.train.import_meta_graph('tmp/-1002.meta')
ckpt = tf.train.get_checkpoint_state(os.path.dirname('tmp/checkpoint'))
saver.restore(sess, ckpt.model_checkpoint_path)
graph = tf.get_default_graph()
# Restore the init operation
dataset_init_op = graph.get_operation_by_name('dataset_init')
X = graph.get_tensor_by_name('X:0')
output = graph.get_tensor_by_name('output:0')
sess.run(dataset_init_op, feed_dict={X:data})
while True:
try:
print(sess.run(output))
except tf.errors.OutOfRangeError:
break
I would suggest to use tf.contrib.data.make_saveable_from_iterator, which has been designed precisely for this purpose. It is much less verbose and does not require you to change existing code, in particular how you define your iterator.
Working example, when we save everything after step 5 has completed. Note how I don't even bother knowing what seed is used.
import tensorflow as tf
iterator = (
tf.data.Dataset.range(100)
.shuffle(10)
.make_one_shot_iterator())
batch = iterator.get_next(name='batch')
saveable_obj = tf.contrib.data.make_saveable_from_iterator(iterator)
tf.add_to_collection(tf.GraphKeys.SAVEABLE_OBJECTS, saveable_obj)
saver = tf.train.Saver()
with tf.Session() as sess:
tf.global_variables_initializer().run()
for step in range(10):
print('{}: {}'.format(step, sess.run(batch)))
if step == 5:
saver.save(sess, './foo', global_step=step)
# 0: 1
# 1: 6
# 2: 7
# 3: 3
# 4: 8
# 5: 10
# 6: 12
# 7: 14
# 8: 5
# 9: 17
Then later, if we resume from step 6, we get the same output.
import tensorflow as tf
saver = tf.train.import_meta_graph('./foo-5.meta')
with tf.Session() as sess:
saver.restore(sess, './foo-5')
for step in range(6, 10):
print('{}: {}'.format(step, sess.run('batch:0')))
# 6: 12
# 7: 14
# 8: 5
# 9: 17
I couldn't solve the problem related to initializing the iterator, but since I pre-process my dataset using map method, and I apply transformations defined by Python operations wrapped with py_func, which cannot be serialized for storing\restoring, I'll have to initialize my dataset when I want to restore it anyway.
So, the problem that remains is how to feed data to my graph when I restore it. I placed a tf.identity node between the iterator output and my network input. Upon restoring, I feed my data to the identity node. A better solution that I discovered later is using placeholder_with_default(), as described in this answer.
I would suggest having a look at CheckpointInputPipelineHook CheckpointInputPipelineHook, which implements saving iterator state for further training with tf.Estimator.