When we specify the global_step in the Saver.save, it will store the global_step as the checkpoint suffix.
# save the checkpoint
saver = tf.train.Saver()
saver.save(session, checkpoints_path, global_step)
We can restore the checkpoint and obtain the last global step stored in the checkpoints like this:
# restore the checkpoint and obtain the global step
saver.restore(session, ckpt.model_checkpoint_path)
...
_, gstep = session.run([optimizer, global_step], feed_dict=feed_dict_train)
If we use tf.train.MonitoredTrainingSession, what is the equivalent way to save the global step to the checkpoint and obtain gstep?
Edit 1
Followed by Maxim's suggestion, I have created the global_step variable before tf.train.MonitoredTrainingSession, and added a CheckpointSaverHook like this:
global_step = tf.train.get_or_create_global_step()
save_checkpoint_hook = tf.train.CheckpointSaverHook(checkpoint_dir=checkpoints_abs_path,
save_steps=5,
checkpoint_basename=(checkpoints_prefix + ".ckpt"))
with tf.train.MonitoredTrainingSession(master=server.target,
is_chief=is_chief,
hooks=[sync_replicas_hook, save_checkpoint_hook],
config=config) as session:
_, gstep = session.run([optimizer, global_step], feed_dict=feed_dict_train)
print("current global step=" + str(gstep))
I can see that it generates the checkpoint files similar to what Saver.saver does. However, it is unable retrieve the global step from the checkpoint. Please kindly advise how should I fix this?
You can get the current global step via tf.train.get_global_step() or via tf.train.get_or_create_global_step() function. The latter should be called before training starts.
For the monitored session, add tf.train.CheckpointSaverHook to the hooks, which internally uses the defined global step tensor to save the model after every N steps.
Related
Before marking my question as duplicate, I want you to understand that I have went through a lot of questions, but none of the solutions there were able to clear my doubts and solve my problem. I have a trained neural network which I want to save, and later use this model to test this model against test dataset.
I tried saving and restoring it, but I am not getting the expected results. Restoring doesn't seem to work, maybe I am using it wrongly, it is just using the values given by the global variable initializer.
This is the code I am using for saving the model.
sess.run(tf.initializers.global_variables())
#num_epochs = 7
for epoch in range(num_epochs):
start_time = time.time()
train_accuracy = 0
train_loss = 0
val_loss = 0
val_accuracy = 0
for bid in range(int(train_data_size/batch_size)):
X_train_batch = X_train[bid*batch_size:(bid+1)*batch_size]
y_train_batch = y_train[bid*batch_size:(bid+1)*batch_size]
sess.run(optimizer, feed_dict = {x:X_train_batch, y:y_train_batch,prob:0.50})
train_accuracy = train_accuracy + sess.run(model_accuracy, feed_dict={x : X_train_batch,y:y_train_batch,prob:0.50})
train_loss = train_loss + sess.run(loss_value, feed_dict={x : X_train_batch,y:y_train_batch,prob:0.50})
for bid in range(int(val_data_size/batch_size)):
X_val_batch = X_val[bid*batch_size:(bid+1)*batch_size]
y_val_batch = y_val[bid*batch_size:(bid+1)*batch_size]
val_accuracy = val_accuracy + sess.run(model_accuracy,feed_dict = {x:X_val_batch, y:y_val_batch,prob:0.75})
val_loss = val_loss + sess.run(loss_value, feed_dict = {x:X_val_batch, y:y_val_batch,prob:0.75})
train_accuracy = train_accuracy/int(train_data_size/batch_size)
val_accuracy = val_accuracy/int(val_data_size/batch_size)
train_loss = train_loss/int(train_data_size/batch_size)
val_loss = val_loss/int(val_data_size/batch_size)
end_time = time.time()
saver.save(sess,'./blood_model_x_v2',global_step = epoch)
After saving the model, the files are written in my working directory something like this.
blood_model_x_v2-2.data-0000-of-0001
blood_model_x_v2-2.index
blood_model_x_v2-2.meta
Similarly, v2-3, so on to v2-6, and then a 'checkpoint' file. I then tried restoring it using this code snippet (after initializing),but getting different results from the expected one. What am I doing wrong ?
saver = tf.train.import_meta_graph('blood_model_x_v2-5.meta')
saver.restore(test_session,tf.train.latest_checkpoint('./'))
According to tensorflow docs:
Restore
Restores previously saved variables.
This method runs the ops added by the constructor for restoring
variables. It requires a session in which the graph was launched. The
variables to restore do not have to have been initialized, as
restoring is itself a way to initialize variables.
Let's see an example:
We save the model similar to this:
import tensorflow as tf
# Prepare to feed input, i.e. feed_dict and placeholders
w1 = tf.placeholder("float", name="w1")
w2 = tf.placeholder("float", name="w2")
b1 = tf.Variable(2.0, name="bias")
feed_dict = {w1: 4, w2: 8}
# Define a test operation that we will restore
w3 = tf.add(w1, w2)
w4 = tf.multiply(w3, b1, name="op_to_restore")
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# Create a saver object which will save all the variables
saver = tf.train.Saver()
# Run the operation by feeding input
print (sess.run(w4, feed_dict))
# Prints 24 which is sum of (w1+w2)*b1
# Now, save the graph
saver.save(sess, './ckpnt/my_test_model', global_step=1000)
And then load the trained model with:
import tensorflow as tf
sess = tf.Session()
# First let's load meta graph and restore weights
saver = tf.train.import_meta_graph('./ckpnt/my_test_model-1000.meta')
saver.restore(sess, tf.train.latest_checkpoint('./ckpnt'))
# Now, let's access and create placeholders variables and
# create feed-dict to feed new data
graph = tf.get_default_graph()
w1 = graph.get_tensor_by_name("w1:0")
w2 = graph.get_tensor_by_name("w2:0")
feed_dict = {w1: 13.0, w2: 17.0}
# Now, access the op that you want to run.
op_to_restore = graph.get_tensor_by_name("op_to_restore:0")
print (sess.run(op_to_restore, feed_dict))
# This will print 60 which is calculated
# using new values of w1 and w2 and saved value of b1.
As you can see we do not initialize our session in the restoring part. There is better way to save and restore model with Checkpoint which allows you to check whether the model is restored correctly or not.
I have downloaded an network with its pretrained model. I added several layers and parameters to the network, I want to use this pretrained model to initialize the original parameters,and random initialize new added parameters by myself.I use this code:
saver = tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, "output/saver-test")
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
but I met the error:"Key global_step not found in checkpoint",this error because I have some new parameters that didn't exist in pretrained model.But how can I solve this problem? What's more,I want to use this code "sess.run(tf.global_variables_initializer())" to initialize the new added parameters,but the extracted parameters from pretrained model will be covered by it?
It happens because of your network is not perfectly match to the loaded one.
You can use selective checkpoint loader something like that:
reader = tf.train.NewCheckpointReader(os.path.join(checkpoint_dir, ckpt_name))
restore_dict = dict()
for v in tf.trainable_variables():
tensor_name = v.name.split(':')[0]
if reader.has_tensor(tensor_name):
print('has tensor ', tensor_name)
restore_dict[tensor_name] = v
restore_dict['my_new_var_scope/my_new_var'] = self.get_my_new_var_variable()
Where get_my_new_var_variable() is something like that:
def get_my_new_var_variable(self):
with tf.variable_scope("my_new_var_scope",reuse=tf.AUTO_REUSE):
my_new_var = tf.get_variable("my_new_var", dtype=tf.int32,initializer=tf.constant([23, 42]))
return my_new_var
Loading the weights:
self.saver = tf.train.Saver(restore_dict)
self.saver.restore(self.sess, os.path.join(checkpoint_dir, ckpt_name))
Edited:
Note that in order to avoid override the loaded variables you can use this method:
def initialize_uninitialized(sess):
global_vars = tf.global_variables()
is_not_initialized = sess.run([tf.is_variable_initialized(var) for var in global_vars])
not_initialized_vars = [v for (v, f) in zip(global_vars, is_not_initialized) if not f]
if len(not_initialized_vars):
sess.run(tf.variables_initializer(not_initialized_vars))
Or simply calling tf.global_variables_initializer() before loading the variables should work here.
I am using the sample to build a CNN as per this article: https://www.tensorflow.org/tutorials/layers
However, I am unable to find a sample to predict by feeding in a sample image. Any help here would be highly appreciated.
Below is what I have tried, and not able to find the output tensor name
img = <load from file>
sess = tf.Session()
saver = tf.train.import_meta_graph('/tmp/mnist_convnet_model/model.ckpt-2000.meta')
saver.restore(sess, tf.train.latest_checkpoint('/tmp/mnist_convnet_model/'))
input_place_holder = sess.graph.get_tensor_by_name("enqueue_input/Placeholder:0")
out_put = <not sure what the tensor output name in the graph>
current_input = img
result = sess.run(out_put, feed_dict={input_place_holder: current_input})
print(result)
You can use the inspect_checkpoint tool in Tensorflow to find the tensors inside a checkpoint file.
from tensorflow.python.tools.inspect_checkpoint import print_tensors_in_checkpoint_file
print_tensors_in_checkpoint_file(file_name="tmp/mnist_convnet_model/model.ckpt-2000.meta", tensor_name='')
There are nice instructions on how to save and restore in tensorflows programming guide. Here is a small example inspired from the latter link. Just make sure that the ./tmp dir exists
import tensorflow as tf
# Create some variables.
variable = tf.get_variable("variable_1", shape=[3], initializer=tf.zeros_initializer)
inc_v1=variable.assign(variable + 1)
# Operation to initialize variables if we do not restore from checkpoint
init_op = tf.global_variables_initializer()
# Create the saver
saver = tf.train.Saver()
with tf.Session() as sess:
# Setting to decide wether or not to restore
DO_RESTORE=True
# Where to save the data file
save_path="./tmp/model.ckpt"
if DO_RESTORE:
# If we want to restore, load the variables from the saved file
saver.restore(sess, save_path)
else:
# If we don't want to restore, then initialize variables
# using their specified initializers.
sess.run(init_op)
# Print the initial values of variable
initial_var_value=sess.run(variable)
print("Initial:", initial_var_value)
# Do some work with the model.
incremented=sess.run(inc_v1)
print("Incremented:", incremented)
# Save the variables to disk.
save_path = saver.save(sess, save_path)
print("Model saved in path: %s" % save_path)
I'm having trouble loading a model to resume training.
I'm using a simple two-layered-NN (Fully connected) on a cifar data set for practice.
NN Setup:
#full_connected_layers
import tensorflow as tf
import numpy as np
#input _-> hidden ->
def inference(data_samples, image_pixels, hidden_units, classes, reg_constant):
with tf.variable_scope('Layer1'):
# Define the variables
weights = tf.get_variable(
name='weights',
shape=[image_pixels, hidden_units],
initializer=tf.truncated_normal_initializer(
stddev=1.0 / np.sqrt(float(image_pixels))),
regularizer=tf.contrib.layers.l2_regularizer(reg_constant)
)
biases = tf.Variable(tf.zeros([hidden_units]), name='biases')
# Define the layer's output
hidden = tf.nn.relu(tf.matmul(data_samples, weights) + biases)
with tf.variable_scope('Layer2'):
# Define variables
weights = tf.get_variable('weights', [hidden_units, classes],
initializer=tf.truncated_normal_initializer(
stddev=1.0 / np.sqrt(float(hidden_units))),
regularizer=tf.contrib.layers.l2_regularizer(reg_constant))
biases = tf.Variable(tf.zeros([classes]), name='biases')
# Define the layer's output
logits = tf.matmul(hidden, weights) + biases
# Define summery-operation for 'logits'-variable
tf.summary.histogram('logits', logits)
return logits
def loss(logits, labels):
'''Calculates the loss from logits and labels.
Args:
logits: Logits tensor, float - [batch size, number of classes].
labels: Labels tensor, int64 - [batch size].
Returns:
loss: Loss tensor of type float.
'''
with tf.name_scope('Loss'):
# Operation to determine the cross entropy between logits and labels
cross_entropy = tf.reduce_mean(
tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=logits, labels=labels, name='cross_entropy'))
# Operation for the loss function
loss = cross_entropy + tf.add_n(tf.get_collection(
tf.GraphKeys.REGULARIZATION_LOSSES))
# Add a scalar summary for the loss
tf.summary.scalar('loss', loss)
return loss
def training(loss, learning_rate):
# Create a variable to track the global step
global_step = tf.Variable(0, name='global_step', trainable=False)
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(
loss, global_step=global_step)
#train_step = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon).minimize(
#loss, global_step=global_step)
return train_step
def evaluation(logits, labels):
with tf.name_scope('Accuracy'):
# Operation comparing prediction with true label
correct_prediction = tf.equal(tf.argmax(logits,1), labels)
# Operation calculating the accuracy of the predictions
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# Summary operation for the accuracy
tf.summary.scalar('train_accuracy', accuracy)
return accuracy
Saved model like this:
if (i + 1) % 500 == 0:
saver.save(sess, MODEL_DIR, global_step=i)
print('Saved checkpoint')
Saved model files
Within this directory:
C:\Users\Moondra\Desktop\CIFAR - PROJECT\parameters_no_changes
I have the following files as well as model.ckpt-499.index etc:
model.ckpt-999.meta
model.ckpt-999.index
model.ckpt-999.data-00000-of-00001
My attempt at loading the model
import numpy as np
import tensorflow as tf
import time
from datetime import datetime
import os
import data_helpers
import full_connected_layers
import itertools
learning_rate = .0001
max_steps = 3000
batch_size = 400
checkpoint = r'C:\Users\Moondra\Desktop\CIFAR - PROJECT\parameters_no_changes\model.ckpt-999'
with tf.Session() as sess:
saver = tf.train.import_meta_graph(r'C:\Users\Moondra\Desktop\CIFAR - PROJECT' +
'\\parameters_no_changes\model.ckpt-999.meta')
saver.restore(sess, checkpoint)
data_sets = data_helpers.load_data()
images = tf.get_default_graph().get_tensor_by_name('images:0') #image placeholder
labels = tf.get_default_graph().get_tensor_by_name('image-labels:0') #placeholder
loss = tf.get_default_graph().get_tensor_by_name('Loss/add:0')
#global_step = tf.get_default_graph().get_tensor_by_name('global_step/initial_value_1:0')
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(
loss)
accuracy = tf.get_default_graph().get_tensor_by_name('Accuracy/Mean:0')
with tf.Session() as sess:
#sess.run(tf.global_variables_initializer())
zipped_data = zip(data_sets['images_train'], data_sets['labels_train'])
batches = data_helpers.gen_batch(list(zipped_data), batch_size,
max_steps)
for i in range(max_steps):
# Get next input data batch
batch = next(batches)
images_batch, labels_batch = zip(*batch)
feed_dict = {
images: images_batch,
labels: labels_batch
}
if i % 100 == 0:
train_accuracy = sess.run(accuracy, feed_dict=feed_dict)
print('Step {:d}, training accuracy {:g}'.format(i, train_accuracy))
ts,loss_ =sess.run([train_step, loss], feed_dict=feed_dict)
Errors and confusion
1) Should I be using this command latest_checkpoint to restore:
`
saver.restore(sess,tf.train.latest_checkpoint('./'))`
I see some tutorials that just point to the folder holding the
.data, .index files.
2) Which brings me to the second question: What should I be using as the second parameter of saver.restore.
Currently I'm just pointing to the folder/dir that holds those files
3) I'm not purposely initializing any variables as I was told, that would overwrite the stored weight and bias values. This seems to be leading to this error:
FailedPreconditionError (see above for traceback): Attempting to use uninitialized value Layer1/weights
[[Node: Layer1/weights/read = Identity[T=DT_FLOAT, _class=["loc:#Layer1/weights"], _device="/job:localhost/replica:0/task:0/cpu:0"](Layer1/weights)]]
4) However, If I do initialize all variables via this code:
sess.run(tf.global_variables_initializer())
My model seems to start training from scratch (and not resuming training)
Does that mean I'm supposed to load all weights and biases via
get_tensor explicitly? If so , how I deal with layers with 20 plus layers?
5) When I run this command
for i in tf.get_default_graph().get_operations():
print(i.values)
I see many global_steps tensors/operations,
'global_step/initial_value' type=Const>>
'global_step' type=VariableV2>>
<'global_step/Assign' type=Assign>>
global_step/read' type=Identity>>
I was trying to load this variable into my current graph, but
didn't know which one I'm supposed to get using the command
get_tensor_by_name. Most of them were resulting in a does not exist error.
6) Same with loss which loss am I supposed to load into my graph with get_tensor
These are the options:
<bound method Operation.values of <tf.Operation 'Loss/Const' type=Const>>
<bound method Operation.values of <tf.Operation 'Loss/Mean' type=Mean>>
<bound method Operation.values of <tf.Operation 'Loss/AddN' type=AddN>>
<bound method Operation.values of <tf.Operation 'Loss/add' type=Add>>
<bound method Operation.values of <tf.Operation 'Loss/loss/tags' type=Const>>
<bound method Operation.values of <tf.Operation 'Loss/loss' type=ScalarSummary>>
6) Lastly, I see a lot of gradient operations when I look at all
the nodes of the graph but I don't see any nodes related to train_step (the
python variable I created that points to the Gradient Dsecent Optimizer). Does that mean I don't need to load it into this graph via get_tensor?
Thank you.
I usually did this sequence of operations:
Initialize
Restore
This translates to this kind of code:
saver = tf.train.Saver()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
saver.restore(sess, tf.train.latest_checkpoint('./'))
...
It will avoid the non-initialized error, and the restore will overwrite with the values from the checkpoint.
1/ In the folder where you save your checkpoint, there should be a file named 'checkpoint' which contains the name of your latest checkpoint.
I normally read this file to find latest checkpoint.
2/ I use checkpoint_directory/global_step.
With this, tf will create 4 files in the checkpoint_directory:
global_step.data-00000-of-00001
global_step.index
global_step.meta
checkpoint
3/ 4/ I'm pretty sure you don't need to pre-initialize the graph before loading, at least I don't do it.
There is some difference: instead of import_meta_graph, I rebuild the whole graph every time I load, but I'm sure it's not an issue to load before you initialize.
5/ Be careful not to mis-take operations for tensors and you are good to go. Tensor name should be op_name:0, which mean this tensor is the output[0] of the operation op_name.
6/ 7/ Well, let me just tell you how I resume my checkpoint. This is probably not the correct way, but it really saves me from the burden of get_tensor_by_name. Seriously get_tensor_by_name can be a real pita sometimes.
Normally my loading process will go through: rebuild graph, load checkpoint, create some new tensors if needed, initialize tensors that is not in the checkpoint.
build_net()
saver = tf.train.Saver()
saver.restore(session, checkpoint_dir/global_step)
add_loss_and_optimizer()
initialize_all_uninitialized_tensor
checkpoint_dir/global_step is from the checkpoint file if you want the latest checkpoint, or you can use different global_step to get the specific checkpoint that you wanna load.
I'm trying to save my model at the end of triaining and restore it every time the training begins. I just followed what this link did.
saver = tf.train.Saver()
with tf.Session(graph=graph) as session:
# Initializate the weights and biases
tf.global_variables_initializer().run()
new_saver = tf.train.import_meta_graph('model.meta')
new_saver.restore(sess,tf.train.latest_checkpoint('./'))
W1 = session.run(W)
print(W1)
for curr_epoch in range(num_epochs):
train_cost = train_ler = 0
start = time.time()
for batch in range(num_batches_per_epoch):
...Some training...
W2 = session.run(W)
print(W2)
save_path = saver.save(session, "models/model")
But it gives error below:
---> new_saver.restore(session, tf.train.latest_checkpoint('./'))
SystemError: <built-in function TF_Run> returned a result with an error set
Can anyone help me please? Many thanks!
If you're gonna load with ./ you have to make sure, that your console (that you use to start the python program) is actually set on that directory (models/).
But in that case, it will save your new data in a new directory. So load with ./models/ instead
(Also you don't need to initiate variables, the restore does that for you.)