Tensorflow : saver.restore not restoring

Tensorflow : saver.restore not restoring - python

When I'm trying to restore a learned model I have a problem:
The first time my program runs, it doesn't seem to load the variables, the second time I run it, the variables are loaded, the third time I have a huge error on the "saver.restore(sess, 'model.ckpt')" line starting with "NotFoundError: Key beta2_power_2 not found in checkpoint".
Here is the beginning of my code:
with tf.Session() as sess:
myModel = SoundCNN(8)#classes
tf.global_variables_initializer().run()
saver = tf.train.Saver(tf.global_variables())
saver.restore(sess, 'model.ckpt')
You can see the SoundCNN class here, the github project in the model.py file.
I'm new to tensorflow and ML and wanted to use awjuliani's project to learn to use tf for sound oriented ML.
edit: here is the full code:
print ("start")
bpm = 240
samplingRate = 44100
mypath = "instruments/drums/"
iterations = 1000
batchSize = 240
with tf.Session() as sess:
myModel = SoundCNN(8)#classes
tf.global_variables_initializer().run()
saver = tf.train.Saver(tf.global_variables())
print("loading session ...")
saver.restore(sess, 'model.ckpt')
print("session loaded")
print("processing audio ...")
classes,trainX,trainYa,valX,valY,testX,testY = util.processAudio(bpm,samplingRate,mypath)
print("audio processed")
fullTrain = np.concatenate((trainX,trainYa),axis=1)
quitFlag = False
inputsize = fullTrain.shape[0]-1 #6607
print("entering loop...")
while (not quitFlag):
indexstr = input("Type the index (0< _ <" + str(inputsize) + ") of the sample to test then press enter.\nYou can press enter without text for random index.\nType q to quit.\n")
if (indexstr == "q" or indexstr == "Q"):
quitFlag = True
else:
if(indexstr ==""):
index = randint(0, inputsize)
print("Index : " + str(index))
else:
index = int(indexstr)
tensors,labels_ = np.hsplit(fullTrain,[-1])
labels = util.oneHotIt(labels_)
tensor, label = tensors[index,:], labels[index]
tensor = tensor.reshape(1,1024)
result = myModel.prediction.eval(session=sess,feed_dict={myModel.x: tensor, myModel.keep_prob: 1.0})
print("Model found sound: n°"+ str(result) + ".\nActual sound: n°" + str(np.argmax(label)) + ".\n" )
Thanks!
edit2: Okay I tryed with this code:
print ("start")
bpm = 240
samplingRate = 44100
mypath = "instruments/drums/"
iterations = 1000
batchSize = 240
tf.reset_default_graph()
myModel = SoundCNN(8)
saver = tf.train.Saver()
with tf.Session() as sess:
print("loading session ...")
saver.restore(sess, 'model.ckpt')
print("session loaded")
And the variables aren't loaded (bad predictions) but the strange thing is that I can make the code work by adding :
myModel = SoundCNN(8)
saver = tf.train.Saver()
print("loading session ...")
saver.restore(sess, 'model.ckpt')
print("session loaded")
after the first saver.restore(sess, 'model.ckpt')
So I made the code work but it's a nasty ...

Ok so first of all, separate between training and testing of the model.
Run conditional if statement using: tf.train.checkpoint_exists and tf.train.latest_checkpoint.
Something like:
if tf.train.checkpoint_exists(tf.train.latest_checkpoint(".")):
test()
else:
trainNetConv(iterations)
test()
You might as well use only latest_checkpoint as it returns None or a path if checkpoint was found.
Run 'tf.reset_default_graph()' whenever you know you'll be loading a model to clear any existing graphs. From what I experienced it stacks copies of the graphs which slows the runtime and I guess it might lead to other problems. Especially if you plan to do this multiple times during runtime.
Assuming you already have a trained model, you must first create it like you would normally do by calling SoundCNN with the same number of classes as the model that you wish to load. Make sure you create the EXACT same model, i.e same number of classes. In the code you provided, you create the model with 8 classes but the number of classes of the model that is created in 'trainNetConv' is determined by 'util.processAudio'. Worth checking that the number of classes is indeed 8 for any given directory with sound files on which it's being trained on.
The key difference when you load a model is that you don't initialize the variables, i.e you do not call the saver object with global variables or run the global variables initializer.
All you have to do is:
Make sure to run tf.reset_default_graph()
Create the model, call SoundCNN
Create a saver object with no arguments.
Create a session like you do,
Call the function restore of the saver object with the path to the latest checkpoint. Using 'tf.train.latest_checkpoint' with the base dir of the model.
And you're done.
Check my GitHub for complete examples of training and testing phase. Make sure to start with the 'mnist' since it is only one file and the simplest there.
Assuming you wish to define additional variables for your own use, let's say some variable Counter and an operator that increments Counter
if prediction is correct. It needs to be placed after you loaded the model using restore and then you would initialize those additional variables only. Again, I think my examples might help in this case.
If you have any more questions please ask, I'll try to help.

Related

Loading Tensorflow model in different session

I'm a bit new to all this so could you please help me? I tried finding the answer to this question but found nothing.
I'm trying to load Tensorflow model in python in a separate function so I can use the model in a loop without having to load it in every iteration of the for loop.
This is my code now:
def load_network():
prediction = neural_network_model(x)
return (prediction)
def use_neural_network(data, prediction):
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
saver = tf.train.import_meta_graph(model_name+'.meta')
saver.restore(sess,model_name)
pred = sess.run(prediction, feed_dict={x: data})
pred = np.asarray(pred)
return pred
if __name__ == '__main__':
result=[]
Load= start_network()
for i in data:
result.append(use_neural_network(i,Load))
And I would like to get something like this:
def load_network():
prediction = neural_network_model(x)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
saver = tf.train.import_meta_graph(model_name+'.meta')
saver.restore(sess,model_name)
return (prediction)
def use_neural_network(data, prediction):
with tf.Session() as sess:
pred = sess.run(prediction, feed_dict={x: data})
pred = np.asarray(pred)
return pred
if __name__ == '__main__':
result=[]
Load= start_network()
for i in data:
result.append(use_neural_network(i,Load))

Generally what you're trying to achieve is easily doable and you're on the right track. In the main block you have start_network() instead of load_network() as in your first line. I'd also recommend against using Load as a variable name but that should not be a problem. Also the TensorFlow Session (sess in your code) should either be a global variable, or you should initialize it either in the main block or in the load_network() function and then pass it on to the use_neural_network() function. The way it's currently written the two sess variables in the two functions are local and therefore refer to different sessions.
If you want to avoid having to use the neural_network_model( x ) function, that is building the model at the beginning, you might want to freeze the model and load it that way, with the architecture embedded as well. Easiest to follow a guide on that, like this one.

restoring sub-graph variables fails with 'Cannot interpret feed_dict key'

The context is that I'm trying to incrementally grow a rnn autoencoder, by first training a single cell encoder/decoder then extending. I'd like to load the parameters of the preceding cells.
This code here is a minimal code where I'm investigating how to do this, and it fails with:
TypeError: Cannot interpret feed_dict key as Tensor: The name 'save_1/Const:0' refers to a Tensor which does not exist. The operation, 'save_1/Const', does not exist in the graph.
I've searched and found nothing, this thread and this thread are not the same problem.
MVCE
import tensorflow as tf
import numpy as np
with tf.Session(graph=tf.Graph()) as sess:
cell1 = tf.nn.rnn_cell.LSTMCell(1,name='lstm_cell1')
cell = tf.nn.rnn_cell.MultiRNNCell([cell1])
inputs = tf.random_normal((5,10,1))
rnn1 = tf.nn.dynamic_rnn(cell,inputs,dtype=tf.float32)
vars0 = tf.trainable_variables()
saver = tf.train.Saver(vars0,max_to_keep=1)
sess.run(tf.initialize_all_variables())
saver.save(sess,'./save0')
vars0_val = sess.run(vars0)
# creating a new graph/session because it is not given that it'll be in the same session.
with tf.Session(graph=tf.Graph()) as sess:
cell1 = tf.nn.rnn_cell.LSTMCell(1,name='lstm_cell1')
#one extra cell
cell2 = tf.nn.rnn_cell.LSTMCell(1,name='lstm_cell2')
cell = tf.nn.rnn_cell.MultiRNNCell([cell1,cell2])
inputs = tf.random_normal((5,10,1))
rnn1 = tf.nn.dynamic_rnn(cell,inputs,dtype=tf.float32)
sess.run(tf.initialize_all_variables())
# new saver with first cell variables
saver = tf.train.Saver(vars0,max_to_keep=1)
# fails
saver.restore(sess,'./save0')
# Should be the same
vars0_val1 = sess.run(vars0)
assert np.all(vars0_val1 = vars0_val)

The mistake comes from the line,
saver = tf.train.Saver(vars0,max_to_keep=1)
if the second session. vars0 refers to actual tensor objects that existed in the previous graph (not the current one). Saver's var_list requires an actual set of tensors (not strings, which I assumed would be good enough).
To make it work the second Saver object should be initialized with the corresponding tensors in the current graph.
Something like,
vars0_names = [v.name for v in vars0]
load_vars = [sess.graph.get_tensor_by_name(n) for n in vars0_names]
saver = tf.train.Saver(load_vars,max_to_keep=1)

TensorFlow print input tensors?

I'm building a TF training program and attempting to diagnose some issues we are seeing with it. Root problem is the gradients are always nan. This is against the CIFAR10 data set (we wrote our own program from scratch to ensure we understand all of the mechanics properly).
Its too much code to post here; so it is here: https://github.com/drcrook1/CIFAR10
At this point we are fairly certain the issue is not the learning rate (we took that sucker down to 1e-25 and still got nans; we also simplified the network to a single mlp layer).
What we think is likely happening is the values being read in by the input pipeline are wrong; therefor we want to print the values from a TFRecordReader pipeline to double check that it is in fact reading and decoding the samples properly. As you know, you can only print a TF value if you know its name or have it captured as a variable; so that brings up the point; how does one print an input tensor from a mini batch?
Thanks for any tips!

It turns out you can return examples and labels as operations and then simply print them during graph execution.
def create_sess_ops():
'''
Creates and returns operations needed for running
a tensorflow training session
'''
GRAPH = tf.Graph()
with GRAPH.as_default():
examples, labels = Inputs.read_inputs(CONSTANTS.RecordPaths,
batch_size=CONSTANTS.BATCH_SIZE,
img_shape=CONSTANTS.IMAGE_SHAPE,
num_threads=CONSTANTS.INPUT_PIPELINE_THREADS)
examples = tf.reshape(examples, [CONSTANTS.BATCH_SIZE, CONSTANTS.IMAGE_SHAPE[0],
CONSTANTS.IMAGE_SHAPE[1], CONSTANTS.IMAGE_SHAPE[2]])
logits = Vgg3CIFAR10.inference(examples)
loss = Vgg3CIFAR10.loss(logits, labels)
OPTIMIZER = tf.train.AdamOptimizer(CONSTANTS.LEARNING_RATE)
#OPTIMIZER = tf.train.RMSPropOptimizer(CONSTANTS.LEARNING_RATE)
gradients = OPTIMIZER.compute_gradients(loss)
apply_gradient_op = OPTIMIZER.apply_gradients(gradients)
gradients_summary(gradients)
summaries_op = tf.summary.merge_all()
return [apply_gradient_op, summaries_op, loss, logits, examples, labels], GRAPH
Notice in the above code we use the input queue runners to grab examples and inputs and feed into the graph. We then return examples and labels as operations along side all of our other operations which can then be used during a session run;
def main():
'''
Run and Train CIFAR 10
'''
print('starting...')
ops, GRAPH = create_sess_ops()
total_duration = 0.0
with tf.Session(graph=GRAPH) as SESSION:
COORDINATOR = tf.train.Coordinator()
THREADS = tf.train.start_queue_runners(SESSION, COORDINATOR)
SESSION.run(tf.global_variables_initializer())
SUMMARY_WRITER = tf.summary.FileWriter('Tensorboard/' + CONSTANTS.MODEL_NAME)
GRAPH_SAVER = tf.train.Saver()
for EPOCH in range(CONSTANTS.EPOCHS):
duration = 0
error = 0.0
start_time = time.time()
for batch in range(CONSTANTS.MINI_BATCHES):
_, summaries, cost_val, prediction = SESSION.run(ops)
print(np.where(np.isnan(prediction)))
print(prediction[0])
print(labels[0])
plt.imshow(examples[0])
plt.show()
error += cost_val
duration += time.time() - start_time
total_duration += duration
SUMMARY_WRITER.add_summary(summaries, EPOCH)
print('Epoch %d: loss = %.2f (%.3f sec)' % (EPOCH, error, duration))
if EPOCH == CONSTANTS.EPOCHS - 1 or error < 0.005:
print(
'Done training for %d epochs. (%.3f sec)' % (EPOCH, total_duration)
)
break
Notice in the above code we take the examples and labels operations and we can now print a variety of things. We print out if anything is nan; along with that we print the prediction array itself, the label and we even use matplot lib to plot an example image in each mini batch.
This is exactly what I was looking to do. I needed to do this to verify my issues. The root cause was due to labels being read incorrectly therefor producing infinite gradients; because the labels did not match the examples.

Have you looked at the tf.Print operator?
https://www.tensorflow.org/api_docs/python/tf/Print
If you add this to your graph with an input from one of the nodes you suspect of causing the problem, you should be able to see the results in stderr.
You may also find the check_numerics operator useful for debugging your problem:
How to check NaN in gradients in Tensorflow when updating?

This looks like an ideal use-case for the official TensorFlow Debugger.
From the first example on the page:
from tensorflow.python import debug as tf_debug
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)
From your description, it seems that you too need the tf_debug.has_inf_or_nan checkpoint to start your debugging.

Tensorflow: How to use a trained model in a application?

I have trained a Tensorflow Model, and now I want to export the "function" to use it in my python program. Is that possible, and if yes, how? Any help would be nice, could not find much in the documentation. (I dont want to save a session!)
I have now stored the session as you suggested. I am loading it now like this:
f = open('batches/batch_9.pkl', 'rb')
input = pickle.load(f)
f.close()
sess = tf.Session()
saver = tf.train.Saver()
saver.restore(sess, 'trained_network.ckpt')
y_pred = []
sess.run(y_pred, feed_dict={x: input})
print(y_pred)
However, I get the error "no Variables to save" when I try to initialize the saver.
What I want to do is this: I am writing a bot for a board game, and the input is the situation on the board formatted into a tensor. Now I want to return a tensor which gives me the best position to play next, i.e. a tensor with 0 everywhere and a 1 at one position.

I don't know if there is any other way to do it, but you can use your model in another Python program by saving your session:
Your training code:
# build your model
sess = tf.Session()
# train your model
saver = tf.train.Saver()
saver.save(sess, 'model/model.ckpt')
In your application:
# build your model (same as training)
sess = tf.Session()
saver = tf.train.Saver()
saver.restore(sess, 'model/model.ckpt')
You can then evaluate any tensor in your model using a feed_dict. This obviously depends on your model. For example:
#evaluate tensor
sess.run(y_pred, feed_dict={x: input_data})

Restored model in tensorflow and predictions

I created model in tensorflow of neural network.
I saved the model and restore it in another python file.
The code is below:
def restoreModel():
prediction = neuralNetworkModel(x)
tf_p = tensorFlow.nn.softmax(prediction)
temp = np.array([2,1,541,161124,3,3])
temp = np.vstack(temp)
with tensorFlow.Session() as sess:
new_saver = tensorFlow.train.import_meta_graph('model.ckpt.meta')
new_saver.restore(sess, tensorFlow.train.latest_checkpoint('./'))
all_vars = tensorFlow.trainable_variables()
tensorFlow.initialize_all_variables().run()
sess.run(tensorFlow.initialize_all_variables())
predict = sess.run([tf_p], feed_dict={
tensorFlow.transpose(x): temp,
y : ***
})
when "temp" variable in what I want to predict!
X is the vector shape, and I "transposed" it to match the shapes.
I dont understand what I need to write in feed_dict variable.

I am answering late but maybe it can still be useful. feed_dict is used to give tensorflow the values you want your placeholders to take. fetches (the first argument of run) is the list of results you want. The keys of feed_dict and the elements of fetches must be either the names of the tensors (I didn't try it though) or variables you can get by
graph = tf.get_default_graph()
var = graph.get_operation_by_name('name_of_operation').outputs[0]
Maybe graph.get_tensor_by_name('name_of_operation:0') works too, I didn't try.
By default, the name of placeholders are simply 'Placeholder', 'Placeholder_1' etc, following the order of creation in the graph definition.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.