Since updating to tensoflow version 1.0 which introduced the new Saver V2, tf does not delete old files any more with the 'max_to_keep' argument. This is a problem on my system since my models are pretty big but my free space is limited.
Using the dummy program below I end up with following files for every number from 1 to 10 while I only expect the last 3 (8,9,10) to actually be there.
testfile-1.data-00000-of-00001
testfile-1.index
testfile-1.meta
program:
import tensorflow as tf
a = tf.Variable(name='a', initial_value=0)
addops = a+1
saver = tf.train.Saver(max_to_keep=3)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())
for i in range(10):
sess.run(addops)
save_path = saver.save(sess, 'testfile', global_step=i+1)
sess.close()
Is this just me or this a known bug?
What are possible problems which could lead to this misbehavior?
Is there any log or smth similar I could get more information from?
I can reproduce this. It seems to be a bug.
However the problem is gone once I save into a different location (different from the executed .py file path)
save_path = saver.save(sess, 'data/testfile', global_step=i+1)
Related
I've looked at many questions regarding saving a trained neural network, including Tensorflow: how to save/restore a model? and https://cv-tricks.com/tensorflow-tutorial/save-restore-tensorflow-models-quick-complete-tutorial/ but none of them save a model without explicitly saving specific variables along with it, as in my case. Here is my case:
# In session "sesh"
saver = tf.train.Saver()
saver.save(sesh,os.getcwd(),latest_filename= 'RNN_plasma.ckpt')
Now, I quit the session and want to restore the model I just saved. How can I do this? When trying:
import tensorflow as tf
with tf.Session() as session1:
#First let's load meta graph and restore weights
saver = tf.train.import_meta_graph('RNN_plasma.ckpt')#error-line
saver.restore(session1,tf.train.latest_checkpoint('./'))
, the tf.train.import_meta_graph() call returns:
raise IOError("Cannot parse file %s: %s." % (filename, str(e)))
IOError: Cannot parse file RNN_plasma.ckpt: 1:1 : Message type "tensorflow.MetaGraphDef" has no field named "model_checkpoint_path"..
Can anyone give any insight as to what is going on here, and how to solve it?
(My version of TensorFlow doesn't come with tf.python.saved_model.simple_save(). (I have git_version 1.5.0))
Save:
saver = tf.train.Saver()
saver.save(sess,"/tmp/network")
Restore:
sess = tf.Session()
saver = tf.train.import_meta_graph('/tmp/network.meta')
saver.restore(sess,tf.train.latest_checkpoint('/tmp'))
graph = tf.get_default_graph()
You save a simple checkpoint but then you are trying load it as a meta graph. This cannot work.
There is a writeup on the TensorFlow website explaining the differences
https://www.tensorflow.org/mobile/prepare_models#what_is_up_with_all_the_different_saved_file_formats
There must be a file ending with .meta.
Here is a great question on how to find the first occurence of Nan in a tensorflow graph:
Debugging nans in the backward pass
The answer is quite helpful, here is the code from it:
train_op = ...
check_op = tf.add_check_numerics_ops()
sess = tf.Session()
sess.run([train_op, check_op]) # Runs training and checks for NaNs
Apparently, running the training and the numerical check at the same time will result in an error report as soon as Nan is encountered for the first time.
How do I integrate this into Keras ?
In the documentation, I can't find anything that looks like this.
I checked the code, too.
The update step is executed here:
https://github.com/fchollet/keras/blob/master/keras/engine/training.py
There is a function called _make_train_function where an operation to compute the loss and apply updates is created. This is later called to train the network.
I could change the code like this (always assuming that we're running on a tf backend):
check_op = tf.add_check_numerics_ops()
self.train_function = K.function(inputs,
[self.total_loss] + self.metrics_tensors + [check_op],
updates=updates, name='train_function', **self._function_kwargs)
I'm currently trying to set this up properly and not sure whether the code above actually works.
Maybe there is an easier way ?
I've been running into the exact same problem, and found an alternative to the check_add_numerics_ops() function. Instead of going that route, I use the TensorFlow Debugger to walk through my model, following the example in https://www.tensorflow.org/guide/debugger to figure out exactly where my code produces nans. This snippet should work for replacing the TensorFlow Session that Keras is using with a debugging session, allowing you to use tfdbg.
from tensorflow.python import debug as tf_debug
sess = K.get_session()
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
K.set_session(sess)
I'm training classification CNN using TensorFlow v0.12, and then want to create labels for new data using the trained model.
At the end of the training script, I added those lines of code:
saver = tf.train.Saver()
save_path = saver.save(sess,'/home/path/to/model/model.ckpt')
After the training completed, the files appearing in the folder are: 1. checkpoint ; 2. model.ckpt.data-00000-of-00001 ; 3. model.ckpt.index ; 4. model.ckpt.meta
Then I tried to restore the model using the .meta file. Following this tutorial, I added the following line into my classification code:
saver=tf.train.import_meta_graph(savepath+'model.ckpt.meta') #line1
and then:
saver.restore(sess, save_path=savepath+'model.ckpt') #line2
Before that change, I needed to build the graph again, and then write (instead of line1):
saver = tf.train.Saver()
But, deleting the graph building, and using line1 in order to restore it, raised an error. The error was that I used a variable from the graph inside my code, and the python didn't recognize it:
predictions = sess.run(y_conv, feed_dict={x: patches,keep_prob: 1.0})
The python didn't recognize the y_conv parameter. There is a way to restore the variables using the meta graph? if not, what os this restore helping, if I can't use variables from the original graph?
I know this question isn't so clear, but it was hard for me to express the problem in words. Sorry about it...
Thanks for answering, appreciate your help! Roi.
it is possible, don't worry. Assuming you don't want to touch the graph anymore, do something like this:
saver = tf.train.import_meta_graph('model/export/{}.meta'.format(model_name))
saver.restore(sess, 'model/export/{}'.format(model_name))
graph = tf.get_default_graph()
y_conv = graph.get_operation_by_name('y_conv').outputs[0]
predictions = sess.run(y_conv, feed_dict={x: patches,keep_prob: 1.0})
A preferred way would however be adding the ops into collections when you build the graph and then referring to them. So when you define the graph, you would add the line:
tf.add_to_collection("y_conv", y_conv)
And then after you import the metagraph and restore it, you would call:
y_conv = tf.get_collection("y_conv")[0]
It is actually explained in the documentation - the exact page you linked - but perhaps you missed it.
Btw, no need for the .ckpt extension, it might create some confusion as that is the old way of saving models.
Just to add to Roberts's answer - after obtaining a saver from the meta graph, and using it to restore the variables in the current session, you can also use:
y_conv = graph.get_tensor_by_name('y_conv:0')
This'll work if you've created the y_conv with explicitly adding the name="y_conv" argument (all TF ops have this).
I'm running tensorflow 0.10.0rc0 on OSX 10.9.5 Mavericks.
There are approximately 25k training examples, 250 features (x), 15 classes (y_) and the predict (y) is a single-hidden-layer NN perceptron.
The following snippet of a simple training loop seems to have a massive memory leak (of order 10s of GBs over =~ 200 iterations - brings down my MBP :( ) :
import tensorflow as tf
# Initialize placeholders and variables etc...
...
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y,y_))
train_step = tf.train.GradientDescentOptimizer(lrate).minimize(cost)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
for i in range(niter):
# Train
_,c=sess.run([train_step,cost])
correct_prediction = tf.equal(tf.argmax(y,1),tf.argmax(y_,1))
sess.run(correct_prediction)
accuracy = tf.reduce_mean(tf.cast(correct_prediction,tf.float32))
print sess.run(accuracy)
# EDIT: Calculate test error
ytest=sess.run(y[itrain:itrain+itest,:])
ytest_=sess.run(y_[itrain:itrain+itest,:])
test_prediction = tf.equal(tf.argmax(ytest,1), tf.argmax(ytest_,1))
test_accuracy=tf.reduce_mean(tf.cast(test_prediction,tf.float32))
print sess.run(test_accuracy)
sess.close()
Am I doing something obviously wrong, or is this per chance a bug? Thanks!
PS: If this is fixed in a later tensorflow build, note that bazel requires Yosemite or higher, so I can't generate my own .whl file (AFAIK) from source; is a nightly whl available? I would rather not be forced into an OS upgrade right now.
Its unnecessary to run sess.run(correct_prediction) -- it's a tensorflow graph variable on which the accuracy variable is dependant. This implies that it will be evaluated during the call to sess.run(accuracy) in any case.
You're probably modifying your graph by creating new correct_prediction and accuracy variables on each iteration. This is also unnecessary -- they can be moved outside the loop and simply evaluated each time with calls to sess.run. So your inner loop will be something like
for i in range(niter):
# Train
_, c = sess.run([train_step, cost])
print sess.run(accuracy)
I have finished running a big model in tensorflow python. But I have not saved it inside the session. Now that the training is over, I want to save the variables. I am doing the following:
saver=tf.train.Saver()
with tf.Session(graph=graph) as sess:
save_path = saver.save(sess, "86_model.ckpt")
print("Model saved in file: %s" % save_path)
This returns : ValueError: No variables to save. According to their website what is missing is initialize_all_variables(). The documentation says little about what exactly that does. The word "initialize" scares me, I do not want to reset all my trained values. Any way to save my model without re-running it?
It seems like from the tensorflow documentation, the "session" is the thing that holds the information from the trained model. So presumably somewhere you called sess.run() to train your model - what you want to do is call sess.save() using THAT session, not a new one you create with this saver object.
I believe its because you are not initializing all of your variables in the saver. This should work
with tf.Session() as sess:
tf.initialize_all_variables().run()
saver = tf.train.Saver(tf.all_variables())
-------everything your session does -------------
checkpoint_path = os.path.join(save_dir, 'model.ckpt')
saver.save(sess, checkpoint_path, global_step = your_global_step)
How about using skflow ? With skflow(now skflow is integrated in tensorflow) you can specify the parameter model_dir on your constructor and that automatically will save your model while training(it will save checkpoints so if something goes wrong during training, you can restart from last checkpooint).