Tensorflow Out of Memory when saving? - python

Hi I'm running the Linux CPU version of tensorflow on Ubuntu 14.04 and I'm running out of memory when I try to save my model. I'm using the tutorial for Deep MNIST that builds a convolution network. You can find it here:
https://www.tensorflow.org/versions/r0.9/tutorials/mnist/pros/index.html#deep-mnist-for-experts
I changed a couple of things and tried to add a Saver to export the model weights. However when I run it I get an error that says I am out of memory. Which doesn't make sense to me because it can train the data forever but saving it somehow uses too much memory?
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
step 0, training accuracy 0.06
W tensorflow/core/framework/op_kernel.cc:909] Resource exhausted: OOM when allocating tensor with shape[10000,28,28,32]
Traceback (most recent call last):
File "/home/mgump/Lambda_Project/MNIST_TRAINER.py", line 66, in <module>
x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 555, in eval
return _eval_using_default_session(self, feed_dict, self.graph, session)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3498, in _eval_using_default_session
return session.run(tensors, feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 372, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 636, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 708, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 728, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[10000,28,28,32]
[[Node: Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/cpu:0"](Reshape, Variable/read)]]
Caused by op u'Conv2D', defined at:
File "/home/mgump/Lambda_Project/MNIST_TRAINER.py", line 28, in <module>
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
File "/home/mgump/Lambda_Project/MNIST_TRAINER.py", line 18, in conv2d
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/ops/gen_nn_ops.py", line 394, in conv2d
data_format=data_format, name=name)
File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/ops/op_def_library.py", line 704, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/framework/ops.py", line 2260, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/framework/ops.py", line 1230, in __init__
self._traceback = _extract_stack()`
This is what it outputs when I run it thanks so much!

Related

Tensorflow: Error changing batch size of deep CNN

I have replicated a deep CNN from a research paper. When I originally constructed the model, I assumed that the batch size would be one. However, now that I have learned more about batch sizes, I want to use a batch size of 40.
Here is the Github Repository
This is a very deep network, so I will show a more basic version of the project below:
x = tf.placeholder(tf.float32, shape=[None, 7168])
y_ = tf.placeholder(tf.float32, shape=[None, 7168, 3])
#MANY CONVOLUTIONS OMITTED HERE
#one of many transpose convolutions, the 40 here is a change I made for the batch size
w = tf.Variable(tf.constant(1.,shape=[2,2,4,1,192]))
DeConnv1 = tf.nn.conv3d_transpose(layer1, filter = w, output_shape = [40,32,32,7,1], strides = [1,2,2,2,1], padding = 'SAME')
#I reshape the final convolution's batch size because I was getting errors
final = tf.reshape(final, [40, 7168])
#Accuracy and loss functions omitted because they do not deal with batch size
#Lastly, I train the model where a and be are size [40][7169][3] 40 is the batch size
train_step.run(feed_dict={x: a, y_: b, keep_prob: .5})
When I run the code from the repository, I get this error. What more changes do I need to make so that the batch size is 40?
Traceback (most recent call last):
File "<stdin>", line 31, in <module>
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2042, in run
_run_using_default_session(self, feed_dict, self.graph, session)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 4490, in _run_using_default_session
session.run(operation, feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: cuDNN Backward Data function launch failure : input shape([320,4,4,1,896]) filter shape([3,3,1,896,800])
[[Node: gradients/conv3d_2/Conv3D_grad/Conv3DBackpropInputV2 = Conv3DBackpropInputV2[T=DT_FLOAT, data_format="NDHWC", padding="VALID", strides=[1, 1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/conv3d_2/Conv3D_grad/Shape, conv3d_1/kernel/read, gradients/conv3d_2/BatchToSpaceND_grad/SpaceToBatchND)]]
Caused by op u'gradients/conv3d_2/Conv3D_grad/Conv3DBackpropInputV2', defined at:
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 343, in minimize
grad_loss=grad_loss)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 414, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py", line 581, in gradients
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py", line 353, in _MaybeCompile
return grad_fn() # Exit early
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py", line 581, in <lambda>
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 82, in _Conv3DGrad
data_format=data_format),
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 1084, in conv3d_backprop_input_v2
data_format=data_format, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
...which was originally created as op u'conv3d_2/Conv3D', defined at:
File "<stdin>", line 2, in <module>
File "<stdin>", line 2, in conv3d_dilation
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/convolutional.py", line 809, in conv3d
return layer.apply(inputs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 671, in apply
return self.__call__(inputs, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 575, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/convolutional.py", line 167, in call
outputs = self._convolution_op(inputs, self.kernel)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 835, in __call__
return self.conv_op(inp, filter)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 499, in __call__
return self.call(inp, filter)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 492, in _with_space_to_batch_call
result = self.op(input_converted, filter)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 187, in __call__
name=self.name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 847, in conv3d
padding=padding, data_format=data_format, name=name)
InternalError (see above for traceback): cuDNN Backward Data function launch failure : input shape([320,4,4,1,896]) filter shape([3,3,1,896,800])
[[Node: gradients/conv3d_2/Conv3D_grad/Conv3DBackpropInputV2 = Conv3DBackpropInputV2[T=DT_FLOAT, data_format="NDHWC", padding="VALID", strides=[1, 1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/conv3d_2/Conv3D_grad/Shape, conv3d_1/kernel/read, gradients/conv3d_2/BatchToSpaceND_grad/SpaceToBatchND)]]
Try this:
shape = tf.shape(tf.reshape(x, [-1, 32, 32, 7, 1]))
DeConnv1 = tf.nn.conv3d_transpose(layer1, filter=w, output_shape=shape, strides=[1,2,2,2,1], padding='SAME')
final = tf.reshape(final, [-1, 7168])
This way you don't hard-code 40 in the model, but able to feed any batch size you want, including 40.
Best practice is to avoid hard coding the batch size in the graph. As discussed here (see "How do I build a graph that works with variable batch sizes?") you should specify shapes as [None, nx, ny, nz], and retrieve batch sizes using tf.shape(input)[0]. Also, for reshaping you can use a form like this: tf.reshape(input, [-1, nx, ny, nz]) where the -1 specifies that the batch dimension should be set to the appropriate size during runtime.

OutOfRangeError (see above for traceback): FIFOQueue '_1_batch/fifo_queue' is closed and has insufficient elements (requested 5, current size 0)

i don't know how to solve this problem, this error message is useless for me to locate the problem. Thanks for helping!
here is the data in e.csv, D.csv and F.csv
e.csv: 1,2,3
4,5,6
7,8,9
D.csv: 11,12,13
14,15,16
17,18,19
F.csv: 21,22,23
24,25,26
27,28,29
here is my code
import tensorflow as tf
import os
file_dir = './KDD2'
fileNameQueue = []
for file in os.listdir(file_dir):
fileNameQueue.append(file)
print fileNameQueue
filename_queue = tf.train.string_input_producer(fileNameQueue, shuffle=False)
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
col1,col2,label = tf.decode_csv(value, record_defaults=[[1],[1],[1]])
example = tf.pack([col1,col2])
example_batch, label_batch = tf.train.batch([example, label], batch_size=5)
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(10):
print example_batch.eval()
coord.request_stop()
coord.join(threads)
here is the error message
root#ubuntumagiclab:/home/magiclab/SAE# python try.py
['e.csv', 'D.csv', 'F.csv']
Traceback (most recent call last):
File "try.py", line 30, in <module>
print example_batch.eval()
File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/framework/ops.py", line 575, in eval
return _eval_using_default_session(self, feed_dict, self.graph, session)
File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/framework/ops.py", line 3633, in _eval_using_default_session
return session.run(tensors, feed_dict)
File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: FIFOQueue '_0_batch/fifo_queue' is closed and has insufficient elements (requested 5, current size 0)
[[Node: batch = QueueDequeueMany[_class=["loc:#batch/fifo_queue"], component_types=[DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]]
Caused by op u'batch', defined at:
File "try.py", line 24, in <module>
example_batch, label_batch = tf.train.batch([example, label], batch_size=5)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 692, in batch
dequeued = queue.dequeue_many(batch_size, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 458, in dequeue_many
self._queue_ref, n=n, component_types=self._dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1099, in _queue_dequeue_many
timeout_ms=timeout_ms, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()
OutOfRangeError (see above for traceback): FIFOQueue '_0_batch/fifo_queue' is closed and has insufficient elements (requested 5, current size 0)
[[Node: batch = QueueDequeueMany[_class=["loc:#batch/fifo_queue"], component_types=[DT_INT32, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch/fifo_queue, batch/n)]]
Problem is with filepaths. Please provide complete paths as shown below to fileName Queue.
This works for me:
fileNameQueue.append('/home/****/Desktop/stackoverflow/data/' +file)
Hope this helps.

Tensorflow Attempting to use uninitialized value AUC/AUC/auc/false_positives

I'm training a CNN using for image classification. Due to the limited size of my data set I'm using transfer learning. Basically, I'm using the pre-trained network Google is proving in its retrain example (https://www.tensorflow.org/tutorials/image_retraining).
The model works great and gives a very good accuracy. But my dataset is highly imbalance which mean accuracy is not the best metric to judge the performance of the model.
By looking into different solutions, some suggested changing the sampling method or the performance metric used. I'm choosing to go with the later.
Tensorflow provides a good verity of metrics including, AUC, precision, recall, etc.
Now, here is the code of the retraing model:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/image_retraining/retrain.py
I'm adding the following to add_evaluation_step(result_tensor, ground_truth_tensor) function:
with tf.name_scope('AUC'):
with tf.name_scope('prediction'):
prediction = tf.argmax(result_tensor, 1)
with tf.name_scope('AUC'):
auc_value = tf.metrics.auc(tf.argmax(ground_truth_tensor, 1), prediction, curve='ROC')
tf.summary.scalar('accuracy', evaluation_step)
tf.summary.scalar('AUC', auc_value)
But I'm getting this error:
Traceback (most recent call last): File
"/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/examples/image_retraining/retrain.py",
line 1135, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/platform/app.py",
line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/examples/image_retraining/retrain.py",
line 911, in main
ground_truth_input: train_ground_truth}) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/client/session.py",
line 767, in run
run_metadata_ptr) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/client/session.py",
line 965, in _run
feed_dict_string, options, run_metadata) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/client/session.py",
line 1015, in _do_run
target_list, options, run_metadata) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/client/session.py",
line 1035, in _do_call
raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.FailedPreconditionError:
Attempting to use uninitialized value AUC/AUC/auc/false_positives
[[Node: AUC/AUC/auc/false_positives/read = IdentityT=DT_FLOAT,
_class=["loc:#AUC/AUC/auc/false_positives"], _device="/job:localhost/replica:0/task:0/cpu:0"]]
Caused by op u'AUC/AUC/auc/false_positives/read', defined at: File
"/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/examples/image_retraining/retrain.py",
line 1135, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/platform/app.py",
line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/examples/image_retraining/retrain.py",
line 874, in main
final_tensor, ground_truth_input) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/examples/image_retraining/retrain.py",
line 806, in add_evaluation_step
auc_value, update_op = tf.metrics.auc(tf.argmax(ground_truth_tensor, 1), prediction,
curve='ROC') File
"/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/ops/metrics_impl.py",
line 555, in auc
labels, predictions, thresholds, weights) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/ops/metrics_impl.py",
line 473, in _confusion_matrix_at_thresholds
false_p = _create_local('false_positives', shape=[num_thresholds]) File
"/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/ops/metrics_impl.py",
line 177, in _create_local
validate_shape=validate_shape) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/ops/variables.py",
line 226, in init
expected_shape=expected_shape) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/ops/variables.py",
line 344, in _init_from_args
self._snapshot = array_ops.identity(self._variable, name="read") File
"/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/ops/gen_array_ops.py",
line 1490, in identity
result = _op_def_lib.apply_op("Identity", input=input, name=name) File
"/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/framework/op_def_library.py",
line 768, in apply_op
op_def=op_def) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/framework/ops.py",
line 2402, in create_op
original_op=self._default_original_op, op_def=op_def) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/framework/ops.py",
line 1264, in init
self._traceback = _extract_stack()
FailedPreconditionError (see above for traceback): Attempting to use
uninitialized value AUC/AUC/auc/false_positives [[Node:
AUC/AUC/auc/false_positives/read = IdentityT=DT_FLOAT,
_class=["loc:#AUC/AUC/auc/false_positives"], _device="/job:localhost/replica:0/task:0/cpu:0"]]
But I don't understand why is this because in the main I have this:
init = tf.global_variables_initializer()
sess.run(init)
try this:
init = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
sess.run(init)

NotFoundError when restoring tensorflow session

Here is the code:
import tensorflow as tf
def save(checkpoint_file='hello.chk'):
with tf.Session() as session:
x = tf.Variable(initial_value=[1, 2, 3], name="x")
y = tf.Variable(initial_value=[[1.0, 2.0], [3.0, 4.0]], name="y")
not_saved = tf.Variable(initial_value=[[11.0, 2.0], [3.0, 4.0]], name="not_saved")
session.run(tf.global_variables_initializer())
print(session.run(tf.global_variables()))
saver = tf.train.Saver([x, y])
saver.save(session, checkpoint_file)
print(session.run(tf.global_variables()))
print("saved!!!!!!!!!!")
def restore(checkpoint_file='hello.chk'):
with tf.Session() as session:
saver = tf.train.Saver()
saver.restore(sess=session, save_path=checkpoint_file)
print(session.run(tf.global_variables()))
def reset():
tf.reset_default_graph()
save()
restore("/home/kaiyin/PycharmProjects/text-classify/hello.chk")
I am just trying to save and restore some simple variables here, nothing complicated. The saving part seems fine, but I got the following errors when restoring:
Traceback (most recent call last):
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call
return fn(*args)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn
status, run_metadata)
File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: Key not_saved not found in checkpoint
[[Node: save_1/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save_1/Const_0, save_1/RestoreV2/tensor_names, save_1/RestoreV2/shape_and_slices)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/kaiyin/PycharmProjects/text-classify/restore.py", line 25, in <module>
restore("/home/kaiyin/PycharmProjects/text-classify/hello.chk")
File "/home/kaiyin/PycharmProjects/text-classify/restore.py", line 18, in restore
saver.restore(sess=session, save_path=checkpoint_file)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1428, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key not_saved not found in checkpoint
[[Node: save_1/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save_1/Const_0, save_1/RestoreV2/tensor_names, save_1/RestoreV2/shape_and_slices)]]
Caused by op 'save_1/RestoreV2', defined at:
File "/home/kaiyin/PycharmProjects/text-classify/restore.py", line 25, in <module>
restore("/home/kaiyin/PycharmProjects/text-classify/hello.chk")
File "/home/kaiyin/PycharmProjects/text-classify/restore.py", line 17, in restore
saver = tf.train.Saver()
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1040, in __init__
self.build()
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1070, in build
restore_sequentially=self._restore_sequentially)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 675, in build
restore_sequentially, reshape)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 402, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 242, in restore_op
[spec.tensor.dtype])[0])
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_io_ops.py", line 668, in restore_v2
dtypes=dtypes, name=name)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
self._traceback = _extract_stack()
NotFoundError (see above for traceback): Key not_saved not found in checkpoint
[[Node: save_1/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save_1/Const_0, save_1/RestoreV2/tensor_names, save_1/RestoreV2/shape_and_slices)]]
Process finished with exit code 1
Tensorflow version:
>>> print(tf.__version__)
1.0.1
Deleting the list of vars in tf.train.Saver() somehow solves the problem. Here is the working code:
import tensorflow as tf
filepath = "/home/kaiyin/PycharmProjects/text-classify/hello.chk"
def save(checkpoint_file=filepath):
with tf.Session() as session:
x = tf.Variable(initial_value=[1, 2, 3], name="x")
y = tf.Variable(initial_value=[[1.0, 2.0], [3.0, 4.0]], name="y")
not_saved = tf.Variable(initial_value=[[11.0, 2.0], [3.0, 4.0]], name="not_saved")
session.run(tf.global_variables_initializer())
print(session.run(tf.global_variables()))
saver = tf.train.Saver()
saver.save(session, checkpoint_file)
print(session.run(tf.global_variables()))
print("saved!!!!!!!!!!")
def restore(checkpoint_file='hello.chk'):
with tf.Session() as session:
saver = tf.train.Saver()
saver.restore(sess=session, save_path=checkpoint_file)
print(session.run(tf.global_variables()[0]))
print(session.run(x))
def reset():
tf.reset_default_graph()
save()
restore(filepath)

tf.nn.dynamic_rnn raised Attempting to use uninitialized value error

My graph looks like this
with graph.as_default():
train_inputs = tf.placeholder(tf.int32, shape=[None, None])
with tf.device('/cpu:0'):
embeddings = tf.Variable(tf.zeros([vocab_size, options.embed_size]))
restorer = tf.train.Saver({'embeddings': embeddings})
init = tf.variables_initializer([embeddings])
uninit = tf.report_uninitialized_variables()
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
# length() returns a [batch_szie,] tensor of true lengths of sentences (lengths before zero-padding)
sequence_length = length(embed)
lstm = tf.nn.rnn_cell.LSTMCell(options.rnn_size)
output, _ = tf.nn.dynamic_rnn(
lstm,
embed,
dtype=tf.float32,
swequence_length=sequence_length
)
And my session:
with tf.Session(graph=graph) as session:
restorer.restore(session, options.restore_path)
# tf.global_variables_initializer.run()
init.run()
print session.run([uninit])
while len(data.ids):
# data.generate_batch returns a list of size [batch_size, max_length], and zero-padding is used, when the sentences are shorter than max_length. For example, batch_inputs = [[1,2,3,4], [3,2,1,0], [1,2,0,0]]
batch_inputs, _ = data.generate_batch(options.batch_size)
feed_dict = {train_inputs: batch_inputs}
test = session.run([tf.shape(output)], feed_dict=feed_dict)
print test
Function length():
def length(self, sequence):
length = tf.sign(sequence)
length = tf.reduce_sum(length, reduction_indices=1)
length = tf.cast(length, tf.int32)
return length
The error i got:
Traceback (most recent call last):
File "rnn.py", line 103, in <module>
test = session.run([tf.shape(output)], feed_dict=feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value RNN/LSTMCell/W_0
[[Node: RNN/LSTMCell/W_0/read = Identity[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](RNN/LSTMCell/W_0)]]
Caused by op u'RNN/LSTMCell/W_0/read', defined at:
File "rnn.py", line 75, in <module>
sequence_length=sequence_length,
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 845, in dynamic_rnn
dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 1012, in _dynamic_rnn_loop
swap_memory=swap_memory)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2636, in while_loop
result = context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2469, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2419, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 995, in _time_step
skip_conditionals=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 403, in _rnn_step
new_output, new_state = call_cell()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 983, in <lambda>
call_cell = lambda: cell(input_t, state)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn_cell.py", line 496, in __call__
dtype, self._num_unit_shards)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn_cell.py", line 329, in _get_concat_variable
sharded_variable = _get_sharded_variable(name, shape, dtype, num_shards)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn_cell.py", line 359, in _get_sharded_variable
dtype=dtype))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1024, in get_variable
custom_getter=custom_getter)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 850, in get_variable
custom_getter=custom_getter)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 346, in get_variable
validate_shape=validate_shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 331, in _true_getter
caching_device=caching_device, validate_shape=validate_shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 677, in _get_single_variable
expected_shape=shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 224, in __init__
expected_shape=expected_shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 367, in _init_from_args
self._snapshot = array_ops.identity(self._variable, name="read")
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 1424, in identity
result = _op_def_lib.apply_op("Identity", input=input, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()
FailedPreconditionError (see above for traceback): Attempting to use uninitialized value RNN/LSTMCell/W_0
[[Node: RNN/LSTMCell/W_0/read = Identity[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](RNN/LSTMCell/W_0)]]
However when I printed out the uninitialized variables, i got [array([], dtype=object)]
When i replaced init.run() with tf.global_variables_initializer.run(), it worked.
Any idea why init.run() doesn't work?
You defined init as follows:
init = tf.variables_initializer([embeddings])
This definition means that init initializes only the embeddings variable. Calling the tf.nn.dynamic_rnn() function creates more variables, representing the various internal weights in the LSTM, and these are not initialized by init.
By contrast, tf.global_variables_initializer() returns an operation that, when run, will initialize all of the (global) variables in your model, including those created for the LSTM.

Categories