Sorry for my lack of knowledge, but I am trying to run the example on Tensorflow:
import numpy as np
import tensorflow as tf
feature_columns = [tf.feature_column.numeric_column("x", shape=[1])]
estimator = tf.estimator.LinearRegressor(feature_columns=feature_columns)
x_train = np.array([1., 2., 3., 4.])
y_train = np.array([0., -1., -2., -3.])
x_eval = np.array([2., 5., 8., 1.])
y_eval = np.array([-1.01, -4.1, -7, 0.])
input_fn = tf.estimator.inputs.numpy_input_fn(
{"x": x_train}, y_train, batch_size=4, num_epochs=None, shuffle=True)
train_input_fn = tf.estimator.inputs.numpy_input_fn(
{"x": x_train}, y_train, batch_size=4, num_epochs=1000, shuffle=False)
eval_input_fn = tf.estimator.inputs.numpy_input_fn(
{"x": x_eval}, y_eval, batch_size=4, num_epochs=1000, shuffle=False)
estimator.train(input_fn=input_fn, steps=1000)
train_metrics = estimator.evaluate(input_fn=train_input_fn)
eval_metrics = estimator.evaluate(input_fn=eval_input_fn)
print("train metrics: %r"% train_metrics)
print("eval metrics: %r"% eval_metrics)
I got the following error message:
PermissionDeniedError: Failed to delete a file: C:\Users\Jeff\AppData\Local\Temp\tmpgpmjek44\graph.pbtxt.tmpe31b9f4677cb426fbaef32dadeaf1a4d; Permission denied
I found the error comes from the line "estimator.train(input_fn=input_fn, steps=1000)". I tried to look at the folder and the file. They are granted full control already. This maybe a stupid question but what can possibly the cause and solution here. Thank you so much in advance!
UPDATE:
I ran it from the root and got the following:
(C:\Users\Jeff\Anaconda3) C:\Users\Jeff>python test.py
WARNING:tensorflow:Using temporary folder as model directory:
C:\Users\Jeff\AppData\Local\Temp\tmp0yywjv30 2017-11-10
22:54:59.808636: I
C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137]
Your CPU supports instructions that this TensorFlow binary was not
compiled to use: AVX AVX2 2017-11-10 22:55:00.096842: I
C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030]
Found device 0 with properties: name: GeForce GTX 1060 major: 6 minor:
1 memoryClockRate(GHz): 1.6705 pciBusID: 0000:01:00.0 totalMemory:
6.00GiB freeMemory: 4.99GiB 2017-11-10 22:55:00.096927: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120]
Creating TensorFlow device (/device:GPU:0) -> (device: 0, name:
GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1)
2017-11-10 22:55:02.512317: E
C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_blas.cc:366]
failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 2017-11-10
22:55:02.513461: E
C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_blas.cc:366]
failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 2017-11-10
22:55:02.513601: E
C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_blas.cc:366]
failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 2017-11-10
22:55:02.514975: E
C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_blas.cc:366]
failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 2017-11-10
22:55:02.515067: W
C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\stream.cc:1901]
attempting to perform BLAS operation using StreamExecutor without BLAS
support Traceback (most recent call last): File
"C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\client\session.py",
line 1323, in _do_call
return fn(*args) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\client\session.py",
line 1302, in _run_fn
status, run_metadata) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\framework\errors_impl.py",
line 473, in exit
c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InternalError: Blas GEMV
launch failed: m=1, n=4
[[Node: linear/linear_model/x/weighted_sum = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false,
_device="/job:localhost/replica:0/task:0/device:GPU:0"](linear/linear_model/x/Reshape,
linear/linear_model/x/weights)]]
[[Node: linear/gradients/linear/linear_model/x/weighted_sum_grad/tuple/control_dependency_1/_85
= _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0",
send_device="/job:localhost/replica:0/task:0/device:GPU:0",
send_device_incarnation=1,
tensor_name="edge_184_linear/gradients/linear/linear_model/x/weighted_sum_grad/tuple/control_dependency_1",
tensor_type=DT_FLOAT,
_device="/job:localhost/replica:0/task:0/device:CPU:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "test.py", line 39, in
estimator.train(input_fn=input_fn, steps=1000) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\estimator\estimator.py",
line 302, in train
loss = self._train_model(input_fn, hooks, saving_listeners) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\estimator\estimator.py",
line 783, in _train_model
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File
"C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py",
line 521, in run
run_metadata=run_metadata) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py",
line 892, in run
run_metadata=run_metadata) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py",
line 967, in run
raise six.reraise(*original_exc_info) File "C:\Users\Jeff\Anaconda3\lib\site-packages\six.py", line 693, in
reraise
raise value File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py",
line 952, in run
return self._sess.run(*args, **kwargs) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py",
line 1024, in run
run_metadata=run_metadata) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py",
line 827, in run
return self._sess.run(*args, **kwargs) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\client\session.py",
line 889, in run
run_metadata_ptr) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\client\session.py",
line 1120, in _run
feed_dict_tensor, options, run_metadata) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\client\session.py",
line 1317, in _do_run
options, run_metadata) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\client\session.py",
line 1336, in _do_call
raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Blas GEMV
launch failed: m=1, n=4
[[Node: linear/linear_model/x/weighted_sum = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false,
_device="/job:localhost/replica:0/task:0/device:GPU:0"](linear/linear_model/x/Reshape,
linear/linear_model/x/weights)]]
[[Node: linear/gradients/linear/linear_model/x/weighted_sum_grad/tuple/control_dependency_1/_85
= _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0",
send_device="/job:localhost/replica:0/task:0/device:GPU:0",
send_device_incarnation=1,
tensor_name="edge_184_linear/gradients/linear/linear_model/x/weighted_sum_grad/tuple/control_dependency_1",
tensor_type=DT_FLOAT,
_device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Caused by op 'linear/linear_model/x/weighted_sum', defined at: File
"test.py", line 39, in
estimator.train(input_fn=input_fn, steps=1000) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\estimator\estimator.py",
line 302, in train
loss = self._train_model(input_fn, hooks, saving_listeners) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\estimator\estimator.py",
line 711, in _train_model
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\estimator\estimator.py",
line 694, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs) File
"C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\estimator\canned\linear.py",
line 348, in _model_fn
config=config) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\estimator\canned\linear.py",
line 118, in _linear_model_fn
logits = logit_fn(features=features) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\estimator\canned\linear.py",
line 70, in linear_logit_fn
features=features, feature_columns=feature_columns, units=units) File
"C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\feature_column\feature_column.py",
line 321, in linear_model
column, builder, units, weight_collections, trainable)) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\feature_column\feature_column.py",
line 1376, in _create_dense_column_weighted_sum
return math_ops.matmul(tensor, weight, name='weighted_sum') File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py",
line 1891, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) File
"C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_math_ops.py",
line 2436, in _mat_mul
name=name) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py",
line 787, in _apply_op_helper
op_def=op_def) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py",
line 2956, in create_op
op_def=op_def) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py",
line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InternalError (see above for traceback): Blas GEMV launch failed:
m=1, n=4
[[Node: linear/linear_model/x/weighted_sum = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false,
_device="/job:localhost/replica:0/task:0/device:GPU:0"](linear/linear_model/x/Reshape,
linear/linear_model/x/weights)]]
[[Node: linear/gradients/linear/linear_model/x/weighted_sum_grad/tuple/control_dependency_1/_85
= _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0",
send_device="/job:localhost/replica:0/task:0/device:GPU:0",
send_device_incarnation=1,
tensor_name="edge_184_linear/gradients/linear/linear_model/x/weighted_sum_grad/tuple/control_dependency_1",
tensor_type=DT_FLOAT,
_device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Its PermissionDeniedError:
You should run this script from the root as i can see for now.
Try it and update.
Related
Traceback (most recent call last):
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: cuDNN launch failure : input shape ([202027,64,1,1])
[[Node: bn_fm_1/FusedBatchNorm = FusedBatchNorm[T=DT_FLOAT, data_format="NCHW", epsilon=0.001, is_training=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bn_fm_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, bn_fm/gamma/read, bn_fm/beta/read, bn_fm/moving_mean/read, bn_fm/moving_variance/read)]]
[[Node: AddN/_31 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_202_AddN", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "NeuralFM.py", line 350, in <module>
model.train(data.Train_data, data.Validation_data, data.Test_data)
File "NeuralFM.py", line 266, in train
init_train = self.evaluate(Train_data)
File "NeuralFM.py", line 311, in evaluate
predictions = self.sess.run((self.out), feed_dict=feed_dict)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: cuDNN launch failure : input shape ([202027,64,1,1])
[[Node: bn_fm_1/FusedBatchNorm = FusedBatchNorm[T=DT_FLOAT, data_format="NCHW", epsilon=0.001, is_training=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bn_fm_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, bn_fm/gamma/read, bn_fm/beta/read, bn_fm/moving_mean/read, bn_fm/moving_variance/read)]]
[[Node: AddN/_31 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_202_AddN", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'bn_fm_1/FusedBatchNorm', defined at:
File "NeuralFM.py", line 349, in <module>
model = NeuralFM(data.features_M, args.hidden_factor, eval(args.layers), args.loss_type, args.pretrain, args.epoch, args.batch_size, args.lr, args.lamda, eval(args.keep_prob), args.optimizer, args.batch_norm, activation_function, args.verbose, args.early_stop)
File "NeuralFM.py", line 89, in __init__
self._init_graph()
File "NeuralFM.py", line 123, in _init_graph
self.FM = self.batch_norm_layer(self.FM, train_phase=self.train_phase, scope_bn='bn_fm')
File "NeuralFM.py", line 224, in batch_norm_layer
is_training=False, reuse=True, trainable=True, scope=scope_bn)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 596, in batch_norm
scope=scope)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 382, in _fused_batch_norm
is_training, _fused_batch_norm_training, _fused_batch_norm_inference)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/utils.py", line 214, in smart_cond
return static_cond(pred_value, fn1, fn2)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/utils.py", line 194, in static_cond
return fn2()
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 379, in _fused_batch_norm_inference
data_format=data_format)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py", line 906, in fused_batch_norm
name=name)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 3465, in _fused_batch_norm
is_training=is_training, name=name)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
op_def=op_def)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InternalError (see above for traceback): cuDNN launch failure : input shape ([202027,64,1,1])
[[Node: bn_fm_1/FusedBatchNorm = FusedBatchNorm[T=DT_FLOAT, data_format="NCHW", epsilon=0.001, is_training=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bn_fm_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, bn_fm/gamma/read, bn_fm/beta/read, bn_fm/moving_mean/read, bn_fm/moving_variance/read)]]
[[Node: AddN/_31 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_202_AddN", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
I keep getting this error, I've tried everything from downgrading CUDA, cuDNN, and tensorflow-gpu.
I'm currently on CUDA 9.0, cuDNN v7.4.2 for CUDA 9.0, tensorflow-gpu 1.9 and nothing I do seems to help. I'm running out of ideas, I've got every dependency I could imagine.
I'm trying to run this:
https://github.com/hexiangnan/neural_factorization_machine
EDIT: I have a feeling this is connected to https://github.com/tensorflow/tensorflow/issues/8090 but as I'm a little new to all this, I'm not sure if I'm right or how to address this.
I met the same error. The reason for mine is that my GPU does not have enough memory for the process.
I'm probably a few of years late to be of any help Alex but I've come up on this issue when on Windows with a specific GPU. Don't ask me why but adding
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '/gpu:0'
if you have a single GPU works for me
I solved it by adding after imports this:
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
in the script
Unable to run TensorFlow using GPU. Code works in CPU.
Debian version 9.8
1 GPU Nvidia Tesla V100
TensorFlow-GPU 1.12
Nvidia Driver: NVIDIA-Linux-x86_64-390.46.run
CUDA: cuda_9.0.176_384.81_linux-run
CuDNN: cudnn-9.0-linux-x64-v7.4.1.5.tgz
NCCL: nccl_2.3.7-1+cuda9.0_x86_64.txz
Update:
Tested with CuDNN 7.1.4 and same problem
Patches
cuda_9.0.176.1_linux-run
cuda_9.0.176.2_linux-run
cuda_9.0.176.3_linux-run
cuda_9.0.176.4_linux-run
Error:
et convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1/Conv2D (defined at mnist_distributed.py:119) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:worker/replica:0/task:1/device:GPU:0"](adam_optimizer/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1/Variable/read_S15)]]
[[{{node adam_optimizer/gradients/conv2/add_grad/tuple/control_dependency_1_S43}} = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/device:GPU:0", send_device="/job:worker/replica:0/task:1/device:GPU:0", send_device_incarnation=-1302637405089825922, tensor_name="edge_273_adam_optimizer/gradients/conv2/add_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:GPU:0"]()]]
Caused by op 'conv1/Conv2D', defined at:
File "mnist_distributed.py", line 237, in <module>
tf.app.run()
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "mnist_distributed.py", line 196, in main
features, labels, keep_prob, global_step, train_step, accuracy, merged = create_model()
File "mnist_distributed.py", line 149, in create_model
y_conv, keep_prob = deepnn(x)
File "mnist_distributed.py", line 77, in deepnn
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
File "mnist_distributed.py", line 119, in conv2d
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 957, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1/Conv2D (defined at mnist_distributed.py:119) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:worker/replica:0/task:1/device:GPU:0"](adam_optimizer/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1/Variable/read_S15)]]
[[{{node adam_optimizer/gradients/conv2/add_grad/tuple/control_dependency_1_S43}} = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/device:GPU:0", send_device="/job:worker/replica:0/task:1/device:GPU:0", send_device_incarnation=-1302637405089825922, tensor_name="edge_273_adam_optimizer/gradients/conv2/add_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:GPU:0"]()]]
Code here
Libraries:
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_HOME=/usr/local/cuda
Versions
CUDA
cat /usr/local/cuda/version.txt
CUDA Version 9.0.176
CUDA Patch Version 9.0.176.1
CUDA Patch Version 9.0.176.2
CUDA Patch Version 9.0.176.3
CUDA Patch Version 9.0.176.4
CuDNN
cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 4
#define CUDNN_PATCHLEVEL 1
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
#include "driver_types.h"
Similar:
https://github.com/tensorflow/tensorflow/issues/24828
Which TensorFlow and CUDA version combinations are compatible?
By looking into the logs in detail I was getting OOM errors, then I changed the following in tf.train.Server to make it work:
config_proto = tf.ConfigProto(log_device_placement=True)
config_proto.gpu_options.allow_growth = True
server = tf.train.Server(cluster, job_name=job_name, task_index=task_index, config=config_proto)
Errors:
2019-02-20 04:27:30.580666: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 836.47M (877106944 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-02-20 04:27:30.612909: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.619060: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.625466: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.630800: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.636172: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.641168: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.723663: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-02-20 04:27:30.726611: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv1/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:worker/replica:0/task:1/device:GPU:0"](adam_optimizer/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1/Variable/read_S15)]]
[[{{node Mean_G10}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:1/device:CPU:0", send_device="/job:worker/replica:0/task:1/device:GPU:0", send_device_incarnation=-8510199717243775654, tensor_name="edge_245_Mean", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:1/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "mnist_distributed.py", line 234, in <module>
tf.app.run()
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "mnist_distributed.py", line 222, in main
feed_dict={features: batch[0], labels: batch[1], keep_prob: 1.0})
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
run_metadata=run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
raise six.reraise(*original_exc_info)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/six.py", line 693, in reraise
raise value
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
return self._sess.run(*args, **kwargs)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1312, in run
run_metadata=run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
return self._sess.run(*args, **kwargs)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1/Conv2D (defined at mnist_distributed.py:118) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:worker/replica:0/task:1/device:GPU:0"](adam_optimizer/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1/Variable/read_S15)]]
[[{{node Mean_G10}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:1/device:CPU:0", send_device="/job:worker/replica:0/task:1/device:GPU:0", send_device_incarnation=-8510199717243775654, tensor_name="edge_245_Mean", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:1/device:CPU:0"]()]]
Caused by op 'conv1/Conv2D', defined at:
File "mnist_distributed.py", line 234, in <module>
tf.app.run()
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "mnist_distributed.py", line 195, in main
features, labels, keep_prob, global_step, train_step, accuracy, merged = create_model()
File "mnist_distributed.py", line 148, in create_model
y_conv, keep_prob = deepnn(x)
File "mnist_distributed.py", line 76, in deepnn
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
File "mnist_distributed.py", line 118, in conv2d
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 957, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1/Conv2D (defined at mnist_distributed.py:118) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:worker/replica:0/task:1/device:GPU:0"](adam_optimizer/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1/Variable/read_S15)]]
[[{{node Mean_G10}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:1/device:CPU:0", send_device="/job:worker/replica:0/task:1/device:GPU:0", send_device_incarnation=-8510199717243775654, tensor_name="edge_245_Mean", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:1/device:CPU:0"]()]]
I found that the error message from TensorFlow, especially at run time (i.e. in sess.run()). There'is few document explaining how to understand the error message.
For example, there is a error message:
Traceback (most recent call last):
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 10669 values, but the requested shape has 11172
[[Node: optimizer/gradients/RPNloss/cond/calcRPNLoss/calcAllRPNLosses/classification_loss/Reshape_2_grad/Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](optimizer/gradients/RPNloss/cond/calcRPNLoss/calcAllRPNLosses/classification_loss/Reshape_2_grad/Reshape/tensor, optimizer/gradients/RPNloss/cond/calcRPNLoss/calcAllRPNLosses/classification_loss/Reshape_2_grad/Shape)]]
[[Node: cond/getRefinementLoss/posLoss/getPosLoss/Reshape/_1897 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4151_cond/getRefinementLoss/posLoss/getPosLoss/Reshape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/hyh/projects/RFCN-tensorflow/main.py", line 155, in <module>
res = runManager.modRun(i)
File "/home/hyh/projects/RFCN-tensorflow/Utils/RunManager.py", line 97, in modRun
return self.runAndMerge(feed_dict, options=options if options is not None else self.options, run_metadata=run_metadata if run_metadata is not None else self.run_metadata)
File "/home/hyh/projects/RFCN-tensorflow/Utils/RunManager.py", line 71, in runAndMerge
res = self.sess.run(self.inputTensors, feed_dict=feed_dict, options=options, run_metadata=run_metadata)
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 10669 values, but the requested shape has 11172
[[Node: optimizer/gradients/RPNloss/cond/calcRPNLoss/calcAllRPNLosses/classification_loss/Reshape_2_grad/Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](optimizer/gradients/RPNloss/cond/calcRPNLoss/calcAllRPNLosses/classification_loss/Reshape_2_grad/Reshape/tensor, optimizer/gradients/RPNloss/cond/calcRPNLoss/calcAllRPNLosses/classification_loss/Reshape_2_grad/Shape)]]
[[Node: cond/getRefinementLoss/posLoss/getPosLoss/Reshape/_1897 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4151_cond/getRefinementLoss/posLoss/getPosLoss/Reshape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'optimizer/gradients/RPNloss/cond/calcRPNLoss/calcAllRPNLosses/classification_loss/Reshape_2_grad/Reshape', defined at:
File "/home/hyh/projects/RFCN-tensorflow/main.py", line 118, in <module>
trainOp = createUpdateOp()
File "/home/hyh/projects/RFCN-tensorflow/main.py", line 104, in createUpdateOp
grads = optimizer.compute_gradients(totalLoss, var_list=net.getVariables())
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 526, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 494, in gradients
gate_gradients, aggregation_method, stop_gradients)
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 636, in _GradientsHelper
lambda: grad_fn(op, *out_grads))
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 385, in _MaybeCompile
return grad_fn() # Exit early
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 636, in <lambda>
lambda: grad_fn(op, *out_grads))
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_grad.py", line 521, in _ReshapeGrad
return [array_ops.reshape(grad, array_ops.shape(op.inputs[0])), None]
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6113, in reshape
"Reshape", tensor=tensor, shape=shape, name=name)
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
op_def=op_def)
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
...which was originally created as op 'RPNloss/cond/calcRPNLoss/calcAllRPNLosses/classification_loss/Reshape_2', defined at:
File "/home/hyh/projects/RFCN-tensorflow/main.py", line 96, in <module>
tf.losses.add_loss(net.getLoss(boxes, classes))
File "/home/hyh/projects/RFCN-tensorflow/BoxEngine/BoxNetwork.py", line 50, in getLoss
return self.rpn.loss(refBoxes) + self.boxRefiner.loss(self.proposals, refBoxes, refClasses)
File "/home/hyh/projects/RFCN-tensorflow/BoxEngine/RPN.py", line 186, in loss
return tf.cond(tf.shape(refBoxes)[0] > 0, lambda: calcLoss(), lambda: tf.constant(0.0))
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 432, in new_func
return func(*args, **kwargs)
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2063, in cond
orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1913, in BuildCondBranch
original_result = fn()
File "/home/hyh/projects/RFCN-tensorflow/BoxEngine/RPN.py", line 186, in <lambda>
return tf.cond(tf.shape(refBoxes)[0] > 0, lambda: calcLoss(), lambda: tf.constant(0.0))
File "/home/hyh/projects/RFCN-tensorflow/BoxEngine/RPN.py", line 173, in calcLoss
positiveLosses, negativeLosses = calcAllLosses(inAnchros, inBoxes, inRawSizes, inScores, inBoxSizes)
File "/home/hyh/projects/RFCN-tensorflow/BoxEngine/RPN.py", line 145, in calcAllLosses
classificationLoss = tf.nn.softmax_cross_entropy_with_logits_v2(logits=scores, labels=refScores, name="classification_loss")
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 1878, in softmax_cross_entropy_with_logits_v2
cost = array_ops.reshape(cost, output_shape)
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6113, in reshape
"Reshape", tensor=tensor, shape=shape, name=name)
File "/home/hyh/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 10669 values, but the requested shape has 11172
[[Node: optimizer/gradients/RPNloss/cond/calcRPNLoss/calcAllRPNLosses/classification_loss/Reshape_2_grad/Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](optimizer/gradients/RPNloss/cond/calcRPNLoss/calcAllRPNLosses/classification_loss/Reshape_2_grad/Reshape/tensor, optimizer/gradients/RPNloss/cond/calcRPNLoss/calcAllRPNLosses/classification_loss/Reshape_2_grad/Shape)]]
[[Node: cond/getRefinementLoss/posLoss/getPosLoss/Reshape/_1897 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4151_cond/getRefinementLoss/posLoss/getPosLoss/Reshape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Process finished with exit code 1
I have two questions:
Where there is so many calling stack? First is Trackback and then During handling of the above exception, another exception occurred:, and Caused by..., finally ...which was originally created as op. What do they mean respectively?
Why there is so many error node? In the message above, it seems that there are two nodes that have gone wrong. What does it mean? Which node caused this error?
Tensorflow error messages are always quite verbose and this is mainly due to how TF works (because of the Computation Graph it builds).
In your case, it seems that you are reshaping a tensor with the wrong shape:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 10669 values, but the requested shape has 11172
To see if that is the case try printing the shape of the tensor given to reshape op, i.e.:
input = tf.placeholder(tf.float32, [None, 28, 28, 1])
x = tf.layers.dense(input, units=64, activation=tf.nn.relu)
x = tf.Print(x, [x])
x_rs = tf.reshape(x, [-1, 28*28])
I've gotten stuck on this issue for a little while. I'm trying to run the code below with the tf_cnnvis (https://github.com/InFoCusp/tf_cnnvis) package for visualising learnt features in the network, where I import my protobuf model and then try and provide it a tensor containing some image data (which I believe is provided as a feed_dict, although I could be mistaken).
import numpy as np
import tensorflow as tf
import keras as k
import cv2
import tf_cnnvis as tfv
from tensorflow.python.platform import gfile
from keras import backend as K
model_filename = "saved_model.pb"
image = "test.jpg"
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8, allow_growth=False)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
K.set_session(sess)
K._LEARNING_PHASE = tf.constant(0)
K.set_learning_phase(0)
with gfile.FastGFile(model_filename, 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
tf.import_graph_def(graph_def)
X = tf.placeholder(tf.float32, shape = [None, 48, 64, 3],name = "input") # placeholder for input images
y = tf.placeholder(tf.float32, shape = [None, 8])
im = np.array(cv2.imread(image))
im = np.expand_dims(im, 0)
layers = ['r', 'p', 'c']
init_op = init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
sess.run(init_op)
with sess.as_default():
is_success = tfv.activation_visualization(sess_graph_path=tf.get_default_graph(), value_feed_dict = {X : im}, layers=layers)
sess.close()
When I run my code, I get an "InvalidArgumentError" with this traceback:
Traceback (most recent call last):
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1292, in _do_call
return fn(*args)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1277, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'import/batch_normalization_1_input' with dtype float and shape [?,48,64,3]
[[{{node import/batch_normalization_1_input}} = Placeholder[_class=["loc:#import/batch_normalization/cond/FusedBatchNorm_1/Switch"], dtype=DT_FLOAT, shape=[?,48,64,3], _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
[[{{node import/conv2d/Relu/_5}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_50_import/conv2d/Relu", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "vis2.py", line 36, in <module>
is_success = tfv.activation_visualization(sess_graph_path=tf.get_default_graph(), value_feed_dict = {X : im}, layers=layers)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tf_cnnvis-1.0.0-py3.6.egg/tf_cnnvis/tf_cnnvis.py", line 406, in activation_visualization
File "/usr/local/anaconda3/lib/python3.6/site-packages/tf_cnnvis-1.0.0-py3.6.egg/tf_cnnvis/tf_cnnvis.py", line 169, in _get_visualization
File "/usr/local/anaconda3/lib/python3.6/site-packages/tf_cnnvis-1.0.0-py3.6.egg/tf_cnnvis/tf_cnnvis.py", line 227, in _visualization_by_layer_type
File "/usr/local/anaconda3/lib/python3.6/site-packages/tf_cnnvis-1.0.0-py3.6.egg/tf_cnnvis/tf_cnnvis.py", line 288, in _visualization_by_layer_name
File "/usr/local/anaconda3/lib/python3.6/site-packages/tf_cnnvis-1.0.0-py3.6.egg/tf_cnnvis/tf_cnnvis.py", line 315, in _activation
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 887, in run
run_metadata_ptr)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1110, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1286, in _do_run
run_metadata)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1308, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'import/batch_normalization_1_input' with dtype float and shape [?,48,64,3]
[[{{node import/batch_normalization_1_input}} = Placeholder[_class=["loc:#import/batch_normalization/cond/FusedBatchNorm_1/Switch"], dtype=DT_FLOAT, shape=[?,48,64,3], _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
[[{{node import/conv2d/Relu/_5}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_50_import/conv2d/Relu", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'import/batch_normalization_1_input', defined at:
File "vis2.py", line 36, in <module>
is_success = tfv.activation_visualization(sess_graph_path=tf.get_default_graph(), value_feed_dict = {X : im}, layers=layers)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tf_cnnvis-1.0.0-py3.6.egg/tf_cnnvis/tf_cnnvis.py", line 406, in activation_visualization
path_logdir = path_logdir, path_outdir = path_outdir)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tf_cnnvis-1.0.0-py3.6.egg/tf_cnnvis/tf_cnnvis.py", line 159, in _get_visualization
s = _graph_import_function(PATH,s)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tf_cnnvis-1.0.0-py3.6.egg/tf_cnnvis/tf_cnnvis.py", line 177, in _graph_import_function
new_saver = tf.train.import_meta_graph(PATH) # Import graph
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1650, in import_meta_graph
meta_graph_or_file, clear_devices, import_scope, **kwargs)[0]
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1672, in _import_meta_graph_with_return_elements
**kwargs))
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/meta_graph.py", line 806, in import_scoped_meta_graph_with_return_elements
return_elements=return_elements)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 442, in import_graph_def
_ProcessNewOps(graph)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 234, in _ProcessNewOps
for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3426, in _add_new_tf_operations
for c_op in c_api_util.new_tf_operations(self)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3426, in <listcomp>
for c_op in c_api_util.new_tf_operations(self)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3285, in _create_op_from_tf_operation
ret = Operation(c_op, self)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'import/batch_normalization_1_input' with dtype float and shape [?,48,64,3]
[[{{node import/batch_normalization_1_input}} = Placeholder[_class=["loc:#import/batch_normalization/cond/FusedBatchNorm_1/Switch"], dtype=DT_FLOAT, shape=[?,48,64,3], _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
[[{{node import/conv2d/Relu/_5}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_50_import/conv2d/Relu", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Now, I've looked around and I've arrived (tentatively) at the conclusion that this is due to a learning phase variable that's set in the BatchNormalization layer that I have in the model. I'm unclear as to how to set the learning phase when you've imported the model. Some people set the learning phase before initializing the model (which as you can see, I have attempted), but in most examples of this they're using one of the large, pre-provided models (such as MNIST). Others provide the learning phase in the feed_dict, which I have also tried, like so:
with sess.as_default():
is_success = tfv.activation_visualization(sess_graph_path=tf.get_default_graph(), value_feed_dict = {X : im, K.learning_phase(): 0}, layers=layers)
But this gives me a different error message:
Traceback (most recent call last):
File "vis2.py", line 36, in <module>
is_success = tfv.activation_visualization(sess_graph_path=tf.get_default_graph(), value_feed_dict = {X : im, K.learning_phase(): 0}, layers=layers)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tf_cnnvis-1.0.0-py3.6.egg/tf_cnnvis/tf_cnnvis.py", line 406, in activation_visualization
File "/usr/local/anaconda3/lib/python3.6/site-packages/tf_cnnvis-1.0.0-py3.6.egg/tf_cnnvis/tf_cnnvis.py", line 169, in _get_visualization
File "/usr/local/anaconda3/lib/python3.6/site-packages/tf_cnnvis-1.0.0-py3.6.egg/tf_cnnvis/tf_cnnvis.py", line 227, in _visualization_by_layer_type
File "/usr/local/anaconda3/lib/python3.6/site-packages/tf_cnnvis-1.0.0-py3.6.egg/tf_cnnvis/tf_cnnvis.py", line 270, in _visualization_by_layer_name
File "/usr/local/anaconda3/lib/python3.6/site-packages/tf_cnnvis-1.0.0-py3.6.egg/tf_cnnvis/utils.py", line 79, in parse_tensors_dict
AttributeError: 'int' object has no attribute 'name'
At this stage, seeing as I'm still not completely sure if the problem I'm trying to fix is even the right one, I would very much appreciate some input. If there's anything else you need me to provide, please ask.
I am trying to add batch norm to a vgg style model in Keras. When I add the batch norm layers I get the error:
FailedPreconditionError: Attempting to use uninitialized value batchnormalization_1_running_mean/biased
Without the batch layers the script runs without errors, only when I add the batchNormalization layers does it throw the error.
model = Sequential()
model.add(ZeroPadding2D((1, 1), input_shape=(1, conf['image_shape'][0], conf['image_shape'][1]), dim_ordering=conf['dim_ordering']))
model.add(Convolution2D(conf['level_1_filters'], 3, 3, dim_ordering=conf['dim_ordering']))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(ZeroPadding2D((1, 1), dim_ordering=conf['dim_ordering']))
model.add(Convolution2D(conf['level_1_filters'], 3, 3, dim_ordering=conf['dim_ordering']))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D((2, 2), strides=(2, 2), dim_ordering=conf['dim_ordering']))
model.add(ZeroPadding2D((1, 1), dim_ordering=conf['dim_ordering']))
model.add(Convolution2D(conf['level_2_filters'], 3, 3, dim_ordering=conf['dim_ordering']))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(ZeroPadding2D((1, 1), dim_ordering=conf['dim_ordering']))
model.add(Convolution2D(conf['level_2_filters'], 3, 3, dim_ordering=conf['dim_ordering']))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D((2, 2), strides=(2, 2), dim_ordering=conf['dim_ordering']))
model.add(Flatten())
model.add(Dense(conf['dense_layer_size']))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(conf['dropout_value']))
model.add(Dense(conf['dense_layer_size']))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(conf['dropout_value']))
model.add(Dense(2, activation='softmax'))
# sgd = SGD(lr=conf['learning_rate'], decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
Is this the correct syntax to use batch norm in Keras? I followed the example in this thread.
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Train patients: 699
Valid patients: 698
Create and compile model...
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:03:00.0
Total memory: 7.92GiB
Free memory: 7.07GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0)
Number of train files: 123111
Number of valid files: 125469
Fit model...
Samples train: 5000, Samples valid: 5000
Epoch 1/40
W tensorflow/core/framework/op_kernel.cc:975] Failed precondition: Attempting to use uninitialized value batchnormalization_1_running_mean/biased
[[Node: batchnormalization_1_running_mean/biased/read = Identity[T=DT_FLOAT, _class=["loc:#batchnormalization_1_running_mean"], _device="/job:localhost/replica:0/task:0/gpu:0"](batchnormalization_1_running_mean/biased)]]
W tensorflow/core/framework/op_kernel.cc:975] Failed precondition: Attempting to use uninitialized value batchnormalization_1_running_mean/biased
[[Node: batchnormalization_1_running_mean/biased/read = Identity[T=DT_FLOAT, _class=["loc:#batchnormalization_1_running_mean"], _device="/job:localhost/replica:0/task:0/gpu:0"](batchnormalization_1_running_mean/biased)]]
W tensorflow/core/framework/op_kernel.cc:975] Failed precondition: Attempting to use uninitialized value batchnormalization_1_running_mean/biased
[[Node: batchnormalization_1_running_mean/biased/read = Identity[T=DT_FLOAT, _class=["loc:#batchnormalization_1_running_mean"], _device="/job:localhost/replica:0/task:0/gpu:0"](batchnormalization_1_running_mean/biased)]]
W tensorflow/core/framework/op_kernel.cc:975] Failed precondition: Attempting to use uninitialized value batchnormalization_1_running_mean/biased
[[Node: batchnormalization_1_running_mean/biased/read = Identity[T=DT_FLOAT, _class=["loc:#batchnormalization_1_running_mean"], _device="/job:localhost/replica:0/task:0/gpu:0"](batchnormalization_1_running_mean/biased)]]
W tensorflow/core/framework/op_kernel.cc:975] Failed precondition: Attempting to use uninitialized value batchnormalization_1_running_mean/biased
[[Node: batchnormalization_1_running_mean/biased/read = Identity[T=DT_FLOAT, _class=["loc:#batchnormalization_1_running_mean"], _device="/job:localhost/replica:0/task:0/gpu:0"](batchnormalization_1_running_mean/biased)]]
W tensorflow/core/framework/op_kernel.cc:975] Failed precondition: Attempting to use uninitialized value batchnormalization_1_running_mean/biased
[[Node: batchnormalization_1_running_mean/biased/read = Identity[T=DT_FLOAT, _class=["loc:#batchnormalization_1_running_mean"], _device="/job:localhost/replica:0/task:0/gpu:0"](batchnormalization_1_running_mean/biased)]]
Traceback (most recent call last):
File "keras-v2.py", line 197, in <module>
model = create_single_model()
File "keras-v2.py", line 173, in create_single_model
callbacks=callbacks)
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 882, in fit_generator
pickle_safe=pickle_safe)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1461, in fit_generator
class_weight=class_weight)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1239, in train_on_batch
outputs = self.train_function(ins)
File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 1040, in __call__
updated = session.run(self.outputs + [self.updates_op], feed_dict=feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value batchnormalization_1_running_mean/biased
[[Node: batchnormalization_1_running_mean/biased/read = Identity[T=DT_FLOAT, _class=["loc:#batchnormalization_1_running_mean"], _device="/job:localhost/replica:0/task:0/gpu:0"](batchnormalization_1_running_mean/biased)]]
[[Node: Mean_3/_49 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2152_Mean_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op u'batchnormalization_1_running_mean/biased/read', defined at:
File "keras-v2.py", line 197, in <module>
model = create_single_model()
File "keras-v2.py", line 145, in create_single_model
model = get_custom_CNN()
File "keras-v2.py", line 111, in get_custom_CNN
model.add(BatchNormalization(axis=-1))
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 312, in add
output_tensor = layer(self.outputs[0])
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 514, in __call__
self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 572, in add_inbound_node
Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 149, in create_node
output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
File "/usr/local/lib/python2.7/dist-packages/keras/layers/normalization.py", line 140, in call
self.updates = [K.moving_average_update(self.running_mean, mean, self.momentum),
File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 329, in moving_average_update
variable, value, momentum)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/moving_averages.py", line 70, in assign_moving_average
update_delta = _zero_debias(variable, value, decay)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/moving_averages.py", line 177, in _zero_debias
trainable=False)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1024, in get_variable
custom_getter=custom_getter)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 850, in get_variable
custom_getter=custom_getter)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 346, in get_variable
validate_shape=validate_shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 331, in _true_getter
caching_device=caching_device, validate_shape=validate_shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 677, in _get_single_variable
expected_shape=shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 224, in __init__
expected_shape=expected_shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 370, in _init_from_args
self._snapshot = array_ops.identity(self._variable, name="read")
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 1424, in identity
result = _op_def_lib.apply_op("Identity", input=input, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()
FailedPreconditionError (see above for traceback): Attempting to use uninitialized value batchnormalization_1_running_mean/biased
[[Node: batchnormalization_1_running_mean/biased/read = Identity[T=DT_FLOAT, _class=["loc:#batchnormalization_1_running_mean"], _device="/job:localhost/replica:0/task:0/gpu:0"](batchnormalization_1_running_mean/biased)]]
[[Node: Mean_3/_49 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2152_Mean_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 433, in data_generator_task
generator_output = next(generator)
File "keras-v2.py", line 71, in batch_generator_train
image = load_and_normalize_dicom(f, conf['image_shape'][0], conf['image_shape'][1])
File "keras-v2.py", line 58, in load_and_normalize_dicom
dicom_img = cv2.resize(dicom_img, (x, y), interpolation=cv2.INTER_CUBIC)
AttributeError: 'NoneType' object has no attribute 'resize'
Try keras.backend.get_session().run(tf.global_variables_initializer()) before fit. There is an issue here
Try keras.backend.get_session().run(tf.local_variables_initializer()). For me, the global initializer didn't work, but local did. Although this is probably not an issue with the latest TF/Keras versions.
If
your input image's data format is "channels_last" and the input_shape is Image_Height x Image_Width x Image_Channel
then
try using BatchNormalization(axis = 3)