Unable to run Distributed TensorFlow using V100 GPU

Unable to run Distributed TensorFlow using V100 GPU - python

Unable to run TensorFlow using GPU. Code works in CPU.
Debian version 9.8
1 GPU Nvidia Tesla V100
TensorFlow-GPU 1.12
Nvidia Driver: NVIDIA-Linux-x86_64-390.46.run
CUDA: cuda_9.0.176_384.81_linux-run
CuDNN: cudnn-9.0-linux-x64-v7.4.1.5.tgz
NCCL: nccl_2.3.7-1+cuda9.0_x86_64.txz
Update:
Tested with CuDNN 7.1.4 and same problem
Patches
cuda_9.0.176.1_linux-run
cuda_9.0.176.2_linux-run
cuda_9.0.176.3_linux-run
cuda_9.0.176.4_linux-run
Error:
et convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1/Conv2D (defined at mnist_distributed.py:119) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:worker/replica:0/task:1/device:GPU:0"](adam_optimizer/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1/Variable/read_S15)]]
[[{{node adam_optimizer/gradients/conv2/add_grad/tuple/control_dependency_1_S43}} = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/device:GPU:0", send_device="/job:worker/replica:0/task:1/device:GPU:0", send_device_incarnation=-1302637405089825922, tensor_name="edge_273_adam_optimizer/gradients/conv2/add_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:GPU:0"]()]]
Caused by op 'conv1/Conv2D', defined at:
File "mnist_distributed.py", line 237, in <module>
tf.app.run()
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "mnist_distributed.py", line 196, in main
features, labels, keep_prob, global_step, train_step, accuracy, merged = create_model()
File "mnist_distributed.py", line 149, in create_model
y_conv, keep_prob = deepnn(x)
File "mnist_distributed.py", line 77, in deepnn
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
File "mnist_distributed.py", line 119, in conv2d
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 957, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550476352470_0004/container_1550476352470_0004_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1/Conv2D (defined at mnist_distributed.py:119) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:worker/replica:0/task:1/device:GPU:0"](adam_optimizer/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1/Variable/read_S15)]]
[[{{node adam_optimizer/gradients/conv2/add_grad/tuple/control_dependency_1_S43}} = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/device:GPU:0", send_device="/job:worker/replica:0/task:1/device:GPU:0", send_device_incarnation=-1302637405089825922, tensor_name="edge_273_adam_optimizer/gradients/conv2/add_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:GPU:0"]()]]
Code here
Libraries:
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_HOME=/usr/local/cuda
Versions
CUDA
cat /usr/local/cuda/version.txt
CUDA Version 9.0.176
CUDA Patch Version 9.0.176.1
CUDA Patch Version 9.0.176.2
CUDA Patch Version 9.0.176.3
CUDA Patch Version 9.0.176.4
CuDNN
cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 4
#define CUDNN_PATCHLEVEL 1
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
#include "driver_types.h"
Similar:
https://github.com/tensorflow/tensorflow/issues/24828
Which TensorFlow and CUDA version combinations are compatible?

By looking into the logs in detail I was getting OOM errors, then I changed the following in tf.train.Server to make it work:
config_proto = tf.ConfigProto(log_device_placement=True)
config_proto.gpu_options.allow_growth = True
server = tf.train.Server(cluster, job_name=job_name, task_index=task_index, config=config_proto)
Errors:
2019-02-20 04:27:30.580666: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 836.47M (877106944 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-02-20 04:27:30.612909: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.619060: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.625466: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.630800: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.636172: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.641168: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-02-20 04:27:30.723663: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-02-20 04:27:30.726611: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv1/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:worker/replica:0/task:1/device:GPU:0"](adam_optimizer/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1/Variable/read_S15)]]
[[{{node Mean_G10}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:1/device:CPU:0", send_device="/job:worker/replica:0/task:1/device:GPU:0", send_device_incarnation=-8510199717243775654, tensor_name="edge_245_Mean", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:1/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "mnist_distributed.py", line 234, in <module>
tf.app.run()
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "mnist_distributed.py", line 222, in main
feed_dict={features: batch[0], labels: batch[1], keep_prob: 1.0})
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
run_metadata=run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
raise six.reraise(*original_exc_info)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/six.py", line 693, in reraise
raise value
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
return self._sess.run(*args, **kwargs)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1312, in run
run_metadata=run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
return self._sess.run(*args, **kwargs)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1/Conv2D (defined at mnist_distributed.py:118) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:worker/replica:0/task:1/device:GPU:0"](adam_optimizer/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1/Variable/read_S15)]]
[[{{node Mean_G10}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:1/device:CPU:0", send_device="/job:worker/replica:0/task:1/device:GPU:0", send_device_incarnation=-8510199717243775654, tensor_name="edge_245_Mean", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:1/device:CPU:0"]()]]
Caused by op 'conv1/Conv2D', defined at:
File "mnist_distributed.py", line 234, in <module>
tf.app.run()
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "mnist_distributed.py", line 195, in main
features, labels, keep_prob, global_step, train_step, accuracy, merged = create_model()
File "mnist_distributed.py", line 148, in create_model
y_conv, keep_prob = deepnn(x)
File "mnist_distributed.py", line 76, in deepnn
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
File "mnist_distributed.py", line 118, in conv2d
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 957, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1550484758208_0014/container_1550484758208_0014_01_000004/venv/tf/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv1/Conv2D (defined at mnist_distributed.py:118) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:worker/replica:0/task:1/device:GPU:0"](adam_optimizer/gradients/conv1/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv1/Variable/read_S15)]]
[[{{node Mean_G10}} = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:1/device:CPU:0", send_device="/job:worker/replica:0/task:1/device:GPU:0", send_device_incarnation=-8510199717243775654, tensor_name="edge_245_Mean", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:1/device:CPU:0"]()]]

Related

cuDNN launch failure (tensorflow-gpu/CUDA)

Traceback (most recent call last):
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: cuDNN launch failure : input shape ([202027,64,1,1])
[[Node: bn_fm_1/FusedBatchNorm = FusedBatchNorm[T=DT_FLOAT, data_format="NCHW", epsilon=0.001, is_training=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bn_fm_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, bn_fm/gamma/read, bn_fm/beta/read, bn_fm/moving_mean/read, bn_fm/moving_variance/read)]]
[[Node: AddN/_31 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_202_AddN", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "NeuralFM.py", line 350, in <module>
model.train(data.Train_data, data.Validation_data, data.Test_data)
File "NeuralFM.py", line 266, in train
init_train = self.evaluate(Train_data)
File "NeuralFM.py", line 311, in evaluate
predictions = self.sess.run((self.out), feed_dict=feed_dict)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: cuDNN launch failure : input shape ([202027,64,1,1])
[[Node: bn_fm_1/FusedBatchNorm = FusedBatchNorm[T=DT_FLOAT, data_format="NCHW", epsilon=0.001, is_training=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bn_fm_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, bn_fm/gamma/read, bn_fm/beta/read, bn_fm/moving_mean/read, bn_fm/moving_variance/read)]]
[[Node: AddN/_31 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_202_AddN", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'bn_fm_1/FusedBatchNorm', defined at:
File "NeuralFM.py", line 349, in <module>
model = NeuralFM(data.features_M, args.hidden_factor, eval(args.layers), args.loss_type, args.pretrain, args.epoch, args.batch_size, args.lr, args.lamda, eval(args.keep_prob), args.optimizer, args.batch_norm, activation_function, args.verbose, args.early_stop)
File "NeuralFM.py", line 89, in __init__
self._init_graph()
File "NeuralFM.py", line 123, in _init_graph
self.FM = self.batch_norm_layer(self.FM, train_phase=self.train_phase, scope_bn='bn_fm')
File "NeuralFM.py", line 224, in batch_norm_layer
is_training=False, reuse=True, trainable=True, scope=scope_bn)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 596, in batch_norm
scope=scope)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 382, in _fused_batch_norm
is_training, _fused_batch_norm_training, _fused_batch_norm_inference)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/utils.py", line 214, in smart_cond
return static_cond(pred_value, fn1, fn2)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/utils.py", line 194, in static_cond
return fn2()
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 379, in _fused_batch_norm_inference
data_format=data_format)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py", line 906, in fused_batch_norm
name=name)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 3465, in _fused_batch_norm
is_training=is_training, name=name)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
op_def=op_def)
File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InternalError (see above for traceback): cuDNN launch failure : input shape ([202027,64,1,1])
[[Node: bn_fm_1/FusedBatchNorm = FusedBatchNorm[T=DT_FLOAT, data_format="NCHW", epsilon=0.001, is_training=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bn_fm_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, bn_fm/gamma/read, bn_fm/beta/read, bn_fm/moving_mean/read, bn_fm/moving_variance/read)]]
[[Node: AddN/_31 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_202_AddN", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
I keep getting this error, I've tried everything from downgrading CUDA, cuDNN, and tensorflow-gpu.
I'm currently on CUDA 9.0, cuDNN v7.4.2 for CUDA 9.0, tensorflow-gpu 1.9 and nothing I do seems to help. I'm running out of ideas, I've got every dependency I could imagine.
I'm trying to run this:
https://github.com/hexiangnan/neural_factorization_machine
EDIT: I have a feeling this is connected to https://github.com/tensorflow/tensorflow/issues/8090 but as I'm a little new to all this, I'm not sure if I'm right or how to address this.

I met the same error. The reason for mine is that my GPU does not have enough memory for the process.

I'm probably a few of years late to be of any help Alex but I've come up on this issue when on Windows with a specific GPU. Don't ask me why but adding
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '/gpu:0'
if you have a single GPU works for me

I solved it by adding after imports this:
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
in the script

Getting "PermissionDeniedError" when running the example program on Tensorflow

Sorry for my lack of knowledge, but I am trying to run the example on Tensorflow:
import numpy as np
import tensorflow as tf
feature_columns = [tf.feature_column.numeric_column("x", shape=[1])]
estimator = tf.estimator.LinearRegressor(feature_columns=feature_columns)
x_train = np.array([1., 2., 3., 4.])
y_train = np.array([0., -1., -2., -3.])
x_eval = np.array([2., 5., 8., 1.])
y_eval = np.array([-1.01, -4.1, -7, 0.])
input_fn = tf.estimator.inputs.numpy_input_fn(
{"x": x_train}, y_train, batch_size=4, num_epochs=None, shuffle=True)
train_input_fn = tf.estimator.inputs.numpy_input_fn(
{"x": x_train}, y_train, batch_size=4, num_epochs=1000, shuffle=False)
eval_input_fn = tf.estimator.inputs.numpy_input_fn(
{"x": x_eval}, y_eval, batch_size=4, num_epochs=1000, shuffle=False)
estimator.train(input_fn=input_fn, steps=1000)
train_metrics = estimator.evaluate(input_fn=train_input_fn)
eval_metrics = estimator.evaluate(input_fn=eval_input_fn)
print("train metrics: %r"% train_metrics)
print("eval metrics: %r"% eval_metrics)
I got the following error message:
PermissionDeniedError: Failed to delete a file: C:\Users\Jeff\AppData\Local\Temp\tmpgpmjek44\graph.pbtxt.tmpe31b9f4677cb426fbaef32dadeaf1a4d; Permission denied
I found the error comes from the line "estimator.train(input_fn=input_fn, steps=1000)". I tried to look at the folder and the file. They are granted full control already. This maybe a stupid question but what can possibly the cause and solution here. Thank you so much in advance!
UPDATE:
I ran it from the root and got the following:
(C:\Users\Jeff\Anaconda3) C:\Users\Jeff>python test.py
WARNING:tensorflow:Using temporary folder as model directory:
C:\Users\Jeff\AppData\Local\Temp\tmp0yywjv30 2017-11-10
22:54:59.808636: I
C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137]
Your CPU supports instructions that this TensorFlow binary was not
compiled to use: AVX AVX2 2017-11-10 22:55:00.096842: I
C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030]
Found device 0 with properties: name: GeForce GTX 1060 major: 6 minor:
1 memoryClockRate(GHz): 1.6705 pciBusID: 0000:01:00.0 totalMemory:
6.00GiB freeMemory: 4.99GiB 2017-11-10 22:55:00.096927: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120]
Creating TensorFlow device (/device:GPU:0) -> (device: 0, name:
GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1)
2017-11-10 22:55:02.512317: E
C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_blas.cc:366]
failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 2017-11-10
22:55:02.513461: E
C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_blas.cc:366]
failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 2017-11-10
22:55:02.513601: E
C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_blas.cc:366]
failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 2017-11-10
22:55:02.514975: E
C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_blas.cc:366]
failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 2017-11-10
22:55:02.515067: W
C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\stream.cc:1901]
attempting to perform BLAS operation using StreamExecutor without BLAS
support Traceback (most recent call last): File
"C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\client\session.py",
line 1323, in _do_call
return fn(*args) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\client\session.py",
line 1302, in _run_fn
status, run_metadata) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\framework\errors_impl.py",
line 473, in exit
c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InternalError: Blas GEMV
launch failed: m=1, n=4
[[Node: linear/linear_model/x/weighted_sum = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false,
_device="/job:localhost/replica:0/task:0/device:GPU:0"](linear/linear_model/x/Reshape,
linear/linear_model/x/weights)]]
[[Node: linear/gradients/linear/linear_model/x/weighted_sum_grad/tuple/control_dependency_1/_85
= _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0",
send_device="/job:localhost/replica:0/task:0/device:GPU:0",
send_device_incarnation=1,
tensor_name="edge_184_linear/gradients/linear/linear_model/x/weighted_sum_grad/tuple/control_dependency_1",
tensor_type=DT_FLOAT,
_device="/job:localhost/replica:0/task:0/device:CPU:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "test.py", line 39, in
estimator.train(input_fn=input_fn, steps=1000) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\estimator\estimator.py",
line 302, in train
loss = self._train_model(input_fn, hooks, saving_listeners) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\estimator\estimator.py",
line 783, in _train_model
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File
"C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py",
line 521, in run
run_metadata=run_metadata) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py",
line 892, in run
run_metadata=run_metadata) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py",
line 967, in run
raise six.reraise(*original_exc_info) File "C:\Users\Jeff\Anaconda3\lib\site-packages\six.py", line 693, in
reraise
raise value File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py",
line 952, in run
return self._sess.run(*args, **kwargs) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py",
line 1024, in run
run_metadata=run_metadata) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py",
line 827, in run
return self._sess.run(*args, **kwargs) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\client\session.py",
line 889, in run
run_metadata_ptr) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\client\session.py",
line 1120, in _run
feed_dict_tensor, options, run_metadata) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\client\session.py",
line 1317, in _do_run
options, run_metadata) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\client\session.py",
line 1336, in _do_call
raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Blas GEMV
launch failed: m=1, n=4
[[Node: linear/linear_model/x/weighted_sum = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false,
_device="/job:localhost/replica:0/task:0/device:GPU:0"](linear/linear_model/x/Reshape,
linear/linear_model/x/weights)]]
[[Node: linear/gradients/linear/linear_model/x/weighted_sum_grad/tuple/control_dependency_1/_85
= _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0",
send_device="/job:localhost/replica:0/task:0/device:GPU:0",
send_device_incarnation=1,
tensor_name="edge_184_linear/gradients/linear/linear_model/x/weighted_sum_grad/tuple/control_dependency_1",
tensor_type=DT_FLOAT,
_device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Caused by op 'linear/linear_model/x/weighted_sum', defined at: File
"test.py", line 39, in
estimator.train(input_fn=input_fn, steps=1000) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\estimator\estimator.py",
line 302, in train
loss = self._train_model(input_fn, hooks, saving_listeners) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\estimator\estimator.py",
line 711, in _train_model
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\estimator\estimator.py",
line 694, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs) File
"C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\estimator\canned\linear.py",
line 348, in _model_fn
config=config) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\estimator\canned\linear.py",
line 118, in _linear_model_fn
logits = logit_fn(features=features) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\estimator\canned\linear.py",
line 70, in linear_logit_fn
features=features, feature_columns=feature_columns, units=units) File
"C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\feature_column\feature_column.py",
line 321, in linear_model
column, builder, units, weight_collections, trainable)) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\feature_column\feature_column.py",
line 1376, in _create_dense_column_weighted_sum
return math_ops.matmul(tensor, weight, name='weighted_sum') File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py",
line 1891, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) File
"C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_math_ops.py",
line 2436, in _mat_mul
name=name) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py",
line 787, in _apply_op_helper
op_def=op_def) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py",
line 2956, in create_op
op_def=op_def) File "C:\Users\Jeff\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py",
line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InternalError (see above for traceback): Blas GEMV launch failed:
m=1, n=4
[[Node: linear/linear_model/x/weighted_sum = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false,
_device="/job:localhost/replica:0/task:0/device:GPU:0"](linear/linear_model/x/Reshape,
linear/linear_model/x/weights)]]
[[Node: linear/gradients/linear/linear_model/x/weighted_sum_grad/tuple/control_dependency_1/_85
= _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0",
send_device="/job:localhost/replica:0/task:0/device:GPU:0",
send_device_incarnation=1,
tensor_name="edge_184_linear/gradients/linear/linear_model/x/weighted_sum_grad/tuple/control_dependency_1",
tensor_type=DT_FLOAT,
_device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Its PermissionDeniedError:
You should run this script from the root as i can see for now.
Try it and update.

NotFoundError when restoring tensorflow session

Here is the code:
import tensorflow as tf
def save(checkpoint_file='hello.chk'):
with tf.Session() as session:
x = tf.Variable(initial_value=[1, 2, 3], name="x")
y = tf.Variable(initial_value=[[1.0, 2.0], [3.0, 4.0]], name="y")
not_saved = tf.Variable(initial_value=[[11.0, 2.0], [3.0, 4.0]], name="not_saved")
session.run(tf.global_variables_initializer())
print(session.run(tf.global_variables()))
saver = tf.train.Saver([x, y])
saver.save(session, checkpoint_file)
print(session.run(tf.global_variables()))
print("saved!!!!!!!!!!")
def restore(checkpoint_file='hello.chk'):
with tf.Session() as session:
saver = tf.train.Saver()
saver.restore(sess=session, save_path=checkpoint_file)
print(session.run(tf.global_variables()))
def reset():
tf.reset_default_graph()
save()
restore("/home/kaiyin/PycharmProjects/text-classify/hello.chk")
I am just trying to save and restore some simple variables here, nothing complicated. The saving part seems fine, but I got the following errors when restoring:
Traceback (most recent call last):
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call
return fn(*args)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn
status, run_metadata)
File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: Key not_saved not found in checkpoint
[[Node: save_1/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save_1/Const_0, save_1/RestoreV2/tensor_names, save_1/RestoreV2/shape_and_slices)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/kaiyin/PycharmProjects/text-classify/restore.py", line 25, in <module>
restore("/home/kaiyin/PycharmProjects/text-classify/hello.chk")
File "/home/kaiyin/PycharmProjects/text-classify/restore.py", line 18, in restore
saver.restore(sess=session, save_path=checkpoint_file)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1428, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key not_saved not found in checkpoint
[[Node: save_1/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save_1/Const_0, save_1/RestoreV2/tensor_names, save_1/RestoreV2/shape_and_slices)]]
Caused by op 'save_1/RestoreV2', defined at:
File "/home/kaiyin/PycharmProjects/text-classify/restore.py", line 25, in <module>
restore("/home/kaiyin/PycharmProjects/text-classify/hello.chk")
File "/home/kaiyin/PycharmProjects/text-classify/restore.py", line 17, in restore
saver = tf.train.Saver()
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1040, in __init__
self.build()
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1070, in build
restore_sequentially=self._restore_sequentially)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 675, in build
restore_sequentially, reshape)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 402, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 242, in restore_op
[spec.tensor.dtype])[0])
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_io_ops.py", line 668, in restore_v2
dtypes=dtypes, name=name)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/kaiyin/virtualenvs/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
self._traceback = _extract_stack()
NotFoundError (see above for traceback): Key not_saved not found in checkpoint
[[Node: save_1/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save_1/Const_0, save_1/RestoreV2/tensor_names, save_1/RestoreV2/shape_and_slices)]]
Process finished with exit code 1
Tensorflow version:
>>> print(tf.__version__)
1.0.1

Deleting the list of vars in tf.train.Saver() somehow solves the problem. Here is the working code:
import tensorflow as tf
filepath = "/home/kaiyin/PycharmProjects/text-classify/hello.chk"
def save(checkpoint_file=filepath):
with tf.Session() as session:
x = tf.Variable(initial_value=[1, 2, 3], name="x")
y = tf.Variable(initial_value=[[1.0, 2.0], [3.0, 4.0]], name="y")
not_saved = tf.Variable(initial_value=[[11.0, 2.0], [3.0, 4.0]], name="not_saved")
session.run(tf.global_variables_initializer())
print(session.run(tf.global_variables()))
saver = tf.train.Saver()
saver.save(session, checkpoint_file)
print(session.run(tf.global_variables()))
print("saved!!!!!!!!!!")
def restore(checkpoint_file='hello.chk'):
with tf.Session() as session:
saver = tf.train.Saver()
saver.restore(sess=session, save_path=checkpoint_file)
print(session.run(tf.global_variables()[0]))
print(session.run(x))
def reset():
tf.reset_default_graph()
save()
restore(filepath)

Keras BatchNormalization uninitialized value

I am trying to add batch norm to a vgg style model in Keras. When I add the batch norm layers I get the error:
FailedPreconditionError: Attempting to use uninitialized value batchnormalization_1_running_mean/biased
Without the batch layers the script runs without errors, only when I add the batchNormalization layers does it throw the error.
model = Sequential()
model.add(ZeroPadding2D((1, 1), input_shape=(1, conf['image_shape'][0], conf['image_shape'][1]), dim_ordering=conf['dim_ordering']))
model.add(Convolution2D(conf['level_1_filters'], 3, 3, dim_ordering=conf['dim_ordering']))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(ZeroPadding2D((1, 1), dim_ordering=conf['dim_ordering']))
model.add(Convolution2D(conf['level_1_filters'], 3, 3, dim_ordering=conf['dim_ordering']))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D((2, 2), strides=(2, 2), dim_ordering=conf['dim_ordering']))
model.add(ZeroPadding2D((1, 1), dim_ordering=conf['dim_ordering']))
model.add(Convolution2D(conf['level_2_filters'], 3, 3, dim_ordering=conf['dim_ordering']))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(ZeroPadding2D((1, 1), dim_ordering=conf['dim_ordering']))
model.add(Convolution2D(conf['level_2_filters'], 3, 3, dim_ordering=conf['dim_ordering']))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D((2, 2), strides=(2, 2), dim_ordering=conf['dim_ordering']))
model.add(Flatten())
model.add(Dense(conf['dense_layer_size']))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(conf['dropout_value']))
model.add(Dense(conf['dense_layer_size']))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(conf['dropout_value']))
model.add(Dense(2, activation='softmax'))
# sgd = SGD(lr=conf['learning_rate'], decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
Is this the correct syntax to use batch norm in Keras? I followed the example in this thread.
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Train patients: 699
Valid patients: 698
Create and compile model...
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:03:00.0
Total memory: 7.92GiB
Free memory: 7.07GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0)
Number of train files: 123111
Number of valid files: 125469
Fit model...
Samples train: 5000, Samples valid: 5000
Epoch 1/40
W tensorflow/core/framework/op_kernel.cc:975] Failed precondition: Attempting to use uninitialized value batchnormalization_1_running_mean/biased
[[Node: batchnormalization_1_running_mean/biased/read = Identity[T=DT_FLOAT, _class=["loc:#batchnormalization_1_running_mean"], _device="/job:localhost/replica:0/task:0/gpu:0"](batchnormalization_1_running_mean/biased)]]
W tensorflow/core/framework/op_kernel.cc:975] Failed precondition: Attempting to use uninitialized value batchnormalization_1_running_mean/biased
[[Node: batchnormalization_1_running_mean/biased/read = Identity[T=DT_FLOAT, _class=["loc:#batchnormalization_1_running_mean"], _device="/job:localhost/replica:0/task:0/gpu:0"](batchnormalization_1_running_mean/biased)]]
W tensorflow/core/framework/op_kernel.cc:975] Failed precondition: Attempting to use uninitialized value batchnormalization_1_running_mean/biased
[[Node: batchnormalization_1_running_mean/biased/read = Identity[T=DT_FLOAT, _class=["loc:#batchnormalization_1_running_mean"], _device="/job:localhost/replica:0/task:0/gpu:0"](batchnormalization_1_running_mean/biased)]]
W tensorflow/core/framework/op_kernel.cc:975] Failed precondition: Attempting to use uninitialized value batchnormalization_1_running_mean/biased
[[Node: batchnormalization_1_running_mean/biased/read = Identity[T=DT_FLOAT, _class=["loc:#batchnormalization_1_running_mean"], _device="/job:localhost/replica:0/task:0/gpu:0"](batchnormalization_1_running_mean/biased)]]
W tensorflow/core/framework/op_kernel.cc:975] Failed precondition: Attempting to use uninitialized value batchnormalization_1_running_mean/biased
[[Node: batchnormalization_1_running_mean/biased/read = Identity[T=DT_FLOAT, _class=["loc:#batchnormalization_1_running_mean"], _device="/job:localhost/replica:0/task:0/gpu:0"](batchnormalization_1_running_mean/biased)]]
W tensorflow/core/framework/op_kernel.cc:975] Failed precondition: Attempting to use uninitialized value batchnormalization_1_running_mean/biased
[[Node: batchnormalization_1_running_mean/biased/read = Identity[T=DT_FLOAT, _class=["loc:#batchnormalization_1_running_mean"], _device="/job:localhost/replica:0/task:0/gpu:0"](batchnormalization_1_running_mean/biased)]]
Traceback (most recent call last):
File "keras-v2.py", line 197, in <module>
model = create_single_model()
File "keras-v2.py", line 173, in create_single_model
callbacks=callbacks)
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 882, in fit_generator
pickle_safe=pickle_safe)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1461, in fit_generator
class_weight=class_weight)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1239, in train_on_batch
outputs = self.train_function(ins)
File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 1040, in __call__
updated = session.run(self.outputs + [self.updates_op], feed_dict=feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value batchnormalization_1_running_mean/biased
[[Node: batchnormalization_1_running_mean/biased/read = Identity[T=DT_FLOAT, _class=["loc:#batchnormalization_1_running_mean"], _device="/job:localhost/replica:0/task:0/gpu:0"](batchnormalization_1_running_mean/biased)]]
[[Node: Mean_3/_49 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2152_Mean_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op u'batchnormalization_1_running_mean/biased/read', defined at:
File "keras-v2.py", line 197, in <module>
model = create_single_model()
File "keras-v2.py", line 145, in create_single_model
model = get_custom_CNN()
File "keras-v2.py", line 111, in get_custom_CNN
model.add(BatchNormalization(axis=-1))
File "/usr/local/lib/python2.7/dist-packages/keras/models.py", line 312, in add
output_tensor = layer(self.outputs[0])
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 514, in __call__
self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 572, in add_inbound_node
Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 149, in create_node
output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
File "/usr/local/lib/python2.7/dist-packages/keras/layers/normalization.py", line 140, in call
self.updates = [K.moving_average_update(self.running_mean, mean, self.momentum),
File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 329, in moving_average_update
variable, value, momentum)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/moving_averages.py", line 70, in assign_moving_average
update_delta = _zero_debias(variable, value, decay)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/moving_averages.py", line 177, in _zero_debias
trainable=False)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1024, in get_variable
custom_getter=custom_getter)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 850, in get_variable
custom_getter=custom_getter)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 346, in get_variable
validate_shape=validate_shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 331, in _true_getter
caching_device=caching_device, validate_shape=validate_shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 677, in _get_single_variable
expected_shape=shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 224, in __init__
expected_shape=expected_shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 370, in _init_from_args
self._snapshot = array_ops.identity(self._variable, name="read")
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 1424, in identity
result = _op_def_lib.apply_op("Identity", input=input, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()
FailedPreconditionError (see above for traceback): Attempting to use uninitialized value batchnormalization_1_running_mean/biased
[[Node: batchnormalization_1_running_mean/biased/read = Identity[T=DT_FLOAT, _class=["loc:#batchnormalization_1_running_mean"], _device="/job:localhost/replica:0/task:0/gpu:0"](batchnormalization_1_running_mean/biased)]]
[[Node: Mean_3/_49 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2152_Mean_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 433, in data_generator_task
generator_output = next(generator)
File "keras-v2.py", line 71, in batch_generator_train
image = load_and_normalize_dicom(f, conf['image_shape'][0], conf['image_shape'][1])
File "keras-v2.py", line 58, in load_and_normalize_dicom
dicom_img = cv2.resize(dicom_img, (x, y), interpolation=cv2.INTER_CUBIC)
AttributeError: 'NoneType' object has no attribute 'resize'

Try keras.backend.get_session().run(tf.global_variables_initializer()) before fit. There is an issue here

Try keras.backend.get_session().run(tf.local_variables_initializer()). For me, the global initializer didn't work, but local did. Although this is probably not an issue with the latest TF/Keras versions.

If
your input image's data format is "channels_last" and the input_shape is Image_Height x Image_Width x Image_Channel
then
try using BatchNormalization(axis = 3)

Tensorflow Out of Memory when saving?

Hi I'm running the Linux CPU version of tensorflow on Ubuntu 14.04 and I'm running out of memory when I try to save my model. I'm using the tutorial for Deep MNIST that builds a convolution network. You can find it here:
https://www.tensorflow.org/versions/r0.9/tutorials/mnist/pros/index.html#deep-mnist-for-experts
I changed a couple of things and tried to add a Saver to export the model weights. However when I run it I get an error that says I am out of memory. Which doesn't make sense to me because it can train the data forever but saving it somehow uses too much memory?
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
step 0, training accuracy 0.06
W tensorflow/core/framework/op_kernel.cc:909] Resource exhausted: OOM when allocating tensor with shape[10000,28,28,32]
Traceback (most recent call last):
File "/home/mgump/Lambda_Project/MNIST_TRAINER.py", line 66, in <module>
x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 555, in eval
return _eval_using_default_session(self, feed_dict, self.graph, session)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3498, in _eval_using_default_session
return session.run(tensors, feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 372, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 636, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 708, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 728, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[10000,28,28,32]
[[Node: Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/cpu:0"](Reshape, Variable/read)]]
Caused by op u'Conv2D', defined at:
File "/home/mgump/Lambda_Project/MNIST_TRAINER.py", line 28, in <module>
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
File "/home/mgump/Lambda_Project/MNIST_TRAINER.py", line 18, in conv2d
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/ops/gen_nn_ops.py", line 394, in conv2d
data_format=data_format, name=name)
File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/ops/op_def_library.py", line 704, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/framework/ops.py", line 2260, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist- packages/tensorflow/python/framework/ops.py", line 1230, in __init__
self._traceback = _extract_stack()`
This is what it outputs when I run it thanks so much!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to run Distributed TensorFlow using V100 GPU - python

Related

cuDNN launch failure (tensorflow-gpu/CUDA)

Getting "PermissionDeniedError" when running the example program on Tensorflow

NotFoundError when restoring tensorflow session

Keras BatchNormalization uninitialized value

Tensorflow Out of Memory when saving?

Categories

Resources