Inception retraining issue "Nan in summary histogram for: HistogramSummary" - python

I'm trying to retrain inceptionV3 on my RPi3. I'm getting this histogram error message.
python /home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py --bottleneck_dir=/home/pi/Documents/Machine\ Learning/Inception/tf_files/bottlenecks --how_many_training_steps 500 --model_dir=/home/pi/Documents/Machine\ Learning/Inception/tf_files/inception --output_graph=/home/pi/Documents/Machine\ Learning/Inception/tf_files/retrained_graph.pb --output_labels=/home/pi/Documents/Machine\ Learning/Inception/tf_files/retrained_labels.txt --image_dir /home/pi/Documents/Machine\ Learning/Inception/Retraining_Images
Looking for images in 'Granny Smith Apple'
Looking for images in 'Red Delicious'
100 bottleneck files created.
200 bottleneck files created.
2017-01-07 11:30:22.180768: Step 0: Train accuracy = 56.0%
2017-01-07 11:30:22.242166: Step 0: Cross entropy = nan
2017-01-07 11:30:22.850969: Step 0: Validation accuracy = 50.0%
Traceback (most recent call last):
File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 938, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 887, in main
ground_truth_input: train_ground_truth})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 717, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 915, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 985, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Nan in summary histogram for: HistogramSummary
[[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, final_result)]]
Caused by op u'HistogramSummary', defined at:
File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 938, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 846, in main
bottleneck_tensor)
File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 764, in add_final_training_ops
tf.histogram_summary(final_tensor_name + '/activations', final_tensor)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/logging_ops.py", line 100, in histogram_summary
tag=tag, values=values, name=scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 100, in _histogram_summary
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 749, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2380, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1298, in __init__
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Nan in summary histogram for: HistogramSummary
[[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, final_result)]]
I tried changing merged = tf.merge_all_summaries() in retrain.py after reading this
but it didnt work.
Also, the first time I tried to retrain, I got different results for step 0 before hitting an error:
2017-01-07 11:13:36.548913: Step 0: Train accuracy = 89.0%
2017-01-07 11:13:36.555770: Step 0: Cross entropy = 0.590778
2017-01-07 11:13:37.052190: Step 0: Validation accuracy = 76.0%

Sounds like that it might help to know where the NaN values are coming from. For that, take a look at tensorflow debugger (tfdbg):
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/debugger/index.md
In your retrain.py, you can make a change like
from tensorflow.python import debug as tf_debug
# ...
# In def main(_)
if debug:
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)
# ...
Then when the sess.run() happens for the training and evaluation, you will drop into the command-line interface of the debugger. At the tfdbg> prompt, you can enter command to let the code run until any NaNs or Infinities appear in the TensorFlow graph:
tfdbg> run -f has_inf_or_nan
When the tensor filter has_inf_or_nan is hit, the interface will give you a list of Tensors containing Infs or Nans, sorted in time order. The one on the top should be the "culprit", i.e., the one that first generated the bad numerical values. Say its name is node_1, you can use the following tfdbg commands to look at its inputs and node attributes:
tfdbg> li -r node_1
tfdbg> ni -a node_1

If you're using tf.contrib.learn you'll want to use the following:
debug_hook = tf_debug.LocalCLIDebugHook()
debug_hook.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)
hooks = [debug_hook]
...
classifier.fit(..., monitors=hooks)

Related

Tensorflow: PartialTensorShape: Incompatible ranks during merge: 2 vs. 1

I'm using keras with tf-2.2 at backend and it shows up this error.
Traceback (most recent call last):
File "run.py", line 97, in <module>
task_entry_function()
File "/data-crystina/src/capreolus-unpublished/capreolus/task/rerank.py", line 47, in train
return self.rerank_run(best_search_run, self.get_results_path())
File "/data-crystina/src/capreolus-unpublished/capreolus/task/rerank.py", line 85, in rerank_run
self.benchmark.relevance_level,
File "/data-crystina/src/capreolus-unpublished/capreolus/trainer/__init__.py", line 578, in train
use_multiprocessing=True,
File "/data-crystina/anaconda3/envs/maxp/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
return method(self, *args, **kwargs)
File "/data-crystina/anaconda3/envs/maxp/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 855, in fit
callbacks.on_train_batch_end(step, logs)
File "/data-crystina/anaconda3/envs/maxp/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py", line 389, in on_train_batch_end
logs = self._process_logs(logs)
File "/data-crystina/anaconda3/envs/maxp/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py", line 265, in _process_logs
return tf_utils.to_numpy_or_python_type(logs) File "/data-crystina/anaconda3/envs/maxp/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py", line 523, in to_numpy_or_python_type
return nest.map_structure(_to_single_numpy_or_python_type, tensors)
File "/data-crystina/anaconda3/envs/maxp/lib/python3.7/site-packages/tensorflow/python/util/nest.py", line 617, in map_structure
structure[0], [func(*x) for x in entries],
File "/data-crystina/anaconda3/envs/maxp/lib/python3.7/site-packages/tensorflow/python/util/nest.py", line 617, in <listcomp>
structure[0], [func(*x) for x in entries],
File "/data-crystina/anaconda3/envs/maxp/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py", line 519, in _to_single_numpy_or_python_type
x = t.numpy()
File "/data-crystina/anaconda3/envs/maxp/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 961, in numpy
maybe_arr = self._numpy() # pylint: disable=protected-access
File "/data-crystina/anaconda3/envs/maxp/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 929, in _numpy
six.raise_from(core._status_to_exception(e.code, e.message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __inference_train_function_100056}} PartialTensorShape: Incompatible ranks during merge: 2 vs. 1
[[{{node map_6/TensorArrayV2Stack/TensorListStack}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
2020-07-03 07:19:03.088112: W tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates som
e op in the graph gets an error: {{function_node __inference_train_function_100056}} PartialTensorShape: Incompatible ranks during merge: 2 vs. 1
[[{{node map_6/TensorArrayV2Stack/TensorListStack}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
Apologize for failing to find a small snippet to reproduce this. But I go inside ..python3.7/site-packages/tensorflow/python/keras/callbacks.py, and in the function:
def on_train_batch_end(self, batch, logs=None):
"""Calls the `on_train_batch_end` methods of its callbacks.
Arguments:
batch: integer, index of batch within the current epoch.
logs: dict. Metric results for this batch.
"""
if self._should_call_train_batch_hooks:
# print("<<<<", logs.keys())
# print(">>>", type(list(logs.values())[0]))
logs = self._process_logs(logs)
self._call_batch_hook(ModeKeys.TRAIN, 'end', batch, logs=logs)
I print out the logs and found it's a dictionary containing only one key loss, and the type of its value is class 'tensorflow.python.framework.ops.EagerTensor'>. However, the logs["loss"] cannot be printed directory because of the same error, and same to logs["loss"].shape. I failed to find any similar case in internet, wondering whether anyone has met this case?
Problem solved, it's totally because I'm trying to parse the tfrecord using a wrong shape/data type in a callback passed to TensorFlow.

InvalidArgumentError: Input to reshape is a tensor with 0 values, but the requested shape has 54912

Very beginner question, I hope that's fine
I'm trying to train this model from GitHub with the MAPS dataset and I made new .tfrecords with this code for the train set. It is based from the code here but I altered some things to make way for a different input (another MIDI file I'm just calling "tempo MIDI").
def create_train_set(tempopath, train_list, outdir, min_length, max_length):
# train_list = list of wav paths selected for
train_file_pairs = []
# find matching midi files
for wav_path in train_list:
midi_file = ''
tempo_midi_file = ''
if os.path.isfile(wav_path + '.mid'):
midi_file = wav_path + '.mid'
if os.path.isfile(wav_path + '.midi'):
midi_file = wav_path + '.midi'
if os.path.isfile(tempopath + os.path.basename(wav_path) + '_tempo.mid'):
tempo_midi_file = tempopath + os.path.basename(wav_path) + '_tempo.mid'
if os.path.isfile(tempopath + os.path.basename(wav_path) + '_tempo.midi'):
tempo_midi_file = tempopath + os.path.basename(wav_path) + '_tempo.midi'
wav_file = wav_path + '.wav'
train_file_pairs.append((wav_file, midi_file, tempo_midi_file))
train_output_name = os.path.join(outdir, 'train.tfrecord')
with tf.python_io.TFRecordWriter(train_output_name) as writer:
for idx, pair in enumerate(train_file_pairs):
print('{} of {}: {}'.format(idx, len(train_file_pairs), pair[0]))
# load the wav data
wav_data = tf.gfile.Open(pair[0], 'rb').read()
# load the midi data and convert to a notesequence
ns = midi_io.midi_file_to_note_sequence(pair[1])
tempo = midi_io.midi_file_to_note_sequence(pair[2])
# aldu = audio_label_data_utils.py
for example in aldu.process_record(
wav_data, ns, tempo, pair[0], min_length, max_length,
sample_rate):
writer.write(example.SerializeToString())
with the tf.Example as follows:
example = tf.train.Example(
features=tf.train.Features(
feature={
'id':
tf.train.Feature(
bytes_list=tf.train.BytesList(
value=[example_id.encode('utf-8')])),
'sequence':
tf.train.Feature(
bytes_list=tf.train.BytesList(
value=[ns.SerializeToString()])),
'audio':
tf.train.Feature(
bytes_list=tf.train.BytesList(value=[wav_data])),
'tempo':
tf.train.Feature(
bytes_list=tf.train.BytesList(
value=[velocity_range.SerializeToString()])),
'velocity_range':
tf.train.Feature(
bytes_list=tf.train.BytesList(
value=[velocity_range.SerializeToString()])),
}))
However, when I try to train the model, I get this error message (I marked the py scripts with a print line so I know where everything's going):
Running wav_to_spec from data.py
Running _wav_to_mel in data.py
Running wav_to_num_frames from data.py
Running wav_to_spec from data.py
Running _wav_to_mel in data.py
Running wav_to_num_frames from data.py
E0611 07:56:55.419340 8436 error_handling.py:70] Error recorded from training_loop: Input to reshape is a tensor with 0 values, but the requested shape has 54912
[[{{node Reshape_8}}]]
[[IteratorGetNext]]
I0611 07:56:55.420338 8436 error_handling.py:96] training_loop marked as finished
W0611 07:56:55.421335 8436 error_handling.py:130] Reraising captured error
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1356, in _do_call
return fn(*args)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 0 values, but the requested shape has 54912
[[{{node Reshape_8}}]]
[[IteratorGetNext]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "onsets_frames_transcription_train.py", line 128, in <module>
console_entry_point()
File "onsets_frames_transcription_train.py", line 124, in console_entry_point
tf.app.run(main)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\absl\app.py", line 300, in run
_run_main(main, args)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\absl\app.py", line 251, in _run_main
sys.exit(main(argv))
File "onsets_frames_transcription_train.py", line 120, in main
additional_trial_info=additional_trial_info)
File "onsets_frames_transcription_train.py", line 95, in run
num_steps=FLAGS.num_steps)
File "C:\Users\User\magenta\magenta\models\onsets_frames_transcription\train_util.py", line 134, in train
estimator.train(input_fn=transcription_data, max_steps=num_steps)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_estimator\python\estimator\tpu\tpu_estimator.py", line 2876, in train
rendezvous.raise_errors()
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_estimator\python\estimator\tpu\error_handling.py", line 131, in raise_errors
six.reraise(typ, value, traceback)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\six.py", line 693, in reraise
raise value
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_estimator\python\estimator\tpu\tpu_estimator.py", line 2871, in train
saving_listeners=saving_listeners)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1192, in _train_model_default
saving_listeners)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1484, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1252, in run
run_metadata=run_metadata)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1353, in run
raise six.reraise(*original_exc_info)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\six.py", line 693, in reraise
raise value
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1338, in run
return self._sess.run(*args, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1411, in run
run_metadata=run_metadata)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1169, in run
return self._sess.run(*args, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
run_metadata_ptr)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run
run_metadata)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 0 values, but the requested shape has 54912
[[{{node Reshape_8}}]]
[[IteratorGetNext]]
From that, I figured the problem lies in wav_to_num_frames but this is the only code for it.
def wav_to_num_frames(wav_audio, frames_per_second):
"""Transforms a wav-encoded audio string into number of frames."""
print("Running wav_to_num_frames from data")
w = wave.open(six.BytesIO(wav_audio))
return np.int32(w.getnframes() / w.getframerate() * frames_per_second)
I didn't get this problem back when I tried training the model with tfrecords created with the original code, so I don't know what's wrong.
It turns out that problem wasn't the created .tfrecords itself but rather the size of the tensors I assigned for the newly added data. There isn't a concrete answer for this though since it's very specific to this situation.

How to apply Non max suppression on batch of images in tensorflow 1.14?

I have batch of cropped images from original image on which I have to perform object detection, I am trying to apply tensorflow NMS operation.
I looked into tensorflow api docs, and found tf.image.combined_non_max_suppression(), but I am unable to understand it properly.
The flow in my pipeline is of two step.
I get some image and apply object detection to get desired region of interests.
On each of these ROIs I have to apply object detection again, so I am passing it as batch.
For the first step, I use simple tf.image.non_max_suppression() followed by tf.gather(), but I am not able to understand, how to do it for second step.
Please refer to code snippets below:
with tf.Session(graph = self.detection_graph) as sess:
# input image tensor
image_tensor1 = self.detection_graph.get_tensor_by_name('import/image_tensor:0')
# boxes, scores and classes for first step
boxesop1 = self.detection_graph.get_tensor_by_name('import/detection_boxes:0')
scoresop1 = self.detection_graph.get_tensor_by_name('import/detection_scores:0')
classesop1 = self.detection_graph.get_tensor_by_name('import/detection_classes:0')
# getting first values, since we are predicting on single image
boxesop1 = boxesop1[0]
classesop1 = classesop1[0]
scoresop1 = scoresop1[0]
# applying NMS for the first step
selected_indices1 = tf.image.non_max_suppression(
boxesop1, scoresop1, 20, iou_threshold = 0.5
)
boxesop1 = tf.gather(boxesop1, selected_indices1)
classesop1 = tf.gather(classesop1, selected_indices1)
scoresop1 = tf.gather(scoresop1, selected_indices1)
# boxes, scores and classes for second step
boxesop2 = self.detection_graph.get_tensor_by_name('import_1/detection_boxes:0')
scoresop2 = self.detection_graph.get_tensor_by_name('import_1/detection_scores:0')
classesop2 = self.detection_graph.get_tensor_by_name('import_1/detection_classes:0')
# applying NMS for the second step
boxesop2, scoresop2, classesop2, valid_detections = tf.image.combined_non_max_suppression(
boxesop2, scoresop2, max_output_size_per_class = 10, max_total_size = 30,
iou_threshold = 0.5
)
# predicting for each images
for imgPath, imgID in img_files:
# reading image data
img = cv2.imread(imgPath)
imageHeight, imageWidth = img.shape[:2]
# Expand dimensions since the model expects images to have shape: [1, None, None, 3]
image_np_expanded = np.expand_dims(img, axis=0)
# Run inference
(boxes1, scores1, classes1, boxes2, scores2, classes2) = sess.run(
[boxesop1, scoresop1, classesop1, boxesop2, scoresop2, classesop2],
feed_dict={image_tensor1: image_np_expanded}
)
But I got following error, when tried running above:
Traceback (most recent call last):
File "../env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "../env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "../env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: boxes must be 4-D[20,300,4]
[[{{node combined_non_max_suppression/CombinedNonMaxSuppression}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/prediction.py", line 159, in predict
feed_dict={image_tensor1: image_np_expanded}
File "../env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "../env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "../env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "../env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: boxes must be 4-D[20,300,4]
[[node combined_non_max_suppression/CombinedNonMaxSuppression (defined at /home/prediction.py:130) ]]
Errors may have originated from an input operation.
Input Source operations connected to node combined_non_max_suppression/CombinedNonMaxSuppression:
import_1/detection_boxes (defined at /home/prediction.py:94)
Original stack trace for 'combined_non_max_suppression/CombinedNonMaxSuppression':
File "/home/prediction.py", line 130, in predict
iou_threshold = 0.5
File "../env/lib/python3.5/site-packages/tensorflow/python/ops/image_ops_impl.py", line 3707, in combined_non_max_suppression
score_threshold, pad_per_class, clip_boxes)
File "../env/lib/python3.5/site-packages/tensorflow/python/ops/gen_image_ops.py", line 431, in combined_non_max_suppression
clip_boxes=clip_boxes, name=name)
File "../env/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "../env/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "../env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "../env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
How to solve it and apply NMS for batch of images in tensorflow ?

Tensorflow Attempting to use uninitialized value AUC/AUC/auc/false_positives

I'm training a CNN using for image classification. Due to the limited size of my data set I'm using transfer learning. Basically, I'm using the pre-trained network Google is proving in its retrain example (https://www.tensorflow.org/tutorials/image_retraining).
The model works great and gives a very good accuracy. But my dataset is highly imbalance which mean accuracy is not the best metric to judge the performance of the model.
By looking into different solutions, some suggested changing the sampling method or the performance metric used. I'm choosing to go with the later.
Tensorflow provides a good verity of metrics including, AUC, precision, recall, etc.
Now, here is the code of the retraing model:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/image_retraining/retrain.py
I'm adding the following to add_evaluation_step(result_tensor, ground_truth_tensor) function:
with tf.name_scope('AUC'):
with tf.name_scope('prediction'):
prediction = tf.argmax(result_tensor, 1)
with tf.name_scope('AUC'):
auc_value = tf.metrics.auc(tf.argmax(ground_truth_tensor, 1), prediction, curve='ROC')
tf.summary.scalar('accuracy', evaluation_step)
tf.summary.scalar('AUC', auc_value)
But I'm getting this error:
Traceback (most recent call last): File
"/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/examples/image_retraining/retrain.py",
line 1135, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/platform/app.py",
line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/examples/image_retraining/retrain.py",
line 911, in main
ground_truth_input: train_ground_truth}) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/client/session.py",
line 767, in run
run_metadata_ptr) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/client/session.py",
line 965, in _run
feed_dict_string, options, run_metadata) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/client/session.py",
line 1015, in _do_run
target_list, options, run_metadata) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/client/session.py",
line 1035, in _do_call
raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.FailedPreconditionError:
Attempting to use uninitialized value AUC/AUC/auc/false_positives
[[Node: AUC/AUC/auc/false_positives/read = IdentityT=DT_FLOAT,
_class=["loc:#AUC/AUC/auc/false_positives"], _device="/job:localhost/replica:0/task:0/cpu:0"]]
Caused by op u'AUC/AUC/auc/false_positives/read', defined at: File
"/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/examples/image_retraining/retrain.py",
line 1135, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/platform/app.py",
line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/examples/image_retraining/retrain.py",
line 874, in main
final_tensor, ground_truth_input) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/examples/image_retraining/retrain.py",
line 806, in add_evaluation_step
auc_value, update_op = tf.metrics.auc(tf.argmax(ground_truth_tensor, 1), prediction,
curve='ROC') File
"/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/ops/metrics_impl.py",
line 555, in auc
labels, predictions, thresholds, weights) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/ops/metrics_impl.py",
line 473, in _confusion_matrix_at_thresholds
false_p = _create_local('false_positives', shape=[num_thresholds]) File
"/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/ops/metrics_impl.py",
line 177, in _create_local
validate_shape=validate_shape) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/ops/variables.py",
line 226, in init
expected_shape=expected_shape) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/ops/variables.py",
line 344, in _init_from_args
self._snapshot = array_ops.identity(self._variable, name="read") File
"/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/ops/gen_array_ops.py",
line 1490, in identity
result = _op_def_lib.apply_op("Identity", input=input, name=name) File
"/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/framework/op_def_library.py",
line 768, in apply_op
op_def=op_def) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/framework/ops.py",
line 2402, in create_op
original_op=self._default_original_op, op_def=op_def) File "/home/user_2/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/framework/ops.py",
line 1264, in init
self._traceback = _extract_stack()
FailedPreconditionError (see above for traceback): Attempting to use
uninitialized value AUC/AUC/auc/false_positives [[Node:
AUC/AUC/auc/false_positives/read = IdentityT=DT_FLOAT,
_class=["loc:#AUC/AUC/auc/false_positives"], _device="/job:localhost/replica:0/task:0/cpu:0"]]
But I don't understand why is this because in the main I have this:
init = tf.global_variables_initializer()
sess.run(init)
try this:
init = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
sess.run(init)

Trainining with TFrecords gets slower gradually

I am trying to use a TFrecord file for training a network in tensorflow. The problem is that it starts running fine, but after some time, it becomes really slow. Even the GPU utilization goes to 0% during some time.
I have measured the time between iterations, and it is clearly increasing.
I have read somewhere that this might be due, to adding operations to the graph in the training loop, and that that can be solved by using graph.finalize().
My code is like this:
self.inputMR_,self.CT_GT_ = read_and_decode_single_example("data.tfrecords")
self.inputMR, self.CT_GT = tf.train.shuffle_batch([self.inputMR_, self.CT_GT_], batch_size=self.batch_size, num_threads=2,
capacity=500*self.batch_size,min_after_dequeue=2000)
batch_size_tf = tf.shape(self.inputMR)[0] #variable batchsize so we can test here
self.train_phase = tf.placeholder(tf.bool, name='phase_train')
self.G = self.Network(self.inputMR,batch_size_tf)# create the network
self.g_loss=lp_loss(self.G, self.CT_GT, self.l_num, batch_size_tf)
print 'learning rate ',self.learning_rate
self.g_optim = tf.train.GradientDescentOptimizer(self.learning_rate).minimize(self.g_loss)
self.saver = tf.train.Saver()
Then I have a training stage that looks like this:
def train(self, config):
init=tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
coord = tf.train.Coordinator()
threads=tf.train.start_queue_runners(sess=sess, coord=coord)
sess.graph.finalize()# **WHERE SHOULD I PUT THIS?**
try:
while not coord.should_stop():
_,loss_eval = sess.run([self.g_optim, self.g_loss],feed_dict={self.train_phase: True})
.....
except:
e = sys.exc_info()[0]
print "Exception !!!", e
finally:
coord.request_stop()
coord.join(threads)
sess.close()
When I add the grapgh.finalize, there is an exeption that says: type 'exceptions.RuntimeError'
Could anyone explain to me, what is the correct way to using a TFrecord file during training, and how to use the graph.finalize() without interefering in the QueueRunner execution?
The full error is:
File "main.py", line 37, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "main.py", line 35, in main
gen_model.train(FLAGS)
File "/home/dongnie/Desktop/gan/TF_record_MR_CT/model.py", line 143, in train
self.global_step.assign(it).eval() # set and update(eval) global_step with index, i
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 505, in assign
return state_ops.assign(self._variable, value, use_locking=use_locking)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_state_ops.py", line 45, in assign
use_locking=use_locking, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 490, in apply_op
preferred_dtype=default_dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 657, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 180, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 167, in constant
attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2337, in create_op
self._check_not_finalized()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2078, in _check_not_finalized
raise RuntimeError("Graph is finalized and cannot be modified.")
RuntimeError: Graph is finalized and cannot be modified.
The problem is that you are modifying graph between session.run calls. You pin-point the place you are modifying the graph by calling finalize on default graph which would trigger an error on graph modification. In your case it seems that you are modifying it by calling global_step.assign(it), which creates an additional assign op each time. You should instead call it once in the beginning, save result to a variable and reuse that value.

Categories