Trainining with TFrecords gets slower gradually - python

I am trying to use a TFrecord file for training a network in tensorflow. The problem is that it starts running fine, but after some time, it becomes really slow. Even the GPU utilization goes to 0% during some time.
I have measured the time between iterations, and it is clearly increasing.
I have read somewhere that this might be due, to adding operations to the graph in the training loop, and that that can be solved by using graph.finalize().
My code is like this:
self.inputMR_,self.CT_GT_ = read_and_decode_single_example("data.tfrecords")
self.inputMR, self.CT_GT = tf.train.shuffle_batch([self.inputMR_, self.CT_GT_], batch_size=self.batch_size, num_threads=2,
capacity=500*self.batch_size,min_after_dequeue=2000)
batch_size_tf = tf.shape(self.inputMR)[0] #variable batchsize so we can test here
self.train_phase = tf.placeholder(tf.bool, name='phase_train')
self.G = self.Network(self.inputMR,batch_size_tf)# create the network
self.g_loss=lp_loss(self.G, self.CT_GT, self.l_num, batch_size_tf)
print 'learning rate ',self.learning_rate
self.g_optim = tf.train.GradientDescentOptimizer(self.learning_rate).minimize(self.g_loss)
self.saver = tf.train.Saver()
Then I have a training stage that looks like this:
def train(self, config):
init=tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
coord = tf.train.Coordinator()
threads=tf.train.start_queue_runners(sess=sess, coord=coord)
sess.graph.finalize()# **WHERE SHOULD I PUT THIS?**
try:
while not coord.should_stop():
_,loss_eval = sess.run([self.g_optim, self.g_loss],feed_dict={self.train_phase: True})
.....
except:
e = sys.exc_info()[0]
print "Exception !!!", e
finally:
coord.request_stop()
coord.join(threads)
sess.close()
When I add the grapgh.finalize, there is an exeption that says: type 'exceptions.RuntimeError'
Could anyone explain to me, what is the correct way to using a TFrecord file during training, and how to use the graph.finalize() without interefering in the QueueRunner execution?
The full error is:
File "main.py", line 37, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "main.py", line 35, in main
gen_model.train(FLAGS)
File "/home/dongnie/Desktop/gan/TF_record_MR_CT/model.py", line 143, in train
self.global_step.assign(it).eval() # set and update(eval) global_step with index, i
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 505, in assign
return state_ops.assign(self._variable, value, use_locking=use_locking)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_state_ops.py", line 45, in assign
use_locking=use_locking, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 490, in apply_op
preferred_dtype=default_dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 657, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 180, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 167, in constant
attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2337, in create_op
self._check_not_finalized()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2078, in _check_not_finalized
raise RuntimeError("Graph is finalized and cannot be modified.")
RuntimeError: Graph is finalized and cannot be modified.

The problem is that you are modifying graph between session.run calls. You pin-point the place you are modifying the graph by calling finalize on default graph which would trigger an error on graph modification. In your case it seems that you are modifying it by calling global_step.assign(it), which creates an additional assign op each time. You should instead call it once in the beginning, save result to a variable and reuse that value.

Related

Loss key needs to be present Pytorch

I am following this repo:
https://github.com/NVIDIA/NeMo/tree/main/examples/nlp/entity_linking
Here is a small tutorial:
https://colab.research.google.com/github/NVIDIA/NeMo/blob/v1.0.2/tutorials/nlp/Entity_Linking_Medical.ipynb
Before starting this tutorial change branch to r1.10.0
When I train this model on entire UMLS dataset given the commands it gives the following error:
In automatic_optimization, when `training_step` returns a dict, the 'loss' key needs to be present
I checked the training steps method and it is fine:
def training_step(self, batch, batch_idx):
"""
Lightning calls this inside the training loop with the data from the training dataloader
passed in as `batch`.
"""
input_ids, token_type_ids, attention_mask, concept_ids = batch
logits = self.forward(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
train_loss = self.loss(logits=logits, labels=concept_ids)
# No hard examples found in batch,
# shouldn't use this batch to update model weights
if train_loss == 0:
train_loss = None
lr = None
else:
lr = self._optimizer.param_groups[0]["lr"]
self.log("train_loss", train_loss)
self.log("lr", lr, prog_bar=True)
return {"loss": train_loss, "lr": lr}
Here is a full stacktrace:
[NeMo I 2022-07-29 18:29:27 multi_similarity_loss:91] Encountered zero loss in multisimloss, loss = 0.0. No hard examples found in the batch
Error executing job with overrides: ['project_dir=.']
Traceback (most recent call last):
File "self_alignment_pretraining.py", line 38, in main
trainer.fit(model)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 769, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 719, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run
results = self._run_stage()
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage
return self._run_train()
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1351, in _run_train
self.fit_loop.run()
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 268, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
batch_output = self.batch_loop.run(batch, batch_idx)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 207, in advance
self.optimizer_idx,
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 256, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 378, in _optimizer_step
using_lbfgs=is_lbfgs,
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1593, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1644, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py", line 278, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 193, in optimizer_step
return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 85, in optimizer_step
closure_result = closure()
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 148, in __call__
self._result = self.closure(*args, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 134, in closure
step_output = self._step_fn()
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 437, in _training_step
training_step_output, self.trainer.accumulate_grad_batches
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 75, in from_training_step_output
"In automatic_optimization, when `training_step` returns a dict, the 'loss' key needs to be present"
pytorch_lightning.utilities.exceptions.MisconfigurationException: In automatic_optimization, when `training_step` returns a dict, the 'loss' key needs to be present
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
You get this error message about "loss key needs to be present" because in some training steps you return the dict {"loss": None}. This happens in your code here
if train_loss == 0:
train_loss = None
lr = None
where you set train_loss = None. Lightning does not like that, because it wants loss to be a tensor with a graph attached.
If you wish to skip the optimization step completely, just return None from the training_step method, like this:
if train_loss == 0:
return None

Keras Value Error in Custom Loss Function

EDIT:
Following this link: TensorFlow: using a tensor to index another tensor
I've got near my solution.
def my_loss(yTrue,yPred):
pred_casted=K.cast(yPred, "int32")
real_pred=K.gather(t_best_x, pred_casted)
true_casted=K.cast(yTrue, "int32")
real_true=K.gather(t_best_x, true_casted)
real_loss=K.difference(real_pred, real.true) #i actually don’t know the name of this function, but I’m positive something like this exist
mean_loss=K.mean(real_loss, axis=-1)
return mean_loss
But when I try to train the model, I get the following error:
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/tensor_util.py", line 371, in make_tensor_proto
raise ValueError("None values not supported.")
Any suggestion about how to solve this and why this is caused is strongly appreciated. Note: using the commented "classical_loss_mse" everything works.
Full stack of the error in the end of the page.
.
.
.
.
.
.
.
ORIGINAL TEXT
I have your classical Keras NN with a custom Loss function
network.compile(loss=my_loss, optimizer='Adam', metrics=['mse'])
I am trying to create a custom loss function (my_loss)
def my_loss(yTrue,yPred):
The value of each prediction is a float between 0 and 1000.
I have an array containing, for each prediction of my NN, a "fitness" of the result.
Like this:
fitness[int(prediction)] = 0.8 #example
len(my_array)
> 1000
The concept of the loss function i want to create (I need this) is:
1 - my_array[int(yPred)]
Now, is not like I can do int(yPred)... yPred is a Tensor. How can I make this work?
I will try to explain more properly WHAT exactly I want using a pseudo-notworking-example:
def my_loss(yTrue,yPred):
true_loss=[]
for prediction in yPred:
true_loss.append(1-fitness[round(prediction)]) # fitness[round(any yTrue)] = 1
return true_loss
Note: fitness is a normal list containing normal floats
FULL STACK ERROR
File
"/home/federico/.local/lib/python3.5/site-packages/spyder/utils/site/sitecustomize.py",
line 710, in runfile
execfile(filename, namespace)
File
"/home/federico/.local/lib/python3.5/site-packages/spyder/utils/site/sitecustomize.py",
line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/home/federico/CLionProjects/monte/EXP2-STILLNOTWORKING.py",
line 201, in
callbacks=[TrainValTensorBoard(log_dir='./logs/'+time.strftime("%Y-%m-%d_%H-%M-%S-")
+ name, histogram_freq=0, write_graph=False, write_images=False),callbacks.EarlyStopping(monitor='val_loss',
patience=200, mode='auto')]) # Data for evaluation
File "/usr/local/lib/python3.5/dist-packages/keras/models.py", line
960, in fit
validation_steps=validation_steps)
File
"/usr/local/lib/python3.5/dist-packages/keras/engine/training.py",
line 1634, in fit
self._make_train_function()
File
"/usr/local/lib/python3.5/dist-packages/keras/engine/training.py",
line 990, in _make_train_function
loss=self.total_loss)
File
"/usr/local/lib/python3.5/dist-packages/keras/legacy/interfaces.py",
line 87, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/keras/optimizers.py",
line 432, in get_updates
m_t = (self.beta_1 * m) + (1. - self.beta_1) * g
File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py",
line 885, in binary_op_wrapper
y = ops.convert_to_tensor(y, dtype=x.dtype.base_dtype, name="y")
File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py",
line 836, in convert_to_tensor
as_ref=False)
File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py",
line 926, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/constant_op.py",
line 229, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/constant_op.py",
line 208, in constant
value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/tensor_util.py",
line 371, in make_tensor_proto
raise ValueError("None values not supported.")
ValueError: None values not supported.

Transfer learning with tf.estimator.Estimator framework

I'm trying to do transfer learning of an Inception-resnet v2 model pretrained on imagenet, using my own dataset and classes.
My original codebase was a modification of a tf.slim sample which I can't find anymore and now I'm trying to rewrite the same code using the tf.estimator.* framework.
I am running, however, into the problem of loading only some of the weights from the pretrained checkpoint, initializing the remaining layers with their default initializers.
Researching the problem, I found this GitHub issue and this question, both mentioning the need to use tf.train.init_from_checkpoint in my model_fn. I tried, but given the lack of examples in both, I guess I got something wrong.
This is my minimal example:
import sys
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
import tensorflow as tf
import numpy as np
import inception_resnet_v2
NUM_CLASSES = 900
IMAGE_SIZE = 299
def input_fn(mode, num_classes, batch_size=1):
# some code that loads images, reshapes them to 299x299x3 and batches them
return tf.constant(np.zeros([batch_size, 299, 299, 3], np.float32)), tf.one_hot(tf.constant(np.zeros([batch_size], np.int32)), NUM_CLASSES)
def model_fn(images, labels, num_classes, mode):
with tf.contrib.slim.arg_scope(inception_resnet_v2.inception_resnet_v2_arg_scope()):
logits, end_points = inception_resnet_v2.inception_resnet_v2(images,
num_classes,
is_training=(mode==tf.estimator.ModeKeys.TRAIN))
predictions = {
'classes': tf.argmax(input=logits, axis=1),
'probabilities': tf.nn.softmax(logits, name='softmax_tensor')
}
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
exclude = ['InceptionResnetV2/Logits', 'InceptionResnetV2/AuxLogits']
variables_to_restore = tf.contrib.slim.get_variables_to_restore(exclude=exclude)
scopes = { os.path.dirname(v.name) for v in variables_to_restore }
tf.train.init_from_checkpoint('inception_resnet_v2_2016_08_30.ckpt',
{s+'/':s+'/' for s in scopes})
tf.losses.softmax_cross_entropy(onehot_labels=labels, logits=logits)
total_loss = tf.losses.get_total_loss() #obtain the regularization losses as well
# Configure the training op
if mode == tf.estimator.ModeKeys.TRAIN:
global_step = tf.train.get_or_create_global_step()
optimizer = tf.train.AdamOptimizer(learning_rate=0.00002)
train_op = optimizer.minimize(total_loss, global_step)
else:
train_op = None
return tf.estimator.EstimatorSpec(
mode=mode,
predictions=predictions,
loss=total_loss,
train_op=train_op)
def main(unused_argv):
# Create the Estimator
classifier = tf.estimator.Estimator(
model_fn=lambda features, labels, mode: model_fn(features, labels, NUM_CLASSES, mode),
model_dir='model/MCVE')
# Train the model
classifier.train(
input_fn=lambda: input_fn(tf.estimator.ModeKeys.TRAIN, NUM_CLASSES, batch_size=1),
steps=1000)
# Evaluate the model and print results
eval_results = classifier.evaluate(
input_fn=lambda: input_fn(tf.estimator.ModeKeys.EVAL, NUM_CLASSES, batch_size=1))
print()
print('Evaluation results:\n %s' % eval_results)
if __name__ == '__main__':
tf.app.run(main=main, argv=[sys.argv[0]])
where inception_resnet_v2 is the model implementation in Tensorflow's models repository.
If I run this script, I get a bunch of info log from init_from_checkpoint, but then, at session creation time, it seems it attempts to load the Logits weights from the checkpoint and fails because of incompatible shapes. This is the full traceback:
Traceback (most recent call last):
File "<ipython-input-6-06fadd69ae8f>", line 1, in <module>
runfile('C:/Users/1/Desktop/transfer_learning_tutorial-master/MCVE.py', wdir='C:/Users/1/Desktop/transfer_learning_tutorial-master')
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\spyder\utils\site\sitecustomize.py", line 710, in runfile
execfile(filename, namespace)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\spyder\utils\site\sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/1/Desktop/transfer_learning_tutorial-master/MCVE.py", line 77, in <module>
tf.app.run(main=main, argv=[sys.argv[0]])
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "C:/Users/1/Desktop/transfer_learning_tutorial-master/MCVE.py", line 68, in main
steps=1000)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 302, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 780, in _train_model
log_step_count_steps=self._config.log_step_count_steps) as mon_sess:
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 368, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 673, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 493, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 851, in __init__
_WrappedSession.__init__(self, self._create_session())
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 856, in _create_session
return self._sess_creator.create_session()
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 554, in create_session
self.tf_sess = self._session_creator.create_session()
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 428, in create_session
init_fn=self._scaffold.init_fn)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\session_manager.py", line 279, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 889, in run
run_metadata_ptr)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1317, in _do_run
options, run_metadata)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [900] rhs shape= [1001] [[Node: Assign_1145 = Assign[T=DT_FLOAT,
_class=["loc:#InceptionResnetV2/Logits/Logits/biases"], use_locking=true, validate_shape=true,
_device="/job:localhost/replica:0/task:0/device:CPU:0"](InceptionResnetV2/Logits/Logits/biases, checkpoint_initializer_1145)]]
What am I doing wrong when using init_from_checkpoint? How exactly are we supposed to "use" it in our model_fn? And why is the estimator trying to load the Logits' weights from the checkpoint when I'm explicitly telling it not to?
Update:
After the suggestion in the comments, I tried alternative ways to call tf.train.init_from_checkpoint.
Using {v.name: v.name}
If, as suggested in the comment, I replace the call with {v.name:v.name for v in variables_to_restore}, I get this error:
ValueError: Assignment map with scope only name InceptionResnetV2/Conv2d_2a_3x3 should map
to scope only InceptionResnetV2/Conv2d_2a_3x3/weights:0. Should be 'scope/': 'other_scope/'.
Using {v.name: v}
If, instead, I try using the name:variable mapping, I get the following error:
ValueError: Tensor InceptionResnetV2/Conv2d_2a_3x3/weights:0 is not found in
inception_resnet_v2_2016_08_30.ckpt checkpoint
{'InceptionResnetV2/Repeat_2/block8_4/Branch_1/Conv2d_0c_3x1/BatchNorm/moving_mean': [256],
'InceptionResnetV2/Repeat/block35_9/Branch_0/Conv2d_1x1/BatchNorm/beta': [32], ...
The error continues listing what I think are all the variable names in the checkpoint (or could it be the scopes instead?).
Update (2)
After inspecting the latest error here above, I see that InceptionResnetV2/Conv2d_2a_3x3/weights is in the list of the checkpointed variables. The problem is that :0 at the end!
I'll now verify if this does indeed solve the problem and post an answer if that's the case.
Thanks to #KathyWu's comment, I got on the right track and found the problem.
Indeed, the way I was computing the scopes would include the InceptionResnetV2/ scope, that would trigger the load of all variables "under" the scope (i.e., all variables in the network). Replacing this with the correct dictionary, however, was not trivial.
Of the possible scope modes init_from_checkpoint accepts, the one I had to use was the 'scope_variable_name': variable one, but without using the actual variable.name attribute.
The variable.name looks like: 'some_scope/variable_name:0'. That :0 is not in the checkpointed variable's name and so using scopes = {v.name:v.name for v in variables_to_restore} will raise a "Variable not found" error.
The trick to make it work was stripping the tensor index from the name:
tf.train.init_from_checkpoint('inception_resnet_v2_2016_08_30.ckpt',
{v.name.split(':')[0]: v for v in variables_to_restore})
I find out {s+'/':s+'/' for s in scopes} didn't work, just because the variables_to_restore include something like "global_step", so scopes include the global scopes which could include everything. You need to print variables_to_restore, find "global_step" thing, and put it in "exclude".

How to reuse slim.arg_scope in tensorFlow?

I'm trying to load inception_resnet_v2_2016_08_30.ckpt file and do testing.
The code works well with single image (entering oneFile() function only once).
If I call oneFile() function twice, the following error occur:
ValueError: Variable InceptionResnetV2/Conv2d_1a_3x3/weights already
exists, disallowed. Did you mean to set reuse=True in VarScope?
Originally defined at:
I found related solution on Sharing Variables
If tf.variable_scope meet the same problem, could call scope.reuse_variables() to resolve this problem.
But I can't find the slim.arg_scope version to reuse the scope.
def oneFile(filepath):
imgPath = filepath
testImage_string = tf.gfile.FastGFile(imgPath, 'rb').read()
testImage = tf.image.decode_jpeg(testImage_string, channels=3)
processed_image = inception_preprocessing.preprocess_image(testImage, image_size, image_size, is_training=False)
processed_images = tf.expand_dims(processed_image, 0)
# Create the model, use the default arg scope to configure the batch norm parameters.
with slim.arg_scope(inception_resnet_v2_arg_scope()):
#logits, end_points = inception_resnet_v2(images, num_classes = dataset.num_classes, is_training = False)
logits, _ = inception_resnet_v2(processed_images, num_classes=16, is_training=False)
probabilities = tf.nn.softmax(logits)
init_fn = slim.assign_from_checkpoint_fn(
checkpoint_file,
slim.get_model_variables(model_name))
with tf.Session() as sess:
init_fn(sess)
np_image, probabilities = sess.run([processed_images, probabilities])
probabilities = probabilities[0, 0:]
sorted_inds = [i[0] for i in sorted(enumerate(-probabilities), key=lambda x: x[1])]
#print(probabilities)
print(probabilities.argmax(axis=0))
#names = imagenet.create_readable_names_for_imagenet_labels()
#for i in range(15):
# index = sorted_inds[i]
# print((probabilities[index], names[index]))
def main():
for image_file in os.listdir(dataset_dir):
try:
image_type = imghdr.what(os.path.join(dataset_dir, image_file))
if not image_type:
continue
except IsADirectoryError:
continue
#image = Image.open(os.path.join(dataset_dir, image_file))
filepath = os.path.join(dataset_dir, image_file)
oneFile(filepath)
inception_resnet_v2_arg_scope
def inception_resnet_v2_arg_scope(weight_decay=0.00004,
batch_norm_decay=0.9997,
batch_norm_epsilon=0.001):
"""Yields the scope with the default parameters for inception_resnet_v2.
Args:
weight_decay: the weight decay for weights variables.
batch_norm_decay: decay for the moving average of batch_norm momentums.
batch_norm_epsilon: small float added to variance to avoid dividing by zero.
Returns:
a arg_scope with the parameters needed for inception_resnet_v2.
"""
# Set weight_decay for weights in conv2d and fully_connected layers.
with slim.arg_scope([slim.conv2d, slim.fully_connected],
weights_regularizer=slim.l2_regularizer(weight_decay),
biases_regularizer=slim.l2_regularizer(weight_decay)):
batch_norm_params = {
'decay': batch_norm_decay,
'epsilon': batch_norm_epsilon,
}
# Set activation_fn and parameters for batch_norm.
with slim.arg_scope([slim.conv2d], activation_fn=tf.nn.relu,
normalizer_fn=slim.batch_norm,
normalizer_params=batch_norm_params) as scope:
return scope
Complete error message:
./data/test/teeth/1/7070.jpg Traceback (most recent call last): File
"testing.py", line 111, in
main() File "testing.py", line 106, in main
cal(processed_images) File "testing.py", line 67, in cal
logits, _ = inception_resnet_v2(processed_images, num_classes=16, is_training=False) File
"/notebooks/transfer_learning_tutorial/inception_resnet_v2.py", line
123, in inception_resnet_v2
scope='Conv2d_1a_3x3') File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py",
line 181, in func_with_args
return func(*args, **current_args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/layers/python/layers/layers.py",
line 918, in convolution
outputs = layer.apply(inputs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py",
line 320, in apply
return self.call(inputs, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py",
line 286, in call
self.build(input_shapes[0]) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/convolutional.py",
line 138, in build
dtype=self.dtype) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py",
line 1049, in get_variable
use_resource=use_resource, custom_getter=custom_getter) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py",
line 948, in get_variable
use_resource=use_resource, custom_getter=custom_getter) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py",
line 349, in get_variable
validate_shape=validate_shape, use_resource=use_resource) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py",
line 1389, in wrapped_custom_getter
*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py",
line 275, in variable_getter
variable_getter=functools.partial(getter, **kwargs)) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py",
line 228, in _add_variable
trainable=trainable and self.trainable) File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/layers/python/layers/layers.py",
line 1334, in layer_variable_getter
return _model_variable_getter(getter, *args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/layers/python/layers/layers.py",
line 1326, in _model_variable_getter
custom_getter=getter, use_resource=use_resource) File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py",
line 181, in func_with_args
return func(*args, **current_args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/variables.py",
line 262, in model_variable
use_resource=use_resource) File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py",
line 181, in func_with_args
return func(*args, **current_args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/variables.py",
line 217, in variable
use_resource=use_resource) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py",
line 341, in _true_getter
use_resource=use_resource) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py",
line 653, in _get_single_variable
name, "".join(traceback.format_list(tb)))) ValueError: Variable InceptionResnetV2/Conv2d_1a_3x3/weights already exists, disallowed.
Did you mean to set reuse=True in VarScope? Originally defined at:
File
"/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/variables.py",
line 217, in variable
use_resource=use_resource) File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py",
line 181, in func_with_args
return func(*args, **current_args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/variables.py",
line 262, in model_variable
use_resource=use_resource)
It seems like tf.reset_default_graph() before processing each image in your oneFile() function will solve this problem, as I encountered the same issue on a very similar example code. My understanding is that once you feed the image to the neural network (NN), because of the variable scope concept TensorFlow uses, it needs to be told that the variables can be reused before you can apply the NN to another image.
My guess would be that you specified the same scope for multiple variables in the graph. This error occurs when tensorflow finds multiple variables under the same scope which is irrespective of the next image or the next batch. When you create the graph, you should create it thinking about one image or batch only. If everything works well with the first batch or first image, tensorflow will take care of the next iterations including the scoping.
So check all the scopes in your model file. I am pretty sure you used the same name twice.

Inception retraining issue "Nan in summary histogram for: HistogramSummary"

I'm trying to retrain inceptionV3 on my RPi3. I'm getting this histogram error message.
python /home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py --bottleneck_dir=/home/pi/Documents/Machine\ Learning/Inception/tf_files/bottlenecks --how_many_training_steps 500 --model_dir=/home/pi/Documents/Machine\ Learning/Inception/tf_files/inception --output_graph=/home/pi/Documents/Machine\ Learning/Inception/tf_files/retrained_graph.pb --output_labels=/home/pi/Documents/Machine\ Learning/Inception/tf_files/retrained_labels.txt --image_dir /home/pi/Documents/Machine\ Learning/Inception/Retraining_Images
Looking for images in 'Granny Smith Apple'
Looking for images in 'Red Delicious'
100 bottleneck files created.
200 bottleneck files created.
2017-01-07 11:30:22.180768: Step 0: Train accuracy = 56.0%
2017-01-07 11:30:22.242166: Step 0: Cross entropy = nan
2017-01-07 11:30:22.850969: Step 0: Validation accuracy = 50.0%
Traceback (most recent call last):
File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 938, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 887, in main
ground_truth_input: train_ground_truth})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 717, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 915, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 985, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Nan in summary histogram for: HistogramSummary
[[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, final_result)]]
Caused by op u'HistogramSummary', defined at:
File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 938, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 846, in main
bottleneck_tensor)
File "/home/pi/Tensorflow/tensorflow/tensorflow/examples/image_retraining/retrain.py", line 764, in add_final_training_ops
tf.histogram_summary(final_tensor_name + '/activations', final_tensor)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/logging_ops.py", line 100, in histogram_summary
tag=tag, values=values, name=scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 100, in _histogram_summary
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 749, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2380, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1298, in __init__
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Nan in summary histogram for: HistogramSummary
[[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, final_result)]]
I tried changing merged = tf.merge_all_summaries() in retrain.py after reading this
but it didnt work.
Also, the first time I tried to retrain, I got different results for step 0 before hitting an error:
2017-01-07 11:13:36.548913: Step 0: Train accuracy = 89.0%
2017-01-07 11:13:36.555770: Step 0: Cross entropy = 0.590778
2017-01-07 11:13:37.052190: Step 0: Validation accuracy = 76.0%
Sounds like that it might help to know where the NaN values are coming from. For that, take a look at tensorflow debugger (tfdbg):
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/debugger/index.md
In your retrain.py, you can make a change like
from tensorflow.python import debug as tf_debug
# ...
# In def main(_)
if debug:
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)
# ...
Then when the sess.run() happens for the training and evaluation, you will drop into the command-line interface of the debugger. At the tfdbg> prompt, you can enter command to let the code run until any NaNs or Infinities appear in the TensorFlow graph:
tfdbg> run -f has_inf_or_nan
When the tensor filter has_inf_or_nan is hit, the interface will give you a list of Tensors containing Infs or Nans, sorted in time order. The one on the top should be the "culprit", i.e., the one that first generated the bad numerical values. Say its name is node_1, you can use the following tfdbg commands to look at its inputs and node attributes:
tfdbg> li -r node_1
tfdbg> ni -a node_1
If you're using tf.contrib.learn you'll want to use the following:
debug_hook = tf_debug.LocalCLIDebugHook()
debug_hook.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)
hooks = [debug_hook]
...
classifier.fit(..., monitors=hooks)

Categories