ML Engine Experiment eval tf.summary.scalar not displaying in tensorboard - python

I am trying to output some summary scalars in an ML engine experiment at both train and eval time. tf.summary.scalar('loss', loss) is correctly outputting the summary scalars for both training and evaluation on the same plot in tensorboard. However, I am also trying to output other metrics at both train and eval time and they are only outputting at train time. The code immediately follows tf.summary.scalar('loss', loss) but does not appear to work. For example, the code as follows is only outputting for TRAIN, but not EVAL. The only difference is that these are using custom accuracy functions, but they are working for TRAIN
if mode in (Modes.TRAIN, Modes.EVAL):
loss = tf.contrib.legacy_seq2seq.sequence_loss(logits, outputs, weights)
tf.summary.scalar('loss', loss)
sequence_accuracy = sequence_accuracy(targets, predictions,weights)
tf.summary.scalar('sequence_accuracy', sequence_accuracy)
Does it make any sense why loss would plot in tensorboard for both TRAIN & EVAL, while sequence_accuracy would only plot for TRAIN?
Could this behavior somehow be related to the warning I received "Found more than one metagraph event per run. Overwriting the metagraph with the newest event."?

Because the summary node in the graph is just a node. It still needs to be evaluated (outputting a protobuf string), and that string still needs to be written to a file. It's not evaluated in training mode because it's not upstream of the train_op in your graph, and even if it were evaluated, it wouldn't be written to a file unless you specified a tf.train.SummarySaverHook as one of you training_chief_hooks in your EstimatorSpec. Because the Estimator class doesn't assume you want any extra evaluation during training, normally evaluation is only done during the EVAL phase, and you just increase min_eval_frequency or checkpoint_frequency to get more evaluation datapoints.
If you really really want to log a summary during training here's how you'd do it:
def model_fn(mode, features, labels, params):
...
if mode == Modes.TRAIN:
# loss is already written out during training, don't duplicate the summary op
loss = tf.contrib.legacy_seq2seq.sequence_loss(logits, outputs, weights)
sequence_accuracy = sequence_accuracy(targets, predictions,weights)
seq_sum_op = tf.summary.scalar('sequence_accuracy', sequence_accuracy)
with tf.control_depencencies([seq_sum_op]):
train_op = optimizer.minimize(loss)
return tf.estimator.EstimatorSpec(
loss=loss,
mode=mode,
train_op=train_op,
training_chief_hooks=[tf.train.SummarySaverHook(
save_steps=100,
output_dir='./summaries',
summary_op=seq_sum_op
)]
)
But it's better to just increase your eval frequency and make an eval_metric_ops for accuracy with tf.metrics.streaming_accuracy

Related

Call back not working in tensor flow to stop the training

I have written a call back which stops training when accuracy becomes 99%.But the problem is i get this error .Sometimes if i resolve this error the call back not get called even though acuurqacy becoms 100 %.
'>' not supported between instances of 'NoneType' and 'float'
class myCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if(logs.get('accuracy') > 0.99):
self.model.stop_training = True
def train_mnist():
# Please write your code only where you are indicated.
# please do not remove # model fitting inline comments.
# YOUR CODE SHOULD START HERE
# YOUR CODE SHOULD END HERE
call = myCallback()
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data(path=path)
# YOUR CODE SHOULD START
x_train = x_train/255
y_train = y_train/255
# YOUR CODE SHOULD END HERE
model = tf.keras.models.Sequential([
# YOUR CODE SHOULD START HERE
keras.layers.Flatten(input_shape=(28,28)),
keras.layers.Dense(128,activation='relu'),
keras.layers.Dense(10,activation='softmax')
# YOUR CODE SHOULD END HERE
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# model fitting
history = model.fit(# YOUR CODE SHOULD START HERE
x_train,y_train,epochs=9,callbacks=[call] )
# model fitting
return history.epoch, history.history['acc'][-1]
Two major problems with the above code:
Getting to 100% accuracy on training set almost always means that your model is overfitting. Thats BAD. What you want to do instead is specify the validation_split=.2 parameter in the .fit method, and look for a high accuracy on the validation set.
What you are trying to build in your custom callback is already done in keras.callbacks.EarlyStopping, it even has an option to restore to the best overall model over each epoch. And, by default, it is looking for a validation accuracy, not training accuracy, if you have a validation split.
So, here's what you should do:
Stop using custom callbacks, they take some mastery to get to work. Use EarlyStopping with restore_best instead. like this
Always use validation_split and look for high accuracy in validation set. Like in this quick example.
Did using built-in callbacks resolve your problem?

OutOfRangeError: tensorflow iterator not reinitializing between runs

I am fine-tuning an Inception model via tensorflow with the below setup, and am feeding batches tf.DatasetAPI. However, every time I attempt to train this model (before successfully retrieving any batches), I get an OutOfRangeError claiming that the iterator is exhausted:
Caught OutOfRangeError. Stopping Training. End of sequence
[[node IteratorGetNext (defined at <ipython-input-8-c768436e70d8>:13) = IteratorGetNext[output_shapes=[[?,224,224,3], [?,1]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]
with tf.Graph().as_default():
I created a function to feed in hard coded batches as the result of get_batch, and this runs and converges without any issues, leading me to believe that the graph and session code is working properly. I also tested the get_batch function to iterate in a session, and this causes no errors. The behavior I would expect is that restarting training (especially with reseting the notebook, etc. ) would produce a fresh iterator over the dataset.
Code to train model:
with tf.Graph().as_default():
tf.logging.set_verbosity(tf.logging.INFO)
images, labels = get_batch(filenames=tf_train_record_path+train_file)
# Create the model, use the default arg scope to configure the batch norm parameters.
with slim.arg_scope(inception.inception_v1_arg_scope()):
logits, ax = inception.inception_v1(images, num_classes=1, is_training=True)
# Specify the loss function:
tf.losses.mean_squared_error(labels,logits)
total_loss = tf.losses.get_total_loss()
tf.summary.scalar('losses/Total_Loss', total_loss)
# Specify the optimizer and create the train op:
optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
train_op = slim.learning.create_train_op(total_loss, optimizer)
# Run the training:
final_loss = slim.learning.train(
train_op,
logdir=train_dir,
init_fn=get_init_fn(),
number_of_steps=1)
Code to get batches using Dataset
def get_batch(filenames):
dataset = tf.data.TFRecordDataset(filenames=filenames)
dataset = dataset.map(parse)
dataset = dataset.batch(2)
iterator = dataset.make_one_shot_iterator()
data_X, data_y = iterator.get_next()
return data_X, data_y
This previously asked question resembles the issue I am experiencing, however, I am not using a batch_join call. I am not if this is an issue with slim.learning.train, restoring from a checkpoint, or scope. Any help would be appreciated!
Your input pipeline looks ok. The problem might be with damaged TFRecords file. You can try your code with random data, or use your images as numpy arrays with tf.data.Dataset.from_tensor_slices().
Also your parse function may cause problems. Try to print your image/label with sess.run.
And I'd advise using Estimator API as train_op. It is much more convenient and slim will be deprecated soon.

How does one use tflite_convert to quantize a network trained with a tf.estimator.Estimator?

tflite_convert is a python script used to invoke TOCO (TensorFlow Lite Optimizing Converter) to convert files from Tensorflow's formats to tflite-compatible files.
I am trying to generate a quantized TFlite model starting from a network I trained with an Estimator. The training code is pretty straightforward and I added the required modifications for fine-tuning the model as required in the Fixed Point Quantization guide:
def input_fn(mode, num_classes, batch_size=1):
#[...]
return {'images': images}, labels
def model_fn(features, labels, num_classes, mode):
images = features['images']
with tf.contrib.slim.arg_scope(net_arg_scope()):
logits, end_points = build_net(...)
if FLAGS.with_quantization:
tf.logging.info("Applying quantization to the graph.")
if mode == tf.estimator.ModeKeys.EVAL:
tf.contrib.quantize.create_eval_graph()
tf.losses.softmax_cross_entropy(onehot_labels=labels, logits=logits)
total_loss = tf.losses.get_total_loss() #obtain the regularization losses as well
if FLAGS.with_quantization:
tf.logging.info("Applying quantization to the graph.")
if mode == tf.estimator.ModeKeys.TRAIN:
tf.contrib.quantize.create_training_graph()
# Configure the training op, etc [...]
return tf.estimator.EstimatorSpec(...)
def main(unused_argv):
regex = FINETUNE_LAYER_RE if not FLAGS.with_quantization else '^((?!_quant).)*$'
ws_settings = tf.estimator.WarmStartSettings(FLAGS.pretrained_checkpoint, regex)
# Create the Estimator
estimator = tf.estimator.Estimator(
model_fn=lambda features, labels, mode: model_fn(features, labels, NUM_CLASSES, mode),
model_dir=FLAGS.model_dir,
#config=run_config,
warm_start_from=ws_settings)
# Set up input functions for training and evaluation
train_input_fn = lambda : input_fn(tf.estimator.ModeKeys.TRAIN, NUM_CLASSES, FLAGS.batch_size)
eval_input_fn = lambda : input_fn(tf.estimator.ModeKeys.EVAL, NUM_CLASSES, FLAGS.batch_size)
#[...]
train_spec = tf.estimator.TrainSpec(...)
eval_spec = tf.estimator.EvalSpec(...)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
The first problem I ran into is that it's impossible to simply continue the training using the latest checkpoint after adding the quantization operations. This is because quantization adds extra variables that won't be found in the checkpoint. I solved that writing a warm start spec that filters out all the new variables by name and using as warm start checkpoint the latest checkpoint from the training.
Now, I want to generate an evaluation graph to save (with the related variables) in order to then feed it to TOCO via the tflite_convert script.
I have tried converting one of the SavedModels exported after every evaluation, but that raises the following error:
Array conv0_bn/FusedBatchNorm, which is an input to the Relu operator
producing the output array cell_stem_0/Relu, is lacking min/max data,
which is necessary for quantization. Either target a non-quantized
output format, or change the input graph to contain min/max
information, or pass --default_ranges_min= and --default_ranges_max=
if you do not care about the accuracy of results.
Aborted (core dumped)
I don't know how to get the right SavedModel or a pair GraphDef + checkpoint (although the SavedModel is preferrable)
Has anyone tried to quantized an estimator model? How do you generate the quantized evaluation graph?

Log accuracy metric while training a tf.estimator

What's the simplest way to print accuracy metrics along with the loss when training a pre-canned estimator?
Most tutorials and documentations seem to address the issue of when you're creating a custom estimator -- which seems overkill if the intention is to use one of the available ones.
tf.contrib.learn had a few (now deprecated) Monitor hooks. TF now suggests using the hook API, but it appears that it doesn't actually come with anything that can utilize the labels and predictions to generate an accuracy number.
Have you tried tf.contrib.estimator.add_metrics(estimator, metric_fn) (doc)? It takes an initialized estimator (can be pre-canned) and adds to it the metrics defined by metric_fn.
Usage Example:
def custom_metric(labels, predictions):
# This function will be called by the Estimator, passing its predictions.
# Let's suppose you want to add the "mean" metric...
# Accessing the class predictions (careful, the key name may change from one canned Estimator to another)
predicted_classes = predictions["class_ids"]
# Defining the metric (value and update tensors):
custom_metric = tf.metrics.mean(labels, predicted_classes, name="custom_metric")
# Returning as a dict:
return {"custom_metric": custom_metric}
# Initializing your canned Estimator:
classifier = tf.estimator.DNNClassifier(feature_columns=columns_feat, hidden_units=[10, 10], n_classes=NUM_CLASSES)
# Adding your custom metrics:
classifier = tf.contrib.estimator.add_metrics(classifier, custom_metric)
# Training/Evaluating:
tf.logging.set_verbosity(tf.logging.INFO) # Just to have some logs to display for demonstration
train_spec = tf.estimator.TrainSpec(input_fn=lambda:your_train_dataset_function(),
max_steps=TRAIN_STEPS)
eval_spec=tf.estimator.EvalSpec(input_fn=lambda:your_test_dataset_function(),
steps=EVAL_STEPS,
start_delay_secs=EVAL_DELAY,
throttle_secs=EVAL_INTERVAL)
tf.estimator.train_and_evaluate(classifier, train_spec, eval_spec)
Logs:
...
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [20/200]
INFO:tensorflow:Evaluation [40/200]
...
INFO:tensorflow:Evaluation [200/200]
INFO:tensorflow:Finished evaluation at 2018-04-19-09:23:03
INFO:tensorflow:Saving dict for global step 1: accuracy = 0.5668, average_loss = 0.951766, custom_metric = 1.2442, global_step = 1, loss = 95.1766
...
As you can see, the custom_metric is returned along the default metrics and loss.
In addition to the answer of #Aldream you can also use the TensorBoard to see some graphics of the custom_metric. To do that, add it to a TensorFlow summary like this:
tf.summary.scalar('custom_metric', custom_metric)
The cool thing when you use the tf.estimator.Estimator is that you don't need to add the summaries to a FileWriter, since it's done automatically (merging and saving them every 100 steps by default).
In order to see the TensorBoard you need to open a new terminal and type:
tensorboard --logdir={$MODEL_DIR}
After that you will be able to see the graphics in your browser at localhost:6006.

Issue with TensorFlow saving

I am training neural nets with TensorFlow, and the model's training is working using a custom implementation of batch gradient descent. I have a logging function which records validation error, and it gets down to about 2.6%. I'm saving the model every 10 epochs using a tf.train.Saver.
However, when I load the variables into memory again using a tf.train.Saver with the same script, the model performs poorly--with about the performance it does when the weights are randomly initialized. I have inspected the constitutional filters in the checkpoint and they don't seem to be random however.
I have not included all of my code, since its around 400 lines long, but I've included what seem to be important sections here and summarized the other functionality.
class ModelTrainer:
def __init__(self, ...hyperparameters...):
# Intitialize datasets and hyperparameters
for each gpu
# Create loss function and gradient assigned to this gpu using tf.device("/gpu:n")
with tf.device("/cpu:0")
# Average and clip gradients from the gpu's
# Create this batch gradient descent operation for each trainable variable
variable.assign_sub(learning_rate * averaged_and_clipped_gradient).op
def train(self, ...hyperparameters...)
saver = train.Saver(tf.all_variables(), max_to_keep = 30)
init = tf.initialize_all_variables()
sess = tf.Session()
if starting_point is not None: # Used to evaluate existing models
saver.restore(sess, starting_point)
else:
sess.run(init)
for i in range(number_of_batches)
# ... Get training batch ...
gradients = sess.run(calculate_gradients, feeds = training_batch)
# Average "gradients" variable across multiple batches
# Must be done because of GPU memory limitations
if i % meta_batch_size == 0:
sess.run(apply_gradients_operators,
feeds = gradients_that_have_been_averaged_across_multiple_batches)
# Log validation error
if i % save_after_n_batches == 0:
saver.save(sess, "some-filename", global_step=self.iter_num)
As expected, running these two functions creates a set of checkpoint files called "some-filename-40001" or whatever other iteration number the training is at when that file is saved. Unfortunately when I load these checkpoints back in using the start_point parameter they perform on par with random initialization.
Initially I assumed it was something to do with the way I'm training the model, since I haven't found anyone else with this issue, but the validation error behaves as expected.
Edit: More odd results. After more experimentation, I have found that when I load the saved model using the code:
with tf.Session() as sess:
saver = tf.train.import_meta_graph("saved-checkpoint-40.meta")
saver.restore(sess, "saved-checkpoint-40")
# ... Use model in some way ...
I get different, but still incorrect results.

Categories