I'm building a TF training program and attempting to diagnose some issues we are seeing with it. Root problem is the gradients are always nan. This is against the CIFAR10 data set (we wrote our own program from scratch to ensure we understand all of the mechanics properly).
Its too much code to post here; so it is here: https://github.com/drcrook1/CIFAR10
At this point we are fairly certain the issue is not the learning rate (we took that sucker down to 1e-25 and still got nans; we also simplified the network to a single mlp layer).
What we think is likely happening is the values being read in by the input pipeline are wrong; therefor we want to print the values from a TFRecordReader pipeline to double check that it is in fact reading and decoding the samples properly. As you know, you can only print a TF value if you know its name or have it captured as a variable; so that brings up the point; how does one print an input tensor from a mini batch?
Thanks for any tips!
It turns out you can return examples and labels as operations and then simply print them during graph execution.
def create_sess_ops():
'''
Creates and returns operations needed for running
a tensorflow training session
'''
GRAPH = tf.Graph()
with GRAPH.as_default():
examples, labels = Inputs.read_inputs(CONSTANTS.RecordPaths,
batch_size=CONSTANTS.BATCH_SIZE,
img_shape=CONSTANTS.IMAGE_SHAPE,
num_threads=CONSTANTS.INPUT_PIPELINE_THREADS)
examples = tf.reshape(examples, [CONSTANTS.BATCH_SIZE, CONSTANTS.IMAGE_SHAPE[0],
CONSTANTS.IMAGE_SHAPE[1], CONSTANTS.IMAGE_SHAPE[2]])
logits = Vgg3CIFAR10.inference(examples)
loss = Vgg3CIFAR10.loss(logits, labels)
OPTIMIZER = tf.train.AdamOptimizer(CONSTANTS.LEARNING_RATE)
#OPTIMIZER = tf.train.RMSPropOptimizer(CONSTANTS.LEARNING_RATE)
gradients = OPTIMIZER.compute_gradients(loss)
apply_gradient_op = OPTIMIZER.apply_gradients(gradients)
gradients_summary(gradients)
summaries_op = tf.summary.merge_all()
return [apply_gradient_op, summaries_op, loss, logits, examples, labels], GRAPH
Notice in the above code we use the input queue runners to grab examples and inputs and feed into the graph. We then return examples and labels as operations along side all of our other operations which can then be used during a session run;
def main():
'''
Run and Train CIFAR 10
'''
print('starting...')
ops, GRAPH = create_sess_ops()
total_duration = 0.0
with tf.Session(graph=GRAPH) as SESSION:
COORDINATOR = tf.train.Coordinator()
THREADS = tf.train.start_queue_runners(SESSION, COORDINATOR)
SESSION.run(tf.global_variables_initializer())
SUMMARY_WRITER = tf.summary.FileWriter('Tensorboard/' + CONSTANTS.MODEL_NAME)
GRAPH_SAVER = tf.train.Saver()
for EPOCH in range(CONSTANTS.EPOCHS):
duration = 0
error = 0.0
start_time = time.time()
for batch in range(CONSTANTS.MINI_BATCHES):
_, summaries, cost_val, prediction = SESSION.run(ops)
print(np.where(np.isnan(prediction)))
print(prediction[0])
print(labels[0])
plt.imshow(examples[0])
plt.show()
error += cost_val
duration += time.time() - start_time
total_duration += duration
SUMMARY_WRITER.add_summary(summaries, EPOCH)
print('Epoch %d: loss = %.2f (%.3f sec)' % (EPOCH, error, duration))
if EPOCH == CONSTANTS.EPOCHS - 1 or error < 0.005:
print(
'Done training for %d epochs. (%.3f sec)' % (EPOCH, total_duration)
)
break
Notice in the above code we take the examples and labels operations and we can now print a variety of things. We print out if anything is nan; along with that we print the prediction array itself, the label and we even use matplot lib to plot an example image in each mini batch.
This is exactly what I was looking to do. I needed to do this to verify my issues. The root cause was due to labels being read incorrectly therefor producing infinite gradients; because the labels did not match the examples.
Have you looked at the tf.Print operator?
https://www.tensorflow.org/api_docs/python/tf/Print
If you add this to your graph with an input from one of the nodes you suspect of causing the problem, you should be able to see the results in stderr.
You may also find the check_numerics operator useful for debugging your problem:
How to check NaN in gradients in Tensorflow when updating?
This looks like an ideal use-case for the official TensorFlow Debugger.
From the first example on the page:
from tensorflow.python import debug as tf_debug
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)
From your description, it seems that you too need the tf_debug.has_inf_or_nan checkpoint to start your debugging.
Related
Recently I have started playing out with TensorBoard. Firstly, I've just wanted to do a simple visualization of the loss function over a few hundred steps. For that, I've wanted to use the tf.contrib.summary API.
My code works except for a slight annoyance - let's say that I want to perform 250 steps of optimizer and I want to record the loss on each of these steps, so, I will do something like this (some chunks of code are missing).
graph = tf.Graph()
sess = tf.Session(graph=graph)
with sess.graph.as_default():
... # lines that define the computation graph as well as input dataset and predictions
global_step = tf.train.create_global_step()
rmse = tf.math.sqrt(tf.losses.mean_squared_error(labels=Y, predictions=Y_PRED))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(rmse, global_step=global_step)
# create summary writer, tensor for recording scalar and initialize everything
summary_writer = tf.contrib.summary.create_file_writer(args.logdir, flush_millis=10 * 1000)
summaries = {}
with summary_writer.as_default(), tf.contrib.summary.always_record_summaries():
summaries["train_rmse"] = tf.contrib.summary.scalar("train/RMSE", rmse)
sess.run(tf.global_variables_initializer())
with summary_writer.as_default():
tf.contrib.summary.initialize(session=sess, graph=graph)
for i in range(250):
train_X_batch, train_Y_batch = # ... retrieve batch of data from dataset
sess.run(optimizer, feed_dict={X : train_X_batch, Y : train_Y_batch})
sess.run(summaries["train_rmse"], {X: train_X, Y: train_Y})
But when I do this, and then visualize results in tensorboard, my train_rmse was recorded only 241 times instead of 250 times as I've used the tf.contrib.summary.always_record_summaries(), right? (See the image).
This issue seems to be data dependent. When I try similiar thing on the mnist dataset and try to record some scalars for the same amount of steps, the number of recorded steps was something like 200.
I've tried to find the answer in the tensorflow documentation but without success. I've also checked things like not having enough data for the 250 steps - this should not be an issue.
One more thing is that this happens even when I use the record_summaries_every_n_global_steps(n) call. For example, calling it with n = 5 records steps only up to the 215th step.
Could anyone help me with this please?
I am trying to learn the dynamics of tensorflow2.0 by converting my tensorflow1.13 script (below) into a tensorflow2.0 script. However I am struggling to do this.
I think the main reason why I am struggling is because the examples of tensorflow2.0 I have seen train neural networks and so they have a model which they compile and fit. However in my simple example below I am not using a neural network so I can't see how to adapt this code to tensorflow2.0 (For example, how do I replace session?). Help is much appreciated and thanks in advance.
data = tf.placeholder(tf.int32)
theta = tf.Variable(np.zeros(100))
p_s = tf.nn.softmax(theta)
loss = tf.reduce_mean(-tf.log(tf.gather(p_s, data)))
train_step = tf.train.AdamOptimizer().minimize(loss)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(10):
for datum in sample_data(): #sample_data() is a list of integer datapoints
_ = sess.run([train_step], feed_dict={data:datum})
print(sess.run(p_s))
I have looked at this (which is most relavant) and so far I have come up with the below:
#data = tf.placeholder(tf.int32)
theta = tf.Variable(np.zeros(100))
p_s = tf.nn.softmax(theta)
loss = tf.reduce_mean(-tf.math.log(tf.gather(p_s, **data**)))
optimizer = tf.keras.optimizers.Adam()
for epoch in range(10):
for datum in sample_data():
optimizer.apply_gradients(loss)
print(p_s)
However the above obviously does not run because the placeholder data inside the loss function does not exist anymore - however I am not sure how to replace it. :S
Anyone? Note that I don't have a def forward(x) because my input datum isn't transformed - it is used directly to calculate the loss.
Instead of using the conversion tool (that exists, but I don't like it since it just prefixes (more or less) the API calls with tf.compat.v1 and uses the old Tensoflow 1.x API) I help you convert your code to the new version.
Sessions are disappeared, and so are the placeholders. The reason? The code is executed line by line - that is the Tensorflow eager mode.
To train a model you correctly have to use an optimizer. If you want to use the minimize method, in Tensorflowe 2.0 you have to define the function to minimize (the loss) as a Python callable.
# This is your "model"
theta = tf.Variable(np.zeros(100))
p_s = tf.nn.softmax(theta)
# Define the optimizer
optimizer = tf.keras.optimizers.Adam()
# Define the training loop with the loss inside (because we use the
# .minimnize method that requires a callable with no arguments)
trainable_variables = [theta]
for epoch in range(10):
for datum in sample_data():
# The loss must be callable and return the value to minimize
def loss_fn():
loss = tf.reduce_mean(-tf.math.log(tf.gather(p_s, datum)))
return loss
optimizer.minimize(loss_fn, var_list=trainable_variables)
tf.print("epoch ", epoch, " finished. ps: ", p_s)
Disclaimer: I haven't tested the code - but it should work (or at least give you an idea on how to implement what you're trying to achieve in TF 2)
Recently, I try to learn how to use Tensorflow on multiple GPU by reading the official tutorial. However, there is something that I am confused about. The following code is part of the official tutorial, which calculates the loss on single GPU.
def tower_loss(scope, images, labels):
# Build inference Graph.
logits = cifar10.inference(images)
# Build the portion of the Graph calculating the losses. Note that we will
# assemble the total_loss using a custom function below.
_ = cifar10.loss(logits, labels)
# Assemble all of the losses for the current tower only.
losses = tf.get_collection('losses', scope)
# Calculate the total loss for the current tower.
total_loss = tf.add_n(losses, name='total_loss')
# Attach a scalar summary to all individual losses and the total loss; do the
# same for the averaged version of the losses.
for l in losses + [total_loss]:
# Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
# session. This helps the clarity of presentation on tensorboard.
loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
tf.summary.scalar(loss_name, l)
return total_loss
The training process is as the following.
def train():
with tf.device('/cpu:0'):
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (cifar10.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /
FLAGS.batch_size / FLAGS.num_gpus)
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(cifar10.INITIAL_LEARNING_RATE,
global_step,
decay_steps,
cifar10.LEARNING_RATE_DECAY_FACTOR,
staircase=True)
# Create an optimizer that performs gradient descent.
opt = tf.train.GradientDescentOptimizer(lr)
# Get images and labels for CIFAR-10.
images, labels = cifar10.distorted_inputs()
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels], capacity=2 * FLAGS.num_gpus)
# Calculate the gradients for each model tower.
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables()
# Retain the summaries from the final tower.
summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
However, I am confused about the for loop about 'for i in xrange(FLAGS.num_gpus)'. It seems that I have to get a new batch image from batch_queue and calculate every gradient. I think this process is serialized instead of parallel. If there anything wrong with my own understanding? By the way, I can also use the iterator to feed image to my model rather than the dequeue right?
Thank you everybody!
This is a common misconception with Tensorflow's coding model.
What you are showing here is the computation graph's construction, NOT the actual execution.
The block:
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
translates to:
For each GPU device (`for i in range..` & `with device...`):
- build operations needed to dequeue a batch
- build operations needed to run the batch through the network and compute the loss
Note how via tf.get_variable_scope().reuse_variables() you're telling the graph that the variables used for the graph GPU must be shared among all (i.e., all graphs on the multiple devices "reuse" the same variables).
None of this actually runs the network once (note how there is no sess.run()): you're just giving instructions on how data must flow.
Then, when you'll start the actual training (I guess you missed that piece of the code when copying it here) each GPU will pull its own batch and produce the per-tower loss. I guess these losses are averaged somewhere in the subsequent code and the average is the loss passed to the optimizer.
Up until the point where the tower losses are averaged together, everything is independent from the other devices, so getting the batch and computing the loss can be done in parallel. Then the gradients and parameter update is done only once, variables are updated and the cycle repeats.
So, to answer your question, no, per-batch loss computation is not serialized, but since this is synchronous distributed computation you need to collect all losses from all GPUs before being allowed to continue with gradients computation and parameters update, so you still have some part of the graph that cannot be independent.
I'm trying to use TensorBoard to display some graphs of a neural network training run. (That is, graphs of test and validation accuracy during training, not just of the network structure.) There is some example code
As well as some questions on this site, all of which seem to follow the same pattern as the example code. That is, the pattern always revolves around something like
summary, _ = sess.run([merged, train_step], ...
So basically, the operations of running a training step and recording statistics for graph display, are being conflated.
This is fine as far as it goes, but I'm trying to retrofit the graph to an existing program that inevitably does things in a slightly different way, so the example code won't work as is. What I really want to do is isolate some code that just records the statistics, separate from existing code to do the training.
How do you record statistics for TensorBoard, within the main training loop, but separate from the code that does the training?
You can manually create tf.Summary object that stores the scalar value and pass it to tf.summary.FileWriter like in the following example:
summary_writer = tf.summary.FileWriter("path_to_log_dir")
# ...
for i in range(max_training_steps):
# compute the values of interest
scalar_value_1 = ...
# ...
scalar_value_n = ...
# manually create tf.Summary object
summary = tf.Summary(
value=[tf.Summary.Value(tag="Metrics_1", simple_value=scalar_value_1),
# ...
tf.Summary.Value(tag="Metrics_n", simple_value=scalar_value_n)])
summary_writer.add_summary(summary, i)
# ...
summary_writer.close()
Alternatively, you can define tf.summary.scalar() operation using tf.placeholder as a tensor and feed the actual value at run time:
scalar_pl_1 = tf.placeholder(tf.float32)
tf.summary.scalar("Metrics_1", scalar_pl_1)
# ...
scalar_pl_n = tf.placeholder(tf.float32)
tf.summary.scalar("Metrics_n", scalar_pl_n)
# Merge all summaries
merged = tf.summary.merge_all()
summary_writer = tf.summary.FileWriter("path_to_log_dir")
with tf.Session() as sess:
for i in range(max_training_steps):
# compute scalar values of interest
scalar_value_1 = ...
scalar_value_n = ...
feed_dict = {scalar_pl_1: scalar_value_1, scalar_pl_n: scalar_value_n}
summary = sess.run(merged, feed_dict=feed_dict)
summary_writer.add_summary(summary, i)
# ...
summary_writer.close()
I have been trying to do the mnist tutorial with png files, and have gotten most things to the point where they make sense.
The gist of the code is here however I'm going to walk through what it does and where the issue is happening.
I have a function that generates filenames that I can give to the slice_input_producer.
def gen_file_names_and_labels(rootDir):
"""goes through the directory structure and extracts images and labels from each image."""
file_names = []
labels = []
for file_name in glob.glob(rootDir+'/*/*'):
file_type_removed = file_name.split('.')[0]
split_by_dir = file_type_removed.split('/')
file_names.append(file_name)
labels.append(int(split_by_dir[2])) #getting the folder it's in, turning into an int, and using as label
return file_names, labels
This behaves as expected.
In the body I run this function for training and testing, and turning them into tensors, passing those tensors into a slice_input_producer
sess = tf.InteractiveSession()
#THERE A PIPELINE FOR BOTH TESTING AND TRAINING. THEY COME IN PAIRS
image_list_train, label_list_train = gen_file_names_and_labels('mnist_png/training')
image_list_test, label_list_test = gen_file_names_and_labels('mnist_png/testing')
images_train = tf.convert_to_tensor(image_list_train,dtype=tf.string)
images_test = tf.convert_to_tensor(image_list_test,dtype=tf.string)
#remember that these aren't the actual images, just file_names
labels_train = tf.convert_to_tensor(label_list_train,dtype=tf.int32)
labels_test = tf.convert_to_tensor(label_list_test,dtype=tf.int32)
input_queue_train = tf.train.slice_input_producer([images_train ,labels_train] , shuffle=True)
input_queue_test = tf.train.slice_input_producer([images_train ,labels_train] , shuffle=True)
this part also works correctly.
This is where things get strange.
asdf = tf.placeholder(tf.int32)
input_queue = tf.cond( asdf>0, lambda: input_queue_train, lambda: input_queue_test)
# input_queue = input_queue_test
image, label = read_images_from_disk(input_queue)
image_reshaped = tf.reshape( image, [28,28,1])
image_batch, label_batch = tf.train.batch([image_reshaped,label],batch_size=50)
The variable asdf was renamed in anger as it was the bearer of bad news.
See the plan here was to use different queues for training and testing.
I planned to feed_dict a single int that would work as an ad-hoc boolean for switching between the two.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
sess.run(tf.initialize_all_variables())
print(label_batch.eval(feed_dict={asdf:0,keep_prob:1.0}))
for i in range(500):
# batch = mnist.train.next_batch(50)
if i%20 ==0:
train_accuracy = accuracy.eval(feed_dict={keep_prob:1.0,asdf:0})
print("step %d, training accuracy %g"%(i, train_accuracy))
train_step.run(feed_dict={keep_prob:0.9,asdf:0})
When running it though, I get the error:
"You must feed a value for placeholder tensor 'Placeholder' with dtype int32"
which is strange because I am feeding it.
using the "print(foo.eval(feed_dict={asdf:0,keep_prob:1.0)) I was able to notice some interesting phenomena. It seems that the switching works fine when I evaluate the individual variables declared "image, label" that come out of "read_images_from_disk(input_queue)"
However if I try to evaluate the batching that comes right after, I get the aforementioned error.
What am I doing wrong with batching to make this happen? Is there a better way to do this switching between testing and training sets? What is the meaning of life the universe and everything? I'm counting on you StackOverflow. You're my only hope.
In answer to your question, "Is there a better way to do this switching between testing and training sets?", yes there is. tf.cond() evaluates both functions at each step (see here) and therefore unnecessarily accesses both queues. This SO discussion and the associated links provide a couple of better alternatives:
use tf.placeholder_with_default() for your test data
use make_template