TensorFlow issue with feed_dict not being noticed when using batching - python

I have been trying to do the mnist tutorial with png files, and have gotten most things to the point where they make sense.
The gist of the code is here however I'm going to walk through what it does and where the issue is happening.
I have a function that generates filenames that I can give to the slice_input_producer.
def gen_file_names_and_labels(rootDir):
"""goes through the directory structure and extracts images and labels from each image."""
file_names = []
labels = []
for file_name in glob.glob(rootDir+'/*/*'):
file_type_removed = file_name.split('.')[0]
split_by_dir = file_type_removed.split('/')
file_names.append(file_name)
labels.append(int(split_by_dir[2])) #getting the folder it's in, turning into an int, and using as label
return file_names, labels
This behaves as expected.
In the body I run this function for training and testing, and turning them into tensors, passing those tensors into a slice_input_producer
sess = tf.InteractiveSession()
#THERE A PIPELINE FOR BOTH TESTING AND TRAINING. THEY COME IN PAIRS
image_list_train, label_list_train = gen_file_names_and_labels('mnist_png/training')
image_list_test, label_list_test = gen_file_names_and_labels('mnist_png/testing')
images_train = tf.convert_to_tensor(image_list_train,dtype=tf.string)
images_test = tf.convert_to_tensor(image_list_test,dtype=tf.string)
#remember that these aren't the actual images, just file_names
labels_train = tf.convert_to_tensor(label_list_train,dtype=tf.int32)
labels_test = tf.convert_to_tensor(label_list_test,dtype=tf.int32)
input_queue_train = tf.train.slice_input_producer([images_train ,labels_train] , shuffle=True)
input_queue_test = tf.train.slice_input_producer([images_train ,labels_train] , shuffle=True)
this part also works correctly.
This is where things get strange.
asdf = tf.placeholder(tf.int32)
input_queue = tf.cond( asdf>0, lambda: input_queue_train, lambda: input_queue_test)
# input_queue = input_queue_test
image, label = read_images_from_disk(input_queue)
image_reshaped = tf.reshape( image, [28,28,1])
image_batch, label_batch = tf.train.batch([image_reshaped,label],batch_size=50)
The variable asdf was renamed in anger as it was the bearer of bad news.
See the plan here was to use different queues for training and testing.
I planned to feed_dict a single int that would work as an ad-hoc boolean for switching between the two.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
sess.run(tf.initialize_all_variables())
print(label_batch.eval(feed_dict={asdf:0,keep_prob:1.0}))
for i in range(500):
# batch = mnist.train.next_batch(50)
if i%20 ==0:
train_accuracy = accuracy.eval(feed_dict={keep_prob:1.0,asdf:0})
print("step %d, training accuracy %g"%(i, train_accuracy))
train_step.run(feed_dict={keep_prob:0.9,asdf:0})
When running it though, I get the error:
"You must feed a value for placeholder tensor 'Placeholder' with dtype int32"
which is strange because I am feeding it.
using the "print(foo.eval(feed_dict={asdf:0,keep_prob:1.0)) I was able to notice some interesting phenomena. It seems that the switching works fine when I evaluate the individual variables declared "image, label" that come out of "read_images_from_disk(input_queue)"
However if I try to evaluate the batching that comes right after, I get the aforementioned error.
What am I doing wrong with batching to make this happen? Is there a better way to do this switching between testing and training sets? What is the meaning of life the universe and everything? I'm counting on you StackOverflow. You're my only hope.

In answer to your question, "Is there a better way to do this switching between testing and training sets?", yes there is. tf.cond() evaluates both functions at each step (see here) and therefore unnecessarily accesses both queues. This SO discussion and the associated links provide a couple of better alternatives:
use tf.placeholder_with_default() for your test data
use make_template

Related

Tensorflow returns ValueError: Cannot create a tensor proto whose content is larger than 2GB

def loadData():
images_dir = os.path.join(current_dir, 'image_data')
images = []
for each in os.listdir(images_dir):
images.append(os.path.join(images_dir,each))
all_images = tf.convert_to_tensor(images, dtype = tf.string)
images_batch = tf.train.shuffle_batch(
[all_images], batch_size = BATCH_SIZE)
return images_batch
returns
ValueError: Cannot create a tensor proto whose content is larger than 2GB.
I'm trying to load about 11GB of images. How can I overcome those limitation?
Edit: Possbile duplicate:
You can split the output classes into multiple operations and concatenate them at the end is suggest, but I do not have multiple classes I can split.
Edit2:
Solutions to this problem suggest using placeholders. So now I'm not sure who to use placeholders in that case and where I can feed the array of images to tensorflow.
Here's a minimal version of my train function to show how I initialize the session.
def train():
images_batch = loadData()
sess = tf.Session()
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
for i in range(EPOCH):
train_image = sess.run(image_batch)
Using convert_to_tensor has the unexpected effect of adding your images to the computational graph, which has a hard limit of 2GB. If you hit this limit, you should reconsider how to feed images for the training process.
We already have a simple solution in TensorFlow, just use placeholders (tf.placeholder) and feed_dict in session.run. The only disadvantage in this case is that you have to produce batches of your data manually.

Tensorflow contrib.summary API - recording scalars every n-th step does not work properly

Recently I have started playing out with TensorBoard. Firstly, I've just wanted to do a simple visualization of the loss function over a few hundred steps. For that, I've wanted to use the tf.contrib.summary API.
My code works except for a slight annoyance - let's say that I want to perform 250 steps of optimizer and I want to record the loss on each of these steps, so, I will do something like this (some chunks of code are missing).
graph = tf.Graph()
sess = tf.Session(graph=graph)
with sess.graph.as_default():
... # lines that define the computation graph as well as input dataset and predictions
global_step = tf.train.create_global_step()
rmse = tf.math.sqrt(tf.losses.mean_squared_error(labels=Y, predictions=Y_PRED))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(rmse, global_step=global_step)
# create summary writer, tensor for recording scalar and initialize everything
summary_writer = tf.contrib.summary.create_file_writer(args.logdir, flush_millis=10 * 1000)
summaries = {}
with summary_writer.as_default(), tf.contrib.summary.always_record_summaries():
summaries["train_rmse"] = tf.contrib.summary.scalar("train/RMSE", rmse)
sess.run(tf.global_variables_initializer())
with summary_writer.as_default():
tf.contrib.summary.initialize(session=sess, graph=graph)
for i in range(250):
train_X_batch, train_Y_batch = # ... retrieve batch of data from dataset
sess.run(optimizer, feed_dict={X : train_X_batch, Y : train_Y_batch})
sess.run(summaries["train_rmse"], {X: train_X, Y: train_Y})
But when I do this, and then visualize results in tensorboard, my train_rmse was recorded only 241 times instead of 250 times as I've used the tf.contrib.summary.always_record_summaries(), right? (See the image).
This issue seems to be data dependent. When I try similiar thing on the mnist dataset and try to record some scalars for the same amount of steps, the number of recorded steps was something like 200.
I've tried to find the answer in the tensorflow documentation but without success. I've also checked things like not having enough data for the 250 steps - this should not be an issue.
One more thing is that this happens even when I use the record_summaries_every_n_global_steps(n) call. For example, calling it with n = 5 records steps only up to the 215th step.
Could anyone help me with this please?

When using TFRecord, how can I run intermediate validation check? (a better way?)

Let's say I defined a network Net and the example code below runs well.
# ... input processing using TFRecord ... # reading from TFRecord
x, y = tf.train.batch([image, label]) # encode batch
net = Net(x,y) # connect to network
# ... initialize and session ...
for iteration:
loss, _ = sess.run([net.loss, net.train_op])
The Net does not have tf.placeholder, since input is provided by tensors from TFRecord provider. What if I would like to run validation set as well, e.g., every 500 steps? How can I switch input flow?
x, y = tf.train.batch([image, label], ...) # training set
vx, vy = tf.train.batch([vimage, vlabel], ...) # validation set
net = Net(x,y)
for iteration:
loss, _ = sess.run([net.loss, net.train_op])
if step % 500 == 0:
# graph is already defined from input to loss.
# how can I run net.loss with vx and vy??
Only one thing I can imagine is, modifying Net to have placeholders, and every time running like
sess.run([...], feed_dict = {Net.x:sess.run(x), Net.y:sess.run(y)})
sess.run([...], feed_dict = {Net.x:sess.run(vx), Net.y:sess.run(vy)})
However, this seems to me that I lost benefits of using TFRecord (e.g., full TF integration). In the middle of computation flow, I have to stop the flow, run tf.sess, and continue (doesn't this lower speed by forcing to use CPU in the middle?)
I am wondering,
if there is a better way.
if my solution is not that worse than I imagine.
Thanks in advance.
There is a better way (than placeholders). I ran into this issue with the CIFAR10 tutorial in TensorFlow, which I adjusted to check accuracy on the test set simultaneous to the training every 500 batches or so. This is where sharing variables comes in handy.
x, y = tf.train.batch([image, label], ...) # training set
vx, vy = tf.train.batch([vimage, vlabel], ...) # validation set
with tf.variable_scope("model") as scope:
net = Net(x,y)
scope.reuse_variables()
vnet = Net(vx,vy)
for iteration:
loss, _ = sess.run([net.loss, net.train_op])
if step % 500 == 0:
loss, acc = sess.run([vnet.loss, vnet.accuracy])
By setting the scope to reuse variables on the second call to Net(), you will use the same tensors and values created in the first call but with a different set of inputs. Just make sure that vimage and vlabel aren't reusing tensors from image and label (which could possibly solved by creating their own variable scopes).

TensorFlow print input tensors?

I'm building a TF training program and attempting to diagnose some issues we are seeing with it. Root problem is the gradients are always nan. This is against the CIFAR10 data set (we wrote our own program from scratch to ensure we understand all of the mechanics properly).
Its too much code to post here; so it is here: https://github.com/drcrook1/CIFAR10
At this point we are fairly certain the issue is not the learning rate (we took that sucker down to 1e-25 and still got nans; we also simplified the network to a single mlp layer).
What we think is likely happening is the values being read in by the input pipeline are wrong; therefor we want to print the values from a TFRecordReader pipeline to double check that it is in fact reading and decoding the samples properly. As you know, you can only print a TF value if you know its name or have it captured as a variable; so that brings up the point; how does one print an input tensor from a mini batch?
Thanks for any tips!
It turns out you can return examples and labels as operations and then simply print them during graph execution.
def create_sess_ops():
'''
Creates and returns operations needed for running
a tensorflow training session
'''
GRAPH = tf.Graph()
with GRAPH.as_default():
examples, labels = Inputs.read_inputs(CONSTANTS.RecordPaths,
batch_size=CONSTANTS.BATCH_SIZE,
img_shape=CONSTANTS.IMAGE_SHAPE,
num_threads=CONSTANTS.INPUT_PIPELINE_THREADS)
examples = tf.reshape(examples, [CONSTANTS.BATCH_SIZE, CONSTANTS.IMAGE_SHAPE[0],
CONSTANTS.IMAGE_SHAPE[1], CONSTANTS.IMAGE_SHAPE[2]])
logits = Vgg3CIFAR10.inference(examples)
loss = Vgg3CIFAR10.loss(logits, labels)
OPTIMIZER = tf.train.AdamOptimizer(CONSTANTS.LEARNING_RATE)
#OPTIMIZER = tf.train.RMSPropOptimizer(CONSTANTS.LEARNING_RATE)
gradients = OPTIMIZER.compute_gradients(loss)
apply_gradient_op = OPTIMIZER.apply_gradients(gradients)
gradients_summary(gradients)
summaries_op = tf.summary.merge_all()
return [apply_gradient_op, summaries_op, loss, logits, examples, labels], GRAPH
Notice in the above code we use the input queue runners to grab examples and inputs and feed into the graph. We then return examples and labels as operations along side all of our other operations which can then be used during a session run;
def main():
'''
Run and Train CIFAR 10
'''
print('starting...')
ops, GRAPH = create_sess_ops()
total_duration = 0.0
with tf.Session(graph=GRAPH) as SESSION:
COORDINATOR = tf.train.Coordinator()
THREADS = tf.train.start_queue_runners(SESSION, COORDINATOR)
SESSION.run(tf.global_variables_initializer())
SUMMARY_WRITER = tf.summary.FileWriter('Tensorboard/' + CONSTANTS.MODEL_NAME)
GRAPH_SAVER = tf.train.Saver()
for EPOCH in range(CONSTANTS.EPOCHS):
duration = 0
error = 0.0
start_time = time.time()
for batch in range(CONSTANTS.MINI_BATCHES):
_, summaries, cost_val, prediction = SESSION.run(ops)
print(np.where(np.isnan(prediction)))
print(prediction[0])
print(labels[0])
plt.imshow(examples[0])
plt.show()
error += cost_val
duration += time.time() - start_time
total_duration += duration
SUMMARY_WRITER.add_summary(summaries, EPOCH)
print('Epoch %d: loss = %.2f (%.3f sec)' % (EPOCH, error, duration))
if EPOCH == CONSTANTS.EPOCHS - 1 or error < 0.005:
print(
'Done training for %d epochs. (%.3f sec)' % (EPOCH, total_duration)
)
break
Notice in the above code we take the examples and labels operations and we can now print a variety of things. We print out if anything is nan; along with that we print the prediction array itself, the label and we even use matplot lib to plot an example image in each mini batch.
This is exactly what I was looking to do. I needed to do this to verify my issues. The root cause was due to labels being read incorrectly therefor producing infinite gradients; because the labels did not match the examples.
Have you looked at the tf.Print operator?
https://www.tensorflow.org/api_docs/python/tf/Print
If you add this to your graph with an input from one of the nodes you suspect of causing the problem, you should be able to see the results in stderr.
You may also find the check_numerics operator useful for debugging your problem:
How to check NaN in gradients in Tensorflow when updating?
This looks like an ideal use-case for the official TensorFlow Debugger.
From the first example on the page:
from tensorflow.python import debug as tf_debug
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)
From your description, it seems that you too need the tf_debug.has_inf_or_nan checkpoint to start your debugging.

TensorFlow: Does each session run initiate a different batch of data in a graph?

Meaning to say if I have the following graph like:
images, labels = load_batch(...)
with slim.arg_scope(inception_resnet_v2_arg_scope()):
logits, end_points = inception_resnet_v2(images, num_classes = dataset.num_classes, is_training = True)
predictions = tf.argmax(end_points['Predictions'], 1)
accuracy, accuracy_update = tf.contrib.metrics.streaming_accuracy(predictions, labels)
....
train_op = slim.learning.create_train_op(...)
and in a supervisor managed_session as sess within the graph context, I run the following every once in a while:
print sess.run(logits)
print sess.run(end_points['Predictions'])
print sess.run(predictions)
print sess.run(labels)
Do they actually call in different batches for each sess run, given that the batch tensor must actually start from load_batch onwards before they ever get to logits, predictions, or labels? Because now when I run each of these sessions, I get a very confusing result in that even the predictions do not match tf.argmax(end_points['Predictions'], 1), and despite a high accuracy in the model, I do not get any predictions that remotely even match the labels to give that kind of high accuracy. Therefore I suspect that each of the result from sess.run probably come from a different batch of data.
This brings me to my next question: Is there a way to inspect the results of different parts of a graph when a batch from load_batch goes all the way till a train_op, where the sess.run is actually run instead? In other words, is there a way to do what I want to do without calling for another sess.run?
Also, if I were to check the results using sess.run in such a way, would it affect my training in that some batches of data will be skipped and not reach the train_op?
I realized there is a problem with running using separate sess.run in that the data loaded is always different. Instead, when I did something like:
logits, probabilities, predictions, labels = sess.run([logits, probabilities, predictions, labels])
print 'logits: \n', logits
print 'Probabilities: \n', probabilities
print 'predictions: \n', predictions
print 'Labels:\n:', labels
All the quantities coincide very well as what I had expected. I had also tried using tf.Print by writing something like:
logits = tf.Print(logits, [logits], message = 'logits: \n', summarize = 100)
immediately after defining the logits, so that they can get printed within the same session I run the train_op. However, the printing is rather messy and so I would prefer the first method of running everything in a session to obtain the values first and then printing them normally like numpy arrays.

Categories