Integrating directory of TFRecord examples into model training

Integrating directory of TFRecord examples into model training - python

What is the most efficient way to feed data from multiple TFRecord files for purposes of training a Tensorflow model? With my current process, I iterate over the examples from TFRecords, separately extracting examples into Python variables, but I don't believe this is the proper way to do this.
I am migrating from Keras to Tensorflow hoping to see some speed improvements in my workflow. Towards that end, I've moved my data into TFRecord, and now I am trying to understand how to run basic linear regression models with a directory of TFRecord files. I have gotten to the point where I can read the TFRecord out into a Tensor and train in batches like so (code is taken from the Tensorflow getting started example and then modified):
# Model parameters
W = tf.Variable([.1], dtype=tf.float32)
b = tf.Variable([.1], dtype=tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W*x + b
y = tf.placeholder(tf.float32)
# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.1)
train = optimizer.minimize(loss)
# Transforms a scalar string `example_proto` into a pair of a scalar string and
# a scalar integer, representing an image and its label, respectively.
def _parse_function(example_proto):
keys_to_features = {
"X": tf.FixedLenFeature([40], tf.float32),
"Y": tf.FixedLenFeature([10], tf.float32)
}
example = tf.parse_single_example(example_proto, keys_to_features)
return example["X"][0], example["Y"][0]
filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames, "ZLIB")
dataset = dataset.map(_parse_function)
dataset = dataset.repeat()
dataset = dataset.batch(1024)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init) # reset values to wrong
sess.run(iterator.initializer, feed_dict = { filenames: training_filenames })
for i in range(10):
**x_train, y_train = sess.run(iterator.get_next())**
sess.run(train, {x: x_train, y: y_train})
My problem is that I do not believe this follows the intended, most efficient dataset workflow possible with Tensorflow. In particular, what is the point of extracting the data from binary into a python variable and then feeding it into the training process? (the line below)
**x_train, y_train = sess.run(iterator.get_next())**
I was under the impression there should be a way that feeds the binary data into the session for training more directly, but after reading the TF tutorials, examples, and other stack overflow posts, I am not finding anything.

The dataset API is very versatile and flexible. It can be used to input as dictionaries as you did. However, a better way is to incorporate the dataset within the graph and make it process all at once.
def model_function(input, label)
# Model parameters
W = tf.Variable([None, input.shape[1]], dtype=tf.float32)
b = tf.Variable([input.shape[1]], dtype=tf.float32)
# Model input and output
x = input
linear_model = W*x + b
y = label
# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.1)
train = optimizer.minimize(loss)
return train
---<Previous dataset related code>---
iterator.make_initializable_iterator()
next_example, next_label = iterator.get_next()
train_op = model_function(next_example, next label)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for steps in range(1000):
_ = sess.run([train_op], feeddict={filenames: training_filenames})
In this way the dataset operations are part of the main graph. This would also use the queuing structure of the dataset better. Since only one sess.run is used, the overhead of the run function is minimised.
For more information have a look at this part of the documentation: Importing data | Tensorflow 1.4
If you need training filenames which are specified at graph runtime you can only specify that placeholder in the feeddict. However, i suggest against that though. Filenames are rather static. I would use a resources file such as config.py and place all the config properties in that file. The filenames are then loaded on graph construction.
To specify the filenames, there are two approaches.
The first one:
...
filenames = tf.constant([filename1.tfrecords, filename2.tfrecords], dtype=tf.String)
dataset = tf.data.Dataset(filenames, "ZLIB")
...
Or else a more proper approach would be to create a new directory in the main folder called resources, place and empty __init__.py file inside and another one called config.py.
Inside config.py:
--- inside config.py ---
FILENAMES = ["filename1.tfrecord", "filename2.tfrecord"]
Inside the main tensorflow function where the dataset is being created:
--- inside tensorflow file ---
from resources import config
...
filenames = tf.constant(config.FILENAMES, dtype=tf.String)
dataset = tf.data.Dataset(filenames, "ZLIB")
...

Related

tf.data.Dataset feedable iterator for training and inference

I have a TensorFlow model that uses tf.data.Dataset feedable iterators to switch between training and validation. Both dataset share the same structure, that is they have a features matrix and the corresponding labels vector. In order to use the same model and iterator for inference (no labels vector only featurex matrix) I need to ideally supply a zero labels vector. Is there a more efficient and elegant way to use the dataset API for both training (validation) and inference?
In code:
training_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
validation_dataset = tf.data.Dataset.from_tensor_slices((X_validation, y_validation))
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle, training_dataset.output_types, training_dataset.output_shapes)
features, labels = iterator.get_next()
Features and lables are used inside the model as input placeholders.
In order to switch between dataset I need to create one iterator for each dataset:
training_iterator = training_dataset.make_initializable_iterator()
validation_iterator = validation_dataset.make_initializable_iterator()
then create the handle
training_handle = sess.run(training_iterator.string_handle())
validation_handle = sess.run(validation_iterator.string_handle())
And use the handle to select which dataset to use, for example:
sess.run(next_element, feed_dict={handle: training_handle})
Now, what happens if I have inference data with no labels?
inference_dataset = tf.data.Dataset.from_tensor_slices(X_inference) # NO y values
inferece_iterator = inference_dataset.make_initializable_iterator()
If I add this iterator it will throw and exception because "Number of components does not match: expected 2 types but got 1."
Any suggestions?
This post How to use tf.Dataset design in both training and inferring? is related to this question, but tf.data.Dataset does not have an unzip method.
What are the best practices for this problem?

If your graph code I assume you are trying to extract a value for labels y from the dataset right? At inference time that was probably baked into the tensorflow dependency graph.
You have a few choices here. Probably the easiest solution is to recreate the graph from code (run your build_graph() function, then load the weights using something like saver.restore(sess, "/tmp/model.ckpt")). If you do it this way you can re-create the graph without the labels y. I assume there are no other dependencies on y (sometimes tensorboard summaries add dependencies you need to check too). Your problem should now be solved.
However, now that I've written the above comment (which I'll leave as-is because it's still useful information), I realize you might not even need that. At inference time you should not be using the labels anywhere (again, double check tensorboard summaries). If you don't need y then tensorflow should not run any of the operations that use y. This should include not trying to extract them from the dataset. Double check that you are not asking tensorflow to use your labels anywhere at inference time.

I think that the first solution proposed by David Parks looks like this, and I think is better than messing with tf.cond in the code.
import tensorflow as tf
import numpy as np
def build_model(features, labels=None, train=False):
linear_model = tf.layers.Dense(units=1)
y_pred = linear_model(features)
if train:
loss = tf.losses.mean_squared_error(labels=labels, predictions=y_pred)
optimizer = tf.train.GradientDescentOptimizer(1e-4)
train = optimizer.minimize(loss)
return train, loss
else:
return y_pred
X_train = np.random.random(100).reshape(-1, 1)
y_train = np.random.random(100).reshape(-1, 1)
training_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
training_dataset = training_dataset.batch(10)
training_dataset = training_dataset.shuffle(20)
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle, training_dataset.output_types, training_dataset.output_shapes)
features, labels = iterator.get_next()
training_iterator = training_dataset.make_one_shot_iterator()
train, loss = build_model(features, labels, train=True)
saver = tf.train.Saver()
init = tf.global_variables_initializer()
sess = tf.Session()
training_handle = sess.run(training_iterator.string_handle())
sess.run(init)
for i in range(10):
_, loss_value = sess.run((train, loss), feed_dict={handle: training_handle})
print(loss_value)
saver.save(sess, "tmp/model.ckpt")
sess.close()
tf.reset_default_graph()
X_test = np.random.random(10).reshape(-1, 1)
inference_dataset = tf.data.Dataset.from_tensor_slices(X_test)
inference_dataset = inference_dataset.batch(5)
handle = tf.placeholder(tf.string, shape=[])
iterator_inference = tf.data.Iterator.from_string_handle(handle, inference_dataset.output_types, inference_dataset.output_shapes)
inference_iterator = inference_dataset.make_one_shot_iterator()
features_inference = iterator_inference.get_next()
y_pred = build_model(features_inference)
saver = tf.train.Saver()
sess = tf.Session()
inference_handle = sess.run(inference_iterator.string_handle())
saver.restore(sess, "tmp/model.ckpt") # Restore variables from disk.
print(sess.run(y_pred, feed_dict={handle: inference_handle}))
sess.close()

How to run prediction (using image as input) for a saved model?

Problem:
I am very new to Tensorflow. My specific question is what particular arguments should I put inside sess.run(fetches, feed_dict) function. For instance, how could find out what the values of the arguments?
Steps:
Here is my understanding of the steps after looking at other posts.
Save tranied tensorflow model, it should consists of 4 files, below are my outputs:
checkpoint
Inception_resnet_v2.ckpt.data-00000-of-00001
Inception_resnet_v2.ckpt.index
Inception_resnet_v2.ckpt.meta
Resize the input image to whatever format required by the neural network.
Start tensorflow session.
Retrive the Graph and associated parameters, tensors...
Predict the input image.
Code：
Traning code:
https://github.com/taki0112/SENet-Tensorflow/blob/master/SE_Inception_resnet_v2.py
[Solved] Test code:
import tensorflow as tf
import numpy as np
import cv2
labels = ["airplane","automobile","bird","cat","deer","dog","frog","horse","ship","truck"]
# Load graph and parameters, etc.
sess=tf.Session()
saver = tf.train.import_meta_graph('./model/Inception_resnet_v2.ckpt.meta')
saver.restore(sess, tf.train.latest_checkpoint("./model/"))
graph = tf.get_default_graph()
# Get tensor names
x = graph.get_tensor_by_name("Placeholder:0")
training_flag = graph.get_tensor_by_name("Placeholder_2:0")
op_to_restore = graph.get_tensor_by_name("final_fully_connected/dense/BiasAdd:0")
# Preprocess imgae imput
src = cv2.imread("./input/car3.jpg")
dst = cv2.resize(src, (32, 32), interpolation=cv2.INTER_CUBIC)
b,g,r = cv2.split(dst)
b = (b - np.mean(b)) / np.std(b) * .1
g = (g - np.mean(g)) / np.std(g) * .1
r = (r - np.mean(r)) / np.std(r) * .1
src = cv2.merge((b,g,r))
picture = dst.reshape(1, 32, 32, 3)
feed_dict ={x: picture, training_flag:False}
result_index = sess.run(op_to_restore,feed_dict)
print(result_index)
print (labels[np.argmax(result_index)])

the arguments actually depend on what you're doing, but mostly the first argument is the weights and placeholders. Whenever you are working with Tensorflow, you define a graph which is fed examples(training data) and some hyperparameters like learning rate, global step etc. It’s a standard practice to feed all the training data and hyperparameters using placeholders. when you build a network using placeholders and save it the network is saved, however, values of the placeholders are not saved.
Let's see a toy example:
import tensorflow as tf
#Prepare to feed input, i.e. feed_dict and placeholders
w1 = tf.placeholder("float", name="w1")
w2 = tf.placeholder("float", name="w2")
b1= tf.Variable(2.0,name="bias")
feed_dict ={w1:4,w2:8}
#Define a test operation that we will restore
w3 = tf.add(w1,w2)
w4 = tf.multiply(w3,b1,name="op_to_restore")
sess = tf.Session()
sess.run(tf.global_variables_initializer())
#Create a saver object which will save all the variables
saver = tf.train.Saver()
#Run the operation by feeding input
print sess.run(w4,feed_dict)
#Prints 24 which is sum of (w1+w2)*b1
#Now, save the graph
saver.save(sess, 'my_test_model',global_step=1000)
Now, when we want to restore it, we not only have to restore the graph and weights, but also prepare a new feed_dict that will feed the new training data to the network. We can get reference to these saved operations and placeholder variables via graph.get_tensor_by_name() method. So if you want to train the same model with further new data, then you would have to utilize those weigtages, if however you just want to get the prediction from the model you trained, you could utilize the op_to_restore and the feed_dict as new data. Something like this, if you follow the above example:
import tensorflow as tf
sess=tf.Session()
#First let's load meta graph and restore weights
saver = tf.train.import_meta_graph('my_test_model-1000.meta')
saver.restore(sess,tf.train.latest_checkpoint('./'))
# Now, let's access and create placeholders variables and
# create feed-dict to feed new data
graph = tf.get_default_graph()
w1 = graph.get_tensor_by_name("w1:0")
w2 = graph.get_tensor_by_name("w2:0")
feed_dict ={w1:13.0,w2:17.0}
#Now, access the op that you want to run.
op_to_restore = graph.get_tensor_by_name("op_to_restore:0")
print sess.run(op_to_restore,feed_dict)
#This will print 60 which is calculated
#using new values of w1 and w2 and saved value of b1.
So, this is how it works, in your case, since you're trying to load the Inception model, your op_to_restore should depend on what you're trying to restore if you could tell us what you're trying to do, then only it's possible to suggest something. However in the other parameter feed_dict , it's just the numpy array of image pixel, of you, you're trying to classify/predict or whatever you're doing.
I took the code from the following article. This will help you as well. http://cv-tricks.com/tensorflow-tutorial/save-restore-tensorflow-models-quick-complete-tutorial/
Update: For your particular case, you may like to try the following code to predict the classes in the new images.
import tensorflow as tf
slim = tf.contrib.slim
from inception_resnet_v2 import *
#Well, since you're using resnet_v2, this may be equivalent to you.
checkpoint_file = 'inception_resnet_v2_2016_08_30.ckpt'
sample_images = ['dog.jpg', 'panda.jpg']
#Load the model
sess = tf.Session()
arg_scope = inception_resnet_v2_arg_scope()
with slim.arg_scope(arg_scope):
logits, end_points = inception_resnet_v2(input_tensor, is_training=False)
#With this, you could consider the op_variable with the following
predict_values, logit_values = sess.run([end_points['Predictions'], logits], feed_dict={input_tensor: im})
#Here im is the normalized numpy array of the image pixels.
Furthermore, the following resources may help you even more:
Using pre-trained inception_resnet_v2 with Tensorflow
https://github.com/tensorflow/tensorflow/issues/7172

Train test dataset in Data Pipeline

I am new to tensorflow, I am building a data pipeline, in which I built two iterators for train, test set from tfrecord. The training works fine, but the problem occurs when inputting test set to graph.
if __name__ == '__main__':
X_video_train,X_audio_train,y = dataset('frame_sample/train.tfrecord')
X_video_test,X_audio_test,y = dataset('frame_sample/test.tfrecord')
#Input:Train Set
logits_train = graph(X_video_train,X_audio_train,training=True)
train = training(logits_train)
This code just fine, after this when I call sess.run and train it. It trains the model, and by using logits of logits_train, I get train accuracy.
But to get test accuracy when I call
logits_test,y = graph(X_video_test,X_audio_test,training=False)
acc,predict_proba = evaluation(logits_test,y)
It give me error
ValueError: Variable bidirectional_rnn/fw/fwd_lstm_1/kernel already
exists, disallowed. Did you mean to set reuse=True or
reuse=tf.AUTO_REUSE in VarScope? :
Then i passed a train test parameter in graph, which creates a new variable for train and test. But I think that creating a whole new graph for test set.
I am thinking of using Varscope Reuse, but does it also create new graph?, instead of getting logits from trained graph?
I just dont understand how I input test data to graph.

This error is thrown because you are re defining the graph in your test function.
The fact that you are training or testing a model should not be related to the graph. The graph should be defined once with a placeholder as input. Then you can populate this placeholder with either train or test data.
Some operations like batch normalization change their behaviour when testing. If your model contains these OPs you should pass a boolean to your feed dictionary like so:
# Model definition
...
h = tf.layers.batch_normalization(h, training=is_training_pl)
...
# Training
_, l = sess.run([train_op, loss], {x_pl: x_train_batch,
y_pl: y_train_batch,
is_training_pl: True})
...
# Testing
l = sess.run(loss, {x_pl: x_test_batch,
is_training_pl: False})
In the case you are using the new tf.data.Dataset API, here is an adapted code snippet using a feedable iterator:
# Define training and validation datasets with the same structure.
training_dataset = tf.data.Dataset ...
validation_dataset = tf.data.Dataset ...
# A feedable iterator is defined by a handle placeholder and its structure. We
# could use the `output_types` and `output_shapes` properties of either
# `training_dataset` or `validation_dataset` here, because they have
# identical structure.
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(
handle, training_dataset.output_types, training_dataset.output_shapes)
next_element = iterator.get_next() # THIS WILL BE USED AS OUR INPUT
# You can use feedable iterators with a variety of different kinds of iterator
# (such as one-shot and initializable iterators).
training_iterator = training_dataset.make_one_shot_iterator()
validation_iterator = validation_dataset.make_initializable_iterator()
# The `Iterator.string_handle()` method returns a tensor that can be evaluated
# and used to feed the `handle` placeholder.
training_handle = sess.run(training_iterator.string_handle())
validation_handle = sess.run(validation_iterator.string_handle())
...
# Model definition
input = next_element
...
h = tf.layers.batch_normalization(h, training=is_training_pl)
...
# Training
_, l = sess.run([train_op, loss], {is_training_pl: True,
handle: training_handle})
# Validation
sess.run(validation_iterator.initializer)
l = sess.run(loss, {is_training_pl: False,
handle: validation_handle})

Predicting single images with tensorflow dataset api

I am trying to create a prediction script using the tensorflow dataset api. Previously I did this using the low-level API and feed_dict:
#import graph
saver = tf.train.import_meta_graph('...')
# Select variables to feed
x = graph.get_tensor_by_name("X:0")
predictions = graph.get_tensor_by_name("pred:0")
with tf.Session() as sess:
p = sess.run(predictions, feed_dict={x:x_feed})
Now I am using the dataset API in the fashion below:
iterator =
tf.data.Iterator.from_structure(training_dataset.output_types,
training_dataset.output_shapes)
next_element = iterator.get_next()
training_init_op = iterator.make_initializer(training_dataset)
validation_init_op = iterator.make_initializer(validation_dataset)
for _ in range(20):
# Initialize an iterator over the training dataset.
sess.run(training_init_op)
for _ in range(100):
sess.run(next_element
# Initialize an iterator over the validation dataset.
sess.run(validation_init_op)
for _ in range(50):
sess.run(next_element)
I am saving a .meta and .data file. How do I use these to create a prediction script? I am unable to extract operations from the graph and feed in desired vales an there are no placeholders defined. One way would be to use the same script and use test data, but there must be a better way?
Thanks

Tensorflow Dataset API doubles graph protobuff filesize

Summary: Using the new tf.contrib.data.Dataset doubles the size of my graph protobuff file and I'm unable to visualize the graph in Tensorboard.
The details:
I'm trying out the new TensorFlow tf.contrib.data.Dataset functionality together with the tf.contrib.learn.Experiment framework. My input data is defined as input functions which return tensors of features and labels.
If I create my input function with the tf.train.slice_input_producer function like in the following codeblock (full code here), then my resulting graph.pbtxt file is 620M and the .meta files are around 165M in size.
def train_inputs():
with tf.name_scope('Training_data'):
x = tf.constant(mnist.train.images.reshape([-1, 28, 28, 1]))
y = tf.constant(mnist.train.labels)
sliced_input = tf.train.slice_input_producer(
tensor_list=[x, y], shuffle=True)
return tf.train.shuffle_batch(
sliced_input, batch_size=batch_size,
capacity=10000, min_after_dequeue=batch_size*10)
Now if I create my input function with the new tf.contrib.data.Dataset.from_tensor_slices like in the following codeblock (full code here), then my resulting graph.pbtxt file doubles in size to 1.3G and the .meta files double in size to 330M.
def train_inputs():
with tf.name_scope('Training_data'):
images = mnist.train.images.reshape([-1, 28, 28, 1])
labels = mnist.train.labels
dataset = tf.contrib.data.Dataset.from_tensor_slices(
(images, labels))
dataset = dataset.repeat(None) # Infinite
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
next_example, next_label = iterator.get_next()
return next_example, next_label
Now because the graph.pbtxt file is so big TensorBoard takes ages to parse this file, and I'm unable to debug my model graph visually.
I found in the Dataset documentation that this increase in size comes from: "the contents of the array will be copied multiple times" and the solution would be to use placeholders. However, in this case, I would need to feed in the numpy arrays into the placeholders with an active session to initialize the iterator:
sess.run(iterator.initializer, feed_dict={features_placeholder: features, labels_placeholder: labels})
This seems, however, to be out of my control when using the tf.contrib.learn.Experiment framework.
How can I initialize the iterator's initialiser with the Experiment framework? Or find a workaround to using the Dataset API without increasing my graph size?

I found a solution to my problem using tf.train.SessionRunHook. I create a SessionRunHook object that initialises the iterator after the session is created:
class IteratorInitializerHook(tf.train.SessionRunHook):
def __init__(self):
super(IteratorInitializerHook, self).__init__()
self.iterator_initiliser_func = None
def after_create_session(self, session, coord):
self.iterator_initiliser_func(session)
The initializer function is set when creating the Dataset Iterator:
iterator_initiliser_hook.iterator_initiliser_func = \
lambda sess: sess.run(
iterator.initializer,
feed_dict={images_placeholder: images,
labels_placeholder: labels})
And I pass in the hook objects to train_monitors and eval_hooks parameters of tf.contrib.learn.Experiment.
The resulting graph.pbtxt file is now only 500K while the .meta files are only 244K.
Full example here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Integrating directory of TFRecord examples into model training - python

Related

tf.data.Dataset feedable iterator for training and inference

How to run prediction (using image as input) for a saved model?

Train test dataset in Data Pipeline

Predicting single images with tensorflow dataset api

Tensorflow Dataset API doubles graph protobuff filesize

Categories

Resources