tf.data.Dataset feedable iterator for training and inference

tf.data.Dataset feedable iterator for training and inference - python

I have a TensorFlow model that uses tf.data.Dataset feedable iterators to switch between training and validation. Both dataset share the same structure, that is they have a features matrix and the corresponding labels vector. In order to use the same model and iterator for inference (no labels vector only featurex matrix) I need to ideally supply a zero labels vector. Is there a more efficient and elegant way to use the dataset API for both training (validation) and inference?
In code:
training_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
validation_dataset = tf.data.Dataset.from_tensor_slices((X_validation, y_validation))
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle, training_dataset.output_types, training_dataset.output_shapes)
features, labels = iterator.get_next()
Features and lables are used inside the model as input placeholders.
In order to switch between dataset I need to create one iterator for each dataset:
training_iterator = training_dataset.make_initializable_iterator()
validation_iterator = validation_dataset.make_initializable_iterator()
then create the handle
training_handle = sess.run(training_iterator.string_handle())
validation_handle = sess.run(validation_iterator.string_handle())
And use the handle to select which dataset to use, for example:
sess.run(next_element, feed_dict={handle: training_handle})
Now, what happens if I have inference data with no labels?
inference_dataset = tf.data.Dataset.from_tensor_slices(X_inference) # NO y values
inferece_iterator = inference_dataset.make_initializable_iterator()
If I add this iterator it will throw and exception because "Number of components does not match: expected 2 types but got 1."
Any suggestions?
This post How to use tf.Dataset design in both training and inferring? is related to this question, but tf.data.Dataset does not have an unzip method.
What are the best practices for this problem?

If your graph code I assume you are trying to extract a value for labels y from the dataset right? At inference time that was probably baked into the tensorflow dependency graph.
You have a few choices here. Probably the easiest solution is to recreate the graph from code (run your build_graph() function, then load the weights using something like saver.restore(sess, "/tmp/model.ckpt")). If you do it this way you can re-create the graph without the labels y. I assume there are no other dependencies on y (sometimes tensorboard summaries add dependencies you need to check too). Your problem should now be solved.
However, now that I've written the above comment (which I'll leave as-is because it's still useful information), I realize you might not even need that. At inference time you should not be using the labels anywhere (again, double check tensorboard summaries). If you don't need y then tensorflow should not run any of the operations that use y. This should include not trying to extract them from the dataset. Double check that you are not asking tensorflow to use your labels anywhere at inference time.

I think that the first solution proposed by David Parks looks like this, and I think is better than messing with tf.cond in the code.
import tensorflow as tf
import numpy as np
def build_model(features, labels=None, train=False):
linear_model = tf.layers.Dense(units=1)
y_pred = linear_model(features)
if train:
loss = tf.losses.mean_squared_error(labels=labels, predictions=y_pred)
optimizer = tf.train.GradientDescentOptimizer(1e-4)
train = optimizer.minimize(loss)
return train, loss
else:
return y_pred
X_train = np.random.random(100).reshape(-1, 1)
y_train = np.random.random(100).reshape(-1, 1)
training_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
training_dataset = training_dataset.batch(10)
training_dataset = training_dataset.shuffle(20)
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle, training_dataset.output_types, training_dataset.output_shapes)
features, labels = iterator.get_next()
training_iterator = training_dataset.make_one_shot_iterator()
train, loss = build_model(features, labels, train=True)
saver = tf.train.Saver()
init = tf.global_variables_initializer()
sess = tf.Session()
training_handle = sess.run(training_iterator.string_handle())
sess.run(init)
for i in range(10):
_, loss_value = sess.run((train, loss), feed_dict={handle: training_handle})
print(loss_value)
saver.save(sess, "tmp/model.ckpt")
sess.close()
tf.reset_default_graph()
X_test = np.random.random(10).reshape(-1, 1)
inference_dataset = tf.data.Dataset.from_tensor_slices(X_test)
inference_dataset = inference_dataset.batch(5)
handle = tf.placeholder(tf.string, shape=[])
iterator_inference = tf.data.Iterator.from_string_handle(handle, inference_dataset.output_types, inference_dataset.output_shapes)
inference_iterator = inference_dataset.make_one_shot_iterator()
features_inference = iterator_inference.get_next()
y_pred = build_model(features_inference)
saver = tf.train.Saver()
sess = tf.Session()
inference_handle = sess.run(inference_iterator.string_handle())
saver.restore(sess, "tmp/model.ckpt") # Restore variables from disk.
print(sess.run(y_pred, feed_dict={handle: inference_handle}))
sess.close()

Related

Train test dataset in Data Pipeline

I am new to tensorflow, I am building a data pipeline, in which I built two iterators for train, test set from tfrecord. The training works fine, but the problem occurs when inputting test set to graph.
if __name__ == '__main__':
X_video_train,X_audio_train,y = dataset('frame_sample/train.tfrecord')
X_video_test,X_audio_test,y = dataset('frame_sample/test.tfrecord')
#Input:Train Set
logits_train = graph(X_video_train,X_audio_train,training=True)
train = training(logits_train)
This code just fine, after this when I call sess.run and train it. It trains the model, and by using logits of logits_train, I get train accuracy.
But to get test accuracy when I call
logits_test,y = graph(X_video_test,X_audio_test,training=False)
acc,predict_proba = evaluation(logits_test,y)
It give me error
ValueError: Variable bidirectional_rnn/fw/fwd_lstm_1/kernel already
exists, disallowed. Did you mean to set reuse=True or
reuse=tf.AUTO_REUSE in VarScope? :
Then i passed a train test parameter in graph, which creates a new variable for train and test. But I think that creating a whole new graph for test set.
I am thinking of using Varscope Reuse, but does it also create new graph?, instead of getting logits from trained graph?
I just dont understand how I input test data to graph.

This error is thrown because you are re defining the graph in your test function.
The fact that you are training or testing a model should not be related to the graph. The graph should be defined once with a placeholder as input. Then you can populate this placeholder with either train or test data.
Some operations like batch normalization change their behaviour when testing. If your model contains these OPs you should pass a boolean to your feed dictionary like so:
# Model definition
...
h = tf.layers.batch_normalization(h, training=is_training_pl)
...
# Training
_, l = sess.run([train_op, loss], {x_pl: x_train_batch,
y_pl: y_train_batch,
is_training_pl: True})
...
# Testing
l = sess.run(loss, {x_pl: x_test_batch,
is_training_pl: False})
In the case you are using the new tf.data.Dataset API, here is an adapted code snippet using a feedable iterator:
# Define training and validation datasets with the same structure.
training_dataset = tf.data.Dataset ...
validation_dataset = tf.data.Dataset ...
# A feedable iterator is defined by a handle placeholder and its structure. We
# could use the `output_types` and `output_shapes` properties of either
# `training_dataset` or `validation_dataset` here, because they have
# identical structure.
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(
handle, training_dataset.output_types, training_dataset.output_shapes)
next_element = iterator.get_next() # THIS WILL BE USED AS OUR INPUT
# You can use feedable iterators with a variety of different kinds of iterator
# (such as one-shot and initializable iterators).
training_iterator = training_dataset.make_one_shot_iterator()
validation_iterator = validation_dataset.make_initializable_iterator()
# The `Iterator.string_handle()` method returns a tensor that can be evaluated
# and used to feed the `handle` placeholder.
training_handle = sess.run(training_iterator.string_handle())
validation_handle = sess.run(validation_iterator.string_handle())
...
# Model definition
input = next_element
...
h = tf.layers.batch_normalization(h, training=is_training_pl)
...
# Training
_, l = sess.run([train_op, loss], {is_training_pl: True,
handle: training_handle})
# Validation
sess.run(validation_iterator.initializer)
l = sess.run(loss, {is_training_pl: False,
handle: validation_handle})

Inference with a model trained with tf.Dataset

I have trained a model using the tf.data.Dataset API, so my training code looks something like this
with graph.as_default():
dataset = tf.data.TFRecordDataset(tfrecord_path)
dataset = dataset.map(scale_features, num_parallel_calls=n_workers)
dataset = dataset.shuffle(10000)
dataset = dataset.padded_batch(batch_size, padded_shapes={...})
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle,
train_dataset.output_types,
train_dataset.output_shapes)
batch = iterator.get_next()
...
# Model code
...
iterator = dataset.make_initializable_iterator()
with tf.Session(graph=graph) as sess:
train_handle = sess.run(iterator.string_handle())
sess.run(tf.global_variables_initializer())
for epoch in range(n_epochs):
sess.run(train_iterator.initializer)
while True:
try:
sess.run(optimizer, feed_dict={handle: train_handle})
except tf.errors.OutOfRangeError:
break
Now after the model is trained I want to infer on examples that are not in the datasets and I am not sure how to go about doing it.
Just to be clear, I know how to use another dataset, for example I just pass a handle to my test set upon testing.
The question is about given the scaling scheme and the fact that the network expects a handle, if I want to make a prediction to a new example which is not written to a TFRecord, how would I go about doing that?
If I'd modify the batch I'd be responsible for the scaling beforehand which is something I would like to avoid if possible.
So how should I infer single examples from a model traiend the tf.data.Dataset way?
(This is not for production purposes it is for evaluating what will happen if I change specific features)

actually there is a tensor name called "IteratorGetNext:0" in the graph
when you use dataset api, so you can using following way to directly set
input:
#get a tensor from a graph
input tensor : input = graph.get_tensor_by_name("IteratorGetNext:0")
# difine the target tensor you want evaluate for your prediction
prediction tensor: predictions=...
# finally call session to run
then sess.run(predictions, feed_dict={input: np.asanyarray(images), ...})

Predicting single images with tensorflow dataset api

I am trying to create a prediction script using the tensorflow dataset api. Previously I did this using the low-level API and feed_dict:
#import graph
saver = tf.train.import_meta_graph('...')
# Select variables to feed
x = graph.get_tensor_by_name("X:0")
predictions = graph.get_tensor_by_name("pred:0")
with tf.Session() as sess:
p = sess.run(predictions, feed_dict={x:x_feed})
Now I am using the dataset API in the fashion below:
iterator =
tf.data.Iterator.from_structure(training_dataset.output_types,
training_dataset.output_shapes)
next_element = iterator.get_next()
training_init_op = iterator.make_initializer(training_dataset)
validation_init_op = iterator.make_initializer(validation_dataset)
for _ in range(20):
# Initialize an iterator over the training dataset.
sess.run(training_init_op)
for _ in range(100):
sess.run(next_element
# Initialize an iterator over the validation dataset.
sess.run(validation_init_op)
for _ in range(50):
sess.run(next_element)
I am saving a .meta and .data file. How do I use these to create a prediction script? I am unable to extract operations from the graph and feed in desired vales an there are no placeholders defined. One way would be to use the same script and use test data, but there must be a better way?
Thanks

tensorflow dataset tf.estimator.inputs.numpy_input_fn

I'm writing a code for reading images and labels from disc in tensorflow and then trying to call tf.estimator.inputs.numpy_input_fn. How can I pass the whole dataset instead of single image. My code looks like:
filenames = tf.constant(filenames)
labels = tf.constant(labels)
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
dataset_batched = dataset.batch(10)
iterator = dataset_batched.make_one_shot_iterator()
features, labels = iterator.get_next()
with tf.Session() as sess:
print(dataset_batched)
print(np.shape(sess.run(features)))
print(np.shape(sess.run(labels)))
mnist_classifier = tf.estimator.Estimator(model_fn=cnn_model_mk, model_dir=dir)
train_input_fn = tf.estimator.inputs.numpy_input_fn(x={"x": np.array(sess.run(features))},
y=np.array(sess.run(labels)),
batch_size=1,
num_epochs=None,
shuffle=False)
mnist_classifier.train(input_fn=train_input_fn, steps=1)
And my question is how can I pass dataset here x={"x": np.array(sess.run(features))}

There is no need/use for numpy_input_fn here. You should wrap the code at the top into a function (say, my_input_fn) that returns iterator.get_next() and, then pass input_fn=my_input_fn into the train call. This would pass the full dataset to the training code in batches of 10.
numpy_input_fn is for when you have the full dataset available in an array already and want a quick way to do batching/shuffling/repeating etc.

Integrating directory of TFRecord examples into model training

What is the most efficient way to feed data from multiple TFRecord files for purposes of training a Tensorflow model? With my current process, I iterate over the examples from TFRecords, separately extracting examples into Python variables, but I don't believe this is the proper way to do this.
I am migrating from Keras to Tensorflow hoping to see some speed improvements in my workflow. Towards that end, I've moved my data into TFRecord, and now I am trying to understand how to run basic linear regression models with a directory of TFRecord files. I have gotten to the point where I can read the TFRecord out into a Tensor and train in batches like so (code is taken from the Tensorflow getting started example and then modified):
# Model parameters
W = tf.Variable([.1], dtype=tf.float32)
b = tf.Variable([.1], dtype=tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W*x + b
y = tf.placeholder(tf.float32)
# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.1)
train = optimizer.minimize(loss)
# Transforms a scalar string `example_proto` into a pair of a scalar string and
# a scalar integer, representing an image and its label, respectively.
def _parse_function(example_proto):
keys_to_features = {
"X": tf.FixedLenFeature([40], tf.float32),
"Y": tf.FixedLenFeature([10], tf.float32)
}
example = tf.parse_single_example(example_proto, keys_to_features)
return example["X"][0], example["Y"][0]
filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames, "ZLIB")
dataset = dataset.map(_parse_function)
dataset = dataset.repeat()
dataset = dataset.batch(1024)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init) # reset values to wrong
sess.run(iterator.initializer, feed_dict = { filenames: training_filenames })
for i in range(10):
**x_train, y_train = sess.run(iterator.get_next())**
sess.run(train, {x: x_train, y: y_train})
My problem is that I do not believe this follows the intended, most efficient dataset workflow possible with Tensorflow. In particular, what is the point of extracting the data from binary into a python variable and then feeding it into the training process? (the line below)
**x_train, y_train = sess.run(iterator.get_next())**
I was under the impression there should be a way that feeds the binary data into the session for training more directly, but after reading the TF tutorials, examples, and other stack overflow posts, I am not finding anything.

The dataset API is very versatile and flexible. It can be used to input as dictionaries as you did. However, a better way is to incorporate the dataset within the graph and make it process all at once.
def model_function(input, label)
# Model parameters
W = tf.Variable([None, input.shape[1]], dtype=tf.float32)
b = tf.Variable([input.shape[1]], dtype=tf.float32)
# Model input and output
x = input
linear_model = W*x + b
y = label
# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.1)
train = optimizer.minimize(loss)
return train
---<Previous dataset related code>---
iterator.make_initializable_iterator()
next_example, next_label = iterator.get_next()
train_op = model_function(next_example, next label)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for steps in range(1000):
_ = sess.run([train_op], feeddict={filenames: training_filenames})
In this way the dataset operations are part of the main graph. This would also use the queuing structure of the dataset better. Since only one sess.run is used, the overhead of the run function is minimised.
For more information have a look at this part of the documentation: Importing data | Tensorflow 1.4
If you need training filenames which are specified at graph runtime you can only specify that placeholder in the feeddict. However, i suggest against that though. Filenames are rather static. I would use a resources file such as config.py and place all the config properties in that file. The filenames are then loaded on graph construction.
To specify the filenames, there are two approaches.
The first one:
...
filenames = tf.constant([filename1.tfrecords, filename2.tfrecords], dtype=tf.String)
dataset = tf.data.Dataset(filenames, "ZLIB")
...
Or else a more proper approach would be to create a new directory in the main folder called resources, place and empty __init__.py file inside and another one called config.py.
Inside config.py:
--- inside config.py ---
FILENAMES = ["filename1.tfrecord", "filename2.tfrecord"]
Inside the main tensorflow function where the dataset is being created:
--- inside tensorflow file ---
from resources import config
...
filenames = tf.constant(config.FILENAMES, dtype=tf.String)
dataset = tf.data.Dataset(filenames, "ZLIB")
...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

tf.data.Dataset feedable iterator for training and inference - python

Related

Train test dataset in Data Pipeline

Inference with a model trained with tf.Dataset

Predicting single images with tensorflow dataset api

tensorflow dataset tf.estimator.inputs.numpy_input_fn

Integrating directory of TFRecord examples into model training

Categories

Resources