Avoiding tf.data.Dataset.from_tensor_slices with estimator api - python

I'm am trying to figure out the recommended way to use the dataset api together with the estimator api. Everything I have seen online is some variation of this:
def train_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
return dataset
which can then be passed to the estimator's train function:
classifier.train(
input_fn=train_input_fn,
#...
)
but the dataset guide warns that:
the above code snippet will embed the features and labels arrays in your TensorFlow graph as tf.constant() operations. This works well for a small dataset, but wastes memory---because the contents of the array will be copied multiple times---and can run into the 2GB limit for the tf.GraphDef protocol buffer.
and then describes a method that involves defining placeholders which are then filled with the feed_dict:
features_placeholder = tf.placeholder(features.dtype, features.shape)
labels_placeholder = tf.placeholder(labels.dtype, labels.shape)
dataset = tf.data.Dataset.from_tensor_slices((features_placeholder, labels_placeholder))
sess.run(iterator.initializer, feed_dict={features_placeholder: features,
labels_placeholder: labels})
But if you're using the estimator api, you're not manually running the session. So how do you use the dataset api with estimators while avoiding the problems associated with from_tensor_slices()?

To use either initializable or reinitializable iterators, you must create a class that inherits from tf.train.SessionRunHook, which has access to the session at multiple times during training and evaluation steps.
You can then use this new class to initialize the iterator has you would normally do in a classic setting. You simply need to pass this newly created hook to the training/evaluation functions or to the correct train spec.
Here is quick example that you can adapt to your needs :
class IteratorInitializerHook(tf.train.SessionRunHook):
def __init__(self):
super(IteratorInitializerHook, self).__init__()
self.iterator_initializer_func = None # Will be set in the input_fn
def after_create_session(self, session, coord):
# Initialize the iterator with the data feed_dict
self.iterator_initializer_func(session)
def get_inputs(X, y):
iterator_initializer_hook = IteratorInitializerHook()
def input_fn():
X_pl = tf.placeholder(X.dtype, X.shape)
y_pl = tf.placeholder(y.dtype, y.shape)
dataset = tf.data.Dataset.from_tensor_slices((X_pl, y_pl))
dataset = ...
...
iterator = dataset.make_initializable_iterator()
next_example, next_label = iterator.get_next()
iterator_initializer_hook.iterator_initializer_func = lambda sess: sess.run(iterator.initializer,
feed_dict={X_pl: X, y_pl: y})
return next_example, next_label
return input_fn, iterator_initializer_hook
...
train_input_fn, train_iterator_initializer_hook = get_inputs(X_train, y_train)
test_input_fn, test_iterator_initializer_hook = get_inputs(X_test, y_test)
...
estimator.train(input_fn=train_input_fn,
hooks=[train_iterator_initializer_hook]) # Don't forget to pass the hook !
estimator.evaluate(input_fn=test_input_fn,
hooks=[test_iterator_initializer_hook])

Related

VGG16 preprocessing dataset generator to dataset mapping

I have a VGG16 model implemented with Keras/tensorflow.
When I call model.fit, I pass in a generator of data. The generator does transforms necessary for a VGGNet:
Preprocess the images with vgg16.preprocess_input
Convert the label to a one-hot vector via to_categorical
The generator can be seen below and works. Unfortunately, since there are multiple epochs, I have to set dataset.repeat(-1) (infinitely repeat) so the generator doesn't run out. This in turn requires one to pass a steps_per_epoch so a given iteration of training can complete. As you're probably thinking, this is brittle, (hinges on a known dataset cardinality)!
I have decided it's best to preprocess the training Dataset once up front using Dataset.map. However, I am struggling with the construction of a mapping function, it seems to_categorical doesn't work with a tf.Tensor. Down below is what I have right now, but I am not sure if there's a latent bug.
How can I correctly translate the below Dataset generator into a Dataset.map function?
Current Dataset Generator
This is implemented (and known to work) with Python 3.8 and tensorflow==2.4.4.
from typing import Iterable, Tuple
import numpy as np
import tensorflow as tf
def make_vgg_preprocessing_generator(
dataset: tf.data.Dataset, num_repeat: int = -1
) -> Iterable[Tuple[tf.Tensor, np.ndarray]]:
num_classes = len(dataset.class_names)
for batch_images, batch_labels in dataset.repeat(num_repeat):
pre_images = tf.keras.applications.vgg16.preprocess_input(batch_images)
pre_labels = tf.keras.utils.to_categorical(batch_labels, num_classes)
yield pre_images, pre_labels
train_ds: tf.data.Dataset # Not provided in this sample
model.fit(
make_vgg_preprocessing_generator(train_ds)
epochs=10,
steps_per_epoch=10, # Required since default num_repeat is indefinitely
)
Dataset.map Function
Here is my current translation that I would like to improve.
def vgg_preprocess_dataset(dataset: tf.data.Dataset) -> tf.data.Dataset:
num_classes = len(dataset.class_names)
def _preprocess(x: tf.Tensor, y: tf.Tensor) -> Tuple[tf.Tensor, tf.Tensor]:
pre_x = tf.keras.applications.vgg16.preprocess_input(x)
pre_y = tf.one_hot(y, depth=num_classes)
return pre_x, pre_y
return dataset.map(_preprocess)
Yes, you're on the right track! You'll want to replace to_categorical with tf.one_hot, just as you have, as tf.one_hot is specifically for tensors, and is designed for this context. Next, you might want to play around with some of the other tf.data.Dataset methods here and add them to your pipeline. Right now, your batch size will be one sample, and un-shuffled. An example of some other processing you might do:
def vgg_preprocess_dataset(dataset: tf.data.Dataset, batch_size=32, shuffle_buffer=1000) -> tf.data.Dataset:
num_classes = len(dataset.class_names)
def _preprocess(x: tf.Tensor, y: tf.Tensor):
pre_x = tf.keras.applications.vgg16.preprocess_input(x)
pre_y = tf.one_hot(y, depth=num_classes)
# pre_y = to_categorical(y, num_classes)
return pre_x, pre_y
# bigger buffer is better but slower
dataset = dataset.shuffle(shuffle_buffer)
# do your mapping after shuffle
dataset = dataset.map(_preprocess)
# next batch it
dataset = dataset.batch(batch_size)
# this allows your CPU to fetch the next batch (do the above shuffling, mapping, etc) during the
# current GPU pass, so that the GPU has minimal downtime
dataset = dataset.prefetch(2)
return dataset
ds = vgg_preprocess_dataset(ds)
# and you just pass it right to fit!
model.fit(ds)

tf.data.Dataset feedable iterator for training and inference

I have a TensorFlow model that uses tf.data.Dataset feedable iterators to switch between training and validation. Both dataset share the same structure, that is they have a features matrix and the corresponding labels vector. In order to use the same model and iterator for inference (no labels vector only featurex matrix) I need to ideally supply a zero labels vector. Is there a more efficient and elegant way to use the dataset API for both training (validation) and inference?
In code:
training_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
validation_dataset = tf.data.Dataset.from_tensor_slices((X_validation, y_validation))
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle, training_dataset.output_types, training_dataset.output_shapes)
features, labels = iterator.get_next()
Features and lables are used inside the model as input placeholders.
In order to switch between dataset I need to create one iterator for each dataset:
training_iterator = training_dataset.make_initializable_iterator()
validation_iterator = validation_dataset.make_initializable_iterator()
then create the handle
training_handle = sess.run(training_iterator.string_handle())
validation_handle = sess.run(validation_iterator.string_handle())
And use the handle to select which dataset to use, for example:
sess.run(next_element, feed_dict={handle: training_handle})
Now, what happens if I have inference data with no labels?
inference_dataset = tf.data.Dataset.from_tensor_slices(X_inference) # NO y values
inferece_iterator = inference_dataset.make_initializable_iterator()
If I add this iterator it will throw and exception because "Number of components does not match: expected 2 types but got 1."
Any suggestions?
This post How to use tf.Dataset design in both training and inferring? is related to this question, but tf.data.Dataset does not have an unzip method.
What are the best practices for this problem?
If your graph code I assume you are trying to extract a value for labels y from the dataset right? At inference time that was probably baked into the tensorflow dependency graph.
You have a few choices here. Probably the easiest solution is to recreate the graph from code (run your build_graph() function, then load the weights using something like saver.restore(sess, "/tmp/model.ckpt")). If you do it this way you can re-create the graph without the labels y. I assume there are no other dependencies on y (sometimes tensorboard summaries add dependencies you need to check too). Your problem should now be solved.
However, now that I've written the above comment (which I'll leave as-is because it's still useful information), I realize you might not even need that. At inference time you should not be using the labels anywhere (again, double check tensorboard summaries). If you don't need y then tensorflow should not run any of the operations that use y. This should include not trying to extract them from the dataset. Double check that you are not asking tensorflow to use your labels anywhere at inference time.
I think that the first solution proposed by David Parks looks like this, and I think is better than messing with tf.cond in the code.
import tensorflow as tf
import numpy as np
def build_model(features, labels=None, train=False):
linear_model = tf.layers.Dense(units=1)
y_pred = linear_model(features)
if train:
loss = tf.losses.mean_squared_error(labels=labels, predictions=y_pred)
optimizer = tf.train.GradientDescentOptimizer(1e-4)
train = optimizer.minimize(loss)
return train, loss
else:
return y_pred
X_train = np.random.random(100).reshape(-1, 1)
y_train = np.random.random(100).reshape(-1, 1)
training_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
training_dataset = training_dataset.batch(10)
training_dataset = training_dataset.shuffle(20)
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle, training_dataset.output_types, training_dataset.output_shapes)
features, labels = iterator.get_next()
training_iterator = training_dataset.make_one_shot_iterator()
train, loss = build_model(features, labels, train=True)
saver = tf.train.Saver()
init = tf.global_variables_initializer()
sess = tf.Session()
training_handle = sess.run(training_iterator.string_handle())
sess.run(init)
for i in range(10):
_, loss_value = sess.run((train, loss), feed_dict={handle: training_handle})
print(loss_value)
saver.save(sess, "tmp/model.ckpt")
sess.close()
tf.reset_default_graph()
X_test = np.random.random(10).reshape(-1, 1)
inference_dataset = tf.data.Dataset.from_tensor_slices(X_test)
inference_dataset = inference_dataset.batch(5)
handle = tf.placeholder(tf.string, shape=[])
iterator_inference = tf.data.Iterator.from_string_handle(handle, inference_dataset.output_types, inference_dataset.output_shapes)
inference_iterator = inference_dataset.make_one_shot_iterator()
features_inference = iterator_inference.get_next()
y_pred = build_model(features_inference)
saver = tf.train.Saver()
sess = tf.Session()
inference_handle = sess.run(inference_iterator.string_handle())
saver.restore(sess, "tmp/model.ckpt") # Restore variables from disk.
print(sess.run(y_pred, feed_dict={handle: inference_handle}))
sess.close()

Train test dataset in Data Pipeline

I am new to tensorflow, I am building a data pipeline, in which I built two iterators for train, test set from tfrecord. The training works fine, but the problem occurs when inputting test set to graph.
if __name__ == '__main__':
X_video_train,X_audio_train,y = dataset('frame_sample/train.tfrecord')
X_video_test,X_audio_test,y = dataset('frame_sample/test.tfrecord')
#Input:Train Set
logits_train = graph(X_video_train,X_audio_train,training=True)
train = training(logits_train)
This code just fine, after this when I call sess.run and train it. It trains the model, and by using logits of logits_train, I get train accuracy.
But to get test accuracy when I call
logits_test,y = graph(X_video_test,X_audio_test,training=False)
acc,predict_proba = evaluation(logits_test,y)
It give me error
ValueError: Variable bidirectional_rnn/fw/fwd_lstm_1/kernel already
exists, disallowed. Did you mean to set reuse=True or
reuse=tf.AUTO_REUSE in VarScope? :
Then i passed a train test parameter in graph, which creates a new variable for train and test. But I think that creating a whole new graph for test set.
I am thinking of using Varscope Reuse, but does it also create new graph?, instead of getting logits from trained graph?
I just dont understand how I input test data to graph.
This error is thrown because you are re defining the graph in your test function.
The fact that you are training or testing a model should not be related to the graph. The graph should be defined once with a placeholder as input. Then you can populate this placeholder with either train or test data.
Some operations like batch normalization change their behaviour when testing. If your model contains these OPs you should pass a boolean to your feed dictionary like so:
# Model definition
...
h = tf.layers.batch_normalization(h, training=is_training_pl)
...
# Training
_, l = sess.run([train_op, loss], {x_pl: x_train_batch,
y_pl: y_train_batch,
is_training_pl: True})
...
# Testing
l = sess.run(loss, {x_pl: x_test_batch,
is_training_pl: False})
In the case you are using the new tf.data.Dataset API, here is an adapted code snippet using a feedable iterator:
# Define training and validation datasets with the same structure.
training_dataset = tf.data.Dataset ...
validation_dataset = tf.data.Dataset ...
# A feedable iterator is defined by a handle placeholder and its structure. We
# could use the `output_types` and `output_shapes` properties of either
# `training_dataset` or `validation_dataset` here, because they have
# identical structure.
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(
handle, training_dataset.output_types, training_dataset.output_shapes)
next_element = iterator.get_next() # THIS WILL BE USED AS OUR INPUT
# You can use feedable iterators with a variety of different kinds of iterator
# (such as one-shot and initializable iterators).
training_iterator = training_dataset.make_one_shot_iterator()
validation_iterator = validation_dataset.make_initializable_iterator()
# The `Iterator.string_handle()` method returns a tensor that can be evaluated
# and used to feed the `handle` placeholder.
training_handle = sess.run(training_iterator.string_handle())
validation_handle = sess.run(validation_iterator.string_handle())
...
# Model definition
input = next_element
...
h = tf.layers.batch_normalization(h, training=is_training_pl)
...
# Training
_, l = sess.run([train_op, loss], {is_training_pl: True,
handle: training_handle})
# Validation
sess.run(validation_iterator.initializer)
l = sess.run(loss, {is_training_pl: False,
handle: validation_handle})

Inference with a model trained with tf.Dataset

I have trained a model using the tf.data.Dataset API, so my training code looks something like this
with graph.as_default():
dataset = tf.data.TFRecordDataset(tfrecord_path)
dataset = dataset.map(scale_features, num_parallel_calls=n_workers)
dataset = dataset.shuffle(10000)
dataset = dataset.padded_batch(batch_size, padded_shapes={...})
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle,
train_dataset.output_types,
train_dataset.output_shapes)
batch = iterator.get_next()
...
# Model code
...
iterator = dataset.make_initializable_iterator()
with tf.Session(graph=graph) as sess:
train_handle = sess.run(iterator.string_handle())
sess.run(tf.global_variables_initializer())
for epoch in range(n_epochs):
sess.run(train_iterator.initializer)
while True:
try:
sess.run(optimizer, feed_dict={handle: train_handle})
except tf.errors.OutOfRangeError:
break
Now after the model is trained I want to infer on examples that are not in the datasets and I am not sure how to go about doing it.
Just to be clear, I know how to use another dataset, for example I just pass a handle to my test set upon testing.
The question is about given the scaling scheme and the fact that the network expects a handle, if I want to make a prediction to a new example which is not written to a TFRecord, how would I go about doing that?
If I'd modify the batch I'd be responsible for the scaling beforehand which is something I would like to avoid if possible.
So how should I infer single examples from a model traiend the tf.data.Dataset way?
(This is not for production purposes it is for evaluating what will happen if I change specific features)
actually there is a tensor name called "IteratorGetNext:0" in the graph
when you use dataset api, so you can using following way to directly set
input:
#get a tensor from a graph
input tensor : input = graph.get_tensor_by_name("IteratorGetNext:0")
# difine the target tensor you want evaluate for your prediction
prediction tensor: predictions=...
# finally call session to run
then sess.run(predictions, feed_dict={input: np.asanyarray(images), ...})

Tensorflow Dataset API doubles graph protobuff filesize

Summary: Using the new tf.contrib.data.Dataset doubles the size of my graph protobuff file and I'm unable to visualize the graph in Tensorboard.
The details:
I'm trying out the new TensorFlow tf.contrib.data.Dataset functionality together with the tf.contrib.learn.Experiment framework. My input data is defined as input functions which return tensors of features and labels.
If I create my input function with the tf.train.slice_input_producer function like in the following codeblock (full code here), then my resulting graph.pbtxt file is 620M and the .meta files are around 165M in size.
def train_inputs():
with tf.name_scope('Training_data'):
x = tf.constant(mnist.train.images.reshape([-1, 28, 28, 1]))
y = tf.constant(mnist.train.labels)
sliced_input = tf.train.slice_input_producer(
tensor_list=[x, y], shuffle=True)
return tf.train.shuffle_batch(
sliced_input, batch_size=batch_size,
capacity=10000, min_after_dequeue=batch_size*10)
Now if I create my input function with the new tf.contrib.data.Dataset.from_tensor_slices like in the following codeblock (full code here), then my resulting graph.pbtxt file doubles in size to 1.3G and the .meta files double in size to 330M.
def train_inputs():
with tf.name_scope('Training_data'):
images = mnist.train.images.reshape([-1, 28, 28, 1])
labels = mnist.train.labels
dataset = tf.contrib.data.Dataset.from_tensor_slices(
(images, labels))
dataset = dataset.repeat(None) # Infinite
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
next_example, next_label = iterator.get_next()
return next_example, next_label
Now because the graph.pbtxt file is so big TensorBoard takes ages to parse this file, and I'm unable to debug my model graph visually.
I found in the Dataset documentation that this increase in size comes from: "the contents of the array will be copied multiple times" and the solution would be to use placeholders. However, in this case, I would need to feed in the numpy arrays into the placeholders with an active session to initialize the iterator:
sess.run(iterator.initializer, feed_dict={features_placeholder: features, labels_placeholder: labels})
This seems, however, to be out of my control when using the tf.contrib.learn.Experiment framework.
How can I initialize the iterator's initialiser with the Experiment framework? Or find a workaround to using the Dataset API without increasing my graph size?
I found a solution to my problem using tf.train.SessionRunHook. I create a SessionRunHook object that initialises the iterator after the session is created:
class IteratorInitializerHook(tf.train.SessionRunHook):
def __init__(self):
super(IteratorInitializerHook, self).__init__()
self.iterator_initiliser_func = None
def after_create_session(self, session, coord):
self.iterator_initiliser_func(session)
The initializer function is set when creating the Dataset Iterator:
iterator_initiliser_hook.iterator_initiliser_func = \
lambda sess: sess.run(
iterator.initializer,
feed_dict={images_placeholder: images,
labels_placeholder: labels})
And I pass in the hook objects to train_monitors and eval_hooks parameters of tf.contrib.learn.Experiment.
The resulting graph.pbtxt file is now only 500K while the .meta files are only 244K.
Full example here.

Categories