Inference with a model trained with tf.Dataset - python

I have trained a model using the tf.data.Dataset API, so my training code looks something like this
with graph.as_default():
dataset = tf.data.TFRecordDataset(tfrecord_path)
dataset = dataset.map(scale_features, num_parallel_calls=n_workers)
dataset = dataset.shuffle(10000)
dataset = dataset.padded_batch(batch_size, padded_shapes={...})
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle,
train_dataset.output_types,
train_dataset.output_shapes)
batch = iterator.get_next()
...
# Model code
...
iterator = dataset.make_initializable_iterator()
with tf.Session(graph=graph) as sess:
train_handle = sess.run(iterator.string_handle())
sess.run(tf.global_variables_initializer())
for epoch in range(n_epochs):
sess.run(train_iterator.initializer)
while True:
try:
sess.run(optimizer, feed_dict={handle: train_handle})
except tf.errors.OutOfRangeError:
break
Now after the model is trained I want to infer on examples that are not in the datasets and I am not sure how to go about doing it.
Just to be clear, I know how to use another dataset, for example I just pass a handle to my test set upon testing.
The question is about given the scaling scheme and the fact that the network expects a handle, if I want to make a prediction to a new example which is not written to a TFRecord, how would I go about doing that?
If I'd modify the batch I'd be responsible for the scaling beforehand which is something I would like to avoid if possible.
So how should I infer single examples from a model traiend the tf.data.Dataset way?
(This is not for production purposes it is for evaluating what will happen if I change specific features)

actually there is a tensor name called "IteratorGetNext:0" in the graph
when you use dataset api, so you can using following way to directly set
input:
#get a tensor from a graph
input tensor : input = graph.get_tensor_by_name("IteratorGetNext:0")
# difine the target tensor you want evaluate for your prediction
prediction tensor: predictions=...
# finally call session to run
then sess.run(predictions, feed_dict={input: np.asanyarray(images), ...})

Related

tf.data.Dataset feedable iterator for training and inference

I have a TensorFlow model that uses tf.data.Dataset feedable iterators to switch between training and validation. Both dataset share the same structure, that is they have a features matrix and the corresponding labels vector. In order to use the same model and iterator for inference (no labels vector only featurex matrix) I need to ideally supply a zero labels vector. Is there a more efficient and elegant way to use the dataset API for both training (validation) and inference?
In code:
training_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
validation_dataset = tf.data.Dataset.from_tensor_slices((X_validation, y_validation))
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle, training_dataset.output_types, training_dataset.output_shapes)
features, labels = iterator.get_next()
Features and lables are used inside the model as input placeholders.
In order to switch between dataset I need to create one iterator for each dataset:
training_iterator = training_dataset.make_initializable_iterator()
validation_iterator = validation_dataset.make_initializable_iterator()
then create the handle
training_handle = sess.run(training_iterator.string_handle())
validation_handle = sess.run(validation_iterator.string_handle())
And use the handle to select which dataset to use, for example:
sess.run(next_element, feed_dict={handle: training_handle})
Now, what happens if I have inference data with no labels?
inference_dataset = tf.data.Dataset.from_tensor_slices(X_inference) # NO y values
inferece_iterator = inference_dataset.make_initializable_iterator()
If I add this iterator it will throw and exception because "Number of components does not match: expected 2 types but got 1."
Any suggestions?
This post How to use tf.Dataset design in both training and inferring? is related to this question, but tf.data.Dataset does not have an unzip method.
What are the best practices for this problem?
If your graph code I assume you are trying to extract a value for labels y from the dataset right? At inference time that was probably baked into the tensorflow dependency graph.
You have a few choices here. Probably the easiest solution is to recreate the graph from code (run your build_graph() function, then load the weights using something like saver.restore(sess, "/tmp/model.ckpt")). If you do it this way you can re-create the graph without the labels y. I assume there are no other dependencies on y (sometimes tensorboard summaries add dependencies you need to check too). Your problem should now be solved.
However, now that I've written the above comment (which I'll leave as-is because it's still useful information), I realize you might not even need that. At inference time you should not be using the labels anywhere (again, double check tensorboard summaries). If you don't need y then tensorflow should not run any of the operations that use y. This should include not trying to extract them from the dataset. Double check that you are not asking tensorflow to use your labels anywhere at inference time.
I think that the first solution proposed by David Parks looks like this, and I think is better than messing with tf.cond in the code.
import tensorflow as tf
import numpy as np
def build_model(features, labels=None, train=False):
linear_model = tf.layers.Dense(units=1)
y_pred = linear_model(features)
if train:
loss = tf.losses.mean_squared_error(labels=labels, predictions=y_pred)
optimizer = tf.train.GradientDescentOptimizer(1e-4)
train = optimizer.minimize(loss)
return train, loss
else:
return y_pred
X_train = np.random.random(100).reshape(-1, 1)
y_train = np.random.random(100).reshape(-1, 1)
training_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
training_dataset = training_dataset.batch(10)
training_dataset = training_dataset.shuffle(20)
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle, training_dataset.output_types, training_dataset.output_shapes)
features, labels = iterator.get_next()
training_iterator = training_dataset.make_one_shot_iterator()
train, loss = build_model(features, labels, train=True)
saver = tf.train.Saver()
init = tf.global_variables_initializer()
sess = tf.Session()
training_handle = sess.run(training_iterator.string_handle())
sess.run(init)
for i in range(10):
_, loss_value = sess.run((train, loss), feed_dict={handle: training_handle})
print(loss_value)
saver.save(sess, "tmp/model.ckpt")
sess.close()
tf.reset_default_graph()
X_test = np.random.random(10).reshape(-1, 1)
inference_dataset = tf.data.Dataset.from_tensor_slices(X_test)
inference_dataset = inference_dataset.batch(5)
handle = tf.placeholder(tf.string, shape=[])
iterator_inference = tf.data.Iterator.from_string_handle(handle, inference_dataset.output_types, inference_dataset.output_shapes)
inference_iterator = inference_dataset.make_one_shot_iterator()
features_inference = iterator_inference.get_next()
y_pred = build_model(features_inference)
saver = tf.train.Saver()
sess = tf.Session()
inference_handle = sess.run(inference_iterator.string_handle())
saver.restore(sess, "tmp/model.ckpt") # Restore variables from disk.
print(sess.run(y_pred, feed_dict={handle: inference_handle}))
sess.close()

How to feed scalar with settings while using Tensorflow Dataset API

I'm using TF Dataset API with a placeholder for file names which I feed when initializing the iterator (different files depending whether it's a training or validation set). I would also like to use additional placeholder indicating whether we're training or validating (to include in dropout layers). However, I'm unable to feed values to this placeholder using the dataset initializer (what makes sense since this is not a part of the dataset). How to feed additional variable while using Dataset API then?
Key code pieces:
filenames_placeholder = tf.placeholder(tf.string, shape = (None))
is_training = tf.placeholder(tf.bool, shape = ()) # Error: You must feed a value for placeholder tensor 'Placeholder_1' with dtype bool
dataset = tf.data.TFRecordDataset(filenames_placeholder)
# (...) Many other dataset operations
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
# Model code using "next_element" as inputs including the dropout layer at some point
# where I would like to let the model know if we're training or validating
tf.layers.dropout(x, training = is_training)
# Model execution
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(iterator.initializer, feed_dict = {filenames_placeholder: training_files, is_training: True})
# (...) Performing training
sess.run(iterator.initializer, feed_dict = {filenames_placeholder: training_files, is_training: False})
# (...) Performing validadtion
What I do in this case is have an additional placeholder with a default value:
keep_prob = tf.placeholder_with_default(1.0, shape=())
And in the graph:
tf.layers.dropout(inputs, rate=1-keep_prob)
Then while training:
sess.run(...,feed_dict={keep_prob:0.5})
While evaluating:
sess.run(...) # No feed_dict here since the keep_prob placeholder has a default value of 1
Note that feeding a placeholder while training, that provides an additional float value doesn't slow your training at all.

Train test dataset in Data Pipeline

I am new to tensorflow, I am building a data pipeline, in which I built two iterators for train, test set from tfrecord. The training works fine, but the problem occurs when inputting test set to graph.
if __name__ == '__main__':
X_video_train,X_audio_train,y = dataset('frame_sample/train.tfrecord')
X_video_test,X_audio_test,y = dataset('frame_sample/test.tfrecord')
#Input:Train Set
logits_train = graph(X_video_train,X_audio_train,training=True)
train = training(logits_train)
This code just fine, after this when I call sess.run and train it. It trains the model, and by using logits of logits_train, I get train accuracy.
But to get test accuracy when I call
logits_test,y = graph(X_video_test,X_audio_test,training=False)
acc,predict_proba = evaluation(logits_test,y)
It give me error
ValueError: Variable bidirectional_rnn/fw/fwd_lstm_1/kernel already
exists, disallowed. Did you mean to set reuse=True or
reuse=tf.AUTO_REUSE in VarScope? :
Then i passed a train test parameter in graph, which creates a new variable for train and test. But I think that creating a whole new graph for test set.
I am thinking of using Varscope Reuse, but does it also create new graph?, instead of getting logits from trained graph?
I just dont understand how I input test data to graph.
This error is thrown because you are re defining the graph in your test function.
The fact that you are training or testing a model should not be related to the graph. The graph should be defined once with a placeholder as input. Then you can populate this placeholder with either train or test data.
Some operations like batch normalization change their behaviour when testing. If your model contains these OPs you should pass a boolean to your feed dictionary like so:
# Model definition
...
h = tf.layers.batch_normalization(h, training=is_training_pl)
...
# Training
_, l = sess.run([train_op, loss], {x_pl: x_train_batch,
y_pl: y_train_batch,
is_training_pl: True})
...
# Testing
l = sess.run(loss, {x_pl: x_test_batch,
is_training_pl: False})
In the case you are using the new tf.data.Dataset API, here is an adapted code snippet using a feedable iterator:
# Define training and validation datasets with the same structure.
training_dataset = tf.data.Dataset ...
validation_dataset = tf.data.Dataset ...
# A feedable iterator is defined by a handle placeholder and its structure. We
# could use the `output_types` and `output_shapes` properties of either
# `training_dataset` or `validation_dataset` here, because they have
# identical structure.
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(
handle, training_dataset.output_types, training_dataset.output_shapes)
next_element = iterator.get_next() # THIS WILL BE USED AS OUR INPUT
# You can use feedable iterators with a variety of different kinds of iterator
# (such as one-shot and initializable iterators).
training_iterator = training_dataset.make_one_shot_iterator()
validation_iterator = validation_dataset.make_initializable_iterator()
# The `Iterator.string_handle()` method returns a tensor that can be evaluated
# and used to feed the `handle` placeholder.
training_handle = sess.run(training_iterator.string_handle())
validation_handle = sess.run(validation_iterator.string_handle())
...
# Model definition
input = next_element
...
h = tf.layers.batch_normalization(h, training=is_training_pl)
...
# Training
_, l = sess.run([train_op, loss], {is_training_pl: True,
handle: training_handle})
# Validation
sess.run(validation_iterator.initializer)
l = sess.run(loss, {is_training_pl: False,
handle: validation_handle})

tensorflow how to change dataset

I have a Dataset API doohickey which is part of my tensorflow graph. How do I swap it out when I want to use different data?
dataset = tf.data.Dataset.range(3)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
variable = tf.Variable(3, dtype=tf.int64)
model = variable*next_element
#pretend like this is me training my model, or something
with tf.Session() as sess:
sess.run(variable.initializer)
try:
while True:
print(sess.run(model)) # (0,3,6)
except:
pass
dataset = tf.data.Dataset.range(2)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
### HOW TO DO THIS THING?
with tf.Session() as sess:
sess.run(variable.initializer) #This would be a saver restore operation, normally...
try:
while True:
print(sess.run(model)) # (0,3)... hopefully
except:
pass
I do not believe this is possible. You are asking to change the computation graph itself, which is not allowed in tensorflow. Rather than explain that myself, I find the accepted answer in this post to be particularly clear in explaining that point Is it possible to modify an existing TensorFlow computation graph?
Now, that said, I think there is a fairly simple/clean way to accomplish what you seek. Essentially, you want to reset the graph and rebuild the Dataset part. Of course you want to reuse the model part of the code. Thus just put that model in a class or function to allow reuse. A simple example built on your code:
# the part of the graph you want to reuse
def get_model(next_element):
variable = tf.Variable(3,dtype=tf.int64)
return variable*next_element
# the first graph you want to build
tf.reset_default_graph()
# the part of the graph you don't want to reuse
dataset = tf.data.Dataset.range(3)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
# reusable part
model = get_model(next_element)
#pretend like this is me training my model, or something
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
try:
while True:
print(sess.run(model)) # (0,3,6)
except:
pass
# now the second graph
tf.reset_default_graph()
# the part of the graph you don't want to reuse
dataset = tf.data.Dataset.range(2)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
# reusable part
model = get_model(next_element)
### HOW TO DO THIS THING?
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
try:
while True:
print(sess.run(model)) # (0,3)... hopefully
except:
pass
Final Note: you will also see some references here and there to tf.contrib.graph_editor docs here. They specifically say that you can't accomplish exactly what you want with the graph_editor (see in that link: "Here is an example of what you cannot do"; but you can get pretty close). Even still though, it's not good practice; they had good reason to make the graph append only, and I think the above method I suggest is the cleaner way to accomplish what you seek.
One way I would suggest but that will make things slower is by using place_holders followed by the tf.data.dataset. Therefore, you will have the following:
train_data = tf.placeholder(dtype=tf.float32, shape=[None, None, 1]) # just an example
# Then add the tf.data.dataset here
train_data = tf.data.Dataset.from_tensor_slices(train_data).shuffle(10000).batch(batch_size)
Now when running the graph within a session, you have to feed in the data using the placeholder. So you feed whatever you like...
Hope this helps!!

Integrating directory of TFRecord examples into model training

What is the most efficient way to feed data from multiple TFRecord files for purposes of training a Tensorflow model? With my current process, I iterate over the examples from TFRecords, separately extracting examples into Python variables, but I don't believe this is the proper way to do this.
I am migrating from Keras to Tensorflow hoping to see some speed improvements in my workflow. Towards that end, I've moved my data into TFRecord, and now I am trying to understand how to run basic linear regression models with a directory of TFRecord files. I have gotten to the point where I can read the TFRecord out into a Tensor and train in batches like so (code is taken from the Tensorflow getting started example and then modified):
# Model parameters
W = tf.Variable([.1], dtype=tf.float32)
b = tf.Variable([.1], dtype=tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W*x + b
y = tf.placeholder(tf.float32)
# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.1)
train = optimizer.minimize(loss)
# Transforms a scalar string `example_proto` into a pair of a scalar string and
# a scalar integer, representing an image and its label, respectively.
def _parse_function(example_proto):
keys_to_features = {
"X": tf.FixedLenFeature([40], tf.float32),
"Y": tf.FixedLenFeature([10], tf.float32)
}
example = tf.parse_single_example(example_proto, keys_to_features)
return example["X"][0], example["Y"][0]
filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames, "ZLIB")
dataset = dataset.map(_parse_function)
dataset = dataset.repeat()
dataset = dataset.batch(1024)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init) # reset values to wrong
sess.run(iterator.initializer, feed_dict = { filenames: training_filenames })
for i in range(10):
**x_train, y_train = sess.run(iterator.get_next())**
sess.run(train, {x: x_train, y: y_train})
My problem is that I do not believe this follows the intended, most efficient dataset workflow possible with Tensorflow. In particular, what is the point of extracting the data from binary into a python variable and then feeding it into the training process? (the line below)
**x_train, y_train = sess.run(iterator.get_next())**
I was under the impression there should be a way that feeds the binary data into the session for training more directly, but after reading the TF tutorials, examples, and other stack overflow posts, I am not finding anything.
The dataset API is very versatile and flexible. It can be used to input as dictionaries as you did. However, a better way is to incorporate the dataset within the graph and make it process all at once.
def model_function(input, label)
# Model parameters
W = tf.Variable([None, input.shape[1]], dtype=tf.float32)
b = tf.Variable([input.shape[1]], dtype=tf.float32)
# Model input and output
x = input
linear_model = W*x + b
y = label
# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.1)
train = optimizer.minimize(loss)
return train
---<Previous dataset related code>---
iterator.make_initializable_iterator()
next_example, next_label = iterator.get_next()
train_op = model_function(next_example, next label)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for steps in range(1000):
_ = sess.run([train_op], feeddict={filenames: training_filenames})
In this way the dataset operations are part of the main graph. This would also use the queuing structure of the dataset better. Since only one sess.run is used, the overhead of the run function is minimised.
For more information have a look at this part of the documentation: Importing data | Tensorflow 1.4
If you need training filenames which are specified at graph runtime you can only specify that placeholder in the feeddict. However, i suggest against that though. Filenames are rather static. I would use a resources file such as config.py and place all the config properties in that file. The filenames are then loaded on graph construction.
To specify the filenames, there are two approaches.
The first one:
...
filenames = tf.constant([filename1.tfrecords, filename2.tfrecords], dtype=tf.String)
dataset = tf.data.Dataset(filenames, "ZLIB")
...
Or else a more proper approach would be to create a new directory in the main folder called resources, place and empty __init__.py file inside and another one called config.py.
Inside config.py:
--- inside config.py ---
FILENAMES = ["filename1.tfrecord", "filename2.tfrecord"]
Inside the main tensorflow function where the dataset is being created:
--- inside tensorflow file ---
from resources import config
...
filenames = tf.constant(config.FILENAMES, dtype=tf.String)
dataset = tf.data.Dataset(filenames, "ZLIB")
...

Categories