I'm writing a code for reading images and labels from disc in tensorflow and then trying to call tf.estimator.inputs.numpy_input_fn. How can I pass the whole dataset instead of single image. My code looks like:
filenames = tf.constant(filenames)
labels = tf.constant(labels)
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
dataset_batched = dataset.batch(10)
iterator = dataset_batched.make_one_shot_iterator()
features, labels = iterator.get_next()
with tf.Session() as sess:
print(dataset_batched)
print(np.shape(sess.run(features)))
print(np.shape(sess.run(labels)))
mnist_classifier = tf.estimator.Estimator(model_fn=cnn_model_mk, model_dir=dir)
train_input_fn = tf.estimator.inputs.numpy_input_fn(x={"x": np.array(sess.run(features))},
y=np.array(sess.run(labels)),
batch_size=1,
num_epochs=None,
shuffle=False)
mnist_classifier.train(input_fn=train_input_fn, steps=1)
And my question is how can I pass dataset here x={"x": np.array(sess.run(features))}
There is no need/use for numpy_input_fn here. You should wrap the code at the top into a function (say, my_input_fn) that returns iterator.get_next() and, then pass input_fn=my_input_fn into the train call. This would pass the full dataset to the training code in batches of 10.
numpy_input_fn is for when you have the full dataset available in an array already and want a quick way to do batching/shuffling/repeating etc.
Related
Code
All of the 4 columns are float64. I'm not sure what to do about this error as I've looked at similar stack overflow issues and nothing regarding converting it to numpy array and/or float32 seem to solve this.
This code is base on:
Google Colab
I just replaced the housing data with mlb pitching data and applied dropna() to the train and test dataframes.
Thank you.
In the function train_model(), specify the array .astype(float).
So the line you need to edit:
features = {name:np.array(value) for name, value in dataset.items()}
... change to:
features = {name:np.array(value).astype(float) for name, value in dataset.items()}
You'll also need to do it in the next block as well:
test_features = {name:np.array(value).astype(float) for name, value in cut_test_df_norm.items()}
Full function code:
def train_model(model, dataset, epochs, label_name,
batch_size=None):
"""Train the model by feeding it data."""
# Split the dataset into features and label.
features = {name:np.array(value).astype(float) for name, value in dataset.items()}
label = np.array(features.pop(label_name))
history = model.fit(x=features, y=label, batch_size=batch_size,
epochs=epochs, shuffle=True)
print(features)
# The list of epochs is stored separately from the rest of history.
epochs = history.epoch
# To track the progression of training, gather a snapshot
# of the model's mean squared error at each epoch.
hist = pd.DataFrame(history.history)
mse = hist["mean_squared_error"]
return epochs, mse
And:
# After building a model against the training set, test that model
# against the test set.
test_features = {name:np.array(value).astype(float) for name, value in cut_test_df_norm.items()}
test_label = np.array(test_features.pop(label_name)) # isolate the label
print("\n Evaluate the new model against the test set:")
my_model.evaluate(x = test_features, y = test_label, batch_size=batch_size)
I have a certain dataset loaded into a dataloader. For example, if I wanted to save 100 images from this dataloader, how should I iterate over the dataloader to save them?
Im not exactly sure what you are trying to do (maybe edit your question) but maybe this helps:
dataset = Dataset()
dataloader = torch.utils.data.DataLoader(
dataloader,
batch_size=32,
num_workers=1,
shuffle=True)
for samples, targets in dataloader:
# 'sample' now is a batch of 32 (see batch-size above) elements of your dataset
Is this what you wanted? Hope so :)
Take dl as your dataloader.
If you want to print only the first batch you can do this:
for data, targets in dl:
print #whatever you want
break
But if you want the n-th batch you can do this:
for batch_idx, (data, target) in enumerate(train_loader):
if batch_idx == n:
print #whatever you want
break
I have a TensorFlow model that uses tf.data.Dataset feedable iterators to switch between training and validation. Both dataset share the same structure, that is they have a features matrix and the corresponding labels vector. In order to use the same model and iterator for inference (no labels vector only featurex matrix) I need to ideally supply a zero labels vector. Is there a more efficient and elegant way to use the dataset API for both training (validation) and inference?
In code:
training_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
validation_dataset = tf.data.Dataset.from_tensor_slices((X_validation, y_validation))
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle, training_dataset.output_types, training_dataset.output_shapes)
features, labels = iterator.get_next()
Features and lables are used inside the model as input placeholders.
In order to switch between dataset I need to create one iterator for each dataset:
training_iterator = training_dataset.make_initializable_iterator()
validation_iterator = validation_dataset.make_initializable_iterator()
then create the handle
training_handle = sess.run(training_iterator.string_handle())
validation_handle = sess.run(validation_iterator.string_handle())
And use the handle to select which dataset to use, for example:
sess.run(next_element, feed_dict={handle: training_handle})
Now, what happens if I have inference data with no labels?
inference_dataset = tf.data.Dataset.from_tensor_slices(X_inference) # NO y values
inferece_iterator = inference_dataset.make_initializable_iterator()
If I add this iterator it will throw and exception because "Number of components does not match: expected 2 types but got 1."
Any suggestions?
This post How to use tf.Dataset design in both training and inferring? is related to this question, but tf.data.Dataset does not have an unzip method.
What are the best practices for this problem?
If your graph code I assume you are trying to extract a value for labels y from the dataset right? At inference time that was probably baked into the tensorflow dependency graph.
You have a few choices here. Probably the easiest solution is to recreate the graph from code (run your build_graph() function, then load the weights using something like saver.restore(sess, "/tmp/model.ckpt")). If you do it this way you can re-create the graph without the labels y. I assume there are no other dependencies on y (sometimes tensorboard summaries add dependencies you need to check too). Your problem should now be solved.
However, now that I've written the above comment (which I'll leave as-is because it's still useful information), I realize you might not even need that. At inference time you should not be using the labels anywhere (again, double check tensorboard summaries). If you don't need y then tensorflow should not run any of the operations that use y. This should include not trying to extract them from the dataset. Double check that you are not asking tensorflow to use your labels anywhere at inference time.
I think that the first solution proposed by David Parks looks like this, and I think is better than messing with tf.cond in the code.
import tensorflow as tf
import numpy as np
def build_model(features, labels=None, train=False):
linear_model = tf.layers.Dense(units=1)
y_pred = linear_model(features)
if train:
loss = tf.losses.mean_squared_error(labels=labels, predictions=y_pred)
optimizer = tf.train.GradientDescentOptimizer(1e-4)
train = optimizer.minimize(loss)
return train, loss
else:
return y_pred
X_train = np.random.random(100).reshape(-1, 1)
y_train = np.random.random(100).reshape(-1, 1)
training_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
training_dataset = training_dataset.batch(10)
training_dataset = training_dataset.shuffle(20)
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle, training_dataset.output_types, training_dataset.output_shapes)
features, labels = iterator.get_next()
training_iterator = training_dataset.make_one_shot_iterator()
train, loss = build_model(features, labels, train=True)
saver = tf.train.Saver()
init = tf.global_variables_initializer()
sess = tf.Session()
training_handle = sess.run(training_iterator.string_handle())
sess.run(init)
for i in range(10):
_, loss_value = sess.run((train, loss), feed_dict={handle: training_handle})
print(loss_value)
saver.save(sess, "tmp/model.ckpt")
sess.close()
tf.reset_default_graph()
X_test = np.random.random(10).reshape(-1, 1)
inference_dataset = tf.data.Dataset.from_tensor_slices(X_test)
inference_dataset = inference_dataset.batch(5)
handle = tf.placeholder(tf.string, shape=[])
iterator_inference = tf.data.Iterator.from_string_handle(handle, inference_dataset.output_types, inference_dataset.output_shapes)
inference_iterator = inference_dataset.make_one_shot_iterator()
features_inference = iterator_inference.get_next()
y_pred = build_model(features_inference)
saver = tf.train.Saver()
sess = tf.Session()
inference_handle = sess.run(inference_iterator.string_handle())
saver.restore(sess, "tmp/model.ckpt") # Restore variables from disk.
print(sess.run(y_pred, feed_dict={handle: inference_handle}))
sess.close()
I have trained a model using the tf.data.Dataset API, so my training code looks something like this
with graph.as_default():
dataset = tf.data.TFRecordDataset(tfrecord_path)
dataset = dataset.map(scale_features, num_parallel_calls=n_workers)
dataset = dataset.shuffle(10000)
dataset = dataset.padded_batch(batch_size, padded_shapes={...})
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle,
train_dataset.output_types,
train_dataset.output_shapes)
batch = iterator.get_next()
...
# Model code
...
iterator = dataset.make_initializable_iterator()
with tf.Session(graph=graph) as sess:
train_handle = sess.run(iterator.string_handle())
sess.run(tf.global_variables_initializer())
for epoch in range(n_epochs):
sess.run(train_iterator.initializer)
while True:
try:
sess.run(optimizer, feed_dict={handle: train_handle})
except tf.errors.OutOfRangeError:
break
Now after the model is trained I want to infer on examples that are not in the datasets and I am not sure how to go about doing it.
Just to be clear, I know how to use another dataset, for example I just pass a handle to my test set upon testing.
The question is about given the scaling scheme and the fact that the network expects a handle, if I want to make a prediction to a new example which is not written to a TFRecord, how would I go about doing that?
If I'd modify the batch I'd be responsible for the scaling beforehand which is something I would like to avoid if possible.
So how should I infer single examples from a model traiend the tf.data.Dataset way?
(This is not for production purposes it is for evaluating what will happen if I change specific features)
actually there is a tensor name called "IteratorGetNext:0" in the graph
when you use dataset api, so you can using following way to directly set
input:
#get a tensor from a graph
input tensor : input = graph.get_tensor_by_name("IteratorGetNext:0")
# difine the target tensor you want evaluate for your prediction
prediction tensor: predictions=...
# finally call session to run
then sess.run(predictions, feed_dict={input: np.asanyarray(images), ...})
What is the most efficient way to feed data from multiple TFRecord files for purposes of training a Tensorflow model? With my current process, I iterate over the examples from TFRecords, separately extracting examples into Python variables, but I don't believe this is the proper way to do this.
I am migrating from Keras to Tensorflow hoping to see some speed improvements in my workflow. Towards that end, I've moved my data into TFRecord, and now I am trying to understand how to run basic linear regression models with a directory of TFRecord files. I have gotten to the point where I can read the TFRecord out into a Tensor and train in batches like so (code is taken from the Tensorflow getting started example and then modified):
# Model parameters
W = tf.Variable([.1], dtype=tf.float32)
b = tf.Variable([.1], dtype=tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W*x + b
y = tf.placeholder(tf.float32)
# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.1)
train = optimizer.minimize(loss)
# Transforms a scalar string `example_proto` into a pair of a scalar string and
# a scalar integer, representing an image and its label, respectively.
def _parse_function(example_proto):
keys_to_features = {
"X": tf.FixedLenFeature([40], tf.float32),
"Y": tf.FixedLenFeature([10], tf.float32)
}
example = tf.parse_single_example(example_proto, keys_to_features)
return example["X"][0], example["Y"][0]
filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames, "ZLIB")
dataset = dataset.map(_parse_function)
dataset = dataset.repeat()
dataset = dataset.batch(1024)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init) # reset values to wrong
sess.run(iterator.initializer, feed_dict = { filenames: training_filenames })
for i in range(10):
**x_train, y_train = sess.run(iterator.get_next())**
sess.run(train, {x: x_train, y: y_train})
My problem is that I do not believe this follows the intended, most efficient dataset workflow possible with Tensorflow. In particular, what is the point of extracting the data from binary into a python variable and then feeding it into the training process? (the line below)
**x_train, y_train = sess.run(iterator.get_next())**
I was under the impression there should be a way that feeds the binary data into the session for training more directly, but after reading the TF tutorials, examples, and other stack overflow posts, I am not finding anything.
The dataset API is very versatile and flexible. It can be used to input as dictionaries as you did. However, a better way is to incorporate the dataset within the graph and make it process all at once.
def model_function(input, label)
# Model parameters
W = tf.Variable([None, input.shape[1]], dtype=tf.float32)
b = tf.Variable([input.shape[1]], dtype=tf.float32)
# Model input and output
x = input
linear_model = W*x + b
y = label
# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.1)
train = optimizer.minimize(loss)
return train
---<Previous dataset related code>---
iterator.make_initializable_iterator()
next_example, next_label = iterator.get_next()
train_op = model_function(next_example, next label)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for steps in range(1000):
_ = sess.run([train_op], feeddict={filenames: training_filenames})
In this way the dataset operations are part of the main graph. This would also use the queuing structure of the dataset better. Since only one sess.run is used, the overhead of the run function is minimised.
For more information have a look at this part of the documentation: Importing data | Tensorflow 1.4
If you need training filenames which are specified at graph runtime you can only specify that placeholder in the feeddict. However, i suggest against that though. Filenames are rather static. I would use a resources file such as config.py and place all the config properties in that file. The filenames are then loaded on graph construction.
To specify the filenames, there are two approaches.
The first one:
...
filenames = tf.constant([filename1.tfrecords, filename2.tfrecords], dtype=tf.String)
dataset = tf.data.Dataset(filenames, "ZLIB")
...
Or else a more proper approach would be to create a new directory in the main folder called resources, place and empty __init__.py file inside and another one called config.py.
Inside config.py:
--- inside config.py ---
FILENAMES = ["filename1.tfrecord", "filename2.tfrecord"]
Inside the main tensorflow function where the dataset is being created:
--- inside tensorflow file ---
from resources import config
...
filenames = tf.constant(config.FILENAMES, dtype=tf.String)
dataset = tf.data.Dataset(filenames, "ZLIB")
...