Summary: Using the new tf.contrib.data.Dataset doubles the size of my graph protobuff file and I'm unable to visualize the graph in Tensorboard.
The details:
I'm trying out the new TensorFlow tf.contrib.data.Dataset functionality together with the tf.contrib.learn.Experiment framework. My input data is defined as input functions which return tensors of features and labels.
If I create my input function with the tf.train.slice_input_producer function like in the following codeblock (full code here), then my resulting graph.pbtxt file is 620M and the .meta files are around 165M in size.
def train_inputs():
with tf.name_scope('Training_data'):
x = tf.constant(mnist.train.images.reshape([-1, 28, 28, 1]))
y = tf.constant(mnist.train.labels)
sliced_input = tf.train.slice_input_producer(
tensor_list=[x, y], shuffle=True)
return tf.train.shuffle_batch(
sliced_input, batch_size=batch_size,
capacity=10000, min_after_dequeue=batch_size*10)
Now if I create my input function with the new tf.contrib.data.Dataset.from_tensor_slices like in the following codeblock (full code here), then my resulting graph.pbtxt file doubles in size to 1.3G and the .meta files double in size to 330M.
def train_inputs():
with tf.name_scope('Training_data'):
images = mnist.train.images.reshape([-1, 28, 28, 1])
labels = mnist.train.labels
dataset = tf.contrib.data.Dataset.from_tensor_slices(
(images, labels))
dataset = dataset.repeat(None) # Infinite
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
next_example, next_label = iterator.get_next()
return next_example, next_label
Now because the graph.pbtxt file is so big TensorBoard takes ages to parse this file, and I'm unable to debug my model graph visually.
I found in the Dataset documentation that this increase in size comes from: "the contents of the array will be copied multiple times" and the solution would be to use placeholders. However, in this case, I would need to feed in the numpy arrays into the placeholders with an active session to initialize the iterator:
sess.run(iterator.initializer, feed_dict={features_placeholder: features, labels_placeholder: labels})
This seems, however, to be out of my control when using the tf.contrib.learn.Experiment framework.
How can I initialize the iterator's initialiser with the Experiment framework? Or find a workaround to using the Dataset API without increasing my graph size?
I found a solution to my problem using tf.train.SessionRunHook. I create a SessionRunHook object that initialises the iterator after the session is created:
class IteratorInitializerHook(tf.train.SessionRunHook):
def __init__(self):
super(IteratorInitializerHook, self).__init__()
self.iterator_initiliser_func = None
def after_create_session(self, session, coord):
self.iterator_initiliser_func(session)
The initializer function is set when creating the Dataset Iterator:
iterator_initiliser_hook.iterator_initiliser_func = \
lambda sess: sess.run(
iterator.initializer,
feed_dict={images_placeholder: images,
labels_placeholder: labels})
And I pass in the hook objects to train_monitors and eval_hooks parameters of tf.contrib.learn.Experiment.
The resulting graph.pbtxt file is now only 500K while the .meta files are only 244K.
Full example here.
Related
I want to train a convolutional neural network (using tf.keras from Tensorflow version 1.13) using numpy arrays as input data. The training data (which I currently store in a single >30GB '.npz' file) does not fit in RAM all at once. What is the best way to save and load large data-sets into a neural network for training? Since I didn't manage to find a good answer to this (surely ubiquitous?) problem, I'm hoping to hear one here. Thank you very much in advance for any help!
Sources
Similar questions seem to have been asked many times (e.g. training-classifier-from-tfrecords-in-tensorflow, tensorflow-synchronize-readings-from-tfrecord, how-to-load-data-parallelly-in-tensorflow) but are several years old and usually contain no conclusive answer.
My current understanding is that using TFRecord files is a good way to approach this problem. The most promising tutorial I found so far explaining how to use TFRecord files with keras is medium.com. Other helpful sources were machinelearninguru.com and medium.com_source2 and sources therin.
The official tensorflow documentation and tutorials (on tf.data.Dataset, Importing Data, tf_records etc.) did not help me. In particular, several of the examples given there didn't work for me even without modifications.
My Attempt at using TFRecord files
I'm assuming TFRecords are a good way to solve my problem but I'm having a hard time using them. Here is an example I made based on the tutorial medium.com. I stripped down the code as much as I could.
# python 3.6, tensorflow 1.13.
# Adapted from https://medium.com/#moritzkrger/speeding-up-keras-with-tfrecord-datasets-5464f9836c36
import tensorflow as tf
import numpy as np
from tensorflow.python import keras as keras
# Helper functions (see also https://www.tensorflow.org/tutorials/load_data/tf_records)
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def writeTFRecords():
number_of_samples = 100 # create some random data to play with
images, labels = (np.random.sample((number_of_samples, 256, 256, 1)), np.random.randint(0, 30, number_of_samples))
writer = tf.python_io.TFRecordWriter("bla.tfrecord")
for index in range(images.shape[0]):
image = images[index]
label = labels[index]
feature = {'image': _bytes_feature(tf.compat.as_bytes(image.tostring())),
'label': _int64_feature(int(label))}
example = tf.train.Example(features=tf.train.Features(feature=feature))
writer.write(example.SerializeToString())
writer.close()
def loadTFRecord(data_path):
with tf.Session() as sess:
feature = {'train/image': tf.FixedLenFeature([], tf.string),
'train/label': tf.FixedLenFeature([], tf.int64)}
# Create a list of filenames and pass it to a queue
filename_queue = tf.train.string_input_producer([data_path], num_epochs=1)
# Define a reader and read the next record
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
# Decode the record read by the reader
features = tf.parse_single_example(serialized_example, features=feature)
# Convert the image data from string back to the numbers
image = tf.decode_raw(features['train/image'], tf.float32)
# Cast label data into int32
label = tf.cast(features['train/label'], tf.int32)
# Reshape image data into the original shape
image = tf.reshape(image, [256, 256, 1])
return image, label # I'm not 100% sure that's how this works...
# ######### generate a TFRecords file in the working directory containing random data. #################################
writeTFRecords()
# ######## Load the TFRecords file and use it to train a simple example neural network. ################################
image, label = loadTFRecord("bla.tfrecord")
model_input = keras.layers.Input(tensor=image)
model_output = keras.layers.Flatten(input_shape=(-1, 256, 256, 1))(model_input)
model_output = keras.layers.Dense(16, activation='relu')(model_output)
train_model = keras.models.Model(inputs=model_input, outputs=model_output)
train_model.compile(optimizer=keras.optimizers.RMSprop(lr=0.0001),
loss='mean_squared_error',
target_tensors=[label])
print("\n \n start training \n \n") # Execution gets stuck on fitting
train_model.fit(epochs=1, steps_per_epoch=10) # no output or error messages.
The code creates a TFRecord file and starts fitting, then just gets stuck with no output or error messages. I don't know what the problem is or how I could try to fix it.
While this is no real answer to the original question (i.e. "what is the optimal way to train on large datasets"), I managed to get tfrecords and datasets to work. Of particular help was this tutorial on YouTube. I include a minimal example with working code for anyone struggling with the same problem.
# Developed using python 3.6, tensorflow 1.14.0.
# This code writes data (pairs (label, image) where label is int64 and image is np.ndarray) into .tfrecord files and
# uses them for training a simple neural network. It is meant as a minimal working example of how to use tfrecords. This
# solution is likely not optimal. If you know how to improve it, please comment on
# https://stackoverflow.com/q/57717004/9988487. Refer to links therein for further information.
import tensorflow as tf
import numpy as np
from tensorflow.python import keras as keras
# Helper functions (see also https://www.tensorflow.org/tutorials/load_data/tf_records)
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def write_tfrecords_file(out_path: str, images: np.ndarray, labels: np.ndarray) -> None:
"""Write all image-label pairs into a single .tfrecord file.
:param out_path: File path of the .tfrecord file to generate or overwrite.
:param images: array with first dimension being the image index. Every images[i].tostring() is
serialized and written into the file as 'image': wrap_bytes(img_bytes)
:param labels: 1d array of integers. labels[i] is the label of images[i]. Written as 'label': wrap_int64(label)"""
assert len(images) == len(labels)
with tf.io.TFRecordWriter(out_path) as writer: # could use writer_options parameter to enable compression
for i in range(len(labels)):
img_bytes = images[i].tostring() # Convert the image to raw bytes.
label = labels[i]
data = {'image': _bytes_feature(img_bytes), 'label': _int64_feature(label)}
feature = tf.train.Features(feature=data) # Wrap the data as TensorFlow Features.
example = tf.train.Example(features=feature) # Wrap again as a TensorFlow Example.
serialized = example.SerializeToString() # Serialize the data.
writer.write(serialized) # Write the serialized data to the TFRecords file.
def parse_example(serialized, shape=(256, 256, 1)):
features = {'image': tf.io.FixedLenFeature([], tf.string), 'label': tf.io.FixedLenFeature([], tf.int64)}
# Parse the serialized data so we get a dict with our data.
parsed_example = tf.io.parse_single_example(serialized=serialized, features=features)
label = parsed_example['label']
image_raw = parsed_example['image'] # Get the image as raw bytes.
image = tf.decode_raw(image_raw, tf.float32) # Decode the raw bytes so it becomes a tensor with type.
image = tf.reshape(image, shape=shape)
return image, label # this function will be called once (to add it to tf graph; then parse images individually)
# create some arbitrary data to play with: 1000 images sized 256x256 with one colour channel. Use your custom np-arrays
IMAGE_WIDTH, NUM_OF_IMAGES, NUM_OF_CLASSES, COLOUR_CHANNELS = 256, 10_000, 10, 1
# using float32 to save memory. Must match type in parse_example(), tf.decode_raw(image_raw, tf.float32)
features_train = np.random.sample((NUM_OF_IMAGES, IMAGE_WIDTH, IMAGE_WIDTH, COLOUR_CHANNELS)).astype(np.float32)
labels_train = np.random.randint(low=0, high=NUM_OF_CLASSES, size=NUM_OF_IMAGES) # one random label for each image
features_eval = features_train[:200] # use the first 200 images as evaluation data for simplicity.
labels_eval = labels_train[:200]
write_tfrecords_file("train.tfrecord", features_train, labels_train) # normal: split the data files of several GB each
write_tfrecords_file("eval.tfrecord", features_eval, labels_eval) # this may take a while. Consider a progressbar
# The files are complete. Now define a model and use datasets to feed the data from the .tfrecord files into the model.
model = keras.Sequential([keras.layers.Flatten(input_shape=(256, 256, 1)),
keras.layers.Dense(128, activation='relu'),
keras.layers.Dense(10, activation='softmax')])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Check docs for parameters (compression, buffer size, thread count. Also www.tensorflow.org/guide/performance/datasets
train_dataset = tf.data.TFRecordDataset("train.tfrecord") # specify a list (or dataset) of file names for large data
train_dataset = train_dataset.map(parse_example) # parse tfrecords. Parameter num_parallel_calls may help performance.
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(64)
validation_dataset = tf.data.TFRecordDataset("eval.tfrecord")
validation_dataset = validation_dataset.map(parse_example).batch(64)
model.fit(train_dataset, epochs=3)
# evaluate the results
results = model.evaluate(validation_dataset)
print('\n\nvalidation loss, validation acc:', results)
Note that it's tricky to use some_keras_model.fit(..., validation_data=some_dataset) with dataset objects. It may result in
TypeError: 'DatasetV1Adapter' object does not support indexing.
This seems to be a bug (see github.com/tensorflow/tensorflow/issues/28995) and is supposedly fixed as of tf-nightly version '1.15.0-dev20190808'; The official tutorial uses this too, although it doesn't work in most versions. An easy but dirty-ish fix is to use verbose=0 (which only suppresses program output) and plot the validation results using tensorboard. Also see Keras model.fit() with tf.dataset API + validation_data.
def loadData():
images_dir = os.path.join(current_dir, 'image_data')
images = []
for each in os.listdir(images_dir):
images.append(os.path.join(images_dir,each))
all_images = tf.convert_to_tensor(images, dtype = tf.string)
images_batch = tf.train.shuffle_batch(
[all_images], batch_size = BATCH_SIZE)
return images_batch
returns
ValueError: Cannot create a tensor proto whose content is larger than 2GB.
I'm trying to load about 11GB of images. How can I overcome those limitation?
Edit: Possbile duplicate:
You can split the output classes into multiple operations and concatenate them at the end is suggest, but I do not have multiple classes I can split.
Edit2:
Solutions to this problem suggest using placeholders. So now I'm not sure who to use placeholders in that case and where I can feed the array of images to tensorflow.
Here's a minimal version of my train function to show how I initialize the session.
def train():
images_batch = loadData()
sess = tf.Session()
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
for i in range(EPOCH):
train_image = sess.run(image_batch)
Using convert_to_tensor has the unexpected effect of adding your images to the computational graph, which has a hard limit of 2GB. If you hit this limit, you should reconsider how to feed images for the training process.
We already have a simple solution in TensorFlow, just use placeholders (tf.placeholder) and feed_dict in session.run. The only disadvantage in this case is that you have to produce batches of your data manually.
I'm am trying to figure out the recommended way to use the dataset api together with the estimator api. Everything I have seen online is some variation of this:
def train_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
return dataset
which can then be passed to the estimator's train function:
classifier.train(
input_fn=train_input_fn,
#...
)
but the dataset guide warns that:
the above code snippet will embed the features and labels arrays in your TensorFlow graph as tf.constant() operations. This works well for a small dataset, but wastes memory---because the contents of the array will be copied multiple times---and can run into the 2GB limit for the tf.GraphDef protocol buffer.
and then describes a method that involves defining placeholders which are then filled with the feed_dict:
features_placeholder = tf.placeholder(features.dtype, features.shape)
labels_placeholder = tf.placeholder(labels.dtype, labels.shape)
dataset = tf.data.Dataset.from_tensor_slices((features_placeholder, labels_placeholder))
sess.run(iterator.initializer, feed_dict={features_placeholder: features,
labels_placeholder: labels})
But if you're using the estimator api, you're not manually running the session. So how do you use the dataset api with estimators while avoiding the problems associated with from_tensor_slices()?
To use either initializable or reinitializable iterators, you must create a class that inherits from tf.train.SessionRunHook, which has access to the session at multiple times during training and evaluation steps.
You can then use this new class to initialize the iterator has you would normally do in a classic setting. You simply need to pass this newly created hook to the training/evaluation functions or to the correct train spec.
Here is quick example that you can adapt to your needs :
class IteratorInitializerHook(tf.train.SessionRunHook):
def __init__(self):
super(IteratorInitializerHook, self).__init__()
self.iterator_initializer_func = None # Will be set in the input_fn
def after_create_session(self, session, coord):
# Initialize the iterator with the data feed_dict
self.iterator_initializer_func(session)
def get_inputs(X, y):
iterator_initializer_hook = IteratorInitializerHook()
def input_fn():
X_pl = tf.placeholder(X.dtype, X.shape)
y_pl = tf.placeholder(y.dtype, y.shape)
dataset = tf.data.Dataset.from_tensor_slices((X_pl, y_pl))
dataset = ...
...
iterator = dataset.make_initializable_iterator()
next_example, next_label = iterator.get_next()
iterator_initializer_hook.iterator_initializer_func = lambda sess: sess.run(iterator.initializer,
feed_dict={X_pl: X, y_pl: y})
return next_example, next_label
return input_fn, iterator_initializer_hook
...
train_input_fn, train_iterator_initializer_hook = get_inputs(X_train, y_train)
test_input_fn, test_iterator_initializer_hook = get_inputs(X_test, y_test)
...
estimator.train(input_fn=train_input_fn,
hooks=[train_iterator_initializer_hook]) # Don't forget to pass the hook !
estimator.evaluate(input_fn=test_input_fn,
hooks=[test_iterator_initializer_hook])
I have trained a model using the tf.data.Dataset API, so my training code looks something like this
with graph.as_default():
dataset = tf.data.TFRecordDataset(tfrecord_path)
dataset = dataset.map(scale_features, num_parallel_calls=n_workers)
dataset = dataset.shuffle(10000)
dataset = dataset.padded_batch(batch_size, padded_shapes={...})
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(handle,
train_dataset.output_types,
train_dataset.output_shapes)
batch = iterator.get_next()
...
# Model code
...
iterator = dataset.make_initializable_iterator()
with tf.Session(graph=graph) as sess:
train_handle = sess.run(iterator.string_handle())
sess.run(tf.global_variables_initializer())
for epoch in range(n_epochs):
sess.run(train_iterator.initializer)
while True:
try:
sess.run(optimizer, feed_dict={handle: train_handle})
except tf.errors.OutOfRangeError:
break
Now after the model is trained I want to infer on examples that are not in the datasets and I am not sure how to go about doing it.
Just to be clear, I know how to use another dataset, for example I just pass a handle to my test set upon testing.
The question is about given the scaling scheme and the fact that the network expects a handle, if I want to make a prediction to a new example which is not written to a TFRecord, how would I go about doing that?
If I'd modify the batch I'd be responsible for the scaling beforehand which is something I would like to avoid if possible.
So how should I infer single examples from a model traiend the tf.data.Dataset way?
(This is not for production purposes it is for evaluating what will happen if I change specific features)
actually there is a tensor name called "IteratorGetNext:0" in the graph
when you use dataset api, so you can using following way to directly set
input:
#get a tensor from a graph
input tensor : input = graph.get_tensor_by_name("IteratorGetNext:0")
# difine the target tensor you want evaluate for your prediction
prediction tensor: predictions=...
# finally call session to run
then sess.run(predictions, feed_dict={input: np.asanyarray(images), ...})
What is the most efficient way to feed data from multiple TFRecord files for purposes of training a Tensorflow model? With my current process, I iterate over the examples from TFRecords, separately extracting examples into Python variables, but I don't believe this is the proper way to do this.
I am migrating from Keras to Tensorflow hoping to see some speed improvements in my workflow. Towards that end, I've moved my data into TFRecord, and now I am trying to understand how to run basic linear regression models with a directory of TFRecord files. I have gotten to the point where I can read the TFRecord out into a Tensor and train in batches like so (code is taken from the Tensorflow getting started example and then modified):
# Model parameters
W = tf.Variable([.1], dtype=tf.float32)
b = tf.Variable([.1], dtype=tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W*x + b
y = tf.placeholder(tf.float32)
# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.1)
train = optimizer.minimize(loss)
# Transforms a scalar string `example_proto` into a pair of a scalar string and
# a scalar integer, representing an image and its label, respectively.
def _parse_function(example_proto):
keys_to_features = {
"X": tf.FixedLenFeature([40], tf.float32),
"Y": tf.FixedLenFeature([10], tf.float32)
}
example = tf.parse_single_example(example_proto, keys_to_features)
return example["X"][0], example["Y"][0]
filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames, "ZLIB")
dataset = dataset.map(_parse_function)
dataset = dataset.repeat()
dataset = dataset.batch(1024)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init) # reset values to wrong
sess.run(iterator.initializer, feed_dict = { filenames: training_filenames })
for i in range(10):
**x_train, y_train = sess.run(iterator.get_next())**
sess.run(train, {x: x_train, y: y_train})
My problem is that I do not believe this follows the intended, most efficient dataset workflow possible with Tensorflow. In particular, what is the point of extracting the data from binary into a python variable and then feeding it into the training process? (the line below)
**x_train, y_train = sess.run(iterator.get_next())**
I was under the impression there should be a way that feeds the binary data into the session for training more directly, but after reading the TF tutorials, examples, and other stack overflow posts, I am not finding anything.
The dataset API is very versatile and flexible. It can be used to input as dictionaries as you did. However, a better way is to incorporate the dataset within the graph and make it process all at once.
def model_function(input, label)
# Model parameters
W = tf.Variable([None, input.shape[1]], dtype=tf.float32)
b = tf.Variable([input.shape[1]], dtype=tf.float32)
# Model input and output
x = input
linear_model = W*x + b
y = label
# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.1)
train = optimizer.minimize(loss)
return train
---<Previous dataset related code>---
iterator.make_initializable_iterator()
next_example, next_label = iterator.get_next()
train_op = model_function(next_example, next label)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for steps in range(1000):
_ = sess.run([train_op], feeddict={filenames: training_filenames})
In this way the dataset operations are part of the main graph. This would also use the queuing structure of the dataset better. Since only one sess.run is used, the overhead of the run function is minimised.
For more information have a look at this part of the documentation: Importing data | Tensorflow 1.4
If you need training filenames which are specified at graph runtime you can only specify that placeholder in the feeddict. However, i suggest against that though. Filenames are rather static. I would use a resources file such as config.py and place all the config properties in that file. The filenames are then loaded on graph construction.
To specify the filenames, there are two approaches.
The first one:
...
filenames = tf.constant([filename1.tfrecords, filename2.tfrecords], dtype=tf.String)
dataset = tf.data.Dataset(filenames, "ZLIB")
...
Or else a more proper approach would be to create a new directory in the main folder called resources, place and empty __init__.py file inside and another one called config.py.
Inside config.py:
--- inside config.py ---
FILENAMES = ["filename1.tfrecord", "filename2.tfrecord"]
Inside the main tensorflow function where the dataset is being created:
--- inside tensorflow file ---
from resources import config
...
filenames = tf.constant(config.FILENAMES, dtype=tf.String)
dataset = tf.data.Dataset(filenames, "ZLIB")
...