Using TFRecords with keras - python

I have transformed an image database into two TFRecords, one for training and the other for validation. I want to train a simple model with keras using these two files for data input but I obtain an error I can't understand related to the shape of the data.
Here is the code (all-capital variables are defined elsewhere):
def _parse_function(proto):
f = {
"x": tf.FixedLenSequenceFeature([IMG_SIZE[0] * IMG_SIZE[1]], tf.float32, default_value=0., allow_missing=True),
"label": tf.FixedLenSequenceFeature([1], tf.int64, default_value=0, allow_missing=True)
}
parsed_features = tf.parse_single_example(proto, f)
x = tf.reshape(parsed_features['x'] / 255, (IMG_SIZE[0], IMG_SIZE[1], 1))
y = tf.cast(parsed_features['label'], tf.float32)
return x, y
def load_dataset(input_path, batch_size, shuffle_buffer):
dataset = tf.data.TFRecordDataset(input_path)
dataset = dataset.shuffle(shuffle_buffer).repeat() # shuffle and repeat
dataset = dataset.map(_parse_function, num_parallel_calls=16)
dataset = dataset.batch(batch_size).prefetch(1) # batch and prefetch
return dataset.make_one_shot_iterator()
train_iterator = load_dataset(TRAIN_TFRECORDS, BATCH_SIZE, SHUFFLE_BUFFER)
val_iterator = load_dataset(VALIDATION_TFRECORDS, BATCH_SIZE, SHUFFLE_BUFFER)
model = tf.keras.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=(IMG_SIZE[0], IMG_SIZE[1], 1)))
model.add(tf.keras.layers.Dense(1, 'sigmoid'))
model.compile(
optimizer=tf.train.AdamOptimizer(),
loss='binary_crossentropy',
metrics=['accuracy']
)
model.fit(
train_iterator,
epochs=N_EPOCHS,
steps_per_epoch=N_TRAIN // BATCH_SIZE,
validation_data=val_iterator,
validation_steps=N_VALIDATION // BATCH_SIZE
)
And here is the error I obtain:
tensorflow.python.framework.errors_impl.InvalidArgumentError: data[0].shape = [3] does not start with indices[0].shape = [2]
[[Node: training/TFOptimizer/gradients/loss/dense_loss/Mean_grad/DynamicStitch = DynamicStitch[N=2, T=DT_INT32, _class=["loc:#training/TFOptimizer/gradients/loss/dense_loss/Mean_grad/floordiv"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/TFOptimizer/gradients/loss/dense_loss/Mean_grad/range, training/TFOptimizer/gradients/loss/dense_loss/Mean_3_grad/Maximum, training/TFOptimizer/gradients/loss/dense_loss/Mean_grad/Shape/_35, training/TFOptimizer/gradients/loss/dense_loss/Mean_3_grad/Maximum/_41)]]
(I know that the model defined here is not a good model for image analysis, I just took the simplest possible architecture that reproduces the error)

Change:
"label": tf.FixedLenSequenceFeature([1]...
into:
"label": tf.FixedLenSequenceFeature([]...
This is unfortunately not explained in the documentation on the website, but some explanation can be found in the docstring of FixedLenSequenceFeature on github. Basically, if your data consists of a single dimension (+ a batch dimension), you don't need to specify it.

You have forget to this line from the example:
parsed_features = tf.parse_single_example(proto, f)
Add it to _parse_function.
Also, you can return just the dataset object. Keras supports iterators as well as instances of the tf.data.Dataset. Also, it looks a bit weird to shuffle and repeat first, and then to parse tfexamples. Here is an example code that works for me:
def dataset(filenames, batch_size, img_height, img_width, is_training=False):
decoder = TfExampleDecoder()
def preprocess(image, boxes, classes):
image = preprocess_image(image, resize_height=img_height, resize_width=img_width)
return image, groundtruth
ds = tf.data.TFRecordDataset(filenames)
ds = ds.map(decoder.decode, num_parallel_calls=8)
if is_training:
ds = ds.shuffle(1000 + 3 * batch_size)
ds = ds.apply(tf.contrib.data.map_and_batch(map_func=preprocess, batch_size=batch_size, num_parallel_calls=8))
ds = ds.repeat()
ds = ds.prefetch(buffer_size=batch_size)
return ds
train_dataset = dataset(args.train_data, args.batch_size,
args.img_height, args.img_width,
is_training=True)
model.fit(train_dataset,
steps_per_epoch=args.steps_per_epoch,
epochs=args.max_epochs,
callbacks=callbacks,
initial_epoch=0)
It seems like an issue with your data or preprocessing pipeline, rather than with Keras. Try to inspect what you are getting out of the dataset with a debugging code like:
ds = dataset(args.data, args.img_height, args.img_width, is_training=True)
image_t, classes_t = ds.make_one_shot_iterator().get_next()
with tf.Session() as sess:
while True:
image, classes = sess.run([image_t, classes_t])
# Do something with the data: display, log etc.

Related

Long initialization time for model.fit when using tensorflow dataset from generator

This is my first question on stack overflow. I apologise in advance for the poor formatting and indentation due to my troubles with the interface.
Environment specifications:
Tensorflow version - 2.7.0 GPU (tested and working properly)
Python version - 3.9.6
CPU - Intel Core i7 7700HQ
GPU - NVIDIA GTX 1060 3GB
RAM - 16GB DDR4 2400MHz
HDD - 1TB 5400 RPM
Problem Statement:
I wish to train a TensorFlow 2.7.0 model to perform multilabel classification with six classes on CT scans stored as DICOM images. The dataset is from Kaggle, link here. The training labels are stored in a CSV file, and the DICOM image names are of the format ID_"random characters".dcm. The images have a combined size of 368 GB.
Approach used:
The CSV file containing the labels is imported into a pandas
DataFrame and the image filenames are set as the index.
A simple data generator is created to read the DICOM image and the
labels by iterating on the rows of the DataFrame. This generator is used to create a
training dataset using tf.data.Dataset.from_generator. The images are
pre-processed using bsb_window().
The training dataset is shuffled and split into a training(90%) and
validation set(10%)
The model is created using Keras Sequential, compiled, and fit using the training and validation datasets created earlier.
code:
def train_generator():
for row in df.itertuples():
image = pydicom.dcmread(train_images_dir + row.Index + ".dcm")
try:
image = bsb_window(image)
except:
image = np.zeros((256,256,3))
labels = row[1:]
yield image, labels
train_images = tf.data.Dataset.from_generator(train_generator,
output_signature =
(
tf.TensorSpec(shape = (256,256,3)),
tf.TensorSpec(shape = (6,))
)
)
train_images = train_images.batch(4)
TRAIN_NUM_FILES = 752803
train_images = train_images.shuffle(40)
val_size = int(TRAIN_NUM_FILES * 0.1)
val_images = train_images.take(val_size)
train_images = train_images.skip(val_size)
def create_model():
model = Sequential([
InceptionV3(include_top = False, input_shape = (256,256,3), weights = "imagenet"),
GlobalAveragePooling2D(name = "avg_pool"),
Dense(6, activation = "sigmoid", name = "dense_output"),
])
model.compile(loss = "binary_crossentropy",
optimizer = tf.keras.optimizers.Adam(5e-4),
metrics = ["accuracy", tf.keras.metrics.SpecificityAtSensitivity(0.8)]
)
return model
model = create_model()
history = model.fit(train_images,
batch_size=4,
epochs=5,
verbose=1,
validation_data=val_images
)
Issue:
When executing this code, there is a delay of a few hours of high disk usage (~30MB/s reads) before training begins. When a DataGenerator is made using tf.keras.utils.Sequence, training commences within seconds of calling model.fit().
Potential causes:
Iterating over a pandas DataFrame in train_generator(). I am not sure how to avoid this issue.
The use of external functions to pre-process and load the data.
The usage of the take() and skip() methods to create training and validation datasets.
How do I optimise this code to run faster? I've heard splitting the data generator into label creation, image pre-processing functions and parallelising operations would improve performance. Still, I'm not sure how to apply those concepts in my case. Any advice would be highly appreciated.
I FOUND THE ANSWER
The problem was in the following code:
TRAIN_NUM_FILES = 752803
train_images = train_images.shuffle(40)
val_size = int(TRAIN_NUM_FILES * 0.1)
val_images = train_images.take(val_size)
train_images = train_images.skip(val_size)
It takes an inordinate amount of time to split the dataset into training and validation datasets after loading the images. This step should be done early in the process, before loading any images. Hence, I split the image path loading and actual image loading, then parallelized the functions using the recommendations given here. The final optimized code is as follows
def train_generator():
for row in df.itertuples():
image_path = f"{train_images_dir}{row.Index}.dcm"
labels = np.reshape(row[1:], (1,6))
yield image_path, labels
def test_generator():
for row in test_df.itertuples():
image_path = f"{test_images_dir}{row.Index}.dcm"
labels = np.reshape(row[1:], (1,6))
yield image_path, labels
def image_loading(image_path):
image_path = tf.compat.as_str_any(tf.strings.reduce_join(image_path).numpy())
dcm = pydicom.dcmread(image_path)
try:
image = bsb_window(dcm)
except:
image = np.zeros((256,256,3))
return image
def wrap_img_load(image_path):
return tf.numpy_function(image_loading, [image_path], [tf.double])
def set_shape(image, labels):
image = tf.reshape(image,[256,256,3])
labels = tf.reshape(labels,[1,6])
labels = tf.squeeze(labels)
return image, labels
train_images = tf.data.Dataset.from_generator(train_generator, output_signature = (tf.TensorSpec(shape=(), dtype=tf.string), tf.TensorSpec(shape=(None,6)))).prefetch(tf.data.AUTOTUNE)
test_images = tf.data.Dataset.from_generator(test_generator, output_signature = (tf.TensorSpec(shape=(), dtype=tf.string), tf.TensorSpec(shape=(None,6)))).prefetch(tf.data.AUTOTUNE)
TRAIN_NUM_FILES = 752803
train_images = train_images.shuffle(40)
val_size = int(TRAIN_NUM_FILES * 0.1)
val_images = train_images.take(val_size)
train_images = train_images.skip(val_size)
train_images = train_images.map(lambda image_path, labels: (wrap_img_load(image_path),labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
test_images = test_images.map(lambda image_path, labels: (wrap_img_load(image_path),labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
val_images = val_images.map(lambda image_path, labels: (wrap_img_load(image_path),labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
train_images = train_images.map(lambda image, labels: set_shape(image,labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
test_images = test_images.map(lambda image, labels: set_shape(image,labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
val_images = val_images.map(lambda image, labels: set_shape(image,labels), num_parallel_calls = tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
train_images = train_images.batch(4).prefetch(tf.data.AUTOTUNE)
test_images = test_images.batch(4).prefetch(tf.data.AUTOTUNE)
val_images = val_images.batch(4).prefetch(tf.data.AUTOTUNE)
def create_model():
model = Sequential([
InceptionV3(include_top = False, input_shape = (256,256,3), weights='imagenet'),
GlobalAveragePooling2D(name='avg_pool'),
Dense(6, activation="sigmoid", name='dense_output'),
])
model.compile(loss="binary_crossentropy", optimizer=tf.keras.optimizers.Adam(5e-4), metrics=["accuracy"])
return model
model = create_model()
history = model.fit(train_images,
epochs=5,
verbose=1,
callbacks=[checkpointer, scheduler],
validation_data=val_images
)
The CPU, GPU, and HDD are utilized very efficiently, and the training time is much faster than with a tf.keras.utils.Sequence datagenerator

Separate trainX and trainY using Tensorflow 2 input pipeline for multi-loss training

I am building a network takes one picture as input and outputs two predictions, and the loss function calculated from both predictions. So:
Input(trainX): RGBD
Output1(trainGT): GT
Output2(trainError): error
As I am using TF's standard input pipeline, it zips the input and targets. However, when I define loss function, I need separate the two targets. Here is my code for input pipeline:
#tf.function
def load_image(point_file):
name = tf.strings.split(point_file,'/')[-1]
GT = tf.io.read_file(GT_path + name)
RGBD = tf.io.read_file(RGBD_path + name)
error = tf.io.read_file(error_path + name)
GT = tf.image.decode_png(GT)
RGBD = tf.image.decode_png(RGBD)
error = tf.image.decode_png(error)[:, :, 0]
GT = tf.cast(GT, tf.float32)// 255.0
RGBD = tf.cast(RGBD, tf.float32)/ 255.0
error = tf.cast(error, tf.float32)
error = tf.reshape(error, 1024)
return RGBD, GT, error
train_dataset = tf.data.Dataset.list_files(point_path + 'train/*.png')
train_dataset = train_dataset.map(load_image,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.batch(BATCH_SIZE).repeat()
test_dataset = tf.data.Dataset.list_files(point_path+'test/*.png')
test_dataset = test_dataset.map(load_image)
test_dataset = test_dataset.batch(BATCH_SIZE)
Later in the fit function what I want to do is:
losses = {
"GT_output": loss_functions.weighted_dice_loss,
"error_output": tf.keras.losses.MeanSquaredError(),
}
lossWeights = {"GT_output": 0.9, "error_output": 0.1}
model.compile(optimizer='adam',
loss=losses, loss_weights=lossWeights,)
model.fit(x=trainX,
y={"GT_output": trainGT, "error_output": trainError},
validation_data=(testX,
{"GT_output": testGT, "error_output": testError}),
epochs=EPOCHS,
verbose=1)
But is there a way to separate trainX, trainGT and trainError from train_dataset?
Thank you...
you can do:
take_n = 1000
trainGT = train_dataset.take(take_n)
trainError = train_dataset.skip(take_n)

MultiWorkerMirroredStrategy hanging before first epoch?

I'm trying to run a simple MNIST neural net on multiple cluster nodes (3 nodes with 1 GPU each), but it keeps stopping before the first epoch prints. I'm able to get all the nodes to sync, but right before it starts (maybe in the model.fit function) it just stops and doesn't do anything.
Any help is appreciated!
My TF_CONFIG looks like this:
TF_CONFIG='{"cluster": {"worker": ["ip1:88888", "ip2:88888", "ip3:88888"]}, "task": {"index": 0, 1, or 2, "type": "worker"}}' python fileName.py
And my code looks like this:
import tensorflow as tf
from tensorflow import keras
import time
def get_compiled_model():
# Make a simple 2-layer densely-connected neural network.
inputs = keras.Input(shape=(784,))
x = keras.layers.Dense(256, activation="relu")(inputs)
x = keras.layers.Dense(256, activation="relu")(x)
outputs = keras.layers.Dense(10)(x)
model = keras.Model(inputs, outputs)
model.compile(
optimizer=keras.optimizers.Adam(),
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
return model
def get_dataset():
batch_size = 32
num_val_samples = 10000
# Return the MNIST dataset in the form of a `tf.data.Dataset`.
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
# Preprocess the data (these are Numpy arrays)
x_train = x_train.reshape(-1, 784).astype("float32") / 255
x_test = x_test.reshape(-1, 784).astype("float32") / 255
y_train = y_train.astype("float32")
y_test = y_test.astype("float32")
# Reserve num_val_samples samples for validation
x_val = x_train[-num_val_samples:]
y_val = y_train[-num_val_samples:]
x_train = x_train[:-num_val_samples]
y_train = y_train[:-num_val_samples]
return (
tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size),
tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(batch_size),
tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(batch_size),
)
# Create a MirroredStrategy.
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
print("Number of devices: {}".format(strategy.num_replicas_in_sync))
# Open a strategy scope.
with strategy.scope():
# Everything that creates variables should be under the strategy scope.
# In general this is only model construction & `compile()`.
model = get_compiled_model()
# Train the model on all available devices.
train_dataset, val_dataset, test_dataset = get_dataset()
start = time.time()
print("Fit")
model.fit(train_dataset, epochs=1, verbose=1, validation_data=val_dataset, steps_per_epoch=25)
end = time.time()
print("Time:", end - start)
# Test the model on all available devices.
model.evaluate(test_dataset)

Validation data leads to training data in keras.fit using TFRecordDataset

I want to use TFRecords in order to train a model on ImageNet (code below). All is well, and training goes just fine, except that validation seems to be performed on the training set and not the validation set.
I tested this by dumbing the training set down and doing nothing to the validation set, which led to both training and validation accuracy being 100%.
When I use fit_generator and flow_from_directory there is no such issue.
This is how I create the TFRecordDatasets:
class ImageNetTFRecordDataset:
def __init__(self, dir, batch_size):
self._dir = dir
self._batch_size = batch_size
#staticmethod
def _parse_function(proto):
# read image
keys_to_features = {'image/encoded': tf.FixedLenFeature([], tf.string),
'image/class/label': tf.FixedLenFeature([], tf.int64)}
# Load one example
parsed_features = tf.parse_single_example(proto, keys_to_features)
# Turn your saved image string into an array
parsed_features['image/encoded'] = tf.image.decode_jpeg(parsed_features['image/encoded'])
parsed_features['image/encoded'] = tf.image.convert_image_dtype(parsed_features['image/encoded'], tf.float32)
parsed_features['image/encoded'] = tf.image.per_image_standardization(parsed_features['image/encoded'])
return parsed_features['image/encoded'], parsed_features["image/class/label"]
def create_dataset(self):
dir_files = os.listdir(self._dir)
dataset = tf.data.TFRecordDataset([os.path.join(self._dir, f) for f in dir_files])
# This dataset will go on forever
dataset = dataset.repeat()
# Set the number of datapoints you want to load and shuffle
# dataset = dataset.shuffle(self._batch_size * 10)
# Maps the parser on every filepath in the array. You can set the number of parallel loaders here
dataset = dataset.map(self._parse_function, num_parallel_calls=8)
# Set the batchsize
dataset = dataset.batch(self._batch_size)
dataset = dataset.prefetch(4)
# Create an iterator
iterator = dataset.make_one_shot_iterator()
# Create your tf representation of the iterator
image, label = iterator.get_next()
# Bring your picture back in shape
image = tf.reshape(image, [-1, 224, 224, 3])
# Create a one hot array for your labels
label = tf.one_hot(label, 1000)
return image, label
And this is how I train the model:
train_dataset = ImageNetTFRecordDataset(train_dir, BATCH_SIZE)
val_dataset = ImageNetTFRecordDataset(val_dir, BATCH_SIZE)
x, y = train_dataset.create_dataset()
x_val, y_val = val_dataset.create_dataset()
model = get_model(x) # this model's first layer is Input(tensor=x)
model.compile(optimizer=OPTIMIZER, loss=categorical_crossentropy,
metrics=get_metrics(), target_tensors=[y])
model.fit(epochs=EPOCHS, verbose=1, validation_data=(x_val, y_val),
steps_per_epoch=150, callbacks=get_callbacks(),
validation_steps=VAL_STEPS)
Any ideas where things go wrong? Why does the validation part feed off the training set?

Tensorflow Keras use tfrecords also for validation

Right now I'm using keras with tensorflow backend.
The dataset was stored in the tfrecords format. Training without any validation set is working, but how to integrate my validation-tfrecords also?
Lets assume this code as coarse skeleton:
def _ds_parser(proto):
features = {
'X': tf.FixedLenFeature([], tf.string),
'Y': tf.FixedLenFeature([], tf.string)
}
parsed_features = tf.parse_single_example(proto, features)
# get the data back as float32
parsed_features['X'] = tf.decode_raw(parsed_features['I'], tf.float32)
parsed_features['Y'] = tf.decode_raw(parsed_features['Y'], tf.float32)
return parsed_features['X'], parsed_features['Y']
def datasetLoader(dataSetPath, batchSize):
dataset = tf.data.TFRecordDataset(dataSetPath)
# Maps the parser on every filepath in the array. You can set the number of parallel loaders here
dataset = dataset.map(_ds_parser, num_parallel_calls=8)
# This dataset will go on forever
dataset = dataset.repeat()
# Set the batchsize
dataset = dataset.batch(batchSize)
# Create an iterator
iterator = dataset.make_one_shot_iterator()
# Create your tf representation of the iterator
X, Y = iterator.get_next()
# Bring the date back in shape
X = tf.reshape(I, [-1, 66, 198, 3])
Y = tf.reshape(Y,[-1,1])
return X, Y
X, Y = datasetLoader('PATH-TO-DATASET', 264)
model_X = keras.layers.Input(tensor=X)
model_output = keras.layers.Conv2D(filters=16, kernel_size=3, strides=1, padding='valid', activation='relu',
input_shape=(-1, 66, 198, 3))(model_X)
model_output = keras.layers.Dense(units=1, activation='linear')(model_output)
model = keras.models.Model(inputs=model_X, outputs=model_output)
model.compile(
optimizer=optimizer,
loss='mean_squared_error',
target_tensors=[Y]
)
parallel_model.fit(
epochs=epochs,
steps_per_epoch=stepPerEpoch,
shuffle=False,
validation_data=????
)
The question is, how to pass the validation set?
I have found something related here: gcloud-ml-engine-with-keras, but I'm not sure how to fit this into my problem.
First, You don't need to use iterator. Keras model will accept dataset object instead separate data/labels parameters, and will handle iteration. You only need to specify steps_per_epoch, hence you need to know dataset size. If you have separate tfrecords file for train/validation, then you can just create dataset object and pass it to validation_data. If you have one file and you'd like to split it, you can do
dataset = tf.data.TFRecordDataset('file.tfrecords')
dataset_train = dataset.take(size)
dataset_val = dataset.skip(size)
...
Ok I found the answer myself: basically it's done by simply change import keras toimport tensorflow.keras as keras. Tf.keras allows you to pass the validation set also as tensor:
X, Y = datasetLoader('PATH-TO-DATASET', 264)
X_val, Y_val = datasetLoader('PATH-TO-VALIDATION-DATASET', 264)
# ... define and compile the model like above
parallel_model.fit(
epochs= epochs,
steps_per_epoch= STEPS_PER_EPOCH,
shuffle= False,
validation_data= (X_val, Y_val),
validation_steps= STEPS_PER_VALIDATION_EPOCH
)

Categories