VGG16 preprocessing dataset generator to dataset mapping - python

I have a VGG16 model implemented with Keras/tensorflow.
When I call model.fit, I pass in a generator of data. The generator does transforms necessary for a VGGNet:
Preprocess the images with vgg16.preprocess_input
Convert the label to a one-hot vector via to_categorical
The generator can be seen below and works. Unfortunately, since there are multiple epochs, I have to set dataset.repeat(-1) (infinitely repeat) so the generator doesn't run out. This in turn requires one to pass a steps_per_epoch so a given iteration of training can complete. As you're probably thinking, this is brittle, (hinges on a known dataset cardinality)!
I have decided it's best to preprocess the training Dataset once up front using Dataset.map. However, I am struggling with the construction of a mapping function, it seems to_categorical doesn't work with a tf.Tensor. Down below is what I have right now, but I am not sure if there's a latent bug.
How can I correctly translate the below Dataset generator into a Dataset.map function?
Current Dataset Generator
This is implemented (and known to work) with Python 3.8 and tensorflow==2.4.4.
from typing import Iterable, Tuple
import numpy as np
import tensorflow as tf
def make_vgg_preprocessing_generator(
dataset: tf.data.Dataset, num_repeat: int = -1
) -> Iterable[Tuple[tf.Tensor, np.ndarray]]:
num_classes = len(dataset.class_names)
for batch_images, batch_labels in dataset.repeat(num_repeat):
pre_images = tf.keras.applications.vgg16.preprocess_input(batch_images)
pre_labels = tf.keras.utils.to_categorical(batch_labels, num_classes)
yield pre_images, pre_labels
train_ds: tf.data.Dataset # Not provided in this sample
model.fit(
make_vgg_preprocessing_generator(train_ds)
epochs=10,
steps_per_epoch=10, # Required since default num_repeat is indefinitely
)
Dataset.map Function
Here is my current translation that I would like to improve.
def vgg_preprocess_dataset(dataset: tf.data.Dataset) -> tf.data.Dataset:
num_classes = len(dataset.class_names)
def _preprocess(x: tf.Tensor, y: tf.Tensor) -> Tuple[tf.Tensor, tf.Tensor]:
pre_x = tf.keras.applications.vgg16.preprocess_input(x)
pre_y = tf.one_hot(y, depth=num_classes)
return pre_x, pre_y
return dataset.map(_preprocess)

Yes, you're on the right track! You'll want to replace to_categorical with tf.one_hot, just as you have, as tf.one_hot is specifically for tensors, and is designed for this context. Next, you might want to play around with some of the other tf.data.Dataset methods here and add them to your pipeline. Right now, your batch size will be one sample, and un-shuffled. An example of some other processing you might do:
def vgg_preprocess_dataset(dataset: tf.data.Dataset, batch_size=32, shuffle_buffer=1000) -> tf.data.Dataset:
num_classes = len(dataset.class_names)
def _preprocess(x: tf.Tensor, y: tf.Tensor):
pre_x = tf.keras.applications.vgg16.preprocess_input(x)
pre_y = tf.one_hot(y, depth=num_classes)
# pre_y = to_categorical(y, num_classes)
return pre_x, pre_y
# bigger buffer is better but slower
dataset = dataset.shuffle(shuffle_buffer)
# do your mapping after shuffle
dataset = dataset.map(_preprocess)
# next batch it
dataset = dataset.batch(batch_size)
# this allows your CPU to fetch the next batch (do the above shuffling, mapping, etc) during the
# current GPU pass, so that the GPU has minimal downtime
dataset = dataset.prefetch(2)
return dataset
ds = vgg_preprocess_dataset(ds)
# and you just pass it right to fit!
model.fit(ds)

Related

How do you pass validation data to Keras using a sequence?

Looking at the TensorFlow documentation it says that model.fit(validation_data) cannot be used with keras.utils.Sequence
Note that validation_data does not support all the data types that are supported in x, eg, dict, generator or keras.utils.Sequence.
My validation set is probably just small enough to fit into RAM, but I'd like to avoid loading it all into RAM in case my dataset grows.
To get an idea of how my current Sequence is working, here is the code:
NOTES:
The sequence currently only processes train_data which is a normalized array containing my examples and labels. I have similar arrays for val_data and test_data.
This loop may look a bit odd because I am working with time-series data that pulls a window for each example.
class MyGenerator(tf.keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, ids, train_dir):
'Initialization'
self.ids = ids
self.train_dir = train_dir
def __len__(self):
'Denotes the number of batches per epoch'
return len(self.ids)
def __getitem__(self, index):
batch_id = self.ids[index]
# load data
X_train, y_train = [], []
start_index = seq_len*batch_id
end_index = start_index + seq_len
for i in range(start_index, end_index):
start_seq = i + start_index
X_train.append(train_data[i-seq_len:i])
y_train.append(train_data[:, 4][i])
# Save our batch
X = np.array(X_train)
y = np.array(y_train)
return X, y
Is there a way for me to process my validation set in batches? I would prefer to use Sequence, but if that is not possible I'm open to other options.
According to the documentstion:
validation_data could be:
tuple (x_val, y_val) of Numpy arrays or tensors
tuple (x_val, y_val, val_sample_weights) of Numpy arrays
dataset
For the first two cases, batch_size must be provided. For the last case, validation_steps could be provided.
So by providing batch_size or validation_steps the validation_data would be processed in batches.

ValueError: `validation_split` is only supported for Tensors or NumPy arrays, found: (keras.preprocessing.sequence.TimeseriesGenerator object)

When I tried to add validation_split in my LSTM model, I got this error
ValueError: `validation_split` is only supported for Tensors or NumPy arrays, found: (<tensorflow.python.keras.preprocessing.sequence.TimeseriesGenerator object)
This is the code
from keras.preprocessing.sequence import TimeseriesGenerator
train_generator = TimeseriesGenerator(df_scaled, df_scaled, length=n_timestamp, batch_size=1)
model.fit(train_generator, epochs=50,verbose=2,callbacks=[tensorboard_callback], validation_split=0.1)
----------
ValueError: `validation_split` is only supported for Tensors or NumPy arrays, found: (<tensorflow.python.keras.preprocessing.sequence.TimeseriesGenerator object)
One reason I could think of is, to use validation_split a tensor or numpy array is expected, as mentioned in the error, however, when passing train data through TimeSeriesGenerator, it changes the dimension of the train data to a 3D array
And since TimeSeriesGenerator is mandatory to be used when using LSTM, does this means for LSTM we can't use validation_split
Your first intution is right that you can't use the validation_split when using dataset generator.
You will have to understand how the functioninig of dataset generator happens. The model.fit API does not know how many records or batch your dataset has in its first epoch. As the data is generated or supplied for each batch one at a time to the model for training. So there is no way to for the API to know how many records are initially there and then making a validation set out of it. Due to this reason you cannot use the validation_split when using dataset generator. You can read it in their documentation.
Float between 0 and 1. Fraction of the training data to be used as
validation data. The model will set apart this fraction of the
training data, will not train on it, and will evaluate the loss and
any model metrics on this data at the end of each epoch. The
validation data is selected from the last samples in the x and y data
provided, before shuffling. This argument is not supported when x is a
dataset, generator or keras.utils.Sequence instance.
You need to read the last two lines where they have said that it is not supported for dataset generator.
What you can instead do is use the following code to split the dataset. You can read in detail here. I am just writing the important part from the link below.
# Splitting the dataset for training and testing.
def is_test(x, _):
return x % 4 == 0
def is_train(x, y):
return not is_test(x, y)
recover = lambda x, y: y
# Split the dataset for training.
test_dataset = dataset.enumerate() \
.filter(is_test) \
.map(recover)
# Split the dataset for testing/validation.
train_dataset = dataset.enumerate() \
.filter(is_train) \
.map(recover)
I hope my answer helps you.
y = np.array(y)
This fixed it for me.
The error says it only supports numpy arrays, so turn it into an array.

what is the best practices to train model on BIG dataset

I need to train a model on a dataset that required more memory than my GPU has. what is the best practice for feeding the dataset to model?
here is my steps:
first of all, I load dataset using batch_size
BATCH_SIZE=32
builder = tfds.builder('mnist')
builder.download_and_prepare()
datasets = builder.as_dataset(batch_size=BATCH_SIZE)
the second step i prepare data
for record in raw_train_ds.take(1):
train_images, train_labels = record['image'], record['label']
print(train_images.shape)
train_images = train_images.numpy().astype(np.float32) / 255.0
train_labels = tf.keras.utils.to_categorical(train_labels)
and then i feed data to the model
history = model.fit(train_images,train_labels, epochs=NUM_EPOCHS, validation_split=0.2)
but at step 2 I prepared data for the first batch and missed the rest batches because the model.fit is out of the loop scope (which, as I understand, works for one, first batch only).
On the other hand, I can't remove take(1) and move the model.fit method under the cycle. Because yes, in this case, I will handle all batches, but at the same time model.fill will be called at the end on each iteration and in this case, it also will not work properly
so, how I should change my code to be able to work appropriately with a big dataset using model.fit? could you point article, any documents, or just advise how to deal with it? thanks
Update
In my post below (approach 1) I describe one approach on how to solve the problem - are there any other better approaches or it is only one way how to solve this?
You can pass the whole dataset to fit for training. As you can see in the documentation, one of the possible values of the first parameter is:
A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights).
So you just need to convert your dataset to that format (a tuple with input and target) and pass it to fit:
BATCH_SIZE=32
builder = tfds.builder('mnist')
builder.download_and_prepare()
datasets = builder.as_dataset(batch_size=BATCH_SIZE)
raw_train_ds = datasets['train']
train_dataset_fit = raw_train_ds.map(
lambda x: (tf.cast.dtypes(x['image'], tf.float32) / 255.0, x['label']))
history = model.fit(train_dataset_fit, epochs=NUM_EPOCHS)
One problem with this is that it does not support a validation_split parameter but, as shown in this guide, tfds already gives you the functionality to have the splits of the data. So you would just need to get the test split dataset, transform it as above and pass it as validation_data to fit.
Approach 1
Thank #jdehesa I changed my code :
load dataset - in reality, it doesn't load data into memory till the first call 'next' from the dataset iterator. and even then, I think the iterator will load a portion of data (batch) with a size equal in BATCH_SIZE
raw_train_ds, raw_validation_ds = builder.as_dataset(split=["train[:90%]", "train[10%:]"], batch_size=BATCH_SIZE)
collected all required transformation into one method
def prepare_data(x):
train_images, train_labels = x['image'], x['label']
# TODO: resize image
train_images = tf.cast(train_images,tf.float32)/ 255.0
# train_labels = tf.keras.utils.to_categorical(train_labels,num_classes=NUM_CLASSES)
train_labels = tf.one_hot(train_labels,NUM_CLASSES)
return (train_images, train_labels)
applied these transformations to each element in batch (dataset) using the method td.data.Dataset.map
train_dataset_fit = raw_train_ds.map(prepare_data)
and then fed this dataset into model.fit - as I understand the model.fit will iterate through all batches in the dataset.
train_dataset_fit = raw_train_ds.map(prepare_data)
history = model.fit(train_dataset_fit, epochs=NUM_EPOCHS)

Avoiding tf.data.Dataset.from_tensor_slices with estimator api

I'm am trying to figure out the recommended way to use the dataset api together with the estimator api. Everything I have seen online is some variation of this:
def train_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
return dataset
which can then be passed to the estimator's train function:
classifier.train(
input_fn=train_input_fn,
#...
)
but the dataset guide warns that:
the above code snippet will embed the features and labels arrays in your TensorFlow graph as tf.constant() operations. This works well for a small dataset, but wastes memory---because the contents of the array will be copied multiple times---and can run into the 2GB limit for the tf.GraphDef protocol buffer.
and then describes a method that involves defining placeholders which are then filled with the feed_dict:
features_placeholder = tf.placeholder(features.dtype, features.shape)
labels_placeholder = tf.placeholder(labels.dtype, labels.shape)
dataset = tf.data.Dataset.from_tensor_slices((features_placeholder, labels_placeholder))
sess.run(iterator.initializer, feed_dict={features_placeholder: features,
labels_placeholder: labels})
But if you're using the estimator api, you're not manually running the session. So how do you use the dataset api with estimators while avoiding the problems associated with from_tensor_slices()?
To use either initializable or reinitializable iterators, you must create a class that inherits from tf.train.SessionRunHook, which has access to the session at multiple times during training and evaluation steps.
You can then use this new class to initialize the iterator has you would normally do in a classic setting. You simply need to pass this newly created hook to the training/evaluation functions or to the correct train spec.
Here is quick example that you can adapt to your needs :
class IteratorInitializerHook(tf.train.SessionRunHook):
def __init__(self):
super(IteratorInitializerHook, self).__init__()
self.iterator_initializer_func = None # Will be set in the input_fn
def after_create_session(self, session, coord):
# Initialize the iterator with the data feed_dict
self.iterator_initializer_func(session)
def get_inputs(X, y):
iterator_initializer_hook = IteratorInitializerHook()
def input_fn():
X_pl = tf.placeholder(X.dtype, X.shape)
y_pl = tf.placeholder(y.dtype, y.shape)
dataset = tf.data.Dataset.from_tensor_slices((X_pl, y_pl))
dataset = ...
...
iterator = dataset.make_initializable_iterator()
next_example, next_label = iterator.get_next()
iterator_initializer_hook.iterator_initializer_func = lambda sess: sess.run(iterator.initializer,
feed_dict={X_pl: X, y_pl: y})
return next_example, next_label
return input_fn, iterator_initializer_hook
...
train_input_fn, train_iterator_initializer_hook = get_inputs(X_train, y_train)
test_input_fn, test_iterator_initializer_hook = get_inputs(X_test, y_test)
...
estimator.train(input_fn=train_input_fn,
hooks=[train_iterator_initializer_hook]) # Don't forget to pass the hook !
estimator.evaluate(input_fn=test_input_fn,
hooks=[test_iterator_initializer_hook])

How to manually specify class labels in keras flow_from_directory?

Problem: I am training a model for multilabel image recognition. My images are therefore associated with multiple y labels. This is conflicting with the convenient keras method "flow_from_directory" of the ImageDataGenerator, where each image is supposed to be in the folder of the corresponding label (https://keras.io/preprocessing/image/).
Workaround: Currently, I am reading all images into a numpy array and use the "flow" function from there. But this results in heavy memory loads and a slow read-in process.
Question: Is there a way to use the "flow_from_directory" method and to supply manually the (multiple) class labels?
Update: I ended up extending the DirectoryIterator class for the multilabel case. You can now set the attribute "class_mode" to the value "multilabel" and provide a dictionary "multlabel_classes" which maps filenames to their labels. Code: https://github.com/tholor/keras/commit/29ceafca3c4792cb480829c5768510e4bdb489c5
You could simply use the flow_from_directory and extend it to a multiclass in a following manner:
def multiclass_flow_from_directory(flow_from_directory_gen, multiclasses_getter):
for x, y in flow_from_directory_gen:
yield x, multiclasses_getter(x, y)
Where multiclasses_getter is assigning a multiclass vector / your multiclass representation to your images. Note that x and y are not a single examples but batches of examples, so this should be included in your multiclasses_getter design.
You could write a custom generator class that would read the files in from the directory and apply the labeling. That custom generator could also take in an ImageDataGenerator instance which would produce the batches using flow().
I am imagining something like this:
class Generator():
def __init__(self, X, Y, img_data_gen, batch_size):
self.X = X
self.Y = Y # Maybe a file that has the appropriate label mapping?
self.img_data_gen = img_data_gen # The ImageDataGenerator Instance
self.batch_size = batch_size
def apply_labels(self):
# Code to apply labels to each sample based on self.X and self.Y
def get_next_batch(self):
"""Get the next training batch"""
self.img_data_gen.flow(self.X, self.Y, self.batch_size)
Then simply:
img_gen = ImageDataGenerator(...)
gen = Generator(X, Y, img_gen, 128)
model.fit_generator(gen.get_next_batch(), ...)
*Disclaimer: I haven't actually tested this, but it should work in theory.
# Training the model
history = model.fit(train_generator, steps_per_epoch=steps_per_epoch, epochs=3, validation_data=val_generator,validation_steps=validation_steps, verbose=1,
callbacks= keras.callbacks.ModelCheckpoint(filepath='/content/results',monitor='val_accuracy', save_best_only=True,save_weights_only=False))
The validation_steps or the steps_per_epoch might be exceeding than that of the original parameters.
steps_per_epoch= (int(num_of_training_examples/batch_size) might help.
Similarly validation_steps= (int(num_of_val_examples/batch_size) will help

Categories