Problem: I am training a model for multilabel image recognition. My images are therefore associated with multiple y labels. This is conflicting with the convenient keras method "flow_from_directory" of the ImageDataGenerator, where each image is supposed to be in the folder of the corresponding label (https://keras.io/preprocessing/image/).
Workaround: Currently, I am reading all images into a numpy array and use the "flow" function from there. But this results in heavy memory loads and a slow read-in process.
Question: Is there a way to use the "flow_from_directory" method and to supply manually the (multiple) class labels?
Update: I ended up extending the DirectoryIterator class for the multilabel case. You can now set the attribute "class_mode" to the value "multilabel" and provide a dictionary "multlabel_classes" which maps filenames to their labels. Code: https://github.com/tholor/keras/commit/29ceafca3c4792cb480829c5768510e4bdb489c5
You could simply use the flow_from_directory and extend it to a multiclass in a following manner:
def multiclass_flow_from_directory(flow_from_directory_gen, multiclasses_getter):
for x, y in flow_from_directory_gen:
yield x, multiclasses_getter(x, y)
Where multiclasses_getter is assigning a multiclass vector / your multiclass representation to your images. Note that x and y are not a single examples but batches of examples, so this should be included in your multiclasses_getter design.
You could write a custom generator class that would read the files in from the directory and apply the labeling. That custom generator could also take in an ImageDataGenerator instance which would produce the batches using flow().
I am imagining something like this:
class Generator():
def __init__(self, X, Y, img_data_gen, batch_size):
self.X = X
self.Y = Y # Maybe a file that has the appropriate label mapping?
self.img_data_gen = img_data_gen # The ImageDataGenerator Instance
self.batch_size = batch_size
def apply_labels(self):
# Code to apply labels to each sample based on self.X and self.Y
def get_next_batch(self):
"""Get the next training batch"""
self.img_data_gen.flow(self.X, self.Y, self.batch_size)
Then simply:
img_gen = ImageDataGenerator(...)
gen = Generator(X, Y, img_gen, 128)
model.fit_generator(gen.get_next_batch(), ...)
*Disclaimer: I haven't actually tested this, but it should work in theory.
# Training the model
history = model.fit(train_generator, steps_per_epoch=steps_per_epoch, epochs=3, validation_data=val_generator,validation_steps=validation_steps, verbose=1,
callbacks= keras.callbacks.ModelCheckpoint(filepath='/content/results',monitor='val_accuracy', save_best_only=True,save_weights_only=False))
The validation_steps or the steps_per_epoch might be exceeding than that of the original parameters.
steps_per_epoch= (int(num_of_training_examples/batch_size) might help.
Similarly validation_steps= (int(num_of_val_examples/batch_size) will help
Related
I have a VGG16 model implemented with Keras/tensorflow.
When I call model.fit, I pass in a generator of data. The generator does transforms necessary for a VGGNet:
Preprocess the images with vgg16.preprocess_input
Convert the label to a one-hot vector via to_categorical
The generator can be seen below and works. Unfortunately, since there are multiple epochs, I have to set dataset.repeat(-1) (infinitely repeat) so the generator doesn't run out. This in turn requires one to pass a steps_per_epoch so a given iteration of training can complete. As you're probably thinking, this is brittle, (hinges on a known dataset cardinality)!
I have decided it's best to preprocess the training Dataset once up front using Dataset.map. However, I am struggling with the construction of a mapping function, it seems to_categorical doesn't work with a tf.Tensor. Down below is what I have right now, but I am not sure if there's a latent bug.
How can I correctly translate the below Dataset generator into a Dataset.map function?
Current Dataset Generator
This is implemented (and known to work) with Python 3.8 and tensorflow==2.4.4.
from typing import Iterable, Tuple
import numpy as np
import tensorflow as tf
def make_vgg_preprocessing_generator(
dataset: tf.data.Dataset, num_repeat: int = -1
) -> Iterable[Tuple[tf.Tensor, np.ndarray]]:
num_classes = len(dataset.class_names)
for batch_images, batch_labels in dataset.repeat(num_repeat):
pre_images = tf.keras.applications.vgg16.preprocess_input(batch_images)
pre_labels = tf.keras.utils.to_categorical(batch_labels, num_classes)
yield pre_images, pre_labels
train_ds: tf.data.Dataset # Not provided in this sample
model.fit(
make_vgg_preprocessing_generator(train_ds)
epochs=10,
steps_per_epoch=10, # Required since default num_repeat is indefinitely
)
Dataset.map Function
Here is my current translation that I would like to improve.
def vgg_preprocess_dataset(dataset: tf.data.Dataset) -> tf.data.Dataset:
num_classes = len(dataset.class_names)
def _preprocess(x: tf.Tensor, y: tf.Tensor) -> Tuple[tf.Tensor, tf.Tensor]:
pre_x = tf.keras.applications.vgg16.preprocess_input(x)
pre_y = tf.one_hot(y, depth=num_classes)
return pre_x, pre_y
return dataset.map(_preprocess)
Yes, you're on the right track! You'll want to replace to_categorical with tf.one_hot, just as you have, as tf.one_hot is specifically for tensors, and is designed for this context. Next, you might want to play around with some of the other tf.data.Dataset methods here and add them to your pipeline. Right now, your batch size will be one sample, and un-shuffled. An example of some other processing you might do:
def vgg_preprocess_dataset(dataset: tf.data.Dataset, batch_size=32, shuffle_buffer=1000) -> tf.data.Dataset:
num_classes = len(dataset.class_names)
def _preprocess(x: tf.Tensor, y: tf.Tensor):
pre_x = tf.keras.applications.vgg16.preprocess_input(x)
pre_y = tf.one_hot(y, depth=num_classes)
# pre_y = to_categorical(y, num_classes)
return pre_x, pre_y
# bigger buffer is better but slower
dataset = dataset.shuffle(shuffle_buffer)
# do your mapping after shuffle
dataset = dataset.map(_preprocess)
# next batch it
dataset = dataset.batch(batch_size)
# this allows your CPU to fetch the next batch (do the above shuffling, mapping, etc) during the
# current GPU pass, so that the GPU has minimal downtime
dataset = dataset.prefetch(2)
return dataset
ds = vgg_preprocess_dataset(ds)
# and you just pass it right to fit!
model.fit(ds)
After looking at doc and tutorial, it seems to me it is very easy to define a hyper parameter for your model. This includes the code to construct it out of layers, as well as compile related ones such as learning rate. What I am looking for (also) is a way to run hyper-param search on non model related parameters. Here are some examples:
data augmentation. If you build this as part of tf dataset pipeline. e.g. amount of random translation
over/undersampling. This is often used to handle unbalanced classes, and one way of doing this is tf.data.Dataset.sample_from_datasets. The "weights" argument to this method are hyperparameters.
num of epochs. Maybe I am missing this one, but it should be considered in keras_tuner in the most straightforward way. A workaround is to use schedule callbacks and achieve this in compile
Is the tuner library framework missing on all these? These seem to be common things you like to tune.
This will help. https://keras.io/guides/keras_tuner/custom_tuner/ The custom tuner can be the way to “hyperparameterized” the tf dataset pipeline. Here's the code snippet I used and it works.
class MyTuner(kt.BayesianOptimization):
def run_trial(self, trial, train_ds, *args, **kwargs):
hp = trial.hyperparameters
train_ds = train_ds.shuffle(batch_size * 8).repeat().batch(batch_size).prefetch(buffer_size=AUTO)
hp_constract_factor = hp.Float('contrast_factor', min_value=0.01, max_value=0.2, sampling='log')
random_flip = tf.keras.layers.experimental.preprocessing.RandomFlip('horizontal')
random_contrast = tf.keras.layers.experimental.preprocessing.RandomContrast(hp_constract_factor)
train_ds = train_ds.map(lambda x, y: (random_flip(x, training=True), y), num_parallel_calls=AUTO)
train_ds = train_ds.map(lambda x, y: (random_contrast(x, training=True), y), num_parallel_calls=AUTO)
return super(MyTuner, self).run_trial(trial, train_ds, *args, **kwargs)
tuner = MyTuner(
model_builder,
objective='val_sparse_categorical_accuracy',
max_trials=50,
executions_per_trial=1,
directory='keras_tuner',
project_name='classifier_data_aug_custom_tuner'
)
tuner.search(...)
I am trying to follow the tutorial on pytorch HERE, but there seems to be a problem. I have created a custom dataloader named training_data that returns an object as required HERE which is a dictionary
{"image": image, "label": label}
where image is a tensor and label is a string. I then follow the tutorial and create a DataLoader as follows:
train_dataloader = DataLoader(training_data, batch_size=batch_size)
and use that DataLoader in the method train:
def train(dataloader, model, loss_fn, optimizer):
size = len(dataloader)
for batch, (X, y) in enumerate(dataloader):
X, y = X.to(device), y.to(device)
# Compute prediction error
pred = model(X)
loss = loss_fn(pred, y)
# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
if batch % 100 == 0:
loss, current = loss.item(), batch * len(X)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
However, when I call the training method for a batch
train(train_dataloader, model, loss_fn, optimizer)
I get an error
Traceback (most recent call last):
File "train_network.py", line 110, in <module>
train(train_dataloader, model, loss_fn, optimizer)
File "train_network.py", line 76, in train
X, y = X.to(device), y.to(device)
AttributeError: 'str' object has no attribute 'to'
as y is a string with the content label. What am I doing wrong?
Your labels y need to be torch tensors. Since you currently have strings, and assuming you are doing classification among n classes, you can simply map them using a list. For example, with three classes, inside the __init__ of your Dataset class:
self.label_names = ["class1", "class2", "class3"]
Then, in __getitem__, you could add:
label = torch.tensor(label_names.index(label))
where label previously stored a string.
A dictionary is a valid type for your dataset to return (though to be fair the more common practice is to have your dataset's __getitem__ method return data,label as two separate tensors.)
In any case, when your dataset __getitem__ method returns an object of type dict, the collate_fn of pytorch returns a dict as well, with the same keys. Your issue is that you are trying to move the key (a string), rather than the value (a tensor) to the GPU.
Instead do:
for batch, datum in enumerate(dataloader):
X = datum["image"].to(device)
y = datum["label"].to(device)
...
However, as noted by #GoodDeeds, I am fairly confident that you need to convert from string labels to integer-type labels for classification (via one-hot encoding as detailed in that answer).
# from GoodDeeds answer, I include to condense both relevant changes into one answer
self.label_names = ["class1", "class2", "class3"]
label = torch.tensor(label_names.index(label)) # in __getitem__ method
If you don't want to edit the dataset at all you could define a custom collate_fn for your dataloader that replaces string class names with integers, or else perform this type conversion after loading the data batch (not recommended).
i'm trying to fit my deep learning model with a custom generator.
When i fit the model, it shows me this error:
I tried to find similar questions, but all the answers were about converting lists to numpy array. I think that's not the question in this error. My lists are all in numpy array format. This custom generator is based on a custom generator from here
This is the part of code where I fit the model:
train_generator = RepresentationGenerator(representation_path=representations_path, target_path=target_path,
filenames=training_filenames, batch_size=batch_size)
val_generator = RepresentationGenerator(representation_path=representations_path, target_path=target_path,
filenames=validation_filenames, batch_size=batch_size)
self.model_semantic.fit_generator(train_generator,
epochs=10,
verbose=1,
validation_data=val_generator,
)
return 0
where the variables are:
representations_path - is a string with the directory to the path where i store the training files, that which file is the input to model
target_path - is a string with the directory to the path where i store the target files, that which file is the target of the model (output)
training_filenames - is a list with the names of training and target files (both have the same name, but they are in different folders)
batch_size - integer with the size of the batch. It has the value 7.
My generator class is below:
import np
from tensorflow_core.python.keras.utils.data_utils import Sequence
class RepresentationGenerator(Sequence):
def __init__(self, representation_path, target_path, filenames, batch_size):
self.filenames = np.array(filenames)
self.batch_size = batch_size
self.representation_path = representation_path
self.target_path = target_path
def __len__(self):
return (np.ceil(len(self.filenames) / float(self.batch_size))).astype(np.int)
def __getitem__(self, idx):
files_to_batch = self.filenames[idx * self.batch_size: (idx + 1) * self.batch_size]
batch_x, batch_y = [], []
for file in files_to_batch:
batch_x.append(np.load(self.representation_path + file + ".npy", allow_pickle=True))
batch_y.append(np.load(self.target_path + file + ".npy", allow_pickle=True))
return np.array(batch_x), np.array(batch_y)
These are the values, when the method fit is called:
How can I fix this error?
Thank you mates!
When I call the method fit_generator, it calls the method fit.
The method fit, calls the method func.fit and it passes the variable Y that is set as None
The error occurs in this line:
Final solution:
Import from the correct place:
from tensorflow.keras.utils import Sequence
Old answers:
If __getitem__ is never called, the problem might be in __len__. You're not returning an int, you're returning a np.int.
I suggest you try:
def __len__(self):
length = len(self.filenames) // self.batch_size
if len(self.filenames) % self.batch_size > 0:
length += 1
return length
But if __getitem__ is being called and your data returned, then you should inspect your arrays.
Get an item from the generator yourself and check the content:
x, y = train_generator[0]
Are they single arrays? Or are they arrays of arrays? (Must be single)
What are their shapes? Do they have the expected shapes?
What are their types? Usually they should be float, sometimes int (for inputs to embedding layers), very rarely string (for inputs to custom layers that know how to treat strings).
The outputs must always be float, at most int (for sparse losses)
Other suppositions, you're using fit with batch_size while using a generator.... this is strange and the "if" clauses inside the method may not be well prepared, you might be falling into another training case.
Go straight to the usual options:
self.model_semantic.fit_generator(train_generator,
epochs=10,
verbose=1,
validation_data=val_generator)
Your generator is a Sequence, it already has a __len__, you don't need to specify steps_per_epoch or validation_steps.
Every generator has automatic batch sizes, every step is a batch and that's it. You don't need to specify batch_size in fit_generator.
If you're going to use fit, go like this:
...fit(train_generator, steps_per_epoch = len(train_generator),
epochs = 10, verbose = 1,
validation_data = val_generator, validation_steps = len(val_generator))
Finally, you should be hunting for anything that might be None (as the error message suggests) in your code.
Check if every function has a return line.
Check all inputs of your generator in __init__.
Print the filenames inside the generator.
Get the __len__ of the generator.
Try to get an item from the generator: x, y= train_generator[0]
I'm am trying to figure out the recommended way to use the dataset api together with the estimator api. Everything I have seen online is some variation of this:
def train_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
return dataset
which can then be passed to the estimator's train function:
classifier.train(
input_fn=train_input_fn,
#...
)
but the dataset guide warns that:
the above code snippet will embed the features and labels arrays in your TensorFlow graph as tf.constant() operations. This works well for a small dataset, but wastes memory---because the contents of the array will be copied multiple times---and can run into the 2GB limit for the tf.GraphDef protocol buffer.
and then describes a method that involves defining placeholders which are then filled with the feed_dict:
features_placeholder = tf.placeholder(features.dtype, features.shape)
labels_placeholder = tf.placeholder(labels.dtype, labels.shape)
dataset = tf.data.Dataset.from_tensor_slices((features_placeholder, labels_placeholder))
sess.run(iterator.initializer, feed_dict={features_placeholder: features,
labels_placeholder: labels})
But if you're using the estimator api, you're not manually running the session. So how do you use the dataset api with estimators while avoiding the problems associated with from_tensor_slices()?
To use either initializable or reinitializable iterators, you must create a class that inherits from tf.train.SessionRunHook, which has access to the session at multiple times during training and evaluation steps.
You can then use this new class to initialize the iterator has you would normally do in a classic setting. You simply need to pass this newly created hook to the training/evaluation functions or to the correct train spec.
Here is quick example that you can adapt to your needs :
class IteratorInitializerHook(tf.train.SessionRunHook):
def __init__(self):
super(IteratorInitializerHook, self).__init__()
self.iterator_initializer_func = None # Will be set in the input_fn
def after_create_session(self, session, coord):
# Initialize the iterator with the data feed_dict
self.iterator_initializer_func(session)
def get_inputs(X, y):
iterator_initializer_hook = IteratorInitializerHook()
def input_fn():
X_pl = tf.placeholder(X.dtype, X.shape)
y_pl = tf.placeholder(y.dtype, y.shape)
dataset = tf.data.Dataset.from_tensor_slices((X_pl, y_pl))
dataset = ...
...
iterator = dataset.make_initializable_iterator()
next_example, next_label = iterator.get_next()
iterator_initializer_hook.iterator_initializer_func = lambda sess: sess.run(iterator.initializer,
feed_dict={X_pl: X, y_pl: y})
return next_example, next_label
return input_fn, iterator_initializer_hook
...
train_input_fn, train_iterator_initializer_hook = get_inputs(X_train, y_train)
test_input_fn, test_iterator_initializer_hook = get_inputs(X_test, y_test)
...
estimator.train(input_fn=train_input_fn,
hooks=[train_iterator_initializer_hook]) # Don't forget to pass the hook !
estimator.evaluate(input_fn=test_input_fn,
hooks=[test_iterator_initializer_hook])