Looking at the TensorFlow documentation it says that model.fit(validation_data) cannot be used with keras.utils.Sequence
Note that validation_data does not support all the data types that are supported in x, eg, dict, generator or keras.utils.Sequence.
My validation set is probably just small enough to fit into RAM, but I'd like to avoid loading it all into RAM in case my dataset grows.
To get an idea of how my current Sequence is working, here is the code:
NOTES:
The sequence currently only processes train_data which is a normalized array containing my examples and labels. I have similar arrays for val_data and test_data.
This loop may look a bit odd because I am working with time-series data that pulls a window for each example.
class MyGenerator(tf.keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, ids, train_dir):
'Initialization'
self.ids = ids
self.train_dir = train_dir
def __len__(self):
'Denotes the number of batches per epoch'
return len(self.ids)
def __getitem__(self, index):
batch_id = self.ids[index]
# load data
X_train, y_train = [], []
start_index = seq_len*batch_id
end_index = start_index + seq_len
for i in range(start_index, end_index):
start_seq = i + start_index
X_train.append(train_data[i-seq_len:i])
y_train.append(train_data[:, 4][i])
# Save our batch
X = np.array(X_train)
y = np.array(y_train)
return X, y
Is there a way for me to process my validation set in batches? I would prefer to use Sequence, but if that is not possible I'm open to other options.
According to the documentstion:
validation_data could be:
tuple (x_val, y_val) of Numpy arrays or tensors
tuple (x_val, y_val, val_sample_weights) of Numpy arrays
dataset
For the first two cases, batch_size must be provided. For the last case, validation_steps could be provided.
So by providing batch_size or validation_steps the validation_data would be processed in batches.
Related
I have a VGG16 model implemented with Keras/tensorflow.
When I call model.fit, I pass in a generator of data. The generator does transforms necessary for a VGGNet:
Preprocess the images with vgg16.preprocess_input
Convert the label to a one-hot vector via to_categorical
The generator can be seen below and works. Unfortunately, since there are multiple epochs, I have to set dataset.repeat(-1) (infinitely repeat) so the generator doesn't run out. This in turn requires one to pass a steps_per_epoch so a given iteration of training can complete. As you're probably thinking, this is brittle, (hinges on a known dataset cardinality)!
I have decided it's best to preprocess the training Dataset once up front using Dataset.map. However, I am struggling with the construction of a mapping function, it seems to_categorical doesn't work with a tf.Tensor. Down below is what I have right now, but I am not sure if there's a latent bug.
How can I correctly translate the below Dataset generator into a Dataset.map function?
Current Dataset Generator
This is implemented (and known to work) with Python 3.8 and tensorflow==2.4.4.
from typing import Iterable, Tuple
import numpy as np
import tensorflow as tf
def make_vgg_preprocessing_generator(
dataset: tf.data.Dataset, num_repeat: int = -1
) -> Iterable[Tuple[tf.Tensor, np.ndarray]]:
num_classes = len(dataset.class_names)
for batch_images, batch_labels in dataset.repeat(num_repeat):
pre_images = tf.keras.applications.vgg16.preprocess_input(batch_images)
pre_labels = tf.keras.utils.to_categorical(batch_labels, num_classes)
yield pre_images, pre_labels
train_ds: tf.data.Dataset # Not provided in this sample
model.fit(
make_vgg_preprocessing_generator(train_ds)
epochs=10,
steps_per_epoch=10, # Required since default num_repeat is indefinitely
)
Dataset.map Function
Here is my current translation that I would like to improve.
def vgg_preprocess_dataset(dataset: tf.data.Dataset) -> tf.data.Dataset:
num_classes = len(dataset.class_names)
def _preprocess(x: tf.Tensor, y: tf.Tensor) -> Tuple[tf.Tensor, tf.Tensor]:
pre_x = tf.keras.applications.vgg16.preprocess_input(x)
pre_y = tf.one_hot(y, depth=num_classes)
return pre_x, pre_y
return dataset.map(_preprocess)
Yes, you're on the right track! You'll want to replace to_categorical with tf.one_hot, just as you have, as tf.one_hot is specifically for tensors, and is designed for this context. Next, you might want to play around with some of the other tf.data.Dataset methods here and add them to your pipeline. Right now, your batch size will be one sample, and un-shuffled. An example of some other processing you might do:
def vgg_preprocess_dataset(dataset: tf.data.Dataset, batch_size=32, shuffle_buffer=1000) -> tf.data.Dataset:
num_classes = len(dataset.class_names)
def _preprocess(x: tf.Tensor, y: tf.Tensor):
pre_x = tf.keras.applications.vgg16.preprocess_input(x)
pre_y = tf.one_hot(y, depth=num_classes)
# pre_y = to_categorical(y, num_classes)
return pre_x, pre_y
# bigger buffer is better but slower
dataset = dataset.shuffle(shuffle_buffer)
# do your mapping after shuffle
dataset = dataset.map(_preprocess)
# next batch it
dataset = dataset.batch(batch_size)
# this allows your CPU to fetch the next batch (do the above shuffling, mapping, etc) during the
# current GPU pass, so that the GPU has minimal downtime
dataset = dataset.prefetch(2)
return dataset
ds = vgg_preprocess_dataset(ds)
# and you just pass it right to fit!
model.fit(ds)
I want to implement this situation for the torchvision MNIST dataset, loading data with DataLoader:
batch A (unaugmented images): 5, 0, 4, ...
batch B (augmented images): 5*, 5+, 5-, 0*, 0+, 0-, 4*, 4+, 4-, ...
... where for every image of A there are 3 augmentations in batch B. len(B) = 3*len(A) correspondingly. These batches should be used within a single iteration to compare the original images of batch A with those augmented in batch B to build a loss.
class MyMNIST(Dataset):
def __init__(self, mnist_dir, train, augmented, transform=None, repeat=1):
self.mnist_dir = mnist_dir
self.train = train
self.augmented = augmented
self.repeat = repeat
self.transform = transform
self.dataset = None
if augmented and train:
self.dataset = datasets.MNIST(self.mnist_dir, train=train, download=True, transform=transform)
self.dataset.data = torch.repeat_interleave(self.dataset.data, repeats=self.repeat, dim=0)
self.dataset.targets = torch.repeat_interleave(self.dataset.targets, repeats=self.repeat, dim=0)
elif augmented and not train:
raise Exception("Test set should not be augmented.")
else:
self.dataset = datasets.MNIST(MNIST_DIR, train=train, download=True, transform=transform)
With this class, I want to initialize two different dataloaders:
orig_train = MyMNIST(MNIST_DIR, train=True, augmented=False, transform=orig_transforms)
orig_train_loader = torch.utils.data.DataLoader(orig_train.dataset, batch_size=100, shuffle=True)
aug_train = MyMNIST(MNIST_DIR, train=True, augmented=True, transform=aug_transforms, repeat=3)
aug_train_loader = torch.utils.data.DataLoader(aug_train.dataset, batch_size=300, shuffle=True)
My problem now is, I also need to shuffle with each iteration while the order between A and B stays in relation. Which is not possible with above code, as both DataLoader yield different orders. So I tried to work with a single DataLoader and to manually copy a repeated batch:
for batch_no, (images, labels) in enumerate(orig_train_loader):
repeat_images = torch.repeat_interleave(images, 3, dim=0)
This way, I get the order of batch B (repeat_images) right, but now I´m missing the transformations which I would need to apply within a batch/iteration. This seems not to be the paradigm of Pytorch, at least I did not find a way to do that.
I would be happy if somebody can help me - I am quite new to Pytorch (and also to stackoverflow), so please also be welcome to criticize my whole approach, performance issues that could arise etc.
Thanks a lot!
model.fit(x,y, epochs=10000, batch_size=1)
The above codes works fine. When I use a function to feed the data in the model, something went wrong.
model.fit(GData(), epochs=10000, batch_size=1)
per_sample_losses = loss_fn.call(targets[i], outs[i])
IndexError: list index out of range
The GData() function is given below:
def GData():
return (x,y)
x is a numpy array with dimension (2, 63, 85)
y is a numpy array with dimension (2, 63, 41000)
This is the whole codes:
import os
import tensorflow as tf
import numpy as np
def MSE( y_true, y_pred):
error = tf.math.reduce_mean(tf.math.square(y_true-y_pred))
return error
data = np.load("Data.npz")
x = data['x'] # (2,63, 85)
y = data['y'] # (2,63,41000)
frame = x.shape[1]
InSize = x.shape[2]
OutSize = y.shape[2]
def GData():
return (x,y)
model = tf.keras.Sequential()
model.add(tf.keras.layers.GRU(1000, return_sequences=True, input_shape=(frame,InSize)))
model.add(tf.keras.layers.Dense(OutSize))
model.compile(optimizer='adam',
loss=MSE)#'mean_squared_error')
model.fit(GData(), epochs=10000, batch_size=1)
First, your function GData is not actually a generator as it is returning a value rather than yielding a value. Regardless, we should take a look at the fit() method and its documentation which you can find here.
From this, we see that the first two arguments to fit() are x and y. Going further, we see that x is limited to a few types. Namely, generators, numpy arrays, tf.data.Datasets, and a few others. An important thing to note in the documentation is that if x is a generator, it must be A generator or keras.utils.Sequence returning (inputs, targets). I am assuming this is what you are looking for. If this is the case, you will need to modify your GData function so that it is actually a generator. This can be done as such
batch_size = 1
EPOCHS = 10000
def GData():
for _ in range(EPOCHS): # Iterate through epochs. Note that this can be changed to be while True so that the generator yields indefinitely. The model will stop training after the amount of epochs you specify in the fit method.
for i in range(0, len(x), batch_size): # Iterate through batches
yield (x[i:batch_size], y[i:batch_size]) # Yield batches for training
Then, you have to specify the amount of steps per epoch in your fit() call so your model knows when to stop at each epoch.
model.fit(GData(), epochs=EPOCHS, steps_per_epoch=x.shape[0]//batch_size)
i'm trying to fit my deep learning model with a custom generator.
When i fit the model, it shows me this error:
I tried to find similar questions, but all the answers were about converting lists to numpy array. I think that's not the question in this error. My lists are all in numpy array format. This custom generator is based on a custom generator from here
This is the part of code where I fit the model:
train_generator = RepresentationGenerator(representation_path=representations_path, target_path=target_path,
filenames=training_filenames, batch_size=batch_size)
val_generator = RepresentationGenerator(representation_path=representations_path, target_path=target_path,
filenames=validation_filenames, batch_size=batch_size)
self.model_semantic.fit_generator(train_generator,
epochs=10,
verbose=1,
validation_data=val_generator,
)
return 0
where the variables are:
representations_path - is a string with the directory to the path where i store the training files, that which file is the input to model
target_path - is a string with the directory to the path where i store the target files, that which file is the target of the model (output)
training_filenames - is a list with the names of training and target files (both have the same name, but they are in different folders)
batch_size - integer with the size of the batch. It has the value 7.
My generator class is below:
import np
from tensorflow_core.python.keras.utils.data_utils import Sequence
class RepresentationGenerator(Sequence):
def __init__(self, representation_path, target_path, filenames, batch_size):
self.filenames = np.array(filenames)
self.batch_size = batch_size
self.representation_path = representation_path
self.target_path = target_path
def __len__(self):
return (np.ceil(len(self.filenames) / float(self.batch_size))).astype(np.int)
def __getitem__(self, idx):
files_to_batch = self.filenames[idx * self.batch_size: (idx + 1) * self.batch_size]
batch_x, batch_y = [], []
for file in files_to_batch:
batch_x.append(np.load(self.representation_path + file + ".npy", allow_pickle=True))
batch_y.append(np.load(self.target_path + file + ".npy", allow_pickle=True))
return np.array(batch_x), np.array(batch_y)
These are the values, when the method fit is called:
How can I fix this error?
Thank you mates!
When I call the method fit_generator, it calls the method fit.
The method fit, calls the method func.fit and it passes the variable Y that is set as None
The error occurs in this line:
Final solution:
Import from the correct place:
from tensorflow.keras.utils import Sequence
Old answers:
If __getitem__ is never called, the problem might be in __len__. You're not returning an int, you're returning a np.int.
I suggest you try:
def __len__(self):
length = len(self.filenames) // self.batch_size
if len(self.filenames) % self.batch_size > 0:
length += 1
return length
But if __getitem__ is being called and your data returned, then you should inspect your arrays.
Get an item from the generator yourself and check the content:
x, y = train_generator[0]
Are they single arrays? Or are they arrays of arrays? (Must be single)
What are their shapes? Do they have the expected shapes?
What are their types? Usually they should be float, sometimes int (for inputs to embedding layers), very rarely string (for inputs to custom layers that know how to treat strings).
The outputs must always be float, at most int (for sparse losses)
Other suppositions, you're using fit with batch_size while using a generator.... this is strange and the "if" clauses inside the method may not be well prepared, you might be falling into another training case.
Go straight to the usual options:
self.model_semantic.fit_generator(train_generator,
epochs=10,
verbose=1,
validation_data=val_generator)
Your generator is a Sequence, it already has a __len__, you don't need to specify steps_per_epoch or validation_steps.
Every generator has automatic batch sizes, every step is a batch and that's it. You don't need to specify batch_size in fit_generator.
If you're going to use fit, go like this:
...fit(train_generator, steps_per_epoch = len(train_generator),
epochs = 10, verbose = 1,
validation_data = val_generator, validation_steps = len(val_generator))
Finally, you should be hunting for anything that might be None (as the error message suggests) in your code.
Check if every function has a return line.
Check all inputs of your generator in __init__.
Print the filenames inside the generator.
Get the __len__ of the generator.
Try to get an item from the generator: x, y= train_generator[0]
Without arguing the pros and cons of whether to actually do this, I'm curious if anyone has created or knows of a simple way to mutate the training data between epochs during the fitting of a model using keras.
Example: I have 100 vectors and output features that I'm using to train a model. I randomly pick 80 of them for the training set, setting the other 20 aside for validation, and then run:
model.fit(train_vectors,train_features,validation_data=(test_vectors,test_features))
Keras fitting allows one to shuffle the order of the training data with shuffle=True but this just randomly changes the order of the training data. It might be fun to randomly pick just 40 vectors from the training set, run an epoch, then randomly pick another 40 vectors, run another epoch, etc.
https://keras.io/models/model/#fit
model.fit() has an argument steps_per_epoch. If you set shuffle=True and choose steps_per_epoch small enough you will get the behaviour that you describe.
In your example with 80 training examples: you could for instance set batch_size to 20 and steps_per_epoch to 4, or batch_size to 10 and steps_per_epoch to 8 etc.
I found that specifying both steps_per_epoch and batch_size raises error. You can find correspondent code lines in the code linked below (seek if steps is not None and batch_size is not None:). Thus, we need to implement a data generator in order to realize such a behavior.
https://github.com/keras-team/keras/blob/1cf5218edb23e575a827ca4d849f1d52d21b4bb0/keras/engine/training_utils.py
The only one way (at the moment) to achieve such a result is to use a Generator:
from tensorflow.python.keras.utils import Sequence
import numpy as np
class mygenerator(Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
def __len__(self):
return int(np.ceil(len(self.x) / float(self.batch_size)))
def __getitem__(self, idx):
batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
# read your data here using the batch lists, batch_x and batch_y
x = [my_readfunction(filename) for filename in batch_x]
y = [my_readfunction(filename) for filename in batch_y]
return np.array(x), np.array(y)
To obtain the behavior you want, you may change the function _getitem_ in order to give every time a random batch