Getting x_test, y_test from generator in Keras?

Getting x_test, y_test from generator in Keras? - python

For certain problems, the validation data can't be a generator, e.g.: TensorBoard histograms:
If printing histograms, validation_data must be provided, and cannot be a generator.
My current code looks like:
image_data_generator = ImageDataGenerator()
training_seq = image_data_generator.flow_from_directory(training_dir)
validation_seq = image_data_generator.flow_from_directory(validation_dir)
testing_seq = image_data_generator.flow_from_directory(testing_dir)
model = Sequential(..)
# ..
model.compile(..)
model.fit_generator(training_seq, validation_data=validation_seq, ..)
How do I provide it as validation_data=(x_test, y_test)?

Python 2.7 and Python 3.* solution:
from platform import python_version_tuple
if python_version_tuple()[0] == '3':
xrange = range
izip = zip
imap = map
else:
from itertools import izip, imap
import numpy as np
# ..
# other code as in question
# ..
x, y = izip(*(validation_seq[i] for i in xrange(len(validation_seq))))
x_val, y_val = np.vstack(x), np.vstack(y)
Or to support class_mode='binary', then:
from keras.utils import to_categorical
x_val = np.vstack(x)
y_val = np.vstack(imap(to_categorical, y))[:,0] if class_mode == 'binary' else y
Full runnable code: https://gist.github.com/AlecTaylor/7f6cc03ed6c3dd84548a039e2e0fd006

Update (22/06/2018): Read the answer provided by the OP for a concise and efficient solution. Read mine to understand what's going on.
In python you can get all the generators data using:
data = [x for x in generator]
But, ImageDataGenerators does not terminate and therefor the approach above would not work. But we can use the same approach with some modifications to work in this case:
data = [] # store all the generated data batches
labels = [] # store all the generated label batches
max_iter = 100 # maximum number of iterations, in each iteration one batch is generated; the proper value depends on batch size and size of whole data
i = 0
for d, l in validation_generator:
data.append(d)
labels.append(l)
i += 1
if i == max_iter:
break
Now we have two lists of tensor batches. We need to reshape them to make two tensors, one for data (i.e X) and one for labels (i.e. y):
data = np.array(data)
data = np.reshape(data, (data.shape[0]*data.shape[1],) + data.shape[2:])
labels = np.array(labels)
labels = np.reshape(labels, (labels.shape[0]*labels.shape[1],) + labels.shape[2:])

Related

Custom generator runs out of data even when steps_per_epoch specified

I am training a model using custom generators, but just before finishing the first epoch, the model runs out of data. It gives me the following error:
Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least (steps_per_epoch * epochs) batches (in this case, 8740 batches). You may need to use the repeat() function when building your dataset
I have four generators (one for the train data, and another for the train label. Same thing with validation). I then zip train & label together. This is the prototype of my generators. I got the idea from here:
import numpy as np
import nibabel as nib
from tensorflow import keras
import os
def weirddivision(n,d):
return np.array(n)/np.array(d) if d else 0
class ImgDataGenerator(keras.utils.Sequence):
def __init__(self, file_list, batch_size=8, shuffle=True):
"""Constructor can be expanded,
with batch size, dimentation etc.
"""
self.file_list = file_list
self.batch_size = batch_size
self.shuffle = shuffle
self.on_epoch_end()
def __len__(self):
'Take all batches in each iteration'
return int(np.floor(len(self.file_list) / self.batch_size))
def __getitem__(self, index):
'Get next batch'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# single file
file_list_temp = [self.file_list[k] for k in indexes]
# Set of X_train and y_train
X = self.__data_generation(file_list_temp)
return X
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.file_list))
if self.shuffle == True:
np.random.shuffle(self.indexes)
def __data_generation(self, file_list_temp):
'Generates data containing batch_size samples'
train_loc = '/home/faruk/Desktop/BrainSeg/Dataset/Train/'
X = np.empty((self.batch_size,224,256,1))
# Generate data
for i, ID in enumerate(file_list_temp):
x_file_path = os.path.join(train_loc, ID)
img = np.load(x_file_path)
img = np.pad(img, pad_width=((14,13),(12,11)), mode='constant')
img = np.expand_dims(img,-1)
img = weirddivision(img, img.max())
# Store sample
X[i,] = img
return X
As mentioned, here I create four generators and zip them:
training_img_generator = ImgDataGenerator(train)
training_label_generator = LabelDataGenerator(train)
train_generator = zip(training_img_generator,training_label_generator)
val_img_generator = ValDataGenerator(val)
val_label_generator = ValLabelDataGenerator(val)
val_generator = zip(val_img_generator,val_label_generator)
Because the generator is generating data dynamically, I thought that maybe it was trying to generate more than what is actually available. Hence, I calculated the steps per epoch as follows and passed it to fit_generator:
batch_size = 8
spe = len(train)//batch_size # len(train) = 34965
val_spe = len(val)//batch_size # len(val) = 4347
History=model.fit_generator(generator=train_generator, validation_data=val_generator, epochs=2, steps_per_epoch=spe, validation_steps = val_spe, shuffle=True, verbose=1)
But still, this is not working. I have tried reducing the number of steps per epoch, and I am able to finish the first epoch, but the error then appears at the beginning of the second epoch. Apparently the generator needs to be repeated infinitely, but I don't know how to achieve this. Can I use an infinite while loop? If yes, where?

Try this:
train_generator = train_generator.repeat()
val_generator = val_generator.repeat()

I solved this. I was defining my Generator class as follows:
class ImgDataGenerator(keras.utils.Sequence)
However, my model was not sequential... It was functional. I solved this by creating my own custom generator without inheriting from the keras.utils.sequence.
I hope this is helpful to someone.

model.evaluate() changes results depending on batch size, when fed by generator

Working in colab, with default tensorflow and keras versions (which print tensorflow 2.2.0-rc2, keras 2.3.0-tf )
I've got a superweird error. Basically, the results of model.evaluate() depend on the batch size I'm using and they change after I shuffle the data. Which makes no sense. I've been able to reproduce this in a minimally working example. In my full program (which works in 3D with bigger datasets) the variations are even more significant. I don't know whether this might depend on batch normalization... But I expect it to be fixed when I'm predicting! My full program is doing multiclass segmentation, my minimal example takes a black image with a white square in a random position, with some little noise, and tries to segment the same white square out of it.
I'm using keras sequence as generators to feed data to the model, which I guess might be relevant as I don't see the behaviour when evaluating the data directly.
Here's the code with its output:
#environment setup
%tensorflow_version 2.x
from tensorflow.keras import backend as K
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input,Conv2D, Activation, BatchNormalization
from tensorflow.keras import metrics
#set up a toy model
K.set_image_data_format("channels_last")
inputL = Input([64,64,1])
l1 = Conv2D(4,[3,3],padding='same')(inputL)
l1N = BatchNormalization(axis=-1,momentum=0.9) (l1)
l2 = Activation('relu') (l1N)
l3 = Conv2D(32,[3,3],padding='same')(l2)
l3N = BatchNormalization(axis=-1,momentum=0.9) (l3)
l4 = Activation('relu') (l3N)
l5 = Conv2D(1,[1,1],padding='same',dtype='float32')(l4)
l6 = Activation('sigmoid') (l5)
model = Model(inputs=inputL,outputs=l6)
model.compile(optimizer='sgd',loss='mse',metrics='accuracy' )
#Create random images
import numpy as np
import random
X_train = np.zeros([96,64,64,1])
for imIdx in range(96):
centPoin = random.randrange(7,50)
X_train[imIdx,centPoin-5:centPoin+5,centPoin-5:centPoin+5,0]=1
X_val = X_train[:32,:,:,:]
X_train = X_train[32:,:,:,:]
Y_train = X_train.copy()
X_train = np.random.normal(0.,0.1,size=X_train.shape)+X_train
for imIdx in range(64):
X_train[imIdx,:,:,:] = X_train[imIdx,:,:,:]+np.random.normal(0,0.2,size=1)
from tensorflow.keras.utils import Sequence
import random
import tensorflow as tf
#setup the data generator
class dataGen (Sequence):
def __init__ (self,x_set,y_set,batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
nSamples = self.x.shape[0]
patList = np.array(range(nSamples),dtype='int16')
patList = patList.reshape(nSamples,1)
np.random.shuffle(patList)
self.patList = patList
def __len__ (self):
return round(self.patList.shape[0] / self.batch_size)
def __getitem__ (self, idx):
patStart = idx
batchS = self.batch_size
listLen = self.patList.shape[0]
Xout = np.zeros((batchS,64,64,1))
Yout = np.zeros((batchS,64,64,1))
for patIdx in range(batchS):
curPat = (patStart+patIdx) % listLen
patInd = self.patList[curPat]
Xout[patIdx,:,:] = self.x[patInd,:,:,:]
Yout[patIdx,:,:] = self.y[patInd,:,:,:]
return Xout, Yout
def on_epoch_end(self):
np.random.shuffle(self.patList)
def setBatchSize(self,batchS):
self.batch_size = batchS
#load the data in the generator
trainGen = dataGen(X_train,Y_train,16)
valGen = dataGen(X_val,X_val,16)
# train the model for two epochs, so that the loss is bad
trainSteps = len(trainGen)
model.fit(trainGen,steps_per_epoch=trainSteps,epochs=32,validation_data=valGen,validation_steps=len(valGen))
trainGen.setBatchSize(4)
model.evaluate(trainGen)
[0.16259156167507172, 0.9870567321777344]
trainGen.setBatchSize(16)
model.evaluate(trainGen)
[0.17035068571567535, 0.9617958068847656]
trainGen.on_epoch_end()
trainGen.setBatchSize(16)
model.evaluate(trainGen)
[0.16663715243339539, 0.9710426330566406]
If I do model.evaluate(Xtrain,Ytrain,batch_size=16) instead the result is not dependent from the batch size.
If I train the model until convergence, where the loss gets to 0.05, the same thing still happens. With the accuracy fluctuating from one evaluation to the other from 0.95 to 0.99.
Why would this happen?
I'd expect the prediction to be super easy, am I wrong?

You made a small mistake inside the __getitem__ function.
curPat = (patStart+patIdx)
should be changed to
curPat = (patStart*batchS+patIdx)
patStart is equal to idx, the current batch number. If your data set contains 64 samples and your batch size is set to 16, the possible values for idx will be 0, 1, 2 and 3.
curPat on the other hand refers to the index of the current sample number in the shuffled list of sample numbers. curPat should therefore be able to take on all values from 0 to 63. In your code, that is not the case. By making the aforementioned change, this issue is fixed.

Keras: How to expand validation_split to generate a third set i.e. test set?

I am using Keras with a TensorFlow backend. I am using the ImageDataGenerator with the validation_split argument to split my data into train set and validation set. As such, I use flow_from_directory with the subset set to "training" and "testing" like so:
total_gen = ImageDataGenerator(validation_split=0.3)
train_gen = data_generator.flow_from_directory(my_dir, target_size=(input_size, input_size), shuffle=False, seed=13,
class_mode='categorical', batch_size=BATCH_SIZE, subset="training")
valid_gen = data_generator.flow_from_directory(my_dir, target_size=(input_size, input_size), shuffle=False, seed=13,
class_mode='categorical', batch_size=32, subset="validation")
This is amazingly convenient, as it allows me to use only one directory instead of two (one for training and one for validation). Now I wonder if it is possible to expand this process in order to generating a third set i.e. test set?

This is not possible out of the box. You should be able to do it with some minor modifications to the source code of ImageDataGenerator:
if subset is not None:
if subset not in {'training', 'validation'}: # add a third subset here
raise ValueError('Invalid subset name:', subset,
'; expected "training" or "validation".') # adjust message
split_idx = int(len(x) * image_data_generator._validation_split)
# you'll need two split indices here
if subset == 'validation':
x = x[:split_idx]
x_misc = [np.asarray(xx[:split_idx]) for xx in x_misc]
if y is not None:
y = y[:split_idx]
elif subset == '...' # add extra case here
else:
x = x[split_idx:]
x_misc = [np.asarray(xx[split_idx:]) for xx in x_misc] # change slicing
if y is not None:
y = y[split_idx:] # change slicing
Edit: this is how you could modify the code:
if subset is not None:
if subset not in {'training', 'validation', 'test'}:
raise ValueError('Invalid subset name:', subset,
'; expected "training" or "validation" or "test".')
split_idxs = (int(len(x) * v) for v in image_data_generator._validation_split)
if subset == 'validation':
x = x[:split_idxs[0]]
x_misc = [np.asarray(xx[:split_idxs[0]]) for xx in x_misc]
if y is not None:
y = y[:split_idxs[0]]
elif subset == 'test':
x = x[split_idxs[0]:split_idxs[1]]
x_misc = [np.asarray(xx[split_idxs[0]:split_idxs[1]]) for xx in x_misc]
if y is not None:
y = y[split_idxs[0]:split_idxs[1]]
else:
x = x[split_idxs[1]:]
x_misc = [np.asarray(xx[split_idxs[1]:]) for xx in x_misc]
if y is not None:
y = y[split_idxs[1]:]
Basically, validation_split is now expected to be a tuple of two floats instead of a single float. The validation data will be the fraction of data between 0 and validation_split[0], test data between validation_split[0] and validation_split[1] and training data between validation_split[1] and 1. This is how you can use it:
import keras
# keras_custom_preprocessing is how i named my directory
from keras_custom_preprocessing.image import ImageDataGenerator
generator = ImageDataGenerator(validation_split=(0.1, 0.5))
# First 10%: validation data - next 40% test data - rest: training data
gen = generator.flow_from_directory(directory='./data/', subset='test')
# Finds 40% of the images in the dir
You will need to modify the file in two or three additional lines (there is a typecheck you will have to change), but that's it and that should work. I have the modified file, let me know if you are interested, I can host it on my github.

Keras custom data generator for large hdf5 file which does not fit into memory

I'm trying to use the pretrained InceptionV3 model to classify the food-101 dataset, which containts food images for 101 categories, 1000 per category. I've preprocessed this dataset into a single hdf5 file (I assumed this is beneficial compared to loading images on the go when training) so far, which has the following tables inside:
The data split is the standard 70% train, 20% validation, 10% test, so for example the valid_img has a size of 20200*299*299*3. The labels are onehotencoded for Keras, so valid_labels has a size of 20200*101.
This hdf5 file has a size of 27.1 GB, so it will not fit into my memory. (Have 8 GB of it, although effectively only probably 4-5 gigs is usable while running Ubuntu. Also my GPU is GTX 960 with 2 GB of VRAM, and so far it looked like 1.5 GB is available for python when I try to start the training script). I'm using Tensorflow backend.
The first idea I had is to use model.train_on_batch() with a double nested for loop like this:
#Loading InceptionV3, adding my fully connected layers, compiling model...
dataset = h5py.File('/home/uzoltan/PycharmProjects/food-101/food-101_299x299.hdf5', 'r')
epoch = 50
for i in range(epoch):
for i in range(100): #1000 images can fit in the memory easily, this could probably be range(10) too
train_images = dataset["train_img"][i * 706:(i + 1) * 706, ...]
train_labels = dataset["train_labels"][i * 706:(i + 1) * 706, ...]
val_images = dataset["valid_img"][i * 202:(i + 1) * 202, ...]
val_labels = dataset["valid_labels"][i * 202:(i + 1) * 202, ...]
model.train_on_batch(x=train_images, y=train_labels, class_weight=None,
sample_weight=None, )
My problem with this approach is that train_on_batch provides 0 support for validation or batch shuffling, so that the batches are not in the same order every epoch.
So I looked towards model.fit_generator() which has the nice property of providing all the same functionality as fit(), plus with the built in ImageDataGenerator you can do image augmentations (rotations, horizontal flips, etc.) at the same time with the CPU, so that your model can be more robust. My problem here is, that if I understand it correctly, the ImageDataGenerator.flow(x,y) method needs all the samples and labels at once, but my training/validation data wont fit into my RAM.
Here is where I think custom data generators come into the picture, but after looking extensively at some examples I could find on the Keras GitHub/Issues page, I still dont really get how should I implement a custom generator, which would read in batches of data from my hdf5 file. Can someone provide me with a good example or pointers? How could I couple the custom batch generator with the image augmentations? Or maybe is it easier to implement some kind of manual validation and batch shuffling for train_on_batch()? If so, I could use some pointer there too.

For anyone still looking for an answer, I made the following "crude wrapper" around ImageDataGeneator's apply_transform method.
from numpy.random import uniform, randint
from tensorflow.python.keras.preprocessing.image import ImageDataGenerator
import numpy as np
class CustomImagesGenerator:
def __init__(self, x, zoom_range, shear_range, rescale, horizontal_flip, batch_size):
self.x = x
self.zoom_range = zoom_range
self.shear_range = shear_range
self.rescale = rescale
self.horizontal_flip = horizontal_flip
self.batch_size = batch_size
self.__img_gen = ImageDataGenerator()
self.__batch_index = 0
def __len__(self):
# steps_per_epoch, if unspecified, will use the len(generator) as a number of steps.
# hence this
return np.floor(self.x.shape[0]/self.batch_size)
def next(self):
return self.__next__()
def __next__(self):
start = self.__batch_index*self.batch_size
stop = start + self.batch_size
self.__batch_index += 1
if stop > len(self.x):
raise StopIteration
transformed = np.array(self.x[start:stop]) # loads from hdf5
for i in range(len(transformed)):
zoom = uniform(self.zoom_range[0], self.zoom_range[1])
transformations = {
'zx': zoom,
'zy': zoom,
'shear': uniform(-self.shear_range, self.shear_range),
'flip_horizontal': self.horizontal_flip and bool(randint(0,2))
}
transformed[i] = self.__img_gen.apply_transform(transformed[i], transformations)
return transformed * self.rescale
It can be called like so:
import h5py
f = h5py.File("my_heavy_dataset_file.hdf5", 'r')
images = f['mydatasets/images']
my_gen = CustomImagesGenerator(
images,
zoom_range=[0.8, 1],
shear_range=6,
rescale=1./255,
horizontal_flip=True,
batch_size=64
)
model.fit_generator(my_gen)

If I understood you correctly, you want to use the data (which does not fit in the memory) from HDF5 and at the same time use data augmentation on it.
I'm in the same situation as you, and I found this code that maybe can be helpful with some few modifications:
https://gist.github.com/wassname/74f02bc9134897e3fe4e60784f5aaa15

this is my solution for shuffle data per epoch with h5 file.
indices means train or val index list.
def generator(h5path, indices, batchSize=128, is_train=True, aug=None):
db = h5py.File(h5path, "r")
with open("mean.json") as f:
mean = json.load(f)
meanV = np.array([mean["R"], mean["G"], mean["B"]])
while True:
np.random.shuffle(indices)
for i in range(0, len(indices), batchSize):
t0 = time()
batch_indices = indices[i:i+batchSize]
batch_indices.sort()
by = db["labels"][batch_indices,:]
bx = db["images"][batch_indices,:,:,:]
bx[:,:,:,0] -= meanV[0]
bx[:,:,:,1] -= meanV[1]
bx[:,:,:,2] -= meanV[2]
t1=time()
if is_train:
#bx = random_crop(bx, (224,224))
if aug is not None:
bx,by = next(aug.flow(bx,by,batchSize))
yield (bx,by)
h5path='all_224.hdf5'
model.fit_generator(generator(h5path, train_indices, batchSize=batchSize, is_train=True, aug=aug),
steps_per_epoch = 20000//batchSize,
validation_data= generator(h5path, test_indices, is_train=False, batchSize=batchSize),
validation_steps = 2424//batchSize,
epochs=args.epoch,
max_queue_size=100,
callbacks=[checkpoint, early_stop])

You want to write a function which loads images from the HDF5 and then yields (not returns) them as a numpy array. Here is a simple example which uses OpenCV to load images directly from .png/.jpg files in a given directory:
def generate_data(directory, batch_size):
"""Replaces Keras' native ImageDataGenerator."""
i = 0
file_list = os.listdir(directory)
while True:
image_batch = []
for b in range(batch_size):
if i == len(file_list):
i = 0
random.shuffle(file_list)
sample = file_list[i]
i += 1
image = cv2.resize(cv2.imread(sample[0]), INPUT_SHAPE)
image_batch.append((image.astype(float) - 128) / 128)
yield np.array(image_batch)
Obviously you will have to modify it to read from the HDF5 instead.
Once you have written your function, the usage is simply:
model.fit_generator(
generate_data('~/my_data', batch_size),
steps_per_epoch=len(os.listdir('~/my_data')) // batch_size)
Again modified to reflect the fact that you are reading from an HDF5 and not a directory.

How to get list of values in ImageDataGenerator.flow_from_directory Keras?

We can generate image dataset using ImageDataGenerator with flow_from_directory method. For calling list of class, we can use oject.classes. But, how to call list of values? I've searched and still not found any.
Thanks :)

The ImageDataGenerator is a python generator, it would yield a batch of data with the shape same with your model inputs(like(batch_size,width,height,channels)) each time. The benefit of the generator is when your data set is too big, you can't put all the data to your limited memory, but, with the generator you can generate one batch data each time. and the ImageDataGenerator works with model.fit_generator(), model.predict_generator().
If you want to get the numeric data, you can use the next() function of the generator:
import numpy as np
data_gen = ImageDataGenerator(rescale = 1. / 255)
data_generator = datagen.flow_from_directory(
data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical')
data_list = []
batch_index = 0
while batch_index <= data_generator.batch_index:
data = data_generator.next()
data_list.append(data[0])
batch_index = batch_index + 1
# now, data_array is the numeric data of whole images
data_array = np.asarray(data_list)
Alternatively, you can use PIL and numpy process the image by yourself:
from PIL import Image
import numpy as np
def image_to_array(file_path):
img = Image.open(file_path)
img = img.resize((img_width,img_height))
data = np.asarray(img,dtype='float32')
return data
# now data is a tensor with shape(width,height,channels) of a single image.
Then, you can loop all your images with this function to get the numeric data.
Notice, I recommend you to use generator instead of get all the data directly, or, you might run out of memory.

'But, how to call list of values' - If I understood correctly, I guess you wish to know what all files are there in your data set - if that's correct, (or if not), there are various ways you can get values from your generator:
use object.filenames.
Object.filenames returns the list of all files in your target folder. I just use the len(object.filename) function to get the total number of files in my test folder. Then pass that number back into my generator and run it again.
generator.n
Other way to get number of all items in your test folder is generator.n
x , y = test_generator.next() to load my array and classes ( if inferred).
Or a = test_generator.next(), where your array and classes will be returned as tuple.
I only used this as my test data set was really small ( 60 images) and I was using extracted features to train and predict my model( that is feature array and not the image array).
If you are building a normal model, using generator to yield batches is much better way.
Create a function using generator
def generate_test_data_from_directory(folder_path, image_target_size = 224, batch_size = 5, channels = 3, class_mode = 'sparse' ):
'''fetch all out test data from directory'''
test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_directory(
folder_path ,
target_size = (image_target_size, image_target_size),
batch_size = batch_size,
class_mode = class_mode)
total_images = test_generator.n
steps = total_images//batch_size
#iterations to cover all data, so if batch is 5, it will take total_images/5 iteration
x , y = [] , []
for i in range(steps):
a , b = test_generator.next()
x.extend(a)
y.extend(b)
return np.array(x), np.array(y)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting x_test, y_test from generator in Keras? - python

Related

Custom generator runs out of data even when steps_per_epoch specified

model.evaluate() changes results depending on batch size, when fed by generator

Keras: How to expand validation_split to generate a third set i.e. test set?

Keras custom data generator for large hdf5 file which does not fit into memory

How to get list of values in ImageDataGenerator.flow_from_directory Keras?

Categories

Resources