My scenario is that we have multiple peers with their own data, located in different directories, with the same sub-directory structure. I want to train the model using those data, but if I copy all of them to one folder, I can't keep track of which data is from whose (the new data is also created occasionally so it's not suitable to keep copy the files every time)
My data is now stored like this:
-user01
-user02
-user03
...
(all of them have similar sub-directory structure)
I have searched for solution, but I only found the multi-input case in here and here, which they concatenate multiple input into 1 single parallel input, which is not my case.
I know that the flow_from_directory() can only be fed by 1 directory at a time, so how can I build a custom one that can be fed by multiple directory at a time?
If my question is low-quality, please give advice on how to improve it, I have searched also on the github of keras but didn't find anything that I can adapt.
Thank you.
The Keras ImageDataGenerator flow_from_directory method has a follow_links parameter.
Maybe you can create one directory which is populated with symlinks to files in all the other directories.
This stack question discusses using symlinks with Keras ImageDataGenerator: Understanding 'follow_links' argument in Keras's ImageDataGenerator?
After so many days I hope you have found the solution to the problem,
but I will share another idea here so that new people like me who will
face the same problem in the future, get help.
A few days ago I had this kind of problem. follow_links will be a solution to your question, as user3731622 said. Also, I think the idea of merging two data generators will work. However, in that case, the batch sizes of the corresponding data generators have to be determined proportion to the extent of data in each relevant directory.
Batch size of sub-generators:
Where,
b = Batch Size Of Any Sub-generator
B = Desired Batch Size Of The Merged Generator
n = Number Of Images In That Directory Of Sub-generator
the sum of n = Total Number Of Images In All Directories
See the code below, this may help:
from keras.preprocessing.image import ImageDataGenerator
from keras.utils import Sequence
import matplotlib.pyplot as plt
import numpy as np
import os
class MergedGenerators(Sequence):
def __init__(self, batch_size, generators=[], sub_batch_size=[]):
self.generators = generators
self.sub_batch_size = sub_batch_size
self.batch_size = batch_size
def __len__(self):
return int(
sum([(len(self.generators[idx]) * self.sub_batch_size[idx])
for idx in range(len(self.sub_batch_size))]) /
self.batch_size)
def __getitem__(self, index):
"""Getting items from the generators and packing them"""
X_batch = []
Y_batch = []
for generator in self.generators:
if generator.class_mode is None:
x1 = generator[index % len(generator)]
X_batch = [*X_batch, *x1]
else:
x1, y1 = generator[index % len(generator)]
X_batch = [*X_batch, *x1]
Y_batch = [*Y_batch, *y1]
if self.generators[0].class_mode is None:
return np.array(X_batch)
return np.array(X_batch), np.array(Y_batch)
def build_datagenerator(dir1=None, dir2=None, batch_size=32):
n_images_in_dir1 = sum([len(files) for r, d, files in os.walk(dir1)])
n_images_in_dir2 = sum([len(files) for r, d, files in os.walk(dir2)])
# Have to set different batch size for two generators as number of images
# in those two directories are not same. As we have to equalize the image
# share in the generators
generator1_batch_size = int((n_images_in_dir1 * batch_size) /
(n_images_in_dir1 + n_images_in_dir2))
generator2_batch_size = batch_size - generator1_batch_size
generator1 = ImageDataGenerator(
rescale=1. / 255,
shear_range=0.2,
zoom_range=0.2,
rotation_range=5.,
horizontal_flip=True,
)
generator2 = ImageDataGenerator(
rescale=1. / 255,
zoom_range=0.2,
horizontal_flip=False,
)
# generator2 has different image augmentation attributes than generaor1
generator1 = generator1.flow_from_directory(
dir1,
target_size=(128, 128),
color_mode='rgb',
class_mode=None,
batch_size=generator1_batch_size,
shuffle=True,
seed=42,
interpolation="bicubic",
)
generator2 = generator2.flow_from_directory(
dir2,
target_size=(128, 128),
color_mode='rgb',
class_mode=None,
batch_size=generator2_batch_size,
shuffle=True,
seed=42,
interpolation="bicubic",
)
return MergedGenerators(
batch_size,
generators=[generator1, generator2],
sub_batch_size=[generator1_batch_size, generator2_batch_size])
def test_datagen(batch_size=32):
datagen = build_datagenerator(dir1="./asdf",
dir2="./asdf2",
batch_size=batch_size)
print("Datagenerator length (Batch count):", len(datagen))
for batch_count, image_batch in enumerate(datagen):
if batch_count == 1:
break
print("Images: ", image_batch.shape)
plt.figure(figsize=(10, 10))
for i in range(image_batch.shape[0]):
plt.subplot(1, batch_size, i + 1)
plt.imshow(image_batch[i], interpolation='nearest')
plt.axis('off')
plt.tight_layout()
test_datagen(4)
Related
I am training a model using custom generators, but just before finishing the first epoch, the model runs out of data. It gives me the following error:
Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least (steps_per_epoch * epochs) batches (in this case, 8740 batches). You may need to use the repeat() function when building your dataset
I have four generators (one for the train data, and another for the train label. Same thing with validation). I then zip train & label together. This is the prototype of my generators. I got the idea from here:
import numpy as np
import nibabel as nib
from tensorflow import keras
import os
def weirddivision(n,d):
return np.array(n)/np.array(d) if d else 0
class ImgDataGenerator(keras.utils.Sequence):
def __init__(self, file_list, batch_size=8, shuffle=True):
"""Constructor can be expanded,
with batch size, dimentation etc.
"""
self.file_list = file_list
self.batch_size = batch_size
self.shuffle = shuffle
self.on_epoch_end()
def __len__(self):
'Take all batches in each iteration'
return int(np.floor(len(self.file_list) / self.batch_size))
def __getitem__(self, index):
'Get next batch'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# single file
file_list_temp = [self.file_list[k] for k in indexes]
# Set of X_train and y_train
X = self.__data_generation(file_list_temp)
return X
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.file_list))
if self.shuffle == True:
np.random.shuffle(self.indexes)
def __data_generation(self, file_list_temp):
'Generates data containing batch_size samples'
train_loc = '/home/faruk/Desktop/BrainSeg/Dataset/Train/'
X = np.empty((self.batch_size,224,256,1))
# Generate data
for i, ID in enumerate(file_list_temp):
x_file_path = os.path.join(train_loc, ID)
img = np.load(x_file_path)
img = np.pad(img, pad_width=((14,13),(12,11)), mode='constant')
img = np.expand_dims(img,-1)
img = weirddivision(img, img.max())
# Store sample
X[i,] = img
return X
As mentioned, here I create four generators and zip them:
training_img_generator = ImgDataGenerator(train)
training_label_generator = LabelDataGenerator(train)
train_generator = zip(training_img_generator,training_label_generator)
val_img_generator = ValDataGenerator(val)
val_label_generator = ValLabelDataGenerator(val)
val_generator = zip(val_img_generator,val_label_generator)
Because the generator is generating data dynamically, I thought that maybe it was trying to generate more than what is actually available. Hence, I calculated the steps per epoch as follows and passed it to fit_generator:
batch_size = 8
spe = len(train)//batch_size # len(train) = 34965
val_spe = len(val)//batch_size # len(val) = 4347
History=model.fit_generator(generator=train_generator, validation_data=val_generator, epochs=2, steps_per_epoch=spe, validation_steps = val_spe, shuffle=True, verbose=1)
But still, this is not working. I have tried reducing the number of steps per epoch, and I am able to finish the first epoch, but the error then appears at the beginning of the second epoch. Apparently the generator needs to be repeated infinitely, but I don't know how to achieve this. Can I use an infinite while loop? If yes, where?
Try this:
train_generator = train_generator.repeat()
val_generator = val_generator.repeat()
I solved this. I was defining my Generator class as follows:
class ImgDataGenerator(keras.utils.Sequence)
However, my model was not sequential... It was functional. I solved this by creating my own custom generator without inheriting from the keras.utils.sequence.
I hope this is helpful to someone.
I am working on a multilabel classification model where I am trying to combine two models, a CNN and a text-classifier into one model using Keras and train them together, like so:
#cnn_model is a vgg16 model
#text_model looks as follows:
### takes the vectorized text as input
text_model = Sequential()
text_model .add(Dense(vec_size, input_shape=(vec_size,), name='aux_input'))
## merging both models
merged = Merge([cnn_model, text_model], mode='concat')
### final_model takes the combined models and adds a sofmax classifier to it
final_model = Sequential()
final_model.add(merged)
final_model.add(Dense(n_classes, activation='softmax'))
As such, I am working with an ImageDataGenerator to process the images and the respective labels.
For the images I am using a custom helper function that reads images into the model via paths provided by pandas dataframes - one for training (df_train) and one for validation (df_validation). The dataframes also provide the final labels for the model in the "label_vec" column:
# From https://github.com/keras-team/keras/issues/5152
def flow_from_dataframe(img_data_gen, in_df, path_col, y_col, **dflow_args):
base_dir = os.path.dirname(in_df[path_col].values[0])
print('## Ignore next message from keras, values are replaced anyways')
df_gen = img_data_gen.flow_from_directory(base_dir, class_mode = 'sparse', **dflow_args)
df_gen.filenames = in_df[path_col].values
df_gen.classes = numpy.stack(in_df[y_col].values)
df_gen.samples = in_df.shape[0]
df_gen.n = in_df.shape[0]
df_gen._set_index_array()
df_gen.directory = '' # since we have the full path
print('Reinserting dataframe: {} images'.format(in_df.shape[0]))
return df_gen
from keras.applications.vgg16 import preprocess_input
train_datagen = keras.preprocessing.image.ImageDataGenerator(preprocessing_function=preprocess_input) horizontal_flip=True)
validation_datagen = keras.preprocessing.image.ImageDataGenerator(preprocessing_function=preprocess_input)#rescale=1./255)
train_generator = flow_from_dataframe(train_datagen, df_train,
path_col = 'filename',
y_col = 'label_vec',
target_size=(224, 224), batch_size=128, shuffle=False)
validation_generator = flow_from_dataframe(validation_datagen, df_validation,
path_col = 'filename',
y_col = 'label_vec',
target_size=(224, 224), batch_size=64, shuffle=False)
Now I am trying to provide my one-hot-encoded text vectors (i.e. [0,0,0,1,0,0]) to the model, which are also stored in a pandas dataframe.
Since my train_generator provides me with the image and label data, I am now looking for a solution to combine this generator with a generator which allows me to additionally feed the respective text-vector
You might want to consider writing your own generator (making use of Keras' Sequence object to allow for multiprocessing) instead of modifying the ImageDataGenerator code. From the Keras docs:
class CIFAR10Sequence(Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
def __len__(self):
return int(np.ceil(len(self.x) / float(self.batch_size)))
def __getitem__(self, idx):
batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
return np.array([
resize(imread(file_name), (200, 200))
for file_name in batch_x]), np.array(batch_y)
You could have your labels, paths to the images, and paths to the text files in a single pandas dataframe and modify the __getitem__ method from above to have your generator yield all three of them simultaneously: one list of numpy arraysX which contains all the inputs, one numpy array Y which contains the outputs.
I work with many images (10M+) stored in a single directory (no subfolders for each class) and use a pandas DataFrame to keep track of class label. The amount of images do not fit in memory so I must read minibatches from disk. So far I have used Keras .flow_from_directory(), but it requires me to move images to one subfolder per class (and per train/validation split). It works great, but it just becomes very unpractical when I want to use differnt subsets of images and define classes in various ways. Do anyone have an alternative strategy that usees a database (e.g. pandas.DataFrame) to keep track of the reading of minibatches instead of moving images to subfolders?
You need a custom data generator.
import numpy as np
import cv2
def batch_generator(ids):
while True:
for start in range(0, len(ids), batch_size):
x_batch = []
y_batch = []
end = min(start + batch_size, len(ids))
ids_batch = ids[start:end]
for id in ids_batch:
img = cv2.imread(dpath+'train/{}.jpg'.format(id))
#img = cv2.resize(img, (224, 224), interpolation = cv2.INTER_AREA)
labelname=df_train.loc[df_train.id==id,'column_name'].values
labelnum=classes.index(labelname)
x_batch.append(img)
y_batch.append(labelnum)
x_batch = np.array(x_batch, np.float32)
y_batch = to_categorical(y_batch,120)
yield x_batch, y_batch
Then you can call the generator only with ids (or image names) numpy array like this:
model.fit_generator(generator=batch_generator(ids_train_split), \
steps_per_epoch= \
np.ceil(float(len(ids_train_split)) / float(batch_size)),\
epochs=epochs, verbose=1, callbacks=callbacks, \
validation_data=batch_generator(ids_valid_split), \
validation_steps=np.ceil(float(len(ids_valid_split)) / float(batch_size)))
I'm trying to use the pretrained InceptionV3 model to classify the food-101 dataset, which containts food images for 101 categories, 1000 per category. I've preprocessed this dataset into a single hdf5 file (I assumed this is beneficial compared to loading images on the go when training) so far, which has the following tables inside:
The data split is the standard 70% train, 20% validation, 10% test, so for example the valid_img has a size of 20200*299*299*3. The labels are onehotencoded for Keras, so valid_labels has a size of 20200*101.
This hdf5 file has a size of 27.1 GB, so it will not fit into my memory. (Have 8 GB of it, although effectively only probably 4-5 gigs is usable while running Ubuntu. Also my GPU is GTX 960 with 2 GB of VRAM, and so far it looked like 1.5 GB is available for python when I try to start the training script). I'm using Tensorflow backend.
The first idea I had is to use model.train_on_batch() with a double nested for loop like this:
#Loading InceptionV3, adding my fully connected layers, compiling model...
dataset = h5py.File('/home/uzoltan/PycharmProjects/food-101/food-101_299x299.hdf5', 'r')
epoch = 50
for i in range(epoch):
for i in range(100): #1000 images can fit in the memory easily, this could probably be range(10) too
train_images = dataset["train_img"][i * 706:(i + 1) * 706, ...]
train_labels = dataset["train_labels"][i * 706:(i + 1) * 706, ...]
val_images = dataset["valid_img"][i * 202:(i + 1) * 202, ...]
val_labels = dataset["valid_labels"][i * 202:(i + 1) * 202, ...]
model.train_on_batch(x=train_images, y=train_labels, class_weight=None,
sample_weight=None, )
My problem with this approach is that train_on_batch provides 0 support for validation or batch shuffling, so that the batches are not in the same order every epoch.
So I looked towards model.fit_generator() which has the nice property of providing all the same functionality as fit(), plus with the built in ImageDataGenerator you can do image augmentations (rotations, horizontal flips, etc.) at the same time with the CPU, so that your model can be more robust. My problem here is, that if I understand it correctly, the ImageDataGenerator.flow(x,y) method needs all the samples and labels at once, but my training/validation data wont fit into my RAM.
Here is where I think custom data generators come into the picture, but after looking extensively at some examples I could find on the Keras GitHub/Issues page, I still dont really get how should I implement a custom generator, which would read in batches of data from my hdf5 file. Can someone provide me with a good example or pointers? How could I couple the custom batch generator with the image augmentations? Or maybe is it easier to implement some kind of manual validation and batch shuffling for train_on_batch()? If so, I could use some pointer there too.
For anyone still looking for an answer, I made the following "crude wrapper" around ImageDataGeneator's apply_transform method.
from numpy.random import uniform, randint
from tensorflow.python.keras.preprocessing.image import ImageDataGenerator
import numpy as np
class CustomImagesGenerator:
def __init__(self, x, zoom_range, shear_range, rescale, horizontal_flip, batch_size):
self.x = x
self.zoom_range = zoom_range
self.shear_range = shear_range
self.rescale = rescale
self.horizontal_flip = horizontal_flip
self.batch_size = batch_size
self.__img_gen = ImageDataGenerator()
self.__batch_index = 0
def __len__(self):
# steps_per_epoch, if unspecified, will use the len(generator) as a number of steps.
# hence this
return np.floor(self.x.shape[0]/self.batch_size)
def next(self):
return self.__next__()
def __next__(self):
start = self.__batch_index*self.batch_size
stop = start + self.batch_size
self.__batch_index += 1
if stop > len(self.x):
raise StopIteration
transformed = np.array(self.x[start:stop]) # loads from hdf5
for i in range(len(transformed)):
zoom = uniform(self.zoom_range[0], self.zoom_range[1])
transformations = {
'zx': zoom,
'zy': zoom,
'shear': uniform(-self.shear_range, self.shear_range),
'flip_horizontal': self.horizontal_flip and bool(randint(0,2))
}
transformed[i] = self.__img_gen.apply_transform(transformed[i], transformations)
return transformed * self.rescale
It can be called like so:
import h5py
f = h5py.File("my_heavy_dataset_file.hdf5", 'r')
images = f['mydatasets/images']
my_gen = CustomImagesGenerator(
images,
zoom_range=[0.8, 1],
shear_range=6,
rescale=1./255,
horizontal_flip=True,
batch_size=64
)
model.fit_generator(my_gen)
If I understood you correctly, you want to use the data (which does not fit in the memory) from HDF5 and at the same time use data augmentation on it.
I'm in the same situation as you, and I found this code that maybe can be helpful with some few modifications:
https://gist.github.com/wassname/74f02bc9134897e3fe4e60784f5aaa15
this is my solution for shuffle data per epoch with h5 file.
indices means train or val index list.
def generator(h5path, indices, batchSize=128, is_train=True, aug=None):
db = h5py.File(h5path, "r")
with open("mean.json") as f:
mean = json.load(f)
meanV = np.array([mean["R"], mean["G"], mean["B"]])
while True:
np.random.shuffle(indices)
for i in range(0, len(indices), batchSize):
t0 = time()
batch_indices = indices[i:i+batchSize]
batch_indices.sort()
by = db["labels"][batch_indices,:]
bx = db["images"][batch_indices,:,:,:]
bx[:,:,:,0] -= meanV[0]
bx[:,:,:,1] -= meanV[1]
bx[:,:,:,2] -= meanV[2]
t1=time()
if is_train:
#bx = random_crop(bx, (224,224))
if aug is not None:
bx,by = next(aug.flow(bx,by,batchSize))
yield (bx,by)
h5path='all_224.hdf5'
model.fit_generator(generator(h5path, train_indices, batchSize=batchSize, is_train=True, aug=aug),
steps_per_epoch = 20000//batchSize,
validation_data= generator(h5path, test_indices, is_train=False, batchSize=batchSize),
validation_steps = 2424//batchSize,
epochs=args.epoch,
max_queue_size=100,
callbacks=[checkpoint, early_stop])
You want to write a function which loads images from the HDF5 and then yields (not returns) them as a numpy array. Here is a simple example which uses OpenCV to load images directly from .png/.jpg files in a given directory:
def generate_data(directory, batch_size):
"""Replaces Keras' native ImageDataGenerator."""
i = 0
file_list = os.listdir(directory)
while True:
image_batch = []
for b in range(batch_size):
if i == len(file_list):
i = 0
random.shuffle(file_list)
sample = file_list[i]
i += 1
image = cv2.resize(cv2.imread(sample[0]), INPUT_SHAPE)
image_batch.append((image.astype(float) - 128) / 128)
yield np.array(image_batch)
Obviously you will have to modify it to read from the HDF5 instead.
Once you have written your function, the usage is simply:
model.fit_generator(
generate_data('~/my_data', batch_size),
steps_per_epoch=len(os.listdir('~/my_data')) // batch_size)
Again modified to reflect the fact that you are reading from an HDF5 and not a directory.
We can generate image dataset using ImageDataGenerator with flow_from_directory method. For calling list of class, we can use oject.classes. But, how to call list of values? I've searched and still not found any.
Thanks :)
The ImageDataGenerator is a python generator, it would yield a batch of data with the shape same with your model inputs(like(batch_size,width,height,channels)) each time. The benefit of the generator is when your data set is too big, you can't put all the data to your limited memory, but, with the generator you can generate one batch data each time. and the ImageDataGenerator works with model.fit_generator(), model.predict_generator().
If you want to get the numeric data, you can use the next() function of the generator:
import numpy as np
data_gen = ImageDataGenerator(rescale = 1. / 255)
data_generator = datagen.flow_from_directory(
data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical')
data_list = []
batch_index = 0
while batch_index <= data_generator.batch_index:
data = data_generator.next()
data_list.append(data[0])
batch_index = batch_index + 1
# now, data_array is the numeric data of whole images
data_array = np.asarray(data_list)
Alternatively, you can use PIL and numpy process the image by yourself:
from PIL import Image
import numpy as np
def image_to_array(file_path):
img = Image.open(file_path)
img = img.resize((img_width,img_height))
data = np.asarray(img,dtype='float32')
return data
# now data is a tensor with shape(width,height,channels) of a single image.
Then, you can loop all your images with this function to get the numeric data.
Notice, I recommend you to use generator instead of get all the data directly, or, you might run out of memory.
'But, how to call list of values' - If I understood correctly, I guess you wish to know what all files are there in your data set - if that's correct, (or if not), there are various ways you can get values from your generator:
use object.filenames.
Object.filenames returns the list of all files in your target folder. I just use the len(object.filename) function to get the total number of files in my test folder. Then pass that number back into my generator and run it again.
generator.n
Other way to get number of all items in your test folder is generator.n
x , y = test_generator.next() to load my array and classes ( if inferred).
Or a = test_generator.next(), where your array and classes will be returned as tuple.
I only used this as my test data set was really small ( 60 images) and I was using extracted features to train and predict my model( that is feature array and not the image array).
If you are building a normal model, using generator to yield batches is much better way.
Create a function using generator
def generate_test_data_from_directory(folder_path, image_target_size = 224, batch_size = 5, channels = 3, class_mode = 'sparse' ):
'''fetch all out test data from directory'''
test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_directory(
folder_path ,
target_size = (image_target_size, image_target_size),
batch_size = batch_size,
class_mode = class_mode)
total_images = test_generator.n
steps = total_images//batch_size
#iterations to cover all data, so if batch is 5, it will take total_images/5 iteration
x , y = [] , []
for i in range(steps):
a , b = test_generator.next()
x.extend(a)
y.extend(b)
return np.array(x), np.array(y)