Related
i need load identical two dataset suppose one dataset has RGB images and another dataset contain same image with different processed(grey images) with same order same size,
datasetA=[1.jpg,2.jpg,..........n.jpg] // RGB
datasetA=[g1.jpg,g2.jpg,..........gn.jpg] //grey
so I need to feed the same order images to two independent networks using DataLoader with random_split, so how to use
rgb = datasets.ImageFolder(rgb images)
grey = datasets.ImageFolder(gray images)
train_data1, test_data = random_split(rgb, [train_data, test_data])
train_data2, test_data = random_split(grey, [train_data, test_data])
train_loader1 = DataLoader(train_data1, batch_size=batch_size, shuffle=True)
train_loader2 = DataLoader(train_data2, batch_size=batch_size, shuffle=True)
so need to load same order images touple like (1.jpg,g1.jpg) for train both network independantly
and how to use
trainiter1 = iter(train_loader1)
features, labels = next(trainiter)
please explain process
I think he easiest way to go about this is to construct a custom Dataset that handles both:
class JointImageDataset(torch.utils.data.Dataset):
def __init__(self, args_rgb_dict, args_grey_dict):
# construct the two individual datasets
self.rgb_dataset = ImageFolder(**args_rgb_dict)
self.grey_dataset = ImageFolder(**args_grey_dict)
def __len__(self):
return min(len(self.rgb_dataset), len(selg.grey_dataset))
def __getitem__(self, index):
rgb_x, rgb_y = self.rgb_dataset[index]
grey_x, grey_y = self.grey_dataset[index]
return rgb_x, grey_x, rgb_y, grey_y
Now you can construct a single DataLoader from the JoindImageDataset and iterate over the joint batches:
joint_data = JoindImageDataset(...)
train_loader = DataLoader(joint_data, batch_size=...)
for rgb_batch, grey_batch, rgb_ys, grey_ys in train_loader:
# do your stuff here...
I am working on Image Binarization using UNet and have a dataset of 150 images and their binarized versions too. My idea is to augment the images randomly to make them look like they are differentso I have made a function which inserts any of the 4-5 types of Noises, skewness, shearing and so on to an image. I could have easily used
ImageDataGenerator(preprocess_function=my_aug_function) to augment the images but the problem is that my y target is also an image. Also, I could have used something like:
train_dataset = (
train_dataset.map(
encode_single_sample, num_parallel_calls=tf.data.experimental.AUTOTUNE
)
.batch(batch_size)
.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
)
But it has 2 problems:
With larger dataset, it'll blow up the memory as data needs to be already in the memory
This is the crucial part that I need to augment the images on the go to make it look like I have a huge dataset.
Another Solution could be saving augmented images to a directory and making them 30-40K and then loading them. It would be silly thing to do.
Now the idea part is that I can use Sequence as the parent class but How can I keep on augmenting and generating new images on the fly with respective Y binarized images?
I have an idea as the below code. Can somebody help me with the augmentation and generation of y images. I have my X_DIR, Y_DIR where image names for binarised and original are same but stored in different directories.
class DataGenerator(tensorflow.keras.utils.Sequence):
def __init__(self, files_path, labels_path, batch_size=32, shuffle=True, random_state=42):
'Initialization'
self.files = files_path
self.labels = labels_path
self.batch_size = batch_size
self.shuffle = shuffle
self.random_state = random_state
self.on_epoch_end()
def on_epoch_end(self):
'Updates indexes after each epoch'
# Shuffle the data here
def __len__(self):
return int(np.floor(len(self.files) / self.batch_size))
def __getitem__(self, index):
# What do I do here?
def __data_generation(self, files):
# I think this is responsible for Augmentation but no idea how should I implement it and how does it works.
Custom Image Data Generator
load Directory data into dataframe for CustomDataGenerator
def data_to_df(data_dir, subset=None, validation_split=None):
df = pd.DataFrame()
filenames = []
labels = []
for dataset in os.listdir(data_dir):
img_list = os.listdir(os.path.join(data_dir, dataset))
label = name_to_idx[dataset]
for image in img_list:
filenames.append(os.path.join(data_dir, dataset, image))
labels.append(label)
df["filenames"] = filenames
df["labels"] = labels
if subset == "train":
split_indexes = int(len(df) * validation_split)
train_df = df[split_indexes:]
val_df = df[:split_indexes]
return train_df, val_df
return df
train_df, val_df = data_to_df(train_dir, subset="train", validation_split=0.2)
Custom Data Generator
import tensorflow as tf
from PIL import Image
import numpy as np
class CustomDataGenerator(tf.keras.utils.Sequence):
''' Custom DataGenerator to load img
Arguments:
data_frame = pandas data frame in filenames and labels format
batch_size = divide data in batches
shuffle = shuffle data before loading
img_shape = image shape in (h, w, d) format
augmentation = data augmentation to make model rebust to overfitting
Output:
Img: numpy array of image
label : output label for image
'''
def __init__(self, data_frame, batch_size=10, img_shape=None, augmentation=True, num_classes=None):
self.data_frame = data_frame
self.train_len = len(data_frame)
self.batch_size = batch_size
self.img_shape = img_shape
self.num_classes = num_classes
print(f"Found {self.data_frame.shape[0]} images belonging to {self.num_classes} classes")
def __len__(self):
''' return total number of batches '''
self.data_frame = shuffle(self.data_frame)
return math.ceil(self.train_len/self.batch_size)
def on_epoch_end(self):
''' shuffle data after every epoch '''
# fix on epoch end it's not working, adding shuffle in len for alternative
pass
def __data_augmentation(self, img):
''' function for apply some data augmentation '''
img = tf.keras.preprocessing.image.random_shift(img, 0.2, 0.3)
img = tf.image.random_flip_left_right(img)
img = tf.image.random_flip_up_down(img)
return img
def __get_image(self, file_id):
""" open image with file_id path and apply data augmentation """
img = np.asarray(Image.open(file_id))
img = np.resize(img, self.img_shape)
img = self.__data_augmentation(img)
img = preprocess_input(img)
return img
def __get_label(self, label_id):
""" uncomment the below line to convert label into categorical format """
#label_id = tf.keras.utils.to_categorical(label_id, num_classes)
return label_id
def __getitem__(self, idx):
batch_x = self.data_frame["filenames"][idx * self.batch_size:(idx + 1) * self.batch_size]
batch_y = self.data_frame["labels"][idx * self.batch_size:(idx + 1) * self.batch_size]
# read your data here using the batch lists, batch_x and batch_y
x = [self.__get_image(file_id) for file_id in batch_x]
y = [self.__get_label(label_id) for label_id in batch_y]
return tf.convert_to_tensor(x), tf.convert_to_tensor(y)
You can use libraries like albumentations and imgaug, both are good but I have heard there are issues with random seed with albumentations.
Here's an example of imgaug taken from the documentation here:
seq = iaa.Sequential([
iaa.Dropout([0.05, 0.2]), # drop 5% or 20% of all pixels
iaa.Sharpen((0.0, 1.0)), # sharpen the image
iaa.Affine(rotate=(-45, 45)), # rotate by -45 to 45 degrees (affects segmaps)
iaa.ElasticTransformation(alpha=50, sigma=5) # apply water effect (affects segmaps)
], random_order=True)
# Augment images and segmaps.
images_aug = []
segmaps_aug = []
for _ in range(len(input_data)):
images_aug_i, segmaps_aug_i = seq(image=image, segmentation_maps=segmap)
images_aug.append(images_aug_i)
segmaps_aug.append(segmaps_aug_i)
You are going in the right way with the custom generator. In __getitem__, make a batch using batch_x = self.files[index:index+batch_size] and same with batch_y, then augment them using X,y = __data_generation(batch_x, batch_y) which will load images(using any library you like, I prefer opencv), and return the augmented pairs (and any other manipulation).
Your __getitem__ will then return the tuple (X,y)
You can use ImageDataGenerator even if your label is an image.
Here is a simple example of how you can do that:
Code:
# Specifying your data augmentation here for both image and label
image_datagen = tf.keras.preprocessing.image.ImageDataGenerator()
mask_datagen = tf.keras.preprocessing.image.ImageDataGenerator()
# Provide the same seed and keyword arguments to the flow methods
seed = 1
image_generator = image_datagen.flow_from_directory(
data_dir,
class_mode=None,
seed=seed)
mask_generator = mask_datagen.flow_from_directory(
data_dir,
class_mode=None,
seed=seed)
# Combine the image and label generator.
train_generator = zip(image_generator, mask_generator)
Now, if you iterate over it you will get:
for image, label in train_generator:
print(image.shape,label.shape)
break
Output:
(32, 256, 256, 3) (32, 256, 256, 3)
You can use this train_generator with fit() command.
Code:
model.fit_generator(
train_generator,
steps_per_epoch=2000,
epochs=50)
With flow_from_directory your memory won't be cluttered and Imagedatagenerator will take care of the augmentation part.
My scenario is that we have multiple peers with their own data, located in different directories, with the same sub-directory structure. I want to train the model using those data, but if I copy all of them to one folder, I can't keep track of which data is from whose (the new data is also created occasionally so it's not suitable to keep copy the files every time)
My data is now stored like this:
-user01
-user02
-user03
...
(all of them have similar sub-directory structure)
I have searched for solution, but I only found the multi-input case in here and here, which they concatenate multiple input into 1 single parallel input, which is not my case.
I know that the flow_from_directory() can only be fed by 1 directory at a time, so how can I build a custom one that can be fed by multiple directory at a time?
If my question is low-quality, please give advice on how to improve it, I have searched also on the github of keras but didn't find anything that I can adapt.
Thank you.
The Keras ImageDataGenerator flow_from_directory method has a follow_links parameter.
Maybe you can create one directory which is populated with symlinks to files in all the other directories.
This stack question discusses using symlinks with Keras ImageDataGenerator: Understanding 'follow_links' argument in Keras's ImageDataGenerator?
After so many days I hope you have found the solution to the problem,
but I will share another idea here so that new people like me who will
face the same problem in the future, get help.
A few days ago I had this kind of problem. follow_links will be a solution to your question, as user3731622 said. Also, I think the idea of merging two data generators will work. However, in that case, the batch sizes of the corresponding data generators have to be determined proportion to the extent of data in each relevant directory.
Batch size of sub-generators:
Where,
b = Batch Size Of Any Sub-generator
B = Desired Batch Size Of The Merged Generator
n = Number Of Images In That Directory Of Sub-generator
the sum of n = Total Number Of Images In All Directories
See the code below, this may help:
from keras.preprocessing.image import ImageDataGenerator
from keras.utils import Sequence
import matplotlib.pyplot as plt
import numpy as np
import os
class MergedGenerators(Sequence):
def __init__(self, batch_size, generators=[], sub_batch_size=[]):
self.generators = generators
self.sub_batch_size = sub_batch_size
self.batch_size = batch_size
def __len__(self):
return int(
sum([(len(self.generators[idx]) * self.sub_batch_size[idx])
for idx in range(len(self.sub_batch_size))]) /
self.batch_size)
def __getitem__(self, index):
"""Getting items from the generators and packing them"""
X_batch = []
Y_batch = []
for generator in self.generators:
if generator.class_mode is None:
x1 = generator[index % len(generator)]
X_batch = [*X_batch, *x1]
else:
x1, y1 = generator[index % len(generator)]
X_batch = [*X_batch, *x1]
Y_batch = [*Y_batch, *y1]
if self.generators[0].class_mode is None:
return np.array(X_batch)
return np.array(X_batch), np.array(Y_batch)
def build_datagenerator(dir1=None, dir2=None, batch_size=32):
n_images_in_dir1 = sum([len(files) for r, d, files in os.walk(dir1)])
n_images_in_dir2 = sum([len(files) for r, d, files in os.walk(dir2)])
# Have to set different batch size for two generators as number of images
# in those two directories are not same. As we have to equalize the image
# share in the generators
generator1_batch_size = int((n_images_in_dir1 * batch_size) /
(n_images_in_dir1 + n_images_in_dir2))
generator2_batch_size = batch_size - generator1_batch_size
generator1 = ImageDataGenerator(
rescale=1. / 255,
shear_range=0.2,
zoom_range=0.2,
rotation_range=5.,
horizontal_flip=True,
)
generator2 = ImageDataGenerator(
rescale=1. / 255,
zoom_range=0.2,
horizontal_flip=False,
)
# generator2 has different image augmentation attributes than generaor1
generator1 = generator1.flow_from_directory(
dir1,
target_size=(128, 128),
color_mode='rgb',
class_mode=None,
batch_size=generator1_batch_size,
shuffle=True,
seed=42,
interpolation="bicubic",
)
generator2 = generator2.flow_from_directory(
dir2,
target_size=(128, 128),
color_mode='rgb',
class_mode=None,
batch_size=generator2_batch_size,
shuffle=True,
seed=42,
interpolation="bicubic",
)
return MergedGenerators(
batch_size,
generators=[generator1, generator2],
sub_batch_size=[generator1_batch_size, generator2_batch_size])
def test_datagen(batch_size=32):
datagen = build_datagenerator(dir1="./asdf",
dir2="./asdf2",
batch_size=batch_size)
print("Datagenerator length (Batch count):", len(datagen))
for batch_count, image_batch in enumerate(datagen):
if batch_count == 1:
break
print("Images: ", image_batch.shape)
plt.figure(figsize=(10, 10))
for i in range(image_batch.shape[0]):
plt.subplot(1, batch_size, i + 1)
plt.imshow(image_batch[i], interpolation='nearest')
plt.axis('off')
plt.tight_layout()
test_datagen(4)
I'm trying to use the pretrained InceptionV3 model to classify the food-101 dataset, which containts food images for 101 categories, 1000 per category. I've preprocessed this dataset into a single hdf5 file (I assumed this is beneficial compared to loading images on the go when training) so far, which has the following tables inside:
The data split is the standard 70% train, 20% validation, 10% test, so for example the valid_img has a size of 20200*299*299*3. The labels are onehotencoded for Keras, so valid_labels has a size of 20200*101.
This hdf5 file has a size of 27.1 GB, so it will not fit into my memory. (Have 8 GB of it, although effectively only probably 4-5 gigs is usable while running Ubuntu. Also my GPU is GTX 960 with 2 GB of VRAM, and so far it looked like 1.5 GB is available for python when I try to start the training script). I'm using Tensorflow backend.
The first idea I had is to use model.train_on_batch() with a double nested for loop like this:
#Loading InceptionV3, adding my fully connected layers, compiling model...
dataset = h5py.File('/home/uzoltan/PycharmProjects/food-101/food-101_299x299.hdf5', 'r')
epoch = 50
for i in range(epoch):
for i in range(100): #1000 images can fit in the memory easily, this could probably be range(10) too
train_images = dataset["train_img"][i * 706:(i + 1) * 706, ...]
train_labels = dataset["train_labels"][i * 706:(i + 1) * 706, ...]
val_images = dataset["valid_img"][i * 202:(i + 1) * 202, ...]
val_labels = dataset["valid_labels"][i * 202:(i + 1) * 202, ...]
model.train_on_batch(x=train_images, y=train_labels, class_weight=None,
sample_weight=None, )
My problem with this approach is that train_on_batch provides 0 support for validation or batch shuffling, so that the batches are not in the same order every epoch.
So I looked towards model.fit_generator() which has the nice property of providing all the same functionality as fit(), plus with the built in ImageDataGenerator you can do image augmentations (rotations, horizontal flips, etc.) at the same time with the CPU, so that your model can be more robust. My problem here is, that if I understand it correctly, the ImageDataGenerator.flow(x,y) method needs all the samples and labels at once, but my training/validation data wont fit into my RAM.
Here is where I think custom data generators come into the picture, but after looking extensively at some examples I could find on the Keras GitHub/Issues page, I still dont really get how should I implement a custom generator, which would read in batches of data from my hdf5 file. Can someone provide me with a good example or pointers? How could I couple the custom batch generator with the image augmentations? Or maybe is it easier to implement some kind of manual validation and batch shuffling for train_on_batch()? If so, I could use some pointer there too.
For anyone still looking for an answer, I made the following "crude wrapper" around ImageDataGeneator's apply_transform method.
from numpy.random import uniform, randint
from tensorflow.python.keras.preprocessing.image import ImageDataGenerator
import numpy as np
class CustomImagesGenerator:
def __init__(self, x, zoom_range, shear_range, rescale, horizontal_flip, batch_size):
self.x = x
self.zoom_range = zoom_range
self.shear_range = shear_range
self.rescale = rescale
self.horizontal_flip = horizontal_flip
self.batch_size = batch_size
self.__img_gen = ImageDataGenerator()
self.__batch_index = 0
def __len__(self):
# steps_per_epoch, if unspecified, will use the len(generator) as a number of steps.
# hence this
return np.floor(self.x.shape[0]/self.batch_size)
def next(self):
return self.__next__()
def __next__(self):
start = self.__batch_index*self.batch_size
stop = start + self.batch_size
self.__batch_index += 1
if stop > len(self.x):
raise StopIteration
transformed = np.array(self.x[start:stop]) # loads from hdf5
for i in range(len(transformed)):
zoom = uniform(self.zoom_range[0], self.zoom_range[1])
transformations = {
'zx': zoom,
'zy': zoom,
'shear': uniform(-self.shear_range, self.shear_range),
'flip_horizontal': self.horizontal_flip and bool(randint(0,2))
}
transformed[i] = self.__img_gen.apply_transform(transformed[i], transformations)
return transformed * self.rescale
It can be called like so:
import h5py
f = h5py.File("my_heavy_dataset_file.hdf5", 'r')
images = f['mydatasets/images']
my_gen = CustomImagesGenerator(
images,
zoom_range=[0.8, 1],
shear_range=6,
rescale=1./255,
horizontal_flip=True,
batch_size=64
)
model.fit_generator(my_gen)
If I understood you correctly, you want to use the data (which does not fit in the memory) from HDF5 and at the same time use data augmentation on it.
I'm in the same situation as you, and I found this code that maybe can be helpful with some few modifications:
https://gist.github.com/wassname/74f02bc9134897e3fe4e60784f5aaa15
this is my solution for shuffle data per epoch with h5 file.
indices means train or val index list.
def generator(h5path, indices, batchSize=128, is_train=True, aug=None):
db = h5py.File(h5path, "r")
with open("mean.json") as f:
mean = json.load(f)
meanV = np.array([mean["R"], mean["G"], mean["B"]])
while True:
np.random.shuffle(indices)
for i in range(0, len(indices), batchSize):
t0 = time()
batch_indices = indices[i:i+batchSize]
batch_indices.sort()
by = db["labels"][batch_indices,:]
bx = db["images"][batch_indices,:,:,:]
bx[:,:,:,0] -= meanV[0]
bx[:,:,:,1] -= meanV[1]
bx[:,:,:,2] -= meanV[2]
t1=time()
if is_train:
#bx = random_crop(bx, (224,224))
if aug is not None:
bx,by = next(aug.flow(bx,by,batchSize))
yield (bx,by)
h5path='all_224.hdf5'
model.fit_generator(generator(h5path, train_indices, batchSize=batchSize, is_train=True, aug=aug),
steps_per_epoch = 20000//batchSize,
validation_data= generator(h5path, test_indices, is_train=False, batchSize=batchSize),
validation_steps = 2424//batchSize,
epochs=args.epoch,
max_queue_size=100,
callbacks=[checkpoint, early_stop])
You want to write a function which loads images from the HDF5 and then yields (not returns) them as a numpy array. Here is a simple example which uses OpenCV to load images directly from .png/.jpg files in a given directory:
def generate_data(directory, batch_size):
"""Replaces Keras' native ImageDataGenerator."""
i = 0
file_list = os.listdir(directory)
while True:
image_batch = []
for b in range(batch_size):
if i == len(file_list):
i = 0
random.shuffle(file_list)
sample = file_list[i]
i += 1
image = cv2.resize(cv2.imread(sample[0]), INPUT_SHAPE)
image_batch.append((image.astype(float) - 128) / 128)
yield np.array(image_batch)
Obviously you will have to modify it to read from the HDF5 instead.
Once you have written your function, the usage is simply:
model.fit_generator(
generate_data('~/my_data', batch_size),
steps_per_epoch=len(os.listdir('~/my_data')) // batch_size)
Again modified to reflect the fact that you are reading from an HDF5 and not a directory.
We can generate image dataset using ImageDataGenerator with flow_from_directory method. For calling list of class, we can use oject.classes. But, how to call list of values? I've searched and still not found any.
Thanks :)
The ImageDataGenerator is a python generator, it would yield a batch of data with the shape same with your model inputs(like(batch_size,width,height,channels)) each time. The benefit of the generator is when your data set is too big, you can't put all the data to your limited memory, but, with the generator you can generate one batch data each time. and the ImageDataGenerator works with model.fit_generator(), model.predict_generator().
If you want to get the numeric data, you can use the next() function of the generator:
import numpy as np
data_gen = ImageDataGenerator(rescale = 1. / 255)
data_generator = datagen.flow_from_directory(
data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical')
data_list = []
batch_index = 0
while batch_index <= data_generator.batch_index:
data = data_generator.next()
data_list.append(data[0])
batch_index = batch_index + 1
# now, data_array is the numeric data of whole images
data_array = np.asarray(data_list)
Alternatively, you can use PIL and numpy process the image by yourself:
from PIL import Image
import numpy as np
def image_to_array(file_path):
img = Image.open(file_path)
img = img.resize((img_width,img_height))
data = np.asarray(img,dtype='float32')
return data
# now data is a tensor with shape(width,height,channels) of a single image.
Then, you can loop all your images with this function to get the numeric data.
Notice, I recommend you to use generator instead of get all the data directly, or, you might run out of memory.
'But, how to call list of values' - If I understood correctly, I guess you wish to know what all files are there in your data set - if that's correct, (or if not), there are various ways you can get values from your generator:
use object.filenames.
Object.filenames returns the list of all files in your target folder. I just use the len(object.filename) function to get the total number of files in my test folder. Then pass that number back into my generator and run it again.
generator.n
Other way to get number of all items in your test folder is generator.n
x , y = test_generator.next() to load my array and classes ( if inferred).
Or a = test_generator.next(), where your array and classes will be returned as tuple.
I only used this as my test data set was really small ( 60 images) and I was using extracted features to train and predict my model( that is feature array and not the image array).
If you are building a normal model, using generator to yield batches is much better way.
Create a function using generator
def generate_test_data_from_directory(folder_path, image_target_size = 224, batch_size = 5, channels = 3, class_mode = 'sparse' ):
'''fetch all out test data from directory'''
test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_directory(
folder_path ,
target_size = (image_target_size, image_target_size),
batch_size = batch_size,
class_mode = class_mode)
total_images = test_generator.n
steps = total_images//batch_size
#iterations to cover all data, so if batch is 5, it will take total_images/5 iteration
x , y = [] , []
for i in range(steps):
a , b = test_generator.next()
x.extend(a)
y.extend(b)
return np.array(x), np.array(y)