Tensorflow: data api for big datasets - python

I'm learning about neural networks reading the book "hands on machine learning with scikit learn keras tensorflow" and the page 410 the author shows the following function saying it is a small helper function: it will create and return a dataset that will efficiently load data from multiple CSV files, then shuffle it, preprocess it and batch it. And it is a good intput pipe for learge datasets that don't fit in memory(ram).
I tried to run the function with 3 small files pretending they are the training set and the thing is the entire "training set"(all three files together) was loaded in memory. More precisaly the function tf.data.TextLineDataset() is reading the whole file. I though it would load in batches lets say, 32 instances from hard drive to ram at the time but is not whats hapening.
So I don't understand whats happening here. Why it is reading the whole dataset?
X_mean,X_std = [...] # mean and scale of each feature in the training set n_inputs = 8
def preprocess(line):
defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
fields = tf.io.decode_csv(line, record_defaults=defs)
x = tf.stack(fields[:-1])
y = tf.stack(fields[-1:])
return (x - X_mean) / X_std, y
def csv_reader_dataset(filepaths, n_readers=5, shuffle_buffer_size=10000, n_parse_threads=5, batch_size=32):
dataset = tf.data.Dataset.list_files(filepaths).repeat(repeat)
dataset = dataset.interleave(lambda filepath: tf.data.TextLineDataset(filepath).skip(1), cycle_length=n_readers)
dataset = dataset.shuffle(shuffle_buffer_size)
dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)
dataset = dataset.batch(batch_size)
return dataset.prefetch(1)
train_set = csv_reader_dataset(train_filepaths)
valid_set = csv_reader_dataset(valid_filepaths)
test_set = csv_reader_dataset(test_filepaths)
model = keras.models.Sequential([...])
model.compile([...])
model.fit(train_set, steps_per_epoch=len(X_train) // batch_size,
epochs=10, validation_data=valid_set, validation_steps=len(X_valid) // batch_size)

For reading multiple CSV files from disk as necessary you can use the make_csv_dataset function.
train_ds = tf.data.experimental.make_csv_dataset(
tf.io.gfile.glob("data/multiple/*.csv"),
batch_size = 10,
column_names = None,
column_defaults= [0,0,0,0,0.,"a",0],
label_name='target',
select_columns = ['age', 'sex', 'cp', 'trestbps','oldpeak','thal', 'target'],
shuffle=True,
shuffle_seed=101,
)
For more information see here.

Related

Is memory supposed to be this high during model.fit using a generator?

The tensorflow versions that I can still recreate this behavior are: 2.7.0, 2.7.3, 2.8.0, 2.9.0. Actually, these are all the versions I've tried; I wasn't able to resolve the issue in any version.
OS: Ubuntu 20
GPU: RTX 2060
RAM: 16GB
I am trying to feed my data to a model using a generator:
class DataGen(tf.keras.utils.Sequence):
def __init__(self, indices, batch_size):
self.X = X
self.y = y
self.indices = indices
self.batch_size = batch_size
def __getitem__(self, index):
X_batch = self.X[self.indices][
index * self.batch_size : (index + 1) * self.batch_size
]
y_batch = self.y[self.indices][
index * self.batch_size : (index + 1) * self.batch_size
]
return X_batch, y_batch
def __len__(self):
return len(self.y[self.indices]) // self.batch_size
train_gen = DataGen(train_indices, 32)
val_gen = DataGen(val_indices, 32)
test_gen = DataGen(test_indices, 32)
where X, y is my dataset loaded from a .h5 file using h5py, and train_indices, val_indices, test_indices are the indices for each set that will be used on X and y.
I am creating the model and feeding the data using:
# setup model
base_model = tf.keras.applications.MobileNetV2(input_shape=(128, 128, 3),
include_top=False)
base_model.trainable = False
mobilenet1 = Sequential([
base_model,
Flatten(),
Dense(27, activation='softmax')
])
mobilenet1.compile(optimizer=tf.keras.optimizers.Adam(),
loss=tf.keras.losses.CategoricalCrossentropy(),
metrics=['accuracy'])
# model training
hist_mobilenet = mobilenet1.fit(train_gen, validation_data=val_gen, epochs=1)
The memory right before training is 8%, but the moment training starts it begins getting values from 30% up to 60%. Since I am using a generator and loading the data in small parts of 32 observations at a time, it seems odd to me that the memory climbs this high. Also, even when training stops, memory stays above 30%. I checked all global variables but none of them has such a large size. If I start another training session memory starts having even higher usage values and eventually jupyter notebook kernel dies.
Is something wrong with my implementation or this is normal?
Edit 1: some additional info.
Whenever the training stops, memory usage drops a little, but I can decrease it even more by calling garbage collector. However, I cannot bring it back down to 8%, even when I delete the history created by fit
the x and y batches' size sum up to 48 bytes; this outrages me! how come loading 48 of data at a time is causing the memory usage to increase that much? Supposedly I am using HDF5 dataset to be able to handle the data without overloading RAM. The next thing that comes to my mind is that fit creates some variables, but it doesn't make sense that it needs so many GBs of memory to store them
Literally, this is not a generator. When you instantiate DataGen, you create a complete class with full indices (def init (self, indices, batch_size)), with datasets (self.X, self.Y), with inheritance from Sequential, and so on.
The simplest real generator for tensorflow looks something like this:
from sklearn.model_selection import train_test_split
BATCH_SIZE = 32
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
X_val = X_train[int(len(X_train) * 0.8):]
X_train = X_train[int(len(X_train) * 0.8)]
y_val = y_train[int(len(y_train) * 0.8):]
y_train = y_train[:int(len(y_train) * 0.8)]
def gen_reader(X_train, y_train):
for data, label in zip(X_train, y_train):
yield data, label
train_ds = tf.data.Dataset.from_generator(gen_reader, args=[X_train, y_train], output_types=(tf.float64, tf.int8)).batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
val_ds = tf.data.Dataset.from_generator(gen_reader, args=[X_val, y_val], output_types=(tf.float64, tf.int8)).batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
test_ds = tf.data.Dataset.from_generator(gen_reader, args=[X_test, y_test], output_types=(tf.float64, tf.int8)).batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
...
hist_mobilenet = mobilenet1.fit(train_ds, validation_data=val_ds, epochs=1)
How to minimize RAM usage
From the very helpful comments and answers of our fellow friends, I came to this conclusion:
First, we have to save the data to an HDF5 file, so we would not have to load the whole dataset in memory.
import h5py as h5
import gc
file = h5.File('data.h5', 'r')
X = file['X']
y = file['y']
gc.collect()
I am using garbage collector just to be safe.
Then, we would not have to pass the data to the generator, as the X and y will be same for training, validation and testing. In order to differentiate between the different data, we will use index maps
# split data for validation and testing
val_split, test_split = 0.2, 0.1
train_indices = np.arange(len(X))[:-int(len(X) * (val_split + test_split))]
val_indices = np.arange(len(X))[-int(len(X) * (val_split + test_split)) : -int(len(X) * test_split)]
test_indices = np.arange(len(X))[-int(len(X) * test_split):]
class DataGen(tf.keras.utils.Sequence):
def __init__(self, index_map, batch_size):
self.X = X
self.y = y
self.index_map = index_map
self.batch_size = batch_size
def __getitem__(self, index):
X_batch = self.X[self.index_map[
index * self.batch_size : (index + 1) * self.batch_size
]]
y_batch = self.y[self.index_map[
index * self.batch_size : (index + 1) * self.batch_size
]]
return X_batch, y_batch
def __len__(self):
return len(self.index_map) // self.batch_size
train_gen = DataGen(train_indices, 32)
val_gen = DataGen(val_indices, 32)
test_gen = DataGen(test_indices, 32)
Last thing to notice is how I implemented the the data fetching inside __getitem__.
Correct solution:
X_batch = self.X[self.index_map[
index * self.batch_size : (index + 1) * self.batch_size
]]
Wrong solution:
X_batch = self.X[self.index_map][
index * self.batch_size : (index + 1) * self.batch_size
]
same for y
Notice the difference? In the wrong solution I am loading the whole dataset (training, validation or testing) in memory! Instead, in the correct solution I am only loading the batch meant to feed in the fit method.
With this setup, I managed to raise RAM only to 2.88 GB, which is pretty cool!
Make use of fit_generator instead of the fit method
I mean instead of
hist_mobilenet = mobilenet1.fit(train_gen, validation_data=val_gen, epochs=1)
Use
hist_mobilenet = mobilenet1.fit_generator(train_gen, validation_data=val_gen, epochs=1)
according to this answer it says
Keras' fit method loads all the data into memory at once meaning
changing your batch size will have no effect on the RAM it takes up.
Have a look at using which is designed for use with a large dataset.
I think the fit_generator will load data batch-wise and not take up the whole ram instantly.

converting numpy memmap (non-image) numeric file into tfrecords for training?

My numeric (non-image) input data (df_input = np.load(memmap file)) to a neural network is a 350 GB memap numpy file with about 230 million rows and 150 columns (one of the columns is the model target). I am using tensorflow keras.
Even though I am using data generators with batches (as shown below), its extremely slow run. Any feedback what I should do? I have number of epochs as 10,000, and after 4 hours, it went to 34/24863 mini batch of: Epoch 1/10000!!
I was wondering if there is a way to convert my numeric (non-image) memmap input data to tfrecords and if by doing so, it could reduce the model training time and the way data is read and loaded for training? I could not find any examples of dealing with non-image data.
class data_generator(Sequence): 'Generates data for Keras'
def __init__(self, df_input, batch_size_gen, target_col_idx,
input_cols_idx, partition_idx, partition_set=None, gen_indices =
False, shuffle=False):
self.batch_size = batch_size_gen
self.shuffle = shuffle
self.partition_idx = partition_idx
self.on_epoch_end()
self.partition_set = partition_set
self.gen_indices = gen_indices
self.df_input = df_input
self.input_cols_idx = input_cols_idx
self.target_col_idx = target_col_idx
def __len__(self):
'Denotes the number of times file batches are retrieved and fed to ANN per epoch'
#return int(np.floor(len(self.partition_idx)/ self.batch_size))
return int(np.ceil(len(self.partition_idx)/ self.batch_size))
def __getitem__(self, index):
'Generate one batch of data by retrieving a batch of files'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of batch files to process
current_batch_idx = [self.partition_idx[k] for k in indexes]
current_batch = self.df_array[current_batch_idx,:]
#print(current_batch)
# Generate data
X, y = self.Data_Generator(current_batch)
return X, y
Here is the training & calibration data:
batch_size_gen = 20000
training_generator = data_generator(df_input, batch_size_gen, target_col_idx, input_cols_idx,train_idx, partition_set = 'Training',gen_indices = True, shuffle=False)
calibration_generator = data_generator(df_input, batch_size_gen,
target_col_idx,input_cols_idx,calibration_idx,partition_set =
'Calibration',gen_indices = False)
And this is how I call the fit:
hist =
model.fit_generator(generator=training_generator,validation_data =
calibration_generator, epochs = n_epoch,use_multiprocessing=False,
callbacks = [reduce, earlystop,checkpointer, csv_logger])

spliting custom binary dataset in train/test subsets using tensorflow io

I am trying to use local binary data to train a network to perform regression inference.
Each local binary data has the following layout:
and the whole data consists of several *.bin files with the layout above. Each file has a variable number of sequences of 403*4 bytes. I was able to read one of those files using the following code:
import tensorflow as tf
RAW_N = 2 + 20*20 + 1
def convert_binary_to_float_array(register):
return tf.io.decode_raw(register, out_type=tf.float32)
raw_dataset = tf.data.FixedLengthRecordDataset(filenames=['mydata.bin'],record_bytes=RAW_N*4)
raw_dataset = raw_dataset.map(map_func=convert_binary_to_float_array)
Now, I need to create 4 datasets train_data, train_labels, test_data, test_labels as follows:
train_data, train_labels, test_data, test_labels = prepare_ds(raw_dataset, 0.8)
and use them to train & evaluate:
model = build_model()
history = model.fit(train_data, train_labels, ...)
loss, mse = model.evaluate(test_data, test_labels)
My question is: how to implement function prepare_ds(dataset, frac)?
def prepare_ds(dataset, frac):
...
I have tried to use tf.shape, tf.reshape, tf.slice, subscription [:] with no success. I realized that those functions doesn't work properly because after the map() call raw_dataset is a MapDataset (as a result of the eager execution concerns).
If the meta-data is suppose to be part of your inputs, which I am assuming, you could try something like this:
import random
import struct
import tensorflow as tf
import numpy as np
RAW_N = 2 + 20*20 + 1
bytess = random.sample(range(1, 5000), RAW_N*4)
with open('mydata.bin', 'wb') as f:
f.write(struct.pack('1612i', *bytess))
def decode_and_prepare(register):
register = tf.io.decode_raw(register, out_type=tf.float32)
inputs = register[:402]
label = register[402:]
return inputs, label
total_data_entries = 8
raw_dataset = tf.data.FixedLengthRecordDataset(filenames=['/content/mydata.bin', '/content/mydata.bin'], record_bytes=RAW_N*4)
raw_dataset = raw_dataset.map(decode_and_prepare)
raw_dataset = raw_dataset.shuffle(buffer_size=total_data_entries)
train_ds_size = int(0.8 * total_data_entries)
test_ds_size = int(0.2 * total_data_entries)
train_ds = raw_dataset.take(train_ds_size)
remaining_data = raw_dataset.skip(train_ds_size)
test_ds = remaining_data.take(test_ds_size)
Note that I am using the same bin file twice for demonstration purposes. After running that code snippet, you could feed the datasets to your model like this:
model = build_model()
history = model.fit(train_ds, ...)
loss, mse = model.evaluate(test_ds)
as each dataset contains the inputs and the corresponding labels.

Fine-tuning a neural network in tensorflow

I've been working on this neural network with the intent to predict TBA (time based availability) of simulated windmill parks based on certain attributes. The neural network runs just fine, and gives me some predictions, however I'm not quite satisfied with the results. It fails to notice some very obvious correlations that I can clearly see by myself. Here is my current code:
`# Import
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
maxi = 0.96
mini = 0.7
# Make data a np.array
data = pd.read_csv('datafile_ML_no_avg.csv')
data = data.values
# Shuffle the data
shuffle_indices = np.random.permutation(np.arange(len(data)))
data = data[shuffle_indices]
# Training and test data
data_train = data[0:int(len(data)*0.8),:]
data_test = data[int(len(data)*0.8):int(len(data)),:]
# Scale data
scaler = MinMaxScaler(feature_range=(mini, maxi))
scaler.fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)
# Build X and y
X_train = data_train[:, 0:5]
y_train = data_train[:, 6:7]
X_test = data_test[:, 0:5]
y_test = data_test[:, 6:7]
# Number of stocks in training data
n_args = X_train.shape[1]
multi = int(8)
# Neurons
n_neurons_1 = 8*multi
n_neurons_2 = 4*multi
n_neurons_3 = 2*multi
n_neurons_4 = 1*multi
# Session
net = tf.InteractiveSession()
# Placeholder
X = tf.placeholder(dtype=tf.float32, shape=[None, n_args])
Y = tf.placeholder(dtype=tf.float32, shape=[None,1])
# Initialize1s
sigma = 1
weight_initializer = tf.variance_scaling_initializer(mode="fan_avg",
distribution="uniform", scale=sigma)
bias_initializer = tf.zeros_initializer()
# Hidden weights
W_hidden_1 = tf.Variable(weight_initializer([n_args, n_neurons_1]))
bias_hidden_1 = tf.Variable(bias_initializer([n_neurons_1]))
W_hidden_2 = tf.Variable(weight_initializer([n_neurons_1, n_neurons_2]))
bias_hidden_2 = tf.Variable(bias_initializer([n_neurons_2]))
W_hidden_3 = tf.Variable(weight_initializer([n_neurons_2, n_neurons_3]))
bias_hidden_3 = tf.Variable(bias_initializer([n_neurons_3]))
W_hidden_4 = tf.Variable(weight_initializer([n_neurons_3, n_neurons_4]))
bias_hidden_4 = tf.Variable(bias_initializer([n_neurons_4]))
# Output weights
W_out = tf.Variable(weight_initializer([n_neurons_4, 1]))
bias_out = tf.Variable(bias_initializer([1]))
# Hidden layer
hidden_1 = tf.nn.relu(tf.add(tf.matmul(X, W_hidden_1), bias_hidden_1))
hidden_2 = tf.nn.relu(tf.add(tf.matmul(hidden_1, W_hidden_2),
bias_hidden_2))
hidden_3 = tf.nn.relu(tf.add(tf.matmul(hidden_2, W_hidden_3),
bias_hidden_3))
hidden_4 = tf.nn.relu(tf.add(tf.matmul(hidden_3, W_hidden_4),
bias_hidden_4))
# Output layer (transpose!)
out = tf.transpose(tf.add(tf.matmul(hidden_4, W_out), bias_out))
# Cost function
mse = tf.reduce_mean(tf.squared_difference(out, Y))
# Optimizer
opt = tf.train.AdamOptimizer().minimize(mse)
# Init
net.run(tf.global_variables_initializer())
# Fit neural net
batch_size = 10
mse_train = []
mse_test = []
# Run
epochs = 10
for e in range(epochs):
# Shuffle training data
shuffle_indices = np.random.permutation(np.arange(len(y_train)))
X_train = X_train[shuffle_indices]
y_train = y_train[shuffle_indices]
# Minibatch training
for i in range(0, len(y_train) // batch_size):
start = i * batch_size
batch_x = X_train[start:start + batch_size]
batch_y = y_train[start:start + batch_size]
# Run optimizer with batch
net.run(opt, feed_dict={X: batch_x, Y: batch_y})
# Show progress
if np.mod(i, 50) == 0:
mse_train.append(net.run(mse, feed_dict={X: X_train, Y: y_train}))
mse_test.append(net.run(mse, feed_dict={X: X_test, Y: y_test}))
pred = net.run(out, feed_dict={X: X_test})
print(pred)`
Have tried to tweak around with the number of hidden layers, number of nodes per layer, number of epochs to run and trying different activation functions and optimizers. However, I am quite new to neural networks, so there might be something very obvious that I'm missing.
Thanks in advance to anyone who managed to read through all of that.
It will make is much easier you you will share a small dataset that illustrate the problem. However, I will state some of the issues with non-standards datasets and how to overcome them.
Possible solutions
Regularization and validation-based optimization - are methods that are always good to try when looking for some extra-accuracy. See dropout methods here (original paper), and some overview here.
Unbalanced data - Sometimes of the time series categories/events behave like anomalies, or just in unbalanced ways. If you read a book, words like the or it will appear much more times than warehouse or such. This can become a problem if your main task is to detect the word warehouse and you train your network (even lstms) in traditional ways. A way to overcome this problem is by balancing the samples (creating balanced datasets) or to give more weight to low-frequent categories.
Model structure - sometimes fully connected layers are not enough. See computer vision problems for instance, where we train using convolution layers. The convolution and pooling layers enforce structure on the model, which is suitable for images. This is also some sort of regulation, since we have less parameters in those layers. In time-series problems, convolutions are also possible and turns out that works just fine. See example in Conditional Time Series Forecasting with Convolution Neural Networks.
The above suggestions are presented in the order I would suggest to try.
Good luck!

Keras custom data generator for large hdf5 file which does not fit into memory

I'm trying to use the pretrained InceptionV3 model to classify the food-101 dataset, which containts food images for 101 categories, 1000 per category. I've preprocessed this dataset into a single hdf5 file (I assumed this is beneficial compared to loading images on the go when training) so far, which has the following tables inside:
The data split is the standard 70% train, 20% validation, 10% test, so for example the valid_img has a size of 20200*299*299*3. The labels are onehotencoded for Keras, so valid_labels has a size of 20200*101.
This hdf5 file has a size of 27.1 GB, so it will not fit into my memory. (Have 8 GB of it, although effectively only probably 4-5 gigs is usable while running Ubuntu. Also my GPU is GTX 960 with 2 GB of VRAM, and so far it looked like 1.5 GB is available for python when I try to start the training script). I'm using Tensorflow backend.
The first idea I had is to use model.train_on_batch() with a double nested for loop like this:
#Loading InceptionV3, adding my fully connected layers, compiling model...
dataset = h5py.File('/home/uzoltan/PycharmProjects/food-101/food-101_299x299.hdf5', 'r')
epoch = 50
for i in range(epoch):
for i in range(100): #1000 images can fit in the memory easily, this could probably be range(10) too
train_images = dataset["train_img"][i * 706:(i + 1) * 706, ...]
train_labels = dataset["train_labels"][i * 706:(i + 1) * 706, ...]
val_images = dataset["valid_img"][i * 202:(i + 1) * 202, ...]
val_labels = dataset["valid_labels"][i * 202:(i + 1) * 202, ...]
model.train_on_batch(x=train_images, y=train_labels, class_weight=None,
sample_weight=None, )
My problem with this approach is that train_on_batch provides 0 support for validation or batch shuffling, so that the batches are not in the same order every epoch.
So I looked towards model.fit_generator() which has the nice property of providing all the same functionality as fit(), plus with the built in ImageDataGenerator you can do image augmentations (rotations, horizontal flips, etc.) at the same time with the CPU, so that your model can be more robust. My problem here is, that if I understand it correctly, the ImageDataGenerator.flow(x,y) method needs all the samples and labels at once, but my training/validation data wont fit into my RAM.
Here is where I think custom data generators come into the picture, but after looking extensively at some examples I could find on the Keras GitHub/Issues page, I still dont really get how should I implement a custom generator, which would read in batches of data from my hdf5 file. Can someone provide me with a good example or pointers? How could I couple the custom batch generator with the image augmentations? Or maybe is it easier to implement some kind of manual validation and batch shuffling for train_on_batch()? If so, I could use some pointer there too.
For anyone still looking for an answer, I made the following "crude wrapper" around ImageDataGeneator's apply_transform method.
from numpy.random import uniform, randint
from tensorflow.python.keras.preprocessing.image import ImageDataGenerator
import numpy as np
class CustomImagesGenerator:
def __init__(self, x, zoom_range, shear_range, rescale, horizontal_flip, batch_size):
self.x = x
self.zoom_range = zoom_range
self.shear_range = shear_range
self.rescale = rescale
self.horizontal_flip = horizontal_flip
self.batch_size = batch_size
self.__img_gen = ImageDataGenerator()
self.__batch_index = 0
def __len__(self):
# steps_per_epoch, if unspecified, will use the len(generator) as a number of steps.
# hence this
return np.floor(self.x.shape[0]/self.batch_size)
def next(self):
return self.__next__()
def __next__(self):
start = self.__batch_index*self.batch_size
stop = start + self.batch_size
self.__batch_index += 1
if stop > len(self.x):
raise StopIteration
transformed = np.array(self.x[start:stop]) # loads from hdf5
for i in range(len(transformed)):
zoom = uniform(self.zoom_range[0], self.zoom_range[1])
transformations = {
'zx': zoom,
'zy': zoom,
'shear': uniform(-self.shear_range, self.shear_range),
'flip_horizontal': self.horizontal_flip and bool(randint(0,2))
}
transformed[i] = self.__img_gen.apply_transform(transformed[i], transformations)
return transformed * self.rescale
It can be called like so:
import h5py
f = h5py.File("my_heavy_dataset_file.hdf5", 'r')
images = f['mydatasets/images']
my_gen = CustomImagesGenerator(
images,
zoom_range=[0.8, 1],
shear_range=6,
rescale=1./255,
horizontal_flip=True,
batch_size=64
)
model.fit_generator(my_gen)
If I understood you correctly, you want to use the data (which does not fit in the memory) from HDF5 and at the same time use data augmentation on it.
I'm in the same situation as you, and I found this code that maybe can be helpful with some few modifications:
https://gist.github.com/wassname/74f02bc9134897e3fe4e60784f5aaa15
this is my solution for shuffle data per epoch with h5 file.
indices means train or val index list.
def generator(h5path, indices, batchSize=128, is_train=True, aug=None):
db = h5py.File(h5path, "r")
with open("mean.json") as f:
mean = json.load(f)
meanV = np.array([mean["R"], mean["G"], mean["B"]])
while True:
np.random.shuffle(indices)
for i in range(0, len(indices), batchSize):
t0 = time()
batch_indices = indices[i:i+batchSize]
batch_indices.sort()
by = db["labels"][batch_indices,:]
bx = db["images"][batch_indices,:,:,:]
bx[:,:,:,0] -= meanV[0]
bx[:,:,:,1] -= meanV[1]
bx[:,:,:,2] -= meanV[2]
t1=time()
if is_train:
#bx = random_crop(bx, (224,224))
if aug is not None:
bx,by = next(aug.flow(bx,by,batchSize))
yield (bx,by)
h5path='all_224.hdf5'
model.fit_generator(generator(h5path, train_indices, batchSize=batchSize, is_train=True, aug=aug),
steps_per_epoch = 20000//batchSize,
validation_data= generator(h5path, test_indices, is_train=False, batchSize=batchSize),
validation_steps = 2424//batchSize,
epochs=args.epoch,
max_queue_size=100,
callbacks=[checkpoint, early_stop])
You want to write a function which loads images from the HDF5 and then yields (not returns) them as a numpy array. Here is a simple example which uses OpenCV to load images directly from .png/.jpg files in a given directory:
def generate_data(directory, batch_size):
"""Replaces Keras' native ImageDataGenerator."""
i = 0
file_list = os.listdir(directory)
while True:
image_batch = []
for b in range(batch_size):
if i == len(file_list):
i = 0
random.shuffle(file_list)
sample = file_list[i]
i += 1
image = cv2.resize(cv2.imread(sample[0]), INPUT_SHAPE)
image_batch.append((image.astype(float) - 128) / 128)
yield np.array(image_batch)
Obviously you will have to modify it to read from the HDF5 instead.
Once you have written your function, the usage is simply:
model.fit_generator(
generate_data('~/my_data', batch_size),
steps_per_epoch=len(os.listdir('~/my_data')) // batch_size)
Again modified to reflect the fact that you are reading from an HDF5 and not a directory.

Categories