Big HDF5 dataset, how to efficienly shuffle after each epoch

Big HDF5 dataset, how to efficienly shuffle after each epoch - python

I'm currently working with a big image dataset (~60GB) to train a CNN (Keras/Tensorflow) for a simple classification task.
The images are video frames, and thus highly correlated in time, so I shuffled the data already once when generating the huge .hdf5 file...
To feed the data into the CNN without having to load the whole set at once into memory I wrote a simple batch generator (see code below).
Now my question:
Usually it is recommended to shuffle the data after each training epoch right? (for SGD convergence reasons?) But to do so I'd have to load the whole dataset after each epoch and shuffle it, which is exactly what I wanted to avoid using the batch generator...
So: Is it really that important to shuffle the dataset after each epoch and if yes how could I do that as efficiently as possible?
Here is the current code of my batch generator:
def generate_batches_from_hdf5_file(hdf5_file, batch_size, dimensions, num_classes):
"""
Generator that returns batches of images ('xs') and labels ('ys') from a h5 file.
"""
filesize = len(hdf5_file['labels'])
while 1:
# count how many entries we have read
n_entries = 0
# as long as we haven't read all entries from the file: keep reading
while n_entries < (filesize - batch_size):
# start the next batch at index 0
# create numpy arrays of input data (features)
xs = hdf5_file['images'][n_entries: n_entries + batch_size]
xs = np.reshape(xs, dimensions).astype('float32')
# and label info. Contains more than one label in my case, e.g. is_dog, is_cat, fur_color,...
y_values = hdf5_file['labels'][n_entries:n_entries + batch_size]
#ys = keras.utils.to_categorical(y_values, num_classes)
ys = to_categorical(y_values, num_classes)
# we have read one more batch from this file
n_entries += batch_size
yield (xs, ys)

Yeah, shuffling improves performance since running the data in the same order each time may get you stuck in suboptimal areas.
Don't shuffle the entire data. Create a list of indices into the data, and shuffle that instead. Then move sequentially over the index list and use its values to pick data from the data set.

Related

Loading & splitting same training data but getting different results

So I'm trying to manually split my training data into separate batches such that I can easily access them via indexing, and not relying on DataLoader to split them up for me, since that way I won't be able to access the individual batches by indexing. So I attempted the following:
train_data = datasets.ANY(root='data', transform=T_train, download=True)
BS = 200
num_batches = len(train_data) // BS
sequence = list(range(len(train_data)))
np.random.shuffle(sequence) # To shuffle the training data
subsets = [Subset(train_data, sequence[i * BS: (i + 1) * BS]) for i in range(num_batches)]
train_loader = [DataLoader(sub, batch_size=BS) for sub in subsets] # Create multiple batches, each with BS number of samples
Which works during training just fine.
However, when I attempted another way to manually split the training data I got different end results, even with all the same parameters and the following settings ensured:
device = torch.device('cuda')
torch.manual_seed(0)
np.random.seed(0)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
torch.cuda.empty_cache()
I only split the training data the following way this time:
train_data = list(datasets.ANY(root='data', transform=T_train, download=True)) # Cast into a list
BS = 200
num_batches = len(train_data) // BS
np.random.shuffle(train_data) # To shuffle the training data
train_loader = [DataLoader(train_data[i*BS: (i+1)*BS], batch_size=BS) for i in range(num_batches)]
But this gives me different results than the first approach, even though (I believe) that both approaches are identical in manually splitting the training data into batches. I even tried not shuffling at all and loading the data just as it is, but I still got different results (85.2% v.s 81.98% accuracy). I even manually checked that the loaded images from the batches match; and are the same using both methods.
The training scheme used in both ways:
for e in trange(epochs):
for loader in train_loader:
for x, y in loader:
x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
loss = F.cross_entropy(m1(x), y)
loss.backward()
optim.step()
scheduler.step()
optim.zero_grad()
Can somebody please explain to me why these differences arise (and if there's a better way)?
UPDATE:
T_train transformation contains some random transformations (H_flip, crop) and when using it along with the first train_loader the time taken during training was: 24.79s/it, while the second train_loader took: 10.88s/it (even though both have the exact same number of parameters updates/steps). So I decided to remove the random transformations from T_train; then the time taken using the first train_loader was: 16.99s/it, while the second train_loader took: 10.87s/it. So somehow, the second train_loader still took the same time (with or without the random transformations). Thus, I decided to visualize the image outputs from the second train_loader to make sure that the transformations were applied, and indeed they were! So this is really confusing and I'm not quite why they're giving different results.

How does tf.Data Input Pipeline with Data Augmentation every epoch works?

I used this code (https://colab.research.google.com/github/tensorflow/models/blob/master/samples/outreach/blogs/segmentation_blogpost/image_segmentation.ipynb#scrollTo=tkNqQaR2HQbd) for my data tensorflow pipeline. But I don't understand how it works. They are telling that "During training time, our model would never see twice the exact same picture". But how does this work? I only use tf.data Map-Function with _augment-Function once. Does this happen every step at my model.fit Function?
I tried to verify my _augment function with printing out something. But this will only occur at the first time and not every epoch.
def get_baseline_dataset(filenames,
labels,
preproc_fn=functools.partial(_augment),
threads=5,
batch_size=batch_size,
shuffle=True):
num_x = len(filenames)
# Create a dataset from the filenames and labels
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
# Map our preprocessing function to every element in our dataset, taking
# advantage of multithreading
dataset = dataset.map(_process_pathnames, num_parallel_calls=threads)
if preproc_fn.keywords is not None and 'resize' not in preproc_fn.keywords:
assert batch_size == 1, "Batching images must be of the same size"
dataset = dataset.map(preproc_fn, num_parallel_calls=threads)
if shuffle:
dataset = dataset.shuffle(num_x)
# It's necessary to repeat our data for all epochs
dataset = dataset.repeat().batch(batch_size)
return dataset
tr_cfg = {
'resize': [img_shape[0], img_shape[1]],
'scale': 1 / 255.,
'hue_delta': 0.1,
'horizontal_flip': True,
'width_shift_range': 0.1,
'height_shift_range': 0.1
}
tr_preprocessing_fn = functools.partial(_augment, **tr_cfg)
train_ds = get_baseline_dataset(x_train_filenames,
y_train_filenames,
preproc_fn=tr_preprocessing_fn,
batch_size=batch_size)

I am quoting the essential steps from https://cs230-stanford.github.io/tensorflow-input-data
I suggest you to skim through the article once for details.
"
To summarize, one good order for the different transformations is:
create the dataset
shuffle (with a big enough buffer size)
repeat
map with the actual work (preprocessing, augmentation…) using multiple parallel calls.
batch
prefetch
"
This should give what you desire because "augmentation" is after "repeat".
Hope it helps.

How to write an efficient custom Keras data generator

I would like to train a convolutional recurrent neural network for video frame prediction. The individual frames are quite big so it is challenging to fit the entire training data in memory at once. As such, I followed some tutorials online to create a custom data generator. When testing it, it seems to work but it is slower by a factor of at least 100 than using the pre-loaded data directly. Since I can only fit about a batch size of 8 on the GPU I understand that the data needs to be generated really fast, however, this does not seem to be the case.
I train my model on a single P100 and have 32 GB of memory available to be used by up to 16 cores.
class DataGenerator(tf.keras.utils.Sequence):
def __init__(self, images, input_images=5, predict_images=5, batch_size=16, image_size=(200, 200),
channels=1):
self.images = images
self.input_images = input_images
self.predict_images = predict_images
self.batch_size = batch_size
self.image_size = image_size
self.channels = channels
self.nr_images = int(len(self.images)-input_images-predict_images)
def __len__(self):
return int(np.floor(self.nr_images) / self.batch_size)
def __getitem__(self, item):
# Randomly select the beginning image of each batch
batch_indices = random.sample(range(0, self.nr_images), self.batch_size)
# Allocate the output images
x = np.empty((self.batch_size, self.input_images,
*self.image_size, self.channels), dtype='uint8')
y = np.empty((self.batch_size, self.predict_images,
*self.image_size, self.channels), dtype='uint8')
# Get the list of input an prediction images
for i in range(self.batch_size):
list_images_input = range(batch_indices[i], batch_indices[i]+self.input_images)
list_images_predict = range(batch_indices[i]+self.input_images,
batch_indices[i]+self.input_images+self.predict_images)
for j, ID in enumerate(list_images_input):
x[i, ] = np.load(np.reshape(self.images[ID], (*self.imagesize, self.channels))
# Read in the prediction images
for j, ID in enumerate(list_images_predict):
y[i, ] = np.load(np.reshape(self.images[ID], (*self.imagesize, self.channels))
return x, y
# Training the model using fit_generator
params = {'batch_size': 8,
'input_images': 5,
'predict_images': 5,
'image_size': (100, 100),
'channels': 1
}
data_path = "input_frames/"
input_images = sorted(glob.glob(data_path + "*.png"))
training_generator = DataGenerator(input_images, **params)
model.fit_generator(generator=training_generator, epochs=10, workers=6)
I would have expected that Keras will prepare the next data batch while the current batch is being processed on the GPU but it does not seem to catch up. In other words, preparing the data before sending it to the GPU seems to be the bottleneck.
Any idea on how to improve the performance of a data generator like this? Is there something missing that guarantees that the data is being prepared in a timely manner?
Thanks a lot!

When you use fit_generator, there is a workers= setting that can be used to scale up the number of generator workers. However you should ensure that the 'item' parameter in getitem is taken into account in order to ensure that the different workers (which are not synchronised) return different values depending on item index. i.e. instead of random sample, perhaps just return a slice of the data based on the index. You can shuffle the entire dataset before starting in order to make sure the dataset order is randomised.

Can you please try with use_multiprocessing=True? These are the numbers I observe on my GTX 1080Ti based system with the data generator you provided.
model.fit_generator(generator=training_generator, epochs=10, workers=6)
148/148 [==============================] - 9s 60ms/step
model.fit_generator(generator=training_generator, epochs=10, workers=6, use_multiprocessing=True)
148/148 [==============================] - 2s 11ms/step

You can try the prefetching of tf.data.Dataset. The prefetching allows you to compute the next batch(es) using your CPU while your GPU computes the gradient descent in the same time. Be careful: you need to change the numpy array into tf.constant in the data generator. Then try:
import tensoflow as tf
generator = DataGenerator(images)
spec = [tf.TypeSpec(shape=(generator.batch_size, generator.input_images,
*generator.image_size, generator.channels), dtype='uint8'),
tf.TypeSpec(shape=(generator.batch_size, generator.predict_images,
*generator.image_size, generator.channels), dtype='uint8')
dataset = tf.data.Dataset.from_generator(DataGenerator, output_signature=spec)
dataset.batch(batch_size).prefetch(-1) # this order is important
# a custom training loop is better than model.fit() otherwise prefetching can fail
def train_loop():
...
You can change the "-1" in prefetch() to another value like 1, 2 or more to get the maximum speed depending on your machine and the batch size.

this blog helps in setting up input data pipeline with tf.data and it also is much more efficient than using ImageDataGenerators and the code is also explained by using a custom data directory.
It also enhances the performance with prefetch, cache.
Prefetch processes the next batch while the current batch is being used.

Multiple independent iterators for a single dataset

Assume I have a training dataset "data_train", I would like to make two independent iterators that both iterate over data_train. The first iterator I will use to train my network "iter_train", where the output of iter_train.get_next() will be a batch that I train on. The second iterator will be used to evaluate the entire training dataset while I train, "iter_eval", to monitor the training progress.
Currently, if I only have a single iterator "iter_single", and I want to evaluate the training loss halfway through an epoch I will have to reset the iterator, evaluate the entire dataset with iter_single, and then begin training at the start of the dataset with iter_single. I will consequently not finish my previous epoch and ignore half the dataset unless I waste time iterating through the data without acting on it.
I already tried making two iterators for one dataset, however, by resetting one iterator it reset the other, which makes having two iterators pointless.

Just in case your dataset size is not huge (by huge I mean that both training and validation data can be held into the memory during the training and evaluation), you can use the following code to:
First, read and parse your data and pass them through a Tensorflow dataset object:
def get_image_dataset(dir_path, batch_size, split=0.7):
# Parse data and return them in array format (Numpy)
train_data, val_data = parse_data(dir_path, split)
# Create the dataset for our train data
train_data = tf.data.Dataset.from_tensor_slices(train_data)
train_data = train_data.batch(batch_size)
# Create the dataset for our test data
val_data = tf.data.Dataset.from_tensor_slices(val_data)
val_data = val_data.batch(batch_size)
return train_data, val_data
Second, define an iterator and initializer for your train and validation data:
def get_data():
with tf.name_scope('data'):
train_data, test_data = get_image_dataset(self.batch_size)
iterator = tf.data.Iterator.from_structure(output_types=train_data.output_types, output_shapes=train_data.output_shapes)
# Define one iterator for your data
img, self.label = iterator.get_next()
# Example of application on MNIST dataset
img = tf.reshape(img, [-1, CNN_INPUT_HEIGHT, CNN_INPUT_WIDTH, CNN_INPUT_CHANNELS])
# Define two initializers for either train or test (validation) data
self.train_init = iterator.make_initializer(train_data)
self.test_init = iterator.make_initializer(test_data)
Third and Fourth, while training/testing your network, initialize your Tensorflow graph with train/test dataset like so:
Train:
def train_network_one_epoch(...):
# Initialize training
sess.run(self.train_init)
# Run training graph nodes
return something
Test:
def evaluate_network(...):
# Initialize testing
sess.run(self.test_init)
# Run evaluation graph nodes
return something
You can have a look at this example which clearly demonstrates this procedure.

Classifying images using adjusted CNN tutorial - system is jumbling the output

Sorry, this is a long one!
I am 80% sure that the problem is that I don't fully understand how tensorflow uses the tf.train.batch function to queue data.
I am trying to adapt one of the tensorflow tutorials to classify a large number of images.
Tutorial can be found here: https://www.tensorflow.org/tutorials/deep_cnn
I have built some modules which can encode my raw data in the same format that cifar10 uses. I am using this to construct training and evaluation data which the program is able to evaluate to a high degree of accuracy. Accuracy varies depending on the quality of the imagesets I put in. To keep things simple I have trained it using 32x32 monochrome tiles of either yellow or blue (category 0 and 1 respectively). Conveniently the network is able to identify whether it is being given a yellow or blue tile with 100% accuracy.
I have also been able to adapt cifar10_eval.py to output predictions rather than an accuracy percentage. This allows me to feed in un-classified data and output predictions as a list. To do this I have exchanged the statement:
top_k_op = tf.nn.in_top_k(logits, labels, 1)
for:
output_2 = tf.argmax(logits, 1)
I have added a variable and a boolean to the eval_once function call to allow it to access the definition for "output_2" and to let me switch between this and "top_k_op" depending on whether I am in evaluation mode or if I am predicting new data.
So far so good. This method works for small amounts of input data but fails as soon as I want to output more than 128 classifications. Not coincidentally 128 is the batch size.
In theory the first item (3073 bytes) in the binary should correspond to the first item in the list which is churned out when I am predicting new data. This happens for inputs of up to 128 images but the data gets jumbled up when I try to categorise more images. Actually, some of the predictions are lost completely!
There are a couple of reasons that this happens. The tutorial isn't designed to care about the order in which data is read or processed, just that individual images correspond with their labels. Originally the data loss was randomised(!) but I have managed to remove the random element by removing multi-threading (threads = 1 rather than 16) and stopped it from shuffling filenames.
filename_queue = tf.train.string_input_producer(filenames, shuffle=False)
string_input_producer has a hidden/optional argument which shuffles the file names. For model evaluation I have set this to false as above.
However.... I am still stuck with jumbled data loss when evaluating data larger than a single batch.
Does anyone know why this happens and have any ideas about how it could be fixed?
In theory I could redesign the code to rebuild the graph and evaluate it for 128 images at a time. However, I want to classify millions of images and feel that I'd be asking for trouble trying to open a new graph instance per batch.
PS, I've done my homework:
I have verified that my initial data to binary conversion works by running a program which can read cifar10-style files and interpret it as a big tile of images. I have run this code on both the original cifar10 binaries and my own binaries and am able to reconstruct both perfectly.
When I encode uncategorised data I add a category label of zero to make sure the tutorial can read the file. However, I make sure that this label is chucked away at the file reading stage and thus is not used when generating a list of predictions.
I have verified the output predictions by printing the list directly onto the screen as a python output and also by using it to assemble a PNG image which can be compared with the original inputs. This verification works perfectly for small batch sizes and starts to fall apart in larger batch sizes.
I've also made some modifications to the tutorial not discussed in this post. These are simple modifications such as changing the number of categories to 2 rather than 10. Am confident that this is not the issue.
PPS, here is a copy of some functions from the modified script. I haven't pasted everything because this question is already huge:
from cifar10_eval:
def eval_once(saver, summary_writer, top_k_op, output_2, summary_op, mapping=False):
"""Run Eval once.
Args:
saver: Saver.
summary_writer: Summary writer.
top_k_op: Top K op.
summary_op: Summary op.
"""
with tf.Session() as sess:
ckpt = tf.train.get_checkpoint_state(FLAGS.checkpoint_dir)
if ckpt and ckpt.model_checkpoint_path:
# Restores from checkpoint
saver.restore(sess, ckpt.model_checkpoint_path)
# Assuming model_checkpoint_path looks something like:
# /my-favorite-path/cifar10_train/model.ckpt-0,
# extract global_step from it.
global_step = ckpt.model_checkpoint_path.split('/')[-1].split('-')[-1]
else:
print('No checkpoint file found')
return
# Start the queue runners.
coord = tf.train.Coordinator()
try:
threads = []
for qr in tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS):
threads.extend(qr.create_threads(sess, coord=coord, daemon=True,
start=True))
num_iter = int(math.ceil(FLAGS.num_examples / FLAGS.batch_size))
true_count = 0 # Counts the number of correct predictions.
total_sample_count = num_iter * FLAGS.batch_size
step = 0
output=[]
if mapping: # if in mapping mode generate a map, if in default mode (variable set to False by default) then tally predictions instead.
while step < num_iter and not coord.should_stop():
step += 1
hold = sess.run(output_2)
print(hold)
for i in range (len(hold)):
output.append(hold[i])
return(output)
from cifar10_input:
def inputs(mapping, data_dir, batch_size):
"""Construct input for CIFAR evaluation using the Reader ops.
Args:
mapping: bool, indicating if one should use the raw or pre-classified eval data set.
data_dir: Path to the CIFAR-10 data directory.
batch_size: Number of images per batch.
Returns:
images: Images. 4D tensor of [batch_size, IMAGE_SIZE, IMAGE_SIZE, 3] size.
labels: Labels. 1D tensor of [batch_size] size.
"""
filelist = os.listdir(data_dir)
filenames = []
if mapping:
# from Raw_Image_Processor import file_name
for f in filelist:
if f.startswith("raw_batch"):
filenames.append(os.path.join(data_dir, f))
num_examples_per_epoch = NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN
else:
for f in filelist:
if f.startswith("eval_batch"):
filenames.append(os.path.join(data_dir, f))
num_examples_per_epoch = NUM_EXAMPLES_PER_EPOCH_FOR_EVAL
for f in filenames:
if not tf.gfile.Exists(f):
raise ValueError('Failed to find file: ' + f)
# Create a queue that produces the filenames to read.
filename_queue = tf.train.string_input_producer(filenames, shuffle=False)
# Read examples from files in the filename queue.
read_input = read_cifar10(filename_queue)
reshaped_image = tf.cast(read_input.uint8image, tf.float32)
height = IMAGE_SIZE
width = IMAGE_SIZE
# Image processing for evaluation.
# Crop the central [height, width] of the image.
resized_image = tf.image.resize_image_with_crop_or_pad(reshaped_image,
height, width)
# Subtract off the mean and divide by the variance of the pixels.
float_image = tf.image.per_image_standardization(resized_image)
# Set the shapes of tensors.
float_image.set_shape([height, width, 3])
read_input.label.set_shape([1])
# Ensure that the random shuffling has good mixing properties.
min_fraction_of_examples_in_queue = 0.4
min_queue_examples = int(num_examples_per_epoch *
min_fraction_of_examples_in_queue)
# Generate a batch of images and labels by building up a queue of examples.
return _generate_image_and_label_batch(float_image, read_input.label,
min_queue_examples, batch_size,
shuffle=False)
from cifar10_input:
def _generate_image_and_label_batch(image, label, min_queue_examples,
batch_size, shuffle):
"""Construct a queued batch of images and labels.
Args:
image: 3-D Tensor of [height, width, 3] of type.float32.
label: 1-D Tensor of type.int32
min_queue_examples: int32, minimum number of samples to retain
in the queue that provides of batches of examples.
batch_size: Number of images per batch.
shuffle: boolean indicating whether to use a shuffling queue.
Returns:
images: Images. 4D tensor of [batch_size, height, width, 3] size.
labels: Labels. 1D tensor of [batch_size] size.
"""
# Create a queue that shuffles the examples, and then
# read 'batch_size' images + labels from the example queue.
num_preprocess_threads = 16
if shuffle:
images, label_batch = tf.train.shuffle_batch(
[image, label],
batch_size=batch_size,
num_threads=num_preprocess_threads,
capacity=min_queue_examples + 3 * batch_size,
min_after_dequeue=min_queue_examples)
else:
images, label_batch = tf.train.batch(
[image, label],
batch_size=batch_size,
num_threads=1,
capacity=1,
enqueue_many = False)
# Display the training images in the visualizer.
tf.summary.image('images', images)
return images, tf.reshape(label_batch, [batch_size])
Edit:
Partial solution is given in the comments below. Information loss is dependent on batch size so it turns out that increasing the batch size (in mapping mode only) is an effective fix.
However, I'm still unsure why it loses and/or scrambles information when the batch size is exceeded. Presumably the batches are taken is some non-sequential order. I don't need it to take the project forwards but if someone could explain how or why this happens it would be greatly appreciated.
Edit 2:
It's back! I've set the batch size to be equivalent to one binary file (in my case roughly 10,000 images). Data is not lost or jumbled within this batch but when I try to process multiple files (about 30) it mixes up the batches a little rather than outputting them on a FIFO basis.
A picture is probably the easiest way for you to see what is going on:
classification map
This is a reconstructed image of a rock face from which the classifier has been trained to recognize three categories. As you can see the reconstruction is mostly smooth. However, there are two clean breaks near the top of the image where a batch (or 3) has been outputted in non-chronological order. These should have appeared at the bottom of the image rather than near the top.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Big HDF5 dataset, how to efficienly shuffle after each epoch - python

Related

Loading & splitting same training data but getting different results

How does tf.Data Input Pipeline with Data Augmentation every epoch works?

How to write an efficient custom Keras data generator

Multiple independent iterators for a single dataset

Classifying images using adjusted CNN tutorial - system is jumbling the output

Categories

Resources