Multiple independent iterators for a single dataset - python

Assume I have a training dataset "data_train", I would like to make two independent iterators that both iterate over data_train. The first iterator I will use to train my network "iter_train", where the output of iter_train.get_next() will be a batch that I train on. The second iterator will be used to evaluate the entire training dataset while I train, "iter_eval", to monitor the training progress.
Currently, if I only have a single iterator "iter_single", and I want to evaluate the training loss halfway through an epoch I will have to reset the iterator, evaluate the entire dataset with iter_single, and then begin training at the start of the dataset with iter_single. I will consequently not finish my previous epoch and ignore half the dataset unless I waste time iterating through the data without acting on it.
I already tried making two iterators for one dataset, however, by resetting one iterator it reset the other, which makes having two iterators pointless.

Just in case your dataset size is not huge (by huge I mean that both training and validation data can be held into the memory during the training and evaluation), you can use the following code to:
First, read and parse your data and pass them through a Tensorflow dataset object:
def get_image_dataset(dir_path, batch_size, split=0.7):
# Parse data and return them in array format (Numpy)
train_data, val_data = parse_data(dir_path, split)
# Create the dataset for our train data
train_data = tf.data.Dataset.from_tensor_slices(train_data)
train_data = train_data.batch(batch_size)
# Create the dataset for our test data
val_data = tf.data.Dataset.from_tensor_slices(val_data)
val_data = val_data.batch(batch_size)
return train_data, val_data
Second, define an iterator and initializer for your train and validation data:
def get_data():
with tf.name_scope('data'):
train_data, test_data = get_image_dataset(self.batch_size)
iterator = tf.data.Iterator.from_structure(output_types=train_data.output_types, output_shapes=train_data.output_shapes)
# Define one iterator for your data
img, self.label = iterator.get_next()
# Example of application on MNIST dataset
img = tf.reshape(img, [-1, CNN_INPUT_HEIGHT, CNN_INPUT_WIDTH, CNN_INPUT_CHANNELS])
# Define two initializers for either train or test (validation) data
self.train_init = iterator.make_initializer(train_data)
self.test_init = iterator.make_initializer(test_data)
Third and Fourth, while training/testing your network, initialize your Tensorflow graph with train/test dataset like so:
Train:
def train_network_one_epoch(...):
# Initialize training
sess.run(self.train_init)
# Run training graph nodes
return something
Test:
def evaluate_network(...):
# Initialize testing
sess.run(self.test_init)
# Run evaluation graph nodes
return something
You can have a look at this example which clearly demonstrates this procedure.

Related

Loading & splitting same training data but getting different results

So I'm trying to manually split my training data into separate batches such that I can easily access them via indexing, and not relying on DataLoader to split them up for me, since that way I won't be able to access the individual batches by indexing. So I attempted the following:
train_data = datasets.ANY(root='data', transform=T_train, download=True)
BS = 200
num_batches = len(train_data) // BS
sequence = list(range(len(train_data)))
np.random.shuffle(sequence) # To shuffle the training data
subsets = [Subset(train_data, sequence[i * BS: (i + 1) * BS]) for i in range(num_batches)]
train_loader = [DataLoader(sub, batch_size=BS) for sub in subsets] # Create multiple batches, each with BS number of samples
Which works during training just fine.
However, when I attempted another way to manually split the training data I got different end results, even with all the same parameters and the following settings ensured:
device = torch.device('cuda')
torch.manual_seed(0)
np.random.seed(0)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
torch.cuda.empty_cache()
I only split the training data the following way this time:
train_data = list(datasets.ANY(root='data', transform=T_train, download=True)) # Cast into a list
BS = 200
num_batches = len(train_data) // BS
np.random.shuffle(train_data) # To shuffle the training data
train_loader = [DataLoader(train_data[i*BS: (i+1)*BS], batch_size=BS) for i in range(num_batches)]
But this gives me different results than the first approach, even though (I believe) that both approaches are identical in manually splitting the training data into batches. I even tried not shuffling at all and loading the data just as it is, but I still got different results (85.2% v.s 81.98% accuracy). I even manually checked that the loaded images from the batches match; and are the same using both methods.
The training scheme used in both ways:
for e in trange(epochs):
for loader in train_loader:
for x, y in loader:
x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
loss = F.cross_entropy(m1(x), y)
loss.backward()
optim.step()
scheduler.step()
optim.zero_grad()
Can somebody please explain to me why these differences arise (and if there's a better way)?
UPDATE:
T_train transformation contains some random transformations (H_flip, crop) and when using it along with the first train_loader the time taken during training was: 24.79s/it, while the second train_loader took: 10.88s/it (even though both have the exact same number of parameters updates/steps). So I decided to remove the random transformations from T_train; then the time taken using the first train_loader was: 16.99s/it, while the second train_loader took: 10.87s/it. So somehow, the second train_loader still took the same time (with or without the random transformations). Thus, I decided to visualize the image outputs from the second train_loader to make sure that the transformations were applied, and indeed they were! So this is really confusing and I'm not quite why they're giving different results.

what is the best practices to train model on BIG dataset

I need to train a model on a dataset that required more memory than my GPU has. what is the best practice for feeding the dataset to model?
here is my steps:
first of all, I load dataset using batch_size
BATCH_SIZE=32
builder = tfds.builder('mnist')
builder.download_and_prepare()
datasets = builder.as_dataset(batch_size=BATCH_SIZE)
the second step i prepare data
for record in raw_train_ds.take(1):
train_images, train_labels = record['image'], record['label']
print(train_images.shape)
train_images = train_images.numpy().astype(np.float32) / 255.0
train_labels = tf.keras.utils.to_categorical(train_labels)
and then i feed data to the model
history = model.fit(train_images,train_labels, epochs=NUM_EPOCHS, validation_split=0.2)
but at step 2 I prepared data for the first batch and missed the rest batches because the model.fit is out of the loop scope (which, as I understand, works for one, first batch only).
On the other hand, I can't remove take(1) and move the model.fit method under the cycle. Because yes, in this case, I will handle all batches, but at the same time model.fill will be called at the end on each iteration and in this case, it also will not work properly
so, how I should change my code to be able to work appropriately with a big dataset using model.fit? could you point article, any documents, or just advise how to deal with it? thanks
Update
In my post below (approach 1) I describe one approach on how to solve the problem - are there any other better approaches or it is only one way how to solve this?
You can pass the whole dataset to fit for training. As you can see in the documentation, one of the possible values of the first parameter is:
A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights).
So you just need to convert your dataset to that format (a tuple with input and target) and pass it to fit:
BATCH_SIZE=32
builder = tfds.builder('mnist')
builder.download_and_prepare()
datasets = builder.as_dataset(batch_size=BATCH_SIZE)
raw_train_ds = datasets['train']
train_dataset_fit = raw_train_ds.map(
lambda x: (tf.cast.dtypes(x['image'], tf.float32) / 255.0, x['label']))
history = model.fit(train_dataset_fit, epochs=NUM_EPOCHS)
One problem with this is that it does not support a validation_split parameter but, as shown in this guide, tfds already gives you the functionality to have the splits of the data. So you would just need to get the test split dataset, transform it as above and pass it as validation_data to fit.
Approach 1
Thank #jdehesa I changed my code :
load dataset - in reality, it doesn't load data into memory till the first call 'next' from the dataset iterator. and even then, I think the iterator will load a portion of data (batch) with a size equal in BATCH_SIZE
raw_train_ds, raw_validation_ds = builder.as_dataset(split=["train[:90%]", "train[10%:]"], batch_size=BATCH_SIZE)
collected all required transformation into one method
def prepare_data(x):
train_images, train_labels = x['image'], x['label']
# TODO: resize image
train_images = tf.cast(train_images,tf.float32)/ 255.0
# train_labels = tf.keras.utils.to_categorical(train_labels,num_classes=NUM_CLASSES)
train_labels = tf.one_hot(train_labels,NUM_CLASSES)
return (train_images, train_labels)
applied these transformations to each element in batch (dataset) using the method td.data.Dataset.map
train_dataset_fit = raw_train_ds.map(prepare_data)
and then fed this dataset into model.fit - as I understand the model.fit will iterate through all batches in the dataset.
train_dataset_fit = raw_train_ds.map(prepare_data)
history = model.fit(train_dataset_fit, epochs=NUM_EPOCHS)

How does tf.Data Input Pipeline with Data Augmentation every epoch works?

I used this code (https://colab.research.google.com/github/tensorflow/models/blob/master/samples/outreach/blogs/segmentation_blogpost/image_segmentation.ipynb#scrollTo=tkNqQaR2HQbd) for my data tensorflow pipeline. But I don't understand how it works. They are telling that "During training time, our model would never see twice the exact same picture". But how does this work? I only use tf.data Map-Function with _augment-Function once. Does this happen every step at my model.fit Function?
I tried to verify my _augment function with printing out something. But this will only occur at the first time and not every epoch.
def get_baseline_dataset(filenames,
labels,
preproc_fn=functools.partial(_augment),
threads=5,
batch_size=batch_size,
shuffle=True):
num_x = len(filenames)
# Create a dataset from the filenames and labels
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
# Map our preprocessing function to every element in our dataset, taking
# advantage of multithreading
dataset = dataset.map(_process_pathnames, num_parallel_calls=threads)
if preproc_fn.keywords is not None and 'resize' not in preproc_fn.keywords:
assert batch_size == 1, "Batching images must be of the same size"
dataset = dataset.map(preproc_fn, num_parallel_calls=threads)
if shuffle:
dataset = dataset.shuffle(num_x)
# It's necessary to repeat our data for all epochs
dataset = dataset.repeat().batch(batch_size)
return dataset
tr_cfg = {
'resize': [img_shape[0], img_shape[1]],
'scale': 1 / 255.,
'hue_delta': 0.1,
'horizontal_flip': True,
'width_shift_range': 0.1,
'height_shift_range': 0.1
}
tr_preprocessing_fn = functools.partial(_augment, **tr_cfg)
train_ds = get_baseline_dataset(x_train_filenames,
y_train_filenames,
preproc_fn=tr_preprocessing_fn,
batch_size=batch_size)
I am quoting the essential steps from https://cs230-stanford.github.io/tensorflow-input-data
I suggest you to skim through the article once for details.
"
To summarize, one good order for the different transformations is:
create the dataset
shuffle (with a big enough buffer size)
repeat
map with the actual work (preprocessing, augmentation…) using multiple parallel calls.
batch
prefetch
"
This should give what you desire because "augmentation" is after "repeat".
Hope it helps.

Keras create your own generator

I am building my own generator to use with fit_generator() and predict_generator() functions from keras library. My generator works but I wondering if it has been build correctly. Especially for validation and test sets.
For these two sets, I disable the data augmentation processing since it is only used for the training phase but I am still using randomness to select data from my inputs. Thus I would like to know if is it correct to still using randomness selection of data for validation set?
I think It is but I am not totally sure.
def generator(inputs, labels, validation=False):
batch_inputs = np.zeros((batch_size, *input_shape))
batch_labels = np.zeros((batch_size, num_classes))
indexes = list(range(0,len(inputs))
while True:
for i in range(self.batch_size):
# choose random index in inputs
if validation:
index = indexes.pop()
else:
index = random.randint(0, len(inputs) - 1)
batch_inputs[i] = rgb_processing(inputs[index], validation) # data_augmentation processing functions validation=true --> disable data augmentation
batch_labels[i] = to_categorical(labels[index], num_classes=self.num_classes)
yield batch_inputs, batch_labels
train_batches = generator(train.X.values, train.y.values)
validate_batches = generator(validate.X.values, validate.y.values, validation=True)
In the validation, the order of the image should not affect your results. So in theory, there is no problem to give the validation images in a random order. You just want to be sure that all your validation images are used only once so your results are comparable.

Big HDF5 dataset, how to efficienly shuffle after each epoch

I'm currently working with a big image dataset (~60GB) to train a CNN (Keras/Tensorflow) for a simple classification task.
The images are video frames, and thus highly correlated in time, so I shuffled the data already once when generating the huge .hdf5 file...
To feed the data into the CNN without having to load the whole set at once into memory I wrote a simple batch generator (see code below).
Now my question:
Usually it is recommended to shuffle the data after each training epoch right? (for SGD convergence reasons?) But to do so I'd have to load the whole dataset after each epoch and shuffle it, which is exactly what I wanted to avoid using the batch generator...
So: Is it really that important to shuffle the dataset after each epoch and if yes how could I do that as efficiently as possible?
Here is the current code of my batch generator:
def generate_batches_from_hdf5_file(hdf5_file, batch_size, dimensions, num_classes):
"""
Generator that returns batches of images ('xs') and labels ('ys') from a h5 file.
"""
filesize = len(hdf5_file['labels'])
while 1:
# count how many entries we have read
n_entries = 0
# as long as we haven't read all entries from the file: keep reading
while n_entries < (filesize - batch_size):
# start the next batch at index 0
# create numpy arrays of input data (features)
xs = hdf5_file['images'][n_entries: n_entries + batch_size]
xs = np.reshape(xs, dimensions).astype('float32')
# and label info. Contains more than one label in my case, e.g. is_dog, is_cat, fur_color,...
y_values = hdf5_file['labels'][n_entries:n_entries + batch_size]
#ys = keras.utils.to_categorical(y_values, num_classes)
ys = to_categorical(y_values, num_classes)
# we have read one more batch from this file
n_entries += batch_size
yield (xs, ys)
Yeah, shuffling improves performance since running the data in the same order each time may get you stuck in suboptimal areas.
Don't shuffle the entire data. Create a list of indices into the data, and shuffle that instead. Then move sequentially over the index list and use its values to pick data from the data set.

Categories