Validation set with TensorFlow Dataset - python

From Train and evaluate with Keras:
The argument validation_split (generating a holdout set from the training data) is not supported when training from Dataset objects, since this features requires the ability to index the samples of the datasets, which is not possible in general with the Dataset API.
Is there a workaround? How can I still use a validation set with TF datasets?

No, you can't use use validation_split (as described clearly by documentation), but you can create validation_data instead and create Dataset "manually".
You can see an example in the same tensorflow tutorial:
# Prepare the training dataset
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(64)
# Prepare the validation dataset
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val))
val_dataset = val_dataset.batch(64)
model.fit(train_dataset, epochs=3, validation_data=val_dataset)
You could create those two datasets from numpy arrays ((x_train, y_train) and (x_val, y_val)) using simple slicing as shown there:
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_val = x_train[-10000:]
y_val = y_train[-10000:]
x_train = x_train[:-10000]
y_train = y_train[:-10000]
There are also other ways to create tf.data.Dataset objects, see tf.data.Dataset documentation and related tutorials/notebooks.

Related

How to split dataset into K-fold without loading the whole dataset at once?

I can't load all of my dataset at once, so I used tf.keras.preprocessing.image_dataset_from_directory() in order to load batches of images during training. It works well if I want to split my dataset into 2 subsets (train and validation), however, I'd like to divide my dataset into K-folds in order to make cross validation. (5 folds would be nice)
How can I make K-folds without loading my whole dataset ?
Do I have to give up using tf.keras.preprocessing.image_dataset_from_directory() ?
Personally I recommend that you switch to tf.data.Dataset().
Not only is it more efficient but it gives you more flexibility in terms of what you can implement.
Say you have images(image_paths) and labels as an example.
In that way, you could create a pipeline like:
training_data = []
validation_data = []
kf = KFold(n_splits=5,shuffle=True,random_state=42)
for train_index, val_index in kf.split(images,labels):
X_train, X_val = images[train_index], images[val_index]
y_train, y_val = labels[train_index], labels[val_index]
training_data.append([X_train,y_train])
validation_data.append([X_val,y_val])
Then you could create something like:
for index, _ in enumerate(training_data):
x_train, y_train = training_data[index][0], training_data[index][1]
x_valid, y_valid = validation_data[index][0], validation_data[index][1]
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.map(mapping_function, num_parallel_calls=tf.data.experimental.AUTOTUNE)
train_dataset = train_dataset.batch(batch_size)
train_dataset = train_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
validation_dataset = tf.data.Dataset.from_tensor_slices((x_valid, y_valid))
validation_dataset = validation_dataset.map(mapping_function, num_parallel_calls=tf.data.experimental.AUTOTUNE)
validation_dataset = validation_dataset.batch(batch_size)
validation_dataset = validation_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
model.fit(train_dataset,
validation_data=validation_dataset,
epochs=epochs,
verbose=2)

Why does shuffling sequences of data in tf.keras.dataset affect the order of sequences differently between tf.fit and tf.predict?

I am training an LSTM deep learning model with time series sequences and labels.
I generate a tensorflow dataset "train_data" and "test_data"
train_data = tf.keras.preprocessing.timeseries_dataset_from_array(
data=data,
targets=None,
sequence_length=total_window_size,
sequence_stride=1,
batch_size=batch_size,
shuffle=is_shuffle).map(split_window).prefetch(tf.data.AUTOTUNE)
I then train the model with the above datasets
model.fit(train_data, epochs=epochs, validation_data = test_data, callbacks=callbacks)
And then run predictions to obtain the predicted values
train_labels = np.concatenate([y for x, y in train_data], axis=0)
train_predictions = model.predict(train_data)
test_labels = np.concatenate([y for x, y in test_data], axis=0)
test_predictions = model.predict(test_data)
Here is my question: When I plot the train/test label data against the predicted values I get the following plot when I do not shuffle the sequences in the dataset building step:
Here the output with shuffling:
Question Why is this the case? I use the exact same source dataset for training and prediction. The dataset should be shuffled. Is there a chance that TensorFlow shuffles the data twice randomly, once during training and another time for predictions? I tried to supply a shuffle seed but that did not change things either.
The dataset gets shuffled everytime you iterate through it. What you get after your list comprehension isn't in the same order as when you write predict. If you don't want that, pass:
shuffle(buffer_size=BUFFER_SIZE, reshuffle_each_iteration=False)

How to load images and labels seperately in a dataset loaded by tensorflow_datasets

import tensorflow_datasets as tfds
train_ds = tfds.load('cifar100', split='train[:90%]').shuffle(1024).batch(32)
val_ds = tfds.load('cifar100', split='train[-10%:]').shuffle(1024).batch(32)
I want to convert train_ds and val_ds into something like this: x_train, y_train and x_val, y_val (x for images and y for labels).
The Keras API uses train and test data split (this seems to be the case in sklearn too), but I do not want to use any test data at all here.
I have tried this, but it didn't work (and I do understand why this doesn't work, but I don't know how else can I convert my training data to images and labels):
x_train = train_ds['image']
# TypeError: 'BatchDataset' object is not subscriptable
Not the best way, I created lists firstly to inspect them. I think you want something like:
train_ds = tfds.load('mnist', split='train[:90%]')
train_examples_labels = tfds.as_numpy(train_ds)
x_train = []
y_train = []
for features_labels in train_examples_labels:
x_train.append(features_labels['image'])
y_train.append(features_labels['label'])
features_labels is a dictionary here:
features_labels.keys()
dict_keys(['image', 'label'])
After you can convert them into numpy arrays.
x_train = np.array(x_train, dtype = 'float32')
y_train = np.array(y_train, dtype = 'float32')
I found a better solution:
train_ds, val_ds = tfds.load(name="cifar100", split=('train[:90%]','train[-10%:]'), batch_size=-1, as_supervised=True)
x_train, y_train = tfds.as_numpy(train_data)
x_val, y_val = tfds.as_numpy(val_data)

Add validation data on pytorch

Below I split my data on train and test and then load into a tensordataset. Which is a straightforward way to add a validation split?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)
train_data = TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train))
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
test_data = TensorDataset(torch.from_numpy(X_test), torch.from_numpy(y_test))
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)
There isn't a more straightforward way than the one that you are using right now, at least not without some other framework which sits on top of the pytorch (such as fastai). Unless you want to manually compute the cut points, shuffle your data, and make the corresponding splits (but this simple procedure does not handle the stratification).
Using your approach, you can further split the train set into 2 (train and validation)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size = 0.2, random_state = 0, stratify = y)
One thing to note here is that you will need to adjust the test_size according to your needs (using 0.2 twice in a row will not result in 60/20/20 split).
Once you have X_train, X_valid, and X_test splits, you can simply create the 3rd DataLoader for your validation set.
valid_data = TensorDataset(torch.from_numpy(X_valid), torch.from_numpy(y_valid))
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
Just a few notes:
unlike in the case of the training set, you don't need to shuffle your validation and test sets, therefore you can set shuffle=False in the DataLoader
you don't need to use the same batch size in each DataLoader, using larger batch sizes for validation is preferrable if your HW can handle it; computing with a larger batch size will be faster without side-effects since the gradient descent is not performed on this set
unless you know that you need a DataLoader for your test set, you can simply do the inference on the whole test set at once; this is due to the fact that the test set is not part of the training loop and the inference is usually performed on CPU (unless you know that you need GPU, e.g. for real-time inference, or test set being so large that it does not fit into RAM but this isn't usually the case)

How to add data via directories for training images

I have been going through git repository by flyyufelix "https://github.com/flyyufelix/cnn_finetune" to fine tune an inception v3 network I want to train network to detect a disease so I have 2 set of images one with disease and without disease.
The git says X_train, Y_train, X_valid, Y_valid = load_data() he loads the cifar dataset ,The git asks us to create our own load_data() function.The author has the code as below
import cv2
import numpy as np
from keras.datasets import cifar10
from keras import backend as K
from keras.utils import np_utils
nb_train_samples = 3000 # 3000 training samples
nb_valid_samples = 100 # 100 validation samples
num_classes = 10
def load_cifar10_data(img_rows, img_cols):
# Load cifar10 training and validation sets
(X_train, Y_train), (X_valid, Y_valid) = cifar10.load_data()
# Resize trainging images
if K.image_dim_ordering() == 'th':
X_train = np.array([cv2.resize(img.transpose(1,2,0), (img_rows,img_cols)).transpose(2,0,1) for img in X_train[:nb_train_samples,:,:,:]])
X_valid = np.array([cv2.resize(img.transpose(1,2,0), (img_rows,img_cols)).transpose(2,0,1) for img in X_valid[:nb_valid_samples,:,:,:]])
else:
X_train = np.array([cv2.resize(img, (img_rows,img_cols)) for img in X_train[:nb_train_samples,:,:,:]])
X_valid = np.array([cv2.resize(img, (img_rows,img_cols)) for img in X_valid[:nb_valid_samples,:,:,:]])
# Transform targets to keras compatible format
Y_train = np_utils.to_categorical(Y_train[:nb_train_samples], num_classes)
Y_valid = np_utils.to_categorical(Y_valid[:nb_valid_samples],num_classes)
return X_train, Y_train, X_valid, Y_valid
can i know how to generate a function which loads
data X_train, Y_train, X_valid, Y_valid = load_data() when i have directries in pc
Use Keras' ImageDataGenerator() class and call flow_from_directory() on it. The labels will be automatically inferred from the directory names. So if you have a directory titled "disease," then Keras would infer that all images within that directory are labeled as "disease," and the same thing would be true for another directory titled "no disease," for example.
I demonstrate how to prepare image data for training a CNN in Keras in this video. The first half of the video is about image organization on disk, and then the second half goes through the process described above.
Follow this tutorial once and all your doubts will get cleared.
https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html

Categories