what is the best practices to train model on BIG dataset

what is the best practices to train model on BIG dataset - python

I need to train a model on a dataset that required more memory than my GPU has. what is the best practice for feeding the dataset to model?
here is my steps:
first of all, I load dataset using batch_size
BATCH_SIZE=32
builder = tfds.builder('mnist')
builder.download_and_prepare()
datasets = builder.as_dataset(batch_size=BATCH_SIZE)
the second step i prepare data
for record in raw_train_ds.take(1):
train_images, train_labels = record['image'], record['label']
print(train_images.shape)
train_images = train_images.numpy().astype(np.float32) / 255.0
train_labels = tf.keras.utils.to_categorical(train_labels)
and then i feed data to the model
history = model.fit(train_images,train_labels, epochs=NUM_EPOCHS, validation_split=0.2)
but at step 2 I prepared data for the first batch and missed the rest batches because the model.fit is out of the loop scope (which, as I understand, works for one, first batch only).
On the other hand, I can't remove take(1) and move the model.fit method under the cycle. Because yes, in this case, I will handle all batches, but at the same time model.fill will be called at the end on each iteration and in this case, it also will not work properly
so, how I should change my code to be able to work appropriately with a big dataset using model.fit? could you point article, any documents, or just advise how to deal with it? thanks
Update
In my post below (approach 1) I describe one approach on how to solve the problem - are there any other better approaches or it is only one way how to solve this?

You can pass the whole dataset to fit for training. As you can see in the documentation, one of the possible values of the first parameter is:
A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights).
So you just need to convert your dataset to that format (a tuple with input and target) and pass it to fit:
BATCH_SIZE=32
builder = tfds.builder('mnist')
builder.download_and_prepare()
datasets = builder.as_dataset(batch_size=BATCH_SIZE)
raw_train_ds = datasets['train']
train_dataset_fit = raw_train_ds.map(
lambda x: (tf.cast.dtypes(x['image'], tf.float32) / 255.0, x['label']))
history = model.fit(train_dataset_fit, epochs=NUM_EPOCHS)
One problem with this is that it does not support a validation_split parameter but, as shown in this guide, tfds already gives you the functionality to have the splits of the data. So you would just need to get the test split dataset, transform it as above and pass it as validation_data to fit.

Approach 1
Thank #jdehesa I changed my code :
load dataset - in reality, it doesn't load data into memory till the first call 'next' from the dataset iterator. and even then, I think the iterator will load a portion of data (batch) with a size equal in BATCH_SIZE
raw_train_ds, raw_validation_ds = builder.as_dataset(split=["train[:90%]", "train[10%:]"], batch_size=BATCH_SIZE)
collected all required transformation into one method
def prepare_data(x):
train_images, train_labels = x['image'], x['label']
# TODO: resize image
train_images = tf.cast(train_images,tf.float32)/ 255.0
# train_labels = tf.keras.utils.to_categorical(train_labels,num_classes=NUM_CLASSES)
train_labels = tf.one_hot(train_labels,NUM_CLASSES)
return (train_images, train_labels)
applied these transformations to each element in batch (dataset) using the method td.data.Dataset.map
train_dataset_fit = raw_train_ds.map(prepare_data)
and then fed this dataset into model.fit - as I understand the model.fit will iterate through all batches in the dataset.
train_dataset_fit = raw_train_ds.map(prepare_data)
history = model.fit(train_dataset_fit, epochs=NUM_EPOCHS)

Related

Training a network for machine learning purpose, dividing the dataset in portions

I have a big dataset that can't be loaded in RAM due to lack of enough memory.
What I am trying to do is train the model in x portions of the dataset to get the final model trained in the whole dataset as following:
num_divisione_dataset=4
div_tr = int(int(len(x_tr))/num_divisione_dataset)
div_val = int(2160/num_divisione_dataset)
num_training = int(math.ceil(100/num_divisione_dataset))
for i in range(0,num_divisione_dataset-1):
model.fit(
x_tr[div_tr*i:div_tr*(i+1)], y_tr[div_tr*i:div_tr*(i+1)],
batch_size = 32,
callbacks=[model_checkpoint_callback],
validation_data = (x_val, y_val),
epochs = 25
)
Is it a right way to train a model?

The batch_size = 32 already is a way to train the model in batches of size 32. It seems you have two levels of batching, one that you built yourself an another that's provided by Tensorflow.
The problem with your batching is epochs=25. The Tensorflow batches alternate within an epoch, and the next epoch it loops again over the Tensorflow batches. But you first train 25 epochs with your first batch, then 25 epochs with your second batch, etcetera.
I'm not sure this is best solved in software. It might be easier to just ignore the lack of RAM, and let the OS swap to disk. Buying more RAM could be another viable route. But a possible software route would be an input pipeline

Put your data in a CSV file. Then use make_csv_dataset to load it in batches and pass it to model.fit. Make sure to set num_epochs=1, otherwise the data set will loop forever.
Here you can find an example on how to use it.
A minimal code should be:
DATASET_PATH=#path of the csv file
LABEL_COLUMN=#name of the column in csv file representing output
COLUMNS=["a","b","c","d"] #name of the columns in csv file representing input
BATCH_SIZE=int(len(x_tr)/num_divisione_dataset)
def get_dataset(batch_size = 5):
return tf.data.experimental.make_csv_dataset(DATASET_PATH, batch_size = batch_size, label_name = LABEL_COLUMN, num_epochs = 1)
dataset = get_dataset(batch_size=BATCH_SIZE)
train_size= #put the train_dataset size here
train_dataset = dataset.take(train_size)
val_dataset = dataset.skip(train_size)
columns=[]
for c in COLUMNS:
cln = tf.feature_column.numeric_column(c, shape=())
columns.append(cln)
feature_layer = tf.keras.layers.DenseFeatures(columns)
model = Sequential()
model.add(feature_layer)
model.add...# add your NN layers
model.compile... #parameters to compile
history = model.fit(
train_dataset,
validation_data=val_dataset,
callbacks=[model_checkpoint_callback],
epochs=25,
)

Loading & splitting same training data but getting different results

So I'm trying to manually split my training data into separate batches such that I can easily access them via indexing, and not relying on DataLoader to split them up for me, since that way I won't be able to access the individual batches by indexing. So I attempted the following:
train_data = datasets.ANY(root='data', transform=T_train, download=True)
BS = 200
num_batches = len(train_data) // BS
sequence = list(range(len(train_data)))
np.random.shuffle(sequence) # To shuffle the training data
subsets = [Subset(train_data, sequence[i * BS: (i + 1) * BS]) for i in range(num_batches)]
train_loader = [DataLoader(sub, batch_size=BS) for sub in subsets] # Create multiple batches, each with BS number of samples
Which works during training just fine.
However, when I attempted another way to manually split the training data I got different end results, even with all the same parameters and the following settings ensured:
device = torch.device('cuda')
torch.manual_seed(0)
np.random.seed(0)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
torch.cuda.empty_cache()
I only split the training data the following way this time:
train_data = list(datasets.ANY(root='data', transform=T_train, download=True)) # Cast into a list
BS = 200
num_batches = len(train_data) // BS
np.random.shuffle(train_data) # To shuffle the training data
train_loader = [DataLoader(train_data[i*BS: (i+1)*BS], batch_size=BS) for i in range(num_batches)]
But this gives me different results than the first approach, even though (I believe) that both approaches are identical in manually splitting the training data into batches. I even tried not shuffling at all and loading the data just as it is, but I still got different results (85.2% v.s 81.98% accuracy). I even manually checked that the loaded images from the batches match; and are the same using both methods.
The training scheme used in both ways:
for e in trange(epochs):
for loader in train_loader:
for x, y in loader:
x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
loss = F.cross_entropy(m1(x), y)
loss.backward()
optim.step()
scheduler.step()
optim.zero_grad()
Can somebody please explain to me why these differences arise (and if there's a better way)?
UPDATE:
T_train transformation contains some random transformations (H_flip, crop) and when using it along with the first train_loader the time taken during training was: 24.79s/it, while the second train_loader took: 10.88s/it (even though both have the exact same number of parameters updates/steps). So I decided to remove the random transformations from T_train; then the time taken using the first train_loader was: 16.99s/it, while the second train_loader took: 10.87s/it. So somehow, the second train_loader still took the same time (with or without the random transformations). Thus, I decided to visualize the image outputs from the second train_loader to make sure that the transformations were applied, and indeed they were! So this is really confusing and I'm not quite why they're giving different results.

ValueError: `validation_split` is only supported for Tensors or NumPy arrays, found: (keras.preprocessing.sequence.TimeseriesGenerator object)

When I tried to add validation_split in my LSTM model, I got this error
ValueError: `validation_split` is only supported for Tensors or NumPy arrays, found: (<tensorflow.python.keras.preprocessing.sequence.TimeseriesGenerator object)
This is the code
from keras.preprocessing.sequence import TimeseriesGenerator
train_generator = TimeseriesGenerator(df_scaled, df_scaled, length=n_timestamp, batch_size=1)
model.fit(train_generator, epochs=50,verbose=2,callbacks=[tensorboard_callback], validation_split=0.1)
----------
ValueError: `validation_split` is only supported for Tensors or NumPy arrays, found: (<tensorflow.python.keras.preprocessing.sequence.TimeseriesGenerator object)
One reason I could think of is, to use validation_split a tensor or numpy array is expected, as mentioned in the error, however, when passing train data through TimeSeriesGenerator, it changes the dimension of the train data to a 3D array
And since TimeSeriesGenerator is mandatory to be used when using LSTM, does this means for LSTM we can't use validation_split

Your first intution is right that you can't use the validation_split when using dataset generator.
You will have to understand how the functioninig of dataset generator happens. The model.fit API does not know how many records or batch your dataset has in its first epoch. As the data is generated or supplied for each batch one at a time to the model for training. So there is no way to for the API to know how many records are initially there and then making a validation set out of it. Due to this reason you cannot use the validation_split when using dataset generator. You can read it in their documentation.
Float between 0 and 1. Fraction of the training data to be used as
validation data. The model will set apart this fraction of the
training data, will not train on it, and will evaluate the loss and
any model metrics on this data at the end of each epoch. The
validation data is selected from the last samples in the x and y data
provided, before shuffling. This argument is not supported when x is a
dataset, generator or keras.utils.Sequence instance.
You need to read the last two lines where they have said that it is not supported for dataset generator.
What you can instead do is use the following code to split the dataset. You can read in detail here. I am just writing the important part from the link below.
# Splitting the dataset for training and testing.
def is_test(x, _):
return x % 4 == 0
def is_train(x, y):
return not is_test(x, y)
recover = lambda x, y: y
# Split the dataset for training.
test_dataset = dataset.enumerate() \
.filter(is_test) \
.map(recover)
# Split the dataset for testing/validation.
train_dataset = dataset.enumerate() \
.filter(is_train) \
.map(recover)
I hope my answer helps you.

y = np.array(y)
This fixed it for me.
The error says it only supports numpy arrays, so turn it into an array.

Can I use "model.fit()" in "for" loop to change train data in each iteration

I have a large dataset and it doesn't fit in memory. So while training, SSD is being used and epochs take too much time.
I save my dataset 9 part of .npz file. I choose first part (part 0) as validation part and I didn't use in training.
I use code below, and acc & val_acc result were fine. But I feel I do big mistake somewhere. I didn't see any example like this
for part in range(1,9):
X_Train, Y_Train = loadPart(part)
history = model.fit(X_Train, Y_Train, batch_size=128, epochs=1, verbose=1)
and also I load part 0 as Test data
val_loss, val_acc = model.evaluate(X_Test, Y_Test)
I tried to check val_acc after train each part of dataset and I observed val_acc was increasing.
Could you please let me know if this usage is legal or illegal and why?
EDIT:
I tried fit_generator but it still use disk during training and ETA was about 2,500 hours. (in model.fit with whole dataset it was about 30 mins per epoch) I use code below:
model.fit_generator(generate_batches()), steps_per_epoch=196000,epochs=10)
def generate_batches():
for part in range(1,9):
x, y = loadPart(part) yield(x,y)
def loadPart(part):
data = np.load('C:/FOLDER_PATH/'+str(part)+'.npz')
return [data['x'], data['y']
and X data shape is (196000,1536,1)
EDIT 2:
I found an answer in [github](
https://github.com/keras-team/keras/issues/4446). It says it is ok with call model.fit() multiple times in for but still I don't sure what happens in behind.What is the different between call model.fit() multiple times and call once with whole dataset.

If your model does not fit in RAM the keras documentation suggests the following (https://keras.io/getting-started/faq/#how-can-i-use-keras-with-datasets-that-dont-fit-in-memory):
You can do batch training using model.train_on_batch(x, y) and model.test_on_batch(x, y). See the models documentation.
Alternatively, you can write a generator that yields batches of training data and use the method model.fit_generator(data_generator, steps_per_epoch, epochs).
This means you could try to further split your training data into batches of 128 on your SSD and then do something like:
import glob
import numpy as np
def generate_batches(data_folder):
while True:
batches_paths = glob.glob("%s/*.npz" % data_folder)
for batch_path in batches_paths:
with np.load(batch_path) as batch:
x, y = preprocess_batch(batch)
yield (x, y)
model.fit_generator(generate_batches("/your-data-folder"), steps_per_epoch=10000, epochs=10)
The preprocess_batch function would be responsible for extracting your x and y from each .npz file and the steps_per_epoch argument in the fit_generator function should be the rounded up value of your number of data samples divided by your batch size.
More info:
https://keras.io/models/sequential/#fit_generator
https://www.pyimagesearch.com/2018/12/24/how-to-use-keras-fit-and-fit_generator-a-hands-on-tutorial/

you can use dask also, this chunks the data into smaller sections by default if you have a set that doesnt fit into RAM

If you are training as described in your question and it is trained in one session, then there is no difference. But if you are training in multiple sessions and continuing from previous training then you should save your model either after every epoch(i.e, are training through 9 sets in 1 epoch) or in your case you can save after every set of the dataset(i.e, after every 1 of 9 dataset) and in every session load the weights using model.load_weights("path to model") before you continue your training.
You can save model after every epoch using model.save("path to directory").

How does tf.Data Input Pipeline with Data Augmentation every epoch works?

I used this code (https://colab.research.google.com/github/tensorflow/models/blob/master/samples/outreach/blogs/segmentation_blogpost/image_segmentation.ipynb#scrollTo=tkNqQaR2HQbd) for my data tensorflow pipeline. But I don't understand how it works. They are telling that "During training time, our model would never see twice the exact same picture". But how does this work? I only use tf.data Map-Function with _augment-Function once. Does this happen every step at my model.fit Function?
I tried to verify my _augment function with printing out something. But this will only occur at the first time and not every epoch.
def get_baseline_dataset(filenames,
labels,
preproc_fn=functools.partial(_augment),
threads=5,
batch_size=batch_size,
shuffle=True):
num_x = len(filenames)
# Create a dataset from the filenames and labels
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
# Map our preprocessing function to every element in our dataset, taking
# advantage of multithreading
dataset = dataset.map(_process_pathnames, num_parallel_calls=threads)
if preproc_fn.keywords is not None and 'resize' not in preproc_fn.keywords:
assert batch_size == 1, "Batching images must be of the same size"
dataset = dataset.map(preproc_fn, num_parallel_calls=threads)
if shuffle:
dataset = dataset.shuffle(num_x)
# It's necessary to repeat our data for all epochs
dataset = dataset.repeat().batch(batch_size)
return dataset
tr_cfg = {
'resize': [img_shape[0], img_shape[1]],
'scale': 1 / 255.,
'hue_delta': 0.1,
'horizontal_flip': True,
'width_shift_range': 0.1,
'height_shift_range': 0.1
}
tr_preprocessing_fn = functools.partial(_augment, **tr_cfg)
train_ds = get_baseline_dataset(x_train_filenames,
y_train_filenames,
preproc_fn=tr_preprocessing_fn,
batch_size=batch_size)

I am quoting the essential steps from https://cs230-stanford.github.io/tensorflow-input-data
I suggest you to skim through the article once for details.
"
To summarize, one good order for the different transformations is:
create the dataset
shuffle (with a big enough buffer size)
repeat
map with the actual work (preprocessing, augmentation…) using multiple parallel calls.
batch
prefetch
"
This should give what you desire because "augmentation" is after "repeat".
Hope it helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

what is the best practices to train model on BIG dataset - python

Related

Training a network for machine learning purpose, dividing the dataset in portions

Loading & splitting same training data but getting different results

ValueError: `validation_split` is only supported for Tensors or NumPy arrays, found: (keras.preprocessing.sequence.TimeseriesGenerator object)

Can I use "model.fit()" in "for" loop to change train data in each iteration

How does tf.Data Input Pipeline with Data Augmentation every epoch works?

Categories

Resources