Dataset too big for RAM, how to do efficient epochs - python

I currently am working with a dataset of about 2 million objects. Before I train on them I have to load them from disk and perform some preprocessing (that makes the dataset much bigger, so that it would be inefficient to save the post-processed data).
Right now I just load and train in small batches, but if I want to train for multiple epochs on the full dataset, I would have to load all the data from the previous epoch multiple times, which ends up taking a lot of time. The alternative is training for multiple epochs on the smaller batches of data before moving onto the next batch. Could the second method result in any issues (like overfitting)? And is there any other, better way to do this? I'm using tflearn with Python 3 if there are any built-in methods using that.
tl;dr: Is it okay to train for multiple epochs on subsets of data before training for a single epoch on all the data

Related

TensorFlow keras model fit() parameters steps_per_epoch and epochs behavior on train set

I'm using a tf.data dataset containing my training data consisting of (lets say) 100k images.
I'm also using a tf.data dataset containing my validation set.
Since an epoch of all 100k images takes quite long (in my case approximately one hour) before I get any feedback on performance on the validation set, I set the steps_per_epoch parameter in tf.keras.Model fit() to 10000.
Using a batch size of 1 this results into having 10 validation scores when reaching 100k of images.
In order to complete one epoch of 100k images of my entire training dataset, I set the epochs parameter to 10
However, I'm not sure if using steps_per_epoch and epochs this way has any other consequences. Is it correct to use these parameters in order to get more frequent feedback on performance?
And also a more specific question, does it use all 100k images or does it use the same first 10k images of my training set at every 'epoch'?
I already dug into the TensorFlow docs and read several different stack overflow questions, but I couldn't find anything conclusive to answer my own question. Hope you can help!
Tensorflow version I'm using is 2.2.0.
Is it correct to use these parameters in order to get more frequent
feedback on performance?
Yes, it is correct to use these parameters. Here is the code that i used to fit the model.
model.fit(
train_data,
steps_per_epoch = train_samples//batch_size,
epochs = epochs,
validation_data = test_data,
verbose = 1,
validation_steps = test_samples//batch_size)
does it use all 100k images or does it use the same first 10k images of my
training set at every 'epoch'?
It use all images in your training data.
For better understanding Epoch is the number times the learning algorithm will work through the entire training data set.
Where as steps_per_epoch is the total number of samples in your training data set divided by the batch size.
For example, if you have 100000 training samples and use a batch size of 100, one epoch will be equivalent to 1000 steps_per_epoch.
Note: We generally observe batch size to be the power of 2, this is because of the effective work of optimized matrix operation libraries.

How does shuffle and batch work in tf.data.dataset?

I'm working on a large dataset with around 10million datapoints so I've decided to use tf.data.dataset api for fetching dataset.
train_dataset = tf.data.Dataset.from_tensor_slices((data))
train = train_dataset.shuffle(100000).batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)
I've few doubts which isn't clear from tensorflow docs. I hope someone can address them.
How does the shuffle work in my case? Because I have 10 million datapoints should I shuffle all 10 million (or) will 100k be enough? Will it have any performance impact choosing a large shuffle?
Will the batch is considered only from shuffled dataset (or) the original dataset?

How can I control which samples are read using tfrecords and steps_per_epoch

I am currently transitioning my tf code towards tfrecords and tf datasets. In my application, the trained model usually converges long before it has seen all training samples. I therefore usually set the data generator length myself to the number of batches that I want to fit in one epoch and ensure in my generator that in the next epoch, the generator picks up after the last sample from the previous epoch. This allows that all callbacks work as desired (especially early stopping) while I can still train my models with unseen data in each epoch.
How can I achieve this behaviour with tf datasets and tfrecords? I have read through the dataset definitions on the tensorflow Github but am unsure on whether this will be possible.
I think there are two possible solutions to this if I set steps_per_epoch:
Overwriting the part of the code that specifies from where the next sample is read to just pick up at the sample one after the last one from the previous epoch.
Trying to mimic the behaviour described above with a custom tf dataset implementation. I would be worried that this could have unforeseen impacts on parallelisations and performance.
However, I do not know how to accomplish either. So if you have any insights on this, I would be very grateful.
For now I can use an inelegant work-around in which I always train for one epoch and then initialise a new dataset with new tfrecord files, but I hope there is a better way, especially with regards to callbacks.
I am not sure I fully understand what you try to achieve. You want that:
During an epoch, your model does NOTĀ see the whole dataset
The following epochs do not use samples from the previous ones
That's it?
From my point of view, the steps_per_epoch argument is your best bet. If you have a Dataset with, for example, 100 items (samples or batches) and you set steps_per_epoch=20 then, during the first epoch your model will see items 0 to 19, and 20 to 39 during second epoch, and so on. No need to overwrite any part of code.
Trying to mimic the Dataset behavior is probably not a good idea (too many things to take care, many (hard) work involved).
From your last paragraph, IĀ understand that you want each epoch to be feeded with data from specific TFRecord files. Maybe you can look at tf.data.Dataset.flat_map. Build a list of your TFRecord files (same file can appear multiple times) and "flat_map" TFRecordDataset on it:
files = tf.data.Dataset.from_tensor_slices([
"file1.tfrecord", "file2.tfrecord",
"file1.tfrecord", "file3.tfrecord"
])
dataset = file.flat_map(TFRecordDataset)
Iterating over dataset will give you Examples from file1, then from file2, then from file1 again and then from file3.
Hope this can help.

Will excessive steps during training mess up the training process in Machine Learning?

I have a dataset of 3372149 rows, and I batch them every 3751 rows as the code shown below:
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": train_features_numpy},
y=train_labels_numpy,
batch_size = 3751,
num_epochs= 1,
shuffle=False)
# Train
nn.train(input_fn=train_input_fn)#, steps=10000)
If I set num_epochs = 1 as what I have in the code, it means that the training process would go through the dataset once right? And that leads to the total steps equals to 3372149/3751 = 899.
If I uncomment the "steps = 10000" part, and set "num_epochs=none", the training part would be forced to train all the way to step 10000.
I have two questions then:
Since I only have 899 sets of valid data but I set the step to 10000, what is Tensorflow training after step 899? Does it just go back to the top and repeat the training?
If I trained more then 899 steps, is it going to mess up the model that relates the features and labels? Or is it redundant since the training loop just go over and over the same data set?
I did ask about the loss not reduced during training in my other posts and I am now thinking if I have too few data sets to train on and thus all the excessive steps are useless.
Iterating over a dataset many times is quite common and normal. Each "step" of your model (that is each batch) takes one gradient update step. In intuitive terms it has taken one step towards the goal in the direction dictated by that mini batch. It does NOT learn everything it can about a particular sample by seeing it once, it just takes a step closer to the goal, and how big a step is dictated by the learning rate (and other more complex factors). If you cut your learning rate in half you'd need twice as many steps to get there. Notice how that had nothing to do with epochs, just "update steps" (aka batches).
The typical way of knowing when it's time to stop is to plot test data accuracy over time as you train your model. It is certainly possible that your model will begin to overfit at some point. If it does so test accuracy will start to get worse, this is an obvious optimal stopping point.
Also note that batches of data are not sequential, each batch is randomly selected by permuting the data. The next time through the dataset will end up with different batches of data, and thus each of these batches will produce a different gradient update. So even going through the dataset twice will not produce the same set of updates on each epoch.
I don't actually know the answer to question #1 because I don't use the estimator API much, but I'm 90% sure it simply permutes the samples and iterates through them again after each epoch. That's the most common approach.

Train Tensorfow model with larger than memory dataset (Python 2.7)

I have a dataset that's way larger than memory (a few hundreds gigabytes) in CSV format that I have to use as a training set for a Tensorflow model.
Using small example datasets is no problem, just load everything in memory and go; but what should the best strategy to handle this problem be?
I'm guessing the only way is to process the file in chunks; problem is that the cost should be calculated on the whole training set. I figured as a solution to do a few epochs on the largest chunk possible (and calculate the cost only on the data in the chunk), then do the next few epochs on the next chunk, and so on (maybe trying to make the model "see" each chunk more than once).
Is this the only solution, and is it reasonable? Or is there any other better approach?

Categories