How does shuffle and batch work in tf.data.dataset? - python

I'm working on a large dataset with around 10million datapoints so I've decided to use tf.data.dataset api for fetching dataset.
train_dataset = tf.data.Dataset.from_tensor_slices((data))
train = train_dataset.shuffle(100000).batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)
I've few doubts which isn't clear from tensorflow docs. I hope someone can address them.
How does the shuffle work in my case? Because I have 10 million datapoints should I shuffle all 10 million (or) will 100k be enough? Will it have any performance impact choosing a large shuffle?
Will the batch is considered only from shuffled dataset (or) the original dataset?

Related

Exclude percentage of Tensorflow dataset samples randomly

I have a Tensorflow dataset generated from a set of .tfrecord files, which I then process using a .map() call:
dataset = tf.data.TFRecordDataset(
filepaths,
compression_type="GZIP",
num_parallel_reads = math.ceil(os.cpu_count() * 2)
)
dataset = dataset.map(transform_func, num_parallel_calls=tf.data.AUTOTUNE)
dataset = dataset.unbatch()
However, the dataset in question is very large. To this end, I would like to exclude a percentage of the dataset randomly (after preprocessing), for example:
dataset = dataset.exclude_percentage(0.5)
This hypothetical .exclude_percentage(percent) function call would randomly drop 50% of the dataset.
Alternatives I have considered:
steps_per_epoch: My initial dataset already very large before the unbatch call (a complicated story but in short a necessary step), so with steps_per_epoch it might not ever see all the data.
.take(), .skip(): These are for extracting only a specific contiguous run from a dataset as far as I can tell, which is not what I want to do.
atrous/dilated convolution: In short, my .map() call convolves over some 2d data, which is then .unbatch()ed. Doing this would mean that there are gaps which the model never sees in training or validation.
.shuffle().take(): .shuffle() maintains an internal buffer, so it is extremely unlikely that a significant percentage of my dataset (before processing, let alone after) would fit in memory to make this viable.

TensorFlow keras model fit() parameters steps_per_epoch and epochs behavior on train set

I'm using a tf.data dataset containing my training data consisting of (lets say) 100k images.
I'm also using a tf.data dataset containing my validation set.
Since an epoch of all 100k images takes quite long (in my case approximately one hour) before I get any feedback on performance on the validation set, I set the steps_per_epoch parameter in tf.keras.Model fit() to 10000.
Using a batch size of 1 this results into having 10 validation scores when reaching 100k of images.
In order to complete one epoch of 100k images of my entire training dataset, I set the epochs parameter to 10
However, I'm not sure if using steps_per_epoch and epochs this way has any other consequences. Is it correct to use these parameters in order to get more frequent feedback on performance?
And also a more specific question, does it use all 100k images or does it use the same first 10k images of my training set at every 'epoch'?
I already dug into the TensorFlow docs and read several different stack overflow questions, but I couldn't find anything conclusive to answer my own question. Hope you can help!
Tensorflow version I'm using is 2.2.0.
Is it correct to use these parameters in order to get more frequent
feedback on performance?
Yes, it is correct to use these parameters. Here is the code that i used to fit the model.
model.fit(
train_data,
steps_per_epoch = train_samples//batch_size,
epochs = epochs,
validation_data = test_data,
verbose = 1,
validation_steps = test_samples//batch_size)
does it use all 100k images or does it use the same first 10k images of my
training set at every 'epoch'?
It use all images in your training data.
For better understanding Epoch is the number times the learning algorithm will work through the entire training data set.
Where as steps_per_epoch is the total number of samples in your training data set divided by the batch size.
For example, if you have 100000 training samples and use a batch size of 100, one epoch will be equivalent to 1000 steps_per_epoch.
Note: We generally observe batch size to be the power of 2, this is because of the effective work of optimized matrix operation libraries.

How to select a subset of mnist training set

I have trouble on how to select a subset of mnist training set which contains M points to train the 1-NN classifier because the number of original training points are too large.
That is , I need to figure out a scheme that takes as input a labeled training set as well as a number M, and return a subset.of the training set of size M.
Besides, uniform-random selection is not allowed.((that is, just picking M of the training points at random)
One option could be to train your network with a a data-generator.
It loads only one batch of data step for step. You will not have issues with your data anymore. Furthermore, it is able to use multithreading.
So loading and maybe preprocessing of your data is not a bottleneck.
Here is a good example:
https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
I hope this helps.

Question about creating a Tensorflow Dataset from data that is too big for RAM (with shuffling)

I have 60 GB of .npy files spread across 20 files. I want to build a neural net in tensorflow to learn on this data.
I plan to train on 19 files in order to test on 1 file. Each file has roughly 80 columns of x data and 1 column of categorical y data. Data types are np.float64 and np.int64. I cannot reduce the data types to smaller sizes because I will lose valuable data in rounding errors.
I have no trouble loading the data into my neural net when I load a single file, but I am having trouble with training because I need to learn across all of the data. I cannot learn on the files in sequential order (for example, train on files 1-19 in order 1, 2, 3, ..., 19). I need to somehow shuffle all of the data for each epoch.
I've read posts like this one which looks almost identical to my question. However, my question is different beacuse I need to shuffle across multiple files. I have not seen a question like this answered on stackoverflow.
The post you linked to explains how to get a TFRecordDataset for each of the 19 data files. Then you can use tf.data.Dataset.zip to combine the TfRecordDatasets into one dataset. On this dataset you can apply shuffle. See this tensorflow tutorial for details.
The way that shuffle tf.data.Dataset works is by loading a buffer of data and shuffling it. Once it is consumed, the next buffer-size chunk of data is loaded and shuffled. I guess you can increase the randomness if needed by dividing your 19 files into smaller files, but you will pay in efficiency of computation.

Dataset too big for RAM, how to do efficient epochs

I currently am working with a dataset of about 2 million objects. Before I train on them I have to load them from disk and perform some preprocessing (that makes the dataset much bigger, so that it would be inefficient to save the post-processed data).
Right now I just load and train in small batches, but if I want to train for multiple epochs on the full dataset, I would have to load all the data from the previous epoch multiple times, which ends up taking a lot of time. The alternative is training for multiple epochs on the smaller batches of data before moving onto the next batch. Could the second method result in any issues (like overfitting)? And is there any other, better way to do this? I'm using tflearn with Python 3 if there are any built-in methods using that.
tl;dr: Is it okay to train for multiple epochs on subsets of data before training for a single epoch on all the data

Categories