Splitting dataset file into many data examples with PyTorch API

Splitting dataset file into many data examples with PyTorch API - python

I would like to transform my data using a preprocessing pipeline in PyTorch in order to train a model. My dataset consists of many ~GB size files. Each file is effectively a series of 3D images (so 4D total). As my model is using 3D convolutions, it's a bit infeasible to keep the original images intact and so crucially I need to split up each image into many different dataset examples. (Effectively many series of 3D patches). In addition to this I need to shuffle the 4th dimension using a custom shuffling function, and split that dimension into different dataset examples too.
To achieve this in TensorFlow I would:
Save the data to the .tfrecord format
Load each large image as a tf.data.Dataset
Apply a series of mapping functions using tf.data.Dataset.map
Split the dataset into many sub examples using tf.data.Dataset.from_tensor_slices
My question is, how can I achieve the same thing in PyTorch? The splitting into patches step could be saved to disk rather than on the fly, however this would be disadvantageous in terms of data flexibility. Critically, the shuffled dimension step needs to be applied at each epoch and therefore cannot be saved to disk.

Related

Custom Dataset, Dataloader, Sampler, or something else?

I'm working on a project that requires training a PyTorch framework NN on a very large dataset of images. Some of these images are completely irrelevant to the problem, and but these irrelevant images are not labelled as such. However, there are some metrics I can use to calculate if they are irrelevant (e.g. summing all the pixel values would give me a good sense of which are the relevant images and which are not). What I would ideally like to do is have a Dataloader that can take in a Dataset class, and create batches only with the relevant images. The Dataset class would just know the list of images and their labels, and the Dataloader would interpret whether or not the image it is making a batch with is relevant or not, and would then only make batches with relevant images.
To apply this to an example, lets say I have a dataset of black and white images. The white images are irrelevant, but they are not labelled as such. I want to be able to load batches from a file location, and have these batches only contain the black images. I could filter at some point by summing all the pixels and finding it equals to 0.
What I am wondering is if a custom Dataset, Dataloader, or Sampler would be able to solve this task for me? I already have written a custom Dataset that stores the directory of all the saved images, and a list of all the images in that directory, and can return an image with its label in the getitem function. Is there something more I should add there to filter out certain images? Or should that filter be applied in a custom Dataloader, or Sampler?
Thank you!

I'm assuming that your image dataset belongs to two classes (0 or 1) but it's unlabeled. As #PranayModukuru mentioned that you can determine the similarity by using some measure (e.g aggregating all the pixels intensity values of a image, as you mentioned) in the getitem function in tour custom Dataset class.
However, determining the similarity in getitem function while training your model will make the training process very slow. So, i would recommend you to approximate the similarity before start training (not in the getitem function). Moreover if your image dataset is comprised of complex images (not black and white images) it's better to use a pretrained deep learning model (e.g. resnet or autoencoder) for dimentionality reduction followed by applying clustering approach (e.g. agglomerative clustering) to label your image.
In the second approach you only need to label your images for exactly one time and if you apply augmentation on images while training you don't need to re-determine the similarity (label) in the getitem funcion. On the other hand, in the first approach you need to determine the similarity (label) every time (after applying transformation on images) in the getitem function which is redundant, unnecessary and time consuming.
Hope this will help.

It sounds like your goal is to totally remove the irrelevant images from training.
The best way to deal with this would be to figure out the filenames of all the relevant images up front and save their filenames to a csv or something. Then pass only the good filenames to your dataset.
The reason is you will run through your dataset multiple times during training. This means you will be loading, analyzing and discarding irrelevant images over and over again, which is a waste of compute.
It's better to do this sort of preprocessing/filtering once upfront.

Efficiently accessing arbitrary NDArray batches in mxnet

Fitting convnets such as Resnet and VGG benefits from the ImageRecordIter python class, that allows efficiently loading batches from large collections of RGB images stored in RecordIO .rec files,
Does anybody know about equivalent facilities for large arbitrary input 2D or 3D matrices (for 2D, rows = items and cols = features, + channels in 3D)?
NDArrayIter requires loading the whole dataset in memory, which is to be avoided in my case (>40Gb data file). CSVIter does not allow straightforward shuffling, and works only for 2D matrices.

You can keep each data sample as a separate file and implement a very simple custom DataSet class following the reference code ImageFolderDataset

Creating images from time series features during graph execution using Tensorflow tf.data

Setup
The data are one dimensional time series data. A sliding window (frame) runs across the data and extracts a 100 dimensional vector of features f_{i}( t_{j} ), where i is a file index, and j is the time index of the frame. Energies from n consecutive frames are stacked to form a 2D "image", which is the input to the classifier.
Basic Approach
Just save the stacked vectors in a tfrecords file and you are good to go. The drawback is that the stacked vectors have a lot of duplicate data and consume a lot of space. If the we have 10 gigs of feature vectors, then stacking 10 frames of history would require 50 gigs (+ overhead) of space. However, we can let tf.data do the shuffling, prefetching, batch creation and what not. Very convenient!
Desired Approach
A implementation where we save only the feature vectors as a tfrecords file, and create the stacked feature vector (images) on the fly.
Question
Has this been encountered and dealt with before? Would most people just not care about the high level of repetition in their data or the amount of space consumed?
How difficult is it to get the desired implementation. Even naively shuffling the data would mess up the time ordering. I've also only used tf.data in a way where I call iterator.get_next() to get the feature and label form one example, and then give it to the graph. I suppose I'd also need to figure out how to get not only the current example's feat and label, but also those of it's predecessors in time! So I'm looking for some tips on how to accomplish this.
Here is an image of the process for images formed from 2 frames of feature vectors. Note that in this case, each feature vector would appear twice in the tfrecords data file :

Map different images into single input's channels Keras

I have data stored in this format
img01_blue.tif
img01_yellow.tif
img01_red.tif
...
imgn_blue.tif
imgn_yellow.tif
imgn_red.tif
with each images being split into 3 images with different channels, indicated by their suffixes.
Now I want to feed them to CNN built by Keras - Python.
Because the data is large and already structured, I feed them in batch by ImageGenerator and flow_from_directory without preprocessing by beforehand.
I want to merge multiple files into 1 single input, each in different channels, can I do that using Keras tool or I have to preprocess the data by other packages first?

The ImageGenerator.flow_from_directory assumes you have single image files. You will have to pre-process your data and merge the files into a single one. If you like to keep the files separate, then you will have to write your own data generator that handles the data you have. But it would wiser to pre-process, here is a post that provides a starting point.

Which is the correct way to do image normalization?

When preparing train set for neural network training, I find two possible way.
The traditional way: calculate the mean on whole training set， and minus this fixed mean value per image before sending to network. Processing standard deviation in the similar way.
I find tensorflow provides a function tf.image.per_image_standardization that do normalization on single image.
I wonder which way is more appropriate？

Both ways are possible and the choice mostly depends on the way you read the data.
Whole training set normalization is convenient when you can load the whole dataset at once into a numpy array. E.g., MNIST dataset is usually loaded fully into memory. This way is also preferable in terms of convergence, when the individual images vary significantly: two training images, one is mostly white and the other is mostly black, will have very different means.
Per image normalization is convenient when the images are loaded one by one or in small batches, for example from the TFRecord. It's also the only viable option when the dataset is too large too fit in memory. In this case, it's better to organize the input pipeline in tensorflow and transform the image tensors just like other tensors in the graph. I've seen pretty good accuracy with this normalization in CIFAR-10, so it's a viable way, despite the issues stated earlier. Also note that you can reduce the negative effect via batch normalization.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.