I have data stored in this format
img01_blue.tif
img01_yellow.tif
img01_red.tif
...
imgn_blue.tif
imgn_yellow.tif
imgn_red.tif
with each images being split into 3 images with different channels, indicated by their suffixes.
Now I want to feed them to CNN built by Keras - Python.
Because the data is large and already structured, I feed them in batch by ImageGenerator and flow_from_directory without preprocessing by beforehand.
I want to merge multiple files into 1 single input, each in different channels, can I do that using Keras tool or I have to preprocess the data by other packages first?
The ImageGenerator.flow_from_directory assumes you have single image files. You will have to pre-process your data and merge the files into a single one. If you like to keep the files separate, then you will have to write your own data generator that handles the data you have. But it would wiser to pre-process, here is a post that provides a starting point.
Related
I would like to transform my data using a preprocessing pipeline in PyTorch in order to train a model. My dataset consists of many ~GB size files. Each file is effectively a series of 3D images (so 4D total). As my model is using 3D convolutions, it's a bit infeasible to keep the original images intact and so crucially I need to split up each image into many different dataset examples. (Effectively many series of 3D patches). In addition to this I need to shuffle the 4th dimension using a custom shuffling function, and split that dimension into different dataset examples too.
To achieve this in TensorFlow I would:
Save the data to the .tfrecord format
Load each large image as a tf.data.Dataset
Apply a series of mapping functions using tf.data.Dataset.map
Split the dataset into many sub examples using tf.data.Dataset.from_tensor_slices
My question is, how can I achieve the same thing in PyTorch? The splitting into patches step could be saved to disk rather than on the fly, however this would be disadvantageous in terms of data flexibility. Critically, the shuffled dimension step needs to be applied at each epoch and therefore cannot be saved to disk.
I have a set of 20,000 images that I am importing from disk like below.
imgs_dict={}
path="Documents/data/img"
os.listdir(path)
valid_images =[".png"]
for f in os.listdir(path):
ext= os.path.splitext(f)[1]
if ext.lower() not in valid_images:
continue
img_name=os.path.basename(f)
img_name=os.path.splitext(img_name)[0]
img=np.asarray(Image.open(os.path.join(path,f)))
imgs_dict.update([(img_name,img)])
The reason I am converting this to a dictionary at the end is because I also have two other dictionaries specifying the image id, the classification and whether it is part of the training or validation set. One of these dictionaries corresponds to all the data that should be part of the training data and the other specifies those that should be part of the validation data. After I separate them out, I need to get them back into the standard array format for images (height, width, channels). How can I take a dictionary of images and convert it back into the format I'm wanting here? When i do the below, it produces an array with a shape of (8500,), which is the amount of images in my training set but obviously not reflective of the height, width and channels.
x_train=np.array(list(training_images.values()))
np.shape(x_train)
(8500,)
Or secondarily, am I going about this all wrong? Is there an easier way to handle images than this? It would seem much nicer to just keep the images in a numpy array from the beginning, but as far as I can tell there's no way to have arrays have a key value/label of any sort so I can't pull out specific images.
Edit: For some more context, what I'm essentially trying to do is get my data into a format like what is described in the following link.
https://elitedatascience.com/keras-tutorial-deep-learning-in-python
The specific part in question I'm having trouble with is this:
from keras.datasets import mnist
# Load pre-shuffled MNIST data into train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
When we load the MNIST data, how is the relation between X_train and y_train determined? How can I replicate that with my data?
Yes, there is an easier way of handling image data in Keras. Specifically, when dealing with large dataset you want to use a generator instead of loading all of the images to the memory, so specifically please refer to the ImageDataGenerator class. This class in a data generator already implemented in Keras, so unless you need any special operations etc. this can be the "go-to-guy", at least for basic projects. This will also allow you to define basic augmentations and normalization (for example - rescaling, normalize the data, rotation etc.).
Specifically, you can automatically upload images per class either by arranging them in subdirectoris (put all the images from a single label under the same subdirectory), or by creating a data frame that indicates for each image path what is it's label. Refer to flow_from_directory and flow_from_dataframe accordingly.
For train-test splitting, the easiest way is to keep your train and test set in different directories (e.g data/train and data/test) and create 2 different generators. For example, a figure from this tutorial:
In case you don't wan't to put the train and test data at different directories, you can use the validation_split argument when initialize the generator (e.g validation_split=0.2), then, when invoking flow_from_directory, add the argument subset='validation' or subset='training'.
Having said all that, in case you want to load all of the images to the memory as you did and just split them easily, you can use scikit learn - train_test_split, as described here, for example.
PS
regarding MNIST - this is a well established benchmark, which is strictly defined to train and test set, so everyone will be able to compare thier evaluations on the exact same images. This is the reason it is already splitted in advance.
As an engineer student works towards DSP and ML fields, I am working on an audio classification project with inputs being short clips (4 sec.) of instruments like bass, keyboard, guitar, etc. (NSynth Dataset by the Magenta team at Google).
The idea is to convert all the short clips (.wav files) to spectrograms or melspectrograms then apply a CNN to train the model.
However, my questions is since the entire dataset is large (approximately 23GB), I wonder if I should firstly convert all the audio files to images like PNG then apply CNN. I feel like this can take a lot of time, and it will double the storage space for my input data as now it is audio + image (maybe up to 70GB).
Thus, I wonder if there is any workaround here that can speed the process.
Thanks in advance.
Preprocessing is totally worth it. You will very likely end up, running multiple experiments before your network will work as you want it to and you don't want to waste time pre-processing the features every time, you want to change a few hyper-parameters.
Rather than using PNG, I would rather save directly PyTorch tensors (torch.save that uses Python's standard pickling protocols) or NumPy arrays (numpy.savez saves serialized arrays into a zip file). If you are concerned with disk space, you can consider numpy.save_compressed.
I am unable to decide on feeding video data to the keras model. I'd like to use a DataGenerator for this case like ImageDataGenerator. From this answer I gather, ImageDataGenerator would not be suitable for this.
I have looked at this github repo for a VideoGenerator in keras which uses .npy files in directories. But the downside is, data augmentation is absent at the moment. How do I go about accomplishing this?
Is there no way I can use ImageDataGenerator?
Supposedly, I split all videos into frames and then load directories with .jpg files instead, how would that fare?
If I write a custom data generator using this data generator tutorial, how do I arrange this partition dict? My data consists of .avi files.
You might find this example helpful.
First you create your tensor of timestep data and then you reshape it for the LSTM network.
Assuming that your dataset consists of sorted frames in that order:
data/
class0/
img001.jpg
img002.jpg
...
class1/
img001.jpg
img002.jpg
...
I have around 13 NumPy arrays stored as files that take around 24 gigabytes on disk. Each file is for a single subject and consists of two arrays: one containing input data (a list of 2D matrices, rows represent sequential time), and the other one containing labels of the data.
My final goal is to feed all the data to a deep learning network I've written in Keras to classify new data. But I don't know how to do it without running out of memory.
I've read about Keras's data generators, but cannot find a way to use it for my situation.
I've also looked up HDF5 and h5py, but I don't know how can add all the data to a single array(dataset in HDF5) without running out of memory.
What you need to do is to implement a generator, to feed the data little by little to your model. Keras, does have a TimeseriesGenerator, but I don't think you can use it as it requires you to first load the whole dataset in memory. Thankfully, keras has a generator for images (called ImageDataGenerator), which we will use to base our custom generator off of.
First two words on how it works. You have two main classes the ImageDataGenerator class
(which mostly deals with any preprocessing you want to perform on each image) and the DirectoryIterator class, which actually does all the work. The latter is what we will modify to get what we want. What it essentially does is:
Inherits from keras.preprocessing.image.Iterator, which implements many methods that initialize and generate an array called index_array that contains the indices of the images that are used in each batch. This array is changed in each iteration, while the data it draws from are shuffled in each epoch. We will build our generator upon this, to maintain its functionality.
Searches for all images under a directory; the labels are deduced from the directory structure. It stores the path to each image and its label in two class variable called filenames and classes respectively. We will use these same variables to store the locations of the timeseries and their classes.
It has a method called _get_batches_of_transformed_samples() that accepts an index_array, loads the images whose indices correspond to those of the array and returns a batch of these images and one containing their classes.
What I'd suggest you do is:
Write a script that structures your timeseries data like how you are supposed to structure images when using the ImageDataGenerator. This involves creating a directory for each class and placing each timeseries separatly inside this directory. While this probably will require more storage than your current option, the data won't be loaded in memory while training the model.
Get acquainted on how the DirectoryIterator works.
Define your own generator class (e.g. MyTimeseriesGenerator). Make sure it inherits from the Iterator class mentioned above.
Modify it so that it searches for the format of files you want (e.g. HDF5, npy) and not image formats (e.g. png, jpeg) like it currently does. This is done in the lines 1733-1763. You don't need to make it work on multiple threads like keras' DirectoryIterator does, as this procedure is done only once.
Change the _get_batches_of_transformed_samples() method, so that it reads the file type that you want, instead of reading images (lines 1774-1788). Remove any other image-related functionality the DirectoryIterator has (transforming the images, standardizing them, saving them, etc.)
Make sure that the array returned by the method above matches what you want your model to accept. I'm guessing it should be in the lines of (batch_size, n_timesteps) or (batch_size, n_timesteps, n_feature), for the data and (batch_size, n_classes) for the labels.
That's about all! It sounds more difficult than it actually is. Once you get acquainted with the DirectoryIterator class, everything else is trivial.
Intended use (after modifications to the code):
from custom_generator import MyTimeseriesGenerator # assuming you named your class
# MyTimeseriesGenerator and you
# wrote it in a python file
# named custom_generator
train_dir = 'path/to/your/properly/structured/train/directory'
valid_dir = 'path/to/your/properly/structured/validation/directory'
train_gen = MyTimeseriesGenerator(train_dir, batch_size=..., ...)
valid_gen = MyTimeseriesGenerator(valid_dir, batch_size=..., ...)
# instantiate and compile model, define hyper-parameters, callbacks, etc.
model.fit_generator(train_gen, validation_data=valid_gen, epochs=..., ...)