Fitting convnets such as Resnet and VGG benefits from the ImageRecordIter python class, that allows efficiently loading batches from large collections of RGB images stored in RecordIO .rec files,
Does anybody know about equivalent facilities for large arbitrary input 2D or 3D matrices (for 2D, rows = items and cols = features, + channels in 3D)?
NDArrayIter requires loading the whole dataset in memory, which is to be avoided in my case (>40Gb data file). CSVIter does not allow straightforward shuffling, and works only for 2D matrices.
You can keep each data sample as a separate file and implement a very simple custom DataSet class following the reference code ImageFolderDataset
Related
I would like to transform my data using a preprocessing pipeline in PyTorch in order to train a model. My dataset consists of many ~GB size files. Each file is effectively a series of 3D images (so 4D total). As my model is using 3D convolutions, it's a bit infeasible to keep the original images intact and so crucially I need to split up each image into many different dataset examples. (Effectively many series of 3D patches). In addition to this I need to shuffle the 4th dimension using a custom shuffling function, and split that dimension into different dataset examples too.
To achieve this in TensorFlow I would:
Save the data to the .tfrecord format
Load each large image as a tf.data.Dataset
Apply a series of mapping functions using tf.data.Dataset.map
Split the dataset into many sub examples using tf.data.Dataset.from_tensor_slices
My question is, how can I achieve the same thing in PyTorch? The splitting into patches step could be saved to disk rather than on the fly, however this would be disadvantageous in terms of data flexibility. Critically, the shuffled dimension step needs to be applied at each epoch and therefore cannot be saved to disk.
I am trying to create a large numpy array, say
S=0.5
a=np.random.normal(size=(100000,10000))
x=np.maximum(S-a,1)
#This is just an example. The calculation is more complicated than this.
But it is too large for the memory. After creating this array, I also need to do manipulation, and use it as training data in machine learning( eg, xgboost, or CART).
So my questions are : 1.How would I create such a big array without getting a memory error and could let me do calculations as well? Could you recommend some packages or links I could learn to do this please? 2. Suppose this array is already saved in a file. How do I load it and then train my model without causing memory error?
I have read
<https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/ >
but it didn't say how to write data onto disk.
Anyone could help please? Thanks a lot.
Dask can help with large NumPy arrays, but they do not support all functions of NumPy API.
Since you have mentioned in your question that your final target is to use the data for training a machine learning model lets look at the problem from the other end.
Assuming that you somehow managed to load the data into memory how do you plan to pass it over to the underlining ML models ? Most of the available classical ML models work on Numpy arrays, so even if you manage to load the data is some format you cannot pass it on to the ML model to train it using those representations until unless they are numpy arrays.
In case your data is sparse you can store them in sparse numpy arrays and some classical models can handle sparse numpy arrays.
This is the general case when data is very large to be fit into memory. Ideally you have to look at ML models which can be trained on a batch at a time. This way you can load a batch of data and train one batch at a time. Any ML model which can be trained using gradient decent algorithm can be trained on a batch at a time. Deep learning models are trained using gradient decent and so they all work on batch of data at a time.
So if you decide to use deep learning models, you normally will end up writing a data loader which will load a batch of data at a time.
If you do not want to use the batch based training models:
Bottom line is, since your final target it to train a ML model, first find out the data representation of the ML model you want to use then try to solve your problem of fitting the data into that format. It will be waste of your time and effort to figure out how to fit the data into memory and then realize that your ML model can not work on that representation.
My deep learning model will use some supervised learning datas
Data will loaded in .npz(numpy arrays) files, each files will have the 30 array list datas.
I have to traning model that using these data sets
It will be loaded when I call it in python function. Then, model will be starting training this data sets.
And the correlation between these data sets is important.
This is the subject of this question.
I will load the entire file in the folder. Then the model will continue learning until data loading is complete.
So, Can I use this data set files in training?
And how to training my model..
I'm not sure what you mean, but you should be able to convert certain data structures to tensors such as numpy arrays.
a = np.ones(5)
b = torch.from_numpy(a) # np array to tensor
This is a simple article for beginners if that helps with training:
https://medium.com/coinmonks/create-a-neural-network-in-pytorch-and-make-your-life-simpler-ec5367895199
I have around 13 NumPy arrays stored as files that take around 24 gigabytes on disk. Each file is for a single subject and consists of two arrays: one containing input data (a list of 2D matrices, rows represent sequential time), and the other one containing labels of the data.
My final goal is to feed all the data to a deep learning network I've written in Keras to classify new data. But I don't know how to do it without running out of memory.
I've read about Keras's data generators, but cannot find a way to use it for my situation.
I've also looked up HDF5 and h5py, but I don't know how can add all the data to a single array(dataset in HDF5) without running out of memory.
What you need to do is to implement a generator, to feed the data little by little to your model. Keras, does have a TimeseriesGenerator, but I don't think you can use it as it requires you to first load the whole dataset in memory. Thankfully, keras has a generator for images (called ImageDataGenerator), which we will use to base our custom generator off of.
First two words on how it works. You have two main classes the ImageDataGenerator class
(which mostly deals with any preprocessing you want to perform on each image) and the DirectoryIterator class, which actually does all the work. The latter is what we will modify to get what we want. What it essentially does is:
Inherits from keras.preprocessing.image.Iterator, which implements many methods that initialize and generate an array called index_array that contains the indices of the images that are used in each batch. This array is changed in each iteration, while the data it draws from are shuffled in each epoch. We will build our generator upon this, to maintain its functionality.
Searches for all images under a directory; the labels are deduced from the directory structure. It stores the path to each image and its label in two class variable called filenames and classes respectively. We will use these same variables to store the locations of the timeseries and their classes.
It has a method called _get_batches_of_transformed_samples() that accepts an index_array, loads the images whose indices correspond to those of the array and returns a batch of these images and one containing their classes.
What I'd suggest you do is:
Write a script that structures your timeseries data like how you are supposed to structure images when using the ImageDataGenerator. This involves creating a directory for each class and placing each timeseries separatly inside this directory. While this probably will require more storage than your current option, the data won't be loaded in memory while training the model.
Get acquainted on how the DirectoryIterator works.
Define your own generator class (e.g. MyTimeseriesGenerator). Make sure it inherits from the Iterator class mentioned above.
Modify it so that it searches for the format of files you want (e.g. HDF5, npy) and not image formats (e.g. png, jpeg) like it currently does. This is done in the lines 1733-1763. You don't need to make it work on multiple threads like keras' DirectoryIterator does, as this procedure is done only once.
Change the _get_batches_of_transformed_samples() method, so that it reads the file type that you want, instead of reading images (lines 1774-1788). Remove any other image-related functionality the DirectoryIterator has (transforming the images, standardizing them, saving them, etc.)
Make sure that the array returned by the method above matches what you want your model to accept. I'm guessing it should be in the lines of (batch_size, n_timesteps) or (batch_size, n_timesteps, n_feature), for the data and (batch_size, n_classes) for the labels.
That's about all! It sounds more difficult than it actually is. Once you get acquainted with the DirectoryIterator class, everything else is trivial.
Intended use (after modifications to the code):
from custom_generator import MyTimeseriesGenerator # assuming you named your class
# MyTimeseriesGenerator and you
# wrote it in a python file
# named custom_generator
train_dir = 'path/to/your/properly/structured/train/directory'
valid_dir = 'path/to/your/properly/structured/validation/directory'
train_gen = MyTimeseriesGenerator(train_dir, batch_size=..., ...)
valid_gen = MyTimeseriesGenerator(valid_dir, batch_size=..., ...)
# instantiate and compile model, define hyper-parameters, callbacks, etc.
model.fit_generator(train_gen, validation_data=valid_gen, epochs=..., ...)
Setup
The data are one dimensional time series data. A sliding window (frame) runs across the data and extracts a 100 dimensional vector of features f_{i}( t_{j} ), where i is a file index, and j is the time index of the frame. Energies from n consecutive frames are stacked to form a 2D "image", which is the input to the classifier.
Basic Approach
Just save the stacked vectors in a tfrecords file and you are good to go. The drawback is that the stacked vectors have a lot of duplicate data and consume a lot of space. If the we have 10 gigs of feature vectors, then stacking 10 frames of history would require 50 gigs (+ overhead) of space. However, we can let tf.data do the shuffling, prefetching, batch creation and what not. Very convenient!
Desired Approach
A implementation where we save only the feature vectors as a tfrecords file, and create the stacked feature vector (images) on the fly.
Question
Has this been encountered and dealt with before? Would most people just not care about the high level of repetition in their data or the amount of space consumed?
How difficult is it to get the desired implementation. Even naively shuffling the data would mess up the time ordering. I've also only used tf.data in a way where I call iterator.get_next() to get the feature and label form one example, and then give it to the graph. I suppose I'd also need to figure out how to get not only the current example's feat and label, but also those of it's predecessors in time! So I'm looking for some tips on how to accomplish this.
Here is an image of the process for images formed from 2 frames of feature vectors. Note that in this case, each feature vector would appear twice in the tfrecords data file :