I have the following problem, I have many files of 3D volumes that I open to extract a bunch of numpy arrays.
I want to get those arrays randomly, i.e. in the worst case I open as many 3D volumes as numpy arrays I want to get, if all those arrays are in separate files.
The IO here isn't great, I open a big file only to get a small numpy array from it.
Any idea how I can store all these arrays so that the IO is better?
I can't pre-read all the arrays and save them all in one file because then that file would be too big to open for RAM.
I looked up LMDB but it all seems to be about Caffe.
Any idea how I can achieve this?
I iterated through my dataset, created an hdf5 file and stored elements in the hdf5. Turns out, when the hdf5 is opened, it doesn't load all data in ram, it loads the header instead.
The header is then used to fetch the data on request, that's how I solved my problem.
Reference:
http://www.machinelearninguru.com/deep_learning/data_preparation/hdf5/hdf5.html
One trivial solution can be pre-processing your dataset and saving multiple smaller crops of the original 3D volumes separately. This way you sacrifice some disk space for more efficient IO.
Note that you can make a trade-off with the crop size here: saving bigger crops than you need for input allows you to still do random crop augmentation on the fly. If you save overlapping crops in the pre-processing step, then you can ensure that still all possible random crops of the original dataset can be produced.
Alternatively you may try using a custom data loader that retains the full volumes for a few batch. Be careful, this might create some correlation between batches. Since many machine learning algorithms relies on i.i.d samples (e.g. Stochastic Gradient Descent), correlated batches can easily cause some serious mess.
Related
I wrote some code to generate a large dataset of complex numpy matrices for ML applications, which I would like to somehow store on disk.
The most suitable idea seems to be saving the matrices into separate binary files. However, commands such as bytearray() seems to flatten the matrices into 1D arrays, thus losing the information over the matrix shape.
I guess I might need to fill each line independently, maybe using an additional for loop, but this would also require a for loop when loading and re-assembling the matrix.
What would be the correct procedure for storing those matrices in a way that minimizes the amount of space on disk and loading time?
I've a project that is utilizing HDF5. There are file structures as well as HDF5 data structures for each dataset.
Think of a large video. Each frame is divided up equally and written to multiple files as well as multiple HDF5 data chunks. A single 'video' may have 20+ files (representing temporal and slices), and then more files to represent additional slices. The datasets aren't very large- under 30gb- but are still cumbersome.
My initial dive to associate (stitch) the pieces back together was to put together an array of pointers to the individual frames, and then stack them for the temporal aspect of the video. This would be (fairly) small since I would be pointing to the locations on disk where everything was. This would also limit the amount of data I'd have to hold into memory- always a bonus- for when I scale to the 'larger' datasets.
However the way to accomplish this in Python eludes me- especially when considering I want to tie in the metadata for each frame (pixels, their locations, etc).
Is there a method I should be following to better reference the data and 'stitch' it back together? My current method was to create numpy arrays of the raw data. This has the detriment of reading all of the data in and storing it in memory (and disk).
So I have this large hdf5 file that features multiple large datsets. I am accessing it with h5py and want to read parts of every of those datasets into a common ndaray. Unfortunately, slicing across datasets is not supported, so I was wondering, what the most efficient way is, to assemble the ndarray with those datasets given the circumstances?
Currently, I am using something along the following lines:
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(assemble_array, range(NrDatasets))
where the function assemble_array reads in the data into a predefined ndarray buffer of appropriate size, but this is not fast enough :/
Can anyone help?
For a Deep Learning application I am building, I have a dataset of about 50k grayscale images, ranging from about 300*2k to 300*10k pixels. Loading all this data into memory is not possible, so I am looking for a proper way to handle reading in random batches of data. One extra complication with this is, I need to know the width of each image before building my Deep Learning model, to define different size-buckets within the data (for example: [2k-4k, 4k-6k, 6k-8k, 8k-10k].
Currently, I am working with a smaller dataset and just load each image from a png file, bucket them by size and start learning. When I want to scale up this is no longer possible.
To train the model, each batch of the data should be (ideally) fully random from a random bucket. A naive way of doing this would be saving the sizes of the images beforehand, and just loading each random batch when it is needed. However, this would result in a lot of extra loading of data and not very efficient memory management.
Does anyone have a suggestion how to handle this problem efficiently?
Cheers!
Why not add a preprocessing step, where you would either (a) physically move the images to folders associated with bucket and/or rename them, or (b) first scan through all images (headers only) to build the in-memory table of image filenames and their sizes/buckets, and then the random sampling step would be quite simple to implement.
My situation is like this:
I have around ~70 million integer values distributed in various files for ~10 categories of data (exact number not known)
I read those several files, and create some python object with that data. This would obviously include reading each file line by line and appending to the python object. So I'll have an array with 70 mil subarrays, with 10 values in each.
I do some statistical processing on that data . This would involve appending several values (say, percentile rank) to each 'row' of data.
I store this object it in a Database
Now I have never worked with data of this scale. My first instinct was to use Numpy for more efficient arrays w.r.t memory. But then I've heard that in Numpy arrays, 'append' is discouraged as it's not as efficient.
So what would you suggest I go with? Any general tips for working with data of this size? I can bring the data down to 20% of its size with random sampling if it's required.
EDIT: Edited for clarity about size and type of data.
If I understand your description correctly, your dataset will contain ~700 million integers. Even if you use 64-bit ints that would still only come to about 6GB. Depending on how much RAM you have and what you want to do in terms of statistical processing, your dataset sounds like it would be quite manageable as a normal numpy array living in core memory.
If the dataset is too large to fit in memory, a simple solution might be to use a memory-mapped array (numpy.memmap). In most respects, an np.memmap array behaves like a normal numpy array, but instead of storing the whole dataset in system memory, it will be dynamically read from/written to a file on disk as required.
Another option would be to store your data in an HDF5 file, for example using PyTables or H5py. HDF5 allows the data to be compressed on disk, and PyTables includes some very fast methods to perform mathematical operations on large disk-based arrays.