Proper way of loading large amounts of image data - python

For a Deep Learning application I am building, I have a dataset of about 50k grayscale images, ranging from about 300*2k to 300*10k pixels. Loading all this data into memory is not possible, so I am looking for a proper way to handle reading in random batches of data. One extra complication with this is, I need to know the width of each image before building my Deep Learning model, to define different size-buckets within the data (for example: [2k-4k, 4k-6k, 6k-8k, 8k-10k].
Currently, I am working with a smaller dataset and just load each image from a png file, bucket them by size and start learning. When I want to scale up this is no longer possible.
To train the model, each batch of the data should be (ideally) fully random from a random bucket. A naive way of doing this would be saving the sizes of the images beforehand, and just loading each random batch when it is needed. However, this would result in a lot of extra loading of data and not very efficient memory management.
Does anyone have a suggestion how to handle this problem efficiently?
Cheers!

Why not add a preprocessing step, where you would either (a) physically move the images to folders associated with bucket and/or rename them, or (b) first scan through all images (headers only) to build the in-memory table of image filenames and their sizes/buckets, and then the random sampling step would be quite simple to implement.

Related

The most space efficient way of storing images in memory in Python?

I'm currently working on a image server project in Python and I've been unable to find information on the most efficient way of storing the images in memory. I've tried using sys.getsizeof to try to get the size of the object in memory, but it doesn't return an accurate size for numpy arrays or PIL images. I've also thought about bypassing Python modules all together and storing the images in a tmpfs ramdisk and loading them back in when the images are requested. I'd appreciate any insights into the best practices for this problem.

Pythonic way to access HDF5 data structures similar to pointers

I've a project that is utilizing HDF5. There are file structures as well as HDF5 data structures for each dataset.
Think of a large video. Each frame is divided up equally and written to multiple files as well as multiple HDF5 data chunks. A single 'video' may have 20+ files (representing temporal and slices), and then more files to represent additional slices. The datasets aren't very large- under 30gb- but are still cumbersome.
My initial dive to associate (stitch) the pieces back together was to put together an array of pointers to the individual frames, and then stack them for the temporal aspect of the video. This would be (fairly) small since I would be pointing to the locations on disk where everything was. This would also limit the amount of data I'd have to hold into memory- always a bonus- for when I scale to the 'larger' datasets.
However the way to accomplish this in Python eludes me- especially when considering I want to tie in the metadata for each frame (pixels, their locations, etc).
Is there a method I should be following to better reference the data and 'stitch' it back together? My current method was to create numpy arrays of the raw data. This has the detriment of reading all of the data in and storing it in memory (and disk).

Faster pytorch dataset file

I have the following problem, I have many files of 3D volumes that I open to extract a bunch of numpy arrays.
I want to get those arrays randomly, i.e. in the worst case I open as many 3D volumes as numpy arrays I want to get, if all those arrays are in separate files.
The IO here isn't great, I open a big file only to get a small numpy array from it.
Any idea how I can store all these arrays so that the IO is better?
I can't pre-read all the arrays and save them all in one file because then that file would be too big to open for RAM.
I looked up LMDB but it all seems to be about Caffe.
Any idea how I can achieve this?
I iterated through my dataset, created an hdf5 file and stored elements in the hdf5. Turns out, when the hdf5 is opened, it doesn't load all data in ram, it loads the header instead.
The header is then used to fetch the data on request, that's how I solved my problem.
Reference:
http://www.machinelearninguru.com/deep_learning/data_preparation/hdf5/hdf5.html
One trivial solution can be pre-processing your dataset and saving multiple smaller crops of the original 3D volumes separately. This way you sacrifice some disk space for more efficient IO.
Note that you can make a trade-off with the crop size here: saving bigger crops than you need for input allows you to still do random crop augmentation on the fly. If you save overlapping crops in the pre-processing step, then you can ensure that still all possible random crops of the original dataset can be produced.
Alternatively you may try using a custom data loader that retains the full volumes for a few batch. Be careful, this might create some correlation between batches. Since many machine learning algorithms relies on i.i.d samples (e.g. Stochastic Gradient Descent), correlated batches can easily cause some serious mess.

Pdf to image conversion takes an enormous amount of space

I have a quick and dirty python script that takes a pdf as input and saves the pages as an array of images (using pdf2image).
What I don't understand: 72 images take up 920MB of memory. However, if I save the images to file and then reload them, I get to barely over 30-40MB (combined size of the images is 29MB). Does that make sense?
I also tried to dump the array using pickle and I get to about 3GB before it crashes due to MemError. I'm at a complete loss what is eating up so much memory...
The reason for the huge memory usage is most likely because of excessive ammount of meta data usage, uncompressed image data (raw color data) or a lossless image codec within the library/tool itself.
It might also depend on the size, amount of images etc.
On the last remark, regarding pickle. Pickle in itself is a memory dump format used by Python to preserve certain variable states. Dumping memory to a session state on disk is quite a heavy task. Not only do Python need to convert everything to a format that enables the saved state, but it must also copy all the data to a known state upon saving it. There for it might use up quite a lot of ram and disk in order to do so. (Only way around this it to chunk up the data usually).
Upon answering some comments, one solution would be to pass the parameter fmt=jpg which keeps the image in a compressed state lowering the resource usage a bit.

Creating a Training/Validation LMDB for NVIDIA Digits in Python

I'm trying to make a Training/Validation LMDB set for use with NVIDIA Digits, but I can't find any good examples/tutorials.
I understand how to create an LMDB database, but I'm uncertain on how to correctly format the data. I get how to create an image using the caffe_pb2 Datum by setting channels/width/height/data and save them.
But, how do I create the Labels LMDB? Do I still use a Caffe Datum? If so, what do I set the channels/width/height to? Will it work if I have a single value label?
Thanks
DIGITS only really supports data in LMDBs for now. Each value in the LMDB key/val store must be a Caffe Datum, which limits the number of dimensions to 3.
Even though Caffe Datums allow for a single numeric label (datum.label), when uploading a prebuilt LMDB to DIGITS you need to specify a separate database for the labels. That's inefficient if you only have a single numeric label (since you could have done it all in one DB), but it's more generic and scalable to other label types.
Sorry, you're right that this isn't documented very well right now. Here are some source files you can browse for inspiration if you're so inclined:
Data are images, labels are (gradientX, gradientY)
Data are image pairs, labels are 1 or 0
Data are text snippets, labels are numeric classes
Generic script for creating LMDBs from any data blobs coming from extensions/plugins

Categories