Creating large .npy files with memory efficiency - python

I'm trying to create very large .npy files and I'm having a bit of difficulty. For instance, I need to create a (500, 1586, 2048, 3) matrix and save it to a npy file. And preferably, I need to put it in an npz_compressed file. I also need this to be memory efficient to be ran on low-memory systems. I've tried a few methods, but none have seemed to work so far. I've written and re-written things so many times, that I don't have code snippets for everything, but I'll describe the methods as best I can with code snippets where I can. Also, apologies for bad formatting.
Create an ndarray with all my data in it, then use savez_compressed to export it.
This gets all my data into the array, but it's terrible for memory efficiency. I filled all 8gb of RAM, plus 5gb of Swap space. I got it to save my file, but it doesn't scale, as my matrix could get significantly larger.
Use " np.memmap('file_name_to_create', mode='w+', shape=(500,1586,2048,3)) " to create the large, initial npy file, then add my data.
This method worked for getting my data in, and it's pretty memory efficient. However, i can no longer use np.load to open the file (get errors associated with pickle, regardless of if allow_pickle is true or false), which means I can't put it into compressed. I'd be happy with this format, if I can get it into the compressed format, but I just can't figure it out. I'm trying to avoid using gzip if possible.
Create a (1,1,1,1) zeros array and save it with np.save. Then try opening it with np.memmap with the same size as before.
This runs into the same issues as method 2. Can no longer use np.load to read it in afterwards
Create 5 [100,...] npy files with method 1, and saving them with np.save. Then read 2 in using np.load(mmap_mode='r+') and then merge them into 1 large npy file.
Creating the individual npy files wan't bad on memory, maybe 1gb to 1.5gb. However, I couldn't figure out how to then merge the npy files without actually loading the entire npy file into RAM. I read in other stackoverflow that npy files aren't really designed for this at all. They mentioned it would be better to use a .h5 file for doing this kind of 'appending'.
Those are the main methods that I've used. I'm looking for feedback on if any of these methods would work, which one would work 'best' for memory efficiency, and maybe some guidance on getting that method to work. I also wouldn't be opposed to moving to .h5 if that would be the best method, I just haven't tried it yet.

try it at Google colab which uses the GPU to run it

Related

When is it better to use npz files instead of csv?

I'm looking at some machine learning/forecasting code using Keras, and the input data sets are stored in npz files instead of the usual csv format.
Why would the authors go with this format instead of csv? What advantages does it have?
It depends of the expected usage. If a file is expected to have broad use cases including direct access from an ordinary client machines, then csv is fine because it can be directly loaded in Excel or LibreOffice calc which are widely deployed. But it is just an good old text file with no indexes nor any additional feature.
On the other hand is a file is only expected to be used by data scientists or generally speaking numpy aware users, then npz is a much better choice because of the additional features (compression, lazy loading, etc.)
Long story made short, you exchange a larger audience for higher features.
From https://kite.com/python/docs/numpy.lib.npyio.NpzFile
A dictionary-like object with lazy-loading of files in the zipped archive provided on construction.
So, it is a zipped archive (smaller size than CSV on the disk, more than one file can be stored) and files can be loaded from disk only when needed (in CSV, when you only need 1 column, you still have to read whole file to parse it).
=> advantages are: performance and more features

Numpy memory mapping issue

I've been recently working with large matrices. My inputs are stored in form of 15GB .npz files, which I am trying to read incrementally, in small batches.
I am familiar with memory mapping, and having seen numpy also supports these kinds of operations seemed like a perfect solution. However, the problem I am facing is as follows:
I first load the matrix with:
foo = np.load('matrix.npz',mmap_mode="r+")
foo has a single key: data.
When I try to, for example do:
foo['data'][1][1]
numpy seems to endlessly spend the available RAM, almost as if there were no memory mapping. Am I doing anything wrong?
My goal would be, for example, to read 30 lines at a time:
for x in np.arange(0,matrix.shape[1],30):
batch = matrix[x:(x+30),:]
do_something_with(batch)
Thank you!
My guess would be that mmap_mode="r+" is ignored when the file in question is a zipped numpy file. I haven't used numpy in this way, so some of what follows is my best guess. The documentation for load states
If the file is a .npz file, then a dictionary-like object is returned, containing {filename: array} key-value pairs, one for each file in the archive.
No mention of what it does with mmap_mode. However in the code for loading .npz files no usage is made of the mmap_mode keyword:
if magic.startswith(_ZIP_PREFIX):
# zip-file (assume .npz)
# Transfer file ownership to NpzFile
tmp = own_fid
own_fid = False
return NpzFile(fid, own_fid=tmp, allow_pickle=allow_pickle, pickle_kwargs=pickle_kwargs)
So, your initial guess is indeed correct. Numpy uses all of the ram because there is no memmapping ocurring. This is a limitation of the implementation of load; since the npz format is an uncompressed zip archive it should be possible to memmap the variables (unless of course your files were created with savez_compressed).
Implementing a load function that memmaps npz would be quite a bit of work though, so you may want to take a look at structured arrays. They provide similar usage (access of fields by keys) and are already compatible with memmapping.

Pdf to image conversion takes an enormous amount of space

I have a quick and dirty python script that takes a pdf as input and saves the pages as an array of images (using pdf2image).
What I don't understand: 72 images take up 920MB of memory. However, if I save the images to file and then reload them, I get to barely over 30-40MB (combined size of the images is 29MB). Does that make sense?
I also tried to dump the array using pickle and I get to about 3GB before it crashes due to MemError. I'm at a complete loss what is eating up so much memory...
The reason for the huge memory usage is most likely because of excessive ammount of meta data usage, uncompressed image data (raw color data) or a lossless image codec within the library/tool itself.
It might also depend on the size, amount of images etc.
On the last remark, regarding pickle. Pickle in itself is a memory dump format used by Python to preserve certain variable states. Dumping memory to a session state on disk is quite a heavy task. Not only do Python need to convert everything to a format that enables the saved state, but it must also copy all the data to a known state upon saving it. There for it might use up quite a lot of ram and disk in order to do so. (Only way around this it to chunk up the data usually).
Upon answering some comments, one solution would be to pass the parameter fmt=jpg which keeps the image in a compressed state lowering the resource usage a bit.

Manipulating Large Amounts of Image Data in Python

I have a large number of images of different categories, e.g. "Cat", "Dog", "Bird". The images have some hierarchical structure, like a dict. So for example the key is the animal name and the value is a list of animal images, e.g. animalPictures[animal][index].
I want to manipulate each image (e.g. compute histogram) and then save the manipulated data in an identical corresponding structure, e.g. animalPictures['bird'][0] has its histogram stored in animalHistograms['bird'][0].
The only issue is I do not have enough memory to load all the images, perform all the manipulations, and create an additional structure of the transformed images.
Is it possible to load an image from disk, manipulate the image, and stream the data to a dict on disk? This way I could work on a per-image basis and not worry about loading everything into memory at once.
Thanks!
I would suggest using theshelvemodule which provides dictionary-like persistent objects backed by a file object. For the most part you can use them like ordinary dictionaries, so converting any existing code that uses them is relatively easy. There's couple of additional methods for controlling the writing back of any cached entries and closing the object.
Basically it will let you transparently store your Python dictionary data in an underlying file-based database. All entries accessed are also cached in memory. However you can prevent the cache from consuming too much memory by manually calling the sync() method which writes any modified entries back to disk and empties the cache.
It is possible, if you us NumPy and especially numpy.memmap to store the image data. That way the image data looks as if it were in memory but is on the disk using the mmap mechanism. The nice thing is that the numpy.memmap arrays are not more difficult to handle than ordinary arrays.
There is some performance overhead as all memmap arrays are saved to disk. The arrays could be described as "disk-backed arrays", i.e. the data is also kept in RAM as long as possible. This means that if you access some data array very often, it is most likely in memory, and there is no disk read overhead.
So, keep your metadata in a dict in memory, but memmap your bigger arrays.
This is probably the easiest way. However, make sure you have a 64-bit Python in use, as the 32-bit one runs out of addresses ad 2 GiB.
Of course, there are a lot of ways to compress the image data. If your data may be compressed, then you might consider using compression to save memory.

efficient array concatenation

Im trying to concatenate several hundred arrays size totaling almost 25GB of data. I am testing on 56 GB machine, but i receive memory error. I reckon the way I do my precess is ineffecient and is sucking lots of memory. This is my code:
for dirname, dirnames, filenames in os.walk('/home/extra/AllData'):
filenames.sort()
BigArray=numpy.zeros((1,200))
for file in filenames:
newArray[load(filenames[filea])
BigArray=numpy.concatenate((BigArrat,newArray))
any idea, thoughts or solutions?
Thanks
Your process is horribly inefficient. When handling such huge amounts of data, you really need to know your tools.
For your problem, np.concatenate is forbidden - it needs at least twice the memory of the inputs. Plus it will copy every bit of data, so it's slow, too.
Use numpy.memmap to load the arrays. That will use only a few bytes of memory while still being pretty efficient.
Join them using np.vstack. Call this only once (i.e. don't bigArray=vstack(bigArray,newArray)!!!). Load all the arrays in a list allArrays and then call bigArray = vstack(allArrays)
If that is really too slow, you need to know the size of the array in advance, create an array of this size once and then load the data into the existing array (instead of creating a new one every time).
Depending on how often the files on disk change, it might be much more efficient to concatenate them with the OS tools to create one huge file and then load that (or use numpy.memmap)

Categories