Python: Memory usage increments while reading TIFFArray - python

I have a list of "TIFFFiles", where each "TIFFFiles" contains a "TIFFArray" with 60 tiff images each with a size of 2776x2080 pixel. The images are read as numpy.memmap objects.
I want to access all intensities of the images (shape of imgs: (60,2776,2080)). I use the following code:
for i in xrange(18):
#get instance of type TIFFArray from tiff_list
tiffs = get_tiff_arrays(smp_ppx, type_subfile,tiff_list[i])
#accessing all intensities from tiffs
imgs = tiffs[:,:,:]
Even by overwriting "tiffs" and "imgs" in each iteration step my memory increments by 2.6GByte. How can I avoid that the data are copied in each iteration step? Is there any way that the memory of 2.6GByte can be reused?

I know that is probably not an answer, but it might help anyway and was too long for a comment.
Some times ago I had a memory problem while reading large (>1Gb) ascii files with numpy: basically to read the file with numpy.loadtxt, the code was using the whole memory (8Gb) plus some swap.
From what I've understood, if you know in advance the size of the array to fill, you can allocate it and pass it to, e.g., loadtxt. This should prevent numpy to allocate temporary objects and it might be better memory-wise.
mmap, or similar approaches, can help improving memory usage, but I've never used them.
edit
The problem with memory usage and release made me wonder when I was trying to solve my large file problem. Basically I had
def read_f(fname):
arr = np.loadtxt(fname) #this uses a lot of memory
#do operations
return something
for f in ["verylargefile", "smallerfile", "evensmallerfile"]:
result = read_f(f)
From the memory profiling I did, there was no memory release when returning loadtxt nor when returning read_f and calling it again with a smaller file.

Related

Why np.unint8 and np.int8 free up memory while any other numpy format does not?

I am working on a project where I am reading a huge bunch of images and where I'm struggling with memory management. I was reading the images using int32/float32 format (I know the default format should be uint8, but I wanted to check using others since I am going to fusion different kinds of data that might be from different types).
For freeing up the memory I am deleting the heavy objects that I am not using after I have read all the images. For it, i am using del and explicitly calling the garbage collector:
del object
gc.collect()
Here is where the weird stuff starts. If I use np.int32 or np.float32 (for instance np.zeros((598, 598, 3), dtype=np.int32) as dtype for the arrays, the memory doesn't free up, as it can be observed in the image below:
However, if I use as dtype np.uint8 or np.int8 the memory does free up, as it can be observed in the image below:
I have also noticed that if I limit the number of images that I am reading (up to 8.61 GB of memory) the memory does free up. The two cases can be observed in the images below:
Why does this might be happening?
Thank you all in advance!

Pdf to image conversion takes an enormous amount of space

I have a quick and dirty python script that takes a pdf as input and saves the pages as an array of images (using pdf2image).
What I don't understand: 72 images take up 920MB of memory. However, if I save the images to file and then reload them, I get to barely over 30-40MB (combined size of the images is 29MB). Does that make sense?
I also tried to dump the array using pickle and I get to about 3GB before it crashes due to MemError. I'm at a complete loss what is eating up so much memory...
The reason for the huge memory usage is most likely because of excessive ammount of meta data usage, uncompressed image data (raw color data) or a lossless image codec within the library/tool itself.
It might also depend on the size, amount of images etc.
On the last remark, regarding pickle. Pickle in itself is a memory dump format used by Python to preserve certain variable states. Dumping memory to a session state on disk is quite a heavy task. Not only do Python need to convert everything to a format that enables the saved state, but it must also copy all the data to a known state upon saving it. There for it might use up quite a lot of ram and disk in order to do so. (Only way around this it to chunk up the data usually).
Upon answering some comments, one solution would be to pass the parameter fmt=jpg which keeps the image in a compressed state lowering the resource usage a bit.

Creating very large NUMPY arrays in small chunks (PyTables vs. numpy.memmap)

There are a bunch of questions on SO that appear to be the same, but they don't really answer my question fully. I think this is a pretty common use-case for computational scientists, so I'm creating a new question.
QUESTION:
I read in several small numpy arrays from files (~10 MB each) and do some processing on them. I want to create a larger array (~1 TB) where each dimension in the array contains the data from one of these smaller files. Any method that tries to create the whole larger array (or a substantial part of it) in the RAM is not suitable, since it floods up the RAM and brings the machine to a halt. So I need to be able to initialize the larger array and fill it in small batches, so that each batch gets written to the larger array on disk.
I initially thought that numpy.memmap is the way to go, but when I issue a command like
mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2))
the RAM floods and the machine slows to a halt.
After poking around a bit it seems like PyTables might be well suited for this sort of thing, but I'm not really sure. Also, it was hard to find a simple example in the doc or elsewhere which illustrates this common use-case.
IF anyone knows how this can be done using PyTables, or if there's a more efficient/faster way to do this, please let me know! Any refs. to examples appreciated!
That's weird. The np.memmap should work. I've been using it with 250Gb data on a 12Gb RAM machine without problems.
Does the system really runs out of memory at the very moment of the creation of the memmap file? Or it happens along the code? If it happens at the file creation I really don't know what the problem would be.
When I started using memmap I've made some mistakes that led me to memory run out. For me, something like the below code should work:
mmapData = np.memmap(mmapFile, mode='w+', shape = (smallarray_size,number_of_arrays), dtype ='float64')
for k in range(number_of_arrays):
smallarray = np.fromfile(list_of_files[k]) # list_of_file is the list with the files name
smallarray = do_something_with_array(smallarray)
mmapData[:,k] = smallarray
It may not be the most efficient way, but it seems to me that it would have the lowest memory usage.
Ps: Be aware that the default dtype value for memmap(int) and fromfile(float) are different!
HDF5 is a C library that can efficiently store large on-disk arrays. Both PyTables and h5py are Python libraries on top of HDF5. If you're using tabular data then PyTables might be preferred; if you have just plain arrays then h5py is probably more stable/simpler.
There are out-of-core numpy array solutions that handle the chunking for you. Dask.array would give you plain numpy semantics on top of your collection of chunked files (see docs on stacking.)

Writing into a NumPy memmap still loads into RAM memory

I'm testing NumPy's memmap through IPython Notebook, with the following code
Ymap = np.memmap('Y.dat', dtype='float32', mode='w+', shape=(5e6, 4e4))
As you can see, Ymap's shape is pretty large. I'm trying to fill up Ymap like a sparse matrix. I'm not using scipy.sparse matrices because I will eventually need to dot-product it with another dense matrix, which will definitely not fit into memory.
Anyways, I'm performing a very long series of indexing operations:
Ymap = np.memmap('Y.dat', dtype='float32', mode='w+', shape=(5e6, 4e4))
with open("somefile.txt", 'rb') as somefile:
for i in xrange(5e6):
# Read a line
line = somefile.readline()
# For each token in the line, lookup its j value
# Assign the value 1.0 to Ymap[i,j]
for token in line.split():
j = some_dictionary[token]
Ymap[i,j] = 1.0
These operations somehow quickly eat up my RAM. I thought mem-mapping was basically an out-of-core numpy.ndarray. Am I mistaken? Why is my memory usage sky-rocketing like crazy?
A (non-anonymous) mmap is a link between a file and RAM that, roughly, guarantees that when RAM of the mmap is full, data will be paged to the given file instead of to the swap disk/file, and when you msync or munmap it, the whole region of RAM gets written out to the file. Operating systems typically follow a lazy strategy wrt. disk accesses (or eager wrt. RAM): data will remain in memory as long as it fits. This means a process with large mmaps will eat up as much RAM as it can/needs before spilling over the rest to disk.
So you're right that an np.memmap array is an out-of-core array, but it is one that will grab as much RAM cache as it can.
As the docs say:
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory.
There's no true magic in computers ;-) If you access very little of a giant array, a memmap gimmick will require very little RAM; if you access very much of a giant array, a memmap gimmick will require very much RAM.
One workaround that may or may not be helpful in your specific code: create new mmap objects periodically (and get rid of old ones), at logical points in your workflow. Then the amount of RAM needed should be roughly proportional to the number of array items you touch between such steps. Against that, it takes time to create and destroy new mmap objects. So it's a balancing act.

efficient array concatenation

Im trying to concatenate several hundred arrays size totaling almost 25GB of data. I am testing on 56 GB machine, but i receive memory error. I reckon the way I do my precess is ineffecient and is sucking lots of memory. This is my code:
for dirname, dirnames, filenames in os.walk('/home/extra/AllData'):
filenames.sort()
BigArray=numpy.zeros((1,200))
for file in filenames:
newArray[load(filenames[filea])
BigArray=numpy.concatenate((BigArrat,newArray))
any idea, thoughts or solutions?
Thanks
Your process is horribly inefficient. When handling such huge amounts of data, you really need to know your tools.
For your problem, np.concatenate is forbidden - it needs at least twice the memory of the inputs. Plus it will copy every bit of data, so it's slow, too.
Use numpy.memmap to load the arrays. That will use only a few bytes of memory while still being pretty efficient.
Join them using np.vstack. Call this only once (i.e. don't bigArray=vstack(bigArray,newArray)!!!). Load all the arrays in a list allArrays and then call bigArray = vstack(allArrays)
If that is really too slow, you need to know the size of the array in advance, create an array of this size once and then load the data into the existing array (instead of creating a new one every time).
Depending on how often the files on disk change, it might be much more efficient to concatenate them with the OS tools to create one huge file and then load that (or use numpy.memmap)

Categories