I'm testing NumPy's memmap through IPython Notebook, with the following code
Ymap = np.memmap('Y.dat', dtype='float32', mode='w+', shape=(5e6, 4e4))
As you can see, Ymap's shape is pretty large. I'm trying to fill up Ymap like a sparse matrix. I'm not using scipy.sparse matrices because I will eventually need to dot-product it with another dense matrix, which will definitely not fit into memory.
Anyways, I'm performing a very long series of indexing operations:
Ymap = np.memmap('Y.dat', dtype='float32', mode='w+', shape=(5e6, 4e4))
with open("somefile.txt", 'rb') as somefile:
for i in xrange(5e6):
# Read a line
line = somefile.readline()
# For each token in the line, lookup its j value
# Assign the value 1.0 to Ymap[i,j]
for token in line.split():
j = some_dictionary[token]
Ymap[i,j] = 1.0
These operations somehow quickly eat up my RAM. I thought mem-mapping was basically an out-of-core numpy.ndarray. Am I mistaken? Why is my memory usage sky-rocketing like crazy?
A (non-anonymous) mmap is a link between a file and RAM that, roughly, guarantees that when RAM of the mmap is full, data will be paged to the given file instead of to the swap disk/file, and when you msync or munmap it, the whole region of RAM gets written out to the file. Operating systems typically follow a lazy strategy wrt. disk accesses (or eager wrt. RAM): data will remain in memory as long as it fits. This means a process with large mmaps will eat up as much RAM as it can/needs before spilling over the rest to disk.
So you're right that an np.memmap array is an out-of-core array, but it is one that will grab as much RAM cache as it can.
As the docs say:
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory.
There's no true magic in computers ;-) If you access very little of a giant array, a memmap gimmick will require very little RAM; if you access very much of a giant array, a memmap gimmick will require very much RAM.
One workaround that may or may not be helpful in your specific code: create new mmap objects periodically (and get rid of old ones), at logical points in your workflow. Then the amount of RAM needed should be roughly proportional to the number of array items you touch between such steps. Against that, it takes time to create and destroy new mmap objects. So it's a balancing act.
Related
I am dealing with large numpy arrays and I am trying out memmap as it could help.
big_matrix = np.memmap(parameters.big_matrix_path, dtype=np.float16, mode='w+', shape=(1000000, 1000000)
The above works fine and it creates a file on my hard drive of about 140GB.
1000000 is just a random number I used - not the one I am actually using.
I want to fill the matrix with values. Currently it is just set to zero.
for i in tqdm(range(len(big_matrix))):
modified_row = get_row(i)
big_matrix[i, :] = modified_row
At this point now, I have a big_matrix filled with the values I want.
The problem is that from this point on I can't operate on this memmap.
For example I want to multiply column wise (broadcast).
I run this:
big_matrix * weights[:, np.newaxis]
Where weights has the same length.
It just hangs and throws and out of memory error as my RAM and SWAP is all used.
My understanding was that the memmap will keep everything on the hard drive.
For example save the results directly there.
So I tried this then:
for i in tqdm(range(big_matrix.shape[1])):
temp = big_matrix[:, i].tolist()
temp = np.array(temp) * weights
The above loads only 1 column in memory, and multiply that with the weights.
Then I will save that column back in big_matrix.
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
At this point I am thinking of switching to sqlite.
I wanted to get some insights why my code is not working?
Do I need to flush the memmap everytime I change it ?
np.memmap map a part of the virtual memory to the storage device space here. The OS is free to preload pages and cache them for a fast reuse. The memory is generally not flushed unless it is reclaimed (eg. by another process or the same process). When this happen, the OS typically (partially) flush data to the storage device and (partially) free the physical memory used for the mapping. That being said, this behaviour is dependent of the actual OS. It work that way on Windows. On Linux, you can use madvise to tune this behaviour but madvise is a low-level C function not yet supported by Numpy (though it is apparently supported for Python, see this issue for more information). Actually, Numpy does not even support closing the memmaped space (which is leaky). The solution is generally to flush data manually not to lose it. There are alternative solutions but none of them is great yet.
big_matrix * weights[:, np.newaxis]
It just hangs and throws and out of memory error as my RAM and SWAP is all used
This is normal since Numpy creates a new temporary array stored in RAM. There is no way to tell to Numpy to store temporary array in on the storage device. That being said, you can tell to Numpy where the output data is stored using the out parameter on some function (eg. np.multiply supports it). The output array can be created using memmap so not to use too much memory (regarding the behaviour of the OS).
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
This is also expected, especially if you use a HDD and not and SSD. Indeed, the array is stored (virtually) contiguously on the storage device. big_matrix[:, i] has to fetch data with a huge stride. For each item, with a size of only 2 bytes, the OS will perform an IO request to the storage device. Storage devices are optimized for contiguous reads so fetches are buffered and each IO request has a pretty significant latency. In practice, the OS will generally at least fetch a page (typically 4096 bytes, that is 512 times more than what is actually needed). Moreover, there is a limit of the number of IO requests that can be completed per second. HDDs can typically do about 20-200 IO requests per seconds while the fastest Nvme SSDs reach 100_000-600_000 UI requests per seconds. Note that the cache help not not reload data for the next column unless there are too many loaded pages and the OS has to flush them. Reading a matrix of size (1_000_000,1_000_000) causes up to 1_000_000*1_000_000=1_000_000_000_000 fetch, which is horribly inefficient. The cache could reduce this by a large margin, but operating simultaneously on 1_000_000 pages is also horribly inefficient since the processor cannot do that (due to a limited number of entries in the TLB). This will typically results in TLB misses, that is expensive kernel calls for each item to be read. Because a kernel call typically take (at least) about ~1 us on mainstream PC, this means more than a week to to the whole computation.
If you want to efficiently read columns, then you need to read large chunk of columns. For example, you certainly need at least several hundred of columns to be read even on a fast Nvme SSD. For HDD, it is at least several dozens of thousand columns to get a proper throughput. This means you certainly cannot read the full columns efficiently due to the high amount of requested RAM. Using another data layout (tile + transposed data) is critical in this case.
My question is simple; and I could not find a resource that answers it. Somewhat similar links are using asarray, on numbers in general, and the most succinct one here.
How can I "calculate" the overhead of loading a numpy array into RAM (if there is any overhead)? Or, how to determine the least amount of RAM needed to hold all arrays in memory (without time-consuming trial and error)?
In short, I have several numpy arrays of shape (x, 1323000, 1), with x being as high as 6000. This leads to a disk usage of 30GB for the largest file.
All files together need 50GB. Is it therefore enough if I use slightly more than 50GB as RAM (using Kubernetes)? I want to use the RAM as efficiently as possible, so just using 100GBs is not an option.
For my research I am working with large numpy arrays consisting of complex data.
arr = np.empty((15000, 25400), dtype='complex128')
np.save('array.npy'), arr)
When stored they are about 3 GB each. Loading these arrays is a time consuming process, which made me wonder if there are ways to speed this process up
One of the things I was thinking of was splitting the array into its complex and real part:
arr_real = arr.real
arr_im = arr.imag
and saving each part separately. However, this didn't seem to improve processing speed significantly. There is some documentation about working with large arrays, but I haven't found much information on working with complex data. Are there smart(er) ways to work with large complex arrays?
If you only need parts of the array in memory, you can load it using memory mapping:
arr = np.load('array.npy', mmap_mode='r')
From the docs:
A memory-mapped array is kept on disk. However, it can be accessed and
sliced like any ndarray. Memory mapping is especially useful for
accessing small fragments of large files without reading the entire
file into memory.
Currently I am working with a NumPy memmap array with 2,000,000 * 33 * 33 *4 (N * W * H * C) data. My program reads random (N) indices from this array.
I have 8GB of RAM, 2TB HDD. The HDD read IO is only around 20M/s, RAM usage stays at 2.5GB. It seems that there is a HDD bottleneck because I am retrieving random indices that are obviously not in the memmap cache. Therefore, I would like the memmap cache to use RAM as much as possible.
Is there a way for me to tell memmap to maximize IO and RAM usage?
(Checking my python 2.7 source)
As far as I can tell NumPy memmap uses mmap.
mmap does define:
# Variables with simple values
...
ALLOCATIONGRANULARITY = 65536
PAGESIZE = 4096
However i am not sure it would be wise (or even possible) to change those.
Furthermore, this may not solve your problem and would definitely not give you the most efficient solution, because there is caching and page reading at OS level and at hardware level (because for hardware it takes more or less the same time to read a single value or the whole page).
A much better solution would probably be to sort your requests. (I suppose here that N is large, otherwise just sort them once):
Gather a bunch of them (say one or ten millions?) and before doing the request, sort them. Then ask the ordered queries. Then after getting the answers put them back in their original order...
I have a list of "TIFFFiles", where each "TIFFFiles" contains a "TIFFArray" with 60 tiff images each with a size of 2776x2080 pixel. The images are read as numpy.memmap objects.
I want to access all intensities of the images (shape of imgs: (60,2776,2080)). I use the following code:
for i in xrange(18):
#get instance of type TIFFArray from tiff_list
tiffs = get_tiff_arrays(smp_ppx, type_subfile,tiff_list[i])
#accessing all intensities from tiffs
imgs = tiffs[:,:,:]
Even by overwriting "tiffs" and "imgs" in each iteration step my memory increments by 2.6GByte. How can I avoid that the data are copied in each iteration step? Is there any way that the memory of 2.6GByte can be reused?
I know that is probably not an answer, but it might help anyway and was too long for a comment.
Some times ago I had a memory problem while reading large (>1Gb) ascii files with numpy: basically to read the file with numpy.loadtxt, the code was using the whole memory (8Gb) plus some swap.
From what I've understood, if you know in advance the size of the array to fill, you can allocate it and pass it to, e.g., loadtxt. This should prevent numpy to allocate temporary objects and it might be better memory-wise.
mmap, or similar approaches, can help improving memory usage, but I've never used them.
edit
The problem with memory usage and release made me wonder when I was trying to solve my large file problem. Basically I had
def read_f(fname):
arr = np.loadtxt(fname) #this uses a lot of memory
#do operations
return something
for f in ["verylargefile", "smallerfile", "evensmallerfile"]:
result = read_f(f)
From the memory profiling I did, there was no memory release when returning loadtxt nor when returning read_f and calling it again with a smaller file.