Numpy's memmap acting strangely?

Numpy's memmap acting strangely? - python

I am dealing with large numpy arrays and I am trying out memmap as it could help.
big_matrix = np.memmap(parameters.big_matrix_path, dtype=np.float16, mode='w+', shape=(1000000, 1000000)
The above works fine and it creates a file on my hard drive of about 140GB.
1000000 is just a random number I used - not the one I am actually using.
I want to fill the matrix with values. Currently it is just set to zero.
for i in tqdm(range(len(big_matrix))):
modified_row = get_row(i)
big_matrix[i, :] = modified_row
At this point now, I have a big_matrix filled with the values I want.
The problem is that from this point on I can't operate on this memmap.
For example I want to multiply column wise (broadcast).
I run this:
big_matrix * weights[:, np.newaxis]
Where weights has the same length.
It just hangs and throws and out of memory error as my RAM and SWAP is all used.
My understanding was that the memmap will keep everything on the hard drive.
For example save the results directly there.
So I tried this then:
for i in tqdm(range(big_matrix.shape[1])):
temp = big_matrix[:, i].tolist()
temp = np.array(temp) * weights
The above loads only 1 column in memory, and multiply that with the weights.
Then I will save that column back in big_matrix.
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
At this point I am thinking of switching to sqlite.
I wanted to get some insights why my code is not working?
Do I need to flush the memmap everytime I change it ?

np.memmap map a part of the virtual memory to the storage device space here. The OS is free to preload pages and cache them for a fast reuse. The memory is generally not flushed unless it is reclaimed (eg. by another process or the same process). When this happen, the OS typically (partially) flush data to the storage device and (partially) free the physical memory used for the mapping. That being said, this behaviour is dependent of the actual OS. It work that way on Windows. On Linux, you can use madvise to tune this behaviour but madvise is a low-level C function not yet supported by Numpy (though it is apparently supported for Python, see this issue for more information). Actually, Numpy does not even support closing the memmaped space (which is leaky). The solution is generally to flush data manually not to lose it. There are alternative solutions but none of them is great yet.
big_matrix * weights[:, np.newaxis]
It just hangs and throws and out of memory error as my RAM and SWAP is all used
This is normal since Numpy creates a new temporary array stored in RAM. There is no way to tell to Numpy to store temporary array in on the storage device. That being said, you can tell to Numpy where the output data is stored using the out parameter on some function (eg. np.multiply supports it). The output array can be created using memmap so not to use too much memory (regarding the behaviour of the OS).
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
This is also expected, especially if you use a HDD and not and SSD. Indeed, the array is stored (virtually) contiguously on the storage device. big_matrix[:, i] has to fetch data with a huge stride. For each item, with a size of only 2 bytes, the OS will perform an IO request to the storage device. Storage devices are optimized for contiguous reads so fetches are buffered and each IO request has a pretty significant latency. In practice, the OS will generally at least fetch a page (typically 4096 bytes, that is 512 times more than what is actually needed). Moreover, there is a limit of the number of IO requests that can be completed per second. HDDs can typically do about 20-200 IO requests per seconds while the fastest Nvme SSDs reach 100_000-600_000 UI requests per seconds. Note that the cache help not not reload data for the next column unless there are too many loaded pages and the OS has to flush them. Reading a matrix of size (1_000_000,1_000_000) causes up to 1_000_000*1_000_000=1_000_000_000_000 fetch, which is horribly inefficient. The cache could reduce this by a large margin, but operating simultaneously on 1_000_000 pages is also horribly inefficient since the processor cannot do that (due to a limited number of entries in the TLB). This will typically results in TLB misses, that is expensive kernel calls for each item to be read. Because a kernel call typically take (at least) about ~1 us on mainstream PC, this means more than a week to to the whole computation.
If you want to efficiently read columns, then you need to read large chunk of columns. For example, you certainly need at least several hundred of columns to be read even on a fast Nvme SSD. For HDD, it is at least several dozens of thousand columns to get a proper throughput. This means you certainly cannot read the full columns efficiently due to the high amount of requested RAM. Using another data layout (tile + transposed data) is critical in this case.

Related

How is memory handled once touched for the first time in numpy.zeros?

I recently saw that when creating a numpy array via np.empty or np.zeros, the memory of that numpy array is not actually allocated by the operating system as discussed in this answer (and this question), because numpy utilizes calloc to allocate the array's memory.
In fact, the OS isn't even "really" allocating that memory until you try to access it.
Therefore,
l = np.zeros(2**28)
does not increase the utilized memory the system reports, e.g., in htop.
Only once I touch the memory, for instance by executing
np.add(l, 0, out=l)
the utilized memory is increased.
Because of that behaviour I have got a couple of questions:
1. Is touched memory copied under the hood?
If I touch chunks of the memory only after a while, is the content of the numpy array copied under the hood by the operating system to guarantee that the memory is contiguous?
i = 100
f[:i] = 3
while True:
... # Do stuff
f[i] = ... # Once the memory "behind" the already allocated chunk of memory is filled
# with other stuff, does the operating system reallocate the memory and
# copy the already filled part of the array to the new location?
i = i + 1
2. Touching the last element
As the memory of the numpy array is continguous in memory, I tought
f[-1] = 3
might require the enitre block of memory to be allocated (without touching the entire memory).
However, it does not, the utilized memory in htop does not increase by the size of the array.
Why is that not the case?

OS isn't even "really" allocating that memory until you try to access it
This is dependent of the target platform (typically the OS and its configuration). Some platform directly allocates page in physical memory (eg. AFAIK the XBox does as well as some embedded platforms). However, mainstream platforms actually do that indeed.
1. Is touched memory copied under the hood?
If I touch chunks of the memory only after a while, is the content of the numpy array copied under the hood by the operating system to guarantee that the memory is contiguous?
Allocations are perform in virtual memory. When a first touch is done on a given memory page (chunk of fixed sized, eg. 4 KiB), the OS maps the virtual page to a physical one. So only one page will be physically map when you set only one item of the array (unless the item cross two pages which only happens in pathological cases).
The physical pages may not be contiguous for a contiguous set of virtual pages. However, this is not a problem and you should not care about it. This is mainly the job of the OS. That being said, modern processors have a dedicated unit called TLB to translate virtual address (the one you could see with a debugger) to physical ones (since this translation is relatively expensive and performance critical).
The content of the Numpy array is not reallocated nor copied thanks to paging (at least from the user point-of-view, ie. in virtual memory).
2. Touching the last element
I thought f[-1] = 3 might require the entire block of memory to be allocated (without touching the entire memory). However, it does not, the utilized memory in htop does not increase by the size of the array. Why is that not the case?
Only the last page in virtual memory associated to the Numpy array is mapped thanks to paging. This is why you do not see a big change in htop. However, you should see a slight change (the size of a page on your platform) if you look carefully. Otherwise, this should mean the page has been already mapped due to other previous recycled allocations. Indeed, the allocation library can preallocate memory area to speed up allocations (by reducing the number of slow requests to the OS). The library could also keep the memory mapped when it is freed by Numpy in order to speed the next allocations up (since the memory do not have to be unmapped to be then mapped again). This is unlikely to occur for huge arrays in practice because the impact on memory consumption would be too expensive.

Astronomical FITS Image calibration: Indexing issue using ccdproc

I seem to be having an issue with some basic astronomical image processing/calibration using the python package ccdproc.
I'm currently compiling 30 bias frames into a single image average of the component frames. Before going through the combination I iterate over each image in order to subtract the overscan region using subtract_overscan() and then select the image dimensions I want to retain using trim_image().
I suppose my indexing is correct but when I get to the combination, it takes extremely long (more than a couple of hours). I'm not sure that this is normal. I suspect something might be being misinterpreted by my computer. I've created the averaged image before without any of the other processing and it didn't take long (5-10 mins or so) which is why I'm thinking it might be an issue with my indexing.
If anyone can verify that my code is correct and/or comment on what might be the issue it'd be a lot of help.
Image dimensions: NAXIS1 = 3128 , NAXIS2 = 3080 and allfiles is a ccdproc.ImageFileCollection.
from astropy.io import fits
import ccdproc as cp
biasImages = []
for filename in allfiles.files_filtered(NAXIS1=3128,NAXIS2=3080,OBSTYPE = 'BIAS'):
ccd = fits.getdata(allfiles.location + filename)
# print(ccd)
ccd = cp.CCDData(ccd, unit = u.adu)
# print(ccd)
ccd = cp.subtract_overscan(ccd,overscan_axis = 1, fits_section = '[3099:3124,:]')
# print(ccd)
ccd = cp.trim_image(ccd,fits_section = '[27:3095,3:3078]')
# print(ccd)
biasImages.append(ccd)
master_bias = cp.combine(biasImages,output_file = path + 'mbias_avg.fits', method='average')

The code looks similar to my own code for combining biases together (see this example), so there is nothing jumping out immediately as a red flag. I rarely do such a large number of biases and the ccdproc.combine task could be far more optimized, so I'm not surprised it is very slow.
One thing that sometimes I run into is issues with garbage collection. So if you are running this in a notebook or part of a large script, there may be a problem with the memory not being cleared. It is useful to see what is happening in memory, and I sometimes include deleting the biasImages object (or an other list of ccd objects) after it has been used and it isn't needed any further
I'm happy to respond further here, or if you have further issues please open an issue at the github repo.

In case you're just looking for a solution skip ahead to the end of this answer but in case you're interested why that (probably) don't skip ahead
it takes extremely long (more than a couple of hours).
That seems like your running out of RAM and your computer then starts using swap memory. That means it will save part (or all) of the objects on your hard disk and remove them from RAM so it can load them again when needed. In some cases swap memory can be very efficient because it only needs to reload from the hard disk rarely but in some cases it has to reload lots of times and then your going to notice a "whole system slow down" and "never ending operations".
After some investigations I think the problem is mainly because the numpy.array created by ccdproc.combine is stacked along the first axis and the operation is along the first axis. The first axis would be good in case it's a FORTRAN-contiguous array but ccdproc doesn't specify any "order" and then it's going to be C-contiguous. That means the elements on the last axis are stored next to each other in memory (if it were FORTRAN-contiguous the elements in the first axis would be next to each other). So if you run out of RAM and your computer starts using swap memory it puts parts of the array on the disk but because the operation is performed along the first axis - the memory addresses of the elements that are used in each operation are "far away from each other". That means it cannot utilize the swap memory in a useful way because it has to basically reload parts of the array from the hard disk for "each" next item.
It's not very important to know that actually, I just included that in case you're interested what the reason for the observed behavior war. The main point to take away is that if you notice that the system is becoming very slow if you run any program and it doesn't seem to make much progress that's because you have been running out of RAM!
The easiest solution (although it has nothing to do with programming) is to buy more RAM.
The complicated solution would be to reduce the memory footprint of your program.
Let's first do a small calculation how much memory we're dealing with:
Your images are 3128 * 3080 that's 9634240 elements. They might be any type when you read them but when you use ccdproc.subtract_overscan they will be floats later. One float (well, actually np.float64) uses 8 bytes, so we're dealing with 77073920 bytes. That's roughly 73 MB per bias image. You have 30 bias images, so we're dealing with roughly 2.2 GB of data here. That's assuming your images don't have uncertainty or mask. If they have that would add another 2.2 GB for the uncertainties or 0.26 GB for the masks.
2.2 GB sounds like a small enough number but ccdproc.combine stacks the NumPy arrays. That means it will create a new array and copy the data of your ccds into the new array. That will double the memory right there. It makes sense to stack them because even though it will take more memory it will be much faster when it actually does the "combining" but it's not there yet.
All in all 4.4 GB could already exhaust your RAM. Some computers will only have 4GB RAM and don't forget that your OS and the other programs need some RAM as well. However, it's unlikely that you run out of RAM if you have 8GB or more but given the numbers and your observations I assume that you only have 4-6GB of RAM.
The interesting question is actually how to avoid the problem. That really depends on the amount of memory you have:
Less than 4GB RAM
That's tricky because you won't have much free RAM after you deduct the size of all the CCDData objects and what you OS and the other processes need. In that case it would be best to process for example 5 bias images at a time and then combine the results of the first combinations. That's probably going to work because you use average as method (it wouldn't work if you used median) because (A+B+C+D) / 4 is equal to ((A+B)/2 + (C+D)/2)/2.
That would be (I haven't actually checked this code, so please inspect it carefully before you run it):
biasImages = []
biasImagesCombined = []
for idx, filename in enumerate(allfiles.files_filtered(NAXIS1=3128,NAXIS2=3080,OBSTYPE = 'BIAS')):
ccd = fits.getdata(allfiles.location + filename)
ccd = cp.CCDData(ccd, unit = u.adu)
ccd = cp.subtract_overscan(ccd,overscan_axis = 1, fits_section = '[3099:3124,:]')
ccd = cp.trim_image(ccd,fits_section = '[27:3095,3:3078]')
biasImages.append(ccd)
# Combine every 5 bias images. This only works correctly if the amount of
# images is a multiple of 5 and you use average as combine-method.
if (idx + 1) % 5 == 0:
tmp_bias = cp.combine(biasImages, method='average')
biasImages = []
biasImagesCombined.append(tmp_bias)
master_bias = cp.combine(biasImagesCombined, output_file = path + 'mbias_avg.fits', method='average')
4GB of RAM
In that case you probably have 500 MB to spare, so you could simply use mem_limit to limit the amount of RAM the combine will take additionally. In that case just change your last line to:
# To account for additional memory usage I chose 100 MB of additional memory
# you could try to adapt the actual number.
master_bias = cp.combine(biasImages, mem_limit=1024*1024*100, ioutput_file=path + 'mbias_avg.fits', method='average')
More than 4GB of RAM but less than 8GB
In that case you probably have 1 GB of free RAM that could be used. It's still the same approach as the 4GB option but you could use a much higher mem_limit. I would start with 500 MB: mem_limit=1024*1024*500.
More than 8GB of RAM
In that case I must have missed something because using ~4.5GB of RAM shouldn't actually exhaust your RAM.

Python randomly drops to 0% CPU usage, causing the code to "hang up", when handling large numpy arrays?

I have been running some code, a part of which loads in a large 1D numpy array from a binary file, and then alters the array using the numpy.where() method.
Here is an example of the operations performed in the code:
import numpy as np
num = 2048
threshold = 0.5
with open(file, 'rb') as f:
arr = np.fromfile(f, dtype=np.float32, count=num**3)
arr *= threshold
arr = np.where(arr >= 1.0, 1.0, arr)
vol_avg = np.sum(arr)/(num**3)
# both arr and vol_avg needed later
I have run this many times (on a free machine, i.e. no other inhibiting CPU or memory usage) with no issue. But recently I have noticed that sometimes the code hangs for an extended period of time, making the runtime an order of magnitude longer. On these occasions I have been monitoring %CPU and memory usage (using gnome system monitor), and found that python's CPU usage drops to 0%.
Using basic prints in between the above operations to debug, it seems to be arbitrary as to which operation causes the pausing (i.e. open(), np.fromfile(), np.where() have each separately caused a hang on a random run). It is as if I am being throttled randomly, because on other runs there are no hangs.
I have considered things like garbage collection or this question, but I cannot see any obvious relation to my problem (for example keystrokes have no effect).
Further notes: the binary file is 32GB, the machine (running Linux) has 256GB memory. I am running this code remotely, via an ssh session.
EDIT: This may be incidental, but I have noticed that there are no hang ups if I run the code after the machine has just been rebooted. It seems they begin to happen after a couple of runs, or at least other usage of the system.

np.where is creating a copy there and assigning it back into arr. So, we could optimize on memory there by avoiding a copying step, like so -
vol_avg = (np.sum(arr) - (arr[arr >= 1.0] - 1.0).sum())/(num**3)
We are using boolean-indexing to select the elements that are greater than 1.0 and getting their offsets from 1.0 and summing those up and subtracting from the total sum. Hopefully the number of such exceeding elements are less and as such won't incur anymore noticeable memory requirement. I am assuming this hanging up issue with large arrays is a memory based one.

The drops in CPU usage were unrelated to python or numpy, but were indeed a result of reading from a shared disk, and network I/O was the real culprit. For such large arrays, reading into memory can be a major bottleneck.

Did you click or select the Console window? This behavior can "hang" the process. Console enters "QuickEditMode". Pressing any key can resume the process.

Numpy memmap better IO and memory usage

Currently I am working with a NumPy memmap array with 2,000,000 * 33 * 33 *4 (N * W * H * C) data. My program reads random (N) indices from this array.
I have 8GB of RAM, 2TB HDD. The HDD read IO is only around 20M/s, RAM usage stays at 2.5GB. It seems that there is a HDD bottleneck because I am retrieving random indices that are obviously not in the memmap cache. Therefore, I would like the memmap cache to use RAM as much as possible.
Is there a way for me to tell memmap to maximize IO and RAM usage?

(Checking my python 2.7 source)
As far as I can tell NumPy memmap uses mmap.
mmap does define:
# Variables with simple values
...
ALLOCATIONGRANULARITY = 65536
PAGESIZE = 4096
However i am not sure it would be wise (or even possible) to change those.
Furthermore, this may not solve your problem and would definitely not give you the most efficient solution, because there is caching and page reading at OS level and at hardware level (because for hardware it takes more or less the same time to read a single value or the whole page).
A much better solution would probably be to sort your requests. (I suppose here that N is large, otherwise just sort them once):
Gather a bunch of them (say one or ten millions?) and before doing the request, sort them. Then ask the ordered queries. Then after getting the answers put them back in their original order...

Writing into a NumPy memmap still loads into RAM memory

I'm testing NumPy's memmap through IPython Notebook, with the following code
Ymap = np.memmap('Y.dat', dtype='float32', mode='w+', shape=(5e6, 4e4))
As you can see, Ymap's shape is pretty large. I'm trying to fill up Ymap like a sparse matrix. I'm not using scipy.sparse matrices because I will eventually need to dot-product it with another dense matrix, which will definitely not fit into memory.
Anyways, I'm performing a very long series of indexing operations:
Ymap = np.memmap('Y.dat', dtype='float32', mode='w+', shape=(5e6, 4e4))
with open("somefile.txt", 'rb') as somefile:
for i in xrange(5e6):
# Read a line
line = somefile.readline()
# For each token in the line, lookup its j value
# Assign the value 1.0 to Ymap[i,j]
for token in line.split():
j = some_dictionary[token]
Ymap[i,j] = 1.0
These operations somehow quickly eat up my RAM. I thought mem-mapping was basically an out-of-core numpy.ndarray. Am I mistaken? Why is my memory usage sky-rocketing like crazy?

A (non-anonymous) mmap is a link between a file and RAM that, roughly, guarantees that when RAM of the mmap is full, data will be paged to the given file instead of to the swap disk/file, and when you msync or munmap it, the whole region of RAM gets written out to the file. Operating systems typically follow a lazy strategy wrt. disk accesses (or eager wrt. RAM): data will remain in memory as long as it fits. This means a process with large mmaps will eat up as much RAM as it can/needs before spilling over the rest to disk.
So you're right that an np.memmap array is an out-of-core array, but it is one that will grab as much RAM cache as it can.

As the docs say:
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory.
There's no true magic in computers ;-) If you access very little of a giant array, a memmap gimmick will require very little RAM; if you access very much of a giant array, a memmap gimmick will require very much RAM.
One workaround that may or may not be helpful in your specific code: create new mmap objects periodically (and get rid of old ones), at logical points in your workflow. Then the amount of RAM needed should be roughly proportional to the number of array items you touch between such steps. Against that, it takes time to create and destroy new mmap objects. So it's a balancing act.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.