numpy memmap memory usage - want to iterate once - python

let say I have some big matrix saved on disk. storing it all in memory is not really feasible so I use memmap to access it
A = np.memmap(filename, dtype='float32', mode='r', shape=(3000000,162))
now let say I want to iterate over this matrix (not essentially in an ordered fashion) such that each row will be accessed exactly once.
p = some_permutation_of_0_to_2999999()
I would like to do something like that:
start = 0
end = 3000000
num_rows_to_load_at_once = some_size_that_will_fit_in_memory()
while start < end:
indices_to_access = p[start:start+num_rows_to_load_at_once]
do_stuff_with(A[indices_to_access, :])
start = min(end, start+num_rows_to_load_at_once)
as this process goes on my computer is becoming slower and slower and my RAM and virtual memory usage is exploding.
Is there some way to force np.memmap to use up to a certain amount of memory? (I know I won't need more than the amount of rows I'm planning to read at a time and that caching won't really help me since I'm accessing each row exactly once)
Maybe instead is there some other way to iterate (generator like) over a np array in a custom order? I could write it manually using file.seek but it happens to be much slower than np.memmap implementation
do_stuff_with() does not keep any reference to the array it receives so no "memory leaks" in that aspect
thanks

This has been an issue that I've been trying to deal with for a while. I work with large image datasets and numpy.memmap offers a convenient solution for working with these large sets.
However, as you've pointed out, if I need to access each frame (or row in your case) to perform some operation, RAM usage will max out eventually.
Fortunately, I recently found a solution that will allow you to iterate through the entire memmap array while capping the RAM usage.
Solution:
import numpy as np
# create a memmap array
input = np.memmap('input', dtype='uint16', shape=(10000,800,800), mode='w+')
# create a memmap array to store the output
output = np.memmap('output', dtype='uint16', shape=(10000,800,800), mode='w+')
def iterate_efficiently(input, output, chunk_size):
# create an empty array to hold each chunk
# the size of this array will determine the amount of RAM usage
holder = np.zeros([chunk_size,800,800], dtype='uint16')
# iterate through the input, replace with ones, and write to output
for i in range(input.shape[0]):
if i % chunk_size == 0:
holder[:] = input[i:i+chunk_size] # read in chunk from input
holder += 5 # perform some operation
output[i:i+chunk_size] = holder # write chunk to output
def iterate_inefficiently(input, output):
output[:] = input[:] + 5
Timing Results:
In [11]: %timeit iterate_efficiently(input,output,1000)
1 loop, best of 3: 1min 48s per loop
In [12]: %timeit iterate_inefficiently(input,output)
1 loop, best of 3: 2min 22s per loop
The size of the array on disk is ~12GB. Using the iterate_efficiently function keeps the memory usage to 1.28GB whereas the iterate_inefficiently function eventually reaches 12GB in RAM.
This was tested on Mac OS.

I've been experimenting with this problem for a couple days now and it appears there are two ways to control memory consumption using np.mmap. The first is reliable while the second would require some testing and will be OS dependent.
Option 1 - reconstruct the memory map with each read / write:
def MoveMMapNPArray(data, output_filename):
CHUNK_SIZE = 4096
for idx in range(0,x.shape[1],CHUNK_SIZE):
x = np.memmap(data.filename, dtype=data.dtype, mode='r', shape=data.shape, order='F')
y = np.memmap(output_filename, dtype=data.dtype, mode='r+', shape=data.shape, order='F')
end = min(idx+CHUNK_SIZE, data.shape[1])
y[:,idx:end] = x[:,idx:end]
Where data is of type np.memmap. This discarding of the memmap object with each read keeps the array from being collected into memory and will keep memory consumption very low if the chunk size is low. It likely introduces some CPU overhead but was found to be small on my setup (MacOS).
Option 2 - construct the mmap buffer yourself and provide memory advice
If you look at the np.memmap source code here, you can see that it is relatively simple to create your own memmapped numpy array relatively easily. Specifically, with the snippet:
mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
mmap_np_array = ndarray.__new__(subtype, shape, dtype=descr, buffer=mm, offset=array_offset, order=order)
Note this python mmap instance is stored as the np.memmap's private _mmap attribute.
With access to the python mmap object, and python 3.8, you can use its madvise method, described here.
This allows you to advise the OS to free memory where available. The various madvise constants are described here for linux, with some generic cross platform options specified.
The MADV_DONTDUMP constant looks promising but I haven't tested memory consumption with it like I have for option 1.

Related

Running out of memory and process killed during Cholesky decomposition with Dask

I'm trying to compute the cholesky decomposition of a matrix with 49152x49152 elements (around 19 Gb) in a laptop with only 16 Gb of RAM. I've seen that using dask it's possible to work with matrices that does not fit in memory, but running my code it always run out of memory (it does not matter if I change the number of chunks).
Here is where my code breaks:
M3 is a (49152, 49152) matrix, definite positive and non-singular.
M3 = da.rechunk(M3, chunks = (Chunks, Chunks))
chol_x2 = da.linalg.cholesky(M3, lower=True)
da.to_npy_stack("./Prueba_Nside_64", chol_x2)
old_name = "./Prueba_Nside_64/0.npy"
new_name = "./Prueba_Nside_64/Cholesky_Decomposition_Nside_64.npy"
os.rename(old_name, new_name)
I've tried different Chunks, always checking that each chunksize is less than 1Gb. I'm not sure if the problem is the cholesky decomposition or the .to_npy_stack function. Which function should I use to save the cholesky decomposition?
I've just realized the code breaks if I try to compute a single element from cholesky. How could I compute the cholesky in my own laptop? Or maybe it's not even possible?
I did a few tests with my laptop also with 16GiB.
I reproduced out of memory error on workers with just this code:
import dask
import dask.array as da
x = da.random.random((49152,49152))
x3 = da.tril(x, k = -1)
da.to_npy_stack("./Prueba_Nside_64", x3)
At first I thought this came from the tril function, but it is actually the to_npy_stack call that is causing issues.
Indeed, storing as Numpy stack requires chunking in only one axis, so it triggers some rechunking that fills Worker memory.
Just using another format (that can directly write the dask array chunks) did the trick for me:
x3.to_zarr("./Prueba_zarr")

RAM usage stacking up with each iteration

I am opening a list of asm files, and closing them after extracting arrays from them. I had issues with RAM usage which is solved now.
When I was appending the extracted arrays (in array format) in a list, the RAM usage kept stacking up with each iteration. However, when I changed my code to change the format of the extracted arrays to list before appending, the issue resolved. Please see line i_arrays.append(h.tolist()). I'm just trying to understand why was the RAM usage stacking up when I was storing np arrays.
Code >>>>>>>>>>>>>>>>>>>>>>>>>
t1_start = process_time()
files = os.listdir('asmFiles')
i_arrays=[]
file_names=[]
for i in tqdm(files[0:1501]):
f_name=i.split('.')[0]
file_names.append(f_name)
b='asmFiles/'+str(i)
f=open(b,'rb')
ln=os.path.getsize(b)
width=int(ln**0.5)
rem=ln%width
a = array.array("B")
a.fromfile(f,ln-rem)
f.close()
g=np.reshape(a,(int(len(a)/width),width))
g=np.uint(g)
h=g[0][0:800]
i_arrays.append(h.tolist())
print(psutil.virtual_memory()[2])
t1_stop = process_time()
print("Elapsed time during the whole program in seconds:",t1_stop-t1_start)

how to put numpy array entirely on RAM using numpy memmap?

I would like to use a memmap allocated numpy array that can be processed in parallel using joblib i.e. shared memory between different processes. But I also want the big array to be stored entirely on RAM to avoid the write/read to disk that memmap does. I have enough RAM to store the whole array, but using np.zeros() instead of memmap complicates parallelization since the former allocates memory local to a process. How do I achieve my goal?
Example:
x_memmap = os.path.join(folder, 'x_memmap')
x_shared = np.memmap(x_memmap,dtype=np.float32,shape=(100000,8,8,32),mode='w+')
Later:
n = N / number_of_cores
slices = [ slice((id*n) , (min(N,(id+1)*n))) for id in range(number_of_cores) ]
Parallel(n_jobs=number_of_cores)( delayed(my_job) ( x_shared[sl,:] ) for sl in slices )
If I allocate x_shared with np.zeros instead as shown below, I can't use parallelization.
x_shared = np.zeros(dtype=np.float32,shape=(100000,8,8,32))

What's an intelligent way of loading a compressed array completely from disk into memory - also (indentically) compressed?

I am experimenting with a 3-dimensional zarr-array, stored on disk:
Name: /data
Type: zarr.core.Array
Data type: int16
Shape: (102174, 1100, 900)
Chunk shape: (12, 220, 180)
Order: C
Read-only: True
Compressor: Blosc(cname='zstd', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Store type: zarr.storage.DirectoryStore
No. bytes: 202304520000 (188.4G)
No. bytes stored: 12224487305 (11.4G)
Storage ratio: 16.5
Chunks initialized: 212875/212875
As I understand it, zarr-arrays can also reside in memory - compressed, as if they were on disk. So I thought why not try to load the entire thing into RAM on a machine with 32 GByte memory. Compressed, the dataset would require approximately 50% of RAM. Uncompressed, it would require about 6 times more RAM than available.
Preparation:
import os
import zarr
from numcodecs import Blosc
import tqdm
zpath = '...' # path to zarr data folder
disk_array = zarr.open(zpath, mode = 'r')['data']
c = Blosc(cname = 'zstd', clevel=3, shuffle = Blosc.BITSHUFFLE)
memory_array = zarr.zeros(
disk_array.shape, chunks = disk_array.chunks,
dtype = disk_array.dtype, compressor = c
)
The following experiment fails almost immediately with an out of memory error:
memory_array[:, :, :] = disk_array[:, :, :]
As I understand it, disk_array[:, :, :] will try to create an uncompressed, full-size numpy array, which will obviously fail.
Second attempt, which works but is agonizingly slow:
chunk_lines = disk_array.chunks[0]
chunk_number = disk_array.shape[0] // disk_array.chunks[0]
chunk_remain = disk_array.shape[0] % disk_array.chunks[0] # unhandled ...
for chunk in tqdm.trange(chunk_number):
chunk_slice = slice(chunk * chunk_lines, (chunk + 1) * chunk_lines)
memory_array[chunk_slice, :, :] = disk_array[chunk_slice, :, :]
Here, I am trying to reads a certain number of chunks at a time and put them into my in-memory array. It works, but it is about 6 to 7 times slower than what it took to write this thing to disk in the first place. EDIT: Yes, it's still slow, but the 6 to 7 times happened due to a disk issue.
What's an intelligent and fast way of achieving this? I'd guess, besides not using the right approach, my chunks might also be too small - but I am not sure.
EDIT: Shape, chunk size and compression are supposed to be identical for the on-disk array and the in-memory array. It should therefore be possible to eliminate the decompress-compress procedure in my example above.
I found zarr.convenience.copy but it is marked as an experimental feature, subject to further change.
Related issue on GitHub
You could conceivably try with fsspec.implementations.memory.MemoryFileSystem, which has a .make_mapper() method, with which you can make the kind of object expected by zarr.
However, this is really just a dict of path:io.BytesIO, which you could make yourself, if you want.
There are a couple of ways one might solve this issue today.
Use LRUStoreCache to cache (some) compressed data in memory.
Coerce your underlying store into a dict and use that as your store.
The first option might be appropriate if you only want some frequently used data in-memory. Of course how much you load into memory is something you can configure. So this could be the whole array. This will only happen with data on-demand, which may be useful for you.
The second option just creates a new in-memory copy of the array by pulling all of the compressed data from disk. The one downside is if you intend to write back to disk this will be something you need to do manually, but it is not too difficult. The update method is pretty handy for facilitating this copying of data between different stores.

Huge memory usage in pyROOT

My pyROOT analysis code is using huge amounts of memory. I have reduced the problem to the example code below:
from ROOT import TChain, TH1D
# Load file, chain
chain = TChain("someChain")
inFile = "someFile.root"
chain.Add(inFile)
nentries = chain.GetEntries()
# Declare histograms
h_nTracks = TH1D("h_nTracks", "h_nTracks", 16, -0.5, 15.5)
h_E = TH1D("h_E","h_E",100,-0.1,6.0)
h_p = TH1D("h_p", "h_p", 100, -0.1, 6.0)
h_ECLEnergy = TH1D("h_ECLEnergy","h_ECLEnergy",100,-0.1,14.0)
# Loop over entries
for jentry in range(nentries):
# Load entry
entry = chain.GetEntry(jentry)
# Define variables
cands = chain.__ncandidates__
nTracks = chain.nTracks
E = chain.useCMSFrame__boE__bc
p = chain.useCMSFrame__bop__bc
ECLEnergy = chain.useCMSFrame__boECLEnergy__bc
# Fill histos
h_nTracks.Fill(nTracks)
h_ECLEnergy.Fill(ECLEnergy)
for cand in range(cands):
h_E.Fill(E[cand])
h_p.Fill(p[cand])
where someFile.root is a root file with 700,000 entries and multiple particle candidates per entry.
When I run this script it uses ~600 MB of memory. If I remove the line
h_p.Fill(p[cand])
it uses ~400 MB.
If I also remove the line
h_E.Fill(E[cand])
it uses ~150 MB.
If I also remove the lines
h_nTracks.Fill(nTracks)
h_ECLEnergy.Fill(ECLEnergy)
there is no further reduction in memory usage.
It seems that for every extra histogram that I fill of the form
h_variable.Fill(variable[cand])
(i.e. histograms that are filled once per candidate per entry, as opposed to histograms that are just filled once per entry) I use an extra ~200 MB of memory. This becomes a serious problem when I have 10 or more histograms because I am using GBs of memory and I am exceeding the limits of my computing system. Does anybody have a solution?
Update: I think this is a python3 problem.
If I take the script in my original post (above) and run it using python2 the memory usage is ~200 MB, compared to ~600 MB with python3. Even if I try to replicate Problem 2 by using the long variable names, the job still only uses ~200 MB of memory with python2, compared to ~1.3 GB with python3.
During my Googling I came across a few other accounts of people encountering memory leaks when using pyROOT with python3. It seems this is still an issue as of Python 3.6.2 and ROOT 6.08/06, and that for the moment you must use python2 if you want to use pyROOT.
So, using python2 appears to be my "solution" for now, but it's not ideal. If anybody has any further information or suggestions I'd be grateful to hear from you!
I'm glad you figured out Python3 was the problem. But if you (or anyone) continues to have memory usage issues when working with histograms in the future, here are some potential solutions I hope you'll find helpful!
THnSparse
Use THnSparse--THnSparse is an efficient multidimensional histogram that shows its strengths in histograms where only a small fraction of the total bins are filled. You can read more about it here.
TTree
TTrees are data structures in ROOT that are quite frankly glorified tables. However, they are highly optimized. A TTree is composed of branches and leaves that contain data that, through ROOT, can be speedily and efficiently accessed. If you put your data into a TTree first, and then read it into a histogram, I guarantee you will find lower memory usage and higher run times.
Here is some example TTree code.
root_file_path = "../hadd_www.root"
muon_ps = ROOT.TFile(root_file_path)
muon_ps_tree = muon_ps.Get("WWWNtuple")
muon_ps_branches = muon_ps_tree.GetListOfBranches()
canv= ROOT.TCanvas()
num_of_events = 5000
ttvhist = ROOT.TH1F('Statistics2', 'Jet eta for ttV (aqua) vs WWW (white); Pseudorapidity',100, -3, 3)
i = 0
muon_ps_tree.GetEntry(i)
print len(muon_ps_tree.jet_eta)
#sys.exit()
while muon_ps_tree.GetEntry(i):
if i > num_of_events: break
for k in range(0,len(muon_ps_tree.jet_eta)-1):
wwwhist.Fill(float(muon_ps_tree.jet_eta[0]), 1)
i += 1
ttvhist.Write()
ttvhist.Draw("hist")
ttvhist.SetFillColor(70);
And here's a resource where you can learn about how fantastic TTrees are:
TTree ROOT documentation
For more reading, here is a discussion on speeding up ROOT historgram building on the CERN help forum:
Memory-conservative histograms for usage in DQ monitoring
Best of luck with your data analysis, and happy coding!

Categories