Currently I am working with a NumPy memmap array with 2,000,000 * 33 * 33 *4 (N * W * H * C) data. My program reads random (N) indices from this array.
I have 8GB of RAM, 2TB HDD. The HDD read IO is only around 20M/s, RAM usage stays at 2.5GB. It seems that there is a HDD bottleneck because I am retrieving random indices that are obviously not in the memmap cache. Therefore, I would like the memmap cache to use RAM as much as possible.
Is there a way for me to tell memmap to maximize IO and RAM usage?
(Checking my python 2.7 source)
As far as I can tell NumPy memmap uses mmap.
mmap does define:
# Variables with simple values
...
ALLOCATIONGRANULARITY = 65536
PAGESIZE = 4096
However i am not sure it would be wise (or even possible) to change those.
Furthermore, this may not solve your problem and would definitely not give you the most efficient solution, because there is caching and page reading at OS level and at hardware level (because for hardware it takes more or less the same time to read a single value or the whole page).
A much better solution would probably be to sort your requests. (I suppose here that N is large, otherwise just sort them once):
Gather a bunch of them (say one or ten millions?) and before doing the request, sort them. Then ask the ordered queries. Then after getting the answers put them back in their original order...
Related
I am dealing with large numpy arrays and I am trying out memmap as it could help.
big_matrix = np.memmap(parameters.big_matrix_path, dtype=np.float16, mode='w+', shape=(1000000, 1000000)
The above works fine and it creates a file on my hard drive of about 140GB.
1000000 is just a random number I used - not the one I am actually using.
I want to fill the matrix with values. Currently it is just set to zero.
for i in tqdm(range(len(big_matrix))):
modified_row = get_row(i)
big_matrix[i, :] = modified_row
At this point now, I have a big_matrix filled with the values I want.
The problem is that from this point on I can't operate on this memmap.
For example I want to multiply column wise (broadcast).
I run this:
big_matrix * weights[:, np.newaxis]
Where weights has the same length.
It just hangs and throws and out of memory error as my RAM and SWAP is all used.
My understanding was that the memmap will keep everything on the hard drive.
For example save the results directly there.
So I tried this then:
for i in tqdm(range(big_matrix.shape[1])):
temp = big_matrix[:, i].tolist()
temp = np.array(temp) * weights
The above loads only 1 column in memory, and multiply that with the weights.
Then I will save that column back in big_matrix.
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
At this point I am thinking of switching to sqlite.
I wanted to get some insights why my code is not working?
Do I need to flush the memmap everytime I change it ?
np.memmap map a part of the virtual memory to the storage device space here. The OS is free to preload pages and cache them for a fast reuse. The memory is generally not flushed unless it is reclaimed (eg. by another process or the same process). When this happen, the OS typically (partially) flush data to the storage device and (partially) free the physical memory used for the mapping. That being said, this behaviour is dependent of the actual OS. It work that way on Windows. On Linux, you can use madvise to tune this behaviour but madvise is a low-level C function not yet supported by Numpy (though it is apparently supported for Python, see this issue for more information). Actually, Numpy does not even support closing the memmaped space (which is leaky). The solution is generally to flush data manually not to lose it. There are alternative solutions but none of them is great yet.
big_matrix * weights[:, np.newaxis]
It just hangs and throws and out of memory error as my RAM and SWAP is all used
This is normal since Numpy creates a new temporary array stored in RAM. There is no way to tell to Numpy to store temporary array in on the storage device. That being said, you can tell to Numpy where the output data is stored using the out parameter on some function (eg. np.multiply supports it). The output array can be created using memmap so not to use too much memory (regarding the behaviour of the OS).
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
This is also expected, especially if you use a HDD and not and SSD. Indeed, the array is stored (virtually) contiguously on the storage device. big_matrix[:, i] has to fetch data with a huge stride. For each item, with a size of only 2 bytes, the OS will perform an IO request to the storage device. Storage devices are optimized for contiguous reads so fetches are buffered and each IO request has a pretty significant latency. In practice, the OS will generally at least fetch a page (typically 4096 bytes, that is 512 times more than what is actually needed). Moreover, there is a limit of the number of IO requests that can be completed per second. HDDs can typically do about 20-200 IO requests per seconds while the fastest Nvme SSDs reach 100_000-600_000 UI requests per seconds. Note that the cache help not not reload data for the next column unless there are too many loaded pages and the OS has to flush them. Reading a matrix of size (1_000_000,1_000_000) causes up to 1_000_000*1_000_000=1_000_000_000_000 fetch, which is horribly inefficient. The cache could reduce this by a large margin, but operating simultaneously on 1_000_000 pages is also horribly inefficient since the processor cannot do that (due to a limited number of entries in the TLB). This will typically results in TLB misses, that is expensive kernel calls for each item to be read. Because a kernel call typically take (at least) about ~1 us on mainstream PC, this means more than a week to to the whole computation.
If you want to efficiently read columns, then you need to read large chunk of columns. For example, you certainly need at least several hundred of columns to be read even on a fast Nvme SSD. For HDD, it is at least several dozens of thousand columns to get a proper throughput. This means you certainly cannot read the full columns efficiently due to the high amount of requested RAM. Using another data layout (tile + transposed data) is critical in this case.
My sincere apologies, in advance if this question seems quite basic.
Given:
>>> import numpy as np
>>> import time
>>> A = np.random.rand( int(1e5), int(5e4) ) # large numpy array
Goal:
>>> bt=time.time(); B=np.argsort(A,axis=1);et=time.time();print(f"Took {(et-bt):.2f} s")
However, it takes quite long time to calculate an array of indices:
# Took 316.90 s
Question:
Is there any other time efficient ways to do this?
Cheers,
The input array A has a shape of (100_000, 50_000) and contains np.float64 values by default. This means that you need 8 * 100_000 * 50_000 / 1024**3 = 37.2 Gio of memory just for this array. You also likely need the same amount of space for the output matrix B (which should contains items of type np.int64). This means you need a machine with at least 74.4 Gio, not to mentions the space required for the operating system (OS) and running software (so probably at least 80 Gio). If you do not have such a memory space, then your OS will use your storage device as a swap memory, which is much much slower.
Assuming you have such a memory space available, such a computation is very expensive. It is mainly due to the page faults when filling the B array, and also the fact that B is pretty huge as well as the default implementation of Numpy do the computation sequentially. You can speed the computation up using a parallel Numba code and a smaller output. Here is an example:
import numba as nb
#nb.njit('int32[:,::1](float64[:,::1])', parallel=True)
def fastSort(a):
b = np.empty(a.shape, dtype=np.int32)
for i in nb.prange(a.shape[0]):
b[i,:] = np.argsort(a[i,:])
return b
Note that the Numba's implementation of argsort is less efficient than the one of Numpy but the parallel version should be much faster if the target machine have mny cores and a good memory bandwidth.
Here are the results on my 6-core machine on a matrix of size (10_000, 50_000) (10 times smaller because I do not have 80 Gio of RAM):
Original implementation: 28.14 s
Sequential Numba: 38.70 s
Parallel Numba: 6.79 s
The resulting solution is thus 4.1 times faster.
Note that you could even use the type uint16 for the items of B in this specific case as the size of each line is less than 2**16 = 65536. This will likely not be significantly faster but it should save some additional memory. The resulting required memory will be 46.5 Gio. You can further reduce the amount of memory needed using the np.float32 type (often at the expense of a loss of accuracy).
If you want to improve the execution time further, then you need to implement a faster implementation of argsort for your specific needs using a low-level language like C or C++. But be aware that beating Numpy is far from being easy if you are not an experienced programmer in such a language or not familiar with low-level optimizations. If you are interested in such a solution, a good start is probably to implement a radix sort.
My question is simple; and I could not find a resource that answers it. Somewhat similar links are using asarray, on numbers in general, and the most succinct one here.
How can I "calculate" the overhead of loading a numpy array into RAM (if there is any overhead)? Or, how to determine the least amount of RAM needed to hold all arrays in memory (without time-consuming trial and error)?
In short, I have several numpy arrays of shape (x, 1323000, 1), with x being as high as 6000. This leads to a disk usage of 30GB for the largest file.
All files together need 50GB. Is it therefore enough if I use slightly more than 50GB as RAM (using Kubernetes)? I want to use the RAM as efficiently as possible, so just using 100GBs is not an option.
I'm testing NumPy's memmap through IPython Notebook, with the following code
Ymap = np.memmap('Y.dat', dtype='float32', mode='w+', shape=(5e6, 4e4))
As you can see, Ymap's shape is pretty large. I'm trying to fill up Ymap like a sparse matrix. I'm not using scipy.sparse matrices because I will eventually need to dot-product it with another dense matrix, which will definitely not fit into memory.
Anyways, I'm performing a very long series of indexing operations:
Ymap = np.memmap('Y.dat', dtype='float32', mode='w+', shape=(5e6, 4e4))
with open("somefile.txt", 'rb') as somefile:
for i in xrange(5e6):
# Read a line
line = somefile.readline()
# For each token in the line, lookup its j value
# Assign the value 1.0 to Ymap[i,j]
for token in line.split():
j = some_dictionary[token]
Ymap[i,j] = 1.0
These operations somehow quickly eat up my RAM. I thought mem-mapping was basically an out-of-core numpy.ndarray. Am I mistaken? Why is my memory usage sky-rocketing like crazy?
A (non-anonymous) mmap is a link between a file and RAM that, roughly, guarantees that when RAM of the mmap is full, data will be paged to the given file instead of to the swap disk/file, and when you msync or munmap it, the whole region of RAM gets written out to the file. Operating systems typically follow a lazy strategy wrt. disk accesses (or eager wrt. RAM): data will remain in memory as long as it fits. This means a process with large mmaps will eat up as much RAM as it can/needs before spilling over the rest to disk.
So you're right that an np.memmap array is an out-of-core array, but it is one that will grab as much RAM cache as it can.
As the docs say:
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory.
There's no true magic in computers ;-) If you access very little of a giant array, a memmap gimmick will require very little RAM; if you access very much of a giant array, a memmap gimmick will require very much RAM.
One workaround that may or may not be helpful in your specific code: create new mmap objects periodically (and get rid of old ones), at logical points in your workflow. Then the amount of RAM needed should be roughly proportional to the number of array items you touch between such steps. Against that, it takes time to create and destroy new mmap objects. So it's a balancing act.
I am using numpy and trying to create a huge matrix.
While doing this, I receive a memory error
Because the matrix is not important, I will just show the way how to easily reproduce the error.
a = 10000000000
data = np.array([float('nan')] * a)
not surprisingly, this throws me MemoryError
There are two things I would like to tell:
I really need to create and to use a big matrix
I think I have enough RAM to handle this matrix (I have 24 Gb or RAM)
Is there an easy way to handle big matrices in numpy?
Just to be on the safe side, I previously read these posts (which sounds similar):
Very large matrices using Python and NumPy
Python/Numpy MemoryError
Processing a very very big data set in python - memory error
P.S. apparently I have some problems with multiplication and division of numbers, which made me think that I have enough memory. So I think it is time for me to go to sleep, review math and may be to buy some memory.
May be during this time some genius might come up with idea how to actually create this matrix using only 24 Gb of Ram.
Why I need this big matrix
I am not going to do any manipulations with this matrix. All I need to do with it is to save it into pytables.
Assuming each floating point number is 4 bytes each, you'd have
(10000000000 * 4) /(2**30.0) = 37.25290298461914
Or 37.5 gigabytes you need to store in memory. So I don't think 24gb of RAM is enough.
If you can't afford creating such a matrix, but still wish to do some computations, try sparse matrices.
If you wish to pass it to another Python package that uses duck typing, you may create your own class with __getitem__ implementing dummy access.
If you use pycharm editor for python you can change memory settings from
C:\Program Files\JetBrains\PyCharm 2018.2.4\bin\pycharm64.exe.vmoptions
you can decrease pycharm speed from this file so your program memory will allocate more megabites
you must edit this codes
-Xms1024m
-Xmx2048m
-XX:ReservedCodeCacheSize=960m
so you can make them -Xms512m -Xmx1024m and finally your program will work
but it'll affect the debugging performance in pycharm.