CuPy random - how to generate new random set in same memory? - python

I am generating a large array of random numbers, totaling more than half the available memory on a GPU. I am doing this in a loop.
When I call cupy.random the second time (or third time...), assigning to the same variable name, it does not free the memory for the first array. It tries to allocate more memory, which causes an out of memory error.
Explicitly freeing the memory before generating a new random array is very slow, and seems inefficient.
Is there a way to generate a new set of numbers, but in the same memory space?
Edit: cupy.random.shuffle() is letting me work around the problem, but I wonder if there is a better way?
Edit 2: on further review, shuffle() does not address the problem, and appears to need even more memory than allocating a second block (before freeing the first) of memory... I am back to restricting ndarray size to less than half the remaining memory, so two ndarrays can be allocated alternately

As user2357112 suggests, cupy.random.random() does not appear to support “re-randomizing“ an existing ndarray, even though cuRand does. Writing C to modify an existing cupy array somewhat defeats the point of using python / cupy in the first place.
Curiously, having an array about 1/3rd the size of available memory, while increasing the number of loops, is faster in total execution time (versus larger arrays/fewer loops). I was not able to determine when cupy (or python or cuda?) does garbage collection on the disused array, but it seems to happen asynchronously.
If GPU garbage collection uses cuda cores (I presume it does?), it does not appear to materially effect my code execution time. Nvidia-smi reports “P2” GPU usage when my code calculations are running, suggesting there are still cores available for cupy / cuda to free memory outside of my code?
I don’t like answering my own question... just sharing what I found in case it helps someone else

Related

Numpy's memmap acting strangely?

I am dealing with large numpy arrays and I am trying out memmap as it could help.
big_matrix = np.memmap(parameters.big_matrix_path, dtype=np.float16, mode='w+', shape=(1000000, 1000000)
The above works fine and it creates a file on my hard drive of about 140GB.
1000000 is just a random number I used - not the one I am actually using.
I want to fill the matrix with values. Currently it is just set to zero.
for i in tqdm(range(len(big_matrix))):
modified_row = get_row(i)
big_matrix[i, :] = modified_row
At this point now, I have a big_matrix filled with the values I want.
The problem is that from this point on I can't operate on this memmap.
For example I want to multiply column wise (broadcast).
I run this:
big_matrix * weights[:, np.newaxis]
Where weights has the same length.
It just hangs and throws and out of memory error as my RAM and SWAP is all used.
My understanding was that the memmap will keep everything on the hard drive.
For example save the results directly there.
So I tried this then:
for i in tqdm(range(big_matrix.shape[1])):
temp = big_matrix[:, i].tolist()
temp = np.array(temp) * weights
The above loads only 1 column in memory, and multiply that with the weights.
Then I will save that column back in big_matrix.
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
At this point I am thinking of switching to sqlite.
I wanted to get some insights why my code is not working?
Do I need to flush the memmap everytime I change it ?
np.memmap map a part of the virtual memory to the storage device space here. The OS is free to preload pages and cache them for a fast reuse. The memory is generally not flushed unless it is reclaimed (eg. by another process or the same process). When this happen, the OS typically (partially) flush data to the storage device and (partially) free the physical memory used for the mapping. That being said, this behaviour is dependent of the actual OS. It work that way on Windows. On Linux, you can use madvise to tune this behaviour but madvise is a low-level C function not yet supported by Numpy (though it is apparently supported for Python, see this issue for more information). Actually, Numpy does not even support closing the memmaped space (which is leaky). The solution is generally to flush data manually not to lose it. There are alternative solutions but none of them is great yet.
big_matrix * weights[:, np.newaxis]
It just hangs and throws and out of memory error as my RAM and SWAP is all used
This is normal since Numpy creates a new temporary array stored in RAM. There is no way to tell to Numpy to store temporary array in on the storage device. That being said, you can tell to Numpy where the output data is stored using the out parameter on some function (eg. np.multiply supports it). The output array can be created using memmap so not to use too much memory (regarding the behaviour of the OS).
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
This is also expected, especially if you use a HDD and not and SSD. Indeed, the array is stored (virtually) contiguously on the storage device. big_matrix[:, i] has to fetch data with a huge stride. For each item, with a size of only 2 bytes, the OS will perform an IO request to the storage device. Storage devices are optimized for contiguous reads so fetches are buffered and each IO request has a pretty significant latency. In practice, the OS will generally at least fetch a page (typically 4096 bytes, that is 512 times more than what is actually needed). Moreover, there is a limit of the number of IO requests that can be completed per second. HDDs can typically do about 20-200 IO requests per seconds while the fastest Nvme SSDs reach 100_000-600_000 UI requests per seconds. Note that the cache help not not reload data for the next column unless there are too many loaded pages and the OS has to flush them. Reading a matrix of size (1_000_000,1_000_000) causes up to 1_000_000*1_000_000=1_000_000_000_000 fetch, which is horribly inefficient. The cache could reduce this by a large margin, but operating simultaneously on 1_000_000 pages is also horribly inefficient since the processor cannot do that (due to a limited number of entries in the TLB). This will typically results in TLB misses, that is expensive kernel calls for each item to be read. Because a kernel call typically take (at least) about ~1 us on mainstream PC, this means more than a week to to the whole computation.
If you want to efficiently read columns, then you need to read large chunk of columns. For example, you certainly need at least several hundred of columns to be read even on a fast Nvme SSD. For HDD, it is at least several dozens of thousand columns to get a proper throughput. This means you certainly cannot read the full columns efficiently due to the high amount of requested RAM. Using another data layout (tile + transposed data) is critical in this case.

How is memory handled once touched for the first time in numpy.zeros?

I recently saw that when creating a numpy array via np.empty or np.zeros, the memory of that numpy array is not actually allocated by the operating system as discussed in this answer (and this question), because numpy utilizes calloc to allocate the array's memory.
In fact, the OS isn't even "really" allocating that memory until you try to access it.
Therefore,
l = np.zeros(2**28)
does not increase the utilized memory the system reports, e.g., in htop.
Only once I touch the memory, for instance by executing
np.add(l, 0, out=l)
the utilized memory is increased.
Because of that behaviour I have got a couple of questions:
1. Is touched memory copied under the hood?
If I touch chunks of the memory only after a while, is the content of the numpy array copied under the hood by the operating system to guarantee that the memory is contiguous?
i = 100
f[:i] = 3
while True:
... # Do stuff
f[i] = ... # Once the memory "behind" the already allocated chunk of memory is filled
# with other stuff, does the operating system reallocate the memory and
# copy the already filled part of the array to the new location?
i = i + 1
2. Touching the last element
As the memory of the numpy array is continguous in memory, I tought
f[-1] = 3
might require the enitre block of memory to be allocated (without touching the entire memory).
However, it does not, the utilized memory in htop does not increase by the size of the array.
Why is that not the case?
OS isn't even "really" allocating that memory until you try to access it
This is dependent of the target platform (typically the OS and its configuration). Some platform directly allocates page in physical memory (eg. AFAIK the XBox does as well as some embedded platforms). However, mainstream platforms actually do that indeed.
1. Is touched memory copied under the hood?
If I touch chunks of the memory only after a while, is the content of the numpy array copied under the hood by the operating system to guarantee that the memory is contiguous?
Allocations are perform in virtual memory. When a first touch is done on a given memory page (chunk of fixed sized, eg. 4 KiB), the OS maps the virtual page to a physical one. So only one page will be physically map when you set only one item of the array (unless the item cross two pages which only happens in pathological cases).
The physical pages may not be contiguous for a contiguous set of virtual pages. However, this is not a problem and you should not care about it. This is mainly the job of the OS. That being said, modern processors have a dedicated unit called TLB to translate virtual address (the one you could see with a debugger) to physical ones (since this translation is relatively expensive and performance critical).
The content of the Numpy array is not reallocated nor copied thanks to paging (at least from the user point-of-view, ie. in virtual memory).
2. Touching the last element
I thought f[-1] = 3 might require the entire block of memory to be allocated (without touching the entire memory). However, it does not, the utilized memory in htop does not increase by the size of the array. Why is that not the case?
Only the last page in virtual memory associated to the Numpy array is mapped thanks to paging. This is why you do not see a big change in htop. However, you should see a slight change (the size of a page on your platform) if you look carefully. Otherwise, this should mean the page has been already mapped due to other previous recycled allocations. Indeed, the allocation library can preallocate memory area to speed up allocations (by reducing the number of slow requests to the OS). The library could also keep the memory mapped when it is freed by Numpy in order to speed the next allocations up (since the memory do not have to be unmapped to be then mapped again). This is unlikely to occur for huge arrays in practice because the impact on memory consumption would be too expensive.

Python matrix multiplication index exchange memory usage example

in c there is an example of importance of memory utilization: naive matrix multiplication using 3 for loops (i,j,k). And one can show that using (i,k,j) is much faster than (i,j,k) due to memory coalescence. In python numpy the order if indexes (again naive 3 loops, not np.dot) does not matter. Why is that?
First of all you need to know why the loop (i,k,j) is faster than (i,j,k) on C. That's happens because the memory usage optimization, in your computer memory the matrix is allocated in a linear way, so if you iterate using (i,k,j) you are using this in your favor where each loop takes a block of memory and load to your RAM. If you use (i,j,k) you are working against it, and each step will take a block of memory load to your RAM and discard on next step because you are iterating jumping blocks.
The implementation of numpy handle it for you, so even if you use the worst order numpy will order it to work faster.
The event of throwing away the cache, and keep changing it all the time is called Cache miss
At this link you can see a much better explanation about how the memory is allocated and why is it faster in some specific itartion way.

Slow GPU comparison in Cupy

I want to test using cupy whether a float is positive, e.g.:
import cupy as cp
u = cp.array(1.3)
u < 2.
>>> array(True)
My problem is that this operation is extremely slow:
%timeit u < 2. gives 26 micro seconds on my computer. It is orders of magnitude greater than what I get in CPU. I suspect it is because u has to be cast on the CPU...
I'm trying to find a faster way to do this operation.
Thanks !
Edit for clarification
My code is something like:
import cupy as cp
n = 100000
X = cp.random.randn(n) # can be greater
for _ in range(100): # There may be more iterations
result = X.dot(X)
if result < 1.2:
break
And it seems like the bottleneck of this code (for this n) is the evaluation of result < 1.2. It is still much faster than on CPU since the dot costs way less.
Running a single operation on the GPU is always a bad idea. To get performance gains out of your GPU, you need to realize a good 'compute intensity'; that is, the amount of computation performed relative to movement of memory; either from global ram to gpu mem, or from gpu mem into the cores themselves. If you dont have at least a few hunderd flops per byte of compute intensity, you can safely forget about realizing any speedup on the gpu. That said your problem may lend itself to gpu acceleration, but you certainly cannot benchmark statements like this in isolation in any meaningful way.
But even if your algorithm consists of chaining a number of such simple low-compute intensity operations on the gpu, you still will be disappointed by the speedup. Your bottleneck will be your gpu memory bandwidth; which really isnt that great compared to cpu memory bandwidth as it may look on paper. Unless you will be writing your own compute-intense kernels, or have plans for running some big ffts or such using cupy, dont think that it will give you any silver-bullet speedups by just porting your numpy code.
This may be because, when using CUDA, the array must be copied to the GPU before processing. Therefore, if your array has only one element, it can be slower in GPU than in CPU. You should try a larger array and see if this keeps happening
I think the problem here is your just leveraging one GPU device. Consider using say 100 to do all the for computations in parallel (although in the case of your simple example code it would only need doing once). https://docs-cupy.chainer.org/en/stable/tutorial/basic.html
Also there is a cupy greater function you could use to do the comparison in the GPU
Also the first time the dot gets called the kernel function will need to be compiled for the GPU which will take significantly longer than subsequent calls.

Initializing a big matrix with numpy taking way too long

I am working in python and I have encountered a problem: I have to initialize a huge array (21 x 2000 x 4000 matrix) so that I can copy a submatrix on it.
The problem is that I want it to be really quick since it is for a real-time application, but when I run numpy.ones((21,2000,4000)), it takes about one minute to create this matrix.
When I run numpy.zeros((21,2000,4000)), it is instantaneous, but as soon as I copy the submatrix, it takes one minute, while in the first case the copying part was instantaneous.
Is there a faster way to initialize a huge array?
I guess there's not a faster way. The matrix you're building is quite large (8 byte float64 x 21 x 2000 x 4000 = 1.25 GB), and might be using up a large fraction of the physical memory on your system; thus, the one minute that you're waiting might be because the operating system has to page other stuff out to make room. You could check this by watching top or similar (e.g., System Monitor) while you're doing your allocation and watching memory usage and paging.
numpy.zeros seems to be instantaneous when you call it, because memory is allocated lazily by the OS. However, as soon as you try to use it, the OS actually has to fit that data somewhere. See Why the performance difference between numpy.zeros and numpy.zeros_like?
Can you restructure your code so that you only create the submatrices that you were intending to copy, without making the big matrix?

Categories