Dealing with memory fragmentation of GPUs in Theano

Dealing with memory fragmentation of GPUs in Theano - python

To allocate space to a variable on GPU memory there must be enough space on continuous memory region. In other words you cannot have fragmented memory regions allocated to a variable on GPUS, unlike RAM. Having different shared variables stored on the GPU memory and continuously updating them could cause memory fragmentation. Therefore, even if there is enough free memory (in terms of bytes) on the GPU, you may not be able to use those memory regions as they are not in a continuous block.
My question is how does Theano deal with such problem?
Does shared_var.set_value([]) release all the memory assigned to that shared variable so that the next update (shared_var.set_value(newDataPoints)) will only allocate the amount of memory to the shared variable and therefore avoid memory fragmentation?
Here is is explained that updating a shared variable may still cause memory fragmentation. So I wonder whether the problem persists if the parameters borrow or allow_gc (in theanorc) is set to True?
How one can keep track of the amount of free memory in a block (continuous) on a GPU?

Related

How is memory handled once touched for the first time in numpy.zeros?

I recently saw that when creating a numpy array via np.empty or np.zeros, the memory of that numpy array is not actually allocated by the operating system as discussed in this answer (and this question), because numpy utilizes calloc to allocate the array's memory.
In fact, the OS isn't even "really" allocating that memory until you try to access it.
Therefore,
l = np.zeros(2**28)
does not increase the utilized memory the system reports, e.g., in htop.
Only once I touch the memory, for instance by executing
np.add(l, 0, out=l)
the utilized memory is increased.
Because of that behaviour I have got a couple of questions:
1. Is touched memory copied under the hood?
If I touch chunks of the memory only after a while, is the content of the numpy array copied under the hood by the operating system to guarantee that the memory is contiguous?
i = 100
f[:i] = 3
while True:
... # Do stuff
f[i] = ... # Once the memory "behind" the already allocated chunk of memory is filled
# with other stuff, does the operating system reallocate the memory and
# copy the already filled part of the array to the new location?
i = i + 1
2. Touching the last element
As the memory of the numpy array is continguous in memory, I tought
f[-1] = 3
might require the enitre block of memory to be allocated (without touching the entire memory).
However, it does not, the utilized memory in htop does not increase by the size of the array.
Why is that not the case?

OS isn't even "really" allocating that memory until you try to access it
This is dependent of the target platform (typically the OS and its configuration). Some platform directly allocates page in physical memory (eg. AFAIK the XBox does as well as some embedded platforms). However, mainstream platforms actually do that indeed.
1. Is touched memory copied under the hood?
If I touch chunks of the memory only after a while, is the content of the numpy array copied under the hood by the operating system to guarantee that the memory is contiguous?
Allocations are perform in virtual memory. When a first touch is done on a given memory page (chunk of fixed sized, eg. 4 KiB), the OS maps the virtual page to a physical one. So only one page will be physically map when you set only one item of the array (unless the item cross two pages which only happens in pathological cases).
The physical pages may not be contiguous for a contiguous set of virtual pages. However, this is not a problem and you should not care about it. This is mainly the job of the OS. That being said, modern processors have a dedicated unit called TLB to translate virtual address (the one you could see with a debugger) to physical ones (since this translation is relatively expensive and performance critical).
The content of the Numpy array is not reallocated nor copied thanks to paging (at least from the user point-of-view, ie. in virtual memory).
2. Touching the last element
I thought f[-1] = 3 might require the entire block of memory to be allocated (without touching the entire memory). However, it does not, the utilized memory in htop does not increase by the size of the array. Why is that not the case?
Only the last page in virtual memory associated to the Numpy array is mapped thanks to paging. This is why you do not see a big change in htop. However, you should see a slight change (the size of a page on your platform) if you look carefully. Otherwise, this should mean the page has been already mapped due to other previous recycled allocations. Indeed, the allocation library can preallocate memory area to speed up allocations (by reducing the number of slow requests to the OS). The library could also keep the memory mapped when it is freed by Numpy in order to speed the next allocations up (since the memory do not have to be unmapped to be then mapped again). This is unlikely to occur for huge arrays in practice because the impact on memory consumption would be too expensive.

CuPy random - how to generate new random set in same memory?

I am generating a large array of random numbers, totaling more than half the available memory on a GPU. I am doing this in a loop.
When I call cupy.random the second time (or third time...), assigning to the same variable name, it does not free the memory for the first array. It tries to allocate more memory, which causes an out of memory error.
Explicitly freeing the memory before generating a new random array is very slow, and seems inefficient.
Is there a way to generate a new set of numbers, but in the same memory space?
Edit: cupy.random.shuffle() is letting me work around the problem, but I wonder if there is a better way?
Edit 2: on further review, shuffle() does not address the problem, and appears to need even more memory than allocating a second block (before freeing the first) of memory... I am back to restricting ndarray size to less than half the remaining memory, so two ndarrays can be allocated alternately

As user2357112 suggests, cupy.random.random() does not appear to support “re-randomizing“ an existing ndarray, even though cuRand does. Writing C to modify an existing cupy array somewhat defeats the point of using python / cupy in the first place.
Curiously, having an array about 1/3rd the size of available memory, while increasing the number of loops, is faster in total execution time (versus larger arrays/fewer loops). I was not able to determine when cupy (or python or cuda?) does garbage collection on the disused array, but it seems to happen asynchronously.
If GPU garbage collection uses cuda cores (I presume it does?), it does not appear to materially effect my code execution time. Nvidia-smi reports “P2” GPU usage when my code calculations are running, suggesting there are still cores available for cupy / cuda to free memory outside of my code?
I don’t like answering my own question... just sharing what I found in case it helps someone else

Over-high memory usage during reading parquet in Python

I have a parquet file at around 10+GB, with columns are mainly strings. When loading it into the memory, the memory usage can peak to 110G, while after it's finished the memory usage is reduced back to around 40G.
I'm working on a high-performance computer with allocated memory so I do have access to large memory. However, it seems a waste to me that I have to apply for a 128G memory just for loading data, after that 64G is sufficient for me. Also, 128G memory is more often to be out of order.
My naive conjecture is that the Python interpreter mistreated the 512G physical memory on the HPC as the total available memory, so it does not do garbage collection as often as actually needed. For example, when I load the data with 64G memory, it never threw me a MemoryError but the kernel is directly killed and restarted.
I was wondering whether the over-high usage of memory when loading is a regular behavior of pyarrow, or it is due to the special setting of my environment. If the latter, then is it possible to somehow limit the available memory during loading?

We fixed a memory use bug that's present in 0.14.0/0.14.1 (which is probably what you're using right now).
https://issues.apache.org/jira/browse/ARROW-6060
We also are introducing an option to read string columns as categorical (aka DictionaryArray in Arrow parlance) which also will reduce memory usage. See https://issues.apache.org/jira/browse/ARROW-3325 and discussion in
https://ursalabs.org/blog/2019-06-07-monthly-report/

How to know if python process is garbage collecting too often?

I am tuning the performance of a python job (that uses Pytorch and Nvidia Cuda). My python process runs on a shared cluster where the max RAM needed must be explicitly specified. Since the lower is better, we tend to set lower RAM that doesn't cause Out of Memory.
In particular, I noticed that the RAM could make quite a difference on performance. For instance, if I set RAM=6GB, my job takes 27Hours. If I set max RAM=10GB (keeping all other variables same), the same job takes about ~16Hrs. Most of the work is done on GPU using GPU RAM, so the CPU and CPU RAM is only for housekeeping and moving tensors.
My suspect is garbage collector runs too often when I set less RAM. I had observed this kind of behavior when I was dealing with JavaVM, and I had tools to inspect how much time my JVM process spends in the garbage collector.
However, with python, I am clueless.
Are there any ways to inspect the memory management, in particular, Garbage Collector time (fraction of runtime)?

Python: minimizing memory usage with functions

I am writing a code where at some point I need to solve several generalized eigenvalue problems for large sparse matrices. Because these operations are essentially similar (only the name of the considered matrices are changing), I made a function:
def eig_prob(myvariables):
# this is just a simplified example
name = 'iteration_'+myvariables["i"]
A = myvariables["A"]
B = myvariables["B"]
N = myvariables["nb_eig"]
Z,V = eigsh(A,N,B,sigma = 1)
# save in Matlab format
scipy.io.savemat(files["exec"]+name+".mat",{"Z":Z,"V":V})
As I do not return any argument to my main function, I would expect the quantity of RAM memory to be the same before and after the call to eig_prob.
In fact, I observe that the consumption of RAM memory increased by about 800 Mb during the call to eig_prob, which is expected, and this memory is not freed after the call, which seems surprising to me.
Is there any explanation for such behavior? Can it be avoided? Do I need to run my function as a sub process to avoid this over consumption of memory?
edit: a colleague of mine indicated that gs.collect() [1] may help, it does! When called after the function, gs.collect() frees the 800 Mb.
[1] https://docs.python.org/2/library/gc.html

If a Python object is allocated, it happens to be put onto the heap of the program.
If it is a quite large object, memory will be allocated via mmap() for as long as it is needed and freed again afterwards. I am not sure if that happens immediately...
For smaller objects, the brk() boundary of the process will be shifted. In this case, memory is allocated. If some other objects are added afterwards and the former objects are freed, their memory is free on the heap, but cannot be returned to the OS. Only after the end-most object on the heap is freed, part of the free area can be returnd to the OS.
You talk about 800 MB, which is clearly so large that the mmap() method should be used, but if the data consists of thousands of smaller objects, chances are that they land on the brk() heap.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.