Theano (GPU) memory footprint management / debug

Theano (GPU) memory footprint management / debug - python

I'm using theano to do some computation involving a large (about 300,000 x 128) matrix.
The theano function quit after outputting a MemoryError, similar to this question.
I think it's probably because the large matrix gets processed step by step, and each step leaves a large GpuArray (although the same shape as the first one) in the memory.
So my question is:
In the case of temporary(non-output) variables with the same shape and dtype, does theano make any effort to reuse such allocated memory? like, a pool for each array shape exist?
If 1, can I inspect which node in the function graph reuses memory?
Although I know there's shared that can do explicit sharing, but I'm in doubt that using it would make the (already hard) computation code harder to understand.
UPDATE
A simplified example of such situation:
import theano
from theano import tensor as T
a0 = T.matrix() # the initial
# op1 op2 op3 are valid, complicated operations,
# whose output's shape are identical to a0's
a1 = op1(a0)
a2 = op2(a1)
a3 = op3(a2)
f = theano.function([a0], a3)
If none of op1, op2, op3 can be optimized as in-place,
will theano try to reuse memory, e.g. a1 and a3 might "share"
the same address, to reduce memory footprint, since a1 is no
longer used at the time of op3?
Thanks!

You need to use in-place operations as much as possible. For the most part this is not under user control (they are automatically used by the optimizer when circumstances allow) but there are a few things you can do to encourage their use.
Take a look at the documentation on this issue:
memory profiler
Information on the optimizations that enable in-place operation
How to create custom operations that can work in-place
Do not use the inplace parameter of the inc_subtensor operation, as indicated in the documentation. The docs also indicate that inplace operators are not supported for user specification (the optimizer will automatically apply them when possible).
You could use Theano's memory profiler to help track down which operation(s) are using up the memory.
UPDATE
I'm not an expert in this area but I believe Theano uses a garbage collection mechanism to free up memory that is no longer needed. This is discussed in the documentation.

Related

Faster numpy array indexing when using condition (numpy.where)?

I have a huge numpy array with shape (50000000, 3) and I'm using:
x = array[np.where((array[:,0] == value) | (array[:,1] == value))]
to get the part of the array that I want. But this way seems to be quite slow.
Is there a more efficient way of performing the same task with numpy?

np.where is highly optimized and I doubt someone can write a faster code than the one implemented in the last Numpy version (disclaimer: I was one who optimized it). That being said, the main issue here is not much np.where but the conditional which create a temporary boolean array. This is unfortunately the way to do that in Numpy and there is not much to do as long as you use only Numpy with the same input layout.
One reason explaining why it is not very efficient is that the input data layout is inefficient. Indeed, assuming array is contiguously stored in memory using the default row major ordering, array[:,0] == value will read 1 item every 3 item of the array in memory. Due to the way CPU cache works (ie. cache lines, prefetching, etc.), 2/3 of the memory bandwidth is wasted. In fact, the output boolean array also need to be written and filling a newly-created array is a bit slow due to page faults. Note that array[:,1] == value will certainly reload data from RAM due to the size of the input (that cannot fit in most CPU caches). The RAM is slow and it is getter slower compared to the computational speed of the CPU and caches. This problem, called "memory wall", has been observed few decades ago and it is not expected to be fixed any time soon. Also note that the logical-or will also create a new array read/written from/to RAM. A better data layout is a (3, 50000000) transposed array contiguous in memory (note that np.transpose does not produce a contiguous array).
Another reason explaining the performance issue is that Numpy tends not to be optimized to operate on very small axis.
One main solution is to create the input in a transposed way if possible. Another solution is to write a Numba or Cython code. Here is an implementation of the non transposed input:
# Compilation for the most frequent types.
# Please pick the right ones so to speed up the compilation time.
#nb.njit(['(uint8[:,::1],uint8)', '(int32[:,::1],int32)', '(int64[:,::1],int64)', '(float64[:,::1],float64)'], parallel=True)
def select(array, value):
n = array.shape[0]
mask = np.empty(n, dtype=np.bool_)
for i in nb.prange(n):
mask[i] = array[i, 0] == value or array[i, 1] == value
return mask
x = array[select(array, value)]
Note that I used a parallel implementation since the or operator is sub-optimal with Numba (the only solution seems to use a native code or Cython) and also because the RAM cannot be fully saturated with one thread on some platforms like computing servers. Also note that it can be faster to use array[np.where(select(array, value))[0]] regarding the result of select. Indeed, if the result is random or very small, then np.where can be faster since it has special optimizations for theses cases that a boolean indexing does not perform. Note that np.where is not particularly optimized in the context of a Numba function since Numba use its own implementation of Numpy functions and they are sometimes not as much optimized for large arrays. A faster implementation consists in creating x in parallel but this is not trivial to do with Numba since the number of output item is not known ahead of time and that threads must know where to write data, not to mention Numpy is already fairly fast to do that in sequential as long as the output is predictable.

Does `np.matmul(A, B, out=A)` multiply the matrices in-place?

I'm trying to understand how out= parameters work in numpy, pytorch and similar libraries.
Intuitively I would expect it to allow allocation free arithmetic. That is, where C = A # B would allocate a new matrix, np.matmul(A, B, out=C) would use the space already allocated for C, thus requiring less memory management and faster code.
What confuses me is that the code
np.matmul(A, B, out=A) seems to work correctly.
That is, it computes A = A # B without error.
So if this is really done without allocating new memory, then numpy (or pytorch) must be using some sort of in-place matrix multiplication algorithm. But I don't think those really exist? (See e.g. Is there an algorithm to multiply square matrices in-place?)
So am I wrong that out= doesn't allocate memory? Or is it just that numpy and pytorch are smart enough to realize that the output is the same as one of the inputs, and then allocates memory in this particular case?
I tried reading the documentation (https://pytorch.org/docs/stable/generated/torch.matmul.html) but it doesn't give any details of what's really going on.

Iterative algorithms in NumPy by abusing as_strided

I was wondering if it is possible to write an iterative algorithm without using a for loop using as_strided and some operation that edits the memory in place.
For example, if I want to write an algorithm that replaces a number in an array with the sum of its neighbors. I came up with this abomination (yep its summing an element with 2 right neighbors but its just to get an idea):
import numpy as np
a = np.arange(10)
ops = 2
a_view_window = np.lib.stride_tricks.as_strided(a, shape = (ops,a.size - 2, 3), strides=(0,) + 2*a.strides)
a_view = np.lib.stride_tricks.as_strided(a, shape = (ops,a.size - 2), strides=(0,) + a.strides)
np.add.reduce(a_view_window, axis = -1, out=a_view)
print(a)
So I am taking an array of 10 numbers and creating this strange view which increases dimensionality without changing the strides. Thus my thinking is the reduction it will run over the fake new dimension and write over the previous values thus when it gets to the next major dimension it will have to read from the data it overwrote and thus iteratively perform the addition.
Sadly this does not work :(
(yes I know this is a terrible way to do things but I am curious about how the underlying numpy stuff works and if it can be hacked in this way)

This code results in an undefined behavior prior to Numpy 1.13 and works out-of-place in newer versions so to avoid overlapping/aliasing issues. Indeed, you cannot assume Numpy iterate in a given order on the input/output array view. In fact, Numpy often use SIMD instructions to speed up the code and sometimes tell compilers that views are not overlapping/aliasing each other (using the restrict keyword) to they can generate a much more efficient code. For more information you can read the doc on ufuncs (and this issue):
Operations where ufunc input and output operands have memory overlap produced undefined results in previous NumPy versions, due to data dependency issues. In NumPy 1.13.0, results from such operations are now defined to be the same as for equivalent operations where there is no memory overlap.
Operations affected now make temporary copies, as needed to eliminate data dependency. As detecting these cases is computationally expensive, a heuristic is used, which may in rare cases result to needless temporary copies. For operations where the data dependency is simple enough for the heuristic to analyze, temporary copies will not be made even if the arrays overlap, if it can be deduced copies are not necessary.

What is the use of the WORK parameters in LAPACK routines?

I am computing an eigenvalue decomposition of a symmetric matrix with scipy.linalg.cython_lapack.syev. From the doc I found, I need to pass an array called WORK:
WORK is DOUBLE PRECISION array, dimension (MAX(1,LWORK))
On exit, if INFO = 0, WORK(1) returns the optimal LWORK.
However, I can't see what it does (can't understand what the values after execution are), nor what it's used for. What is the purpose of this parameter?

Using the cython interface to dsyev() from scipy.linalg.cython_lapack makes sense: numpy eigh wraps dsyevs and scipy eigh wraps dsyevr(). But, following the Fortran prototype of dsyev(), an array WORK must be provided.
The array WORK is required by syev for internal use (expect if LWORK = -1).
LAPACK is written in Fortran 77 and this langage does not support dynamic allocation on the heap in its standards! Dynamic allocation might have been plateform-dependent or provided by specific compiler extentions. Consequently, LAPACK is written so that the user could use whatever she/he wants: static arrays, arrays allocated on the stack or array allocated on the heap.
Indeed, hard-coding the size for the WORK array in the library would trigger two ackward situations. Either the array is too big, increasing the memory footprint for nothing, or the array is too small, leading to bad performance or out-of-bound errors (segmentation faults...). As a result, the memory management is left to the user of the library. Some help is provided to the user as the optimal size for the array is provided if LWORK = -1.
If dynamic allocation is available, the most common use of LAPACK functions is to first perform a workspace query using LWORK = -1, then use the return value to allocate a WORK array of the correct size and finally call the routine of LAPACK to get the expected result. High-end wrappers of LAPACK such as LAPACKE features function doing just that: take a look at the source of LAPACKE for function LAPACKE_dsyev()! It calls twice the function LAPACKE_dsyev_work, which calls LAPACK_dsyev (wrapping dsyev()).
Wrappers still feature functions such as LAPACKE_dsyev_work(), where the arguments work and lwork are still required. The number of allocations can therefore be reduced if the routine is called multiple times on similar sizes by not deallocating WORK between calls, but the user must do that himself (see this example). In addition, the source of ILAENV, the function of LAPACK called to compute the optimzed size of WORK, features the following text:
This version provides a set of parameters which should give good,
but not optimal, performance on many of the currently available
computers. Users are encouraged to modify this subroutine to set
the tuning parameters for their particular machine using the option
and problem size information in the arguments.
As a result, testing sizes of WORK larger than the size returned by the workspace query could improve performances.
Indeed, lots of functions in LAPACK feature the WORK and LWORK arguments. If you search for alloc in folder lapack-3.7.1/SRC by grep -r "alloc" .
, the output only features comment lines:
./zgejsv.f:*> Length of CWORK to confirm proper allocation of workspace.
./zgejsv.f:*> In both cases, the allocated CWORK can accommodate blocked runs
./zgejsv.f:*> Length of RWORK to confirm proper allocation of workspace.
./zgesdd.f:* minimal amount of workspace allocated at that point in the code,
./zhseqr.f:* ==== NL allocates some local workspace to help small matrices
./dhseqr.f:* ==== NL allocates some local workspace to help small matrices
./dgesdd.f:* minimal amount of workspace allocated at that point in the code,
./shseqr.f:* ==== NL allocates some local workspace to help small matrices
./chseqr.f:* ==== NL allocates some local workspace to help small matrices
./sgesdd.f:* minimal amount of workspace allocated at that point in the code,
./sgejsv.f:*> Length of WORK to confirm proper allocation of work space.
./cgejsv.f:*> Length of CWORK to confirm proper allocation of workspace.
./cgejsv.f:*> In both cases, the allocated CWORK can accommodate blocked runs
./cgejsv.f:*> Length of RWORK to confirm proper allocation of workspace.
./dgejsv.f:*> Length of WORK to confirm proper allocation of work space.
./cgesdd.f:* minimal amount of workspace allocated at that point in the code,
It shows that the core of LAPACK does not handle dynamic memory allocation on the heap by commands like allocate, which is useful for large arrays: the user must care for that himself.

syev needs additional space during the calculation and the caller must provide this memory (work array).
There is a minimal amount of additional memory necessary for the calculation, however if you can afford to allocate more memory the calculation would profit from it and be faster.
The minimal size of the work array should be (nb +2)n where nb is the block size (which can be calculated via ilaenv).
Usually, one gives more memory: one either knows the optimal size because the structure of the matrix is known or by calling syev with lwork = -1, which would return the optimal size (in the work array):
double query;
int lwork = -1;
dsyev(..., &query, lwork);//query the work-size (error handling missing)
lwork=(int) query; //cast the returned double to int to get the size
....
The most part of thework-content is just some junk, I guess you could win some insight into the work of the algorithm by looking at the data - but this is implementation dependent.
However, the first entry of work will be the recommended size of the work array, this is also the case if you query the recommended size by setting lwork = -1.

What is colocate_with used for in tensorflow?

Here is the link of the official docs.
https://www.tensorflow.org/versions/r1.3/api_docs/python/tf/colocate_with

It's a context manager to make sure that the operation or tensor you're about to create will be placed on the same device the reference operation is on. Consider this piece of code (tested):
import tensorflow as tf
with tf.device("/cpu:0"):
a = tf.constant(0.0, name="a")
with tf.device("/gpu:0"):
b = tf.constant(0.0, name="b")
with tf.colocate_with(a):
c = tf.constant(0.0, name="c")
d = tf.constant(0.0, name="d")
for operation in tf.get_default_graph().get_operations():
print(operation.name, operation.device)
Outputs:
(u'a', u'/device:CPU:0')
(u'b', u'/device:GPU:0')
(u'c', u'/device:CPU:0')
(u'd', u'/device:GPU:0')
So it places tensor c on the same device where a is, regardless of the active device context of GPU when c is created. This can be very important for multi-GPU training. Imagine if you're not careful and you have a graph with tensors dependent on each other placed on 8 devices randomly. A complete disaster efficiency-wise. tf.colocate_with() can make sure this doesn't happen.
It is not explained in the docs because it's meant to be used by internal libraries only, so no guarantees it will stay. (Very likely it will, however. If you want to know more, you can look it up in the source code as of May 2018; might move as changes to the code happen.)
You're not likely to need this unless you're working on some low-level stuff. Most people use only one GPU, and even if you use multiple, you're generally building your graph one GPU at a time, that is within one tf.device() context manager at a time.
One example where it's used is the tf.train.ExponentialMovingAverage class. Clearly it looks a good idea to make sure to colocate the decay and moving average variables with the value tensor they are tracking.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Theano (GPU) memory footprint management / debug - python

Related

Faster numpy array indexing when using condition (numpy.where)?

Does `np.matmul(A, B, out=A)` multiply the matrices in-place?

Iterative algorithms in NumPy by abusing as_strided

What is the use of the WORK parameters in LAPACK routines?

What is colocate_with used for in tensorflow?

Categories

Resources