I'm trying to understand how out= parameters work in numpy, pytorch and similar libraries.
Intuitively I would expect it to allow allocation free arithmetic. That is, where C = A # B would allocate a new matrix, np.matmul(A, B, out=C) would use the space already allocated for C, thus requiring less memory management and faster code.
What confuses me is that the code
np.matmul(A, B, out=A) seems to work correctly.
That is, it computes A = A # B without error.
So if this is really done without allocating new memory, then numpy (or pytorch) must be using some sort of in-place matrix multiplication algorithm. But I don't think those really exist? (See e.g. Is there an algorithm to multiply square matrices in-place?)
So am I wrong that out= doesn't allocate memory? Or is it just that numpy and pytorch are smart enough to realize that the output is the same as one of the inputs, and then allocates memory in this particular case?
I tried reading the documentation (https://pytorch.org/docs/stable/generated/torch.matmul.html) but it doesn't give any details of what's really going on.
Related
I have a huge numpy array with shape (50000000, 3) and I'm using:
x = array[np.where((array[:,0] == value) | (array[:,1] == value))]
to get the part of the array that I want. But this way seems to be quite slow.
Is there a more efficient way of performing the same task with numpy?
np.where is highly optimized and I doubt someone can write a faster code than the one implemented in the last Numpy version (disclaimer: I was one who optimized it). That being said, the main issue here is not much np.where but the conditional which create a temporary boolean array. This is unfortunately the way to do that in Numpy and there is not much to do as long as you use only Numpy with the same input layout.
One reason explaining why it is not very efficient is that the input data layout is inefficient. Indeed, assuming array is contiguously stored in memory using the default row major ordering, array[:,0] == value will read 1 item every 3 item of the array in memory. Due to the way CPU cache works (ie. cache lines, prefetching, etc.), 2/3 of the memory bandwidth is wasted. In fact, the output boolean array also need to be written and filling a newly-created array is a bit slow due to page faults. Note that array[:,1] == value will certainly reload data from RAM due to the size of the input (that cannot fit in most CPU caches). The RAM is slow and it is getter slower compared to the computational speed of the CPU and caches. This problem, called "memory wall", has been observed few decades ago and it is not expected to be fixed any time soon. Also note that the logical-or will also create a new array read/written from/to RAM. A better data layout is a (3, 50000000) transposed array contiguous in memory (note that np.transpose does not produce a contiguous array).
Another reason explaining the performance issue is that Numpy tends not to be optimized to operate on very small axis.
One main solution is to create the input in a transposed way if possible. Another solution is to write a Numba or Cython code. Here is an implementation of the non transposed input:
# Compilation for the most frequent types.
# Please pick the right ones so to speed up the compilation time.
#nb.njit(['(uint8[:,::1],uint8)', '(int32[:,::1],int32)', '(int64[:,::1],int64)', '(float64[:,::1],float64)'], parallel=True)
def select(array, value):
n = array.shape[0]
mask = np.empty(n, dtype=np.bool_)
for i in nb.prange(n):
mask[i] = array[i, 0] == value or array[i, 1] == value
return mask
x = array[select(array, value)]
Note that I used a parallel implementation since the or operator is sub-optimal with Numba (the only solution seems to use a native code or Cython) and also because the RAM cannot be fully saturated with one thread on some platforms like computing servers. Also note that it can be faster to use array[np.where(select(array, value))[0]] regarding the result of select. Indeed, if the result is random or very small, then np.where can be faster since it has special optimizations for theses cases that a boolean indexing does not perform. Note that np.where is not particularly optimized in the context of a Numba function since Numba use its own implementation of Numpy functions and they are sometimes not as much optimized for large arrays. A faster implementation consists in creating x in parallel but this is not trivial to do with Numba since the number of output item is not known ahead of time and that threads must know where to write data, not to mention Numpy is already fairly fast to do that in sequential as long as the output is predictable.
I was wondering if it is possible to write an iterative algorithm without using a for loop using as_strided and some operation that edits the memory in place.
For example, if I want to write an algorithm that replaces a number in an array with the sum of its neighbors. I came up with this abomination (yep its summing an element with 2 right neighbors but its just to get an idea):
import numpy as np
a = np.arange(10)
ops = 2
a_view_window = np.lib.stride_tricks.as_strided(a, shape = (ops,a.size - 2, 3), strides=(0,) + 2*a.strides)
a_view = np.lib.stride_tricks.as_strided(a, shape = (ops,a.size - 2), strides=(0,) + a.strides)
np.add.reduce(a_view_window, axis = -1, out=a_view)
print(a)
So I am taking an array of 10 numbers and creating this strange view which increases dimensionality without changing the strides. Thus my thinking is the reduction it will run over the fake new dimension and write over the previous values thus when it gets to the next major dimension it will have to read from the data it overwrote and thus iteratively perform the addition.
Sadly this does not work :(
(yes I know this is a terrible way to do things but I am curious about how the underlying numpy stuff works and if it can be hacked in this way)
This code results in an undefined behavior prior to Numpy 1.13 and works out-of-place in newer versions so to avoid overlapping/aliasing issues. Indeed, you cannot assume Numpy iterate in a given order on the input/output array view. In fact, Numpy often use SIMD instructions to speed up the code and sometimes tell compilers that views are not overlapping/aliasing each other (using the restrict keyword) to they can generate a much more efficient code. For more information you can read the doc on ufuncs (and this issue):
Operations where ufunc input and output operands have memory overlap produced undefined results in previous NumPy versions, due to data dependency issues. In NumPy 1.13.0, results from such operations are now defined to be the same as for equivalent operations where there is no memory overlap.
Operations affected now make temporary copies, as needed to eliminate data dependency. As detecting these cases is computationally expensive, a heuristic is used, which may in rare cases result to needless temporary copies. For operations where the data dependency is simple enough for the heuristic to analyze, temporary copies will not be made even if the arrays overlap, if it can be deduced copies are not necessary.
I have a set of large very sparse matrices which I'm trying to find the eigenvector corresponding to eigenvalue 0 of. From the structure of the underlying problem I know a solution to this must exist and that zero is also the largest eigenvalue.
To solve the problem I use scipy.sparse.linalg.eigs with arguments:
val, rho = eigs(L, k=1, which = 'LM', v0=init, sigma=-0.001)
where L is the matrix. I then call this multiple times for different matrices. At first I was having a severe problem where the memory usage just increases as I call the function more and more times. It seems eigs doesn't free all the memory it should. I solved by calling gc.collect() each time I use eigs.
But now I worry that also internally memory isn't being freed, naively I expect that using something like Arnoldi shouldn't use more memory as the algorithm progresses, it should just be storing the matrix and the current set of Lanczos vectors, but I find that memory usage increases while the code is still inside the eigs function.
Any ideas?
I'm using theano to do some computation involving a large (about 300,000 x 128) matrix.
The theano function quit after outputting a MemoryError, similar to this question.
I think it's probably because the large matrix gets processed step by step, and each step leaves a large GpuArray (although the same shape as the first one) in the memory.
So my question is:
In the case of temporary(non-output) variables with the same shape and dtype, does theano make any effort to reuse such allocated memory? like, a pool for each array shape exist?
If 1, can I inspect which node in the function graph reuses memory?
Although I know there's shared that can do explicit sharing, but I'm in doubt that using it would make the (already hard) computation code harder to understand.
UPDATE
A simplified example of such situation:
import theano
from theano import tensor as T
a0 = T.matrix() # the initial
# op1 op2 op3 are valid, complicated operations,
# whose output's shape are identical to a0's
a1 = op1(a0)
a2 = op2(a1)
a3 = op3(a2)
f = theano.function([a0], a3)
If none of op1, op2, op3 can be optimized as in-place,
will theano try to reuse memory, e.g. a1 and a3 might "share"
the same address, to reduce memory footprint, since a1 is no
longer used at the time of op3?
Thanks!
You need to use in-place operations as much as possible. For the most part this is not under user control (they are automatically used by the optimizer when circumstances allow) but there are a few things you can do to encourage their use.
Take a look at the documentation on this issue:
memory profiler
Information on the optimizations that enable in-place operation
How to create custom operations that can work in-place
Do not use the inplace parameter of the inc_subtensor operation, as indicated in the documentation. The docs also indicate that inplace operators are not supported for user specification (the optimizer will automatically apply them when possible).
You could use Theano's memory profiler to help track down which operation(s) are using up the memory.
UPDATE
I'm not an expert in this area but I believe Theano uses a garbage collection mechanism to free up memory that is no longer needed. This is discussed in the documentation.
Given the thread here
It seems that numpy is not the most ideal for ultra fast calculation. Does anyone know what overhead we must be aware of when using numpy for numerical calculation?
Well, depends on what you want to do. XOR is, for instance, hardly relevant for someone interested in doing numerical linear algebra (for which numpy is pretty fast, by virtue of using optimized BLAS/LAPACK libraries underneath).
Generally, the big idea behind getting good performance from numpy is to amortize the cost of the interpreter over many elements at a time. In other words, move the loops from python code (slow) into C/Fortran loops somewhere in the numpy/BLAS/LAPACK/etc. internals (fast). If you succeed in that operation (called vectorization) performance will usually be quite good.
Of course, you can obviously get even better performance by dumping the python interpreter and using, say, C++ instead. Whether this approach actually succeeds or not depends on how good you are at high performance programming with C++ vs. numpy, and what operation exactly you're trying to do.
Any time you have an expression like x = a * b + c / d + e, you end up with one temporary array for a * b, one temporary array for c / d, one for one of the sums and finally one allocation for the result. This is a limitation of Python types and operator overloading. You can however do things in-place explicitly using the augmented assignment (*=, +=, etc.) operators and be assured that copies aren't made.
As for the specific reason NumPy performs more slowly in that benchmark, it's hard to tell but it probably has to do with the constant overhead of checking sizes, type-marshaling, etc. that Cython/etc. don't have to worry about. On larger problems you'd probably see it get closer.
I can't really tell, but I'd guess there are two factors:
Perhaps numpy is copying more stuff? weave is often faster when you avoid allocating big temporary arrays, but this shouldn't matter here.
numpy has a bit of overhead used in iterating over (possibly) multidimensional arrays. This overhead would normally be dwarfed by number crunching, but an xor is really really fast, so all that really matters is the overhead.
Your sub-question: a = sin(x), how many roundtrips are there.
The trick is to pass a numpy array to sin(x), then there is only one 'roundtrip' for the whole array, since numpy will return an array of sin-values. There is no python for loop involved in this operation.