NumPy tensordot MemoryError - python

I have two matrices -- A is 3033x3033, and X is 3033x20. I am running the following lines (as suggested in the answer to another question I asked):
n, d = X.shape
c = X.reshape(n, -1, d) - X.reshape(-1, n, d)
return np.tensordot(A.reshape(n, n, -1) * c, c, axes=[(0,1),(0,1)])
On the final line, Python simply stops and says "MemoryError". How can I get around this, either by changing some setting in Python or performing this operation in a more memory-efficient way?

Here is a function that does the calculation without any for loops and without any large temporary array. See the related question for a longer answer, complete with a test script.
def fbest(A, X):
""
KA_best = np.tensordot(A.sum(1)[:,None] * X, X, axes=[(0,), (0,)])
KA_best += np.tensordot(A.sum(0)[:,None] * X, X, axes=[(0,), (0,)])
KA_best -= np.tensordot(np.dot(A, X), X, axes=[(0,), (0,)])
KA_best -= np.tensordot(X, np.dot(A, X), axes=[(0,), (0,)])
return KA_best
I profiled the code with your size arrays:
I love sp.einsum by the way. It is a great place to start when speeding up array operations by removing for loops. You can do SOOOO much with one call to sp.einsum.
The advantage of np.tensordot is that it links to whatever fast numerical library you have installed (i.e. MKL). So, tensordot will run faster and in parallel when you have the right libraries installed.

If you replace the final line with
return np.einsum('ij,ijk,ijl->kl',A,c,c)
you avoid creating the A.reshape(n, n, -1) * c (3301 by 3301 by 20) intermediate that I think is your main problem.
My impression is that the version I give is probably slower (for cases where it doesn't run out of memory), but I haven't rigourously timed it.
It's possible you could go further and avoid creating c, but I can't immediately see how to do it. It'd be a case of following writing the whole thing in terms of sums of matrix indicies and seeing what it simplified to.

You can employ a two-nested loop format iterating along the last dimension of X. Now, that last dimension is 20, so hopefully it would still be efficient enough and more importantly leave minimum memory footprint. Here's the implementation -
n, d = X.shape
c = X.reshape(n, -1, d) - X.reshape(-1, n, d)
out = np.empty((d,d)) # d is a small number: 20
for i in range(d):
for j in range(d):
out[i,j] = (A*c[:,:,i]*(c[:,:,j])).sum()
return out
You can replace the last line with np.einsum -
out[i,j] = np.einsum('ij->',A*c[:,:,i]*c[:,:,j])

Related

How to make two arrays contiguous so that Numba can speed up np.dot()

I have the following code:
import numpy as np
from numba import jit
Nx = 15
Ny = 1000
v = np.ones((Nx,Ny))
v = np.reshape(v,(Nx*Ny))
A = np.random.rand(Nx*Ny,Nx*Ny,5)
B = np.random.rand(Nx*Ny,Nx*Ny,5)
C = np.random.rand(Nx*Ny,5)
#jit(nopython=True)
def dotplus(B, v, C):
return np.dot(B, v) + C
k = 2
D = dotplus(B[:,:,k], v, C[:,k])
I get the following warning, which I guess refers to arrays B[:,:,k] and v:
NumbaPerformanceWarning: np.dot() is faster on contiguous arrays, called on (array(float64, 2d, A), array(float64, 1d, C))
return np.dot(B, v0) + C
Is there a way to make the two arrays contiguous, so that Numba can speed up the code?
PS in case you're wondering about the meaning of k, note this is just a MRE. In the actual code, dotplus is called multiple times inside a for loop for different values of k (so, different slices of B and C). The for loop updates the values of v, but B and C don't change.
Flawr is correct. B[..., k] returns a np.view() into B, but does not actually copy any data. In memory, two neighbouring elements of the view have a distance of B.strides[1], which evaluates to B.shape[-1]*B.itemsize and is greater than B.itemsize. Consequentially, your array is not contiguous.
The best optimization is to vectorize the dotplus loop and write
D = np.tensordot(B, v, axes=(1, 0)) + C
The second best optimization is to refactor and let the batch dimension be the first dimension of the array. This can be done on top of the above vectorization and is generally advisable. It would look something like
A = np.random.rand(5, Nx*Ny,Nx*Ny)
# rather than
A = np.random.rand(Nx*Ny,Nx*Ny,5)
If you can't refactor the code, you need to start profiling. You can easily swap axes temporarily via
B = np.moveaxis(B, -1, 0)
some_op(B[k, ...], ...)
B = np.moveaxis(B, 0, -1)
Contrary to max9111's comment, this will not net you anything compared to np.ascontiguousarray() because the data has to be copied in both cases. That said, a copy is O(Nx*Ny*k) + buffer allocation. Direct matrix-vector multiplication is O(Nx*Ny), but you have to gather the elements first, which is really expensive. It comes down to your specific architecture and concrete problem, so profiling is the way to go.

Efficient ways to iterate over several numpy arrays and process current and previous elements?

I've read a lot about different techniques for iterating over numpy arrays recently and it seems that consensus is not to iterate at all (for instance, see a comment here). There are several similar questions on SO, but my case is a bit different as I have to combine "iterating" (or not iterating) and accessing previous values.
Let's say there are N (N is small, usually 4, might be up to 7) 1-D numpy arrays of float128 in a list X, all arrays are of the same size. To give you a little insight, these are data from PDE integration, each array stands for one function, and I would like to apply a Poincare section. Unfortunately, the algorithm should be both memory- and time-efficient since these arrays are sometimes ~1Gb each, and there are only 4Gb of RAM on board (I've just learnt about memmap'ing of numpy arrays and now consider using them instead of regular ones).
One of these arrays is used for "filtering" the others, so I start with secaxis = X.pop(idx). Now I have to locate pairs of indices where (secaxis[i-1] > 0 and secaxis[i] < 0) or (secaxis[i-1] < 0 and secaxis[i] > 0) and then apply simple algebraic transformations to remaining arrays, X (and save results). Worth mentioning, data shouldn't be wasted during this operation.
There are multiple ways for doing that, but none of them seem efficient (and elegant enough) to me. One is a C-like approach, where you just iterate in a for-loop:
import array # better than lists
res = [ array.array('d') for _ in X ]
for i in xrange(1,secaxis.size):
if condition: # see above
co = -secaxis[i-1]/secaxis[i]
for j in xrange(N):
res[j].append( (X[j][i-1] + co*X[j][i])/(1+co) )
This is clearly very inefficient and besides not a Pythonic way.
Another way is to use numpy.nditer, but I haven't figured out yet how one accesses the previous value, though it allows iterating over several arrays at once:
# without secaxis = X.pop(idx)
it = numpy.nditer(X)
for vec in it:
# vec[idx] is current value, how do you get the previous (or next) one?
Third possibility is to first find sought indices with efficient numpy slices, and then use them for bulk multiplication/addition. I prefer this one for now:
res = []
inds, = numpy.where((secaxis[:-1] < 0) * (secaxis[1:] > 0) +
(secaxis[:-1] > 0) * (secaxis[1:] < 0))
coefs = -secaxis[inds] / secaxis[inds+1] # array of coefficients
for f in X: # loop is done only N-1 times, that is, 3 to 6
res.append( (f[inds] + coefs*f[inds+1]) / (1+coefs) )
But this is seemingly done in 7 + 2*(N - 1) passes, moreover, I'm not sure about secaxis[inds] type of addressing (it is not slicing and generally it has to find all elements by indices just like in the first method, doesn't it?).
Finally, I've also tried using itertools and it resulted in monstrous and obscure structures, which might stem from the fact that I'm not very familiar with functional programming:
def filt(x):
return (x[0] < 0 and x[1] > 0) or (x[0] > 0 and x[1] < 0)
import array
from itertools import izip, tee, ifilter
res = [ array.array('d') for _ in X ]
iters = [iter(x) for x in X] # N-1 iterators in a list
prev, curr = tee(izip(*iters)) # 2 similar iterators, each of which
# consists of N-1 iterators
next(curr, None) # one of them is now for current value
seciter = tee(iter(secaxis))
next(seciter[1], None)
for x in ifilter(filt, izip(seciter[0], seciter[1], prev, curr)):
co = - x[0]/x[1]
for r, p, c in zip(res, x[2], x[3]):
r.append( (p+co*c) / (1+co) )
Not only this looks very ugly, it also takes an awful lot of time to complete.
So, I have following questions:
Of all these methods is the third one indeed the best? If so, what can be done to impove the last one?
Are there any other, better ones yet?
Out of sheer curiosity, is there a way to solve the problem using nditer?
Finally, will I be better off using memmap versions of numpy arrays, or will it probably slow things down a lot? Maybe I should only load secaxis array into RAM, keep others on disk and use third method?
(bonus question) List of equal in length 1-D numpy arrays comes from loading N .npy files whose sizes aren't known beforehand (but N is). Would it be more efficient to read one array, then allocate memory for one 2-D numpy array (slight memory overhead here) and read remaining into that 2-D array?
The numpy.where() version is fast enough, you can speedup it a little by method3(). If the > condition can change to >=, you can also use method4().
import numpy as np
a = np.random.randn(100000)
def method1(a):
idx = []
for i in range(1, len(a)):
if (a[i-1] > 0 and a[i] < 0) or (a[i-1] < 0 and a[i] > 0):
idx.append(i)
return idx
def method2(a):
inds, = np.where((a[:-1] < 0) * (a[1:] > 0) +
(a[:-1] > 0) * (a[1:] < 0))
return inds + 1
def method3(a):
m = a < 0
p = a > 0
return np.where((m[:-1] & p[1:]) | (p[:-1] & m[1:]))[0] + 1
def method4(a):
return np.where(np.diff(a >= 0))[0] + 1
assert np.allclose(method1(a), method2(a))
assert np.allclose(method2(a), method3(a))
assert np.allclose(method3(a), method4(a))
%timeit method1(a)
%timeit method2(a)
%timeit method3(a)
%timeit method4(a)
the %timeit result:
1 loop, best of 3: 294 ms per loop
1000 loops, best of 3: 1.52 ms per loop
1000 loops, best of 3: 1.38 ms per loop
1000 loops, best of 3: 1.39 ms per loop
I'll need to read your post in more detail, but will start with some general observations (from previous iteration questions).
There isn't an efficient way of iterating over arrays in Python, though there are things that slow things down. I like to distinguish between the iteration mechanism (nditer, for x in A:) and the action (alist.append(...), x[i+1] += 1). The big time consumer is usually the action, done many times, not the iteration mechanism itself.
Letting numpy do the iteration in compiled code is the fastest.
xdiff = x[1:] - x[:-1]
is much faster than
xdiff = np.zeros(x.shape[0]-1)
for i in range(x.shape[0]:
xdiff[i] = x[i+1] - x[i]
The np.nditer isn't any faster.
nditer is recommended as a general iteration tool in compiled code. But its main value lies in handling broadcasting and coordinating the iteration over several arrays (input/output). And you need to use buffering and c like code to get the best speed from nditer (I'll look up a recent SO question).
https://stackoverflow.com/a/39058906/901925
Don't use nditer without studying the relevant iteration tutorial page (the one that ends with a cython example).
=========================
Just judging from experience, this approach will be fastest. Yes it's going to iterate over secaxis a number of times, but those are all done in compiled code, and will be much faster than any iteration in Python. And the for f in X: iteration is just a few times.
res = []
inds, = numpy.where((secaxis[:-1] < 0) * (secaxis[1:] > 0) +
(secaxis[:-1] > 0) * (secaxis[1:] < 0))
coefs = -secaxis[inds] / secaxis[inds+1] # array of coefficients
for f in X:
res.append( (f[inds] + coefs*f[inds+1]) / (1+coefs) )
#HYRY has explored alternatives for making the where step faster. But as you can see the differences aren't that big. Other possible tweaks
inds1 = inds+1
coefs = -secaxis[inds] / secaxis[inds1]
coefs1 = coefs+1
for f in X:
res.append(( f[inds] + coefs*f[inds1]) / coefs1)
If X was an array, res could be an array as well.
res = (X[:,inds] + coefs*X[:,inds1])/coefs1
But for small N I suspect the list res is just as good. Don't need to make the arrays any bigger than necessary. The tweaks are minor, just trying to avoid recalculating things.
=================
This use of np.where is just np.nonzero. That actually makes two passes of the array, once with np.count_nonzero to determine how many values it will return, and create the return structure (list of arrays of now known length). And a second loop to fill in those indices. So multiple iterations are fine if it keeps action simple.

Using nested for loops to evaluate function at every coordinate - python

I am trying to use a nested for loop to evaluate function f(r) at every combination of 2D polar coordinates.
My current code is:
r = np.linspace(0, 5, 10)
phi = np.linespace(0,2*np.pi, 10)
for i in r:
for j in phi:
X = f(r) * np.cos(phi)
print X
When I run this as is it returns X as a 1D array of f(r) and cos(phi) is 1 (ie. phi = 0). However I want f(r) at every value of r, which is then multiplied by its corresponding phi value. This would be a 2D array (10 by 10) where by every combination of r and phi is evaluated.
If you have any suggestions about possible efficiencies I would appreciate it as eventually I will be running this with a resolution much greater than 10 (maybe as high as 10,000) and looping it many thousands of times.
It's np.linspace not np.linespace.
You're not iterating over the actual data at all, i.e. i and j are not even used in your inner loop. Right now, you're executing the same statement with the same arguments and the same result over and over again. You could have executed the inner statement only one single time, X would be the same.
You need to preallocate a 2D-array for the result.
Iterate over both axes, grab the items of r and phi, do your calculation and put it into the right field of your output array.
Somehow the first point makes me think you never event tried to run your code, as it gives an obvious error message. Anyways, here some solution:
def f(x):
return x
r = np.linspace(0, 5, 10)
phi = np.linspace(0, 2*np.pi, 10)
X = np.zeros((len(r), len(phi)))
for i in xrange(len(r)):
for j in xrange(len(phi)):
X[i,j] = f(r[i]) * np.cos(phi[j])
print X
P.S. Don't go with the solutions, that put the results into ordinary lists, stick with numpy arrays for mathematical problems.
When I run this as is it returns X as a 1D array of f(r) and cos(phi) is 1 (ie. phi = 0).
That's because you don't store, print, or do anything else with all those intermediate values you generate. All you do is rebind X to the latest value over and over (forgetting whatever used to be in X), so at the end, it's bound to the last one.
If you want all of those X values, you have to actually store them in some way. For example:
Xs = []
for i in r:
for j in phi:
X = f(r) * np.cos(phi)
Xs.append[X]
Now, you'll have a list of each X instead of just the last one.
Of course in general, the whole reason you use NumPy is to "vectorize" operations instead of looping in the first place. (This allows the loops and the arithmetic to be done in C/C++/Fortran/whatever instead of Python, typically making them an order of magnitude or so faster.) So, if you can rewrite your code to create a 2D array from your 1D array by broadcasting, instead of creating a list of 1D arrays, it will be better…
Instead of assigning to X, you should be adding the most recent 1D array to it. (You'll need to initialize X before the outer loop.)
Edit: You don't need the inner loop; it's computing across all of phi (which is why you get the 1D array) repeatedly. (Note how it doesn't use j). That'll speed things up a bit, AND get you the correct answer!

python numpy - optimize chisq function by removing explicit python loop?

I'm trying to evaluate a chi squared function, i.e. compare an arbitrary (blackbox) function to a numpy vector array of data. At the moment I'm looping over the array in python but something like this is very slow:
n=len(array)
sigma=1.0
chisq=0.0
for i in range(n):
data = array[i]
model = f(i,a,b,c)
chisq += 0.5*((data-model)/sigma)**2.0
return chisq
array is a 1-d numpy array and a,b,c are scalars. Is there a way to speed this up by using numpy.sum() or some sort of lambda function etc.? I can see how to remove one loop (over chisq) like this:
numpy.sum(((array-model_vec)/sigma)**2.0)
but then I still need to explicitly populate the array model_vec, which will presumably be just as slow; how do I do that without an explicit loop like this:
model_vec=numpy.zeros(len(data))
for i in range(n):
model_vec[i] = f(i,a,b,c)
return numpy.sum(((array-model_vec)/sigma)**2.0)
?
Thanks!
You can use np.vectorize to 'vectorize' your function f if you don't have control over its definition:
g = np.vectorize(f)
But this is not as good as vectorizing the function yourself manually to support arrays, as it doesn't really do much more than internalize the loop, and it might not work well with certain functions. In fact, from the documentation:
Notes The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
You should instead focus on making f accept a vector instead of i:
def f(i, a, b, x):
return a*x[i] + b
def g(a, b, x):
x = np.asarray(x)
return a*x + b
Then, instead of calling f(i, a, b, x), call g(a,b,x)[i] if you only want the ith, but for operations on the entire function, use g(a, b, x) and it will be much faster.
model_vec = g(a, b, x)
return numpy.sum(((array-model_vec)/sigma)**2.0)
It seems that your code is slow because what is executing in the loop is slow (your model generation). Turning this into a one-liner won't speed things up. If you have access to a modern computer with more than on CPU you could try to run this loop in parallel - for example using the multiprocessing module;
from multiprocessing import Pool
if __name__ == '__main__':
# snip set up code
pool = Pool(processes=4) # start 4 worker processes
inputs = [(i,a,b,c) for i in range(n)]
model_array = pool.map(model, inputs)
for i in range(n):
data = array[i]
model = model_array[i]
chisq += 0.5*((data-model)/sigma)**2.0

Combining element-wise and matrix multiplication with multi-dimensional arrays in NumPy

I have two multidimensional NumPy arrays, A and B, with A.shape = (K, d, N) and B.shape = (K, N, d). I would like to perform an element-wise operation over axis 0 (K), with that operation being matrix multiplication over axes 1 and 2 (d, N and N, d). So the result should be a multidimensional array C with C.shape = (K, d, d), so that C[k] = np.dot(A[k], B[k]). A naive implementation would look like this:
C = np.vstack([np.dot(A[k], B[k])[np.newaxis, :, :] for k in xrange(K)])
but this implementation is slow. A slightly faster approach looks like this:
C = np.dot(A, B)[:, :, 0, :]
which uses the default behaviour of np.dot on multidimensional arrays, giving me an array with shape (K, d, K, d). However, this approach computes the required answer K times (each of the entries along axis 2 are the same). Asymptotically it will be slower than the first approach, but the overhead is much less. I am also aware of the following approach:
from numpy.core.umath_tests import matrix_multiply
C = matrix_multiply(A, B)
but I am not guaranteed that this function will be available. My question is thus, does NumPy provide a standard way of doing this efficiently? An answer which applies to multidimensional arrays in general would be perfect, but an answer specific to only this case would be great too.
Edit: As pointed out by #Juh_, the second approach is incorrect. The correct version is:
C = np.dot(A, B).diagonal(axis1=0, axis2=2).transpose(2, 0, 1)
but the overhead added makes it slower than the first approach, even for small matrices. The last approach is winning by a long shot on all my timing tests, for small and large matrices. I'm now strongly considering using this if no better solution crops up, even if that would mean copying the numpy.core.umath_tests library (written in C) into my project.
A possible solution to your problem is:
C = np.sum(A[:,:,:,np.newaxis]*B[:,np.newaxis,:,:],axis=2)
However:
it is quicker than the vstack approach only if K is much bigger than d and N
their might be some memory issue: in the above solution an KxdxNxd array is allocated (i.e. all possible product paires, before summing). Actually I could not test with big K,d and N as I was going out of memory.
btw, note that:
C = np.dot(A, B)[:, :, 0, :]
does not give the correct result. It got me tricked because I first checked my method by comparing the results to those given by this np.dot command.
I have this same issue in my project. The best I've been able to come up with is, I think it's a little faster (maybe 10%) than using vstack:
K, d, N = A.shape
C = np.empty((K, d, d))
for k in xrange(K):
C[k] = np.dot(A[k], B[k])
I'd love to see a better solution, I can't quite see how one would use tensordot to do this.
A very flexible, compact, and fast solution:
C = np.einsum('Kab,Kbc->Kac', A, B, optimize=True)
Confirmation:
import numpy as np
K = 10
d = 5
N = 3
A = np.random.rand(K,d,N)
B = np.random.rand(K,N,d)
C_old = np.dot(A, B).diagonal(axis1=0, axis2=2).transpose(2, 0, 1)
C_new = np.einsum('Kab,Kbc->Kac', A, B)
print(np.max(C_old-C_new)) # should be 0 or a very small number
For large multi-dimensional arrays, the optional parameter optimize=True can save you a lot of time.
You can learn about einsum here:
https://ajcr.net/Basic-guide-to-einsum/
https://rockt.github.io/2018/04/30/einsum
https://numpy.org/doc/stable/reference/generated/numpy.einsum.html
Quote:
The Einstein summation convention can be used to compute many multi-dimensional, linear algebraic array operations. einsum provides a succinct way of representing these. A non-exhaustive list of these operations is:
Trace of an array, numpy.trace.
Return a diagonal, numpy.diag.
Array axis summations, numpy.sum.
Transpositions and permutations, numpy.transpose.
Matrix multiplication and dot product, numpy.matmul numpy.dot.
Vector inner and outer products, numpy.inner numpy.outer.
Broadcasting, element-wise and scalar multiplication, numpy.multiply.
Tensor contractions, numpy.tensordot.
Chained array operations, in efficient calculation order, numpy.einsum_path.
You can do
np.matmul(A, B)
Look at https://numpy.org/doc/stable/reference/generated/numpy.matmul.html.
Should be faster than einsum for big enough K.

Categories