I have a code running operations on numpy arrays.
While linear algebra operations seem fast, I now am finding a bottleneck in a different issue: the summation of two distinct arrays.
In the example below WE3 and T1 are two 1000X1000X1000 arrays.
First I calculate WE3 using a numpy operation, then I sum those arrays.
import numpy as np
import scipy as sp
import time
N = 100
n = 1000
X = np.random.uniform(size = (N,n))
wE = np.mean(X,0)
wE3 = np.einsum('i,j,k->ijk', wE, wE, wE) #22 secs
T1 = np.random.uniform(size = (n,n,n))
a = wE3 + T1 #115 secs
The calculation of wE3 takes like 22 seconds, while the addition between WE3 and T1 takes 115 seconds.
Is there any known reason why the summation of those arrays is such slower than the calculation of WE3? They should have more or less the same complexity..
Is there a way to speed up that code?
Is there any known reason why the summation of those arrays is such slower than the calculation of WE3?
The arrays wE3, T1 and a each require 8 gigabytes of memory. You are probably running out of physical memory, and swap memory access is killing your performance.
Is there a way to speed up that code?
Get more physical memory (i.e. RAM).
If that is not possible, take a look at what you are going to do with these arrays, and see if you can work in batches such that the total memory required when processing a batch remains within the limits of your physical memory.
That np.einsum('i,j,k->ijk', wE, wE, wE) part isn't doing any sum-reduction and is essentially just broadcasted elementwise multiplication. So, we can replace that with something like this -
wE[:,None,None] * wE[:,None] * wE
Runtime test -
In [9]: # Setup inputs at 1/5th of original dataset sizes
...: N = 20
...: n = 200
...: X = np.random.uniform(size = (N,n))
...: wE = np.mean(X,0)
...:
In [10]: %timeit np.einsum('i,j,k->ijk', wE, wE, wE)
10 loops, best of 3: 45.7 ms per loop
In [11]: %timeit wE[:,None,None] * wE[:,None] * wE
10 loops, best of 3: 26.1 ms per loop
Next up, we have wE3 + T1, where T1 = np.random.uniform(size = (n,n,n)) doesn't look like could be helped in a big way as we have to create T1 anyway and then it's just element-wise addition. It seems we can use np.add that lets us write back the results to one of the arrays : wE3 or T1. Let's say we choose T1, if that's okay to be modified. I guess this would bring slight memory efficiency as we won't be adding another variable into workspace.
Thus, we could do -
np.add(wE3,T1,out=T1)
Runtime test -
In [58]: def func1(wE3):
...: T1 = np.random.uniform(size = (n,n,n))
...: return wE3 + T1
...:
...: def func2(wE3):
...: T1 = np.random.uniform(size = (n,n,n))
...: np.add(wE3,T1,out=T1)
...: return T1
...:
In [59]: # Setup inputs at 1/4th of original dataset sizes
...: N = 25
...: n = 250
...: X = np.random.uniform(size = (N,n))
...: wE = np.mean(X,0)
...: wE3 = np.einsum('i,j,k->ijk', wE, wE, wE)
...:
In [60]: %timeit func1(wE3)
1 loops, best of 3: 390 ms per loop
In [61]: %timeit func2(wE3)
1 loops, best of 3: 363 ms per loop
Using #Aaron's suggestion, we can use a loop and assuming that writing back the results into wE3 is okay, we could do -
wE3 = wE[:,None,None] * wE[:,None] * wE
for x in wE3:
np.add(x, np.random.uniform(size = (n,n)), out=x)
Final results
Thus, putting back all the suggested improvements, finally the runtime test results were -
In [97]: def func1(wE):
...: wE3 = np.einsum('i,j,k->ijk', wE, wE, wE)
...: T1 = np.random.uniform(size = (n,n,n))
...: return wE3 + T1
...:
...: def func2(wE):
...: wE3 = wE[:,None,None] * wE[:,None] * wE
...: for x in wE3:
...: np.add(x, np.random.uniform(size = (n,n)), out=x)
...: return wE3
...:
In [98]: # Setup inputs at 1/3rd of original dataset sizes
...: N = 33
...: n = 330
...: X = np.random.uniform(size = (N,n))
...: wE = np.mean(X,0)
...:
In [99]: %timeit func1(wE)
1 loops, best of 3: 1.09 s per loop
In [100]: %timeit func2(wE)
1 loops, best of 3: 879 ms per loop
You should really use Numba's Jit (just in time compiler) for this. It is a purely numpy pipeline, which is perfect for Numba.
All you have to do is throw that above code into a function, and put an #jit decorater on top. It gets speedups close to Cython.
However, as others have pointed out, it appears you're trying to work with data too large for your local machine, and numba would not solve your problems
Related
My problem is the following. I have two arrays X and Y of shape n, p where p >> n (e.g. n = 50, p = 10000).
I also have a mask mask (1-d array of booleans of size p) with respect to p, of small density (e.g. np.mean(mask) is 0.05).
I try to compute, as fast as possible, the inner product of X and Y with respect to mask: the output inner is an array of shape n, n, and is such that inner[i, j] = np.sum(X[i, np.logical_not(mask)] * Y[j, np.logical_not(mask)]).
I have tried using the numpy.ma library, but it is quite slow for my use:
import numpy as np
import numpy.ma as ma
n, p = 50, 10000
density = 0.05
mask = np.array(np.random.binomial(1, density, size=p), dtype=np.bool_)
mask_big = np.ones(n)[:, None] * mask[None, :]
X = np.random.randn(n, p)
Y = np.random.randn(n, p)
X_ma = ma.array(X, mask=mask_big)
Y_ma = ma.array(Y, mask=mask_big)
But then, on my machine, X_ma.dot(Y_ma.T) is about 5 times slower than X.dot(Y.T)...
To begin with, I think it is a problem that .dot does not know that the mask is only with respect to p but I don't if its possible to use this information.
I'm looking for a way to perform the computation without being much slower than the naive dot.
Thanks a lot !
We can use matrix-multiplication with and without the masked versions as the masked subtraction from the full version yields to us the desired output -
inner = X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
Or simply use the reversed mask, would be slower though for a sparsey mask -
inner = X[:,~mask].dot(Y[:,~mask].T)
Timings -
In [34]: np.random.seed(0)
...: p,n = 10000,50
...: X = np.random.rand(n,p)
...: Y = np.random.rand(n,p)
...: mask = np.random.rand(p)>0.95
In [35]: mask.mean()
Out[35]: 0.0507
In [36]: %timeit X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
100 loops, best of 3: 2.54 ms per loop
In [37]: %timeit X[:,~mask].dot(Y[:,~mask].T)
100 loops, best of 3: 4.1 ms per loop
In [39]: %%timeit
...: inner = np.empty((n,n))
...: for i in range(X.shape[0]):
...: for j in range(X.shape[0]):
...: inner[i, j] = np.sum(X[i, ~mask] * Y[j, ~mask])
1 loop, best of 3: 302 ms per loop
I am struggling with a slow numpy operation, using python 3.
I have the following operation:
np.sum(np.log(X.T * b + a).T, 1)
where
(30000,1000) = X.shape
(1000,1) = b.shape
(1000,1) = a.shape
My problem is that this operation is pretty slow (around 1.5 seconds), and it is inside a loop, so it is repeated around 100 times, that makes the running time of my code very long.
I am wondering if there is a faster implementation of this function.
Maybe useful fact: X is extremely sparse (only 0.08% of the entries are nonzero), but is a NumPy array.
We can optimize the logarithm operation which seems to be the bottleneck and that being one of the transcendental functions could be sped up with numexpr module and then sum-reduce with NumPy because NumPy does it much better, thus giving us a hybrid one, like so -
import numexpr as ne
def numexpr_app(X, a, b):
XT = X.T
return ne.evaluate('log(XT * b + a)').sum(0)
Looking closely at the broadcasting operations : XT * b + a, we see that there are two stages of broadcasting, on which we can optimize further. The intention is to see if that could be reduced to one stage and that seems possible here with some division. This gives us a slightly modified version, shown below -
def numexpr_app2(X, a, b):
ab = (a/b)
XT = X.T
return np.log(b).sum() + ne.evaluate('log(ab + XT)').sum(0)
Runtime test and verification
Original approach -
def numpy_app(X, a, b):
return np.sum(np.log(X.T * b + a).T, 1)
Timings -
In [111]: # Setup inputs
...: density = 0.08/100 # 0.08 % sparse
...: m,n = 30000, 1000
...: X = scipy.sparse.rand(m,n,density=density,format="csr").toarray()
...: a = np.random.rand(n,1)
...: b = np.random.rand(n,1)
...:
In [112]: out0 = numpy_app(X, a, b)
...: out1 = numexpr_app(X, a, b)
...: out2 = numexpr_app2(X, a, b)
...: print np.allclose(out0, out1)
...: print np.allclose(out0, out2)
...:
True
True
In [114]: %timeit numpy_app(X, a, b)
1 loop, best of 3: 691 ms per loop
In [115]: %timeit numexpr_app(X, a, b)
10 loops, best of 3: 153 ms per loop
In [116]: %timeit numexpr_app2(X, a, b)
10 loops, best of 3: 149 ms per loop
Just to prove the observation stated at the start that log part is the bottleneck with the original NumPy approach, here's the timing on it -
In [44]: %timeit np.log(X.T * b + a)
1 loop, best of 3: 682 ms per loop
On which the improvement was significant -
In [120]: XT = X.T
In [121]: %timeit ne.evaluate('log(XT * b + a)')
10 loops, best of 3: 142 ms per loop
It's a bit unclear why you would do np.sum(your_array.T, axis=1) instead of np.sum(your_array, axis=0).
You can use a scipy sparse matrix: (use column compressed format for X, so that X.T is row compressed, since you multiply by b which has the shape of one row of X.T)
X_sparse = scipy.sparse.csc_matrx(X)
and replace X.T * b by:
X_sparse.T.multiply(b)
However if a is not sparse it will not help you as much as it could.
These are the speed ups I obtain for this operation:
In [16]: %timeit X_sparse.T.multiply(b)
The slowest run took 10.80 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 374 µs per loop
In [17]: %timeit X.T * b
10 loops, best of 3: 44.5 ms per loop
with:
import numpy as np
from scipy import sparse
X = np.random.randn(30000, 1000)
a = np.random.randn(1000, 1)
b = np.random.randn(1000, 1)
X[X < 3] = 0
print(np.sum(X != 0))
X_sparse = sparse.csc_matrix(X)
Given two matrices, I want to compute the pairwise differences between all rows. Each matrix has 1000 rows and 100 columns so they are fairly large. I tried using a for loop and pure broadcasting but the for loop seem to be working faster. Am I doing something wrong? Here is the code:
from numpy import *
A = random.randn(1000,100)
B = random.randn(1000,100)
start = time.time()
for a in A:
sum((a - B)**2,1)
print time.time() - start
# pure broadcasting
start = time.time()
((A[:,newaxis,:] - B)**2).sum(-1)
print time.time() - start
The broadcasting method takes about 1 second longer and it's even longer for large matrices. Any idea how to speed this up purely using numpy?
Here's another way to perform :
(a-b)^2 = a^2 + b^2 - 2ab
with np.einsum for the first two terms and dot-product for the third one -
import numpy as np
np.einsum('ij,ij->i',A,A)[:,None] + np.einsum('ij,ij->i',B,B) - 2*np.dot(A,B.T)
Runtime test
Approaches -
def loopy_app(A,B):
m,n = A.shape[0], B.shape[0]
out = np.empty((m,n))
for i,a in enumerate(A):
out[i] = np.sum((a - B)**2,1)
return out
def broadcasting_app(A,B):
return ((A[:,np.newaxis,:] - B)**2).sum(-1)
# #Paul Panzer's soln
def outer_sum_dot_app(A,B):
return np.add.outer((A*A).sum(axis=-1), (B*B).sum(axis=-1)) - 2*np.dot(A,B.T)
# #Daniel Forsman's soln
def einsum_all_app(A,B):
return np.einsum('ijk,ijk->ij', A[:,None,:] - B[None,:,:], \
A[:,None,:] - B[None,:,:])
# Proposed in this post
def outer_einsum_dot_app(A,B):
return np.einsum('ij,ij->i',A,A)[:,None] + np.einsum('ij,ij->i',B,B) - \
2*np.dot(A,B.T)
Timings -
In [51]: A = np.random.randn(1000,100)
...: B = np.random.randn(1000,100)
...:
In [52]: %timeit loopy_app(A,B)
...: %timeit broadcasting_app(A,B)
...: %timeit outer_sum_dot_app(A,B)
...: %timeit einsum_all_app(A,B)
...: %timeit outer_einsum_dot_app(A,B)
...:
10 loops, best of 3: 136 ms per loop
1 loops, best of 3: 302 ms per loop
100 loops, best of 3: 8.51 ms per loop
1 loops, best of 3: 341 ms per loop
100 loops, best of 3: 8.38 ms per loop
Here is a solution which avoids both the loop and the large intermediates:
from numpy import *
import time
A = random.randn(1000,100)
B = random.randn(1000,100)
start = time.time()
for a in A:
sum((a - B)**2,1)
print time.time() - start
# pure broadcasting
start = time.time()
((A[:,newaxis,:] - B)**2).sum(-1)
print time.time() - start
#matmul
start = time.time()
add.outer((A*A).sum(axis=-1), (B*B).sum(axis=-1)) - 2*dot(A,B.T)
print time.time() - start
Prints:
0.546781778336
0.674743175507
0.10723400116
Another job for np.einsum
np.einsum('ijk,ijk->ij', A[:,None,:] - B[None,:,:], A[:,None,:] - B[None,:,:])
Similar to #paul-panzer, a general way to compute pairwise differences of arrays of arbitrary dimension is broadcasting as follows:
Let v be a NumPy array of size (n, d):
import numpy as np
v_tiled_across = np.tile(v[:, np.newaxis, :], (1, v.shape[0], 1))
v_tiled_down = np.tile(v[np.newaxis, :, :], (v.shape[0], 1, 1))
result = v_tiled_across - v_tiled_down
For a better picture what's happening, imagine each d-dimensional row of v being propped up like a flagpole, and copied across and down. Now when you do component-wise subtraction, you're getting each pairwise combination.
__
There's also scipy.spatial.distance.pdist, which computes a metric in a pairwise fashion.
from scipy.spatial.distance import pdist, squareform
pairwise_L2_norms = squareform(pdist(v))
This question relates to one I posted awhile back:
Python, numpy, einsum multiply a stack of matrices
I am trying to understand why I get the speedups I get with Numba when used in a particular manner when multiplying a stack of a stack of matrices. As before, I am putting in a (500,201,2,2) array, multiplying the (2x2) matrices at the end along the first axis (so 500 multiplications), to get a (201,2,2) array as the result.
Here is the Python code:
from numba import jit # numba 0.24, numpy 1.9.3, python 2.7.11
Arr = rand(500,201,2,2)
def loopMult(Arr):
ArrMult = Arr[0]
for i in range(1,len(Arr)):
ArrMult = np.einsum('fij,fjk->fik', ArrMult, Arr[i])
return ArrMult
#jit(nopython=True)
def loopMultJit(Arr):
ArrMult = np.empty(shape=Arr.shape[1:], dtype=Arr.dtype)
for i in range(0, Arr.shape[1]):
ArrMult[i] = Arr[0, i]
for j in range(1, Arr.shape[0]):
ArrMult[i] = np.dot(ArrMult[i], Arr[j, i])
return ArrMult
#jit(nopython=True)
def loopMultJit_2X2(Arr):
ArrMult = np.empty(shape=Arr.shape[1:], dtype=Arr.dtype)
for i in range(0, Arr.shape[1]):
ArrMult[i] = Arr[0, i]
for j in range(1, Arr.shape[0]):
x1 = ArrMult[i,0,0] * Arr[j,i,0,0] + ArrMult[i,0,1] * Arr[j,i,1,0]
y1 = ArrMult[i,0,0] * Arr[j,i,0,1] + ArrMult[i,0,1] * Arr[j,i,1,1]
x2 = ArrMult[i,1,0] * Arr[j,i,0,0] + ArrMult[i,1,1] * Arr[j,i,1,0]
y2 = ArrMult[i,1,0] * Arr[j,i,0,1] + ArrMult[i,1,1] * Arr[j,i,1,1]
ArrMult[i,0,0] = x1
ArrMult[i,0,1] = y1
ArrMult[i,1,0] = x2
ArrMult[i,1,1] = y2
return ArrMult
A1 = loopMult(Arr)
A2 = loopMultJit(Arr)
A3 = loopMultJit_2X2(Arr)
print np.allclose(A1, A2)
print np.allclose(A1, A3)
%timeit loopMult(Arr)
%timeit loopMultJit(Arr)
%timeit loopMultJit_2X2(Arr)
Here is the output:
True
True
10 loops, best of 3: 40.5 ms per loop
10 loops, best of 3: 36 ms per loop
1000 loops, best of 3: 808 µs per loop
In the prior question, the accepted answer showed that with f2py there was a speedup of 8x without detailed optimization. Here, with Numba, I get about 10% speedup using numba over an einsum loop, but I get 45x speedup if instead of using np.dot in the loop, I simply do the 2x2 matrix multiplication by hand. Why is this? I should mention I have implemented both of these jit functions with proper type signatures as guvectorize versions as well, which basically provides the same speedup factors, so I left them out. Also speedup from iterating over a 201,500,2,2 matrix is minimal.
2 Comments have responded that the speedup is just due to python overhead, and I think that's right. The overhead is mostly function calls, but also for loops, and np.dot has some extra overhead on top of that. I set up a Naive dot product function:
#jit(nopython=True)
def dot(mat1, mat2):
s = 0
mat = np.empty(shape=(mat1.shape[1], mat2.shape[0]), dtype=mat1.dtype)
for r1 in range(mat1.shape[0]):
for c2 in range(mat2.shape[1]):
s = 0
for j in range(mat2.shape[0]):
s += mat1[r1,j] * mat2[j,c2]
mat[r1,c2] = s
return mat
Then I set up to functions to multiply the arrays, one which calls the dot function and one which has the dot function built into the loop, so that it is executed without an extra function call:
#jit(nopython=True)
def loopMultJit_dot(Arr):
ArrMult = np.empty(shape=Arr.shape[1:], dtype=Arr.dtype)
for i in range(0, Arr.shape[1]):
ArrMult[i] = Arr[0, i]
for j in range(1, Arr.shape[0]):
ArrMult[i] = dot(ArrMult[i], Arr[j, i])
return ArrMult
#jit(nopython=True)
def loopMultJit_dotInternal(Arr):
ArrMult = np.empty(shape=Arr.shape[1:], dtype=Arr.dtype)
for i in range(0, Arr.shape[1]):
ArrMult[i] = Arr[0, i]
for j in range(1, Arr.shape[0]):
s = 0.0
for r1 in range(ArrMult.shape[1]):
for c2 in range(Arr.shape[3]):
s = 0.0
for r2 in range(Arr.shape[2]):
s += ArrMult[i,r1,r2] * Arr[j,i,r2,c2]
ArrMult[i,r1,c2] = s
return ArrMult
Then I can run 2 comparisons: 2x2 arrays, and 10x10 arrays. With these I get some idea of the penalties paid for function calls in general, and for the np.dot function call in particular, and the gains from BLAS optimizations in np.dot:
print "2x2 Time Test:"
Arr = rand(500,201,2,2)
%timeit loopMult(Arr)
%timeit loopMultJit(Arr)
%timeit loopMultJit_2X2(Arr)
%timeit loopMultJit_dot(Arr)
%timeit loopMultJit_dotInternal(Arr)
print "10x10 Time Test:"
Arr = rand(500,201,10,10)
%timeit loopMult(Arr)
%timeit loopMultJit(Arr)
%timeit loopMultJit_dot(Arr)
%timeit loopMultJit_dotInternal(Arr)
which yields:
2x2 Time Test:
10 loops, best of 3: 55.8 ms per loop # einsum
10 loops, best of 3: 48.7 ms per loop # np.dot
1000 loops, best of 3: 1.09 ms per loop # 2x2
10 loops, best of 3: 28.3 ms per loop # naive dot, separate function
100 loops, best of 3: 2.58 ms per loop # naive dot internal
10x10 Time Test:
1 loop, best of 3: 499 ms per loop # einsum
10 loops, best of 3: 91.3 ms per loop # np.dot
10 loops, best of 3: 170 ms per loop # naive dot, separate function
10 loops, best of 3: 161 ms per loop # naive dot internal
I suppose the take home messages are:
einsum is nice if you're not using numba, or need one-liners, but for matrix multiplication, there are faster options
if you're working with small matrices, it can be faster to do things by hand and not call separate functions
for large matrices, there is a reason BLAS was invented, and in fact, speedups are quite noticeable at sizes as small as 10x10.
I have written a Python function that computes pairwise electromagnetic interactions between a largish number (N ~ 10^3) of particles and stores the results in an NxN complex128 ndarray. It runs, but it is the slowest part of a larger program, taking about 40 seconds when N=900 [corrected]. The original code looks like this:
import numpy as np
def interaction(s,alpha,kprop): # s is an Nx3 real array
# alpha is complex
# kprop is float
ndipoles = s.shape[0]
Amat = np.zeros((ndipoles,3, ndipoles, 3), dtype=np.complex128)
I = np.array([[1,0,0],[0,1,0],[0,0,1]])
im = complex(0,1)
k2 = kprop*kprop
for i in range(ndipoles):
xi = s[i,:]
for j in range(ndipoles):
if i != j:
xj = s[j,:]
dx = xi-xj
R = np.sqrt(dx.dot(dx))
n = dx/R
kR = kprop*R
kR2 = kR*kR
A = ((1./kR2) - im/kR)
nxn = np.outer(n, n)
nxn = (3*A-1)*nxn + (1-A)*I
nxn *= -alpha*(k2*np.exp(im*kR))/R
else:
nxn = I
Amat[i,:,j,:] = nxn
return(Amat.reshape((3*ndipoles,3*ndipoles)))
I had never previously used Cython, but that seemed like a good place to start in my effort to speed things up, so I pretty much blindly adapted the techniques I found in online tutorials. I got some speedup (30 seconds vs. 40 seconds), but not nearly as dramatic as I expected, so I'm wondering whether I'm doing something wrong or am missing a critical step. The following is my best attempt at cythonizing the above routine:
import numpy as np
cimport numpy as np
DTYPE = np.complex128
ctypedef np.complex128_t DTYPE_t
def interaction(np.ndarray s, DTYPE_t alpha, float kprop):
cdef float k2 = kprop*kprop
cdef int i,j
cdef np.ndarray xi, xj, dx, n, nxn
cdef float R, kR, kR2
cdef DTYPE_t A
cdef int ndipoles = s.shape[0]
cdef np.ndarray Amat = np.zeros((ndipoles,3, ndipoles, 3), dtype=DTYPE)
cdef np.ndarray I = np.array([[1,0,0],[0,1,0],[0,0,1]])
cdef DTYPE_t im = complex(0,1)
for i in range(ndipoles):
xi = s[i,:]
for j in range(ndipoles):
if i != j:
xj = s[j,:]
dx = xi-xj
R = np.sqrt(dx.dot(dx))
n = dx/R
kR = kprop*R
kR2 = kR*kR
A = ((1./kR2) - im/kR)
nxn = np.outer(n, n)
nxn = (3*A-1)*nxn + (1-A)*I
nxn *= -alpha*(k2*np.exp(im*kR))/R
else:
nxn = I
Amat[i,:,j,:] = nxn
return(Amat.reshape((3*ndipoles,3*ndipoles)))
The real power of NumPy is in performing an operation across a huge number of elements in a vectorized manner instead of using that operation in chunks spread across loops. In your case, you are using two nested loops and one IF conditional statement. I would propose extending the dimensions of the intermediate arrays, which would bring in NumPy's powerful broadcasting capability to come into play and thus the same operations could be used on all elements in one go instead of small chunks of data within the loops.
For extending the dimensions, None/np.newaxis could be used. So, the vectorized implementation to follow such a premise would look like this -
def vectorized_interaction(s,alpha,kprop):
im = complex(0,1)
I = np.array([[1,0,0],[0,1,0],[0,0,1]])
k2 = kprop*kprop
# Vectorized calculations for dx, R, n, kR, A
sd = s[:,None] - s
Rv = np.sqrt((sd**2).sum(2))
nv = sd/Rv[:,:,None]
kRv = Rv*kprop
Av = (1./(kRv*kRv)) - im/kRv
# Vectorized calculation for: "nxn = np.outer(n, n)"
nxnv = nv[:,:,:,None]*nv[:,:,None,:]
# Vectorized calculation for: "(3*A-1)*nxn + (1-A)*I"
P = (3*Av[:,:,None,None]-1)*nxnv + (1-Av[:,:,None,None])*I
# Vectorized calculation for: "-alpha*(k2*np.exp(im*kR))/R"
multv = -alpha*(k2*np.exp(im*kRv))/Rv
# Vectorized calculation for: "nxn *= -alpha*(k2*np.exp(im*kR))/R"
outv = P*multv[:,:,None,None]
# Simulate ELSE part of the conditional statement"if i != j:"
# with masked setting to I on the last two dimensions
outv[np.eye((N),dtype=bool)] = I
return outv.transpose(0,2,1,3).reshape(N*3,-1)
Runtime tests and output verification -
Case #1:
In [703]: N = 10
...: s = np.random.rand(N,3) + complex(0,1)*np.random.rand(N,3)
...: alpha = 3j
...: kprop = 5.4
...:
In [704]: out_org = interaction(s,alpha,kprop)
...: out_vect = vectorized_interaction(s,alpha,kprop)
...: print np.allclose(np.real(out_org),np.real(out_vect))
...: print np.allclose(np.imag(out_org),np.imag(out_vect))
...:
True
True
In [705]: %timeit interaction(s,alpha,kprop)
100 loops, best of 3: 7.6 ms per loop
In [706]: %timeit vectorized_interaction(s,alpha,kprop)
1000 loops, best of 3: 304 µs per loop
Case #2:
In [707]: N = 100
...: s = np.random.rand(N,3) + complex(0,1)*np.random.rand(N,3)
...: alpha = 3j
...: kprop = 5.4
...:
In [708]: out_org = interaction(s,alpha,kprop)
...: out_vect = vectorized_interaction(s,alpha,kprop)
...: print np.allclose(np.real(out_org),np.real(out_vect))
...: print np.allclose(np.imag(out_org),np.imag(out_vect))
...:
True
True
In [709]: %timeit interaction(s,alpha,kprop)
1 loops, best of 3: 826 ms per loop
In [710]: %timeit vectorized_interaction(s,alpha,kprop)
100 loops, best of 3: 14 ms per loop
Case #3:
In [711]: N = 900
...: s = np.random.rand(N,3) + complex(0,1)*np.random.rand(N,3)
...: alpha = 3j
...: kprop = 5.4
...:
In [712]: out_org = interaction(s,alpha,kprop)
...: out_vect = vectorized_interaction(s,alpha,kprop)
...: print np.allclose(np.real(out_org),np.real(out_vect))
...: print np.allclose(np.imag(out_org),np.imag(out_vect))
...:
True
True
In [713]: %timeit interaction(s,alpha,kprop)
1 loops, best of 3: 1min 7s per loop
In [714]: %timeit vectorized_interaction(s,alpha,kprop)
1 loops, best of 3: 1.59 s per loop