Exterior product in NumPy : Vectorizing six nested loops - python

In a research paper, the author introduces an exterior product between two (3*3) matrices A and B, resulting in C:
C(i, j) = sum(k=1..3, l=1..3, m=1..3, n=1..3) eps(i,k,l)*eps(j,m,n)*A(k,m)*B(l,n)
where eps(a, b, c) is the Levi-Civita symbol.
I am wondering how to vectorize such a mathematical operator in Numpy instead of implementing 6 nested loops (for i, j, k, l, m, n) naively.

It looks like a purely sum-reduction based problem without the requirement of keeping any axis aligned between the inputs. So, I would suggest matrix-multiplication based solution for tensors using np.tensordot.
Thus, one solution could be implemented in three steps -
# Matrix-multiplication between first eps and A.
# Thus losing second axis from eps and first from A : k
parte1 = np.tensordot(eps,A,axes=((1),(0)))
# Matrix-multiplication between second eps and B.
# Thus losing third axis from eps and second from B : n
parte2 = np.tensordot(eps,B,axes=((2),(1)))
# Finally, we are left with two products : ilm & jml.
# We need to lose lm and ml from these inputs respectively to get ij.
# So, we need to lose last two dims from the products, but flipped .
out = np.tensordot(parte1,parte2,axes=((1,2),(2,1)))
Runtime test
Approaches -
def einsum_based1(eps, A, B): # #unutbu's soln1
return np.einsum('ikl,jmn,km,ln->ij', eps, eps, A, B)
def einsum_based2(eps, A, B): # #unutbu's soln2
return np.einsum('ilm,jml->ij',
np.einsum('ikl,km->ilm', eps, A),
np.einsum('jmn,ln->jml', eps, B))
def tensordot_based(eps, A, B):
parte1 = np.tensordot(eps,A,axes=((1),(0)))
parte2 = np.tensordot(eps,B,axes=((2),(1)))
return np.tensordot(parte1,parte2,axes=((1,2),(2,1)))
Timings -
In [5]: # Setup inputs
...: N = 20
...: eps = np.random.rand(N,N,N)
...: A = np.random.rand(N,N)
...: B = np.random.rand(N,N)
...:
In [6]: %timeit einsum_based1(eps, A, B)
1 loops, best of 3: 773 ms per loop
In [7]: %timeit einsum_based2(eps, A, B)
1000 loops, best of 3: 972 µs per loop
In [8]: %timeit tensordot_based(eps, A, B)
1000 loops, best of 3: 214 µs per loop
Bigger dataset -
In [12]: # Setup inputs
...: N = 100
...: eps = np.random.rand(N,N,N)
...: A = np.random.rand(N,N)
...: B = np.random.rand(N,N)
...:
In [13]: %timeit einsum_based2(eps, A, B)
1 loops, best of 3: 856 ms per loop
In [14]: %timeit tensordot_based(eps, A, B)
10 loops, best of 3: 49.2 ms per loop

You could use einsum which implements Einstein summation notation:
C = np.einsum('ikl,jmn,km,ln->ij', eps, eps, A, B)
or for better performance, apply einsum to two arrays at a time:
C = np.einsum('ilm,jml->ij',
np.einsum('ikl,km->ilm', eps, A),
np.einsum('jmn,ln->jml', eps, B))
np.einsum computes a sum of products.
The subscript specifier 'ikl,jmn,km,ln->ij' tells np.einsum that
the first eps has subcripts i,k,l,
the second eps has subcripts j,m,n,
A has subcripts k,m,
B has subcripts l,n,
the output array has subscripts i,j
Thus, the summation is over products of the form
eps(i,k,l) * eps(j,m,n) * A(k,m) * B(l,n)
All subscripts not in the output array are summed over.

Related

Fast inner product of two 2-d masked arrays in numpy

My problem is the following. I have two arrays X and Y of shape n, p where p >> n (e.g. n = 50, p = 10000).
I also have a mask mask (1-d array of booleans of size p) with respect to p, of small density (e.g. np.mean(mask) is 0.05).
I try to compute, as fast as possible, the inner product of X and Y with respect to mask: the output inner is an array of shape n, n, and is such that inner[i, j] = np.sum(X[i, np.logical_not(mask)] * Y[j, np.logical_not(mask)]).
I have tried using the numpy.ma library, but it is quite slow for my use:
import numpy as np
import numpy.ma as ma
n, p = 50, 10000
density = 0.05
mask = np.array(np.random.binomial(1, density, size=p), dtype=np.bool_)
mask_big = np.ones(n)[:, None] * mask[None, :]
X = np.random.randn(n, p)
Y = np.random.randn(n, p)
X_ma = ma.array(X, mask=mask_big)
Y_ma = ma.array(Y, mask=mask_big)
But then, on my machine, X_ma.dot(Y_ma.T) is about 5 times slower than X.dot(Y.T)...
To begin with, I think it is a problem that .dot does not know that the mask is only with respect to p but I don't if its possible to use this information.
I'm looking for a way to perform the computation without being much slower than the naive dot.
Thanks a lot !
We can use matrix-multiplication with and without the masked versions as the masked subtraction from the full version yields to us the desired output -
inner = X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
Or simply use the reversed mask, would be slower though for a sparsey mask -
inner = X[:,~mask].dot(Y[:,~mask].T)
Timings -
In [34]: np.random.seed(0)
...: p,n = 10000,50
...: X = np.random.rand(n,p)
...: Y = np.random.rand(n,p)
...: mask = np.random.rand(p)>0.95
In [35]: mask.mean()
Out[35]: 0.0507
In [36]: %timeit X.dot(Y.T)-X[:,mask].dot(Y[:,mask].T)
100 loops, best of 3: 2.54 ms per loop
In [37]: %timeit X[:,~mask].dot(Y[:,~mask].T)
100 loops, best of 3: 4.1 ms per loop
In [39]: %%timeit
...: inner = np.empty((n,n))
...: for i in range(X.shape[0]):
...: for j in range(X.shape[0]):
...: inner[i, j] = np.sum(X[i, ~mask] * Y[j, ~mask])
1 loop, best of 3: 302 ms per loop

Time- and space- efficient array multiplication in NumPy [duplicate]

This question already has an answer here:
Python: Multiplying a list of vectors by a list of matrices as a single matrix operation
(1 answer)
Closed 5 years ago.
Given NumPy arrays R and S with shapes (m, d) and (m, n, d) respectively, I would like to compute an array P of shape (m, n) whose (i, j)-th entry is np.dot(R[i, :] , S[i, j, :]).
Doing a double for-loop would not need any extra space (apart from the m * n space for P), but would not be time-efficient.
Using broadcasting, I could do P = np.sum(R[:, np.newaxis, :] * S, axis=2), but this would cost extra m * n * d space.
What is the most time- and space-efficient way to do this?
einsum is another of the usual suspects
m, n, d = 100, 100, 100
>>> R = np.random.random((m, d))
>>> S = np.random.random((m, n, d))
>>> np.einsum('md,mnd->mn', R, S)
>>> np.allclose(np.einsum('md,mnd->mn', R, S), (R[:,None,:]*S).sum(axis=-1))
True
>>> from timeit import repeat
>>> repeat('np.einsum("md,mnd->mn", R, S)', globals=globals(), number=1000)
[0.7004671019967645, 0.6925274690147489, 0.6952172230230644]
>>> repeat('(R[:,None,:]*S).sum(axis=-1)', globals=globals(), number=1000)
[3.0512512560235336, 3.0466731210472062, 3.044075728044845]
Some indirect evidence that einsum isn't too wasteful with the RAM:
>>> m, n, d = 1000, 1001, 1002
>>> # Too much for broadcasting:
>>> np.zeros((m, n, d))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
MemoryError
>>> R = np.random.random((m, d))
>>> S = np.random.random((n, d))
>>> np.einsum('md,nd->mn', R, S).shape
(1000, 1001)
In these cases, it is always good to consider numba, which can provide the best of both worlds:
import numpy as np
from numba import jit
def vanilla_mult(R, S):
m, n = R.shape[0], S.shape[1]
result = np.empty((m, n), dtype=R.dtype)
for i in range(m):
for j in range(n):
result[i, j] = np.dot(R[i, :], S[i, j,:])
return result
def broadcast_mult(R, S):
return np.sum(R[:, np.newaxis, :] * S, axis=2)
#jit(nopython=True)
def jit_mult(R, S):
m, n = R.shape[0], S.shape[1]
result = np.empty((m, n), dtype=R.dtype)
for i in range(m):
for j in range(n):
result[i, j] = np.dot(R[i, :], S[i, j,:])
return result
Note, vanilla_mult and jit_mult have the exact-same implementation, however, the latter is just-in-time compiled. Let's test this out:
In [1]: import test # the above is in test.py
In [2]: import numpy as np
In [3]: m, n, d = 100, 100, 100
In [4]: R = np.random.rand(m, d)
In [5]: S = np.random.rand(m, n, d)
OK...
In [6]: %timeit test.broadcast_mult(R, S)
100 loops, best of 3: 1.95 ms per loop
In [7]: %timeit test.vanilla_mult(R, S)
100 loops, best of 3: 11.7 ms per loop
Ouch, yeah, an almost 5-fold increase in compuation time compared to broadcasting. However...
In [8]: %timeit test.jit_mult(R, S)
The slowest run took 760.57 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 870 µs per loop
Nice! We can cut our runtime in half by simply JITing! How does this scale?
In [12]: m, n, d = 1000, 1000, 100
In [13]: R = np.random.rand(m, d)
In [14]: S = np.random.rand(m, n, d)
In [15]: %timeit test.vanilla_mult(R, S)
1 loop, best of 3: 1.22 s per loop
In [16]: %timeit test.broadcast_mult(R, S)
1 loop, best of 3: 666 ms per loop
In [17]: %timeit test.jit_mult(R, S)
The slowest run took 7.59 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 83.6 ms per loop
Scales very well, since broadcasting is starting to be held back by having to create large, intermediate arrays, it is only half the time compared to the vanilla approach, but it takes almost 7-times as much as the JIT-approach!
Edit to Add
And finally, we compare the np.einsum approach:
In [19]: %timeit np.einsum('md,mnd->mn', R, S)
10 loops, best of 3: 59.5 ms per loop
And it is clearly the winner in speed. I am not familiar enough with it to comment on the space requirements, though.

Optimizing an operation with numpy sparsey array

I am struggling with a slow numpy operation, using python 3.
I have the following operation:
np.sum(np.log(X.T * b + a).T, 1)
where
(30000,1000) = X.shape
(1000,1) = b.shape
(1000,1) = a.shape
My problem is that this operation is pretty slow (around 1.5 seconds), and it is inside a loop, so it is repeated around 100 times, that makes the running time of my code very long.
I am wondering if there is a faster implementation of this function.
Maybe useful fact: X is extremely sparse (only 0.08% of the entries are nonzero), but is a NumPy array.
We can optimize the logarithm operation which seems to be the bottleneck and that being one of the transcendental functions could be sped up with numexpr module and then sum-reduce with NumPy because NumPy does it much better, thus giving us a hybrid one, like so -
import numexpr as ne
def numexpr_app(X, a, b):
XT = X.T
return ne.evaluate('log(XT * b + a)').sum(0)
Looking closely at the broadcasting operations : XT * b + a, we see that there are two stages of broadcasting, on which we can optimize further. The intention is to see if that could be reduced to one stage and that seems possible here with some division. This gives us a slightly modified version, shown below -
def numexpr_app2(X, a, b):
ab = (a/b)
XT = X.T
return np.log(b).sum() + ne.evaluate('log(ab + XT)').sum(0)
Runtime test and verification
Original approach -
def numpy_app(X, a, b):
return np.sum(np.log(X.T * b + a).T, 1)
Timings -
In [111]: # Setup inputs
...: density = 0.08/100 # 0.08 % sparse
...: m,n = 30000, 1000
...: X = scipy.sparse.rand(m,n,density=density,format="csr").toarray()
...: a = np.random.rand(n,1)
...: b = np.random.rand(n,1)
...:
In [112]: out0 = numpy_app(X, a, b)
...: out1 = numexpr_app(X, a, b)
...: out2 = numexpr_app2(X, a, b)
...: print np.allclose(out0, out1)
...: print np.allclose(out0, out2)
...:
True
True
In [114]: %timeit numpy_app(X, a, b)
1 loop, best of 3: 691 ms per loop
In [115]: %timeit numexpr_app(X, a, b)
10 loops, best of 3: 153 ms per loop
In [116]: %timeit numexpr_app2(X, a, b)
10 loops, best of 3: 149 ms per loop
Just to prove the observation stated at the start that log part is the bottleneck with the original NumPy approach, here's the timing on it -
In [44]: %timeit np.log(X.T * b + a)
1 loop, best of 3: 682 ms per loop
On which the improvement was significant -
In [120]: XT = X.T
In [121]: %timeit ne.evaluate('log(XT * b + a)')
10 loops, best of 3: 142 ms per loop
It's a bit unclear why you would do np.sum(your_array.T, axis=1) instead of np.sum(your_array, axis=0).
You can use a scipy sparse matrix: (use column compressed format for X, so that X.T is row compressed, since you multiply by b which has the shape of one row of X.T)
X_sparse = scipy.sparse.csc_matrx(X)
and replace X.T * b by:
X_sparse.T.multiply(b)
However if a is not sparse it will not help you as much as it could.
These are the speed ups I obtain for this operation:
In [16]: %timeit X_sparse.T.multiply(b)
The slowest run took 10.80 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 374 µs per loop
In [17]: %timeit X.T * b
10 loops, best of 3: 44.5 ms per loop
with:
import numpy as np
from scipy import sparse
X = np.random.randn(30000, 1000)
a = np.random.randn(1000, 1)
b = np.random.randn(1000, 1)
X[X < 3] = 0
print(np.sum(X != 0))
X_sparse = sparse.csc_matrix(X)

Cross product all points in 3D space

I have a vector, a, which I wish to cross with every point in a defined 3D space.
import numpy as np
# Grid
x = np.arange(-4,4,0.1)
y = np.arange(-4,4,0.1)
z = np.arange(-4,4,0.1)
a = [1,0,0]
result = [[] for i in range(3)]
for j in range(len(x)): # loop on x coords
for k in range(len(y)): # loop on y coords
for l in range(len(z)): # loop on z coords
r = [x[j] , y[k], z[l]]
result[0].append(np.cross(a, r)[0])
result[1].append(np.cross(a, r)[1])
result[2].append(np.cross(a, r)[2])
This produces an array which has taken the cross product of a with every point in space. However, the process takes far too long, due to the nested loops. Is there anyway to exploit vectors (meshgrid perhaps?) to make this process faster?
Here's one vectorized approach -
np.cross(a, np.array(np.meshgrid(x,y,z)).transpose(2,1,3,0)).reshape(-1,3).T
Sample run -
In [403]: x = np.random.rand(4)
...: y = np.random.rand(5)
...: z = np.random.rand(6)
...:
In [404]: result = original_app(x,y,z,a)
In [405]: out = np.cross(a, np.array(np.meshgrid(x,y,z)).\
transpose(2,1,3,0)).reshape(-1,3).T
In [406]: np.allclose(result[0], out[0])
Out[406]: True
In [407]: np.allclose(result[1], out[1])
Out[407]: True
In [408]: np.allclose(result[2], out[2])
Out[408]: True
Runtime test -
# Original setup used in the question
In [393]: # Grid
...: x = np.arange(-4,4,0.1)
...: y = np.arange(-4,4,0.1)
...: z = np.arange(-4,4,0.1)
...:
# Original approach
In [397]: %timeit original_app(x,y,z,a)
1 loops, best of 3: 21.5 s per loop
# #Denziloe's soln
In [395]: %timeit [np.cross(a, r) for r in product(x, y, z)]
1 loops, best of 3: 7.34 s per loop
# Proposed in this post
In [396]: %timeit np.cross(a, np.array(np.meshgrid(x,y,z)).\
transpose(2,1,3,0)).reshape(-1,3).T
100 loops, best of 3: 16 ms per loop
More than 1000x speedup over the original one and more than 450x over the loopy approach from other post.
This takes a couple of seconds to run on my machine:
from itertools import product
result = [np.cross(a, r) for r in product(x, y, z)]
I don't know if that's fast enough for you, but there are a lot of calculations involved. It's certainly cleaner, and there is at least some reduction of redundancy (e.g. calculating np.cross(a, r) three times). It also gives the result in a slightly different format, but this is the natural way to store the result and is hopefully fine for your purposes.

NumPy - vectorizing matrix-matrix column correlation coefficient [duplicate]

I have two arrays that have the shapes N X T and M X T. I'd like to compute the correlation coefficient across T between every possible pair of rows n and m (from N and M, respectively).
What's the fastest, most pythonic way to do this? (Looping over N and M would seem to me to be neither fast nor pythonic.) I'm expecting the answer to involve numpy and/or scipy. Right now my arrays are numpy arrays, but I'm open to converting them to a different type.
I'm expecting my output to be an array with the shape N X M.
N.B. When I say "correlation coefficient," I mean the Pearson product-moment correlation coefficient.
Here are some things to note:
The numpy function correlate requires input arrays to be one-dimensional.
The numpy function corrcoef accepts two-dimensional arrays, but they must have the same shape.
The scipy.stats function pearsonr requires input arrays to be one-dimensional.
Correlation (default 'valid' case) between two 2D arrays:
You can simply use matrix-multiplication np.dot like so -
out = np.dot(arr_one,arr_two.T)
Correlation with the default "valid" case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position.
Row-wise Correlation Coefficient calculation for two 2D arrays:
def corr2_coeff(A, B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1)[:, None]
B_mB = B - B.mean(1)[:, None]
# Sum of squares across rows
ssA = (A_mA**2).sum(1)
ssB = (B_mB**2).sum(1)
# Finally get corr coeff
return np.dot(A_mA, B_mB.T) / np.sqrt(np.dot(ssA[:, None],ssB[None]))
This is based upon this solution to How to apply corr2 functions in Multidimentional arrays in MATLAB
Benchmarking
This section compares runtime performance with the proposed approach against generate_correlation_map & loopy pearsonr based approach listed in the other answer.(taken from the function test_generate_correlation_map() without the value correctness verification code at the end of it). Please note the timings for the proposed approach also include a check at the start to check for equal number of columns in the two input arrays, as also done in that other answer. The runtimes are listed next.
Case #1:
In [106]: A = np.random.rand(1000, 100)
In [107]: B = np.random.rand(1000, 100)
In [108]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15 ms per loop
In [109]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.6 ms per loop
Case #2:
In [110]: A = np.random.rand(5000, 100)
In [111]: B = np.random.rand(5000, 100)
In [112]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 368 ms per loop
In [113]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 493 ms per loop
Case #3:
In [114]: A = np.random.rand(10000, 10)
In [115]: B = np.random.rand(10000, 10)
In [116]: %timeit corr2_coeff(A, B)
1 loops, best of 3: 1.29 s per loop
In [117]: %timeit generate_correlation_map(A, B)
1 loops, best of 3: 1.83 s per loop
The other loopy pearsonr based approach seemed too slow, but here are the runtimes for one small datasize -
In [118]: A = np.random.rand(1000, 100)
In [119]: B = np.random.rand(1000, 100)
In [120]: %timeit corr2_coeff(A, B)
100 loops, best of 3: 15.3 ms per loop
In [121]: %timeit generate_correlation_map(A, B)
100 loops, best of 3: 19.7 ms per loop
In [122]: %timeit pearsonr_based(A, B)
1 loops, best of 3: 33 s per loop
#Divakar provides a great option for computing the unscaled correlation, which is what I originally asked for.
In order to calculate the correlation coefficient, a bit more is required:
import numpy as np
def generate_correlation_map(x, y):
"""Correlate each n with each m.
Parameters
----------
x : np.array
Shape N X T.
y : np.array
Shape M X T.
Returns
-------
np.array
N X M array in which each element is a correlation coefficient.
"""
mu_x = x.mean(1)
mu_y = y.mean(1)
n = x.shape[1]
if n != y.shape[1]:
raise ValueError('x and y must ' +
'have the same number of timepoints.')
s_x = x.std(1, ddof=n - 1)
s_y = y.std(1, ddof=n - 1)
cov = np.dot(x,
y.T) - n * np.dot(mu_x[:, np.newaxis],
mu_y[np.newaxis, :])
return cov / np.dot(s_x[:, np.newaxis], s_y[np.newaxis, :])
Here's a test of this function, which passes:
from scipy.stats import pearsonr
def test_generate_correlation_map():
x = np.random.rand(10, 10)
y = np.random.rand(20, 10)
desired = np.empty((10, 20))
for n in range(x.shape[0]):
for m in range(y.shape[0]):
desired[n, m] = pearsonr(x[n, :], y[m, :])[0]
actual = generate_correlation_map(x, y)
np.testing.assert_array_almost_equal(actual, desired)
For those interested in computing the Pearson correlation coefficient between a 1D and 2D array, I wrote the following function, where x is a 1D array and y a 2D array.
def pearsonr_2D(x, y):
"""computes pearson correlation coefficient
where x is a 1D and y a 2D array"""
upper = np.sum((x - np.mean(x)) * (y - np.mean(y, axis=1)[:,None]), axis=1)
lower = np.sqrt(np.sum(np.power(x - np.mean(x), 2)) * np.sum(np.power(y - np.mean(y, axis=1)[:,None], 2), axis=1))
rho = upper / lower
return rho
Example run:
>>> x
Out[1]: array([1, 2, 3])
>>> y
Out[2]: array([[ 1, 2, 3],
[ 6, 7, 12],
[ 9, 3, 1]])
>>> pearsonr_2D(x, y)
Out[3]: array([ 1. , 0.93325653, -0.96076892])

Categories