Replace looping-over-axes with broadcasting, pt 2 - python

Earlier I asked a similar question where the answer used np.dot, taking advantage of the fact that a dot product involves a sum of products. (To my understanding.)
Now I have a similar issue where I don't think dot will apply, because in place of a sum I want to take an element-wise diagonal. If it does, I haven't been able to apply it correctly.
Given a matrix x and array err:
x = np.matrix([[ 0.02984406, -0.00257266],
[-0.00257266, 0.00320312]])
err = np.array([ 7.6363226 , 13.16548267])
My current implementation with loop is:
res = np.array([np.sqrt(np.diagonal(x * err[i])) for i in range(err.shape[0])])
print(res)
[[ 0.47738755 0.15639712]
[ 0.62682649 0.20535487]]
which takes the diagonal of x.dot(i) for each i in err. Could this be vectorized? In other words, can the output of x * err be 3-dimensional, with np.diagonal then yielding a 2d array, with one element for each diagonal?

Program:
import numpy as np
x = np.matrix([[ 0.02984406, -0.00257266],
[-0.00257266, 0.00320312]])
err = np.array([ 7.6363226 , 13.16548267])
diag = np.diagonal(x)
ans = np.sqrt(diag*err[:,np.newaxis]) # sqrt of outer product
print(ans)
# use out keyword to avoid making new numpy array for many times.
ans = np.empty(x.shape, dtype=x.dtype)
for i in range(100):
ans = np.multiply(diag, err, out=ans)
ans = np.sqrt(ans, out=ans)
Result:
[[ 0.47738755 0.15639712]
[ 0.62682649 0.20535487]]

Here's an approach making use of diagonal-view with ndarray.flat into x and then use broadcasting for element-wise multiplication, like so -
np.sqrt(x.flat[::x.shape[1]+1].A1 * err[:,None])
Sample run -
In [108]: x = np.matrix([[ 0.02984406, -0.00257266],
...: [-0.00257266, 0.00320312]])
...:
...: err = np.array([ 7.6363226 , 13.16548267])
...:
In [109]: np.sqrt(x.flat[::x.shape[1]+1].A1 * err[:,None])
Out[109]:
array([[ 0.47738755, 0.15639712],
[ 0.62682649, 0.20535487]])
Runtime test to see how a view helps over np.diagonal that creates a copy -
In [104]: x = np.matrix(np.random.rand(5000,5000))
In [105]: err = np.random.rand(5000)
In [106]: %timeit np.diagonal(x)*err[:,np.newaxis]
10 loops, best of 3: 66.8 ms per loop
In [107]: %timeit x.flat[::x.shape[1]+1].A1 * err[:,None]
10 loops, best of 3: 37.7 ms per loop

Related

How to vectorize a 2 level loop in NumPy

Based on the comments, I have revised the example:
Consider the following code
import numpy as np
def subspace_angle(A, B):
M = A.T # B
s = np.linalg.svd(M, compute_uv=False)
return s[0]
def principal_angles(bases):
k = bases.shape[0]
r = np.zeros((k, k))
for i in range(k):
x = bases[i]
r[i, i] = subspace_angle(x, x)
for j in range(i):
y = bases[j]
r[i, j] = subspace_angle(x, y)
r[j, i] = r[i, j]
r = np.minimum(1, r)
return np.rad2deg(np.arccos(r))
Following is an example use:
bases = []
# number of subspaces
k = 5
# ambient dimension
n = 8
# subspace dimension
m = 4
for i in range(5):
X = np.random.randn(n, m)
Q,R = np.linalg.qr(X)
bases.append(Q)
# combine the orthonormal bases for all the subspaces
bases = np.array(bases)
# Compute the smallest principal angles between each pair of subspaces.
print(np.round(principal_angles(bases), 2))
Is there a way to avoid the two-level for loops in the principal_angles function, so that the code could be sped up?
As a result of this code, the matrix r is symmetric. Since subspace_angle could be compute-heavy depending on the array size, it is important to avoid computing it twice for r[i,j] and r[j,i].
On the comment about JIT, actually, I am writing the code with Google/JAX. The two-level loop does get JIT compiled giving performance benefits. However, the JIT compilation time is pretty high (possibly due to two levels of for-loop). I am wondering if there is a better way to write this code so that it may compile faster.
I started to copy your code to ipython session, getting a (5,8,4) shaped bases. But then realized that func is undefined. So by commenting that out, I get:
In [6]: def principal_angles(bases):
...: k = bases.shape[0]
...: r = np.zeros((k, k))
...: for i in range(k):
...: x = bases[i]
...: # r[i, i] = func(x, x)
...: for j in range(i):
...: y = bases[j]
...: r[i, j] = subspace_angle(x, y)
...: #r[j, i] = r[i, j]
...: return r
...: #r = np.minimum(1, r)
...: #return np.rad2deg(np.arccos(r))
...:
In [7]: r=principal_angles(bases)
In [8]: r.shape
Out[8]: (5, 5)
Since both matmul and svd can work with higher dimensions, i.e. batches, I wonder if it's possible to call subspace_angle with all bases, rather than iteratively.
We have to think carefully about what shapes we pass it, and how they evolve.
def subspace_angle(A, B):
M = A.T # B
s = np.linalg.svd(M, compute_uv=False)
return s[0]
(Oops, my os just crashed the terminal, so I'll have get back to this later.)
So A and B are (8,4), A.T is (4,8), and A.T#B is (4,4)
If they were (5,8,4), A.transpose(0,2,1) would be (5,4,8), and M would be (5,4,4).
I believe np.linalg.svd accepts that M, returning a (5,4,4)
In [29]: r=principal_angles(bases)
In [30]: r
Out[30]:
array([[0. , 0. , 0. , 0. , 0. ],
[0.99902153, 0. , 0. , 0. , 0. ],
[0.99734371, 0.95318936, 0. , 0. , 0. ],
[0.99894054, 0.99790422, 0.87577343, 0. , 0. ],
[0.99840093, 0.92809283, 0.99896121, 0.98286429, 0. ]])
Let's try that with the whole bases. Use broadcasting to get the 'outer' product on the first dimension:
In [31]: M=bases[:,None,:,:].transpose(0,1,3,2)#bases
In [32]: r1=np.linalg.svd(M, compute_uv=False)
In [33]: M.shape
Out[33]: (5, 5, 4, 4)
In [34]: r1.shape
Out[34]: (5, 5, 4)
To match your s[0] I have to use (need to review the svd docs):
In [35]: r1[:,:,0]
Out[35]:
array([[1. , 0.99902153, 0.99734371, 0.99894054, 0.99840093],
[0.99902153, 1. , 0.95318936, 0.99790422, 0.92809283],
[0.99734371, 0.95318936, 1. , 0.87577343, 0.99896121],
[0.99894054, 0.99790422, 0.87577343, 1. , 0.98286429],
[0.99840093, 0.92809283, 0.99896121, 0.98286429, 1. ]])
time savings aren't massive, but may be better if the first dimension is larger than 5:
In [36]: timeit r=principal_angles(bases)
320 µs ± 554 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [37]: %%timeit
...: M=bases[:,None,:,:].transpose(0,1,3,2)#bases
...: r1=np.linalg.svd(M, compute_uv=False)[:,:,0]
...:
...:
190 µs ± 450 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This may be enough to get you started with a more refined "vectorization".
After some more thinking and experimenting with np.triu_indices function, I have come up with the following solution which avoids extra unnecessary computation.
def vectorized_principal_angles(subspaces):
# number of subspaces
n = subspaces.shape[0]
# Indices for upper triangular matrix
i, j = np.triu_indices(n, k=1)
# prepare all the possible pairs of A and B
A = subspaces[i]
B = subspaces[j]
# Compute the Hermitian transpose of each matrix in A array
AH = np.conjugate(np.transpose(A, axes=(0,2,1)))
# Compute M = A^H B for each matrix pair
M = np.matmul(AH, B)
# Compute the SVD for each matrix in M
s = np.linalg.svd(M, compute_uv=False)
# keep only the first singular value for each M
s = s[:, 0]
# prepare the result matrix
# It is known in advance that diagonal elements will be 1.
r = 0.5 * np.eye(n)
r[i, j] = s
# Symmetrize the matrix
r = r + r.T
# Final result
return r
Here is what is going on:
np.triu_indices(k, k=1) gives me indices for n(n-1)/2 pairs of possible combinations of matrices.
All the remaining computation is limited to the n(n-1)/2 pairs only.
Finally, the scalar values array is put back into a square symmetric result matrix
Thank you #hpaulj for your solution. It helped me a lot in getting the right direction.

How to vectorize multiple matrix multiplications in numpy?

For a conceptual idea of what I mean, I have 2 data points:
x_0 = np.array([0.6, 1.4])[:, None]
x_1 = np.array([2.6, 3.4])[:, None]
And a 2x2 matrix:
y = np.array([[2, 2], [2, 2]])
If I perform x_0.T # y # x_0, I get array([[ 8.]]). Similarly, x_1.T # y # x_1 returns array([[ 72.]]).
But is there a way to perform both of these calculations in one go, without a for loop? Obviously the speed-up here is negligible, but I am working with much more data points than presented here.
With x as the column stacked version of x_0, x_1 and so on, we can use np.einsum -
np.einsum('ji,jk,ki->i',x,y,x)
With a mix of np.einsum and matrix-multiplcation -
np.einsum('ij,ji->i',x.T.dot(y),x)
As stated earlier, x was assumed to be column-stacked, like so :
x = np.column_stack((x_0, x_1))
Runtime test -
In [236]: x = np.random.randint(0,255,(3,100000))
In [237]: y = np.random.randint(0,255,(3,3))
# Proposed in #titipata's post/comments under this post
In [238]: %timeit (x.T.dot(y)*x.T).sum(1)
100 loops, best of 3: 3.45 ms per loop
# Proposed earlier in this post
In [239]: %timeit np.einsum('ji,jk,ki->i',x,y,x)
1000 loops, best of 3: 832 µs per loop
# Proposed earlier in this post
In [240]: %timeit np.einsum('ij,ji->i',x.T.dot(y),x)
100 loops, best of 3: 2.6 ms per loop
Basically, you want to do the operation (x.T).dot(A).dot(x) for all x that you have.
x_0 = np.array([0.6, 1.4])[:, None]
x_1 = np.array([2.6, 3.4])[:, None]
x = np.hstack((x_0, x_1)) # [[ 0.6 2.6], [ 1.4 3.4]]
The easy way to think about it is to do multiplication for all x_i that you have with y as
[x_i.dot(y).dot(x_i) for x_i in x.T]
>> [8.0, 72.0]
But of course this is not too efficient. However, you can do the trick where you can do dot product of x with y first and multiply back with itself and sum over column i.e. you manually do dot product. This will make the calculation much faster:
x = x.T
(x.dot(y) * x).sum(axis=1)
>> array([ 8., 72.])
Note that I transpose the matrix first because we want to multiply column of y to each row of x

Multiply array of vectors with array of matrices; return array of vectors?

I've got a numpy array of row vectors of shape (n,3) and another numpy array of matrices of shape (n,3,3). I would like to multiply each of the n vectors with the corresponding matrix and return an array of shape (n,3) of the resulting vectors.
By now I've been using a for loop to iterate through the n vectors/matrices and do the multiplication item by item.
I would like to know if there's a more numpy-ish way of doing this. A way without the for loop that might even be faster.
//edit 1:
As requested, here's my loopy code (with n = 10):
arr_in = np.random.randn(10, 3)
matrices = np.random.randn(10, 3, 3)
for i in range(arr_in.shape[0]): # 10 iterations
arr_out[i] = np.asarray(np.dot(arr_in[i], matrices[i]))
That dot-product is essentially performing reduction along axis=1 of the two input arrays. The dimensions could be represented like so -
arr_in : n 3
matrices : n 3 3
So, one way to solve it would be to "push" the dimensions of arr_in to front by one axis/dimension, thus creating a singleton dimension at axis=2 in a 3D array version of it. Then, sum-reducing the elements along axis = 1 would give us the desired output. Let's show it -
arr_in : n [3] 1
matrices : n [3] 3
Now, this could be achieved through two ways.
1) With np.einsum -
np.einsum('ij,ijk->ik',arr_in,matrices)
2) With NumPy broadcasting -
(arr_in[...,None]*matrices).sum(1)
Runtime test and verify output (for einsum version) -
In [329]: def loop_based(arr_in,matrices):
...: arr_out = np.zeros((arr_in.shape[0], 3))
...: for i in range(arr_in.shape[0]):
...: arr_out[i] = np.dot(arr_in[i], matrices[i])
...: return arr_out
...:
...: def einsum_based(arr_in,matrices):
...: return np.einsum('ij,ijk->ik',arr_in,matrices)
...:
In [330]: # Inputs
...: N = 16935
...: arr_in = np.random.randn(N, 3)
...: matrices = np.random.randn(N, 3, 3)
...:
In [331]: np.allclose(einsum_based(arr_in,matrices),loop_based(arr_in,matrices))
Out[331]: True
In [332]: %timeit loop_based(arr_in,matrices)
10 loops, best of 3: 49.1 ms per loop
In [333]: %timeit einsum_based(arr_in,matrices)
1000 loops, best of 3: 714 µs per loop
You could use np.einsum. To get v.dot(M) for each vector-matrix pair, use np.einsum("...i,...ij", arr_in, matrices). To get M.dot(v) use np.einsum("...ij,...i", matrices, arr_in)

Applying a function for all pairwise rows in two matrices under Numpy

I have two matrices:
import numpy as np
def create(n):
M = array([[ 0.33840224, 0.25420152, 0.40739624],
[ 0.35087337, 0.40939274, 0.23973389],
[ 0.40168642, 0.29848413, 0.29982946],
[ 0.17442095, 0.50982272, 0.31575633]])
return np.concatenate([M] * n)
A = create(1)
nof_type = A.shape[1]
I = np.eye(nof_type)
Matrix A dimension is 4 x 3 and I is 3 x 3.
What I want to do is to
calculate a distance score for every row in A against every row in I.
for every row in A report the row id of I and the maximum score
So at the end of the day we have 4 x 2 matrix.
How an I achieve that?
This is the function that compute distance score between two numpy array.
def jsd(x,y): #Jensen-shannon divergence
import warnings
warnings.filterwarnings("ignore", category = RuntimeWarning)
x = np.array(x)
y = np.array(y)
d1 = x*np.log2(2*x/(x+y))
d2 = y*np.log2(2*y/(x+y))
d1[np.isnan(d1)] = 0
d2[np.isnan(d2)] = 0
d = 0.5*np.sum(d1+d2)
return d
And in actual case A has number of rows with around 40K. So we really like it to be fast.
Using loopy way:
def scoreit (A, I):
aoa = []
for i, x in enumerate(A):
maxscore = -10000
id = -1
for j, y in enumerate(I):
distance = jsd(x, y)
#print "\t", i, j, distance
if dist > maxscore:
maxscore = distance
id = j
#print "MAX", maxscore, id
aoa.append([maxscore,id])
return aoa
It prints this result:
In [56]: scoreit(A,I)
Out[56]:
[[0.54393736529629078, 1],
[0.56083720679952753, 2],
[0.49502813447483673, 1],
[0.64408263453965031, 0]]
Current timing:
In [57]: %timeit scoreit(create(1000),I)
1 loops, best of 3: 3.31 s per loop
You can extend I's dimensions to a 3D array version at various places to bring in powerful broadcasting into play. We keep A as it is, because it's a huge array and we don't want to incur performance loss moving its elements around. Also, you can avoid that costly affair of checking for NaNs and summing with a single operation of np.nansum that does summing over non-NaNs. Thus, the vectorized solution would look something like this -
def jsd_vectorized(A,I):
# Perform "(x+y)" in a vectorized manner
AI = A+I[:,None]
# Calculate d1 and d2 using AI again in vectorized manner
d1 = A*np.log2(2*A/AI)
d2 = I[:,None,:]*np.log2((2*I[:,None,:])/AI)
# Use np.nansum to ignore NaNs & sum along rows to get all distances
dists = np.nansum(d1,2) + np.nansum(d2,2)
# Pack the argmax IDs and the corresponding scores as final output
ID = dists.argmax(0)
return np.vstack((0.5*dists[ID,np.arange(dists.shape[1])],ID)).T
Sample run
Loopy function to run original function code -
def jsd_loopy(A,I):
dists = np.empty((A.shape[0],I.shape[0]))
for i, x in enumerate(A):
for j, y in enumerate(I):
dists[i,j] = jsd(x, y)
ID = dists.argmax(1)
return np.vstack((dists[np.arange(dists.shape[0]),ID],ID)).T
Run and verify -
In [511]: A = np.array([[ 0.33840224, 0.25420152, 0.40739624],
...: [ 0.35087337, 0.40939274, 0.23973389],
...: [ 0.40168642, 0.29848413, 0.29982946],
...: [ 0.17442095, 0.50982272, 0.31575633]])
...: nof_type = A.shape[1]
...: I = np.eye(nof_type)
...:
In [512]: jsd_loopy(A,I)
Out[512]:
array([[ 0.54393737, 1. ],
[ 0.56083721, 2. ],
[ 0.49502813, 1. ],
[ 0.64408263, 0. ]])
In [513]: jsd_vectorized(A,I)
Out[513]:
array([[ 0.54393737, 1. ],
[ 0.56083721, 2. ],
[ 0.49502813, 1. ],
[ 0.64408263, 0. ]])
Runtime tests
In [514]: A = np.random.rand(1000,3)
In [515]: nof_type = A.shape[1]
...: I = np.eye(nof_type)
...:
In [516]: %timeit jsd_loopy(A,I)
1 loops, best of 3: 782 ms per loop
In [517]: %timeit jsd_vectorized(A,I)
1000 loops, best of 3: 1.17 ms per loop
In [518]: np.allclose(jsd_loopy(A,I),jsd_vectorized(A,I))
Out[518]: True

Constructing a 3D cube of points from a list

I have a list pts containing N points (Python floats). I wish to construct a NumPy array of dimension N*N*N*3 such that the array is equivalent to:
for i in xrange(0, N):
for j in xrange(0, N):
for k in xrange(0, N):
arr[i,j,k,0] = pts[i]
arr[i,j,k,1] = pts[j]
arr[i,j,k,2] = pts[k]
I am wondering how I can exploit the array broadcasting rules of NumPy and functions such as tile to simplify this.
I think that the following should work:
pts = np.array(pts) #Skip if pts is a numpy array already
lp = len(pts)
arr = np.zeros((lp,lp,lp,3))
arr[:,:,:,0] = pts[:,None,None] #None is the same as np.newaxis
arr[:,:,:,1] = pts[None,:,None]
arr[:,:,:,2] = pts[None,None,:]
A quick test:
import numpy as np
import timeit
def meth1(pts):
pts = np.array(pts) #Skip if pts is a numpy array already
lp = len(pts)
arr = np.zeros((lp,lp,lp,3))
arr[:,:,:,0] = pts[:,None,None] #None is the same as np.newaxis
arr[:,:,:,1] = pts[None,:,None]
arr[:,:,:,2] = pts[None,None,:]
return arr
def meth2(pts):
lp = len(pts)
N = lp
arr = np.zeros((lp,lp,lp,3))
for i in xrange(0, N):
for j in xrange(0, N):
for k in xrange(0, N):
arr[i,j,k,0] = pts[i]
arr[i,j,k,1] = pts[j]
arr[i,j,k,2] = pts[k]
return arr
pts = range(10)
a1 = meth1(pts)
a2 = meth2(pts)
print np.all(a1 == a2)
NREPEAT = 10000
print timeit.timeit('meth1(pts)','from __main__ import meth1,pts',number=NREPEAT)
print timeit.timeit('meth2(pts)','from __main__ import meth2,pts',number=NREPEAT)
results in:
True
0.873255968094 #my way
11.4249279499 #original
So this new method is an order of magnitude faster as well.
import numpy as np
N = 10
pts = xrange(0,N)
l = [ [ [ [ pts[i],pts[j],pts[k] ] for k in xrange(0,N) ] for j in xrange(0,N) ] for i in xrange(0,N) ]
x = np.array(l, np.int32)
print x.shape # (10,10,10,3)
This can be done in two lines:
def meth3(pts):
arrs = np.broadcast_arrays(*np.ix_(pts, pts, pts))
return np.concatenate([a[...,None] for a in arrs], axis=3)
However, this method is not as fast as mgilson's answer, because concatenate is annoyingly slow. A generalized version of his answer performs roughly as well, though, and can generate the result you want (i.e. an n-dimensional cartesian product contained within an n-dimensional grid) for any set of arrays.
def meth4(arrs): # or meth4(*arrs) for a simplified interface
arr = np.empty([len(a) for a in arrs] + [len(arrs)])
for i, a in enumerate(np.ix_(*arrs)):
arr[...,i] = a
return arr
This accepts any sequence of sequences, as long as it can be converted into a sequence of numpy arrays:
>>> meth4([[0, 1], [2, 3]])
array([[[ 0., 2.],
[ 0., 3.]],
[[ 1., 2.],
[ 1., 3.]]])
And the cost of this generality isn't too high -- it's only twice as slow for small pts arrays:
>>> (meth4([pts, pts, pts]) == meth1(pts)).all()
True
>>> %timeit meth4([pts, pts, pts])
10000 loops, best of 3: 27.4 us per loop
>>> %timeit meth1(pts)
100000 loops, best of 3: 13.1 us per loop
And it's actually a bit faster for larger ones (although the speed gain is probably due to my use of empty instead of zeros):
>>> pts = np.linspace(0, 1, 100)
>>> %timeit meth4([pts, pts, pts])
100 loops, best of 3: 13.4 ms per loop
>>> %timeit meth1(pts)
100 loops, best of 3: 16.7 ms per loop

Categories