How to vectorize a 2 level loop in NumPy

How to vectorize a 2 level loop in NumPy - python

Based on the comments, I have revised the example:
Consider the following code
import numpy as np
def subspace_angle(A, B):
M = A.T # B
s = np.linalg.svd(M, compute_uv=False)
return s[0]
def principal_angles(bases):
k = bases.shape[0]
r = np.zeros((k, k))
for i in range(k):
x = bases[i]
r[i, i] = subspace_angle(x, x)
for j in range(i):
y = bases[j]
r[i, j] = subspace_angle(x, y)
r[j, i] = r[i, j]
r = np.minimum(1, r)
return np.rad2deg(np.arccos(r))
Following is an example use:
bases = []
# number of subspaces
k = 5
# ambient dimension
n = 8
# subspace dimension
m = 4
for i in range(5):
X = np.random.randn(n, m)
Q,R = np.linalg.qr(X)
bases.append(Q)
# combine the orthonormal bases for all the subspaces
bases = np.array(bases)
# Compute the smallest principal angles between each pair of subspaces.
print(np.round(principal_angles(bases), 2))
Is there a way to avoid the two-level for loops in the principal_angles function, so that the code could be sped up?
As a result of this code, the matrix r is symmetric. Since subspace_angle could be compute-heavy depending on the array size, it is important to avoid computing it twice for r[i,j] and r[j,i].
On the comment about JIT, actually, I am writing the code with Google/JAX. The two-level loop does get JIT compiled giving performance benefits. However, the JIT compilation time is pretty high (possibly due to two levels of for-loop). I am wondering if there is a better way to write this code so that it may compile faster.

I started to copy your code to ipython session, getting a (5,8,4) shaped bases. But then realized that func is undefined. So by commenting that out, I get:
In [6]: def principal_angles(bases):
...: k = bases.shape[0]
...: r = np.zeros((k, k))
...: for i in range(k):
...: x = bases[i]
...: # r[i, i] = func(x, x)
...: for j in range(i):
...: y = bases[j]
...: r[i, j] = subspace_angle(x, y)
...: #r[j, i] = r[i, j]
...: return r
...: #r = np.minimum(1, r)
...: #return np.rad2deg(np.arccos(r))
...:
In [7]: r=principal_angles(bases)
In [8]: r.shape
Out[8]: (5, 5)
Since both matmul and svd can work with higher dimensions, i.e. batches, I wonder if it's possible to call subspace_angle with all bases, rather than iteratively.
We have to think carefully about what shapes we pass it, and how they evolve.
def subspace_angle(A, B):
M = A.T # B
s = np.linalg.svd(M, compute_uv=False)
return s[0]
(Oops, my os just crashed the terminal, so I'll have get back to this later.)
So A and B are (8,4), A.T is (4,8), and A.T#B is (4,4)
If they were (5,8,4), A.transpose(0,2,1) would be (5,4,8), and M would be (5,4,4).
I believe np.linalg.svd accepts that M, returning a (5,4,4)
In [29]: r=principal_angles(bases)
In [30]: r
Out[30]:
array([[0. , 0. , 0. , 0. , 0. ],
[0.99902153, 0. , 0. , 0. , 0. ],
[0.99734371, 0.95318936, 0. , 0. , 0. ],
[0.99894054, 0.99790422, 0.87577343, 0. , 0. ],
[0.99840093, 0.92809283, 0.99896121, 0.98286429, 0. ]])
Let's try that with the whole bases. Use broadcasting to get the 'outer' product on the first dimension:
In [31]: M=bases[:,None,:,:].transpose(0,1,3,2)#bases
In [32]: r1=np.linalg.svd(M, compute_uv=False)
In [33]: M.shape
Out[33]: (5, 5, 4, 4)
In [34]: r1.shape
Out[34]: (5, 5, 4)
To match your s[0] I have to use (need to review the svd docs):
In [35]: r1[:,:,0]
Out[35]:
array([[1. , 0.99902153, 0.99734371, 0.99894054, 0.99840093],
[0.99902153, 1. , 0.95318936, 0.99790422, 0.92809283],
[0.99734371, 0.95318936, 1. , 0.87577343, 0.99896121],
[0.99894054, 0.99790422, 0.87577343, 1. , 0.98286429],
[0.99840093, 0.92809283, 0.99896121, 0.98286429, 1. ]])
time savings aren't massive, but may be better if the first dimension is larger than 5:
In [36]: timeit r=principal_angles(bases)
320 µs ± 554 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [37]: %%timeit
...: M=bases[:,None,:,:].transpose(0,1,3,2)#bases
...: r1=np.linalg.svd(M, compute_uv=False)[:,:,0]
...:
...:
190 µs ± 450 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This may be enough to get you started with a more refined "vectorization".

After some more thinking and experimenting with np.triu_indices function, I have come up with the following solution which avoids extra unnecessary computation.
def vectorized_principal_angles(subspaces):
# number of subspaces
n = subspaces.shape[0]
# Indices for upper triangular matrix
i, j = np.triu_indices(n, k=1)
# prepare all the possible pairs of A and B
A = subspaces[i]
B = subspaces[j]
# Compute the Hermitian transpose of each matrix in A array
AH = np.conjugate(np.transpose(A, axes=(0,2,1)))
# Compute M = A^H B for each matrix pair
M = np.matmul(AH, B)
# Compute the SVD for each matrix in M
s = np.linalg.svd(M, compute_uv=False)
# keep only the first singular value for each M
s = s[:, 0]
# prepare the result matrix
# It is known in advance that diagonal elements will be 1.
r = 0.5 * np.eye(n)
r[i, j] = s
# Symmetrize the matrix
r = r + r.T
# Final result
return r
Here is what is going on:
np.triu_indices(k, k=1) gives me indices for n(n-1)/2 pairs of possible combinations of matrices.
All the remaining computation is limited to the n(n-1)/2 pairs only.
Finally, the scalar values array is put back into a square symmetric result matrix
Thank you #hpaulj for your solution. It helped me a lot in getting the right direction.

Related

how to relpace a array to the diagonal of numpy array python

I have an array
import numpy as np
X = np.array([[0.7513, 0.6991, 0.5472, 0.2575],
[0.2551, 0.8909, 0.1386, 0.8407],
[0.5060, 0.9593, 0.1493, 0.2543],
[0.5060, 0.9593, 0.1493, 0.2543]])
y = np.array([[1,2,3,4]])
How to replace the diagonal of X with y. We can write a loop but any faster way?

A fast and reliable method is np.einsum:
>>> diag_view = np.einsum('ii->i', X)
This creates a view of the diagonal:
>>> diag_view
array([0.7513, 0.8909, 0.1493, 0.2543])
This view is writable:
>>> diag_view[None] = y
>>> X
array([[1. , 0.6991, 0.5472, 0.2575],
[0.2551, 2. , 0.1386, 0.8407],
[0.506 , 0.9593, 3. , 0.2543],
[0.506 , 0.9593, 0.1493, 4. ]])
This works for contiguous and non-contiguous arrays and is very fast:
contiguous:
loop 21.146424998732982
diag_indices 2.595232878000388
einsum 1.0271988900003635
flatten 1.5372659160002513
non contiguous:
loop 20.133818001340842
diag_indices 2.618005960001028
einsum 1.0305795049989683
Traceback (most recent call last): <- flatten does not work here
...
How does it work? Under the hood einsum does an advanced version of #Julien's trick: It adds the strides of arr:
>>> arr.strides
(3200, 16)
>>> np.einsum('ii->i', arr).strides
(3216,)
One can convince oneself that this will always work as long as arr is organized in strides, which is the case for numpy arrays.
While this use of einsum is pretty neat it is also almost impossible to find if one doesn't know. So spread the word!
Code to recreate the timings and the crash:
import numpy as np
n = 100
arr = np.zeros((n, n))
replace = np.ones(n)
def loop():
for i in range(len(arr)):
arr[i,i] = replace[i]
def other():
l = len(arr)
arr.shape = -1
arr[::l+1] = replace
arr.shape = l,l
def di():
arr[np.diag_indices(arr.shape[0])] = replace
def es():
np.einsum('ii->i', arr)[...] = replace
from timeit import timeit
print('\ncontiguous:')
print('loop ', timeit(loop, number=1000)*1000)
print('diag_indices ', timeit(di))
print('einsum ', timeit(es))
print('flatten ', timeit(other))
arr = np.zeros((2*n, 2*n))[::2, ::2]
print('\nnon contiguous:')
print('loop ', timeit(loop, number=1000)*1000)
print('diag_indices ', timeit(di))
print('einsum ', timeit(es))
print('flatten ', timeit(other))

This should be pretty fast (especially for bigger arrays, for your example it's about twice slower):
arr = np.zeros((4,4))
replace = [1,2,3,4]
l = len(arr)
arr.shape = -1
arr[::l+1] = replace
arr.shape = l,l
Test on bigger array:
n = 100
arr = np.zeros((n,n))
replace = np.ones(n)
def loop():
for i in range(len(arr)):
arr[i,i] = replace[i]
def other():
l = len(arr)
arr.shape = -1
arr[::l+1] = replace
arr.shape = l,l
%timeit(loop())
%timeit(other())
14.7 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.55 µs ± 24.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Use diag_indices for a vectorized solution:
X[np.diag_indices(X.shape[0])] = y
array([[1. , 0.6991, 0.5472, 0.2575],
[0.2551, 2. , 0.1386, 0.8407],
[0.506 , 0.9593, 3. , 0.2543],
[0.506 , 0.9593, 0.1493, 4. ]])

Replace looping-over-axes with broadcasting, pt 2

Earlier I asked a similar question where the answer used np.dot, taking advantage of the fact that a dot product involves a sum of products. (To my understanding.)
Now I have a similar issue where I don't think dot will apply, because in place of a sum I want to take an element-wise diagonal. If it does, I haven't been able to apply it correctly.
Given a matrix x and array err:
x = np.matrix([[ 0.02984406, -0.00257266],
[-0.00257266, 0.00320312]])
err = np.array([ 7.6363226 , 13.16548267])
My current implementation with loop is:
res = np.array([np.sqrt(np.diagonal(x * err[i])) for i in range(err.shape[0])])
print(res)
[[ 0.47738755 0.15639712]
[ 0.62682649 0.20535487]]
which takes the diagonal of x.dot(i) for each i in err. Could this be vectorized? In other words, can the output of x * err be 3-dimensional, with np.diagonal then yielding a 2d array, with one element for each diagonal?

Program:
import numpy as np
x = np.matrix([[ 0.02984406, -0.00257266],
[-0.00257266, 0.00320312]])
err = np.array([ 7.6363226 , 13.16548267])
diag = np.diagonal(x)
ans = np.sqrt(diag*err[:,np.newaxis]) # sqrt of outer product
print(ans)
# use out keyword to avoid making new numpy array for many times.
ans = np.empty(x.shape, dtype=x.dtype)
for i in range(100):
ans = np.multiply(diag, err, out=ans)
ans = np.sqrt(ans, out=ans)
Result:
[[ 0.47738755 0.15639712]
[ 0.62682649 0.20535487]]

Here's an approach making use of diagonal-view with ndarray.flat into x and then use broadcasting for element-wise multiplication, like so -
np.sqrt(x.flat[::x.shape[1]+1].A1 * err[:,None])
Sample run -
In [108]: x = np.matrix([[ 0.02984406, -0.00257266],
...: [-0.00257266, 0.00320312]])
...:
...: err = np.array([ 7.6363226 , 13.16548267])
...:
In [109]: np.sqrt(x.flat[::x.shape[1]+1].A1 * err[:,None])
Out[109]:
array([[ 0.47738755, 0.15639712],
[ 0.62682649, 0.20535487]])
Runtime test to see how a view helps over np.diagonal that creates a copy -
In [104]: x = np.matrix(np.random.rand(5000,5000))
In [105]: err = np.random.rand(5000)
In [106]: %timeit np.diagonal(x)*err[:,np.newaxis]
10 loops, best of 3: 66.8 ms per loop
In [107]: %timeit x.flat[::x.shape[1]+1].A1 * err[:,None]
10 loops, best of 3: 37.7 ms per loop

Applying a function for all pairwise rows in two matrices under Numpy

I have two matrices:
import numpy as np
def create(n):
M = array([[ 0.33840224, 0.25420152, 0.40739624],
[ 0.35087337, 0.40939274, 0.23973389],
[ 0.40168642, 0.29848413, 0.29982946],
[ 0.17442095, 0.50982272, 0.31575633]])
return np.concatenate([M] * n)
A = create(1)
nof_type = A.shape[1]
I = np.eye(nof_type)
Matrix A dimension is 4 x 3 and I is 3 x 3.
What I want to do is to
calculate a distance score for every row in A against every row in I.
for every row in A report the row id of I and the maximum score
So at the end of the day we have 4 x 2 matrix.
How an I achieve that?
This is the function that compute distance score between two numpy array.
def jsd(x,y): #Jensen-shannon divergence
import warnings
warnings.filterwarnings("ignore", category = RuntimeWarning)
x = np.array(x)
y = np.array(y)
d1 = x*np.log2(2*x/(x+y))
d2 = y*np.log2(2*y/(x+y))
d1[np.isnan(d1)] = 0
d2[np.isnan(d2)] = 0
d = 0.5*np.sum(d1+d2)
return d
And in actual case A has number of rows with around 40K. So we really like it to be fast.
Using loopy way:
def scoreit (A, I):
aoa = []
for i, x in enumerate(A):
maxscore = -10000
id = -1
for j, y in enumerate(I):
distance = jsd(x, y)
#print "\t", i, j, distance
if dist > maxscore:
maxscore = distance
id = j
#print "MAX", maxscore, id
aoa.append([maxscore,id])
return aoa
It prints this result:
In [56]: scoreit(A,I)
Out[56]:
[[0.54393736529629078, 1],
[0.56083720679952753, 2],
[0.49502813447483673, 1],
[0.64408263453965031, 0]]
Current timing:
In [57]: %timeit scoreit(create(1000),I)
1 loops, best of 3: 3.31 s per loop

You can extend I's dimensions to a 3D array version at various places to bring in powerful broadcasting into play. We keep A as it is, because it's a huge array and we don't want to incur performance loss moving its elements around. Also, you can avoid that costly affair of checking for NaNs and summing with a single operation of np.nansum that does summing over non-NaNs. Thus, the vectorized solution would look something like this -
def jsd_vectorized(A,I):
# Perform "(x+y)" in a vectorized manner
AI = A+I[:,None]
# Calculate d1 and d2 using AI again in vectorized manner
d1 = A*np.log2(2*A/AI)
d2 = I[:,None,:]*np.log2((2*I[:,None,:])/AI)
# Use np.nansum to ignore NaNs & sum along rows to get all distances
dists = np.nansum(d1,2) + np.nansum(d2,2)
# Pack the argmax IDs and the corresponding scores as final output
ID = dists.argmax(0)
return np.vstack((0.5*dists[ID,np.arange(dists.shape[1])],ID)).T
Sample run
Loopy function to run original function code -
def jsd_loopy(A,I):
dists = np.empty((A.shape[0],I.shape[0]))
for i, x in enumerate(A):
for j, y in enumerate(I):
dists[i,j] = jsd(x, y)
ID = dists.argmax(1)
return np.vstack((dists[np.arange(dists.shape[0]),ID],ID)).T
Run and verify -
In [511]: A = np.array([[ 0.33840224, 0.25420152, 0.40739624],
...: [ 0.35087337, 0.40939274, 0.23973389],
...: [ 0.40168642, 0.29848413, 0.29982946],
...: [ 0.17442095, 0.50982272, 0.31575633]])
...: nof_type = A.shape[1]
...: I = np.eye(nof_type)
...:
In [512]: jsd_loopy(A,I)
Out[512]:
array([[ 0.54393737, 1. ],
[ 0.56083721, 2. ],
[ 0.49502813, 1. ],
[ 0.64408263, 0. ]])
In [513]: jsd_vectorized(A,I)
Out[513]:
array([[ 0.54393737, 1. ],
[ 0.56083721, 2. ],
[ 0.49502813, 1. ],
[ 0.64408263, 0. ]])
Runtime tests
In [514]: A = np.random.rand(1000,3)
In [515]: nof_type = A.shape[1]
...: I = np.eye(nof_type)
...:
In [516]: %timeit jsd_loopy(A,I)
1 loops, best of 3: 782 ms per loop
In [517]: %timeit jsd_vectorized(A,I)
1000 loops, best of 3: 1.17 ms per loop
In [518]: np.allclose(jsd_loopy(A,I),jsd_vectorized(A,I))
Out[518]: True

Fastest way to compute entropy of each numpy array row?

I have a array in size MxN and I like to compute the entropy value of each row. What would be the fastest way to do so ?

scipy.special.entr computes -x*log(x) for each element in an array. After calling that, you can sum the rows.
Here's an example. First, create an array p of positive values whose rows sum to 1:
In [23]: np.random.seed(123)
In [24]: x = np.random.rand(3, 10)
In [25]: p = x/x.sum(axis=1, keepdims=True)
In [26]: p
Out[26]:
array([[ 0.12798052, 0.05257987, 0.04168536, 0.1013075 , 0.13220688,
0.07774843, 0.18022149, 0.1258417 , 0.08837421, 0.07205402],
[ 0.08313743, 0.17661773, 0.1062474 , 0.01445742, 0.09642919,
0.17878489, 0.04420998, 0.0425045 , 0.12877228, 0.1288392 ],
[ 0.11793032, 0.15790292, 0.13467074, 0.11358463, 0.13429674,
0.06003561, 0.06725376, 0.0424324 , 0.05459921, 0.11729367]])
In [27]: p.shape
Out[27]: (3, 10)
In [28]: p.sum(axis=1)
Out[28]: array([ 1., 1., 1.])
Now compute the entropy of each row. entr uses the natural logarithm, so to get the base-2 log, divide the result by log(2).
In [29]: from scipy.special import entr
In [30]: entr(p).sum(axis=1)
Out[30]: array([ 2.22208731, 2.14586635, 2.22486581])
In [31]: entr(p).sum(axis=1)/np.log(2)
Out[31]: array([ 3.20579434, 3.09583074, 3.20980287])
If you don't want the dependency on scipy, you can use the explicit formula:
In [32]: (-p*np.log2(p)).sum(axis=1)
Out[32]: array([ 3.20579434, 3.09583074, 3.20980287])

As #Warren pointed out, it's unclear from your question whether you are starting out from an array of probabilities, or from the raw samples themselves. In my answer I've assumed the latter, in which case the main bottleneck will be computing the bin counts over each row.
Assuming that each vector of samples is relatively long, the fastest way to do this will probably be to use np.bincount:
import numpy as np
def entropy(x):
"""
x is assumed to be an (nsignals, nsamples) array containing integers between
0 and n_unique_vals
"""
x = np.atleast_2d(x)
nrows, ncols = x.shape
nbins = x.max() + 1
# count the number of occurrences for each unique integer between 0 and x.max()
# in each row of x
counts = np.vstack((np.bincount(row, minlength=nbins) for row in x))
# divide by number of columns to get the probability of each unique value
p = counts / float(ncols)
# compute Shannon entropy in bits
return -np.sum(p * np.log2(p), axis=1)
Although Warren's method of computing the entropies from the probability values using entr is slightly faster than using the explicit formula, in practice this is likely to represent a tiny fraction of the total runtime compared to the time taken to compute the bin counts.
Test correctness for a single row:
vals = np.arange(3)
prob = np.array([0.1, 0.7, 0.2])
row = np.random.choice(vals, p=prob, size=1000000)
print("theoretical H(x): %.6f, empirical H(x): %.6f" %
(-np.sum(prob * np.log2(prob)), entropy(row)[0]))
# theoretical H(x): 1.156780, empirical H(x): 1.157532
Test speed:
In [1]: %%timeit x = np.random.choice(vals, p=prob, size=(1000, 10000))
....: entropy(x)
....:
10 loops, best of 3: 34.6 ms per loop
If your data don't consist of integer indices between 0 and the number of unique values, you can convert them into this format using np.unique:
y = np.random.choice([2.5, 3.14, 42], p=prob, size=(1000, 10000))
unq, x = np.unique(y, return_inverse=True)
x.shape = y.shape

Constructing a 3D cube of points from a list

I have a list pts containing N points (Python floats). I wish to construct a NumPy array of dimension N*N*N*3 such that the array is equivalent to:
for i in xrange(0, N):
for j in xrange(0, N):
for k in xrange(0, N):
arr[i,j,k,0] = pts[i]
arr[i,j,k,1] = pts[j]
arr[i,j,k,2] = pts[k]
I am wondering how I can exploit the array broadcasting rules of NumPy and functions such as tile to simplify this.

I think that the following should work:
pts = np.array(pts) #Skip if pts is a numpy array already
lp = len(pts)
arr = np.zeros((lp,lp,lp,3))
arr[:,:,:,0] = pts[:,None,None] #None is the same as np.newaxis
arr[:,:,:,1] = pts[None,:,None]
arr[:,:,:,2] = pts[None,None,:]
A quick test:
import numpy as np
import timeit
def meth1(pts):
pts = np.array(pts) #Skip if pts is a numpy array already
lp = len(pts)
arr = np.zeros((lp,lp,lp,3))
arr[:,:,:,0] = pts[:,None,None] #None is the same as np.newaxis
arr[:,:,:,1] = pts[None,:,None]
arr[:,:,:,2] = pts[None,None,:]
return arr
def meth2(pts):
lp = len(pts)
N = lp
arr = np.zeros((lp,lp,lp,3))
for i in xrange(0, N):
for j in xrange(0, N):
for k in xrange(0, N):
arr[i,j,k,0] = pts[i]
arr[i,j,k,1] = pts[j]
arr[i,j,k,2] = pts[k]
return arr
pts = range(10)
a1 = meth1(pts)
a2 = meth2(pts)
print np.all(a1 == a2)
NREPEAT = 10000
print timeit.timeit('meth1(pts)','from __main__ import meth1,pts',number=NREPEAT)
print timeit.timeit('meth2(pts)','from __main__ import meth2,pts',number=NREPEAT)
results in:
True
0.873255968094 #my way
11.4249279499 #original
So this new method is an order of magnitude faster as well.

import numpy as np
N = 10
pts = xrange(0,N)
l = [ [ [ [ pts[i],pts[j],pts[k] ] for k in xrange(0,N) ] for j in xrange(0,N) ] for i in xrange(0,N) ]
x = np.array(l, np.int32)
print x.shape # (10,10,10,3)

This can be done in two lines:
def meth3(pts):
arrs = np.broadcast_arrays(*np.ix_(pts, pts, pts))
return np.concatenate([a[...,None] for a in arrs], axis=3)
However, this method is not as fast as mgilson's answer, because concatenate is annoyingly slow. A generalized version of his answer performs roughly as well, though, and can generate the result you want (i.e. an n-dimensional cartesian product contained within an n-dimensional grid) for any set of arrays.
def meth4(arrs): # or meth4(*arrs) for a simplified interface
arr = np.empty([len(a) for a in arrs] + [len(arrs)])
for i, a in enumerate(np.ix_(*arrs)):
arr[...,i] = a
return arr
This accepts any sequence of sequences, as long as it can be converted into a sequence of numpy arrays:
>>> meth4([[0, 1], [2, 3]])
array([[[ 0., 2.],
[ 0., 3.]],
[[ 1., 2.],
[ 1., 3.]]])
And the cost of this generality isn't too high -- it's only twice as slow for small pts arrays:
>>> (meth4([pts, pts, pts]) == meth1(pts)).all()
True
>>> %timeit meth4([pts, pts, pts])
10000 loops, best of 3: 27.4 us per loop
>>> %timeit meth1(pts)
100000 loops, best of 3: 13.1 us per loop
And it's actually a bit faster for larger ones (although the speed gain is probably due to my use of empty instead of zeros):
>>> pts = np.linspace(0, 1, 100)
>>> %timeit meth4([pts, pts, pts])
100 loops, best of 3: 13.4 ms per loop
>>> %timeit meth1(pts)
100 loops, best of 3: 16.7 ms per loop

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to vectorize a 2 level loop in NumPy - python

Related

how to relpace a array to the diagonal of numpy array python

Replace looping-over-axes with broadcasting, pt 2

Applying a function for all pairwise rows in two matrices under Numpy

Fastest way to compute entropy of each numpy array row?

Constructing a 3D cube of points from a list

Categories

Resources