Given a 2-d numpy array, X, of shape [m,m], I wish to apply a function and obtain a new 2-d numpy matrix P, also of shape [m,m], whose [i,j]th element is obtained as follows:
P[i][j] = exp (-|| X[i] - x[j] ||**2)
where ||.|| represents the standard L-2 norm of a vector. Is there any way faster than a simple nested for loop?
For example,
X = [[1,1,1],[2,3,4],[5,6,7]]
Then, at diagonal entries the rows accessed will be the same and the norm/magnitude of their difference will be 0. Hence,
P[0][0] = P[1][1] = P[2][2] = exp (0) = 1.0
Also,
P[0][1] = exp (- || X[0] - X[1] ||**2) = exp (- || [-1,-2,-3] || ** 2) = exp (-14)
etc.
The most trivial solution using a nested for loop is as follows:
import numpy as np
X = np.array([[1,2,3],[4,5,6],[7,8,9]])
P = np.zeros (shape=[len(X),len(X)])
for i in range (len(X)):
for j in range (len(X)):
P[i][j] = np.exp (- np.linalg.norm (X[i]-X[j])**2)
print (P)
This prints:
P = [[1.00000000e+00 1.87952882e-12 1.24794646e-47]
[1.87952882e-12 1.00000000e+00 1.87952882e-12]
[1.24794646e-47 1.87952882e-12 1.00000000e+00]]
Here, m is of the order of 5e4.
In [143]: X = np.array([[1,2,3],[4,5,6],[7,8,9]])
...: P = np.zeros (shape=[len(X),len(X)])
...: for i in range (len(X)):
...: for j in range (len(X)):
...: P[i][j] = np.exp (- np.linalg.norm (X[i]-X[j]))
...:
In [144]: P
Out[144]:
array([[1.00000000e+00, 5.53783071e-03, 3.06675690e-05],
[5.53783071e-03, 1.00000000e+00, 5.53783071e-03],
[3.06675690e-05, 5.53783071e-03, 1.00000000e+00]])
A no-loop version:
In [145]: np.exp(-np.sqrt(((X[:,None,:]-X[None,:,:])**2).sum(axis=2)))
Out[145]:
array([[1.00000000e+00, 5.53783071e-03, 3.06675690e-05],
[5.53783071e-03, 1.00000000e+00, 5.53783071e-03],
[3.06675690e-05, 5.53783071e-03, 1.00000000e+00]])
I had to drop your **2 to match values.
With the norm applied to the 3d difference array:
In [148]: np.exp(-np.linalg.norm(X[:,None,:]-X[None,:,:], axis=2))
Out[148]:
array([[1.00000000e+00, 5.53783071e-03, 3.06675690e-05],
[5.53783071e-03, 1.00000000e+00, 5.53783071e-03],
[3.06675690e-05, 5.53783071e-03, 1.00000000e+00]])
In one of the scikit packages (learn?) there's a cdist that may handle this sort of thing faster.
As hpaulj mentioned cdist does it better. Try the following.
from scipy.spatial.distance import cdist
import numpy as np
np.exp(-cdist(X,X,'sqeuclidean'))
Notice the sqeuclidean. This means that scipy does not take the square root so you don't have to square like you did above with the norm.
This would be easier if you provided a sample array. You can create an array Q of size [m, m, m] where Q[i, j, k] = X[i, k] - X[j, k] by using
X[None,:,:] - X[:,None,:]
At this point, you're performing simple numpy operations against the third axis.
Related
I am aware of the scipy.spatial.distance.pdist function and how to compute the mean from the resulting matrix/ndarray.
>>> x = np.random.rand(10000, 2)
>>> y = pdist(x, metric='euclidean')
>>> y.mean()
0.5214255824176626
In the example above y gets quite large (nearly 2,500 times as large as the input array):
>>> y.shape
(49995000,)
>>> from sys import getsizeof
>>> getsizeof(x)
160112
>>> getsizeof(y)
399960096
>>> getsizeof(y) / getsizeof(x)
2498.0019986009793
But since I am only interested in the mean pairwise distance, the distance matrix doesn't have to be kept in memory. Instead the mean of each row (or column) can be computed seperatly. The final mean value can then be computed from the row mean values.
Is there already a function which exploit this property or is there an easy way to extend/combine existing functions to do so?
If you use the square version of distance, it is equivalent to using the variance with n-1:
from scipy.spatial.distance import pdist, squareform
import numpy as np
x = np.random.rand(10000, 2)
y = np.array([[1,1], [0,0], [2,0]])
print(pdist(x, 'sqeuclidean').mean())
print(np.var(x, 0, ddof=1).sum()*2)
>>0.331474285845873
0.33147428584587346
You will have to weight each row by the number of observations that make up the mean. For example the pdist of a 3 x 2 matrix is the flattened upper triangle (offset of 1) of the squareform 3 x 3 distance matrix.
arr = np.arange(6).reshape(3,2)
arr
array([[0, 1],
[2, 3],
[4, 5]])
pdist(arr)
array([2.82842712, 5.65685425, 2.82842712])
from sklearn.metrics import pairwise_distances
square = pairwise_distances(arr)
square
array([[0. , 2.82842712, 5.65685425],
[2.82842712, 0. , 2.82842712],
[5.65685425, 2.82842712, 0. ]])
square[triu_indices(square.shape[0], 1)]
array([2.82842712, 5.65685425, 2.82842712])
There is the pairwise_distances_chuncked function that can be used to iterate over the distance matrix row by row, but you will need to keep track of the row index to make sure you only take the mean of values in the upper/lower triangle of the matrix (distance matrix is symmetrical). This isn't complicated, but I imagine you will introduce a significant slowdown.
tot = ((arr.shape[0]**2) - arr.shape[0]) / 2
weighted_means = 0
for i in gen:
if r < arr.shape[0]:
sm = i[0, r:].mean()
wgt = (i.shape[1] - r) / tot
weighted_means += sm * wgt
r += 1
I have a set of data in python likes:
x y angle
If I want to calculate the distance between two points with all possible value and plot the distances with the difference between two angles.
x, y, a = np.loadtxt('w51e2-pa-2pk.log', unpack=True)
n = 0
f=(((x[n])-x[n+1:])**2+((y[n])-y[n+1:])**2)**0.5
d = a[n]-a[n+1:]
plt.scatter(f,d)
There are 255 points in my data.
f is the distance and d is the difference between two angles.
My question is can I set n = [1,2,3,.....255] and do the calculation again to get the f and d of all possible pairs?
You can obtain the pairwise distances through broadcasting by considering it as an outer operation on the array of 2-dimensional vectors as follows:
vecs = np.stack((x, y)).T
np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
For example,
In [1]: import numpy as np
...: x = np.array([1, 2, 3])
...: y = np.array([3, 4, 6])
...: vecs = np.stack((x, y)).T
...: np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
...:
Out[1]:
array([[ 0. , 1.41421356, 3.60555128],
[ 1.41421356, 0. , 2.23606798],
[ 3.60555128, 2.23606798, 0. ]])
Here, the (i, j)'th entry is the distance between the i'th and j'th vectors.
The case of the pairwise differences between angles is similar, but simpler, as you only have one dimension to deal with:
In [2]: a = np.array([10, 12, 15])
...: a[np.newaxis, :] - a[: , np.newaxis]
...:
Out[2]:
array([[ 0, 2, 5],
[-2, 0, 3],
[-5, -3, 0]])
Moreover, plt.scatter does not care that the results are given as matrices, and putting everything together using the notation of the question, you can obtain the plot of angles by distances by doing something like
vecs = np.stack((x, y)).T
f = np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
d = angle[np.newaxis, :] - angle[: , np.newaxis]
plt.scatter(f, d)
You have to use a for loop and range() to iterate over n, e.g. like like this:
n = len(x)
for i in range(n):
# do something with the current index
# e.g. print the points
print x[i]
print y[i]
But note that if you use i+1 inside the last iteration, this will already be outside of your list.
Also in your calculation there are errors. (x[n])-x[n+1:] does not work because x[n] is a single value in your list while x[n+1:] is a list starting from n+1'th element. You can not subtract a list from an int or whatever it is.
Maybe you will have to even use two nested loops to do what you want. I guess that you want to calculate the distance between each point so a two dimensional array may be the data structure you want.
If you are interested in all combinations of the points in x and y I suggest to use itertools, which will give you all possible combinations. Then you can do it like follows:
import itertools
f = [((x[i]-x[j])**2 + (y[i]-y[j])**2)**0.5 for i,j in itertools.product(255,255) if i!=j]
# and similar for the angles
But maybe there is even an easier way...
Is there a method that I can call to create a random orthonormal matrix in python? Possibly using numpy? Or is there a way to create a orthonormal matrix using multiple numpy methods? Thanks.
Version 0.18 of scipy has scipy.stats.ortho_group and scipy.stats.special_ortho_group. The pull request where it was added is https://github.com/scipy/scipy/pull/5622
For example,
In [24]: from scipy.stats import ortho_group # Requires version 0.18 of scipy
In [25]: m = ortho_group.rvs(dim=3)
In [26]: m
Out[26]:
array([[-0.23939017, 0.58743526, -0.77305379],
[ 0.81921268, -0.30515101, -0.48556508],
[-0.52113619, -0.74953498, -0.40818426]])
In [27]: np.set_printoptions(suppress=True)
In [28]: m.dot(m.T)
Out[28]:
array([[ 1., 0., -0.],
[ 0., 1., 0.],
[-0., 0., 1.]])
You can obtain a random n x n orthogonal matrix Q, (uniformly distributed over the manifold of n x n orthogonal matrices) by performing a QR factorization of an n x n matrix with elements i.i.d. Gaussian random variables of mean 0 and variance 1. Here is an example:
import numpy as np
from scipy.linalg import qr
n = 3
H = np.random.randn(n, n)
Q, R = qr(H)
print (Q.dot(Q.T))
[[ 1.00000000e+00 -2.77555756e-17 2.49800181e-16]
[ -2.77555756e-17 1.00000000e+00 -1.38777878e-17]
[ 2.49800181e-16 -1.38777878e-17 1.00000000e+00]]
EDIT: (Revisiting this answer after the comment by #g g.) The claim above on the QR decomposition of a Gaussian matrix providing a uniformly distributed (over the, so called, Stiefel manifold) orthogonal matrix is suggested by Theorems 2.3.18-19 of this reference. Note that the statement of the result suggests a "QR-like" decomposition, however, with the triangular matrix R having positive elements.
Apparently, the qr function of scipy (numpy) function does not guarantee positive diagonal elements for R and the corresponding Q is actually not uniformly distributed. This has been observed in this monograph, Sec. 4.6 (the discussion refers to MATLAB, but I guess both MATLAB and scipy use the same LAPACK routines). It is suggested there that the matrix Q provided by qr is modified by post multiplying it with a random unitary diagonal matrix.
Below I reproduce the experiment in the above reference, plotting the empirical distribution (histogram) of phases of eigenvalues of the "direct" Q matrix provided by qr, as well as the "modified" version, where it is seen that the modified version does indeed have a uniform eigenvalue phase, as would be expected from a uniformly distributed orthogonal matrix.
from scipy.linalg import qr, eigvals
from seaborn import distplot
n = 50
repeats = 10000
angles = []
angles_modified = []
for rp in range(repeats):
H = np.random.randn(n, n)
Q, R = qr(H)
angles.append(np.angle(eigvals(Q)))
Q_modified = Q # np.diag(np.exp(1j * np.pi * 2 * np.random.rand(n)))
angles_modified.append(np.angle(eigvals(Q_modified)))
fig, ax = plt.subplots(1,2, figsize = (10,3))
distplot(np.asarray(angles).flatten(),kde = False, hist_kws=dict(edgecolor="k", linewidth=2), ax= ax[0])
ax[0].set(xlabel='phase', title='direct')
distplot(np.asarray(angles_modified).flatten(),kde = False, hist_kws=dict(edgecolor="k", linewidth=2), ax= ax[1])
ax[1].set(xlabel='phase', title='modified');
This is the rvs method pulled from the https://github.com/scipy/scipy/pull/5622/files, with minimal change - just enough to run as a stand alone numpy function.
import numpy as np
def rvs(dim=3):
random_state = np.random
H = np.eye(dim)
D = np.ones((dim,))
for n in range(1, dim):
x = random_state.normal(size=(dim-n+1,))
D[n-1] = np.sign(x[0])
x[0] -= D[n-1]*np.sqrt((x*x).sum())
# Householder transformation
Hx = (np.eye(dim-n+1) - 2.*np.outer(x, x)/(x*x).sum())
mat = np.eye(dim)
mat[n-1:, n-1:] = Hx
H = np.dot(H, mat)
# Fix the last sign such that the determinant is 1
D[-1] = (-1)**(1-(dim % 2))*D.prod()
# Equivalent to np.dot(np.diag(D), H) but faster, apparently
H = (D*H.T).T
return H
It matches Warren's test, https://stackoverflow.com/a/38426572/901925
An easy way to create any shape (n x m) orthogonal matrix:
import numpy as np
n, m = 3, 5
H = np.random.rand(n, m)
u, s, vh = np.linalg.svd(H, full_matrices=False)
mat = u # vh
print(mat # mat.T) # -> eye(n)
Note that if n > m, it would obtain mat.T # mat = eye(m).
from scipy.stats import special_ortho_group
num_dim=3
x = special_ortho_group.rvs(num_dim)
Documentation
if you want a none Square Matrix with orthonormal column vectors you could create a square one with any of the mentioned method and drop some columns.
Numpy also has qr factorization. https://numpy.org/doc/stable/reference/generated/numpy.linalg.qr.html
import numpy as np
a = np.random.rand(3, 3)
q, r = np.linalg.qr(a)
q # q.T
# array([[ 1.00000000e+00, 8.83206468e-17, 2.69154044e-16],
# [ 8.83206468e-17, 1.00000000e+00, -1.30466244e-16],
# [ 2.69154044e-16, -1.30466244e-16, 1.00000000e+00]])
I have a array in size MxN and I like to compute the entropy value of each row. What would be the fastest way to do so ?
scipy.special.entr computes -x*log(x) for each element in an array. After calling that, you can sum the rows.
Here's an example. First, create an array p of positive values whose rows sum to 1:
In [23]: np.random.seed(123)
In [24]: x = np.random.rand(3, 10)
In [25]: p = x/x.sum(axis=1, keepdims=True)
In [26]: p
Out[26]:
array([[ 0.12798052, 0.05257987, 0.04168536, 0.1013075 , 0.13220688,
0.07774843, 0.18022149, 0.1258417 , 0.08837421, 0.07205402],
[ 0.08313743, 0.17661773, 0.1062474 , 0.01445742, 0.09642919,
0.17878489, 0.04420998, 0.0425045 , 0.12877228, 0.1288392 ],
[ 0.11793032, 0.15790292, 0.13467074, 0.11358463, 0.13429674,
0.06003561, 0.06725376, 0.0424324 , 0.05459921, 0.11729367]])
In [27]: p.shape
Out[27]: (3, 10)
In [28]: p.sum(axis=1)
Out[28]: array([ 1., 1., 1.])
Now compute the entropy of each row. entr uses the natural logarithm, so to get the base-2 log, divide the result by log(2).
In [29]: from scipy.special import entr
In [30]: entr(p).sum(axis=1)
Out[30]: array([ 2.22208731, 2.14586635, 2.22486581])
In [31]: entr(p).sum(axis=1)/np.log(2)
Out[31]: array([ 3.20579434, 3.09583074, 3.20980287])
If you don't want the dependency on scipy, you can use the explicit formula:
In [32]: (-p*np.log2(p)).sum(axis=1)
Out[32]: array([ 3.20579434, 3.09583074, 3.20980287])
As #Warren pointed out, it's unclear from your question whether you are starting out from an array of probabilities, or from the raw samples themselves. In my answer I've assumed the latter, in which case the main bottleneck will be computing the bin counts over each row.
Assuming that each vector of samples is relatively long, the fastest way to do this will probably be to use np.bincount:
import numpy as np
def entropy(x):
"""
x is assumed to be an (nsignals, nsamples) array containing integers between
0 and n_unique_vals
"""
x = np.atleast_2d(x)
nrows, ncols = x.shape
nbins = x.max() + 1
# count the number of occurrences for each unique integer between 0 and x.max()
# in each row of x
counts = np.vstack((np.bincount(row, minlength=nbins) for row in x))
# divide by number of columns to get the probability of each unique value
p = counts / float(ncols)
# compute Shannon entropy in bits
return -np.sum(p * np.log2(p), axis=1)
Although Warren's method of computing the entropies from the probability values using entr is slightly faster than using the explicit formula, in practice this is likely to represent a tiny fraction of the total runtime compared to the time taken to compute the bin counts.
Test correctness for a single row:
vals = np.arange(3)
prob = np.array([0.1, 0.7, 0.2])
row = np.random.choice(vals, p=prob, size=1000000)
print("theoretical H(x): %.6f, empirical H(x): %.6f" %
(-np.sum(prob * np.log2(prob)), entropy(row)[0]))
# theoretical H(x): 1.156780, empirical H(x): 1.157532
Test speed:
In [1]: %%timeit x = np.random.choice(vals, p=prob, size=(1000, 10000))
....: entropy(x)
....:
10 loops, best of 3: 34.6 ms per loop
If your data don't consist of integer indices between 0 and the number of unique values, you can convert them into this format using np.unique:
y = np.random.choice([2.5, 3.14, 42], p=prob, size=(1000, 10000))
unq, x = np.unique(y, return_inverse=True)
x.shape = y.shape
I'm almost exactly in a similar situation as the asker here over a year ago:
fast way to invert or dot kxnxn matrix
So I have a tensor with indices a[n,i,j] of dimensions (N,M,M) and I want to invert the M*M square matrix part for each n in N.
For example, suppose I have
In [1]: a = np.arange(12)
a.shape = (3,2,2)
a
Out[1]: array([[[ 0, 1],
[ 2, 3]],
[[ 4, 5],
[ 6, 7]],
[[ 8, 9],
[10, 11]]])
Then a for loop inversion would go like this:
In [2]: inv_a = np.zeros([3,2,2])
for m in xrange(0,3):
inv_a[m] = np.linalg.inv(a[m])
inv_a
Out[2]: array([[[-1.5, 0.5],
[ 1. , 0. ]],
[[-3.5, 2.5],
[ 3. , -2. ]],
[[-5.5, 4.5],
[ 5. , -4. ]]])
This will apparently be implemented in NumPy 2.0, according to this issue on github...
I guess I need to install the dev version as seberg noted in the github issue thread, but is there another way to do this in vectorized manner right now?
Update:
In NumPy 1.8 and later, the functions in numpy.linalg are generalized universal functions.
Meaning that you can now do something like this:
import numpy as np
a = np.random.rand(12, 3, 3)
np.linalg.inv(a)
This will invert each 3x3 array and return the result as a 12x3x3 array.
See the numpy 1.8 release notes.
Original Answer:
Since N is relatively small, how about we compute the LU decomposition manually for all the matrices at once.
This ensures that the for loops involved are relatively short.
Here's how this can be done with normal NumPy syntax:
import numpy as np
from numpy.random import rand
def pylu3d(A):
N = A.shape[1]
for j in xrange(N-1):
for i in xrange(j+1,N):
#change to L
A[:,i,j] /= A[:,j,j]
#change to U
A[:,i,j+1:] -= A[:,i,j:j+1] * A[:,j,j+1:]
def pylusolve(A, B):
N = A.shape[1]
for j in xrange(N-1):
for i in xrange(j+1,N):
B[:,i] -= A[:,i,j] * B[:,j]
for j in xrange(N-1,-1,-1):
B[:,j] /= A[:,j,j]
for i in xrange(j):
B[:,i] -= A[:,i,j] * B[:,j]
#usage
A = rand(1000000,3,3)
b = rand(3)
b = np.tile(b,(1000000,1))
pylu3d(A)
# A has been replaced with the LU decompositions
pylusolve(A, b)
# b has been replaced to the solutions of
# A[i] x = b[i] for each A[i] and b[i]
As I have written it, pylu3d modifies A in place to compute the LU decomposition.
After replacing each NxN matrix with its LU decomposition, pylusolve can be used to solve an MxN array b representing the right hand sides of your matrix systems.
It modifies b in place and does the proper back substitutions to solve the system.
As it is written, this implementation does not include pivoting, so it isn't numerically stable, but it should work well enough in most cases.
Depending on how your array is arranged in memory, it is probably still a good bit faster to use Cython.
Here are two Cython functions that do the same thing, but they iterate along M first.
It's not vectorized, but it is relatively fast.
from numpy cimport ndarray as ar
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def lu3d(ar[double,ndim=3] A):
cdef int n, i, j, k, N=A.shape[0], h=A.shape[1], w=A.shape[2]
for n in xrange(N):
for j in xrange(h-1):
for i in xrange(j+1,h):
#change to L
A[n,i,j] /= A[n,j,j]
#change to U
for k in xrange(j+1,w):
A[n,i,k] -= A[n,i,j] * A[n,j,k]
#cython.boundscheck(False)
#cython.wraparound(False)
def lusolve(ar[double,ndim=3] A, ar[double,ndim=2] b):
cdef int n, i, j, N=A.shape[0], h=A.shape[1]
for n in xrange(N):
for j in xrange(h-1):
for i in xrange(j+1,h):
b[n,i] -= A[n,i,j] * b[n,j]
for j in xrange(h-1,-1,-1):
b[n,j] /= A[n,j,j]
for i in xrange(j):
b[n,i] -= A[n,i,j] * b[n,j]
You could also try using Numba, though I couldn't get it to run as fast as Cython in this case.