Vectorized calculation of scaled/rotated pairwise squared euclidean distance - python

Given a set of n vectors of dimension d stored in a (n,d) array and a second set of m vectors of the same dimension (stored in (m,d) array) I want to calculate the squared point wise distance between the vectors, scaled by some matrix A with the size (d,d).
The output should be a (n,m) array.
I expect the input range to be somewhere between 1 to 10.000 for m and n and 1 to 100 for d.
The distance between two points is given by:
In the non-optimized, but working python code this looks like this:
import numpy as np
v1 = np.array([[1, 2],
[3, 4],
[4, 5]])
v2 = np.array([[1,1],
[2, 2],
[2, 2],
[0, 0]])
A = np.array([[1,0], [2, 3]])
d = np.zeros((3, 4))
for i in range(0,3):
for j in range(0,4):
d[i,j] = (v1[i,:] - v2[j,:]).T # A # (v1[i,:] - v2[j,:])
The squared distance between the example points is:
d = [[ 3. 1. 1. 17.]
[ 43. 17. 17. 81.]
[ 81. 43. 43. 131.]]
Is there a version of this, that avoids the nested loop in python e.g. using broadcasting black magic?
EDIT:
For the case
A = np.array([[1,0], [0, 1]])
this is the normal squared euclidean distance which can be calculated e.g.
from scipy.spatial.distance import cdist
cdist(v1,v2,'sqeuclidean')

We can use np.einsum -
V = v1[:,None,:]-v2
d_out = np.einsum('ijk,kl,ijl->ij',V,A,V)
Also, play around with the optimize flag in np.einsum by setting it as True to use BLAS.
Explanation on the vectorized method
Original code was -
d[i,j] = (v1[i,:] - v2[j,:]).T # A # (v1[i,:] - v2[j,:])
I. We are translating :
v1[i,:] - v2[j,:]
to the outer operation with broadcasting :
v1[:,None,:]-v2
Schematically put :
v1[:,None,:] : m x 1 x n
v2 : m x n
output, V : m x m x n
More info on outer explanation.
More info on broadcasting could be found in docs.
II. Next up, (v1[i,:] - v2[j,:]).T # A # (v1[i,:] - v2[j,:]) with the new V becomes np.einsum('ijk,kl,ijl->ij',V,A,V) using einsum's string notation. More info could be found in docs.

Related

Memory efficient mean pairwise distance

I am aware of the scipy.spatial.distance.pdist function and how to compute the mean from the resulting matrix/ndarray.
>>> x = np.random.rand(10000, 2)
>>> y = pdist(x, metric='euclidean')
>>> y.mean()
0.5214255824176626
In the example above y gets quite large (nearly 2,500 times as large as the input array):
>>> y.shape
(49995000,)
>>> from sys import getsizeof
>>> getsizeof(x)
160112
>>> getsizeof(y)
399960096
>>> getsizeof(y) / getsizeof(x)
2498.0019986009793
But since I am only interested in the mean pairwise distance, the distance matrix doesn't have to be kept in memory. Instead the mean of each row (or column) can be computed seperatly. The final mean value can then be computed from the row mean values.
Is there already a function which exploit this property or is there an easy way to extend/combine existing functions to do so?
If you use the square version of distance, it is equivalent to using the variance with n-1:
from scipy.spatial.distance import pdist, squareform
import numpy as np
x = np.random.rand(10000, 2)
y = np.array([[1,1], [0,0], [2,0]])
print(pdist(x, 'sqeuclidean').mean())
print(np.var(x, 0, ddof=1).sum()*2)
>>0.331474285845873
0.33147428584587346
You will have to weight each row by the number of observations that make up the mean. For example the pdist of a 3 x 2 matrix is the flattened upper triangle (offset of 1) of the squareform 3 x 3 distance matrix.
arr = np.arange(6).reshape(3,2)
arr
array([[0, 1],
[2, 3],
[4, 5]])
pdist(arr)
array([2.82842712, 5.65685425, 2.82842712])
from sklearn.metrics import pairwise_distances
square = pairwise_distances(arr)
square
array([[0. , 2.82842712, 5.65685425],
[2.82842712, 0. , 2.82842712],
[5.65685425, 2.82842712, 0. ]])
square[triu_indices(square.shape[0], 1)]
array([2.82842712, 5.65685425, 2.82842712])
There is the pairwise_distances_chuncked function that can be used to iterate over the distance matrix row by row, but you will need to keep track of the row index to make sure you only take the mean of values in the upper/lower triangle of the matrix (distance matrix is symmetrical). This isn't complicated, but I imagine you will introduce a significant slowdown.
tot = ((arr.shape[0]**2) - arr.shape[0]) / 2
weighted_means = 0
for i in gen:
if r < arr.shape[0]:
sm = i[0, r:].mean()
wgt = (i.shape[1] - r) / tot
weighted_means += sm * wgt
r += 1

Why is my SVD calculation different than numpy's SVD calculation of this matrix?

I'm trying to manually compute the SVD of the matrix A defined below but I am having some problems. Computing it manually and with the svd method in numpy yields two different results.
Computed manually below:
import numpy as np
A = np.array([[3,2,2], [2,3,-2]])
V = np.linalg.eig(A.T # A)[1]
U = np.linalg.eig(A # A.T)[1]
S = np.c_[np.diag(np.sqrt(np.linalg.eig(A # A.T)[0])), [0,0]]
print(A)
print(U # S # V.T)
And computed via numpy's svd method:
X,Y,Z = np.linalg.svd(A)
Y = np.c_[np.diag(Y), [0,0]]
print(A)
print(X # Y # Z)
When these two codes are run. The manual calculation doesn't equal the svd method. Why is there a discrepancy between these two calculations?
Look at the eigenvalues returned by np.linalg.eig(A.T # A):
In [57]: evals, evecs = np.linalg.eig(A.T # A)
In [58]: evals
Out[58]: array([2.50000000e+01, 3.61082692e-15, 9.00000000e+00])
So (ignoring the normal floating point imprecision), it computed [25, 0, 9]. The eigenvectors associated with those eigenvalues are in the columns of evecs, in the same order. But your construction of S doesn't match that order; here's your S:
In [60]: S
Out[60]:
array([[5., 0., 0.],
[0., 3., 0.]])
When you compute U # S # V.T, the values in S # V.T are not correctly aligned.
As a quick fix, you can rerun your code with S set explicitly as follows:
S = np.array([[5, 0, 0],
[0, 0, 3]])
With that change, your code outputs
[[ 3 2 2]
[ 2 3 -2]]
[[-3. -2. -2.]
[-2. -3. 2.]]
That's better, but why are the signs wrong? Now the problem is that you have independently computed U and V. Eigenvectors are not unique; they are the basis of an eigenspace, and such a basis is not unique. If the eigenvalue is simple, and if the vector is normalized to have length one (which numpy.linalg.eig does), there is still a choice of the sign to be made. That is, if v is an eigenvector, then so is -v. The choices made by eig when computing U and V won't necessarily result in restoring the sign of A when U # S # V.T is computed.
It turns out that you can get the result that you expect by simply reversing all the signs in either U or V. Here is a modified version of your script that generates the output that you expected:
import numpy as np
A = np.array([[3, 2, 2],
[2, 3, -2]])
U = np.linalg.eig(A # A.T)[1]
V = -np.linalg.eig(A.T # A)[1]
#S = np.c_[np.diag(np.sqrt(np.linalg.eig(A # A.T)[0])), [0,0]]
S = np.array([[5, 0, 0],
[0, 0, 3]])
print(A)
print(U # S # V.T)
Output:
[[ 3 2 2]
[ 2 3 -2]]
[[ 3. 2. 2.]
[ 2. 3. -2.]]

Range of index in NumPy correlation function

I am looking into the NumPy correlation function
numpy.correlate(a, v, mode='valid')[source]
Cross-correlation of two 1-dimensional sequences.
This function computes the correlation as generally defined in signal processing texts:
c_{av}[k] = sum_n a[n+k] * conj(v[n])
Then for the example:
a = [1, 2, 3]
v = [0, 1, 0.5]
np.correlate([1, 2, 3], [0, 1, 0.5], "full")
array([ 0.5, 2. , 3.5, 3. , 0. ])
So the k in the output array is from 0 to 4 in this example. However, I am wondering how does a[n+k] is defined when (n+k) > 2 in this case?
Also, how is conjugate(v(n)) defined and how is each element in array computed?
The formula c_{av}[k] = sum_n a[n+k] * conj(v[n]) is a little misleading because k on the left is not necessarily the Python index of the output array. In the 'full' mode, the possible values of k are those for which there exists at least one n such that a[n+k] * conj(v[n]) is defined (that is, both n+k and n fall in the ranges of respective arrays).
In your examples, k in sum_n a[n+k] * conj(v[n]) can be -2, -1, 0, 1, 2. These generate 5 values that you see. For example, k being -2 results in a[2-2]*conj(v[2]) which is 0.5, and so on.
In general, the range of k in the 'full' mode is from 1-len(a) to len(v)-1 inclusive. So, if k is really understood as Python index, then the formula should be
c_{av}[k] = sum_n a[n+k+len(a)-1] * conj(v[n])

Getting eigenvalues from 3x3 matrix in Python using Power method

I'm trying to get all eigenvalues from a 3x3 matrix by using Power Method in Python. However my method returns diffrent eigenvalues from the correct ones for some reason.
My matrix: A = [[1, 2, 3], [2, 4, 5], [3, 5,-1]]
Correct eigenvalues: [ 8.54851285, -4.57408723, 0.02557437 ]
Eigenvalues returned by my method: [ 8.5485128481521926, 4.5740872291939381, 9.148174458392436 ]
So the first one is correct, second one has wrong sign and the third one is all wrong. I don't know what I'm doing wrong and I can't see where have I made mistake.
Here's my code:
import numpy as np
import numpy.linalg as la
eps = 1e-8 # Precision of eigenvalue
def trans(v): # translates vector (v^T)
v_1 = np.copy(v)
return v_1.reshape((-1, 1))
def power(A):
eig = []
Ac = np.copy(A)
lamb = 0
for i in range(3):
x = np.array([1, 1, 1])
while True:
x_1 = Ac.dot(x) # y_n = A*x_(n-1)
x_norm = la.norm(x_1)
x_1 = x_1/x_norm # x_n = y_n/||y_n||
if(abs(lamb - x_norm) <= eps): # If precision is reached, it returns eigenvalue
break
else:
lamb = x_norm
x = x_1
eig.append(lamb)
# Matrix Deflaction: A - Lambda * norm[V]*norm[V]^T
v = x_1/la.norm(x_1)
R = v * trans(v)
R = eig[i]*R
Ac = Ac - R
return eig
def main():
A = np.array([1, 2, 3, 2, 4, 5, 3, 5, -1]).reshape((3, 3))
print(power(A))
if __name__ == '__main__':
main()
PS. Is there a simpler way to get the second and third eigenvalue from power method instead of matrix deflaction?
With
lamb = x_norm
you ever only compute the absolute value of the eigenvalues. Better compute them as
lamb = dot(x,x_1)
where x is assumed to be normalized.
As you do not remove the negative eigenvalue -4.57408723, but effectively add it instead, the largest eigenvalue in the third stage is 2*-4.574.. = -9.148.. where you again computed the absolute value.
I didn't know this method, so I googled it and found here:
http://ergodic.ugr.es/cphys/LECCIONES/FORTRAN/power_method.pdf
that it is valid only for finding the leading (largest) eigenvalue, thus, it seems that it is working for you fine, and it is not guaranteed that the following eigenvalues will be correct.
Btw. numpy.linalg.eig() works faster than your code for this matrix, but I am guessing you implemented it as an exercise.

How to plot pairwise distances of two-dimensional vectors?

I have a set of data in python likes:
x y angle
If I want to calculate the distance between two points with all possible value and plot the distances with the difference between two angles.
x, y, a = np.loadtxt('w51e2-pa-2pk.log', unpack=True)
n = 0
f=(((x[n])-x[n+1:])**2+((y[n])-y[n+1:])**2)**0.5
d = a[n]-a[n+1:]
plt.scatter(f,d)
There are 255 points in my data.
f is the distance and d is the difference between two angles.
My question is can I set n = [1,2,3,.....255] and do the calculation again to get the f and d of all possible pairs?
You can obtain the pairwise distances through broadcasting by considering it as an outer operation on the array of 2-dimensional vectors as follows:
vecs = np.stack((x, y)).T
np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
For example,
In [1]: import numpy as np
...: x = np.array([1, 2, 3])
...: y = np.array([3, 4, 6])
...: vecs = np.stack((x, y)).T
...: np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
...:
Out[1]:
array([[ 0. , 1.41421356, 3.60555128],
[ 1.41421356, 0. , 2.23606798],
[ 3.60555128, 2.23606798, 0. ]])
Here, the (i, j)'th entry is the distance between the i'th and j'th vectors.
The case of the pairwise differences between angles is similar, but simpler, as you only have one dimension to deal with:
In [2]: a = np.array([10, 12, 15])
...: a[np.newaxis, :] - a[: , np.newaxis]
...:
Out[2]:
array([[ 0, 2, 5],
[-2, 0, 3],
[-5, -3, 0]])
Moreover, plt.scatter does not care that the results are given as matrices, and putting everything together using the notation of the question, you can obtain the plot of angles by distances by doing something like
vecs = np.stack((x, y)).T
f = np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
d = angle[np.newaxis, :] - angle[: , np.newaxis]
plt.scatter(f, d)
You have to use a for loop and range() to iterate over n, e.g. like like this:
n = len(x)
for i in range(n):
# do something with the current index
# e.g. print the points
print x[i]
print y[i]
But note that if you use i+1 inside the last iteration, this will already be outside of your list.
Also in your calculation there are errors. (x[n])-x[n+1:] does not work because x[n] is a single value in your list while x[n+1:] is a list starting from n+1'th element. You can not subtract a list from an int or whatever it is.
Maybe you will have to even use two nested loops to do what you want. I guess that you want to calculate the distance between each point so a two dimensional array may be the data structure you want.
If you are interested in all combinations of the points in x and y I suggest to use itertools, which will give you all possible combinations. Then you can do it like follows:
import itertools
f = [((x[i]-x[j])**2 + (y[i]-y[j])**2)**0.5 for i,j in itertools.product(255,255) if i!=j]
# and similar for the angles
But maybe there is even an easier way...

Categories