Compute distances in kmeans Lloyds algorithm - python

I'm trying to compute the distance between each point of matrix X (shape N,D) and matrix mu (shape K,D) using numpy:
np.array([[np.linalg.norm(x - m) for m in mu] for x in X])
This is very slow. Is there a faster way to get the same result?

We can extend the dimensions of one matrix to a third dimension and then calculate the distance:
np.linalg.norm(X - mu[:,None], axis=-1, ord=2).T

Related

Scipy cdist maximum distance

New to scipy. I am trying to use the cdist function to pick the greatest distance between vectors. My attempt is
dm = cdist(XA, XB, lambda u, v: np.max(np.sqrt(((u-v)**2).sum())))
but it doesn't seem to produce the correct result. Any suggestions?
The cdist function returns a NxM matrix containing all distances between the N vectors of XA and M vectors of XB. If you want the max distance, regardless of the vectors that originate it, you need to ravel() the 2D array into a 1D array and then look for the max() value:
dm = cdist(XA, XB,metric='euclidean').ravel().max()

Numpy cross covariance

Let X be a (d_x,n) matrix containing n observations of a d_x-dimensional variable x, and let w be a vector of weights (probabilities) of dimension n. The weighted covariance is given in numpy by
CX = numpy.cov(X, ddof=0, aweights=w)
Let now Y be a (d_y,n) matrix containing n observations of a d_y-dimensional vector. Is there a clever way to compute the weighted cross covariance, in pseudocode
CXY = sum(W[i] * numpy.outer((X[i, :] - X_mean),(Y[i, :] - Y_mean)))
?

How to normalize in numpy?

I have the following question: A numpy array Y of shape (N, M) where Y[i] contains the same data as X[i], but normalized to have mean 0 and standard deviation 1.
I have mapped the array like this:
(X - np.mean(X)) / np.std(X)
but it doesn't give me the correct answer.
You want to normalize along a specific dimension, for instance -
(X - np.mean(X, axis=0)) / np.std(X, axis=0)
Otherwise you're calculating the statistics over the whole matrix, i.e. subtracting the global mean of all points/features and the same with the standard deviation.
Use norm from linalg
https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html
from numpy import linalg as LA
a = np.arange(9) - 4
LA.norm(a)
>>>7.745966692414834
Then you divide the array by the norm :
a/LA.norm(a)

Implement simple linear algebra operation in tensorflow

Given some (n,m) matrix X with columns x_1, ..., x_m, I am trying to find an op that gives me either the 3-mode tensor [x_1 x_1^T, ..., x_m x_m^T] with shape (m, n,n) or the (n**2, m) matrix with columns vec(x_1 x_1^T),...,vec(x_m x_m^T) where vec is the vectorization of the matrices x_i x_i^T.
In other words, I am trying to generalize
tf.tensordot(a,a,axes=0)
or
tf.tensordot(a,a,axes=0).reshape(-1,1)
from vectors a to the columns of a matrix. Is there a way to get this done without having to rely on a loop?
You can do that with:
tf.expand_dims(a, 2) # tf.expand_dims(a, 1)
Or using tf.linalg.matmul instead of the # operator if you prefer.

Sum of Gaussians into fast Numpy?

here is my problem:
I have two sets of 3d points. Lets call them "Gausspoints" and "XYZ". I define a function which is a sum of Gaussians in which every Gaussian is centered at one of the Gausspoints. Now I want to evaluate this function on the XYZ points: My approach is working fine but it is rather slow. Any idea how to speed it up by exploiting numpy a little better?
def sumgaus(r):
t=r-Gausspoints
t=map(np.linalg.norm,t)
t = -np.power(t,2.0)
t=np.exp(t)
res=np.sum(t)
return res
result=map(sumgaus,XYZ)
Thanks for any help
Edit:
shape of XYZ N*3 and Gausspoints are M*3 with M, N being different integers
Edit2: I want to apply the following function on each item in XYZ
The tricky part is how to vectorize the computation of all the differences between your points without any explicit Python looping or mapping. You can roll out your own implementation using broadcasting by doing something like:
dist2 = XYZ[:, np.newaxis, :] - Gausspoints
dist2 *= dist
dist2 = np.sum(dist, axis=-1)
And if XYZ has shape (n, 3) and Gausspoints has shape (m, 3), then dist will have shape (n, m), with dist[i, j] being the distance between points XYZ[i] and Gausspoints[j].
It may be easier to understand using scipy.spatial.distance.cdist:
from scipy.spatial.distance import cdist
dist2 = cdist(XYZ, Gausspoints)
dist2 *= dist2
But once you have your array of squared distances, it's child's play:
f = np.sum(np.exp(-dist2), axis=1)

Categories