Hierarchical agglomerative clustering: how to update distance matrix? - python

I would like to implement the simple hierarchical agglomerative clustering according to the pseudocode:
I got stuck at the last part where I need to update the distance matrix. So far I have:
import numpy as np
X = np.array([[1, 2],
[0, 3],
[2, 3],])
# Clusters
C = np.zeros((X.shape[0], X.shape[0]))
# Keeps track of active clusters
I = np.zeros(X.shape[0])
# For all n datapoints
for n in range(X.shape[0]):
for i in range(X.shape[0]):
# Compute the similarity of all N x N pairs of images
C[n][i] = np.linalg.norm(X[n] - X[i])
I[n] = 1
# Collects clustering as a sequence of merges
A = []
In each of N iterations
for k in range(X.shape[0] - 1):
# TODO: Find the indices of the smallest distance
# Updated distance matrix
I would like to implement the single-linkage clustering, so I would like to find the argmin of the distance matrix. I originally thought about doing something like:
i, m = np.where(C == np.min(C[np.nonzero(C)]))
i, m = i[0], m[0]
A.append((i, m))
to find the argmin, but I think it is incorrect as it doesn't specify a condition on the active clusters in I. I am also confused because I should just be looking at the upper or lower triangle of the matrix, so if I use the above method I could get the same argmin twice due to symmetry.
I was also thinking about first creating the rows and columns of the new merged cluster:
C = np.vstack((C, np.zeros((1, C.shape[1]))))
C = np.hstack((C, np.zeros((C.shape[0], 1))))
Then somehow update it like:
for j in range(X.shape[0]):
C[i][j] = min(C[i][j], C[m][j])
C[j][i] = min(C[i][j], C[m][j])
I am not sure if this is right approach. Is there a simpler way to find the argmin, merge the rows and columns and update the values?

If you get confused when how to find row and column indexes of minimum dist error,
Firstly,
To avoid getting argmin twice due to symmetry you can construct your initial distance matrix in shape of lower triangle matrix.
def euclidean_distance(p1,p2):
return math.sqrt((p1[0]-p2[0])**2+(p1[1]-p2[1])**2)
distance_matrix = np.zeros((len(X.shape[0]),len(X.shape[0])))
for i in range(len(distance_matrix)):
for j in range(i):
distance_matrix[i][j] = euclidean_distance(X[i],X[j])
Secondly,
You can do your min search in the given matrix by hand if you don't like to use np tools or you are looking for a simple way.
min_value = np.inf
for i in range(len(distance_matrix)):
for j in range(i):
if( distance_matrix[i][j] < min_value):
min_value = distance_matrix[i][j]
min_i = i
min_j = j
Finally,
Update the distance matrix and merge clusters as fallows:
for i in range(len(distance_matrix)):
if( i > min_i and i < min_j ):
distance_matrix[i][min_i] = min(distance_matrix[i][min_i],distance_matrix[min_j][i])
elif( i > min_j ):
distance_matrix[i][min_i] = min(distance_matrix[i][min_i],distance_matrix[i][min_j])
for j in range(len(distance_matrix)):
if( j < min_i ):
distance_matrix[min_i][j] = min(distance_matrix[min_i][j],distance_matrix[min_j][j])
#remove one of the old clusters data from the distance matrix
distance_matrix = np.delete(distance_matrix, min_j, axis=1)
distance_matrix = np.delete(distance_matrix, min_j, axis=0)
A[min_i] = A[min_i] + A[min_j]
A.pop(min_j)

Related

Why non-linear response to random values is always positive?

I'm creating a non-linear response to a series of random values from {-1, +1} using a simple Volterra kernel:
With a zero mean for a(k) values I would expect r(k) to have a zero mean as well for arbitrary w values. However, I get r(k) with an always positive mean value, while a mean for a(k) behaves as expected: is close to zero and changes sign from run to run.
Why don't I get a similar behavior for r(k)? Is it because a(k) are pseudo-random and two different values from a are not actually independent?
Here is a code that I use:
import numpy as np
import matplotlib.pyplot as plt
import itertools
# array of random values {-1, 1}
A = np.random.randint(2, size=10000)
A = [x*2 - 1 for x in A]
# array of random weights
M = 3
w = np.random.rand(int(M*(M+1)/2))
# non-linear response to random values
R = []
for i in range(M, len(A)):
vals = np.asarray([np.prod(x) for x in itertools.combinations_with_replacement(A[i-M:i], 2)])
R.append(np.dot(vals, w))
print(np.mean(A), np.var(A))
print(np.mean(R), np.var(R))
Edit:
Check on whether the quadratic form, which is employed by the kernel, is definite-positive fails (i.e. there are negative principal minors). The code to do the check:
import scipy.linalg as lin
wm = np.zeros((M,M))
w_index = 0
# check Sylvester's criterion
# reconstruct weights for quadratic form
for r in range(0,M):
for c in range(r,M):
wm[r,c] += w[w_index]/2
wm[c,r] += w[w_index]/2
w_index += 1
# check principal minors
for r in range(0,M):
if lin.det(wm[:r+1,:r+1])<0: print('found negative principal minor of order', r)
I'm not certain if this is the case for Volterra kernels, but many kernels are positive definite, and some kernels, such as covariance functions, do not admit values less than zero (e.g. Squared Exponential/RBF, Rational Quadratic, Matern kernels).
If these are not the cases for the Volterra kernel, you can also try changing the random seed to seed the RNG differently to check if this is still the case. Here is a looped version of your code that iterates over different random seeds:
import numpy as np
import matplotlib.pyplot as plt
import itertools
# Loop over random seeds
for i in range(10):
# Seed the RNG
np.random.seed(i)
# array of random values {-1, 1}
A = np.random.randint(2, size=10000)
A = [x*2 - 1 for x in A]
# array of random weights
M = 3
w = np.random.rand(int(M*(M+1)/2))
# non-linear response to random values
R = []
for i in range(M, len(A)):
vals = np.asarray([np.prod(x) for x in itertools.combinations_with_replacement(A[i-M:i], 2)])
R.append(np.dot(vals, w))
# Covert R to a numpy array to check for slicing
R = np.array(R)
print("A: ", np.mean(A), np.var(A))
print("R <= 0: ", R[R <= 0])
print("R: ", np.mean(R), np.var(R))
Running this, I get the following values:
A: 0.017 0.9997109999999997
R <= 0: []
R: 1.487637375177384 0.14880206863520892
A: -0.0012 0.9999985600000002
R <= 0: []
R: 2.28108226352669 0.5926651729251319
A: 0.0104 0.9998918400000001
R <= 0: []
R: 1.6138015284426408 0.9526360372883802
A: -0.0064 0.9999590399999999
R <= 0: []
R: 0.988332642595828 0.9650456000380685
A: 0.0026 0.9999932399999998
R <= 0: [-0.75835076 -0.75835076 -0.75835076 ... -0.75835076 -0.75835076
-0.75835076]
R: 0.7352258581171865 1.2668744674748733
A: -0.0048 0.9999769599999996
R <= 0: [-0.02201476 -0.29894937 -0.29894937 ... -0.02201476 -0.29894937
-0.02201476]
R: 0.7396699663779303 1.3844391355510492
A: -0.0012 0.9999985600000002
R <= 0: []
R: 2.4343947709617475 1.6377776468054106
A: -0.0052 0.99997296
R <= 0: []
R: 0.8778918601676095 0.07656607914368625
A: 0.0086 0.99992604
R <= 0: []
R: 2.3490174001719937 0.059871902764070624
A: 0.0046 0.9999788399999996
R <= 0: []
R: 1.7699147798471178 1.8049209966313247
So as you can see, R still has some negative values. My guess is that this occurs because your kernel is positive definite.
This question ended up being about math, and not programming. Nevertheless, this is my own answer.
Simply put, when indices of a(k-i) are equal, the variables in the resulting product are not independent (because they are equal). Such a product does not have a zero mean, hence the mean value of the whole equation is shifted into the positive range.
Formally, implemented function is a quadratic form, for which a mean value can be calculated by
where \mu and \Sigma are a vector of expected values and a covariance matrix for a vector A respectively.
Having a zero vector \mu leaves only the first part of this equation. The resulting estimate can be done with the following code. And it actually gives values that are close to the statistical results in the question.
# Estimate R mean
# sum weights in a main diagonal for quadratic form (matrix trace)
w_sum = 0
w_index = 0
for r in range(0,M):
for c in range(r,M):
if r==c: w_sum += w[w_index]
w_index += 1
Rmean_est = np.var(A) * w_sum
print(Rmean_est)
This estimate uses an assumption, that a elements with different indices are independent. Any implicit dependency due to the nature of pseudo-random generator, if present, probably gives only a slight change to the resulting estimate.

Iterate through numpy array testing multiple elements efficiently

I have the following code which iterates through 2d numpy array named "m". It works extremely slow. How can I transform this code using numpy functions so that I avoid using the for loops?
pairs = []
for i in range(size):
for j in range(size):
if(i >= j):
continue
if(m[i][j] + m[j][i] >= 0.75):
pairs.append([i, j, m[i][j] + m[j][i]])
You can use vectorised approach using NumPy. The idea is:
First initialize a matrix m and then create m+m.T which is equivalent to m[i][j] + m[j][i] where m.T is the matrix transpose and call it summ
np.triu(summ) returns the upper triangular part of the matrix (This is equivalent to ignoring the lower part by using continue in your code). This avoids explicit if(i >= j): in your code. Here you have to use k=1 to exclude the diagonal elements. By default, k=0 which includes the diagonal elements as well.
Then you get the indices of points using np.argwhere where the sum m+m.T is greater than equal to 0.75
Then you store those indices and the corresponding values in a list for later processing/printing purposes.
Verifiable example (using a small 3x3 random dataset)
import numpy as np
np.random.seed(0)
m = np.random.rand(3,3)
summ = m + m.T
index = np.argwhere(np.triu(summ, k=1)>=0.75)
pairs = [(x,y, summ[x,y]) for x,y in index]
print (pairs)
# # [(0, 1, 1.2600725493693163), (0, 2, 1.0403505873343364), (1, 2, 1.537667113848736)]
Further performance improvement
I just worked out an even faster approach to generate the final pairs list avoiding explicit for loops as
pairs = list(zip(index[:, 0], index[:, 1], summ[index[:,0], index[:,1]]))
One way to optimize your code is to avoid comparison if (i >= j). To traverse only the lower triangle of the array without that comparison, you have to make the inner loop start with the value of i of the outermost loop. That way, you avoid size x size if comparisons.
import numpy as np
size = 5000
m = np.random.rand(size, size)
pairs = []
for i in range(size):
for j in range(i , size):
if(m[i][j] + m[j][i] >= 0.75):
pairs.append([i, j, m[i][j] + m[j][i]])

How can I get rid of the loops and make my Correlation matrix function more efficient

I have this function that computes the correlation matrix and works as expect however I am trying to make it more efficient and get rid of the loops but I'm having trouble doing so. My function below:
def correlation(X):
N = X.shape[0] # num of rows
D = X.shape[1] # num of cols
covarianceMatrix = np.cov(X) # start with covariance matrix
# use covarianceMatrix to create size of M
M = np.zeros([covarianceMatrix.shape[0], covarianceMatrix.shape[1]])
for i in range(covarianceMatrix.shape[0]):
for j in range(covarianceMatrix.shape[1]):
corr = covarianceMatrix[i, j] / np.sqrt(np.dot(covarianceMatrix[i, i], covarianceMatrix[j, j]))
M[i,j] = corr
return M
What would be a more efficient way to perform this computation using numpy and not using its built it functions such as corrcoef().
Once you have the covariance matrix you simply need to multiply by the product of the diagonal inverse square root. Using bits of your code as a starting point:
covarianceMatrix = np.cov(X)
tmp = 1.0 / np.sqrt(np.diag(covarianceMatrix))
corr = covarianceMatrix.copy()
corr *= tmp[:, None]
corr *= tmp[None, :]
A bit more difficult if you have complex values, and you should probably clip between -1 and 1 via:
np.clip(corr, -1, 1, out=corr)

Find closest k points for every point in row of numpy array

I have a np array, X that is size 1000 x 1000 where each element is a real number. I want to find the 5 closest points for every point in each row of this np array. Here the distance metric can just be abs(x-y). I have tried to do
for i in range(X.shape[0]):
knn = NearestNeighbors(n_neighbors=5)
knn.fit(X[i])
for j in range(X.shape[1])
d = knn.kneighbors(X[i,j], return_distance=False)
However, this does not work for me and I am not sure how efficient this is. Is there a way around this? I have seen a lot of methods for comparing vectors but not any for comparing single elements. I know that I could use a for loop and loop over and find the k smallest, but this would be computationally expensive. Could a KD tree work for this? I have tried a method similar to
Finding index of nearest point in numpy arrays of x and y coordinates
However, I can not get this to work. Is there some numpy function I don't know about that could accomplish this?
Construct a kdtree with scipy.spatial.cKDTree for each row of your data.
import numpy as np
import scipy.spatial
def nearest_neighbors(arr, k):
k_lst = list(range(k + 2))[2:] # [2,3]
neighbors = []
for row in arr:
# stack the data so each element is in its own row
data = np.vstack(row)
# construct a kd-tree
tree = scipy.spatial.cKDTree(data)
# find k nearest neighbors for each element of data, squeezing out the zero result (the first nearest neighbor is always itself)
dd, ii = tree.query(data, k=k_lst)
# apply an index filter on data to get the nearest neighbor elements
closest = data[ii].reshape(-1, k)
neighbors.append(closest)
return np.stack(neighbors)
N = 1000
k = 5
A = np.random.random((N, N))
nearest_neighbors(A, k)
I'm not really sure how you want the final results. But this definitely gets you what you need.
np.random.seed([3,1415])
X = np.random.rand(1000, 1000)
Grab upper triangle indices to track every combination of points per row
x1, x2 = np.triu_indices(X.shape[1], 1)
generate an array of all distances
d = np.abs(X[:, x1] - X[:, x2])
Find the closest 5 for every row
tpos = np.argpartition(d, 5)[:, :5]
Then x1[tpos] gives the row-wise positions of the first point in the closest pairs while x2[tpos] gives the second position of the closest pairs.
Here is an argsorting solution that strives to take advantage of the simple metric:
def nn(A, k):
out = np.zeros((A.shape[0], A.shape[1] + 2*k), dtype=int)
out[:, k:-k] = np.argsort(A, axis=-1)
out[:, :k] = out[:, -k-1, None]
out[:, -k:] = out[:, k, None]
strd = stride_tricks.as_strided(
out, strides=out.strides + (out.strides[-1],), shape=A.shape + (2*k+1,))
delta = A[np.arange(A.shape[0])[:, None, None], strd]
delta -= delta[..., k, None]
delta = np.abs(delta)
s = np.argpartition(delta,(0, k), axis = -1)[..., 1:k+1]
inds = tuple(np.ogrid[:strd.shape[0], :strd.shape[1], :0][:2])
res = np.empty(A.shape + (k,), dtype=int)
res[np.arange(strd.shape[0])[:, None, None], out[:, k:-k, None],
np.arange(k)[None, None, :]] = strd[inds + (s,)]
return res
N = 1000
k = 5
r = 10
A = np.random.random((N, N))
# crude test
print(np.abs(A[np.arange(N)[:, None, None], res]-A[..., None]).mean())
# timings
print(timeit(lambda: nn(A, k), number=r) / r)
Output:
# 0.00150537172454
# 0.4567880852999224

How to do this operation in numPy?

I have an array X of 3D coords of N points (N*3) and want to calculate the eukledian distance between each pair of points.
I can do this by iterating over X and comparing them with the threshold.
coords = array([v.xyz for v in vertices])
for vertice in vertices:
tests = np.sum(array(coords - vertice.xyz) ** 2, 1) < threshold
closest = [v for v, t in zip(vertices, tests) if t]
Is this possible to do in one operation? I recall linear algebra from 10 years ago and can't find a way to do this.
Probably this should be a 3D array (point a, point b, axis) and then summed by axis dimension.
edit: found the solution myself, but it doesn't work on big datasets.
coords = array([v.xyz for v in vertices])
big = np.repeat(array([coords]), len(coords), 0)
big_same = np.swapaxes(big, 0, 1)
tests = np.sum((big - big_same) ** 2, 0) < thr_square
for v, test_vector in zip(vertices, tests):
v.closest = self.filter(vertices, test_vector)
Use scipy.spatial.distance. If X is an n×3 array of points, you can get an n×n distance matrix from
from scipy.spatial import distance
D = distance.squareform(distance.pdist(X))
Then, the closest to point i is the point with index
np.argsort(D[i])[1]
(The [1] skips over the value in the diagonal, which will be returned first.)
I'm not quite sure what you're asking here. If you're computing the Euclidean distance between each pair of points in an N-point space, it would make sense to me to represent the results as a look-up matrix. So for N points, you'd get an NxN symmetric matrix. Element (3, 5) would represent the distance between points 3 and 5, whereas element (2, 2) would be the distance between point 2 and itself (zero). This is how I would do it for random points:
import numpy as np
N = 5
coords = np.array([np.random.rand(3) for _ in range(N)])
dist = np.zeros((N, N))
for i in range(N):
for j in range(i, N):
dist[i, j] = np.linalg.norm(coords[i] - coords[j])
dist[j, i] = dist[i, j]
print dist
If xyz is the array with your coordinates, then the following code will compute the distance matrix (works fast till the moment when you have enough memory to store N^2 distances):
xyz = np.random.uniform(size=(1000,3))
distances = (sum([(xyzs[:,i][:,None]-xyzs[:,i][None,:])**2 for i in range(3)]))**.5

Categories