Use a similarity function for clustering scikit-learn

Use a similarity function for clustering scikit-learn - python

I use a function to calculate similarity between a pair of documents and wanto perform clustering using this similarity measure.
Code so Far
Sim=np.zeros((n, n)) # create a numpy arrary
i=0
j=0
for i in range(0,n):
for j in range(i,n):
if i==j:
Sim[i][j]=1
else:
Sim[i][j]=simfunction(list_doc[i],list_doc[j]) # calculate similarity between documents i and j using simfunction
Sim=Sim+ Sim.T - np.diag(Sim.diagonal()) # complete the symmetric matrix
AggClusterDistObj=AgglomerativeClustering(n_clusters=num_cluster,linkage='average',affinity="precomputed")
Res_Labels=AggClusterDistObj.fit_predict(Sim)
My concern is that here I used a similarity function , and I think as per documents it should be a disimilarity matrix, how can I change it to dissimilarity matrix.
Also what would be a more efficient way to do this.

Please format your code correctly, as indentation matters in Python.
If possible, keep the code complete (you left out a import numpy as np).
Since range always starts from zero, you can omit it and write range(n).
Indexing in numpy works like [i, j, k, ...].
So instead of Sim[i][j] you actually want to write Sim[i, j], because otherwise you do two operations: first taking the entire row slice and then indexing the column. Heres another way to copy the elements of the upper triangle to the lower one
Sim = np.identity(n) # diagonal with ones (100 percent similarity)
for i in range(n):
for j in range(i+1, n): # +1 skips the diagonal
Sim[i, j]= simfunction(list_doc[i], list_doc[j])
# Expand the matrix (copy triangle)
tril = np.tril_indices_from(Sim, -1) # take lower & upper triangle's indices
triu = np.triu_indices_from(Sim, 1) # (without diagonal)
Sim[tril] = Sim[triu]
Assumed tha you really have similarities within the range (0, 1) to convert your similarity matrix into a distance matrix you can then simply do
dm = 1 - Sim
This operation will be vectorized by numpy

Related

Python: create 3D array using values of another 3D array that meet a condition

I'm basically trying to take the weighted mean of a 3D dataset, but only on a filtered subset of the data, where the filter is based off of another (2D) array. The shape of the 2D data matches the first 2 dimensions of the 3D data, and is thus repeated for each slice in the 3rd dimension.
Something like:
import numpy as np
myarr = np.array([[[4,6,8],[9,3,2]],[[2,7,4],[3,8,6]],[[1,6,7],[7,8,3]]])
myarr2 = np.array([[7,3],[6,7],[2,6]])
weights = np.random.rand(3,2,3)
filtered = []
for k in range(len(myarr[0,0,:])):
temp1 = myarr[:,:,k]
temp2 = weights[:,:,k]
filtered.append(temp1[np.where(myarr2 > 5)]*temp2[np.where(myarr2 > 5)])
average = np.array(np.sum(filtered,1)/len(filtered[0]))
I am concerned about efficiency here. Is it possible to vectorize this so I don't need the loop, or are there other suggestions to make this more efficient?

The most glaring efficiency issue, even the loop aside, is that np.where(...) is being called multiple times inside the loop, on the same condition! You can just do this a single time beforehand. Moreover, there is no need for a loop. Your operation basically equates to:
mask = myarr2 > 5
average = (myarr[mask] * weights[mask]).mean(axis=0)
There is no need for an np.where either.
myarr2 is an array of shape (i, j) with same first two dims as myarr and weight, which have some shape (i, j, k).
So if there are n True elements in the boolean mask myarr2 > 5, you can apply it on your other arrays to obtain (n, k) elements (taking all elements along third axis, when there is a True at a certain [i, j] position).

Calculate only lower triangular elements of a matrix OR calculation on all possible pairs of the elements of a vector with jax

Is it possible to efficiently run some calculation on all possible pairs of the elements of a vector? I.e. I want to fill the lower triangular elements of a matrix (possibly flattened).
I.e. I want to:
calculate do_my_calculation(input_vector[i], input_vector[j])
for all i, j in [1, length(input_vector)] and j < i
save all the results
The shape of the results is not terribly important. If I can choose however, I would prefer a vector corresponding to an unrolled of the triangular (i, j) matrix however.
To illustrate what I would like to do in pseudo-python:
input_vector = np.arange(100)
result_vector = []
for i in range(1, len(input_vector)):
for j in range(0, i):
result_vector.append(do_my_calculation(input_vector[i], input_vector[j])
Note: For this question, the types of input_vector and result_vector in the above code are not pertinent. Equally, I am of course happy to preallocate result_vector if required. I am using a list for the sake of conciseness of the sample code.
Edit 1: concrete example as requested by #ddejohn
Note: The question is not whether I can get this to run in jax but whether I can get it to run efficiently, i.e. vectorized .
# Set up the problem
import numpy as np
dim = 15
input_vector_x = np.random.rand(dim)
input_vector_y = np.random.rand(dim)
output_vector = np.empty(np.tril_indices(dim, k=-1)[0].size)
assert input_vector_x.size == input_vector_y.size
# alternative implementation 1
counter = 0
for i in range(1, input_vector_x.size):
for j in range(0, i):
output_vector[counter] = (input_vector_y[j] - input_vector_y[i]) / (input_vector_x[j] - input_vector_x[i])
counter += 1
# alternative implementation 2
indices = np.tril_indices(dim, k=-1)
i = indices[0]
j = indices[1]
output_vector = (input_vector_y[j] - input_vector_y[i]) / (input_vector_x[j] - input_vector_x[i])

There are a few ways to approach this. If you want to compute the full matrix of pairwise results, you could use typical numpy-style broadcasting, assuming your function supports it. Similarly, you could use JAX's Automatic Vectorization (vmap) functionality whether or not your function is compatible with broadcasting.
If you really wish to only compute each value once, you can do this using the lower or upper triangular indices. Note that although this performs fewer operations, you may find that in practice it's faster, particularly on accelerators like GPU and TPU, to compute the full result. The reason for this is that multi-dimensional indexing (the gather operation) is generally relatively expensive on this kind of hardware, so the overhead of doubling the number of function calls may be preferable.
Here's a demonstration of these three approaches:
import jax
import jax.numpy as jnp
key = jax.random.PRNGKey(5748395)
dim = 3
x = jax.random.uniform(key, (dim,))
def f(x1, x2):
return (x1 * x2) / (x1 + x2)
# Option 1: full result, broadcasted operations
print(f(x[:, None], x[None, :]))
# [[0.34950745 0.00658672 0.28704265]
# [0.00658672 0.00332469 0.00655982]
# [0.28704265 0.00655982 0.24352014]]
# Option 2: full result, via vmap
f_mapped = jax.vmap(jax.vmap(f, (None, 0)), (0, None))
print(f_mapped(x, x))
# [[0.34950745 0.00658672 0.28704265]
# [0.00658672 0.00332469 0.00655982]
# [0.28704265 0.00655982 0.24352014]]
# Option 3: explicitly computing at lower-triangular indices
i, j = jnp.tril_indices(dim)
out_tril = f(x[i], x[j])
print(out_tril)
# [0.34950745 0.00658672 0.00332469 0.28704265 0.00655982 0.24352014]
print(jnp.zeros((dim, dim)).at[i, j].set(out_tril))
# [[0.34950745 0. 0. ]
# [0.00658672 0.00332469 0. ]
# [0.28704265 0.00655982 0.24352014]]

Compute cosine similarity against every element of a dask matrix

My goal is to find the Top N vectors in a large 3D dask array(~100k rows per side or more would be nice) that are most cosine similar to a target vector. I can get the Top 1, and only for smaller values of n, n=500 takes over 2 hours. I'm doing something incorrectly with dask, but not sure what. Also, is there a vectorized way to get the cosine similarity instead of the for-loop? In pure numpy I can get to n = ~6000 before I have a MemoryError. dtype of float16 is enough accuracy and an attempt to save space. If dask isn't the right tool, I'd be open to something else too.
import dask.array as da
import numpy as np
from numpy.linalg import norm
# create a 2d matrix of n rows, each of length n, ideally n is quite large, >100,000
start = 1
step = 1
n = 5
vec_len = 10
shape = [n, vec_len]
end = np.prod(shape) * step + start
arr_2D = da.from_array(np.array(np.arange(start, end, step).reshape(shape), dtype=np.float16))
print(arr_2D.compute())
# sum each row with each other row using broadcasting, resulting in a 3D matrix
# each (i,j) location contains a vector that is the sum of the i-th and j-th original vectors
sums_3D = arr_2D[:, None] + arr_2D[None,:]
# make a target vector
target = np.array(range(vec_len,0,-1))
print('target:', target)
# brute force way to get cosine of each vector in #D matrix with target vector
da_cos = da.empty(shape=(n,n), dtype=np.float16)
for i in range(n): # <----- is there a way to vectorize this for loop??
print('row:', i)
for j in range(i+1, n): # i+1: to get only upper triangle
cur = sums_3D[i, j]
cosine = np.dot(target,cur)/(norm(target)*norm(cur))
da_cos[i,j] = cosine
print(da_cos.compute(), da_cos.dtype, da_cos.shape)
# Get top match <------ how would I get the Top N matches??
ar_max = da_cos.argmax().compute()
best_1, best_2 = np.unravel_index(ar_max, (n,n))
print(da_cos.max().compute(), best_1, best_2)

Increase speed of finding minimum element in a 2-D numpy array which has many entries set to np.inf

I have a 16000*16000 matrix and want to find the minimum entry. This matrix is a distance matrix, so it is symmetric about diagonal. In order to get exactly one minimum at each time, I set the lower triangle and the diagonal to np.inf. Below is an 5*5 matrix example:
inf a0 a1 a2 a3
inf inf a4 a5 a6
inf inf inf a7 a8
inf inf inf inf a9
inf inf inf inf inf
I want to find the index of the minimum entry only in the upper triangle. However, when I use np.argmin(), it will still go through the whole matrix. Is there any way to "ignore" the lower triangle and increase speed?
I have tried many methods, such as:
Use masked array
Use triu_indices() to extract the upper triangle and then find the minimum
Set the entries in the lower triangle and diagonal to None instead of np.inf, then use np.nanargmin() to find the minimum
However, all of the methods I tried are slower the using np.argmin() directly.
Thank you for your time, I would appreciate it if you can help me.
UPDATE 1: Some background of my problem
In fact, I am implementing a modified version of agglomerative clustering from scratch. The original dataset is 16000*64 (I have 16000 points, each is 64-dimensional). At first, I build 16000 clusters and each contains exactly one point. In each iteration, I find the nearest 2 clusters and merge them, until meet the terminate condition.
To avoid repeated calculation of distances, I store the distances in a 16000*16000 distance matrix. I set the diagonal and lower triangle to np.inf. In each iteration, I will find the smallest entry in the distance matrix, and the index of this entry corresponds to the 2 nearest clusters, say c_i and c_j. Afterwards, in the distance matrix, I fill the 2 rows and 2 columns corresponding to c_i and c_j to np.inf, which means that these 2 clusters are merged and do not exist anymore. Then I will calculate an array of the distances between the new cluster and all other clusters, then put the array in the 1 row and 1 column corresponding to c_i.
Let me make it clear: in the whole process, the size of the distance matrix never change. In each iteration, for the 2 rows and 2 columns correspond to the 2 nearest clusters I found, I fill 1 row and 1 column with np.inf and put the distance array of the new cluster in the other 1 row and 1 column.
Now the bottleneck of the performance is finding the smallest entry in the distance matrix, which takes 0.008s. The run time of the whole algorithm is about 40 minutes.
UPDATE 2: How I compute distance matrix
Below is the code I used to generate distance matrix:
from sklearn.metrics import pairwise_distances
dis_matrix = pairwise_distances(dataset)
for i in range(num_dim):
for j in range(num_dim):
if i >= j or (cluster_list[i].contain_reference_point and cluster_list[j].contain_reference_point):
dis_matrix[i][j] = np.inf
Nevertheless, I need to say that generating the distance matrix is not the bottleneck in the algorithm now, because I generate it only once, and then I just update the distance matrix (as mentioned above).

If we back up a step, assuming the distance matrix is symmetric and based on an (i, n) shaped array with i points in n dimensions, and the distance metric is cartesian, this can be done very efficiently with a KDTree data structure:
i = 16000
n = 3
points = np.random.rand(i, n) * 100
from scipy.spatial import cKDTree
tree = cKDTree(points)
close = tree.sparse_distance_matrix(tree,
max_distance = 1, #can tune for your application
output_type = "coo_matrix")
close.eliminate_zeros()
ix = close.data.argmin()
i, j = (close.row[ix], close.col[ix])
This is pretty blazing fast, but it depends on your application and distance metric if it's useful for you.
If you don't need the distance matrix at all (and only need indices), you can do:
d, ix = tree.query(points, 2)
j, i = ix[d[:, 1].argmin()]
EDIT: this doesn't work well for high-dimensionality data. Since you're up against the curse of dimensionality, you'll probably need to brute force. I recommend scipy.spatial.distance.pdist for this:
from scipy.spatial.distance import pdist
D = pdist(points, metric = 'seuclidean') # this only returns the upper diagonal
ix = np.argmin(D)
def ix_to_ij(ix, n):
sorter = np.arange(n-1)[::-1].cumsum()
j = np.searchsorted(sorter, ix)
i = ix - sorter[j]
return i, j
ix_to_ij(ix, 16000)
Not completely tested but I think that should work.

One thing I can think of that might give you a boost is using numba.njit:
#njit
def upper_min(m):
x = np.inf
for r in range(0, m.shape[0] - 1):
for c in range(r + 1, m.shape[1] + 1):
if x < m[r, c]:
x = m[r, c]
Be sure not to time it the first time you run it. The compilation is slow.
Another way could be to use sparse matrices somehow.

You can select upper triangle of array by masking, simple example:
import numpy as np
arr = np.array([[0, 1], [2, 3]])
# Mask of upper triangle
mask = np.array([[True, True],[False, True]])
# Masking returns only upper triangle as 1D array
min_val = np.min(arr[mask]) # Equal to np.min([0, 1, 3])
So instead of making lower triangle as inf, you have to generate a mask where lower triangle is False and upper triangle is True and apply masking arr[mask] which returns 1D array of upper triangle, then you apply min

Most efficient algorithm in Python to generate all 6x6 (0,1) matrices with sum in columns and rows lower than 2?

I am working on a problem which requires me to find all 6x6 (0,1) matrices with some given properties:
The sum of a row/column must be lower than 2.
The matrices are not symmetrical.
I am using this code:
import numpy as np
import itertools as it
n=6
li=[]
for i in it.product([0, 1], repeat = n**2):
if (np.reshape(np.array(i), (n, n)).sum(axis=1) < 2).all() and (np.reshape(np.array(i), (n, n)).sum(axis=0)< 2).all() :
if (np.transpose(np.reshape(np.array(i), (n, n))) != np.reshape(np.array(i), (n, n))).any():
li.append(np.reshape(np.array(i), (n, n)))
The problem is that this method has to go through all 68719476736 (0,1) matrices. After this piece of code I still have to impose extra conditions.
Is there a faster algorithm to find this list of matrices?
Edit:
The problem I am working on is one to find unique adjacency matrices (graph theory) up to a certain equivalence class. For instance, in the 4x4 version of the problem I wanted to find all (0,1) matrices such that:
The sum in a row/column is lower than 2;
Are not symmetrical, i.e. A^T != A;
Also A^T != P^T A P, where P is a matrix representation of the dihedral group D8 (order 8) which is a subgroup of S4.
After this last step I get a certain number of matrices. If A relates to B through the relation B = P^T A P, then it represents the same matrix. I follow to choose only one representative of this equivalence class.
In the 4x4 problem I go from 65536 to 3.
My estimate of the result after sorting through the first condition (sums) is 46080. In the 6x6 problem, the group of transformations P is of order 48.

You have trouble with your math, because if the row/column sum is less than 2, it could be 0 or 1 -- that means that in every row/column can be only one non-zero elememt, which is 7^6 = 117649 possible matrices.
100k matrices is pretty much doable by using a brute force, with additional filtering to remove vertical/horizontal flips and diagonal symmetries.
Here's a simple code that should get you started:
import numpy as np
from itertools import permutations
for perm in permutations( range(7), 6 ) : # there are only 5040 permutations
m = np.zeros(6, 6) # start with an empty matrix
for i, j in enumerate(perm) :
if j == 6 : continue # all zeros
m[i][j] = 1 # put `1` in the current (i)-th row, (j) pos
# here you check `m` for symmetry and save it somewhere or not

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.