Compute cosine similarity against every element of a dask matrix

Compute cosine similarity against every element of a dask matrix - python

My goal is to find the Top N vectors in a large 3D dask array(~100k rows per side or more would be nice) that are most cosine similar to a target vector. I can get the Top 1, and only for smaller values of n, n=500 takes over 2 hours. I'm doing something incorrectly with dask, but not sure what. Also, is there a vectorized way to get the cosine similarity instead of the for-loop? In pure numpy I can get to n = ~6000 before I have a MemoryError. dtype of float16 is enough accuracy and an attempt to save space. If dask isn't the right tool, I'd be open to something else too.
import dask.array as da
import numpy as np
from numpy.linalg import norm
# create a 2d matrix of n rows, each of length n, ideally n is quite large, >100,000
start = 1
step = 1
n = 5
vec_len = 10
shape = [n, vec_len]
end = np.prod(shape) * step + start
arr_2D = da.from_array(np.array(np.arange(start, end, step).reshape(shape), dtype=np.float16))
print(arr_2D.compute())
# sum each row with each other row using broadcasting, resulting in a 3D matrix
# each (i,j) location contains a vector that is the sum of the i-th and j-th original vectors
sums_3D = arr_2D[:, None] + arr_2D[None,:]
# make a target vector
target = np.array(range(vec_len,0,-1))
print('target:', target)
# brute force way to get cosine of each vector in #D matrix with target vector
da_cos = da.empty(shape=(n,n), dtype=np.float16)
for i in range(n): # <----- is there a way to vectorize this for loop??
print('row:', i)
for j in range(i+1, n): # i+1: to get only upper triangle
cur = sums_3D[i, j]
cosine = np.dot(target,cur)/(norm(target)*norm(cur))
da_cos[i,j] = cosine
print(da_cos.compute(), da_cos.dtype, da_cos.shape)
# Get top match <------ how would I get the Top N matches??
ar_max = da_cos.argmax().compute()
best_1, best_2 = np.unravel_index(ar_max, (n,n))
print(da_cos.max().compute(), best_1, best_2)

Related

Increase speed of finding minimum element in a 2-D numpy array which has many entries set to np.inf

I have a 16000*16000 matrix and want to find the minimum entry. This matrix is a distance matrix, so it is symmetric about diagonal. In order to get exactly one minimum at each time, I set the lower triangle and the diagonal to np.inf. Below is an 5*5 matrix example:
inf a0 a1 a2 a3
inf inf a4 a5 a6
inf inf inf a7 a8
inf inf inf inf a9
inf inf inf inf inf
I want to find the index of the minimum entry only in the upper triangle. However, when I use np.argmin(), it will still go through the whole matrix. Is there any way to "ignore" the lower triangle and increase speed?
I have tried many methods, such as:
Use masked array
Use triu_indices() to extract the upper triangle and then find the minimum
Set the entries in the lower triangle and diagonal to None instead of np.inf, then use np.nanargmin() to find the minimum
However, all of the methods I tried are slower the using np.argmin() directly.
Thank you for your time, I would appreciate it if you can help me.
UPDATE 1: Some background of my problem
In fact, I am implementing a modified version of agglomerative clustering from scratch. The original dataset is 16000*64 (I have 16000 points, each is 64-dimensional). At first, I build 16000 clusters and each contains exactly one point. In each iteration, I find the nearest 2 clusters and merge them, until meet the terminate condition.
To avoid repeated calculation of distances, I store the distances in a 16000*16000 distance matrix. I set the diagonal and lower triangle to np.inf. In each iteration, I will find the smallest entry in the distance matrix, and the index of this entry corresponds to the 2 nearest clusters, say c_i and c_j. Afterwards, in the distance matrix, I fill the 2 rows and 2 columns corresponding to c_i and c_j to np.inf, which means that these 2 clusters are merged and do not exist anymore. Then I will calculate an array of the distances between the new cluster and all other clusters, then put the array in the 1 row and 1 column corresponding to c_i.
Let me make it clear: in the whole process, the size of the distance matrix never change. In each iteration, for the 2 rows and 2 columns correspond to the 2 nearest clusters I found, I fill 1 row and 1 column with np.inf and put the distance array of the new cluster in the other 1 row and 1 column.
Now the bottleneck of the performance is finding the smallest entry in the distance matrix, which takes 0.008s. The run time of the whole algorithm is about 40 minutes.
UPDATE 2: How I compute distance matrix
Below is the code I used to generate distance matrix:
from sklearn.metrics import pairwise_distances
dis_matrix = pairwise_distances(dataset)
for i in range(num_dim):
for j in range(num_dim):
if i >= j or (cluster_list[i].contain_reference_point and cluster_list[j].contain_reference_point):
dis_matrix[i][j] = np.inf
Nevertheless, I need to say that generating the distance matrix is not the bottleneck in the algorithm now, because I generate it only once, and then I just update the distance matrix (as mentioned above).

If we back up a step, assuming the distance matrix is symmetric and based on an (i, n) shaped array with i points in n dimensions, and the distance metric is cartesian, this can be done very efficiently with a KDTree data structure:
i = 16000
n = 3
points = np.random.rand(i, n) * 100
from scipy.spatial import cKDTree
tree = cKDTree(points)
close = tree.sparse_distance_matrix(tree,
max_distance = 1, #can tune for your application
output_type = "coo_matrix")
close.eliminate_zeros()
ix = close.data.argmin()
i, j = (close.row[ix], close.col[ix])
This is pretty blazing fast, but it depends on your application and distance metric if it's useful for you.
If you don't need the distance matrix at all (and only need indices), you can do:
d, ix = tree.query(points, 2)
j, i = ix[d[:, 1].argmin()]
EDIT: this doesn't work well for high-dimensionality data. Since you're up against the curse of dimensionality, you'll probably need to brute force. I recommend scipy.spatial.distance.pdist for this:
from scipy.spatial.distance import pdist
D = pdist(points, metric = 'seuclidean') # this only returns the upper diagonal
ix = np.argmin(D)
def ix_to_ij(ix, n):
sorter = np.arange(n-1)[::-1].cumsum()
j = np.searchsorted(sorter, ix)
i = ix - sorter[j]
return i, j
ix_to_ij(ix, 16000)
Not completely tested but I think that should work.

One thing I can think of that might give you a boost is using numba.njit:
#njit
def upper_min(m):
x = np.inf
for r in range(0, m.shape[0] - 1):
for c in range(r + 1, m.shape[1] + 1):
if x < m[r, c]:
x = m[r, c]
Be sure not to time it the first time you run it. The compilation is slow.
Another way could be to use sparse matrices somehow.

You can select upper triangle of array by masking, simple example:
import numpy as np
arr = np.array([[0, 1], [2, 3]])
# Mask of upper triangle
mask = np.array([[True, True],[False, True]])
# Masking returns only upper triangle as 1D array
min_val = np.min(arr[mask]) # Equal to np.min([0, 1, 3])
So instead of making lower triangle as inf, you have to generate a mask where lower triangle is False and upper triangle is True and apply masking arr[mask] which returns 1D array of upper triangle, then you apply min

Generating linearly independent columns for a matrix

As the title suggests, I want to generate a random N x d matrix (N - number of examples, d - number of features) where each column is linearly independent of the other columns. How can I implement the same using numpy and python?

If you just generate the vectors at random, the chance that the column vectors will not be linearly independent is very very small (Assuming N >= d).
Let A = [B | x] where A is a N x d matrix, B is an N x (d-1) matrix with independent column vectors, and x is a column vector with N elements. The set of all x with no constraints is a subspace with dimension N, while the set of all x such that x is NOT linearly independent with all column vectors in B would be a subspace with dimension d-1 (since every column vector in B serves as a basis vector for this set).
Since you are dealing with bounded, discrete numbers (likely doubles, floats, or integers), the probability of the matrix not being linearly independent will not be exactly zero. The more possible values each element can take, in general, the more likely the matrix is to have independent column vectors.
Therefore, I recommend you chose elements at random. You can always verify after the fact that the matrix has linearly independent column vectors by calculating it's column-echelon form. You could do this with np.random.rand(N,d).

One way to guarantee random independent columns would be to iteratively add a random column and check matrix rank:
import numpy as np
N, d = 1000, 200
M = np.random.rand(N,1)
r = 1 #matrix rank
while r < d:
t = np.random.rand(N,1)
if np.linalg.matrix_rank(np.hstack([M,t])) > r:
M = np.hstack([M,t])
r+=1
However this process is quite slow since requires to compute the rank of a matrix at least d times.
A faster approach would be to generate a random Nxd 2d-array and check its rank:
M = np.random.rand(N,d)
r = np.linalg.matrix_rank(M)
while r < d:
M = np.random.rand(N,d)
r = np.linalg.matrix_rank(M)
Which is likely to never enter the while loop, however we add a check and eventually generate another random 2d-array.

You can still have a small degree of correlation, simply by chance, if your number of observations is small.
One way of ensuring that, is to using the principal component scores. So brief explanation from wiki:
Repeating this process yields an orthogonal basis in which different
individual dimensions of the data are uncorrelated. These basis
vectors are called principal components, and several related
procedures principal component analysis (PCA).
We can see this below:
from sklearn.decomposition import PCA
import numpy as np
import seaborn as sns
N = 50
d = 20
a = np.random.normal(0,1,(50,20))
pca = PCA(n_components=d)
pca.fit(a)
pc_scores = pca.transform(a)
fig, ax = plt.subplots(1, 2,figsize=(10,4))
sns.heatmap(np.corrcoef(np.transpose(a)),ax=ax[0],cmap="YlGnBu")
sns.heatmap(np.corrcoef(np.transpose(pc_scores)),ax=ax[1],cmap="YlGnBu")
The heatmap on the matrix shows you can still have some degree of correlation by chance (drawing from a standard normal, but small sample size).

Get list of X minimum distances by their indices

I have a huge matrix (think 20000 x 1000) called Z that I need to generate the pairwise distance from so I'm currently using sklearn.metrics.pairwise.euclidean_distances(Z,Z) to generate the pairwise distances.
However, now I need to search through the result to find the smallest X distances but I need their indices.
An example would be:
A = 20000 x 1000 numpy.ndarray
B = sklearn.metrics.pairwise.euclidean_distances(A, A)
C = ((2400,100), (800,900), (29,999)) if X = 3
What would be the best way to go about doing this? I saw numpy.unravel_index(a.argmax(), a.shape) but I'm not sure it would work well for this instance.

You can use np.triu_indices to generate the indices that correspond to entries of the compressed distance matrix.
import numpy as np
from scipy.spatial.distance import pdist
# Generate points
Z = np.random.normal(0, 1, (1000, 3))
# Compute euclidean distance
distance = pdist(Z)
# Get the smallest distance
min_distance = np.min(distance)
# Get the indices (k = 1 to omit diagonal entries)
idx = np.asarray(np.triu_indices(len(Z), 1))
# Filter the indices (this is assuming that the minimum distance is not unique)
idx = idx[:, distance == min_distance]
If you know that there is exactly one minimum distance, you could also use
idx = idx[:, np.argmin(distance)]
which is slightly more efficient.
EDIT: To get the sorted indices, try the following
idx = idx[:, np.argsort(distance)]

fastest way to get closest 10 euclidean neighbors of large feature vector in python

I have a numpy array that has 10,000 vectors with 3,000 elements in each. I want to return the top 10 indices of the closest pairs with the distance between them. So if row 5 and row 7 have the closest euclidean distance of 0.005, and row 8 and row 10 have the second closest euclidean distance of 0.0052 then I want to return [(8,10,.0052),(5,7,.005)]. The traditional for loop method is very slow. Is there an alternative quicker approach for a way to get euclidean neighbors of large features vectors (stored as np array)?
I'm doing the following:
l = []
for i in range(0,M.shape[0]):
for j in range(0,M.shape[0]):
if i != j and i > j:
l.append( (i,j,euc(M[i],M[j]))
return l
Here euc is a function to calculate euclidean distances between two vectors of a matrix using scipy.
Then I sort l and pull out the top 10 closest distances

def topTen(M):
i,j = np.triu_indices(M.shape[0], 1)
dist_sq = np.einsum('ij,ij->i', M[i]-M[j], M[i]-M[j])
max_i=np.argpartition(dist_sq, 10)[:10]
max_o=np.argsort(dist_sq[max_i])
return np.vstack((i[max_i][max_o], j[max_i][max_o], dist_sq[max_i][max_o]**.5)).T
This should be pretty fast as it only does sorting and the square root on the top 10, which are the long steps (outside of the looping).

I'll post this as an answer, but I admit is not a real solution to the question, because it will only work for smaller arrays. The problem is that if you want to be really fast and avoid loops you would need to compute all the pairwise distances at once, and that implies a memory complexity in the order of the square of the input... Let's say 10,000 rows * 10,000 rows * 3,000 elems/row * 4 bytes/row (say we're using float32) ≈ 1TB (!) of memory required (actually maybe twice because you probably need a couple of arrays that size). So while it is possible, it is not practical with these kind of sizes. The following code shows how you could implement that (with sizes divided by 100).
import numpy as np
# Row length
n = 30
# Number of rows
m = 100
# Number of top elements
k = 10
# Input data
data = np.random.random((m, n))
# Tile the data in two different dimensions
data1 = np.tile(data[:, :, np.newaxis], (1, 1, m))
data2 = np.tile(data.T[np.newaxis, :, :], (m, 1, 1))
# Compute pairwise squared distances
dist = np.sum(np.square(data1 - data2), axis=1)
# Fill lower half with inf to avoid repeat and self-matching
dist[np.tril_indices(m)] = np.inf
# Find smallest distance for each row
i = np.arange(m)
j = np.argmin(dist, axis=1)
dmin = dist[i, j]
# Pick the top K smallest distances
idx = np.stack((i, j), axis=1)
isort = dmin.argsort()
# Top K indices pairs (K x 2 matrix)
top_idx = idx[isort[:k], :]
# Top K smallest distances
top_dist = np.sqrt(dmin[isort[:k]])

Creating a sparse matrix from lists of sub matrices (Python)

This is my first SO question ever. Let me know if I could have asked it better :)
I am trying to find a way to splice together lists of sparse matrices into a larger block matrix.
I have python code that generates lists of square sparse matrices, matrix by matrix. In pseudocode:
Lx = [Lx1, Lx1, ... Lxn]
Ly = [Ly1, Ly2, ... Lyn]
Lz = [Lz1, Lz2, ... Lzn]
Since each individual Lx1, Lx2 etc. matrix is computed sequentially, they are appended to a list--I could not find a way to populate an array-like object "on the fly".
I am optimizing for speed, and the bottleneck features a computation of Cartesian products item-by-item, similar to the pseudocode:
M += J[i,j] * [ Lxi *Lxj + Lyi*Lyj + Lzi*Lzj ]
for all combinations of 0 <= i, j <= n. (J is an n-dimensional square matrix of numbers).
It seems that vectorizing this by computing all the Cartesian products in one step via (pseudocode):
L = [ [Lx1, Lx2, ...Lxn],
[Ly1, Ly2, ...Lyn],
[Lz1, Lz2, ...Lzn] ]
product = L.T * L
would be faster. However, options such as np.bmat, np.vstack, np.hstack seem to require arrays as inputs, and I have lists instead.
Is there a way to efficiently splice the three lists of matrices together into a block? Or, is there a way to generate an array of sparse matrices one element at a time and then np.vstack them together?
Reference: Similar MATLAB code, used to compute the Hamiltonian matrix for n-spin NMR simulation, can be found here:
http://spindynamics.org/Spin-Dynamics---Part-II---Lecture-06.php

This is scipy.sparse.bmat:
L = scipy.sparse.bmat([Lx, Ly, Lz], format='csc')
LT = scipy.sparse.bmat(zip(Lx, Ly, Lz), format='csr') # Not equivalent to L.T
product = LT * L

I have a "vectorized" solution, but it's almost twice as slow as the original code. Both the bottleneck shown above, and the final dot product shown in the last line below, take about 95% of the calculation time according to kernprof tests.
# Create the matrix of column vectors from these lists
L_column = bmat([Lx, Ly, Lz], format='csc')
# Create the matrix of row vectors (via a transpose of matrix with
# transposed blocks)
Lx_trans = [x.T for x in Lx]
Ly_trans = [y.T for y in Ly]
Lz_trans = [z.T for z in Lz]
L_row = bmat([Lx_trans, Ly_trans, Lz_trans], format='csr').T
product = L_row * L_column

I was able to get a tenfold speed increase by not using sparse matrices and using an array of arrays.
Lx = np.empty((1, nspins), dtype='object')
Ly = np.empty((1, nspins), dtype='object')
Lz = np.empty((1, nspins), dtype='object')
These are populated with the individual Lx arrays (formerly sparse matrices) as they are generated. Using the array structure allows the transpose and Cartesian product to perform as desired:
Lcol = np.vstack((Lx, Ly, Lz)).real
Lrow = Lcol.T # As opposed to sparse version of code, this works!
Lproduct = np.dot(Lrow, Lcol)
The individual Lx[n] matrices are still "bundled", so Product is an n x n matrix. This means in-place multiplication of the n x n J array with Lproduct works:
scalars = np.multiply(J, Lproduct)
Each matrix element is then added on to the final hamiltonian matrix:
for n in range(nspins):
for m in range(nspins):
M += scalars[n, k].real

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.