Compute cosine similarity between 3D numpy array and 2D numpy array - python

I have a 3D numpy array A of shape (m, n, 300) and a 2D numpy array B of shape (p, 300).
For each of the m (n, 300) matrices in the 3D array, I want to compute its cosine similarity matrix with the 2D numpy array. Currently, I am doing the following:
result = []
for sub_matrix in A:
result.append(sklearn.metrics.pairwise.cosine_similarity(sub_matrix, B)
The sklearn cosine_similarity function does not support operations with 3D arrays, so is there a more efficient way of computing this that does not involve using the for-loop?

You can reshape to 2D and use the same function -
from sklearn.metrics.pairwise import cosine_similarity
m,n = A.shape[:2]
out = cosine_similarity(A.reshape(m*n,-1), B).reshape(m,n,-1)
The output would be 3D after the reshape at the end, which is what you would get after array conversion of result.
Sample run -
In [336]: np.random.seed(0)
...: A = np.random.rand(5,4,3)
...: B = np.random.rand(2,3)
...:
...: result = []
...: for sub_matrix in A:
...: result.append(cosine_similarity(sub_matrix, B))
...: out_org = np.array(result)
...:
...: from sklearn.metrics.pairwise import cosine_similarity
...:
...: m,n = A.shape[:2]
...: out = cosine_similarity(A.reshape(m*n,-1), B).reshape(m,n,-1)
...:
...: print np.allclose(out_org, out)
True

Related

Pick 3D numpy array with 1D array of indices

Given that there is 2 Numpy Array :
3d_array with shape (100,10,2),
1d_indices with shape (100)
What is the Numpy way/equivalent to do this :
result = []
for i,j in zip(range(len(3d_array)),1d_indices):
result.append(3d_array[i,j])
Which should return result.shape (100,2)
The closest I've come to is by using fancy indexing on Numpy :
result = 3d_array[np.arange(len(3d_array)), 1d_indices]
Your code snippet should be equivalent to 3d_array[:, 1d_indices].reshape(-1,2), example:
a = np.arange(100*10*2).reshape(100,10,2) # 3d array
b = np.random.randint(0, 10, 100) # 1d indices
def fun(a,b):
result = []
for i in range(len(a)):
for j in b:
result.append(a[i,j])
return np.array(result)
assert (a[:, b].reshape(-1, 2) == fun(a, b)).all()

Apply function to numpy matrix dependent on position

Given a 2-d numpy array, X, of shape [m,m], I wish to apply a function and obtain a new 2-d numpy matrix P, also of shape [m,m], whose [i,j]th element is obtained as follows:
P[i][j] = exp (-|| X[i] - x[j] ||**2)
where ||.|| represents the standard L-2 norm of a vector. Is there any way faster than a simple nested for loop?
For example,
X = [[1,1,1],[2,3,4],[5,6,7]]
Then, at diagonal entries the rows accessed will be the same and the norm/magnitude of their difference will be 0. Hence,
P[0][0] = P[1][1] = P[2][2] = exp (0) = 1.0
Also,
P[0][1] = exp (- || X[0] - X[1] ||**2) = exp (- || [-1,-2,-3] || ** 2) = exp (-14)
etc.
The most trivial solution using a nested for loop is as follows:
import numpy as np
X = np.array([[1,2,3],[4,5,6],[7,8,9]])
P = np.zeros (shape=[len(X),len(X)])
for i in range (len(X)):
for j in range (len(X)):
P[i][j] = np.exp (- np.linalg.norm (X[i]-X[j])**2)
print (P)
This prints:
P = [[1.00000000e+00 1.87952882e-12 1.24794646e-47]
[1.87952882e-12 1.00000000e+00 1.87952882e-12]
[1.24794646e-47 1.87952882e-12 1.00000000e+00]]
Here, m is of the order of 5e4.
In [143]: X = np.array([[1,2,3],[4,5,6],[7,8,9]])
...: P = np.zeros (shape=[len(X),len(X)])
...: for i in range (len(X)):
...: for j in range (len(X)):
...: P[i][j] = np.exp (- np.linalg.norm (X[i]-X[j]))
...:
In [144]: P
Out[144]:
array([[1.00000000e+00, 5.53783071e-03, 3.06675690e-05],
[5.53783071e-03, 1.00000000e+00, 5.53783071e-03],
[3.06675690e-05, 5.53783071e-03, 1.00000000e+00]])
A no-loop version:
In [145]: np.exp(-np.sqrt(((X[:,None,:]-X[None,:,:])**2).sum(axis=2)))
Out[145]:
array([[1.00000000e+00, 5.53783071e-03, 3.06675690e-05],
[5.53783071e-03, 1.00000000e+00, 5.53783071e-03],
[3.06675690e-05, 5.53783071e-03, 1.00000000e+00]])
I had to drop your **2 to match values.
With the norm applied to the 3d difference array:
In [148]: np.exp(-np.linalg.norm(X[:,None,:]-X[None,:,:], axis=2))
Out[148]:
array([[1.00000000e+00, 5.53783071e-03, 3.06675690e-05],
[5.53783071e-03, 1.00000000e+00, 5.53783071e-03],
[3.06675690e-05, 5.53783071e-03, 1.00000000e+00]])
In one of the scikit packages (learn?) there's a cdist that may handle this sort of thing faster.
As hpaulj mentioned cdist does it better. Try the following.
from scipy.spatial.distance import cdist
import numpy as np
np.exp(-cdist(X,X,'sqeuclidean'))
Notice the sqeuclidean. This means that scipy does not take the square root so you don't have to square like you did above with the norm.
This would be easier if you provided a sample array. You can create an array Q of size [m, m, m] where Q[i, j, k] = X[i, k] - X[j, k] by using
X[None,:,:] - X[:,None,:]
At this point, you're performing simple numpy operations against the third axis.

Vectorizing Numpy 3D and 2D array operation

I'm trying to create K MxN matrices in Python, stored in a (M,N,K) numpy array, C, from two matrices, A and B, with shapes (K, M) and (K,N) respectively. The first matrix is computed as C0 = a0.T x b0, where a0 is the first row of A and b1 is the first row of B, the second matrix as C1 = a1.T x b0 and so on.
Right now I'm using a for loop to compute the matrices.
import numpy as np
A = np.random.random((10,800))
B = np.random.random((10,500))
C = np.zeros((800,500,10))
for k in range(10):
C[:,:,k] = A[k,:][:,None] # B[k,:][None,:]
Since the operations are independent, I was wondering if there was some pythonic way to avoid the for loop. Perhaps I can vectorize the code, but I fail to see how it could be done.
In [235]: A = np.random.random((10,800))
...: B = np.random.random((10,500))
...: C = np.zeros((800,500,10))
...: for k in range(10):
...: C[:,:,k] = A[k,:][:,None] # B[k,:][None,:]
...:
In [236]: C.shape
Out[236]: (800, 500, 10)
Batched matrix product, followed by transpose
In [237]: np.allclose((A[:,:,None]#B[:,None,:]).transpose(1,2,0), C)
Out[237]: True
But since the matrix product axis is size 1, and there's no other summation, broadcasted multiply is just as good:
In [238]: np.allclose((A[:,:,None]*B[:,None,:]).transpose(1,2,0), C)
Out[238]: True
Execution time is about the same

Python np.asarray does not return the true shape

I spin a loop on two sub table of my original table.
When I start the loop, and that I check the shape, I get (1008,) while the shape must be (1008,168,252,3). Is there a problem in my loop?
train_images2 = []
for i in range(len(train_2)):
im = process_image(Image.open(train_2['Path'][i]))
train_images2.append(im)
train_images2 = np.asarray(train_images2)
The problem is that your process_image() function is returning a scalar instead of the processed image (i.e. a 3D array of shape (168,252,3)). So, the variable im is just a scalar. Because of this, you get the array train_images2 to be 1D array. Below is a contrived example which illustrates this:
In [59]: train_2 = range(1008)
In [65]: train_images2 = []
In [66]: for i in range(len(train_2)):
...: im = np.random.random_sample()
...: train_images2.append(im)
...: train_images2 = np.asarray(train_images2)
...:
In [67]: train_images2.shape
Out[67]: (1008,)
So, the fix is that you should make sure that process_image() function returns a 3D array as in the below contrived example:
In [58]: train_images2 = []
In [59]: train_2 = range(1008)
In [60]: for i in range(len(train_2)):
...: im = np.random.random_sample((168,252,3))
...: train_images2.append(im)
...: train_images2 = np.asarray(train_images2)
...:
# indeed a 4D array as you expected
In [61]: train_images2.shape
Out[61]: (1008, 168, 252, 3)

Multiply array of vectors with array of matrices; return array of vectors?

I've got a numpy array of row vectors of shape (n,3) and another numpy array of matrices of shape (n,3,3). I would like to multiply each of the n vectors with the corresponding matrix and return an array of shape (n,3) of the resulting vectors.
By now I've been using a for loop to iterate through the n vectors/matrices and do the multiplication item by item.
I would like to know if there's a more numpy-ish way of doing this. A way without the for loop that might even be faster.
//edit 1:
As requested, here's my loopy code (with n = 10):
arr_in = np.random.randn(10, 3)
matrices = np.random.randn(10, 3, 3)
for i in range(arr_in.shape[0]): # 10 iterations
arr_out[i] = np.asarray(np.dot(arr_in[i], matrices[i]))
That dot-product is essentially performing reduction along axis=1 of the two input arrays. The dimensions could be represented like so -
arr_in : n 3
matrices : n 3 3
So, one way to solve it would be to "push" the dimensions of arr_in to front by one axis/dimension, thus creating a singleton dimension at axis=2 in a 3D array version of it. Then, sum-reducing the elements along axis = 1 would give us the desired output. Let's show it -
arr_in : n [3] 1
matrices : n [3] 3
Now, this could be achieved through two ways.
1) With np.einsum -
np.einsum('ij,ijk->ik',arr_in,matrices)
2) With NumPy broadcasting -
(arr_in[...,None]*matrices).sum(1)
Runtime test and verify output (for einsum version) -
In [329]: def loop_based(arr_in,matrices):
...: arr_out = np.zeros((arr_in.shape[0], 3))
...: for i in range(arr_in.shape[0]):
...: arr_out[i] = np.dot(arr_in[i], matrices[i])
...: return arr_out
...:
...: def einsum_based(arr_in,matrices):
...: return np.einsum('ij,ijk->ik',arr_in,matrices)
...:
In [330]: # Inputs
...: N = 16935
...: arr_in = np.random.randn(N, 3)
...: matrices = np.random.randn(N, 3, 3)
...:
In [331]: np.allclose(einsum_based(arr_in,matrices),loop_based(arr_in,matrices))
Out[331]: True
In [332]: %timeit loop_based(arr_in,matrices)
10 loops, best of 3: 49.1 ms per loop
In [333]: %timeit einsum_based(arr_in,matrices)
1000 loops, best of 3: 714 µs per loop
You could use np.einsum. To get v.dot(M) for each vector-matrix pair, use np.einsum("...i,...ij", arr_in, matrices). To get M.dot(v) use np.einsum("...ij,...i", matrices, arr_in)

Categories