Cosine similarity between two ndarrays

Cosine similarity between two ndarrays - python

I have two numpy arrays, first array is of size 100*4*200, and second array is of size 150*6*200. In fact, I am storing the 100 samples of 200 dimensional vector representations of 4 fields in array 1 and 140 samples of 200 dimensional vectors of 6 fields in array 2.
Now I want to compute the similarity vector between the samples and create a similarity matrix. For each sample, I would like to calculate the similarity between the each combination of fields and store it such that I get a 15000*24 dimensional array.
First 150 rows will be the similarity vector between 1st row of array 1 and 150 rows of array 2, next 150 rows will be the similarity vector between the 2nd row of array 1 and 150 rows of array 2 etc.
Each similarity vector is # fields in array 1 * # fields in array 2 i.e. 1st element of the similarity vector is cosine similarity between field 1 of array 1 and field 1 of array 2, 2nd element will be the similarity between field 1 of array 1 and field 2 of array 2 and so on with last element is the similarity between last field of array 1 and last field of array 2.
What is the best way to do this using numpy arrays ?

So every "row" (i assume the first axis, that I'll call axis 0) is the sample axis. That means you have 100 samples from one vector, each with fieldsxdimentions 4x200.
Doing this the way you describe, then the first row of the first array would have (4,200) and the second one would then have (150,6,200). Then you'd want to do a cos distance between an (m,n), and (m,n,k) array, which does not make sense (the closest you have to a dot product here would be the tensor product, which I'm fairly sure is not what you want).
So we have to extract these first and then iterate over all the others.
To do this I actually recomend just splitting the array with np.split and iterate over both of them. This is just because I've never come across a faster way in numpy. You could use tensorflow to gain efficiency, but I'm not going into that here in my answer.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
a = np.random.rand(100, 4, 200)
b = np.random.rand(150, 6, 200)
# We know the output will be 150*100 x 6*4
c = np.empty([15000, 24])
# Make an array with the rows of a and same for b
a_splitted=np.split(a, a.shape[0], 0)
b_splitted=np.split(b, b.shape[0], 0)
i=0
for alpha in a_splitted:
for beta in b_splitted:
# Gives a 4x6 matrix
sim=cosine_similarity(alpha[0],beta[0])
c[i,:]=sim.ravel()
i+=1
For the similarity-function above I just chose what #StefanFalk sugested: sklearn.metrics.pairwise.cosine_similarity. If this similarity measure is not sufficient, then you could either write your own.
I am not at all claiming that this is the best way to do this in all of python. I think the most efficient way is to do this symbolically using, as mentioned, tensorflow.
Anyways, hope it helps!

Related

Flatten only part of a dataframe shape for Euclidean calculation?

I have a data frame with shape:
(20,30,1024)
I want to find the Euclidean distance between every entry and every other entry in the dataframe (ideally non-redundantly, i.e. don't find the distance of row 1 and 5....and then row 5 and 1 but not there yet). I have this code:
from scipy.spatial.distance import pdist,squareform
distances = pdist(df_test,metric='euclidean')
dist_matrix = squareform(distances)
print(dist_matrix)
The error says:
A 2-dimensional array must be passed.
So I guess I want to convert my matrix from shape (20,30,1024) to (20,30720), and then calculate the pdist/squareform between the rows (i.e. 20 rows of vectors that are 30720 in length).
I know that I can use test_df[0:20].flatten().tolist()
But that completely flattened my matrix, the output shape was (1,614400).
Can someone show me how to convert a shape from (20,30,1024) to (20,3072), or if i'm not going about this the right way?
The ultimate end goal is to calculate Euclidean distance between all non-redundant pairs in a data set, but the data set is big, so I need to do it as efficiently as possible/not duplicating calculations.

The most straightforward way to reshape that I can think of, according to how you described the problem, is:
df_test.values.reshape(20, -1)
By calling .values, you are retrieving your dataframe data as a numpy array. From there, .reshape finishes your job. Since you need a 2D-array, you provide the size of the first dimension (in your case, 20), and by passing -1 Numpy will calculate the size of the second dimension for you (in this case it will multiply the remaining dimension sizes in the original 3D-array)

How to pick out columns of a matrix with specified step

I have a matrix A with 500 rows and 1024 columns. I would like to generate a matrix consisting of evenly spaced columns from A, say with step size 2^5. How do I do this in Numpy? I haven't seen this explained in the references I have.

You can just use slicing:
import numpy as np
arr = np.random.rand(512,1024)
step_size = 2 ** 5
arr[:, ::step_size] # shape is (512, 32)
So what it does is keeping all the rows, while taking all the columns with the desired step size. You can read about numpy indexing in the following link:
https://numpy.org/doc/stable/user/basics.indexing.html?highlight=indexing#other-indexing-options
You can apply the same logic to the rows or to both rows and columns to get a more sophisticated slicing.

How to broadcast correctly subtracting 2 different matrices in Numpy

I am trying to subtract two matrices of different shapes using broadcasting. However I am stuck on some point. Need simple solution of how to solve the problem.
Literally I am evaluating data on a grid (first step is subtracting). For example I have 5 grid points grid = (-20,-10, 0, 10, 20) and array of data of length 100.
Line:
u = grid.reshape((ngrid, 1)) - data
works perfectly fine. ngrid = 5 in this trivial example.
Output is matrix of 5 rows and 100 columns, so each point of data is evaluated on each point of grid.
Next I want to do it for 2 grids and 2 data sets simultaneously (data is of size (2x100, e.g. 2 randn arrays). I have already succeeded in subtracting two data sets from one grid, but using two grids throws an error.
In the example below a is vertical array of the grid, length 5 points and data is array of random data of the shape (100,2).
In this case u is is tuple (2,5,100), so u[0] and u[1] has 5 rows and 100 columns, meaning that data was subtracted correctly from the grid.
Second line of the code is what I am trying to do. The error is following:
ValueError: operands could not be broadcast together with shapes (5,2) (2,1,100)
u = a - data.T[:, None] # a is vertical grid of 5 elements. Works ok.
u = grid_test - data.T[:, None] # grid_test is 2 column 5 row matrix of 2 grids. Error.
What I need is kind of same line of code as above, but it should work if "a" contains 2 columns, e.g. two different grids. So in the end expected result is "u", which contains in addition to above described results another two matrices where same data (both arrays) evaluated on the second grid.
Unfortunately I cannot use any loops - only vectorization and broadcasting.
Thanks in advance.

Broadcasting - 3D field of coefficients to 3D field of matrices given matrix basis

I have a (large) 4D array, consisting of the 5 coefficients in a given basis for a matrix field. Given the 5 basis matrices, I want to efficiently calculate the matrix field.
The coefficient field c[x,y,z,i] being the value of i-th coefficient at position x,y,z
And the matrix field M[x,y,z,a,b] being the (3,3) matrix at position x,y,z
And the basis matrices T_1,...T_5, being the (3,3) basis matrices
I could loop over each position in space:
M[x,y,z,:,:] = T_1[:,:]*c[x,y,z,0] + T_2[:,:]*c[x,y,z,1]...T_5[:,:]*c[x,y,z,4]
But this is very inefficient. My attempts at using np.multiply,np.sum result in broadcasting errors due to the ambiguity of the desired product being a field of 3x3 matrices.

Keep in mind that to numpy, these 4 and 5d arrays are just that, not 3d arrays containing 2d matrices, etc.
Let's try to write your calculation in a way that clarifies dimensions:
M[x,y,z] = T_1*c[x,y,z,0] + T_2*c[x,y,z,1]...T_5*c[x,y,z,4]
M[x,y,z,:,:] = T_1[:,:]*c[x,y,z,0] + T_2[:,:]*c[x,y,z,1]...T_5[:,:]*c[x,y,z,4]
c[x,y,z,i] is a coefficient, right? So M is a weighted sum of the T_n arrays?
One way of expressing this is:
T = np.stack([T_1, T_2, ...T_5], axis=0) # 3d (nab)
M = np.einsum('nab,xyzn->xyzab', T, c)
We could alternatively stack T_i on a new last axis
T = np.stack([T_1, T_2 ...T_5], axis=2) # (abn)
M = np.einsum('abn,xyzn->xyzab', T, c)
or as broadcasted multiplication plus sum:
M = (T[None,None,None,:,:,:] * c[:,:,:,None,None,:]).sum(axis=-1)
I'm writing this code without testing, so there may be errors, but I think the basic outline is right.
It could also be written as a dot, if I can put the n dimension last in one argument, and 2nd to the last in the other. Or with tensordot. But there's less control over broadcasting of the other dimensions.
For test calculations you could also reshape these arrays so that the x,y,z are rolled into one, and the a,b into another, e.g
M[xyz,:] = T_n[ab]*c[xyz,n] # etc

Numpy operation for euclidean distance between multidimensional arrays

I have two numpy arrays. 'A' of size w,h,2 and 'B' with n,2.
In other words, A is a 2-dimensional array of 2D vectors while B is a 1D array of 2D vectors.
What i want as a result is an array of size w,h,n. The last dimension is an n-dimensional vector where each of the components is the euclidean distance between the corresponding vector from A (denoted by the first two dimensions w and h) and the nth vector of B.
I know that i can just loop through w, h and n in python manually and calculate the distance for each element, but i like to know if there is a smart way to do that with numpy operations to increase performance.
I found some similar questions but unfortunately all of those use input arrays of the same dimensionality.

Approach #1
You could reshape A to 2D, use Scipy's cdist that expects 2D arrays as inputs, get those euclidean distances and finally reshape back to 3D.
Thus, an implementation would be -
from scipy.spatial.distance import cdist
out = cdist(A.reshape(-1,2),B).reshape(w,h,-1)
Approach #2
Since, the axis of reduction is of length 2 only, we can just slice the input arrays to save memory on intermediate arrays, like so -
np.sqrt((A[...,0,None] - B[:,0])**2 + (A[...,1,None] - B[:,1])**2)
Explanation on A[...,0,None] and A[...,1,None] :
With that None we are just introducing a new axis at the end of sliced A. Well, let's take a small example -
In [54]: A = np.random.randint(0,9,(4,5,2))
In [55]: A[...,0].shape
Out[55]: (4, 5)
In [56]: A[...,0,None].shape
Out[56]: (4, 5, 1)
In [57]: B = np.random.randint(0,9,(3,2))
In [58]: B[:,0].shape
Out[58]: (3,)
So, we have :
A[...,0,None] : 4 x 5 x 1
B[:,0] : 3
That is essentially :
A[...,0,None] : 4 x 5 x 1
B[:,0] : 1 x 1 x 3
When the subtraction is performed, the singleton dims are broadcasted corresponding to the dimensions of the other participating arrays -
A[...,0,None] - B : 4 x 5 x 3
We repeat this for the second index along the last axis. We add these two arrays after squaring and finally a square-root to get the final eucl. distances.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cosine similarity between two ndarrays - python

Related

Flatten only part of a dataframe shape for Euclidean calculation?

How to pick out columns of a matrix with specified step

How to broadcast correctly subtracting 2 different matrices in Numpy

Broadcasting - 3D field of coefficients to 3D field of matrices given matrix basis

Numpy operation for euclidean distance between multidimensional arrays

Categories

Resources