Cosine Similarity - python

I was reading and came across this formula:
The formula is for cosine similarity. I thought this looked interesting and I created a numpy array that has user_id as row and item_id as column. For instance, let M be this matrix:
M = [[2,3,4,1,0],[0,0,0,0,5],[5,4,3,0,0],[1,1,1,1,1]]
Here the entries inside the matrix are ratings the people u has given to item i based on row u and column i. I want to calculate this cosine similarity for this matrix between items (rows). This should yield a 5 x 5 matrix I believe. I tried to do
df = pd.DataFrame(M)
item_mean_subtracted = df.sub(df.mean(axis=0), axis=1)
similarity_matrix = item_mean_subtracted.fillna(0).corr(method="pearson").values
However, this does not seem right.

Here's a possible implementation of the adjusted cosine similarity:
import numpy as np
from scipy.spatial.distance import pdist, squareform
M = np.asarray([[2, 3, 4, 1, 0],
[0, 0, 0, 0, 5],
[5, 4, 3, 0, 0],
[1, 1, 1, 1, 1]])
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
I'm taking advantage of NumPy broadcasting to subtract the mean.
If M is a sparse matrix, you could do something like ths: M.toarray().
From the docs:
Y = pdist(X, 'cosine')
Computes the cosine distance between vectors u and v,
1 − u⋅v / (||u||2||v||2)
where ||∗||2 is the 2-norm of its argument *, and u⋅v is the dot product of u and v.
Array transposition is performed through the T method.
In [277]: M_u
Out[277]: array([ 2. , 1. , 2.4, 1. ])
In [278]: item_mean_subtracted
array([[ 0. , 1. , 2. , -1. , -2. ],
[-1. , -1. , -1. , -1. , 4. ],
[ 2.6, 1.6, 0.6, -2.4, -2.4],
[ 0. , 0. , 0. , 0. , 0. ]])
In [279]: np.set_printoptions(precision=2)
In [280]: similarity_matrix
array([[ 1. , 0.87, 0.4 , -0.68, -0.72],
[ 0.87, 1. , 0.8 , -0.65, -0.91],
[ 0.4 , 0.8 , 1. , -0.38, -0.8 ],
[-0.68, -0.65, -0.38, 1. , 0.27],
[-0.72, -0.91, -0.8 , 0.27, 1. ]])


how to reverse index a 2-d array

I have a 2d MxN array A , each row of which is a sequence of indices, padded by -1's at the end e.g.:
[[ 2 1 -1 -1 -1]
[ 1 4 3 -1 -1]
[ 3 1 0 -1 -1]]
I have another MxN array of float values B:
[[ 0.7 0.4 1.5 2.0 4.4 ]
[ 0.8 4.0 0.3 0.11 0.53]
[ 0.6 7.4 0.22 0.71 0.06]]
and I want to use the indices in A to filter B i.e. for each row, only the indices present in A retain their values, and the values at all other locations are set to 0.0, i.e. the result would look like:
[[ 0.0 0.4 1.5 0.0 0.0 ]
[ 0.0 4.0 0.0 0.11 0.53 ]
[ 0.6 7.4 0.0 0.71 0.0]]
What's a good way to do this in "pure" numpy? (I would like to do this in pure numpy so I can jit it in jax.
Numpy supports fancy indexing. Ignoring the "-1" entries for the moment, you can do something like this:
index = (np.arange(B.shape[0]).reshape(-1, 1), A)
result = np.zeros_like(B)
result[index] = B[index]
This works because indices are broadcasted. The column np.arange(B.shape[0]).reshape(-1, 1) matches all the elements of a given row of A to the corresponding row in B and result.
This example does not address the fact that -1 is a valid numpy index. You need to clear the elements that correspond to -1 in A when 4 (the last column) is not present in that row:
mask = (A == -1).any(axis=1) & (A != A.shape[1] - 1).all(axis=1)
result[mask, -1] = 0.0
Here, the mask is [True, False, True], indicating that even though the second row has a -1 in it, it also contains a 4.
This approach is fairly efficient. It will create no more than a couple of boolean arrays of the same shape as A for the mask.
You can use broadcasting, but note that it will create a large intermediate array of shape (M, N, N) (in pure numpy at least):
import numpy as np
A = ...
B = ...
M, N = A.shape
out = np.where(np.any(A[..., None] == np.arange(N), axis=1), B, 0.0)
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 4. , 0. , 0.11, 0.53],
[0.6 , 7.4 , 0. , 0.71, 0. ]])
Another possible solution:
maxr = np.max(A, axis=1)
A = np.where(A == -1, maxr.reshape(-1,1), A)
mask = np.zeros(np.shape(B), dtype=bool)
np.put_along_axis(mask, A, True, axis=1)
np.where(mask, B, 0)
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 4. , 0. , 0.11, 0.53],
[0.6 , 7.4 , 0. , 0.71, 0. ]])
EDIT (When there is rows with only -1)
The following code aims to contemplate the possibility, raised by #MadPhysicist (to whom I thank), of having rows containing only -1 -- that is only necessary to add 2 lines of code to my previous code.
A = np.array([[ 2, 1, -1, -1, -1],
[ -1, -1, -1, -1, -1],
[ 3, 1, 0, -1, -1]])
B = np.array([[ 0.7, 0.4, 1.5, 2.0, 4.4 ],
[ 0.8, 4.0, 0.3, 0.11, 0.53],
[ 0.6, 7.4, 0.22, 0.71, 0.06]])
rminus1 = np.all(A == -1, axis=1) # new
maxr = np.max(A, axis=1)
A = np.where(A == -1, maxr.reshape(-1,1), A)
mask = np.zeros(np.shape(B), dtype=bool)
np.put_along_axis(mask, A, True, axis=1)
C = np.where(mask, B, 0)
C[rminus1, :] = 0 # new
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0.6 , 7.4 , 0. , 0.71, 0. ]])

Numpy: How to stack a single array into each row of a bigger array and turn it into a 2D array?

I have a numpy array named heartbeats with 100 rows. Each row has 5 elements.
I also have a single array named time_index with 5 elements.
I need to prepend the time index to each row of heartbeats.
heartbeats = np.array([
[-0.58, -0.57, -0.55, -0.39, -0.40],
[-0.31, -0.31, -0.32, -0.46, -0.46]
time_index = np.array([-2, -1, 0, 1, 2])
What I need:
array([-2, -0.58],
[-1, -0.57],
[0, -0.55],
[1, -0.39],
[2, -0.40],
[-2, -0.31],
[-1, -0.31],
[0, -0.32],
[1, -0.46],
[2, -0.46])
I only wrote two rows of heartbeats to illustrate.
Assuming you are using numpy, the exact output array you are looking for can be made by stacking a repeated version of time_index with the raveled version of heartbeats:
np.stack((np.tile(time_index, len(heartbeats)), heartbeats.ravel()), axis=-1)
Another approach, using broadcasting
In [13]: heartbeats = np.array([
...: [-0.58, -0.57, -0.55, -0.39, -0.40],
...: [-0.31, -0.31, -0.32, -0.46, -0.46]
...: ])
...: time_index = np.array([-2, -1, 0, 1, 2])
Make a target array:
In [14]: res = np.zeros(heartbeats.shape + (2,), heartbeats.dtype)
In [15]: res[:,:,1] = heartbeats # insert a (2,5) into a (2,5) slot
In [17]: res[:,:,0] = time_index[None] # insert a (5,) into a (2,5) slot
In [18]: res
array([[[-2. , -0.58],
[-1. , -0.57],
[ 0. , -0.55],
[ 1. , -0.39],
[ 2. , -0.4 ]],
[[-2. , -0.31],
[-1. , -0.31],
[ 0. , -0.32],
[ 1. , -0.46],
[ 2. , -0.46]]])
and then reshape to 2d:
In [19]: res.reshape(-1,2)
array([[-2. , -0.58],
[-1. , -0.57],
[ 0. , -0.55],
[ 1. , -0.39],
[ 2. , -0.4 ],
[-2. , -0.31],
[-1. , -0.31],
[ 0. , -0.32],
[ 1. , -0.46],
[ 2. , -0.46]])
[17] takes a (5,), expands it to (1,5), and then to (2,5) for the insert. Read up on broadcasting.
As an alternative way, you can repeat time_index by np.concatenate based on the specified times:
concatenated = np.concatenate([time_index] * heartbeats.shape[0])
# [-2 -1 0 1 2 -2 -1 0 1 2]
# result = np.dstack((concatenated, heartbeats.reshape(-1))).squeeze()
result = np.array([concatenated, heartbeats.reshape(-1)]).T
Using np.concatenate may be faster than np.tile. This solution is faster than Mad Physicist, but the fastest is using broadcasting as hpaulj's answer.

Cartesian product from 2 series

I have this big serie of length t (t = 200K rows)
prices = [200, 100, 500, 300 ..]
and I want to calculate a matrix (tXt) where a value is calculated as:
matrix[i][j] = prices[j]/prices[i] - 1
I tried this using a double for, but it's too slow. Any ideas how to perform it better?
for p0 in prices:
for p1 in prices:
matrix[i][j] = p1/p0 - 1
A vectorized solution is using np.meshgrid, with prices and 1/prices as arguments (note that prices must be an array), and multiplying the result and substracting 1 in order to compute matrix[i][j] = prices[j]/prices[i] - 1:
a, b = np.meshgrid(p, 1/p)
a * b - 1
As an example:
p = np.array([1,4,2])
Would give:
a, b = np.meshgrid(p, 1/p)
a * b - 1
array([[ 0. , 3. , 1. ],
[-0.75, 0. , -0.5 ],
[-0.5 , 1. , 0. ]])
Quick check of some of the cells:
(i,j) prices[j]/prices[i] - 1
(1,1) 1/1 - 1 = 0
(1,2) 4/1 - 1 = 3
(1,3) 2/1 - 1 = 1
(2,1) 1/4 - 1 = -0.75
Another solution:
[p] / np.array([p]).T - 1
array([[ 0. , 3. , 1. ],
[-0.75, 0. , -0.5 ],
[-0.5 , 1. , 0. ]])
There are two idiomatic ways of doing an outer product-type operation. Either use the .outer method of universal functions, here np.divide:
In [2]: p = np.array([10, 20, 30, 40])
In [3]: np.divide.outer(p, p)
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
Alternatively, use broadcasting:
In [4]: p[:, None] / p[None, :]
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
This p[None, :] itself can be spelled as a reshape, p.reshape((1, len(p))), but readability.
Both are equivalent to a double for-loop:
In [6]: o = np.empty((len(p), len(p)))
In [7]: for i in range(len(p)):
...: for j in range(len(p)):
...: o[i, j] = p[i] / p[j]
In [8]: o
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
I guess it can be done in this way
import numpy
prices = [200., 300., 100., 500., 600.]
x = numpy.array(prices).reshape(1, len(prices))
matrix = (1/x.T) * x - 1
Let me explain in details. This matrix is a matrix product of column vector of element-wise reciprocal price values and a row vector of original price values. Then matrix of ones of the same size needs to be subtracted from the result.
First of all we create row-vector from prices list
x = numpy.array(prices).reshape(1, len(prices))
Reshaping is required here. Otherwise your vector will have shape (len(prices),), not required (1, len(prices)).
Then we compute a column vector of element-wise reciprocal price values:
Finally, we compute the resulting matrix
matrix = (1/x.T) * x - 1
Here ending - 1 will be broadcasted to a matrix of the same shape with (1/x.T) * x.

Matrix created from a function, and concatenated column vector of the matrix

We have a function f(x,y). We want to calculate the matrix Bij = f(xi,xj) = f(ih,jh) for 1 <= i,j <= n and h=1/(n+1), such as :
If f(x,y)=x+y, then Bij = ih+jh and the matrix becomes (here, n=3) :
I would like to program a function calculating the column vector b that concatenates all the columns of Bij. For example, with my previous example, we would have :
I done, we can change the function and n, here f(x,y)=x+y :
def f(i,j):
return a
B = np.fromfunction(f,(n,n))
But I don't know how to do the vector b. And with
I get a line vector, and not a column vector. Could you help me ? Sorry for my bad english, and I'm beginner in Python.
The ravel function along with a new axis should do the trick:
import numpy as np
x = np.array([[0.5, 0.75, 1],
[0.75, 1, 1.25],
[1, 1.25, 1.5]])
x.T.ravel()[:, np.newaxis]
# array([[ 0.5 ],
# [ 0.75],
# [ 1. ],
# [ 0.75],
# [ 1. ],
# [ 1.25],
# [ 1. ],
# [ 1.25],
# [ 1.5 ]])
Ravel stitches together all the rows, so we first transpose the matrix (with .T). The result is a row-vector, and we change it to a column vector by adding a new axis.
import numpy as np
# create sample matrix `m`
m = np.matrix([[0.5, 0.75, 1], [0.75, 1, 1.25], [1, 1.25, 1.5]])
# convert matrix `m` to a 'flat' matrix
m_flat = m.flatten()
# `m_flat` is still a matrix, in case you need an array:
m_flat_arr = np.squeeze(np.asarray(m_flat))
The snippet uses .flatten(), .asarray() and .squeeze() to convert the original matrix m being
matrix([[ 0.5 , 0.75, 1. ],
[ 0.75, 1. , 1.25],
[ 1. , 1.25, 1.5 ]])
into an array m_flat_arr of:
array([ 0.5 , 0.75, 1. , 0.75, 1. , 1.25, 1. , 1.25, 1.5 ])

scipy sparse matrix division

I have been trying to divide a python scipy sparse matrix by a vector sum of its rows. Here is my code
sparse_mat = bsr_matrix((l_data, (l_row, l_col)), dtype=float)
sparse_mat = sparse_mat / (sparse_mat.sum(axis = 1)[:,None])
However, it throws an error no matter how I try it
sparse_mat = sparse_mat / (sparse_mat.sum(axis = 1)[:,None])
File "/usr/lib/python2.7/dist-packages/scipy/sparse/", line 381, in __div__
return self.__truediv__(other)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/", line 427, in __truediv__
raise NotImplementedError
Anyone with an idea of where I am going wrong?
You can circumvent the problem by creating a sparse diagonal matrix from the reciprocals of your row sums and then multiplying it with your matrix. In the product the diagonal matrix goes left and your matrix goes right.
>>> a
array([[0, 9, 0, 0, 1, 0],
[2, 0, 5, 0, 0, 9],
[0, 2, 0, 0, 0, 0],
[2, 0, 0, 0, 0, 0],
[0, 9, 5, 3, 0, 7],
[1, 0, 0, 8, 9, 0]])
>>> b = sparse.bsr_matrix(a)
>>> c = sparse.diags(1/b.sum(axis=1).A.ravel())
>>> # on older scipy versions the offsets parameter (default 0)
... # is a required argument, thus
... # c = sparse.diags(1/b.sum(axis=1).A.ravel(), 0)
>>> a/a.sum(axis=1, keepdims=True)
array([[ 0. , 0.9 , 0. , 0. , 0.1 , 0. ],
[ 0.125 , 0. , 0.3125 , 0. , 0. , 0.5625 ],
[ 0. , 1. , 0. , 0. , 0. , 0. ],
[ 1. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0.375 , 0.20833333, 0.125 , 0. , 0.29166667],
[ 0.05555556, 0. , 0. , 0.44444444, 0.5 , 0. ]])
>>> (c # b).todense() # on Python < 3.5 replace c # b with
matrix([[ 0. , 0.9 , 0. , 0. , 0.1 , 0. ],
[ 0.125 , 0. , 0.3125 , 0. , 0. , 0.5625 ],
[ 0. , 1. , 0. , 0. , 0. , 0. ],
[ 1. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0.375 , 0.20833333, 0.125 , 0. , 0.29166667],
[ 0.05555556, 0. , 0. , 0.44444444, 0.5 , 0. ]])
Something funny is going on. I have no problem performing the element division. I wonder if it's a Py2 issue. I'm using Py3.
In [1022]: A=sparse.bsr_matrix([[2,4],[1,2]])
In [1023]: A
<2x2 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements (blocksize = 2x2) in Block Sparse Row format>
In [1024]: A.A
array([[2, 4],
[1, 2]], dtype=int32)
In [1025]: A.sum(axis=1)
[3]], dtype=int32)
In [1026]: A/A.sum(axis=1)
matrix([[ 0.33333333, 0.66666667],
[ 0.33333333, 0.66666667]])
or to try the other example:
In [1027]: b=sparse.bsr_matrix([[0, 9, 0, 0, 1, 0],
...: [2, 0, 5, 0, 0, 9],
...: [0, 2, 0, 0, 0, 0],
...: [2, 0, 0, 0, 0, 0],
...: [0, 9, 5, 3, 0, 7],
...: [1, 0, 0, 8, 9, 0]])
In [1028]: b
<6x6 sparse matrix of type '<class 'numpy.int32'>'
with 14 stored elements (blocksize = 1x1) in Block Sparse Row format>
In [1029]: b.sum(axis=1)
[ 2],
[ 2],
[18]], dtype=int32)
In [1030]: b/b.sum(axis=1)
matrix([[ 0. , 0.9 , 0. , 0. , 0.1 , 0. ],
[ 0.125 , 0. , 0.3125 , 0. , 0. , 0.5625 ],
[ 0.05555556, 0. , 0. , 0.44444444, 0.5 , 0. ]])
The result of this sparse/dense is also dense, where as the c*b (c is the sparse diagonal) is sparse.
In [1039]: c*b
<6x6 sparse matrix of type '<class 'numpy.float64'>'
with 14 stored elements in Compressed Sparse Row format>
The sparse sum is a dense matrix. It is 2d, so there's no need to expand it dimensions. In fact if I try that I get an error:
In [1031]: A/(A.sum(axis=1)[:,None])
ValueError: shape too large to be a matrix.
Per this message, to keep the matrix sparse, you access the data values and use the (nonzero) indices:
sums = np.asarray(A.sum(axis=1)).squeeze() # this is dense /= sums[A.nonzero()[0]]
If dividing by the nonzero row mean instead of the sum, one can
nnz = A.getnnz(axis=1) # this is also dense
means = sums / nnz /= means[A.nonzero()[0]]
