SVD for recommendation engine - python

I'm trying to build a toy recommendation engine to wrap my mind around Singular Value Decomposition (SVD). I've read enough content to understand the motivations and intuition behind the actual decomposition of the matrix A (a user x movie matrix).
I need to know more about what goes on after that.
from numpy.linalg import svd
import numpy as np
A = np.matrix([
[0, 0, 0, 4, 5],
[0, 4, 3, 0, 0],
...
])
U, S, V = svd(A)
k = 5 #dimension reduction
A_k = U[:, :k] * np.diag(S[:k]) * V[:k, :]
Three Questions:
Do the values of matrix A_k represent the the predicted/approximate ratings?
What role/ what steps does cosine similarity play in the recommendation?
And finally I'm using Mean Absolute Error (MAE) to calculate my error. But what I'm values am I comparing? Something like MAE(A, A_k) or something else?

Related

Compute Gradient of overdefined Plane

I want to compute the gradient (direction and magnitude) of an overdefined plane (> 3 points), such as e.g. four x, y, z coordinates:
[0, 0, 1], [1, 0, 0], [0, 1, 1], [1, 1, 0]
My code for computing the gradient looks like this (using singular value decomposition from this post, modified):
import numpy as np
def regr(X):
y = np.average(X, axis=0)
Xm = X - y
cov = (1./X.shape[0])*np.matmul(Xm.T,Xm) # Covariance
u, s, v = np.linalg.svd(cov) # Singular Value Decomposition
return u[:,1]
However as a result I get:
[0, 1, 0]
which does not represent the gradient nor the normal vector. The result should look like this:
[sqrt(2)/2, 0, -sqrt(2)/2]
Any ideas why I am getting a wrong vector?
You can use the function numpy.gradient for that. Here is some math background how it works.
In short:
The directional derivative will be the dot product of the gradient with the (unit normalized) vector of the direction.
An easier solution is to use numdifftools, especially the convenience function directionaldiff.
An example can be found here. And if you look at the source code you can see the relation to the math mentioned above.

How does matrix decomposition help fill in a sparse utility/ratings matrix for new users?

This is my first attempt at machine learning. I'm writing a very simple recommendation engine using the yelp dataset. It's written in python using pandas and numpy libraries for (data processing). I've already narrowed the data down first to restaurants(millions), then only restaurants in vegas(thousands), then only restaurants with 3.5 stars or higher with over 50 reviews(hundreds). Also I narrowed down the users to only those that have reviewed at least 20% of the restaurants. Finally I've arrived to a ratings matrix that has 100 users by 1800 restaurants.
However, I feel it's still to sparse to give (relatively) useful recommendations. The goal is to use item-item collaborative based filtering computing vector distance using cosine similarity.
I've been reading about dealing with sparse matrices and the consensus seems to be to use matrix factorization. It seems however that most of these readings deal with current users and use matrix factorization as the algorithm that drives the recommendation for current users while solving the sparsity issue as a by-product. Is my understanding correct here? What I'm looking for is a method that will solve the sparsity issue first and then use cosine vector distances to guide the recommendation.
If decomposition is in fact the way to go: what sklearn.decomposition method should I use i.e. PCA, SVD, NMF?
[[ 3, 0, 0, ..., 0, 0, 0],
[ 0, 0, 0, ..., 0, 0, 0],
[ 0, 0, 0, ..., 0, 4, 3],
...
[ 1, 0, 0, ..., 0, 0, 0],
[ 0, 0, 0, ..., 0, 0, 2],
[ 0, 0, 5, ..., 0, 1, 3]]
(100 Users X 1800 Restaurants)
Reducing the amount of ratings is not a good solution to improve the accuracy of your recommendation (at least directly).
Said that, sparsity is not a "big" problem. Indeed, factorization algorithms for recommendation are designed to deal with this kind of sparsities going up to: 99%, 98%, 95% level of sparsity.
Currently, matrix factorization gives the best results and its concept is pretty simple. Moreover, memory-based CF approaches (like item-based, user-based,...) are more inefficient, less flexible and present worse results than the model-based.
Most popular algorithms are based on SVD:
Funk's SVD (a.k.a SVD, although is not the real SVD. It's an approximation.)
BRISMF (Biased regularised version of Funk's)
SVD++: BRISMF plus implicit feedback information.
timesSVD: SVD++ that also models the datetime
trustSVD: SVD++ which includes trust information (like friends)
The basics of those algorithms consists in:
Create some low-rank matrices and randomly initialize them
For each rating in the data set, compute the error with regard your prediction.
Update the low-rank matrices using the gradients of the function that you are optimising
Repeat
Python example (BRISMF):
# Initialize low-rank matrices (K is the number of latent factors)
P = np.random.rand(num_users, K) # User-feature matrix
Q = np.random.rand(num_items, K) # Item-feature matrix
# Factorize R matrix using SGD
for step in range(steps):
# Compute predictions
for k in range(0, len(data)):
i = data.X[k, user_col] # Users
j = data.X[k, item_col] # Items
r_ij = data.Y[k] # rating(i, j)
# NOTE: For simplicity (here) I've considered the
# bias a standard deviation, but it should be
# learned for better accuracy.
bias = global_avg + std_user[i] + std_item[j]
# Make prediction and compute error
rij_pred = bias + np.dot(Q[j, :], P[i, :])
eij = rij_pred - r_ij
# Update P and Q at the same time, they're dependents
tempP = alpha * 2 * (eij * Q[j, :] + beta * P[i, :])
tempQ = alpha * 2 * (eij * P[i, :] + beta * Q[j, :])
P[i] -= tempP
Q[j] -= tempQ
Extra:
For speed reasons (and simplicity of the code), I recommend you to have everything vectorised
Try to create caches if needed. Accessing in sparse matrices can be pretty slow, even with the correct data structure.
This algorithms are computationally very expensive so for simple versions you can expect to take 10s/iter in datasets of 1,000,000 ratings
I'm building a simple library for Orange3 Data Mining so if you're interested you can take a look: https://github.com/biolab/orange3-recommendation

How to find linearly independent rows from a matrix

How to identify the linearly independent rows from a matrix? For instance,
The 4th rows is independent.
First, your 3rd row is linearly dependent with 1t and 2nd row. However, your 1st and 4th column are linearly dependent.
Two methods you could use:
Eigenvalue
If one eigenvalue of the matrix is zero, its corresponding eigenvector is linearly dependent. The documentation eig states the returned eigenvalues are repeated according to their multiplicity and not necessarily ordered. However, assuming the eigenvalues correspond to your row vectors, one method would be:
import numpy as np
matrix = np.array(
[
[0, 1 ,0 ,0],
[0, 0, 1, 0],
[0, 1, 1, 0],
[1, 0, 0, 1]
])
lambdas, V = np.linalg.eig(matrix.T)
# The linearly dependent row vectors
print matrix[lambdas == 0,:]
Cauchy-Schwarz inequality
To test linear dependence of vectors and figure out which ones, you could use the Cauchy-Schwarz inequality. Basically, if the inner product of the vectors is equal to the product of the norm of the vectors, the vectors are linearly dependent. Here is an example for the columns:
import numpy as np
matrix = np.array(
[
[0, 1 ,0 ,0],
[0, 0, 1, 0],
[0, 1, 1, 0],
[1, 0, 0, 1]
])
print np.linalg.det(matrix)
for i in range(matrix.shape[0]):
for j in range(matrix.shape[0]):
if i != j:
inner_product = np.inner(
matrix[:,i],
matrix[:,j]
)
norm_i = np.linalg.norm(matrix[:,i])
norm_j = np.linalg.norm(matrix[:,j])
print 'I: ', matrix[:,i]
print 'J: ', matrix[:,j]
print 'Prod: ', inner_product
print 'Norm i: ', norm_i
print 'Norm j: ', norm_j
if np.abs(inner_product - norm_j * norm_i) < 1E-5:
print 'Dependent'
else:
print 'Independent'
To test the rows is a similar approach.
Then you could extend this to test all combinations of vectors, but I imagine this solution scale badly with size.
With sympy you can find the linear independant rows using: sympy.Matrix.rref:
>>> import sympy
>>> import numpy as np
>>> mat = np.array([[0,1,0,0],[0,0,1,0],[0,1,1,0],[1,0,0,1]]) # your matrix
>>> _, inds = sympy.Matrix(mat).T.rref() # to check the rows you need to transpose!
>>> inds
[0, 1, 3]
Which basically tells you the rows 0, 1 and 3 are linear independant while row 2 isn't (it's a linear combination of row 0 and 1).
Then you could remove these rows with slicing:
>>> mat[inds]
array([[0, 1, 0, 0],
[0, 0, 1, 0],
[1, 0, 0, 1]])
This also works well for rectangular (not only for quadratic) matrices.
I edited the code for Cauchy-Schwartz inequality which scales better with dimension: the inputs are the matrix and its dimension, while the output is a new rectangular matrix which contains along its rows the linearly independent columns of the starting matrix. This works in the assumption that the first column in never null, but can be readily generalized in order to implement this case too. Another thing that I observed is that 1e-5 seems to be a "sloppy" threshold, since some particular pathologic vectors were found to be linearly dependent in that case: 1e-4 doesn't give me the same problems. I hope this could be of some help: it was pretty difficult for me to find a really working routine to extract li vectors, and so I'm willing to share mine. If you find some bug, please report them!!
from numpy import dot, zeros
from numpy.linalg import matrix_rank, norm
def find_li_vectors(dim, R):
r = matrix_rank(R)
index = zeros( r ) #this will save the positions of the li columns in the matrix
counter = 0
index[0] = 0 #without loss of generality we pick the first column as linearly independent
j = 0 #therefore the second index is simply 0
for i in range(R.shape[0]): #loop over the columns
if i != j: #if the two columns are not the same
inner_product = dot( R[:,i], R[:,j] ) #compute the scalar product
norm_i = norm(R[:,i]) #compute norms
norm_j = norm(R[:,j])
#inner product and the product of the norms are equal only if the two vectors are parallel
#therefore we are looking for the ones which exhibit a difference which is bigger than a threshold
if absolute(inner_product - norm_j * norm_i) > 1e-4:
counter += 1 #counter is incremented
index[counter] = i #index is saved
j = i #j is refreshed
#do not forget to refresh j: otherwise you would compute only the vectors li with the first column!!
R_independent = zeros((r, dim))
i = 0
#now save everything in a new matrix
while( i < r ):
R_independent[i,:] = R[index[i],:]
i += 1
return R_independent
I know this was asked a while ago, but here is a very simple (although probably inefficient) solution. Given an array, the following finds a set of linearly independent vectors by progressively adding a vector and testing if the rank has increased:
from numpy.linalg import matrix_rank
def LI_vecs(dim,M):
LI=[M[0]]
for i in range(dim):
tmp=[]
for r in LI:
tmp.append(r)
tmp.append(M[i]) #set tmp=LI+[M[i]]
if matrix_rank(tmp)>len(LI): #test if M[i] is linearly independent from all (row) vectors in LI
LI.append(M[i]) #note that matrix_rank does not need to take in a square matrix
return LI #return set of linearly independent (row) vectors
#Example
mat=[[1,2,3,4],[4,5,6,7],[5,7,9,11],[2,4,6,8]]
LI_vecs(4,mat)
I interpret the problem as finding rows that are linearly independent from other rows.
That is equivalent to finding rows that are linearly dependent on other rows.
Gaussian elimination and treat numbers smaller than a threshold as zeros can do that. It is faster than finding eigenvalues of a matrix, testing all combinations of rows with Cauchy-Schwarz inequality, or singular value decomposition.
See:
https://math.stackexchange.com/questions/1297437/using-gauss-elimination-to-check-for-linear-dependence
Problem with floating point numbers:
http://numpy-discussion.10968.n7.nabble.com/Reduced-row-echelon-form-td16486.html
With regards to the following discussion:
Find dependent rows/columns of a matrix using Matlab?
from sympy import *
A = Matrix([[1,1,1],[2,2,2],[1,7,5]])
print(A.nullspace())
It is obvious that the first and second row are multiplication of each other.
If we execute the above code we get [-1/3, -2/3, 1]. The indices of the zero elements in the null space show independence. But why is the third element here not zero? If we multiply the A matrix with the null space, we get a zero column vector. So what's wrong?
The answer which we are looking for is the null space of the transpose of A.
B = A.T
print(B.nullspace())
Now we get the [-2, 1, 0], which shows that the third row is independent.
Two important notes here:
Consider whether we want to check the row dependencies or the column
dependencies.
Notice that the null space of a matrix is not equal to the null
space of the transpose of that matrix unless it is symmetric.
You can basically find the vectors spanning the columnspace of the matrix by using SymPy library's columnspace() method of Matrix object. Automatically, they are the linearly independent columns of the matrix.
import sympy as sp
import numpy as np
M = sp.Matrix([[0, 1, 0, 0],
[0, 0, 1, 0],
[1, 0, 0, 1]])
for i in M.columnspace():
print(np.array(i))
print()
# The output is following.
# [[0]
# [0]
# [1]]
# [[1]
# [0]
# [0]]
# [[0]
# [1]
# [0]]

Python, simultaneous pseudo-inversion of many 3x3, singular, symmetric, matrices

I have a 3D image with dimensions rows x cols x deps. For every voxel in the image, I have computed a 3x3 real symmetric matrix. They are stored in the array D, which therefore has shape (rows, cols, deps, 6).
D stores the 6 unique elements of the 3x3 symmetric matrix for every voxel in my image. I need to find the Moore-Penrose pseudo inverse of all row*cols*deps matrices simultaneously/in vectorized code (looping through every image voxel and inverting is far too slow in Python).
Some of these 3x3 symmetric matrices are non-singular, and I can find their inverses, in vectorized code, using the analytical formula for the true inverse of a non-singular 3x3 symmetric matrix, and I've done that.
However, for those matrices that ARE singular (and there are sure to be some) I need the Moore-Penrose pseudo inverse. I could derive an analytical formula for the MP of a real, singular, symmetric 3x3 matrix, but it's a really nasty/lengthy formula, and would therefore involve a VERY large number of (element-wise) matrix arithmetic and quite a bit of confusing looking code.
Hence, I would like to know if there is a way to simultaneously find the MP pseudo inverse for all these matrices at once numerically. Is there a way to do this?
Gratefully,
GF
NumPy 1.8 included linear algebra gufuncs, which do exactly what you are after. While np.linalg.pinv is not gufunc-ed, np.linalg.svd is, and behind the scenes that is the function that gets called. So you can define your own gupinv function, based on the source code of the original function, as follows:
def gu_pinv(a, rcond=1e-15):
a = np.asarray(a)
swap = np.arange(a.ndim)
swap[[-2, -1]] = swap[[-1, -2]]
u, s, v = np.linalg.svd(a)
cutoff = np.maximum.reduce(s, axis=-1, keepdims=True) * rcond
mask = s > cutoff
s[mask] = 1. / s[mask]
s[~mask] = 0
return np.einsum('...uv,...vw->...uw',
np.transpose(v, swap) * s[..., None, :],
np.transpose(u, swap))
And you can now do things like:
a = np.random.rand(50, 40, 30, 6)
b = np.empty(a.shape[:-1] + (3, 3), dtype=a.dtype)
# Expand the unique items into a full symmetrical matrix
b[..., 0, :] = a[..., :3]
b[..., 1:, 0] = a[..., 1:3]
b[..., 1, 1:] = a[..., 3:5]
b[..., 2, 1:] = a[..., 4:]
# make matrix at [1, 2, 3] singular
b[1, 2, 3, 2] = b[1, 2, 3, 0] + b[1, 2, 3, 1]
# Find all the pseudo-inverses
pi = gu_pinv(b)
And of course the results are correct, both for singular and non-singular matrices:
>>> np.allclose(pi[0, 0, 0], np.linalg.pinv(b[0, 0, 0]))
True
>>> np.allclose(pi[1, 2, 3], np.linalg.pinv(b[1, 2, 3]))
True
And for this example, with 50 * 40 * 30 = 60,000 pseudo-inverses calculated:
In [2]: %timeit pi = gu_pinv(b)
1 loops, best of 3: 422 ms per loop
Which is really not that bad, although it is noticeably (4x) slower than simply calling np.linalg.inv, but this of course fails to properly handle the singular arrays:
In [8]: %timeit np.linalg.inv(b)
10 loops, best of 3: 98.8 ms per loop
EDIT: See #Jaime's answer. Only the discussion in the comments to this answer is useful now, and only for the specific problem at hand.
You can do this matrix by matrix, using scipy, that provides pinv (link) to calculate the Moore-Penrose pseudo inverse. An example follows:
from scipy.linalg import det,eig,pinv
from numpy.random import randint
#generate a random singular matrix M first
while True:
M = randint(0,10,9).reshape(3,3)
if det(M)==0:
break
M = M.astype(float)
#this is the method you need
MPpseudoinverse = pinv(M)
This does not exploit the fact that the matrix is symmetric though. You may also want to try the version of pinv exposed by numpy, that is supposedely faster, and different. See this post.

Working Example for Mahalanobis Distance Measure

I need to measure the distance between two n-diensional vectors. It seems that Mahalanobis Distance is a good choise here so i want to give it a try.
My Code looks like this:
import numpy as np
import scipy.spatial.distance.mahalanobis
x = [19, 8, 0, 0, 2, 1, 0, 0, 18, 0, 1673, 9, 218]
y = [17, 6, 0, 0, 1, 2, 0, 0, 8, 0, 984, 9, 30]
scipy.spatial.distance.mahalanobis(x,y,np.linalg.inv(np.cov(x,y)))
But I get this error message:
/usr/lib/python2.7/dist-packages/scipy/spatial/distance.pyc in mahalanobis(u, v, VI)
498 v = np.asarray(v, order='c')
499 VI = np.asarray(VI, order='c')
--> 500 return np.sqrt(np.dot(np.dot((u-v),VI),(u-v).T).sum())
501
502 def chebyshev(u, v):
ValueError: matrices are not aligned
The Scipy Doc says, that VI is the inverse of the covariance matrix, and i think np.cov is the covariance matrix and np.linalg.inv is the inverse of a matrix...
But I see what is the problem here (matrices are not aligned): The Matrix VI has the wrong dimension (2x2 and not 13x13).
So possible solution is to do this:
VI = np.linalg.inv(np.cov(np.vstack((x,y)).T))
but unfortuanly the det of np.cov(np.vstack((x,y)).T) is 0, which means that a inverse matrix does not exsists.
so how can i use mahalanobis distance measure when i even cant compute the covariance matrix?
Are you sure that Mahalanobis Distance is right for you application? According to Wikipedia you need a set of points to generate the covariance matrix, not just two vectors. Then you can compute distances of vectors from the set's center.
You don't have a sample set with which to calculate a covariance. You probably just want the Euclidean distance here (np.linalg.norm(x-y)). What is the bigger picture in what you are trying to achieve?

Categories