Efficient implementation of the transition matrix for page rank - python

I'm trying to implement PageRank. I'm reading the description here: http://nlp.stanford.edu/IR-book/html/htmledition/markov-chains-1.html
Everything is very clear to me, however I'm concerned about the construction of the matrix $P$. I find that constructing $P$ the naive way would be very expensive. For example: to implement step 1, one would need to check every row of $A$ and then check every element of that row to see if all elements are zero. For step 2 one would need to compute the number of ones for each row. I can imagine my code to have nasty slow loops. I was wondering if there are smart linear algebra techniques that could efficiently construct $P$. I will be using python numpy for my coding.
EDIT: one way I'm thinking now to solve this is by doing a summation element wise over the columns of $A$. By that I would have a column vector. Now I will go through each element of this vector to check which elements are zeros. Thus I can now know which rows has no 1s and I can multiply those rows with $1/N$.

Your concern is correct. Since the number of web pages (vertices in the representing graph) is huge, it is impossible to actually generate such A and work on it.
The matrix calculation of page rank can be much more efficiently calculated using sparse matrix implementations, since the matrix is very sparse. Most webpages are not actually connected to each other, so most entries in the matrix are 0.
The sparse matrix is built as follows:
Build matrix A as described A_ij = 1 if (i,j) is an edge, otherwise A_ij = 0
Step 1 is usually not made, and instead we remove 'sinks' iteratively. This is done to prevent the matrix being dense, some alternatives are also linking 'sinks' back to the nodes that linked to them, or link a sink to itself.
Divide each 1 in A as described in (2)
Let's denote the resulting matrix as M, and this is the resulting matrix we will work on, in order to get a column vector p (which is initialized with 1/n for each entry).
x = [1/n, 1/n, ... , 1/n]^T //a column vector
p = [1/n, 1/n, ... , 1/n]^T //a column vector with the initial ranks
M = genSparseMatrix() //as described above
do until p converge:
p = (1-\alpha)* M*p + (\alpha) * x
return p
In the end, this yields p, the column vector that holds the page rank value for each node.

Related

Pairwise distance in very large datasets

I have an array that is about [5000000 x 6] and I need to select only the points (rows) that are at a certain a distance from each other.
The ideia should be:
Start new_array with first row from data array
Compare new_array with the second row from data array
If pdist between they are > tol, append row to new_array
Compare new_array with the third row from data array
and so on...
One problem is RAM size. I cant compare all rows at once even in pdist.
So I've been thinking in split the dataset in smaller ones, but then i dont know how to retrieve the index information for the rows in dataset.
I've tried scipy cdist, scipy euclidean, sklearn euclidean_distances, sklearn paired_distances and the below code is the fastest i could get. At first it is fast but after 40k loops it becomes really slow.
xyTotal=np.random.random([5000000,6])
tol=0.5
for i,z in enumerate(xyTotal):
if (pdist(np.vstack([np.array(ng),z]))>tol).all():
ng.append(z)
Any suggestions for this problem?
EDIT
ktree = BallTree(xyTotal, leaf_size=40,metric='euclidean')
btsem=[]
for i,j in enumerate(xyTotal):
ktree.query_radius(j.reshape(1,-1),r=tol, return_distance=True)
if (ktree.query_radius(j.reshape(1,-1),r=tol, count_only=True))==1:
btsem.append(j)
This is fast but I'm only picking outliers. When i get to points that are near to anothers (i.e. in a little cluster) I don't know hot to pick only one point and leave the others, since i will get the same metrics for all points in the cluster (they all have the same distance to each other)
The computation is slow because the complexity of your algorithm is quadratic: O(k * n * n) where n is len(xyTotal) and k is the probability of the condition being true. Thus, assuming k=0.1 and n=5000000, the running time will be huge (likely hours of computation).
Hopefully, you can write a better implementation running in O(n * log(n)) time. However, this is tricky to implement. You need to add your ng points in a k-d tree and then you can search the nearest neighbor and check the distance with the current point is greater than tol.
Note that you can find Python modules implementing k-d trees and the SciPy documentation provides an example of implementation written in pure Python (so likely not very efficient).

How to find the common eigenvectors of two matrices with distincts eigenvalues

I am looking for finding or rather building common eigenvectors matrix X between 2 matrices A and B such as :
AX=aX with "a" the diagonal matrix corresponding to the eigenvalues
BX=bX with "b" the diagonal matrix corresponding to the eigenvalues
where A and B are square and diagonalizable matrices.
I took a look in a similar post but had not managed to conclude, i.e having valid results when I build the final wanted endomorphism F defined by : F = P D P^-1
I have also read the wikipedia topic and this interesting paper but couldn't have to extract methods pretty easy to implement.
Particularly, I am interested by the eig(A,B) Matlab function.
I tried to use it like this :
% Search for common build eigen vectors between FISH_sp and FISH_xc
[V,D] = eig(FISH_sp,FISH_xc);
% Diagonalize the matrix (A B^-1) to compute Lambda since we have AX=Lambda B X
[eigenv, eigen_final] = eig(inv(FISH_xc)*FISH_sp);
% Compute the final endomorphism : F = P D P^-1
FISH_final = V*eye(7).*eigen_final*inv(V)
But the matrix FISH_final don't give good results since I can do other computations from this matrix FISH_final (this is actually a Fisher matrix) and the results of these computations are not valid.
So surely, I must have done an error in my code snippet above. In a first time, I prefer to conclude in Matlab as if it was a prototype, and after if it works, look for doing this synthesis with MKL or with Python functions. Hence also tagging python.
How can I build these common eigenvectors and finding also the eigenvalues associated? I am a little lost between all the potential methods that exist to carry it out.
The screen capture below shows that the kernel of commutator has to be different from null vector :
EDIT 1: From maths exchange, one advices to use Singular values Decomposition (SVD) on the commutator [A,B], that is in Matlab doing by :
"If 𝑣 is a common eigenvector, then ‖(𝐴𝐵−𝐵𝐴)𝑣‖=0. The SVD approach gives you a unit-vector 𝑣 that minimizes ‖(𝐴𝐵−𝐵𝐴)𝑣‖ (with the constraint that ‖𝑣‖=1)"
So I extract the approximative eigen vectors V from :
[U,S,V] = svd(A*B-B*A)
Is there a way to increase the accuracy to minimize ‖(𝐴𝐵−𝐵𝐴)𝑣‖ as much as possible ?
IMPORTANT REMARK : Maybe some of you didn't fully understand my goal.
Concerning the common basis of eigen vectors, I am looking for a combination (vectorial or matricial) of V1 and V2, or directly using null operator on the 2 input Fisher marices, to build this new basis "P" in which, with others eigenvalues than known D1 and D2 (noted D1a and D2a), we could have :
F = P (D1a+D2a) P^-1
To compute the new Fisher matrix F, I need to know P, assuming that D1a and D2a are equal respectively to D1 and D2 diagonal matrices (coming from diagonalization of A and B matrices)
If I know common basis of eigen vectors P, I could deduce D1a and Da2 from D1 and D2, couldn't I ?
The 2 Fisher matrices are available on these links :
Matrix A
Matrix B
I don't think there is a built-in facility in Matlab for computing common eigenvalues of two matrices. I'll just outline brute force way and do it in Matlab in order to highlight some of its eigenvector related methods. We will assume the matrices A and B are square and diagonalizable.
Outline of steps:
Get eigenvectors/values for A and B respectively.
Group the resultant eigenvectors by their eigenspaces.
Check for intersection of the eigenspaces by checking linear dependency among the eigenvectors of A and B one pair eigenspaces at a time.
Matlab does provide methods for (efficiently) completing each step! Except of course step 3 involves checking linear dependency many many times, which in turn means we are likely doing unnecessary computation. Not to mention, finding common eigenvectors may not require finding all eigenvectors. So this is not meant to be a general numerical recipe.
How to get eigenvector/values
The syntax is
[V,D] = eig(A)
where D(i), V(:,i) are the corresponding eigenpairs.
Just be wary of numerical errors. In other words, if you check
tol=sum(abs(A*V(:,i)-D(i)*V(:,i)));
tol<n*eps should be true for some small n for a smallish matrix A but it's probably not true for 0 or 1.
Example:
>> A = gallery('lehmer',4);
>> [V,D] = eig(A);
>> sum(abs(A*V(:,1)-D(1)*V(:,1)))<eps
ans =
logical
0
>> sum(abs(A*V(:,1)-D(1)*V(:,1)))<10*eps
ans =
logical
1
How to group eigenvectors by their eigenspaces
In Matlab, eigenvalues are not automatically sorted in the output of [V,D] = eig(A). So you need to do that.
Get diagonal entries of matrix: diag(D)
Sort and keep track of the required permutation for sorting: [d,I]=sort(diag(D))
Identify repeating elements in d: [~,ia,~]=unique(d,'stable')
ia(i) tells you the beginning index of the ith eigenspace. So you can expect d(ia(i):ia(i+1)-1) to be identical eigenvalues and thus the eigenvectors belonging to the ith eigenspace are the columns W(:,ia(i):ia(i+1)-1) where W=V(:,I). Of course, for the last one, the index is ia(end):end
The last step happens to be answered here in true generality. Here, unique is sufficient at least for small A.
(Feel free to ask a separate question on how to do this whole step of "shuffling columns of one matrix based on another diagonal matrix" efficiently. There are probably other efficient methods using built-in Matlab functions.)
For example,
>> A=[1,2,0;1,2,2;3,6,1];
>> [V,D] = eig(A),
V =
0 0 0.7071
1.0000 -0.7071 0
0 0.7071 -0.7071
D =
3 0 0
0 5 0
0 0 3
>> [d,I]=sort(diag(D));
>> W=V(:,I),
W =
0 0.7071 0
1.0000 0 -0.7071
0 -0.7071 0.7071
>> [~,ia,~]=unique(d,'stable'),
ia =
1
3
which makes sense because the 1st eigenspace is the one with eigenvalue 3 comprising of span of column 1 and 2 of W, and similarly for the 2nd space.
How to get linear intersect of (the span of) two sets
To complete the task of finding common eigenvectors, you do the above for both A and B. Next, for each pair of eigenspaces, you check for linear dependency. If there is linear dependency, the linear intersect is an answer.
There are a number of ways for checking linear dependency. One is to use other people's tools. Example: https://www.mathworks.com/matlabcentral/fileexchange/32060-intersection-of-linear-subspaces
One is to get the RREF of the matrix formed by concatenating the column vectors column-wise.
Let's say you did the computation in step 2 and arrived at V1, D1, d1, W1, ia1 for A and V2, D2, d2, W2, ia2 for B. You need to do
for i=1:numel(ia1)
for j=1:numel(ia2)
check_linear_dependency(col1,col2);
end
end
where col1 is W1(:,ia1(i):ia1(i+1)-1) as mentioned in step 2 but with the caveat for the last space and similarly for col2 and by check_linear_dependency we mean the followings. First we get RREF:
[R,p] = rref([col1,col2]);
You are looking for, first, rank([col1,col2])<size([col1,col2],2). If you have computed rref anyway, you already have the rank. You can check the Matlab documentation for details. You will need to profile your code for selecting the more efficient method. I shall refrain from guess-estimating what Matlab does in rank(). Although whether doing rank() implies doing the work in rref can make a good separate question.
In cases where rank([col1,col2])<size([col1,col2],2) is true, some rows don't have leading 1s and I believe p will help you trace back to which columns are dependent on which other columns. And you can build the intersect from here. As usual, be alert of numerical errors getting in the way of == statements. We are getting to the point of a different question -- ie. how to get linear intersect from rref() in Matlab, so I am going to leave it here.
There is yet another way using fundamental theorem of linear algebra (*sigh at that unfortunate naming):
null( [null(col1.').' ; null(col2.').'] )
The formula I got from here. I think ftla is why it should work. If that's not why or if you want to be sure that the formula works (which you probably should), please ask a separate question. Just beware that purely math questions should go on a different stackexchange site.
Now I guess we are done!
EDIT 1:
Let's be extra clear with how ia works with an example. Let's say we named everything with a trailing 1 for A and 2 for B. We need
for i=1:numel(ia1)
for j=1:numel(ia2)
if i==numel(ia1)
col1 = W1(:,ia1(end):end);
else
col1 = W1(:,ia1(i):ia1(i+1)-1);
end
if j==numel(ia2)
col2 = W2(:,ia2(j):ia2(j+1)-1);
else
col2 = W2(:,ia2(end):end);
end
check_linear_dependency(col1,col2);
end
end
EDIT 2:
I should mention the observation that common eigenvectors should be those in the nullspace of the commutator. Thus, perhaps null(A*B-B*A) yields the same result.
But still be alert of numerical errors. With the brute force method, we started with eigenpairs with low tol (see definition in earlier sections) and so we already verified the "eigen" part in the eigenvectors. With null(A*B-B*A), the same should be done as well.
Of course, with multiple methods at hand, it's good idea to compare results across methods.
I suspect this is rather a delicate matter.
First off, mathematically, A and B are simultaneously diagonalisable iff they commute, that is iff
A*B - B*A == 0 (often A*B-B*A is written [A,B])
(for if A*X = X*a and B*X = X*b with a, b diagonal then
A = X*a*inv(X), B = X*b*inv(X)
[A,B] = X*[a,b]*inv(X) = 0 since a and b, being diagonal, commute)
So I'd say the first thing to check is that your A and B do commute, and here is the first awkward issue: since [A,B] as computed is unlikely to be all zeroes due to rounding error, you'll need to decide if [A,B] being non-zero is just due to rounding error or if, actually, A and B don't commute.
Now suppose x is an eigenvector of A, with eigenvalue e. Then
A*B*x = B*A*x = B*e*x = e*B*x
And so we have, mathematically, two possibilities: either Bx is 0, or Bx is also an eigenvector of A with eigenvalue e.
A nice case is when all the elements of a are different, that is when each eigenspace of A is one dimensional. In that case:
if AX = Xa for diagonal a, then BX = Xb for diagonal b (which you'll need to compute). If you diagonalize A, and all the eigenvalues are sufficiently different, then you can assume each eigenspace is of dimension 1, but what does 'sufficiently' mean? Another delicate question, alas. If two computed eigenvalues are very close, are the eigenvalues different or is the difference rounding error?
Anyway, to compute the eigenvalues of b for each eigenvector x of A compute Bx. If ||Bx|| is small enough compared to ||x|| then the eigenvalue of B is 0, otherwise it's
x'*B*x/(x'*x)
In the general case, some of the eigenspaces may have dimension greater than 1. The one dimensional eigen spaces can be dealt with as above, but more computation is required for the higher dimensional ones.
Suppose that m eigenvectors x[1].. x[m] of A correspond to the eigenvalue e. Since A and B commute, it is easy to see that B maps the space spanned by the xs to itself. Let C be the mxm matrix
C[i,j] = x[j]'*B*x[i]
Then C is symmetric and so we can diagonalize it, ie find orthogonal V and diagonal c with
C = V'*c*V
If we define
y[l] = Sum{ k | V[l,k]*x[k] } l=1..m
then a little algebra shows that y[l] is an eigenvector of B, with eigenvalue c[l]. Moreover, since each x[i] is an eigenvector of A with the same eigenvalue e, each y[l] is also an eigenvector of A with eigenvector e.
So all in all, I think a method would be:
Compute [A,B] and if its really not 0, give up
Diagonalise A, and sort the eigenvalues to be increasing (and sort the eigenvectors!)
Identify the eigenspaces of A. For the 1 dimensional spaces the corresponding eigenvector of A is an eigenvector of B, and all you need to compute is the eigenvalue of B. For higher dimensional ones, proceed as in the previous paragraph.
A relatively expensive (in computational effort) but reasonably reliable way to test whether the commutator is zero would be to compute the svd of the commutator and take is largest singular value, c say, and also to take the largest singular value (or largest absolute value of the eigenvalues) a of A and b of B. Unless c is a lot smaller (eg 1e-10 times) the lesser of a and b, you should conclude the commutator is not zero.

Python matrix multiplication 3d Array

I tried to solve a PDE numerically and in the course of this I faced the problem of a triple-nested for loop resembling the 3 spatial dimension. This construct is nested in another time loop, so you can imagine that the computing takes forever for sufficient large node numbers. The code block looks like this
for jy in range(0,cy-1):
for jx in range(0,cx-1):
for jz in range(0,cz-1):
T[n+1,jx,jy,jz] = T[n,jx,jy,jz] + s*(T[n,jx-1,jy,jz] - 2*T[n,jx,jy,jz] + T[n,jx+1,jy,jz]) + s*(T[n,jx,jy-1,jz] - 2*T[n,jx,jy,jz] + T[n,jx,jy+1,jz]) + s*(T[n,jx,jy,jz-1] - 2*T[n,jx,jy,jz] + T[n,jx,jy,jz+1])
It might look intimidating at first, but is quite easy. I have a 3 dimensional matrix representing a solid bulk material, where each point represents the current temperature. The iteratively calculated next temperature at each point is calculated taking into account each point next to that point - so 6 in total. In the case of a 1-dimensional solid the solution is just a simple matrix multiplication. Is there any chance to represent the 3-loop-system above in a simple matrix solution like in the 1D case?
Best regards!
With numpy you can easily do these kinds of matrix operations,
e.g for a 3x3 matrix
import numpy as np
T = np.random.random((3,3,3))
T = T*T - 2*T ... etc.
First off, you need to be a bit more careful with your terminology. A "matrix" is a 2-Dimensional array of numbers. So you are really talking about an array. Numpy, or better yet Scipy, has an data type called an ndarray. You need to be very careful manipulating them, because although they are sometimes used to represent matrices, there are operations that can be performed on 2-D arrays that are not mathematically legal for matrices.
I strongly recommend you use # and not * to perform multiplication of 1- or 2-D matrices, and be sure to add code to check that the operations you are doing are legal mathematically. As a trivial example, Python lets you add a 1 x n or an n x 1 vector to an n x n matrix, even though that is not mathematically correct. The reason it allows it is, as intimated above, because there is no true matrix type in Python.
It very well may be that you can reformulate your problem to use a 3-D array, and by experimentation find the particular operation you are trying to perform. Just keep in mind that the rules of linear algebra are only casually applied in Python.

Efficiently update values held in scoring matrix

I am continuously calculating correlation matrices where each time the order of the underlying data is randomized. When a correlation score with randomized data is greater than or equal to the original correlation determined with ordered data, I would like to update the corresponding cell in a scoring matrix with +1. (All cells begin as zeroes in the scoring matrix).
Due to the size of the matrices I am dealing with shape = (3681, 12709), I would like to find out an efficient way of doing this. So far, what I have is inefficient and takes too long. I wonder if there is a matrix-operation style approach to this rather than iterating, as I am currently doing below:
for i, j in product(data_sorted.index, data_sorted.columns):
# if random correlation is as good as or better than sorted correlation
if data_random.loc[i, j] >= data_sorted.loc[i, j]:
# update scoring matrix
scoring_matrix[sorted_index_list.index(i)][sorted_column_list.index(j)] += 1
I have crudely timed this approach and found that doing this for a single line of my matrix will take roughly 4.2 seconds which seems excessive.
Any help would he much obliged.
Assuming everything has the same indices, this should work as expected and be pretty quick.
scoring_matrix += (data_random >= data_sorted).astype(int)

Scipy LinearOperator With Multiple Inputs

I need to invert a large, dense matrix which I hoped to use Scipy's gmres to do. Fortunately, the dense matrix A follows a pattern and I do not need to store the matrix in memory. The LinearOperator class allows us to construct an object which acts as the matrix for GMRES and can compute directly the matrix vector product A*v. That is, we write a function mv(v) which takes as input a vector v and returns mv(v) = A*v. Then, we can use the LinearOperator class to create A_LinOp = LinearOperator(shape = shape, matvec = mv). We can put the linear operator into the Scipy gmres command to evaluate the matrix vector products without ever having to fully load A into memory.
The documentation for the LinearOperator is found here: LinearOperator Documentation.
Here is my problem: to write the routine to compute the matrix vector product mv(v) = A*v, I need another input vector C. The entries in A are of the form A[i,j] = f(C[i] - C[j]). So, what I really want is for mv to be of two inputs, one fixed vector input C, and one variable input v which we want to compute A*v.
MATLAB has a similar setup, where would write x = gmres(#(v) mv(v,C),b) where b is the right hand side of the problem Ax = b, , and mv is the function that takes as variable input v which we want to compute A*v and C is the fixed, known vector which we need for the assembly of A.
My problem is that I can't figure out how to allow the LinearOperator class to accept two inputs, one variable and one "fixed" like I can in MATLAB.
Is there a way to do the analogous operation in SciPy? Alternatively, if anyone knows of a better way of inverting a large, dense matrix (50000, 50000) where the entries follow a pattern, I would greatly appreciate any suggestions.
Thanks!
EDIT: I should have stated this information actually. The matrix is actually (in block form) [A C; C^T 0], where A is N x N (N large) and C is N x 3, and the 0 is 3 x 3 and C^T is the transpose of C. This array C is the same array as the one mentioned above. The entries of A follow a pattern A[i,j] = f(C[i] - C[j]).
I wrote mv(v,C) to go row by row construct A*v[i] for i=0,N, by computing sum f(C[i]-C[j)*v[j] (actually, I do numpy.dot(FC,v) where FC[j] = f(C[i]-C[j]) which works well). Then, at the end doing the computations for the C^T rows. I was hoping to eventually replace the large for loop with a multiprocessing call to parallelize the for loop, but that's a future thing to consider. I will also look into using Cython to speed up the computations.
This is very late, but if you're still interested...
Your A matrix must be very low rank since it's a nonlinearly transformed version of a rank-2 matrix. Plus it's symmetric. That means it's trivial to inverse: get the truncated eigenvalue decompostion with, say, 5 eigenvalues: A = U*S*U', then invert that: A^-1 = U*S^-1*U'. S is diagonal so this is inexpensive. You can get the truncated eigenvalue decomposition with eigh.
That takes care of A. Then for the rest: use the block matrix inversion formula. Looks nasty, but I will bet you 100,000,000 prussian francs that it's 50x faster than the direct method you were using.
I faced the same situation (some years later than you) of trying to use more than one argument to LinearOperator, but for another problem. The solution I found was the use of global variables, to avoid passing the variables as arguments to the function.

Categories