I have a large symmetric matrix A of dimensions (N, N) (N is about twenty million), and for sure I cannot store this matrix (50% components of A are zeros).
But every component A[i, j] is explicitly known: A[i, j] = f(i, j). For example A[i, j] = cos(i)*cos(j).
I need to multiply this matrix with a vector of length N. What is "doable" way to do that on a machine of 64 cores, 128GB of RAM?
If you have a way to compute elements of matrix on the fly there is no need to store whole matrix in memory. Also each element of result vector in independent of each other so you can run as many parallel workers as you want.
The only optimization of algorithm I can think of is take into consideration that f(i, j) = cos(i)*cos(j) is symmetric function (f(i, j) = f(j, i)). But that's if this is your real function.
Also check numpy and Cython for much faster computations in Python as pure Python can be a little slow for this kind of job.
Related
I have a NumPy array vectors = np.random.randn(rows, cols). I want to find differences between its rows according to some other array diffs which is sparse and "2-hot": containing a 1 in its column corresponding to the first row of vectors and a -1 corresponding to the second row. Perhaps an example shall make it clearer:
diffs = np.array([[ 1, 0, -1],
[ 1, -1, 0]])
then I can compute the row differences by simply diffs # vectors.
Unfortunately this is slow for diffs of 10_000x1000 and vectors 1000x15_000. I can get a speedup by using scipy.sparse: sparse.csr_matrix(diffs) # vectors, but even this is 300ms.
Possibly this is simply as fast as it gets, but some part of me thinks whether using matrix multiplications is the wisest decision for this task.
What's more is I need to take the absolute value afterwards so really I'm doing np.abs(sparse.csr_matrix(diffs) # vectors) which adds ~ 200ms for a grand total of ~500ms.
I can compute the row differences by simply diffs # vectors.
This is very inefficient. A matrix multiplication runs in O(n*m*k) for a (n,m) multiplied by a (m,k) one. In your case, there is only two values per line and you do not actually need a multiplication by 1 or -1. Your problem can be computed in O(n*k) time (ie. m times faster).
Unfortunately this is slow for diffs of 10_000x1000 and vectors 1000x15_000. I can get a speedup by using scipy.sparse.
The thing is the input data representation is inefficient. When diff is an array of size (10_000,1000), this is not reasonable to use a dense matrix that would be ~1000 times bigger than needed nor a sparse matrix that is not optimized for having only two non-zero values (especially 1 and -1). You need to store the position of the non-zeros values in a 2D array called sel_rows of shape (2,n) where the first row contains the location of the 1 and the second one contains the location of the -1 in the diff 2D array. Then, you can extract the rows of vectors for example with vectors[sel_rows[0]]. You can perform the final operation with vectors[sel_rows[0,:]] - vectors[sel_rows[1,:]]. This approach should be drastically faster than a dense matrix product and it may be a bit faster than a sparse one regarding the target machine.
While the above solution is simple, it create multiple temporary arrays that are not cache-friendly since your output array should take 10_000 * 15_000 * 8 = 1.1 GiB (which is quite huge). You can use Numba so to remove temporary array and so improve the performance. Multiple threads can be used to improve performance even further. Here is an untested code:
import numba as nb
#nb.njit('(int_[:,::1], float64[:,::1])', parallel=True)
def compute(diffs, vectors):
n, k = diffs.shape[0], vectors.shape[1]
assert diffs.shape[1] == 2
res = np.empty((n, k))
for i in nb.prange(n):
a, b = diffs[i]
for j in range(k):
# Compute nb.abs here if needed so to avoid
# creating new temporary arrays
res[i, j] = vectors[a, j] - vectors[b, j]
return res
This above code should be nearly optimal. It should be memory bound and able to saturate the memory bandwidth. Note that writing such huge arrays in memory take some time as well as reading (twice) the input array. On x86-64 platforms, a basic implementation should move 4.4 GiB of data from/to the RAM. Thus, on a mainstream PC with a 20 GiB/s RAM, this takes 220 ms. In fact, the sparse matrix computation result was not so bad in practice for a sequential implementation.
If this is not enough to you, then you can use simple-precision floating-point numbers instead of double-precision (twice faster). You could also use a low-level C/C++ implementation so to reduce the memory bandwidth used (thanks to non-temporal instructions -- ~30% faster). There is no much more to do.
I have two NxM arrays in numpy, a and b. I would like to perform a vectorized operation that does the following:
c = np.zeros(N)
for i in range(N):
for j in range(M):
c[i] += a[i, j]*b[i, j]
Stated in a more mathematical way, I have two matrices A and B, and want to compute the diagonal of the matrix A*B (being imprecise with matrix transposition, etc). I've been trying to accomplish something like this with the tensordot function, but haven't had much success. This is an operation that is going to be performed many times, so I would like for it to be efficient (i.e., without literally calculating the matrix AB and just taking the diagonal from that).
Given a query vector (one-hot-vector) q with size of 50000x1 and a large sparse matrix A with size of 50000 x 50000 and nnz of A is 0.3 billion, I want to compute r=(A + A^2 + ... + A^S)q (usually 4 <= S <=6).
I can above equation iteratively using loop
r = np.zeros((50000,1))
for i in range(S):
q = A.dot(q)
r += q
but I want to more fast method.
First thought was A can be symmetric, so eigen decomposition would help for compute power of A. But since A is large sparse matrix, decomposition makes dense matrix with same size as A which leads to performance degradation (in aspect of memory and speed).
Low Rank Approximation was also considered. But A is large and sparse, so not sure which rank r is appropriate.
It is totally fine to pre-compute something, like A + A^2 + ... + A^S = B. But I hope last computation will be fast: compute Bq less than 40ms.
Is there any reference or paper or tricks for that?
Even if the matrix is not sparse this the iterative method is the way to go.
Multiplying A.dot(q) has complexity O(N^2), while computing A.dot(A^i) has complexity O(N^3).
The fact that q is sparse (indeed much more sparse than A) may help.
For the first iteration A*q can be computed as A[q_hot_index,:].T.
For the second iteration the expected density of A # q has the same expected density as A for A (about 10%) so it is still good to do it sparsely.
For the third iteration afterwards the A^i # q will be dense.
Since you are accumulating the result it is good that your r is not sparse, it prevents index manipulation.
There are several different ways to store sparse matrices. I myself can't say I understand in depth all of them, but I think csr_matrix, csc_matrix, are the most compact for generic sparse matrices.
Eigen decomposition is good when you need to compute a P(A), to compute a P(A)*q the eigen decomposition becomes advantageous only when P(A) has degree of the order of the size of A. Eigen-decomposition has complexity O(N^3), a matrix-vector product has complexity O(N^2), the evaluation of a polynomial P(A) of degree D using the eigen decomposition can be achieved in O(N^3 + N*D).
Edit: Answering questionss on the comments
"it prevents index manipulation" <- Could you elaborate this?
Suppose you have a sparse matrix [0,0,0,2,0,7,0]. This could be described as ((3,2), (5,7)). Now suppose you assigne 1 to one element and it becomes [0,0,0,2,1,7,0], it is now represented as ((3,2), (4,1), (5,7)). The assignment is performed by insertion in an array, and inserting in an array has complexity O(nnz), where nnz is the number of nonzero elements. If you have a dense matrix you can always modify one element with complexity O(1).
What is the N in the complexity?
It is the number of rows, or columns, of the matrix A
About the eigen decomposition, do you want to say that it is worth
that computing r can be achieved in O(N^3 +N*D) not O(N^3 + N^2)
Computing P(A) will have complexity O(N^3 * D) (with a different constant), for big matrices, computing P(A) using the eigen decomposition is probably the most efficient. But P(A)x have O(N^2 * D) complexity, so it is probably not a good idea to compute P(A)x with eigen decomposition unless you have big D (>N), when speed is concerned.
I have a function which currently multiplies a matrix in scipy.sparse.csr_matrix form by a vector. I use this function for different values lots of times and I would like the matrix * vector multiplication to be as efficient as possible. The matrix is an N x N matrix, but only contains m x N non-zero elements, where m << N. The non-zero elements are currently arranged randomly about the matrix. I could perform row operations to get this matrix in a form such that all the elements appear on only m + 2 diagonals. Then use scipy.sparse.dia_matrix instead of scipy.sparse.csr_matrix. It will take quite a bit of work so I was wondering if anyone knows if this will even improve the computational efficiency?
I have two matrices A and B, each with a size of NxM, where N is the number of samples and M is the size of histogram bins. Thus, each row represents a histogram for that particular sample.
What I would like to do is to compute the chi-square distance between two matrices for a different pair of samples. Therefore, each row in the matrix A will be compared to all rows in the other matrix B, resulting a final matrix C with a size of NxN and C[i,j] corresponds to the chi-square distance between A[i] and B[j] histograms.
Here is my python code that does the job:
def chi_square(histA,histB):
esp = 1.e-10
d = sum((histA-histB)**2/(histA+histB+eps))
return 0.5*d
def matrix_cost(A,B):
a,_ = A.shape
b,_ = B.shape
C = zeros((a,b))
for i in xrange(a):
for j in xrange(b):
C[i,j] = chi_square(A[i],B[j])
return C
Currently, for a 100x70 matrix, this entire process takes 0.1 seconds.
Is there any way to improve this performance?
I would appreciate any thoughts or recommendations.
Thank you.
Sure! I'm assuming you're using numpy?
If you have the RAM available, you could use broadcast the arrays and use numpy's efficient vectorization of the operations on those arrays.
Here's how:
Abroad = A[:,np.newaxis,:] # prepared for broadcasting
C = np.sum((Abroad - B)**2/(Abroad + B), axis=-1)/2.
Timing considerations on my platform show a factor of 10 speed gain compared to your algorithm.
A slower option (but still faster than your original algorithm) that uses less RAM than the previous option is simply to broadcast the rows of A into 2D arrays:
def new_way(A,B):
C = np.empty((A.shape[0],B.shape[0]))
for rowind, row in enumerate(A):
C[rowind,:] = np.sum((row - B)**2/(row + B), axis=-1)/2.
return C
This has the advantage that it can be run for arrays with shape (N,M) much larger than (100,70).
You could also look to Theano to push the expensive for-loops to the C-level if you don't have the memory available. I get a factor 2 speed gain compared to the first option (not taking into account the initial compile time) for both the (100,70) arrays as well as (1000,70):
import theano
import theano.tensor as T
X = T.matrix("X")
Y = T.matrix("Y")
results, updates = theano.scan(lambda x_i: ((x_i - Y)**2/(x_i+Y)).sum(axis=1)/2., sequences=X)
chi_square_norm = theano.function(inputs=[X, Y], outputs=[results])
chi_square_norm(A,B) # same result