I want to find the least-square solution of a matrix and I am using the numpy linalg.lstsq function;
weights = np.linalg.lstsq(semivariance, prediction, rcond=None)
The dimension for my variables are;
semivariance is a float of size 5030 x 5030
prediction is a 1D array of length 5030
The problem I have is it takes approximately 80sec to return the value of weights and I have to repeat the calculation of weights about 10000 times so the computational time is just elevated.
Is there a faster way/pythonic function to do this?
#Brenlla appears to be right, even if you perform least squares by solving using the Moore-Penrose pseudo inverse, it is significantly faster than np.linalg.lstsq:
import numpy as np
import time
semivariance=np.random.uniform(0,100,[5030,5030]).astype(np.float64)
prediction=np.random.uniform(0,100,[5030,1]).astype(np.float64)
start=time.time()
weights_lstsq = np.linalg.lstsq(semivariance, prediction, rcond=None)
print("Took: {}".format(time.time()-start))
>>> Took: 34.65818190574646
start=time.time()
weights_pseudo = np.linalg.solve(semivariance.T.dot(semivariance),semivariance.T.dot(prediction))
print("Took: {}".format(time.time()-start))
>>> Took: 2.0434153079986572
np.allclose(weights_lstsq[0],weights_pseudo)
>>> True
The above is not on your exact matrices but the concept on the samples likely transfers. np.linalg.lstsq performs an optimisation problem by minimizing || b - a x ||^2 to solve for x in ax=b. This is usually faster on extremely large matrices, hence why linear models are often solved using gradient decent in neural networks, but in this case the matrices just aren't large enough for the performance benefit.
Related
Given a query vector (one-hot-vector) q with size of 50000x1 and a large sparse matrix A with size of 50000 x 50000 and nnz of A is 0.3 billion, I want to compute r=(A + A^2 + ... + A^S)q (usually 4 <= S <=6).
I can above equation iteratively using loop
r = np.zeros((50000,1))
for i in range(S):
q = A.dot(q)
r += q
but I want to more fast method.
First thought was A can be symmetric, so eigen decomposition would help for compute power of A. But since A is large sparse matrix, decomposition makes dense matrix with same size as A which leads to performance degradation (in aspect of memory and speed).
Low Rank Approximation was also considered. But A is large and sparse, so not sure which rank r is appropriate.
It is totally fine to pre-compute something, like A + A^2 + ... + A^S = B. But I hope last computation will be fast: compute Bq less than 40ms.
Is there any reference or paper or tricks for that?
Even if the matrix is not sparse this the iterative method is the way to go.
Multiplying A.dot(q) has complexity O(N^2), while computing A.dot(A^i) has complexity O(N^3).
The fact that q is sparse (indeed much more sparse than A) may help.
For the first iteration A*q can be computed as A[q_hot_index,:].T.
For the second iteration the expected density of A # q has the same expected density as A for A (about 10%) so it is still good to do it sparsely.
For the third iteration afterwards the A^i # q will be dense.
Since you are accumulating the result it is good that your r is not sparse, it prevents index manipulation.
There are several different ways to store sparse matrices. I myself can't say I understand in depth all of them, but I think csr_matrix, csc_matrix, are the most compact for generic sparse matrices.
Eigen decomposition is good when you need to compute a P(A), to compute a P(A)*q the eigen decomposition becomes advantageous only when P(A) has degree of the order of the size of A. Eigen-decomposition has complexity O(N^3), a matrix-vector product has complexity O(N^2), the evaluation of a polynomial P(A) of degree D using the eigen decomposition can be achieved in O(N^3 + N*D).
Edit: Answering questionss on the comments
"it prevents index manipulation" <- Could you elaborate this?
Suppose you have a sparse matrix [0,0,0,2,0,7,0]. This could be described as ((3,2), (5,7)). Now suppose you assigne 1 to one element and it becomes [0,0,0,2,1,7,0], it is now represented as ((3,2), (4,1), (5,7)). The assignment is performed by insertion in an array, and inserting in an array has complexity O(nnz), where nnz is the number of nonzero elements. If you have a dense matrix you can always modify one element with complexity O(1).
What is the N in the complexity?
It is the number of rows, or columns, of the matrix A
About the eigen decomposition, do you want to say that it is worth
that computing r can be achieved in O(N^3 +N*D) not O(N^3 + N^2)
Computing P(A) will have complexity O(N^3 * D) (with a different constant), for big matrices, computing P(A) using the eigen decomposition is probably the most efficient. But P(A)x have O(N^2 * D) complexity, so it is probably not a good idea to compute P(A)x with eigen decomposition unless you have big D (>N), when speed is concerned.
I want to solve the linear equation Ax = b, for the unknown matrix x. A and b are both large and sparse, and have shapes (when converted to dense) of 30,000 x 25 and 30,000 x 100,000, respectively.
I have tried using both scipy.sparse.linalg.lsqr and scipy.sparse.linalg.lsmr, but they both require that b be dense, which is computationally very expensive and prohibitive.
How can I do this?
You could to use numpy.linalg.pinv to find "x" values
I have a 50 dimensional array, whose dimensions are 255 x 255 x 255 x...(50 times)..x255. So its a total of 50^255 floating point numbers. Its just out of scope to even think of fitting in a RAM. Moreover I need to take an 50 dimensional Fast Fourier Transform (DFT) of this array. I can't do it in python on an ordinary PC. I cant even imagine doing it on a GPU. so I am guessing I have to take help of a hard disk memory, but even that is too huge. I don't need this in real time, I can afford even days for it to run. I have no clue what sort of machine I need or is it even possible? Appreciate your advice. Super computers, grids, or something even if its too costly, I am not worried about investment.
If you found enough universes to save your data in, here is what you could do:
The Fourier Transform is separable, that means that calculating the DFT of each axis one after the other will give you the same result as if you calculated the n-dimensional DFT:
for i in range(C.ndim):
C[...] = numpy.fft.fft(C, axis=i)
Double checking if the value is correct using a 2D tensor (because we have a 2D FFT numpy.fft.fft2 to compare against):
import numpy
A = numpy.random.rand(*[16] * 2)
B = numpy.fft.fft2(A)
C = A.astype(numpy.complex) # output vector for separable FFT
for i in range(C.ndim):
C[...] = numpy.fft.fft(C, axis=i)
numpy.allclose(C, B) # True
I try to compute the matrix multiplication using the script:
import numpy as np
import math
from timeit import default_timer as timer
from numba import cuda
from numba import *
def mult(a,b):
return a*b
mult_gpu=cuda.jit(restype=float32,argtypes=[float32,float32],device=True)(mult)
#cuda.jit(argtypes=[float32[:,:],float32[:,:],float32[:,:,:]])
def mult_kernel(a,b,c):
Ni=c.shape[0]
Nj=c.shape[1]
Nk=c.shape[2]
startX,startY,startZ=cuda.grid(3)
gridX=cuda.gridDim.x*cuda.blockDim.x
gridY=cuda.gridDim.y*cuda.blockDim.y
gridZ=cuda.gridDim.z*cuda.blockDim.z
for i in range(startX,Ni,gridX):
for j in range(startY,Nj,gridY):
for k in range(startZ,Nk,gridZ):
c[i,j,k]=mult_gpu(a[i,k],b[j,k])
def main():
A=np.ones((20,50000),dtype=np.float32)
B=np.ones((3072,50000),dtype=np.float32)
C=np.ones((20,3072,50000),dtype=np.float32)
(Ni,Nj,Nk)=C.shape
my_gpu=cuda.get_current_device()
thread_ct=my_gpu.WARP_SIZE
block_ct_x=int(math.ceil(float(Ni)/thread_ct))
block_ct_y=int(math.ceil(float(Nj)/thread_ct))
block_ct_z=int(math.ceil(float(Nk)/thread_ct))
blockdim=thread_ct,thread_ct,thread_ct
griddim=block_ct_x,block_ct_y,block_ct_z
print "Threads per block:",blockdim
print "Blocks per grid:",griddim
start=timer()
Cg=cuda.to_device(C)
mult_kernel[griddim,blockdim](A,B,Cg)
Cg.to_host()
dt=timer()-start
print "Computation done in %f s"%(dt)
print 'C[:3,1,1] = ',C[:3,1,1]
print 'C[-3:,1,1] = ',C[-3:,1,1]
if __name__=='__main__':
main()
Executing this yields an error:
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
How could I fix this memory problem?
Nevertheless, using smaller matrices
A=np.ones((20,500),dtype=np.float32)
B=np.ones((372,500),dtype=np.float32)
C=np.ones((20,372,500),dtype=np.float32)
there is still an error:
numba.cuda.cudadrv.driver.CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE
I got inspired by the Mandelbrot Example to implement the computation above.
EDIT1
In order to resolve any confusion, this is actually a 3D matrix by 3D matrix multiplication:
A=np.ones((20,1,50000),dtype=np.float32)
B=np.ones((1,3072,50000),dtype=np.float32)
C=np.ones((20,3072,50000),dtype=np.float32)
I skipped one dimension in A and B because it is not necessary for the computation.
EDIT2
My GPU is:
In [1]: from numba import cuda
In [2]: gpu=cuda.get_current_device()
In [3]: gpu.name
Out[3]: 'GeForce GT 750M'
EDIT3
According to the memory of my GPU (2GB), I reduced the size of each dimension by 2:
dimx=10
dimy=1536
dimz=25000
A=np.ones((dimx,dimz),dtype=np.float32)
B=np.ones((dimy,dimz),dtype=np.float32)
C=np.ones((dimx,dimy,dimz),dtype=np.float32)
But I still receive the CUDA_ERROR_OUT_OF_MEMORY error. How could one explain this?
The calculation yields a size of about 1.7GB for the 3 matrices:
(10*1536*25000*4.+10*25000*4+1536*25000*4.)/(10**9)=1.6906
Regarding the first problem, you're running out of memory. A major contributor to that is that this isn't the way people would normally do a matrix-matrix multiply. Normally, as you are multiplying row and column elements together, you would keep a running sum, then store that sum in the appropriate location in the product (result) matrix. This will allow you to have a much smaller size for the c matrix, ie. it need only be 2 dimensions, not 3. You may wish to just study the linear algebra definition of matrix-matrix multiply. When you multiply a 2D matrix by a 2D matrix, the result is a 2D matrix, not a 3D matrix.
In a nutshell, something like this:
for i in range(startX,Ni,gridX):
for j in range(startY,Nj,gridY):
c[i,j] = 0
for k in range(startZ,Nk,gridZ):
c[i,j]= c[i,j] + mult_gpu(a[i,k],b[j,k])
And adjust your c shape accordingly.
If you actually need the individual products in 3D form as you are doing here, then there is not much I can say except that you will need to scale the problem to fit in the GPU memory size for whatever GPU you are using.
Regarding the second problem, you have a problem here:
thread_ct=my_gpu.WARP_SIZE
...
blockdim=thread_ct,thread_ct,thread_ct
WARP_SIZE is 32 (presumably) so you are asking for a 3D block of dimensions 32*32*32 = 32K threads. But CUDA threadblocks are limited to a maximum of 1024 threads, which limit is the product of the individual dimensions.
If you change your thread_ct to 8, for example:
thread_ct=8
You should be able to get past this particular issue.
I have two matrices A and B, each with a size of NxM, where N is the number of samples and M is the size of histogram bins. Thus, each row represents a histogram for that particular sample.
What I would like to do is to compute the chi-square distance between two matrices for a different pair of samples. Therefore, each row in the matrix A will be compared to all rows in the other matrix B, resulting a final matrix C with a size of NxN and C[i,j] corresponds to the chi-square distance between A[i] and B[j] histograms.
Here is my python code that does the job:
def chi_square(histA,histB):
esp = 1.e-10
d = sum((histA-histB)**2/(histA+histB+eps))
return 0.5*d
def matrix_cost(A,B):
a,_ = A.shape
b,_ = B.shape
C = zeros((a,b))
for i in xrange(a):
for j in xrange(b):
C[i,j] = chi_square(A[i],B[j])
return C
Currently, for a 100x70 matrix, this entire process takes 0.1 seconds.
Is there any way to improve this performance?
I would appreciate any thoughts or recommendations.
Thank you.
Sure! I'm assuming you're using numpy?
If you have the RAM available, you could use broadcast the arrays and use numpy's efficient vectorization of the operations on those arrays.
Here's how:
Abroad = A[:,np.newaxis,:] # prepared for broadcasting
C = np.sum((Abroad - B)**2/(Abroad + B), axis=-1)/2.
Timing considerations on my platform show a factor of 10 speed gain compared to your algorithm.
A slower option (but still faster than your original algorithm) that uses less RAM than the previous option is simply to broadcast the rows of A into 2D arrays:
def new_way(A,B):
C = np.empty((A.shape[0],B.shape[0]))
for rowind, row in enumerate(A):
C[rowind,:] = np.sum((row - B)**2/(row + B), axis=-1)/2.
return C
This has the advantage that it can be run for arrays with shape (N,M) much larger than (100,70).
You could also look to Theano to push the expensive for-loops to the C-level if you don't have the memory available. I get a factor 2 speed gain compared to the first option (not taking into account the initial compile time) for both the (100,70) arrays as well as (1000,70):
import theano
import theano.tensor as T
X = T.matrix("X")
Y = T.matrix("Y")
results, updates = theano.scan(lambda x_i: ((x_i - Y)**2/(x_i+Y)).sum(axis=1)/2., sequences=X)
chi_square_norm = theano.function(inputs=[X, Y], outputs=[results])
chi_square_norm(A,B) # same result