Matrix multiplication in CUDA running out of memory - python

I try to compute the matrix multiplication using the script:
import numpy as np
import math
from timeit import default_timer as timer
from numba import cuda
from numba import *
def mult(a,b):
return a*b
mult_gpu=cuda.jit(restype=float32,argtypes=[float32,float32],device=True)(mult)
#cuda.jit(argtypes=[float32[:,:],float32[:,:],float32[:,:,:]])
def mult_kernel(a,b,c):
Ni=c.shape[0]
Nj=c.shape[1]
Nk=c.shape[2]
startX,startY,startZ=cuda.grid(3)
gridX=cuda.gridDim.x*cuda.blockDim.x
gridY=cuda.gridDim.y*cuda.blockDim.y
gridZ=cuda.gridDim.z*cuda.blockDim.z
for i in range(startX,Ni,gridX):
for j in range(startY,Nj,gridY):
for k in range(startZ,Nk,gridZ):
c[i,j,k]=mult_gpu(a[i,k],b[j,k])
def main():
A=np.ones((20,50000),dtype=np.float32)
B=np.ones((3072,50000),dtype=np.float32)
C=np.ones((20,3072,50000),dtype=np.float32)
(Ni,Nj,Nk)=C.shape
my_gpu=cuda.get_current_device()
thread_ct=my_gpu.WARP_SIZE
block_ct_x=int(math.ceil(float(Ni)/thread_ct))
block_ct_y=int(math.ceil(float(Nj)/thread_ct))
block_ct_z=int(math.ceil(float(Nk)/thread_ct))
blockdim=thread_ct,thread_ct,thread_ct
griddim=block_ct_x,block_ct_y,block_ct_z
print "Threads per block:",blockdim
print "Blocks per grid:",griddim
start=timer()
Cg=cuda.to_device(C)
mult_kernel[griddim,blockdim](A,B,Cg)
Cg.to_host()
dt=timer()-start
print "Computation done in %f s"%(dt)
print 'C[:3,1,1] = ',C[:3,1,1]
print 'C[-3:,1,1] = ',C[-3:,1,1]
if __name__=='__main__':
main()
Executing this yields an error:
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
How could I fix this memory problem?
Nevertheless, using smaller matrices
A=np.ones((20,500),dtype=np.float32)
B=np.ones((372,500),dtype=np.float32)
C=np.ones((20,372,500),dtype=np.float32)
there is still an error:
numba.cuda.cudadrv.driver.CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE
I got inspired by the Mandelbrot Example to implement the computation above.
EDIT1
In order to resolve any confusion, this is actually a 3D matrix by 3D matrix multiplication:
A=np.ones((20,1,50000),dtype=np.float32)
B=np.ones((1,3072,50000),dtype=np.float32)
C=np.ones((20,3072,50000),dtype=np.float32)
I skipped one dimension in A and B because it is not necessary for the computation.
EDIT2
My GPU is:
In [1]: from numba import cuda
In [2]: gpu=cuda.get_current_device()
In [3]: gpu.name
Out[3]: 'GeForce GT 750M'
EDIT3
According to the memory of my GPU (2GB), I reduced the size of each dimension by 2:
dimx=10
dimy=1536
dimz=25000
A=np.ones((dimx,dimz),dtype=np.float32)
B=np.ones((dimy,dimz),dtype=np.float32)
C=np.ones((dimx,dimy,dimz),dtype=np.float32)
But I still receive the CUDA_ERROR_OUT_OF_MEMORY error. How could one explain this?
The calculation yields a size of about 1.7GB for the 3 matrices:
(10*1536*25000*4.+10*25000*4+1536*25000*4.)/(10**9)=1.6906

Regarding the first problem, you're running out of memory. A major contributor to that is that this isn't the way people would normally do a matrix-matrix multiply. Normally, as you are multiplying row and column elements together, you would keep a running sum, then store that sum in the appropriate location in the product (result) matrix. This will allow you to have a much smaller size for the c matrix, ie. it need only be 2 dimensions, not 3. You may wish to just study the linear algebra definition of matrix-matrix multiply. When you multiply a 2D matrix by a 2D matrix, the result is a 2D matrix, not a 3D matrix.
In a nutshell, something like this:
for i in range(startX,Ni,gridX):
for j in range(startY,Nj,gridY):
c[i,j] = 0
for k in range(startZ,Nk,gridZ):
c[i,j]= c[i,j] + mult_gpu(a[i,k],b[j,k])
And adjust your c shape accordingly.
If you actually need the individual products in 3D form as you are doing here, then there is not much I can say except that you will need to scale the problem to fit in the GPU memory size for whatever GPU you are using.
Regarding the second problem, you have a problem here:
thread_ct=my_gpu.WARP_SIZE
...
blockdim=thread_ct,thread_ct,thread_ct
WARP_SIZE is 32 (presumably) so you are asking for a 3D block of dimensions 32*32*32 = 32K threads. But CUDA threadblocks are limited to a maximum of 1024 threads, which limit is the product of the individual dimensions.
If you change your thread_ct to 8, for example:
thread_ct=8
You should be able to get past this particular issue.

Related

NumPy array row differences

I have a NumPy array vectors = np.random.randn(rows, cols). I want to find differences between its rows according to some other array diffs which is sparse and "2-hot": containing a 1 in its column corresponding to the first row of vectors and a -1 corresponding to the second row. Perhaps an example shall make it clearer:
diffs = np.array([[ 1, 0, -1],
[ 1, -1, 0]])
then I can compute the row differences by simply diffs # vectors.
Unfortunately this is slow for diffs of 10_000x1000 and vectors 1000x15_000. I can get a speedup by using scipy.sparse: sparse.csr_matrix(diffs) # vectors, but even this is 300ms.
Possibly this is simply as fast as it gets, but some part of me thinks whether using matrix multiplications is the wisest decision for this task.
What's more is I need to take the absolute value afterwards so really I'm doing np.abs(sparse.csr_matrix(diffs) # vectors) which adds ~ 200ms for a grand total of ~500ms.
I can compute the row differences by simply diffs # vectors.
This is very inefficient. A matrix multiplication runs in O(n*m*k) for a (n,m) multiplied by a (m,k) one. In your case, there is only two values per line and you do not actually need a multiplication by 1 or -1. Your problem can be computed in O(n*k) time (ie. m times faster).
Unfortunately this is slow for diffs of 10_000x1000 and vectors 1000x15_000. I can get a speedup by using scipy.sparse.
The thing is the input data representation is inefficient. When diff is an array of size (10_000,1000), this is not reasonable to use a dense matrix that would be ~1000 times bigger than needed nor a sparse matrix that is not optimized for having only two non-zero values (especially 1 and -1). You need to store the position of the non-zeros values in a 2D array called sel_rows of shape (2,n) where the first row contains the location of the 1 and the second one contains the location of the -1 in the diff 2D array. Then, you can extract the rows of vectors for example with vectors[sel_rows[0]]. You can perform the final operation with vectors[sel_rows[0,:]] - vectors[sel_rows[1,:]]. This approach should be drastically faster than a dense matrix product and it may be a bit faster than a sparse one regarding the target machine.
While the above solution is simple, it create multiple temporary arrays that are not cache-friendly since your output array should take 10_000 * 15_000 * 8 = 1.1 GiB (which is quite huge). You can use Numba so to remove temporary array and so improve the performance. Multiple threads can be used to improve performance even further. Here is an untested code:
import numba as nb
#nb.njit('(int_[:,::1], float64[:,::1])', parallel=True)
def compute(diffs, vectors):
n, k = diffs.shape[0], vectors.shape[1]
assert diffs.shape[1] == 2
res = np.empty((n, k))
for i in nb.prange(n):
a, b = diffs[i]
for j in range(k):
# Compute nb.abs here if needed so to avoid
# creating new temporary arrays
res[i, j] = vectors[a, j] - vectors[b, j]
return res
This above code should be nearly optimal. It should be memory bound and able to saturate the memory bandwidth. Note that writing such huge arrays in memory take some time as well as reading (twice) the input array. On x86-64 platforms, a basic implementation should move 4.4 GiB of data from/to the RAM. Thus, on a mainstream PC with a 20 GiB/s RAM, this takes 220 ms. In fact, the sparse matrix computation result was not so bad in practice for a sequential implementation.
If this is not enough to you, then you can use simple-precision floating-point numbers instead of double-precision (twice faster). You could also use a low-level C/C++ implementation so to reduce the memory bandwidth used (thanks to non-temporal instructions -- ~30% faster). There is no much more to do.

matrix multiplication using # or numpy.dot() is givng wrong results at higher values.Why?

Here input a is given in comments in the next block and the answer which I have got using numpy also gives me same result.
The actual answer should be
If you want the real context where I am using this is the drive link for the file
It is because the numbers in the original matrix a is too large. So that the output is incorrect.
To deal with this problem, first scale down the numbers in the matrix by a number first. Then run the matrix multiplication. After that scale up back the number to the matrix. This will give you the results.
The following is the code:
import numpy as np
a = np.array([[3524578, 2178309],[2178309, 1346269]])
a = a / 10000
b = np.dot(a, a) *10000
print(b)
Output:
[[1.71676802e+09 1.06102099e+09]
[1.06102099e+09 6.55747032e+08]]

Efficient way to implement matrix multiplication when one matrix is extremely wide?

I need to multiply 3 matrices, A: 3000x100, B: 100x100, C: 100x3.6MM. I currently am just using normal matrix multiplication in PyTorch
A_gpu = torch.from_numpy(A)
B_gpu = torch.from_numpy(B)
C_gpu = torch.from_numpy(C)
D_gpu = (A_gpu # B_gpu # C_gpu.t()).t()
C is very wide so the data reuse on gpu is limited but are there other ways to speed this up? I have a machine with 4x GPUs.
Since you have four GPUs, you can harness them to perform efficient matrix multiplication. Notice however that the results of the multiplication has size 3000x3600000, which takes up 40GB in single precision floating point (fp32). Unless you have a large enough RAM for the CPU, you cannot store the results of this computation on the RAM.
A possible solution for this is to break up the large matrix C into four smaller chunks, perform the matrix multiplication of each chunk on a different GPU, and keep the result on the GPU. Provided that each GPU has at least 10GB of memory, you will have enough memory for this.
If you do have also enough CPU memory, you can then move the results of all four GPUs onto the CPU and concatenate them (in fact, in this case you could have used only a single GPU and transfer the results from GPU to CPU each time). Otherwise, you can keep the results chunked on the GPUs, and you need to remember and keep track that the four chunks are actually part of one matrix.
import numpy as np
import torch.nn as nn
import torch
number_of_gpus = 4
# create four matrics
A = np.random.normal(size=(3000,100))
B = np.random.normal(size=(100,100))
C = np.random.normal(size=(100,3600000))
# convert them to pytorch fp32 tensors
A = torch.from_numpy(A).float()
B = torch.from_numpy(B).float()
C = torch.from_numpy(C).float()
# calcualte `A#B`, which is easy
AB = A#B
# split the large matrix `C` into 4 smaller chunks along the second dimension.
# we assume here that the size of the second dimension of `C` is divisible by 4.
C_split = torch.split(C,C.shape[1]//number_of_gpus,dim=1)
# loop over the four GPUs, and perform the calculation on each using the corresponding chunk of `C`
D_split = []
for i in range(number_of_gpus):
device = 'cuda:{:d}'.format(i)
D_split.append( AB.to(device) # C_split[i].to(device))
# DO THIS ONLY IF YOU HAVE ENOUGH CPU MEMORY!! :
D = torch.cat([d.cpu() for d in D_split],dim=1)
If you have multiple GPUs, you can distribute the computation on all of them using PyTorch's DataParallel. It will split (parallelize) the multiplication of the columns of the matrix C_gpu among the GPUs.
Here's how:
First, import the modules and prepare the matrices:
import torch
import torch.nn as nn
A_gpu = torch.from_numpy(A).float()
B_gpu = torch.from_numpy(B).float()
C_gpu = torch.from_numpy(C).float()
Next, create a Linear "layer" without bias. What this layer does is exactly matrix multiplication. The input size will be the size of each column of C_gpu, and the output size will be the size of each column of the result.
mat_mult = nn.Linear(in_features=C_gpu.shape[0],out_features=A_gpu.shape[0],bias=False)
Set the matrix (=weight) of the layer to be A_gpu # B_gpu, which is a small matrix that can be quickly computed without parallelization (although you could parallelize it as well if you want).
mat_mult.weight.data = A_gpu # B_gpu
Convert the layer into a DataParallel instance. This means that it will automatically parallelize computation along the "batch" dimension. The argument device_ids is a list of indices of your GPUs (4 of them, in your case).
mat_mult_gpu = nn.DataParallel(mat_mult,device_ids=[0,1,2,3]).to('cuda:0')
Now you can feed the matrix C_gpu into the layer, and the computation will be parallel along the its large dimension:
D_gpu = mat_mult_gpu(C_gpu.t())
IMPORTANT NOTE: When writing this answer, I did not have access to multiple GPUs to actually test this proposed solution. I will appreciate if any of the readers will confirm that it indeed works (and even better - time the solution and compare to a single GPU)
EDIT1: I now tried this code on multiple GPUs (four Nvidia Tesla P100), and turns out it gives an out of memory error. I'll keep this solution here as a reference though, since it does work for sizes up to about 400K (instead of 3.6M).
Also, This solution will still work also for sizes 3.6M if you divide C into smaller chunks, feed each chunk into mat_mult_gpu, and then concatenate the results on the CPU. Note that you need a lot of CPU memory for this to work, since the result has size 3K-by-3.6M which in fp32 takes about 40GB. (alternatively, you can save each chunk to the disk without concatenating chunks).
Depending on your matrix C, a sparse matrix, may reduce size and computation time, e.g. you save only columns that are not 0, maybe torch reference may help.

How can I generate a large identity matrix in python without running into "memory full"?

I am using the code below:
n = 40000
numpy.matlib.identity(n)
You can do this with scipy using sparse matrix representation:
import numpy as np
from scipy.sparse import identity
n = 30000
a = np.identity(n)
print a.nbytes
b = identity(n)
print b.data.nbytes
The difference is huge (quadratic): 7200000000 vs 240000.
You can also try to decrease the size by providing appropriate dtype, like a = np.identity(n, dtype='int8') but this will only reduce the size linearly (with maximum linear factor of less than 200).
The same way you can do b = identity(n, dtype='int8', format='dia') which will reduce the size even further to 30000.
But the most important thing is what are you planning to do with this matrix (highly doubt you just want to create it)? And some of the operations would not support sparse indices. Then you either have to buy more memory or come up with smart linear-algebra stuff to operate on parts of the matrices, store results on disk and merge them together.

Optimizing histogram distance metric for two matrices in Python

I have two matrices A and B, each with a size of NxM, where N is the number of samples and M is the size of histogram bins. Thus, each row represents a histogram for that particular sample.
What I would like to do is to compute the chi-square distance between two matrices for a different pair of samples. Therefore, each row in the matrix A will be compared to all rows in the other matrix B, resulting a final matrix C with a size of NxN and C[i,j] corresponds to the chi-square distance between A[i] and B[j] histograms.
Here is my python code that does the job:
def chi_square(histA,histB):
esp = 1.e-10
d = sum((histA-histB)**2/(histA+histB+eps))
return 0.5*d
def matrix_cost(A,B):
a,_ = A.shape
b,_ = B.shape
C = zeros((a,b))
for i in xrange(a):
for j in xrange(b):
C[i,j] = chi_square(A[i],B[j])
return C
Currently, for a 100x70 matrix, this entire process takes 0.1 seconds.
Is there any way to improve this performance?
I would appreciate any thoughts or recommendations.
Thank you.
Sure! I'm assuming you're using numpy?
If you have the RAM available, you could use broadcast the arrays and use numpy's efficient vectorization of the operations on those arrays.
Here's how:
Abroad = A[:,np.newaxis,:] # prepared for broadcasting
C = np.sum((Abroad - B)**2/(Abroad + B), axis=-1)/2.
Timing considerations on my platform show a factor of 10 speed gain compared to your algorithm.
A slower option (but still faster than your original algorithm) that uses less RAM than the previous option is simply to broadcast the rows of A into 2D arrays:
def new_way(A,B):
C = np.empty((A.shape[0],B.shape[0]))
for rowind, row in enumerate(A):
C[rowind,:] = np.sum((row - B)**2/(row + B), axis=-1)/2.
return C
This has the advantage that it can be run for arrays with shape (N,M) much larger than (100,70).
You could also look to Theano to push the expensive for-loops to the C-level if you don't have the memory available. I get a factor 2 speed gain compared to the first option (not taking into account the initial compile time) for both the (100,70) arrays as well as (1000,70):
import theano
import theano.tensor as T
X = T.matrix("X")
Y = T.matrix("Y")
results, updates = theano.scan(lambda x_i: ((x_i - Y)**2/(x_i+Y)).sum(axis=1)/2., sequences=X)
chi_square_norm = theano.function(inputs=[X, Y], outputs=[results])
chi_square_norm(A,B) # same result

Categories