Efficient way to implement matrix multiplication when one matrix is extremely wide?

Efficient way to implement matrix multiplication when one matrix is extremely wide? - python

I need to multiply 3 matrices, A: 3000x100, B: 100x100, C: 100x3.6MM. I currently am just using normal matrix multiplication in PyTorch
A_gpu = torch.from_numpy(A)
B_gpu = torch.from_numpy(B)
C_gpu = torch.from_numpy(C)
D_gpu = (A_gpu # B_gpu # C_gpu.t()).t()
C is very wide so the data reuse on gpu is limited but are there other ways to speed this up? I have a machine with 4x GPUs.

Since you have four GPUs, you can harness them to perform efficient matrix multiplication. Notice however that the results of the multiplication has size 3000x3600000, which takes up 40GB in single precision floating point (fp32). Unless you have a large enough RAM for the CPU, you cannot store the results of this computation on the RAM.
A possible solution for this is to break up the large matrix C into four smaller chunks, perform the matrix multiplication of each chunk on a different GPU, and keep the result on the GPU. Provided that each GPU has at least 10GB of memory, you will have enough memory for this.
If you do have also enough CPU memory, you can then move the results of all four GPUs onto the CPU and concatenate them (in fact, in this case you could have used only a single GPU and transfer the results from GPU to CPU each time). Otherwise, you can keep the results chunked on the GPUs, and you need to remember and keep track that the four chunks are actually part of one matrix.
import numpy as np
import torch.nn as nn
import torch
number_of_gpus = 4
# create four matrics
A = np.random.normal(size=(3000,100))
B = np.random.normal(size=(100,100))
C = np.random.normal(size=(100,3600000))
# convert them to pytorch fp32 tensors
A = torch.from_numpy(A).float()
B = torch.from_numpy(B).float()
C = torch.from_numpy(C).float()
# calcualte `A#B`, which is easy
AB = A#B
# split the large matrix `C` into 4 smaller chunks along the second dimension.
# we assume here that the size of the second dimension of `C` is divisible by 4.
C_split = torch.split(C,C.shape[1]//number_of_gpus,dim=1)
# loop over the four GPUs, and perform the calculation on each using the corresponding chunk of `C`
D_split = []
for i in range(number_of_gpus):
device = 'cuda:{:d}'.format(i)
D_split.append( AB.to(device) # C_split[i].to(device))
# DO THIS ONLY IF YOU HAVE ENOUGH CPU MEMORY!! :
D = torch.cat([d.cpu() for d in D_split],dim=1)

If you have multiple GPUs, you can distribute the computation on all of them using PyTorch's DataParallel. It will split (parallelize) the multiplication of the columns of the matrix C_gpu among the GPUs.
Here's how:
First, import the modules and prepare the matrices:
import torch
import torch.nn as nn
A_gpu = torch.from_numpy(A).float()
B_gpu = torch.from_numpy(B).float()
C_gpu = torch.from_numpy(C).float()
Next, create a Linear "layer" without bias. What this layer does is exactly matrix multiplication. The input size will be the size of each column of C_gpu, and the output size will be the size of each column of the result.
mat_mult = nn.Linear(in_features=C_gpu.shape[0],out_features=A_gpu.shape[0],bias=False)
Set the matrix (=weight) of the layer to be A_gpu # B_gpu, which is a small matrix that can be quickly computed without parallelization (although you could parallelize it as well if you want).
mat_mult.weight.data = A_gpu # B_gpu
Convert the layer into a DataParallel instance. This means that it will automatically parallelize computation along the "batch" dimension. The argument device_ids is a list of indices of your GPUs (4 of them, in your case).
mat_mult_gpu = nn.DataParallel(mat_mult,device_ids=[0,1,2,3]).to('cuda:0')
Now you can feed the matrix C_gpu into the layer, and the computation will be parallel along the its large dimension:
D_gpu = mat_mult_gpu(C_gpu.t())
IMPORTANT NOTE: When writing this answer, I did not have access to multiple GPUs to actually test this proposed solution. I will appreciate if any of the readers will confirm that it indeed works (and even better - time the solution and compare to a single GPU)
EDIT1: I now tried this code on multiple GPUs (four Nvidia Tesla P100), and turns out it gives an out of memory error. I'll keep this solution here as a reference though, since it does work for sizes up to about 400K (instead of 3.6M).
Also, This solution will still work also for sizes 3.6M if you divide C into smaller chunks, feed each chunk into mat_mult_gpu, and then concatenate the results on the CPU. Note that you need a lot of CPU memory for this to work, since the result has size 3K-by-3.6M which in fp32 takes about 40GB. (alternatively, you can save each chunk to the disk without concatenating chunks).

Depending on your matrix C, a sparse matrix, may reduce size and computation time, e.g. you save only columns that are not 0, maybe torch reference may help.

Related

NumPy array row differences

I have a NumPy array vectors = np.random.randn(rows, cols). I want to find differences between its rows according to some other array diffs which is sparse and "2-hot": containing a 1 in its column corresponding to the first row of vectors and a -1 corresponding to the second row. Perhaps an example shall make it clearer:
diffs = np.array([[ 1, 0, -1],
[ 1, -1, 0]])
then I can compute the row differences by simply diffs # vectors.
Unfortunately this is slow for diffs of 10_000x1000 and vectors 1000x15_000. I can get a speedup by using scipy.sparse: sparse.csr_matrix(diffs) # vectors, but even this is 300ms.
Possibly this is simply as fast as it gets, but some part of me thinks whether using matrix multiplications is the wisest decision for this task.
What's more is I need to take the absolute value afterwards so really I'm doing np.abs(sparse.csr_matrix(diffs) # vectors) which adds ~ 200ms for a grand total of ~500ms.

I can compute the row differences by simply diffs # vectors.
This is very inefficient. A matrix multiplication runs in O(n*m*k) for a (n,m) multiplied by a (m,k) one. In your case, there is only two values per line and you do not actually need a multiplication by 1 or -1. Your problem can be computed in O(n*k) time (ie. m times faster).
Unfortunately this is slow for diffs of 10_000x1000 and vectors 1000x15_000. I can get a speedup by using scipy.sparse.
The thing is the input data representation is inefficient. When diff is an array of size (10_000,1000), this is not reasonable to use a dense matrix that would be ~1000 times bigger than needed nor a sparse matrix that is not optimized for having only two non-zero values (especially 1 and -1). You need to store the position of the non-zeros values in a 2D array called sel_rows of shape (2,n) where the first row contains the location of the 1 and the second one contains the location of the -1 in the diff 2D array. Then, you can extract the rows of vectors for example with vectors[sel_rows[0]]. You can perform the final operation with vectors[sel_rows[0,:]] - vectors[sel_rows[1,:]]. This approach should be drastically faster than a dense matrix product and it may be a bit faster than a sparse one regarding the target machine.
While the above solution is simple, it create multiple temporary arrays that are not cache-friendly since your output array should take 10_000 * 15_000 * 8 = 1.1 GiB (which is quite huge). You can use Numba so to remove temporary array and so improve the performance. Multiple threads can be used to improve performance even further. Here is an untested code:
import numba as nb
#nb.njit('(int_[:,::1], float64[:,::1])', parallel=True)
def compute(diffs, vectors):
n, k = diffs.shape[0], vectors.shape[1]
assert diffs.shape[1] == 2
res = np.empty((n, k))
for i in nb.prange(n):
a, b = diffs[i]
for j in range(k):
# Compute nb.abs here if needed so to avoid
# creating new temporary arrays
res[i, j] = vectors[a, j] - vectors[b, j]
return res
This above code should be nearly optimal. It should be memory bound and able to saturate the memory bandwidth. Note that writing such huge arrays in memory take some time as well as reading (twice) the input array. On x86-64 platforms, a basic implementation should move 4.4 GiB of data from/to the RAM. Thus, on a mainstream PC with a 20 GiB/s RAM, this takes 220 ms. In fact, the sparse matrix computation result was not so bad in practice for a sequential implementation.
If this is not enough to you, then you can use simple-precision floating-point numbers instead of double-precision (twice faster). You could also use a low-level C/C++ implementation so to reduce the memory bandwidth used (thanks to non-temporal instructions -- ~30% faster). There is no much more to do.

Large scale matrix multiplication using Numpy

I am facing a problem where I need to perform matrix multiplication between two large matrix A [400000 x 70000] and B [70000 x 1000]. The two matrices are dense and have no special structure that I can utilize.
Currently my implementation is to divide A into multiple chunks of rows, say, sub_A [2000 x 70000] and perfrom sub_A * B. I noticed that there are a lot of time is spent on I/O, i.e. read in the sub_A. Read in the matrix takes about 500 seconds and computation takes about 300 seconds.
Will using PyTables here be useful to improve the I/O efficiency? Are there any library that will help in improving the time efficiency?
Here is the code:
def sim_phe_g(geno, betas, chunk_size):
num_indv = geno.row_count
num_snps = geno.col_count
num_settings = betas.shape[1]
phe_g = np.zeros([num_indv, num_settings])
# divide individuals into chunks
for i in range(0, num_indv, chunk_size):
sub_geno = geno[i : i + chunk_size, :]
sub_geno = sub_geno.read().val
phe_g[i : i + chunk_size, :] = np.dot(sub_geno, betas)
return phe_g
geno is of size [400000 x 70000] and betas is of size [70000 x 1000]. geno here is a large matrix that is stored in disk. The statement sub_geno = sub_geno.read().val will load a chunk of the genotype into the memory. And this statement costs a lot of time.
Also, I divide the big matrix into chunks because of 32GB memory size limitation.

Try using TensowFlow for GPU optimization, it's very good for matrix multiplication as it will allow you to parallelize each operation.

If applicable try using tensorflow for large matrices multiplication, as you can see from this article that tensorflow performs significantly better in case of large matrices under many circumstances. The reason for the same most likely being that its primarily built for this very purpose of handling large matrices efficiently.
for more details on the specific use of matrix multiplication kindly refer to the documentation.
I tested it on a (1000,1000) matrix for multiplication:
for numpy.matmul = 60 ms ± 5.35
for tensorflow.matmul = 42.5 ms ± 2.47 m
100 runs for each were conducted sharing mean and stdev
P.S. Tensorflow's cpu version was only used

Resources for doing an n dimensional FFT of a huge multi dimensional array

I have a 50 dimensional array, whose dimensions are 255 x 255 x 255 x...(50 times)..x255. So its a total of 50^255 floating point numbers. Its just out of scope to even think of fitting in a RAM. Moreover I need to take an 50 dimensional Fast Fourier Transform (DFT) of this array. I can't do it in python on an ordinary PC. I cant even imagine doing it on a GPU. so I am guessing I have to take help of a hard disk memory, but even that is too huge. I don't need this in real time, I can afford even days for it to run. I have no clue what sort of machine I need or is it even possible? Appreciate your advice. Super computers, grids, or something even if its too costly, I am not worried about investment.

If you found enough universes to save your data in, here is what you could do:
The Fourier Transform is separable, that means that calculating the DFT of each axis one after the other will give you the same result as if you calculated the n-dimensional DFT:
for i in range(C.ndim):
C[...] = numpy.fft.fft(C, axis=i)
Double checking if the value is correct using a 2D tensor (because we have a 2D FFT numpy.fft.fft2 to compare against):
import numpy
A = numpy.random.rand(*[16] * 2)
B = numpy.fft.fft2(A)
C = A.astype(numpy.complex) # output vector for separable FFT
for i in range(C.ndim):
C[...] = numpy.fft.fft(C, axis=i)
numpy.allclose(C, B) # True

Matrix multiplication in CUDA running out of memory

I try to compute the matrix multiplication using the script:
import numpy as np
import math
from timeit import default_timer as timer
from numba import cuda
from numba import *
def mult(a,b):
return a*b
mult_gpu=cuda.jit(restype=float32,argtypes=[float32,float32],device=True)(mult)
#cuda.jit(argtypes=[float32[:,:],float32[:,:],float32[:,:,:]])
def mult_kernel(a,b,c):
Ni=c.shape[0]
Nj=c.shape[1]
Nk=c.shape[2]
startX,startY,startZ=cuda.grid(3)
gridX=cuda.gridDim.x*cuda.blockDim.x
gridY=cuda.gridDim.y*cuda.blockDim.y
gridZ=cuda.gridDim.z*cuda.blockDim.z
for i in range(startX,Ni,gridX):
for j in range(startY,Nj,gridY):
for k in range(startZ,Nk,gridZ):
c[i,j,k]=mult_gpu(a[i,k],b[j,k])
def main():
A=np.ones((20,50000),dtype=np.float32)
B=np.ones((3072,50000),dtype=np.float32)
C=np.ones((20,3072,50000),dtype=np.float32)
(Ni,Nj,Nk)=C.shape
my_gpu=cuda.get_current_device()
thread_ct=my_gpu.WARP_SIZE
block_ct_x=int(math.ceil(float(Ni)/thread_ct))
block_ct_y=int(math.ceil(float(Nj)/thread_ct))
block_ct_z=int(math.ceil(float(Nk)/thread_ct))
blockdim=thread_ct,thread_ct,thread_ct
griddim=block_ct_x,block_ct_y,block_ct_z
print "Threads per block:",blockdim
print "Blocks per grid:",griddim
start=timer()
Cg=cuda.to_device(C)
mult_kernel[griddim,blockdim](A,B,Cg)
Cg.to_host()
dt=timer()-start
print "Computation done in %f s"%(dt)
print 'C[:3,1,1] = ',C[:3,1,1]
print 'C[-3:,1,1] = ',C[-3:,1,1]
if __name__=='__main__':
main()
Executing this yields an error:
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
How could I fix this memory problem?
Nevertheless, using smaller matrices
A=np.ones((20,500),dtype=np.float32)
B=np.ones((372,500),dtype=np.float32)
C=np.ones((20,372,500),dtype=np.float32)
there is still an error:
numba.cuda.cudadrv.driver.CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE
I got inspired by the Mandelbrot Example to implement the computation above.
EDIT1
In order to resolve any confusion, this is actually a 3D matrix by 3D matrix multiplication:
A=np.ones((20,1,50000),dtype=np.float32)
B=np.ones((1,3072,50000),dtype=np.float32)
C=np.ones((20,3072,50000),dtype=np.float32)
I skipped one dimension in A and B because it is not necessary for the computation.
EDIT2
My GPU is:
In [1]: from numba import cuda
In [2]: gpu=cuda.get_current_device()
In [3]: gpu.name
Out[3]: 'GeForce GT 750M'
EDIT3
According to the memory of my GPU (2GB), I reduced the size of each dimension by 2:
dimx=10
dimy=1536
dimz=25000
A=np.ones((dimx,dimz),dtype=np.float32)
B=np.ones((dimy,dimz),dtype=np.float32)
C=np.ones((dimx,dimy,dimz),dtype=np.float32)
But I still receive the CUDA_ERROR_OUT_OF_MEMORY error. How could one explain this?
The calculation yields a size of about 1.7GB for the 3 matrices:
(10*1536*25000*4.+10*25000*4+1536*25000*4.)/(10**9)=1.6906

Regarding the first problem, you're running out of memory. A major contributor to that is that this isn't the way people would normally do a matrix-matrix multiply. Normally, as you are multiplying row and column elements together, you would keep a running sum, then store that sum in the appropriate location in the product (result) matrix. This will allow you to have a much smaller size for the c matrix, ie. it need only be 2 dimensions, not 3. You may wish to just study the linear algebra definition of matrix-matrix multiply. When you multiply a 2D matrix by a 2D matrix, the result is a 2D matrix, not a 3D matrix.
In a nutshell, something like this:
for i in range(startX,Ni,gridX):
for j in range(startY,Nj,gridY):
c[i,j] = 0
for k in range(startZ,Nk,gridZ):
c[i,j]= c[i,j] + mult_gpu(a[i,k],b[j,k])
And adjust your c shape accordingly.
If you actually need the individual products in 3D form as you are doing here, then there is not much I can say except that you will need to scale the problem to fit in the GPU memory size for whatever GPU you are using.
Regarding the second problem, you have a problem here:
thread_ct=my_gpu.WARP_SIZE
...
blockdim=thread_ct,thread_ct,thread_ct
WARP_SIZE is 32 (presumably) so you are asking for a 3D block of dimensions 32*32*32 = 32K threads. But CUDA threadblocks are limited to a maximum of 1024 threads, which limit is the product of the individual dimensions.
If you change your thread_ct to 8, for example:
thread_ct=8
You should be able to get past this particular issue.

How can I generate a large identity matrix in python without running into "memory full"?

I am using the code below:
n = 40000
numpy.matlib.identity(n)

You can do this with scipy using sparse matrix representation:
import numpy as np
from scipy.sparse import identity
n = 30000
a = np.identity(n)
print a.nbytes
b = identity(n)
print b.data.nbytes
The difference is huge (quadratic): 7200000000 vs 240000.
You can also try to decrease the size by providing appropriate dtype, like a = np.identity(n, dtype='int8') but this will only reduce the size linearly (with maximum linear factor of less than 200).
The same way you can do b = identity(n, dtype='int8', format='dia') which will reduce the size even further to 30000.
But the most important thing is what are you planning to do with this matrix (highly doubt you just want to create it)? And some of the operations would not support sparse indices. Then you either have to buy more memory or come up with smart linear-algebra stuff to operate on parts of the matrices, store results on disk and merge them together.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient way to implement matrix multiplication when one matrix is extremely wide? - python

Depending on your matrix C, a sparse matrix, may reduce size and computation time, e.g. you save only columns that are not 0, maybe torch reference may help.

Related

NumPy array row differences

Large scale matrix multiplication using Numpy

Resources for doing an n dimensional FFT of a huge multi dimensional array

Matrix multiplication in CUDA running out of memory

How can I generate a large identity matrix in python without running into "memory full"?

Categories

Resources