I am facing a problem where I need to perform matrix multiplication between two large matrix A [400000 x 70000] and B [70000 x 1000]. The two matrices are dense and have no special structure that I can utilize.
Currently my implementation is to divide A into multiple chunks of rows, say, sub_A [2000 x 70000] and perfrom sub_A * B. I noticed that there are a lot of time is spent on I/O, i.e. read in the sub_A. Read in the matrix takes about 500 seconds and computation takes about 300 seconds.
Will using PyTables here be useful to improve the I/O efficiency? Are there any library that will help in improving the time efficiency?
Here is the code:
def sim_phe_g(geno, betas, chunk_size):
num_indv = geno.row_count
num_snps = geno.col_count
num_settings = betas.shape[1]
phe_g = np.zeros([num_indv, num_settings])
# divide individuals into chunks
for i in range(0, num_indv, chunk_size):
sub_geno = geno[i : i + chunk_size, :]
sub_geno = sub_geno.read().val
phe_g[i : i + chunk_size, :] = np.dot(sub_geno, betas)
return phe_g
geno is of size [400000 x 70000] and betas is of size [70000 x 1000]. geno here is a large matrix that is stored in disk. The statement sub_geno = sub_geno.read().val will load a chunk of the genotype into the memory. And this statement costs a lot of time.
Also, I divide the big matrix into chunks because of 32GB memory size limitation.
Try using TensowFlow for GPU optimization, it's very good for matrix multiplication as it will allow you to parallelize each operation.
If applicable try using tensorflow for large matrices multiplication, as you can see from this article that tensorflow performs significantly better in case of large matrices under many circumstances. The reason for the same most likely being that its primarily built for this very purpose of handling large matrices efficiently.
for more details on the specific use of matrix multiplication kindly refer to the documentation.
I tested it on a (1000,1000) matrix for multiplication:
for numpy.matmul = 60 ms ± 5.35
for tensorflow.matmul = 42.5 ms ± 2.47 m
100 runs for each were conducted sharing mean and stdev
P.S. Tensorflow's cpu version was only used
Related
I am computing a matrix multiplication at few thousand times during my algorithm. Therefore, I compute:
import numpy as np
import time
def mat_mul(mat1, mat2, mat3, mat4):
return(np.dot(np.transpose(mat1),np.multiply(np.diag(mat2)[:,None], mat3))+mat4)
n = 2000
mat1 = np.random.rand(n,n)
mat2 = np.diag(np.random.rand(n))
mat3 = np.random.rand(n,n)
mat4 = np.random.rand(n,n)
t0=time.time()
cov_11=mat_mul(mat1, mat2, mat1, mat4)
t1=time.time()
print('time ',t1-t0, 's')
The matrices are of size:
n = (2000,2000) and mat2 only has entries along its diagonal. The remaining entries are zero.
On my machine I get the following:
time 0.3473696708679199 s
How can I speed this up?
Thanks.
The Numpy implementation can be optimized a bit by reducing the amount of temporary arrays and reuse them as much as possible (ie. multiple times). Indeed, while matrix multiplications are generally heavily-optimized by BLAS implementations, filling/copying (newly allocated) arrays add a non-negligible overhead.
Here is the implementation:
def mat_mul_opt(mat1, mat2, mat3, mat4):
tmp1 = np.empty((n,n))
tmp2 = np.empty((n,n))
vect = np.diag(mat2)[:,None]
np.dot(np.transpose(mat1),np.multiply(vect, mat3, out=tmp1), out=tmp2)
np.add(mat4, tmp2, out=tmp1)
return tmp1
The code can be optimized further if it is fine to mutate input matrices or if you can pre-allocate tmp1 and tmp2 outside the function once (and then reuse them multiple times). Here is an example:
def mat_mul_opt2(mat1, mat2, mat3, mat4, tmp1, tmp2):
vect = np.diag(mat2)[:,None]
np.dot(np.transpose(mat1),np.multiply(vect, mat3, out=tmp1), out=tmp2)
np.add(mat4, tmp2, out=tmp1)
return tmp1
Here are performance results on my i5-9600KF processor (6-cores):
mat_mul: 103.6 ms
mat_mul_opt1: 96.7 ms
mat_mul_opt2: 83.5 ms
np.dot time only: 74.4 ms (kind of practical lower-bound)
Optimal lower bound: 55 ms (quite optimistic)
cython is not going to speed it up, simply because numpy is using other tricks to speed things up like threading and SIMD, anyone that tries to implement such function with only cython is going to end up with much worse performance.
only 2 things are possible:
use a gpu based version of numpy (cupy)
use a different more optimized backend for numpy if you aren't using the best already (like intel MKL)
I have a NumPy array vectors = np.random.randn(rows, cols). I want to find differences between its rows according to some other array diffs which is sparse and "2-hot": containing a 1 in its column corresponding to the first row of vectors and a -1 corresponding to the second row. Perhaps an example shall make it clearer:
diffs = np.array([[ 1, 0, -1],
[ 1, -1, 0]])
then I can compute the row differences by simply diffs # vectors.
Unfortunately this is slow for diffs of 10_000x1000 and vectors 1000x15_000. I can get a speedup by using scipy.sparse: sparse.csr_matrix(diffs) # vectors, but even this is 300ms.
Possibly this is simply as fast as it gets, but some part of me thinks whether using matrix multiplications is the wisest decision for this task.
What's more is I need to take the absolute value afterwards so really I'm doing np.abs(sparse.csr_matrix(diffs) # vectors) which adds ~ 200ms for a grand total of ~500ms.
I can compute the row differences by simply diffs # vectors.
This is very inefficient. A matrix multiplication runs in O(n*m*k) for a (n,m) multiplied by a (m,k) one. In your case, there is only two values per line and you do not actually need a multiplication by 1 or -1. Your problem can be computed in O(n*k) time (ie. m times faster).
Unfortunately this is slow for diffs of 10_000x1000 and vectors 1000x15_000. I can get a speedup by using scipy.sparse.
The thing is the input data representation is inefficient. When diff is an array of size (10_000,1000), this is not reasonable to use a dense matrix that would be ~1000 times bigger than needed nor a sparse matrix that is not optimized for having only two non-zero values (especially 1 and -1). You need to store the position of the non-zeros values in a 2D array called sel_rows of shape (2,n) where the first row contains the location of the 1 and the second one contains the location of the -1 in the diff 2D array. Then, you can extract the rows of vectors for example with vectors[sel_rows[0]]. You can perform the final operation with vectors[sel_rows[0,:]] - vectors[sel_rows[1,:]]. This approach should be drastically faster than a dense matrix product and it may be a bit faster than a sparse one regarding the target machine.
While the above solution is simple, it create multiple temporary arrays that are not cache-friendly since your output array should take 10_000 * 15_000 * 8 = 1.1 GiB (which is quite huge). You can use Numba so to remove temporary array and so improve the performance. Multiple threads can be used to improve performance even further. Here is an untested code:
import numba as nb
#nb.njit('(int_[:,::1], float64[:,::1])', parallel=True)
def compute(diffs, vectors):
n, k = diffs.shape[0], vectors.shape[1]
assert diffs.shape[1] == 2
res = np.empty((n, k))
for i in nb.prange(n):
a, b = diffs[i]
for j in range(k):
# Compute nb.abs here if needed so to avoid
# creating new temporary arrays
res[i, j] = vectors[a, j] - vectors[b, j]
return res
This above code should be nearly optimal. It should be memory bound and able to saturate the memory bandwidth. Note that writing such huge arrays in memory take some time as well as reading (twice) the input array. On x86-64 platforms, a basic implementation should move 4.4 GiB of data from/to the RAM. Thus, on a mainstream PC with a 20 GiB/s RAM, this takes 220 ms. In fact, the sparse matrix computation result was not so bad in practice for a sequential implementation.
If this is not enough to you, then you can use simple-precision floating-point numbers instead of double-precision (twice faster). You could also use a low-level C/C++ implementation so to reduce the memory bandwidth used (thanks to non-temporal instructions -- ~30% faster). There is no much more to do.
I need to multiply 3 matrices, A: 3000x100, B: 100x100, C: 100x3.6MM. I currently am just using normal matrix multiplication in PyTorch
A_gpu = torch.from_numpy(A)
B_gpu = torch.from_numpy(B)
C_gpu = torch.from_numpy(C)
D_gpu = (A_gpu # B_gpu # C_gpu.t()).t()
C is very wide so the data reuse on gpu is limited but are there other ways to speed this up? I have a machine with 4x GPUs.
Since you have four GPUs, you can harness them to perform efficient matrix multiplication. Notice however that the results of the multiplication has size 3000x3600000, which takes up 40GB in single precision floating point (fp32). Unless you have a large enough RAM for the CPU, you cannot store the results of this computation on the RAM.
A possible solution for this is to break up the large matrix C into four smaller chunks, perform the matrix multiplication of each chunk on a different GPU, and keep the result on the GPU. Provided that each GPU has at least 10GB of memory, you will have enough memory for this.
If you do have also enough CPU memory, you can then move the results of all four GPUs onto the CPU and concatenate them (in fact, in this case you could have used only a single GPU and transfer the results from GPU to CPU each time). Otherwise, you can keep the results chunked on the GPUs, and you need to remember and keep track that the four chunks are actually part of one matrix.
import numpy as np
import torch.nn as nn
import torch
number_of_gpus = 4
# create four matrics
A = np.random.normal(size=(3000,100))
B = np.random.normal(size=(100,100))
C = np.random.normal(size=(100,3600000))
# convert them to pytorch fp32 tensors
A = torch.from_numpy(A).float()
B = torch.from_numpy(B).float()
C = torch.from_numpy(C).float()
# calcualte `A#B`, which is easy
AB = A#B
# split the large matrix `C` into 4 smaller chunks along the second dimension.
# we assume here that the size of the second dimension of `C` is divisible by 4.
C_split = torch.split(C,C.shape[1]//number_of_gpus,dim=1)
# loop over the four GPUs, and perform the calculation on each using the corresponding chunk of `C`
D_split = []
for i in range(number_of_gpus):
device = 'cuda:{:d}'.format(i)
D_split.append( AB.to(device) # C_split[i].to(device))
# DO THIS ONLY IF YOU HAVE ENOUGH CPU MEMORY!! :
D = torch.cat([d.cpu() for d in D_split],dim=1)
If you have multiple GPUs, you can distribute the computation on all of them using PyTorch's DataParallel. It will split (parallelize) the multiplication of the columns of the matrix C_gpu among the GPUs.
Here's how:
First, import the modules and prepare the matrices:
import torch
import torch.nn as nn
A_gpu = torch.from_numpy(A).float()
B_gpu = torch.from_numpy(B).float()
C_gpu = torch.from_numpy(C).float()
Next, create a Linear "layer" without bias. What this layer does is exactly matrix multiplication. The input size will be the size of each column of C_gpu, and the output size will be the size of each column of the result.
mat_mult = nn.Linear(in_features=C_gpu.shape[0],out_features=A_gpu.shape[0],bias=False)
Set the matrix (=weight) of the layer to be A_gpu # B_gpu, which is a small matrix that can be quickly computed without parallelization (although you could parallelize it as well if you want).
mat_mult.weight.data = A_gpu # B_gpu
Convert the layer into a DataParallel instance. This means that it will automatically parallelize computation along the "batch" dimension. The argument device_ids is a list of indices of your GPUs (4 of them, in your case).
mat_mult_gpu = nn.DataParallel(mat_mult,device_ids=[0,1,2,3]).to('cuda:0')
Now you can feed the matrix C_gpu into the layer, and the computation will be parallel along the its large dimension:
D_gpu = mat_mult_gpu(C_gpu.t())
IMPORTANT NOTE: When writing this answer, I did not have access to multiple GPUs to actually test this proposed solution. I will appreciate if any of the readers will confirm that it indeed works (and even better - time the solution and compare to a single GPU)
EDIT1: I now tried this code on multiple GPUs (four Nvidia Tesla P100), and turns out it gives an out of memory error. I'll keep this solution here as a reference though, since it does work for sizes up to about 400K (instead of 3.6M).
Also, This solution will still work also for sizes 3.6M if you divide C into smaller chunks, feed each chunk into mat_mult_gpu, and then concatenate the results on the CPU. Note that you need a lot of CPU memory for this to work, since the result has size 3K-by-3.6M which in fp32 takes about 40GB. (alternatively, you can save each chunk to the disk without concatenating chunks).
Depending on your matrix C, a sparse matrix, may reduce size and computation time, e.g. you save only columns that are not 0, maybe torch reference may help.
I want to find the least-square solution of a matrix and I am using the numpy linalg.lstsq function;
weights = np.linalg.lstsq(semivariance, prediction, rcond=None)
The dimension for my variables are;
semivariance is a float of size 5030 x 5030
prediction is a 1D array of length 5030
The problem I have is it takes approximately 80sec to return the value of weights and I have to repeat the calculation of weights about 10000 times so the computational time is just elevated.
Is there a faster way/pythonic function to do this?
#Brenlla appears to be right, even if you perform least squares by solving using the Moore-Penrose pseudo inverse, it is significantly faster than np.linalg.lstsq:
import numpy as np
import time
semivariance=np.random.uniform(0,100,[5030,5030]).astype(np.float64)
prediction=np.random.uniform(0,100,[5030,1]).astype(np.float64)
start=time.time()
weights_lstsq = np.linalg.lstsq(semivariance, prediction, rcond=None)
print("Took: {}".format(time.time()-start))
>>> Took: 34.65818190574646
start=time.time()
weights_pseudo = np.linalg.solve(semivariance.T.dot(semivariance),semivariance.T.dot(prediction))
print("Took: {}".format(time.time()-start))
>>> Took: 2.0434153079986572
np.allclose(weights_lstsq[0],weights_pseudo)
>>> True
The above is not on your exact matrices but the concept on the samples likely transfers. np.linalg.lstsq performs an optimisation problem by minimizing || b - a x ||^2 to solve for x in ax=b. This is usually faster on extremely large matrices, hence why linear models are often solved using gradient decent in neural networks, but in this case the matrices just aren't large enough for the performance benefit.
I am using the code below:
n = 40000
numpy.matlib.identity(n)
You can do this with scipy using sparse matrix representation:
import numpy as np
from scipy.sparse import identity
n = 30000
a = np.identity(n)
print a.nbytes
b = identity(n)
print b.data.nbytes
The difference is huge (quadratic): 7200000000 vs 240000.
You can also try to decrease the size by providing appropriate dtype, like a = np.identity(n, dtype='int8') but this will only reduce the size linearly (with maximum linear factor of less than 200).
The same way you can do b = identity(n, dtype='int8', format='dia') which will reduce the size even further to 30000.
But the most important thing is what are you planning to do with this matrix (highly doubt you just want to create it)? And some of the operations would not support sparse indices. Then you either have to buy more memory or come up with smart linear-algebra stuff to operate on parts of the matrices, store results on disk and merge them together.