I have a NumPy array vectors = np.random.randn(rows, cols). I want to find differences between its rows according to some other array diffs which is sparse and "2-hot": containing a 1 in its column corresponding to the first row of vectors and a -1 corresponding to the second row. Perhaps an example shall make it clearer:
diffs = np.array([[ 1, 0, -1],
[ 1, -1, 0]])
then I can compute the row differences by simply diffs # vectors.
Unfortunately this is slow for diffs of 10_000x1000 and vectors 1000x15_000. I can get a speedup by using scipy.sparse: sparse.csr_matrix(diffs) # vectors, but even this is 300ms.
Possibly this is simply as fast as it gets, but some part of me thinks whether using matrix multiplications is the wisest decision for this task.
What's more is I need to take the absolute value afterwards so really I'm doing np.abs(sparse.csr_matrix(diffs) # vectors) which adds ~ 200ms for a grand total of ~500ms.
I can compute the row differences by simply diffs # vectors.
This is very inefficient. A matrix multiplication runs in O(n*m*k) for a (n,m) multiplied by a (m,k) one. In your case, there is only two values per line and you do not actually need a multiplication by 1 or -1. Your problem can be computed in O(n*k) time (ie. m times faster).
Unfortunately this is slow for diffs of 10_000x1000 and vectors 1000x15_000. I can get a speedup by using scipy.sparse.
The thing is the input data representation is inefficient. When diff is an array of size (10_000,1000), this is not reasonable to use a dense matrix that would be ~1000 times bigger than needed nor a sparse matrix that is not optimized for having only two non-zero values (especially 1 and -1). You need to store the position of the non-zeros values in a 2D array called sel_rows of shape (2,n) where the first row contains the location of the 1 and the second one contains the location of the -1 in the diff 2D array. Then, you can extract the rows of vectors for example with vectors[sel_rows[0]]. You can perform the final operation with vectors[sel_rows[0,:]] - vectors[sel_rows[1,:]]. This approach should be drastically faster than a dense matrix product and it may be a bit faster than a sparse one regarding the target machine.
While the above solution is simple, it create multiple temporary arrays that are not cache-friendly since your output array should take 10_000 * 15_000 * 8 = 1.1 GiB (which is quite huge). You can use Numba so to remove temporary array and so improve the performance. Multiple threads can be used to improve performance even further. Here is an untested code:
import numba as nb
#nb.njit('(int_[:,::1], float64[:,::1])', parallel=True)
def compute(diffs, vectors):
n, k = diffs.shape[0], vectors.shape[1]
assert diffs.shape[1] == 2
res = np.empty((n, k))
for i in nb.prange(n):
a, b = diffs[i]
for j in range(k):
# Compute nb.abs here if needed so to avoid
# creating new temporary arrays
res[i, j] = vectors[a, j] - vectors[b, j]
return res
This above code should be nearly optimal. It should be memory bound and able to saturate the memory bandwidth. Note that writing such huge arrays in memory take some time as well as reading (twice) the input array. On x86-64 platforms, a basic implementation should move 4.4 GiB of data from/to the RAM. Thus, on a mainstream PC with a 20 GiB/s RAM, this takes 220 ms. In fact, the sparse matrix computation result was not so bad in practice for a sequential implementation.
If this is not enough to you, then you can use simple-precision floating-point numbers instead of double-precision (twice faster). You could also use a low-level C/C++ implementation so to reduce the memory bandwidth used (thanks to non-temporal instructions -- ~30% faster). There is no much more to do.
Related
I'm dealing with large dense square matrices of size NxN ~(100k x 100k) that are too large to fit into memory.
After doing some research, I've found that most people handle large matrices by either using numpy's memap or the pytables package. However, I've found that these packages seem to have major limitations. Neither of them seem to offer support ASSIGN values to list slices to the matrix on the disk along more than one dimension.
I would like to look for an efficient way to assign values to a large dense square matrix M with something like:
M[0, [1,2,3], [8,15,30]] = np.zeros((3, 3)) # or
M[0, [1,2,3,1,2,3,1,2,3], [8,8,8,15,15,15,30,30,30]] = 0 # for memmap
With memmap, the expression M[0, [1,2,3], [8,15,30]] would always copy the slice into RAM hence assignment doesn't seem to work.
With pytables, list slicing along more than 1 dimension is not supported. Currently I'm just slicing along 1 dimension following by the other dimension (i.e. M[0, [1,2,3]][:, [8,15,30]]). RAM usage of this solution would scale with N, which is better than dealing with the whole array (N^2) but is still not ideal.
In addition, it appears that pytables isn't the most efficient way of handling matrices with lots of rows. (or could there be a way of specifying the chunksize to get rid of this message?) I am getting the following warning message:
The Leaf ``/M`` is exceeding the maximum recommended rowsize (104857600 bytes);
be ready to see PyTables asking for *lots* of memory and possibly slow
I/O. You may want to reduce the rowsize by trimming the value of
dimensions that are orthogonal (and preferably close) to the *main*
dimension of this leave. Alternatively, in case you have specified a
very small/large chunksize, you may want to increase/decrease it.
I'm just wonder whether there are better solutions to assign values to arbitrary 2d slices of large matrices?
First of all, note that in numpy (not sure about pytables) M[0, [1,2,3], [8,15,30]] will return an array of shape (3,) corresponding to elements M[0,1,8], M[0,2,15] and M[0,3,30], so assigning np.zeros((3,3)) to that will raise an error.
Now, the following works fine with me:
np.save('M.npy', np.random.randn(5,5,5)) # create some dummy matrix
M = np.load('M.npy', mmap_mode='r+') # load such matrix as a memmap
M[[0,1,2],[1,2,3],[2,3,4]] = 0
M.flush() # make sure thing is updated on disk
del M
M = np.load('M.npy', mmap_mode='r+') # re-load matrix
print(M[[0,1,2],[1,2,3],[2,3,4]]) # should show array([0., 0., 0.])
I need to multiply 3 matrices, A: 3000x100, B: 100x100, C: 100x3.6MM. I currently am just using normal matrix multiplication in PyTorch
A_gpu = torch.from_numpy(A)
B_gpu = torch.from_numpy(B)
C_gpu = torch.from_numpy(C)
D_gpu = (A_gpu # B_gpu # C_gpu.t()).t()
C is very wide so the data reuse on gpu is limited but are there other ways to speed this up? I have a machine with 4x GPUs.
Since you have four GPUs, you can harness them to perform efficient matrix multiplication. Notice however that the results of the multiplication has size 3000x3600000, which takes up 40GB in single precision floating point (fp32). Unless you have a large enough RAM for the CPU, you cannot store the results of this computation on the RAM.
A possible solution for this is to break up the large matrix C into four smaller chunks, perform the matrix multiplication of each chunk on a different GPU, and keep the result on the GPU. Provided that each GPU has at least 10GB of memory, you will have enough memory for this.
If you do have also enough CPU memory, you can then move the results of all four GPUs onto the CPU and concatenate them (in fact, in this case you could have used only a single GPU and transfer the results from GPU to CPU each time). Otherwise, you can keep the results chunked on the GPUs, and you need to remember and keep track that the four chunks are actually part of one matrix.
import numpy as np
import torch.nn as nn
import torch
number_of_gpus = 4
# create four matrics
A = np.random.normal(size=(3000,100))
B = np.random.normal(size=(100,100))
C = np.random.normal(size=(100,3600000))
# convert them to pytorch fp32 tensors
A = torch.from_numpy(A).float()
B = torch.from_numpy(B).float()
C = torch.from_numpy(C).float()
# calcualte `A#B`, which is easy
AB = A#B
# split the large matrix `C` into 4 smaller chunks along the second dimension.
# we assume here that the size of the second dimension of `C` is divisible by 4.
C_split = torch.split(C,C.shape[1]//number_of_gpus,dim=1)
# loop over the four GPUs, and perform the calculation on each using the corresponding chunk of `C`
D_split = []
for i in range(number_of_gpus):
device = 'cuda:{:d}'.format(i)
D_split.append( AB.to(device) # C_split[i].to(device))
# DO THIS ONLY IF YOU HAVE ENOUGH CPU MEMORY!! :
D = torch.cat([d.cpu() for d in D_split],dim=1)
If you have multiple GPUs, you can distribute the computation on all of them using PyTorch's DataParallel. It will split (parallelize) the multiplication of the columns of the matrix C_gpu among the GPUs.
Here's how:
First, import the modules and prepare the matrices:
import torch
import torch.nn as nn
A_gpu = torch.from_numpy(A).float()
B_gpu = torch.from_numpy(B).float()
C_gpu = torch.from_numpy(C).float()
Next, create a Linear "layer" without bias. What this layer does is exactly matrix multiplication. The input size will be the size of each column of C_gpu, and the output size will be the size of each column of the result.
mat_mult = nn.Linear(in_features=C_gpu.shape[0],out_features=A_gpu.shape[0],bias=False)
Set the matrix (=weight) of the layer to be A_gpu # B_gpu, which is a small matrix that can be quickly computed without parallelization (although you could parallelize it as well if you want).
mat_mult.weight.data = A_gpu # B_gpu
Convert the layer into a DataParallel instance. This means that it will automatically parallelize computation along the "batch" dimension. The argument device_ids is a list of indices of your GPUs (4 of them, in your case).
mat_mult_gpu = nn.DataParallel(mat_mult,device_ids=[0,1,2,3]).to('cuda:0')
Now you can feed the matrix C_gpu into the layer, and the computation will be parallel along the its large dimension:
D_gpu = mat_mult_gpu(C_gpu.t())
IMPORTANT NOTE: When writing this answer, I did not have access to multiple GPUs to actually test this proposed solution. I will appreciate if any of the readers will confirm that it indeed works (and even better - time the solution and compare to a single GPU)
EDIT1: I now tried this code on multiple GPUs (four Nvidia Tesla P100), and turns out it gives an out of memory error. I'll keep this solution here as a reference though, since it does work for sizes up to about 400K (instead of 3.6M).
Also, This solution will still work also for sizes 3.6M if you divide C into smaller chunks, feed each chunk into mat_mult_gpu, and then concatenate the results on the CPU. Note that you need a lot of CPU memory for this to work, since the result has size 3K-by-3.6M which in fp32 takes about 40GB. (alternatively, you can save each chunk to the disk without concatenating chunks).
Depending on your matrix C, a sparse matrix, may reduce size and computation time, e.g. you save only columns that are not 0, maybe torch reference may help.
I am trying to get rid of the for loop and instead do an array-matrix multiplication to decrease the processing time when the weights array is very large:
import numpy as np
sequence = [np.random.random(10), np.random.random(10), np.random.random(10)]
weights = np.array([[0.1,0.3,0.6],[0.5,0.2,0.3],[0.1,0.8,0.1]])
Cov_matrix = np.matrix(np.cov(sequence))
results = []
for w in weights:
result = np.matrix(w)*Cov_matrix*np.matrix(w).T
results.append(result.A)
Where:
Cov_matrix is a 3x3 matrix
weights is an array of n lenght with n 1x3 matrices in it.
Is there a way to multiply/map weights to Cov_matrix and bypass the for loop? I am not very familiar with all the numpy functions.
I'd like to reiterate what's already been said in another answer: the np.matrix class has much more disadvantages than advantages these days, and I suggest moving to the use of the np.array class alone. Matrix multiplication of arrays can be easily written using the # operator, so the notation is in most cases as elegant as for the matrix class (and arrays don't have several restrictions that matrices do).
With that out of the way, what you need can be done in terms of a call to np.einsum. We need to contract certain indices of three matrices while keeping one index alone in two matrices. That is, we want to perform w_{ij} * Cov_{jk} * w.T_{ki} with a summation over j, k, giving us an array with i indices. The following call to einsum will do:
res = np.einsum('ij,jk,ik->i', weights, Cov_matrix, weights)
Note that the above will give you a single 1d array, whereas you originally had a list of arrays with shape (1,1). I suspect the above result will even make more sense. Also, note that I omitted the transpose in the second weights argument, and this is why the corresponding summation indices appear as ik rather than ki. This should be marginally faster.
To prove that the above gives the same result:
In [8]: results # original
Out[8]: [array([[0.02803215]]), array([[0.02280609]]), array([[0.0318784]])]
In [9]: res # einsum
Out[9]: array([0.02803215, 0.02280609, 0.0318784 ])
The same can be achieved by working with the weights as a matrix and then looking at the diagonal elements of the result. Namely:
np.diag(weights.dot(Cov_matrix).dot(weights.transpose()))
which gives:
array([0.03553664, 0.02394509, 0.03765553])
This does more calculations than necessary (calculates off-diagonals) so maybe someone will suggest a more efficient method.
Note: I'd suggest slowly moving away from np.matrix and instead work with np.array. It takes a bit of getting used to not being able to do A*b but will pay dividends in the long run. Here is a related discussion.
I have a 50 dimensional array, whose dimensions are 255 x 255 x 255 x...(50 times)..x255. So its a total of 50^255 floating point numbers. Its just out of scope to even think of fitting in a RAM. Moreover I need to take an 50 dimensional Fast Fourier Transform (DFT) of this array. I can't do it in python on an ordinary PC. I cant even imagine doing it on a GPU. so I am guessing I have to take help of a hard disk memory, but even that is too huge. I don't need this in real time, I can afford even days for it to run. I have no clue what sort of machine I need or is it even possible? Appreciate your advice. Super computers, grids, or something even if its too costly, I am not worried about investment.
If you found enough universes to save your data in, here is what you could do:
The Fourier Transform is separable, that means that calculating the DFT of each axis one after the other will give you the same result as if you calculated the n-dimensional DFT:
for i in range(C.ndim):
C[...] = numpy.fft.fft(C, axis=i)
Double checking if the value is correct using a 2D tensor (because we have a 2D FFT numpy.fft.fft2 to compare against):
import numpy
A = numpy.random.rand(*[16] * 2)
B = numpy.fft.fft2(A)
C = A.astype(numpy.complex) # output vector for separable FFT
for i in range(C.ndim):
C[...] = numpy.fft.fft(C, axis=i)
numpy.allclose(C, B) # True
I am using the code below:
n = 40000
numpy.matlib.identity(n)
You can do this with scipy using sparse matrix representation:
import numpy as np
from scipy.sparse import identity
n = 30000
a = np.identity(n)
print a.nbytes
b = identity(n)
print b.data.nbytes
The difference is huge (quadratic): 7200000000 vs 240000.
You can also try to decrease the size by providing appropriate dtype, like a = np.identity(n, dtype='int8') but this will only reduce the size linearly (with maximum linear factor of less than 200).
The same way you can do b = identity(n, dtype='int8', format='dia') which will reduce the size even further to 30000.
But the most important thing is what are you planning to do with this matrix (highly doubt you just want to create it)? And some of the operations would not support sparse indices. Then you either have to buy more memory or come up with smart linear-algebra stuff to operate on parts of the matrices, store results on disk and merge them together.