Using cupy to create a distance matrix from another matrix on GPU - python

I have written code using numpy that takes an array of size (m x n)... The rows (m) are individual observations comprised of (n) features... and creates a square distance matrix of size (m x m). This distance matrix is the distance of a given observation from all other observations. E.g. row 0 column 9 is the distance between observation 0 and observation 9.
import numpy as np
#import cupy as np
def l1_distance(arr):
return np.linalg.norm(arr, 1)
X = np.random.randint(low=0, high=255, size=(700,4096))
distance = np.empty((700,700))
for i in range(700):
for j in range(700):
distance[i,j] = l1_distance(X[i,:] - X[j,:])
I attempted this on GPU using cupy by umcommenting the second import statement, but obviously the double for loop is drastically inefficient. It takes numpy approx 6 seconds, but cupy takes 26 seconds. I understand why but it's not immediately clear to me how to parallelize this process.
I know I'm going to need to write a reduction kernel of some sort, but I can't think of how to construct one cupy array from iterative operations on elements of another array.

Using broadcasting CuPy takes 0.10 seconds in a A100 GPU compared to NumPy which takes 6.6 seconds
for i in range(700):
distance[i,:] = np.abs(np.broadcast_to(X[i,:], X.shape) - X).sum(axis=1)
This vectorizes and makes the distance of one vector to all other ones in parallel.

Related

Can I use the numpy normal distribution sampler to efficiently create a sample using few or no loops?

I would like to generate a sample that follows a normal distribution from M source values each with a standard deviation, with N samples per source value. Can this be done efficiently with numpy arrays?
My desired output is an MxN array. I expected this pseudocode to work, but it fails with an error:
import numpy as np
# initial data
M = 100
x = np.arange(M)
y = x**2
y_err = y * 0.1
# sample the data N times per datapoint
N = 1000
N_samples = np.random.normal(loc=y, scale=y_err, size=N)
Running this yields a broadcasting error since N and M are not the same:
ValueError: shape mismatch: objects cannot be broadcast to a single shape
I can imagine solutions that use loops, but is there a better/faster method that minimizes the use of loops? For example, many numpy functions are vectorized so I would expect there to be some numpy method that would be faster or at least avoid the use of loops.
I was able to create two methods: one that uses loops, and one that uses numpy functions. However, the numpy method is slower for large arrays, so I am curious as to why this is and whether there is an alternative method.
Method one: loop through each of the M source values and sample N points from that value, and proceed through the whole dataset so that the numpy sampler is used M times:
# initialize the sample array
y_sampled = np.zeros([M, N])
for i in range(M):
y_sampled[i] = prng.normal(loc=y[i], scale=y_err_abs[i], size=num_samples)
Method two: use numpy's vectorized methods on an adjusted dataset, wherein the source data is duplicated to be an MxN array, on which the numpy sampler is applied once
# duplicate the source data and error arrays horizontally N times
y_dup = np.repeat(np.vstack(y), N,axis=1)
y_err_dup = np.repeat(np.vstack(y_err), N, axis=1)
# apply the numpy sampler once on the entire 2D array
y_sampled = np.random.normal(loc=y_dup, scale=y_err_dup, size=(M,N))
I expected the second method to be faster since the sampler is applied only once, albeit on a 2D array. The walltime is similar for small arrays (M = 100) but different by a factor of ~2x for larger arrays (M = 1E5). Timing:
M = 100 N = 1000
Time used by loop method: 0.0156 seconds
Time used by numpy resize/duplicating method: 0.0199
M = 100000 N = 1000
Time used by loop method: 3.9298 seconds
Time used by numpy resize/duplicating method: 7.3371 seconds
I would expect there to be a built-in method to sample N times, instead of duplicating the dataset N times, but these methods work.

NumPy array row differences

I have a NumPy array vectors = np.random.randn(rows, cols). I want to find differences between its rows according to some other array diffs which is sparse and "2-hot": containing a 1 in its column corresponding to the first row of vectors and a -1 corresponding to the second row. Perhaps an example shall make it clearer:
diffs = np.array([[ 1, 0, -1],
[ 1, -1, 0]])
then I can compute the row differences by simply diffs # vectors.
Unfortunately this is slow for diffs of 10_000x1000 and vectors 1000x15_000. I can get a speedup by using scipy.sparse: sparse.csr_matrix(diffs) # vectors, but even this is 300ms.
Possibly this is simply as fast as it gets, but some part of me thinks whether using matrix multiplications is the wisest decision for this task.
What's more is I need to take the absolute value afterwards so really I'm doing np.abs(sparse.csr_matrix(diffs) # vectors) which adds ~ 200ms for a grand total of ~500ms.
I can compute the row differences by simply diffs # vectors.
This is very inefficient. A matrix multiplication runs in O(n*m*k) for a (n,m) multiplied by a (m,k) one. In your case, there is only two values per line and you do not actually need a multiplication by 1 or -1. Your problem can be computed in O(n*k) time (ie. m times faster).
Unfortunately this is slow for diffs of 10_000x1000 and vectors 1000x15_000. I can get a speedup by using scipy.sparse.
The thing is the input data representation is inefficient. When diff is an array of size (10_000,1000), this is not reasonable to use a dense matrix that would be ~1000 times bigger than needed nor a sparse matrix that is not optimized for having only two non-zero values (especially 1 and -1). You need to store the position of the non-zeros values in a 2D array called sel_rows of shape (2,n) where the first row contains the location of the 1 and the second one contains the location of the -1 in the diff 2D array. Then, you can extract the rows of vectors for example with vectors[sel_rows[0]]. You can perform the final operation with vectors[sel_rows[0,:]] - vectors[sel_rows[1,:]]. This approach should be drastically faster than a dense matrix product and it may be a bit faster than a sparse one regarding the target machine.
While the above solution is simple, it create multiple temporary arrays that are not cache-friendly since your output array should take 10_000 * 15_000 * 8 = 1.1 GiB (which is quite huge). You can use Numba so to remove temporary array and so improve the performance. Multiple threads can be used to improve performance even further. Here is an untested code:
import numba as nb
#nb.njit('(int_[:,::1], float64[:,::1])', parallel=True)
def compute(diffs, vectors):
n, k = diffs.shape[0], vectors.shape[1]
assert diffs.shape[1] == 2
res = np.empty((n, k))
for i in nb.prange(n):
a, b = diffs[i]
for j in range(k):
# Compute nb.abs here if needed so to avoid
# creating new temporary arrays
res[i, j] = vectors[a, j] - vectors[b, j]
return res
This above code should be nearly optimal. It should be memory bound and able to saturate the memory bandwidth. Note that writing such huge arrays in memory take some time as well as reading (twice) the input array. On x86-64 platforms, a basic implementation should move 4.4 GiB of data from/to the RAM. Thus, on a mainstream PC with a 20 GiB/s RAM, this takes 220 ms. In fact, the sparse matrix computation result was not so bad in practice for a sequential implementation.
If this is not enough to you, then you can use simple-precision floating-point numbers instead of double-precision (twice faster). You could also use a low-level C/C++ implementation so to reduce the memory bandwidth used (thanks to non-temporal instructions -- ~30% faster). There is no much more to do.

Scipy spsolve is order of magnitude slower than matlab mldivide

Ok, so I have a linear system. A is sparse 29791 by 29791 with 202771 stored elements. B is 29791 by 1 with 4561 stored elements.
I have tried solving this system by storing A as csr_matrix and as csc_matrix, and B as a regular numpy array. Both csr and csc matrices take about a minute to solve. Then, I tried saving the triplets (row, column, data) in csv format, and importing this in matlab and then doing:
Asp = sparse(data(:, 1), data(:, 2), data(:, 3))
P = Asp\b
This takes around a second, if not less. Why is matlab nearly 2 orders of magnitude faster than scipy, and how can I speed this computation up? Even matlab mldivide takes less than 2 seconds.
A = csc_matrix((data, (np.asarray(rows), np.asarray(cols))), shape=A_shape)
P = spsolve(A, csc_matrix(B))
I expect the scipy spsolve to be at max twice as slow as matlab backslash, but its not.

Matrix multiplication in CUDA running out of memory

I try to compute the matrix multiplication using the script:
import numpy as np
import math
from timeit import default_timer as timer
from numba import cuda
from numba import *
def mult(a,b):
return a*b
mult_gpu=cuda.jit(restype=float32,argtypes=[float32,float32],device=True)(mult)
#cuda.jit(argtypes=[float32[:,:],float32[:,:],float32[:,:,:]])
def mult_kernel(a,b,c):
Ni=c.shape[0]
Nj=c.shape[1]
Nk=c.shape[2]
startX,startY,startZ=cuda.grid(3)
gridX=cuda.gridDim.x*cuda.blockDim.x
gridY=cuda.gridDim.y*cuda.blockDim.y
gridZ=cuda.gridDim.z*cuda.blockDim.z
for i in range(startX,Ni,gridX):
for j in range(startY,Nj,gridY):
for k in range(startZ,Nk,gridZ):
c[i,j,k]=mult_gpu(a[i,k],b[j,k])
def main():
A=np.ones((20,50000),dtype=np.float32)
B=np.ones((3072,50000),dtype=np.float32)
C=np.ones((20,3072,50000),dtype=np.float32)
(Ni,Nj,Nk)=C.shape
my_gpu=cuda.get_current_device()
thread_ct=my_gpu.WARP_SIZE
block_ct_x=int(math.ceil(float(Ni)/thread_ct))
block_ct_y=int(math.ceil(float(Nj)/thread_ct))
block_ct_z=int(math.ceil(float(Nk)/thread_ct))
blockdim=thread_ct,thread_ct,thread_ct
griddim=block_ct_x,block_ct_y,block_ct_z
print "Threads per block:",blockdim
print "Blocks per grid:",griddim
start=timer()
Cg=cuda.to_device(C)
mult_kernel[griddim,blockdim](A,B,Cg)
Cg.to_host()
dt=timer()-start
print "Computation done in %f s"%(dt)
print 'C[:3,1,1] = ',C[:3,1,1]
print 'C[-3:,1,1] = ',C[-3:,1,1]
if __name__=='__main__':
main()
Executing this yields an error:
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
How could I fix this memory problem?
Nevertheless, using smaller matrices
A=np.ones((20,500),dtype=np.float32)
B=np.ones((372,500),dtype=np.float32)
C=np.ones((20,372,500),dtype=np.float32)
there is still an error:
numba.cuda.cudadrv.driver.CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE
I got inspired by the Mandelbrot Example to implement the computation above.
EDIT1
In order to resolve any confusion, this is actually a 3D matrix by 3D matrix multiplication:
A=np.ones((20,1,50000),dtype=np.float32)
B=np.ones((1,3072,50000),dtype=np.float32)
C=np.ones((20,3072,50000),dtype=np.float32)
I skipped one dimension in A and B because it is not necessary for the computation.
EDIT2
My GPU is:
In [1]: from numba import cuda
In [2]: gpu=cuda.get_current_device()
In [3]: gpu.name
Out[3]: 'GeForce GT 750M'
EDIT3
According to the memory of my GPU (2GB), I reduced the size of each dimension by 2:
dimx=10
dimy=1536
dimz=25000
A=np.ones((dimx,dimz),dtype=np.float32)
B=np.ones((dimy,dimz),dtype=np.float32)
C=np.ones((dimx,dimy,dimz),dtype=np.float32)
But I still receive the CUDA_ERROR_OUT_OF_MEMORY error. How could one explain this?
The calculation yields a size of about 1.7GB for the 3 matrices:
(10*1536*25000*4.+10*25000*4+1536*25000*4.)/(10**9)=1.6906
Regarding the first problem, you're running out of memory. A major contributor to that is that this isn't the way people would normally do a matrix-matrix multiply. Normally, as you are multiplying row and column elements together, you would keep a running sum, then store that sum in the appropriate location in the product (result) matrix. This will allow you to have a much smaller size for the c matrix, ie. it need only be 2 dimensions, not 3. You may wish to just study the linear algebra definition of matrix-matrix multiply. When you multiply a 2D matrix by a 2D matrix, the result is a 2D matrix, not a 3D matrix.
In a nutshell, something like this:
for i in range(startX,Ni,gridX):
for j in range(startY,Nj,gridY):
c[i,j] = 0
for k in range(startZ,Nk,gridZ):
c[i,j]= c[i,j] + mult_gpu(a[i,k],b[j,k])
And adjust your c shape accordingly.
If you actually need the individual products in 3D form as you are doing here, then there is not much I can say except that you will need to scale the problem to fit in the GPU memory size for whatever GPU you are using.
Regarding the second problem, you have a problem here:
thread_ct=my_gpu.WARP_SIZE
...
blockdim=thread_ct,thread_ct,thread_ct
WARP_SIZE is 32 (presumably) so you are asking for a 3D block of dimensions 32*32*32 = 32K threads. But CUDA threadblocks are limited to a maximum of 1024 threads, which limit is the product of the individual dimensions.
If you change your thread_ct to 8, for example:
thread_ct=8
You should be able to get past this particular issue.

Optimizing histogram distance metric for two matrices in Python

I have two matrices A and B, each with a size of NxM, where N is the number of samples and M is the size of histogram bins. Thus, each row represents a histogram for that particular sample.
What I would like to do is to compute the chi-square distance between two matrices for a different pair of samples. Therefore, each row in the matrix A will be compared to all rows in the other matrix B, resulting a final matrix C with a size of NxN and C[i,j] corresponds to the chi-square distance between A[i] and B[j] histograms.
Here is my python code that does the job:
def chi_square(histA,histB):
esp = 1.e-10
d = sum((histA-histB)**2/(histA+histB+eps))
return 0.5*d
def matrix_cost(A,B):
a,_ = A.shape
b,_ = B.shape
C = zeros((a,b))
for i in xrange(a):
for j in xrange(b):
C[i,j] = chi_square(A[i],B[j])
return C
Currently, for a 100x70 matrix, this entire process takes 0.1 seconds.
Is there any way to improve this performance?
I would appreciate any thoughts or recommendations.
Thank you.
Sure! I'm assuming you're using numpy?
If you have the RAM available, you could use broadcast the arrays and use numpy's efficient vectorization of the operations on those arrays.
Here's how:
Abroad = A[:,np.newaxis,:] # prepared for broadcasting
C = np.sum((Abroad - B)**2/(Abroad + B), axis=-1)/2.
Timing considerations on my platform show a factor of 10 speed gain compared to your algorithm.
A slower option (but still faster than your original algorithm) that uses less RAM than the previous option is simply to broadcast the rows of A into 2D arrays:
def new_way(A,B):
C = np.empty((A.shape[0],B.shape[0]))
for rowind, row in enumerate(A):
C[rowind,:] = np.sum((row - B)**2/(row + B), axis=-1)/2.
return C
This has the advantage that it can be run for arrays with shape (N,M) much larger than (100,70).
You could also look to Theano to push the expensive for-loops to the C-level if you don't have the memory available. I get a factor 2 speed gain compared to the first option (not taking into account the initial compile time) for both the (100,70) arrays as well as (1000,70):
import theano
import theano.tensor as T
X = T.matrix("X")
Y = T.matrix("Y")
results, updates = theano.scan(lambda x_i: ((x_i - Y)**2/(x_i+Y)).sum(axis=1)/2., sequences=X)
chi_square_norm = theano.function(inputs=[X, Y], outputs=[results])
chi_square_norm(A,B) # same result

Categories