Basic multi GPU parallelization of matrix multiplication

Basic multi GPU parallelization of matrix multiplication - python

I want to parallelize the simple following expression on 2 GPUs: C = A^n + B^n by calculating A^n on GPU 0 and B^n on GPU 1 before summing the results.
In TensorFlow I would go like:
with tf.device('/gpu:0'):
An = matpow(A, n)
with tf.device('/gpu:1'):
Bn = matpow(B, n)
with tf.Session() as sess:
C = sess.run(An + Bn)
However, since PyTorch is dynamic, I'm having trouble doing the same thing. I tried the following but it only takes more time.
with torch.cuda.device(0):
A = A.cuda()
with torch.cuda.device(1):
B = B.cuda()
C = matpow(A, n) + matpow(B, n).cuda(0)
I know there is a module to parallelize models on the batch dimension using torch.nn.DataParallel but here I try to do something more basic.

You can use cuda streams for this. This will not necessarily distribute it over two devices, but the execution will be in parallel.
s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()
with torch.cuda.stream(s1):
A = torch.pow(A,n)
with torch.cuda.stream(s2):
B = torch.pow(B,n)
C = A+B
Although I'm not sure whether it will really speed up your computation if you only parallelize this one operation. Your matrices must be really big.
If your requirement is to split it across devices, you can add this before the streams:
A = A.cuda(0)
B = B.cuda(1)
Then after the power operation, you need to get them on the same device again, e.g. B = B.cuda(0). After that you can do the addition.

Related

Compiled Numba function not faster that CPython

I have a compile function with Numba that splits an array based on an index, this returns an irregular(variable length) list of numpy arrays. This then get padded to form a 2d array from the irregular list.
Problem
The compile function 'nb_array2mat' should be much faster than the pure python 'array2mat' but it is not.
Additionally, is this possible using numpy?
length of the array and index
1456391 95007
times:
numba: 1.3438396453857422
python: 1.1407015323638916
I think I am not using the numba compile in a proper manner. Any help would be great.
EDIT
Using dummy data as edited in the code section now I get an speed up, why does it not work with the actual data?
length of the array and index
1456391 95007
times:
numba: 0.012002706527709961
python: 0.13403034210205078
Code
idx_split: https://drive.google.com/file/d/1hSduTs1_s3seEFAiyk_n5yk36ZBl0AXW/view?usp=sharing
dist_min_orto: https://drive.google.com/file/d/1fwarVmBa0NGbWPifBEezTzjEZSrHncSN/view?usp=sharing
import time
import numba
import numpy as np
from numba.pycc import CC
cc = CC('compile_func')
cc.verbose = True
#numba.njit(parallel=True, fastmath=True)
#cc.export('nb_array2mat', 'f8[:,:](f8[:], i4[:])')
def array2mat(arr, idx):
# split arr by idx indexes
out = []
s = 0
for n in numba.prange(len(idx)):
e = idx[n]
out.append(arr[s:e])
s = e
# create a 2d array with arr values pading empty values with fill_value=1000000.0
_len = [len(_i) for _i in out]
cols = max(_len)
rows = len(out)
mat = np.full(shape=(rows, cols), fill_value=1000000.0)
for row in numba.prange(rows):
len_col = len(out[row])
mat[row, :len_col] = out[row]
return mat
if __name__ == "__main__":
cc.compile()
# PYTHON FUNC
def array2mat(arr, idx):
# split arr by idx indexes
out = []
s = 0
for n in range(len(idx)):
e = idx[n]
out.append(arr[s:e])
s = e
# create a 2d array with arr values pading empty values with fill_value=1000000.0
_len = [len(_i) for _i in out]
cols = max(_len)
rows = len(out)
mat = np.full(shape=(rows, cols), fill_value=1000000.0)
for row in range(rows):
len_col = len(out[row])
mat[row, :len_col] = out[row]
return mat
import compile_func
#ACTUAL DATA
arr = np.load('dist_min_orto.npy').astype(float)
idx = np.load('idx_split.npy').astype(int)
# DUMMY DATA
arr = np.random.randint(50, size=1456391).astype(float)
idx = np.cumsum(np.random.randint(5, size=95007).astype(int))
print(len(arr), len(idx))
#NUMBA FUNC
t0 = time.time()
print(compile_func.nb_array2mat(arr, idx))
print(time.time() - t0)
# PYTHON FUNC
t0 = time.time()
print(array2mat(arr, idx))
print(time.time() - t0)

You cannot use nb.prange on the first loop since out is shared between threads and it is also read/written by them. This causes a race condition. Numba assume that there is not dependencies between iterations and this is your responsibility to guarantee this. The simplest solution is not to use a parallel loop here
Additionally, the second loop is mainly memory-bound so I do not expect a big speed up using multiple threads since the RAM is a shared resource with a limited throughput (few threads are often enough to saturate it, especially on PC where sometimes one thread is enough).
Hopefully, you do not need to create the out temporary list, just the end offsets so then to compute len_cols in the parallel loop. The maximum cols can be computed on the fly in the first loop. The first loop should be executed very quickly compared to the second loop. Filling a big matrix newly allocated is often faster in parallel on Linux since page faults can be done in parallel. AFAIK, one Windows this is less true (certainly since pages faults scale more badly). This is also better here since the range 0:len_col is variable and thus the time to fill this part of the matrix is variable causing some thread to finish after others (the slower thread bound the execution). Furthermore, this is generally much faster on NUMA machines since each NUMA node can write in its own memory.
Note that AOT compilation does not support automatic parallel execution. To quote a Numba developer:
From discussion in today's triage meeting, related to #7696: this is not likely to be supported as AOT code doesn't require Numba to be installed - this would mean a great deal of work and issues to overcome for packaging the code for the threading layers.
The same thing applies for fastmath also it is likely to be added in the next incoming release regarding the current work.
Note that JIT compilation and AOT compilation are two separate process. Thus the parameters of njit are not shared to cc.export and the signature is not shared to njit. This means that the function will be compiled during its first execution due to lazy compilation. That being said, the function is redefined, so the njit is just useless here (overwritten).
Here is the resulting code (using only the JIT implementation with an eager compilation instead of the AOT one):
import time
import numba
import numpy as np
#numba.njit('f8[:,:](f8[:], i4[:])', fastmath=True)
def nb_array2mat(arr, idx):
# split arr by idx indexes
s = 0
ends = np.empty(len(idx), dtype=np.int_)
cols = 0
for n in range(len(idx)):
e = idx[n]
ends[n] = e
len_col = e - s
cols = max(cols, len_col)
s = e
# create a 2d array with arr values pading empty values with fill_value=1000000.0
rows = len(idx)
mat = np.empty(shape=(rows, cols))
for row in numba.prange(rows):
s = ends[row-1] if row >= 1 else 0
e = ends[row]
len_col = e - s
mat[row, 0:len_col] = arr[s:e]
mat[row, len_col:cols] = 1000000.0
return mat
# PYTHON FUNC
def array2mat(arr, idx):
# split arr by idx indexes
out = []
s = 0
for n in range(len(idx)):
e = idx[n]
out.append(arr[s:e])
s = e
# create a 2d array with arr values pading empty values with fill_value=1000000.0
_len = [len(_i) for _i in out]
cols = max(_len)
rows = len(out)
mat = np.full(shape=(rows, cols), fill_value=1000000.0)
for row in range(rows):
len_col = len(out[row])
mat[row, :len_col] = out[row]
return mat
#ACTUAL DATA
arr = np.load('dist_min_orto.npy').astype(np.float64)
idx = np.load('idx_split.npy').astype(np.int32)
#NUMBA FUNC
t0 = time.time()
print(nb_array2mat(arr, idx))
print(time.time() - t0)
# PYTHON FUNC
t0 = time.time()
print(array2mat(arr, idx))
print(time.time() - t0)
On my machine, the new Numba code is slightly faster: it takes 0.358 seconds for the Numba implementation and 0.418 for the Python implementation. In fact, using a sequential Numba code is even slightly faster on my machine as it takes 0.344 second.
Note that the shape of the output matrix is (95007,5469). Thus, the matrix takes 3.87 GiB in memory. You should check you have enough memory to store it. In fact the Python implementation takes about 7.5 GiB on my machine (possibly because the GC/default-allocator does not release the memory directly). If you do not have enouth memory, then the system can use the very slow swap memory (which use your storage device). Moreover, x86-64 processors use a write allocate cache policy causing written cache-lines to be actually read by default. Non temporal writes can be used to avoid this on a big matrix. Unfortunately, neither Numpy nor Numba use this on my machine. This means half the RAM throughput is wasted. Not to mention page faults are pretty expensive: in sequential, 60% of the time of the Numpy implementation is spent in page faults. The Numba code spend almost all its time writing in memory and performing page faults. Here is a related open issue.

based on #Jérôme Richard answer I wrote the same function. The improvement was in the way the mat numpy array is created, as the previous answer stated, the size in memory of the np.full takes a lot longer to operate, so the solution was to initialize it as a np.empty.
The improvement is not much bewtewn python and numba, but the size of the mat array takes a big impact in processing time.
1456391 95007
python: 0.29506611824035645
numba: 0.1800403594970703
Code
#cc.export('nb_array2mat', 'f8[:,:](f8[:], i4[:])')
def nb_array2mat(arr, idx):
s = 0
_len = np.empty(len(idx), dtype=np.int_)
_len[0] = idx[0]
_len[1:] = idx[1:] - idx[:-1]
# create a 2d array
cols = int(np.max(_len))
rows = len(idx)
mat = np.empty(shape=(rows, cols), dtype=np.float_)
for row in range(len(idx)):
e = idx[row]
len_col = _len[row]
mat[row, :len_col] = arr[s:e]
s = e
return mat

Numba Python - how to exploit parallelism effectively?

I have been trying to exploit Numba to speed up large array calculations. I have been measuring the calculation speed in GFLOPS, and it consistently falls far short of my expectations for my CPU.
My processor is i9-9900k, which according to float32 benchmarks should be capable of over 200 GFLOPS. In my tests I have never exceeded about 50 GFLOPS. This is running on all 8 cores.
On a single core I achieve about 17 GFLOPS, which (I believe) is 50% of the theoretical performance. I'm not sure if this is improvable, but the fact that it doesn't extend well to multi-core is a problem.
I am trying to learn this because I am planning to write some image processing code that desperately needs every speed boost possible. I also feel I should understand this first, before I dip my toes into GPU computing.
Here is some example code with a few of my attempts at writing fast functions. The operation I am testing, is multiplying an array by a float32 then summing the whole array, i.e. a MAC operation.
How can I get better results?
import os
# os.environ["NUMBA_ENABLE_AVX"] = "1"
import numpy as np
import timeit
from timeit import default_timer as timer
import numba
# numba.config.NUMBA_ENABLE_AVX = 1
# numba.config.LOOP_VECTORIZE = 1
# numba.config.DUMP_ASSEMBLY = 1
from numba import float32, float64
from numba import jit, njit, prange
from numba import vectorize
from numba import cuda
lengthY = 16 # 2D array Y axis
lengthX = 2**16 # X axis
totalops = lengthY * lengthX * 2 # MAC operation has 2 operations
iters = 100
doParallel = True
#njit(fastmath=True, parallel=doParallel)
def MAC_numpy(testarray):
output = (float)(0.0)
multconst = (float)(.99)
output = np.sum(np.multiply(testarray, multconst))
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_01(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(lengthX):
output += multconst*testarray[y,x]
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_04(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(int(lengthX/4)):
xn = x*4
output += multconst*testarray[y,xn] + multconst*testarray[y,xn+1] + multconst*testarray[y,xn+2] + multconst*testarray[y,xn+3]
return output
# ======================================= TESTS =======================================
testarray = np.random.rand(lengthY, lengthX)
# ==== MAC_numpy ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_numpy(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_numpy")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_01 ====
time = 1000
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
for n in range(iters):
start = timer()
output = MAC_01(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_01")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_04 ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_04(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_04")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))

Q : How can I get better results?
1st : Learn how to avoid doing useless work - you can straight eliminate HALF of the FLOP-s not speaking about also the half of all the RAM-I/O-s avoided, each one being at a cost of +100~350 [ns] per writeback
Due to the distributive nature of MUL and ADD ( a.C + b.C ) == ( a + b ).C, better first np.sum( A ) and only after that then MUL the sum by the (float) constant.
#utput = np.sum(np.multiply(testarray, multconst)) # AWFULLY INEFFICIENT
output = np.sum( testarray)*multconst #######################
2nd : Learn how to best align data along the order of processing ( cache-line reuses get you ~100x faster re-use of pre-fetched data. Not aligning vectorised-code along these already pre-fetched data side-effects just let your code pay many times the RAM-access latencies, instead of smart re-using the already paid for data-blocks. Designing work-units aligned according to this principle means a few SLOCs more, but the rewards are worth that - who gets ~100x faster CPUs+RAMs for free and right now or about a ~100x speedup for free, just from not writing a badly or naively designed looping iterators?
3rd : Learn how to efficiently harness vectorised (block-directed) operations inside numpy or numba code-blocks and avoid pressing numba to spend time on auto-analysing the call-signatures ( you pay an extra time for this auto-analyses per call, while you have designed the code and knew exactly what data-types are going to go there, so why to pay an extra time for auto-analysis each time a numba-block gets called???)
4th : Learn where the extended Amdahl's Law, having all the relevant add-on costs and processing atomicity put into the game, supports your wish to get speedups, not to ever pay way more than you will get back (to at least justify the add-on costs... ) - paying extra costs for not getting any reward is possible, yet has no beneficial impact on your code's performance ( rather the opposite )
5th : Learn when and how the manually created inline(s) may save your code, once the steps 1-4 are well learnt and routinely excersised with proper craftmanship ( Using popular COTS frameworks is fine, yet these may deliver results after a few days of work, while a hand-crafted single purpose smart designed assembly code was able to get the same results in about 12 minutes(!), not several days without any GPU/CPU tricks etc - yes, that faster - just by not doing a single step more than what was needed for the numerical processing of the large matrix data )
Did I mention float32 may surprise at being processed slower on small scales than float64, while on larger data-scales ~ n [GB] the RAM I/O-times grow slower for more efficient float32 pre-fetches? This never happens here, as float64 array gets processed here. Sure, unless one explicitly instructs the constructor(s) to downconvert the default data type, like this: np.random.rand( lengthY, lengthX ).astype( dtype = np.float32 )>>> np.random.rand( 10, 2 ).dtypedtype('float64')Avoiding extensive memory allocations is another performance trick, supported in numpy call-signatures. Using this option for large arrays will save you a lot of extra time wasted on mem-allocs for large interim arrays. Reusing already pre-allocated memory-zones and wisely controlled gc-policing are another signs of a professional, focused on low-latency & design-for-performance

Is there a way to use tensorflow map_fn on GPU?

I have a tensor A with shape [a,n] and I need to perform an op my_op with another tensor B of shape [b,n] such that the resulting tensor C has shape [a,b].
In other words: For each subtensor in A (A[0], A1,...A[n]) I need to perform an element wise op with each subtensor in B.
So the resulting tensor would contain the following:
[ [ A[0] op B[0] , A[0] op B[1], ... , A[0] op B[b] ],
[ A[1] op B[0] , A[1] op B[1], ... , A[1] op B[b] ],
[ ... ],
[ A[a] op B[0] , A[a] op B[1], ... , A[a] op B[b] ] ]
The only way that I've been able to find that achieves this is through nested use of tf.map_fn
Thus:
import tensorflow as tf
import time
import numpy as np
a_size = 64
b_size = 256*256
n = 256
A = tf.placeholder(tf.float32,[a_size,n])
B = tf.placeholder(tf.float32,[b_size,n])
def elementwise_op(a,b):
return tf.reduce_sum(tf.multiply(a,b))
def intermediate_op(sub_a,my_b):
sample_values = tf.map_fn(lambda x: elementwise_op(sub_a,x),my_b)
return sample_values
my_op = tf.map_fn(lambda x: intermediate_op(x,B),A)
with tf.Session() as sess:
a = np.random.rand(a_size,n)
b = np.random.rand(b_size,n)
start_time = time.time()
result = sess.run (my_op,feed_dict={A:a,B:b})
print ("exec time: " ,time.time()-start_time)
print (result.shape)
The code above runs fine, however, it does not use the GPU very well (only ~15% utilization, according to nvidia-smi). In fact, it runs an order of magnitude faster when using only the CPU!! (on my 12 core machine) When run using the GPU, I see very low GPU utilization (~15%) and 100% on one of my CPU cores. When run on the CPU only, I see 100% utilization across all CPU cores.
Average timing of 5 CPU only runs: 11.33s
Average timing of 5 GPU runs: 111.88s
The above test was run using the official Tensorflow docker images: tensorflow/tensorflow:latest-py3 (for CPU) and tensorflow/tensorflow:latest-gpu-py3 (for GPU)
My guess is that map_fn, via the python lambda, is forcing data to be copied back and forth between the CPU and GPU at every iteration, and the nested nature of the op just makes it worse. The comments in unanswered SO question here suggest that this is the case.
This article claims that:
lambda expression is the main reason of low GPU utilization.
-
So my question is: Is there a way to force map_fn to use the GPU? Or to avoid the Python lambda?
Alternatively, is there some other (perhaps more tensorflow-y) way to achieve the result described above, in order to the get graph to run on the GPU?
Edit:
After running the profiler (I had to drastically reduce the size of the arrays to get the profiler to run at all, because it was eating up RAM like crazy), the following lines caught my attention:
node name | output bytes | total execution time | accelerator execution time | cpu execution time
Mul 1.02KB (22.23%, 0.29%), 195.07ms (85.00%, 13.06%), 5.29ms (100.00%, 25.79%), 189.78ms (84.79%, 12.89%)
Sum 256B (21.41%, 0.07%), 241.48ms (69.08%, 16.17%), 6.01ms (74.21%, 29.29%), 235.47ms (69.01%, 15.99%)
TensorArrayScatterV3 512B (0.64%, 0.15%), 658.31ms (46.87%, 44.09%), 9.19ms (44.80%, 44.80%), 649.12ms (46.90%, 44.08%)
It looks like certain ops are being done mostly on the CPU, and only on one thread at that!

The tf.map_fn() construct can be used with a function that runs ops on GPU. By default, TensorFlow will try to run as much of the function as possible on the GPU, and any GPU-incompatible ops will run on the CPU. In your program, the entire elementwise_op() function is built from GPU-compatible ops, so there should be no additional copying between CPU and GPU at each iteration.
The cause of low GPU utilization is difficult to determine from a program fragment. For example, if A and B are relatively small, and you are feeding them from Python and the immediately fetching back the result, it is likely that the overhead of copying the initial data to and from the GPU would dominate. The best way to track this down is to use a GPU profiler, which you can get using tfprof or the NVIDIA Visual Profiler.

Load large computation sequence from file and evaluate matrix in Python

My problem involves large matrices (20GB+ in storage) where each matrix element consists of an algebraic expression. To bypass this size issue I wrote a script which converts the matrix into a computation sequence, and by doing so more than halves the file size. Here is an example of how this is done:
Consider the arithmetic expression (a 1x1 matrix):
Running this through the sequencing code produces:
Where the definitions for the tx parameters are:
t1 = A^2, t2 = t1*A, t3 = t1^2, t4 = t3*t2, t7 = t3*t1, t8 = B^2, t14 = C^2, t16 = t3*A, t17 = t8*B, t23 = t8^2, t33 = A+B, t34 = t33^2, t35 = t34^2
For this isolated example it may seem pointless, but when applied to a 10,000 x 10,000 matrix the number of common sequences between elements reduces the storage size substantially (like a zipping procedure).
My question is how best to process these definitions which are saved in files using Python to rebuild the matrix and evaluate the elements.
For the small (1x1) example above this can be done easily:
from __future__ import division
# Values for A,B,C
A = 1
B = 2
C = 3
# List of definitions
t1 = A**2
t2 = t1*A
t3 = t1**2
t4 = t3*t2
t7 = t3*t1
t8 = B**2
t14 = C**2
t16 = t3*A
t17 = t8*B
t23 = t8**2
t33 = A+B
t34 = t33**2
t35 = t34**2
# Print numerical result
print((1/2)*(6*B*C*t7+16*C*t16*t8+t1*t23*t8+B*t4+C*t4+t14*t7+31*t16*t17+6*t7*t8+16)/(t17*t14*C*t2*t35*t33))
Which gives the correct answer of 0.00565843621399. For matrices which have larger lists of definitions I have imported the file which works well, but when the size of the file gets larger (1GB+) the import runs into memory issues when creating the .pyc file.
I can read the file line by line but this makes the evaluation of the matrix more complicated as the tx definitions are now all strings.
I feel there are multiple ways to approach this problem but I am unsure on the most efficient implementation for when the matrices get very large, so I am here asking more experienced people for some insights on how best to proceed with the problem.

Why single addition takes longer than single additions plus single assignment?

Here is the python code, I use python 3.5.2/Intel(R) Core(TM) i7-4790K CPU # 4.00GHz :
import time
empty_loop_t = 0.14300823211669922
N = 10000000
def single_addition(n):
a = 1.0
b = 0.0
start_t = time.time()
for i in range(0, n):
a + b
end_t = time.time()
cost_t = end_t - start_t - empty_loop_t
print(n,"iterations single additions:", cost_t)
return cost_t
single_addition(N)
def single_addition_plus_single_assignment(n):
a = 1.0
b = 0.0
c = 0.0
start_t = time.time()
for i in range(0, n):
c = a + b
end_t = time.time()
cost_t = end_t - start_t - empty_loop_t
print(n,"iterations single additions and single assignments:", cost_t)
return cost_t
single_addition_plus_single_assignment(N)
The output is:
10000000 iterations single additions: 0.19701123237609863
10000000 iterations single additions and single assignments: 0.1890106201171875
Normally, to get a more reliable result, it is better to do the test using K-fold. However, since K-fold loop itself has influence on the result, I don't use it in my test. And I'm sure this inequality can be reproduced, at least on my machine. So the question is why this happened?

I run it with pypy (had to set empty_loop_t = 0) and got the following results:
(10000000, 'iterations single additions:', 0.014394044876098633)
(10000000, 'iterations single additions and single assignments:', 0.018398046493530273)
So I guess it's up to what interpreter does with the source code and how interpreter executes it. It might be that deliberate assignment takes less operations and workload than disposing of the result with non-JIT interpreter while JIT-compiler forces the code to perform the actual number of operations.
Furthermore, the use of JIT-interpreter makes your script run ~50 times faster on my configuration. If you general aim is to optimize the running time of your script you are probably to look that way.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Basic multi GPU parallelization of matrix multiplication - python

Related

Compiled Numba function not faster that CPython

Numba Python - how to exploit parallelism effectively?

Is there a way to use tensorflow map_fn on GPU?

Load large computation sequence from file and evaluate matrix in Python

Why single addition takes longer than single additions plus single assignment?

Categories

Resources