Cuda Parallelized Kernel Shared Counter Variable

Cuda Parallelized Kernel Shared Counter Variable - python

Is there a way to have an integer counter variable that can be incremented/decremented across all threads in a parallelized cuda kernel? The below code outputs "[1]" since the modifications to the counter array from one thread is not applied in the others.
import numpy as np
from numba import cuda
#cuda.jit('void(int32[:])')
def func(counter):
counter[0] = counter[0] + 1
counter = cuda.to_device(np.zeros(1, dtype=np.int32))
threadsperblock = 64
blockspergrid = 18
func[blockspergrid, threadsperblock](counter)
print(counter.copy_to_host())

One approach would be to use numba cuda atomics:
$ cat t18.py
import numpy as np
from numba import cuda
#cuda.jit('void(int32[:])')
def func(counter):
cuda.atomic.add(counter, 0, 1)
counter = cuda.to_device(np.zeros(1, dtype=np.int32))
threadsperblock = 64
blockspergrid = 18
print blockspergrid * threadsperblock
func[blockspergrid, threadsperblock](counter)
print(counter.copy_to_host())
$ python t18.py
1152
[1152]
$
An atomic operation performs an indivisible read-modify-write operation on the target, so threads do not interfere with each other when they update the target variable.
Certainly other methods are possible, depending on your actual needs, such as a classical parallel reduction. numba provides some reduction sugar also.

Related

Clear all cached kernels from CuPY to force kernel compilation

In the CuPY documentation, it is stated that
"CuPy caches the kernel code sent to GPU device within the process, which reduces the kernel compilation time on further calls."
This means that when one calls a function from CuPY, subsequent calls to this function will be extremely fast. An example is as follows:
import cupy as cp
from timeit import default_timer as timer
import time
mempool = cp.get_default_memory_pool()
pinned_mempool = cp.get_default_pinned_memory_pool()
def multiply():
rand = cp.random.default_rng() #This is the fast way of creating large arrays with cp
arr = rand.integers(0, 100_000, (10000, 1000)) #Create array
y = cp.multiply(arr, 42) ## Multiply by 42, randomly chosen number
return y
if __name__ == '__main__':
times = []
start = timer()
for i in range(21):
mempool.free_all_blocks()
pinned_mempool.free_all_blocks()
start = timer()
multiply()
times.append(timer()-start)
print(times)
This will return the times:
[0.17462146899993058, 0.0006819850000283623, 0.0006159440001738403, 0.0006145069999092811, 0.000610309999956371, 0.0006169410000893549, 0.0006062159998236893, 0.0006096620002153941, 0.0006096250001519365, 0.0006106630000886071, 0.0006063629998607212, 0.0006168999998408253, 0.0006058349999875645, 0.0006090080000831222, 0.0005964219999441411, 0.0006113049998930364, 0.0005968339999071759, 0.0005951619998540991, 0.0005980400001135422, 0.0005941219999385794, 0.0006568090000200755]
Where only the first call includes the time it takes to compile the kernel as well.
Is there a way to flush everything in order to force the compilation for each subsequent call to multiply()?

Currently, there is no way to disable kernel caching in CuPy. The only option available is to disable persisting kernel caching on disk (CUPY_CACHE_IN_MEMORY=1), but kernels are cached on-memory so compilation runs only once within the process.
https://docs.cupy.dev/en/stable/user_guide/performance.html#one-time-overheads
https://docs.cupy.dev/en/latest/reference/environment.html

How to run NumPy code on the GPU with Numba

I've developed several codes with high matrix implication; those run quite well but since I spent some extra money in GPU, I'd like take advantage from it... ;-) I've tryed in several configurations from numba manual, but clearly something is missing .. Why I receive an error when try to execute some functions in Cuda (nvidia gtx 1050)?
Would be numba or eighter other module, how can I let execute a portion of code (which require strong parallelization) into GPU ?
from numba import types
from numba.extending import intrinsic
from numba import jit, cuda
from numba import vectorize
#jit
def calculate_portfolio_return(returns, weights):
portfolio_return = np.sum(returns.mean()*weights)*252
print("Expected Portfolio Return:", portfolio_return)
#jit
def calculate_portfolio_risk(returns, weights):
portfolio_variance = np.sqrt(np.dot(weights.T, np.dot(returns.cov()*252,weights)))
print("Expected Risk:", portfolio_variance)
#jit
def generate_portfolios(weights, returns):
preturns = []
pvariances = []
for i in range(10000):
# weights = np.random.random(len(stocks_portfolio))
weights = np.random.random(data.shape[1])
weights/=np.sum(weights)
preturns.append(np.sum(returns.mean()*weights)*252)
pvariances.append(np.sqrt(np.dot(weights.T,np.dot(returns.cov()*252,weights))))
preturns = np.array(preturns)
pvariances = np.array(pvariances)
return preturns,pvariances
......

How to accelerate numpy array masking?

I am profiling performance of a piece of Python code, using a line profiler.
In the code, I have a numpy array tt of shape (106906,) and dtype=int64. With the help of the profiler, I find that the the second line below mask[tt]=True is quite slow. Is there anyway to accelerate it? I am on Python 3 if that matters.
mask = np.zeros(100000, dtype='bool')
mask[tt] = True

You can use Numba as #orlevii has suggested:
from numba import njit
#njit
def f(mask,tt):
mask[tt] = True
#Test:
mask = np.zeros(1000000, dtype='bool')
tt = np.random.randint(0,1000000,106906)
f(mask,tt)
A simple %%timeit check suggests that you should expect roughly 3 times faster execution.
Further speed-up can be achieved by utilizing the GPU. An example of how to do it with PyTorch:
import torch
mask = torch.zeros(1000000).type(torch.cuda.FloatTensor)
tt = torch.randint(0,1000000,torch.Size([106906])).type(torch.cuda.LongTensor)
mask[tt] = True
Note that here we use a torch.Tensor object which is the equivalent of numpy.ndarray in PyTorch. Code will run only if you have a GPU (of NVIDIA) with CUDA. Expect x30 speed-up w.r.t your original code on Tesla V100-SXM2.

Numpy matrix multiplications with multiprocessing suddenly slow down as dimension increase

I want to do some large matrix multiplications using multiprocessing.Pool.
Suddenly, when the dimension is higher than 50, it takes an extremely long computation time.
Is there any easy way to be faster?
Here, I don't want to use shared memory like RawArray, because my original code randomly generate the matrix for each time.
The sample code is as follows.
import numpy as np
from time import time
from multiprocessing import Pool
from functools import partial
def f(d):
a = int(10*d)
N = int(10000/d)
for _ in range(N):
X = np.random.randn(a,10) # np.random.randn(10,10)
return X
# Dimensions
ds = [1,2,3,4,5,6,8,10,20,35,40,45,50,60,62,64,66,68,70,80,90,100]
# Serial processing
serial = []
for d in ds:
t1 = time()
for i in range(20):
f(d)
serial.append(time()-t1)
# Parallel processing
parallel = []
for d in ds:
t1 = time()
pool = Pool()
for i in range(20):
pool.apply_async(partial(f,d), args=())
pool.close()
pool.join()
parallel.append(time()-t1)
# Plot
import matplotlib.pyplot as plt
plt.title('Matrix multiplication time with 10000/d repetitions')
plt.plot(ds,serial,label='serial')
plt.plot(ds,parallel,label='parallel')
plt.xlabel('d (dimension)')
plt.ylabel('Total time (sec)')
plt.legend()
plt.show()
Due to the total computation cost of f(d) is the same for all d, the parallel processing time should be equal.
But the actual output is not.
System info:
Linux-4.15.0-47-generic-x86_64-with-debian-stretch-sid
3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0]
Intel(R) Core(TM) i9-7940X CPU # 3.10GHz
NOTE I want to use parallel computation as a complicated internal simulation (like #), not sending data to the child process.

This is for self-reference.
Here, I found a solution.
My numpy uses MKL as backend, it may be the problem that MKL multithreading collides multiprocessing.
If I run the code:
import os
os.environ['MKL_NUM_THREADS'] = '1'
before importing numpy, then it solved.

I just found an explanation here: https://github.com/numpy/numpy/issues/10145.
Looks like the CPU caching gets messed up when you have conflicting MKL matrix multiplications going at the same time.

Most Efficient Cumulative Summation in numba?

I am attempting to use the most time efficient cumsum possible on a 3D array, in python. I have tried numpy's cumsum, but found that simply using a manual parallelized method with numba:
import numpy as np
from numba import njit, prange
from timeit import default_timer as timer
#njit(parallel=True)
def cpu_cumsum(data, output):
for i in prange(200):
for j in prange(2000000):
output[i,j][0] = data[i,j][0]
for i in prange(1, 200):
for j in prange(1,2000000):
for k in range(1, 5):
output[i,j,k] = data[i,j,k] + output[i,j,k-1]
return output
data = np.float32(np.arange(2000000000).reshape(200, 2000000, 5))
output = np.empty_like(data)
func_start = timer()
output = cpu_cumsum(data, output)
timing=timer()-func_start
print("Function: manualCumSum duration (seconds):" + str(timing))
My method:
Function: manualCumSum duration (seconds):2.8496341188924994
np.cumsum:
Function: cumSum duration (seconds):6.182090314569933
While trying this with guvectorize, I found that it used too much memory for my GPU so I have since abandoned that avenue. Is there a better way to do this, or have I hit the end of the road?
PS: Speed needed due to looping around this many times.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cuda Parallelized Kernel Shared Counter Variable - python

Related

Clear all cached kernels from CuPY to force kernel compilation

How to run NumPy code on the GPU with Numba

How to accelerate numpy array masking?

Numpy matrix multiplications with multiprocessing suddenly slow down as dimension increase

Most Efficient Cumulative Summation in numba?

Categories

Resources