Clear all cached kernels from CuPY to force kernel compilation - python

In the CuPY documentation, it is stated that
"CuPy caches the kernel code sent to GPU device within the process, which reduces the kernel compilation time on further calls."
This means that when one calls a function from CuPY, subsequent calls to this function will be extremely fast. An example is as follows:
import cupy as cp
from timeit import default_timer as timer
import time
mempool = cp.get_default_memory_pool()
pinned_mempool = cp.get_default_pinned_memory_pool()
def multiply():
rand = cp.random.default_rng() #This is the fast way of creating large arrays with cp
arr = rand.integers(0, 100_000, (10000, 1000)) #Create array
y = cp.multiply(arr, 42) ## Multiply by 42, randomly chosen number
return y
if __name__ == '__main__':
times = []
start = timer()
for i in range(21):
mempool.free_all_blocks()
pinned_mempool.free_all_blocks()
start = timer()
multiply()
times.append(timer()-start)
print(times)
This will return the times:
[0.17462146899993058, 0.0006819850000283623, 0.0006159440001738403, 0.0006145069999092811, 0.000610309999956371, 0.0006169410000893549, 0.0006062159998236893, 0.0006096620002153941, 0.0006096250001519365, 0.0006106630000886071, 0.0006063629998607212, 0.0006168999998408253, 0.0006058349999875645, 0.0006090080000831222, 0.0005964219999441411, 0.0006113049998930364, 0.0005968339999071759, 0.0005951619998540991, 0.0005980400001135422, 0.0005941219999385794, 0.0006568090000200755]
Where only the first call includes the time it takes to compile the kernel as well.
Is there a way to flush everything in order to force the compilation for each subsequent call to multiply()?

Currently, there is no way to disable kernel caching in CuPy. The only option available is to disable persisting kernel caching on disk (CUPY_CACHE_IN_MEMORY=1), but kernels are cached on-memory so compilation runs only once within the process.
https://docs.cupy.dev/en/stable/user_guide/performance.html#one-time-overheads
https://docs.cupy.dev/en/latest/reference/environment.html

Related

Only GPU to CPU transfer with cupy is incredible slow

If I have an array on the GPU, it is really slow (order of hundreds of seconds) to copy back an array of shape (20, 256, 256).
My code is the following:
import cupy as cp
from cupyx.scipy.ndimage import convolve
import numpy as np
# Fast...
xt = np.random.randint(0, 255, (20, 256, 256)).astype(np.float32)
xt_gpu = cp.asarray(xt)
# Also very fast...
result_gpu = convolve(xt_gpu, xt_gpu, mode='constant')
# Very very very very very slow....
result_cpu = cp.asnumpy(result_gpu)
I measured the times using cp.cuda.Event() with record and synchronize to avoid measuring any random times, but is still the same result, the GPU->CPU transfer is incredible slow. However, using PyTorch or TensorFlow this is not the case (out of experience for similar data size/shape)... What am I doing wrong?
I think you might be timing it wrong. I modified the code to synchronize between every GPU operation and it seems like the convolution takes the majority of the time with both transfer operations being very fast.
import cupy as cp
from cupyx.scipy.ndimage import convolve
import numpy as np
import time
# Fast...
xt = np.random.randint(0, 255, (20, 256, 256)).astype(np.float32)
t0 = time.time()
xt_gpu = cp.asarray(xt)
cp.cuda.stream.get_current_stream().synchronize()
print(time.time() - t0)
# Also very fast...
t0 = time.time()
result_gpu = convolve(xt_gpu, xt_gpu, mode='constant')
cp.cuda.stream.get_current_stream().synchronize()
print(time.time() - t0)
# Very very very very very slow....
t0 = time.time()
result_cpu = cp.asnumpy(result_gpu)
cp.cuda.stream.get_current_stream().synchronize()
print(time.time() - t0)
Output:
0.1380000114440918
4.032999753952026
0.0010001659393310547
To me it seems like you are not actually synchronizing between calls when you tested it. Until the transfer back to a numpy array all operations are simply queued up and seem to finish instantly without the synchronize calls. This would lead to the measured GPU->CPU transfer time actually being the time for the convolution and the transfer.
I also meet the same problem, I found that accessing Float64 data is way faster than Float32, maybe you can try to .astype(float64).

Are Scipy LAPACK functions parallel?

I am currently using the scipy.linalg.lapack.zheevd() function and it runs on all cores, and produces hangs and memory overflows if I try mapping the function to an array of arguments using the ProcessPoolExecutor() or ThreadPoolExecutor() from concurrent.futures.
It utilizes as many cores as my test system has, but I was under the impression that things were not typically parallelized in Python due to the GIL. Is this a result of the underlying Fortran code running with OpenMP?
Is it safe to assume this is parallelized, and cannot be parallelized further? This is not a large bottleneck for my code (finding the eigensystems of 400 unique 1000x1000 matrices; although there may be need for this to be scaled up, e.g. 1000 2000x2000 matrices eventually), but I am in the optimization phase for it.
Here is a, hopefully, helpful code snippet for conceptualization, but does not represent the actual matrices:
import numpy as np
from scipy import linalg as la
import concurrent.futures
# In real code
# various parameters are used to build the matrix function,
# it is presumably not sparse
# Matrix with independent variable x
def matrix_function(x):
# Define dimensions and pre-allocate space for matrix
#dim = 100 # For quicker evaluation/testing
dim = 1000 # For conveying the scale of the problem
matrix_dimensions = [dim, dim]
# The matrix is complex
mat = np.zeros(matrix_dimensions, dtype=complex)
for i in range(dim):
for j in range(i,dim):
mat[i,j] = x*np.random.rand(1) + np.random.rand(1)*1J
# Making the matrix Hermitian
mat[j,i] = np.conjugate( mat[i,j] )
return mat
# 400 Arguments for the defined matrix function
args = np.arange(0,10,0.025)
# Parallelizing evaluation of 400 matrices
with concurrent.futures.ProcessPoolExecutor() as pool:
evaluated_matrix_functions = pool.map( matrix_function, args )
''' This will hang,
which is what tipped me off to the issue
**not important to question
eigsystem = pool.map( la.lapack.zheevd,
evaluated_matrix_functions
)
'''
pool.shutdown()
''' This will cause a memory overflow,
depending on the size of the matrices
and how many of them; even with 32GB memory
with concurrent.futures.ThreadPoolExecutor() as pool:
eigsystem = pool.map( la.lapack.zheevd,
evaluated_matrix_functions
)
pool.shutdown()
'''
# The code which I run, in serial,
# but still uses all cores/threads my 2700x provides at full load
eigensystem_list = []
for matrix in evaluated_matrix_functions:
eigensystem_list.append( la.lapack.zheevd(matrix) )
# The eigensystem_list is then used in later calculations
This is all controlled by the LAPACK library you are using under the hood.

Numpy matrix multiplications with multiprocessing suddenly slow down as dimension increase

I want to do some large matrix multiplications using multiprocessing.Pool.
Suddenly, when the dimension is higher than 50, it takes an extremely long computation time.
Is there any easy way to be faster?
Here, I don't want to use shared memory like RawArray, because my original code randomly generate the matrix for each time.
The sample code is as follows.
import numpy as np
from time import time
from multiprocessing import Pool
from functools import partial
def f(d):
a = int(10*d)
N = int(10000/d)
for _ in range(N):
X = np.random.randn(a,10) # np.random.randn(10,10)
return X
# Dimensions
ds = [1,2,3,4,5,6,8,10,20,35,40,45,50,60,62,64,66,68,70,80,90,100]
# Serial processing
serial = []
for d in ds:
t1 = time()
for i in range(20):
f(d)
serial.append(time()-t1)
# Parallel processing
parallel = []
for d in ds:
t1 = time()
pool = Pool()
for i in range(20):
pool.apply_async(partial(f,d), args=())
pool.close()
pool.join()
parallel.append(time()-t1)
# Plot
import matplotlib.pyplot as plt
plt.title('Matrix multiplication time with 10000/d repetitions')
plt.plot(ds,serial,label='serial')
plt.plot(ds,parallel,label='parallel')
plt.xlabel('d (dimension)')
plt.ylabel('Total time (sec)')
plt.legend()
plt.show()
Due to the total computation cost of f(d) is the same for all d, the parallel processing time should be equal.
But the actual output is not.
System info:
Linux-4.15.0-47-generic-x86_64-with-debian-stretch-sid
3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0]
Intel(R) Core(TM) i9-7940X CPU # 3.10GHz
NOTE I want to use parallel computation as a complicated internal simulation (like #), not sending data to the child process.
This is for self-reference.
Here, I found a solution.
My numpy uses MKL as backend, it may be the problem that MKL multithreading collides multiprocessing.
If I run the code:
import os
os.environ['MKL_NUM_THREADS'] = '1'
before importing numpy, then it solved.
I just found an explanation here: https://github.com/numpy/numpy/issues/10145.
Looks like the CPU caching gets messed up when you have conflicting MKL matrix multiplications going at the same time.

Multiple scipy.integrate.ode instances

I would like to use scipy.integrate.ode (or scipy.integrate.odeint) instances in multiple threads (one for each CPU core) in order to solve multiple IVPs at a time. However the documentation says: "This integrator is not re-entrant. You cannot have two ode instances using the “vode” integrator at the same time."
(Also odeint causes internal errors if instantiated multiple times although the documentation does not say so.)
Any idea what can be done?
One option is to use multiprocessing (i.e. use processes instead of threads). Here's an example that uses the map function of the multiprocessing.Pool class.
The function solve takes a set of initial conditions and returns a solution generated by odeint. The "serial" version of the code in the main section calls solve repeatedly, once for each set of initial conditions in ics. The "multiprocessing" version uses the map function of a multiprocessing.Pool instance to run several processes simultaneously, each calling solve. The map function takes care of doling out the arguments to solve.
My computer has four cores, and as I increase num_processes, the speedup maxes out at about 3.6.
from __future__ import division, print_function
import sys
import time
import multiprocessing as mp
import numpy as np
from scipy.integrate import odeint
def lorenz(q, t, sigma, rho, beta):
x, y, z = q
return [sigma*(y - x), x*(rho - z) - y, x*y - beta*z]
def solve(ic):
t = np.linspace(0, 200, 801)
sigma = 10.0
rho = 28.0
beta = 8/3
sol = odeint(lorenz, ic, t, args=(sigma, rho, beta), rtol=1e-10, atol=1e-12)
return sol
if __name__ == "__main__":
ics = np.random.randn(100, 3)
print("multiprocessing:", end='')
tstart = time.time()
num_processes = 5
p = mp.Pool(num_processes)
mp_solutions = p.map(solve, ics)
tend = time.time()
tmp = tend - tstart
print(" %8.3f seconds" % tmp)
print("serial: ", end='')
sys.stdout.flush()
tstart = time.time()
serial_solutions = [solve(ic) for ic in ics]
tend = time.time()
tserial = tend - tstart
print(" %8.3f seconds" % tserial)
print("num_processes = %i, speedup = %.2f" % (num_processes, tserial/tmp))
check = [(sol1 == sol2).all()
for sol1, sol2 in zip(serial_solutions, mp_solutions)]
if not all(check):
print("There was at least one discrepancy in the solutions.")
On my computer, the output is:
multiprocessing: 6.904 seconds
serial: 24.756 seconds
num_processes = 5, speedup = 3.59
SciPy.integrate.ode appears to use the LLNL SUNDIALS solvers, although SciPy doesn't say so explicitly, but they should, in my opinion.
The current version of the CVODE ode solver, 3.2.2, is re-entrant, which means that it can be used to solve multiple problems concurrently. The relevant information appears in User Documentation for CVODE v3.2.0 (SUNDIALS v3.2.0).
All state information used by cvode to solve a given problem is saved in a structure, and a pointer
to that structure is returned to the user. There is no global data in the cvode package, and so, in this
respect, it is reentrant. State information specific to the linear solver is saved in a separate structure,
a pointer to which resides in the cvode memory structure. The reentrancy of cvode was motivated
by the anticipated multicomputer extension, but is also essential in a uniprocessor setting where two
or more problems are solved by intermixed calls to the package from within a single user program.
But I don't know whether SciPy.integrate.ode, or other ode solvers like scikits.odes.ode, support this concurrency.

Theano GPU calculation slower than numpy

I'm learning to use theano. I want to populate a term-document matrix (a numpy sparse matrix) by calculating binary TF-IDF for each element inside it:
import theano
import theano.tensor as T
import numpy as np
from time import perf_counter
def tfidf_gpu(appearance_in_documents,num_documents,document_words):
start = perf_counter()
APP = T.scalar('APP',dtype='int32')
N = T.scalar('N',dtype='int32')
SF = T.scalar('S',dtype='int32')
F = (T.log(N)-T.log(APP)) / SF
TFIDF = theano.function([N,APP,SF],F)
ret = TFIDF(num_documents,appearance_in_documents,document_words)
end = perf_counter()
print("\nTFIDF_GPU ",end-start," secs.")
return ret
def tfidf_cpu(appearance_in_documents,num_documents,document_words):
start = perf_counter()
tfidf = (np.log(num_documents)-np.log(appearance_in_documents))/document_words
end = perf_counter()
print("TFIDF_CPU ",end-start," secs.\n")
return tfidf
But the numpy version is much faster than the theano implementation:
Progress 1/43
TFIDF_GPU 0.05702276699594222 secs.
TFIDF_CPU 1.454801531508565e-05 secs.
Progress 2/43
TFIDF_GPU 0.023830442980397493 secs.
TFIDF_CPU 1.1073017958551645e-05 secs.
Progress 3/43
TFIDF_GPU 0.021920352999586612 secs.
TFIDF_CPU 1.0738993296399713e-05 secs.
Progress 4/43
TFIDF_GPU 0.02303648801171221 secs.
TFIDF_CPU 1.1675001587718725e-05 secs.
Progress 5/43
TFIDF_GPU 0.02359767400776036 secs.
TFIDF_CPU 1.4385004760697484e-05 secs.
....
I've read that this can be due to overhead, that for small operations might kill the performance.
Is my code bad or should I avoid using GPU because of the overhead?
The thing is that you are compiling your Theano function every time. The compilation takes time. Try passing the compiled function like this:
def tfidf_gpu(appearance_in_documents,num_documents,document_words,TFIDF):
start = perf_counter()
ret = TFIDF(num_documents,appearance_in_documents,document_words)
end = perf_counter()
print("\nTFIDF_GPU ",end-start," secs.")
return ret
APP = T.scalar('APP',dtype='int32')
N = T.scalar('N',dtype='int32')
SF = T.scalar('S',dtype='int32')
F = (T.log(N)-T.log(APP)) / SF
TFIDF = theano.function([N,APP,SF],F)
tfidf_gpu(appearance_in_documents,num_documents,document_words,TFIDF)
Also your TFIDF task is a bandwidth intensive task. Theano, and GPU in general, is best for computation intensive tasks.
The current task will considerable overhead taking the data to the GPU and back because in the end you will need to read each element O(1) times. But if you want to do more computation it makes sense to use the GPU.

Categories