Theano GPU calculation slower than numpy - python

I'm learning to use theano. I want to populate a term-document matrix (a numpy sparse matrix) by calculating binary TF-IDF for each element inside it:
import theano
import theano.tensor as T
import numpy as np
from time import perf_counter
def tfidf_gpu(appearance_in_documents,num_documents,document_words):
start = perf_counter()
APP = T.scalar('APP',dtype='int32')
N = T.scalar('N',dtype='int32')
SF = T.scalar('S',dtype='int32')
F = (T.log(N)-T.log(APP)) / SF
TFIDF = theano.function([N,APP,SF],F)
ret = TFIDF(num_documents,appearance_in_documents,document_words)
end = perf_counter()
print("\nTFIDF_GPU ",end-start," secs.")
return ret
def tfidf_cpu(appearance_in_documents,num_documents,document_words):
start = perf_counter()
tfidf = (np.log(num_documents)-np.log(appearance_in_documents))/document_words
end = perf_counter()
print("TFIDF_CPU ",end-start," secs.\n")
return tfidf
But the numpy version is much faster than the theano implementation:
Progress 1/43
TFIDF_GPU 0.05702276699594222 secs.
TFIDF_CPU 1.454801531508565e-05 secs.
Progress 2/43
TFIDF_GPU 0.023830442980397493 secs.
TFIDF_CPU 1.1073017958551645e-05 secs.
Progress 3/43
TFIDF_GPU 0.021920352999586612 secs.
TFIDF_CPU 1.0738993296399713e-05 secs.
Progress 4/43
TFIDF_GPU 0.02303648801171221 secs.
TFIDF_CPU 1.1675001587718725e-05 secs.
Progress 5/43
TFIDF_GPU 0.02359767400776036 secs.
TFIDF_CPU 1.4385004760697484e-05 secs.
I've read that this can be due to overhead, that for small operations might kill the performance.
Is my code bad or should I avoid using GPU because of the overhead?

The thing is that you are compiling your Theano function every time. The compilation takes time. Try passing the compiled function like this:
def tfidf_gpu(appearance_in_documents,num_documents,document_words,TFIDF):
start = perf_counter()
ret = TFIDF(num_documents,appearance_in_documents,document_words)
end = perf_counter()
print("\nTFIDF_GPU ",end-start," secs.")
return ret
APP = T.scalar('APP',dtype='int32')
N = T.scalar('N',dtype='int32')
SF = T.scalar('S',dtype='int32')
F = (T.log(N)-T.log(APP)) / SF
TFIDF = theano.function([N,APP,SF],F)
Also your TFIDF task is a bandwidth intensive task. Theano, and GPU in general, is best for computation intensive tasks.
The current task will considerable overhead taking the data to the GPU and back because in the end you will need to read each element O(1) times. But if you want to do more computation it makes sense to use the GPU.


Clear all cached kernels from CuPY to force kernel compilation

In the CuPY documentation, it is stated that
"CuPy caches the kernel code sent to GPU device within the process, which reduces the kernel compilation time on further calls."
This means that when one calls a function from CuPY, subsequent calls to this function will be extremely fast. An example is as follows:
import cupy as cp
from timeit import default_timer as timer
import time
mempool = cp.get_default_memory_pool()
pinned_mempool = cp.get_default_pinned_memory_pool()
def multiply():
rand = cp.random.default_rng() #This is the fast way of creating large arrays with cp
arr = rand.integers(0, 100_000, (10000, 1000)) #Create array
y = cp.multiply(arr, 42) ## Multiply by 42, randomly chosen number
return y
if __name__ == '__main__':
times = []
start = timer()
for i in range(21):
start = timer()
This will return the times:
[0.17462146899993058, 0.0006819850000283623, 0.0006159440001738403, 0.0006145069999092811, 0.000610309999956371, 0.0006169410000893549, 0.0006062159998236893, 0.0006096620002153941, 0.0006096250001519365, 0.0006106630000886071, 0.0006063629998607212, 0.0006168999998408253, 0.0006058349999875645, 0.0006090080000831222, 0.0005964219999441411, 0.0006113049998930364, 0.0005968339999071759, 0.0005951619998540991, 0.0005980400001135422, 0.0005941219999385794, 0.0006568090000200755]
Where only the first call includes the time it takes to compile the kernel as well.
Is there a way to flush everything in order to force the compilation for each subsequent call to multiply()?
Currently, there is no way to disable kernel caching in CuPy. The only option available is to disable persisting kernel caching on disk (CUPY_CACHE_IN_MEMORY=1), but kernels are cached on-memory so compilation runs only once within the process.

Only GPU to CPU transfer with cupy is incredible slow

If I have an array on the GPU, it is really slow (order of hundreds of seconds) to copy back an array of shape (20, 256, 256).
My code is the following:
import cupy as cp
from cupyx.scipy.ndimage import convolve
import numpy as np
# Fast...
xt = np.random.randint(0, 255, (20, 256, 256)).astype(np.float32)
xt_gpu = cp.asarray(xt)
# Also very fast...
result_gpu = convolve(xt_gpu, xt_gpu, mode='constant')
# Very very very very very slow....
result_cpu = cp.asnumpy(result_gpu)
I measured the times using cp.cuda.Event() with record and synchronize to avoid measuring any random times, but is still the same result, the GPU->CPU transfer is incredible slow. However, using PyTorch or TensorFlow this is not the case (out of experience for similar data size/shape)... What am I doing wrong?
I think you might be timing it wrong. I modified the code to synchronize between every GPU operation and it seems like the convolution takes the majority of the time with both transfer operations being very fast.
import cupy as cp
from cupyx.scipy.ndimage import convolve
import numpy as np
import time
# Fast...
xt = np.random.randint(0, 255, (20, 256, 256)).astype(np.float32)
t0 = time.time()
xt_gpu = cp.asarray(xt)
print(time.time() - t0)
# Also very fast...
t0 = time.time()
result_gpu = convolve(xt_gpu, xt_gpu, mode='constant')
print(time.time() - t0)
# Very very very very very slow....
t0 = time.time()
result_cpu = cp.asnumpy(result_gpu)
print(time.time() - t0)
To me it seems like you are not actually synchronizing between calls when you tested it. Until the transfer back to a numpy array all operations are simply queued up and seem to finish instantly without the synchronize calls. This would lead to the measured GPU->CPU transfer time actually being the time for the convolution and the transfer.
I also meet the same problem, I found that accessing Float64 data is way faster than Float32, maybe you can try to .astype(float64).

Numpy matrix multiplications with multiprocessing suddenly slow down as dimension increase

I want to do some large matrix multiplications using multiprocessing.Pool.
Suddenly, when the dimension is higher than 50, it takes an extremely long computation time.
Is there any easy way to be faster?
Here, I don't want to use shared memory like RawArray, because my original code randomly generate the matrix for each time.
The sample code is as follows.
import numpy as np
from time import time
from multiprocessing import Pool
from functools import partial
def f(d):
a = int(10*d)
N = int(10000/d)
for _ in range(N):
X = np.random.randn(a,10) # np.random.randn(10,10)
return X
# Dimensions
ds = [1,2,3,4,5,6,8,10,20,35,40,45,50,60,62,64,66,68,70,80,90,100]
# Serial processing
serial = []
for d in ds:
t1 = time()
for i in range(20):
# Parallel processing
parallel = []
for d in ds:
t1 = time()
pool = Pool()
for i in range(20):
pool.apply_async(partial(f,d), args=())
# Plot
import matplotlib.pyplot as plt
plt.title('Matrix multiplication time with 10000/d repetitions')
plt.xlabel('d (dimension)')
plt.ylabel('Total time (sec)')
Due to the total computation cost of f(d) is the same for all d, the parallel processing time should be equal.
But the actual output is not.
System info:
3.6.8 |Anaconda custom (64-bit)| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0]
Intel(R) Core(TM) i9-7940X CPU # 3.10GHz
NOTE I want to use parallel computation as a complicated internal simulation (like #), not sending data to the child process.
This is for self-reference.
Here, I found a solution.
My numpy uses MKL as backend, it may be the problem that MKL multithreading collides multiprocessing.
If I run the code:
import os
os.environ['MKL_NUM_THREADS'] = '1'
before importing numpy, then it solved.
I just found an explanation here:
Looks like the CPU caching gets messed up when you have conflicting MKL matrix multiplications going at the same time.

Gram-Schmidt orthogonalization in pure Tensorflow: performance for iterative solution is much slower than numpy

I want to do Gram-Schmidt orthogonalization to fix big matrices which start to deviate slightly from orthogonality in pure Tensorflow (to do it on the graph within larger computation, without breaking it). The solutions I've seen like the one there are used "externally" (doing multiple inside).
So I wrote a simple and I think very inefficient implementation myself:
def tf_gram_schmidt(vectors):
# add batch dimension for matmul
basis = tf.expand_dims(vectors[0,:]/tf.norm(vectors[0,:]),0)
for i in range(1,vectors.get_shape()[0].value):
v = vectors[i,:]
# add batch dimension for matmul
v = tf.expand_dims(v,0)
w = v - tf.matmul(tf.matmul(v, tf.transpose(basis)), basis)
# I assume that my matrix is close to orthogonal
basis = tf.concat([basis, w/tf.norm(w)],axis=0)
return basis
But when I compare it with the same iterative external code, it is 3 times slower (on GPU !!!) (though has a bit better precision):
how much source differs from orthogonal matrix:
tensorflow version:
Time elapsed: 23365.9820557ms
numpy version with tensorflow and variable re-assign to the result of numpy code:
Time elapsed: 8540.5600071ms
(UPD 4: I had a small mistake in my example, but it didn't change timings at all, as ort_discrepancy() is a lightweight function):
Minimal example:
import tensorflow as tf
import numpy as np
import time
# found this code somewhere on stackoverflow
def np_gram_schmidt(vectors):
basis = []
for v in vectors:
w = v - np.sum(,b)*b for b in basis )
if (w > 1e-10).any():
return np.array(basis)
def tf_gram_schmidt(vectors):
# add batch dimension for matmul
basis = tf.expand_dims(vectors[0,:]/tf.norm(vectors[0,:]),0)
for i in range(1,vectors.get_shape()[0].value):
v = vectors[i,:]
# add batch dimension for matmul
v = tf.expand_dims(v,0)
w = v - tf.matmul(tf.matmul(v, tf.transpose(basis)), basis)
# I assume that my matrix is close to orthogonal
basis = tf.concat([basis, w/tf.norm(w)],axis=0)
return basis
# how much matrix differs from orthogonal
# computes ||W*W^T - I||2
def ort_discrepancy(matrix):
wwt = tf.matmul(matrix, matrix, transpose_a=True)
rows = tf.shape(wwt)[0]
cols = tf.shape(wwt)[1]
return tf.norm((wwt - tf.eye(rows,cols)),ord='euclidean')
# white noise matrix
np_nearly_orthogonal = np.random.normal(size=(2000,2000))
# centered rows
np_nearly_orthogonal = np.array([row/np.linalg.norm(row) for row in np_nearly_orthogonal])
tf_nearly_orthogonal = tf.Variable(np_nearly_orthogonal,dtype=tf.float32)
init = tf.global_variables_initializer()
with tf.Session() as sess:
print("how much source differs from orthogonal matrix:")
print("tensorflow version:")
start = time.time()
end = time.time()
print("Time elapsed: %sms"%(1000*(end-start)))
print("numpy version with tensorflow and variable re-assign to the result of numpy code:")
start = time.time()
tf_nearly_orthogonal = tf.Variable(np_gram_schmidt(tf_nearly_orthogonal.eval()),dtype=tf.float32)[tf_nearly_orthogonal]))
# check that variable was updated
end = time.time()
print("Time elapsed: %sms"%(1000*(end-start)))
Is there a way to speed it up? I couldn't figure out how to do it for G-S which requires appending to the basis (so no tf.map_fn parallelization can help).
UPD: I have achieved difference in 2x by optimizing tf.matmul:
def tf_gram_schmidt(vectors):
# add batch dimension for matmul
basis = tf.expand_dims(vectors[0,:]/tf.norm(vectors[0,:]),0)
for i in range(1,vectors.get_shape()[0].value):
v = vectors[i,:]
# add batch dimension for matmul
v = tf.expand_dims(v,0)
w = v - tf.matmul(tf.matmul(v, basis, transpose_b=True), basis)
# I assume that my matrix is close to orthogonal
basis = tf.concat([basis, w/tf.norm(w)],axis=0)
return basis
how much source differs from orthogonal matrix:
tensorflow version:
Time elapsed: 17004.458189ms
numpy version with tensorflow and variable re-assign to the result of numpy code:
Time elapsed: 8082.20791817ms
Just for fun, tried to fully mimic numpy solution, and got extremely long working code:
def tf_gram_schmidt(vectors):
# add batch dimension for matmul
basis = tf.expand_dims(vectors[0,:]/tf.norm(vectors[0,:]),0)
for i in range(1,vectors.get_shape()[0].value):
v = vectors[i,:]
# like in numpy example
multiplied = tf.reduce_sum(tf.map_fn(lambda b: tf.scalar_mul(tf.tensordot(v,b,axes=[[0],[0]]),b), basis), axis=0)
w = v - multiplied
## add batch dimension for matmul
##v = tf.expand_dims(v,0)
##w = v - tf.matmul(tf.matmul(v, basis, transpose_b=True), basis)
# I assume that my matrix is close to orthogonal
basis = tf.concat([basis, tf.expand_dims(w/tf.norm(w),0)],axis=0)
return basis
(which seems to overfill GPU memory as well):
how much source differs from orthogonal matrix:
tensorflow version:
2018-01-05 22:12:09.854505: I tensorflow/core/common_runtime/gpu/] PoolAllocator: After 14005 get requests, put_count=5105 evicted_count=1000 eviction_rate=0.195886 and unsatisfied allocation rate=0.714031
2018-01-05 22:12:09.854530: I tensorflow/core/common_runtime/gpu/] Raising pool_size_limit_ from 100 to 110
2018-01-05 22:12:13.090296: I tensorflow/core/common_runtime/gpu/] PoolAllocator: After 308520 get requests, put_count=314261 evicted_count=6000 eviction_rate=0.0190924 and unsatisfied allocation rate=0.00088487
2018-01-05 22:12:22.270822: I tensorflow/core/common_runtime/gpu/] PoolAllocator: After 1485113 get requests, put_count=1500399 evicted_count=16000 eviction_rate=0.0106638 and unsatisfied allocation rate=0.000490198
2018-01-05 22:12:37.833056: I tensorflow/core/common_runtime/gpu/] PoolAllocator: After 3484575 get requests, put_count=3509407 evicted_count=26000 eviction_rate=0.00740866 and unsatisfied allocation rate=0.000339209
2018-01-05 22:12:59.995184: I tensorflow/core/common_runtime/gpu/] PoolAllocator: After 6315546 get requests, put_count=6349923 evicted_count=36000 eviction_rate=0.00566936 and unsatisfied allocation rate=0.000259202
Time elapsed: 136108.97398ms
numpy version with tensorflow and variable re-assign to the result of numpy code:
Time elapsed: 10618.8428402ms
UPD3: My GPU is GTX1050, it usually has speedup 5-7 times in comparison to my CPU. So the result is very strange for me.
UPD5: Ok, I found that GPU is almost not used for this code, while training neural network with manually written backpropagation which uses a lot of tf.matmul's and other matrix arithmetics fully exploits it. Why is it so?
UPD 6:
Following the given suggestion I have measured the time in a new way:
# Akshay's suggestion to measure performance correclty
orthogonalized = ort_discrepancy(tf_gram_schmidt(tf_nearly_orthogonal))
with tf.Session() as sess:
print("how much source differs from orthogonal matrix:")
print("tensorflow version:")
start = time.time()
tf_result =
end = time.time()
print("Time elapsed: %sms"%(1000*(end-start)))
print("numpy version with tensorflow and variable re-assign to the result of numpy code:")
start = time.time()
tf_nearly_orthogonal = tf.Variable(np_gram_schmidt(tf_nearly_orthogonal.eval()),dtype=tf.float32)[tf_nearly_orthogonal]))
# check that variable was updated
end = time.time()
print("Time elapsed: %sms"%(1000*(end-start)))
Now I can see 4x speedup:
how much source differs from orthogonal matrix:
tensorflow version:
Time elapsed: 2594.85888481ms
numpy version with tensorflow and variable re-assign to the result of numpy code:
Time elapsed: 8851.86600685ms
TensorFlow appears slow because your benchmark is measuring both the time that it construct the graph and the time it takes to execute it; a fairer comparison between TensorFlow and NumPy would exclude graph construction from the benchmark. In particular, your benchmark should probably look something like this:
print("tensorflow version:")
# This line constructs the graph but does not execute it.
orthogonalized = ort_discrepancy(tf_gram_schmidt(tf_nearly_orthogonal))
start = time.time()
tf_result =
end = time.time()

How to decrease time of execution using multi-threading python

I am performing DCT(in Raspberry Pi). I've broken the image into 8x8 blocks. Initially I performed DCT in nested for loop (without multithreading). I observed that it takes about 18 seconds for a 512x512 image.
But, Here's the code with multi-threads
#!/usr/bin/env python
from __future__ import print_function,division
import time
start_time = time.time()
import cv2
import numpy as np
import sys
import pylab as plt
import threading
import Queue
from numpy import empty,arange,exp,real,imag,pi
from numpy.fft import rfft,irfft
from pprint import pprint
queue = Queue.Queue()
if len(sys.argv)>1:
im = cv2.imread(sys.argv[1])
else :
im = cv2.imread('baboon.jpg')
im = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
h, w = im.shape[:2]
DF = np.zeros((h,w))
def dct2(y):
M = y.shape[0]
N = y.shape[1]
a = empty([M,N],float)
b = empty([M,N],float)
for i in range(M):
a[i,:] = dct(y[i,:])
for j in range(N):
b[:,j] = dct(a[:,j])
def dct(y):
N = len(y)
y2 = empty(2*N,float)
y2[:N] = y[:]
y2[N:] = y[::-1]
c = rfft(y2)
phi = exp(-1j*pi*arange(N)/(2*N))
return real(phi*c[:N])
def Main():
jobs = []
for row in range(0, h, Nb):
for col in range(0, w, Nb):
f = im[(row):(row+Nb), (col):(col+Nb)]
thread = threading.Thread(target=dct2(f))
df = queue.get()
DF[row:row+Nb, col:col+Nb] = df
for j in jobs:
for j in jobs:
if __name__ == "__main__":
cv2.imwrite('dct_img.jpg', DF)
print("--- %s seconds ---" % (time.time() - start_time))
plt.imshow(DF1, cmap = 'Greys')
After using multiple threads, this code take about 25 seconds to get executed. What's wrong? Have I implemented multi-threading wrongly? I want to reduce the time taken to perform DCT as much as possible (1-5 seconds). Any suggestions?
Any other concept or method (I've read post on multiprocessing) that'll significantly reduce my execution and processing time?
Due to GIL all your threads are executed in a sequence (not in parallel).
So you might want to switch to multiprocessing. Another option is to build numba, which can greatly increase speed of usual python code and also can unlock GIL.
In Python, you should use multithreading for performances only when mixing IO and CPU tasks.
For your problem you should use multiprocessing.
Maybe the other posters are right about the GIL. But OpenCV as well as Numpy release the GIL so I would at least expect a speedup from a multithreaded solution.
I would have a look at how many threads you are creating simultaneously. It's probably a lot since you start one for each 8 by 8 pixel sub picture. (Each time a thread is taken off the cpu and replaced by another it incurs a small overhead which in sum gets quite noticeable if you have a lot of threads)
If this is the case you probably gain performance by not starting them all at once but to only start as many as you have cpu cores (a few more a few less...just experiment) and only start the next thread if one has finished.
Look at the answers to this question on how to do this with minimal effort.
