Is there a way to use tensorflow map_fn on GPU? - python

I have a tensor A with shape [a,n] and I need to perform an op my_op with another tensor B of shape [b,n] such that the resulting tensor C has shape [a,b].
In other words: For each subtensor in A (A[0], A1,...A[n]) I need to perform an element wise op with each subtensor in B.
So the resulting tensor would contain the following:
[ [ A[0] op B[0] , A[0] op B[1], ... , A[0] op B[b] ],
[ A[1] op B[0] , A[1] op B[1], ... , A[1] op B[b] ],
[ ... ],
[ A[a] op B[0] , A[a] op B[1], ... , A[a] op B[b] ] ]
The only way that I've been able to find that achieves this is through nested use of tf.map_fn
import tensorflow as tf
import time
import numpy as np
a_size = 64
b_size = 256*256
n = 256
A = tf.placeholder(tf.float32,[a_size,n])
B = tf.placeholder(tf.float32,[b_size,n])
def elementwise_op(a,b):
return tf.reduce_sum(tf.multiply(a,b))
def intermediate_op(sub_a,my_b):
sample_values = tf.map_fn(lambda x: elementwise_op(sub_a,x),my_b)
return sample_values
my_op = tf.map_fn(lambda x: intermediate_op(x,B),A)
with tf.Session() as sess:
a = np.random.rand(a_size,n)
b = np.random.rand(b_size,n)
start_time = time.time()
result = (my_op,feed_dict={A:a,B:b})
print ("exec time: " ,time.time()-start_time)
print (result.shape)
The code above runs fine, however, it does not use the GPU very well (only ~15% utilization, according to nvidia-smi). In fact, it runs an order of magnitude faster when using only the CPU!! (on my 12 core machine) When run using the GPU, I see very low GPU utilization (~15%) and 100% on one of my CPU cores. When run on the CPU only, I see 100% utilization across all CPU cores.
Average timing of 5 CPU only runs: 11.33s
Average timing of 5 GPU runs: 111.88s
The above test was run using the official Tensorflow docker images: tensorflow/tensorflow:latest-py3 (for CPU) and tensorflow/tensorflow:latest-gpu-py3 (for GPU)
My guess is that map_fn, via the python lambda, is forcing data to be copied back and forth between the CPU and GPU at every iteration, and the nested nature of the op just makes it worse. The comments in unanswered SO question here suggest that this is the case.
This article claims that:
lambda expression is the main reason of low GPU utilization.
So my question is: Is there a way to force map_fn to use the GPU? Or to avoid the Python lambda?
Alternatively, is there some other (perhaps more tensorflow-y) way to achieve the result described above, in order to the get graph to run on the GPU?
After running the profiler (I had to drastically reduce the size of the arrays to get the profiler to run at all, because it was eating up RAM like crazy), the following lines caught my attention:
node name | output bytes | total execution time | accelerator execution time | cpu execution time
Mul 1.02KB (22.23%, 0.29%), 195.07ms (85.00%, 13.06%), 5.29ms (100.00%, 25.79%), 189.78ms (84.79%, 12.89%)
Sum 256B (21.41%, 0.07%), 241.48ms (69.08%, 16.17%), 6.01ms (74.21%, 29.29%), 235.47ms (69.01%, 15.99%)
TensorArrayScatterV3 512B (0.64%, 0.15%), 658.31ms (46.87%, 44.09%), 9.19ms (44.80%, 44.80%), 649.12ms (46.90%, 44.08%)
It looks like certain ops are being done mostly on the CPU, and only on one thread at that!

The tf.map_fn() construct can be used with a function that runs ops on GPU. By default, TensorFlow will try to run as much of the function as possible on the GPU, and any GPU-incompatible ops will run on the CPU. In your program, the entire elementwise_op() function is built from GPU-compatible ops, so there should be no additional copying between CPU and GPU at each iteration.
The cause of low GPU utilization is difficult to determine from a program fragment. For example, if A and B are relatively small, and you are feeding them from Python and the immediately fetching back the result, it is likely that the overhead of copying the initial data to and from the GPU would dominate. The best way to track this down is to use a GPU profiler, which you can get using tfprof or the NVIDIA Visual Profiler.


Numba Python - how to exploit parallelism effectively?

I have been trying to exploit Numba to speed up large array calculations. I have been measuring the calculation speed in GFLOPS, and it consistently falls far short of my expectations for my CPU.
My processor is i9-9900k, which according to float32 benchmarks should be capable of over 200 GFLOPS. In my tests I have never exceeded about 50 GFLOPS. This is running on all 8 cores.
On a single core I achieve about 17 GFLOPS, which (I believe) is 50% of the theoretical performance. I'm not sure if this is improvable, but the fact that it doesn't extend well to multi-core is a problem.
I am trying to learn this because I am planning to write some image processing code that desperately needs every speed boost possible. I also feel I should understand this first, before I dip my toes into GPU computing.
Here is some example code with a few of my attempts at writing fast functions. The operation I am testing, is multiplying an array by a float32 then summing the whole array, i.e. a MAC operation.
How can I get better results?
import os
# os.environ["NUMBA_ENABLE_AVX"] = "1"
import numpy as np
import timeit
from timeit import default_timer as timer
import numba
# numba.config.NUMBA_ENABLE_AVX = 1
# numba.config.LOOP_VECTORIZE = 1
# numba.config.DUMP_ASSEMBLY = 1
from numba import float32, float64
from numba import jit, njit, prange
from numba import vectorize
from numba import cuda
lengthY = 16 # 2D array Y axis
lengthX = 2**16 # X axis
totalops = lengthY * lengthX * 2 # MAC operation has 2 operations
iters = 100
doParallel = True
#njit(fastmath=True, parallel=doParallel)
def MAC_numpy(testarray):
output = (float)(0.0)
multconst = (float)(.99)
output = np.sum(np.multiply(testarray, multconst))
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_01(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(lengthX):
output += multconst*testarray[y,x]
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_04(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(int(lengthX/4)):
xn = x*4
output += multconst*testarray[y,xn] + multconst*testarray[y,xn+1] + multconst*testarray[y,xn+2] + multconst*testarray[y,xn+3]
return output
# ======================================= TESTS =======================================
testarray = np.random.rand(lengthY, lengthX)
# ==== MAC_numpy ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_numpy(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("output = %f" % (output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_01 ====
time = 1000
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
for n in range(iters):
start = timer()
output = MAC_01(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("output = %f" % (output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_04 ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_04(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("output = %f" % (output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
Q : How can I get better results?
1st : Learn how to avoid doing useless work - you can straight eliminate HALF of the FLOP-s not speaking about also the half of all the RAM-I/O-s avoided, each one being at a cost of +100~350 [ns] per writeback
Due to the distributive nature of MUL and ADD ( a.C + b.C ) == ( a + b ).C, better first np.sum( A ) and only after that then MUL the sum by the (float) constant.
#utput = np.sum(np.multiply(testarray, multconst)) # AWFULLY INEFFICIENT
output = np.sum( testarray)*multconst #######################
2nd : Learn how to best align data along the order of processing ( cache-line reuses get you ~100x faster re-use of pre-fetched data. Not aligning vectorised-code along these already pre-fetched data side-effects just let your code pay many times the RAM-access latencies, instead of smart re-using the already paid for data-blocks. Designing work-units aligned according to this principle means a few SLOCs more, but the rewards are worth that - who gets ~100x faster CPUs+RAMs for free and right now or about a ~100x speedup for free, just from not writing a badly or naively designed looping iterators?
3rd : Learn how to efficiently harness vectorised (block-directed) operations inside numpy or numba code-blocks and avoid pressing numba to spend time on auto-analysing the call-signatures ( you pay an extra time for this auto-analyses per call, while you have designed the code and knew exactly what data-types are going to go there, so why to pay an extra time for auto-analysis each time a numba-block gets called???)
4th : Learn where the extended Amdahl's Law, having all the relevant add-on costs and processing atomicity put into the game, supports your wish to get speedups, not to ever pay way more than you will get back (to at least justify the add-on costs... ) - paying extra costs for not getting any reward is possible, yet has no beneficial impact on your code's performance ( rather the opposite )
5th : Learn when and how the manually created inline(s) may save your code, once the steps 1-4 are well learnt and routinely excersised with proper craftmanship ( Using popular COTS frameworks is fine, yet these may deliver results after a few days of work, while a hand-crafted single purpose smart designed assembly code was able to get the same results in about 12 minutes(!), not several days without any GPU/CPU tricks etc - yes, that faster - just by not doing a single step more than what was needed for the numerical processing of the large matrix data )
Did I mention float32 may surprise at being processed slower on small scales than float64, while on larger data-scales ~ n [GB] the RAM I/O-times grow slower for more efficient float32 pre-fetches? This never happens here, as float64 array gets processed here. Sure, unless one explicitly instructs the constructor(s) to downconvert the default data type, like this: np.random.rand( lengthY, lengthX ).astype( dtype = np.float32 )>>> np.random.rand( 10, 2 ).dtypedtype('float64')Avoiding extensive memory allocations is another performance trick, supported in numpy call-signatures. Using this option for large arrays will save you a lot of extra time wasted on mem-allocs for large interim arrays. Reusing already pre-allocated memory-zones and wisely controlled gc-policing are another signs of a professional, focused on low-latency & design-for-performance

Why is sklearn faster on CPU than Theano on GPU?

I've compared processing time with theano(CPU), theano(GPU) and Scikit-learn(CPU) using Python.
But, I got strange result.
Here look at the graph that I plot.
Processing Time Comparison:
you can see the result of scikit-learn that is faster than theano(GPU).
The program that I checked its elapsed time is to compute euclidean distance matrix from a matrix which have n * 40 elements.
Here is the part of code.
points = T.fmatrix("points")
edm = T.zeros_like(points)
def get_point_to_points_euclidean_distances(point_id):
euclideans = (T.sqrt((T.sqr(points- points[point_id, : ])).sum(axis=1)))
return euclideans
def get_EDM_CPU(points):
EDM = np.zeros((points.shape[0], points.shape[0])).astype(np.float32)
for row in range(points.shape[0]):
EDM[row, :] = np.sqrt(np.sum((points - points[row, :])**2, axis=1))
return EDM
def get_sk(points):
EDM = sk.pairwise_distances(a, metric='l2')
return EDM
seq = T.arange(T.shape(points)[0])
(result, _) = theano.scan(fn = get_point_to_points_euclidean_distances, \
outputs_info = None , \
sequences = seq)
get_EDM_GPU = theano.function(inputs = [points], outputs = result, allow_input_downcast = True)
I thought that the reason why GPU is slower than sci-kit learn is probably transfer time. So I did profiling GPU with nvprof command. then I got this.
==27105== NVPROF is profiling process 27105, command: python ./
Using gpu device 0: GeForce GTX 580 (CNMeM is disabled, cuDNN not available)
data shape : (10000, 40)
get_EDM_GPU elapsed time : 1.84863090515 (s)
get_EDM_CPU elapsed time : 8.09937691689 (s)
get_EDM_sk elapsed time : 1.10968112946 (s)
ratio : 4.38128395145
==27105== Profiling application: python ./
==27105== Warning: Found 9 invalid records in the result.
==27105== Warning: This could be because device ran out of memory when profiling.
==27105== Profiling result:
Time(%) Time Calls Avg Min Max Name
71.34% 1.28028s 9998 128.05us 127.65us 128.78us kernel_reduce_01_node_316e2e1cbfbe8cfb8e4a101f329ffeec_0(int, int, float const *, int, int, float*, int)
19.95% 357.97ms 9997 35.807us 35.068us 36.948us kernel_Sub_node_bc41b3f8f12c93d29f2c4360ad445d80_0_2(unsigned int, int, int, float const *, int, int, float const *, int, int, float*, int, int)
7.32% 131.38ms 2 65.690ms 1.2480us 131.38ms [CUDA memcpy DtoH]
1.25% 22.456ms 9996 2.2460us 2.1140us 2.8420us kernel_Sqrt_node_23508f8f49d12f3e8369d543f5620c15_0_Ccontiguous(unsigned int, float const *, float*)
0.12% 2.1847ms 1 2.1847ms 2.1847ms 2.1847ms [CUDA memset]
0.01% 259.73us 5 51.946us 640ns 250.36us [CUDA memcpy HtoD]
0.00% 17.086us 1 17.086us 17.086us 17.086us kernel_reduce_ccontig_node_97496c4d3cf9a06dc4082cc141f918d2_0(unsigned int, float const *, float*)
0.00% 2.0090us 1 2.0090us 2.0090us 2.0090us void copy_kernel<float, int=0>(cublasCopyParams<float>)
The transfer [CUDA memcpy DtoH] was performed twice { 1.248 [us], 131.38 [ms] }
The transfer [CUDA memcpy HtoD] was performed 5x { min: 640 [ns], max: 250.36 [us] }
The transfer time is about 131.639 ms (131.88 ms + 259.73 us).
but the gap between GPU and scikit-learn is about 700ms (1.8 s - 1.1 s) So, the gap is over the transfer time.
does it compute only upper triangular matrix from symmetric matrix?
what makes scikit-learn so fast?
What makes scikit-learn ( on pure CPU-side ) so fast?
My initial candidates would be a mix of:
highly efficient use of available CPU-cores' L1-/ L2- sizes within the fastest [ns]-distances
smart numpy vectorised execution being friendly to CPU cache-lines
dataset so small, it can completely remain non-evicted from cache ( test to scale the dataset-under-review way above the L2-/L3-cache sizes to see the DDRx-memory-cost effects on the observed performance ( details are in the URL below ) )
might enjoy even better timing on numpy, if avoiding .astype() conversions ( test it )
Facts on the GPU-side
auto-generated GPU-kernels do not have much chance to get ultimate levels of global memory latency-masking, compared to manually tweaked kernel-designs, tailor fit to respective GPU-silicon-architecture / latencies observed in-vivo
data-structures larger than just a few KB remain paying GPU-SM/GDDR-MEM distances of ~ large hundreds of [ns], nearly [us] -v/s- compared to small units ~ small tens of [ns] at CPU/L1/L2/L3/DDRx ) ref. timing details in >>>
not being able to enjoy much of the GPU/SMX power, due to this task's obvious low-reuse of data points and dataset size beyond the GPU/SM-silicon limits, that causes and must cause GPU/SM-register capacity spillovers in any kind of GPU-kernel design attempts and tweaking
the global task is not having a minimum reasonable amount of asynchronous, isolated ( non-communicating islands ) mathematically-dense, yet SMX-local, GPU-kernel processing steps ( there is not much to compute so as to adjust for the add-on overheads and expensive SMX/GDDR memory costs )
GPU-s can lovely exhibit it's best-performance, if sufficiently enough densely-convoluted re-processing operations take place -- like in large-scale/high-resolution image-processing -- on [m,n,o]-convolution-kernel matrices so small, so as that all these m*n*o constant values can reside local to SM, inside an available set of SMX-SM_registers and if the GPU-kernel-launchers are optimally tweaked by the 3D-tblock/grid processing-layout geometries, so that the global memory access latencies are at its best-masked performance, having all the GPU-threads enforced within the hardware WARP-aligned SMx:WarpScheduler RoundRobin thread-scheduling capabilites ( the first swap from Round-Robin into Greedy-WarpSchedule mode loses the whole battle in case of divergent execution-paths in GPU-kernel-code ).

Basic multi GPU parallelization of matrix multiplication

I want to parallelize the simple following expression on 2 GPUs: C = A^n + B^n by calculating A^n on GPU 0 and B^n on GPU 1 before summing the results.
In TensorFlow I would go like:
with tf.device('/gpu:0'):
An = matpow(A, n)
with tf.device('/gpu:1'):
Bn = matpow(B, n)
with tf.Session() as sess:
C = + Bn)
However, since PyTorch is dynamic, I'm having trouble doing the same thing. I tried the following but it only takes more time.
with torch.cuda.device(0):
A = A.cuda()
with torch.cuda.device(1):
B = B.cuda()
C = matpow(A, n) + matpow(B, n).cuda(0)
I know there is a module to parallelize models on the batch dimension using torch.nn.DataParallel but here I try to do something more basic.
You can use cuda streams for this. This will not necessarily distribute it over two devices, but the execution will be in parallel.
s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()
A = torch.pow(A,n)
B = torch.pow(B,n)
C = A+B
Although I'm not sure whether it will really speed up your computation if you only parallelize this one operation. Your matrices must be really big.
If your requirement is to split it across devices, you can add this before the streams:
A = A.cuda(0)
B = B.cuda(1)
Then after the power operation, you need to get them on the same device again, e.g. B = B.cuda(0). After that you can do the addition.

TensorFlow: how to log GPU memory (VRAM) utilization?

TensorFlow always (pre-)allocates all free memory (VRAM) on my graphics card, which is ok since I want my simulations to run as fast as possible on my workstation.
However, I would like to log how much memory (in sum) TensorFlow really uses. Additionally it would be really nice, if I could also log how much memory single tensors use.
This information is important to measure and compare the memory size that different ML/AI architectures need.
Any tips?
Update, can use TensorFlow ops to query allocator:
# maximum across all sessions and .run calls so far
# current usage
Also you can get detailed information about call including all memory being allocations during run call by looking at RunMetadata. IE something like this
run_metadata = tf.RunMetadata(), options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE, output_partition_graphs=True), run_metadata=run_metadata)
Here's an end-to-end example -- take column vector, row vector and add them to get a matrix of additions:
import tensorflow as tf
no_opt = tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0,
config = tf.ConfigProto(graph_options=tf.GraphOptions(optimizer_options=no_opt),
log_device_placement=True, allow_soft_placement=False,
device_count={"CPU": 3},
sess = tf.Session(config=config)
with tf.device("cpu:0"):
a = tf.ones((13, 1))
with tf.device("cpu:1"):
b = tf.ones((1, 13))
with tf.device("cpu:2"):
c = a+b
sess = tf.Session(config=config)
run_metadata = tf.RunMetadata(), options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE, output_partition_graphs=True), run_metadata=run_metadata)
with open("/tmp/run2.txt", "w") as out:
If you open run.txt you'll see messages like this:
node_name: "ones"
allocation_description {
requested_bytes: 52
allocator_name: "cpu"
ptr: 4322108320
node_name: "ones_1"
allocation_description {
requested_bytes: 52
allocator_name: "cpu"
ptr: 4322092992
node_name: "add"
allocation_description {
requested_bytes: 676
allocator_name: "cpu"
ptr: 4492163840
So here you can see that a and b allocated 52 bytes each (13*4), and the result allocated 676 bytes.
Yaroslav Bulatov's answer is the best solution for TF1.
For TF2, however, contrib package does not exist. The best way is to use tf's profiler --
It will plot a memory utilization graph like this.

Measuring time using pycuda.driver.Event gives wrong results

I ran from the PyCuda examples, producing the following output:
Using nbr_values == 8192
Calculating 100000 iterations
SourceModule time and first three results:
0.058294s, [ 0.005477 0.005477 0.005477]
Elementwise time and first three results:
0.102527s, [ 0.005477 0.005477 0.005477]
Elementwise Python looping time and first three results:
2.398071s, [ 0.005477 0.005477 0.005477]
GPUArray time and first three results:
8.207257s, [ 0.005477 0.005477 0.005477]
CPU time measured using :
0.000002s, [ 0.005477 0.005477 0.005477]
The first four time measurements are reasonable, the last one (0.000002s) however is way off. The CPU result should be the slowest one but it is orders of magnitude faster than the fastest GPU method. So obviously the measured time must be wrong. This is strange since the same timing method seems to work fine for the first four results.
So I took some code from and made a small test file [2], which produced:
time measured using option 1:
time measured using option 2:
Option 1 measures the duration using pycuda.driver.Event.record() (as in, option 2 uses time.clock(). Again, option 1 is off while option 2 gives a reasonable result (the time it takes to run the test file is around 6s).
Does anyone have an idea as to why this is happening?
Since using option 1 is endorsed in, could it be my setup that is causing the problem? I am running a GTX 470, Display Driver 301.42, CUDA 4.2, Python 2.7 64, PyCuda 2012.1, X5650 Xeon
[2] Test file:
import numpy
import time
import pycuda.driver as drv
import pycuda.autoinit
n_iter = 100000
nbr_values = 8192 # = 64 * 128 (values as used in
start = drv.Event() # option 1 uses pycuda.driver.Event
end = drv.Event()
a = numpy.ones(nbr_values).astype(numpy.float32) # test data
start.record() # start option 1 (inserting recording points into GPU stream)
tic = time.clock() # start option 2 (using CPU time)
for i in range(n_iter):
a = numpy.sin(a) # do some work
end.record() # end option 1
toc = time.clock() # end option 2
events_secs = start.time_till(end)*1e-3
time_secs = toc - tic
print "time measured using option 1:"
print "%fs " % events_secs
print "time measured using option 2:"
print "%fs " % time_secs
I contacted Andreas Klöckner and he suggested to synchronize on the start event, too.
And this seems to solve the issue!
time measured using option 1:
time measured using option 2:
Apparently CUDA's behaviour changed in the last two years. I updated
