TensorFlow: how to log GPU memory (VRAM) utilization?

TensorFlow: how to log GPU memory (VRAM) utilization? - python

TensorFlow always (pre-)allocates all free memory (VRAM) on my graphics card, which is ok since I want my simulations to run as fast as possible on my workstation.
However, I would like to log how much memory (in sum) TensorFlow really uses. Additionally it would be really nice, if I could also log how much memory single tensors use.
This information is important to measure and compare the memory size that different ML/AI architectures need.
Any tips?

Update, can use TensorFlow ops to query allocator:
# maximum across all sessions and .run calls so far
sess.run(tf.contrib.memory_stats.MaxBytesInUse())
# current usage
sess.run(tf.contrib.memory_stats.BytesInUse())
Also you can get detailed information about session.run call including all memory being allocations during run call by looking at RunMetadata. IE something like this
run_metadata = tf.RunMetadata()
sess.run(c, options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE, output_partition_graphs=True), run_metadata=run_metadata)
Here's an end-to-end example -- take column vector, row vector and add them to get a matrix of additions:
import tensorflow as tf
no_opt = tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0,
do_common_subexpression_elimination=False,
do_function_inlining=False,
do_constant_folding=False)
config = tf.ConfigProto(graph_options=tf.GraphOptions(optimizer_options=no_opt),
log_device_placement=True, allow_soft_placement=False,
device_count={"CPU": 3},
inter_op_parallelism_threads=3,
intra_op_parallelism_threads=1)
sess = tf.Session(config=config)
with tf.device("cpu:0"):
a = tf.ones((13, 1))
with tf.device("cpu:1"):
b = tf.ones((1, 13))
with tf.device("cpu:2"):
c = a+b
sess = tf.Session(config=config)
run_metadata = tf.RunMetadata()
sess.run(c, options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE, output_partition_graphs=True), run_metadata=run_metadata)
with open("/tmp/run2.txt", "w") as out:
out.write(str(run_metadata))
If you open run.txt you'll see messages like this:
node_name: "ones"
allocation_description {
requested_bytes: 52
allocator_name: "cpu"
ptr: 4322108320
}
....
node_name: "ones_1"
allocation_description {
requested_bytes: 52
allocator_name: "cpu"
ptr: 4322092992
}
...
node_name: "add"
allocation_description {
requested_bytes: 676
allocator_name: "cpu"
ptr: 4492163840
So here you can see that a and b allocated 52 bytes each (13*4), and the result allocated 676 bytes.

Yaroslav Bulatov's answer is the best solution for TF1.
For TF2, however, contrib package does not exist. The best way is to use tf's profiler -- https://www.tensorflow.org/guide/profiler#memory_profile_tool
It will plot a memory utilization graph like this.

Related

Dask delayed / dask array no response

I have a distributed dask cluster setup and I have used it to load and transform a bunch of data. Works like a charm.
I'm want to use it do some processing in parallel. Here's my function
el = 5000
n_using = 26
n_across= 6
mat = np.random.random((el,n_using,n_across))
idx = np.tril_indices(n_across*2, -n_across)
def get_vals(c1, m, el, idx):
m1 = m[c1,:,:]
corr_vals = np.zeros((el, (n_across//2)*(n_across+1)))
for c2 in range(c1+1, el):
corr = np.corrcoef(m1.T, m[c2,:,:].T)
corr_vals[c2] = corr[idx]
return corr_vals
lazy_get_val = dask.delayed(get_vals, pure=True)
Here is a single processor version of what I'm trying to do:
arrays = [get_vals(c1, mat, el, idx) for c1 in range(el)]
all_corr = np.stack(arrays, axis=0)
Works fine but takes a few hours.
Here's my go at doing this in dask:
lazy_list = [lazy_get_val(c1, mat, el, idx) for c1 in range(el)]
arrays = [da.from_delayed(lazy_item, dtype=float, shape=(el, 21)) for lazy_item in lazy_list]
all_corr = da.stack(arrays, axis=0)
Even if it run all_corr[1].compute(), it just sits there and doesn't respond. When I interrupt the kernel, it seems to be stuck at /distributed/utils.py:
~/.../lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
249 else:
250 while not e.is_set():
--> 251 e.wait(10)
252 if error[0]:
253 six.reraise(*error[0])
Any suggestions on debugging this?
Other things:
If I run it with a smaller mat (el=1000) and it runs fine.
If I make el = 5000, it hangs.
If I interrupt the kernel and run it again with el = 1000, it hangs.

After adding imports to the example I ran things and it was very slow while building the graph. This can be improved by avoiding placing numpy arrays directly in delayed calls as follows:
# mat = np.random.random((el,n_using,n_across))
# idx = np.tril_indices(n_across*2, -n_across)
mat = dask.delayed(np.random.random)((el,n_using,n_across))
idx = dask.delayed(np.tril_indices)(n_across*2, -n_across)
Or by removing the pure=True keyword to dask.delayed (when you set pure=True it has to hash the contents of all inputs to get a unique key for them, you're doing this 5000 times). I found this out by profiling your code with the %snakeviz magic in IPython.
I then ran all_corr[1].compute() and it was fine. I then ran all_corr.compute() and it seemed like it would progress to completion, but wasn't very fast. I suspect that either your tasks are too small so that there is too much overhead, or that your code is spending too much time in Python for loops and so is running into GIL issues. Not sure which.
The next thing I would recommend trying would be using the dask.distributed scheduler, which would handle the GIL issue better and exacerbate the overhead issue. Seeing how that performed would probably help isolate the issue.

Is there a way to use tensorflow map_fn on GPU?

I have a tensor A with shape [a,n] and I need to perform an op my_op with another tensor B of shape [b,n] such that the resulting tensor C has shape [a,b].
In other words: For each subtensor in A (A[0], A1,...A[n]) I need to perform an element wise op with each subtensor in B.
So the resulting tensor would contain the following:
[ [ A[0] op B[0] , A[0] op B[1], ... , A[0] op B[b] ],
[ A[1] op B[0] , A[1] op B[1], ... , A[1] op B[b] ],
[ ... ],
[ A[a] op B[0] , A[a] op B[1], ... , A[a] op B[b] ] ]
The only way that I've been able to find that achieves this is through nested use of tf.map_fn
Thus:
import tensorflow as tf
import time
import numpy as np
a_size = 64
b_size = 256*256
n = 256
A = tf.placeholder(tf.float32,[a_size,n])
B = tf.placeholder(tf.float32,[b_size,n])
def elementwise_op(a,b):
return tf.reduce_sum(tf.multiply(a,b))
def intermediate_op(sub_a,my_b):
sample_values = tf.map_fn(lambda x: elementwise_op(sub_a,x),my_b)
return sample_values
my_op = tf.map_fn(lambda x: intermediate_op(x,B),A)
with tf.Session() as sess:
a = np.random.rand(a_size,n)
b = np.random.rand(b_size,n)
start_time = time.time()
result = sess.run (my_op,feed_dict={A:a,B:b})
print ("exec time: " ,time.time()-start_time)
print (result.shape)
The code above runs fine, however, it does not use the GPU very well (only ~15% utilization, according to nvidia-smi). In fact, it runs an order of magnitude faster when using only the CPU!! (on my 12 core machine) When run using the GPU, I see very low GPU utilization (~15%) and 100% on one of my CPU cores. When run on the CPU only, I see 100% utilization across all CPU cores.
Average timing of 5 CPU only runs: 11.33s
Average timing of 5 GPU runs: 111.88s
The above test was run using the official Tensorflow docker images: tensorflow/tensorflow:latest-py3 (for CPU) and tensorflow/tensorflow:latest-gpu-py3 (for GPU)
My guess is that map_fn, via the python lambda, is forcing data to be copied back and forth between the CPU and GPU at every iteration, and the nested nature of the op just makes it worse. The comments in unanswered SO question here suggest that this is the case.
This article claims that:
lambda expression is the main reason of low GPU utilization.
-
So my question is: Is there a way to force map_fn to use the GPU? Or to avoid the Python lambda?
Alternatively, is there some other (perhaps more tensorflow-y) way to achieve the result described above, in order to the get graph to run on the GPU?
Edit:
After running the profiler (I had to drastically reduce the size of the arrays to get the profiler to run at all, because it was eating up RAM like crazy), the following lines caught my attention:
node name | output bytes | total execution time | accelerator execution time | cpu execution time
Mul 1.02KB (22.23%, 0.29%), 195.07ms (85.00%, 13.06%), 5.29ms (100.00%, 25.79%), 189.78ms (84.79%, 12.89%)
Sum 256B (21.41%, 0.07%), 241.48ms (69.08%, 16.17%), 6.01ms (74.21%, 29.29%), 235.47ms (69.01%, 15.99%)
TensorArrayScatterV3 512B (0.64%, 0.15%), 658.31ms (46.87%, 44.09%), 9.19ms (44.80%, 44.80%), 649.12ms (46.90%, 44.08%)
It looks like certain ops are being done mostly on the CPU, and only on one thread at that!

The tf.map_fn() construct can be used with a function that runs ops on GPU. By default, TensorFlow will try to run as much of the function as possible on the GPU, and any GPU-incompatible ops will run on the CPU. In your program, the entire elementwise_op() function is built from GPU-compatible ops, so there should be no additional copying between CPU and GPU at each iteration.
The cause of low GPU utilization is difficult to determine from a program fragment. For example, if A and B are relatively small, and you are feeding them from Python and the immediately fetching back the result, it is likely that the overhead of copying the initial data to and from the GPU would dominate. The best way to track this down is to use a GPU profiler, which you can get using tfprof or the NVIDIA Visual Profiler.

Why is sklearn faster on CPU than Theano on GPU?

I've compared processing time with theano(CPU), theano(GPU) and Scikit-learn(CPU) using Python.
But, I got strange result.
Here look at the graph that I plot.
Processing Time Comparison:
you can see the result of scikit-learn that is faster than theano(GPU).
The program that I checked its elapsed time is to compute euclidean distance matrix from a matrix which have n * 40 elements.
Here is the part of code.
points = T.fmatrix("points")
edm = T.zeros_like(points)
def get_point_to_points_euclidean_distances(point_id):
euclideans = (T.sqrt((T.sqr(points- points[point_id, : ])).sum(axis=1)))
return euclideans
def get_EDM_CPU(points):
EDM = np.zeros((points.shape[0], points.shape[0])).astype(np.float32)
for row in range(points.shape[0]):
EDM[row, :] = np.sqrt(np.sum((points - points[row, :])**2, axis=1))
return EDM
def get_sk(points):
EDM = sk.pairwise_distances(a, metric='l2')
return EDM
seq = T.arange(T.shape(points)[0])
(result, _) = theano.scan(fn = get_point_to_points_euclidean_distances, \
outputs_info = None , \
sequences = seq)
get_EDM_GPU = theano.function(inputs = [points], outputs = result, allow_input_downcast = True)
I thought that the reason why GPU is slower than sci-kit learn is probably transfer time. So I did profiling GPU with nvprof command. then I got this.
==27105== NVPROF is profiling process 27105, command: python ./EDM_test.py
Using gpu device 0: GeForce GTX 580 (CNMeM is disabled, cuDNN not available)
data shape : (10000, 40)
get_EDM_GPU elapsed time : 1.84863090515 (s)
get_EDM_CPU elapsed time : 8.09937691689 (s)
get_EDM_sk elapsed time : 1.10968112946 (s)
ratio : 4.38128395145
==27105== Profiling application: python ./EDM_test.py
==27105== Warning: Found 9 invalid records in the result.
==27105== Warning: This could be because device ran out of memory when profiling.
==27105== Profiling result:
Time(%) Time Calls Avg Min Max Name
71.34% 1.28028s 9998 128.05us 127.65us 128.78us kernel_reduce_01_node_316e2e1cbfbe8cfb8e4a101f329ffeec_0(int, int, float const *, int, int, float*, int)
19.95% 357.97ms 9997 35.807us 35.068us 36.948us kernel_Sub_node_bc41b3f8f12c93d29f2c4360ad445d80_0_2(unsigned int, int, int, float const *, int, int, float const *, int, int, float*, int, int)
7.32% 131.38ms 2 65.690ms 1.2480us 131.38ms [CUDA memcpy DtoH]
1.25% 22.456ms 9996 2.2460us 2.1140us 2.8420us kernel_Sqrt_node_23508f8f49d12f3e8369d543f5620c15_0_Ccontiguous(unsigned int, float const *, float*)
0.12% 2.1847ms 1 2.1847ms 2.1847ms 2.1847ms [CUDA memset]
0.01% 259.73us 5 51.946us 640ns 250.36us [CUDA memcpy HtoD]
0.00% 17.086us 1 17.086us 17.086us 17.086us kernel_reduce_ccontig_node_97496c4d3cf9a06dc4082cc141f918d2_0(unsigned int, float const *, float*)
0.00% 2.0090us 1 2.0090us 2.0090us 2.0090us void copy_kernel<float, int=0>(cublasCopyParams<float>)
The transfer [CUDA memcpy DtoH] was performed twice { 1.248 [us], 131.38 [ms] }
The transfer [CUDA memcpy HtoD] was performed 5x { min: 640 [ns], max: 250.36 [us] }
The transfer time is about 131.639 ms (131.88 ms + 259.73 us).
but the gap between GPU and scikit-learn is about 700ms (1.8 s - 1.1 s) So, the gap is over the transfer time.
does it compute only upper triangular matrix from symmetric matrix?
what makes scikit-learn so fast?

What makes scikit-learn ( on pure CPU-side ) so fast?
My initial candidates would be a mix of:
highly efficient use of available CPU-cores' L1-/ L2- sizes within the fastest [ns]-distances
smart numpy vectorised execution being friendly to CPU cache-lines
dataset so small, it can completely remain non-evicted from cache ( test to scale the dataset-under-review way above the L2-/L3-cache sizes to see the DDRx-memory-cost effects on the observed performance ( details are in the URL below ) )
might enjoy even better timing on numpy, if avoiding .astype() conversions ( test it )
Facts on the GPU-side
auto-generated GPU-kernels do not have much chance to get ultimate levels of global memory latency-masking, compared to manually tweaked kernel-designs, tailor fit to respective GPU-silicon-architecture / latencies observed in-vivo
data-structures larger than just a few KB remain paying GPU-SM/GDDR-MEM distances of ~ large hundreds of [ns], nearly [us] -v/s- compared to small units ~ small tens of [ns] at CPU/L1/L2/L3/DDRx ) ref. timing details in >>> https://stackoverflow.com/a/33065382
not being able to enjoy much of the GPU/SMX power, due to this task's obvious low-reuse of data points and dataset size beyond the GPU/SM-silicon limits, that causes and must cause GPU/SM-register capacity spillovers in any kind of GPU-kernel design attempts and tweaking
the global task is not having a minimum reasonable amount of asynchronous, isolated ( non-communicating islands ) mathematically-dense, yet SMX-local, GPU-kernel processing steps ( there is not much to compute so as to adjust for the add-on overheads and expensive SMX/GDDR memory costs )
GPU-s can lovely exhibit it's best-performance, if sufficiently enough densely-convoluted re-processing operations take place -- like in large-scale/high-resolution image-processing -- on [m,n,o]-convolution-kernel matrices so small, so as that all these m*n*o constant values can reside local to SM, inside an available set of SMX-SM_registers and if the GPU-kernel-launchers are optimally tweaked by the 3D-tblock/grid processing-layout geometries, so that the global memory access latencies are at its best-masked performance, having all the GPU-threads enforced within the hardware WARP-aligned SMx:WarpScheduler RoundRobin thread-scheduling capabilites ( the first swap from Round-Robin into Greedy-WarpSchedule mode loses the whole battle in case of divergent execution-paths in GPU-kernel-code ).

Basic multi GPU parallelization of matrix multiplication

I want to parallelize the simple following expression on 2 GPUs: C = A^n + B^n by calculating A^n on GPU 0 and B^n on GPU 1 before summing the results.
In TensorFlow I would go like:
with tf.device('/gpu:0'):
An = matpow(A, n)
with tf.device('/gpu:1'):
Bn = matpow(B, n)
with tf.Session() as sess:
C = sess.run(An + Bn)
However, since PyTorch is dynamic, I'm having trouble doing the same thing. I tried the following but it only takes more time.
with torch.cuda.device(0):
A = A.cuda()
with torch.cuda.device(1):
B = B.cuda()
C = matpow(A, n) + matpow(B, n).cuda(0)
I know there is a module to parallelize models on the batch dimension using torch.nn.DataParallel but here I try to do something more basic.

You can use cuda streams for this. This will not necessarily distribute it over two devices, but the execution will be in parallel.
s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()
with torch.cuda.stream(s1):
A = torch.pow(A,n)
with torch.cuda.stream(s2):
B = torch.pow(B,n)
C = A+B
Although I'm not sure whether it will really speed up your computation if you only parallelize this one operation. Your matrices must be really big.
If your requirement is to split it across devices, you can add this before the streams:
A = A.cuda(0)
B = B.cuda(1)
Then after the power operation, you need to get them on the same device again, e.g. B = B.cuda(0). After that you can do the addition.

How do I pass large numpy arrays between python subprocesses without saving to disk?

Is there a good way to pass a large chunk of data between two python subprocesses without using the disk? Here's a cartoon example of what I'm hoping to accomplish:
import sys, subprocess, numpy
cmdString = """
import sys, numpy
done = False
while not done:
cmd = raw_input()
if cmd == 'done':
done = True
elif cmd == 'data':
##Fake data. In real life, get data from hardware.
data = numpy.zeros(1000000, dtype=numpy.uint8)
data.dump('data.pkl')
sys.stdout.write('data.pkl' + '\\n')
sys.stdout.flush()"""
proc = subprocess.Popen( #python vs. pythonw on Windows?
[sys.executable, '-c %s'%cmdString],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
for i in range(3):
proc.stdin.write('data\n')
print proc.stdout.readline().rstrip()
a = numpy.load('data.pkl')
print a.shape
proc.stdin.write('done\n')
This creates a subprocess which generates a numpy array and saves the array to disk. The parent process then loads the array from disk. It works!
The problem is, our hardware can generate data 10x faster than the disk can read/write. Is there a way to transfer data from one python process to another purely in-memory, maybe even without making a copy of the data? Can I do something like passing-by-reference?
My first attempt at transferring data purely in-memory is pretty lousy:
import sys, subprocess, numpy
cmdString = """
import sys, numpy
done = False
while not done:
cmd = raw_input()
if cmd == 'done':
done = True
elif cmd == 'data':
##Fake data. In real life, get data from hardware.
data = numpy.zeros(1000000, dtype=numpy.uint8)
##Note that this is NFG if there's a '10' in the array:
sys.stdout.write(data.tostring() + '\\n')
sys.stdout.flush()"""
proc = subprocess.Popen( #python vs. pythonw on Windows?
[sys.executable, '-c %s'%cmdString],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
for i in range(3):
proc.stdin.write('data\n')
a = numpy.fromstring(proc.stdout.readline().rstrip(), dtype=numpy.uint8)
print a.shape
proc.stdin.write('done\n')
This is extremely slow (much slower than saving to disk) and very, very fragile. There's got to be a better way!
I'm not married to the 'subprocess' module, as long as the data-taking process doesn't block the parent application. I briefly tried 'multiprocessing', but without success so far.
Background: We have a piece of hardware that generates up to ~2 GB/s of data in a series of ctypes buffers. The python code to handle these buffers has its hands full just dealing with the flood of information. I want to coordinate this flow of information with several other pieces of hardware running simultaneously in a 'master' program, without the subprocesses blocking each other. My current approach is to boil the data down a little bit in the subprocess before saving to disk, but it'd be nice to pass the full monty to the 'master' process.

While googling around for more information about the code Joe Kington posted, I found the numpy-sharedmem package. Judging from this numpy/multiprocessing tutorial it seems to share the same intellectual heritage (maybe largely the same authors? -- I'm not sure).
Using the sharedmem module, you can create a shared-memory numpy array (awesome!), and use it with multiprocessing like this:
import sharedmem as shm
import numpy as np
import multiprocessing as mp
def worker(q,arr):
done = False
while not done:
cmd = q.get()
if cmd == 'done':
done = True
elif cmd == 'data':
##Fake data. In real life, get data from hardware.
rnd=np.random.randint(100)
print('rnd={0}'.format(rnd))
arr[:]=rnd
q.task_done()
if __name__=='__main__':
N=10
arr=shm.zeros(N,dtype=np.uint8)
q=mp.JoinableQueue()
proc = mp.Process(target=worker, args=[q,arr])
proc.daemon=True
proc.start()
for i in range(3):
q.put('data')
# Wait for the computation to finish
q.join()
print arr.shape
print(arr)
q.put('done')
proc.join()
Running yields
rnd=53
(10,)
[53 53 53 53 53 53 53 53 53 53]
rnd=15
(10,)
[15 15 15 15 15 15 15 15 15 15]
rnd=87
(10,)
[87 87 87 87 87 87 87 87 87 87]

Basically, you just want to share a block of memory between processes and view it as a numpy array, right?
In that case, have a look at this (Posted to numpy-discussion by Nadav Horesh awhile back, not my work). There are a couple of similar implementations (some more flexible), but they all essentially use this principle.
# "Using Python, multiprocessing and NumPy/SciPy for parallel numerical computing"
# Modified and corrected by Nadav Horesh, Mar 2010
# No rights reserved
import numpy as N
import ctypes
import multiprocessing as MP
_ctypes_to_numpy = {
ctypes.c_char : N.dtype(N.uint8),
ctypes.c_wchar : N.dtype(N.int16),
ctypes.c_byte : N.dtype(N.int8),
ctypes.c_ubyte : N.dtype(N.uint8),
ctypes.c_short : N.dtype(N.int16),
ctypes.c_ushort : N.dtype(N.uint16),
ctypes.c_int : N.dtype(N.int32),
ctypes.c_uint : N.dtype(N.uint32),
ctypes.c_long : N.dtype(N.int64),
ctypes.c_ulong : N.dtype(N.uint64),
ctypes.c_float : N.dtype(N.float32),
ctypes.c_double : N.dtype(N.float64)}
_numpy_to_ctypes = dict(zip(_ctypes_to_numpy.values(), _ctypes_to_numpy.keys()))
def shmem_as_ndarray(raw_array, shape=None ):
address = raw_array._obj._wrapper.get_address()
size = len(raw_array)
if (shape is None) or (N.asarray(shape).prod() != size):
shape = (size,)
elif type(shape) is int:
shape = (shape,)
else:
shape = tuple(shape)
dtype = _ctypes_to_numpy[raw_array._obj._type_]
class Dummy(object): pass
d = Dummy()
d.__array_interface__ = {
'data' : (address, False),
'typestr' : dtype.str,
'descr' : dtype.descr,
'shape' : shape,
'strides' : None,
'version' : 3}
return N.asarray(d)
def empty_shared_array(shape, dtype, lock=True):
'''
Generate an empty MP shared array given ndarray parameters
'''
if type(shape) is not int:
shape = N.asarray(shape).prod()
try:
c_type = _numpy_to_ctypes[dtype]
except KeyError:
c_type = _numpy_to_ctypes[N.dtype(dtype)]
return MP.Array(c_type, shape, lock=lock)
def emptylike_shared_array(ndarray, lock=True):
'Generate a empty shared array with size and dtype of a given array'
return empty_shared_array(ndarray.size, ndarray.dtype, lock)

From the other answers, it seems that numpy-sharedmem is the way to go.
However, if you need a pure python solution, or installing extensions, cython or the like is a (big) hassle, you might want to use the following code which is a simplified version of Nadav's code:
import numpy, ctypes, multiprocessing
_ctypes_to_numpy = {
ctypes.c_char : numpy.dtype(numpy.uint8),
ctypes.c_wchar : numpy.dtype(numpy.int16),
ctypes.c_byte : numpy.dtype(numpy.int8),
ctypes.c_ubyte : numpy.dtype(numpy.uint8),
ctypes.c_short : numpy.dtype(numpy.int16),
ctypes.c_ushort : numpy.dtype(numpy.uint16),
ctypes.c_int : numpy.dtype(numpy.int32),
ctypes.c_uint : numpy.dtype(numpy.uint32),
ctypes.c_long : numpy.dtype(numpy.int64),
ctypes.c_ulong : numpy.dtype(numpy.uint64),
ctypes.c_float : numpy.dtype(numpy.float32),
ctypes.c_double : numpy.dtype(numpy.float64)}
_numpy_to_ctypes = dict(zip(_ctypes_to_numpy.values(),
_ctypes_to_numpy.keys()))
def shm_as_ndarray(mp_array, shape = None):
'''Given a multiprocessing.Array, returns an ndarray pointing to
the same data.'''
# support SynchronizedArray:
if not hasattr(mp_array, '_type_'):
mp_array = mp_array.get_obj()
dtype = _ctypes_to_numpy[mp_array._type_]
result = numpy.frombuffer(mp_array, dtype)
if shape is not None:
result = result.reshape(shape)
return numpy.asarray(result)
def ndarray_to_shm(array, lock = False):
'''Generate an 1D multiprocessing.Array containing the data from
the passed ndarray. The data will be *copied* into shared
memory.'''
array1d = array.ravel(order = 'A')
try:
c_type = _numpy_to_ctypes[array1d.dtype]
except KeyError:
c_type = _numpy_to_ctypes[numpy.dtype(array1d.dtype)]
result = multiprocessing.Array(c_type, array1d.size, lock = lock)
shm_as_ndarray(result)[:] = array1d
return result
You would use it like this:
Use sa = ndarray_to_shm(a) to convert the ndarray a into a shared multiprocessing.Array.
Use multiprocessing.Process(target = somefunc, args = (sa, ) (and start, maybe join) to call somefunc in a separate process, passing the shared array.
In somefunc, use a = shm_as_ndarray(sa) to get an ndarray pointing to the shared data. (Actually, you may want to do the same in the original process, immediately after creating sa, in order to have two ndarrays referencing the same data.)
AFAICS, you don't need to set lock to True, since shm_as_ndarray will not use the locking anyhow. If you need locking, you would set lock to True and call acquire/release on sa.
Also, if your array is not 1-dimensional, you might want to transfer the shape along with sa (e.g. use args = (sa, a.shape)).
This solution has the advantage that it does not need additional packages or extension modules, except multiprocessing (which is in the standard library).

Use threads. But I guess you are going to get problems with the GIL.
Instead: Choose your poison.
I know from the MPI implementations I work with, that they use shared memory for on-node-communications. You will have to code your own synchronization in that case.
2 GB/s sounds like you will get problems with most "easy" methods, depending on your real-time constraints and available main memory.

One possibility to consider is to use a RAM drive for the temporary storage of files to be shared between processes. A RAM drive is where a portion of RAM is treated as a logical hard drive, to which files can be written/read as you would with a regular drive, but at RAM read/write speeds.
This article describes using the ImDisk software (for MS Win) to create such disk and obtains file read/write speeds of 6-10 Gigabytes/second:
https://www.tekrevue.com/tip/create-10-gbs-ram-disk-windows/
An example in Ubuntu:
https://askubuntu.com/questions/152868/how-do-i-make-a-ram-disk#152871
Another noted benefit is that files with arbitrary formats can be passed around with such method: e.g. Picke, JSON, XML, CSV, HDF5, etc...
Keep in mind that anything stored on the RAM disk is wiped on reboot.

Use threads. You probably won't have problems with the GIL.
The GIL only affects Python code, not C/Fortran/Cython backed libraries. Most numpy operations and a good chunk of the C-backed Scientific Python stack release the GIL and can operate just fine on multiple cores. This blogpost discusses the GIL and scientific Python in more depth.
Edit
Simple ways to use threads include the threading module and multiprocessing.pool.ThreadPool.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

TensorFlow: how to log GPU memory (VRAM) utilization? - python

Yaroslav Bulatov's answer is the best solution for TF1. For TF2, however, contrib package does not exist. The best way is to use tf's profiler -- https://www.tensorflow.org/guide/profiler#memory_profile_tool It will plot a memory utilization graph like this.

Related

Dask delayed / dask array no response

Is there a way to use tensorflow map_fn on GPU?

Why is sklearn faster on CPU than Theano on GPU?

Basic multi GPU parallelization of matrix multiplication

How do I pass large numpy arrays between python subprocesses without saving to disk?

Categories

Resources