Calling MATLAB from Python is bound to give some performance reduction that I could avoid by rewriting (a lot of) code in Python. However, this isn't a realistic option for me, but it annoys me that a huge loss of efficiency lies in the simple conversion from a numpy array to a MATLAB double.
I'm talking about the following conversion from data1 to data1m, where
data1 = np.random.uniform(low = 0.0, high = 30000.0, size = (1000000,))
data1m = matlab.double(list(data1))
Here matlab.double comes from Mathworks own MATLAB package / engine. The second line of code takes 20 s on my system, which just seems like too much for a conversion that doesn't really do anything other than making the numbers 'edible' for MATLAB.
So basically I'm looking for a trick opposite to the one given here that works for converting MATLAB output back to Python.
Passing numpy arrays efficiently
Take a look at the file mlarray_sequence.py in the folder PYTHONPATH\Lib\site-packages\matlab\_internal. There you will find the construction of the MATLAB array object. The performance problem comes from copying data with loops within the generic_flattening function.
To avoid this behavior we will edit the file a bit. This fix should work on complex and non-complex datatypes.
Make a backup of the original file in case something goes wrong.
Add import numpy as np to the other imports at the beginning of the file
In line 38 you should find:
init_dims = _get_size(initializer)
replace this with:
try:
init_dims=initializer.shape
except:
init_dims = _get_size(initializer)
In line 48 you should find:
if is_complex:
complex_array = flat(self, initializer,
init_dims, typecode)
self._real = complex_array['real']
self._imag = complex_array['imag']
else:
self._data = flat(self, initializer, init_dims, typecode)
Replace this with:
if is_complex:
try:
self._real = array.array(typecode,np.ravel(initializer, order='F').real)
self._imag = array.array(typecode,np.ravel(initializer, order='F').imag)
except:
complex_array = flat(self, initializer,init_dims, typecode)
self._real = complex_array['real']
self._imag = complex_array['imag']
else:
try:
self._data = array.array(typecode,np.ravel(initializer, order='F'))
except:
self._data = flat(self, initializer, init_dims, typecode)
Now you can pass a numpy array directly to the MATLAB array creation method.
data1 = np.random.uniform(low = 0.0, high = 30000.0, size = (1000000,))
#faster
data1m = matlab.double(data1)
#or slower method
data1m = matlab.double(data1.tolist())
data2 = np.random.uniform(low = 0.0, high = 30000.0, size = (1000000,)).astype(np.complex128)
#faster
data1m = matlab.double(data2,is_complex=True)
#or slower method
data1m = matlab.double(data2.tolist(),is_complex=True)
The performance in MATLAB array creation increases by a factor of 15 and the interface is easier to use now.
While awaiting better suggestions, I'll post the best trick I've come up with so far. It comes down to saving the file with `scipy.io.savematĀ“ and then loading this file in MATLAB.
This is not the prettiest hack and it requires some care to ensure different processes relying on the same script don't end up writing and loading each other's .mat files, but the performance gain is worth it for me.
As a test case I wrote two simple, almost identical MATLAB functions that require 2 numpy arrays (I tested with length 1000000) and one int as input.
function d = test(x, y, fs_signal)
d = sum((x + y))./double(fs_signal);
function d = test2(path)
load(path)
d = sum((x + y))./double(fs_signal);
The function test requires conversion, while test2 requires saving.
Testing test: Converting the two numpy arrays takes cirka 40 s on my system. The total time to prepare for and run test comes down to 170 s
Testing test2: Saving the arrays and int takes cirka 0.35 s on my system. Suprisingly, loading the .mat file in MATLAB is extremely efficient (or more suprisingly, it is extremely ineffcient at dealing with its doubles)... The total time to prepare for and run test2 comes down to 0.38 s
That's a performance gain of almost 450x...
My situation was a bit different (python script called from matlab) but for me converting the ndarray into an array.array massively speed up the process. Basically it is very similar to Alexandre Chabot solution but without the need to alter any files:
#untested i.e. only deducted from my "matlab calls python" situation
import numpy
import array
data1 = numpy.random.uniform(low = 0.0, high = 30000.0, size = (1000000,))
ar = array.array('d',data1.flatten('F').tolist())
p = matlab.double(ar)
C = matlab.reshape(p,data1.shape) #this part I am definitely not sure about if it will work like that
At least if done from Matlab the combination of "array.array" and "double" is relative fast. Tested with Matlab 2016b + python 3.5.4 64bit.
Related
Background
I am analyzing large (between 0.5 and 20 GB) binary files, which contain information about particle collisions from a simulation. The number of collisions, number of incoming and outgoing particles can vary, so the files consist of variable length records. For analysis I use python and numpy. After switching from python 2 to python 3 I have noticed a dramatic decrease in performance of my scripts and traced it down to numpy.fromfile function.
Simplified code to reproduce the problem
This code, iotest.py
Generates a file of a similar structure to what I have in my studies
Reads it using numpy.fromfile
Reads it using numpy.frombuffer
Compares timing of both
import numpy as np
import os
def generate_binary_file(filename, nrecords):
n_records = np.random.poisson(lam = nrecords)
record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
with open(filename, 'wb') as f:
s = 0
for i in range(n_records):
f.write(record_lengths[i].tobytes())
f.write(x[s:s+record_lengths[i]].tobytes())
s += record_lengths[i]
# Trick for testing: make sum of records equal to 0
f.write(np.array([1], dtype = 'i4').tobytes())
f.write(np.array([-x.sum()], dtype = 'd').tobytes())
return os.path.getsize(filename)
def read_binary_npfromfile(filename):
checksum = 0.0
with open(filename, 'rb') as f:
while True:
try:
record_length = np.fromfile(f, 'i4', 1)[0]
x = np.fromfile(f, 'd', record_length)
checksum += x.sum()
except:
break
assert(np.abs(checksum) < 1e-6)
def read_binary_npfrombuffer(filename):
checksum = 0.0
with open(filename, 'rb') as f:
while True:
try:
record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
checksum += x.sum()
except:
break
assert(np.abs(checksum) < 1e-6)
if __name__ == '__main__':
from timeit import Timer
from functools import partial
fname = 'testfile.tmp'
print("# File size[MB], Timings and errors [s]: fromfile, frombuffer")
for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
fsize = generate_binary_file(fname, i)
t1 = Timer(partial(read_binary_npfromfile, fname))
t2 = Timer(partial(read_binary_npfrombuffer, fname))
a1 = np.array(t1.repeat(5, 1))
a2 = np.array(t2.repeat(5, 1))
print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))
Results
Conclusions
In Python 2 numpy.fromfile was probably the fastest way to deal with binary files of variable structure. It was approximately 3 times faster than numpy.frombuffer. Performance of both scaled linearly with file size.
In Python 3 numpy.frombuffer became around 10% slower, while numpy.fromfile became around 9.3 times slower compared to Python 2! Performance of both still scales linearly with file size.
In the documentation of numpy.fromfile it is described as "A highly efficient way of reading binary data with a known data-type". It is not correct in Python 3 anymore. This was in fact noticed earlier by other people already.
Questions
In Python 3 how to obtain a comparable (or better) performance to Python 2, when reading binary files of variable structure?
What happened in Python 3 so that numpy.fromfile became an order of magnitude slower?
TL;DR: np.fromfile and np.frombuffer are not optimized to read many small buffers. You can load the whole file in a big buffer and then decode it very efficiently using Numba.
Analysis
The main issue is that the benchmark measure overheads. Indeed, it perform a lot of system/C calls that are very inefficient. For example, on the 24 MiB file, the while loops calls 601_214 times np.fromfile and np.frombuffer. The timing on my machine are 10.5s for read_binary_npfromfile and 1.2s for read_binary_npfrombuffer. This means respectively 17.4 us and 2.0 us per call for the two function. Such timing per call are relatively reasonable considering Numpy is not designed to efficiently operate on very small arrays (it needs to perform many checks, call some functions, wrap/unwrap CPython types, allocate some objects, etc.). The overhead of these functions can change from one version to another and unless it becomes huge, this is not a bug. The addition of new features to Numpy and CPython often impact overheads and this appear to be the case here (eg. buffering interface). The point is that it is not really a problem because there is a way to use a different approach that is much much faster (as it does not pay huge overheads).
Faster Numpy code
The main solution to write a fast implementation is to read the whole file once in a big byte buffer and then decode it using np.view. That being said, this is a bit tricky because of data alignment and the fact that nearly all Numpy function needs to be prohibited in the while loop due to their overhead. Here is an example:
def read_binary_faster_numpy(filename):
buff = np.fromfile(filename, dtype=np.uint8)
buff_int32 = buff.view(np.int32)
buff_double_1 = buff[0:len(buff)//8*8].view(np.float64)
buff_double_2 = buff[4:4+(len(buff)-4)//8*8].view(np.float64)
nblocks = buff.size // 4 # Number of 4-byte blocks
pos = 0 # Displacement by block of 4 bytes
lst = []
while pos < nblocks:
record_length = buff_int32[pos]
pos += 1
if pos + record_length * 2 > nblocks:
break
offset = pos // 2
if pos % 2 == 0: # Aligned with buff_double_1
x = buff_double_1[offset:offset+record_length]
else: # Aligned with buff_double_2
x = buff_double_2[offset:offset+record_length]
lst.append(x) # np.sum is too expensive here
pos += record_length * 2
checksum = np.sum(np.concatenate(lst))
assert(np.abs(checksum) < 1e-6)
The above implementation should be faster but it is a bit tricky to understand and it is still bounded by the latency of Numpy operations. Indeed, the loop is still calling Numpy functions due to operations like buff_int32[pos] or buff_double_1[offset:offset+record_length]. Even though the overheads of indexing is much smaller than the one of previous functions, it is still quite big for such a critical loop (with ~300_000 iterations)...
Better performance with... a basic pure-Python code
It turns out that the following pure-python implementation is faster, safer and simpler:
from struct import unpack_from
def read_binary_python_struct(filename):
checksum = 0.0
with open(filename, 'rb') as f:
data = f.read()
offset = 0
while offset < len(data):
record_length = unpack_from('#i', data, offset)[0]
checksum += sum(unpack_from(f'{record_length}d', data, offset + 4))
offset += 4 + record_length * 8
assert(np.abs(checksum) < 1e-6)
This is because the overhead of unpack_from is far lower than the one of Numpy functions but it is still not great.
In fact, now the main issue is actually the CPython interpreter. It is clearly not designed with high-performance in mind. The above code push it to the limit. Allocating millions of temporary reference-counted dynamic objects like variable-sized integers and strings is very expensive. This is not reasonable to let CPython do such an operation.
Writing a high-performance code with Numba
We can drastically speed it up using Numba which can compile Numpy-based Python codes to native ones using a just-in-time compiler! Here is an example:
#nb.njit('float64(uint8[::1])')
def decode_buffer(buff):
checksum = 0.0
offset = 0
while offset + 4 < buff.size:
record_length = buff[offset:offset+4].view(np.int32)[0]
start = offset + 4
end = start + record_length * 8
if end > buff.size:
break
x = buff[start:end].view(np.float64)
checksum += x.sum()
offset = end
return checksum
def read_binary_numba(filename):
buff = np.fromfile(filename, dtype=np.uint8)
checksum = decode_buffer(buff)
assert(np.abs(checksum) < 1e-6)
Numba removes nearly all Numpy overheads thanks to a native compiled code. That being said note that Numba does not implement all Numpy functions yet. This include np.fromfile which need to be called outside a Numba-compiled function.
Benchmark
Here are the performance results on my machine (i5-9600KF with a high-performance Nvme SSD) with Python 3.8.1, Numpy 1.20.3 and Numba 0.54.1.
read_binary_npfromfile: 10616 ms ( x1)
read_binary_npfrombuffer: 1132 ms ( x9)
read_binary_faster_numpy: 509 ms ( x21)
read_binary_python_struct: 222 ms ( x48)
read_binary_numba: 12 ms ( x885)
Optimal time: 7 ms (x1517)
One can see that the Numba implementation is extremely fast compared to the initial Python implementation and even to the fastest alternative Python implementation. This is especially true considering that 8 ms is spent in np.fromfile and only 4 ms in decode_buffer!
I am new to cython and have the following code for a numpy for loop that I am trying to optimize. So far, this Cython code isn't much faster than the numpy for loop.
# cython: infer_types = True
import numpy as np
cimport numpy
DTYPE = np.double
def hdcfTransfomation(scanData):
cdef Py_ssize_t Position
scanLength = scanData.shape[0]
hdcfFunction_np = np.zeros(scanLength, dtype = DTYPE)
cdef double [::1] hdcfFunction = hdcfFunction_np
for position in range(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:-(position + 1)]
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
hdcfFunction[position] = arrayMean
return hdcfFunction
I know that using C math library functions would be more ideal than calling back into the numpy language (subtract, square, mean), but I am not sure where I can find a list of functions that can be called in this manner.
I have been trying to figure out ways to optimize this code by using different types, ect. but nothing is providing the performance that I think is possible with a fully optimized implementation of Cython.
Here is a working example of the numpy for-loop:
def hdcfTransfomation(scanData):
scanLength = scanData.shape[0]
hdcfFunction = np.zeros(scanLength)
for position in range(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:-(position + 1)]
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
hdcfFunction[position] = arrayMean
return hdcfFunction
scanDataArray = np.random.rand(80000, 1)
transformedScan = hdcfTransformed(scanDataArray)
Always provide as much informations as possible (some example data, Python/Cython Version, Compiler Version/Settings and CPU Model.
Without that it is quite hard to compare any timings. For example this problem benefits quite well from SIMD-vectorization. It will make quite a difference which compiler you use or if you want to redistribute a compiled version which should also run on low-end or quite old CPUS (eg. no AVX).
I am not very familiar with Cython, but I think your main problem is the missing declaration for scanData. Maybe the C-Compiler needs additional flags like march=native, but the real syntax is compiler dependend. I am am also not sure how Cython or the C-compiler optimizes this part:
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
If that loops (all vectorized commands are actually loops) are not joined, but intead there are temporary arryas like in pure Python created, this will slow down the code. It will be a good idea to create a 1D array first. (eg. scanData=scanData[::1]
As said I am not that familliar with Cython, I tried what is possible with Numba. At least it shows what should also be possible with a resonable good Cython implementation.
Maybe easier to otimize for the compiler
import numba as nb
import numpy as np
#nb.njit(fastmath=True,error_model='numpy',parallel=True)
#scanData is a 1D-array here
def hdcfTransfomation(scanData):
scanLength = scanData.shape[0]
hdcfFunction = np.zeros(scanLength, dtype = scanData.dtype)
for position in nb.prange(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:scanData.shape[0]-(position + 1)]
sum=0.
jj=0
for i in range(scanLength-(position + 1)):
jj+=1
sum+=(topShift[i]-bottomShift[i])**2
hdcfFunction[position] = sum/jj
return hdcfFunction
I also used parallelization here, because the problem is embarrassingly parallel. At least with a size of 80_000 and Numba it doesn't matter if you use a slightly modified version of your code (1D-array), or the code above.
Timings
#Quadcore Core i7-4th gen,Numba 0.4dev,Python 3.6
scanData=np.random.rand(80_000)
#The first call to the function isn't measured (compilation overhead),but the following calls.
Pure Python: 5900ms
Numba single-threaded: 947ms
Numba parallel: 260ms
Especially for larger arrays than np.random.rand(80_000) there may be better aproaches (loop tilling for better cache usage), but for this size that should be more or less OK (At least it fits in the L3-cache)
Naive GPU Implementation
from numba import cuda, float32
#cuda.jit('void(float32[:], float32[:])')
def hdcfTransfomation_gpu(scanData,out_data):
scanLength = scanData.shape[0]
position = cuda.grid(1)
if position < scanLength - 1:
sum= float32(0.)
offset=1 + position
for i in range(scanLength-offset):
sum+=(scanData[i+offset]-scanData[i])**2
out_data[position] = sum/(scanLength-offset)
hdcfTransfomation_gpu[scanData.shape[0]//64,64](scanData,res_3)
This gives about 400ms on a GT640 (float32) and 970ms (float64). For a good implemenation shared arrays should be considered.
Putting cython aside, does this do the same thing as your current code but without a for loop? We can tighten it up and correct for inaccuracies, but the first port of call is to try apply operations in numpy to 2D arrays before turning to cython for for loops. It's too long to put in a comment.
import numpy as np
# Setup
arr = np.random.choice(np.arange(10), 100).reshape(10, 10)
top_shift = arr[:, :-1]
bottom_shift = arr[:, 1:]
arr_diff = top_shift - bottom_shift
arr_squared = np.square(arr_diff)
arr_mean = arr_squared.mean(axis=1)
I have a distributed dask cluster setup and I have used it to load and transform a bunch of data. Works like a charm.
I'm want to use it do some processing in parallel. Here's my function
el = 5000
n_using = 26
n_across= 6
mat = np.random.random((el,n_using,n_across))
idx = np.tril_indices(n_across*2, -n_across)
def get_vals(c1, m, el, idx):
m1 = m[c1,:,:]
corr_vals = np.zeros((el, (n_across//2)*(n_across+1)))
for c2 in range(c1+1, el):
corr = np.corrcoef(m1.T, m[c2,:,:].T)
corr_vals[c2] = corr[idx]
return corr_vals
lazy_get_val = dask.delayed(get_vals, pure=True)
Here is a single processor version of what I'm trying to do:
arrays = [get_vals(c1, mat, el, idx) for c1 in range(el)]
all_corr = np.stack(arrays, axis=0)
Works fine but takes a few hours.
Here's my go at doing this in dask:
lazy_list = [lazy_get_val(c1, mat, el, idx) for c1 in range(el)]
arrays = [da.from_delayed(lazy_item, dtype=float, shape=(el, 21)) for lazy_item in lazy_list]
all_corr = da.stack(arrays, axis=0)
Even if it run all_corr[1].compute(), it just sits there and doesn't respond. When I interrupt the kernel, it seems to be stuck at /distributed/utils.py:
~/.../lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
249 else:
250 while not e.is_set():
--> 251 e.wait(10)
252 if error[0]:
253 six.reraise(*error[0])
Any suggestions on debugging this?
Other things:
If I run it with a smaller mat (el=1000) and it runs fine.
If I make el = 5000, it hangs.
If I interrupt the kernel and run it again with el = 1000, it hangs.
After adding imports to the example I ran things and it was very slow while building the graph. This can be improved by avoiding placing numpy arrays directly in delayed calls as follows:
# mat = np.random.random((el,n_using,n_across))
# idx = np.tril_indices(n_across*2, -n_across)
mat = dask.delayed(np.random.random)((el,n_using,n_across))
idx = dask.delayed(np.tril_indices)(n_across*2, -n_across)
Or by removing the pure=True keyword to dask.delayed (when you set pure=True it has to hash the contents of all inputs to get a unique key for them, you're doing this 5000 times). I found this out by profiling your code with the %snakeviz magic in IPython.
I then ran all_corr[1].compute() and it was fine. I then ran all_corr.compute() and it seemed like it would progress to completion, but wasn't very fast. I suspect that either your tasks are too small so that there is too much overhead, or that your code is spending too much time in Python for loops and so is running into GIL issues. Not sure which.
The next thing I would recommend trying would be using the dask.distributed scheduler, which would handle the GIL issue better and exacerbate the overhead issue. Seeing how that performed would probably help isolate the issue.
Calling MATLAB from Python is bound to give some performance reduction that I could avoid by rewriting (a lot of) code in Python. However, this isn't a realistic option for me, but it annoys me that a huge loss of efficiency lies in the simple conversion from a numpy array to a MATLAB double.
I'm talking about the following conversion from data1 to data1m, where
data1 = np.random.uniform(low = 0.0, high = 30000.0, size = (1000000,))
data1m = matlab.double(list(data1))
Here matlab.double comes from Mathworks own MATLAB package / engine. The second line of code takes 20 s on my system, which just seems like too much for a conversion that doesn't really do anything other than making the numbers 'edible' for MATLAB.
So basically I'm looking for a trick opposite to the one given here that works for converting MATLAB output back to Python.
Passing numpy arrays efficiently
Take a look at the file mlarray_sequence.py in the folder PYTHONPATH\Lib\site-packages\matlab\_internal. There you will find the construction of the MATLAB array object. The performance problem comes from copying data with loops within the generic_flattening function.
To avoid this behavior we will edit the file a bit. This fix should work on complex and non-complex datatypes.
Make a backup of the original file in case something goes wrong.
Add import numpy as np to the other imports at the beginning of the file
In line 38 you should find:
init_dims = _get_size(initializer)
replace this with:
try:
init_dims=initializer.shape
except:
init_dims = _get_size(initializer)
In line 48 you should find:
if is_complex:
complex_array = flat(self, initializer,
init_dims, typecode)
self._real = complex_array['real']
self._imag = complex_array['imag']
else:
self._data = flat(self, initializer, init_dims, typecode)
Replace this with:
if is_complex:
try:
self._real = array.array(typecode,np.ravel(initializer, order='F').real)
self._imag = array.array(typecode,np.ravel(initializer, order='F').imag)
except:
complex_array = flat(self, initializer,init_dims, typecode)
self._real = complex_array['real']
self._imag = complex_array['imag']
else:
try:
self._data = array.array(typecode,np.ravel(initializer, order='F'))
except:
self._data = flat(self, initializer, init_dims, typecode)
Now you can pass a numpy array directly to the MATLAB array creation method.
data1 = np.random.uniform(low = 0.0, high = 30000.0, size = (1000000,))
#faster
data1m = matlab.double(data1)
#or slower method
data1m = matlab.double(data1.tolist())
data2 = np.random.uniform(low = 0.0, high = 30000.0, size = (1000000,)).astype(np.complex128)
#faster
data1m = matlab.double(data2,is_complex=True)
#or slower method
data1m = matlab.double(data2.tolist(),is_complex=True)
The performance in MATLAB array creation increases by a factor of 15 and the interface is easier to use now.
While awaiting better suggestions, I'll post the best trick I've come up with so far. It comes down to saving the file with `scipy.io.savematĀ“ and then loading this file in MATLAB.
This is not the prettiest hack and it requires some care to ensure different processes relying on the same script don't end up writing and loading each other's .mat files, but the performance gain is worth it for me.
As a test case I wrote two simple, almost identical MATLAB functions that require 2 numpy arrays (I tested with length 1000000) and one int as input.
function d = test(x, y, fs_signal)
d = sum((x + y))./double(fs_signal);
function d = test2(path)
load(path)
d = sum((x + y))./double(fs_signal);
The function test requires conversion, while test2 requires saving.
Testing test: Converting the two numpy arrays takes cirka 40 s on my system. The total time to prepare for and run test comes down to 170 s
Testing test2: Saving the arrays and int takes cirka 0.35 s on my system. Suprisingly, loading the .mat file in MATLAB is extremely efficient (or more suprisingly, it is extremely ineffcient at dealing with its doubles)... The total time to prepare for and run test2 comes down to 0.38 s
That's a performance gain of almost 450x...
My situation was a bit different (python script called from matlab) but for me converting the ndarray into an array.array massively speed up the process. Basically it is very similar to Alexandre Chabot solution but without the need to alter any files:
#untested i.e. only deducted from my "matlab calls python" situation
import numpy
import array
data1 = numpy.random.uniform(low = 0.0, high = 30000.0, size = (1000000,))
ar = array.array('d',data1.flatten('F').tolist())
p = matlab.double(ar)
C = matlab.reshape(p,data1.shape) #this part I am definitely not sure about if it will work like that
At least if done from Matlab the combination of "array.array" and "double" is relative fast. Tested with Matlab 2016b + python 3.5.4 64bit.
Is there a good way to pass a large chunk of data between two python subprocesses without using the disk? Here's a cartoon example of what I'm hoping to accomplish:
import sys, subprocess, numpy
cmdString = """
import sys, numpy
done = False
while not done:
cmd = raw_input()
if cmd == 'done':
done = True
elif cmd == 'data':
##Fake data. In real life, get data from hardware.
data = numpy.zeros(1000000, dtype=numpy.uint8)
data.dump('data.pkl')
sys.stdout.write('data.pkl' + '\\n')
sys.stdout.flush()"""
proc = subprocess.Popen( #python vs. pythonw on Windows?
[sys.executable, '-c %s'%cmdString],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
for i in range(3):
proc.stdin.write('data\n')
print proc.stdout.readline().rstrip()
a = numpy.load('data.pkl')
print a.shape
proc.stdin.write('done\n')
This creates a subprocess which generates a numpy array and saves the array to disk. The parent process then loads the array from disk. It works!
The problem is, our hardware can generate data 10x faster than the disk can read/write. Is there a way to transfer data from one python process to another purely in-memory, maybe even without making a copy of the data? Can I do something like passing-by-reference?
My first attempt at transferring data purely in-memory is pretty lousy:
import sys, subprocess, numpy
cmdString = """
import sys, numpy
done = False
while not done:
cmd = raw_input()
if cmd == 'done':
done = True
elif cmd == 'data':
##Fake data. In real life, get data from hardware.
data = numpy.zeros(1000000, dtype=numpy.uint8)
##Note that this is NFG if there's a '10' in the array:
sys.stdout.write(data.tostring() + '\\n')
sys.stdout.flush()"""
proc = subprocess.Popen( #python vs. pythonw on Windows?
[sys.executable, '-c %s'%cmdString],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
for i in range(3):
proc.stdin.write('data\n')
a = numpy.fromstring(proc.stdout.readline().rstrip(), dtype=numpy.uint8)
print a.shape
proc.stdin.write('done\n')
This is extremely slow (much slower than saving to disk) and very, very fragile. There's got to be a better way!
I'm not married to the 'subprocess' module, as long as the data-taking process doesn't block the parent application. I briefly tried 'multiprocessing', but without success so far.
Background: We have a piece of hardware that generates up to ~2 GB/s of data in a series of ctypes buffers. The python code to handle these buffers has its hands full just dealing with the flood of information. I want to coordinate this flow of information with several other pieces of hardware running simultaneously in a 'master' program, without the subprocesses blocking each other. My current approach is to boil the data down a little bit in the subprocess before saving to disk, but it'd be nice to pass the full monty to the 'master' process.
While googling around for more information about the code Joe Kington posted, I found the numpy-sharedmem package. Judging from this numpy/multiprocessing tutorial it seems to share the same intellectual heritage (maybe largely the same authors? -- I'm not sure).
Using the sharedmem module, you can create a shared-memory numpy array (awesome!), and use it with multiprocessing like this:
import sharedmem as shm
import numpy as np
import multiprocessing as mp
def worker(q,arr):
done = False
while not done:
cmd = q.get()
if cmd == 'done':
done = True
elif cmd == 'data':
##Fake data. In real life, get data from hardware.
rnd=np.random.randint(100)
print('rnd={0}'.format(rnd))
arr[:]=rnd
q.task_done()
if __name__=='__main__':
N=10
arr=shm.zeros(N,dtype=np.uint8)
q=mp.JoinableQueue()
proc = mp.Process(target=worker, args=[q,arr])
proc.daemon=True
proc.start()
for i in range(3):
q.put('data')
# Wait for the computation to finish
q.join()
print arr.shape
print(arr)
q.put('done')
proc.join()
Running yields
rnd=53
(10,)
[53 53 53 53 53 53 53 53 53 53]
rnd=15
(10,)
[15 15 15 15 15 15 15 15 15 15]
rnd=87
(10,)
[87 87 87 87 87 87 87 87 87 87]
Basically, you just want to share a block of memory between processes and view it as a numpy array, right?
In that case, have a look at this (Posted to numpy-discussion by Nadav Horesh awhile back, not my work). There are a couple of similar implementations (some more flexible), but they all essentially use this principle.
# "Using Python, multiprocessing and NumPy/SciPy for parallel numerical computing"
# Modified and corrected by Nadav Horesh, Mar 2010
# No rights reserved
import numpy as N
import ctypes
import multiprocessing as MP
_ctypes_to_numpy = {
ctypes.c_char : N.dtype(N.uint8),
ctypes.c_wchar : N.dtype(N.int16),
ctypes.c_byte : N.dtype(N.int8),
ctypes.c_ubyte : N.dtype(N.uint8),
ctypes.c_short : N.dtype(N.int16),
ctypes.c_ushort : N.dtype(N.uint16),
ctypes.c_int : N.dtype(N.int32),
ctypes.c_uint : N.dtype(N.uint32),
ctypes.c_long : N.dtype(N.int64),
ctypes.c_ulong : N.dtype(N.uint64),
ctypes.c_float : N.dtype(N.float32),
ctypes.c_double : N.dtype(N.float64)}
_numpy_to_ctypes = dict(zip(_ctypes_to_numpy.values(), _ctypes_to_numpy.keys()))
def shmem_as_ndarray(raw_array, shape=None ):
address = raw_array._obj._wrapper.get_address()
size = len(raw_array)
if (shape is None) or (N.asarray(shape).prod() != size):
shape = (size,)
elif type(shape) is int:
shape = (shape,)
else:
shape = tuple(shape)
dtype = _ctypes_to_numpy[raw_array._obj._type_]
class Dummy(object): pass
d = Dummy()
d.__array_interface__ = {
'data' : (address, False),
'typestr' : dtype.str,
'descr' : dtype.descr,
'shape' : shape,
'strides' : None,
'version' : 3}
return N.asarray(d)
def empty_shared_array(shape, dtype, lock=True):
'''
Generate an empty MP shared array given ndarray parameters
'''
if type(shape) is not int:
shape = N.asarray(shape).prod()
try:
c_type = _numpy_to_ctypes[dtype]
except KeyError:
c_type = _numpy_to_ctypes[N.dtype(dtype)]
return MP.Array(c_type, shape, lock=lock)
def emptylike_shared_array(ndarray, lock=True):
'Generate a empty shared array with size and dtype of a given array'
return empty_shared_array(ndarray.size, ndarray.dtype, lock)
From the other answers, it seems that numpy-sharedmem is the way to go.
However, if you need a pure python solution, or installing extensions, cython or the like is a (big) hassle, you might want to use the following code which is a simplified version of Nadav's code:
import numpy, ctypes, multiprocessing
_ctypes_to_numpy = {
ctypes.c_char : numpy.dtype(numpy.uint8),
ctypes.c_wchar : numpy.dtype(numpy.int16),
ctypes.c_byte : numpy.dtype(numpy.int8),
ctypes.c_ubyte : numpy.dtype(numpy.uint8),
ctypes.c_short : numpy.dtype(numpy.int16),
ctypes.c_ushort : numpy.dtype(numpy.uint16),
ctypes.c_int : numpy.dtype(numpy.int32),
ctypes.c_uint : numpy.dtype(numpy.uint32),
ctypes.c_long : numpy.dtype(numpy.int64),
ctypes.c_ulong : numpy.dtype(numpy.uint64),
ctypes.c_float : numpy.dtype(numpy.float32),
ctypes.c_double : numpy.dtype(numpy.float64)}
_numpy_to_ctypes = dict(zip(_ctypes_to_numpy.values(),
_ctypes_to_numpy.keys()))
def shm_as_ndarray(mp_array, shape = None):
'''Given a multiprocessing.Array, returns an ndarray pointing to
the same data.'''
# support SynchronizedArray:
if not hasattr(mp_array, '_type_'):
mp_array = mp_array.get_obj()
dtype = _ctypes_to_numpy[mp_array._type_]
result = numpy.frombuffer(mp_array, dtype)
if shape is not None:
result = result.reshape(shape)
return numpy.asarray(result)
def ndarray_to_shm(array, lock = False):
'''Generate an 1D multiprocessing.Array containing the data from
the passed ndarray. The data will be *copied* into shared
memory.'''
array1d = array.ravel(order = 'A')
try:
c_type = _numpy_to_ctypes[array1d.dtype]
except KeyError:
c_type = _numpy_to_ctypes[numpy.dtype(array1d.dtype)]
result = multiprocessing.Array(c_type, array1d.size, lock = lock)
shm_as_ndarray(result)[:] = array1d
return result
You would use it like this:
Use sa = ndarray_to_shm(a) to convert the ndarray a into a shared multiprocessing.Array.
Use multiprocessing.Process(target = somefunc, args = (sa, ) (and start, maybe join) to call somefunc in a separate process, passing the shared array.
In somefunc, use a = shm_as_ndarray(sa) to get an ndarray pointing to the shared data. (Actually, you may want to do the same in the original process, immediately after creating sa, in order to have two ndarrays referencing the same data.)
AFAICS, you don't need to set lock to True, since shm_as_ndarray will not use the locking anyhow. If you need locking, you would set lock to True and call acquire/release on sa.
Also, if your array is not 1-dimensional, you might want to transfer the shape along with sa (e.g. use args = (sa, a.shape)).
This solution has the advantage that it does not need additional packages or extension modules, except multiprocessing (which is in the standard library).
Use threads. But I guess you are going to get problems with the GIL.
Instead: Choose your poison.
I know from the MPI implementations I work with, that they use shared memory for on-node-communications. You will have to code your own synchronization in that case.
2 GB/s sounds like you will get problems with most "easy" methods, depending on your real-time constraints and available main memory.
One possibility to consider is to use a RAM drive for the temporary storage of files to be shared between processes. A RAM drive is where a portion of RAM is treated as a logical hard drive, to which files can be written/read as you would with a regular drive, but at RAM read/write speeds.
This article describes using the ImDisk software (for MS Win) to create such disk and obtains file read/write speeds of 6-10 Gigabytes/second:
https://www.tekrevue.com/tip/create-10-gbs-ram-disk-windows/
An example in Ubuntu:
https://askubuntu.com/questions/152868/how-do-i-make-a-ram-disk#152871
Another noted benefit is that files with arbitrary formats can be passed around with such method: e.g. Picke, JSON, XML, CSV, HDF5, etc...
Keep in mind that anything stored on the RAM disk is wiped on reboot.
Use threads. You probably won't have problems with the GIL.
The GIL only affects Python code, not C/Fortran/Cython backed libraries. Most numpy operations and a good chunk of the C-backed Scientific Python stack release the GIL and can operate just fine on multiple cores. This blogpost discusses the GIL and scientific Python in more depth.
Edit
Simple ways to use threads include the threading module and multiprocessing.pool.ThreadPool.