I'm trying to parallelize operations on two dimensional array using joblib library in python. Here is the code I have
from joblib import Parallel, delayed
import multiprocessing
import numpy as np
# The code below just aggregates the base_array to form a new two dimensional array
base_array = np.ones((2**12, 2**12), dtype=np.uint8)
def compute_average(i, j):
return np.uint8(np.mean(base_array[i*4: (i+1)*4, j*4: (j+1)*4]))
num_cores = multiprocessing.cpu_count()
new_array = np.array(Parallel(n_jobs=num_cores)(delayed(compute_average)(i, j)
for i in xrange(0,1024) for j in xrange(0,1024)), dtype=np.uint8)
The above code takes more time than the basic nested for loop below.
new_array_nested = np.ones((2**10, 2**10), dtype=np.uint8)
for i in xrange(0,1024):
for j in xrange(0,1024):
new_array_nested[i,j] = compute_average(i,j)
Why are parallel operations taking more time? How can the efficiency of the above code be improved?
Wow! Absolutely loved your code. It worked like a charm improving the total efficiency by 400x. I'll try to read more about numba and jit compilers, but can you write briefly of why it is so efficient. Thanks once again for all the help! – Ram Jan 3 '18 at 20:30
We can quite easily get somewhere under 77 [ms], but it takes mastering a few steps to get there, so let's start:
Q: why parallel operations are taking more time?
Because the proposed step with joblib creates that many full-scale process copies - so as to escape from the GIL-stepped pure-[SERIAL] dancing ( one-after-another ) but (!) this includes add-on costs of all the memory transfers ( very expensive / sensitive for indeed large numpy arrays ) of all variables and the whole python interpreter and its internal state, before it gets to start doing a first step on the "usefull" work on your "payload"-calculation strategy,
so
the sum of all these instantiation overheads can easily become larger, than an overhead-agnostic expectation of an inversely proportional 1 / N factor,
where you set the N ~ num_cores.
For details, read the mathematical formulation in the tail part of Amdahl's Law re-formulation here.
Q:can help improve the efficiency of the above code?
Save as much as possible on all overhead costs:
- where possible:
- on process spawn-side, try to use n_jobs = ( num_cores - 1 ) to let more space for the "main" process going forward and benchmark if performance goes up
- on process termination-side, avoiding to collect-and-construct a new ( possibly great in size ) object from returning values, but rather pre-allocate a just-enough large process-local data structures and return some efficient, serialised for easy and non-blocking coalescing of the per-partes returned results' alignments.
Both of these "hidden" costs are your main design enemies, as they get linearly added to the pure-[SERIAL] part of the computing-path of the whole problem solution ( ref.: the effects of both of these in the overhead-strict Amdahl's Law formula )
Experiments & Results:
>>> from zmq import Stopwatch; aClk = Stopwatch()
>>> base_array = np.ones( (2**12, 2**12), dtype = np.uint8 )
>>> base_array.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
>>> def compute_average_per_TILE( TILE_i, TILE_j ): // NAIVE MODE
... return np.uint8( np.mean( base_array[ 4*TILE_i:4*(TILE_i+1),
... 4*TILE_j:4*(TILE_j+1)
... ]
... )
... )
...
>>> aClk.start(); _ = compute_average_per_TILE( 12,13 ); aClk.stop()
25110
102
109
93
This takes about 93 [us] per one shot. Having expectation of about 1024*1024*93 ~ 97,517,568 [us] to cover the mean-processing over the whole base_array.
Experimentally, here one nicely can see the impact of not very well handled overheads, the naive-nested experiment took:
>>> aClk.start(); _ = [ compute_average_per_TILE( i, j )
for i in xrange(1024)
for j in xrange(1024)
]; aClk.stop()
26310594
^^......
26310594 / 1024. / 1024. == 25.09 [us/cell]
which is about 3.7x less ( due to not incurred "tail"-part ( assignment of individual returned values ) overheads 2**20 times, but just once, at the terminal assignment.
Yet, more surprises to come.
What is a proper tool here?
There is never a universal rule, no one-size-fits-all.
Given
not more than just a 4x4 matrix tile is going to be processes per call ( taking actually less than 25 [us] per a proposed joblib-orchestrated spawn of 2**20 calls, distributed over ~ .cpu_count() fully instantiated processes by the original proposal
...( joblib.Parallel( n_jobs = num_cores )(
joblib.delayed( compute_average )( i, j )
for i in xrange( 1024 )
for j in xrange( 1024 )
)
there is indeed a space to improve the performance.
For these small-scale matrices ( not all problems are so happy in this sense ), one can expect best results from smarter-memory access patterns and from reducing python GIL-originated weaknesses.
As the per-call span is just a 4x4 micro sized computation, a way better will be to harness smart vectorisation ( all data fit in cache, so in-cache computing is a holiday journey for hunting an utmost performance )
The best ( still very naively vectorised code )
was able to get from ~ 25 [us/cell] to less than ~ 74 [ns/cell] ( having there still a space for better aligned processing, as it took ~ 4.6 [ns] / a base_array cell processing ), so expect yet another level of speedups, if in-cache optimised vectorised code will get crafted properly.
In 77 [ms] ?! Worth doing that right, isn't it?
Not 97 seconds,
not 25 seconds,
but less than 77 [ms] in just a few strokes of keyboard, and more could have got squeezed off, if better optimising a call-signature:
>>> import numba
>>> #numba.jit( nogil = True, nopython = True )
... def jit_avg2( base_IN, ret_OUT ): // all pre-allocated memory for these data-structures
... for i in np.arange( 1024 ): // vectorised-code ready numpy iterator
... for j in np.arange( 1024 ):// vectorised-code ready numpy iterator
... ret_OUT[i,j] = np.uint8( np.mean( base_IN[4*i:4*(i+1),
... 4*j:4*(j+1)
... ]
... )
... )
... return // avoid terminal assignment costs
...
>>> aClk.start(); _ = jit_avg2( base_array, mean_array ); aClk.stop()
1586182 (even with all the jit-compilation circus, it was FASTER than GIL-stepped nested fors ...)
76935
77337
Related
Background
I am analyzing large (between 0.5 and 20 GB) binary files, which contain information about particle collisions from a simulation. The number of collisions, number of incoming and outgoing particles can vary, so the files consist of variable length records. For analysis I use python and numpy. After switching from python 2 to python 3 I have noticed a dramatic decrease in performance of my scripts and traced it down to numpy.fromfile function.
Simplified code to reproduce the problem
This code, iotest.py
Generates a file of a similar structure to what I have in my studies
Reads it using numpy.fromfile
Reads it using numpy.frombuffer
Compares timing of both
import numpy as np
import os
def generate_binary_file(filename, nrecords):
n_records = np.random.poisson(lam = nrecords)
record_lengths = np.random.poisson(lam = 10, size = n_records).astype(dtype = 'i4')
x = np.random.normal(size = record_lengths.sum()).astype(dtype = 'd')
with open(filename, 'wb') as f:
s = 0
for i in range(n_records):
f.write(record_lengths[i].tobytes())
f.write(x[s:s+record_lengths[i]].tobytes())
s += record_lengths[i]
# Trick for testing: make sum of records equal to 0
f.write(np.array([1], dtype = 'i4').tobytes())
f.write(np.array([-x.sum()], dtype = 'd').tobytes())
return os.path.getsize(filename)
def read_binary_npfromfile(filename):
checksum = 0.0
with open(filename, 'rb') as f:
while True:
try:
record_length = np.fromfile(f, 'i4', 1)[0]
x = np.fromfile(f, 'd', record_length)
checksum += x.sum()
except:
break
assert(np.abs(checksum) < 1e-6)
def read_binary_npfrombuffer(filename):
checksum = 0.0
with open(filename, 'rb') as f:
while True:
try:
record_length = np.frombuffer(f.read(np.dtype('i4').itemsize), dtype = 'i4', count = 1)[0]
x = np.frombuffer(f.read(np.dtype('d').itemsize * record_length), dtype = 'd', count = record_length)
checksum += x.sum()
except:
break
assert(np.abs(checksum) < 1e-6)
if __name__ == '__main__':
from timeit import Timer
from functools import partial
fname = 'testfile.tmp'
print("# File size[MB], Timings and errors [s]: fromfile, frombuffer")
for i in [10**3, 3*10**3, 10**4, 3*10**4, 10**5, 3*10**5, 10**6, 3*10**6]:
fsize = generate_binary_file(fname, i)
t1 = Timer(partial(read_binary_npfromfile, fname))
t2 = Timer(partial(read_binary_npfrombuffer, fname))
a1 = np.array(t1.repeat(5, 1))
a2 = np.array(t2.repeat(5, 1))
print('%8.3f %12.6f %12.6f %12.6f %12.6f' % (1.0 * fsize / (2**20), a1.mean(), a1.std(), a2.mean(), a2.std()))
Results
Conclusions
In Python 2 numpy.fromfile was probably the fastest way to deal with binary files of variable structure. It was approximately 3 times faster than numpy.frombuffer. Performance of both scaled linearly with file size.
In Python 3 numpy.frombuffer became around 10% slower, while numpy.fromfile became around 9.3 times slower compared to Python 2! Performance of both still scales linearly with file size.
In the documentation of numpy.fromfile it is described as "A highly efficient way of reading binary data with a known data-type". It is not correct in Python 3 anymore. This was in fact noticed earlier by other people already.
Questions
In Python 3 how to obtain a comparable (or better) performance to Python 2, when reading binary files of variable structure?
What happened in Python 3 so that numpy.fromfile became an order of magnitude slower?
TL;DR: np.fromfile and np.frombuffer are not optimized to read many small buffers. You can load the whole file in a big buffer and then decode it very efficiently using Numba.
Analysis
The main issue is that the benchmark measure overheads. Indeed, it perform a lot of system/C calls that are very inefficient. For example, on the 24 MiB file, the while loops calls 601_214 times np.fromfile and np.frombuffer. The timing on my machine are 10.5s for read_binary_npfromfile and 1.2s for read_binary_npfrombuffer. This means respectively 17.4 us and 2.0 us per call for the two function. Such timing per call are relatively reasonable considering Numpy is not designed to efficiently operate on very small arrays (it needs to perform many checks, call some functions, wrap/unwrap CPython types, allocate some objects, etc.). The overhead of these functions can change from one version to another and unless it becomes huge, this is not a bug. The addition of new features to Numpy and CPython often impact overheads and this appear to be the case here (eg. buffering interface). The point is that it is not really a problem because there is a way to use a different approach that is much much faster (as it does not pay huge overheads).
Faster Numpy code
The main solution to write a fast implementation is to read the whole file once in a big byte buffer and then decode it using np.view. That being said, this is a bit tricky because of data alignment and the fact that nearly all Numpy function needs to be prohibited in the while loop due to their overhead. Here is an example:
def read_binary_faster_numpy(filename):
buff = np.fromfile(filename, dtype=np.uint8)
buff_int32 = buff.view(np.int32)
buff_double_1 = buff[0:len(buff)//8*8].view(np.float64)
buff_double_2 = buff[4:4+(len(buff)-4)//8*8].view(np.float64)
nblocks = buff.size // 4 # Number of 4-byte blocks
pos = 0 # Displacement by block of 4 bytes
lst = []
while pos < nblocks:
record_length = buff_int32[pos]
pos += 1
if pos + record_length * 2 > nblocks:
break
offset = pos // 2
if pos % 2 == 0: # Aligned with buff_double_1
x = buff_double_1[offset:offset+record_length]
else: # Aligned with buff_double_2
x = buff_double_2[offset:offset+record_length]
lst.append(x) # np.sum is too expensive here
pos += record_length * 2
checksum = np.sum(np.concatenate(lst))
assert(np.abs(checksum) < 1e-6)
The above implementation should be faster but it is a bit tricky to understand and it is still bounded by the latency of Numpy operations. Indeed, the loop is still calling Numpy functions due to operations like buff_int32[pos] or buff_double_1[offset:offset+record_length]. Even though the overheads of indexing is much smaller than the one of previous functions, it is still quite big for such a critical loop (with ~300_000 iterations)...
Better performance with... a basic pure-Python code
It turns out that the following pure-python implementation is faster, safer and simpler:
from struct import unpack_from
def read_binary_python_struct(filename):
checksum = 0.0
with open(filename, 'rb') as f:
data = f.read()
offset = 0
while offset < len(data):
record_length = unpack_from('#i', data, offset)[0]
checksum += sum(unpack_from(f'{record_length}d', data, offset + 4))
offset += 4 + record_length * 8
assert(np.abs(checksum) < 1e-6)
This is because the overhead of unpack_from is far lower than the one of Numpy functions but it is still not great.
In fact, now the main issue is actually the CPython interpreter. It is clearly not designed with high-performance in mind. The above code push it to the limit. Allocating millions of temporary reference-counted dynamic objects like variable-sized integers and strings is very expensive. This is not reasonable to let CPython do such an operation.
Writing a high-performance code with Numba
We can drastically speed it up using Numba which can compile Numpy-based Python codes to native ones using a just-in-time compiler! Here is an example:
#nb.njit('float64(uint8[::1])')
def decode_buffer(buff):
checksum = 0.0
offset = 0
while offset + 4 < buff.size:
record_length = buff[offset:offset+4].view(np.int32)[0]
start = offset + 4
end = start + record_length * 8
if end > buff.size:
break
x = buff[start:end].view(np.float64)
checksum += x.sum()
offset = end
return checksum
def read_binary_numba(filename):
buff = np.fromfile(filename, dtype=np.uint8)
checksum = decode_buffer(buff)
assert(np.abs(checksum) < 1e-6)
Numba removes nearly all Numpy overheads thanks to a native compiled code. That being said note that Numba does not implement all Numpy functions yet. This include np.fromfile which need to be called outside a Numba-compiled function.
Benchmark
Here are the performance results on my machine (i5-9600KF with a high-performance Nvme SSD) with Python 3.8.1, Numpy 1.20.3 and Numba 0.54.1.
read_binary_npfromfile: 10616 ms ( x1)
read_binary_npfrombuffer: 1132 ms ( x9)
read_binary_faster_numpy: 509 ms ( x21)
read_binary_python_struct: 222 ms ( x48)
read_binary_numba: 12 ms ( x885)
Optimal time: 7 ms (x1517)
One can see that the Numba implementation is extremely fast compared to the initial Python implementation and even to the fastest alternative Python implementation. This is especially true considering that 8 ms is spent in np.fromfile and only 4 ms in decode_buffer!
I have been trying to exploit Numba to speed up large array calculations. I have been measuring the calculation speed in GFLOPS, and it consistently falls far short of my expectations for my CPU.
My processor is i9-9900k, which according to float32 benchmarks should be capable of over 200 GFLOPS. In my tests I have never exceeded about 50 GFLOPS. This is running on all 8 cores.
On a single core I achieve about 17 GFLOPS, which (I believe) is 50% of the theoretical performance. I'm not sure if this is improvable, but the fact that it doesn't extend well to multi-core is a problem.
I am trying to learn this because I am planning to write some image processing code that desperately needs every speed boost possible. I also feel I should understand this first, before I dip my toes into GPU computing.
Here is some example code with a few of my attempts at writing fast functions. The operation I am testing, is multiplying an array by a float32 then summing the whole array, i.e. a MAC operation.
How can I get better results?
import os
# os.environ["NUMBA_ENABLE_AVX"] = "1"
import numpy as np
import timeit
from timeit import default_timer as timer
import numba
# numba.config.NUMBA_ENABLE_AVX = 1
# numba.config.LOOP_VECTORIZE = 1
# numba.config.DUMP_ASSEMBLY = 1
from numba import float32, float64
from numba import jit, njit, prange
from numba import vectorize
from numba import cuda
lengthY = 16 # 2D array Y axis
lengthX = 2**16 # X axis
totalops = lengthY * lengthX * 2 # MAC operation has 2 operations
iters = 100
doParallel = True
#njit(fastmath=True, parallel=doParallel)
def MAC_numpy(testarray):
output = (float)(0.0)
multconst = (float)(.99)
output = np.sum(np.multiply(testarray, multconst))
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_01(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(lengthX):
output += multconst*testarray[y,x]
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_04(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(int(lengthX/4)):
xn = x*4
output += multconst*testarray[y,xn] + multconst*testarray[y,xn+1] + multconst*testarray[y,xn+2] + multconst*testarray[y,xn+3]
return output
# ======================================= TESTS =======================================
testarray = np.random.rand(lengthY, lengthX)
# ==== MAC_numpy ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_numpy(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_numpy")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_01 ====
time = 1000
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
for n in range(iters):
start = timer()
output = MAC_01(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_01")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_04 ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_04(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_04")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
Q : How can I get better results?
1st : Learn how to avoid doing useless work - you can straight eliminate HALF of the FLOP-s not speaking about also the half of all the RAM-I/O-s avoided, each one being at a cost of +100~350 [ns] per writeback
Due to the distributive nature of MUL and ADD ( a.C + b.C ) == ( a + b ).C, better first np.sum( A ) and only after that then MUL the sum by the (float) constant.
#utput = np.sum(np.multiply(testarray, multconst)) # AWFULLY INEFFICIENT
output = np.sum( testarray)*multconst #######################
2nd : Learn how to best align data along the order of processing ( cache-line reuses get you ~100x faster re-use of pre-fetched data. Not aligning vectorised-code along these already pre-fetched data side-effects just let your code pay many times the RAM-access latencies, instead of smart re-using the already paid for data-blocks. Designing work-units aligned according to this principle means a few SLOCs more, but the rewards are worth that - who gets ~100x faster CPUs+RAMs for free and right now or about a ~100x speedup for free, just from not writing a badly or naively designed looping iterators?
3rd : Learn how to efficiently harness vectorised (block-directed) operations inside numpy or numba code-blocks and avoid pressing numba to spend time on auto-analysing the call-signatures ( you pay an extra time for this auto-analyses per call, while you have designed the code and knew exactly what data-types are going to go there, so why to pay an extra time for auto-analysis each time a numba-block gets called???)
4th : Learn where the extended Amdahl's Law, having all the relevant add-on costs and processing atomicity put into the game, supports your wish to get speedups, not to ever pay way more than you will get back (to at least justify the add-on costs... ) - paying extra costs for not getting any reward is possible, yet has no beneficial impact on your code's performance ( rather the opposite )
5th : Learn when and how the manually created inline(s) may save your code, once the steps 1-4 are well learnt and routinely excersised with proper craftmanship ( Using popular COTS frameworks is fine, yet these may deliver results after a few days of work, while a hand-crafted single purpose smart designed assembly code was able to get the same results in about 12 minutes(!), not several days without any GPU/CPU tricks etc - yes, that faster - just by not doing a single step more than what was needed for the numerical processing of the large matrix data )
Did I mention float32 may surprise at being processed slower on small scales than float64, while on larger data-scales ~ n [GB] the RAM I/O-times grow slower for more efficient float32 pre-fetches? This never happens here, as float64 array gets processed here. Sure, unless one explicitly instructs the constructor(s) to downconvert the default data type, like this: np.random.rand( lengthY, lengthX ).astype( dtype = np.float32 )>>> np.random.rand( 10, 2 ).dtypedtype('float64')Avoiding extensive memory allocations is another performance trick, supported in numpy call-signatures. Using this option for large arrays will save you a lot of extra time wasted on mem-allocs for large interim arrays. Reusing already pre-allocated memory-zones and wisely controlled gc-policing are another signs of a professional, focused on low-latency & design-for-performance
Basically what I'm doing is calculating a vector and then writing the results. Right now I'm writing it row by row then I have to transpose is after with datamash transpose which is kind of annoying. It would be way more efficient if I could write a tab-delimited table column-by-column.
How can I do this in Python (can it be down w/ np.savetxt)?
Here's one of my attempts below:
In [12]: import numpy as np
In [13]: data = np.random.normal(size=(10,3))
In [14]: data
Out[14]:
array([[-0.50469426, -0.9710173 , 0.43285955],
[ 0.71702597, -0.99998294, -1.00353228],
[ 0.77699465, -0.66542361, -0.04594868],
[-0.71012566, -1.46451086, -0.95308903],
[ 0.47470605, -0.56792278, 0.95818696],
[ 1.20729071, -0.04735589, -0.11576503],
[ 1.03686861, 0.72149358, 0.35908901],
[ 0.09520535, 0.24437775, -0.59554944],
[-0.13346795, 0.29530724, 0.17524018],
[-0.16433609, 0.05261348, -0.57545287]])
In [15]: with open("example.tsv", "w") as f:
...: for row in data:
...: print(*row, sep="\t\n", file=f, end="")
...:
In [16]: %%bash
...: cat example.tsv
...:
...:
-0.5046942610111921
-0.9710173002083825
0.43285954686999120.7170259682395401
-0.9999829435149956
-1.003532284093560.7769946455220355
-0.6654236121150638
-0.04594868270526936-0.7101256559235657
-1.4645108615674511
-0.95308903163753660.4747060486691834
-0.5679227787239494
0.9581869616594761.207290709055818
-0.047355888296561795
-0.115765032633420781.036868614074581
0.7214935810053711
0.35908901158512140.09520535113648704
0.2443777544867152
-0.5955494427027563-0.1334679518044996
0.29530724431573385
0.1752401825058493-0.1643360874489238
0.052613481327433834
-0.5754528683216069
Having 10 GB of data: it's worth start with a few remarks first :
Unless you data processing pipelines are outside of your control and you have to use anyway the plain-text, GNU datamash text-file transformations, the numpy-side is an absolute master in smart vector/array processing, if one decides to harness all its powers:
Key factors are:
in-RAM-layout: numpy can change this in a snap, as it tricks you by handling axes abstract
on-disk-layout: numpy can read both F-like (FORTRAN, column ) C-like (row-wise) data-files
This said, your python side is thanks to numpy as flexible as the crowd of scientific computing developers was already used to design, develop and performance tune. Fortran has used the column-first ordering since ever due to performance advantages, so it has been put into numpy-tools right for performance reasons since ever either.
The numpy smart in-RAM-layout tricks:
|>>> from zmq import Stopwatch
|>>> aClk = Stopwatch()
|>>> aClk.start(); a = np.arange( 1E3 );aClk.stop() #_____ tiny arrays are cheapest
46 #_____ ~ 46 [us]+may live in-cache
|>>> aClk.start(); _ = np.arange( 1E6 );aClk.stop() #_____ small arrays are "fast"
10131 #_____ ~ 10 [ms] to alloc. RAM
|
|
|>>> aClk.start(); _ = np.arange( 1E8 );aClk.stop() #_____ going 1E8+ SWAPs a lot
23970419 #_____ ~ 24 [s] + O/S SWAPs RAM
+0:00:34.974360
|
|>>> _.shape (100000000,)
|>>> _.size * _.itemsize 800000000
|>>> _.dtype dtype('float64')
|>>> _.flags
C_CONTIGUOUS : True <--------+--- vectors are {C|F}-contiguous blocks
F_CONTIGUOUS : True <--------+
OWNDATA : True <------------ indeed owns own data + may have cheap "np.view"-s
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
in-RAM-layout is a bit hidden issue and if you have never been into numpy-tricks and performance tweaking details, might be, this is the first place you hear about this.
|>>> aClk.start();_ = _.reshape( _.shape[0]/10, 10 );aClk.stop() # ___ make 1E8 array into [1E7,1E1] shape
24 # ___ ~ 24 [us]
# i.e. NOTHING HAS MOVED in memory, just the hidden indexing tricks were adapted
# not a single byte from all of the 0.8 GB, now array, has been moved
# the same will happen with numpy.ndarray.T smart-tricked operations
|>>> _.flags
C_CONTIGUOUS : True <--------- vector ceased be vector, matrix is C-row-wise
F_CONTIGUOUS : False as not a single byte was moved in in-RAM-layout
OWNDATA : False <--------- now, view-alike manipulations on "foreign"-owned data
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
The smart tools can "transpose" ( actually doing nothing but indexing tricks ) in almost zero-time:
|>>> aClk.start();_.shape;aClk.stop()
(10000000, 10) # BLOB is [1E7,1E1] ~0.8 GB
25 # ... 23 [us] to tell us the .shape
|>>> aClk.start();_.T.shape;aClk.stop() # go TRANSPOSE !
(10, 10000000) # ok BLOB is [1E1,1E7]
43 # ...~43 [us]
# -23 [us] to tell us the .shape
# == 20 [us] to TRANSPOSE BLOB ??????
# NO transpose is tricked by adapted indexing only
The transpose operation just changes the way, how indices get mapped-onto an already RAM-allocated memory-area, all without moving a single byte of the 0.8 GB sized BLOB-data in the RAM, just smart-indexing tricks - so the column-wise arrangement could be used as it comes at zero add-on costs ( not counting those ~ 20 microseconds ... )
|>>> _.T.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : True <------------- now, F-column-wise arrangement gets set
OWNDATA : False <------------ and, again, on "foreign"-owned data
WRITEABLE : True already
ALIGNED : True put in-RAM
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
Anyway, doing any high-performance, smart and vectorised code for a bit larger datasets, one will always benefit from taking due care about in-RAM-layout and tuning the vectorised-code to "follow" ( not fight against ) the actual layout used. Best read about cache-memory hierarchies for your processor/computing-platform and for ultimate performance always test each ordering scenario for the target scales of vectors/n-dim-arrays, before making the vectorised-code into HPC-production (cache-lines, associativity of caches are very different and HPC-impacts are indeed immense if your code "violates" the best alignment possible and such code wastes a lot of time right due to such inefficiencies ).
One may spend a lot or smart-save in using on-disk-layout tricks :
as noted above, unless forbidden to do so, one may increase the performance of on-storage ops, by smart-compression first all the output the numpy is to store on-disk. Be it a numpy way or dill.dump() or similarly developed compression-library.
Starting from the exemplified code above:
In [15]: with open("example.tsv", "w") as f:
...: for row in data:
...: print(*row, sep="\t\n", file=f, end="")
is very expensive, as the for-loops are particularly slow ( expensive ) in python and the whole job could be done smarter using the "HPC"-grade numpy-tools.
Additionally added data-compression is both fast per-se and way cheaper than moving data onto storage and back. SSD-media are faster than conventional media, yet ( in 2019 ) the benefits from compression remains smaller data-footprint and faster I/O-data-flows, which in most if not all cases justifies the time spent in using numpy.savez_compressed() to move just the compressed ( .ZIP_DEFLATED ) blob, instead of always spending 8-bits-per-each-digit in ASCII-file plain number representations.
numpy.load() is an input-direction call for the numpy-case, dill can even save/restore the whole python-interpreter session, which may have a wider use in scientific workflow automation ( plus saving restore-able-snapshots )
python also provides zipfile tools for direct work with compressed-files, so a need to resort to GNU command-line tools may be limited to some indeed corner-cases.
A simple program which calculates square of numbers and stores the results:
import time
from joblib import Parallel, delayed
import multiprocessing
array1 = [ 0 for i in range(100000) ]
def myfun(i):
return i**2
#### Simple loop ####
start_time = time.time()
for i in range(100000):
array1[i]=i**2
print( "Time for simple loop --- %s seconds ---" % ( time.time()
- start_time
)
)
#### Parallelized loop ####
start_time = time.time()
results = Parallel( n_jobs = -1,
verbose = 0,
backend = "threading"
)(
map( delayed( myfun ),
range( 100000 )
)
)
print( "Time for parallelized method --- %s seconds ---" % ( time.time()
- start_time
)
)
#### Output ####
# >>> ( executing file "Test_vr20.py" )
# Time for simple loop --- 0.015599966049194336 seconds ---
# Time for parallelized method --- 7.763299942016602 seconds ---
Could it be the difference in array handling for the two options? My actual program would have something more complicated but this is the kind of calculation that I need to parallelize, as simply as possible, but not with such results.
System Model: HP ProBook 640 G2, Windows 7,
IDLE for Python System Type: x64-based PC Processor:
Intel(R) Core(TM) i5-6300U CPU # 2.40GHz,
2401 MHz,
2 Core(s),
4 Logical Processor(s)
From the documentation of threading:
If you know that the function you are calling is based on a compiled
extension that releases the Python Global Interpreter Lock (GIL)
during most of its computation ...
The problem is that in the this case, you don't know that. Python itself will only allow one thread to run at once (the python interpreter locks the GIL every time it executes a python operation).
threading is only going to be useful if myfun() spends most of its time in a compiled Python extension, and that extension releases the GIL.
The Parallel code is so embarrassingly slow because you are doing a huge amount of work to create multiple threads - and then you only execute one thread at a time anyway.
If you use the multiprocessing backend, then you have to copy the input data into each of four or eight processes (one per core), do the processing in each processes, and then copy the output data back. The copying is going to be slow, but if the processing is a little bit more complex than just calculating a square, it might be worth it. Measure and see.
Why?Because trying to use tools in cases,where tools principally cannot and DO NOT adjust the costs of entry:
I love python.
I pray educators better explain the costs of tools, otherwise we get lost in these wish-to-get [PARALLEL]-schedules.
A few facts:
No.0: With a lot of simplification, python intentionally uses GIL to [SERIAL]-ise access to variables and thus avoiding any potential collision from [CONCURRENT] modifications - paying these add-on costs of GIL-stepped dancing in extra time
No.1: [PARALLEL]-code execution is way harder than a "just"-[CONCURRENT] ( read more )
No.2: [SERIAL]-process has to pay extra costs, if trying to split work onto [CONCURRENT]-workers
No.3: If a process does inter-worker communication, immense extra costs per data exchange are paid
No.4: If hardware has few resources for [CONCURRENT] processes, results get way worse further
To have some smell of what can be done in standard python 2.7.13:
Efficiency is in better using silicon, not in bulldozing syntax-constructors into territories, where they are legal, but their performance has adverse effects on the experiment-under-test end-to-end speed:
You pay about 8 ~ 11 [ms] just to iteratively assemble an empty array1
>>> from zmq import Stopwatch
>>> aClk = Stopwatch()
>>> aClk.start();array1 = [ 0 for i in xrange( 100000 ) ];aClk.stop()
9751L
10146L
10625L
9942L
10346L
9359L
10473L
9171L
8328L
( the Stopwatch().stop() method yields [us] from .start() )
while, the memory-efficient, vectorisable, GIL-free approach can do the same about +230x ~ +450x faster:
>>> import numpy as np
>>>
>>> aClk.start();arrayNP = np.zeros( 100000 );aClk.stop()
15L
22L
21L
23L
19L
22L
>>> aClk.start();arrayNP = np.zeros( 100000, dtype = np.int );aClk.stop()
43L
47L
42L
44L
47L
So, using the proper tools just starts the story of performance:
>>> def test_SERIAL_python( nLOOPs = 100000 ):
... aClk.start()
... for i in xrange( nLOOPs ): # py3 range() ~ xrange() in py27
... array1[i] = i**2 # your loop-code
... _ = aClk.stop()
... return _
While a naive [SERIAL]-iterative implementation works, you pay immense costs for opting to do so ~ 70 [ms] for a 100000-D vector:
>>> test_SERIAL_python( nLOOPs = 100000 )
70318L
69211L
77825L
70943L
74834L
73079L
Using a more suitable / appropriate tool costs just ~ 0.2 [ms] i.e. ++350x FASTER
>>> aClk.start();arrayNP[:] = arrayNP[:]**2;aClk.stop()
189L
171L
173L
187L
183L
188L
193L
and with another glitch, a.k.a. an inplace modus-operandi:
>>> aClk.start();arrayNP[:] *=arrayNP[:];aClk.stop()
138L
139L
136L
137L
136L
136L
137L
Yields ~ +514x SPEEDUP, just from using appropriate tool
The art of performance is not in following marketing-sounding claimsabout parallellizing-( at-any-cost ),but in using know-how based methods, that pay least costs for biggest speedups achievable.
For "small"-problems, typical costs of distributing "thin"-work-packages are indeed hard to get covered by any potentially achievable speedups, so "problem-size" actually limits one's choice of methods, that could reach positive gain ( speedups of 0.9 or even << 1.0 are so often reported here, on StackOverflow, that you need not feel lost or alone in this sort of surprise ).
Epilogue
Processor number counts.
Core number counts.
But cache-sizes + NUMA-irregularities count more than that.
Smart, vectorised, HPC-cured, GIL-free libraries matter ( numpy et al - thanks a lot Travis OLIPHANT & al ... Great Salute to his team ... )
As an overhead-strict Amdahl Law (re-)-formulation explains, why even many-N-CPU parallelised code execution may ( and indeed often does ) suffer from speedups << 1.0
Overhead-strict formulation of the Amdahl's Law speedup S includes the very costs of the paid [PAR]-Setup + [PAR]-Terminate Overheads, explicitly:
1
S = __________________________; where s, ( 1 - s ), N were defined above
( 1 - s ) pSO:= [PAR]-Setup-Overhead add-on
s + pSO + _________ + pTO pTO:= [PAR]-Terminate-Overhead add-on
N
( an interactive animated tool for 2D visualising effects of these performance constraints is cited here )
I've compared processing time with theano(CPU), theano(GPU) and Scikit-learn(CPU) using Python.
But, I got strange result.
Here look at the graph that I plot.
Processing Time Comparison:
you can see the result of scikit-learn that is faster than theano(GPU).
The program that I checked its elapsed time is to compute euclidean distance matrix from a matrix which have n * 40 elements.
Here is the part of code.
points = T.fmatrix("points")
edm = T.zeros_like(points)
def get_point_to_points_euclidean_distances(point_id):
euclideans = (T.sqrt((T.sqr(points- points[point_id, : ])).sum(axis=1)))
return euclideans
def get_EDM_CPU(points):
EDM = np.zeros((points.shape[0], points.shape[0])).astype(np.float32)
for row in range(points.shape[0]):
EDM[row, :] = np.sqrt(np.sum((points - points[row, :])**2, axis=1))
return EDM
def get_sk(points):
EDM = sk.pairwise_distances(a, metric='l2')
return EDM
seq = T.arange(T.shape(points)[0])
(result, _) = theano.scan(fn = get_point_to_points_euclidean_distances, \
outputs_info = None , \
sequences = seq)
get_EDM_GPU = theano.function(inputs = [points], outputs = result, allow_input_downcast = True)
I thought that the reason why GPU is slower than sci-kit learn is probably transfer time. So I did profiling GPU with nvprof command. then I got this.
==27105== NVPROF is profiling process 27105, command: python ./EDM_test.py
Using gpu device 0: GeForce GTX 580 (CNMeM is disabled, cuDNN not available)
data shape : (10000, 40)
get_EDM_GPU elapsed time : 1.84863090515 (s)
get_EDM_CPU elapsed time : 8.09937691689 (s)
get_EDM_sk elapsed time : 1.10968112946 (s)
ratio : 4.38128395145
==27105== Profiling application: python ./EDM_test.py
==27105== Warning: Found 9 invalid records in the result.
==27105== Warning: This could be because device ran out of memory when profiling.
==27105== Profiling result:
Time(%) Time Calls Avg Min Max Name
71.34% 1.28028s 9998 128.05us 127.65us 128.78us kernel_reduce_01_node_316e2e1cbfbe8cfb8e4a101f329ffeec_0(int, int, float const *, int, int, float*, int)
19.95% 357.97ms 9997 35.807us 35.068us 36.948us kernel_Sub_node_bc41b3f8f12c93d29f2c4360ad445d80_0_2(unsigned int, int, int, float const *, int, int, float const *, int, int, float*, int, int)
7.32% 131.38ms 2 65.690ms 1.2480us 131.38ms [CUDA memcpy DtoH]
1.25% 22.456ms 9996 2.2460us 2.1140us 2.8420us kernel_Sqrt_node_23508f8f49d12f3e8369d543f5620c15_0_Ccontiguous(unsigned int, float const *, float*)
0.12% 2.1847ms 1 2.1847ms 2.1847ms 2.1847ms [CUDA memset]
0.01% 259.73us 5 51.946us 640ns 250.36us [CUDA memcpy HtoD]
0.00% 17.086us 1 17.086us 17.086us 17.086us kernel_reduce_ccontig_node_97496c4d3cf9a06dc4082cc141f918d2_0(unsigned int, float const *, float*)
0.00% 2.0090us 1 2.0090us 2.0090us 2.0090us void copy_kernel<float, int=0>(cublasCopyParams<float>)
The transfer [CUDA memcpy DtoH] was performed twice { 1.248 [us], 131.38 [ms] }
The transfer [CUDA memcpy HtoD] was performed 5x { min: 640 [ns], max: 250.36 [us] }
The transfer time is about 131.639 ms (131.88 ms + 259.73 us).
but the gap between GPU and scikit-learn is about 700ms (1.8 s - 1.1 s) So, the gap is over the transfer time.
does it compute only upper triangular matrix from symmetric matrix?
what makes scikit-learn so fast?
What makes scikit-learn ( on pure CPU-side ) so fast?
My initial candidates would be a mix of:
highly efficient use of available CPU-cores' L1-/ L2- sizes within the fastest [ns]-distances
smart numpy vectorised execution being friendly to CPU cache-lines
dataset so small, it can completely remain non-evicted from cache ( test to scale the dataset-under-review way above the L2-/L3-cache sizes to see the DDRx-memory-cost effects on the observed performance ( details are in the URL below ) )
might enjoy even better timing on numpy, if avoiding .astype() conversions ( test it )
Facts on the GPU-side
auto-generated GPU-kernels do not have much chance to get ultimate levels of global memory latency-masking, compared to manually tweaked kernel-designs, tailor fit to respective GPU-silicon-architecture / latencies observed in-vivo
data-structures larger than just a few KB remain paying GPU-SM/GDDR-MEM distances of ~ large hundreds of [ns], nearly [us] -v/s- compared to small units ~ small tens of [ns] at CPU/L1/L2/L3/DDRx ) ref. timing details in >>> https://stackoverflow.com/a/33065382
not being able to enjoy much of the GPU/SMX power, due to this task's obvious low-reuse of data points and dataset size beyond the GPU/SM-silicon limits, that causes and must cause GPU/SM-register capacity spillovers in any kind of GPU-kernel design attempts and tweaking
the global task is not having a minimum reasonable amount of asynchronous, isolated ( non-communicating islands ) mathematically-dense, yet SMX-local, GPU-kernel processing steps ( there is not much to compute so as to adjust for the add-on overheads and expensive SMX/GDDR memory costs )
GPU-s can lovely exhibit it's best-performance, if sufficiently enough densely-convoluted re-processing operations take place -- like in large-scale/high-resolution image-processing -- on [m,n,o]-convolution-kernel matrices so small, so as that all these m*n*o constant values can reside local to SM, inside an available set of SMX-SM_registers and if the GPU-kernel-launchers are optimally tweaked by the 3D-tblock/grid processing-layout geometries, so that the global memory access latencies are at its best-masked performance, having all the GPU-threads enforced within the hardware WARP-aligned SMx:WarpScheduler RoundRobin thread-scheduling capabilites ( the first swap from Round-Robin into Greedy-WarpSchedule mode loses the whole battle in case of divergent execution-paths in GPU-kernel-code ).