Let's use, for example, numpy.sin()
The following code will return the value of the sine for each value of the array a:
import numpy
a = numpy.arange( 1000000 )
result = numpy.sin( a )
But my machine has 32 cores, so I'd like to make use of them. (The overhead might not be worthwhile for something like numpy.sin() but the function I actually want to use is quite a bit more complicated, and I will be working with a huge amount of data.)
Is this the best (read: smartest or fastest) method:
from multiprocessing import Pool
if __name__ == '__main__':
pool = Pool()
result = pool.map( numpy.sin, a )
or is there a better way to do this?
There is a better way: numexpr
Slightly reworded from their main page:
It's a multi-threaded VM written in C that analyzes expressions, rewrites them more efficiently, and compiles them on the fly into code that gets near optimal parallel performance for both memory and cpu bounded operations.
For example, in my 4 core machine, evaluating a sine is just slightly less than 4 times faster than numpy.
In [1]: import numpy as np
In [2]: import numexpr as ne
In [3]: a = np.arange(1000000)
In [4]: timeit ne.evaluate('sin(a)')
100 loops, best of 3: 15.6 ms per loop
In [5]: timeit np.sin(a)
10 loops, best of 3: 54 ms per loop
Documentation, including supported functions here. You'll have to check or give us more information to see if your more complicated function can be evaluated by numexpr.
Well this is kind of interesting note if you run the following commands:
import numpy
from multiprocessing import Pool
a = numpy.arange(1000000)
pool = Pool(processes = 5)
result = pool.map(numpy.sin, a)
UnpicklingError: NEWOBJ class argument has NULL tp_new
wasn't expecting that, so whats going on, well:
>>> help(numpy.sin)
Help on ufunc object:
sin = class ufunc(__builtin__.object)
| Functions that operate element by element on whole arrays.
|
| To see the documentation for a specific ufunc, use np.info(). For
| example, np.info(np.sin). Because ufuncs are written in C
| (for speed) and linked into Python with NumPy's ufunc facility,
| Python's help() function finds this page whenever help() is called
| on a ufunc.
yep numpy.sin is implemented in c as such you can't really use it directly with multiprocessing.
so we have to wrap it with another function
perf:
import time
import numpy
from multiprocessing import Pool
def numpy_sin(value):
return numpy.sin(value)
a = numpy.arange(1000000)
pool = Pool(processes = 5)
start = time.time()
result = numpy.sin(a)
end = time.time()
print 'Singled threaded %f' % (end - start)
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print 'Multithreaded %f' % (end - start)
$ python perf.py
Singled threaded 0.032201
Multithreaded 10.550432
wow, wasn't expecting that either, well theres a couple of issues for starters we are using a python function even if its just a wrapper vs a pure c function, and theres also the overhead of copying the values, multiprocessing by default doesn't share data, as such each value needs to be copy back/forth.
do note that if properly segment our data:
import time
import numpy
from multiprocessing import Pool
def numpy_sin(value):
return numpy.sin(value)
a = [numpy.arange(100000) for _ in xrange(10)]
pool = Pool(processes = 5)
start = time.time()
result = numpy.sin(a)
end = time.time()
print 'Singled threaded %f' % (end - start)
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print 'Multithreaded %f' % (end - start)
$ python perf.py
Singled threaded 0.150192
Multithreaded 0.055083
So what can we take from this, multiprocessing is great but we should always test and compare it sometimes its faster and sometimes its slower, depending how its used ...
Granted you are not using numpy.sin but another function I would recommend you first verify that indeed multiprocessing will speed up the computation, maybe the overhead of copying values back/forth may affect you.
Either way I also do believe that using pool.map is the best, safest method of multithreading code ...
I hope this helps.
SciPy actually has a pretty good writeup on this subject here.
Related
I am trying to learn the multiprocessing library in Python3.9. One thing I compared was the performance of a repeated computation of on a dataset composing of 220500 samples per dataset. I did this using the multiprocessing library and then using for loops.
Throughout my tests I am consistently getting better performance using for loops. Here is the code for the test I am running. I am computing the FFT of a signal with 220500 samples. My experiment involves running this process for a certain amount of times in each test. I am testing this out with setting the number of processes to 10, 100, and 1000 respectively.
import time
import numpy as np
from scipy.signal import get_window
from scipy.fftpack import fft
import multiprocessing
from itertools import product
def make_signal():
# moved this code into a function to make threading portion of code clearer
DUR = 5
FREQ_HZ = 10
Fs = 44100
# precompute the size
N = DUR * Fs
# get a windowing function
w = get_window('hanning', N)
t = np.linspace(0, DUR, N)
x = np.zeros_like(t)
b = 2*np.pi*FREQ_HZ*t
for i in range(50):
x += np.sin(b*i)
return x*w, Fs
def fft_(x, Fs):
yfft = fft(x)[:x.size//2]
xfft = np.linspace(0,Fs//2,yfft.size)
return 2/yfft.size * np.abs(yfft), xfft
if __name__ == "__main__":
# grab the raw sample data which will be computed by the fft function
x = make_signal()
# len(x) = 220500
# create 5 different tests, each with the amount of processes below
# array([ 10, 100, 1000])
tests_sweep = np.logspace(1,3,3, dtype=int)
# sweep through the processes
for iteration, test_num in enumerate(tests_sweep):
# create a list of the amount of processes to give for each iteration
fft_processes = []
for i in range(test_num):
fft_processes.append(x)
start = time.time()
# repeat the process for test_num amount of times (e.g. 10, 100, 1000)
with multiprocessing.Pool() as pool:
results = pool.starmap(fft_, fft_processes)
end = time.time()
print(f'{iteration}: Multiprocessing method with {test_num} processes took: {end - start:.2f} sec')
start = time.time()
for fft_processes in fft_processes:
# repeat the process the same amount of time as the multiprocessing method using for loops
fft_(*fft_processes)
end = time.time()
print(f'{iteration}: For-loop method with {test_num} processes took: {end - start:.2f} sec')
print('----------')
Here are the results of my test.
0: Multiprocessing method with 10 processes took: 0.84 sec
0: For-loop method with 10 processes took: 0.05 sec
----------
1: Multiprocessing method with 100 processes took: 1.46 sec
1: For-loop method with 100 processes took: 0.45 sec
----------
2: Multiprocessing method with 1000 processes took: 6.70 sec
2: For-loop method with 1000 processes took: 4.21 sec
----------
Why is the for-loop method considerably faster? Am I using the multiprocessing library correctly? Thanks.
There is a nontrivial amount of overhead to starting a new process. In addition the data has to be copied from one process to another (again with some overhead compared to a normal memory copy).
Another aspect is that you should limit the number of processes to the number of cores you have. Going over will make you incurr process switching costs as well.
This, coupled with the fact that you have little computation per process makes the switch not worth while.
I think if you make the signal significantly longer (10x or 100x) you should start seeing some benefits from using multiple cores.
Also check if the operations you are running are already using some parallelism. They might be implemented with threads, which are significantly cheaper the processes (but historically didn't work well in python, dye to GIL).
I have been trying to exploit Numba to speed up large array calculations. I have been measuring the calculation speed in GFLOPS, and it consistently falls far short of my expectations for my CPU.
My processor is i9-9900k, which according to float32 benchmarks should be capable of over 200 GFLOPS. In my tests I have never exceeded about 50 GFLOPS. This is running on all 8 cores.
On a single core I achieve about 17 GFLOPS, which (I believe) is 50% of the theoretical performance. I'm not sure if this is improvable, but the fact that it doesn't extend well to multi-core is a problem.
I am trying to learn this because I am planning to write some image processing code that desperately needs every speed boost possible. I also feel I should understand this first, before I dip my toes into GPU computing.
Here is some example code with a few of my attempts at writing fast functions. The operation I am testing, is multiplying an array by a float32 then summing the whole array, i.e. a MAC operation.
How can I get better results?
import os
# os.environ["NUMBA_ENABLE_AVX"] = "1"
import numpy as np
import timeit
from timeit import default_timer as timer
import numba
# numba.config.NUMBA_ENABLE_AVX = 1
# numba.config.LOOP_VECTORIZE = 1
# numba.config.DUMP_ASSEMBLY = 1
from numba import float32, float64
from numba import jit, njit, prange
from numba import vectorize
from numba import cuda
lengthY = 16 # 2D array Y axis
lengthX = 2**16 # X axis
totalops = lengthY * lengthX * 2 # MAC operation has 2 operations
iters = 100
doParallel = True
#njit(fastmath=True, parallel=doParallel)
def MAC_numpy(testarray):
output = (float)(0.0)
multconst = (float)(.99)
output = np.sum(np.multiply(testarray, multconst))
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_01(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(lengthX):
output += multconst*testarray[y,x]
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_04(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(int(lengthX/4)):
xn = x*4
output += multconst*testarray[y,xn] + multconst*testarray[y,xn+1] + multconst*testarray[y,xn+2] + multconst*testarray[y,xn+3]
return output
# ======================================= TESTS =======================================
testarray = np.random.rand(lengthY, lengthX)
# ==== MAC_numpy ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_numpy(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_numpy")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_01 ====
time = 1000
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
for n in range(iters):
start = timer()
output = MAC_01(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_01")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_04 ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_04(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_04")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
Q : How can I get better results?
1st : Learn how to avoid doing useless work - you can straight eliminate HALF of the FLOP-s not speaking about also the half of all the RAM-I/O-s avoided, each one being at a cost of +100~350 [ns] per writeback
Due to the distributive nature of MUL and ADD ( a.C + b.C ) == ( a + b ).C, better first np.sum( A ) and only after that then MUL the sum by the (float) constant.
#utput = np.sum(np.multiply(testarray, multconst)) # AWFULLY INEFFICIENT
output = np.sum( testarray)*multconst #######################
2nd : Learn how to best align data along the order of processing ( cache-line reuses get you ~100x faster re-use of pre-fetched data. Not aligning vectorised-code along these already pre-fetched data side-effects just let your code pay many times the RAM-access latencies, instead of smart re-using the already paid for data-blocks. Designing work-units aligned according to this principle means a few SLOCs more, but the rewards are worth that - who gets ~100x faster CPUs+RAMs for free and right now or about a ~100x speedup for free, just from not writing a badly or naively designed looping iterators?
3rd : Learn how to efficiently harness vectorised (block-directed) operations inside numpy or numba code-blocks and avoid pressing numba to spend time on auto-analysing the call-signatures ( you pay an extra time for this auto-analyses per call, while you have designed the code and knew exactly what data-types are going to go there, so why to pay an extra time for auto-analysis each time a numba-block gets called???)
4th : Learn where the extended Amdahl's Law, having all the relevant add-on costs and processing atomicity put into the game, supports your wish to get speedups, not to ever pay way more than you will get back (to at least justify the add-on costs... ) - paying extra costs for not getting any reward is possible, yet has no beneficial impact on your code's performance ( rather the opposite )
5th : Learn when and how the manually created inline(s) may save your code, once the steps 1-4 are well learnt and routinely excersised with proper craftmanship ( Using popular COTS frameworks is fine, yet these may deliver results after a few days of work, while a hand-crafted single purpose smart designed assembly code was able to get the same results in about 12 minutes(!), not several days without any GPU/CPU tricks etc - yes, that faster - just by not doing a single step more than what was needed for the numerical processing of the large matrix data )
Did I mention float32 may surprise at being processed slower on small scales than float64, while on larger data-scales ~ n [GB] the RAM I/O-times grow slower for more efficient float32 pre-fetches? This never happens here, as float64 array gets processed here. Sure, unless one explicitly instructs the constructor(s) to downconvert the default data type, like this: np.random.rand( lengthY, lengthX ).astype( dtype = np.float32 )>>> np.random.rand( 10, 2 ).dtypedtype('float64')Avoiding extensive memory allocations is another performance trick, supported in numpy call-signatures. Using this option for large arrays will save you a lot of extra time wasted on mem-allocs for large interim arrays. Reusing already pre-allocated memory-zones and wisely controlled gc-policing are another signs of a professional, focused on low-latency & design-for-performance
The purpose of using the numexpr.evaluate() is to speed up the compute. But im my case it wokrs even slower than numpy und eval(). I would like to know why?
code as example:
import datetime
import numpy as np
import numexpr as ne
expr = '11808000.0*1j*x**2*exp(2.5e-10*1j*x) + 1512000.0*1j*x**2*exp(5.0e-10*1j*x)'
# use eval
start_eval = datetime.datetime.now()
namespace = dict(x=np.array([m+3j for m in range(1, 1001)]), exp=np.exp)
result_eval = eval(expr, namespace)
end_eval = datetime.datetime.now()
# print(result)
print("time by using eval : %s" % (end_eval- start_eval))
# use numexpr
# ne.set_num_threads(8)
start_ne = datetime.datetime.now()
x = np.array([n+3j for n in range(1, 1001)])
result_ne = ne.evaluate(expr)
end_ne = datetime.datetime.now()
# print(result_ne)
print("time by using numexpr: %s" % (end_ne- start_ne))
return:
time by using eval : 0:00:00.002998
time by using numexpr: __ 0:00:00.052969
Thank you all
I have get the answer from robbmcleod in Github
for NumExpr 2.6 you'll need arrays around 128 - 256 kElements to see a speed-up. NumPy always starts faster as it doesn't have to synchronize at a thread barrier and otherwise spin-up the virtual machine
Also, once you call numexpr.evaluate() the second time, it should be faster as it will have already compiled everything. Compilation takes around 0.5 ms for simple expressions, more for longer expressions. The expression is stored as a hash, so the next time that computational expense is gone.
related url: https://github.com/pydata/numexpr/issues/301#issuecomment-388097698
Ok, here is my problem: I have a nested for loop in my program which runs on a single core. Since the program spend over 99% of run time in this nested for loop I would like to parallelize it. Right now I have to wait 9 days for the computation to finish. I tried to implement a parallel for loop by using the multiprocessing library. But I only find very basic examples and can not transfer them to my problem. Here are the nested loops with random data:
import numpy as np
dist_n = 100
nrm = np.linspace(1,10,dist_n)
data_Y = 11000
data_I = 90000
I = np.random.randn(data_I, 1000)
Y = np.random.randn(data_Y, 1000)
dist = np.zeros((data_I, dist_n)
for t in range(data_Y):
for i in range(data_I):
d = np.abs(I[i] - Y[t])
for p in range(dist_n):
dist[i,p] = np.sum(d**nrm[p])/nrm[p]
print(dist)
Please give me some advise how to make it parallel.
There's a small overhead with initiating a process (50ms+ depending on data size) so it's generally best to MP the largest block of code possible. From your comment it sounds like each loop of t is independent so we should be free to parallelize this.
When python creates a new process you get a copy of the main process so you have available all your global data but when each process writes the data, it writes to it's own local copy. This means dist[i,p] won't be available to the main process unless you explicitly pass it back with a return (which will have some overhead). In your situation, if each process writes dist[i,p] to a file then you should be fine, just don't try to write to the same file unless you implement some type of mutex access control.
#!/usr/bin/python
import time
import multiprocessing as mp
import numpy as np
data_Y = 11 #11000
data_I = 90 #90000
dist_n = 100
nrm = np.linspace(1,10,dist_n)
I = np.random.randn(data_I, 1000)
Y = np.random.randn(data_Y, 1000)
dist = np.zeros((data_I, dist_n))
def worker(t):
st = time.time()
for i in range(data_I):
d = np.abs(I[i] - Y[t])
for p in range(dist_n):
dist[i,p] = np.sum(d**nrm[p])/nrm[p]
# Here - each worker opens a different file and writes to it
print 'Worker time %4.3f mS' % (1000.*(time.time()-st))
if 1: # single threaded
st = time.time()
for x in map(worker, range(data_Y)):
pass
print 'Single-process total time is %4.3f seconds' % (time.time()-st)
print
if 1: # multi-threaded
pool = mp.Pool(28) # try 2X num procs and inc/dec until cpu maxed
st = time.time()
for x in pool.imap_unordered(worker, range(data_Y)):
pass
print 'Multiprocess total time is %4.3f seconds' % (time.time()-st)
print
If you re-increase the size of data_Y/data_I again, the speed-up should increase up to the theoretical limit.
A simple program which calculates square of numbers and stores the results:
import time
from joblib import Parallel, delayed
import multiprocessing
array1 = [ 0 for i in range(100000) ]
def myfun(i):
return i**2
#### Simple loop ####
start_time = time.time()
for i in range(100000):
array1[i]=i**2
print( "Time for simple loop --- %s seconds ---" % ( time.time()
- start_time
)
)
#### Parallelized loop ####
start_time = time.time()
results = Parallel( n_jobs = -1,
verbose = 0,
backend = "threading"
)(
map( delayed( myfun ),
range( 100000 )
)
)
print( "Time for parallelized method --- %s seconds ---" % ( time.time()
- start_time
)
)
#### Output ####
# >>> ( executing file "Test_vr20.py" )
# Time for simple loop --- 0.015599966049194336 seconds ---
# Time for parallelized method --- 7.763299942016602 seconds ---
Could it be the difference in array handling for the two options? My actual program would have something more complicated but this is the kind of calculation that I need to parallelize, as simply as possible, but not with such results.
System Model: HP ProBook 640 G2, Windows 7,
IDLE for Python System Type: x64-based PC Processor:
Intel(R) Core(TM) i5-6300U CPU # 2.40GHz,
2401 MHz,
2 Core(s),
4 Logical Processor(s)
From the documentation of threading:
If you know that the function you are calling is based on a compiled
extension that releases the Python Global Interpreter Lock (GIL)
during most of its computation ...
The problem is that in the this case, you don't know that. Python itself will only allow one thread to run at once (the python interpreter locks the GIL every time it executes a python operation).
threading is only going to be useful if myfun() spends most of its time in a compiled Python extension, and that extension releases the GIL.
The Parallel code is so embarrassingly slow because you are doing a huge amount of work to create multiple threads - and then you only execute one thread at a time anyway.
If you use the multiprocessing backend, then you have to copy the input data into each of four or eight processes (one per core), do the processing in each processes, and then copy the output data back. The copying is going to be slow, but if the processing is a little bit more complex than just calculating a square, it might be worth it. Measure and see.
Why?Because trying to use tools in cases,where tools principally cannot and DO NOT adjust the costs of entry:
I love python.
I pray educators better explain the costs of tools, otherwise we get lost in these wish-to-get [PARALLEL]-schedules.
A few facts:
No.0: With a lot of simplification, python intentionally uses GIL to [SERIAL]-ise access to variables and thus avoiding any potential collision from [CONCURRENT] modifications - paying these add-on costs of GIL-stepped dancing in extra time
No.1: [PARALLEL]-code execution is way harder than a "just"-[CONCURRENT] ( read more )
No.2: [SERIAL]-process has to pay extra costs, if trying to split work onto [CONCURRENT]-workers
No.3: If a process does inter-worker communication, immense extra costs per data exchange are paid
No.4: If hardware has few resources for [CONCURRENT] processes, results get way worse further
To have some smell of what can be done in standard python 2.7.13:
Efficiency is in better using silicon, not in bulldozing syntax-constructors into territories, where they are legal, but their performance has adverse effects on the experiment-under-test end-to-end speed:
You pay about 8 ~ 11 [ms] just to iteratively assemble an empty array1
>>> from zmq import Stopwatch
>>> aClk = Stopwatch()
>>> aClk.start();array1 = [ 0 for i in xrange( 100000 ) ];aClk.stop()
9751L
10146L
10625L
9942L
10346L
9359L
10473L
9171L
8328L
( the Stopwatch().stop() method yields [us] from .start() )
while, the memory-efficient, vectorisable, GIL-free approach can do the same about +230x ~ +450x faster:
>>> import numpy as np
>>>
>>> aClk.start();arrayNP = np.zeros( 100000 );aClk.stop()
15L
22L
21L
23L
19L
22L
>>> aClk.start();arrayNP = np.zeros( 100000, dtype = np.int );aClk.stop()
43L
47L
42L
44L
47L
So, using the proper tools just starts the story of performance:
>>> def test_SERIAL_python( nLOOPs = 100000 ):
... aClk.start()
... for i in xrange( nLOOPs ): # py3 range() ~ xrange() in py27
... array1[i] = i**2 # your loop-code
... _ = aClk.stop()
... return _
While a naive [SERIAL]-iterative implementation works, you pay immense costs for opting to do so ~ 70 [ms] for a 100000-D vector:
>>> test_SERIAL_python( nLOOPs = 100000 )
70318L
69211L
77825L
70943L
74834L
73079L
Using a more suitable / appropriate tool costs just ~ 0.2 [ms] i.e. ++350x FASTER
>>> aClk.start();arrayNP[:] = arrayNP[:]**2;aClk.stop()
189L
171L
173L
187L
183L
188L
193L
and with another glitch, a.k.a. an inplace modus-operandi:
>>> aClk.start();arrayNP[:] *=arrayNP[:];aClk.stop()
138L
139L
136L
137L
136L
136L
137L
Yields ~ +514x SPEEDUP, just from using appropriate tool
The art of performance is not in following marketing-sounding claimsabout parallellizing-( at-any-cost ),but in using know-how based methods, that pay least costs for biggest speedups achievable.
For "small"-problems, typical costs of distributing "thin"-work-packages are indeed hard to get covered by any potentially achievable speedups, so "problem-size" actually limits one's choice of methods, that could reach positive gain ( speedups of 0.9 or even << 1.0 are so often reported here, on StackOverflow, that you need not feel lost or alone in this sort of surprise ).
Epilogue
Processor number counts.
Core number counts.
But cache-sizes + NUMA-irregularities count more than that.
Smart, vectorised, HPC-cured, GIL-free libraries matter ( numpy et al - thanks a lot Travis OLIPHANT & al ... Great Salute to his team ... )
As an overhead-strict Amdahl Law (re-)-formulation explains, why even many-N-CPU parallelised code execution may ( and indeed often does ) suffer from speedups << 1.0
Overhead-strict formulation of the Amdahl's Law speedup S includes the very costs of the paid [PAR]-Setup + [PAR]-Terminate Overheads, explicitly:
1
S = __________________________; where s, ( 1 - s ), N were defined above
( 1 - s ) pSO:= [PAR]-Setup-Overhead add-on
s + pSO + _________ + pTO pTO:= [PAR]-Terminate-Overhead add-on
N
( an interactive animated tool for 2D visualising effects of these performance constraints is cited here )