What actually happens when setting parallel=True in #njit numba? - python

Could someone please explain roughly what happens when one runs a #njit-ted python function which contains a nested for loop (each iteration from each of the loops is independent of the others) and sets parallel=True and puts prange instead of range?
#njit(parallel=True)
def f():
C = np.empty((80, 20, 18), dtype=np.complex128)
for i in prange(80):
for j in prange(20):
for k in range(18):
C[i, j, k] = do_smth(i, j, k) # where do_smth(i, j, k) is #njit-ted and will further call other functions
Similarly, what happens when using prange only for the outermost loop? (i.e. letting for j in range(20): ... )
I understand what a thread is and I put NUMBA_NUM_THREADS (the environmental variable) to be the number of cores of the processor.
I did some profiling using the timeit module and it seems that the parallel=True keyword only slows the execution of the f() function when the .py script is called on a machine with 20 cores (by a considerable amount (even 4 times slower)).
f() above further calls more functions (first one being do_smth()) also having their structure resembling the f()'s (nested for loops which, at each of their iterations, call other #njit-ted functions) structure.
I checked them as above. Is my approach good? I.e. to profile them timeit and changing the keywords params inside their #njit decorator (I played with parallel, fastmath and nogil) and creating a table in which I note the execution times. My aim was to find the best execution time from the results I obtain.

Could someone please explain roughly what happens when one runs a #njit-ted python function which contains a nested for loop and sets parallel=True and puts prange instead of range?
This is explained in the documentation, but basically, when parallel=True is set, prange split the loop iteration in blocks so they are executed in multiple threads. The exact scheduling is dependent of the underlying parallel runtime (eg. TBB, OpenMP, etc.). The loops is analyzed by Numba so to know whether a reduction is needed or not (not all patterns are allowed). It can also fuse parallel loops if needed (though it does not work on my machine with Numba 0.55.2, even on trivial reduction loops: only the outer loop is parallelized). Note that its takes time to create threads and the bigger the number of core, the slower it is. This is why multi-threaded computations should last for a relatively long time so for multiple threads to be useful.
Similarly, what happens when using prange only for the outermost loop? (i.e. letting for j in range(20): ... )
In theory, it is generally better to specify more parallelism. In practice, it is not always useful and sometimes even detrimental because the runtime can use inefficient methods (loop fusion can cause slow modulus to be used with some OpenMP backends).
If you use it only on the outer i-based loop, then only this loop is parallelized (using all the cores by default, so 4 iterations per loop if a static schedule is selected by the backend).
I did some profiling using the timeit module and it seems that the parallel=True keyword only slows the execution of the f() function when the .py script is called on a machine with 20 cores (by a considerable amount (even 4 times slower)).
Parallel programming is not easy. At least, far more than most people think. This is why researcher teams worked on it for decades and it is still an active field of research.
There are many effects that can be responsible for this, including:
Allocator contention (very frequent)
Undefined behavior in the code (frequent): typically a race condition (example)
False-sharing (quite frequent)
NUMA effects (eg. access to remote pages)
Other resource saturation (eg. memory) though it generally make the code barely scale and do not cause a slowdown (unless there is a contention)
A bug in Numba (quite rare)
Also note that the first call cause the function to be compiled so it is slower (and parallel codes are even slower to compile).

Related

Parallel processing with numpy or numba

I have a simple problem. A function receives an array [a, b] of two numbers, and it returns another array [aa, ab]. The sample code is
import numpy as np
def func(array_1):
array_2 = np.zeros_like(array_1)
array_2[0] = array_1[0]*array_1[0]
array_2[1] = array_1[0]*array_1[1]
return array_2
array_1 = np.array([3., 4.]) # sample test array [a, b]
print(array_1) # prints this test array [a, b]
print(func( array_1 ) ) # prints [a*a, a*b]
The two lines inside the function func
array_2[0] = array_1[0]*array_1[0]
array_2[1] = array_1[0]*array_1[1]
are independent and I want to parallelize them.
Please tell me
how to parallize this (without Numba)?
how to parallize this (with Numba)?
This does not make sense to parallelise this code using multiple threads/processes because the arrays are far too small for such parallelisation to be useful. Indeed, creating thread typically takes about 1-100 microseconds on a mainstream machine while this code should take clearly less than a microsecond in Numba. In fact, the two computing lines should take less than 0.01 microsecond. Thus, creating thread will make the execution far slower.
Assuming your array would be much bigger, the typical way to parallelize a Python script is to use multiprocessing (which creates processes). For a Numba code, it is prange + parallel=True (which creates threads).
If you execute a jitted Numba function, then the code already runs a bit in parallel. Indeed, modern mainstream processors already execute instructions in parallel. This is called instruction-level parallelism. More specifically, modern processors pipeline the instructions and execute multiple of them thanks to a superscalar execution and in an out-of-order way. All of this is completely automatic. You just need to avoid having dependencies between the executed instructions.
Finally, if you want to speed up this function, then you need to use Numba in the caller function because a function call from CPython is far more expensive than computing and storing two floats. Note also that allocating an array is pretty expensive too so it is better to reuse buffers at this granularity.

Why is building a list from user input and printing its contents much slower in PyPy than CPython?

I was coding for a problem in CodeForces, and I submitted this code to run in PyPy:
import math
a=[]
b=[]
t=int(input())
for i in range(t):
n=float(input())
a.append(math.floor(n))
b.append(math.ceil(n))
l=0-sum(a)
i=0
while i<len(a):
if l>0 and a[i]!=b[i]:
print(b[i])
l-=1
else:
print(a[i])
i+=1
However, I was given a "time limit exceeded" verdict, with the execution taking over 1 second.
The same code ran in under 600 ms when run by the CPython interpreter.
From what I understand, PyPy is usually faster than Python. Why would CPython be faster for this code?
Welcome to Stack Overflow! In two words, the reason that PyPy looses to CPython in this case is that the Python code we are running is not really computing much, but instead all the time is spent doing input/output (first with a loop of input(), then a loop of print()). This is likely the major part of where the time is spent. PyPy's routines for input/output are not as well optimized as CPython's, which is the reason for why it is somewhat slower. You can have a guess that PyPy will win over CPython, sometimes massively, when the Python code that you wrote is spending time doing computations in Python.
The opposite of "doing computations in Python" is sometimes called "running library code"---this includes things like input/output, or more generally anything where a single Python function call invokes quite a lot of C code. Note that, counter-intuitively, this also includes doing arithmetic on very, very large integers, because that requires a lot of C code for every single operation. The opposite extreme example would be doing arithmetic on "small" integers, up to sys.maxsize, because the PyPy JIT can map every operation directly to one CPU instruction.
In summary, PyPy is good where there is some time spent in pure Python---not necessarily all the time. For example, non-trivial pure-Python web servers tend to benefit a lot from PyPy: the raw socket input/output is indeed a bit slower, but all the logic to process the queries and build responses is much faster, and that's easily the major part of the execution time.

Why is it better to use synchronous programming for in-memory operations?

I have a complex nested data structure. I iterate through it and perform some calculations on each possible uniqe pair of elements. It's all in-memory mathematical functions. I don't read from files or do networking.
It takes a few hours to run, with do_work() being called 25,000 times. I am looking for ways to speed it up.
Although Pool.map() seems useful for my lists, it's proving to be difficult because I need to pass extra arguments into the function being mapped.
I thought using the Python multitasking library would help, but when I use Pool.apply_async() to call do_work(), it actually takes longer.
I did some googling and a blogger says "Use sync for in-memory operations — async is a complete waste when you aren’t making blocking calls." Is this true? Can someone explain why? Do the RAM read & write operations interfere with each other? Why does my code take longer with async calls? do_work() writes calculation results to a database, but it doesn't modify my data structure.
Surely there is a way to utilize my processor cores instead of just linearly iterating through my lists.
My starting point, doing it synchronously:
main_list = [ [ [a,b,c,[x,y,z], ... ], ... ], ... ] # list of identical structures
helper_list = [1,2,3]
z = 2
for i_1 in range(0, len(main_list)):
for i_2 in range(0, len(main_list)):
if i_1 < i_2: # only unique combinations
for m in range(0, len(main_list[i_1])):
for h, helper in enumerate(helper_list):
do_work(
main_list[i_1][m][0], main_list[i_2][m][0], # unique combo
main_list[i_1][m][1], main_list[i_1][m][2],
main_list[i_1][m][3][z], main_list[i_2][m][3][h],
helper_list[h]
)
Variable names have been changed to make it more readable.
This is just a general answer, but too long for a comment...
First of all, I think your biggest bottleneck at this very moment is Python itself. I don't know what do_work() does, but if it's CPU intensive, you have the GIL which completely prevents effective parallelisation inside one process. No matter what you do, threads will fight for the GIL and it will eventually make your code even slower. Remember: Python has real threading, but the CPU is shared inside a single process.
I recommend checking out the page of David M Beazley: http://dabeaz.com/GIL/gilvis who did a lot of effort to visualise the GIL behaviour in Python.
On the other hand, the module multiprocessing allows you to run multiple processes and "circumvent" the GIL downsides, but it will be tricky to get access to the same memory locations without bigger penalties or trade-offs.
Second: if you utilise heavy nested loops, you should think about using numba and trying to fit your data structures inside numpy (structured) arrays. This can give you order of magnitude of speed quite easily. Python is slow as hell for such things but luckily there are ways to squeeze out a lot when using appropriate libraries.
To sum up, I think the code you are running could be orders of magnitudes faster with numba and numpy structures.
Alternatively, you can try to rewrite the code in a language like Julia (very similar syntax to Python and the community is extremely helpful) and quickly check how fast it is in order to explore the limits of the performance. It's always a good idea to get a feeling how fast something (or parts of a code) can be in a language which has not such complex performance critical aspects like Python.
Your task is more CPU bound than relying on I/O operations. Asynchronous execution make sense when you have long I/O operations i.e. sending/receiving something from network etc.
What you can do is split task to the chunks and utilize threads and multiprocessing (run on different CPU cores).

Why is linear read-shuffled write not faster than shuffled read-linear write?

I'm currently trying to get a better understanding of memory/cache related performance issues. I read somewhere that memory locality is more important for reading than for writing, because in the former case the CPU has to actually wait for the data whereas in the latter case it can just ship them out and forget about them.
With that in mind, I did the following quick-and-dirty test: I wrote a script that creates an array of N random floats and a permutation, i.e. an array containing the numbers 0 to N-1 in random order. Then it repeatedly either (1) reads the data array linearly and writes it back to a new array in the random access pattern given by the permutation or (2) reads the data array in the permuted order and linearly writes it to a new array.
To my surprise (2) seemed consistently faster than (1). There were, however, problems with my script
The script is written in python/numpy. This being quite a high-level language it is not clear how pecisely the read/write are implemented.
I probably did not balance the two cases properly.
Also, some of the answers/comments below suggest that my original expectation isn't correct and that depending on details of the cpu cache either case might be faster.
My question is:
Which (if any) of the two should be faster?
What are the relvant cache concepts here; how do they influence the result
A beginner-friendly explanation would be appreciated. Any supporting code should be in C / cython / numpy / numba or python.
Optionally:
Explain why the absolute durations are nonlinear in problem size (cf. timings below).
Explain the behavior of my clearly inadequate python experiments.
For reference, my platform is Linux-4.12.14-lp150.11-default-x86_64-with-glibc2.3.4. Python version is 3.6.5.
Here is the code I wrote:
import numpy as np
from timeit import timeit
def setup():
global a, b, c
a = np.random.permutation(N)
b = np.random.random(N)
c = np.empty_like(b)
def fwd():
c = b[a]
def inv():
c[a] = b
N = 10_000
setup()
timeit(fwd, number=100_000)
# 1.4942631321027875
timeit(inv, number=100_000)
# 2.531870319042355
N = 100_000
setup()
timeit(fwd, number=10_000)
# 2.4054739447310567
timeit(inv, number=10_000)
# 3.2365565397776663
N = 1_000_000
setup()
timeit(fwd, number=1_000)
# 11.131387163884938
timeit(inv, number=1_000)
# 14.19817715883255
As pointed out by #Trilarion and #Yann Vernier my snippets aren't properly balanced, so I replaced them with
def fwd():
c[d] = b[a]
b[d] = c[a]
def inv():
c[a] = b[d]
b[a] = c[d]
where d = np.arange(N) (I shuffle everything both ways to hopefully reduce across trial caching effects). I also replaced timeit with repeat and reduced the numbers of repeats by a factor of 10.
Then I get
[0.6757169323973358, 0.6705542299896479, 0.6702114241197705] #fwd
[0.8183442652225494, 0.8382121799513698, 0.8173762648366392] #inv
[1.0969422250054777, 1.0725746559910476, 1.0892365919426084] #fwd
[1.0284497970715165, 1.025063106790185, 1.0247828317806125] #inv
[3.073981977067888, 3.077839042060077, 3.072118630632758] #fwd
[3.2967213969677687, 3.2996009718626738, 3.2817375687882304] #inv
So there still seems to be a difference, but it is much more subtle and can now go either way depending on the problem size.
This is a complex problem closely related to architectural features of modern processors and your intuition that random read are slower than random writes because the CPU has to wait for the read data
is not verified (most of the time). There are several reasons for that I will detail.
Modern processors are very efficient to hide read latency
while memory writes are more expensive than memory reads
especially in a multicore environment
Reason #1 Modern processors are efficient to hide read latency.
Modern superscalar can execute several instructions simultaneously, and change instruction execution order (out of order execution).
While first reason for these features is to increase instruction thoughput,
one of the most interesting consequence is the ability of processors to hide latency of memory writes (or of complex operators, branches, etc).
To explain that, let us consider a simple code that copies array into another one.
for i in a:
c[i] = b[i]
One compiled, code executed by the processor will be somehow like that
#1. (iteration 1) c[0] = b[0]
1a. read memory at b[0] and store result in register c0
1b. write register c0 at memory address c[0]
#2. (iteration 2) c[1] = b[1]
2a. read memory at b[1] and store result in register c1
2b. write register c1 at memory address c[1]
#1. (iteration 2) c[2] = b[2]
3a. read memory at b[2] and store result in register c2
3b. write register c2 at memory address c[2]
# etc
(this is terribly oversimplified and the actual code is more complex and has to deal with loop management, address computation, etc, but this simplistic model is presently sufficient).
As said in the question, for reads, the processor has to wait for the actual data. Indeed, 1b need the data fetched by 1a and cannot execute as long as 1a is not completed. Such a constraint is called a dependency and we can say that 1b is dependent on 1a. Dependencies is a major notion in modern processors. Dependencies express the algorithm (eg I write b to c) and must absolutely be respected. But, if there is no dependency between instructions, processors will try to execute other pending instructions in order to keep there operative pipeline always active. This can lead to execution out-of-order, as long as dependencies are respected (similar to the as-if rule).
For the considered code, there no dependency between high level instruction 2. and 1. (or between asm instructions 2a and 2b and previous instructions). Actually the final result would even be identical is 2. is executed before 1., and the processor will try to execute 2a and 2b, before completion of 1a and 1b. There is still a dependency between 2a and 2b, but both can be issued. And similarly for 3a. and 3b., and so on. This is a powerful mean to hide memory latency. If for some reason 2., 3. and 4. can terminate before 1. loads its data, you may even not notice at all any slowdown.
This instruction level parallelism is managed by a set of "queues" in the processor.
a queue of pending instructions in the reservation stations RS (type 128 μinstructions in recent pentiums). As soon as resources required by the instruction is available (for instance value of register c1 for instruction 1b), the instruction can execute.
a queue of pending memory accesses in memory order buffer MOB before the L1 cache. This is required to deal with memory aliases and to insure sequentiality in memory writes or loads at the same address (typ. 64 loads, 32 stores)
a queue to enforce sequentiality when writing back results in registers (reorder buffer or ROB of 168 entries) for similar reasons.
and some other queues at instruction fetch, for μops generation, write and miss buffers in the cache, etc
At one point execution of the previous program there will be many pending stores instructions in RS, several loads in MOB and instructions waiting to retire in the ROB.
As soon as a data becomes available (for instance a read terminates) depending instructions can execute and that frees positions in the queues. But if no termination occurs, and one of these queues is full, the functional unit associated with this queue stalls (this can also happen at instruction issue if the processor is missing register names). Stalls are what creates performance loss and to avoid it, queue filling must be limited.
This explains the difference between linear and random memory accesses.
In a linear access, 1/ the number of misses will be smaller because of the better spatial locality and because caches can prefetch accesses with a regular pattern to reduce it further and 2/ whenever a read terminates, it will concern a complete cache line and can free several pending load instructions limiting the filling of instructions queues. This ways the processor is permanently busy and memory latency is hidden.
For a random access, the number of misses will be higher, and only a single load can be served when data arrives. Hence instructions queues will saturate rapidly, the processor stalls and memory latency can no longer be hidden by executing other instructions.
The processor architecture must be balanced in terms of throughput in order to avoid queue saturation and stalls. Indeed there are be generally tens of instructions at some stage of execution in a processor and global throughput (ie the ability to serve instruction requests by the memory (or functional units)) is the main factor that will determine performances. The fact than some of these pending instructions are waiting for a memory value has a minor effect...
...except if you have long dependency chains.
There is a dependency when an instruction has to wait for the completion of a previous one. Using the result of a read is a dependency. And dependencies can be a problem when involved in a dependency chain.
For instance, consider the code for i in range(1,100000): s += a[i]. All the memory reads are independent, but there is a dependency chain for the accumulation in s. No addition can happen until the previous one has terminated. These dependencies will make the reservation stations rapidly filled and create stalls in the pipeline.
But reads are rarely involved in dependency chains. It is still possible to imagine pathological code where all reads are dependent of the previous one (for instance for i in range(1,100000): s = a[s]), but they are uncommon in real code. And the problem comes from the dependency chain, not from the fact that it is a read; the situation would be similar (and even probably worse) with compute bound dependent code like for i in range(1,100000): x = 1.0/x+1.0.
Hence, except in some situations, computation time is more related to throughput than to read dependency, thanks to the fact that superscalar out or order execution hides latency. And for what concerns throughput, writes are worse then reads.
Reason #2: Memory writes (especially random ones) are more expensive than memory reads
This is related to the way caches behave. Cache are fast memory that store a part of the memory (called a line) by the processor. Cache lines are presently 64 bytes and allow to exploit spatial locality of memory references: once a line is stored, all data in the line are immediately available. The important aspect here is that all transfers between the cache and the memory are lines.
When a processor performs a read on a data, the cache checks if the line to which the data belongs is in the cache. If not, the line is fetched from memory, stored in the cache and the desired data is sent back to the processor.
When a processor writes a data to memory, the cache also checks for the line presence. If the line is not present, the cache cannot send its data to memory (because all transfers are line based) and does the following steps:
cache fetches the line from memory and writes it in the cache line.
data is written in the cache and the complete line is marked as modified (dirty)
when a line is suppressed from the cache, it checks for the modified flag, and if the line has been modified, it writes it back to memory (write back cache)
Hence, every memory write must be preceded by a memory read to get the line in the cache. This adds an extra operation, but is not very expensive for linear writes. There will be a cache miss and a memory read for the first written word, but successive writes will just concern the cache and be hits.
But the situation is very different for random writes. If the number of misses is important, every cache miss implies a read followed by only a small number of writes before the line is ejected from the cache, which significantly increases write cost. If a line is ejected after a single write, we can even consider that a write is twice the temporal cost of a read.
It is important to note that increasing the number of memory accesses (either reads or writes) tends to saturate the memory access path and to globally slow down all transfers between the processor and memory.
In either case, writes are always more expensive than reads. And multicores augment this aspect.
Reason #3: Random writes create cache misses in multicores
Not sure this really applies to the situation of the question. While numpy BLAS routines are multithreaded, I do not think basic array copy is. But it is closely related and is another reason why writes are more expensive.
The problem with multicores is to ensure proper cache coherence in such a way that a data shared by several processors is properly updated in the cache of every core. This is done by mean of a protocol such as MESI that updates a cache line before writing it, and invalidates other cache copies (read for ownership).
While none of the data is actually shared between cores in the question (or a parallel version of it), note that the protocol applies to cache lines. Whenever a cache line is to be modified, it is copied from the cache holding the most recent copy, locally updated and all other copies are invalidated. Even if cores are accessing different parts of the cache line. Such a situation is called a false sharing and it is an important issue for multicore programming.
Concerning the problem of random writes, cache lines are 64 bytes and can hold 8 int64, and if the computer has 8 cores, every core will process on the average 2 values. Hence there is an important false sharing that will slow down writes.
We did some performance evaluations. It was performed in C in order to include an evaluation of the impact of parallelization. We compared 5
functions that process int64 arrays of size N.
Just a copy of b to c (c[i] = b[i]) (implemented by the compiler with memcpy())
Copy with a linear index c[i] = b[d[i]] where d[i]==i (read_linear)
Copy with a random index c[i] = b[a[i]] where a is a random
permutation of 0..N-1 (read_random is equivalent to fwd in the original question)
Write linear c[d[i]] = b[i] where d[i]==i (write_linear)
Write random c[a[i]] = b[i] with a random
permutation of 0..N-1 (write_random is equivalent to inv in the question)
Code has been compiled with gcc -O3 -funroll-loops -march=native -malign-double on
a skylake processor. Performances are measured with _rdtsc() and are
given in cycles per iteration. The function are executed several times (1000-20000 depending on array size), 10 experiments are performed and the smallest time is kept.
Array sizes range from 4000 to 1200000. All code has been measured with a sequential and a parallel version with openmp.
Here is a graph of the results. Functions are with different colors, with the sequential version in thick lines and the parallel one with thin ones.
Direct copy is (obviously) the fastest and is implemented by gcc with
the highly optimized memcpy(). It is a mean to get an estimation of data throughput with memory. It ranges from 0.8 cycles per iteration (CPI) for small matrices to 2.0 CPI for large ones.
Read linear performances are approximately twice longer than memcpy, but there are 2 reads and a write, vs 1
read and a write for the direct copy. More the index adds some dependency. Min value is 1.56 CPI and max value 3.8 CPI. Write linear is slightly longer (5-10%).
Reads and writes with a random index are the purpose of the original question and deserve a longer comments. Here are the results.
size 4000 6000 9000 13496 20240 30360 45536 68304 102456 153680 230520 345776 518664 777992 1166984
rd-rand 1.86821 2.52813 2.90533 3.50055 4.69627 5.10521 5.07396 5.57629 6.13607 7.02747 7.80836 10.9471 15.2258 18.5524 21.3811
wr-rand 7.07295 7.21101 7.92307 7.40394 8.92114 9.55323 9.14714 8.94196 8.94335 9.37448 9.60265 11.7665 15.8043 19.1617 22.6785
small values (<10k): L1 cache is 32k and can hold a 4k array of uint64. Note, that due to the randomness of the index, after ~1/8 of iterations L1 cache will be completely filled with values of the random index array (as cache lines are 64 bytes and can hold 8 array elements). Accesses to the other linear arrays we will rapidly generate many L1 misses and we have to use the L2 cache. L1 cache access is 5 cycles, but it is pipelined and can serve a couple of values per cycle. L2 access is longer and requires 12 cycles. The amount of misses is similar for random reads and writes, but we see than we fully pay the double access required for writes when array size is small.
medium values (10k-100k): L2 cache is 256k and it can hold a 32k int64 array. After that, we need to go to L3 cache (12Mo). As size increases, the number of misses in L1 and L2 increases and the computation time accordingly. Both algorithms have a similar number of misses, mostly due to random reads or writes (other accesses are linear and can be very efficiently prefetched by the caches). We retrieve the factor two between random reads and writes already noted in B.M. answer. It can be partly explained by the double cost of writes.
large values (>100k): the difference between methods is progressively reduced. For these sizes, a large part of information is stored in L3 cache. L3 size is sufficient to hold a full array of 1.5M and lines are less likely to be ejected. Hence, for writes, after the initial read, a larger number of writes can be done without line ejection, and the relative cost of writes vs read is reduced. For these large sizes, there are also many other factors that need to be considered. For instance, caches can only serve a limited number of misses (typ. 16) and when the number of misses is large, this may be the limiting factor.
One word on parallel omp version of random reads and writes. Except for small sizes, where having the random index array spread over several caches may not be an advantage, they are systematically ~ twice faster. For large sizes, we clearly see that the gap between random reads and writes increases due to false sharing.
It is almost impossible to do quantitative predictions with the complexity of present computer architectures, even for simple code, and even qualitative explanations of the behaviour are difficult and must take into account many factors. As mentioned in other answers, software aspects related to python can also have an impact. But, while it may happen in some situations, most of the time, one cannot consider that reads are more expensive because of data dependency.
First a refutation of your intuition : fwd beats inv even without numpy mecanism.
It is the case for this numba version:
import numba
#numba.njit
def fwd_numba(a,b,c):
for i in range(N):
c[a[i]]=b[i]
#numba.njit
def inv_numba(a,b,c):
for i in range(N):
c[i]=b[a[i]]
Timings for N= 10 000:
%timeit fwd()
%timeit inv()
%timeit fwd_numba(a,b,c)
%timeit inv_numba(a,b,c)
62.6 µs ± 3.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
144 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
16.6 µs ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
34.9 µs ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Second, Numpy has to deal with fearsome problems of alignement and (cache-) locality.
It's essentially a wrapper on low level procedures from BLAS/ATLAS/MKL tuned for that.
Fancy indexing is a nice high-level tool but heretic for these problems; there is no direct traduction of this concept at low level.
Third, numpy dev docs : details fancy indexing. In particular:
Unless there is only a single indexing array during item getting, the
validity of the indices is checked beforehand. Otherwise it is handled
in the inner loop itself for optimization.
We are in this case here. I think this can explain the difference, and why set is slower than get.
It explains also why hand made numba is often faster : it doesn't check anything and crashes on inconsistent index.
Your two NumPy snippets b[a] and c[a] = b seem like reasonable heuristics for measuring shuffled/linear read/write speeds, as I'll try to argue by looking at the underlying NumPy code in the first section below.
Regarding the question of which ought to be faster, it seems plausible that shuffled-read-linear-write could typically win (as the benchmarks seem to show), but the difference in speed may be affected by how "shuffled" the shuffled index is, and one or more of:
The CPU cache read/update policies (write-back vs. write-through, etc.).
How the CPU chooses to (re)order the instructions it needs to execute (pipelining).
The CPU recognising memory access patterns and pre-fetching data.
Cache eviction logic.
Even making assumptions about which policies are in place, these effects are difficult to model and reason about analytically and so I'm not sure a general answer applicable to all processors is possible (although I am not an expert in hardware).
Nevertheless, in the second section below I'll attempt to reason about why the shuffled-read-linear-write is apparently faster, given some assumptions.
"Trivial" Fancy Indexing
The purpose of this section is to go through the NumPy source code to determine if there any obvious explanations for the timings, and also get as clear an idea as possible of what happens when A[B] or A[B] = C is executed.
The iteration routine underpinning the fancy-indexing for getitem and setitem operations in this question is "trivial":
B is a single-indexing array with a single stride
A and B have the same memory order (both C-contiguous or both Fortran-contiguous)
Furthermore, in our case both A and B are Uint Aligned:
Strided copy code: Here, "uint alignment" is used instead. If the itemsize [N] of an array is equal to 1, 2, 4, 8 or 16 bytes and the array is uint aligned then instead [of using buffering] numpy will do *(uintN*)dst) = *(uintN*)src) for appropriate N. Otherwise numpy copies by doing memcpy(dst, src, N).
The point here is that use of an internal buffer to ensure alignment is avoided. The underlying copying implemented with *(uintN*)dst) = *(uintN*)src) is as straightfoward as "put the X bytes from offset src into the X bytes at offset dst".
Compilers will likely translate this very simply into mov instructions (on x86 for example), or similar.
The core low-level code which performs the getting and setting of items is in the functions mapiter_trivial_get and mapiter_trivial_set. These functions are produced in lowlevel_strided_loops.c.src, where the templating and macros make it somewhat challenging to read (an occasion to be grateful for higher-level languages).
Persevering, we can eventually see that there is little difference between getitem and setitem. Here is a simplified version of the main loop for exposition. The macro lines determine whether were running getitem or setitem:
while (itersize--) {
char * self_ptr;
npy_intp indval = *((npy_intp*)ind_ptr);
#if #isget#
if (check_and_adjust_index(&indval, fancy_dim, 0, _save) < 0 ) {
return -1;
}
#else
if (indval < 0) {
indval += fancy_dim;
}
#endif
self_ptr = base_ptr + indval * self_stride; /* offset into array being indexed */
#if #isget#
*(npy_uint64 *)result_ptr = *(npy_uint64 *)self_ptr;
#else
*(npy_uint64 *)self_ptr = *(npy_uint64 *)result_ptr;
#endif
ind_ptr += ind_stride; /* move to next item of index array */
result_ptr += result_stride; /* move to next item of result array */
As we might expect, this simply amounts to some arithmetic to get the correct offset into the arrays, and then copying bytes from one memory location to another.
Extra index checks for setitem
One thing worth mentioning is that for setitem, the validity of the indices (whether they are all inbounds for the target array) is checked before copying begins (via check_and_adjust_index), which also replaces negative indices with corresponding positive indices.
In the snippet above you can see check_and_adjust_index called for getitem in the main loop, while a simpler (possibly redundant) check for negative indices occurs for setitem.
This extra preliminary check could conceivably have a small but negative impact on the speed of setitem (A[B] = C).
Cache misses
Because the code for both code snippets is so similar, suspicion falls on the CPU and how it handles access to the underlying arrays of memory.
The CPU caches small blocks of memory (cache lines) that have been recently accessed in the anticipation that it will probably soon need to access that region of memory again.
For context, cache lines are generally 64 bytes. The L1 (fastest) data cache on my ageing laptop's CPU is 32KB (enough to hold around 500 int64 values from the array, but keep in mind that the CPU will be doing other things requiring other memory while the NumPy snippet executes):
$ cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64
$ cat /sys/devices/system/cpu/cpu0/cache/index0/size
32K
As you are probably already aware, for reading/writing memory sequentially caching works well because 64 bytes blocks of memory are fetched as needed and stored closer to the CPU. Repeated access to that block of memory is quicker than fetching from RAM (or a slower higher-level cache). In fact, the CPU may even preemptively fetch the next cache line before it is even requested by the program.
On the other hand, randomly accessing memory is likely to cause frequent cache misses. Here, the region of memory with the required address is not in the fast cache near the CPU and instead must be accessed from a higher-level cache (slower) or the actual memory (much slower).
So which is faster for the CPU to handle: frequent data read misses, or data write misses?
Let's assume the CPU's write policy is write-back, meaning that a modified memory is written back to the cache. The cache is marked as being modified (or "dirty"), and the change will only be written back to main memory once the the line is evicted from the cache (the CPU can still read from a dirty cache line).
If we are writing to random points in a large array, the expectation is that many of the cache lines in the CPU's cache will become dirty. A write through to main memory will be needed as each one is evicted which may occur often if the cache is full.
However, this write through should happen less frequently when writing data sequentially and reading it at random, as we expect fewer cache lines to become dirty and data written back to main memory or slower caches less regularly.
As mentioned, this is a simplified model and there may be many other factors that influence the CPU's performance. Someone with more expertise than me may well be able to improve this model.
Your function fwd isn't touching the global variable c. You didn't tell it global c (only in setup), so it has its own local variable, and uses STORE_FAST in cpython:
>>> import dis
>>> def fwd():
... c = b[a]
...
>>> dis.dis(fwd)
2 0 LOAD_GLOBAL 0 (b)
3 LOAD_GLOBAL 1 (a)
6 BINARY_SUBSCR
7 STORE_FAST 0 (c)
10 LOAD_CONST 0 (None)
13 RETURN_VALUE
Now, let's try that with a global:
>>> def fwd2():
... global c
... c = b[a]
...
>>> dis.dis(fwd2)
3 0 LOAD_GLOBAL 0 (b)
3 LOAD_GLOBAL 1 (a)
6 BINARY_SUBSCR
7 STORE_GLOBAL 2 (c)
10 LOAD_CONST 0 (None)
13 RETURN_VALUE
Even so, it may differ in time compared to the inv function which calls setitem for a global.
Either way, if you wanted it to write into c, you need something like c[:] = b[a] or c.fill(b[a]). The assignment replaces the variable (name) with the object from the right hand side, so the old c might be getting deallocated instead of the new b[a], and that sort of memory shuffling can be costly.
As for the effect I think you wanted to measure, basically whether forward or inverse permutations are more costly, that would be highly cache dependent. Forward permutation (storing at randomly ordered indices from a linear read) could in principle be faster because it can use write masking and never fetch the new array, assuming the cache system is smart enough to preserve byte masks in the write buffer. Backward runs a high risk of cache collisions while performing the random read if the array is large enough.
That was my initial impression; results, as you say, are opposite. This could be a result of a cache implementation that doesn't have a large write buffer or can't exploit small writes. If out of cache accesses require the same memory bus time anyway, the read access will have a chance of loading data that won't be expunged from cache before it's needed. With a multiway cache, the partially written lines will also have a chance of not being chosen for expulsion; and only dirty cache lines require memory bus time to drop. A lower level program written with other knowledge (e.g. that the permutation is complete and non-overlapping) could improve the behaviour using hints such as non-temporal SSE writes.
The following experiment corroborates that random writes are faster than random reads. For small sizes of the data (when it entirely fits in caches) the random writing code is slower than the random reading one (probably because of certain implementation peculiarities in numpy), but as the data size grows the initial 1.7x difference in the execution time is almost completely eliminated (however, in case of numba there is a strange reversal of that trend in the end).
$ cat test.py
import numpy as np
from timeit import timeit
import numba
def fwd(a,b,c):
c = b[a]
def inv(a,b,c):
c[a] = b
#numba.njit
def fwd_numba(a,b,c):
for i,j in enumerate(a):
c[i] = b[j]
#numba.njit
def inv_numba(a,b,c):
for i,j in enumerate(a):
c[j] = b[i]
for p in range(4, 8):
N = 10**p
n = 10**(9-p)
a = np.random.permutation(N)
b = np.random.random(N)
c = np.empty_like(b)
print('---- N = %d ----' % N)
for f in 'fwd', 'fwd_numba', 'inv', 'inv_numba':
print(f, timeit(f+'(a,b,c)', number=n, globals=globals()))
$ python test.py
---- N = 10000 ----
fwd 1.1199337750003906
fwd_numba 0.9052993479999714
inv 1.929507338001713
inv_numba 1.5510062070025015
---- N = 100000 ----
fwd 1.8672701190007501
fwd_numba 1.5000483989970235
inv 2.509873716000584
inv_numba 2.0653326050014584
---- N = 1000000 ----
fwd 7.639554155000951
fwd_numba 5.673054756000056
inv 7.685382894000213
inv_numba 5.439735023999674
---- N = 10000000 ----
fwd 15.065879136000149
fwd_numba 12.68919651500255
inv 15.433822674000112
inv_numba 14.862108078999881

Is generator faster than while loop in python?

Question is simple, I have following code that does the same thing in python2:
for _ in range(n): # or xrange(),they have similar performance according to my test
pass
i = 0
while i < n:
i+=1
pass
the for loop is faster than the while loop, when n = 1000000, each takes roughly 0.105544 and 0.2389421
on the surface it looks like while loop is doing the increment and boundary check, but as far as I know, the generator or iterator has to perform the same amount of hard work, so if the work done is the same, why is one faster than another?
from python generator wiki
def generator(n):
i = 0
while i < n:
yield i
i += 1
in the case of an iterator, there is usually a member function called next, and every time it is called, it will return the "next item in the iterable", to me this means a lot of function calls, thus huge overhead on stack (more assembly code to do push and pop stack) and based on my knowledge on coroutine (generator), it trys to circumvent this by creating a new separated stack (just like thread, it manages its own program counter), although it will no longer deal with tons of function calls, it bears the same problem as thread, namely overhead of context switch.
How can the while loop be slower when it does not face any of the overheads I mentioned above?
I expect the performance difference you're seeing has to do with what parts of the code are defined in Python and which are defined inside the interpreter (in C, for cpython). The calls to next in the for loop case, for instance, are going to be handled in C, and for a range or other built-in iterable, the implementation of the function will also be in C, so it may be pretty fast. The bounds check on the while loop on the other hand is a Python expression, which needs to be evaluated on each pass of the loop. Python code is almost always going to be slower than C code, so it's not too shocking that a for loop may be faster than a while loop in some situations.
Note however that both kinds of loops are probably much faster than any sort of useful work you might be doing inside of them. It is almost never worth focusing your efforts on the very small performance differences between different kinds of loops like this, rather than on larger issues like the complexity of your algorithms or the efficiency of your data structures.
The only exception might be if you've done a bunch of profiling of your code and found that a specific loop is the greatest performance bottleneck for your particular program. If that's the case, micro-optimize to your heart's content.

Categories