I have a simple code that:
Read a trajectory file that can be seen as a list of 2D arrays (list of positions in space) stored in Y
I then want to compute for each pair (scipy.pdist style) the RMSD
My code works fine:
trajectory = read("test.lammpstrj", index="::")
m = len(trajectory)
#.get_positions() return a 2d numpy array
Y = np.array([snapshot.get_positions() for snapshot in trajectory])
b = [np.sqrt(((((Y[i]- Y[j])**2))*3).mean()) for i in range(m) for j in range(i + 1, m)]
This code execute in 0.86 seconds using python3.10, using Julia1.8 the same kind of code execute in 0.46 seconds
I plan to have trajectory much larger (~ 200,000 elements), would it be possible to get a speed-up using python or should I stick to Julia?
You've mentioned that snapshot.get_positions() returns some 2D array, suppose of shape (p, q). So I expect that Y is a 3D array with some shape (m, p, q), where m is the number of snapshots in the trajectory. You also expect m to scale rather high.
Let's see a basic way to speed up the distance calculation, on the setting m=1000:
import numpy as np
# dummy inputs
m = 1000
p, q = 4, 5
Y = np.random.randn(m, p, q)
# your current method
def foo():
return [np.sqrt(((((Y[i]- Y[j])**2))*3).mean()) for i in range(m) for j in range(i + 1, m)]
# vectorized approach -> compute the upper triangle of the pairwise distance matrix
def bar():
u, v = np.triu_indices(Y.shape[0], 1)
return np.sqrt((3 * (Y[u] - Y[v]) ** 2).mean(axis=(-1, -2)))
# Check for correctness
out_1 = foo()
out_2 = bar()
print(np.allclose(out_1, out_2))
# True
If we test the time required:
%timeit -n 10 -r 3 foo()
# 3.16 s ± 50.3 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
The first method is really slow, it takes over 3 seconds for this calculation. Let's check the second method:
%timeit -n 10 -r 3 bar()
# 97.5 ms ± 405 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
So we have a ~30x speedup here, which would make your large calculation in python much more feasible than using the original code. Feel free to test out with other sizes of Y to see how it scales compared to the original.
JIT
In addition, you can also try out JIT, mainly jax or numba. It is fairly simple to port the function bar with jax.numpy, for example:
import jax
import jax.numpy as jnp
#jax.jit
def jit_bar(Y):
u, v = jnp.triu_indices(Y.shape[0], 1)
return jnp.sqrt((3 * (Y[u] - Y[v]) ** 2).mean(axis=(-1, -2)))
# check for correctness
print(np.allclose(bar(), jit_bar(Y)))
# True
If we test the time of the jitted jnp op:
%timeit -n 10 -r 3 jit_bar(Y)
# 10.6 ms ± 678 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
So compared to the original, we could reach even up to ~300x speed.
Note that not every operation can be converted to jax/jit so easily (this particular problem is conveniently suitable), so the general advice is to simply avoid python loops and use numpy's broadcasting/vectorization capabilities, like in bar().
Stick to Julia.
If you already made it in a language which runs faster, why are you trying to use python in the first place?
Your question is about speeding up Python, relative to Julia, so I'd like to offer some Julia code for comparison.
Since your data is most naturally expressed as a list of 4x5 arrays, I suggest expressing it as a vector of SMatrixes:
sumdiff2(A, B) = sum((A[i] - B[i])^2 for i in eachindex(A, B))
function dists(Y)
M = length(Y)
V = Vector{float(eltype(eltype(Y)))}(undef, sum(1:M-1))
Threads.#threads for i in eachindex(Y)
ii = sum(M-i+1:M-1) # don't worry about this sum
for j in i+1:lastindex(Y)
ind = ii + (j-i)
V[ind] = sqrt(3 * sumdiff2(Y[i], Y[j])/length(Y[i]))
end
end
return V
end
using Random: randn
using StaticArrays: SMatrix
Ys = [randn(SMatrix{4,5,Float64}) for _ in 1:1000];
Benchmarks:
# single-threaded
julia> using BenchmarkTools
julia> #btime dists($Ys);
6.561 ms (2 allocations: 3.81 MiB)
# multi-threaded with 6 cores
julia> #btime dists($Ys);
1.606 ms (75 allocations: 3.82 MiB)
I was not able to install jax on my computer, but when comparing with #Mercury's numpy code I got
foo: 5.5seconds
bar: 179ms
i.e. approximately 3400x speedup over foo.
It is possible to write this as a one-liner at a ~2-3x performance cost.
While Python tends to be slower than Julia for many tasks, it is possible to write numerical codes as fast as Julia in Python using Numba and plain loops. Indeed, Numba is based on LLVM-Lite which is basically a JIT-compiler based on the LLVM toolchain. The standard implementation of Julia also use a JIT and the LLVM toolchain. This means the two should behave pretty closely besides the overhead introduced by the languages that are negligible once the computation is performed in parallel (because the resulting computation will be memory-bound on nearly all modern platforms).
This computation can be parallelized in both Julia and Python (still using Numba). While writing a sequential computation is quite straightforward, writing a parallel computation is if bit more complex. Indeed, computing the upper triangular values can result in an imbalanced workload and so to a sub-optimal execution time. An efficient strategy is to compute, for each iteration, a pair of lines: one comes from the top of the upper triangular part and one comes from the bottom. The top line contains m-i items while the bottom one contains i+1 items. In the end, there is m+1 items to compute per iteration so the number of item is independent of the iteration number. This results in a much better load-balancing. The line of the middle needs to be computed separately regarding the size of the input array.
Here is the final implementation:
import numba as nb
import numpy as np
#nb.njit(inline='always', fastmath=True)
def compute_line(tmp, res, i, m):
offset = (i * (2 * m - i - 1)) // 2
factor = 3.0 / n
for j in range(i + 1, m):
s = 0.0
for k in range(n):
s += (tmp[i, k] - tmp[j, k]) ** 2
res[offset] = np.sqrt(s * factor)
offset += 1
return res
#nb.njit('()', parallel=True, fastmath=True)
def fastest():
m, n = Y.shape[0], Y.shape[1] * Y.shape[2]
res = np.empty(m*(m-1)//2)
tmp = Y.reshape(m, n)
for i in nb.prange(m//2):
compute_line(tmp, res, i, m)
compute_line(tmp, res, m-i-1, m)
if m % 2 == 1:
compute_line(tmp, res, (m+1)//2, m)
return res
# [...] same as others
%timeit -n 100 fastest()
Results
Here are performance results on my machine (with a i5-9600KF having 6 cores):
foo (seq, Python, Mercury): 4910.7 ms
bar (seq, Python, Mercury): 134.2 ms
jit_bar (seq, Python, Mercury): ???
dists (seq, Julia, DNF) 6.9 ms
dists (par, Julia, DNF) 2.2 ms
fastest (par, Python, me): 1.5 ms <-----
(Jax does not work on my machine so I cannot test it yet)
This implementation is the fastest one and succeed to beat the best Julia code so far.
Optimal implementation
Note that for large arrays like (200_000,4,5), all implementations provided so far are inefficient since they are not cache friendly. Indeed, the input array will take 32 MiB and will not for on the cache of most modern processors (and even if it could, one need to consider the space needed for the output and the fact that caches are not perfect). This can be fixed using tiling, at the expense of an even more complex code. I think such an implementation should be optimal if you use Z-order curves.
Related
I have a 500-by-40K matrix A and a 40K-element vector q in the form of numpy arrays. By estimation it would take 16 minutes to compute the dot product of them by A.dot(q). Because there are 50K q that I need to do this computation with (which is a lot), I would like to reduce the computing time as much as possible.
Are more efficient better ways to compute the dot product of A and q, or better, of A and Q=[q1, ... , qn]? Scipy seems to provide a solution in this, but I don't quite understand how should I approach my problem here. Can anybody help me out here?
from time import process_time
t1 = process_time()
A[0,:].dot(q)
t2 = process_time()
print("Estimated Total Time:", (t2-t1)*A.shape[0]/60)
Here is a sample of what you can do:
Q = np.hstack((q1, q2, q3, q4, ...))
B = np.dot(A, Q)
in the output, column i of B(B[:,i]) is the product of A and q_i. This should be fairly fast, even for your required array sizes. It should not even take as long as you mentioned, probably in a matter of a minute or so.
If you have sufficient memory to hold A and all the q vectors, it should not take 16 minutes to compute those products, if you put all the q vectors into an array Q and call A.dot(Q).
This is on a laptop with a 2.6 GHz Intel Core i7 and 16 GB memory. I'm using the interactive Python shell ipython:
In [1]: import numpy as np
In [2]: rng = np.random.default_rng()
In [3]: A = rng.normal(size=(500, 40_000)).astype(np.float32)
In [4]: Q = np.ones((40_000, 50_000), dtype=np.float32)
In [5]: %timeit A.dot(Q)
7.79 s ± 235 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That's under 8 seconds--pretty far from 16 minutes!
If you don't have sufficient memory, it could take a lot longer, as the OS might start swapping out memory to disk.
I'm following the exercises from "Doing Bayesian Data Analysis" in both R and Python.
I would like to find a fast method of doing Monte-Carlo simulation that uses constant space.
The problem below is trivial, but serves as a good test for different methods:
ex 4.3
Determine the exact probability of drawing a 10 from a shuffled pinochle deck. (In a pinochle deck, there are 48 cards. There are six values: 9, 10, Jack, Queen, King, Ace. There are two copies of each value in each of the standard four suits: hearts, diamonds, clubs, spades.)
(A) What is the probability of getting a 10?
Of course, the answer is 1/6.
The fastest solution I could find (comparable to the speed of R) is generating a large array of card draws using np.random.choice, then applying a Counter. I don't like the idea of creating arrays unnecessarily, so I tried using a dictionary and a for loop, drawing one card at a time and incrementing the count for that type of card. To my surprise, it was much slower!
The full code is below for the 3 methods I tested. _Is there a way of doing this that will be as performant as method1(), but using constant space?
Python code: (Google Colab link)
deck = [c for c in ['9','10','Jack','Queen','King','Ace'] for _ in range(8)]
num_draws = 1000000
def method1():
draws = np.random.choice(deck, size=num_draws, replace=True)
df = pd.DataFrame([Counter(draws)])/num_draws
print(df)
def method2():
card_counts = defaultdict(int)
for _ in range(num_draws):
card_counts[np.random.choice(deck, replace=True)] += 1
df = pd.DataFrame([card_counts])/num_draws
print(df)
def method3():
card_counts = defaultdict(int)
for _ in range(num_draws):
card_counts[deck[random.randint(0, len(deck)-1)]] += 1
df = pd.DataFrame([card_counts])/num_draws
print(df)
Python timeit() results:
method1: 1.2997
method2: 23.0626
method3: 5.5859
R code:
card = sample(deck, numDraws, replace=TRUE)
print(as.data.frame(table(card)/numDraws))
Here's one with np.unique+np.bincount -
def unique():
unq,ids = np.unique(deck, return_inverse=True)
all_ids = np.random.choice(ids, size=num_draws, replace=True)
ar = np.bincount(all_ids)/num_draws
return pd.DataFrame(ar[None], columns=unq)
How does NumPy help here?
There are two major improvements that's helping us here :
We convert the string data to numeric. NumPy works well with such data. To achieve this, we are using np.unique.
We use np.bincount to replace the counting step. Again, it works well with numeric data and we do have that from the numeric conversion done at the start of this method.
NumPy in general works well with large data, which is the case here.
Timings with given sample dataset comparing against fastest method1 -
In [177]: %timeit method1()
328 ms ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [178]: %timeit unique()
12.4 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy achieves efficiency by running C code in its numerical engine. Python is convenient, but it is orders of magnitude slower than C.
In Numpy and other high-performance Python libraries, the Python code consists mostly of glue code, preparing the task to be dispatched. Since there is overhead, it is much faster to draw a lot of samples at once.
Remember that providing a buffer of 1 million elements for Numpy to work is still constant space. Then you can sample 1 billion times by looping it.
This extra memory allocation is usually not a problem. If you must avoid using memory at all costs while still getting performance benefits from Numpy, you can try using Numba or Cython to accelerate it.
from numba import jit
#jit(nopython=True)
def method4():
card_counts = np.zeros(6)
for _ in range(num_draws):
card_counts[np.random.randint(0, 6)] += 1
return card_counts/num_draws
I have two large matrices (40000*4096) and I would like to compare and match each row of the first matrix to all of the rows for the second matrix and as a result, the output will have a size (40000*40000). However, since I need to do this for several thousand times, it is wildy time consuming 26k seconds for each iteration so for 5000 times ...
I would be glad if you could give me some smart suggestion. Thank you.
P.S. this is what I did so far for just one iteration (1 of 5000)
def matcher(Antigens, Antibodies,ind):
temp = np.zeros((Antibodies.shape[0],Antibodies.shape[1]))
output = np.zeros((Antibodies.shape[0],1))
for i in range(len(Antibodies)):
temp[i] = np.int32(np.equal(Antigens[ind],Antibodies[i]))
output[i] = np.sum(temp[i])
return output
output = [matcher(gens,Antibodies) for gens in Antigens]
Okay, I think I understand what your goal is:
Count number of row matches (antigen vs antibody matrix). Each row of the resulting vector (40,000 x 1) represents a count of exact matches between 1 antigen row and all of the antibodies row (so values from 0 - 40_000).
I made some fake data:
import numpy as np
import numba as nb
num_mat = 5 # number of matrices
num_row = 10_000 # number of rows per matrix
num_elm = 4_096 # number of elements per row
dim = (num_mat,num_row,num_elm)
Antigens = np.random.randint(0,256,dim,dtype=np.uint8)
Antibodies = np.random.randint(0,256,dim,dtype=np.uint8)
There's one important point here, I reduced the matrices to the smallest datatype that can represent the data in order to reduce their memory foot-print. I'm not sure what your data looks like, but hopefully you can do this as well.
Also, the following code assumes your dimensions look the fake data:
(number of matrices, rows, elements)
#nb.njit
def match_arr(arr1, arr2):
for i in range(arr1.shape[0]): #4096 vs 4096
if arr1[i] != arr2[i]:
return False
return True
#nb.njit
def match_mat_sum(ag, ab):
out = np.zeros((ag.shape[0])) # 40000
for i in range(ag.shape[0]):
tmp = 0
for j in range(ab.shape[0]):
tmp += match_arr(ag[i], ab[j])
out[i] = tmp
return out
#nb.njit(parallel=True)
def match_sets(Antigens, Antibodies):
out = np.empty((Antigens.shape[0] * Antibodies.shape[0], Antigens.shape[1])) # 5000 x 40000
# multiprocessing per antigen matrix, may want to move this as suits your data
for i in nb.prange(Antigens.shape[0]):
for j in range(Antibodies.shape[0]):
out[j+(5*i)] = match_mat_sum(Antigens[i], Antibodies[j]) # need to figure out the index to avoid race conditions
return out
I lean on Numba heavily. One of the key optimizations is not to check the equivalence of entire rows with np.equal() but to write a custom function match_arr() that breaks as soon as it finds a mis-matched element. Hopefully, this lets us skip a ton of comparisons.
Time comparison:
%timeit match_arr(arr1, arr2)
314 ns ± 0.361 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.equal(arr1, arr2)
1.07 µs ± 5.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
match_mat_sum
This function simply calculates the middle step (the 40,000 x 1 vector) that represents the sum of exact matches between two matrices. This step reduces two matrices like: (m x n), (o x n) -> (m)
match_sets()
The last function parallelizes this operation with explicit parallel loops through nb.prange. You might want to move this function to a different loop depending on what your data looks like (like if you have one antigen matrix, but 5000 antibody matrices, you should move prange to the inner loop or you'll not be leveraging parallelization). The fake data assumes some antigen and some antibody matrices.
Another important thing to note here is the indexing on the out array. In order to avoid race conditions, each explicit loops needs to write to a unique space. Again, depending on your data, you'll need to index the proper "place" to put the result.
On a Ryzen 1600 (6-core) with 16 gigs of RAM, using this fake data, I generated a result in 10.2 seconds.
Your data is about 3200x times larger. Assuming linear scaling, the full set would take approximately 9 hours, assuming you have enough memory.
You could write some kind of batch loader as well, rather than loading 5000 giant matrices directly into memory.
This problem can be tackled with a mixture of numpy broadcasting, and the module numexpr, which performs operations fast while minimizing the storage of intermediate values
import numexpr as ne
# expand arrays dimensions to support broadcasting when doing comparison
Antigens, Antibodies = Antigens[None, :, :], Antibodies[:, None, :]
output = ne.evaluate('sum((Antigens==Antibodies)*1, axis=2)')
# *1 is a hack because numexpr does not currently support sum on bool
This may be faster than your current solution, but for such large arrays it will take a while.
The performance of numexpr for this operations is a bit lackluster, but you can at least use broadcasting inside the loop:
output = np.zeros((Antibodies.shape[0],)*2, dtype=np.int32)
for row, out_row in zip(Antibodies, output):
(row[None,:]==Antigens).sum(1, out=out_row)
I have an array of vectors and compute the norm of their diffs vs the first one.
When using python broadcasting, the calculation is significantly slower than doing it via a simple loop. Why?
import numpy as np
def norm_loop(M, v):
n = M.shape[0]
d = np.zeros(n)
for i in range(n):
d[i] = np.sum((M[i] - v)**2)
return d
def norm_bcast(M, v):
n = M.shape[0]
d = np.zeros(n)
d = np.sum((M - v)**2, axis=1)
return d
M = np.random.random_sample((1000, 10000))
v = M[0]
%timeit norm_loop(M, v)
25.9 ms
%timeit norm_bcast(M, v)
38.5 ms
I have Python 3.6.3 and Numpy 1.14.2
To run the example in google colab:
https://drive.google.com/file/d/1GKzpLGSqz9eScHYFAuT8wJt4UIZ3ZTru/view?usp=sharing
Memory access.
First off, the broadcast version can be simplified to
def norm_bcast(M, v):
return np.sum((M - v)**2, axis=1)
This still runs slightly slower than the looped version.
Now, conventional wisdom says that vectorized code using broadcasting should always be faster, which in many cases isn't true (I'll shamelessly plug another of my answers here). So what's happening?
As I said, it comes down to memory access.
In the broadcast version every element of M is subtracted from v. By the time the last row of M is processed the results of processing the first row have been evicted from cache, so for the second step these differences are again loaded into cache memory and squared. Finally, they are loaded and processed a third time for the summation. Since M is quite large, parts of the cache are cleared on each step to acomodate all of the data.
In the looped version each row is processed completely in one smaller step, leading to fewer cache misses and overall faster code.
Lastly, it is possible to avoid this with some array operations by using einsum.
This function allows mixing matrix multiplications and summations.
First, I'll point out it's a function that has rather unintuitive syntax compared to the rest of numpy, and potential improvements often aren't worth the extra effort to understand it.
The answer may also be slightly different due to rounding errors.
In this case it can be written as
def norm_einsum(M, v):
tmp = M-v
return np.einsum('ij,ij->i', tmp, tmp)
This reduces it to two operations over the entire array - a subtraction, and calling einsum, which performs the squaring and summation.
This gives a slight improvement:
%timeit norm_bcast(M, v)
30.1 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit norm_loop(M, v)
25.1 ms ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit norm_einsum(M, v)
21.7 ms ± 65.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Squeezing out maximum performance
On vectorized operations you clearly have a bad cache behaviour. But the calculation itsef is also slow due to not exploiting modern SIMD instructions (AVX2,FMA). Fortunately it isn't really complicated to overcome this issues.
Example
import numpy as np
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def norm_loop_improved(M, v):
n = M.shape[0]
d = np.empty(n,dtype=M.dtype)
#enables SIMD-vectorization
#if the arrays are not aligned
M=np.ascontiguousarray(M)
v=np.ascontiguousarray(v)
for i in nb.prange(n):
dT=0.
for j in range(v.shape[0]):
dT+=(M[i,j]-v[j])*(M[i,j]-v[j])
d[i]=dT
return d
Performance
M = np.random.random_sample((1000, 1000))
norm_loop_improved: 0.11 ms**, 0.28ms
norm_loop: 6.56 ms
norm_einsum: 3.84 ms
M = np.random.random_sample((10000, 10000))
norm_loop_improved:34 ms
norm_loop: 223 ms
norm_einsum: 379 ms
** Be careful when measuring performance
The first result (0.11ms) comes from calling the function repeadedly with the same data. This would need 77 GB/s reading-throuput from RAM, which is far more than my DDR3 Dualchannel-RAM is capable of. Due to the fact that calling a function with the same input parameters successively isn't realistic at all, we have to modify the measurement.
To avoid this issue we have to call the same function with different data at least twice (8MB L3-cache, 8MB data) and than divide the result by two to clear all the caches.
The relative performance of this methods also differ on array sizes (have a look at the einsum results).
As a user of Python for data analysis and numerical calculations, rather than a real "coder", I had been missing a really low-overhead way of distributing embarrassingly parallel loop calculations on several cores.
As I learned, there used to be the prange construct in Numba, but it was abandoned because of "instability and performance issues".
Playing with the newly open-sourced #guvectorize decorator I found a way to use it for virtually no-overhead emulation of the functionality of late prange.
I am very happy to have this tool at hand now, thanks to the guys at Continuum Analytics, and did not find anything on the web explicitly mentioning this use of #guvectorize. Although it may be trivial to people who have been using NumbaPro earlier, I'm posting this for all those fellow non-coders out there (see my answer to this "question").
Consider the example below, where a two-level nested for loop with a core doing some numerical calculation involving two input arrays and a function of the loop indices is executed in four different ways. Each variant is timed with Ipython's %timeit magic:
naive for loop, compiled using numba.jit
forall-like construct using numba.guvectorize, executed in a single thread (target = "cpu")
forall-like construct using numba.guvectorize, executed in as many threads as there are cpu "cores" (in my case hyperthreads) (target = "parallel")
same as 3., however calling the "guvectorized" forall with the sequence of "parallel" loop indices randomly permuted
The last one is done because (in this particular example) the inner loop's range depends on the value of the outer loop's index. I don't know how exactly the dispatchment of gufunc calls is organized inside numpy, but it appears as if the randomization of "parallel" loop indices achieves slightly better load balancing.
On my (slow) machine (1st gen core i5, 2 cores, 4 hyperthreads) I get the timings:
1 loop, best of 3: 8.19 s per loop
1 loop, best of 3: 8.27 s per loop
1 loop, best of 3: 4.6 s per loop
1 loop, best of 3: 3.46 s per loop
Note: I'd be interested if this recipe readily applies to target="gpu" (it should do, but I don't have access to a suitable graphics card right now), and what's the speedup. Please post!
And here's the example:
import numpy as np
from numba import jit, guvectorize, float64, int64
#jit
def naive_for_loop(some_input_array, another_input_array, result):
for i in range(result.shape[0]):
for k in range(some_input_array.shape[0] - i):
result[i] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i))
#guvectorize([(float64[:],float64[:],int64[:],float64[:])],'(n),(n),()->()', nopython=True, target='parallel')
def forall_loop_body_parallel(some_input_array, another_input_array, loop_index, result):
i = loop_index[0] # just a shorthand
# do some nontrivial calculation involving elements from the input arrays and the loop index
for k in range(some_input_array.shape[0] - i):
result[0] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i))
#guvectorize([(float64[:],float64[:],int64[:],float64[:])],'(n),(n),()->()', nopython=True, target='cpu')
def forall_loop_body_cpu(some_input_array, another_input_array, loop_index, result):
i = loop_index[0] # just a shorthand
# do some nontrivial calculation involving elements from the input arrays and the loop index
for k in range(some_input_array.shape[0] - i):
result[0] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i))
arg_size = 20000
input_array_1 = np.random.rand(arg_size)
input_array_2 = np.random.rand(arg_size)
result_array = np.zeros_like(input_array_1)
# do single-threaded naive nested for loop
# reset result_array inside %timeit call
%timeit -r 3 result_array[:] = 0.0; naive_for_loop(input_array_1, input_array_2, result_array)
result_1 = result_array.copy()
# do single-threaded forall loop (loop indices in-order)
# reset result_array inside %timeit call
loop_indices = range(arg_size)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_cpu(input_array_1, input_array_2, loop_indices, result_array)
result_2 = result_array.copy()
# do multi-threaded forall loop (loop indices in-order)
# reset result_array inside %timeit call
loop_indices = range(arg_size)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_parallel(input_array_1, input_array_2, loop_indices, result_array)
result_3 = result_array.copy()
# do forall loop (loop indices scrambled for better load balancing)
# reset result_array inside %timeit call
loop_indices_scrambled = np.random.permutation(range(arg_size))
loop_indices_unscrambled = np.argsort(loop_indices_scrambled)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_parallel(input_array_1, input_array_2, loop_indices_scrambled, result_array)
result_4 = result_array[loop_indices_unscrambled].copy()
# check validity
print(np.all(result_1 == result_2))
print(np.all(result_1 == result_3))
print(np.all(result_1 == result_4))