I have the following python method, which selects a weighted random element from the sequence "seq" randomly weighted by other sequence, which contains the weights for each element in seq:
def weighted_choice(seq, weights):
assert len(seq) == len(weights)
total = sum(weights)
r = random.uniform(0, total)
upto = 0
for i in range(len(seq)):
if upto + weights[i] >= r:
return seq[i]
upto += weights[i]
assert False, "Shouldn't get here"
If I call the above a million times with a 1000 element sequence, like this:
seq = range(1000)
weights = []
for i in range(1000):
weights.append(random.randint(1,100))
st=time.time()
for i in range(1000000):
r=weighted_choice(seq, weights)
print (time.time()-st)
it runs for approximately 45 seconds in cpython 2.7 and for 70 seconds in cpython 3.6.
It finishes in around 2.3 seconds in pypy 5.10, which would be fine for me, sadly I can't use pypy for some reasons.
Any ideas on how to speed up this function on cpython? I'm interested in other implementations (algorithmically, or via external libraries, like numpy) as well if they perform better.
ps: python3 has random.choices with weights, it runs for around 23 seconds, which is better than the above function, but still exactly ten times slower than pypy can run.
I've tried it with numpy this way:
weights=[1./1000]*1000
st=time.time()
for i in range(1000000):
#r=weighted_choice(seq, weights)
#r=random.choices(seq, weights)
r=numpy.random.choice(seq, p=weights)
print (time.time()-st)
It ran for 70 seconds.
You can use numpy.random.choice (the p parameter is the weights). Normally numpy functions are vectorized and so run at near-C speed.
Implement as:
def weighted_choice(seq, weights):
w = np.asarray(weights)
p = w / w.sum() # can skip if weights always sum to 1
return np.random.choice(seq, p=w)
Edit:
Timings:
%timeit np.random.choice(x, p=w) # len(x) == 1_000_000
13 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.random.choice(y, p=w) # len(y) == 100_000_000
1.28 s ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
you could take this approach with numpy. If you emlimiate the for loop, you can get the true power of numpy by indexing the positions you need
#Untimed since you did not
seq = np.arange(1000)
weights = np.random.randint(1,100,(1000,1))
def weights_numpy(seq,weights,iterations):
"""
:param seq: Input sequence
:param weights: Input Weights
:param iterations: Iterations to run
:return:
"""
r = np.random.uniform(0,weights.sum(0),(1,iterations)) #create array of choices
ar = weights.cumsum(0) # get cumulative sum
return seq[(ar >= r).argmax(0)] #get indeces of seq that meet your condition
And the timing (python 3,numpy '1.14.0')
%timeit weights_numpy(seq,weights,1000000)
4.05 s ± 256 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Which is a bit slower than PyPy, but hardly...
Related
I was wondering if anyone has an idea on how to speed up the identification of which indices are between a set of values.
Let's say I have a 1d array of sorted values (~50k) and a large list (>100k) of a pair of min/max values and I want to determine which (if any) indices in the 1d array are present. I must also be able to do this many times where the 1d array changes in size/shape.
My current approach is to use numpy and numba and list comprehension but unfortunately it doesn't really scale. It's okay if I try to look for ~1k values but when the number is much larger, it's too slow to be able to repeat it 1000s of times.
Current code:
import numpy as np
import numba
#numba.njit()
def find_between_batch(array: np.ndarray, min_value: np.ndarray, max_value: np.ndarray):
"""Find indices between specified boundaries for many items."""
res = []
for i in range(len(min_value)):
res.append(np.where(np.logical_and(array >= min_value[i], array <= max_value[i]))[0])
return res
Here is an example of the input:
x = np.linspace(0, 2000, 50000) # input 1d array
# these are the boundaries for which we should find the indices
mins = np.sort(np.random.choice(x, 10000)) - 0.01 # lower values to search for
maxs = mins + 0.02 # upper values to search for
And the current performance
# pre-compile
result = find_between_batch(x, mins, maxs)
%timeit -r 3 -n 10 find_between_batch(x, mins, maxs)
616 ms ± 4.11 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
And example output
result
[array([11]),
array([14]),
array([19]),
array([23]),
...
]
Does anyone have a suggestion on how to speed this up or if there is another approach that could give me the same results?
Thanks for the suggestion to use np.searchsorted - I've come up with a solution that is approx. 10-100x faster than my initial attempt.
#numba.njit()
def find_between_batch2(array: np.ndarray, min_value: np.ndarray, max_value: np.ndarray):
"""Find indices between specified boundaries for many items."""
min_indices = np.searchsorted(array, min_value, side="left")
max_indices = np.searchsorted(array, max_value, side="right")
res = []
for i in range(len(min_value)):
_array = array[min_indices[i]:max_indices[i]]
res.append(min_indices[i] + find_between(_array, min_value[i], max_value[i]))
return res
Original code:
%timeit -r 3 -n 10 find_between_batch(x, mins, maxs)
616 ms ± 4.11 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
Updated code:
%timeit -r 3 -n 10 find_between_batch2(x, mins, maxs)
6.36 ms ± 73.6 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
I have to compute a large number of 3x3 linear transformations (eg. rotations). This is what I have so far:
import numpy as np
from scipy import sparse
from numba import jit
n = 100000 # number of transformations
k = 100 # number of vectors for each transformation
A = np.random.rand(n, 3, k) # vectors
Op = np.random.rand(n, 3, 3) # operators
sOp = sparse.bsr_matrix((Op, np.arange(n), np.arange(n+1))) # same as Op but as block-diag
def dot1():
""" naive approach: many times np.dot """
return np.stack([np.dot(o, a) for o, a in zip(Op, A)])
#jit(nopython=True)
def dot2():
""" same as above, but jitted """
new = np.empty_like(A)
for i in range(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot3():
""" using einsum """
return np.einsum("ijk,ikl->ijl", Op, A)
def dot4():
""" using sparse block diag matrix """
return sOp.dot(A.reshape(3 * n, -1)).reshape(n, 3, -1)
On a macbook pro 2012, this gives me:
In [62]: %timeit dot1()
783 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [63]: %timeit dot2()
261 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [64]: %timeit dot3()
293 ms ± 2.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [65]: %timeit dot4()
281 ms ± 6.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Appart from the naive approach, all approaches are similar. Is there a way to accelerate this significantly?
Edit
(The cuda approach is the best when available. The following is comparing the non-cuda versions)
Following the various suggestions, I modified dot2, added the Op#A method, and a version based on #59356461.
#njit(fastmath=True, parallel=True)
def dot2(Op, A):
""" same as above, but jitted """
new = np.empty_like(A)
for i in prange(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot5(Op, A):
""" using matmul """
return Op#A
#njit(fastmath=True, parallel=True)
def dot6(Op, A):
""" another numba.jit with parallel (based on #59356461) """
new = np.empty_like(A)
for i_n in prange(A.shape[0]):
for i_k in range(A.shape[2]):
for i_x in range(3):
acc = 0.0j
for i_y in range(3):
acc += Op[i_n, i_x, i_y] * A[i_n, i_y, i_k]
new[i_n, i_x, i_k] = acc
return new
This is what I get (on a different machine) with benchit:
def gen(n, k):
Op = np.random.rand(n, 3, 3) + 1j * np.random.rand(n, 3, 3)
A = np.random.rand(n, 3, k) + 1j * np.random.rand(n, 3, k)
return Op, A
# benchit
import benchit
funcs = [dot1, dot2, dot3, dot4, dot5, dot6]
inputs = {n: gen(n, 100) for n in [100,1000,10000,100000,1000000]}
t = benchit.timings(funcs, inputs, multivar=True, input_name='Number of operators')
t.plot(logy=True, logx=True)
You've gotten some great suggestions, but I wanted to add one more due to this specific goal:
Is there a way to accelerate this significantly?
Realistically, if you need these operations to be significantly faster (which often means > 10x) you probably would want to use a GPU for the matrix multiplication. As a quick example:
import numpy as np
import cupy as cp
n = 100000 # number of transformations
k = 100 # number of vectors for each transformation
# CPU version
A = np.random.rand(n, 3, k) # vectors
Op = np.random.rand(n, 3, 3) # operators
def dot5(): # the suggested, best CPU approach
return Op#A
# GPU version using a V100
gA = cp.asarray(A)
gOp = cp.asarray(Op)
# run once to ignore JIT overhead before benchmarking
gOp#gA;
%timeit dot5()
%timeit gOp#gA; cp.cuda.Device().synchronize() # need to sync for a fair benchmark
112 ms ± 546 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.19 ms ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use Op#A like suggested by #hpaulj in comments.
Here is a comparison using benchit:
def dot1(A,Op):
""" naive approach: many times np.dot """
return np.stack([np.dot(o, a) for o, a in zip(Op, A)])
#jit(nopython=True)
def dot2(A,Op):
""" same as above, but jitted """
new = np.empty_like(A)
for i in range(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot3(A,Op):
""" using einsum """
return np.einsum("ijk,ikl->ijl", Op, A)
def dot4(A,Op):
n = A.shape[0]
sOp = sparse.bsr_matrix((Op, np.arange(n), np.arange(n+1))) # same as Op but as block-diag
""" using sparse block diag matrix """
return sOp.dot(A.reshape(3 * n, -1)).reshape(n, 3, -1)
def dot5(A,Op):
return Op#A
in_ = {n:[np.random.rand(n, 3, k), np.random.rand(n, 3, 3)] for n in [100,1000,10000,100000,1000000]}
They seem to be close in performance for larger scale with dot5 being slightly faster.
In one answer Nick mentioned using the GPU - which is the best solution of course.
But - as a general rule - what you're doing is likely CPU limited. Therefore (with the exception to the GPU approach), the best bang you can get is if you make use of all the cores on your machine to work in parallel.
So for that you would want to use multiprocessing (not python's multithreading!), to split the job up into pieces running on each core in parallel.
This is not trivial, but also not too hard, and there are many good examples/guides online.
But if you had an 8-core machine, it would likely give you an almost 8x speed increase as long as you're careful to avoid memory bottlenecks by trying to pass many small objects between processes, but pass them all in a group at the start
Right now I am just looping through using np.nditer() and comparing to the previous element. Is there a (vectorised) approach which is faster?
Added bonus is the fact that I don't always have to go to the end of the array; as soon as a sequence of max_len has been found I am done searching.
import numpy as np
max_len = 3
streak = 0
prev = np.nan
a = np.array([0, 3, 4, 3, 0, 2, 2, 2, 0, 2, 1])
for c in np.nditer(a):
if c == prev:
streak += 1
if streak == max_len:
print(c)
break
else:
prev = c
streak = 1
Alternative I thought about is using np.diff() but this just shifts the problem; we are now looking for a sequence of zeroes in its result. Also I doubt it will be faster since it will have to calculate the difference for every integer whereas in practice the sequence will occur before reaching the end of the list more often than not.
I developed a numpy-only version that works, but after testing, I found that it performs quite poorly because it can't take advantage of short-circuiting. Since that's what you asked for, I describe it below. However, there is a much better approach using numba with a lightly modified version of your code. (Note that all of these return the index of the first match in a, rather than the value itself. I find that approach more flexible.)
#numba.jit(nopython=True)
def find_reps_numba(a, max_len):
streak = 1
val = a[0]
for i in range(1, len(a)):
if a[i] == val:
streak += 1
if streak >= max_len:
return i - max_len + 1
else:
streak = 1
val = a[i]
return -1
This turns out to be ~100x faster than the pure Python version.
The numpy version uses the rolling window trick and the argmax trick. But again, this turns out to be far slower than even the pure Python version, by a substantial ~30x.
def rolling_window(a, window):
a = numpy.ascontiguousarray(a) # This approach requires a C-ordered array
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def find_reps_numpy(a, max_len):
windows = rolling_window(a, max_len)
return (windows == windows[:, 0:1]).sum(axis=1).argmax()
I tested both of these against a non-jitted version of the first function. (I used Jupyter's %%timeit feature for testing.)
a = numpy.random.randint(0, 100, 1000000)
%%timeit
find_reps_numpy(a, 3)
28.6 ms ± 553 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
find_reps_orig(a, 3)
4.04 ms ± 40.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
find_reps_numba(a, 3)
8.29 µs ± 89.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Note that these numbers can vary dramatically depending on how deep into a the functions have to search. For a better estimate of expected performance, we can regenerate a new set of random numbers each time, but it's difficult to do so without including that step in the timing. So for comparison here, I include the time required to generate the random array without running anything else:
a = numpy.random.randint(0, 100, 1000000)
9.91 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
a = numpy.random.randint(0, 100, 1000000)
find_reps_numpy(a, 3)
38.2 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
a = numpy.random.randint(0, 100, 1000000)
find_reps_orig(a, 3)
13.7 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
a = numpy.random.randint(0, 100, 1000000)
find_reps_numba(a, 3)
9.87 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As you can see, find_reps_numba is so fast that the variance in the time it takes to run numpy.random.randint(0, 100, 1000000) is much larger — hence the illusory speedup between the first and last tests.
So the big moral of the story is that numpy solutions aren't always best. Sometimes even pure Python is faster. In those cases, numba in nopython mode may be the best option by far.
You can use groupby from the itertools package.
import numpy as np
from itertools import groupby
max_len = 3
best = ()
a = np.array([0, 3, 4, 3, 0, 2, 2, 2, 0, 2, 1])
for k, g in groupby(a):
tup_g = tuple(g)
if tup_g==max_len:
best = tup_g
break
if len(tup_g) > len(best):
best = tup_g
best
# returns:
(2, 2, 2)
You could create sub-arrays of length max_length, moving one position to the right each time (like ngrams), and check if the sum of one sub_array divided by max_length is equal to the first element of that sub-array.
If that's True, then you have found your consecutive sequence of integers of length max_length.
def get_conseq(array, max_length):
sub_arrays = zip(*[array[i:] for i in range(max_length)])
for e in sub_arrays:
if sum(e) / len(e) == e[0]:
print("Found : {}".format(e))
return e
print("Nothing found")
return []
For example, this array [1,2,2,3,4,5], with max_length = 2, will be 'split' like this:
[1,2]
[2,2]
[2,3]
[3,4]
[4,5]
On the second element, [2,2], the sum is 4, divided by max_length gives 2, and that matches the first element of that subgroup, and the function returns.
You can break if that's what you prefer to do, instead of returning like I do.
You could also add a few rules to capture edge cases, to make things clean (empty array, max_length superior to the length of the array, etc).
Here are a few example calls:
>>> splits([1,2,3,4,5,6], 2)
Nothing found
>>> splits([1,2,2,3,4,5,6], 3)
Nothing found
>>> splits([1,2,3,3,3], 3)
Found : [3, 3, 3]
>>> splits([1,2,2,3,3], 2)
Found : [2, 2]
Hope this helps !
Assuming you are looking for the element that appears for at least max_len times consecutively, here's one NumPy based way -
m = np.r_[True,a[:-1]!=a[1:],True]
idx0 = np.flatnonzero(m)
m2 = np.diff(idx0)>=max_len
out = None # None for no such streak found case
if m2.any():
out = a[idx0[m2.argmax()]]
Another with binary-dilation -
from scipy.ndimage.morphology import binary_erosion
m = np.r_[False,a[:-1]==a[1:]]
m2 = binary_erosion(m, np.ones(max_len-1, dtype=bool))
out = None
if m2.any():
out = a[m2.argmax()]
Finally, for completeness, you can also look into numba. Your existing code would work as it is, with a direct-looping over a, i.e. for c in a:.
So I am experimenting on the performance boost of combining vectorization and for-loop powered by #njit in numba(I am currently using numba 0.45.1). Disappointingly, I found out it is actually slower than the pure nested-loop implementation in my code.
This is my code:
import numpy as np
from numba import njit
#njit
def func3(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
w = w + (1-alpha_arr)**i
e = e*(1-alpha_arr) + arr_in[i]
result[i,:] = e /w
return result
#njit
def func4(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
for col in range(len(win_arr)):
w[col] = w[col] + (1-alpha_arr[col])**i
e[col] = e[col]*(1-alpha_arr[col]) + arr_in[i]
result[i,col] = e[col] /w[col]
return result
if __name__ == '__main__':
np.random.seed(0)
data_size = 200000
winarr_size = 1000
data = np.random.uniform(0,1000, size = data_size)+29000
win_array = np.arange(1, winarr_size+1)
abc_test3= func3(data, win_array)
abc_test4= func4(data, win_array)
print(np.allclose(abc_test3, abc_test4, equal_nan = True))
I benchmarked the two functions using the following configurations:
(data_size,winarr_size) = (200000,100), (200000,200),(200000,1000), (200000,2000), (20000,10000), (2000,100000).
And found that the pure nested-for-loop implementation(func4) is consistently faster (about 2-5% faster) than the implementation with a for-loop mixed with vectorization (func3).
My questions are the following:
1) what needs to be changed to further improve the speed of the code?
2) why is it that the computation time of the vectorized version of the function grows linearly with the size of the win_arr? I thought the vectorization should make it so that the operation speed is constant no matter how big/small the vector is, but apparently this does not hold true in this case.
3) Are there any general conditions under which the computation time of the vectorized operation will still grow linearly with the input size?
It seems you misunderstood what "vectorized" means. Vectorized means that you write code that operates on arrays as-if they were scalars - but that's just how the code looks like, not related to performance.
In the Python/NumPy world vectorized also carries the meaning that the overhead of the loop in vectorized operations is (often) much smaller compared to loopy code. However the vectorized code still has to do the loop (even if it's hidden in a library)!
Also, if you write a loop with numba, numba will compile it and create fast code that performs (generally) as fast as vectorized NumPy code. That means inside a numba function there's no significant performance difference between vectorized and non-vectorized code.
So that should answer your questions:
2) why is it that the computation time of the vectorized version of the function grows linearly with the size of the win_arr? I thought the vectorization should make it so that the operation speed is constant no matter how big/small the vector is, but apparently this does not hold true in this case.
It grows linearly because it still has to iterate. In vectorized code the loop is just hidden inside a library routine.
3) Are there any general conditions under which the computation time of the vectorized operation will still grow linearly with the input size?
No.
You also asked what could be done to make it faster.
The comments already mentioned that you could parallelize it:
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def func6(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
for col in nb.prange(len(win_arr)):
w[col] = w[col] + (1-alpha_arr[col])**i
e[col] = e[col] * (1-alpha_arr[col]) + arr_in[i]
result[i,col] = e[col] /w[col]
return result
This makes the code a bit faster on my machine (4cores).
However there's also a problem that your algorithm may be numerically unstable. The (1-alpha_arr[col])**i will underflow at some point when you raise it to powers of hundred-thousands:
>>> alpha = 0.01
>>> for i in [1, 10, 100, 1_000, 10_000, 50_000, 100_000, 200_000]:
... print((1-alpha)**i)
0.99
0.9043820750088044
0.3660323412732292
4.317124741065786e-05
2.2487748498162805e-44
5.750821364590612e-219
0.0 # <-- underflow
0.0
Always think twice about complicated mathematical operations like (pow, divisions,...). If you can replace them by easy operations like multiplications, additions and subtractions it is always worth a try.
Please note that multiplying alpha repeatedly with itself is only algebraically the same as directly calculating with exponentiation. Since this is numerical math the results can differ.
Also avoid unnecessary temporary arrays.
First try
#nb.njit(error_model="numpy",parallel=True)
def func5(arr_in, win_arr):
#filling the whole array with NaNs isn't necessary
result = np.empty((win_arr.shape[0],arr_in.shape[0]))
for col in range(win_arr.shape[0]):
result[col,0]=np.nan
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[:two_index,0] = arr_in[0]
for col in nb.prange(win_arr.shape[0]):
alpha=1.-(2./ (win_arr[col] + 1.))
alpha_exp=alpha
w=1.
e=arr_in[0]
for i in range(1, arr_in.shape[0]):
w+= alpha_exp
e = e*alpha + arr_in[i]
result[col,i] = e/w
alpha_exp*=alpha
return result.T
Second try (avoiding underflow)
#nb.njit(error_model="numpy",parallel=True)
def func7(arr_in, win_arr):
#filling the whole array with NaNs isn't necessary
result = np.empty((win_arr.shape[0],arr_in.shape[0]))
for col in range(win_arr.shape[0]):
result[col,0]=np.nan
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[:two_index,0] = arr_in[0]
for col in nb.prange(win_arr.shape[0]):
alpha=1.-(2./ (win_arr[col] + 1.))
alpha_exp=alpha
w=1.
e=arr_in[0]
for i in range(1, arr_in.shape[0]):
w+= alpha_exp
e = e*alpha + arr_in[i]
result[col,i] = e/w
if np.abs(alpha_exp)>=1e-308:
alpha_exp*=alpha
else:
alpha_exp=0.
return result.T
Timings
%timeit abc_test3= func3(data, win_array)
7.17 s ± 45.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test4= func4(data, win_array)
7.13 s ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#from MSeifert answer (parallelized)
%timeit abc_test6= func6(data, win_array)
3.42 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test5= func5(data, win_array)
1.22 s ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test7= func7(data, win_array)
238 ms ± 5.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I recently converted a MATLAB script to Python with Numpy, and found that it ran significantly slower. I expected similar performance, so I'm wondering if I'm doing something wrong.
As stripped-down example, I manually sum a geometric series:
MATLAB version:
function s = array_sum(a, array_size, iterations)
s = zeros(array_size);
for m = 1:iterations
s = a + 0.5*s;
end
end
% benchmark code
array_size = 500
iterations = 500
a = randn(array_size)
f = #() array_sum(a, array_size, iterations);
fprintf('run time: %.2f ms\n', timeit(f)*1e3);
Python/Numpy version:
import numpy as np
import timeit
def array_sum(a, array_size, iterations):
s = np.zeros((array_size, array_size))
for m in range(iterations):
s = a + 0.5*s
return s
array_size = 500
iterations = 500
a = np.random.randn(array_size, array_size)
timeit_iterations = 10
t1 = timeit.timeit(lambda: array_sum(a, array_size, iterations),
number=timeit_iterations)
print("run time: {:.2f} ms".format(1e3*t1/timeit_iterations))
On my machine, MATLAB completes in 58 ms. The Python version runs in 292 ms, or 5X slower.
I also tried speeding up the Python code by adding the Numba JIT decorator #jit('f8[:,:](i8, i8)', nopython=True), but the time only dropped to 236 ms (4X slower).
This is slower than I expected. Am I using timeit improperly? Is there something wrong with my Python code?
EDIT: edited so that the random matrix is created outside of benchmarked function.
EDIT 2: I ran the benchmark using Torch instead of Numpy (calculating the sum as s = torch.add(s, 0.5, a)) and it runs in just 52 ms on my computer!
From my experience, when using numba's jit function it's usually faster to expand array operations into loops. So I tried to rewrite your python function as:
#jit(nopython=True, cache=True)
def array_sum_numba(a, array_size, iterations):
s = np.zeros((array_size, array_size))
for m in range(iterations):
for i in range(array_size):
for j in range(array_size):
s[i,j] = a[i,j] + 0.5 * s[i,j]
return s
And out of curiosity, I've also tested #percusse's version with a little modification on the parameter:
def array_sum2(r, array_size, iterations):
s = np.zeros((array_size, array_size))
for m in range(iterations):
s /= 2
s += r
return s
The testing results on my machine are:
original version run time: 143.83 ms
numba jitted loop version run time: 26.99 ms
#percusse's version run time: 61.38 ms
This result is within my expectation. It's worthing mentioning that I've increased timeit iterations to 50, which results in some significant time reduction for numba version.
In summary: The Python code can still be significantly accelerated if you use numba's jit and write the function in loops. I don't have Matlab on my machine to test, but my guess is with numba the python version is faster.
Since you are updating the same variable suitable for inplace operations, you can update your function as
def array_sum2(array_size, iterations):
s = np.zeros((array_size, array_size))
r = np.random.randn(array_size, array_size)
for m in range(iterations):
s /= 2
s += r
return s
This has given the following speed benefit on my machine compared to array_sum
run time: 157.32 ms
run time2: 672.43 ms
Times include the randn call as well as the summation:
In [68]: timeit array_sum(array_size, 0)
16.6 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [69]: timeit array_sum(array_size, 1)
18.9 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [70]: timeit array_sum(array_size, 20)
55.5 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [71]: (55-16)/20
Out[71]: 1.95
So it's 16ms for the setup, and 2ms per iteration. Same pattern with 500 iterations.
MATLAB does some JIT compilation. I don't know if that's the case here or not. I don't have MATLAB to test. In Octave (no timeit)
>> t = time(); array_sum(500,0); (time()-t)*1000
ans = 13.704
>> t = time(); array_sum(500,1); (time()-t)*1000
ans = 16.219
>> t = time(); array_sum(500,20); (time()-t)*1000
ans = 82.346
>> t = time(); array_sum(500,500); (time()-t)*1000
ans = 1610.6
Octave's random is faster, but the per iteration sum is slower.