Fastest way to compute large number of 3x3 dot product

Fastest way to compute large number of 3x3 dot product - python

I have to compute a large number of 3x3 linear transformations (eg. rotations). This is what I have so far:
import numpy as np
from scipy import sparse
from numba import jit
n = 100000 # number of transformations
k = 100 # number of vectors for each transformation
A = np.random.rand(n, 3, k) # vectors
Op = np.random.rand(n, 3, 3) # operators
sOp = sparse.bsr_matrix((Op, np.arange(n), np.arange(n+1))) # same as Op but as block-diag
def dot1():
""" naive approach: many times np.dot """
return np.stack([np.dot(o, a) for o, a in zip(Op, A)])
#jit(nopython=True)
def dot2():
""" same as above, but jitted """
new = np.empty_like(A)
for i in range(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot3():
""" using einsum """
return np.einsum("ijk,ikl->ijl", Op, A)
def dot4():
""" using sparse block diag matrix """
return sOp.dot(A.reshape(3 * n, -1)).reshape(n, 3, -1)
On a macbook pro 2012, this gives me:
In [62]: %timeit dot1()
783 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [63]: %timeit dot2()
261 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [64]: %timeit dot3()
293 ms ± 2.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [65]: %timeit dot4()
281 ms ± 6.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Appart from the naive approach, all approaches are similar. Is there a way to accelerate this significantly?
Edit
(The cuda approach is the best when available. The following is comparing the non-cuda versions)
Following the various suggestions, I modified dot2, added the Op#A method, and a version based on #59356461.
#njit(fastmath=True, parallel=True)
def dot2(Op, A):
""" same as above, but jitted """
new = np.empty_like(A)
for i in prange(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot5(Op, A):
""" using matmul """
return Op#A
#njit(fastmath=True, parallel=True)
def dot6(Op, A):
""" another numba.jit with parallel (based on #59356461) """
new = np.empty_like(A)
for i_n in prange(A.shape[0]):
for i_k in range(A.shape[2]):
for i_x in range(3):
acc = 0.0j
for i_y in range(3):
acc += Op[i_n, i_x, i_y] * A[i_n, i_y, i_k]
new[i_n, i_x, i_k] = acc
return new
This is what I get (on a different machine) with benchit:
def gen(n, k):
Op = np.random.rand(n, 3, 3) + 1j * np.random.rand(n, 3, 3)
A = np.random.rand(n, 3, k) + 1j * np.random.rand(n, 3, k)
return Op, A
# benchit
import benchit
funcs = [dot1, dot2, dot3, dot4, dot5, dot6]
inputs = {n: gen(n, 100) for n in [100,1000,10000,100000,1000000]}
t = benchit.timings(funcs, inputs, multivar=True, input_name='Number of operators')
t.plot(logy=True, logx=True)

You've gotten some great suggestions, but I wanted to add one more due to this specific goal:
Is there a way to accelerate this significantly?
Realistically, if you need these operations to be significantly faster (which often means > 10x) you probably would want to use a GPU for the matrix multiplication. As a quick example:
import numpy as np
import cupy as cp
n = 100000 # number of transformations
k = 100 # number of vectors for each transformation
# CPU version
A = np.random.rand(n, 3, k) # vectors
Op = np.random.rand(n, 3, 3) # operators
def dot5(): # the suggested, best CPU approach
return Op#A
# GPU version using a V100
gA = cp.asarray(A)
gOp = cp.asarray(Op)
# run once to ignore JIT overhead before benchmarking
gOp#gA;
%timeit dot5()
%timeit gOp#gA; cp.cuda.Device().synchronize() # need to sync for a fair benchmark
112 ms ± 546 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.19 ms ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Use Op#A like suggested by #hpaulj in comments.
Here is a comparison using benchit:
def dot1(A,Op):
""" naive approach: many times np.dot """
return np.stack([np.dot(o, a) for o, a in zip(Op, A)])
#jit(nopython=True)
def dot2(A,Op):
""" same as above, but jitted """
new = np.empty_like(A)
for i in range(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot3(A,Op):
""" using einsum """
return np.einsum("ijk,ikl->ijl", Op, A)
def dot4(A,Op):
n = A.shape[0]
sOp = sparse.bsr_matrix((Op, np.arange(n), np.arange(n+1))) # same as Op but as block-diag
""" using sparse block diag matrix """
return sOp.dot(A.reshape(3 * n, -1)).reshape(n, 3, -1)
def dot5(A,Op):
return Op#A
in_ = {n:[np.random.rand(n, 3, k), np.random.rand(n, 3, 3)] for n in [100,1000,10000,100000,1000000]}
They seem to be close in performance for larger scale with dot5 being slightly faster.

In one answer Nick mentioned using the GPU - which is the best solution of course.
But - as a general rule - what you're doing is likely CPU limited. Therefore (with the exception to the GPU approach), the best bang you can get is if you make use of all the cores on your machine to work in parallel.
So for that you would want to use multiprocessing (not python's multithreading!), to split the job up into pieces running on each core in parallel.
This is not trivial, but also not too hard, and there are many good examples/guides online.
But if you had an 8-core machine, it would likely give you an almost 8x speed increase as long as you're careful to avoid memory bottlenecks by trying to pass many small objects between processes, but pass them all in a group at the start

Related

Faster way to find same integers in sequence in numpy array

Right now I am just looping through using np.nditer() and comparing to the previous element. Is there a (vectorised) approach which is faster?
Added bonus is the fact that I don't always have to go to the end of the array; as soon as a sequence of max_len has been found I am done searching.
import numpy as np
max_len = 3
streak = 0
prev = np.nan
a = np.array([0, 3, 4, 3, 0, 2, 2, 2, 0, 2, 1])
for c in np.nditer(a):
if c == prev:
streak += 1
if streak == max_len:
print(c)
break
else:
prev = c
streak = 1
Alternative I thought about is using np.diff() but this just shifts the problem; we are now looking for a sequence of zeroes in its result. Also I doubt it will be faster since it will have to calculate the difference for every integer whereas in practice the sequence will occur before reaching the end of the list more often than not.

I developed a numpy-only version that works, but after testing, I found that it performs quite poorly because it can't take advantage of short-circuiting. Since that's what you asked for, I describe it below. However, there is a much better approach using numba with a lightly modified version of your code. (Note that all of these return the index of the first match in a, rather than the value itself. I find that approach more flexible.)
#numba.jit(nopython=True)
def find_reps_numba(a, max_len):
streak = 1
val = a[0]
for i in range(1, len(a)):
if a[i] == val:
streak += 1
if streak >= max_len:
return i - max_len + 1
else:
streak = 1
val = a[i]
return -1
This turns out to be ~100x faster than the pure Python version.
The numpy version uses the rolling window trick and the argmax trick. But again, this turns out to be far slower than even the pure Python version, by a substantial ~30x.
def rolling_window(a, window):
a = numpy.ascontiguousarray(a) # This approach requires a C-ordered array
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def find_reps_numpy(a, max_len):
windows = rolling_window(a, max_len)
return (windows == windows[:, 0:1]).sum(axis=1).argmax()
I tested both of these against a non-jitted version of the first function. (I used Jupyter's %%timeit feature for testing.)
a = numpy.random.randint(0, 100, 1000000)
%%timeit
find_reps_numpy(a, 3)
28.6 ms ± 553 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
find_reps_orig(a, 3)
4.04 ms ± 40.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
find_reps_numba(a, 3)
8.29 µs ± 89.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Note that these numbers can vary dramatically depending on how deep into a the functions have to search. For a better estimate of expected performance, we can regenerate a new set of random numbers each time, but it's difficult to do so without including that step in the timing. So for comparison here, I include the time required to generate the random array without running anything else:
a = numpy.random.randint(0, 100, 1000000)
9.91 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
a = numpy.random.randint(0, 100, 1000000)
find_reps_numpy(a, 3)
38.2 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
a = numpy.random.randint(0, 100, 1000000)
find_reps_orig(a, 3)
13.7 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
a = numpy.random.randint(0, 100, 1000000)
find_reps_numba(a, 3)
9.87 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As you can see, find_reps_numba is so fast that the variance in the time it takes to run numpy.random.randint(0, 100, 1000000) is much larger — hence the illusory speedup between the first and last tests.
So the big moral of the story is that numpy solutions aren't always best. Sometimes even pure Python is faster. In those cases, numba in nopython mode may be the best option by far.

You can use groupby from the itertools package.
import numpy as np
from itertools import groupby
max_len = 3
best = ()
a = np.array([0, 3, 4, 3, 0, 2, 2, 2, 0, 2, 1])
for k, g in groupby(a):
tup_g = tuple(g)
if tup_g==max_len:
best = tup_g
break
if len(tup_g) > len(best):
best = tup_g
best
# returns:
(2, 2, 2)

You could create sub-arrays of length max_length, moving one position to the right each time (like ngrams), and check if the sum of one sub_array divided by max_length is equal to the first element of that sub-array.
If that's True, then you have found your consecutive sequence of integers of length max_length.
def get_conseq(array, max_length):
sub_arrays = zip(*[array[i:] for i in range(max_length)])
for e in sub_arrays:
if sum(e) / len(e) == e[0]:
print("Found : {}".format(e))
return e
print("Nothing found")
return []
For example, this array [1,2,2,3,4,5], with max_length = 2, will be 'split' like this:
[1,2]
[2,2]
[2,3]
[3,4]
[4,5]
On the second element, [2,2], the sum is 4, divided by max_length gives 2, and that matches the first element of that subgroup, and the function returns.
You can break if that's what you prefer to do, instead of returning like I do.
You could also add a few rules to capture edge cases, to make things clean (empty array, max_length superior to the length of the array, etc).
Here are a few example calls:
>>> splits([1,2,3,4,5,6], 2)
Nothing found
>>> splits([1,2,2,3,4,5,6], 3)
Nothing found
>>> splits([1,2,3,3,3], 3)
Found : [3, 3, 3]
>>> splits([1,2,2,3,3], 2)
Found : [2, 2]
Hope this helps !

Assuming you are looking for the element that appears for at least max_len times consecutively, here's one NumPy based way -
m = np.r_[True,a[:-1]!=a[1:],True]
idx0 = np.flatnonzero(m)
m2 = np.diff(idx0)>=max_len
out = None # None for no such streak found case
if m2.any():
out = a[idx0[m2.argmax()]]
Another with binary-dilation -
from scipy.ndimage.morphology import binary_erosion
m = np.r_[False,a[:-1]==a[1:]]
m2 = binary_erosion(m, np.ones(max_len-1, dtype=bool))
out = None
if m2.any():
out = a[m2.argmax()]
Finally, for completeness, you can also look into numba. Your existing code would work as it is, with a direct-looping over a, i.e. for c in a:.

Vectorization in a loop slower than a nested loop in numba jitted function

So I am experimenting on the performance boost of combining vectorization and for-loop powered by #njit in numba(I am currently using numba 0.45.1). Disappointingly, I found out it is actually slower than the pure nested-loop implementation in my code.
This is my code:
import numpy as np
from numba import njit
#njit
def func3(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
w = w + (1-alpha_arr)**i
e = e*(1-alpha_arr) + arr_in[i]
result[i,:] = e /w
return result
#njit
def func4(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
for col in range(len(win_arr)):
w[col] = w[col] + (1-alpha_arr[col])**i
e[col] = e[col]*(1-alpha_arr[col]) + arr_in[i]
result[i,col] = e[col] /w[col]
return result
if __name__ == '__main__':
np.random.seed(0)
data_size = 200000
winarr_size = 1000
data = np.random.uniform(0,1000, size = data_size)+29000
win_array = np.arange(1, winarr_size+1)
abc_test3= func3(data, win_array)
abc_test4= func4(data, win_array)
print(np.allclose(abc_test3, abc_test4, equal_nan = True))
I benchmarked the two functions using the following configurations:
(data_size,winarr_size) = (200000,100), (200000,200),(200000,1000), (200000,2000), (20000,10000), (2000,100000).
And found that the pure nested-for-loop implementation(func4) is consistently faster (about 2-5% faster) than the implementation with a for-loop mixed with vectorization (func3).
My questions are the following:
1) what needs to be changed to further improve the speed of the code?
2) why is it that the computation time of the vectorized version of the function grows linearly with the size of the win_arr? I thought the vectorization should make it so that the operation speed is constant no matter how big/small the vector is, but apparently this does not hold true in this case.
3) Are there any general conditions under which the computation time of the vectorized operation will still grow linearly with the input size?

It seems you misunderstood what "vectorized" means. Vectorized means that you write code that operates on arrays as-if they were scalars - but that's just how the code looks like, not related to performance.
In the Python/NumPy world vectorized also carries the meaning that the overhead of the loop in vectorized operations is (often) much smaller compared to loopy code. However the vectorized code still has to do the loop (even if it's hidden in a library)!
Also, if you write a loop with numba, numba will compile it and create fast code that performs (generally) as fast as vectorized NumPy code. That means inside a numba function there's no significant performance difference between vectorized and non-vectorized code.
So that should answer your questions:
2) why is it that the computation time of the vectorized version of the function grows linearly with the size of the win_arr? I thought the vectorization should make it so that the operation speed is constant no matter how big/small the vector is, but apparently this does not hold true in this case.
It grows linearly because it still has to iterate. In vectorized code the loop is just hidden inside a library routine.
3) Are there any general conditions under which the computation time of the vectorized operation will still grow linearly with the input size?
No.
You also asked what could be done to make it faster.
The comments already mentioned that you could parallelize it:
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def func6(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
for col in nb.prange(len(win_arr)):
w[col] = w[col] + (1-alpha_arr[col])**i
e[col] = e[col] * (1-alpha_arr[col]) + arr_in[i]
result[i,col] = e[col] /w[col]
return result
This makes the code a bit faster on my machine (4cores).
However there's also a problem that your algorithm may be numerically unstable. The (1-alpha_arr[col])**i will underflow at some point when you raise it to powers of hundred-thousands:
>>> alpha = 0.01
>>> for i in [1, 10, 100, 1_000, 10_000, 50_000, 100_000, 200_000]:
... print((1-alpha)**i)
0.99
0.9043820750088044
0.3660323412732292
4.317124741065786e-05
2.2487748498162805e-44
5.750821364590612e-219
0.0 # <-- underflow
0.0

Always think twice about complicated mathematical operations like (pow, divisions,...). If you can replace them by easy operations like multiplications, additions and subtractions it is always worth a try.
Please note that multiplying alpha repeatedly with itself is only algebraically the same as directly calculating with exponentiation. Since this is numerical math the results can differ.
Also avoid unnecessary temporary arrays.
First try
#nb.njit(error_model="numpy",parallel=True)
def func5(arr_in, win_arr):
#filling the whole array with NaNs isn't necessary
result = np.empty((win_arr.shape[0],arr_in.shape[0]))
for col in range(win_arr.shape[0]):
result[col,0]=np.nan
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[:two_index,0] = arr_in[0]
for col in nb.prange(win_arr.shape[0]):
alpha=1.-(2./ (win_arr[col] + 1.))
alpha_exp=alpha
w=1.
e=arr_in[0]
for i in range(1, arr_in.shape[0]):
w+= alpha_exp
e = e*alpha + arr_in[i]
result[col,i] = e/w
alpha_exp*=alpha
return result.T
Second try (avoiding underflow)
#nb.njit(error_model="numpy",parallel=True)
def func7(arr_in, win_arr):
#filling the whole array with NaNs isn't necessary
result = np.empty((win_arr.shape[0],arr_in.shape[0]))
for col in range(win_arr.shape[0]):
result[col,0]=np.nan
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[:two_index,0] = arr_in[0]
for col in nb.prange(win_arr.shape[0]):
alpha=1.-(2./ (win_arr[col] + 1.))
alpha_exp=alpha
w=1.
e=arr_in[0]
for i in range(1, arr_in.shape[0]):
w+= alpha_exp
e = e*alpha + arr_in[i]
result[col,i] = e/w
if np.abs(alpha_exp)>=1e-308:
alpha_exp*=alpha
else:
alpha_exp=0.
return result.T
Timings
%timeit abc_test3= func3(data, win_array)
7.17 s ± 45.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test4= func4(data, win_array)
7.13 s ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#from MSeifert answer (parallelized)
%timeit abc_test6= func6(data, win_array)
3.42 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test5= func5(data, win_array)
1.22 s ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test7= func7(data, win_array)
238 ms ± 5.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Why does a numpy array not appear to be much faster than a standard python list?

From what I understand, numpy arrays can handle operations more quickly than python lists because they're handled in a parallel rather than iterative fashion. I tried to test that out for fun, but I didn't see much of a difference.
Was there something wrong with my test? Does the difference only matter with arrays much bigger than the ones I used? I made sure to create a python list and numpy array in each function to cancel out differences creating one vs. the other might make, but the time delta really seems negligible. Here's my code:
My final outputs were numpy function: 6.534756324786595s, list function: 6.559365831783256s
import timeit
import numpy as np
a_setup = 'import timeit; import numpy as np'
std_fx = '''
def operate_on_std_array():
std_arr = list(range(0,1000000))
np_arr = np.asarray(std_arr)
for index,elem in enumerate(std_arr):
std_arr[index] = (elem**20)*63134
return std_arr
'''
parallel_fx = '''
def operate_on_np_arr():
std_arr = list(range(0,1000000))
np_arr = np.asarray(std_arr)
np_arr = (np_arr**20)*63134
return np_arr
'''
def operate_on_std_array():
std_arr = list(range(0,1000000))
np_arr = np.asarray(std_arr)
for index,elem in enumerate(std_arr):
std_arr[index] = (elem**20)*63134
return std_arr
def operate_on_np_arr():
std_arr = list(range(0,1000000))
np_arr = np.asarray(std_arr)
np_arr = (np_arr**20)*63134
return np_arr
print('std',timeit.timeit(setup = a_setup, stmt = std_fx, number = 80000000))
print('par',timeit.timeit(setup = a_setup, stmt = parallel_fx, number = 80000000))
#operate_on_np_arr()
#operate_on_std_array()

The timeit docs here show that the statement you pass in is supposed to execute something, but the statements you pass in just define functions. I was thinking 80000000 trials on a 1-million-length array should take much longer.
Other issues you have in your test:
np_arr = (np_arr**20)*63134 may create a copy of np_arr, but your Python list equivalent only mutates an existing array.
Numpy math is different than Python math. 100**20 in Python returns a huge number because Python has unbounded-length integers, but Numpy uses C-style fixed-length integers that overflow. (In general, you have to imagine doing the operation in C when you use Numpy because other unintuitive things may apply, like garbage in uninitialized arrays.)
Here's a test where I modify both in place, multiplying then dividing by 31 each time so the values don't change over time or overflow:
import numpy as np
import timeit
std_arr = list(range(0,100000))
np_arr = np.array(std_arr)
np_arr_vec = np.vectorize(lambda n: (n * 31) / 31)
def operate_on_std_array():
for index,elem in enumerate(std_arr):
std_arr[index] = elem * 31
std_arr[index] = elem / 31
return std_arr
def operate_on_np_arr():
np_arr_vec(np_arr)
return np_arr
import time
def test_time(f):
count = 100
start = time.time()
for i in range(count):
f()
dur = time.time() - start
return dur
print(test_time(operate_on_std_array))
print(test_time(operate_on_np_arr))
Results:
3.0798873901367188 # standard array time
2.221336841583252 # np array time
Edit: As #user2357112 pointed out, the proper Numpy way to do it is this:
def operate_on_np_arr():
global np_arr
np_arr *= 31
np_arr //= 31 # integer division, not double
return np_arr
Makes it much faster. I see 0.1248 seconds.

Here are some timings using the ipython magic to initialize lists and or arrays. The results should focus on the calculations:
In [103]: %%timeit alist = list(range(10000))
...: for i,e in enumerate(alist):
...: alist[i] = (e*3)*20
...:
4.13 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [104]: %%timeit arr = np.arange(10000)
...: z = (arr*3)*20
...:
20.6 µs ± 439 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [105]: %%timeit alist = list(range(10000))
...: z = [(e*3)*20 for e in alist]
...:
...:
1.71 ms ± 2.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Looking at the effect of array creation times:
In [106]: %%timeit alist = list(range(10000))
...: arr = np.array(alist)
...: z = (arr*3)*20
...:
...:
1.01 ms ± 43.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Ok, the calculation isn't the same. If I use **3 instead, all times are about 2x larger. Same relative relations.

High performance weighted random choice for python 2?

I have the following python method, which selects a weighted random element from the sequence "seq" randomly weighted by other sequence, which contains the weights for each element in seq:
def weighted_choice(seq, weights):
assert len(seq) == len(weights)
total = sum(weights)
r = random.uniform(0, total)
upto = 0
for i in range(len(seq)):
if upto + weights[i] >= r:
return seq[i]
upto += weights[i]
assert False, "Shouldn't get here"
If I call the above a million times with a 1000 element sequence, like this:
seq = range(1000)
weights = []
for i in range(1000):
weights.append(random.randint(1,100))
st=time.time()
for i in range(1000000):
r=weighted_choice(seq, weights)
print (time.time()-st)
it runs for approximately 45 seconds in cpython 2.7 and for 70 seconds in cpython 3.6.
It finishes in around 2.3 seconds in pypy 5.10, which would be fine for me, sadly I can't use pypy for some reasons.
Any ideas on how to speed up this function on cpython? I'm interested in other implementations (algorithmically, or via external libraries, like numpy) as well if they perform better.
ps: python3 has random.choices with weights, it runs for around 23 seconds, which is better than the above function, but still exactly ten times slower than pypy can run.
I've tried it with numpy this way:
weights=[1./1000]*1000
st=time.time()
for i in range(1000000):
#r=weighted_choice(seq, weights)
#r=random.choices(seq, weights)
r=numpy.random.choice(seq, p=weights)
print (time.time()-st)
It ran for 70 seconds.

You can use numpy.random.choice (the p parameter is the weights). Normally numpy functions are vectorized and so run at near-C speed.
Implement as:
def weighted_choice(seq, weights):
w = np.asarray(weights)
p = w / w.sum() # can skip if weights always sum to 1
return np.random.choice(seq, p=w)
Edit:
Timings:
%timeit np.random.choice(x, p=w) # len(x) == 1_000_000
13 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.random.choice(y, p=w) # len(y) == 100_000_000
1.28 s ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

you could take this approach with numpy. If you emlimiate the for loop, you can get the true power of numpy by indexing the positions you need
#Untimed since you did not
seq = np.arange(1000)
weights = np.random.randint(1,100,(1000,1))
def weights_numpy(seq,weights,iterations):
"""
:param seq: Input sequence
:param weights: Input Weights
:param iterations: Iterations to run
:return:
"""
r = np.random.uniform(0,weights.sum(0),(1,iterations)) #create array of choices
ar = weights.cumsum(0) # get cumulative sum
return seq[(ar >= r).argmax(0)] #get indeces of seq that meet your condition
And the timing (python 3,numpy '1.14.0')
%timeit weights_numpy(seq,weights,1000000)
4.05 s ± 256 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Which is a bit slower than PyPy, but hardly...

Comparing Numpy and Matlab array summation speed

I recently converted a MATLAB script to Python with Numpy, and found that it ran significantly slower. I expected similar performance, so I'm wondering if I'm doing something wrong.
As stripped-down example, I manually sum a geometric series:
MATLAB version:
function s = array_sum(a, array_size, iterations)
s = zeros(array_size);
for m = 1:iterations
s = a + 0.5*s;
end
end
% benchmark code
array_size = 500
iterations = 500
a = randn(array_size)
f = #() array_sum(a, array_size, iterations);
fprintf('run time: %.2f ms\n', timeit(f)*1e3);
Python/Numpy version:
import numpy as np
import timeit
def array_sum(a, array_size, iterations):
s = np.zeros((array_size, array_size))
for m in range(iterations):
s = a + 0.5*s
return s
array_size = 500
iterations = 500
a = np.random.randn(array_size, array_size)
timeit_iterations = 10
t1 = timeit.timeit(lambda: array_sum(a, array_size, iterations),
number=timeit_iterations)
print("run time: {:.2f} ms".format(1e3*t1/timeit_iterations))
On my machine, MATLAB completes in 58 ms. The Python version runs in 292 ms, or 5X slower.
I also tried speeding up the Python code by adding the Numba JIT decorator #jit('f8[:,:](i8, i8)', nopython=True), but the time only dropped to 236 ms (4X slower).
This is slower than I expected. Am I using timeit improperly? Is there something wrong with my Python code?
EDIT: edited so that the random matrix is created outside of benchmarked function.
EDIT 2: I ran the benchmark using Torch instead of Numpy (calculating the sum as s = torch.add(s, 0.5, a)) and it runs in just 52 ms on my computer!

From my experience, when using numba's jit function it's usually faster to expand array operations into loops. So I tried to rewrite your python function as:
#jit(nopython=True, cache=True)
def array_sum_numba(a, array_size, iterations):
s = np.zeros((array_size, array_size))
for m in range(iterations):
for i in range(array_size):
for j in range(array_size):
s[i,j] = a[i,j] + 0.5 * s[i,j]
return s
And out of curiosity, I've also tested #percusse's version with a little modification on the parameter:
def array_sum2(r, array_size, iterations):
s = np.zeros((array_size, array_size))
for m in range(iterations):
s /= 2
s += r
return s
The testing results on my machine are:
original version run time: 143.83 ms
numba jitted loop version run time: 26.99 ms
#percusse's version run time: 61.38 ms
This result is within my expectation. It's worthing mentioning that I've increased timeit iterations to 50, which results in some significant time reduction for numba version.
In summary: The Python code can still be significantly accelerated if you use numba's jit and write the function in loops. I don't have Matlab on my machine to test, but my guess is with numba the python version is faster.

Since you are updating the same variable suitable for inplace operations, you can update your function as
def array_sum2(array_size, iterations):
s = np.zeros((array_size, array_size))
r = np.random.randn(array_size, array_size)
for m in range(iterations):
s /= 2
s += r
return s
This has given the following speed benefit on my machine compared to array_sum
run time: 157.32 ms
run time2: 672.43 ms

Times include the randn call as well as the summation:
In [68]: timeit array_sum(array_size, 0)
16.6 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [69]: timeit array_sum(array_size, 1)
18.9 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [70]: timeit array_sum(array_size, 20)
55.5 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [71]: (55-16)/20
Out[71]: 1.95
So it's 16ms for the setup, and 2ms per iteration. Same pattern with 500 iterations.
MATLAB does some JIT compilation. I don't know if that's the case here or not. I don't have MATLAB to test. In Octave (no timeit)
>> t = time(); array_sum(500,0); (time()-t)*1000
ans = 13.704
>> t = time(); array_sum(500,1); (time()-t)*1000
ans = 16.219
>> t = time(); array_sum(500,20); (time()-t)*1000
ans = 82.346
>> t = time(); array_sum(500,500); (time()-t)*1000
ans = 1610.6
Octave's random is faster, but the per iteration sum is slower.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fastest way to compute large number of 3x3 dot product - python

Related

Faster way to find same integers in sequence in numpy array

Vectorization in a loop slower than a nested loop in numba jitted function

Why does a numpy array not appear to be much faster than a standard python list?

High performance weighted random choice for python 2?

Comparing Numpy and Matlab array summation speed

Categories

Resources