Comparing Numpy and Matlab array summation speed - python

I recently converted a MATLAB script to Python with Numpy, and found that it ran significantly slower. I expected similar performance, so I'm wondering if I'm doing something wrong.
As stripped-down example, I manually sum a geometric series:
MATLAB version:
function s = array_sum(a, array_size, iterations)
s = zeros(array_size);
for m = 1:iterations
s = a + 0.5*s;
end
end
% benchmark code
array_size = 500
iterations = 500
a = randn(array_size)
f = #() array_sum(a, array_size, iterations);
fprintf('run time: %.2f ms\n', timeit(f)*1e3);
Python/Numpy version:
import numpy as np
import timeit
def array_sum(a, array_size, iterations):
s = np.zeros((array_size, array_size))
for m in range(iterations):
s = a + 0.5*s
return s
array_size = 500
iterations = 500
a = np.random.randn(array_size, array_size)
timeit_iterations = 10
t1 = timeit.timeit(lambda: array_sum(a, array_size, iterations),
number=timeit_iterations)
print("run time: {:.2f} ms".format(1e3*t1/timeit_iterations))
On my machine, MATLAB completes in 58 ms. The Python version runs in 292 ms, or 5X slower.
I also tried speeding up the Python code by adding the Numba JIT decorator #jit('f8[:,:](i8, i8)', nopython=True), but the time only dropped to 236 ms (4X slower).
This is slower than I expected. Am I using timeit improperly? Is there something wrong with my Python code?
EDIT: edited so that the random matrix is created outside of benchmarked function.
EDIT 2: I ran the benchmark using Torch instead of Numpy (calculating the sum as s = torch.add(s, 0.5, a)) and it runs in just 52 ms on my computer!

From my experience, when using numba's jit function it's usually faster to expand array operations into loops. So I tried to rewrite your python function as:
#jit(nopython=True, cache=True)
def array_sum_numba(a, array_size, iterations):
s = np.zeros((array_size, array_size))
for m in range(iterations):
for i in range(array_size):
for j in range(array_size):
s[i,j] = a[i,j] + 0.5 * s[i,j]
return s
And out of curiosity, I've also tested #percusse's version with a little modification on the parameter:
def array_sum2(r, array_size, iterations):
s = np.zeros((array_size, array_size))
for m in range(iterations):
s /= 2
s += r
return s
The testing results on my machine are:
original version run time: 143.83 ms
numba jitted loop version run time: 26.99 ms
#percusse's version run time: 61.38 ms
This result is within my expectation. It's worthing mentioning that I've increased timeit iterations to 50, which results in some significant time reduction for numba version.
In summary: The Python code can still be significantly accelerated if you use numba's jit and write the function in loops. I don't have Matlab on my machine to test, but my guess is with numba the python version is faster.

Since you are updating the same variable suitable for inplace operations, you can update your function as
def array_sum2(array_size, iterations):
s = np.zeros((array_size, array_size))
r = np.random.randn(array_size, array_size)
for m in range(iterations):
s /= 2
s += r
return s
This has given the following speed benefit on my machine compared to array_sum
run time: 157.32 ms
run time2: 672.43 ms

Times include the randn call as well as the summation:
In [68]: timeit array_sum(array_size, 0)
16.6 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [69]: timeit array_sum(array_size, 1)
18.9 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [70]: timeit array_sum(array_size, 20)
55.5 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [71]: (55-16)/20
Out[71]: 1.95
So it's 16ms for the setup, and 2ms per iteration. Same pattern with 500 iterations.
MATLAB does some JIT compilation. I don't know if that's the case here or not. I don't have MATLAB to test. In Octave (no timeit)
>> t = time(); array_sum(500,0); (time()-t)*1000
ans = 13.704
>> t = time(); array_sum(500,1); (time()-t)*1000
ans = 16.219
>> t = time(); array_sum(500,20); (time()-t)*1000
ans = 82.346
>> t = time(); array_sum(500,500); (time()-t)*1000
ans = 1610.6
Octave's random is faster, but the per iteration sum is slower.

Related

Making a Numpy Operation Vectorised

Long story short, I'm applying a function onto multiple different time intervals and then storing the resulting arrays at different indexs in an ndarray. Presently, I'm doing this by using the a for loop with the numpy equivalent of the enumerate function. As I understand it, this eliminates the major advantage of numpy: vectorisation. Is this a particular way my rountine could be implemented that retains this advantage?
Here is my code:
Most of is working parts for the function psi_t
import numpy as np
# Number of Walks and Number of Positions
N = 100
P = 2*N +1
hopping_rate = 0.5
psi_t0 = np.zeros(P)
psi_t0[N] = 1
#creates the line upon which the particle moves
#index N is the central position
def hamiltonian(line_length, hopping_rate):
'''
creates the simple non time dependent hamiltonian for H = γA
where A is the adjancency matrix
'''
return hopping_rate * line_adjacency_matrix(line_length)
def measurement_operator(positions,finished_quantum_state):
'''
Converts the finished quantum state into an array of probabilities for
being in each position.
Uses the measurement operator from Susan Blog
https://susan-stepney.blogspot.com/2014/02/mathjax.html
Improved on by this guy
https://github.com/Driminary/python-randomwalk-project/blob/master/quantum-2D.py
Apart from the fact that the measurement operator drops the extra dimensions of the spin space,
which isn't present in the continuous walk.
'''
probabilities = np.empty(P)
#M_hat = np.zeros((2*P,2*P,2*P))
for k in range(P):
posn = np.zeros(P) # values of positions to nought ..
posn[k] = 1 #except for the value we're interested in
#M_hat = np.kron(np.outer(posn,posn)) #perform measurement at the current pos
M_hat = np.outer(posn,posn)
proj = M_hat.dot(finished_quantum_state) #find the state the system is in
probabilities[k] = proj.dot(proj.conjugate()).real #Calculate Prob of Particle being there
return probabilities
def psi_t(initial_wave_function,positions,hopping_rate,time):
'''
Acts upon the initial state to give the 'position' of the quantum particle at time t. Applies the measurement operator
to return the probability of being at any position at time t.
'''
psi_t = np.matmul((LA.expm(-1j*hamiltonian(positions,hopping_rate)*time)),initial_wave_function) #state after the continuous walk after time evolution
probablities = measurement_operator(P, psi_t)
return probablities
time_evolution = 150 #how many 'seconds' the wavefunction is evolved for
time_interval = 0.5
number_of_intervals =int(time_evolution / time_interval )
number_of_positions = P
probabilities_at_t =np.ndarray((number_of_intervals,number_of_positions)) #creates the empty ndarray ready for the probabilites at time t
array_of_times = np.linspace(0,time_evolution,number_of_intervals) #produces the individual times at which psi_t is calculated,
for idx,time in np.ndenumerate(array_of_times):
probabilities_at_t[idx] = psi_t(psi_t0,P,hopping_rate,time) #the array probabillites_at_t is filled at index idx with the array of probabilities produced by psi_t.
#This is the step I am trying to vectorise
The function psi_t is called on a for loop to act on each of the time(s) in array_of_times individually. Is there way where psi_t could act on the array array_of_times like one can do x**2 for the array x? Can it be done in one fell swoop?
P.S Eagle Eyed Overflowers will note that within the measurement_operator there is a for loop anyway. I don't think there's a way to get rid of this however !
Question is not really reproducible because some of the functions that are being called are missing but here is my vectorised implementation of measurement_operator. This is with the assumption that finished_quantum_state has a shape of (P, ) (Not sure if that's the case, because couldn't reproduce till that part) .
def measurement_operator_vectorized(positions, finished_quantum_state):
M_hat = np.zeros((P, P, P))
M_hat[np.arange(0, P), np.arange(0, P), np.arange(0, P)] = 1
proj = np.tensordot(M_hat, finished_quantum_state, axes=((2), (0)))
probabilities = (proj * proj.conjugate()).sum(axis=1).real
return probabilities
Here is some benchmarkings -
P = 1000
a = np.random.rand(P)
b = np.random.rand(P)
%timeit c1 = measurement_operator(a, b)
%timeit c2 = measurement_operator_vectorized(a, b)
%memit c1 = measurement_operator(a, b)
%memit c2 = measurement_operator_vectorized(a, b)
print(np.allclose(c1, c2))
Gives -
1.18 s ± 46.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
308 ms ± 6.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
peak memory: 86.43 MiB, increment: 0.00 MiB
peak memory: 90.34 MiB, increment: 3.91 MiB
True
The vectorised version is faster is comparable memory usage for P~1000.
Note that for really high values of P, the memory usage will increase a lot for the vectorised version.
This isn't exactly what the OP asked for, but to vectorise the other loop, a more complete code would be helpful.
However, this benchmark is valid only if finished_quantum_state is real. For complex values the tensordot operation is very slow and inefficient (in memory) so you might actually be better off with the non-vectorized version.
P = 1000
a = np.random.rand(P) + np.random.rand(P)*1j
b = np.random.rand(P) + np.random.rand(P)*1j
%timeit -n1 -r1 c1 = measurement_operator(a, b)
%timeit -n1 -r1 c2 = measurement_operator_vectorized(a, b)
%memit c1 = measurement_operator(a, b)
%memit c2 = measurement_operator_vectorized(a, b)
np.allclose(c1, c2)
2.97 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.49 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
peak memory: 102.69 MiB, increment: 0.03 MiB
peak memory: 15365.38 MiB, increment: 15262.69 MiB
However, if you really want the best performance, you are better off forgetting the physics details about measurement etc temporarily and just doing
def measurement_operator_fastest(positions, finished_quantum_state):
return (finished_quantum_state * finished_quantum_state.conjugate()).real
P = 1000
a = np.random.rand(P) + np.random.rand(P)*1j
b = np.random.rand(P) + np.random.rand(P)*1j
%timeit -n1 -r1 c1 = measurement_operator(a, b)
%timeit -n1 -r1 c2 = measurement_operator_vectorized(a, b)
%timeit -n1 -r1 c3 = measurement_operator_fastest(a, b)
%memit c1 = measurement_operator(a, b)
%memit c2 = measurement_operator_vectorized(a, b)
%memit c3 = measurement_operator_fastest(a, b)
print(np.allclose(c1, c2))
print(np.allclose(c1, c3))
2.87 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
3.48 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
16.6 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
peak memory: 102.70 MiB, increment: 0.00 MiB
peak memory: 15365.39 MiB, increment: 15262.69 MiB
peak memory: 102.69 MiB, increment: -0.01 MiB
True
True
By taking the inner product directly, you can make the function around 10^6 times faster. Of course that assumes the measurement operator as defined.

Fastest way to compute large number of 3x3 dot product

I have to compute a large number of 3x3 linear transformations (eg. rotations). This is what I have so far:
import numpy as np
from scipy import sparse
from numba import jit
n = 100000 # number of transformations
k = 100 # number of vectors for each transformation
A = np.random.rand(n, 3, k) # vectors
Op = np.random.rand(n, 3, 3) # operators
sOp = sparse.bsr_matrix((Op, np.arange(n), np.arange(n+1))) # same as Op but as block-diag
def dot1():
""" naive approach: many times np.dot """
return np.stack([np.dot(o, a) for o, a in zip(Op, A)])
#jit(nopython=True)
def dot2():
""" same as above, but jitted """
new = np.empty_like(A)
for i in range(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot3():
""" using einsum """
return np.einsum("ijk,ikl->ijl", Op, A)
def dot4():
""" using sparse block diag matrix """
return sOp.dot(A.reshape(3 * n, -1)).reshape(n, 3, -1)
On a macbook pro 2012, this gives me:
In [62]: %timeit dot1()
783 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [63]: %timeit dot2()
261 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [64]: %timeit dot3()
293 ms ± 2.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [65]: %timeit dot4()
281 ms ± 6.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Appart from the naive approach, all approaches are similar. Is there a way to accelerate this significantly?
Edit
(The cuda approach is the best when available. The following is comparing the non-cuda versions)
Following the various suggestions, I modified dot2, added the Op#A method, and a version based on #59356461.
#njit(fastmath=True, parallel=True)
def dot2(Op, A):
""" same as above, but jitted """
new = np.empty_like(A)
for i in prange(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot5(Op, A):
""" using matmul """
return Op#A
#njit(fastmath=True, parallel=True)
def dot6(Op, A):
""" another numba.jit with parallel (based on #59356461) """
new = np.empty_like(A)
for i_n in prange(A.shape[0]):
for i_k in range(A.shape[2]):
for i_x in range(3):
acc = 0.0j
for i_y in range(3):
acc += Op[i_n, i_x, i_y] * A[i_n, i_y, i_k]
new[i_n, i_x, i_k] = acc
return new
This is what I get (on a different machine) with benchit:
def gen(n, k):
Op = np.random.rand(n, 3, 3) + 1j * np.random.rand(n, 3, 3)
A = np.random.rand(n, 3, k) + 1j * np.random.rand(n, 3, k)
return Op, A
# benchit
import benchit
funcs = [dot1, dot2, dot3, dot4, dot5, dot6]
inputs = {n: gen(n, 100) for n in [100,1000,10000,100000,1000000]}
t = benchit.timings(funcs, inputs, multivar=True, input_name='Number of operators')
t.plot(logy=True, logx=True)
You've gotten some great suggestions, but I wanted to add one more due to this specific goal:
Is there a way to accelerate this significantly?
Realistically, if you need these operations to be significantly faster (which often means > 10x) you probably would want to use a GPU for the matrix multiplication. As a quick example:
import numpy as np
import cupy as cp
n = 100000 # number of transformations
k = 100 # number of vectors for each transformation
# CPU version
A = np.random.rand(n, 3, k) # vectors
Op = np.random.rand(n, 3, 3) # operators
def dot5(): # the suggested, best CPU approach
return Op#A
# GPU version using a V100
gA = cp.asarray(A)
gOp = cp.asarray(Op)
# run once to ignore JIT overhead before benchmarking
gOp#gA;
%timeit dot5()
%timeit gOp#gA; cp.cuda.Device().synchronize() # need to sync for a fair benchmark
112 ms ± 546 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.19 ms ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use Op#A like suggested by #hpaulj in comments.
Here is a comparison using benchit:
def dot1(A,Op):
""" naive approach: many times np.dot """
return np.stack([np.dot(o, a) for o, a in zip(Op, A)])
#jit(nopython=True)
def dot2(A,Op):
""" same as above, but jitted """
new = np.empty_like(A)
for i in range(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot3(A,Op):
""" using einsum """
return np.einsum("ijk,ikl->ijl", Op, A)
def dot4(A,Op):
n = A.shape[0]
sOp = sparse.bsr_matrix((Op, np.arange(n), np.arange(n+1))) # same as Op but as block-diag
""" using sparse block diag matrix """
return sOp.dot(A.reshape(3 * n, -1)).reshape(n, 3, -1)
def dot5(A,Op):
return Op#A
in_ = {n:[np.random.rand(n, 3, k), np.random.rand(n, 3, 3)] for n in [100,1000,10000,100000,1000000]}
They seem to be close in performance for larger scale with dot5 being slightly faster.
In one answer Nick mentioned using the GPU - which is the best solution of course.
But - as a general rule - what you're doing is likely CPU limited. Therefore (with the exception to the GPU approach), the best bang you can get is if you make use of all the cores on your machine to work in parallel.
So for that you would want to use multiprocessing (not python's multithreading!), to split the job up into pieces running on each core in parallel.
This is not trivial, but also not too hard, and there are many good examples/guides online.
But if you had an 8-core machine, it would likely give you an almost 8x speed increase as long as you're careful to avoid memory bottlenecks by trying to pass many small objects between processes, but pass them all in a group at the start

Vectorization in a loop slower than a nested loop in numba jitted function

So I am experimenting on the performance boost of combining vectorization and for-loop powered by #njit in numba(I am currently using numba 0.45.1). Disappointingly, I found out it is actually slower than the pure nested-loop implementation in my code.
This is my code:
import numpy as np
from numba import njit
#njit
def func3(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
w = w + (1-alpha_arr)**i
e = e*(1-alpha_arr) + arr_in[i]
result[i,:] = e /w
return result
#njit
def func4(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
for col in range(len(win_arr)):
w[col] = w[col] + (1-alpha_arr[col])**i
e[col] = e[col]*(1-alpha_arr[col]) + arr_in[i]
result[i,col] = e[col] /w[col]
return result
if __name__ == '__main__':
np.random.seed(0)
data_size = 200000
winarr_size = 1000
data = np.random.uniform(0,1000, size = data_size)+29000
win_array = np.arange(1, winarr_size+1)
abc_test3= func3(data, win_array)
abc_test4= func4(data, win_array)
print(np.allclose(abc_test3, abc_test4, equal_nan = True))
I benchmarked the two functions using the following configurations:
(data_size,winarr_size) = (200000,100), (200000,200),(200000,1000), (200000,2000), (20000,10000), (2000,100000).
And found that the pure nested-for-loop implementation(func4) is consistently faster (about 2-5% faster) than the implementation with a for-loop mixed with vectorization (func3).
My questions are the following:
1) what needs to be changed to further improve the speed of the code?
2) why is it that the computation time of the vectorized version of the function grows linearly with the size of the win_arr? I thought the vectorization should make it so that the operation speed is constant no matter how big/small the vector is, but apparently this does not hold true in this case.
3) Are there any general conditions under which the computation time of the vectorized operation will still grow linearly with the input size?
It seems you misunderstood what "vectorized" means. Vectorized means that you write code that operates on arrays as-if they were scalars - but that's just how the code looks like, not related to performance.
In the Python/NumPy world vectorized also carries the meaning that the overhead of the loop in vectorized operations is (often) much smaller compared to loopy code. However the vectorized code still has to do the loop (even if it's hidden in a library)!
Also, if you write a loop with numba, numba will compile it and create fast code that performs (generally) as fast as vectorized NumPy code. That means inside a numba function there's no significant performance difference between vectorized and non-vectorized code.
So that should answer your questions:
2) why is it that the computation time of the vectorized version of the function grows linearly with the size of the win_arr? I thought the vectorization should make it so that the operation speed is constant no matter how big/small the vector is, but apparently this does not hold true in this case.
It grows linearly because it still has to iterate. In vectorized code the loop is just hidden inside a library routine.
3) Are there any general conditions under which the computation time of the vectorized operation will still grow linearly with the input size?
No.
You also asked what could be done to make it faster.
The comments already mentioned that you could parallelize it:
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def func6(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
for col in nb.prange(len(win_arr)):
w[col] = w[col] + (1-alpha_arr[col])**i
e[col] = e[col] * (1-alpha_arr[col]) + arr_in[i]
result[i,col] = e[col] /w[col]
return result
This makes the code a bit faster on my machine (4cores).
However there's also a problem that your algorithm may be numerically unstable. The (1-alpha_arr[col])**i will underflow at some point when you raise it to powers of hundred-thousands:
>>> alpha = 0.01
>>> for i in [1, 10, 100, 1_000, 10_000, 50_000, 100_000, 200_000]:
... print((1-alpha)**i)
0.99
0.9043820750088044
0.3660323412732292
4.317124741065786e-05
2.2487748498162805e-44
5.750821364590612e-219
0.0 # <-- underflow
0.0
Always think twice about complicated mathematical operations like (pow, divisions,...). If you can replace them by easy operations like multiplications, additions and subtractions it is always worth a try.
Please note that multiplying alpha repeatedly with itself is only algebraically the same as directly calculating with exponentiation. Since this is numerical math the results can differ.
Also avoid unnecessary temporary arrays.
First try
#nb.njit(error_model="numpy",parallel=True)
def func5(arr_in, win_arr):
#filling the whole array with NaNs isn't necessary
result = np.empty((win_arr.shape[0],arr_in.shape[0]))
for col in range(win_arr.shape[0]):
result[col,0]=np.nan
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[:two_index,0] = arr_in[0]
for col in nb.prange(win_arr.shape[0]):
alpha=1.-(2./ (win_arr[col] + 1.))
alpha_exp=alpha
w=1.
e=arr_in[0]
for i in range(1, arr_in.shape[0]):
w+= alpha_exp
e = e*alpha + arr_in[i]
result[col,i] = e/w
alpha_exp*=alpha
return result.T
Second try (avoiding underflow)
#nb.njit(error_model="numpy",parallel=True)
def func7(arr_in, win_arr):
#filling the whole array with NaNs isn't necessary
result = np.empty((win_arr.shape[0],arr_in.shape[0]))
for col in range(win_arr.shape[0]):
result[col,0]=np.nan
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[:two_index,0] = arr_in[0]
for col in nb.prange(win_arr.shape[0]):
alpha=1.-(2./ (win_arr[col] + 1.))
alpha_exp=alpha
w=1.
e=arr_in[0]
for i in range(1, arr_in.shape[0]):
w+= alpha_exp
e = e*alpha + arr_in[i]
result[col,i] = e/w
if np.abs(alpha_exp)>=1e-308:
alpha_exp*=alpha
else:
alpha_exp=0.
return result.T
Timings
%timeit abc_test3= func3(data, win_array)
7.17 s ± 45.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test4= func4(data, win_array)
7.13 s ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#from MSeifert answer (parallelized)
%timeit abc_test6= func6(data, win_array)
3.42 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test5= func5(data, win_array)
1.22 s ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test7= func7(data, win_array)
238 ms ± 5.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Why does a numpy array not appear to be much faster than a standard python list?

From what I understand, numpy arrays can handle operations more quickly than python lists because they're handled in a parallel rather than iterative fashion. I tried to test that out for fun, but I didn't see much of a difference.
Was there something wrong with my test? Does the difference only matter with arrays much bigger than the ones I used? I made sure to create a python list and numpy array in each function to cancel out differences creating one vs. the other might make, but the time delta really seems negligible. Here's my code:
My final outputs were numpy function: 6.534756324786595s, list function: 6.559365831783256s
import timeit
import numpy as np
a_setup = 'import timeit; import numpy as np'
std_fx = '''
def operate_on_std_array():
std_arr = list(range(0,1000000))
np_arr = np.asarray(std_arr)
for index,elem in enumerate(std_arr):
std_arr[index] = (elem**20)*63134
return std_arr
'''
parallel_fx = '''
def operate_on_np_arr():
std_arr = list(range(0,1000000))
np_arr = np.asarray(std_arr)
np_arr = (np_arr**20)*63134
return np_arr
'''
def operate_on_std_array():
std_arr = list(range(0,1000000))
np_arr = np.asarray(std_arr)
for index,elem in enumerate(std_arr):
std_arr[index] = (elem**20)*63134
return std_arr
def operate_on_np_arr():
std_arr = list(range(0,1000000))
np_arr = np.asarray(std_arr)
np_arr = (np_arr**20)*63134
return np_arr
print('std',timeit.timeit(setup = a_setup, stmt = std_fx, number = 80000000))
print('par',timeit.timeit(setup = a_setup, stmt = parallel_fx, number = 80000000))
#operate_on_np_arr()
#operate_on_std_array()
The timeit docs here show that the statement you pass in is supposed to execute something, but the statements you pass in just define functions. I was thinking 80000000 trials on a 1-million-length array should take much longer.
Other issues you have in your test:
np_arr = (np_arr**20)*63134 may create a copy of np_arr, but your Python list equivalent only mutates an existing array.
Numpy math is different than Python math. 100**20 in Python returns a huge number because Python has unbounded-length integers, but Numpy uses C-style fixed-length integers that overflow. (In general, you have to imagine doing the operation in C when you use Numpy because other unintuitive things may apply, like garbage in uninitialized arrays.)
Here's a test where I modify both in place, multiplying then dividing by 31 each time so the values don't change over time or overflow:
import numpy as np
import timeit
std_arr = list(range(0,100000))
np_arr = np.array(std_arr)
np_arr_vec = np.vectorize(lambda n: (n * 31) / 31)
def operate_on_std_array():
for index,elem in enumerate(std_arr):
std_arr[index] = elem * 31
std_arr[index] = elem / 31
return std_arr
def operate_on_np_arr():
np_arr_vec(np_arr)
return np_arr
import time
def test_time(f):
count = 100
start = time.time()
for i in range(count):
f()
dur = time.time() - start
return dur
print(test_time(operate_on_std_array))
print(test_time(operate_on_np_arr))
Results:
3.0798873901367188 # standard array time
2.221336841583252 # np array time
Edit: As #user2357112 pointed out, the proper Numpy way to do it is this:
def operate_on_np_arr():
global np_arr
np_arr *= 31
np_arr //= 31 # integer division, not double
return np_arr
Makes it much faster. I see 0.1248 seconds.
Here are some timings using the ipython magic to initialize lists and or arrays. The results should focus on the calculations:
In [103]: %%timeit alist = list(range(10000))
...: for i,e in enumerate(alist):
...: alist[i] = (e*3)*20
...:
4.13 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [104]: %%timeit arr = np.arange(10000)
...: z = (arr*3)*20
...:
20.6 µs ± 439 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [105]: %%timeit alist = list(range(10000))
...: z = [(e*3)*20 for e in alist]
...:
...:
1.71 ms ± 2.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Looking at the effect of array creation times:
In [106]: %%timeit alist = list(range(10000))
...: arr = np.array(alist)
...: z = (arr*3)*20
...:
...:
1.01 ms ± 43.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Ok, the calculation isn't the same. If I use **3 instead, all times are about 2x larger. Same relative relations.

High performance weighted random choice for python 2?

I have the following python method, which selects a weighted random element from the sequence "seq" randomly weighted by other sequence, which contains the weights for each element in seq:
def weighted_choice(seq, weights):
assert len(seq) == len(weights)
total = sum(weights)
r = random.uniform(0, total)
upto = 0
for i in range(len(seq)):
if upto + weights[i] >= r:
return seq[i]
upto += weights[i]
assert False, "Shouldn't get here"
If I call the above a million times with a 1000 element sequence, like this:
seq = range(1000)
weights = []
for i in range(1000):
weights.append(random.randint(1,100))
st=time.time()
for i in range(1000000):
r=weighted_choice(seq, weights)
print (time.time()-st)
it runs for approximately 45 seconds in cpython 2.7 and for 70 seconds in cpython 3.6.
It finishes in around 2.3 seconds in pypy 5.10, which would be fine for me, sadly I can't use pypy for some reasons.
Any ideas on how to speed up this function on cpython? I'm interested in other implementations (algorithmically, or via external libraries, like numpy) as well if they perform better.
ps: python3 has random.choices with weights, it runs for around 23 seconds, which is better than the above function, but still exactly ten times slower than pypy can run.
I've tried it with numpy this way:
weights=[1./1000]*1000
st=time.time()
for i in range(1000000):
#r=weighted_choice(seq, weights)
#r=random.choices(seq, weights)
r=numpy.random.choice(seq, p=weights)
print (time.time()-st)
It ran for 70 seconds.
You can use numpy.random.choice (the p parameter is the weights). Normally numpy functions are vectorized and so run at near-C speed.
Implement as:
def weighted_choice(seq, weights):
w = np.asarray(weights)
p = w / w.sum() # can skip if weights always sum to 1
return np.random.choice(seq, p=w)
Edit:
Timings:
%timeit np.random.choice(x, p=w) # len(x) == 1_000_000
13 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.random.choice(y, p=w) # len(y) == 100_000_000
1.28 s ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
you could take this approach with numpy. If you emlimiate the for loop, you can get the true power of numpy by indexing the positions you need
#Untimed since you did not
seq = np.arange(1000)
weights = np.random.randint(1,100,(1000,1))
def weights_numpy(seq,weights,iterations):
"""
:param seq: Input sequence
:param weights: Input Weights
:param iterations: Iterations to run
:return:
"""
r = np.random.uniform(0,weights.sum(0),(1,iterations)) #create array of choices
ar = weights.cumsum(0) # get cumulative sum
return seq[(ar >= r).argmax(0)] #get indeces of seq that meet your condition
And the timing (python 3,numpy '1.14.0')
%timeit weights_numpy(seq,weights,1000000)
4.05 s ± 256 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Which is a bit slower than PyPy, but hardly...

Categories