Fastest way to count sequence elements satisfying a predicate - python

I want to count the elements verifying a certain property in a throw-away sequence. I am somewhat surprised that a generator expression would not be the fastest one:
from random import random
l = [random() for i in range(1000000)]
%timeit len([None for x in l if x < 0.5])
%timeit len([x for x in l if x < 0.5])
%timeit sum(1 for x in l if x < 0.5)
%timeit sum(x < 0.5 for x in l)
Measured performances:
90.7 ms ± 7.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
97.7 ms ± 7.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
105 ms ± 3.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
178 ms ± 2.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Is there a faster way to do this?

You can use NumPy, if the conversion to an NumPy array itself does not count:
import numpy as np
a = np.array(l)
%timeit np.sum(a < 0.5)
1.28 ms ± 48.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Even taking the conversion into account makes it significantly faster:
%%timeit
a = np.array(l)
np.sum(a < 0.5)
27.2 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Compared to the pure Python versions on my machine.
from random import random
l = [random() for i in range(1000000)]
%timeit len([None for x in l if x < 0.5])
%timeit len([x for x in l if x < 0.5])
%timeit sum(1 for x in l if x < 0.5)
%timeit sum(x < 0.5 for x in l)
46.4 ms ± 941 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
48.1 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
59.5 ms ± 811 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
103 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

How is Numba faster than NumPy for matrix multiplication with integers?

I was comparing parallel matrix multiplication with numba and matrix multiplication with numpy when I noticed that numpy isn't as fast with integers (int32).
import numpy as np
from numba import njit, prange
#njit()
def matrix_multiplication(A, B):
m, n = A.shape
_, p = B.shape
C = np.zeros((m, p))
for i in range(m):
for j in range(n):
for k in range(p):
C[i, k] += A[i, j] * B[j, k]
return C
#njit(parallel=True, fastmath=True)
def matrix_multiplication_parallel(A, B):
m, n = A.shape
_, p = B.shape
C = np.zeros((m, p))
for i in prange(m):
for j in range(n):
for k in range(p):
C[i, k] += A[i, j] * B[j, k]
return C
m = 100
n = 1000
p = 1500
A = np.random.randn(m, n)
B = np.random.randn(n, p)
A2 = np.random.randint(1, 100, size=(m, n))
B2 = np.random.randint(1, 100, size=(n, p))
A3 = np.ones((m, n))
B3 = np.ones((n, p))
# compile function
matrix_multiplication(A, B)
matrix_multiplication_parallel(A, B)
print('normal')
%timeit matrix_multiplication(A, B)
%timeit matrix_multiplication(A2, B2)
%timeit matrix_multiplication(A3, B3)
print('parallel')
%timeit matrix_multiplication_parallel(A, B)
%timeit matrix_multiplication_parallel(A2, B2)
%timeit matrix_multiplication_parallel(A3, B3)
print('numpy')
%timeit A # B
%timeit A2 # B2
%timeit A3 # B3
normal
1.51 s ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
*1.56 s* ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.5 s ± 34.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
333 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
408 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
313 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
31.2 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*1.99 s* ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)**
28.4 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I found this answer explaining that numpy doesn't use BLAS for integers.
From what I understand, both numpy and numba make use of vectorization. I wonder what could be different in the implementations for a relatively consistent 25% increase in performance.
I tried reversing the order of operations in case less CPU resources were available towards the end. I made sure to not do anything while the program was running.
numpy
35.1 ms ± 1.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*1.97 s* ± 44.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
32 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
normal
1.48 s ± 33.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
*1.46 s* ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.47 s ± 29.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
379 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
461 ms ± 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
381 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Trying the method in the answer doesn't really help.
import inspect
inspect.getmodule(matrix_multiplication)
<module '__main__'>
I tried it on Google Colab.
normal
2.28 s ± 407 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
*1.7 s* ± 277 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.6 s ± 317 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
1.33 s ± 315 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.66 s ± 425 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.34 s ± 327 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
64.9 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*2.14 s* ± 477 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
64.1 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
It is possible to print the generated code, but I don't know how it can be compared to the numpy code.
for v, k in matrix_multiplication.inspect_llvm().items():
print(v, k)
Going to the definition of np.matmul leads to matmul: _GUFunc_Nin2_Nout1[L['matmul'], L[19], None] in ".../site-packages/numpy/_init_.pyi".
I think this is the C method being called because of the name "no BLAS". The code seems equivalent to mine, except for additional if statements.
For small arrays m = n = p = 10, numpy is faster.
normal
6.6 µs ± 99.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
6.72 µs ± 68.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
6.57 µs ± 62.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
parallel
63.5 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
64.5 µs ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
63.3 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
numpy
1.94 µs ± 56.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.53 µs ± 305 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
1.91 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
m=10000 instead of 1000
normal
14.4 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
14.3 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
14.7 s ± 538 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
3.34 s ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.42 s ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.46 s ± 78.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
334 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
19.4 s ± 655 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
248 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You are comparing two different loop patterns. The pattern equivalent to the Numpy implementation will be like the following.
def matrix_multiplication(A, B):
m, n = A.shape
_, p = B.shape
C = np.zeros((m, p))
for i in range(m):
for k in range(p):
for j in range(n):
C[i, k] += A[i, j] * B[j, k]
return C
Test with m, n, p = 1000, 1500, 100 with your original test code, then you will see Numpy is faster then Numba.

Need help in understanding the loop speed with timeit function in python

I need help in understanding the %timeit function works in the two programs.
Program A
a = [1,3,2,4,1,4,2]
%timeit [val + 5 for val in a]
830 ns ± 45.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Program B
import numpy as np
a = np.array([1,3,2,4,1,4,2])
%timeit [a+5]
1.07 µs ± 23.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
My confusion:
µs is bigger than ns. How does the NumPy function execute slower than for loop here?
1.07 µs ± 23.7 ns per loop... why is the loop speed calculated in ns and not in µs?
Numpy adds an overhead, this will impact the speed on small datasets. Vectorization is mostly useful when using large datasets.
You must try on larger numbers:
N = 10_000_000
a = list(range(N))
%timeit [val + 5 for val in a]
import numpy as np
a = np.arange(N)
%timeit a+5
Output:
1.51 s ± 318 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
55.8 ms ± 3.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why is NumPy's `repmat` faster than `kron` to repeat blocks of arrays?

Why is the repmat function from numpy.matlib so much faster than numpy.kron (i.e., Kronecker Product) to repeat blocks of matrices?
A MWE would be:
test_N = 1000
test_vec = np.random.rand(test_N, 2)
rep_vec = np.matlib.repmat(test_vec, 100, 1)
kron_vec = kron(ones((100,1)), test_vec)
%%timeit
rep_vec = np.matlib.repmat(test_vec, 10, 1)
53.5 µs ± 2.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
kron_vec = kron(test_vec, ones((10,1)))
1.65 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Some of my own timings:
In [361]: timeit kron_vec = np.kron(np.ones((10,1)), test_vec)
131 µs ± 871 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Your
kron_vec = kron(test_vec, ones((10,1)))
1.65 ms
looks more like a ones((100,1)) time test.
Mine is longer than others, but not as drastically so.
A similar multiplication approach (like the outer of kron, but no need for the concatenate step):
In [362]: timeit (test_vec*np.ones((10,1,1))).reshape(-1,2)
61.2 µs ± 38.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
repmat:
In [363]: timeit rep_vec = matlib.repmat(test_vec,10,1)
94.2 µs ± 32.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Using tile instead:
In [364]: timeit np.tile(test_vec,(10,1))
20.4 µs ± 17.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
and repeat directly:
In [365]: timeit x = test_vec[None,:,:].repeat(10,0).reshape(-1,2)
12.2 µs ± 371 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Range of all elements in numpy array

Suppose I have a 1d array a where from each element I would like to have a range of which the size is stored in ranges:
a = np.array([10,9,12])
ranges = np.array([2,4,3])
The desired output would be:
np.array([10,11,9,10,11,12,12,13,14])
I could of course use a for loop, but I prefer a fully vectorized approach. np.repeat allows one to repeat the elements in a a number of times by setting repeats=, but I am not aware of a similar numpy function particularly dealing with the problem above.
>>> np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
array([10, 11, 9, 10, 11, 12, 12, 13, 14])
With pandas it could be easier:
>>> import pandas as pd
>>> x = pd.Series(np.repeat(a, ranges))
>>> x + x.groupby(x).cumcount()
0 10
1 11
2 9
3 10
4 11
5 12
6 12
7 13
8 14
dtype: int64
>>>
If you want a numpy array:
>>> x.add(x.groupby(x).cumcount()).to_numpy()
array([10, 11, 9, 10, 11, 12, 12, 13, 14], dtype=int64)
>>>
Someone asked about timing, so I compared the times of the three solutions (so far) in a very simple manner, using the %timeit magic function in Jupyter notebook cells.
I set it up as follows:
N = 1
a = np.array([10,9,12])
a = np.tile(a, N)
ranges = np.array([2,4,3])
ranges = np.tile(ranges, N)
a.shape, ranges.shape
So I could easily scale (albeit things not random, but repeated).
Then I ran:
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
,
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
and
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
Results are as follows:
N = 1:
9.81 µs ± 481 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
568 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.53 µs ± 81.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
N = 10:
63.4 µs ± 976 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
575 µs ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
25.1 µs ± 698 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
N = 100:
612 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
608 µs ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
237 µs ± 9.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N = 1000:
6.09 ms ± 52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
852 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.44 ms ± 43.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So the Pandas solution wins when things get to arrays of 1000 elements or more, but the Python double list comprehension does an excellent job until that point. np.hstack probably loses out because of extra memory allocation and copying, but that's a guess. Note also that the Pandas solution is nearly the same time for each array size.
Caveats still exists because there are repeated numbers, and all values are relatively small integers. This really shouldn't matter, but I'm not (yet) betting on it. (For example, Pandas groupby functionality may be fast because of the repeated numbers.)
Bonus: the OP has statement in a comment that "The real life arrays are around 1000 elements, yet with ranges ranging from 100 to 1000. So becomes quite big – pr94".
So I adjusted my timing test to the following:
import numpy as np
import pandas as pd
N = 1000
a = np.random.randint(100, 1000, N)
# This is how I understand "ranges ranging from 100 to 1000"
ranges = np.random.randint(100, 1000, N)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
Which comes out as :
hstack: 2.78 ms ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
pandas: 18.4 ms ± 663 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
double list comprehension: 64.1 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Which shows that those caveats I mentioned, in some form at least, do seem to exist. But people should double check whether this testing code is actually the most relevant and appropriate, and whether it is correct.
This problem is probably going to be solved much faster with a Numba-compiled function:
#nb.jit
def expand_range(values, counts):
n = len(values)
m = np.sum(counts)
r = np.zeros((m,), dtype=values.dtype)
k = 0
for i in range(n):
x = values[i]
for j in range(counts[i]):
r[k] = x + j
k += 1
return r
On the very small inputs:
%timeit expand_range(a, ranges)
# 1.16 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 617 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
# 25 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
# 13.5 µs ± 929 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and on somewhat larger inputs:
b = np.random.randint(0, 1000, 1000)
b_ranges = np.random.randint(1, 10, 1000)
%timeit expand_range(b, b_ranges)
# 5.07 µs ± 98.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 617 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
# 25 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
# 13.5 µs ± 929 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
these show that with Numba-based approach winning the speed gain is at least 100x over any of the other approaches proposed so far.
With the numbers closer to what as been indicated in one of the comments by the OP:
b = np.random.randint(10, 1000, 1000)
b_ranges = np.random.randint(100, 1000, 1000)
%timeit expand_range(b, b_ranges)
# 1.5 ms ± 67.9 µs per loop (mean ± std. dev. of 7 runs, 1000
%timeit x = pd.Series(np.repeat(b, b_ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 91.8 ms ± 6.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(b, b_ranges)])
# 10.7 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.array([i for j in range(len(b)) for i in range(b[j],b[j]+b_ranges[j])])
# 144 ms ± 4.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
which is still at least a respectable 7x over the others.

Is there a way to speed up Numpy array calculations when they only contain values in upper/lower triangle?

I'm doing some matrix calculations (2d) that only involve values in the upper triangle of the matrices.
So far I've found that using Numpy's triu method ("return a copy of a matrix with the elements below the k-th diagonal zeroed") works and is quite fast. But presumably, the calculations are still being carried out for the whole matrix, including unnecessary calculations on the zeros. Or are they?...
Here is an example of what I tried first:
# Initialize vars
N = 160
u = np.empty(N)
u[0] = 1000
u[1:] = np.cumprod(np.full(N-1, 1/2**(1/16)))*1000
m = np.random.random(N)
def method1():
# Prepare matrices with values only in upper triangle
ones_ut = np.triu(np.ones((N, N)))
u_ut = np.triu(np.broadcast_to(u, (N, N)))
m_ut = np.triu(np.broadcast_to(m, (N, N)))
# Do calculation
return (ones_ut - np.divide(u_ut, u.reshape(N, 1)))**3*m_ut
Then I realized I only need to zero-out the final result matrix:
def method2():
return np.triu((np.ones((N, N)) - np.divide(u, u.reshape(N, 1)))**3*m)
assert np.array_equal(method1(), method2())
But to my surprise, this was slower.
In [62]: %timeit method1()
662 µs ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [63]: %timeit method2()
836 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Does numpy do some kind of special optimization when it knows the matrices contain half zeros?
I'm curious about why it is slower but actually my main question is, is there a way to speed up vectorized calculations by taking account of the fact that you are not interested in half the values in the matrix?
UPDATE
I tried just doing the calculations over 3 of the quadrants of the matrices but it didn't achieve any speed increase over method 1:
def method4():
split = N//2
x = np.zeros((N, N))
u_mat = 1 - u/u.reshape(N, 1)
x[:split, :] = u_mat[:split,:]**3*m
x[split:, split:] = u_mat[split:, split:]**3*m[split:]
return np.triu(x)
assert np.array_equal(method1(), method4())
In [86]: %timeit method4()
683 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But this is faster than method 2.
We should simplify things there to leverage broadcasting at minimal places. Upon which, we would end up with something like this to directly get the final output using u and m, like so -
np.triu((1-u/u.reshape(N, 1))**3*m)
Then, we could leverage numexpr module that performs noticeably better when working with transcendental operations as is the case here and also is very memory efficient. So, upon porting to numexpr version, it would be -
import numexpr as ne
np.triu(ne.evaluate('(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)}))
Bring in the masking part within the evaluate method for further perf. boost -
M = np.tri(N,dtype=bool)
ne.evaluate('(1-M)*(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)})
Timings on given dataset -
In [25]: %timeit method1()
1000 loops, best of 3: 521 µs per loop
In [26]: %timeit method2()
1000 loops, best of 3: 417 µs per loop
In [27]: %timeit np.triu((1-u/u.reshape(N, 1))**3*m)
1000 loops, best of 3: 408 µs per loop
In [28]: %timeit np.triu(ne.evaluate('(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)}))
10000 loops, best of 3: 159 µs per loop
In [29]: %timeit ne.evaluate('(1-M)*(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1),'M':np.tri(N,dtype=bool)})
10000 loops, best of 3: 110 µs per loop
Note that another way to extend u to a 2D version would be with np.newaxis/None and this would be the idiomatic way. Hence, u.reshape(N, 1) could be replaced by u[:,None]. This shouldn't change the timings though.
Here is another solution that is faster in some cases but slower in some other cases.
idx = np.triu_indices(N)
def my_method():
result = np.zeros((N, N))
t = 1 - u[idx[1]] / u[idx[0]]
result[idx] = t * t * t * m[idx[1]]
return result
Here, the computation is done only for the elements in the (flattened) upper triangle. However, there is overhead in the 2D-index-based assignment operation result[idx] = .... So the method is faster when the overhead is less than the saved computations -- which happens when N is small or the computation is relatively complex (e.g., using t ** 3 instead of t * t * t).
Another variation of the method is to use 1D-index for the assignment operation, which can lead to a small speedup.
idx = np.triu_indices(N)
raveled_idx = np.ravel_multi_index(idx, (N, N))
def my_method2():
result = np.zeros((N, N))
t = 1 - u[idx[1]] / u[idx[0]]
result.ravel()[raveled_idx] = t * t * t * m[idx[1]]
return result
Following is the result of performance tests. Note that idx and raveled_idx and are fixed for each N and do not change with u and m (as long as their shapes remain unchanged). Hence their values can be precomputed and the times are excluded from the test.
(If you need to call these methods with matrices of many different sizes, there will be added overhead in the computations of idx and raveled_idx.) For the comparision, method4b, method5 and method6 cannot benefit much from any precomputation. For method_ne, the precomputation M = np.tri(N, dtype=bool) is also excluded from the test.
%timeit method4b()
%timeit method5()
%timeit method6()
%timeit method_ne()
%timeit my_method()
%timeit my_method2()
Result (for N = 160):
1.54 ms ± 7.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.63 ms ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
167 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
255 µs ± 14.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
233 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
177 µs ± 907 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
For N = 32:
89.9 µs ± 880 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
84 µs ± 728 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
25.2 µs ± 223 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
28.6 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
17.6 µs ± 1.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
14.3 µs ± 52.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For N = 1000:
70.7 ms ± 871 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
65.1 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
21.4 ms ± 642 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.03 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.2 ms ± 95.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.7 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using t ** 3 instead of t * t * t in my_method and my_method2 (N = 160):
1.53 ms ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.6 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
156 µs ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
235 µs ± 8.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.4 ms ± 4.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.32 ms ± 9.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Here, my_method and my_method2 outperform method4b and method5 a little bit.
I think the answer may be quite simple. Just put zeros in the cells that you don't want to calculate and the overall calculation will be faster. I think that might explain why method1() was faster than method2().
Here are some tests to illustrate the point.
In [29]: size = (160, 160)
In [30]: z = np.zeros(size)
In [31]: r = np.random.random(size) + 1
In [32]: t = np.triu(r)
In [33]: w = np.ones(size)
In [34]: %timeit z**3
177 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [35]: %timeit t**3
376 µs ± 2.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [36]: %timeit r**3
572 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [37]: %timeit w**3
138 µs ± 548 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [38]: %timeit np.triu(r)**3
427 µs ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [39]: %timeit np.triu(r**3)
625 µs ± 3.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not sure how all this works at a low level but clearly, zero or one raised to a power takes much less time to compute than any other value.
Also interesting. With numexpr computation there is no difference.
In [42]: %timeit ne.evaluate("r**3")
79.2 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [43]: %timeit ne.evaluate("z**3")
79.3 µs ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So, I think the fastest without using numexpr may be this way:
def method5():
return np.triu(1 - u/u[:, None])**3*m
assert np.array_equal(method1(), method5())
In [65]: %timeit method1()
656 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [66]: %timeit method5()
587 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Or, if you are really chasing every micro-second:
def method4b():
split = N//2
x = np.zeros((N, N))
u_mat = np.triu(1 - u/u.reshape(N, 1))
x[:split, :] = u_mat[:split,:]**3*m
x[split:, split:] = u_mat[split:, split:]**3*m[split:]
return x
assert np.array_equal(method1(), method4b())
In [71]: %timeit method4b()
543 µs ± 3.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [72]: %timeit method4b()
533 µs ± 7.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
And #Divakar's answer using numexpr is the fastest overall.
UPDATE
Thanks to #GZ0's comment, if you only need to raise to the power of 3, this is much faster:
def method6():
a = np.triu(1 - u/u[:, None])
return a*a*a*m
assert np.isclose(method1(), method6()).all()
(But there is a slight loss of precision I noticed).
In [84]: %timeit method6()
195 µs ± 609 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In fact it is not far off the numexpr methods in #Divakar's answer (185/163 µs on my machine).

Categories