How is Numba faster than NumPy for matrix multiplication with integers? - python

I was comparing parallel matrix multiplication with numba and matrix multiplication with numpy when I noticed that numpy isn't as fast with integers (int32).
import numpy as np
from numba import njit, prange
#njit()
def matrix_multiplication(A, B):
m, n = A.shape
_, p = B.shape
C = np.zeros((m, p))
for i in range(m):
for j in range(n):
for k in range(p):
C[i, k] += A[i, j] * B[j, k]
return C
#njit(parallel=True, fastmath=True)
def matrix_multiplication_parallel(A, B):
m, n = A.shape
_, p = B.shape
C = np.zeros((m, p))
for i in prange(m):
for j in range(n):
for k in range(p):
C[i, k] += A[i, j] * B[j, k]
return C
m = 100
n = 1000
p = 1500
A = np.random.randn(m, n)
B = np.random.randn(n, p)
A2 = np.random.randint(1, 100, size=(m, n))
B2 = np.random.randint(1, 100, size=(n, p))
A3 = np.ones((m, n))
B3 = np.ones((n, p))
# compile function
matrix_multiplication(A, B)
matrix_multiplication_parallel(A, B)
print('normal')
%timeit matrix_multiplication(A, B)
%timeit matrix_multiplication(A2, B2)
%timeit matrix_multiplication(A3, B3)
print('parallel')
%timeit matrix_multiplication_parallel(A, B)
%timeit matrix_multiplication_parallel(A2, B2)
%timeit matrix_multiplication_parallel(A3, B3)
print('numpy')
%timeit A # B
%timeit A2 # B2
%timeit A3 # B3
normal
1.51 s ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
*1.56 s* ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.5 s ± 34.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
333 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
408 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
313 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
31.2 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*1.99 s* ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)**
28.4 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I found this answer explaining that numpy doesn't use BLAS for integers.
From what I understand, both numpy and numba make use of vectorization. I wonder what could be different in the implementations for a relatively consistent 25% increase in performance.
I tried reversing the order of operations in case less CPU resources were available towards the end. I made sure to not do anything while the program was running.
numpy
35.1 ms ± 1.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*1.97 s* ± 44.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
32 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
normal
1.48 s ± 33.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
*1.46 s* ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.47 s ± 29.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
379 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
461 ms ± 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
381 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Trying the method in the answer doesn't really help.
import inspect
inspect.getmodule(matrix_multiplication)
<module '__main__'>
I tried it on Google Colab.
normal
2.28 s ± 407 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
*1.7 s* ± 277 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.6 s ± 317 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
1.33 s ± 315 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.66 s ± 425 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.34 s ± 327 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
64.9 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*2.14 s* ± 477 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
64.1 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
It is possible to print the generated code, but I don't know how it can be compared to the numpy code.
for v, k in matrix_multiplication.inspect_llvm().items():
print(v, k)
Going to the definition of np.matmul leads to matmul: _GUFunc_Nin2_Nout1[L['matmul'], L[19], None] in ".../site-packages/numpy/_init_.pyi".
I think this is the C method being called because of the name "no BLAS". The code seems equivalent to mine, except for additional if statements.
For small arrays m = n = p = 10, numpy is faster.
normal
6.6 µs ± 99.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
6.72 µs ± 68.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
6.57 µs ± 62.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
parallel
63.5 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
64.5 µs ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
63.3 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
numpy
1.94 µs ± 56.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.53 µs ± 305 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
1.91 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
m=10000 instead of 1000
normal
14.4 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
14.3 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
14.7 s ± 538 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
3.34 s ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.42 s ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.46 s ± 78.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
334 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
19.4 s ± 655 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
248 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You are comparing two different loop patterns. The pattern equivalent to the Numpy implementation will be like the following.
def matrix_multiplication(A, B):
m, n = A.shape
_, p = B.shape
C = np.zeros((m, p))
for i in range(m):
for k in range(p):
for j in range(n):
C[i, k] += A[i, j] * B[j, k]
return C
Test with m, n, p = 1000, 1500, 100 with your original test code, then you will see Numpy is faster then Numba.

Related

Why is NumPy's `repmat` faster than `kron` to repeat blocks of arrays?

Why is the repmat function from numpy.matlib so much faster than numpy.kron (i.e., Kronecker Product) to repeat blocks of matrices?
A MWE would be:
test_N = 1000
test_vec = np.random.rand(test_N, 2)
rep_vec = np.matlib.repmat(test_vec, 100, 1)
kron_vec = kron(ones((100,1)), test_vec)
%%timeit
rep_vec = np.matlib.repmat(test_vec, 10, 1)
53.5 µs ± 2.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
kron_vec = kron(test_vec, ones((10,1)))
1.65 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Some of my own timings:
In [361]: timeit kron_vec = np.kron(np.ones((10,1)), test_vec)
131 µs ± 871 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Your
kron_vec = kron(test_vec, ones((10,1)))
1.65 ms
looks more like a ones((100,1)) time test.
Mine is longer than others, but not as drastically so.
A similar multiplication approach (like the outer of kron, but no need for the concatenate step):
In [362]: timeit (test_vec*np.ones((10,1,1))).reshape(-1,2)
61.2 µs ± 38.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
repmat:
In [363]: timeit rep_vec = matlib.repmat(test_vec,10,1)
94.2 µs ± 32.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Using tile instead:
In [364]: timeit np.tile(test_vec,(10,1))
20.4 µs ± 17.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
and repeat directly:
In [365]: timeit x = test_vec[None,:,:].repeat(10,0).reshape(-1,2)
12.2 µs ± 371 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

How would you write an less computationally intensive equivalent of numpy.where(np.ones(shape))

I want to get a list of elements for an array of a given shape.
I found one easy way to do that:
import numpy as np
shape = (3,3)
elements = np.where(np.ones(shape))
the result is
>>> elements
(array([0, 0, 0, 1, 1, 1, 2, 2, 2]), array([0, 1, 2, 0, 1, 2, 0, 1, 2]))
This is the expected behaviour. However it doesn't seem to be the most compute-efficient way. f shape is huge, then np.where can be quite sluggish. I am looking for a more compute-efficient solution. Any idea?
Based on the comments I have received, I implemented 3 ways to get the same result and tested their performance.
import timeit
import numpy as np
def with_where(a):
shape = a.shape
return np.where(np.ones(shape))
def with_mgrid(a):
shape = a.shape
grid_shape = (len(shape), np.prod(shape))
return np.mgrid[0:shape[0],0:shape[1]].reshape(grid_shape)
def with_repeat(a):
shape = a.shape
np.repeat(np.arange(shape[0]), shape[1]), np.tile(np.arange(shape[1]), shape[0])
a1 = np.ones((1,1))
a10 = np.ones((10,10))
a100 = np.ones((100,100))
a1000 = np.ones((1000,1000))
a10000 = np.ones((10000,10000))
Then I ran %timeit in ipython
%timeit with_where(a1)
%timeit with_where(a10)
%timeit with_where(a100)
%timeit with_where(a1000)
%timeit with_where(a10000)
11.1 µs ± 163 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
12.7 µs ± 39.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
146 µs ± 1.36 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
16.2 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.49 s ± 58.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit with_mgrid(a1)
%timeit with_mgrid(a10)
%timeit with_mgrid(a100)
%timeit with_mgrid(a1000)
%timeit with_mgrid(a10000)
50.2 µs ± 2.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
45.9 µs ± 989 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
75.1 µs ± 1.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
6.17 ms ± 54.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 s ± 40.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit with_repeat(a1)
%timeit with_repeat(a10)
%timeit with_repeat(a100)
%timeit with_repeat(a1000)
%timeit with_repeat(a10000)
23.3 µs ± 931 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
31 µs ± 739 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
66 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
4.41 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.05 s ± 22.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
so for large arrays, the method with np.where is about 2x as slow as the fastest method. This is not as bad as I thought.

Is there a way to speed up Numpy array calculations when they only contain values in upper/lower triangle?

I'm doing some matrix calculations (2d) that only involve values in the upper triangle of the matrices.
So far I've found that using Numpy's triu method ("return a copy of a matrix with the elements below the k-th diagonal zeroed") works and is quite fast. But presumably, the calculations are still being carried out for the whole matrix, including unnecessary calculations on the zeros. Or are they?...
Here is an example of what I tried first:
# Initialize vars
N = 160
u = np.empty(N)
u[0] = 1000
u[1:] = np.cumprod(np.full(N-1, 1/2**(1/16)))*1000
m = np.random.random(N)
def method1():
# Prepare matrices with values only in upper triangle
ones_ut = np.triu(np.ones((N, N)))
u_ut = np.triu(np.broadcast_to(u, (N, N)))
m_ut = np.triu(np.broadcast_to(m, (N, N)))
# Do calculation
return (ones_ut - np.divide(u_ut, u.reshape(N, 1)))**3*m_ut
Then I realized I only need to zero-out the final result matrix:
def method2():
return np.triu((np.ones((N, N)) - np.divide(u, u.reshape(N, 1)))**3*m)
assert np.array_equal(method1(), method2())
But to my surprise, this was slower.
In [62]: %timeit method1()
662 µs ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [63]: %timeit method2()
836 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Does numpy do some kind of special optimization when it knows the matrices contain half zeros?
I'm curious about why it is slower but actually my main question is, is there a way to speed up vectorized calculations by taking account of the fact that you are not interested in half the values in the matrix?
UPDATE
I tried just doing the calculations over 3 of the quadrants of the matrices but it didn't achieve any speed increase over method 1:
def method4():
split = N//2
x = np.zeros((N, N))
u_mat = 1 - u/u.reshape(N, 1)
x[:split, :] = u_mat[:split,:]**3*m
x[split:, split:] = u_mat[split:, split:]**3*m[split:]
return np.triu(x)
assert np.array_equal(method1(), method4())
In [86]: %timeit method4()
683 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But this is faster than method 2.
We should simplify things there to leverage broadcasting at minimal places. Upon which, we would end up with something like this to directly get the final output using u and m, like so -
np.triu((1-u/u.reshape(N, 1))**3*m)
Then, we could leverage numexpr module that performs noticeably better when working with transcendental operations as is the case here and also is very memory efficient. So, upon porting to numexpr version, it would be -
import numexpr as ne
np.triu(ne.evaluate('(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)}))
Bring in the masking part within the evaluate method for further perf. boost -
M = np.tri(N,dtype=bool)
ne.evaluate('(1-M)*(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)})
Timings on given dataset -
In [25]: %timeit method1()
1000 loops, best of 3: 521 µs per loop
In [26]: %timeit method2()
1000 loops, best of 3: 417 µs per loop
In [27]: %timeit np.triu((1-u/u.reshape(N, 1))**3*m)
1000 loops, best of 3: 408 µs per loop
In [28]: %timeit np.triu(ne.evaluate('(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)}))
10000 loops, best of 3: 159 µs per loop
In [29]: %timeit ne.evaluate('(1-M)*(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1),'M':np.tri(N,dtype=bool)})
10000 loops, best of 3: 110 µs per loop
Note that another way to extend u to a 2D version would be with np.newaxis/None and this would be the idiomatic way. Hence, u.reshape(N, 1) could be replaced by u[:,None]. This shouldn't change the timings though.
Here is another solution that is faster in some cases but slower in some other cases.
idx = np.triu_indices(N)
def my_method():
result = np.zeros((N, N))
t = 1 - u[idx[1]] / u[idx[0]]
result[idx] = t * t * t * m[idx[1]]
return result
Here, the computation is done only for the elements in the (flattened) upper triangle. However, there is overhead in the 2D-index-based assignment operation result[idx] = .... So the method is faster when the overhead is less than the saved computations -- which happens when N is small or the computation is relatively complex (e.g., using t ** 3 instead of t * t * t).
Another variation of the method is to use 1D-index for the assignment operation, which can lead to a small speedup.
idx = np.triu_indices(N)
raveled_idx = np.ravel_multi_index(idx, (N, N))
def my_method2():
result = np.zeros((N, N))
t = 1 - u[idx[1]] / u[idx[0]]
result.ravel()[raveled_idx] = t * t * t * m[idx[1]]
return result
Following is the result of performance tests. Note that idx and raveled_idx and are fixed for each N and do not change with u and m (as long as their shapes remain unchanged). Hence their values can be precomputed and the times are excluded from the test.
(If you need to call these methods with matrices of many different sizes, there will be added overhead in the computations of idx and raveled_idx.) For the comparision, method4b, method5 and method6 cannot benefit much from any precomputation. For method_ne, the precomputation M = np.tri(N, dtype=bool) is also excluded from the test.
%timeit method4b()
%timeit method5()
%timeit method6()
%timeit method_ne()
%timeit my_method()
%timeit my_method2()
Result (for N = 160):
1.54 ms ± 7.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.63 ms ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
167 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
255 µs ± 14.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
233 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
177 µs ± 907 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
For N = 32:
89.9 µs ± 880 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
84 µs ± 728 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
25.2 µs ± 223 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
28.6 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
17.6 µs ± 1.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
14.3 µs ± 52.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For N = 1000:
70.7 ms ± 871 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
65.1 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
21.4 ms ± 642 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.03 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.2 ms ± 95.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.7 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using t ** 3 instead of t * t * t in my_method and my_method2 (N = 160):
1.53 ms ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.6 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
156 µs ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
235 µs ± 8.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.4 ms ± 4.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.32 ms ± 9.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Here, my_method and my_method2 outperform method4b and method5 a little bit.
I think the answer may be quite simple. Just put zeros in the cells that you don't want to calculate and the overall calculation will be faster. I think that might explain why method1() was faster than method2().
Here are some tests to illustrate the point.
In [29]: size = (160, 160)
In [30]: z = np.zeros(size)
In [31]: r = np.random.random(size) + 1
In [32]: t = np.triu(r)
In [33]: w = np.ones(size)
In [34]: %timeit z**3
177 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [35]: %timeit t**3
376 µs ± 2.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [36]: %timeit r**3
572 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [37]: %timeit w**3
138 µs ± 548 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [38]: %timeit np.triu(r)**3
427 µs ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [39]: %timeit np.triu(r**3)
625 µs ± 3.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not sure how all this works at a low level but clearly, zero or one raised to a power takes much less time to compute than any other value.
Also interesting. With numexpr computation there is no difference.
In [42]: %timeit ne.evaluate("r**3")
79.2 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [43]: %timeit ne.evaluate("z**3")
79.3 µs ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So, I think the fastest without using numexpr may be this way:
def method5():
return np.triu(1 - u/u[:, None])**3*m
assert np.array_equal(method1(), method5())
In [65]: %timeit method1()
656 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [66]: %timeit method5()
587 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Or, if you are really chasing every micro-second:
def method4b():
split = N//2
x = np.zeros((N, N))
u_mat = np.triu(1 - u/u.reshape(N, 1))
x[:split, :] = u_mat[:split,:]**3*m
x[split:, split:] = u_mat[split:, split:]**3*m[split:]
return x
assert np.array_equal(method1(), method4b())
In [71]: %timeit method4b()
543 µs ± 3.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [72]: %timeit method4b()
533 µs ± 7.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
And #Divakar's answer using numexpr is the fastest overall.
UPDATE
Thanks to #GZ0's comment, if you only need to raise to the power of 3, this is much faster:
def method6():
a = np.triu(1 - u/u[:, None])
return a*a*a*m
assert np.isclose(method1(), method6()).all()
(But there is a slight loss of precision I noticed).
In [84]: %timeit method6()
195 µs ± 609 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In fact it is not far off the numexpr methods in #Divakar's answer (185/163 µs on my machine).

Why does a dictionary count in some cases faster than collections.Counter?

I needed a solution for extracting unique elements from a non-unique list along with counting its duplicate elements.
The purpose of the solution was to use it in an algorithm for creating unique combinations from a non-unique list. The list sizes from which combinations are created in such case are usually very small (less than 50 elements), but my goal was to find the overall fastest code trying to optimize whenever and wherever possible (even if gaining only very tiny amount of running time).
Pythons collections module provides an exactly for such purpose suited specialized collections.Counter, but there are apparently cases in which usage of a simple dictionary instead of collections.Counter leads to faster code like you can check out yourself using the code below:
from time import time as t
from timeit import timeit as tt
from collections import Counter
def counter(iterable):
dctCounter = {}
for item in iterable:
if item in dctCounter:
dctCounter[item] += 1
else:
dctCounter[item] = 1
return dctCounter
for n, N in [(1,10), (10,1), (1,50), (50,1), (1,100), (100,1), (1,200), (200, 1), (1, 500), (500, 1), (1, 1000), (1000,1)]:
lstItems = n*list(range(N))
for noLoops in [10**p for p in range(5, 6)]:
s = t()
for _ in range(noLoops):
dctCounter = counter(lstItems)
e = t()
timeDctFctn = e - s
s = t()
for _ in range(noLoops):
objCounter = Counter(lstItems)
e = t()
timeCollCtr = e - s
timeitCollCtr = tt("objCounter=Counter(lstItems)", "from __main__ import Counter, lstItems", number=noLoops)
timeitDctFctn = tt("dctCounter=counter(lstItems)", "from __main__ import counter, lstItems", number=noLoops)
# print("Loops: {:7}, time/timeit CollCtr: {:7.5f}/{:7.5f} DctFctn: {:7.5f}/{:7.5f} sec. lstSize: {:3}, %uniq: {:3.0f}".format(noLoops, timeCollCtr, timeitCollCtr, timeDctFctn, timeitDctFctn, n*N, 100.0/n))
print("collections.Counter(): {:7.5f}, def counter(): {:7.5f} sec. lstSize: {:3}, %uniq: {:3.0f}, ({} timitLoops)".format(timeitCollCtr, timeitDctFctn, n*N, 100.0/n, noLoops))
# print('-----------------------------------------------------------------------------------------------------------')
Here the output:
python3.6 -u "collections.Counter-vs-dictionaryAsCounter_Cg.py"
collections.Counter(): 0.36461, def counter(): 0.09592 sec. lstSize: 10, %uniq: 100, (100000 timitLoops)
collections.Counter(): 0.36444, def counter(): 0.12286 sec. lstSize: 10, %uniq: 10, (100000 timitLoops)
collections.Counter(): 0.58627, def counter(): 0.43233 sec. lstSize: 50, %uniq: 100, (100000 timitLoops)
collections.Counter(): 0.52399, def counter(): 0.54106 sec. lstSize: 50, %uniq: 2, (100000 timitLoops)
collections.Counter(): 0.82332, def counter(): 0.81436 sec. lstSize: 100, %uniq: 100, (100000 timitLoops)
collections.Counter(): 0.72513, def counter(): 1.06823 sec. lstSize: 100, %uniq: 1, (100000 timitLoops)
collections.Counter(): 1.27130, def counter(): 1.59476 sec. lstSize: 200, %uniq: 100, (100000 timitLoops)
collections.Counter(): 1.13817, def counter(): 2.14566 sec. lstSize: 200, %uniq: 0, (100000 timitLoops)
collections.Counter(): 3.16287, def counter(): 4.26738 sec. lstSize: 500, %uniq: 100, (100000 timitLoops)
collections.Counter(): 2.64247, def counter(): 5.67448 sec. lstSize: 500, %uniq: 0, (100000 timitLoops)
collections.Counter(): 4.89153, def counter(): 7.68661 sec. lstSize:1000, %uniq: 100, (100000 timitLoops)
collections.Counter(): 6.06389, def counter():13.92613 sec. lstSize:1000, %uniq: 0, (100000 timitLoops)
>Exit code: 0
P.S. It seems that collections.Counter() fails to be up to the expectations also in another context as this above. See here: https://stackoverflow.com/questions/41594940/why-is-collections-counter-so-slow
Counter has one major bottleneck when you count short iterables: It checks if isinstance(iterable, Mapping). This test is rather slow because collections.abc.Mapping is an abstract metaclass and as such the isinstance-check is a bit more complicated than normal isinstance-checks, see for example: "Why is checking isinstance(something, Mapping) so slow?"
So it isn't really surprising if other approaches are faster for short iterables. However for long iterables the check doesn't matter much and Counter should be faster (at least for python-3.x (CPython) where the actual counting function _count_elements is written in c).
An easy way to identify bottlenecks is profiling. I'll use line_profiler here:
%load_ext line_profiler
from collections import Counter
x = range(50)
# Profile the function Counter.update when executing the command "Counter(x)"
%lprun -f Counter.update Counter(x)
The result:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
604 1 8 8.0 3.9 if not args:
605 raise TypeError("descriptor 'update' of 'Counter' object "
606 "needs an argument")
607 1 13 13.0 6.4 self, *args = args
608 1 6 6.0 3.0 if len(args) > 1:
609 raise TypeError('expected at most 1 arguments, got %d' % len(args))
610 1 5 5.0 2.5 iterable = args[0] if args else None
611 1 3 3.0 1.5 if iterable is not None:
612 1 94 94.0 46.3 if isinstance(iterable, Mapping):
613 if self:
614 self_get = self.get
615 for elem, count in iterable.items():
616 self[elem] = count + self_get(elem, 0)
617 else:
618 super(Counter, self).update(iterable) # fast path when counter is empty
619 else:
620 1 69 69.0 34.0 _count_elements(self, iterable)
621 1 5 5.0 2.5 if kwds:
622 self.update(kwds)
So the time it takes to initialize a Counter from an iterable has a rather big constant factor (46% of the time are spent on the isinstance check, getting the dictionary with the counts takes only 34%).
However for long iterables it doesn't matter (much) because it's just done once:
%lprun -f Counter.update Counter([1]*100000)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
604 1 12 12.0 0.0 if not args:
605 raise TypeError("descriptor 'update' of 'Counter' object "
606 "needs an argument")
607 1 12 12.0 0.0 self, *args = args
608 1 6 6.0 0.0 if len(args) > 1:
609 raise TypeError('expected at most 1 arguments, got %d' % len(args))
610 1 6 6.0 0.0 iterable = args[0] if args else None
611 1 3 3.0 0.0 if iterable is not None:
612 1 97 97.0 0.3 if isinstance(iterable, Mapping):
613 if self:
614 self_get = self.get
615 for elem, count in iterable.items():
616 self[elem] = count + self_get(elem, 0)
617 else:
618 super(Counter, self).update(iterable) # fast path when counter is empty
619 else:
620 1 28114 28114.0 99.5 _count_elements(self, iterable)
621 1 13 13.0 0.0 if kwds:
622 self.update(kwds)
Just to give you an overview how these perform depending on the number of elements, for comparison I included an optimized version of your count and the _count_elements function that is used by Counter. However I excluded the part where you sorted the items and created a list of the counts to avoid other effects - especially because sorted has a different run-time behavior (O(n log(n))) than counting (O(n)):
# Setup
import random
from collections import Counter, _count_elements
def count(iterable):
"""Explicit iteration over items."""
dctCounter = {}
for item in iterable:
if item in dctCounter:
dctCounter[item] += 1
else:
dctCounter[item] = 1
return dctCounter
def count2(iterable):
"""Iterating over the indices"""
dctCounter = {}
lenLstItems = len(iterable)
for idx in range(lenLstItems):
item = iterable[idx]
if item in dctCounter.keys():
dctCounter[item] += 1
else:
dctCounter[item] = 1
return dctCounter
def c_count(iterable):
"""Internal counting function that's used by Counter"""
d = {}
_count_elements(d, iterable)
return d
# Timing
timings = {Counter: [], count: [], count2: [], c_count: []}
for i in range(1, 20):
print(2**i)
it = [random.randint(0, 2**i) for _ in range(2**i)]
for func in (Counter, count, count2, c_count):
res = %timeit -o func(it)
timings[func].append(res)
# Plotting
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(1)
ax = plt.subplot(111)
n = 2**np.arange(1, 5)
ax.plot(n,
[time.average for time in timings[count]],
label='my custom function', c='red')
ax.plot(n,
[time.average for time in timings[count2]],
label='your custom function', c='green')
ax.plot(n,
[time.average for time in timings[Counter]],
label='Counter', c='blue')
ax.plot(n,
[time.average for time in timings[c_count]],
label='_count_elements', c='purple')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('elements')
ax.set_ylabel('time to count them [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()
And the result:
Individual timings:
2
30.5 µs ± 177 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.67 µs ± 3.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
6.03 µs ± 31.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.67 µs ± 1.45 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
4
30.7 µs ± 75.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
2.63 µs ± 25.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
7.81 µs ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.97 µs ± 5.59 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
8
34.3 µs ± 1.14 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
4.3 µs ± 16.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
11.3 µs ± 23.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
16
34.2 µs ± 599 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
7.46 µs ± 42 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
17.5 µs ± 83.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.24 µs ± 17.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
32
38.4 µs ± 578 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
13.7 µs ± 95.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
29.8 µs ± 383 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
7.56 µs ± 90.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
64
43.5 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
24 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
52.8 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
11.6 µs ± 83.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
128
53.5 µs ± 1.14 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
47.8 µs ± 507 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
101 µs ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
21.7 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
256
69.6 µs ± 239 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
92.1 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
188 µs ± 3.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
39.5 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
512
123 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
200 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
409 µs ± 3.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
90.9 µs ± 958 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1024
230 µs ± 2.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
428 µs ± 5.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
855 µs ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
193 µs ± 2.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
2048
436 µs ± 7.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
868 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.76 ms ± 31.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
386 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4096
830 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.8 ms ± 33.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.75 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.06 ms ± 89.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
8192
2.3 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.8 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.8 ms ± 425 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.69 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
16384
4.53 ms ± 349 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.22 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.9 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.9 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32768
9.6 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
17.2 ms ± 51.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.5 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
10.4 ms ± 687 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
65536
24.8 ms ± 490 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
40.1 ms ± 710 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
66.8 ms ± 271 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
24.5 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
131072
54.6 ms ± 756 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
84.2 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
142 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
54.1 ms ± 424 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
262144
120 ms ± 4.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
182 ms ± 4.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
296 ms ± 3.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
117 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
524288
244 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
368 ms ± 2.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
601 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
252 ms ± 6.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python Speedup np.unique

I am looking to speed up the following piece of code:
NNlist=[np.unique(i) for i in NNlist]
where NNlist is a list of np.arrays with duplicated entries.
Thanks :)
numpy.unique is already pretty optimized, you're not likely to get get much of a speedup over what you already have unless you know something else about the underlying data. For example if the data is all small integers you might be able to use numpy.bincount or if the unique values in each of the arrays are mostly the same there might be some optimization that could be done over the whole list of arrays.
pandas.unique() is much faster than numpy.unique(). The Pandas version does not sort the result, but you can do that yourself and it will still be much faster if the result is much smaller than the input (i.e. there are a lot of duplicate values):
np.sort(pd.unique(arr))
Timings:
In [1]: x = np.random.randint(10, 20, 50000000)
In [2]: %timeit np.sort(pd.unique(x))
201 ms ± 9.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit np.unique(x)
1.49 s ± 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I also took a look at list(set()) and strings in a list, between pandas Series and python lists.
data = np.random.randint(0,10,100)
data_hex = [str(hex(n)) for n in data] # just some simple strings
sample1 = pd.Series(data, name='data')
sample2 = data.tolist()
sample3 = pd.Series(data_hex, name='data')
sample4 = data_hex
And then the benchmarks:
%timeit np.unique(sample1) # 16.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.unique(sample2) # 15.9 µs ± 743 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.unique(sample3) # 45.8 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.unique(sample4) # 20.6 µs ± 680 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample1) # 60.3 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample2) # 196 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample3) # 79.7 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample4) # 214 µs ± 61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each
%timeit list(set(sample1)) # 16.3 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit list(set(sample2)) # 1.64 µs ± 83.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit list(set(sample3)) # 17.8 µs ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit list(set(sample4)) # 2.48 µs ± 439 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The take away is:
Starting with a Pandas Series with integers? Go with either np.unique() or list(set())
Starting with a Pandas Series with strings? Go with list(set())
Starting with a list of integers? Go with list(set())
Starting with a list of strings? Go with list(set())
However, if N=1,000,000 instead, the results are different.
%timeit np.unique(sample1) # 26.5 ms ± 616 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(sample2) # 98.1 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.unique(sample3) # 1.31 s ± 78.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(sample4) # 174 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.unique(sample1) # 10.5 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.unique(sample2) # 99.3 ms ± 5.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(sample3) # 46.4 ms ± 4.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(sample4) # 113 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample1)) # 25.9 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample2)) # 11.2 ms ± 496 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(set(sample3)) # 37.1 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample4)) # 20.2 ms ± 843 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Starting with a Pandas Series with integers? Go with pd.unique()
Starting with a Pandas Series with strings? Go with list(set())
Starting with a list of integers? Go with list(set())
Starting with a list of strings? Go with list(set())
Here are some benchmarks:
In [72]: ar_list = [np.random.randint(0, 100, 1000) for _ in range(100)]
In [73]: %timeit map(np.unique, ar_list)
100 loops, best of 3: 4.9 ms per loop
In [74]: %timeit [np.unique(ar) for ar in ar_list]
100 loops, best of 3: 4.9 ms per loop
In [75]: %timeit [pd.unique(ar) for ar in ar_list] # using pandas
100 loops, best of 3: 2.25 ms per loop
So pandas.unique seems to be faster than numpy.unique. However the docstring mentions that the values are "not necessarily sorted", which (partly) explains, that it is faster.
Using a list comprehension or map doesn't give a difference in this example.
The numpy.unique() is based on sorting (quicksort), and the pandas.unique() is based on hash table. Normally, the latter is faster according to my benchmarks. They are already very optimized.
For some special case, you can continue to optimize the performance.
For example, if the data already sorted, you can skip the sorting method:
# ar is already sorted
# this segment is from source code of numpy
mask = np.empty(ar.shape, dtype=np.bool_)
mask[:1] = True
mask[1:] = ar[1:] != ar[:-1]
ret = ar[mask]
I meet the similar problem to yours. I wrote my unique function for my use. Because the pandas.unique doesn't support return_counts option. It is fast implementation. But my implementation only supports integers array. You can check out the source code here.

Categories