How would you write an less computationally intensive equivalent of numpy.where(np.ones(shape))

How would you write an less computationally intensive equivalent of numpy.where(np.ones(shape)) - python

I want to get a list of elements for an array of a given shape.
I found one easy way to do that:
import numpy as np
shape = (3,3)
elements = np.where(np.ones(shape))
the result is
>>> elements
(array([0, 0, 0, 1, 1, 1, 2, 2, 2]), array([0, 1, 2, 0, 1, 2, 0, 1, 2]))
This is the expected behaviour. However it doesn't seem to be the most compute-efficient way. f shape is huge, then np.where can be quite sluggish. I am looking for a more compute-efficient solution. Any idea?

Based on the comments I have received, I implemented 3 ways to get the same result and tested their performance.
import timeit
import numpy as np
def with_where(a):
shape = a.shape
return np.where(np.ones(shape))
def with_mgrid(a):
shape = a.shape
grid_shape = (len(shape), np.prod(shape))
return np.mgrid[0:shape[0],0:shape[1]].reshape(grid_shape)
def with_repeat(a):
shape = a.shape
np.repeat(np.arange(shape[0]), shape[1]), np.tile(np.arange(shape[1]), shape[0])
a1 = np.ones((1,1))
a10 = np.ones((10,10))
a100 = np.ones((100,100))
a1000 = np.ones((1000,1000))
a10000 = np.ones((10000,10000))
Then I ran %timeit in ipython
%timeit with_where(a1)
%timeit with_where(a10)
%timeit with_where(a100)
%timeit with_where(a1000)
%timeit with_where(a10000)
11.1 µs ± 163 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
12.7 µs ± 39.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
146 µs ± 1.36 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
16.2 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.49 s ± 58.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit with_mgrid(a1)
%timeit with_mgrid(a10)
%timeit with_mgrid(a100)
%timeit with_mgrid(a1000)
%timeit with_mgrid(a10000)
50.2 µs ± 2.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
45.9 µs ± 989 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
75.1 µs ± 1.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
6.17 ms ± 54.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 s ± 40.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit with_repeat(a1)
%timeit with_repeat(a10)
%timeit with_repeat(a100)
%timeit with_repeat(a1000)
%timeit with_repeat(a10000)
23.3 µs ± 931 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
31 µs ± 739 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
66 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
4.41 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.05 s ± 22.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
so for large arrays, the method with np.where is about 2x as slow as the fastest method. This is not as bad as I thought.

Related

How is Numba faster than NumPy for matrix multiplication with integers?

I was comparing parallel matrix multiplication with numba and matrix multiplication with numpy when I noticed that numpy isn't as fast with integers (int32).
import numpy as np
from numba import njit, prange
#njit()
def matrix_multiplication(A, B):
m, n = A.shape
_, p = B.shape
C = np.zeros((m, p))
for i in range(m):
for j in range(n):
for k in range(p):
C[i, k] += A[i, j] * B[j, k]
return C
#njit(parallel=True, fastmath=True)
def matrix_multiplication_parallel(A, B):
m, n = A.shape
_, p = B.shape
C = np.zeros((m, p))
for i in prange(m):
for j in range(n):
for k in range(p):
C[i, k] += A[i, j] * B[j, k]
return C
m = 100
n = 1000
p = 1500
A = np.random.randn(m, n)
B = np.random.randn(n, p)
A2 = np.random.randint(1, 100, size=(m, n))
B2 = np.random.randint(1, 100, size=(n, p))
A3 = np.ones((m, n))
B3 = np.ones((n, p))
# compile function
matrix_multiplication(A, B)
matrix_multiplication_parallel(A, B)
print('normal')
%timeit matrix_multiplication(A, B)
%timeit matrix_multiplication(A2, B2)
%timeit matrix_multiplication(A3, B3)
print('parallel')
%timeit matrix_multiplication_parallel(A, B)
%timeit matrix_multiplication_parallel(A2, B2)
%timeit matrix_multiplication_parallel(A3, B3)
print('numpy')
%timeit A # B
%timeit A2 # B2
%timeit A3 # B3
normal
1.51 s ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
*1.56 s* ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.5 s ± 34.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
333 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
408 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
313 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
31.2 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*1.99 s* ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)**
28.4 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I found this answer explaining that numpy doesn't use BLAS for integers.
From what I understand, both numpy and numba make use of vectorization. I wonder what could be different in the implementations for a relatively consistent 25% increase in performance.
I tried reversing the order of operations in case less CPU resources were available towards the end. I made sure to not do anything while the program was running.
numpy
35.1 ms ± 1.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*1.97 s* ± 44.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
32 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
normal
1.48 s ± 33.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
*1.46 s* ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.47 s ± 29.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
379 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
461 ms ± 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
381 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Trying the method in the answer doesn't really help.
import inspect
inspect.getmodule(matrix_multiplication)
<module '__main__'>
I tried it on Google Colab.
normal
2.28 s ± 407 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
*1.7 s* ± 277 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.6 s ± 317 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
1.33 s ± 315 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.66 s ± 425 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.34 s ± 327 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
64.9 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*2.14 s* ± 477 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
64.1 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
It is possible to print the generated code, but I don't know how it can be compared to the numpy code.
for v, k in matrix_multiplication.inspect_llvm().items():
print(v, k)
Going to the definition of np.matmul leads to matmul: _GUFunc_Nin2_Nout1[L['matmul'], L[19], None] in ".../site-packages/numpy/_init_.pyi".
I think this is the C method being called because of the name "no BLAS". The code seems equivalent to mine, except for additional if statements.
For small arrays m = n = p = 10, numpy is faster.
normal
6.6 µs ± 99.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
6.72 µs ± 68.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
6.57 µs ± 62.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
parallel
63.5 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
64.5 µs ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
63.3 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
numpy
1.94 µs ± 56.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.53 µs ± 305 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
1.91 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
m=10000 instead of 1000
normal
14.4 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
14.3 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
14.7 s ± 538 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
3.34 s ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.42 s ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.46 s ± 78.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
334 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
19.4 s ± 655 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
248 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You are comparing two different loop patterns. The pattern equivalent to the Numpy implementation will be like the following.
def matrix_multiplication(A, B):
m, n = A.shape
_, p = B.shape
C = np.zeros((m, p))
for i in range(m):
for k in range(p):
for j in range(n):
C[i, k] += A[i, j] * B[j, k]
return C
Test with m, n, p = 1000, 1500, 100 with your original test code, then you will see Numpy is faster then Numba.

Why is NumPy's `repmat` faster than `kron` to repeat blocks of arrays?

Why is the repmat function from numpy.matlib so much faster than numpy.kron (i.e., Kronecker Product) to repeat blocks of matrices?
A MWE would be:
test_N = 1000
test_vec = np.random.rand(test_N, 2)
rep_vec = np.matlib.repmat(test_vec, 100, 1)
kron_vec = kron(ones((100,1)), test_vec)
%%timeit
rep_vec = np.matlib.repmat(test_vec, 10, 1)
53.5 µs ± 2.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
kron_vec = kron(test_vec, ones((10,1)))
1.65 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Some of my own timings:
In [361]: timeit kron_vec = np.kron(np.ones((10,1)), test_vec)
131 µs ± 871 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Your
kron_vec = kron(test_vec, ones((10,1)))
1.65 ms
looks more like a ones((100,1)) time test.
Mine is longer than others, but not as drastically so.
A similar multiplication approach (like the outer of kron, but no need for the concatenate step):
In [362]: timeit (test_vec*np.ones((10,1,1))).reshape(-1,2)
61.2 µs ± 38.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
repmat:
In [363]: timeit rep_vec = matlib.repmat(test_vec,10,1)
94.2 µs ± 32.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Using tile instead:
In [364]: timeit np.tile(test_vec,(10,1))
20.4 µs ± 17.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
and repeat directly:
In [365]: timeit x = test_vec[None,:,:].repeat(10,0).reshape(-1,2)
12.2 µs ± 371 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Range of all elements in numpy array

Suppose I have a 1d array a where from each element I would like to have a range of which the size is stored in ranges:
a = np.array([10,9,12])
ranges = np.array([2,4,3])
The desired output would be:
np.array([10,11,9,10,11,12,12,13,14])
I could of course use a for loop, but I prefer a fully vectorized approach. np.repeat allows one to repeat the elements in a a number of times by setting repeats=, but I am not aware of a similar numpy function particularly dealing with the problem above.

>>> np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
array([10, 11, 9, 10, 11, 12, 12, 13, 14])

With pandas it could be easier:
>>> import pandas as pd
>>> x = pd.Series(np.repeat(a, ranges))
>>> x + x.groupby(x).cumcount()
0 10
1 11
2 9
3 10
4 11
5 12
6 12
7 13
8 14
dtype: int64
>>>
If you want a numpy array:
>>> x.add(x.groupby(x).cumcount()).to_numpy()
array([10, 11, 9, 10, 11, 12, 12, 13, 14], dtype=int64)
>>>

Someone asked about timing, so I compared the times of the three solutions (so far) in a very simple manner, using the %timeit magic function in Jupyter notebook cells.
I set it up as follows:
N = 1
a = np.array([10,9,12])
a = np.tile(a, N)
ranges = np.array([2,4,3])
ranges = np.tile(ranges, N)
a.shape, ranges.shape
So I could easily scale (albeit things not random, but repeated).
Then I ran:
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
,
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
and
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
Results are as follows:
N = 1:
9.81 µs ± 481 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
568 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.53 µs ± 81.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
N = 10:
63.4 µs ± 976 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
575 µs ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
25.1 µs ± 698 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
N = 100:
612 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
608 µs ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
237 µs ± 9.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N = 1000:
6.09 ms ± 52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
852 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.44 ms ± 43.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So the Pandas solution wins when things get to arrays of 1000 elements or more, but the Python double list comprehension does an excellent job until that point. np.hstack probably loses out because of extra memory allocation and copying, but that's a guess. Note also that the Pandas solution is nearly the same time for each array size.
Caveats still exists because there are repeated numbers, and all values are relatively small integers. This really shouldn't matter, but I'm not (yet) betting on it. (For example, Pandas groupby functionality may be fast because of the repeated numbers.)
Bonus: the OP has statement in a comment that "The real life arrays are around 1000 elements, yet with ranges ranging from 100 to 1000. So becomes quite big – pr94".
So I adjusted my timing test to the following:
import numpy as np
import pandas as pd
N = 1000
a = np.random.randint(100, 1000, N)
# This is how I understand "ranges ranging from 100 to 1000"
ranges = np.random.randint(100, 1000, N)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
Which comes out as :
hstack: 2.78 ms ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
pandas: 18.4 ms ± 663 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
double list comprehension: 64.1 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Which shows that those caveats I mentioned, in some form at least, do seem to exist. But people should double check whether this testing code is actually the most relevant and appropriate, and whether it is correct.

This problem is probably going to be solved much faster with a Numba-compiled function:
#nb.jit
def expand_range(values, counts):
n = len(values)
m = np.sum(counts)
r = np.zeros((m,), dtype=values.dtype)
k = 0
for i in range(n):
x = values[i]
for j in range(counts[i]):
r[k] = x + j
k += 1
return r
On the very small inputs:
%timeit expand_range(a, ranges)
# 1.16 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 617 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
# 25 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
# 13.5 µs ± 929 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and on somewhat larger inputs:
b = np.random.randint(0, 1000, 1000)
b_ranges = np.random.randint(1, 10, 1000)
%timeit expand_range(b, b_ranges)
# 5.07 µs ± 98.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 617 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
# 25 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
# 13.5 µs ± 929 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
these show that with Numba-based approach winning the speed gain is at least 100x over any of the other approaches proposed so far.
With the numbers closer to what as been indicated in one of the comments by the OP:
b = np.random.randint(10, 1000, 1000)
b_ranges = np.random.randint(100, 1000, 1000)
%timeit expand_range(b, b_ranges)
# 1.5 ms ± 67.9 µs per loop (mean ± std. dev. of 7 runs, 1000
%timeit x = pd.Series(np.repeat(b, b_ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 91.8 ms ± 6.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(b, b_ranges)])
# 10.7 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.array([i for j in range(len(b)) for i in range(b[j],b[j]+b_ranges[j])])
# 144 ms ± 4.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
which is still at least a respectable 7x over the others.

Is there a way to speed up Numpy array calculations when they only contain values in upper/lower triangle?

I'm doing some matrix calculations (2d) that only involve values in the upper triangle of the matrices.
So far I've found that using Numpy's triu method ("return a copy of a matrix with the elements below the k-th diagonal zeroed") works and is quite fast. But presumably, the calculations are still being carried out for the whole matrix, including unnecessary calculations on the zeros. Or are they?...
Here is an example of what I tried first:
# Initialize vars
N = 160
u = np.empty(N)
u[0] = 1000
u[1:] = np.cumprod(np.full(N-1, 1/2**(1/16)))*1000
m = np.random.random(N)
def method1():
# Prepare matrices with values only in upper triangle
ones_ut = np.triu(np.ones((N, N)))
u_ut = np.triu(np.broadcast_to(u, (N, N)))
m_ut = np.triu(np.broadcast_to(m, (N, N)))
# Do calculation
return (ones_ut - np.divide(u_ut, u.reshape(N, 1)))**3*m_ut
Then I realized I only need to zero-out the final result matrix:
def method2():
return np.triu((np.ones((N, N)) - np.divide(u, u.reshape(N, 1)))**3*m)
assert np.array_equal(method1(), method2())
But to my surprise, this was slower.
In [62]: %timeit method1()
662 µs ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [63]: %timeit method2()
836 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Does numpy do some kind of special optimization when it knows the matrices contain half zeros?
I'm curious about why it is slower but actually my main question is, is there a way to speed up vectorized calculations by taking account of the fact that you are not interested in half the values in the matrix?
UPDATE
I tried just doing the calculations over 3 of the quadrants of the matrices but it didn't achieve any speed increase over method 1:
def method4():
split = N//2
x = np.zeros((N, N))
u_mat = 1 - u/u.reshape(N, 1)
x[:split, :] = u_mat[:split,:]**3*m
x[split:, split:] = u_mat[split:, split:]**3*m[split:]
return np.triu(x)
assert np.array_equal(method1(), method4())
In [86]: %timeit method4()
683 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But this is faster than method 2.

We should simplify things there to leverage broadcasting at minimal places. Upon which, we would end up with something like this to directly get the final output using u and m, like so -
np.triu((1-u/u.reshape(N, 1))**3*m)
Then, we could leverage numexpr module that performs noticeably better when working with transcendental operations as is the case here and also is very memory efficient. So, upon porting to numexpr version, it would be -
import numexpr as ne
np.triu(ne.evaluate('(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)}))
Bring in the masking part within the evaluate method for further perf. boost -
M = np.tri(N,dtype=bool)
ne.evaluate('(1-M)*(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)})
Timings on given dataset -
In [25]: %timeit method1()
1000 loops, best of 3: 521 µs per loop
In [26]: %timeit method2()
1000 loops, best of 3: 417 µs per loop
In [27]: %timeit np.triu((1-u/u.reshape(N, 1))**3*m)
1000 loops, best of 3: 408 µs per loop
In [28]: %timeit np.triu(ne.evaluate('(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)}))
10000 loops, best of 3: 159 µs per loop
In [29]: %timeit ne.evaluate('(1-M)*(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1),'M':np.tri(N,dtype=bool)})
10000 loops, best of 3: 110 µs per loop
Note that another way to extend u to a 2D version would be with np.newaxis/None and this would be the idiomatic way. Hence, u.reshape(N, 1) could be replaced by u[:,None]. This shouldn't change the timings though.

Here is another solution that is faster in some cases but slower in some other cases.
idx = np.triu_indices(N)
def my_method():
result = np.zeros((N, N))
t = 1 - u[idx[1]] / u[idx[0]]
result[idx] = t * t * t * m[idx[1]]
return result
Here, the computation is done only for the elements in the (flattened) upper triangle. However, there is overhead in the 2D-index-based assignment operation result[idx] = .... So the method is faster when the overhead is less than the saved computations -- which happens when N is small or the computation is relatively complex (e.g., using t ** 3 instead of t * t * t).
Another variation of the method is to use 1D-index for the assignment operation, which can lead to a small speedup.
idx = np.triu_indices(N)
raveled_idx = np.ravel_multi_index(idx, (N, N))
def my_method2():
result = np.zeros((N, N))
t = 1 - u[idx[1]] / u[idx[0]]
result.ravel()[raveled_idx] = t * t * t * m[idx[1]]
return result
Following is the result of performance tests. Note that idx and raveled_idx and are fixed for each N and do not change with u and m (as long as their shapes remain unchanged). Hence their values can be precomputed and the times are excluded from the test.
(If you need to call these methods with matrices of many different sizes, there will be added overhead in the computations of idx and raveled_idx.) For the comparision, method4b, method5 and method6 cannot benefit much from any precomputation. For method_ne, the precomputation M = np.tri(N, dtype=bool) is also excluded from the test.
%timeit method4b()
%timeit method5()
%timeit method6()
%timeit method_ne()
%timeit my_method()
%timeit my_method2()
Result (for N = 160):
1.54 ms ± 7.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.63 ms ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
167 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
255 µs ± 14.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
233 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
177 µs ± 907 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
For N = 32:
89.9 µs ± 880 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
84 µs ± 728 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
25.2 µs ± 223 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
28.6 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
17.6 µs ± 1.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
14.3 µs ± 52.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For N = 1000:
70.7 ms ± 871 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
65.1 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
21.4 ms ± 642 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.03 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.2 ms ± 95.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.7 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using t ** 3 instead of t * t * t in my_method and my_method2 (N = 160):
1.53 ms ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.6 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
156 µs ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
235 µs ± 8.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.4 ms ± 4.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.32 ms ± 9.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Here, my_method and my_method2 outperform method4b and method5 a little bit.

I think the answer may be quite simple. Just put zeros in the cells that you don't want to calculate and the overall calculation will be faster. I think that might explain why method1() was faster than method2().
Here are some tests to illustrate the point.
In [29]: size = (160, 160)
In [30]: z = np.zeros(size)
In [31]: r = np.random.random(size) + 1
In [32]: t = np.triu(r)
In [33]: w = np.ones(size)
In [34]: %timeit z**3
177 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [35]: %timeit t**3
376 µs ± 2.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [36]: %timeit r**3
572 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [37]: %timeit w**3
138 µs ± 548 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [38]: %timeit np.triu(r)**3
427 µs ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [39]: %timeit np.triu(r**3)
625 µs ± 3.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not sure how all this works at a low level but clearly, zero or one raised to a power takes much less time to compute than any other value.
Also interesting. With numexpr computation there is no difference.
In [42]: %timeit ne.evaluate("r**3")
79.2 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [43]: %timeit ne.evaluate("z**3")
79.3 µs ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So, I think the fastest without using numexpr may be this way:
def method5():
return np.triu(1 - u/u[:, None])**3*m
assert np.array_equal(method1(), method5())
In [65]: %timeit method1()
656 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [66]: %timeit method5()
587 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Or, if you are really chasing every micro-second:
def method4b():
split = N//2
x = np.zeros((N, N))
u_mat = np.triu(1 - u/u.reshape(N, 1))
x[:split, :] = u_mat[:split,:]**3*m
x[split:, split:] = u_mat[split:, split:]**3*m[split:]
return x
assert np.array_equal(method1(), method4b())
In [71]: %timeit method4b()
543 µs ± 3.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [72]: %timeit method4b()
533 µs ± 7.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
And #Divakar's answer using numexpr is the fastest overall.
UPDATE
Thanks to #GZ0's comment, if you only need to raise to the power of 3, this is much faster:
def method6():
a = np.triu(1 - u/u[:, None])
return a*a*a*m
assert np.isclose(method1(), method6()).all()
(But there is a slight loss of precision I noticed).
In [84]: %timeit method6()
195 µs ± 609 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In fact it is not far off the numexpr methods in #Divakar's answer (185/163 µs on my machine).

Python Speedup np.unique

I am looking to speed up the following piece of code:
NNlist=[np.unique(i) for i in NNlist]
where NNlist is a list of np.arrays with duplicated entries.
Thanks :)

numpy.unique is already pretty optimized, you're not likely to get get much of a speedup over what you already have unless you know something else about the underlying data. For example if the data is all small integers you might be able to use numpy.bincount or if the unique values in each of the arrays are mostly the same there might be some optimization that could be done over the whole list of arrays.

pandas.unique() is much faster than numpy.unique(). The Pandas version does not sort the result, but you can do that yourself and it will still be much faster if the result is much smaller than the input (i.e. there are a lot of duplicate values):
np.sort(pd.unique(arr))
Timings:
In [1]: x = np.random.randint(10, 20, 50000000)
In [2]: %timeit np.sort(pd.unique(x))
201 ms ± 9.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit np.unique(x)
1.49 s ± 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I also took a look at list(set()) and strings in a list, between pandas Series and python lists.
data = np.random.randint(0,10,100)
data_hex = [str(hex(n)) for n in data] # just some simple strings
sample1 = pd.Series(data, name='data')
sample2 = data.tolist()
sample3 = pd.Series(data_hex, name='data')
sample4 = data_hex
And then the benchmarks:
%timeit np.unique(sample1) # 16.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.unique(sample2) # 15.9 µs ± 743 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.unique(sample3) # 45.8 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.unique(sample4) # 20.6 µs ± 680 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample1) # 60.3 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample2) # 196 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample3) # 79.7 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample4) # 214 µs ± 61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each
%timeit list(set(sample1)) # 16.3 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit list(set(sample2)) # 1.64 µs ± 83.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit list(set(sample3)) # 17.8 µs ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit list(set(sample4)) # 2.48 µs ± 439 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The take away is:
Starting with a Pandas Series with integers? Go with either np.unique() or list(set())
Starting with a Pandas Series with strings? Go with list(set())
Starting with a list of integers? Go with list(set())
Starting with a list of strings? Go with list(set())
However, if N=1,000,000 instead, the results are different.
%timeit np.unique(sample1) # 26.5 ms ± 616 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(sample2) # 98.1 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.unique(sample3) # 1.31 s ± 78.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(sample4) # 174 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.unique(sample1) # 10.5 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.unique(sample2) # 99.3 ms ± 5.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(sample3) # 46.4 ms ± 4.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(sample4) # 113 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample1)) # 25.9 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample2)) # 11.2 ms ± 496 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(set(sample3)) # 37.1 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample4)) # 20.2 ms ± 843 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Starting with a Pandas Series with integers? Go with pd.unique()
Starting with a Pandas Series with strings? Go with list(set())
Starting with a list of integers? Go with list(set())
Starting with a list of strings? Go with list(set())

Here are some benchmarks:
In [72]: ar_list = [np.random.randint(0, 100, 1000) for _ in range(100)]
In [73]: %timeit map(np.unique, ar_list)
100 loops, best of 3: 4.9 ms per loop
In [74]: %timeit [np.unique(ar) for ar in ar_list]
100 loops, best of 3: 4.9 ms per loop
In [75]: %timeit [pd.unique(ar) for ar in ar_list] # using pandas
100 loops, best of 3: 2.25 ms per loop
So pandas.unique seems to be faster than numpy.unique. However the docstring mentions that the values are "not necessarily sorted", which (partly) explains, that it is faster.
Using a list comprehension or map doesn't give a difference in this example.

The numpy.unique() is based on sorting (quicksort), and the pandas.unique() is based on hash table. Normally, the latter is faster according to my benchmarks. They are already very optimized.
For some special case, you can continue to optimize the performance.
For example, if the data already sorted, you can skip the sorting method:
# ar is already sorted
# this segment is from source code of numpy
mask = np.empty(ar.shape, dtype=np.bool_)
mask[:1] = True
mask[1:] = ar[1:] != ar[:-1]
ret = ar[mask]
I meet the similar problem to yours. I wrote my unique function for my use. Because the pandas.unique doesn't support return_counts option. It is fast implementation. But my implementation only supports integers array. You can check out the source code here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How would you write an less computationally intensive equivalent of numpy.where(np.ones(shape)) - python

Related

How is Numba faster than NumPy for matrix multiplication with integers?

Why is NumPy's `repmat` faster than `kron` to repeat blocks of arrays?

Range of all elements in numpy array

Is there a way to speed up Numpy array calculations when they only contain values in upper/lower triangle?

Python Speedup np.unique

Categories

Resources