I have a large numpy 1D array with over a 100 million elements and am applying np.unique to it
import numpy as np
x = np.random.randint(0,10000, size=100_000_000)
_, index = np.unique(x, return_inverse=True)
What I actually need is the index that is returned from np.unique but I do not need the unique array at all (i.e., it is throwaway). Since, in my real use case, I need to call np.unique many times on different arrays (all with the same length), this becomes the bottleneck. I'm guessing that a lot of the time is spent on sorting the unique array.
What is the a fastest way to obtain the index for a large 1D array (it may be over a billion elements in length)?
Is there a parallelized option?
Here's a way with array-assignment + masking + indexing trickery specific to the case of positive integers only in the input array x -
def return_inverse_only(x, maxnum=None):
if maxnum is None:
maxnum = x.max()+1 # Determines extent of indexing array
p = np.zeros(maxnum, dtype=bool)
p[x] = 1
p2 = np.empty(maxnum, dtype=np.uint64)
c = p.sum()
p2[p] = np.arange(c)
out = p2[x]
return out
If max number in the input array is known before-hahnd, feed in one-added number as maxnum to boost perf. further.
Timings on large arrays -
In [146]: np.random.seed(0)
...: x = np.random.randint(0,10000, size=100000)
In [147]: %timeit np.unique(x, return_inverse=True)
...: %timeit return_inverse_only(x)
...: %timeit return_inverse_only(x, maxnum=10000)
10.9 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
539 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
446 µs ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [148]: np.random.seed(0)
...: x = np.random.randint(0,10000, size=1000000)
In [149]: %timeit np.unique(x, return_inverse=True)
...: %timeit return_inverse_only(x)
...: %timeit return_inverse_only(x, maxnum=10000)
149 ms ± 5.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
6.1 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.3 ms ± 504 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [150]: np.random.seed(0)
...: x = np.random.randint(0,10000, size=10000000)
In [151]: %timeit np.unique(x, return_inverse=True)
...: %timeit return_inverse_only(x)
...: %timeit return_inverse_only(x, maxnum=10000)
1.88 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
67.9 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
55.8 ms ± 1.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
30x+ speedup!
Related
I need help in understanding the %timeit function works in the two programs.
Program A
a = [1,3,2,4,1,4,2]
%timeit [val + 5 for val in a]
830 ns ± 45.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Program B
import numpy as np
a = np.array([1,3,2,4,1,4,2])
%timeit [a+5]
1.07 µs ± 23.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
My confusion:
µs is bigger than ns. How does the NumPy function execute slower than for loop here?
1.07 µs ± 23.7 ns per loop... why is the loop speed calculated in ns and not in µs?
Numpy adds an overhead, this will impact the speed on small datasets. Vectorization is mostly useful when using large datasets.
You must try on larger numbers:
N = 10_000_000
a = list(range(N))
%timeit [val + 5 for val in a]
import numpy as np
a = np.arange(N)
%timeit a+5
Output:
1.51 s ± 318 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
55.8 ms ± 3.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I've noticed that np.unique might get slower in some cases if not passing True to return_index parameter.
a = np.ones(shape = (1000, 50), dtype=int)
a[:,-7:] = [10000, -4750, -4750, 95, 95, 95, 95]
arr = np.cumsum(a.ravel())
%timeit np.unique(arr)
%timeit np.unique(arr, return_index=True)
%timeit np.unique(arr, return_index=True, return_inverse=True)
%timeit np.unique(arr, return_index=True, return_inverse=True, return_counts=True)
1.14 ms ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
711 µs ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
955 µs ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.3 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
It doesn't occur usually with other kinds of data. What is happening here?
The difference occurs because the data in your example is sorted. unique sorts the data, and when return_index=True, a stable merge-sort algorithn is used. When merge-sort is applied to data that is already sorted, the algorithm will make just one pass through the data, so it is very fast.
For example, in the follow, arr is an array of nondecreasing values:
In [10]: arr = np.random.randint(0, 3, size=50000).cumsum()
In [11]: arr
Out[11]: array([ 1, 3, 4, ..., 49892, 49892, 49894])
The default sort algorithm takes almost 8 times as long as the merge-sort:
In [12]: %timeit np.sort(arr)
386 µs ± 6.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [13]: %timeit np.sort(arr, kind='mergesort')
49.5 µs ± 708 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
You can see the code that ends up doing the actual work of finding the unique values here: https://github.com/numpy/numpy/blob/6ff787b93d46cca6d31c370cfd9543ed573a98fc/numpy/lib/arraysetops.py#L320-L361
arr = np.round(np.random.rand(50000) * 50).astype('int')
%timeit np.unique(arr)
%timeit np.unique(arr, return_index=True)
%timeit np.unique(arr, return_index=True, return_inverse=True)
%timeit np.unique(arr, return_index=True, return_inverse=True, return_counts=True)
274 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
352 µs ± 539 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
397 µs ± 609 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
414 µs ± 555 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Data in your array are all unique. Random data with duplicates didn't reproduce this behavior.
My best guess is NumPy has some optimization for near-all-unique arrays when return_index is specified.
Suppose I have a 1d array a where from each element I would like to have a range of which the size is stored in ranges:
a = np.array([10,9,12])
ranges = np.array([2,4,3])
The desired output would be:
np.array([10,11,9,10,11,12,12,13,14])
I could of course use a for loop, but I prefer a fully vectorized approach. np.repeat allows one to repeat the elements in a a number of times by setting repeats=, but I am not aware of a similar numpy function particularly dealing with the problem above.
>>> np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
array([10, 11, 9, 10, 11, 12, 12, 13, 14])
With pandas it could be easier:
>>> import pandas as pd
>>> x = pd.Series(np.repeat(a, ranges))
>>> x + x.groupby(x).cumcount()
0 10
1 11
2 9
3 10
4 11
5 12
6 12
7 13
8 14
dtype: int64
>>>
If you want a numpy array:
>>> x.add(x.groupby(x).cumcount()).to_numpy()
array([10, 11, 9, 10, 11, 12, 12, 13, 14], dtype=int64)
>>>
Someone asked about timing, so I compared the times of the three solutions (so far) in a very simple manner, using the %timeit magic function in Jupyter notebook cells.
I set it up as follows:
N = 1
a = np.array([10,9,12])
a = np.tile(a, N)
ranges = np.array([2,4,3])
ranges = np.tile(ranges, N)
a.shape, ranges.shape
So I could easily scale (albeit things not random, but repeated).
Then I ran:
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
,
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
and
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
Results are as follows:
N = 1:
9.81 µs ± 481 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
568 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.53 µs ± 81.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
N = 10:
63.4 µs ± 976 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
575 µs ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
25.1 µs ± 698 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
N = 100:
612 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
608 µs ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
237 µs ± 9.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N = 1000:
6.09 ms ± 52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
852 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.44 ms ± 43.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So the Pandas solution wins when things get to arrays of 1000 elements or more, but the Python double list comprehension does an excellent job until that point. np.hstack probably loses out because of extra memory allocation and copying, but that's a guess. Note also that the Pandas solution is nearly the same time for each array size.
Caveats still exists because there are repeated numbers, and all values are relatively small integers. This really shouldn't matter, but I'm not (yet) betting on it. (For example, Pandas groupby functionality may be fast because of the repeated numbers.)
Bonus: the OP has statement in a comment that "The real life arrays are around 1000 elements, yet with ranges ranging from 100 to 1000. So becomes quite big – pr94".
So I adjusted my timing test to the following:
import numpy as np
import pandas as pd
N = 1000
a = np.random.randint(100, 1000, N)
# This is how I understand "ranges ranging from 100 to 1000"
ranges = np.random.randint(100, 1000, N)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
Which comes out as :
hstack: 2.78 ms ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
pandas: 18.4 ms ± 663 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
double list comprehension: 64.1 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Which shows that those caveats I mentioned, in some form at least, do seem to exist. But people should double check whether this testing code is actually the most relevant and appropriate, and whether it is correct.
This problem is probably going to be solved much faster with a Numba-compiled function:
#nb.jit
def expand_range(values, counts):
n = len(values)
m = np.sum(counts)
r = np.zeros((m,), dtype=values.dtype)
k = 0
for i in range(n):
x = values[i]
for j in range(counts[i]):
r[k] = x + j
k += 1
return r
On the very small inputs:
%timeit expand_range(a, ranges)
# 1.16 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 617 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
# 25 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
# 13.5 µs ± 929 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and on somewhat larger inputs:
b = np.random.randint(0, 1000, 1000)
b_ranges = np.random.randint(1, 10, 1000)
%timeit expand_range(b, b_ranges)
# 5.07 µs ± 98.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 617 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
# 25 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
# 13.5 µs ± 929 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
these show that with Numba-based approach winning the speed gain is at least 100x over any of the other approaches proposed so far.
With the numbers closer to what as been indicated in one of the comments by the OP:
b = np.random.randint(10, 1000, 1000)
b_ranges = np.random.randint(100, 1000, 1000)
%timeit expand_range(b, b_ranges)
# 1.5 ms ± 67.9 µs per loop (mean ± std. dev. of 7 runs, 1000
%timeit x = pd.Series(np.repeat(b, b_ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 91.8 ms ± 6.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(b, b_ranges)])
# 10.7 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.array([i for j in range(len(b)) for i in range(b[j],b[j]+b_ranges[j])])
# 144 ms ± 4.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
which is still at least a respectable 7x over the others.
I'm doing some matrix calculations (2d) that only involve values in the upper triangle of the matrices.
So far I've found that using Numpy's triu method ("return a copy of a matrix with the elements below the k-th diagonal zeroed") works and is quite fast. But presumably, the calculations are still being carried out for the whole matrix, including unnecessary calculations on the zeros. Or are they?...
Here is an example of what I tried first:
# Initialize vars
N = 160
u = np.empty(N)
u[0] = 1000
u[1:] = np.cumprod(np.full(N-1, 1/2**(1/16)))*1000
m = np.random.random(N)
def method1():
# Prepare matrices with values only in upper triangle
ones_ut = np.triu(np.ones((N, N)))
u_ut = np.triu(np.broadcast_to(u, (N, N)))
m_ut = np.triu(np.broadcast_to(m, (N, N)))
# Do calculation
return (ones_ut - np.divide(u_ut, u.reshape(N, 1)))**3*m_ut
Then I realized I only need to zero-out the final result matrix:
def method2():
return np.triu((np.ones((N, N)) - np.divide(u, u.reshape(N, 1)))**3*m)
assert np.array_equal(method1(), method2())
But to my surprise, this was slower.
In [62]: %timeit method1()
662 µs ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [63]: %timeit method2()
836 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Does numpy do some kind of special optimization when it knows the matrices contain half zeros?
I'm curious about why it is slower but actually my main question is, is there a way to speed up vectorized calculations by taking account of the fact that you are not interested in half the values in the matrix?
UPDATE
I tried just doing the calculations over 3 of the quadrants of the matrices but it didn't achieve any speed increase over method 1:
def method4():
split = N//2
x = np.zeros((N, N))
u_mat = 1 - u/u.reshape(N, 1)
x[:split, :] = u_mat[:split,:]**3*m
x[split:, split:] = u_mat[split:, split:]**3*m[split:]
return np.triu(x)
assert np.array_equal(method1(), method4())
In [86]: %timeit method4()
683 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But this is faster than method 2.
We should simplify things there to leverage broadcasting at minimal places. Upon which, we would end up with something like this to directly get the final output using u and m, like so -
np.triu((1-u/u.reshape(N, 1))**3*m)
Then, we could leverage numexpr module that performs noticeably better when working with transcendental operations as is the case here and also is very memory efficient. So, upon porting to numexpr version, it would be -
import numexpr as ne
np.triu(ne.evaluate('(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)}))
Bring in the masking part within the evaluate method for further perf. boost -
M = np.tri(N,dtype=bool)
ne.evaluate('(1-M)*(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)})
Timings on given dataset -
In [25]: %timeit method1()
1000 loops, best of 3: 521 µs per loop
In [26]: %timeit method2()
1000 loops, best of 3: 417 µs per loop
In [27]: %timeit np.triu((1-u/u.reshape(N, 1))**3*m)
1000 loops, best of 3: 408 µs per loop
In [28]: %timeit np.triu(ne.evaluate('(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)}))
10000 loops, best of 3: 159 µs per loop
In [29]: %timeit ne.evaluate('(1-M)*(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1),'M':np.tri(N,dtype=bool)})
10000 loops, best of 3: 110 µs per loop
Note that another way to extend u to a 2D version would be with np.newaxis/None and this would be the idiomatic way. Hence, u.reshape(N, 1) could be replaced by u[:,None]. This shouldn't change the timings though.
Here is another solution that is faster in some cases but slower in some other cases.
idx = np.triu_indices(N)
def my_method():
result = np.zeros((N, N))
t = 1 - u[idx[1]] / u[idx[0]]
result[idx] = t * t * t * m[idx[1]]
return result
Here, the computation is done only for the elements in the (flattened) upper triangle. However, there is overhead in the 2D-index-based assignment operation result[idx] = .... So the method is faster when the overhead is less than the saved computations -- which happens when N is small or the computation is relatively complex (e.g., using t ** 3 instead of t * t * t).
Another variation of the method is to use 1D-index for the assignment operation, which can lead to a small speedup.
idx = np.triu_indices(N)
raveled_idx = np.ravel_multi_index(idx, (N, N))
def my_method2():
result = np.zeros((N, N))
t = 1 - u[idx[1]] / u[idx[0]]
result.ravel()[raveled_idx] = t * t * t * m[idx[1]]
return result
Following is the result of performance tests. Note that idx and raveled_idx and are fixed for each N and do not change with u and m (as long as their shapes remain unchanged). Hence their values can be precomputed and the times are excluded from the test.
(If you need to call these methods with matrices of many different sizes, there will be added overhead in the computations of idx and raveled_idx.) For the comparision, method4b, method5 and method6 cannot benefit much from any precomputation. For method_ne, the precomputation M = np.tri(N, dtype=bool) is also excluded from the test.
%timeit method4b()
%timeit method5()
%timeit method6()
%timeit method_ne()
%timeit my_method()
%timeit my_method2()
Result (for N = 160):
1.54 ms ± 7.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.63 ms ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
167 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
255 µs ± 14.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
233 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
177 µs ± 907 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
For N = 32:
89.9 µs ± 880 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
84 µs ± 728 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
25.2 µs ± 223 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
28.6 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
17.6 µs ± 1.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
14.3 µs ± 52.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For N = 1000:
70.7 ms ± 871 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
65.1 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
21.4 ms ± 642 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.03 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.2 ms ± 95.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.7 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using t ** 3 instead of t * t * t in my_method and my_method2 (N = 160):
1.53 ms ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.6 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
156 µs ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
235 µs ± 8.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.4 ms ± 4.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.32 ms ± 9.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Here, my_method and my_method2 outperform method4b and method5 a little bit.
I think the answer may be quite simple. Just put zeros in the cells that you don't want to calculate and the overall calculation will be faster. I think that might explain why method1() was faster than method2().
Here are some tests to illustrate the point.
In [29]: size = (160, 160)
In [30]: z = np.zeros(size)
In [31]: r = np.random.random(size) + 1
In [32]: t = np.triu(r)
In [33]: w = np.ones(size)
In [34]: %timeit z**3
177 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [35]: %timeit t**3
376 µs ± 2.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [36]: %timeit r**3
572 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [37]: %timeit w**3
138 µs ± 548 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [38]: %timeit np.triu(r)**3
427 µs ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [39]: %timeit np.triu(r**3)
625 µs ± 3.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not sure how all this works at a low level but clearly, zero or one raised to a power takes much less time to compute than any other value.
Also interesting. With numexpr computation there is no difference.
In [42]: %timeit ne.evaluate("r**3")
79.2 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [43]: %timeit ne.evaluate("z**3")
79.3 µs ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So, I think the fastest without using numexpr may be this way:
def method5():
return np.triu(1 - u/u[:, None])**3*m
assert np.array_equal(method1(), method5())
In [65]: %timeit method1()
656 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [66]: %timeit method5()
587 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Or, if you are really chasing every micro-second:
def method4b():
split = N//2
x = np.zeros((N, N))
u_mat = np.triu(1 - u/u.reshape(N, 1))
x[:split, :] = u_mat[:split,:]**3*m
x[split:, split:] = u_mat[split:, split:]**3*m[split:]
return x
assert np.array_equal(method1(), method4b())
In [71]: %timeit method4b()
543 µs ± 3.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [72]: %timeit method4b()
533 µs ± 7.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
And #Divakar's answer using numexpr is the fastest overall.
UPDATE
Thanks to #GZ0's comment, if you only need to raise to the power of 3, this is much faster:
def method6():
a = np.triu(1 - u/u[:, None])
return a*a*a*m
assert np.isclose(method1(), method6()).all()
(But there is a slight loss of precision I noticed).
In [84]: %timeit method6()
195 µs ± 609 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In fact it is not far off the numexpr methods in #Divakar's answer (185/163 µs on my machine).
For efficiency I want to calculate the sqrt of a tensor only for values that are below a threshold.
In numpy, for example, I have
import numpy as np
x = np.random.random(size=(10e6))
%timeit np.sqrt(x)
-> 10 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If I use a mask
x_m = x[x < 1e-3]
%timeit np.sqrt(x_m)
-> 8.94 µs ± 20.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The calculation if faster, as expected, as numpy seems to calculate the sqrt only for the elements x < 1e-3.
In Tensorflow, however, I cannot make this work:
import tensorflow as tf
tf.InteractiveSession()
x_tf = tf.constant(x)
%timeit tf.sqrt(x_tf).eval()
-> 314 ms ± 1.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If I now try to use a boolean_mask
mask = tf.boolean_mask(x_tf, x_tf < 1e-3)
%timeit tf.sqrt(mask).eval()
-> 341 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
there is no speed-up like in the numpy version. It seems like the sqrt in Tensorflow is still computed for all values of the original Tensor x_tf.
Is there a way to run operations (like the sqrt) only on the masked values? Or, alternatively, extract a shorter tensor from the masked tensor?
There are two problems with your measures:
You are not counting the comparison the boolean masking in NumPy.
You are creating new graph nodes on each timing trial in TensorFlow.
These should be more representative timings:
import numpy as np
import tensorflow as tf
np.random.seed(0)
x = np.random.random(size=int(10e6))
%timeit np.sqrt(x)
# 20.4 ms ± 581 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.sqrt(x[x < 1e-3])
# 9.96 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
with tf.Graph().as_default(), tf.Session():
x_tf = tf.constant(x)
x_tf_sqrt = tf.sqrt(x_tf)
%timeit x_tf_sqrt.eval()
# 16.8 ms ± 685 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
mask = tf.boolean_mask(x_tf, x_tf < 1e-3)
mask_sqrt = tf.sqrt(mask)
%timeit mask_sqrt.eval()
# 103 µs ± 43.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)