ufunc memory consumption in arithemtic expressions - python

What is the memory consumption for arithmetic numpy expressions I.e.
vec ** 3 + vec ** 2 + vec
(vec being a numpy.ndarray). Is an array stored for each intermediate operation? Could such compound expressions have multiple times the memory than the underlying ndarray?

You are correct, a new array will be allocated for each intermediate result. Fortunately, the package numexpr is designed to deal with this issue. From the description:
The main reason why NumExpr achieves better performance than NumPy is that it avoids allocating memory for intermediate results. This results in better cache utilization and reduces memory access in general. Due to this, NumExpr works best with large arrays.
Example:
In [97]: xs = np.random.rand(1_000_000)
In [98]: %timeit xs ** 3 + xs ** 2 + xs
26.8 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [99]: %timeit numexpr.evaluate('xs ** 3 + xs ** 2 + xs')
1.43 ms ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Thanks to #max9111 for pointing out that numexpr simplifies power to multiplication. It seems that most of the discrepancy in the benchmark is explained by optimization of xs ** 3.
In [421]: %timeit xs * xs
1.62 ms ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [422]: %timeit xs ** 2
1.63 ms ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [423]: %timeit xs ** 3
22.8 ms ± 283 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [424]: %timeit xs * xs * xs
2.52 ms ± 58.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

Need help in understanding the loop speed with timeit function in python

I need help in understanding the %timeit function works in the two programs.
Program A
a = [1,3,2,4,1,4,2]
%timeit [val + 5 for val in a]
830 ns ± 45.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Program B
import numpy as np
a = np.array([1,3,2,4,1,4,2])
%timeit [a+5]
1.07 µs ± 23.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
My confusion:
µs is bigger than ns. How does the NumPy function execute slower than for loop here?
1.07 µs ± 23.7 ns per loop... why is the loop speed calculated in ns and not in µs?
Numpy adds an overhead, this will impact the speed on small datasets. Vectorization is mostly useful when using large datasets.
You must try on larger numbers:
N = 10_000_000
a = list(range(N))
%timeit [val + 5 for val in a]
import numpy as np
a = np.arange(N)
%timeit a+5
Output:
1.51 s ± 318 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
55.8 ms ± 3.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

How can I parallelize np.matmul and np.multiply?

I have a question about matrix calculation using numpy. How can I parallelize these calculation such as np.matmul and np.multiply? I cannot find any references describing how to compute np.matmul using parallelization.
def time_shift_R(V, R_1, I0, t): # V is the potential function which returns an array
temp1 = V(xx, yy, t) + B*I0**2
temp = P*np.matmul(M, I0) + Q*np.matmul(I0, M) - np.multiply(temp1, I0)
R1 = ( R_1 - dt*temp ) / ( 1 - dt*B*R_1*I0 )
return R1
I appreciate your kind help in advance.
You might want to do some time tests to see what exactly is taking most time. For example on a rather modest machine with stock Ubuntu linux:
Make a complex array (you didn't cite any sizes so I'm just guessing as to something reasonable):
In [60]: A = np.ones((1000,1000),complex)
multiply and the operator * are basically the same:
In [61]: timeit A*A
7.49 ms ± 4.99 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [62]: timeit np.multiply(A,A)
7.48 ms ± 6.72 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
matmul is quite a bit longer, but then it's doing a lot more. A faster BLAS equivalent might help here. Note matmul is smart enough to use a specialized BLAS function for the transpose case. The # operator is basically the same.
In [63]: timeit np.matmul(A,A)
381 ms ± 8.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [64]: timeit np.matmul(A,A.T)
231 ms ± 9.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
and the power calc:
In [65]: timeit 10.0*A**2
14.4 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
The V(xx, yy, t) is unknown.

Speed up numpy indexing of large array

I have a numpy array x of size (n, n, p) and I need to index it using a list m. I need to return a two new arrays of sizes (n, m, p) and (n, n-m, p). Both p and m are generally small (range 10 to 100), but n can be from 100 to 10000+.
When n is small, there is no issue. However when n gets large, these indexing operations take the majority of my function call time.
In my actual implementation, the indexing took 15 seconds, and the rest of the function was less than 1 sec.
I've tried doing the regular indexing, using np.delete, and np.take, and np.take was faster by a factor of 2 and it was what I am currently using to get the 15 sec time.
An example is below:
m = [1, 7, 12, 40]
r = np.arange(5000)
r = np.delete(r, m, axis=0)
x = np.random.rand(5000,5000,10)
%timeit tmp = x[:,m,:]
1.55 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit tmp2 = x[:,r,:]
1.7 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit tmp = np.delete(x, r, axis=1)
1.46 ms ± 31.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit tmp2 = np.delete(x, m, axis=1)
1.64 s ± 18.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit tmp = np.take(x, m, axis=1)
1.21 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit tmp2 = np.take(x, r, axis=1)
1.04 s ± 79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Except instead of 1 sec, it's 15 times that and I have to call this function a few hundred or thousand times.
Is there something I can do to speed this indexing up?
I'm using Python 3.6.10 through Spyder 4.0.1 on a Windows 10 laptop with an Intel i7-8650U and 16GB of RAM. I checked the array sizes and my available RAM when executing the commands and did not hit the maximum usage at any point in the execution.

Is there a way to speed up Numpy array calculations when they only contain values in upper/lower triangle?

I'm doing some matrix calculations (2d) that only involve values in the upper triangle of the matrices.
So far I've found that using Numpy's triu method ("return a copy of a matrix with the elements below the k-th diagonal zeroed") works and is quite fast. But presumably, the calculations are still being carried out for the whole matrix, including unnecessary calculations on the zeros. Or are they?...
Here is an example of what I tried first:
# Initialize vars
N = 160
u = np.empty(N)
u[0] = 1000
u[1:] = np.cumprod(np.full(N-1, 1/2**(1/16)))*1000
m = np.random.random(N)
def method1():
# Prepare matrices with values only in upper triangle
ones_ut = np.triu(np.ones((N, N)))
u_ut = np.triu(np.broadcast_to(u, (N, N)))
m_ut = np.triu(np.broadcast_to(m, (N, N)))
# Do calculation
return (ones_ut - np.divide(u_ut, u.reshape(N, 1)))**3*m_ut
Then I realized I only need to zero-out the final result matrix:
def method2():
return np.triu((np.ones((N, N)) - np.divide(u, u.reshape(N, 1)))**3*m)
assert np.array_equal(method1(), method2())
But to my surprise, this was slower.
In [62]: %timeit method1()
662 µs ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [63]: %timeit method2()
836 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Does numpy do some kind of special optimization when it knows the matrices contain half zeros?
I'm curious about why it is slower but actually my main question is, is there a way to speed up vectorized calculations by taking account of the fact that you are not interested in half the values in the matrix?
UPDATE
I tried just doing the calculations over 3 of the quadrants of the matrices but it didn't achieve any speed increase over method 1:
def method4():
split = N//2
x = np.zeros((N, N))
u_mat = 1 - u/u.reshape(N, 1)
x[:split, :] = u_mat[:split,:]**3*m
x[split:, split:] = u_mat[split:, split:]**3*m[split:]
return np.triu(x)
assert np.array_equal(method1(), method4())
In [86]: %timeit method4()
683 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But this is faster than method 2.
We should simplify things there to leverage broadcasting at minimal places. Upon which, we would end up with something like this to directly get the final output using u and m, like so -
np.triu((1-u/u.reshape(N, 1))**3*m)
Then, we could leverage numexpr module that performs noticeably better when working with transcendental operations as is the case here and also is very memory efficient. So, upon porting to numexpr version, it would be -
import numexpr as ne
np.triu(ne.evaluate('(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)}))
Bring in the masking part within the evaluate method for further perf. boost -
M = np.tri(N,dtype=bool)
ne.evaluate('(1-M)*(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)})
Timings on given dataset -
In [25]: %timeit method1()
1000 loops, best of 3: 521 µs per loop
In [26]: %timeit method2()
1000 loops, best of 3: 417 µs per loop
In [27]: %timeit np.triu((1-u/u.reshape(N, 1))**3*m)
1000 loops, best of 3: 408 µs per loop
In [28]: %timeit np.triu(ne.evaluate('(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)}))
10000 loops, best of 3: 159 µs per loop
In [29]: %timeit ne.evaluate('(1-M)*(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1),'M':np.tri(N,dtype=bool)})
10000 loops, best of 3: 110 µs per loop
Note that another way to extend u to a 2D version would be with np.newaxis/None and this would be the idiomatic way. Hence, u.reshape(N, 1) could be replaced by u[:,None]. This shouldn't change the timings though.
Here is another solution that is faster in some cases but slower in some other cases.
idx = np.triu_indices(N)
def my_method():
result = np.zeros((N, N))
t = 1 - u[idx[1]] / u[idx[0]]
result[idx] = t * t * t * m[idx[1]]
return result
Here, the computation is done only for the elements in the (flattened) upper triangle. However, there is overhead in the 2D-index-based assignment operation result[idx] = .... So the method is faster when the overhead is less than the saved computations -- which happens when N is small or the computation is relatively complex (e.g., using t ** 3 instead of t * t * t).
Another variation of the method is to use 1D-index for the assignment operation, which can lead to a small speedup.
idx = np.triu_indices(N)
raveled_idx = np.ravel_multi_index(idx, (N, N))
def my_method2():
result = np.zeros((N, N))
t = 1 - u[idx[1]] / u[idx[0]]
result.ravel()[raveled_idx] = t * t * t * m[idx[1]]
return result
Following is the result of performance tests. Note that idx and raveled_idx and are fixed for each N and do not change with u and m (as long as their shapes remain unchanged). Hence their values can be precomputed and the times are excluded from the test.
(If you need to call these methods with matrices of many different sizes, there will be added overhead in the computations of idx and raveled_idx.) For the comparision, method4b, method5 and method6 cannot benefit much from any precomputation. For method_ne, the precomputation M = np.tri(N, dtype=bool) is also excluded from the test.
%timeit method4b()
%timeit method5()
%timeit method6()
%timeit method_ne()
%timeit my_method()
%timeit my_method2()
Result (for N = 160):
1.54 ms ± 7.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.63 ms ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
167 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
255 µs ± 14.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
233 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
177 µs ± 907 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
For N = 32:
89.9 µs ± 880 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
84 µs ± 728 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
25.2 µs ± 223 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
28.6 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
17.6 µs ± 1.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
14.3 µs ± 52.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For N = 1000:
70.7 ms ± 871 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
65.1 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
21.4 ms ± 642 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.03 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.2 ms ± 95.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.7 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using t ** 3 instead of t * t * t in my_method and my_method2 (N = 160):
1.53 ms ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.6 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
156 µs ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
235 µs ± 8.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.4 ms ± 4.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.32 ms ± 9.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Here, my_method and my_method2 outperform method4b and method5 a little bit.
I think the answer may be quite simple. Just put zeros in the cells that you don't want to calculate and the overall calculation will be faster. I think that might explain why method1() was faster than method2().
Here are some tests to illustrate the point.
In [29]: size = (160, 160)
In [30]: z = np.zeros(size)
In [31]: r = np.random.random(size) + 1
In [32]: t = np.triu(r)
In [33]: w = np.ones(size)
In [34]: %timeit z**3
177 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [35]: %timeit t**3
376 µs ± 2.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [36]: %timeit r**3
572 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [37]: %timeit w**3
138 µs ± 548 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [38]: %timeit np.triu(r)**3
427 µs ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [39]: %timeit np.triu(r**3)
625 µs ± 3.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not sure how all this works at a low level but clearly, zero or one raised to a power takes much less time to compute than any other value.
Also interesting. With numexpr computation there is no difference.
In [42]: %timeit ne.evaluate("r**3")
79.2 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [43]: %timeit ne.evaluate("z**3")
79.3 µs ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So, I think the fastest without using numexpr may be this way:
def method5():
return np.triu(1 - u/u[:, None])**3*m
assert np.array_equal(method1(), method5())
In [65]: %timeit method1()
656 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [66]: %timeit method5()
587 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Or, if you are really chasing every micro-second:
def method4b():
split = N//2
x = np.zeros((N, N))
u_mat = np.triu(1 - u/u.reshape(N, 1))
x[:split, :] = u_mat[:split,:]**3*m
x[split:, split:] = u_mat[split:, split:]**3*m[split:]
return x
assert np.array_equal(method1(), method4b())
In [71]: %timeit method4b()
543 µs ± 3.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [72]: %timeit method4b()
533 µs ± 7.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
And #Divakar's answer using numexpr is the fastest overall.
UPDATE
Thanks to #GZ0's comment, if you only need to raise to the power of 3, this is much faster:
def method6():
a = np.triu(1 - u/u[:, None])
return a*a*a*m
assert np.isclose(method1(), method6()).all()
(But there is a slight loss of precision I noticed).
In [84]: %timeit method6()
195 µs ± 609 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In fact it is not far off the numexpr methods in #Divakar's answer (185/163 µs on my machine).

Merging 1D arrays into a 2D array

Is there a built-in function to join two 1D arrays into a 2D array?
Consider an example:
X=np.array([1,2])
y=np.array([3,4])
result=np.array([[1,3],[2,4]])
I can think of 2 simple solutions.
The first one is pretty straightforward.
np.transpose([X,y])
The other one employs a lambda function.
np.array(list(map(lambda i: [a[i],b[i]], range(len(X)))))
While the second one looks more complex, it seems to be almost twice as fast as the first one.
Edit
A third solution involves the zip() function.
np.array(list(zip(X, y)))
It's faster than the lambda function but slower than column_stack solution suggested by #Divakar.
np.column_stack((X,y))
Take into consideration scalability. If we increase the size of the arrays, complete numpy solutions are quite faster than solutions involving python built-in operations:
np.random.seed(1234)
X = np.random.rand(10000)
y = np.random.rand(10000)
%timeit np.array(list(map(lambda i: [X[i],y[i]], range(len(X)))))
6.64 ms ± 32.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.array(list(zip(X, y)))
4.53 ms ± 33.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.column_stack((X,y))
19.2 µs ± 30.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.transpose([X,y])
16.2 µs ± 247 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.vstack((X, y)).T
14.2 µs ± 94.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Taking into account all proposed solutions, np.vstack(X,y).T is the fastest when working with greater array sizes.
This is one way:
import numpy as np
X = np.array([1,2])
y = np.array([3,4])
result = np.vstack((X, y)).T
print(result)
# [[1 3]
# [2 4]]

Categories