I have a question about matrix calculation using numpy. How can I parallelize these calculation such as np.matmul and np.multiply? I cannot find any references describing how to compute np.matmul using parallelization.
def time_shift_R(V, R_1, I0, t): # V is the potential function which returns an array
temp1 = V(xx, yy, t) + B*I0**2
temp = P*np.matmul(M, I0) + Q*np.matmul(I0, M) - np.multiply(temp1, I0)
R1 = ( R_1 - dt*temp ) / ( 1 - dt*B*R_1*I0 )
return R1
I appreciate your kind help in advance.
You might want to do some time tests to see what exactly is taking most time. For example on a rather modest machine with stock Ubuntu linux:
Make a complex array (you didn't cite any sizes so I'm just guessing as to something reasonable):
In [60]: A = np.ones((1000,1000),complex)
multiply and the operator * are basically the same:
In [61]: timeit A*A
7.49 ms ± 4.99 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [62]: timeit np.multiply(A,A)
7.48 ms ± 6.72 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
matmul is quite a bit longer, but then it's doing a lot more. A faster BLAS equivalent might help here. Note matmul is smart enough to use a specialized BLAS function for the transpose case. The # operator is basically the same.
In [63]: timeit np.matmul(A,A)
381 ms ± 8.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [64]: timeit np.matmul(A,A.T)
231 ms ± 9.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
and the power calc:
In [65]: timeit 10.0*A**2
14.4 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
The V(xx, yy, t) is unknown.
Related
scipy.stats has a function nbinom.pmf() which computes the probability mass function of the negative binomial distribution.
The mathematical function is very easily described in pure Python code.
from math import comb
def nbinom_pmf(k, n, p):
return comb(k+n-1, n-1)* p**n * (1-p)**k
It turns out that scipy.stats.nbinom.pmf() is quite a lot slower than this pure python code and this is a known issue due to the overhead of checking the parameters. The docs suggest using ._pmf instead. This is indeed faster but is still slower than the pure Python code in many cases. E.g.
In [24]: %timeit nbinom_pmf(1, 26, 0.5)
282 ns ± 1.61 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [25]: %timeit nbinom._pmf(1, 26, 0.5)
2.03 µs ± 6.55 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [32]: %timeit nbinom._pmf(36, 26, 0.5)
2.03 µs ± 1.49 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [33]: %timeit nbinom_pmf(36, 26, 0.5)
1.64 µs ± 30.9 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Why is the scipy code so much slower than the naive pure Python implementation?
I have a numpy array x of size (n, n, p) and I need to index it using a list m. I need to return a two new arrays of sizes (n, m, p) and (n, n-m, p). Both p and m are generally small (range 10 to 100), but n can be from 100 to 10000+.
When n is small, there is no issue. However when n gets large, these indexing operations take the majority of my function call time.
In my actual implementation, the indexing took 15 seconds, and the rest of the function was less than 1 sec.
I've tried doing the regular indexing, using np.delete, and np.take, and np.take was faster by a factor of 2 and it was what I am currently using to get the 15 sec time.
An example is below:
m = [1, 7, 12, 40]
r = np.arange(5000)
r = np.delete(r, m, axis=0)
x = np.random.rand(5000,5000,10)
%timeit tmp = x[:,m,:]
1.55 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit tmp2 = x[:,r,:]
1.7 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit tmp = np.delete(x, r, axis=1)
1.46 ms ± 31.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit tmp2 = np.delete(x, m, axis=1)
1.64 s ± 18.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit tmp = np.take(x, m, axis=1)
1.21 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit tmp2 = np.take(x, r, axis=1)
1.04 s ± 79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Except instead of 1 sec, it's 15 times that and I have to call this function a few hundred or thousand times.
Is there something I can do to speed this indexing up?
I'm using Python 3.6.10 through Spyder 4.0.1 on a Windows 10 laptop with an Intel i7-8650U and 16GB of RAM. I checked the array sizes and my available RAM when executing the commands and did not hit the maximum usage at any point in the execution.
Let's say that I want to perform a mathematical summation, say the Madhava–Leibniz formula for π, in Python:
Within a function called Leibniz_pi(), I could create a loop to calculate the nth partial sum, such as:
def Leibniz_pi(n):
nth_partial_sum = 0 #initialize the variable
for i in range(n+1):
nth_partial_sum += ((-1)**i)/(2*i + 1)
return nth_partial_sum
I'm assuming it would be faster to use something like xrange() instead of range(). Would it be even faster to use numpy and its built in numpy.sum() method? What would such an example look like?
I guess most people will define the fastest solution by #zero using only numpy as the most pythonic, but it is certainly not the fastest. With some additional optimizations you can beat the already fast numpy implementation by a factor of 50.
Using only Numpy (#zero)
import numpy as np
import numexpr as ne
import numba as nb
def Leibniz_point(n):
val = (-1)**n / (2*n + 1)
return val
%timeit Leibniz_point(np.arange(1000)).sum()
33.8 µs ± 203 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Make use of numexpr
n=np.arange(1000)
%timeit ne.evaluate("sum((-1)**n / (2*n + 1))")
21 µs ± 354 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Compile your function using Numba
# with error_model="numpy", turns off division-by-zero checks
#nb.njit(error_model="numpy",cache=True)
def Leibniz_pi(n):
nth_partial_sum = 0. #initialize the variable as float64
for i in range(n+1):
nth_partial_sum += ((-1)**i)/(2*i + 1)
return nth_partial_sum
%timeit Leibniz_pi(999)
6.48 µs ± 38.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Edit, optimizing away the costly (-1)**n
import numba as nb
import numpy as np
#replacement for the much more costly (-1)**n
#nb.njit()
def sgn(i):
if i%2>0:
return -1.
else:
return 1.
# with error_model="numpy", turns off the division-by-zero checks
#
# fastmath=True makes SIMD-vectorization in this case possible
# floating point math is in general not commutative
# e.g. calculating four times sgn(i)/(2*i + 1) at once and then the sum
# is not exactly the same as doing this sequentially, therefore you have to
# explicitly allow the compiler to make the optimizations
#nb.njit(fastmath=True,error_model="numpy",cache=True)
def Leibniz_pi(n):
nth_partial_sum = 0. #initialize the variable
for i in range(n+1):
nth_partial_sum += sgn(i)/(2*i + 1)
return nth_partial_sum
%timeit Leibniz_pi(999)
777 ns ± 5.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
3 suggestions (with speed computation):
define the Leibniz point not the cumulative sum:
def Leibniz_point(n):
val = (-1)**n / (2*n + 1)
return val
1) sum a list comprehension
%timeit sum([Leibniz_point(n) for n in range(100)])
58.8 µs ± 825 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit sum([Leibniz_point(n) for n in range(1000)])
667 µs ± 3.41 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2) standard for loop
%%timeit
sum = 0
for n in range(100):
sum += Leibniz_point(n)
61.8 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
sum = 0
for n in range(1000):
sum += Leibniz_point(n)
729 µs ± 43.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3) use a numpy array (suggested)
%timeit Leibniz_point(np.arange(100)).sum()
11.5 µs ± 866 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit Leibniz_point(np.arange(1000)).sum()
61.8 µs ± 3.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In general, for operations involving collections of more than a few elements, numpy will be faster. A simple numpy implementation could be something like this:
def leibniz(n):
a = np.arange(n + 1)
return (((-1.0) ** a) / (2 * a + 1)).sum()
Note that you must specify that the numerator is a float with 1.0 on Python 2. On Python 3, 1 will be fine.
What is the memory consumption for arithmetic numpy expressions I.e.
vec ** 3 + vec ** 2 + vec
(vec being a numpy.ndarray). Is an array stored for each intermediate operation? Could such compound expressions have multiple times the memory than the underlying ndarray?
You are correct, a new array will be allocated for each intermediate result. Fortunately, the package numexpr is designed to deal with this issue. From the description:
The main reason why NumExpr achieves better performance than NumPy is that it avoids allocating memory for intermediate results. This results in better cache utilization and reduces memory access in general. Due to this, NumExpr works best with large arrays.
Example:
In [97]: xs = np.random.rand(1_000_000)
In [98]: %timeit xs ** 3 + xs ** 2 + xs
26.8 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [99]: %timeit numexpr.evaluate('xs ** 3 + xs ** 2 + xs')
1.43 ms ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Thanks to #max9111 for pointing out that numexpr simplifies power to multiplication. It seems that most of the discrepancy in the benchmark is explained by optimization of xs ** 3.
In [421]: %timeit xs * xs
1.62 ms ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [422]: %timeit xs ** 2
1.63 ms ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [423]: %timeit xs ** 3
22.8 ms ± 283 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [424]: %timeit xs * xs * xs
2.52 ms ± 58.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Is there a built-in function to join two 1D arrays into a 2D array?
Consider an example:
X=np.array([1,2])
y=np.array([3,4])
result=np.array([[1,3],[2,4]])
I can think of 2 simple solutions.
The first one is pretty straightforward.
np.transpose([X,y])
The other one employs a lambda function.
np.array(list(map(lambda i: [a[i],b[i]], range(len(X)))))
While the second one looks more complex, it seems to be almost twice as fast as the first one.
Edit
A third solution involves the zip() function.
np.array(list(zip(X, y)))
It's faster than the lambda function but slower than column_stack solution suggested by #Divakar.
np.column_stack((X,y))
Take into consideration scalability. If we increase the size of the arrays, complete numpy solutions are quite faster than solutions involving python built-in operations:
np.random.seed(1234)
X = np.random.rand(10000)
y = np.random.rand(10000)
%timeit np.array(list(map(lambda i: [X[i],y[i]], range(len(X)))))
6.64 ms ± 32.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.array(list(zip(X, y)))
4.53 ms ± 33.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.column_stack((X,y))
19.2 µs ± 30.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.transpose([X,y])
16.2 µs ± 247 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.vstack((X, y)).T
14.2 µs ± 94.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Taking into account all proposed solutions, np.vstack(X,y).T is the fastest when working with greater array sizes.
This is one way:
import numpy as np
X = np.array([1,2])
y = np.array([3,4])
result = np.vstack((X, y)).T
print(result)
# [[1 3]
# [2 4]]