Performance of numpy with built in functions - python

I just try figure out why my program is so slow and find the following result.
In [11]: n = 1000000
In [12]: x = randn(n)
In [13]: %timeit norm(x)
100 loops, best of 3: 2.25 ms per loop
In [14]: %timeit (x.dot(x))**0.5
1000 loops, best of 3: 387 µs per loop
I know the norm function will contain many if else detecting the input and select the right norm. But I am still wondering this big difference especially when calling in loops.
Is this normal in numpy?
Another examples is that computing the eigenvalue and eigenvector of a 10000x10000 random generated matrix from randn. Firstly I use Matlab compute and get the result in several minutes. But numpy took a very very very long time to compute this and finally I Ctrl+c the process. Both use the eig function respectively.

Related

largest singular value NumPy `ndarray`

I have a large bidimensional ndarray, A and I want to compute the SVD retrieving largest eigenvalue and associated eigenvector pair. Looking at NumPy docs it seems that NumPy can compute the complete SVD only (numpy.linalg.svd), while SciPy has method that does exactly what I need (scipy.sparse.linalg.svds), but with sparse matrices and I don't want to perform a conversion of A, since it would require additional computational time.
Until now, I have used SciPy svds directly on the A however the documentation discourages to pass ndarrays to these methods.
Is there a way to perform this task with a method that accepts ndarray objects?
If svds works with your dense A array, then continue to use it. You don't need to convert it to anything. svds does all the adaptation that it needs.
It's documentation says
A : {sparse matrix, LinearOperator}
Array to compute the SVD on, of shape (M, N)
But what is a LinearOperator? It is a wrapper around something that can perform a matrix product. For a dense array A.dot qualifies.
Look at the code for svds. The first thing is does is A = np.asarray(A) if A isn't already a Linear Operator or sparse matrix. Then it grabs A.dot and (hemetianA).dot and makes a new LinearOperator.
There's nothing special about a sparse matrix in this function. All that matters is having a compatible matrix product.
Look at these times:
In [358]: A=np.eye(10)
In [359]: Alg=splg.aslinearoperator(A)
In [360]: Am=sparse.csr_matrix(A)
In [361]: timeit splg.svds(A)
1000 loops, best of 3: 541 µs per loop
In [362]: timeit splg.svds(Alg)
1000 loops, best of 3: 964 µs per loop
In [363]: timeit splg.svds(Am)
1000 loops, best of 3: 939 µs per loop
Direct use of A is fastest. The conversions don't help, even when they are outside of the timing loop.

Numpy: Replace every value in the array with the mean of its adjacent elements

I have an ndarray, and I want to replace every value in the array with the mean of its adjacent elements. The code below can do the job, but it is super slow when I have 700 arrays all with shape (7000, 7000) , so I wonder if there are better ways to do it. Thanks!
a = np.array(([1,2,3,4,5,6,7,8,9],[4,5,6,7,8,9,10,11,12],[3,4,5,6,7,8,9,10,11]))
row,col = a.shape
new_arr = np.ndarray(a.shape)
for x in xrange(row):
for y in xrange(col):
min_x = max(0, x-1)
min_y = max(0, y-1)
new_arr[x][y] = a[min_x:(x+2),min_y:(y+2)].mean()
print new_arr
Well, that's a smoothing operation in image processing, which can be achieved with 2D convolution. You are working a bit differently on the near-boundary elements. So, if the boundary elements are let off for precision, you can use scipy's convolve2d like so -
from scipy.signal import convolve2d as conv2
out = (conv2(a,np.ones((3,3)),'same')/9.0
This specific operation is a built-in in OpenCV module as cv2.blur and is very efficient at it. The name basically describes its operation of blurring the input arrays representing images. I believe the efficiency comes from the fact that internally its implemented entirely in C for performance with a thin Python wrapper to handle NumPy arrays.
So, the output could be alternatively calculated with it, like so -
import cv2 # Import OpenCV module
out = cv2.blur(a.astype(float),(3,3))
Here's a quick show-down on timings on a decently big image/array -
In [93]: a = np.random.randint(0,255,(5000,5000)) # Input array
In [94]: %timeit conv2(a,np.ones((3,3)),'same')/9.0
1 loops, best of 3: 2.74 s per loop
In [95]: %timeit cv2.blur(a.astype(float),(3,3))
1 loops, best of 3: 627 ms per loop
Following the discussion with #Divakar, find bellow a comparison of different convolution methods present in scipy:
import numpy as np
from scipy import signal, ndimage
def conv2(A, size):
return signal.convolve2d(A, np.ones((size, size)), mode='same') / float(size**2)
def fftconv(A, size):
return signal.fftconvolve(A, np.ones((size, size)), mode='same') / float(size**2)
def uniform(A, size):
return ndimage.uniform_filter(A, size, mode='constant')
All 3 methods return exactly the same value. However, note that uniform_filter has a parameter mode='constant', which indicates the boundary conditions of the filter, and constant == 0 is the same boundary condition that the Fourier domain (in the other 2 methods) is enforced. For different use cases you can change the boundary conditions.
Now some test matrices:
A = np.random.randn(1000, 1000)
And some timings:
%timeit conv2(A, 3) # 33.8 ms per loop
%timeit fftconv(A, 3) # 84.1 ms per loop
%timeit uniform(A, 3) # 17.1 ms per loop
%timeit conv2(A, 5) # 68.7 ms per loop
%timeit fftconv(A, 5) # 92.8 ms per loop
%timeit uniform(A, 5) # 17.1 ms per loop
%timeit conv2(A, 10) # 210 ms per loop
%timeit fftconv(A, 10) # 86 ms per loop
%timeit uniform(A, 10) # 16.4 ms per loop
%timeit conv2(A, 30) # 1.75 s per loop
%timeit fftconv(A, 30) # 102 ms per loop
%timeit uniform(A, 30) # 16.5 ms per loop
So in short, uniform_filter seems faster, and it because the convolution is separable in two 1D convolutons (similar to gaussian_filter which is also separable).
Other non-separable filters with different kernels are more likely to be faster using signal module (the one in #Divakar's) solution.
The speed of both fftconvolve and uniform_filter remains constant for different kernel sizes, while convolve2d gets slightly slower.
I had a similar problem recently and had to find a different solution since I can't use scipy.
import numpy as np
a = np.random.randint(100, size=(7000,7000)) #Array of 7000 x 7000
row,col = a.shape
column_totals = a.sum(axis=0) #Dump the sum of all columns into a single array
new_array = np.zeros([row,col]) #Create an receiving array
for i in range(row):
#Resulting row = the value of all rows minus the orignal row, divided by the row number minus one.
new_array[i] = (column_totals - a[i]) / (row - 1)
print(new_array)

Computing the spectral norms of ~1m Hermitian matrices: `numpy.linalg.norm` is too slow

I would like to calculate the spectral norms of N 8x8 Hermitian matrices, with N being close to 1E6. As an example, take these 1 million random complex 8x8 matrices:
import numpy as np
array = np.random.rand(8,8,1e6) + 1j*np.random.rand(8,8,1e6)
It currently takes me almost 10 seconds using numpy.linalg.norm:
np.linalg.norm(array, ord=2, axis=(0,1))
I tried using the Cython code below, but this gave me only a negligible performance improvement:
import numpy as np
cimport numpy as np
cimport cython
np.import_array()
DTYPE = np.complex64
#cython.boundscheck(False)
#cython.wraparound(False)
def function(np.ndarray[np.complex64_t, ndim=3] Array):
assert Array.dtype == DTYPE
cdef int shape0 = Array.shape[2]
cdef np.ndarray[np.float32_t, ndim=1] normarray = np.zeros(shape0, dtype=np.float32)
normarray = np.linalg.norm(Array, ord=2, axis=(0, 1))
return normarray
I also tried numba and some other scipy functions (such as scipy.linalg.svdvals) to calculate the singular values of these matrices. Everything is still too slow.
Is it not possible to make this any faster? Is numpy already optimized to the extent that no speed gains are possible by using Cython or numba? Or is my code highly inefficient and I am doing something fundamentally wrong?
I noticed that only two of my CPU cores are 100% utilized while doing the calculation. With that in mind, I looked at these previous StackOverflow questions:
why isn't numpy.mean multithreaded?
Why does multiprocessing use only a single core after I import numpy?
multithreaded blas in python/numpy (didn't help)
and several others, but unfortunately I still don't have a solution.
I considered splitting my array into smaller chunks, and processing these in parallel (perhaps on the GPU using CUDA). Is there a way within numpy/Python to do this? I don't yet know where the bottleneck is in my code, i.e. whether it is CPU or memory-bound, or perhaps something different.
Digging into the code for np.linalg.norm, I've deduced, that for these parameters, it is finding the maximum of matrix singular values over the N dimension
First generate a small sample array. Make N the first dimension to eliminate a rollaxis operation:
In [268]: N=10; A1 = np.random.rand(N,8,8)+1j*np.random.rand(N,8,8)
In [269]: np.linalg.norm(A1,ord=2,axis=(1,2))
Out[269]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
the equivalent operation:
In [270]: np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
Out[270]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
same values, and same time:
In [271]: timeit np.linalg.norm(A1,ord=2,axis=(1,2))
1000 loops, best of 3: 398 µs per loop
In [272]: timeit np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
1000 loops, best of 3: 389 µs per loop
And most of the time spent in svd, which produces an (N,8) array:
In [273]: timeit np.linalg.svd(A1,compute_uv=0)
1000 loops, best of 3: 366 µs per loop
So if you want to speed up the norm, you have look further into speeding up this svd. svd uses np.linalg._umath_linalg functions - that is a .so file - compiled.
The c code is in https://github.com/numpy/numpy/blob/97c35365beda55c6dead8c50df785eb857f843f0/numpy/linalg/umath_linalg.c.src
It sure looks like this is the fastest you'll get. There's no Python level loop. Any looping is in that c code, or the lapack function it calls.
np.linalg.norm(A, ord=2) computes the spectral norm by finding the largest singular value using SVD. However, since your 8x8 submatrices are Hermitian, their largest singular values will be equal to the maximum of their absolute eigenvalues (see here):
import numpy as np
def random_symmetric(N, k):
A = np.random.randn(N, k, k)
A += A.transpose(0, 2, 1)
return A
N = 100000
k = 8
A = random_symmetric(N, k)
norm1 = np.abs(np.linalg.eigvalsh(A)).max(1)
norm2 = np.linalg.norm(A, ord=2, axis=(1, 2))
print(np.allclose(norm1, norm2))
# True
Eigendecomposition on a Hermitian matrix is quite a bit faster than SVD:
In [1]: %%timeit A = random_symmetric(N, k)
np.linalg.norm(A, ord=2, axis=(1, 2))
....:
1 loops, best of 3: 1.54 s per loop
In [2]: %%timeit A = random_symmetric(N, k)
np.abs(np.linalg.eigvalsh(A)).max(1)
....:
1 loops, best of 3: 757 ms per loop

Many small matrices speed-up for loops

I have a large coordinate grid (vectors a and b), for which I generate and solve a matrix (10x10) equation. Is there a way for scipy.linalg.solve to accept vector input? So far my solution was to run for cycles over the coordinate arrays.
import time
import numpy as np
import scipy.linalg
N = 10
a = np.linspace(0, 1, 10**3)
b = np.linspace(0, 1, 2*10**3)
A = np.random.random((N, N)) # input matrix, not static
def f(a,b,n): # array-filling function
return a*b*n
def sol(x,y): # matrix solver
D = np.arange(0,N)
B = f(x,y,D)**2 + f(x-1, y+1, D) # source vector
X = scipy.linalg.solve(A,B)
return X # output an N-size vector
start = time.time()
answer = np.zeros(shape=(a.size, b.size)) # predefine output array
for egg in range(a.size): # an ugly double-for cycle
for ham in range(b.size):
aa = a[egg]
bb = b[ham]
answer[egg,ham] = sol(aa,bb)[0]
print time.time() - start
To illustrate my point about generalized ufuncs, and the ability to eliminate the loop over egg and ham, consider the following piece of code:
import numpy as np
A = np.random.randn(4,4,10,10)
AI = np.linalg.inv(A)
#check that generalized ufuncs work as expected
I = np.einsum('xyij,xyjk->xyik', A, AI)
print np.allclose(I, np.eye(10)[np.newaxis, np.newaxis, :, :])
#yevgeniy You are right, efficiently solving multiple independent linear systems A x = b with scipy a bit tricky (assuming an A array that changes for every iteration).
For instance, here is a benchmark for solving 1000 systems of the form, A x = b, where A is a 10x10 matrix, and b a 10 element vector. Surprisingly, the approach to put all this into one block diagonal matrix and call scipy.linalg.solve once is indeed slower both with dense and sparse matrices.
import numpy as np
from scipy.linalg import block_diag, solve
from scipy.sparse import block_diag as sp_block_diag
from scipy.sparse.linalg import spsolve
N = 10
M = 1000 # number of coordinates
Ai = np.random.randn(N, N) # we can compute the inverse here,
# but let's assume that Ai are different matrices in the for loop loop
bi = np.random.randn(N)
%timeit [solve(Ai, bi) for el in range(M)]
# 10 loops, best of 3: 32.1 ms per loop
Afull = sp_block_diag([Ai]*M, format='csr')
bfull = np.tile(bi, M)
%timeit Afull = sp_block_diag([Ai]*M, format='csr')
%timeit spsolve(Afull, bfull)
# 1 loops, best of 3: 303 ms per loop
# 100 loops, best of 3: 5.55 ms per loop
Afull = block_diag(*[Ai]*M)
%timeit Afull = block_diag(*[Ai]*M)
%timeit solve(Afull, bfull)
# 100 loops, best of 3: 14.1 ms per loop
# 1 loops, best of 3: 23.6 s per loop
The solution of the linear system, with sparse arrays is faster, but the time to create this block diagonal array is actually very slow. As to dense arrays, they are simply slower in this case (and take lots of RAM).
Maybe I'm missing something about how to make this work efficiently with sparse arrays, but if you are keeping the for loops, there are two things that you could do for optimizations.
From pure python, look at the source code of scipy.linalg.solve : remove unnecessary tests and factorize all repeated operations outside of your loops. For instance, assuming your arrays are not symmetrical positives, we could do
from scipy.linalg import get_lapack_funcs
gesv, = get_lapack_funcs(('gesv',), (Ai, bi))
def solve_opt(A, b, gesv=gesv):
# not sure if copying A and B is necessary, but just in case (faster if arrays are not copied)
lu, piv, x, info = gesv(A.copy(), b.copy(), overwrite_a=False, overwrite_b=False)
if info == 0:
return x
if info > 0:
raise LinAlgError("singular matrix")
raise ValueError('illegal value in %d-th argument of internal gesv|posv' % -info)
%timeit [solve(Ai, bi) for el in range(M)]
%timeit [solve_opt(Ai, bi) for el in range(M)]
# 10 loops, best of 3: 30.1 ms per loop
# 100 loops, best of 3: 3.77 ms per loop
which results in a 6.5x speed up.
If you need even better performance, you would have to port this for loop in Cython and interface the gesv BLAS functions directly in C, as discussed here, or better with the Cython API for BLAS/LAPACK in Scipy 0.16.
Edit: As #Eelco Hoogendoorn mentioned if your A matrix is fixed, there is a much simpler and more efficient approach.

Broadcastable Numpy dot

I have an array H of dimensions (n0, n2) and an array W of dimensions (n0, n1, n2, n3) and I want to do the following operation:
(H[:, None, :, None] * W).sum(axis=(0, 2))
As far as I know, the above line does not use BLAS libraries. Is there a way to use numpy.dot or a similar function that uses BLAS to do the same computation (and still without copying the array H several times in memory)?
You have identified one way of doing this; I know of two others.
For a small example
In [365]: n0,n1,n2,n3=2,3,4,5
In [366]: H=np.ones((n0,n2));W=np.ones((n0,n1,n2,n3))
comparative times are:
In [362]: timeit np.tensordot(H,W,[(0,1),(0,2)])
10000 loops, best of 3: 32.8 µs per loop
In [363]: timeit np.einsum('ik,ijkl',H,W)
100000 loops, best of 3: 10.7 µs per loop
In [364]: timeit (H[:,None,:,None]*W).sum(axis=(0,2))
10000 loops, best of 3: 29.5 µs per loop
tensordot reshapes and transposes the inputs so it can call np.dot. einsum decodes the string, and does its own nditer in C.
https://stackoverflow.com/a/31129207/901925 has timings for another multidimensional dot, involving (100,)*(10,100,100)*(100,) arrays.

Categories