Related
I have a set of of 2D arrays that I have to compute the 2D correlation of. I have been trying many different things (even programming it in Fortran), but I think the fastest way will be calculating it using FFT.
Based on my tests and on this answer I can use scipy.signal.fftconvolve and it works fine if I'm trying to reproduce the output of scipy.signal.correlate2d with boundary='fill'. So basically this
scipy.signal.fftconvolve(a, a[::-1, ::-1], mode='same')
is equal to this (with the exception of a slight shift)
scipy.signal.correlate2d(a, a, boundary='fill', mode='same')
The thing is that the arrays should be computed in wrapped mode, since they are 2D periodic arrays (i.e., boundary='wrap'). So if I'm trying to reproduce the output of
scipy.signal.correlate2d(a, a, boundary='wrap', mode='same')
I can't, or at least I don't see how to do it. (And I want to use the FFT method, since it's way faster.)
Apparently Scipy used to have something like that that might have done the trick, but apparently it got left behind and I can't find it, so I think Scipy might have dropped support for it.
Anyway, is there a way to use scipy's or numpy's FFT routines to calculate this correlation of period arrays?
The wrapped correlation can be implemented using the FFT. Here's some code to demonstrate how:
In [276]: import numpy as np
In [277]: from scipy.signal import correlate2d
Create a random array a to work with:
In [278]: a = np.random.randn(200, 200)
Compute the 2D correlation using scipy.signal.correlate2d:
In [279]: c = correlate2d(a, a, boundary='wrap', mode='same')
Now compute the same result, using the 2D FFT functions from numpy.fft. (This code assumes a is square.)
In [280]: from numpy.fft import fft2, ifft2
In [281]: fc = np.roll(ifft2(fft2(a).conj()*fft2(a)).real, (a.shape[0] - 1)//2, axis=(0,1))
Verify that both methods give the same result:
In [282]: np.allclose(c, fc)
Out[282]: True
And as you point out, using the FFT is much faster. For this example, it is about 1000 times faster:
In [283]: %timeit c = correlate2d(a, a, boundary='wrap', mode='same')
1 loop, best of 3: 3.2 s per loop
In [284]: %timeit fc = np.roll(ifft2(fft2(a).conj()*fft2(a)).real, (a.shape[0] - 1)//2, axis=(0,1))
100 loops, best of 3: 3.19 ms per loop
And that includes the duplicated computation of fft2(a). Of course, fft2(a) should only be computed once:
In [285]: fta = fft2(a)
In [286]: fc = np.roll(ifft2(fta.conj()*fta).real, (a.shape[0] - 1)//2, axis=(0,1))
I would like to calculate the spectral norms of N 8x8 Hermitian matrices, with N being close to 1E6. As an example, take these 1 million random complex 8x8 matrices:
import numpy as np
array = np.random.rand(8,8,1e6) + 1j*np.random.rand(8,8,1e6)
It currently takes me almost 10 seconds using numpy.linalg.norm:
np.linalg.norm(array, ord=2, axis=(0,1))
I tried using the Cython code below, but this gave me only a negligible performance improvement:
import numpy as np
cimport numpy as np
cimport cython
np.import_array()
DTYPE = np.complex64
#cython.boundscheck(False)
#cython.wraparound(False)
def function(np.ndarray[np.complex64_t, ndim=3] Array):
assert Array.dtype == DTYPE
cdef int shape0 = Array.shape[2]
cdef np.ndarray[np.float32_t, ndim=1] normarray = np.zeros(shape0, dtype=np.float32)
normarray = np.linalg.norm(Array, ord=2, axis=(0, 1))
return normarray
I also tried numba and some other scipy functions (such as scipy.linalg.svdvals) to calculate the singular values of these matrices. Everything is still too slow.
Is it not possible to make this any faster? Is numpy already optimized to the extent that no speed gains are possible by using Cython or numba? Or is my code highly inefficient and I am doing something fundamentally wrong?
I noticed that only two of my CPU cores are 100% utilized while doing the calculation. With that in mind, I looked at these previous StackOverflow questions:
why isn't numpy.mean multithreaded?
Why does multiprocessing use only a single core after I import numpy?
multithreaded blas in python/numpy (didn't help)
and several others, but unfortunately I still don't have a solution.
I considered splitting my array into smaller chunks, and processing these in parallel (perhaps on the GPU using CUDA). Is there a way within numpy/Python to do this? I don't yet know where the bottleneck is in my code, i.e. whether it is CPU or memory-bound, or perhaps something different.
Digging into the code for np.linalg.norm, I've deduced, that for these parameters, it is finding the maximum of matrix singular values over the N dimension
First generate a small sample array. Make N the first dimension to eliminate a rollaxis operation:
In [268]: N=10; A1 = np.random.rand(N,8,8)+1j*np.random.rand(N,8,8)
In [269]: np.linalg.norm(A1,ord=2,axis=(1,2))
Out[269]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
the equivalent operation:
In [270]: np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
Out[270]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
same values, and same time:
In [271]: timeit np.linalg.norm(A1,ord=2,axis=(1,2))
1000 loops, best of 3: 398 µs per loop
In [272]: timeit np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
1000 loops, best of 3: 389 µs per loop
And most of the time spent in svd, which produces an (N,8) array:
In [273]: timeit np.linalg.svd(A1,compute_uv=0)
1000 loops, best of 3: 366 µs per loop
So if you want to speed up the norm, you have look further into speeding up this svd. svd uses np.linalg._umath_linalg functions - that is a .so file - compiled.
The c code is in https://github.com/numpy/numpy/blob/97c35365beda55c6dead8c50df785eb857f843f0/numpy/linalg/umath_linalg.c.src
It sure looks like this is the fastest you'll get. There's no Python level loop. Any looping is in that c code, or the lapack function it calls.
np.linalg.norm(A, ord=2) computes the spectral norm by finding the largest singular value using SVD. However, since your 8x8 submatrices are Hermitian, their largest singular values will be equal to the maximum of their absolute eigenvalues (see here):
import numpy as np
def random_symmetric(N, k):
A = np.random.randn(N, k, k)
A += A.transpose(0, 2, 1)
return A
N = 100000
k = 8
A = random_symmetric(N, k)
norm1 = np.abs(np.linalg.eigvalsh(A)).max(1)
norm2 = np.linalg.norm(A, ord=2, axis=(1, 2))
print(np.allclose(norm1, norm2))
# True
Eigendecomposition on a Hermitian matrix is quite a bit faster than SVD:
In [1]: %%timeit A = random_symmetric(N, k)
np.linalg.norm(A, ord=2, axis=(1, 2))
....:
1 loops, best of 3: 1.54 s per loop
In [2]: %%timeit A = random_symmetric(N, k)
np.abs(np.linalg.eigvalsh(A)).max(1)
....:
1 loops, best of 3: 757 ms per loop
For one of my project in image processing, I need to use key points. To compute them, I found that OpenCV was quite fast and convenient to use. But, when computing the key points of an image with, for example, the FAST algorithm we receive an array of KeyPoint objects.
When I get those key points, I would like to only take their coordinates, not the additional information (angle, etc.). Those coordinates will be used to make several computation with numpy. The problem is the conversion time from an array of KeyPoint to a numpy array. It takes approximately 60% of the total execution time. Any suggestion how to improve the loop below?
import cv2
import numpy as np
image_cv2 = cv2.imread("sample.jpg")
fast = cv2.FastFeatureDetector(threshold=25)
keypoints = fast.detect(image_cv2)
n = len(keypoints)
cd_x = np.zeros(n, dtype=np.int)
cd_y = np.zeros(n, dtype=np.int)
for i in xrange(0, n):
cd_x[i] = keypoints[i].pt[0]
cd_y[i] = keypoints[i].pt[1]
PS: I tried to use np.vectorize but I did not notice any improvement. For information, the number of key points by image is often around 5 000.
Update:
As some people pointed out, the simple assignation from keypoints to numpy array should be quite fast. After some tests, it is true that it is very fast. For example, for a dataset of 275 images, with 1 thread, the complete execution time is 22.9s, with only 0.2s for the keypoints->numpy to execute, and around 20s is spent by cv2.imread().
My mistake was to use too many threads at the same time, as each core was not used at least at 80% I kept increasing their quantity until this arbitrary limit, which slowed down the loop execution. Thank you everyone to make me open my eyes on a stupid mistake elsewhere in the code!
A trial case:
In [765]: class Keypoints(object):
def __init__(self):
self.pt=[1.,2.]
.....:
In [766]: keypoints=[Keypoints() for i in xrange(1000)]
In [767]: cd=np.array([k.pt for k in keypoints])
In [768]: cd
Out[768]:
array([[ 1., 2.],
[ 1., 2.],
[ 1., 2.],
...,
[ 1., 2.],
[ 1., 2.],
[ 1., 2.]])
In [769]: cd_x=cd[:,0]
In timeit tests, the keypoints step takes just as long as the cd calculation, 1ms.
But the 2 simpler iterations
cd_x=np.array([k.pt[0] for k in keypoints])
cd_y=np.array([k.pt[1] for k in keypoints])
takes half the time. I was expecting the single iteration to save time. But in these simple cases, the comprehension itself takes only half of the time, the rest is creating the array.
In [789]: timeit [k.pt[0] for k in keypoints]
10000 loops, best of 3: 136 us per loop
In [790]: timeit np.array([k.pt[0] for k in keypoints])
1000 loops, best of 3: 282 us per loop
Assume a few functions called many times. These functions do something such as multiply, divide, add, on a 3d vector (a 1x3 array).
Given:
import numpy as np
import math
x = [0,1,2]
y = [3,2,1]
a = 1.2
Based on my testing, it is faster for python math library to do:
math.sin(a)
than for numpy to do:
np.sin(a)
Additionally, simple algorithms such as normalization are faster with python than np.linalg.norm using the method discussed in this conversation.
Now if we add a bit of complexity to the data, such as doing matrix multiplication for 3d, where we have a rotation matrix of 3x3 that is then multiplied by another matrix and transposed, numpy starts to gain the advantage.
Currently, doing operations such as:
L = math.sqrt(V[0] * V[0] + V[1] * V[1] + V[2] * V[2])
V = (V[0] / L, V[1] / L, V[2] / L)
are much faster when called repeatedly (I assume from no overhead in creating the numpy array).
However, in order to use the numpy matrix functions, the array needs to be numpy. Using np.asarray() has significant overhead, which makes the efficiency border between not using numpy at all, accepting the overhead of creating the array, or accepting the efficiency of numpy math functions on scalars and only using numpy.
Of course I can try out all of these methods, but in a large algorithm, the possible combinations are too much. Is there any strategy to efficiently switch between python and numpy in this situation?
EDIT:
From some comments, it seems the question is not clear enough. I understand numpy is more efficient with big sets, which is why this question exists. The algorithm is NOT ONLY calculating sine. The following code might make it easier to understand:
x = [2,1,2]
math.sin(x[0])
L = math.sqrt(x[0] * x[0] + x[1] * x[1] + x[2] * x[2])
V = (x[0] / L, x[1] / L, x[2] / L)
math.sin(V[0])
#Do something else here
When working with single values, and small arrays, the np.array overhead certainly slows things down compared to using the math. equivalents. But with many values, the array approach quickly becomes better.
For example in Ipython I can time sin for 50 values:
In [444]: %%timeit x=np.arange(50)
np.sin(x)
100000 loops, best of 3: 8.5 us per loop
In [445]: %%timeit x=range(50)
[math.sin(i) for i in x]
100000 loops, best of 3: 18.1 us per loop
Your V calculation is 20x faster than
Va=Va/math.sqrt((Va*Va).sum())
But if I do that on 20 sets of values, the times are about equal. And I don't have to change the expression to handle Va=np.ones((20,3), float). To time your V I had to wrap it in a function and time [foo(i) for i in V].
You might even gain more speed by doing the indexing only once, e.g.
v1, v2, v3 = V
L = math.sqrt(v1*v1+ v2*v2+v3*v3)
V = (v1/L, v2/L, v3/L)
I'd expect more gain when using arrays than lists.
I can't figure out why numba is beating numpy here (over 3x). Did I make some fundamental error in how I am benchmarking here? Seems like the perfect situation for numpy, no? Note that as a check, I also ran a variation combining numba and numpy (not shown), which as expected was the same as running numpy without numba.
(btw this is a followup question to: Fastest way to numerically process 2d-array: dataframe vs series vs array vs numba )
import numpy as np
from numba import jit
nobs = 10000
def proc_numpy(x,y,z):
x = x*2 - ( y * 55 ) # these 4 lines represent use cases
y = x + y*2 # where the processing time is mostly
z = x + y + 99 # a function of, say, 50 to 200 lines
z = z * ( z - .88 ) # of fairly simple numerical operations
return z
#jit
def proc_numba(xx,yy,zz):
for j in range(nobs): # as pointed out by Llopis, this for loop
x, y = xx[j], yy[j] # is not needed here. it is here by
# accident because in the original benchmarks
x = x*2 - ( y * 55 ) # I was doing data creation inside the function
y = x + y*2 # instead of passing it in as an array
z = x + y + 99 # in any case, this redundant code seems to
z = z * ( z - .88 ) # have something to do with the code running
# faster. without the redundant code, the
zz[j] = z # numba and numpy functions are exactly the same.
return zz
x = np.random.randn(nobs)
y = np.random.randn(nobs)
z = np.zeros(nobs)
res_numpy = proc_numpy(x,y,z)
z = np.zeros(nobs)
res_numba = proc_numba(x,y,z)
results:
In [356]: np.all( res_numpy == res_numba )
Out[356]: True
In [357]: %timeit proc_numpy(x,y,z)
10000 loops, best of 3: 105 µs per loop
In [358]: %timeit proc_numba(x,y,z)
10000 loops, best of 3: 28.6 µs per loop
I ran this on a 2012 macbook air (13.3), standard anaconda distribution. I can provide more detail on my setup if it's relevant.
I think this question highlights (somewhat) the limitations of calling out to precompiled functions from a higher level language. Suppose in C++ you write something like:
for (int i = 0; i != N; ++i) a[i] = b[i] + c[i] + 2 * d[i];
The compiler sees all this at compile time, the whole expression. It can do a lot of really intelligent things here, including optimizing out temporaries (and loop unrolling).
In python however, consider what's happening: when you use numpy each ''+'' uses operator overloading on the np array types (which are just thin wrappers around contiguous blocks of memory, i.e. arrays in the low level sense), and calls out to a fortran (or C++) function which does the addition super fast. But it just does one addition, and spits out a temporary.
We can see that in some way, while numpy is awesome and convenient and pretty fast, it is slowing things down because while it seems like it is calling into a fast compiled language for the hard work, the compiler doesn't get to see the whole program, it's just fed isolated little bits. And this is hugely detrimental to a compiler, especially modern compilers which are very intelligent and can retire multiple instructions per cycle when the code is well written.
Numba on the other hand, used a jit. So, at runtime it can figure out that the temporaries are not needed, and optimize them away. Basically, Numba has a chance to have the program compiled as a whole, numpy can only call small atomic blocks which themselves have been pre-compiled.
When you ask numpy to do:
x = x*2 - ( y * 55 )
It is internally translated to something like:
tmp1 = y * 55
tmp2 = x * 2
tmp3 = tmp2 - tmp1
x = tmp3
Each of those temps are arrays that have to be allocated, operated on, and then deallocated. Numba, on the other hand, handles things one item at a time, and doesn't have to deal with that overhead.
Numba is generally faster than Numpy and even Cython (at least on Linux).
Here's a plot (stolen from Numba vs. Cython: Take 2):
In this benchmark, pairwise distances have been computed, so this may depend on the algorithm.
Note that this may be different on other Platforms, see this for Winpython (From WinPython Cython tutorial):
Instead of cluttering the original question further, I'll add some more stuff here in response to Jeff, Jaime, Veedrac:
def proc_numpy2(x,y,z):
np.subtract( np.multiply(x,2), np.multiply(y,55),out=x)
np.add( x, np.multiply(y,2),out=y)
np.add(x,np.add(y,99),out=z)
np.multiply(z,np.subtract(z,.88),out=z)
return z
def proc_numpy3(x,y,z):
x *= 2
x -= y*55
y *= 2
y += x
z = x + y
z += 99
z *= (z-.88)
return z
My machine seems to be running a tad faster today than yesterday so here they are in comparison to proc_numpy (proc_numba is timing the same as before)
In [611]: %timeit proc_numpy(x,y,z)
10000 loops, best of 3: 103 µs per loop
In [612]: %timeit proc_numpy2(x,y,z)
10000 loops, best of 3: 92.5 µs per loop
In [613]: %timeit proc_numpy3(x,y,z)
10000 loops, best of 3: 85.1 µs per loop
Note that as I was writing proc_numpy2/3 that I started seeing some side effects so I made copies of x,y,z and passed the copies instead of re-using x,y,z. Also, the different functions sometimes had slight differences in precision, so some of the them didn't pass the equality tests but if you diff them, they are really close. I assume that is due to creating or (not creating) temp variables. E.g.:
In [458]: (res_numpy2 - res_numba)[:12]
Out[458]:
array([ -7.27595761e-12, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, -7.27595761e-12, 0.00000000e+00])
Also, it's pretty minor (about 10 µs) but using float literals (55. instead of 55) will also save a little time for numpy but doesn't help numba.