I have an array H of dimensions (n0, n2) and an array W of dimensions (n0, n1, n2, n3) and I want to do the following operation:
(H[:, None, :, None] * W).sum(axis=(0, 2))
As far as I know, the above line does not use BLAS libraries. Is there a way to use numpy.dot or a similar function that uses BLAS to do the same computation (and still without copying the array H several times in memory)?
You have identified one way of doing this; I know of two others.
For a small example
In [365]: n0,n1,n2,n3=2,3,4,5
In [366]: H=np.ones((n0,n2));W=np.ones((n0,n1,n2,n3))
comparative times are:
In [362]: timeit np.tensordot(H,W,[(0,1),(0,2)])
10000 loops, best of 3: 32.8 µs per loop
In [363]: timeit np.einsum('ik,ijkl',H,W)
100000 loops, best of 3: 10.7 µs per loop
In [364]: timeit (H[:,None,:,None]*W).sum(axis=(0,2))
10000 loops, best of 3: 29.5 µs per loop
tensordot reshapes and transposes the inputs so it can call np.dot. einsum decodes the string, and does its own nditer in C.
https://stackoverflow.com/a/31129207/901925 has timings for another multidimensional dot, involving (100,)*(10,100,100)*(100,) arrays.
Related
I have an ndarray, and I want to replace every value in the array with the mean of its adjacent elements. The code below can do the job, but it is super slow when I have 700 arrays all with shape (7000, 7000) , so I wonder if there are better ways to do it. Thanks!
a = np.array(([1,2,3,4,5,6,7,8,9],[4,5,6,7,8,9,10,11,12],[3,4,5,6,7,8,9,10,11]))
row,col = a.shape
new_arr = np.ndarray(a.shape)
for x in xrange(row):
for y in xrange(col):
min_x = max(0, x-1)
min_y = max(0, y-1)
new_arr[x][y] = a[min_x:(x+2),min_y:(y+2)].mean()
print new_arr
Well, that's a smoothing operation in image processing, which can be achieved with 2D convolution. You are working a bit differently on the near-boundary elements. So, if the boundary elements are let off for precision, you can use scipy's convolve2d like so -
from scipy.signal import convolve2d as conv2
out = (conv2(a,np.ones((3,3)),'same')/9.0
This specific operation is a built-in in OpenCV module as cv2.blur and is very efficient at it. The name basically describes its operation of blurring the input arrays representing images. I believe the efficiency comes from the fact that internally its implemented entirely in C for performance with a thin Python wrapper to handle NumPy arrays.
So, the output could be alternatively calculated with it, like so -
import cv2 # Import OpenCV module
out = cv2.blur(a.astype(float),(3,3))
Here's a quick show-down on timings on a decently big image/array -
In [93]: a = np.random.randint(0,255,(5000,5000)) # Input array
In [94]: %timeit conv2(a,np.ones((3,3)),'same')/9.0
1 loops, best of 3: 2.74 s per loop
In [95]: %timeit cv2.blur(a.astype(float),(3,3))
1 loops, best of 3: 627 ms per loop
Following the discussion with #Divakar, find bellow a comparison of different convolution methods present in scipy:
import numpy as np
from scipy import signal, ndimage
def conv2(A, size):
return signal.convolve2d(A, np.ones((size, size)), mode='same') / float(size**2)
def fftconv(A, size):
return signal.fftconvolve(A, np.ones((size, size)), mode='same') / float(size**2)
def uniform(A, size):
return ndimage.uniform_filter(A, size, mode='constant')
All 3 methods return exactly the same value. However, note that uniform_filter has a parameter mode='constant', which indicates the boundary conditions of the filter, and constant == 0 is the same boundary condition that the Fourier domain (in the other 2 methods) is enforced. For different use cases you can change the boundary conditions.
Now some test matrices:
A = np.random.randn(1000, 1000)
And some timings:
%timeit conv2(A, 3) # 33.8 ms per loop
%timeit fftconv(A, 3) # 84.1 ms per loop
%timeit uniform(A, 3) # 17.1 ms per loop
%timeit conv2(A, 5) # 68.7 ms per loop
%timeit fftconv(A, 5) # 92.8 ms per loop
%timeit uniform(A, 5) # 17.1 ms per loop
%timeit conv2(A, 10) # 210 ms per loop
%timeit fftconv(A, 10) # 86 ms per loop
%timeit uniform(A, 10) # 16.4 ms per loop
%timeit conv2(A, 30) # 1.75 s per loop
%timeit fftconv(A, 30) # 102 ms per loop
%timeit uniform(A, 30) # 16.5 ms per loop
So in short, uniform_filter seems faster, and it because the convolution is separable in two 1D convolutons (similar to gaussian_filter which is also separable).
Other non-separable filters with different kernels are more likely to be faster using signal module (the one in #Divakar's) solution.
The speed of both fftconvolve and uniform_filter remains constant for different kernel sizes, while convolve2d gets slightly slower.
I had a similar problem recently and had to find a different solution since I can't use scipy.
import numpy as np
a = np.random.randint(100, size=(7000,7000)) #Array of 7000 x 7000
row,col = a.shape
column_totals = a.sum(axis=0) #Dump the sum of all columns into a single array
new_array = np.zeros([row,col]) #Create an receiving array
for i in range(row):
#Resulting row = the value of all rows minus the orignal row, divided by the row number minus one.
new_array[i] = (column_totals - a[i]) / (row - 1)
print(new_array)
I have two numpy arrays:
x of shape ((d1,...,d_m))
y of shape ((e_1,...e_n))
I would like to form the outer tensor product, that is the numpy array
z of shape ((d1,...,d_m,e_1,...,e_n))
such that
z[i_1,...,i_n,i_{n+1}...,i_{m+n}] == x[i_1,...i_m]*y[i_{m+1},...,i_{m+n}]
I have to perform the above outer multiplication several times so I would like to speed this up as much as possible.
You want np.multiply.outer:
z = np.multiply.outer(x, y)
An alternative to outer is to explicitly expand the dimensions. For 1d arrays this would be
x[:,None]*y # y[None,:] is automatic.
For 10x10 arrays, and generalizing the dimension expansion, I get the same times
In [74]: timeit x[[slice(None)]*x.ndim + [None]*y.ndim] * y
10000 loops, best of 3: 53.6 µs per loop
In [75]: timeit np.multiply.outer(x,y)
10000 loops, best of 3: 52.6 µs per loop
So outer does save some coding, but the basic broadcasted multiplication is the same.
I have a 3D numpy array in this form:
>>>img.shape
(4504932, 2, 2)
>>> img
array([[[15114, 15306],
[15305, 15304]],
[[15305, 15306],
[15303, 15304]],
[[15305, 15306],
[15303, 15304]],
...,
[[15305, 15302],
[15305, 15302]]], dtype=uint16)
Which I want to convert to a 1D numpy array where each entry is the sum of each 2x2 submatrix in the above img numpy array.
I have been able to accomplish this using:
img_new = np.array([i.sum() for i in img])
>>> img_new
array([61029, 61218, 61218, ..., 61214, 61214, 61214], dtype=uint64)
Which is exactly what I want. But this is too slow (takes about 10 seconds). Is there a faster method I could use? I included above img.shape so you had an idea of the size of this numpy array.
EDIT - ADDITIONAL INFO:
My img array could also be a 3D array in the form of 4x4, 5x5, 7x7.. etc submatrices. This is specified by the variables sub_rows and sub_cols.
img.sum(axis=(1, 2))
sum allows you to specify an axis or axes along which to sum, rather than just summing the whole array. This allows NumPy to loop over the array in C and perform just a few machine instructions per sum, rather than having to go through the Python bytecode evaluation loop and create a ton of wrapper objects to stick in a list.
Using a numpy method (apply_over_axes) is usually quicker and indeed that is the case here. I just tested on a 4000x2x2 array:
img = np.random.rand(4000,2,2)
timeit(np.apply_along_axis(np.sum, img, [1,2]))
# 1000 loops, best of 3: 721 us per loop
timeit(np.array([i.sum() for i in img]))
# 100 loops, best of 3: 17.2 ms per loop
You can use np.einsum -
img_new = np.einsum('ijk->i',img)
Verify results
In [42]: np.array_equal(np.array([i.sum() for i in img]),np.einsum('ijk->i',img))
Out[42]: True
Runtime tests
In [34]: img = np.random.randint(0,10000,(10000,2,2)).astype('uint16')
In [35]: %timeit np.array([i.sum() for i in img]) # Original approach
10 loops, best of 3: 92.4 ms per loop
In [36]: %timeit img.sum(axis=(1, 2)) # From other solution
1000 loops, best of 3: 297 µs per loop
In [37]: %timeit np.einsum('ijk->i',img)
10000 loops, best of 3: 102 µs per loop
I just try figure out why my program is so slow and find the following result.
In [11]: n = 1000000
In [12]: x = randn(n)
In [13]: %timeit norm(x)
100 loops, best of 3: 2.25 ms per loop
In [14]: %timeit (x.dot(x))**0.5
1000 loops, best of 3: 387 µs per loop
I know the norm function will contain many if else detecting the input and select the right norm. But I am still wondering this big difference especially when calling in loops.
Is this normal in numpy?
Another examples is that computing the eigenvalue and eigenvector of a 10000x10000 random generated matrix from randn. Firstly I use Matlab compute and get the result in several minutes. But numpy took a very very very long time to compute this and finally I Ctrl+c the process. Both use the eig function respectively.
This question already has answers here:
NumPy array initialization (fill with identical values)
(9 answers)
Closed 8 years ago.
I want to create a 3D array in Python, filled with -1.
I tested these methods:
import numpy as np
l = 200
b = 100
h = 30
%timeit grid = [[[-1 for x in range(l)] for y in range(b)] for z in range(h)]
1 loops, best of 3: 458 ms per loop
%timeit grid = -1 * np.ones((l, b, h), dtype=np.int)
10 loops, best of 3: 35.5 ms per loop
%timeit grid = np.zeros((l, b, h), dtype=np.int) - 1
10 loops, best of 3: 31.7 ms per loop
%timeit grid = -1.0 * np.ones((l, b, h), dtype=np.float32)
10 loops, best of 3: 42.1 ms per loop
%%timeit
grid = np.empty((l,b,h))
grid.fill(-1.0)
100 loops, best of 3: 13.7 ms per loop
So obviously, the last one is the fastest. Does anybody has an even faster method or at least less memory intensive? Because it runs on a RaspberryPi.
The only thing I can think to add is that any of these methods will be faster with the dtype argument chosen to take up as little memory as possible.
Assuming you need no more space that int8, the method suggested by #RutgerKassies in the comments took this long on my system:
%timeit grid = np.full((l, b, h), -1, dtype=np.int8)
1000 loops, best of 3: 286 µs per loop
For comparison, not specifying dtype (defaulting to int32) took about 10 times longer with the same method:
%timeit grid = np.full((l, b, h), -1)
100 loops, best of 3: 3.61 ms per loop
Your fastest method was about as fast as np.full (sometimes beating it):
%%timeit
grid = np.empty((l,b,h))
grid.fill(-1)
100 loops, best of 3: 3.51 ms per loop
or, with dtype specified as int8,
1000 loops, best of 3: 255 µs per loop
Edit: This is probably cheating, but, well...
%timeit grid = np.lib.stride_tricks.as_strided(np.array(-1, dtype=np.int8), (l, b, h), (0, 0, 0))
100000 loops, best of 3: 12.4 us per loop
All that's happening here is that we begin with an array of length one, np.array([-1]), and then fiddle with the stride lengths so that grid looks exactly like an array with the required dimensions.
If you need an actual array, you can use grid = grid.copy(); this makes the creation of the grid array about as fast as the quickest approaches suggested on elsewhere this page.
This is a little faster for me. Might be different on a RPi though.
grid = np.empty((l,b,h))
grid[...] = -1
np.int8 is much faster if it's big enough for you
grid = np.empty((l,b,h), dtype=np.int8)
grid[...] = -1
%%timeit
grid = np.empty((l,b,h), dtype=np.int8)
grid[:] = -1
100 loops, best of 3: 6.91 ms per loop