I have a 3D numpy array in this form:
>>>img.shape
(4504932, 2, 2)
>>> img
array([[[15114, 15306],
[15305, 15304]],
[[15305, 15306],
[15303, 15304]],
[[15305, 15306],
[15303, 15304]],
...,
[[15305, 15302],
[15305, 15302]]], dtype=uint16)
Which I want to convert to a 1D numpy array where each entry is the sum of each 2x2 submatrix in the above img numpy array.
I have been able to accomplish this using:
img_new = np.array([i.sum() for i in img])
>>> img_new
array([61029, 61218, 61218, ..., 61214, 61214, 61214], dtype=uint64)
Which is exactly what I want. But this is too slow (takes about 10 seconds). Is there a faster method I could use? I included above img.shape so you had an idea of the size of this numpy array.
EDIT - ADDITIONAL INFO:
My img array could also be a 3D array in the form of 4x4, 5x5, 7x7.. etc submatrices. This is specified by the variables sub_rows and sub_cols.
img.sum(axis=(1, 2))
sum allows you to specify an axis or axes along which to sum, rather than just summing the whole array. This allows NumPy to loop over the array in C and perform just a few machine instructions per sum, rather than having to go through the Python bytecode evaluation loop and create a ton of wrapper objects to stick in a list.
Using a numpy method (apply_over_axes) is usually quicker and indeed that is the case here. I just tested on a 4000x2x2 array:
img = np.random.rand(4000,2,2)
timeit(np.apply_along_axis(np.sum, img, [1,2]))
# 1000 loops, best of 3: 721 us per loop
timeit(np.array([i.sum() for i in img]))
# 100 loops, best of 3: 17.2 ms per loop
You can use np.einsum -
img_new = np.einsum('ijk->i',img)
Verify results
In [42]: np.array_equal(np.array([i.sum() for i in img]),np.einsum('ijk->i',img))
Out[42]: True
Runtime tests
In [34]: img = np.random.randint(0,10000,(10000,2,2)).astype('uint16')
In [35]: %timeit np.array([i.sum() for i in img]) # Original approach
10 loops, best of 3: 92.4 ms per loop
In [36]: %timeit img.sum(axis=(1, 2)) # From other solution
1000 loops, best of 3: 297 µs per loop
In [37]: %timeit np.einsum('ijk->i',img)
10000 loops, best of 3: 102 µs per loop
Related
I have an ndarray, and I want to replace every value in the array with the mean of its adjacent elements. The code below can do the job, but it is super slow when I have 700 arrays all with shape (7000, 7000) , so I wonder if there are better ways to do it. Thanks!
a = np.array(([1,2,3,4,5,6,7,8,9],[4,5,6,7,8,9,10,11,12],[3,4,5,6,7,8,9,10,11]))
row,col = a.shape
new_arr = np.ndarray(a.shape)
for x in xrange(row):
for y in xrange(col):
min_x = max(0, x-1)
min_y = max(0, y-1)
new_arr[x][y] = a[min_x:(x+2),min_y:(y+2)].mean()
print new_arr
Well, that's a smoothing operation in image processing, which can be achieved with 2D convolution. You are working a bit differently on the near-boundary elements. So, if the boundary elements are let off for precision, you can use scipy's convolve2d like so -
from scipy.signal import convolve2d as conv2
out = (conv2(a,np.ones((3,3)),'same')/9.0
This specific operation is a built-in in OpenCV module as cv2.blur and is very efficient at it. The name basically describes its operation of blurring the input arrays representing images. I believe the efficiency comes from the fact that internally its implemented entirely in C for performance with a thin Python wrapper to handle NumPy arrays.
So, the output could be alternatively calculated with it, like so -
import cv2 # Import OpenCV module
out = cv2.blur(a.astype(float),(3,3))
Here's a quick show-down on timings on a decently big image/array -
In [93]: a = np.random.randint(0,255,(5000,5000)) # Input array
In [94]: %timeit conv2(a,np.ones((3,3)),'same')/9.0
1 loops, best of 3: 2.74 s per loop
In [95]: %timeit cv2.blur(a.astype(float),(3,3))
1 loops, best of 3: 627 ms per loop
Following the discussion with #Divakar, find bellow a comparison of different convolution methods present in scipy:
import numpy as np
from scipy import signal, ndimage
def conv2(A, size):
return signal.convolve2d(A, np.ones((size, size)), mode='same') / float(size**2)
def fftconv(A, size):
return signal.fftconvolve(A, np.ones((size, size)), mode='same') / float(size**2)
def uniform(A, size):
return ndimage.uniform_filter(A, size, mode='constant')
All 3 methods return exactly the same value. However, note that uniform_filter has a parameter mode='constant', which indicates the boundary conditions of the filter, and constant == 0 is the same boundary condition that the Fourier domain (in the other 2 methods) is enforced. For different use cases you can change the boundary conditions.
Now some test matrices:
A = np.random.randn(1000, 1000)
And some timings:
%timeit conv2(A, 3) # 33.8 ms per loop
%timeit fftconv(A, 3) # 84.1 ms per loop
%timeit uniform(A, 3) # 17.1 ms per loop
%timeit conv2(A, 5) # 68.7 ms per loop
%timeit fftconv(A, 5) # 92.8 ms per loop
%timeit uniform(A, 5) # 17.1 ms per loop
%timeit conv2(A, 10) # 210 ms per loop
%timeit fftconv(A, 10) # 86 ms per loop
%timeit uniform(A, 10) # 16.4 ms per loop
%timeit conv2(A, 30) # 1.75 s per loop
%timeit fftconv(A, 30) # 102 ms per loop
%timeit uniform(A, 30) # 16.5 ms per loop
So in short, uniform_filter seems faster, and it because the convolution is separable in two 1D convolutons (similar to gaussian_filter which is also separable).
Other non-separable filters with different kernels are more likely to be faster using signal module (the one in #Divakar's) solution.
The speed of both fftconvolve and uniform_filter remains constant for different kernel sizes, while convolve2d gets slightly slower.
I had a similar problem recently and had to find a different solution since I can't use scipy.
import numpy as np
a = np.random.randint(100, size=(7000,7000)) #Array of 7000 x 7000
row,col = a.shape
column_totals = a.sum(axis=0) #Dump the sum of all columns into a single array
new_array = np.zeros([row,col]) #Create an receiving array
for i in range(row):
#Resulting row = the value of all rows minus the orignal row, divided by the row number minus one.
new_array[i] = (column_totals - a[i]) / (row - 1)
print(new_array)
I have two numpy arrays:
x of shape ((d1,...,d_m))
y of shape ((e_1,...e_n))
I would like to form the outer tensor product, that is the numpy array
z of shape ((d1,...,d_m,e_1,...,e_n))
such that
z[i_1,...,i_n,i_{n+1}...,i_{m+n}] == x[i_1,...i_m]*y[i_{m+1},...,i_{m+n}]
I have to perform the above outer multiplication several times so I would like to speed this up as much as possible.
You want np.multiply.outer:
z = np.multiply.outer(x, y)
An alternative to outer is to explicitly expand the dimensions. For 1d arrays this would be
x[:,None]*y # y[None,:] is automatic.
For 10x10 arrays, and generalizing the dimension expansion, I get the same times
In [74]: timeit x[[slice(None)]*x.ndim + [None]*y.ndim] * y
10000 loops, best of 3: 53.6 µs per loop
In [75]: timeit np.multiply.outer(x,y)
10000 loops, best of 3: 52.6 µs per loop
So outer does save some coding, but the basic broadcasted multiplication is the same.
I have an array H of dimensions (n0, n2) and an array W of dimensions (n0, n1, n2, n3) and I want to do the following operation:
(H[:, None, :, None] * W).sum(axis=(0, 2))
As far as I know, the above line does not use BLAS libraries. Is there a way to use numpy.dot or a similar function that uses BLAS to do the same computation (and still without copying the array H several times in memory)?
You have identified one way of doing this; I know of two others.
For a small example
In [365]: n0,n1,n2,n3=2,3,4,5
In [366]: H=np.ones((n0,n2));W=np.ones((n0,n1,n2,n3))
comparative times are:
In [362]: timeit np.tensordot(H,W,[(0,1),(0,2)])
10000 loops, best of 3: 32.8 µs per loop
In [363]: timeit np.einsum('ik,ijkl',H,W)
100000 loops, best of 3: 10.7 µs per loop
In [364]: timeit (H[:,None,:,None]*W).sum(axis=(0,2))
10000 loops, best of 3: 29.5 µs per loop
tensordot reshapes and transposes the inputs so it can call np.dot. einsum decodes the string, and does its own nditer in C.
https://stackoverflow.com/a/31129207/901925 has timings for another multidimensional dot, involving (100,)*(10,100,100)*(100,) arrays.
EDIT I have kept the more complicated problem I am facing below, but my problems with np.take can be summarized better as follows. Say you have an array img of shape (planes, rows), and another array lut of shape (planes, 256), and you want to use them to create a new array out of shape (planes, rows), where out[p,j] = lut[p, img[p, j]]. This can be achieved with fancy indexing as follows:
In [4]: %timeit lut[np.arange(planes).reshape(-1, 1), img]
1000 loops, best of 3: 471 us per loop
But if, instead of fancy indexing, you use take and a python loop over the planes things can be sped up tremendously:
In [6]: %timeit for _ in (lut[j].take(img[j]) for j in xrange(planes)) : pass
10000 loops, best of 3: 59 us per loop
Can lut and img be in someway rearranged, so as to have the whole operation happen without python loops, but using numpy.take (or an alternative method) instead of conventional fancy indexing to keep the speed advantage?
ORIGINAL QUESTION
I have a set of look-up tables (LUTs) that I want to use on an image. The array holding the LUTs is of shape (planes, 256, n), and the image has shape (planes, rows, cols). Both are of dtype = 'uint8', matching the 256 axis of the LUT. The idea is to run the p-th plane of the image through each of the n LUTs from the p-th plane of the LUT.
If my lut and img are the following:
planes, rows, cols, n = 3, 4000, 4000, 4
lut = np.random.randint(-2**31, 2**31 - 1,
size=(planes * 256 * n // 4,)).view('uint8')
lut = lut.reshape(planes, 256, n)
img = np.random.randint(-2**31, 2**31 - 1,
size=(planes * rows * cols // 4,)).view('uint8')
img = img.reshape(planes, rows, cols)
I can achieve what I am after using fancy indexing like this
out = lut[np.arange(planes).reshape(-1, 1, 1), img]
which gives me an array of shape (planes, rows, cols, n) , where out[i, :, :, j] holds the i-th plane of img run through the j-th LUT of the i-th plane of the LUT...
All is good, except for this:
In [2]: %timeit lut[np.arange(planes).reshape(-1, 1, 1), img]
1 loops, best of 3: 5.65 s per loop
which is completely unacceptable, especially since I have all of the following not so nice looking alternatives using np.take than run much faster:
A single LUT on a single plane runs about x70 faster:
In [2]: %timeit np.take(lut[0, :, 0], img[0])
10 loops, best of 3: 78.5 ms per loop
A python loop running through all the desired combinations finishes almost x6 faster:
In [2]: %timeit for _ in (np.take(lut[j, :, k], img[j]) for j in xrange(planes) for k in xrange(n)) : pass
1 loops, best of 3: 947 ms per loop
Even running all combinations of planes in the LUT and image and then discarding the planes**2 - planes unwanted ones is faster than fancy indexing:
In [2]: %timeit np.take(lut, img, axis=1)[np.arange(planes), np.arange(planes)]
1 loops, best of 3: 3.79 s per loop
And the fastest combination I have been able to come up with has a python loop iterating over the planes and finishes x13 faster:
In [2]: %timeit for _ in (np.take(lut[j], img[j], axis=0) for j in xrange(planes)) : pass
1 loops, best of 3: 434 ms per loop
The question of course is if there is no way of doing this with np.take without any python loop? Ideally whatever reshaping or resizing is needed should happen on the LUT, not the image, but I am open to whatever you people can come up with...
Fist of all I have to say I really liked your question. Without rearranging LUT or IMG the following solution worked:
%timeit a=np.take(lut, img, axis=1)
# 1 loops, best of 3: 1.93s per loop
But from the result you have to query the diagonal: a[0,0], a[1,1], a[2,2]; to get what you want. I've tried to find a way to do this indexing only for the diagonal elements, but still did not manage.
Here are some ways to rearrange your LUT and IMG:
The following works if the indexes in IMG are from 0-255, for the 1st plane, 256-511 for the 2nd plane, and 512-767 for the 3rd plane, but that would prevent you from using 'uint8', which can be a big issue...:
lut2 = lut.reshape(-1,4)
%timeit np.take(lut2,img,axis=0)
# 1 loops, best of 3: 716 ms per loop
# or
%timeit np.take(lut2, img.flatten(), axis=0).reshape(3,4000,4000,4)
# 1 loops, best of 3: 709 ms per loop
in my machine your solution is still the best option, and very adequate since you just need the diagonal evaluations, i.e. plane1-plane1, plane2-plane2 and plane3-plane3:
%timeit for _ in (np.take(lut[j], img[j], axis=0) for j in xrange(planes)) : pass
# 1 loops, best of 3: 677 ms per loop
I hope this can give you some insight about a better solution. It would be nice to look for more options with flatten(), and similar methods as np.apply_over_axes() or np.apply_along_axis(), that seem to be promising.
I used this code below to generate the data:
import numpy as np
num = 4000
planes, rows, cols, n = 3, num, num, 4
lut = np.random.randint(-2**31, 2**31-1,size=(planes*256*n//4,)).view('uint8')
lut = lut.reshape(planes, 256, n)
img = np.random.randint(-2**31, 2**31-1,size=(planes*rows*cols//4,)).view('uint8')
img = img.reshape(planes, rows, cols)
Imagine you have an RGB image and want to process every pixel:
import numpy as np
image = np.zeros((1024, 1024, 3))
def rgb_to_something(rgb):
pass
vfunc = np.vectorize(rgb_to_something)
vfunc(image)
vfunc should now get every RGB value. The problem is that numpy flattens the
array and the function gets r0, g0, b0, r1, g1, b1, ... when it should get
rgb0, rgb1, rgb2, ....
Can this be done somehow?
http://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html
Maybe by converting the numpy array to some special datatype beforehand?
For example (of course not working):
image = image.astype(np.float32)
import ctypes
RGB = ctypes.c_float * 3
image.astype(RGB)
ValueError: shape mismatch: objects cannot be broadcast to a single shape
Update:
The main purpose is efficiency here. A non vectorized version could simply look like this:
import numpy as np
image = np.zeros((1024, 1024, 3))
shape = image.shape[0:2]
image = image.reshape((-1, 3))
def rgb_to_something((r, g, b)):
return r + g + b
transformed_image = np.array([rgb_to_something(rgb) for rgb in image]).reshape(shape)
The easy way to solve this kind of problem is to pass the entire array to the function and used vectorized idioms inside it. Specifically, your rgb_to_something can also be written
def rgb_to_something(pixels):
return pixels.sum(axis=1)
which is about 15 times faster than your version:
In [16]: %timeit np.array([old_rgb_to_something(rgb) for rgb in image]).reshape(shape)
1 loops, best of 3: 3.03 s per loop
In [19]: %timeit image.sum(axis=1).reshape(shape)
1 loops, best of 3: 192 ms per loop
The problem with np.vectorize is that it necessarily incurs a lot of Python function call overhead when applied to large arrays.
You can use Numexpr for some cases. For instance:
import numpy as np
import numexpr
rgb = np.random.rand(3,1000,1000)
r,g,b = rgb
In this case, numexpr is 5x faster than even a "vectorized" numpy expression. But, not all functions can be written this way.
%timeit r*2+g*3/b
10 loops, best of 3: 20.8 ms per loop
%timeit numexpr.evaluate("(r*2+g*3) / b")
100 loops, best of 3: 4.2 ms per loop