Using numpy.take for faster fancy indexing

Using numpy.take for faster fancy indexing - python

EDIT I have kept the more complicated problem I am facing below, but my problems with np.take can be summarized better as follows. Say you have an array img of shape (planes, rows), and another array lut of shape (planes, 256), and you want to use them to create a new array out of shape (planes, rows), where out[p,j] = lut[p, img[p, j]]. This can be achieved with fancy indexing as follows:
In [4]: %timeit lut[np.arange(planes).reshape(-1, 1), img]
1000 loops, best of 3: 471 us per loop
But if, instead of fancy indexing, you use take and a python loop over the planes things can be sped up tremendously:
In [6]: %timeit for _ in (lut[j].take(img[j]) for j in xrange(planes)) : pass
10000 loops, best of 3: 59 us per loop
Can lut and img be in someway rearranged, so as to have the whole operation happen without python loops, but using numpy.take (or an alternative method) instead of conventional fancy indexing to keep the speed advantage?
ORIGINAL QUESTION
I have a set of look-up tables (LUTs) that I want to use on an image. The array holding the LUTs is of shape (planes, 256, n), and the image has shape (planes, rows, cols). Both are of dtype = 'uint8', matching the 256 axis of the LUT. The idea is to run the p-th plane of the image through each of the n LUTs from the p-th plane of the LUT.
If my lut and img are the following:
planes, rows, cols, n = 3, 4000, 4000, 4
lut = np.random.randint(-2**31, 2**31 - 1,
size=(planes * 256 * n // 4,)).view('uint8')
lut = lut.reshape(planes, 256, n)
img = np.random.randint(-2**31, 2**31 - 1,
size=(planes * rows * cols // 4,)).view('uint8')
img = img.reshape(planes, rows, cols)
I can achieve what I am after using fancy indexing like this
out = lut[np.arange(planes).reshape(-1, 1, 1), img]
which gives me an array of shape (planes, rows, cols, n) , where out[i, :, :, j] holds the i-th plane of img run through the j-th LUT of the i-th plane of the LUT...
All is good, except for this:
In [2]: %timeit lut[np.arange(planes).reshape(-1, 1, 1), img]
1 loops, best of 3: 5.65 s per loop
which is completely unacceptable, especially since I have all of the following not so nice looking alternatives using np.take than run much faster:
A single LUT on a single plane runs about x70 faster:
In [2]: %timeit np.take(lut[0, :, 0], img[0])
10 loops, best of 3: 78.5 ms per loop
A python loop running through all the desired combinations finishes almost x6 faster:
In [2]: %timeit for _ in (np.take(lut[j, :, k], img[j]) for j in xrange(planes) for k in xrange(n)) : pass
1 loops, best of 3: 947 ms per loop
Even running all combinations of planes in the LUT and image and then discarding the planes**2 - planes unwanted ones is faster than fancy indexing:
In [2]: %timeit np.take(lut, img, axis=1)[np.arange(planes), np.arange(planes)]
1 loops, best of 3: 3.79 s per loop
And the fastest combination I have been able to come up with has a python loop iterating over the planes and finishes x13 faster:
In [2]: %timeit for _ in (np.take(lut[j], img[j], axis=0) for j in xrange(planes)) : pass
1 loops, best of 3: 434 ms per loop
The question of course is if there is no way of doing this with np.take without any python loop? Ideally whatever reshaping or resizing is needed should happen on the LUT, not the image, but I am open to whatever you people can come up with...

Fist of all I have to say I really liked your question. Without rearranging LUT or IMG the following solution worked:
%timeit a=np.take(lut, img, axis=1)
# 1 loops, best of 3: 1.93s per loop
But from the result you have to query the diagonal: a[0,0], a[1,1], a[2,2]; to get what you want. I've tried to find a way to do this indexing only for the diagonal elements, but still did not manage.
Here are some ways to rearrange your LUT and IMG:
The following works if the indexes in IMG are from 0-255, for the 1st plane, 256-511 for the 2nd plane, and 512-767 for the 3rd plane, but that would prevent you from using 'uint8', which can be a big issue...:
lut2 = lut.reshape(-1,4)
%timeit np.take(lut2,img,axis=0)
# 1 loops, best of 3: 716 ms per loop
# or
%timeit np.take(lut2, img.flatten(), axis=0).reshape(3,4000,4000,4)
# 1 loops, best of 3: 709 ms per loop
in my machine your solution is still the best option, and very adequate since you just need the diagonal evaluations, i.e. plane1-plane1, plane2-plane2 and plane3-plane3:
%timeit for _ in (np.take(lut[j], img[j], axis=0) for j in xrange(planes)) : pass
# 1 loops, best of 3: 677 ms per loop
I hope this can give you some insight about a better solution. It would be nice to look for more options with flatten(), and similar methods as np.apply_over_axes() or np.apply_along_axis(), that seem to be promising.
I used this code below to generate the data:
import numpy as np
num = 4000
planes, rows, cols, n = 3, num, num, 4
lut = np.random.randint(-2**31, 2**31-1,size=(planes*256*n//4,)).view('uint8')
lut = lut.reshape(planes, 256, n)
img = np.random.randint(-2**31, 2**31-1,size=(planes*rows*cols//4,)).view('uint8')
img = img.reshape(planes, rows, cols)

Related

Numpy: Replace every value in the array with the mean of its adjacent elements

I have an ndarray, and I want to replace every value in the array with the mean of its adjacent elements. The code below can do the job, but it is super slow when I have 700 arrays all with shape (7000, 7000) , so I wonder if there are better ways to do it. Thanks!
a = np.array(([1,2,3,4,5,6,7,8,9],[4,5,6,7,8,9,10,11,12],[3,4,5,6,7,8,9,10,11]))
row,col = a.shape
new_arr = np.ndarray(a.shape)
for x in xrange(row):
for y in xrange(col):
min_x = max(0, x-1)
min_y = max(0, y-1)
new_arr[x][y] = a[min_x:(x+2),min_y:(y+2)].mean()
print new_arr

Well, that's a smoothing operation in image processing, which can be achieved with 2D convolution. You are working a bit differently on the near-boundary elements. So, if the boundary elements are let off for precision, you can use scipy's convolve2d like so -
from scipy.signal import convolve2d as conv2
out = (conv2(a,np.ones((3,3)),'same')/9.0
This specific operation is a built-in in OpenCV module as cv2.blur and is very efficient at it. The name basically describes its operation of blurring the input arrays representing images. I believe the efficiency comes from the fact that internally its implemented entirely in C for performance with a thin Python wrapper to handle NumPy arrays.
So, the output could be alternatively calculated with it, like so -
import cv2 # Import OpenCV module
out = cv2.blur(a.astype(float),(3,3))
Here's a quick show-down on timings on a decently big image/array -
In [93]: a = np.random.randint(0,255,(5000,5000)) # Input array
In [94]: %timeit conv2(a,np.ones((3,3)),'same')/9.0
1 loops, best of 3: 2.74 s per loop
In [95]: %timeit cv2.blur(a.astype(float),(3,3))
1 loops, best of 3: 627 ms per loop

Following the discussion with #Divakar, find bellow a comparison of different convolution methods present in scipy:
import numpy as np
from scipy import signal, ndimage
def conv2(A, size):
return signal.convolve2d(A, np.ones((size, size)), mode='same') / float(size**2)
def fftconv(A, size):
return signal.fftconvolve(A, np.ones((size, size)), mode='same') / float(size**2)
def uniform(A, size):
return ndimage.uniform_filter(A, size, mode='constant')
All 3 methods return exactly the same value. However, note that uniform_filter has a parameter mode='constant', which indicates the boundary conditions of the filter, and constant == 0 is the same boundary condition that the Fourier domain (in the other 2 methods) is enforced. For different use cases you can change the boundary conditions.
Now some test matrices:
A = np.random.randn(1000, 1000)
And some timings:
%timeit conv2(A, 3) # 33.8 ms per loop
%timeit fftconv(A, 3) # 84.1 ms per loop
%timeit uniform(A, 3) # 17.1 ms per loop
%timeit conv2(A, 5) # 68.7 ms per loop
%timeit fftconv(A, 5) # 92.8 ms per loop
%timeit uniform(A, 5) # 17.1 ms per loop
%timeit conv2(A, 10) # 210 ms per loop
%timeit fftconv(A, 10) # 86 ms per loop
%timeit uniform(A, 10) # 16.4 ms per loop
%timeit conv2(A, 30) # 1.75 s per loop
%timeit fftconv(A, 30) # 102 ms per loop
%timeit uniform(A, 30) # 16.5 ms per loop
So in short, uniform_filter seems faster, and it because the convolution is separable in two 1D convolutons (similar to gaussian_filter which is also separable).
Other non-separable filters with different kernels are more likely to be faster using signal module (the one in #Divakar's) solution.
The speed of both fftconvolve and uniform_filter remains constant for different kernel sizes, while convolve2d gets slightly slower.

I had a similar problem recently and had to find a different solution since I can't use scipy.
import numpy as np
a = np.random.randint(100, size=(7000,7000)) #Array of 7000 x 7000
row,col = a.shape
column_totals = a.sum(axis=0) #Dump the sum of all columns into a single array
new_array = np.zeros([row,col]) #Create an receiving array
for i in range(row):
#Resulting row = the value of all rows minus the orignal row, divided by the row number minus one.
new_array[i] = (column_totals - a[i]) / (row - 1)
print(new_array)

Vectorize large NumPy multiplication

I am interested in calculating a large NumPy array. I have a large array A which contains a bunch of numbers. I want to calculate the sum of different combinations of these numbers. The structure of the data is as follows:
A = np.random.uniform(0,1, (3743, 1388, 3))
Combinations = np.random.randint(0,3, (306,3))
Final_Product = np.array([ np.sum( A*cb, axis=2) for cb in Combinations])
My question is if there is a more elegant and memory efficient way to calculate this? I find it frustrating to work with np.dot() when a 3-D array is involved.
If it helps, the shape of Final_Product ideally should be (3743, 306, 1388). Currently Final_Product is of the shape (306, 3743, 1388), so I can just reshape to get there.

np.dot() won't give give you the desired output , unless you involve extra step(s) that would probably include reshaping. Here's one vectorized approach using np.einsum to do it one shot without any extra memory overhead -
Final_Product = np.einsum('ijk,lk->lij',A,Combinations)
For completeness, here's with np.dot and reshaping as discussed earlier -
M,N,R = A.shape
Final_Product = A.reshape(-1,R).dot(Combinations.T).T.reshape(-1,M,N)
Runtime tests and verify output -
In [138]: # Inputs ( smaller version of those listed in question )
...: A = np.random.uniform(0,1, (374, 138, 3))
...: Combinations = np.random.randint(0,3, (30,3))
...:
In [139]: %timeit np.array([ np.sum( A*cb, axis=2) for cb in Combinations])
1 loops, best of 3: 324 ms per loop
In [140]: %timeit np.einsum('ijk,lk->lij',A,Combinations)
10 loops, best of 3: 32 ms per loop
In [141]: M,N,R = A.shape
In [142]: %timeit A.reshape(-1,R).dot(Combinations.T).T.reshape(-1,M,N)
100 loops, best of 3: 15.6 ms per loop
In [143]: Final_Product =np.array([np.sum( A*cb, axis=2) for cb in Combinations])
...: Final_Product2 = np.einsum('ijk,lk->lij',A,Combinations)
...: M,N,R = A.shape
...: Final_Product3 = A.reshape(-1,R).dot(Combinations.T).T.reshape(-1,M,N)
...:
In [144]: print np.allclose(Final_Product,Final_Product2)
True
In [145]: print np.allclose(Final_Product,Final_Product3)
True

Instead of dot you could use tensordot. Your current method is equivalent to:
np.tensordot(A, Combinations, [2, 1]).transpose(2, 0, 1)
Note the transpose at the end to put the axes in the correct order.
Like dot, the tensordot function can call down to the fast BLAS/LAPACK libraries (if you have them installed) and so should be perform well for large arrays.

Many small matrices speed-up for loops

I have a large coordinate grid (vectors a and b), for which I generate and solve a matrix (10x10) equation. Is there a way for scipy.linalg.solve to accept vector input? So far my solution was to run for cycles over the coordinate arrays.
import time
import numpy as np
import scipy.linalg
N = 10
a = np.linspace(0, 1, 10**3)
b = np.linspace(0, 1, 2*10**3)
A = np.random.random((N, N)) # input matrix, not static
def f(a,b,n): # array-filling function
return a*b*n
def sol(x,y): # matrix solver
D = np.arange(0,N)
B = f(x,y,D)**2 + f(x-1, y+1, D) # source vector
X = scipy.linalg.solve(A,B)
return X # output an N-size vector
start = time.time()
answer = np.zeros(shape=(a.size, b.size)) # predefine output array
for egg in range(a.size): # an ugly double-for cycle
for ham in range(b.size):
aa = a[egg]
bb = b[ham]
answer[egg,ham] = sol(aa,bb)[0]
print time.time() - start

To illustrate my point about generalized ufuncs, and the ability to eliminate the loop over egg and ham, consider the following piece of code:
import numpy as np
A = np.random.randn(4,4,10,10)
AI = np.linalg.inv(A)
#check that generalized ufuncs work as expected
I = np.einsum('xyij,xyjk->xyik', A, AI)
print np.allclose(I, np.eye(10)[np.newaxis, np.newaxis, :, :])

#yevgeniy You are right, efficiently solving multiple independent linear systems A x = b with scipy a bit tricky (assuming an A array that changes for every iteration).
For instance, here is a benchmark for solving 1000 systems of the form, A x = b, where A is a 10x10 matrix, and b a 10 element vector. Surprisingly, the approach to put all this into one block diagonal matrix and call scipy.linalg.solve once is indeed slower both with dense and sparse matrices.
import numpy as np
from scipy.linalg import block_diag, solve
from scipy.sparse import block_diag as sp_block_diag
from scipy.sparse.linalg import spsolve
N = 10
M = 1000 # number of coordinates
Ai = np.random.randn(N, N) # we can compute the inverse here,
# but let's assume that Ai are different matrices in the for loop loop
bi = np.random.randn(N)
%timeit [solve(Ai, bi) for el in range(M)]
# 10 loops, best of 3: 32.1 ms per loop
Afull = sp_block_diag([Ai]*M, format='csr')
bfull = np.tile(bi, M)
%timeit Afull = sp_block_diag([Ai]*M, format='csr')
%timeit spsolve(Afull, bfull)
# 1 loops, best of 3: 303 ms per loop
# 100 loops, best of 3: 5.55 ms per loop
Afull = block_diag(*[Ai]*M)
%timeit Afull = block_diag(*[Ai]*M)
%timeit solve(Afull, bfull)
# 100 loops, best of 3: 14.1 ms per loop
# 1 loops, best of 3: 23.6 s per loop
The solution of the linear system, with sparse arrays is faster, but the time to create this block diagonal array is actually very slow. As to dense arrays, they are simply slower in this case (and take lots of RAM).
Maybe I'm missing something about how to make this work efficiently with sparse arrays, but if you are keeping the for loops, there are two things that you could do for optimizations.
From pure python, look at the source code of scipy.linalg.solve : remove unnecessary tests and factorize all repeated operations outside of your loops. For instance, assuming your arrays are not symmetrical positives, we could do
from scipy.linalg import get_lapack_funcs
gesv, = get_lapack_funcs(('gesv',), (Ai, bi))
def solve_opt(A, b, gesv=gesv):
# not sure if copying A and B is necessary, but just in case (faster if arrays are not copied)
lu, piv, x, info = gesv(A.copy(), b.copy(), overwrite_a=False, overwrite_b=False)
if info == 0:
return x
if info > 0:
raise LinAlgError("singular matrix")
raise ValueError('illegal value in %d-th argument of internal gesv|posv' % -info)
%timeit [solve(Ai, bi) for el in range(M)]
%timeit [solve_opt(Ai, bi) for el in range(M)]
# 10 loops, best of 3: 30.1 ms per loop
# 100 loops, best of 3: 3.77 ms per loop
which results in a 6.5x speed up.
If you need even better performance, you would have to port this for loop in Cython and interface the gesv BLAS functions directly in C, as discussed here, or better with the Cython API for BLAS/LAPACK in Scipy 0.16.
Edit: As #Eelco Hoogendoorn mentioned if your A matrix is fixed, there is a much simpler and more efficient approach.

Efficient conversion of a 3D numpy array to a 1D numpy array

I have a 3D numpy array in this form:
>>>img.shape
(4504932, 2, 2)
>>> img
array([[[15114, 15306],
[15305, 15304]],
[[15305, 15306],
[15303, 15304]],
[[15305, 15306],
[15303, 15304]],
...,
[[15305, 15302],
[15305, 15302]]], dtype=uint16)
Which I want to convert to a 1D numpy array where each entry is the sum of each 2x2 submatrix in the above img numpy array.
I have been able to accomplish this using:
img_new = np.array([i.sum() for i in img])
>>> img_new
array([61029, 61218, 61218, ..., 61214, 61214, 61214], dtype=uint64)
Which is exactly what I want. But this is too slow (takes about 10 seconds). Is there a faster method I could use? I included above img.shape so you had an idea of the size of this numpy array.
EDIT - ADDITIONAL INFO:
My img array could also be a 3D array in the form of 4x4, 5x5, 7x7.. etc submatrices. This is specified by the variables sub_rows and sub_cols.

img.sum(axis=(1, 2))
sum allows you to specify an axis or axes along which to sum, rather than just summing the whole array. This allows NumPy to loop over the array in C and perform just a few machine instructions per sum, rather than having to go through the Python bytecode evaluation loop and create a ton of wrapper objects to stick in a list.

Using a numpy method (apply_over_axes) is usually quicker and indeed that is the case here. I just tested on a 4000x2x2 array:
img = np.random.rand(4000,2,2)
timeit(np.apply_along_axis(np.sum, img, [1,2]))
# 1000 loops, best of 3: 721 us per loop
timeit(np.array([i.sum() for i in img]))
# 100 loops, best of 3: 17.2 ms per loop

You can use np.einsum -
img_new = np.einsum('ijk->i',img)
Verify results
In [42]: np.array_equal(np.array([i.sum() for i in img]),np.einsum('ijk->i',img))
Out[42]: True
Runtime tests
In [34]: img = np.random.randint(0,10000,(10000,2,2)).astype('uint16')
In [35]: %timeit np.array([i.sum() for i in img]) # Original approach
10 loops, best of 3: 92.4 ms per loop
In [36]: %timeit img.sum(axis=(1, 2)) # From other solution
1000 loops, best of 3: 297 µs per loop
In [37]: %timeit np.einsum('ijk->i',img)
10000 loops, best of 3: 102 µs per loop

Fastest way to create an array in Python [duplicate]

This question already has answers here:
NumPy array initialization (fill with identical values)
(9 answers)
Closed 8 years ago.
I want to create a 3D array in Python, filled with -1.
I tested these methods:
import numpy as np
l = 200
b = 100
h = 30
%timeit grid = [[[-1 for x in range(l)] for y in range(b)] for z in range(h)]
1 loops, best of 3: 458 ms per loop
%timeit grid = -1 * np.ones((l, b, h), dtype=np.int)
10 loops, best of 3: 35.5 ms per loop
%timeit grid = np.zeros((l, b, h), dtype=np.int) - 1
10 loops, best of 3: 31.7 ms per loop
%timeit grid = -1.0 * np.ones((l, b, h), dtype=np.float32)
10 loops, best of 3: 42.1 ms per loop
%%timeit
grid = np.empty((l,b,h))
grid.fill(-1.0)
100 loops, best of 3: 13.7 ms per loop
So obviously, the last one is the fastest. Does anybody has an even faster method or at least less memory intensive? Because it runs on a RaspberryPi.

The only thing I can think to add is that any of these methods will be faster with the dtype argument chosen to take up as little memory as possible.
Assuming you need no more space that int8, the method suggested by #RutgerKassies in the comments took this long on my system:
%timeit grid = np.full((l, b, h), -1, dtype=np.int8)
1000 loops, best of 3: 286 µs per loop
For comparison, not specifying dtype (defaulting to int32) took about 10 times longer with the same method:
%timeit grid = np.full((l, b, h), -1)
100 loops, best of 3: 3.61 ms per loop
Your fastest method was about as fast as np.full (sometimes beating it):
%%timeit
grid = np.empty((l,b,h))
grid.fill(-1)
100 loops, best of 3: 3.51 ms per loop
or, with dtype specified as int8,
1000 loops, best of 3: 255 µs per loop
Edit: This is probably cheating, but, well...
%timeit grid = np.lib.stride_tricks.as_strided(np.array(-1, dtype=np.int8), (l, b, h), (0, 0, 0))
100000 loops, best of 3: 12.4 us per loop
All that's happening here is that we begin with an array of length one, np.array([-1]), and then fiddle with the stride lengths so that grid looks exactly like an array with the required dimensions.
If you need an actual array, you can use grid = grid.copy(); this makes the creation of the grid array about as fast as the quickest approaches suggested on elsewhere this page.

This is a little faster for me. Might be different on a RPi though.
grid = np.empty((l,b,h))
grid[...] = -1
np.int8 is much faster if it's big enough for you
grid = np.empty((l,b,h), dtype=np.int8)
grid[...] = -1
%%timeit
grid = np.empty((l,b,h), dtype=np.int8)
grid[:] = -1
100 loops, best of 3: 6.91 ms per loop

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using numpy.take for faster fancy indexing - python

Related

Numpy: Replace every value in the array with the mean of its adjacent elements

Vectorize large NumPy multiplication

Many small matrices speed-up for loops

Efficient conversion of a 3D numpy array to a 1D numpy array

Fastest way to create an array in Python [duplicate]

Categories

Resources