Vectorize large NumPy multiplication

Vectorize large NumPy multiplication - python

I am interested in calculating a large NumPy array. I have a large array A which contains a bunch of numbers. I want to calculate the sum of different combinations of these numbers. The structure of the data is as follows:
A = np.random.uniform(0,1, (3743, 1388, 3))
Combinations = np.random.randint(0,3, (306,3))
Final_Product = np.array([ np.sum( A*cb, axis=2) for cb in Combinations])
My question is if there is a more elegant and memory efficient way to calculate this? I find it frustrating to work with np.dot() when a 3-D array is involved.
If it helps, the shape of Final_Product ideally should be (3743, 306, 1388). Currently Final_Product is of the shape (306, 3743, 1388), so I can just reshape to get there.

np.dot() won't give give you the desired output , unless you involve extra step(s) that would probably include reshaping. Here's one vectorized approach using np.einsum to do it one shot without any extra memory overhead -
Final_Product = np.einsum('ijk,lk->lij',A,Combinations)
For completeness, here's with np.dot and reshaping as discussed earlier -
M,N,R = A.shape
Final_Product = A.reshape(-1,R).dot(Combinations.T).T.reshape(-1,M,N)
Runtime tests and verify output -
In [138]: # Inputs ( smaller version of those listed in question )
...: A = np.random.uniform(0,1, (374, 138, 3))
...: Combinations = np.random.randint(0,3, (30,3))
...:
In [139]: %timeit np.array([ np.sum( A*cb, axis=2) for cb in Combinations])
1 loops, best of 3: 324 ms per loop
In [140]: %timeit np.einsum('ijk,lk->lij',A,Combinations)
10 loops, best of 3: 32 ms per loop
In [141]: M,N,R = A.shape
In [142]: %timeit A.reshape(-1,R).dot(Combinations.T).T.reshape(-1,M,N)
100 loops, best of 3: 15.6 ms per loop
In [143]: Final_Product =np.array([np.sum( A*cb, axis=2) for cb in Combinations])
...: Final_Product2 = np.einsum('ijk,lk->lij',A,Combinations)
...: M,N,R = A.shape
...: Final_Product3 = A.reshape(-1,R).dot(Combinations.T).T.reshape(-1,M,N)
...:
In [144]: print np.allclose(Final_Product,Final_Product2)
True
In [145]: print np.allclose(Final_Product,Final_Product3)
True

Instead of dot you could use tensordot. Your current method is equivalent to:
np.tensordot(A, Combinations, [2, 1]).transpose(2, 0, 1)
Note the transpose at the end to put the axes in the correct order.
Like dot, the tensordot function can call down to the fast BLAS/LAPACK libraries (if you have them installed) and so should be perform well for large arrays.

Related

Numpy applying a condition on each element of an array [duplicate]

This sounds simple, and I think I'm overcomplicating this in my mind.
I want to make an array whose elements are generated from two source arrays of the same shape, depending on which element in the source arrays is greater.
to illustrate:
import numpy as np
array1 = np.array((2,3,0))
array2 = np.array((1,5,0))
array3 = (insert magic)
>> array([2, 5, 0))
I can't work out how to produce an array3 that combines the elements of array1 and array2 to produce an array where only the greater of the two array1/array2 element values is taken.
Any help would be much appreciated. Thanks.

We could use NumPy built-in np.maximum, made exactly for that purpose -
np.maximum(array1, array2)
Another way would be to use the NumPy ufunc np.max on a 2D stacked array and max-reduce along the first axis (axis=0) -
np.max([array1,array2],axis=0)
Timings on 1 million datasets -
In [271]: array1 = np.random.randint(0,9,(1000000))
In [272]: array2 = np.random.randint(0,9,(1000000))
In [274]: %timeit np.maximum(array1, array2)
1000 loops, best of 3: 1.25 ms per loop
In [275]: %timeit np.max([array1, array2],axis=0)
100 loops, best of 3: 3.31 ms per loop
# #Eric Duminil's soln1
In [276]: %timeit np.where( array1 > array2, array1, array2)
100 loops, best of 3: 5.15 ms per loop
# #Eric Duminil's soln2
In [277]: magic = lambda x,y : np.where(x > y , x, y)
In [278]: %timeit magic(array1, array2)
100 loops, best of 3: 5.13 ms per loop
Extending to other supporting ufuncs
Similarly, there's np.minimum for finding element-wise minimum values between two arrays of same or broadcastable shapes. So, to find element-wise minimum between array1 and array2, we would have :
np.minimum(array1, array2)
For a complete list of ufuncs that support this feature, please refer to the docs and look for the keyword : element-wise. Grep-ing for those, I got the following ufuncs :
add, subtract, multiply, divide, logaddexp, logaddexp2, true_divide,
floor_divide, power, remainder, mod, fmod, divmod, heaviside, gcd,
lcm, arctan2, hypot, bitwise_and, bitwise_or, bitwise_xor, left_shift,
right_shift, greater, greater_equal, less, less_equal, not_equal,
equal, logical_and, logical_or, logical_xor, maximum, minimum, fmax,
fmin, copysign, nextafter, ldexp, fmod

If your condition ever becomes more complex, you could use np.where:
import numpy as np
array1 = np.array((2,3,0))
array2 = np.array((1,5,0))
array3 = np.where( array1 > array2, array1, array2)
# array([2, 5, 0])
You could replace array1 > array2 with any condition. If all you want is the maximum, go with #Divakar's answer.
And just for fun :
magic = lambda x,y : np.where(x > y , x, y)
magic(array1, array2)
# array([2, 5, 0])

Efficient Kronecker product with identity matrix and regular matrix - NumPy/ Python

I am working on a python project and making use of numpy. I frequently have to compute Kronecker products of matrices by the identity matrix. These are a pretty big bottleneck in my code so I would like to optimize them. There are two kinds of products I have to take. The first one is:
np.kron(np.eye(N), A)
This one is pretty easy to optimize by simply using scipy.linalg.block_diag. The product is equivalent to:
la.block_diag(*[A]*N)
Which is about 10 times faster. However, I am unsure on how to optimize the second kind of product:
np.kron(A, np.eye(N))
Is there a similar trick I can use?

One approach would be to initialize an output array of 4D and then assign values into it from A. Such an assignment would broadcast values and this is where we would get efficiency in NumPy.
Thus, a solution would be like so -
# Get shape of A
m,n = A.shape
# Initialize output array as 4D
out = np.zeros((m,N,n,N))
# Get range array for indexing into the second and fourth axes
r = np.arange(N)
# Index into the second and fourth axes and selecting all elements along
# the rest to assign values from A. The values are broadcasted.
out[:,r,:,r] = A
# Finally reshape back to 2D
out.shape = (m*N,n*N)
Put as a function -
def kron_A_N(A, N): # Simulates np.kron(A, np.eye(N))
m,n = A.shape
out = np.zeros((m,N,n,N),dtype=A.dtype)
r = np.arange(N)
out[:,r,:,r] = A
out.shape = (m*N,n*N)
return out
To simulate np.kron(np.eye(N), A), simply swap the operations along the first and second and similarly for third and fourth axes -
def kron_N_A(A, N): # Simulates np.kron(np.eye(N), A)
m,n = A.shape
out = np.zeros((N,m,N,n),dtype=A.dtype)
r = np.arange(N)
out[r,:,r,:] = A
out.shape = (m*N,n*N)
return out
Timings -
In [174]: N = 100
...: A = np.random.rand(100,100)
...:
In [175]: np.allclose(np.kron(A, np.eye(N)), kron_A_N(A,N))
Out[175]: True
In [176]: %timeit np.kron(A, np.eye(N))
1 loops, best of 3: 458 ms per loop
In [177]: %timeit kron_A_N(A, N)
10 loops, best of 3: 58.4 ms per loop
In [178]: 458/58.4
Out[178]: 7.842465753424658

Quickly compute eigenvectors for each element of an array in python

I want to compute eigenvectors for an array of data (in my actual case, i cloud of polygons)
To do so i wrote this function:
import numpy as np
def eigen(data):
eigenvectors = []
eigenvalues = []
for d in data:
# compute covariance for each triangle
cov = np.cov(d, ddof=0, rowvar=False)
# compute eigen vectors
vals, vecs = np.linalg.eig(cov)
eigenvalues.append(vals)
eigenvectors.append(vecs)
return np.array(eigenvalues), np.array(eigenvectors)
Running this on some test data:
import cProfile
triangles = np.random.random((10**4,3,3,)) # 10k 3D triangles
cProfile.run('eigen(triangles)') # 550005 function calls in 0.933 seconds
Works fine but it gets very slow because of the iteration loop. Is there a faster way to compute the data I need without iterating over the array? And if not can anyone suggest ways to speed it up?

Hack It!
Well I hacked into covariance func definition and put in the stated input states : ddof=0, rowvar=False and as it turns out, everything reduces to just three lines -
nC = m.shape[1] # m is the 2D input array
X = m - m.mean(0)
out = np.dot(X.T, X)/nC
To extend it to our 3D array case, I wrote down the loopy version with these three lines being iterated for the 2D arrays sections from the 3D input array, like so -
for i,d in enumerate(m):
# Using np.cov :
org_cov = np.cov(d, ddof=0, rowvar=False)
# Using earlier 2D array hacked version :
nC = m[i].shape[0]
X = m[i] - m[i].mean(0,keepdims=True)
hacked_cov = np.dot(X.T, X)/nC
Boost-it-up
We are needed to speedup the last three lines there. Computation of X across all iterations could be done with broadcasting -
diffs = data - data.mean(1,keepdims=True)
Next up, the dot-product calculation for all iterations could be done with transpose and np.dot, but that transpose could be a costly affair for such a multi-dimensional array. A better alternative exists in np.einsum, like so -
cov3D = np.einsum('ijk,ijl->ikl',diffs,diffs)/data.shape[1]
Use it!
To sum up :
for d in data:
# compute covariance for each triangle
cov = np.cov(d, ddof=0, rowvar=False)
Could be pre-computed like so :
diffs = data - data.mean(1,keepdims=True)
cov3D = np.einsum('ijk,ijl->ikl',diffs,diffs)/data.shape[1]
These pre-computed values could be used across iterations to compute eigen vectors like so -
for i,d in enumerate(data):
# Directly use pre-computed covariances for each triangle
vals, vecs = np.linalg.eig(cov3D[i])
Test It!
Here are some runtime tests to assess the effect of pre-computing covariance results -
In [148]: def original_app(data):
...: cov = np.empty(data.shape)
...: for i,d in enumerate(data):
...: # compute covariance for each triangle
...: cov[i] = np.cov(d, ddof=0, rowvar=False)
...: return cov
...:
...: def vectorized_app(data):
...: diffs = data - data.mean(1,keepdims=True)
...: return np.einsum('ijk,ijl->ikl',diffs,diffs)/data.shape[1]
...:
In [149]: data = np.random.randint(0,10,(1000,3,3))
In [150]: np.allclose(original_app(data),vectorized_app(data))
Out[150]: True
In [151]: %timeit original_app(data)
10 loops, best of 3: 64.4 ms per loop
In [152]: %timeit vectorized_app(data)
1000 loops, best of 3: 1.14 ms per loop
In [153]: data = np.random.randint(0,10,(5000,3,3))
In [154]: np.allclose(original_app(data),vectorized_app(data))
Out[154]: True
In [155]: %timeit original_app(data)
1 loops, best of 3: 324 ms per loop
In [156]: %timeit vectorized_app(data)
100 loops, best of 3: 5.67 ms per loop

I don't know how much of a speed up you can actually achieve.
Here is a slight modification that can help a little:
%timeit -n 10 values, vectors = \
eigen(triangles)
10 loops, best of 3: 745 ms per loop
%timeit values, vectors = \
zip(*(np.linalg.eig(np.cov(d, ddof=0, rowvar=False)) for d in triangles))
10 loops, best of 3: 705 ms per loop

how to speed up a vector cross product calculation

Hi I'm relatively new here and trying to do some calculations with numpy. I'm experiencing a long elapse time from one particular calculation and can't work out any faster way to achieve the same thing.
Basically its part of a ray triangle intersection algorithm and I need to calculate all the vector cros products from two matrices of different sizes.
The code I was using was :
allhvals1 = numpy.cross( dirvectors[:,None,:], trivectors2[None,:,:] )
where dirvectors is an array of n* vectors (xyz) and trivectors2 is an array of m*vectors(xyz). allhvals1 is an array of the cross products of size n*M*vector (xyz).
This works but is very slow. It's essentially the n*m matrix of each vector from each array. Hope that you understand. The sizes of each varies from approx 1-4000 depending on parameters (I basically chunk the dirvectors dependent on size).
Any advice appreciated. Unfortunately my matrix math is somewhat flakey.

If you look at the source code of np.cross, it basically moves the xyz dimension to the front of the shape tuple for all arrays, and then has the calculation of each of the components spelled out like this:
x = a[1]*b[2] - a[2]*b[1]
y = a[2]*b[0] - a[0]*b[2]
z = a[0]*b[1] - a[1]*b[0]
In your case, each of those products requires allocating huge arrays, so the overall behavior is not very efficient.
Lets set up some test data:
u = np.random.rand(1000, 3)
v = np.random.rand(2000, 3)
In [13]: %timeit s1 = np.cross(u[:, None, :], v[None, :, :])
1 loops, best of 3: 591 ms per loop
We can try to compute it using Levi-Civita symbols and np.einsum as follows:
eijk = np.zeros((3, 3, 3))
eijk[0, 1, 2] = eijk[1, 2, 0] = eijk[2, 0, 1] = 1
eijk[0, 2, 1] = eijk[2, 1, 0] = eijk[1, 0, 2] = -1
In [14]: %timeit s2 = np.einsum('ijk,uj,vk->uvi', eijk, u, v)
1 loops, best of 3: 706 ms per loop
In [15]: np.allclose(s1, s2)
Out[15]: True
So while it works, it has worse performance. The thing is that np.einsum has trouble when there are more than two operands, but has optimized pathways for two or less. So we can try to rewrite it in two steps, to see if it helps:
In [16]: %timeit s3 = np.einsum('iuk,vk->uvi', np.einsum('ijk,uj->iuk', eijk, u), v)
10 loops, best of 3: 63.4 ms per loop
In [17]: np.allclose(s1, s3)
Out[17]: True
Bingo! Close to an order of magnitude improvement...
Some performance figures for NumPy 1.11.0 with a=numpy.random.rand(n,3), b=numpy.random.rand(n,3):
The nested einsum is about twice as fast as cross for the largest n tested.

While writing dynamic simulations for underwater vehicles I have found this method for fast cross product:
https://github.com/simena86/Simulink-Underwater-Robotics-Simulator/blob/master/3rdparty/gnc_mfiles/Smtrx.m
Which works well, it is written in Matlab but the code is very simple. Just read the comments at the top.

Using numpy.take for faster fancy indexing

EDIT I have kept the more complicated problem I am facing below, but my problems with np.take can be summarized better as follows. Say you have an array img of shape (planes, rows), and another array lut of shape (planes, 256), and you want to use them to create a new array out of shape (planes, rows), where out[p,j] = lut[p, img[p, j]]. This can be achieved with fancy indexing as follows:
In [4]: %timeit lut[np.arange(planes).reshape(-1, 1), img]
1000 loops, best of 3: 471 us per loop
But if, instead of fancy indexing, you use take and a python loop over the planes things can be sped up tremendously:
In [6]: %timeit for _ in (lut[j].take(img[j]) for j in xrange(planes)) : pass
10000 loops, best of 3: 59 us per loop
Can lut and img be in someway rearranged, so as to have the whole operation happen without python loops, but using numpy.take (or an alternative method) instead of conventional fancy indexing to keep the speed advantage?
ORIGINAL QUESTION
I have a set of look-up tables (LUTs) that I want to use on an image. The array holding the LUTs is of shape (planes, 256, n), and the image has shape (planes, rows, cols). Both are of dtype = 'uint8', matching the 256 axis of the LUT. The idea is to run the p-th plane of the image through each of the n LUTs from the p-th plane of the LUT.
If my lut and img are the following:
planes, rows, cols, n = 3, 4000, 4000, 4
lut = np.random.randint(-2**31, 2**31 - 1,
size=(planes * 256 * n // 4,)).view('uint8')
lut = lut.reshape(planes, 256, n)
img = np.random.randint(-2**31, 2**31 - 1,
size=(planes * rows * cols // 4,)).view('uint8')
img = img.reshape(planes, rows, cols)
I can achieve what I am after using fancy indexing like this
out = lut[np.arange(planes).reshape(-1, 1, 1), img]
which gives me an array of shape (planes, rows, cols, n) , where out[i, :, :, j] holds the i-th plane of img run through the j-th LUT of the i-th plane of the LUT...
All is good, except for this:
In [2]: %timeit lut[np.arange(planes).reshape(-1, 1, 1), img]
1 loops, best of 3: 5.65 s per loop
which is completely unacceptable, especially since I have all of the following not so nice looking alternatives using np.take than run much faster:
A single LUT on a single plane runs about x70 faster:
In [2]: %timeit np.take(lut[0, :, 0], img[0])
10 loops, best of 3: 78.5 ms per loop
A python loop running through all the desired combinations finishes almost x6 faster:
In [2]: %timeit for _ in (np.take(lut[j, :, k], img[j]) for j in xrange(planes) for k in xrange(n)) : pass
1 loops, best of 3: 947 ms per loop
Even running all combinations of planes in the LUT and image and then discarding the planes**2 - planes unwanted ones is faster than fancy indexing:
In [2]: %timeit np.take(lut, img, axis=1)[np.arange(planes), np.arange(planes)]
1 loops, best of 3: 3.79 s per loop
And the fastest combination I have been able to come up with has a python loop iterating over the planes and finishes x13 faster:
In [2]: %timeit for _ in (np.take(lut[j], img[j], axis=0) for j in xrange(planes)) : pass
1 loops, best of 3: 434 ms per loop
The question of course is if there is no way of doing this with np.take without any python loop? Ideally whatever reshaping or resizing is needed should happen on the LUT, not the image, but I am open to whatever you people can come up with...

Fist of all I have to say I really liked your question. Without rearranging LUT or IMG the following solution worked:
%timeit a=np.take(lut, img, axis=1)
# 1 loops, best of 3: 1.93s per loop
But from the result you have to query the diagonal: a[0,0], a[1,1], a[2,2]; to get what you want. I've tried to find a way to do this indexing only for the diagonal elements, but still did not manage.
Here are some ways to rearrange your LUT and IMG:
The following works if the indexes in IMG are from 0-255, for the 1st plane, 256-511 for the 2nd plane, and 512-767 for the 3rd plane, but that would prevent you from using 'uint8', which can be a big issue...:
lut2 = lut.reshape(-1,4)
%timeit np.take(lut2,img,axis=0)
# 1 loops, best of 3: 716 ms per loop
# or
%timeit np.take(lut2, img.flatten(), axis=0).reshape(3,4000,4000,4)
# 1 loops, best of 3: 709 ms per loop
in my machine your solution is still the best option, and very adequate since you just need the diagonal evaluations, i.e. plane1-plane1, plane2-plane2 and plane3-plane3:
%timeit for _ in (np.take(lut[j], img[j], axis=0) for j in xrange(planes)) : pass
# 1 loops, best of 3: 677 ms per loop
I hope this can give you some insight about a better solution. It would be nice to look for more options with flatten(), and similar methods as np.apply_over_axes() or np.apply_along_axis(), that seem to be promising.
I used this code below to generate the data:
import numpy as np
num = 4000
planes, rows, cols, n = 3, num, num, 4
lut = np.random.randint(-2**31, 2**31-1,size=(planes*256*n//4,)).view('uint8')
lut = lut.reshape(planes, 256, n)
img = np.random.randint(-2**31, 2**31-1,size=(planes*rows*cols//4,)).view('uint8')
img = img.reshape(planes, rows, cols)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Vectorize large NumPy multiplication - python

Related

Numpy applying a condition on each element of an array [duplicate]

Efficient Kronecker product with identity matrix and regular matrix - NumPy/ Python

Quickly compute eigenvectors for each element of an array in python

how to speed up a vector cross product calculation

Using numpy.take for faster fancy indexing

Categories

Resources