Broadcasting only with specific dimensions of ndarray in python - python

Consider a TxFxM ndarray. I wish to multiply it with its conjugate, only for the M dimension while keeping the other dimensions the same as presented in the following code:
import numpy as np
T=2
F=3
M=4
x=np.random.rand(T,F,M)
result=np.zeros((T,F,M,M))
for i in range(x.shape[0]):
for j in range(x.shape[1]):
result[i,j,:,:]=np.matmul(np.expand_dims(x[i,j,:],axis=1),np.expand_dims(x[i,j,:],axis=0).conj())
If I simply use broadcasting as in np.matmul(x,x.conj().T), The broadcast operation will continue to higher levels of dimensions and keep multiplying. On the other hand, my implementation is very slow due to two loops and very unpythonic to my understanding.
Is there a way to implement this S.T. it will run faster?
P.S.
My dimensions are obviously larger T=3000,F=1024,M=4, And this operation repeats itself, hence my requirement for a fast implementation.
I plan to average this over dimension T, so if there is a faster total implementation I would be very interested.

The array you need can be computed with broadcasting if you inject singleton dimensions in two different places for x and x.conj(). If x has shape (T,F,M) then arrays of shape (T,F,M,1) and (T,F,1,M) will broadcast to (T,F,M,M) just the way you want it. Here's your example with complex data to make sure we're not missing something:
import numpy as np
T,F,M = 2,3,4
x = np.random.rand(T,F,M) + np.random.rand(T,F,M)*1j
result = np.zeros((T,F,M,M), dtype=complex)
# loop
for i in range(x.shape[0]):
for j in range(x.shape[1]):
result[i,j,:,:] = np.matmul(np.expand_dims(x[i,j,:],axis=1),
np.expand_dims(x[i,j,:],axis=0).conj())
# broadcasting
result2 = x[..., None] * x[..., None, :].conj()
# proof
print(np.array_equal(result, result2)) # True
Since you mentioned that you want to take a mean along the T-sized dimension, we have to consider whether it's worth putting this dimension last, so that the mean makes use of as contiguous blocks of memory as possible. This means the following options:
def summed_original(x):
"""Assume x.shape == (T, F, M), return summed array of shape (F, M, M)"""
return (x[..., None] * x[..., None, :].conj()).mean(0)
def summed_transposed(x):
"""Assume x.shape == (F, M, T), return summed array of shape (F, M, M)"""
return (x[..., None, :] * x[:, None, ...].conj()).mean(-1)
x_transposed = x.transpose(1, 2, 0).copy() # ensure contiguous copy
print(np.allclose(summed_original(x), summed_transposed(x_transposed))) # True
As you can see these two functions compute the same thing, but they assume the input to have different memory order. The reason why this is important is because it might prove faster to have your original array in a different memory layout (at the cost of transposing and copying it once at the start, if need be).
So let's time them using IPython's %timeit magic and your real sizes:
T,F,M = 3000,1024,4
x = np.random.rand(T, F, M) + np.random.rand(T, F, M)*1j
x_transposed = x.transpose(1, 2, 0).copy()
print(np.allclose(summed_original(x), summed_transposed(x_transposed))) # True
%timeit summed_original(x)
# 500 ms ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit summed_transposed(x_transposed)
# 352 ms ± 2.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As you can see, for your specific sizes and types it seems well worth rearranging the dimensions of your array so that the T dimension corresponds to contiguous blocks of memory, aiding caching in the CPU. You can either do this with a .transpose(...).copy() call at the start, or even better you might construct your array such in the first place, making the code optimal.

Related

K-Means: assign clusters to new data points

I've implemented a k-means clustering algorithm in python, and now I want to label a new data with the clusters I got with my algorithm. My approach is to iterate through every data point and every centroid to find the minimum distance and the centroid associated with it. But I wonder if there are simpler or shorter ways to do it.
def assign_cluster(clusterDict, data):
clusterList = []
label = []
cen = list(clusterDict.values())
for i in range(len(data)):
for j in range(len(cen)):
# if cen[j] has the minimum distance with data[i]
# then clusterList[i] = cen[j]
Where clusterDict is a dictionary with keys as labels, [0,1,2,....] and values as coordinates of centroids.
Can someone help me implementing this?
This is a good use case for numba, because it lets you express this as a simple double loop without a big performance penalty, which in turn allows you to avoid the excessive extra memory of using np.tile to replicate the data across a third dimension just to do it in a vectorized manner.
Borrowing the standard vectorized numpy implementation from the other answer, I have these two implementations:
import numba
import numpy as np
def kmeans_assignment(centroids, points):
num_centroids, dim = centroids.shape
num_points, _ = points.shape
# Tile and reshape both arrays into `[num_points, num_centroids, dim]`.
centroids = np.tile(centroids, [num_points, 1]).reshape([num_points, num_centroids, dim])
points = np.tile(points, [1, num_centroids]).reshape([num_points, num_centroids, dim])
# Compute all distances (for all points and all centroids) at once and
# select the min centroid for each point.
distances = np.sum(np.square(centroids - points), axis=2)
return np.argmin(distances, axis=1)
#numba.jit
def kmeans_assignment2(centroids, points):
P, C = points.shape[0], centroids.shape[0]
distances = np.zeros((P, C), dtype=np.float32)
for p in range(P):
for c in range(C):
distances[p, c] = np.sum(np.square(centroids[c] - points[p]))
return np.argmin(distances, axis=1)
Then for some sample data, I did a few timing experiments:
In [12]: points = np.random.rand(10000, 50)
In [13]: centroids = np.random.rand(30, 50)
In [14]: %timeit kmeans_assignment(centroids, points)
196 ms ± 6.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [15]: %timeit kmeans_assignment2(centroids, points)
127 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I won't go as far to say that the numba version is certainly faster than the np.tile version, but clearly it's very close while not incurring the extra memory cost of np.tile.
In fact, I noticed for my laptop that when I make the shapes larger and use (10000, 1000) for the shape of points and (200, 1000) for the shape of centroids, then np.tile generated a MemoryError, meanwhile the numba function runs in under 5 seconds with no memory error.
Separately, I actually noticed a slowdown when using numba.jit on the first version (withnp.tile), which is likely due to the extra array creation inside the jitted function combined with the fact that there's not much numba can optimize when you're already calling all vectorized functions.
And I also did not notice any significant improvement in the second version when trying to shorten the code by using broadcasting. E.g. shortening the double loop to be
for p in range(P):
distances[p, :] = np.sum(np.square(centroids - points[p, :]), axis=1)
did not really help anything (and would use more memory when repeatedly broadcasting points[p, :] across all of centroids).
This is one of the really nice benefits of numba. You really can write the algorithms in a very straightforward, loop-based way that comports with standard descriptions of algorithms and allows finer point of control over how the syntax unpacks into memory consumption or broadcasting... all without giving up runtime performance.
An efficient way to perform assignment phase is by doing vectorized computation. This approach assumes that you start with two 2D arrays: points and centroids, with the same number of columns (dimensionality of space), but possibly different number of rows. By using tiling (np.tile) we can then compute the distance matrix in a batch, then select the closest clusters per each point.
Here's the code:
def kmeans_assignment(centroids, points):
num_centroids, dim = centroids.shape
num_points, _ = points.shape
# Tile and reshape both arrays into `[num_points, num_centroids, dim]`.
centroids = np.tile(centroids, [num_points, 1]).reshape([num_points, num_centroids, dim])
points = np.tile(points, [1, num_centroids]).reshape([num_points, num_centroids, dim])
# Compute all distances (for all points and all centroids) at once and
# select the min centroid for each point.
distances = np.sum(np.square(centroids - points), axis=2)
return np.argmin(distances, axis=1)
See this GitHub gist for a complete runnable example.

Creating an euclidean distance matrix of tensors

I have 10000 of matrixes with the shape (32, 32, 3). I want to create an euclidean distance matrix between all the matrixes. At the end, it is going to be like,
[0, d2, d3, d4, ...]
[d1, 0, d3, d4, ...]
[d1, d2, 0, d4, ...]
[d1, d2, d3, 0, ...]
How I can make it in the fastest way? I have tried the following, but it takes ages to finish.
import numpy as np
dists = []
for a in range(len(X_test)):
dists.append([])
for b in range(len(X_test)):
dists[a].append(np.linalg.norm(X_test[a] - X_test[b]))
print dists
You can cut the time in half by exploiting the fact that the distance matrix is symmetrical and only compute the upper triangular portion by using using
for b in range(a+1, len(X_test)):
on line 5.
I don't see any other obvious optimizations while keeping the problem exactly the same, but it also seems that you're working with 32x32 images in a three channel format. That's 3072 dimensions! Why not first down-sample to 4x4, convert to HSL color space, and keep only Hue and Lightness to get a (4,4,2) "signature" for each image. If your problem is mostly about shape, you can throw away Hue too and basically work with black-and-white images.
(4,4,2) has only 32 dimensions, for a savings of 100 compared to (32,32,3). And if you did want to do the full comparison in the (32,32,3) space, you could do that only on images that are already very similar in the (4,4,2) space.
I have read Divakar comment.
Rather than asking "Show me Divakar" I asked myself "What is this pdist/cdist stuff?" — I read about pdist and norm and I came out with the following code
Import stuff:
In [1]: import numpy as np
In [2]: from scipy.spatial.distance import pdist
Generate a random sample, not necessarily as large as the OP's one, and reshape it as suggested by Divakar
In [3]: a = np.random.random((100,32,32,3))
In [4]: b = a.reshape((100,32*32*3))
Using the magic of IPython, let's benchmark the two approaches
In [5]: %%timeit
...: dists = []
...: for i in range(len(a)):
...: dists.append([])
...: for j in range(len(a)):
...: dists[i].append(np.linalg.norm(a[i] - a[j]))
128 ms ± 337 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [6]: %timeit pdist(b)
12.3 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Divakar's was 1 order of magnitude faster — but what about the accuracy?
Let's repeat the computations...
In [7]: dists1 = []
...: for i in range(len(a)):
...: dists1.append([])
...: for j in range(len(a)):
...: dists1[i].append(np.linalg.norm(a[i] - a[j]))
In [8]: dists2 = pdist(b)
To compare the results, we must be aware that pdist computes only the upper triangle of the square matrix of distances (because the matrix is symmetric and the principal diagonal is identically equal to zero) so we must be careful in checking our results: hence I check the off diagonal part of the first row of dists1 with the first 99 elements of dists2 using allclose
In [9]: np.allclose(dists1[0][1:], dists2[:99])
Out[9]: True
The result is the same, nice.
What about an estimate of the time required for 10,000 elements? The feeling is that's quadratic, but let's experiment doubling the number of elements
In [10]: b = np.random.random((200,32*32*3))
In [11]: %timeit pdist(b)
48 ms ± 97.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [12]:
the new timing is 4 times the initial one, so my estimate for your computation, on my feeble pc and using Divakar's proposal, is 12ms x 100 x 100 = 120,000ms = 120s. You should read carefully the excellent answer by olooney and decide what you really want to do.

Numpy: Replace every value in the array with the mean of its adjacent elements

I have an ndarray, and I want to replace every value in the array with the mean of its adjacent elements. The code below can do the job, but it is super slow when I have 700 arrays all with shape (7000, 7000) , so I wonder if there are better ways to do it. Thanks!
a = np.array(([1,2,3,4,5,6,7,8,9],[4,5,6,7,8,9,10,11,12],[3,4,5,6,7,8,9,10,11]))
row,col = a.shape
new_arr = np.ndarray(a.shape)
for x in xrange(row):
for y in xrange(col):
min_x = max(0, x-1)
min_y = max(0, y-1)
new_arr[x][y] = a[min_x:(x+2),min_y:(y+2)].mean()
print new_arr
Well, that's a smoothing operation in image processing, which can be achieved with 2D convolution. You are working a bit differently on the near-boundary elements. So, if the boundary elements are let off for precision, you can use scipy's convolve2d like so -
from scipy.signal import convolve2d as conv2
out = (conv2(a,np.ones((3,3)),'same')/9.0
This specific operation is a built-in in OpenCV module as cv2.blur and is very efficient at it. The name basically describes its operation of blurring the input arrays representing images. I believe the efficiency comes from the fact that internally its implemented entirely in C for performance with a thin Python wrapper to handle NumPy arrays.
So, the output could be alternatively calculated with it, like so -
import cv2 # Import OpenCV module
out = cv2.blur(a.astype(float),(3,3))
Here's a quick show-down on timings on a decently big image/array -
In [93]: a = np.random.randint(0,255,(5000,5000)) # Input array
In [94]: %timeit conv2(a,np.ones((3,3)),'same')/9.0
1 loops, best of 3: 2.74 s per loop
In [95]: %timeit cv2.blur(a.astype(float),(3,3))
1 loops, best of 3: 627 ms per loop
Following the discussion with #Divakar, find bellow a comparison of different convolution methods present in scipy:
import numpy as np
from scipy import signal, ndimage
def conv2(A, size):
return signal.convolve2d(A, np.ones((size, size)), mode='same') / float(size**2)
def fftconv(A, size):
return signal.fftconvolve(A, np.ones((size, size)), mode='same') / float(size**2)
def uniform(A, size):
return ndimage.uniform_filter(A, size, mode='constant')
All 3 methods return exactly the same value. However, note that uniform_filter has a parameter mode='constant', which indicates the boundary conditions of the filter, and constant == 0 is the same boundary condition that the Fourier domain (in the other 2 methods) is enforced. For different use cases you can change the boundary conditions.
Now some test matrices:
A = np.random.randn(1000, 1000)
And some timings:
%timeit conv2(A, 3) # 33.8 ms per loop
%timeit fftconv(A, 3) # 84.1 ms per loop
%timeit uniform(A, 3) # 17.1 ms per loop
%timeit conv2(A, 5) # 68.7 ms per loop
%timeit fftconv(A, 5) # 92.8 ms per loop
%timeit uniform(A, 5) # 17.1 ms per loop
%timeit conv2(A, 10) # 210 ms per loop
%timeit fftconv(A, 10) # 86 ms per loop
%timeit uniform(A, 10) # 16.4 ms per loop
%timeit conv2(A, 30) # 1.75 s per loop
%timeit fftconv(A, 30) # 102 ms per loop
%timeit uniform(A, 30) # 16.5 ms per loop
So in short, uniform_filter seems faster, and it because the convolution is separable in two 1D convolutons (similar to gaussian_filter which is also separable).
Other non-separable filters with different kernels are more likely to be faster using signal module (the one in #Divakar's) solution.
The speed of both fftconvolve and uniform_filter remains constant for different kernel sizes, while convolve2d gets slightly slower.
I had a similar problem recently and had to find a different solution since I can't use scipy.
import numpy as np
a = np.random.randint(100, size=(7000,7000)) #Array of 7000 x 7000
row,col = a.shape
column_totals = a.sum(axis=0) #Dump the sum of all columns into a single array
new_array = np.zeros([row,col]) #Create an receiving array
for i in range(row):
#Resulting row = the value of all rows minus the orignal row, divided by the row number minus one.
new_array[i] = (column_totals - a[i]) / (row - 1)
print(new_array)

Computing the spectral norms of ~1m Hermitian matrices: `numpy.linalg.norm` is too slow

I would like to calculate the spectral norms of N 8x8 Hermitian matrices, with N being close to 1E6. As an example, take these 1 million random complex 8x8 matrices:
import numpy as np
array = np.random.rand(8,8,1e6) + 1j*np.random.rand(8,8,1e6)
It currently takes me almost 10 seconds using numpy.linalg.norm:
np.linalg.norm(array, ord=2, axis=(0,1))
I tried using the Cython code below, but this gave me only a negligible performance improvement:
import numpy as np
cimport numpy as np
cimport cython
np.import_array()
DTYPE = np.complex64
#cython.boundscheck(False)
#cython.wraparound(False)
def function(np.ndarray[np.complex64_t, ndim=3] Array):
assert Array.dtype == DTYPE
cdef int shape0 = Array.shape[2]
cdef np.ndarray[np.float32_t, ndim=1] normarray = np.zeros(shape0, dtype=np.float32)
normarray = np.linalg.norm(Array, ord=2, axis=(0, 1))
return normarray
I also tried numba and some other scipy functions (such as scipy.linalg.svdvals) to calculate the singular values of these matrices. Everything is still too slow.
Is it not possible to make this any faster? Is numpy already optimized to the extent that no speed gains are possible by using Cython or numba? Or is my code highly inefficient and I am doing something fundamentally wrong?
I noticed that only two of my CPU cores are 100% utilized while doing the calculation. With that in mind, I looked at these previous StackOverflow questions:
why isn't numpy.mean multithreaded?
Why does multiprocessing use only a single core after I import numpy?
multithreaded blas in python/numpy (didn't help)
and several others, but unfortunately I still don't have a solution.
I considered splitting my array into smaller chunks, and processing these in parallel (perhaps on the GPU using CUDA). Is there a way within numpy/Python to do this? I don't yet know where the bottleneck is in my code, i.e. whether it is CPU or memory-bound, or perhaps something different.
Digging into the code for np.linalg.norm, I've deduced, that for these parameters, it is finding the maximum of matrix singular values over the N dimension
First generate a small sample array. Make N the first dimension to eliminate a rollaxis operation:
In [268]: N=10; A1 = np.random.rand(N,8,8)+1j*np.random.rand(N,8,8)
In [269]: np.linalg.norm(A1,ord=2,axis=(1,2))
Out[269]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
the equivalent operation:
In [270]: np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
Out[270]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
same values, and same time:
In [271]: timeit np.linalg.norm(A1,ord=2,axis=(1,2))
1000 loops, best of 3: 398 µs per loop
In [272]: timeit np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
1000 loops, best of 3: 389 µs per loop
And most of the time spent in svd, which produces an (N,8) array:
In [273]: timeit np.linalg.svd(A1,compute_uv=0)
1000 loops, best of 3: 366 µs per loop
So if you want to speed up the norm, you have look further into speeding up this svd. svd uses np.linalg._umath_linalg functions - that is a .so file - compiled.
The c code is in https://github.com/numpy/numpy/blob/97c35365beda55c6dead8c50df785eb857f843f0/numpy/linalg/umath_linalg.c.src
It sure looks like this is the fastest you'll get. There's no Python level loop. Any looping is in that c code, or the lapack function it calls.
np.linalg.norm(A, ord=2) computes the spectral norm by finding the largest singular value using SVD. However, since your 8x8 submatrices are Hermitian, their largest singular values will be equal to the maximum of their absolute eigenvalues (see here):
import numpy as np
def random_symmetric(N, k):
A = np.random.randn(N, k, k)
A += A.transpose(0, 2, 1)
return A
N = 100000
k = 8
A = random_symmetric(N, k)
norm1 = np.abs(np.linalg.eigvalsh(A)).max(1)
norm2 = np.linalg.norm(A, ord=2, axis=(1, 2))
print(np.allclose(norm1, norm2))
# True
Eigendecomposition on a Hermitian matrix is quite a bit faster than SVD:
In [1]: %%timeit A = random_symmetric(N, k)
np.linalg.norm(A, ord=2, axis=(1, 2))
....:
1 loops, best of 3: 1.54 s per loop
In [2]: %%timeit A = random_symmetric(N, k)
np.abs(np.linalg.eigvalsh(A)).max(1)
....:
1 loops, best of 3: 757 ms per loop

Many small matrices speed-up for loops

I have a large coordinate grid (vectors a and b), for which I generate and solve a matrix (10x10) equation. Is there a way for scipy.linalg.solve to accept vector input? So far my solution was to run for cycles over the coordinate arrays.
import time
import numpy as np
import scipy.linalg
N = 10
a = np.linspace(0, 1, 10**3)
b = np.linspace(0, 1, 2*10**3)
A = np.random.random((N, N)) # input matrix, not static
def f(a,b,n): # array-filling function
return a*b*n
def sol(x,y): # matrix solver
D = np.arange(0,N)
B = f(x,y,D)**2 + f(x-1, y+1, D) # source vector
X = scipy.linalg.solve(A,B)
return X # output an N-size vector
start = time.time()
answer = np.zeros(shape=(a.size, b.size)) # predefine output array
for egg in range(a.size): # an ugly double-for cycle
for ham in range(b.size):
aa = a[egg]
bb = b[ham]
answer[egg,ham] = sol(aa,bb)[0]
print time.time() - start
To illustrate my point about generalized ufuncs, and the ability to eliminate the loop over egg and ham, consider the following piece of code:
import numpy as np
A = np.random.randn(4,4,10,10)
AI = np.linalg.inv(A)
#check that generalized ufuncs work as expected
I = np.einsum('xyij,xyjk->xyik', A, AI)
print np.allclose(I, np.eye(10)[np.newaxis, np.newaxis, :, :])
#yevgeniy You are right, efficiently solving multiple independent linear systems A x = b with scipy a bit tricky (assuming an A array that changes for every iteration).
For instance, here is a benchmark for solving 1000 systems of the form, A x = b, where A is a 10x10 matrix, and b a 10 element vector. Surprisingly, the approach to put all this into one block diagonal matrix and call scipy.linalg.solve once is indeed slower both with dense and sparse matrices.
import numpy as np
from scipy.linalg import block_diag, solve
from scipy.sparse import block_diag as sp_block_diag
from scipy.sparse.linalg import spsolve
N = 10
M = 1000 # number of coordinates
Ai = np.random.randn(N, N) # we can compute the inverse here,
# but let's assume that Ai are different matrices in the for loop loop
bi = np.random.randn(N)
%timeit [solve(Ai, bi) for el in range(M)]
# 10 loops, best of 3: 32.1 ms per loop
Afull = sp_block_diag([Ai]*M, format='csr')
bfull = np.tile(bi, M)
%timeit Afull = sp_block_diag([Ai]*M, format='csr')
%timeit spsolve(Afull, bfull)
# 1 loops, best of 3: 303 ms per loop
# 100 loops, best of 3: 5.55 ms per loop
Afull = block_diag(*[Ai]*M)
%timeit Afull = block_diag(*[Ai]*M)
%timeit solve(Afull, bfull)
# 100 loops, best of 3: 14.1 ms per loop
# 1 loops, best of 3: 23.6 s per loop
The solution of the linear system, with sparse arrays is faster, but the time to create this block diagonal array is actually very slow. As to dense arrays, they are simply slower in this case (and take lots of RAM).
Maybe I'm missing something about how to make this work efficiently with sparse arrays, but if you are keeping the for loops, there are two things that you could do for optimizations.
From pure python, look at the source code of scipy.linalg.solve : remove unnecessary tests and factorize all repeated operations outside of your loops. For instance, assuming your arrays are not symmetrical positives, we could do
from scipy.linalg import get_lapack_funcs
gesv, = get_lapack_funcs(('gesv',), (Ai, bi))
def solve_opt(A, b, gesv=gesv):
# not sure if copying A and B is necessary, but just in case (faster if arrays are not copied)
lu, piv, x, info = gesv(A.copy(), b.copy(), overwrite_a=False, overwrite_b=False)
if info == 0:
return x
if info > 0:
raise LinAlgError("singular matrix")
raise ValueError('illegal value in %d-th argument of internal gesv|posv' % -info)
%timeit [solve(Ai, bi) for el in range(M)]
%timeit [solve_opt(Ai, bi) for el in range(M)]
# 10 loops, best of 3: 30.1 ms per loop
# 100 loops, best of 3: 3.77 ms per loop
which results in a 6.5x speed up.
If you need even better performance, you would have to port this for loop in Cython and interface the gesv BLAS functions directly in C, as discussed here, or better with the Cython API for BLAS/LAPACK in Scipy 0.16.
Edit: As #Eelco Hoogendoorn mentioned if your A matrix is fixed, there is a much simpler and more efficient approach.

Categories