K-Means: assign clusters to new data points - python

I've implemented a k-means clustering algorithm in python, and now I want to label a new data with the clusters I got with my algorithm. My approach is to iterate through every data point and every centroid to find the minimum distance and the centroid associated with it. But I wonder if there are simpler or shorter ways to do it.
def assign_cluster(clusterDict, data):
clusterList = []
label = []
cen = list(clusterDict.values())
for i in range(len(data)):
for j in range(len(cen)):
# if cen[j] has the minimum distance with data[i]
# then clusterList[i] = cen[j]
Where clusterDict is a dictionary with keys as labels, [0,1,2,....] and values as coordinates of centroids.
Can someone help me implementing this?

This is a good use case for numba, because it lets you express this as a simple double loop without a big performance penalty, which in turn allows you to avoid the excessive extra memory of using np.tile to replicate the data across a third dimension just to do it in a vectorized manner.
Borrowing the standard vectorized numpy implementation from the other answer, I have these two implementations:
import numba
import numpy as np
def kmeans_assignment(centroids, points):
num_centroids, dim = centroids.shape
num_points, _ = points.shape
# Tile and reshape both arrays into `[num_points, num_centroids, dim]`.
centroids = np.tile(centroids, [num_points, 1]).reshape([num_points, num_centroids, dim])
points = np.tile(points, [1, num_centroids]).reshape([num_points, num_centroids, dim])
# Compute all distances (for all points and all centroids) at once and
# select the min centroid for each point.
distances = np.sum(np.square(centroids - points), axis=2)
return np.argmin(distances, axis=1)
#numba.jit
def kmeans_assignment2(centroids, points):
P, C = points.shape[0], centroids.shape[0]
distances = np.zeros((P, C), dtype=np.float32)
for p in range(P):
for c in range(C):
distances[p, c] = np.sum(np.square(centroids[c] - points[p]))
return np.argmin(distances, axis=1)
Then for some sample data, I did a few timing experiments:
In [12]: points = np.random.rand(10000, 50)
In [13]: centroids = np.random.rand(30, 50)
In [14]: %timeit kmeans_assignment(centroids, points)
196 ms ± 6.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [15]: %timeit kmeans_assignment2(centroids, points)
127 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I won't go as far to say that the numba version is certainly faster than the np.tile version, but clearly it's very close while not incurring the extra memory cost of np.tile.
In fact, I noticed for my laptop that when I make the shapes larger and use (10000, 1000) for the shape of points and (200, 1000) for the shape of centroids, then np.tile generated a MemoryError, meanwhile the numba function runs in under 5 seconds with no memory error.
Separately, I actually noticed a slowdown when using numba.jit on the first version (withnp.tile), which is likely due to the extra array creation inside the jitted function combined with the fact that there's not much numba can optimize when you're already calling all vectorized functions.
And I also did not notice any significant improvement in the second version when trying to shorten the code by using broadcasting. E.g. shortening the double loop to be
for p in range(P):
distances[p, :] = np.sum(np.square(centroids - points[p, :]), axis=1)
did not really help anything (and would use more memory when repeatedly broadcasting points[p, :] across all of centroids).
This is one of the really nice benefits of numba. You really can write the algorithms in a very straightforward, loop-based way that comports with standard descriptions of algorithms and allows finer point of control over how the syntax unpacks into memory consumption or broadcasting... all without giving up runtime performance.

An efficient way to perform assignment phase is by doing vectorized computation. This approach assumes that you start with two 2D arrays: points and centroids, with the same number of columns (dimensionality of space), but possibly different number of rows. By using tiling (np.tile) we can then compute the distance matrix in a batch, then select the closest clusters per each point.
Here's the code:
def kmeans_assignment(centroids, points):
num_centroids, dim = centroids.shape
num_points, _ = points.shape
# Tile and reshape both arrays into `[num_points, num_centroids, dim]`.
centroids = np.tile(centroids, [num_points, 1]).reshape([num_points, num_centroids, dim])
points = np.tile(points, [1, num_centroids]).reshape([num_points, num_centroids, dim])
# Compute all distances (for all points and all centroids) at once and
# select the min centroid for each point.
distances = np.sum(np.square(centroids - points), axis=2)
return np.argmin(distances, axis=1)
See this GitHub gist for a complete runnable example.

Related

How to compute rolling dot product/cosine similarity on pandas dataframe with a number of columns?

My main objective of this question is to calculate the rolling dot_product or cosine_similarity over a pandas dataframe. Going through the documentation, I found that, technically we can compute the rolling correlation function using the following syntax:
df.rolling(window_size).corr().
However, I am wondering how to compute the rolling cosine_similarity. For instance, I would like to have something like:
from sklearn.metrics.pairwise import cosine_similarity
df.rolling(window=3, method="table").apply(lambda table: cosine_similarity(table.T))
However, this is throwing an error.
Kindly note below the entire code to regenerate the same problem.
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
pd.__version__ # 1.4.3
def generate_random_walk(_len):
lst = [np.random.randn()]
for i in range(_len):
lst.append(lst[i] + np.random.randn())
return lst
_len = 100
_num_arrays = 5
array_2D = np.array([generate_random_walk(_len) for _ in range(_num_arrays)]).T
df = pd.DataFrame(array_2D)
df.rolling(window=3, method="table").apply(lambda table: cosine_similarity(table.T))
Please note that I am using:
Python version: 3.10, and
Pandas version: 1.4.3
I would expect the answer to include a solution using pandas native api. Else, any solution avoiding a for loop is great, given how slow the code can get using for loop. I did personally made some comparison on the corr function using pandas native functions and using a for loop, and the former was more than 100 times faster. Finally, if no pandas native function is available for the cosine_similarity; and if there doesn't exist a solution similar to this: df.rolling(window=3, method="table").apply(lambda table: cosine_similarity(table.T)) I would appreciate having a solution using a for loop and numba for faster computations.
Thanks in advance for the support.
Below is my solution, using numpy and numba for faster processing.
import numpy as np
import pandas as pd
from numba import jit, njit
import numba
from sklearn.metrics.pairwise import cosine_similarity
Explanation
The main objective of this problem is to calculate the rolling cosine_similarity over a given matrix. In other words, I would like to achieve a similar behavior to the pandas native function: df.rolling(window_size).corr() which return the corrlation coefficient for each sliding window over a given dataframe. More information about the rolling correlation can be found here. The following figure demonstrate how the sliding window works over axis=0.
As a result, each window will be passed to the corr() function, and the result of corr() is a dataframe of size (N x N) where N is the number of columns in the main dataframe. The output of df.rolling(window_size).corr() is roughly: (len(df) - window_size + 1, N, N ).
In the following sections, I will go over the main steps for caclulating the cosine similarity over a 2D array (to be covered in section 1). Under section 2, I will show the main implementation in python I will be using numba to achieve faster processing times. Finally, I will apply the cos_similarity over a sliding window as shown in the figure above.
Section 1
The formula for cosine similarity is:
Therefore, if we have a given matrix A with m number of rows and N number of columns, calculating the cosine similarity between each and every col requires us to go through a nested for loop, consuming every pair of columns, and then apply the cosine formula above. A python code snippet will look like this:
def calc_cosine_sim(a, b):
dot_product = a*b
a_norm = np.linalg.norm(a)
b_norm = np.linalg.norm(b)
return dot_product / (a_norm * b_norm)
A = np.array(some_values) # A.shape = (m, N)
lst = []
for i in range(N - 1):
for j in range(i + 1, N):
lst.append(cal_cosine_sim(A[:, i], A[:, j]))
Given that we are calculating the dot product between each and every pair of columns, the same process can be achieved using a matrix multiplication. In other words, if we take the transpose of A and multiply it with itself we should get another matrix who's entries reflect the dot product of A's columns.
Assume we have the following matrix:
Where each row represent one instance, and each col represent a feature, or a vector. Likewise, we will assume:
Multiplying the transpose of A with itself will produce:
Consequently, using matrix multiplication, I have achieved the same result without using for loops. This process is much faster due to vectorization in numpy.
However, the cosine similarity requires us dividing the vector dot product by the norms of both vectors. Therefore, we would love to get this matrix:
However, this is simply:
However,
Therefore, if we simply calculate the norms |a|, |b| and |c|, and then create the above 2 matrices (B and C), that should enable us to calculate the norm_matrix, which when multiplied by A^{T}.A will return the final result, *cosine_similarity matrix. In the next section, I will go through the python code to explain how the theory above gets translated into code.
One final note: C = B^{T}. Therefore, we should only produce B, and what will be used to generate C.
Section 2
Below is the python implementation using numpy. This is a straight forward implementation, without using numba. I have added comments to reflects the steps and matrices generated above, in section 1. Below I have written 2 methods: calc_rolling_cosine_similarity_v1 and calc_rolling_cosine_similarity_v2 on purpose so that I use both the cosine_similarity implementation from sklearn and my implementation in numpy. The aim is to compare the numbers at the end and make sure the implementation is correct
# This is the implementation of the above logic (section 1) using numpy library
def calc_cosine_similarity_on_2darray(arr):
'''
Input is 2D array
Return the cosine similarity matrix over the 2D array. In other words, the result is the cosine similarity between
each and every column of the input 2Darray/matrix.
'''
# Equation 1
arr_x_arr = arr.T # arr
# Calculating Matrix B
arr_norm = np.linalg.norm(arr, axis=0)
arr_norm_r = np.expand_dims(arr_norm, axis=0)
arr_norm_r_m = np.tile(arr_norm_r, (arr_norm_r.shape[1], 1))
# Calculating Matrix C
arr_norm_c_m = arr_norm_r_m.T
# Calculating Matrix norm_matrix
arr_norm_mul = arr_norm_r_m * arr_norm_c_m
# return matrix: cosine_similarity
return arr_x_arr / arr_norm_mul
# This funtion will calculate the rolling cosine similarity using simply the sliding_window_view numpy function and
# cosine_similarity from the sklearn library
def calc_rolling_cosine_similarity_v1(array, window_size, num_features):
array_2D_windowed = np.squeeze(np.lib.stride_tricks.sliding_window_view(array_2D,
window_shape=(window_size, num_features)))
# arr is transposed on purpose because the cosine_similarity from sklearn will compute cos_sim between the matrix rows.
# Therefore, the transpose will put the features, aka columns, in the first dimension.
cos_sim = [cosine_similarity(arr.T) for arr in array_2D_windowed]
return cos_sim
def calc_rolling_cosine_similarity_v2(array, window_size, num_features):
array_2D_windowed = np.squeeze(np.lib.stride_tricks.sliding_window_view(array_2D,
window_shape=(window_size, num_features)))
cos_sim = [calc_cosine_similarity_on_2darray(arr) for arr in array_2D_windowed]
return cos_sim
The same implementation modified to use numba package in python to speed up our computations. the following numpy functions generated some errors if used with numba:
np.linalg.norm
np.expand_dims
np.tile
Finally, I could have implemented manually the numpy function np.lib.stride_tricks.sliding_window_view. I am not sure how faster this method could be when used with numba. I left this one on purpose due to my time constraints.
# this function returns the norm of a vector v.
#njit
def calc_norm(v):
return np.sqrt(np.sum(np.square(v)))
# The implementation here is equivalent to the implementation of calc_cosine_similarity_on_2darray above
# but modified to utilize the numba python package for faster processing.
#njit
def calc_cosine_similarity_on_2darray_numba(arr):
'''
Input is 2D array
Return the similarity matrix over the 2D array. In other words, the result is the cosine similarity between
each and every column of the input 2Darray/matrix.
'''
# Equation 1
arr_x_arr = arr.T # arr
# Calculating Matrix B
arr_norm_r = np.zeros(shape=(1, arr.shape[1]))
for i in range(arr.shape[1]): # iterate over each col in arr
arr_norm_r[0, i] = calc_norm(arr[:, i]) # calculate the norm of every col in arr.
arr_norm_r_m = np.ones(shape=(arr.shape[1], arr.shape[1])) * arr_norm_r
# Calculating Matrix C
arr_norm_c_m = arr_norm_r_m.T
# Calculating Matrix norm_matrix
arr_norm_mul = arr_norm_r_m * arr_norm_c_m
# return matrix: cosine_similarity
return arr_x_arr / arr_norm_mul
#njit
def rolling_cosine_similarity_numba(array_windowed):
cos_sim = [calc_cosine_similarity_on_2darray_numba(arr) for arr in array_windowed]
return cos_sim
def calc_rolling_cosine_similarity_numba(arr, window_size, num_features):
array_windowed = np.squeeze(np.lib.stride_tricks.sliding_window_view(arr, window_shape=(window_size, num_features)))
cos_sim = rolling_cosine_similarity_numba(array_windowed)
return cos_sim
# Helper function to generate random signal.
def generate_random_walk(_len):
lst = [np.random.randn()]
for i in range(_len - 1):
lst.append(lst[i] + np.random.randn())
return lst
signal_len = 24*365*5
num_features = 5
window_size = 10
array_2D = np.array([generate_random_walk(signal_len) for _ in range(num_features)]).T
array_2D.shape
--> output: (43800, 5)
%%timeit
calc_rolling_cosine_similarity_v1(array_2D, window_size, num_features)
--> output: 9.87 s ± 127 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
calc_rolling_cosine_similarity_v2(array_2D, window_size, num_features)
--> output: 2.79 s ± 100 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
calc_rolling_cosine_similarity_numba(array_2D, window_size, num_features)
--> output: 343 ms ± 4.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sol1 = np.array(calc_rolling_cosine_similarity_v1(array_2D, window_size, num_features))
sol2 = np.array(calc_rolling_cosine_similarity_v2(array_2D, window_size, num_features))
sol3 = np.array(calc_rolling_cosine_similarity_numba(array_2D, window_size, num_features))
sol1.shape, sol2.shape, sol3.shape
--> output: ((43791, 5, 5), (43791, 5, 5), (43791, 5, 5))
np.allclose(sol1, sol2), np.allclose(sol2, sol3), np.allclose(sol1, sol3)
--> output: (True, True, True)
Conclusion
The same solution, when using numba is almost 9.87 / 0.343 ~ 30 times faster than the native sklearn implementation, and 2.79 / 0.343 ~ 8 times faster than the same numpy implementation without numba.

Calculate Pearson correlation coefficient for only 1 column of array efficiently

I have an array that has shape ~(700,36000)and would like to calculate the pearson correlation coefficient for only a specific column (against all other columns) but thousands of times. I've tried this a number of ways, but none seem to be that incredibly efficient:
import numpy
df_corr = numpy.corrcoef(df.T)
corr_column = df_corr.iloc[:, column_index]
This of course calculates the entire correlation matrix, and takes ~12s on my machine; this is a problem, as I need to do this ~35,000 times (arr is changed slightly every time before creating the correlation matrix)!
I've also tried iterating over the columns individually:
corr_column = numpy.zeros(len(df))
for x in df.columns:
corr_column[x] = numpy.corrcoef(x=p_subset.iloc[:,gene_ix],y=p_subset.iloc[:,x])[0][1]
corr_column = vals.reshape(-1,1)
This is slightly faster at ~10s per iteration, but still too slow. Are there ways to find the correlation coefficient between a column and all other columns faster?
Well you can just implement the formula yourself:
import numpy as np
def corr(a, i):
'''
Parameters
----------
a: numpy array
i: column index
Returns
-------
c: numpy array
correlation coefficients of a[:,i] against all other columns of a
'''
mean_t = np.mean(a, axis=0)
std_t = np.std(a, axis=0)
mean_i = mean_t[i]
std_i = std_t[i]
mean_xy = np.mean(a*a[:,i][:,None], axis=0)
c = (mean_xy - mean_i * mean_t)/(std_i * std_t)
return c
a = np.random.randint(0,10, (700,36000))
%timeit corr(a,0)
608 ms ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.corrcoef(a.T)
# Actually didn't have the patience to let it finish in my machine
# Using a smaller sample, the implementation above is 100x faster.

Broadcasting only with specific dimensions of ndarray in python

Consider a TxFxM ndarray. I wish to multiply it with its conjugate, only for the M dimension while keeping the other dimensions the same as presented in the following code:
import numpy as np
T=2
F=3
M=4
x=np.random.rand(T,F,M)
result=np.zeros((T,F,M,M))
for i in range(x.shape[0]):
for j in range(x.shape[1]):
result[i,j,:,:]=np.matmul(np.expand_dims(x[i,j,:],axis=1),np.expand_dims(x[i,j,:],axis=0).conj())
If I simply use broadcasting as in np.matmul(x,x.conj().T), The broadcast operation will continue to higher levels of dimensions and keep multiplying. On the other hand, my implementation is very slow due to two loops and very unpythonic to my understanding.
Is there a way to implement this S.T. it will run faster?
P.S.
My dimensions are obviously larger T=3000,F=1024,M=4, And this operation repeats itself, hence my requirement for a fast implementation.
I plan to average this over dimension T, so if there is a faster total implementation I would be very interested.
The array you need can be computed with broadcasting if you inject singleton dimensions in two different places for x and x.conj(). If x has shape (T,F,M) then arrays of shape (T,F,M,1) and (T,F,1,M) will broadcast to (T,F,M,M) just the way you want it. Here's your example with complex data to make sure we're not missing something:
import numpy as np
T,F,M = 2,3,4
x = np.random.rand(T,F,M) + np.random.rand(T,F,M)*1j
result = np.zeros((T,F,M,M), dtype=complex)
# loop
for i in range(x.shape[0]):
for j in range(x.shape[1]):
result[i,j,:,:] = np.matmul(np.expand_dims(x[i,j,:],axis=1),
np.expand_dims(x[i,j,:],axis=0).conj())
# broadcasting
result2 = x[..., None] * x[..., None, :].conj()
# proof
print(np.array_equal(result, result2)) # True
Since you mentioned that you want to take a mean along the T-sized dimension, we have to consider whether it's worth putting this dimension last, so that the mean makes use of as contiguous blocks of memory as possible. This means the following options:
def summed_original(x):
"""Assume x.shape == (T, F, M), return summed array of shape (F, M, M)"""
return (x[..., None] * x[..., None, :].conj()).mean(0)
def summed_transposed(x):
"""Assume x.shape == (F, M, T), return summed array of shape (F, M, M)"""
return (x[..., None, :] * x[:, None, ...].conj()).mean(-1)
x_transposed = x.transpose(1, 2, 0).copy() # ensure contiguous copy
print(np.allclose(summed_original(x), summed_transposed(x_transposed))) # True
As you can see these two functions compute the same thing, but they assume the input to have different memory order. The reason why this is important is because it might prove faster to have your original array in a different memory layout (at the cost of transposing and copying it once at the start, if need be).
So let's time them using IPython's %timeit magic and your real sizes:
T,F,M = 3000,1024,4
x = np.random.rand(T, F, M) + np.random.rand(T, F, M)*1j
x_transposed = x.transpose(1, 2, 0).copy()
print(np.allclose(summed_original(x), summed_transposed(x_transposed))) # True
%timeit summed_original(x)
# 500 ms ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit summed_transposed(x_transposed)
# 352 ms ± 2.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As you can see, for your specific sizes and types it seems well worth rearranging the dimensions of your array so that the T dimension corresponds to contiguous blocks of memory, aiding caching in the CPU. You can either do this with a .transpose(...).copy() call at the start, or even better you might construct your array such in the first place, making the code optimal.

Computing the spectral norms of ~1m Hermitian matrices: `numpy.linalg.norm` is too slow

I would like to calculate the spectral norms of N 8x8 Hermitian matrices, with N being close to 1E6. As an example, take these 1 million random complex 8x8 matrices:
import numpy as np
array = np.random.rand(8,8,1e6) + 1j*np.random.rand(8,8,1e6)
It currently takes me almost 10 seconds using numpy.linalg.norm:
np.linalg.norm(array, ord=2, axis=(0,1))
I tried using the Cython code below, but this gave me only a negligible performance improvement:
import numpy as np
cimport numpy as np
cimport cython
np.import_array()
DTYPE = np.complex64
#cython.boundscheck(False)
#cython.wraparound(False)
def function(np.ndarray[np.complex64_t, ndim=3] Array):
assert Array.dtype == DTYPE
cdef int shape0 = Array.shape[2]
cdef np.ndarray[np.float32_t, ndim=1] normarray = np.zeros(shape0, dtype=np.float32)
normarray = np.linalg.norm(Array, ord=2, axis=(0, 1))
return normarray
I also tried numba and some other scipy functions (such as scipy.linalg.svdvals) to calculate the singular values of these matrices. Everything is still too slow.
Is it not possible to make this any faster? Is numpy already optimized to the extent that no speed gains are possible by using Cython or numba? Or is my code highly inefficient and I am doing something fundamentally wrong?
I noticed that only two of my CPU cores are 100% utilized while doing the calculation. With that in mind, I looked at these previous StackOverflow questions:
why isn't numpy.mean multithreaded?
Why does multiprocessing use only a single core after I import numpy?
multithreaded blas in python/numpy (didn't help)
and several others, but unfortunately I still don't have a solution.
I considered splitting my array into smaller chunks, and processing these in parallel (perhaps on the GPU using CUDA). Is there a way within numpy/Python to do this? I don't yet know where the bottleneck is in my code, i.e. whether it is CPU or memory-bound, or perhaps something different.
Digging into the code for np.linalg.norm, I've deduced, that for these parameters, it is finding the maximum of matrix singular values over the N dimension
First generate a small sample array. Make N the first dimension to eliminate a rollaxis operation:
In [268]: N=10; A1 = np.random.rand(N,8,8)+1j*np.random.rand(N,8,8)
In [269]: np.linalg.norm(A1,ord=2,axis=(1,2))
Out[269]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
the equivalent operation:
In [270]: np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
Out[270]:
array([ 5.87718306, 5.54662999, 6.15018125, 5.869058 , 5.80882818,
5.86060462, 6.04997992, 5.85681085, 5.71243196, 5.58533323])
same values, and same time:
In [271]: timeit np.linalg.norm(A1,ord=2,axis=(1,2))
1000 loops, best of 3: 398 µs per loop
In [272]: timeit np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
1000 loops, best of 3: 389 µs per loop
And most of the time spent in svd, which produces an (N,8) array:
In [273]: timeit np.linalg.svd(A1,compute_uv=0)
1000 loops, best of 3: 366 µs per loop
So if you want to speed up the norm, you have look further into speeding up this svd. svd uses np.linalg._umath_linalg functions - that is a .so file - compiled.
The c code is in https://github.com/numpy/numpy/blob/97c35365beda55c6dead8c50df785eb857f843f0/numpy/linalg/umath_linalg.c.src
It sure looks like this is the fastest you'll get. There's no Python level loop. Any looping is in that c code, or the lapack function it calls.
np.linalg.norm(A, ord=2) computes the spectral norm by finding the largest singular value using SVD. However, since your 8x8 submatrices are Hermitian, their largest singular values will be equal to the maximum of their absolute eigenvalues (see here):
import numpy as np
def random_symmetric(N, k):
A = np.random.randn(N, k, k)
A += A.transpose(0, 2, 1)
return A
N = 100000
k = 8
A = random_symmetric(N, k)
norm1 = np.abs(np.linalg.eigvalsh(A)).max(1)
norm2 = np.linalg.norm(A, ord=2, axis=(1, 2))
print(np.allclose(norm1, norm2))
# True
Eigendecomposition on a Hermitian matrix is quite a bit faster than SVD:
In [1]: %%timeit A = random_symmetric(N, k)
np.linalg.norm(A, ord=2, axis=(1, 2))
....:
1 loops, best of 3: 1.54 s per loop
In [2]: %%timeit A = random_symmetric(N, k)
np.abs(np.linalg.eigvalsh(A)).max(1)
....:
1 loops, best of 3: 757 ms per loop

Many small matrices speed-up for loops

I have a large coordinate grid (vectors a and b), for which I generate and solve a matrix (10x10) equation. Is there a way for scipy.linalg.solve to accept vector input? So far my solution was to run for cycles over the coordinate arrays.
import time
import numpy as np
import scipy.linalg
N = 10
a = np.linspace(0, 1, 10**3)
b = np.linspace(0, 1, 2*10**3)
A = np.random.random((N, N)) # input matrix, not static
def f(a,b,n): # array-filling function
return a*b*n
def sol(x,y): # matrix solver
D = np.arange(0,N)
B = f(x,y,D)**2 + f(x-1, y+1, D) # source vector
X = scipy.linalg.solve(A,B)
return X # output an N-size vector
start = time.time()
answer = np.zeros(shape=(a.size, b.size)) # predefine output array
for egg in range(a.size): # an ugly double-for cycle
for ham in range(b.size):
aa = a[egg]
bb = b[ham]
answer[egg,ham] = sol(aa,bb)[0]
print time.time() - start
To illustrate my point about generalized ufuncs, and the ability to eliminate the loop over egg and ham, consider the following piece of code:
import numpy as np
A = np.random.randn(4,4,10,10)
AI = np.linalg.inv(A)
#check that generalized ufuncs work as expected
I = np.einsum('xyij,xyjk->xyik', A, AI)
print np.allclose(I, np.eye(10)[np.newaxis, np.newaxis, :, :])
#yevgeniy You are right, efficiently solving multiple independent linear systems A x = b with scipy a bit tricky (assuming an A array that changes for every iteration).
For instance, here is a benchmark for solving 1000 systems of the form, A x = b, where A is a 10x10 matrix, and b a 10 element vector. Surprisingly, the approach to put all this into one block diagonal matrix and call scipy.linalg.solve once is indeed slower both with dense and sparse matrices.
import numpy as np
from scipy.linalg import block_diag, solve
from scipy.sparse import block_diag as sp_block_diag
from scipy.sparse.linalg import spsolve
N = 10
M = 1000 # number of coordinates
Ai = np.random.randn(N, N) # we can compute the inverse here,
# but let's assume that Ai are different matrices in the for loop loop
bi = np.random.randn(N)
%timeit [solve(Ai, bi) for el in range(M)]
# 10 loops, best of 3: 32.1 ms per loop
Afull = sp_block_diag([Ai]*M, format='csr')
bfull = np.tile(bi, M)
%timeit Afull = sp_block_diag([Ai]*M, format='csr')
%timeit spsolve(Afull, bfull)
# 1 loops, best of 3: 303 ms per loop
# 100 loops, best of 3: 5.55 ms per loop
Afull = block_diag(*[Ai]*M)
%timeit Afull = block_diag(*[Ai]*M)
%timeit solve(Afull, bfull)
# 100 loops, best of 3: 14.1 ms per loop
# 1 loops, best of 3: 23.6 s per loop
The solution of the linear system, with sparse arrays is faster, but the time to create this block diagonal array is actually very slow. As to dense arrays, they are simply slower in this case (and take lots of RAM).
Maybe I'm missing something about how to make this work efficiently with sparse arrays, but if you are keeping the for loops, there are two things that you could do for optimizations.
From pure python, look at the source code of scipy.linalg.solve : remove unnecessary tests and factorize all repeated operations outside of your loops. For instance, assuming your arrays are not symmetrical positives, we could do
from scipy.linalg import get_lapack_funcs
gesv, = get_lapack_funcs(('gesv',), (Ai, bi))
def solve_opt(A, b, gesv=gesv):
# not sure if copying A and B is necessary, but just in case (faster if arrays are not copied)
lu, piv, x, info = gesv(A.copy(), b.copy(), overwrite_a=False, overwrite_b=False)
if info == 0:
return x
if info > 0:
raise LinAlgError("singular matrix")
raise ValueError('illegal value in %d-th argument of internal gesv|posv' % -info)
%timeit [solve(Ai, bi) for el in range(M)]
%timeit [solve_opt(Ai, bi) for el in range(M)]
# 10 loops, best of 3: 30.1 ms per loop
# 100 loops, best of 3: 3.77 ms per loop
which results in a 6.5x speed up.
If you need even better performance, you would have to port this for loop in Cython and interface the gesv BLAS functions directly in C, as discussed here, or better with the Cython API for BLAS/LAPACK in Scipy 0.16.
Edit: As #Eelco Hoogendoorn mentioned if your A matrix is fixed, there is a much simpler and more efficient approach.

Categories