I am having a big vector and I want to compute the elements wise inverse each time with a small perturbation. For example, for large N, I have y and its element wise inverse y_inv
y = np.random.rand(1, N)
y_inv = 1. / y
y_inv[y_inv == np.inf] = 0 # y may contain zeros
and I want to compute
p = 0.01
z = y + p
z_inv = 1./ z = 1./ (y + p)
multiple times for different p and fixed y. Because N is very large, is there an efficient way or an approximation, to use the already computed y_inv in the computation of z_inv? Any suggestions to increase the speed of inverting z are highly appreciated.
Floating point divisions are slow, especially for double-precision floating point numbers. Simple-precision is faster (with a relative error likely less than 2.5e-07) and it can be made even faster if you do not need a high-precision by computing the approximate reciprocal (with a maximum relative error for this approximation is less than 3.66e-4 on x86-64 processors).
Assuming this is OK to you to decrease the precision of results, here are the performance of the different approach on a Skylake processor (assuming Numpy is compiled with the support of the AVX SIMD instruction set):
Double-precision division: 8 operation/cycle/core
Simple-precision division: 5 operation/cycle/core
Simple-precision approximate reciprocal: 1 operation/cycle/core
You can easily switch to simple-precision with Numpy by specifying the dtype of arrays. For the reciprocal, you need to use Numba (or Cython) configured to use the fastmath model (a flag of njit).
Another solution is simply to execute this operation in using multiple threads. This can be easily done with Numba using the njit flag parallel=True and a loop iterating over nb.prange. This solution can be combined with the previous one (about precision) resulting in a much faster computation compared to the initial Numpy double-precision-based code.
Moreover, computing arrays in-place should be faster (especially when using multiple threads). Alternatively, you can preallocate the output array and reuse it (slower than the in-place method but faster than the naive approach). The parameter out of Numpy function (like np.divide) can be used to do that.
Here is an (untested) example of parallel Numba code:
import numba as nb
# See fastmath=True and float32 types for faster performance of approximate results
# Assume `out` is already preallocated
#njit('void(float64[::1], float64[::1], float64)', parallel=True)
def compute(out, y, p):
assert(out.size == y.size)
for i in nb.prange(y.size):
out[i] = 1.0 / (y[i] + p)
Related
I have a NumPy array vectors = np.random.randn(rows, cols). I want to find differences between its rows according to some other array diffs which is sparse and "2-hot": containing a 1 in its column corresponding to the first row of vectors and a -1 corresponding to the second row. Perhaps an example shall make it clearer:
diffs = np.array([[ 1, 0, -1],
[ 1, -1, 0]])
then I can compute the row differences by simply diffs # vectors.
Unfortunately this is slow for diffs of 10_000x1000 and vectors 1000x15_000. I can get a speedup by using scipy.sparse: sparse.csr_matrix(diffs) # vectors, but even this is 300ms.
Possibly this is simply as fast as it gets, but some part of me thinks whether using matrix multiplications is the wisest decision for this task.
What's more is I need to take the absolute value afterwards so really I'm doing np.abs(sparse.csr_matrix(diffs) # vectors) which adds ~ 200ms for a grand total of ~500ms.
I can compute the row differences by simply diffs # vectors.
This is very inefficient. A matrix multiplication runs in O(n*m*k) for a (n,m) multiplied by a (m,k) one. In your case, there is only two values per line and you do not actually need a multiplication by 1 or -1. Your problem can be computed in O(n*k) time (ie. m times faster).
Unfortunately this is slow for diffs of 10_000x1000 and vectors 1000x15_000. I can get a speedup by using scipy.sparse.
The thing is the input data representation is inefficient. When diff is an array of size (10_000,1000), this is not reasonable to use a dense matrix that would be ~1000 times bigger than needed nor a sparse matrix that is not optimized for having only two non-zero values (especially 1 and -1). You need to store the position of the non-zeros values in a 2D array called sel_rows of shape (2,n) where the first row contains the location of the 1 and the second one contains the location of the -1 in the diff 2D array. Then, you can extract the rows of vectors for example with vectors[sel_rows[0]]. You can perform the final operation with vectors[sel_rows[0,:]] - vectors[sel_rows[1,:]]. This approach should be drastically faster than a dense matrix product and it may be a bit faster than a sparse one regarding the target machine.
While the above solution is simple, it create multiple temporary arrays that are not cache-friendly since your output array should take 10_000 * 15_000 * 8 = 1.1 GiB (which is quite huge). You can use Numba so to remove temporary array and so improve the performance. Multiple threads can be used to improve performance even further. Here is an untested code:
import numba as nb
#nb.njit('(int_[:,::1], float64[:,::1])', parallel=True)
def compute(diffs, vectors):
n, k = diffs.shape[0], vectors.shape[1]
assert diffs.shape[1] == 2
res = np.empty((n, k))
for i in nb.prange(n):
a, b = diffs[i]
for j in range(k):
# Compute nb.abs here if needed so to avoid
# creating new temporary arrays
res[i, j] = vectors[a, j] - vectors[b, j]
return res
This above code should be nearly optimal. It should be memory bound and able to saturate the memory bandwidth. Note that writing such huge arrays in memory take some time as well as reading (twice) the input array. On x86-64 platforms, a basic implementation should move 4.4 GiB of data from/to the RAM. Thus, on a mainstream PC with a 20 GiB/s RAM, this takes 220 ms. In fact, the sparse matrix computation result was not so bad in practice for a sequential implementation.
If this is not enough to you, then you can use simple-precision floating-point numbers instead of double-precision (twice faster). You could also use a low-level C/C++ implementation so to reduce the memory bandwidth used (thanks to non-temporal instructions -- ~30% faster). There is no much more to do.
I am writing an application in Python having speed as the main driver. While optimizing my code, I found out that the main bottleneck is given by the code used to compute
In my code, this matrix multiplication is computed as
POW = np.arange(4)
y = C # (x ** POW)
I tried to use different methods (e.g., for cycle and others), but as now this is the fastest way I found. Do you have any suggestion to improve the computational time?
It's absolutely to use Numpy. Numpy does the actual mathematical operations in highly optimized C code. Using Numpy is faster than writing your own non-optimized C code.
Firstly, use float instead int. Secondly, if you don't need double precision then use np.float32.
POW = np.arange(4, dtype='f')
# C = C.astype('f', copy=False) # ensure that C.dtype == np.float32
y = C # (x ** POW)
Given a query vector (one-hot-vector) q with size of 50000x1 and a large sparse matrix A with size of 50000 x 50000 and nnz of A is 0.3 billion, I want to compute r=(A + A^2 + ... + A^S)q (usually 4 <= S <=6).
I can above equation iteratively using loop
r = np.zeros((50000,1))
for i in range(S):
q = A.dot(q)
r += q
but I want to more fast method.
First thought was A can be symmetric, so eigen decomposition would help for compute power of A. But since A is large sparse matrix, decomposition makes dense matrix with same size as A which leads to performance degradation (in aspect of memory and speed).
Low Rank Approximation was also considered. But A is large and sparse, so not sure which rank r is appropriate.
It is totally fine to pre-compute something, like A + A^2 + ... + A^S = B. But I hope last computation will be fast: compute Bq less than 40ms.
Is there any reference or paper or tricks for that?
Even if the matrix is not sparse this the iterative method is the way to go.
Multiplying A.dot(q) has complexity O(N^2), while computing A.dot(A^i) has complexity O(N^3).
The fact that q is sparse (indeed much more sparse than A) may help.
For the first iteration A*q can be computed as A[q_hot_index,:].T.
For the second iteration the expected density of A # q has the same expected density as A for A (about 10%) so it is still good to do it sparsely.
For the third iteration afterwards the A^i # q will be dense.
Since you are accumulating the result it is good that your r is not sparse, it prevents index manipulation.
There are several different ways to store sparse matrices. I myself can't say I understand in depth all of them, but I think csr_matrix, csc_matrix, are the most compact for generic sparse matrices.
Eigen decomposition is good when you need to compute a P(A), to compute a P(A)*q the eigen decomposition becomes advantageous only when P(A) has degree of the order of the size of A. Eigen-decomposition has complexity O(N^3), a matrix-vector product has complexity O(N^2), the evaluation of a polynomial P(A) of degree D using the eigen decomposition can be achieved in O(N^3 + N*D).
Edit: Answering questionss on the comments
"it prevents index manipulation" <- Could you elaborate this?
Suppose you have a sparse matrix [0,0,0,2,0,7,0]. This could be described as ((3,2), (5,7)). Now suppose you assigne 1 to one element and it becomes [0,0,0,2,1,7,0], it is now represented as ((3,2), (4,1), (5,7)). The assignment is performed by insertion in an array, and inserting in an array has complexity O(nnz), where nnz is the number of nonzero elements. If you have a dense matrix you can always modify one element with complexity O(1).
What is the N in the complexity?
It is the number of rows, or columns, of the matrix A
About the eigen decomposition, do you want to say that it is worth
that computing r can be achieved in O(N^3 +N*D) not O(N^3 + N^2)
Computing P(A) will have complexity O(N^3 * D) (with a different constant), for big matrices, computing P(A) using the eigen decomposition is probably the most efficient. But P(A)x have O(N^2 * D) complexity, so it is probably not a good idea to compute P(A)x with eigen decomposition unless you have big D (>N), when speed is concerned.
I'm trying to make a dot product of an expression and it was supposed to be symmetric.
It turns out that it just isn't.
B is a 4D array which I must transpose its last two dimensions to become B^t.
D is a 2D array. (It's an expression of the Stiffness Matrix known to the Finite Element Method programmers)
The numpy.dotproduct associated with numpy.transpose and as a second alternative numpy.einsum (the idea came from this topic: Numpy Matrix Multiplication U*B*U.T Results in Non-symmetric Matrix) have already been used and the problem persists.
By the end of the calculations the product B^tDB is obtained and when it's verified if it really is symmetric by subtracting its transpose B^tDB, there is still a residue.
The Dot product or the Einstein Summation are used only over the dimensions of interest (last ones).
The question is: How can these residues be eliminated?
You need to use arbitrary precision floating point math. Here's how you can combine numpy and the mpmath package to define an arbitrary precision version of matrix multiplication (ie the np.dot method):
from mpmath import mp, mpf
import numpy as np
# stands for "decimal places". Larger values
# mean higher precision, but slower computation
mp.dps = 75
def tompf(arr):
"""Convert any numpy array to one of arbitrary precision mpmath.mpf floats
"""
if arr.size and not isinstance(arr.flat[0], mpf):
return np.array([mpf(x) for x in arr.flat]).reshape(*arr.shape)
else:
return arr
def dotmpf(arr0, arr1):
"""An arbitrary precision version of np.dot
"""
return tompf(arr0).dot(tompf(arr1))
As an example, if you then set up B, B^t, and D matrices as so:
bshape = (8,8,8,8)
dshape = (8,8)
B = np.random.rand(*bshape)
BT = np.swapaxes(B, -2, -1)
d = np.random.rand(*dshape)
D = d.dot(d.T)
then B^tDB - (B^tDB)^t will always have a non-zero value if you calculate it using the standard matrix multiplication method from numpy:
M = np.dot(np.dot(B, D), BT)
np.sum(M - M.T)
but if you use the arbitrary precision version given above it won't have a residue:
M = dotmpf(dotmpf(B, D), BT)
np.sum(M - M.T)
Watch out though. Calculations using arbitrary precision math run much slower than those done using standard floating point numbers.
I have two matrices A and B, each with a size of NxM, where N is the number of samples and M is the size of histogram bins. Thus, each row represents a histogram for that particular sample.
What I would like to do is to compute the chi-square distance between two matrices for a different pair of samples. Therefore, each row in the matrix A will be compared to all rows in the other matrix B, resulting a final matrix C with a size of NxN and C[i,j] corresponds to the chi-square distance between A[i] and B[j] histograms.
Here is my python code that does the job:
def chi_square(histA,histB):
esp = 1.e-10
d = sum((histA-histB)**2/(histA+histB+eps))
return 0.5*d
def matrix_cost(A,B):
a,_ = A.shape
b,_ = B.shape
C = zeros((a,b))
for i in xrange(a):
for j in xrange(b):
C[i,j] = chi_square(A[i],B[j])
return C
Currently, for a 100x70 matrix, this entire process takes 0.1 seconds.
Is there any way to improve this performance?
I would appreciate any thoughts or recommendations.
Thank you.
Sure! I'm assuming you're using numpy?
If you have the RAM available, you could use broadcast the arrays and use numpy's efficient vectorization of the operations on those arrays.
Here's how:
Abroad = A[:,np.newaxis,:] # prepared for broadcasting
C = np.sum((Abroad - B)**2/(Abroad + B), axis=-1)/2.
Timing considerations on my platform show a factor of 10 speed gain compared to your algorithm.
A slower option (but still faster than your original algorithm) that uses less RAM than the previous option is simply to broadcast the rows of A into 2D arrays:
def new_way(A,B):
C = np.empty((A.shape[0],B.shape[0]))
for rowind, row in enumerate(A):
C[rowind,:] = np.sum((row - B)**2/(row + B), axis=-1)/2.
return C
This has the advantage that it can be run for arrays with shape (N,M) much larger than (100,70).
You could also look to Theano to push the expensive for-loops to the C-level if you don't have the memory available. I get a factor 2 speed gain compared to the first option (not taking into account the initial compile time) for both the (100,70) arrays as well as (1000,70):
import theano
import theano.tensor as T
X = T.matrix("X")
Y = T.matrix("Y")
results, updates = theano.scan(lambda x_i: ((x_i - Y)**2/(x_i+Y)).sum(axis=1)/2., sequences=X)
chi_square_norm = theano.function(inputs=[X, Y], outputs=[results])
chi_square_norm(A,B) # same result