Efficient way to solve matrix equation in Python

Efficient way to solve matrix equation in Python - python

Right now I am using the numpy.linalg.solve to solve my matrix, but the fact that I am using it to solve a 5000*17956 matrix makes it really time consuming. It runs really slow and It have taken me more than an hour to solve. The running time for this is probably O(n^3) for solving matrix equation but I never thought it would be that slow. Is there any way to solve it faster in Python?
My code is something like that, to solve a for the equation BT * UT = BT*B a, where m is the number of test cases (in my case over 5000), B is a data matrix m*17956, and u is 1*m.
C = 0.005 # hyperparameter term for regulization
I = np.identity(17956) # 17956*17956 identity matrix
rhs = np.dot(B.T, U.T) # (17956*m) * (m*1) = 17956*1
lhs = np.dot(B.T, B)+C*I # (17956*m) * (m*17956) = 17956*17956
a = np.linalg.solve(lhs, rhs) # B.T u = B.T B a, solve for a (17956*1)

Update (2 July 2018): The updated question asks about the impact of a regularization term and the type of data in the matrices. In general, this can make a large impact in terms of the datatypes a particular CPU is most optimized for (as a rough rule of thumb, AMD is better with vectorized integer math and Intel is better with vectorized floating point math when all other things are held equal), and the presence of a large number of zero values can allow for the use of sparse matrix libraries. In this particular case though, the changes on the main diagonal (well under 1% of all the values in consideration) will have a negligible impact in terms of runtime.
TLDR;
An hour is reasonable (a cubic regression suggests that this would take around 83 minutes on my machine -- a low-end chromebook).
The pre-processing to generate lhs and rhs account for almost none of that time.
You won't be able to solve that exact problem much faster than with numpy.linalg.solve.
If m is small as you suggest and if B is invertible, you can instead solve the equation U.T=Ba in a minute or less.
If this is part of a larger problem, this costly intermediate step might be able to be simplified away from a mathematical framework.
Performance bottlenecks really should be addressed with profiling to figure out which step is causing the issues.
Since this comes from real-world data, you might be able to get away with fewer features (either directly or through a reduction step like PCA, NMF, or LLE), depending on the end goal.
As mentioned in another answer, if the matrix is sufficiently sparse you can get away with sparse linear algebra routines to great effect (many natural language processing data sources are like this).
Since the output is a 1D vector, I would use np.dot(U, B).T instead of np.dot(B.T, U.T). Transposes are neat that way. This avoids doing the transpose on a big matrix like B, though since you have a cubic operation as the dominant step this doesn't matter much for your problem.
Depending on whether you need the original data anymore and if the matrices involved have any other special properties, you might be able to fiddle with the parameters in scipy.linalg.solve instead for a gain.
I've had mixed success replacing large matrix equations with block matrix equations falling back on numpy routines. That approach typically saves 5-20% over numpy approaches and takes 1% or so off scipy approaches on my system. I haven't fully explored the reason for the discrepancy.

Assuming your matrix is sparse, the scipy.sparse.linalg module will be useful. Here is the documentation for the whole module, and here is the documentation for spsolve.

Related

Solving a large (150 variable) system of linear, ordinary differential equations; running into floating point rounding and/or stiffness problems

EDIT: Original post too vague. I am looking for an algorithm to solve a large-system, solvable, linear IVP that can handle very small floating point values. Solving for the eigenvectors and eigenvalues is impossible with numpy.linalg.eig() as the returned values are complex and should not be, it does not support numpy.float128 either, and the matrix is not symmetric so numpy.linalg.eigh() won't work. Sympy could do it given an infinite amount of time, but after running it for 5 hours I gave up. scipy.integrate.solve_ivp() works with implicit methods (have tried Radau and BDF), but the output is wildly wrong. Are there any libraries, methods, algorithms, or solutions for working with this many, very small numbers?
Feel free to ignore the rest of this.
I have a 150x150 sparse (~500 nonzero entries of 22500) matrix representing a system of first order, linear differential equations. I'm attempting to find the eigenvalues and eigenvectors of this matrix to construct a function that serves as the analytical solution to the system so that I can just give it a time and it will give me values for each variable. I've used this method in the past for similar 40x40 matrices, and it's much (tens, in some cases hundreds of times) faster than scipy.integrate.solve_ivp() and also makes post model analysis much easier as I can find maximum values and maximum rates of change using scipy.optimize.fmin() or evaluate my function at inf to see where things settle if left long enough.
This time around, however, numpy.linalg.eig() doesn't seem to like my matrix and is giving me complex values, which I know are wrong because I'm modeling a physical system that can't have complex rates of growth or decay (or sinusoidal solutions), much less complex values for its variables. I believe this to be a stiffness or floating point rounding problem where the underlying LAPACK algorithm is unable to handle either the very small values (smallest is ~3e-14, and most nonzero values are of similar scale) or disparity between some values (largest is ~4000, but values greater than 1 only show up a handful of times).
I have seen suggestions for similar users' problems to use sympy to solve for the eigenvalues, but when it hadn't solved my matrix after 5 hours I figured it wasn't a viable solution for my large system. I've also seen suggestions to use numpy.real_if_close() to remove the imaginary portions of the complex values, but I'm not sure this is a good solution either; several eigenvalues from numpy.linalg.eig() are 0, which is a sign of error to me, but additionally almost all the real portions are of the same scale as the imaginary portions (exceedingly small), which makes me question their validity as well. My matrix is real, but unfortunately not symmetric, so numpy.linalg.eigh() is not viable either.
I'm at a point where I may just run scipy.integrate.solve_ivp() for an arbitrarily long time (a few thousand hours) which will probably take a long time to compute, and then use scipy.optimize.curve_fit() to approximate the analytical solutions I want, since I have a good idea of their forms. This isn't ideal as it makes my program much slower, and I'm also not even sure it will work with the stiffness and rounding problems I've encountered with numpy.linalg.eig(); I suspect Radau or BDF would be able to navigate the stiffness, but not the rounding.
Anybody have any ideas? Any other algorithms for finding eigenvalues that could handle this? Can numpy.linalg.eig() work with numpy.float128 instead of numpy.float64 or would even that extra precision not help?
I'm happy to provide additional details upon request. I'm open to changing languages if needed.

As mentioned in the comment chain above the best solution for this is to use a Matrix Exponential, which is a lot simpler (and apparently less error prone) than diagonalizing your system with eigenvectors and eigenvalues.
For my case I used scipy.sparse.linalg.expm() since my system is sparse. It's fast, accurate, and simple. My only complaint is the loss of evaluation at infinity, but it's easy enough to work around.

Difference in results with sparse solver

I'm solving a non-linear elliptic PDE via linearization + iteration and a finite difference method: basically it comes down to solving a matrix equation Ax = b. A is a banded matrix. Due to the large size of A (typically ~8 billion elements) I have been using a sparse solver (scipy.sparse.linalg.spsolve) to do this. In my code, I compute a residual value which measures deviation from the true non-linear solution and lowers it with successive iterations. It turns out that there is a difference between the values that the sparse solver produces in comparison to what scipy.linalg.solve does.
Output of normal solver:
Output of sparse solver:
The only difference in my code is the replacement of the solver. I don't think this is down to floating-point errors since the error creeps upto the 2nd decimal place (in the last iteration - but the order of magnitude also decreases... so I'm not sure). Any insights on why this might be happening? The final solution, it seems, is not affected qualitatively - but I wonder whether this can create problems.
(No code has been included since the difference is only there in the creation of a sparse matrix and the sparse solver. However, if you feel you need to check some part of it, please ask me to include code accordingly)

Fast subsequent multiplication of many matrices in python

I have to generate a matrix (propagator in physics) by ordered multiplication of many other matrices. Each matrix is about the size of (30,30), all real entries (floats), but not symmetric. The number of matrices to multiply varies between 1e3 to 1e5. Each matrix is only slightly different from previous, however they are not commutative (and at the end I need the product of all these non-commutative multiplication). Each matrix is for certain time slice, so I know how to generate each of them independently, wherever they are in the multiplication sequence. At the end, I have to produce many such matrix propagators, so any performance enhancement is welcomed.
What is the algorithm for fastest implementation of such matrix multiplication in python?
In particular -
How to structure it? Are there fast axes and so on? preferable dimensions for rows / columns of the matrix?
Assuming memory is not a problem, to allocate and build all matrices before multiplication, or to generate each per time step? To store each matrix in dedicated variable before multiplication, or to generate when needed and directly multiply?
Cumulative effects of function call overheads effects when generating matrices?
As I know how to build each, should it be parallelized? For example maybe to create batch sequences from start of the sequence and from the end, multiply them in parallel and at the end multiply the results in proper order?
Is it preferable to use other module than numpy? Numba can be useful? or some other efficient way to compile in place to C, or use of optimized external libraries? (please give reference if so, I don't have experience in that)
Thanks in advance.

I don't think that the matrix multiplication would take much time. So, I would do it in a single loop. The assembling is probably the costly part here.
If you have bigger matrices, a map-reduce approach could be helpful. (split the set of matrices, apply matrix multiplication to each set and do the same for the resulting matrices)
Numpy is perfectly fine for problems like this as it is pretty optimized. (and is partly in C)
Just test how much time the matrix multiplication takes and how much the assembling. The result should indicate where you need to optimize.

Vectorizing consequential/iterative simulation (in python)

This is a very general question -- is there any way to vectorize consequential simulation (where next step depends on previous), or any such iterative algorithm in general?
Obviously, if one need to run M simulations (each N steps) you can use for i in range(N) and calculate M values on each step to get a significant speed-up. But say you only need one or two simulations with a lot of steps, or your simulations don't have a fixed amount of steps (like radiation detection), or you are solving a differential system (again, for a lot of steps). Is there any way to shove upper for-loop under the numpy hood (with a speed gain, I am not talking passing python function object to numpy.vectorize), or cython-ish approaches are the only option? Or maybe this is possible in R or some similar language, but not (currently?) in Python?

Perhaps Multigrid in time methods can give some improvements.

Fast non-negative matrix factorization on large sparse matrix

Using Scikit-learn (v 0.15.2) for non-negative matrix factorization on a large sparse matrix (less than 1% values > 0). I want to find factors by minimizing errors only on non-zero values of the matrix (i.e., do not calculate errors for entries that are zero), and to favor sparsity. I'm not sure if something's wrong with what I'm trying. The scikit-learn package's NMF and ProjectedGradientNMF have worked well for me before. But it seems that when the matrix size increases, the factorization is terribly slow.
I'm talking about matrices with > 10^10 cells. For matrix with ~10^7 cells, I find the executime time to be good.
The parameters I've used are as follows: nmf_model = NMF(n_components = 100, init='nndsvd', random_state=0, tol = 0.01, sparseness='data').
When I tried slightly different parameters (change to init=random), I get the following warning. After the warning, the execution of the script halts.
/lib/python2.7/site-packages/sklearn/decomposition/nmf.py:252: UserWarning: Iteration limit reached in nls subproblem.
warnings.warn("Iteration limit reached in nls subproblem.")
Is there a way to make this faster and solve the above problem? I've tried using numpy sparse matrix (column- and row-sparse), but surprisingly - it's slower on the test I did with a smaller matrix (~10^7 cells).
Considering that one would have to run multiple iterations of such a factorization (to choose an ideal number of factors and k-fold cross validation), a faster way to solve this problem is highly desirable.
I'm also open to suggestions of package/tools that's not based on sklearn or Pyhon. I understand questions about package/tool choices are not encouraged, but for such a specific use-case, knowing what techniques others in the field use would be very helpful.

Maybe a few words on what the initial problem is about could enable us to give better answers.
Matrix Factorization on a very large matrix is always going to be slow due to the nature of the problem.
Suggestions:
Reducing n_components to < 20 will speed it up somewhat. However, the only real improvement in speed will be achieved by limiting the size of the matrix.
With a matrix like you describe, one could think that you are trying to factorize a term frequency matrix. If this is so, you could try to use the vectorization functions in scikit-learn to limit the size of the matrix. Most of them have a max_features parameter. Example:
vectorizer = TfidfVectorizer(
max_features=10000,
ngram_range=(1,2))
tfidf = vectorizer.fit_transform(data)
This will significantly speed up the problem solving.
Should I be completely wrong and this is not a term frequency problem, I would still look into ways to limit the initial matrix you are trying to factorize.

You might want to take a look at this article which discusses more recent techniques on NMF: http://www.cc.gatech.edu/~hpark/papers/nmf_blockpivot.pdf
The idea is to work only on the nonzero entries for factorization which reduces computational time especially when the matrix/matrices involved is/are very sparse.
Also, one of the authors from the same article created NMF implementations on github including the ones mentioned in their article. Here's the link: https://github.com/kimjingu/nonnegfac-python
Hope that helps.

Old question, new answer.
The OP asks for "zero-masked" NMF, where zeros are treated as missing values. This will never be faster than normal NMF. Consider NMF by alternating least squares, here the left-hand side of systems of equations is generally constant (it is simply the tcrossprod of W or H), but in zero-masked NMF it needs to be re-calculated for every single sample or feature.
I've implemented zero-masked NMF in the RcppML R package. You can install it from CRAN and use the nmf function setting the mask_zeros argument to TRUE:
install.packages("RcppML")
A <- rsparsematrix(1000, 1000, 0.1) # simulate random matrix
model <- RcppML::nmf(A, k = 10, mask_zeros = TRUE)
My NMF implementation is faster than scikit-learn without masking zeros, and shouldn't be impossibly slow for 99% sparse matrices.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.