Best performance method when solving 7000x7000 linear system with python

Best performance method when solving 7000x7000 linear system with python - python

I’m in the need of an efficient method to inverse a 7000x7000 aerodynamic influence coefficient (dense) matrix in python. I had started before a FORTRAN routine to handle the problem using the LU decompostition routines from LAPACK, which I had seen being used in other related apps quite efficiently. I’ve read, though, that the NumPy and SciPy linear system solvers are mostly based in direct calls to the same LAPACK/BLAS functions in C, and was wondering wether switching to FORTRAN would really reduce computation time to a level which justifies giving up an easier, higher level language.
If there are python solvers which would guarantee similar performance for matrixes of that size (1000 to 10000, square), which are they?
I really need the matrix inverse, so switching to iterative Ax=b solutions is not an option.

Indeed, Numpy and Scipy efficiently call LAPACK routines to perform numpy.linalg.inv and scipy.linalg.inv.
To inverse a general matrix, numpy.linalg.inv solves A.x=np.eye((n,n)). The function inv() calls ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj), which calls call_#lapack_func#(&params); where params.B is the identitity matrix and #lapack_func# is one of sgesv, dgesv, cgesv, zgesv, that are linear solvers for general matrices.
On the other hand, scipy.linalg.inv calls getri , defined as get_lapack_funcs(('getri'),(a1,)). It corresponds to the DGETRI() function of lapack, designed to computes the inverse of a matrix using the LU factorization computed by DGETRF(). Hence, if you are using DGETRI() in Fortran, making use of scipy.linalg.inv() in python likely enable similar performances and results.
Most of Lapack functions can be called using scipy.linalg.lapack. Here is an example making use of scipy.linalg.cython_lapack.dgetri() in a cython module: How to compile C extension for Python where C function uses LAPACK library? Here goes a sample code, comparing scipy.linalg.cython_lapack.dgetrf()+scipy.linalg.cython_lapack.dgetri() , numpy and scipy.linalg.inv() on a 1000x1000 matrix:
import numpy as np
from scipy import linalg
import time
import myinverse
n=1000
A=np.random.rand(n,n)
start= time.time()
Am,info,string=myinverse.invert(A.copy())
end= time.time()
print 'DGETRF+DGETRI, ', end-start, ' seconds'
if info==0:
print 'residual ',np.linalg.norm(A.dot(Am)-np.identity(n), np.inf)
else :
print "inversion failed, info=",info, string
start= time.time()
Am=np.linalg.inv(A.copy())
end= time.time()
print 'np.linalg.inv ', end-start, ' seconds'
print 'residual ', np.linalg.norm(A.dot(Am)-np.identity(n), np.inf)
start= time.time()
Am=linalg.inv(A.copy())
end= time.time()
print 'scipy.linalg.inv ', end-start, ' seconds'
print 'residual ',np.linalg.norm(A.dot(Am)-np.identity(n), np.inf)
And the output is:
DGETRF+DGETRI, 0.22541308403 seconds
residual 4.2155882951089296e-11
np.linalg.inv 0.29932808876 seconds
residual 4.371813154546711e-11
scipy.linalg.inv 0.298856973648 seconds
residual 9.110997546690758e-11
For 2000x2000 matrix:
DGETRF+DGETRI, 1.64830899239 seconds
residual 8.541625644634121e-10
np.linalg.inv 2.02795410156 seconds
residual 7.448244269611659e-10
scipy.linalg.inv 1.61937093735 seconds
residual 1.6453560233026243e-09
A Fortran code chaining DGETRF()+DGETRI() is provided in LAPACK inversion routine strangely mixes up all variables
After a few changes, let' run:
PROGRAM solvelinear
implicit none
REAL(8), dimension(1000,1000) :: A,Ainv,M,LU
REAL(8),allocatable :: work(:)
REAL(8) :: wwork
INTEGER :: info,lwork
INTEGER,dimension(1000) :: ipiv
INTEGER :: i,j
real :: start, finish
! put code to test here
info=0
!work=0
ipiv=0
call RANDOM_NUMBER(A)
call cpu_time(start)
!-- LU factorisation
LU = A
CALL DGETRF(1000,1000,LU,1000,ipiv,info)
!-- Inversion of matrix A using the LU
Ainv=LU
lwork=-1
CALL DGETRI(1000,Ainv,1000,Ipiv,wwork,lwork,info)
lwork =INT( wwork+0.1)
allocate(work(lwork))
CALL DGETRI(1000,Ainv,1000,Ipiv,work,lwork,info)
deallocate(work)
call cpu_time(finish)
print '("Time = ",f6.3," seconds.")',finish-start
!-- computation of A^-1 * A to check the inverse
M = matmul(Ainv,A)
print*,"M = "
do i=1,3
do j=1,3
print*,M(i,j)
enddo
end do
END PROGRAM solvelinear
Once compiled by using gfortran main2.f90 -o main2 -llapack -lblas -lm -Wall, it takes 0.42s for 1000x1000 matrix and 3s for a 2000x2000 matrix.
Finally, different performances may occur if the Fortran code and python code do not link to the same Blas/Lapack libraries. To investigate this, type commands like np.__config__.show() as shown in Link ATLAS/MKL to an installed Numpy or commands reported in How to check BLAS/LAPACK linkage in NumPy and SciPy? .
To got further and make use of distributed computations, petsc discourages inverting full matrices, as it is rarely required. It is also written that MatMatSolve(A,B,X), where B and X are dense matrices can be used to do so. Furthermore, this function is provided in the python interface petsc4py as method matSolve(self, Mat B, Mat X) for the object petsc4py.PETSc.Mat. The no-maintained-anymore Elemental library is listed as implementing a direct solver for dense matrices. While the Elemental library supported a python interface, its fork Hydrogen does not support it anymore.
Nevertheless, the Elemental page lists some related open source projects for Distributed dense linear algebra. ScaLapack provides the routine PDGETRI()/PZGETRI() to invert distributed dense matrices using LU decomposition. That might leave some room for a faster inversion.

Related

Efficient computation of entropy-like formula (sum(xlogx)) in Python

I'm looking for an efficient way to compute the entropy of vectors, without normalizing them and while ignoring any non-positive value.
Since the vectors aren't probability vectors, and shouldn't be normalized, I can't use scipy's entropy function.
So far I couldn't find a single numpy or scipy function to obtain this, and as a result my alternatives involve breaking the computation into 2 steps, which involve intermediate arrays and slow down the run time. If anyone can think of a single function for this computation it will be interseting.
Below is a timeit script for measuring several alternatives at I tried. I'm using a pre-allocated array to avoid repeated allocations and deallocations during run-time. It's possible to select which alternative to run by setting the value of func_code. I included the nansum offered by one of the answers. The measurements on My MacBook Pro 2019 are:
matmul: 16.720187613
xlogy: 17.296380516
nansum: 20.059866123000003
import timeit
import numpy as np
from scipy import special
def matmul(arg):
a, log_a = arg
log_a.fill(0)
np.log2(a, where=a > 0, out=log_a)
return (a[:, None, :] # log_a[..., None]).ravel()
def xlogy(arg):
a, log_a = arg
a[a < 0] = 0
return np.sum(special.xlogy(a, a), axis=1) * (1/np.log(2))
def nansum(arg):
a, log_a = arg
return np.nansum(a * np.log2(a, out=log_a), axis=1)
def setup():
a = np.random.rand(20, 1000) - 0.1
log = np.empty_like(a)
return a, log
setup_code = """
from __main__ import matmul, xlogy, nansum, setup
data = setup()
"""
func_code = "matmul(data)"
print(timeit.timeit(func_code, setup=setup_code, number=100000))

On my machine the computation of the logarithms takes about 80% of the time of matmul so it is definitively the bottleneck an optimizing other functions will result in a negligible speed up.
The bad news is that the default implementation np.log is not yet optimized on most platforms. Indeed, it is not vectorized by default, except on recent x86 Intel processors supporting AVX-512 (ie. basically Skylake processors on servers and IceLake processors on PCs, not recent AlderLake though). This means the computation could be significantly faster once vectorized. AFAIK, the close-source SVML library do support AVX/AVX2 and could speed up it (on x86-64 processors only). SMVL is supported by Numexpr and Numba which can be faster because of that assuming you have access to the non-free SVML which is a part of Intel tools often available on HPC machines (eg. like MKL, OneAPI, etc.).
If you do not have access to the SVML there are two possible remaining options:
Implement your own optimized SIMD log2 function which is possible but hard since it require a good understanding of the hardware SIMD units and certainly require to write a C or Cython code. This solutions consists in computing the log2 function as a n-degree polynomial approximation (it can be exact to 1 ULP with a big n though one generally do not need that). Naive approximations (eg. n=1) are much simple to implement but often too inaccurate for a scientific use).
Implement a multi-threaded log computation typically using Numba/Cython. This is a desperate solution as multithreading can slow things down if the input data is not large enough.
Here is an example of multi-threaded Numba solution:
import numba as nb
#nb.njit('(UniTuple(f8[:,::1],2),)', parallel=True)
def matmul(arg):
a, log_a = arg
result = np.empty(a.shape[0])
for i in nb.prange(a.shape[0]):
s = 0.0
for j in range(a.shape[1]):
if a[i, j] > 0:
s += a[i, j] * np.log2(a[i, j])
result[i] = s
return result
This is about 4.3 times faster on my 6-core PC (200 us VS 46.4 us). However, you should be careful if you run this on a server with many cores on such small dataset as it can actually be slower on some platforms.

Having np.log2 of negative numbers (or zero) just gives a runtime warning and sets those values to np.nan, which is probably the best way to deal with them. If you don't want them to pollute your sum, just use
np.nansum(v_i*np.log2(v_i))

Is there a way to increase the speed of working with arrays in Fortran under Windows, like Python numpy?

Sorry for the possible duplication.
About the problem.
numpy (1.18.2) in python 3.8.2 gives me a very high simulation speed (3 times faster) for a matrix product compared to GNU Fortran (9.2.0 MinGW.org GCC Build-20200227-1) under Windows. I used the command gfortran.exe test.f without any additional options.
Does anyone know what is causing this and is it possible to increase the simulation speed in Fortran?
Here is the fortran code:
program product_test
INTEGER :: N,N_count,i,j,k,nc
REAL*8 :: t1,t2
REAL*8,dimension (:,:), allocatable :: a,b,c
N = 200
N_count = 10
allocate ( a(N,N) )
allocate ( b(N,N) )
allocate ( c(N,N) )
call RANDOM_NUMBER(a)
call RANDOM_NUMBER(b)
print *, 'Matrix Multiplication: C = A * B for size (',N,',',N,')'
call CPU_TIME ( time_begin )
do nc=1,N_count
c = MATMUL(a,b)
end do
call CPU_TIME ( time_end )
t2 = (time_end - time_begin)/N_count
print *, 'Time of operation was ', t2, ' seconds'
end
Here is the output:
Matrix Multiplication: C = A * B for size ( 200 , 200 )
Time of operation was 9.3749E-003 seconds
Here is the python 3 code:
import numpy as np
import time
N = 200
N_count = 10
a = np.random.rand(N,N)
b = np.random.rand(N,N)
c = np.zeros([N,N], dtype = float)
print('Matrix product in python (using numpy): c= a*b for size (',N,',',N,')')
start_time = time.time()
for nc in range(N_count):
c = a#b
t2 = (time.time() - start_time)/N_count
print('Elapsed time = ',t2,'s')
Here is the output:
Matrix product in python (using numpy): c= a*b for size ( 200 , 200 )
Elapsed time = 0.0031252 s
**Additional tests.** following to the comments of "roygvib" and "Vladimir F", I have done the test with blas/lapack:
gfortran test.f -lopenblas -o test.exe
or gfortran test.f -ffast-math -o test.exe
or gfortran test.f -lblas -o test.exe
or gfortran test.f -llapack -o test.exe give me the calculation time of 0.0063s for matrix multiplication of square matrices with the size ( 200 x 200 ).
Unfortunately, I have deleted the previous version of mingw and new tests were performed under GNU Fortran (x86_64-posix-seh-rev0, Built by MinGW-W64 project 8.1.0). May be I did something incorrect because there are no difference between -llapack, -lblas, -lopenblas. For time measuring, I used SYSTEM_CLOCK as suggested by "Vladimir F".
Now, it is better, but numpy still faster than fortran (not three times but two times).
Following to the last comment of "Vladimir F", I found that unlike Python, Fortran uses mainly one logical core (there are 4 logical cores on my PC with intel i3 CPU). Thus, this is a problem of improperly configured MinGW on my PC (Windows8.1).

Use MATMUL or external libraries like BLAS for matrix multiplication in Fortran, We have many questions that deal with the performance of matrix multiplication
Fortran matrix multiplication performance in different optimization
performance of fortran matrix operations
How does BLAS get such extreme performance?
You should read them first. You should never do matrix multiplication in a naive for loop, that will always be slow. There are special algorithms for matrix multiplication. They use the memory bandwidth in an efficient way and also employ vectorizing instructions (often written directly in assembly).
Many Fortran compilers will allow you to call BLAS xGEMM directly through MATMUL. In gfortran it is possible with -fexternal-blas mentioned by roygvib. If you have problems with that, call DGEMM directly.
Certain BLAS implementations are able to use multiple threads. If you try that you must not use CPU_TIME to measure the speed, you have to use SYSTEM_CLOCK or an alternative.
Also, you did not report using any optimization flags like -O3. These are necessary for any decent performance unless an optimized external library does all the work.

The problem was maybe in the compatibility of different versions. I updated the compiler and libraries (I upgraded to gcc 9.3.0, openblas 0.3.9 after uninstalling all previous versions).
Now the following results for the matrix product: c = a * b with matrix size (2000x2000) (with averaging 20 trials) are more adequate (I carried out the test on a PC with Intel i5 (4 logical cores) under Windows 10):
0.237833s(minGW64) and 0.236853s(cygwin64). C++ with armadillo using gcc 9.3.0+openblas 0.3.9
0.2492s(minGW64) and 0.2479(cygwin64), norm = 0. Fortran (matmul) with -fexternal-blas flag, command line: gfortran FILE_NAME.f95 -o FILE_NAME -O3 -ffast-math -fexternal-blas "[pathto]\libopenblas_v0.3.9-gcc_9_3_0.a" (gcc 9.3.0, openblas 0.3.9)
0.2484s(dgemm) whereas 1.12894s for matmul, norm = 1.5695E-10. Fortran in minGW64 with -lopenblas flag, command line: gfortran FILE_NAME.f95 -o FILE_NAME -O3 -ffast-math lopenblas (gcc 9.3.0)
0.2562533s, norm = 0.0. python (numpy)
0.285133s(R2016a) and 0.269926s(R2020a), norm = 8.4623e-12. Matlab 64.
0.3133s, norm = 1.5695E-10. Fortran (matmul) in minGW64/cygwin64 with -lblas flag, command line: gfortran FILE_NAME.f95 -o FILE_NAME -O3 -ffast-math -lblas (gcc 9.3.0, in cygwin64).
To run these tests, I used cygwin (or minGW) to compile c++ code using armadillo (OpenMP C++ Matrix Multiplication run slower in parallel), where three matrices A, B, C were created and saved to disk to use the same matrices in these tests. Thus, the “norm” indicates the relative accuracy of the matrix product.
I found that numpy uses openblas (libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64).
Matlab on my PC gives me the next information about library of blas/lapack: Intel (R) Math Kernel Library version 11.2.3 Build 20150413 for applications with Intel (R) 64 architecture, CNR AVX2 branch in R2016a and Intel(R) Math Kernel Library Version 2019.0.3 Product Build 20190125 for Intel(R) 64 architecture applications, CNR branch AVX2 in R2020a.
The fortran simulation speed is now resonable with respect to other languages. And openBLAS won in C++ (perhaps, due to its adaptation for C). Noting that matlab shows a relatively high computational speed with not fully used CPUs. All languages/programs use all 4 kernels of my system:

Vectorization of recursively defined problem - Python

I was wondering if anyone would have an idea on how I am able to vectorize the following loop:
for i in range(1,(T*n)+1):
Y = Y + np.diag(mu) # Y * dt + np.multiply(np.diag(sigma)#Y, L # np.random.normal( 0, dt, (d,N)))
Whereas the following parameters are already a dxN matrices (I already vectorized a loop with that..):
Y (this is the recursive Parameter)
np.diag(mu) # Y * dt
np.diag(sigma) # Y
L # np.random.normal( 0, dt, (d,N))
Any help would be very appreciated. :)
With best regards!

Unfortunately, this doesn't look like vectorizable code:
Iterations should be independent. Typically, vectorization means making several iterations at once. Typically, it also implies using AVX, SSE or FMA instructions (if we talk about x86 processors) to make iterations go truly in parallel on a hardware level.
Continuing about vector assembly instructions, such level of optimization is typically unreachable from python code because the interpreter isn't that smart. An iteration is also doing too much to be vectorized. It actually contains sub-loops! We don't see it but matrix multiplications do involve more loops.
So I woudn't call optimization of this loop a "vectorization". But luckily, there are still things to check:
Profile it. Find out what part of the computation consumes most of the time.
Verify that np.random doesn't slow down the program significantly. If yes, you can rely on pre-generated values instead.
Check if code that can be vectorized is vectorized. That means, verify that your numpy is built with SSE/AVX support and that matrix multiplications use that under the hood. It can be a bit tricky to do but up to x4 speedups* are possible with AVX usage.
If parts of the code are indeed vectorized on the assembly level, switching to storing data in float16 arrays can make it even faster. To my knowledge, AVX does support operations on large blocks of 16-bit floats.
Rewrite it in C/Cython or try out Numba JIT compilation for the same task.
If compilation even with Numba is not the case, I wonder if Tensorflow can help here. With Tensorflow, Python code doesn't kick off computations immediately but rather constructs a computational graph that is then executed without returning to the interpreter level. Tensorflow does support AVX and SSE (although not without pain), so you may expect more control over low-level details than with numpy. And you can also try to launch it on GPU.
Last thing, I don't quite believe in it, but does loop unrolling help?
for i in range(1, (T * n + 1) // 4):
Y = Y + ...
Y = Y + ...
Y = Y + ...
Y = Y + ...
* - subject to Amdahl's law

SVD calculation error from lapack function when using scikit-learn's Linear Discriminant Analysis class

I'm classifying 2-class, 1-D data using scikit-learn's LDA classifier in a machine learning pipeline I've created. The following exception occurred:
ValueError: Internal work array size computation failed: -10
at the following line:
LinearDiscriminantAnalysis.fit(X,y)
where X = [-5e15, -5e15, -5e15, 5.7e16] and y = [0, 0, 0, 1], both float64 data-type
Additionally the following error was printed to console:
Intel MKL ERROR: Parameter 10 was incorrect on entry to DGESDD
After a quick Google search, dgesdd is a function in LAPACK which scikit-learn relies upon. The dgesdd documentation tells us that the function computes the singular value decomposition (SVD) of a real M-by-N matrix A.
Going back to the original exception, I found it was raised in scipy.linalg.lapack.py at the _compute_lwork function. This function takes as input a function, which in this case I believe is the dgesdd function. CTRL-F "-10" on the dgesdd documentation page gives the logic behind this error code, but I don't know Fortran so I'm not exactly sure what it means.
I want to bet that the SVD calculation is failing due to either (1) large values in X array, or (2) the fact that the 3 of the values in the X array are the exact same number.
I will keep reading into SVD and its limitations. Any insight on how to avoid this error would be tremendously appreciated.
Here is a screenshot of the error

This is the definition of DGESDD:
subroutine dgesdd (JOBZ, M, N, A, LDA, S, U, LDU, VT, LDVT, WORK, LWORK, IWORK, INFO)
The error, that you have, indicates that the value that is passed to MKL's implementation of the routine for the 10th parameter, LDVT, the leading dimension of the V**T matrix does not comply with expecation of said routing.
This could be a bug in Intels implementation, rather unlikely, assuming that there is a battery on testing stress testing this routines, but not impossible. Which version of MKL is this? Or it's a bug in the LDA code, rather likely:
LDVT is INTEGER
The leading dimension of the array VT. LDVT >= 1;
if JOBZ = 'A' or JOBZ = 'O' and M >= N, LDVT >= N;
if JOBZ = 'S', LDVT >= min(M,N).
Would you please print M, N, LDA, LDU and LDVT?
If you set LDVT properly the workspace analysis will run just fine.

regard to Intel MKL ERROR: Parameter 10 was incorrect on entry to DGESDD problem. Actually this problem has been fixed in MKL v.2018 u4 ( Sep 2018). Here is the link to MKL 2018 bug fix list.
You may easier to check version of MKL you use by setting env variable MKL_VERBOSE=1 to the system environments and look at the output which will contain such kind info.
E.x:
MKL_VERBOSE Intel(R) MKL 2019.0 Update 2 Product build 20190118 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions (Intel(R) AVX) enabled processors, Lnx 2.80GHz lp64 intel_thread
MKL_VERBOSE ZGETRF(85,85,0x13e66f0,85,0x13e1080,0) 6.18ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20

Long Vector Linear Programming in R?

Hello and thanks in advance. Fresh off the heels of this question I acquired some more RAM and now have enough memory to fit all the matrices I need to run a linear programming solver. Now the problem is none of the Linear Programming packages in R seem to support long vectors (ie large matrices).
I've tried functions Rsymphony_solve_LP, Rglpk_solve_LP and lp from packages Rsymphony, Rglpk, and lpSolve respectively. All report a similar error to the one below:
Error in rbind(const.mat, const.dir.num, const.rhs) :
long vectors not supported yet: bind.c:1544
I also have my code below in case that helps...the constraint matrix mat is my big matrix (7062 rows by 364520 columns) created using the package bigmemory. When I run this this line the matrix is pulled into memory and then after a while the errors show.
Rsymph <- Rsymphony_solve_LP(obj,mat[1:nrow(mat),1:ncol(mat)],dir,rhs,types=types,max=max, write_lp=T)
I'm guessing this is a hard-coded error in each of the three functions? Is there currently a linear programming solver in R or even Python that supports long vectors? Should I contact the package maintainers or just edit the code myself? Thanks!

The package lpSolveAPI can solve long-vector linear programming problems. You have to first start my declaring a Linear Programming object, then add the constraints:
library(lpSolveAPI)
#Generate Linear Programming Object
lprec <- make.lp(nrow = 0 # Number of Constraints
, ncol = ncol(mat) # Number of Decision Variables
)
#Set Objective Function to Minimize
set.objfn(lprec, obj)
#Add Constraints
#Note Direction and RHS is included along with Constraint Value
for(i in 1:nrow(mat) ){
add.constraint(lprec,mat[i,], dir[i], rhs[i])
print(i)
}
#Set Decision Variable Type
set.type(lprec, c(1:ncol(mat)), type = c("binary"))
#Solve Model
solve(lprec)
#Obtain Solutions
get.total.iter(lprec)
get.objective(lprec)
get.variables(lprec)
There's a good introduction to this package here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.