scipy.optimize.least_squares, LinearOperator for argument 'jac'

scipy.optimize.least_squares, LinearOperator for argument 'jac' - python

I'm trying to understand the documentation on the scipy.optimize.least_squares function:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.least_squares.html
The possibilities for inputs are a bit messy, and the documentation is not super clear on all of them. It says that the argument jac can be a callable returning a LinearOperator.
I suppose, the provided LinearOperator is supposed to represent the Jacobian as a linear operator mapping the variable shifts to the residual shifts. Or is it the other way round?
Which operations do I need to implement for the LinearOperator? Only matvec, or matmat as well?
Does providing a LinearOperator instead of the full Jacobi matrix actually speed up anything? Or is the full matrix built from the operator anyways? (And yes, in my example, evaluating the LinearOperator is much faster than building the whole Jacobi matrix.)

least_squares expects Jacobian of original functions (the variable shifts to the value shifts). If you know Jacobian for residuals, you probably should switch to fmincon or other optimization routine and work with residuals. One of the main advantages of least squares approach is the opportunity to efficiently talk on the language of original functions instead of residuals.
least_squares invokes matmat, matvec, rmatvec, but LinearOperator itself can implement matmat from matvec, if only matvec is provided (and vice versa). But it cannot implement rmatvec without rmatvec or rmatmat.
Most of the time only the result of J(x).T.dot(f) is needed and the full matrix is not held. Yet I noticed some numerical difference between matrix and operators Jacobians.

Related

Is there a way to define a 'heterogeneous' kernel design to incorporate linear operators into the regression for GPflow (or GPytorch/GPy/...)?

I'm trying to perform a GP regression with linear operators as described in for example this paper by Särkkä: https://users.aalto.fi/~ssarkka/pub/spde.pdf In this example we can see from equation (8) that I need a different kernel function for the four covariance blocks (of training and test data) in the complete covariance matrix.
This is definitely possible and valid, but I would like to include this in a kernel definition of (preferably) GPflow, or GPytorch, GPy or the like.
However, in the documentation for kernel design in Gpflow, the only possibility is to define a covariance function that acts on all covariance blocks. In principle, the method above should be straight-forward to add myself (the kernel function expressions can be derived analytically), but I don't see any way of incorporating the 'heterogeneous' kernel functions into the regression or kernel classes. I tried to consult other packages such as Gpytorch and Gpy, but again, the kernel design does not seem to allow this.
Maybe I'm missing something here, maybe I'm not familiar enough with the underlying implementation to asses this, but if someone has done this before or sees the (what should be reasonably straight-forward?) implementation possibility, I would be happy to find out.
Thank you very much in advance for your answer!
Kind regards

This should be reasonably straightforward, though requires building a custom kernel. Basically, you need a kernel that can know for each input what the linear operator for the corresponding output is (whether this is a function observation/identity operator, integral observation, derivative observation, etc). You can achieve this by including an extra column in your input matrix X, similar to how it's done for the gpflow.kernels.Coregion kernel (see this notebook). You would need to then need to define a new kernel with K and K_diag methods that for each linear operator type find the corresponding rows in the input matrix, and pass it to the appropriate covariance function (using tf.dynamic_partition and tf.dynamic_stitch, this is used in a very similar way in GPflow's SwitchedLikelihood class).
The full implementation would probably take half a day or so, which is beyond what I can do here, but I hope this is a useful starting pointer, and you're very welcome to join the GPflow slack (invite link in the GPflow README) and discuss it in more detail there!

Scalar minimization using scipy (`minimize` vs `minimize_scalar`)

I have a polynomial function for which I would like to find all local extrema. I can evaluate the polynomial via P(x) and to its derivative via d_P(x).
My first thought was to use minimize_scalar, however this does not seem to be able to take advantage of the fact that I can evaluate the derivative. Alternatively, I can use the more general minimize function and provide the gradient.
Is there a rule of thumb about which method will work better, or is this something where I should test out both methods and see what works better. Since the function I am optimizing is a polynomial (well behaved) I wonder if it really matters so much which I use, but if someone has a more background that would be great.
In particular, P(x) is the (unique) polynomial of degree n which alternatively attains a value of 1 or -1 on a set of n-1 points.
Here is a sample of the P(x) scaled so that P(0)=1. Note that the y axis is plotted on a symlog scale.

Since you have a continuous scalar function, the documentation of minimize_scalar suggests a more discrete optimization approach. Since it doesn't use gradient information you won't have trouble with noise/discontinuities/discreteness in your objective. However, if you use minimize in conjunction with a gradient based method then you will have trouble with convergence for noise/discontinuities/discreteness.
If the objective function is fist order continuous then both minimize and minimize_scalar should yield the same solution for a given bound.

How should I scipy.optimize a multivariate and non-differentiable function with boundaries?

I come upon the following optimization problem:
The target function is a multivariate and non-differentiable function which takes as argument a list of scalars and return a scalar. It is non-differentiable in the sense that the computation within the function is based on pandas and a series of rolling, std, etc. actions.
The pseudo code is below:
def target_function(x: list) -> float:
# calculations
return output
Besides, each component of the x argument has its own bounds defined as a tuple (min, max). So how should I use the scipy.optimize library to find the global minimum of this function? Any other libraries could help?
I already tried scipy.optimize.brute, which took me like forever and scipy.optimize.minimize, which never produced a seemingly correct answer.

basinhopping, brute, and differential_evolution are the methods available for global optimization. As you've already discovered, brute-force global optimization is not going to be particularly efficient.
Differential evolution is a stochastic method that should do better than brute-force, but may still require a large number of objective function evaluations. If you want to use it, you should play with the parameters and see what will work best for your problem. This tends to work better than other methods if you know that your objective function is not "smooth": there could be discontinuities in the function or its derivatives.
Basin-hopping, on the other hand, makes stochastic jumps but also uses local relaxation after each jump. This is useful if your objective function has many local minima, but due to the local relaxation used, the function should be smooth. If you can't easily get at the gradient of your function, you could still try basin-hopping with one of the local minimizers which doesn't require this information.
The advantage of the scipy.optimize.basinhopping routine is that it is very customizable. You can use take_step to define a custom random jump, accept_test to override the test used for deciding whether to proceed with or discard the results of a random jump and relaxation, and minimizer_kwargs to adjust the local minimization behavior. For example, you might override take_step to stay within your bounds, and then select perhaps the L-BFGS-B minimizer, which can numerically estimate your function's gradient as well as take bounds on the parameters. L-BFGS-B does work better if you give it a gradient, but I've used it without one and it still is able to minimize well. Be sure to read about all of the parameters on the local and global optimization routines and adjust things like tolerances as acceptable to improve performance.

Scipy Linear algebra LinearOperator function utilised in Conjugate Gradient

I am preconditioning a matrix using spilu, however, to pass this preconditioner into cg (the built in conjugate gradient method) it is necessary to use the LinearOperator function, can someone explain to me the parameter matvec, and why I need to use it. Below is my current code
Ainv=scla.spilu(A,drop_tol= 1e-7)
Ainv=scla.LinearOperator(Ainv.shape,matvec=Ainv)
scla.cg(A,b,maxiter=maxIterations, M = Ainv)
However this doesnt work and I am given the error TypeError: 'SuperLU' object is not callable. I have played around and tried
Ainv=scla.LinearOperator(Ainv.shape,matvec=Ainv.solve)
instead. This seems to work but I want to know why matvec needs Ainv.solve rather than just Ainv, and is it the right thing to feed LinearOperator?
Thanks for your time

Without having much experience with this part of scipy, some comments:
According to the docs you don't have to use LinearOperator, but you might do
M : {sparse matrix, dense matrix, LinearOperator}, so you can use explicit matrices too!
The idea/advantage of the LinearOperator:
Many iterative methods (e.g. cg, gmres) do not need to know the individual entries of a matrix to solve a linear system A*x=b. Such solvers only require the computation of matrix vector products docs
Depending on the task, sometimes even matrix-free approaches are available which can be much more efficient
The working approach you presented is indeed the correct one (some other source doing it similarily, and some course-materials doing it like that)
The idea of not using the inverse matrix, but using solve() here is not to form the inverse explicitly (which might be very costly)
A similar idea is very common in BFGS-based optimization algorithms although wiki might not give much insight here
scipy has an extra LinearOperator for this not forming the inverse explicitly! (although i think it's only used for statistics / completing/finishing some optimization; but i successfully build some LBFGS-based optimizers with this one)
Source # scicomp.stackexchange discussing this without touching scipy
And because of that i would assume spilu is completely going for this too (returning an object with a solve-method)

The difference between C++ (LAPACK, sgels) and Python (Numpy, lstsq) results

I am comparing the numerical results of C++ and Python computations. In C++, I make use of LAPACK's sgels function to compute the coefficients of a linear regression problem. In Python, I use Numpy's linalg.lstsq function for a similar task.
What is the mathematical difference between the methods used by sgels and linalg.lstsq?
What is the expected error (e.g. 6 significant digits) when comparing the results (i.e. the regression coefficients) numerically?
FYI: I am by no means a C++ or Python expert, which makes it difficult to understand what is going on inside the functions.

Taking a look at the source of numpy, in the file linalg.py, lstsq relies on LAPACK's zgelsd() for complex and dgelsd() for real. Here are the differences to sgels():
dgelsd() is for double while sgels() is for float. There is a difference of precision...
dgels() makes use the QR factorization of the matrix A and assumes that A has full rank. The condition number of the matrix must be reasonable to get a significant result. See this course for getting the logic of the method. On the other hand, dgelsd() makes use of the Singular value decomposition of A. In particular, A can be rank defiencient and small singular values are discarted depending of the additional argument rcond or machine precision. Notice that numpy's default value for rcond is -1: negative values refers to machine precision. See this course for the logic.
According to the benchmark of LAPACK, on can expect dgels() to be about 5 time faster than dgelsd().
You may see significant differences between the result of sgels() and dgelsd() if the matrix is ill conditionned. Indeed, there is a bound on the error of the linear regression which depends on the algorithm and the value of rcond() that is used. See the user guide of LAPACK on, Error Bounds for Linear Least Squares Problems for estimates of the errors and Further Details: Error Bounds for Linear Least Squares Problems for technical details.
As a conclusion, sgels() and dgels() can be used if the measures in b are accurate and easily related to the explanatory variables. For instance, if sensors are placed at the exits of exhaust pipes, it's easy to guess which motors are running. But sometimes, the linear link between the source and the measures is not precisely known (uncertainty on the terms of A) or discriminating polluters on the base of measurements becomes more difficult (Some polluters are far from the set of sensors and A is ill-conditionned). In this kind of situation, dgelsd() and tunning the rcond argument can help. Whenever in doubt, use dgelsd() and estimate the error on the estimated x according to LAPACK's user guide.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

scipy.optimize.least_squares, LinearOperator for argument 'jac' - python

Related

Is there a way to define a 'heterogeneous' kernel design to incorporate linear operators into the regression for GPflow (or GPytorch/GPy/...)?

Scalar minimization using scipy (`minimize` vs `minimize_scalar`)

How should I scipy.optimize a multivariate and non-differentiable function with boundaries?

Scipy Linear algebra LinearOperator function utilised in Conjugate Gradient

The difference between C++ (LAPACK, sgels) and Python (Numpy, lstsq) results

Categories

Resources