LBFGS: Accuracy of Hessian approximation - python
Does anybody know how useful LBFGS is for estimating the Hessian matrix in the case of many (>10 000) dimensions? When running scipy's implementation on a simple 100D quadratic form the algorithm does already seem to struggle. Are there any general results about special cases (i.e. a dominant diagonal) in which the approximated Hessian is reasonably trustworthy?
Finally, one immediate drawback in scipy's implementation to me seems that the initial estimate of the Hessian is the identity matrix which might lead to a slower convergence. Do you know how important this effect is, i.e. how would the algorithm be affected if I would have a good idea of what the diagonal elements would be?
Here are two sets of example plots for a rather diagonal dominant form, as well as for a case with strong off-diagonals. The first one shows the original covariance matrix and the latter one gives the approximated results using m=50 and m=500.
Code for running the experiment:
import numpy as np
from matplotlib import pyplot as plt
# Parameters
ndims = 100 # Dimensions for our problem
a = .2 # Relative importance of non-diagonal elements in covariance
m = 500 # Number of updates we allow in lbfgs
x0=1*np.random.rand(ndims) # Initial starting point for LBFGS
# Generate covariance matrix
A = np.matrix([np.random.randn(ndims) + np.random.randn(1)*a for i in range(ndims)])
A = A*np.transpose(A)
D_half = np.diag(np.diag(A)**(-0.5))
cov= D_half*A*D_half
invcov = np.linalg.inv(cov)
assert(np.all(np.linalg.eigvals(cov) > 0))
# Define quadratic form and its derivative
def gauss(x,invcov):
res = 0.5*x.T#invcov#x
return res[0,0]
def gaussgrad(x,invcov):
res = np.asarray(x.T#invcov)
return res[0]
# Put function in lambda shape
fgauss = lambda x: gauss(x,invcov=invcov)
fprimegauss = lambda x: gaussgrad(x,invcov=invcov)
# Run the lbfgs variant and retrieve the inverse Hessian approximation
x, f, d, s, y = fmin_l_bfgs_b(func=fgauss,x0=x0,fprime=fprimegauss,m=m,approx_grad=False)
invhess = LbfgsInvHess(s, y)
# Plot the results
plt.imshow(cov)
plt.colorbar()
plt.show()
plt.imshow(invhess.todense(),vmin=np.min(cov),vmax=np.max(cov))
plt.colorbar()
plt.show()
plt.imshow(invhess.todense()-cov)
plt.colorbar()
plt.show()
As scipy does not give the vectors from which the Hessian is reconstructed we need to call a marginally modified function (based on scipy.optimize.lbfgsb.py):
import numpy as np
from numpy import array, asarray, float64, zeros
from scipy.optimize import _lbfgsb
from scipy.optimize.optimize import (MemoizeJac, OptimizeResult,
_check_unknown_options, _prepare_scalar_function)
from scipy.optimize._constraints import old_bound_to_new
from scipy.sparse.linalg import LinearOperator
__all__ = ['fmin_l_bfgs_b', 'LbfgsInvHessProduct']
def fmin_l_bfgs_b(func, x0, fprime=None, args=(),
approx_grad=0,
bounds=None, m=10, factr=1e7, pgtol=1e-5,
epsilon=1e-8,
iprint=-1, maxfun=15000, maxiter=15000, disp=None,
callback=None, maxls=20):
"""
Minimize a function func using the L-BFGS-B algorithm.
Parameters
----------
func : callable f(x,*args)
Function to minimize.
x0 : ndarray
Initial guess.
fprime : callable fprime(x,*args), optional
The gradient of `func`. If None, then `func` returns the function
value and the gradient (``f, g = func(x, *args)``), unless
`approx_grad` is True in which case `func` returns only ``f``.
args : sequence, optional
Arguments to pass to `func` and `fprime`.
approx_grad : bool, optional
Whether to approximate the gradient numerically (in which case
`func` returns only the function value).
bounds : list, optional
``(min, max)`` pairs for each element in ``x``, defining
the bounds on that parameter. Use None or +-inf for one of ``min`` or
``max`` when there is no bound in that direction.
m : int, optional
The maximum number of variable metric corrections
used to define the limited memory matrix. (The limited memory BFGS
method does not store the full hessian but uses this many terms in an
approximation to it.)
factr : float, optional
The iteration stops when
``(f^k - f^{k+1})/max{|f^k|,|f^{k+1}|,1} <= factr * eps``,
where ``eps`` is the machine precision, which is automatically
generated by the code. Typical values for `factr` are: 1e12 for
low accuracy; 1e7 for moderate accuracy; 10.0 for extremely
high accuracy. See Notes for relationship to `ftol`, which is exposed
(instead of `factr`) by the `scipy.optimize.minimize` interface to
L-BFGS-B.
pgtol : float, optional
The iteration will stop when
``max{|proj g_i | i = 1, ..., n} <= pgtol``
where ``pg_i`` is the i-th component of the projected gradient.
epsilon : float, optional
Step size used when `approx_grad` is True, for numerically
calculating the gradient
iprint : int, optional
Controls the frequency of output. ``iprint < 0`` means no output;
``iprint = 0`` print only one line at the last iteration;
``0 < iprint < 99`` print also f and ``|proj g|`` every iprint iterations;
``iprint = 99`` print details of every iteration except n-vectors;
``iprint = 100`` print also the changes of active set and final x;
``iprint > 100`` print details of every iteration including x and g.
disp : int, optional
If zero, then no output. If a positive number, then this over-rides
`iprint` (i.e., `iprint` gets the value of `disp`).
maxfun : int, optional
Maximum number of function evaluations.
maxiter : int, optional
Maximum number of iterations.
callback : callable, optional
Called after each iteration, as ``callback(xk)``, where ``xk`` is the
current parameter vector.
maxls : int, optional
Maximum number of line search steps (per iteration). Default is 20.
Returns
-------
x : array_like
Estimated position of the minimum.
f : float
Value of `func` at the minimum.
d : dict
Information dictionary.
* d['warnflag'] is
- 0 if converged,
- 1 if too many function evaluations or too many iterations,
- 2 if stopped for another reason, given in d['task']
* d['grad'] is the gradient at the minimum (should be 0 ish)
* d['funcalls'] is the number of function calls made.
* d['nit'] is the number of iterations.
See also
--------
minimize: Interface to minimization algorithms for multivariate
functions. See the 'L-BFGS-B' `method` in particular. Note that the
`ftol` option is made available via that interface, while `factr` is
provided via this interface, where `factr` is the factor multiplying
the default machine floating-point precision to arrive at `ftol`:
``ftol = factr * numpy.finfo(float).eps``.
Notes
-----
License of L-BFGS-B (FORTRAN code):
The version included here (in fortran code) is 3.0
(released April 25, 2011). It was written by Ciyou Zhu, Richard Byrd,
and Jorge Nocedal <nocedal#ece.nwu.edu>. It carries the following
condition for use:
This software is freely available, but we expect that all publications
describing work using this software, or all commercial products using it,
quote at least one of the references given below. This software is released
under the BSD License.
References
----------
* R. H. Byrd, P. Lu and J. Nocedal. A Limited Memory Algorithm for Bound
Constrained Optimization, (1995), SIAM Journal on Scientific and
Statistical Computing, 16, 5, pp. 1190-1208.
* C. Zhu, R. H. Byrd and J. Nocedal. L-BFGS-B: Algorithm 778: L-BFGS-B,
FORTRAN routines for large scale bound constrained optimization (1997),
ACM Transactions on Mathematical Software, 23, 4, pp. 550 - 560.
* J.L. Morales and J. Nocedal. L-BFGS-B: Remark on Algorithm 778: L-BFGS-B,
FORTRAN routines for large scale bound constrained optimization (2011),
ACM Transactions on Mathematical Software, 38, 1.
"""
# handle fprime/approx_grad
if approx_grad:
fun = func
jac = None
elif fprime is None:
fun = MemoizeJac(func)
jac = fun.derivative
else:
fun = func
jac = fprime
# build options
if disp is None:
disp = iprint
opts = {'disp': disp,
'iprint': iprint,
'maxcor': m,
'ftol': factr * np.finfo(float).eps,
'gtol': pgtol,
'eps': epsilon,
'maxfun': maxfun,
'maxiter': maxiter,
'callback': callback,
'maxls': maxls}
res, s, y = _minimize_lbfgsb(fun, x0, args=args, jac=jac, bounds=bounds,
**opts)
d = {'grad': res['jac'],
'task': res['message'],
'funcalls': res['nfev'],
'nit': res['nit'],
'warnflag': res['status']}
f = res['fun']
x = res['x']
return x, f, d, s, y
def _minimize_lbfgsb(fun, x0, args=(), jac=None, bounds=None,
disp=None, maxcor=10, ftol=2.2204460492503131e-09,
gtol=1e-5, eps=1e-8, maxfun=15000, maxiter=15000,
iprint=-1, callback=None, maxls=20,
finite_diff_rel_step=None, **unknown_options):
"""
Minimize a scalar function of one or more variables using the L-BFGS-B
algorithm.
Options
-------
disp : None or int
If `disp is None` (the default), then the supplied version of `iprint`
is used. If `disp is not None`, then it overrides the supplied version
of `iprint` with the behaviour you outlined.
maxcor : int
The maximum number of variable metric corrections used to
define the limited memory matrix. (The limited memory BFGS
method does not store the full hessian but uses this many terms
in an approximation to it.)
ftol : float
The iteration stops when ``(f^k -
f^{k+1})/max{|f^k|,|f^{k+1}|,1} <= ftol``.
gtol : float
The iteration will stop when ``max{|proj g_i | i = 1, ..., n}
<= gtol`` where ``pg_i`` is the i-th component of the
projected gradient.
eps : float or ndarray
If `jac is None` the absolute step size used for numerical
approximation of the jacobian via forward differences.
maxfun : int
Maximum number of function evaluations.
maxiter : int
Maximum number of iterations.
iprint : int, optional
Controls the frequency of output. ``iprint < 0`` means no output;
``iprint = 0`` print only one line at the last iteration;
``0 < iprint < 99`` print also f and ``|proj g|`` every iprint iterations;
``iprint = 99`` print details of every iteration except n-vectors;
``iprint = 100`` print also the changes of active set and final x;
``iprint > 100`` print details of every iteration including x and g.
callback : callable, optional
Called after each iteration, as ``callback(xk)``, where ``xk`` is the
current parameter vector.
maxls : int, optional
Maximum number of line search steps (per iteration). Default is 20.
finite_diff_rel_step : None or array_like, optional
If `jac in ['2-point', '3-point', 'cs']` the relative step size to
use for numerical approximation of the jacobian. The absolute step
size is computed as ``h = rel_step * sign(x0) * max(1, abs(x0))``,
possibly adjusted to fit into the bounds. For ``method='3-point'``
the sign of `h` is ignored. If None (default) then step is selected
automatically.
Notes
-----
The option `ftol` is exposed via the `scipy.optimize.minimize` interface,
but calling `scipy.optimize.fmin_l_bfgs_b` directly exposes `factr`. The
relationship between the two is ``ftol = factr * numpy.finfo(float).eps``.
I.e., `factr` multiplies the default machine floating-point precision to
arrive at `ftol`.
"""
#_check_unknown_options(unknown_options)
m = maxcor
pgtol = gtol
factr = ftol / np.finfo(float).eps
x0 = asarray(x0).ravel()
n, = x0.shape
if bounds is None:
bounds = [(None, None)] * n
if len(bounds) != n:
raise ValueError('length of x0 != length of bounds')
# unbounded variables must use None, not +-inf, for optimizer to work properly
bounds = [(None if l == -np.inf else l, None if u == np.inf else u) for l, u in bounds]
# LBFGSB is sent 'old-style' bounds, 'new-style' bounds are required by
# approx_derivative and ScalarFunction
new_bounds = old_bound_to_new(bounds)
# check bounds
if (new_bounds[0] > new_bounds[1]).any():
raise ValueError("LBFGSB - one of the lower bounds is greater than an upper bound.")
# initial vector must lie within the bounds. Otherwise ScalarFunction and
# approx_derivative will cause problems
x0 = np.clip(x0, new_bounds[0], new_bounds[1])
if disp is not None:
if disp == 0:
iprint = -1
else:
iprint = disp
sf = _prepare_scalar_function(fun, x0, jac=jac, args=args, epsilon=eps,
bounds=new_bounds,
finite_diff_rel_step=finite_diff_rel_step)
func_and_grad = sf.fun_and_grad
fortran_int = _lbfgsb.types.intvar.dtype
nbd = zeros(n, fortran_int)
low_bnd = zeros(n, float64)
upper_bnd = zeros(n, float64)
bounds_map = {(None, None): 0,
(1, None): 1,
(1, 1): 2,
(None, 1): 3}
for i in range(0, n):
l, u = bounds[i]
if l is not None:
low_bnd[i] = l
l = 1
if u is not None:
upper_bnd[i] = u
u = 1
nbd[i] = bounds_map[l, u]
if not maxls > 0:
raise ValueError('maxls must be positive.')
x = array(x0, float64)
f = array(0.0, float64)
g = zeros((n,), float64)
wa = zeros(2*m*n + 5*n + 11*m*m + 8*m, float64)
iwa = zeros(3*n, fortran_int)
task = zeros(1, 'S60')
csave = zeros(1, 'S60')
lsave = zeros(4, fortran_int)
isave = zeros(44, fortran_int)
dsave = zeros(29, float64)
task[:] = 'START'
n_iterations = 0
while 1:
# x, f, g, wa, iwa, task, csave, lsave, isave, dsave = \
_lbfgsb.setulb(m, x, low_bnd, upper_bnd, nbd, f, g, factr,
pgtol, wa, iwa, task, iprint, csave, lsave,
isave, dsave, maxls)
task_str = task.tobytes()
if task_str.startswith(b'FG'):
# The minimization routine wants f and g at the current x.
# Note that interruptions due to maxfun are postponed
# until the completion of the current minimization iteration.
# Overwrite f and g:
f, g = func_and_grad(x)
elif task_str.startswith(b'NEW_X'):
# new iteration
n_iterations += 1
if callback is not None:
callback(np.copy(x))
if n_iterations >= maxiter:
task[:] = 'STOP: TOTAL NO. of ITERATIONS REACHED LIMIT'
elif sf.nfev > maxfun:
task[:] = ('STOP: TOTAL NO. of f AND g EVALUATIONS '
'EXCEEDS LIMIT')
else:
break
task_str = task.tobytes().strip(b'\x00').strip()
if task_str.startswith(b'CONV'):
warnflag = 0
elif sf.nfev > maxfun or n_iterations >= maxiter:
warnflag = 1
else:
warnflag = 2
# These two portions of the workspace are described in the mainlb
# subroutine in lbfgsb.f. See line 363.
s = wa[0: m*n].reshape(m, n)
y = wa[m*n: 2*m*n].reshape(m, n)
print(x.shape)
# See lbfgsb.f line 160 for this portion of the workspace.
# isave(31) = the total number of BFGS updates prior the current iteration;
n_bfgs_updates = isave[30]
n_corrs = min(n_bfgs_updates, maxcor)
inv_hess = LbfgsInvHess(s[:n_corrs], y[:n_corrs])
task_str = task_str.decode()
return OptimizeResult(fun=f, jac=g, nfev=sf.nfev,
njev=sf.ngev,
nit=n_iterations, status=warnflag, message=task_str,
x=x, success=(warnflag == 0), hess_inv=inv_hess), s[:n_corrs], y[:n_corrs]
class LbfgsInvHess(LinearOperator):
"""Linear operator for the L-BFGS approximate inverse Hessian.
This operator computes the product of a vector with the approximate inverse
of the Hessian of the objective function, using the L-BFGS limited
memory approximation to the inverse Hessian, accumulated during the
optimization.
Objects of this class implement the ``scipy.sparse.linalg.LinearOperator``
interface.
Parameters
----------
sk : array_like, shape=(n_corr, n)
Array of `n_corr` most recent updates to the solution vector.
(See [1]).
yk : array_like, shape=(n_corr, n)
Array of `n_corr` most recent updates to the gradient. (See [1]).
References
----------
.. [1] Nocedal, Jorge. "Updating quasi-Newton matrices with limited
storage." Mathematics of computation 35.151 (1980): 773-782.
"""
def __init__(self, sk, yk):
"""Construct the operator."""
if sk.shape != yk.shape or sk.ndim != 2:
raise ValueError('sk and yk must have matching shape, (n_corrs, n)')
n_corrs, n = sk.shape
super().__init__(dtype=np.float64, shape=(n, n))
self.sk = sk
self.yk = yk
self.n_corrs = n_corrs
self.rho = 1 / np.einsum('ij,ij->i', sk, yk)
def _matvec(self, x):
"""Efficient matrix-vector multiply with the BFGS matrices.
This calculation is described in Section (4) of [1].
Parameters
----------
x : ndarray
An array with shape (n,) or (n,1).
Returns
-------
y : ndarray
The matrix-vector product
"""
s, y, n_corrs, rho = self.sk, self.yk, self.n_corrs, self.rho
q = np.array(x, dtype=self.dtype, copy=True)
if q.ndim == 2 and q.shape[1] == 1:
q = q.reshape(-1)
alpha = np.empty(n_corrs)
for i in range(n_corrs-1, -1, -1):
alpha[i] = rho[i] * np.dot(s[i], q)
q = q - alpha[i]*y[i]
r = q
for i in range(n_corrs):
beta = rho[i] * np.dot(y[i], r)
r = r + s[i] * (alpha[i] - beta)
return r
def todense(self):
"""Return a dense array representation of this operator.
Returns
-------
arr : ndarray, shape=(n, n)
An array with the same shape and containing
the same data represented by this `LinearOperator`.
"""
s, y, n_corrs, rho = self.sk, self.yk, self.n_corrs, self.rho
I = np.eye(*self.shape, dtype=self.dtype)
Hk = I
for i in range(n_corrs):
A1 = I - s[i][:, np.newaxis] * y[i][np.newaxis, :] * rho[i]
A2 = I - y[i][:, np.newaxis] * s[i][np.newaxis, :] * rho[i]
Hk = np.dot(A1, np.dot(Hk, A2)) + (rho[i] * s[i][:, np.newaxis] *
s[i][np.newaxis, :])
return Hk
Edit: Typo in code.
Related
Scipy minimize returns a higher value than minimum
As a part of multi-start optimization, I am running differential evolution (DE), the output of which I feed as initial values to scipy minimization with SLSQP (I need constraints). I am testing testing the procedure on the Ackley function. Even in situations in which DE returns the optimum (zeros), scipy minimization deviates from the optimal initial value and returns a value higher than at the optimum. Do you know how to make scipy minimize return the optimum? I noticed it helps to specify tolerance for scipy minimize, but it does not solve the issue completely. Scaling the objective function makes things worse. The problem is not present for COBYLA solver. Here are the optimization steps: # Set up x0min = -20 x0max = 20 xdim = 4 fun = ackley bounds = [(x0min,x0max)] * xdim tol = 1e-12 # Get a DE solution result = differential_evolution(fun, bounds, maxiter=10000, tol=tol, workers = 1, init='latinhypercube') # Initialize at DE output x0 = result.x # Estimate the model r = minimize(fun, x0, method='SLSQP', tol=1e-18) which in my case yields result.fun = -4.440892098500626e-16 r.fun = 1.0008238682246429e-09 result.x = array([0., 0., 0., 0.]) r.x = array([-1.77227927e-10, -1.77062108e-10, 4.33179228e-10, -2.73031830e-12]) Here is the implementation of the Ackley function: def ackley(x): # Computes the value of Ackley benchmark function. # ACKLEY accepts a matrix of size (dim,N) and returns a vetor # FVALS of size (N,) # Parameters # ---------- # x : 1-D array size (dim,) or a 2-D array size (dim,N) # Each row of the matrix represents one dimension. # Columns have therefore the interpretation of different points at which # the function is evaluated. N is number of points to be evaluated. # Returns # ------- # fvals : a scalar if x is a 1-D array or # a 1-D array size (N,) if x is a 2-D array size (dim,N) # in which each row contains the function value for each column of X. n = x.shape[0] ninverse = 1 / n sum1 = np.sum(x**2, axis=0) sum2 = np.sum(np.cos(2 * np.pi * x), axis=0) fvals = (20 + np.exp(1) - (20 * np.exp(-0.2 * np.sqrt( ninverse * sum1))) - np.exp( ninverse * sum2)) return fvals
Decreasing the "step size used for numerical approximation of the Jacobian" in the SLSQP options solved the issue for me.
Using python built-in functions for coupled ODEs
THIS PART IS JUST BACKGROUND IF YOU NEED IT I am developing a numerical solver for the Second-Order Kuramoto Model. The functions I use to find the derivatives of theta and omega are given below. # n-dimensional change in omega def d_theta(omega): return omega # n-dimensional change in omega def d_omega(K,A,P,alpha,mask,n): def layer1(theta,omega): T = theta[:,None] - theta A[mask] = K[mask] * np.sin(T[mask]) return - alpha*omega + P - A.sum(1) return layer1 These equations return vectors. QUESTION 1 I know how to use odeint for two dimensions, (y,t). for my research I want to use a built-in Python function that works for higher dimensions. QUESTION 2 I do not necessarily want to stop after a predetermined amount of time. I have other stopping conditions in the code below that will indicate whether the system of equations converges to the steady state. How do I incorporate these into a built-in Python solver? WHAT I CURRENTLY HAVE This is the code I am currently using to solve the system. I just implemented RK4 with constant time stepping in a loop. # This function randomly samples initial values in the domain and returns whether the solution converged # Inputs: # f change in theta (d_theta) # g change in omega (d_omega) # tol when step size is lower than tolerance, the solution is said to converge # h size of the time step # max_iter maximum number of steps Runge-Kutta will perform before giving up # max_laps maximum number of laps the solution can do before giving up # fixed_t vector of fixed points of theta # fixed_o vector of fixed points of omega # n number of dimensions # theta initial theta vector # omega initial omega vector # Outputs: # converges true if it nodes restabilizes, false otherwise def kuramoto_rk4_wss(f,g,tol_ss,tol_step,h,max_iter,max_laps,fixed_o,fixed_t,n): def layer1(theta,omega): lap = np.zeros(n, dtype = int) converges = False i = 0 tau = 2 * np.pi while(i < max_iter): # perform RK4 with constant time step p_omega = omega p_theta = theta T1 = h*f(omega) O1 = h*g(theta,omega) T2 = h*f(omega + O1/2) O2 = h*g(theta + T1/2,omega + O1/2) T3 = h*f(omega + O2/2) O3 = h*g(theta + T2/2,omega + O2/2) T4 = h*f(omega + O3) O4 = h*g(theta + T3,omega + O3) theta = theta + (T1 + 2*T2 + 2*T3 + T4)/6 # take theta time step mask2 = np.array(np.where(np.logical_or(theta > tau, theta < 0))) # find which nodes left [0, 2pi] lap[mask2] = lap[mask2] + 1 # increment the mask theta[mask2] = np.mod(theta[mask2], tau) # take the modulus omega = omega + (O1 + 2*O2 + 2*O3 + O4)/6 if(max_laps in lap): # if any generator rotates this many times it probably won't converge break elif(np.any(omega > 12)): # if any of the generators is rotating this fast, it probably won't converge break elif(np.linalg.norm(omega) < tol_ss and # assert the nodes are sufficiently close to the equilibrium np.linalg.norm(omega - p_omega) < tol_step and # assert change in omega is small np.linalg.norm(theta - p_theta) < tol_step): # assert change in theta is small converges = True break i = i + 1 return converges return layer1 Thanks for your help!
You can wrap your existing functions into a function accepted by odeint (option tfirst=True) and solve_ivp as def odesys(t,u): theta,omega = u[:n],u[n:]; # or = u.reshape(2,-1); return [ *f(omega), *g(theta,omega) ]; # or np.concatenate([f(omega), g(theta,omega)]) u0 = [*theta0, *omega0] t = linspan(t0, tf, timesteps+1); u = odeint(odesys, u0, t, tfirst=True); #or res = solve_ivp(odesys, [t0,tf], u0, t_eval=t) The scipy methods pass numpy arrays and convert the return value into same, so that you do not have to care in the ODE function. The variant in comments is using explicit numpy functions. While solve_ivp does have event handling, using it for a systematic collection of events is rather cumbersome. It would be easier to advance some fixed step, do the normalization and termination detection, and then repeat this. If you want to later increase efficiency somewhat, use directly the stepper classes behind solve_ivp.
Finding alpha and beta of beta-binomial distribution with scipy.optimize and loglikelihood
A distribution is beta-binomial if p, the probability of success, in a binomial distribution has a beta distribution with shape parameters α > 0 and β > 0. The shape parameters define the probability of success. I want to find the values for α and β that best describe my data from the perspective of a beta-binomial distribution. My dataset players consist of data about the number of hits (H), the number of at-bats (AB) and the conversion (H / AB) of a lot of baseball players. I estimate the PDF with the help of the answer of JulienD in Beta Binomial Function in Python from scipy.special import beta from scipy.misc import comb pdf = comb(n, k) * beta(k + a, n - k + b) / beta(a, b) Next, I write a loglikelihood function that we will minimize. def loglike_betabinom(params, *args): """ Negative log likelihood function for betabinomial distribution :param params: list for parameters to be fitted. :param args: 2-element array containing the sample data. :return: negative log-likelihood to be minimized. """ a, b = params[0], params[1] k = args[0] # the conversion rate n = args[1] # the number of at-bats (AE) pdf = comb(n, k) * beta(k + a, n - k + b) / beta(a, b) return -1 * np.log(pdf).sum() Now, I want to write a function that minimizes loglike_betabinom from scipy.optimize import minimize init_params = [1, 10] res = minimize(loglike_betabinom, x0=init_params, args=(players['H'] / players['AB'], players['AB']), bounds=bounds, method='L-BFGS-B', options={'disp': True, 'maxiter': 250}) print(res.x) The result is [-6.04544138 2.03984464], which implies that α is negative which is not possible. I based my script on the following R-snippet. They get [101.359, 287.318].. ll <- function(alpha, beta) { x <- career_filtered$H total <- career_filtered$AB -sum(VGAM::dbetabinom.ab(x, total, alpha, beta, log=True)) } m <- mle(ll, start = list(alpha = 1, beta = 10), method = "L-BFGS-B", lower = c(0.0001, 0.1)) ab <- coef(m) Can someone tell me what I am doing wrong? Help is much appreciated!!
One thing to pay attention to is that comb(n, k) in your log-likelihood might not be well-behaved numerically for the values of n and k in your dataset. You can verify this by applying comb to your data and see if infs appear. One way to amend things could be to rewrite the negative log-likelihood as suggested in https://stackoverflow.com/a/32355701/4240413, i.e. as a function of logarithms of Gamma functions as in from scipy.special import gammaln import numpy as np def loglike_betabinom(params, *args): a, b = params[0], params[1] k = args[0] # the OVERALL conversions n = args[1] # the number of at-bats (AE) logpdf = gammaln(n+1) + gammaln(k+a) + gammaln(n-k+b) + gammaln(a+b) - \ (gammaln(k+1) + gammaln(n-k+1) + gammaln(a) + gammaln(b) + gammaln(n+a+b)) return -np.sum(logpdf) You can then minimize the log-likelihood with from scipy.optimize import minimize init_params = [1, 10] # note that I am putting 'H' in the args res = minimize(loglike_betabinom, x0=init_params, args=(players['H'], players['AB']), method='L-BFGS-B', options={'disp': True, 'maxiter': 250}) print(res) and that should give reasonable results. You could check How to properly fit a beta distribution in python? for inspiration if you want to rework further your code.
Steepest descent spitting out unreasonably large values
My implementation of steepest descent for solving Ax = b is showing some weird behavior: for any matrix large enough (~10 x 10, have only tested square matrices so far), the returned x contains all huge values (on the order of 1x10^10). def steepestDescent(A, b, numIter=100, x=None): """Solves Ax = b using steepest descent method""" warnings.filterwarnings(action="error",category=RuntimeWarning) # Reshape b in case it has shape (nL,) b = b.reshape(len(b), 1) exes = [] res = [] # Make a guess for x if none is provided if x==None: x = np.zeros((len(A[0]), 1)) exes.append(x) for i in range(numIter): # Re-calculate r(i) using r(i) = b - Ax(i) every five iterations # to prevent roundoff error. Also calculates initial direction # of steepest descent. if (numIter % 5)==0: r = b - np.dot(A, x) # Otherwise use r(i+1) = r(i) - step * Ar(i) else: r = r - step * np.dot(A, r) res.append(r) # Calculate step size. Catching the runtime warning allows the function # to stop and return before all iterations are completed. This is # necessary because once the solution x has been found, r = 0, so the # calculation below divides by 0, turning step into "nan", which then # goes on to overwrite the correct answer in x with "nan"s try: step = np.dot(r.T, r) / np.dot( np.dot(r.T, A), r ) except RuntimeWarning: warnings.resetwarnings() return x # Update x x = x + step * r exes.append(x) warnings.resetwarnings() return x, exes, res (exes and res are returned for debugging) I assume the problem must be with calculating r or step (or some deeper issue) but I can't make out what it is.
The code seems correct. For example, the following test work for me (both linalg.solve and steepestDescent give the close answer, most of the time): import numpy as np n = 100 A = np.random.random(size=(n,n)) + 10 * np.eye(n) print(np.linalg.eig(A)[0]) b = np.random.random(size=(n,1)) x, xs, r = steepestDescent(A,b, numIter=50) print(x - np.linalg.solve(A,b)) The problem is in the math. This algorithm is guaranteed to converge to the correct solution if A is positive definite matrix. By adding the 10 * identity matrix to a random matrix, we increase the probability that all the eigen-values are positive If you test with large random matrices (for example A = random.random(size=(n,n)), you are almost certain to have a negative eigenvalue, and the algorithm will not converge.
Minimizing a multivariable function with scipy. Derivative not known
I have a function which is actually a call to another program (some Fortran code). When I call this function (run_moog) I can parse 4 variables, and it returns 6 values. These values should all be close to 0 (in order to minimize). However, I combined them like this: np.sum(results**2). Now I have a scalar function. I would like to minimize this function, i.e. get the np.sum(results**2) as close to zero as possible. Note: When this function (run_moog) takes the 4 input parameters, it creates an input file for the Fortran code that depends on these parameters. I have tried several ways to optimize this from the scipy docs. But none works as expected. The minimization should be able to have bounds on the 4 variables. Here is an attempt: from scipy.optimize import minimize # Tried others as well from the docs x0 = 4435, 3.54, 0.13, 2.4 bounds = [(4000, 6000), (3.00, 4.50), (-0.1, 0.1), (0.0, None)] a = minimize(fun_mmog, x0, bounds=bounds, method='L-BFGS-B') # I've tried several different methods here print a This then gives me status: 0 success: True nfev: 5 fun: 2.3194639999999964 x: array([ 4.43500000e+03, 3.54000000e+00, 1.00000000e-01, 2.40000000e+00]) message: 'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL' jac: array([ 0., 0., -54090399.99999981, 0.]) nit: 0 The third parameter changes slightly, while the others are exactly the same. Also there have been 5 function calls (nfev) but no iterations (nit). The output from scipy is shown here.
Couple of possibilities: Try COBYLA. It should be derivative-free, and supports inequality constraints. You can't use different epsilons via the normal interface; so try scaling your first variable by 1e4. (Divide it going in, multiply coming back out.) Skip the normal automatic jacobian constructor, and make your own: Say you're trying to use SLSQP, and you don't provide a jacobian function. It makes one for you. The code for it is in approx_jacobian in slsqp.py. Here's a condensed version: def approx_jacobian(x,func,epsilon,*args): x0 = asfarray(x) f0 = atleast_1d(func(*((x0,)+args))) jac = zeros([len(x0),len(f0)]) dx = zeros(len(x0)) for i in range(len(x0)): dx[i] = epsilon jac[i] = (func(*((x0+dx,)+args)) - f0)/epsilon dx[i] = 0.0 return jac.transpose() You could try replacing that loop with: for (i, e) in zip(range(len(x0)), epsilon): dx[i] = e jac[i] = (func(*((x0+dx,)+args)) - f0)/e dx[i] = 0.0 You can't provide this as the jacobian to minimize, but fixing it up for that is straightforward: def construct_jacobian(func,epsilon): def jac(x, *args): x0 = asfarray(x) f0 = atleast_1d(func(*((x0,)+args))) jac = zeros([len(x0),len(f0)]) dx = zeros(len(x0)) for i in range(len(x0)): dx[i] = epsilon jac[i] = (func(*((x0+dx,)+args)) - f0)/epsilon dx[i] = 0.0 return jac.transpose() return jac You can then call minimize like: minimize(fun_mmog, x0, jac=construct_jacobian(fun_mmog, [1e0, 1e-4, 1e-4, 1e-4]), bounds=bounds, method='SLSQP')
It sounds like your target function doesn't have well-behaving derivatives. The line in the output jac: array([ 0., 0., -54090399.99999981, 0.]) means that changing only the third variable value is significant. And because the derivative w.r.t. to this variable is virtually infinite, there is probably something wrong in the function. That is also why the third variable value ends up in its maximum. I would suggest that you take a look at the derivatives, at least in a few points in your parameter space. Compute them using finite differences and the default step size of SciPy's fmin_l_bfgs_b, 1e-8. Here is an example of how you could compute the derivates. Try also plotting your target function. For instance, keep two of the parameters constant and let the two others vary. If the function has multiple local optima, you shouldn't use gradient-based methods like BFGS.
How difficult is it to get an analytical expression for the gradient? If you have that you can then approximate the product of Hessian with a vector using finite difference. Then you can use other optimization routines available. Among the various optimization routines available in SciPy, the one called TNC (Newton Conjugate Gradient with Truncation) is quite robust to the numerical values associated with the problem.
The Nelder-Mead Simplex Method (suggested by Cristián Antuña in the comments above) is well known to be a good choice for optimizing (posibly ill-behaved) functions with no knowledge of derivatives (see Numerical Recipies In C, Chapter 10). There are two somewhat specific aspects to your question. The first is the constraints on the inputs, and the second is a scaling problem. The following suggests solutions to these points, but you might need to manually iterate between them a few times until things work. Input Constraints Assuming your input constraints form a convex region (as your examples above indicate, but I'd like to generalize it a bit), then you can write a function is_in_bounds(p): # Return if p is in the bounds Using this function, assume that the algorithm wants to move from point from_ to point to, where from_ is known to be in the region. Then the following function will efficiently find the furthermost point on the line between the two points on which it can proceed: from numpy.linalg import norm def progress_within_bounds(from_, to, eps): """ from_ -- source (in region) to -- target point eps -- Eucliedan precision along the line """ if norm(from_, to) < eps: return from_ mid = (from_ + to) / 2 if is_in_bounds(mid): return progress_within_bounds(mid, to, eps) return progress_within_bounds(from_, mid, eps) (Note that this function can be optimized for some regions, but it's hardly worth the bother, as it doesn't even call your original object function, which is the expensive one.) One of the nice aspects of Nelder-Mead is that the function does a series of steps which are so intuitive. Some of these points can obviously throw you out of the region, but it's easy to modify this. Here is an implementation of Nelder Mead with modifications made marked between pairs of lines of the form ##################################################################: import copy ''' Pure Python/Numpy implementation of the Nelder-Mead algorithm. Reference: https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method ''' def nelder_mead(f, x_start, step=0.1, no_improve_thr=10e-6, no_improv_break=10, max_iter=0, alpha = 1., gamma = 2., rho = -0.5, sigma = 0.5): ''' #param f (function): function to optimize, must return a scalar score and operate over a numpy array of the same dimensions as x_start #param x_start (numpy array): initial position #param step (float): look-around radius in initial step #no_improv_thr, no_improv_break (float, int): break after no_improv_break iterations with an improvement lower than no_improv_thr #max_iter (int): always break after this number of iterations. Set it to 0 to loop indefinitely. #alpha, gamma, rho, sigma (floats): parameters of the algorithm (see Wikipedia page for reference) ''' # init dim = len(x_start) prev_best = f(x_start) no_improv = 0 res = [[x_start, prev_best]] for i in range(dim): x = copy.copy(x_start) x[i] = x[i] + step score = f(x) res.append([x, score]) # simplex iter iters = 0 while 1: # order res.sort(key = lambda x: x[1]) best = res[0][1] # break after max_iter if max_iter and iters >= max_iter: return res[0] iters += 1 # break after no_improv_break iterations with no improvement print '...best so far:', best if best < prev_best - no_improve_thr: no_improv = 0 prev_best = best else: no_improv += 1 if no_improv >= no_improv_break: return res[0] # centroid x0 = [0.] * dim for tup in res[:-1]: for i, c in enumerate(tup[0]): x0[i] += c / (len(res)-1) # reflection xr = x0 + alpha*(x0 - res[-1][0]) ################################################################## ################################################################## xr = progress_within_bounds(x0, x0 + alpha*(x0 - res[-1][0]), prog_eps) ################################################################## ################################################################## rscore = f(xr) if res[0][1] <= rscore < res[-2][1]: del res[-1] res.append([xr, rscore]) continue # expansion if rscore < res[0][1]: xe = x0 + gamma*(x0 - res[-1][0]) ################################################################## ################################################################## xe = progress_within_bounds(x0, x0 + gamma*(x0 - res[-1][0]), prog_eps) ################################################################## ################################################################## escore = f(xe) if escore < rscore: del res[-1] res.append([xe, escore]) continue else: del res[-1] res.append([xr, rscore]) continue # contraction xc = x0 + rho*(x0 - res[-1][0]) ################################################################## ################################################################## xc = progress_within_bounds(x0, x0 + rho*(x0 - res[-1][0]), prog_eps) ################################################################## ################################################################## cscore = f(xc) if cscore < res[-1][1]: del res[-1] res.append([xc, cscore]) continue # reduction x1 = res[0][0] nres = [] for tup in res: redx = x1 + sigma*(tup[0] - x1) score = f(redx) nres.append([redx, score]) res = nres Note This implementation is GPL, which is either fine for you or not. It's extremely easy to modify NM from any pseudocode, though, and you might want to throw in simulated annealing in any case. Scaling This is a trickier problem, but jasaarim has made an interesting point regarding that. Once the modified NM algorithm has found a point, you might want to run matplotlib.contour while fixing a few dimensions, in order to see how the function behaves. At this point, you might want to rescale one or more of the dimensions, and rerun the modified NM. –