How does l_bfgs optimization method approximate the gradient

How does l_bfgs optimization method approximate the gradient - python

I'm using scipy's fmin_l_bfgs_b optimization method on a 2-dimensional function available as a black box. Gradients cannot be evaluated directly, so I'm asking the method to approximate the gradients by setting approx_grad = True.
I want to know how the approximate gradients are computed. My guess is that at each point, for each dimension, gradient is approximated by forward difference. So for each point in N dimensions, N evaluations are made to get the partial derivatives. Is this correct?

Jacobian approximation is done with scipy.optimize.approx_fprime function, docs:
f(xk[i] + epsilon[i]) - f(xk[i])
f'[i] = ---------------------------------
epsilon[i]
Where epsilon is a paramether to fmin_l_bfgs_b
epsilon : float
Step size used when approx_grad is True, for numerically calculating the gradient

I do not know how scipy does it particularly. A popular approach is to calculate them as follows:
(f(x+e)-f(x-e)/(2*e) (apparently here no LaTex supported)
This gives you accuracy up to quadratic terms (just calculate Taylor expansion for each term and substract them)

Related

scipy.optimize.leastq Minimize sum of least squares

I have a least squares minimization problem that has the following form
Where the parameters I want to optimize over are x and everything else is known.
scipy.optimize.least_squares has the following form:
scipy.optimize.least_squares(fun, x0)
where x0 is an initial condition and fun is a "Function which computes the vector of residuals"
After reading the documentation, I'm a little confused about what fun wants me to return.
If I do the summation inside fun, then I'm afraid that it would compute RHS, which is not equivalent to the LHS (...or is it, when it comes to minimization?)
Thanks for any assistance!

According to the documentation of scipy.optimize.least_squares, the argument fun is to provide the vector of residuals with which the process of minimization proceeds. It is possible to supply a scalar that is the result of summation of squared residuals, but it is also possible to supply a one-dimensional vector of shape (m,), where m is the number of dimensions of the residual function. Note that squaring and summation is not done in this instance as least_squares handles that detail on its own. Only the residuals as such must be supplied in this instance.

Solver tolerance and residual error when using sweep function in FiPy

I was trying to use FiPy to solve a set of PDEs when I realized the command sweep was not working the way I thought it would. Here goes a sample with part of my code:
from pylab import *
import sys
from fipy import *
viscosity = 5.55555555556e-06
Pe =5.
pfi=100.
lfi=0.01
Ly=1.
Nx =200
Ny=100
Lx=Ly*Nx/Ny
dL=Ly/Ny
mesh = PeriodicGrid2DTopBottom(nx=Nx, ny=Ny, dx=dL, dy=dL)
x, y = mesh.cellCenters
xVelocity = CellVariable(mesh=mesh, hasOld=True, name='X velocity')
xVelocity.constrain(Pe, mesh.facesLeft)
xVelocity.constrain(Pe, mesh.facesRight)
rad=0.1
var1 = DistanceVariable(name='distance to center', mesh=mesh, value=numerix.sqrt((x-Nx*dL/2.)**2+(y-Ny*dL/2.)**2))
pi_fi= CellVariable(mesh=mesh, value=0.,name='Fluid-interface energy map')
pi_fi.setValue(pfi*exp(-1.*(var1-rad)/lfi), where=(var1 > rad) )
pi_fi.setValue(pfi, where=(var1 <= rad))
xVelocityEq = DiffusionTerm(coeff=viscosity) - ImplicitSourceTerm(pi_fi)
xres=10.
while (xres > 1.e-6) :
xVelocity.updateOld()
mySolver = LinearGMRESSolver(iterations=1000,tolerance=1.e-6)
xres = xVelocityEq.sweep(var=xVelocity,solver=mySolver)
print 'Result = ', xres
#Thats it
In short, I am declaring a function called xVelocityEq and solving it using sweep. Here is my output:
Result = 0.0007856742013190237
Result = 6.414470433257661e-07
As you can see, the while loop ends after two iterations. My first question is: why is my first residual error (=0.0007856742013190237) higher than the solver's tolerance? I thought that, since xVelocityEq corresponds to a linear system, solver tolerance and residual error would mean the same thing.
If I increase the no. of iterations in mySolver from 1000 to 10000, I get the following output:
Result = 0.0007856742013190237
Result = 2.4619110931978988e-09
Why did the second residual change, given that the first remained the same?
If I increase the tolerance in mySolver from 1.e-6 to 7.e-4, I get the following output:
Result = 0.0007856742013190237
Result = 6.414470433257661e-07
Note that these residuals are the same as in the first output. Now if I try to further increase the tolerance to 8.e-4, here's what I get as output:
Result = 0.0007856742013190237
Result = 0.0007856742013190237
Result = 0.0007856742013190237
Result = 0.0007856742013190237
Result = 0.0007856742013190237
...
At this point I was completely lost. Why the residuals have the same values for all solver tolerances smaller than 7.e-4? And why these residuals are constant and equal to 0.0007856742013190237 for solver tolerances higher than 7.e-4?
If I change the mySolver to LinearLUSolver (iterations=1000, tolerance=1.e-6), here's what I get:
Result = 0.0007856742013190237
Result = 1.6772757200988522e-18
Why in the world is my first residual the same as before, even though I have changed the solver?

why is my first residual error (=0.0007856742013190237) higher than the solver's tolerance?
The residual calculated by .sweep() is calculated before the solver is invoked to calculated a new solution vector. The matrix L and right-hand-side vector b are calculated based on the initial value of the solution vector x.
The residual is a measure of how well the current solution vector satisfies the non-linear PDE. The solver tolerance places a limit on how hard the solver should work to satisfy the linear system of equations discretized from the PDE.
Even if the PDE is linear (e.g., the diffusion coefficient is not a function of the solution variable), the initial value presumably doesn't solve the PDE, so the residual is large. After the solver is invoked, then x should solve the PDE, to within the solver tolerance. If the PDE is non-linear, then a well-converged solution to the linear algebra is still probably not a good solution to the PDE; that's what sweeping is for.
I thought that, since xVelocityEq corresponds to a linear system, solver tolerance and residual error would mean the same thing.
There wouldn't be any utility in keeping track of both. In addition to the residual being before the solve and the solver tolerance being used to terminate the solve, there are different normalizations that can be used and a lot of the solver documentation can be kind of sketchy. FiPy uses |L x - b|_2 as its residual. Solvers may normalize by the magnitude of b, the diagonal of L, or the phase of the moon, all of which can make it hard to directly compare the residual with the tolerance.
Why did the second residual change, given that the first remained the same?
By allowing 1000 iterations instead of 100, the solver was able to drive to a more exacting tolerance which, in turn, led to a smaller residual for the next sweep.
Why the residuals have the same values for all solver tolerances smaller than 7.e-4? And why these residuals are constant and equal to 0.0007856742013190237 for solver tolerances higher than 7.e-4?
Probably because the solver is failing and so not changing the value of the solution vector. Some solvers don't report this. In other cases, we should be doing a better job of reporting that fact to you.
Why in the world is my first residual the same as before, even though I have changed the solver?
The residual is not a property of the solver. It is a property of the discretized system of equations that approximates your PDE. Those linear algebra equations are then the input to the solver.

gradient at the minimum in fmin_l_bfgs_b

I am using fmin_l_bfgs_b for a bounded minimization on 4 parameters.
I would like to inspect the gradient at the minimum of the cost function and for this I call the d['grad'] parameter as described in the documentation of fmin_l_bfgs_b. My problem is that d['grad'] is an array of size 4 looking like:
'grad': array([ 8.38440428e-05, -5.72697445e-04, 3.21875859e-03,
-2.21115926e+00])
I would expect it to be a single value close to zero. Does this have something to do with the number of the parameters I am using for the minimization (4)..? Not what I would expect but any help would be appreciated.

What you are getting is the gradient of the cost function with respect to each parameter, in turn.
To picture it, suppose there were only two parameters, x and y. The cost function is a surface z as a function of x and y.
The optimization is finding a minimum point on that surface.
That's where the gradients with respect to both x and y are zero (or close to it).
If either gradient is not zero, you are not at a minimum, and you would descend further.
As a further point, you could well be interested in the curvature, or second derivative, because high curvature means a narrow (precise) minimum, while low curvature means a nearly flat minimum, with very uncertain estimates.
The second derivative in the x,y case would not be a 2-vector, but a 2x2-matrix (called a "Hessian", just to snow your friends).
You might want to think about why it's a 2x2-matrix.

IFFT of a Gaussian power spectrum - Python

I want to calculate the Inverse Fourier Transform of a Gaussian power spectrum, thus obtaining a Gaussian again. I want to use this fact to check that the IFFT of my Gaussian power spectrum is sensible, in the sense that it produces an array of data effectively distributed in Gaussian way.
Now, it turns out that the IFFT must be multiplied by a factor 2*pi*N, where N is the dimension of the array, in order to recover the analytic correlation function (which is the Inverse Fourier Transform of the power spectrum). Can someone explain why?
Here is the piece of code that first fills an array with the Gaussian power spectrum and then does the IFFT of the power spectrum.
power_spectrum_k = np.zeros(n, float)
for k in range(1, int(n/2+1)):
power_spectrum_k[k] = math.exp(-(2*math.pi*k*sigma/n)*(2*math.pi*k*sigma/n))
for k in range(int(n/2+1), n):
power_spectrum_k[k] = power_spectrum_k[int(k - n/2)]
inverse_transform2 = np.zeros(n, float)
inverse_transform2 = np.fft.ifft(power_spectrum_k)
where the symmetry of the power spectrum comes from the need to get a real correlation function, at the same time following the rules for the use of numpy.ifft (quoting from the documentation:
"The input should be ordered in the same way as is returned by fft, i.e., a[0] should contain the zero frequency term, a[1:n/2+1] should contain the positive-frequency terms, and a[n/2+1:] should contain the negative-frequency terms, in order of decreasingly negative frequency".)

The reason is the Plancherel theorem, which states that the Fourier transform conserves the signal's energy, i.e., the integral over |x(t)|² equals the integral over |X(f)|². If you have more samples (e.g., caused by higher sampling rate or a longer interval), you have more energy. For that reason your IFFT result is scaled by a factor of N. Your factor depends on hand on the convention of Fourier Integral used, as #pv already noted. On the other hand, on the length of your interval, since integral over the power of the sampled and the continuous interval need to be the same.

I'd recommend using an existing library for an fft. Not as its particularly difficult but there are some well optimised solutions.
Try scipy http://docs.scipy.org/doc/scipy/reference/fftpack.html or my favourite fftw https://hgomersall.github.io/pyFFTW/

Why does numpy std() give a different result to matlab std()?

I try to convert matlab code to numpy and figured out that numpy has a different result with the std function.
in matlab
std([1,3,4,6])
ans = 2.0817
in numpy
np.std([1,3,4,6])
1.8027756377319946
Is this normal? And how should I handle this?

The NumPy function np.std takes an optional parameter ddof: "Delta Degrees of Freedom". By default, this is 0. Set it to 1 to get the MATLAB result:
>>> np.std([1,3,4,6], ddof=1)
2.0816659994661326
To add a little more context, in the calculation of the variance (of which the standard deviation is the square root) we typically divide by the number of values we have.
But if we select a random sample of N elements from a larger distribution and calculate the variance, division by N can lead to an underestimate of the actual variance. To fix this, we can lower the number we divide by (the degrees of freedom) to a number less than N (usually N-1). The ddof parameter allows us change the divisor by the amount we specify.
Unless told otherwise, NumPy will calculate the biased estimator for the variance (ddof=0, dividing by N). This is what you want if you are working with the entire distribution (and not a subset of values which have been randomly picked from a larger distribution). If the ddof parameter is given, NumPy divides by N - ddof instead.
The default behaviour of MATLAB's std is to correct the bias for sample variance by dividing by N-1. This gets rid of some of (but probably not all of) of the bias in the standard deviation. This is likely to be what you want if you're using the function on a random sample of a larger distribution.
The nice answer by #hbaderts gives further mathematical details.

The standard deviation is the square root of the variance. The variance of a random variable X is defined as
An estimator for the variance would therefore be
where denotes the sample mean. For randomly selected , it can be shown that this estimator does not converge to the real variance, but to
If you randomly select samples and estimate the sample mean and variance, you will have to use a corrected (unbiased) estimator
which will converge to . The correction term is also called Bessel's correction.
Now by default, MATLABs std calculates the unbiased estimator with the correction term n-1. NumPy however (as #ajcr explained) calculates the biased estimator with no correction term by default. The parameter ddof allows to set any correction term n-ddof. By setting it to 1 you get the same result as in MATLAB.
Similarly, MATLAB allows to add a second parameter w, which specifies the "weighing scheme". The default, w=0, results in the correction term n-1 (unbiased estimator), while for w=1, only n is used as correction term (biased estimator).

For people who aren't great with statistics, a simplistic guide is:
Include ddof=1 if you're calculating np.std() for a sample taken from your full dataset.
Ensure ddof=0 if you're calculating np.std() for the full population
The DDOF is included for samples in order to counterbalance bias that can occur in the numbers.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.