Integer step size in scipy optimize minimize - python

I have a computer vision algorithm I want to tune up using scipy.optimize.minimize. Right now I only want to tune up two parameters but the number of parameters might eventually grow so I would like to use a technique that can do high-dimensional gradient searches. The Nelder-Mead implementation in SciPy seemed like a good fit.
I got the code all set up but it seems that the minimize function really wants to use floating point values with a step size that is less than one.The current set of parameters are both integers and one has a step size of one and the other has a step size of two (i.e. the value must be odd, if it isn't the thing I am trying to optimize will convert it to an odd number). Roughly one parameter is a window size in pixels and the other parameter is a threshold (a value from 0-255).
For what it is worth I am using a fresh build of scipy from the git repo. Does anyone know how to tell scipy to use a specific step size for each parameter? Is there some way I can roll my own gradient function? Is there a scipy flag that could help me out? I am aware that this could be done with a simple parameter sweep, but I would eventually like to apply this code to much larger sets of parameters.
The code itself is dead simple:
import numpy as np
from scipy.optimize import minimize
from ScannerUtil import straightenImg
import bson
def doSingleIteration(parameters):
# do some machine vision magic
# return the difference between my value and the truth value
parameters = np.array([11,10])
res = minimize( doSingleIteration, parameters, method='Nelder-Mead',options={'xtol': 1e-2, 'disp': True,'ftol':1.0,}) #not sure if these params do anything
print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
print res
This is what my output looks like. As you can see we are repeating a lot of runs and not getting anywhere in the minimization.
*+++++++++++++++++++++++++++++++++++++++++
[ 11. 10.] <-- Output from scipy minimize
{'block_size': 11, 'degree': 10} <-- input to my algorithm rounded and made int
+++++++++++++++++++++++++++++++++++++++++
120 <-- output of the function I am trying to minimize
+++++++++++++++++++++++++++++++++++++++++
[ 11.55 10. ]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11. 10.5]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11.55 9.5 ]
{'block_size': 11, 'degree': 9}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11.1375 10.25 ]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11.275 10. ]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11. 10.25]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11.275 9.75 ]
{'block_size': 11, 'degree': 9}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
~~~
SNIP
~~~
+++++++++++++++++++++++++++++++++++++++++
[ 11. 10.0078125]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
Optimization terminated successfully.
Current function value: 120.000000
Iterations: 7
Function evaluations: 27
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
status: 0
nfev: 27
success: True
fun: 120.0
x: array([ 11., 10.])
message: 'Optimization terminated successfully.'
nit: 7*

Assuming that the function to minimize is arbitrarily complex (nonlinear), this is a very hard problem in general. It cannot be guaranteed to be solved optimal unless you try every possible option. I do not know if there are any integer constrained nonlinear optimizer (somewhat doubt it) and I will assume you know that Nelder-Mead should work fine if it was a contiguous function.
Edit: Considering the comment from #Dougal I will just add here: Set up a coarse+fine grid search first, if you then feel like trying if your Nelder-Mead works (and converges faster), the points below may help...
But maybe some points that help:
Considering how the whole integer constraint is very difficult, maybe it would be an option to do some simple interpolation to help the optimizer. It should still converge to an integer solution. Of course this requires to calculate extra points, but it might solve many other problems. (even in linear integer programming its common to solve the unconstrained system first AFAIK)
Nelder-Mead starts with N+1 points, these are hard wired in scipy (at least older versions) to (1+0.05) * x0[j] (for j in all dimensions, unless x0[j] is 0), which you will see in your first evaluation steps. Maybe these can be supplied in newer versions, otherwise you could just change/copy the scipy code (it is pure python) and set it to something more reasonable. Or if you feel that is simpler, scale all input variables down so that (1+0.05)*x0 is of sensible size.
Maybe you should cache all function evaluations, since if you use Nelder-Mead I would guess you can always run into duplicat evaluation (at least at the end).
You have to check how likely Nelder-Mead will just shrink to a single value and give up, because it always finds the same result.
You generally must check if your function is well behaved at all... This optimization is doomed if the function does not change smooth over the parameter space, and even then it can easily run into local minima if you should have of those. (since you cached all evaluations - see 2. - you could at least plot those and have a look at the error landscape without needing to do any extra evluations)

Unfortunately, Scipy's built-in optimization tools don't easily allow for this. But never fear; it sounds like you have a convex problem, and so you should be able to find a unique optimum, even if it won't be mathematically pretty.
Two options that I've implemented for different problems are creating a custom gradient descent algorithm, and using bisection on a series of univariate problems. If you're doing cross-validation in your tuning, your loss function unfortunately won't be smooth (because of noise from cross-validation on different datasets), but will be generally convex.
To implement gradient descent numerically (without having an analytical method for evaluating the gradient), choose a test point and a second point that is delta away from your test point in all dimensions. Evaluating your loss function at these two points can allow you to numerically compute a local subgradient. It is important that delta be large enough that it steps outside of local minima created by cross-validation noise.
A slower but potentially more robust alternative is to implement bisection for each parameter you're testing. If you know that the problem in jointly convex in your two parameters (or n parameters), you can separate this into n univariate optimization problems, and write a bisection algorithm which recursively hones in on the optimal parameters. This can help handle some types of quasiconvexity (e.g. if your loss function takes a background noise value for part of its domain, and is convex in another region), but requires a good guess as to the bounds for the initial iteration.
If you simply snap the requested x values to an integer grid without fixing xtol to map to that gridsize, you risk having the solver request two points within a grid cell, receiving the same output value, and concluding that it is at a minimum.
No easy answer, unfortunately.

Snap your floats x, y (a.k.a. winsize, threshold) to an integer grid inside your function, like this:
def func( x, y ):
x = round( x )
y = round( (y - 1) / 2 ) * 2 + 1 # 1 3 5 ...
...
Then Nelder-Mead will see function values only on the grid, and should give you near-integer x, y.
(If you'd care to post your code someplace, I'm looking for test cases for a Nelder-Mead
with restarts.)

The Nelder-Mead minimize method now lets you specify the initial simplex vertex points, so you should be able to set the simplex points far apart, and the simplex will then flop around and find the minimum and converge when the simplex size drops below 1.
https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead

The problem is that the algorithm gets stuck trying to shrink its (N+1) simplex.
I'd highly recommend for anyone new to the concept to learn more about the geographical shape of a simplex and figure out how the input parameters relate to the points on the simplex. Once you get a grasp of that then as I.P. Freeley suggested the problem can be solved by defining strong initial points for your simplex, Note that this is different than defining your x0 and goes into nelder-mead's dedicated options. Here is an example of a higher --4-- dimensional problem. Also note that the initial simplex has to have N+1 points in this case 5 and in your case 3.
init_simplex = np.array([[1, .1, .3, .3], [.1, 1, .3, .3], [.1, .1, 5, .3],
[.1, .1, .3, 5], [1, 1, 5, 5]])
minimum = minimize(Optimize.simplex_objective, x0=np.array([.01, .01, .01, .01]),
method='Nelder-Mead',
options={'adaptive': True, 'xatol': 0.1, 'fatol': .00001,
'initial_simplex': init_simplex})
In this example the x0 gets ignored by the definition of the initial_simplex. Other useful option in high dimensional problems is the 'adaptive' option, which takes the number of parameters into acount while trying to set the models operational coefficients (ie. α, γ,ρ and σ for reflection, expansion, contraction and shrink respectively). And if you haven't already, I'd also recommend familiarizing yourself with the steps of the algorithm.
Now as for the reason this problem is happening its because the method gets no good results in an expansion so it keeps shrinking the simplex smaller and smaller trying to find out a better solution that may or may not exist.

Related

Optuna suggest float log=True

How can I have optuna suggest float numeric values from this list:
[1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0]
I'm using this Python code snippet:
trial.suggest_float("lambda", 1e-6, 1.0, log=True)
It correctly suggests values between 1e-6 and 1.0, but it suggests other values in the range, not just the values explicitly in the list above. What am I doing wrong?
What's wrong with suggest_categorical (in this instance)?
The suggest_categorical approach works alright, but it's subtly going to hurt the efficiency of your search in a way that can sometimes be extremely significant (especially if your list is large or you use this categorical approach many times in the same search space). Optuna considers each value in the list to be its own separate entity that cannot be ordered or compared to others. For example, I often use suggest_categorical to select between different algorithms, e.g. trial.suggest_categorical("algo", ["dbscan", "kmeans"]). But of course, it would be silly to say that dbscan > kmeans, or to find the value "halfway between" dbscan and kmeans. Optuna treats your list of numbers the same way. Even if it finds that performance is steadily and substantially decreasing as you move from 1.0 to 1e-5, it will still try 1e-6 because it cannot extrapolate on the trend. Note this is actually good, when your values are truly categorical and un-ordered, but bad in your case since we are trying parameters that are almost certainly bad.
What to do instead
trial.suggest_float selects between the min and max values provided in a continuous manner, unless the "step" argument is provided.
For example, trial.suggest_float("x", 0, 10) can return 0.0, 6.5, 3.25846, or anything else between 0 and 10. With step=0.5, it can only return numbers divisible by 0.5.
Sadly, the Optuna docs state:
The step and log arguments cannot be used at the same time. To set the step argument to a float number, set the log argument to False.
However, you can get around this. What I would suggest is having optima give you the exponent, and you calculate the actual value yourself. For example:
exp = trial.suggest_int("exp", -6, 0)
x = 10 ** exp
Essentially, anytime you can express the parameters you want as some function (in this case 10^x for x between -6 and 0), it is better to just code that function directly, with optima supplying the inputs, than try and bend optuna's functions too far to your purposes.
For selecting from a list, use suggest_categorical.
trial.suggest_categorical("lambda", [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0])

How to configure the optimization algorithm during maximum likelihood estimation in OpenTURNS?

I have a Sample and i want to fit the parameters of a Beta distribution with maximum likelihood estimation. Moreover, I want to truncate its parameters into the [0,100] interval. This should be easy with MaximumLikelihoodFactory, but the problem is that the optimization algorithm fails. How may I change the algorithm so that it can succeed?
Here is a simple example, where I generate a sample with size 100 and configure the parameters a and b with the setKnownParameter.
import openturns as ot
# Get sample
beta_true = ot.Beta(3.0, 1.0, 0.0, 100.0)
sample = beta_true.getSample(100)
# Fit
factory = ot.MaximumLikelihoodFactory(ot.Beta())
factory.setKnownParameter([0.0, 100.0], [2, 3])
beta = factory.build(sample)
print(beta)
The previous script produces:
Beta(alpha = 2, beta = 2, a = 0, b = 100)
WRN - Switch to finite difference to compute the gradient at point=[0.130921,-2.18413]
WRN - TNC went to an abnormal point=[nan,nan]
The algorithm surely fails, since the values of alpha and beta are unchanged with respect to their default values.
I do not know why this fails, perhaps because it uses finite difference derivatives. Anyway, I would like to customize the optimization algorithm and see if it can change anything to the result, but I do not know how to do this.
The ResourceMap has a key which allows to configure the optimization algorithm. The value is a string which is the name of the default algorithm:
"MaximumLikelihoodFactory-DefaultOptimizationAlgorithm": "TNC"
The code uses this algorithm to perform the maximum likelihood estimation (MLE). But it does not say what values can be set. Actually, the code of the MLE uses the OptimizationAlgorithm.Build static method to create the optimization algorithm. According to the doc, this is the "Name of the algorithm or problem to solve. For example TNC, Cobyla or one of the NLopt solver names.". So I can configure, say M. J. D. Powell's "Cobyla" algorithm:
ot.ResourceMap.SetAsString("MaximumLikelihoodFactory-DefaultOptimizationAlgorithm", "Cobyla")
factory = ot.MaximumLikelihoodFactory(ot.Beta())
factory.setKnownParameter([0.0, 100.0], [2, 3])
beta = factory.build(sample)
print(beta)
The previous script produces:
Beta(alpha = 2.495, beta = 0.842196, a = 0, b = 100)
This shows that the algorithm now performs correctly.
I can also use one of NLOPT's algorithms, e.g. "LN_AUGLAG" (this is a "L"ocal algorithm, with "N"o derivatives).

what is the significance of omega in successive over relaxation rate method?

I have the following matrix
I have transformed this to strictly dominant matrix and applied Guass-Siedel and Successive over relaxation rate method with omega=1.1 and tolerance of epsilon=1e-4 with convergence formula as below
By solving this using python manually(not using linear algebra library) i found that both the methods are taking same number of iterations(6), but as per my understanding if the matrix is convergent in Gauss-Siedel and 1<omega<2 for successive over relaxation rate method then SOR method should take less number of iterations which is not happening?
so, is my understanding correct? is it mandatory for SOR method to take less number of iterations?
This is actually a question I had myself as I was trying to solve the same problem. Here I will include my results from the 6th iteration from both GS and SOR methods and will analyze my opinion on why this is the case. For both the initial vector x = (0,0,0,0). Practically speaking we see that the L infinity norm is different for each method (see below).
For Gauss-Seidel:
The solution vector in iteration 6 is:
[[ 1.0001]
[ 2. ]
[-1. ]
[ 1. ]]
The L infinity norm in iteration 6 is: [4.1458e-05]
For SOR:
The solution vector in iteration 6 is:
[[ 1.0002]
[ 2.0001]
[-1.0001]
[ 1. ]]
The L infinity norm in iteration 6 is: [7.8879e-05]
Academically speaking "SOR can provide a convenient means to speed up both the Jacobian and Gauss-Seidel methods of solving the our linear system. The parameter ω is referred to as the relaxation parameter. Clearly for ω = 1 we restore the original equations. If ω < 1 we talk of under-relaxation, and this can be important for some systems which will not converge under normal Jacobian relaxation. If ω > 1, we have over-relaxation, with which we will be more concerned. It was discovered during the years of hand computation that convergence is faster if we go beyond the Gauss-Seidel correction. Roughly speaking, those approximations stay on the same side of the solution x. An overrelaxation factor ω moves us closer to the solution. With ω = 1, we recover Gauss-Seidel; with ω > 1, the method is known as SOR. The optimal choice of ω never exceeds 2. It is often in the neighborhood of 1.9."
For more information on the ω you can also refer to Strang, G., 2006 page 410 of the book "Linear Algebra and its applications" as well as to the paper A rapid finite difference algorithm, utilizing successive over‐relaxation to solve the Poisson–Boltzmann equation.
Based on the academic description above I believe that both of these methods have 6 iterations because 1.1 is not the optimal ω value. Changing ω to a value closer to could yield a better result, as the whole point of overrelaxation is to discover this optimal ω. (My belief again is that this 1.1 is not the optimal omega and will update you once I do the calculation). The image is from Strang, G., 2006 "Linear algebra and its applications" 4th edition page 411.
Edit: Indeed by running a graphical representation of omega - iterations in SOR it seems that my optimal omega is in the range of 1.0300 to 1.0440, and the whole range of these omegas gives me five iterations, which is a more efficient way than pure Gauss-Seidel at omega = 1 that gives 6 iterations.

loss function as min of several points, custom loss function and gradient

I am trying to predict quality of metal coil. I have the metal coil with width 10 meters and length from 1 to 6 kilometers. As training data I have ~600 parameters measured each 10 meters, and final quality control mark - good/bad (for whole coil). Bad means there is at least 1 place there is coil is bad, there is no data where is exactly. I have data for approx 10000 coils.
Lets imagine we want to train logistic regression for this data(with 2 factors).
X = [[0, 0],
...
[0, 0],
[1, 1], # coil is actually broken here, but we don't know it yet.
[0, 0],
...
[0, 0]]
Y = ?????
I can't just put all "bad" in Y and run classifier, because I will be confusing for classifier. I can't put all "good" and one "bad" in Y becuase I don't know where is the bad position.
The solution I have in mind is the following, I could define loss function as sum( (Y-min(F(x1,x2)))^2 ) (min calculated by all F belonging to one coil) not sum( (Y-F(x1,x2))^2 ). In this case probably I get F trained correctly to point bad place. I need gradient for that, it there is impossible to calculate it in all points, the min is not differentiable in all places, but I could use weak gradient instead(using values of functions which is minimal in coil in every place).
I more or less know how to implement it myself, the question is what is the simplest way to do it in python with scikit-learn. Ideally it should be same (or easily adaptable) with several learning method(a lot of methods based on loss function and gradient), is where possible to make some wrapper for learning methods which works this way?
update: looking at gradient_boosting.py - there is internal abstract class LossFunction with ability to calculate loss and gradient, looks perspective. Looks like there is no common solution.
What you are considering here is known in machine learning community as superset learning, meaning, that instead of typical supervised setting where you have training set in the form of {(x_i, y_i)} you have {({x_1, ..., x_N}, y_1)} such that you know that at least one element from the set has property y_1. This is not a very common setting, but existing, with some research available, google for papers in the domain.
In terms of your own loss functions - scikit-learn is a no-go. Scikit-learn is about simplicity, it provides you with a small set of ready to use tools with very little flexibility. It is not a research tool, and your problem is researchy. What can you use instead? I suggest you go for any symbolic-differentiation solution, for example autograd which gives you ability to differentiate through python code, simply apply scipy.optimize.minimize on top of it and you are done! Any custom loss function will work just fine.
As a side note - minimum operator is not differentiable, thus the model might have hard time figuring out what is going on. You could instead try to do sum((Y - prod_x F(x_1, x_2) )^2) since multiplication is nicely differentiable, and you will still get the similar effect - if at least one element is predicted to be 0 it will remove any "1" answer from the remaining ones. You can even go one step further to make it more numerically stable and do:
if Y==0 then loss = sum_x log(F(x_1, x_2 ) )
if Y==1 then loss = sum_x log(1-F(x_1, x_2))
which translates to
Y * sum_x log(1-F(x_1, x_2)) + (1-Y) * sum_x log( F(x_1, x_2) )
you can notice similarity with cross entropy cost which makes perfect sense since your problem is indeed a classification. And now you have perfect probabilistic loss - you are attaching such probabilities of each segment to be "bad" or "good" so the probability of the whole object being bad is either high (if Y==0) or low (if Y==1).

Difference between fitting algorithms in scipy

I have a question about the fit algorithms used in scipy. In my program, I have a set of x and y data points with y errors only, and want to fit a function
f(x) = (a[0] - a[1])/(1+np.exp(x-a[2])/a[3]) + a[1]
to it.
The problem is that I get absurdly high errors on the parameters and also different values and errors for the fit parameters using the two fit scipy fit routines scipy.odr.ODR (with least squares algorithm) and scipy.optimize. I'll give my example:
Fit with scipy.odr.ODR, fit_type=2
Beta: [ 11.96765963 68.98892582 100.20926023 0.60793377]
Beta Std Error: [ 4.67560801e-01 3.37133614e+00 8.06031988e+04 4.90014367e+04]
Beta Covariance: [[ 3.49790629e-02 1.14441187e-02 -1.92963671e+02 1.17312104e+02]
[ 1.14441187e-02 1.81859542e+00 -5.93424196e+03 3.60765567e+03]
[ -1.92963671e+02 -5.93424196e+03 1.03952883e+09 -6.31965068e+08]
[ 1.17312104e+02 3.60765567e+03 -6.31965068e+08 3.84193143e+08]]
Residual Variance: 6.24982731975
Inverse Condition #: 1.61472215874e-08
Reason(s) for Halting:
Sum of squares convergence
and then the fit with scipy.optimize.leastsquares:
Fit with scipy.optimize.leastsq
beta: [ 11.9671859 68.98445306 99.43252045 1.32131099]
Beta Std Error: [0.195503 1.384838 34.891521 45.950556]
Beta Covariance: [[ 3.82214235e-02 -1.05423284e-02 -1.99742825e+00 2.63681933e+00]
[ -1.05423284e-02 1.91777505e+00 1.27300761e+01 -1.67054172e+01]
[ -1.99742825e+00 1.27300761e+01 1.21741826e+03 -1.60328181e+03]
[ 2.63681933e+00 -1.67054172e+01 -1.60328181e+03 2.11145361e+03]]
Residual Variance: 6.24982904455 (calulated by me)
My Point is the third fit parameter: The results are
scipy.odr.ODR, fit_type=2:
C = 100.209 +/- 80600
scipy.optimize.leastsq:
C = 99.432 +/- 12.730
I don't know why the first error is so much higher. Even better: If I put exactly the same data points with errors into Origin 9 I get
C = x0 = 99,41849 +/- 0,20283
and again exactly the same data into c++ ROOT Cern
C = 99.85+/- 1.373
even though I used exactly the same initial variables for ROOT and Python. Origin doesn't need any.
Do you have any clue why this happens and which is the best result?
I added the code for you at pastebin:
Data
C++ code
Python code: http://pastebin.com/jZVyzMkS
Thank you for helping!
EDIT: here's the plot related to SirJohnFranklins post:
Did you actually try plotting the ODR and leastsq fits side by side? They look basically identical:
Consider what the parameters correspond to - the step function described by beta[0] and beta[1], the initial and final values, explains by far the majority of the variance in your data. By contrast, small changes in beta[2] and beta[3], the inflexion point and slope, will have comparatively little effect on the overall shape of the curve and therefore the residual variance for the fit. It's therefore no surprise that these parameters have high standard errors, and are fitted slightly differently by the two algorithms.
The overall greater standard errors reported by ODR are due to the fact that this model incorporates errors in the y-values whereas the ordinary least squares fit does not - errors in the measured y-values ought to reduce our confidence in the estimated fit parameters.
(Sadly, i can't upload the fit, because I need more reputation. I'll give the plot to Captain Sandwich, so he can upload it for me.)
I'm in the same workgroup as the person who started the thread, but I did this plot.
So, I added x-errors on the data, because I was not that far the last time. The error obtained through the ODR is still absurdly high (4.18550164e+04 on beta[2]). In the plot, I show you what the FIT from [ROOT Cern][2] gives, now with x and y error. Here, x0 is the beta[2].
The red and the green curve have a different beta, the left one minus the error of the fit of 3.430 obtained by ROOT and the right one plus the error. I think this makes totally sense, much more, than the error of 0.2 given by the fit of Origin 9 (which can only handle y-errors, I think) or the error of about 40k given by the ODR which also includes x and y errors.
Maybe, because ROOT is mostly used by astrophysicists who need very roubust fitting algorithms, it can handle much more difficult fits, but I don't know enough about the robustness of fitting algorithms.

Categories