I have the following matrix
I have transformed this to strictly dominant matrix and applied Guass-Siedel and Successive over relaxation rate method with omega=1.1 and tolerance of epsilon=1e-4 with convergence formula as below
By solving this using python manually(not using linear algebra library) i found that both the methods are taking same number of iterations(6), but as per my understanding if the matrix is convergent in Gauss-Siedel and 1<omega<2 for successive over relaxation rate method then SOR method should take less number of iterations which is not happening?
so, is my understanding correct? is it mandatory for SOR method to take less number of iterations?
This is actually a question I had myself as I was trying to solve the same problem. Here I will include my results from the 6th iteration from both GS and SOR methods and will analyze my opinion on why this is the case. For both the initial vector x = (0,0,0,0). Practically speaking we see that the L infinity norm is different for each method (see below).
For Gauss-Seidel:
The solution vector in iteration 6 is:
[[ 1.0001]
[ 2. ]
[-1. ]
[ 1. ]]
The L infinity norm in iteration 6 is: [4.1458e-05]
For SOR:
The solution vector in iteration 6 is:
[[ 1.0002]
[ 2.0001]
[-1.0001]
[ 1. ]]
The L infinity norm in iteration 6 is: [7.8879e-05]
Academically speaking "SOR can provide a convenient means to speed up both the Jacobian and Gauss-Seidel methods of solving the our linear system. The parameter ω is referred to as the relaxation parameter. Clearly for ω = 1 we restore the original equations. If ω < 1 we talk of under-relaxation, and this can be important for some systems which will not converge under normal Jacobian relaxation. If ω > 1, we have over-relaxation, with which we will be more concerned. It was discovered during the years of hand computation that convergence is faster if we go beyond the Gauss-Seidel correction. Roughly speaking, those approximations stay on the same side of the solution x. An overrelaxation factor ω moves us closer to the solution. With ω = 1, we recover Gauss-Seidel; with ω > 1, the method is known as SOR. The optimal choice of ω never exceeds 2. It is often in the neighborhood of 1.9."
For more information on the ω you can also refer to Strang, G., 2006 page 410 of the book "Linear Algebra and its applications" as well as to the paper A rapid finite difference algorithm, utilizing successive over‐relaxation to solve the Poisson–Boltzmann equation.
Based on the academic description above I believe that both of these methods have 6 iterations because 1.1 is not the optimal ω value. Changing ω to a value closer to could yield a better result, as the whole point of overrelaxation is to discover this optimal ω. (My belief again is that this 1.1 is not the optimal omega and will update you once I do the calculation). The image is from Strang, G., 2006 "Linear algebra and its applications" 4th edition page 411.
Edit: Indeed by running a graphical representation of omega - iterations in SOR it seems that my optimal omega is in the range of 1.0300 to 1.0440, and the whole range of these omegas gives me five iterations, which is a more efficient way than pure Gauss-Seidel at omega = 1 that gives 6 iterations.
Related
In an optimization problem developed in PuLP i use the following objective function:
objective = p.lpSum(vec[r] for r in range(0,len(vec)))
All variables are non-negative integers, hence the sum over the vector gives the total number of units for my problem.
Now i am struggling with the fact, that PuLP only gives one of many solutions and i would like to narrow down the solution space to results that favors the solution set with the smallest standard deviation of the decision variables.
E.g. say vec is a vector with elements 6 and 12. Then 7/11, 8/10, 9/9 are equally feasible solutions and i would like PuLP to arrive at 9/9.
Then the objective
objective = p.lpSum(vec[r]*vec[r] for r in range(0,len(vec)))
would obviously create a cost function, that would help the case, but alas, it is non-linear and PuLP throws an error.
Anyone who can point me to a potential solution?
Instead of minimizing the standard deviation (which is inherently non-linear), you could minimize the range or bandwidth. Along the lines of:
minimize maxv-minv
maxv >= vec[r] for all r
minv <= vec[r] for all r
I'm doing homework in Monte Carlo course and I'm asked to find a matrix of Markov chain with 6 states namely
0, 1, 2, 3, 4, 5
such that after long enough period of time we have spent time proportional to the numbers
5, 10, 5, 10, 25, 60
in each of the states.
I see that this is the stationary vector that we get if we have the transition matrix. I have to use Metropolis algorithm, but all the explanations and examples I find are based on Metropolis-Hasting algorithm.
Pseudocode of the algorithm that I have:
select x
Loop over repetitions t=1,2...
select y from Nx using density Gx
put h=min(1, f(y)/f(x))
if U ~ U(0, 1) < h then x <- y
end loop
I'm looking for a step by step explanation how to implement the algorithm for the given problem, preferably in python!
The algorithmic approach
The standard approach to computing the stationary distribution of a Markov Chain is the solution of linear equations, e.g. as described here https://stephens999.github.io/fiveMinuteStats/stationary_distribution.html.
The solution to your problem is the same in reverse - solve the same equations, except that in your case, you have the stationary distribution but you don't have the transition probabilities/rates.
The problem with this approach, however, is that you may construct a system of linear equations with much more variables than equations. This severely reduces your options regarding the topology of the constructed Markov Chain.
Fortunately, you seem to have no constraints on the topology of the constructed Markov Chain, so you can make some compromises. What you can do is disable most transitions, i.e. give them zero probability/rate, and only enable one transition per state. This may produce some kind of a ring topology, but it should ensure that your system of linear equations has a solution.
Primitive example
Consider stationary distribution Pi = ( x = 1/3, y = 1/3, z = 1/3)
Construct your system of linear equations as
Pi(x) = 1/3 = Pr(y,x) * Pi(y)
Pi(y) = 1/3 = Pr(z,y) * Pi(z)
Pi(z) = 1/3 = Pr(x,z) * Pi(x)
In this case a solution is Pr(y,x) = Pr(z,y) = Pr(x,z) = 1 and the obtained Markov Chain just boringly loops from x to z to y and back to x with probability 1.
Note that the number of fitting solutions may be infinite (even for the reduced system of linear equations as shown in the example), e.g. in this case the probabilities/rates can be any positive value as long as they are all equal.
So, step by step solution
Construct the system of linear equations as described.
Solve the constructed system of linear equations
The solution of your constructed system of linear equations describes a Markov Chain you are looking for. Trivially reconstruct the entire transition matrix, if you want to.
When creating a line of best fit with numpy's polyfit, you can specify the parameter full to be True. This returns 4 extra values, apart from the coefficents. What do these values mean and what do they tell me about how well the function fits my data?
https://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html
What i'm doing is:
bestFit = np.polyfit(x_data, y_data, deg=1, full=True)
and I get the result:
(array([ 0.00062008, 0.00328837]), array([ 0.00323329]), 2, array([
1.30236506, 0.55122159]), 1.1102230246251565e-15)
The documentation says that the four extra pieces of information are: residuals, rank, singular_values, and rcond.
Edit:
I am looking for a further explanation of how rcond and singular_values describes goodness of fit.
Thank you!
how rcond and singular_values describes goodness of fit.
Short answer: they don't.
They do not describe how well the polynomial fits the data; this is what residuals are for. They describe how numerically robust was the computation of that polynomial.
rcond
The value of rcond is not really about quality of fit, it describes the process by which the fit was obtained, namely a least-squares solution of a linear system. Most of the time the user of polyfit does not provide this parameter, so a suitable value is picked by polyfit itself. This value is then returned to the user for their information.
rcond is used for truncation in ill-conditioned matrices. Least squares solver does two things:
Finds x that minimizes the norm of residuals Ax-b
If multiple x achieve this minimum, returns x with the smallest norm among those.
The second clause occurs when some changes of x do not affect the right-hand side at all. But since floating point computations are imperfect, usually what happens is that some changes of x affect the right hand side very little. And this is where rcond is used to decide when "very little" should be considered as "zero up to noise".
For example, consider the system
x1 = 1
x1 + 0.0000000001 * x2 = 2
This one can be solved exactly: x1 = 1 and x2 = 10000000000. But... that tiny coefficient (that in reality, came after some matrix manipulations) has some numeric error in it; for all we know it could be negative, or zero. Should we let it have such huge influence on the solution?
So, in such a situation the matrix (specifically its singular values) gets truncated at level rcond. This leaves
x1 = 1
x1 = 2
for which the least-squares solution is x1 = 1.5, x2 = 0. Note that this solution is robust: no huge numbers from tiny fluctuations of coefficients.
Singular values
When one solves a linear system Ax = b in the least squares sense, the singular values of A determine how numerically tricky this is. Specifically, large disparity between largest and smallest singular values is problematic: such systems are ill-conditioned. An example is
0.835*x1 + 0.667*x2 = 0.168
0.333*x1 + 0.266*x2 = 0.0067
The exact solution is (1, -1). But if the right hand side is changed from 0.067 to 0.066, the solution is (-666, 834) -- totally different. The problem is that the singular values of A are (roughly) 1 and 1e-6; this magnifies any changes on the right by the factor of 1e6.
Unfortunately, polynomial fit often results in ill-conditioned matrices. For example, fitting a polynomial of degree 24 to 25 equally spaced data points is unadvisable.
import numpy as np
x = np.arange(25)
np.polyfit(x, x, 24, full=True)
The singular values are
array([4.68696731e+00, 1.55044718e+00, 7.17264545e-01, 3.14298605e-01,
1.16528492e-01, 3.84141241e-02, 1.15530672e-02, 3.20120674e-03,
8.20608411e-04, 1.94870760e-04, 4.28461687e-05, 8.70404409e-06,
1.62785983e-06, 2.78844775e-07, 4.34463936e-08, 6.10212689e-09,
7.63709211e-10, 8.39231664e-11, 7.94539407e-12, 6.32326226e-13,
4.09332903e-14, 2.05501534e-15, 7.55397827e-17, 4.81104905e-18,
8.98275758e-20]),
which, with the default value of rcond (5.55e-15 here), gets four of them truncated to 0.
The difference in magnitude between smallest and largest singular values indicates that perturbing the y-values by numbers of size 1e-15 can result in changes of about 1 to the coefficients. (Not every perturbation will do that, just some that happen to align with a singular vector for a small singular value).
Rank
The effective rank is just the number of singular values above the rcond threshold. In the above example it's 21. This means that even though the fit is for 25 points, and we get a polynomial with 25 coefficients, there are only 21 degrees of freedom in the solution.
I've implemented the mutual information formula in python using pandas and numpy
def mutual_info(p):
p_x=p.sum(axis=1)
p_y=p.sum(axis=0)
I=0.0
for i_y in p.index:
for i_x in p.columns:
I+=(p.ix[i_y,i_x]*np.log2(p.ix[i_y,i_x]/(p_x[i_y]*p[i_x]))).values[0]
return I
However, if a cell in p has a zero probability, then np.log2(p.ix[i_y,i_x]/(p_x[i_y]*p[i_x])) is negative infinity, and the whole expression is multiplied by zero and returns NaN.
What is the right way to work around that?
For various theoretical and practical reasons (e.g., see Competitive Distribution Estimation:
Why is Good-Turing Good), you might consider never using a zero probability with the log loss measure.
So, say, if you have a probability vector p, then, for some small scalar α > 0, you would use α 1 + (1 - α) p (where here the first 1 is the uniform vector). Unfortunately, there are no general guidelines for choosing α, and you'll have to assess this further down the calculation.
For the Kullback-Leibler distance, you would of course apply this to each of the inputs.
I have a computer vision algorithm I want to tune up using scipy.optimize.minimize. Right now I only want to tune up two parameters but the number of parameters might eventually grow so I would like to use a technique that can do high-dimensional gradient searches. The Nelder-Mead implementation in SciPy seemed like a good fit.
I got the code all set up but it seems that the minimize function really wants to use floating point values with a step size that is less than one.The current set of parameters are both integers and one has a step size of one and the other has a step size of two (i.e. the value must be odd, if it isn't the thing I am trying to optimize will convert it to an odd number). Roughly one parameter is a window size in pixels and the other parameter is a threshold (a value from 0-255).
For what it is worth I am using a fresh build of scipy from the git repo. Does anyone know how to tell scipy to use a specific step size for each parameter? Is there some way I can roll my own gradient function? Is there a scipy flag that could help me out? I am aware that this could be done with a simple parameter sweep, but I would eventually like to apply this code to much larger sets of parameters.
The code itself is dead simple:
import numpy as np
from scipy.optimize import minimize
from ScannerUtil import straightenImg
import bson
def doSingleIteration(parameters):
# do some machine vision magic
# return the difference between my value and the truth value
parameters = np.array([11,10])
res = minimize( doSingleIteration, parameters, method='Nelder-Mead',options={'xtol': 1e-2, 'disp': True,'ftol':1.0,}) #not sure if these params do anything
print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
print res
This is what my output looks like. As you can see we are repeating a lot of runs and not getting anywhere in the minimization.
*+++++++++++++++++++++++++++++++++++++++++
[ 11. 10.] <-- Output from scipy minimize
{'block_size': 11, 'degree': 10} <-- input to my algorithm rounded and made int
+++++++++++++++++++++++++++++++++++++++++
120 <-- output of the function I am trying to minimize
+++++++++++++++++++++++++++++++++++++++++
[ 11.55 10. ]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11. 10.5]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11.55 9.5 ]
{'block_size': 11, 'degree': 9}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11.1375 10.25 ]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11.275 10. ]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11. 10.25]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
[ 11.275 9.75 ]
{'block_size': 11, 'degree': 9}
+++++++++++++++++++++++++++++++++++++++++
120
+++++++++++++++++++++++++++++++++++++++++
~~~
SNIP
~~~
+++++++++++++++++++++++++++++++++++++++++
[ 11. 10.0078125]
{'block_size': 11, 'degree': 10}
+++++++++++++++++++++++++++++++++++++++++
120
Optimization terminated successfully.
Current function value: 120.000000
Iterations: 7
Function evaluations: 27
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
status: 0
nfev: 27
success: True
fun: 120.0
x: array([ 11., 10.])
message: 'Optimization terminated successfully.'
nit: 7*
Assuming that the function to minimize is arbitrarily complex (nonlinear), this is a very hard problem in general. It cannot be guaranteed to be solved optimal unless you try every possible option. I do not know if there are any integer constrained nonlinear optimizer (somewhat doubt it) and I will assume you know that Nelder-Mead should work fine if it was a contiguous function.
Edit: Considering the comment from #Dougal I will just add here: Set up a coarse+fine grid search first, if you then feel like trying if your Nelder-Mead works (and converges faster), the points below may help...
But maybe some points that help:
Considering how the whole integer constraint is very difficult, maybe it would be an option to do some simple interpolation to help the optimizer. It should still converge to an integer solution. Of course this requires to calculate extra points, but it might solve many other problems. (even in linear integer programming its common to solve the unconstrained system first AFAIK)
Nelder-Mead starts with N+1 points, these are hard wired in scipy (at least older versions) to (1+0.05) * x0[j] (for j in all dimensions, unless x0[j] is 0), which you will see in your first evaluation steps. Maybe these can be supplied in newer versions, otherwise you could just change/copy the scipy code (it is pure python) and set it to something more reasonable. Or if you feel that is simpler, scale all input variables down so that (1+0.05)*x0 is of sensible size.
Maybe you should cache all function evaluations, since if you use Nelder-Mead I would guess you can always run into duplicat evaluation (at least at the end).
You have to check how likely Nelder-Mead will just shrink to a single value and give up, because it always finds the same result.
You generally must check if your function is well behaved at all... This optimization is doomed if the function does not change smooth over the parameter space, and even then it can easily run into local minima if you should have of those. (since you cached all evaluations - see 2. - you could at least plot those and have a look at the error landscape without needing to do any extra evluations)
Unfortunately, Scipy's built-in optimization tools don't easily allow for this. But never fear; it sounds like you have a convex problem, and so you should be able to find a unique optimum, even if it won't be mathematically pretty.
Two options that I've implemented for different problems are creating a custom gradient descent algorithm, and using bisection on a series of univariate problems. If you're doing cross-validation in your tuning, your loss function unfortunately won't be smooth (because of noise from cross-validation on different datasets), but will be generally convex.
To implement gradient descent numerically (without having an analytical method for evaluating the gradient), choose a test point and a second point that is delta away from your test point in all dimensions. Evaluating your loss function at these two points can allow you to numerically compute a local subgradient. It is important that delta be large enough that it steps outside of local minima created by cross-validation noise.
A slower but potentially more robust alternative is to implement bisection for each parameter you're testing. If you know that the problem in jointly convex in your two parameters (or n parameters), you can separate this into n univariate optimization problems, and write a bisection algorithm which recursively hones in on the optimal parameters. This can help handle some types of quasiconvexity (e.g. if your loss function takes a background noise value for part of its domain, and is convex in another region), but requires a good guess as to the bounds for the initial iteration.
If you simply snap the requested x values to an integer grid without fixing xtol to map to that gridsize, you risk having the solver request two points within a grid cell, receiving the same output value, and concluding that it is at a minimum.
No easy answer, unfortunately.
Snap your floats x, y (a.k.a. winsize, threshold) to an integer grid inside your function, like this:
def func( x, y ):
x = round( x )
y = round( (y - 1) / 2 ) * 2 + 1 # 1 3 5 ...
...
Then Nelder-Mead will see function values only on the grid, and should give you near-integer x, y.
(If you'd care to post your code someplace, I'm looking for test cases for a Nelder-Mead
with restarts.)
The Nelder-Mead minimize method now lets you specify the initial simplex vertex points, so you should be able to set the simplex points far apart, and the simplex will then flop around and find the minimum and converge when the simplex size drops below 1.
https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
The problem is that the algorithm gets stuck trying to shrink its (N+1) simplex.
I'd highly recommend for anyone new to the concept to learn more about the geographical shape of a simplex and figure out how the input parameters relate to the points on the simplex. Once you get a grasp of that then as I.P. Freeley suggested the problem can be solved by defining strong initial points for your simplex, Note that this is different than defining your x0 and goes into nelder-mead's dedicated options. Here is an example of a higher --4-- dimensional problem. Also note that the initial simplex has to have N+1 points in this case 5 and in your case 3.
init_simplex = np.array([[1, .1, .3, .3], [.1, 1, .3, .3], [.1, .1, 5, .3],
[.1, .1, .3, 5], [1, 1, 5, 5]])
minimum = minimize(Optimize.simplex_objective, x0=np.array([.01, .01, .01, .01]),
method='Nelder-Mead',
options={'adaptive': True, 'xatol': 0.1, 'fatol': .00001,
'initial_simplex': init_simplex})
In this example the x0 gets ignored by the definition of the initial_simplex. Other useful option in high dimensional problems is the 'adaptive' option, which takes the number of parameters into acount while trying to set the models operational coefficients (ie. α, γ,ρ and σ for reflection, expansion, contraction and shrink respectively). And if you haven't already, I'd also recommend familiarizing yourself with the steps of the algorithm.
Now as for the reason this problem is happening its because the method gets no good results in an expansion so it keeps shrinking the simplex smaller and smaller trying to find out a better solution that may or may not exist.