LBFGS never converges in large dimensions in pytorch

LBFGS never converges in large dimensions in pytorch - python

I am playing with Rule 110 of Wolfram cellular automata. Given line of zeroes and ones, you can calculate next line with these rules:
Starting with 00000000....1 in the end you get this sequence:
Just of curiosity I decided to approximate these rules with a polynomial, so that cells could be not only 0 and 1, but also gray color in between:
def triangle(x,y,z,v0):
v=(y + y * y + y * y * y - 3. * (1. + x) * y * z + z * (1. + z + z * z)) / 3.
return (v-v0)*(v-v0)
so if x,y,z and v0 matches any of these rules from the table, it will return 0, and positive nonzero value otherwise.
Next I've added all possible groups of 4 neighbors into single sum, which will be zero for integer solutions:
def eval():
s = 0.
for i in range(W - 1):
for j in range(1, W + 1):
xx = x[i, (j - 1) % W]
yy = x[i, j % W]
zz = x[i, (j + 1) % W]
r = x[i + 1, j % W]
s += triangle(xx, yy, zz, r)
for j in range(W - 1): s += x[0, j] * x[0, j]
s += (1 - x[0, W - 1]) * (1 - x[0, W - 1])
return torch.sqrt(s)
Also in the bottom of this function I add ordinary conditions for first line, so that all elements are 0 except last one, which is 1. Finally I've decided to minimize this sum of squares on W*W matrix with pytorch:
x = Variable(torch.DoubleTensor(W,W).zero_(), requires_grad=True)
opt = torch.optim.LBFGS([x],lr=.1)
for i in range(15500):
def closure():
opt.zero_grad()
s=eval()
s.backward()
return s
opt.step(closure)
Here is full code, you can try it yourself. The problem is that for 10*10 it converges to correct solution in ~20 steps:
But if I take 15*15 board, it never finishes convergence:
The graph on the right shows how sum of squares is changing with each next iteration and you can see that it never reaches zero. My question is why this happens and how can I fix this. Tried different pytorch optimisers, but all of them perform worse then LBFGS. Tried different learning rates. Any ideas why this happen and how I can reach final point during optimisation?
UPD: improved convergence graph, log of SOS:
UPD2: I also tried doing same in C++ with dlib, and I don't have any convergency issues there, it goes much deeper in much less time:
I am using this code for optimisation in C++:
find_min_using_approximate_derivatives(bfgs_search_strategy(),
objective_delta_stop_strategy(1e-87),
s, x, -1)

What's you're trying to do here is non convex optimisation and this is a notoriously difficult problem. Once you think about it, it make sense because just about any practical mathematical problem can be formulated as an optimisation problem.
1. Prelude
So, before giving you hints as to where to find a solution to your particular problem, I want to illustrate why certain optimisation problems are easy to solve.
I'm going to start by discussing convex problems. These are easy to solve even in the constrained case, and the reason for this is that when you compute the gradient you actually get a lot of information of where the minimum cannot be (the Taylor expansion of a convex function, f, is always an underestimate of f), additionally there is only one minimum and no sadle points. If you're interested in learning more about convex optimisation I recommend seeing Stephen Boyd's class in convex optimisation on YouTube
Now then if non convex optimization is so difficult, how come we are able to solve it in deep learning? The answer is simply that the non convex function we are minimising in deep learning it's quite nice as demonstrated by Henaff et al.
It is therefore important that machine learning practitioners realise that the operation procedures used in deep learning will most likely not yield a good minimum, if they converge to a minimum in the first place, on other non-convex problems.
2. Answer to your question
Now then to answer your problem, You're not you probably not gonna find and fast solution as nonconvex optimisation is NP complete. But fear not, SciPy has a few global optimisation algorithms to choose from. Here is a link to another stack overflow thread with a good answer to your question.
3. Moral of the story
Finally, I want to remind you that convergence guarantees are important, forgetting it has led to an oil rig collapsing.
PS. Please forgive typos, I'm usong my phone for this
Update: As to why BFGS works with dlib, there might be two reasons, firstly, BFGS is better at using curvature information than L-BFGS, and secondly it uses a line search to find an optimal step size. I'd recommend checking if PyTorch allow line searches and if not, setting an decreasing step size (or just a really low one).

Related

Why is theta0 skipped while performing regulariztion on regression?

I am currently learning ML on coursera with the help of course on ML by Andrew Ng. I am performing the assignments in python because I am more used to it rather than Matlab. I have recently come to a problem regarding my understanding of the topic of Regularization. My understanding is that by doing regularization, one can add less important features which are important enough in prediction. But while implementing it, I don't understand why the 1st element of theta(parameters) i.e theta[0] is skipped while calculating the cost. I have referred other solutions but they also have done the same skipping w/o explanation.
Here is the code:
`
term1 = np.dot(-np.array(y).T,np.log(h(theta,X)))
term2 = np.dot((1-np.array(y)).T,np.log(1-h(theta,X)))
regterm = (lambda_/2) * np.sum(np.dot(theta[1:].T,theta[1:])) #Skip theta0. Explain this line
J=float( (1/m) * ( np.sum(term1 - term2) + regterm ) )
grad=np.dot((sigmoid(np.dot(X,theta))-y),X)/m
grad_reg=grad+((lambda_/m)*theta)
grad_reg[0]=grad[0]
`
And here is the formula:
Here J(theta) is cost function
h(x) is the sigmoid function or hypothesis.
lamnda is the regularization parameter.

Theta0 is referring to bias.
Bias comes in to picture when we want our decision boundaries to be separated properly. just consider an example of
Y1=w1 * X and then Y2= w2 * X
when the values of X comes close to zero, there could be a case when its a tough deal to separate them, here comes bias into the role.
Y1=w1 * X + b1 and Y2= w2 * X + b2
now, via learning, the decision boundaries will be clear all the time.
Let’s consider why we use regularization now.
So that we don’t over-fit, and smoothen the curve. As you can see the equation, its the slopes w1 and w2, that needs smoothening, bias are just the intercepts of segregation. So, there is no point of using them in regularization.
Although we can use it, in the case of neural networks it won’t make any difference. But we might face the issues of reducing bias value so much, that it might confuse data points. Thus, it's better to not use Bias in Regularization.
Hope it answers your question.
Originally published: https://medium.com/#shrutijadon10104776/why-we-dont-use-bias-in-regularization-5a86905dfcd6

2D N body simulation

I've followed the equations from the n-body problem found on Wikipedia and implemented a simple O(n²) n-body simulation. However, once I visualize the simulation, things don't behave as expected, namely, all the particles move away from the center as though they have high repulsive force. I thought at first I may have mistaken the direction of the force vectors, but I tried flipping it and it did pretty much the same thing.
data = np.random.rand(100, 2)
velocities = np.zeros_like(data)
masses = np.ones_like(data)
dt = 60 * 60 * 24
for _ in range(10000):
forces = np.zeros_like(data)
for i, node1 in enumerate(data):
for j, node2 in enumerate(data):
d = node2 - node1
# First term is gravitational constant, 1e-8 is a softening factor
forces[i] += 6.67384e-11 * d / (np.sqrt(d.dot(d) + 1e-8) ** 3)
velocities += forces * dt / masses
data += velocities * dt
yield data # for visualization
I also considered that it may just not work in 2D (although there is no reason it shouldn't at all, so I tried it in 3D as well by setting rand dimensions to (100, 3), but the behaviour was the same.
I've looked over other code available online, but I can't seem to find what I've done wrong (or differently from others), so any help would be appreciated.
EDIT 1
This actually appears to be consistent with the equations. I've worked out the first couple steps by hand for [-1, 1] and [1, 1] (ignoring G) and for p1, the forces are [0.25, 0.7, 81, 0, 0] respectively. However, since the velocity is so high from the third step, and that particle p2 does the opposite of p1, they move away really fast. However, other implementations easily found online don't face this issue. I can't seem to figure out why. I thought it may have been the initialization, but other implementations don't seem to suffer from this.

My dt was too large. Setting the dt to a smaller value e.g. 0.05 did it.

Alternating direction implicit method for finite difference solver of pde in Python

I am working on implementing the Alternating direction implicit method to solve FitzHugh–Nagumo reaction diffusion model. I have found a Python implementation example for it in a blog, but I think there is an error in the method - in the stencil presented here:
Shouldn't it be half time step size multiplying the reaction term f ?

Replacing the difference quotients by the differential quotients, one gets
U_t = D/2 * U_xx + D/2 * U_yy + Δt*f
in both instances, which is not the equation
U_t = D * (U_xx + U_yy) + f
that was the originally posed task.
So the coefficients should be 1/(Δt/2) as it was at U_t, D/(Δp^2) at U_pp, p=x,y and 1 for f.
It seems the formula is a mix-up of the one with difference quotients and the next stage where it gets multiplied by Δt/2.
And in that next formula one does not need new constants as indeed α_p=σ_p, p=x,y and then you are right that the factor of f should be Δt/2.

On ordinary differential equations (ODE) and optimization, in Python

I want to solve this kind of problem:
dy/dt = 0.01*y*(1-y), find t when y = 0.8 (0<t<3000)
I've tried the ode function in Python, but it can only calculate y when t is given.
So are there any simple ways to solve this problem in Python?
PS: This function is just a simple example. My real problem is so complex that can't be solve analytically. So I want to know how to solve it numerically. And I think this problem is more like an optimization problem:
Objective function y(t) = 0.8, Subject to dy/dt = 0.01*y*(1-y), and 0<t<3000
PPS: My real problem is:
objective function: F(t) = 0.85,
subject to: F(t) = sqrt(x(t)^2+y(t)^2+z(t)^2),
x''(t) = (1/F(t)-1)*250*x(t),
y''(t) = (1/F(t)-1)*250*y(t),
z''(t) = (1/F(t)-1)*250*z(t)-10,
x(0) = 0, y(0) = 0, z(0) = 0.7,
x'(0) = 0.1, y'(0) = 1.5, z'(0) = 0,
0<t<5

This differential equation can be solved analytically quite easily:
dy/dt = 0.01 * y * (1-y)
rearrange to gather y and t terms on opposite sides
100 dt = 1/(y * (1-y)) dy
The lhs integrates trivially to 100 * t, rhs is slightly more complicated. We can always write a product of two quotients as a sum of the two quotients * some constants:
1/(y * (1-y)) = A/y + B/(1-y)
The values for A and B can be worked out by putting the rhs on the same denominator and comparing constant and first order y terms on both sides. In this case it is simple, A=B=1. Thus we have to integrate
1/y + 1/(1-y) dy
The first term integrates to ln(y), the second term can be integrated with a change of variables u = 1-y to -ln(1-y). Our integrated equation therefor looks like:
100 * t + C = ln(y) - ln(1-y)
not forgetting the constant of integration (it is convenient to write it on the lhs here). We can combine the two logarithm terms:
100 * t + C = ln( y / (1-y) )
In order to solve t for an exact value of y, we first need to work out the value of C. We do this using the initial conditions. It is clear that if y starts at 1, dy/dt = 0 and the value of y never changes. Thus plug in the values for y and t at the beginning
100 * 0 + C = ln( y(0) / (1 - y(0) )
This will give a value for C (assuming y is not 0 or 1) and then use y=0.8 to get a value for t. Note that because of the logarithm and the factor 100 multiplying t y will reach 0.8 within a relatively short range of t values, unless the initial value of y is incredibly small. It is of course also straightforward to rearrange the equation above to express y in terms of t, then you can plot the function as well.
Edit: Numerical integration
For a more complexed ODE which cannot be solved analytically, you will have to try numerically. Initially we only know the value of the function at zero time y(0) (we have to know at least that in order to uniquely define the trajectory of the function), and how to evaluate the gradient. The idea of numerical integration is that we can use our knowledge of the gradient (which tells us how the function is changing) to work out what the value of the function will be in the vicinity of our starting point. The simplest way to do this is Euler integration:
y(dt) = y(0) + dy/dt * dt
Euler integration assumes that the gradient is constant between t=0 and t=dt. Once y(dt) is known, the gradient can be calculated there also and in turn used to calculate y(2 * dt) and so on, gradually building up the complete trajectory of the function. If you are looking for a particular target value, just wait until the trajectory goes past that value, then interpolate between the last two positions to get the precise t.
The problem with Euler integration (and with all other numerical integration methods) is that its results are only accurate when its assumptions are valid. Because the gradient is not constant between pairs of time points, a certain amount of error will arise for each integration step, which over time will build up until the answer is completely inaccurate. In order to improve the quality of the integration, it is necessary to use more sophisticated approximations to the gradient. Check out for example the Runge-Kutta methods, which are a family of integrators which remove progressive orders of error term at the cost of increased computation time. If your function is differentiable, knowing the second or even third derivatives can also be used to reduce the integration error.
Fortunately of course, somebody else has done the hard work here, and you don't have to worry too much about solving problems like numerical stability or have an in depth understanding of all the details (although understanding roughly what is going on helps a lot). Check out http://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.ode.html#scipy.integrate.ode for an example of an integrator class which you should be able to use straightaway. For instance
from scipy.integrate import ode
def deriv(t, y):
return 0.01 * y * (1 - y)
my_integrator = ode(deriv)
my_integrator.set_initial_value(0.5)
t = 0.1 # start with a small value of time
while t < 3000:
y = my_integrator.integrate(t)
if y > 0.8:
print "y(%f) = %f" % (t, y)
break
t += 0.1
This code will print out the first t value when y passes 0.8 (or nothing if it never reaches 0.8). If you want a more accurate value of t, keep the y of the previous t as well and interpolate between them.

As an addition to Krastanov`s answer:
Aside of PyDSTool there are other packages, like Pysundials and Assimulo which provide bindings to the solver IDA from Sundials. This solver has root finding capabilites.

Use scipy.integrate.odeint to handle your integration, and analyse the results afterward.
import numpy as np
from scipy.integrate import odeint
ts = np.arange(0,3000,1) # time series - start, stop, step
def rhs(y,t):
return 0.01*y*(1-y)
y0 = np.array([1]) # initial value
ys = odeint(rhs,y0,ts)
Then analyse the numpy array ys to find your answer (dimensions of array ts matches ys). (This may not work first time because I am constructing from memory).
This might involve using the scipy interpolate function for the ys array, such that you get a result at time t.
EDIT: I see that you wish to solve a spring in 3D. This should be fine with the above method; Odeint on the scipy website has examples for systems such as coupled springs that can be solved for, and these could be extended.

What you are asking for is a ODE integrator with root finding capabilities. They exist and the low-level code for such integrators is supplied with scipy, but they have not yet been wrapped in python bindings.
For more information see this mailing list post that provides a few alternatives: http://mail.scipy.org/pipermail/scipy-user/2010-March/024890.html
You can use the following example implementation which uses backtracking (hence it is not optimal as it is a bolt-on addition to an integrator that does not have root finding on its own): https://github.com/scipy/scipy/pull/4904/files

Multiple linear regression in python without fitting the origin?

I found this chunk of code on http://rosettacode.org/wiki/Multiple_regression#Python, which does a multiple linear regression in python. Print b in the following code gives you the coefficients of x1, ..., xN. However, this code is fitting the line through the origin (i.e. the resulting model does not include a constant).
All I'd like to do is the exact same thing except I do not want to fit the line through the origin, I need the constant in my resulting model.
Any idea if it's a small modification to do this? I've searched and found numerous documents on multiple regressions in python, except they are lengthy and overly complicated for what I need. This code works perfect, except I just need a model that fits through the intercept not the origin.
import numpy as np
from numpy.random import random
n=100
k=10
y = np.mat(random((1,n)))
X = np.mat(random((k,n)))
b = y * X.T * np.linalg.inv(X*X.T)
print(b)
Any help would be appreciated. Thanks.

you only need to add a row to X that is all 1.

Maybe a more stable approach would be to use a least squares algorithm anyway. This can also be done in numpy in a few lines. Read the documentation about numpy.linalg.lstsq.
Here you can find an example implementation:
http://glowingpython.blogspot.de/2012/03/linear-regression-with-numpy.html

What you have written out, b = y * X.T * np.linalg.inv(X * X.T), is the solution to the normal equations, which gives the least squares fit with a multi-linear model. swang's response is correct (and EMS's elaboration)---you need to add a row of 1's to X. If you want some idea of why it works theoretically, keep in mind that you are finding b_i such that
y_j = sum_i b_i x_{ij}.
By adding a row of 1's, you are are setting x_{(k+1)j} = 1 for all j, which means that you are finding b_i such that:
y_j = (sum_i b_i x_{ij}) + b_{k+1}
because the k+1st x_ij term is always equal to one. Thus, b_{k+1} is your intercept term.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.