Wishing to do a harder problem, I tried to attack it from a simpler point of view with the following toy minimization equation
which does not have a trivial solution.
In order to do that, I need to use an augmented lagrangian / dual function 1 with its gradient 2, and the equilibrium point 3.
The augmented lagrangian version of the previous problem:
The point of a Lagrange multiplier is to optimize over mu in [0,inf] in order to take into account the weird constraints of our problem.
Running the following code
a = 1
nbtests = 5 ; minmu = 0 ; maxmu = 5
def dual(mu) :
x = spo.fsolve(lambda x : 2*x - mu * (np.exp(x)+1) , 1)
return (- (x**2 - mu *(np.exp(x) + x -a )) ,- (np.exp(x) + x-a))
pl = np.empty((nbtests,2))
for i,nu in enumerate(np.linspace(minmu,maxmu,pl.shape[0])) :
res = spo.fmin_l_bfgs_b(dual,nu,bounds=[(0,None)],factr=1e6)
print(nu,res[0],res[2]['task'])
pl[i] = [nu,spo.fsolve(lambda x : 2*x - res[0] * (np.exp(x)+1), 1)]
plt.plot(pl[:,0],pl[:,1],label="L-BFGS-B")
plt.axhline(y=spo.fsolve(lambda x : np.exp(x)+x-a, 1)[0], label="Target",color="red")
plt.legend()
plt.show()
plt.plot(pl[:,0],pl[:,1],label="L-BFGS-B")
plt.axhline(y=spo.fsolve(lambda x : np.exp(x)+x-a, 1)[0], label="Target",color="red")
plt.legend()
plt.show()
Here, we are trying to L-BFGS-B optimizer in Python (which is the fastest one, since we have access to the gradient) from the dual problem, then revert to the original solution with fsolve. Note that the - signs inside the function and gradient are because the minimisation of the primal problem is equal to the maximistation of the dual problem, so they are for the actual maximisation of the equations.
I plot here the result of the optimizer with respect to the initial guess of mu, and shows it's very very dependent on the initialisation, not robust at all. When changing the parameter a to other values, the convergence of the method is even worse.
The plot of the optimizer result with respesct to mu (for a bigger nbtests)
and the output of the program
0.0 [0.] b'ABNORMAL_TERMINATION_IN_LNSRCH'
0.5555555555555556 [0.55555556] b'ABNORMAL_TERMINATION_IN_LNSRCH'
1.1111111111111112 [3.52870269] b'ABNORMAL_TERMINATION_IN_LNSRCH'
1.6666666666666667 [3.52474085] b'ABNORMAL_TERMINATION_IN_LNSRCH'
2.2222222222222223 [3.5243099] b'ABNORMAL_TERMINATION_IN_LNSRCH'
2.7777777777777777 [3.49601967] b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
3.3333333333333335 [3.52020875] b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
3.8888888888888893 [3.88888889] b'ABNORMAL_TERMINATION_IN_LNSRCH'
4.444444444444445 [4.44444354] b'ABNORMAL_TERMINATION_IN_LNSRCH'
5.0 [5.] b'ABNORMAL_TERMINATION_IN_LNSRCH'
The first column is the initial guess, and the second one is the estimated guess after the optimzer, and we see that it does not optimize at all for most of the values.
For a <= 0 , the domain of the function is x < 0, which gives a trivial minimisation x^2 = 0, for mu = 0. So all the solutions should give 0 at the return of the optimizer.
The error b'ABNORMAL_TERMINATION_IN_LNSRCH comes from a bad gradient, but here, is it the real gradient of the function...
What did I miss?
There are multiple wrong signs in your code, which, by the way, violates the DRY principle. It should be x**2 + mu * (np.exp(x) + x - a) instead of x**2 - mu * (np.exp(x) + x - a) and similar for the derivative. IMHO, something like
from scipy.optimize import fsolve, fmin_l_bfgs_b
a = 1
nbtests = 5
minmu = 0
maxmu = 5
def lagrange(x, mu):
return x**2 + mu * (np.exp(x) + x - a)
def lagrange_grad(x, mu):
grad_x = 2*x + mu * (np.exp(x) + 1)
grad_mu = np.exp(x) + x - a
return grad_x, grad_mu
def dual(mu):
x = fsolve(lambda x: lagrange_grad(x, mu)[0], x0=1)
obj_val = lagrange(x, mu)
grad = lagrange_grad(x, mu)[1]
return -1.0*obj_val, -1.0*grad
pl = np.empty((nbtests, 2))
for i, nu in enumerate(np.linspace(minmu,maxmu,nbtests)):
res = fmin_l_bfgs_b(dual, x0=nu, bounds=[(0,None)], factr=1e6)
mu_opt = res[0]
x_opt = fsolve(lambda x: lagrange_grad(x, mu_opt)[0], x0=1)
pl[i] = [nu, *x_opt]
is much cleaner. This gives me
array([[0. , 0. ],
[1.25, 0. ],
[2.5 , 0. ],
[3.75, 0. ],
[5. , 0. ]])
as desired.
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have to do Logistic regression using batch gradient descent.
import numpy as np
X = np.asarray([
[0.50],[0.75],[1.00],[1.25],[1.50],[1.75],[1.75],
[2.00],[2.25],[2.50],[2.75],[3.00],[3.25],[3.50],
[4.00],[4.25],[4.50],[4.75],[5.00],[5.50]])
y = np.asarray([0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1])
m = len(X)
def sigmoid(a):
return 1.0 / (1 + np.exp(-a))
def gradient_Descent(theta, alpha, X , y):
for i in range(0,m):
cost = ((-y) * np.log(sigmoid(X[i]))) - ((1 - y) * np.log(1 - sigmoid(X[i])))
grad = theta - alpha * (1.0/m) * (np.dot(cost,X[i]))
theta = theta - alpha * grad
gradient_Descent(0.1,0.005,X,y)
The way I have to do it is like this but I can't seem to understand how to make it work.
It looks like you have some stuff mixed up in here. It's critical when doing this that you keep track of the shape of your vectors and makes sure you're getting sensible results. For example, you are calculating cost with:
cost = ((-y) * np.log(sigmoid(X[i]))) - ((1 - y) * np.log(1 - sigmoid(X[i])))
In your case y is vector with 20 items and X[i] is a single value. This makes your cost calculation a 20 item vector which doesn't makes sense. Your cost should be a single value. (you're also calculating this cost a bunch of times for no reason in your gradient descent function).
Also, if you want this to be able to fit your data you need to add a bias terms to X. So let's start there.
X = np.asarray([
[0.50],[0.75],[1.00],[1.25],[1.50],[1.75],[1.75],
[2.00],[2.25],[2.50],[2.75],[3.00],[3.25],[3.50],
[4.00],[4.25],[4.50],[4.75],[5.00],[5.50]])
ones = np.ones(X.shape)
X = np.hstack([ones, X])
# X.shape is now (20, 2)
Theta will now need 2 values for each X. So initialize that and Y:
Y = np.array([0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]).reshape([-1, 1])
# reshape Y so it's column vector so matrix multiplication is easier
Theta = np.array([[0], [0]])
Your sigmoid function is good. Let's also make a vectorized cost function:
def sigmoid(a):
return 1.0 / (1 + np.exp(-a))
def cost(x, y, theta):
m = x.shape[0]
h = sigmoid(np.matmul(x, theta))
cost = (np.matmul(-y.T, np.log(h)) - np.matmul((1 -y.T), np.log(1 - h)))/m
return cost
The cost function works because Theta has a shape of (2, 1) and X has a shape of (20, 2) so matmul(X, Theta) will be shaped (20, 1). The then matrix multiply the transpose of Y (y.T shape is (1, 20)), which result in a single value, our cost given a particular value of Theta.
We can then write a function that performs a single step of batch gradient descent:
def gradient_Descent(theta, alpha, x , y):
m = x.shape[0]
h = sigmoid(np.matmul(x, theta))
grad = np.matmul(X.T, (h - y)) / m;
theta = theta - alpha * grad
return theta
Notice np.matmul(X.T, (h - y)) is multiplying shapes (2, 20) and (20, 1) which results in a shape of (2, 1) — the same shape as Theta, which is what you want from your gradient. This allows you to multiply is by your learning rate and subtract it from the initial Theta, which is what gradient descent is supposed to do.
So now you just write a loop for a number of iterations and update Theta until it looks like it converges:
n_iterations = 500
learning_rate = 0.5
for i in range(n_iterations):
Theta = gradient_Descent(Theta, learning_rate, X, Y)
if i % 50 == 0:
print(cost(X, Y, Theta))
This will print the cost every 50 iterations resulting in a steadily decreasing cost, which is what you hope for:
[[ 0.6410409]]
[[ 0.44766253]]
[[ 0.41593581]]
[[ 0.40697167]]
[[ 0.40377785]]
[[ 0.4024982]]
[[ 0.40195]]
[[ 0.40170533]]
[[ 0.40159325]]
[[ 0.40154101]]
You can try different initial values of Theta and you will see it always converges to the same thing.
Now you can use your newly found values of Theta to make predictions:
h = sigmoid(np.matmul(X, Theta))
print((h > .5).astype(int) )
This prints what you would expect for a linear fit to your data:
[[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]]
I am calculating the theta for my AI like so:
theta = opt.fmin_cg(cost, initial_theta, gradient, (newX, y))
Which works great and gives me this output:
Optimization terminated successfully.
Current function value: 0.684355
Iterations: 6
Function evaluations: 15
Gradient evaluations: 15
When I print theta, I get this:
[ 0. -0.28132729 0.158859 ]
I now want to plot this on my scatter graph as a line, my expected output looks like this:
But when I try to perform this on my graph with the algorithm:
weights * features = weight0 + weight1 * feature1 + weight2 * feature2
Like so:
x_axis = np.array([min(newX[:, 1]), max(newX[:, 1])])
y_axis = x_axis * theta[1:]
ax.plot(x_axis, y_axis, linewidth=2)
plt.show()
The output looks like this:
What should y_axis = x_axis * theta[1:] be to match the algorithm?
Update:
newX derives from my training data frame and is created like this:
newX = np.zeros(shape=(x.shape[0], x.shape[1] + 1))
newX[:, 1:] = x.values
It now looks like this, the concept is 0 is the free weight:
[[0. 8. 2.]
[0. 0. 0.]
[0. 0. 0.]
...
[0. 0. 0.]
[0. 0. 0.]
[0. 1. 1.]]
IIUC, you are trying to plot your decision boundary for the logistic regression. This is not simply a y = mx + b problem, well it is but you first need to determine where your decision boundary is, typically it's at probability of 0.5. I assume the model you are going with something looks like h(x) = g(theta_0*x_0 + theta_1*x_1 + theta_2*x_2), where g(x) = 1 / (1 + e^-x) and x_1 and x_2 are your features that you are plotting, ie your y and x axis (I don't know which is y and which is x since I don't know your data). So for probability 0.5, you want to solve for h(x) = 0.5, ie theta_0*x_0 + theta_1*x_1 + theta_2*x_2 = 0
So what you want to plot is the line 0 = theta_0*x_0 + theta_1*x_1 + theta_2*x_2. Let's just say you have x_1 on your x axis and x_2 on your y axis. (x_0 is just 1, corresponding to theta_0, your intercept.)
So you'll need to pick (somewhat arbitrarily) x_1 values that will give you a good illustration of the boundary line. Min/max of your dataset works, which you've done. Then solve for x_2 given the formula above. You can define a function here: lambda x_1: (theta[0] + theta[1] * x_1) / theta[2]. I'm assuming your theta variable corresponds to [intercept, coeff for x_1, coeff for x_2]. So you'll end up with something like:
theta = [0., -0.28132729, 0.158859]
x = np.array([0, 10])
f = lambda x_1: (theta[0] + theta[1] * x_1) / theta[2]
y = f(x)
plt.plot(x, y)
I have the same problem as in this question but don't want to add only one but several constraints to the optimization problem.
So e.g. I want to maximize x1 + 5 * x2 with the constraints that the sum of x1 and x2 is smaller than 5 and x2 is smaller than 3 (needless to say that the actual problem is far more complicated and cannot just thrown into scipy.optimize.minimize as this one; it just serves to illustrate the problem...).
I can to an ugly hack like this:
from scipy.optimize import differential_evolution
import numpy as np
def simple_test(x, more_constraints):
# check wether all constraints evaluate to True
if all(map(eval, more_constraints)):
return -1 * (x[0] + 5 * x[1])
# if not all constraints evaluate to True, return a positive number
return 10
bounds = [(0., 5.), (0., 5.)]
additional_constraints = ['x[0] + x[1] <= 5.', 'x[1] <= 3']
result = differential_evolution(simple_test, bounds, args=(additional_constraints, ), tol=1e-6)
print(result.x, result.fun, sum(result.x))
This will print
[ 1.99999986 3. ] -16.9999998396 4.99999985882
as one would expect.
Is there a better/ more straightforward way to add several constraints than using the rather 'dangerous' eval?
An example is something like this::
additional_constraints = [lambda(x): x[0] + x[1] <= 5., lambda(x):x[1] <= 3]
def simple_test(x, more_constraints):
# check wether all constraints evaluate to True
if all(constraint(x) for constraint in more_constraints):
return -1 * (x[0] + 5 * x[1])
# if not all constraints evaluate to True, return a positive number
return 10
There is a proper solution to the problem described in the question, to enforce multiple nonlinear constraints with scipy.optimize.differential_evolution.
The proper way is by using the scipy.optimize.NonlinearConstraint function.
Here below I give a non-trivial example of optimizing the classic Rosenbrock function inside a region defined by the intersection of two circles.
import numpy as np
from scipy import optimize
# Rosenbrock function
def fun(x):
return 100*(x[1] - x[0]**2)**2 + (1 - x[0])**2
# Function defining the nonlinear constraints:
# 1) x^2 + (y - 3)^2 < 4
# 2) (x - 1)^2 + (y + 1)^2 < 13
def constr_fun(x):
r1 = x[0]**2 + (x[1] - 3)**2
r2 = (x[0] - 1)**2 + (x[1] + 1)**2
return r1, r2
# No lower limit on constr_fun
lb = [-np.inf, -np.inf]
# Upper limit on constr_fun
ub = [4, 13]
# Bounds are irrelevant for this problem, but are needed
# for differential_evolution to compute the starting points
bounds = [[-2.2, 1.5], [-0.5, 2.2]]
nlc = optimize.NonlinearConstraint(constr_fun, lb, ub)
sol = optimize.differential_evolution(fun, bounds, constraints=nlc)
# Accurate solution by Mathematica
true = [1.174907377273171, 1.381484428610871]
print(f"nfev = {sol.nfev}")
print(f"x = {sol.x}")
print(f"err = {sol.x - true}\n")
This prints the following with default parameters:
nfev = 636
x = [1.17490808 1.38148613]
err = [7.06260962e-07 1.70116282e-06]
Here is a visualization of the function (contours) and the feasible region defined by the nonlinear constraints (shading inside the green line). The constrained global minimum is indicated by the yellow dot, while the magenta one shows the unconstrained global minimum.
This constrained problem has an obvious local minimum at (x, y) ~ (-1.2, 1.4) on the boundary of the feasible region which will make local optimizers fail to converge to the global minimum for many starting locations. However, differential_evolution consistently finds the global minimum as expected.
I'm trying to maximize a sum of two monotonically increasing functions f(x) = f1(x) + f2(x) within the given bounds, say x = 0 to 6. The curves of the two functions are:
To solve this I'm using basinhopping function from scipy package.
I would like to specify a constraint for using the bounds. Specifically, I want the summation of bounds to be lesser than or equal to a constant value. i.e. In my implementation below, I want x[0] + x[1] <= C where C = 6.
In the above figure, for C = 6, approximately x[0] = 2 and x[1] = 4 (4 + 2 =<=6) will yield the maximum value. My question is how to specify this constraint? If it's not possible, is there another optimization function that is better suited for this problem?
from scipy.optimize import basinhopping
from math import tanh
def f(x):
return -(f1(x[0]) + f2(x[1])) # -ve sign for maximization
def f1(x):
return tanh(x)
def f2(x):
return (10 ** (0.2 * x))
# Starting point
x0 = [1., 1.]
# Bounds
xmin = [1., 1.]
xmax = [6., 6.]
# rewrite the bounds in the way required by L-BFGS-B
bounds = [(low, high) for low, high in zip(xmin, xmax)]
minimizer_kwargs = dict(method="L-BFGS-B", bounds=bounds)
result = basinhopping(f,
x0,
minimizer_kwargs=minimizer_kwargs,
niter=200)
print("global max: x = [%.4f, %.4f], f(x0) = %.4f" % (result.x[0], result.x[1], result.fun))
This is possible with COBYLA constrained optimization, though I don't know how to do it with L-BFGS-B. You can add constraints like the following:
def constraint1(x):
return 6 - x[0]
def constraint2(x):
return 6 - x[0] - x[1]
def constraint3(x):
return x[0] - 1
To add these constraints to the minimizer, use:
c1={"type":"ineq","fun":constraint1}
c2={"type":"ineq","fun":constraint2}
c3={"type":"ineq","fun":constraint3}
minimizer_kwargs = dict(method="COBYLA",constraints=(c1,c2,c3))
When I run this example, I get the result global max: x = [1.0000, 5.0000], f(x0) = -10.7616
Note, I haven't tested whether you need to add a constraint4 that x[1]>1.
For more reading, see the documentation for cobyla:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_cobyla.html
I have the following code to minimize the Cost Function with its gradient.
def trainLinearReg( X, y, lamda ):
# theta = zeros( shape(X)[1], 1 )
theta = random.rand( shape(X)[1], 1 ) # random initialization of theta
result = scipy.optimize.fmin_cg( computeCost, fprime = computeGradient, x0 = theta,
args = (X, y, lamda), maxiter = 200, disp = True, full_output = True )
return result[1], result[0]
But I am having this warning:
Warning: Desired error not necessarily achieved due to precision loss.
Current function value: 8403387632289934651424768.000000
Iterations: 0
Function evaluations: 15
Gradient evaluations: 3
My computeCost and computeGradient are defined as
def computeCost( theta, X, y, lamda ):
theta = theta.reshape( shape(X)[1], 1 )
m = shape(y)[0]
J = 0
grad = zeros( shape(theta) )
h = X.dot(theta)
squaredErrors = (h - y).T.dot(h - y)
# theta[0] = 0.0
J = (1.0 / (2 * m)) * (squaredErrors) + (lamda / (2 * m)) * (theta.T.dot(theta))
return J[0]
def computeGradient( theta, X, y, lamda ):
theta = theta.reshape( shape(X)[1], 1 )
m = shape(y)[0]
J = 0
grad = zeros( shape(theta) )
h = X.dot(theta)
squaredErrors = (h - y).T.dot(h - y)
# theta[0] = 0.0
J = (1.0 / (2 * m)) * (squaredErrors) + (lamda / (2 * m)) * (theta.T.dot(theta))
grad = (1.0 / m) * (X.T.dot(h - y)) + (lamda / m) * theta
return grad.flatten()
I have reviewed these similar questions:
scipy.optimize.fmin_bfgs: “Desired error not necessarily achieved due to precision loss”
scipy.optimize.fmin_cg: "'Desired error not necessarily achieved due to precision loss.'
scipy is not optimizing and returns "Desired error not necessarily achieved due to precision loss"
But still cannot have the solution to my problem. How to let the minimization function process converge instead of being stuck at first?
ANSWER:
I solve this problem based on #lejlot 's comments below.
He is right. The data set X is to large since I did not properly return the correct normalized value to the correct variable. Even though this is a small mistake, it indeed can give you the thought where should we look at when encountering such problems. The Cost Function value is too large leads to the possibility that there are some wrong with my data set.
The previous wrong one:
X_poly = polyFeatures(X, p)
X_norm, mu, sigma = featureNormalize(X_poly)
X_poly = c_[ones((m, 1)), X_poly]
The correct one:
X_poly = polyFeatures(X, p)
X_poly, mu, sigma = featureNormalize(X_poly)
X_poly = c_[ones((m, 1)), X_poly]
where X_poly is actually used in the following traing as
cost, theta = trainLinearReg(X_poly, y, lamda)
ANSWER:
I solve this problem based on #lejlot 's comments below.
He is right. The data set X is to large since I did not properly return the correct normalized value to the correct variable. Even though this is a small mistake, it indeed can give you the thought where should we look at when encountering such problems. The Cost Function value is too large leads to the possibility that there are some wrong with my data set.
The previous wrong one:
X_poly = polyFeatures(X, p)
X_norm, mu, sigma = featureNormalize(X_poly)
X_poly = c_[ones((m, 1)), X_poly]
The correct one:
X_poly = polyFeatures(X, p)
X_poly, mu, sigma = featureNormalize(X_poly)
X_poly = c_[ones((m, 1)), X_poly]
where X_poly is actually used in the following traing as
cost, theta = trainLinearReg(X_poly, y, lamda)
For my implementation scipy.optimize.fmin_cg also failed with the above-mentioned error in some initial guesses. Then I changed it to the BFGS method and converged.
scipy.optimize.minimize(fun, x0, args=(), method='BFGS', jac=None, tol=None, callback=None, options={'disp': False, 'gtol': 1e-05, 'eps': 1.4901161193847656e-08, 'return_all': False, 'maxiter': None, 'norm': inf})
seems that this error in cg is inevitable still as,
CG ends up with a non-descent direction
I too faced this problem and even after searching a lot for solutions nothing happened as the solutions were not clearly defined.
Then I read the documentation from scipy.optimize.fmin_cg where it is clearly mentioned that parameter x0 must be a 1-D array.
My approach was same as that of you wherein I passed 2-D matrix as x0 and I always got some precision error or divide by zero error and same warning as you got.
Then I changed my approach and passed theta as a 1-D array and converted that array into 2-D matrix inside the computeCost and computeGradient function which worked for me and I got the results as expected.
My solutiion for Logistic Regression
def sigmoid(z):
return 1 / (1 + np.exp(-z))
theta = np.zeros(features)
def computeCost(theta,X, Y):
x = np.matrix(X.values)
y = np.matrix(Y.values)
theta = np.matrix(theta)
xtheta = np.matmul(x,theta.T)
hx = sigmoid(xtheta)
cost = (np.multiply(y,np.log(hx)))+(np.multiply((1-y),np.log(1-hx)))
return -(np.sum(cost))/m
def computeGradient(theta, X, Y):
x = np.matrix(X.values)
y = np.matrix(Y.values)
theta = np.matrix(theta)
grad = np.zeros(features)
xtheta = np.matmul(x,theta.T)
hx = sigmoid(xtheta)
error = hx-Y
for i in range(0,features,1):
term = np.multiply(error,x[:,i])
grad[i] = (np.sum(term))/m
return grad
import scipy.optimize as opt
result = opt.fmin_tnc(func=computeCost, x0=theta, fprime=computeGradient, args=(X, Y))
print cost(result[0],X, Y)
Note Again that theta has to be a 1-D array
So in your code modify theta in trainLinearReg to theta = random.randn(features)
I today faced this problem.
I then noticed that my cost function was implemented wrong way and was producing high scaled errors due to which scipy was asking for more data. Hope this helps for someone like me.