Estimating logistic regression using BFGS optimization algorithm - python

I tried to use scipy.optimize.minimum to estimate parameters in logistic regression. Before this, I wrote log likelihood function and gradient of log likelihood function. I then used Nelder-Mead and BFGS algorithm, respectively. Turned out the latter one failed but the former one succeeded. Because BFGS algorithm also uses the gradient, I doubt the gradient part (function log_likelihood_gradient(x, y)) suffers from bugs. But it's also likely due to any numerical issue associated with Nelder-Mead since I got the warning Desired error not necessarily achieved due to precision loss.
I'm using UCLA's tutorial and dataset (see link) and the correct estimates are,
FYI, I also highly suggest you reading this post to understand the problem setup and preprocess on the dataset.
Now is my code, the first part is data preparation. You can simply run it to get yourself a copy,
import numpy as np
from scipy.optimize import minimize
from patsy import dmatrices
from urllib.request import urlretrieve
url = 'http://www.ats.ucla.edu/stat/data/binary.csv'
urlretrieve(url, './admit_rate.csv')
data = pd.read_csv('./admit_rate.csv')
y, x = dmatrices('admit ~ gre + gpa + C(rank)', data, return_type = 'dataframe')
y = np.array(y, dtype=np.float64)
x = np.array(x, dtype=np.float64)
The second part is the MLE part, log_likelihood defines the sum of log-likelihood function, and log_likelihood_gradient defines the gradient of the sum of log-likelihood function. Both come from the formula as below,
def sigmoid(x):
"""
Logistic function
"""
return 1.0 / (1 + np.exp(-x))
def sigmoid_derivative(x):
"""
Logistic function's first derivative
"""
return sigmoid(x) * (1 - sigmoid(x))
def log_likelihood(x, y):
"""
The sum of log likelihood functions
"""
n1, n2 = x.shape
def log_likelihood_p(p):
"""
p: betas/parameters in logistic regression, to be estimated
"""
p = p.reshape(n2, 1)
sig = sigmoid(np.dot(x, p)).reshape(n1, 1)
return np.sum(y * np.log(sig) + (1 - y) * np.log(1 - sig))
return log_likelihood_p
def log_likelihood_gradient(x, y):
"""
The gradient of the sum of log likelihood functions used in optimization
"""
n1, n2 = x.shape
def log_likelihood_gradient_p(p):
"""
p: betas/parameters in logistic regression, to be estimated
"""
p = p.reshape(n2, 1)
xp = np.dot(x, p)
sig = sigmoid(xp)
sig_der = sigmoid_derivative(xp)
return np.sum(y * sig_der / sig * x - (1 - y) * sig_der / (1 - sig) * x, axis=0)
return log_likelihood_gradient_p
def negate(f):
return lambda *args, **kwargs: -f(*args, **kwargs)
The third part is the optimization part, I started with inital value
x0 = np.array([-1, -0.5, -1, -1.5, 0, 1])
for Nelder-Mead, it works and outputs
array([ -3.99005691e+00, -6.75402757e-01, -1.34021420e+00, -1.55145450e+00, 2.26447746e-03, 8.04046227e-01].
But then when I tried Nelder-Mead algorithm, it failed at
x0 = np.array([-1, -0.5, -1, -1.5, 0, 1]).
Even when I gave even closer initial value like the output from Nelder-Mead algorithm, it failed with warning Desired error not necessarily achieved due to precision loss.
Nelder-Mead
x0 = np.array([-1, -0.5, -1, -1.5, 0, 1])
estimator1 = minimize(negate(log_likelihood(x, y)), x0,
method='nelder-mead', options={'disp': True})
print(estimator1.success)
estimator1.x
BFGS
x0 = np.array([ -3.99005691e+00, -6.75402757e-01, -1.34021420e+00,
-1.55145450e+00, 2.26447746e-03, 8.04046227e-01])
estimator2 = minimize(negate(log_likelihood(x, y)), x0, method='BFGS',
jac=log_likelihood_gradient(x, y), options={'disp': True})
print(estimator2.success)
I know my post is a bit long, but I'd glad to answer for any questions you throw to me and can anyone help?

Related

How to fit this data in python and scipy?

I have some function which behaves as shown below i.e. some tapered/decaying oscillations
I want to fit the data using scipy's curve_fit. I have previously asked a question related to fitting functions with scipy, which was well answered here, and highlighted the importance of the initial guess for the values of the fitting parameters.
However, I am struggling to fit this data in a way which captures both the oscillations and the decay. My approach is as follows:
from scipy.optimize import curve_fit
import numpy as np
def Fit(x,y):
#Define the function fit
func = ansatz
#Define the initial guess of parameters
mag = (y.max() + y.min()) / 2
y_shifted = y - mag
omega_guess = np.pi * np.sum(y_shifted[:-1] * y_shifted[1:] < 0) / (x.max() - x.min())
lam = np.log(2) / 1e7 #Rough guess based on approximate half life
p0 = (mag,mag, omega_guess,mag,lam)
#Do the fit
popt, pcov = curve_fit(func, x,y,p0=p0)
# return
return func(x, *popt)
def ansatz(x,A,B,omega,offset,lam):
osc = A*np.sin(omega*x) + B*np.cos(omega*x)
linear = offset
decay = np.exp(-x*lam)
return decay*osc + linear
data = np.load('example.npy')
x = data[:,0]
y = data[:,1]
yFit = Fit(x,y)
This approach captures the decay, but not the oscillations. What is erroneous with my approach? Guesses for fit parameters? Function ansatz? Code implementation?

Finding alpha and beta of beta-binomial distribution with scipy.optimize and loglikelihood

A distribution is beta-binomial if p, the probability of success, in a binomial distribution has a beta distribution with shape parameters α > 0 and β > 0. The shape parameters define the probability of success.
I want to find the values for α and β that best describe my data from the perspective of a beta-binomial distribution. My dataset players consist of data about the number of hits (H), the number of at-bats (AB) and the conversion (H / AB) of a lot of baseball players. I estimate the PDF with the help of the answer of JulienD in Beta Binomial Function in Python
from scipy.special import beta
from scipy.misc import comb
pdf = comb(n, k) * beta(k + a, n - k + b) / beta(a, b)
Next, I write a loglikelihood function that we will minimize.
def loglike_betabinom(params, *args):
"""
Negative log likelihood function for betabinomial distribution
:param params: list for parameters to be fitted.
:param args: 2-element array containing the sample data.
:return: negative log-likelihood to be minimized.
"""
a, b = params[0], params[1]
k = args[0] # the conversion rate
n = args[1] # the number of at-bats (AE)
pdf = comb(n, k) * beta(k + a, n - k + b) / beta(a, b)
return -1 * np.log(pdf).sum()
Now, I want to write a function that minimizes loglike_betabinom
from scipy.optimize import minimize
init_params = [1, 10]
res = minimize(loglike_betabinom, x0=init_params,
args=(players['H'] / players['AB'], players['AB']),
bounds=bounds,
method='L-BFGS-B',
options={'disp': True, 'maxiter': 250})
print(res.x)
The result is [-6.04544138 2.03984464], which implies that α is negative which is not possible. I based my script on the following R-snippet. They get [101.359, 287.318]..
ll <- function(alpha, beta) {
x <- career_filtered$H
total <- career_filtered$AB
-sum(VGAM::dbetabinom.ab(x, total, alpha, beta, log=True))
}
m <- mle(ll, start = list(alpha = 1, beta = 10),
method = "L-BFGS-B", lower = c(0.0001, 0.1))
ab <- coef(m)
Can someone tell me what I am doing wrong? Help is much appreciated!!
One thing to pay attention to is that comb(n, k) in your log-likelihood might not be well-behaved numerically for the values of n and k in your dataset. You can verify this by applying comb to your data and see if infs appear.
One way to amend things could be to rewrite the negative log-likelihood as suggested in https://stackoverflow.com/a/32355701/4240413, i.e. as a function of logarithms of Gamma functions as in
from scipy.special import gammaln
import numpy as np
def loglike_betabinom(params, *args):
a, b = params[0], params[1]
k = args[0] # the OVERALL conversions
n = args[1] # the number of at-bats (AE)
logpdf = gammaln(n+1) + gammaln(k+a) + gammaln(n-k+b) + gammaln(a+b) - \
(gammaln(k+1) + gammaln(n-k+1) + gammaln(a) + gammaln(b) + gammaln(n+a+b))
return -np.sum(logpdf)
You can then minimize the log-likelihood with
from scipy.optimize import minimize
init_params = [1, 10]
# note that I am putting 'H' in the args
res = minimize(loglike_betabinom, x0=init_params,
args=(players['H'], players['AB']),
method='L-BFGS-B', options={'disp': True, 'maxiter': 250})
print(res)
and that should give reasonable results.
You could check How to properly fit a beta distribution in python? for inspiration if you want to rework further your code.

Cost Function and Gradient Seem to be Working, but scipy.optimize functions are not

I'm working through my Matlab code for the Andrew NG Coursera course and turning it into python. I am working on non-regularized logistic regression and after writing my gradient and cost functions I needed something similar to fminunc and after some googling, I found a couple options. They are both returning the same results, but they do not match what is in Andrew NG's expected results code. Others seem to be getting this to work correctly, but I'm wondering why my specific code does not seem to return the desired result when using scipy.optimize functions, but does for the cost and gradient pieces earlier in the code.
The data I'm using can be found at the link below;
ex2data1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as op
#Machine Learning Online Class - Exercise 2: Logistic Regression
#Load Data
#The first two columns contains the exam scores and the third column contains the label.
data = pd.read_csv('ex2data1.txt', header = None)
X = np.array(data.iloc[:, 0:2]) #100 x 3
y = np.array(data.iloc[:,2]) #100 x 1
y.shape = (len(y), 1)
#Creating sub-dataframes for plotting
pos_plot = data[data[2] == 1]
neg_plot = data[data[2] == 0]
#==================== Part 1: Plotting ====================
#We start the exercise by first plotting the data to understand the
#the problem we are working with.
print('Plotting data with + indicating (y = 1) examples and o indicating (y = 0) examples.')
plt.plot(pos_plot[0], pos_plot[1], "+", label = "Admitted")
plt.plot(neg_plot[0], neg_plot[1], "o", label = "Not Admitted")
plt.xlabel('Exam 1 score')
plt.ylabel('Exam 2 score')
plt.legend()
plt.show()
def sigmoid(z):
'''
SIGMOID Compute sigmoid function
g = SIGMOID(z) computes the sigmoid of z.
Instructions: Compute the sigmoid of each value of z (z can be a matrix,
vector or scalar).
'''
g = 1 / (1 + np.exp(-z))
return g
def costFunction(theta, X, y):
'''
COSTFUNCTION Compute cost and gradient for logistic regression
J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the
parameter for logistic regression and the gradient of the cost
w.r.t. to the parameters.
'''
m = len(y) #number of training examples
h = sigmoid(X.dot(theta)) #logisitic regression hypothesis
J = (1/m) * np.sum((-y*np.log(h)) - ((1-y)*np.log(1-h)))
#h is 100x1, y is %100x1, these end up as 2 vector we subtract from each other
#then we sum the values by rows
#cost function for logisitic regression
return J
def gradient(theta, X, y):
m = len(y)
grad = np.zeros((theta.shape))
h = sigmoid(X.dot(theta))
for i in range(len(theta)): #number of rows in theta
XT = X[:,i]
XT.shape = (len(X),1)
grad[i] = (1/m) * np.sum((h-y)*XT) #updating each row of the gradient
return grad
#============ Part 2: Compute Cost and Gradient ============
#In this part of the exercise, you will implement the cost and gradient
#for logistic regression. You neeed to complete the code in costFunction.m
#Add intercept term to x and X_test
Bias = np.ones((len(X), 1))
X = np.column_stack((Bias, X))
#Initialize fitting parameters
initial_theta = np.zeros((len(X[0]), 1))
#Compute and display initial cost and gradient
(cost, grad) = costFunction(initial_theta, X, y), gradient(initial_theta, X, y)
print('Cost at initial theta (zeros): %f' % cost)
print('Expected cost (approx): 0.693\n')
print('Gradient at initial theta (zeros):')
print(grad)
print('Expected gradients (approx):\n -0.1000\n -12.0092\n -11.2628')
#Compute and display cost and gradient with non-zero theta
test_theta = np.array([[-24], [0.2], [0.2]]);
(cost, grad) = costFunction(test_theta, X, y), gradient(test_theta, X, y)
print('\nCost at test theta: %f' % cost)
print('Expected cost (approx): 0.218\n')
print('Gradient at test theta:')
print(grad)
print('Expected gradients (approx):\n 0.043\n 2.566\n 2.647\n')
result = op.fmin_tnc(func = costFunction, x0 = initial_theta, fprime = gradient, args = (X,y))
result[1]
Result = op.minimize(fun = costFunction,
x0 = initial_theta,
args = (X, y),
method = 'TNC',
jac = gradient, options={'gtol': 1e-3, 'disp': True, 'maxiter': 1000})
theta = Result.x
theta
test = np.array([[1, 45, 85]])
prob = sigmoid(test.dot(theta))
print('For a student with scores 45 and 85, we predict an admission probability of %f,' % prob)
print('Expected value: 0.775 +/- 0.002\n')
This was a very difficult problem to debug, and illustrates a poorly documented aspect of the scipy.optimize interface. The documentation vaguely indicates that theta will be passed around as a vector:
Minimization of scalar function of one or more variables.
In general, the optimization problems are of the form:
minimize f(x) subject to
g_i(x) >= 0, i = 1,...,m
h_j(x) = 0, j = 1,...,p
where x is a vector of one or more variables.
What's important is that they really mean vector in the most primitive sense, a 1-dimensional array. So you have to expect that whenever theta is passed into one of your callbacks, it will be passed in as a 1-d array. But in numpy, 1-d arrays sometimes behave differently from 2-d row arrays (and, obviously, from 2-d column arrays).
I don't know exactly why it's causing a problem in your case, but it's easily fixed regardless. You just have to add the following at the top of both your cost function and your gradient function:
theta = theta.reshape(-1, 1)
This guarantees that theta will be a 2-d column array, as expected. Once you've done this, the results are correct.
I have had similar issues with Scipy dealing with the same problem as you. As senderle points out the interface is not the easiest to deal with, especially combined with the numpy array interface... Here is my implementation which works as expected.
Defining the cost and gradient functions
Note that initial_theta is passed as a simple array of shape (3,) and converted to a column vector of shape (3,1) within the function. The gradient function then returns the grad.ravel() which has shape (3,) again. This is important as doing otherwise caused an error message with various optimization methods in Scipy.optimize.
Note that different methods have different behaviours but returning .ravel() seems to fix most issues...
import pandas as pd
import numpy as np
import scipy.optimize as opt
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def CostFunc(theta,X,y):
#Initializing variables
m = len(y)
J = 0
grad = np.zeros(theta.shape)
#Vectorized computations
z = X # theta
h = sigmoid(z)
J = (1/m) * ( (-y.T # np.log(h)) - (1 - y).T # np.log(1-h));
return J
def Gradient(theta,X,y):
#Initializing variables
m = len(y)
theta = theta[:,np.newaxis]
grad = np.zeros(theta.shape)
#Vectorized computations
z = X # theta
h = sigmoid(z)
grad = (1/m)*(X.T # ( h - y));
return grad.ravel() #<-- This is the trick
Initializing variables and parameters
Note that initial_theta.shape returns (3,)
X = data1.iloc[:,0:2].values
m,n = X.shape
X = np.concatenate((np.ones(m)[:,np.newaxis],X),1)
y = data1.iloc[:,-1].values[:,np.newaxis]
initial_theta = np.zeros((n+1))
Calling Scipy.optimize
model = opt.minimize(fun = CostFunc, x0 = initial_theta, args = (X, y), method = 'TNC', jac = Gradient)
Any comments from more knowledgeable people are welcome, this Scipy interface is a mystery to me, thanks

fmin_cg: Desired error not necessarily achieved due to precision loss

I have the following code to minimize the Cost Function with its gradient.
def trainLinearReg( X, y, lamda ):
# theta = zeros( shape(X)[1], 1 )
theta = random.rand( shape(X)[1], 1 ) # random initialization of theta
result = scipy.optimize.fmin_cg( computeCost, fprime = computeGradient, x0 = theta,
args = (X, y, lamda), maxiter = 200, disp = True, full_output = True )
return result[1], result[0]
But I am having this warning:
Warning: Desired error not necessarily achieved due to precision loss.
Current function value: 8403387632289934651424768.000000
Iterations: 0
Function evaluations: 15
Gradient evaluations: 3
My computeCost and computeGradient are defined as
def computeCost( theta, X, y, lamda ):
theta = theta.reshape( shape(X)[1], 1 )
m = shape(y)[0]
J = 0
grad = zeros( shape(theta) )
h = X.dot(theta)
squaredErrors = (h - y).T.dot(h - y)
# theta[0] = 0.0
J = (1.0 / (2 * m)) * (squaredErrors) + (lamda / (2 * m)) * (theta.T.dot(theta))
return J[0]
def computeGradient( theta, X, y, lamda ):
theta = theta.reshape( shape(X)[1], 1 )
m = shape(y)[0]
J = 0
grad = zeros( shape(theta) )
h = X.dot(theta)
squaredErrors = (h - y).T.dot(h - y)
# theta[0] = 0.0
J = (1.0 / (2 * m)) * (squaredErrors) + (lamda / (2 * m)) * (theta.T.dot(theta))
grad = (1.0 / m) * (X.T.dot(h - y)) + (lamda / m) * theta
return grad.flatten()
I have reviewed these similar questions:
scipy.optimize.fmin_bfgs: “Desired error not necessarily achieved due to precision loss”
scipy.optimize.fmin_cg: "'Desired error not necessarily achieved due to precision loss.'
scipy is not optimizing and returns "Desired error not necessarily achieved due to precision loss"
But still cannot have the solution to my problem. How to let the minimization function process converge instead of being stuck at first?
ANSWER:
I solve this problem based on #lejlot 's comments below.
He is right. The data set X is to large since I did not properly return the correct normalized value to the correct variable. Even though this is a small mistake, it indeed can give you the thought where should we look at when encountering such problems. The Cost Function value is too large leads to the possibility that there are some wrong with my data set.
The previous wrong one:
X_poly = polyFeatures(X, p)
X_norm, mu, sigma = featureNormalize(X_poly)
X_poly = c_[ones((m, 1)), X_poly]
The correct one:
X_poly = polyFeatures(X, p)
X_poly, mu, sigma = featureNormalize(X_poly)
X_poly = c_[ones((m, 1)), X_poly]
where X_poly is actually used in the following traing as
cost, theta = trainLinearReg(X_poly, y, lamda)
ANSWER:
I solve this problem based on #lejlot 's comments below.
He is right. The data set X is to large since I did not properly return the correct normalized value to the correct variable. Even though this is a small mistake, it indeed can give you the thought where should we look at when encountering such problems. The Cost Function value is too large leads to the possibility that there are some wrong with my data set.
The previous wrong one:
X_poly = polyFeatures(X, p)
X_norm, mu, sigma = featureNormalize(X_poly)
X_poly = c_[ones((m, 1)), X_poly]
The correct one:
X_poly = polyFeatures(X, p)
X_poly, mu, sigma = featureNormalize(X_poly)
X_poly = c_[ones((m, 1)), X_poly]
where X_poly is actually used in the following traing as
cost, theta = trainLinearReg(X_poly, y, lamda)
For my implementation scipy.optimize.fmin_cg also failed with the above-mentioned error in some initial guesses. Then I changed it to the BFGS method and converged.
scipy.optimize.minimize(fun, x0, args=(), method='BFGS', jac=None, tol=None, callback=None, options={'disp': False, 'gtol': 1e-05, 'eps': 1.4901161193847656e-08, 'return_all': False, 'maxiter': None, 'norm': inf})
seems that this error in cg is inevitable still as,
CG ends up with a non-descent direction
I too faced this problem and even after searching a lot for solutions nothing happened as the solutions were not clearly defined.
Then I read the documentation from scipy.optimize.fmin_cg where it is clearly mentioned that parameter x0 must be a 1-D array.
My approach was same as that of you wherein I passed 2-D matrix as x0 and I always got some precision error or divide by zero error and same warning as you got.
Then I changed my approach and passed theta as a 1-D array and converted that array into 2-D matrix inside the computeCost and computeGradient function which worked for me and I got the results as expected.
My solutiion for Logistic Regression
def sigmoid(z):
return 1 / (1 + np.exp(-z))
theta = np.zeros(features)
def computeCost(theta,X, Y):
x = np.matrix(X.values)
y = np.matrix(Y.values)
theta = np.matrix(theta)
xtheta = np.matmul(x,theta.T)
hx = sigmoid(xtheta)
cost = (np.multiply(y,np.log(hx)))+(np.multiply((1-y),np.log(1-hx)))
return -(np.sum(cost))/m
def computeGradient(theta, X, Y):
x = np.matrix(X.values)
y = np.matrix(Y.values)
theta = np.matrix(theta)
grad = np.zeros(features)
xtheta = np.matmul(x,theta.T)
hx = sigmoid(xtheta)
error = hx-Y
for i in range(0,features,1):
term = np.multiply(error,x[:,i])
grad[i] = (np.sum(term))/m
return grad
import scipy.optimize as opt
result = opt.fmin_tnc(func=computeCost, x0=theta, fprime=computeGradient, args=(X, Y))
print cost(result[0],X, Y)
Note Again that theta has to be a 1-D array
So in your code modify theta in trainLinearReg to theta = random.randn(features)
I today faced this problem.
I then noticed that my cost function was implemented wrong way and was producing high scaled errors due to which scipy was asking for more data. Hope this helps for someone like me.

scipy.optimize.minimize() in place of gradient desc

Given a set of points we'd like to find a straight line fitting the data best possible. I have implemented a version where the minimization of the cost function is done via gradient descent, and now I'd like to use the algorithm from scipy (scipy.optimize.minimize).
I tried this:
def gradDescVect(X, y, theta, alpha, iters):
m = shape(X)[0]
grad = copy(theta)
for c in range(0, iters):
error_sum = hypo(X, grad) - y
error_sum = X.T.dot(error_sum)
grad -= (alpha/m)*error_sum
return grad
def computeCostScipy(theta, X, y):
"""Compute cost, vectorized version"""
m = len(y)
term = hypo(X, theta) - y
print((term.T.dot(term) / (2 * m))[0, 0])
return (term.T.dot(term) / (2 * m))[0, 0]
def findMinTheta( theta, X, y):
result = scipy.optimize.minimize( computeCostScipy, theta, args=(X, y), method='BFGS', options={"maxiter":5000, "disp":True} )
return result.x, result.fun
This actually works pretty fine and gives results very close results to the original grad desc version.
The only problem seems that fminunc() stops execution before reaching minimum value of costFunction().
its sure that costFunction() works fine, and no. of iteration of minimize() is set greater than max iters it takes.
Output:
Optimization terminated successfully.
Current function value: 15.024985
Iterations: 2
Function evaluations: 16
Gradient evaluations: 4
Cost : 15.0249848024
Thteta : [ 0.15232531 0.93072285]
while the correct results are :
Cost : 4.48338825659
Theta : [[-3.63029144]
[ 1.16636235]]
See how close the results are:

Categories