I am trying to write a gradient descent function in python as part of a multivariate linear regression exercise. It runs, but does not compute the correct answer. My code is below. I've been trying for weeks to finish this problem but have made zero progress.
I believe that I understand the concept of gradient descent to optimize a multivariate linear regression function and also that the 'math' is correct. I believe that the error is in my code, but I am still learning python. Your help is very much appreciated.
def regression_gradient_descent(feature_matrix,output,initial_weights,step_size,tolerance):
from math import sqrt
converged = False
weights = np.array(initial_weights)
while not converged:
predictions = np.dot(feature_matrix,weights)
errors = predictions - output
gradient_sum_squares = 0
for i in range(len(weights)):
derivative = -2 * np.dot(errors[i],feature_matrix[i])
gradient_sum_squares = gradient_sum_squares + np.dot(derivative, derivative)
weights[i] = weights[i] - step_size * derivative[i]
gradient_magnitude = sqrt(gradient_sum_squares)
print gradient_magnitude
if gradient_magnitude < tolerance:
converged = True
return(weights)
Feature matrix is:
sales = gl.SFrame.read_csv('kc_house_data.csv',column_type_hints = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float,'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str,'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int,'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int,'view':int})
I'm calling the function as:
train_data,test_data = sales.random_split(.8,seed=0)
simple_features = ['sqft_living']
my_output= 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7
simple_weights = regression_gradient_descent(simple_feature_matrix, output,initial_weights,step_size,tolerance)
**get_numpy_data is just a function to convert everything into arrays and works as intended
Update: I fixed the formula to:
derivative = 2 * np.dot(errors,feature_matrix)
and it seems to have worked. The derivation of this formula in my online course used
-2 * np.dot(errors,feature_matrix)
and I'm not sure why this formula did not provide the correct answer.
The step size seems too small, and the tolerance unusually big. Perhaps you meant to use them the other way around?
In general, the step size is determined by a trial-and-error procedure: the "natural" step size α=1 might lead to divergence, so one could try to lower the value (e.g. taking α=1/2, α=1/4, etc until convergence is achieved. Don't start with a very small step size.
I'm trying to implement gradient descent in python and my loss/cost keeps increasing with every iteration.
I've seen a few people post about this, and saw an answer here: gradient descent using python and numpy
I believe my implementation is similar, but cant see what I'm doing wrong to get an exploding cost value:
Iteration: 1 | Cost: 697361.660000
Iteration: 2 | Cost: 42325117406694536.000000
Iteration: 3 | Cost: 2582619233752172973298548736.000000
Iteration: 4 | Cost: 157587870187822131053636619678439702528.000000
Iteration: 5 | Cost: 9615794890267613993157742129590663647488278265856.000000
I'm testing this on a dataset I found online (LA Heart Data): http://www.umass.edu/statdata/statdata/stat-corr.html
Import code:
dataset = np.genfromtxt('heart.csv', delimiter=",")
x = dataset[:]
x = np.insert(x,0,1,axis=1) # Add 1's for bias
y = dataset[:,6]
y = np.reshape(y, (y.shape[0],1))
Gradient descent:
def gradientDescent(weights, X, Y, iterations = 1000, alpha = 0.01):
theta = weights
m = Y.shape[0]
cost_history = []
for i in xrange(iterations):
residuals, cost = calculateCost(theta, X, Y)
gradient = (float(1)/m) * np.dot(residuals.T, X).T
theta = theta - (alpha * gradient)
# Store the cost for this iteration
cost_history.append(cost)
print "Iteration: %d | Cost: %f" % (i+1, cost)
Calculate cost:
def calculateCost(weights, X, Y):
m = Y.shape[0]
residuals = h(weights, X) - Y
squared_error = np.dot(residuals.T, residuals)
return residuals, float(1)/(2*m) * squared_error
Calculate hypothesis:
def h(weights, X):
return np.dot(X, weights)
To actually run it:
gradientDescent(np.ones((x.shape[1],1)), x, y, 5)
Assuming that your derivation of the gradient is correct, you are using: =- and you should be using: -=. Instead of updating theta, you are reassigning it to - (alpha * gradient)
EDIT (after the above issue was fixed in the code):
I ran what the code on what I believe is the right dataset and was able to get the cost to behave by setting alpha=1e-7. If you run it for 1e6 iterations you should see it converging. This approach on this dataset appears very sensitive to learning rate.
In general, if your cost is increasing, then the very first thing you should check is to see if your learning rate is too large. In such cases, the rate is causing the cost function to jump over the optimal value and increase upwards to infinity. Try different small values of your learning rate. When I face the problem that you describe, I usually repeatedly try 1/10 of the learning rate until I can find a rate where J(w) decreases.
Another problem might be a bug in your derivative implementation. A good way to debug is to do a gradient check to compare the analytic gradient versus the numeric gradient.
I was implementing logistic regression in python. To find theta , I was struggling to decide which is the best algorithm that always guarantees global optima without bothering about initial parameter theta.
import numpy as np
import scipy.optimize as op
def Sigmoid(z):
return 1/(1 + np.exp(-z));
def Gradient(theta,x,y):
m , n = x.shape
theta = theta.reshape((n,1));
y = y.reshape((m,1))
sigmoid_x_theta = Sigmoid(x.dot(theta));
grad = ((x.T).dot(sigmoid_x_theta-y))/m;
return grad.flatten();
def CostFunc(theta,x,y):
m,n = x.shape;
theta = theta.reshape((n,1));
y = y.reshape((m,1));
term1 = np.log(Sigmoid(x.dot(theta)));
term2 = np.log(1-Sigmoid(x.dot(theta)));
term1 = term1.reshape((m,1))
term2 = term2.reshape((m,1))
term = y * term1 + (1 - y) * term2;
J = -((np.sum(term))/m);
return J;
data = np.loadtxt('ex2data1.txt',delimiter=',');
# m training samples and n attributes
m , n = data.shape
X = data[:,0:n-1]
y = data[:,n-1:]
X = np.concatenate((np.ones((m,1)), X),axis = 1)
initial_theta = np.zeros((n,1))
m , n = X.shape;
Result = op.minimize(fun = CostFunc,
x0 = initial_theta,
args = (X,y),
method = 'TNC',
jac = Gradient);
theta = Result.x;
where content of ex2data1.txt is:
34.62365962451697,78.0246928153624,0
30.28671076822607,43.89499752400101,0
35.84740876993872,72.90219802708364,0
60.18259938620976,86.30855209546826,1
79.0327360507101,75.3443764369103,1
45.08327747668339,56.3163717815305,0
61.10666453684766,96.51142588489624,1
75.02474556738889,46.55401354116538,1
76.09878670226257,87.42056971926803,1
84.43281996120035,43.53339331072109,1
95.86155507093572,38.22527805795094,0
75.01365838958247,30.60326323428011,0
82.30705337399482,76.48196330235604,1
69.36458875970939,97.71869196188608,1
39.53833914367223,76.03681085115882,0
53.9710521485623,89.20735013750205,1
69.07014406283025,52.74046973016765,1
67.94685547711617,46.67857410673128,0
70.66150955499435,92.92713789364831,1
76.97878372747498,47.57596364975532,1
67.37202754570876,42.83843832029179,0
89.67677575072079,65.79936592745237,1
50.534788289883,48.85581152764205,0
34.21206097786789,44.20952859866288,0
77.9240914545704,68.9723599933059,1
62.27101367004632,69.95445795447587,1
80.1901807509566,44.82162893218353,1
93.114388797442,38.80067033713209,0
61.83020602312595,50.25610789244621,0
38.78580379679423,64.99568095539578,0
61.379289447425,72.80788731317097,1
85.40451939411645,57.05198397627122,1
52.10797973193984,63.12762376881715,0
52.04540476831827,69.43286012045222,1
40.23689373545111,71.16774802184875,0
54.63510555424817,52.21388588061123,0
33.91550010906887,98.86943574220611,0
64.17698887494485,80.90806058670817,1
74.78925295941542,41.57341522824434,0
34.1836400264419,75.2377203360134,0
83.90239366249155,56.30804621605327,1
51.54772026906181,46.85629026349976,0
94.44336776917852,65.56892160559052,1
82.36875375713919,40.61825515970618,0
51.04775177128865,45.82270145776001,0
62.22267576120188,52.06099194836679,0
77.19303492601364,70.45820000180959,1
97.77159928000232,86.7278223300282,1
62.07306379667647,96.76882412413983,1
91.56497449807442,88.69629254546599,1
79.94481794066932,74.16311935043758,1
99.2725269292572,60.99903099844988,1
90.54671411399852,43.39060180650027,1
34.52451385320009,60.39634245837173,0
50.2864961189907,49.80453881323059,0
49.58667721632031,59.80895099453265,0
97.64563396007767,68.86157272420604,1
32.57720016809309,95.59854761387875,0
74.24869136721598,69.82457122657193,1
71.79646205863379,78.45356224515052,1
75.3956114656803,85.75993667331619,1
35.28611281526193,47.02051394723416,0
56.25381749711624,39.26147251058019,0
30.05882244669796,49.59297386723685,0
44.66826172480893,66.45008614558913,0
66.56089447242954,41.09209807936973,0
40.45755098375164,97.53518548909936,1
49.07256321908844,51.88321182073966,0
80.27957401466998,92.11606081344084,1
66.74671856944039,60.99139402740988,1
32.72283304060323,43.30717306430063,0
64.0393204150601,78.03168802018232,1
72.34649422579923,96.22759296761404,1
60.45788573918959,73.09499809758037,1
58.84095621726802,75.85844831279042,1
99.82785779692128,72.36925193383885,1
47.26426910848174,88.47586499559782,1
50.45815980285988,75.80985952982456,1
60.45555629271532,42.50840943572217,0
82.22666157785568,42.71987853716458,0
88.9138964166533,69.80378889835472,1
94.83450672430196,45.69430680250754,1
67.31925746917527,66.58935317747915,1
57.23870631569862,59.51428198012956,1
80.36675600171273,90.96014789746954,1
68.46852178591112,85.59430710452014,1
42.0754545384731,78.84478600148043,0
75.47770200533905,90.42453899753964,1
78.63542434898018,96.64742716885644,1
52.34800398794107,60.76950525602592,0
94.09433112516793,77.15910509073893,1
90.44855097096364,87.50879176484702,1
55.48216114069585,35.57070347228866,0
74.49269241843041,84.84513684930135,1
89.84580670720979,45.35828361091658,1
83.48916274498238,48.38028579728175,1
42.2617008099817,87.10385094025457,1
99.31500880510394,68.77540947206617,1
55.34001756003703,64.9319380069486,1
74.77589300092767,89.52981289513276,1
Above code gives theta = Result.x value as [-25.87282405 0.21193078 0.20722013]. This is global minimum if initial_theta = np.zeros((n,1)). But if initial_theta = np.ones((n,1)), it gives error. So in this case our result depends on initial values of parameter theta. So can this be automated in any way to avoid this issue.
Also I tried using 'BFGS' method instead of 'TNC' method in minimize function call as shown below, then I get RuntimeWarning.
initial_theta = np.zeros((n,1))
result = op.minimize(fun = CostFunc,
x0 = intial_theta,
args = (X,y),
method = 'BFGS',
jac = Gradient);
optimal_theta = result.x
I have called above function several times with different initial values of initial_theta and I found that BFGS maximum time converges to local minima. When I called BFGS with
initial_theta = np.array([-25,0.2,0.2])
which is nearer to global minima, it converged. So it seems that TNC is better than BFGS because with intial_theta being same in both cases, TNC converges to global minima while BFGS converges to local minima. So
Is this true in all cases or it depends on particular problem?
Which is better BFGS or TNC?
Is there any difference between scipy.optimize.fmin_bfgs and scipy.optimize.minimize with method parameter = 'BFGS' or both are same?
Any help or insight will be helpful. Thank you.
There is no practical algorithm that is guaranteed to find a global optimum. However, there are some heuristics like DIRECT (see e.g. here) that work very well in practice for given bounds. These can be used to find a good initialization for an algorithm that finds the local optimum in the vicinity of the initialization and works more efficiently.
However, logistic regression is a convex optimization problem. That means there is only one minimum of the objective function (error function), i.e. the local minimum is always the global minimum. Hence, you can use any local optimizer (Gradient Descent, L-BFGS, Conjugate Gradient, ...). The only problem is that you cannot compute the minimum directly because of the nonlinear logistic function. There is a similar problem called linear regression without that logistic function. In this case the global minimum of the error function can be computed directly without any complex optimization algorithm.
A comparison of optimizers for logistic regression can be found in Fabian Pedregosa's blog. My first guess would be that you have an error in your gradient computation. Maybe you should compare it to the numerical approximation of the gradient with scipy.optimize.check_grad.
scipy.optimize.minimize calls scipy.optimize.fmin_bfgs
This isn't possible with an efficient, general algorithm. You'll never really know what the cost function looked like on the inputs you didn't try. Perhaps there was some miracle trench running through a high plateau you ignored. Perhaps the cost function starts with if arg1 == secret: return -1e100. Who can say? If you really, absolutely need a global minimum, you either need to take advantage of extra knowledge about the cost function, or you need to try each and every single possible input.
1:23 PM (20 minutes ago)
Hi,
Trying to learn pymc3 (never learned pymc2, so jumping into the new stuff), and I suspect there is a very simple example/pseudocode for what I'm trying to do. Wondering if someone can help me out, as the past few hours I've not made much progress...
My problem is to sample from a posterior in a rather straightforward manner. Let "x" be a vector, "t(x)" be a function (R^n --> R^n map) of that vector, and "D" be some observed data. I want to sample vectors x from
P( x | D ) \propto P( D | x ) P(x)
Usual Bayesian stuff. An example of how to do this using NUTS would be spectacular! My main problem seems to be getting the function t(x) to work appropriately, and have the model return samples from the posterior (rather than the prior).
Any and all help/hints appreciated. In the mean time I'll continue to try stuff out.
Best,
TJ
Your notation is a little confusing to me, but if I understand correctly, you want to sample from the likelihood (some function of the parameters and data) times the prior. And I agree - that's typical Bayesian stuff.
I think Bayesian logistic regression is a good example since we can't solve it analytically. Let's say our model is the following:
B ~ Normal(0, sigma2 * I)
p(y_i | B) = p_i ^ {y_i} (1 - p_i) ^{1 - y_i}
Where y_i is observed and p_i = 1 / (1 + exp(-z_i)) and
z_i = B_0 + B_1 * x_i
We'll assume sigma2 is known. After we load data into numpy arrays x and y, we can sample from the posterior with the following:
with pm.Model() as model:
#Priors
b0 = pm.Normal("b0", mu=0, tau=1e-6)
b1 = pm.Normal("b1", mu=0, tau=1e-6)
#Likelihood
yhat = pm.Bernoulli("yhat", 1 / (1 + t.exp(-(b0 + b1*x))), observed=y)
# Sample from the posterior
trace = pm.sample(10000, pm.NUTS(), progressbar=False)
To see a full example, check out this iPython notebook:
http://nbviewer.ipython.org/gist/jbencook/9295751c917941208349
pymc3 also has a nice glm syntax. You can see how that works here:
http://jbencook.github.io/portfolio/bayesian_logistic_regression.html