I am trying to build a Linear Regression model without using SK learn package. For a single independent variable I have the code (given by my professor) which is below:
def get_gradient(w, x, y):
y_estimate = (np.power(x,1).dot(w)).flatten() #hypothesis
error = (y.flatten() - y_estimate)
mse = (1.0/len(x))*np.sum(np.power(error,2)) # mse
gradient = -(1.0/len(x)) * error.dot(np.power(x,1)) # gradient
return gradient, mse
w = np.random.randn(2) # Random Intialization
alpha = 0.5 # learning rate
tolerance = 1e-3 # param for stopping the loop
print("Intial values of Weights:")
print(w[1], w[0])
# Perform Gradient Descent
iterations = 1
while True:
gradient, error = get_gradient(w, train_x, train_y)
new_w = w - alpha * gradient
# Stopping Condition
if np.sum(abs(new_w - w)) < tolerance:
print ("Converged")
break
# Print error every 10 iterations
if iterations % 10 == 0:
print ("Iteration: %d - Error: %.4f" %(iterations, error))
print ("Updated Weights : {:f} , {:f}".format(w[1], w[0]))
iterations += 1
w = new_w
print ("Final Weights : {:f} , {:f}".format(w[1], w[0]))
print ("Test Cost =", get_gradient(w, test_x, test_y)[1])
This works fine with one variable (+ the intercept). Note - created a numpy array for x and y each for purpose of this code.
But now I wish to use this on the Boston Housing Dataset which has 13 Independent variables/features.
I cannot figure out how to use this for multiple variables. I cannot figure out how to:
Have each weight work with only the values in the column of one variable - apply gradient descent and recalculate the weight. If I try to run the gradient descent function with 13 weights inside the weights array - it ends up (I think) multiplying all 13 weights with ALL rows in EACH variable. I think this is what is happening because initially, I got 'error' as infinity after only a few iterations. Also, it got thrown into an infinite loop.
I tried running a for loop for each weight in the weights variable (13 variables) and THEN calculating gradient for each of the 13 variables. I then put each weight in a list (later converted to numpy array) before calling the 'get gradient' function in the while loop. This worked, at the time but error reached infinity and the program went into an infinite loop just like before. Anyway, this should not even be a solution since by doing a for loop I am making the first variable get all the relevant weight first.
I also tried to divide all the weights and errors by 13 - to at least get the program to run and then I would tweak it. But no success there either.
Strangely I cannot replicate the for loop anymore. I keep getting some or the error - shape errors, value errors etc.
Please help.
If any further information is required - please do let me know.
Thank you so much!
Related
I am trying to write a gradient descent function in python as part of a multivariate linear regression exercise. It runs, but does not compute the correct answer. My code is below. I've been trying for weeks to finish this problem but have made zero progress.
I believe that I understand the concept of gradient descent to optimize a multivariate linear regression function and also that the 'math' is correct. I believe that the error is in my code, but I am still learning python. Your help is very much appreciated.
def regression_gradient_descent(feature_matrix,output,initial_weights,step_size,tolerance):
from math import sqrt
converged = False
weights = np.array(initial_weights)
while not converged:
predictions = np.dot(feature_matrix,weights)
errors = predictions - output
gradient_sum_squares = 0
for i in range(len(weights)):
derivative = -2 * np.dot(errors[i],feature_matrix[i])
gradient_sum_squares = gradient_sum_squares + np.dot(derivative, derivative)
weights[i] = weights[i] - step_size * derivative[i]
gradient_magnitude = sqrt(gradient_sum_squares)
print gradient_magnitude
if gradient_magnitude < tolerance:
converged = True
return(weights)
Feature matrix is:
sales = gl.SFrame.read_csv('kc_house_data.csv',column_type_hints = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float,'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str,'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int,'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int,'view':int})
I'm calling the function as:
train_data,test_data = sales.random_split(.8,seed=0)
simple_features = ['sqft_living']
my_output= 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7
simple_weights = regression_gradient_descent(simple_feature_matrix, output,initial_weights,step_size,tolerance)
**get_numpy_data is just a function to convert everything into arrays and works as intended
Update: I fixed the formula to:
derivative = 2 * np.dot(errors,feature_matrix)
and it seems to have worked. The derivation of this formula in my online course used
-2 * np.dot(errors,feature_matrix)
and I'm not sure why this formula did not provide the correct answer.
The step size seems too small, and the tolerance unusually big. Perhaps you meant to use them the other way around?
In general, the step size is determined by a trial-and-error procedure: the "natural" step size α=1 might lead to divergence, so one could try to lower the value (e.g. taking α=1/2, α=1/4, etc until convergence is achieved. Don't start with a very small step size.
After I implemented a LS estimation with gradient descent for a simple linear regression problem, I'm now trying to do the same with Maximum Likelihood.
I used this equation from wikipedia. The maximum has to be found.
train_X = np.random.rand(100, 1) # all values [0-1)
train_Y = train_X
X = tf.placeholder("float", None)
Y = tf.placeholder("float", None)
theta_0 = tf.Variable(np.random.randn())
theta_1 = tf.Variable(np.random.randn())
var = tf.Variable(0.5)
hypothesis = tf.add(theta_0, tf.mul(X, theta_1))
lhf = 1 * (50 * np.log(2*np.pi) + 50 * tf.log(var) + (1/(2*var)) * tf.reduce_sum(tf.pow(hypothesis - Y, 2)))
op = tf.train.GradientDescentOptimizer(0.01).minimize(lhf)
This code works, but I still have some questions about it:
If I change the lhf function from 1 * to -1 * and minimize -lhf (according to the equation), it does not work. But why?
The value for lhf goes up and down during optimization. Shouldn't it only change in one direction?
The value for lhf sometimes is a NaN during optimization. How can I avoid that?
In the equation, σ² is the variance of the error (right?). My values are perfectly on a line. Why do I get a value of var above 100?
The symptoms in your question indicate a common problem: the learning rate or step size might be too high for the problem.
The zig-zag behaviour, where the function to be maximized goes up and down, is usual when the learning rate is too high. Specially when you get NaNs.
The simplest solution is to lower the learning rate, by dividing your current learning rate by 10 until the learning curve is smooth and there are no NaNs or up-down behavior.
As you are using TensorFlow you can also try AdamOptimizer as this adjust the learning rate dynamically as you train.
I'm trying to implement gradient descent in python and my loss/cost keeps increasing with every iteration.
I've seen a few people post about this, and saw an answer here: gradient descent using python and numpy
I believe my implementation is similar, but cant see what I'm doing wrong to get an exploding cost value:
Iteration: 1 | Cost: 697361.660000
Iteration: 2 | Cost: 42325117406694536.000000
Iteration: 3 | Cost: 2582619233752172973298548736.000000
Iteration: 4 | Cost: 157587870187822131053636619678439702528.000000
Iteration: 5 | Cost: 9615794890267613993157742129590663647488278265856.000000
I'm testing this on a dataset I found online (LA Heart Data): http://www.umass.edu/statdata/statdata/stat-corr.html
Import code:
dataset = np.genfromtxt('heart.csv', delimiter=",")
x = dataset[:]
x = np.insert(x,0,1,axis=1) # Add 1's for bias
y = dataset[:,6]
y = np.reshape(y, (y.shape[0],1))
Gradient descent:
def gradientDescent(weights, X, Y, iterations = 1000, alpha = 0.01):
theta = weights
m = Y.shape[0]
cost_history = []
for i in xrange(iterations):
residuals, cost = calculateCost(theta, X, Y)
gradient = (float(1)/m) * np.dot(residuals.T, X).T
theta = theta - (alpha * gradient)
# Store the cost for this iteration
cost_history.append(cost)
print "Iteration: %d | Cost: %f" % (i+1, cost)
Calculate cost:
def calculateCost(weights, X, Y):
m = Y.shape[0]
residuals = h(weights, X) - Y
squared_error = np.dot(residuals.T, residuals)
return residuals, float(1)/(2*m) * squared_error
Calculate hypothesis:
def h(weights, X):
return np.dot(X, weights)
To actually run it:
gradientDescent(np.ones((x.shape[1],1)), x, y, 5)
Assuming that your derivation of the gradient is correct, you are using: =- and you should be using: -=. Instead of updating theta, you are reassigning it to - (alpha * gradient)
EDIT (after the above issue was fixed in the code):
I ran what the code on what I believe is the right dataset and was able to get the cost to behave by setting alpha=1e-7. If you run it for 1e6 iterations you should see it converging. This approach on this dataset appears very sensitive to learning rate.
In general, if your cost is increasing, then the very first thing you should check is to see if your learning rate is too large. In such cases, the rate is causing the cost function to jump over the optimal value and increase upwards to infinity. Try different small values of your learning rate. When I face the problem that you describe, I usually repeatedly try 1/10 of the learning rate until I can find a rate where J(w) decreases.
Another problem might be a bug in your derivative implementation. A good way to debug is to do a gradient check to compare the analytic gradient versus the numeric gradient.
I've implemented a single-variable linear regression model in Python that uses gradient descent to find the intercept and slope of the best-fit line (I'm using gradient descent rather than computing the optimal values for intercept and slope directly because I'd eventually like to generalize to multiple regression).
The data I am using are below. sales is the dependent variable (in dollars) and temp is the independent variable (degrees celsius) (think ice cream sales vs temperature, or something similar).
sales temp
215 14.20
325 16.40
185 11.90
332 15.20
406 18.50
522 22.10
412 19.40
614 25.10
544 23.40
421 18.10
445 22.60
408 17.20
And this is my data after it has been normalized:
sales temp
0.06993007 0.174242424
0.326340326 0.340909091
0 0
0.342657343 0.25
0.515151515 0.5
0.785547786 0.772727273
0.529137529 0.568181818
1 1
0.836829837 0.871212121
0.55011655 0.46969697
0.606060606 0.810606061
0.51981352 0.401515152
My code for the algorithm:
import numpy as np
import pandas as pd
from scipy import stats
class SLRegression(object):
def __init__(self, learnrate = .01, tolerance = .000000001, max_iter = 10000):
# Initialize learnrate, tolerance, and max_iter.
self.learnrate = learnrate
self.tolerance = tolerance
self.max_iter = max_iter
# Define the gradient descent algorithm.
def fit(self, data):
# data : array-like, shape = [m_observations, 2_columns]
# Initialize local variables.
converged = False
m = data.shape[0]
# Track number of iterations.
self.iter_ = 0
# Initialize theta0 and theta1.
self.theta0_ = 0
self.theta1_ = 0
# Compute the cost function.
J = (1.0/(2.0*m)) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0])**2 for i in range(m)])
print('J is: ', J)
# Iterate over each point in data and update theta0 and theta1 on each pass.
while not converged:
diftemp0 = (1.0/m) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0]) for i in range(m)])
diftemp1 = (1.0/m) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0]) * data[i][1] for i in range(m)])
# Subtract the learnrate * partial derivative from theta0 and theta1.
temp0 = self.theta0_ - (self.learnrate * diftemp0)
temp1 = self.theta1_ - (self.learnrate * diftemp1)
# Update theta0 and theta1.
self.theta0_ = temp0
self.theta1_ = temp1
# Compute the updated cost function, given new theta0 and theta1.
new_J = (1.0/(2.0*m)) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0])**2 for i in range(m)])
print('New J is: %s') % (new_J)
# Test for convergence.
if abs(J - new_J) <= self.tolerance:
converged = True
print('Model converged after %s iterations!') % (self.iter_)
# Set old cost equal to new cost and update iter.
J = new_J
self.iter_ += 1
# Test whether we have hit max_iter.
if self.iter_ == self.max_iter:
converged = True
print('Maximum iterations have been reached!')
return self
def point_forecast(self, x):
# Given feature value x, returns the regression's predicted value for y.
return self.theta0_ + self.theta1_ * x
# Run the algorithm on a data set.
if __name__ == '__main__':
# Load in the .csv file.
data = np.squeeze(np.array(pd.read_csv('sales_normalized.csv')))
# Create a regression model with the default learning rate, tolerance, and maximum number of iterations.
slregression = SLRegression()
# Call the fit function and pass in the data.
slregression.fit(data)
# Print out the results.
print('After %s iterations, the model converged on Theta0 = %s and Theta1 = %s.') % (slregression.iter_, slregression.theta0_, slregression.theta1_)
# Compare our model to scipy linregress model.
slope, intercept, r_value, p_value, slope_std_error = stats.linregress(data[:,1], data[:,0])
print('Scipy linear regression gives intercept: %s and slope = %s.') % (intercept, slope)
# Test the model with a point forecast.
print('As an example, our algorithm gives y = %s given x = .87.') % (slregression.point_forecast(.87)) # Should be about .83.
print('The true y-value for x = .87 is about .8368.')
I'm having trouble understanding exactly what allows the algorithm to converge versus return values that are completely wrong. Given learnrate = .01, tolerance = .0000000001, and max_iter = 10000, in combination with normalized data, I can get the gradient descent algorithm to converge. However, when I use the un-normalized data, the smallest I can make the learning rate without the algorithm returning NaN is .005. This brings changes in the cost function from iteration to iteration down to around 614, but I can't get it to go any lower.
Is it definitely a requirement of this type of algorithm to have normalized data, and if so, why? Also, what would be the best way to take a novel x-value in non-normalized form and plug it into the point forecast, given that the algorithm needs normalized values? For instance, if I were going to deliver this algorithm to a client so they could make predictions of their own (I'm not, but for the sake of argument..), wouldn't I want them to simply be able to plug in the un-normalized x-value?
All and all, playing around with the tolerance, max_iter, and learnrate gives me non-convergent results the majority of the time. Is this normal, or are there flaws in my algorithm that are contributing to this issue?
Given learnrate = .01, tolerance = .0000000001, and max_iter = 10000, in combination with normalized data, I can get the gradient descent algorithm to converge. However, when I use the un-normalized data, the smallest I can make the learning rate without the algorithm returning NaN is .005
That's kind of to be expected the way you have your algorithm set up.
The normalization of the data makes it so the y-intercept of the best fit is around 0.0. Otherwise, you could have a y-intercept thousands of units off of the starting guess, and you'd have to trek there before you ever really started the optimization part.
Is it definitely a requirement of this type of algorithm to have normalized data, and if so, why?
No, absolutely not, but if you don't normalize, you should pick a starting point more intelligently (you're starting at (m,b) = (0,0)). Your learnrate may also be too small if you don't normalize your data, and same with your tolerance.
Also, what would be the best way to take a novel x-value in non-normalized form and plug it into the point forecast, given that the algorithm needs normalized values?
Apply whatever transformation that you applied to the original data to get the normalized data to your new x-value. (The code for normalization is outside of what you have shown). If this test point fell within the (minx,maxx) range of your original data, once transformed, it should fall within 0 <= x <= 1. Once you have this normalized test point, plug it into your theta equation of a line (remember, your thetas are m,b of the y-intercept form of the equation of a line).
All and all, playing around with the tolerance, max_iter, and learnrate gives me non-convergent results the majority of the time.
For a well-formed problem, if you're in fact diverging it often means your step size is too large. Try lowering it.
If it's simply not converging before it hits the max iterations, that could be a few issues:
Your step size is too small,
Your tolerance is too small,
Your max iterations is too small,
Your starting point is poorly chosen
In your case, using the non normalized data results in your starting point of (0,0) being very far off (the (m,b) of the non-normalized data is around (-159, 30) while the (m,b) of your normalized data is (0.10,0.79)), so most if not all of your iterations are being used just getting to the area of interest.
The problem with this is that by increasing the step size to get to the area of interest faster also makes it less-likely to find convergence once it gets there.
To account for this, some gradient descent algorithms have dynamic step size (or learnrate) such that large steps are taken at the beginning, and smaller ones as it nears convergence.
It may also be helpful for you to keep a history of of the theta pairs throughout the algorithm, then plot them. You'll be able to see the difference immediately between using normalized and non-normalized input data.
I am trying to implement the SVM loss function and its gradient.
I found some example projects that implement these two, but I could not figure out how they can use the loss function when computing the gradient.
Here is the formula of loss function:
What I cannot understand is that how can I use the loss function's result while computing gradient?
The example project computes the gradient as follows:
for i in xrange(num_train):
scores = X[i].dot(W)
correct_class_score = scores[y[i]]
for j in xrange(num_classes):
if j == y[i]:
continue
margin = scores[j] - correct_class_score + 1 # note delta = 1
if margin > 0:
loss += margin
dW[:,j] += X[i]
dW[:,y[i]] -= X[i]
dW is for gradient result. And X is the array of training data.
But I didn't understand how the derivative of the loss function results in this code.
The method to calculate gradient in this case is Calculus (analytically, NOT numerically!). So we differentiate loss function with respect to W(yi) like this:
and with respect to W(j) when j!=yi is:
The 1 is just indicator function so we can ignore the middle form when condition is true. And when you write in code, the example you provided is the answer.
Since you are using cs231n example, you should definitely check note and videos if needed.
Hope this helps!
If the substraction less than zero the loss is zero so the gradient of W is also zero. If the substarction larger than zero, then the gradient of W is the partial derviation of the loss.
If we don't keep these two lines of code:
dW[:,j] += X[i]
dW[:,y[i]] -= X[i]
we get loss value.