Gradient Descent Variation doesn't work - python

I try to implement the Stochastic Gradient Descent Algorithm.
The first solution works:
def gradientDescent(x,y,theta,alpha):
xTrans = x.transpose()
for i in range(0,99):
hypothesis = np.dot(x,theta)
loss = hypothesis - y
gradient = np.dot(xTrans,loss)
theta = theta - alpha * gradient
return theta
This solution gives the right theta values but the following algorithm
doesnt work:
def gradientDescent2(x,y,theta,alpha):
xTrans = x.transpose();
for i in range(0,99):
hypothesis = np.dot(x[i],theta)
loss = hypothesis - y[i]
gradientThetaZero= loss * x[i][0]
gradientThetaOne = loss * x[i][1]
theta[0] = theta[0] - alpha * gradientThetaZero
theta[1] = theta[1] - alpha * gradientThetaOne
return theta
I don't understand why solution 2 does not work, basically it
does the same like the first algorithm.
I use the following code to produce data:
def genData():
x = np.random.rand(100,2)
y = np.zeros(shape=100)
for i in range(0, 100):
x[i][0] = 1
# our target variable
e = np.random.uniform(-0.1,0.1,size=1)
y[i] = np.sin(2*np.pi*x[i][1]) + e[0]
return x,y
And use it the following way:
x,y = genData()
theta = np.ones(2)
theta = gradientDescent2(x,y,theta,0.005)
print(theta)
I hope you can help me!
Best regards, Felix

Your second code example overwrites the gradient computation on each iteration over your observation data.
In the first code snippet, you properly adjust your parameters in each looping iteration based on the error (loss function).
In the second code snippet, you calculate the point-wise gradient computation in each iteration, but then don't do anything with it. That means that your final update effectively only trains on the very last data point.
If instead you accumulate the gradients within the loop by summing ( += ), it should be closer to what you're looking for (as an expression of the gradient of the loss function with respect to your parameters over the entire observation set).

Related

Linear regression using gradient descent; having trouble with cost function value

I'm coding linear regression by using gradient descent. By using for loop not tensor.
I think my code is logically right, and when I plot the graph theta value and linear model seems to be coming out good. But the value of cost function is high. Can you help me?
The value of cost function is 1,160,934 which is abnormal.
def gradient_descent(alpha,x,y,ep=0.0001, max_repeat=10000000):
m = x.shape[0]
converged = False
repeat = 0
theta0 = 1.0
theta3 = -1.0
# J=sum([(theta0 +theta3*x[i]- y[i])**2 for i in range(m)]) / 2*m #######
J=1
while not converged :
grad0= sum([(theta0 +theta3*x[i]-y[i]) for i in range (m)]) / m
grad1= sum([(theta0 + theta3*x[i]-y[i])*x[i] for i in range (m)])/ m
temp0 = theta0 - alpha*grad0
temp1 = theta3 - alpha*grad1
theta0 = temp0
theta3 = temp1
msqe = (sum([(theta0 + theta3*x[i] - y[i]) **2 for i in range(m)]))* (1 / 2*m)
print(theta0,theta3,msqe)
if abs(J-msqe) <= ep:
print ('Converged, iterations: {0}', repeat, '!!!')
converged = True
J = msqe
repeat += 1
if repeat == max_repeat:
converged = True
print("max 까지 갔다")
return theta0, theta3, J
[theta0,theta3,J]=gradient_descent(0.001,X3,Y,ep=0.0000001,max_repeat=1000000)
print("************\n theta0 : {0}\ntheta3 : {1}\nJ : {2}\n"
.format(theta0,theta3,J))
This is the data set.
I think the dataset itself is quite widespread and that's why the best fit line shows a large amount for the cost function. If you scale your data - you would see it drop significantly.
It is quite normal for cost to be high while dealing with the large dataset which has huge variance. Moreover your data is dealing with big numbers so cost is pretty high, normalizing data will give you the correct estimate as normalized data don't need to be scaled. Try this for verifying start with random wrights, observe the cost every time if the cost fluctuates in huge range then there might be some mistake that else its fine.

What is the use of mfunction and elmul?

I was converting a program from Octave to Python, I used OMPC for this to try and see if it works, and when I get the conversion of the code for Python at the top there is a line that says #mfunction("") which I'm not sure what is for and I haven't found a lot of information about it. Also about elmul which I also don´t know what it is for and haven't found information about it.
This is the code on Octave
function [J, grad] = costFunction(theta, X, y)
%COSTFUNCTION Compute cost and gradient for logistic regression
% J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the
% parameter for logistic regression and the gradient of the cost
% w.r.t. to the parameters.
% Initialize some useful values
m = length(y); % number of training examples
% You need to return the following variables correctly
J = 0;
grad = zeros(size(theta));
% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost of a particular choice of theta.
% You should set J to the cost.
% Compute the partial derivatives and set grad to the partial
% derivatives of the cost w.r.t. each parameter in theta
%
% Note: grad should have the same dimensions as theta
% Use sigmoid function previously programed
hypothesis = sigmoid(X*theta); % Hypothesis for logistic regression
Weight = 1/m;
J = -Weight*sum( ( y.*log(hypothesis) + (1 - y).*log(1 - hypothesis) ) );
for i = 1 : m
grad = grad + (hypothesis(i) - y(i)) * X(i,:)'; % X must be transposed
end
grad = Weight*grad;
% =============================================================
end
This is the code produced by OMPC:
#mfunction("J, grad")
def costFunction(theta=None, X=None, y=None):
#COSTFUNCTION Compute cost and gradient for logistic regression
# J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the
# parameter for logistic regression and the gradient of the cost
# w.r.t. to the parameters.
# Initialize some useful values
m = length(y) # number of training examples
# You need to return the following variables correctly
J = 0
grad = zeros(size(theta))
# ====================== YOUR CODE HERE ======================
# Instructions: Compute the cost of a particular choice of theta.
# You should set J to the cost.
# Compute the partial derivatives and set grad to the partial
# derivatives of the cost w.r.t. each parameter in theta
#
# Note: grad should have the same dimensions as theta
# Use sigmoid function previously programed
hypothesis = sigmoid(X * theta)# Hypothesis for logistic regression
Weight = 1 / m
J = -Weight * sum((y *elmul* log(hypothesis) + (1 - y) *elmul* log(1 - hypothesis)))
for i in mslice[1:m]:
grad = grad + (hypothesis(i) - y(i)) * X(i, mslice[:]).cT # X must be transposed
end
grad = Weight * grad
# =============================================================
end
Both #mfunction and elmul are OMPC-defined functions. The first one is to correctly deal with the special marray type defined by OMPC, the second one is for doing element-wise multiplication on the aforementioned marrays. They are intended to leverage the semantic differences between NumPy and MATLAB.
However, these differences are difficult to translate in full generality, especially so because often MATLAB code relies on these semantic details.
That is why MATLAB-to-Python converters cannot be used reliably in production code.
I would translate the Octave code above into something like (assuming NumPy arrays inputs):
def cost_function(theta, x, y):
m = np.size(y)
grad = np.zeros(theta.shape)
hypothesis = sigmoid(x # theta)
j = -np.sum((y * np.log(hypothesis) + (1 - y) * np.log(1 - hypothesis))) / m
for i in range(m):
grad = grad + (hypothesis[i] - y[i]) # x[i, :].transpose().conjugate()
grad = grad / m
return j, grad
Please carefully check that x[i, :].transpose().conjugate() is actually needed (it depends on what x and hypothesis actually are).

How to do a gradient descent problem (machine learning)?

could somebody please explain how to do a gradient descent problem WITHOUT the context of the cost function? I have seen countless tutorials that explain gradient descent using the cost function, but I really don't understand how it works in a more general sense.
I am given a 3D function:
z = 3*((1-xx)2) * np.exp(-(xx2) - (yy+1)2) \
- 10*(xx/5 - xx3 - yy5) * np.exp(-xx2 - yy2)- (1/3)* np.exp(-(xx+1)**2 - yy2)
And I am asked to:
Code a simple gradient algorithm. Set the parameters as follows:
learning rate = step size: 0.1
Max number of iterations: 20
Stopping criterion: 0.0001 (Your iterations should stop when your gradient is smaller than the threshold)
Then start your algorithm at
(x0 = 0.5, y0 = -0.5)
(x0 = -0.3, y0 = -0.3)
I have seen this piece of code floating around wherever gradient descent is talked about:
def update_weights(m, b, X, Y, learning_rate):
m_deriv = 0
b_deriv = 0
N = len(X)
for i in range(N):
# Calculate partial derivatives
# -2x(y - (mx + b))
m_deriv += -2*X[i] * (Y[i] - (m*X[i] + b))
# -2(y - (mx + b))
b_deriv += -2*(Y[i] - (m*X[i] + b))
# We subtract because the derivatives point in direction of steepest ascent
m -= (m_deriv / float(N)) * learning_rate
b -= (b_deriv / float(N)) * learning_rate
return m, b
enter code here
But I don't understand how to use it for my problem. How does my function fit in there? What do I adjust instead of m and b? I'm very very confused.
Thank you.
Gradient Descent is optimization algorithm for finding the minimum of a function.
Very simplified view
Lets start with a 1D function y = f(x)
Lets start at an arbitrary value of x and find the gradient (slope) of f(x).
If the slope is decreasing at x then it means we have to go further toward (right of number line) x (for reaching the minimum)
If the slope is increasing at x then it means we have to go away from (left of number line) x
We can get the slope by taking the derivative of the function. The derivative is -ve if the slop is decreasing and +ve if the slope is increasing
So we can start at some arbitrary value of x and slowly move toward the minimum using the derivatives at that value of x. How slowly we are moving is determined by the learning rate or step size. so we have the update rule
x = x - df_dx*lr
We can see that if the slope is decreasing the derivative (df_dx) is -ve and x is increasing and so x is moving to further right. On the other hand if slope is increasing the df_dx is +ve which decreases x and so we are moving toward left.
We continue this either for some large number of times or until the derivative is very small
Multivariate function z = f(x,y)
The same logic as above applies except now we take the partial derivatives instead of derivative.
Update rule is
x = x - dpf_dx*lr
y = y - dpf_dy*lr
Where dpf_dx is the partial derivative of f with respect to x
The above algorithm is called the gradient decent algorithm. In Machine learning the f(x,y) is a cost/loss function whose minimum we are interested in.
Example
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.axes3d import Axes3D
from pylab import meshgrid
from scipy.optimize import fmin
import math
def z_func(a):
x, y = a
return ((x-1)**2+(y-2)**2)
x = np.arange(-3.0,3.0,0.1)
y = np.arange(-3.0,3.0,0.1)
X,Y = meshgrid(x, y) # grid of point
Z = z_func((X, Y)) # evaluation of the function on the grid
fig = plt.figure()
ax = fig.gca(projection='3d')
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1,linewidth=0, antialiased=False)
plt.show()
The min of z_func is at (1,2). This can be verified using the fmin function of scipy
fmin(z_func,np.array([10,10]))
Now lets write our own gradient decent algorithm to find the min of z_func
def gradient_decent(x,y,lr):
while True:
d_x = 2*(x-1)
d_y = 2*(y-2)
x -= d_x*lr
y -= d_y*lr
if d_x < 0.0001 and d_y < 0.0001:
break
return x,y
print (gradient_decent(10,10,0.1)
We are starting at some arbitrary value x=10 and y=10 and a learning rate of 0.1. The above code prints 1.000033672997724 2.0000299315535326 which is correct.
So if you have a continuous differentiable convex function, to find its optimal (which is minimal for a convex) all you have to do is find the partial derivatives of the function with respect to each variable and use the update rule mentioned above. Repeat the steps until the gradients are small which mean we have reached the minima for a convex function.
If the function is not convex, we might get stuck in a local optima.

Cost Function and Gradient Seem to be Working, but scipy.optimize functions are not

I'm working through my Matlab code for the Andrew NG Coursera course and turning it into python. I am working on non-regularized logistic regression and after writing my gradient and cost functions I needed something similar to fminunc and after some googling, I found a couple options. They are both returning the same results, but they do not match what is in Andrew NG's expected results code. Others seem to be getting this to work correctly, but I'm wondering why my specific code does not seem to return the desired result when using scipy.optimize functions, but does for the cost and gradient pieces earlier in the code.
The data I'm using can be found at the link below;
ex2data1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as op
#Machine Learning Online Class - Exercise 2: Logistic Regression
#Load Data
#The first two columns contains the exam scores and the third column contains the label.
data = pd.read_csv('ex2data1.txt', header = None)
X = np.array(data.iloc[:, 0:2]) #100 x 3
y = np.array(data.iloc[:,2]) #100 x 1
y.shape = (len(y), 1)
#Creating sub-dataframes for plotting
pos_plot = data[data[2] == 1]
neg_plot = data[data[2] == 0]
#==================== Part 1: Plotting ====================
#We start the exercise by first plotting the data to understand the
#the problem we are working with.
print('Plotting data with + indicating (y = 1) examples and o indicating (y = 0) examples.')
plt.plot(pos_plot[0], pos_plot[1], "+", label = "Admitted")
plt.plot(neg_plot[0], neg_plot[1], "o", label = "Not Admitted")
plt.xlabel('Exam 1 score')
plt.ylabel('Exam 2 score')
plt.legend()
plt.show()
def sigmoid(z):
'''
SIGMOID Compute sigmoid function
g = SIGMOID(z) computes the sigmoid of z.
Instructions: Compute the sigmoid of each value of z (z can be a matrix,
vector or scalar).
'''
g = 1 / (1 + np.exp(-z))
return g
def costFunction(theta, X, y):
'''
COSTFUNCTION Compute cost and gradient for logistic regression
J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the
parameter for logistic regression and the gradient of the cost
w.r.t. to the parameters.
'''
m = len(y) #number of training examples
h = sigmoid(X.dot(theta)) #logisitic regression hypothesis
J = (1/m) * np.sum((-y*np.log(h)) - ((1-y)*np.log(1-h)))
#h is 100x1, y is %100x1, these end up as 2 vector we subtract from each other
#then we sum the values by rows
#cost function for logisitic regression
return J
def gradient(theta, X, y):
m = len(y)
grad = np.zeros((theta.shape))
h = sigmoid(X.dot(theta))
for i in range(len(theta)): #number of rows in theta
XT = X[:,i]
XT.shape = (len(X),1)
grad[i] = (1/m) * np.sum((h-y)*XT) #updating each row of the gradient
return grad
#============ Part 2: Compute Cost and Gradient ============
#In this part of the exercise, you will implement the cost and gradient
#for logistic regression. You neeed to complete the code in costFunction.m
#Add intercept term to x and X_test
Bias = np.ones((len(X), 1))
X = np.column_stack((Bias, X))
#Initialize fitting parameters
initial_theta = np.zeros((len(X[0]), 1))
#Compute and display initial cost and gradient
(cost, grad) = costFunction(initial_theta, X, y), gradient(initial_theta, X, y)
print('Cost at initial theta (zeros): %f' % cost)
print('Expected cost (approx): 0.693\n')
print('Gradient at initial theta (zeros):')
print(grad)
print('Expected gradients (approx):\n -0.1000\n -12.0092\n -11.2628')
#Compute and display cost and gradient with non-zero theta
test_theta = np.array([[-24], [0.2], [0.2]]);
(cost, grad) = costFunction(test_theta, X, y), gradient(test_theta, X, y)
print('\nCost at test theta: %f' % cost)
print('Expected cost (approx): 0.218\n')
print('Gradient at test theta:')
print(grad)
print('Expected gradients (approx):\n 0.043\n 2.566\n 2.647\n')
result = op.fmin_tnc(func = costFunction, x0 = initial_theta, fprime = gradient, args = (X,y))
result[1]
Result = op.minimize(fun = costFunction,
x0 = initial_theta,
args = (X, y),
method = 'TNC',
jac = gradient, options={'gtol': 1e-3, 'disp': True, 'maxiter': 1000})
theta = Result.x
theta
test = np.array([[1, 45, 85]])
prob = sigmoid(test.dot(theta))
print('For a student with scores 45 and 85, we predict an admission probability of %f,' % prob)
print('Expected value: 0.775 +/- 0.002\n')
This was a very difficult problem to debug, and illustrates a poorly documented aspect of the scipy.optimize interface. The documentation vaguely indicates that theta will be passed around as a vector:
Minimization of scalar function of one or more variables.
In general, the optimization problems are of the form:
minimize f(x) subject to
g_i(x) >= 0, i = 1,...,m
h_j(x) = 0, j = 1,...,p
where x is a vector of one or more variables.
What's important is that they really mean vector in the most primitive sense, a 1-dimensional array. So you have to expect that whenever theta is passed into one of your callbacks, it will be passed in as a 1-d array. But in numpy, 1-d arrays sometimes behave differently from 2-d row arrays (and, obviously, from 2-d column arrays).
I don't know exactly why it's causing a problem in your case, but it's easily fixed regardless. You just have to add the following at the top of both your cost function and your gradient function:
theta = theta.reshape(-1, 1)
This guarantees that theta will be a 2-d column array, as expected. Once you've done this, the results are correct.
I have had similar issues with Scipy dealing with the same problem as you. As senderle points out the interface is not the easiest to deal with, especially combined with the numpy array interface... Here is my implementation which works as expected.
Defining the cost and gradient functions
Note that initial_theta is passed as a simple array of shape (3,) and converted to a column vector of shape (3,1) within the function. The gradient function then returns the grad.ravel() which has shape (3,) again. This is important as doing otherwise caused an error message with various optimization methods in Scipy.optimize.
Note that different methods have different behaviours but returning .ravel() seems to fix most issues...
import pandas as pd
import numpy as np
import scipy.optimize as opt
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def CostFunc(theta,X,y):
#Initializing variables
m = len(y)
J = 0
grad = np.zeros(theta.shape)
#Vectorized computations
z = X # theta
h = sigmoid(z)
J = (1/m) * ( (-y.T # np.log(h)) - (1 - y).T # np.log(1-h));
return J
def Gradient(theta,X,y):
#Initializing variables
m = len(y)
theta = theta[:,np.newaxis]
grad = np.zeros(theta.shape)
#Vectorized computations
z = X # theta
h = sigmoid(z)
grad = (1/m)*(X.T # ( h - y));
return grad.ravel() #<-- This is the trick
Initializing variables and parameters
Note that initial_theta.shape returns (3,)
X = data1.iloc[:,0:2].values
m,n = X.shape
X = np.concatenate((np.ones(m)[:,np.newaxis],X),1)
y = data1.iloc[:,-1].values[:,np.newaxis]
initial_theta = np.zeros((n+1))
Calling Scipy.optimize
model = opt.minimize(fun = CostFunc, x0 = initial_theta, args = (X, y), method = 'TNC', jac = Gradient)
Any comments from more knowledgeable people are welcome, this Scipy interface is a mystery to me, thanks

Stochastic Gradient Descent Convergence Criteria

Currently my convergence criteria for SGD checks whether the MSE error ratio is within a specific boundary.
def compute_mse(data, labels, weights):
m = len(labels)
hypothesis = np.dot(data,weights)
sq_errors = (hypothesis - labels) ** 2
mse = np.sum(sq_errors)/(2.0*m)
return mse
cur_mse = 1.0
prev_mse = 100.0
m = len(labels)
while cur_mse/prev_mse < 0.99999:
prev_mse = cur_mse
for i in range(m):
d = np.array(data[i])
hypothesis = np.dot(d, weights)
gradient = np.dot((labels[i] - hypothesis), d)/m
weights = weights + (alpha * gradient)
cur_mse = compute_mse(data, labels, weights)
if cur_mse > prev_mse:
return
The weights are update w.r.t. to a single data point in the training set.
With an alpha of 0.001, the model is supposed to have converged within a few iterations however I get no convergence. Is this convergence criteria too strict?
I'll try to answer the question. First, the pseudocode of stochastic gradient descent looks something like this:
input: f(x), alpha, initial x (guess or random)
output: min_x f(x) # x that minimizes f(x)
while True:
shuffle data # good practice, not completely needed
for d in data:
x -= alpha * grad(f(x)) # df/dx
if <stopping criterion>:
break
There can be other regularization parameters added to the function that you want to minimize, such as the l1 penalty to avoid overfitting.
Going back to your problem, looking at your data and definition of the gradient, looks like you want to solve a simple linear system of equations of the form:
Ax = b
which yields the objevtive function:
f(x) = ||Ax - b||^2
stochastic gradient descent uses one row data at a time:
||A_i x - b||
where || o || is the euclidean norm and _i means index of a row.
Here, A is your data, x is your weights and b is your labels.
The gradient of the function is then computed as a:
grad(f(x)) = 2 * A.T (Ax - b)
Or in the case of the stochastic gradient descent:
2 * A_i.T (A_i x - b)
where .T means transpose.
Putting everything back into your code... first I will setup a synthetic data:
A = np.random.randn(100, 2) # 100x2 data
x = np.random.randn(2, 1) # 2x1 weights
b = np.random.randint(0, 2, 100).reshape(100, 1) # 100x1 labels
b[b == 0] = -1 # labels in {-1, 1}
Then, define the parameters:
alpha = 0.001
cur_mse = 100.
prev_mse = np.inf
it = 0
max_iter = 100
m = A.shape[0]
idx = range(m)
And loop!
while cur_mse/prev_mse < 0.99999 and it < max_iter:
prev_mse = cur_mse
shuffle(idx)
for i in idx:
d = A[i:i+1]
y = b[i:i+1]
h = np.dot(d, x)
dx = 2 * np.dot(d.T, (h - y))
x -= (alpha * dx)
cur_mse = np.mean((A.dot(x) - b)**2)
if cur_mse > prev_mse:
raise Exception("Not converging")
it += 1
This code is pretty much the same as yours, with a couple of additions:
Another stopping criterion based on the number of iterations (to avoid looping forever if the system doesn't converge or does too slowly)
Redefinition of the gradient dx (still similar to yours). You have the sign inverted and therefore the weight update is positive + since in my example is negative - (makes sense since you are going down in a gradient).
Indexing of data and labels. While data[i] gives a tuple of size (2,) (in this case for a 100x2 data), using fancy indexing data[i:i+1] will return a view of the data without reshaping it (e.g with shape (1, 2)) and therefore will allow you to perform the proper matrix multiplications.
You can add a 3rd stopping criterion based on acceptable mse error, i.e: if cur_mse < 1e-3: break.
This algorithm, with random data, converges in 20-40 iterations for me (depending on the generated random data).
So... assuming that this is the function you want to minimize, if this method doesn't work for you, it might mean that your system is underdeterminated (you have less training data than features, which means A is more wide than high).
Hope it helps!

Categories