How to do a gradient descent problem (machine learning)? - python

could somebody please explain how to do a gradient descent problem WITHOUT the context of the cost function? I have seen countless tutorials that explain gradient descent using the cost function, but I really don't understand how it works in a more general sense.
I am given a 3D function:
z = 3*((1-xx)2) * np.exp(-(xx2) - (yy+1)2) \
- 10*(xx/5 - xx3 - yy5) * np.exp(-xx2 - yy2)- (1/3)* np.exp(-(xx+1)**2 - yy2)
And I am asked to:
Code a simple gradient algorithm. Set the parameters as follows:
learning rate = step size: 0.1
Max number of iterations: 20
Stopping criterion: 0.0001 (Your iterations should stop when your gradient is smaller than the threshold)
Then start your algorithm at
(x0 = 0.5, y0 = -0.5)
(x0 = -0.3, y0 = -0.3)
I have seen this piece of code floating around wherever gradient descent is talked about:
def update_weights(m, b, X, Y, learning_rate):
m_deriv = 0
b_deriv = 0
N = len(X)
for i in range(N):
# Calculate partial derivatives
# -2x(y - (mx + b))
m_deriv += -2*X[i] * (Y[i] - (m*X[i] + b))
# -2(y - (mx + b))
b_deriv += -2*(Y[i] - (m*X[i] + b))
# We subtract because the derivatives point in direction of steepest ascent
m -= (m_deriv / float(N)) * learning_rate
b -= (b_deriv / float(N)) * learning_rate
return m, b
enter code here
But I don't understand how to use it for my problem. How does my function fit in there? What do I adjust instead of m and b? I'm very very confused.
Thank you.

Gradient Descent is optimization algorithm for finding the minimum of a function.
Very simplified view
Lets start with a 1D function y = f(x)
Lets start at an arbitrary value of x and find the gradient (slope) of f(x).
If the slope is decreasing at x then it means we have to go further toward (right of number line) x (for reaching the minimum)
If the slope is increasing at x then it means we have to go away from (left of number line) x
We can get the slope by taking the derivative of the function. The derivative is -ve if the slop is decreasing and +ve if the slope is increasing
So we can start at some arbitrary value of x and slowly move toward the minimum using the derivatives at that value of x. How slowly we are moving is determined by the learning rate or step size. so we have the update rule
x = x - df_dx*lr
We can see that if the slope is decreasing the derivative (df_dx) is -ve and x is increasing and so x is moving to further right. On the other hand if slope is increasing the df_dx is +ve which decreases x and so we are moving toward left.
We continue this either for some large number of times or until the derivative is very small
Multivariate function z = f(x,y)
The same logic as above applies except now we take the partial derivatives instead of derivative.
Update rule is
x = x - dpf_dx*lr
y = y - dpf_dy*lr
Where dpf_dx is the partial derivative of f with respect to x
The above algorithm is called the gradient decent algorithm. In Machine learning the f(x,y) is a cost/loss function whose minimum we are interested in.
Example
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.axes3d import Axes3D
from pylab import meshgrid
from scipy.optimize import fmin
import math
def z_func(a):
x, y = a
return ((x-1)**2+(y-2)**2)
x = np.arange(-3.0,3.0,0.1)
y = np.arange(-3.0,3.0,0.1)
X,Y = meshgrid(x, y) # grid of point
Z = z_func((X, Y)) # evaluation of the function on the grid
fig = plt.figure()
ax = fig.gca(projection='3d')
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1,linewidth=0, antialiased=False)
plt.show()
The min of z_func is at (1,2). This can be verified using the fmin function of scipy
fmin(z_func,np.array([10,10]))
Now lets write our own gradient decent algorithm to find the min of z_func
def gradient_decent(x,y,lr):
while True:
d_x = 2*(x-1)
d_y = 2*(y-2)
x -= d_x*lr
y -= d_y*lr
if d_x < 0.0001 and d_y < 0.0001:
break
return x,y
print (gradient_decent(10,10,0.1)
We are starting at some arbitrary value x=10 and y=10 and a learning rate of 0.1. The above code prints 1.000033672997724 2.0000299315535326 which is correct.
So if you have a continuous differentiable convex function, to find its optimal (which is minimal for a convex) all you have to do is find the partial derivatives of the function with respect to each variable and use the update rule mentioned above. Repeat the steps until the gradients are small which mean we have reached the minima for a convex function.
If the function is not convex, we might get stuck in a local optima.

Related

How to fit a piecewise (alternating linear and constant segments) function to a parabolic function?

I do have a function, for example , but this can be something else as well, like a quadratic or logarithmic function. I am only interested in the domain of . The parameters of the function (a and k in this case) are known as well.
My goal is to fit a continuous piece-wise function to this, which contains alternating segments of linear functions (i.e. sloped straight segments, each with intercept of 0) and constants (i.e. horizontal segments joining the sloped segments together). The first and last segments are both sloped. And the number of segments should be pre-selected between around 9-29 (that is 5-15 linear steps + 4-14 constant plateaus).
Formally
The input function:
The fitted piecewise function:
I am looking for the optimal resulting parameters (c,r,b) (in terms of least squares) if the segment numbers (n) are specified beforehand.
The resulting constants (c) and the breakpoints (r) should be whole natural numbers, and the slopes (b) round two decimal point values.
I have tried to do the fitting numerically using the pwlf package using a segmented constant models, and further processed the resulting constant model with some graphical intuition to "slice" the constant steps with the slopes. It works to some extent, but I am sure this is suboptimal from both fitting perspective and computational efficiency. It takes multiple minutes to generate a fitting with 8 slopes on the range of 1-50000. I am sure there must be a better way to do this.
My idea would be to instead using only numerical methods/ML, the fact that we have the algebraic form of the input function could be exploited in some way to at least to use algebraic transforms (integrals) to get to a simpler optimization problem.
import numpy as np
import matplotlib.pyplot as plt
import pwlf
# The input function
def input_func(x,k,a):
return np.power(x,1/a)*k
x = np.arange(1,5e4)
y = input_func(x, 1.8, 1.3)
plt.plot(x,y);
def pw_fit(func, x_r, no_seg, *fparams):
# working on the specified range
x = np.arange(1,x_r)
y_input = func(x, *fparams)
my_pwlf = pwlf.PiecewiseLinFit(x, y_input, degree=0)
res = my_pwlf.fit(no_seg)
yHat = my_pwlf.predict(x)
# Function values at the breakpoints
y_isec = func(res, *fparams)
# Slope values at the breakpoints
slopes = np.round(y_isec / res, decimals=2)
slopes = slopes[1:]
# For the first slope value, I use the intersection of the first constant plateau and the input function
slopes = np.insert(slopes,0,np.round(y_input[np.argwhere(np.diff(np.sign(y_input - yHat))).flatten()[0]] / np.argwhere(np.diff(np.sign(y_input - yHat))).flatten()[0], decimals=2))
plateaus = np.unique(np.round(yHat))
# If due to rounding slope values (to two decimals), there is no change in a subsequent step, I just remove those segments
to_del = np.argwhere(np.diff(slopes) == 0).flatten()
slopes = np.delete(slopes,to_del + 1)
plateaus = np.delete(plateaus,to_del)
breakpoints = [np.ceil(plateaus[0]/slopes[0])]
for idx, j in enumerate(slopes[1:-1]):
breakpoints.append(np.floor(plateaus[idx]/j))
breakpoints.append(np.ceil(plateaus[idx+1]/j))
breakpoints.append(np.floor(plateaus[-1]/slopes[-1]))
return slopes, plateaus, breakpoints
slo, plat, breaks = pw_fit(input_func, 50000, 8, 1.8, 1.3)
# The piecewise function itself
def pw_calc(x, slopes, plateaus, breaks):
x = x.astype('float')
cond_list = [x < breaks[0]]
for idx, j in enumerate(breaks[:-1]):
cond_list.append((j <= x) & (x < breaks[idx+1]))
cond_list.append(breaks[-1] <= x)
func_list = [lambda x: x * slopes[0]]
for idx, j in enumerate(slopes[1:]):
func_list.append(plateaus[idx])
func_list.append(lambda x, j=j: x * j)
return np.piecewise(x, cond_list, func_list)
y_output = pw_calc(x, slo, plat, breaks)
plt.plot(x,y,y_output);
(Not important, but I think the fitted piecewise function is not continuous as it is. Intervals should be x<=r1; r1<x<=r2; ....)
As Anatolyg has pointed out, it looks to me that in the optimal solution (for the function posted at least, and probably for any where the derivative is different from zero), the horizantal segments will collapse to a point or the minimum segment length (in this case 1).
EDIT---------------------------------------------
The behavior above could only be valid if the slopes could have an intercept. If the intercepts are zero, as posted in the question, one consideration must be taken into account: Is the initial parabolic function defined in zero or nearby? Imagine the function y=0.001 *sqrt(x-1000), then the segments defined as b*x will have a slope close to zero and will be so similar to the constant segments that the best fit will be just the line that without intercept that fits better all the function.
Provided that the function is defined in zero or nearby, you can start by approximating the curve just by linear segments (with intercepts):
divide the function domain in N intervals(equal intervals or whose size is a function of the average curvature (or second derivative) of the function along the domain).
linear fit/regression in each intervals
for each interval, if a point (or bunch of points) in the extreme of any interval is better fitted by the line of the neighbor interval than the line of its interval, this point is assigned to the neighbor interval.
Repeat from 2) until no extreme points are moved.
Linear regressions might be optimized not to calculate all the covariance matrixes from scratch on each iteration, but just adding the contributions of the moved points to the previous covariance matrixes.
Then each linear segment (LSi) is replaced by a combination of a small constant segment at the beginning (Cbi), a linear segment without intercept (Si), and another constant segment at the end (Cei). This segments are easy to calculate as Si will contain the middle point of LSi, and Cbi and Cei will have respectively the begin and end values of the segment LSi. Then the intervals of each segment has to be calculated as an intersection between lines.
With this, the constant end segment will be collinear with the constant begin segment from the next interval so they will merge, resulting in a series of constant and linear segments interleaved.
But this would be a floating point start solution. Next, you will have to apply all the roundings which will mess up quite a lot all the segments as the conditions integer intervals and linear segments without slope can be very confronting. In fact, b,c,r are not totally independent. If ci and ri+1 are known, then bi+1 is already fixed
If nothing is broken so far, the final task will be to minimize the error/cost function (I assume that it will be the integral of the error between the parabolic function and the segments). My guess is that gradients here will be quite a pain, as if you change for example one ci, all the rest of the bj and cj will have to adapt as well due to the integer intervals restriction. However, if you can generalize the derivatives between parameters ( how much do I have to adapt bi+1 if ci changes a unit), you can propagate the change of one parameter to all other parameters and have kind of a gradient. Then for each interval, you can estimate what would be the ideal parameter and averaging all intervals calculate the best gradient step. Let me illustrate this:
Assuming first that r parameters are fixed, if I change c1 by one unit, b2 changes by 0.1, c2 changes by -0.2 and b3 changes by 0.2. This would be the gradient.
Then I estimate, comparing with the parabolic curve, that c1 should increase 0.5 (to reduce the cost by 10 points), b2 should increase 0.2 (to reduce the cost by 5 points), c2 should increase 0.2 (to reduce the cost by 6 points) and b3 should increase 0.1 (to reduce the cost by 9 points).
Finally, the gradient step would be (0.5/1·10 + 0.2/0.1·5 - 0.2/(-0.2)·6 + 0.1/0.2·9)/(10 + 5 + 6 + 9)~= 0.45. Thus, c1 would increase 0.45 units, b2 would increase 0.45·0.1, and so on.
When you add the r parameters to the pot, as integer intervals do not have an proper derivative, calculation is not straightforward. However, you can consider r parameters as floating points, calculate and apply the gradient step and then apply the roundings.
We can integrate the squared error function for linear and constant pieces and let SciPy optimize it. Python 3:
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize
xl = 1
xh = 50000
a = 1.3
p = 1 / a
n = 8
def split_b_and_c(bc):
return bc[::2], bc[1::2]
def solve_for_r(b, c):
r = np.empty(2 * n)
r[0] = xl
r[1:-1:2] = c / b[:-1]
r[2::2] = c / b[1:]
r[-1] = xh
return r
def linear_residual_integral(b, x):
return (
(x ** (2 * p + 1)) / (2 * p + 1)
- 2 * b * x ** (p + 2) / (p + 2)
+ b ** 2 * x ** 3 / 3
)
def constant_residual_integral(c, x):
return x ** (2 * p + 1) / (2 * p + 1) - 2 * c * x ** (p + 1) / (p + 1) + c ** 2 * x
def squared_error(bc):
b, c = split_b_and_c(bc)
r = solve_for_r(b, c)
linear = np.sum(
linear_residual_integral(b, r[1::2]) - linear_residual_integral(b, r[::2])
)
constant = np.sum(
constant_residual_integral(c, r[2::2])
- constant_residual_integral(c, r[1:-1:2])
)
return linear + constant
def evaluate(x, b, c, r):
i = 0
while x > r[i + 1]:
i += 1
return b[i // 2] * x if i % 2 == 0 else c[i // 2]
def main():
bc0 = (xl + (xh - xl) * np.arange(1, 4 * n - 2, 2) / (4 * n - 2)) ** (
p - 1 + np.arange(2 * n - 1) % 2
)
bc = scipy.optimize.minimize(
squared_error, bc0, bounds=[(1e-06, None) for i in range(2 * n - 1)]
).x
b, c = split_b_and_c(bc)
r = solve_for_r(b, c)
X = np.linspace(xl, xh, 1000)
Y = [evaluate(x, b, c, r) for x in X]
plt.plot(X, X ** p)
plt.plot(X, Y)
plt.show()
if __name__ == "__main__":
main()
I have tried to come up with a new solution myself, based on the idea of #Amo Robb, where I have partitioned the domain, and curve fitted a dual - constant and linear - piece together (with the help of np.maximum). I have used the 1 / f(x)' as the function to designate the breakpoints, but I know this is arbitrary and does not provide a global optimum. Maybe there is some optimal function for these breakpoints. But this solution is OK for me, as it might be appropriate to have a better fit at the first segments, at the expense of the error for the later segments. (The task itself is actually a cost based retail margin calculation {supply price -> added margin}, as the retail POS software can only work with such piecewise margin function).
The answer from #David Eisenstat is correct optimal solution if the parameters are allowed to be floats. Unfortunately the POS software can not use floats. It is OK to round up c-s and r-s afterwards. But the b-s should be rounded to two decimals, as those are inputted as percents, and this constraint would ruin the optimal solution with long floats. I will try to further improve my solution with both Amo's and David's valuable input. Thank You for that!
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# The input function f(x)
def input_func(x,k,a):
return np.power(x,1/a) * k
# 1 / f(x)'
def one_per_der(x,k,a):
return a / (k * np.power(x, 1/a-1))
# 1 / f(x)' inverted
def one_per_der_inv(x,k,a):
return np.power(a / (x*k), a / (1-a))
def segment_fit(start,end,y,first_val):
b, _ = curve_fit(lambda x,b: np.maximum(first_val, b*x), np.arange(start,end), y[start-1:end-1])
b = float(np.round(b, decimals=2))
bp = np.round(first_val / b)
last_val = np.round(b * end)
return b, bp, last_val
def pw_fit(end_range, no_seg, **fparams):
y_bps = np.linspace(one_per_der(1, **fparams), one_per_der(end_range,**fparams) , no_seg+1)[1:]
x_bps = np.round(one_per_der_inv(y_bps, **fparams))
y = input_func(x, **fparams)
slopes = [np.round(float(curve_fit(lambda x,b: x * b, np.arange(1,x_bps[0]), y[:int(x_bps[0])-1])[0]), decimals = 2)]
plats = [np.round(x_bps[0] * slopes[0])]
bps = []
for i, xbp in enumerate(x_bps[1:]):
b, bp, last_val = segment_fit(int(x_bps[i]+1), int(xbp), y, plats[i])
slopes.append(b); bps.append(bp); plats.append(last_val)
breaks = sorted(list(x_bps) + bps)[:-1]
# If due to rounding slope values (to two decimals), there is no change in a subsequent step, I just remove those segments
to_del = np.argwhere(np.diff(slopes) == 0).flatten()
breaks_to_del = np.concatenate((to_del * 2, to_del * 2 + 1))
slopes = np.delete(slopes,to_del + 1)
plats = np.delete(plats[:-1],to_del)
breaks = np.delete(breaks,breaks_to_del)
return slopes, plats, breaks
def pw_calc(x, slopes, plateaus, breaks):
x = x.astype('float')
cond_list = [x < breaks[0]]
for idx, j in enumerate(breaks[:-1]):
cond_list.append((j <= x) & (x < breaks[idx+1]))
cond_list.append(breaks[-1] <= x)
func_list = [lambda x: x * slopes[0]]
for idx, j in enumerate(slopes[1:]):
func_list.append(plateaus[idx])
func_list.append(lambda x, j=j: x * j)
return np.piecewise(x, cond_list, func_list)
fparams = {'k':1.8, 'a':1.2}
end_range = 5e4
no_steps = 10
x = np.arange(1, end_range)
y = input_func(x, **fparams)
slopes, plats, breaks = pw_fit(end_range, no_steps, **fparams)
y_output = pw_calc(x, slopes, plats, breaks)
plt.plot(x,y_output,y);

Cost Function and Gradient Seem to be Working, but scipy.optimize functions are not

I'm working through my Matlab code for the Andrew NG Coursera course and turning it into python. I am working on non-regularized logistic regression and after writing my gradient and cost functions I needed something similar to fminunc and after some googling, I found a couple options. They are both returning the same results, but they do not match what is in Andrew NG's expected results code. Others seem to be getting this to work correctly, but I'm wondering why my specific code does not seem to return the desired result when using scipy.optimize functions, but does for the cost and gradient pieces earlier in the code.
The data I'm using can be found at the link below;
ex2data1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as op
#Machine Learning Online Class - Exercise 2: Logistic Regression
#Load Data
#The first two columns contains the exam scores and the third column contains the label.
data = pd.read_csv('ex2data1.txt', header = None)
X = np.array(data.iloc[:, 0:2]) #100 x 3
y = np.array(data.iloc[:,2]) #100 x 1
y.shape = (len(y), 1)
#Creating sub-dataframes for plotting
pos_plot = data[data[2] == 1]
neg_plot = data[data[2] == 0]
#==================== Part 1: Plotting ====================
#We start the exercise by first plotting the data to understand the
#the problem we are working with.
print('Plotting data with + indicating (y = 1) examples and o indicating (y = 0) examples.')
plt.plot(pos_plot[0], pos_plot[1], "+", label = "Admitted")
plt.plot(neg_plot[0], neg_plot[1], "o", label = "Not Admitted")
plt.xlabel('Exam 1 score')
plt.ylabel('Exam 2 score')
plt.legend()
plt.show()
def sigmoid(z):
'''
SIGMOID Compute sigmoid function
g = SIGMOID(z) computes the sigmoid of z.
Instructions: Compute the sigmoid of each value of z (z can be a matrix,
vector or scalar).
'''
g = 1 / (1 + np.exp(-z))
return g
def costFunction(theta, X, y):
'''
COSTFUNCTION Compute cost and gradient for logistic regression
J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the
parameter for logistic regression and the gradient of the cost
w.r.t. to the parameters.
'''
m = len(y) #number of training examples
h = sigmoid(X.dot(theta)) #logisitic regression hypothesis
J = (1/m) * np.sum((-y*np.log(h)) - ((1-y)*np.log(1-h)))
#h is 100x1, y is %100x1, these end up as 2 vector we subtract from each other
#then we sum the values by rows
#cost function for logisitic regression
return J
def gradient(theta, X, y):
m = len(y)
grad = np.zeros((theta.shape))
h = sigmoid(X.dot(theta))
for i in range(len(theta)): #number of rows in theta
XT = X[:,i]
XT.shape = (len(X),1)
grad[i] = (1/m) * np.sum((h-y)*XT) #updating each row of the gradient
return grad
#============ Part 2: Compute Cost and Gradient ============
#In this part of the exercise, you will implement the cost and gradient
#for logistic regression. You neeed to complete the code in costFunction.m
#Add intercept term to x and X_test
Bias = np.ones((len(X), 1))
X = np.column_stack((Bias, X))
#Initialize fitting parameters
initial_theta = np.zeros((len(X[0]), 1))
#Compute and display initial cost and gradient
(cost, grad) = costFunction(initial_theta, X, y), gradient(initial_theta, X, y)
print('Cost at initial theta (zeros): %f' % cost)
print('Expected cost (approx): 0.693\n')
print('Gradient at initial theta (zeros):')
print(grad)
print('Expected gradients (approx):\n -0.1000\n -12.0092\n -11.2628')
#Compute and display cost and gradient with non-zero theta
test_theta = np.array([[-24], [0.2], [0.2]]);
(cost, grad) = costFunction(test_theta, X, y), gradient(test_theta, X, y)
print('\nCost at test theta: %f' % cost)
print('Expected cost (approx): 0.218\n')
print('Gradient at test theta:')
print(grad)
print('Expected gradients (approx):\n 0.043\n 2.566\n 2.647\n')
result = op.fmin_tnc(func = costFunction, x0 = initial_theta, fprime = gradient, args = (X,y))
result[1]
Result = op.minimize(fun = costFunction,
x0 = initial_theta,
args = (X, y),
method = 'TNC',
jac = gradient, options={'gtol': 1e-3, 'disp': True, 'maxiter': 1000})
theta = Result.x
theta
test = np.array([[1, 45, 85]])
prob = sigmoid(test.dot(theta))
print('For a student with scores 45 and 85, we predict an admission probability of %f,' % prob)
print('Expected value: 0.775 +/- 0.002\n')
This was a very difficult problem to debug, and illustrates a poorly documented aspect of the scipy.optimize interface. The documentation vaguely indicates that theta will be passed around as a vector:
Minimization of scalar function of one or more variables.
In general, the optimization problems are of the form:
minimize f(x) subject to
g_i(x) >= 0, i = 1,...,m
h_j(x) = 0, j = 1,...,p
where x is a vector of one or more variables.
What's important is that they really mean vector in the most primitive sense, a 1-dimensional array. So you have to expect that whenever theta is passed into one of your callbacks, it will be passed in as a 1-d array. But in numpy, 1-d arrays sometimes behave differently from 2-d row arrays (and, obviously, from 2-d column arrays).
I don't know exactly why it's causing a problem in your case, but it's easily fixed regardless. You just have to add the following at the top of both your cost function and your gradient function:
theta = theta.reshape(-1, 1)
This guarantees that theta will be a 2-d column array, as expected. Once you've done this, the results are correct.
I have had similar issues with Scipy dealing with the same problem as you. As senderle points out the interface is not the easiest to deal with, especially combined with the numpy array interface... Here is my implementation which works as expected.
Defining the cost and gradient functions
Note that initial_theta is passed as a simple array of shape (3,) and converted to a column vector of shape (3,1) within the function. The gradient function then returns the grad.ravel() which has shape (3,) again. This is important as doing otherwise caused an error message with various optimization methods in Scipy.optimize.
Note that different methods have different behaviours but returning .ravel() seems to fix most issues...
import pandas as pd
import numpy as np
import scipy.optimize as opt
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def CostFunc(theta,X,y):
#Initializing variables
m = len(y)
J = 0
grad = np.zeros(theta.shape)
#Vectorized computations
z = X # theta
h = sigmoid(z)
J = (1/m) * ( (-y.T # np.log(h)) - (1 - y).T # np.log(1-h));
return J
def Gradient(theta,X,y):
#Initializing variables
m = len(y)
theta = theta[:,np.newaxis]
grad = np.zeros(theta.shape)
#Vectorized computations
z = X # theta
h = sigmoid(z)
grad = (1/m)*(X.T # ( h - y));
return grad.ravel() #<-- This is the trick
Initializing variables and parameters
Note that initial_theta.shape returns (3,)
X = data1.iloc[:,0:2].values
m,n = X.shape
X = np.concatenate((np.ones(m)[:,np.newaxis],X),1)
y = data1.iloc[:,-1].values[:,np.newaxis]
initial_theta = np.zeros((n+1))
Calling Scipy.optimize
model = opt.minimize(fun = CostFunc, x0 = initial_theta, args = (X, y), method = 'TNC', jac = Gradient)
Any comments from more knowledgeable people are welcome, this Scipy interface is a mystery to me, thanks

Gradient Descent Variation doesn't work

I try to implement the Stochastic Gradient Descent Algorithm.
The first solution works:
def gradientDescent(x,y,theta,alpha):
xTrans = x.transpose()
for i in range(0,99):
hypothesis = np.dot(x,theta)
loss = hypothesis - y
gradient = np.dot(xTrans,loss)
theta = theta - alpha * gradient
return theta
This solution gives the right theta values but the following algorithm
doesnt work:
def gradientDescent2(x,y,theta,alpha):
xTrans = x.transpose();
for i in range(0,99):
hypothesis = np.dot(x[i],theta)
loss = hypothesis - y[i]
gradientThetaZero= loss * x[i][0]
gradientThetaOne = loss * x[i][1]
theta[0] = theta[0] - alpha * gradientThetaZero
theta[1] = theta[1] - alpha * gradientThetaOne
return theta
I don't understand why solution 2 does not work, basically it
does the same like the first algorithm.
I use the following code to produce data:
def genData():
x = np.random.rand(100,2)
y = np.zeros(shape=100)
for i in range(0, 100):
x[i][0] = 1
# our target variable
e = np.random.uniform(-0.1,0.1,size=1)
y[i] = np.sin(2*np.pi*x[i][1]) + e[0]
return x,y
And use it the following way:
x,y = genData()
theta = np.ones(2)
theta = gradientDescent2(x,y,theta,0.005)
print(theta)
I hope you can help me!
Best regards, Felix
Your second code example overwrites the gradient computation on each iteration over your observation data.
In the first code snippet, you properly adjust your parameters in each looping iteration based on the error (loss function).
In the second code snippet, you calculate the point-wise gradient computation in each iteration, but then don't do anything with it. That means that your final update effectively only trains on the very last data point.
If instead you accumulate the gradients within the loop by summing ( += ), it should be closer to what you're looking for (as an expression of the gradient of the loss function with respect to your parameters over the entire observation set).

How to approximate a value from a formula

So I have one vector of alpha, one vector of beta, and I am trying to find a theta for when the sum of all the estimates (for alpha's 1 to N and beta's 1 to N) equals 60:
def CalcTheta(grensscore, alpha, beta):
theta = 0.0001
estimate = [grensscore-1]
while(sum(estimate) < grensscore):
theta += 0.00001
for x in range(len(beta)):
if x == 0:
estimate = []
estimate.append(math.exp(alpha[x] * (theta - beta[x])) /
(1 + math.exp(alpha[x] * (theta - beta[x]))))
return(theta)
Basically what I did is start from theta = 0.0001, and iterate through, calculating all these sums, and when it is lower than 60, continue by adding 0.0001 each time, while above 60 means we found the theta.
I found the value theta this way. Problem is, it took me about 60 seconds using Python, to find a theta of 0.456.
What is quicker approach to find this theta (since I would like to apply this for other data)?
If you know a lower and an upper bound for θ, and the function is monotonic in the range between these, then you could employ a bisection algorithm to easily and quickly find the desired value.

Nonlinear e^(-x) regression using scipy, python, numpy

The code below is giving me a flat line for the line of best fit rather than a nice curve along the model of e^(-x) that would fit the data. Can anyone show me how to fix the code below so that it fits my data?
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize
def _eNegX_(p,x):
x0,y0,c,k=p
y = (c * np.exp(-k*(x-x0))) + y0
return y
def _eNegX_residuals(p,x,y):
return y - _eNegX_(p,x)
def Get_eNegX_Coefficients(x,y):
print 'x is: ',x
print 'y is: ',y
# Calculate p_guess for the vectors x,y. Note that p_guess is the
# starting estimate for the minimization.
p_guess=(np.median(x),np.min(y),np.max(y),.01)
# Calls the leastsq() function, which calls the residuals function with an initial
# guess for the parameters and with the x and y vectors. Note that the residuals
# function also calls the _eNegX_ function. This will return the parameters p that
# minimize the least squares error of the _eNegX_ function with respect to the original
# x and y coordinate vectors that are sent to it.
p, cov, infodict, mesg, ier = scipy.optimize.leastsq(
_eNegX_residuals,p_guess,args=(x,y),full_output=1,warning=True)
# Define the optimal values for each element of p that were returned by the leastsq() function.
x0,y0,c,k=p
print('''Reference data:\
x0 = {x0}
y0 = {y0}
c = {c}
k = {k}
'''.format(x0=x0,y0=y0,c=c,k=k))
print 'x.min() is: ',x.min()
print 'x.max() is: ',x.max()
# Create a numpy array of x-values
numPoints = np.floor((x.max()-x.min())*100)
xp = np.linspace(x.min(), x.max(), numPoints)
print 'numPoints is: ',numPoints
print 'xp is: ',xp
print 'p is: ',p
pxp=_eNegX_(p,xp)
print 'pxp is: ',pxp
# Plot the results
plt.plot(x, y, '>', xp, pxp, 'g-')
plt.xlabel('BPM%Rest')
plt.ylabel('LVET/BPM',rotation='vertical')
plt.xlim(0,3)
plt.ylim(0,4)
plt.grid(True)
plt.show()
return p
# Declare raw data for use in creating regression equation
x = np.array([1,1.425,1.736,2.178,2.518],dtype='float')
y = np.array([3.489,2.256,1.640,1.043,0.853],dtype='float')
p=Get_eNegX_Coefficients(x,y)
It looks like it's a problem with your initial guesses; something like (1, 1, 1, 1) works fine:
You have
p_guess=(np.median(x),np.min(y),np.max(y),.01)
for the function
def _eNegX_(p,x):
x0,y0,c,k=p
y = (c * np.exp(-k*(x-x0))) + y0
return y
So that's test_data_maxe^( -.01(x - test_data_median)) + test_data_min
I don't know much about the art of choosing good starting parameters, but I can say a few things. leastsq is finding a local minimum here - the key in choosing these values is to find the right mountain to climb, not to try to cut down on the work that the minimization algorithm has to do. Your initial guess looks like this (green):
(1.736, 0.85299999999999998, 3.4889999999999999, 0.01)
which results in your flat line (blue):
(-59.20295956, 1.8562 , 1.03477144, 0.69483784)
Greater gains were made in adjusting the height of the line than in increasing the k value. If you know you're fitting to this kind of data, use a larger k. If you don't know, I guess you could try to find a decent k value by sampling your data, or working back from the slope between an average of the first half and the second half, but I wouldn't know how to go about that.
Edit: You could also start with several guesses, run the minimization several times, and take the line with the lowest residuals.

Categories