Got stuck while make a ML model from scratch

Got stuck while make a ML model from scratch - python

I have a CSV file of various persons with 8 parameters to determine whether the person is diabetic or not.
You will get the CSV file from here
I am making a model that will train and predict if a person is diabetic or not without using of third-party applications like Tensorlfow Scikitlearn etc. I am making it from scratch.
here is my code:
from numpy import genfromtxt
import numpy as np
my_data = genfromtxt('E:/diabaties.csv', delimiter=',')
X,Y = my_data[1: ,:-1], my_data[1: ,-1:] #striping data and output from my_data
def sigmoid(x):
return (1/(1+np.exp(-x)))
m = X.shape[0]
def propagate(W, b, X, Y):
#forward propagation
A = sigmoid(np.dot(X, W) + b)
cost = (- 1 / m) * np.sum(Y * np.log(A) + (1 - Y) * (np.log(1 - A)))
print(cost)
#backward propagation
dw = (1 / m) * np.dot(X.T, (A - Y))
db = (1 / m) * np.sum(A - Y)
return(dw, db, cost)
def optimizer(W,b,X,Y,number_of_iterration,learning_rate):
for i in range(number_of_iterration):
dw, db, cost = propagate(W,b,X,Y)
W = W - learning_rate*dw
b = b - learning_rate*db
return(W, b)
W = np.zeros((X.shape[1],1))
b = 0
W,b = optimizer(W, b, X, Y, 100, 0.05)
The output which is getting generated is:
It is in this link please take a look.
I have tried to -
initialize the value of W with random numbers.
spent a lot of time to debug but cannot find what I have done wrong

This short answer is that your learning rate is about 500x too big for this problem. Think about it like you're trying to pilot your W vector into a canyon in the cost function. At each step, the gradient tells you which way is down hill, but the steps you take in that direction are so big that you jump over the canyon and end up on the other side. Each time this happens, your cost goes up because you're getting farther and farther out of the canyon until after 2 iterations, it blows up.
If you replace the line
W,b = optimizer(W, b, X, Y, 100, 0.05)
with
W,b = optimizer(W, b, X, Y, 100, 0.0001)
It will converge, though still not at a reasonable speed. (Side note, there's no good way to know the learning rate you need for a given problem. You just try lower and lower values until your cost value doesn't diverge.)
The longer answer is that the problem is that your features are all on different scales.
col_means = X.mean(axis=0)
col_stds = X.std(axis=0)
print('column means: ', col_means)
print('column stdevs: ', col_stds)
yields
column means: [ 3.84505208 120.89453125 69.10546875 20.53645833 79.79947917
31.99257812 0.4718763 33.24088542]
column stdevs: [ 3.36738361 31.95179591 19.34320163 15.94182863 115.16894926
7.87902573 0.33111282 11.75257265]
This means that the variations in the numbers of the second feature are about 100x as large as the variations in the numbers of the second to last feature which in turn means that the number of the second value in your W vector will have to be tuned to about 100x the precision of the value of the second to last number in your W vector.
There are two ways to deal with this in practice. First, you could use a fancier optimizer. Instead of basic gradient descent, you could use gradient descent with momentum, but that would change all your code. The second, simpler, way is just to scale your features so they're all about the same size.
col_means = X.mean(axis=0)
col_stds = X.std(axis=0)
print('column means: ', col_means)
print('column stdevs: ', col_stds)
X -= col_means
X /= col_stds
W, b = optimizer(W, b, X, Y, 100, 1.0)
Here we subtract the mean value of each feature and divide each feature's value by its standard deviation. Sometimes newbies are thrown off by this -- "you can't change your data values, that changes the problem" -- but it makes sense if you realize that it's just another mathematical transformation, just like multiplying by W, adding b, taking the sigmoid, etc. The only catch is that you've got to make sure you do the same thing for any future data. Just like the values of your W vector are learned parameters of your model, the values of the col_means and col_stds are too, so you've got to save them like W and b and use them if you want to perform inference with this model on new data in the future.
That lets us use a much bigger learning rater of 1.0 because now all the features are approximately the same size.
Now if you try, you'll get the following output:
column means: [ 3.84505208 120.89453125 69.10546875 20.53645833 79.79947917
31.99257812 0.4718763 33.24088542]
column stdevs: [ 3.36738361 31.95179591 19.34320163 15.94182863 115.16894926
7.87902573 0.33111282 11.75257265]
0.6931471805599452
0.5902957589079032
0.5481784378158732
0.5254804089153315
...
0.4709931321295562
0.4709931263193595
0.47099312122176273
0.4709931167488006
0.470993112823447
This is what you want. Your cost function is going down at each step and at the end of your 100 iterations, the cost is stable to ~8 significant figures, so dropping it more probably won't do much.
Welcome to machine learning!

The problem is with your initialization of the weight and bias. It’s important that you don’t initialize at least the weights to zero and instead initialize them with some random small numbers. The value of A is coming out to be zero making your cost function undefined
Update:
Try something like this:
from numpy import genfromtxt
import numpy as np
# my_data = genfromtxt('E:/diabaties.csv', delimiter=',')
# X,Y = my_data[1: ,:-1], my_data[1: ,-1:] #striping data and output from my_data
# Using random data
n_points = 100
n_neurons = 5
X = np.random.rand(n_points, n_neurons) # 5 dimensional data from uniform distribution [0, 1)
Y = np.random.randint(low=0, high=2, size=(n_points, 1)) # Binary labels
def sigmoid(x):
return (1/(1+np.exp(-x)))
m = X.shape[0]
def propagate(W, b, X, Y):
#forward propagation
A = sigmoid(np.dot(X, W) + b)
cost = (- 1 / m) * np.sum(Y * np.log(A) + (1 - Y) * (np.log(1 - A)))
print(cost)
#backward propagation
dw = (1 / m) * np.dot(X.T, (A - Y))
db = (1 / m) * np.sum(A - Y)
return(dw, db, cost)
def optimizer(W,b,X,Y,number_of_iterration,learning_rate):
for i in range(number_of_iterration):
dw, db, cost = propagate(W,b,X,Y)
W = W - learning_rate*dw
b = b - learning_rate*db
return(W, b)
W = np.random.normal(loc=0, scale=0.01, size=(n_neurons, 1)) # Drawing random initialization from gaussian
b = 0
W,b = optimizer(W, b, X, Y, 100, 0.05)

Your NaN problem is simply due to np.log encountering a zero value. You always want to scale your X values. Statistical (mean, std) normalization will work, but I find min-max scaling works best. Here is code for that:
def minmax_scaler(x):
min = np.nanmin(x, axis=0)
max = np.nanmax(x, axis=0)
return (x-min)/(max-min)
Also, your neural net has only one neuron. When you call np.dot(X, W) these should be matrices of shape (cases, features) and (features, neurons) respectively. So, now your initialization code looks like this:
X = minmax_scaler(X)
neurons = 10
learning_rate = 0.05
W = np.random.random((X.shape[1], neurons))
b = np.zeros((1, neurons)) # b width to match W
I got decent convergence without needing to change the learning rate. See chart:
This is such a small dataset that, even with 10-20 neurons, you are in danger of overfitting it. Ordinarily, you would code a predict() method and an accuracy check, and then set aside some of the data to test for overfitting.

Related

How to fit a piecewise (alternating linear and constant segments) function to a parabolic function?

I do have a function, for example , but this can be something else as well, like a quadratic or logarithmic function. I am only interested in the domain of . The parameters of the function (a and k in this case) are known as well.
My goal is to fit a continuous piece-wise function to this, which contains alternating segments of linear functions (i.e. sloped straight segments, each with intercept of 0) and constants (i.e. horizontal segments joining the sloped segments together). The first and last segments are both sloped. And the number of segments should be pre-selected between around 9-29 (that is 5-15 linear steps + 4-14 constant plateaus).
Formally
The input function:
The fitted piecewise function:
I am looking for the optimal resulting parameters (c,r,b) (in terms of least squares) if the segment numbers (n) are specified beforehand.
The resulting constants (c) and the breakpoints (r) should be whole natural numbers, and the slopes (b) round two decimal point values.
I have tried to do the fitting numerically using the pwlf package using a segmented constant models, and further processed the resulting constant model with some graphical intuition to "slice" the constant steps with the slopes. It works to some extent, but I am sure this is suboptimal from both fitting perspective and computational efficiency. It takes multiple minutes to generate a fitting with 8 slopes on the range of 1-50000. I am sure there must be a better way to do this.
My idea would be to instead using only numerical methods/ML, the fact that we have the algebraic form of the input function could be exploited in some way to at least to use algebraic transforms (integrals) to get to a simpler optimization problem.
import numpy as np
import matplotlib.pyplot as plt
import pwlf
# The input function
def input_func(x,k,a):
return np.power(x,1/a)*k
x = np.arange(1,5e4)
y = input_func(x, 1.8, 1.3)
plt.plot(x,y);
def pw_fit(func, x_r, no_seg, *fparams):
# working on the specified range
x = np.arange(1,x_r)
y_input = func(x, *fparams)
my_pwlf = pwlf.PiecewiseLinFit(x, y_input, degree=0)
res = my_pwlf.fit(no_seg)
yHat = my_pwlf.predict(x)
# Function values at the breakpoints
y_isec = func(res, *fparams)
# Slope values at the breakpoints
slopes = np.round(y_isec / res, decimals=2)
slopes = slopes[1:]
# For the first slope value, I use the intersection of the first constant plateau and the input function
slopes = np.insert(slopes,0,np.round(y_input[np.argwhere(np.diff(np.sign(y_input - yHat))).flatten()[0]] / np.argwhere(np.diff(np.sign(y_input - yHat))).flatten()[0], decimals=2))
plateaus = np.unique(np.round(yHat))
# If due to rounding slope values (to two decimals), there is no change in a subsequent step, I just remove those segments
to_del = np.argwhere(np.diff(slopes) == 0).flatten()
slopes = np.delete(slopes,to_del + 1)
plateaus = np.delete(plateaus,to_del)
breakpoints = [np.ceil(plateaus[0]/slopes[0])]
for idx, j in enumerate(slopes[1:-1]):
breakpoints.append(np.floor(plateaus[idx]/j))
breakpoints.append(np.ceil(plateaus[idx+1]/j))
breakpoints.append(np.floor(plateaus[-1]/slopes[-1]))
return slopes, plateaus, breakpoints
slo, plat, breaks = pw_fit(input_func, 50000, 8, 1.8, 1.3)
# The piecewise function itself
def pw_calc(x, slopes, plateaus, breaks):
x = x.astype('float')
cond_list = [x < breaks[0]]
for idx, j in enumerate(breaks[:-1]):
cond_list.append((j <= x) & (x < breaks[idx+1]))
cond_list.append(breaks[-1] <= x)
func_list = [lambda x: x * slopes[0]]
for idx, j in enumerate(slopes[1:]):
func_list.append(plateaus[idx])
func_list.append(lambda x, j=j: x * j)
return np.piecewise(x, cond_list, func_list)
y_output = pw_calc(x, slo, plat, breaks)
plt.plot(x,y,y_output);

(Not important, but I think the fitted piecewise function is not continuous as it is. Intervals should be x<=r1; r1<x<=r2; ....)
As Anatolyg has pointed out, it looks to me that in the optimal solution (for the function posted at least, and probably for any where the derivative is different from zero), the horizantal segments will collapse to a point or the minimum segment length (in this case 1).
EDIT---------------------------------------------
The behavior above could only be valid if the slopes could have an intercept. If the intercepts are zero, as posted in the question, one consideration must be taken into account: Is the initial parabolic function defined in zero or nearby? Imagine the function y=0.001 *sqrt(x-1000), then the segments defined as b*x will have a slope close to zero and will be so similar to the constant segments that the best fit will be just the line that without intercept that fits better all the function.
Provided that the function is defined in zero or nearby, you can start by approximating the curve just by linear segments (with intercepts):
divide the function domain in N intervals(equal intervals or whose size is a function of the average curvature (or second derivative) of the function along the domain).
linear fit/regression in each intervals
for each interval, if a point (or bunch of points) in the extreme of any interval is better fitted by the line of the neighbor interval than the line of its interval, this point is assigned to the neighbor interval.
Repeat from 2) until no extreme points are moved.
Linear regressions might be optimized not to calculate all the covariance matrixes from scratch on each iteration, but just adding the contributions of the moved points to the previous covariance matrixes.
Then each linear segment (LSi) is replaced by a combination of a small constant segment at the beginning (Cbi), a linear segment without intercept (Si), and another constant segment at the end (Cei). This segments are easy to calculate as Si will contain the middle point of LSi, and Cbi and Cei will have respectively the begin and end values of the segment LSi. Then the intervals of each segment has to be calculated as an intersection between lines.
With this, the constant end segment will be collinear with the constant begin segment from the next interval so they will merge, resulting in a series of constant and linear segments interleaved.
But this would be a floating point start solution. Next, you will have to apply all the roundings which will mess up quite a lot all the segments as the conditions integer intervals and linear segments without slope can be very confronting. In fact, b,c,r are not totally independent. If ci and ri+1 are known, then bi+1 is already fixed
If nothing is broken so far, the final task will be to minimize the error/cost function (I assume that it will be the integral of the error between the parabolic function and the segments). My guess is that gradients here will be quite a pain, as if you change for example one ci, all the rest of the bj and cj will have to adapt as well due to the integer intervals restriction. However, if you can generalize the derivatives between parameters ( how much do I have to adapt bi+1 if ci changes a unit), you can propagate the change of one parameter to all other parameters and have kind of a gradient. Then for each interval, you can estimate what would be the ideal parameter and averaging all intervals calculate the best gradient step. Let me illustrate this:
Assuming first that r parameters are fixed, if I change c1 by one unit, b2 changes by 0.1, c2 changes by -0.2 and b3 changes by 0.2. This would be the gradient.
Then I estimate, comparing with the parabolic curve, that c1 should increase 0.5 (to reduce the cost by 10 points), b2 should increase 0.2 (to reduce the cost by 5 points), c2 should increase 0.2 (to reduce the cost by 6 points) and b3 should increase 0.1 (to reduce the cost by 9 points).
Finally, the gradient step would be (0.5/1·10 + 0.2/0.1·5 - 0.2/(-0.2)·6 + 0.1/0.2·9)/(10 + 5 + 6 + 9)~= 0.45. Thus, c1 would increase 0.45 units, b2 would increase 0.45·0.1, and so on.
When you add the r parameters to the pot, as integer intervals do not have an proper derivative, calculation is not straightforward. However, you can consider r parameters as floating points, calculate and apply the gradient step and then apply the roundings.

We can integrate the squared error function for linear and constant pieces and let SciPy optimize it. Python 3:
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize
xl = 1
xh = 50000
a = 1.3
p = 1 / a
n = 8
def split_b_and_c(bc):
return bc[::2], bc[1::2]
def solve_for_r(b, c):
r = np.empty(2 * n)
r[0] = xl
r[1:-1:2] = c / b[:-1]
r[2::2] = c / b[1:]
r[-1] = xh
return r
def linear_residual_integral(b, x):
return (
(x ** (2 * p + 1)) / (2 * p + 1)
- 2 * b * x ** (p + 2) / (p + 2)
+ b ** 2 * x ** 3 / 3
)
def constant_residual_integral(c, x):
return x ** (2 * p + 1) / (2 * p + 1) - 2 * c * x ** (p + 1) / (p + 1) + c ** 2 * x
def squared_error(bc):
b, c = split_b_and_c(bc)
r = solve_for_r(b, c)
linear = np.sum(
linear_residual_integral(b, r[1::2]) - linear_residual_integral(b, r[::2])
)
constant = np.sum(
constant_residual_integral(c, r[2::2])
- constant_residual_integral(c, r[1:-1:2])
)
return linear + constant
def evaluate(x, b, c, r):
i = 0
while x > r[i + 1]:
i += 1
return b[i // 2] * x if i % 2 == 0 else c[i // 2]
def main():
bc0 = (xl + (xh - xl) * np.arange(1, 4 * n - 2, 2) / (4 * n - 2)) ** (
p - 1 + np.arange(2 * n - 1) % 2
)
bc = scipy.optimize.minimize(
squared_error, bc0, bounds=[(1e-06, None) for i in range(2 * n - 1)]
).x
b, c = split_b_and_c(bc)
r = solve_for_r(b, c)
X = np.linspace(xl, xh, 1000)
Y = [evaluate(x, b, c, r) for x in X]
plt.plot(X, X ** p)
plt.plot(X, Y)
plt.show()
if __name__ == "__main__":
main()

I have tried to come up with a new solution myself, based on the idea of #Amo Robb, where I have partitioned the domain, and curve fitted a dual - constant and linear - piece together (with the help of np.maximum). I have used the 1 / f(x)' as the function to designate the breakpoints, but I know this is arbitrary and does not provide a global optimum. Maybe there is some optimal function for these breakpoints. But this solution is OK for me, as it might be appropriate to have a better fit at the first segments, at the expense of the error for the later segments. (The task itself is actually a cost based retail margin calculation {supply price -> added margin}, as the retail POS software can only work with such piecewise margin function).
The answer from #David Eisenstat is correct optimal solution if the parameters are allowed to be floats. Unfortunately the POS software can not use floats. It is OK to round up c-s and r-s afterwards. But the b-s should be rounded to two decimals, as those are inputted as percents, and this constraint would ruin the optimal solution with long floats. I will try to further improve my solution with both Amo's and David's valuable input. Thank You for that!
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# The input function f(x)
def input_func(x,k,a):
return np.power(x,1/a) * k
# 1 / f(x)'
def one_per_der(x,k,a):
return a / (k * np.power(x, 1/a-1))
# 1 / f(x)' inverted
def one_per_der_inv(x,k,a):
return np.power(a / (x*k), a / (1-a))
def segment_fit(start,end,y,first_val):
b, _ = curve_fit(lambda x,b: np.maximum(first_val, b*x), np.arange(start,end), y[start-1:end-1])
b = float(np.round(b, decimals=2))
bp = np.round(first_val / b)
last_val = np.round(b * end)
return b, bp, last_val
def pw_fit(end_range, no_seg, **fparams):
y_bps = np.linspace(one_per_der(1, **fparams), one_per_der(end_range,**fparams) , no_seg+1)[1:]
x_bps = np.round(one_per_der_inv(y_bps, **fparams))
y = input_func(x, **fparams)
slopes = [np.round(float(curve_fit(lambda x,b: x * b, np.arange(1,x_bps[0]), y[:int(x_bps[0])-1])[0]), decimals = 2)]
plats = [np.round(x_bps[0] * slopes[0])]
bps = []
for i, xbp in enumerate(x_bps[1:]):
b, bp, last_val = segment_fit(int(x_bps[i]+1), int(xbp), y, plats[i])
slopes.append(b); bps.append(bp); plats.append(last_val)
breaks = sorted(list(x_bps) + bps)[:-1]
# If due to rounding slope values (to two decimals), there is no change in a subsequent step, I just remove those segments
to_del = np.argwhere(np.diff(slopes) == 0).flatten()
breaks_to_del = np.concatenate((to_del * 2, to_del * 2 + 1))
slopes = np.delete(slopes,to_del + 1)
plats = np.delete(plats[:-1],to_del)
breaks = np.delete(breaks,breaks_to_del)
return slopes, plats, breaks
def pw_calc(x, slopes, plateaus, breaks):
x = x.astype('float')
cond_list = [x < breaks[0]]
for idx, j in enumerate(breaks[:-1]):
cond_list.append((j <= x) & (x < breaks[idx+1]))
cond_list.append(breaks[-1] <= x)
func_list = [lambda x: x * slopes[0]]
for idx, j in enumerate(slopes[1:]):
func_list.append(plateaus[idx])
func_list.append(lambda x, j=j: x * j)
return np.piecewise(x, cond_list, func_list)
fparams = {'k':1.8, 'a':1.2}
end_range = 5e4
no_steps = 10
x = np.arange(1, end_range)
y = input_func(x, **fparams)
slopes, plats, breaks = pw_fit(end_range, no_steps, **fparams)
y_output = pw_calc(x, slopes, plats, breaks)
plt.plot(x,y_output,y);

Linear Regression with gradient descent: two questions

I'm trying to understand Linear Regression with Gradient Descent and I do not understand this part in my loss_gradients function below.
import numpy as np
def forward_linear_regression(X, y, weights):
# dot product weights * inputs
N = np.dot(X, weights['W'])
# add bias
P = N + weights['B']
# compute loss with MSE
loss = np.mean(np.power(y - P, 2))
forward_info = {}
forward_info['X'] = X
forward_info['N'] = N
forward_info['P'] = P
forward_info['y'] = y
return loss, forward_info
Here is where I'm stuck in my understanding, I have commented out my questions:
def loss_gradients(forward_info, weights):
# to update weights, we need: dLdW = dLdP * dPdN * dNdW
dLdP = -2 * (forward_info['y'] - forward_info['P'])
dPdN = np.ones_like(forward_info['N'])
dNdW = np.transpose(forward_info['X'], (1, 0))
dLdW = np.dot(dNdW, dLdP * dPdN)
# why do we mix matrix multiplication and dot product like this?
# Why not dLdP * dPdN * dNdW instead?
# to update biases, we need: dLdB = dLdP * dPdB
dPdB = np.ones_like(forward_info[weights['B']])
dLdB = np.sum(dLdP * dPdB, axis=0)
# why do we sum those values along axis 0?
# why not just dLdP * dPdB ?

It looks to me like this code is expecting a 'batch' of data. What I mean by that is, it's expecting that when you do forward_info and loss_gradients, you're actually passing a bunch of (X, y) pairs together. Let's say you pass B such pairs. The first dimension of all of your forward info stuff will have size B.
Now, the answers to both of your questions are the same: essentially, these lines compute the gradients (using the formulas you predicted) for each of the B terms, and then sum up all of the gradients so you get one gradient update. I encourage you to work out the logic behind the dot product yourself, because this is a very common pattern in ML, but it's a little tricky to get the hang of at first.

Linear regression implementation from scratch

I'm trying to understand the gradient descent algorithm.
Can someone please explain why I'm getting high MSE values using the following code, or if I missed some concept can you please clarify?
import numpy as np
import pandas as pd
my_data = pd.DataFrame({'x': np.arange(0,100),
'y': np.arange(0,100)})
X = my_data.iloc[:,0:1].values
y = my_data.iloc[:,1].values
def gradientDescent(X, y, lr = 0.001, n = 1000):
n_samples, n_features = X.shape
cost = []
weight = np.zeros([n_features])
b = 0
for _ in range(n):
# predict
y_hat = np.dot(X, weight) + b # y = ax + b
residual = y - y_hat
db = -(2/n_samples) * np.sum(residual)
dw = -(2/n_samples) * np.sum(X.T * residual, axis = 1)
# update weights
weight -= (lr * dw)
b -= (lr * db)
cost.append(((y-y_hat) **2).mean())
return weight, b, cost
gradientDescent(X,y)

Not an expert, but I think you are currently experiencing the exploding gradient problem. If you step through your code you will notice that your weight value is swinging from positive to negative in increasing steps. I believe you cannot find the minimum because using mse for this dataset is causing you to jump back and forth never converging. Your x and y ranges to 100, so when you look at the cost it is just blowing up.
If you want to use mse with your current x and y values you should normalize your data. You can do this by subtracting the mean and dividing by the standard deviation, or just normalize both x and y to 1.
For example:
my_data.x = my_data.x.transform(lambda x: x / x.max())
my_data.y = my_data.y.transform(lambda x: x / x.max())
If you do this you should see your cost converge to ~0 with enough iterations.

Steepest descent spitting out unreasonably large values

My implementation of steepest descent for solving Ax = b is showing some weird behavior: for any matrix large enough (~10 x 10, have only tested square matrices so far), the returned x contains all huge values (on the order of 1x10^10).
def steepestDescent(A, b, numIter=100, x=None):
"""Solves Ax = b using steepest descent method"""
warnings.filterwarnings(action="error",category=RuntimeWarning)
# Reshape b in case it has shape (nL,)
b = b.reshape(len(b), 1)
exes = []
res = []
# Make a guess for x if none is provided
if x==None:
x = np.zeros((len(A[0]), 1))
exes.append(x)
for i in range(numIter):
# Re-calculate r(i) using r(i) = b - Ax(i) every five iterations
# to prevent roundoff error. Also calculates initial direction
# of steepest descent.
if (numIter % 5)==0:
r = b - np.dot(A, x)
# Otherwise use r(i+1) = r(i) - step * Ar(i)
else:
r = r - step * np.dot(A, r)
res.append(r)
# Calculate step size. Catching the runtime warning allows the function
# to stop and return before all iterations are completed. This is
# necessary because once the solution x has been found, r = 0, so the
# calculation below divides by 0, turning step into "nan", which then
# goes on to overwrite the correct answer in x with "nan"s
try:
step = np.dot(r.T, r) / np.dot( np.dot(r.T, A), r )
except RuntimeWarning:
warnings.resetwarnings()
return x
# Update x
x = x + step * r
exes.append(x)
warnings.resetwarnings()
return x, exes, res
(exes and res are returned for debugging)
I assume the problem must be with calculating r or step (or some deeper issue) but I can't make out what it is.

The code seems correct. For example, the following test work for me (both linalg.solve and steepestDescent give the close answer, most of the time):
import numpy as np
n = 100
A = np.random.random(size=(n,n)) + 10 * np.eye(n)
print(np.linalg.eig(A)[0])
b = np.random.random(size=(n,1))
x, xs, r = steepestDescent(A,b, numIter=50)
print(x - np.linalg.solve(A,b))
The problem is in the math. This algorithm is guaranteed to converge to the correct solution if A is positive definite matrix. By adding the 10 * identity matrix to a random matrix, we increase the probability that all the eigen-values are positive
If you test with large random matrices (for example A = random.random(size=(n,n)), you are almost certain to have a negative eigenvalue, and the algorithm will not converge.

Stochastic Gradient Descent Convergence Criteria

Currently my convergence criteria for SGD checks whether the MSE error ratio is within a specific boundary.
def compute_mse(data, labels, weights):
m = len(labels)
hypothesis = np.dot(data,weights)
sq_errors = (hypothesis - labels) ** 2
mse = np.sum(sq_errors)/(2.0*m)
return mse
cur_mse = 1.0
prev_mse = 100.0
m = len(labels)
while cur_mse/prev_mse < 0.99999:
prev_mse = cur_mse
for i in range(m):
d = np.array(data[i])
hypothesis = np.dot(d, weights)
gradient = np.dot((labels[i] - hypothesis), d)/m
weights = weights + (alpha * gradient)
cur_mse = compute_mse(data, labels, weights)
if cur_mse > prev_mse:
return
The weights are update w.r.t. to a single data point in the training set.
With an alpha of 0.001, the model is supposed to have converged within a few iterations however I get no convergence. Is this convergence criteria too strict?

I'll try to answer the question. First, the pseudocode of stochastic gradient descent looks something like this:
input: f(x), alpha, initial x (guess or random)
output: min_x f(x) # x that minimizes f(x)
while True:
shuffle data # good practice, not completely needed
for d in data:
x -= alpha * grad(f(x)) # df/dx
if <stopping criterion>:
break
There can be other regularization parameters added to the function that you want to minimize, such as the l1 penalty to avoid overfitting.
Going back to your problem, looking at your data and definition of the gradient, looks like you want to solve a simple linear system of equations of the form:
Ax = b
which yields the objevtive function:
f(x) = ||Ax - b||^2
stochastic gradient descent uses one row data at a time:
||A_i x - b||
where || o || is the euclidean norm and _i means index of a row.
Here, A is your data, x is your weights and b is your labels.
The gradient of the function is then computed as a:
grad(f(x)) = 2 * A.T (Ax - b)
Or in the case of the stochastic gradient descent:
2 * A_i.T (A_i x - b)
where .T means transpose.
Putting everything back into your code... first I will setup a synthetic data:
A = np.random.randn(100, 2) # 100x2 data
x = np.random.randn(2, 1) # 2x1 weights
b = np.random.randint(0, 2, 100).reshape(100, 1) # 100x1 labels
b[b == 0] = -1 # labels in {-1, 1}
Then, define the parameters:
alpha = 0.001
cur_mse = 100.
prev_mse = np.inf
it = 0
max_iter = 100
m = A.shape[0]
idx = range(m)
And loop!
while cur_mse/prev_mse < 0.99999 and it < max_iter:
prev_mse = cur_mse
shuffle(idx)
for i in idx:
d = A[i:i+1]
y = b[i:i+1]
h = np.dot(d, x)
dx = 2 * np.dot(d.T, (h - y))
x -= (alpha * dx)
cur_mse = np.mean((A.dot(x) - b)**2)
if cur_mse > prev_mse:
raise Exception("Not converging")
it += 1
This code is pretty much the same as yours, with a couple of additions:
Another stopping criterion based on the number of iterations (to avoid looping forever if the system doesn't converge or does too slowly)
Redefinition of the gradient dx (still similar to yours). You have the sign inverted and therefore the weight update is positive + since in my example is negative - (makes sense since you are going down in a gradient).
Indexing of data and labels. While data[i] gives a tuple of size (2,) (in this case for a 100x2 data), using fancy indexing data[i:i+1] will return a view of the data without reshaping it (e.g with shape (1, 2)) and therefore will allow you to perform the proper matrix multiplications.
You can add a 3rd stopping criterion based on acceptable mse error, i.e: if cur_mse < 1e-3: break.
This algorithm, with random data, converges in 20-40 iterations for me (depending on the generated random data).
So... assuming that this is the function you want to minimize, if this method doesn't work for you, it might mean that your system is underdeterminated (you have less training data than features, which means A is more wide than high).
Hope it helps!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.