Linear regression implementation from scratch

Linear regression implementation from scratch - python

I'm trying to understand the gradient descent algorithm.
Can someone please explain why I'm getting high MSE values using the following code, or if I missed some concept can you please clarify?
import numpy as np
import pandas as pd
my_data = pd.DataFrame({'x': np.arange(0,100),
'y': np.arange(0,100)})
X = my_data.iloc[:,0:1].values
y = my_data.iloc[:,1].values
def gradientDescent(X, y, lr = 0.001, n = 1000):
n_samples, n_features = X.shape
cost = []
weight = np.zeros([n_features])
b = 0
for _ in range(n):
# predict
y_hat = np.dot(X, weight) + b # y = ax + b
residual = y - y_hat
db = -(2/n_samples) * np.sum(residual)
dw = -(2/n_samples) * np.sum(X.T * residual, axis = 1)
# update weights
weight -= (lr * dw)
b -= (lr * db)
cost.append(((y-y_hat) **2).mean())
return weight, b, cost
gradientDescent(X,y)

Not an expert, but I think you are currently experiencing the exploding gradient problem. If you step through your code you will notice that your weight value is swinging from positive to negative in increasing steps. I believe you cannot find the minimum because using mse for this dataset is causing you to jump back and forth never converging. Your x and y ranges to 100, so when you look at the cost it is just blowing up.
If you want to use mse with your current x and y values you should normalize your data. You can do this by subtracting the mean and dividing by the standard deviation, or just normalize both x and y to 1.
For example:
my_data.x = my_data.x.transform(lambda x: x / x.max())
my_data.y = my_data.y.transform(lambda x: x / x.max())
If you do this you should see your cost converge to ~0 with enough iterations.

Related

Got stuck while make a ML model from scratch

I have a CSV file of various persons with 8 parameters to determine whether the person is diabetic or not.
You will get the CSV file from here
I am making a model that will train and predict if a person is diabetic or not without using of third-party applications like Tensorlfow Scikitlearn etc. I am making it from scratch.
here is my code:
from numpy import genfromtxt
import numpy as np
my_data = genfromtxt('E:/diabaties.csv', delimiter=',')
X,Y = my_data[1: ,:-1], my_data[1: ,-1:] #striping data and output from my_data
def sigmoid(x):
return (1/(1+np.exp(-x)))
m = X.shape[0]
def propagate(W, b, X, Y):
#forward propagation
A = sigmoid(np.dot(X, W) + b)
cost = (- 1 / m) * np.sum(Y * np.log(A) + (1 - Y) * (np.log(1 - A)))
print(cost)
#backward propagation
dw = (1 / m) * np.dot(X.T, (A - Y))
db = (1 / m) * np.sum(A - Y)
return(dw, db, cost)
def optimizer(W,b,X,Y,number_of_iterration,learning_rate):
for i in range(number_of_iterration):
dw, db, cost = propagate(W,b,X,Y)
W = W - learning_rate*dw
b = b - learning_rate*db
return(W, b)
W = np.zeros((X.shape[1],1))
b = 0
W,b = optimizer(W, b, X, Y, 100, 0.05)
The output which is getting generated is:
It is in this link please take a look.
I have tried to -
initialize the value of W with random numbers.
spent a lot of time to debug but cannot find what I have done wrong

This short answer is that your learning rate is about 500x too big for this problem. Think about it like you're trying to pilot your W vector into a canyon in the cost function. At each step, the gradient tells you which way is down hill, but the steps you take in that direction are so big that you jump over the canyon and end up on the other side. Each time this happens, your cost goes up because you're getting farther and farther out of the canyon until after 2 iterations, it blows up.
If you replace the line
W,b = optimizer(W, b, X, Y, 100, 0.05)
with
W,b = optimizer(W, b, X, Y, 100, 0.0001)
It will converge, though still not at a reasonable speed. (Side note, there's no good way to know the learning rate you need for a given problem. You just try lower and lower values until your cost value doesn't diverge.)
The longer answer is that the problem is that your features are all on different scales.
col_means = X.mean(axis=0)
col_stds = X.std(axis=0)
print('column means: ', col_means)
print('column stdevs: ', col_stds)
yields
column means: [ 3.84505208 120.89453125 69.10546875 20.53645833 79.79947917
31.99257812 0.4718763 33.24088542]
column stdevs: [ 3.36738361 31.95179591 19.34320163 15.94182863 115.16894926
7.87902573 0.33111282 11.75257265]
This means that the variations in the numbers of the second feature are about 100x as large as the variations in the numbers of the second to last feature which in turn means that the number of the second value in your W vector will have to be tuned to about 100x the precision of the value of the second to last number in your W vector.
There are two ways to deal with this in practice. First, you could use a fancier optimizer. Instead of basic gradient descent, you could use gradient descent with momentum, but that would change all your code. The second, simpler, way is just to scale your features so they're all about the same size.
col_means = X.mean(axis=0)
col_stds = X.std(axis=0)
print('column means: ', col_means)
print('column stdevs: ', col_stds)
X -= col_means
X /= col_stds
W, b = optimizer(W, b, X, Y, 100, 1.0)
Here we subtract the mean value of each feature and divide each feature's value by its standard deviation. Sometimes newbies are thrown off by this -- "you can't change your data values, that changes the problem" -- but it makes sense if you realize that it's just another mathematical transformation, just like multiplying by W, adding b, taking the sigmoid, etc. The only catch is that you've got to make sure you do the same thing for any future data. Just like the values of your W vector are learned parameters of your model, the values of the col_means and col_stds are too, so you've got to save them like W and b and use them if you want to perform inference with this model on new data in the future.
That lets us use a much bigger learning rater of 1.0 because now all the features are approximately the same size.
Now if you try, you'll get the following output:
column means: [ 3.84505208 120.89453125 69.10546875 20.53645833 79.79947917
31.99257812 0.4718763 33.24088542]
column stdevs: [ 3.36738361 31.95179591 19.34320163 15.94182863 115.16894926
7.87902573 0.33111282 11.75257265]
0.6931471805599452
0.5902957589079032
0.5481784378158732
0.5254804089153315
...
0.4709931321295562
0.4709931263193595
0.47099312122176273
0.4709931167488006
0.470993112823447
This is what you want. Your cost function is going down at each step and at the end of your 100 iterations, the cost is stable to ~8 significant figures, so dropping it more probably won't do much.
Welcome to machine learning!

The problem is with your initialization of the weight and bias. It’s important that you don’t initialize at least the weights to zero and instead initialize them with some random small numbers. The value of A is coming out to be zero making your cost function undefined
Update:
Try something like this:
from numpy import genfromtxt
import numpy as np
# my_data = genfromtxt('E:/diabaties.csv', delimiter=',')
# X,Y = my_data[1: ,:-1], my_data[1: ,-1:] #striping data and output from my_data
# Using random data
n_points = 100
n_neurons = 5
X = np.random.rand(n_points, n_neurons) # 5 dimensional data from uniform distribution [0, 1)
Y = np.random.randint(low=0, high=2, size=(n_points, 1)) # Binary labels
def sigmoid(x):
return (1/(1+np.exp(-x)))
m = X.shape[0]
def propagate(W, b, X, Y):
#forward propagation
A = sigmoid(np.dot(X, W) + b)
cost = (- 1 / m) * np.sum(Y * np.log(A) + (1 - Y) * (np.log(1 - A)))
print(cost)
#backward propagation
dw = (1 / m) * np.dot(X.T, (A - Y))
db = (1 / m) * np.sum(A - Y)
return(dw, db, cost)
def optimizer(W,b,X,Y,number_of_iterration,learning_rate):
for i in range(number_of_iterration):
dw, db, cost = propagate(W,b,X,Y)
W = W - learning_rate*dw
b = b - learning_rate*db
return(W, b)
W = np.random.normal(loc=0, scale=0.01, size=(n_neurons, 1)) # Drawing random initialization from gaussian
b = 0
W,b = optimizer(W, b, X, Y, 100, 0.05)

Your NaN problem is simply due to np.log encountering a zero value. You always want to scale your X values. Statistical (mean, std) normalization will work, but I find min-max scaling works best. Here is code for that:
def minmax_scaler(x):
min = np.nanmin(x, axis=0)
max = np.nanmax(x, axis=0)
return (x-min)/(max-min)
Also, your neural net has only one neuron. When you call np.dot(X, W) these should be matrices of shape (cases, features) and (features, neurons) respectively. So, now your initialization code looks like this:
X = minmax_scaler(X)
neurons = 10
learning_rate = 0.05
W = np.random.random((X.shape[1], neurons))
b = np.zeros((1, neurons)) # b width to match W
I got decent convergence without needing to change the learning rate. See chart:
This is such a small dataset that, even with 10-20 neurons, you are in danger of overfitting it. Ordinarily, you would code a predict() method and an accuracy check, and then set aside some of the data to test for overfitting.

Linear Regression with python - gradient descent error

I have been trying to implement my own Linear Regression from scratch using python but have been facing a issue during the last days.
This is the code I am using :
Import modules
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
import matplotlib.pyplot as plt
Initialize parameters
def initialize_parameters(n):
w = np.zeros(n,)
b = 0.0
return w,b
Predictor/Hypothesis
def predictor(x, w, b):
return np.dot(x,w) + b
Cost function
def calculate_cost(X, y, theta, b):
m = len(y)
predictions = np.dot(X, theta)
error = predictions - y
cost = (1/2*m) * np.sum(np.power(error,2))
return cost
Gradient descent
def gradient_descent(X, W, b, y, learning_rate = 0.0001, epochs = 25):
m = len(y)
final_cost = 0
for _ in range(epochs):
predictions = predictor(X, W, b)
error = predictions - y
derivate = np.dot(error, X)
print(derivate)
W = W - (learning_rate/m) * derivate
b = b - (learning_rate/m) * error.sum()
Test run :
# Load dataset
boston = load_boston()
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
data['PRICE'] = boston.target
# Split dataset
X = data.drop(columns=['PRICE']).values
Y = data['PRICE'].values
w, b = initialize_parameters(X.shape[1])
gradient_descent(X, w, b, Y)
During the test run, I can see that the values for the derivate is growing insanely fast :
[1.41239553e+06 3.20162679e+06 3.84829686e+06 2.17737688e+04
1.81667467e+05 1.99565485e+06 2.27660208e+07 1.15045731e+06
3.50107975e+06 1.40396525e+08 5.96494458e+06 1.14447329e+08
4.25947931e+06]
[-4.33362969e+07 -9.66008831e+07 -1.16941872e+08 -6.62733008e+05
-5.50761913e+06 -6.04452389e+07 -6.90425672e+08 -3.46792848e+07
-1.06967561e+08 -4.26847914e+09 -1.80579130e+08 -3.45024565e+09
-1.29016170e+08]
...
[-2.01209195e+34 -4.47742185e+34 -5.42629282e+34 -3.07294644e+32
-2.55503032e+33 -2.80363423e+34 -3.20314565e+35 -1.60824109e+34
-4.96433806e+34 -1.98052568e+36 -8.37673498e+34 -1.60024763e+36
-5.98654489e+34]
[6.09700758e+35 1.35674093e+36 1.64426623e+36 9.31159124e+33
7.74221040e+34 8.49552585e+35 9.70611871e+36 4.87326542e+35
1.50428547e+36 6.00135600e+37 2.53830431e+36 4.84904376e+37
1.81403288e+36]
[-1.84750510e+37 -4.11117381e+37 -4.98242821e+37 -2.82158290e+35
-2.34603173e+36 -2.57430013e+37 -2.94113196e+38 -1.47668879e+37
-4.55826082e+37 -1.81852092e+39 -7.69152754e+37 -1.46934918e+39
-5.49685229e+37]
[5.59827926e+38 1.24576106e+39 1.50976712e+39 8.54991361e+36
7.10890636e+37 7.80060146e+38 8.91216919e+39 4.47463782e+38
1.38123662e+39 5.51045187e+40 2.33067389e+39 4.45239747e+40
1.66564705e+39]
[-1.69638128e+40 -3.77488445e+40 -4.57487122e+40 -2.59078061e+38
-2.15412899e+39 -2.36372529e+40 -2.70055070e+41 -1.35589732e+40
-4.18540025e+40 -1.66976797e+42 -7.06236930e+40 -1.34915808e+42
-5.04721600e+40]
And then, the gradient descent run stops before all interactions due to the high values.
At a certain point, the values form the derivate assume values as NaN.
As expected, when I try to predict a test case, I get 0.0 as output:
sample_house = [[2.29690000e-01, 0.00000000e+00, 1.05900000e+01, 0.00000000e+00, 4.89000000e-01,
6.32600000e+00, 5.25000000e+01, 4.35490000e+00, 4.00000000e+00, 2.77000000e+02,
1.86000000e+01, 3.94870000e+02, 1.09700000e+01]]
test_predict = predictor(sample_house, w, b)
test_predict
------------------------------------------------
out : array([0.])
Thanks!

Your cost function is wrong, it should be:
cost = 1/(2*m) * np.sum(np.power(error,2))
Also, try to initialize your weights as random values between 0 an 1 and scale your inputs to range 0-1.

I had the same issue which I resolved by normalizing the x values.

I think that you are making a mistake in the gradient descent algorithm. When updating the values for "W" vector it should be:
W = W - (learning_rate/m) * derivate.sum()

The learning rate is too large.
I try learning_rate = 0.000001, and it converges normally.

Cost Function and Gradient Seem to be Working, but scipy.optimize functions are not

I'm working through my Matlab code for the Andrew NG Coursera course and turning it into python. I am working on non-regularized logistic regression and after writing my gradient and cost functions I needed something similar to fminunc and after some googling, I found a couple options. They are both returning the same results, but they do not match what is in Andrew NG's expected results code. Others seem to be getting this to work correctly, but I'm wondering why my specific code does not seem to return the desired result when using scipy.optimize functions, but does for the cost and gradient pieces earlier in the code.
The data I'm using can be found at the link below;
ex2data1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as op
#Machine Learning Online Class - Exercise 2: Logistic Regression
#Load Data
#The first two columns contains the exam scores and the third column contains the label.
data = pd.read_csv('ex2data1.txt', header = None)
X = np.array(data.iloc[:, 0:2]) #100 x 3
y = np.array(data.iloc[:,2]) #100 x 1
y.shape = (len(y), 1)
#Creating sub-dataframes for plotting
pos_plot = data[data[2] == 1]
neg_plot = data[data[2] == 0]
#==================== Part 1: Plotting ====================
#We start the exercise by first plotting the data to understand the
#the problem we are working with.
print('Plotting data with + indicating (y = 1) examples and o indicating (y = 0) examples.')
plt.plot(pos_plot[0], pos_plot[1], "+", label = "Admitted")
plt.plot(neg_plot[0], neg_plot[1], "o", label = "Not Admitted")
plt.xlabel('Exam 1 score')
plt.ylabel('Exam 2 score')
plt.legend()
plt.show()
def sigmoid(z):
'''
SIGMOID Compute sigmoid function
g = SIGMOID(z) computes the sigmoid of z.
Instructions: Compute the sigmoid of each value of z (z can be a matrix,
vector or scalar).
'''
g = 1 / (1 + np.exp(-z))
return g
def costFunction(theta, X, y):
'''
COSTFUNCTION Compute cost and gradient for logistic regression
J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the
parameter for logistic regression and the gradient of the cost
w.r.t. to the parameters.
'''
m = len(y) #number of training examples
h = sigmoid(X.dot(theta)) #logisitic regression hypothesis
J = (1/m) * np.sum((-y*np.log(h)) - ((1-y)*np.log(1-h)))
#h is 100x1, y is %100x1, these end up as 2 vector we subtract from each other
#then we sum the values by rows
#cost function for logisitic regression
return J
def gradient(theta, X, y):
m = len(y)
grad = np.zeros((theta.shape))
h = sigmoid(X.dot(theta))
for i in range(len(theta)): #number of rows in theta
XT = X[:,i]
XT.shape = (len(X),1)
grad[i] = (1/m) * np.sum((h-y)*XT) #updating each row of the gradient
return grad
#============ Part 2: Compute Cost and Gradient ============
#In this part of the exercise, you will implement the cost and gradient
#for logistic regression. You neeed to complete the code in costFunction.m
#Add intercept term to x and X_test
Bias = np.ones((len(X), 1))
X = np.column_stack((Bias, X))
#Initialize fitting parameters
initial_theta = np.zeros((len(X[0]), 1))
#Compute and display initial cost and gradient
(cost, grad) = costFunction(initial_theta, X, y), gradient(initial_theta, X, y)
print('Cost at initial theta (zeros): %f' % cost)
print('Expected cost (approx): 0.693\n')
print('Gradient at initial theta (zeros):')
print(grad)
print('Expected gradients (approx):\n -0.1000\n -12.0092\n -11.2628')
#Compute and display cost and gradient with non-zero theta
test_theta = np.array([[-24], [0.2], [0.2]]);
(cost, grad) = costFunction(test_theta, X, y), gradient(test_theta, X, y)
print('\nCost at test theta: %f' % cost)
print('Expected cost (approx): 0.218\n')
print('Gradient at test theta:')
print(grad)
print('Expected gradients (approx):\n 0.043\n 2.566\n 2.647\n')
result = op.fmin_tnc(func = costFunction, x0 = initial_theta, fprime = gradient, args = (X,y))
result[1]
Result = op.minimize(fun = costFunction,
x0 = initial_theta,
args = (X, y),
method = 'TNC',
jac = gradient, options={'gtol': 1e-3, 'disp': True, 'maxiter': 1000})
theta = Result.x
theta
test = np.array([[1, 45, 85]])
prob = sigmoid(test.dot(theta))
print('For a student with scores 45 and 85, we predict an admission probability of %f,' % prob)
print('Expected value: 0.775 +/- 0.002\n')

This was a very difficult problem to debug, and illustrates a poorly documented aspect of the scipy.optimize interface. The documentation vaguely indicates that theta will be passed around as a vector:
Minimization of scalar function of one or more variables.
In general, the optimization problems are of the form:
minimize f(x) subject to
g_i(x) >= 0, i = 1,...,m
h_j(x) = 0, j = 1,...,p
where x is a vector of one or more variables.
What's important is that they really mean vector in the most primitive sense, a 1-dimensional array. So you have to expect that whenever theta is passed into one of your callbacks, it will be passed in as a 1-d array. But in numpy, 1-d arrays sometimes behave differently from 2-d row arrays (and, obviously, from 2-d column arrays).
I don't know exactly why it's causing a problem in your case, but it's easily fixed regardless. You just have to add the following at the top of both your cost function and your gradient function:
theta = theta.reshape(-1, 1)
This guarantees that theta will be a 2-d column array, as expected. Once you've done this, the results are correct.

I have had similar issues with Scipy dealing with the same problem as you. As senderle points out the interface is not the easiest to deal with, especially combined with the numpy array interface... Here is my implementation which works as expected.
Defining the cost and gradient functions
Note that initial_theta is passed as a simple array of shape (3,) and converted to a column vector of shape (3,1) within the function. The gradient function then returns the grad.ravel() which has shape (3,) again. This is important as doing otherwise caused an error message with various optimization methods in Scipy.optimize.
Note that different methods have different behaviours but returning .ravel() seems to fix most issues...
import pandas as pd
import numpy as np
import scipy.optimize as opt
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def CostFunc(theta,X,y):
#Initializing variables
m = len(y)
J = 0
grad = np.zeros(theta.shape)
#Vectorized computations
z = X # theta
h = sigmoid(z)
J = (1/m) * ( (-y.T # np.log(h)) - (1 - y).T # np.log(1-h));
return J
def Gradient(theta,X,y):
#Initializing variables
m = len(y)
theta = theta[:,np.newaxis]
grad = np.zeros(theta.shape)
#Vectorized computations
z = X # theta
h = sigmoid(z)
grad = (1/m)*(X.T # ( h - y));
return grad.ravel() #<-- This is the trick
Initializing variables and parameters
Note that initial_theta.shape returns (3,)
X = data1.iloc[:,0:2].values
m,n = X.shape
X = np.concatenate((np.ones(m)[:,np.newaxis],X),1)
y = data1.iloc[:,-1].values[:,np.newaxis]
initial_theta = np.zeros((n+1))
Calling Scipy.optimize
model = opt.minimize(fun = CostFunc, x0 = initial_theta, args = (X, y), method = 'TNC', jac = Gradient)
Any comments from more knowledgeable people are welcome, this Scipy interface is a mystery to me, thanks

Gradient Descent & linear regression - Code not converging

I'm trying to implement the gradient descent algorithm from scratch on a toy problem. My code always returns a vector of NaN's:
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(45)
x = np.linspace(0, 1000, num=1000)
y = 3*x + 2 + np.random.randn(len(x))
# sklearn output - This works (returns intercept = 1.6, coef = 3)
lm = LinearRegression()
lm.fit(x.reshape(-1, 1), y.reshape(-1, 1))
print("Intercept = {:.2f}, Coef = {:.2f}".format(lm.coef_[0][0], lm.intercept_[0]))
# BGD output
theta = np.array((0, 0)).reshape(-1, 1)
X = np.hstack([np.ones_like(x.reshape(-1, 1)), x.reshape(-1, 1)]) # [1, x]
Y = y.reshape(-1, 1) # Column vector
alpha = 0.05
for i in range(100):
# Update: theta <- theta - alpha * [X.T][X][theta] - [X.T][Y]
h = np.dot(X, theta) # Hypothesis
loss = h - Y
theta = theta - alpha*np.dot(X.T, loss)
theta
The sklearn part runs fine, so I must be doing something wrong in the for loop. I've tried various different alpha values and none of them converge.
The problem is theta keeps getting bigger and bigger throughout the loop, and eventually becomes too big for python to store.
Here's a contour plot of the cost function:
J = np.dot((np.dot(X, theta) - y).T, (np.dot(X, theta) - y))
plt.contour(J)
Clearly there's no minimum here. Where have I gone wrong?
Thanks

In the theta update, the second term should be divided by the size of the training set. More details are there: gradient descent using python and numpy

Stochastic Gradient Descent Convergence Criteria

Currently my convergence criteria for SGD checks whether the MSE error ratio is within a specific boundary.
def compute_mse(data, labels, weights):
m = len(labels)
hypothesis = np.dot(data,weights)
sq_errors = (hypothesis - labels) ** 2
mse = np.sum(sq_errors)/(2.0*m)
return mse
cur_mse = 1.0
prev_mse = 100.0
m = len(labels)
while cur_mse/prev_mse < 0.99999:
prev_mse = cur_mse
for i in range(m):
d = np.array(data[i])
hypothesis = np.dot(d, weights)
gradient = np.dot((labels[i] - hypothesis), d)/m
weights = weights + (alpha * gradient)
cur_mse = compute_mse(data, labels, weights)
if cur_mse > prev_mse:
return
The weights are update w.r.t. to a single data point in the training set.
With an alpha of 0.001, the model is supposed to have converged within a few iterations however I get no convergence. Is this convergence criteria too strict?

I'll try to answer the question. First, the pseudocode of stochastic gradient descent looks something like this:
input: f(x), alpha, initial x (guess or random)
output: min_x f(x) # x that minimizes f(x)
while True:
shuffle data # good practice, not completely needed
for d in data:
x -= alpha * grad(f(x)) # df/dx
if <stopping criterion>:
break
There can be other regularization parameters added to the function that you want to minimize, such as the l1 penalty to avoid overfitting.
Going back to your problem, looking at your data and definition of the gradient, looks like you want to solve a simple linear system of equations of the form:
Ax = b
which yields the objevtive function:
f(x) = ||Ax - b||^2
stochastic gradient descent uses one row data at a time:
||A_i x - b||
where || o || is the euclidean norm and _i means index of a row.
Here, A is your data, x is your weights and b is your labels.
The gradient of the function is then computed as a:
grad(f(x)) = 2 * A.T (Ax - b)
Or in the case of the stochastic gradient descent:
2 * A_i.T (A_i x - b)
where .T means transpose.
Putting everything back into your code... first I will setup a synthetic data:
A = np.random.randn(100, 2) # 100x2 data
x = np.random.randn(2, 1) # 2x1 weights
b = np.random.randint(0, 2, 100).reshape(100, 1) # 100x1 labels
b[b == 0] = -1 # labels in {-1, 1}
Then, define the parameters:
alpha = 0.001
cur_mse = 100.
prev_mse = np.inf
it = 0
max_iter = 100
m = A.shape[0]
idx = range(m)
And loop!
while cur_mse/prev_mse < 0.99999 and it < max_iter:
prev_mse = cur_mse
shuffle(idx)
for i in idx:
d = A[i:i+1]
y = b[i:i+1]
h = np.dot(d, x)
dx = 2 * np.dot(d.T, (h - y))
x -= (alpha * dx)
cur_mse = np.mean((A.dot(x) - b)**2)
if cur_mse > prev_mse:
raise Exception("Not converging")
it += 1
This code is pretty much the same as yours, with a couple of additions:
Another stopping criterion based on the number of iterations (to avoid looping forever if the system doesn't converge or does too slowly)
Redefinition of the gradient dx (still similar to yours). You have the sign inverted and therefore the weight update is positive + since in my example is negative - (makes sense since you are going down in a gradient).
Indexing of data and labels. While data[i] gives a tuple of size (2,) (in this case for a 100x2 data), using fancy indexing data[i:i+1] will return a view of the data without reshaping it (e.g with shape (1, 2)) and therefore will allow you to perform the proper matrix multiplications.
You can add a 3rd stopping criterion based on acceptable mse error, i.e: if cur_mse < 1e-3: break.
This algorithm, with random data, converges in 20-40 iterations for me (depending on the generated random data).
So... assuming that this is the function you want to minimize, if this method doesn't work for you, it might mean that your system is underdeterminated (you have less training data than features, which means A is more wide than high).
Hope it helps!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.