Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have to do Logistic regression using batch gradient descent.
import numpy as np
X = np.asarray([
[0.50],[0.75],[1.00],[1.25],[1.50],[1.75],[1.75],
[2.00],[2.25],[2.50],[2.75],[3.00],[3.25],[3.50],
[4.00],[4.25],[4.50],[4.75],[5.00],[5.50]])
y = np.asarray([0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1])
m = len(X)
def sigmoid(a):
return 1.0 / (1 + np.exp(-a))
def gradient_Descent(theta, alpha, X , y):
for i in range(0,m):
cost = ((-y) * np.log(sigmoid(X[i]))) - ((1 - y) * np.log(1 - sigmoid(X[i])))
grad = theta - alpha * (1.0/m) * (np.dot(cost,X[i]))
theta = theta - alpha * grad
gradient_Descent(0.1,0.005,X,y)
The way I have to do it is like this but I can't seem to understand how to make it work.
It looks like you have some stuff mixed up in here. It's critical when doing this that you keep track of the shape of your vectors and makes sure you're getting sensible results. For example, you are calculating cost with:
cost = ((-y) * np.log(sigmoid(X[i]))) - ((1 - y) * np.log(1 - sigmoid(X[i])))
In your case y is vector with 20 items and X[i] is a single value. This makes your cost calculation a 20 item vector which doesn't makes sense. Your cost should be a single value. (you're also calculating this cost a bunch of times for no reason in your gradient descent function).
Also, if you want this to be able to fit your data you need to add a bias terms to X. So let's start there.
X = np.asarray([
[0.50],[0.75],[1.00],[1.25],[1.50],[1.75],[1.75],
[2.00],[2.25],[2.50],[2.75],[3.00],[3.25],[3.50],
[4.00],[4.25],[4.50],[4.75],[5.00],[5.50]])
ones = np.ones(X.shape)
X = np.hstack([ones, X])
# X.shape is now (20, 2)
Theta will now need 2 values for each X. So initialize that and Y:
Y = np.array([0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]).reshape([-1, 1])
# reshape Y so it's column vector so matrix multiplication is easier
Theta = np.array([[0], [0]])
Your sigmoid function is good. Let's also make a vectorized cost function:
def sigmoid(a):
return 1.0 / (1 + np.exp(-a))
def cost(x, y, theta):
m = x.shape[0]
h = sigmoid(np.matmul(x, theta))
cost = (np.matmul(-y.T, np.log(h)) - np.matmul((1 -y.T), np.log(1 - h)))/m
return cost
The cost function works because Theta has a shape of (2, 1) and X has a shape of (20, 2) so matmul(X, Theta) will be shaped (20, 1). The then matrix multiply the transpose of Y (y.T shape is (1, 20)), which result in a single value, our cost given a particular value of Theta.
We can then write a function that performs a single step of batch gradient descent:
def gradient_Descent(theta, alpha, x , y):
m = x.shape[0]
h = sigmoid(np.matmul(x, theta))
grad = np.matmul(X.T, (h - y)) / m;
theta = theta - alpha * grad
return theta
Notice np.matmul(X.T, (h - y)) is multiplying shapes (2, 20) and (20, 1) which results in a shape of (2, 1) โ the same shape as Theta, which is what you want from your gradient. This allows you to multiply is by your learning rate and subtract it from the initial Theta, which is what gradient descent is supposed to do.
So now you just write a loop for a number of iterations and update Theta until it looks like it converges:
n_iterations = 500
learning_rate = 0.5
for i in range(n_iterations):
Theta = gradient_Descent(Theta, learning_rate, X, Y)
if i % 50 == 0:
print(cost(X, Y, Theta))
This will print the cost every 50 iterations resulting in a steadily decreasing cost, which is what you hope for:
[[ 0.6410409]]
[[ 0.44766253]]
[[ 0.41593581]]
[[ 0.40697167]]
[[ 0.40377785]]
[[ 0.4024982]]
[[ 0.40195]]
[[ 0.40170533]]
[[ 0.40159325]]
[[ 0.40154101]]
You can try different initial values of Theta and you will see it always converges to the same thing.
Now you can use your newly found values of Theta to make predictions:
h = sigmoid(np.matmul(X, Theta))
print((h > .5).astype(int) )
This prints what you would expect for a linear fit to your data:
[[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]]
Related
Im working with the gradient function for an exercise but i still couldn't get the expected outcome. That is, i receive 2 error messages:
Wrong output for the loss function. Check how you are implementing the matrix multiplications.
Wrong values for weight's matrix theta. Check how you are updating the matrix of weights.
When applying the function (see below) i notice that the cost decreases at each iteration but still it does not converge to the desired outcome in the exercise. I already tried several adaptations on the formula but couldn't solve it yet.
# gradientDescent
def gradientDescent(x, y, theta, alpha, num_iters):
Input:
x: matrix of features which is (m,n+1)
y: corresponding labels of the input matrix x, dimensions (m,1)
theta: weight vector of dimension (n+1,1)
alpha: learning rate
num_iters: number of iterations you want to train your model for
Output:
J: the final cost
theta: your final weight vector
Hint: you might want to print the cost to make sure that it is going down.
### START CODE HERE ###
# get 'm', the number of rows in matrix x
m = len(x)
for i in range(0, num_iters):
# get z, the dot product of x and theta
# z = predictins
z = np.dot(x, theta)
h = sigmoid(z)
loss = z - y
# calculate the cost function
J = (-1/m) * np.sum(loss)
print("Iteration %d | Cost: %f" % (i, J))#
gradient = np.dot(x.T, loss)
#update theta
theta = theta - (1/m) * alpha * gradient
### END CODE HERE ###
J = float(J)
return J, theta
The issue is that i wrongly applied the formula of the cost function and the formula for calculating the weights:
๐ฝ=โ1/๐ร(๐ฒ๐โ
๐๐๐(๐ก)+(1โ๐ฒ)๐โ
๐๐๐(1โ๐ก))
๐=๐โ๐ผ/๐ร(๐ฑ๐โ
(๐กโ๐ฒ))
The solution is:
J = (-1/m) * (np.dot(y.T, np.log(h)) + (np.dot((1-y).T, np.log(1-h)))
theta = theta - (alpha/m) * gradient
I've been spending a few hours googling about this problem and it seems I can't find any information.
I tried coding a multivariate gaussian pdf as:
def multivariate_normal(X, M, S):
# X has shape (D, N) where D is the number of dimensions and N the number of observations
# M is the mean vector with shape (D, 1)
# S is the covariance matrix with shape (D, D)
D = S.shape[0]
S_inv = np.linalg.inv(S)
logdet = np.log(np.linalg.det(S))
log2pi = np.log(2*np.pi)
devs = X - M
a = np.array([- D/2 * log2pi - (1/2) * logdet - dev.T # S_inv # dev for dev in devs.T])
return np.exp(a)
I've only been successful in computing the pdf through a for loop, iterating N times. If I don't, I end up with an (N, N) matrix which is unhelpful. I've found another post here, but the post is quite outdated and in matlab.
Is there anyway to take advantage of numpy's vectorisation?
This is my first post on stackoverflow, let me know if anything is off!d
I came across this problem in a similar manner and here's how I solved it:
Variables:
X = numpy.ndarray[numpy.ndarray[float]] - m x n
MU = numpy.ndarray[numpy.ndarray[float]] - k x n
SIGMA = numpy.ndarray[numpy.ndarray[numpy.ndarray[float]]] - k x n x n
k = int
Where X is my feature vector, MU is my means, SIGMA is my covariance matrix.
To vectorize, I rewrote the dot product per the definition of the dot-product:
sigma_det = np.linalg.det(sigma)
sigma_inv = np.linalg.inv(sigma)
const = 1/((2*np.pi)**(n/2)*sigma_det**(1/2))
p = const*np.exp((-1/2)*np.sum((X-mu).dot(sigma_inv)*(X-mu),axis=1))
I have been working on this problem for the last few days and finally have come to a solution.
To do so I have added an extra dimension to the x vector, and then used the np.einsum() function for computing the Mahalanobis distance.
Example
For the following example we will use a (100 x 2) input array. That is, 100 samples of two random variables. That gives us a (1 x 2) mean vector and a (2 x 2) covariance matrix.
Generating some data:
# instantiate a random number generator
rng = np.random.default_rng(100)
# define mu and sigma for the dummy sample
mu = np.array([0.5, 0.25])
covmat = np.array([[1, 0.5],
[0.5, 1]])
# generate multivariate normal random sample
x = rng.multivariate_normal(mu, covmat, size=100)
And defining the pdf function:
def pdf(x, mu, covmat):
"""
Generates the probability of a given x vector based on the
probability distribution function N(mu, covmat)
Returns: the probability
"""
x = x[:, np.newaxis] # add a new first dimension to x
k = mu.shape[0] # number of dimensions
diff = x - mu # deviation of x from the mean
inv_covmat = np.linalg.inv(covmat)
term1 = (2*np.pi)**-(k/2)*np.linalg.det(inv_covmat)
term2 = np.exp(-np.einsum('ijk, kl, ijl->ij', diff, inv_covmat, diff) / 2)
return term1 * term2
Which returns a (n, 1) array, where n is the number of samples, in this case (100,1).
Explanation
The easiest way to think about solving the problem is just writing down the dimensions, and trying to do the linear algebra.
We need to do some kind of manipulation of three tensors with the following shapes, to get the resulting tensor:
A, B, C -> D
(100 x 1 x 2), (2, 2), (100 x 1 x 2) -> (100 x 1)
Let the first tensor, A, have the indices, ijk:
Then we want to do some operation of A and B to get the shape (100 x 1 x 2).
Hence,
ijk, kl - > ijl
(100 x 1 x 2), (2 x 2) -> (100 x 1 x 2)
This leaves us with AB, C
(100 x 1 x 2), (100 x 1 x 2)
We want D to have the shape (100 x 1)
Hence:
ijl, ijl->ij
(100 x 1 x 2), (100 x 1 x 2) -> (100 x 1)
Putting the two operations together, we get:
ijk, kl, ijl->ij
I am taking the machine learning course from coursera. There is a topic called gradient descent to optimize the cost function. It says to simultaneously update theta0 and theta1 such that it will minimize the cost function and will reach to global minimum.
The formula for gradient descent is
How do i do this programmatically using python? I am using numpy array and pandas to start from scratch to understand step by step its logic.
For now i have only calculated cost function
# step 1 - collect our data
data = pd.read_csv("datasets.txt", header=None)
def compute_cost_function(x, y, theta):
'''
Taking in a numpy array x, y, theta and generate the cost function
'''
m = len(y)
# formula for prediction = theta0 + theta1.x
predictions = x.dot(theta)
# formula for square error = ((theta1.x + theta0) - y)**2
square_error = (predictions - y)**2
# sum of square error function
return 1/(2*m) * np.sum(square_error)
# converts into numpy represetation of the pandas dataframe. The axes labels will be excluded
numpy_data = data.values
m = data[0].size
x = np.append(np.ones((m, 1)), numpy_data[:, 0].reshape(m, 1), axis=1)
y = numpy_data[:, 1].reshape(m, 1)
theta = np.zeros((2, 1))
compute_cost_function(x, y, theta)
def gradient_descent(x, y, theta, alpha):
'''
simultaneously update theta0 and theta1 where
theta0 = theta0 - apha * 1/m * (sum of square error)
'''
pass
I know i have to call that compute_cost_function from gradient descent but could not apply that formula.
What it means is that you use the previous values of the parameters and compute what you need on the right hand side. Once you're done, update the parameters. To do this the most clearly, create a temporary array inside your function that stores the results on the right hand side and return the computed result when you're finished.
def gradient_descent(x, y, theta, alpha):
''' simultaneously update theta0 and theta1 where
theta0 = theta0 - apha * 1/m * (sum of square error) '''
theta_return = np.zeros((2, 1))
theta_return[0] = theta[0] - (alpha / m) * ((x.dot(theta) - y).sum())
theta_return[1] = theta[1] - (alpha / m) * (((x.dot(theta) - y)*x[:, 1][:, None]).sum())
return theta_return
We first declare the temporary array then compute each part of the parameters, namely the intercept and slope separately then return what we need. The nice thing about the above code is that we're doing it vectorized. For the intercept term, x.dot(theta) performs matrix vector multiplication where you have your data matrix x and parameter vector theta. By subtracting this result with the output values y, we are computing the sum over all errors between the predicted values and true values, then multiplying by the learning rate then dividing by the number of samples. We do something similar with the slope term only we additionally multiply by each input value without the bias term. We additionally need to ensure the input values are in columns as slicing along the second column of x results in a 1D NumPy array instead of a 2D with a singleton column. This allows the elementwise multiplication to play nicely together.
One more thing to note is that you don't need to compute the cost at all when updating the parameters. Mind you, inside your optimization loop it'll be nice to call it as you're updating your parameters so you can see how well your parameters are learning from your data.
To make this truly vectorized and thus exploiting the simultaneous update, you can formulate this as a matrix-vector multiplication on the training examples alone:
def gradient_descent(x, y, theta, alpha):
''' simultaneously update theta0 and theta1 where
theta0 = theta0 - apha * 1/m * (sum of square error) '''
return theta - (alpha / m) * x.T.dot(x.dot(theta) - y)
What this does is that when we compute x.dot(theta), this calculates the the predicted values, then we combine this by subtracting with the expected values. This produces the error vector. When we pre-multiply by the transpose of x, what ends up happening is that we take the error vector and perform the summation vectorized such that the first row of the transposed matrix x corresponds to values of 1 meaning that we are simply summing up all of the error terms which gives us the update for the bias or intercept term. Similarly the second row of the transposed matrix x additionally weights each error term by the corresponding sample value in x (without the bias term of 1) and computes the sum that way. The result is a 2 x 1 vector which gives us the final update when we subtract with the previous value of our parameters and weighted by the learning rate and number of samples.
I didn't realize you were putting the code in an iterative framework. In that case you need to update the parameters at each iteration.
def gradient_descent(x, y, theta, alpha, iterations):
''' simultaneously update theta0 and theta1 where
theta0 = theta0 - apha * 1/m * (sum of square error) '''
theta_return = np.zeros((2, 1))
for i in range(iterations):
theta_return[0] = theta[0] - (alpha / m) * ((x.dot(theta) - y).sum())
theta_return[1] = theta[1] - (alpha / m) * (((x.dot(theta) - y)*x[:, 1][:, None]).sum())
theta = theta_return
return theta
theta = gradient_descent(x, y, theta, 0.01, 1000)
At each iteration, you update the parameters then set it properly so that the next time, the current updates become the previous updates.
I'm working through my Matlab code for the Andrew NG Coursera course and turning it into python. I am working on non-regularized logistic regression and after writing my gradient and cost functions I needed something similar to fminunc and after some googling, I found a couple options. They are both returning the same results, but they do not match what is in Andrew NG's expected results code. Others seem to be getting this to work correctly, but I'm wondering why my specific code does not seem to return the desired result when using scipy.optimize functions, but does for the cost and gradient pieces earlier in the code.
The data I'm using can be found at the link below;
ex2data1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as op
#Machine Learning Online Class - Exercise 2: Logistic Regression
#Load Data
#The first two columns contains the exam scores and the third column contains the label.
data = pd.read_csv('ex2data1.txt', header = None)
X = np.array(data.iloc[:, 0:2]) #100 x 3
y = np.array(data.iloc[:,2]) #100 x 1
y.shape = (len(y), 1)
#Creating sub-dataframes for plotting
pos_plot = data[data[2] == 1]
neg_plot = data[data[2] == 0]
#==================== Part 1: Plotting ====================
#We start the exercise by first plotting the data to understand the
#the problem we are working with.
print('Plotting data with + indicating (y = 1) examples and o indicating (y = 0) examples.')
plt.plot(pos_plot[0], pos_plot[1], "+", label = "Admitted")
plt.plot(neg_plot[0], neg_plot[1], "o", label = "Not Admitted")
plt.xlabel('Exam 1 score')
plt.ylabel('Exam 2 score')
plt.legend()
plt.show()
def sigmoid(z):
'''
SIGMOID Compute sigmoid function
g = SIGMOID(z) computes the sigmoid of z.
Instructions: Compute the sigmoid of each value of z (z can be a matrix,
vector or scalar).
'''
g = 1 / (1 + np.exp(-z))
return g
def costFunction(theta, X, y):
'''
COSTFUNCTION Compute cost and gradient for logistic regression
J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the
parameter for logistic regression and the gradient of the cost
w.r.t. to the parameters.
'''
m = len(y) #number of training examples
h = sigmoid(X.dot(theta)) #logisitic regression hypothesis
J = (1/m) * np.sum((-y*np.log(h)) - ((1-y)*np.log(1-h)))
#h is 100x1, y is %100x1, these end up as 2 vector we subtract from each other
#then we sum the values by rows
#cost function for logisitic regression
return J
def gradient(theta, X, y):
m = len(y)
grad = np.zeros((theta.shape))
h = sigmoid(X.dot(theta))
for i in range(len(theta)): #number of rows in theta
XT = X[:,i]
XT.shape = (len(X),1)
grad[i] = (1/m) * np.sum((h-y)*XT) #updating each row of the gradient
return grad
#============ Part 2: Compute Cost and Gradient ============
#In this part of the exercise, you will implement the cost and gradient
#for logistic regression. You neeed to complete the code in costFunction.m
#Add intercept term to x and X_test
Bias = np.ones((len(X), 1))
X = np.column_stack((Bias, X))
#Initialize fitting parameters
initial_theta = np.zeros((len(X[0]), 1))
#Compute and display initial cost and gradient
(cost, grad) = costFunction(initial_theta, X, y), gradient(initial_theta, X, y)
print('Cost at initial theta (zeros): %f' % cost)
print('Expected cost (approx): 0.693\n')
print('Gradient at initial theta (zeros):')
print(grad)
print('Expected gradients (approx):\n -0.1000\n -12.0092\n -11.2628')
#Compute and display cost and gradient with non-zero theta
test_theta = np.array([[-24], [0.2], [0.2]]);
(cost, grad) = costFunction(test_theta, X, y), gradient(test_theta, X, y)
print('\nCost at test theta: %f' % cost)
print('Expected cost (approx): 0.218\n')
print('Gradient at test theta:')
print(grad)
print('Expected gradients (approx):\n 0.043\n 2.566\n 2.647\n')
result = op.fmin_tnc(func = costFunction, x0 = initial_theta, fprime = gradient, args = (X,y))
result[1]
Result = op.minimize(fun = costFunction,
x0 = initial_theta,
args = (X, y),
method = 'TNC',
jac = gradient, options={'gtol': 1e-3, 'disp': True, 'maxiter': 1000})
theta = Result.x
theta
test = np.array([[1, 45, 85]])
prob = sigmoid(test.dot(theta))
print('For a student with scores 45 and 85, we predict an admission probability of %f,' % prob)
print('Expected value: 0.775 +/- 0.002\n')
This was a very difficult problem to debug, and illustrates a poorly documented aspect of the scipy.optimize interface. The documentation vaguely indicates that theta will be passed around as a vector:
Minimization of scalar function of one or more variables.
In general, the optimization problems are of the form:
minimize f(x) subject to
g_i(x) >= 0, i = 1,...,m
h_j(x) = 0, j = 1,...,p
where x is a vector of one or more variables.
What's important is that they really mean vector in the most primitive sense, a 1-dimensional array. So you have to expect that whenever theta is passed into one of your callbacks, it will be passed in as a 1-d array. But in numpy, 1-d arrays sometimes behave differently from 2-d row arrays (and, obviously, from 2-d column arrays).
I don't know exactly why it's causing a problem in your case, but it's easily fixed regardless. You just have to add the following at the top of both your cost function and your gradient function:
theta = theta.reshape(-1, 1)
This guarantees that theta will be a 2-d column array, as expected. Once you've done this, the results are correct.
I have had similar issues with Scipy dealing with the same problem as you. As senderle points out the interface is not the easiest to deal with, especially combined with the numpy array interface... Here is my implementation which works as expected.
Defining the cost and gradient functions
Note that initial_theta is passed as a simple array of shape (3,) and converted to a column vector of shape (3,1) within the function. The gradient function then returns the grad.ravel() which has shape (3,) again. This is important as doing otherwise caused an error message with various optimization methods in Scipy.optimize.
Note that different methods have different behaviours but returning .ravel() seems to fix most issues...
import pandas as pd
import numpy as np
import scipy.optimize as opt
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def CostFunc(theta,X,y):
#Initializing variables
m = len(y)
J = 0
grad = np.zeros(theta.shape)
#Vectorized computations
z = X # theta
h = sigmoid(z)
J = (1/m) * ( (-y.T # np.log(h)) - (1 - y).T # np.log(1-h));
return J
def Gradient(theta,X,y):
#Initializing variables
m = len(y)
theta = theta[:,np.newaxis]
grad = np.zeros(theta.shape)
#Vectorized computations
z = X # theta
h = sigmoid(z)
grad = (1/m)*(X.T # ( h - y));
return grad.ravel() #<-- This is the trick
Initializing variables and parameters
Note that initial_theta.shape returns (3,)
X = data1.iloc[:,0:2].values
m,n = X.shape
X = np.concatenate((np.ones(m)[:,np.newaxis],X),1)
y = data1.iloc[:,-1].values[:,np.newaxis]
initial_theta = np.zeros((n+1))
Calling Scipy.optimize
model = opt.minimize(fun = CostFunc, x0 = initial_theta, args = (X, y), method = 'TNC', jac = Gradient)
Any comments from more knowledgeable people are welcome, this Scipy interface is a mystery to me, thanks
Currently my convergence criteria for SGD checks whether the MSE error ratio is within a specific boundary.
def compute_mse(data, labels, weights):
m = len(labels)
hypothesis = np.dot(data,weights)
sq_errors = (hypothesis - labels) ** 2
mse = np.sum(sq_errors)/(2.0*m)
return mse
cur_mse = 1.0
prev_mse = 100.0
m = len(labels)
while cur_mse/prev_mse < 0.99999:
prev_mse = cur_mse
for i in range(m):
d = np.array(data[i])
hypothesis = np.dot(d, weights)
gradient = np.dot((labels[i] - hypothesis), d)/m
weights = weights + (alpha * gradient)
cur_mse = compute_mse(data, labels, weights)
if cur_mse > prev_mse:
return
The weights are update w.r.t. to a single data point in the training set.
With an alpha of 0.001, the model is supposed to have converged within a few iterations however I get no convergence. Is this convergence criteria too strict?
I'll try to answer the question. First, the pseudocode of stochastic gradient descent looks something like this:
input: f(x), alpha, initial x (guess or random)
output: min_x f(x) # x that minimizes f(x)
while True:
shuffle data # good practice, not completely needed
for d in data:
x -= alpha * grad(f(x)) # df/dx
if <stopping criterion>:
break
There can be other regularization parameters added to the function that you want to minimize, such as the l1 penalty to avoid overfitting.
Going back to your problem, looking at your data and definition of the gradient, looks like you want to solve a simple linear system of equations of the form:
Ax = b
which yields the objevtive function:
f(x) = ||Ax - b||^2
stochastic gradient descent uses one row data at a time:
||A_i x - b||
where || o || is the euclidean norm and _i means index of a row.
Here, A is your data, x is your weights and b is your labels.
The gradient of the function is then computed as a:
grad(f(x)) = 2 * A.T (Ax - b)
Or in the case of the stochastic gradient descent:
2 * A_i.T (A_i x - b)
where .T means transpose.
Putting everything back into your code... first I will setup a synthetic data:
A = np.random.randn(100, 2) # 100x2 data
x = np.random.randn(2, 1) # 2x1 weights
b = np.random.randint(0, 2, 100).reshape(100, 1) # 100x1 labels
b[b == 0] = -1 # labels in {-1, 1}
Then, define the parameters:
alpha = 0.001
cur_mse = 100.
prev_mse = np.inf
it = 0
max_iter = 100
m = A.shape[0]
idx = range(m)
And loop!
while cur_mse/prev_mse < 0.99999 and it < max_iter:
prev_mse = cur_mse
shuffle(idx)
for i in idx:
d = A[i:i+1]
y = b[i:i+1]
h = np.dot(d, x)
dx = 2 * np.dot(d.T, (h - y))
x -= (alpha * dx)
cur_mse = np.mean((A.dot(x) - b)**2)
if cur_mse > prev_mse:
raise Exception("Not converging")
it += 1
This code is pretty much the same as yours, with a couple of additions:
Another stopping criterion based on the number of iterations (to avoid looping forever if the system doesn't converge or does too slowly)
Redefinition of the gradient dx (still similar to yours). You have the sign inverted and therefore the weight update is positive + since in my example is negative - (makes sense since you are going down in a gradient).
Indexing of data and labels. While data[i] gives a tuple of size (2,) (in this case for a 100x2 data), using fancy indexing data[i:i+1] will return a view of the data without reshaping it (e.g with shape (1, 2)) and therefore will allow you to perform the proper matrix multiplications.
You can add a 3rd stopping criterion based on acceptable mse error, i.e: if cur_mse < 1e-3: break.
This algorithm, with random data, converges in 20-40 iterations for me (depending on the generated random data).
So... assuming that this is the function you want to minimize, if this method doesn't work for you, it might mean that your system is underdeterminated (you have less training data than features, which means A is more wide than high).
Hope it helps!