Learning parameters with gradient descent - python

I just started a ML course and I'm trying to run gradient descent in python. The below functions work fine, but as I move on to the bigger chunk where I do the actual learning, I just can't get the expected output and learn the right parameters, as you can tell from this decision boundary I plotted afterwards. And I'm trying to figure out why.
plotting the decision boundary
def sigmoid(z):
sigma = 1/(1+np.exp(-z))
return sigma
def compute_cost(X, y, w, b):
y_hat = sigmoid((X * np.expand_dims(w, axis=0)).sum(axis=1) + b)
total_cost = (-y * np.log(y_hat) - (1-y) * np.log(1-y_hat)).mean()
return total_cost
def compute_gradient(X, y, w, b):
z = w * X + b
yhat = sigmoid(z)
y1 = np.expand_dims(y, axis=1)
error = yhat - y1
db = error.mean()
dw_j1 = (X * error)
dw_j = np.mean(dw_j1,axis=0)
return dw_j, db
Before building this gradient descent function, I tested all the above with my training data & they all work and output the correct numbers. Really appreciate it if you can spot my mistakes.
Learning parameters with gradient descent
def gradient_descent(X, y, w, b, alpha, num_iters):
m = len(X)
J_history = []
wb_history = []
for i in range(num_iters):
cost = compute_cost(X, y, w, b)
dw_j, db = compute_gradient(X, y, w, b)
w = w - alpha * dw_j
b = b - alpha * db
wb_history.append((w,b))
J_history.append(cost)
if i % math.ceil(num_iters/10) == 0 or i == (num_iters-1):
print(f"Iteration {i:4}: Cost {float(J_history[-1]):8.2f}")
return w, b, J_history, wb_history
np.random.seed(1)
initial_w = 0.01 * (np.random.rand(2) - 0.5)
initial_b = -8
iterations = 10000
alpha = 0.001
w, b, J_history, _ = gradient_descent(X_train ,y_train, initial_w, initial_b, alpha, iterations)

Related

A Simple Linear Regression with One Neuron Gives Incorrect Results

I am trying to code a NN with one neuron. I have one input (x) and bias (b) to solve a simple regression problem to detect x, b for the eq.: (my cost function is y=x)
y = 0.3 * x + 2.
The closest results I am getting is:
x = 0.38178107 (expected: ~0.3)
b = 1.10040842 (expected: ~1.0)
My question is why my results are far from the expected results? Am I falling into an over/underfitting problem or buffer overflow?
I took into consideration the relationship between the learning rate and the number of iterations.
I know my training data is small, but I am looping through each data entry 100 times. Also, I tried increasing the training to 100 entries and reduced the looping for each entry to 10 times, the results were much far something like x= ~3.067 and b=-3.098
Here are the steps I followed:
My training data is x: 1~10 & y:2.3~5.0. Training: [(1, 2.3), .., (10, 5.0)]
The derivatives used:
dE_dw = -(y-A)*x #gradient
new_w = w - lr * dE_dw
dE_db = -(y-A) #gradient
new_b = b - lr * dE_db
The Code:
import random as r
# function: calculate gradient for weight w for the x input or weight b for bias input
def calc_new_Weight(v, lr, grad):
# v is value of the weight
# lr is learning rate
# grad is gradient
new_v = v - lr * grad
return new_v
# linear cost function y=x
def costFunc(s): return s
def nn(x, y, w, b, lr):
s = x*w + 1*b
A = costFunc(s)
#Error: E = 0.5 (y - a) ** 2
#partial deriv E w/ respect to w
dE_dw = -1*(y-A)*x
w_new = calc_new_Weight(w, lr, grad = dE_dw)
# partial deriv E w/ respect to b
dE_db = -1*(y-A)
b_new = calc_new_Weight(b, lr, grad = dE_db)
return (w_new, b_new)
def main():
#random init weights w, b for the inputs x, b
w = r.random()
b = r.random()
for x, y in data:
# y = 0.3*x + 2
for i in range(1, 100):
#update w, b with the new weights
w, b = nn(x, y, w, b, lr=0.001)
print(w, b)
If you can help me understand this I really appreciate your time.
Thank you in advance
Your Gradients are correct but the Gradient Descent Algorithm in the code needs to be modified a little,
The Gradient Descent Algo goes like this: reference
t <- 0
max_iterations <- 1000
Initialize W/theta (Weights)
while t++ < max_iterations do
H = Forward_propogate(Inputs, W)
delta_W = Backward_propogation(H)
W -= n*delta_W
end
In python it looks like this:
w, b, lr, max_iterations = r.random(), r.random(), 0.001, 1000
for i in range(max_iterations):
dw, db = 0, 0
for x,y in data:
# return dw and db from nn and not updated w and b
dw += nn(w, b, x, y)
db += nn(w, b, x, y)
w = w - lr * dw
b = b - lr * db
And If you want to do stochastic gradient descent(i.e update w,b for all points in the data) the code would be:
w, b, lr, max_iterations = r.random(), 0, 0.001, 1000
for i in range(max_iterations):
dw, db = 0, 0
for x,y in data:
dw += nn(w, b, x, y)
db += nn(w, b, x, y)
w = w - lr * dw
b = b - lr * db
Implementing SGD with your example with 50 data points and 1000 iterations and initializing random w and b to be 0 we can consistently converge to expected values:
import random as r
import matplotlib.pyplot as plt
# function: calculate gradient for weight w for the x input or weight b for bias input
def calc_new_Weight(v, lr, grad):
# v is value of the weight
# lr is learning rate
# grad is gradient
new_v = v - lr * grad
return new_v
# linear cost function y=x
def costFunc(s): return s
def nn(x, y, w, b, lr):
s = x*w + 1*b
A = costFunc(s)
#Error: E = 0.5 (y - a) ** 2
#partial deriv E w/ respect to w
dE_dw = -1*(y-A)*x
#w_new = calc_new_Weight(w, lr, grad = dE_dw)
# partial deriv E w/ respect to b
dE_db = -1*(y-A)
#b_new = calc_new_Weight(b, lr, grad = dE_db)
return (dE_dw, dE_db)
def main():
#random init weights w, b for the inputs x, b
w = r.random()
b = 0
x = list(range(1,50))
y = [(0.3*i + 2) for i in x]
data = list(zip(x,y))
#r.shuffle(data)
for i in range(1,1000):
dw , db = 0, 0
for x, y in data:
# y = 0.3*x + 2
#update w, b with the new weights
d_w, d_b = nn(x, y, w, b, lr=0.001)
dw += d_w
db += d_b
w = w - (0.001*d_w)
b = b - (0.001*d_b)
return w, b
w, b = main()

How to find 2 parameters with gradient descent method in Python?

I have a few lines of code which doesn't converge. If anyone has an idea why, I would greatly appreciate. The original equation is written in def f(x,y,b,m) and I need to find parameters b,m.
np.random.seed(42)
x = np.random.normal(0, 5, 100)
y = 50 + 2 * x + np.random.normal(0, 2, len(x))
def f(x, y, b, m):
return (1/len(x))*np.sum((y - (b + m*x))**2) # it is supposed to be a sum operator
def dfb(x, y, b, m): # partial derivative with respect to b
return b - m*np.mean(x)+np.mean(y)
def dfm(x, y, b, m): # partial derivative with respect to m
return np.sum(x*y - b*x - m*x**2)
b0 = np.mean(y)
m0 = 0
alpha = 0.0001
beta = 0.0001
epsilon = 0.01
while True:
b = b0 - alpha * dfb(x, y, b0, m0)
m = m0 - alpha * dfm(x, y, b0, m0)
if np.sum(np.abs(m-m0)) <= epsilon and np.sum(np.abs(b-b0)) <= epsilon:
break
else:
m0 = m
b0 = b
print(m, f(x, y, b, m))
Both derivatives got some signs mixed up:
def dfb(x, y, b, m): # partial derivative with respect to b
# return b - m*np.mean(x)+np.mean(y)
# ^-------------^------ these are incorrect
return b + m*np.mean(x) - np.mean(y)
def dfm(x, y, b, m): # partial derivative with respect to m
# v------ this should be negative
return -np.sum(x*y - b*x - m*x**2)
In fact, these derivatives are still missing some constants:
dfb should be multiplied by 2
dfm should be multiplied by 2/len(x)
I imagine that's not too bad because the gradient is scaled by alpha anyway, but it could make the speed of convergence worse.
If you do use the correct derivatives, your code will converge after one iteration:
def dfb(x, y, b, m): # partial derivative with respect to b
return 2 * (b + m * np.mean(x) - np.mean(y))
def dfm(x, y, b, m): # partial derivative with respect to m
# Used `mean` here since (2/len(x)) * np.sum(...)
# is the same as 2 * np.mean(...)
return -2 * np.mean(x * y - b * x - m * x**2)

scipy.optimize.fmin_cg function need two callable function f and fprime, how could I extract two function from a function that return both two values?

def linear_regression(theta, X, y, lamb):
# X(12,1+1) theta(2,1) y(12,1)
m = X.shape[0]
ones = np.ones([m, 1])
X = np.hstack([ones, X])
h = X.dot(theta)
# cost function
J = 1 / 2 / m * np.sum(np.power(h - y, 2)) + lamb / 2 / m * np.sum(np.power(theta[1:], 2))
# gradient X(12,2) X.T(2,12) (h-y)(12,1) sum_error(2,1)\
sum_error = 1 / m * X.T.dot(h - y)
temp = theta
temp[0] = 0
gradient = sum_error + lamb / m * temp
return J, gradient
def f(theta, X, y, lamb):
J, gradient = linear_regression(theta, X, y, lamb)
return J
def fprime(theta, X, y, lamb):
J, gradient = linear_regression(theta, X, y, lamb)
return gradient
J, gradient = linear_regression(theta,X,y,1)
# theta need to be a vector not matrix
result = opt.fmin_cg(f, theta, fprime=fprime, args=(X,y,1))
print(result[0])
Explain:
opt.fmin_cg(f, theta, fprime=fprime, args=(X,y,1)) need a callable function f and fprime
f, fprime is the return value of linear_regression(theta, X, y, lamb)
It is easy to compute the cost function and gradient in the same function
Question:
Is there an easy way to extract two callable function from linear_regression(theta, X, y, lamb)
calling J, gradient = linear_regression(theta,X,y,1) and pass to opt.fmin_cg(J, theta, fprime=gradient , args=(X,y,1)) is not work

Regularized Logistic Regression in Python (Andrew ng Course)

I'm starting the ML journey and I'm having troubles with this coding exercise
here is my code
import numpy as np
import pandas as pd
import scipy.optimize as op
# Read the data and give it labels
data = pd.read_csv('ex2data2.txt', header=None, name['Test1', 'Test2', 'Accepted'])
# Separate the features to make it fit into the mapFeature function
X1 = data['Test1'].values.T
X2 = data['Test2'].values.T
# This function makes more features (degree)
def mapFeature(x1, x2):
degree = 6
out = np.ones((x1.shape[0], sum(range(degree + 2))))
curr_column = 1
for i in range(1, degree + 1):
for j in range(i+1):
out[:,curr_column] = np.power(x1, i-j) * np.power(x2, j)
curr_column += 1
return out
# Separate the data into training and target, also initialize theta
X = mapFeature(X1, X2)
y = np.matrix(data['Accepted'].values).T
m, n = X.shape
cols = X.shape[1]
theta = np.matrix(np.zeros(cols))
#Initialize the learningRate(sigma)
learningRate = 1
# Define the Sigmoid Function (Output between 0 and 1)
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def cost(theta, X, y, learningRate):
# This is require to make the optimize function work
theta = theta.reshape(-1, 1)
error = sigmoid(X # theta)
first = np.multiply(-y, np.log(error))
second = np.multiply(1 - y, np.log(1 - error))
j = np.sum((first - second)) / m + (learningRate * np.sum(np.power(theta, 2)) / 2 * m)
return j
# Define the gradient of the cost function
def gradient(theta, X, y, learningRate):
# This is require to make the optimize function work
theta = theta.reshape(-1, 1)
error = sigmoid(X # theta)
grad = (X.T # (error - y)) / m + ((learningRate * theta) / m)
grad_no = (X.T # (error - y)) / m
grad[0] = grad_no[0]
return grad
Result = op.minimize(fun=cost, x0=theta, args=(X, y, learningRate), method='TNC', jac=gradient)
opt_theta = np.matrix(Result.x)
def predict(theta, X):
sigValue = sigmoid(X # theta.T)
p = sigValue >= 0.5
return p
p = predict(opt_theta, X)
print('Train Accuracy: {:f}'.format(np.mean(p == y) * 100))
So, when the learningRate = 1, the accuracy should be around 83,05% but I'm getting 80.5% and when the learningRate = 0, the accuracy should be 91.52% but I'm getting 87.28%
So the question is What am I doing wrong? Why my accuracy is below the problem default answer?
Hope someone can guide me in the right direction. Thanks!
P.D: Here is the dataset, maybe it can help
https://raw.githubusercontent.com/TheGirlWhiteWithBandages/Machine-Learning-Algorithms/master/Logistic%20Regression/ex2data2.txt
Hey guys I found a way to make it even better!
Here is the code
import numpy as np
import pandas as pd
import scipy.optimize as op
from sklearn.preprocessing import PolynomialFeatures
# Read the data and give it labels
data = pd.read_csv('ex2data2.txt', header=None, names=['Test1', 'Test2', 'Accepted'])
# Separate the data into training and target
X = (data.iloc[:, 0:2]).values
y = (data.iloc[:, 2:3]).values
# Modify the features to a certain degree (Polynomial)
poly = PolynomialFeatures(6)
m = y.size
XX = poly.fit_transform(data.iloc[:, 0:2].values)
# Initialize Theta
theta = np.zeros(XX.shape[1])
# Define the Sigmoid Function (Output between 0 and 1)
def sigmoid(z):
return(1 / (1 + np.exp(-z)))
# Define the Regularized cost function
def costFunctionReg(theta, reg, *args):
# This is require to make the optimize function work
h = sigmoid(XX # theta)
first = np.log(h).T # - y
second = np.log(1 - h).T # (1 - y)
J = (1 / m) * (first - second) + (reg / (2 * m)) * np.sum(np.square(theta[1:]))
return J
# Define the Regularized gradient function
def gradientReg(theta, reg, *args):
theta = theta.reshape(-1, 1)
h = sigmoid(XX # theta)
grad = (1 / m) * (XX.T # (h - y)) + (reg / m) * np.r_[[[0]], theta[1:]]
return grad.flatten()
# Define the predict Function
def predict(theta, X):
sigValue = sigmoid(X # theta.T)
p = sigValue >= 0.5
return p
# A loop to test between different values for sigma (reg parameter)
for i, Sigma in enumerate([0, 1, 100]):
# Optimize costFunctionReg
res2 = op.minimize(costFunctionReg, theta, args=(Sigma, XX, y), method=None, jac=gradientReg)
# Get the accuracy of the model
accuracy = 100 * sum(predict(res2.x, XX) == y.ravel()) / y.size
# Get the Error between different weights
error1 = costFunctionReg(res2.x, Sigma, XX, y)
# print the accuracy and error
print('Train accuracy {}% with Lambda = {}'.format(np.round(accuracy, decimals=4), Sigma))
print(error1)
Thanks for all your help!
try out this:
# import library
import pandas as pd
import numpy as np
dataset = pd.read_csv('ex2data2.csv',names = ['Test #1','Test #2','Accepted'])
# splitting to x and y variables for features and target variable
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values
print('x[0] ={}, y[0] ={}'.format(x[0],y[0]))
m, n = x.shape
print('#{} Number of training samples, #{} features per sample'.format(m,n))
# import library FeatureMapping
from sklearn.preprocessing import PolynomialFeatures
# We also add one column of ones to interpret theta 0 (x with power of 0 = 1) by
include_bias as True
pf = PolynomialFeatures(degree = 6, include_bias = True)
x_poly = pf.fit_transform(x)
pd.DataFrame(x_poly).head(5)
m,n = x_poly.shape
# define theta as zero
theta = np.zeros(n)
# define hyperparameter λ
lambda_ = 1
# reshape (-1,1) because we just have one feature in y column
y = y.reshape(-1,1)
def sigmoid(z):
return 1/(1+np.exp(-z))
def lr_hypothesis(x,theta):
return np.dot(x,theta)
def compute_cost(theta,x,y,lambda_):
theta = theta.reshape(n,1)
infunc1 = -y*(np.log(sigmoid(lr_hypothesis(x,theta)))) - ((1-y)*(np.log(1 - sigmoid(lr_hypothesis(x,theta)))))
infunc2 = (lambda_*np.sum(theta[1:]**2))/(2*m)
j = np.sum(infunc1)/m+ infunc2
return j
# gradient[0] correspond to gradient for theta(0)
# gradient[1:] correspond to gradient for theta(j) j>0
def compute_gradient(theta,x,y,lambda_):
gradient = np.zeros(n).reshape(n,)
theta = theta.reshape(n,1)
infunc1 = sigmoid(lr_hypothesis(x,theta))-y
gradient_in = np.dot(x.transpose(),infunc1)/m
gradient[0] = gradient_in[0,0] # theta(0)
gradient[1:] = gradient_in[1:,0]+(lambda_*theta[1:,]/m).reshape(n-1,) # theta(j) ; j>0
gradient = gradient.flatten()
return gradient
You can now test your cost and gradient without optimization. Th below code will optimize the model:
# hyperparameters
m,n = x_poly.shape
# define theta as zero
theta = np.zeros(n)
# define hyperparameter λ
lambda_array = [0, 1, 10, 100]
import scipy.optimize as opt
for i in range(0,len(lambda_array)):
# Train
print('======================================== Iteration {} ===================================='.format(i))
optimized = opt.minimize(fun = compute_cost, x0 = theta, args = (x_poly, y,lambda_array[i]),
method = 'TNC', jac = compute_gradient)
new_theta = optimized.x
# Prediction
y_pred_train = predictor(x_poly,new_theta)
cm_train = confusion_matrix(y,y_pred_train)
t_train,f_train,acc_train = acc(cm_train)
print('With lambda = {}, {} correct, {} wrong ==========> accuracy = {}%'
.format(lambda_array[i],t_train,f_train,acc_train*100))
Now you should see output like this :
=== Iteration 0 === With lambda = 0, 104 correct, 14 wrong ==========> accuracy = 88.13559322033898%
=== Iteration 1 === With lambda = 1, 98 correct, 20 wrong ==========> accuracy = 83.05084745762711%
=== Iteration 2 === With lambda = 10, 88 correct, 30 wrong ==========> accuracy = 74.57627118644068%
=== Iteration 3 === With lambda = 100, 72 correct, 46 wrong ==========> accuracy = 61.016949152542374%

training a logistic neuron with scipy.minimize()

I am having troubles using scipy.minimize() in a logistic neuron training.
My cost and gradient functions have been successfully tested.
scipy.minimize() sends me back "IndexError: too many indices for array".
I am using method='CG', but that's the same with other methods.
res = minimize(loCostEntro, W, args=(XX,Y,lmbda), method='CG', jac=loGradEntro, options={'maxiter': 500})
W (weights), XX(training sets) and Y(result) are all numpy 2D arrays.
Please find below the code of the gradient and the cost functions:
def loOutput(X, W):
Z = np.dot(X, W)
O = misc.sigmoid(Z)
return O
def loCostEntro(W, X, Y, lmbda=0):
m = len(X)
O = loOutput(X, W)
cost = -1 * (1 / m) * (np.log(O).T.dot(Y) + np.log(1 - O).T.dot(1 - Y)) \
+ (lmbda / (2 * m)) * np.sum( np.square(W[1:]))
return cost[0,0]
def loGradEntro(W, X, Y, lmbda=0):
m = len(X)
O = loOutput(X, W)
GRAD = (1 / m) * np.dot(X.T, (O - Y)) + (lmbda / m) * np.r_[[[0]], W[1:].reshape(-1, 1)]
return GRAD
Thanks to this working example, I figured out what was wrong. The reason is that scipy.minimize() sends a 1D Weights array (W) to my Gradient and Cost functions whereas my functions supported only 2D arrays.
So reshaping W in the dot product as below fixed the issue :
def loOutput(X, W):
Z = np.dot(X, W.reshape(-1, 1)) # reshape(-1, 1) because scipy.minimize() sends 1-D W !!!
O = misc.sigmoid(Z)
return O
By the way, I encountered another similar problem after fixing this one. The Gradient function should return a 1D gradient. So I added :
def loGradEntroFlatten(W, X, Y, lmbda=0):
return loGradEntro(W, X, Y, lmbda).flatten()
and I updated :
res = minimize(loCostEntro, W, args=(XX,Y,lmbda), method='CG', jac=loGradEntroFlatten, options={'maxiter': 500})

Categories