I am trying to implement backpropagation from scratch. While my cost is decreasing, gradient check yields a whooping 0.767399376130221. I've been trying to figure out what's wrong and managed to slim down the code to these few lines:
def forward(self,X,y):
z2 = self.params_l1.dot(X.T)
a2 = self.sigmoid(z2)
z3 = self.params_l2.dot(a2)
a3 = self.sigmoid(z3)
loss = self.cross_entropy(a3,y)
return a3,loss,z2,a2,z3
def backward(self,X,y):
n_examples = len(X)
yh,loss,Z2,A2,Z3 = self.forward(X,y)
delta3 = np.multiply(-(yh - y),self.dsigmoid(Z3))
delta2 = (np.dot(self.params_l2.T,delta3))*self.dsigmoid(Z2)
de3 = np.dot(delta3,A2.T)
de2 = np.dot(delta2,X)
self.params_l2 = self.params_l2 - self.lr * (de3 /n_examples)
self.params_l1 = self.params_l1 - self.lr * (de2 / n_examples)
return de3/n_examples ,de2 /n_examples
It is a simple (2,2,1) MLP. I'm using cross-entropy as the loss function. I am following the chain rule for the backprop.
I suspect the problem may lay in the order in which i take the products, but i have tried every which way and still had no luck.
I managed to get a difference of 1.7250119005319425e-10 by computing delta3 just through yh - yand no further multiplications. Now I need to figure out why this is.
Related
I'm creating a neural network in python to recognize handwritten numbers. I'm pretty sure my feedforward, backpropagation, and gradient descent are correct since my program has a training accuracy of around 90%. I'm confident that it's working properly because I physically pulled out a couple random testing example images and compared it with the prediction and they were all correct.
However, plotting cost function J against iteration, I'm getting very weird results. Sometimes it decreases in a weird way and sometimes it increases, depending on what I chose for my regularization factor lambda. Here's an example plot:
I suspect that the mistake is how I coded the cost function, although I cannot spot it. Here is the function:
def J2(y_target, y_pred, theta1, theta2, lamb):
"""
Args:
y_target np.array(n_samples,10): One-hot target class
y_pred np.array(n_samples,10): Predicted class likelihoods
[...]
"""
m = y_target.shape[1]
cost = np.multiply(y_target, np.log(y_pred)) + np.multiply((1-y_target),np.log(1-y_pred))
cost = np.sum(cost)
cost = (-1/m)*cost
reg = np.sum(np.square(theta1)) + np.sum(np.square(theta2))
reg = (lamb/2*m)*reg
J = cost + reg
return J
Here is how I did forward and backward propagation:
def forward_prop2(X, theta1, theta2):
#forward propagation
#X is a 'm by n' matrix
#m = number of examples
#n = number of features
a1 = np.transpose(X)
z2 = np.matmul(theta1, a1)
a2 = sigmoid(z2)
a2 = np.append(np.ones((1,a2.shape[1])),a2,axis=0)
z3 = np.matmul(theta2, a2)
a3 = sigmoid(z3)
return a1, a2, a3
def backward_prop2(y_vectors, a1, a2, a3, theta1, theta2, lamb):
#backward propagation
#y_vectors is vectors of results (see notes for clarification)
#outputs gradiat arrays for theta1 and theta2
m = y_vectors.shape[1]
delta3 = a3 - y_vectors
delta2_matmul_term = np.matmul(np.transpose(theta2),delta3)
delta2_dot_term = np.multiply(a2, np.ones(a2.shape)-a2)
delta2 = np.multiply(delta2_matmul_term,delta2_dot_term)
triangle2 = np.matmul(delta3, np.transpose(a2))
triangle1 = np.matmul(delta2[1:,:],np.transpose(a1))
reg2 = np.zeros((theta2.shape[0],1))
reg2 = np.append(reg2,theta2[:,1:],axis=1)
grad2 = (1/m)*triangle2 + lamb*reg2
reg1 = np.zeros((theta1.shape[0],1))
reg1 = np.append(reg1,theta1[:,1:],axis=1)
grad1 = (1/m)*triangle1 + lamb*reg1
return grad1, grad2
And then finally I run this for loop:
iterations = 1000
alpha = 1
lamb = 0
J = []
for i in range (0, iterations):
a1, a2, a3 = nn.forward_prop2(X_train, theta1, theta2)
grad1, grad2 = nn.backward_prop2(y_vectors_train, a1, a2, a3, theta1, theta2, lamb)
theta1 ,theta2 = nn.grad_des(theta1, theta2, grad1, grad2, alpha)
J.append(nn.J2(y_vectors_train, a3, theta1, theta2, lamb))
plt.plot(J)
plt.xlabel('Iterations')
plt.ylabel('J')
plt.show()
Edit:
Here's the plot after making lambda very small, i.e. 0.0000000001:
It still looks a bit off to me.
I agree, the issue seems to be linked with theta1,2, although it also seems unclear to me how the model learns if what you're plotting is the training accuracy.
Also, you mention y_vectors is an 'm by 10' array, yet assign m = y_vectors.shape[1]. It seems like you should write m = y_vectors.shape[0].
P.s.: My post is more of a comment, but I can't comment yet.
I am going to implement binary addition by Recurrent Neural Network (RNN) as a sample. I have coped with an issue to implement it by Python, so I decided to share my problem in there to come up with ideas to fix it.
As can be seen in my notebook code (Backpropagation through time (BPTT) section),
There is a chain rule like below to update input weight matrix like below:
My problem is this part:
I've tried to implement this part in my Python code or notebook code (class input_layer, backward method), but unmatched dimensions raises an error.
In my sample code, W_hidden is 16*16, whereas the result of delta pre_hidden is 1*2. This makes the error. If you run the code, you could see the error.
I spent a lot of time to check my chain rule as well as my code. I guess my chain rule is right. Only reason to make this error is my code.
As I know, multiple unmatched matrices in terms of dimension is impossible. If my chain rule is correct, how it could be implemented by Python?
Any idea?
Thanks in advance.
You need to apply dimension balancing on the gradients. Taken from the Stanford's cs231n course, it comes down to two simple modifications:
Given , and , we will have:
,
Here is the code I used to ensure the gradient calculation is correct. You should be able to update your code accordingly.
import torch
torch.random.manual_seed(0)
x_1, x_2 = torch.zeros(size=(1, 8)).normal_(0, 0.01), torch.zeros(size=(1, 8)).normal_(0, 0.01)
y = torch.zeros(size=(1, 8)).normal_(0, 0.01)
h_0 = torch.zeros(size=(1, 16)).normal_(0, 0.01)
weight_ih = torch.zeros(size=(8, 16)).normal_(mean=0, std=0.01).requires_grad_(True)
weight_hh = torch.zeros(size=(16, 16)).normal_(mean=0, std=0.01).requires_grad_(True)
weight_ho = torch.zeros(size=(16, 8)).normal_(mean=0, std=0.01).requires_grad_(True)
h_1 = x_1.mm(weight_ih) + h_0.mm(weight_hh)
h_2 = x_2.mm(weight_ih) + h_1.mm(weight_hh)
g_2 = h_2.sigmoid()
j_2 = g_2.mm(weight_ho)
y_predicted = j_2.sigmoid()
loss = 0.5 * (y - y_predicted).pow(2).sum()
loss.backward()
delta_1 = -1 * (y - y_predicted) * y_predicted * (1 - y_predicted)
delta_2 = delta_1.mm(weight_ho.t()) * (g_2 * (1 - g_2))
delta_3 = delta_2.mm(weight_hh.t())
# 16 x 8
weight_ho_grad = g_2.t() * delta_1
# 16 x 16
weight_hh_grad = h_1.t() * delta_2 + (h_0.t() * delta_3)
# 8 x 16
weight_ih_grad = x_2.t() * delta_2 + x_1.t() * delta_3
atol = 1e-10
assert torch.allclose(weight_ho.grad, weight_ho_grad, atol=atol)
assert torch.allclose(weight_hh.grad, weight_hh_grad, atol=atol)
assert torch.allclose(weight_ih.grad, weight_ih_grad, atol=atol)
So I'm new to learning ML and I am using gradient descent as my first algorithm I would like to get good at and learn well. I wrote my first code and have looked online for the issue I'm facing but due to lack of concrete knowledge I'm having a hard time understanding how I would go about diagnosing my issue. My gradient begins by approaching the correct answer and when the error has been cut by a factor of 8, the algorithm loses it's value and the b-value begins to go negative and the m-value goes past the target value. I'm sorry if I worded this odd, hopefully the code will help.
I am learning this from multiple sources on youtube and on google. I have been following Siraj Raval's math of intelligence playlist on youtube, I understood how the underlying algorithm worked but I decided to take my own approach and it seems to not be working too great. I'm struggling to read online resources as I'm inexperienced in what ever algorithm means and how it's implemented into python. I know this issue has something to do with training and testing but I don't know where to apply this.
def gradient_updater(error, mcurr, bcurr):
for i in x:
# gets the predicted y-value
ypred = (mcurr * i) + bcurr
# uses partial derivative formula to get new m and b
new_m = -(2/N) * sum(x*(y - ypred))
new_b = -(2/N) * sum(y - ypred)
# applies the new b and m value
mcurr = mcurr - (learning_rate * new_m)
bcurr = bcurr - (learning_rate * new_b)
return mcurr, bcurr
def run(iterations, initial_m, initial_b):
current_m = initial_m
current_b = initial_b
for i in range(iterations):
error = get_error(current_m, current_b)
current_m, current_b = gradient_updater(error, current_m, current_b)
print(current_m, current_b, error)
I expected the m and b values to converge to a specific value, this didn't occur and the values kept increasing in opposite direction.
If I am understanding your code correctly, I think your problem is that your taking the partial derivative to get your new slope and intercept on just one point. I'm not sure what exactly some of the variables within the gradient_updater are, so I will try to provide an example that better explains the concept:
I'm not sure we are calculating the optimization in the same way, so in my code, b0 is your 'x' in y=mx+b and b1 is your 'b' that same equation. The following code is for calculating a total b0_temp and b1_temp that will be divided by the batch size to present a new b0 and b1 to fit your graph.
for i in range(len(X)):
ERROR = ERROR + (b1*X[i] + b0 - Y[i])**2
b1_temp = b1_temp + (1/2)*((1/len(X))*(b1*X[i] + b0 - Y[i])**2)**(-1/2) * (2/len(X))*(b1*X[i] + b0 - Y[i])*X[i]
b0_temp = b0_temp + (1/2)*((1/len(X))*(b1*X[i] + b0 - Y[i])**2)**(-1/2) * (2/len(X))*(b1*X[i] + b0 - Y[i])
I run through this for every value within my dataset, where X[i] and Y[i] represent an individual datapoint.
Next, I adjust the slope that is currently fitting the graph:
b1_temp = b1_temp / batch_size
b0_temp = b0_temp / batch_size
b0 = b0 - learning_rate * b0_temp
b1 = b1 - learning_rate * b1_temp
b1_temp = 0
b0_temp = 0
Where batch_size can just be taken as len(X). I run through this for some number of epochs (i.e. a for loop of some number, 100 should work), and the line of best fit will adjust accordingly over time. The overall concept behind it is decrease the distance between each point and the line to where it is at a minimum.
Hope I was able to better explain this to you and provide you with a basic code base to adjust your's upon!
Here's where I think the error in your code lies - the calculation of the gradient. I believe that your cost function is similar to the one used in https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html. To solve the gradient, you need to aggregate the effects from all partial derivatives. In your implementation however, you iterate over the range x, without accumulating the effects. Therefore, your new_m and new_b are only calculated for the final term, x (Items marked 1 and 2 below).
Your implementation:
def gradient_updater(error, mcurr, bcurr):
for i in x:
# gets the predicted y-value
ypred = (mcurr * i) + bcurr
# uses partial derivative formula to get new m and b
new_m = -(2/N) * sum(x*(y - ypred)) #-- 1 --
new_b = -(2/N) * sum(y - ypred) #-- 2 --
# applies the new b and m value <-- Indent this block to place inside the for loop
mcurr = mcurr - (learning_rate * new_m)
bcurr = bcurr - (learning_rate * new_b)
return mcurr, bcurr
That said, I think your implementation should come closer to the mathematical formula if you just update mcurr and bcurr in every iteration (See inline comment). The other thing to do is to divide both sum(x*(y - ypred)) and sum(y - ypred) by N as well, in computing new_m and new_b.
Note
Since I do not know what your actual cost function is, I just want to point out that you are also using a constant y value in your code. It is more likely to be an array of different values and be called by Y[i] and X[i] respectively.
I am trying to practice the exercise questions in this style transfer tutorial, is there anyone know how to replace the basic gradient descent with Adam Optimizer.
I think these code maybe the place to change. Thank you very much for help.
# Reduce the dimensionality of the gradient.
grad = np.squeeze(grad)
# Scale the step-size according to the gradient-values.
step_size_scaled = step_size / (np.std(grad) + 1e-8)
# Update the image by following the gradient.
mixed_image -= grad * step_size_scaled
Referring to slides 36 and 37 from Stanford CS231n slides,
first_moment = 0
second_moment = 0
must be declared above the for i in range(num_iterations): line present in that GitHub file. Also, initialize beta1 and beta2 variables from below based on your requirements. Then, you can replace your code block with the following:
# Reduce the dimensionality of the gradient.
grad = np.squeeze(grad)
# Calculate moments
first_moment = beta1 * first_moment + (1 - beta1) * grad
second_moment = beta2 * second_moment + (1 - beta2) * grad * grad
# Bias correction steps
first_unbias = first_moment / (1 - beta1 ** i)
second_unbias = second_moment / (1 - beta2 ** i)
# Update the image by following the gradient (AdaGrad/RMSProp step)
mixed_image -= step_size * first_unbias / (tf.sqrt(second_unbias) + 1e-8)
I initialize beta1 and beta2 like this:
beta1=tf.Variable(0,name='beta1')
beta2=tf.Variable(0,name='beta2')
session.run([beta1.initializer,beta2.initializer])
However,there are something go wrong: Tensor' object has no attribute 'sqrt'.
The detailed error looks like this.
I try to implement the Stochastic Gradient Descent Algorithm.
The first solution works:
def gradientDescent(x,y,theta,alpha):
xTrans = x.transpose()
for i in range(0,99):
hypothesis = np.dot(x,theta)
loss = hypothesis - y
gradient = np.dot(xTrans,loss)
theta = theta - alpha * gradient
return theta
This solution gives the right theta values but the following algorithm
doesnt work:
def gradientDescent2(x,y,theta,alpha):
xTrans = x.transpose();
for i in range(0,99):
hypothesis = np.dot(x[i],theta)
loss = hypothesis - y[i]
gradientThetaZero= loss * x[i][0]
gradientThetaOne = loss * x[i][1]
theta[0] = theta[0] - alpha * gradientThetaZero
theta[1] = theta[1] - alpha * gradientThetaOne
return theta
I don't understand why solution 2 does not work, basically it
does the same like the first algorithm.
I use the following code to produce data:
def genData():
x = np.random.rand(100,2)
y = np.zeros(shape=100)
for i in range(0, 100):
x[i][0] = 1
# our target variable
e = np.random.uniform(-0.1,0.1,size=1)
y[i] = np.sin(2*np.pi*x[i][1]) + e[0]
return x,y
And use it the following way:
x,y = genData()
theta = np.ones(2)
theta = gradientDescent2(x,y,theta,0.005)
print(theta)
I hope you can help me!
Best regards, Felix
Your second code example overwrites the gradient computation on each iteration over your observation data.
In the first code snippet, you properly adjust your parameters in each looping iteration based on the error (loss function).
In the second code snippet, you calculate the point-wise gradient computation in each iteration, but then don't do anything with it. That means that your final update effectively only trains on the very last data point.
If instead you accumulate the gradients within the loop by summing ( += ), it should be closer to what you're looking for (as an expression of the gradient of the loss function with respect to your parameters over the entire observation set).