Multiple unmatched matrices in backpropagation through time - python

I am going to implement binary addition by Recurrent Neural Network (RNN) as a sample. I have coped with an issue to implement it by Python, so I decided to share my problem in there to come up with ideas to fix it.
As can be seen in my notebook code (Backpropagation through time (BPTT) section),
There is a chain rule like below to update input weight matrix like below:
My problem is this part:
I've tried to implement this part in my Python code or notebook code (class input_layer, backward method), but unmatched dimensions raises an error.
In my sample code, W_hidden is 16*16, whereas the result of delta pre_hidden is 1*2. This makes the error. If you run the code, you could see the error.
I spent a lot of time to check my chain rule as well as my code. I guess my chain rule is right. Only reason to make this error is my code.
As I know, multiple unmatched matrices in terms of dimension is impossible. If my chain rule is correct, how it could be implemented by Python?
Any idea?
Thanks in advance.

You need to apply dimension balancing on the gradients. Taken from the Stanford's cs231n course, it comes down to two simple modifications:
Given , and , we will have:
,
Here is the code I used to ensure the gradient calculation is correct. You should be able to update your code accordingly.
import torch
torch.random.manual_seed(0)
x_1, x_2 = torch.zeros(size=(1, 8)).normal_(0, 0.01), torch.zeros(size=(1, 8)).normal_(0, 0.01)
y = torch.zeros(size=(1, 8)).normal_(0, 0.01)
h_0 = torch.zeros(size=(1, 16)).normal_(0, 0.01)
weight_ih = torch.zeros(size=(8, 16)).normal_(mean=0, std=0.01).requires_grad_(True)
weight_hh = torch.zeros(size=(16, 16)).normal_(mean=0, std=0.01).requires_grad_(True)
weight_ho = torch.zeros(size=(16, 8)).normal_(mean=0, std=0.01).requires_grad_(True)
h_1 = x_1.mm(weight_ih) + h_0.mm(weight_hh)
h_2 = x_2.mm(weight_ih) + h_1.mm(weight_hh)
g_2 = h_2.sigmoid()
j_2 = g_2.mm(weight_ho)
y_predicted = j_2.sigmoid()
loss = 0.5 * (y - y_predicted).pow(2).sum()
loss.backward()
delta_1 = -1 * (y - y_predicted) * y_predicted * (1 - y_predicted)
delta_2 = delta_1.mm(weight_ho.t()) * (g_2 * (1 - g_2))
delta_3 = delta_2.mm(weight_hh.t())
# 16 x 8
weight_ho_grad = g_2.t() * delta_1
# 16 x 16
weight_hh_grad = h_1.t() * delta_2 + (h_0.t() * delta_3)
# 8 x 16
weight_ih_grad = x_2.t() * delta_2 + x_1.t() * delta_3
atol = 1e-10
assert torch.allclose(weight_ho.grad, weight_ho_grad, atol=atol)
assert torch.allclose(weight_hh.grad, weight_hh_grad, atol=atol)
assert torch.allclose(weight_ih.grad, weight_ih_grad, atol=atol)

Related

Backpropagation bug

I am trying to implement backpropagation from scratch. While my cost is decreasing, gradient check yields a whooping 0.767399376130221. I've been trying to figure out what's wrong and managed to slim down the code to these few lines:
def forward(self,X,y):
z2 = self.params_l1.dot(X.T)
a2 = self.sigmoid(z2)
z3 = self.params_l2.dot(a2)
a3 = self.sigmoid(z3)
loss = self.cross_entropy(a3,y)
return a3,loss,z2,a2,z3
def backward(self,X,y):
n_examples = len(X)
yh,loss,Z2,A2,Z3 = self.forward(X,y)
delta3 = np.multiply(-(yh - y),self.dsigmoid(Z3))
delta2 = (np.dot(self.params_l2.T,delta3))*self.dsigmoid(Z2)
de3 = np.dot(delta3,A2.T)
de2 = np.dot(delta2,X)
self.params_l2 = self.params_l2 - self.lr * (de3 /n_examples)
self.params_l1 = self.params_l1 - self.lr * (de2 / n_examples)
return de3/n_examples ,de2 /n_examples
It is a simple (2,2,1) MLP. I'm using cross-entropy as the loss function. I am following the chain rule for the backprop.
I suspect the problem may lay in the order in which i take the products, but i have tried every which way and still had no luck.
I managed to get a difference of 1.7250119005319425e-10 by computing delta3 just through yh - yand no further multiplications. Now I need to figure out why this is.

Gradient Descent basic algorithm overshooting and doesn't converge in python

So I'm new to learning ML and I am using gradient descent as my first algorithm I would like to get good at and learn well. I wrote my first code and have looked online for the issue I'm facing but due to lack of concrete knowledge I'm having a hard time understanding how I would go about diagnosing my issue. My gradient begins by approaching the correct answer and when the error has been cut by a factor of 8, the algorithm loses it's value and the b-value begins to go negative and the m-value goes past the target value. I'm sorry if I worded this odd, hopefully the code will help.
I am learning this from multiple sources on youtube and on google. I have been following Siraj Raval's math of intelligence playlist on youtube, I understood how the underlying algorithm worked but I decided to take my own approach and it seems to not be working too great. I'm struggling to read online resources as I'm inexperienced in what ever algorithm means and how it's implemented into python. I know this issue has something to do with training and testing but I don't know where to apply this.
def gradient_updater(error, mcurr, bcurr):
for i in x:
# gets the predicted y-value
ypred = (mcurr * i) + bcurr
# uses partial derivative formula to get new m and b
new_m = -(2/N) * sum(x*(y - ypred))
new_b = -(2/N) * sum(y - ypred)
# applies the new b and m value
mcurr = mcurr - (learning_rate * new_m)
bcurr = bcurr - (learning_rate * new_b)
return mcurr, bcurr
def run(iterations, initial_m, initial_b):
current_m = initial_m
current_b = initial_b
for i in range(iterations):
error = get_error(current_m, current_b)
current_m, current_b = gradient_updater(error, current_m, current_b)
print(current_m, current_b, error)
I expected the m and b values to converge to a specific value, this didn't occur and the values kept increasing in opposite direction.
If I am understanding your code correctly, I think your problem is that your taking the partial derivative to get your new slope and intercept on just one point. I'm not sure what exactly some of the variables within the gradient_updater are, so I will try to provide an example that better explains the concept:
I'm not sure we are calculating the optimization in the same way, so in my code, b0 is your 'x' in y=mx+b and b1 is your 'b' that same equation. The following code is for calculating a total b0_temp and b1_temp that will be divided by the batch size to present a new b0 and b1 to fit your graph.
for i in range(len(X)):
ERROR = ERROR + (b1*X[i] + b0 - Y[i])**2
b1_temp = b1_temp + (1/2)*((1/len(X))*(b1*X[i] + b0 - Y[i])**2)**(-1/2) * (2/len(X))*(b1*X[i] + b0 - Y[i])*X[i]
b0_temp = b0_temp + (1/2)*((1/len(X))*(b1*X[i] + b0 - Y[i])**2)**(-1/2) * (2/len(X))*(b1*X[i] + b0 - Y[i])
I run through this for every value within my dataset, where X[i] and Y[i] represent an individual datapoint.
Next, I adjust the slope that is currently fitting the graph:
b1_temp = b1_temp / batch_size
b0_temp = b0_temp / batch_size
b0 = b0 - learning_rate * b0_temp
b1 = b1 - learning_rate * b1_temp
b1_temp = 0
b0_temp = 0
Where batch_size can just be taken as len(X). I run through this for some number of epochs (i.e. a for loop of some number, 100 should work), and the line of best fit will adjust accordingly over time. The overall concept behind it is decrease the distance between each point and the line to where it is at a minimum.
Hope I was able to better explain this to you and provide you with a basic code base to adjust your's upon!
Here's where I think the error in your code lies - the calculation of the gradient. I believe that your cost function is similar to the one used in https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html. To solve the gradient, you need to aggregate the effects from all partial derivatives. In your implementation however, you iterate over the range x, without accumulating the effects. Therefore, your new_m and new_b are only calculated for the final term, x (Items marked 1 and 2 below).
Your implementation:
def gradient_updater(error, mcurr, bcurr):
for i in x:
# gets the predicted y-value
ypred = (mcurr * i) + bcurr
# uses partial derivative formula to get new m and b
new_m = -(2/N) * sum(x*(y - ypred)) #-- 1 --
new_b = -(2/N) * sum(y - ypred) #-- 2 --
# applies the new b and m value <-- Indent this block to place inside the for loop
mcurr = mcurr - (learning_rate * new_m)
bcurr = bcurr - (learning_rate * new_b)
return mcurr, bcurr
That said, I think your implementation should come closer to the mathematical formula if you just update mcurr and bcurr in every iteration (See inline comment). The other thing to do is to divide both sum(x*(y - ypred)) and sum(y - ypred) by N as well, in computing new_m and new_b.
Note
Since I do not know what your actual cost function is, I just want to point out that you are also using a constant y value in your code. It is more likely to be an array of different values and be called by Y[i] and X[i] respectively.

Python: Intersection of two equations

I have the following equations:
sqrt((x0 - x)^2 + (y0 - y)^2) - sqrt((x1 - x)^2 + (y1 - y)^2) = c1
sqrt((x3 - x)^2 + (y3 - y)^2) - sqrt((x4 - x)^2 + (y4 - y)^2) = c2
And I would like to find the intersection. I tried using fsolve, and transforming the equations into linear f(x) functions, and it worked for small numbers. I am working with huge numbers and to solve the linear equation there are lots of calculations performed, specifically the calculations reach to a square root of a subtraction, and when handling huge numbers precision is lost, and the left operand is smaller than the right one getting to a math value domain error trying to solve the square root of a negative number.
I am trying to solve this issue in different manners:
Trying to use bigger precision floats. Tried using numpy.float128 but fsolve wont allow using this.
Currently searching for a library that allows to solve non linear equations system, but no luck so far.
Any help/guidance/tip I will appreciate!!
Thanks!!
Taking all advice, i ended using code like the following:
for the the system:
0 = x + y - 8
0 = sqrt((-6 - x)^2 + (4 - y)^2) - sqrt((1 - x)^2 + y^) - 5
from math import sqrt
import numpy as np
from scipy.optimize import fsolve
def f(x):
y = np.zeros(2)
y[0] = x[1] + x[0] - 8
y[1] = sqrt((-6 - x[0]) ** 2 + (4 - x[1]) ** 2) - sqrt((1 - x[0]) ** 2 + x[1] ** 2) - 5
return y
x0 = np.array([0, 0])
solution = fsolve(f, x0)
print "(x, y) = (" + str(solution[0]) + ", " + str(solution[1]) + ")"
Note: the line x0 = np.array([0, 0]) corresponds to the seed that the method uses in fsolve in order to get to a solution. It is important to have a close seed to reach for a solution.
The example provided works :)
You might find some use in SymPy, which is a symbolic algebra manipulation in Python.
From it's home page:
SymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python.
As you have a non-linear equation you need some kind of optimizer to solve it. Probably you can use something like scipy.optimize (https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html). However, as I have no experience with that scipy function can offer you only a solution with the gradient descent method of the tensorflow library. You can find a short guide here: https://learningtensorflow.com/lesson7/ (check out the Gradient descent cahpter). Analog to the method described there you could do something like that:
# These arrays are pseudo code, fill in your values for x0,x1,y0,y1,...
x_array = [x0,x1,x3,x4]
y_array = [y0,y1,y3,y4]
c_array = [c1,c2]
# Tensorflow model starts here
x=tf.placeholder("float")
y=tf.placeholder("float")
z=tf.placeholder("float")
# the array [0,0] are initial guesses for the "correct" x and y that solves the equation
xy_array = tf.Variable([0,0], name="xy_array")
x0 = tf.constant(x_array[0], name="x0")
x1 = tf.constant(x_array[1], name="x1")
x3 = tf.constant(x_array[2], name="x3")
x4 = tf.constant(x_array[3], name="x4")
y0 = tf.constant(y_array[0], name="y0")
y1 = tf.constant(y_array[1], name="y1")
y3 = tf.constant(y_array[2], name="y3")
y4 = tf.constant(y_array[3], name="y4")
c1 = tf.constant(c_array[0], name="c1")
c2 = tf.constant(c_array[1], name="c2")
# I took your first line and subtracted c1 from it, same for the second line, and introduced d_1 and d_2
d_1 = tf.sqrt(tf.square(x0 - xy_array[0])+tf.square(y0 - xy_array[1])) - tf.sqrt(tf.square(x1 - xy_array[0])+tf.square(y1 - xy_array[1])) - c_1
d_2 = tf.sqrt(tf.square(x3 - xy_array[0])+tf.square(y3 - xy_array[1])) - tf.sqrt(tf.square(x4 - xy_array[0])+tf.square(y4 - xy_array[1])) - c_2
# this z_model should actually be zero in the end, in that case there is an intersection
z_model = d_1 - d_2
error = tf.square(z-z_model)
# you can try different values for the "learning rate", here 0.01
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(error)
model = tf.global_variables_initializer()
with tf.Session() as session:
session.run(model)
# here you are creating a "training set" of size 1000, you can also make it bigger if you like
for i in range(1000):
x_value = np.random.rand()
y_value = np.random.rand()
d1_value = np.sqrt(np.square(x_array[0]-x_value)+np.square(y_array[0]-y_value)) - np.sqrt(np.square(x_array[1]-x_value)+np.square(y_array[1]-y_value)) - c_array[0]
d2_value = np.sqrt(np.square(x_array[2]-x_value)+np.square(y_array[2]-y_value)) - np.sqrt(np.square(x_array[3]-x_value)+np.square(y_array[3]-y_value)) - c_array[1]
z_value = d1_value - d2_value
session.run(train_op, feed_dict={x: x_value, y: y_value, z: z_value})
xy_value = session.run(xy_array)
print("Predicted model: {a:.3f}x + {b:.3f}".format(a=xy_value[0], b=xy_value[1]))
But be aware: This code will probably run a while... This is why haven't tested it...
Also I am currently not sure what will happen if there is no intersection. Probably you get the coordinates of the closest distance of the functions...
Tensorflow can be somewhat difficult if you haven't used it yet, but it is worth to learn it, as you can also use it for any deep learning application (actual purpose of this library).

How do you update the weights in function approximation with reinforcement learning?

My SARSA with gradient-descent keep escalating the weights exponentially. At Episode 4 step 17 the value is already nan
Exception: Qa is nan
e.g:
6) Qa:
Qa = -2.00890180632e+303
7) NEXT Qa:
Next Qa with west = -2.28577776413e+303
8) THETA:
1.78032402991e+303 <= -0.1 + (0.1 * -2.28577776413e+303) - -2.00890180632e+303
9) WEIGHTS (sample)
5.18266630725e+302 <= -1.58305782482e+301 + (0.3 * 1.78032402991e+303 * 1)
I don't know where to look for the mistake I made.
Here's some code FWIW:
def getTheta(self, reward, Qa, QaNext):
""" let t = r + yQw(s',a') - Qw(s,a) """
theta = reward + (self.gamma * QaNext) - Qa
def updateWeights(self, Fsa, theta):
""" wi <- wi + alpha * theta * Fi(s,a) """
for i, w in enumerate(self.weights):
self.weights[i] += (self.alpha * theta * Fsa[i])
I have about 183 binary features.
you need normalization in each trial. This will keep the weights in a bounded range. (e.g. [0,1]). They way you are adding the weights each time, just grows the weights and it would be useless after the first trial.
I would do something like this:
self.weights[i] += (self.alpha * theta * Fsa[i])
normalize(self.weights[i],wmin,wmax)
or see the following example (from literature of RL):
You need to write the normalization function by yourself though ;)
I do not have access to the full code in your application, so I might be wrong. But I think that I know where you are going wrong.
First and foremost, normalization should not be necessary here. For weights to get bloated so soon in this situation suggests something wrong with your implementation.
I think your update equation should be:-
self.weights[:, action_i] = self.weights[:, action_i] + (self.alpha * theta * Fsa[i])
That is to say that you should be updating columns instead of rows, because rows are for states and columns for for actions in the weight matrix.

Python Neural Network Backpropagation

I'm learning about neural networks, specifically looking at MLPs with a back-propagation implementation. I'm trying to implement my own network in python and I thought I'd look at some other libraries before I started. After some searching I found Neil Schemenauer's python implementation bpnn.py. (http://arctrix.com/nas/python/bpnn.py)
Having worked through the code and read the first part of Christopher M. Bishops book titled 'Neural Networks for Pattern Recognition' I found an issue in the backPropagate function:
# calculate error terms for output
output_deltas = [0.0] * self.no
for k in range(self.no):
error = targets[k]-self.ao[k]
output_deltas[k] = dsigmoid(self.ao[k]) * error
The line of code that calculates the error is different in Bishops book. On page 145, equation 4.41 he defines the output units error as:
d_k = y_k - t_k
Where y_k are the outputs and t_k are the targets. (I'm using _ to represent subscript)
So my question is should this line of code:
error = targets[k]-self.ao[k]
Be infact:
error = self.ao[k] - targets[k]
I'm most likely completely wrong but could someone help clear up my confusion please. Thanks
It all depends on the error measure you use. To give just a few examples of error measures (for brevity, I'll use ys to mean a vector of n outputs and ts to mean a vector of n targets):
mean squared error (MSE):
sum((y - t) ** 2 for (y, t) in zip(ys, ts)) / n
mean absolute error (MAE):
sum(abs(y - t) for (y, t) in zip(ys, ts)) / n
mean logistic error (MLE):
sum(-log(y) * t - log(1 - y) * (1 - t) for (y, t) in zip(ys, ts)) / n
Which one you use depends entirely on the context. MSE and MAE can be used for when the target outputs can take any values, and MLE gives very good results when your target outputs are either 0 or 1 and when y is in the open range (0, 1).
With that said, I haven't seen the errors y - t or t - y used before (I'm not very experienced in machine learning myself). As far as I can see, the source code you provided doesn't square the difference or use the absolute value, are you sure the book doesn't either? The way I see it y - t or t - y can't be very good error measures and here's why:
n = 2 # We only have two output neurons
ts = [ 0, 1 ] # Our target outputs
ys = [ 0.999, 0.001 ] # Our sigmoid outputs
# Notice that your outputs are the exact opposite of what you want them to be.
# Yet, if you use (y - t) or (t - y) to measure your error for each neuron and
# then sum up to get the total error of the network, you get 0.
t_minus_y = (0 - 0.999) + (1 - 0.001)
y_minus_t = (0.999 - 0) + (0.001 - 1)
Edit: Per alfa's comment, in the book, y - t is actually the derivative of MSE. In that case, t - y is incorrect. Note, however, that the actual derivative of MSE is 2 * (y - t) / n, not simply y - t.
If you don't divide by n (so you actually have a summed squared error (SSE), not a mean squared error), then the derivative would be 2 * (y - t). Furthermore, if you use SSE / 2 as your error measure, then the 1 / 2 and the 2 in the derivative cancel out and you are left with y - t.
You have to backpropagate the derivative of
0.5*(y-t)^2 or 0.5*(t-y)^2 with respect to y
which is always
y-t = (y-t)(+1) = (t-y)(-1)
You can study this implementation of MLP from Padasip library.
And the documentation is here
In actual code, we often calculate NEGATIVE grad(of loss with regard to w), and use w += eta*grad to update weight. Actually its a grad ascent.
In some text book, POSITIVE grad is calculated and w -= eta*grad to update weight.

Categories