Backpropagating bias in a neural network - python

Following Andrew Traks's example, I want to implement a 3 layer neural network - 1 input, 1 hidden, 1 output - with a simple dropout, for binary classification.
If I include bias terms b1 and b2, then I would need to slightly modify Andrew's code as below.
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
y = np.array([[0,1,1,0]]).T
alpha,hidden_dim,dropout_percent = (0.5,4,0.2)
synapse_0 = 2*np.random.random((X.shape[1],hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,1)) - 1
b1 = np.zeros(hidden_dim)
b2 = np.zeros(1)
for j in range(60000):
# sigmoid activation function
layer_1 = (1/(1+np.exp(-(np.dot(X,synapse_0) + b1))))
# dropout
layer_1 *= np.random.binomial([np.ones((len(X),hidden_dim))],1-dropout_percent)[0] * (1.0/(1-dropout_percent))
layer_2 = 1/(1+np.exp(-(np.dot(layer_1,synapse_1) + b2)))
# sigmoid derivative = s(x)(1-s(x))
layer_2_delta = (layer_2 - y)*(layer_2*(1-layer_2))
layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 * (1-layer_1))
synapse_1 -= (alpha * layer_1.T.dot(layer_2_delta))
synapse_0 -= (alpha * X.T.dot(layer_1_delta))
b1 -= alpha*layer_1_delta
b2 -= alpha*layer_2_delta
The problem is, of course, with the code above the dimensions of b1 dont match with the dimensions of layer_1_delta, similarly with b2 and layer_2_delta.
I don't understand how the delta is calculated to update b1 and b2 - according to Michael Nielsen's example, b1 and b2 should be updated by a delta which in my code I believe to be layer_1_delta and layer_2_delta respectively.
What am I doing wrong here? Have I messed up the dimensionality of the deltas or of the biases? I feel it is the latter, because if I remove the biases from this code it works fine. Thanks in advance

So first I would change X in bX to 0 and 1 to correspond to synapse_X, because this is where they belong and it makes it:
b1 -= alpha * 1.0 / m * np.sum(layer_2_delta)
b0 -= alpha * 1.0 / m * np.sum(layer_1_delta)
Where m is the number of examples in the training set. Also, the drop rate is stupidly high and actually hurts convergence. So in all considered the whole code:
import numpy as np
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
m = X.shape[0]
y = np.array([[0,1,1,0]]).T
alpha,hidden_dim,dropout_percent = (0.5,4,0.02)
synapse_0 = 2*np.random.random((X.shape[1],hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,1)) - 1
b0 = np.zeros(hidden_dim)
b1 = np.zeros(1)
for j in range(10000):
# sigmoid activation function
layer_1 = (1/(1+np.exp(-(np.dot(X,synapse_0) + b0))))
# dropout
layer_1 *= np.random.binomial([np.ones((len(X),hidden_dim))],1-dropout_percent)[0] * (1.0/(1-dropout_percent))
layer_2 = 1/(1+np.exp(-(np.dot(layer_1,synapse_1) + b1)))
# sigmoid derivative = s(x)(1-s(x))
layer_2_delta = (layer_2 - y)*(layer_2*(1-layer_2))
layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 * (1-layer_1))
synapse_1 -= (alpha * layer_1.T.dot(layer_2_delta))
synapse_0 -= (alpha * X.T.dot(layer_1_delta))
b1 -= alpha * 1.0 / m * np.sum(layer_2_delta)
b0 -= alpha * 1.0 / m * np.sum(layer_1_delta)
print layer_2

Related

Using Adam to find the minimum of the Rosenbrock function using Pytorch

I am comparing the Adam - Algorithm to SGD with Momentum. I realised that the convergence rate of Adam is way worse than the convergence rate of SGD with Momentum if applied to the Rosenbrock function. This finding is in contrast to this visualisation. You can read the underlying code here.
Too ensure that I did not have an implementation error I compared the results of my algorithm to the Pytorch implementation. Pytorch and my implementation return the same result.
Therefore either Pytorch and my implementation is incorrect or the implementation in the link is incorrect. If you check out the code from the link above you will find that the Bias correction step is missing. After adapting my code in the same way the results did not significantly improve.
So my question is why does it work in the linked scenario but not in my/Pytorch implementation? Even though all of the three should return the same result.
import numpy as np
import torch
# Rosenbrock function
class Rosenbrock:
a_f = 1.
b_f = 2.
# The minimum is at (a_f, a_f**2)
class Adam_para:
beta1 = 0.9 # 0.7 # modified because of github: https://gist.github.com/EmilienDupont/f97a3902f4f3a98f350500a3a00371db
beta2 = 0.999
eps = 1e-8
lr = 2e-2
iterations = 100
def f(x,y):
return ( Rosenbrock.a_f - x ) ** 2 + Rosenbrock.b_f * (y - x ** 2 ) ** 2
def grad_f(x,y):
grad_x = - 1. * 2 * (Rosenbrock.a_f - x) + Rosenbrock.b_f * (- 2 * x) * 2 * ( y - x ** 2 )
grad_y = Rosenbrock.b_f * ( 1. ) * 2 * (y - x ** 2)
return np.array([grad_x, grad_y])
def adam_inner(p: np.ndarray,t,exp_avg,exp_avg_sqr, lr):
# inner loop of adam algorithm
# p current point
# exp_avg first moment estimate
# exp_avg_sqr second moment estimate
# lr learning rate
# the following values are taken from the ADAM Paper
beta1 = Adam_para.beta1
beta2 = Adam_para.beta2
eps = Adam_para.eps
t = t+1
g = grad_f(*p)
exp_avg = beta1 * exp_avg + ( 1 - beta1 ) * g
exp_avg_sqr = beta2 * exp_avg_sqr + ( 1 - beta2 ) * np.square(g)
bias_corr_1 = 1 - beta1 ** t
bias_corr_2 = 1 - beta2 ** t
exp_avg_hat = exp_avg / bias_corr_1
exp_avg_sqr_hat = exp_avg_sqr / bias_corr_2
denom = np.sqrt(exp_avg_sqr_hat) + eps
p = p - lr * exp_avg_hat / denom
return {'p': p, 'first_mom': exp_avg, 'second_mom': exp_avg_sqr}
def adam(p, it, lr=0.001):
# it number of iterations
# m first moment estimate
# v second moment estimate
# init
m = 0
v = 0
p_list = [p]
for i in range(it):
tmp = adam_inner(p_list[-1],i,m,v,lr)
p_list.append(tmp['p'])
m = tmp['first_mom']
v = tmp['second_mom']
return np.asarray(p_list)
x0 = np.array([3.,3.])
t = adam(x0,Adam_para.iterations,Adam_para.lr)
x0_torch = torch.tensor(x0, requires_grad=True)
f_torch = f(x0_torch[0],x0_torch[1])
optimizer = torch.optim.Adam([x0_torch], lr = Adam_para.lr, betas=(Adam_para.beta1,Adam_para.beta2))
for i in range(Adam_para.iterations):
optimizer.zero_grad()
f_torch = f(x0_torch[0],x0_torch[1])
f_torch.backward()
optimizer.step()
print("pytorch result:", x0_torch)
print("my result:", t[-1])

Updating weight in previous layers in Backpropagation

I am trying to create a simple neural network and stuck at updating the weights at first layer in two layers. I imagine the first update I am doing to w2 are correct as what I learned from back propagation algorithm. I am not including bias for now. But how do we update the first layer weights is where I am stuck at.
import numpy as np
np.random.seed(10)
def sigmoid(x):
return 1.0/(1+ np.exp(-x))
def sigmoid_derivative(x):
return x * (1.0 - x)
def cost_function(output, y):
return (output - y) ** 2
x = 2
y = 4
w1 = np.random.rand()
w2 = np.random.rand()
h = sigmoid(w1 * x)
o = sigmoid(h * w2)
cost_function_output = cost_function(o, y)
prev_w2 = w2
w2 -= 0.5 * 2 * cost_function_output * h * sigmoid_derivative(o) # 0.5 being learning rate
w1 -= 0 # What do you update this to?
print(cost_function_output)
I'm not able to comment on your question, so writing here.
Firstly, your sigmoid_derivative function is wrong.
The derivative of sigmoid(x*y) w.r.t x is = sigmoid(x*y)*(1-sigmoid(x*y))*y.
Edit: (deleted unnecessary text)
We need dW1 and dW2 (These are dJ/dW1 and dJ/dW (partial derivatives) respectively.
J = (o - y)^2 therefore dJ/do = 2*(o - y)
Now, dW2
dJ/dW2 = dJ/do * do/dW2 (chain rule)
dJ/dW2 = (2*(o - y)) * (o*(1 - o)*h)
dW2 (equals above equation)
W2 -= learning_rate*dW2
Now, for dW1
dJ/dh = dJ/do * do/dh = (2*(o - y)) * (o*(1 - o)*W2
dJ/dW1 = dJ/dh * dh/dW1 = ((2*(o - y)) * (o*(1 - o)*W2)) * (h*(1- h)*x)
dW1 (equals above equation)
W1 -= learning_rate*dW2
PS: Try to make a computational graphs, finding derivatives become a lot more easier. (If you don't know this, read it online)

neural network xor gate classification

I've written a simple neural network that can predict XOR gate function. I think I've used the math correctly, but the loss doesn't go down and remains near 0.6. Can anyone help me find the reason why?
import numpy as np
import matplotlib as plt
train_X = np.array([[0,0],[0,1],[1,0],[1,1]]).T
train_Y = np.array([[0,1,1,0]])
test_X = np.array([[0,0],[0,1],[1,0],[1,1]]).T
test_Y = np.array([[0,1,1,0]])
learning_rate = 0.1
S = 5
def sigmoid(z):
return 1/(1+np.exp(-z))
def sigmoid_derivative(z):
return sigmoid(z)*(1-sigmoid(z))
S0, S1, S2 = 2, 5, 1
m = 4
w1 = np.random.randn(S1, S0) * 0.01
b1 = np.zeros((S1, 1))
w2 = np.random.randn(S2, S1) * 0.01
b2 = np.zeros((S2, 1))
for i in range(1000000):
Z1 = np.dot(w1, train_X) + b1
A1 = sigmoid(Z1)
Z2 = np.dot(w2, A1) + b2
A2 = sigmoid(Z2)
J = np.sum(-train_Y * np.log(A2) + (train_Y-1) * np.log(1-A2)) / m
dZ2 = A2 - train_Y
dW2 = np.dot(dZ2, A1.T) / m
dB2 = np.sum(dZ2, axis = 1, keepdims = True) / m
dZ1 = np.dot(w2.T, dZ2) * sigmoid_derivative(Z1)
dW1 = np.dot(dZ1, train_X.T) / m
dB1 = np.sum(dZ1, axis = 1, keepdims = True) / m
w1 = w1 - dW1 * 0.03
w2 = w2 - dW2 * 0.03
b1 = b1 - dB1 * 0.03
b2 = b2 - dB2 * 0.03
print(J)
I think your dZ2 is not correct, as you do not multiply it with the derivative of sigmoid.
For the XOR problem, if you inspect the outputs the 1's are slightly higher than 0.5 and the 0's are slightly lower. I believe this is because the search has reached a plateau and therefore therefore progressing very slowly. I tried RMSProp which converged to almost 0 very fast. I also tried a pseudo second order algorithm, RProp, which converged almost immediately (I used iRProp-). I am showing the plot for RMSPprop below
Also, the final output of the network is now
[[1.67096234e-06 9.99999419e-01 9.99994158e-01 6.87836337e-06]]
Rounding which gets
array([[0., 1., 1., 0.]])
But, I would highly recommend to perform gradient checking to be sure that the analytical gradients match with the ones computed numerically. Also see Andrew Ng's coursera lecture on gradient checking.
I am adding the modified code to with the RMSProp implementation.
#!/usr/bin/python3
import numpy as np
import matplotlib.pyplot as plt
train_X = np.array([[0,0],[0,1],[1,0],[1,1]]).T
train_Y = np.array([[0,1,1,0]])
test_X = np.array([[0,0],[0,1],[1,0],[1,1]]).T
test_Y = np.array([[0,1,1,0]])
learning_rate = 0.1
S = 5
def sigmoid(z):
return 1/(1+np.exp(-z))
def sigmoid_derivative(z):
return sigmoid(z)*(1-sigmoid(z))
S0, S1, S2 = 2, 5, 1
m = 4
w1 = np.random.randn(S1, S0) * 0.01
b1 = np.zeros((S1, 1))
w2 = np.random.randn(S2, S1) * 0.01
b2 = np.zeros((S2, 1))
# RMSProp variables
dWsqsum1 = np.zeros_like (w1)
dWsqsum2 = np.zeros_like (w2)
dBsqsum1 = np.zeros_like (b1)
dBsqsum2 = np.zeros_like (b2)
alpha = 0.9
lr = 0.01
err_vec = list ();
for i in range(20000):
Z1 = np.dot(w1, train_X) + b1
A1 = sigmoid(Z1)
Z2 = np.dot(w2, A1) + b2
A2 = sigmoid(Z2)
J = np.sum(-train_Y * np.log(A2) + (train_Y-1) * np.log(1-A2)) / m
dZ2 = (A2 - train_Y) * sigmoid_derivative (Z2);
dW2 = np.dot(dZ2, A1.T) / m
dB2 = np.sum(dZ2, axis = 1, keepdims = True) / m
dZ1 = np.dot(w2.T, dZ2) * sigmoid_derivative(Z1)
dW1 = np.dot(dZ1, train_X.T) / m
dB1 = np.sum(dZ1, axis = 1, keepdims = True) / m
# RMSProp update
dWsqsum1 = alpha * dWsqsum1 + (1 - learning_rate) * np.square (dW1);
dWsqsum2 = alpha * dWsqsum2 + (1 - learning_rate) * np.square (dW2);
dBsqsum1 = alpha * dBsqsum1 + (1 - learning_rate) * np.square (dB1);
dBsqsum2 = alpha * dBsqsum2 + (1 - learning_rate) * np.square (dB2);
w1 = w1 - (lr * dW1 / (np.sqrt (dWsqsum1) + 10e-10));
w2 = w2 - (lr * dW2 / (np.sqrt (dWsqsum2) + 10e-10));
b1 = b1 - (lr * dB1 / (np.sqrt (dBsqsum1) + 10e-10));
b2 = b2 - (lr * dB2 / (np.sqrt (dBsqsum2) + 10e-10));
print(J)
err_vec.append (J);
Z1 = np.dot(w1, train_X) + b1
A1 = sigmoid(Z1)
Z2 = np.dot(w2, A1) + b2
A2 = sigmoid(Z2)
print ("\n", A2);
plt.plot (np.array (err_vec));
plt.show ();

Why does my neural network's cost keeps increasing ?

I've implemented a neural network to predict the xor gate. It has 1 input layer with 2 nodes, 1 hidden layer with 2 nodes and 1 output layer with 1 node. No matter what I try to do my cost keeps on increasing. I've tried setting my learning rate to small values but that just makes the cost increase slowly. Please, any tips appreciated.
import numpy as np
train_data = np.array([[0,0],[0,1],[1,0],[1,1]]).T
labels = np.array([[0,1,1,0]])
def sigmoid(z,deriv = False):
sig = 1/(1+np.exp(-z))
if deriv == True:
return np.multiply(sig,1-sig)
return sig
w1 = np.random.randn(2,2)*0.01
b1 = np.zeros((2,1))
w2 = np.random.randn(1,2)*0.01
b2 = np.zeros((1,1))
iterations = 1000
lr = 0.1
for i in range(1000):
z1 = np.dot(w1,train_data) + b1
a1 = sigmoid(z1)
z2 = np.dot(w2,a1) + b2
al = sigmoid(z2) #forward_prop
cost = np.dot(labels,np.log(al).T) + np.dot(1-labels,np.log(1-al).T)
cost = cost*(-1/4)
cost = np.squeeze(cost)#calcost
dal = (-1/4) * (np.divide(labels,al) + np.divide(1-labels,1-al))
dz2 = np.multiply(dal,sigmoid(z2,deriv = True))
dw2 = np.dot(dz2,a1.T)
db2 = np.sum(dz2,axis=1,keepdims = True)
da1 = np.dot(w2.T,dz2)
dz1 = np.multiply(da1,sigmoid(z1,deriv = True))
dw1 = np.dot(dz1,train_data.T)
db1 = np.sum(dz1,axis=1,keepdims = True) #backprop
w1 = w1 - lr*dw1
w2 = w2 - lr*dw2
b1 = b1 - lr*db1
b2 = b2 - lr*db2 #update params
print(cost,'------',str(i))
The main mistake is in cross-entropy backprop (recommend these notes for checking). The correct formula is the following:
dal = -labels / al + (1 - labels) / (1 - al)
I have also simplified the code a little bit. Here's a complete working version:
import numpy as np
train_data = np.array([[0,0], [0,1], [1,0], [1,1]]).T
labels = np.array([0, 1, 1, 1])
def sigmoid(z):
return 1 / (1 + np.exp(-z))
w1 = np.random.randn(2,2) * 0.001
b1 = np.zeros((2,1))
w2 = np.random.randn(1,2) * 0.001
b2 = np.zeros((1,1))
lr = 0.1
for i in range(1000):
z1 = np.dot(w1, train_data) + b1
a1 = sigmoid(z1)
z2 = np.dot(w2, a1) + b2
a2 = sigmoid(z2)
cost = -np.mean(labels * np.log(a2) + (1 - labels) * np.log(1 - a2))
da2 = (a2 - labels) / (a2 * (1 - a2)) # version #1
# da2 = -labels / a2 + (1 - labels) / (1 - a2) # version #2
dz2 = np.multiply(da2, a2 * (1 - a2))
dw2 = np.dot(dz2, a1.T)
db2 = np.sum(dz2, axis=1, keepdims=True)
da1 = np.dot(w2.T, dz2)
dz1 = np.multiply(da1, a1 * (1 - a1))
dw1 = np.dot(dz1, train_data.T)
db1 = np.sum(dz1, axis=1, keepdims=True)
w1 = w1 - lr*dw1
w2 = w2 - lr*dw2
b1 = b1 - lr*db1
b2 = b2 - lr*db2
print i, cost

Cost of simple non object oriented Neural Network "jumping"

I am building a sketch of a neural network in Python 3.4 with numpy and matrices to learn a simple XOR.
My Notation is as follows:
a is the activity of a neuron
z is the input of a neuron
W is a weight matrix with size R^{#number of neurons in previous layer}x{#number of neurons in next layer}
B is a vector of bias values
After implementing a very simple network in python, everything works fine when training on only a single input vector. However, when training on all four training examples of XOR the error function shows a quite weird behaviour (see picture) and the output of the network is always roughly 0.5.
Changing the network size, the learning rate or the training epochs does not seem to help.
Cost J while only training on one training example
Cost J while training with all training examples
This is the code for the network:
import numpy as np
import time
import matplotlib.pyplot as plt
Js = []
start = time.time()
np.random.seed(2)
#Sigmoid
def activation(x, derivative = False):
if(derivative):
a = activation(x)
return a * (1 - a)
else:
return 1/(1+np.exp(-x))
def cost(output, target):
return (1/2) * np.sum((target - output)**2)
INPUTS = np.array([
[0, 1],
[1, 0],
[0, 0],
[1, 1],
])
TARGET = np.array([
[1],
[1],
[0],
[0],
])
"Hyper-Parameters"
# Layer Structure
LAYER = [2, 3, 1]
LEARNING_RATE = 0.1
ITERATIONS = int(1e3)
# Init Weights
W1 = np.random.rand(LAYER[0], LAYER[1])
W2 = np.random.rand(LAYER[1], LAYER[2])
# Init Biases
B1 = np.random.rand(LAYER[1], 1)
B2 = np.random.rand(LAYER[2], 1)
for i in range(0, ITERATIONS):
exampleIndex = i % len(INPUTS)
#exampleIndex = 2
"Forward Pass"
# Layer One Activity (Input layer)
A0 = np.transpose(INPUTS[exampleIndex:exampleIndex+1])
# Layer Two Activity (Hidden Layer)
Z1 = np.dot(np.transpose(W1), A0) + B1
A1 = activation(Z1)
# Layer Three Activity (Output Layer)
Z2 = np.dot(np.transpose(W2), A1) + B2
A2 = activation(Z2)
# Output
O = A2
# Cost J
# Target Vector T
T = np.transpose(TARGET[exampleIndex:exampleIndex+1])
J = cost(O, T)
Js.append(J)
print("J = {}".format(J))
print("I = {}, O = {}".format(A0, O))
"Backward Pass"
# Calculate Delta of output layer
D2 = (O - T) * activation(Z2, True)
# Calculate Delta of hidden layer
D1 = np.dot(W2, D2) * activation(Z1, True)
# Calculate Derivatives w.r.t. W2
DerW2 = np.dot(A1, np.transpose(D2))
# Calculate Derivatives w.r.t. W1
DerW1 = np.dot(A0, np.transpose(D1))
# Calculate Derivatives w.r.t. B2
DerB2 = D2
# Calculate Derivatives w.r.t. B1
DerB1 = D1
"Update Weights and Biases"
W1 -= LEARNING_RATE * DerW1
B1 -= LEARNING_RATE * DerB1
W2 -= LEARNING_RATE * DerW2
B2 -= LEARNING_RATE * DerB2
# Show prediction
print("Time elapsed {}s".format(time.time() - start))
plt.plot(Js)
plt.ylabel("Cost J")
plt.xlabel("Iterations")
plt.show()
What could be the reason for this strange behaviour in my implementation?
I think your cost function is jumping since you perform your weight updates after each sample. However, your network is training the correct behavior nonetheless:
479997
J = 4.7222501603409765e-05
I = [[1]
[0]], O = [[ 0.99028172]]
T = [[1]]
479998
J = 7.3205311398742e-05
I = [[0]
[0]], O = [[ 0.01210003]]
T = [[0]]
479999
J = 4.577485181547362e-05
I = [[1]
[1]], O = [[ 0.00956816]]
T = [[0]]
480000
J = 4.726257702199439e-05
I = [[0]
[1]], O = [[ 0.9902776]]
T = [[1]]
The cost function shows some interesting behavior: the training process reaches a point where jumps in the cost function will become quite small.
You can reproduce this with the code below (I have only made slight changes; note that I trained over much more epochs):
import numpy as np
import time
import matplotlib.pyplot as plt
Js = []
start = time.time()
np.random.seed(2)
#Sigmoid
def activation(x, derivative = False):
if(derivative):
a = activation(x)
return a * (1 - a)
else:
return 1/(1+np.exp(-x))
def cost(output, target):
return (1/2) * np.sum((target - output)**2)
INPUTS = np.array([[0, 1],[1, 0],[0, 0],[1, 1]])
TARGET = np.array([[1],[1],[0],[0]])
"Hyper-Parameters"
# Layer Structure
LAYER = [2, 3, 1]
LEARNING_RATE = 0.1
ITERATIONS = int(5e5)
# Init Weights
W1 = np.random.rand(LAYER[0], LAYER[1])
W2 = np.random.rand(LAYER[1], LAYER[2])
# Init Biases
B1 = np.random.rand(LAYER[1], 1)
B2 = np.random.rand(LAYER[2], 1)
for i in range(0, ITERATIONS):
exampleIndex = i % len(INPUTS)
# exampleIndex = 2
"Forward Pass"
# Layer One Activity (Input layer)
A0 = np.transpose(INPUTS[exampleIndex:exampleIndex+1])
# Layer Two Activity (Hidden Layer)
Z1 = np.dot(np.transpose(W1), A0) + B1
A1 = activation(Z1)
# Layer Three Activity (Output Layer)
Z2 = np.dot(np.transpose(W2), A1) + B2
A2 = activation(Z2)
# Output
O = A2
# Cost J
# Target Vector T
T = np.transpose(TARGET[exampleIndex:exampleIndex+1])
J = cost(O, T)
Js.append(J)
# print("J = {}".format(J))
# print("I = {}, O = {}".format(A0, O))
# print("T = {}".format(T))
if ((i+3) % 20000 == 0):
print(i)
print("J = {}".format(J))
print("I = {}, O = {}".format(A0, O))
print("T = {}".format(T))
if ((i+2) % 20000 == 0):
print(i)
print("J = {}".format(J))
print("I = {}, O = {}".format(A0, O))
print("T = {}".format(T))
if ((i+1) % 20000 == 0):
print(i)
print("J = {}".format(J))
print("I = {}, O = {}".format(A0, O))
print("T = {}".format(T))
if (i % 20000 == 0):
print(i)
print("J = {}".format(J))
print("I = {}, O = {}".format(A0, O))
print("T = {}".format(T))
"Backward Pass"
# Calculate Delta of output layer
D2 = (O - T) * activation(Z2, True)
# Calculate Delta of hidden layer
D1 = np.dot(W2, D2) * activation(Z1, True)
# Calculate Derivatives w.r.t. W2
DerW2 = np.dot(A1, np.transpose(D2))
# Calculate Derivatives w.r.t. W1
DerW1 = np.dot(A0, np.transpose(D1))
# Calculate Derivatives w.r.t. B2
DerB2 = D2
# Calculate Derivatives w.r.t. B1
DerB1 = D1
"Update Weights and Biases"
W1 -= LEARNING_RATE * DerW1
B1 -= LEARNING_RATE * DerB1
W2 -= LEARNING_RATE * DerW2
B2 -= LEARNING_RATE * DerB2
# Show prediction
print("Time elapsed {}s".format(time.time() - start))
plt.plot(Js)
plt.ylabel("Cost J")
plt.xlabel("Iterations")
plt.savefig('cost.pdf')
plt.show()
In order to reduce fluctuations in the cost function, one usually uses multiple data samples before performing an update (some averaged update), but I see that this is difficult in a set containing only four different training events.
So, to conclude this rather long answer: your cost function jumps because it is calculated for every single example and not for an average of multiple examples. However, the network output follows the distribution of the XOR function quite well, so you don't need to change it.

Categories