How to call "backward" in a loop with 2 optimizers? - python

I have 2 networks that I'm trying to update:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Normal
import matplotlib.pyplot as plt
from tqdm import tqdm
softplus = torch.nn.Softplus()
class Model_RL(nn.Module):
def __init__(self):
super(Model_RL, self).__init__()
self.fc1 = nn.Linear(3, 20)
self.fc2 = nn.Linear(20, 30)
self.fc3 = nn.Linear(30, 2)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = softplus(self.fc3(x))
return x
class Model_FA(nn.Module):
def __init__(self):
super(Model_FA, self).__init__()
self.fc1 = nn.Linear(1, 20)
self.fc2 = nn.Linear(20, 30)
self.fc3 = nn.Linear(30, 1)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = softplus(self.fc3(x))
return x
net_RL = Model_RL()
net_FA = Model_FA()
The training loop is
inps = torch.tensor([[1.0]])
y = torch.tensor(10.0)
opt_RL = optim.Adam(net_RL.parameters())
opt_FA = optim.Adam(net_FA.parameters())
baseline = 0
baseline_lr = 0.1
epochs = 100
for _ in tqdm(range(epochs)):
for inp in inps:
with torch.no_grad():
net_FA(inp)
for layer in range(3):
out_RL = net_RL(torch.tensor([1.0,2.0,3.0]))
mu, std = out_RL
dist = Normal(mu, std)
update_values = dist.sample()
log_p = dist.log_prob(update_values).mean()
out = net_FA(inp)
reward = -torch.square((y - out))
baseline = (1 - baseline_lr) * baseline + baseline_lr * reward
loss_RL = - (reward - baseline) * log_p
opt_RL.zero_grad()
opt_FA.zero_grad()
loss_RL.backward()
opt_RL.step()
out = net_FA(inp)
loss_FA = torch.mean(torch.square(y - out))
opt_RL.zero_grad()
opt_FA.zero_grad()
loss_FA.backward()
opt_FA.step()
print("Mean: " + str(mu.detach().numpy()) + ", Goal: " + str(y))
print("Standard deviation: " + str(softplus(std).detach().numpy()) + ", Goal: 0ish")
I'm getting 2 main errors:
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling .backward()...
And when I add retain_graph=True to both backward calls I get the following
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [30, 1]], which is output 0 of TBackward, is at version 5; expected version 4 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True)
My main question is how can I make this training work?
But intermediate questions are:
why does retain_graph=True is needed here if I'm using a loop? From here: "there is no need to use retain_graph=True. In each loop, a new graph is created"
Why does it seem as if the retain_graph=True makes training significantly slower (if I remove the other backward call)? This doesn't really makes sense to me as in each epoch a new computational graph should be created (and not just one that is being extended).

I think the line baseline = (1 - baseline_lr) * baseline + baseline_lr * reward causing the error. Because:
previous state of baseline is used to get new state of baseline.
PyTorch will track all these states inside a graph.
backward will flush the graph.
variable baseline of time - t + 1 will try to backpropagate through baseline of time - t.
But at time - t + 1 graph behind baseline of time - t doesn't exist.
This leads to error
Solution:
As you are not optimizing variable baseline or anything behind baseline
Initializebaseline as torch tensor.
detach it from graph before updating state.
Try this:
# intialize baseline as torch tensor
baseline = torch.tensor(0.)
baseline_lr = 0.1
epochs = 100
for _ in tqdm(range(epochs)):
for inp in inps:
with torch.no_grad():
net_FA(inp)
for layer in range(3):
out_RL = net_RL(torch.tensor([1.0,2.0,3.0]))
mu, std = out_RL
dist = Normal(mu, std)
update_values = dist.sample()
log_p = dist.log_prob(update_values).mean()
out = net_FA(inp)
reward = -torch.square((y - out))
# detach baseline from graph
baseline = (1 - baseline_lr) * baseline.detach() + baseline_lr * reward
loss_RL = - (reward - baseline) * log_p
opt_RL.zero_grad()
opt_FA.zero_grad()
loss_RL.backward()
opt_RL.step()
out = net_FA(inp)
loss_FA = torch.mean(torch.square(y - out))
opt_RL.zero_grad()
opt_FA.zero_grad()
loss_FA.backward()
opt_FA.step()
But actually I don't know why you are updating the networks, 3 times for the same input?

Related

Why does regularization in pytorch and scratch code does not match and what is the formula used for regularization in pytorch?

I have been trying to do L2 regularization on a binary classification model in PyTorch but when I match the results of PyTorch and scratch code it doesn't match,
Pytorch code:
class LogisticRegression(nn.Module):
def __init__(self,n_input_features):
super(LogisticRegression,self).__init__()
self.linear=nn.Linear(4,1)
self.linear.weight.data.fill_(0.0)
self.linear.bias.data.fill_(0.0)
def forward(self,x):
y_predicted=torch.sigmoid(self.linear(x))
return y_predicted
model=LogisticRegression(4)
criterion=nn.BCELoss()
optimizer=torch.optim.SGD(model.parameters(),lr=0.05,weight_decay=0.1)
dataset=Data()
train_data=DataLoader(dataset=dataset,batch_size=1096,shuffle=False)
num_epochs=1000
for epoch in range(num_epochs):
for x,y in train_data:
y_pred=model(x)
loss=criterion(y_pred,y)
loss.backward()
optimizer.step()
optimizer.zero_grad()
Scratch Code:
def sigmoid(z):
s = 1/(1+ np.exp(-z))
return s
def yinfer(X, beta):
return sigmoid(beta[0] + np.dot(X,beta[1:]))
def cost(X, Y, beta, lam):
sum = 0
sum1 = 0
n = len(beta)
m = len(Y)
for i in range(m):
sum = sum + Y[i]*(np.log( yinfer(X[i],beta)))+ (1 -Y[i])*np.log(1-yinfer(X[i],beta))
for i in range(0, n):
sum1 = sum1 + beta[i]**2
return (-sum + (lam/2) * sum1)/(1.0*m)
def pred(X,beta):
if ( yinfer(X, beta) > 0.5):
ypred = 1
else :
ypred = 0
return ypred
beta = np.zeros(5)
iterations = 1000
arr_cost = np.zeros((iterations,4))
print(beta)
n = len(Y_train)
for i in range(iterations):
Y_prediction_train=np.zeros(len(Y_train))
Y_prediction_test=np.zeros(len(Y_test))
for l in range(len(Y_train)):
Y_prediction_train[l]=pred(X[l,:],beta)
for l in range(len(Y_test)):
Y_prediction_test[l]=pred(X_test[l,:],beta)
train_acc = format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100)
test_acc = 100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100
arr_cost[i,:] = [i,cost(X,Y_train,beta,lam),train_acc,test_acc]
temp_beta = np.zeros(len(beta))
''' main code from below '''
for j in range(n):
temp_beta[0] = temp_beta[0] + yinfer(X[j,:], beta) - Y_train[j]
temp_beta[1:] = temp_beta[1:] + (yinfer(X[j,:], beta) - Y_train[j])*X[j,:]
for k in range(0, len(beta)):
temp_beta[k] = temp_beta[k] + lam * beta[k] #regularization here
temp_beta= temp_beta / (1.0*n)
beta = beta - alpha*temp_beta
graph of the losses
graph of training accuracy
graph of testing accuracy
Can someone please tell me why this is happening?
L2 value=0.1
Great question. I dug a lot through PyTorch documentation and found the answer. The answer is very tricky. Basically there are two ways to calculate regulalarization. (For summery jump to the last section).
The PyTorch uses the first type (in which regularization factor is not divided by batch size).
Here's a sample code which demonstrates that:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import torch.optim as optim
class model(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(1, 1)
self.linear.weight.data.fill_(1.0)
self.linear.bias.data.fill_(1.0)
def forward(self, x):
return self.linear(x)
model = model()
optimizer = optim.SGD(model.parameters(), lr=0.1, weight_decay=1.0)
input = torch.tensor([[2], [4]], dtype=torch.float32)
target = torch.tensor([[7], [11]], dtype=torch.float32)
optimizer.zero_grad()
pred = model(input)
loss = F.mse_loss(pred, target)
print(f'input: {input[0].data, input[1].data}')
print(f'prediction: {pred[0].data, pred[1].data}')
print(f'target: {target[0].data, target[1].data}')
print(f'\nMSEloss: {loss.item()}\n')
loss.backward()
print('Before updation:')
print('--------------------------------------------------------------------------')
print(f'weight [data, gradient]: {model.linear.weight.data, model.linear.weight.grad}')
print(f'bias [data, gradient]: {model.linear.bias.data, model.linear.bias.grad}')
print('--------------------------------------------------------------------------')
optimizer.step()
print('After updation:')
print('--------------------------------------------------------------------------')
print(f'weight [data]: {model.linear.weight.data}')
print(f'bias [data]: {model.linear.bias.data}')
print('--------------------------------------------------------------------------')
which outputs:
input: (tensor([2.]), tensor([4.]))
prediction: (tensor([3.]), tensor([5.]))
target: (tensor([7.]), tensor([11.]))
MSEloss: 26.0
Before updation:
--------------------------------------------------------------------------
weight [data, gradient]: (tensor([[1.]]), tensor([[-32.]]))
bias [data, gradient]: (tensor([1.]), tensor([-10.]))
--------------------------------------------------------------------------
After updation:
--------------------------------------------------------------------------
weight [data]: tensor([[4.1000]])
bias [data]: tensor([1.9000])
--------------------------------------------------------------------------
Here m = batch size = 2, lr = alpha = 0.1, lambda = weight_decay = 1.
Now consider tensor weight which has value = 1 and grad = -32
case1(type1 regularization):
weight = weight - lr(grad + weight_decay.weight)
weight = 1 - 0.1(-32 + 1(1))
weight = 4.1
case2(type2 regularization):
weight = weight - lr(grad + (weight_decay/batch size).weight)
weight = 1 - 0.1(-32 + (1/2)(1))
weight = 4.15
From the output we can see that updated weight = 4.1000. That concludes PyTorch uses type1 regularization.
So finally In your code you are following type2 regularization. So just change some last lines to this:
# for k in range(0, len(beta)):
# temp_beta[k] = temp_beta[k] + lam * beta[k] #regularization here
temp_beta= temp_beta / (1.0*n)
beta = beta - alpha*(temp_beta + lam * beta)
And also PyTorch loss functions doesn't include regularization term(implemented inside optimizers) so also remove regularization terms inside your custom cost function.
In summary:
Pytorch use this Regularization function:
Regularization is implemented inside Optimizers (weight_decay parameter).
PyTorch Loss functions doesn't include Regularization term.
Bias is also regularized if Regularization is used.
To use Regularization try:
torch.nn.optim.optimiser_name(model.parameters(), lr, weight_decay=lambda).

Implementing Neural Network using pure Numpy (Softmax + CrossEntropy)

I am trying a simple implementation of a multi-layer perceptron (MLP) using pure NumPy. My previous implementation using RMSE and sigmoid activation at the output (single output) works perfectly with appropriate data. However, when I consider multi-output system (Due to one-hot encoding) with Cross-entropy loss function and softmax activation always fails.
I believe I am doing something wrong with my implementation for gradient calculation but unable to figure it out. So I am here for help.
For the current implementation, I use IRIS dataset for testing the model.
The data for IRIS is obtained as follows:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import minmax_scale
def one_hot_encoder(y):
y_oh = np.zeros((len(y), np.max(y)+1))
for t in np.unique(y):
y_oh[y==t,t] = 1
return y_oh
data = load_iris().data
target = load_iris().target
data_scaled = minmax_scale(data)
target_oh = one_hot_encoder(target)
A Neural network class is defined with a simple 1-hidden layer network as follows:
class NeuralNetwork:
def __init__(self, x, y):
self.x = x
# hidden layer with 16 nodes
self.weights1= np.random.rand(self.x.shape[1],16)
self.bias1 = np.random.rand(16)
# output layer with 3 nodes (for 3 output - One-hot encoded)
self.weights2 = np.random.rand(16,3)
self.bias2 = np.random.rand(3)
self.y = y
self.pred = np.zeros(y.shape)
self.lr = 0.001
def feedforward(self):
self.layer1 = sigmoid(np.dot(self.x, self.weights1) + self.bias1)
self.layer2 = softmax(np.dot(self.layer1, self.weights2) + self.bias2)
# print(self.layer2.shape)
return self.layer2.clip(min=1e-8, max=None)
def backprop(self):
dloss = cross_entropy_derivative(self.pred, self.y) # 2*(self.y - self.pred)
d_weights2 = np.dot(self.layer1.T, dloss*softmax_derivative(self.pred))
d_bias2 = np.dot(np.ones([self.x.shape[0]]), dloss*softmax_derivative(self.pred))
d_weights1 = np.dot(self.x.T, np.dot(dloss*softmax_derivative(self.pred), self.weights2.T)*sigmoid_derivative(self.layer1))
d_bias1 = np.dot(np.ones([self.x.shape[0]]), np.dot(dloss*softmax_derivative(self.pred), self.weights2.T)*sigmoid_derivative(self.layer1))
self.weights1 += self.lr*d_weights1
self.weights2 += self.lr*d_weights2
self.bias1 += self.lr*d_bias1
self.bias2 += self.lr*d_bias2
def train(self, X, y):
self.x = X
self.y = y
self.pred = self.feedforward()
self.backprop()
def predict(self, X):
self.x = X
self.pred = self.feedforward()
return self.pred
def evaluate(self, y, pred):
self.y = y
self.pred = pred
# self.loss = np.sqrt(np.mean((self.pred-self.y)**2))
self.loss = cross_entropy(self.pred, self.y)
return self.loss
The activation functions and their derivatives are computed as follows (I feel there is something wrong here)
# Activation functions
def sigmoid(t):
return 1/(1+np.exp(-t))
# Derivative of sigmoid
def sigmoid_derivative(p):
return p * (1 - p)
# sofmax activation
def softmax(X):
exps = np.exp(X - np.max(X,axis=1).reshape(-1,1))
return exps / np.sum(exps,axis=1)[:,None]
# derivative of softmax
def softmax_derivative(pred):
return pred * (1 -(1 * pred).sum(axis=1)[:,None])
The cross-entropy loss function and its derivatives are as shown below:
def cross_entropy(X,y):
X = X.clip(min=1e-8,max=None)
# print('\n\nCE: ', (np.where(y==1,-np.log(X), 0)).sum(axis=1))
return (np.where(y==1,-np.log(X), 0)).sum(axis=1)
def cross_entropy_derivative(X,y):
X = X.clip(min=1e-8,max=None)
# print('\n\nCED: ', np.where(y==1,-1/X, 0))
return np.where(y==1,-1/X, 0)
The main function call:
NN = NeuralNetwork(data_scaled, target_oh)
for i in range(10000): # trains the NN 10,000 times
NN.train(data_scaled, target_oh)
loss.append(NN.evaluate(NN.y, NN.pred))
y_pred = NN.predict(data_scaled)
The output is approximately constant always predicting a single class. What am I doing wrong? Appreciate your help on the code or directions to look at. Thanks.
subtract the gradient and also derive the unactivated output instead of activated output. read this piece of code that i wrote for more info and watch Sebastian Lagues video about Neural Networks for help about this topic
P.S. The video is not in python, but it explains exactly what 3 years in college tries to explain.

Strange result Neural network Python

I followed an article here: TowardsDataScience.
I wrote math equations about the network, everything made sense.
However, after writing the code, results are pretty strange, like it is predicting always same class...
I spent a lot of time on it, changed many things, but I still cannot understand what I did wrong.
Here is the code:
# coding: utf-8
from mnist import MNIST
import numpy as np
import math
import os
import pdb
DATASETS_PREFIX = '../Datasets/MNIST'
mndata = MNIST(DATASETS_PREFIX)
TRAINING_IMAGES, TRAINING_LABELS = mndata.load_training()
TESTING_IMAGES , TESTING_LABELS = mndata.load_testing()
### UTILS
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def d_sigmoid(x):
return x.T * (1 - x)
#return np.dot(x.T, 1.0 - x)
def softmax(x):
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
def d_softmax(x):
#This function has not yet been tested.
return x.T * (1 - x)
def tanh(x):
return np.tanh(x)
def d_tanh(x):
return 1 - x.T * x
def normalize(image):
return image / (255.0 * 0.99 + 0.01)
### !UTILS
class NeuralNetwork(object):
"""
This is a 3-layer neural network (1 hidden layer).
#_input : input layer
#_weights1: weights between input layer and hidden layer (matrix shape (input.shape[1], 4))
#_weights2: weights between hidden layer and output layer (matrix shape (4, 1))
#_y : output
#_output : computed output
#_alpha : learning rate
"""
def __init__(self, xshape, yshape):
self._neurones_nb = 20
self._input = None
self._weights1 = np.random.randn(xshape, self._neurones_nb)
self._weights2 = np.random.randn(self._neurones_nb, yshape)
self._y = np.mat(np.zeros(yshape))
self._output = np.mat(np.zeros(yshape))
self._alpha1 = 0.1
self._alpha2 = 0.1
self._function = sigmoid
self._derivative = d_sigmoid
self._epoch = 1
def Train(self, xs, ys):
for j in range(self._epoch):
for i in range(len(xs)):
self._input = normalize(np.mat(xs[i]))
self._y[0, ys[i]] = 1
self.feedforward()
self.backpropagation()
self._y[0, ys[i]] = 0
def Predict(self, image):
self._input = normalize(image)
out = self.feedforward()
return out
def feedforward(self):
self._layer1 = self._function(np.dot(self._input, self._weights1))
self._output = self._function(np.dot(self._layer1, self._weights2))
return self._output
def backpropagation(self):
d_weights2 = np.dot(
self._layer1.T,
2 * (self._y - self._output) * self._derivative(self._output)
)
d_weights1 = np.dot(
self._input.T,
np.dot(
2 * (self._y - self._output) * self._derivative(self._output),
self._weights2.T
) * self._derivative(self._layer1)
)
self._weights1 += self._alpha1 * d_weights1
self._weights2 += self._alpha2 * d_weights2
if __name__ == '__main__':
neural_network = NeuralNetwork(len(TRAINING_IMAGES[0]), 10)
print('* training neural network')
neural_network.Train(TRAINING_IMAGES, TRAINING_LABELS)
print('* testing neural network')
count = 0
for i in range(len(TESTING_IMAGES)):
image = np.mat(TESTING_IMAGES[i])
expected = TESTING_LABELS[i]
prediction = neural_network.Predict(image)
if i % 100 == 0: print(expected, prediction)
#print(f'* results: {count} / {len(TESTING_IMAGES)}')
Thank you for your help, really appreciated.
Julien
Well, I don't see any error in the implementation so considering your network, this could be improved by doing two things :
One epoch is not enough. Like not a all ! You need to pass over your data multiple times (a great minimum is 10 times, average might be around 100 epochs and this could go up to 5000 or more)
You network is a shallow network, e.g. really simple. To detect difficult things (like images), you could implement a CNN (Convolutional Neural Network) or first trying to deepen your network and complexify it
=> Try to add layers (3, 4, 5 etc..) and then add neurons to each layers (50, 60, ..) depending of the size of your input. You can still go up to 800, 900 or more.

Neural network can learn |sin(x)| for [0,pi] but not [0,2pi] or [0, 4pi]

My neural network can learn |sin(x)| for [0,pi], but not larger intervals than that. I tried changing the quantity and widths of hidden layers in various ways, but none of the changes leads to a good result.
I train the NN on thousands of random values from a uniform distribution in the chosen interval. using back propagation with gradient descent.
I am starting to think there is a fundamental problem in my network.
For the following examples I used a 1-10-10-1 layer structure:
[0, pi]:
[0, 2pi]:
[0, 4pi]:
Here is the code for the neural network:
import math
import numpy
import random
import copy
import matplotlib.pyplot as plt
def sigmoid(x):
return 1.0/(1+ numpy.exp(-x))
def sigmoid_derivative(x):
return x * (1.0 - x)
class NeuralNetwork:
def __init__(self, weight_dimensions, x=None, y=None):
self.weights = []
self.layers = [[]] * len(weight_dimensions)
self.weight_gradients = []
self.learning_rate = 1
self.layers[0] = x
for i in range(len(weight_dimensions) - 1):
self.weights.append(numpy.random.rand(weight_dimensions[i],weight_dimensions[i+1]) - 0.5)
self.y = y
def feed_forward(self):
# calculate an output using feed forward layer-by-layer
for i in range(len(self.layers) - 1):
self.layers[i + 1] = sigmoid(numpy.dot(self.layers[i], self.weights[i]))
def print_loss(self):
loss = numpy.square(self.layers[-1] - self.y).sum()
print(loss)
def get_weight_gradients(self):
return self.weight_gradients
def apply_weight_gradients(self):
for i in range(len(self.weight_gradients)):
self.weights[i] += self.weight_gradients[i] * self.learning_rate
if self.learning_rate > 0.001:
self.learning_rate -= 0.0001
def back_prop(self):
# find derivative of the loss function with respect to weights
self.weight_gradients = []
deltas = []
output_error = (self.y - self.layers[-1])
output_delta = output_error * sigmoid_derivative(self.layers[-1])
deltas.append(output_delta)
self.weight_gradients.append(self.layers[-2].T.dot(output_delta))
for i in range(len(self.weights) - 1):
i_error = deltas[i].dot(self.weights[-(i+1)].T)
i_delta = i_error * sigmoid_derivative(self.layers[-(i+2)])
self.weight_gradients.append(self.layers[-(i+3)].T.dot(i_delta))
deltas.append(copy.deepcopy(i_delta))
# Unreverse weight gradient list
self.weight_gradients = self.weight_gradients[::-1]
def get_output(self, inp):
self.layers[0] = inp
self.feed_forward()
return self.layers[-1]
def sin_test():
interval = numpy.random.uniform(0, 2*math.pi, int(1000*(2*math.pi)))
x_values = []
y_values = []
for i in range(len(interval)):
y_values.append([abs(math.sin(interval[i]))])
x_values.append([interval[i]])
x = numpy.array(x_values)
y = numpy.array(y_values)
nn = NeuralNetwork([1, 10, 10, 1], x, y)
for i in range(10000):
tmp_input = []
tmp_output = []
mini_batch_indexes = random.sample(range(0, len(x)), 10)
for j in mini_batch_indexes:
tmp_input.append(x[j])
tmp_output.append(y[j])
nn.layers[0] = numpy.array(tmp_input)
nn.y = numpy.array(tmp_output)
nn.feed_forward()
nn.back_prop()
nn.apply_weight_gradients()
nn.print_loss()
nn.layers[0] = numpy.array(numpy.array(x))
nn.y = numpy.array(numpy.array(y))
nn.feed_forward()
axis_1 = []
axis_2 = []
for i in range(len(nn.layers[-1])):
axis_1.append(nn.layers[0][i][0])
axis_2.append(nn.layers[-1][i][0])
true_axis_2 = []
for x in axis_1:
true_axis_2.append(abs(math.sin(x)))
axises = []
for i in range(len(axis_1)):
axises.append([axis_1[i], axis_2[i], true_axis_2[i]])
axises.sort(key=lambda x: x[0], reverse=False)
axis_1_new = []
axis_2_new = []
true_axis_2_new = []
for elem in axises:
axis_1_new.append(elem[0])
axis_2_new.append(elem[1])
true_axis_2_new.append(elem[2])
plt.plot(axis_1_new, axis_2_new, label="nn")
plt.plot(axis_1_new, true_axis_2_new, 'k--', label="sin(x)")
plt.grid()
plt.axis([0, 2*math.pi, -1, 2.5])
plt.show()
sin_test()
The main issue with your network seem to be that you apply the activation function to the final "layer" of your network. The final output of your network should be a linear combination without any sigmoid applied.
As a warning though, do not expect the model to generalize outside of the region included in the training data.
Here is an example in PyTorch:
import torch
import torch.nn as nn
import math
import numpy as np
import matplotlib.pyplot as plt
N = 1000
p = 2.5
x = 2 * p * math.pi * torch.rand(N, 1)
y = np.abs(np.sin(x))
with torch.no_grad():
plt.plot(x.numpy(), y.numpy(), '.')
plt.savefig("training_data.png")
inner = 20
model = nn.Sequential(
nn.Linear(1, inner, bias=True),
nn.Sigmoid(),
nn.Linear(inner, 1, bias=True)#,
#nn.Sigmoid()
)
loss_fn = nn.MSELoss()
learning_rate = 1e-3
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500000):
y_pred = model(x)
loss = loss_fn(y_pred, y)
if t % 1000 == 0:
print("MSE: {}".format(t), loss.item())
model.zero_grad()
loss.backward()
optimizer.step()
with torch.no_grad():
X = torch.arange(0, p * 2 * math.pi, step=0.01).reshape(-1, 1)
Y = model(X)
Y_TRUTH = np.abs(np.sin(X))
print(Y.shape)
print(Y_TRUTH.shape)
loss = loss_fn(Y, Y_TRUTH)
plt.clf()
plt.plot(X.numpy(), Y_TRUTH.numpy())
plt.plot(X.numpy(), Y.numpy())
plt.title("MSE: {}".format(loss.item()))
plt.savefig("output.png")
The output is available here: Image showing neural network prediction and ground truth. The yellow line is the predicted line by the neural network and the blue line is the ground truth.
First and foremost, you've chosen a topology suited for a different class of problems. A simple, fully-connected NN such as this is great with trivial classification (e.g. Boolean operators) or functions with at least two continuous derivatives. You've tried to apply it to a function that is simply one step beyond its capabilities.
Try your model on sin(x) and see how it performs at larger ranges. Try it on max(sin(x), 0). Do you see how the model has trouble with certain periodicity and irruptions? These are an emergent feature of the many linear equations struggling to predict the proper functional value: the linear combinations have trouble emulating non-linearities past a simple level.

I can't find the bug in this implementation of backpropogation?

My data is 4123 rows of inputs and outputs to an xor gate.
I want to write a Neural Network with three input layer neurons (the third one is bias), a hidden layer, and an output layer.
Here's my implementation
import numpy as np
class TwoLayerNetwork:
def __init__(self, input_size, hidden_size, output_size):
"""
input_size: the number of neurons in the input layer
hidden_size: the number of neurons in the hidden layer
output_size: the number of neurons in the output layer
"""
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
self.params = {}
self.params['W1'] = 0.01 * np.random.randn(input_size, hidden_size) # FxH
self.params['b1'] = np.zeros((hidden_size, 1)) # Hx1
self.params['W2'] = 0.01 * np.random.randn(hidden_size, output_size) # HxO
self.params['b2'] = np.zeros((output_size, 1)) # Ox1
self.optimal_weights = []
self.errors = {}
def train(self, X, y, epochs):
"""
X: input data matrix, NxF
y: output vector, Nx1
returns:
the optimal set of parameters that best minimize the loss function
"""
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
for iteration in range(epochs):
forward_to_hidden = X.dot(W1) # NxH
activate_hidden = sigmoid(forward_to_hidden) # NxH
forward_to_output = activate_hidden.dot(W2) # NxO
output = sigmoid(forward_to_output) # NxO
self.errors[iteration] = np.mean(0.5 * (y**2 - output**2))
output_error = y - output # NxO
output_layer_delta = output_error * sigmoidPrime(output) # NxO
hidden_layer_error = output_layer_delta.dot(W2.T) # NxO . OxH = NxH
hidden_layer_delta = hidden_layer_error * sigmoidPrime(activate_hidden) # NxH
W1_update = X.T.dot(hidden_layer_delta) # FxN . NxH = FxH
W2_update = activate_hidden.T.dot(output_layer_delta) # HxN . NxO = HxO
W1 += W1_update
W2 += W2_update
self.optimal_weights.append(W1)
self.optimal_weights.append(W2)
def predict(self, X):
W1, W2 = self.optimal_weights[0], self.optimal_weights[1]
forward = sigmoid(X.dot(W1)) # NxH
forward = forward.dot(W2) # NxO
forward = sigmoid(forward) # NxO
return forward
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoidPrime(x):
return sigmoid(x) * (1 - sigmoid(x))
I realize that's very vanilla, but that's intentional. I want to understand the most basic form of NN architecture first.
Now, my problem is that my error plot is confusing.
The neural network just stops learning.
My second problem is that my weights are blowing up reaching up to -10000, which causes overflow because of exp in the sigmoid function.
My third problem is that my output vector only outputs 0.5 instead of 1 or 0
import pandas as pd
data = pd.read_csv('xor.csv').sample(frac=1)
X = data.iloc[:, [0, 1]] # 1st and 2nd cols are the input
X = np.hstack((X, np.ones((data.shape[0], 1)))) # adding the bias 1's
y = data.iloc[:, 2][:, np.newaxis] # 3rd col is the output
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
nn.train(X_train, y_train, 100)
plt.plot(range(100), [i for i in nn.errors.values()])
plt.show()
The link for the dataset
So, if I read your code correctly, your network is specified correctly, but is missing a few key points in order to learn XOR by backpropagation.
The fun part is, your error specification is weird.
I made it into
self.errors[iteration] = np.mean(0.5 * (y - output)**2)
for visualization.
With x-axis denoting epoch and y-axis denoting error:
So what happens, the backpropagation hits a plateau, then rapidly blows up the weights. To slow down the blowing up of the weights and allow the network some time to re-evaluate its mistakes, you can add a so-called "learning rate" != 1. This adresses one of the pitfalls.
Another one is the second figure: you hit oscillatory behaviour in the updates and the program will never reach its optimum state. To adress this, you can deliberately enter an imperfection in the form of a "momentum".
Additionally, the initial conditions matter for the speed at which you converge, so you need to have enough epochs to overcome the local plateaux:
Last, but certainly not least, I did find an error with your specification, but all of the above still applies.
In your layer_deltas you do sigmoidPrime(sigmoid(forwards)) which is one call to sigmoid too many.
last_update = np.zeros((X.shape[1], W1.shape[1]))
last_update2 = np.zeros((W1.shape[1], W2.shape[1]))
output_layer_delta = output_error * sigmoidPrime(forward_to_output) # NxO
hidden_layer_delta = hidden_layer_error * sigmoidPrime(forward_to_hidden) # NxH
W1 += 0.001*(W1_update + last_update * 0.5)
W2 += 0.001*(W2_update + last_update2 * 0.5)
# W1 = 0.001*W1_update
# W2 = 0.001*W2_update
last_update = W1_update.copy()
last_update2 = W2_update.copy()
Did the final trick for me. Now please verify and appease this grumbling man who spent the better part of a night and day on figuring it out. ;)

Categories