Backpropagation outputs tend towards same value - python

I'm attempting to create a multilayer feedforward backpropagation neural network to recognize handwritten digits and I'm running into a problem where the activations in my output layer all tend towards the same value.
I'm using the Optical Recognition of Handwritten Digits Data Set, with training data that looks like
0,1,6,15,12,1,0,0,0,7,16,6,6,10,0,0,0,8,16,2,0,11,2,0,0,5,16,3,0,5,7,0,0,7,13,3,0,8,7,0,0,4,12,0,1,13,5,0,0,0,14,9,15,9,0,0,0,0,6,14,7,1,0,0,0
which represents an 8x8 matrix, where each of the 64 integers corresponds to the number of dark pixels in a sub-4x4 matrix, with the last integer being the classification.
I'm using 64 nodes in the input layer corresponding to the 64 integers, some number of hidden nodes in some number of hidden layers, and 10 nodes in the output layer corresponding to 0-9.
My weights are initialized here, and biases are added for the input layer and hidden layers
self.weights = []
for i in xrange(1, len(layers) - 1):
self.weights.append(
np.random.uniform(low=-0.2,
high=0.2,
size=(layers[i-1] + 1, layers[i] + 1)))
# Output weights
self.weights.append(
np.random.uniform(low=-0.2,
high=0.2,
size=(layers[-2] + 1, layers[-1])))
where list contains the number of nodes in each layer, e.g.
layers=[64, 30, 10]
I'm using the logistic function as my activation function
def logistic(self, z):
return sp.expit(z)
and its derivative
def derivative(self, z):
return sp.expit(z) * (1 - sp.expit(z))
My backpropagation algorithm is borrowed heavily from here; my previous attempts failed so I wanted to try another route.
def back_prop_learning(self, X, y):
# add biases to inputs with value of 1
biases = np.atleast_2d(np.ones(X.shape[0]))
X = np.concatenate((biases.T, X), axis=1)
# Iterate over training set
for epoch in xrange(self.epochs):
# for each weight w[i][j] in network assign random tiny values
# handled in __init__
''' PROPAGATE THE INPUTS FORWARD TO COMPUTE THE OUTPUTS '''
for example in zip(X, y):
# for each node i in the input layer
# set input layer outputs equal to input vector outputs
activations = [example[0]]
# for layer = 1 (first hidden) to output layer
for layer in xrange(len(self.weights)):
# for each node j in layer
weighted_sum = np.dot(activations[layer], self.weights[layer])
# assert number of outputs == number of weights in each layer
assert(len(activations[layer]) == len(self.weights[layer]))
# compute activation of weighted sum of node j
activation = self.logistic(weighted_sum)
# append vector of activations
activations.append(activation)
''' PROPAGATE DELTAS BACKWARDS FROM OUTPUT LAYER TO INPUT LAYER '''
# for each node j in the output layer
# compute error of target - output
errors = example[1] - activations[-1]
# multiply by derivative
deltas = [errors * self.derivative(activations[-1])]
# for layer = last hidden layer down to first hidden layer
for layer in xrange(len(activations)-2, 0, -1):
deltas.append(deltas[-1].dot(self.weights[layer].T) * self.derivative(activations[layer]))
''' UPDATE EVERY WEIGHT IN NETWORK USING DELTAS '''
deltas.reverse()
# for each weight w[i][j] in network
for i in xrange(len(self.weights)):
layer = np.atleast_2d(activations[i])
delta = np.atleast_2d(deltas[i])
self.weights[i] += self.alpha * layer.T.dot(delta)
And my outputs after running testing data all resemble
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 9.0
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 4.0
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 6.0
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 6.0
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 7.0
No matter what I select for my learning rate, number of hidden nodes, or number of hidden layers, everything seems to tend towards 1. Which leaves me wondering whether I'm even approaching and setting up the problem correctly, with 64 inputs to 10 outputs, or whether I've selected/implemented my sigmoid function correctly, or whether the failure is in my implementation of my backpropagation algorithm. I've recreated the above program two or three times with the same results, which leads me to believe that I'm fundamentally misunderstanding the problem and not representing it correctly.

I think I've answered my question.
I believe the problem was how I was calculating my errors in the output layer. I had been calculating it as errors = example[1] - activations[-1], which created an array of errors resulting from subtracting my output layer activations from the target value.
I changed this so that my target values were a vector of zeros, 0-9, so that my the index of my target value was 1.0.
y = int(example[1])
errors_v = np.zeros(shape=(10,), dtype=float)
errors_v[y] = 1.0
errors = errors_v - activations[-1]
I also changed my activation function to be the tanh function.
This has significantly increased the variance in the activations in my output layer and I've been able to achieve 50% - 75% accuracy in my limited testing so far. Hopefully this helps someone else.

Related

Is output of Batch Normalization in Keras dependent on number of epochs?

I am finding output of batchnormalization in Keras.
My model is:
#Import libraries
import numpy as np
import keras
from keras import layers
from keras.layers import Input, Dense, Activation, BatchNormalization, Flatten, Conv2D
from keras.models import Model
#Model
def HappyModel3(input_shape):
X_input = Input(input_shape, name='input_layer')
X = BatchNormalization(axis = 1, name = 'batchnorm_layer')(X_input)
X = Dense(1, activation='sigmoid', name='sigmoid_layer')(X)
model = Model(inputs = X_input, outputs = X, name='HappyModel3')
return model
Compiling Model | here number of epochs is 1
X_train=np.array([[1,1,-1],[2,1,1]])
Y_train=np.array([0,1])
happyModel_1=HappyModel3(X_train[0].shape)
happyModel_1.compile(optimizer=keras.optimizers.RMSprop(), loss=keras.losses.mean_squared_error)
happyModel_1.fit(x = X_train, y = Y_train, epochs = 1 , batch_size = 2, verbose=0 )
finding Batch Normalisation layer's output for model with epochs=1:
for i in range(0, len(happyModel_1.layers)):
tmp_model = Model(happyModel_1.layers[0].input, happyModel_1.layers[i].output)
tmp_output = tmp_model.predict(X_train)
if i in (0,1) :
print(happyModel_1.layers[i].name)
print(tmp_output.shape)
print(tmp_output)
print('\n')
Code Output is:
input_layer
(2, 3)
[[ 1. 1. -1.]
[ 2. 1. 1.]]
batchnorm_layer
(2, 3)
[[ 0.99003249 0.99388224 -0.99551398]
[ 1.99647105 0.99388224 0.9971655 ]]
We've normalized at axis=1 |
Batch Norm Layer Output: At axis=1, 1st dimension mean is 1.5, 2nd dimension mean is 1, 3rd dimension mean is 0.
Since its batch norm, I expect mean to be close to 0 for all 3 dimensions
This happens when I increase epochs to 1000:
happyModel_2=HappyModel3(X_train[0].shape)
happyModel_2.compile(optimizer=keras.optimizers.RMSprop(), loss=keras.losses.mean_squared_error)
happyModel_2.fit(x = X_train, y = Y_train, epochs = 1000 , batch_size = 2, verbose=0 )
finding Batch Normalisation layer's output for model with epochs=1000:
for i in range(0, len(happyModel_2.layers)):
tmp_model = Model(happyModel_2.layers[0].input, happyModel_2.layers[i].output)
tmp_output = tmp_model.predict(X_train)
if i in (0,1) :
print(happyModel_2.layers[i].name)
print(tmp_output.shape)
print(tmp_output)
print('\n')
#Code output
input_layer
(2, 3)
[[ 1. 1. -1.]
[ 2. 1. 1.]]
batchnorm_layer
(2, 3)
[[ -1.95576239e+00 8.08715820e-04 -1.86621261e+00]
[ 1.95795488e+00 8.08715820e-04 1.86590290e+00]]
We've normalized at axis=1 | Now At axis=1, batch norm layer output is: 1st dimension mean is 0, 2nd dimension mean is 0, 3rd dimension mean is 0. THIS IS AN EXPECTED OUTPUT NOW
My question is: Is output of Batch Normalization in Keras dependent on number of epochs?
(Probably YES, as we do backpropagation, batch Normalization parameters will be affected by increasing number of epochs)
The keras documentation for BatchNormalization gives an answer to your question:
Importantly, batch normalization works differently during training and
during inference.
What happens during training, i.e. when calling model.fit()?
During training [...], the layer normalizes its output
using the mean and standard deviation of the current batch of inputs.
But what will happen during inference, i.e. when calling mode.predict() as in your examples?
During inference [...], the layer normalizes its output using a moving average of
the mean and standard deviation of the batches it has seen during
training. That is to say, it returns (batch - self.moving_mean) / (self.moving_var + epsilon) * gamma + beta.
self.moving_mean and self.moving_var are non-trainable variables that
are updated each time the layer in called in training mode [...].
It's important to understand that batch normalization will calculate the statistics (mean and variance) of your whole training data during training by looking at statistics of single batches and internally updating the moving_mean and moving_variance parameters by a running average computed form the single batch statistics. Therefore they're not affected by backpropagation. Ideally, after your model has seen enough training examples (or did enough training epochs), moving_mean and moving_variance will correspond to the statistics of your whole training set. These two parameters are then used during inference to normalize test examples. At the start of training the two parameters will be initialized to 0 and 1. Further batch norm has two more parameters called gamma and beta, which will be updated by the optimizer and therefore depend on your loss.
In essence, yes, the output of batch normalization during inference is dependent on the number of epochs you have trained your model. Firstly, due to changing moving averages for mean and variance and second due to learned parameters gamma and beta.
For a deeper understanding of how batch normalization works and why it is needed, have a look at the original publication.

Does tf.keras.losses.categorical_crossentropy return an array or a single value?

I'm using a custom training loop. The loss that is returned by tf.keras.losses.categorical_crossentropy is an array of I'm assuming (1,batch_size). Is this what it is supposed to return or a single value?
In the latter case, any idea what I could be doing wrong?
If you have a prediction shape of (samples of batch, classes) tf.keras.losses.categorical_crossentropy returns the losses in the shape of (samples of batch,).
So, if your labels are:
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
And your predictions are:
[[0.9 0.05 0.05]
[0.5 0.89 0.6 ]
[0.05 0.01 0.94]]
You will get a loss like:
[0.10536055 0.8046684 0.06187541]
In most case your model will use these value's mean for the update of your model parameters. So if you manually do the updates you can use:
loss = tf.keras.backend.mean(losses)
Most usual losses return the original shape minus the last axis.
So, if your original y_pred shape was (samples, ..., ..., classes), then your resulting shape will be (samples, ..., ...).
This is probably because Keras may use this tensor in further calculations, for sample weights and maybe other things.
In a custom loop, if these dimensions are useless, you can simply take a K.mean(loss_result) before calculating the gradients. (Where K is either keras.backend or tensorflow.keras.backend)

unable to shuffle matrix rows

I'm coding a simple neural network from scratch. The neural network is implemented in the method def simple_1_layer_classification_NN, which accepts an input matrix, output labels among other parameters. Before looping through every Epoch I wanted to shuffle the input matrix, only by its rows (i.e. its observations), just as one measure of avoiding over-fitting. I tried random.shuffle(dataset_input_matrix). Two strange things happened. I took snapshot of the matrix before and after the shuffle step (by using the code below with breakpoints to see the value of the matrix before and after, expecting it to shuffle). So matrix_input should give the value of the matrix before the shuffle, and matrix_input1 should give the value after, i.e. of the shuffled matrix.
input_matrix = dataset_input_matrix
# shuffle our matrix observation samples, to decrease the chance of overfitting
random.shuffle(dataset_input_matrix)
input_matrix1 = dataset_input_matrix
When I printed both values, I got the same matrix, with no changes.
ipdb> input_matrix
array([[3. , 1.5],
[3. , 1.5],
[2. , 1. ],
[3. , 1.5],
[3. , 1.5],
[3. , 1. ]])
ipdb> input_matrix1
array([[3. , 1.5],
[3. , 1.5],
[2. , 1. ],
[3. , 1.5],
[3. , 1.5],
[3. , 1. ]])
ipdb>
Not sure if I'm doing something wrong here.
The second strange thing is, when I ran the neural network (after the shuffle), its accuracy dropped dramatically. Before I was getting accuracy ranging from 60% - 95% (with very few 50%).
After doing the shuffle step for the input matrix, I was barely getting an accuracy above 50%, no matter how many times I run the model. Which is strange considering that it appears the shuffle hasn't even worked examining it with the breakpoints. And anyway why should the network accuracy drop this badly. Unless I'm doing the shuffling completely wrong.
So 2 questions:
1- How to shuffle only the rows of a matrix (as I only need to randomise the observations (rows), not the features (columns) of the dataset).
2- Secondly why is it when I did the shuffle it dropped the accuracy so much that the neural network is not able to get anything above 50%. After all, it is something recommended to shuffle data as a pre-processing step to avoid over-fitting.
Please refer to the full code below, and apologies for the large portion of code.
Many thanks in advance for any help.
# --- neural network structure diagram ---
# O output prediction
# / \ w1, w2, b
# O O datapoint 1, datapoint 2
def simple_1_layer_classification_NN(self, dataset_input_matrix, output_data_labels, input_dimension, epochs, activation_func='sigmoid', learning_rate=0.2, cost_func='squared_error'):
weights = []
bias = int()
cost = float()
costs = []
dCost_dWeights = []
chosen_activation_func_derivation = None
chosen_cost_func = None
chosen_cost_func_derivation = None
correct_pred = int()
incorrect_pred = int()
# store the chosen activation function to use to it later on in the activation calculation section and in the 'predict' method
# Also the same goes for the derivation section.
if activation_func == 'sigmoid':
self.chosen_activation_func = NN_classification.sigmoid
chosen_activation_func_derivation = NN_classification.sigmoid_derivation
elif activation_func == 'relu':
self.chosen_activation_func = NN_classification.relu
chosen_activation_func_derivation = NN_classification.relu_derivation
else:
print("Exception error - no activation function utilised, in training method", file=sys.stderr)
return
# store the chosen cost function to use to it later on in the cost calculation section.
# Also the same goes for the cost derivation section.
if cost_func == 'squared_error':
chosen_cost_func = NN_classification.squared_error
chosen_cost_func_derivation = NN_classification.squared_error_derivation
else:
print("Exception error - no cost function utilised, in training method", file=sys.stderr)
return
# Set initial network parameters (weights & bias):
# Will initialise the weights to a uniform distribution and ensure the numbers are small close to 0.
# We need to loop through all the weights to set them to a random value initially.
for i in range(input_dimension):
# create random numbers for our initial weights (connections) to begin with. 'rand' method creates small random numbers.
w = np.random.rand()
weights.append(w)
# create a random number for our initial bias to begin with.
bias = np.random.rand()
'''
I tried adding the shuffle step, where the matrix is shuffled only in terms of its observations (i.e. rows)
but this has dropped the accuracy dramaticaly, to the point where the 50% range was the best the model can achieve.
'''
input_matrix = dataset_input_matrix
# shuffle our matrix observation samples, to decrease the chance of overfitting
random.shuffle(dataset_input_matrix)
input_matrix1 = dataset_input_matrix
# We perform the training based on the number of epochs specified
for i in range(epochs):
#reset average accuracy with every epoch
self.train_average_accuracy = 0
for ri in range(len(dataset_input_matrix)):
# reset weighted sum value at the beginning of every epoch to avoid incrementing the previous observations weighted-sums on top.
weighted_sum = 0
input_observation_vector = dataset_input_matrix[ri]
# Loop through all the independent variables (x) in the observation
for x in range(len(input_observation_vector)):
# Weighted_sum: we take each independent variable in the entire observation, add weight to it then add it to the subtotal of weighted sum
weighted_sum += input_observation_vector[x] * weights[x]
# Add Bias: add bias to weighted sum
weighted_sum += bias
# Activation: process weighted_sum through activation function
activation_func_output = self.chosen_activation_func(weighted_sum)
# Prediction: Because this is a single layer neural network, so the activation output will be the same as the prediction
pred = activation_func_output
# Cost: the cost function to calculate the prediction error margin
cost = chosen_cost_func(pred, output_data_labels[ri])
# Also calculate the derivative of the cost function with respect to prediction
dCost_dPred = chosen_cost_func_derivation(pred, output_data_labels[ri])
# Derivative: bringing derivative from prediction output with respect to the activation function used for the weighted sum.
dPred_dWeightSum = chosen_activation_func_derivation(weighted_sum)
# Bias is just a number on its own added to the weighted sum, so its derivative is just 1
dWeightSum_dB = 1
# The derivative of the Weighted Sum with respect to each weight is the input data point / independant variable it's multiplied by.
# Therefore I simply assigned the input data array to another variable I called 'dWeightedSum_dWeights'
# to represent the array of the derivative of all the weights involved. I could've used the 'input_sample'
# array variable itself, but for the sake of readibility, I created a separate variable to represent the derivative of each of the weights.
dWeightedSum_dWeights = input_observation_vector
# Derivative chaining rule: chaining all the derivative functions together (chaining rule)
# Loop through all the weights to workout the derivative of the cost with respect to each weight:
for dWeightedSum_dWeight in dWeightedSum_dWeights:
dCost_dWeight = dCost_dPred * dPred_dWeightSum * dWeightedSum_dWeight
dCost_dWeights.append(dCost_dWeight)
dCost_dB = dCost_dPred * dPred_dWeightSum * dWeightSum_dB
# Backpropagation: update the weights and bias according to the derivatives calculated above.
# In other word we update the parameters of the neural network to correct parameters and therefore
# optimise the neural network prediction to be as accurate to the real output as possible
# We loop through each weight and update it with its derivative with respect to the cost error function value.
for ind in range(len(weights)):
weights[ind] = weights[ind] - learning_rate * dCost_dWeights[ind]
bias = bias - learning_rate * dCost_dB
# Compare prediction to target
error_margin = np.sqrt(np.square(pred - output_data_labels[ri]))
accuracy = (1 - error_margin) * 100
self.train_average_accuracy += round(accuracy)
# Evaluate whether guessed correctly or not based on classification binary problem 0 or 1 outcome. So if prediction is above 0.5 it guessed 1 and below 0.5 it guessed incorrectly. If it's dead on 0.5 it is incorrect for either guesses. Because it's no exactly a good guess for either 0 or 1. We need to set a good standard for the neural net model.
if (error_margin < 0.5) and (error_margin >= 0):
correct_pred += 1
elif (error_margin >= 0.5) and (error_margin <= 1):
incorrect_pred += 1
else:
print("Exception error - 'margin error' for 'predict' method is out of range. Must be between 0 and 1, in training method", file=sys.stderr)
return
costs.append(cost)
# Calculate average accuracy from the predictions of all obervations in the training dataset
self.train_average_accuracy = round(self.train_average_accuracy / len(dataset_input_matrix), 1)
# store the final optimised weights to the weights instance variable so it can be used in the predict method.
self.weights = weights
# store the final optimised bias to the weights instance variable so it can be used in the predict method.
self.bias = bias
# Print out results
print('Average Accuracy: {}'.format(self.train_average_accuracy))
print('Correct predictions: {}, Incorrect Predictions: {}'.format(correct_pred, incorrect_pred))
from numpy import array
#define array of dataset
# each observation vector has 3 datapoints or 3 columns: length, width, and outcome label (0, 1 to represent blue flower and red flower respectively).
data = array([[3, 1.5, 1],
[2, 1, 0],
[4, 1.5, 1],
[3, 1, 0],
[3.5, 0.5, 1],
[2, 0.5, 0],
[5.5, 1, 1],
[1, 1, 0]])
# separate data: split input, output, train and test data.
X_train, y_train, X_test, y_test = data[:6, :-1], data[:6, -1], data[6:, :-1], data[6:, -1]
nn_model = NN_classification()
nn_model.simple_1_layer_classification_NN(X_train, y_train, 2, 10000, learning_rate=0.2)

Basic neural network, weights too high

I am trying to code a very basic neural network in python, with 3 input nodes with a value of 0 or 1 and a single output node, with a value of 0 or 1. The output should be almost equal to the second input, but after training, the weights are way way too high, and the network almost always guesses 1.
I am using python 3.7 with numpy and scipy. I have tried changing the training set, the new instance, and the random seed
import numpy as np
from scipy.special import expit as ex
rand.seed(10)
training_set=[[0,1,0],[1,0,1],[0,0,0],[1,1,1]] #The training sets and their outputs
training_outputs=[0,1,0,1]
weightlst=[rand.uniform(-1,1),rand.uniform(-1,1),rand.uniform(-1,1)] #Weights are randomly set with a value between -1 and 1
print('Random weights\n'+str(weightlst))
def calcout(inputs,weights): #Calculate the expected output with given inputs and weights
output=0.5
for i in range(len(inputs)):
output=output+(inputs[i]*weights[i])
#print('\nmy output is ' + str(ex(output)))
return ex(output) #Return the output on a sigmoid curve between 0 and 1
def adj(expected_output,training_output,weights,inputs): #Adjust the weights based on the expected output, true (training) output and the weights
adjweights=[]
error=expected_output-training_output
for i in weights:
adjweights.append(i+(error*(expected_output*(1-expected_output))))
return adjweights
#Train the network, adjusting weights each time
training_iterations=10000
for k in range(training_iterations):
for l in range(len(training_set)):
expected=calcout(training_set[l],weightlst)
weightlst=adj(expected,training_outputs[l],weightlst,training_set[l])
new_instance=[1,0,0] #Calculate and return the expected output of a new instance
print('Adjusted weights\n'+str(weightlst))
print('\nExpected output of new instance = ' + str(calcout(new_instance,weightlst)))
The expected output would be 0, or something very close to it, but no matter what i set new_instance to, the output is still
Random weights
[-0.7312715117751976, 0.6948674738744653, 0.5275492379532281]
Adjusted weights
[1999.6135460307303, 2001.03968501638, 2000.8723667804588]
Expected output of new instance = 1.0
What is wrong with my code?
Bugs:
No bias used in the neuron
error=training_output-expected_output (not the other way around) for gradient decent
weight update rule of ith weight w_i = w_i + learning_rate * delta_w_i, (delta_w_i is gradient of loss with respect to w_i)
For squared loss delta_w_i = error*sample[i] (ith value of input vector sample)
Since you have only one neuron (one hidden layer or size 1) your model can only learn linearly separable data (it is only a linear classifier). Examples of linearly separable data are data generated by functions like boolean AND, OR. Note that boolean XOR is not linearly separable.
Code with bugs fixed
import numpy as np
from scipy.special import expit as ex
rand.seed(10)
training_set=[[0,1,0],[1,0,1],[0,0,0],[1,1,1]] #The training sets and their outputs
training_outputs=[1,1,0,1] # Boolean OR of input vector
#training_outputs=[0,0,,1] # Boolean AND of input vector
weightlst=[rand.uniform(-1,1),rand.uniform(-1,1),rand.uniform(-1,1)] #Weights are randomly set with a value between -1 and 1
bias = rand.uniform(-1,1)
print('Random weights\n'+str(weightlst))
def calcout(inputs,weights, bias): #Calculate the expected output with given inputs and weights
output=bias
for i in range(len(inputs)):
output=output+(inputs[i]*weights[i])
#print('\nmy output is ' + str(ex(output)))
return ex(output) #Return the output on a sigmoid curve between 0 and 1
def adj(expected_output,training_output,weights,bias,inputs): #Adjust the weights based on the expected output, true (training) output and the weights
adjweights=[]
error=training_output-expected_output
lr = 0.1
for j, i in enumerate(weights):
adjweights.append(i+error*inputs[j]*lr)
adjbias = bias+error*lr
return adjweights, adjbias
#Train the network, adjusting weights each time
training_iterations=10000
for k in range(training_iterations):
for l in range(len(training_set)):
expected=calcout(training_set[l],weightlst, bias)
weightlst, bias =adj(expected,training_outputs[l],weightlst,bias,training_set[l])
new_instance=[1,0,0] #Calculate and return the expected output of a new instance
print('Adjusted weights\n'+str(weightlst))
print('\nExpected output of new instance = ' + str(calcout(new_instance,weightlst, bias)))
Output:
Random weights
[0.142805189379827, -0.14222189064977075, 0.15618260226894076]
Adjusted weights
[6.196759842119063, 11.71208191137411, 6.210137255008176]
Expected output of new instance = 0.6655563851223694
As up can see for input [1,0,0] the model predicted the probability 0.66 which is class 1 (since 0.66>0.5). It is correct as the output class is OR of input vector.
Note:
For learning/understanding how each weight is updated it is ok to code like above, but in practice all the operations are vectorised. Check the link for vectorized implementation.

Learning OR gate through gradient descent

I am trying to make my program learn OR logic gate using neural network and gradient descent algorithm. I took additional input neuron as -1 so that I can adjust threshold of neuron for activation later. currently threshold is simply 0.
Here's my attempt at implementation
#!/usr/bin/env python
from numpy import *
def pcntrain(inp, tar, wei, eta):
for data in range(nData):
activation = dot(inp,wei)
wei += eta*(dot(transpose(inp), target-activation))
print "ITERATION " + str(data)
print wei
print "TESTING LEARNED ALGO"
# Sample input
activation = dot(array([[0,0,-1],[1,0,-1],[1,1,-1],[0,0,-1]]),wei)
print activation
nIn = 2
nOut = 1
nData = 4
inputs = array([[0,0],[0,1],[1,0],[1,1]])
target = array([[0],[1],[1],[1]])
inputs = concatenate((inputs,-ones((nData,1))),axis=1) #add bias input = -1
weights = random.rand(nIn +1,nOut)*0.1-0.05 #random weight
if __name__ == '__main__':
pcntrain(inputs, target, weights, 0.25)
This code seem to produce output which does not seem like an OR gate. Help?
Well this is an OR gate, if you correct your testing data to be
activation = dot(array([[0,0,-1],[1,0,-1],[1,1,-1],[0,1,-1]]),wei)
(your code has 0,0 twice, and never 0,1) it produces
[[ 0.30021868]
[ 0.67476151]
[ 1.0276208 ]
[ 0.65307797]]
which, after calling round gives
[[ 0.]
[ 1.]
[ 1.]
[ 1.]]
as desired.
However, you do have some minor errors:
you are running 4 iterations of the gradient descent (main loop), furthermore it comes from the fact that you use number of inputs to specify that - this is incorret, there is no relation between number of "reasonable" iterations and number of points. If you run 100 iterations you end up with closer scores
.
[[ 0.25000001]
[ 0.75 ]
[ 1.24999999]
[ 0.75 ]]
your model is linear and has linear output, thus you cannot expect it to output exactly 0 and 1, the above result (0.25, 0.75 and 1.25) is actually the optimal solution for this kind of model model. If you want it to converge to nice 0/1 you need sigmoid in the output and consequently different loss/derivatives (this is still a linear model in the ML sense, you simply have a squashing function on the output to make it work in correct space).
you are not using "tar" argument in your function, instead, you refer to global variable "target" (which have the same value, but this is an obvious error)

Categories