Learning OR gate through gradient descent - python

I am trying to make my program learn OR logic gate using neural network and gradient descent algorithm. I took additional input neuron as -1 so that I can adjust threshold of neuron for activation later. currently threshold is simply 0.
Here's my attempt at implementation
#!/usr/bin/env python
from numpy import *
def pcntrain(inp, tar, wei, eta):
for data in range(nData):
activation = dot(inp,wei)
wei += eta*(dot(transpose(inp), target-activation))
print "ITERATION " + str(data)
print wei
print "TESTING LEARNED ALGO"
# Sample input
activation = dot(array([[0,0,-1],[1,0,-1],[1,1,-1],[0,0,-1]]),wei)
print activation
nIn = 2
nOut = 1
nData = 4
inputs = array([[0,0],[0,1],[1,0],[1,1]])
target = array([[0],[1],[1],[1]])
inputs = concatenate((inputs,-ones((nData,1))),axis=1) #add bias input = -1
weights = random.rand(nIn +1,nOut)*0.1-0.05 #random weight
if __name__ == '__main__':
pcntrain(inputs, target, weights, 0.25)
This code seem to produce output which does not seem like an OR gate. Help?

Well this is an OR gate, if you correct your testing data to be
activation = dot(array([[0,0,-1],[1,0,-1],[1,1,-1],[0,1,-1]]),wei)
(your code has 0,0 twice, and never 0,1) it produces
[[ 0.30021868]
[ 0.67476151]
[ 1.0276208 ]
[ 0.65307797]]
which, after calling round gives
[[ 0.]
[ 1.]
[ 1.]
[ 1.]]
as desired.
However, you do have some minor errors:
you are running 4 iterations of the gradient descent (main loop), furthermore it comes from the fact that you use number of inputs to specify that - this is incorret, there is no relation between number of "reasonable" iterations and number of points. If you run 100 iterations you end up with closer scores
.
[[ 0.25000001]
[ 0.75 ]
[ 1.24999999]
[ 0.75 ]]
your model is linear and has linear output, thus you cannot expect it to output exactly 0 and 1, the above result (0.25, 0.75 and 1.25) is actually the optimal solution for this kind of model model. If you want it to converge to nice 0/1 you need sigmoid in the output and consequently different loss/derivatives (this is still a linear model in the ML sense, you simply have a squashing function on the output to make it work in correct space).
you are not using "tar" argument in your function, instead, you refer to global variable "target" (which have the same value, but this is an obvious error)

Related

Keras - Specifying from_logits=False when using tf.keras.layers.Dense(1,activation='sigmoid')(x)

I am working on a binary classification problem, using transfer learning and image inputs and have a question regarding the
I have been working through using the correct activation layers (e.g. Softmax or Sigmoid - sigmoid for binary softmax for multiclass) and noticed when I specify 'sigmoid' as part of the Dense() output layer, I no longer need to specify from_logits=True during model.compile().
This means when I am obtaining predictions, I don't use the tf.nn.sigmoid() function and instead simply check if the value is greater than 0.5, then 1, else 0. Is this correct? Here is my code:
i = keras.Input(shape=(150, 150, 3))
scale_layer = keras.layers.Rescaling(scale=1 / 127.5, offset=-1)
mt = scale_layer(i)
mt = base_model(model_top, training=False)
mt = keras.layers.GlobalAveragePooling2D()(mt)
mt = keras.layers.Dropout(dropout)(mt) # Regularize with dropout
o = keras.layers.Dense(1,activation='sigmoid')(mt)
model = keras.Model(i, o)
....
model.compile(optimizer=keras.optimizers.Adam(lr),loss=keras.losses.BinaryCrossentropy(from_logits=False)
)
And then when I obtain predictions, I have the following:
pred = model.predict(test)
pred = tf.where(pred < 0.5, 0, 1)
pred = pred.numpy()
My intuition is that as I am specifying the sigmoid activation function during the Dense layer build, I no longer work with 'logits' and therefore do not need to apply the sigmoid function later on. In the documentation, I've seen both examples used but it's quite sparse on information when working with model.predict(), would appreciate any guidance.
This means when I am obtaining predictions, I don't use the tf.nn.sigmoid() function and instead simply check if the value is greater than 0.5, then 1, else 0. Is this correct?
Yes, you don't even need the from_logits parameter since you're using the sigmoid function. I believe it's False by default.
And then when I obtain predictions, I have the following:
Depends on how (un)balanced your training data is. Ideally, if it's balanced, you're correct, pred > 0.5 means the model thinks the image belongs closer to class 1. If you have a disproportionately large amount of 1, the model may be more biased to classifying an image as 1. Conversely, if you choose to use the softmax function, you'll get an array with length = num_of_classes, with each prediction array adding up to 1.0, with each element in the array representing the model's confidence the image belongs to each class.

unable to shuffle matrix rows

I'm coding a simple neural network from scratch. The neural network is implemented in the method def simple_1_layer_classification_NN, which accepts an input matrix, output labels among other parameters. Before looping through every Epoch I wanted to shuffle the input matrix, only by its rows (i.e. its observations), just as one measure of avoiding over-fitting. I tried random.shuffle(dataset_input_matrix). Two strange things happened. I took snapshot of the matrix before and after the shuffle step (by using the code below with breakpoints to see the value of the matrix before and after, expecting it to shuffle). So matrix_input should give the value of the matrix before the shuffle, and matrix_input1 should give the value after, i.e. of the shuffled matrix.
input_matrix = dataset_input_matrix
# shuffle our matrix observation samples, to decrease the chance of overfitting
random.shuffle(dataset_input_matrix)
input_matrix1 = dataset_input_matrix
When I printed both values, I got the same matrix, with no changes.
ipdb> input_matrix
array([[3. , 1.5],
[3. , 1.5],
[2. , 1. ],
[3. , 1.5],
[3. , 1.5],
[3. , 1. ]])
ipdb> input_matrix1
array([[3. , 1.5],
[3. , 1.5],
[2. , 1. ],
[3. , 1.5],
[3. , 1.5],
[3. , 1. ]])
ipdb>
Not sure if I'm doing something wrong here.
The second strange thing is, when I ran the neural network (after the shuffle), its accuracy dropped dramatically. Before I was getting accuracy ranging from 60% - 95% (with very few 50%).
After doing the shuffle step for the input matrix, I was barely getting an accuracy above 50%, no matter how many times I run the model. Which is strange considering that it appears the shuffle hasn't even worked examining it with the breakpoints. And anyway why should the network accuracy drop this badly. Unless I'm doing the shuffling completely wrong.
So 2 questions:
1- How to shuffle only the rows of a matrix (as I only need to randomise the observations (rows), not the features (columns) of the dataset).
2- Secondly why is it when I did the shuffle it dropped the accuracy so much that the neural network is not able to get anything above 50%. After all, it is something recommended to shuffle data as a pre-processing step to avoid over-fitting.
Please refer to the full code below, and apologies for the large portion of code.
Many thanks in advance for any help.
# --- neural network structure diagram ---
# O output prediction
# / \ w1, w2, b
# O O datapoint 1, datapoint 2
def simple_1_layer_classification_NN(self, dataset_input_matrix, output_data_labels, input_dimension, epochs, activation_func='sigmoid', learning_rate=0.2, cost_func='squared_error'):
weights = []
bias = int()
cost = float()
costs = []
dCost_dWeights = []
chosen_activation_func_derivation = None
chosen_cost_func = None
chosen_cost_func_derivation = None
correct_pred = int()
incorrect_pred = int()
# store the chosen activation function to use to it later on in the activation calculation section and in the 'predict' method
# Also the same goes for the derivation section.
if activation_func == 'sigmoid':
self.chosen_activation_func = NN_classification.sigmoid
chosen_activation_func_derivation = NN_classification.sigmoid_derivation
elif activation_func == 'relu':
self.chosen_activation_func = NN_classification.relu
chosen_activation_func_derivation = NN_classification.relu_derivation
else:
print("Exception error - no activation function utilised, in training method", file=sys.stderr)
return
# store the chosen cost function to use to it later on in the cost calculation section.
# Also the same goes for the cost derivation section.
if cost_func == 'squared_error':
chosen_cost_func = NN_classification.squared_error
chosen_cost_func_derivation = NN_classification.squared_error_derivation
else:
print("Exception error - no cost function utilised, in training method", file=sys.stderr)
return
# Set initial network parameters (weights & bias):
# Will initialise the weights to a uniform distribution and ensure the numbers are small close to 0.
# We need to loop through all the weights to set them to a random value initially.
for i in range(input_dimension):
# create random numbers for our initial weights (connections) to begin with. 'rand' method creates small random numbers.
w = np.random.rand()
weights.append(w)
# create a random number for our initial bias to begin with.
bias = np.random.rand()
'''
I tried adding the shuffle step, where the matrix is shuffled only in terms of its observations (i.e. rows)
but this has dropped the accuracy dramaticaly, to the point where the 50% range was the best the model can achieve.
'''
input_matrix = dataset_input_matrix
# shuffle our matrix observation samples, to decrease the chance of overfitting
random.shuffle(dataset_input_matrix)
input_matrix1 = dataset_input_matrix
# We perform the training based on the number of epochs specified
for i in range(epochs):
#reset average accuracy with every epoch
self.train_average_accuracy = 0
for ri in range(len(dataset_input_matrix)):
# reset weighted sum value at the beginning of every epoch to avoid incrementing the previous observations weighted-sums on top.
weighted_sum = 0
input_observation_vector = dataset_input_matrix[ri]
# Loop through all the independent variables (x) in the observation
for x in range(len(input_observation_vector)):
# Weighted_sum: we take each independent variable in the entire observation, add weight to it then add it to the subtotal of weighted sum
weighted_sum += input_observation_vector[x] * weights[x]
# Add Bias: add bias to weighted sum
weighted_sum += bias
# Activation: process weighted_sum through activation function
activation_func_output = self.chosen_activation_func(weighted_sum)
# Prediction: Because this is a single layer neural network, so the activation output will be the same as the prediction
pred = activation_func_output
# Cost: the cost function to calculate the prediction error margin
cost = chosen_cost_func(pred, output_data_labels[ri])
# Also calculate the derivative of the cost function with respect to prediction
dCost_dPred = chosen_cost_func_derivation(pred, output_data_labels[ri])
# Derivative: bringing derivative from prediction output with respect to the activation function used for the weighted sum.
dPred_dWeightSum = chosen_activation_func_derivation(weighted_sum)
# Bias is just a number on its own added to the weighted sum, so its derivative is just 1
dWeightSum_dB = 1
# The derivative of the Weighted Sum with respect to each weight is the input data point / independant variable it's multiplied by.
# Therefore I simply assigned the input data array to another variable I called 'dWeightedSum_dWeights'
# to represent the array of the derivative of all the weights involved. I could've used the 'input_sample'
# array variable itself, but for the sake of readibility, I created a separate variable to represent the derivative of each of the weights.
dWeightedSum_dWeights = input_observation_vector
# Derivative chaining rule: chaining all the derivative functions together (chaining rule)
# Loop through all the weights to workout the derivative of the cost with respect to each weight:
for dWeightedSum_dWeight in dWeightedSum_dWeights:
dCost_dWeight = dCost_dPred * dPred_dWeightSum * dWeightedSum_dWeight
dCost_dWeights.append(dCost_dWeight)
dCost_dB = dCost_dPred * dPred_dWeightSum * dWeightSum_dB
# Backpropagation: update the weights and bias according to the derivatives calculated above.
# In other word we update the parameters of the neural network to correct parameters and therefore
# optimise the neural network prediction to be as accurate to the real output as possible
# We loop through each weight and update it with its derivative with respect to the cost error function value.
for ind in range(len(weights)):
weights[ind] = weights[ind] - learning_rate * dCost_dWeights[ind]
bias = bias - learning_rate * dCost_dB
# Compare prediction to target
error_margin = np.sqrt(np.square(pred - output_data_labels[ri]))
accuracy = (1 - error_margin) * 100
self.train_average_accuracy += round(accuracy)
# Evaluate whether guessed correctly or not based on classification binary problem 0 or 1 outcome. So if prediction is above 0.5 it guessed 1 and below 0.5 it guessed incorrectly. If it's dead on 0.5 it is incorrect for either guesses. Because it's no exactly a good guess for either 0 or 1. We need to set a good standard for the neural net model.
if (error_margin < 0.5) and (error_margin >= 0):
correct_pred += 1
elif (error_margin >= 0.5) and (error_margin <= 1):
incorrect_pred += 1
else:
print("Exception error - 'margin error' for 'predict' method is out of range. Must be between 0 and 1, in training method", file=sys.stderr)
return
costs.append(cost)
# Calculate average accuracy from the predictions of all obervations in the training dataset
self.train_average_accuracy = round(self.train_average_accuracy / len(dataset_input_matrix), 1)
# store the final optimised weights to the weights instance variable so it can be used in the predict method.
self.weights = weights
# store the final optimised bias to the weights instance variable so it can be used in the predict method.
self.bias = bias
# Print out results
print('Average Accuracy: {}'.format(self.train_average_accuracy))
print('Correct predictions: {}, Incorrect Predictions: {}'.format(correct_pred, incorrect_pred))
from numpy import array
#define array of dataset
# each observation vector has 3 datapoints or 3 columns: length, width, and outcome label (0, 1 to represent blue flower and red flower respectively).
data = array([[3, 1.5, 1],
[2, 1, 0],
[4, 1.5, 1],
[3, 1, 0],
[3.5, 0.5, 1],
[2, 0.5, 0],
[5.5, 1, 1],
[1, 1, 0]])
# separate data: split input, output, train and test data.
X_train, y_train, X_test, y_test = data[:6, :-1], data[:6, -1], data[6:, :-1], data[6:, -1]
nn_model = NN_classification()
nn_model.simple_1_layer_classification_NN(X_train, y_train, 2, 10000, learning_rate=0.2)

how to normalize prediction values in tensorflow

A simple feed forward DNN with relevant .csv files can be found here https://github.com/jhsmith12345/tensorflow/blob/normalize_prediction/tf_from_csv.py
This piece of code
classification = prediction.eval(feed_dict={x: [[9,3]]})
print (classification)
is outputting
[[ -12.2412138 -17.24327469 ]]
I am expecting a prediction that conforms to the labels, which are 1 or 0. Something like
[[ 0 1 ]]
I believe that my predictive values are not getting normalized by a softmax, but have no idea how to proceed. Any help is appreciated! Also, I'm more than happy to post the full code here but didn't want to clutter the post. Thanks!
Let me clear, in your code
prediction = neural_network_model(x)
prediction.eval(feed_dict={x: [[9,3]]})
# output is [[ -12.2412138 -17.24327469 ]]
and you confuse why the range is not 0 ~ 1, right ?
because softmax doesn't apply on prediction
you use tf.nn.softmax_cross_entropy_with_logits
For I know, this function apply softmax to prediction before compute cross entropy
But it doesn't change the value of prediction
I think you can
do softmax then compute cross entropy, finally print prediction directly (it means can't use tf.nn.softmax_cross_entropy_with_logits)
or, change nothing, but do softmax before print prediction

Simple TensorFlow Neural Network minimizes cost function yet all results are close to 1

So I tried implementing the neural network from:
http://iamtrask.github.io/2015/07/12/basic-python-network/
but using TensorFlow instead. I printed out the cost function twice during training and the error is appears to be getting smaller according yet all the values in the output layer are close to 1 when only two of them should be. I imagine it might be something wrong with my maths but I'm not sure. There is no difference when I try with a hidden layer or use Error Squared as cost function. Here is my code:
import tensorflow as tf
import numpy as np
input_layer_size = 3
output_layer_size = 1
x = tf.placeholder(tf.float32, [None, input_layer_size]) #holds input values
y = tf.placeholder(tf.float32, [None, output_layer_size]) # holds true y values
tf.set_random_seed(1)
input_weights = tf.Variable(tf.random_normal([input_layer_size, output_layer_size]))
input_bias = tf.Variable(tf.random_normal([1, output_layer_size]))
output_layer_vals = tf.nn.sigmoid(tf.matmul(x, input_weights) + input_bias)
cross_entropy = -tf.reduce_sum(y * tf.log(output_layer_vals))
training = tf.train.AdamOptimizer(0.1).minimize(cross_entropy)
x_data = np.array(
[[0,0,1],
[0,1,1],
[1,0,1],
[1,1,1]])
y_data = np.reshape(np.array([0,0,1,1]).T, (4, 1))
with tf.Session() as ses:
init = tf.initialize_all_variables()
ses.run(init)
for _ in range(1000):
ses.run(training, feed_dict={x: x_data, y:y_data})
if _ % 500 == 0:
print(ses.run(output_layer_vals, feed_dict={x: x_data}))
print(ses.run(cross_entropy, feed_dict={x: x_data, y:y_data}))
print('\n\n')
And this is what it outputs:
[[ 0.82036656]
[ 0.96750367]
[ 0.87607527]
[ 0.97876281]]
0.21947 #first cross_entropy error
[[ 0.99937409]
[ 0.99998224]
[ 0.99992537]
[ 0.99999785]]
0.00062825 #second cross_entropy error, as you can see, it's smaller
First of all: you have no hidden layer. As far as I remember basic perceptrons could possibly model the XOR problem, but it needed some adjustments. However, AI is just invented by biology, but it does not model real neural networks exactly. Thus, you have to at least build an MLP (Multilayer perceptron), which consits of at least one input, one hidden and one output layer. The XOR problem needs at least two neurons + bias in the hidden layer to be solved correctly (with a high precision).
Additionally your learning rate is too high. 0.1 is a very high learning rate. To put it simply: it basically means that you update/adapt your current state by 10% of one single learning step. This lets your network forget about already learned invariants quickly. Usually the learning rate is something in between 1e-2 to 1e-6, depending on your problem, network size and general architecture.
Moreover you implemented the "simplified/short" version of cross-entropy. See wikipedia for the full version: cross-entropy. However, to avoid some edge cases TensorFlow already has its own version of cross-entropy: for example tf.nn.softmax_cross_entropy_with_logits.
Finally you should remember that the cross-entropy error is a logistic loss function that operates on probabilities of your classes. Although your sigmoid function squashes the output layer into an interval of [0, 1], this does only work in your case because you have one single output neuron. As soon as you have more than one output neuron, you also need the sum of the output layer to be exactly 1,0 in order to really describes probabilities for every class on the output layer.

Backpropagation outputs tend towards same value

I'm attempting to create a multilayer feedforward backpropagation neural network to recognize handwritten digits and I'm running into a problem where the activations in my output layer all tend towards the same value.
I'm using the Optical Recognition of Handwritten Digits Data Set, with training data that looks like
0,1,6,15,12,1,0,0,0,7,16,6,6,10,0,0,0,8,16,2,0,11,2,0,0,5,16,3,0,5,7,0,0,7,13,3,0,8,7,0,0,4,12,0,1,13,5,0,0,0,14,9,15,9,0,0,0,0,6,14,7,1,0,0,0
which represents an 8x8 matrix, where each of the 64 integers corresponds to the number of dark pixels in a sub-4x4 matrix, with the last integer being the classification.
I'm using 64 nodes in the input layer corresponding to the 64 integers, some number of hidden nodes in some number of hidden layers, and 10 nodes in the output layer corresponding to 0-9.
My weights are initialized here, and biases are added for the input layer and hidden layers
self.weights = []
for i in xrange(1, len(layers) - 1):
self.weights.append(
np.random.uniform(low=-0.2,
high=0.2,
size=(layers[i-1] + 1, layers[i] + 1)))
# Output weights
self.weights.append(
np.random.uniform(low=-0.2,
high=0.2,
size=(layers[-2] + 1, layers[-1])))
where list contains the number of nodes in each layer, e.g.
layers=[64, 30, 10]
I'm using the logistic function as my activation function
def logistic(self, z):
return sp.expit(z)
and its derivative
def derivative(self, z):
return sp.expit(z) * (1 - sp.expit(z))
My backpropagation algorithm is borrowed heavily from here; my previous attempts failed so I wanted to try another route.
def back_prop_learning(self, X, y):
# add biases to inputs with value of 1
biases = np.atleast_2d(np.ones(X.shape[0]))
X = np.concatenate((biases.T, X), axis=1)
# Iterate over training set
for epoch in xrange(self.epochs):
# for each weight w[i][j] in network assign random tiny values
# handled in __init__
''' PROPAGATE THE INPUTS FORWARD TO COMPUTE THE OUTPUTS '''
for example in zip(X, y):
# for each node i in the input layer
# set input layer outputs equal to input vector outputs
activations = [example[0]]
# for layer = 1 (first hidden) to output layer
for layer in xrange(len(self.weights)):
# for each node j in layer
weighted_sum = np.dot(activations[layer], self.weights[layer])
# assert number of outputs == number of weights in each layer
assert(len(activations[layer]) == len(self.weights[layer]))
# compute activation of weighted sum of node j
activation = self.logistic(weighted_sum)
# append vector of activations
activations.append(activation)
''' PROPAGATE DELTAS BACKWARDS FROM OUTPUT LAYER TO INPUT LAYER '''
# for each node j in the output layer
# compute error of target - output
errors = example[1] - activations[-1]
# multiply by derivative
deltas = [errors * self.derivative(activations[-1])]
# for layer = last hidden layer down to first hidden layer
for layer in xrange(len(activations)-2, 0, -1):
deltas.append(deltas[-1].dot(self.weights[layer].T) * self.derivative(activations[layer]))
''' UPDATE EVERY WEIGHT IN NETWORK USING DELTAS '''
deltas.reverse()
# for each weight w[i][j] in network
for i in xrange(len(self.weights)):
layer = np.atleast_2d(activations[i])
delta = np.atleast_2d(deltas[i])
self.weights[i] += self.alpha * layer.T.dot(delta)
And my outputs after running testing data all resemble
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 9.0
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 4.0
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 6.0
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 6.0
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 7.0
No matter what I select for my learning rate, number of hidden nodes, or number of hidden layers, everything seems to tend towards 1. Which leaves me wondering whether I'm even approaching and setting up the problem correctly, with 64 inputs to 10 outputs, or whether I've selected/implemented my sigmoid function correctly, or whether the failure is in my implementation of my backpropagation algorithm. I've recreated the above program two or three times with the same results, which leads me to believe that I'm fundamentally misunderstanding the problem and not representing it correctly.
I think I've answered my question.
I believe the problem was how I was calculating my errors in the output layer. I had been calculating it as errors = example[1] - activations[-1], which created an array of errors resulting from subtracting my output layer activations from the target value.
I changed this so that my target values were a vector of zeros, 0-9, so that my the index of my target value was 1.0.
y = int(example[1])
errors_v = np.zeros(shape=(10,), dtype=float)
errors_v[y] = 1.0
errors = errors_v - activations[-1]
I also changed my activation function to be the tanh function.
This has significantly increased the variance in the activations in my output layer and I've been able to achieve 50% - 75% accuracy in my limited testing so far. Hopefully this helps someone else.

Categories