I'm coding a simple neural network from scratch. The neural network is implemented in the method def simple_1_layer_classification_NN, which accepts an input matrix, output labels among other parameters. Before looping through every Epoch I wanted to shuffle the input matrix, only by its rows (i.e. its observations), just as one measure of avoiding over-fitting. I tried random.shuffle(dataset_input_matrix). Two strange things happened. I took snapshot of the matrix before and after the shuffle step (by using the code below with breakpoints to see the value of the matrix before and after, expecting it to shuffle). So matrix_input should give the value of the matrix before the shuffle, and matrix_input1 should give the value after, i.e. of the shuffled matrix.
input_matrix = dataset_input_matrix
# shuffle our matrix observation samples, to decrease the chance of overfitting
random.shuffle(dataset_input_matrix)
input_matrix1 = dataset_input_matrix
When I printed both values, I got the same matrix, with no changes.
ipdb> input_matrix
array([[3. , 1.5],
[3. , 1.5],
[2. , 1. ],
[3. , 1.5],
[3. , 1.5],
[3. , 1. ]])
ipdb> input_matrix1
array([[3. , 1.5],
[3. , 1.5],
[2. , 1. ],
[3. , 1.5],
[3. , 1.5],
[3. , 1. ]])
ipdb>
Not sure if I'm doing something wrong here.
The second strange thing is, when I ran the neural network (after the shuffle), its accuracy dropped dramatically. Before I was getting accuracy ranging from 60% - 95% (with very few 50%).
After doing the shuffle step for the input matrix, I was barely getting an accuracy above 50%, no matter how many times I run the model. Which is strange considering that it appears the shuffle hasn't even worked examining it with the breakpoints. And anyway why should the network accuracy drop this badly. Unless I'm doing the shuffling completely wrong.
So 2 questions:
1- How to shuffle only the rows of a matrix (as I only need to randomise the observations (rows), not the features (columns) of the dataset).
2- Secondly why is it when I did the shuffle it dropped the accuracy so much that the neural network is not able to get anything above 50%. After all, it is something recommended to shuffle data as a pre-processing step to avoid over-fitting.
Please refer to the full code below, and apologies for the large portion of code.
Many thanks in advance for any help.
# --- neural network structure diagram ---
# O output prediction
# / \ w1, w2, b
# O O datapoint 1, datapoint 2
def simple_1_layer_classification_NN(self, dataset_input_matrix, output_data_labels, input_dimension, epochs, activation_func='sigmoid', learning_rate=0.2, cost_func='squared_error'):
weights = []
bias = int()
cost = float()
costs = []
dCost_dWeights = []
chosen_activation_func_derivation = None
chosen_cost_func = None
chosen_cost_func_derivation = None
correct_pred = int()
incorrect_pred = int()
# store the chosen activation function to use to it later on in the activation calculation section and in the 'predict' method
# Also the same goes for the derivation section.
if activation_func == 'sigmoid':
self.chosen_activation_func = NN_classification.sigmoid
chosen_activation_func_derivation = NN_classification.sigmoid_derivation
elif activation_func == 'relu':
self.chosen_activation_func = NN_classification.relu
chosen_activation_func_derivation = NN_classification.relu_derivation
else:
print("Exception error - no activation function utilised, in training method", file=sys.stderr)
return
# store the chosen cost function to use to it later on in the cost calculation section.
# Also the same goes for the cost derivation section.
if cost_func == 'squared_error':
chosen_cost_func = NN_classification.squared_error
chosen_cost_func_derivation = NN_classification.squared_error_derivation
else:
print("Exception error - no cost function utilised, in training method", file=sys.stderr)
return
# Set initial network parameters (weights & bias):
# Will initialise the weights to a uniform distribution and ensure the numbers are small close to 0.
# We need to loop through all the weights to set them to a random value initially.
for i in range(input_dimension):
# create random numbers for our initial weights (connections) to begin with. 'rand' method creates small random numbers.
w = np.random.rand()
weights.append(w)
# create a random number for our initial bias to begin with.
bias = np.random.rand()
'''
I tried adding the shuffle step, where the matrix is shuffled only in terms of its observations (i.e. rows)
but this has dropped the accuracy dramaticaly, to the point where the 50% range was the best the model can achieve.
'''
input_matrix = dataset_input_matrix
# shuffle our matrix observation samples, to decrease the chance of overfitting
random.shuffle(dataset_input_matrix)
input_matrix1 = dataset_input_matrix
# We perform the training based on the number of epochs specified
for i in range(epochs):
#reset average accuracy with every epoch
self.train_average_accuracy = 0
for ri in range(len(dataset_input_matrix)):
# reset weighted sum value at the beginning of every epoch to avoid incrementing the previous observations weighted-sums on top.
weighted_sum = 0
input_observation_vector = dataset_input_matrix[ri]
# Loop through all the independent variables (x) in the observation
for x in range(len(input_observation_vector)):
# Weighted_sum: we take each independent variable in the entire observation, add weight to it then add it to the subtotal of weighted sum
weighted_sum += input_observation_vector[x] * weights[x]
# Add Bias: add bias to weighted sum
weighted_sum += bias
# Activation: process weighted_sum through activation function
activation_func_output = self.chosen_activation_func(weighted_sum)
# Prediction: Because this is a single layer neural network, so the activation output will be the same as the prediction
pred = activation_func_output
# Cost: the cost function to calculate the prediction error margin
cost = chosen_cost_func(pred, output_data_labels[ri])
# Also calculate the derivative of the cost function with respect to prediction
dCost_dPred = chosen_cost_func_derivation(pred, output_data_labels[ri])
# Derivative: bringing derivative from prediction output with respect to the activation function used for the weighted sum.
dPred_dWeightSum = chosen_activation_func_derivation(weighted_sum)
# Bias is just a number on its own added to the weighted sum, so its derivative is just 1
dWeightSum_dB = 1
# The derivative of the Weighted Sum with respect to each weight is the input data point / independant variable it's multiplied by.
# Therefore I simply assigned the input data array to another variable I called 'dWeightedSum_dWeights'
# to represent the array of the derivative of all the weights involved. I could've used the 'input_sample'
# array variable itself, but for the sake of readibility, I created a separate variable to represent the derivative of each of the weights.
dWeightedSum_dWeights = input_observation_vector
# Derivative chaining rule: chaining all the derivative functions together (chaining rule)
# Loop through all the weights to workout the derivative of the cost with respect to each weight:
for dWeightedSum_dWeight in dWeightedSum_dWeights:
dCost_dWeight = dCost_dPred * dPred_dWeightSum * dWeightedSum_dWeight
dCost_dWeights.append(dCost_dWeight)
dCost_dB = dCost_dPred * dPred_dWeightSum * dWeightSum_dB
# Backpropagation: update the weights and bias according to the derivatives calculated above.
# In other word we update the parameters of the neural network to correct parameters and therefore
# optimise the neural network prediction to be as accurate to the real output as possible
# We loop through each weight and update it with its derivative with respect to the cost error function value.
for ind in range(len(weights)):
weights[ind] = weights[ind] - learning_rate * dCost_dWeights[ind]
bias = bias - learning_rate * dCost_dB
# Compare prediction to target
error_margin = np.sqrt(np.square(pred - output_data_labels[ri]))
accuracy = (1 - error_margin) * 100
self.train_average_accuracy += round(accuracy)
# Evaluate whether guessed correctly or not based on classification binary problem 0 or 1 outcome. So if prediction is above 0.5 it guessed 1 and below 0.5 it guessed incorrectly. If it's dead on 0.5 it is incorrect for either guesses. Because it's no exactly a good guess for either 0 or 1. We need to set a good standard for the neural net model.
if (error_margin < 0.5) and (error_margin >= 0):
correct_pred += 1
elif (error_margin >= 0.5) and (error_margin <= 1):
incorrect_pred += 1
else:
print("Exception error - 'margin error' for 'predict' method is out of range. Must be between 0 and 1, in training method", file=sys.stderr)
return
costs.append(cost)
# Calculate average accuracy from the predictions of all obervations in the training dataset
self.train_average_accuracy = round(self.train_average_accuracy / len(dataset_input_matrix), 1)
# store the final optimised weights to the weights instance variable so it can be used in the predict method.
self.weights = weights
# store the final optimised bias to the weights instance variable so it can be used in the predict method.
self.bias = bias
# Print out results
print('Average Accuracy: {}'.format(self.train_average_accuracy))
print('Correct predictions: {}, Incorrect Predictions: {}'.format(correct_pred, incorrect_pred))
from numpy import array
#define array of dataset
# each observation vector has 3 datapoints or 3 columns: length, width, and outcome label (0, 1 to represent blue flower and red flower respectively).
data = array([[3, 1.5, 1],
[2, 1, 0],
[4, 1.5, 1],
[3, 1, 0],
[3.5, 0.5, 1],
[2, 0.5, 0],
[5.5, 1, 1],
[1, 1, 0]])
# separate data: split input, output, train and test data.
X_train, y_train, X_test, y_test = data[:6, :-1], data[:6, -1], data[6:, :-1], data[6:, -1]
nn_model = NN_classification()
nn_model.simple_1_layer_classification_NN(X_train, y_train, 2, 10000, learning_rate=0.2)
Related
I seem to be having a problem with my code. The error occurs at:
x, predicted = torch.max(net(value).data.squeeze(), 1)
I'm not sure what the issue is, and I've tried everything to fix. From my understanding, there seems to be a problem with the tensor dimension. I'm not sure what else to do. Can anyone give me any suggestions or solutions on how to fix this problem? Thank you in advance.
class Network(nn.Module): #Class for the neural network
def __init__(self):
super(Network, self).__init__()
self.layer1 = nn.Linear(6, 10) #First number in the number of inputs(784 since 28x28 is 784.) Second number indicates the number of inputs for the hidden layer(can be any number).
self.hidden = nn.Softmax() #Activation Function
self.layer2 = nn.Linear(10, 1) #First number is the hidden layer number(same as first layer), second number is the number of outputs.
self.layer3 = nn.Sigmoid()
def forward(self, x): #Feed-forward part of the neural network. We will will feed the input through every layer of our network.
y = self.layer1(x)
y = self.hidden(y)
y = self.layer2(y)
y = self.layer3(y)
return y #Returns the result
net = Network()
loss_function = nn.BCELoss()
optimizer = optim.SGD(net.parameters(), lr=0.01)
for x in range(1): #Amount of epochs over the dataset
for index, value in enumerate(new_train_loader):
print(value)#This loop loops over every image in the dataset
#actual = value[0]
actual_value = value[5]
#print(value.size())
#print(net(value).size())
print("ACtual", actual_value)
net(value)
loss = loss_function(net(value), actual_value.unsqueeze(0)) #Updating our loss function for every image
#Backpropogation
optimizer.zero_grad() #Sets gradients to zero.
loss.backward() #Computes gradients
optimizer.step() #Updates gradients
print("Loop #: ", str(x+1), "Index #: ", str(index+1), "Loss: ", loss.item())
right = 0
total = 0
for value in new_test_loader:
actual_value = value[5]
#print(torch.max(net(value).data, 1))
print(net(value).shape)
x, predicted = torch.max(net(value).data.squeeze(), 1)
total += actual_value.size(0)
right += (predicted==actual_value).sum().item()
print("Accuracy: " + str((100*right/total)))
I should also mention that i'm using the latest versions.
You are calling .squeeze() on the model's output, which removes all singular dimensions (dimensions that have size 1). Your model's output has size [batch_size, 1], so
.squeeze() removes the second dimension entirely, resulting in size [batch_size]. After, you're trying to take the maximum value across dimension 1, but the only dimension you have is the 0th dimension.
You don't need to take the maximum value in this case, since you have only one class as the output, and with the sigmoid at the end of your model you get values between [0, 1]. Since your are doing a binary classification that single class acts as two, namely either it's 0 or it's 1. So it can be seen as the probability that it is the class 1. Then you just need to set use a threshold of 0.5, meaning when the probability is over 0.5 it's class 1 and if the probability is under 0.5 it's the class 0. That's exactly what rounding does, therefore you can use torch.round.
output = net(value)
predicted = torch.round(output.squeeze())
On a side note, you are calling net(value) multiple times with the same value, and that means that its output is calculated multiple times as well, because it needs to go through the entire network again. That is unnecessary and you should just save the output in a variable. With this small network it isn't noticeable, but with larger networks that will take a lot of unnecessary time to recalculate the output.
I am struggling on how to compute accuracy from my neural network. I am using the MNIST database with backpropagation algorithm. All is done from scratch.
My partial code looks like this:
for x in range(1, epochs+1):
#Compute Feedforward
#...
activationValueOfSoftmax = softmax(Z2)
#Loss
#Y are my labels
loss = - np.sum((Y * np.log(activationValueOfSoftmax)), axis=0, keepdims=True)
cost = np.sum(loss, axis=1) / m #m is 784
#Backpropagation
dZ2 = activationValueOfSoftmax - Y
#the rest of the parameters
#...
#parameters update via Gradient Descent
#...
Can I compute accuracy from this, or do I have to redo some parts of my NN?
Thanks for any help!
I assume that you have your 10 sized one hot y vector for your test set (10 digits), and you retrieved your hypothesis through forward prop with your training set.
correct = 0
for i in range(np.shape(y)[0]):
#argmax retrieves index of max element in hypothesis
guess = np.argmax(hyp[i, :])
ans= np.argmax(y[i, :])
print("guess: ", guess, "| ans: ", ans)
if guess == match:
correct = correct + 1;
accuracy = (correct/np.shape(y)[0]) * 100
You have to do forward prop again with your weights and the TEST SET data to get your hypothesis vector (should be 10 sized) then you can loop through all the y values in the test set, using a counter variable (correct) to retrieve the amount correct. To get percentage, you just divide the correct by the number of test set examples and multiply by 100.
If you want the accuracy from the training set, just use your hypothesis (in your case activationValueOfSoftmax) and do the same.
Best of luck
I am trying to code a very basic neural network in python, with 3 input nodes with a value of 0 or 1 and a single output node, with a value of 0 or 1. The output should be almost equal to the second input, but after training, the weights are way way too high, and the network almost always guesses 1.
I am using python 3.7 with numpy and scipy. I have tried changing the training set, the new instance, and the random seed
import numpy as np
from scipy.special import expit as ex
rand.seed(10)
training_set=[[0,1,0],[1,0,1],[0,0,0],[1,1,1]] #The training sets and their outputs
training_outputs=[0,1,0,1]
weightlst=[rand.uniform(-1,1),rand.uniform(-1,1),rand.uniform(-1,1)] #Weights are randomly set with a value between -1 and 1
print('Random weights\n'+str(weightlst))
def calcout(inputs,weights): #Calculate the expected output with given inputs and weights
output=0.5
for i in range(len(inputs)):
output=output+(inputs[i]*weights[i])
#print('\nmy output is ' + str(ex(output)))
return ex(output) #Return the output on a sigmoid curve between 0 and 1
def adj(expected_output,training_output,weights,inputs): #Adjust the weights based on the expected output, true (training) output and the weights
adjweights=[]
error=expected_output-training_output
for i in weights:
adjweights.append(i+(error*(expected_output*(1-expected_output))))
return adjweights
#Train the network, adjusting weights each time
training_iterations=10000
for k in range(training_iterations):
for l in range(len(training_set)):
expected=calcout(training_set[l],weightlst)
weightlst=adj(expected,training_outputs[l],weightlst,training_set[l])
new_instance=[1,0,0] #Calculate and return the expected output of a new instance
print('Adjusted weights\n'+str(weightlst))
print('\nExpected output of new instance = ' + str(calcout(new_instance,weightlst)))
The expected output would be 0, or something very close to it, but no matter what i set new_instance to, the output is still
Random weights
[-0.7312715117751976, 0.6948674738744653, 0.5275492379532281]
Adjusted weights
[1999.6135460307303, 2001.03968501638, 2000.8723667804588]
Expected output of new instance = 1.0
What is wrong with my code?
Bugs:
No bias used in the neuron
error=training_output-expected_output (not the other way around) for gradient decent
weight update rule of ith weight w_i = w_i + learning_rate * delta_w_i, (delta_w_i is gradient of loss with respect to w_i)
For squared loss delta_w_i = error*sample[i] (ith value of input vector sample)
Since you have only one neuron (one hidden layer or size 1) your model can only learn linearly separable data (it is only a linear classifier). Examples of linearly separable data are data generated by functions like boolean AND, OR. Note that boolean XOR is not linearly separable.
Code with bugs fixed
import numpy as np
from scipy.special import expit as ex
rand.seed(10)
training_set=[[0,1,0],[1,0,1],[0,0,0],[1,1,1]] #The training sets and their outputs
training_outputs=[1,1,0,1] # Boolean OR of input vector
#training_outputs=[0,0,,1] # Boolean AND of input vector
weightlst=[rand.uniform(-1,1),rand.uniform(-1,1),rand.uniform(-1,1)] #Weights are randomly set with a value between -1 and 1
bias = rand.uniform(-1,1)
print('Random weights\n'+str(weightlst))
def calcout(inputs,weights, bias): #Calculate the expected output with given inputs and weights
output=bias
for i in range(len(inputs)):
output=output+(inputs[i]*weights[i])
#print('\nmy output is ' + str(ex(output)))
return ex(output) #Return the output on a sigmoid curve between 0 and 1
def adj(expected_output,training_output,weights,bias,inputs): #Adjust the weights based on the expected output, true (training) output and the weights
adjweights=[]
error=training_output-expected_output
lr = 0.1
for j, i in enumerate(weights):
adjweights.append(i+error*inputs[j]*lr)
adjbias = bias+error*lr
return adjweights, adjbias
#Train the network, adjusting weights each time
training_iterations=10000
for k in range(training_iterations):
for l in range(len(training_set)):
expected=calcout(training_set[l],weightlst, bias)
weightlst, bias =adj(expected,training_outputs[l],weightlst,bias,training_set[l])
new_instance=[1,0,0] #Calculate and return the expected output of a new instance
print('Adjusted weights\n'+str(weightlst))
print('\nExpected output of new instance = ' + str(calcout(new_instance,weightlst, bias)))
Output:
Random weights
[0.142805189379827, -0.14222189064977075, 0.15618260226894076]
Adjusted weights
[6.196759842119063, 11.71208191137411, 6.210137255008176]
Expected output of new instance = 0.6655563851223694
As up can see for input [1,0,0] the model predicted the probability 0.66 which is class 1 (since 0.66>0.5). It is correct as the output class is OR of input vector.
Note:
For learning/understanding how each weight is updated it is ok to code like above, but in practice all the operations are vectorised. Check the link for vectorized implementation.
I am new to neural networks and I am using an example neural network I found online to attempt to approximate the sphere function(the addition of a set of numbers squared) using back propagation.
The initial code is:
class NeuralNetwork():
def __init__(self):
#Seed the random number generator, so it generates the same numbers
#every time the program is run.
#random.seed(1)
#Model a single neuron, with 3 input connections and 1 output connection.
#We assign random weights to a 3 x 1 matrix, with values in the range -1 to 1
#and mean 0
self.synaptic_weights = 2 * random.random((2,1)) - 1
#The Sigmoid function, which describes an S shaped curve.
#We pass the weighted sum of thle inputs through this function to
#normalise them between 0 and 1.
def __sigmoid(self, x):
return 1 / (1 + exp(-x))
#The derivative of the sigmoid function
#This is the gradient of the sigmoid curve.
#It indicates how confident we are about existing weight.
def __sigmoid_derivative(self, x):
return x * (1 -x)
#Train the network through a process of trial and error.
#Adjusting the synaptic weights each time.
def train(self, training_set_inputs, training_set_outputs, number_of_training_iterations):
for iteration in xrange(10000):
#Pass the training set through our neural network(a single neuron)
output = self.think(training_set_inputs)
#Calculate the error(Difference between the desired output and predicted output).
error = training_set_outputs - output
#Multiply the error by the input and again by the gradient of the Sigmoid curve.
#This means less confident weights are adjusted more.
#This means inputs, which are zero, do not cause changes to the weights.
adjustment = dot(training_set_inputs.T, error * self.__sigmoid_derivative(output))
#Adjust the weights
self.synaptic_weights += adjustment
#The neural network thinks.
def think(self, inputs):
#Pass inputs through our neural network(OUR SINGLE NEURON).
return self.__sigmoid(dot(inputs, self.synaptic_weights))
if __name__ == "__main__":
#Initialise a single neuron neural network.
neural_network = NeuralNetwork()
print"Random starting synaptic weights: "
print neural_network.synaptic_weights
#The training set. We have 4 examples, each consisting of 3 input values and 1 output value
training_set_inputs = array([[0, 1], [1,0], [0,0]])
training_set_outputs = array([[1,1,0]]).T
#Train the neural network using a training set.
#Do it 10,000 times and make small adjustments each time.
neural_network.train(training_set_inputs, training_set_outputs, 10000)
print "New synaptic weights after training: "
print neural_network.synaptic_weights
#Test the neural network with a new situation.
print "Considering new situation [1,1] -> ?: "
print neural_network.think(array([1,1]))
My aim is to input training data(the sphere function input and outputs) into the neural network to train it and meaningfully adjust the weights. After continuous training the weights should reach a point where reasonably accurate results are given from the training inputs.
I imagine an example of some training sets for the sphere function would be something like:
training_set_inputs = array([[2,1], [3,2], [4,6], [8,3]])
training_set_outputs = array([[5, 13, 52, 73]])
The example I found online can successfully approximate the XOR operation, but when given sphere function inputs it only gives me an output of 1 when tested on a new example(for example, [6,7] which should ideally return an approximation around 85)
From what I have read about neural networks I suspect this is because I need to normalize the inputs but I am not entirely sure how to do this. Any help on this or something to point me on the right track would be appreciated a lot, thank you.
I'm attempting to create a multilayer feedforward backpropagation neural network to recognize handwritten digits and I'm running into a problem where the activations in my output layer all tend towards the same value.
I'm using the Optical Recognition of Handwritten Digits Data Set, with training data that looks like
0,1,6,15,12,1,0,0,0,7,16,6,6,10,0,0,0,8,16,2,0,11,2,0,0,5,16,3,0,5,7,0,0,7,13,3,0,8,7,0,0,4,12,0,1,13,5,0,0,0,14,9,15,9,0,0,0,0,6,14,7,1,0,0,0
which represents an 8x8 matrix, where each of the 64 integers corresponds to the number of dark pixels in a sub-4x4 matrix, with the last integer being the classification.
I'm using 64 nodes in the input layer corresponding to the 64 integers, some number of hidden nodes in some number of hidden layers, and 10 nodes in the output layer corresponding to 0-9.
My weights are initialized here, and biases are added for the input layer and hidden layers
self.weights = []
for i in xrange(1, len(layers) - 1):
self.weights.append(
np.random.uniform(low=-0.2,
high=0.2,
size=(layers[i-1] + 1, layers[i] + 1)))
# Output weights
self.weights.append(
np.random.uniform(low=-0.2,
high=0.2,
size=(layers[-2] + 1, layers[-1])))
where list contains the number of nodes in each layer, e.g.
layers=[64, 30, 10]
I'm using the logistic function as my activation function
def logistic(self, z):
return sp.expit(z)
and its derivative
def derivative(self, z):
return sp.expit(z) * (1 - sp.expit(z))
My backpropagation algorithm is borrowed heavily from here; my previous attempts failed so I wanted to try another route.
def back_prop_learning(self, X, y):
# add biases to inputs with value of 1
biases = np.atleast_2d(np.ones(X.shape[0]))
X = np.concatenate((biases.T, X), axis=1)
# Iterate over training set
for epoch in xrange(self.epochs):
# for each weight w[i][j] in network assign random tiny values
# handled in __init__
''' PROPAGATE THE INPUTS FORWARD TO COMPUTE THE OUTPUTS '''
for example in zip(X, y):
# for each node i in the input layer
# set input layer outputs equal to input vector outputs
activations = [example[0]]
# for layer = 1 (first hidden) to output layer
for layer in xrange(len(self.weights)):
# for each node j in layer
weighted_sum = np.dot(activations[layer], self.weights[layer])
# assert number of outputs == number of weights in each layer
assert(len(activations[layer]) == len(self.weights[layer]))
# compute activation of weighted sum of node j
activation = self.logistic(weighted_sum)
# append vector of activations
activations.append(activation)
''' PROPAGATE DELTAS BACKWARDS FROM OUTPUT LAYER TO INPUT LAYER '''
# for each node j in the output layer
# compute error of target - output
errors = example[1] - activations[-1]
# multiply by derivative
deltas = [errors * self.derivative(activations[-1])]
# for layer = last hidden layer down to first hidden layer
for layer in xrange(len(activations)-2, 0, -1):
deltas.append(deltas[-1].dot(self.weights[layer].T) * self.derivative(activations[layer]))
''' UPDATE EVERY WEIGHT IN NETWORK USING DELTAS '''
deltas.reverse()
# for each weight w[i][j] in network
for i in xrange(len(self.weights)):
layer = np.atleast_2d(activations[i])
delta = np.atleast_2d(deltas[i])
self.weights[i] += self.alpha * layer.T.dot(delta)
And my outputs after running testing data all resemble
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 9.0
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 4.0
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 6.0
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 6.0
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 7.0
No matter what I select for my learning rate, number of hidden nodes, or number of hidden layers, everything seems to tend towards 1. Which leaves me wondering whether I'm even approaching and setting up the problem correctly, with 64 inputs to 10 outputs, or whether I've selected/implemented my sigmoid function correctly, or whether the failure is in my implementation of my backpropagation algorithm. I've recreated the above program two or three times with the same results, which leads me to believe that I'm fundamentally misunderstanding the problem and not representing it correctly.
I think I've answered my question.
I believe the problem was how I was calculating my errors in the output layer. I had been calculating it as errors = example[1] - activations[-1], which created an array of errors resulting from subtracting my output layer activations from the target value.
I changed this so that my target values were a vector of zeros, 0-9, so that my the index of my target value was 1.0.
y = int(example[1])
errors_v = np.zeros(shape=(10,), dtype=float)
errors_v[y] = 1.0
errors = errors_v - activations[-1]
I also changed my activation function to be the tanh function.
This has significantly increased the variance in the activations in my output layer and I've been able to achieve 50% - 75% accuracy in my limited testing so far. Hopefully this helps someone else.