I'm currently building my 3-4-1 neural network from scratch using numpy (I avoided using keras and tensorflow for the purpose of learning and trying to demonstrate my knowledge instead of using pre-built libraries to do all the work), the problems I find when I run the program are:
1/ getting "nan" values after a certain number of iterations in the "updated" weights, lowering the learning rate only delays the problem and doesn't solve it.
2/ the second problem is the very low predicting accuracy.
I would like to know what causes these bugs on my program and would appreciate any help.
here is the code:
# Import our dependencies
from numpy import exp, array, random, dot, ones_like, where
# Create our Artificial Neural Network class
class ArtificialNeuralNetwork():
# initializing the class
def __init__(self):
# generating the same synaptic weights every time the program runs
random.seed(1)
# synaptic weights (3 × 4 Matrix) of the hidden layer
self.w_ij = 2 * random.rand(3, 4) - 1
# synaptic weights (4 × 1 Matrix) of the output layer
self.w_jk = 2 * random.rand(4, 1) - 1
def LeakyReLU(self, x):
# The Leaky ReLU (short for Rectified Linear Unit) activation function will be applied to the inputs of the hidden layer
# The activation function will return the same value of x if x is positive
# while it will multiply the negative values of x by the alpha parameter
# we used in this example the Leaky ReLU instead of the standard ReLU activation function to avoid the dying ReLU problem
return where(x > 0, x, x * 0.01)
def LeakyReLUDerivative(self, x, α = 0.01):
# The Leaky ReLU Derivative will return 1 for every positive value in the x array
# while returning the value of the parameter alpha for every negative value
x[x > 0] = 1 # returns 1 for every positive value in the x array
x[x <= 0] = α # returns α for every negative value in the x array
return x
def Sigmoid(self, x):
# The Sigmoid activation function will turn every input value into probabilities between 0 and 1
# the probabilistic values help us assert which class x belongs to
return 1 / (1 + exp(-x))
def SigmoidDerivative(self, x):
# The derivative of the Sigmoid activation function will be used to calculate the gradient during the backpropagation process
# and help optimize the random starting synaptic weights
return x * (1 - x)
def train(self, x, y, learning_rate, iterations):
# x: training set of data
# y: the actual output of the training data
for i in range(iterations):
z_ij = dot(x, self.w_ij) # the dot product of the weights of the hidden layer and the inputs
a_ij = self.LeakyReLU(z_ij) # using the Leaky ReLU activation function to introduce non-linearity to our Neural Network
z_jk = dot(a_ij, self.w_jk) # the same precedent process will be applied to find the last input of the output layer
a_jk = self.Sigmoid(z_jk) # this time the Sigmoid activation function will be used instead of Leaky ReLU
dl_jk = -y/a_jk + (1 - y)/(1 - a_jk) # calculating the derivative of the cross entropy loss wrt output
da_jk = self.SigmoidDerivative(a_jk) # calculating the derivative of the Sigmoid activation function wrt the input (before activation) of the output layer
dz_jk = a_ij # calculating the derivative of the inputs of the hidden layer (before activation) wrt weights of the output layer
dl_ij = dot(da_jk * dl_jk, self.w_jk.T) # calculating the derivative of the cross entropy loss wrt activated input of the hidden layer
# to do so we multiply the derivative of the cross entropy loss wrt output by the derivative of the Sigmoid activation function wrt the input (before activation) of the output layer by the derivative of the inputs of the hidden layer (before activation) wrt weights of the output layer
da_ij = self.LeakyReLUDerivative(z_ij) # calculating the derivative of the Leaky ReLU activation function wrt the inputs of the hidden layer (before activation)
dz_ij = x # calculating the derivative of the inputs of the hidden layer (before activation) wrt weights of the hidden layer
# calculating the gradient using the chain rule
gradient_ij = dot(dz_ij.T , dl_ij * da_ij)
gradient_jk = dot(dz_jk.T , dl_jk * da_jk)
# calculating the new optimal weights
self.w_ij = self.w_ij - learning_rate * gradient_ij
self.w_jk = self.w_jk - learning_rate * gradient_jk
def predict(self, inputs):
# predicting the class of the input data after weights optimization
output_from_layer1 = self.LeakyReLU(dot(inputs, self.w_ij)) # the output of the hidden layer
output_from_layer2 = self.Sigmoid(dot(output_from_layer1, self.w_jk)) # the output of the output layer
return output_from_layer1, output_from_layer2
# the function will print the initial starting weights before training
def SynapticWeights(self):
print("Layer 1 (4 neurons, each with 3 inputs): ")
print("w_ij: ", self.w_ij)
print("Layer 2 (1 neuron, with 4 inputs): ")
print("w_jk: ", self.w_jk)
def main():
ANN = ArtificialNeuralNetwork()
ANN.SynapticWeights()
# the training inputs
x = array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [0, 1, 0], [1, 0, 0], [1, 1, 1], [0, 0, 0]])
# the training outputs
y = array([[0, 1, 1, 1, 1, 0, 0]]).T
ANN.train(x, y, 1, 10000)
# Printing the new synaptic weights after training
print("New synaptic weights after training: ")
print("w_ij: ", ANN.w_ij)
print("w_jk: ", ANN.w_jk)
# Our prediction after feeding the ANN with new set of data
print("Considering new situation [1, 1, 0] -> ?: ")
print(ANN.predict(array([[1, 1, 0]])))
if __name__=="__main__":
main()
So, I changed a few things. (Disclaimer: I didn't check the correctness of the code)
Weight initialization: initialize to much smaller weights.
# synaptic weights (3 × 4 Matrix) of the hidden layer
self.w_ij = (2 * random.rand(3, 4) - 1)*0.1
# synaptic weights (4 × 1 Matrix) of the output layer
self.w_jk = (2 * random.rand(4, 1) - 1)*0.1
Weight initialization really matter.
I reduced the learning rate to 0.1.
ANN.train(x, y, .1, 500000)
I see the neural network perfectly fitting your data and not giving Nan even after 500,000 iterations.
print(ANN.predict(array([[0, 0, 1],
[0, 1, 1],
[1, 0, 1],
[0, 1, 0],
[1, 0, 0],
[1, 1, 1],
[0, 0, 0]])))
Related
I am trying to implement a NN from scratch in Python. It has 2 layers: input layer –
output layer. The input layer will have 4 neurons and the output layer will have only a
single node (+biases). I have the following code but I get the error message: ValueError: shapes (4,2) and (4,1) not aligned: 2 (dim 1) != 4 (dim 0). Can someone help me?
import numpy as np
# Step 1: Define input and output data
X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
y = np.array([[0, 1, 0, 1]])
# Step 2: Define the number of input neurons, hidden neurons, and output neurons
input_neurons = 4
output_neurons = 1
# Step 3: Define the weights and biases for the network
weights = np.random.rand(input_neurons, output_neurons)
biases = np.random.rand(output_neurons, 1)
# Step 4: Define the sigmoid activation function
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# Step 5: Define the derivative of the sigmoid function
def sigmoid_derivative(x):
return sigmoid(x) * (1 - sigmoid(x))
# Step 6: Define the forward propagation function
def forward_propagation(X, weights, biases):
output = sigmoid(np.dot(X.T, weights) + biases)
return output
# Step 7: Define the backward propagation function
def backward_propagation(X, y, output, weights, biases):
error = output - y
derivative = sigmoid_derivative(output)
delta = error * derivative
weights_derivative = np.dot(X, delta.T)
biases_derivative = np.sum(delta, axis=1, keepdims=True)
return delta, weights_derivative, biases_derivative
# Step 8: Define the train function
def train(X, y, weights, biases, epochs, learning_rate):
for i in range(epochs):
output = forward_propagation(X, weights, biases)
delta, weights_derivative, biases_derivative = backward_propagation(X, y, output, weights, biases)
weights -= learning_rate * weights_derivative
biases -= learning_rate * biases_derivative
error = np.mean(np.abs(delta))
print("Epoch ", i, " error: ", error)
# Step 9: Train the network
epochs = 5000
learning_rate = 0.1
train(X, y, weights, biases, epochs, learning_rate)
You have an output layer with one neuron, so your output should be of one dimension.
You're assuming that the output has 4 dims:
y = np.array([[0, 1, 0, 1]])
Since you are giving two inputs (a pair of 4 dim inputs) like this,
X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
You need also give two outputs (in one dim), for example like this:
y = np.array([[0],[1]])
Hope this helps.
I am not a deep learning geek, i am learning to do this for my homework.
How can I make my neural network output a list of positive floats that sum to 1 but at the same time each element of the list is smaller than a treshold (0.4 for example)?
I tried to add some hidden layers before the output layer but that didn't improve the results.
here is where i started from:
def build_net(inputs, predictor,scope,trainable):
with tf.name_scope(scope):
if predictor == 'CNN':
L=int(inputs.shape[2])
N = int(inputs.shape[3])
conv1_W = tf.Variable(tf.truncated_normal([1,L,N,32], stddev=0.15), trainable=trainable)
layer = tf.nn.conv2d(inputs, filter=conv1_W, padding='VALID', strides=[1, 1, 1, 1])
norm1 = tf.layers.batch_normalization(layer)
x = tf.nn.relu(norm1)
conv3_W = tf.Variable(tf.random_normal([1, 1, 32, 1], stddev=0.15), trainable=trainable)
conv3 = tf.nn.conv2d(x, filter=conv3_W, strides=[1, 1, 1, 1], padding='VALID')
norm3 = tf.layers.batch_normalization(conv3)
net = tf.nn.relu(norm3)
net=tf.layers.flatten(net)
return net
x=build_net(inputs,predictor,scope,trainable=trainable)
y=tf.placeholder(tf.float32,shape=[None]+[self.M])
network = tf.add(x,y)
w_init=tf.random_uniform_initializer(-0.0005,0.0005)
outputs=tf.layers.dense(network,self.M,activation=tf.nn.softmax,kernel_initializer=w_init)
I am expecting that outputs would still sum to 1, but each element of it is smaller than a specific treshold i set.
Thank you so much in advance for your precious help guys.
What you want to do is add a penalty in case any of the outputs is larger than some specified thresh, you can do this with the max function:
thresh = 0.4
strength = 10.0
reg_output = strength * tf.reduce_sum(tf.math.maximum(0.0, outputs - thresh), axis=-1)
Then you need to add reg_output to your loss so its optimized with the rest of the loss. strength is a tunable parameter that defines how strong is the penalty for going over the threshold, and you have to tune it to your needs.
This penalty works by summing max(0, output - thresh) over the last dimension, which activates the penalty if output is bigger than thresh. If its smaller, the penalty is zero and does nothing.
I have a small, 3 layer, neural network with two input neurons, two hidden neurons and one output neuron. I am trying to stick to the below format of using only 2 hidden neurons.
I am trying to show how this can be used to behave as the XOR logic gate, however with just two hidden neurons I get the following poor output after 1,000,000 iterations!
Input: 0 0 Output: [0.01039096]
Input: 1 0 Output: [0.93708829]
Input: 0 1 Output: [0.93599738]
Input: 1 1 Output: [0.51917667]
If I use three hidden neurons I get a much better output with 100,000 iterations:
Input: 0 0 Output: [0.01831612]
Input: 1 0 Output: [0.98558057]
Input: 0 1 Output: [0.98567602]
Input: 1 1 Output: [0.02007876]
I am getting a decent output with 3 neurons in the hidden layer but not with two neurons in the hidden layer. Why?
As per a comment below, this repo contains code of high to solve the XOR problem using two hidden neurons.
I can't figure out what I am doing wrong. Any suggestions are appreciated!
Attached is my code:
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
# Sigmoid function
def sigmoid(x, deriv=False):
if deriv:
return x * (1 - x)
return 1 / (1 + np.exp(-x))
alpha = [0.7]
# Input dataset
X = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
# Output dataset
y = np.array([[0, 1, 1, 0]]).T
# seed random numbers to make calculation deterministic
np.random.seed(1)
# initialise weights randomly with mean 0
syn0 = 2 * np.random.random((2, 3)) - 1 # 1st layer of weights synapse 0 connecting L0 to L1
syn1 = 2 * np.random.random((3, 1)) - 1 # 2nd layer of weights synapse 0 connecting L1 to L2
# Randomize inputs for stochastic gradient descent
data = np.hstack((X, y)) # append Input and output dataset
np.random.shuffle(data) # shuffle
x, y = np.array_split(data, 2, 1) # Split along vertical(1) axis
for iter in range(100000):
for i in range(4):
# forward prop
layer0 = x[i] # Input layer
layer1 = sigmoid(np.dot(layer0, syn0)) # Prediction step for layer 1
layer2 = sigmoid(np.dot(layer1, syn1)) # Prediction step for layer 2
layer2_error = y[i] - layer2 # Compare how well layer2's guess was with input
layer2_delta = layer2_error * sigmoid(layer2, deriv=True) # Error weighted derivative step
if iter % 10000 == 0:
print("Error: ", str(np.mean(np.abs(layer2_error))))
plt.plot(iter, layer2_error, 'ro')
# Uses "confidence weighted error" from l2 to establish an error for l1
layer1_error = layer2_delta.dot(syn1.T)
layer1_delta = layer1_error * sigmoid(layer1, deriv=True) # Error weighted derivative step
# Since SGD we need to dot product two 1D arrays. This is how.
syn1 += (alpha * np.dot(layer1[:, None], layer2_delta[None, :])) # Update weights
syn0 += (alpha * np.dot(layer0[:, None], layer1_delta[None, :]))
# Training was done above, below we re run to test algorithm
layer0 = X # Input layer
layer1 = sigmoid(np.dot(layer0, syn0)) # Prediction step for layer 1
layer2 = sigmoid(np.dot(layer1, syn1)) # Prediction step for layer 2
plt.show()
print("output after training: \n")
print("Input: 0 0 \t Output: ", layer2[0])
print("Input: 1 0 \t Output: ", layer2[1])
print("Input: 0 1 \t Output: ", layer2[2])
print("Input: 1 1 \t Output: ", layer2[3])
This is due to the fact that you have not considered any bias for the neurons.
You have only used weights to try and fit the XOR model.
Incase of 2 neurons in the hidden layer, the network under-fits as it can't compensate for the bias.
When you use 3 neurons in the hidden layer, the extra neuron counters the effect caused due to the lack of bias.
This is an example of a network for XOR gate. You'll notice theta (bias) added to the hidden layers. This gives the network an additional parameter to tweak.
Additional resources
It is an unsolvable equation system, that is why NN can not solve it either.
While it may be an oversimplification, if we say the transfer function is linear, the expression becomes something like
z = (w1*x+w2*y)*w3 + (w4*x+w5*y)*w6
Then there are the 4 cases:
xy=00, z=0 = 0
xy=10, z=1 = w1*w3+w4*w6
xy=01, z=1 = w2*w3+w5*w6
xy=11, z=0 = (w1+w2)*w3 + (w4+w5)*w6
The problem is that
0 = (w1+w2)*w3 + (w4+w5)*w6 = w1*w3+w2*w3 + w4*w6+w5*w6 <-- xy=11 line
= w1*w3+w4*w6 + w2*w3+w5*w6 = 1+1 = 2 <-- xy=10 and xy=01 lines
So the seemingly 6 degrees of freedom are just not enough here, that is why you experience the need for adding something extra.
I've implemented the following neural network to solve the XOR problem in Python. My neural network consists of an input layer of 2 neurons, 1 hidden layer of 2 neurons and an output layer of 1 neuron. I am using the Sigmoid function as the activation function for both the hidden layer and output layer. Can someone please explain what I have done wrong.
import numpy
import scipy.special
class NeuralNetwork:
def __init__(self, inputNodes, hiddenNodes, outputNodes, learningRate):
self.iNodes = inputNodes
self.hNodes = hiddenNodes
self.oNodes = outputNodes
self.wIH = numpy.random.normal(0.0, pow(self.iNodes, -0.5), (self.hNodes, self.iNodes))
self.wOH = numpy.random.normal(0.0, pow(self.hNodes, -0.5), (self.oNodes, self.hNodes))
self.lr = learningRate
self.activationFunction = lambda x: scipy.special.expit(x)
def train(self, inputList, targetList):
inputs = numpy.array(inputList, ndmin=2).T
targets = numpy.array(targetList, ndmin=2).T
#print(inputs, targets)
hiddenInputs = numpy.dot(self.wIH, inputs)
hiddenOutputs = self.activationFunction(hiddenInputs)
finalInputs = numpy.dot(self.wOH, hiddenOutputs)
finalOutputs = self.activationFunction(finalInputs)
outputErrors = targets - finalOutputs
hiddenErrors = numpy.dot(self.wOH.T, outputErrors)
self.wOH += self.lr * numpy.dot((outputErrors * finalOutputs * (1.0 - finalOutputs)), numpy.transpose(hiddenOutputs))
self.wIH += self.lr * numpy.dot((hiddenErrors * hiddenOutputs * (1.0 - hiddenOutputs)), numpy.transpose(inputs))
def query(self, inputList):
inputs = numpy.array(inputList, ndmin=2).T
hiddenInputs = numpy.dot(self.wIH, inputs)
hiddenOutputs = self.activationFunction(hiddenInputs)
finalInputs = numpy.dot(self.wOH, hiddenOutputs)
finalOutputs = self.activationFunction(finalInputs)
return finalOutputs
nn = NeuralNetwork(2, 2, 1, 0.01)
data = [[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 0]]
epochs = 10
for e in range(epochs):
for record in data:
inputs = numpy.asfarray(record[1:])
targets = record[0]
#print(targets)
#print(inputs, targets)
nn.train(inputs, targets)
print(nn.query([0, 0]))
print(nn.query([1, 0]))
print(nn.query([0, 1]))
print(nn.query([1, 1]))
Several reasons.
I don't think you should be taking the activation function of everything, especially in your query function. I think you have muddled up the ideas of neuron to neuron weightings (wIH and wOH) with the activation values.
Because of your muddle you have missed the idea of re-using your query function as part of your training. You should think of it as feed forward activation levels to the output, compare the result with the target output to give an array of errors which are then fed backwards using the derivative of the sigmoid function to adjust the weightings.
I would put the function and it's derivative in rather than importing from scipy as they are so simple. Also "it's recommended" to use tanh and d/dx.tanh for the hidden layer functions (can't remember why, probably not needed for this simple net)
# transfer functions
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# derivative of sigmoid
def dsigmoid(y):
return y * (1.0 - y)
# using tanh over logistic sigmoid for the hidden layer is recommended
def tanh(x):
return np.tanh(x)
# derivative for tanh sigmoid
def dtanh(y):
return 1 - y*y
Finally, you might be able to figure out what I did a while ago with a neural net using just numpy here https://github.com/paddywwoof/Machine-Learning/blob/master/perceptron.py
I understand how the Neural Network with backpropogation is supposed to work. I know how to use Python's own MLPClassifier and fit functions work in sklearn. I am creating my own because I'd like to know the details better. I will first show my code (with comments) and then discuss my problems.
import numpy as np
import scipy as sp
import sklearn as ML
# z: the linear combination of the previous layer
#
# returns the activation for the node
#
def sigmoid(z):
a = 1 / (1 + np.exp(-z))
return a
# z: the contribution of a layer
#
# returns the derivative of the sigmoid evaluated at z
#
def sig_grad(z):
d = (1 - sigmoid(z))*sigmoid(z)
return d
# input: the data we want to train the network with
# hidden_layers: the number of nodes in the hidden layers
# num_layers: how many hidden layers between the input layer and the output layer
# num_output: how many outputs there are... this becomes relevant when we input many features.
#
# returns the activations determined
# and the linear combinations of previous layer's nodes for each layer
#
def feedforward(input, hidden_layers, num_layers, num_output, thresh, weights):
#initialize the vector for inputs AND threshold values
X = np.hstack([thresh[0], input])
#intialize the activations list
A = []
#intialize the linear combos for each layer
Z = []
w = list(weights)
#place ones in the first row of each layer of weights for the threshold
w[0] = np.vstack([np.ones([1,hidden_layers]), w[0]])
for i in range(1,num_layers):
w[i] = np.vstack([np.ones([1,hidden_layers]), weights[i]])
w[-1] = np.vstack([np.ones([1,num_output]), w[-1]])
#the first layer of weights are initialized outside function
#cycle through the hidden layers
for i in range(1, num_layers+1):
Z.append( np.dot(X, w[i-1])); S = sigmoid(Z[i-1]); A.append(S); X = np.hstack([thresh[i], A[i-1]])
#find the output/last layer activations
Z.append( np.dot(X, w[-1]) ); S = sigmoid(Z[-1]); A.append(S);
return A, Z
#
# truth: what we know the output should be
# activations: the activations determined at each node by the sigmoid
# function in the previous feedforward pass
# combos: the linear combinations at each layer in the prev. ff pass
# num_layers: the number of hidden layers
#
# error: the errors determined at each layer; will be needed for gradient descent
#
def backprop(input, truth, activations, combos, num_layers, weights):
#initialize an array of errors for each hidden layer and the output layer
error = [0 for x in range(0,num_layers+1)]
#intialize the lists containing the gradients w.r.t. weights and threshold
derivW = []; derivb = []
#set the output layer since its error is computed differently than the others
error[num_layers] = (activations[num_layers] - truth)*sig_grad(combos[num_layers])
#find the rate of change for weights and thresh for connections to output
derivW.append( activations[num_layers-1]*error[num_layers]); derivb.append(np.sum(error[num_layers]))
if(num_layers > 1):
#find the errors for each of the hidden layers
for i in range(num_layers - 1, 0, -1):
error[i] = np.dot(weights[i+1],error[i+1])*sig_grad(combos[i])
derivW.append( np.outer(activations[i-1], error[i]) ); derivb.append(np.sum(error[i]))
#
#finding the derivative for weights of input to next layer
#
error[0] = np.dot(weights[i],error[i])*sig_grad(combos[0])
derivW.append( np.outer(input, error[0]) ); derivb.append(np.sum(error[0]))
return derivW, derivb
#
# weights: our networks weights to update via gradient descent
# thresh: the threshold values to update for our system
# derivb: the derivative of our cost function with respect to b for each layer
# derivW: the derivative of our cost function with respect to W for each layer
# stepsize: the stepsize we want to take, determines how big of a step we take
#
# returns the updated weights and threshold values for our network
def gradDesc(weights, thresh, derivb, derivW, stepsize, num_layers):
#perform gradient descent
for j in range(100):
for i in range(0, num_layers + 1):
weights[i] = weights[i] - stepsize*derivW[num_layers-i]
thresh[i] = thresh[i] - stepsize*derivb[num_layers-i]
return weights, thresh
#input: the data to send through the network
#hidden_layers: the number of hidden_layers between the input layer and the output layer
#num_layers: the number of nodes in the hidden layer
#num_output: the number of nodes in the output layer
#
#returns the output of the network
#
def nNetwork(input, truth, hidden_layers, num_layers, num_output, maxiter, stepsize):
#assuming that input is an array where each element is an input/sample
#we also need to know the size of each sample itself
m = input.size
thresh = np.random.randn(num_layers + 1, 1)
thresh_weights = np.ones([num_layers + 1, 1])
# initialize the weights as a list because each layer might have
# a different number of weights
weights = []; weights.append(np.random.randn(m,hidden_layers));
if( num_layers > 1):
for i in range(1, num_layers):
weights.append(np.random.randn(hidden_layers, hidden_layers))
weights.append(np.random.randn(hidden_layers, num_output))
for i in range(maxiter):
activations, combos = feedforward(input, hidden_layers, num_layers, num_output, thresh, weights)
derivW, derivb = backprop(input, truth, activations, combos, num_layers, weights)
weights, thresh = gradDesc(weights, thresh, derivb, derivW, stepsize, num_layers)
return weights, thresh
def main():
# a very, very simple neural network
input = np.array([1,0,0])
truth = 0
hidden_layers = 3
num_layers = 2
num_output = 1
#train the network
w, t = nNetwork(input, truth, hidden_layers, num_layers, num_output, maxiter = 10, stepsize = 0.001)
#test the network on a new set of arguments
#activations, combos = feedforward(new_input, hidden_layers = 3, num_layers = 2, thresh = t, weights = w)
main()
I've tested this code on simple examples where there are n input of one dimension and output of n dimension (not yet able to work out the bugs when I type import NN.py into the console, but works when I run it piece by piece in the console). I have a few questions to help me better understand what is going on when I have n input there are m dimensions. For example, the digits data in Python (there are 1797 samples and each sample is 64x1 -- an 8x8 image vectorized).
1) Is each of the 64 pixels considered an input? If so, is the neural net trained one image at a time? This would be an easy fix for me.
2) If the neural net is trained all images at once, what are suggestions for modifying my code?
3) Obviously the output for an image comes in the form of 0, 1, 2, 3, ... , or 9. But, does the output come in the form of a vector 10x1 where there is a 1 in the digit the image represents and 0's elsewhere? So, my prediction vector would have the highest value where the 1 might be, right?
4) Then, I'm not quite sure how #3 would look if #2 is true..
I apologize for the long note. Thanks for taking a look and helping me understand better!