Python - Neural Network Approximating Sphere Function

Python - Neural Network Approximating Sphere Function - python

I am new to neural networks and I am using an example neural network I found online to attempt to approximate the sphere function(the addition of a set of numbers squared) using back propagation.
The initial code is:
class NeuralNetwork():
def __init__(self):
#Seed the random number generator, so it generates the same numbers
#every time the program is run.
#random.seed(1)
#Model a single neuron, with 3 input connections and 1 output connection.
#We assign random weights to a 3 x 1 matrix, with values in the range -1 to 1
#and mean 0
self.synaptic_weights = 2 * random.random((2,1)) - 1
#The Sigmoid function, which describes an S shaped curve.
#We pass the weighted sum of thle inputs through this function to
#normalise them between 0 and 1.
def __sigmoid(self, x):
return 1 / (1 + exp(-x))
#The derivative of the sigmoid function
#This is the gradient of the sigmoid curve.
#It indicates how confident we are about existing weight.
def __sigmoid_derivative(self, x):
return x * (1 -x)
#Train the network through a process of trial and error.
#Adjusting the synaptic weights each time.
def train(self, training_set_inputs, training_set_outputs, number_of_training_iterations):
for iteration in xrange(10000):
#Pass the training set through our neural network(a single neuron)
output = self.think(training_set_inputs)
#Calculate the error(Difference between the desired output and predicted output).
error = training_set_outputs - output
#Multiply the error by the input and again by the gradient of the Sigmoid curve.
#This means less confident weights are adjusted more.
#This means inputs, which are zero, do not cause changes to the weights.
adjustment = dot(training_set_inputs.T, error * self.__sigmoid_derivative(output))
#Adjust the weights
self.synaptic_weights += adjustment
#The neural network thinks.
def think(self, inputs):
#Pass inputs through our neural network(OUR SINGLE NEURON).
return self.__sigmoid(dot(inputs, self.synaptic_weights))
if __name__ == "__main__":
#Initialise a single neuron neural network.
neural_network = NeuralNetwork()
print"Random starting synaptic weights: "
print neural_network.synaptic_weights
#The training set. We have 4 examples, each consisting of 3 input values and 1 output value
training_set_inputs = array([[0, 1], [1,0], [0,0]])
training_set_outputs = array([[1,1,0]]).T
#Train the neural network using a training set.
#Do it 10,000 times and make small adjustments each time.
neural_network.train(training_set_inputs, training_set_outputs, 10000)
print "New synaptic weights after training: "
print neural_network.synaptic_weights
#Test the neural network with a new situation.
print "Considering new situation [1,1] -> ?: "
print neural_network.think(array([1,1]))
My aim is to input training data(the sphere function input and outputs) into the neural network to train it and meaningfully adjust the weights. After continuous training the weights should reach a point where reasonably accurate results are given from the training inputs.
I imagine an example of some training sets for the sphere function would be something like:
training_set_inputs = array([[2,1], [3,2], [4,6], [8,3]])
training_set_outputs = array([[5, 13, 52, 73]])
The example I found online can successfully approximate the XOR operation, but when given sphere function inputs it only gives me an output of 1 when tested on a new example(for example, [6,7] which should ideally return an approximation around 85)
From what I have read about neural networks I suspect this is because I need to normalize the inputs but I am not entirely sure how to do this. Any help on this or something to point me on the right track would be appreciated a lot, thank you.

Related

Can you reverse a PyTorch neural network and activate the inputs from the outputs?

Can we activate the outputs of a NN to gain insight into how the neurons are connected to input features?
If I take a basic NN example from the PyTorch tutorials. Here is an example of a f(x,y) training example.
import torch
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4
for t in range(500):
y_pred = model(x)
loss = loss_fn(y_pred, y)
model.zero_grad()
loss.backward()
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad
After I've finished training the network to predict y from x inputs. Is it possible to reverse the trained NN so that it can now predict x from y inputs?
I don't expect y to match the original inputs that trained the y outputs. So I expect to see what features the model activates on to match x and y.
If it is possible, then how do I rearrange the Sequential model without breaking all the weights and connections?

It is possible but only for very special cases. For a feed-forward network (Sequential) each of the layers needs to be reversible; that means the following arguments apply to each layer separately. The transformation associated with one layer is y = activation(W*x + b) where W is the weight matrix and b the bias vector. In order to solve for x we need to perform the following steps:
Reverse activation; not all activation functions have an inverse though. For example the ReLU function does not have an inverse on (-inf, 0). If we used tanh on the other hand we can use its inverse which is 0.5 * log((1 + x) / (1 - x)).
Solve W*x = inverse_activation(y) - b for x; for a unique solution to exist W must have similar row and column rank and det(W) must be non-zero. We can control the former by choosing a specific network architecture while the latter depends on the training process.
So for a neural network to be reversible it must have a very specific architecture: all layers must have the same number of input and output neurons (i.e. square weight matrices) and the activation functions all need to be invertible.
Code: Using PyTorch we will have to do the inversion of the network manually, both in terms of solving the system of linear equations as well as finding the inverse activation function. Consider the following example of a 1-layer neural network (since the steps apply to each layer separately extending this to more than 1 layer is trivial):
import torch
N = 10 # number of samples
n = 3 # number of neurons per layer
x = torch.randn(N, n)
model = torch.nn.Sequential(
torch.nn.Linear(n, n), torch.nn.Tanh()
)
y = model(x)
z = y # use 'z' for the reverse result, start with the model's output 'y'.
for step in list(model.children())[::-1]:
if isinstance(step, torch.nn.Linear):
z = z - step.bias[None, ...]
z = z[..., None] # 'torch.solve' requires N column vectors (i.e. shape (N, n, 1)).
z = torch.solve(z, step.weight)[0]
z = torch.squeeze(z) # remove the extra dimension that we've added for 'torch.solve'.
elif isinstance(step, torch.nn.Tanh):
z = 0.5 * torch.log((1 + z) / (1 - z))
print('Agreement between x and z: ', torch.dist(x, z))

If I've understood correctly, there are two questions here:
Is it possible to determine what features in the input have activated neurons?
If so, is it possible to use this information to generate samples from p(x|y)?
Regarding 1, a basic way to determine if a neuron is sensitive to an input feature x_i is to compute the gradient of this neuron's output w.r.t x_i. A high gradient will indicate sensitivity to a particular input element. There is a rich literature on the subject, for example, you can have a look at guided backpropagation or at GradCam (the latter is about classification with convnets, but it does contain useful ideas).
As for 2, I don't think that your approach to "reversing the problem" is correct. The problem is that your network is discriminative and what it outputs can be seen as argmax_y p(y|x). Note that this is a point-wise estimation, not a full modeling of the distribution. However, the inverse problem that you're interested in seems to be sampling from
p(x|y)=constant*p(y|x)p(x).
You don't know how to sample from p(y|x) and you don't know anything about p(x). Even if you use a method to discover correlations between the neurons and specific input features, you have only discovered which features where more important to the networks prediction, but depending on the nature of y this might be insufficiant. Consider a toy example where your inputs x are 2d points distributed according to some distribution in R^2 and where the output y is binary, such that any (a,b) in R^2 is classified as 1 if a<1 and it is classified as 0 if a>1. Then a discriminative network could learn the vertical line x=1 as its decision boundary. Inspecting correlations between neurons and input features will reveal that only the first coordinate was useful in this prediction, but this information is not sufficient for sampling from the full 2d distribution of inputs.
I think that Variational autoencoders could be what you're looking for.

Siamese network, lower part uses a dense layer instead of a euclidean distance layer

This is a rather interesting question for Siamese network
I am following the example from https://keras.io/examples/mnist_siamese/.
My modified version of the code is in this google colab
The siamese network takes in 2 inputs (2 handwritten digits) and output whether they are of the same digit (1) or not (0).
Each of the two inputs are first processed by a shared base_network (3 Dense layers with 2 Dropout layers in between). The input_a is extracted into processed_a, input_b into processed_b.
The last layer of the siamese network is an euclidean distance layer between the two extracted tensors:
distance = Lambda(euclidean_distance,
output_shape=eucl_dist_output_shape)([processed_a, processed_b])
model = Model([input_a, input_b], distance)
I understand the reasoning behind using an euclidean distance layer for the lower part of the network: if the features are extracted nicely, then similar inputs should have similar features.
I am thinking, why not use a normal Dense layer for the lower part, as:
# distance = Lambda(euclidean_distance,
# output_shape=eucl_dist_output_shape)([processed_a, processed_b])
# model = Model([input_a, input_b], distance)
#my model
subtracted = Subtract()([processed_a, processed_b])
out = Dense(1, activation="sigmoid")(subtracted)
model = Model([input_a,input_b], out)
My reasoning is that if the extracted features are similar, then the Subtract layer should produce a small tensor, as the difference between the extracted features. The next layer, Dense layer, can learn that if the input is small, output 1, otherwise 0.
Because the euclidean distance layer outputs close to 0 value when two inputs are similar and 1 otherwise, I also need to invert the accuracy and loss function, as:
# the version of loss and accuracy for Euclidean distance layer
# def contrastive_loss(y_true, y_pred):
# '''Contrastive loss from Hadsell-et-al.'06
# http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf
# '''
# margin = 1
# square_pred = K.square(y_pred)
# margin_square = K.square(K.maximum(margin - y_pred, 0))
# return K.mean(y_true * square_pred + (1 - y_true) * margin_square)
# def compute_accuracy(y_true, y_pred):
# '''Compute classification accuracy with a fixed threshold on distances.
# '''
# pred = y_pred.ravel() < 0.5
# return np.mean(pred == y_true)
# def accuracy(y_true, y_pred):
# '''Compute classification accuracy with a fixed threshold on distances.
# '''
# return K.mean(K.equal(y_true, K.cast(y_pred < 0.5, y_true.dtype)))
### my version, loss and accuracy
def contrastive_loss(y_true, y_pred):
margin = 1
square_pred = K.square(y_pred)
margin_square = K.square(K.maximum(margin - y_pred, 0))
# return K.mean(y_true * square_pred + (1-y_true) * margin_square)
return K.mean(y_true * margin_square + (1-y_true) * square_pred)
def compute_accuracy(y_true, y_pred):
'''Compute classification accuracy with a fixed threshold on distances.
'''
pred = y_pred.ravel() > 0.5
return np.mean(pred == y_true)
def accuracy(y_true, y_pred):
'''Compute classification accuracy with a fixed threshold on distances.
'''
return K.mean(K.equal(y_true, K.cast(y_pred > 0.5, y_true.dtype)))
The accuracy for the old model:
* Accuracy on training set: 99.55%
* Accuracy on test set: 97.42%
This slight change leads to a model that not learning anything:
* Accuracy on training set: 48.64%
* Accuracy on test set: 48.29%
So my question is:
1. What is wrong with my reasoning of using Substract + Dense for the lower part of the Siamese network?
2. Can we fix this? I have two potential solution in mind but I am not confident, (1) convoluted neural net for feature extraction (2) more dense layers for the lower part of the siamese network.

In case of two similar examples, after subtracting two n-dimensional feature vector (extracted using common/base feature extraction model) you will get zero or around zero value in most of the location of resulting n-dimensional vector on which next/output Dense layer works. On the other hand, we all know that in a ANN model weights are learnt in such a way that less important features produce very less responses and prominent/interesting features contributing towards decision produce high responses. Now you can understand that our subtracted features vector is just in the opposite direction because when two examples are from different class then they produce high responses and opposite for examples from same class. Furthermore with a single node in the output layer (no additional hidden layer before output layer) its quite difficult to learn for model to generate high response from zero values when two samples are of same class. This might be an important point to solve your problem.
Based on the above discussion, you may want to try following ideas:
transforming subtracted feature vector to ensure when there is similarity you get high responses, may be by doing subtraction from 1 or reciprocal (multiplicative inverse) followed by normalization.
Adding more Dense layer before output layer.
I wont be surprised if convolutional neural net instead of stacked Dense layer for feature extraction (as you are thinking) does not improve your accuracy much as it's just another way of doing the same (feature extraction).

Basic neural network, weights too high

I am trying to code a very basic neural network in python, with 3 input nodes with a value of 0 or 1 and a single output node, with a value of 0 or 1. The output should be almost equal to the second input, but after training, the weights are way way too high, and the network almost always guesses 1.
I am using python 3.7 with numpy and scipy. I have tried changing the training set, the new instance, and the random seed
import numpy as np
from scipy.special import expit as ex
rand.seed(10)
training_set=[[0,1,0],[1,0,1],[0,0,0],[1,1,1]] #The training sets and their outputs
training_outputs=[0,1,0,1]
weightlst=[rand.uniform(-1,1),rand.uniform(-1,1),rand.uniform(-1,1)] #Weights are randomly set with a value between -1 and 1
print('Random weights\n'+str(weightlst))
def calcout(inputs,weights): #Calculate the expected output with given inputs and weights
output=0.5
for i in range(len(inputs)):
output=output+(inputs[i]*weights[i])
#print('\nmy output is ' + str(ex(output)))
return ex(output) #Return the output on a sigmoid curve between 0 and 1
def adj(expected_output,training_output,weights,inputs): #Adjust the weights based on the expected output, true (training) output and the weights
adjweights=[]
error=expected_output-training_output
for i in weights:
adjweights.append(i+(error*(expected_output*(1-expected_output))))
return adjweights
#Train the network, adjusting weights each time
training_iterations=10000
for k in range(training_iterations):
for l in range(len(training_set)):
expected=calcout(training_set[l],weightlst)
weightlst=adj(expected,training_outputs[l],weightlst,training_set[l])
new_instance=[1,0,0] #Calculate and return the expected output of a new instance
print('Adjusted weights\n'+str(weightlst))
print('\nExpected output of new instance = ' + str(calcout(new_instance,weightlst)))
The expected output would be 0, or something very close to it, but no matter what i set new_instance to, the output is still
Random weights
[-0.7312715117751976, 0.6948674738744653, 0.5275492379532281]
Adjusted weights
[1999.6135460307303, 2001.03968501638, 2000.8723667804588]
Expected output of new instance = 1.0
What is wrong with my code?

Bugs:
No bias used in the neuron
error=training_output-expected_output (not the other way around) for gradient decent
weight update rule of ith weight w_i = w_i + learning_rate * delta_w_i, (delta_w_i is gradient of loss with respect to w_i)
For squared loss delta_w_i = error*sample[i] (ith value of input vector sample)
Since you have only one neuron (one hidden layer or size 1) your model can only learn linearly separable data (it is only a linear classifier). Examples of linearly separable data are data generated by functions like boolean AND, OR. Note that boolean XOR is not linearly separable.
Code with bugs fixed
import numpy as np
from scipy.special import expit as ex
rand.seed(10)
training_set=[[0,1,0],[1,0,1],[0,0,0],[1,1,1]] #The training sets and their outputs
training_outputs=[1,1,0,1] # Boolean OR of input vector
#training_outputs=[0,0,,1] # Boolean AND of input vector
weightlst=[rand.uniform(-1,1),rand.uniform(-1,1),rand.uniform(-1,1)] #Weights are randomly set with a value between -1 and 1
bias = rand.uniform(-1,1)
print('Random weights\n'+str(weightlst))
def calcout(inputs,weights, bias): #Calculate the expected output with given inputs and weights
output=bias
for i in range(len(inputs)):
output=output+(inputs[i]*weights[i])
#print('\nmy output is ' + str(ex(output)))
return ex(output) #Return the output on a sigmoid curve between 0 and 1
def adj(expected_output,training_output,weights,bias,inputs): #Adjust the weights based on the expected output, true (training) output and the weights
adjweights=[]
error=training_output-expected_output
lr = 0.1
for j, i in enumerate(weights):
adjweights.append(i+error*inputs[j]*lr)
adjbias = bias+error*lr
return adjweights, adjbias
#Train the network, adjusting weights each time
training_iterations=10000
for k in range(training_iterations):
for l in range(len(training_set)):
expected=calcout(training_set[l],weightlst, bias)
weightlst, bias =adj(expected,training_outputs[l],weightlst,bias,training_set[l])
new_instance=[1,0,0] #Calculate and return the expected output of a new instance
print('Adjusted weights\n'+str(weightlst))
print('\nExpected output of new instance = ' + str(calcout(new_instance,weightlst, bias)))
Output:
Random weights
[0.142805189379827, -0.14222189064977075, 0.15618260226894076]
Adjusted weights
[6.196759842119063, 11.71208191137411, 6.210137255008176]
Expected output of new instance = 0.6655563851223694
As up can see for input [1,0,0] the model predicted the probability 0.66 which is class 1 (since 0.66>0.5). It is correct as the output class is OR of input vector.
Note:
For learning/understanding how each weight is updated it is ok to code like above, but in practice all the operations are vectorised. Check the link for vectorized implementation.

Why does this backpropagation implementation fail to train weights correctly?

I've written the following backpropagation routine for a neural network, using the code here as an example. The issue I'm facing is confusing me, and has pushed my debugging skills to their limit.
The problem I am facing is rather simple: as the neural network trains, its weights are being trained to zero with no gain in accuracy.
I have attempted to fix it many times, verifying that:
the training sets are correct
the target vectors are correct
the forward step is recording information correctly
the backward step deltas are recording properly
the signs on the deltas are correct
the weights are indeed being adjusted
the deltas of the input layer are all zero
there are no other errors or overflow warnings
Some information:
The training inputs are an 8x8 grid of [0,16) values representing an intensity; this grid represents a numeral digit (converted to a column vector)
The target vector is an output that is 1 in the position corresponding to the correct number
The original weights and biases are being assigned by Gaussian distribution
The activations are a standard sigmoid
I'm not sure where to go from here. I've verified that all things I know to check are operating correctly, and it's still not working, so I'm asking here. The following is the code I'm using to backpropagate:
def backprop(train_set, wts, bias, eta):
learning_coef = eta / len(train_set[0])
for next_set in train_set:
# These record the sum of the cost gradients in the batch
sum_del_w = [np.zeros(w.shape) for w in wts]
sum_del_b = [np.zeros(b.shape) for b in bias]
for test, sol in next_set:
del_w = [np.zeros(wt.shape) for wt in wts]
del_b = [np.zeros(bt.shape) for bt in bias]
# These two helper functions take training set data and make them useful
next_input = conv_to_col(test)
outp = create_tgt_vec(sol)
# Feedforward step
pre_sig = []; post_sig = []
for w, b in zip(wts, bias):
next_input = np.dot(w, next_input) + b
pre_sig.append(next_input)
post_sig.append(sigmoid(next_input))
next_input = sigmoid(next_input)
# Backpropagation gradient
delta = cost_deriv(post_sig[-1], outp) * sigmoid_deriv(pre_sig[-1])
del_b[-1] = delta
del_w[-1] = np.dot(delta, post_sig[-2].transpose())
for i in range(2, len(wts)):
pre_sig_vec = pre_sig[-i]
sig_deriv = sigmoid_deriv(pre_sig_vec)
delta = np.dot(wts[-i+1].transpose(), delta) * sig_deriv
del_b[-i] = delta
del_w[-i] = np.dot(delta, post_sig[-i-1].transpose())
sum_del_w = [dw + sdw for dw, sdw in zip(del_w, sum_del_w)]
sum_del_b = [db + sdb for db, sdb in zip(del_b, sum_del_b)]
# Modify weights based on current batch
wts = [wt - learning_coef * dw for wt, dw in zip(wts, sum_del_w)]
bias = [bt - learning_coef * db for bt, db in zip(bias, sum_del_b)]
return wts, bias
By Shep's suggestion, I checked what's happening when training a network of shape [2, 1, 1] to always output 1, and indeed, the network trains properly in that case. My best guess at this point is that the gradient is adjusting too strongly for the 0s and weakly on the 1s, resulting in a net decrease despite an increase at each step - but I'm not sure.

I suppose your problem is in choice of initial weights and in choice of initialization of weights algorithm. Jeff Heaton author of Encog claims that it as usually performs worse then other initialization method. Here is another results of weights initialization algorithm perfomance. Also from my own experience recommend you to init your weights with different signs values. Even in cases when I had all positive outputs weights with different signs perfomed better then with the same sign.

Issue with gradient calculation in a Neural Network (stuck at 7% error in MNIST)

Hi I am having an issue with my calculation of checking the gradient when implementing a neural network in python using numpy.
I am using mnist dataset to try and trying to using mini-batch gradient descent.
I have check the math and on paper look good so maybe you can give me a hint of what's happening here:
EDIT: One answer made me realize that indeed the cost function was being calculated wrong. Howerver that does not explain the problem with the gradient as it is calculated using back_prop. I get %7 error rate using 300 units in the hidden layer using minibatch gradient descent with rmsprop, 30 epochs and 100 batches. (learning_rate = 0.001, small due to the rmsprop).
each input is has 768 features so for a 100 samples I have a matrix. Mnist has 10 classes.
X = NoSamplesxFeatures = 100x768
Y = NoSamplesxClasses = 100x10
I am using a one hidden layer neural network with hidden layer size of 300 when fully training. Another question I have is whether I should use a softmax output function for calculating the error... which I think not. But I am kinda newbie to all of this and the obvious might seem strange to me.
(NOTE: I know the code is ugly, but this is my first Python/Numpy code done under pressure, bear with me)
Here is back_prof and activations:
def sigmoid(z):
return np.true_divide(1,1 + np.exp(-z) )
#not calculated really - this the fake version to make it faster.
def sigmoid_prime(a):
return (a)*(1 - a)
def _back_prop(self,W,X,labels,f=sigmoid,fprime=sigmoid_prime,lam=0.001):
"""
Calculate the partial derivates of the cost function using backpropagation.
"""
#Weight for first layer and hidden layer
Wl1,bl1,Wl2,bl2 = self._extract_weights(W)
# get the forward prop value
layers_outputs = self._forward_prop(W,X,f)
#from a number make a binary vector, for mnist 1x10 with all 0 but the number.
y = self.make_1_of_c_encoding(labels)
num_samples = X.shape[0] # layers_outputs[-1].shape[0]
# Dot product return Numsamples (N) x Outputs (No CLasses)
# Y is NxNo Clases
# Layers output to
big_delta = np.zeros(Wl2.size + bl2.size + Wl1.size + bl1.size)
big_delta_wl1, big_delta_bl1, big_delta_wl2, big_delta_bl2 = self._extract_weights(big_delta)
# calculate the gradient for each training sample in the batch and accumulate it
for i,x in enumerate(X):
# Error with respect the output
dE_dy = layers_outputs[-1][i,:] - y[i,:]
# bias hidden layer
big_delta_bl2 += dE_dy
# get the error for the hiddlen layer
dE_dz_out = dE_dy * fprime(layers_outputs[-1][i,:])
#and for the input layer
dE_dhl = dE_dy.dot(Wl2.T)
#bias input layer
big_delta_bl1 += dE_dhl
small_delta_hl = dE_dhl*fprime(layers_outputs[-2][i,:])
#here calculate the gradient for the weights in the hidden and first layer
big_delta_wl2 += np.outer(layers_outputs[-2][i,:],dE_dz_out)
big_delta_wl1 += np.outer(x,small_delta_hl)
# divide by number of samples in the batch (should be done here)?
big_delta_wl2 = np.true_divide(big_delta_wl2,num_samples) + lam*Wl2*2
big_delta_bl2 = np.true_divide(big_delta_bl2,num_samples)
big_delta_wl1 = np.true_divide(big_delta_wl1,num_samples) + lam*Wl1*2
big_delta_bl1 = np.true_divide(big_delta_bl1,num_samples)
# return
return np.concatenate([big_delta_wl1.ravel(),
big_delta_bl1,
big_delta_wl2.ravel(),
big_delta_bl2.reshape(big_delta_bl2.size)])
Now the feed_forward:
def _forward_prop(self,W,X,transfer_func=sigmoid):
"""
Return the output of the net a Numsamples (N) x Outputs (No CLasses)
# an array containing the size of the output of all of the laye of the neural net
"""
# Hidden layer DxHLS
weights_L1,bias_L1,weights_L2,bias_L2 = self._extract_weights(W)
# Output layer HLSxOUT
# A_2 = N x HLS
A_2 = transfer_func(np.dot(X,weights_L1) + bias_L1 )
# A_3 = N x Outputs
A_3 = transfer_func(np.dot(A_2,weights_L2) + bias_L2)
# output layer
return [A_2,A_3]
And the cost function for the gradient checking:
def cost_function(self,W,X,labels,reg=0.001):
"""
reg: regularization term
No weight decay term - lets leave it for later
"""
outputs = self._forward_prop(W,X,sigmoid)[-1] #take the last layer out
sample_size = X.shape[0]
y = self.make_1_of_c_encoding(labels)
e1 = np.sum((outputs - y)**2, axis=1))*0.5
#error = e1.sum(axis=1)
error = e1.sum()/sample_size + 0.5*reg*(np.square(W)).sum()
return error

What kind of results are you getting when you run gradient checking? Often times you can tease out the nature of the implementation error by looking at the output of your gradient vs the output produced by gradient checking.
Furthermore, square error is usually a poor choice for a classification task such as MNIST and I would suggest using either a simple sigmoid top-layer or a softmax. With sigmoid the cross entropy function you want to use is:
L(h,Y) = -Y*log(h) - (1-Y)*log(1-h)
For a softmax
L(h,Y) = -sum(Y*log(h))
where Y is the target given as a 1x10 vector and h is your predicted value, but easily extends to arbitrary batch sizes.
In both cases the top-layer delta simply becomes:
delta = h - Y
And the top-layer gradient becomes:
grad = dot(delta, A_in)
Where A_in is the input into the top layer from the previous layer.
While I am having some trouble getting my head around your backprop routine, I suspect from your code that the error in gradient is due to the fact that you are not calculating the top-level dE/dw_l2 correctly when using square error, along with computing fprime on the incorrect input.
When using square error the top layer delta should be:
delta = (h - Y) * fprime(Z_l2)
Here Z_l2 is the input into your transfer function for layer 2. Similarly when computing fprime for the lower layers, you want to use the input to your transfer function (i.e. dot(X,weights_L1) + bias_L1)
Hope that helps.
EDIT:
As some added justification for using cross entropy error over square error I would suggest looking up Geoffrey Hinton's lectures on linear classification methods:
www.cs.toronto.edu/~hinton/csc2515/notes/lec3.ppt
EDIT2:
I ran some tests locally with my implementation of neural nets on the MNIST dataset with different parameters and 1 hidden layer using RMSPROP. Here are the results:
Test1
Epochs: 30
Hidden Size: 300
Learn Rate: 0.001
Lambda: 0.001
Train Method: RMSPROP with decrements=0.5 and increments=1.3
Train Error: 6.1%
Test Error: 6.9%
Test2
Epochs: 30
Hidden Size: 300
Learn Rate: 0.001
Lambda: 0.000002
Train Method: RMSPROP with decrements=0.5 and increments=1.3
Train Error: 4.5%
Test Error: 5.7%
It already appears that if you decrease your lambda parameter by a couple orders of magnitude you should end up with better performance.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.