I am trying to code a very basic neural network in python, with 3 input nodes with a value of 0 or 1 and a single output node, with a value of 0 or 1. The output should be almost equal to the second input, but after training, the weights are way way too high, and the network almost always guesses 1.
I am using python 3.7 with numpy and scipy. I have tried changing the training set, the new instance, and the random seed
import numpy as np
from scipy.special import expit as ex
rand.seed(10)
training_set=[[0,1,0],[1,0,1],[0,0,0],[1,1,1]] #The training sets and their outputs
training_outputs=[0,1,0,1]
weightlst=[rand.uniform(-1,1),rand.uniform(-1,1),rand.uniform(-1,1)] #Weights are randomly set with a value between -1 and 1
print('Random weights\n'+str(weightlst))
def calcout(inputs,weights): #Calculate the expected output with given inputs and weights
output=0.5
for i in range(len(inputs)):
output=output+(inputs[i]*weights[i])
#print('\nmy output is ' + str(ex(output)))
return ex(output) #Return the output on a sigmoid curve between 0 and 1
def adj(expected_output,training_output,weights,inputs): #Adjust the weights based on the expected output, true (training) output and the weights
adjweights=[]
error=expected_output-training_output
for i in weights:
adjweights.append(i+(error*(expected_output*(1-expected_output))))
return adjweights
#Train the network, adjusting weights each time
training_iterations=10000
for k in range(training_iterations):
for l in range(len(training_set)):
expected=calcout(training_set[l],weightlst)
weightlst=adj(expected,training_outputs[l],weightlst,training_set[l])
new_instance=[1,0,0] #Calculate and return the expected output of a new instance
print('Adjusted weights\n'+str(weightlst))
print('\nExpected output of new instance = ' + str(calcout(new_instance,weightlst)))
The expected output would be 0, or something very close to it, but no matter what i set new_instance to, the output is still
Random weights
[-0.7312715117751976, 0.6948674738744653, 0.5275492379532281]
Adjusted weights
[1999.6135460307303, 2001.03968501638, 2000.8723667804588]
Expected output of new instance = 1.0
What is wrong with my code?
Bugs:
No bias used in the neuron
error=training_output-expected_output (not the other way around) for gradient decent
weight update rule of ith weight w_i = w_i + learning_rate * delta_w_i, (delta_w_i is gradient of loss with respect to w_i)
For squared loss delta_w_i = error*sample[i] (ith value of input vector sample)
Since you have only one neuron (one hidden layer or size 1) your model can only learn linearly separable data (it is only a linear classifier). Examples of linearly separable data are data generated by functions like boolean AND, OR. Note that boolean XOR is not linearly separable.
Code with bugs fixed
import numpy as np
from scipy.special import expit as ex
rand.seed(10)
training_set=[[0,1,0],[1,0,1],[0,0,0],[1,1,1]] #The training sets and their outputs
training_outputs=[1,1,0,1] # Boolean OR of input vector
#training_outputs=[0,0,,1] # Boolean AND of input vector
weightlst=[rand.uniform(-1,1),rand.uniform(-1,1),rand.uniform(-1,1)] #Weights are randomly set with a value between -1 and 1
bias = rand.uniform(-1,1)
print('Random weights\n'+str(weightlst))
def calcout(inputs,weights, bias): #Calculate the expected output with given inputs and weights
output=bias
for i in range(len(inputs)):
output=output+(inputs[i]*weights[i])
#print('\nmy output is ' + str(ex(output)))
return ex(output) #Return the output on a sigmoid curve between 0 and 1
def adj(expected_output,training_output,weights,bias,inputs): #Adjust the weights based on the expected output, true (training) output and the weights
adjweights=[]
error=training_output-expected_output
lr = 0.1
for j, i in enumerate(weights):
adjweights.append(i+error*inputs[j]*lr)
adjbias = bias+error*lr
return adjweights, adjbias
#Train the network, adjusting weights each time
training_iterations=10000
for k in range(training_iterations):
for l in range(len(training_set)):
expected=calcout(training_set[l],weightlst, bias)
weightlst, bias =adj(expected,training_outputs[l],weightlst,bias,training_set[l])
new_instance=[1,0,0] #Calculate and return the expected output of a new instance
print('Adjusted weights\n'+str(weightlst))
print('\nExpected output of new instance = ' + str(calcout(new_instance,weightlst, bias)))
Output:
Random weights
[0.142805189379827, -0.14222189064977075, 0.15618260226894076]
Adjusted weights
[6.196759842119063, 11.71208191137411, 6.210137255008176]
Expected output of new instance = 0.6655563851223694
As up can see for input [1,0,0] the model predicted the probability 0.66 which is class 1 (since 0.66>0.5). It is correct as the output class is OR of input vector.
Note:
For learning/understanding how each weight is updated it is ok to code like above, but in practice all the operations are vectorised. Check the link for vectorized implementation.
Related
I wanted to code my implementation of polynomial regression, but my model's gradients either exploded or my model didn't fit the data well enough.
For testing purposes, my dataset is just the function x^2 and my model is a second-degree polynomial ax^2 + bx + c. I trained it for 50 epochs using batch gradient descent.
I noticed that the model explodes with the learning rate >=0.001 and underfits with a learning rate <=0.0001
To visualize the model, at the end of each epoch, I plot the model's predictions with the labels. So, in the ideal case, these lines should be indistinguishable.
*The orange line is the labels and the blue one is the model's predictions.
Here is the model exploding:*
And here it underfits:*
One interesting thing is, that even though the model's predictions are way too big, the line still resembles the correct polynomial. And the picture where the predictions go into negatives is also correct, but just flipped/mirrored.
I made the code in python. This is my main.py:
from decimal import Decimal
from matplotlib.pyplot import plot, draw, pause, clf
from model import PolynomialRegression
POLYNOMIAL_FUNCTION = [0, 1, 2]
LEARNING_RATE = Decimal(0.0001)
DATASET = [0, 1, 2, 3, 4, 5, 6, 7]
LABELSET = [0, 1, 4, 9, 16, 32, 64, 128]
EPOCHS = 50
model = PolynomialRegression(POLYNOMIAL_FUNCTION, LEARNING_RATE)
for _ in range(EPOCHS):
for data, label in zip(DATASET, LABELSET):
# train the model
model.train(data, label)
# update the model
model.update()
# predict the dataset
predictions = [model.predict(data) for data in DATASET]
# plot predictions and labels
plot(predictions)
plot(LABELSET)
draw()
pause(0.1)
clf()
print(model.parameters)
# erase the stored gradients
model.clear_grad()
And this is my model.py:
from decimal import Decimal
class PolynomialRegression:
"""
Polynomial regression model.
"""
def __init__(self, polynomial_function: list, learning_rate: Decimal) -> None:
# the structure of the polynomial function (the exponents)
self.polynomial_function = polynomial_function
# parameters of the model set to be 1
self.parameters = [Decimal(1)] * len(polynomial_function)
self.learning_rate = learning_rate
# stored gradients to update the model
self.gradients = []
def predict(self, x: Decimal) -> Decimal:
"""
Make a prediction based on the input.
Args:
x (Decimal): Input to the model.
Returns:
Decimal: A prediction.
"""
y = Decimal(0)
# go through each parameter and exponent
for param, exponent in zip(self.parameters, self.polynomial_function):
# compute a term and add it to the final output
y += param * (x ** exponent)
return y
def train(self, x: Decimal, y: Decimal) -> Decimal:
"""
Compute a gradient from a given input and target output.
Args:
x (Decimal): Input for the model.
y (Decimal): Target/Desired output.
Returns:
Decimal: An MSE loss.
"""
prediction = self.predict(x)
error = prediction - y
loss = error ** 2
gradient = []
# go through each parameter and exponent
for param, exponent in zip(self.parameters, self.polynomial_function):
# compute the gradient for a single parameter
param_gradient = error * (x ** exponent) * self.learning_rate
# add the parameter gradient to the gradient list
gradient.append(param_gradient)
# add the gradient to a list
self.gradients.append(gradient)
return loss
def __sum_gradients(self) -> Decimal:
"""
Return a sum of gradients along the 0 axis.
(equivalent of numpy.sum(x, axis=0))
Returns:
list: List of summed Decimals.
"""
result = [Decimal(0)] * len(self.parameters)
# iterate through the y axis
for gradient in self.gradients:
# iterate through the x axis
for i, param_gradient in enumerate(gradient):
result[i] += param_gradient
return result
def update(self) -> None:
"""
Update the model's parameters based on the stored gradients.
"""
summed_gradients = self.__sum_gradients()
# fraction used to calculate the average for every gradient
averaging_fraction = Decimal(1) / len(self.gradients)
for param_index, grad in enumerate(summed_gradients):
self.parameters[param_index] -= averaging_fraction * grad
def clear_grad(self) -> None:
"""
Clear/Reset the stored gradients.
"""
self.gradients = []
I think the problem lies somewhere in my gradient descent calculations, but it may also be something unexpected and silly.
First your dataset consists of only 8 datapoints. This is to few data to generalize a model, which means that you are probably overfitting.
The second thing I see, is that you do not normalize the x data. The model is not very complex so I guess it doesn't really matter in that context. But if you had a more complex model with n features and one feature is very small and one is very big, the feature with the bigger values would influence the result much more than the smaller one. Which might result in a bad performing model.
But your last plot doesn't look like it's underfitting to me. You have to realize that a ML model will always have an error. In my opinion for 8 datapoints, a model with only one layer and 50 epochs that looks fine. You probably could improve the results by learning longer, but that would mean to overfit the model even more. But to be honest if your goal is to emulate a mathematical function with ML this should be okay. You could also add a new layer.
The fact that your lr has to be that small to not fuck up the results tells me that you are correct, there is something wrong with the gradient descent you might want to look into this behavior.
An easy way to evaluate this is to build your model in pytorch and then use your optimizer to update the weights. If you get the same problem, it was your gradient descent, if not the problem lies somewhere else. But I strongly believe it is your gradient descent. Maybe debug into this function and look at the actual values you are subtracting.
This is a rather interesting question for Siamese network
I am following the example from https://keras.io/examples/mnist_siamese/.
My modified version of the code is in this google colab
The siamese network takes in 2 inputs (2 handwritten digits) and output whether they are of the same digit (1) or not (0).
Each of the two inputs are first processed by a shared base_network (3 Dense layers with 2 Dropout layers in between). The input_a is extracted into processed_a, input_b into processed_b.
The last layer of the siamese network is an euclidean distance layer between the two extracted tensors:
distance = Lambda(euclidean_distance,
output_shape=eucl_dist_output_shape)([processed_a, processed_b])
model = Model([input_a, input_b], distance)
I understand the reasoning behind using an euclidean distance layer for the lower part of the network: if the features are extracted nicely, then similar inputs should have similar features.
I am thinking, why not use a normal Dense layer for the lower part, as:
# distance = Lambda(euclidean_distance,
# output_shape=eucl_dist_output_shape)([processed_a, processed_b])
# model = Model([input_a, input_b], distance)
#my model
subtracted = Subtract()([processed_a, processed_b])
out = Dense(1, activation="sigmoid")(subtracted)
model = Model([input_a,input_b], out)
My reasoning is that if the extracted features are similar, then the Subtract layer should produce a small tensor, as the difference between the extracted features. The next layer, Dense layer, can learn that if the input is small, output 1, otherwise 0.
Because the euclidean distance layer outputs close to 0 value when two inputs are similar and 1 otherwise, I also need to invert the accuracy and loss function, as:
# the version of loss and accuracy for Euclidean distance layer
# def contrastive_loss(y_true, y_pred):
# '''Contrastive loss from Hadsell-et-al.'06
# http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf
# '''
# margin = 1
# square_pred = K.square(y_pred)
# margin_square = K.square(K.maximum(margin - y_pred, 0))
# return K.mean(y_true * square_pred + (1 - y_true) * margin_square)
# def compute_accuracy(y_true, y_pred):
# '''Compute classification accuracy with a fixed threshold on distances.
# '''
# pred = y_pred.ravel() < 0.5
# return np.mean(pred == y_true)
# def accuracy(y_true, y_pred):
# '''Compute classification accuracy with a fixed threshold on distances.
# '''
# return K.mean(K.equal(y_true, K.cast(y_pred < 0.5, y_true.dtype)))
### my version, loss and accuracy
def contrastive_loss(y_true, y_pred):
margin = 1
square_pred = K.square(y_pred)
margin_square = K.square(K.maximum(margin - y_pred, 0))
# return K.mean(y_true * square_pred + (1-y_true) * margin_square)
return K.mean(y_true * margin_square + (1-y_true) * square_pred)
def compute_accuracy(y_true, y_pred):
'''Compute classification accuracy with a fixed threshold on distances.
'''
pred = y_pred.ravel() > 0.5
return np.mean(pred == y_true)
def accuracy(y_true, y_pred):
'''Compute classification accuracy with a fixed threshold on distances.
'''
return K.mean(K.equal(y_true, K.cast(y_pred > 0.5, y_true.dtype)))
The accuracy for the old model:
* Accuracy on training set: 99.55%
* Accuracy on test set: 97.42%
This slight change leads to a model that not learning anything:
* Accuracy on training set: 48.64%
* Accuracy on test set: 48.29%
So my question is:
1. What is wrong with my reasoning of using Substract + Dense for the lower part of the Siamese network?
2. Can we fix this? I have two potential solution in mind but I am not confident, (1) convoluted neural net for feature extraction (2) more dense layers for the lower part of the siamese network.
In case of two similar examples, after subtracting two n-dimensional feature vector (extracted using common/base feature extraction model) you will get zero or around zero value in most of the location of resulting n-dimensional vector on which next/output Dense layer works. On the other hand, we all know that in a ANN model weights are learnt in such a way that less important features produce very less responses and prominent/interesting features contributing towards decision produce high responses. Now you can understand that our subtracted features vector is just in the opposite direction because when two examples are from different class then they produce high responses and opposite for examples from same class. Furthermore with a single node in the output layer (no additional hidden layer before output layer) its quite difficult to learn for model to generate high response from zero values when two samples are of same class. This might be an important point to solve your problem.
Based on the above discussion, you may want to try following ideas:
transforming subtracted feature vector to ensure when there is similarity you get high responses, may be by doing subtraction from 1 or reciprocal (multiplicative inverse) followed by normalization.
Adding more Dense layer before output layer.
I wont be surprised if convolutional neural net instead of stacked Dense layer for feature extraction (as you are thinking) does not improve your accuracy much as it's just another way of doing the same (feature extraction).
I'm implementing a Convolutional Neural Network in Tensorflow with python.
I'm in the following scenario: I've got a tensor of labels y (batch labels) like this:
y = [[0,1,0]
[0,0,1]
[1,0,0]]
where each row is a one-hot vector that represents a label related to the correspondent example. Now in training I want stop loss gradient (set to 0) of the example with that label (the third):
[1,0,0]
which rappresents the n/a label,
instead the loss of the other examples in the batch are computed.
For my loss computation I use a method like that:
self.y_loss = kl_divergence(self.pred_y, self.y)
I found this function that stop gradient, but how can apply it to conditionally to the batch elements?
If you don't want some samples to contribute to the gradients you could just avoid feeding them to the network during training at all. Simply remove the samples with that label from your training set.
Alternatively, since the loss is computed by summing over the KL-divergences for each sample, you could multiply the KL-divergence for each sample with either 1 if the sample should be taken into account and 0 otherwise before summing over them.
You can get the vectors of values you need to multiply the individual KL-divergences with by subtracting the first column of the tensor of labels from 1: 1 - y[:,0]
For the kl_divergence function from the answer to your previous question it might look like this:
def kl_divergence(p, q)
return tf.reduce_sum(tf.reduce_sum(p * tf.log(p/q), axis=1)*(1-p[:,0]))
where p is the groundtruth tensor and q are the predictions
I am new to neural networks and I am using an example neural network I found online to attempt to approximate the sphere function(the addition of a set of numbers squared) using back propagation.
The initial code is:
class NeuralNetwork():
def __init__(self):
#Seed the random number generator, so it generates the same numbers
#every time the program is run.
#random.seed(1)
#Model a single neuron, with 3 input connections and 1 output connection.
#We assign random weights to a 3 x 1 matrix, with values in the range -1 to 1
#and mean 0
self.synaptic_weights = 2 * random.random((2,1)) - 1
#The Sigmoid function, which describes an S shaped curve.
#We pass the weighted sum of thle inputs through this function to
#normalise them between 0 and 1.
def __sigmoid(self, x):
return 1 / (1 + exp(-x))
#The derivative of the sigmoid function
#This is the gradient of the sigmoid curve.
#It indicates how confident we are about existing weight.
def __sigmoid_derivative(self, x):
return x * (1 -x)
#Train the network through a process of trial and error.
#Adjusting the synaptic weights each time.
def train(self, training_set_inputs, training_set_outputs, number_of_training_iterations):
for iteration in xrange(10000):
#Pass the training set through our neural network(a single neuron)
output = self.think(training_set_inputs)
#Calculate the error(Difference between the desired output and predicted output).
error = training_set_outputs - output
#Multiply the error by the input and again by the gradient of the Sigmoid curve.
#This means less confident weights are adjusted more.
#This means inputs, which are zero, do not cause changes to the weights.
adjustment = dot(training_set_inputs.T, error * self.__sigmoid_derivative(output))
#Adjust the weights
self.synaptic_weights += adjustment
#The neural network thinks.
def think(self, inputs):
#Pass inputs through our neural network(OUR SINGLE NEURON).
return self.__sigmoid(dot(inputs, self.synaptic_weights))
if __name__ == "__main__":
#Initialise a single neuron neural network.
neural_network = NeuralNetwork()
print"Random starting synaptic weights: "
print neural_network.synaptic_weights
#The training set. We have 4 examples, each consisting of 3 input values and 1 output value
training_set_inputs = array([[0, 1], [1,0], [0,0]])
training_set_outputs = array([[1,1,0]]).T
#Train the neural network using a training set.
#Do it 10,000 times and make small adjustments each time.
neural_network.train(training_set_inputs, training_set_outputs, 10000)
print "New synaptic weights after training: "
print neural_network.synaptic_weights
#Test the neural network with a new situation.
print "Considering new situation [1,1] -> ?: "
print neural_network.think(array([1,1]))
My aim is to input training data(the sphere function input and outputs) into the neural network to train it and meaningfully adjust the weights. After continuous training the weights should reach a point where reasonably accurate results are given from the training inputs.
I imagine an example of some training sets for the sphere function would be something like:
training_set_inputs = array([[2,1], [3,2], [4,6], [8,3]])
training_set_outputs = array([[5, 13, 52, 73]])
The example I found online can successfully approximate the XOR operation, but when given sphere function inputs it only gives me an output of 1 when tested on a new example(for example, [6,7] which should ideally return an approximation around 85)
From what I have read about neural networks I suspect this is because I need to normalize the inputs but I am not entirely sure how to do this. Any help on this or something to point me on the right track would be appreciated a lot, thank you.
Hi I am having an issue with my calculation of checking the gradient when implementing a neural network in python using numpy.
I am using mnist dataset to try and trying to using mini-batch gradient descent.
I have check the math and on paper look good so maybe you can give me a hint of what's happening here:
EDIT: One answer made me realize that indeed the cost function was being calculated wrong. Howerver that does not explain the problem with the gradient as it is calculated using back_prop. I get %7 error rate using 300 units in the hidden layer using minibatch gradient descent with rmsprop, 30 epochs and 100 batches. (learning_rate = 0.001, small due to the rmsprop).
each input is has 768 features so for a 100 samples I have a matrix. Mnist has 10 classes.
X = NoSamplesxFeatures = 100x768
Y = NoSamplesxClasses = 100x10
I am using a one hidden layer neural network with hidden layer size of 300 when fully training. Another question I have is whether I should use a softmax output function for calculating the error... which I think not. But I am kinda newbie to all of this and the obvious might seem strange to me.
(NOTE: I know the code is ugly, but this is my first Python/Numpy code done under pressure, bear with me)
Here is back_prof and activations:
def sigmoid(z):
return np.true_divide(1,1 + np.exp(-z) )
#not calculated really - this the fake version to make it faster.
def sigmoid_prime(a):
return (a)*(1 - a)
def _back_prop(self,W,X,labels,f=sigmoid,fprime=sigmoid_prime,lam=0.001):
"""
Calculate the partial derivates of the cost function using backpropagation.
"""
#Weight for first layer and hidden layer
Wl1,bl1,Wl2,bl2 = self._extract_weights(W)
# get the forward prop value
layers_outputs = self._forward_prop(W,X,f)
#from a number make a binary vector, for mnist 1x10 with all 0 but the number.
y = self.make_1_of_c_encoding(labels)
num_samples = X.shape[0] # layers_outputs[-1].shape[0]
# Dot product return Numsamples (N) x Outputs (No CLasses)
# Y is NxNo Clases
# Layers output to
big_delta = np.zeros(Wl2.size + bl2.size + Wl1.size + bl1.size)
big_delta_wl1, big_delta_bl1, big_delta_wl2, big_delta_bl2 = self._extract_weights(big_delta)
# calculate the gradient for each training sample in the batch and accumulate it
for i,x in enumerate(X):
# Error with respect the output
dE_dy = layers_outputs[-1][i,:] - y[i,:]
# bias hidden layer
big_delta_bl2 += dE_dy
# get the error for the hiddlen layer
dE_dz_out = dE_dy * fprime(layers_outputs[-1][i,:])
#and for the input layer
dE_dhl = dE_dy.dot(Wl2.T)
#bias input layer
big_delta_bl1 += dE_dhl
small_delta_hl = dE_dhl*fprime(layers_outputs[-2][i,:])
#here calculate the gradient for the weights in the hidden and first layer
big_delta_wl2 += np.outer(layers_outputs[-2][i,:],dE_dz_out)
big_delta_wl1 += np.outer(x,small_delta_hl)
# divide by number of samples in the batch (should be done here)?
big_delta_wl2 = np.true_divide(big_delta_wl2,num_samples) + lam*Wl2*2
big_delta_bl2 = np.true_divide(big_delta_bl2,num_samples)
big_delta_wl1 = np.true_divide(big_delta_wl1,num_samples) + lam*Wl1*2
big_delta_bl1 = np.true_divide(big_delta_bl1,num_samples)
# return
return np.concatenate([big_delta_wl1.ravel(),
big_delta_bl1,
big_delta_wl2.ravel(),
big_delta_bl2.reshape(big_delta_bl2.size)])
Now the feed_forward:
def _forward_prop(self,W,X,transfer_func=sigmoid):
"""
Return the output of the net a Numsamples (N) x Outputs (No CLasses)
# an array containing the size of the output of all of the laye of the neural net
"""
# Hidden layer DxHLS
weights_L1,bias_L1,weights_L2,bias_L2 = self._extract_weights(W)
# Output layer HLSxOUT
# A_2 = N x HLS
A_2 = transfer_func(np.dot(X,weights_L1) + bias_L1 )
# A_3 = N x Outputs
A_3 = transfer_func(np.dot(A_2,weights_L2) + bias_L2)
# output layer
return [A_2,A_3]
And the cost function for the gradient checking:
def cost_function(self,W,X,labels,reg=0.001):
"""
reg: regularization term
No weight decay term - lets leave it for later
"""
outputs = self._forward_prop(W,X,sigmoid)[-1] #take the last layer out
sample_size = X.shape[0]
y = self.make_1_of_c_encoding(labels)
e1 = np.sum((outputs - y)**2, axis=1))*0.5
#error = e1.sum(axis=1)
error = e1.sum()/sample_size + 0.5*reg*(np.square(W)).sum()
return error
What kind of results are you getting when you run gradient checking? Often times you can tease out the nature of the implementation error by looking at the output of your gradient vs the output produced by gradient checking.
Furthermore, square error is usually a poor choice for a classification task such as MNIST and I would suggest using either a simple sigmoid top-layer or a softmax. With sigmoid the cross entropy function you want to use is:
L(h,Y) = -Y*log(h) - (1-Y)*log(1-h)
For a softmax
L(h,Y) = -sum(Y*log(h))
where Y is the target given as a 1x10 vector and h is your predicted value, but easily extends to arbitrary batch sizes.
In both cases the top-layer delta simply becomes:
delta = h - Y
And the top-layer gradient becomes:
grad = dot(delta, A_in)
Where A_in is the input into the top layer from the previous layer.
While I am having some trouble getting my head around your backprop routine, I suspect from your code that the error in gradient is due to the fact that you are not calculating the top-level dE/dw_l2 correctly when using square error, along with computing fprime on the incorrect input.
When using square error the top layer delta should be:
delta = (h - Y) * fprime(Z_l2)
Here Z_l2 is the input into your transfer function for layer 2. Similarly when computing fprime for the lower layers, you want to use the input to your transfer function (i.e. dot(X,weights_L1) + bias_L1)
Hope that helps.
EDIT:
As some added justification for using cross entropy error over square error I would suggest looking up Geoffrey Hinton's lectures on linear classification methods:
www.cs.toronto.edu/~hinton/csc2515/notes/lec3.ppt
EDIT2:
I ran some tests locally with my implementation of neural nets on the MNIST dataset with different parameters and 1 hidden layer using RMSPROP. Here are the results:
Test1
Epochs: 30
Hidden Size: 300
Learn Rate: 0.001
Lambda: 0.001
Train Method: RMSPROP with decrements=0.5 and increments=1.3
Train Error: 6.1%
Test Error: 6.9%
Test2
Epochs: 30
Hidden Size: 300
Learn Rate: 0.001
Lambda: 0.000002
Train Method: RMSPROP with decrements=0.5 and increments=1.3
Train Error: 4.5%
Test Error: 5.7%
It already appears that if you decrease your lambda parameter by a couple orders of magnitude you should end up with better performance.