Why slow learning in RNN implemented using for-loop? - python

Problem settings
As a beginner of RNN, I'm currently building a 3-to-1 autocompletion RNN model for 4-letter words, where the input is a 3-letter incomplete word and the output is a single-letter which completes the word. For example, I would desire to have the following model-prediction:
input : "C", "A", "F"
output : "E"
Codes - generate dataset
To get the desired result from an RNN model, I have made an (imbalanced) dataset as follows:
import string
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
alphList = list(string.ascii_uppercase) # Define a list of alphabets
alphToNum = {n: i for i, n in enumerate(alphList)} # dic of alphabet-numbers
# Make dataset
# define words of interest
fourList = ['CARE', 'CODE', 'COME', 'CANE', 'COPE', 'FISH', 'JAZZ', 'GAME', 'WALK', 'QUIZ']
# (len(Sequence), len(Batch), len(Observation)) following tensorflow-style
first3Data = np.zeros((3, len(fourList), len(alphList)), dtype=np.int32)
last1Data = np.zeros((len(fourList), len(alphList)), dtype=np.int32)
for idxObs, word in enumerate(fourList):
# Make an array of one-hot vectors consisting of first 3 letters
first3 = [alphToNum[n] for n in word[:-1]]
first3Data[:,idxObs,:] = np.eye(len(alphList))[first3]
# Make an array of one-hot vectors consisting of last 1 letter
last1 = alphToNum[word[3]]
last1Data[idxObs,:] = np.eye(len(alphList))[last1]
So fourList contains the training data information, first3Data contains all the one-hot encoded first 3 letters of the training data, and last1Data contains all the one-hot encoded last 1 letter of the training data.
Codes - build model
Following the standard setting of 3-to-1 RNN model,I have made the following code.
# Hyperparameters
n_data = len(fourList)
n_input = len(alphList) # number of input units
n_hidden = 128 # number of hidden units
n_output = len(alphList) # number of output units
learning_rate = 0.01
total_epoch = 100000
# Variables (separate version)
W_in = tf.Variable(tf.random_normal([n_input, n_hidden]))
W_rec = tf.Variable(tf.random_normal([n_hidden, n_hidden]))
b_rec = tf.Variable(tf.random_normal([n_hidden]))
W_out = tf.Variable(tf.random_normal([n_hidden, n_output]))
b_out = tf.Variable(tf.random_normal([n_output]))
# Manual calculation of RNN output
def RNNoutput(Xinput):
h_state = tf.random_normal([1,n_hidden]) # initial hidden state
for iX in Xinput:
h_state = tf.nn.tanh(iX # W_in + (h_state # W_rec + b_rec))
rnn_output = h_state # W_out + b_out
return(rnn_output)
Note that the Manual calculation of RNN output part basically rolls the hidden state exactly 4 times using the matrix multiplication and the tanh activation function as follows:
tf.nn.tanh(iX # W_in + (h_state # W_rec + b_rec))
Here, every time the whole data is passed, one epoch is completed. Thus I initialize the h_state every time I pass the data. Additionally, note that I have not used a placeholder, which may be a cause of the learning instability.
Codes - train
I have used the following code to train the network.
# Cost / optimizer definition
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=RNNoutput(first3Data),
labels=last1Data))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
# Train and keep track of the loss history
sess = tf.Session()
sess.run(tf.global_variables_initializer())
lossHistory = []
for epoch in range(total_epoch):
_, loss = sess.run([optimizer, cost])
lossHistory.append(loss)
Question
The resulting learning curve looks as follows. Indeed, it shows an exponential decay.
However, for me it looks too wiggly for this kind of simple example, showing some instabilities even in the late period of the learning.
plt.plot(range(total_epoch), lossHistory)
plt.show()
Possible explanations?
I think the learning curve should show a square-like stable decay pattern as expected using tensorflow built-in functions (*). But I think this instability may be explained plausibly as follows:
Instability in random initialization of parameters
Numerical instability due to the successive addition when defining RNNoutput
Not using a tensor for loop but using the for loop directly in data
But I don't think any of these played a crucial role. Is there any other solution to help me out?
(*) I have seen a nearly square-patterned loss decay using tensorflow built-in functions for simple RNN. But sorry that I have not included the results to be compared, since I run out of time... I think I can edit shortly.

This modification where the initial state is set to be zero seems to solve the problem.
# Variables (separate version)
W_in = tf.Variable(tf.random_normal([n_input, n_hidden]))
W_rec = tf.Variable(tf.random_normal([n_hidden, n_hidden]))
b_rec = tf.Variable(tf.random_normal([n_hidden]))
W_out = tf.Variable(tf.random_normal([n_hidden, n_output]))
b_out = tf.Variable(tf.random_normal([n_output]))
h_init = tf.zeros([1,n_hidden])
# Manual calculation of RNN output
def RNNoutput(Xinput):
h_state = h_init # initial hidden state
for iX in Xinput:
h_state = tf.nn.tanh(iX # W_in + (h_state # W_rec + b_rec))
rnn_output = h_state # W_out + b_out
return(rnn_output)

Related

Keras Word2Vec implementation

I'm using the implementation found in http://adventuresinmachinelearning.com/word2vec-keras-tutorial/ to learn something about word2Vec. What I am not understanding is why isn't the loss function decreasing?
Iteration 119200, loss=0.7305528521537781
Iteration 119300, loss=0.6254740953445435
Iteration 119400, loss=0.8255964517593384
Iteration 119500, loss=0.7267132997512817
Iteration 119600, loss=0.7213149666786194
Iteration 119700, loss=0.6156617999076843
Iteration 119800, loss=0.11473365128040314
Iteration 119900, loss=0.6617216467857361
The net, from my understanding, is a standard one used in this task:
input_target = Input((1,))
input_context = Input((1,))
embedding = Embedding(vocab_size, vector_dim, input_length=1, name='embedding')
target = embedding(input_target)
target = Reshape((vector_dim, 1))(target)
context = embedding(input_context)
context = Reshape((vector_dim, 1))(context)
dot_product = Dot(axes=1)([target, context])
dot_product = Reshape((1,))(dot_product)
output = Dense(1, activation='sigmoid')(dot_product)
model = Model(inputs=[input_target, input_context], outputs=output)
model.compile(loss='binary_crossentropy', optimizer='rmsprop') #adam??
Words come from a vocabulary of size 10000 from http://mattmahoney.net/dc/text8.zip (english text)
What I notice is that some words are somewhat learned in time like the context for numbers and articles is easily guessed, yet the loss is quite stuck around 0.7 from the beginning, and as iterations goes it only fluctuates randomly.
The training part is made like this (which I sense strange since the absence of the standard fit method)
arr_1 = np.zeros((1,))
arr_2 = np.zeros((1,))
arr_3 = np.zeros((1,))
for cnt in range(epochs):
idx = np.random.randint(0, len(labels)-1)
arr_1[0,] = word_target[idx]
arr_2[0,] = word_context[idx]
arr_3[0,] = labels[idx]
loss = model.train_on_batch([arr_1, arr_2], arr_3)
if cnt % 100 == 0:
print("Iteration {}, loss={}".format(cnt, loss))
Am i missing something important about these type of net? What is not written is implemented exactly like the link above
I followed the same tutorial and the loss drops after the algorithm went through a sample again. Note that the loss function is calculated only for the current target and context word pair. In the code example from the tutorial one epoch is only one sample, therefore you would need more than the number of target and context words to come to a point where the loss drops.
I implemented the training part with the following line
model.fit([word_target, word_context], labels, epochs=5)
Be warned that this can take a long time depending on how large the corpus is. The train_on_batch function gives you more control in training and you can vary the batch size or select samples you choose at every step of the training.

modify projection layer for seq2seq in tensorflow - python

I am trying to modify the projection layer of my NMT (neural machine translation) model. I want to be able to update the number of units without reinitializing all of the weights. I followed the tutorial from the tensorflow NMT tutorial found here. Here is the code for my decoder:
# Decoder
train_decoder = tf.contrib.seq2seq.BasicDecoder(
decoder_cell, train_helper, decoder_initial_state)
maximum_iterations = tf.round(tf.reduce_max(encoder_input_lengths) * 2)
# Dynamic decoding
train_outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(train_decoder)
# Projection layer -- THIS IS WHAT I WANT TO MODIFY
projection_layer = layers_core.Dense(
len(language_base.vocabulary), use_bias=False)
train_logits = projection_layer(train_outputs.rnn_output)
train_crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=decoder_outputs, logits=train_logits)
# Target weights
target_weights = tf.sequence_mask(
decoder_input_lengths, params.tgt_max_len, dtype=train_logits.dtype)
target_weights = tf.transpose(target_weights)
# Loss function
train_loss = (tf.reduce_sum(train_crossent * target_weights) /
tf.to_float(params.batch_size))
# Calculate and clip gradients
train_vars = tf.trainable_variables()
gradients = tf.gradients(train_loss, train_vars)
clipped_gradients, _ = tf.clip_by_global_norm(
gradients, params.max_gradient_norm)
# Optimization
optimizer = tf.train.AdamOptimizer(params.learning_rate)
update_step = optimizer.apply_gradients(
zip(clipped_gradients, train_vars))
Tensorflow doesn't really let you change the shape of a variable (which is related to the number of units here) without some effort.
Instead, you're better off preallocating a larger-than-usual number of units and masking the units you don't use to 0 during the early stages of training, and update your mask as you go (use a variable to store the mask to make updating it easier).

Seq2Seq in TensorFlow without embeddings

I'm trying to create a very basic multivariate time series auto-encoder.
I want to be able to reconstruct the exact two signals I pass in.
Most of the references I'm looking at are using older versions of APIs or use embeddings.
I'm trying to use the latest higher level APIs, but its not obvious how you cobble them together.
class Seq2SeqConfig():
def __init__(self):
self.sequence_length = 15 # num of time steps
self.hidden_units = 64 # ?
self.num_features = 2
self.batch_size = 10
# Expect input as batch major.
encoder_inputs = tf.placeholder(shape=(None, config.sequence_length, config.num_features), dtype=tf.float32)
decoder_inputs = tf.placeholder(shape=(None, config.sequence_length, config.num_features), dtype=tf.float32))
# Convert inputs to time major
encoder_inputs_tm = tf.transpose(encoder_inputs, [1, 0, 2])
decoder_inputs_tm = tf.transpose(decoder_inputs, [1, 0, 2])
# setup encoder
encoder_cell = tf.contrib.rnn.LSTMCell(config.hidden_units)
encoder_outputs, encoder_final_state = tf.nn.dynamic_rnn(
cell=encoder_cell,
inputs=encoder_inputs_tm,
time_major=True)
# setup decoder
decoder_cell = tf.contrib.rnn.LSTMCell(config.hidden_units)
# The sequence length is mandatory. Not sure what the expectation is here?
helper = tf.contrib.seq2seq.TrainingHelper(
decoder_inputs_tm,
sequence_length=tf.constant(config.sequence_length, dtype=tf.int32, shape=[config.batch_size]),
time_major=True)
decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell, helper, encoder_final_state)
decoder_outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder, output_time_major=True)
# loss calculation
loss_op = tf.reduce_mean(tf.square(decoder_outputs.rnn - decoder_targets_tm)
The loss operation fails because the shapes are different.
decoder_targets is (?, 15, 2) and decoder_outputs.rnn is (?, ?, 64).
Question 1:
Am I missing an operation somewhere where I reshape the decoder output?
I loosely followed this tensorflow tutorial: https://www.tensorflow.org/tutorials/seq2seq
There is a projection_layer operation passed into the basic decoder. Is that the purpose of this?
projection_layer = layers_core.Dense(tgt_vocab_size, use_bias=False)
I don't see a layers_core.Dense() function anywhere. I assume its deprecated or internal.
Question 2:
Which helper does one use for Inference when not using embeddings?
Question 3:
What would the ideal size of the hidden units be?
I assume because we want to reduce the dimensions in the latent space, it needs to be less that the size of the inputs. How does that translate when you have a input with sequence length = 15 and number of features = 2.
Should the number of hidden units be < 15, < 2 or < 15 *2?
Figured out the answer to Question 1
from tensorflow.python.layers.core import Dense
output_layer = Dense(config.num_features)
decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell, helper, encoder_final_state, output_layer)
Reference: https://github.com/udacity/deep-learning/blob/master/seq2seq/sequence_to_sequence_implementation.ipynb
Other two questions still stand.
Regarding question 3: I suggest you run several training and validation cycles with different hyperparameters to find what works best for your data and requirements. You can take a look at my implementation here (https://github.com/fritzfitzpatrick/tensorflow-seq2seq-generic-example) where I have built a very simple training & validation loop that stops once the validation loss has not gone down for a number of cycles to prevent overfitting.
Regarding question 2: I am still working on a CustomHelper implementation at the moment, and it looks like it is going somewhere. You can find the full sample code here (https://github.com/fritzfitzpatrick/tensorflow-seq2seq-generic-example/blob/master/tensorflow_custom_helper.ipynb).
batch_size = 5
features_dec_inp = 2 # number of features in target sequence
go_token = 2
end_token = 3
sess = tf.InteractiveSession()
def initialize_fn():
finished = tf.tile([False], [batch_size])
start_inputs = tf.fill([batch_size, features_dec_inp], go_token)
return (finished, start_inputs)
def next_inputs_fn(time, outputs, state, sample_ids):
del time, sample_ids
# finished needs to update after last step.
# one could use conditional logic based on sequence length
# if sequence length is known in advance
finished = tf.tile([False], [batch_size])
# next inputs should be the output of the dense layer
# unless the above finished logic returns [True]
# in which case next inputs can be anything in the right shape
next_inputs = tf.fill([batch_size, features_dec_inp], 0.5)
return (finished, next_inputs, state)
helper = tf.contrib.seq2seq.CustomHelper(
initialize_fn = initialize_fn,
sample_fn = tf.identity,
next_inputs_fn = next_inputs_fn)
print(helper)
Regarding question 1: This is the code that I am using to reduce the dimensionality of my decoder output to the number of features in my target sequence:
train_output_dense = tf.layers.dense(
train_dec_out_logits, # [batch_size x seq_length x num_units]
features_dec_exp_out) # [batch_size x seq_length x num_target_features]

Compute updates in Theano after N number of loss calculations

I've constructed a LSTM recurrent NNet using lasagne that is loosely based on the architecture in this blog post. My input is a text file that has around 1,000,000 sentences and a vocabulary of 2,000 word tokens. Normally, when I construct networks for image recognition my input layer will look something like the following:
l_in = nn.layers.InputLayer((32, 3, 128, 128))
(where the dimensions are batch size, channel, height and width) which is convenient because all the images are the same size so I can process them in batches. Since each instance in my LSTM network has a varying sentence length, I have an input layer that looks like the following:
l_in = nn.layers.InputLayer((None, None, 2000))
As described in above referenced blog post,
Masks:
Because not all sequences in each minibatch will always have the same length, all recurrent layers in
lasagne
accept a separate mask input which has shape
(batch_size, n_time_steps)
, which is populated such that
mask[i, j] = 1
when
j <= (length of sequence i)
and
mask[i, j] = 0
when
j > (length
of sequence i)
.
When no mask is provided, it is assumed that all sequences in the minibatch are of length
n_time_steps.
My question is: Is there a way to process this type of network in mini-batches without using a mask?
Here is a simplified version if my network.
# -*- coding: utf-8 -*-
import theano
import theano.tensor as T
import lasagne as nn
softmax = nn.nonlinearities.softmax
def build_model():
l_in = nn.layers.InputLayer((None, None, 2000))
lstm = nn.layers.LSTMLayer(l_in, 4096, grad_clipping=5)
rs = nn.layers.SliceLayer(lstm, 0, 0)
dense = nn.layers.DenseLayer(rs, num_units=2000, nonlinearity=softmax)
return l_in, dense
model = build_model()
l_in, l_out = model
all_params = nn.layers.get_all_params(l_out)
target_var = T.ivector("target_output")
output = nn.layers.get_output(l_out)
loss = T.nnet.categorical_crossentropy(output, target_var).sum()
updates = nn.updates.adagrad(loss, all_params, 0.005)
train = theano.function([l_in.input_var, target_var], cost, updates=updates)
From there I have generator that spits out (X, y) pairs and I am computing train(X, y) and updating the gradient with each iteration. What I want to do is do an N number of training steps and then update the parameters with the average gradient.
To do this, I tried creating a compute_gradient function:
gradient = theano.grad(loss, all_params)
compute_gradient = theano.function(
[l_in.input_var, target_var],
output=gradient
)
and then looping over several training instances to create a "batch" and collect the gradient calculations to a list:
grads = []
for _ in xrange(1024):
X, y = train_gen.next() # generator for producing training data
grads.append(compute_gradient(X, y))
this produces a list of lists
>>> grads
[[<CudaNdarray at 0x7f83b5ff6d70>,
<CudaNdarray at 0x7f83b5ff69f0>,
<CudaNdarray at 0x7f83b5ff6270>,
<CudaNdarray at 0x7f83b5fc05f0>],
[<CudaNdarray at 0x7f83b5ff66f0>,
<CudaNdarray at 0x7f83b5ff6730>,
<CudaNdarray at 0x7f83b5ff6b70>,
<CudaNdarray at 0x7f83b5ff64f0>] ...
From here I would need to take the mean of the gradient at each layer, and then update the model parameters. This is possible to do in pieces like this does does the gradient calc/parameter update need to happen all in one theano function?
Thanks.
NOTE: this is a solution, but by no means do i have enough experience to verify its best and the code is just a sloppy example
You need 2 theano functions. The first being the grad one you seem to have already judging from the information provided in your question.
So after computing the batched gradients you want to immediately feed them as an input argument back into another theano function dedicated to updating the shared variables. For this you need to specify the expected batch size at the compile time of your neural network. so you could do something like this: (for simplicity i will assume you have a global list variable where all your params are stored)
params #list of params you wish to update
BATCH_SIZE = 1024 #size of the expected training batch
G = [T.matrix() for i in range(BATCH_SIZE) for param in params] #placeholder for grads result flattened so they can be fed into a theano function
updates = [G[i] for i in range(len(params))] #starting with list of param updates from first batch
for i in range(len(params)): #summing the gradients for each individual param
for j in range(1, len(G)/len(params)):
updates[i] += G[i*BATCH_SIZE + j]
for i in range(len(params)): #making a list of tuples for theano.function updates argument
updates[i] = (params[i], updates[i]/BATCH_SIZE)
update = theano.function([G], 0, updates=updates)
Like this theano will be taking the mean of the gradients and updating the params as usual
dont know if you need to flatten the inputs as I did, but probably
EDIT: gathering from how you edited your question it seems important that the batch size can vary in that case you could add 2 theano functions to your existing one:
the first theano function takes a batch of size 2 of your params and returns the sum. you could apply this theano function using python's reduce() and get the sum of the over the whole batch of gradients
the second theano function takes those summed param gradients and a scaler (the batch size) as input and hence is able to update the NN params over the mean of the summed gradients.

Issue with gradient calculation in a Neural Network (stuck at 7% error in MNIST)

Hi I am having an issue with my calculation of checking the gradient when implementing a neural network in python using numpy.
I am using mnist dataset to try and trying to using mini-batch gradient descent.
I have check the math and on paper look good so maybe you can give me a hint of what's happening here:
EDIT: One answer made me realize that indeed the cost function was being calculated wrong. Howerver that does not explain the problem with the gradient as it is calculated using back_prop. I get %7 error rate using 300 units in the hidden layer using minibatch gradient descent with rmsprop, 30 epochs and 100 batches. (learning_rate = 0.001, small due to the rmsprop).
each input is has 768 features so for a 100 samples I have a matrix. Mnist has 10 classes.
X = NoSamplesxFeatures = 100x768
Y = NoSamplesxClasses = 100x10
I am using a one hidden layer neural network with hidden layer size of 300 when fully training. Another question I have is whether I should use a softmax output function for calculating the error... which I think not. But I am kinda newbie to all of this and the obvious might seem strange to me.
(NOTE: I know the code is ugly, but this is my first Python/Numpy code done under pressure, bear with me)
Here is back_prof and activations:
def sigmoid(z):
return np.true_divide(1,1 + np.exp(-z) )
#not calculated really - this the fake version to make it faster.
def sigmoid_prime(a):
return (a)*(1 - a)
def _back_prop(self,W,X,labels,f=sigmoid,fprime=sigmoid_prime,lam=0.001):
"""
Calculate the partial derivates of the cost function using backpropagation.
"""
#Weight for first layer and hidden layer
Wl1,bl1,Wl2,bl2 = self._extract_weights(W)
# get the forward prop value
layers_outputs = self._forward_prop(W,X,f)
#from a number make a binary vector, for mnist 1x10 with all 0 but the number.
y = self.make_1_of_c_encoding(labels)
num_samples = X.shape[0] # layers_outputs[-1].shape[0]
# Dot product return Numsamples (N) x Outputs (No CLasses)
# Y is NxNo Clases
# Layers output to
big_delta = np.zeros(Wl2.size + bl2.size + Wl1.size + bl1.size)
big_delta_wl1, big_delta_bl1, big_delta_wl2, big_delta_bl2 = self._extract_weights(big_delta)
# calculate the gradient for each training sample in the batch and accumulate it
for i,x in enumerate(X):
# Error with respect the output
dE_dy = layers_outputs[-1][i,:] - y[i,:]
# bias hidden layer
big_delta_bl2 += dE_dy
# get the error for the hiddlen layer
dE_dz_out = dE_dy * fprime(layers_outputs[-1][i,:])
#and for the input layer
dE_dhl = dE_dy.dot(Wl2.T)
#bias input layer
big_delta_bl1 += dE_dhl
small_delta_hl = dE_dhl*fprime(layers_outputs[-2][i,:])
#here calculate the gradient for the weights in the hidden and first layer
big_delta_wl2 += np.outer(layers_outputs[-2][i,:],dE_dz_out)
big_delta_wl1 += np.outer(x,small_delta_hl)
# divide by number of samples in the batch (should be done here)?
big_delta_wl2 = np.true_divide(big_delta_wl2,num_samples) + lam*Wl2*2
big_delta_bl2 = np.true_divide(big_delta_bl2,num_samples)
big_delta_wl1 = np.true_divide(big_delta_wl1,num_samples) + lam*Wl1*2
big_delta_bl1 = np.true_divide(big_delta_bl1,num_samples)
# return
return np.concatenate([big_delta_wl1.ravel(),
big_delta_bl1,
big_delta_wl2.ravel(),
big_delta_bl2.reshape(big_delta_bl2.size)])
Now the feed_forward:
def _forward_prop(self,W,X,transfer_func=sigmoid):
"""
Return the output of the net a Numsamples (N) x Outputs (No CLasses)
# an array containing the size of the output of all of the laye of the neural net
"""
# Hidden layer DxHLS
weights_L1,bias_L1,weights_L2,bias_L2 = self._extract_weights(W)
# Output layer HLSxOUT
# A_2 = N x HLS
A_2 = transfer_func(np.dot(X,weights_L1) + bias_L1 )
# A_3 = N x Outputs
A_3 = transfer_func(np.dot(A_2,weights_L2) + bias_L2)
# output layer
return [A_2,A_3]
And the cost function for the gradient checking:
def cost_function(self,W,X,labels,reg=0.001):
"""
reg: regularization term
No weight decay term - lets leave it for later
"""
outputs = self._forward_prop(W,X,sigmoid)[-1] #take the last layer out
sample_size = X.shape[0]
y = self.make_1_of_c_encoding(labels)
e1 = np.sum((outputs - y)**2, axis=1))*0.5
#error = e1.sum(axis=1)
error = e1.sum()/sample_size + 0.5*reg*(np.square(W)).sum()
return error
What kind of results are you getting when you run gradient checking? Often times you can tease out the nature of the implementation error by looking at the output of your gradient vs the output produced by gradient checking.
Furthermore, square error is usually a poor choice for a classification task such as MNIST and I would suggest using either a simple sigmoid top-layer or a softmax. With sigmoid the cross entropy function you want to use is:
L(h,Y) = -Y*log(h) - (1-Y)*log(1-h)
For a softmax
L(h,Y) = -sum(Y*log(h))
where Y is the target given as a 1x10 vector and h is your predicted value, but easily extends to arbitrary batch sizes.
In both cases the top-layer delta simply becomes:
delta = h - Y
And the top-layer gradient becomes:
grad = dot(delta, A_in)
Where A_in is the input into the top layer from the previous layer.
While I am having some trouble getting my head around your backprop routine, I suspect from your code that the error in gradient is due to the fact that you are not calculating the top-level dE/dw_l2 correctly when using square error, along with computing fprime on the incorrect input.
When using square error the top layer delta should be:
delta = (h - Y) * fprime(Z_l2)
Here Z_l2 is the input into your transfer function for layer 2. Similarly when computing fprime for the lower layers, you want to use the input to your transfer function (i.e. dot(X,weights_L1) + bias_L1)
Hope that helps.
EDIT:
As some added justification for using cross entropy error over square error I would suggest looking up Geoffrey Hinton's lectures on linear classification methods:
www.cs.toronto.edu/~hinton/csc2515/notes/lec3.ppt
EDIT2:
I ran some tests locally with my implementation of neural nets on the MNIST dataset with different parameters and 1 hidden layer using RMSPROP. Here are the results:
Test1
Epochs: 30
Hidden Size: 300
Learn Rate: 0.001
Lambda: 0.001
Train Method: RMSPROP with decrements=0.5 and increments=1.3
Train Error: 6.1%
Test Error: 6.9%
Test2
Epochs: 30
Hidden Size: 300
Learn Rate: 0.001
Lambda: 0.000002
Train Method: RMSPROP with decrements=0.5 and increments=1.3
Train Error: 4.5%
Test Error: 5.7%
It already appears that if you decrease your lambda parameter by a couple orders of magnitude you should end up with better performance.

Categories