I'm implementing a Restricted Boltzmann Machine with Rectified Linear Units. I haven't found a simple implementation anywhere so wanted to ask if somebody would kindly verify the design.
Here is the CD1 calculation:
def propup(self, vis):
activation = numpy.dot(vis, self.W) + self.hbias
# ReLU activation of hidden units
return activation * (activation > 0)
def sample_h_given_v(self, v0_sample):
h1_mean = self.propup(v0_sample)
# Sampling from a rectified Normal distribution
h1_sample = numpy.maximum(0, h1_mean + numpy.random.normal(0, sigmoid(h1_mean)))
return [h1_mean, h1_sample]
def propdown(self, hid):
activation = numpy.dot(hid, self.W.T) + self.vbias
return sigmoid(activation)
def sample_v_given_h(self, h0_sample):
v1_mean = self.propdown(h0_sample)
v1_sample = self.numpy_rng.binomial(size=v1_mean.shape, n=1, p=v1_mean)
return [v1_mean, v1_sample]
This is how I calculate the gradient:
def get_cost_updates(self, lr, decay, mom, l1_penalty, p_noise, epoch, persistent=None, k=1):
ph_mean, ph_sample = self.sample_h_given_v(input)
nv_means, nv_samples,nh_means, nh_samples = self.gibbs_hvh(ph_sample)
W_grad = numpy.dot(self.input.T, ph_mean) - numpy.dot(nv_samples.T, nh_means)
vbias_grad = numpy.mean(self.input - nv_samples, axis=0)
hbias_grad = numpy.mean(ph_mean - nh_means, axis=0)
My question is, how do I layer these into a DBN?
The aim is to construct an autoencoder, but I'm not sure how to handle the visible units also being real number variables in the second layer.
I can see that question was asked some time ago, but as there is no answer, I will add mine.
DBN as you wrote is implemented with a greedy learning algorithm that takes each layer to be as if it is a RBM. I actually gave a lecture about it recently and you can find a presentation with a numeric example I used here:https://www.slideshare.net/mobile/AvnerGidron/generative-models/AvnerGidron/generative-models
I think that if you will understand the presentation it shouldn't take really long for you to do it yourself.
Related
I am trying to implement a LSTM VAE (following this example I found), but also have it accept variable length sequences using Masking Layers. I tried to combine the above code with the ideas from this SO question that seems to deal with it the "best way" by cropping the gradients to get the most accurate loss as possible, however my implementation does not seem to be able to reproduce sequences on a small set of data. I am thus relatively confident that there is something amiss with my implementation, but I cannot seem to pinpoint what exactly is wrong. The relevant part is here:
x = Input(shape=(None, input_dim))(x)
x_masked = Masking(mask_value=0.0, input_shape=(None, input_dim))(x)
h = LSTM(intermediate_dim)(x_masked)
z_mean = Dense(latent_dim)(h)
z_log_sigma = Dense(latent_dim)(h)
def sampling(args):
z_mean, z_log_sigma = args
epsilon = K.random_normal(shape=(batch_size, latent_dim), mean=0., stddev=epsilon_std)
return z_mean + z_log_sigma * epsilon
z = Lambda(sampling, output_shape=(latent_dim,))([z_mean, z_log_sigma])
decoded_h = LSTM(intermediate_dim, return_sequences=True)
decoded_mean = LSTM(latent_dim, return_sequences=True)
h_decoded = RepeatVector(max_timesteps)(z)
h_decoded = decoder_h(h_decoded)
x_decoded_mean = decoder_mean(h_decoded)
def crop_outputs(x):
padding = K.cast(K.not_equal(x[1], 0), dtype=K.floatx())
return x[0] * padding
x_decoded_mean = Lambda(crop_outputs, output_shape=(max_timesteps, input_dim))([x_decoded_mean, x])
vae = Model(x, x_decoded_mean)
def vae_loss(x, x_decoded_mean):
xent_loss = objectives.mse(x, x_decoded_mean)
kl_loss = -0.5 * K.mean(1 + z_log_sigma - K.square(z_mean) - K.exp(z_log_sigma))
loss = xent_loss + kl_loss
return loss
vae.compile(optimizer='adam', loss=vae_loss)
# Here, X is variable length time series data of shape
# (num_examples, max_timesteps, input_dim) and is zero padded
# on the right for all the examples of length less than max_timesteps
# X has been appropriately scaled using the StandardScaler.
vae.fit(X, X, epochs = num_epochs, batch_size=batch_size)
As always, any help is much appreciated. Thank you!
I came by your question while looking to do exactly the same. I gave up on VAE's, but found a solution to apply masking to layers that don't support masking. What I did was just predefine a binary mask (you can do this with numpy Code 1) and then multiplied my output by the mask. During Backpropagation the algorithm will try the derivative of the multiplication and will end up propagating the value or not. It is not as clever as the masking layer on Keras, bubt it did for me.
#Code 1
#making a numpy binary mask
# expecting a sequence with shape (Time_Steps, Features)
# let's say that my sequence has Features = 10 and a max_Length of 15
max_Len = 15
seq = np.linspace(0,1,100).reshape((10,10))
# You must pad/truncate the sequence here
mask = np.concatenate([np.ones(seq.shape[0]),np.zeros(max_Len-seq.shape[0])],axis=-1)
# This mask can be thrown as input to the model afterwards
A few considerations:
1- It resulted on a weak regression model. Don't know the impact on VAE's, since I never tested, but I think it will generate lots of noise.
2- The computational resource demand went up, so it is a good thing to try and calculate the requirements of propagating and backpropagating this workaround (or "gambiarra" as we say here) if you are on a budget like me.
3- It wont solve the problem completly, you could try and delve deeper on this and implement a more stable solution using pure Tensorflow.
4- A more "accurate" solution would be to implement a custom masking layer (code 2).
Regarding point 4, it is easy, you must define the layer as a default layer and then just use the call function receiving a mask and then just output the multiplication of mask and input. Like this:
# Code 2
class MyCoolMaskingLayer(tf.keras.layers.Layer):
def __init__(self, **kwargs):
#init stuff here
pass
def compute_mask(self, inputs, mask=None):
return mask
def call(self, input, mask=None):
bc_mask = tf.expand_dims(tf.cast(mask, "float32"), -1) if mask is not None else np.asarray([[1]])
return input * mask
This function might not work for you, it is really problem specific and from a noob (me), but it worked for me. I just cannot share the entire code because my Master's Tutor doesn't allow.
(a little bit of context: I wrap it around a TimeDistributed so that each TimeStep of a LSTM output is individually processed by this masking layer, because inside call i perform some transformations on the data)
I am computing the forward jacobian (derivative of outputs with respect to inputs) of a 2 layer feedforward neural network in pytorch, and my results are correct but relatively slow. Given the nature of the calculation I would expect it to be approximately as fast as a forward pass through the network (or maybe 2-3x as long), but it takes ~12x as long to run an optimization step on this routine (in my test example I just want the jacobian=1 at all points) vs the standard mean squared error so I assume I am doing something in an un-optimal manner. I'm just wondering if anyone knew a faster way to code this. My test network has 2 input nodes, followed by 2 hidden layers of 5 nodes each and an output layer of 2 nodes, and uses tanh activation functions on the hidden layers, with a linear output layer.
The Jacobian calculations are based on the paper The Limitations of Deep Learning in Adversarial Settings which gives a basic recursive definition of the forward derivative (basically you end up multiplying the derivative of your activation functions with the weights and previous partial derivatives of each layer). This is very similar to forward propagation, which is why I would expect it to be faster than it is. Then the determinant of the 2x2 jacobian at the end is pretty straightforward.
Below is the code for the network and the jacobian
class Network(torch.nn.Module):
def __init__(self):
super(Network, self).__init__()
self.h_1_1 = torch.nn.Linear(input_1, hidden_1)
self.h_1_2 = torch.nn.Linear(hidden_1, hidden_2)
self.out = torch.nn.Linear(hidden_2, out_1)
def forward(self, x):
x = F.tanh(self.h_1_1(x))
x = F.tanh(self.h_1_2(x))
x = (self.out(x))
return x
def jacobian(self, x):
a = self.h_1_1.weight
x = F.tanh(self.h_1_1(x))
tanh_deriv_tensor = 1 - (x ** 2)
expanded_deriv = tanh_deriv_tensor.unsqueeze(-1).expand(-1, -1, input_1)
partials = expanded_deriv * a.expand_as(expanded_deriv)
a = torch.matmul(self.h_1_2.weight, partials)
x = F.tanh(self.h_1_2(x))
tanh_deriv_tensor = 1 - (x ** 2)
expanded_deriv = tanh_deriv_tensor.unsqueeze(-1).expand(-1, -1, out_1)
partials = expanded_deriv*a
partials = torch.matmul(self.out.weight, partials)
determinant = partials[:, 0, 0] * partials[:, 1, 1] - partials[:, 0, 1] * partials[:, 1, 0]
return determinant
and here are the two error functions being compared. Note that the first one requires an extra forward call through the network, to get the output values (labeled action) while the second function does not since it works on the input values.
def actor_loss_fcn1(action, target):
loss = ((action-target)**2).mean()
return loss
def actor_loss_fcn2(input): # 12x slower
jacob = model.jacobian(input)
loss = ((jacob-1)**2).mean()
return loss
Any insight on this would be greatly appreciated
The second calculation of 'a' takes the most time on my machine (cpu).
# Here you increase the size of the matrix with a factor of "input_1"
expanded_deriv = tanh_deriv_tensor.unsqueeze(-1).expand(-1, -1, input_1)
partials = expanded_deriv * a.expand_as(expanded_deriv)
# Here your torch.matmul() needs to handle "input_1" times more computations than in a normal forward call
a = torch.matmul(self.h_1_2.weight, partials)
On my machine the time of computing the Jacobian is roughly the time it takes torch to compute
a = torch.rand(hidden_1, hidden_2)
b = torch.rand(n_inputs, hidden_1, input_1)
%timeit torch.matmul(a,b)
I don't think it is possible to speed this up, computationally wise. Unless you can move from CPU to GPU, because GPU get better on larges matrices.
My question is similar to the one posed here:
keras combining two losses with adjustable weights
However, the outputs have a different dimensionality resulting in the outputs not being able to be concatenated. Hence, the solution is not applicable, is there another way to solve this problem?
The question:
I have a keras functional model with two layers with outputs x1 and x2.
x1 = Dense(1,activation='relu')(prev_inp1)
x2 = Dense(2,activation='relu')(prev_inp2)
I need to use these x1 and x2 use them in a weighted loss function like in the attached image. Propagate the 'same loss' into both branches. Alpha is flexible to vary with iterations.
For this question, a more elaborated solution is necessary. Since we're going to use a trainable weight, we will need a custom layer.
Also, we will be needing a different form of training, since our loss doesn't work like the others taking only y_true and y_pred and considers joining two different outputs.
Thus, we're going to create two versions of the same model, one for prediction, another for training, and the training version will contain the loss in itself, using a dummy keras loss function in compilation.
The prediction model
Let's use a very basic example of model with two outputs and one input:
#any input your true model takes
inp = Input((5,5,2))
#represents the localization output
outImg = Conv2D(1,3,activation='sigmoid')(inp)
#represents the classification output
outClass = Flatten()(inp)
outClass = Dense(2,activation='sigmoid')(outClass)
#the model
predictionModel = Model(inp, [outImg,outClass])
You use this one regularly for predictions. It's not necessary to compile this one.
The losses for each branch
Now, let's create custom loss functions for each branch, one for LossCls and another for LossLoc.
Using dummy examples here, you can elaborate these losses better if necessary. The most important is that they output batches shaped like (batch, 1) or (batch,). Both output the same shape so they can be summed later.
def calcImgLoss(x):
true,pred = x
loss = binary_crossentropy(true,pred)
return K.mean(loss, axis=[1,2])
def calcClassLoss(x):
true,pred = x
return binary_crossentropy(true,pred)
These will be used in Lambda layers in the training model.
The loss weighting layer - (WARNING! EDITED! - See explanation at the end)
Now, let's weight the losses with the trainable alpha. Trainable parameters need custom layers to be implemented.
class LossWeighter(Layer):
def __init__(self, **kwargs): #kwargs can have 'name' and other things
super(LossWeighter, self).__init__(**kwargs)
#create the trainable weight here, notice the constraint between 0 and 1
def build(self, inputShape):
self.weight = self.add_weight(name='loss_weight',
shape=(1,),
initializer=Constant(0.5),
constraint=Between(0,1),
trainable=True)
super(LossWeighter,self).build(inputShape)
def call(self,inputs):
#old answer: will always tend to completely ignore the biggest loss
#return (self.weight * firstLoss) + ((1-self.weight)*secondLoss)
#problem: alpha tends to 0 or 1, eliminating the biggest of the two losses
#proposal of working alpha optimization
#return K.square((self.weight * firstLoss) - ((1-self.weight)*secondLoss))
#problem: might not train any of the losses, and even increase one of them
#in order to minimize the difference between the two losses
#new answer - a mix between the two, applying gradients to the right weights
loss1, loss2 = inputs #trainable
static_loss1 = K.stop_gradient(loss1) #non_trainable
static_loss2 = K.stop_gradient(loss2) #non_trainable
a1 = self.weight #trainable
a2 = 1 - a1 #trainable
static_a1 = K.stop_gradient(a1) #non_trainable
static_a2 = 1 - static_a1 #non_trainable
#this trains only alpha to minimize the difference between both losses
alpha_loss = K.square((a1 * static_loss1) - (a2 * static_loss2))
#or K.abs (.....)
#this trains only the original model weights to minimize both original losses
model_loss = (static_a1 * loss1) + (static_a2 * loss2)
return alpha_loss + model_loss
def compute_output_shape(self,inputShape):
return inputShape[0]
Notice that there is a custom constraint to keep this weight between 0 and 1. This constraint is implemented with:
class Between(Constraint):
def __init__(self,min_value,max_value):
self.min_value = min_value
self.max_value = max_value
def __call__(self,w):
return K.clip(w,self.min_value, self.max_value)
def get_config(self):
return {'min_value': self.min_value,
'max_value': self.max_value}
The training model
This model will take the prediction model as base, add the loss calculations and loss weighter at the end and output only the loss value. Because it outputs only a loss, we will use the true targets as inputs, and a dummy loss function defined like:
def ignoreLoss(true,pred):
return pred #this just tries to minimize the prediction without any extra computation
Model inputs:
#true targets
trueImg = Input((3,3,1))
trueClass = Input((2,))
#predictions from the prediction model
predImg = predictionModel.outputs[0]
predClass = predictionModel.outputs[1]
Model outputs = losses:
imageLoss = Lambda(calcImgLoss, name='loss_loc')([trueImg, predImg])
classLoss = Lambda(calcClassLoss, name='loss_cls')([trueClass, predClass])
weightedLoss = LossWeighter(name='weighted_loss')([imageLoss,classLoss])
Model:
trainingModel = Model([predictionModel.input, trueImg, trueClass], weightedLoss)
trainingModel.compile(optimizer='sgd', loss=ignoreLoss)
Dummy training
inputImages = np.zeros((7,5,5,2))
outputImages = np.ones((7,3,3,1))
outputClasses = np.ones((7,2))
dummyOut = np.zeros((7,))
trainingModel.fit([inputImages,outputImages,outputClasses], dummyOut, epochs = 50)
predictionModel.predict(inputImages)
Necessary imports
from keras.layers import *
from keras.models import Model
from keras.constraints import Constraint
from keras.initializers import Constant
from keras.losses import binary_crossentropy #or another you need
(EDIT) Explaining the problem with the old answer:
The formula used in the old answer would make alpha always go to 0 or 1, meaning only the smallest of the two losses would be ever trained. (Useless)
A new formula leads alpha to make both losses have the same value. Alpha would be trained properly and not tend to 0 or 1. But, still, the losses would not be properly trained because "increasing one loss to reach the other" would be a possibility for the model, and once both losses were equal, the model would stop training.
The new solution is a mix of the two proposals above, while the first actually trains the losses but with wrong alpha; and the second trains alpha with wrong losses. The mixed solution adds both, but uses K.stop_gradient to prevent the wrong part of the training from happening.
The result of this will be: the "easiest" loss (not the biggest) will be more trained than the hardest. We may use K.abs or K.square, as compared to "mae" or "mse" between the two losses. The best option is up to experiment.
See this table comparing the old and new proposals:
This does not guarantee the best optimization though!!!
Training the easiest loss will not always have the best result, though. It may be better than favoring a huge loss just because it's formula is different. But the expected result might still need some manual weighting of the losses.
I fear there is no automatic training for this weight. If you have a target metric, you can try to train this metric (when possible, but metrics that depend on sorting, getting an index, rounding or anything that breaks backpropagation may not be possible to be transformed in losses).
There is no need to concatenate your outputs. To pass multiple arguments to a loss function, you can wrap it as follows:
def custom_loss(x1, x2, y1, y2, alpha):
def loss(y_true, y_pred):
return (1-alpha) * loss_cls(y1, x1) + alpha * loss_loc(y2, x2)
return loss
And then compile your functional model as:
x1 = Dense(1, activation='relu')(prev_inp1)
x2 = Dense(2, activation='relu')(prev_inp2)
y1 = Input((1,))
y2 = Input((2,))
model.compile('sgd',
loss=custom_loss(x1, x2, y1, y2, 0.5),
target_tensors=[y1, y2])
NOTE: Not tested.
Is there a way in which we can enforce constraint on the prediction of sequences?
Say, if my modeling is as follows:
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
model.add(LSTM(150, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation='linear')))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['acc'])
Can I somehow capture the constraint that model.pred(x) <= x
The docs shows that we can add constraints to the network weights. However, they do not mention how to map relationship or constraints between input and output.
Never heard of it.... but there are quite a few ways you can implement that yourself using a functional API model and custom functions.
Below, there is a possible answer to this, but first, is this really the best to do??
If you're trying to create an autoencoder, you should not care about limiting the outputs. Otherwise, your model will not really learn much.
Maybe the best to do is simply normalizing the inputs first (between -1 and +1), and using the tanh activation at the end.
Funtional API model to preserve the input:
inputTensor = Input(n_timesteps_in, n_features)
out = LSTM(150, input_shape=)(inputTensor)
out = RepeatVector(n_timesteps_in)(out) #this line sounds funny in your model...
out = LSTM(150, return_sequences=True)(out)
out = TimeDistributed(Dense(n_features))(out)
out = Activation(chooseOneActivation)(out)
out = Lambda(chooseACustomFunction)([out,inputTensor])
model = Model(inputTensor,out)
model.compile(...)
Custom limit options
There are infinite ways of doing this, here are some examples that may or may not be what you need. But from this you are free to develop anything similar.
The options below limits the individual outputs to the respective individual inputs. But you may prefer to use all outputs confined to the maximum input instead.
If so, use this below: maxInput = max(originalInput, axis=1, keepdims=True)
1 - A simple stretched 'tanh':
You can simply define both top and bottom limits by using a tanh (that ranges from -1 to +1) and multiplying it by the inputs.
Use the Activation('tanh') layer, and the following custom function in the Lambda layer:
import keras.backend as K
def stretchedTanh(x):
originalOutput = x[0]
originalInput = x[1]
return K.abs(originalInput) * originalOutput
I'm not totally sure this will be a healthy option. If the idea is to create an autoencoder, this model will easily find a solution of outputing all tanh activations as close to 1 as possible, without really looking at the inputs.
2 - Modified 'relu'
First, you could simply clip your outputs based on the inputs, changing a relu activation. Use the Activation('relu')(out) in your model above, and the following custom function in the Lambda layer:
def modifiedRelu(x):
negativeOutput = (-1) * x[0] #ranging from -infinite to 0
originalInput = x[1]
#ranging from -infinite to originalInput
return negativeOutput + originalInput #needs the same shape between input and output
This may have a downside when everything goes above the limit and backpropagation gets unable to return. (A problem that might happen with 'relu').
3 - Half linear, half modified tanh
In this case, you don't need the Activation layer, or you can use it as 'linear'.
import keras.backend as K
def halfTanh(x):
originalOutput = x[0]
originalInput = x[1] #assuming all inputs are positive
#find the positive outputs and get a tensor with 1's at their positions
positiveOutputs = K.greater(originalOuptut,0)
positiveOutputs = K.cast(positiveOutputs,K.floatx())
#now the 1's are at the negative positions
negativeOutputs = 1 - positiveOutputs
tanhOutputs = K.tanh(originalOutput) #function limited to -1 or +1
tanhOutputs = originalInput * sigmoidOutputs #raises the limit from 1 to originalInput
#use the conditions above to select between the negative and the positive side
return positiveOutputs * tanhOutputs + negativeOutputs * originalOutputs
Keras provides an easy way to handle such trivial constraints. We could write out = Minimum()([out, input_tensor])
Complete example
import keras
from keras.layers.merge import Maximum, Minimum
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
n_timesteps_in = 20
n_features=2
input_tensor = keras.layers.Input(shape = [n_timesteps_in, n_features])
out = keras.layers.LSTM(150)(input_tensor)
out = keras.layers.RepeatVector(n_timesteps_in)(out)
out = keras.layers.LSTM(150, return_sequences=True)(out)
out = keras.layers.TimeDistributed(keras.layers.Dense(n_features))(out)
out = Minimum()([out, input_tensor])
model = keras.Model(input_tensor, out )
SVG(model_to_dot(model, show_shapes=True, show_layer_names=True, rankdir='HB').create(prog='dot', format='svg'))
Here's the network structure of the model. It shows how the input and the output are used to compute clamped output.
Hi I am having an issue with my calculation of checking the gradient when implementing a neural network in python using numpy.
I am using mnist dataset to try and trying to using mini-batch gradient descent.
I have check the math and on paper look good so maybe you can give me a hint of what's happening here:
EDIT: One answer made me realize that indeed the cost function was being calculated wrong. Howerver that does not explain the problem with the gradient as it is calculated using back_prop. I get %7 error rate using 300 units in the hidden layer using minibatch gradient descent with rmsprop, 30 epochs and 100 batches. (learning_rate = 0.001, small due to the rmsprop).
each input is has 768 features so for a 100 samples I have a matrix. Mnist has 10 classes.
X = NoSamplesxFeatures = 100x768
Y = NoSamplesxClasses = 100x10
I am using a one hidden layer neural network with hidden layer size of 300 when fully training. Another question I have is whether I should use a softmax output function for calculating the error... which I think not. But I am kinda newbie to all of this and the obvious might seem strange to me.
(NOTE: I know the code is ugly, but this is my first Python/Numpy code done under pressure, bear with me)
Here is back_prof and activations:
def sigmoid(z):
return np.true_divide(1,1 + np.exp(-z) )
#not calculated really - this the fake version to make it faster.
def sigmoid_prime(a):
return (a)*(1 - a)
def _back_prop(self,W,X,labels,f=sigmoid,fprime=sigmoid_prime,lam=0.001):
"""
Calculate the partial derivates of the cost function using backpropagation.
"""
#Weight for first layer and hidden layer
Wl1,bl1,Wl2,bl2 = self._extract_weights(W)
# get the forward prop value
layers_outputs = self._forward_prop(W,X,f)
#from a number make a binary vector, for mnist 1x10 with all 0 but the number.
y = self.make_1_of_c_encoding(labels)
num_samples = X.shape[0] # layers_outputs[-1].shape[0]
# Dot product return Numsamples (N) x Outputs (No CLasses)
# Y is NxNo Clases
# Layers output to
big_delta = np.zeros(Wl2.size + bl2.size + Wl1.size + bl1.size)
big_delta_wl1, big_delta_bl1, big_delta_wl2, big_delta_bl2 = self._extract_weights(big_delta)
# calculate the gradient for each training sample in the batch and accumulate it
for i,x in enumerate(X):
# Error with respect the output
dE_dy = layers_outputs[-1][i,:] - y[i,:]
# bias hidden layer
big_delta_bl2 += dE_dy
# get the error for the hiddlen layer
dE_dz_out = dE_dy * fprime(layers_outputs[-1][i,:])
#and for the input layer
dE_dhl = dE_dy.dot(Wl2.T)
#bias input layer
big_delta_bl1 += dE_dhl
small_delta_hl = dE_dhl*fprime(layers_outputs[-2][i,:])
#here calculate the gradient for the weights in the hidden and first layer
big_delta_wl2 += np.outer(layers_outputs[-2][i,:],dE_dz_out)
big_delta_wl1 += np.outer(x,small_delta_hl)
# divide by number of samples in the batch (should be done here)?
big_delta_wl2 = np.true_divide(big_delta_wl2,num_samples) + lam*Wl2*2
big_delta_bl2 = np.true_divide(big_delta_bl2,num_samples)
big_delta_wl1 = np.true_divide(big_delta_wl1,num_samples) + lam*Wl1*2
big_delta_bl1 = np.true_divide(big_delta_bl1,num_samples)
# return
return np.concatenate([big_delta_wl1.ravel(),
big_delta_bl1,
big_delta_wl2.ravel(),
big_delta_bl2.reshape(big_delta_bl2.size)])
Now the feed_forward:
def _forward_prop(self,W,X,transfer_func=sigmoid):
"""
Return the output of the net a Numsamples (N) x Outputs (No CLasses)
# an array containing the size of the output of all of the laye of the neural net
"""
# Hidden layer DxHLS
weights_L1,bias_L1,weights_L2,bias_L2 = self._extract_weights(W)
# Output layer HLSxOUT
# A_2 = N x HLS
A_2 = transfer_func(np.dot(X,weights_L1) + bias_L1 )
# A_3 = N x Outputs
A_3 = transfer_func(np.dot(A_2,weights_L2) + bias_L2)
# output layer
return [A_2,A_3]
And the cost function for the gradient checking:
def cost_function(self,W,X,labels,reg=0.001):
"""
reg: regularization term
No weight decay term - lets leave it for later
"""
outputs = self._forward_prop(W,X,sigmoid)[-1] #take the last layer out
sample_size = X.shape[0]
y = self.make_1_of_c_encoding(labels)
e1 = np.sum((outputs - y)**2, axis=1))*0.5
#error = e1.sum(axis=1)
error = e1.sum()/sample_size + 0.5*reg*(np.square(W)).sum()
return error
What kind of results are you getting when you run gradient checking? Often times you can tease out the nature of the implementation error by looking at the output of your gradient vs the output produced by gradient checking.
Furthermore, square error is usually a poor choice for a classification task such as MNIST and I would suggest using either a simple sigmoid top-layer or a softmax. With sigmoid the cross entropy function you want to use is:
L(h,Y) = -Y*log(h) - (1-Y)*log(1-h)
For a softmax
L(h,Y) = -sum(Y*log(h))
where Y is the target given as a 1x10 vector and h is your predicted value, but easily extends to arbitrary batch sizes.
In both cases the top-layer delta simply becomes:
delta = h - Y
And the top-layer gradient becomes:
grad = dot(delta, A_in)
Where A_in is the input into the top layer from the previous layer.
While I am having some trouble getting my head around your backprop routine, I suspect from your code that the error in gradient is due to the fact that you are not calculating the top-level dE/dw_l2 correctly when using square error, along with computing fprime on the incorrect input.
When using square error the top layer delta should be:
delta = (h - Y) * fprime(Z_l2)
Here Z_l2 is the input into your transfer function for layer 2. Similarly when computing fprime for the lower layers, you want to use the input to your transfer function (i.e. dot(X,weights_L1) + bias_L1)
Hope that helps.
EDIT:
As some added justification for using cross entropy error over square error I would suggest looking up Geoffrey Hinton's lectures on linear classification methods:
www.cs.toronto.edu/~hinton/csc2515/notes/lec3.ppt
EDIT2:
I ran some tests locally with my implementation of neural nets on the MNIST dataset with different parameters and 1 hidden layer using RMSPROP. Here are the results:
Test1
Epochs: 30
Hidden Size: 300
Learn Rate: 0.001
Lambda: 0.001
Train Method: RMSPROP with decrements=0.5 and increments=1.3
Train Error: 6.1%
Test Error: 6.9%
Test2
Epochs: 30
Hidden Size: 300
Learn Rate: 0.001
Lambda: 0.000002
Train Method: RMSPROP with decrements=0.5 and increments=1.3
Train Error: 4.5%
Test Error: 5.7%
It already appears that if you decrease your lambda parameter by a couple orders of magnitude you should end up with better performance.