MLP Neural Network: calculating the gradient (matrices) - python

What is a good implementation for calculating the gradient in a n-layered neural network?
Weight layers:
First layer weights:     (n_inputs+1, n_units_layer)-matrix
Hidden layer weights: (n_units_layer+1, n_units_layer)-matrix
Last layer weights:     (n_units_layer+1, n_outputs)-matrix
If there is only one hidden layer we would represent the net by using just two (weight) layers:
inputs --first_layer-> network_unit --second_layer-> output
For a n-layer network with more than one hidden layer, we need to implement the (2) step.
A bit vague pseudocode:
weight_layers = [ layer1, layer2 ] # a list of layers as described above
input_values = [ [0,0], [0,0], [1,0], [0,1] ] # our test set (corresponds to XOR)
target_output = [ 0, 0, 1, 1 ] # what we want to train our net to output
output_layers = [] # output for the corresponding layers
for layer in weight_layers:
output <-- calculate the output # calculate the output from the current layer
output_layers <-- output # store the output from each layer
n_samples = input_values.shape[0]
n_outputs = target_output.shape[1]
error = ( output-target_output )/( n_samples*n_outputs )
""" calculate the gradient here """
Final implementation
The final implementation is available at github.

With Python and numpy that is easy.
You have two options:
You can either compute everything in parallel for num_instances instances or
you can compute the gradient for one instance (which is actually a special case of 1.).
I will now give some hints how to implement option 1. I would suggest that you create a new class that is called Layer. It should have two functions:
X: shape = [num_instances, num_inputs]
W: shape = [num_outputs, num_inputs]
b: shape = [num_outputs]
g: function
activation function
Y: shape = [num_instances, num_outputs]
dE/dY: shape = [num_instances, num_outputs]
backpropagated gradient
W: shape = [num_outputs, num_inputs]
b: shape = [num_outputs]
gd: function
calculates the derivative of g(A) = Y
based on Y, i.e. gd(Y) = g'(A)
Y: shape = [num_instances, num_outputs]
X: shape = [num_instances, num_inputs]
dE/dX: shape = [num_instances, num_inputs]
will be backpropagated (dE/dY of lower layer)
dE/dW: shape = [num_outputs, num_inputs]
accumulated derivative with respect to weights
dE/db: shape = [num_outputs]
accumulated derivative with respect to biases
The implementation is simple:
def forward(X, W, b):
A = + b # will be broadcasted
Y = g(A)
return Y
def backprop(dEdY, W, b, gd, Y, X):
Deltas = gd(Y) * dEdY # element-wise multiplication
dEdX =
dEdW =
dEdb = Deltas.sum(axis=0)
return dEdX, dEdW, dEdb
X of the first layer is your taken from your dataset and then you pass each Y as the X of the next layer in the forward pass.
The dE/dY of the output layer is computed (either for softmax activation function and cross entropy error function or for linear activation function and sum of squared errors) as Y-T, where Y is the output of the network (shape = [num_instances, num_outputs]) and T (shape = [num_instances, num_outputs]) is the desired output. Then you can backpropagate, i.e. dE/dX of each layer is dE/dY of the previous layer.
Now you can use dE/dW and dE/db of each layer to update W and b.
Here is an example for C++: OpenANN.
Btw. you can compare the speed of instance-wise and batch-wise forward propagation:
In [1]: import timeit
In [2]: setup = """import numpy
...: W = numpy.random.rand(10, 5000)
...: X = numpy.random.rand(1000, 5000)"""
In [3]: timeit.timeit('[ for x in X]', setup=setup, number=10)
Out[3]: 0.5420958995819092
In [4]: timeit.timeit('', setup=setup, number=10)
Out[4]: 0.22001314163208008


Coding softmax activation using numpy

I am having a neural network for multi-class classification (3 classes) having the following architecture:
Input layer has 2 neurons for 2 input features
There is one hidden layer having 4 neurons
Output layer has 3 neurons corresponding to 3 classes to be predicted
Sigmoid activation function is used for hidden layer neurons and softmax activation function is used for output layer.
The parameters used in the network are as follows:
Weights from input layer to hidden layer have the shape = (4, 2)
Biases for hidden layer = (1, 4)
Weights from hidden layer to output layer have the shape = (3, 4)
Biases for output layer = (1, 3)
The forward propagation is coded as follows:
Z1 =, W1.T) + b1 # Z1.shape = (m, 4); 'm' is number of training examples
A1 = sigmoid(Z1) # A1.shape = (m, 4)
Z2 =, A1.T) + b2.T # Z2.shape = (3, m)
Now 'Z2' has to be fed into activation function so that each of the three neurons compute probabilistic activations summing upto one.
The code I have for the 3 output neurons are:
o1 = np.exp(Z2[0,:])/np.exp(Z2[0,:]).sum() # o1.shape = (m,)
o2 = np.exp(Z2[1,:])/np.exp(Z2[1,:]).sum() # o2.shape = (m,)
o1 = np.exp(Z2[3,:])/np.exp(Z2[3,:]).sum() # o3.shape = (m,)
I was expecting each o1, o2 and o3 to output a vector of shape (3,).
My aim is to reduce 'Z2' having shape (m, n) and use softmax activation function to (1, n) for each of the 'n' neurons.
Here, 'm' is the number of training examples and 'n' is the number of classes.
What am I doing wrong?
From what I understand, the equation for the second activation should be:
Z2 =, W2.T) + b2.T # Z2.shape = (m,3)
A soft-max for Z2 could be performed as:
o = np.exp(Z2)/np.sum(np.exp(Z2), axis=1) # o.shape = (m,3)
The interpretation of the nth-column of o is the probability of your input belonging to the n-th class for each of the m input rows.

Keras custom softmax layer: Is it possible to have output neurons set to 0 in the output of a softmax layer based on zeros as data in an input layer?

I have a neural network with 10 output neurons in the last layer using softmax activation. I also know exactly that based on the input values, certain neurons in the output layer shall have 0 values. So I have a special input layer of 10 neurons, each of them being either 0 or 1.
Would it be somehow possible to force let's say the output neuron no. 3 to have value = 0 if the input neuron no 3 is also 0?
action_input = Input(shape=(10,), name='action_input')
x = Dense(10, kernel_initializer = RandomNormal(),bias_initializer = RandomNormal() )(x)
x = Activation('softmax')(x)
I know that there is a method via which I can mask out the results of the output layer OUTSIDE the neural network, and have all non zero related outputs reshaped (in order to have a total sum of 1). But I would like to solve this issue within the network and use it during the training of the network, too. Shall I use a custom layer for this?
You can use a Lambda layer and K.switch to check for zero values in the input and mask them in the output:
from keras import backend as K
inp = Input((5,))
soft_out = Dense(5, activation='softmax')(inp)
out = Lambda(lambda x: K.switch(x[0], x[1], K.zeros_like(x[1])))([inp, soft_out])
model = Model(inp, out)
model.predict(np.array([[0, 3, 0, 2, 0]]))
# array([[0., 0.35963967, 0., 0.47805876, 0.]], dtype=float32)
However, as you can see the sum of outputs are no longer one. If you want the sum to be one, you can rescale the values:
def mask_output(x):
inp, soft_out = x
y = K.switch(inp, soft_out, K.zeros_like(inp))
y /= K.sum(y, axis=-1)
return y
# ...
out = Lambda(mask_output)([inp, soft_out])
At the end I came up with this code:
from keras import backend as K
import tensorflow as tf
def mask_output2(x):
inp, soft_out = x
# add a very small value in order to avoid having 0 everywhere
c = K.constant(0.0000001, dtype='float32', shape=(32, 13))
y = soft_out + c
y = Lambda(lambda x: K.switch(K.equal(x[0],0), x[1], K.zeros_like(x[1])))([inp, soft_out])
y_sum = K.sum(y, axis=-1)
y_sum_corrected = Lambda(lambda x: K.switch(K.equal(x[0],0), K.ones_like(x[0]), x[0] ))([y_sum])
y_sum_corrected = tf.divide(1,y_sum_corrected)
y = tf.einsum('ij,i->ij', y, y_sum_corrected)
return y

Errors from building a one hidden neural network

I'm currently building my 3-4-1 neural network from scratch using numpy (I avoided using keras and tensorflow for the purpose of learning and trying to demonstrate my knowledge instead of using pre-built libraries to do all the work), the problems I find when I run the program are:
1/ getting "nan" values after a certain number of iterations in the "updated" weights, lowering the learning rate only delays the problem and doesn't solve it.
2/ the second problem is the very low predicting accuracy.
I would like to know what causes these bugs on my program and would appreciate any help.
here is the code:
# Import our dependencies
from numpy import exp, array, random, dot, ones_like, where
# Create our Artificial Neural Network class
class ArtificialNeuralNetwork():
# initializing the class
def __init__(self):
# generating the same synaptic weights every time the program runs
# synaptic weights (3 × 4 Matrix) of the hidden layer
self.w_ij = 2 * random.rand(3, 4) - 1
# synaptic weights (4 × 1 Matrix) of the output layer
self.w_jk = 2 * random.rand(4, 1) - 1
def LeakyReLU(self, x):
# The Leaky ReLU (short for Rectified Linear Unit) activation function will be applied to the inputs of the hidden layer
# The activation function will return the same value of x if x is positive
# while it will multiply the negative values of x by the alpha parameter
# we used in this example the Leaky ReLU instead of the standard ReLU activation function to avoid the dying ReLU problem
return where(x > 0, x, x * 0.01)
def LeakyReLUDerivative(self, x, α = 0.01):
# The Leaky ReLU Derivative will return 1 for every positive value in the x array
# while returning the value of the parameter alpha for every negative value
x[x > 0] = 1 # returns 1 for every positive value in the x array
x[x <= 0] = α # returns α for every negative value in the x array
return x
def Sigmoid(self, x):
# The Sigmoid activation function will turn every input value into probabilities between 0 and 1
# the probabilistic values help us assert which class x belongs to
return 1 / (1 + exp(-x))
def SigmoidDerivative(self, x):
# The derivative of the Sigmoid activation function will be used to calculate the gradient during the backpropagation process
# and help optimize the random starting synaptic weights
return x * (1 - x)
def train(self, x, y, learning_rate, iterations):
# x: training set of data
# y: the actual output of the training data
for i in range(iterations):
z_ij = dot(x, self.w_ij) # the dot product of the weights of the hidden layer and the inputs
a_ij = self.LeakyReLU(z_ij) # using the Leaky ReLU activation function to introduce non-linearity to our Neural Network
z_jk = dot(a_ij, self.w_jk) # the same precedent process will be applied to find the last input of the output layer
a_jk = self.Sigmoid(z_jk) # this time the Sigmoid activation function will be used instead of Leaky ReLU
dl_jk = -y/a_jk + (1 - y)/(1 - a_jk) # calculating the derivative of the cross entropy loss wrt output
da_jk = self.SigmoidDerivative(a_jk) # calculating the derivative of the Sigmoid activation function wrt the input (before activation) of the output layer
dz_jk = a_ij # calculating the derivative of the inputs of the hidden layer (before activation) wrt weights of the output layer
dl_ij = dot(da_jk * dl_jk, self.w_jk.T) # calculating the derivative of the cross entropy loss wrt activated input of the hidden layer
# to do so we multiply the derivative of the cross entropy loss wrt output by the derivative of the Sigmoid activation function wrt the input (before activation) of the output layer by the derivative of the inputs of the hidden layer (before activation) wrt weights of the output layer
da_ij = self.LeakyReLUDerivative(z_ij) # calculating the derivative of the Leaky ReLU activation function wrt the inputs of the hidden layer (before activation)
dz_ij = x # calculating the derivative of the inputs of the hidden layer (before activation) wrt weights of the hidden layer
# calculating the gradient using the chain rule
gradient_ij = dot(dz_ij.T , dl_ij * da_ij)
gradient_jk = dot(dz_jk.T , dl_jk * da_jk)
# calculating the new optimal weights
self.w_ij = self.w_ij - learning_rate * gradient_ij
self.w_jk = self.w_jk - learning_rate * gradient_jk
def predict(self, inputs):
# predicting the class of the input data after weights optimization
output_from_layer1 = self.LeakyReLU(dot(inputs, self.w_ij)) # the output of the hidden layer
output_from_layer2 = self.Sigmoid(dot(output_from_layer1, self.w_jk)) # the output of the output layer
return output_from_layer1, output_from_layer2
# the function will print the initial starting weights before training
def SynapticWeights(self):
print("Layer 1 (4 neurons, each with 3 inputs): ")
print("w_ij: ", self.w_ij)
print("Layer 2 (1 neuron, with 4 inputs): ")
print("w_jk: ", self.w_jk)
def main():
ANN = ArtificialNeuralNetwork()
# the training inputs
x = array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [0, 1, 0], [1, 0, 0], [1, 1, 1], [0, 0, 0]])
# the training outputs
y = array([[0, 1, 1, 1, 1, 0, 0]]).T
ANN.train(x, y, 1, 10000)
# Printing the new synaptic weights after training
print("New synaptic weights after training: ")
print("w_ij: ", ANN.w_ij)
print("w_jk: ", ANN.w_jk)
# Our prediction after feeding the ANN with new set of data
print("Considering new situation [1, 1, 0] -> ?: ")
print(ANN.predict(array([[1, 1, 0]])))
if __name__=="__main__":
So, I changed a few things. (Disclaimer: I didn't check the correctness of the code)
Weight initialization: initialize to much smaller weights.
# synaptic weights (3 × 4 Matrix) of the hidden layer
self.w_ij = (2 * random.rand(3, 4) - 1)*0.1
# synaptic weights (4 × 1 Matrix) of the output layer
self.w_jk = (2 * random.rand(4, 1) - 1)*0.1
Weight initialization really matter.
I reduced the learning rate to 0.1.
ANN.train(x, y, .1, 500000)
I see the neural network perfectly fitting your data and not giving Nan even after 500,000 iterations.
print(ANN.predict(array([[0, 0, 1],
[0, 1, 1],
[1, 0, 1],
[0, 1, 0],
[1, 0, 0],
[1, 1, 1],
[0, 0, 0]])))

Keras_ERROR : "cannot import name '_time_distributed_dense"

Since the Keras wrapper does not support attention model yet, I'd like to refer to the following custom attention.
But the problem is, when I run the code above, it returns following error:
ImportError: cannot import name '_time_distributed_dense'
It looks like no more _time_distributed_dense is supported by keras over 2.0.0
the only parts that use _time_distributed_dense module is the part below:
def call(self, x):
# store the whole sequence so we can "attend" to it at each timestep
self.x_seq = x
# apply the a dense layer over the time dimension of the sequence
# do it here because it doesn't depend on any previous steps
# thefore we can save computation time:
self._uxpb = _time_distributed_dense(self.x_seq, self.U_a, b=self.b_a,
return super(AttentionDecoder, self).call(x)
In which way should I change the _time_distrubuted_dense(self ... ) part?
I just posted from An Chen's answer of the GitHub issue (the page or his answer might be deleted in the future)
def _time_distributed_dense(x, w, b=None, dropout=None,
input_dim=None, output_dim=None,
timesteps=None, training=None):
"""Apply `y . w + b` for every temporal slice y of x.
# Arguments
x: input tensor.
w: weight matrix.
b: optional bias vector.
dropout: wether to apply dropout (same dropout mask
for every temporal slice of the input).
input_dim: integer; optional dimensionality of the input.
output_dim: integer; optional dimensionality of the output.
timesteps: integer; optional number of timesteps.
training: training phase tensor or boolean.
# Returns
Output tensor.
if not input_dim:
input_dim = K.shape(x)[2]
if not timesteps:
timesteps = K.shape(x)[1]
if not output_dim:
output_dim = K.shape(w)[1]
if dropout is not None and 0. < dropout < 1.:
# apply the same dropout pattern at every timestep
ones = K.ones_like(K.reshape(x[:, 0, :], (-1, input_dim)))
dropout_matrix = K.dropout(ones, dropout)
expanded_dropout_matrix = K.repeat(dropout_matrix, timesteps)
x = K.in_train_phase(x * expanded_dropout_matrix, x, training=training)
# collapse time dimension and batch dimension together
x = K.reshape(x, (-1, input_dim))
x =, w)
if b is not None:
x = K.bias_add(x, b)
# reshape to 3D tensor
if K.backend() == 'tensorflow':
x = K.reshape(x, K.stack([-1, timesteps, output_dim]))
x.set_shape([None, None, output_dim])
x = K.reshape(x, (-1, timesteps, output_dim))
return x
You could just add this on your Python code.

First Neural Network, (MLP), from Scratch, Python -- Questions

I understand how the Neural Network with backpropogation is supposed to work. I know how to use Python's own MLPClassifier and fit functions work in sklearn. I am creating my own because I'd like to know the details better. I will first show my code (with comments) and then discuss my problems.
import numpy as np
import scipy as sp
import sklearn as ML
# z: the linear combination of the previous layer
# returns the activation for the node
def sigmoid(z):
a = 1 / (1 + np.exp(-z))
return a
# z: the contribution of a layer
# returns the derivative of the sigmoid evaluated at z
def sig_grad(z):
d = (1 - sigmoid(z))*sigmoid(z)
return d
# input: the data we want to train the network with
# hidden_layers: the number of nodes in the hidden layers
# num_layers: how many hidden layers between the input layer and the output layer
# num_output: how many outputs there are... this becomes relevant when we input many features.
# returns the activations determined
# and the linear combinations of previous layer's nodes for each layer
def feedforward(input, hidden_layers, num_layers, num_output, thresh, weights):
#initialize the vector for inputs AND threshold values
X = np.hstack([thresh[0], input])
#intialize the activations list
A = []
#intialize the linear combos for each layer
Z = []
w = list(weights)
#place ones in the first row of each layer of weights for the threshold
w[0] = np.vstack([np.ones([1,hidden_layers]), w[0]])
for i in range(1,num_layers):
w[i] = np.vstack([np.ones([1,hidden_layers]), weights[i]])
w[-1] = np.vstack([np.ones([1,num_output]), w[-1]])
#the first layer of weights are initialized outside function
#cycle through the hidden layers
for i in range(1, num_layers+1):
Z.append(, w[i-1])); S = sigmoid(Z[i-1]); A.append(S); X = np.hstack([thresh[i], A[i-1]])
#find the output/last layer activations
Z.append(, w[-1]) ); S = sigmoid(Z[-1]); A.append(S);
return A, Z
# truth: what we know the output should be
# activations: the activations determined at each node by the sigmoid
# function in the previous feedforward pass
# combos: the linear combinations at each layer in the prev. ff pass
# num_layers: the number of hidden layers
# error: the errors determined at each layer; will be needed for gradient descent
def backprop(input, truth, activations, combos, num_layers, weights):
#initialize an array of errors for each hidden layer and the output layer
error = [0 for x in range(0,num_layers+1)]
#intialize the lists containing the gradients w.r.t. weights and threshold
derivW = []; derivb = []
#set the output layer since its error is computed differently than the others
error[num_layers] = (activations[num_layers] - truth)*sig_grad(combos[num_layers])
#find the rate of change for weights and thresh for connections to output
derivW.append( activations[num_layers-1]*error[num_layers]); derivb.append(np.sum(error[num_layers]))
if(num_layers > 1):
#find the errors for each of the hidden layers
for i in range(num_layers - 1, 0, -1):
error[i] =[i+1],error[i+1])*sig_grad(combos[i])
derivW.append( np.outer(activations[i-1], error[i]) ); derivb.append(np.sum(error[i]))
#finding the derivative for weights of input to next layer
error[0] =[i],error[i])*sig_grad(combos[0])
derivW.append( np.outer(input, error[0]) ); derivb.append(np.sum(error[0]))
return derivW, derivb
# weights: our networks weights to update via gradient descent
# thresh: the threshold values to update for our system
# derivb: the derivative of our cost function with respect to b for each layer
# derivW: the derivative of our cost function with respect to W for each layer
# stepsize: the stepsize we want to take, determines how big of a step we take
# returns the updated weights and threshold values for our network
def gradDesc(weights, thresh, derivb, derivW, stepsize, num_layers):
#perform gradient descent
for j in range(100):
for i in range(0, num_layers + 1):
weights[i] = weights[i] - stepsize*derivW[num_layers-i]
thresh[i] = thresh[i] - stepsize*derivb[num_layers-i]
return weights, thresh
#input: the data to send through the network
#hidden_layers: the number of hidden_layers between the input layer and the output layer
#num_layers: the number of nodes in the hidden layer
#num_output: the number of nodes in the output layer
#returns the output of the network
def nNetwork(input, truth, hidden_layers, num_layers, num_output, maxiter, stepsize):
#assuming that input is an array where each element is an input/sample
#we also need to know the size of each sample itself
m = input.size
thresh = np.random.randn(num_layers + 1, 1)
thresh_weights = np.ones([num_layers + 1, 1])
# initialize the weights as a list because each layer might have
# a different number of weights
weights = []; weights.append(np.random.randn(m,hidden_layers));
if( num_layers > 1):
for i in range(1, num_layers):
weights.append(np.random.randn(hidden_layers, hidden_layers))
weights.append(np.random.randn(hidden_layers, num_output))
for i in range(maxiter):
activations, combos = feedforward(input, hidden_layers, num_layers, num_output, thresh, weights)
derivW, derivb = backprop(input, truth, activations, combos, num_layers, weights)
weights, thresh = gradDesc(weights, thresh, derivb, derivW, stepsize, num_layers)
return weights, thresh
def main():
# a very, very simple neural network
input = np.array([1,0,0])
truth = 0
hidden_layers = 3
num_layers = 2
num_output = 1
#train the network
w, t = nNetwork(input, truth, hidden_layers, num_layers, num_output, maxiter = 10, stepsize = 0.001)
#test the network on a new set of arguments
#activations, combos = feedforward(new_input, hidden_layers = 3, num_layers = 2, thresh = t, weights = w)
I've tested this code on simple examples where there are n input of one dimension and output of n dimension (not yet able to work out the bugs when I type import into the console, but works when I run it piece by piece in the console). I have a few questions to help me better understand what is going on when I have n input there are m dimensions. For example, the digits data in Python (there are 1797 samples and each sample is 64x1 -- an 8x8 image vectorized).
1) Is each of the 64 pixels considered an input? If so, is the neural net trained one image at a time? This would be an easy fix for me.
2) If the neural net is trained all images at once, what are suggestions for modifying my code?
3) Obviously the output for an image comes in the form of 0, 1, 2, 3, ... , or 9. But, does the output come in the form of a vector 10x1 where there is a 1 in the digit the image represents and 0's elsewhere? So, my prediction vector would have the highest value where the 1 might be, right?
4) Then, I'm not quite sure how #3 would look if #2 is true..
I apologize for the long note. Thanks for taking a look and helping me understand better!
