Coding softmax activation using numpy - python

I am having a neural network for multi-class classification (3 classes) having the following architecture:
Input layer has 2 neurons for 2 input features
There is one hidden layer having 4 neurons
Output layer has 3 neurons corresponding to 3 classes to be predicted
Sigmoid activation function is used for hidden layer neurons and softmax activation function is used for output layer.
The parameters used in the network are as follows:
Weights from input layer to hidden layer have the shape = (4, 2)
Biases for hidden layer = (1, 4)
Weights from hidden layer to output layer have the shape = (3, 4)
Biases for output layer = (1, 3)
The forward propagation is coded as follows:
Z1 = np.dot(X, W1.T) + b1 # Z1.shape = (m, 4); 'm' is number of training examples
A1 = sigmoid(Z1) # A1.shape = (m, 4)
Z2 = np.dot(W2, A1.T) + b2.T # Z2.shape = (3, m)
Now 'Z2' has to be fed into activation function so that each of the three neurons compute probabilistic activations summing upto one.
The code I have for the 3 output neurons are:
o1 = np.exp(Z2[0,:])/np.exp(Z2[0,:]).sum() # o1.shape = (m,)
o2 = np.exp(Z2[1,:])/np.exp(Z2[1,:]).sum() # o2.shape = (m,)
o1 = np.exp(Z2[3,:])/np.exp(Z2[3,:]).sum() # o3.shape = (m,)
I was expecting each o1, o2 and o3 to output a vector of shape (3,).
My aim is to reduce 'Z2' having shape (m, n) and use softmax activation function to (1, n) for each of the 'n' neurons.
Here, 'm' is the number of training examples and 'n' is the number of classes.
What am I doing wrong?
Thanks!

From what I understand, the equation for the second activation should be:
Z2 = np.dot(A1, W2.T) + b2.T # Z2.shape = (m,3)
A soft-max for Z2 could be performed as:
o = np.exp(Z2)/np.sum(np.exp(Z2), axis=1) # o.shape = (m,3)
The interpretation of the nth-column of o is the probability of your input belonging to the n-th class for each of the m input rows.

Related

Understanding neural networks architecture visually

I am following this book and I am trying to visualize the network.
This part seems tricky to me and I am trying to get my head around it by visualizing it:
import numpy as np
import nnfs
from nnfs.datasets import spiral_data
nnfs.init()
class Layer_Dense:
# Layer initialization
def __init__(self, n_inputs, n_neurons):
# Initialize weights and biases
self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
self.biases = np.zeros((1, n_neurons))
# Forward pass
def forward(self, inputs):
# Calculate output values from inputs, weights and biases
self.output = np.dot(inputs, self.weights) + self.biases
# ReLU activation
class Activation_Relu():
# forward pass
def forward(self, inputs):
# calculate output values from inputs
self.output = np.maximum(0,inputs)
create dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 3)
# create ReLU activation which will be used with Dense layer
activation1 = Activation_Relu()
# create second dense layer with 3 input features from the previous layer and 3 output values
dense2 = Layer_Dense(3,3)
# create dataset
X, y = spiral_data(samples = 100, classes = 3)
dense1.forward(X)
activation1.forward(dense1.output)
dense2.forward(activation1.output)
My input data X is an array of 300 rows and 2 columns, meaning each of my 300 inputs will have 2 values that describes it.
The Layer_Dense class is initialized with parameters (2, 3) meaning that there are 2 inputs and 3 neurons.
At the moment my variables look like this:
X.shape # (300, 2)
x[:5]
# [[ 0. , 0. ],
# [ 0.00279763, 0.00970586],
# [-0.01122664, 0.01679536],
# [ 0.02998079, 0.0044075 ],
# [-0.01222386, 0.03851056]]
dense1.weights.shape
# (2, 3)
dense1.weights
# [[0.00862166, 0.00508044, 0.00461094],
# [0.00965116, 0.00796512, 0.00558731]])
dense1.biases
# [[0., 0., 0.]]
dense1.output.shape
(300, 3)
print(dense1.output[:5])
# [[0.0000000e+00 0.0000000e+00 0.0000000e+00]
# [8.0659374e-05 4.3710388e-05 6.5012209e-05]
# [1.5923499e-04 6.9124777e-05 1.0470775e-04]
# [2.3033096e-04 1.9152602e-04 2.7749798e-04]
# [1.9318146e-04 3.1980115e-04 4.5189835e-04]]
Does this configuration make my network look like so:
Where each of 300 inputs has 2 features
Or like so:
Do I understand this correctly:
There are 300 inputs with 2 features each
Each input is connected to 3 neurons in first layer, since its connected to 3 neurons there are 3 weights
Why the shape of weights is (2, 3) instead of (300, 3) since there are 300 inputs with 2 features each, each feature connected to 3 neurons
I have used this to draw networks.

Cost Function Neural Network

The following code is my implementation of neural network (1 hidden layer) trying to predict some number based on input data.
Number of input node: 11
Number of nodes in hidden layer: 11
Number of nodes in output layer: 1
m: number of training examples, here = 4527
X: [11, m] matrix
y: [1, m] matrix
w1: weights associated from input layer to hidden layer
b1: bias vector associated from input layer to hidden layer
w2: weights associated from hidden layer to output layer
b2: bias vector associated from hidden layer to output layer
alpha: learning rate
ite: number of iteration, here = 10000
Since I'm trying to predict a continuous value output, I'm using sigmoid function in input layers and identity function in output layer
def propagate(X,y,w1,b1,w2,b2,alpha,ite):
assert(X.shape[0] == 11)
assert(y.shape[0] == 1)
assert(X.shape[1] == y.shape[1])
m = X.shape[1]
J = np.zeros(shape=(ite,1))
iteNo = np.zeros(shape=(ite,1))
for i in range(1,ite+1):
z1 = np.dot(w1,X) + b1
a1 = sigmoid(z1)
z2 = np.dot(w2,a1) + b2
dz2 = (z2-y)/m
dw2 = np.dot(dz2,a1.T)
db2 = np.sum(dz2, axis=1, keepdims=True)
dz1 = np.dot(w2.T,dz2)*derivative_of_sigmoid(z1)
dw1 = np.dot(dz1,X.T)
db1 = np.sum(dz1, axis=1, keepdims=True)
w2 = w2 - (alpha*dw2)
b2 = b2 - (alpha*db2)
w1 = w1 - (alpha*dw1)
b1 = b1 - (alpha*db1)
iteNo[i-1] = i
J[i-1] = np.dot((z2-y),(z2-y).T)/(2*m)
print(z2)
return w1,b1,w2,b2,iteNo,J
I have tried both the ways (With feature normalization and scaling & without) but my cost function varies as follows with respect number of iterations (Plotted J).
On x-axis: Number of iteration, On y-axis: Error * 10^12
Please help!

Keras custom softmax layer: Is it possible to have output neurons set to 0 in the output of a softmax layer based on zeros as data in an input layer?

I have a neural network with 10 output neurons in the last layer using softmax activation. I also know exactly that based on the input values, certain neurons in the output layer shall have 0 values. So I have a special input layer of 10 neurons, each of them being either 0 or 1.
Would it be somehow possible to force let's say the output neuron no. 3 to have value = 0 if the input neuron no 3 is also 0?
action_input = Input(shape=(10,), name='action_input')
...
x = Dense(10, kernel_initializer = RandomNormal(),bias_initializer = RandomNormal() )(x)
x = Activation('softmax')(x)
I know that there is a method via which I can mask out the results of the output layer OUTSIDE the neural network, and have all non zero related outputs reshaped (in order to have a total sum of 1). But I would like to solve this issue within the network and use it during the training of the network, too. Shall I use a custom layer for this?
You can use a Lambda layer and K.switch to check for zero values in the input and mask them in the output:
from keras import backend as K
inp = Input((5,))
soft_out = Dense(5, activation='softmax')(inp)
out = Lambda(lambda x: K.switch(x[0], x[1], K.zeros_like(x[1])))([inp, soft_out])
model = Model(inp, out)
model.predict(np.array([[0, 3, 0, 2, 0]]))
# array([[0., 0.35963967, 0., 0.47805876, 0.]], dtype=float32)
However, as you can see the sum of outputs are no longer one. If you want the sum to be one, you can rescale the values:
def mask_output(x):
inp, soft_out = x
y = K.switch(inp, soft_out, K.zeros_like(inp))
y /= K.sum(y, axis=-1)
return y
# ...
out = Lambda(mask_output)([inp, soft_out])
At the end I came up with this code:
from keras import backend as K
import tensorflow as tf
def mask_output2(x):
inp, soft_out = x
# add a very small value in order to avoid having 0 everywhere
c = K.constant(0.0000001, dtype='float32', shape=(32, 13))
y = soft_out + c
y = Lambda(lambda x: K.switch(K.equal(x[0],0), x[1], K.zeros_like(x[1])))([inp, soft_out])
y_sum = K.sum(y, axis=-1)
y_sum_corrected = Lambda(lambda x: K.switch(K.equal(x[0],0), K.ones_like(x[0]), x[0] ))([y_sum])
y_sum_corrected = tf.divide(1,y_sum_corrected)
y = tf.einsum('ij,i->ij', y, y_sum_corrected)
return y

First Neural Network, (MLP), from Scratch, Python -- Questions

I understand how the Neural Network with backpropogation is supposed to work. I know how to use Python's own MLPClassifier and fit functions work in sklearn. I am creating my own because I'd like to know the details better. I will first show my code (with comments) and then discuss my problems.
import numpy as np
import scipy as sp
import sklearn as ML
# z: the linear combination of the previous layer
#
# returns the activation for the node
#
def sigmoid(z):
a = 1 / (1 + np.exp(-z))
return a
# z: the contribution of a layer
#
# returns the derivative of the sigmoid evaluated at z
#
def sig_grad(z):
d = (1 - sigmoid(z))*sigmoid(z)
return d
# input: the data we want to train the network with
# hidden_layers: the number of nodes in the hidden layers
# num_layers: how many hidden layers between the input layer and the output layer
# num_output: how many outputs there are... this becomes relevant when we input many features.
#
# returns the activations determined
# and the linear combinations of previous layer's nodes for each layer
#
def feedforward(input, hidden_layers, num_layers, num_output, thresh, weights):
#initialize the vector for inputs AND threshold values
X = np.hstack([thresh[0], input])
#intialize the activations list
A = []
#intialize the linear combos for each layer
Z = []
w = list(weights)
#place ones in the first row of each layer of weights for the threshold
w[0] = np.vstack([np.ones([1,hidden_layers]), w[0]])
for i in range(1,num_layers):
w[i] = np.vstack([np.ones([1,hidden_layers]), weights[i]])
w[-1] = np.vstack([np.ones([1,num_output]), w[-1]])
#the first layer of weights are initialized outside function
#cycle through the hidden layers
for i in range(1, num_layers+1):
Z.append( np.dot(X, w[i-1])); S = sigmoid(Z[i-1]); A.append(S); X = np.hstack([thresh[i], A[i-1]])
#find the output/last layer activations
Z.append( np.dot(X, w[-1]) ); S = sigmoid(Z[-1]); A.append(S);
return A, Z
#
# truth: what we know the output should be
# activations: the activations determined at each node by the sigmoid
# function in the previous feedforward pass
# combos: the linear combinations at each layer in the prev. ff pass
# num_layers: the number of hidden layers
#
# error: the errors determined at each layer; will be needed for gradient descent
#
def backprop(input, truth, activations, combos, num_layers, weights):
#initialize an array of errors for each hidden layer and the output layer
error = [0 for x in range(0,num_layers+1)]
#intialize the lists containing the gradients w.r.t. weights and threshold
derivW = []; derivb = []
#set the output layer since its error is computed differently than the others
error[num_layers] = (activations[num_layers] - truth)*sig_grad(combos[num_layers])
#find the rate of change for weights and thresh for connections to output
derivW.append( activations[num_layers-1]*error[num_layers]); derivb.append(np.sum(error[num_layers]))
if(num_layers > 1):
#find the errors for each of the hidden layers
for i in range(num_layers - 1, 0, -1):
error[i] = np.dot(weights[i+1],error[i+1])*sig_grad(combos[i])
derivW.append( np.outer(activations[i-1], error[i]) ); derivb.append(np.sum(error[i]))
#
#finding the derivative for weights of input to next layer
#
error[0] = np.dot(weights[i],error[i])*sig_grad(combos[0])
derivW.append( np.outer(input, error[0]) ); derivb.append(np.sum(error[0]))
return derivW, derivb
#
# weights: our networks weights to update via gradient descent
# thresh: the threshold values to update for our system
# derivb: the derivative of our cost function with respect to b for each layer
# derivW: the derivative of our cost function with respect to W for each layer
# stepsize: the stepsize we want to take, determines how big of a step we take
#
# returns the updated weights and threshold values for our network
def gradDesc(weights, thresh, derivb, derivW, stepsize, num_layers):
#perform gradient descent
for j in range(100):
for i in range(0, num_layers + 1):
weights[i] = weights[i] - stepsize*derivW[num_layers-i]
thresh[i] = thresh[i] - stepsize*derivb[num_layers-i]
return weights, thresh
#input: the data to send through the network
#hidden_layers: the number of hidden_layers between the input layer and the output layer
#num_layers: the number of nodes in the hidden layer
#num_output: the number of nodes in the output layer
#
#returns the output of the network
#
def nNetwork(input, truth, hidden_layers, num_layers, num_output, maxiter, stepsize):
#assuming that input is an array where each element is an input/sample
#we also need to know the size of each sample itself
m = input.size
thresh = np.random.randn(num_layers + 1, 1)
thresh_weights = np.ones([num_layers + 1, 1])
# initialize the weights as a list because each layer might have
# a different number of weights
weights = []; weights.append(np.random.randn(m,hidden_layers));
if( num_layers > 1):
for i in range(1, num_layers):
weights.append(np.random.randn(hidden_layers, hidden_layers))
weights.append(np.random.randn(hidden_layers, num_output))
for i in range(maxiter):
activations, combos = feedforward(input, hidden_layers, num_layers, num_output, thresh, weights)
derivW, derivb = backprop(input, truth, activations, combos, num_layers, weights)
weights, thresh = gradDesc(weights, thresh, derivb, derivW, stepsize, num_layers)
return weights, thresh
def main():
# a very, very simple neural network
input = np.array([1,0,0])
truth = 0
hidden_layers = 3
num_layers = 2
num_output = 1
#train the network
w, t = nNetwork(input, truth, hidden_layers, num_layers, num_output, maxiter = 10, stepsize = 0.001)
#test the network on a new set of arguments
#activations, combos = feedforward(new_input, hidden_layers = 3, num_layers = 2, thresh = t, weights = w)
main()
I've tested this code on simple examples where there are n input of one dimension and output of n dimension (not yet able to work out the bugs when I type import NN.py into the console, but works when I run it piece by piece in the console). I have a few questions to help me better understand what is going on when I have n input there are m dimensions. For example, the digits data in Python (there are 1797 samples and each sample is 64x1 -- an 8x8 image vectorized).
1) Is each of the 64 pixels considered an input? If so, is the neural net trained one image at a time? This would be an easy fix for me.
2) If the neural net is trained all images at once, what are suggestions for modifying my code?
3) Obviously the output for an image comes in the form of 0, 1, 2, 3, ... , or 9. But, does the output come in the form of a vector 10x1 where there is a 1 in the digit the image represents and 0's elsewhere? So, my prediction vector would have the highest value where the 1 might be, right?
4) Then, I'm not quite sure how #3 would look if #2 is true..
I apologize for the long note. Thanks for taking a look and helping me understand better!

MLP Neural Network: calculating the gradient (matrices)

What is a good implementation for calculating the gradient in a n-layered neural network?
Weight layers:
First layer weights:     (n_inputs+1, n_units_layer)-matrix
Hidden layer weights: (n_units_layer+1, n_units_layer)-matrix
Last layer weights:     (n_units_layer+1, n_outputs)-matrix
Notes:
If there is only one hidden layer we would represent the net by using just two (weight) layers:
inputs --first_layer-> network_unit --second_layer-> output
For a n-layer network with more than one hidden layer, we need to implement the (2) step.
A bit vague pseudocode:
weight_layers = [ layer1, layer2 ] # a list of layers as described above
input_values = [ [0,0], [0,0], [1,0], [0,1] ] # our test set (corresponds to XOR)
target_output = [ 0, 0, 1, 1 ] # what we want to train our net to output
output_layers = [] # output for the corresponding layers
for layer in weight_layers:
output <-- calculate the output # calculate the output from the current layer
output_layers <-- output # store the output from each layer
n_samples = input_values.shape[0]
n_outputs = target_output.shape[1]
error = ( output-target_output )/( n_samples*n_outputs )
""" calculate the gradient here """
Final implementation
The final implementation is available at github.
With Python and numpy that is easy.
You have two options:
You can either compute everything in parallel for num_instances instances or
you can compute the gradient for one instance (which is actually a special case of 1.).
I will now give some hints how to implement option 1. I would suggest that you create a new class that is called Layer. It should have two functions:
forward:
inputs:
X: shape = [num_instances, num_inputs]
inputs
W: shape = [num_outputs, num_inputs]
weights
b: shape = [num_outputs]
biases
g: function
activation function
outputs:
Y: shape = [num_instances, num_outputs]
outputs
backprop:
inputs:
dE/dY: shape = [num_instances, num_outputs]
backpropagated gradient
W: shape = [num_outputs, num_inputs]
weights
b: shape = [num_outputs]
biases
gd: function
calculates the derivative of g(A) = Y
based on Y, i.e. gd(Y) = g'(A)
Y: shape = [num_instances, num_outputs]
outputs
X: shape = [num_instances, num_inputs]
inputs
outputs:
dE/dX: shape = [num_instances, num_inputs]
will be backpropagated (dE/dY of lower layer)
dE/dW: shape = [num_outputs, num_inputs]
accumulated derivative with respect to weights
dE/db: shape = [num_outputs]
accumulated derivative with respect to biases
The implementation is simple:
def forward(X, W, b):
A = X.dot(W.T) + b # will be broadcasted
Y = g(A)
return Y
def backprop(dEdY, W, b, gd, Y, X):
Deltas = gd(Y) * dEdY # element-wise multiplication
dEdX = Deltas.dot(W)
dEdW = Deltas.T.dot(X)
dEdb = Deltas.sum(axis=0)
return dEdX, dEdW, dEdb
X of the first layer is your taken from your dataset and then you pass each Y as the X of the next layer in the forward pass.
The dE/dY of the output layer is computed (either for softmax activation function and cross entropy error function or for linear activation function and sum of squared errors) as Y-T, where Y is the output of the network (shape = [num_instances, num_outputs]) and T (shape = [num_instances, num_outputs]) is the desired output. Then you can backpropagate, i.e. dE/dX of each layer is dE/dY of the previous layer.
Now you can use dE/dW and dE/db of each layer to update W and b.
Here is an example for C++: OpenANN.
Btw. you can compare the speed of instance-wise and batch-wise forward propagation:
In [1]: import timeit
In [2]: setup = """import numpy
...: W = numpy.random.rand(10, 5000)
...: X = numpy.random.rand(1000, 5000)"""
In [3]: timeit.timeit('[W.dot(x) for x in X]', setup=setup, number=10)
Out[3]: 0.5420958995819092
In [4]: timeit.timeit('X.dot(W.T)', setup=setup, number=10)
Out[4]: 0.22001314163208008

Categories