I've implemented the following neural network to solve the XOR problem in Python. My neural network consists of an input layer of 2 neurons, 1 hidden layer of 2 neurons and an output layer of 1 neuron. I am using the Sigmoid function as the activation function for both the hidden layer and output layer. Can someone please explain what I have done wrong.
import numpy
import scipy.special
class NeuralNetwork:
def __init__(self, inputNodes, hiddenNodes, outputNodes, learningRate):
self.iNodes = inputNodes
self.hNodes = hiddenNodes
self.oNodes = outputNodes
self.wIH = numpy.random.normal(0.0, pow(self.iNodes, -0.5), (self.hNodes, self.iNodes))
self.wOH = numpy.random.normal(0.0, pow(self.hNodes, -0.5), (self.oNodes, self.hNodes))
self.lr = learningRate
self.activationFunction = lambda x: scipy.special.expit(x)
def train(self, inputList, targetList):
inputs = numpy.array(inputList, ndmin=2).T
targets = numpy.array(targetList, ndmin=2).T
#print(inputs, targets)
hiddenInputs = numpy.dot(self.wIH, inputs)
hiddenOutputs = self.activationFunction(hiddenInputs)
finalInputs = numpy.dot(self.wOH, hiddenOutputs)
finalOutputs = self.activationFunction(finalInputs)
outputErrors = targets - finalOutputs
hiddenErrors = numpy.dot(self.wOH.T, outputErrors)
self.wOH += self.lr * numpy.dot((outputErrors * finalOutputs * (1.0 - finalOutputs)), numpy.transpose(hiddenOutputs))
self.wIH += self.lr * numpy.dot((hiddenErrors * hiddenOutputs * (1.0 - hiddenOutputs)), numpy.transpose(inputs))
def query(self, inputList):
inputs = numpy.array(inputList, ndmin=2).T
hiddenInputs = numpy.dot(self.wIH, inputs)
hiddenOutputs = self.activationFunction(hiddenInputs)
finalInputs = numpy.dot(self.wOH, hiddenOutputs)
finalOutputs = self.activationFunction(finalInputs)
return finalOutputs
nn = NeuralNetwork(2, 2, 1, 0.01)
data = [[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 0]]
epochs = 10
for e in range(epochs):
for record in data:
inputs = numpy.asfarray(record[1:])
targets = record[0]
#print(targets)
#print(inputs, targets)
nn.train(inputs, targets)
print(nn.query([0, 0]))
print(nn.query([1, 0]))
print(nn.query([0, 1]))
print(nn.query([1, 1]))
Several reasons.
I don't think you should be taking the activation function of everything, especially in your query function. I think you have muddled up the ideas of neuron to neuron weightings (wIH and wOH) with the activation values.
Because of your muddle you have missed the idea of re-using your query function as part of your training. You should think of it as feed forward activation levels to the output, compare the result with the target output to give an array of errors which are then fed backwards using the derivative of the sigmoid function to adjust the weightings.
I would put the function and it's derivative in rather than importing from scipy as they are so simple. Also "it's recommended" to use tanh and d/dx.tanh for the hidden layer functions (can't remember why, probably not needed for this simple net)
# transfer functions
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# derivative of sigmoid
def dsigmoid(y):
return y * (1.0 - y)
# using tanh over logistic sigmoid for the hidden layer is recommended
def tanh(x):
return np.tanh(x)
# derivative for tanh sigmoid
def dtanh(y):
return 1 - y*y
Finally, you might be able to figure out what I did a while ago with a neural net using just numpy here https://github.com/paddywwoof/Machine-Learning/blob/master/perceptron.py
Related
I am trying to implement a NN from scratch in Python. It has 2 layers: input layer –
output layer. The input layer will have 4 neurons and the output layer will have only a
single node (+biases). I have the following code but I get the error message: ValueError: shapes (4,2) and (4,1) not aligned: 2 (dim 1) != 4 (dim 0). Can someone help me?
import numpy as np
# Step 1: Define input and output data
X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
y = np.array([[0, 1, 0, 1]])
# Step 2: Define the number of input neurons, hidden neurons, and output neurons
input_neurons = 4
output_neurons = 1
# Step 3: Define the weights and biases for the network
weights = np.random.rand(input_neurons, output_neurons)
biases = np.random.rand(output_neurons, 1)
# Step 4: Define the sigmoid activation function
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# Step 5: Define the derivative of the sigmoid function
def sigmoid_derivative(x):
return sigmoid(x) * (1 - sigmoid(x))
# Step 6: Define the forward propagation function
def forward_propagation(X, weights, biases):
output = sigmoid(np.dot(X.T, weights) + biases)
return output
# Step 7: Define the backward propagation function
def backward_propagation(X, y, output, weights, biases):
error = output - y
derivative = sigmoid_derivative(output)
delta = error * derivative
weights_derivative = np.dot(X, delta.T)
biases_derivative = np.sum(delta, axis=1, keepdims=True)
return delta, weights_derivative, biases_derivative
# Step 8: Define the train function
def train(X, y, weights, biases, epochs, learning_rate):
for i in range(epochs):
output = forward_propagation(X, weights, biases)
delta, weights_derivative, biases_derivative = backward_propagation(X, y, output, weights, biases)
weights -= learning_rate * weights_derivative
biases -= learning_rate * biases_derivative
error = np.mean(np.abs(delta))
print("Epoch ", i, " error: ", error)
# Step 9: Train the network
epochs = 5000
learning_rate = 0.1
train(X, y, weights, biases, epochs, learning_rate)
You have an output layer with one neuron, so your output should be of one dimension.
You're assuming that the output has 4 dims:
y = np.array([[0, 1, 0, 1]])
Since you are giving two inputs (a pair of 4 dim inputs) like this,
X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
You need also give two outputs (in one dim), for example like this:
y = np.array([[0],[1]])
Hope this helps.
Here is a link to my project: https://github.com/aaronnoyes/neural-network/blob/master/nn.py
I have implemented a basic neural network in python. By default it uses a sigmoid activation function and that works great. I'm trying to compare changes in learning rate between activation functions, so I tried implementing an option for using ReLU. When it runs however, the weights all drop immediately to 0.
if (self.activation == 'relu'):
d_weights2 = np.dot(self.layer1.T, (2*(self.y - self.output) * self.relu(self.output, True)))
d_weights1 = np.dot(self.input.T, (np.dot(2*(self.y - self.output) * self.relu(self.output, True), self.weights2.T) * self.relu(self.layer1, True)))
I'm almost sure the issue is in lines 54-56 of my program (shown above) when I try to apply gradient descent. How can I fix this so the program will actually update weights appropriately? My relu implementation is as follows:
def relu(self, x, derivative=False):
if derivative:
return 1. * (x > 0)
else:
return x * (x > 0)
There are two problems with your code:
You are applying a relu to the output layer as well. The recommended standard approach is to use identity as output layer activation for regression and sigmoid/softmax for classification.
You are using a learning rate of 1, which is way to high. (Usual test values are 1e-2 and smaller.)
I changed the output activation to sigmoid even when using relu activation in the hidden layers
def feedforward(self):
...
if (self.activation == 'relu'):
self.layer1 = self.relu(np.dot(self.input, self.weights1))
self.output = self.sigmoid(np.dot(self.layer1, self.weights2))
return self.output
def backprop(self):
...
if (self.activation == 'relu'):
d_weights2 = np.dot(self.layer1.T, (2*(self.y - self.output) * self.sigmoid(self.output, True)))
d_weights1 = np.dot(self.input.T, (np.dot(2*(self.y - self.output) * self.relu(self.output, True), self.weights2.T) * self.relu(self.layer1, True)))
and used a smaller learning rate
# update the weights with the derivative (slope) of the loss function
self.weights1 += .01 * d_weights1
self.weights2 += .01 * d_weights2
and this is the result:
Actual Output : [[ 0.00000] [ 1.00000] [ 1.00000] [ 0.00000]]
Predicted Output: [[ 0.10815] [ 0.92762] [ 0.94149] [ 0.05783]]
I'm currently building my 3-4-1 neural network from scratch using numpy (I avoided using keras and tensorflow for the purpose of learning and trying to demonstrate my knowledge instead of using pre-built libraries to do all the work), the problems I find when I run the program are:
1/ getting "nan" values after a certain number of iterations in the "updated" weights, lowering the learning rate only delays the problem and doesn't solve it.
2/ the second problem is the very low predicting accuracy.
I would like to know what causes these bugs on my program and would appreciate any help.
here is the code:
# Import our dependencies
from numpy import exp, array, random, dot, ones_like, where
# Create our Artificial Neural Network class
class ArtificialNeuralNetwork():
# initializing the class
def __init__(self):
# generating the same synaptic weights every time the program runs
random.seed(1)
# synaptic weights (3 × 4 Matrix) of the hidden layer
self.w_ij = 2 * random.rand(3, 4) - 1
# synaptic weights (4 × 1 Matrix) of the output layer
self.w_jk = 2 * random.rand(4, 1) - 1
def LeakyReLU(self, x):
# The Leaky ReLU (short for Rectified Linear Unit) activation function will be applied to the inputs of the hidden layer
# The activation function will return the same value of x if x is positive
# while it will multiply the negative values of x by the alpha parameter
# we used in this example the Leaky ReLU instead of the standard ReLU activation function to avoid the dying ReLU problem
return where(x > 0, x, x * 0.01)
def LeakyReLUDerivative(self, x, α = 0.01):
# The Leaky ReLU Derivative will return 1 for every positive value in the x array
# while returning the value of the parameter alpha for every negative value
x[x > 0] = 1 # returns 1 for every positive value in the x array
x[x <= 0] = α # returns α for every negative value in the x array
return x
def Sigmoid(self, x):
# The Sigmoid activation function will turn every input value into probabilities between 0 and 1
# the probabilistic values help us assert which class x belongs to
return 1 / (1 + exp(-x))
def SigmoidDerivative(self, x):
# The derivative of the Sigmoid activation function will be used to calculate the gradient during the backpropagation process
# and help optimize the random starting synaptic weights
return x * (1 - x)
def train(self, x, y, learning_rate, iterations):
# x: training set of data
# y: the actual output of the training data
for i in range(iterations):
z_ij = dot(x, self.w_ij) # the dot product of the weights of the hidden layer and the inputs
a_ij = self.LeakyReLU(z_ij) # using the Leaky ReLU activation function to introduce non-linearity to our Neural Network
z_jk = dot(a_ij, self.w_jk) # the same precedent process will be applied to find the last input of the output layer
a_jk = self.Sigmoid(z_jk) # this time the Sigmoid activation function will be used instead of Leaky ReLU
dl_jk = -y/a_jk + (1 - y)/(1 - a_jk) # calculating the derivative of the cross entropy loss wrt output
da_jk = self.SigmoidDerivative(a_jk) # calculating the derivative of the Sigmoid activation function wrt the input (before activation) of the output layer
dz_jk = a_ij # calculating the derivative of the inputs of the hidden layer (before activation) wrt weights of the output layer
dl_ij = dot(da_jk * dl_jk, self.w_jk.T) # calculating the derivative of the cross entropy loss wrt activated input of the hidden layer
# to do so we multiply the derivative of the cross entropy loss wrt output by the derivative of the Sigmoid activation function wrt the input (before activation) of the output layer by the derivative of the inputs of the hidden layer (before activation) wrt weights of the output layer
da_ij = self.LeakyReLUDerivative(z_ij) # calculating the derivative of the Leaky ReLU activation function wrt the inputs of the hidden layer (before activation)
dz_ij = x # calculating the derivative of the inputs of the hidden layer (before activation) wrt weights of the hidden layer
# calculating the gradient using the chain rule
gradient_ij = dot(dz_ij.T , dl_ij * da_ij)
gradient_jk = dot(dz_jk.T , dl_jk * da_jk)
# calculating the new optimal weights
self.w_ij = self.w_ij - learning_rate * gradient_ij
self.w_jk = self.w_jk - learning_rate * gradient_jk
def predict(self, inputs):
# predicting the class of the input data after weights optimization
output_from_layer1 = self.LeakyReLU(dot(inputs, self.w_ij)) # the output of the hidden layer
output_from_layer2 = self.Sigmoid(dot(output_from_layer1, self.w_jk)) # the output of the output layer
return output_from_layer1, output_from_layer2
# the function will print the initial starting weights before training
def SynapticWeights(self):
print("Layer 1 (4 neurons, each with 3 inputs): ")
print("w_ij: ", self.w_ij)
print("Layer 2 (1 neuron, with 4 inputs): ")
print("w_jk: ", self.w_jk)
def main():
ANN = ArtificialNeuralNetwork()
ANN.SynapticWeights()
# the training inputs
x = array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [0, 1, 0], [1, 0, 0], [1, 1, 1], [0, 0, 0]])
# the training outputs
y = array([[0, 1, 1, 1, 1, 0, 0]]).T
ANN.train(x, y, 1, 10000)
# Printing the new synaptic weights after training
print("New synaptic weights after training: ")
print("w_ij: ", ANN.w_ij)
print("w_jk: ", ANN.w_jk)
# Our prediction after feeding the ANN with new set of data
print("Considering new situation [1, 1, 0] -> ?: ")
print(ANN.predict(array([[1, 1, 0]])))
if __name__=="__main__":
main()
So, I changed a few things. (Disclaimer: I didn't check the correctness of the code)
Weight initialization: initialize to much smaller weights.
# synaptic weights (3 × 4 Matrix) of the hidden layer
self.w_ij = (2 * random.rand(3, 4) - 1)*0.1
# synaptic weights (4 × 1 Matrix) of the output layer
self.w_jk = (2 * random.rand(4, 1) - 1)*0.1
Weight initialization really matter.
I reduced the learning rate to 0.1.
ANN.train(x, y, .1, 500000)
I see the neural network perfectly fitting your data and not giving Nan even after 500,000 iterations.
print(ANN.predict(array([[0, 0, 1],
[0, 1, 1],
[1, 0, 1],
[0, 1, 0],
[1, 0, 0],
[1, 1, 1],
[0, 0, 0]])))
I have implemented a simple Neural Net with just a single sigmoid hidden layer, with the choice of a sigmoid or softmax output layer and squared error or cross entropy loss function, respectively. After much research on the softmax activation function, the cross entropy loss, and their derivatives (and with following this blog) I believe that my implementation seems correct.
When attempting to learn the simple XOR function, the NN with the sigmoid output learns to a very small loss very quickly when using single binary outputs of 0 and 1. However, when changing the labels to one-hot encodings of [1, 0] = 0 and [0, 1] = 1, the softmax implementation does not work. The loss consistently increases as the network's outputs converge to exactly [0, 1] for the two outputs on every input, yet the labels of the data set is perfectly balanced between [0, 1] and [1, 0].
My code is below, where the choice of using sigmoid or softmax at the output layer can be chosen by uncommenting the necessary two lines near the bottom of the code. I cannot figure out why the softmax implementation is not working.
import numpy as np
class MLP:
def __init__(self, numInputs, numHidden, numOutputs, activation):
self.numInputs = numInputs
self.numHidden = numHidden
self.numOutputs = numOutputs
self.activation = activation.upper()
self.IH_weights = np.random.rand(numInputs, numHidden) # Input -> Hidden
self.HO_weights = np.random.rand(numHidden, numOutputs) # Hidden -> Output
self.IH_bias = np.zeros((1, numHidden))
self.HO_bias = np.zeros((1, numOutputs))
# Gradients corresponding to weight matrices computed during backprop
self.IH_w_gradients = np.zeros_like(self.IH_weights)
self.HO_w_gradients = np.zeros_like(self.HO_weights)
# Gradients corresponding to biases computed during backprop
self.IH_b_gradients = np.zeros_like(self.IH_bias)
self.HO_b_gradients = np.zeros_like(self.HO_bias)
# Input, hidden and output layer neuron values
self.I = np.zeros(numInputs) # Inputs
self.L = np.zeros(numOutputs) # Labels
self.H = np.zeros(numHidden) # Hidden
self.O = np.zeros(numOutputs) # Output
# ##########################################################################
# ACIVATION FUNCTIONS
# ##########################################################################
def sigmoid(self, x, derivative=False):
if derivative:
return x * (1 - x)
return 1 / (1 + np.exp(-x))
def softmax(self, prediction, label=None, derivative=False):
if derivative:
return prediction - label
return np.exp(prediction) / np.sum(np.exp(prediction))
# ##########################################################################
# LOSS FUNCTIONS
# ##########################################################################
def squaredError(self, prediction, label, derivative=False):
if derivative:
return (-2 * prediction) + (2 * label)
return (prediction - label) ** 2
def crossEntropy(self, prediction, label, derivative=False):
if derivative:
return [-(y / x) for x, y in zip(prediction, label)] # NOT NEEDED ###############################
return - np.sum([y * np.log(x) for x, y in zip(prediction, label)])
# ##########################################################################
def forward(self, inputs):
self.I = np.array(inputs).reshape(1, self.numInputs) # [numInputs, ] -> [1, numInputs]
self.H = self.I.dot(self.IH_weights) + self.IH_bias
self.H = self.sigmoid(self.H)
self.O = self.H.dot(self.HO_weights) + self.HO_bias
if self.activation == 'SIGMOID':
self.O = self.sigmoid(self.O)
elif self.activation == 'SOFTMAX':
self.O = self.softmax(self.O) + 1e-10 # allows for log(0)
return self.O
def backward(self, labels):
self.L = np.array(labels).reshape(1, self.numOutputs) # [numOutputs, ] -> [1, numOutputs]
if self.activation == 'SIGMOID':
self.O_error = self.squaredError(self.O, self.L)
self.O_delta = self.squaredError(self.O, self.L, derivative=True) * self.sigmoid(self.O, derivative=True)
elif self.activation == 'SOFTMAX':
self.O_error = self.crossEntropy(self.O, self.L)
self.O_delta = self.softmax(self.O, self.L, derivative=True)
self.H_error = self.O_delta.dot(self.HO_weights.T)
self.H_delta = self.H_error * self.sigmoid(self.H, derivative=True)
self.IH_w_gradients += self.I.T.dot(self.H_delta)
self.HO_w_gradients += self.H.T.dot(self.O_delta)
self.IH_b_gradients += self.H_delta
self.HO_b_gradients += self.O_delta
return self.O_error
def updateWeights(self, learningRate):
self.IH_weights += learningRate * self.IH_w_gradients
self.HO_weights += learningRate * self.HO_w_gradients
self.IH_bias += learningRate * self.IH_b_gradients
self.HO_bias += learningRate * self.HO_b_gradients
self.IH_w_gradients = np.zeros_like(self.IH_weights)
self.HO_w_gradients = np.zeros_like(self.HO_weights)
self.IH_b_gradients = np.zeros_like(self.IH_bias)
self.HO_b_gradients = np.zeros_like(self.HO_bias)
sigmoidData = [
[[0, 0], 0],
[[0, 1], 1],
[[1, 0], 1],
[[1, 1], 0]
]
softmaxData = [
[[0, 0], [1, 0]],
[[0, 1], [0, 1]],
[[1, 0], [0, 1]],
[[1, 1], [1, 0]]
]
sigmoidMLP = MLP(2, 10, 1, 'SIGMOID')
softmaxMLP = MLP(2, 10, 2, 'SOFTMAX')
# SIGMOID #######################
# data = sigmoidData
# mlp = sigmoidMLP
# ###############################
# SOFTMAX #######################
data = softmaxData
mlp = softmaxMLP
# ###############################
numEpochs = 5000
for epoch in range(numEpochs):
losses = []
for i in range(len(data)):
print(mlp.forward(data[i][0])) # Print outputs
# mlp.forward(data[i][0]) # Don't print outputs
loss = mlp.backward(data[i][1])
losses.append(loss)
mlp.updateWeights(0.001)
# if epoch % 1000 == 0 or epoch == numEpochs - 1: # Print loss every 1000 epochs
print(np.mean(losses)) # Print loss every epoch
Contrary to all the information online, simply changing the derivative of the softmax cross entropy from prediction - label to label - prediction solved the problem. Perhaps I have something else backwards somewhere since every source I have come across has it as prediction - label.
For past few days I have been debugging my NN but I can't find an issue.
I've created total raw implementation of multi-layer perceptron for identifying MNIST dataset images.
Network seems to learn because after train cycle test data accuracy is above 94% accuracy. I have problem with loss function - it starts increasing after a while, when test/val accuracy reaches ~76%.
Can someone please check my forward/backprop math and tell me if my loss function is properly implemented, or suggest what might be wrong?
NN structure:
input layer: 758 nodes, (1 node per pixel)
hidden layer 1: 300 nodes
hidden layer 2: 75 nodes
output layer: 10 nodes
NN activation functions:
input layer -> hidden layer 1: ReLU
hidden layer 1 -> hidden layer 2: ReLU
hidden layer 2 -> output layer 3: Softmax
NN Loss function:
Categorial Cross-Entropy
Full CLEAN code available here as Jupyter Notebook.
Neural Network forward/backward pass:
def train(self, features, targets):
n_records = features.shape[0]
# placeholders for weights and biases change values
delta_weights_i_h1 = np.zeros(self.weights_i_to_h1.shape)
delta_weights_h1_h2 = np.zeros(self.weights_h1_to_h2.shape)
delta_weights_h2_o = np.zeros(self.weights_h2_to_o.shape)
delta_bias_i_h1 = np.zeros(self.bias_i_to_h1.shape)
delta_bias_h1_h2 = np.zeros(self.bias_h1_to_h2.shape)
delta_bias_h2_o = np.zeros(self.bias_h2_to_o.shape)
for X, y in zip(features, targets):
### forward pass
# input to hidden 1
inputs_to_h1_layer = np.dot(X, self.weights_i_to_h1) + self.bias_i_to_h1
inputs_to_h1_layer_activated = self.activation_ReLU(inputs_to_h1_layer)
# hidden 1 to hidden 2
h1_to_h2_layer = np.dot(inputs_to_h1_layer_activated, self.weights_h1_to_h2) + self.bias_h1_to_h2
h1_to_h2_layer_activated = self.activation_ReLU(h1_to_h2_layer)
# hidden 2 to output
h2_to_output_layer = np.dot(h1_to_h2_layer_activated, self.weights_h2_to_o) + self.bias_h2_to_o
h2_to_output_layer_activated = self.softmax(h2_to_output_layer)
# output
final_outputs = h2_to_output_layer_activated
### backpropagation
# output to hidden2
error = y - final_outputs
output_error_term = error.dot(self.dsoftmax(h2_to_output_layer_activated))
h2_error = np.dot(output_error_term, self.weights_h2_to_o.T)
h2_error_term = h2_error * self.activation_dReLU(h1_to_h2_layer_activated)
# hidden2 to hidden1
h1_error = np.dot(h2_error_term, self.weights_h1_to_h2.T)
h1_error_term = h1_error * self.activation_dReLU(inputs_to_h1_layer_activated)
# weight & bias step (input to hidden)
delta_weights_i_h1 += h1_error_term * X[:, None]
delta_bias_i_h1 = np.sum(h1_error_term, axis=0)
# weight & bias step (hidden1 to hidden2)
delta_weights_h1_h2 += h2_error_term * inputs_to_h1_layer_activated[:, None]
delta_bias_h1_h2 = np.sum(h2_error_term, axis=0)
# weight & bias step (hidden2 to output)
delta_weights_h2_o += output_error_term * h1_to_h2_layer_activated[:, None]
delta_bias_h2_o = np.sum(output_error_term, axis=0)
# update the weights and biases
self.weights_i_to_h1 += self.lr * delta_weights_i_h1 / n_records
self.weights_h1_to_h2 += self.lr * delta_weights_h1_h2 / n_records
self.weights_h2_to_o += self.lr * delta_weights_h2_o / n_records
self.bias_i_to_h1 += self.lr * delta_bias_i_h1 / n_records
self.bias_h1_to_h2 += self.lr * delta_bias_h1_h2 / n_records
self.bias_h2_to_o += self.lr * delta_bias_h2_o / n_records
Activation function implementation:
def activation_ReLU(self, x):
return x * (x > 0)
def activation_dReLU(self, x):
return 1. * (x > 0)
def softmax(self, x):
z = x - np.max(x)
return np.exp(z) / np.sum(np.exp(z))
def dsoftmax(self, x):
# TODO: vectorise math
vec_len = len(x)
J = np.zeros((vec_len, vec_len))
for i in range(vec_len):
for j in range(vec_len):
if i == j:
J[i][j] = x[i] * (1 - x[j])
else:
J[i][j] = -x[i] * x[j]
return J
Loss function implementation:
def categorical_cross_entropy(pred, target):
return (1/len(pred)) * -np.sum(target * np.log(pred))
I managed to find the problem.
Neural Network is large so I couldn't stick everything to this question. Though if you check my Jupiter Notebook you could see implementation of my Softmax activation function and how do I use it in train cycle.
Problem with Loss miscalculation was caused by the fact my Softmax implementation worked only for ndarray dim == 1.
During training step I have put only ndarray with dim 1 to activtion function so NN learned well, but my run() function was returning wrong predictions as I have inserted whole test data to it, not only single row of it in for loop. Because of that it calculated Softmax "matrix-wise" rather than "row-wise".
This is very fast fix for it:
def softmax(self, x):
# TODO: vectorise math to speed up computation
softmax_result = None
if x.ndim == 1:
z = x - np.max(x)
softmax_result = np.exp(z) / np.sum(np.exp(z))
return softmax_result
else:
softmax_result = []
for row in x:
z = row - np.max(row)
row_softmax_result = np.exp(z) / np.sum(np.exp(z))
softmax_result.append(row_softmax_result)
return np.array(softmax_result)
Yet this code should be vectorised to avoid for loops and ifs if possible because currently it's ugly and takes too much PC resources.