How to compute accuracy? - python

I am struggling on how to compute accuracy from my neural network. I am using the MNIST database with backpropagation algorithm. All is done from scratch.
My partial code looks like this:
for x in range(1, epochs+1):
#Compute Feedforward
#...
activationValueOfSoftmax = softmax(Z2)
#Loss
#Y are my labels
loss = - np.sum((Y * np.log(activationValueOfSoftmax)), axis=0, keepdims=True)
cost = np.sum(loss, axis=1) / m #m is 784
#Backpropagation
dZ2 = activationValueOfSoftmax - Y
#the rest of the parameters
#...
#parameters update via Gradient Descent
#...
Can I compute accuracy from this, or do I have to redo some parts of my NN?
Thanks for any help!

I assume that you have your 10 sized one hot y vector for your test set (10 digits), and you retrieved your hypothesis through forward prop with your training set.
correct = 0
for i in range(np.shape(y)[0]):
#argmax retrieves index of max element in hypothesis
guess = np.argmax(hyp[i, :])
ans= np.argmax(y[i, :])
print("guess: ", guess, "| ans: ", ans)
if guess == match:
correct = correct + 1;
accuracy = (correct/np.shape(y)[0]) * 100
You have to do forward prop again with your weights and the TEST SET data to get your hypothesis vector (should be 10 sized) then you can loop through all the y values in the test set, using a counter variable (correct) to retrieve the amount correct. To get percentage, you just divide the correct by the number of test set examples and multiply by 100.
If you want the accuracy from the training set, just use your hypothesis (in your case activationValueOfSoftmax) and do the same.
Best of luck

Related

Polynomial regression exploding gradients and underfitting

I wanted to code my implementation of polynomial regression, but my model's gradients either exploded or my model didn't fit the data well enough.
For testing purposes, my dataset is just the function x^2 and my model is a second-degree polynomial ax^2 + bx + c. I trained it for 50 epochs using batch gradient descent.
I noticed that the model explodes with the learning rate >=0.001 and underfits with a learning rate <=0.0001
To visualize the model, at the end of each epoch, I plot the model's predictions with the labels. So, in the ideal case, these lines should be indistinguishable.
*The orange line is the labels and the blue one is the model's predictions.
Here is the model exploding:*
And here it underfits:*
One interesting thing is, that even though the model's predictions are way too big, the line still resembles the correct polynomial. And the picture where the predictions go into negatives is also correct, but just flipped/mirrored.
I made the code in python. This is my main.py:
from decimal import Decimal
from matplotlib.pyplot import plot, draw, pause, clf
from model import PolynomialRegression
POLYNOMIAL_FUNCTION = [0, 1, 2]
LEARNING_RATE = Decimal(0.0001)
DATASET = [0, 1, 2, 3, 4, 5, 6, 7]
LABELSET = [0, 1, 4, 9, 16, 32, 64, 128]
EPOCHS = 50
model = PolynomialRegression(POLYNOMIAL_FUNCTION, LEARNING_RATE)
for _ in range(EPOCHS):
for data, label in zip(DATASET, LABELSET):
# train the model
model.train(data, label)
# update the model
model.update()
# predict the dataset
predictions = [model.predict(data) for data in DATASET]
# plot predictions and labels
plot(predictions)
plot(LABELSET)
draw()
pause(0.1)
clf()
print(model.parameters)
# erase the stored gradients
model.clear_grad()
And this is my model.py:
from decimal import Decimal
class PolynomialRegression:
"""
Polynomial regression model.
"""
def __init__(self, polynomial_function: list, learning_rate: Decimal) -> None:
# the structure of the polynomial function (the exponents)
self.polynomial_function = polynomial_function
# parameters of the model set to be 1
self.parameters = [Decimal(1)] * len(polynomial_function)
self.learning_rate = learning_rate
# stored gradients to update the model
self.gradients = []
def predict(self, x: Decimal) -> Decimal:
"""
Make a prediction based on the input.
Args:
x (Decimal): Input to the model.
Returns:
Decimal: A prediction.
"""
y = Decimal(0)
# go through each parameter and exponent
for param, exponent in zip(self.parameters, self.polynomial_function):
# compute a term and add it to the final output
y += param * (x ** exponent)
return y
def train(self, x: Decimal, y: Decimal) -> Decimal:
"""
Compute a gradient from a given input and target output.
Args:
x (Decimal): Input for the model.
y (Decimal): Target/Desired output.
Returns:
Decimal: An MSE loss.
"""
prediction = self.predict(x)
error = prediction - y
loss = error ** 2
gradient = []
# go through each parameter and exponent
for param, exponent in zip(self.parameters, self.polynomial_function):
# compute the gradient for a single parameter
param_gradient = error * (x ** exponent) * self.learning_rate
# add the parameter gradient to the gradient list
gradient.append(param_gradient)
# add the gradient to a list
self.gradients.append(gradient)
return loss
def __sum_gradients(self) -> Decimal:
"""
Return a sum of gradients along the 0 axis.
(equivalent of numpy.sum(x, axis=0))
Returns:
list: List of summed Decimals.
"""
result = [Decimal(0)] * len(self.parameters)
# iterate through the y axis
for gradient in self.gradients:
# iterate through the x axis
for i, param_gradient in enumerate(gradient):
result[i] += param_gradient
return result
def update(self) -> None:
"""
Update the model's parameters based on the stored gradients.
"""
summed_gradients = self.__sum_gradients()
# fraction used to calculate the average for every gradient
averaging_fraction = Decimal(1) / len(self.gradients)
for param_index, grad in enumerate(summed_gradients):
self.parameters[param_index] -= averaging_fraction * grad
def clear_grad(self) -> None:
"""
Clear/Reset the stored gradients.
"""
self.gradients = []
I think the problem lies somewhere in my gradient descent calculations, but it may also be something unexpected and silly.
First your dataset consists of only 8 datapoints. This is to few data to generalize a model, which means that you are probably overfitting.
The second thing I see, is that you do not normalize the x data. The model is not very complex so I guess it doesn't really matter in that context. But if you had a more complex model with n features and one feature is very small and one is very big, the feature with the bigger values would influence the result much more than the smaller one. Which might result in a bad performing model.
But your last plot doesn't look like it's underfitting to me. You have to realize that a ML model will always have an error. In my opinion for 8 datapoints, a model with only one layer and 50 epochs that looks fine. You probably could improve the results by learning longer, but that would mean to overfit the model even more. But to be honest if your goal is to emulate a mathematical function with ML this should be okay. You could also add a new layer.
The fact that your lr has to be that small to not fuck up the results tells me that you are correct, there is something wrong with the gradient descent you might want to look into this behavior.
An easy way to evaluate this is to build your model in pytorch and then use your optimizer to update the weights. If you get the same problem, it was your gradient descent, if not the problem lies somewhere else. But I strongly believe it is your gradient descent. Maybe debug into this function and look at the actual values you are subtracting.

Can I set the predicted values of a neural network?

My question is more theoretical than practical but I can show some code as well. I have network that is mapping from random values in a domain uvw, to a xyz domain. I want a certain values of the uvw to go to other certain values in the xyz that I already know, since that is the idea of the function that I want to obtain and I want the network to over-fit.
My question is divided into two questions:
Can I set the predicted values I want into the network so it doesn't have to calculate those predicted values?
Does this is going to affect prediction of the other values?
This is my code, I want to show it so we have some notation that we can talk about.
# The model is a simple fully connected network mapping a 3D parameter point to 3D
phi = common.MLP(in_dim=3, out_dim=3).to(args.device)
# Eps is 1/lambda and max_iters is the maximum number of Sinkhorn iterations to do
emd_loss_fun = SinkhornLoss(eps=args.sinkhorn_eps, max_iters=args.max_sinkhorn_iters,
stop_thresh=1e-3, return_transport_matrix=True) # TODO add r-1 function to the weights
mse_loss_fun = torch.nn.MSELoss()
# Adam optimizer at first
optimizer = torch.optim.Rprop(phi.parameters(), lr=args.learning_rate)
fit_start_time = time.time()
for epoch in range(args.num_epochs):
optimizer.zero_grad()
# Do the forward pass of the neural net, evaluating the function at the parametric points
y = phi(t)
# Compute the Sinkhorn divergence between the reconstruction*(using the francis library) and the target
# NOTE: The Sinkhorn function expects a batch of b point sets (i.e. tensors of shape [b, n, 3])
# since we only have 1, we unsqueeze so x and y have dimension [1, n, 3]
with torch.no_grad():
_, M = emd_loss_fun(phi(t[num:]).unsqueeze(0), x[num:].unsqueeze(0))
_, Q = emd_loss_fun(phi(t[0:num]).unsqueeze(0), x[0:num].unsqueeze(0))
P[0,num:,num:] = M[0]
P[0,0:num,0:num] = Q[0]
#print(y[Q.squeeze().max(0)[1], :])
# Project the transport matrix onto the space of permutation matrices and compute the L-2 loss
# between the permuted points
loss = mse_loss_fun(y[P.squeeze().max(0)[1], :], x)
# loss = mse_loss_fun(P.squeeze() # y, x) # Use the transport matrix directly
# Take an optimizer step
loss.backward()
optimizer.step()
print("Epoch %d, loss = %f" % (epoch, loss.item()))

Basic neural network, weights too high

I am trying to code a very basic neural network in python, with 3 input nodes with a value of 0 or 1 and a single output node, with a value of 0 or 1. The output should be almost equal to the second input, but after training, the weights are way way too high, and the network almost always guesses 1.
I am using python 3.7 with numpy and scipy. I have tried changing the training set, the new instance, and the random seed
import numpy as np
from scipy.special import expit as ex
rand.seed(10)
training_set=[[0,1,0],[1,0,1],[0,0,0],[1,1,1]] #The training sets and their outputs
training_outputs=[0,1,0,1]
weightlst=[rand.uniform(-1,1),rand.uniform(-1,1),rand.uniform(-1,1)] #Weights are randomly set with a value between -1 and 1
print('Random weights\n'+str(weightlst))
def calcout(inputs,weights): #Calculate the expected output with given inputs and weights
output=0.5
for i in range(len(inputs)):
output=output+(inputs[i]*weights[i])
#print('\nmy output is ' + str(ex(output)))
return ex(output) #Return the output on a sigmoid curve between 0 and 1
def adj(expected_output,training_output,weights,inputs): #Adjust the weights based on the expected output, true (training) output and the weights
adjweights=[]
error=expected_output-training_output
for i in weights:
adjweights.append(i+(error*(expected_output*(1-expected_output))))
return adjweights
#Train the network, adjusting weights each time
training_iterations=10000
for k in range(training_iterations):
for l in range(len(training_set)):
expected=calcout(training_set[l],weightlst)
weightlst=adj(expected,training_outputs[l],weightlst,training_set[l])
new_instance=[1,0,0] #Calculate and return the expected output of a new instance
print('Adjusted weights\n'+str(weightlst))
print('\nExpected output of new instance = ' + str(calcout(new_instance,weightlst)))
The expected output would be 0, or something very close to it, but no matter what i set new_instance to, the output is still
Random weights
[-0.7312715117751976, 0.6948674738744653, 0.5275492379532281]
Adjusted weights
[1999.6135460307303, 2001.03968501638, 2000.8723667804588]
Expected output of new instance = 1.0
What is wrong with my code?
Bugs:
No bias used in the neuron
error=training_output-expected_output (not the other way around) for gradient decent
weight update rule of ith weight w_i = w_i + learning_rate * delta_w_i, (delta_w_i is gradient of loss with respect to w_i)
For squared loss delta_w_i = error*sample[i] (ith value of input vector sample)
Since you have only one neuron (one hidden layer or size 1) your model can only learn linearly separable data (it is only a linear classifier). Examples of linearly separable data are data generated by functions like boolean AND, OR. Note that boolean XOR is not linearly separable.
Code with bugs fixed
import numpy as np
from scipy.special import expit as ex
rand.seed(10)
training_set=[[0,1,0],[1,0,1],[0,0,0],[1,1,1]] #The training sets and their outputs
training_outputs=[1,1,0,1] # Boolean OR of input vector
#training_outputs=[0,0,,1] # Boolean AND of input vector
weightlst=[rand.uniform(-1,1),rand.uniform(-1,1),rand.uniform(-1,1)] #Weights are randomly set with a value between -1 and 1
bias = rand.uniform(-1,1)
print('Random weights\n'+str(weightlst))
def calcout(inputs,weights, bias): #Calculate the expected output with given inputs and weights
output=bias
for i in range(len(inputs)):
output=output+(inputs[i]*weights[i])
#print('\nmy output is ' + str(ex(output)))
return ex(output) #Return the output on a sigmoid curve between 0 and 1
def adj(expected_output,training_output,weights,bias,inputs): #Adjust the weights based on the expected output, true (training) output and the weights
adjweights=[]
error=training_output-expected_output
lr = 0.1
for j, i in enumerate(weights):
adjweights.append(i+error*inputs[j]*lr)
adjbias = bias+error*lr
return adjweights, adjbias
#Train the network, adjusting weights each time
training_iterations=10000
for k in range(training_iterations):
for l in range(len(training_set)):
expected=calcout(training_set[l],weightlst, bias)
weightlst, bias =adj(expected,training_outputs[l],weightlst,bias,training_set[l])
new_instance=[1,0,0] #Calculate and return the expected output of a new instance
print('Adjusted weights\n'+str(weightlst))
print('\nExpected output of new instance = ' + str(calcout(new_instance,weightlst, bias)))
Output:
Random weights
[0.142805189379827, -0.14222189064977075, 0.15618260226894076]
Adjusted weights
[6.196759842119063, 11.71208191137411, 6.210137255008176]
Expected output of new instance = 0.6655563851223694
As up can see for input [1,0,0] the model predicted the probability 0.66 which is class 1 (since 0.66>0.5). It is correct as the output class is OR of input vector.
Note:
For learning/understanding how each weight is updated it is ok to code like above, but in practice all the operations are vectorised. Check the link for vectorized implementation.

Why does this backpropagation implementation fail to train weights correctly?

I've written the following backpropagation routine for a neural network, using the code here as an example. The issue I'm facing is confusing me, and has pushed my debugging skills to their limit.
The problem I am facing is rather simple: as the neural network trains, its weights are being trained to zero with no gain in accuracy.
I have attempted to fix it many times, verifying that:
the training sets are correct
the target vectors are correct
the forward step is recording information correctly
the backward step deltas are recording properly
the signs on the deltas are correct
the weights are indeed being adjusted
the deltas of the input layer are all zero
there are no other errors or overflow warnings
Some information:
The training inputs are an 8x8 grid of [0,16) values representing an intensity; this grid represents a numeral digit (converted to a column vector)
The target vector is an output that is 1 in the position corresponding to the correct number
The original weights and biases are being assigned by Gaussian distribution
The activations are a standard sigmoid
I'm not sure where to go from here. I've verified that all things I know to check are operating correctly, and it's still not working, so I'm asking here. The following is the code I'm using to backpropagate:
def backprop(train_set, wts, bias, eta):
learning_coef = eta / len(train_set[0])
for next_set in train_set:
# These record the sum of the cost gradients in the batch
sum_del_w = [np.zeros(w.shape) for w in wts]
sum_del_b = [np.zeros(b.shape) for b in bias]
for test, sol in next_set:
del_w = [np.zeros(wt.shape) for wt in wts]
del_b = [np.zeros(bt.shape) for bt in bias]
# These two helper functions take training set data and make them useful
next_input = conv_to_col(test)
outp = create_tgt_vec(sol)
# Feedforward step
pre_sig = []; post_sig = []
for w, b in zip(wts, bias):
next_input = np.dot(w, next_input) + b
pre_sig.append(next_input)
post_sig.append(sigmoid(next_input))
next_input = sigmoid(next_input)
# Backpropagation gradient
delta = cost_deriv(post_sig[-1], outp) * sigmoid_deriv(pre_sig[-1])
del_b[-1] = delta
del_w[-1] = np.dot(delta, post_sig[-2].transpose())
for i in range(2, len(wts)):
pre_sig_vec = pre_sig[-i]
sig_deriv = sigmoid_deriv(pre_sig_vec)
delta = np.dot(wts[-i+1].transpose(), delta) * sig_deriv
del_b[-i] = delta
del_w[-i] = np.dot(delta, post_sig[-i-1].transpose())
sum_del_w = [dw + sdw for dw, sdw in zip(del_w, sum_del_w)]
sum_del_b = [db + sdb for db, sdb in zip(del_b, sum_del_b)]
# Modify weights based on current batch
wts = [wt - learning_coef * dw for wt, dw in zip(wts, sum_del_w)]
bias = [bt - learning_coef * db for bt, db in zip(bias, sum_del_b)]
return wts, bias
By Shep's suggestion, I checked what's happening when training a network of shape [2, 1, 1] to always output 1, and indeed, the network trains properly in that case. My best guess at this point is that the gradient is adjusting too strongly for the 0s and weakly on the 1s, resulting in a net decrease despite an increase at each step - but I'm not sure.
I suppose your problem is in choice of initial weights and in choice of initialization of weights algorithm. Jeff Heaton author of Encog claims that it as usually performs worse then other initialization method. Here is another results of weights initialization algorithm perfomance. Also from my own experience recommend you to init your weights with different signs values. Even in cases when I had all positive outputs weights with different signs perfomed better then with the same sign.

Issue with gradient calculation in a Neural Network (stuck at 7% error in MNIST)

Hi I am having an issue with my calculation of checking the gradient when implementing a neural network in python using numpy.
I am using mnist dataset to try and trying to using mini-batch gradient descent.
I have check the math and on paper look good so maybe you can give me a hint of what's happening here:
EDIT: One answer made me realize that indeed the cost function was being calculated wrong. Howerver that does not explain the problem with the gradient as it is calculated using back_prop. I get %7 error rate using 300 units in the hidden layer using minibatch gradient descent with rmsprop, 30 epochs and 100 batches. (learning_rate = 0.001, small due to the rmsprop).
each input is has 768 features so for a 100 samples I have a matrix. Mnist has 10 classes.
X = NoSamplesxFeatures = 100x768
Y = NoSamplesxClasses = 100x10
I am using a one hidden layer neural network with hidden layer size of 300 when fully training. Another question I have is whether I should use a softmax output function for calculating the error... which I think not. But I am kinda newbie to all of this and the obvious might seem strange to me.
(NOTE: I know the code is ugly, but this is my first Python/Numpy code done under pressure, bear with me)
Here is back_prof and activations:
def sigmoid(z):
return np.true_divide(1,1 + np.exp(-z) )
#not calculated really - this the fake version to make it faster.
def sigmoid_prime(a):
return (a)*(1 - a)
def _back_prop(self,W,X,labels,f=sigmoid,fprime=sigmoid_prime,lam=0.001):
"""
Calculate the partial derivates of the cost function using backpropagation.
"""
#Weight for first layer and hidden layer
Wl1,bl1,Wl2,bl2 = self._extract_weights(W)
# get the forward prop value
layers_outputs = self._forward_prop(W,X,f)
#from a number make a binary vector, for mnist 1x10 with all 0 but the number.
y = self.make_1_of_c_encoding(labels)
num_samples = X.shape[0] # layers_outputs[-1].shape[0]
# Dot product return Numsamples (N) x Outputs (No CLasses)
# Y is NxNo Clases
# Layers output to
big_delta = np.zeros(Wl2.size + bl2.size + Wl1.size + bl1.size)
big_delta_wl1, big_delta_bl1, big_delta_wl2, big_delta_bl2 = self._extract_weights(big_delta)
# calculate the gradient for each training sample in the batch and accumulate it
for i,x in enumerate(X):
# Error with respect the output
dE_dy = layers_outputs[-1][i,:] - y[i,:]
# bias hidden layer
big_delta_bl2 += dE_dy
# get the error for the hiddlen layer
dE_dz_out = dE_dy * fprime(layers_outputs[-1][i,:])
#and for the input layer
dE_dhl = dE_dy.dot(Wl2.T)
#bias input layer
big_delta_bl1 += dE_dhl
small_delta_hl = dE_dhl*fprime(layers_outputs[-2][i,:])
#here calculate the gradient for the weights in the hidden and first layer
big_delta_wl2 += np.outer(layers_outputs[-2][i,:],dE_dz_out)
big_delta_wl1 += np.outer(x,small_delta_hl)
# divide by number of samples in the batch (should be done here)?
big_delta_wl2 = np.true_divide(big_delta_wl2,num_samples) + lam*Wl2*2
big_delta_bl2 = np.true_divide(big_delta_bl2,num_samples)
big_delta_wl1 = np.true_divide(big_delta_wl1,num_samples) + lam*Wl1*2
big_delta_bl1 = np.true_divide(big_delta_bl1,num_samples)
# return
return np.concatenate([big_delta_wl1.ravel(),
big_delta_bl1,
big_delta_wl2.ravel(),
big_delta_bl2.reshape(big_delta_bl2.size)])
Now the feed_forward:
def _forward_prop(self,W,X,transfer_func=sigmoid):
"""
Return the output of the net a Numsamples (N) x Outputs (No CLasses)
# an array containing the size of the output of all of the laye of the neural net
"""
# Hidden layer DxHLS
weights_L1,bias_L1,weights_L2,bias_L2 = self._extract_weights(W)
# Output layer HLSxOUT
# A_2 = N x HLS
A_2 = transfer_func(np.dot(X,weights_L1) + bias_L1 )
# A_3 = N x Outputs
A_3 = transfer_func(np.dot(A_2,weights_L2) + bias_L2)
# output layer
return [A_2,A_3]
And the cost function for the gradient checking:
def cost_function(self,W,X,labels,reg=0.001):
"""
reg: regularization term
No weight decay term - lets leave it for later
"""
outputs = self._forward_prop(W,X,sigmoid)[-1] #take the last layer out
sample_size = X.shape[0]
y = self.make_1_of_c_encoding(labels)
e1 = np.sum((outputs - y)**2, axis=1))*0.5
#error = e1.sum(axis=1)
error = e1.sum()/sample_size + 0.5*reg*(np.square(W)).sum()
return error
What kind of results are you getting when you run gradient checking? Often times you can tease out the nature of the implementation error by looking at the output of your gradient vs the output produced by gradient checking.
Furthermore, square error is usually a poor choice for a classification task such as MNIST and I would suggest using either a simple sigmoid top-layer or a softmax. With sigmoid the cross entropy function you want to use is:
L(h,Y) = -Y*log(h) - (1-Y)*log(1-h)
For a softmax
L(h,Y) = -sum(Y*log(h))
where Y is the target given as a 1x10 vector and h is your predicted value, but easily extends to arbitrary batch sizes.
In both cases the top-layer delta simply becomes:
delta = h - Y
And the top-layer gradient becomes:
grad = dot(delta, A_in)
Where A_in is the input into the top layer from the previous layer.
While I am having some trouble getting my head around your backprop routine, I suspect from your code that the error in gradient is due to the fact that you are not calculating the top-level dE/dw_l2 correctly when using square error, along with computing fprime on the incorrect input.
When using square error the top layer delta should be:
delta = (h - Y) * fprime(Z_l2)
Here Z_l2 is the input into your transfer function for layer 2. Similarly when computing fprime for the lower layers, you want to use the input to your transfer function (i.e. dot(X,weights_L1) + bias_L1)
Hope that helps.
EDIT:
As some added justification for using cross entropy error over square error I would suggest looking up Geoffrey Hinton's lectures on linear classification methods:
www.cs.toronto.edu/~hinton/csc2515/notes/lec3.ppt
EDIT2:
I ran some tests locally with my implementation of neural nets on the MNIST dataset with different parameters and 1 hidden layer using RMSPROP. Here are the results:
Test1
Epochs: 30
Hidden Size: 300
Learn Rate: 0.001
Lambda: 0.001
Train Method: RMSPROP with decrements=0.5 and increments=1.3
Train Error: 6.1%
Test Error: 6.9%
Test2
Epochs: 30
Hidden Size: 300
Learn Rate: 0.001
Lambda: 0.000002
Train Method: RMSPROP with decrements=0.5 and increments=1.3
Train Error: 4.5%
Test Error: 5.7%
It already appears that if you decrease your lambda parameter by a couple orders of magnitude you should end up with better performance.

Categories