Simple Neural Network from scratch using NumPy - python

I added learning rate and momentum to a neural network implementation from scratch I found at: https://towardsdatascience.com/how-to-build-your-own-neural-network-from-scratch-in-python-68998a08e4f6
However I had a few questions about my implementation:
Is it correct? Any suggested improvements? It appears to output adequate results generally but outside advice is very appreciated.
With a learning rate < 0.5 or momentum > 0.9 the network tends to gets stuck in a local optimum where loss = ~1. I assume this is because step size isn't big enough to escape this but is there a way to overcome this? Or is this inherent with the nature of the data being solved and unavoidable.
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
sig = 1 / (1 + np.exp(-x))
return sig * (1 - sig)
class NeuralNetwork:
def __init__(self, x, y):
self.input = x
self.weights1 = np.random.rand(self.input.shape[1], 4)
self.weights2 = np.random.rand(4, 1)
self.y = y
self.output = np.zeros(self.y.shape)
self.v_dw1 = 0
self.v_dw2 = 0
self.alpha = 0.5
self.beta = 0.5
def feedforward(self):
self.layer1 = sigmoid(np.dot(self.input, self.weights1))
self.output = sigmoid(np.dot(self.layer1, self.weights2))
def backprop(self, alpha, beta):
# application of the chain rule to find derivative of the loss function with respect to weights2 and weights1
d_weights2 = np.dot(self.layer1.T, (2*(self.y - self.output) * sigmoid_derivative(self.output)))
d_weights1 = np.dot(self.input.T, (np.dot(2*(self.y - self.output) *
sigmoid_derivative(self.output), self.weights2.T) *
sigmoid_derivative(self.layer1)))
# adding effect of momentum
self.v_dw1 = (beta * self.v_dw1) + ((1 - beta) * d_weights1)
self.v_dw2 = (beta * self.v_dw2) + ((1 - beta) * d_weights2)
# update the weights with the derivative (slope) of the loss function
self.weights1 = self.weights1 + (self.v_dw1 * alpha)
self.weights2 = self.weights2 + (self.v_dw2 * alpha)
if __name__ == "__main__":
X = np.array([[0, 0, 1],
[0, 1, 1],
[1, 0, 1],
[1, 1, 1]])
y = np.array([[0], [1], [1], [0]])
nn = NeuralNetwork(X, y)
total_loss = []
for i in range(10000):
nn.feedforward()
nn.backprop(nn.alpha, nn.beta)
total_loss.append(sum((nn.y-nn.output)**2))
iteration_num = list(range(10000))
plt.plot(iteration_num, total_loss)
plt.show()
print(nn.output)

first thing, in your "sigmoid_derivative(x)", input to this function is already output of a sigmoid, but you get the sigmoid again and then computed derivative, that is one problem, it should be :
return x * (1 - x)
second problem, you are not using any bias, how do you know your decision boundary would cross the origin in the problem hypothesis space? so you need to add a bias term.
And last thing I think your derivatives are not correct, you can refer to Andrew Ng deep learning course 1, week 2 at coursera.org , for a list of general formulas for computing back propagation in neural networks to make sure you are doing it right.

Related

scipy optimize one iteration at a time

I want to control the objective of my optimization as a function of the number of iterations. In my real problem, I have a complicated regularization term that I want to control using the iteration number.
Is it possible to call a scipy optimizer one iteration at a time, or at least to be able to access the iteration number in the objective function?
Here is an example showing my best attempt so far:
from scipy.optimize import fmin_slsqp
from scipy.optimize import minimize as mini
import numpy as np
# define objective function
# x is the design input
# iteration is the iteration number
# the idea is that I want to control a regularization term using the iteration number
def objective(x, iteration):
return (1 - x[0]) ** 2 + 100 * (x[1] - x[0] ** 2) ** 2 + 10 * np.sum(x ** 2) / iteration
x = np.ones(2) * 5
for ii in range(20):
x = fmin_slsqp(objective, x, iter=1, args=(ii,), iprint=0)
if ii == 5: print('at iteration 5, I expect to get ~ [0, 0], but I get', x)
truex = mini(objective, np.ones(2) * 5, args=(200,)).x
print('the final result is ', x, 'instead of the correct answer, which is close to [1, 1] (', truex, ')')
output:
at iteration 5, I expect to get ~ [0, 0], but I get [5. 5.]
the final result is [5. 5.] instead of the correct answer, [1, 1] ([0.88613989 0.78485145])
No, I don't think scipy offers this option.
Interestingly, pytorch does. See this example of optimizing one iteration at a time:
import numpy as np
# define rosenbrock function and gradient
a = 1
b = 5
def f(x):
return (a - x[0]) ** 2 + b * (x[1] - x[0] ** 2) ** 2
# create stochastic rosenbrock function and gradient
def f_rand(x):
return f(x) * np.random.uniform(0.5, 1.5)
x = np.array([0.1, 0.1])
x0 = x.copy()
import torch
x_tensor = torch.tensor(x0, requires_grad=True)
optimizer = torch.optim.Adam([x_tensor], lr=learning_rate)
def closure():
optimizer.zero_grad()
loss = f_rand(x_tensor)
loss.backward()
return loss
# optimize one iteration at a time
for ii in range(200):
optimizer.step(closure)
print('optimal solution found: ', x_tensor, f(x_tensor))
If you really need to use scipy, you can make a class to count iterations, though you should be careful when mixing this with an algorithm that is approximating the inverse hessian matrix.
from scipy.optimize import fmin_slsqp
from scipy.optimize import minimize as mini
import numpy as np
# define objective function
# x is the design input
# iteration is the iteration number
# the idea is that I want to control a regularization term using the iteration number
def objective(x):
return (1 - x[0]) ** 2 + 100 * (x[1] - x[0] ** 2) ** 2 + 10 * np.sum(x ** 2)
class myclass:
def __init__(self):
self.iteration = 0
def call(self, x):
self.iteration += 1
return (1 - x[0]) ** 2 + 100 * (x[1] - x[0] ** 2) ** 2 + 10 * np.sum(x ** 2) / self.iteration
x = np.ones(2) * 5
obj = myclass()
x = fmin_slsqp(obj.call, x, iprint=0)
truex = mini(objective, np.ones(2) * 5).x
print('the final result is ', x, ', which is not the correct answer, and is not close to [1, 1] (', truex, ')')

Neural network back propagation regression, how to correctly learn the cos function?

After Lutz Lehmann's suggestion, I discovered that it was a problem of random weights and biases. I used np.ramdom.seed(2021) to specify the random seed number, and the error has not converged. But if I use np.ramdom.seed(10) as the random seed number,the 600th ephoch error will converge to a relatively small amount.
Galletti_Lance's suggestion is correct and should be replaced with a periodic activation function. I expanded the interval of the sin function, and the learning error did not converge.Sure enough, it is overfitting.
input_data = np.arange(0, np.pi * 4, 0.1) # input
correct_data = np.sin(input_data) # correct answer
input_data = (input_data - np.pi*2) / np.pi
np.random.seed(2021) Learning cos function, the 20000th epoch is as follows:
Epoch:0/20001 Error:0.2904405534384431
Epoch:200/20001 Error:0.2752981376571506
Epoch:400/20001 Error:0.27356300803051226
Epoch:600/20001 Error:0.27409878767315193
Epoch:800/20001 Error:0.2638216736165815
Epoch:1000/20001 Error:0.27196157366033213
Epoch:1200/20001 Error:0.2743520487664953
Epoch:1400/20001 Error:0.2589745966244678
Epoch:1600/20001 Error:0.2705289192984957
Epoch:1800/20001 Error:0.2689693217636388
....
Epoch:20000/20001 Error:0.2678723095120438
But if I use np.ramdom.seed(10) as the random seed number,the 600th ephoch error will converge to a relatively small amount.
Epoch:0/20001 Error:0.283958515549615
Epoch:200/20001 Error:0.260819823215878
Epoch:400/20001 Error:0.23267630899157743
Epoch:600/20001 Error:0.0022589485429890047
Epoch:800/20001 Error:0.0007425256677052262
Epoch:1000/20001 Error:0.0003946220094805989
....
Epoch:2800/20001 Error:0.00011495288247859594
Epoch:3000/20001 Error:9.989662843897715e-05
....
Epoch:20000/20001 Error:4.6146397913360866e-05
np.random.seed(10) Learning cos function, the 600th epoch is as follows:
I use neural network back propagation regression to learn the cos function. When I learn the sin function, it is normal. If it is changed to cos, it is abnormal. What is the problem?
correct_data = np.cos(input_data)
Related settings:
1.The activation function of the middle layer: sigmoid function
2.Excitation function of the output layer: identity function
3.Loss function: sum of squares error
4.Optimization algorithm: stochastic gradient descent method
5.Batch size: 1
My code is as follows:
import numpy as np
import matplotlib.pyplot as plt
# - Prepare to input and correct answer data -
input_data = np.arange(0, np.pi * 2, 0.1) # input
correct_data = np.cos(input_data) # correct answer
input_data = (input_data - np.pi) / np.pi # Converge the input to the range of -1.0-1.0
n_data = len(correct_data) # number of data
# - Each setting value -
n_in = 1 # The number of neurons in the input layer
n_mid = 3 # The number of neurons in the middle layer
n_out = 1 # The number of neurons in the output layer
wb_width = 0.01 # The spread of weights and biases
eta = 0.1 # learning coefficient
epoch = 2001
interval = 200 # Display progress interval practice
# -- middle layer --
class MiddleLayer:
def __init__(self, n_upper, n): # Initialize settings
self.w = wb_width * np.random.randn(n_upper, n) # weight (matrix)
self.b = wb_width * np.random.randn(n) # offset (vector)
def forward(self, x): # forward propagation
self.x = x
u = np.dot(x, self.w) + self.b
self.y = 1 / (1 + np.exp(-u)) # Sigmoid function
def backward(self, grad_y): # Backpropagation
delta = grad_y * (1 - self.y) * self.y # Differentiation of Sigmoid function
self.grad_w = np.dot(self.x.T, delta)
self.grad_b = np.sum(delta, axis=0)
self.grad_x = np.dot(delta, self.w.T)
def update(self, eta): # update of weight and bias
self.w -= eta * self.grad_w
self.b -= eta * self.grad_b
# - Output layer -
class OutputLayer:
def __init__(self, n_upper, n): # Initialize settings
self.w = wb_width * np.random.randn(n_upper, n) # weight (matrix)
self.b = wb_width * np.random.randn(n) # offset (vector)
def forward(self, x): # forward propagation
self.x = x
u = np.dot(x, self.w) + self.b
self.y = u # Identity function
def backward(self, t): # Backpropagation
delta = self.y - t
self.grad_w = np.dot(self.x.T, delta)
self.grad_b = np.sum(delta, axis=0)
self.grad_x = np.dot(delta, self.w.T)
def update(self, eta): # update of weight and bias
self.w -= eta * self.grad_w
self.b -= eta * self.grad_b
# - Initialization of each network layer -
middle_layer = MiddleLayer(n_in, n_mid)
output_layer = OutputLayer(n_mid, n_out)
# -- learn --
for i in range(epoch):
# Randomly scramble the index value
index_random = np.arange(n_data)
np.random.shuffle(index_random)
# Used for the display of results
total_error = 0
plot_x = []
plot_y = []
for idx in index_random:
x = input_data[idx:idx + 1] # input
t = correct_data[idx:idx + 1] # correct answer
# Forward spread
middle_layer.forward(x.reshape(1, 1)) # Convert the input to a matrix
output_layer.forward(middle_layer.y)
# Backpropagation
output_layer.backward(t.reshape(1, 1)) # Convert the correct answer to a matrix
middle_layer.backward(output_layer.grad_x)
# Update of weights and biases
middle_layer.update(eta)
output_layer.update(eta)
if i % interval == 0:
y = output_layer.y.reshape(-1) # Restore the matrix to a vector
# Error calculation
total_error += 1.0 / 2.0 * np.sum(np.square(y - t)) # Square sum error
# Output record
plot_x.append(x)
plot_y.append(y)
if i % interval == 0:
# Display the number of epochs and errors
print("Epoch:" + str(i) + "/" + str(epoch), "Error:" + str(total_error / n_data))
# Display the output with a graph
plt.plot(input_data, correct_data, linestyle="dashed")
plt.scatter(plot_x, plot_y, marker="+")
plt.show()
If increasing the number of epochs worked, the model needed more training.
But you may be overfitting... Notice that the cosine function is a periodic function, yet you are using only monotonic functions (sigmoid, and identity) to approximate it.
So while on the bounded interval of your data it may work:
It does not generalize well:
Code for the above plots:
import math as m
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets as datasets
from tensorflow import keras
from tensorflow.keras import layers
t, _ = datasets.make_blobs(n_samples=7500, centers=[[0, 0]], cluster_std=1, random_state=0)
X = np.array(list(filter(lambda x : m.cos(4*x[0]) - x[1] < -.5 or m.cos(4*x[0]) - x[1] > .5, t)))
Y = np.array([1 if m.cos(4*x[0]) - x[1] >= 0 else LABEL for x in X])
model = keras.models.Sequential()
model.add(layers.Dense(8, input_dim=2, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss="binary_crossentropy")
model.fit(X, Y, batch_size=500, epochs=3000)
# create a mesh to plot in
h = .02 # step size in the mesh
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
meshData = np.c_[xx.ravel(), yy.ravel()]
fig, ax = plt.subplots()
Z = model.predict(meshData)
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, alpha=.3, cmap=plt.cm.Paired)
ax.axis('off')
# Plot also the training points
T = model.predict(X)
T = T.reshape(X[:,0].shape)
ax.scatter(X[:, 0], X[:, 1], color=colors[T].tolist(), s=10, alpha=0.9)
plt.show()
# add duplicate plotting code here to generate second plot
# predicting on data generated from a blob
# with a larger standard deviation

Using Adam to find the minimum of the Rosenbrock function using Pytorch

I am comparing the Adam - Algorithm to SGD with Momentum. I realised that the convergence rate of Adam is way worse than the convergence rate of SGD with Momentum if applied to the Rosenbrock function. This finding is in contrast to this visualisation. You can read the underlying code here.
Too ensure that I did not have an implementation error I compared the results of my algorithm to the Pytorch implementation. Pytorch and my implementation return the same result.
Therefore either Pytorch and my implementation is incorrect or the implementation in the link is incorrect. If you check out the code from the link above you will find that the Bias correction step is missing. After adapting my code in the same way the results did not significantly improve.
So my question is why does it work in the linked scenario but not in my/Pytorch implementation? Even though all of the three should return the same result.
import numpy as np
import torch
# Rosenbrock function
class Rosenbrock:
a_f = 1.
b_f = 2.
# The minimum is at (a_f, a_f**2)
class Adam_para:
beta1 = 0.9 # 0.7 # modified because of github: https://gist.github.com/EmilienDupont/f97a3902f4f3a98f350500a3a00371db
beta2 = 0.999
eps = 1e-8
lr = 2e-2
iterations = 100
def f(x,y):
return ( Rosenbrock.a_f - x ) ** 2 + Rosenbrock.b_f * (y - x ** 2 ) ** 2
def grad_f(x,y):
grad_x = - 1. * 2 * (Rosenbrock.a_f - x) + Rosenbrock.b_f * (- 2 * x) * 2 * ( y - x ** 2 )
grad_y = Rosenbrock.b_f * ( 1. ) * 2 * (y - x ** 2)
return np.array([grad_x, grad_y])
def adam_inner(p: np.ndarray,t,exp_avg,exp_avg_sqr, lr):
# inner loop of adam algorithm
# p current point
# exp_avg first moment estimate
# exp_avg_sqr second moment estimate
# lr learning rate
# the following values are taken from the ADAM Paper
beta1 = Adam_para.beta1
beta2 = Adam_para.beta2
eps = Adam_para.eps
t = t+1
g = grad_f(*p)
exp_avg = beta1 * exp_avg + ( 1 - beta1 ) * g
exp_avg_sqr = beta2 * exp_avg_sqr + ( 1 - beta2 ) * np.square(g)
bias_corr_1 = 1 - beta1 ** t
bias_corr_2 = 1 - beta2 ** t
exp_avg_hat = exp_avg / bias_corr_1
exp_avg_sqr_hat = exp_avg_sqr / bias_corr_2
denom = np.sqrt(exp_avg_sqr_hat) + eps
p = p - lr * exp_avg_hat / denom
return {'p': p, 'first_mom': exp_avg, 'second_mom': exp_avg_sqr}
def adam(p, it, lr=0.001):
# it number of iterations
# m first moment estimate
# v second moment estimate
# init
m = 0
v = 0
p_list = [p]
for i in range(it):
tmp = adam_inner(p_list[-1],i,m,v,lr)
p_list.append(tmp['p'])
m = tmp['first_mom']
v = tmp['second_mom']
return np.asarray(p_list)
x0 = np.array([3.,3.])
t = adam(x0,Adam_para.iterations,Adam_para.lr)
x0_torch = torch.tensor(x0, requires_grad=True)
f_torch = f(x0_torch[0],x0_torch[1])
optimizer = torch.optim.Adam([x0_torch], lr = Adam_para.lr, betas=(Adam_para.beta1,Adam_para.beta2))
for i in range(Adam_para.iterations):
optimizer.zero_grad()
f_torch = f(x0_torch[0],x0_torch[1])
f_torch.backward()
optimizer.step()
print("pytorch result:", x0_torch)
print("my result:", t[-1])

How to fix "invalid value", "divide by zero" and other calculation errors

I'm using numpy to calculate neural network weights and nodes but almost every time i run into errors like:
"RuntimeWarning: divide by zero encountered in log"
"invalid value encountered in multiply"
"RuntimeWarning: overflow encountered in exp"
...
which most of the time results in values becoming "nan".
Because weights are initiated randomly I don't always get the same errors.
When I build small neural network(2 inputs, 4 hidden nodes and 1 output for example) I don't encounter these errors at all, but if I increase number of hidden nodes it breaks resulting in 'nan' being the value of everything.
I searched for solutions already but they didn't help. I checked that values are not the ones that will result in 'nan' but I didn't find anything.
import numpy as np
import sys
import matplotlib.pyplot as plt
np.set_printoptions(threshold=sys.maxsize)
class NeuralNetwork:
def __init__(self, inputs, outputs, epochs, lr):
self.inputs = inputs
self.outputs = outputs
self.m = self.inputs.shape[0]
self.n = self.inputs.shape[1]
self.epochs = epochs
self.lr = lr
self.hidden1Size = 64
self.hidden2Size = 64
self.w1 = np.random.randn(self.n, self.hidden1Size)
self.w2 = np.random.randn(self.hidden1Size, self.hidden2Size)
self.w3 = np.random.randn(self.hidden2Size, self.outputs.shape[1])
def sigmoid(self, n):
return 1 / (1 - np.exp(-n))
def sigmoidDerivative(self, n):
return self.sigmoid(n) * (1 - self.sigmoid(n))
def ReLU(self, n):
return n * (n > 0)
def ReLUDerivatie(self, n):
return n > 0
def crossEntropyLoss(self, y, h):
return - (y * np.log(h) + (y - 1) * np.log(1 - h))
def train(self):
for i in range(self.epochs):
X, a2, a3, h = self.feedForward()
self.backProp(X, a2, a3, h)
if i % 100 == 0:
print('Cost at ', i, 'epochs: ', print(self.crossEntropyLoss(self.inputs, h)))
def feedForward(self):
X = self.inputs
z2 = X.dot(self.w1)
a2 = self.ReLU(z2)
z3 = a2.dot(self.w2)
a3 = self.ReLU(z3)
z4 = a3.dot(self.w3)
h = self.sigmoid(z4)
return X, a2, a3, h
def backProp(self, X, a2, a3, h):
outputErrors = self.crossEntropyLoss(self.outputs, h)
outputDeltas = outputErrors * self.sigmoidDerivative(h)
hidden2Errors = outputDeltas.dot(self.w3.T)
hidden2Deltas = hidden2Errors * self.ReLUDerivatie(a3)
hidden1Errors = hidden2Deltas.dot(self.w2.T)
hidden1Deltas = hidden1Errors * self.ReLUDerivatie(a2)
self.w3 += self.lr * a3.T.dot(outputDeltas)
self.w2 += self.lr * a2.T.dot(hidden2Deltas)
self.w1 += self.lr * X.T.dot(hidden1Deltas)
NN = NeuralNetwork(np.array([[0, 0], [0, 1], [1, 0], [1, 1]]), np.array([[0], [0], [0], [1]]), 10000, 0.01)
NN.train()
These values are just for test. I plan on feeding image data to distinguish digits but as far as I know it shouldn't break with these.
So, following might render nans/errors:
# if n == 0, you'll have division by 0
def sigmoid(self, n):
return 1 / (1 - np.exp(-n))
And here:
# log function is not defined for negatives, thus for h < 0 or 1 - h < 0
def crossEntropyLoss(self, y, h):
return - (y * np.log(h) + (y - 1) * np.log(1 - h))
There's a lot of transformations going on, hard to trace where the values come from, but these would be my first guesses. dot will render 0 in case vectors are parallel - maybe that might help you as well.
So, I recommend printing traceback in these two functions in case condition mentioned above occurs: https://docs.python.org/3/library/traceback.html#traceback.print_tb
This way you'll be able to find out which iteration/function called caused such invalid condition.

Neural network - Matrix problems in python

Im playing around with a simple neural network, specifically this tutorial https://stevenmiller888.github.io/mind-how-to-build-a-neural-network/ , and im getting some problems when doing backpropagation. My matrices won't match when backpropagating for the initial input weights.
I guess its a simple linear algebra problem. However, i wonder if the programming language in the tutorial confuses me. Or i could be the problem occured much earlier.
If anybody has any ideas what i might do wrong please let me know!
My Code
import numpy as np
inputM = np.matrix([
[0,1],
[1,0],
[1,1],
[0,0]
])
outputM = np.matrix([
[0],
[0],
[1],
[1]
])
neurons = 3
mu, sigma = 0, 0.1 # mean and standard deviation
weights = np.random.normal(mu, sigma, len(inputM.T) * neurons)
weightsMatrix = np.matrix(weights).reshape(3,2)
weights = np.matrix(weights)
#Forward
inputHidden = inputM * weightsMatrix.T
hiddenLayerLog = 1 / (1 + np.exp(inputHidden))
hiddenWeights = np.random.normal(mu, sigma, neurons)[np.newaxis, :]
sumOfHiddenLayer = np.sum(hiddenWeights + hiddenLayerLog, axis=1)
predictedOutput = 1 / (1 + np.exp(sumOfHiddenLayer))
residual = outputM - predictedOutput
logDerivative = 1 / (1 + np.exp(sumOfHiddenLayer)) * (1 - 1 / (1 + np.exp(sumOfHiddenLayer))).T
deltaOutputSum = logDerivative * residual
#Backward
deltaWeights = deltaOutputSum / hiddenLayerLog
newHiddenWeights = hiddenWeights - deltaWeights
deltaHiddenSum = (deltaOutputSum / hiddenWeights)
deltaHiddenSum = deltaHiddenSum.T * (1 / (1 + np.exp(inputHidden))) * (1 - 1 / (1 + np.exp(inputHidden))).T
newInputWeights = np.array(deltaHiddenSum) / np.array(inputM)

Categories