Updating weight in previous layers in Backpropagation

Updating weight in previous layers in Backpropagation - python

I am trying to create a simple neural network and stuck at updating the weights at first layer in two layers. I imagine the first update I am doing to w2 are correct as what I learned from back propagation algorithm. I am not including bias for now. But how do we update the first layer weights is where I am stuck at.
import numpy as np
np.random.seed(10)
def sigmoid(x):
return 1.0/(1+ np.exp(-x))
def sigmoid_derivative(x):
return x * (1.0 - x)
def cost_function(output, y):
return (output - y) ** 2
x = 2
y = 4
w1 = np.random.rand()
w2 = np.random.rand()
h = sigmoid(w1 * x)
o = sigmoid(h * w2)
cost_function_output = cost_function(o, y)
prev_w2 = w2
w2 -= 0.5 * 2 * cost_function_output * h * sigmoid_derivative(o) # 0.5 being learning rate
w1 -= 0 # What do you update this to?
print(cost_function_output)

I'm not able to comment on your question, so writing here.
Firstly, your sigmoid_derivative function is wrong.
The derivative of sigmoid(x*y) w.r.t x is = sigmoid(x*y)*(1-sigmoid(x*y))*y.
Edit: (deleted unnecessary text)
We need dW1 and dW2 (These are dJ/dW1 and dJ/dW (partial derivatives) respectively.
J = (o - y)^2 therefore dJ/do = 2*(o - y)
Now, dW2
dJ/dW2 = dJ/do * do/dW2 (chain rule)
dJ/dW2 = (2*(o - y)) * (o*(1 - o)*h)
dW2 (equals above equation)
W2 -= learning_rate*dW2
Now, for dW1
dJ/dh = dJ/do * do/dh = (2*(o - y)) * (o*(1 - o)*W2
dJ/dW1 = dJ/dh * dh/dW1 = ((2*(o - y)) * (o*(1 - o)*W2)) * (h*(1- h)*x)
dW1 (equals above equation)
W1 -= learning_rate*dW2
PS: Try to make a computational graphs, finding derivatives become a lot more easier. (If you don't know this, read it online)

Related

Linear regression outputs "inf" value

I'm trying to learn linear regression, gave this problem a try. The results of the adjusted b(bias) and m(linear coefficient) are being outputted as "inf" or "-inf", what should i do?
sorry if the problem in the code is obvius, I'm new at this.
from matplotlib import pyplot as plt
import random
x = [1,2,3,3,4,4,3,2,1,2,5,4]
y = [1,2,2,1,3,4,1,1,2,3,4,5]
b = random.random()
m = random.random()
learning_rate = 0.3
iterations = 1000
for i in range(iterations):
for k in range(len(x)):
X = m * x[k] + b
derivative_error = 2 * (X - y[k])
dX_dm = x[k]
dX_db = 1
m += derivative_error * dX_dm * learning_rate
b += derivative_error * learning_rate

If I get it right, you are trying to use gradient descent to solve the linear regression
model. Here are the problems with your approch:
First:
The derivative is incorrect, instead of of
X = m * x[k] + b
derivative_error = 2 * (X - y[k])
dX_dm = x[k]
dX_db = 1
m += derivative_error * dX_dm * learning_rate
b += derivative_error * learning_rate
it should be taking the derivate of the error with respect to m and b.
Second:
You don't update the gradient every time you see a data point x[k], like what you are doing in the inner for-loop of your code:
for k in range(len(x)):
X = m * x[k] + b
derivative_error = 2 * (X - y[k])
dX_dm = x[k]
dX_db = 1
m += derivative_error * dX_dm * learning_rate
b += derivative_error * learning_rate
Instead, you accumulate errors of all x and average them. Use the averaged error to update ypur m and n.
Third:
Perhaps your learning_rate set to 0.3 is too large, such that it 'overshoots' the optimimum point at each of your update and hence the value of m and b get to a very wild number all the way to inf.
That said, the following is my solution, with a error function to check the
average errors you get at every iteration.
def error(x,y, m, b):
error = 0
for k in range(len(x)):
error = error + ((x[k] * m + b - y[k]) **2)
return error
from matplotlib import pyplot as plt
import random
x = [1,2,3,3,4,4,3,2,1,2,5,4]
y = [1,2,2,1,3,4,1,1,2,3,4,5]
b = random.random()
m = random.random()
learning_rate = 0.01
iterations = 100
for i in range(iterations):
print(error(x, y, m, b))
d_m = 0
d_b = 0
for k in range(len(x)):
# Calulate the derivative w.r.t. m and accumulate the error
derivative_error_m = -2*(y[k] - m*x[k] - b)*x[k]
d_m = d_m + derivative_error_m
# Calulate the derivative w.r.t. b and accumulate the error
derirative_error_b = -2*(y[k] - m*x[k] - b)
d_b = d_b + derirative_error_b
# Average the derivate of errors.
d_m = d_m / len(x)
d_b = d_b / len(x)
# Update parameters to the negative direction of gradient.
m = m - d_m * learning_rate
b = b - d_b * learning_rate
After running the code for iterations = 10, you get:
15.443121587504484
14.019097680461613
13.123926121402514
12.561191094860135
12.207425702911078
11.985018705759003
11.8451837105445
11.757253610772613
11.70195107555181
11.66715838203049
where errors are shrinking at every update.
Besides, you should also notice that a simple model like linear regression. There is a nice closed-form solution which gets you the opitimum solution immediately without applying iterations such as gradient descent.

Using Adam to find the minimum of the Rosenbrock function using Pytorch

I am comparing the Adam - Algorithm to SGD with Momentum. I realised that the convergence rate of Adam is way worse than the convergence rate of SGD with Momentum if applied to the Rosenbrock function. This finding is in contrast to this visualisation. You can read the underlying code here.
Too ensure that I did not have an implementation error I compared the results of my algorithm to the Pytorch implementation. Pytorch and my implementation return the same result.
Therefore either Pytorch and my implementation is incorrect or the implementation in the link is incorrect. If you check out the code from the link above you will find that the Bias correction step is missing. After adapting my code in the same way the results did not significantly improve.
So my question is why does it work in the linked scenario but not in my/Pytorch implementation? Even though all of the three should return the same result.
import numpy as np
import torch
# Rosenbrock function
class Rosenbrock:
a_f = 1.
b_f = 2.
# The minimum is at (a_f, a_f**2)
class Adam_para:
beta1 = 0.9 # 0.7 # modified because of github: https://gist.github.com/EmilienDupont/f97a3902f4f3a98f350500a3a00371db
beta2 = 0.999
eps = 1e-8
lr = 2e-2
iterations = 100
def f(x,y):
return ( Rosenbrock.a_f - x ) ** 2 + Rosenbrock.b_f * (y - x ** 2 ) ** 2
def grad_f(x,y):
grad_x = - 1. * 2 * (Rosenbrock.a_f - x) + Rosenbrock.b_f * (- 2 * x) * 2 * ( y - x ** 2 )
grad_y = Rosenbrock.b_f * ( 1. ) * 2 * (y - x ** 2)
return np.array([grad_x, grad_y])
def adam_inner(p: np.ndarray,t,exp_avg,exp_avg_sqr, lr):
# inner loop of adam algorithm
# p current point
# exp_avg first moment estimate
# exp_avg_sqr second moment estimate
# lr learning rate
# the following values are taken from the ADAM Paper
beta1 = Adam_para.beta1
beta2 = Adam_para.beta2
eps = Adam_para.eps
t = t+1
g = grad_f(*p)
exp_avg = beta1 * exp_avg + ( 1 - beta1 ) * g
exp_avg_sqr = beta2 * exp_avg_sqr + ( 1 - beta2 ) * np.square(g)
bias_corr_1 = 1 - beta1 ** t
bias_corr_2 = 1 - beta2 ** t
exp_avg_hat = exp_avg / bias_corr_1
exp_avg_sqr_hat = exp_avg_sqr / bias_corr_2
denom = np.sqrt(exp_avg_sqr_hat) + eps
p = p - lr * exp_avg_hat / denom
return {'p': p, 'first_mom': exp_avg, 'second_mom': exp_avg_sqr}
def adam(p, it, lr=0.001):
# it number of iterations
# m first moment estimate
# v second moment estimate
# init
m = 0
v = 0
p_list = [p]
for i in range(it):
tmp = adam_inner(p_list[-1],i,m,v,lr)
p_list.append(tmp['p'])
m = tmp['first_mom']
v = tmp['second_mom']
return np.asarray(p_list)
x0 = np.array([3.,3.])
t = adam(x0,Adam_para.iterations,Adam_para.lr)
x0_torch = torch.tensor(x0, requires_grad=True)
f_torch = f(x0_torch[0],x0_torch[1])
optimizer = torch.optim.Adam([x0_torch], lr = Adam_para.lr, betas=(Adam_para.beta1,Adam_para.beta2))
for i in range(Adam_para.iterations):
optimizer.zero_grad()
f_torch = f(x0_torch[0],x0_torch[1])
f_torch.backward()
optimizer.step()
print("pytorch result:", x0_torch)
print("my result:", t[-1])

use Theano to get the w_0 and w_1 parameters

I have a problem where i have to Create a dataset ,
Afterwards,I have to use Theano to get the w_0 and w_1 parameters of the following model:
y = log(1 + w_0 * |x|) + (w_1 * |x|)
the datasets are created and i have computed the w_0 and w_1 values but with numpy using the following code but I have studied throughly but don't know how to compute w_0 and w_1 values with theano .. how can I compute these using theano?
It will be great help thankyou :)
code that i am using :
import numpy as np
import math
import theano as t
#code to generate datasets
trX = np.linspace(-1, 1, 101)
trY = np.linspace(-1, 1, 101)
for i in range(len(trY)):
trY[i] = math.log(1 + 0.5 * abs(trX[i])) + trX[i] / 3 + np.random.randn() * 0.033
#code that produce w0 w1 and i want to compute it with theano
X = np.column_stack((np.ones(101, dtype=trX.dtype), trX))
print(X.shape)
Xplus = np.linalg.pinv(X) #pseudo-inverse of X
w_opt = Xplus # trY #The # symbol denotes matrix multiplication
print(w_opt)
x = abs(trX) #abs is a built in function to return positive values in a array
y= trY
for i in range(len(trX)):
y[i] = math.log(1 + w_opt[0] * x[i]) + (w_opt[1] * x[i])

Good morning Hina Malik,
Using the gradient descent algorithm and with the right model selection, this problem should be solved. also, you should create 2 shared variables (w & c) one for each parameter.
X = T.scalar()
Y = T.scalar()
def model(X, w, c):
return X * w + c
w = theano.shared(np.asarray(0., dtype = theano.config.floatX))
c = theano.shared(np.asarray(0., dtype = theano.config.floatX))
y = model(X, w, c)
learning_rate=0.01
cost = T.mean(T.sqr(y - Y))
gradient_w = T.grad(cost = cost, wrt = w)
gradient_c = T.grad(cost = cost, wrt = c)
updates = [[w, w - gradient_w * learning_rate], [c, c - gradient_c * learning_rate]]
train = theano.function(inputs = [X, Y], outputs = cost, updates = updates)
coste=[] #Variable para almacenar los datos de coste para poder representarlos gráficamente
for i in range(101):
for x, y in zip(trX, trY):
cost_i = train(x, y)
coste.append(cost_i)
w0=float(w.get_value())
w1=float(c.get_value())
print(w0,w1)
I replied also to the same or very similar topic in the 'Spanish' version of StackOverFlow here: go to solution
I hope this can help you
Best regards

Backpropagating bias in a neural network

Following Andrew Traks's example, I want to implement a 3 layer neural network - 1 input, 1 hidden, 1 output - with a simple dropout, for binary classification.
If I include bias terms b1 and b2, then I would need to slightly modify Andrew's code as below.
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
y = np.array([[0,1,1,0]]).T
alpha,hidden_dim,dropout_percent = (0.5,4,0.2)
synapse_0 = 2*np.random.random((X.shape[1],hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,1)) - 1
b1 = np.zeros(hidden_dim)
b2 = np.zeros(1)
for j in range(60000):
# sigmoid activation function
layer_1 = (1/(1+np.exp(-(np.dot(X,synapse_0) + b1))))
# dropout
layer_1 *= np.random.binomial([np.ones((len(X),hidden_dim))],1-dropout_percent)[0] * (1.0/(1-dropout_percent))
layer_2 = 1/(1+np.exp(-(np.dot(layer_1,synapse_1) + b2)))
# sigmoid derivative = s(x)(1-s(x))
layer_2_delta = (layer_2 - y)*(layer_2*(1-layer_2))
layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 * (1-layer_1))
synapse_1 -= (alpha * layer_1.T.dot(layer_2_delta))
synapse_0 -= (alpha * X.T.dot(layer_1_delta))
b1 -= alpha*layer_1_delta
b2 -= alpha*layer_2_delta
The problem is, of course, with the code above the dimensions of b1 dont match with the dimensions of layer_1_delta, similarly with b2 and layer_2_delta.
I don't understand how the delta is calculated to update b1 and b2 - according to Michael Nielsen's example, b1 and b2 should be updated by a delta which in my code I believe to be layer_1_delta and layer_2_delta respectively.
What am I doing wrong here? Have I messed up the dimensionality of the deltas or of the biases? I feel it is the latter, because if I remove the biases from this code it works fine. Thanks in advance

So first I would change X in bX to 0 and 1 to correspond to synapse_X, because this is where they belong and it makes it:
b1 -= alpha * 1.0 / m * np.sum(layer_2_delta)
b0 -= alpha * 1.0 / m * np.sum(layer_1_delta)
Where m is the number of examples in the training set. Also, the drop rate is stupidly high and actually hurts convergence. So in all considered the whole code:
import numpy as np
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
m = X.shape[0]
y = np.array([[0,1,1,0]]).T
alpha,hidden_dim,dropout_percent = (0.5,4,0.02)
synapse_0 = 2*np.random.random((X.shape[1],hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,1)) - 1
b0 = np.zeros(hidden_dim)
b1 = np.zeros(1)
for j in range(10000):
# sigmoid activation function
layer_1 = (1/(1+np.exp(-(np.dot(X,synapse_0) + b0))))
# dropout
layer_1 *= np.random.binomial([np.ones((len(X),hidden_dim))],1-dropout_percent)[0] * (1.0/(1-dropout_percent))
layer_2 = 1/(1+np.exp(-(np.dot(layer_1,synapse_1) + b1)))
# sigmoid derivative = s(x)(1-s(x))
layer_2_delta = (layer_2 - y)*(layer_2*(1-layer_2))
layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 * (1-layer_1))
synapse_1 -= (alpha * layer_1.T.dot(layer_2_delta))
synapse_0 -= (alpha * X.T.dot(layer_1_delta))
b1 -= alpha * 1.0 / m * np.sum(layer_2_delta)
b0 -= alpha * 1.0 / m * np.sum(layer_1_delta)
print layer_2

Python: Neural network strange and consistent output

I'm trying to create my first neural network. I'm trying to create a "virtual creatures" system, in which each creature has two sensors in the top of their head.
The input the network get is the distances from the two sensors to the closest food source. By the output I'm supposed to decide whether the creature is suppose to go left or right, and in what angle. The problem is, that i always get the same result (It always goes right or left). My calculations are as follow:
def forward(self, X):
self.z1 = np.dot(np.insert(X, len(X), 1), self.W1)
self.a1 = self.sigmoid(self.z1)
self.z2 = np.dot(np.insert(self.a1, len(self.a1), 1), self.W2)
results = self.sigmoid(self.z2)
return results
I calculate the angle by:
left_d = distance(sensors[0], food_pos)
right_d = distance(sensors[1], food_pos)
max_dis = sqrt(WIDTH**2 + HEIGHT ** 2)
output = self.NN.forward(np.array([float(left_d) / max_dis, float(right_d) / max_dis]))
if output[0] > output[1]:
self.angle += (output[0] - 0.5) * np.pi
else:
self.angle -= (output[1] - 0.5) * np.pi
I've also tried the network with random inputs, and i found out that for a given NN, the output will always be (X,Y) such that X > Y or that X < Y, never randomized between the two in the same NN.
Here is some data from a run:
X - [ 0.78958477 0.69948212]
Z1 - [ 1.61766664 1.56767388 1.82580234]
A1 - [ 0.99890179 0.99999513 0.96766178]
Z2 - [ 1.45907443 0.92895941]
R - [ 0.9937656 0.80099741]
X - [ 0.14044444 0.60987121]
Z1 - [ 0.97104647 1.00091401 1.23983745]
A1 - [ 0.82547683 0.84196448 0.94573119]
Z2 - [ 1.28368194 0.85941254]
R - [ 0.95906503 0.75745915]
as you can see, R is consistant for this and for ~million measures.
Here is my weight initialization:
self.W1 = np.random.rand(self.inputLayerSize + 1, self.hiddenLayerSize)
self.W2 = np.random.rand(self.hiddenLayerSize + 1, self.outputLayerSize)
Note: I'm using genetic algorithm rather than backpropagation.
Here is the GA part:
def merge_guppies(screen, g1, g2):
W11, W12 = g1.NN.get_W()
W21, W22 = g2.NN.get_W()
W1 = [[0] * len(W11[0])] * len(W11)
W2 = [[0] * len(W12[0])] * len(W12)
for k in xrange(len(W11)):
for j in xrange(len(W11[k])):
if uniform(0, 1) > 0.9:
W1[k][j] = uniform(0, 1)
elif uniform(0, 1) > 0.45:
W1[k][j] = W11[k][j]
else:
W1[k][j] = W21[k][j]
for k in xrange(len(W12)):
for j in xrange(len(W12[k])):
if uniform(0, 1) > 0.9:
W2[k][j] = uniform(0, 1)
elif uniform(0, 1) > 0.45:
W2[k][j] = W12[k][j]
else:
W2[k][j] = W22[k][j]
W1 = np.array(W1)
W2 = np.array(W2)
g = Guppy(screen)
g.NN.set_W(W1, W2)
return g
I doubt if the problem is here, considering it occured in other tests i've made with my NN (just giving it random inputs), but who knows... I've been searching for an answer for couple of days now and i'm pretty lost.
Any help will be appreciated. Any idea where have I made a mistake?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Updating weight in previous layers in Backpropagation - python

Related

Linear regression outputs "inf" value

Using Adam to find the minimum of the Rosenbrock function using Pytorch

use Theano to get the w_0 and w_1 parameters

Backpropagating bias in a neural network

Python: Neural network strange and consistent output

Categories

Resources