Why does the random seed impact on my back-propagation algorithm?

Why does the random seed impact on my back-propagation algorithm? - python

after a whole class about Machine Learning I realize I don't have the slightest idea of how to build a NN even though I passed the exam. Therefore I tried to write one from scratch following the advice of this videohttps://youtu.be/I74ymkoNTnw?t=425
In order to test the NN code I tried to overfit over the first point and, for some reason, I get the exactly opposite result (output=(0, 1); expected=(1, 0) ) where the output are probability.
I tried to change the sign of the correction in the back propagation but I still get an error of 45% even after thousands iteration. Therefore I assumed the sign was correct and the problem lies elsewhere.
I'm working with Google Collab so you can check and run the whole code: https://colab.research.google.com/drive/1j-WMk80t8mbg7vr5HscbTUJxUFxOK1yN
The function I'm assuming it's not working is the following:
def back_propagation(self, x:np.ndarray, y:np.ndarray, y_exp:np.ndarray):
error = 0.5*np.sum( (y-y_exp)**2 )
Ep = y-y_exp # d(Error) / d(y)
dfrac = np.flip( self.out)/np.sum( self.out)**2 # d( x/sum(x) )/d(x)
dsigm = self.out*(1-self.out) # d( 1/(1+exp(-x)) )/d(x) | out = sig(x)
correction = np.outer(Ep*dfrac*dsigm, x) # Correction matrix
self.NN *= 1-self.lr*correction
return error
Where y was obtained through:
def forward_propagation(self, x:np.ndarray):
Ax = self.NN.dot(x)
self.out = self.sigmoid(Ax)
y = self.out / np.sum( self.out)
return y
Can someone lend me an hand?
PS: I haven't written english in a long time, if there is any error / unreadable part tell me, I'll try to explain myself better.
EDIT: I examined more the error then I have the + sign in the back-propagation and I noticed that changing the seed change the minimum error after 10002 iteration:
seed = 1000 --> error = 0.4443457394544875
seed = 1234 --> error = 3.484945305904348e-05
seed = 1 --> error = 2.8741028650796533e-05
seed = 10000 --> error = 0.44434995711021025
seed = 12345 --> error = 3.430037390869015e-05
seed = 100 --> error = 2.851979370926932e-05
Therefore I change the question from "Why is my back-propagation algorithm maximizing error?" to "Why does the random seed impact on my back-propagation algorithm?"

I managed to fix my code. There was a couple of errors:
As I'm doing gradient descent a minus sign was needed. (i got lost in math)
The correction is independent of the weight. Therefore the right way to do it is
self.NN = self.NN - self.lr*correction
and not
self.NN *= 1-self.lr*correction
The fact that the seed changed my result was probably due to a low learning rate (when i wrote the question it was 0.1-0.5) it may be due to a local maximum (local maximum because I was wrongly trying to maximize error)
Hope my answer help someone else with a simular problem.

Related

Odd linear model results

I'm unit acceptance testing some code I wrote. It's conceivable that at some point in the real world we will have input data where the dependent variable is constant. Not the norm, but possible. A linear model should yield coefficients of 0 in this case (right?), which is fine and what we would want -- but for some reason I'm getting some wild results when I try to fit the model on this use case.
I have tried 3 models and get diffirent weird results every time -- or no results in some cases.
For this use case all of the dependent observations are set at 100, all the freq_weights are set at 1, and the independent variables are a binary coded dummy set of 20 features.
In total there are 150 observations.
Again, this data is unlikely in the real world but I need my code to be able to work on this ugly data. IDK why I'm getting such erroneous and different results.
As I understand with no variance in the dependent variable I should be getting 0 for all my coefficients.
freq = freq['Freq']
Indies = sm.add_constant(df)
model = sm.OLS(df1, Indies)
res = model.fit()
res.params
yields:
const 65.990203
x1 17.214836
reg = statsmodels.GLM(df1, Indies, freq_weights = freq)
results = reg.fit(method = 'lbfgs', max_start_irls=0)
results.params
yields:
const 83.205034
x1 82.575228
reg = statsmodels.GLM(df1, Indies, freq_weights = freq)
result2 = reg.fit()
result2.params
yields
PerfectSeparationError: Perfect separation detected, results not available

Pytorch: How to create an update rule that doesn't come from derivatives?

I want to implement the following algorithm, taken from this book, section 13.6:
I don't understand how to implement the update rule in pytorch (the rule for w is quite similar to that of theta).
As far as I know, torch requires a loss for loss.backwward().
This form does not seem to apply for the quoted algorithm.
I'm still certain there is a correct way of implementing such update rules in pytorch.
Would greatly appreciate a code snippet of how the w weights should be updated, given that V(s,w) is the output of the neural net, parameterized by w.
EDIT: Chris Holland suggested a way to implement, and I implemented it. It does not converge on Cartpole, and I wonder if I did something wrong.
The critic does converge on the solution to the function gamma*f(n)=f(n)-1 which happens to be the sum of the series gamma+gamma^2+...+gamma^inf
meaning, gamma=1 diverges. gamma=0.99 converges on 100, gamma=0.5 converges on 2 and so on. Regardless of the actor or policy.
The code:
def _update_grads_with_eligibility(self, is_critic, delta, discount, ep_t):
gamma = self.args.gamma
if is_critic:
params = list(self.critic_nn.parameters())
lamb = self.critic_lambda
eligibilities = self.critic_eligibilities
else:
params = list(self.actor_nn.parameters())
lamb = self.actor_lambda
eligibilities = self.actor_eligibilities
is_episode_just_started = (ep_t == 0)
if is_episode_just_started:
eligibilities.clear()
for i, p in enumerate(params):
if not p.requires_grad:
continue
eligibilities.append(torch.zeros_like(p.grad, requires_grad=False))
# eligibility traces
for i, p in enumerate(params):
if not p.requires_grad:
continue
eligibilities[i][:] = (gamma * lamb * eligibilities[i]) + (discount * p.grad)
p.grad[:] = delta.squeeze() * eligibilities[i]
and
expected_reward_from_t = self.critic_nn(s_t)
probs_t = self.actor_nn(s_t)
expected_reward_from_t1 = torch.tensor([[0]], dtype=torch.float)
if s_t1 is not None: # s_t is not a terminal state, s_t1 exists.
expected_reward_from_t1 = self.critic_nn(s_t1)
delta = r_t + gamma * expected_reward_from_t1.data - expected_reward_from_t.data
negative_expected_reward_from_t = -expected_reward_from_t
self.critic_optimizer.zero_grad()
negative_expected_reward_from_t.backward()
self._update_grads_with_eligibility(is_critic=True,
delta=delta,
discount=discount,
ep_t=ep_t)
self.critic_optimizer.step()
EDIT 2:
Chris Holland's solution works. The problem originated from a bug in my code that caused the line
if s_t1 is not None:
expected_reward_from_t1 = self.critic_nn(s_t1)
to always get called, thus expected_reward_from_t1 was never zero, and thus no stopping condition was specified for the bellman equation recursion.
With no reward engineering, gamma=1, lambda=0.6, and a single hidden layer of size 128 for both actor and critic, this converged on a rather stable optimal policy within 500 episodes.
Even faster with gamma=0.99, as the graph shows (best discounted episode reward is about 86.6).
BIG thank you to #Chris Holland, who "gave this a try"

I am gonna give this a try.
.backward() does not need a loss function, it just needs a differentiable scalar output. It approximates a gradient with respect to the model parameters. Let's just look at the first case the update for the value function.
We have one gradient appearing for v, we can approximate this gradient by
v = model(s)
v.backward()
This gives us a gradient of v which has the dimension of your model parameters. Assuming we already calculated the other parameter updates, we can calculate the actual optimizer update:
for i, p in enumerate(model.parameters()):
z_theta[i][:] = gamma * lamda * z_theta[i] + l * p.grad
p.grad[:] = alpha * delta * z_theta[i]
We can then use opt.step() to update the model parameters with the adjusted gradient.

Bayesian fit of cosine wave taking longer than expected

In a recent homework, I was asked to perform a Bayesian fit over a set of data a and b using a Metropolis algorithm. The relationship between a and b is given:
e(t) = e_0*cos(w*t)
w = 2 * pi
The Metropolis algorithm is (it works fine with other fit):
def metropolis(logP, args, v0, Nsteps, stepSize):
vCur = v0
logPcur = logP(vCur, *args)
v = []
Nattempts = 0
for i in range(Nsteps):
while(True):
#Propose step:
vNext = vCur + stepSize*np.random.randn(*vCur.shape)
logPnext = logP(vNext, *args)
Nattempts += 1
#Accept/reject step:
Pratio = (1. if logPnext>logPcur else np.exp(logPnext-logPcur))
if np.random.rand() < Pratio:
vCur = vNext
logPcur = logPnext
v.append(vCur)
break
acceptRatio = Nsteps*(1./Nattempts)
return np.array(v), acceptRatio
I have tried to Bayesian fit the cosine wave and used the Metropolis algorithm above:
e_0 = -0.00155
def strain_t(e_0,t):
return e_0*np.cos(2*np.pi*t)
data = pd.read_csv('stressStrain.csv')
t = np.array(data['t'])
e = strain_t(e_0,t)
def logfitstrain_t(params,t,e):
e_0 = params[0]
sigmaR = params[1]
strainModel = strain_t(e_0,t)
return np.sum(-0.5*((e-strainModel)/sigmaR)**2 - np.log(sigmaR))
params0 = np.array([-0.00155,np.std(t)])
params, accRatio = metropolis(logfitstrain_t, (t,e), params0, 1000, 0.042)
print('Acceptance ratio:', accRatio)
e0 = np.mean(params[0])
print('e0=',e0)
e_t = e0*np.cos(2*np.pi*t)
sns.jointplot(t, e_t, kind='hex',color='purple')
The data in .csv looks like
There isn't any error message showing after I hit run, but it takes forever for python to give me an output. What did I do wrong here?

Why it might "take forever"
Your algorithm is designed to run until it accepts a given number of proposals (1000 in the example). Thus, if it's running for a long time, you're likely rejecting a bunch of proposals. This can happen when the step size is too large, leading new proposals to end up in distant, low probability regions of the likelihood space. Try reducing your step size. This may require you to also increase the number of samples to ensure the posterior space becomes adequately explored.
A more serious concern
Because you only append accepted proposals to the chain v, you haven't actually implemented the Metropolis algorithm, and instead obtain a biased set of samples that will tend to overrepresent less likely regions of the posterior space. A true Metropolis implementation re-appends the previous proposal whenever the new proposal is rejected. You can still enforce a minimum number of accepted proposals, but you really must append something every time.

Network Flow Optimimization (Gurobi)

I am trying to model and solve an optimization problem, with python and gurobi optimizer. It is my first experience to solve a problem using optimizer. firstly I wrote a really big problem and add all variables and constraints, step by step. But there was problem(S) in that. so I reduce the problem to the small version, again and again. After all, now I have a very simple code:
from gurobipy import *
m = Model('net')
x = m.addVar(name = 'x')
y = m.addVar(name = 'y')
m.addConstr(x >= 0 and x <= 9000, name = 'flow0')
m.addConstr(y >= 0 and y <= 1000, name = 'flow1')
m.addConstr(y + x == 9990, name = 'total_flow')
m.setObjective(x *(4 + 0.6*(x/9000)) + (y * (4 + 0.6*(y/1000))), GRB.MINIMIZE)
solo = m.optimize()
if solo:
print ('find!!!')
It actually is a simple network flow problem (for a graph with two nodes and two edges) I want to calculate the flow of each edge (x and y). Obviously the flow of each edge cant be negative and cant be bigger than edge capacity(x(capa) = 9000, y(capa) = 1000). and the third constraint shows the the total flow limitation on both edges. Finally, the objective function has to minimize the equation.
Now I have some question on this code:
why 'solo' is (None)?
How can I print solution variables. I used getAttr() function. but I couldn't find out the role of variables name (x, y or flow0, flow1)
3.Ive got this result. But I really cant understand this!!!!
for example: what dose it calculate in each iteration??!
Tnx in advance, and excuse for my simple question...

The optimize() method always returns None, see print(help(m.optimize)). The status of your model after calling this method is stored in m.status while the solution values are stored in the .X attribute for each variable (assumed the model was solved to optimality). To access them you can use m.getVars():
# your model ...
m.optimize()
if m.status = GRB.OPTIMAL:
for var in m.getVars():
print(var.VarName, var.X)
Your posted log shows for each iteration of the barrier method (also known as interior point method) the objective value. See here for a detailed overview.

How to define General deterministic function in PyMC

In my model, I need to obtain the value of my deterministic variable from a set of parent variables using a complicated python function.
Is it possible to do that?
Following is a pyMC3 code which shows what I am trying to do in a simplified case.
import numpy as np
import pymc as pm
#Predefine values on two parameter Grid (x,w) for a set of i values (1,2,3)
idata = np.array([1,2,3])
size= 20
gridlength = size*size
Grid = np.empty((gridlength,2+len(idata)))
for x in range(size):
for w in range(size):
# A silly version of my real model evaluated on grid.
Grid[x*size+w,:]= np.array([x,w]+[(x**i + w**i) for i in idata])
# A function to find the nearest value in Grid and return its product with third variable z
def FindFromGrid(x,w,z):
return Grid[int(x)*size+int(w),2:] * z
#Generate fake Y data with error
yerror = np.random.normal(loc=0.0, scale=9.0, size=len(idata))
ydata = Grid[16*size+12,2:]*3.6 + yerror # ie. True x= 16, w= 12 and z= 3.6
with pm.Model() as model:
#Priors
x = pm.Uniform('x',lower=0,upper= size)
w = pm.Uniform('w',lower=0,upper =size)
z = pm.Uniform('z',lower=-5,upper =10)
#Expected value
y_hat = pm.Deterministic('y_hat',FindFromGrid(x,w,z))
#Data likelihood
ysigmas = np.ones(len(idata))*9.0
y_like = pm.Normal('y_like',mu= y_hat, sd=ysigmas, observed=ydata)
# Inference...
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(state=start) # Instantiate MCMC sampling algorithm
trace = pm.sample(1000, step, start=start, progressbar=False) # draw 1000 posterior samples using NUTS sampling
print('The trace plot')
fig = pm.traceplot(trace, lines={'x': 16, 'w': 12, 'z':3.6})
fig.show()
When I run this code, I get error at the y_hat stage, because the int() function inside the FindFromGrid(x,w,z) function needs integer not FreeRV.
Finding y_hat from a pre calculated grid is important because my real model for y_hat does not have an analytical form to express.
I have earlier tried to use OpenBUGS, but I found out here it is not possible to do this in OpenBUGS. Is it possible in PyMC ?
Update
Based on an example in pyMC github page, I found I need to add the following decorator to my FindFromGrid(x,w,z) function.
#pm.theano.compile.ops.as_op(itypes=[t.dscalar, t.dscalar, t.dscalar],otypes=[t.dvector])
This seems to solve the above mentioned issue. But I cannot use NUTS sampler anymore since it needs gradient.
Metropolis seems to be not converging.
Which step method should I use in a scenario like this?

You found the correct solution with as_op.
Regarding the convergence: Are you using pm.Metropolis() instead of pm.NUTS() by any chance? One reason this could not converge is that Metropolis() by default samples in the joint space while often Gibbs within Metropolis is more effective (and this was the default in pymc2). Having said that, I just merged this: https://github.com/pymc-devs/pymc/pull/587 which changes the default behavior of the Metropolis and Slice sampler to be non-blocked by default (so within Gibbs). Other samplers like NUTS that are primarily designed to sample the joint space still default to blocked. You can always explicitly set this with the kwarg blocked=True.
Anyway, update pymc with the most recent master and see if convergence improves. If not, try the Slice sampler.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does the random seed impact on my back-propagation algorithm? - python

Related

Odd linear model results

Pytorch: How to create an update rule that doesn't come from derivatives?

Bayesian fit of cosine wave taking longer than expected

Network Flow Optimimization (Gurobi)

How to define General deterministic function in PyMC

Categories

Resources