custom dqn agent can't output correct action

custom dqn agent can't output correct action - python

I customize a dqn agent to solve a circuit problem.for example, the state is 1D input which represents five nodes' value(node_0 to node_4),shape=(5,),and actions are choosing one of six components (like whose values are [0,1,2,3,4,5])to place in the circuit to get a new state,named state_.So the action_space is (6,).My goal is to make five values that in one state to reach fixed value as possible.For example,the initial state is [0.8,0.7,0.9,0.98,0.9] and my goal is to make five value higher than 0.95. I mean, i put a component whose value is 3, in node_0 and it becomes 0.95 from 0.8,it meets the requirments.and the node_3 don't need place a component because it is 0.98.And i set up a limit that the sum of placed components value can't over 10.
here are some hyperparameter:
gamma = 0.9
TARGET_REPLACE_ITER = 500
nodes = 5
memory_capability = 1000
batch_size = 30
epsilon_start =1
epsilon_end = 0.0001
epsilon_decay = 0.06
learning_rate = 0.001
epsilon = 1
n_state = 5
n_action = 6
I make two neural networks to do,one is eval_net and the other is target_net,the code is below:
class NN(nn.Module):
def __init__(self, ):
super(NN,self).__init__()
self.fc1 = nn.Linear(n_state, 16)
self.fc1.weight.data.normal_(0, 0.1)
self.fc2 = nn.Linear(16,32)
self.out = nn.Linear(32, n_action)
self.out.weight.data.normal_(0, 0.1)
def forward(self, x):
x = self.fc1(x)
x = F.relu(x)
x = self.fc2(x)
x = F.relu(x)
action_value = self.out(x)
return action_value
here is the agent:
class Ckt_Opt(object):
def __init__(self,):
self.learn_step_counter = 0
self.memory = np.zeros((memory_capability, n_state * 2 + 2))
self.memory_cntr = 0
self.eval_net, self.target_net = NN(), NN()
self.loss_func = nn.MSELoss()
self.optimizer = torch.optim.Adam(self.eval_net.parameters(), lr=learning_rate)
def choose_action(self, state):
state = torch.unsqueeze(torch.FloatTensor(state),0)
if random.random() < epsilon:
action = random.randint(0,len(com_data) - 1) # choose a random component value,com_data is a list which means components' value ,from 0 to 5
else:
action_value = self.eval_net.forward(state)
action = torch.max(action_value, 1)[1].numpy()[0] # I copied this code from others and i guess it means choose the max Q value action
return action
def step(self,action):
self.ran_node = random.choice([a for a,x in enumerate(vio_node) if x == 1]) # x=1 means the node state value is lower than 0.95,ran_node means i randomly select node to place coponent
str1 = '.param cap_0_%d_val=%e\n' % (self.ran_node, com_data[action]) #
self.decap_param[self.ran_node] = com_data[action] # this two lines means change the placed component,it doesn't matter
def learn(self):
# target net update
if self.learn_step_counter % TARGET_REPLACE_ITER == 0:
self.target_net.load_state_dict(self.eval_net.state_dict())
self.learn_step_counter += 1
# sample from memory
sample_index = np.random.choice(memory_capability, batch_size)
b_memory = self.memory[sample_index, :]
b_s = torch.FloatTensor(b_memory[:, :n_state])
b_a = torch.LongTensor(b_memory[:, n_state:n_state + 1].astype(int))
b_r = torch.FloatTensor(b_memory[:, n_state + 1:n_state + 2])
b_s_ = torch.FloatTensor(b_memory[:, -n_state:])
q_eval = self.eval_net(b_s).gather(1, b_a) # shape (batch, 1)
q_next = self.target_net(b_s_).detach()
q_target = b_r + gamma * q_next.max(1)[0] # shape (batch, 1)
loss = self.loss_func(q_eval, q_target)
# calculate and update eval_net
self.optimizer.zero_grad()
loss.backward() # i don't understand this three lines
self.optimizer.step()
the main function is below:
ckt = Ckt_Opt()
for i in range(0,50):
ckt.reset() # no component is placed,get initial state
state = read_result() # a function to read state after taking action,here is reading initial state
for j in range(500):
action = pdn.choose_action(state)
state_,Cof,vio_node = pdn.step(action)
# Cof means whether the sum of component value is more than the limit(1 means more than limit, 0 not),vio_node means whether place component
reward = sum(-((state_ -0.95) * vio_node)**2 *500) + (nodes - sum(vio_node))if Cof == 0 else \
sum(-((state_ - 0.95) * vio_node)**2 *5000)
# equations mean give priority to placing components in nodes with low state value to improve the reward.for example,node_0 is 0.6 and node_1 is 0.9,the penalty(equals negative reward) of node_0 is -(0.6-0.95)^2 *500 = -61.25,and node_1's penalty is -(0.9-0.95)^2 *500 = -1.25
ckt.store_transition(state,action,reward,state_) # just store in the experience memory
state = state_
epsilon = epsilon_end +(epsilon_start - epsilon_end) * math.exp(-1. *epsilon_decay * i)
My goal is to find a solution that has the best reward.For example,the initial state is [0.6,0.7,0.8,0.9,0.97],and the placed component values are [5,4,0,1,0],get the best state is [0.85,0.9,0.91,0.93,0.97], it can't make every state value get over than 0.95 because of some reason.
But!!! I ran many times and always get a wired solution like [1,1,1,1,0] or [2,2,2,2,0], which is not make sense, i think it must be something wrong with choose_action function or learn function,but i can't find it because i am new to DQN
Could anyone help me ? thanks a lot

Related

K-Arms Bandit Epsilon-Greedy Policy

I have been trying to implement Reinforcement Learning books exercise 2.5
I have written this piece of code according to this pseudo version
class k_arm:
def __init__(self, iter, method="incrementally"):
# self.iter placeholder
self.iter = iter
self.k = 10
self.eps = .1
# here is Q(a) and N(a)
self.qStar = np.zeros(self.k)
self.n = np.zeros(self.k)
# Method just for experimenting different functions
self.method = method
def pull(self):
# selecting argmax(Q(A)) action with prob. (1 - eps)
eps = np.random.uniform(0, 1, 1)
if eps < self.eps or self.qStar.argmax() == 0:
a = np.random.randint(10)
else: a = self.qStar.argmax()
# R bandit(A)
r = np.random.normal(0, 0.01, 1)
# N(A) <- N(A) + 1
self.n[a] += 1
# Q(A) <- Q(A) i / (N(A)) * (R - Q(A))
if self.method == "incrementally":
self.qStar[a] += (r - self.qStar[a]) / self.n[a]
return self.qStar[a]`
iter = 1000
rewards = np.zeros(iter)
c = k_arm(iter, method="incrementally")
for i in range(iter):
k = c.pull()
rewards[i] = k
And I get this as a result
Where I am expecting this kind of results.
I have been trying to understand where am I went missing here, but I couldn't.

Your average reward is around 0 because it is the correct estimation. Your reward function is defined as:
# R bandit(A)
r = np.random.normal(0, 0.01, 1)
This means the expected value of your reward distribution is 0 with 0.01 variance. In the book the authors use a different reward function. While this still has a fundamental issue, you could earn similar rewards if you change your code to
# R bandit(A)
r = np.random.normal(1.25, 0.01, 1)
It makes sense to give each bandit a different reward function or all your action values will be the same. So what you really should do is sample from k different distributions with different expected values. Otherwise action selection is meaningless.
Add this to your init function:
self.expected_vals = np.random.uniform(0, 2, self.k)
and change the the calculation of the reward so, that it depends on the action:
r = np.random.uniform(self.expected_vals[a], 0.5, 1)
I've also increased the variance to 0.5 as 0.01 is basically meaningless variance in the context of bandits. If your agents works correctly, his average reward should be approximately equal to np.max(self.expected_vals)

L-BFGS-B code, Scipy (sciopt.fmin_l_bfgs_b(func, init_guess, maxiter=10, bounds=list(bounds), disp=1, iprint=101))

I'm using the L-BFGS-B optimizer to find the minima of a function. This will help me calculate sharpness for the function. However, I'm not sure if this following message is considered a normal message (i.e. Is there something wrong with my program or is this message typical?) See below:
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 28149514 M = 10
At X0 0 variables are exactly at the bounds
^[[C
At iterate 0 f= -3.59325D+00 |proj g|= 2.10249D-03
At iterate 1 f= -2.47853D+01 |proj g|= 4.20499D-03
Bad direction in the line search;
refresh the lbfgs memory and restart the iteration.
At iterate 2 f= -2.53202D+01 |proj g|= 4.17686D-03
At iterate 3 f= -2.53202D+01 |proj g|= 4.17686D-03
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
***** 3 43 ****** 0 ***** 4.177D-03 -2.532D+01
F = -25.320247650146484
CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
Warning: more than 10 function and gradient
evaluations in the last line search. Termination
may possibly be caused by a bad search direction.
I got the following sharpness anyway which is relatively consistent with the paper I'm trying to reproduce: It's just that I'm a bit concerned with the above message.
tensor(473.0201)
Here is my code for computing sharpness:
def get_sharpness(data_loader, model, criterion, epsilon, manifolds=0):
# extract current x0
x0 = None
for p in model.parameters():
if x0 is None:
x0 = p.data.view(-1)
else:
x0 = torch.cat((x0, p.data.view(-1)))
x0 = x0.cpu().numpy()
# get current f_x
f_x0, _ = get_minus_cross_entropy(x0, data_loader, model, criterion)
f_x0 = -f_x0
logging.info('min loss f_x0 = {loss:.4f}'.format(loss=f_x0))
# find the minimum
if 0==manifolds:
x_min = np.reshape(x0 - epsilon * (np.abs(x0) + 1), (x0.shape[0], 1))
x_max = np.reshape(x0 + epsilon * (np.abs(x0) + 1), (x0.shape[0], 1))
bounds = np.concatenate([x_min, x_max], 1)
func = lambda x: get_minus_cross_entropy(x, data_loader, model, criterion, training=True)
init_guess = x0
else:
warnings.warn("Small manifolds may not be able to explore the space.")
assert(manifolds<=x0.shape[0])
#transformer = rp.GaussianRandomProjection(n_components=manifolds)
#transformer.fit(np.random.rand(manifolds, x0.shape[0]))
#A_plus = transformer.components_
#A = np.linalg.pinv(A_plus)
A_plus = np.random.rand(manifolds, x0.shape[0])*2.-1.
# normalize each column to unit length
A_plus_norm = np.linalg.norm(A_plus, axis=1)
A_plus = A_plus / np.reshape(A_plus_norm, (manifolds,1))
A = np.linalg.pinv(A_plus)
abs_bound = epsilon * (np.abs(np.dot(A_plus, x0))+1)
abs_bound = np.reshape(abs_bound, (abs_bound.shape[0], 1))
bounds = np.concatenate([-abs_bound, abs_bound], 1)
def func(y):
floss, fg = get_minus_cross_entropy(x0 + np.dot(A, y), data_loader, model, criterion, training=True)
return floss, np.dot(np.transpose(A), fg)
#func = lambda y: get_minus_cross_entropy(x0+np.dot(A, y), data_loader, model, criterion, training=True)
init_guess = np.zeros(manifolds)
#rand_selections = (np.random.rand(bounds.shape[0])+1e-6)*0.99
#init_guess = np.multiply(1.-rand_selections, bounds[:,0])+np.multiply(rand_selections, bounds[:,1])
minimum_x, f_x, d = sciopt.fmin_l_bfgs_b(func, init_guess, maxiter=10, bounds=list(bounds), disp=1, iprint=101)
#factr=10.,
#pgtol=1.e-12,
f_x = -f_x
logging.info('max loss f_x = {loss:.4f}'.format(loss=f_x))
sharpness = (f_x - f_x0)/(1+f_x0)*100
print(sharpness)
# recover the model
x0 = torch.from_numpy(x0).float()
x0 = x0.cuda()
x_start = 0
for p in model.parameters():
psize = p.data.size()
peltnum = 1
for s in psize:
peltnum *= s
x_part = x0[x_start:x_start + peltnum]
p.data = x_part.view(psize)
x_start += peltnum
return sharpness
Which was taken from this repository:
https://github.com/wenwei202/smoothout/blob/master/measure_sharpness.py
I'm concerned about exact accuracy.

First, l-bfgs-b will only give a global minimum for a convex function.
the message
CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
is the normal convergence message.
The warning you are getting says that there are a lot of function/gradient evaluations in the line search - this can often happen when you use l-bfgs-b on non convex functions. So if the thing you're minimizing is non convex (and it seems like it might be just by glancing at the code), I would say this is normal.

Why won't a simple AND gate neural network work without BIAS?

A simple neural net of 2 inputs and one output without a bias, like this - doesn't seem to work.
|input1||weight1 weight2| = Z
|input2|
output = sigmoid(Z)
Whereas, it works perfectly when BIAS is added, why does it work & what is the math behind it?
|input1||weight1 weight2| = Z
|input2|
output = sigmoid(Z - BIAS)
Here's the CODE to working version with BIAS:
import numpy as np
import random as r
import sys
def sigmoid(ip, derivate=False):
if derivate:
return ip*(1-ip)
return 1.0/(1+np.exp(-1*ip))
class NeuralNet:
global sigmoid
def __init__(self):
self.inputLayers = 2
self.outputLayer = 1
self.bias = r.random()
def setup(self):
self.i = np.array([r.random(), r.random()], dtype=float).reshape(2,)
self.w = np.array([r.random(), r.random()], dtype=float).reshape(2,)
def forward_propogate(self):
self.z = self.w*self.i
self.o = sigmoid(sum(self.z)-self.bias)
def optimize_cost(self, desired):
i=0
current_cost = pow(desired - self.o, 2)
for weight in self.w:
dpdw = -1*(desired-self.o) * (sigmoid(self.o, derivate=True)) * self.i[i]
self.w[i] = self.w[i] - 2*dpdw
i+=1
#calculate dp/dB
dpdB = -1*(desired-self.o) * (sigmoid(self.o, derivate=True)) * -1
self.bias = self.bias - 2*dpdB
self.forward_propogate()
def train(self, ip, op):
self.i = np.array(ip).reshape(2,)
self.forward_propogate()
self.optimize_cost(op[0])
n = NeuralNet()
n.setup()
# while sys.stdin.read(1):
success_rate = 0
trial=0
done = False
while not done:
a = [0.1,1,0.1,1]
b = [0.1,0.1,1,1]
c = [0,0,0,1]
for i in range(len(a)):
trial +=1
n.train([a[i],b[i]],[c[i]])
if c[i] - n.o < 0.01:
success_rate +=1
print(100*success_rate/trial, "%")
if 100*success_rate/trial > 99 and trial > 4:
print(100*success_rate/trial, "%")
print("Network trained, took: {} trials".format(trial))
print("Network weights:{}, bias:{}".format(n.w, n.bias))
done = True
break

A bias is just a shift of the intercept. The NN you have set up in this example appears to be a single layer neural network with no hidden layers, which is effectively a logistic regression, which is just a linear model.
When you don't learn an intercept value, the intercept defaults to 0, so it always passes through the origin and you're just learning the slope of the line. To correctly classify the AND of your data, i.e. the top right corner at (1,1), but not any of the other points, you need a non zero intercept because there is no line that passes through the origin that will only have the top right corner on one side and the other three points on the other side.

Why don't my 2 functions give the same results?

Below I have the code of my attempt to make a neural network with 2 inputs and 3 outputs. While the training gives good results, when I try to input the numbers, the results are way off. After I made some small changes, I observed that, even though they return the output from the function which should be the same, again, the results were different. The only explanation I can think of is that there is a bug.
The functions that I'm talking about are "train" and "result".
Here is the code:
from numpy import dot, exp, max, sum, random, array
class Network:
def __init__(self):
self.w = random.random((2,3))
def sigmoid(self, x, derivate = False):
if(derivate == True):
return x * (1 - x)
return 1 /(1 + exp(-x))
def train(self):
trainingInput = array([[0,0],[0,1],[1,0],[1,1]])
trainingOutput = array([[0,0,0],[0,1,0],[0,0,1],[1,0,0]])
n = 0
while(n < 10000):
exOutput = self.sigmoid(dot(trainingInput, self.w) - 0.1)
error = trainingOutput - exOutput
self.w += dot(trainingInput.T, error *
self.sigmoid(exOutput,True))
n += 1
return exOutput
def result(self):
trainingInput = array([[0,0],[0,1],[1,0],[1,1]])
exOutput = self.sigmoid(dot(trainingInput, self.w) - 0.1)
return exOutput
network = Network()
c = 0
d = 1
o = network.result()
output = network.train()
print(o)
print(output)

you should first train and then check the results.
If you check it before training, obviously two results will be different.
you can just once again calculate the results after training, hopefully, this will solve your bug.

Neural network XOR gate not learning

I'm trying to make a XOR gate by using 2 perceptron network but for some reason the network is not learning, when I plot the change of error in a graph the error comes to a static level and oscillates in that region.
I did not add any bias to the network at the moment.
import numpy as np
def S(x):
return 1/(1+np.exp(-x))
win = np.random.randn(2,2)
wout = np.random.randn(2,1)
eta = 0.15
# win = [[1,1], [2,2]]
# wout = [[1],[2]]
obj = [[0,0],[1,0],[0,1],[1,1]]
target = [0,1,1,0]
epoch = int(10000)
emajor = ""
for r in range(0,epoch):
for xy in range(len(target)):
tar = target[xy]
fdata = obj[xy]
fdata = S(np.dot(1,fdata))
hnw = np.dot(fdata,win)
hnw = S(np.dot(fdata,win))
out = np.dot(hnw,wout)
out = S(out)
diff = tar-out
E = 0.5 * np.power(diff,2)
emajor += str(E[0]) + ",\n"
delta_out = (out-tar)*(out*(1-out))
nindelta_out = delta_out * eta
wout_change = np.dot(nindelta_out[0], hnw)
for x in range(len(wout_change)):
change = wout_change[x]
wout[x] -= change
delta_in = np.dot(hnw,(1-hnw)) * np.dot(delta_out[0], wout)
nindelta_in = eta * delta_in
for x in range(len(nindelta_in)):
midway = np.dot(nindelta_in[x][0], fdata)
for y in range(len(win)):
win[y][x] -= midway[y]
f = open('xor.csv','w')
f.write(emajor) # python will convert \n to os.linesep
f.close() # you can omit in most cases as the destructor will call it
This is the error changing by the number of learning rounds. Is this correct? The red color line is the line I was expecting how the error should change.
Anything wrong I'm doing in the code? As I can't seem to figure out what's causing the error. Help much appreciated.
Thanks in advance

Here is a one hidden layer network with backpropagation which can be customized to run experiments with relu, sigmoid and other activations. After several experiments it was concluded that with relu the network performed better and reached convergence sooner, while with sigmoid the loss value fluctuated. This happens because, "the gradient of sigmoids becomes increasingly small as the absolute value of x increases".
import numpy as np
import matplotlib.pyplot as plt
from operator import xor
class neuralNetwork():
def __init__(self):
# Define hyperparameters
self.noOfInputLayers = 2
self.noOfOutputLayers = 1
self.noOfHiddenLayerNeurons = 2
# Define weights
self.W1 = np.random.rand(self.noOfInputLayers,self.noOfHiddenLayerNeurons)
self.W2 = np.random.rand(self.noOfHiddenLayerNeurons,self.noOfOutputLayers)
def relu(self,z):
return np.maximum(0,z)
def sigmoid(self,z):
return 1/(1+np.exp(-z))
def forward (self,X):
self.z2 = np.dot(X,self.W1)
self.a2 = self.relu(self.z2)
self.z3 = np.dot(self.a2,self.W2)
yHat = self.relu(self.z3)
return yHat
def costFunction(self, X, y):
#Compute cost for given X,y, use weights already stored in class.
self.yHat = self.forward(X)
J = 0.5*sum((y-self.yHat)**2)
return J
def costFunctionPrime(self,X,y):
# Compute derivative with respect to W1 and W2
delta3 = np.multiply(-(y-self.yHat),self.sigmoid(self.z3))
djw2 = np.dot(self.a2.T, delta3)
delta2 = np.dot(delta3,self.W2.T)*self.sigmoid(self.z2)
djw1 = np.dot(X.T,delta2)
return djw1,djw2
if __name__ == "__main__":
EPOCHS = 6000
SCALAR = 0.01
nn= neuralNetwork()
COST_LIST = []
inputs = [ np.array([[0,0]]), np.array([[0,1]]), np.array([[1,0]]), np.array([[1,1]])]
for epoch in xrange(1,EPOCHS):
cost = 0
for i in inputs:
X = i #inputs
y = xor(X[0][0],X[0][1])
cost += nn.costFunction(X,y)[0]
djw1,djw2 = nn.costFunctionPrime(X,y)
nn.W1 = nn.W1 - SCALAR*djw1
nn.W2 = nn.W2 - SCALAR*djw2
COST_LIST.append(cost)
plt.plot(np.arange(1,EPOCHS),COST_LIST)
plt.ylim(0,1)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title(str('Epochs: '+str(EPOCHS)+', Scalar: '+str(SCALAR)))
plt.show()
inputs = [ np.array([[0,0]]), np.array([[0,1]]), np.array([[1,0]]), np.array([[1,1]])]
print "X\ty\ty_hat"
for inp in inputs:
print (inp[0][0],inp[0][1]),"\t",xor(inp[0][0],inp[0][1]),"\t",round(nn.forward(inp)[0][0],4)
End Result:
X y y_hat
(0, 0) 0 0.0
(0, 1) 1 0.9997
(1, 0) 1 0.9997
(1, 1) 0 0.0005
The weights obtained after training were:
nn.w1
[ [-0.81781753 0.71323677]
[ 0.48803631 -0.71286155] ]
nn.w2
[ [ 2.04849235]
[ 1.40170791] ]
I found the following youtube series extremely helpful for understanding neural nets: Neural networks demystified
There is only little which I know and also that can be explained in this answer. If you want an even better understanding of neural nets, then I would suggest you to go through the following link: cs231n: Modelling one neuron

The error calculated in each epoch should be a sum total of all sum squared errors (i.e. error for every target)
import numpy as np
def S(x):
return 1/(1+np.exp(-x))
win = np.random.randn(2,2)
wout = np.random.randn(2,1)
eta = 0.15
# win = [[1,1], [2,2]]
# wout = [[1],[2]]
obj = [[0,0],[1,0],[0,1],[1,1]]
target = [0,1,1,0]
epoch = int(10000)
emajor = ""
for r in range(0,epoch):
# ***** initialize final error *****
finalError = 0
for xy in range(len(target)):
tar = target[xy]
fdata = obj[xy]
fdata = S(np.dot(1,fdata))
hnw = np.dot(fdata,win)
hnw = S(np.dot(fdata,win))
out = np.dot(hnw,wout)
out = S(out)
diff = tar-out
E = 0.5 * np.power(diff,2)
# ***** sum all errors *****
finalError += E
delta_out = (out-tar)*(out*(1-out))
nindelta_out = delta_out * eta
wout_change = np.dot(nindelta_out[0], hnw)
for x in range(len(wout_change)):
change = wout_change[x]
wout[x] -= change
delta_in = np.dot(hnw,(1-hnw)) * np.dot(delta_out[0], wout)
nindelta_in = eta * delta_in
for x in range(len(nindelta_in)):
midway = np.dot(nindelta_in[x][0], fdata)
for y in range(len(win)):
win[y][x] -= midway[y]
# ***** Save final error *****
emajor += str(finalError[0]) + ",\n"
f = open('xor.csv','w')
f.write(emajor) # python will convert \n to os.linesep
f.close() # you can omit in most cases as the destructor will call it

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

custom dqn agent can't output correct action - python

Related

K-Arms Bandit Epsilon-Greedy Policy

L-BFGS-B code, Scipy (sciopt.fmin_l_bfgs_b(func, init_guess, maxiter=10, bounds=list(bounds), disp=1, iprint=101))

Why won't a simple AND gate neural network work without BIAS?

Why don't my 2 functions give the same results?

Neural network XOR gate not learning

Categories

Resources