Implementing the TD-Gammon algorithm

Implementing the TD-Gammon algorithm - python

I am attempting to implement the algorithm from the TD-Gammon article by Gerald Tesauro. The core of the learning algorithm is described in the following paragraph:
I have decided to have a single hidden layer (if that was enough to play world-class backgammon in the early 1990's, then it's enough for me). I am pretty certain that everything except the train() function is correct (they are easier to test), but I have no idea whether I have implemented this final algorithm correctly.
import numpy as np
class TD_network:
"""
Neural network with a single hidden layer and a Temporal Displacement training algorithm
taken from G. Tesauro's 1995 TD-Gammon article.
"""
def __init__(self, num_input, num_hidden, num_output, hnorm, dhnorm, onorm, donorm):
self.w21 = 2*np.random.rand(num_hidden, num_input) - 1
self.w32 = 2*np.random.rand(num_output, num_hidden) - 1
self.b2 = 2*np.random.rand(num_hidden) - 1
self.b3 = 2*np.random.rand(num_output) - 1
self.hnorm = hnorm
self.dhnorm = dhnorm
self.onorm = onorm
self.donorm = donorm
def value(self, input):
"""Evaluates the NN output"""
assert(input.shape == self.w21[1,:].shape)
h = self.w21.dot(input) + self.b2
hn = self.hnorm(h)
o = self.w32.dot(hn) + self.b3
return(self.onorm(o))
def gradient(self, input):
"""
Calculates the gradient of the NN at the given input. Outputs a list of dictionaries
where each dict corresponds to the gradient of an output node, and each element in
a given dict gives the gradient for a subset of the weights.
"""
assert(input.shape == self.w21[1,:].shape)
J = []
h = self.w21.dot(input) + self.b2
hn = self.hnorm(h)
o = self.w32.dot(hn) + self.b3
for i in range(len(self.b3)):
db3 = np.zeros(self.b3.shape)
db3[i] = self.donorm(o[i])
dw32 = np.zeros(self.w32.shape)
dw32[i, :] = self.donorm(o[i])*hn
db2 = np.multiply(self.dhnorm(h), self.w32[i,:])*self.donorm(o[i])
dw21 = np.transpose(np.outer(input, db2))
J.append(dict(db3 = db3, dw32 = dw32, db2 = db2, dw21 = dw21))
return(J)
def train(self, input_states, end_result, a = 0.1, l = 0.7):
"""
Trains the network using a single series of input states representing a game from beginning
to end, and a final (supervised / desired) output for the end state
"""
outputs = [self(input_state) for input_state in input_states]
outputs.append(end_result)
for t in range(len(input_states)):
delta = dict(
db3 = np.zeros(self.b3.shape),
dw32 = np.zeros(self.w32.shape),
db2 = np.zeros(self.b2.shape),
dw21 = np.zeros(self.w21.shape))
grad = self.gradient(input_states[t])
for i in range(len(self.b3)):
for key in delta.keys():
td_sum = sum([l**(t-k)*grad[i][key] for k in range(t + 1)])
delta[key] += a*(outputs[t + 1][i] - outputs[t][i])*td_sum
self.w21 += delta["dw21"]
self.w32 += delta["dw32"]
self.b2 += delta["db2"]
self.b3 += delta["db3"]
The way I use this is I play through a whole game (or rather, the neural net plays against itself), and then I send the states of that game, from start to finish, into train(), along with the final result. It then takes this game log, and applies the above formula to alter weights using the first game state, then the first and second game states, and so on until the final time, when it uses the entire list of game states. Then I repeat that many times and hope that the network learns.
To be clear, I am not after feedback on my code writing. This was never meant to be more than a quick and dirty implementation to see that I have all the nuts and bolts in the right spots.
However, I have no idea whether it is correct, as I have thus far been unable to make it capable of playing tic-tac-toe at any reasonable level. There could be many reasons for that. Maybe I'm not giving it enough hidden nodes (I have used 10 to 12). Maybe it needs more games to train (I have used 200 000). Maybe it would do better with different normalisation functions (I've tried sigmoid and ReLU, leaky and non-leaky, in different variations). Maybe the learning parameters are not tuned right. Maybe tic-tac-toe and its deterministic gameplay means it "locks in" on certain paths in the game tree. Or maybe the training implementation is just wrong. Which is why I'm here.
Have I misunderstood Tesauro's algorithm?

I can't say that I entirely understand your implementation, but this line jumps out to me:
td_sum = sum([l**(t-k)*grad[i][key] for k in range(t + 1)])
Comparing with the formula you reference:
I see at least two differences:
Your implementation sums over t+1 elements compared to t elements in the formula
The gradient should be indexed with the same k as used in l**(t-k), but in your implementation it is indexed with i and key, without any reference to k
Perhaps if you fix these discrepancies your solution will behave more as expected.

Related

RL + optimization: how to do it better?

I am learning about how to optimize using reinforcement learning. I have chosen the problem of maximum matching in a bipartite graph as I can easily compute the true optimum.
Recall that a matching in a graph is a subset of the edges where no two edges are incident on the same node/vertex. The goal is to find the largest such subset.
I show my full code below but first let me explain parts of it.
num_variables = 1000
g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
g_matching = g.maximum_bipartite_matching()
print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))
This makes a random bipartite graph with 1000 nodes in each of the two sets of nodes. It then prints out the size of the true maximum matching.
In the code below, self.agent_pos is an array representing the current matching found. Its length is the number of edges in the original graph and there is a 1 at index i if edge i is included and a 0 otherwise. self.matching is the set of edges in the growing matching. self.matching_nodes is the set of nodes in the growing matching that is used to check to see if a particular edge can be added or not.
import igraph as ig
from tqdm import tqdm
import numpy as np
import gym
from gym import spaces
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
class MaxMatchEnv(gym.Env):
metadata = {'render.modes': ['console']}
def __init__(self, array_length=10):
super(MaxMatchEnv, self).__init__()
# Size of the 1D-grid
self.array_length = array_length
self.agent_pos = [0]*array_length
self.action_space = spaces.Discrete(array_length)
self.observation_space = spaces.Box(low=0, high=1, shape=(array_length,), dtype=np.uint8)
self.matching = set() # set of edges
self.matching_nodes = set() # set of node ids (ints)
self.matching_size = len([v for v in g_matching.matching if v < num_variables and v != -1])
self.best_found = 0
def reset(self):
# Initialize the array to have random values
self.time = 0
#print(self.agent_pos)
self.agent_pos = [0]*self.array_length
self.matching = set()
self.matching_nodes = set()
return np.array(self.agent_pos)
def step(self, action):
self.time += 1
reward = 0
edge = g.es[action]
if not(edge.source in self.matching_nodes or edge.target in self.matching_nodes):
self.matching.add(edge)
self.matching_nodes.add(edge.source)
self.matching_nodes.add(edge.target)
self.agent_pos[action] = 1
if sum(self.agent_pos) > self.best_found:
self.best_found = sum(self.agent_pos)
print("New max", self.best_found)
reward = 1
elif self.agent_pos[action] == 1:
#print("Removing edge", action)
self.matching_nodes.remove(edge.source)
self.matching_nodes.remove(edge.target)
self.matching.remove(edge)
self.agent_pos[action] = 0
reward = -1
done = sum(self.agent_pos) == self.matching_size
info = {}
return np.array(self.agent_pos), reward, done, info
def render(self, mode='console'):
print(sum(self.agent_pos))
def close(self):
pass
if __name__ == '__main__':
num_variables = 1000
g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
g_matching = g.maximum_bipartite_matching()
print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))
env = make_vec_env(lambda: MaxMatchEnv(array_length=len(g.es)), n_envs=12)
model = PPO('MlpPolicy', env, verbose=1).learn(10000000)
There are a number of problems with this but the main one is that it doesn't optimize well. This code gives just over 550 and then stops improving where the true optimum is over 900 (it is printed out by the code at the start).
The main question is:
How can this be done better so that it gets to a better matching?
A subsidiary question is, how can I print the best matching found so far? My attempt using self.best_found to maintain the best score does not work as it seems to be reset regularly.
Changes that haven't help
Changing PPO for DQN makes only a marginal difference.
I tried changing the code so that done is True after 1000 steps.
The change is as follows:
if self.time == 1000:
done = True
else:
done = False
Having added print(max(env.get_attr("best_found"))) in place of print("New max", self.best_found) this change to done shows no advantage at all.

To print the max for each environment you can use get_attr method from stable baselines. More info in their official docs.
For example the lines below will print the max for each of the 12 environments and then the maximum across all environments.
print(env.get_attr("best_found"))
print(max(env.get_attr("best_found")))
Regarding why it does not converge it could be due a bad reward selected, although looking at your reward choice it seems sensible. I added a debug print in your code to see if some step lead to done = True, but it seems that the environment never reaches that state. I think that for the model to learn it would be good to have multiple sequence of actions leading to a state with done = True, which would mean that the model gets to experience the end of an episode. I did not study the problem in your code in detail but maybe this information can help debug your issue.
If we compare the problem with other environments like CartPole, there we have episodes that end with done = True and that helps the model learn a better policy (in your case you could limit the amount of actions per episode instead of running the same episode forever). That could help the model avoid getting stuck in a local optimum as you give it the opportunity to "try again" in a new episode.

If I'm reading this right, then you aren't performing any hyperparameter tuning. How do you know your learning rate / policy update rate / any other variable parameters are a good fit for the problem at hand?
It looks like stable baselines has this functionality built-in:
https://stable-baselines.readthedocs.io/en/master/guide/rl_zoo.html?highlight=tune#hyperparameter-optimization

Minimal cost flow network when using multiple demand/supply sets (networkx)

This is my first stackoverflow submission, so please let me know whether I am asking this question according to community guidelines.
Question in short: Is there a way to have networkx find the min cost flow network given multiple demand/supply sets?
I currently working on a network optimization problem in which I am using the nx.min_cost_flow() function to calculate the minimum cost network layout given a supply and demand set for the nodes, and maximum capacity for certain edges. The code works, but now I am trying to find a way to find the optimal network layout when given multiple supply and demand sets over time. I have tried a method that works for optimal trees (iterating through all the demand/supply sets and using the maximum capacity for each edge in each time step), but sadly this does not provide the optimal solution when using nx.min_cost_flow() as I want the model to be able to create loops as well.
I have added the relevant code below, but it might be a bit confusing as there are many elements referring to other parts of the model unrelated to the problem I am currently facing. The complete model is too large to share.
I have tried looking for a solution online without success. Hopefully you know a solution. Thanks in advance!
def full_existing(T,G,folder,rpc,routing):
fullG = G.copy()
for (i,j) in fullG.edges():
fullG[i][j]['capacity'] = 0
for (i,j) in T.edges():
if 'current' in T[i][j]:
fullG[i][j]['weight'] = T[i][j]['weight']*rpc
f_demand = open(folder+'/demand.txt','r')
for X in f_demand:
print('Demand set is called')
Y = X.split(',')
demand = {i: float(Y[i+1]) for i in range(len(Y)-1)}
G1 = fullG.copy()
for i in fullG.nodes():
if i in demand.keys():
G1.nodes[i]['demand'] = demand[i]
else:
G1.nodes[i]['demand'] = 0
for (i,j) in G1.edges():
G1[i][j]['capacity'] = inf
for (i,j) in T.edges():
if 'current' in T[i][j]:
G1[i][j]['capacity'] = T[i][j]['current']
G1[i][j]['weight'] = T[i][j]['weight']*rpc
else:
G1[i][j]['capacity'] = inf
G1 = simplex(G1)
for i,j in G1.edges():
fullG[i][j]['capacity'] = max(fullG[i][j]['capacity'],G1[i][j]['capacity'])
for (i,j) in T.edges():
if 'current' in T[i][j]:
fullG[i][j]['weight'] = T[i][j]['weight']
nx.set_edge_attributes(fullG,nx.get_edge_attributes(T,'current'),'current')
return fullG
The simplex procedure is called because the input graph is undirected.
def simplex(G):
GX = G.copy()
G1 = GX.to_directed()
ns = nx.min_cost_flow(G1)
for i,j in G1.edges():
if ns[i][j] > 0 or ns[j][i] > 0:
GX[i][j]['capacity'] = max(ns[i][j],ns[j][i])
else:
GX[i][j]['capacity'] = 0
return GX

Implement a N-aryTreeLSTM version of the TreeLSTM in TensorFlow Fold

I'm trying to implement this paper TreeLSTM by using TensorFlow Fold. Actually, in Tensorflow Fold, there's already an example of TreeLSTM but in a BinaryTreeLSTM version, here's the tutorial: https://github.com/tensorflow/fold/blob/master/tensorflow_fold/g3doc/sentiment.ipynb
What I'm trying to do now is to implement a real NaryTreeLSTM, means that a LSTM node can be the parent of any number of children, not just 2 like in the above tutorial.
This is my attempt at trying to fold the tree, this is a modify version of logits_and_state() in the above example "
def logits_and_state():
"""Creates a block that goes from tokens to (logits, state) tuples."""
word2vec = (td.GetItem(0) >> td.InputTransform(lookup_word) >>
td.Scalar('int32') >> word_embedding)
children_num =
children2vec_list = list()
children2vec_list.append(embed_subtree())
for i in range(children_num):
children2vec_list.append(embed_subtree())
children2vec = tuple(children2vec_list)
# Trees are binary, so the tree layer takes two states as its input_state.
zero_state = td.Zeros((tree_lstm.state_size,) * 2)
# Input is a word vector.
zero_inp = td.Zeros(word_embedding.output_type.shape[0])
# word_case =
word_case = td.AllOf(word2vec, zero_state)
children_case = td.AllOf(zero_inp, children2vec)
tree2vec = td.OneOf(lambda x: 1 if len(x) == 1 else 2), [(1,word_case),(2,children_case)])
return tree2vec >> tree_lstm >> (output_layer, td.Identity())
The children_num is the thing that I'm struggling at this moment, I have no idea to get out that number, eventhought I know that the length of children can be obtained by td.GetItem(1) ==> will produce a block that contains an array of children ==> how to get out the real number of that block?
You may say that I should try PyTorch or some others DL framework that also provides Dynamic Computation Graph, but in my case, the requirement is strict with Tensorflow Fold.

Back-propagation in Tensorflow

I am constructing a time delayed recurrent model and I need to know about how and when TensorFlow computes its backwards step.
Consider the following model and pseudo-code:
unit_1 = LSTM(unit_size)
unit_2 = LSTM(unit_size)
unit_3 = LSTM(unit_size)
unit_4 = LSTM(unit_size)
ip_W = Variable([4 * unit_size, output_size])
ip_b = Variable([output_size])
prev_1 = tf.zeros([unit_size])
prev_2 = tf.zeros([unit_size])
prev_3 = tf.zeros([unit_size])
prev_4 = tf.zeros([unit_size])
for t, input in enumerate(input_data):
if t%1==0:
prev_1 = unit_1([input])
if t%2==0:
prev_2 = unit_2([input])
if t%3==0:
prev_3 = unit_3([input])
if t%4==0:
prev_4 = unit_4([input])
concat = tf.concat(0,[prev_1, prev_2, prev_3, prev_4])
output[t] = tf.matmul(concat, ip_W) + ip_B
Here is a gist to a usable version of this code, based on tensorflow/python/ops/rnn.py
My Question:
For time steps where cells are not called (i.e. at T=1, unit_0 is called, while all the rest are not) are their weights updated? I'm torn as to whether it would be a good idea to have them update at every state or not. The cells themselves haven't been exposed to any new data in the steps, so I afraid that back propagating may result in over-correction. Would appreciate other's insights into this.
Let me know if any clarification is necessary.

Create a model that switches between two different states using Temporal Logic?

Im trying to design a model that can manage different requests for different water sources.
Platform : MAC OSX, using latest Python with TuLip module installed.
For example,
Definitions :
Two water sources : w1 and w2
3 different requests : r1,r2,and r3
-
Specifications :
Water 1 (w1) is preferred, but w2 will be used if w1 unavailable.
Water 2 is only used if w1 is depleted.
r1 has the maximum priority.
If all entities request simultaneously, r1's supply must not fall below 50%.
-
The water sources are not discrete but rather continuous, this will increase the difficulty of creating the model. I can do a crude discretization for the water levels but I prefer finding a model for the continuous state first.
So how do I start doing that ?
Some of my thoughts :
Create a matrix W where w1,w2 ∈ W
Create a matrix R where r1,r2,r3 ∈ R
or leave all variables singular without putting them in a matrix
I'm not an expert in coding so that's why I need help. Not sure what is the best way to start tackling this problem.
I am only interested in the model, or a code sample of how can this be put together.
edit
Now imagine I do a crude discretization of the water sources to have w1=[0...4] and w2=[0...4] for 0, 25, 50, 75,100 percent respectively.
==> means implies
Usage of water sources :
if w1[0]==>w2[4] -- meaning if water source 1 has 0%, then use 100% of water source 2 etc
if w1[1]==>w2[3]
if w1[2]==>w2[2]
if w1[3]==>w2[1]
if w1[4]==>w2[0]
r1=r2=r3=[0,1] -- 0 means request OFF and 1 means request ON
Now what model can be designed that will give each request 100% water depending on the values of w1 and w2 (w1 and w2 values are uncontrollable so cannot define specific value, but 0...4 is used for simplicity )

This is called the flow problem: http://en.wikipedia.org/wiki/Maximum_flow_problem
Wiki has some code for the solution: http://en.wikipedia.org/wiki/Ford%E2%80%93Fulkerson_algorithm
I'm not sure temporal logic is of much help here. For example load balancing is a major research topic, and I believe most of it doesn't use this formalism.
I have coded something, which only represents a simple priority list, which is kind of trivial. I would use classes and functions to represent states, not matrices. The dependencies in terms of priority are simple enough. Otherwise you can add those to the class watersource aswell. (class WaterSourcePriorityQueue or something like that). To get a simulation it is good to use threads, which I haven't here. You can use stepwise iteration (rounds), which is more in line with a procedural program.
import time
from random import random
from math import floor
import operator
class Watersource:
def __init__(self,initlevel,prio,name):
self.level = initlevel
self.priority = prio
self.name = name
def requestWater(self,amount):
if amount < self.level:
self.level -= amount
return True
else:
return False
#watersources
w1 = Watersource(40,1,"A")
w2 = Watersource(30,2,"B")
w3 = Watersource(20,3,"C")
probA = 0.8 # probability A will be requested
probB = 0.7
probC = 0.9
probs = {w1:probA,w2:probB,w3:probC}
amounts = {w1:10,w2:10,w3:20} # amounts requested
ws = [w1,w2,w3]
numrounds = 100
for round in range(1,numrounds):
print 'round ',round
done = False
i = 0
priorRequest = False
prioramount = 0
while not done or priorRequest:
if i==len(ws):
done=True
break
w = ws[i]
probtresh = probs[w]
prob = random()
if prob > probtresh: # request water
if prioramount != 0:
amount = prioramount
else:
amount = floor(random()*amounts[w])
prioramount = amount
print 'requesting ',amount
success = w.requestWater(amount)
if not success:
print 'not enough'
priorRequest=True
else:
print 'got water'
done = True
priorRequest=False
i+=1
time.sleep(1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Implementing the TD-Gammon algorithm - python

Related

RL + optimization: how to do it better?

Minimal cost flow network when using multiple demand/supply sets (networkx)

Implement a N-aryTreeLSTM version of the TreeLSTM in TensorFlow Fold

Back-propagation in Tensorflow

Create a model that switches between two different states using Temporal Logic?

Categories

Resources