RL + optimization: how to do it better?

RL + optimization: how to do it better? - python

I am learning about how to optimize using reinforcement learning. I have chosen the problem of maximum matching in a bipartite graph as I can easily compute the true optimum.
Recall that a matching in a graph is a subset of the edges where no two edges are incident on the same node/vertex. The goal is to find the largest such subset.
I show my full code below but first let me explain parts of it.
num_variables = 1000
g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
g_matching = g.maximum_bipartite_matching()
print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))
This makes a random bipartite graph with 1000 nodes in each of the two sets of nodes. It then prints out the size of the true maximum matching.
In the code below, self.agent_pos is an array representing the current matching found. Its length is the number of edges in the original graph and there is a 1 at index i if edge i is included and a 0 otherwise. self.matching is the set of edges in the growing matching. self.matching_nodes is the set of nodes in the growing matching that is used to check to see if a particular edge can be added or not.
import igraph as ig
from tqdm import tqdm
import numpy as np
import gym
from gym import spaces
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
class MaxMatchEnv(gym.Env):
metadata = {'render.modes': ['console']}
def __init__(self, array_length=10):
super(MaxMatchEnv, self).__init__()
# Size of the 1D-grid
self.array_length = array_length
self.agent_pos = [0]*array_length
self.action_space = spaces.Discrete(array_length)
self.observation_space = spaces.Box(low=0, high=1, shape=(array_length,), dtype=np.uint8)
self.matching = set() # set of edges
self.matching_nodes = set() # set of node ids (ints)
self.matching_size = len([v for v in g_matching.matching if v < num_variables and v != -1])
self.best_found = 0
def reset(self):
# Initialize the array to have random values
self.time = 0
#print(self.agent_pos)
self.agent_pos = [0]*self.array_length
self.matching = set()
self.matching_nodes = set()
return np.array(self.agent_pos)
def step(self, action):
self.time += 1
reward = 0
edge = g.es[action]
if not(edge.source in self.matching_nodes or edge.target in self.matching_nodes):
self.matching.add(edge)
self.matching_nodes.add(edge.source)
self.matching_nodes.add(edge.target)
self.agent_pos[action] = 1
if sum(self.agent_pos) > self.best_found:
self.best_found = sum(self.agent_pos)
print("New max", self.best_found)
reward = 1
elif self.agent_pos[action] == 1:
#print("Removing edge", action)
self.matching_nodes.remove(edge.source)
self.matching_nodes.remove(edge.target)
self.matching.remove(edge)
self.agent_pos[action] = 0
reward = -1
done = sum(self.agent_pos) == self.matching_size
info = {}
return np.array(self.agent_pos), reward, done, info
def render(self, mode='console'):
print(sum(self.agent_pos))
def close(self):
pass
if __name__ == '__main__':
num_variables = 1000
g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
g_matching = g.maximum_bipartite_matching()
print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))
env = make_vec_env(lambda: MaxMatchEnv(array_length=len(g.es)), n_envs=12)
model = PPO('MlpPolicy', env, verbose=1).learn(10000000)
There are a number of problems with this but the main one is that it doesn't optimize well. This code gives just over 550 and then stops improving where the true optimum is over 900 (it is printed out by the code at the start).
The main question is:
How can this be done better so that it gets to a better matching?
A subsidiary question is, how can I print the best matching found so far? My attempt using self.best_found to maintain the best score does not work as it seems to be reset regularly.
Changes that haven't help
Changing PPO for DQN makes only a marginal difference.
I tried changing the code so that done is True after 1000 steps.
The change is as follows:
if self.time == 1000:
done = True
else:
done = False
Having added print(max(env.get_attr("best_found"))) in place of print("New max", self.best_found) this change to done shows no advantage at all.

To print the max for each environment you can use get_attr method from stable baselines. More info in their official docs.
For example the lines below will print the max for each of the 12 environments and then the maximum across all environments.
print(env.get_attr("best_found"))
print(max(env.get_attr("best_found")))
Regarding why it does not converge it could be due a bad reward selected, although looking at your reward choice it seems sensible. I added a debug print in your code to see if some step lead to done = True, but it seems that the environment never reaches that state. I think that for the model to learn it would be good to have multiple sequence of actions leading to a state with done = True, which would mean that the model gets to experience the end of an episode. I did not study the problem in your code in detail but maybe this information can help debug your issue.
If we compare the problem with other environments like CartPole, there we have episodes that end with done = True and that helps the model learn a better policy (in your case you could limit the amount of actions per episode instead of running the same episode forever). That could help the model avoid getting stuck in a local optimum as you give it the opportunity to "try again" in a new episode.

If I'm reading this right, then you aren't performing any hyperparameter tuning. How do you know your learning rate / policy update rate / any other variable parameters are a good fit for the problem at hand?
It looks like stable baselines has this functionality built-in:
https://stable-baselines.readthedocs.io/en/master/guide/rl_zoo.html?highlight=tune#hyperparameter-optimization

Related

Minimal cost flow network when using multiple demand/supply sets (networkx)

This is my first stackoverflow submission, so please let me know whether I am asking this question according to community guidelines.
Question in short: Is there a way to have networkx find the min cost flow network given multiple demand/supply sets?
I currently working on a network optimization problem in which I am using the nx.min_cost_flow() function to calculate the minimum cost network layout given a supply and demand set for the nodes, and maximum capacity for certain edges. The code works, but now I am trying to find a way to find the optimal network layout when given multiple supply and demand sets over time. I have tried a method that works for optimal trees (iterating through all the demand/supply sets and using the maximum capacity for each edge in each time step), but sadly this does not provide the optimal solution when using nx.min_cost_flow() as I want the model to be able to create loops as well.
I have added the relevant code below, but it might be a bit confusing as there are many elements referring to other parts of the model unrelated to the problem I am currently facing. The complete model is too large to share.
I have tried looking for a solution online without success. Hopefully you know a solution. Thanks in advance!
def full_existing(T,G,folder,rpc,routing):
fullG = G.copy()
for (i,j) in fullG.edges():
fullG[i][j]['capacity'] = 0
for (i,j) in T.edges():
if 'current' in T[i][j]:
fullG[i][j]['weight'] = T[i][j]['weight']*rpc
f_demand = open(folder+'/demand.txt','r')
for X in f_demand:
print('Demand set is called')
Y = X.split(',')
demand = {i: float(Y[i+1]) for i in range(len(Y)-1)}
G1 = fullG.copy()
for i in fullG.nodes():
if i in demand.keys():
G1.nodes[i]['demand'] = demand[i]
else:
G1.nodes[i]['demand'] = 0
for (i,j) in G1.edges():
G1[i][j]['capacity'] = inf
for (i,j) in T.edges():
if 'current' in T[i][j]:
G1[i][j]['capacity'] = T[i][j]['current']
G1[i][j]['weight'] = T[i][j]['weight']*rpc
else:
G1[i][j]['capacity'] = inf
G1 = simplex(G1)
for i,j in G1.edges():
fullG[i][j]['capacity'] = max(fullG[i][j]['capacity'],G1[i][j]['capacity'])
for (i,j) in T.edges():
if 'current' in T[i][j]:
fullG[i][j]['weight'] = T[i][j]['weight']
nx.set_edge_attributes(fullG,nx.get_edge_attributes(T,'current'),'current')
return fullG
The simplex procedure is called because the input graph is undirected.
def simplex(G):
GX = G.copy()
G1 = GX.to_directed()
ns = nx.min_cost_flow(G1)
for i,j in G1.edges():
if ns[i][j] > 0 or ns[j][i] > 0:
GX[i][j]['capacity'] = max(ns[i][j],ns[j][i])
else:
GX[i][j]['capacity'] = 0
return GX

make big process on graph with python parallelised

i'm working on graphs and big dataset of complex network's. i run SIR algorithm on them with ndlib library.
but each iteration takes something like 1Sec and it make code takes 10-12 h to complete .
i was wondering is there any way to make it parallelised ?
the code is like down bellow
this line of the code is core :
sir = model.infected_SIR_MODEL(it, infectionList, False)
is there any simple method to make it run on multi thread or parallelised ?
count = 500
for i in numpy.arange(1, count, 1):
for it in model.get_nodes():
sir = model.infected_SIR_MODEL(it, infectionList, False)
each iteration :
for u in self.graph.nodes():
u_status = self.status[u]
eventp = np.random.random_sample()
neighbors = self.graph.neighbors(u)
if isinstance(self.graph, nx.DiGraph):
neighbors = self.graph.predecessors(u)
if u_status == 0:
infected_neighbors = len([v for v in neighbors if self.status[v] == 1])
if eventp < self.BetaList[u] * infected_neighbors:
actual_status[u] = 1
elif u_status == 1:
if eventp < self.params['model']['gamma']:
actual_status[u] = 2

So, if the iterations are independent, then I don't see the point of iteration over count=500. Either way the multiprocessing library might be of interest to you.
I've prepared 2 stub solutions (i.e. alter to your exact needs).
The first expects that every input is static (the changes in solutions as far as I understand the OP's question raise from the random state generation inside each iteration). With the second, you can update the input data between iterations of i. I've not tried the code as I don't have the model so it might not work directly.
import multiprocessing as mp
# if everything is independent (eg. "infectionList" is static and does not change during the iterations)
def worker(model, infectionList):
sirs = []
for it in model.get_nodes():
sir = model.infected_SIR_MODEL(it, infectionList, False)
sirs.append(sir)
return sirs
count = 500
infectionList = []
model = "YOUR MODEL INSTANCE"
data = [(model, infectionList) for _ in range(1, count+1)]
with mp.Pool() as pool:
results = pool.starmap(worker, data)
The second proposed solution if "infectionList" or something else gets updated in each iteration of "i":
def worker2(model, it, infectionList):
sir = model.infected_SIR_MODEL(it, infectionList, False)
return sir
with mp.Pool() as pool:
for i in range(1, count+1):
data = [(model, it, infectionList) for it in model.get_nodes()]
results = pool.starmap(worker2, data)
# process results, update something go to next iteration....
Edit: Updated the answer to separate proposals more clearly.

SORT Tracker - Parameter Setting

I'm having some issues related to the SORT tracker (a combination of Kalman filter and Hungarian algorithm), combined with YOLO v3, in soccer videos.
As also mentioned in the main paper, SORT suffers a lot about identity switches (in other words, the ID changes even if the tracked object is the same), also in absence of occlusions! I was wondering if i can (slightly) mitigate this problem by calibrating the parameters max_age (the time that can pass without the id assignment) and max_hits. How can these parameter affect the final tracking? And the IoU parameter of Hungarian? Thanks a lot!
class Sort(object):
def __init__(self,max_age=8,min_hits=3):
"""
Sets key parameters for SORT
"""
self.max_age = max_age
self.min_hits = min_hits
self.trackers = []
self.frame_count = 0
def update(self,dets):
"""
Params:
dets - a numpy array of detections in the format [[x,y,w,h,score],[x,y,w,h,score],...]
Requires: this method must be called once for each frame even with empty detections.
Returns the a similar array, where the last column is the object ID.
NOTE: The number of objects returned may differ from the number of detections provided.
"""
# prevent "too many indices for array" error
if len(dets)==0:
return np.empty((0,5))
self.frame_count += 1
#get predicted locations from existing trackers.
trks = np.zeros((len(self.trackers),5))
to_del = []
ret = []
for t,trk in enumerate(trks):
pos = self.trackers[t].predict()[0]
trk[:] = [pos[0], pos[1], pos[2], pos[3], 0]
if(np.any(np.isnan(pos))):
to_del.append(t)
trks = np.ma.compress_rows(np.ma.masked_invalid(trks))
for t in reversed(to_del):
self.trackers.pop(t)
matched, unmatched_dets, unmatched_trks = associate_detections_to_trackers(dets,trks)
#update matched trackers with assigned detections
for t,trk in enumerate(self.trackers):
if(t not in unmatched_trks):
d = matched[np.where(matched[:,1]==t)[0],0]
trk.update(dets[d,:][0])
#create and initialise new trackers for unmatched detections
for i in unmatched_dets:
trk = KalmanBoxTracker(dets[i,:])
self.trackers.append(trk)
i = len(self.trackers)
for trk in reversed(self.trackers):
d = trk.get_state()[0]
if((trk.time_since_update < 1) and (trk.hit_streak >= self.min_hits or self.frame_count <= self.min_hits)):
ret.append(np.concatenate((d,[trk.id+1])).reshape(1,-1)) # +1 as MOT benchmark requires positive
i -= 1
#remove dead tracklet
if(trk.time_since_update > self.max_age):
self.trackers.pop(i)
if(len(ret)>0):

In case you raise the max_age you are risking confusion between objects which were lost/got out of scene and new objects which get into the last seen area. You should play with this parameter (maybe raise it a bit) and lower the IOU for the Kalman. This would create longer and more robust tracking with an increased risk of different IDs merging into one track. This tuning is essential for your tracker's performance and is highly data dependent. Good luck :)

Implementing the TD-Gammon algorithm

I am attempting to implement the algorithm from the TD-Gammon article by Gerald Tesauro. The core of the learning algorithm is described in the following paragraph:
I have decided to have a single hidden layer (if that was enough to play world-class backgammon in the early 1990's, then it's enough for me). I am pretty certain that everything except the train() function is correct (they are easier to test), but I have no idea whether I have implemented this final algorithm correctly.
import numpy as np
class TD_network:
"""
Neural network with a single hidden layer and a Temporal Displacement training algorithm
taken from G. Tesauro's 1995 TD-Gammon article.
"""
def __init__(self, num_input, num_hidden, num_output, hnorm, dhnorm, onorm, donorm):
self.w21 = 2*np.random.rand(num_hidden, num_input) - 1
self.w32 = 2*np.random.rand(num_output, num_hidden) - 1
self.b2 = 2*np.random.rand(num_hidden) - 1
self.b3 = 2*np.random.rand(num_output) - 1
self.hnorm = hnorm
self.dhnorm = dhnorm
self.onorm = onorm
self.donorm = donorm
def value(self, input):
"""Evaluates the NN output"""
assert(input.shape == self.w21[1,:].shape)
h = self.w21.dot(input) + self.b2
hn = self.hnorm(h)
o = self.w32.dot(hn) + self.b3
return(self.onorm(o))
def gradient(self, input):
"""
Calculates the gradient of the NN at the given input. Outputs a list of dictionaries
where each dict corresponds to the gradient of an output node, and each element in
a given dict gives the gradient for a subset of the weights.
"""
assert(input.shape == self.w21[1,:].shape)
J = []
h = self.w21.dot(input) + self.b2
hn = self.hnorm(h)
o = self.w32.dot(hn) + self.b3
for i in range(len(self.b3)):
db3 = np.zeros(self.b3.shape)
db3[i] = self.donorm(o[i])
dw32 = np.zeros(self.w32.shape)
dw32[i, :] = self.donorm(o[i])*hn
db2 = np.multiply(self.dhnorm(h), self.w32[i,:])*self.donorm(o[i])
dw21 = np.transpose(np.outer(input, db2))
J.append(dict(db3 = db3, dw32 = dw32, db2 = db2, dw21 = dw21))
return(J)
def train(self, input_states, end_result, a = 0.1, l = 0.7):
"""
Trains the network using a single series of input states representing a game from beginning
to end, and a final (supervised / desired) output for the end state
"""
outputs = [self(input_state) for input_state in input_states]
outputs.append(end_result)
for t in range(len(input_states)):
delta = dict(
db3 = np.zeros(self.b3.shape),
dw32 = np.zeros(self.w32.shape),
db2 = np.zeros(self.b2.shape),
dw21 = np.zeros(self.w21.shape))
grad = self.gradient(input_states[t])
for i in range(len(self.b3)):
for key in delta.keys():
td_sum = sum([l**(t-k)*grad[i][key] for k in range(t + 1)])
delta[key] += a*(outputs[t + 1][i] - outputs[t][i])*td_sum
self.w21 += delta["dw21"]
self.w32 += delta["dw32"]
self.b2 += delta["db2"]
self.b3 += delta["db3"]
The way I use this is I play through a whole game (or rather, the neural net plays against itself), and then I send the states of that game, from start to finish, into train(), along with the final result. It then takes this game log, and applies the above formula to alter weights using the first game state, then the first and second game states, and so on until the final time, when it uses the entire list of game states. Then I repeat that many times and hope that the network learns.
To be clear, I am not after feedback on my code writing. This was never meant to be more than a quick and dirty implementation to see that I have all the nuts and bolts in the right spots.
However, I have no idea whether it is correct, as I have thus far been unable to make it capable of playing tic-tac-toe at any reasonable level. There could be many reasons for that. Maybe I'm not giving it enough hidden nodes (I have used 10 to 12). Maybe it needs more games to train (I have used 200 000). Maybe it would do better with different normalisation functions (I've tried sigmoid and ReLU, leaky and non-leaky, in different variations). Maybe the learning parameters are not tuned right. Maybe tic-tac-toe and its deterministic gameplay means it "locks in" on certain paths in the game tree. Or maybe the training implementation is just wrong. Which is why I'm here.
Have I misunderstood Tesauro's algorithm?

I can't say that I entirely understand your implementation, but this line jumps out to me:
td_sum = sum([l**(t-k)*grad[i][key] for k in range(t + 1)])
Comparing with the formula you reference:
I see at least two differences:
Your implementation sums over t+1 elements compared to t elements in the formula
The gradient should be indexed with the same k as used in l**(t-k), but in your implementation it is indexed with i and key, without any reference to k
Perhaps if you fix these discrepancies your solution will behave more as expected.

Create a model that switches between two different states using Temporal Logic?

Im trying to design a model that can manage different requests for different water sources.
Platform : MAC OSX, using latest Python with TuLip module installed.
For example,
Definitions :
Two water sources : w1 and w2
3 different requests : r1,r2,and r3
-
Specifications :
Water 1 (w1) is preferred, but w2 will be used if w1 unavailable.
Water 2 is only used if w1 is depleted.
r1 has the maximum priority.
If all entities request simultaneously, r1's supply must not fall below 50%.
-
The water sources are not discrete but rather continuous, this will increase the difficulty of creating the model. I can do a crude discretization for the water levels but I prefer finding a model for the continuous state first.
So how do I start doing that ?
Some of my thoughts :
Create a matrix W where w1,w2 ∈ W
Create a matrix R where r1,r2,r3 ∈ R
or leave all variables singular without putting them in a matrix
I'm not an expert in coding so that's why I need help. Not sure what is the best way to start tackling this problem.
I am only interested in the model, or a code sample of how can this be put together.
edit
Now imagine I do a crude discretization of the water sources to have w1=[0...4] and w2=[0...4] for 0, 25, 50, 75,100 percent respectively.
==> means implies
Usage of water sources :
if w1[0]==>w2[4] -- meaning if water source 1 has 0%, then use 100% of water source 2 etc
if w1[1]==>w2[3]
if w1[2]==>w2[2]
if w1[3]==>w2[1]
if w1[4]==>w2[0]
r1=r2=r3=[0,1] -- 0 means request OFF and 1 means request ON
Now what model can be designed that will give each request 100% water depending on the values of w1 and w2 (w1 and w2 values are uncontrollable so cannot define specific value, but 0...4 is used for simplicity )

This is called the flow problem: http://en.wikipedia.org/wiki/Maximum_flow_problem
Wiki has some code for the solution: http://en.wikipedia.org/wiki/Ford%E2%80%93Fulkerson_algorithm
I'm not sure temporal logic is of much help here. For example load balancing is a major research topic, and I believe most of it doesn't use this formalism.
I have coded something, which only represents a simple priority list, which is kind of trivial. I would use classes and functions to represent states, not matrices. The dependencies in terms of priority are simple enough. Otherwise you can add those to the class watersource aswell. (class WaterSourcePriorityQueue or something like that). To get a simulation it is good to use threads, which I haven't here. You can use stepwise iteration (rounds), which is more in line with a procedural program.
import time
from random import random
from math import floor
import operator
class Watersource:
def __init__(self,initlevel,prio,name):
self.level = initlevel
self.priority = prio
self.name = name
def requestWater(self,amount):
if amount < self.level:
self.level -= amount
return True
else:
return False
#watersources
w1 = Watersource(40,1,"A")
w2 = Watersource(30,2,"B")
w3 = Watersource(20,3,"C")
probA = 0.8 # probability A will be requested
probB = 0.7
probC = 0.9
probs = {w1:probA,w2:probB,w3:probC}
amounts = {w1:10,w2:10,w3:20} # amounts requested
ws = [w1,w2,w3]
numrounds = 100
for round in range(1,numrounds):
print 'round ',round
done = False
i = 0
priorRequest = False
prioramount = 0
while not done or priorRequest:
if i==len(ws):
done=True
break
w = ws[i]
probtresh = probs[w]
prob = random()
if prob > probtresh: # request water
if prioramount != 0:
amount = prioramount
else:
amount = floor(random()*amounts[w])
prioramount = amount
print 'requesting ',amount
success = w.requestWater(amount)
if not success:
print 'not enough'
priorRequest=True
else:
print 'got water'
done = True
priorRequest=False
i+=1
time.sleep(1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

RL + optimization: how to do it better? - python

Related

Minimal cost flow network when using multiple demand/supply sets (networkx)

make big process on graph with python parallelised

SORT Tracker - Parameter Setting

Implementing the TD-Gammon algorithm

Create a model that switches between two different states using Temporal Logic?

Categories

Resources