Q Learning Applied To a Two Player Game - python

I am trying to implement a Q Learning agent to learn an optimal policy for playing against a random agent in a game of Tic Tac Toe.
I have created a plan that I believe will work. There is just one part that I cannot get my head around. And this comes from the fact that there are two players within the environment.
Now, a Q Learning agent should act upon the current state, s, the action taken given some policy, a, the successive state given the action, s', and any reward received from that successive state, r.
Lets put this into a tuple (s, a, r, s')
Now usually an agent will act upon every state it finds itself encountered in given an action, and use the Q Learning equation to update the value of the previous state.
However, as Tic Tac Toe has two players, we can partition the set of states into two. One set of states can be those where it is the learning agents turn to act. The other set of states can be where it is the opponents turn to act.
So, do we need to partition the states into two? Or does the learning agent need to update every single state that is accessed within the game?
I feel as though it should probably be the latter, as this might affect updating Q Values for when the opponent wins the game.
Any help with this would be great, as there does not seem to be anything online that helps with my predicament.

In general, directly applying Q-learning to a two-player game (or other kind of multi-agent environment) isn't likely to lead to very good results if you assume that the opponent can also learn. However, you specifically mentioned
for playing against a random agent
and that means it actually can work, because this means the opponent isn't learning / changing its behaviour, so you can reliably treat the opponent as ''a part of the environment''.
Doing exactly that will also likely be the best approach you can take. Treating the opponent (and his actions) as a part of the environment means that you should basically just completely ignore all of the states in which the opponent is to move. Whenever your agent takes an action, you should also immediately generate an action for the opponent, and only then take the resulting state as the next state.
So, in the tuple (s, a, r, s'), we have:
s = state in which your agent is to move
a = action executed by your agent
r = one-step reward
s' = next state in which your agent is to move again
The state in which the opponent is to move, and the action they took, do not appear at all. They should simply be treated as unobservable, nondeterministic parts of the environment. From the point of view of your algorithm, there are no other states in between s and s', in which there is an opponent that can take actions. From the point of view of your algorithm, the environment is simply nondeterministic, which means that taking action a in state s will sometimes randomly lead to s', but maybe also sometimes randomly to a different state s''.
Note that this will only work precisely because you wrote that the opponent is a random agent (or, more importantly, a non-learning agent with a fixed policy). As soon as the opponent also gains the ability to learn, this will break down completely, and you'd have to move on to proper multi-agent versions of Reinforcement Learning algorithms.

Q-Learning is an algorithm from the MDP (Markov Decision Process) field, i.e the MDP and Learning in practically facing a world that being act upon. and each action change the state of the agent (with some probability)
the algorithm build on the basis that for any action, the world give a feedback (reaction).
Q-Learning works best when for any action there is a somewhat immediate and measurable reaction
in addition this method looks at the world from one agent perspective
My Suggestion is to implement the agent as part of the world. like a be bot which plays with various strategies e.g random, best action, fixed layout or even a implement it's logic as q-learning
for looking n steps forward and running all the states (so later you can pick the best one) you can use monte-carlo tree search if the space size is too large (like did with GO)
the Tic-Tac-Toe game is already solved, the player can achieve win or draw if follows the optimal strategy, and 2 optimal players will achieve draw, the full game tree is fairly easy to build

Related

Deep Q Learning Approach for the card game Schnapsen

So I have a DQN Agent that plays the card game Schnapsen. I wont bore you with the details of the game as they are not so related to the question I am about to ask. The only important point is that for every round of the game, there are specific valid moves a player can take. The DQN Agent I have created sometime outputs non-valid moves, in the form of an integer. There are 28 possible moves in the entire game, so sometimes it will output a move that cannot be played based on the current state of the game, for example playing the Jack of Diamonds when it is not in its hand. I was wondering if there was any way for me to "map" the outputs of the neural network into the most similar move in the case that it does not converge? Would that be the best approach to this problem or do I have to tune the neural network better?
As of right now, whenever the DQN Agent does not output a valid move, it falls on to another algorithm, a Bully Bot implementation that plays one of the possible valid moves. Here is the link to my github repo with the code. To run the code where the DQN Agent plays against a bully bot, just navigate into the executables file and run : python cli.py bully-bot
One approach to mapping the outputs of your neural network to the most similar valid move would be to use "softmax" to convert the raw outputs of the network into a probability distribution over the possible moves. Then, you could select the move with the highest probability that is also a valid move. Another approach could be to use "argmax" which returns the index of the maximum value in the output. Then you will have to check whether the returned index corresponds to a valid move or not. If not, you can select the next possible index which corresponds to a valid move.

How to list possible successor states for each state in OpenAI gym? (strictly for normal MDPs)

Is there a way to iterate through each state, force the environment to go to that state, and then take a step and then use the "info" dictionary returned to see what are all the possible successor states?
Or an even easier way to recover all possible successor states for each state, perhaps somewhere hidden?
I saw online that something called MuJoKo or something like that has a set_state function, but I don't want to create a new environment, I just want to set the state of the ones already provided by openAi gym.
Context: trying to implement topological order value iteration, which requires making a graph where each state has an edge to any state that any action could ever transition it to.
I realize that obviously in some games that's just not provided, but for the ones where it is, is there a way?
(Other than the brute force method of running the game and taking every step I haven't yet taken at whatever state I land at until I've reached all states and seen everything, which depending on the game could take forever)
This is my first time using OpenAi gym so please explain as detailed as you can. For example, I have no idea what Wrappers are.
Thanks!
No, OpenAI gym does not have a method for supplying all possible successor states. Generally, that's sort of the point of creating an algorithm with OpenAI gym. You are training an agent to learn what the outcome of its actions are; if it can look into the future and know what the results of its actions are, it kind of defeats the purpose.
The brute force method you described is probably the easiest way to accomplish what you're describing.

OpenAI gym cartpole-v0 understanding observation and action relationship

I'm interested in modelling a system that can use openai gym to make a model that not only performs well but hopefully even better yet continuously improves to converge on the best moves.
This is how I initialize the env
import gym
env = gym.make("CartPole-v0")
env.reset()
it returns a set of info; observation, reward, done and info, info always nothing so ignore that.
reward I'd hope would signify whether the action taken is good or bad but it always returns a reward of 1 until the game ends, it's more of a counter of how long you've been playing.
The action can be sampled by
action = env.action_space.sample()
which in this case is either 1 or 0.
To put into perspective for anyone who doesn't know what this game is, here's the link and it's objective is to balance a pole by moving left or right i.e. provide an input of 0 or 1.
The observation is the only key way to tell whether you're making a good or bad move.
obs, reward, done, info = env.step(action)
and the observation looks something like this
array([-0.02861881, 0.02662095, -0.01234258, 0.03900408])
as I said before reward is always 1 so not a good pointer of good or bad move based on the observation and done means the game has come to an end though I also can't tell if it means you lost or won also.
Since the objective as you'll see from the link to the page is to balance the pole for a total reward of +195 averaged over 100 games that's the determining guide of a successful game, not sure then if you've successfully then balanced it completely or just lasted long but still, I've followed a few examples and suggestion to generate a lot of random games and those that do rank well use them to train a model.
But this way feels sketchy and not inherently aware of what a failing move is i.e. when you're about to tip the pole more than 15 degrees or the cart moves 2.4 units from the center.
I've been able to gather data from running the simulation for over 200000 times and using this also found I've got a good number of games that lasted for more than 80 steps. (the goal is 195) so using this I graphed these games (< ipython notebook) there's a number of graphs and since I'm graphing each observation individually per game it's too many graphs to put here just to hopefully then maybe see a link between a final observation and the game ending since these are randomly sampled actions so it's random moves.
What I thought I saw was maybe for the first observation that if it gets to 0 the game ends but I've also seen some others where the game runs with negative values. I can't make sense of the data even with graphing basically.
What I really would like to know is if possible what each value in the observation means and also if 0 means left or right but the later would be easier to deduce when I can understand the first.
It seems you asked this question quite some time ago. However, the answer is that the observation is given by the position of the cart, the angle of the pole and their derivatives. The position in the middle is 0. So the negative is left and positive is right.

Python generator instead of class object as reinforcement learning environment

Typical way of using reinforcement learning environments looks like this:
env = Environment()
while not env.done:
state = env.state
action = choose_action(state)
env.step(action)
results = env.results
But wouldn't it be more pythonic this way:
env = Environment()
for state in env:
action = choose_action(state)
env.step(action)
else:
results = env.results
What difference does it make? I can see two reasons.
Less code: in latter example we don't need to worry about env.done or keep track of what state we are in, generator will pick up where we left automaticaly
Easy copy: we can easly duplicate generator in every state to evaluate different strategies
We are looping over object we mutate insided the loop, but since introduction of generator .send() method, isn't this sort of acceptable?
The Reinforcement Learning paradigm consists of making repeatively an action in an environment which in turn gives back a new state (and also a reward, some information and a boolean that tells you if the episode is done or not).
Looping over all the possible states of the environment is inconsistent with the spirit of RL. Here is an example to illustrate what I mean:
Suppose a robot has to grad a cup of coffee. At each millisecond, the robot receives an image coming from its internal camera. Then, the state space is the state of possible images that its camera can return, right ? Doing this
for state in env:
action = choose_action(state)
env.step(action)
would mean that seeing successively all the possible images that the world could give him, the robot would make a correponding action, which is obviously not what you want him to do. You want him to act according to what he has just seen from the previous state to make another new consistent action.
Hence the dynamics from this code definitly makes more sense:
while not env.done:
state = env.state
action = choose_action(state)
env.step(action)
results = env.results
Indeed it means that, as long as the robot has not grabbed the cup, then he should look at the environment and make an action. Then he looks at the new state and make a new action according to its new observation, and so on.

Artificial life with neural networks [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am trying to build a simple evolution simulation of agents controlled by neural network. In the current version each agent has feed-forward neural net with one hidden layer. The environment contains fixed amount of food represented as a red dot. When an agent moves, he loses energy, and when he is near the food, he gains energy. Agent with 0 energy dies. the input of the neural net is the current angle of the agent and a vector to the closest food. Every time step, the angle of movement of each agent is changed by the output of its neural net. The aim of course is to see food-seeking behavior evolves after some time. However, nothing happens.
I don't know if the problem is the structure the neural net (too simple?) or the reproduction mechanism: to prevent population explosion, the initial population is about 20 agents, and as the population becomes close to 50, the reproduction chance approaches zero. When reproduction does occur, the parent is chosen by going over the list of agents from beginning to end, and checking for each agent whether or not a random number between 0 to 1 is less than the ratio between this agent's energy and the sum of the energy of all agents. If so, the searching is over and this agent becomes a parent, as we add to the environment a copy of this agent with some probability of mutations in one or more of the weights in his neural network.
Thanks in advance!
If the environment is benign enough (e.g it's easy enough to find food) then just moving randomly may be a perfectly viable strategy and reproductive success may be far more influenced by luck than anything else. Also consider unintended consequences: e.g if offspring is co-sited with its parent then both are immediately in competition with each other in the local area and this might be sufficiently disadvantageous to lead to the death of both in the longer term.
To test your system, introduce an individual with a "premade" neural network set up to steer the individual directly towards the nearest food (your model is such that such a thing exists and is reasobably easy to write down, right? If not, it's unreasonable to expect it to evolve!). Introduce that individual into your simulation amongst the dumb masses. If the individual doesn't quickly dominate, it suggests your simulation isn't set up to reinforce such behaviour. But if the individual enjoys reproductive success and it and its descendants take over, then your simulation is doing something right and you need to look elsewhere for the reason such behaviour isn't evolving.
Update in response to comment:
Seems to me this mixing of angles and vectors is dubious. Whether individuals can evolve towards the "move straight towards nearest food" behaviour must rather depend on how well an atan function can be approximated by your network (I'm sceptical). Again, this suggests more testing:
set aside all the ecological simulation and just test perturbing a population
of your style of random networks to see if they can evolve towards the expected function.
(simpler, better) Have the network output a vector (instead of an angle): the direction the individual should move in (of course this means having 2 output nodes instead of one). Obviously the "move straight towards food" strategy is then just a straight pass-through of the "direction towards food" vector components, and the interesting thing is then to see whether your random networks evolve towards this simple "identity function" (also should allow introduction of a readymade optimised individual as described above).
I'm dubious about the "fixed amount of food" too. (I assume you mean as soon as a red dot is consumed, another one is introduced). A more "realistic" model might be to introduce food at a constant rate, and not impose any artificial population limits: population limits are determined by the limitations of food supply. e.g If you introduce 100 units of food a minute and individuals need 1 unit of food per minute to survive, then your simulation should find it tends towards a long term average population of 100 individuals without any need for a clamp to avoid a "population explosion" (although boom-and-bust, feast-or-famine dynamics may actually emerge depending on the details).
This sounds like a problem for Reinforcement Learning, there is a good online textbook too.

Categories