So I have a DQN Agent that plays the card game Schnapsen. I wont bore you with the details of the game as they are not so related to the question I am about to ask. The only important point is that for every round of the game, there are specific valid moves a player can take. The DQN Agent I have created sometime outputs non-valid moves, in the form of an integer. There are 28 possible moves in the entire game, so sometimes it will output a move that cannot be played based on the current state of the game, for example playing the Jack of Diamonds when it is not in its hand. I was wondering if there was any way for me to "map" the outputs of the neural network into the most similar move in the case that it does not converge? Would that be the best approach to this problem or do I have to tune the neural network better?
As of right now, whenever the DQN Agent does not output a valid move, it falls on to another algorithm, a Bully Bot implementation that plays one of the possible valid moves. Here is the link to my github repo with the code. To run the code where the DQN Agent plays against a bully bot, just navigate into the executables file and run : python cli.py bully-bot
One approach to mapping the outputs of your neural network to the most similar valid move would be to use "softmax" to convert the raw outputs of the network into a probability distribution over the possible moves. Then, you could select the move with the highest probability that is also a valid move. Another approach could be to use "argmax" which returns the index of the maximum value in the output. Then you will have to check whether the returned index corresponds to a valid move or not. If not, you can select the next possible index which corresponds to a valid move.
Related
I would like to generate some data (position of the snake, available moves, distance from the food...) to create a neural network model so that it can be trained on the data to play the snake game. However, I don't know how to do that. My current ideas are:
Play manually (by myself) the game for many iterations and store the data (drawback: I should play the game a lot of times).
Make the snake do some random movements track and track their outcomes.
Play the snake with depth-fist search or similar algorithms many times and store the data.
Can you suggest to me some other method or should I choose from one of those? Which one in that case?
P.S. I don't know if it is the right place to ask such a question. However, I don't know whom/where to ask such a question hence, I am here.
If using a neural network, start simple. Think inputs and outputs and keep them simple too.
Inputs:
How many squares to the left of the head are free
How many squares to the right of the head are free
How many squares forward of the head are free
Relative position of next food left/right
Relative position of next food forward/back
Length of snake
keep inputs normalized to the minimum and maximum possible values to keep inputs in range -1.0 to 1.0
Outputs:
Turn Left
Turn Right
Straight Ahead
(choose the output with highest activation)
Next problem is training. Typical application might be use a genetic algorithm for all the weights of the above neural network. This randomizes and tests new versions of the neural network every life. After X attempts, create a new evolution and repeat to improve. This is pretty much doing random moves automatically (your second choice)
Next problem is fitness for training. How do you know which neural network is better? Well you could simply use the length of the snake as a fitness factor - the longer the snake, the better and more 'fit'
I'm currently trying to implement a neural network in Python to play the game of Snake that is trained using a genetic algorithm (although that's a separate matter right now).
Every network that plays the game does the same movement over and over (e.g. continues in a straight line, keeps turning left). There are 5 inputs to the network: the distance to an object (food, a boundary, its own tail) in all four directions, and the angle between the food and the direction the snake is facing. The three outputs represent turning left, continuing straight, and turning right.
I've never worked with anything like this before so I have a fairly basic understanding at this point. The number of hidden layers and the number of nodes per layer is variable, and is something I have been altering a lot to test, but the snakes continue to each repeat the exact same motion.
Any advice on why this is happening would be greatly appreciated, and how to fix it. I can show my code if it's useful.
You may have weights initialized to zero and if they aren't trained properly and stay zero for some reason, neural network will be producing bias as an output always.
I'm interested in modelling a system that can use openai gym to make a model that not only performs well but hopefully even better yet continuously improves to converge on the best moves.
This is how I initialize the env
import gym
env = gym.make("CartPole-v0")
env.reset()
it returns a set of info; observation, reward, done and info, info always nothing so ignore that.
reward I'd hope would signify whether the action taken is good or bad but it always returns a reward of 1 until the game ends, it's more of a counter of how long you've been playing.
The action can be sampled by
action = env.action_space.sample()
which in this case is either 1 or 0.
To put into perspective for anyone who doesn't know what this game is, here's the link and it's objective is to balance a pole by moving left or right i.e. provide an input of 0 or 1.
The observation is the only key way to tell whether you're making a good or bad move.
obs, reward, done, info = env.step(action)
and the observation looks something like this
array([-0.02861881, 0.02662095, -0.01234258, 0.03900408])
as I said before reward is always 1 so not a good pointer of good or bad move based on the observation and done means the game has come to an end though I also can't tell if it means you lost or won also.
Since the objective as you'll see from the link to the page is to balance the pole for a total reward of +195 averaged over 100 games that's the determining guide of a successful game, not sure then if you've successfully then balanced it completely or just lasted long but still, I've followed a few examples and suggestion to generate a lot of random games and those that do rank well use them to train a model.
But this way feels sketchy and not inherently aware of what a failing move is i.e. when you're about to tip the pole more than 15 degrees or the cart moves 2.4 units from the center.
I've been able to gather data from running the simulation for over 200000 times and using this also found I've got a good number of games that lasted for more than 80 steps. (the goal is 195) so using this I graphed these games (< ipython notebook) there's a number of graphs and since I'm graphing each observation individually per game it's too many graphs to put here just to hopefully then maybe see a link between a final observation and the game ending since these are randomly sampled actions so it's random moves.
What I thought I saw was maybe for the first observation that if it gets to 0 the game ends but I've also seen some others where the game runs with negative values. I can't make sense of the data even with graphing basically.
What I really would like to know is if possible what each value in the observation means and also if 0 means left or right but the later would be easier to deduce when I can understand the first.
It seems you asked this question quite some time ago. However, the answer is that the observation is given by the position of the cart, the angle of the pole and their derivatives. The position in the middle is 0. So the negative is left and positive is right.
I am trying to implement a Q Learning agent to learn an optimal policy for playing against a random agent in a game of Tic Tac Toe.
I have created a plan that I believe will work. There is just one part that I cannot get my head around. And this comes from the fact that there are two players within the environment.
Now, a Q Learning agent should act upon the current state, s, the action taken given some policy, a, the successive state given the action, s', and any reward received from that successive state, r.
Lets put this into a tuple (s, a, r, s')
Now usually an agent will act upon every state it finds itself encountered in given an action, and use the Q Learning equation to update the value of the previous state.
However, as Tic Tac Toe has two players, we can partition the set of states into two. One set of states can be those where it is the learning agents turn to act. The other set of states can be where it is the opponents turn to act.
So, do we need to partition the states into two? Or does the learning agent need to update every single state that is accessed within the game?
I feel as though it should probably be the latter, as this might affect updating Q Values for when the opponent wins the game.
Any help with this would be great, as there does not seem to be anything online that helps with my predicament.
In general, directly applying Q-learning to a two-player game (or other kind of multi-agent environment) isn't likely to lead to very good results if you assume that the opponent can also learn. However, you specifically mentioned
for playing against a random agent
and that means it actually can work, because this means the opponent isn't learning / changing its behaviour, so you can reliably treat the opponent as ''a part of the environment''.
Doing exactly that will also likely be the best approach you can take. Treating the opponent (and his actions) as a part of the environment means that you should basically just completely ignore all of the states in which the opponent is to move. Whenever your agent takes an action, you should also immediately generate an action for the opponent, and only then take the resulting state as the next state.
So, in the tuple (s, a, r, s'), we have:
s = state in which your agent is to move
a = action executed by your agent
r = one-step reward
s' = next state in which your agent is to move again
The state in which the opponent is to move, and the action they took, do not appear at all. They should simply be treated as unobservable, nondeterministic parts of the environment. From the point of view of your algorithm, there are no other states in between s and s', in which there is an opponent that can take actions. From the point of view of your algorithm, the environment is simply nondeterministic, which means that taking action a in state s will sometimes randomly lead to s', but maybe also sometimes randomly to a different state s''.
Note that this will only work precisely because you wrote that the opponent is a random agent (or, more importantly, a non-learning agent with a fixed policy). As soon as the opponent also gains the ability to learn, this will break down completely, and you'd have to move on to proper multi-agent versions of Reinforcement Learning algorithms.
Q-Learning is an algorithm from the MDP (Markov Decision Process) field, i.e the MDP and Learning in practically facing a world that being act upon. and each action change the state of the agent (with some probability)
the algorithm build on the basis that for any action, the world give a feedback (reaction).
Q-Learning works best when for any action there is a somewhat immediate and measurable reaction
in addition this method looks at the world from one agent perspective
My Suggestion is to implement the agent as part of the world. like a be bot which plays with various strategies e.g random, best action, fixed layout or even a implement it's logic as q-learning
for looking n steps forward and running all the states (so later you can pick the best one) you can use monte-carlo tree search if the space size is too large (like did with GO)
the Tic-Tac-Toe game is already solved, the player can achieve win or draw if follows the optimal strategy, and 2 optimal players will achieve draw, the full game tree is fairly easy to build
I just started working on an artificial life simulation (again... I lost the other one) in Python and Pygame using Pybrain, and I'm planning how this is going to work. So far I have an environment with some "food pellets". A food pellet is added every minutes. I haven't made my agents (aka "Creatures") yet, but I know I want them to have simple feed forward neural networks with some inputs and the outputs will be its' movement. I want the inputs to show what's in front of them, sort of like they are seeing the simulated world in front of them. How should I go about this? I either want them to actually "see" the colors in their line of vision, or just input the nearest object into their NN. Which one would be best, and how will I implement them?
Having a full field of vision is technically possible in a neural network, but requires a LOT of inputs and massive processing; not a direction you should expect to be able to evolve in any kind of meaningful way.
A neural network deals with values and thresholds. I'd recommend using two inputs associated with the nearest individual - one of them has a value for distance (of the nearest) and the other its angle (with zero being directly ahead, less than zero being on the left and greater than zero bring on the right).
Make sure that these values are easy to process into outputs. For example, if one output goes to a rotation actuator, make sure that the input values and output values are on the same scale. Then it will be easy to both turn toward or away from a particular individual.
If you want them to be able to see multiple individuals, simple include multiple pairs of inputs. I was going to suggest putting them in distance order, but it might be easier for them if as soon as an organism sees something it always comes in to the same inputs until it's no longer tracked.