I have little to no formal discrete math training, and have run into a wee bit of an issue. I am trying to write an agent which reads in a human player's (arbitrary) score and scores a point every so often. The agent needs to "lag behind" and "catch up" every so often, so that the human player believes there is some competition going on. Then, the agent must either win or lose (depending on the condition) against the human.
I have tried a few different techniques, including a wonky probabilistic loop (which failed horribly). I was thinking that this problem calls for something like an emission Hidden Markov Model (HMM), but I'm not sure how to implement it (or even whether this is the best approach).
I have a gist up, but again, it sucks.
I hope the __main__ function provides some insight as to the goal of this agent. It is going to be called in pygame.
I think you may be over-thinking this. You can use simple probability to estimate how often and by how much the computer's score should "catch-up". Additionally, you can calculate the difference between the computer's score and human's score, and then feed this to a sigmoid-like function to give you the degree at which the computer's score increases.
Illustrative Python:
#!/usr/bin/python
import random, math
human_score = 0
computer_score = 0
trials = 100
computer_ahead_factor = 5 # maximum amount of points the computer can be ahead by
computer_catchup_prob = 0.33 # probability of computer catching up
computer_ahead_prob = 0.5 # probability of computer being ahead of human
computer_advantage_count = 0
for i in xrange(trials):
# Simulate player score increase.
human_score += random.randint(0,5) # add an arbitrary random amount
# Simulate computer lagging behind human, by calculating the probability of
# computer jumping ahead based on proximity to the human's score.
score_diff = human_score - computer_score
p = (math.atan(score_diff)/(math.pi/2.) + 1)/2.
if random.random() < computer_ahead_prob:
computer_score = human_score + random.randint(0,computer_ahead_factor)
elif random.random() < computer_catchup_prob:
computer_score += int(abs(score_diff)*p)
# Display scores.
print 'Human score:',human_score
print 'Computer score:',computer_score
computer_advantage_count += computer_score > human_score
print 'Effective computer advantage ratio: %.6f' % (computer_advantage_count/float(trials),)
I am making the assumption that the human cannot see the computer agent playing the game. If this is the case, here is one idea you might try.
Create a list of all the possible point combinations that can be scored for any given move. For each move, find a score range which you would like the agent to end up within after the current turn. Reduce the set of possible move values to only the values which would end the agent in that particular range and randomly select one. As conditions change for how far behind or ahead you would like the agent to get, simply slide your range appropriately.
If you are looking for something with some kind of built in and researched psychological effects for the human, I cant help you with that. You will need to define more rules for us if you want something more specific to your situation than this.
Related
So I'm scratching my head at this, and I don't know what's wrong. The code is here. The idea is that the AI plays games against itself, and is rewarded for winning or drawing, and punished for losing. Given a board, the AI will choose a move with certain weights. If the game ends up being a win or draw, those moves that it selected will have their weights increased. If the game ends up being a loss, those moves that it selected will have their weights decreased.
Instead what I observe is that 1) The 'X' player (Player 1) will almost always go for either the top left or bottom right square, rather than in the middle as expected, and 2) The 'X' player will become more and more favoured to win as the number of games increases.
I have no idea what is causing this behaviour, and I would appreciate any help.
Apparently stackoverflow requires you to also put code in to use pastebin, so here is the reward bit, although it probably makes more sense in the full context linked above.
foo = ai_player
for i in range(0,len(moves_made)):
# Find the index of the move made
weight_index = foo.children.index(moves_made[i])
# If X won
if checkWin(current_player.placements) == 1:
if i % 2 == 0:
foo.weights[weight_index] += 10
else:
foo.weights[weight_index] -= 10
if foo.weights[weight_index] <= 0: foo.weights[weight_index] = 0
# If O won
if checkWin(current_player.placements) == -1:
if i % 2 == 0:
foo.weights[weight_index] -= 10
if foo.weights[weight_index] <= 0: foo.weights[weight_index] = 0
else:
foo.weights[weight_index] += 10
# If it was a draw
if checkWin(current_player.placements) == 0:
if i % 2 == 0:
foo.weights[weight_index] += 5
else:
foo.weights[weight_index] += 5
foo = foo.children[weight_index]
There's a number of issues with your original code for trying to get agents to learn.
You use absolute weights instead of some calculated heuristic for the weights. When adding weights, you want to use an update mechanism (e.g. Q-Learning).
Wikipedia Q-Learning
This bounds the board weights between the range of possible rewards and allows new, promising options to rise, given positive rewards (having a score +100 or +200 will be very hard to beat).
You assign weightings to the board indexes and not board positions. This will teach agents that certain squares of the board are 'the best' and other squares aren't regardless of the utility. (i.e. if there are X's on square 0 and square 1, and O thinks their 'absolute' best move is bottom right, however X can win next turn).
Moves seem to be random in this environment (as far as I can see). The agents need some way of 'exploring' to find which moves work and which moves don't, and then 'exploiting' the best moves found so far.
You only store the 'state' values and not the value of taking actions in a given state. With this, agents can calculate the value of the state they're in, but they need a way of calculating the value of actions they can take, to inform them of the best actions to take.
In the Reinforcement Learning framework, the agent takes in the state and needs to evaluate the best possible action to take, given the state.
I've re-written your code, using a Q-learning table to get the agents moving appropriately (moving the agents into a class). Also using NumPy for argmax etc functions.
Q-Learning function is:
self.Q_table[(state, action)] = q_val + 0.1 * \
(reward + self.return_value[(state, action)] / self.return_number[(state, action)])
Code is in a pastebin.
With Epsilon of 0.08 and 1,000,000 episodes, the win rate for X was ~16%, Y was ~7% and about ~77% draws.
With Epsilon of 0.04 and 1,000,000 episodes, the win rate for X was ~9%, Y was ~3% and about ~88% draws.
With Epsilon of 0.04 and 2,000,000 episodes - almost identical results to 1m episodes.
It's hard to tell just like this but possibly you reward them for playing draws and it keeps happening?
Also, you should evaluate checkWin() once at the start and you could make a simple function like update(weight) and pass 10, -10 or 5 depending on your case
Is your algorithm:
0. init all weights to 100 for every board state
reset board
get board state
make a random (weighted) move depending on weights/board state
check if the game ends, if so update weights and go back to 1, else go to 2
?
i have this simple game where there is a ball bouncing on the screen and the player can move left and right of the screen and shoot an arrow up to pop the ball, every time the player hits a ball, the ball bursts and splits into two smaller balls until they reach a minimum size and disappear.
I am trying to solve this game with a genetic algorithm based on the python neat library and on this tutorial on flappy bird https://www.youtube.com/watch?v=MMxFDaIOHsE&list=PLzMcBGfZo4-lwGZWXz5Qgta_YNX3_vLS2, so I have a configuration file in which I have to specify how many input nodes must be in the network, I had thought to give as input the player's x coordinate, the distance between the player's x-coordinate and the ball's x-coordinate and the distance between the player's y-coordinate and the ball's y-coordinate.
My problem is that at the beginning of the game I have only one ball but after a few moves I could have more balls in the screen so I should have a greater number of input nodes,the more balls there are on the screen the more input coordinates I have to provide to the network.
So how to set the number of input nodes in a variable way?
config-feedforward.txt file
"""
# network parameters
num_hidden = 0
num_inputs = 3 #this needs to be variable
num_outputs = 3
"""
python file
for index,player in enumerate(game.players):
balls_array_x = []
balls_array_y = []
for ball in game.balls:
balls_array_x.append(ball.x)
balls_array_x.append(ball.y)
output = np.argmax(nets[index].activate(("there may be a number of variable arguments here")))
#other...
final code
for index,player in enumerate(game.players):
balls_array_x = []
balls_array_y = []
for ball in game.balls:
balls_array_x.append(ball.x)
balls_array_y.append(ball.y)
distance_list = []
player_x = player.x
player_y = player.y
i = 0
while i < len(balls_array_x):
dist = math.sqrt((balls_array_x[i] - player_x) ** 2 + (balls_array_y[i] - player_y) ** 2)
distance_list.append(dist)
i+=1
i = 0
if len(distance_list) > 0:
nearest_ball = min(distance_list)
output = np.argmax(nets[index].activate((player.x,player.y,nearest_ball)))
This is a good question and as far as I can tell from a quick Google search hasn't been addressed for simple ML algorithms like NEAT.
Traditionally resizing methods of Deep NN (padding, cropping, RNNs, middle-layers, etc) can obviously not be applied here since NEAT explicitly encodes each single neuron and connection.
I am also not aware of any general method/trick to make the input size mutable for the traditional NEAT algorithm and frankly don't think there is one. Though I can think of a couple of changes to the algorithm that would make this possible, but that's of no help to you I suppose.
In my opinion you therefore have 3 options:
You increase the input size to the maximum number of balls the algorithm should track and set the x-diff/y-diff value of non-existent balls to an otherwise impossible number (e.g. -1). If balls come into existence you actually set the values for those x-diff/y-diff input neurons and set them to -1 again when they are gone. Then you let NEAT figure it out. Also worth thinking about concatenating 2 separate NEAT NNs, with the first NN having 2 inputs, 1 output and the second NN having 1 (player pos) + x (max number of balls) inputs and 2 outputs (left, right). The first NN produces an output for each ball position (and is identical for each ball) and the second NN takes the first NNs output and turns it into an action. Also: The maximum number of balls doesn't have to be the maximum number of displayable balls, but can also be limited to 10 and only considering the 10 closest balls.
You only consider 1 ball for each action side (making your input 1 + 2*2). This could be the consideration of the lowest ball on each side or the closest ball on each side. Such preprocessing can make such simple NN tasks however quite easy to solve. Maybe you can add inertia into your test environment and thereby add a non-linearity that makes it not so straightforward to always teleport/hurry to the lowest ball.
You input the whole observation space into NEAT (or a uniformly downsampled fraction), e.g. the whole game at whatever resolution is lowest but still sensible. I know that this observation space is huge, but NEAT works quite well in handling such spaces.
I know that this is not the variable input size option of NEAT that you might have hoped for, but I don't know about any such general option/trick without changing the underlying NEAT algorithm significantly.
However, I am very happy to be corrected if someone knows a better option!
I'm building a media player for the office, and so far so good but I want to add a voting system (kinda like Pandora thumbs up/thumbs down)
To build the playlist, I am currently using the following code, which pulls 100 random tracks that haven't been played recently (we make sure all tracks have around the same play count), and then ensures we don't hear the same artist within 10 songs and builds a playlist of 50 songs.
max_value = Items.select(fn.Max(Items.count_play)).scalar()
query = (Items
.select()
.where(Items.count_play < max_value, Items.count_skip_vote < 5)
.order_by(fn.Rand()).limit(100))
if query.count < 1:
max_value = max_value - 1
query = (Items
.select()
.where(Items.count_play < max_value, Items.count_skip_vote < 5)
.order_by(fn.Rand()).limit(100))
artistList = []
playList = []
for item in query:
if len(playList) is 50:
break
if item.artist not in artistList:
playList.append(item.path)
if len(artistList) < 10:
artistList.append(item.artist)
else:
artistList.pop(0)
artistList.append(item.artist)
for path in playList:
client.add(path.replace("/music/Library/",""))
I'm trying to work out the best way to use the up/down votes.
I want to see less with downvotes and more with upvotes.
I'm not after direct code because I'm pretty OK with python, it's more of the logic that I can't quite nut out (that being said, if you feel the need to improve my code, I won't stop you :) )
Initially give each track a weight w, e.g. 10 - a vote up increases this, down reduces it (but never to 0). Then when deciding which track to play next:
Calculate the total of all the weights, generate a random number between 0 and this total, and step through the tracks from 0-49 adding up their w until you exceed the random number. play that track.
The exact weighting algorithm (e.g. how much an upvote/downvote changes w) will of course affect how often tracks (re)appear. Wasn't it Apple who had to change the 'random' shuffle of their early iPod because it could randomly play the same track twice (or close enough together for a user to notice) so they had to make it less random, which I presume means also changing the weighting by how recently the track was played - in that case the time since last play would also be taken into account at the time of choosing the next track. Make sure you cover the end cases where everyone downvotes 49 (or all 50 if they want silence) of the tracks. Or maybe that's what you want...
I'm trying to write the minimax algorithm in python with one for loop (yes I know wikipedia says the min and max players are often treated separately), and I'm using the variable turn to keep track of whether the min or max player is currently exploring options. I think, however, that at present the code wrongly evaluates for X when it is the O player's turn and O when it is the X player's turn.
Here's the source (p12) : http://web.cs.wpi.edu/~rich/courses/imgd4000-d10/lectures/E-MiniMax.pdf
Things you might be wondering about:
b is a list of lists; 0 denotes an available space
evaluate is used both for checking for a victory (by default) as well as for scoring the board for a particular player (we look for places where the value of a cell on the board ).
makeMove returns the row of the column the piece is placed in (used for subsequent removal)
Any help would be very much appreciated. Please let me know if anything is unclear.
def minMax(b, turn, depth=0):
player, piece = None, None
best, move = None, -1
if turn % 2 == 0 : # even player is max player
player, piece = 'max', 'X'
best, move = -1000, -1
else :
player, piece = 'min', 'O'
best, move = 1000, -1
if boardFull(b) or depth == MAX_DEPTH:
return evaluate(b, False, piece)
for col in range(N_COLS):
if possibleMove(b, col) :
row = makeMove(b, col, piece)
turn += 1 # now the other player's turn
score = minMax(b, turn, depth+1)
if player == 'max':
if score > best:
best, move = score, col
else:
if score < best:
best, move = score, col
reset(b, row, col)
return move
#seaotternerd. Yes I was wondering about that. But I'm not sure that is the problem. Here is one printout. As you can see, X has been dropped in the fourth column by AI but is evaluating from the min player's perspective (it counts 2 O units in the far right column).
Here's what the evaluate function determines, depending on piece:
if piece == 'O':
return best * -25
return best * 25
You are incrementing turn every time that you find a possible move and not undoing it. As a result, when control returns to a given minMax call, turn is 1 greater than it was before. Then, the next time your program finds a possible move, it increments turn again. This will cause the next call to minMax to select the wrong player as the current one. Overall, I believe this will result in the board getting evaluated for the wrong player approximately half the time. You can fix this by adding 1 to turn in the recursive call to minMax(), rather than by changing the value stored in the variables:
row = makeMove(b, col, piece)
score = minMax(b, turn+1, depth+1)
EDIT: Digging deeper into your code, I'm finding a number of additional problems:
MAX_DEPTH is set to 1. This will not allow the ai to see its own next move, instead forcing it to make decisions solely based on getting in the way of the other player.
minMax() returns the score if it has reached MAX_DEPTH or a win condition, but otherwise it returns a move. This breaks propagation of the score back up the recursion tree.
This is not critical, but it's something to keep in mind: your board evaluation function only takes into account how long a given player's longest string is, ignoring how the other player is doing and any other factors that may make one placement better than another. This mostly just means that your AI won't be very "smart."
EDIT 2: A big part of the problem with the way that you're keeping track of min and max is in your evaluation function. You check to see if each piece has won. You are then basing the score of that board off of who the current player is, but the point of having a min player and a max player is that you don't need to know who the current player is to evaluate the board. If max has won, the score is infinity. If min has won, the score is -infinity.
def evaluate(b, piece):
if evaluate_aux(b, True, 'X'):
return 100000
if evaluate_aux(b, True, 'O'):
return -100000
return evaluate_aux(b, False, piece)
In general, I think there is a lot that you could do to make your code cleaner and easier to read, which would make it a lot easier to detect errors. For instance, if you are saying that "X" is always max and "Y" is always min, then you don't need to bother keeping track of both player and piece. Additionally, having evaluate_aux sometimes return a boolean and sometimes return an int is confusing. You could just have it count the number of each piece in a row, for instance, with contiguous "X"s counting positive and contiguous "O"s counting negative and sum the scores; an evaluation function isn't supposed to be from one player's perspective or the other. Obviously you would still need to have a check for win conditions in there. This would also address point 3.
It's possible that there are more problems, but like I said, this code is not particularly easy to wade through. If you fix the things that I've already found and clean it up, I can take another look.
I'm quite new to algorithms and i was trying to understand the minimax, i read a lot of articles,but i still can't get how to implement it into a tic-tac-toe game in python.
Can you try to explain it to me as easy as possible maybe with some pseudo-code or some python code?.
I just need to understand how it works. i read a lot of stuff about that and i understood the basic, but i still can't get how it can return a move.
If you can please don't link me tutorials and samples like (http://en.literateprograms.org/Tic_Tac_Toe_(Python)) , i know that they are good, but i simply need a idiot explanation.
thank you for your time :)
the idea of "minimax" is that there in a two-player game, one player is trying to maximize some form of score and another player is trying to minimize it. For example, in Tic-Tac-Toe the win of X might be scored as +1 and the win of O as -1. X would be the max player, trying to maximize the final score and O would be the min player, trying to minimize the final score.
X is called the max player because when it is X's move, X needs to choose a move that maximizes the outcome after that move. When O players, O needs to choose a move that minimizes the outcome after that move. These rules are applied recursively, so that e.g. if there are only three board positions open to play, the best play of X is the one that forces O to choose a minimum-value move whose value is as high as possible.
In other words, the game-theoretic minimax value V for a board position B is defined as
V(B) = 1 if X has won in this position
V(B) = -1 if O has won in this position
V(B) = 0 if neither player has won and no more moves are possible (draw)
otherwise
V(B) = max(V(B1), ..., V(Bn)) where board positions B1..Bn are
the positions available for X, and it is X's move
V(B) = min(V(B1), ..., V(Bn)) where board positions B1..Bn are
the positions available for O, and it is O's move
The optimal strategy for X is always to move from B to Bi such that V(Bi) is maximum, i.e. corresponds to the gametheoretic value V(B), and for O, analogously, to choose a minimum successor position.
However, this is not usually possible to calculate in games like chess, because in order to calculate the gametheoretic value one needs to enumerate the whole game tree until final positions and that tree is usually extremely large. Therefore, a standard approach is to coin an "evaluation function" that maps board positions to scores that are hopefully correlated with the gametheoretic values. E.g. in chess programs evaluation functions tend to give positive score for material advantage, open columns etc. A minimax algorithm them minimaximizes the evaluation function score instead of the actual (uncomputable) gametheoretic value of a board position.
A significant, standard optimization to minimax is "alpha-beta pruning". It gives the same results as minimax search but faster. Minimax can be also casted in terms of "negamax" where the sign of the score is reversed at every search level. It is just an alternative way to implement minimax but handles players in a uniform fashion. Other game tree search methods include iterative deepening, proof-number search, and more.
Minimax is a way of exploring the space of potential moves in a two player game with alternating turns. You are trying to win, and your opponent is trying to prevent you from winning.
A key intuition is that if it's currently your turn, a two-move sequence that guarantees you a win isn't useful, because your opponent will not cooperate with you. You try to make moves that maximize your chances of winning and your opponent makes moves that minimize your chances of winning.
For that reason, it's not very useful to explore branches from moves that you make that are bad for you, or moves your opponent makes that are good for you.