OpenAI gym cartpole-v0 understanding observation and action relationship

OpenAI gym cartpole-v0 understanding observation and action relationship - python

I'm interested in modelling a system that can use openai gym to make a model that not only performs well but hopefully even better yet continuously improves to converge on the best moves.
This is how I initialize the env
import gym
env = gym.make("CartPole-v0")
env.reset()
it returns a set of info; observation, reward, done and info, info always nothing so ignore that.
reward I'd hope would signify whether the action taken is good or bad but it always returns a reward of 1 until the game ends, it's more of a counter of how long you've been playing.
The action can be sampled by
action = env.action_space.sample()
which in this case is either 1 or 0.
To put into perspective for anyone who doesn't know what this game is, here's the link and it's objective is to balance a pole by moving left or right i.e. provide an input of 0 or 1.
The observation is the only key way to tell whether you're making a good or bad move.
obs, reward, done, info = env.step(action)
and the observation looks something like this
array([-0.02861881, 0.02662095, -0.01234258, 0.03900408])
as I said before reward is always 1 so not a good pointer of good or bad move based on the observation and done means the game has come to an end though I also can't tell if it means you lost or won also.
Since the objective as you'll see from the link to the page is to balance the pole for a total reward of +195 averaged over 100 games that's the determining guide of a successful game, not sure then if you've successfully then balanced it completely or just lasted long but still, I've followed a few examples and suggestion to generate a lot of random games and those that do rank well use them to train a model.
But this way feels sketchy and not inherently aware of what a failing move is i.e. when you're about to tip the pole more than 15 degrees or the cart moves 2.4 units from the center.
I've been able to gather data from running the simulation for over 200000 times and using this also found I've got a good number of games that lasted for more than 80 steps. (the goal is 195) so using this I graphed these games (< ipython notebook) there's a number of graphs and since I'm graphing each observation individually per game it's too many graphs to put here just to hopefully then maybe see a link between a final observation and the game ending since these are randomly sampled actions so it's random moves.
What I thought I saw was maybe for the first observation that if it gets to 0 the game ends but I've also seen some others where the game runs with negative values. I can't make sense of the data even with graphing basically.
What I really would like to know is if possible what each value in the observation means and also if 0 means left or right but the later would be easier to deduce when I can understand the first.

It seems you asked this question quite some time ago. However, the answer is that the observation is given by the position of the cart, the angle of the pole and their derivatives. The position in the middle is 0. So the negative is left and positive is right.

Related

What's better for large datasets in reinforcement learning? Dictionary, numpy or pandas?

I'm trying to implement a Chess AI with Reinforcement Learning, i'm using the Q-learn algorithm and, to keep track of the q-values that each state have, as a starting point, i implemented a dictionary of dictionaries. (in other words, to keep track of large amount of values that holds a large amount of values, i implemented a dictionary of dictionary)
This implementation worked smooth in small and dumb scenarios.
For example, having the following board, if the whites are the only ones that can move, it gets a checkmate in less than 3 minutes.
Initial testing board
I've tried to minimize the computational requeriments for each step in the q-learning algorithm, but nothing seems to work, so, as the mentioned algorithm works around the q-table itself, it's really probable that i'm not using the optimal structure.
Right now this is the method that gets/initialize the values for each state-action (the first dictionary is the state, the second the different actions (future states) that that state does).
`
def get_qdict(self, state, action):
string_state = self.stateToString(state)
string_action = self.stateToString(action)
if string_state in self.q_dict:
if string_action in self.q_dict[string_state]:
return self.q_dict[string_state][string_action]
else:
self.q_dict[string_state][string_action] = 0
return 0
else:
self.q_dict[string_state]={}
self.q_dict[string_state][string_action] = 0
return 0
self.q_dict[string_state][string_action] = 0
return 0
`
The stateToString transform the type array position to a string
The problem comes as i try to add the black movements in the equation or i set a more complex scenario, the time needed increases exponentially, needing hours to get somewhere.
For example, for each episode in the previous scenario (the one with just the black king, and white rook and king) it takes around 0,01s.
While for each episode in the scenario where blacks can do a movement, the time needed for episode is around 0,3s.
How could i fix this? Should i implement the q-table as a numpy? Or is it better if i implement it as a pandas DataFrame
Or, is it normal that when i set the second scenario, the times augments that much?

Rising and Falling Edge in multiple signals - PYTHON

This is the global scenario: I'm recording some simple signals from a novel sensor using Python 3.8. I have already filtered signals to have a better representations where let run other algorithms of Data Analysis. Nothing of special.
Following some signals on which I need to run my algorithm:
First Example
Second Example
These signals came out a sensor whose I am working on. My aim is to get the timestamps where signals starting to increase or decrease. I actually need to run this algorithm for only one signal (blue or orange).
I have reported both signals because they have antagonistic behaviour and maybe could be useful to achieve my task.
In other words, these signals are regarded to Foot Flexion Extension (FLE/EXT), then the point where they start to increase represents the point when I start to move my foot. Viceversa, when I move back my foot it results on decreasing signals amplitude.
My job is to identify the FLE/EXT and I tried to examine first derivative but it appears to don't give me any useful information.
I also have tried to use a convolution with a fixed-lenght ones-array by looking for when the successive convulution's average is greater than the current one.
This approach has 2 constraints:
Fixed-lenght array: because when signals represents faster FLE/EXT (then in less temporale distance in x-axis) the window is not enough to catch variation.
Threshold's Criterion for choosing how much has to be the successive average respect to the current one in order to save this iteration for my purpose.
I have stuck here, because I want to use a dynamic threshold approach or something similar which can allow me to exclude any fixed thresholds.
I hope to have a discussion with you for solving my problem. What do you think?
Please, if something is unclear, I am ready to clarify better.
Best regards,
V

Generate the data for A.I. to play the Snake game

I would like to generate some data (position of the snake, available moves, distance from the food...) to create a neural network model so that it can be trained on the data to play the snake game. However, I don't know how to do that. My current ideas are:
Play manually (by myself) the game for many iterations and store the data (drawback: I should play the game a lot of times).
Make the snake do some random movements track and track their outcomes.
Play the snake with depth-fist search or similar algorithms many times and store the data.
Can you suggest to me some other method or should I choose from one of those? Which one in that case?
P.S. I don't know if it is the right place to ask such a question. However, I don't know whom/where to ask such a question hence, I am here.

If using a neural network, start simple. Think inputs and outputs and keep them simple too.
Inputs:
How many squares to the left of the head are free
How many squares to the right of the head are free
How many squares forward of the head are free
Relative position of next food left/right
Relative position of next food forward/back
Length of snake
keep inputs normalized to the minimum and maximum possible values to keep inputs in range -1.0 to 1.0
Outputs:
Turn Left
Turn Right
Straight Ahead
(choose the output with highest activation)
Next problem is training. Typical application might be use a genetic algorithm for all the weights of the above neural network. This randomizes and tests new versions of the neural network every life. After X attempts, create a new evolution and repeat to improve. This is pretty much doing random moves automatically (your second choice)
Next problem is fitness for training. How do you know which neural network is better? Well you could simply use the length of the snake as a fitness factor - the longer the snake, the better and more 'fit'

Q Learning Applied To a Two Player Game

I am trying to implement a Q Learning agent to learn an optimal policy for playing against a random agent in a game of Tic Tac Toe.
I have created a plan that I believe will work. There is just one part that I cannot get my head around. And this comes from the fact that there are two players within the environment.
Now, a Q Learning agent should act upon the current state, s, the action taken given some policy, a, the successive state given the action, s', and any reward received from that successive state, r.
Lets put this into a tuple (s, a, r, s')
Now usually an agent will act upon every state it finds itself encountered in given an action, and use the Q Learning equation to update the value of the previous state.
However, as Tic Tac Toe has two players, we can partition the set of states into two. One set of states can be those where it is the learning agents turn to act. The other set of states can be where it is the opponents turn to act.
So, do we need to partition the states into two? Or does the learning agent need to update every single state that is accessed within the game?
I feel as though it should probably be the latter, as this might affect updating Q Values for when the opponent wins the game.
Any help with this would be great, as there does not seem to be anything online that helps with my predicament.

In general, directly applying Q-learning to a two-player game (or other kind of multi-agent environment) isn't likely to lead to very good results if you assume that the opponent can also learn. However, you specifically mentioned
for playing against a random agent
and that means it actually can work, because this means the opponent isn't learning / changing its behaviour, so you can reliably treat the opponent as ''a part of the environment''.
Doing exactly that will also likely be the best approach you can take. Treating the opponent (and his actions) as a part of the environment means that you should basically just completely ignore all of the states in which the opponent is to move. Whenever your agent takes an action, you should also immediately generate an action for the opponent, and only then take the resulting state as the next state.
So, in the tuple (s, a, r, s'), we have:
s = state in which your agent is to move
a = action executed by your agent
r = one-step reward
s' = next state in which your agent is to move again
The state in which the opponent is to move, and the action they took, do not appear at all. They should simply be treated as unobservable, nondeterministic parts of the environment. From the point of view of your algorithm, there are no other states in between s and s', in which there is an opponent that can take actions. From the point of view of your algorithm, the environment is simply nondeterministic, which means that taking action a in state s will sometimes randomly lead to s', but maybe also sometimes randomly to a different state s''.
Note that this will only work precisely because you wrote that the opponent is a random agent (or, more importantly, a non-learning agent with a fixed policy). As soon as the opponent also gains the ability to learn, this will break down completely, and you'd have to move on to proper multi-agent versions of Reinforcement Learning algorithms.

Q-Learning is an algorithm from the MDP (Markov Decision Process) field, i.e the MDP and Learning in practically facing a world that being act upon. and each action change the state of the agent (with some probability)
the algorithm build on the basis that for any action, the world give a feedback (reaction).
Q-Learning works best when for any action there is a somewhat immediate and measurable reaction
in addition this method looks at the world from one agent perspective
My Suggestion is to implement the agent as part of the world. like a be bot which plays with various strategies e.g random, best action, fixed layout or even a implement it's logic as q-learning
for looking n steps forward and running all the states (so later you can pick the best one) you can use monte-carlo tree search if the space size is too large (like did with GO)
the Tic-Tac-Toe game is already solved, the player can achieve win or draw if follows the optimal strategy, and 2 optimal players will achieve draw, the full game tree is fairly easy to build

writing optimization function

I'm trying to write a tennis reservation system and I got stucked with this problem.
Let's say you have players with their prefs regarding court number, day and hour.
Also every player is ranked so if there is day/hour slot and there are several players
with preferences for this slot the one with top priority should be chosen.
I'm thinking about using some optimization algorithms to solve this problem but I'am not sure what would be the best cost function and/or algorithm to use.
Any advice?
One more thing I would prefer to use Python but some language-agnostic advice would be welcome also.
Thanks!
edit:
some clarifications-
the one with better priority wins and loser is moved to nearest slot,
rather flexible time slots question
yes, maximizing the number of people getting their most highly preffered times

The basic Algorithm
I'd sort the players by their rank, as the high ranked ones always push away the low ranked ones. Then you start with the player with the highest rank, give him what he asked for (if he really is the highest, he will always win, thus you can as well give him whatever he requested). Then I would start with the second highest one. If he requested something already taken by the highest, try to find a slot nearby and assign this slot to him. Now comes the third highest one. If he requested something already taken by the highest one, move him to a slot nearby. If this slot is already taken by the second highest one, move him to a slot some further away. Continue with all other players.
Some tunings to consider:
If multiple players can have the same rank, you may need to implement some "fairness". All players with equal rank will have a random order to each other if you sort them e.g. using QuickSort. You can get some some fairness, if you don't do it player for player, but rank for rank. You start with highest rank and the first player of this rank. Process his first request. However, before you process his second request, process the first request of the next player having highest rank and then of the third player having highest rank. The algorithm is the same as above, but assuming you have 10 players and player 1-4 are highest rank and players 5-7 are low and players 8-10 are very low, and every player made 3 requests, you process them as
Player 1 - Request 1
Player 2 - Request 1
Player 3 - Request 1
Player 4 - Request 1
Player 1 - Request 2
Player 2 - Request 2
:
That way you have some fairness. You could also choose randomly within a ranking class each time, this could also provide some fairness.
You could implement fairness even across ranks. E.g. if you have 4 ranks, you could say
Rank 1 - 50%
Rank 2 - 25%
Rank 3 - 12,5%
Rank 4 - 6,25%
(Just example values, you may use a different key than always multiplying by 0.5, e.g. multiplying by 0.8, causing the numbers to decrease slower)
Now you can say, you start processing with Rank 1, however once 50% of all Rank 1 requests have been fulfilled, you move on to Rank 2 and make sure 25% of their requests are fulfilled and so on. This way even a Rank 4 user can win over a Rank 1 user, somewhat defeating the initial algorithm, however you offer some fairness. Even a Rank 4 player can sometimes gets his request, he won't "run dry". Otherwise a Rank 1 player scheduling every request on the same time as a Rank 4 player will make sure a Rank 4 player has no chance to ever get a single request. This way there is at least a small chance he may get one.
After you made sure everyone had their minimal percentage processed (and the higher the rank, the more this is), you go back to top, starting with Rank 1 again and process the rest of their requests, then the rest of the Rank 2 requests and so on.
Last but not least: You may want to define a maximum slot offset. If a slot is taken, the application should search for the nearest slot still free. However, what if this nearest slot is very far away? If I request a slot Monday at 4 PM and the application finds the next free one to be Wednesday on 9 AM, that's not really helpful for me, is it? I might have no time on Wednesday at all. So you may limit slot search to the same day and saying the slot might be at most 3 hours off. If no slot is found within that range, cancel the request. In that case you need to inform the player "We are sorry, but we could not find any nearby slot for you; please request a slot on another date/time and we will see if we can find a suitable slot there for you".

This is an NP-complete problem, I think, so it'll be impossible to have a very fast algorithm for any large data sets.
There's also the problem where you might have a schedule that is impossible to make. Given that that's not the case, something like this pseudocode is probably your best bet:
sort players by priority, highest to lowest
start with empty schedule
for player in players:
for timeslot in player.preferences():
if timeslot is free:
schedule.fillslot(timeslot, player)
break
else:
#if we get here, it means this player couldn't be accomodated at all.
#you'll have to go through the slots that were filled and move another (higher-priority) player's time slot

You are describing a matching problem. Possible references are the Stony Brook algorithm repository and Algorithm Design by Kleinberg and Tardos. If the number of players is equal to the number of courts you can reach a stable matching - The Stable Marriage Problem. Other formulations become harder.

There are several questions I'd ask before answering this queston:
what happens if there is a conflict, i.e. a worse player books first, then a better player books the same court? Who wins? what happens for the loser?
do you let the best players play as long as the match runs, or do you have fixed time slots?
how often is the scheduling run - is it run interactively - so potentially someone could be told they can play, only to be told they can't; or is it run in a more batch manner - you put in requests, then get told later if you can have your slot. Or do users set up a number of preferred times, and then the system has to maximise the number of people getting their most highly preferred times?
As an aside, you can make it slightly less complex by re-writing the times as integer indexes (so you're dealing with integers rather than times).

I would advise using a scoring algorithm. Basically construct a formula that pulls all the values you described into a single number. Who ever has the highest final score wins that slot. For example a simple formula might be:
FinalScore = ( PlayerRanking * N1 ) + ( PlayerPreference * N2 )
Where N1, N2 are weights to control the formula.
This will allow you to get good (not perfect) results very quickly. We use this approach on a much more complex system with very good results.
You can add more variety to this by adding in factors for how many times the player has won or lost slots, or (as someone suggested) how much the player paid.
Also, you can use multiple passes to assign slots in the day. Use one strategy where it goes chronologically, one reverse chronologically, one that does the morning first, one that does the afternoon first, etc. Then sum the scores of the players that got the spots, and then you can decide strategy provided the best results.

Basically, you have the advantage that players have priorities; therefore, you sort the players by descending priority, and then you start allocating slots to them. The first gets their preferred slot, then the next takes his preferred among the free ones and so on. It's a O(N) algorithm.

I think you should use genetic algorithm because:
It is best suited for large problem instances.
It yields reduced time complexity on the price of inaccurate answer(Not the ultimate best)
You can specify constraints & preferences easily by adjusting fitness punishments for not met ones.
You can specify time limit for program execution.
The quality of solution depends on how much time you intend to spend solving the program..
Genetic Algorithms Definition
Genetic Algorithms Tutorial
Class scheduling project with GA
Also take a look at :a similar question and another one

Money. Allocate time slots based on who pays the most. In case of a draw don't let any of them have the slot.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.