Typical way of using reinforcement learning environments looks like this:
env = Environment()
while not env.done:
state = env.state
action = choose_action(state)
env.step(action)
results = env.results
But wouldn't it be more pythonic this way:
env = Environment()
for state in env:
action = choose_action(state)
env.step(action)
else:
results = env.results
What difference does it make? I can see two reasons.
Less code: in latter example we don't need to worry about env.done or keep track of what state we are in, generator will pick up where we left automaticaly
Easy copy: we can easly duplicate generator in every state to evaluate different strategies
We are looping over object we mutate insided the loop, but since introduction of generator .send() method, isn't this sort of acceptable?
The Reinforcement Learning paradigm consists of making repeatively an action in an environment which in turn gives back a new state (and also a reward, some information and a boolean that tells you if the episode is done or not).
Looping over all the possible states of the environment is inconsistent with the spirit of RL. Here is an example to illustrate what I mean:
Suppose a robot has to grad a cup of coffee. At each millisecond, the robot receives an image coming from its internal camera. Then, the state space is the state of possible images that its camera can return, right ? Doing this
for state in env:
action = choose_action(state)
env.step(action)
would mean that seeing successively all the possible images that the world could give him, the robot would make a correponding action, which is obviously not what you want him to do. You want him to act according to what he has just seen from the previous state to make another new consistent action.
Hence the dynamics from this code definitly makes more sense:
while not env.done:
state = env.state
action = choose_action(state)
env.step(action)
results = env.results
Indeed it means that, as long as the robot has not grabbed the cup, then he should look at the environment and make an action. Then he looks at the new state and make a new action according to its new observation, and so on.
Related
I'm moderately new to RL(Reinforcement learning) and trying to solve a problem by using a deep Q learning agent (trying a bunch of algorithms) and I don't want to implement my own agent (I would probably do a worse job than anyone writing an RL library).
My problem is that the way I'm able to view my state space is as (state,action) pairs which poses a technical problem more than an algorithmic one.
I've found libraries that allow me to upload my own Neural network as a Q function estimator,
but I haven't been able to find a library that allows me to evaluate
(state,action) -> Q_estimation or even better [(state,action)_1,...,(state,action)_i] -> action to take according to policy (either greedy or exploratory).
all I've found are libraries that allow me to input "state" and "possible actions" and get Q values or action choice back.
My second problem is that I want to control the horizon - meaning I want to use a finite horizon.
In short, what I'm looking for is :
A RL library that will allow me to use a deep - Q-network agent that accepts (state,action) pairs and approximates the relevant Q value.
Also I would like to control the horizon of the problem.
Does anyone know any solutions, I've spent days searching the internet for an implemented solution
There is some reward "function" R(s,a) (or perhaps R(s') where s' is determined by the state-action pair) that gives the reward necessary for training a deep Q learner in this method. The reason why there aren't really out-of -box solutions is that the reward function is up to your discretion to define based on the specific learning problem and furthermore is dependent on your state transition function P(s,s',a) which gives the probability of reaching state s' from state s given action a. This function is also learning problem specific.
Let's assume for simplicity that your reward function is a function only of the state. In this case, you simply need to write a function that scores each reachable state starting from state s and given action a and weights it by the probability of reaching that state.
def get_expected_reward(s,a):
# sum over all possible next states
expected value = 0
for s_next in reachable:
# here I assume your reward is a function of state
# and I assume your transition probabilities are stored in a tensor
expected_reward += P[s,s_next,a] * reward(s_next)
return expected_reward
Then, when training:
(states,actions) = next(dataloader) # here I assume a batch size of b
targets = torch.tensor(len(actions))
for b in range(len(actions)):
targets[b] = get_expected_reward(states[b],actions[b])
# forward pass through model
predicted_rewards = model(states,actions)
# loss
loss = loss_function(predicted_rewards,targets)
I'm trying to learn how to use Q-learning with OpenAI-Gym in Python, and I modified existing gym 'FrozenLake-v0' to make an example, where agent is going through the map of labirynth and picks up apples - there is a reward for every picked apple. And picked apple should disappear from map, so it wouldn't be picked endlessly. What I'm trying to ask, is how to make that work in theory - that "picking" and taking off objects from map? Should I decrease reward in q-table value for that place when I get an apple? What is most proper solution for that problem?
Is there a way to iterate through each state, force the environment to go to that state, and then take a step and then use the "info" dictionary returned to see what are all the possible successor states?
Or an even easier way to recover all possible successor states for each state, perhaps somewhere hidden?
I saw online that something called MuJoKo or something like that has a set_state function, but I don't want to create a new environment, I just want to set the state of the ones already provided by openAi gym.
Context: trying to implement topological order value iteration, which requires making a graph where each state has an edge to any state that any action could ever transition it to.
I realize that obviously in some games that's just not provided, but for the ones where it is, is there a way?
(Other than the brute force method of running the game and taking every step I haven't yet taken at whatever state I land at until I've reached all states and seen everything, which depending on the game could take forever)
This is my first time using OpenAi gym so please explain as detailed as you can. For example, I have no idea what Wrappers are.
Thanks!
No, OpenAI gym does not have a method for supplying all possible successor states. Generally, that's sort of the point of creating an algorithm with OpenAI gym. You are training an agent to learn what the outcome of its actions are; if it can look into the future and know what the results of its actions are, it kind of defeats the purpose.
The brute force method you described is probably the easiest way to accomplish what you're describing.
I am trying to implement a Q Learning agent to learn an optimal policy for playing against a random agent in a game of Tic Tac Toe.
I have created a plan that I believe will work. There is just one part that I cannot get my head around. And this comes from the fact that there are two players within the environment.
Now, a Q Learning agent should act upon the current state, s, the action taken given some policy, a, the successive state given the action, s', and any reward received from that successive state, r.
Lets put this into a tuple (s, a, r, s')
Now usually an agent will act upon every state it finds itself encountered in given an action, and use the Q Learning equation to update the value of the previous state.
However, as Tic Tac Toe has two players, we can partition the set of states into two. One set of states can be those where it is the learning agents turn to act. The other set of states can be where it is the opponents turn to act.
So, do we need to partition the states into two? Or does the learning agent need to update every single state that is accessed within the game?
I feel as though it should probably be the latter, as this might affect updating Q Values for when the opponent wins the game.
Any help with this would be great, as there does not seem to be anything online that helps with my predicament.
In general, directly applying Q-learning to a two-player game (or other kind of multi-agent environment) isn't likely to lead to very good results if you assume that the opponent can also learn. However, you specifically mentioned
for playing against a random agent
and that means it actually can work, because this means the opponent isn't learning / changing its behaviour, so you can reliably treat the opponent as ''a part of the environment''.
Doing exactly that will also likely be the best approach you can take. Treating the opponent (and his actions) as a part of the environment means that you should basically just completely ignore all of the states in which the opponent is to move. Whenever your agent takes an action, you should also immediately generate an action for the opponent, and only then take the resulting state as the next state.
So, in the tuple (s, a, r, s'), we have:
s = state in which your agent is to move
a = action executed by your agent
r = one-step reward
s' = next state in which your agent is to move again
The state in which the opponent is to move, and the action they took, do not appear at all. They should simply be treated as unobservable, nondeterministic parts of the environment. From the point of view of your algorithm, there are no other states in between s and s', in which there is an opponent that can take actions. From the point of view of your algorithm, the environment is simply nondeterministic, which means that taking action a in state s will sometimes randomly lead to s', but maybe also sometimes randomly to a different state s''.
Note that this will only work precisely because you wrote that the opponent is a random agent (or, more importantly, a non-learning agent with a fixed policy). As soon as the opponent also gains the ability to learn, this will break down completely, and you'd have to move on to proper multi-agent versions of Reinforcement Learning algorithms.
Q-Learning is an algorithm from the MDP (Markov Decision Process) field, i.e the MDP and Learning in practically facing a world that being act upon. and each action change the state of the agent (with some probability)
the algorithm build on the basis that for any action, the world give a feedback (reaction).
Q-Learning works best when for any action there is a somewhat immediate and measurable reaction
in addition this method looks at the world from one agent perspective
My Suggestion is to implement the agent as part of the world. like a be bot which plays with various strategies e.g random, best action, fixed layout or even a implement it's logic as q-learning
for looking n steps forward and running all the states (so later you can pick the best one) you can use monte-carlo tree search if the space size is too large (like did with GO)
the Tic-Tac-Toe game is already solved, the player can achieve win or draw if follows the optimal strategy, and 2 optimal players will achieve draw, the full game tree is fairly easy to build
I am trying to devise an iterative markov decision process (MDP) agent in Python with the following characteristics:
observable state
I handle potential 'unknown' state by reserving some state space
for answering query-type moves made by the DP (the state at t+1 will
identify the previous query [or zero if previous move was not a query]
as well as the embedded result vector) this space is padded with 0s to
a fixed length to keep the state frame aligned regardless of query
answered (whose data lengths may vary)
actions that may not always be available at all states
reward function may change over time
policy convergence should incremental and only computed per move
So the basic idea is the MDP should make its best guess optimized move at T using its current probability model (and since its probabilistic the move it makes is expectedly stochastic implying possible randomness), couple the new input state at T+1 with the reward from previous move at T and reevaluate the model. The convergence must not be permanent since the reward may modulate or the available actions could change.
What I'd like to know is if there are any current python libraries (preferably cross-platform as I necessarily change environments between Windoze and Linux) that can do this sort of thing already (or may support it with suitable customization eg: derived class support that allows redefining say reward method with one's own).
I'm finding information about on-line per-move MDP learning is rather scarce. Most use of MDP that I can find seems to focus on solving the entire policy as a preprocessing step.
Here is a python toolbox for MDPs.
Caveat: It's for vanilla textbook MDPs and not for partially observable MDPs (POMDPs), or any kind of non-stationarity in rewards.
Second Caveat: I found the documentation to be really lacking. You have to look in the python code if you want to know what it implements or you can quickly look at their documentation for a similar toolbox they have for MATLAB.