Python libraries for on-line machine learning MDP

Python libraries for on-line machine learning MDP - python

I am trying to devise an iterative markov decision process (MDP) agent in Python with the following characteristics:
observable state
I handle potential 'unknown' state by reserving some state space
for answering query-type moves made by the DP (the state at t+1 will
identify the previous query [or zero if previous move was not a query]
as well as the embedded result vector) this space is padded with 0s to
a fixed length to keep the state frame aligned regardless of query
answered (whose data lengths may vary)
actions that may not always be available at all states
reward function may change over time
policy convergence should incremental and only computed per move
So the basic idea is the MDP should make its best guess optimized move at T using its current probability model (and since its probabilistic the move it makes is expectedly stochastic implying possible randomness), couple the new input state at T+1 with the reward from previous move at T and reevaluate the model. The convergence must not be permanent since the reward may modulate or the available actions could change.
What I'd like to know is if there are any current python libraries (preferably cross-platform as I necessarily change environments between Windoze and Linux) that can do this sort of thing already (or may support it with suitable customization eg: derived class support that allows redefining say reward method with one's own).
I'm finding information about on-line per-move MDP learning is rather scarce. Most use of MDP that I can find seems to focus on solving the entire policy as a preprocessing step.

Here is a python toolbox for MDPs.
Caveat: It's for vanilla textbook MDPs and not for partially observable MDPs (POMDPs), or any kind of non-stationarity in rewards.
Second Caveat: I found the documentation to be really lacking. You have to look in the python code if you want to know what it implements or you can quickly look at their documentation for a similar toolbox they have for MATLAB.

Related

Ways to speed up Monte Carlo Tree Search in a model-based RL task

This area is still very new to me, so forgive me if I am asking dumb questions. I'm utilizing MCTS to run a model-based reinforcement learning task. Basically I have an agent foraging in a discrete environment, where the agent can see out some number of spaces around it (I'm assuming perfect knowledge of its observation space for simplicity, so the observation is the same as the state). The agent has an internal transition model of the world represented by an MLP (I'm using tf.keras). Basically, for each step in the tree, I use the model to predict the next state given the action, and I let the agent calculate how much reward it would receive based on the predicted change in state. From there it's the familiar MCTS algorithm, with selection, expansion, rollout, and backprop.
Essentially, the problem is that this all runs prohibitively slowly. From profiling my code, I notice that a lot of time is spent doing the rollout, likely I imagine because the NN needs to be consulted many times and takes some nontrivial amount of time for each prediction. Of course, I can probably stand to clean up my code to make it run faster (e.g. better vectorization), but I was wondering:
Are there ways to speed up/work around the traditional random walk done for rollout in MCTS?
Are there generally other ways to speed up MCTS? Does it just not mix well with using an NN in terms of runtime?
Thanks!

I am working on a similar problem and so far the following have helped me:
Make sure you are running tensorflow on you GPU (You will have to install CUDA)
Estimate how many steps into the future your agent needs to calculate to still get good results
(The one I am currently working on) parallelize

How to list possible successor states for each state in OpenAI gym? (strictly for normal MDPs)

Is there a way to iterate through each state, force the environment to go to that state, and then take a step and then use the "info" dictionary returned to see what are all the possible successor states?
Or an even easier way to recover all possible successor states for each state, perhaps somewhere hidden?
I saw online that something called MuJoKo or something like that has a set_state function, but I don't want to create a new environment, I just want to set the state of the ones already provided by openAi gym.
Context: trying to implement topological order value iteration, which requires making a graph where each state has an edge to any state that any action could ever transition it to.
I realize that obviously in some games that's just not provided, but for the ones where it is, is there a way?
(Other than the brute force method of running the game and taking every step I haven't yet taken at whatever state I land at until I've reached all states and seen everything, which depending on the game could take forever)
This is my first time using OpenAi gym so please explain as detailed as you can. For example, I have no idea what Wrappers are.
Thanks!

No, OpenAI gym does not have a method for supplying all possible successor states. Generally, that's sort of the point of creating an algorithm with OpenAI gym. You are training an agent to learn what the outcome of its actions are; if it can look into the future and know what the results of its actions are, it kind of defeats the purpose.
The brute force method you described is probably the easiest way to accomplish what you're describing.

Deep Q learning Replay method Memory Vanishing

In the Q-learning algorithm used in Reinforcement Learning with replay, one would use a data structure in which it stores previous experience that is used in training (a basic example would be a tuple in Python). For a complex state space, I would need to train the agent in a very large number of different situations to obtain a NN that correctly approximates the Q-values. The experience data will occupy more and more memory and thus I should impose a superior limit for the number of experience to be stored, after which the computer should drop the experience from memory.
Do you think FIFO (first in first out) would be a good way of manipulating the data vanishing procedure in the memory of the agent (that way, after reaching the memory limit I would discard the oldest experience, which may be useful for permitting the agent to adapt quicker to changes in the medium)? How could I compute a good maximum number of experiences in the memory to make sure that Q-learning on the agent's NN converges towards the Q function approximator I need (I know that this could be done empirically, I would like to know if an analytical estimator for this limit exists)?

In the preeminent paper on "Deep Reinforcement Learning", DeepMind achieved their results by randomly selecting which experiences should be stored. The rest of the experiences were dropped.
It's hard to say how a FIFO approach would affect your results without knowing more about the problem you're trying to solve. As dblclik points out, this may cause your learning agent to overfit. That said, it's worth trying. There very well may be a case where using FIFO to saturate the experience replay would result in an accelerated rate of learning. I would try both approaches and see if your agent reaches convergence more quickly with one.

Can Tensorflow be used for global minimization of multivariate functions?

I've been curious if TF can be used for global optimization of a function. For example, could it be used to efficiently find the ground state of a Lennard-Jones potential? Would it be any better or worse than existing optimization methods, like Basin-Hopping?
Part of my research involves searching for the ground state of large, multi-component molecules. Traditional methods (BH, ect.) are good for this, but also quite slow. I've looked into TF and there are parts that seem robust enough to apply to this problem, although my limited web search doesn't appear to show any use of TF to this problem.

The gradient descent performed to train neural networks is considering only a local region of the function. There is thus no guarantee that it will converge toward a global minimum (which is actually fine for most machine-learning algorithms ; given the really high dimensionality of the considered spaces, one is usually happy to find a good local minimum without having to explore around too much).
That being said, one could certainly use Tensorflow (or any such frameworks) to implement a local optimizer for the global basin-hopping scheme, e.g. as follows (simplified algorithm):
Choose a starting point;
Use your local optimizer to get the local minimum;
Apply some perturbation to the coordinates of this minimum;
From this new position, re-use your local optimizer to get the next local minimum;
Keep the best minimum, repeat from 3.
Actually, some people are currently trying to implement this exact scheme, interfacing TF with scipy.optimize.basinhopping(). Current development and discussions can be found in this Github issue.

What algorithms are suitable for this simple machine learning problem?

I have a what I think is a simple machine learning question.
Here is the basic problem: I am repeatedly given a new object and a list of descriptions about the object. For example: new_object: 'bob' new_object_descriptions: ['tall','old','funny']. I then have to use some kind of machine learning to find previously handled objects that have the 10 or less most similar descriptions, for example, past_similar_objects: ['frank','steve','joe']. Next, I have an algorithm that can directly measure whether these objects are indeed similar to bob, for example, correct_objects: ['steve','joe']. The classifier is then given this feedback training of successful matches. Then this loop repeats with a new object.
a
Here's the pseudo-code:
Classifier=new_classifier()
while True:
new_object,new_object_descriptions = get_new_object_and_descriptions()
past_similar_objects = Classifier.classify(new_object,new_object_descriptions)
correct_objects = calc_successful_matches(new_object,past_similar_objects)
Classifier.train_successful_matches(object,correct_objects)
But, there are some stipulations that may limit what classifier can be used:
There will be millions of objects put into this classifier so classification and training needs to scale well to millions of object types and still be fast. I believe this disqualifies something like a spam classifier that is optimal for just two types: spam or not spam. (Update: I could probably narrow this to thousands of objects instead of millions, if that is a problem.)
Again, I prefer speed when millions of objects are being classified, over accuracy.
Update: The classifier should return the 10 (or fewer) most similar objects, based on feedback from past training. Without this limit, an obvious cheat would be for the classifier could just return all past objects :)
What are decent, fast machine learning algorithms for this purpose?
Note: The calc_successful_matches distance metric is extremely expensive to calculate and that's why I'm using a fast machine learning algorithm to try to guess which objects will be close before I actually do the expensive calculation.

An algorithm that seems to meet your requirements (and is perhaps similar to what John the Statistician is suggesting) is Semantic Hashing. The basic idea is that it trains a deep belief network (a type of neural network that some have called 'neural networks 2.0' and is a very active area of research right now) to create a hash of the list of descriptions of an object into binary number such that the Hamming distance between the numbers correspond to similar objects. Since this just requires bitwise operations it can be pretty fast, and since you can use it to create a nearest neighbor-style algorithm it naturally generalizes to a very large number of classes. This is very good state of the art stuff. Downside: it's not trivial to understand and implement, and requires some parameter tuning. The author provides some Matlab code here. A somewhat easier algorithm to implement and is closely related to this one is Locality Sensitive Hashing.
Now that you say that you have an expensive distance function you want to approximate quickly, I'm reminded of another very interesting algorithm that does this, Boostmap. This one uses boosting to create a fast metric which approximates an expensive to calculate metric. In a certain sense it's similar to the above idea but the algorithms used are different. The authors of this paper have several papers on related techniques, all pretty good quality (published in top conferences) that you might want to check out.

do you really need a machine learning algorithm for this? What is your metric for similarity? You've mentioned the dimensionality of the number of objects, what about the size of the trait set for each person? Are there a maximum number of trait types? I might try something like this:
1) Have a dictionary mapping trait to a list of names named map
for each person p
for each trait t in p
map[t].add(p);
2) then when I want to find the closest person, I'd take my dictionary and create a new temp one:
dictionary mapping name to count called cnt
for each trait t in my person of interest
for each person p in map[t]
cnt[p]++;
then the entry with the highest count is closest
The benefit here is the map is only created once. if the traits per person is small, and the types of available traits are large, then the algorithm should be fast.

You could use the vector space model (http://en.wikipedia.org/wiki/Vector_space_model). I think what you are trying to learn is how to weight terms in considering how close two object description vectors are to each other, say for example in terms of a simplified mutual information. This could be very efficient as you could hash from terms to vectors, which means you wouldn't have to compare objects without shared features. The naive model would then have an adjustable weight per term (this could either be per term per vector, per term overall, or both), as well as a threshold. The vector space model is a widely used technique (for example, in Apache Lucene, which you might be able to use for this problem), so you'll be able to find out a lot about it through further searches.
Let me give a very simple formulation of this in terms of your example. Given bob: ['tall','old','funny'], I retrieve
frank: ['young','short,'funny']
steve: ['tall','old','grumpy']
joe: ['tall','old']
as I am maintaining a hash from funny->{frank,...}, tall->{steve, joe,...}, and old->{steve, joe,...}
I calculate something like the overall mutual information: weight of shared tags/weight of bob's tags. If that weight is over the threshold, I include them in the list.
When training, if I make a mistake I modify the shared tags. If my error was including frank, I reduce the weight for funny, while if I make a mistake by not including Steve or Joe, I increase the weight for tall and old.
You can make this as sophisticated as you'd like, for example by including weights for conjunctions of terms.

SVM is pretty fast. LIBSVM for Python, in particular, provides a very decent implementation of Support Vector Machine for classification.

This project departs from typical classification applications in two notable ways:
Rather than outputting the class which the new object is thought to belong to (or possibly outputting an array of these classes, each with probability / confidence level), the "classifier" provides a list of "neighbors" which are "close enough" to the new object.
With each new classification, an objective function, independent from the classifier, provides the list of the correct "neighbors"; in turn the corrected list (a subset of the list provided by the classifier ?) is then used to train the classifier
The idea behind the second point is probably that future objects submitted to the classifier and with similar to the current object should get better "classified" (be associated with a more correct set of previously seen objects) since the on-going training re-enforces connections to positive (correct) matches, while weakening the connection to objects which the classifier initially got wrong.
These two characteristics introduce distinct problems.
- The fact that the output is a list of objects rather than a "prototype" (or category identifier of sorts) make it difficult to scale as the number of objects seen so far grows toward the millions of instances as suggested in the question.
- The fact that the training is done on the basis of a subset of the matches found by the classifier, may introduce over-fitting, whereby the classifier could become "blind" to features (dimensions) which it, accidentally, didn't weight as important/relevant, in the early parts of the training. (I may be assuming too much with regards to the objective function in charge of producing the list of "correct" objects)
Possibly, the scaling concern could be handled by having a two-step process, with a first classifier, based the K-Means algorithm or something similar, which would produce a subset of the overall object collection (of objects previously seen) as plausible matches for the current object (effectively filtering out say 70% or more of collection). These possible matches would then be evaluated on the basis of Vector Space Model (particularly relevant if the feature dimensions are based on factors rather than values) or some other models. The underlying assumption for this two-step process is that the object collection will effectively expose clusters (it may just be relatively evenly distributed along the various dimensions).
Another way to further limit the number of candidates to evaluate, as the size of the previously seen objects grows, is to remove near duplicates and to only compare with one of these (but to supply the full duplicate list in the result, assuming that if the new object is close to the "representative" of this near duplicate class, all members of the class would also match)
The issue of over-fitting is trickier to handle. A possible approach would be to [sometimes] randomly add objects to the matching list which the classifier would not normally include. The extra objects could be added on the basis of their distance relative distance to the new object (i.e. making it a bit more probable that a relatively close object be added)

What you describe is somewhat similar to the Locally Weighted Learning algorithm, which given a query instance, it trains a model locally around the neighboring instances weighted by their distances to the query one.
Weka (Java) has an implementation of this in weka.classifiers.lazy.LWL

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.