Understanding and Evaluating different methods in Reinforcement Learning

Understanding and Evaluating different methods in Reinforcement Learning - python

I have been trying to implement the Reinforcement learning algorithm on Python using different variants like Q-learning, Deep Q-Network, Double DQN and Dueling Double DQN. Consider a cart-pole example and to evaluate the performance of each of these variants, I can think of plotting sum of rewards to number of episodes (attaching a picture of the plot) and the actual graphical output where how well the pole is stable while the cart is moving.
But these two evaluations are not really of interest in terms to explain the better variants quantitatively. I am new to the Reinforcement learning and trying to understand if any other ways to compare different variants of RL models on the same problem.
I am referring to the colab link https://colab.research.google.com/github/ageron/handson-ml2/blob/master/18_reinforcement_learning.ipynb#scrollTo=MR0z7tfo3k9C for the code on all the variants of cart pole example.

You can find the answer in research paper about those algorithms, because when a new algorithm been proposed we usually need the experiments to show the evident that it have advantage over other algorithm.
The most commonly used evaluation method in research paper about RL algorithms is average return (note not reward, return is accumulated reward, is like the score in game) over timesteps, and there many way you can average the return, e.g average wrt different hyperparameters like in Soft Actor-Critic paper's comparative evaluation average wrt different random seeds (initialize the model):
Figure 1 shows the total average return of evaluation rolloutsduring
training for DDPG, PPO, and TD3. We train fivedifferent instances of
each algorithm with different randomseeds, with each performing one
evaluation rollout every1000 environment steps. The solid curves
corresponds to themean and the shaded region to the minimum and
maximumreturns over the five trials.
And we usually want compare the performance of many algorithms not only on one task but diverse set of tasks (i.e Benchmark), because algorithms may have some form of inductive bias for them to better at some form of tasks but worse on other tasks, e.g in Phasic Policy Gradient paper's experiments comparison to PPO:
We report results on the environments in Procgen Benchmark
(Cobbe et al.,2019). This benchmark was designed to be highly
diverse, and we expect improvements on this benchmark to transfer well
to many other RL environment

Related

Advice for my plan - large dataset of students and grades, looking to classify bottom 2%

I have a dataset which includes socioeconomic indicators for students nationwide as well as their grades. More specifically, this dataset has 36 variables with about 30 million students as predictors and then the students grades as the responses.
My goal is to be able to predict whether a student will fail out (ie. be in the bottom 2%ile of the nation in terms of grades). I understand that classification with an imbalanced dataset (98% : 2%) will introduce a bias. Based on some research I planned to account for this by increasing the cost of an incorrect classification in the minority class.
Can someone please confirm that this is the correct approach (and that there isn't a better one, I'm assuming there is)? And also, given the nature of this dataset, could someone please help me choose a machine learning algorithm to accomplish this?
I am working with TensorFlow 2.0 in a Google Colab. I've compiled all the data together into a .feather file using pandas.

In case of having imbalanced dataset, using weighted class is the most common approach to do so, but having such large dataset (30M training example) for binary classification problem representing 2% for the first class and 98% for the second one, I can say it's too hard to prevent model to be unbiased against first class using weighted class as it's not too much differ from reducing the training set size to be balanced.
Here some steps for the model accuracy evaluation.
split your dataset set to train, evalution and test sets.
For evaluation metric I suggest these alternatives.
a. Make sure to have at least +20%, representing the first class for both
evaluation and test sets.
b. Set evalution metric to be precision and recall for your model accuracy
(rather than using f1 score).
c. Set evalution metric to be Cohen's kapp score (coefficient).
From my own perspective, I prefer using b.
Since you are using tensorflow, I assume that you are familiar with deep learning. so use deep learning instead of machine learning, that's gives you the ability to have many additional alternatives, anyway, here some steps for both machine learning and deep learning approach.
For Machine Leaning Algorithms
Decision Trees Algorithms (especially Random Forest).
If my features has no correlation, correlation approach to zero (i.e. 0.01),
I am going to try Complement Naive Bayes classifiers for multinomial features
or Gaussian Naive Bayes using weighted class for continuous features.
Try some nonparametric learning algorithms. You may not able to fit this
training set using Support Vector Machines (SVM) easily because of you
have somehow large data set but you could try.
Try unsupervised learning algorithms
(this sometimes gives you more generic model)
For Deep Leaning Algorithms
Encoder and decoder architectures or simply generative adversarial
networks (GANs).
Siamese network.
Train model using 1D convolution Layers.
Use weighted class.
Balanced batches of the training set, randomly chosen.
You have many other alternatives, From my own perspective, I may try hard to get it with 1, 3 or 5.
For Deep learning 5th approach sometimes works very well and I recommend to try it with 1, 3.

Training stability of Wasserstein GANs

I am working on a project with Wasserstein GANs and more specifically with an implementation of the improved version of Wasserstein GANs. I have two theoretical questions about wGANs regarding their stability and training process. Firstly, the result of the loss function notoriously is correlated with the quality of the result of the generated samples (that is stated here). Is there some extra bibliography that supports that argument?
Secondly, during my experimental phase, I noticed that training my architecture using wGANs is much faster than using a simple version of GANs. Is that a common behavior? Is there also some literature analysis about that?
Furthermore, one question about the continuous functions that are guaranteed by using Wasserstein loss. I am having some issues understanding this concept in practice, what it means that the normal GANs loss is not continuous function?

You can check Inception Score and Frechet Inception Distance for now. And also here. The problem is that GANs not having a unified objective functions(there are two networks) there's no agreed way of evaluating and comparing GAN models. INstead people devise metrics that's relating the image distributinos and generator distributions.
wGAN could be faster due to having morestable training procedures as opposed to vanilla GAN(Wasserstein metric, weight clipping and gradient penalty(if you are using it) ) . I dont know if there's a literature analysis for speed and It may not always the case for WGAN faster than a simple GAN. WGAN cannot find the best Nash equlibirum like GAN.
Think two distributions: p and q. If these distributions overlap, i.e. , their domains overlap, then KL or JS divergence are differentiable. The problem arises when p and q don't overlap. As in WGAN paper example, say two pdfs on 2D space, V = (0, Z) , Q = (K , Z) where K is different from 0 and Z is sampled from uniform distribution. If you try to take derivative of KL/JS divergences of these two pdfs well you cannot. This is because these two divergence would be a binary indicator function (equal or not) and we cannot take derivative of these functions. However, if we use Wasserstein loss or Earth-Mover distance, we can take it since we are approximating it as a distance between two points on space. Short story: Normal GAN loss function is continuous iff the distributions have an overlap, otherwise it is discrete.
Hope this helps

The most common way to stabilize the training of a WGAN is to replace the Gradient Clipping technique that was used in the early W-GAN with Gradient Penalty (WGAN-GP). This technique seems outperform the original WGAN. The paper that describes what GP is can be found here:
https://arxiv.org/pdf/1704.00028.pdf
Also, If you need any help of how to implement this, You can check a nice repository that I have found here:
https://github.com/kochlisGit/Keras-GAN
There are also other tricks that You can use to improve the overall quality of your generated images, described in the repository. For example:
Add Random Gaussian Noise at the inputs of the discriminator that decays over the time.
Random/Adaptive Data Augmentations
Separate fake/real batches
etc.

A2C algorithm in tf.keras: actor loss function

I'm learning about Action-Critic Reinforcement Learning techniques, in particular A2C algorithm.
I've found a good description of a simple version of the algorithm (i.e. without experience replay, batching or other tricks) with implementation here: https://link.medium.com/yi55uKWwV2. The complete code from that article is available on GitHub.
I think I understand ok-ish what's happening here, but to make sure I actually do, I'm trying to reimplement it from scratch using higher-level tf.keras APIs. Where I'm getting stuck is how do I implement training loop correctly, and how do I formulate actor's loss function.
What is the correct way to pass action and advantage into the loss function?
Actor's loss function involves computing probability of the action taken given to normal distribution. How can I ensure that mu and sigma of the normal distribution during loss function computation actually match the ones were during prediction?
The way it is in the original, the actor's loss function doesn't care about y_pred, it only does about action that was chosen while interacting with the environment. This seems to be wrong, but I'm not sure how.
The code I have so far: https://gist.github.com/nevkontakte/beb59f29e0a8152d99003852887e7de7
Edit: I suppose some of my confusion stems from a poor understanding of magic behind gradient computation in Keras/TensorFlow, so any pointers there would be appreciated.

First, credit where credit is due: information provided by ralf htp and Simon was instrumental in helping me to figure out the right answers eventually.
Before I go into detailed answers to my own questions, here's the original code I was trying to rewrite in tf.keras terms, and here's my result.
What is the correct way to pass action and advantage into a loss function in Keras?
There is a difference between what raw TF optimizer considers a loss function and what Keras does. When using an optimizer directly, it simply expects a tensor (lazy or eager depending on your configuration), which will be evaluated under tf.GradientTape() to compute the gradient and update weights.
Example from https://medium.com/#asteinbach/actor-critic-using-deep-rl-continuous-mountain-car-in-tensorflow-4c1fb2110f7c:
# Below norm_dist is the output tensor of the neural network we are training.
loss_actor = -tfc.log(norm_dist.prob(action_placeholder) + 1e-5) * delta_placeholder
training_op_actor = tfc.train.AdamOptimizer(
lr_actor, name='actor_optimizer').minimize(loss_actor)
# Later, in the training loop...
_, loss_actor_val = sess.run([training_op_actor, loss_actor],
feed_dict={action_placeholder: np.squeeze(action),
state_placeholder: scale_state(state),
delta_placeholder: td_error})
In this example it computes the whole graph, including making an inference, capture the gradient and adjust weights. So to pass whatever values you need into the loss function/gradient computation you just pass necessary values into the computation graph.
Keras is a bit more formal in what loss function should look like:
loss: String (name of objective function), objective function or tf.keras.losses.Loss instance. See tf.keras.losses. An objective function is any callable with the signature scalar_loss = fn(y_true, y_pred). If the model has multiple outputs, you can use a different loss on each output by passing a dictionary or a list of losses. The loss value that will be minimized by the model will then be the sum of all individual losses.
Keras will do the inference (forward pass) for you and pass the output into the loss function. The loss function is supposed to do some extra computation on the predicted value and y_true label, and return the result. This whole process will be tracked for the purpose of gradient computation.
Although it is very convenient for traditional training, this is a bit restrictive when we want to pass some extra data in, like TD error. It is possible to work around that and shove all the extra data into y_true, and pull it apart inside the loss function (I found this trick somewhere on the web, but unfortunately lost the link to source).
Here's how I rewrote the above in the end:
def loss(y_true, y_pred):
action_true = y_true[:, :n_outputs]
advantage = y_true[:, n_outputs:]
return -tfc.log(y_pred.prob(action_true) + 1e-5) * advantage
# Below, in the training loop...
# A trick to pass TD error *and* actual action to the loss function: join them into a tensor and split apart
# Inside the loss function.
annotated_action = tf.concat([action, td_error], axis=1)
actor_model.train_on_batch([scale_state(state)], [annotated_action])
Actor's loss function involves computing probability of the action taken given to normal distribution. How can I ensure that mu and sigma of the normal distribution during loss function computation actually match the ones were during prediction?
When I asked this question, I didn't understand well enough how TF compute graph works. So the answer is simple: every time sess.run() is invoked, it must compute the whole graph from scratch. Parameters of the distribution would be the same (or similar) as long as graph inputs (e.g. observed state) and NN weights are the same (or similar).
The way it is in the original, the actor's loss function doesn't care about y_pred, it only does about action that was chosen while interacting with the environment. This seems to be wrong, but I'm not sure how.
What's wrong is the assumption "the actor's loss function doesn't care about y_pred" :) Actor's loss function involves norm_dist (which is action probability distribution), which is effectively an analog of y_pred in this context.

As far as i understand A2C it is the machine learning implementation of activator-inhibitor systems that are also called two-component reaction diffusion systems (https://en.wikipedia.org/wiki/Reaction%E2%80%93diffusion_system). Activator-inhibitor models are important in any field of science as they describe pattern formations like i.e. the Turing mechanism (simply search the net for activator-inhibitor model and you find a vast amount of information, a very common application are predator-prey models). Also cf the graphic
source of graphic : https://www.researchgate.net/figure/Activator-Inhibitor-System_fig1_23671770/
with the explanatory graphic of the A2C algorithm in https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69
Activator-inhibitor models are closely linked to the theory of nonlinear dynamical systems (or 'chaos theory') this also becomes obvious in the comparison of the bifurcation tree-like structure in https://medium.com/#asteinbach/rl-introduction-simple-actor-critic-for-continuous-actions-4e22afb712 and the bifurcation tree of a nonlinear dynamical systems like i.e. the logistic map (https://en.wikipedia.org/wiki/Logistic_map, the logistic map is one of the simplest predator-prey models or activator-inhibitor models). Another similarity is the sensitivity to initial condition in A2C models that is described as
This introduces in inherent high variability in log probabilities (log of the policy distribution) and cumulative reward values, because each trajectories during training can deviate from each other at great degrees.
in https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f and the curse of dimensionality appears also in chaos theory, i.e. in attractor reconstruction
From the viewpoint of systems theory the A2C algorithm tries to adapt the initial value (start state) in a way that it ends up at a given endpoint when increasing the growth rate of a dynamical systems i.e. the logistic map (r-value is increased and the initial value (start state) is constantly re-adapted to choose the correct bifurations (actions) in the bifurcation tree )
So A2C tries to numerically solve a chaos theory problem, namely finding the initial value for a given outcome of a nonlinear dynamical system in its chaotic region. Analytically this problem is in most cases not solveable.
The action is the bifurcation points in the bifurcation tree, the states are the future bifurctions.
Both, actions and states, are modeled by two coupled neural networks and this coupling of two neural nets is the great innovation of A2C algorithms.
In https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 is well documented keras code for implementing A2C, so you have a possible implementation there.
The loss function here is defined as the temporal difference (TD) function that is the exact difference between state at the actual bifurcation point and the state at the estimated future one, however this mathematically exactly defined is prone to stochastic error (or noise), so the stochastic error is included in the definition of exact, because in the end machine learning is based on stochastic systems or error calculus, meaning systems that are composed of a deterministic and a stochastic component. To zero this error stochastic gradient descend is used. In keras this is simply implmeneted by choosing optimizer=sge.
This interaction of actual and future step is implemented as memory on https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 in the function remember and this function also links the actor and the critic network (or activator and inhibitor network). This general structure of trial (action), call predict (TD function ), remember and train (i.e. stochastic gradient descent) is fundamental to all reinforcement learning algorithms, and is linked to the structure actual state, action, reward, new state :
The prediction code is also very much the same as it was in previous reinforcement learning algorithms. That is, we just have to iterate through the trial and call predict, remember, and train on the agent:
In the implementation on your first question is solved by applying remember on the critic and the train the critic with these values (this is in the main function), where training always evaluates the loss function, so action and reward are passed to the loss function by remember in this implementation :
actor_critic.remember(cur_state, action, reward, new_state, done)
actor_critic.train()
Because of your second question : i am not sure but i think this is achieved by the optimization algorithm (i.e. stochastic gradient descent)
Third question : In the predator-prey model the actors or activator is the prey and the behavior of the prey is only determined by the size or capacity of the habitat (the amount of grass) and the size of the predator (inhibitor) population, so modelling it in this way is consistent with nature or an activator-inhibitor system again. In the main function in https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 also only the critic or inhibitor / predator is trained.

Elastic net regression or lasso regression with weighted samples (sklearn)

Scikit-learn allows sample weights to be provided to linear, logistic, and ridge regressions (among others), but not to elastic net or lasso regressions. By sample weights, I mean each element of the input to fit on (and the corresponding output) is of varying importance, and should have an effect on the estimated coefficients proportional to its weight.
Is there a way I can manipulate my data before passing it to ElasticNet.fit() to incorporate my sample weights?
If not, is there a fundamental reason it is not possible?
Thanks!

You can read some discussion about this in sklearn's issue-tracker.
It basically reads like:
not that hard to do (theory-wise)
pain keeping all the basic sklearn'APIs and supporting all possible cases (dense vs. sparse)
As you can see in this thread and the linked one about adaptive lasso, there is not much activity there (probably because not many people care and the related paper is not popular enough; but that's only a guess).
Depending on your exact task (size? sparseness?), you could build your own optimizer quite easily based on scipy.optimize, supporting this kind of sample-weights (which will be a bit slower, but robust and precise)!

Decrease the False Negative Rate in signal prediction

I'm currently working on a project in estimating signal by using some classification learning algorithms, such as logistics regression and random forest using scikit-learn.
I'm now using the confusion matrix to estimate the performance of different algorithms in prediction, and I found there was common problem for both algorithms. That is, in all cases, although the accuracy of algorithms seems relatively good (around 90% - 93%), the total number of FN are pretty high comparing to TP (FNR < 3% ). Does any one has clue about why I'm having this kind of issue in my prediction problem. If possible, can you give me some hints regarding how to possibly solve this problem?
Thanks for reply and help in advance.
Updates:
The dataset is extremely imbalanced (8:1), with in total around 180K observations. I already tested several re-sampling methods, such as OSS, SMOTE(+Tomek or +ENN), but neither of them returns good results. In both cases, although the recall goes up from 2.5% to 20%, the precision decreases significantly (from 60% to 20%).

You probably have an imbalanced dataset, where one of your classes has many more examples than your other class.
One solution is to give an higher cost of misclassifying the class with less examples.
This question in Cross Validated covers many approaches to your problem:
https://stats.stackexchange.com/questions/131255/class-imbalance-in-supervised-machine-learning
EDIT:
Given that you are using scikit-learn you can, as a first approach, set the parameter class_weight to balanced on your Logistic regression.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understanding and Evaluating different methods in Reinforcement Learning - python

Related

Advice for my plan - large dataset of students and grades, looking to classify bottom 2%

Training stability of Wasserstein GANs

A2C algorithm in tf.keras: actor loss function

Elastic net regression or lasso regression with weighted samples (sklearn)

Decrease the False Negative Rate in signal prediction

Categories

Resources