Convert this loss function equation into python code - python

Please check this equation of this link and convert it into a python loss function for a simple keras model.
EQUATION PICTURE OR IMAGE LINK FOR CONVERTING IT TO PYTHON'S KERAS REQUIRED LOSS EQUATION
where the max part or the curve selected part of the equation in the picture is the hinge loss, yi represents the label of each
example, φ(x) denotes feature representation, b is a bias, k is the total number of training examples and w is the classifier to be learned.
For easy check, the sample equation is -
min(w) [
1/k(sum of i to k)
max(0, 1 - y_i(w.φ(x) - b))
]
+
1/2||w||^ 2
.
Actually I can find the max part or the curved section of the equation in the picture but I can not find the 1/2 * ||w||^ 2 part.
You check this link too for help -
similar link
Here I have attached some sample code to clear the concept of my issue:
print("Create Model")
model = Sequential()
model.add(Dense(512,
input_dim=4096, init='glorot_normal',W_regularizer=l2(0.001),activation='relu'))
model.add(Dropout(0.6))
model.add(Dense(32, init='glorot_normal',W_regularizer=l2(0.001)))
model.add(Dropout(0.6))
model.add(Dense(1, init='glorot_normal',W_regularizer=l2(0.001),activation='sigmoid'))
adagrad=Adagrad(lr=0.01, epsilon=1e-08)
model.compile(loss= required_loss_function, optimizer=adagrad)
def required_loss_function(y_true, y_pred):
IN THIS LOSS FUNCTION,
CONVERT THE EQUATION IN THE
PICTURE INTO PYTHON CODE.
AS A MENTION, THE THING YOU HAVE TO FIND IS THE- 1/2 * ||w|| ^ 2 .
As I can find the python code of the remaining or other part of the equation in the linked picture. The hinge loss part can be easily calculated using this equation -
import keras
keras.losses.hinge(y_true, y_pred)
If you require further help, please comment for details.

Your screenshot shows the whole objective function, but only the sum(max(...)) term is called the loss term. Therefore, only that term needs to get implemented in required_loss_function. Indeed, you can probably use the pre-baked hinge loss from the keras library rather than writing it yourself—unless you're supposed to write it yourself as part of the exercise, of course.
The other term, the 0.5*||w||^2 term, is a regularization term. Specifically, it's an L2 regularization term. keras has a completely separate way of dealing with regularization, which you can read about at https://keras.io/regularizers/ . Basically it amounts to creating an l2 instance keras.regularizers.l2(lambdaParameter) and attaching it to your model with the .add() method (your screenshotted equation doesn't have a parameter that scales the regularization term–so, if that's literally what you're supposed to implement, it means your lambdaParameter would be 1.0).
But the listing you supply already seems to be applying l2 regularizers like this, multiple times in different contexts (I'm not familiar with keras at all so I don't really know what's going on–I guess it's a more sophisticated model than the one represented in your screenshot).
Either way, the answer to your question is that the regularization term is handled separately and does not belong in the loss function (the signature of the loss function gives us that hint too: there's no w argument passed into it—only y_true and y_pred).

Related

Using sigmoid output for cross entropy loss on Pytorch

I’m trying to modify Yolo v1 to work with my task which each object has only 1 class. (e.g: an obj cannot be both cat and dog)
Due to the architecture (other outputs like localization prediction must be used regression) so sigmoid was applied to the last output of the model (f.sigmoid(nearly_last_output)). And for classification, yolo 1 also use MSE as loss. But as far as I know that MSE sometimes not going well compared to cross entropy for one-hot like what I want.
And specific: GT like this: 0 0 0 0 1 (let say we have only 5 classes in total, each only has 1 class so only one number 1 in them, of course this is class 5th in this example)
and output model at classification part: 0.1 0.1 0.9 0.2 0.1
I found some suggestion use nn.BCE / nn.BCEWithLogitsLoss but I think I should ask here for more correct since I’m not good at math and maybe I’m wrong somewhere so just ask to learn more and for sure what should I use correctly?
MSE loss is usually used for regression problem.
For binary classification, you can either use BCE or BCEWithLogitsLoss. BCEWithLogitsLoss combines sigmoid with BCE loss, thus if there is sigmoid applied on the last layer, you can directly use BCE.
The GT mentioned in your case refers to 'multi-class' classification problem, and the output shown doesn't really correspond to multi-class classification. So, in this case, you can apply a CrossEntropyLoss, which combines softmax and log loss and suitable for 'multi-class' classification problem.

Neural Network Loss Function and how to add a Constant Value to it

I have a neural network model with only convolutional layers and need some help with the loss function.
I am reading a paper which suggests to add a constant which is proportional to something called an 'energy' which can be calculated from the result of the trained model. It is a bit more complicated then a simple loss function. This is done to assist training and not be stuck in a local minimum.
2 questions arise:
1: How do I simply add a value to the loss function for every epoch (or mini-batch?) step to the loss?
2: How does this help the network to train? Since adding some constant value for every epoch step doesn't help in the back-propagation step. Since this is dependent on some derivation.
(edit starts here)
Basically the model looks like this (it's not totally important to understand for my question but an extra):
model.append(models.Sequential())
model[i].add(layers.Conv1D(1, 2, activation='relu', input_shape=(32+2,1)))
model[i].add(layers.Conv1D(1, 2, activation='sigmoid', input_shape=(32+1,1)))
model[i].compile(optimizer=tf.keras.optimizers.Adam(learning_rate = 1e-3),
loss=tf.keras.losses.BinaryCrossentropy(),metrics=['accuracy'])
['accuracy'])
es = EarlyStopping(monitor='loss', mode='min',verbose = 1, patience = 100, min_delta = 0)
model[i].fit(train_rgS[i].reshape(10000,32+padding_size,1),
train_mcS[i].reshape(10000,32,1),
batch_size = 10**3, epochs=500, verbose=0, callbacks=[es])
I can apply this model on a set of input data and from this calculate an energy. This is a little bit more complicated and cannot be described by any loss function. However i want to add this value to my loss function to assist training
Since i am coming from Pytorch, it was there very easy to manipulate the loss function. But in Tensorflow everything is already build in together and I wonder how it would be possible to add a constant value to the loss.
I give you a picture of the entire extract of the paper which I am referring to:
I don't want to explain what this Energy is because this goes to deep for the simple question and requires a lot of background information.
(edit ends here)
I am already very grateful if you answer my first question.
Thank you very much.

A2C algorithm in tf.keras: actor loss function

I'm learning about Action-Critic Reinforcement Learning techniques, in particular A2C algorithm.
I've found a good description of a simple version of the algorithm (i.e. without experience replay, batching or other tricks) with implementation here: https://link.medium.com/yi55uKWwV2. The complete code from that article is available on GitHub.
I think I understand ok-ish what's happening here, but to make sure I actually do, I'm trying to reimplement it from scratch using higher-level tf.keras APIs. Where I'm getting stuck is how do I implement training loop correctly, and how do I formulate actor's loss function.
What is the correct way to pass action and advantage into the loss function?
Actor's loss function involves computing probability of the action taken given to normal distribution. How can I ensure that mu and sigma of the normal distribution during loss function computation actually match the ones were during prediction?
The way it is in the original, the actor's loss function doesn't care about y_pred, it only does about action that was chosen while interacting with the environment. This seems to be wrong, but I'm not sure how.
The code I have so far: https://gist.github.com/nevkontakte/beb59f29e0a8152d99003852887e7de7
Edit: I suppose some of my confusion stems from a poor understanding of magic behind gradient computation in Keras/TensorFlow, so any pointers there would be appreciated.
First, credit where credit is due: information provided by ralf htp and Simon was instrumental in helping me to figure out the right answers eventually.
Before I go into detailed answers to my own questions, here's the original code I was trying to rewrite in tf.keras terms, and here's my result.
What is the correct way to pass action and advantage into a loss function in Keras?
There is a difference between what raw TF optimizer considers a loss function and what Keras does. When using an optimizer directly, it simply expects a tensor (lazy or eager depending on your configuration), which will be evaluated under tf.GradientTape() to compute the gradient and update weights.
Example from https://medium.com/#asteinbach/actor-critic-using-deep-rl-continuous-mountain-car-in-tensorflow-4c1fb2110f7c:
# Below norm_dist is the output tensor of the neural network we are training.
loss_actor = -tfc.log(norm_dist.prob(action_placeholder) + 1e-5) * delta_placeholder
training_op_actor = tfc.train.AdamOptimizer(
lr_actor, name='actor_optimizer').minimize(loss_actor)
# Later, in the training loop...
_, loss_actor_val = sess.run([training_op_actor, loss_actor],
feed_dict={action_placeholder: np.squeeze(action),
state_placeholder: scale_state(state),
delta_placeholder: td_error})
In this example it computes the whole graph, including making an inference, capture the gradient and adjust weights. So to pass whatever values you need into the loss function/gradient computation you just pass necessary values into the computation graph.
Keras is a bit more formal in what loss function should look like:
loss: String (name of objective function), objective function or tf.keras.losses.Loss instance. See tf.keras.losses. An objective function is any callable with the signature scalar_loss = fn(y_true, y_pred). If the model has multiple outputs, you can use a different loss on each output by passing a dictionary or a list of losses. The loss value that will be minimized by the model will then be the sum of all individual losses.
Keras will do the inference (forward pass) for you and pass the output into the loss function. The loss function is supposed to do some extra computation on the predicted value and y_true label, and return the result. This whole process will be tracked for the purpose of gradient computation.
Although it is very convenient for traditional training, this is a bit restrictive when we want to pass some extra data in, like TD error. It is possible to work around that and shove all the extra data into y_true, and pull it apart inside the loss function (I found this trick somewhere on the web, but unfortunately lost the link to source).
Here's how I rewrote the above in the end:
def loss(y_true, y_pred):
action_true = y_true[:, :n_outputs]
advantage = y_true[:, n_outputs:]
return -tfc.log(y_pred.prob(action_true) + 1e-5) * advantage
# Below, in the training loop...
# A trick to pass TD error *and* actual action to the loss function: join them into a tensor and split apart
# Inside the loss function.
annotated_action = tf.concat([action, td_error], axis=1)
actor_model.train_on_batch([scale_state(state)], [annotated_action])
Actor's loss function involves computing probability of the action taken given to normal distribution. How can I ensure that mu and sigma of the normal distribution during loss function computation actually match the ones were during prediction?
When I asked this question, I didn't understand well enough how TF compute graph works. So the answer is simple: every time sess.run() is invoked, it must compute the whole graph from scratch. Parameters of the distribution would be the same (or similar) as long as graph inputs (e.g. observed state) and NN weights are the same (or similar).
The way it is in the original, the actor's loss function doesn't care about y_pred, it only does about action that was chosen while interacting with the environment. This seems to be wrong, but I'm not sure how.
What's wrong is the assumption "the actor's loss function doesn't care about y_pred" :) Actor's loss function involves norm_dist (which is action probability distribution), which is effectively an analog of y_pred in this context.
As far as i understand A2C it is the machine learning implementation of activator-inhibitor systems that are also called two-component reaction diffusion systems (https://en.wikipedia.org/wiki/Reaction%E2%80%93diffusion_system). Activator-inhibitor models are important in any field of science as they describe pattern formations like i.e. the Turing mechanism (simply search the net for activator-inhibitor model and you find a vast amount of information, a very common application are predator-prey models). Also cf the graphic
source of graphic : https://www.researchgate.net/figure/Activator-Inhibitor-System_fig1_23671770/
with the explanatory graphic of the A2C algorithm in https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69
Activator-inhibitor models are closely linked to the theory of nonlinear dynamical systems (or 'chaos theory') this also becomes obvious in the comparison of the bifurcation tree-like structure in https://medium.com/#asteinbach/rl-introduction-simple-actor-critic-for-continuous-actions-4e22afb712 and the bifurcation tree of a nonlinear dynamical systems like i.e. the logistic map (https://en.wikipedia.org/wiki/Logistic_map, the logistic map is one of the simplest predator-prey models or activator-inhibitor models). Another similarity is the sensitivity to initial condition in A2C models that is described as
This introduces in inherent high variability in log probabilities (log of the policy distribution) and cumulative reward values, because each trajectories during training can deviate from each other at great degrees.
in https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f and the curse of dimensionality appears also in chaos theory, i.e. in attractor reconstruction
From the viewpoint of systems theory the A2C algorithm tries to adapt the initial value (start state) in a way that it ends up at a given endpoint when increasing the growth rate of a dynamical systems i.e. the logistic map (r-value is increased and the initial value (start state) is constantly re-adapted to choose the correct bifurations (actions) in the bifurcation tree )
So A2C tries to numerically solve a chaos theory problem, namely finding the initial value for a given outcome of a nonlinear dynamical system in its chaotic region. Analytically this problem is in most cases not solveable.
The action is the bifurcation points in the bifurcation tree, the states are the future bifurctions.
Both, actions and states, are modeled by two coupled neural networks and this coupling of two neural nets is the great innovation of A2C algorithms.
In https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 is well documented keras code for implementing A2C, so you have a possible implementation there.
The loss function here is defined as the temporal difference (TD) function that is the exact difference between state at the actual bifurcation point and the state at the estimated future one, however this mathematically exactly defined is prone to stochastic error (or noise), so the stochastic error is included in the definition of exact, because in the end machine learning is based on stochastic systems or error calculus, meaning systems that are composed of a deterministic and a stochastic component. To zero this error stochastic gradient descend is used. In keras this is simply implmeneted by choosing optimizer=sge.
This interaction of actual and future step is implemented as memory on https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 in the function remember and this function also links the actor and the critic network (or activator and inhibitor network). This general structure of trial (action), call predict (TD function ), remember and train (i.e. stochastic gradient descent) is fundamental to all reinforcement learning algorithms, and is linked to the structure actual state, action, reward, new state :
The prediction code is also very much the same as it was in previous reinforcement learning algorithms. That is, we just have to iterate through the trial and call predict, remember, and train on the agent:
In the implementation on your first question is solved by applying remember on the critic and the train the critic with these values (this is in the main function), where training always evaluates the loss function, so action and reward are passed to the loss function by remember in this implementation :
actor_critic.remember(cur_state, action, reward, new_state, done)
actor_critic.train()
Because of your second question : i am not sure but i think this is achieved by the optimization algorithm (i.e. stochastic gradient descent)
Third question : In the predator-prey model the actors or activator is the prey and the behavior of the prey is only determined by the size or capacity of the habitat (the amount of grass) and the size of the predator (inhibitor) population, so modelling it in this way is consistent with nature or an activator-inhibitor system again. In the main function in https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 also only the critic or inhibitor / predator is trained.

Regression through reinforcement learning

I'm trying to build an Agent that can play Pocket Tanks using RL. The problem I'm facing now is that how can I train a neural network to output the correct Power and Angle. so instead of actions classification. and I want a regression.
In order to output the correct power and angle, all you need to do is go into your neural network architecture and change the activation of your last layer.
In your question, you stated that you are currently using an action classification output, so it is most likely a softmax output layer. We can do two things here:
If the power and angle has hard constraints, e.g. the angle cannot be greater than 360°, or the power cannot exceed 700 kW, we can change the softmax output to a TanH output (hyperbolic tangent) and multiply it by the constraint of power/angle. This will create a "scaling effect" because tanh's output is between -1 and 1. Multiplying the tanh's output by the constraint of the power/angle ensures that the constraints are always satisfied and the output is the correct power/angle.
If there are no constraints on your problem. We can simply just delete the softmax output all together. Removing the softmax allows for the output to no longer be constrained between 0 and 1. The last layer of the neural network will simply act as a linear mapping, i.e., y = Wx + b.
I hope this helps!
EDIT: In both cases, your reward function to train your neural network can simply be a MSE loss. Example: loss = (real_power - estimated_power)^2 + (real_angle - estimated_angle)^2

How can I compute the hessian of the loss function in MXNet?

I'm learning MXNet at the moment and I'm working on a problem using neural nets. I'm interested in observing the curvature of my loss function with respect to the network weights but as best I can tell higher order gradients are not supported for neural network functions at the moment. Is there any (possibly hacky) way that I could still do this?
You can follow the discussion here
The gist of it is that not all operators support higher order gradients at the moment.
In Gluon you can try the following:
with mx.autograd.record():
output = net(x)
loss = loss_func(output)
dz = mx.autograd.grad(loss, [z], create_graph=True) # where [z] is the parameter(s) you want
dz[0].backward() # now the actual parameters should have second order gradients
Taken from this forum thread

Categories