Derivative of a tensor operation - python

I'm reading a book on deep learning and I'm a bit confused about one of the ideas the author mentioned.
I don't understand why we subtract -step * gradient (f) (W0) from the weight and not just -step, since -step * gradient (f) (W0) represents a loss while -step is the parameter (i.e the x value i.e small change in weight)

The gradient tell you which direction to move and the step would help to control the magnitude that you move so that your sequence converges.
We can't just subtract step. Recall that step is just a scalar number. W0 is a tensor. We can't subtract a tensor by a scalar number. The gradient is a tensor with the same size as W0 and that would make the subtraction well defined.
Readings on gradient descent might help your understanding.

You need to change the parameter opposite to its gradient by a small amount to make sure the loss goes down. Using just step does not guarantee that the loss decreases. This is called gradient descent in optimization and there is proof of convergence. You can check online tutorials on this topic such as this.

Related

Custom Negative Loss Likelihood returning NaN's after varying numbers of training

Hello StackOverflow people,
I am encountering a problem where I don't know what else I can try. First off, I am using a custom loss function (at least I believe that is the problem, but maybe it's something different?) for a mixture density network:
def nll_loss(mu, sigma, alpha, y):
gm = tfd.MixtureSameFamily(
mixture_distribution=tfd.Categorical(probs=alpha),
components_distribution=tfd.Normal(
loc=mu,
scale=sigma))
log_likelihood = gm.log_prob(tf.transpose(y))
return -tf.reduce_mean(log_likelihood, axis=-1)
The funny thing is, the network randomly collapses after a varying amount of training.
The things I already tried and checked:
All my input data is scaled between 0 and 1 (both x and y)
I tried multiplying the y's and adding an integer to those, so the distance to zero is increased
Different optimizers
Clipping optimizers
Clipping loss function
Setting Learning Rate to 0! (That ones puzzles me the most, as I am sure my inputs are correct)
Adding Batch Normalization to every layer of my network
Does anyone have an idea why this is happening? What am I missing? Thank you!

Gradient of neural network with respect to inputs

I am working on a NN with Pytorch which simply maps points from the plane into real numbers, for example
model = nn.Sequential(nn.Linear(2,2),nn.ReLU(),nn.Linear(2,1))
What I want to do, since this network defines a map h:R^2->R, is to compute the gradient of this mapping h in the training loop. So for example
for it in range(epochs):
pred = model(X_train)
grad = torch.autograd.grad(pred,X_train)
....
The training set has been defined as a tensor requiring the gradient. My problem is that even if the output, for each fixed point, is a scalar, since I am propagating a set of N=100 points, the output is actually a Nx1 tensor. This brings to the error: autograd can compute the gradient just of scalar functions.
In fact, trying with the little change
pred = torch.sum(model(X_train))
everything works perfectly. However I am interested in all the single gradients so, is there a way to compute all these gradients together?
Actually computing the sum as presented above gives exactly the same result I expect of course, but I wanted to know if this is the only possiblity.
There are other possibilities but using .sum is the simplest way. Using .sum() on the final loss vector and computing dpred/dinput will give you the desired output. Here is why:
Since, pred = sum(loss) = sum (f(xi))
where i is the index of input x.
dpred/dinput will be a matrix [dpred/dx0, dpred/dx1, dpred/dx...]
Consider, dpred/dx0, it will be equal to df(x0)/dx0, since other df(xi)/dx0 is 0.
PS: Please excuse the crappy mathematical expressions... SO does not support latex/math expressions.

Use PyTorch to adjust Tensor matrix values based on numbers I calculate from the Tensors?

I have two tensors (matrices) that I've initialized:
sm=Var(torch.randn(20,1),requires_grad=True)
sm = torch.mm(sm,sm.t())
freq_m=Var(torch.randn(12,20),requires_grad=True)
I am creating two lists from the data inside these 2 matrices, and I am using spearmanr to get a correlation value between these 2 lists. How I am creating the lists is not important, but the goal is to adjust the values inside the matrices so that the calculated correlation value is as close to 1 as possible.
If I were to solve this problem manually, I would tweak values in the matrices by .01 (or some small number) each time and recalculate the lists and correlation score. If the new correlation value is higher than the previous one, I would save the 2 matrices and tweak a different value until I get the 2 matrices that give me the highest correlation score possible.
Is PyTorch capable of doing this automatically? I know PyTorch can adjust based on an equation but the way I want to adjust the matrix values is not against an equation, it's against a correlation value that I calculate. Any guidance with this is greatly appreciated!
Pytorch has an autograd package, that means if you have variable and you pass them through differentiable functions and get a scalar result, you can perform a gradient descent to update the variable to lower or augment the scalar result.
So what you need to do is to define a function f that works on tensor level such that f(sm, freq_m) will give you the desired correlation.
Then, you should do something like:
lr = 1e-3
for i in range(100):
# 100 updates
loss = 1 - f(sm, freq_m)
print(loss)
loss.backward()
with torch.no_grad():
sm -= lr * sm.grad
freq_m -= lr * freq_m.grad
# Manually zero the gradients after updating weights
sm.grad.zero_()
freq_m.grad.zero_()
The learning rate is basically the size of the step you do, a learning rate too high will cause the loss to explode, and a learning rate too little will cause a slow convergence, I suggest you experiment.
Edit : To answer the comment on loss.backward : for any differentiable function f, f is a function of multiple tensors t1, ..., tn with requires_grad=True as a result, you can calculate the gradient of the loss with respect to each of those tensors. When you do loss.backward, it calculates those gradients and store those in t1.grad, ..., tn.grad. Then you update t1, ..., tn using gradient descent in order to lower the value of f. This update doesn't need a computational graph, so this is why you use with torch.no_grad().
At the end of the loop, you zero the gradients because .backward doesn't overwrite the gradients but rather add the new gradients to them. More on that here : https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903

Negative reward in reinforcement learning

I can't wrap my head around question: how exactly negative rewards helps machine to avoid them?
Origin of the question came from google's solution for game Pong. By their logic, once game finished (agent won or lost point), environment returns reward (+1 or -1). Any intermediate states return 0 as reward. That means each win/loose will return either [0,0,0,...,0,1] either [0,0,0,...,0,-1] reward arrays. Then they discount and standardize rewards:
#rwd - array with rewards (ex. [0,0,0,0,0,0,1]), args.gamma is 0.99
prwd = discount_rewards(rwd, args.gamma)
prwd -= np.mean(prwd)
prwd /= np.std(prwd)
discount_rewards suppose to be some kind of standard function, impl can be found here. Result for win (+1) could be something like this:
[-1.487 , -0.999, -0.507, -0.010, 0.492, 0.999, 1.512]
For loose (-1):
[1.487 , 0.999, 0.507, 0.010, -0.492, -0.999, -1.512]
As result each move gets rewarded. Their loss function looks like this:
loss = tf.reduce_sum(processed_rewards * cross_entropies + move_cost)
Please, help me answer next questions:
Cross entropy function can produce output from 0 -> inf. Right?
Tensorflow optimizer minimize loss by absolute value (doesn't care about sign, perfect loss is always 0). Right?
If statement 2 is correct, then loss 7.234 is equally bad as -7.234. Right?
If everything above is correct, than how negative reward tells machine that it's bad, and positive tells machine that it's good?
I also read this answer, however I still didn't manage to get the idea exactly why negative worse than positive. It makes more sense to me to have something like:
loss = tf.reduce_sum(tf.pow(cross_entropies, reward))
But that experiment didn't went well.
Cross entropy function can produce output from 0 -> inf. Right?
Yes, only because we multiply it by -1. Thinking of the natural sign of log(p). As p is a probability (i.e between 0 and 1), log(p) ranges from (-inf, 0].
Tensorflow optimizer minimize loss by absolute value (doesn't care about sign, perfect loss is always 0). Right?
Nope, the sign matters. It sums up all losses with their signs intact.
If statement 2 is correct, then loss 7.234 is equally bad as -7.234. Right?
See below, a loss of 7.234 is much better than a loss of -7.234 in terms of increasing the reward. The overall positive loss indicates our agent is making a series of good decisions.
If everything above is correct, than how negative reward tells machine that it's bad, and positive tells machine that it's good?
Normalizing Rewards to Generate Returns in reinforcement learning makes a very good point that the signed rewards are there to control the size of the gradient. The positive / negative rewards perform a "balancing" act for the gradient size. This is because a huge gradient from a large loss would cause a large change to the weights. Thus if your agent makes as many mistakes as it does proper moves, the overall update for that batch should not be large.
"Tensorflow optimizer minimize loss by absolute value (doesn't care about sign, perfect loss is always 0). Right?"
Wrong. Minimizing the loss means trying to achieve as small a value as possible. That is, -100 is "better" than 0. Accordingly, -7.2 is better than 7.2. Thus, a value of 0 really carries no special significance, besides the fact that many loss functions are set up such that 0 determines the "optimal" value. However, these loss functions are usually set up to be non-negative, so the question of positive vs. negative values doesn't arise. Examples are cross entropy, squared error etc.

Loss function for simple Reinforcement Learning algorithm

This question comes from watching the following video on TensorFlow and Reinforcement Learning from Google I/O 18: https://www.youtube.com/watch?v=t1A3NTttvBA
Here they train a very simple RL algorithm to play the game of Pong.
In the slides they use, the loss is defined like this ( approx # 11m 25s ):
loss = -R(sampled_actions * log(action_probabilities))
Further they show the following code ( approx # 20m 26s):
# loss
cross_entropies = tf.losses.softmax_cross_entropy(
onehot_labels=tf.one_hot(actions, 3), logits=Ylogits)
loss = tf.reduce_sum(rewards * cross_entropies)
# training operation
optimizer = tf.train.RMSPropOptimizer(learning_rate=0.001, decay=0.99)
train_op = optimizer.minimize(loss)
Now my question is this; They use the +1 for winning and -1 for losing as rewards. In the code that is provided, any cross entropy loss that's multiplied by a negative reward will be very low? And if the training operation is using the optimizer to minimize the loss, well then the algorithm is trained to lose?
Or is there something fundamental I'm missing ( probably because of my very limited mathematical skills )
Great question Corey. I am also wondering exactly what this popular loss function in RL actually means. I've seen many implementations of it, but many contradict each other. For my understanding, it means this:
Loss = - log(pi) * A
Where A is the advantage compared to a baseline case. In Google's case, they used a baseline of 0, so A = R. This is multiplied by that specific action at that specific time, so in your above example, actions were one hot encoded as [1, 0, 0]. We will ignore the 0s and only take the 1. Hence we have the above equation.
If you intuitively calculate this loss for a negative reward:
Loss = - (-1) * log(P)
But for any P less than 1, log of that value will be negative. Therefore, you have a negative loss which can be interpreted as "very good", but really doesn't make physical sense.
The correct way:
However in my opinion, and please others correct me if I'm wrong, you do not calculate the loss directly. You take the gradient of the loss. That is, you take the derivative of -log(pi)*A.
Therefore, you would have:
-(d(pi) / pi) * A
Now, if you have a large negative reward, it will translate to a very large loss.
I hope this makes sense.

Categories