I have a toy reinforcement learning project based on the REINFORCE algorithm (here's PyTorch's implementation) that I would like to add batch updates to. In RL, the "target" can only be created after a "prediction" has been made, so standard batching techniques do not apply. As such, I accrue losses for each episode and append them to a list l_losses where each item is a zero-dimensional tensor. I hold off on calling .backward() or optimizer.step() until a certain number of episodes have passed in order to create a sort of pseudo batch.
Given this list of losses, how do I have PyTorch update the network based on their average gradient? Or would updating based on the average gradient be the same as updating on the average loss (I seem to have read otherwise elsewhere)?
My current method is to create a new tensor t_loss from torch.stack(l_losses), and then run t_loss = t_loss.mean(), t_loss.backward(), optimizer.step(), and zero the gradient, but I'm unsure if this is equivalent to my intents? It's also unclear to me if I should have been running .backward() on each individual loss instead of concatenating them in a list (but holding on the .step() part until the end?
Gradient is a linear operation so gradient of the average is the same as the average of the gradient.
Take some example data
import torch
a = torch.randn(1, 4, requires_grad=True);
b = torch.randn(5, 4);
You could store all the losses and compute the mean as you are doing,
a.grad = None
x = (a * b).mean(axis=1)
x.mean().backward() # gradient of the mean
print(a.grad)
Or every iteration to compute the back propagation to get the contribution of that loss to the gradient.
a.grad = None
for bi in b:
(a * bi / len(b)).mean().backward()
print(a.grad)
Performance
I don't know the internal details of the pytorch backward implementation, but I can tell that
(1) the graph is destroyed by default after the backward pass ratain_graph=True or create_graph=True to backward().
(2) The gradient is not kept except for leaf tensors, unless you specify retain_grad;
(3) if you evaluate a model twice using different inputs, you can perform the backward pass to individual variables, this means that they have separate graphs. This can be verified with the following code.
a.grad = None
# compute all the variables in advance
r = [ (a * b / len(b)).mean() for bi in b ]
for ri in r:
# This depends on the graph of r[i] but the graph or r[i-1]
# was already destroyed, it means that r[i] graph is independent
# of r[i-1] graph, hence they require separate memory.
ri.backward() # this will remove the graph of ri
print(a.grad)
So if you update the gradient after each episode it will accumulate the gradient of the leaf nodes, that's all the information you need for the next optimization step, so you can discard that loss freeing up resources for further computations. I would expect a memory usage reduction, potentially even a faster execution if the memory allocation can efficiently use the just deallocated pages for the next allocation.
I am using Pytorch to implement a neural network that has (say) 5 inputs and 2 outputs
class myNetwork(nn.Module):
def __init__(self):
super(myNetwork,self).__init__()
self.layer1 = nn.Linear(5,32)
self.layer2 = nn.Linear(32,2)
def forward(self,x):
x = torch.relu(self.layer1(x))
x = self.layer2(x)
return x
Obviously, I can feed this an (N x 5) Tensor and get an (N x 2) result,
net = myNetwork()
nbatch = 100
inp = torch.rand([nbatch,5])
inp.requires_grad = True
out = net(inp)
I would now like to compute the derivatives of the NN output with respect to one element of the input vector (let's say the 5th element), for each example in the batch. I know I can calculate the derivatives of one element of the output with respect to all inputs using torch.autograd.grad, and I could use this as follows:
deriv = torch.zeros([nbatch,2])
for i in range(nbatch):
for j in range(2):
deriv[i,j] = torch.autograd.grad(out[i,j],inp,retain_graph=True)[0][i,4]
However, this seems very inefficient: it calculates the gradient of out[i,j] with respect to every single element in the batch, and then discards all except one. Is there a better way to do this?
By virtue of backpropagation, if you did only compute the gradient w.r.t a single input, the computational savings wouldn't necessarily amount to much, you would only save some in the first layer, all layers afterwards need to be backpropagated either way.
So this may not be the optimal way, but it doesn't actually create much overhead, especially if your network has many layers.
By the way, is there a reason that you need to loop over nbatch? If you wanted the gradient of each element of a batch w.r.t a parameter, I could understand that, because pytorch will lump them together, but you seem to be solely interested in the input...
I'm trying to convert tensorflow code to julia but I didn't understand how layer_norm function works.
I'm trying to convert that part:
output = tf.contrib.layers.layer_norm(attn_out + h, begin_norm_axis=-1,
scope='LayerNorm')
shape of the tensor attn_out + h is ( qlen,bsz,d_model ) and when I downloaded the trained model, variables gamma and beta have shape (d_model,). Also, output of nn.moments should have shape [qlen,bsz,1]. layer_norm function calls batch_norm function with these parameters:
add_arg_scope
def layer_norm(inputs,
center=True,
scale=True,
activation_fn=None,
reuse=None,
variables_collections=None,
outputs_collections=None,
trainable=True,
begin_norm_axis=1,
begin_params_axis=-1,
scope=None):
"""Adds a Layer Normalization layer.
Based on the paper:
"Layer Normalization"
Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
https://arxiv.org/abs/1607.06450.
Can be used as a normalizer function for conv2d and fully_connected.
Given a tensor `inputs` of rank `R`, moments are calculated and normalization
is performed over axes `begin_norm_axis ... R - 1`. Scaling and centering,
if requested, is performed over axes `begin_params_axis .. R - 1`.
By default, `begin_norm_axis = 1` and `begin_params_axis = -1`,
meaning that normalization is performed over all but the first axis
(the `HWC` if `inputs` is `NHWC`), while the `beta` and `gamma` trainable
parameters are calculated for the rightmost axis (the `C` if `inputs` is
`NHWC`). Scaling and recentering is performed via broadcast of the
`beta` and `gamma` parameters with the normalized tensor.
The shapes of `beta` and `gamma` are `inputs.shape[begin_params_axis:]`,
and this part of the inputs' shape must be fully defined.
Args:
inputs: A tensor having rank `R`. The normalization is performed over axes
`begin_norm_axis ... R - 1` and centering and scaling parameters are
calculated over `begin_params_axis ... R - 1`.
center: If True, add offset of `beta` to normalized tensor. If False, `beta`
is ignored.
scale: If True, multiply by `gamma`. If False, `gamma` is not used. When the
next layer is linear (also e.g. `nn.relu`), this can be disabled since the
scaling can be done by the next layer.
activation_fn: Activation function, default set to None to skip it and
maintain a linear activation.
reuse: Whether or not the layer and its variables should be reused. To be
able to reuse the layer scope must be given.
variables_collections: Optional collections for the variables.
outputs_collections: Collections to add the outputs.
trainable: If `True` also add variables to the graph collection
`GraphKeys.TRAINABLE_VARIABLES` (see tf.Variable).
begin_norm_axis: The first normalization dimension: normalization will be
performed along dimensions `begin_norm_axis : rank(inputs)`
begin_params_axis: The first parameter (beta, gamma) dimension: scale and
centering parameters will have dimensions
`begin_params_axis : rank(inputs)` and will be broadcast with the
normalized inputs accordingly.
scope: Optional scope for `variable_scope`.
Returns:
A `Tensor` representing the output of the operation, having the same
shape and dtype as `inputs`.
Raises:
ValueError: If the rank of `inputs` is not known at graph build time,
or if `inputs.shape[begin_params_axis:]` is not fully defined at
graph build time.
"""
with variable_scope.variable_scope(
scope, 'LayerNorm', [inputs], reuse=reuse) as sc:
inputs = ops.convert_to_tensor(inputs)
inputs_shape = inputs.shape
inputs_rank = inputs_shape.ndims
if inputs_rank is None:
raise ValueError('Inputs %s has undefined rank.' % inputs.name)
dtype = inputs.dtype.base_dtype
if begin_norm_axis < 0:
begin_norm_axis = inputs_rank + begin_norm_axis
if begin_params_axis >= inputs_rank or begin_norm_axis >= inputs_rank:
raise ValueError('begin_params_axis (%d) and begin_norm_axis (%d) '
'must be < rank(inputs) (%d)' %
(begin_params_axis, begin_norm_axis, inputs_rank))
params_shape = inputs_shape[begin_params_axis:]
if not params_shape.is_fully_defined():
raise ValueError(
'Inputs %s: shape(inputs)[%s:] is not fully defined: %s' %
(inputs.name, begin_params_axis, inputs_shape))
# Allocate parameters for the beta and gamma of the normalization.
beta, gamma = None, None
if center:
beta_collections = utils.get_variable_collections(variables_collections,
'beta')
beta = variables.model_variable(
'beta',
shape=params_shape,
dtype=dtype,
initializer=init_ops.zeros_initializer(),
collections=beta_collections,
trainable=trainable)
if scale:
gamma_collections = utils.get_variable_collections(
variables_collections, 'gamma')
gamma = variables.model_variable(
'gamma',
shape=params_shape,
dtype=dtype,
initializer=init_ops.ones_initializer(),
collections=gamma_collections,
trainable=trainable)
# By default, compute the moments across all the dimensions except the one with index 0.
norm_axes = list(range(begin_norm_axis, inputs_rank))
mean, variance = nn.moments(inputs, norm_axes, keep_dims=True)
# Compute layer normalization using the batch_normalization function.
# Note that epsilon must be increased for float16 due to the limited
# representable range.
variance_epsilon = 1e-12 if dtype != dtypes.float16 else 1e-3
outputs = nn.batch_normalization(
inputs,
mean,
variance,
offset=beta,
scale=gamma,
variance_epsilon=variance_epsilon)
outputs.set_shape(inputs_shape)
if activation_fn is not None:
outputs = activation_fn(outputs)
return utils.collect_named_outputs(outputs_collections, sc.name, outputs)
However none of these tensors (gamma,beta,mean,variance) satisfy the input restrictions of batch_norm:
#tf_export("nn.batch_normalization")
def batch_normalization(x,
mean,
variance,
offset,
scale,
variance_epsilon,
name=None):
r"""Batch normalization.
Normalizes a tensor by `mean` and `variance`, and applies (optionally) a
`scale` \\(\gamma\\) to it, as well as an `offset` \\(\beta\\):
\\(\frac{\gamma(x-\mu)}{\sigma}+\beta\\)
`mean`, `variance`, `offset` and `scale` are all expected to be of one of two
shapes:
* In all generality, they can have the same number of dimensions as the
input `x`, with identical sizes as `x` for the dimensions that are not
normalized over (the 'depth' dimension(s)), and dimension 1 for the
others which are being normalized over.
`mean` and `variance` in this case would typically be the outputs of
`tf.nn.moments(..., keep_dims=True)` during training, or running averages
thereof during inference.
* In the common case where the 'depth' dimension is the last dimension in
the input tensor `x`, they may be one dimensional tensors of the same
size as the 'depth' dimension.
This is the case for example for the common `[batch, depth]` layout of
fully-connected layers, and `[batch, height, width, depth]` for
convolutions.
`mean` and `variance` in this case would typically be the outputs of
`tf.nn.moments(..., keep_dims=False)` during training, or running averages
thereof during inference.
See Source: [Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift; S. Ioffe, C. Szegedy]
(http://arxiv.org/abs/1502.03167).
Args:
x: Input `Tensor` of arbitrary dimensionality.
mean: A mean `Tensor`.
variance: A variance `Tensor`.
offset: An offset `Tensor`, often denoted \\(\beta\\) in equations, or
None. If present, will be added to the normalized tensor.
scale: A scale `Tensor`, often denoted \\(\gamma\\) in equations, or
`None`. If present, the scale is applied to the normalized tensor.
variance_epsilon: A small float number to avoid dividing by 0.
name: A name for this operation (optional).
Returns:
the normalized, scaled, offset tensor.
"""
with ops.name_scope(name, "batchnorm", [x, mean, variance, scale, offset]):
inv = math_ops.rsqrt(variance + variance_epsilon)
if scale is not None:
inv *= scale
# Note: tensorflow/contrib/quantize/python/fold_batch_norms.py depends on
# the precise order of ops that are generated by the expression below.
return x * math_ops.cast(inv, x.dtype) + math_ops.cast(
offset - mean * inv if offset is not None else -mean * inv, x.dtype)
I didn't understand how it works and could't convert to julia code. I'm using the same order for tensor shapes. If x has shape [a,b,c] in tf, it has [a,b,c] shape in my julia code aswell.
Flux provides batch normalization and layer normalization (and some others) in Julia (called LayerNorm and BatchNorm. In your question you don't specify what library you use in Julia, but if you use something else than Flux, then the code might be helpful for your own implementation. See: https://github.com/FluxML/Flux.jl/blob/master/src/layers/normalise.jl
If you are using Knet (which I never did) there is at least batch normalization already implemented, I am not sure about other normalization layers. They might be there as well.
Please also post what you already tried so for in Julia.
I am new to machine learning, python and tensorflow. I am used to code in C++ or C# and it is difficult for me to use tf.backend.
I am trying to write a custom loss function for an LSTM network that tries to predict if the next element of a time series will be positive or negative. My code runs nicely with the binary_crossentropy loss function. I want now to improve my network having a loss function that adds the value of the next time series element if the predicted probability is greater than 0.5 and substracts it if the prob is less or equal to 0.5.
I tried something like this:
def customLossFunction(y_true, y_pred):
temp = 0.0
for i in range(0, len(y_true)):
if(y_pred[i] > 0):
temp += y_true[i]
else:
temp -= y_true[i]
return temp
Obviously, dimensions are wrong but since I cannot step into my function while debugging, it is very hard to get a grasp of dimensions here.
Can you please tell me if I can use an element-by-element function? If yes, how? And if not, could you help me with tf.backend?
Thanks a lot
From keras backend functions, you have the function greater that you can use:
import keras.backend as K
def customLossFunction(yTrue,yPred)
greater = K.greater(yPred,0.5)
greater = K.cast(greater,K.floatx()) #has zeros and ones
multiply = (2*greater) - 1 #has -1 and 1
modifiedTrue = multiply * yTrue
#here, it's important to know which dimension you want to sum
return K.sum(modifiedTrue, axis=?)
The axis parameter should be used according to what you want to sum.
axis=0 -> batch or sample dimension (number of sequences)
axis=1 -> time steps dimension (if you're using return_sequences = True until the end)
axis=2 -> predictions for each step
Now, if you have only a 2D target:
axis=0 -> batch or sample dimension (number of sequences)
axis=1 -> predictions for each sequence
If you simply want to sum everything for every sequence, then just don't put the axis parameter.
Important note about this function:
Since it contains only values from yTrue, it cannot backpropagate to change the weights. This will lead to a "none values not supported" error or something very similar.
Although yPred (the one that is connected to the model's weights) is used in the function, it's used only for getting a true x false condition, which is not differentiable.
I have an LSTM predicting time series values in tensorflow.
The model is working using an MSE as a loss function.
However, I'd like to be able to create a custom loss function where one of the error values is multiplied by two (therefore producing a higher error value).
In my batch of size 10, I want the 3rd value of the first input to be multiplied by 2, but because this is time series, this corresponds to the second value in the second input and the first value in the third input.
The error I get is:
ValueError: No gradients provided for any variable, check your graph for ops that do not support gradients
How do I make the gradients?
def loss_function(y_true, y_pred, peak_value=3, weight=2):
# peak value is where the multiplication happens on the first line
# weight is the how much the error is multiplied by
all_dif = tf.squared_difference(y_true, y_pred) # should be shape=[10,10]
peak = [peak_value] * 10
listy = range(0, 10)
c = [(i - j) % 10 for i, j in zip(peak, listy)]
for i in range(0, 10):
indices = [[i, c[i]]]
values = [1.0]
shape = [10,10]
delta = tf.SparseTensor(indices, values, shape)
all_dif = all_dif + tf.sparse_tensor_to_dense(delta)
return tf.reduce_sum(all_dif)
I believe the psuedo code would look something like this:
#tf.custom_gradient
def loss_function(y_true, y_pred, peak_value=3, weight=2)
## your code
def grad(dy):
return dy * partial_derivative
return loss, grad
Where partial_derivative is the analytically evaluated partial derivative with respect to your loss function. If your loss function is a function of more than one variable, it will require a partial derivative respect to each variable, I believe.
If you need more information, the documentation is good: https://www.tensorflow.org/api_docs/python/tf/custom_gradient
And I've yet to find an example of this functionality embedded in a model that's not a toy.