Tensorflow: _variable_with_weight_decay(...) explanation

Tensorflow: _variable_with_weight_decay(...) explanation - python

at the moment I'm looking at the cifar10 example and I noticed the function _variable_with_weight_decay(...) in the file cifar10.py. The code is as follows:
def _variable_with_weight_decay(name, shape, stddev, wd):
"""Helper to create an initialized Variable with weight decay.
Note that the Variable is initialized with a truncated normal distribution.
A weight decay is added only if one is specified.
Args:
name: name of the variable
shape: list of ints
stddev: standard deviation of a truncated Gaussian
wd: add L2Loss weight decay multiplied by this float. If None, weight
decay is not added for this Variable.
Returns:
Variable Tensor
"""
dtype = tf.float16 if FLAGS.use_fp16 else tf.float32
var = _variable_on_cpu(
name,
shape,
tf.truncated_normal_initializer(stddev=stddev, dtype=dtype))
if wd is not None:
weight_decay = tf.mul(tf.nn.l2_loss(var), wd, name='weight_loss')
tf.add_to_collection('losses', weight_decay)
return var
I'm wondering if this function does what it says. It is clear that when a weight decay factor is given (wd not None) the deacy value (weight_decay) is computed. But is it every applied? At the end the unmodified variable (var) is return, or am I missing something?
Second question would be how to fix this? As I understand the value of the scalar weight_decay must be subtracted from each element in the weight matrix, but I'm unable to find a tensorflow op that can do that (adding/subtracting a single value from every element of a tensor). Is there any op like this?
As a workaround I thought it might be possible to create a new tensor initialized with the value of weight_decay and use tf.subtract(...) to achieve the same result. Or is this the right way to go anyway?
Thanks in advance.

The code does what it says. You are supposed to sum everything in the 'losses' collection (which the weight decay term is added to in the second to last line) for the loss that you pass to the optimizer. In the loss() function in that example:
tf.add_to_collection('losses', cross_entropy_mean)
[...]
return tf.add_n(tf.get_collection('losses'), name='total_loss')
so what the loss() function returns is the classification loss plus everything that was in the 'losses' collection before.
As a side note, weight decay does not mean you subtract the value of wd from every value in the tensor as part of the update step, it multiplies the value by (1-learning_rate*wd) (in plain SGD). To see why this is so, recall that l2_loss computes
output = sum(t_i ** 2) / 2
with t_i being the elements of the tensor. This means that the derivative of l2_loss with regard to each tensor element is the value of that tensor element itself, and since you scaled l2_loss with wd the derivative is scaled as well.
Since the update step (again, in plain SGD) is (forgive me for omitting the time step indexes)
w := w - learning_rate * dL/dw
you get, if you only had the weight decay term
w := w - learning_rate * wd * w
or
w := w * (1 - learning_rate * wd)

Related

What is the difference between these two pytorch tensors?

I have two list of tensors where one can be concatenated with torch.cat() while the last one can not be concatenated with torch.cat(). Therefore I wanted to know what is the difference between these two tensors and how we can concatenate the last one. I see that the tensor value.item() has a square brackets while in the last one there is no bracket surrounding the value.item()
def update(self, episode: dict, gamma: float) -> dict:
"""
Updates the policy and value networks and returns
a dictionary with the overall loss and the loss at
each step for the policy and value nets.
Args:
episode: a dicitonary with states, actions, rewards, and log probabilites
gamma: the discount factor
Returns:
a dicitonary with the loss at each state and the overall loss for the policy and value nets.
"""
# use this list to keep track of the loss terms at the individual steps
policy_losses = []
value_losses = []
# TODO compute returns by calling `compute_returns(rewards, gamma)`, as it
# already includes the return standardization
# YOUR CODE HERE
#raise NotImplementedError()
states, actions, rewards, log_probs, values = episode.values()
returns = compute_returns(rewards, gamma=gamma)
returns = torch.tensor(returns)
for a_log_prob, state_value, R in zip(log_probs, values, returns):
policy_losses.append(-1 * a_log_prob * (R - state_value.item()))
value_losses.append(torch.nn.functional.mse_loss(state_value, torch.tensor([R]), reduction='mean'))
print('p',policy_losses)
print('v',value_losses)
# -----------------------------------------------------------------------------------
# --- the code below does not need to be changed ------------------------------------
# -----------------------------------------------------------------------------------
# this resets the gradients on all involved weight tensors
self.policy_optimizer.zero_grad()
self.value_optimizer.zero_grad()
# this concatenates all individual policy loss terms
policy_losses = torch.cat(policy_losses)
print(policy_losses)
value_losses = torch.cat(value_losses)
# here we sum all policy losses up. this implements the gradient accumulation
# "theta = theta + ..." from the pseudo code, where it says
# "loop for each step of the episode ..." in one neat line.
policy_loss = policy_losses.sum()
value_loss = value_losses.sum()
# this computes the gradient of the loss wrt all involved weight tensors
policy_loss.backward()
value_loss.backward()
# this updates the weight tensors with the update rule of the optimizer
# we're not using vanilla SGD here, but rather the Adam update rule,
# as it converges much, much faster
self.policy_optimizer.step()
self.value_optimizer.step()
# finally, we'll return the policy_loss and value_loss for visualization purposes later on
return dict(
policy_loss=policy_loss.item(),
value_loss=value_loss.item(),
policy_losses=policy_losses,
value_losses=value_losses
)

The difference lies in the distinction between a zero-dimensional tensor with a single value (pytorch would list its shape as torch.Size([])) versus a one-dimensional tensor with a single value (pytorch would list its shape as torch.Size([1]).
The former cannot be concatenated (as there is no dimension along which to concatenate) while the latter can. You can easily convert the former to the latter by:
dim1 = dim0.unsqueeze(0)
Alternatively, you can use torch.stack which "concatenates" tensors along a new dimension.

Adding manual gradient to step

I was wondering if there was any method to manually add a gradient to a step in pytorch while otherwise using autograd. There is one middle step in my loss function that I cannot compute without transforming the datatype out of a tensor so I don't get an autograd of that component so the gradient doesn't get computed correctly. However, manually I could compute the gradient. How would I go about incorporating this into the gradient graph in pytorch? All the guides I've found don't use autograd at all (as far I understand it).
The specific issue I am trying to solve is normalizing a function over some interval. The following example does this for the sum of gaussians. The tensor m is [[m1,m2,m3,m4...]] and represents means, s represents standard deviations and p represents weights. p,m and s are all outputs from my model. I want the integral between the low and high cutoff to be 1 so I can get that by taking the cdf at higher cutoff and subtracting the lower cutoff cdf before dividing all the ps by that value. I would then use these new values of p (along with m and s and a target) to calculate some value for the loss function. Then when I call loss.backward() I would get the correct gradients, including the part of the gradient that comes from the normalization factor changing as p,m and s change.
normFactor=0
for gaussianInd in range(numberGaussians):
normFactor += (spstats.norm.cdf(higherCutoff,m[0][gaussianInd].cpu().detach(),s[0][gaussianInd].cpu().detach()+1e-6)-spstats.norm.cdf(lowerCutoff,m[0][gaussianInd].cpu().detach(),s[0][gaussianInd].cpu().detach()+1e-6))*p[0][gaussianInd]
p=p/normFactor
Edit: Added specific example

Of course, you can always modify the grad attribute of the tensor of interest. Your optimizer will tap on this attribute to update the corresponding tensor.
For illustration purposes:
>>> p = nn.Linear(10, 1, bias=False)
>>> p.weight
Parameter containing:
tensor([[ 0.3148, -0.2287, 0.1254, -0.1360, 0.2799, -0.0225, -0.3006, -0.0605,
-0.2784, -0.2618]], requires_grad=True)
>>> optim = torch.optim.SGD(p.parameters(), lr=.1)
Manually modify the gradient:
>>> p.weight.grad = torch.rand_like(p.weight)
Update with the optimizer:
>>> optim.step()
The parameter will get updated:
>>> p.weight
Parameter containing:
tensor([[ 0.2514, -0.2555, 0.1026, -0.1881, 0.2529, -0.0497, -0.3750, -0.1489,
-0.3762, -0.2839]], requires_grad=True)

How to convert tf.layer_norm to julia?

I'm trying to convert tensorflow code to julia but I didn't understand how layer_norm function works.
I'm trying to convert that part:
output = tf.contrib.layers.layer_norm(attn_out + h, begin_norm_axis=-1,
scope='LayerNorm')
shape of the tensor attn_out + h is ( qlen,bsz,d_model ) and when I downloaded the trained model, variables gamma and beta have shape (d_model,). Also, output of nn.moments should have shape [qlen,bsz,1]. layer_norm function calls batch_norm function with these parameters:
add_arg_scope
def layer_norm(inputs,
center=True,
scale=True,
activation_fn=None,
reuse=None,
variables_collections=None,
outputs_collections=None,
trainable=True,
begin_norm_axis=1,
begin_params_axis=-1,
scope=None):
"""Adds a Layer Normalization layer.
Based on the paper:
"Layer Normalization"
Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
https://arxiv.org/abs/1607.06450.
Can be used as a normalizer function for conv2d and fully_connected.
Given a tensor `inputs` of rank `R`, moments are calculated and normalization
is performed over axes `begin_norm_axis ... R - 1`. Scaling and centering,
if requested, is performed over axes `begin_params_axis .. R - 1`.
By default, `begin_norm_axis = 1` and `begin_params_axis = -1`,
meaning that normalization is performed over all but the first axis
(the `HWC` if `inputs` is `NHWC`), while the `beta` and `gamma` trainable
parameters are calculated for the rightmost axis (the `C` if `inputs` is
`NHWC`). Scaling and recentering is performed via broadcast of the
`beta` and `gamma` parameters with the normalized tensor.
The shapes of `beta` and `gamma` are `inputs.shape[begin_params_axis:]`,
and this part of the inputs' shape must be fully defined.
Args:
inputs: A tensor having rank `R`. The normalization is performed over axes
`begin_norm_axis ... R - 1` and centering and scaling parameters are
calculated over `begin_params_axis ... R - 1`.
center: If True, add offset of `beta` to normalized tensor. If False, `beta`
is ignored.
scale: If True, multiply by `gamma`. If False, `gamma` is not used. When the
next layer is linear (also e.g. `nn.relu`), this can be disabled since the
scaling can be done by the next layer.
activation_fn: Activation function, default set to None to skip it and
maintain a linear activation.
reuse: Whether or not the layer and its variables should be reused. To be
able to reuse the layer scope must be given.
variables_collections: Optional collections for the variables.
outputs_collections: Collections to add the outputs.
trainable: If `True` also add variables to the graph collection
`GraphKeys.TRAINABLE_VARIABLES` (see tf.Variable).
begin_norm_axis: The first normalization dimension: normalization will be
performed along dimensions `begin_norm_axis : rank(inputs)`
begin_params_axis: The first parameter (beta, gamma) dimension: scale and
centering parameters will have dimensions
`begin_params_axis : rank(inputs)` and will be broadcast with the
normalized inputs accordingly.
scope: Optional scope for `variable_scope`.
Returns:
A `Tensor` representing the output of the operation, having the same
shape and dtype as `inputs`.
Raises:
ValueError: If the rank of `inputs` is not known at graph build time,
or if `inputs.shape[begin_params_axis:]` is not fully defined at
graph build time.
"""
with variable_scope.variable_scope(
scope, 'LayerNorm', [inputs], reuse=reuse) as sc:
inputs = ops.convert_to_tensor(inputs)
inputs_shape = inputs.shape
inputs_rank = inputs_shape.ndims
if inputs_rank is None:
raise ValueError('Inputs %s has undefined rank.' % inputs.name)
dtype = inputs.dtype.base_dtype
if begin_norm_axis < 0:
begin_norm_axis = inputs_rank + begin_norm_axis
if begin_params_axis >= inputs_rank or begin_norm_axis >= inputs_rank:
raise ValueError('begin_params_axis (%d) and begin_norm_axis (%d) '
'must be < rank(inputs) (%d)' %
(begin_params_axis, begin_norm_axis, inputs_rank))
params_shape = inputs_shape[begin_params_axis:]
if not params_shape.is_fully_defined():
raise ValueError(
'Inputs %s: shape(inputs)[%s:] is not fully defined: %s' %
(inputs.name, begin_params_axis, inputs_shape))
# Allocate parameters for the beta and gamma of the normalization.
beta, gamma = None, None
if center:
beta_collections = utils.get_variable_collections(variables_collections,
'beta')
beta = variables.model_variable(
'beta',
shape=params_shape,
dtype=dtype,
initializer=init_ops.zeros_initializer(),
collections=beta_collections,
trainable=trainable)
if scale:
gamma_collections = utils.get_variable_collections(
variables_collections, 'gamma')
gamma = variables.model_variable(
'gamma',
shape=params_shape,
dtype=dtype,
initializer=init_ops.ones_initializer(),
collections=gamma_collections,
trainable=trainable)
# By default, compute the moments across all the dimensions except the one with index 0.
norm_axes = list(range(begin_norm_axis, inputs_rank))
mean, variance = nn.moments(inputs, norm_axes, keep_dims=True)
# Compute layer normalization using the batch_normalization function.
# Note that epsilon must be increased for float16 due to the limited
# representable range.
variance_epsilon = 1e-12 if dtype != dtypes.float16 else 1e-3
outputs = nn.batch_normalization(
inputs,
mean,
variance,
offset=beta,
scale=gamma,
variance_epsilon=variance_epsilon)
outputs.set_shape(inputs_shape)
if activation_fn is not None:
outputs = activation_fn(outputs)
return utils.collect_named_outputs(outputs_collections, sc.name, outputs)
However none of these tensors (gamma,beta,mean,variance) satisfy the input restrictions of batch_norm:
#tf_export("nn.batch_normalization")
def batch_normalization(x,
mean,
variance,
offset,
scale,
variance_epsilon,
name=None):
r"""Batch normalization.
Normalizes a tensor by `mean` and `variance`, and applies (optionally) a
`scale` \\(\gamma\\) to it, as well as an `offset` \\(\beta\\):
\\(\frac{\gamma(x-\mu)}{\sigma}+\beta\\)
`mean`, `variance`, `offset` and `scale` are all expected to be of one of two
shapes:
* In all generality, they can have the same number of dimensions as the
input `x`, with identical sizes as `x` for the dimensions that are not
normalized over (the 'depth' dimension(s)), and dimension 1 for the
others which are being normalized over.
`mean` and `variance` in this case would typically be the outputs of
`tf.nn.moments(..., keep_dims=True)` during training, or running averages
thereof during inference.
* In the common case where the 'depth' dimension is the last dimension in
the input tensor `x`, they may be one dimensional tensors of the same
size as the 'depth' dimension.
This is the case for example for the common `[batch, depth]` layout of
fully-connected layers, and `[batch, height, width, depth]` for
convolutions.
`mean` and `variance` in this case would typically be the outputs of
`tf.nn.moments(..., keep_dims=False)` during training, or running averages
thereof during inference.
See Source: [Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift; S. Ioffe, C. Szegedy]
(http://arxiv.org/abs/1502.03167).
Args:
x: Input `Tensor` of arbitrary dimensionality.
mean: A mean `Tensor`.
variance: A variance `Tensor`.
offset: An offset `Tensor`, often denoted \\(\beta\\) in equations, or
None. If present, will be added to the normalized tensor.
scale: A scale `Tensor`, often denoted \\(\gamma\\) in equations, or
`None`. If present, the scale is applied to the normalized tensor.
variance_epsilon: A small float number to avoid dividing by 0.
name: A name for this operation (optional).
Returns:
the normalized, scaled, offset tensor.
"""
with ops.name_scope(name, "batchnorm", [x, mean, variance, scale, offset]):
inv = math_ops.rsqrt(variance + variance_epsilon)
if scale is not None:
inv *= scale
# Note: tensorflow/contrib/quantize/python/fold_batch_norms.py depends on
# the precise order of ops that are generated by the expression below.
return x * math_ops.cast(inv, x.dtype) + math_ops.cast(
offset - mean * inv if offset is not None else -mean * inv, x.dtype)
I didn't understand how it works and could't convert to julia code. I'm using the same order for tensor shapes. If x has shape [a,b,c] in tf, it has [a,b,c] shape in my julia code aswell.

Flux provides batch normalization and layer normalization (and some others) in Julia (called LayerNorm and BatchNorm. In your question you don't specify what library you use in Julia, but if you use something else than Flux, then the code might be helpful for your own implementation. See: https://github.com/FluxML/Flux.jl/blob/master/src/layers/normalise.jl
If you are using Knet (which I never did) there is at least batch normalization already implemented, I am not sure about other normalization layers. They might be there as well.
Please also post what you already tried so for in Julia.

tensorflow: gradients for a custom loss function

I have an LSTM predicting time series values in tensorflow.
The model is working using an MSE as a loss function.
However, I'd like to be able to create a custom loss function where one of the error values is multiplied by two (therefore producing a higher error value).
In my batch of size 10, I want the 3rd value of the first input to be multiplied by 2, but because this is time series, this corresponds to the second value in the second input and the first value in the third input.
The error I get is:
ValueError: No gradients provided for any variable, check your graph for ops that do not support gradients
How do I make the gradients?
def loss_function(y_true, y_pred, peak_value=3, weight=2):
# peak value is where the multiplication happens on the first line
# weight is the how much the error is multiplied by
all_dif = tf.squared_difference(y_true, y_pred) # should be shape=[10,10]
peak = [peak_value] * 10
listy = range(0, 10)
c = [(i - j) % 10 for i, j in zip(peak, listy)]
for i in range(0, 10):
indices = [[i, c[i]]]
values = [1.0]
shape = [10,10]
delta = tf.SparseTensor(indices, values, shape)
all_dif = all_dif + tf.sparse_tensor_to_dense(delta)
return tf.reduce_sum(all_dif)

I believe the psuedo code would look something like this:
#tf.custom_gradient
def loss_function(y_true, y_pred, peak_value=3, weight=2)
## your code
def grad(dy):
return dy * partial_derivative
return loss, grad
Where partial_derivative is the analytically evaluated partial derivative with respect to your loss function. If your loss function is a function of more than one variable, it will require a partial derivative respect to each variable, I believe.
If you need more information, the documentation is good: https://www.tensorflow.org/api_docs/python/tf/custom_gradient
And I've yet to find an example of this functionality embedded in a model that's not a toy.

Python + Theano: Logistic regression weights do not update

I've compared extensively to existing tutorials but I can't figure out why my weights don't update. Here is the function that return the list of updates:
def get_updates(cost, params, learning_rate):
updates = []
for param in params:
updates.append((param, param - learning_rate * T.grad(cost, param)))
return updates
It is defined at the top level, outside of any classes. This is standard gradient descent for each param. The 'params' parameter here is fed in as mlp.params, which is simply the concatenated lists of the param lists for each layer. I removed every layer except for a logistic regression one to isolate the reason as to why my cost was not decreasing. The following is the definition of mlp.params in MLP's constructor. It follows the definition of each layer and their respective param lists.
self.params = []
for layer in self.layers:
self.params += layer.params
The following is the train function, which I call for each minibatch during each epoch:
train = theano.function([minibatch_index], cost,
updates=get_updates(cost, mlp.params, learning_rate),
givens= {
x: train_set_x[minibatch_index * batch_size : (minibatch_index + 1) * batch_size],
y: train_set_y[minibatch_index * batch_size : (minibatch_index + 1) * batch_size]
})
If you require further details, the entire file is available here: http://pastebin.com/EeNmXfGD
I don't know how many people use Theano (it doesn't seem like plenty); if you've read to this point, thank you.
Fixed: I've determined that I can't use average squared error as the cost function. It works as usual after replacing it with a negative log-likelihood.

This behavior it caused by a few things but it comes down to the cost not being properly computed. In your implementation , the output of the LogisticRegression layer is the predicted class for every input digit (obtained with the argmax operation) and you take the squared difference between it and the expected prediction.
This will give you gradients of 0s wrt to any parameter in your model because the gradient of the output of the argmax (predicted class) wrt the input of the argmax (class probabilities) will be 0.
Instead, the LogisticRegression should output the probabilities of the classes :
def output(self, input):
input = input.flatten(2)
self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
return self.p_y_given_x
And then in the MLP class, you compute the cost. You can used mean squared error between the desired probabilities for each class and the probabilities computed by the model but people tend to use the Negative Log Likelihood of the expected classes and you can implement it as such in the MLP class :
def neg_log_likelihood(self, x, y):
p_y_given_x = self.output(x)
return -T.mean(T.log(p_y_given_x)[T.arange(y.shape[0]), y])
Then you can use this function to compute your cost and the model trains :
cost = mlp.neg_log_likelihood(x_, y)
A few additional things:
At line 215, when you print your cost, you format it as an integer value but it is a floating point value; this will lose precision in the monitoring.
Initializing all the weights to 0s as you do in your LogisticRegression class is often not recommended. Weights should differ in their original values so as to help break symmetry

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow: _variable_with_weight_decay(...) explanation - python

Related

What is the difference between these two pytorch tensors?

Adding manual gradient to step

How to convert tf.layer_norm to julia?

tensorflow: gradients for a custom loss function

Python + Theano: Logistic regression weights do not update

Categories

Resources