How to convert tf.layer_norm to julia? - python

I'm trying to convert tensorflow code to julia but I didn't understand how layer_norm function works.
I'm trying to convert that part:
output = tf.contrib.layers.layer_norm(attn_out + h, begin_norm_axis=-1,
scope='LayerNorm')
shape of the tensor attn_out + h is ( qlen,bsz,d_model ) and when I downloaded the trained model, variables gamma and beta have shape (d_model,). Also, output of nn.moments should have shape [qlen,bsz,1]. layer_norm function calls batch_norm function with these parameters:
add_arg_scope
def layer_norm(inputs,
center=True,
scale=True,
activation_fn=None,
reuse=None,
variables_collections=None,
outputs_collections=None,
trainable=True,
begin_norm_axis=1,
begin_params_axis=-1,
scope=None):
"""Adds a Layer Normalization layer.
Based on the paper:
"Layer Normalization"
Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
https://arxiv.org/abs/1607.06450.
Can be used as a normalizer function for conv2d and fully_connected.
Given a tensor `inputs` of rank `R`, moments are calculated and normalization
is performed over axes `begin_norm_axis ... R - 1`. Scaling and centering,
if requested, is performed over axes `begin_params_axis .. R - 1`.
By default, `begin_norm_axis = 1` and `begin_params_axis = -1`,
meaning that normalization is performed over all but the first axis
(the `HWC` if `inputs` is `NHWC`), while the `beta` and `gamma` trainable
parameters are calculated for the rightmost axis (the `C` if `inputs` is
`NHWC`). Scaling and recentering is performed via broadcast of the
`beta` and `gamma` parameters with the normalized tensor.
The shapes of `beta` and `gamma` are `inputs.shape[begin_params_axis:]`,
and this part of the inputs' shape must be fully defined.
Args:
inputs: A tensor having rank `R`. The normalization is performed over axes
`begin_norm_axis ... R - 1` and centering and scaling parameters are
calculated over `begin_params_axis ... R - 1`.
center: If True, add offset of `beta` to normalized tensor. If False, `beta`
is ignored.
scale: If True, multiply by `gamma`. If False, `gamma` is not used. When the
next layer is linear (also e.g. `nn.relu`), this can be disabled since the
scaling can be done by the next layer.
activation_fn: Activation function, default set to None to skip it and
maintain a linear activation.
reuse: Whether or not the layer and its variables should be reused. To be
able to reuse the layer scope must be given.
variables_collections: Optional collections for the variables.
outputs_collections: Collections to add the outputs.
trainable: If `True` also add variables to the graph collection
`GraphKeys.TRAINABLE_VARIABLES` (see tf.Variable).
begin_norm_axis: The first normalization dimension: normalization will be
performed along dimensions `begin_norm_axis : rank(inputs)`
begin_params_axis: The first parameter (beta, gamma) dimension: scale and
centering parameters will have dimensions
`begin_params_axis : rank(inputs)` and will be broadcast with the
normalized inputs accordingly.
scope: Optional scope for `variable_scope`.
Returns:
A `Tensor` representing the output of the operation, having the same
shape and dtype as `inputs`.
Raises:
ValueError: If the rank of `inputs` is not known at graph build time,
or if `inputs.shape[begin_params_axis:]` is not fully defined at
graph build time.
"""
with variable_scope.variable_scope(
scope, 'LayerNorm', [inputs], reuse=reuse) as sc:
inputs = ops.convert_to_tensor(inputs)
inputs_shape = inputs.shape
inputs_rank = inputs_shape.ndims
if inputs_rank is None:
raise ValueError('Inputs %s has undefined rank.' % inputs.name)
dtype = inputs.dtype.base_dtype
if begin_norm_axis < 0:
begin_norm_axis = inputs_rank + begin_norm_axis
if begin_params_axis >= inputs_rank or begin_norm_axis >= inputs_rank:
raise ValueError('begin_params_axis (%d) and begin_norm_axis (%d) '
'must be < rank(inputs) (%d)' %
(begin_params_axis, begin_norm_axis, inputs_rank))
params_shape = inputs_shape[begin_params_axis:]
if not params_shape.is_fully_defined():
raise ValueError(
'Inputs %s: shape(inputs)[%s:] is not fully defined: %s' %
(inputs.name, begin_params_axis, inputs_shape))
# Allocate parameters for the beta and gamma of the normalization.
beta, gamma = None, None
if center:
beta_collections = utils.get_variable_collections(variables_collections,
'beta')
beta = variables.model_variable(
'beta',
shape=params_shape,
dtype=dtype,
initializer=init_ops.zeros_initializer(),
collections=beta_collections,
trainable=trainable)
if scale:
gamma_collections = utils.get_variable_collections(
variables_collections, 'gamma')
gamma = variables.model_variable(
'gamma',
shape=params_shape,
dtype=dtype,
initializer=init_ops.ones_initializer(),
collections=gamma_collections,
trainable=trainable)
# By default, compute the moments across all the dimensions except the one with index 0.
norm_axes = list(range(begin_norm_axis, inputs_rank))
mean, variance = nn.moments(inputs, norm_axes, keep_dims=True)
# Compute layer normalization using the batch_normalization function.
# Note that epsilon must be increased for float16 due to the limited
# representable range.
variance_epsilon = 1e-12 if dtype != dtypes.float16 else 1e-3
outputs = nn.batch_normalization(
inputs,
mean,
variance,
offset=beta,
scale=gamma,
variance_epsilon=variance_epsilon)
outputs.set_shape(inputs_shape)
if activation_fn is not None:
outputs = activation_fn(outputs)
return utils.collect_named_outputs(outputs_collections, sc.name, outputs)
However none of these tensors (gamma,beta,mean,variance) satisfy the input restrictions of batch_norm:
#tf_export("nn.batch_normalization")
def batch_normalization(x,
mean,
variance,
offset,
scale,
variance_epsilon,
name=None):
r"""Batch normalization.
Normalizes a tensor by `mean` and `variance`, and applies (optionally) a
`scale` \\(\gamma\\) to it, as well as an `offset` \\(\beta\\):
\\(\frac{\gamma(x-\mu)}{\sigma}+\beta\\)
`mean`, `variance`, `offset` and `scale` are all expected to be of one of two
shapes:
* In all generality, they can have the same number of dimensions as the
input `x`, with identical sizes as `x` for the dimensions that are not
normalized over (the 'depth' dimension(s)), and dimension 1 for the
others which are being normalized over.
`mean` and `variance` in this case would typically be the outputs of
`tf.nn.moments(..., keep_dims=True)` during training, or running averages
thereof during inference.
* In the common case where the 'depth' dimension is the last dimension in
the input tensor `x`, they may be one dimensional tensors of the same
size as the 'depth' dimension.
This is the case for example for the common `[batch, depth]` layout of
fully-connected layers, and `[batch, height, width, depth]` for
convolutions.
`mean` and `variance` in this case would typically be the outputs of
`tf.nn.moments(..., keep_dims=False)` during training, or running averages
thereof during inference.
See Source: [Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift; S. Ioffe, C. Szegedy]
(http://arxiv.org/abs/1502.03167).
Args:
x: Input `Tensor` of arbitrary dimensionality.
mean: A mean `Tensor`.
variance: A variance `Tensor`.
offset: An offset `Tensor`, often denoted \\(\beta\\) in equations, or
None. If present, will be added to the normalized tensor.
scale: A scale `Tensor`, often denoted \\(\gamma\\) in equations, or
`None`. If present, the scale is applied to the normalized tensor.
variance_epsilon: A small float number to avoid dividing by 0.
name: A name for this operation (optional).
Returns:
the normalized, scaled, offset tensor.
"""
with ops.name_scope(name, "batchnorm", [x, mean, variance, scale, offset]):
inv = math_ops.rsqrt(variance + variance_epsilon)
if scale is not None:
inv *= scale
# Note: tensorflow/contrib/quantize/python/fold_batch_norms.py depends on
# the precise order of ops that are generated by the expression below.
return x * math_ops.cast(inv, x.dtype) + math_ops.cast(
offset - mean * inv if offset is not None else -mean * inv, x.dtype)
I didn't understand how it works and could't convert to julia code. I'm using the same order for tensor shapes. If x has shape [a,b,c] in tf, it has [a,b,c] shape in my julia code aswell.

Flux provides batch normalization and layer normalization (and some others) in Julia (called LayerNorm and BatchNorm. In your question you don't specify what library you use in Julia, but if you use something else than Flux, then the code might be helpful for your own implementation. See: https://github.com/FluxML/Flux.jl/blob/master/src/layers/normalise.jl
If you are using Knet (which I never did) there is at least batch normalization already implemented, I am not sure about other normalization layers. They might be there as well.
Please also post what you already tried so for in Julia.

Related

What is the difference between these two pytorch tensors?

I have two list of tensors where one can be concatenated with torch.cat() while the last one can not be concatenated with torch.cat(). Therefore I wanted to know what is the difference between these two tensors and how we can concatenate the last one. I see that the tensor value.item() has a square brackets while in the last one there is no bracket surrounding the value.item()
def update(self, episode: dict, gamma: float) -> dict:
"""
Updates the policy and value networks and returns
a dictionary with the overall loss and the loss at
each step for the policy and value nets.
Args:
episode: a dicitonary with states, actions, rewards, and log probabilites
gamma: the discount factor
Returns:
a dicitonary with the loss at each state and the overall loss for the policy and value nets.
"""
# use this list to keep track of the loss terms at the individual steps
policy_losses = []
value_losses = []
# TODO compute returns by calling `compute_returns(rewards, gamma)`, as it
# already includes the return standardization
# YOUR CODE HERE
#raise NotImplementedError()
states, actions, rewards, log_probs, values = episode.values()
returns = compute_returns(rewards, gamma=gamma)
returns = torch.tensor(returns)
for a_log_prob, state_value, R in zip(log_probs, values, returns):
policy_losses.append(-1 * a_log_prob * (R - state_value.item()))
value_losses.append(torch.nn.functional.mse_loss(state_value, torch.tensor([R]), reduction='mean'))
print('p',policy_losses)
print('v',value_losses)
# -----------------------------------------------------------------------------------
# --- the code below does not need to be changed ------------------------------------
# -----------------------------------------------------------------------------------
# this resets the gradients on all involved weight tensors
self.policy_optimizer.zero_grad()
self.value_optimizer.zero_grad()
# this concatenates all individual policy loss terms
policy_losses = torch.cat(policy_losses)
print(policy_losses)
value_losses = torch.cat(value_losses)
# here we sum all policy losses up. this implements the gradient accumulation
# "theta = theta + ..." from the pseudo code, where it says
# "loop for each step of the episode ..." in one neat line.
policy_loss = policy_losses.sum()
value_loss = value_losses.sum()
# this computes the gradient of the loss wrt all involved weight tensors
policy_loss.backward()
value_loss.backward()
# this updates the weight tensors with the update rule of the optimizer
# we're not using vanilla SGD here, but rather the Adam update rule,
# as it converges much, much faster
self.policy_optimizer.step()
self.value_optimizer.step()
# finally, we'll return the policy_loss and value_loss for visualization purposes later on
return dict(
policy_loss=policy_loss.item(),
value_loss=value_loss.item(),
policy_losses=policy_losses,
value_losses=value_losses
)
The difference lies in the distinction between a zero-dimensional tensor with a single value (pytorch would list its shape as torch.Size([])) versus a one-dimensional tensor with a single value (pytorch would list its shape as torch.Size([1]).
The former cannot be concatenated (as there is no dimension along which to concatenate) while the latter can. You can easily convert the former to the latter by:
dim1 = dim0.unsqueeze(0)
Alternatively, you can use torch.stack which "concatenates" tensors along a new dimension.

Multi-channel, 2D mask weights using BCEWithLogitsLoss in Pytorch

I have a set of 256x256 images that are each labeled with nine, binary 256x256 masks. I am trying to calculate the pos_weight in order to weight the BCEWithLogitsLoss using Pytorch.
The shape of my masks tensor is tensor([1000, 9, 256, 256]) where 1000 is the number of training images, 9 is the number of mask channels (all encoded to 0/1), and 256 is the size of each image side.
To calculate pos_weight, I have summed the zeros in each mask, and divided that number by the sum of all of the ones in each mask (following the advice suggested here.):
(masks[:,channel,:,:]==0).sum()/masks[:,channel,:,:].sum()
Calculating the weight for every mask channel provides a tensor with the shape of tensor([9]), which seems intuitive to me, since I want a pos_weight value for each of the nine mask channels. However when I try to fit my model, I get the following error message:
RuntimeError: The size of tensor a (9) must match the size of
tensor b (256) at non-singleton dimension 3
This error message is surprising because it suggests that the weights need to be the size of one of the image sides, but not the number of mask channels. What shape should pos_weight be and how do I specify that it should be providing weights for the mask channels instead of the image pixels?
TLDR; This is a broadcasting issue which is surprisingly not handled by PyTorch's nn.BCEWithLogitsLoss namely F.binary_cross_entropy_with_logits. It might actually be worth putting out a Github issue linking to this SO thread to notify the developers of this undesirable behaviour.
In the documentation page of nn.BCEWithLogitsLoss, it is stated that the provided positive weights tensor pos_weight:
Must be a vector with length equal to the number of classes.
This is of course what you were expecting (rightly so) since positive weights refer to the weight given to the positive instances for every single class. Since your prediction and target tensors are multi-dimensional this seems to not be handled properly by PyTorch.
Anyhows, here is a minimal example showing how you can bypass this error and also showing the manual computation of the binary cross-entropy, as reference.
Here is the setup of the prediction and target tensors pred and label respectively:
>>> c=2;b=5;h=3;w=3
>>> pred = torch.rand(b,c,h,w)
>>> label = torch.randint(0,2, (b,c,h,w), dtype=float)
Now for the definition of the positive weight, notice the leading singletons dimensions:
>>> pos_weight = torch.rand(c,1,1)
In your case, with your existing 1D tensor of length c, you would simply have to unsqueeze two extra dimensions for the height and width dimensions. This means doing something like: pos_weight = pos_weight[:,None,None].
Calling the bce with logits function or its oop equivalent:
>>> F.binary_cross_entropy_with_logits(pred, label, pos_weight=pos_weight).mean()
Which is equivalent, in plain code to:
>>> z = torch.sigmoid(pred)
>>> bce = -(pos_weight*label*torch.log(z) + (1-label)*torch.log(1-z))
Note, that the built-in function would have the desired behaviour (i.e. no error message) if the class dimension was last in your prediction and target tensors.
>>> pos_weight = torch.rand(c)
>>> F.binary_cross_entropy_with_logits(
... pred.transpose(1,-1),
... label.transpose(1,-1),
... pos_weight=pos_weight)
In other words, we are applying the function with format NHWC which means the pos_weight of format C can be multiplied properly. So the result above effectively yields the same result as:
>>> F.binary_cross_entropy_with_logits(
... pred,
... label,
... pos_weight=pos_weight[:,None,None])
You can read more about the pos_weight in BCEWithLogitsLoss in another thread here

a cost function that consists of two parts with different second dimension

I have defined a loss function like this:
def my_loss(y_recon, y_real, brain_hidden, brain_real):
loss = torch.mean((y_recon - y_real)**2 + (brain_hidden - brain_real)**2
return loss
y_recon's shape (and y_real) is batch_size*300 and brain_hidden's shape (and brain_real) is batch_size*64
I need to minimize these two both elements. However, this way I get the error
The size of tensor a (300) must match the size of tensor b (64) at
non-singleton dimension 1
How can I update the loss function to avoid this error?

Keras tensor has an additional dimension and causes wrong results for net.evaluate()

I'd like to train a neural network in Python and Keras using a metric learning custom loss function. The loss minimizes the distances of the outputs for similar inputs and maximizes the distances between dissimilar ones. The part considering similar inputs is:
# function to create a pairwise similarity matrix, i.e
# L[i,j] == 1 for similar samples i, j and 0 otherwise
def build_indicator_matrix(y_, thr=0.1):
# y_: contains the labels of the samples,
# samples are similar in case of same label
# prevent checking equality of floats --> check if absolute
# differences are below threshold
lbls_diff = K.expand_dims(y_, axis=0) - K.expand_dims(y_, axis=1)
lbls_thr = K.less(K.abs(lbls_diff), thr)
# cast bool tensor back to float32
L = K.cast(lbls_thr, 'float32')
# POSSIBLE WORKAROUND
#L = K.sum(L, axis=2)
return L
# function to compute the (squared) Euclidean distances between all pairs
# of samples, store in DIST[i,j] the distance between output y_pred[i,:] and y_pred[j,:]
def compute_pairwise_distances(y_pred):
DIFF = K.expand_dims(y_pred, axis=0) - K.expand_dims(y_pred, axis=1)
DIST = K.sum(K.square(DIFF), axis=-1)
return DIST
# function to compute the average distance between all similar samples
def my_loss(y_true, y_pred):
# y_true: contains true labels of the samples
# y_pred: contains network outputs
L = build_indicator_matrix(y_true)
DIST = compute_pairwise_distances(y_pred)
return K.mean(DIST * L, axis=1)
For training, I pass a numpy array y of shape (n,) as target variable to my_loss. However, I found (using the computational graph in TensorBoard) that the tensorflow backend creates a 2D variable out of y (displayed shape ? x ?), and hence L in build_indicator_matrix is not 2 but 3-dimensional (shape ? x ? x ? in TensorBoard). This causes net.evaulate() and net.fit() to compute wrong results.
Why does tensorflow create a 2D rather than a 1D array? And how does this affect net.evaluate() and net.fit()?
As quick workarounds I found that either replacing the build_indicator_matrix() with static numpy code for computing L , or collapsing the "fake" dimension with the line L = K.sum(L, axis=2) solves the problem. In the latter case, however, the output of K.eval(build_indicator_matrix(y)) is of only of shape (n,) and not (n,n), so I do not understand why this workaround still yields correct results. Why does tensorflow introduce an additional dimension?
My library versions are:
keras: 2.2.4
tensorflow: 1.8.0
numpy: 1.15.0
This is because evaluate and fit work in batches.
The first dimension you see in tensorboard is the batch dimension, unknown in advance and therefore denoted ?.
When using custom metrics, remember the tensors (y_true and y_pred) you get are the ones corresponding to the batch.
For more info, show us how you call both those functions.

How is the categorical_crossentropy implemented in keras?

I'm trying to apply the concept of distillation, basically to train a new smaller network to do the same as the original one but with less computation.
I have the softmax outputs for every sample instead of the logits.
My question is, how is the categorical cross entropy loss function implemented?
Like it takes the maximum value of the original labels and multiply it with the corresponded predicted value in the same index, or it does the summation all over the logits (One Hot encoding) as the formula says:
As an answer to "Do you happen to know what the epsilon and tf.clip_by_value is doing?",
it is ensuring that output != 0, because tf.log(0) returns a division by zero error.
(I don't have points to comment but thought I'd contribute)
I see that you used the tensorflow tag, so I guess this is the backend you are using?
def categorical_crossentropy(output, target, from_logits=False):
"""Categorical crossentropy between an output tensor and a target tensor.
# Arguments
output: A tensor resulting from a softmax
(unless `from_logits` is True, in which
case `output` is expected to be the logits).
target: A tensor of the same shape as `output`.
from_logits: Boolean, whether `output` is the
result of a softmax, or is a tensor of logits.
# Returns
Output tensor.
This code comes from the keras source code. Looking directly at the code should answer all your questions :) If you need more info just ask !
EDIT :
Here is the code that interests you :
# Note: tf.nn.softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
# scale preds so that the class probas of each sample sum to 1
output /= tf.reduce_sum(output,
reduction_indices=len(output.get_shape()) - 1,
keep_dims=True)
# manual computation of crossentropy
epsilon = _to_tensor(_EPSILON, output.dtype.base_dtype)
output = tf.clip_by_value(output, epsilon, 1. - epsilon)
return - tf.reduce_sum(target * tf.log(output),
reduction_indices=len(output.get_shape()) - 1)
If you look at the return, they sum it... :)

Categories