Implementing weight normalization using TensorFlow layers' `kernel_constraint`

Implementing weight normalization using TensorFlow layers' `kernel_constraint` - python

Some of the TensorFlow layers, such as tf.layers.dense and tf.layers.conv2d, take in a kernel_constraint argument, which according to the tf api docs docs implements an
Optional projection function to be applied to the kernel after being updated by an Optimizer (e.g. used to implement norm constraints or value constraints for layer weights).
In [1], Salimans et al. present a neural network normalization technique, called weight normalization, which normalizes the weight vectors of the network layers, in contrast to, for example the batch normalization [2], which normalizes the actual data batch flowing through the layer. In some cases the computational overhead of the weight normalization method is lower and it can also be used in cases where the use of batch normalization is not feasible.
My question is: is it possible to implement the weight normalization using the abovementioned TensorFlow layers' kernel_constraint? Assuming x is an input with shape (batch, height, width, channels), I thought I could implement it as follows:
x = tf.layers.conv2d(
inputs=x,
filters=16,
kernel_size=(3, 3),
strides=(1, 1),
kernel_constraint=lambda kernel: (
tf.nn.l2_normalize(w, list(range(kernel.shape.ndims-1)))))
What would be a simple test case to validate/invalidate my solution?
[1] SALIMANS, Tim; KINGMA, Diederik P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In: Advances in Neural Information Processing Systems. 2016. p. 901-909.
[2] IOFFE, Sergey; SZEGEDY, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

Despite the title, the paper by Salimans and Kingma suggests to decouple the weight norm and their direction, rather than actually normalising the weights (i.e. setting their l2 norm to one as you suggested).
If you want to verify that your code has the intended effect even if it is not what they proposed, you can get the weights of the model and check their norm.
In pseudo-code:
model = tf.models.Model(inputs=inputs, outputs=x)
weights = model.get_weights()[i] # checking the weights of the i-th layer
flat_weights = weights.flatten()
import numpy as np
print(np.linalg.norm(flat_weights, 2))

Related

Neural network with linear activation output. Calculate output range for each of the output neurons

Let's assume I have a neural network like the following:
model = keras.models.Sequential()
model.add(keras.layers.Dense(10, input_shape=(5,), activation='relu'))
model.add(keras.layers.Dense(4, activation='linear'))
With n output neurons with a linear activation function.
The training process is not important here, so we can take a look at the random weights that keras initialized using:
model.weights
Of course, in a real example, these weights should be adjusted in the training process.
Depending on these model.weights, each of the output neurons returns values in a range.
I would like to calculate this exact range.
Does keras offer any function to calculate it?
I built a flawed piece of code to make an approximation of it, using a loop and predicting random inputs. But this would not be really useful in a real example with much more inputs/neurons/weights.
Here a few examples trying to clarify my question (All of them assume that the input values are between and 1):
model = keras.models.Sequential()
model.add(keras.layers.Dense(1, input_shape=(2,),
activation='linear', use_bias=False))
model.set_weights([np.array([1, 1]).reshape(2, 1)])
For the previous example the output neuron results would be between 0 and 2
model.set_weights([np.array([-0.5, 1]).reshape(2, 1)])
For the previous example the output neuron results would be between -0.5 and 1
model = keras.models.Sequential()
model.add(keras.layers.Dense(2, input_shape=(2,), activation='linear', use_bias=False))
model.add(keras.layers.Dense(1, activation='linear', use_bias=False))
model.set_weights([np.array([1, 1, 1, 1]).reshape(2,2), np.array([1, 1]).reshape(2,1)])
For the previous example, the output neuron results would be between 0 and 4
These are simplified examples. In a real scenario with a much complex network structure, activation functions, bias..... these ranges are not obvious to calculate.

It sounds like you are roughly interested in what is referred to as neural network verification. This field broadly consists of answering the question: given a range of possible inputs, what is the range of possible outputs from a neural network with a given set of weights? A few things to note:
A neural network is essentially a complex, non-linear function. That is, it maps the input space to the output space. Defining an output range does not make sense except with respect to an input range. In your question you make no reference to the inputs, so your examples are flawed/incomplete.
In general, neural network verification is an emerging field with most published works being fairly recent (last 5-7 years). That being said, there are exact and approximate methods for fully connected networks with a variety of activation functions. I'll list a few such methods here:
https://arxiv.org/abs/2004.05519 - MATLAB toolbox, but you could export your neural network in ONNX format and then use MATLAB for the verification/output range analysis.
https://arxiv.org/abs/1804.10829 - specifically for ReLU activation function.
https://anwu1219.github.io/download/Marabou.pdf with python API available here: https://github.com/NeuralNetworkVerification/Marabou
The field is still evolving so you may have to do some of the codings yourself rather than using pre-existing libraries in some cases, but these papers/ a search query for neural network verification should at least give you some ideas of where to start.

IMO, there is no such a function, as far as I know, to estimate the output value's range( without imposing your restriction).
For example, a dense function without bias is just a plain linear function of a=bx, in your case, you are restricting x to 0-1 range and explicitly setting b to your desired values.
You will always get the value in those ranges you`ve cited in your question. A hypothetical example is to choose b randomly and the range in your questions would not hold the ground.
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
model = keras.models.Sequential()
model.add(keras.layers.Dense(1, input_shape=(2,), activation='linear', use_bias=False))
import matplotlib.pyplot as plt
#model.set_weights([np.array([1, 1]).reshape(2, 1)])
eval_func = keras.backend.function([model.input], model.layers[-1].output)
outputs = eval_func(np.array([[2,1]]))
counts, bins = np.histogram(outputs)
plt.hist(bins[:-1], bins, weights=counts)

Is convolution useful on a network with a timestep of 1?

This code comes from https://www.kaggle.com/dkaraflos/1-geomean-nn-and-6featlgbm-2-259-private-lb, The goal of this competition is to use seismic signals to predict the timing of laboratory earthquakes. The person in this link has won first place among more than 4000 teams
def get_model():
inp = Input(shape=(1,train_sample.shape[1]))
x = BatchNormalization()(inp)
x = LSTM(128,return_sequences=True)(x) # LSTM as first layer performed better than Dense.
x = Convolution1D(128, (2),activation='relu', padding="same")(x)
x = Convolution1D(84, (2),activation='relu', padding="same")(x)
x = Convolution1D(64, (2),activation='relu', padding="same")(x)
x = Flatten()(x)
x = Dense(64, activation="relu")(x)
x = Dense(32, activation="relu")(x)
#outputs
ttf = Dense(1, activation='relu',name='regressor')(x) # Time to Failure
tsf = Dense(1)(x) # Time Since Failure
classifier = Dense(1, activation='sigmoid')(x) # Binary for TTF<0.5 seconds
model = models.Model(inputs=inp, outputs=[ttf,tsf,classifier])
opt = optimizers.Nadam(lr=0.008)
# We are fitting to 3 targets simultaneously: Time to Failure (TTF), Time Since Failure (TSF), and Binary for TTF<0.5 seconds
# We weight the model to optimize heavily for TTF
# Optimizing for TSF and Binary TTF<0.5 helps to reduce overfitting, and helps for generalization.
model.compile(optimizer=opt, loss=['mae','mae','binary_crossentropy'],loss_weights=[8,1,1],metrics=['mae'])
return model
However, According to my derivation, I think x = Convolution1D(128, (2),activation='relu', padding="same")(x) and x = Dense(128, activation='relu ')(x) has the same effect, because the convolution kernel performs convolution on the sequence with a time step of 1. In principle, it is very similar to the fully connected layer. Why use conv1D here instead of directly using the fullly connection layer? Is my derivation wrong?

1) Assuming you would input a sequence to the LSTM (the normal use case):
It would not be the same since the LSTM returns a sequence (return_sequences=True), thereby not reducing the input dimensionality. The output shape is therefore (Batch, Sequence, Hid). This is being fed to the Convolution1D layer which performs convolution on the Sequence dimension, i.e. on (Sequence, Hid). So in effect, the purpose of the 1D Convolutions is to extract local 1D subsequences/patches after the LSTM.
If we had return_sequences=False, the LSTM would return the final state h_t. To ensure the same behavior as a Dense layer, you need a fully connected convolutional layer, i.e. a kernel size of Sequence length, and we need as many filters as we have Hid in the output shape. This would then make the 1D Convolution equivalent to a Dense layer.
2) Assuming you do not input a sequence to the LSTM (your example):
In your example, the LSTM is used as a replacement for a Dense layer.
It serves the same function, though it gives you a slightly different
result as the gates do additional transformations (even though we
have no sequence).
Since the Convolution is then performed on (Sequence, Hid) = (1, Hid), it is indeed operating per timestep. Since we have 128 inputs and 128 filters, it is fully connected and the kernel size is large enough to operate on the single element. This meets the above defined criteria for a 1D Convolution to be equivalent to a Dense layer, so you're correct.
As a side note, this type of architecture is something you would typically get with a Neural Architecture Search. The "replacements" used here are not really commonplace and not generally guaranteed to be better than the more established counterparts. In a lot of cases, using Reinforcement Learning or Evolutionary Algorithms can however yield slightly better accuracy using "untraditional" solutions since very small performance gains can just happen by chance and don't have to necessarily reflect back on the usefulness of the architecture.

Tensorflow - building LSTM model - need for tf.keras.layers.Dense()

Python 3.7 tensorflow
I am experimenting Time series forecasting w Tensorflow
I understand the second line creates a LSTM RNN i.e. a Recurrent Neural Network of type Long Short Term Memory.
Why do we need to add a Dense(1) layer in the end?
single_step_model = tf.keras.models.Sequential()
single_step_model.add(tf.keras.layers.LSTM(32, input_shape=x_train_single.shape[-2:]))
single_step_model.add(tf.keras.layers.Dense(1))
Tutorial for Dense() says
Dense implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True).
would you like to rephrase or elaborate on need for Dense() here ?

The following line
single_step_model.add(tf.keras.layers.LSTM(32, input_shape=x_train_single.shape[-2:]))
creates an LSTM layer which transforms each input step of size #features into a latent representation of size 32. You want to predict a single value so you need to convert this latent representation of size 32 into a single value. Hence, you add the following line
single_step_model.add(tf.keras.layers.Dense(1))
which adds a Dense Layer (Fully-Connected Neural Network) with one neuron in the output which, obviously, produces a single value. Look at it as a way to transform an intermediate result of higher dimensionality into the final result.

Well in the tutorial you are following Time series forecasting, they are trying to forecast temperature (6 hrs ahead). For which they are using an LSTM followed by a Dense layer.
single_step_model = tf.keras.models.Sequential()
single_step_model.add(tf.keras.layers.LSTM(32, input_shape=x_train_single.shape[-2:]))
single_step_model.add(tf.keras.layers.Dense(1))
Dense layer is nothing but a regular fully-connected NN layer. In this case you are bringing down the output dimensionality to 1, which should represent some proportionality (need not be linear) to the temperature you are trying to predict. There are other layers you could use as well. Check out, Keras Layers.
If you are confused about the input and output shape of LSTM, check out
I/O Shape.

tensorflow basic word2vec example: Shouldn't we be using weights [nce_weight Transpose] for the representation and not embedding matrix?

I am referreing to this sample code
in the code snippet below:
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
# Construct the variables for the NCE loss
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
Now NCE_Loss function is nothing but a single hidden layer neural network with softmax at the optput layer [knowing is takes only a few negative sample]
This part of the graph will only update the weights of the network, it is not doing anything to the "embeddings" matrix/ tensor.
so ideally once the network is trained we must again pass it once through the embeddings_matrix first and then multiply by the transpose of the "nce_weights" [considering it as the same weight auto-encoder, at input & output layers] to reach to the hidden layer representation of each word, which we are are calling word2vec (?)
But if look at the later part of the code, the value of the embeddings matrix is being used a word representation. This
Even the tensorflow doc for NCE loss, mentions input (to which we are passing embed, which uses embeddings) as just the 1st layer input activation values.
inputs: A Tensor of shape [batch_size, dim]. The forward activations of the input network.
A normal back propagation stops at the first layer of the network,
does this implementation of NCE loss, goes beyond and propagates the loss to the input values (and hence to the embedding) ?
This seems an extra step?
Refer this for why I am calling it an extra step, he has a same explanation.

Want I have figured out reading and going through tensorflow is that
though the entire thing is single hidden layer neural network, a auto-encoder indeed. But the weights are not tied, which I assumed.
The encoder is made of the weight matrix embeddings and the decoder is made of the nce_weights. And now embed is nothing but the hidden layer output, given by multiplying input with embeddings.
So with this, embeddings and nce_weights both will be updated in the graph. And we can choose any of the two weight matrix, embeddings is more preferred here.
Edit1:
Actually for both tf.nn.nce_loss and tf.nn.sampled_softmax_loss, the parameters, weights and bias are for the input Weights(tranpose) X + bias, to objective function, which can be logistic regression/ softmax function [refer].
But the back-propagation/ gradient descent happens till the very base of the graph you are building and does not stop at the weights and bias of the function only. Hence the input parameter in both tf.nn.nce_loss and tf.nn.sampled_softmax_loss are also updated which in-turn is build of embeddings matrix.

Batch normalization with 3D convolutions in TensorFlow

I'm implementing a model relying on 3D convolutions (for a task that is similar to action recognition) and I want to use batch normalization (see [Ioffe & Szegedy 2015]). I could not find any tutorial focusing on 3D convs, hence I'm making a short one here which I'd like to review with you.
The code below refers to TensorFlow r0.12 and it explicitly instances variables - I mean I'm not using tf.contrib.learn except for the tf.contrib.layers.batch_norm() function. I'm doing this both to better understand how things work under the hood and to have more implementation freedom (e.g., variable summaries).
I will get to the 3D convolution case smoothly by first writing the example for a fully-connected layer, then for a 2D convolution and finally for the 3D case. While going through the code, it would be great if you could check if everything is done correctly - the code runs, but I'm not 100% sure about the way I apply batch normalization. I end this post with a more detailed question.
import tensorflow as tf
# This flag is used to allow/prevent batch normalization params updates
# depending on whether the model is being trained or used for prediction.
training = tf.placeholder_with_default(True, shape=())
Fully-connected (FC) case
# Input.
INPUT_SIZE = 512
u = tf.placeholder(tf.float32, shape=(None, INPUT_SIZE))
# FC params: weights only, no bias as per [Ioffe & Szegedy 2015].
FC_OUTPUT_LAYER_SIZE = 1024
w = tf.Variable(tf.truncated_normal(
[INPUT_SIZE, FC_OUTPUT_LAYER_SIZE], dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
fc = tf.matmul(u, w)
# Batch normalization.
fc_bn = tf.contrib.layers.batch_norm(
fc,
center=True,
scale=True,
is_training=training,
scope='fc-batch_norm')
# Activation function.
fc_bn_relu = tf.nn.relu(fc_bn)
print(fc_bn_relu) # Tensor("Relu:0", shape=(?, 1024), dtype=float32)
2D convolutional (CNN) layer case
# Input: 640x480 RGB images (whitened input, hence tf.float32).
INPUT_HEIGHT = 480
INPUT_WIDTH = 640
INPUT_CHANNELS = 3
u = tf.placeholder(tf.float32, shape=(None, INPUT_HEIGHT, INPUT_WIDTH, INPUT_CHANNELS))
# CNN params: wights only, no bias as per [Ioffe & Szegedy 2015].
CNN_FILTER_HEIGHT = 3 # Space dimension.
CNN_FILTER_WIDTH = 3 # Space dimension.
CNN_FILTERS = 128
w = tf.Variable(tf.truncated_normal(
[CNN_FILTER_HEIGHT, CNN_FILTER_WIDTH, INPUT_CHANNELS, CNN_FILTERS],
dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
CNN_LAYER_STRIDE_VERTICAL = 1
CNN_LAYER_STRIDE_HORIZONTAL = 1
CNN_LAYER_PADDING = 'SAME'
cnn = tf.nn.conv2d(
input=u, filter=w,
strides=[1, CNN_LAYER_STRIDE_VERTICAL, CNN_LAYER_STRIDE_HORIZONTAL, 1],
padding=CNN_LAYER_PADDING)
# Batch normalization.
cnn_bn = tf.contrib.layers.batch_norm(
cnn,
data_format='NHWC', # Matching the "cnn" tensor which has shape (?, 480, 640, 128).
center=True,
scale=True,
is_training=training,
scope='cnn-batch_norm')
# Activation function.
cnn_bn_relu = tf.nn.relu(cnn_bn)
print(cnn_bn_relu) # Tensor("Relu_1:0", shape=(?, 480, 640, 128), dtype=float32)
3D convolutional (CNN3D) layer case
# Input: sequence of 9 160x120 RGB images (whitened input, hence tf.float32).
INPUT_SEQ_LENGTH = 9
INPUT_HEIGHT = 120
INPUT_WIDTH = 160
INPUT_CHANNELS = 3
u = tf.placeholder(tf.float32, shape=(None, INPUT_SEQ_LENGTH, INPUT_HEIGHT, INPUT_WIDTH, INPUT_CHANNELS))
# CNN params: wights only, no bias as per [Ioffe & Szegedy 2015].
CNN3D_FILTER_LENGHT = 3 # Time dimension.
CNN3D_FILTER_HEIGHT = 3 # Space dimension.
CNN3D_FILTER_WIDTH = 3 # Space dimension.
CNN3D_FILTERS = 96
w = tf.Variable(tf.truncated_normal(
[CNN3D_FILTER_LENGHT, CNN3D_FILTER_HEIGHT, CNN3D_FILTER_WIDTH, INPUT_CHANNELS, CNN3D_FILTERS],
dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
CNN3D_LAYER_STRIDE_TEMPORAL = 1
CNN3D_LAYER_STRIDE_VERTICAL = 1
CNN3D_LAYER_STRIDE_HORIZONTAL = 1
CNN3D_LAYER_PADDING = 'SAME'
cnn3d = tf.nn.conv3d(
input=u, filter=w,
strides=[1, CNN3D_LAYER_STRIDE_TEMPORAL, CNN3D_LAYER_STRIDE_VERTICAL, CNN3D_LAYER_STRIDE_HORIZONTAL, 1],
padding=CNN3D_LAYER_PADDING)
# Batch normalization.
cnn3d_bn = tf.contrib.layers.batch_norm(
cnn3d,
data_format='NHWC', # Matching the "cnn" tensor which has shape (?, 9, 120, 160, 96).
center=True,
scale=True,
is_training=training,
scope='cnn3d-batch_norm')
# Activation function.
cnn3d_bn_relu = tf.nn.relu(cnn3d_bn)
print(cnn3d_bn_relu) # Tensor("Relu_2:0", shape=(?, 9, 120, 160, 96), dtype=float32)
What I would like to make sure is whether the code above exactly implements batch normalization as described in [Ioffe & Szegedy 2015] at the end of Sec. 3.2:
For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a minibatch, over all locations. [...] Alg. 2 is modified similarly, so that during inference the BN transform applies the same linear transformation to each activation in a given feature map.
UPDATE
I guess the code above is also correct for the 3D conv case. In fact, when I define my model if I print all the trainable variables, I also see the expected numbers of beta and gamma variables. For instance:
Tensor("conv3a/conv3d_weights/read:0", shape=(3, 3, 3, 128, 256), dtype=float32)
Tensor("BatchNorm_2/beta/read:0", shape=(256,), dtype=float32)
Tensor("BatchNorm_2/gamma/read:0", shape=(256,), dtype=float32)
This looks ok to me since due to BN, one pair of beta and gamma are learned for each feature map (256 in total).
[Ioffe & Szegedy 2015]: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

That is a great post about 3D batchnorm, it's often unnoticed that batchnorm can be applied to any tensor of rank greater than 1. Your code is correct, but I couldn't help but add a few important notes on this:
A "standard" 2D batchnorm (accepts a 4D tensor) can be significantly faster in tensorflow than 3D or higher, because it supports fused_batch_norm implementation, which applies one kernel operation:
Fused batch norm combines the multiple operations needed to do batch
normalization into a single kernel. Batch norm is an expensive process
that for some models makes up a large percentage of the operation
time. Using fused batch norm can result in a 12%-30% speedup.
There is an issue on GitHub to support 3D filters as well, but there hasn't been any recent activity and at this point the issue is closed unresolved.
Although the original paper prescribes using batchnorm before ReLU activation (and that's what you did in the code above), there is evidence that it's probably better to use batchnorm after the activation. Here's a comment on Keras GitHub by Francois Chollet:
... I can guarantee that recent code written by Christian [Szegedy]
applies relu
before BN. It is still occasionally a topic of debate, though.
For anyone interested to apply the idea of normalization in practice, there's been recent research developments of this idea, namely weight normalization and layer normalization, which fix certain disadvantages of original batchnorm, for example they work better for LSTM and recurrent networks.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.