Why does the global average pooling work in ResNet?

Why does the global average pooling work in ResNet? - python

Lately, I start a project about classification, using a very shallow ResNet.
The model just has 10 conv. layer and then connects a Global avg pooling layer before softmax layer.
The performance is good as my expectation --- 93% (yeah, it is ok).
However, for some reasons, I need replace the Global avg pooling layer.
I have tried the following ways:
(Given the input shape of this layer [-1, 128, 1, 32], tensorflow form)
Global max pooling layer. but got 85% ACC
Exponential Moving Average. but got 12% (almost didn't work)
split_list = tf.split(input, 128, axis=1)
avg_pool = split_list[0]
beta = 0.5
for i in range(1, 128):
avg_pool = beta*split_list[i] + (1-beta)*avg_pool
avg_pool = tf.reshape(avg_pool, [-1,32])
Split input into 4 parts, avg_pool each parts, finally concatenate them.
but got 75%
split_shape = [32,32,32,32]
split_list = tf.split(input,
split_shape,
axis=1)
for i in range(len(split_shape)):
split_list[i] = tf.keras.layers.GlobalMaxPooling2D()(split_list[i])
avg_pool = tf.concat(split_list, axis=1)
Average the last channel. [-1, 128, 1, 32] --> [-1, 128], didn't work.
^
Use a conv. layer with 1 kernel. In this way, the output shape is [-1, 128, 1, 1]. but didn't work, 25% or so.
I am pretty confused why global average pooling can work that well?
And is there any other way to replace it?

Global Average Pooling has the following advantages over the fully connected final layers paradigm:
The removal of a large number of trainable parameters from the model. Fully connected or dense layers have lots of parameters. A 7 x 7 x 64 CNN output being flattened and fed into a 500 node dense layer yields 1.56 million weights which need to be trained. Removing these layers speeds up the training of your model.
The elimination of all these trainable parameters also reduces the tendency of over-fitting, which needs to be managed in fully connected layers by the use of dropout.
The authors argue in the original paper that removing the fully connected classification layers forces the feature maps to be more closely related to the classification categories – so that each feature map becomes a kind of “category confidence map”.
Finally, the authors also argue that, due to the averaging operation over the feature maps, this makes the model more robust to spatial translations in the data. In other words, as long as the requisite feature is included / or activated in the feature map somewhere, it will still be “picked up” by the averaging operation.

Related

Is convolution useful on a network with a timestep of 1?

This code comes from https://www.kaggle.com/dkaraflos/1-geomean-nn-and-6featlgbm-2-259-private-lb, The goal of this competition is to use seismic signals to predict the timing of laboratory earthquakes. The person in this link has won first place among more than 4000 teams
def get_model():
inp = Input(shape=(1,train_sample.shape[1]))
x = BatchNormalization()(inp)
x = LSTM(128,return_sequences=True)(x) # LSTM as first layer performed better than Dense.
x = Convolution1D(128, (2),activation='relu', padding="same")(x)
x = Convolution1D(84, (2),activation='relu', padding="same")(x)
x = Convolution1D(64, (2),activation='relu', padding="same")(x)
x = Flatten()(x)
x = Dense(64, activation="relu")(x)
x = Dense(32, activation="relu")(x)
#outputs
ttf = Dense(1, activation='relu',name='regressor')(x) # Time to Failure
tsf = Dense(1)(x) # Time Since Failure
classifier = Dense(1, activation='sigmoid')(x) # Binary for TTF<0.5 seconds
model = models.Model(inputs=inp, outputs=[ttf,tsf,classifier])
opt = optimizers.Nadam(lr=0.008)
# We are fitting to 3 targets simultaneously: Time to Failure (TTF), Time Since Failure (TSF), and Binary for TTF<0.5 seconds
# We weight the model to optimize heavily for TTF
# Optimizing for TSF and Binary TTF<0.5 helps to reduce overfitting, and helps for generalization.
model.compile(optimizer=opt, loss=['mae','mae','binary_crossentropy'],loss_weights=[8,1,1],metrics=['mae'])
return model
However, According to my derivation, I think x = Convolution1D(128, (2),activation='relu', padding="same")(x) and x = Dense(128, activation='relu ')(x) has the same effect, because the convolution kernel performs convolution on the sequence with a time step of 1. In principle, it is very similar to the fully connected layer. Why use conv1D here instead of directly using the fullly connection layer? Is my derivation wrong?

1) Assuming you would input a sequence to the LSTM (the normal use case):
It would not be the same since the LSTM returns a sequence (return_sequences=True), thereby not reducing the input dimensionality. The output shape is therefore (Batch, Sequence, Hid). This is being fed to the Convolution1D layer which performs convolution on the Sequence dimension, i.e. on (Sequence, Hid). So in effect, the purpose of the 1D Convolutions is to extract local 1D subsequences/patches after the LSTM.
If we had return_sequences=False, the LSTM would return the final state h_t. To ensure the same behavior as a Dense layer, you need a fully connected convolutional layer, i.e. a kernel size of Sequence length, and we need as many filters as we have Hid in the output shape. This would then make the 1D Convolution equivalent to a Dense layer.
2) Assuming you do not input a sequence to the LSTM (your example):
In your example, the LSTM is used as a replacement for a Dense layer.
It serves the same function, though it gives you a slightly different
result as the gates do additional transformations (even though we
have no sequence).
Since the Convolution is then performed on (Sequence, Hid) = (1, Hid), it is indeed operating per timestep. Since we have 128 inputs and 128 filters, it is fully connected and the kernel size is large enough to operate on the single element. This meets the above defined criteria for a 1D Convolution to be equivalent to a Dense layer, so you're correct.
As a side note, this type of architecture is something you would typically get with a Neural Architecture Search. The "replacements" used here are not really commonplace and not generally guaranteed to be better than the more established counterparts. In a lot of cases, using Reinforcement Learning or Evolutionary Algorithms can however yield slightly better accuracy using "untraditional" solutions since very small performance gains can just happen by chance and don't have to necessarily reflect back on the usefulness of the architecture.

How to combat huge numbers produced by relu-oriented CNN

I have a CNN with a structure loosely close to AlexNet, see below:
Convolutional Neural Network structure:
100x100x3 Input image
25x25x12 Convolutional layer: 4x4x12, stride = 4, padding = 0
12x12x12 Max pooling layer: 3x3, stride = 2
12x12x24 Convolutional layer: 5x5x24, stride = 1, padding = 2
5x5x24 Max pooling layer: 4x4, stride = 2
300x1x1 Flatten layer: 600 -> 300
300x1x1 Fully connected layer: 300
3x1x1 Fully connected layer: 3
Obviously, with only max pooling and convolutional layers, the numbers will approach 0 and infinity, depending of how negative the weights are. I was wondering of any approaches to combat this, seeing as I would like to avoid large numbers.
One problem that arrises from this is if you use sigmoid in the final layers. Seeing as the derivative of sigmoid is s(x)*(1-s(x)). Having larger numbers will inevitably make the value of sigmoid 1, and so you'll notice on back prop, you have 1*(1-1), which obviously doesn't go down too well.
So I would like to know of any ways to try and keep the numbers low.
Tagged with python because that's what I implemented this in. I used my own code.

I asked this question on AI stack exchange (which it is better suited for) and through implementing the correct weight initialisation, numbers will neither explode or vanish on a forward or backward pass. See here: https://ai.stackexchange.com/questions/13106/how-are-exploding-numbers-in-a-forward-pass-of-a-cnn-combated

Implement Causal CNN in Keras for multivariate time-series prediction

This question is a followup to my previous question here: Multi-feature causal CNN - Keras implementation, however, there are numerous things that are unclear to me that I think it warrants a new question. The model in question here has been built according to the accepted answer in the post mentioned above.
I am trying to apply a Causal CNN model on multivariate time-series data of 10 sequences with 5 features.
lookback, features = 10, 5
What should filters and kernel be set to?
What is the effect of filters and kernel on the network?
Are these just an arbitrary number - i.e. number of neurons in ANN layer?
Or will they have an effect on how the net interprets the time-steps?
What should dilations be set to?
Is this just an arbitrary number or does this represent the lookback of the model?
filters = 32
kernel = 5
dilations = 5
dilation_rates = [2 ** i for i in range(dilations)]
model = Sequential()
model.add(InputLayer(input_shape=(lookback, features)))
model.add(Reshape(target_shape=(features, lookback, 1), input_shape=(lookback, features)))
According to the previously mentioned answer, the input needs to be reshaped according to the following logic:
After Reshape 5 input features are now treated as the temporal layer for the TimeDistributed layer
When Conv1D is applied to each input feature, it thinks the shape of the layer is (10, 1)
with the default "channels_last", therefore...
10 time-steps is the temporal dimension
1 is the "channel", the new location for the feature maps
# Add causal layers
for dilation_rate in dilation_rates:
model.add(TimeDistributed(Conv1D(filters=filters,
kernel_size=kernel,
padding='causal',
dilation_rate=dilation_rate,
activation='elu')))
According to the mentioned answer, the model needs to be reshaped, according to the following logic:
Stack feature maps on top of each other so each time step can look at all features produced earlier - (10 time steps, 5 features * 32 filters)
Next, causal layers are now applied to the 5 input features dependently.
Why were they initially applied independently?
Why are they now applied dependently?
model.add(Reshape(target_shape=(lookback, features * filters)))
next_dilations = 3
dilation_rates = [2 ** i for i in range(next_dilations)]
for dilation_rate in dilation_rates:
model.add(Conv1D(filters=filters,
kernel_size=kernel,
padding='causal',
dilation_rate=dilation_rate,
activation='elu'))
model.add(MaxPool1D())
model.add(Flatten())
model.add(Dense(units=1, activation='linear'))
model.summary()
SUMMARY
What should filters and kernel be set to?
Will they have an effect on how the net interprets the time-steps?
What should dilations be set to to represent lookback of 10?
Why are causal layers initially applied independently?
Why are they applied dependently after reshape?
Why not apply them dependently from the beginning?
===========================================================================
FULL CODE
lookback, features = 10, 5
filters = 32
kernel = 5
dilations = 5
dilation_rates = [2 ** i for i in range(dilations)]
model = Sequential()
model.add(InputLayer(input_shape=(lookback, features)))
model.add(Reshape(target_shape=(features, lookback, 1), input_shape=(lookback, features)))
# Add causal layers
for dilation_rate in dilation_rates:
model.add(TimeDistributed(Conv1D(filters=filters,
kernel_size=kernel,
padding='causal',
dilation_rate=dilation_rate,
activation='elu')))
model.add(Reshape(target_shape=(lookback, features * filters)))
next_dilations = 3
dilation_rates = [2 ** i for i in range(next_dilations)]
for dilation_rate in dilation_rates:
model.add(Conv1D(filters=filters,
kernel_size=kernel,
padding='causal',
dilation_rate=dilation_rate,
activation='elu'))
model.add(MaxPool1D())
model.add(Flatten())
model.add(Dense(units=1, activation='linear'))
model.summary()
===========================================================================
EDIT:
Daniel, thank you for your answer.
Question:
If you can explain "exactly" how you're structuring your data, what is the original data and how you're transforming it into the input shape, if you have independent sequences, if you're creating sliding windows, etc. A better understanding of this process could be achieved.
Answer:
I hope I understand your question correctly.
Each feature is a sequence array of time-series data. They are independent, as in, they are not an image, however, they correlate with each other somewhat.
Which is why I am trying to use Wavenet, which is very good at predicting a single time-series array, however, my problem requires me to use multiple multiple features.

Comments about the given answer
Questions:
Why are causal layers initially applied independently?
Why are they applied dependently after reshape?
Why not apply them dependently from the beginning?
That answer is sort of strange. I'm not an expert, but I don't see the need to keep independent features with a TimeDistributed layer. But I also cannot say whether it gives a better result or not. At first I'd say it's just unnecessary. But it might bring extra intelligence though, given that it might see relations that involve distant steps between two features instead of just looking at "same steps". (This should be tested)
Nevertheless, there is a mistake in that approach.
The reshapes that are intended to swap lookback and feature sizes are not doing what they are expected to do. The author of the answer clearly wants to swap axes (keeps the interpretation of what is feature, what is lookback), which is different from reshape (mixes everything and data loses meaningfulness)
A correct approach would need actual axis swapping, like model.add(Permute((2,1))) instead of the reshapes.
So, I don't know these answers, but nothing seems to create that need.
One sure thing is: you will certainly want the dependent part. A model will not get any near the intelligence of your original model if it doesn't consider relations between features. (Unless you're lucky to have your data completely independent)
Now, explaining the relation between LSTM and Conv1D
An LSTM can be directly compared to a Conv1D and the shapes used are exactly the same, and they mean virtually the same, as long as you're using channels_last.
That said, the shape (samples, input_length, features_or_channels) is the correct shape for both LSTM and Conv1D. In fact, features and channels are exactly the same thing in this case. What changes is how each layer works regarding the input length and calculations.
Concept of filters and kernels
Kernel is the entire tensor inside the conv layer that will be multiplied to the inputs to get the results. A kernel includes its spatial size (kernel_size) and number of filters (output features). And also automatic input filters.
There is not a number of kernels, but there is a kernel_size. The kernel size is how many steps in the length will be joined together for each output step. (This tutorial is great for undestanding 2D convolutions regarding what it does and what the kernel size is - just imagine 1D images instead -- this tutorial doesn't show the number of "filters" though, it's like 1-filter animations)
The number of filters relates directly to the number of features, they're exactly the same thing.
What should filters and kernel be set to?
So, if your LSTM layer is using units=256, meaning it will output 256 features, you should use filters=256, meaning your convolution will output 256 channels/features.
This is not a rule, though, you may find that using more or less filters could bring better results, since the layers do different things after all. There is no need to have all layers with the same number of filters as well!! Here you should go with a parameter tuning. Test to see which numbers are best for your goal and data.
Now, kernel size is something that can't be compared to the LSTM. It's a new thing added to the model.
The number 3 is sort of a very common choice. It means that the convolution will take three time steps to produce one time step. Then slide one step to take another group of three steps to produce the next step and so on.
Dilations
Dilations mean how many spaces between steps the convolution filter will have.
A convolution dilation_rate=1 takes kernel_size consecutive steps to produce one step.
A convolution with dilation_rate = 2 takes, for instance, steps 0, 2 and 4 to produce a step. Then takes steps 1,3,5 to produce the next step and so on.
What should dilations be set to to represent lookback of 10?
range = 1 + (kernel_size - 1) * dilation_rate
So, with a kernel size = 3:
Dilation = 0 (dilation_rate=1): the kernel size will range 3 steps
Dilation = 1 (dilation_rate=2): the kernel size will range 5 steps
Dilation = 2 (dilation_rate=4): the kernel size will range 9 steps
Dilation = 3 (dilation_rate=8): the kernel size will range 17 steps
My question to you
If you can explain "exactly" how you're structuring your data, what is the original data and how you're transforming it into the input shape, if you have independent sequences, if you're creating sliding windows, etc. A better understanding of this process could be achieved.

Implementing weight normalization using TensorFlow layers' `kernel_constraint`

Some of the TensorFlow layers, such as tf.layers.dense and tf.layers.conv2d, take in a kernel_constraint argument, which according to the tf api docs docs implements an
Optional projection function to be applied to the kernel after being updated by an Optimizer (e.g. used to implement norm constraints or value constraints for layer weights).
In [1], Salimans et al. present a neural network normalization technique, called weight normalization, which normalizes the weight vectors of the network layers, in contrast to, for example the batch normalization [2], which normalizes the actual data batch flowing through the layer. In some cases the computational overhead of the weight normalization method is lower and it can also be used in cases where the use of batch normalization is not feasible.
My question is: is it possible to implement the weight normalization using the abovementioned TensorFlow layers' kernel_constraint? Assuming x is an input with shape (batch, height, width, channels), I thought I could implement it as follows:
x = tf.layers.conv2d(
inputs=x,
filters=16,
kernel_size=(3, 3),
strides=(1, 1),
kernel_constraint=lambda kernel: (
tf.nn.l2_normalize(w, list(range(kernel.shape.ndims-1)))))
What would be a simple test case to validate/invalidate my solution?
[1] SALIMANS, Tim; KINGMA, Diederik P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In: Advances in Neural Information Processing Systems. 2016. p. 901-909.
[2] IOFFE, Sergey; SZEGEDY, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

Despite the title, the paper by Salimans and Kingma suggests to decouple the weight norm and their direction, rather than actually normalising the weights (i.e. setting their l2 norm to one as you suggested).
If you want to verify that your code has the intended effect even if it is not what they proposed, you can get the weights of the model and check their norm.
In pseudo-code:
model = tf.models.Model(inputs=inputs, outputs=x)
weights = model.get_weights()[i] # checking the weights of the i-th layer
flat_weights = weights.flatten()
import numpy as np
print(np.linalg.norm(flat_weights, 2))

Batch normalization with 3D convolutions in TensorFlow

I'm implementing a model relying on 3D convolutions (for a task that is similar to action recognition) and I want to use batch normalization (see [Ioffe & Szegedy 2015]). I could not find any tutorial focusing on 3D convs, hence I'm making a short one here which I'd like to review with you.
The code below refers to TensorFlow r0.12 and it explicitly instances variables - I mean I'm not using tf.contrib.learn except for the tf.contrib.layers.batch_norm() function. I'm doing this both to better understand how things work under the hood and to have more implementation freedom (e.g., variable summaries).
I will get to the 3D convolution case smoothly by first writing the example for a fully-connected layer, then for a 2D convolution and finally for the 3D case. While going through the code, it would be great if you could check if everything is done correctly - the code runs, but I'm not 100% sure about the way I apply batch normalization. I end this post with a more detailed question.
import tensorflow as tf
# This flag is used to allow/prevent batch normalization params updates
# depending on whether the model is being trained or used for prediction.
training = tf.placeholder_with_default(True, shape=())
Fully-connected (FC) case
# Input.
INPUT_SIZE = 512
u = tf.placeholder(tf.float32, shape=(None, INPUT_SIZE))
# FC params: weights only, no bias as per [Ioffe & Szegedy 2015].
FC_OUTPUT_LAYER_SIZE = 1024
w = tf.Variable(tf.truncated_normal(
[INPUT_SIZE, FC_OUTPUT_LAYER_SIZE], dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
fc = tf.matmul(u, w)
# Batch normalization.
fc_bn = tf.contrib.layers.batch_norm(
fc,
center=True,
scale=True,
is_training=training,
scope='fc-batch_norm')
# Activation function.
fc_bn_relu = tf.nn.relu(fc_bn)
print(fc_bn_relu) # Tensor("Relu:0", shape=(?, 1024), dtype=float32)
2D convolutional (CNN) layer case
# Input: 640x480 RGB images (whitened input, hence tf.float32).
INPUT_HEIGHT = 480
INPUT_WIDTH = 640
INPUT_CHANNELS = 3
u = tf.placeholder(tf.float32, shape=(None, INPUT_HEIGHT, INPUT_WIDTH, INPUT_CHANNELS))
# CNN params: wights only, no bias as per [Ioffe & Szegedy 2015].
CNN_FILTER_HEIGHT = 3 # Space dimension.
CNN_FILTER_WIDTH = 3 # Space dimension.
CNN_FILTERS = 128
w = tf.Variable(tf.truncated_normal(
[CNN_FILTER_HEIGHT, CNN_FILTER_WIDTH, INPUT_CHANNELS, CNN_FILTERS],
dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
CNN_LAYER_STRIDE_VERTICAL = 1
CNN_LAYER_STRIDE_HORIZONTAL = 1
CNN_LAYER_PADDING = 'SAME'
cnn = tf.nn.conv2d(
input=u, filter=w,
strides=[1, CNN_LAYER_STRIDE_VERTICAL, CNN_LAYER_STRIDE_HORIZONTAL, 1],
padding=CNN_LAYER_PADDING)
# Batch normalization.
cnn_bn = tf.contrib.layers.batch_norm(
cnn,
data_format='NHWC', # Matching the "cnn" tensor which has shape (?, 480, 640, 128).
center=True,
scale=True,
is_training=training,
scope='cnn-batch_norm')
# Activation function.
cnn_bn_relu = tf.nn.relu(cnn_bn)
print(cnn_bn_relu) # Tensor("Relu_1:0", shape=(?, 480, 640, 128), dtype=float32)
3D convolutional (CNN3D) layer case
# Input: sequence of 9 160x120 RGB images (whitened input, hence tf.float32).
INPUT_SEQ_LENGTH = 9
INPUT_HEIGHT = 120
INPUT_WIDTH = 160
INPUT_CHANNELS = 3
u = tf.placeholder(tf.float32, shape=(None, INPUT_SEQ_LENGTH, INPUT_HEIGHT, INPUT_WIDTH, INPUT_CHANNELS))
# CNN params: wights only, no bias as per [Ioffe & Szegedy 2015].
CNN3D_FILTER_LENGHT = 3 # Time dimension.
CNN3D_FILTER_HEIGHT = 3 # Space dimension.
CNN3D_FILTER_WIDTH = 3 # Space dimension.
CNN3D_FILTERS = 96
w = tf.Variable(tf.truncated_normal(
[CNN3D_FILTER_LENGHT, CNN3D_FILTER_HEIGHT, CNN3D_FILTER_WIDTH, INPUT_CHANNELS, CNN3D_FILTERS],
dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
CNN3D_LAYER_STRIDE_TEMPORAL = 1
CNN3D_LAYER_STRIDE_VERTICAL = 1
CNN3D_LAYER_STRIDE_HORIZONTAL = 1
CNN3D_LAYER_PADDING = 'SAME'
cnn3d = tf.nn.conv3d(
input=u, filter=w,
strides=[1, CNN3D_LAYER_STRIDE_TEMPORAL, CNN3D_LAYER_STRIDE_VERTICAL, CNN3D_LAYER_STRIDE_HORIZONTAL, 1],
padding=CNN3D_LAYER_PADDING)
# Batch normalization.
cnn3d_bn = tf.contrib.layers.batch_norm(
cnn3d,
data_format='NHWC', # Matching the "cnn" tensor which has shape (?, 9, 120, 160, 96).
center=True,
scale=True,
is_training=training,
scope='cnn3d-batch_norm')
# Activation function.
cnn3d_bn_relu = tf.nn.relu(cnn3d_bn)
print(cnn3d_bn_relu) # Tensor("Relu_2:0", shape=(?, 9, 120, 160, 96), dtype=float32)
What I would like to make sure is whether the code above exactly implements batch normalization as described in [Ioffe & Szegedy 2015] at the end of Sec. 3.2:
For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a minibatch, over all locations. [...] Alg. 2 is modified similarly, so that during inference the BN transform applies the same linear transformation to each activation in a given feature map.
UPDATE
I guess the code above is also correct for the 3D conv case. In fact, when I define my model if I print all the trainable variables, I also see the expected numbers of beta and gamma variables. For instance:
Tensor("conv3a/conv3d_weights/read:0", shape=(3, 3, 3, 128, 256), dtype=float32)
Tensor("BatchNorm_2/beta/read:0", shape=(256,), dtype=float32)
Tensor("BatchNorm_2/gamma/read:0", shape=(256,), dtype=float32)
This looks ok to me since due to BN, one pair of beta and gamma are learned for each feature map (256 in total).
[Ioffe & Szegedy 2015]: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

That is a great post about 3D batchnorm, it's often unnoticed that batchnorm can be applied to any tensor of rank greater than 1. Your code is correct, but I couldn't help but add a few important notes on this:
A "standard" 2D batchnorm (accepts a 4D tensor) can be significantly faster in tensorflow than 3D or higher, because it supports fused_batch_norm implementation, which applies one kernel operation:
Fused batch norm combines the multiple operations needed to do batch
normalization into a single kernel. Batch norm is an expensive process
that for some models makes up a large percentage of the operation
time. Using fused batch norm can result in a 12%-30% speedup.
There is an issue on GitHub to support 3D filters as well, but there hasn't been any recent activity and at this point the issue is closed unresolved.
Although the original paper prescribes using batchnorm before ReLU activation (and that's what you did in the code above), there is evidence that it's probably better to use batchnorm after the activation. Here's a comment on Keras GitHub by Francois Chollet:
... I can guarantee that recent code written by Christian [Szegedy]
applies relu
before BN. It is still occasionally a topic of debate, though.
For anyone interested to apply the idea of normalization in practice, there's been recent research developments of this idea, namely weight normalization and layer normalization, which fix certain disadvantages of original batchnorm, for example they work better for LSTM and recurrent networks.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.