How can I make TensorFlow RNN training more robust?

How can I make TensorFlow RNN training more robust? - python

I am training an RNN on a time series. I subclassed RNNCell and I use it in dynamic_rnn. The topology of the RNNCell is as follows:
input (shape [15, 100, 3])
1x3 convolution (5 kernels), ReLu (shape [15, 98, 5])
1x(remaining) convolution (20 kernels), ReLu (shape [15, 1, 20])
concatenate previous output (shape [15, 1, 21])
squeeze and 1x1 convolution (1 kernel), ReLu (shape [15, 1])
squeeze and softmax (shape [15])
The batch size for dynamic_rnn is around 100 (not the same 100 of the description above, that's the number of time periods in a window of data). Epochs are made of about 200 batches.
I would like to experiment with hyperparameters and regularization, but too often what I try stops the learning entirely and I don't understand why. These are some of the weird things that happen:
Adagrad works, but if I use Adam or Nadam the gradients are all zero.
I am forced to set a huge learning rate (~1.0) to see learning from epoch to epoch.
If I try to add dropout after any of the convolutions, even if I set keep_prob to 1.0 it stops learning.
If I tweak the number of kernels in the convolutions, for some choices that would seem just as good (e.g. 5, 25, 1 vs 5, 20, 1) again the network stops learning entirely.
Why is this model so fragile? Is it the topology of the RNNCell?
EDIT:
This is the code of the RNNCell:
class RNNCell(tf.nn.rnn_cell.RNNCell):
def __init__(self):
super(RNNCell, self).__init__()
self._output_size = 15
self._state_size = 15
def __call__(self, X, prev_state):
network = X
# ------ 2 convolutional layers ------
network = tflearn.layers.conv_2d(network, 5, [1, 3], activation='relu', weights_init=tflearn.initializations.variance_scaling(), padding="valid", regularizer=None)
width = network.get_shape()[2]
network = tflearn.layers.conv_2d(network, 20, [1, width], [1, 1], activation='relu', weights_init=tflearn.initializations.variance_scaling(), padding="valid", regularizer=None)
# ------ concatenate the previous state ------
_, height, width, features = network.get_shape()
network = tf.reshape(network, [-1, int(height), 1, int(width * features)])
network = tf.concat([network, prev_state[..., None, None]], axis=3)
# ------ last convolution and softmax ------
network = tflearn.layers.conv_2d(network, 1, [1, 1], activation='relu', weights_init=tflearn.initializations.variance_scaling(), padding="valid", regularizer=None)
network = network[:, :, 0, 0]
predictions = tflearn.layers.core.activation(network, activation="softmax")
return predictions, predictions
#property
def output_size(self):
return self._output_size
#property
def state_size(self):
return self._state_size

Most probably you are facing vanished gradients problem.
Potentially the instability can be caused by using ReLU in a combination with a pretty small number of parameters to tune. As far as I understand from the description there are only 1x3x5 = 15 trainable parameters in a first layer for instance. If to suppose, that the initialization is around zero, than gradients of in average 50% of parameters will always stay zero. Generally speaking ReLU on small networks in an evil, especially in a case of RNNs.
Try to use Leaky ReLU (but you can face exploding gradients though)
Try to use tanh, but check initial values of parameters, that they are really around zero, otherwise your gradients will vanish very quickly as well.
Retrieve results of untrained, but just initialized network at a step 0. With a right initialization and NN construction you should get normally distributed values around .5 If you have strictly ones, zeros or mix of them, your NN architecture is wrong. All values strictly .5 is also bad.
Consider more robust approach such as LSTM

Related

assigning direct input to weight of a neural network in python

Neural network based XOR
import numpy as np
import keras
model = keras.models.Sequential()
model.add(keras.layers.Dense(2, activation='relu', input_shape=(2,),use_bias=False)) # hidden layer
model.add(keras.layers.Dense(1,activation='softmax',use_bias=False)) # output layer
w1=np.zeros((2,2)) # two input neurons for two neurons at the hidden layer
w2=np.zeros((2,1)) # two input neurons for one output neuron
def __init__(self,x1,x2):
self.x1=x1
self.x2=x2
w1[0,0]=not x2
w1[0,1]=0
w1[1,0]=0
w1[1,1]=not x1
w2[0,0]=1
w2[1,0]=1
model.set_weights([w1, w2])
x = np.array([
[0, 0],
[0, 1],
[1, 0],
[1, 1],
])
model.predict(x)
I want to implement 2 input XOR gate for the above shown neural network. I want to assign weights directly as not x1, not x2 to get outputs. I tried with different Activation functions. When linear, sigmoid, softmax activation functions are used at output, all outputs obtained are 0's, 0.5's, 1's respectively. Kindly help me in getting correct output.
I am using python 3.6

So, with such such a small network, you cannot get the output of XOR function. You need to add more dense layer in middle.
This is not hidden layer, this is the input layer (you have also given input shape in it):
model.add(keras.layers.Dense(2, activation='relu', input_shape=(2,),use_bias=False))
Since weights are zero, inputs to last layer (output layer) is zero too. Then using the activations, you're getting correct outputs. Check the functions and give input as 0, you'll get your output.
Linear - f(0)=0 - Linear
Sigmoid - f(0)=0.5 - Sigmoid
Softmax - f(0)=1 - Softmax
If you want to give not x1, not x2 as inputs, you could do:
X = 1-X

Why does the global average pooling work in ResNet?

Lately, I start a project about classification, using a very shallow ResNet.
The model just has 10 conv. layer and then connects a Global avg pooling layer before softmax layer.
The performance is good as my expectation --- 93% (yeah, it is ok).
However, for some reasons, I need replace the Global avg pooling layer.
I have tried the following ways:
(Given the input shape of this layer [-1, 128, 1, 32], tensorflow form)
Global max pooling layer. but got 85% ACC
Exponential Moving Average. but got 12% (almost didn't work)
split_list = tf.split(input, 128, axis=1)
avg_pool = split_list[0]
beta = 0.5
for i in range(1, 128):
avg_pool = beta*split_list[i] + (1-beta)*avg_pool
avg_pool = tf.reshape(avg_pool, [-1,32])
Split input into 4 parts, avg_pool each parts, finally concatenate them.
but got 75%
split_shape = [32,32,32,32]
split_list = tf.split(input,
split_shape,
axis=1)
for i in range(len(split_shape)):
split_list[i] = tf.keras.layers.GlobalMaxPooling2D()(split_list[i])
avg_pool = tf.concat(split_list, axis=1)
Average the last channel. [-1, 128, 1, 32] --> [-1, 128], didn't work.
^
Use a conv. layer with 1 kernel. In this way, the output shape is [-1, 128, 1, 1]. but didn't work, 25% or so.
I am pretty confused why global average pooling can work that well?
And is there any other way to replace it?

Global Average Pooling has the following advantages over the fully connected final layers paradigm:
The removal of a large number of trainable parameters from the model. Fully connected or dense layers have lots of parameters. A 7 x 7 x 64 CNN output being flattened and fed into a 500 node dense layer yields 1.56 million weights which need to be trained. Removing these layers speeds up the training of your model.
The elimination of all these trainable parameters also reduces the tendency of over-fitting, which needs to be managed in fully connected layers by the use of dropout.
The authors argue in the original paper that removing the fully connected classification layers forces the feature maps to be more closely related to the classification categories – so that each feature map becomes a kind of “category confidence map”.
Finally, the authors also argue that, due to the averaging operation over the feature maps, this makes the model more robust to spatial translations in the data. In other words, as long as the requisite feature is included / or activated in the feature map somewhere, it will still be “picked up” by the averaging operation.

Keras Recurrent Neural Networks For Multivariate Time Series

I have been reading about Keras RNN models (LSTMs and GRUs), and authors seem to largely focus on language data or univariate time series that use training instances composed of previous time steps. The data I have is a bit different.
I have 20 variables measured every year for 10 years for 100,000 persons as input data, and the 20 variables measured for year 11 as output data. What I would like to do is predict the value of one of the variables (not the other 19) for the 11th year.
I have my data structured as X.shape = [persons, years, variables] = [100000, 10, 20] and Y.shape = [persons, variable] = [100000, 1]. Below is my Python code for a LSTM model.
## LSTM model.
# Define model.
network_lstm = models.Sequential()
network_lstm.add(layers.LSTM(128, activation = 'tanh',
input_shape = (X.shape[1], X.shape[2])))
network_lstm.add(layers.Dense(1, activation = None))
# Compile model.
network_lstm.compile(optimizer = 'adam', loss = 'mean_squared_error')
# Fit model.
history_lstm = network_lstm.fit(X, Y, epochs = 25, batch_size = 128)
I have four (related) questions, please:
Have I coded the Keras model correctly for the data structure I have? The performance I get from a fully-connected network (using flattened data) and from LSTM, GRU, and 1D CNN models are nearly identical, and I don't know if I have made an error in Keras or if a recurrent model is simply not helpful in this case.
Should I have Y as a series with shape Y.shape = [persons, years] = [100000, 11], rather than including the variable in X, which would then have shape X.shape = [persons, years, variables] = [100000, 10, 19]? If so, how can I get the RNN to output the predicted sequence? When I use return_sequences = True, Keras returns an error.
Is this the best way to predict with the data I have? Are there better option choices available in the Keras RNN models, or even other models?
How could I simulate data resembling the data structure I have so that a RNN model would outperform a fully-connected network?
UPDATE:
I have tried a simulation, with what I hope is a very simple case where an RNN should be expected to outperform a FNN.
While the LSTM tends to outperform the FNN when both have less hidden layers (4), the performance becomes identical with more hidden layers (8+). Can anyone think of a better simulation where a RNN would be expected to outperform a FNN with a similar data structure?
from keras import models
from keras import layers
from keras.layers import Dense, LSTM
import numpy as np
import matplotlib.pyplot as plt
The code below simulates data for 10,000 instances, 10 time steps, and 2 variables. If the second variable has a 0 in the very first time step, then Y is the value of the first variable for the very last time step multiplied by 3. If the second variable has a 1 in the very first time step, then Y is the value of the first variable for the very last time step multiplied by 9.
My hope was that the RNN would keep the value of second variable at the very first time step in memory and use that to know which value (3 or 9) to multiply the the first variable for the very last time step.
## Simulate data.
instances = 10000
sequences = 10
X = np.zeros((instances, sequences * 2))
X[:int(instances / 2), 1] = 1
for i in range(instances):
for j in range(0, sequences * 2, 2):
X[i, j] = np.random.random()
Y = np.zeros((instances, 1))
for i in range(len(Y)):
if X[i, 1] == 0:
Y[i] = X[i, -2] * 3
if X[i, 1] == 1:
Y[i] = X[i, -2] * 9
Below is code for a FNN:
## Densely connected model.
# Define model.
network_dense = models.Sequential()
network_dense.add(layers.Dense(4, activation = 'relu',
input_shape = (X.shape[1],)))
network_dense.add(Dense(1, activation = None))
# Compile model.
network_dense.compile(optimizer = 'rmsprop', loss = 'mean_absolute_error')
# Fit model.
history_dense = network_dense.fit(X, Y, epochs = 100, batch_size = 256, verbose = False)
plt.scatter(Y[X[:, 1] == 0, :], network_dense.predict(X[X[:, 1] == 0, :]), alpha = 0.1)
plt.plot([0, 3], [0, 3], color = 'black', linewidth = 2)
plt.title('FNN, Second Variable has a 0 in the Very First Time Step')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
plt.scatter(Y[X[:, 1] == 1, :], network_dense.predict(X[X[:, 1] == 1, :]), alpha = 0.1)
plt.plot([0, 9], [0, 9], color = 'black', linewidth = 2)
plt.title('FNN, Second Variable has a 1 in the Very First Time Step')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
Below is code for a LSTM:
## Structure X data for LSTM.
X_lstm = X.reshape(X.shape[0], X.shape[1] // 2, 2)
X_lstm.shape
## LSTM model.
# Define model.
network_lstm = models.Sequential()
network_lstm.add(layers.LSTM(4, activation = 'relu',
input_shape = (X_lstm.shape[1], 2)))
network_lstm.add(layers.Dense(1, activation = None))
# Compile model.
network_lstm.compile(optimizer = 'rmsprop', loss = 'mean_squared_error')
# Fit model.
history_lstm = network_lstm.fit(X_lstm, Y, epochs = 100, batch_size = 256, verbose = False)
plt.scatter(Y[X[:, 1] == 0, :], network_lstm.predict(X_lstm[X[:, 1] == 0, :]), alpha = 0.1)
plt.plot([0, 3], [0, 3], color = 'black', linewidth = 2)
plt.title('LSTM, FNN, Second Variable has a 0 in the Very First Time Step')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
plt.scatter(Y[X[:, 1] == 1, :], network_lstm.predict(X_lstm[X[:, 1] == 1, :]), alpha = 0.1)
plt.plot([0, 9], [0, 9], color = 'black', linewidth = 2)
plt.title('LSTM, FNN, Second Variable has a 1 in the Very First Time Step')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

Yes the code used is correct for what you are trying to do. 10 years is the time window used to predict the following year so that should be the number of inputs into your model for each of the 20 variables. The sample size of 100,000 observations is not relevant to the input shape of your model.
The way that you had originally shaped the dependent variable Y is correct. You are predicting a window of 1 year for 1 variable and you have 100,000 observations. The key word argument return_sequences=True will cause an error to be thrown because you only have a single LSTM layer. Set this parameter to True if you are implementing multiple LSTM layers and the layer in question is followed by another LSTM layer.
I wish I could offer some guidance to 3 but without actually having your dataset I don't know if it's possible to answer this with any sort of certainty.
I will say that LSTM's were designed to address what is know as the the long term dependency problem present in regular RNN's. What this problem boils down to is that as the gap between when the relevant information was observed to the point where that information would be useful grows, the standard RNN will have a harder time learning the relationship between them. Think of predicting a stock price based on 3 days of activity vs an entire year.
This leads into number 4. If I use the term 'resembling' loosely and stretch your time window further out to say 50 years as opposed to 10, the advantages gained from using an LSTM would become more apparent. Although I'm sure that someone more experienced will be able to offer a better answer and I look forward to seeing it.
I found this page helpful for understanding LSTM's:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Tensorflow ReLu doesn't work?

I have written a convolutional network in tensorflow with relu as an activation function, however it is not learning (loss is constant for both eval and train data set).
For different activation functions everything works as it should.
Here is code where the nn is created:
def _create_nn(self):
current = tf.layers.conv2d(self.input, 20, 3, activation=self.activation)
current = tf.layers.max_pooling2d(current, 2, 2)
current = tf.layers.conv2d(current, 24, 3, activation=self.activation)
current = tf.layers.conv2d(current, 24, 3, activation=self.activation)
current = tf.layers.max_pooling2d(current, 2, 2)
self.descriptor = current = tf.layers.conv2d(current, 32, 5, activation=self.activation)
if not self.drop_conv:
current = tf.layers.conv2d(current, self.layer_7_filters_n, 3, activation=self.activation)
if self.add_conv:
current = tf.layers.conv2d(current, 48, 2, activation=self.activation)
self.descriptor = current
last_conv_output_shape = current.get_shape().as_list()
self.descr_size = last_conv_output_shape[1] * last_conv_output_shape[2] * last_conv_output_shape[3]
current = tf.layers.dense(tf.reshape(current, [-1, self.descr_size]), 100, activation=self.activation)
current = tf.layers.dense(current, 50, activation=self.last_activation)
return current
self.activiation is set to tf.nn.relu and self.last_activiation is set to tf.nn.softmax
loss function and optimizer are created here:
self._nn = self._create_nn()
self._loss_function = tf.reduce_sum(tf.squared_difference(self._nn, self.Y), 1)
optimizer = tf.train.AdamOptimizer()
self._train_op = optimizer.minimize(self._loss_function)
I tried changing variables initialization by passing tf.random_normal_initializer(0.1, 0.1) as initializers however it did not result in any change in loss function.
I would be grateful for help in making this neural network work with ReLu.
Edit: Leaky ReLu has the same problem
Edit: Small example where I managed to duplicate same error:
x = tf.constant([[3., 211., 123., 78.]])
v = tf.Variable([0.5, 0.5, 0.5, 0.5])
h_d = tf.layers.Dense(4, activation=tf.nn.leaky_relu)
h = h_d(x)
y_d = tf.layers.Dense(4, activation=tf.nn.softmax)
y = y_d(h)
d = tf.constant([[.5, .5, 0, 0]])
Gradients (as calculated with tf.gradients) for h_d and y_d kernels and biases are either equal or close to 0

In a very improbable case, all activations in some layer can be negative for all samples. They are set to zero by the ReLU and there is no learning progress because the gradient is zero in the negative part of the ReLU.
Things that make this more probable are a small dataset, weird scaling of input features, inappropriate weight initialization, and/or few channels in intermediate layers.
Here you use random_normal_initializer with mean=0.1, so maybe your inputs are all negative, and thus get mapped to negative values. Try mean=0, or rescale input features.
You can also try a Leaky ReLU. Also maybe the learning rate is too small or too large.

Looks like the problem was with the scale of input data. With values being between 0 and 255 that scale was more or less kept in the next layers, giving pre-activation outputs of the last layer having large enough differences to decrease softmax gradient to (almost) 0.
It was observable only with relu-like activation functions because other, like sigmoid or softsign, kept values ranges in network smaller, with an order of magnitude of 1 inststead of tens or hundreds.
The solution here was to just multiply input to rescale it to 0-1, in case of bytes by 1/255.

Batch normalization with 3D convolutions in TensorFlow

I'm implementing a model relying on 3D convolutions (for a task that is similar to action recognition) and I want to use batch normalization (see [Ioffe & Szegedy 2015]). I could not find any tutorial focusing on 3D convs, hence I'm making a short one here which I'd like to review with you.
The code below refers to TensorFlow r0.12 and it explicitly instances variables - I mean I'm not using tf.contrib.learn except for the tf.contrib.layers.batch_norm() function. I'm doing this both to better understand how things work under the hood and to have more implementation freedom (e.g., variable summaries).
I will get to the 3D convolution case smoothly by first writing the example for a fully-connected layer, then for a 2D convolution and finally for the 3D case. While going through the code, it would be great if you could check if everything is done correctly - the code runs, but I'm not 100% sure about the way I apply batch normalization. I end this post with a more detailed question.
import tensorflow as tf
# This flag is used to allow/prevent batch normalization params updates
# depending on whether the model is being trained or used for prediction.
training = tf.placeholder_with_default(True, shape=())
Fully-connected (FC) case
# Input.
INPUT_SIZE = 512
u = tf.placeholder(tf.float32, shape=(None, INPUT_SIZE))
# FC params: weights only, no bias as per [Ioffe & Szegedy 2015].
FC_OUTPUT_LAYER_SIZE = 1024
w = tf.Variable(tf.truncated_normal(
[INPUT_SIZE, FC_OUTPUT_LAYER_SIZE], dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
fc = tf.matmul(u, w)
# Batch normalization.
fc_bn = tf.contrib.layers.batch_norm(
fc,
center=True,
scale=True,
is_training=training,
scope='fc-batch_norm')
# Activation function.
fc_bn_relu = tf.nn.relu(fc_bn)
print(fc_bn_relu) # Tensor("Relu:0", shape=(?, 1024), dtype=float32)
2D convolutional (CNN) layer case
# Input: 640x480 RGB images (whitened input, hence tf.float32).
INPUT_HEIGHT = 480
INPUT_WIDTH = 640
INPUT_CHANNELS = 3
u = tf.placeholder(tf.float32, shape=(None, INPUT_HEIGHT, INPUT_WIDTH, INPUT_CHANNELS))
# CNN params: wights only, no bias as per [Ioffe & Szegedy 2015].
CNN_FILTER_HEIGHT = 3 # Space dimension.
CNN_FILTER_WIDTH = 3 # Space dimension.
CNN_FILTERS = 128
w = tf.Variable(tf.truncated_normal(
[CNN_FILTER_HEIGHT, CNN_FILTER_WIDTH, INPUT_CHANNELS, CNN_FILTERS],
dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
CNN_LAYER_STRIDE_VERTICAL = 1
CNN_LAYER_STRIDE_HORIZONTAL = 1
CNN_LAYER_PADDING = 'SAME'
cnn = tf.nn.conv2d(
input=u, filter=w,
strides=[1, CNN_LAYER_STRIDE_VERTICAL, CNN_LAYER_STRIDE_HORIZONTAL, 1],
padding=CNN_LAYER_PADDING)
# Batch normalization.
cnn_bn = tf.contrib.layers.batch_norm(
cnn,
data_format='NHWC', # Matching the "cnn" tensor which has shape (?, 480, 640, 128).
center=True,
scale=True,
is_training=training,
scope='cnn-batch_norm')
# Activation function.
cnn_bn_relu = tf.nn.relu(cnn_bn)
print(cnn_bn_relu) # Tensor("Relu_1:0", shape=(?, 480, 640, 128), dtype=float32)
3D convolutional (CNN3D) layer case
# Input: sequence of 9 160x120 RGB images (whitened input, hence tf.float32).
INPUT_SEQ_LENGTH = 9
INPUT_HEIGHT = 120
INPUT_WIDTH = 160
INPUT_CHANNELS = 3
u = tf.placeholder(tf.float32, shape=(None, INPUT_SEQ_LENGTH, INPUT_HEIGHT, INPUT_WIDTH, INPUT_CHANNELS))
# CNN params: wights only, no bias as per [Ioffe & Szegedy 2015].
CNN3D_FILTER_LENGHT = 3 # Time dimension.
CNN3D_FILTER_HEIGHT = 3 # Space dimension.
CNN3D_FILTER_WIDTH = 3 # Space dimension.
CNN3D_FILTERS = 96
w = tf.Variable(tf.truncated_normal(
[CNN3D_FILTER_LENGHT, CNN3D_FILTER_HEIGHT, CNN3D_FILTER_WIDTH, INPUT_CHANNELS, CNN3D_FILTERS],
dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
CNN3D_LAYER_STRIDE_TEMPORAL = 1
CNN3D_LAYER_STRIDE_VERTICAL = 1
CNN3D_LAYER_STRIDE_HORIZONTAL = 1
CNN3D_LAYER_PADDING = 'SAME'
cnn3d = tf.nn.conv3d(
input=u, filter=w,
strides=[1, CNN3D_LAYER_STRIDE_TEMPORAL, CNN3D_LAYER_STRIDE_VERTICAL, CNN3D_LAYER_STRIDE_HORIZONTAL, 1],
padding=CNN3D_LAYER_PADDING)
# Batch normalization.
cnn3d_bn = tf.contrib.layers.batch_norm(
cnn3d,
data_format='NHWC', # Matching the "cnn" tensor which has shape (?, 9, 120, 160, 96).
center=True,
scale=True,
is_training=training,
scope='cnn3d-batch_norm')
# Activation function.
cnn3d_bn_relu = tf.nn.relu(cnn3d_bn)
print(cnn3d_bn_relu) # Tensor("Relu_2:0", shape=(?, 9, 120, 160, 96), dtype=float32)
What I would like to make sure is whether the code above exactly implements batch normalization as described in [Ioffe & Szegedy 2015] at the end of Sec. 3.2:
For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a minibatch, over all locations. [...] Alg. 2 is modified similarly, so that during inference the BN transform applies the same linear transformation to each activation in a given feature map.
UPDATE
I guess the code above is also correct for the 3D conv case. In fact, when I define my model if I print all the trainable variables, I also see the expected numbers of beta and gamma variables. For instance:
Tensor("conv3a/conv3d_weights/read:0", shape=(3, 3, 3, 128, 256), dtype=float32)
Tensor("BatchNorm_2/beta/read:0", shape=(256,), dtype=float32)
Tensor("BatchNorm_2/gamma/read:0", shape=(256,), dtype=float32)
This looks ok to me since due to BN, one pair of beta and gamma are learned for each feature map (256 in total).
[Ioffe & Szegedy 2015]: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

That is a great post about 3D batchnorm, it's often unnoticed that batchnorm can be applied to any tensor of rank greater than 1. Your code is correct, but I couldn't help but add a few important notes on this:
A "standard" 2D batchnorm (accepts a 4D tensor) can be significantly faster in tensorflow than 3D or higher, because it supports fused_batch_norm implementation, which applies one kernel operation:
Fused batch norm combines the multiple operations needed to do batch
normalization into a single kernel. Batch norm is an expensive process
that for some models makes up a large percentage of the operation
time. Using fused batch norm can result in a 12%-30% speedup.
There is an issue on GitHub to support 3D filters as well, but there hasn't been any recent activity and at this point the issue is closed unresolved.
Although the original paper prescribes using batchnorm before ReLU activation (and that's what you did in the code above), there is evidence that it's probably better to use batchnorm after the activation. Here's a comment on Keras GitHub by Francois Chollet:
... I can guarantee that recent code written by Christian [Szegedy]
applies relu
before BN. It is still occasionally a topic of debate, though.
For anyone interested to apply the idea of normalization in practice, there's been recent research developments of this idea, namely weight normalization and layer normalization, which fix certain disadvantages of original batchnorm, for example they work better for LSTM and recurrent networks.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.