Understand and Implement Element-Wise Attention Module

Understand and Implement Element-Wise Attention Module - python

Please add a minimum comment on your thoughts so that I can improve my query. Thank you. -)
I'm trying to understand and implement a research work on Triple Attention Learning, which consists on
- channel-wise attention (a)
- element-wise attention (b)
- scale-wise attention (c)
The mechanism is integrated experimentally inside the DenseNet model. The arch of the whole model's diagram is here. The channel-wise attention module is simply nothing but the squeeze and excitation block. That gives a sigmoid output further to the element-wise attention module. Below is the more precise feature flow diagram of these modules (a, b, and c).
Theory
For the most part, I was able to understand and implement it but was a bit lost in the Element-Wise attention section (part b from the above diagram). This is where I need your assistance. -)
Here is a little theory on this topic to give you a rough idea of what all this is about. Please note, The paper is not openly accessible now but at its early stage of release on the publisher page it was free to get and I saved it at that time. And to be fair to all, I'm sharing it with you, Link. Anyway, from the paper (Section 4.3) it shows:
So first of all, f(att) function (which is in the first inplace diagram, left-middle part or b) consists of three convolution layers with 512 kernels with 1 x 1, 512 kernels with 3 x 3 and C kernels with 1 x 1. Here C is the number of the classifier. And with Softmax activation!
Next, it applies to the Channel-Wise attention module which we mentioned that simply a SENet module and gave a sigmoid probability score i.e X(CA). So, from the function of f(att), we're getting C times softmax probability scores and each of these scores get multiplied with sigmoid output and finally produces feature maps A (according to the equation 4 of the above diagram).
Second, there is a C linear classifier that implemented as a 1 x 1 - C kernels convolution layer. This layer also applied to the SENet module's output i.e. X(CA), to each feature vector pixel-wise. And in the end, it gives an output of feature maps S (equation 5 shown below diagram).
And Third, they element-wise multiply each confidence score (of S) with the corresponding attention element A. This multiplication is on purpose. They did it for preventing unnecessary attention on the feature maps. To make it effective, they also use the weighted cross-entropy loss function to minimize it here between the classification ground truth and the score vector.
My Query
Mostly I don't get properly the minimization strategies in the middle of the network. I want someone who can give me a proper understanding and implementation of this `element-wise attention mechanism in detail that proposed in the mentioned paperwork (section 4.3).
Implement
Here is a minimum code to get started. It should enough I guess. This is shallow implementation but too much away from the original element-wise module. I'm not sure how to implement it properly. For now, I want it as a layer that supposed to plug and play to any model. I was trying with MNIST and a simple Conv net.
In a summary, for MNIST, we should have a network that contains both the channel-wise and element-wise attention model followed by the last 10 unit softmax layer. So for example:
Net: Conv2D - Attentions-Module - GAP - Softmax(10)
The Attention-Module consists of those two-part: Channel-wise and Element-wise, and the Element-wisesupposed to have Softmax too that minimizes weighted CE loss function to ground-truth and score vector coming from this module (according to the paperwork, already described above too). The module also passes weighted feature maps to the consecutive layers. For more clarity here is a simple schematic diagram of what we're looking for
Ok, for the channel-wise attention which should give us a single probability score (sigmoid), let's use a fake layer for now for simplicity:
class FakeSE(tf.keras.layers.Layer):
def __init__(self):
super(Block, self).__init__()
# conv layer
self.conv = tf.keras.layers.Conv2D(10, padding='same',
kernel_size=3)
def call(self, input_tensor, training=False):
x = self.conv(input_tensor)
return tf.math.sigmoid(x)
And for the element-wise attention part, following is the failed attempt so far:
class ElementWiseAttention(tf.keras.layers.Layer):
def __init__(self):
# for simplicity the f(attn) function here has 2 convolution instead of 3
# self.conv1, and self.conv2
self.conv1 = tf.keras.layers.Conv2D(16,
kernel_size=1,
strides=1, padding='same',
use_bias=True, activation=tf.nn.silu)
self.conv2 = tf.keras.layers.Conv2D(10,
kernel_size=1,
strides=1, padding='same',
use_bias=False, activation=tf.keras.activations.softmax)
# fake SENet or channel-wise attention module
self.cam = FakeSE()
# a linear layer
self.linear = tf.keras.layers.Conv2D(10,
kernel_size=1,
strides=1, padding='same',
use_bias=True, activation=None)
super(ElementWiseAttention, self).__init__()
def call(self, inputs):
# 2 stacked conv layer (in paper, it's 3. we set 2 for simplicity)
# this is the f(att)
x = self.conv1(inputs)
x = self.conv2(x)
# this is the A = f(att)*X(CA)
camx = self.cam(x)*x
# this is S = X(CA)*Linear_Classifier
linx = self.cam(self.linear(inputs))
# element-wise multiply to prevent unnecessary attention
# suppose to minimize with weighted cross entorpy loss
out = tf.multiply(camx, linx)
return out
The above one is the Layer of Interest. If I understand the paper words correctly, this layer should not only minimize the weighted loss function to gt and score_vector but also produce some weighted feature maps (2D).
Run
Here is the toy data
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = np.expand_dims(x_train, axis=-1)
x_train = x_train.astype('float32') / 255
x_train = tf.image.resize(x_train, [32,32]) # if we want to resize
y_train = tf.keras.utils.to_categorical(y_train , num_classes=10)
# Model
input = tf.keras.Input(shape=(32,32,1))
efnet = tf.keras.applications.DenseNet121(weights=None,
include_top = False,
input_tensor = input)
em = ElementWiseAttention()(efnet.output)
# Now that we apply global max pooling.
gap = tf.keras.layers.GlobalMaxPooling2D()(em)
# classification layer.
output = tf.keras.layers.Dense(10, activation='softmax')(gap)
# bind all
func_model = tf.keras.Model(efnet.input, output)
func_model.compile(
loss = tf.keras.losses.CategoricalCrossentropy(),
metrics = tf.keras.metrics.CategoricalAccuracy(),
optimizer = tf.keras.optimizers.Adam())
# fit
func_model.fit(x_train, y_train, batch_size=32, epochs=3, verbose = 1)

Understanding the element-wise attention
When paper introduce they method they said:
The attention modules aim to exploit the relationship between disease
labels and (1) diagnosis-specific feature channels, (2)
diagnosis-specific locations on images (i.e. the regions of thoracic
abnormalities), and (3) diagnosis-specific scales of the feature maps.
(1), (2), (3) corresponding to channel-wise attention, element-wise attention, scale-wise attention
We can tell that element-wise attention is for deal with disease location & weight info, i.e: at each location on image, how likely there is a disease, as it been mention again when paper introduce the element-wise attention：
The element-wise attention learning aims to enhance the sensitivity of feature
representations to thoracic abnormal regions, while suppressing the activations when there is no abnormality.
OK, we could easily get location & weight info for one disease, but we have multiple disease:
Since there are multiple thoracic diseases, we choose to estimate an
element-wise attention map for each category in this work.
We could store the multiple disease location & weight info by using a tensor A with shape (height, width, number of disease):
The all-category attention map is denoted by A ∈ RH×W×C, where each
element aijc is expected to represent the relative importance at location (i, j) for
identifying the c-th category of thoracic abnormalities.
And we have linear classifiers for produce a tensor S with same shape as A, this can be interpret as:
At each location on feature maps X(CA), how confident those linear classifiers think there is certain disease at that location
Now we element-wise multiply S and A to get M, i.e we are:
prevent the attention maps from paying unnecessary attention to those
location with non-existent labels
So after all those, we get tensor M which tells us:
location & weight info about certain disease that linear classifiers are confident about it
Then if we do global average pooling over M, we get prediction of weight for each disease, add another softmax (or sigmoid) we could get prediction of probability for each disease
Now since we have label and prediction, so, naturally we could minimizing loss function to optimize the model.
Implementation
Following code is tested on colab and will show you how to implement channel-wise attention and element-wise attention, and build and training a simple model base on your code with DenseNet121 and without scale-wise attention:
import tensorflow as tf
import numpy as np
ALPHA = 1/16
C = 10
D = 128
class ChannelWiseAttention(tf.keras.layers.Layer):
def __init__(self):
super(ChannelWiseAttention, self).__init__()
# squeeze
self.gap = tf.keras.layers.GlobalAveragePooling2D()
# excitation
self.fc0 = tf.keras.layers.Dense(int(ALPHA * D), use_bias=False, activation=tf.nn.relu)
self.fc1 = tf.keras.layers.Dense(D, use_bias=False, activation=tf.nn.sigmoid)
# reshape so we can do channel-wise multiplication
self.rs = tf.keras.layers.Reshape((1, 1, D))
def call(self, inputs):
# calculate channel-wise attention vector
z = self.gap(inputs)
u = self.fc0(z)
u = self.fc1(u)
u = self.rs(u)
return u * inputs
class ElementWiseAttention(tf.keras.layers.Layer):
def __init__(self):
super(ElementWiseAttention, self).__init__()
# f(att)
self.conv0 = tf.keras.layers.Conv2D(512,
kernel_size=1,
strides=1, padding='same',
use_bias=True, activation=tf.nn.relu)
self.conv1 = tf.keras.layers.Conv2D(512,
kernel_size=3,
strides=1, padding='same',
use_bias=True, activation=tf.nn.relu)
self.conv2 = tf.keras.layers.Conv2D(C,
kernel_size=1,
strides=1, padding='same',
use_bias=False, activation=tf.keras.activations.softmax)
# linear classifier
self.linear = tf.keras.layers.Conv2D(C,
kernel_size=1,
strides=1, padding='same',
use_bias=True, activation=None)
# for calculate score vector to training element-wise attention module
self.gap = tf.keras.layers.GlobalAveragePooling2D()
self.sfm = tf.keras.layers.Softmax()
def call(self, inputs):
# f(att)
a = self.conv0(inputs)
a = self.conv1(a)
a = self.conv2(a)
# confidence score
s = self.linear(inputs)
# element-wise multiply to prevent unnecessary attention
m = s * a
# using to minimize with weighted cross entorpy loss
y_hat = self.gap(m)
# could also using sigmoid like in paper
out = self.sfm(y_hat)
return m, out
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = np.expand_dims(x_train, axis=-1)
x_train = x_train.astype('float32') / 255
x_train = tf.image.resize(x_train, [32,32]) # if we want to resize
y_train = tf.keras.utils.to_categorical(y_train , num_classes=10)
# Model
input = tf.keras.Input(shape=(32,32,1))
efnet = tf.keras.applications.DenseNet121(weights=None,
include_top = False,
input_tensor = input)
xca = ChannelWiseAttention()(efnet.get_layer("conv3_block1_0_bn").output)
m, output = ElementWiseAttention()(xca)
# bind all
func_model = tf.keras.Model(efnet.input, output)
func_model.compile(
loss = tf.keras.losses.CategoricalCrossentropy(),
metrics = tf.keras.metrics.CategoricalAccuracy(),
optimizer = tf.keras.optimizers.Adam())
# fit
func_model.fit(x_train, y_train, batch_size=64, epochs=3, verbose = 1)
PS: Serendipity, I also answered your another question related to this paper few month back:
How to place custom layer inside a in-built pre trained model?

Related

Image sequence detection with Keras, Convolutional and Stateful Neural Network

I am trying to write a pretty complicated neural network (at least for me) in keras that needs to combine both a common CNN structure and an LSTM/GRU layer.
Basically, I have a dataset of climatological maps of the Mediterranean sea, each map details the wind, pressure and other parameters. I am studying Medicanes (Mediterranean hurricanes) and my goal is to create a neural network that can classify each map with a label zero if there is no trace of such hurricanes or one if the map contains one.
In order to achieve that I need a network with two parts:
feature extractor (normal CNN).
temporal layer (LSTM/GRU).
The main cause of this is that each map is correlated with the previous one because the formation and life cycle of a Medicane can take several days to complete.
Important note: the dataset is too big to be uploaded all at once so I have to work one batch at a time.
I am working with Keras and I found it pretty challenging to adapt its standard framework to my needs so I have come up with some peculiar flow to feed my data into the network.
In particular, I found it hard to pass both my batch size and my time-step parameter to the GRU layer using a more standard alternative.
This is what I tried:
I am positively sure I have overcomplicated the task, but, as I said I am not very proficient with Keras and TensorFlow.
The main problem was that I could not find a way to import the data both in a batch (for RAM reasons) and in a sequence of 10-15 pictures (to be used as the time steps in the GRU layer).
I solved this problem by importing batches of 120 maps in order (no shuffle) and I created a way to turn these batches into the sequence of images I needed then I proceeded to re-batch the sequences and feed them to the model manually.
Data Import
batch_size=120
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
"./Figures_1/Train",
validation_split=None,
subset=None,
labels="inferred",
label_mode="binary",
color_mode="rgb",
interpolation='bilinear',
batch_size=batch_size,
image_size=(600, 600),
shuffle=False,
seed=123
)
Get a sequence of Images
Here, I break down the 120 map batches into sequences of 60 observations, and I return each sequence one at a time.
sequence_lengh=60
def sequence_x(train_dataset):
x_numpy = np.asarray(list(map(lambda x: x[0], tfds.as_numpy(train_dataset))),dtype=object)
for element in range(0,x_numpy.shape[0]):
for i in range(0, x_numpy.shape[0],sequence_lengh):
x_seq = x_numpy[element][i:i+sequence_lengh]
yield x_seq
def sequence_y(train_dataset):
y_numpy = np.asarray(list(map(lambda x: x[1], tfds.as_numpy(train_dataset))),dtype=object)
for element in range(0,y_numpy.shape[0]):
for i in range(0, y_numpy.shape[0],sequence_lengh):
y_seq = y_numpy[element][i:i+sequence_lengh]
yield y_seq
CNN Model
I build the CNN model based on a pre-trained DenseNet
from keras.layers import TimeDistributed, GRU
def build_convnet(shape=(600, 600, 3)):
inputs = keras.Input(shape = shape)
x = inputs
# preprocessing
x = keras.applications.densenet.preprocess_input(x)
#Convbase
x = convBase(x)
x = layers.Flatten()(x)
# Fine tuning
x = keras.layers.Dense(1024, activation='relu')(x)
x = layers.Dropout(0.2)(x)
x = keras.layers.Dense(512, activation='relu')(x)
x = keras.layers.GlobalMaxPool2D()
return x
GRU Model
I build the time part of the network with a GRU layer
def action_model(shape=(15, 600, 600, 3), nbout=15):
# Create our convnet with (112, 112, 3) input shape
convnet = build_convnet(shape[1:]) #[1:]
# then create our final model
model = keras.Sequential()
# add the convnet with (5, 112, 112, 3) shape
model.add(TimeDistributed(convnet, input_shape=shape))
# here, you can also use GRU or LSTM
model.add(GRU(64))
# and finally, we make a decision network
model.add(Dense(1024, activation='relu'))
model.add(Dropout(.5))
model.add(Dense(512, activation='relu'))
model.add(Dropout(.5))
model.add(Dense(128, activation='relu'))
model.add(Dropout(.5))
model.add(Dense(64, activation='relu'))
model.add(Dense(15, activation='softmax'))
return model
Transfer Learning
I retrain a part of the GRU
convBase = DenseNet121(include_top=False, weights=None, input_shape=(600,600,3), pooling="avg")
for layer in convBase.layers:
if 'conv5' in layer.name:
layer.trainable = True
for layer in convBase.layers:
if 'conv4' in layer.name:
layer.trainable = True
Model Compile
Model compilation ( image size= 600x600x3)
INSHAPE=(15, 600, 600, 3) # (5, 112, 112, 3)
model = action_model(INSHAPE, 1)
optimizer = keras.optimizers.Adam(0.001)
model.compile(
optimizer,
'categorical_crossentropy',
metrics='accuracy'
)
Model Fit
Here I manually batch my data. I turn an array (60, 600, 600, 3) into a (4,15,600,600) array. Meaning 4 batches each one containing a 15-map long sequence.
epochs = 10
for value in range(0, epochs):
train_x, train_y = sequence_x(train_ds), sequence_y(train_ds)
val_x, val_y = sequence_x(validation_ds), sequence_y(validation_ds)
for i in range(0,278): #
x = next(train_x, "none")
y = next(train_y, "none")
if (x!="none" or y!="none"):
if (np.any(x) and np.any(y)):
x_stack = np.stack((x[:15], x[15:30], x[30:45], x[45:]))
y_stack = np.stack((y[:15], y[15:30], y[30:45], y[45:]))
y_stack=y_stack.reshape(4,15)
model.fit(x=x_stack, y=y_stack,
validation_data=None,
batch_size=None,
shuffle=False
)
else:
continue
else:
continue
The idea is to get a model that, when presented with a sequence of images, can categorize each one of them with a 0 or a 1 if they have a Medicane or not.
The model does compile without any errors but the results it provides are horrible:
.
What am I doing incorrectly? Is there a more effective way to write all of this?

What activation function on the last layer and loss function should I use in an auto encoder for reconstructing a sequence of events? [Keras]

My data set is a 3D array of the size (M,t,N) where M is the number of samples, t is the number of timesteps in a sequence and N is the number of possible events that can happen at time t. By selecting a specific M we have a 2D array of size (t,N) where each row is a timestep and each column is an event. Each column is set to 1 if that event happened at time t, otherwise it's set to 0. Only 1 event can happen at any given timestep.
I want to try and build an auto-encoder for anomaly detection, and in the tutorials and blogs I have read, the last activation layer is 'relu' and the loss function is 'mse'. But since I am trying to basically reconstruct a classification with N classes, would 'softmax' as the last layer and 'categorical_crossentropy' be better?
inputs = Input(shape = (timesteps,n_features))
# Encoder
lstm_enc_1 = LSTM(32, activation='relu', input_shape=(timesteps, n_features), return_sequences=True)(inputs)
lstm_enc_2 = LSTM(latent_dim, activation='relu', return_sequences=False)(lstm_enc_1)
repeater = RepeatVector(timesteps)
# Decoder
lstm_dec_1 = LSTM(latent_dim, activation='relu', return_sequences=True)
lstm_dec_2 = LSTM(32, activation='relu', return_sequences=True)
time_dis = TimeDistributed(Dense(n_features,activation='softmax')) #<-- Does this make sense here?
z = repeater(lstm_enc_2)
h = lstm_dec_1(z)
decoded_h = lstm_dec_2(h)
decoded = time_dis(decoded_h)
ae = Model(inputs,decoded)
ae.compile(loss='categorical_crossentropy', optimizer='adam') #<-- Does this make sense here?
Or should I, for some reason, still use 'relu' and 'mse' as the last activation function and loss function?
Any input is appreciated.

When i read it correctly, N is one-hot encoded and it sounds like you want to do a classification, no regression.
For beeing y one-hot encoded, using categorical_crossentropy is correct.
If you have more classes in y than 4, you may use integer-encodings and use sparse_categorical_crossentropy, which decodes you y values to one-hot matrices on the way.
mse is better used for regression.
As last actication, since you have a classification, you may want to use softmax, which outputs a probability for each of your y classes.
As far as I know, your normally do not use relu is the last layer, if you have a regression task, you prefer sigmoid in general.

50% accuracy in CNN on image binary classification

I have a collection of images with open and closed eyes.
The data is collected from the current directory using keras in this way:
batch_size = 64
N_images = 84898 #total number of images
datagen = ImageDataGenerator(
rescale=1./255)
data_iterator = datagen.flow_from_directory(
'./Eyes',
shuffle = 'False',
color_mode='grayscale',
target_size=(h, w),
batch_size=batch_size,
class_mode = 'binary')
I've got a .csv file with the state of each eye.
I've built this Sequential model:
num_filters = 8
filter_size = 3
pool_size = 2
model = Sequential([
Conv2D(num_filters, filter_size, input_shape=(90, 90, 1)),
MaxPooling2D(pool_size=pool_size),
Flatten(),
Dense(16, activation='relu'),
Dense(2, activation='sigmoid'), # Two classes. one for "open" and another one for "closed"
])
Model compilation.
model.compile(
'adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
Finally I fit all the data with the following:
model.fit(
train_images,
to_categorical(train_labels),
epochs=3,
validation_data=(test_images, to_categorical(test_labels)),
)
The result fluctuates around 50% and I do not understand why.

Your current model essentially has one convolutional layer. That is, num_filters convolutional filters (which in this case are 3 x 3 arrays) are defined and fit such that when they are convolved with the image, they produce features that are as discriminative as possible between classes. You then perform maxpooling to slightly reduce the dimension of the output CNN features before passing to 2 dense layers.
I'd start by saying that one convolutional layer is almost certainly insufficient, especially with 3x3 filters. Basically, with a single convolutional layer, the most meaningful information you can get are edges or lines. These features are only marginally more useful to a function approximator (i.e. your fully connected layers) than the raw pixel intensity values because they still have an extremely high degree of variability both within a class and between classes. Consider that shifting an image of an eye 2 pixels to the left would result in completely different values output from your 1-layer CNN. You'd like the outputs of your CNN to be invariant to scale, rotation, illumination, etc.
In practice, this means you're going to need more convolutional layers. The relatively simple VGG net has at least 14 convolutional layers, and modern residual-layer based networks often have over 100 convolutional layers. Try writing a routine to define sequentially more complex networks until you start seeing performance gains.
As a secondary point, generally you don't want to use a sigmoid() activation function on your final layer outputs during training. This flattens the gradients and makes it much slower to backpropogate your loss. You actually don't care that the output values fall between 0 and 1, you only care about their relative magnitudes. Common practice is to use cross entropy loss which combines a log softmax function (gradient more stable than normal softmax) and negative log likelihood loss, as you've already done. Thus, since the log softmax portion transforms the output values into the desired range, there's no need to use the sigmoid activation function.

Improve Accuracy for a Siamese Network

I wrote this little model using Keras Functional API to find similarity of a dialogue between two individuals. I am using Gensim's Doc2Vec embeddings for transforming text-data into vectors (vocab size: 4117). My data is equally divided up into 56 positive cases and 64 negative cases. (yes I know the dataset is small - but that's all I have for the time being).
def euclidean_distance(vects):
x, y = vects
sum_square = K.sum(K.square(x - y), axis=1, keepdims=True)
return K.sqrt(K.maximum(sum_square, K.epsilon()))
ch_inp = Input(shape=(38, 200))
csr_inp = Input(shape=(38, 200))
inp = Input(shape=(38, 200))
net = Embedding(int(vocab_size), 16)(inp)
net = Conv2D(16, 1, activation='relu')(net)
net = TimeDistributed(LSTM(8, return_sequences=True))(net)
out = Activation('relu')(net)
sia = Model(inp, out)
x = sia(csr_inp)
y = sia(ch_inp)
sub = Subtract()([x, y])
mul = Multiply()([sub, sub])
mul_x = Multiply()([x, x])
mul_y = Multiply()([y, y])
sub_xy = Subtract()([x, y])
euc = Lambda(euclidean_distance)([x, y])
z = Concatenate(axis=-1)([euc, sub_xy, mul])
z = TimeDistributed(Bidirectional(LSTM(4)))(z)
z = Activation('relu')(z)
z = GlobalMaxPooling1D()(z)
z = Dense(2, activation='relu')(z)
out = Dense(1, activation = 'sigmoid')(z)
model = Model([ch_inp, csr_inp], out)
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
The problem is: my accuracy won't improve from 60.87% - I ran 10 epochs and the accuracy remains constant. Is there something I've done here in my code that's causing that? Or perhaps its an issue with my data?
I also did K-Fold Validation for some Sklearn models and got these results from the dataset:
Additionally, an overview of my dataset is attached below:
I'm definitely struggling with this one - so literally any help here would be appreciated. Thanks!
UPDATE:
I increased my data-size to 1875 train-samples. Its accuracy improved to 70.28%. But its still constant over all iterations.

I see two things that may be important there.
You're using 'relu' after the LSTM. An LSTM in Keras already has 'tanh' as default activation. So, although you're not locking your model, you're making it harder for it to learn, with an activation that constraints the results between as small range plus one that cuts the negative values
You're using 'relu' with very few units! Relu with few units, bad initialization, big learning rates and bad luck will get stuck in the zero region without any gradients.
If your loss completely freezes, it's most probably due to the second point above. And even if it doesn't freeze, it may be using just one unit from the 2 Dense units, for instance, making the layer very poor.
You should do something from below:
Your model is small, so quit using 'relu' and use 'tanh' instead. This will give your model the expected power it should have.
Otherwise, you should definitely increase the number of units, both for the LSTM and for the Dense, so 'relu' doesn't get easily stuck.
You can add a BatchNormalization layer after Dense and before 'relu', this way you guarantee that a good amount units will always be above zero.
In any case, don't use 'relu' after the LSTM.
The other approach would be making the model more powerful.
For instance:
z = TimeDistributed(Bidirectional(LSTM(4)))(z)
z = Conv1D(10, 3, activation = 'tanh')(z) #or 'relu' maybe
z = MaxPooling1D(z)
z = Conv1D(15, 3, activation = 'tanh')(z) #or 'relu' maybe
z = Flatten()(z) #unless the length is variable, then GlobalAveragePooling1D()(z)
z = Dense(10, activation='relu')(z)
out = Dense(1, activation = 'sigmoid')(z)

Batch normalization with 3D convolutions in TensorFlow

I'm implementing a model relying on 3D convolutions (for a task that is similar to action recognition) and I want to use batch normalization (see [Ioffe & Szegedy 2015]). I could not find any tutorial focusing on 3D convs, hence I'm making a short one here which I'd like to review with you.
The code below refers to TensorFlow r0.12 and it explicitly instances variables - I mean I'm not using tf.contrib.learn except for the tf.contrib.layers.batch_norm() function. I'm doing this both to better understand how things work under the hood and to have more implementation freedom (e.g., variable summaries).
I will get to the 3D convolution case smoothly by first writing the example for a fully-connected layer, then for a 2D convolution and finally for the 3D case. While going through the code, it would be great if you could check if everything is done correctly - the code runs, but I'm not 100% sure about the way I apply batch normalization. I end this post with a more detailed question.
import tensorflow as tf
# This flag is used to allow/prevent batch normalization params updates
# depending on whether the model is being trained or used for prediction.
training = tf.placeholder_with_default(True, shape=())
Fully-connected (FC) case
# Input.
INPUT_SIZE = 512
u = tf.placeholder(tf.float32, shape=(None, INPUT_SIZE))
# FC params: weights only, no bias as per [Ioffe & Szegedy 2015].
FC_OUTPUT_LAYER_SIZE = 1024
w = tf.Variable(tf.truncated_normal(
[INPUT_SIZE, FC_OUTPUT_LAYER_SIZE], dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
fc = tf.matmul(u, w)
# Batch normalization.
fc_bn = tf.contrib.layers.batch_norm(
fc,
center=True,
scale=True,
is_training=training,
scope='fc-batch_norm')
# Activation function.
fc_bn_relu = tf.nn.relu(fc_bn)
print(fc_bn_relu) # Tensor("Relu:0", shape=(?, 1024), dtype=float32)
2D convolutional (CNN) layer case
# Input: 640x480 RGB images (whitened input, hence tf.float32).
INPUT_HEIGHT = 480
INPUT_WIDTH = 640
INPUT_CHANNELS = 3
u = tf.placeholder(tf.float32, shape=(None, INPUT_HEIGHT, INPUT_WIDTH, INPUT_CHANNELS))
# CNN params: wights only, no bias as per [Ioffe & Szegedy 2015].
CNN_FILTER_HEIGHT = 3 # Space dimension.
CNN_FILTER_WIDTH = 3 # Space dimension.
CNN_FILTERS = 128
w = tf.Variable(tf.truncated_normal(
[CNN_FILTER_HEIGHT, CNN_FILTER_WIDTH, INPUT_CHANNELS, CNN_FILTERS],
dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
CNN_LAYER_STRIDE_VERTICAL = 1
CNN_LAYER_STRIDE_HORIZONTAL = 1
CNN_LAYER_PADDING = 'SAME'
cnn = tf.nn.conv2d(
input=u, filter=w,
strides=[1, CNN_LAYER_STRIDE_VERTICAL, CNN_LAYER_STRIDE_HORIZONTAL, 1],
padding=CNN_LAYER_PADDING)
# Batch normalization.
cnn_bn = tf.contrib.layers.batch_norm(
cnn,
data_format='NHWC', # Matching the "cnn" tensor which has shape (?, 480, 640, 128).
center=True,
scale=True,
is_training=training,
scope='cnn-batch_norm')
# Activation function.
cnn_bn_relu = tf.nn.relu(cnn_bn)
print(cnn_bn_relu) # Tensor("Relu_1:0", shape=(?, 480, 640, 128), dtype=float32)
3D convolutional (CNN3D) layer case
# Input: sequence of 9 160x120 RGB images (whitened input, hence tf.float32).
INPUT_SEQ_LENGTH = 9
INPUT_HEIGHT = 120
INPUT_WIDTH = 160
INPUT_CHANNELS = 3
u = tf.placeholder(tf.float32, shape=(None, INPUT_SEQ_LENGTH, INPUT_HEIGHT, INPUT_WIDTH, INPUT_CHANNELS))
# CNN params: wights only, no bias as per [Ioffe & Szegedy 2015].
CNN3D_FILTER_LENGHT = 3 # Time dimension.
CNN3D_FILTER_HEIGHT = 3 # Space dimension.
CNN3D_FILTER_WIDTH = 3 # Space dimension.
CNN3D_FILTERS = 96
w = tf.Variable(tf.truncated_normal(
[CNN3D_FILTER_LENGHT, CNN3D_FILTER_HEIGHT, CNN3D_FILTER_WIDTH, INPUT_CHANNELS, CNN3D_FILTERS],
dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
CNN3D_LAYER_STRIDE_TEMPORAL = 1
CNN3D_LAYER_STRIDE_VERTICAL = 1
CNN3D_LAYER_STRIDE_HORIZONTAL = 1
CNN3D_LAYER_PADDING = 'SAME'
cnn3d = tf.nn.conv3d(
input=u, filter=w,
strides=[1, CNN3D_LAYER_STRIDE_TEMPORAL, CNN3D_LAYER_STRIDE_VERTICAL, CNN3D_LAYER_STRIDE_HORIZONTAL, 1],
padding=CNN3D_LAYER_PADDING)
# Batch normalization.
cnn3d_bn = tf.contrib.layers.batch_norm(
cnn3d,
data_format='NHWC', # Matching the "cnn" tensor which has shape (?, 9, 120, 160, 96).
center=True,
scale=True,
is_training=training,
scope='cnn3d-batch_norm')
# Activation function.
cnn3d_bn_relu = tf.nn.relu(cnn3d_bn)
print(cnn3d_bn_relu) # Tensor("Relu_2:0", shape=(?, 9, 120, 160, 96), dtype=float32)
What I would like to make sure is whether the code above exactly implements batch normalization as described in [Ioffe & Szegedy 2015] at the end of Sec. 3.2:
For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a minibatch, over all locations. [...] Alg. 2 is modified similarly, so that during inference the BN transform applies the same linear transformation to each activation in a given feature map.
UPDATE
I guess the code above is also correct for the 3D conv case. In fact, when I define my model if I print all the trainable variables, I also see the expected numbers of beta and gamma variables. For instance:
Tensor("conv3a/conv3d_weights/read:0", shape=(3, 3, 3, 128, 256), dtype=float32)
Tensor("BatchNorm_2/beta/read:0", shape=(256,), dtype=float32)
Tensor("BatchNorm_2/gamma/read:0", shape=(256,), dtype=float32)
This looks ok to me since due to BN, one pair of beta and gamma are learned for each feature map (256 in total).
[Ioffe & Szegedy 2015]: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

That is a great post about 3D batchnorm, it's often unnoticed that batchnorm can be applied to any tensor of rank greater than 1. Your code is correct, but I couldn't help but add a few important notes on this:
A "standard" 2D batchnorm (accepts a 4D tensor) can be significantly faster in tensorflow than 3D or higher, because it supports fused_batch_norm implementation, which applies one kernel operation:
Fused batch norm combines the multiple operations needed to do batch
normalization into a single kernel. Batch norm is an expensive process
that for some models makes up a large percentage of the operation
time. Using fused batch norm can result in a 12%-30% speedup.
There is an issue on GitHub to support 3D filters as well, but there hasn't been any recent activity and at this point the issue is closed unresolved.
Although the original paper prescribes using batchnorm before ReLU activation (and that's what you did in the code above), there is evidence that it's probably better to use batchnorm after the activation. Here's a comment on Keras GitHub by Francois Chollet:
... I can guarantee that recent code written by Christian [Szegedy]
applies relu
before BN. It is still occasionally a topic of debate, though.
For anyone interested to apply the idea of normalization in practice, there's been recent research developments of this idea, namely weight normalization and layer normalization, which fix certain disadvantages of original batchnorm, for example they work better for LSTM and recurrent networks.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.