Padding text sequences for conv1D change the output of the convolution

Padding text sequences for conv1D change the output of the convolution - python

I found out that even if new tensorflow api (since 2.0) and Keras is practical because we do not need to perform padding every time (even with only one input) during inference performing padding change the results of conv1d layer which do not accept Masking layer. The results change even when using GlobalMaxPooling1D.
Here is a little code snippet to explain my point :
import tensorflow as tf
y = tf.random.normal((2, 7, 100))
# Conv1d and maxpooling
cnn = tf.keras.layers.Conv1D(32, 3, padding="same")
max_pool = tf.keras.layers.GlobalMaxPool1D()
m_post = max_pool(cnn(tf.keras.preprocessing.sequence.pad_sequences(y, maxlen=10, dtype="float32", value=tf.zeros(100), padding="post")))
m_pre = max_pool(cnn(tf.keras.preprocessing.sequence.pad_sequences(y, maxlen=10, dtype="float32", value=tf.zeros(100), padding="pre")))
m = max_pool(cnn(y))
print(m - m_post)
print(m - m_pre)
print(m_post - m_pre)
The thing that is really weird is, the output does not change among differrent padding size...
m_post = max_pool(cnn(tf.keras.preprocessing.sequence.pad_sequences(y, maxlen=10, dtype="float32", value=tf.zeros(100), padding="post")))
m_post_2 = max_pool(cnn(tf.keras.preprocessing.sequence.pad_sequences(y, maxlen=16, dtype="float32", value=tf.zeros(100), padding="post")))
print(m_post - m_post_2)
When I am in production inference, The solution is to pad the same amount than I did in training, but it is not optimal because I have really different text lenght.
Can someone explain that behaviour ? Did I miss something here ?

Related

Is convolution useful on a network with a timestep of 1?

This code comes from https://www.kaggle.com/dkaraflos/1-geomean-nn-and-6featlgbm-2-259-private-lb, The goal of this competition is to use seismic signals to predict the timing of laboratory earthquakes. The person in this link has won first place among more than 4000 teams
def get_model():
inp = Input(shape=(1,train_sample.shape[1]))
x = BatchNormalization()(inp)
x = LSTM(128,return_sequences=True)(x) # LSTM as first layer performed better than Dense.
x = Convolution1D(128, (2),activation='relu', padding="same")(x)
x = Convolution1D(84, (2),activation='relu', padding="same")(x)
x = Convolution1D(64, (2),activation='relu', padding="same")(x)
x = Flatten()(x)
x = Dense(64, activation="relu")(x)
x = Dense(32, activation="relu")(x)
#outputs
ttf = Dense(1, activation='relu',name='regressor')(x) # Time to Failure
tsf = Dense(1)(x) # Time Since Failure
classifier = Dense(1, activation='sigmoid')(x) # Binary for TTF<0.5 seconds
model = models.Model(inputs=inp, outputs=[ttf,tsf,classifier])
opt = optimizers.Nadam(lr=0.008)
# We are fitting to 3 targets simultaneously: Time to Failure (TTF), Time Since Failure (TSF), and Binary for TTF<0.5 seconds
# We weight the model to optimize heavily for TTF
# Optimizing for TSF and Binary TTF<0.5 helps to reduce overfitting, and helps for generalization.
model.compile(optimizer=opt, loss=['mae','mae','binary_crossentropy'],loss_weights=[8,1,1],metrics=['mae'])
return model
However, According to my derivation, I think x = Convolution1D(128, (2),activation='relu', padding="same")(x) and x = Dense(128, activation='relu ')(x) has the same effect, because the convolution kernel performs convolution on the sequence with a time step of 1. In principle, it is very similar to the fully connected layer. Why use conv1D here instead of directly using the fullly connection layer? Is my derivation wrong?

1) Assuming you would input a sequence to the LSTM (the normal use case):
It would not be the same since the LSTM returns a sequence (return_sequences=True), thereby not reducing the input dimensionality. The output shape is therefore (Batch, Sequence, Hid). This is being fed to the Convolution1D layer which performs convolution on the Sequence dimension, i.e. on (Sequence, Hid). So in effect, the purpose of the 1D Convolutions is to extract local 1D subsequences/patches after the LSTM.
If we had return_sequences=False, the LSTM would return the final state h_t. To ensure the same behavior as a Dense layer, you need a fully connected convolutional layer, i.e. a kernel size of Sequence length, and we need as many filters as we have Hid in the output shape. This would then make the 1D Convolution equivalent to a Dense layer.
2) Assuming you do not input a sequence to the LSTM (your example):
In your example, the LSTM is used as a replacement for a Dense layer.
It serves the same function, though it gives you a slightly different
result as the gates do additional transformations (even though we
have no sequence).
Since the Convolution is then performed on (Sequence, Hid) = (1, Hid), it is indeed operating per timestep. Since we have 128 inputs and 128 filters, it is fully connected and the kernel size is large enough to operate on the single element. This meets the above defined criteria for a 1D Convolution to be equivalent to a Dense layer, so you're correct.
As a side note, this type of architecture is something you would typically get with a Neural Architecture Search. The "replacements" used here are not really commonplace and not generally guaranteed to be better than the more established counterparts. In a lot of cases, using Reinforcement Learning or Evolutionary Algorithms can however yield slightly better accuracy using "untraditional" solutions since very small performance gains can just happen by chance and don't have to necessarily reflect back on the usefulness of the architecture.

Primer on TensorFlow and Keras: The past (TF1) the present (TF2)

The aim of this question is to ask for a bare-minimal guide to get someone up to speed with TensorFlow 1 and TensorFlow 2. I feel like there isn't a coherent guide that explains differences between TF1 and TF2 and TF has been through major revisions and evolving at a rapid pace.
For reference when I say,
v1 or TF1 - I refer to TF 1.15.0
v2 or TF2 - I refer to TF 2.0.0
The questions I have are,
How does TF1/TF2 work? What are their key differences?
What are different datatypes/ data structures in TF1 and TF2?
What is Keras and how does that fit in all these? What kind of different APIs Keras provide to implement deep learning models? Can you provide examples of each?
What are the most recurring warnings/errors I have to look out for while using TF and Keras?
Performance differences between TF1 and TF2

How does TF1/TF2 work? And their differences
TF1
TF1 follows an execution style known as define-then-run. This is opposed to define-by-run which is for example Python execution style. But what does that mean? Define then run means that, just because you called/defined something it's not executed. You have to explicitly execute what you defined.
TF has this concept of a Graph. First you define all the computations you need (e.g. all the layer computations of a neural network, loss computation and an optimizer that minimizes the loss - these are represented as ops or operations). After you define the computation/data-flow graph you execute bits and pieces of this using a Session. Let's see a simple example in action.
# Graph generation
tf_a = tf.placeholder(dtype=tf.float32)
tf_b = tf.placeholder(dtype=tf.float32)
tf_c = tf.add(tf_a, tf.math.multiply(tf_b, 2.0))
# Execution
with tf.Session() as sess:
c = sess.run(tf_c, feed_dict={tf_a: 5.0, tf_b: 2.0})
print(c)
The computational graph (also known as data flow graph) will look like below.
tf_a tf_b tf.constant(2.0)
\ \ /
\ tf.math.multiply
\ /
tf.add
|
tf_c
Analogy: Think about you making a cake. You download the recipe from the internet. Then you start following the steps to actually make the cake. The recipe is the Graph and the process of making the cake is what the Session does (i.e. execution of the graph).
TF2
TF2 follows immediate execution style or define-by-run. You call/define something, it is executed. Let's see an example.
a = tf.constant(5.0)
b = tf.constant(3.0)
c = tf_a + (tf_b * 2.0)
print(c.numpy())
Woah! It looks so clean compared to the TF1 example. Everything looks so Pythonic.
Analogy: Now think that you are in a hands-on cake workshop. You are making cake as the instructor explains. And the instructor explains what the result of each step is immediately. So, unlike in the previous example you don't have to wait till you bake the cake to see if you got it right (which is a reference to the fact that you cannot debug code). But you get instant feedback on how you are doing (you know what this means).
Does that mean TF2 doesn't build a graph? Panic attack
Well yes and no. There's two features in TF2 you should know about eager execution and AutoGraph functions.
Tip: To be exact TF1 also had eager execution (off by default) and can be enabled using tf.enable_eager_execution(). TF2 has eager_execution on by default.
Eager execution
Eager execution can immediately execute Tensors and Operations. This is what you observed in the TF2 example. But the flipside is that it does not build a graph. So for example you use eager execution to implement and run a neural network, it will be very slow (as neural networks do very repetitive tasks (forward computation - loss computation - backward pass) over and over again).
AutoGraph
This is where the AutoGraph feature comes to the rescue. AutoGraph is one of my favorite features in TF2. What this does is that if you are doing "TensorFlow" stuff in a function, it analyses the function and builds the graph for you (mind blown). So for example you do the following. TensorFlow builds the graph.
#tf.function
def do_silly_computation(x, y):
a = tf.constant(x)
b = tf.constant(y)
c = tf_a + (tf_b * 2.0)
return c
print(do_silly_computation(5.0, 3.0).numpy())
So all you need to do is define a function which takes the necessary inputs and return the correct output. Most importantly add #tf.function decorator as that's the trigger for TensorFlow AutoGraph to analyse a given function.
Warning: AutoGraph is not a silver bullet and not to be used naively. There are various limitations of AutoGraph too.
Differences between TF1 and TF2
TF1 requires a tf.Session() object to execute the graph and TF2 doesn't
In TF1 the unreferenced variables were not collected by the Python GC, but in TF2 they are
TF1 does not promote code modularity as you need the full graph defined before starting the computations. However, with the AutoGraph function code modularity is encouraged
What are different datatypes in TF1 and TF2?
You've already seen lot of the main data types. But you might have questions about what they do and how they behave. Well this section is all about that.
TF1 Data types / Data structures
tf.placeholder: This is how you provide inputs to the computational graph. As the name suggest, it does not have a value attached to it. Rather, you feed a value at runtime. tf_a and tf_b are examples of these. Think of this as an empty box. You fill that with water/sand/fluffy teddy bears depending on the need.
tf.Variable: This is what you use to define the parameters of your neural network. Unlike placeholders, variables are initialized with some value. But their value can also be changed over time. This is what happens to the parameters of a neural network during back propagation.
tf.Operation: Operations are various transformations you can execute on Placeholders, Tensors and Variables. For example tf.add() and tf.mul() are operations. These operations return a Tensor (most of the time). If you want proof of an op that doesn't return a Tensor, check this out.
tf.Tensor: This is similar to a variable in the sense that it has an initial value. However, once they are defined, their value cannot be changed (i.e. they are immutable). For example, tf_c in the previous example is a tf.Tensor.
TF2 Data types / Data structures
tf.Variable
tf.Tensor
tf.Operation
In terms of the behavior nothing much has changed in data types going from TF1 to TF2. The only main difference is that, the tf.placeholders are gone. You can also have a look at the full list of data types.
What is Keras and how does that fit in all these?
Keras used to be a separate library providing high-level implementations of components (e.g. layers and models) that are mainly used for deep learning models. But since later versions of TensorFlow, Keras got integrated into TensorFlow.
So as I explained, Keras hides lot of unnecessary intricacies you have to deal with if you were to work with bare-bone TensorFlow. There are two main things Keras offers Layer objects and Model objects for implementing NNs. Keras also has two most common model APIs that lets you develop models: the Sequential API and the Functional API. Let's see how different Keras and TensorFlow are in a quick example. Let's build a simple CNN.
Tip: Keras allows you to achieve what you can with do achieve with TF much easier. But Keras also provide capabilities that are not yet strong in TF (e.g. text processing capabilities).
height=64
width = 64
n_channels = 3
n_outputs = 10
Keras (Sequential API) example
model = Sequential()
model.add(Conv2D(filters=32, kernel_size=(2,2),
activation='relu',input_shape=(height, width, n_channels)))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Conv2D(filters=64, kernel_size=(2,2), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(n_outputs, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam')
model.summary()
Pros
Straight-forward to implement simple models
Cons
Cannot be used to implement complex models (e.g. models with multiple inputs)
Keras (Functional API) example
inp = Input(shape=(height, width, n_channels))
out = Conv2D(filters=32, kernel_size=(2,2), activation='relu',input_shape=(height, width, n_channels))(inp)
out = MaxPooling2D(pool_size=(2,2))(out)
out = Conv2D(filters=64, kernel_size=(2,2), activation='relu')(out)
out = MaxPooling2D(pool_size=(2,2))(out)
out = Flatten()(out)
out = Dense(n_outputs, activation='softmax')(out)
model = Model(inputs=inp, outputs=out)
model.compile(loss='binary_crossentropy', optimizer='adam')
model.summary()
Pros
Can be used to implement complex models involving multiple inputs and outputs
Cons
Needs to have a very good understanding of the shapes of the inputs outputs and what's expected as an input by each layer
TF1 example
# Input
tf_in = tf.placeholder(shape=[None, height, width, n_channels], dtype=tf.float32)
# 1st conv and max pool
conv1 = tf.Variable(tf.initializers.glorot_uniform()([2,2,3,32]))
tf_out = tf.nn.conv2d(tf_in, filters=conv1, strides=[1,1,1,1], padding='SAME') # 64,64
tf_out = tf.nn.max_pool2d(tf_out, ksize=[2,2], strides=[1,2,2,1], padding='SAME') # 32,32
# 2nd conv and max pool
conv2 = tf.Variable(tf.initializers.glorot_uniform()([2,2,32,64]))
tf_out = tf.nn.conv2d(tf_out, filters=conv2, strides=[1,1,1,1], padding='SAME') # 32, 32
tf_out = tf.nn.max_pool2d(tf_out, ksize=[2,2], strides=[1,2,2,1], padding='SAME') # 16, 16
tf_out = tf.reshape(tf_out, [-1, 16*16*64])
# Dense layer
dense = conv1 = tf.Variable(tf.initializers.glorot_uniform()([16*16*64, n_outputs]))
tf_out = tf.matmul(tf_out, dense)
Pros
Is very good for cutting edge research involving atypical operations (e.g. changing the sizes of layers dynamically)
Cons
Poor readability
Caveats and Gotchas
Here I will be listing down few things you have to watch out for when using TF (coming from my experience).
TF1 - Forgetting to feed all the dependent placeholders to compute the result
tf_a = tf.placeholder(dtype=tf.float32)
tf_b = tf.placeholder(dtype=tf.float32)
tf_c = tf.add(tf_a, tf.math.multiply(tf_b, 2.0))
with tf.Session() as sess:
c = sess.run(tf_c, feed_dict={tf_a: 5.0})
print(c)
InvalidArgumentError: You must feed a value for placeholder tensor 'Placeholder_8' with dtype float
[[node Placeholder_8 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
The reason you get an error here is because, you haven't fed a value to tf_b. So make sure you feed values to all the dependent placeholder to compute a result.
TF1 - Be very very careful of data types
tf_a = tf.placeholder(dtype=tf.int32)
tf_b = tf.placeholder(dtype=tf.float32)
tf_c = tf.add(tf_a, tf.math.multiply(tf_b, 2.0))
with tf.Session() as sess:
c = sess.run(tf_c, feed_dict={tf_a: 5, tf_b: 2.0})
print(c)
TypeError: Input 'y' of 'Add' Op has type float32 that does not match type int32 of argument 'x'.
Can you spot the error? It is because you have to match data types when passing them to operations. Otherwise, use tf.cast() operation to cast your data type to a compatible data type.
Keras - Understand what input shape each layer expects
model = Sequential()
model.add(Conv2D(filters=32, kernel_size=(2,2),
activation='relu',input_shape=(height, width)))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Conv2D(filters=64, kernel_size=(2,2), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(n_outputs, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam')
ValueError: Input 0 of layer conv2d_8 is incompatible with the layer: expected ndim=4, found ndim=3. Full shape received: [None, 64, 64]
Here, you have defined an input shape [None, height, width] (when you add the batch dimension). But Conv2D expects a 4D input [None, height, width, n_channels]. Therefore you get the error above. Some commonly misunderstood/error-prone layers are,
Conv2D layer - Expects a 4D input [None, height, width, n_channels]. To know about the convolution layer/operation in more detail have a look at this answer
LSTM layer - Expects a 3D input [None, timesteps, n_dim]
ConvLSTM2D layer - Expects a 5D input [None, timesteps, height, width, n_channels]
Concatenate layer - Except the axis the data concatenated on all other dimension needs to be the same
Keras - Feeding in the wrong input/output shape during fit()
height=64
width = 64
n_channels = 3
n_outputs = 10
Xtrain = np.random.normal(size=(500, height, width, 1))
Ytrain = np.random.choice([0,1], size=(500, n_outputs))
# Build the model
# fit network
model.fit(Xtrain, Ytrain, epochs=10, batch_size=32, verbose=0)
ValueError: Error when checking input: expected conv2d_9_input to have shape (64, 64, 3) but got array with shape (64, 64, 1)
You should know this one. We are feeding an input of shape [batch size, height, width, 1] when we should be feeding a [batch size, height, width, 3] input.
Performance differences between TF1 and TF2
This has already been in discussion here. So I will not reiterate what's in there.
Things I wish I could have talked about but couldn't
I'm leaving this with some links to further reading.
tf.data.Dataset
tf.RaggedTensor

How to verify structure a neural network in keras model?

I'm new in Keras and Neural Networks. I'm writing a thesis and trying to create a SimpleRNN in Keras as it is illustrated below:
As it is shown in the picture, I need to create a model with 4 inputs + 2 outputs and with any number of neurons in the hidden layer.
This is my code:
model = Sequential()
model.add(SimpleRNN(4, input_shape=(1, 4), activation='sigmoid', return_sequences=True))
model.add(Dense(2))
model.compile(loss='mean_absolute_error', optimizer='adam')
model.fit(data, target, epochs=5000, batch_size=1, verbose=2)
predict = model.predict(data)
1) Does my model implement the graph?
2) Is it possible to specify connections between neurons Input and Hidden layers or Output and Input layers?
Explanation:
I am going to use backpropagation to train my network.
I have input and target values
Input is a 10*4 array and target is a 10*2 array which I then reshape:
input = input.reshape((10, 1, 4))
target = target.reshape((10, 1, 2))
It is crucial for to able to specify connections between neurons as they can be different. For instance, here you can have an example:

1) Not really. But I'm not sure about what exactly you want in that graph. (Let's see how Keras recurrent layers work below)
2) Yes, it's possible to connect every layer to every layer, but you can't use Sequential for that, you must use Model.
This answer may not be what you're looking for. What exactly do you want to achieve? What kind of data you have, what output you expect, what is the model supposed to do? etc...
1 - How does a recurrent layer work?
Documentation
Recurrent layers in keras work with an "input sequence" and may output a single result or a sequence result. It's recurrency is totally contained in it and doesn't interact with other layers.
You should have inputs with shape (NumberOrExamples, TimeStepsInSequence, DimensionOfEachStep). This means input_shape=(TimeSteps,Dimension).
The recurrent layer will work internally with each time step. The cycles happen from step to step and this behavior is totally invisible. The layer seems to work just like any other layer.
This doesn't seem to be what you want. Unless you have a "sequence" to input. The only way I know if using recurrent layers in Keras that is similar to you graph is when you have a segment of a sequence and want to predict the next step. If that's the case, see some examples by searching for "predicting the next element" in Google.
2 - How to connect layers using Model:
Instead of adding layers to a sequential model (which will always follow a straight line), start using the layers independently, starting from an input tensor:
from keras.layers import *
from keras.models import Model
inputTensor = Input(shapeOfYourInput) #it seems the shape is "(2,)", but we must see your data.
#A dense layer with 2 outputs:
myDense = Dense(2, activation=ItsAGoodIdeaToUseAnActivation)
#The output tensor of that layer when you give it the input:
denseOut1 = myDense(inputTensor)
#You can do as many cycles as you want here:
denseOut2 = myDense(denseOut1)
#you can even make a loop:
denseOut = Activation(ItsAGoodIdeaToUseAnActivation)(inputTensor) #you may create a layer and call it with the input tensor in just one line if you're not going to reuse the layer
#I'm applying this activation layer here because since we defined an activation for the dense layer and we're going to cycle it, it's not going to behave very well receiving huge values in the first pass and small values the next passes....
for i in range(n):
denseOut = myDense(denseOut)
This kind of usage allows you to create any kind of model, with branches, alternative ways, connections from anywhere to anywhere, provided you respect the shape rules. For a cycle like that, inputs and outputs must have the same shape.
At the end, you must define a model from one or many inputs to one or many outputs (you must have training data to match all inputs and outputs you choose):
model = Model(inputTensor,denseOut)
But notice that this model is static. If you want to change the number of cycles, you will have to create a new model.
In this case, it would be as simple as repeating the loop step denseOut = myDense(denseOut) and creating another model2=Model(inputTensor,denseOut).
3 - Trying to create something like the image below:
I am supposing C and F will participate in all iterations. If not,
Since there are four actual inputs, and we are going to treat them all separately, let's create 4 inputs instead, all like (1,).
Your input array should be divided in 4 arrays, all being (10,1).
from keras.models import Model
from keras.layers import *
inputA = Input((1,))
inputB = Input((1,))
inputC = Input((1,))
inputF = Input((1,))
Now the layers N2 and N3, that will be used only once, since C and F are constant:
outN2 = Dense(1)(inputC)
outN3 = Dense(1)(inputF)
Now the recurrent layer N1, without giving it the tensors yet:
layN1 = Dense(1)
For the loop, let's create outA and outB. They start as actual inputs and will be given to the layer N1, but in the loop they will be replaced
outA = inputA
outB = inputB
Now in the loop, let's do the "passes":
for i in range(n):
#unite A and B in one
inputAB = Concatenate()([outA,outB])
#pass through N1
outN1 = layN1(inputAB)
#sum results of N1 and N2 into A
outA = Add()([outN1,outN2])
#this is constant for all the passes except the first
outB = outN3 #looks like B is never changing in your image....
Now the model:
finalOut = Concatenate()([outA,outB])
model = Model([inputA,inputB,inputC,inputF], finalOut)

Why do Keras Conv1D layers' output tensors not have the input dimension?

According to the keras documentation (https://keras.io/layers/convolutional/) the shape of a Conv1D output tensor is (batch_size, new_steps, filters) while the input tensor shape is (batch_size, steps, input_dim). I don't understand how this could be since that implies that if you pass a 1d input of length 8000 where batch_size = 1 and steps = 1 (I've heard steps means the # of channels in your input) then this layer would have an output of shape (1,1,X) where X is the number of filters in the Conv layer. But what happens to the input dimension? Since the X filters in the layer are applied to the entire input dimension shouldn't one of the output dimensions be 8000 (or less depending on padding), something like (1,1,8000,X)? I checked and Conv2D layers behave in a way that makes more sense their output_shape is (samples, filters, new_rows, new_cols) where new_rows and new_cols would be the dimensions of an input image again adjusted based on padding. If Conv2D layers preserve their input dimensions why don't Conv1D layers? Is there something I'm missing here?
Background Info:
I'm trying to visualize 1d convolutional layer activations of my CNN but most tools online I've found seem to just work for 2d convolutional layers so I've decided to write my own code for it. I've got a pretty good understanding of how it works here is the code I've got so far:
# all the model's activation layer output tensors
activation_output_tensors = [layer.output for layer in model.layers if type(layer) is keras.layers.Activation]
# make a function that computes activation layer outputs
activation_comp_function = K.function([model.input, K.learning_phase()], activation_output_tensors)
# 0 means learning phase = False (i.e. the model isn't learning right now)
activation_arrays = activation_comp_function([training_data[0,:-1], 0])
This code is based off of julienr's first comment in this thread, with some modifications for the current version of keras. Sure enough when I use it though all the activation arrays are of shape (1,1,X)... I spent all day yesterday trying to figure out why this is but no luck any help is greatly appreciated.
UPDATE: Turns out I mistook the meaning of the input_dimension with the steps dimension. This is mostly because the architecture I used came from another group that build their model in mathematica and in mathematica an input shape of (X,Y) to a Conv1D layer means X "channels" (or input_dimension of X) and Y steps. A thank you to gionni for helping me realize this and explaining so well how the "input_dimension" becomes the "filter" dimension.

I used to have the same problem with 2D convolutions. The thing is that when you apply a convolutional layer the kernel you are applying is not of size (kernel_size, 1) but actually (kernel_size, input_dim).
If you think of it if it wasn't this way a 1D convolutional layer with kernel_size = 1 would be doing nothing to the inputs it received.
Instead it is computing a weighted average of the input features at each time step, using the same weights for each time step (although every filter uses a different set of weights). I think it helps to visualize input_dim as the number of channels in a 2D convolution of an image, where the same reaoning applies (in that case is the channels that "get lost" and trasformed into the number of filters).
To convince yourself of this, you can reproduce the 1D convolution with a 2D convolution layer using kernel_size=(1D_kernel_size, input_dim) and the same number of filters. Here an example:
from keras.layers import Conv1D, Conv2D
import keras.backend as K
import numpy as np
# create an input with 4 steps and 5 channels/input_dim
channels = 5
steps = 4
filters = 3
val = np.array([list(range(i * channels, (i + 1) * channels)) for i in range(1, steps + 1)])
val = np.expand_dims(val, axis=0)
x = K.variable(value=val)
# 1D convolution. Initialize the kernels to ones so that it's easier to compute the result by hand
conv1d = Conv1D(filters=filters, kernel_size=1, kernel_initializer='ones')(x)
# 2D convolution that replicates the 1D one
# need to add a dimension to your input since conv2d expects 4D inputs. I add it at axis 4 since my keras is setup with `channel_last`
val1 = np.expand_dims(val, axis=3)
x1 = K.variable(value=val1)
conv2d = Conv2D(filters=filters, kernel_size=(1, 5), kernel_initializer='ones')(x1)
# evaluate and print the outputs
print(K.eval(conv1d))
print(K.eval(conv2d))
As I said, it took me a while too to understand this, I think mostly because no tutorial explains it clearly

Thanks, It's very useful.
here the same code adapted using recent version of tensorflow + keras
and stacking on axis 0 to build the 4D
# %%
from tensorflow.keras.layers import Conv1D, Conv2D
from tensorflow.keras.backend import eval
import tensorflow as tf
import numpy as np
# %%
# create an 3D input with format BLC (Batch, Layer, Channel)
batch = 10
layers = 3
channels = 5
kernel = 2
val3D = np.random.randint(0, 100, size=(batch, layers, channels))
x = tf.Variable(val3D.astype('float32'))
# %%
# 1D convolution. Initialize the kernels to ones so that it's easier to compute the result by hand / compare
conv1d = Conv1D(filters=layers, kernel_size=kernel, kernel_initializer='ones')(x)
# %%
# 2D convolution that replicates the 1D one
# need to add a dimension to your input since conv2d expects 4D inputs. I add it at axis 0 since my keras is setup with `channel_last`
# stack 3 time the same
val4D = np.stack([val3D,val3D,val3D], axis=0)
x1 = tf.Variable(val4D.astype('float32'))
# %%
# 2D convolution. Initialize the kernel_size to one for the 1st kernel size so that replicate the conv1D
conv2d = Conv2D(filters=layers, kernel_size=(1, kernel), kernel_initializer='ones')(x1)
# %%
# evaluate and print the outputs
print(eval(conv1d))
print('---------------------------------------------')
# display only one of the stacked
print(eval(conv2d)[0])

Batch normalization with 3D convolutions in TensorFlow

I'm implementing a model relying on 3D convolutions (for a task that is similar to action recognition) and I want to use batch normalization (see [Ioffe & Szegedy 2015]). I could not find any tutorial focusing on 3D convs, hence I'm making a short one here which I'd like to review with you.
The code below refers to TensorFlow r0.12 and it explicitly instances variables - I mean I'm not using tf.contrib.learn except for the tf.contrib.layers.batch_norm() function. I'm doing this both to better understand how things work under the hood and to have more implementation freedom (e.g., variable summaries).
I will get to the 3D convolution case smoothly by first writing the example for a fully-connected layer, then for a 2D convolution and finally for the 3D case. While going through the code, it would be great if you could check if everything is done correctly - the code runs, but I'm not 100% sure about the way I apply batch normalization. I end this post with a more detailed question.
import tensorflow as tf
# This flag is used to allow/prevent batch normalization params updates
# depending on whether the model is being trained or used for prediction.
training = tf.placeholder_with_default(True, shape=())
Fully-connected (FC) case
# Input.
INPUT_SIZE = 512
u = tf.placeholder(tf.float32, shape=(None, INPUT_SIZE))
# FC params: weights only, no bias as per [Ioffe & Szegedy 2015].
FC_OUTPUT_LAYER_SIZE = 1024
w = tf.Variable(tf.truncated_normal(
[INPUT_SIZE, FC_OUTPUT_LAYER_SIZE], dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
fc = tf.matmul(u, w)
# Batch normalization.
fc_bn = tf.contrib.layers.batch_norm(
fc,
center=True,
scale=True,
is_training=training,
scope='fc-batch_norm')
# Activation function.
fc_bn_relu = tf.nn.relu(fc_bn)
print(fc_bn_relu) # Tensor("Relu:0", shape=(?, 1024), dtype=float32)
2D convolutional (CNN) layer case
# Input: 640x480 RGB images (whitened input, hence tf.float32).
INPUT_HEIGHT = 480
INPUT_WIDTH = 640
INPUT_CHANNELS = 3
u = tf.placeholder(tf.float32, shape=(None, INPUT_HEIGHT, INPUT_WIDTH, INPUT_CHANNELS))
# CNN params: wights only, no bias as per [Ioffe & Szegedy 2015].
CNN_FILTER_HEIGHT = 3 # Space dimension.
CNN_FILTER_WIDTH = 3 # Space dimension.
CNN_FILTERS = 128
w = tf.Variable(tf.truncated_normal(
[CNN_FILTER_HEIGHT, CNN_FILTER_WIDTH, INPUT_CHANNELS, CNN_FILTERS],
dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
CNN_LAYER_STRIDE_VERTICAL = 1
CNN_LAYER_STRIDE_HORIZONTAL = 1
CNN_LAYER_PADDING = 'SAME'
cnn = tf.nn.conv2d(
input=u, filter=w,
strides=[1, CNN_LAYER_STRIDE_VERTICAL, CNN_LAYER_STRIDE_HORIZONTAL, 1],
padding=CNN_LAYER_PADDING)
# Batch normalization.
cnn_bn = tf.contrib.layers.batch_norm(
cnn,
data_format='NHWC', # Matching the "cnn" tensor which has shape (?, 480, 640, 128).
center=True,
scale=True,
is_training=training,
scope='cnn-batch_norm')
# Activation function.
cnn_bn_relu = tf.nn.relu(cnn_bn)
print(cnn_bn_relu) # Tensor("Relu_1:0", shape=(?, 480, 640, 128), dtype=float32)
3D convolutional (CNN3D) layer case
# Input: sequence of 9 160x120 RGB images (whitened input, hence tf.float32).
INPUT_SEQ_LENGTH = 9
INPUT_HEIGHT = 120
INPUT_WIDTH = 160
INPUT_CHANNELS = 3
u = tf.placeholder(tf.float32, shape=(None, INPUT_SEQ_LENGTH, INPUT_HEIGHT, INPUT_WIDTH, INPUT_CHANNELS))
# CNN params: wights only, no bias as per [Ioffe & Szegedy 2015].
CNN3D_FILTER_LENGHT = 3 # Time dimension.
CNN3D_FILTER_HEIGHT = 3 # Space dimension.
CNN3D_FILTER_WIDTH = 3 # Space dimension.
CNN3D_FILTERS = 96
w = tf.Variable(tf.truncated_normal(
[CNN3D_FILTER_LENGHT, CNN3D_FILTER_HEIGHT, CNN3D_FILTER_WIDTH, INPUT_CHANNELS, CNN3D_FILTERS],
dtype=tf.float32, stddev=1e-1))
# Layer output with no activation function (yet).
CNN3D_LAYER_STRIDE_TEMPORAL = 1
CNN3D_LAYER_STRIDE_VERTICAL = 1
CNN3D_LAYER_STRIDE_HORIZONTAL = 1
CNN3D_LAYER_PADDING = 'SAME'
cnn3d = tf.nn.conv3d(
input=u, filter=w,
strides=[1, CNN3D_LAYER_STRIDE_TEMPORAL, CNN3D_LAYER_STRIDE_VERTICAL, CNN3D_LAYER_STRIDE_HORIZONTAL, 1],
padding=CNN3D_LAYER_PADDING)
# Batch normalization.
cnn3d_bn = tf.contrib.layers.batch_norm(
cnn3d,
data_format='NHWC', # Matching the "cnn" tensor which has shape (?, 9, 120, 160, 96).
center=True,
scale=True,
is_training=training,
scope='cnn3d-batch_norm')
# Activation function.
cnn3d_bn_relu = tf.nn.relu(cnn3d_bn)
print(cnn3d_bn_relu) # Tensor("Relu_2:0", shape=(?, 9, 120, 160, 96), dtype=float32)
What I would like to make sure is whether the code above exactly implements batch normalization as described in [Ioffe & Szegedy 2015] at the end of Sec. 3.2:
For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a minibatch, over all locations. [...] Alg. 2 is modified similarly, so that during inference the BN transform applies the same linear transformation to each activation in a given feature map.
UPDATE
I guess the code above is also correct for the 3D conv case. In fact, when I define my model if I print all the trainable variables, I also see the expected numbers of beta and gamma variables. For instance:
Tensor("conv3a/conv3d_weights/read:0", shape=(3, 3, 3, 128, 256), dtype=float32)
Tensor("BatchNorm_2/beta/read:0", shape=(256,), dtype=float32)
Tensor("BatchNorm_2/gamma/read:0", shape=(256,), dtype=float32)
This looks ok to me since due to BN, one pair of beta and gamma are learned for each feature map (256 in total).
[Ioffe & Szegedy 2015]: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

That is a great post about 3D batchnorm, it's often unnoticed that batchnorm can be applied to any tensor of rank greater than 1. Your code is correct, but I couldn't help but add a few important notes on this:
A "standard" 2D batchnorm (accepts a 4D tensor) can be significantly faster in tensorflow than 3D or higher, because it supports fused_batch_norm implementation, which applies one kernel operation:
Fused batch norm combines the multiple operations needed to do batch
normalization into a single kernel. Batch norm is an expensive process
that for some models makes up a large percentage of the operation
time. Using fused batch norm can result in a 12%-30% speedup.
There is an issue on GitHub to support 3D filters as well, but there hasn't been any recent activity and at this point the issue is closed unresolved.
Although the original paper prescribes using batchnorm before ReLU activation (and that's what you did in the code above), there is evidence that it's probably better to use batchnorm after the activation. Here's a comment on Keras GitHub by Francois Chollet:
... I can guarantee that recent code written by Christian [Szegedy]
applies relu
before BN. It is still occasionally a topic of debate, though.
For anyone interested to apply the idea of normalization in practice, there's been recent research developments of this idea, namely weight normalization and layer normalization, which fix certain disadvantages of original batchnorm, for example they work better for LSTM and recurrent networks.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.