Does Keras BatchNormalization work correctly?

Does Keras BatchNormalization work correctly? - python

Not really sure whether I'm stupid or not, but shouldn't the values produced by BatchNormalization end up being between -1 and 1? There were already a lot of discussions on Keras BatchNormalization and I couldn't really find what I was looking for. I became suspicious one day and tryed several test scenarios, but none of them produced what I was expecting. I even tried on Google Colab for version problems
EDIT:
So, the question was rather stupid. However, I was more interested in the initial state which is why I set the "lr" so low and was running only one epoch.
btw:
tf.__version__
>>> 2.4.1
Simple test case:
import tensorflow as tf
import numpy as np
# a = (np.arange(25, dtype=np.float32)/50).reshape(1, 5, 5, 1)
a = np.arange(25, dtype=np.float32).reshape(1, 5, 5, 1)
inputs = tf.keras.layers.Input(shape=[5,5,1])
initializer = tf.random_normal_initializer(1.0, 0.002)
loss_fn = tf.keras.losses.BinaryCrossentropy(from_logits=True)
model = tf.keras.Sequential()
model.add(inputs)
model.add(tf.keras.layers.Conv2D(1, 4, strides=2, padding='same', kernel_initializer=initializer, use_bias=False)) # not really necessary
model.add(tf.keras.layers.BatchNormalization(momentum=0.99, epsilon=0.001, center=True, scale=True))
model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.000000000001), loss=loss_fn, metrics=['accuracy'])
model.fit(a, a[:, 1:4, 1:4, :], epochs=1, batch_size=1)
print(model(a), 0)
>>> tf.Tensor(
[[[[ 8.615232]
[14.495497]
[ 8.131738]]
[[26.24201 ]
[38.98827 ]
[20.710234]]
[[17.929565]
[25.93689 ]
[13.535995]]]], shape=(1, 3, 3, 1), dtype=float32) 0

Short Answer!! NO!!
You should not expect BatchNormalization to give values between -1 to 1.
Even with Normalised you should not expect values betwen -1 to 1
But after gamma and beta layer it gets inflated again. The kind of values that you are seeing
Because there are two things that are happening in the BatchNormalisation layer.
Normalisation of the Layer using mean and standard $\ddot z$=(z-$\mu$)/$\sigma$
Learning of new parameters \gamma and \beta z_delta=\gamma
*\ddot z+/beta
You can see that Batchnorm layer has 4 parameters 2 untrainable and 2 of them are trainable.
Not able to write properly hence uploading a picture

Related

How to use 6x6 filter size in a 2D Convolutional Neural Network without causing negative dimensions?

I a doing classification with a 2D CNN. My data is composed of samples of size (3788, 6, 1). (rows, columns, channels)
The 6 columns in each tensor represent X, Y and Z values of an accelerometer sensor and a gyroscope sensor, respectively. My aim is to predict which of 11 possible movements is performed in a sample, based on the sensor data.
From a logical standpoint, it would make sense to me to make the filter consider the X, Y and Z values of both sensors all together in each stride, since all 6 values combined are key to defining the movement.
This would leave me with a filter of size (?, 6). Since filters are mostly used in quadratic shape, I tried to use a filter of size (6, 6). This raises an error since the filters turn the shape of my data into negative dimensons.
A filter size of (2, 2) works, but as I have just described, it does not make logical sense to me since it only considers two column values of two rows at a time, and thereby only considers a fraction of the entire movement at a given point in time.
Model:
model = keras.Sequential()
model.add(Conv2D(filters=32, kernel_size=(6, 6), strides=1, activation = 'relu', input_shape= (3788, 6, 1)))
model.add(MaxPooling2D(pool_size=(6, 6)))
model.add(Conv2D(filters=64, kernel_size=(6, 6), strides=1, activation = 'relu'))
model.add(Dropout(0.4))
model.add(Flatten())
model.add(Dense(11, activation='softmax'))
model.compile(optimizer=Adam(learning_rate = 0.001), loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])
history = model.fit(X_train, y_train, epochs = 10, validation_data = (X_test, y_test), verbose=1)
Error:
Negative dimension size caused by subtracting 6 from 1 for '{{node max_pooling2d_7/MaxPool}} = MaxPool[T=DT_FLOAT, data_format="NHWC", ksize=[1, 6, 6, 1], padding="VALID", strides=[1, 6, 6, 1]](conv2d_13/Identity)' with input shapes: [?,3783,1,32].
I have 3 questions:
Is it even possible to use a 6x6 filter for my data? Changing filter sizes and/or leaving out the Pooling layer did not work.
I must admit that I still do not completely understand how the shape of my data changes from one layer to the next. Can you recommend an exemplary and explanatory resource for this?
Would a 1x6 filter be also an option, although not shaped qhadratically? This would assure that one datapoint (composed of X,Y and Z values of both sensors) is considered within each stride.

For using 6x6 convolutions, try using "same" padding for all the Conv2D layers. Add a padding='same' parameter in both Conv2D layers.
For your second question try simulating this behavior on this website https://madebyollin.github.io/convnet-calculator/.
Not so sure about your third question.

Padding text sequences for conv1D change the output of the convolution

I found out that even if new tensorflow api (since 2.0) and Keras is practical because we do not need to perform padding every time (even with only one input) during inference performing padding change the results of conv1d layer which do not accept Masking layer. The results change even when using GlobalMaxPooling1D.
Here is a little code snippet to explain my point :
import tensorflow as tf
y = tf.random.normal((2, 7, 100))
# Conv1d and maxpooling
cnn = tf.keras.layers.Conv1D(32, 3, padding="same")
max_pool = tf.keras.layers.GlobalMaxPool1D()
m_post = max_pool(cnn(tf.keras.preprocessing.sequence.pad_sequences(y, maxlen=10, dtype="float32", value=tf.zeros(100), padding="post")))
m_pre = max_pool(cnn(tf.keras.preprocessing.sequence.pad_sequences(y, maxlen=10, dtype="float32", value=tf.zeros(100), padding="pre")))
m = max_pool(cnn(y))
print(m - m_post)
print(m - m_pre)
print(m_post - m_pre)
The thing that is really weird is, the output does not change among differrent padding size...
m_post = max_pool(cnn(tf.keras.preprocessing.sequence.pad_sequences(y, maxlen=10, dtype="float32", value=tf.zeros(100), padding="post")))
m_post_2 = max_pool(cnn(tf.keras.preprocessing.sequence.pad_sequences(y, maxlen=16, dtype="float32", value=tf.zeros(100), padding="post")))
print(m_post - m_post_2)
When I am in production inference, The solution is to pad the same amount than I did in training, but it is not optimal because I have really different text lenght.
Can someone explain that behaviour ? Did I miss something here ?

Unexpected behaviour of from_logits in BinaryCrossentropy?

I am playing with a naive U-net that I'm deploying on MNIST as a toy dataset.
I am seeing a strange behaviour in the way the from_logits argument works in tf.keras.losses.BinaryCrossentropy.
From what I understand, if in the last layer of any neural network activation='sigmoid' is used, then in tf.keras.losses.BinaryCrossentropy you must use from_logits=False. If instead activation=None, you need from_logits=True. Either of them should work in practice, although from_logits=True appears more stable (e.g., Why does sigmoid & crossentropy of Keras/tensorflow have low precision?). This is not the case in the following example.
So, my unet goes as follows (the full code is at the end of this post):
def unet(input,init_depth,activation):
# do stuff that defines layers
# last layer is a 1x1 convolution
output = tf.keras.layers.Conv2D(1,(1,1), activation=activation)(previous_layer) # shape = (28x28x1)
return tf.keras.Model(input,output)
Now I define two models, one with the activation in the last layer:
input = Layers.Input((28,28,1))
model_withProbs = unet(input,4,activation='sigmoid')
model_withProbs.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
optimizer=tf.keras.optimizers.Adam()) #from_logits=False since the sigmoid is already present
and one without
model_withLogits = unet(input,4,activation=None)
model_withLogits.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam()) #from_logits=True since there is no activation
If I'm right, they should have exactly the same behaviour.
Instead, the prediction for model_withLogits has pixel values up to 2500 or so (which is wrong), while for model_withProbs I get values between 0 and 1 (which is right). You can check out the figures I get here
I thought about the issue of stability (from_logits=True is more stable) but this problem appears even before training (see here). Moreover, the problem is exactly when I pass from_logits=True (that is, for model_withLogits) so I don't think stability is relevant.
Does anybody have any clue of why this is happening? Am I missing anything fundamental here?
Post Scriptum: Codes
Re-purposing MNIST for segmentation.
I load MNIST:
(x_train, labels_train), (x_test, labels_test) = tf.keras.datasets.mnist.load_data()
I am re-purposing MNIST for a segmentation task by setting to one all the non-zero values x_train:
x_train = x_train/255 #normalisation
x_test = x_test/255
Y_train = np.zeros(x_train.shape) #create segmentation map
Y_train[x_train>0] = 1 #Y_train is zero everywhere but where the digit is drawn
Full unet network:
def unet(input, init_depth,activation):
conv1 = Layers.Conv2D(init_depth,(2,2),activation='relu', padding='same')(input)
pool1 = Layers.MaxPool2D((2,2))(conv1)
drop1 = Layers.Dropout(0.2)(pool1)
conv2 = Layers.Conv2D(init_depth*2,(2,2),activation='relu',padding='same')(drop1)
pool2 = Layers.MaxPool2D((2,2))(conv2)
drop2 = Layers.Dropout(0.2)(pool2)
conv3 = Layers.Conv2D(init_depth*4, (2,2), activation='relu',padding='same')(drop2)
#pool3 = Layers.MaxPool2D((2,2))(conv3)
#drop3 = Layers.Dropout(0.2)(conv3)
#upsampling
up1 = Layers.Conv2DTranspose(init_depth*2, (2,2), strides=(2,2))(conv3)
up1 = Layers.concatenate([conv2,up1])
conv4 = Layers.Conv2D(init_depth*2, (2,2), padding='same')(up1)
up2 = Layers.Conv2DTranspose(init_depth,(2,2), strides=(2,2), padding='same')(conv4)
up2 = Layers.concatenate([conv1,up2])
conv5 = Layers.Conv2D(init_depth, (2,2), padding='same' )(up2)
last = Layers.Conv2D(1,(1,1), activation=activation)(conv5)
return tf.keras.Model(inputs=input,outputs=last)

improving xor neural network in tensorflow and using dense as an input layer

I am trying to build a neural network in tf as a beginners challenge, and my model is not very good meaning many times it will not be very accurate (although sometimes accuracy is 1 but most of the time it isn't, and even then the loss is high.
So I have two questions:
How can I improve this NN?
What Is the difference between using Input as the input layer and using Dense?
Here is the code:
import tensorflow as tf
from tensorflow import keras
model = keras.Sequential()
model.add(tf.keras.Input(shape=(2,)))
#model.add(keras.layers.Dense(2))
model.add(keras.layers.Dense(4, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss=tf.losses.BinaryCrossentropy(),
metrics=['accuracy'])
# X_train, Y_train = ([[0, 0], [0, 1], [1, 0], [1, 1]], [[0], [1], [1], [0]])
X_train = tf.cast([[0, 0], [0, 1], [1, 0], [1, 1]], tf.float32)
Y_train = tf.cast([0, 1, 1, 0], tf.float32)
model.fit(X_train, Y_train, epochs=500, steps_per_epoch=1)
print(model.predict([[0, 1]]))
print(model.predict([[1, 1]]))
print(model.predict([[1, 0]]))
print(model.predict([[0, 0]]))

There are several obvious problems with above code:
First off, the steps_per_epoch=1 parameter means that for each epoch, your model will only see 1 example. Very inefficient. remove that parameter.
Next up, the 500 epochs are not nearly enough. Without pretraining, NN-s take a lot of time to train, even on the simplest problems. I just ran your code and in about 3500 epochs, it converges to the optimal solution.
Haven't tried it, but you also can try a higher learning rate, like this:
optimizer = keras.optimizers.Adam(lr=5e-2) #for example
model.compile(optimizer=optimzer, ...)
Also, if you know how to use callbacks, you can always use EarlyStopping callback to make the model run until the best model is found.
About early stopping:
If you use early stopping, you must also use a separate validation set. It's the validation set that tells you when it's the right time to stop training. Early stopping is one of the easiest and (in my opinion) one of the most useful regularization techniques in the field today.
So, if used with a validation set, no problem of stopping too early. Default parameters just to the trick.
Also, if you have many (+50) epochs, try plotting the history to gain insights.
Like this:
hist = model.fit(...)
plt.plot(hist.history['loss'])
If the line is bouncing at the end, you probably need EarlyStopping or maybe even Learning rate decay.
Ask me if anything else seems vague.
Cheers.

You should try to put more neurons in your hidden layer. I've tried with 64 and it worked fine.
model.add(keras.layers.Dense(64, activation='relu'))
Input layer is configured to receive your initial data, that means you can custom the input shape data of your NN and it explicitly asks for the shape of the input. When you use dense, you configure how many Neurons you need in your layer, aditionally you can custom the Activation Function there.
Notice that you use dense for setting your output layer where the number of neurons is the number of classes you want to predict (one in this case).

Building/Training 1D CNN for sequence-InvalidArgument

I'm building and training a CNN for a sequence, and have been using RNN's successfully, but am running into issues with CNN.
Here's the code, cnn1 is first (more complex model), tried getting a simpler one to fit and getting errors on both:
The shapes are as follows:
xtrain (5206, 19, 4)
ytrain (5206, 4)
xvalid (651, 19, 4)
yvalid (651, 4)
xtest (651, 19, 4)
ytest (651, 4)
I've tried just about every combination of kernel sizes and nodes I can think of, tried 2 different model builds.
model_cnn1.add(keras.layers.Conv1D(32, (4), activation='relu'))
model_cnn1.add(keras.layers.MaxPooling1D((4)))
model_cnn1.add(keras.layers.Conv1D(32, (4), activation='relu'))
model_cnn1.add(keras.layers.MaxPooling1D((4)))
model_cnn1.add(keras.layers.Conv1D(32, (4), activation='relu'))
model_cnn1.add(keras.layers.Dense(4))
model_cnn2 = keras.models.Sequential([
keras.layers.Conv1D(100,(4),input_shape=(19,4),activation='relu'),
keras.layers.MaxPooling1D(4),
keras.layers.Dense(4)
])
model_cnn2.compile(loss='mse',optimizer='adam',metrics= ['mse','accuracy'])
model_cnn2.fit(X_train_tf,y_train_tf,epochs=25)
Output is 1/25 epochs, not entirely run, then on cnn1 I receive some variation of (final line):
ValueError: Negative dimension size caused by subtracting 4 from 1 for
'max_pooling1d_26/MaxPool' (op: 'MaxPool') with input shapes:
[?,1,1,32]
on cnn2 (simpler) I get error (final line):
InvalidArgumentError: Incompatible shapes: [32,4,4] vs. [32,4]
[[{{node metrics_6/mse/SquaredDifference}}]]
[Op:__inference_keras_scratch_graph_6917]
In general, is there some rule I should be following here for kernels/nodes/etc? I always seem to get these errors on the shape.
I'm hoping after I build a model of each type I'll understand the ins and outs--no pun intended--but it's driving me crazy!
I've tried every combination of

You can read up on the docs of the Conv1D and MaxPooling1D to read that these layers change the output shape depending on the value for strides. In your case you can keep the output shape for Conv1D equal by specifying a padding. MaxPooling1D changes the output shape by definition. With strides = 4, the output shape will be 4 times smaller in fact. I'd suggest carefully reading the docs to figure out exactly what happens and learning about the underlying theory of CNNs as to why this happens.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Does Keras BatchNormalization work correctly? - python

Related

How to use 6x6 filter size in a 2D Convolutional Neural Network without causing negative dimensions?

Padding text sequences for conv1D change the output of the convolution

Unexpected behaviour of from_logits in BinaryCrossentropy?

improving xor neural network in tensorflow and using dense as an input layer

Building/Training 1D CNN for sequence-InvalidArgument

Categories

Resources