Tensorflow siamese CNN model always predicts the same class - python

I'm trying to train a siamese CNN model for deepfake detection. The model takes pairs of 3D images. Each pair consists of a 3D face image and a 3D background block image. The 3D image is formed by stacking images across multiple frames where the number of stacked frames is referred to as depth. Each 3D image has the shape (height, width, depth, channels=3) and dtype=float32. The images are normalized to have values between 0.0 and 1.0. I use a custom data generator to generate batches of data. The data generator is tested and provides the data as expected. Each generated batch consists of a tuple of ([input1, input2], output) where each input has shape (batch_size, img_height, img_width, img_depth, 3) and each output has shape (batch_size, 1) because it's a binary classification problem and the output is either 0 (for real) or 1 (for fake).
The assumption is for real videos, the noise patterns are the same for both the face and the background. However, for fake videos the face is manipulated while the background isn't. So the noise patterns will be different. The objective is to train a siamese network to extract features related to noise patterns from the face and background and calculate a distance score (between 0 and 1). For real videos the distance score should be close to 0. And for fake videos the distance should be close to 1.
My base model is a 3D CNN model with the following architecture:
def create_3D_CNN(input_shape):
model = Sequential()
# 1
model.add(Conv3D(8, kernel_size=(3, 3, 3), padding='same', input_shape=input_shape))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 1), strides=(2, 2, 1), padding='same'))
# 2
model.add(Conv3D(16, kernel_size=(3, 3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 2), strides=(2, 2, 2), padding='same'))
# 3
model.add(Conv3D(32, kernel_size=(3, 3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 2), strides=(2, 2, 2), padding='same'))
# 4
model.add(Conv3D(64, kernel_size=(3, 3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 2), strides=(2, 2, 2), padding='same'))
# 5
model.add(Conv3D(128, kernel_size=(3, 3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 2), strides=(2, 2, 2), padding='same'))
# final
model.add(Flatten())
model.add(Dropout(0.5))
model.add(Dense(1024))
model.add(BatchNormalization())
model.add(Activation('relu'))
return model
It extracts 1024 features from each input sample. Because it's a siamese network, there are 2 inputs which are processed in parallel (face and background) by the same 3D CNN using shared weights. Each input is converted to a vector of 1024 features. Assume the two vectors are called a and b.
The two vectors are fed to an output layer which calculates a distance value score between 0 and 1 based on the Manhattan distance. The equation of the output layer is as follows:
predicted_output = 1 - exp(-sum|a_i - b_i|)
When a and b are smilar, the predicted_output will be close to 0. When a and b are very different, the predicted_output will be close to 1.
Here is the code for the rest of the model:
def manhatten_distance(vects):
x, y = vects
return 1 - K.exp(-K.sum(K.abs(x - y), axis=1, keepdims=True))
def man_dist_output_shape(shapes):
shape1, shape2 = shapes
return (shape1[0], 1)
input_shape = train_gen[0][0][0][0].shape
print('input_shape =', input_shape)
#-------------- defining layers using functional api ----------------------
# inputs
input_face = Input(shape=input_shape)
input_back = Input(shape=input_shape)
# feature extraction (shared weights)
CNN_3D = create_3D_CNN(input_shape)
features_face = CNN_3D(input_face)
features_back = CNN_3D(input_back)
# distance between the 2 feature vectors
distance = Lambda(manhatten_distance, output_shape=man_dist_output_shape)([features_face, features_back])
#---------------- creating final model ---------------------------
model = Model(inputs=[input_face, input_back], outputs=distance)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
However, when I train the model, the accuracy is always close to 50% and the loss (binary_crossentropy) doesn't decrease below 7.8. When I try to make predictions after training I always get the same value.
What could be the reason for this?
Update:
To narrow down the issue I think the problem has to do with the Manhattan distance layer.
I replaced it with a "concatenate" layer and added another layer for output with 1 neuron and sigmoid activation and the model seems to be learning.
Why would the Manhattan distance layer prevent the model from learning?

Related

How to add scalar value into latent variables in auto-encoder with keras

I've tried to input a specific scalar value(0 to 1; each image has) into a layer of latent variables.
How to insert the value in CNN based auto-encoder sequential model?
def encoder():
model = tf.keras.Sequential()
model.add(Conv2D(12, (2,2), activation='relu', padding='same', input_shape=(32, 32, 1)))
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Flatten())
model.add(Dense(64))
return model
# maybe some code here
def decoder():
model = tf.keras.Sequential()
model.add(Dense(16*16*12, input_shape=(65,)))
model.add(Reshape((16,16,12)))
model.add(Conv2D(32, (2, 2), activation='relu', padding='same'))
model.add(UpSampling2D((2, 2)))
model.add(Dropout(0.5))
model.add(Conv2D(1, (2, 2), padding='same'))
return model
The number of latent variables is 64 in encoder, therefore, a scalar variable should make 65 latent variables in decoder.
Concatenate layer could be applied?
You can use a Concatenate layer for that.
For example:
x1 = tf.keras.layers.Dense(8)(np.arange(10).reshape(5, 2))
x2 = tf.keras.layers.Dense(8)(np.arange(10, 20).reshape(5, 2))
concatted = tf.keras.layers.Concatenate()([x1, x2])
But you will have to choose the correct shape for that constant you are trying to add.
In your case, you could use:
...
outputs = tf.keras.layers.Concatenate()([model, your_constant])
model = tf.keras.Model(inputs=model.inputs, outputs=outputs)
return model
...
Does this solution work for you?

Python: How to solve the low accuracy of a Variational Autoencoder Convolutional Model developed to predict a sequence of future frames?

I am currently developing a precipitation cell displacement prediction model. I have taken as a model to implement a variational convolutional autoencoder (I attach the model code). In summary, the model receives a sequence of 5 images, and must predict the following 5. The architecture consists of five convolutive layers in the encoder and decoder (Conv Transpose), which were made to greatly reduce the image size and learn spatial details. Between the encoder and decoder carries ConvLSTM layers to learn the temporal sequences. I am working it in Python, with tensorflow and keras.
The data consists of "images" of the rain radar of 400x400 pixels, with the dark background and the rain cells in the center of the frame. The time between frame and frame is 5 minutes, radar configuration. After further processing, the training data is scaled between 0 and 1, and in numpy format (for working with matrices). My training data finally has the form [number of sequences, number of images per sequence, height, width, channel = 1].
Sequence of precipitation Images
The sequences are made up of: 5 inputs and 5 targets, of which there are 2111 radar image sequences (I know I don't have much data :( for training) and 80% have been taken for training and 20% for the validation.
To detail:
train_input = [1688, 5, 400, 400, 1]
train_target = [1688, 5, 400, 400, 1]
valid_input = [423, 5, 400, 400, 1]
valid_target = [423, 5, 400, 400, 1]
The problem is that I have trained my model, and I have obtained the value of accuracy very poor. around 8e-05. I've been training 400 epochs, and the value remains or surrounds the mentioned value. Also when I take a sequence of 5 images to predict the next 5, I get very bad results (not even a small formation of "spots" in the center, which represents the rain). I have already tried to reduce the number of layers in the encoder and decoder, in addition to the optimizer [adam, nadam, adadelta], I have also tried using the activation function [relu, elu]. I have not obtained any profitable results, in the prediction images and the accuracy value.
Loss and Accuracy during Training
I am a beginner in Deep Learning topics, I like it a lot, but I can't find a solution to this problem. I suspect that my model architecture is not right. In addition to that I should look for a better optimizer or activation function to improve the accuracy value and predicted images. As a last solution, perhaps cut the image of 400x400 pixels to a central area, where the precipitation is. Although I would lose training data.
I appreciate you can help me solve this problem, maybe giving me some ideas to organize my architecture model, or ideas to organize de train data.
Best regards.
# Encoder
seq = Sequential()
seq.add(Input(shape=(5, 400, 400,1)))
seq.add(Conv3D(filters=32, kernel_size=(11, 11, 5), strides=3,
padding='same', activation ='relu'))
seq.add(BatchNormalization())
seq.add(Conv3D(filters=32, kernel_size=(9, 9, 32), strides=2,
padding='same', activation ='relu'))
seq.add(BatchNormalization())
seq.add(Conv3D(filters=64, kernel_size=(7, 7, 32), strides=2,
padding='same', activation ='relu'))
seq.add(BatchNormalization())
seq.add(Conv3D(filters=64, kernel_size=(5, 5, 64), strides=2,
padding='same', activation ='relu'))
seq.add(BatchNormalization())
seq.add(Conv3D(filters=32, kernel_size=(3, 3, 64), strides=3,
padding='same', activation ='relu'))
seq.add(BatchNormalization())
# ConvLSTM Layers
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
input_shape=(None, 6, 6, 32),
padding='same', return_sequences=True))
seq.add(BatchNormalization())
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
padding='same', return_sequences=True))
seq.add(BatchNormalization())
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
padding='same', return_sequences=True))
seq.add(BatchNormalization())
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
padding='same', return_sequences=True))
seq.add(BatchNormalization())
seq.add(Conv3D(filters=32, kernel_size=(3, 3, 3),
activation='relu',
padding='same', data_format='channels_last'))
# Decoder
seq.add(Conv3DTranspose(filters=32, kernel_size=(3, 3, 64), strides=(2,3,3),
input_shape=(1, 6, 6, 32),
padding='same', activation ='relu'))
seq.add(BatchNormalization())
seq.add(Conv3DTranspose(filters=64, kernel_size=(5, 5, 64), strides=(3,2,2),
padding='same', activation ='relu'))
seq.add(BatchNormalization())
seq.add(Conv3DTranspose(filters=64, kernel_size=(7, 7, 32), strides=(1,2,2),
padding='same', activation ='relu'))
seq.add(BatchNormalization())
seq.add(Conv3DTranspose(filters=32, kernel_size=(9, 9, 32), strides=(1,2,2),
padding='same', activation ='relu'))
seq.add(BatchNormalization())
seq.add(Conv3DTranspose(filters=1, kernel_size=(11, 11, 5), strides=(1,3,3),
padding='same', activation ='relu'))
seq.add(BatchNormalization())
seq.add(Cropping3D(cropping = (0,16,16)))
seq.add(Cropping3D(cropping = ((0,-5),(0,0),(0,0))))
seq.compile(loss='mean_squared_error', optimizer='adadelta', metrics=['accuracy'])
The metric you would like to use in case of a regression problem is mse(mean_squared_error) or mae (mean_absolute_error).
You may want to use mse in the beginning as it penalises greater errors more than the mae.
You just need to change a little bit the code where you compile your model.
seq.compile(loss='mean_squared_error', optimizer='adadelta', metrics=['mse','mae'])
In this way you can monitor both mse and mae metric during the training.

i have 10000 images in a vector form how do i convert it for my Convolution neural network?

I am new to Convolutional Neural Network. Instead of getting my data in image format i have been given flattened images matrix which is [10000x784].
Means 10000 images of size 28x28
Considering one image size is 28x28, how should i give the data matrix to my input for CNN?
My model is:
model = models.Sequential()
model.add(layers.Conv2D(64, (3, 3), activation='relu', input_shape=(28,28,1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
#model.add(layers.Flatten())
model.add(layers.Dense(2500, activation='relu'))
model.add(layers.Dense(2500, activation='relu'))
model.add(layers.Dense(1, activation='relu'))
model.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['mae','mse'])
callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=15)
#Fits model
history= model.fit(x_trained, y_train, epochs = 7000, validation_split = 0.2, shuffle= True, verbose = 1, callbacks=[callback])
I get error at model.fit.
P.S: I am doing regression and for every image i have one value as output
Begin with a Reshape layer:
model = models.Sequential()
model.add(layers.Reshape((28, 28, 1), input_shape=(784,)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
# ...

Managing Reshape correctly to build RNN from a previous CNN

I am currently trying to build a CRNN with Keras. When I try to reshape the input size, I had some trouble finding the correct dimension for my LSTM. After some debugging, I found a field in my model object called output_shape whose value was (3,1,244) and I tried to pass it as a 2D array with (3,224). Everything worked fine, but did I do correctly? What is the math behind this and what can I do next time to discover this size without debugging?
def CRNN(blockSize, blockCount, inputShape, trainGen, testGen, epochs):
model = Sequential()
# Conv Layer
channels = 32
for i in range(blockCount):
for j in range(blockSize):
if (i, j) == (0, 0):
conv = Conv2D(channels, kernel_size=(5, 5),
input_shape=inputShape, padding='same')
else:
conv = Conv2D(channels, kernel_size=(5, 5), padding='same')
model.add(conv)
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.15))
if j == blockSize - 2:
channels += 32
model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
model.add(Dropout(0.15))
# Feature aggregation across time
model.add(Reshape((3, 224)))
# LSTM layer
model.add(Bidirectional(LSTM(200), merge_mode='ave'))
model.add(Dropout(0.5))
# Linear classifier
model.add(Dense(4, activation='softmax'))
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adam(),
metrics=['accuracy']) # F1?
model.fit_generator(trainGen,
validation_data=testGen, steps_per_epoch = trainGen.x.size // 20,
validation_steps = testGen.x.size // 20,
epochs=epochs, verbose=1)
return model
# Function call
model = CRNN(4, 6, (140, 33, 1), trainGen, testGen, 1)

Wrong model prediction

I have a binary classification problem. I want to detect raindrops on the image. I trained a simple model, but my prediction is not good. I want to have a prediction between 0 and 1.
For my first try, i used relu for all layers accept the final(I used softmax). As the optimizer, i used binary_crossentropy and i changed it into the categorical_crossentropy. Both of them didn't work.
opt = Adam(lr=LEARNING_RATE, decay=LEARNING_RATE / EPOCHS)
cnNetwork.compile(loss='categorical_crossentropy',
optimizer=optimizers.RMSprop(lr=lr),
metrics=['accuracy'])
inputShape = (height, width, depth)
# if we are using "channels first", update the input shape
if K.image_data_format() == "channels_first":
inputShape = (depth, height, width)
# First layer is a convolution with 20 functions and a kernel size of 5x5 (2 neighbor pixels on each side)
model.add(Conv2D(20, (5, 5), padding="same",
input_shape=inputShape))
# our activation function is ReLU (Rectifier Linear Units)
model.add(Activation("relu"))
# second layer is maxpooling 2x2 that reduces our image resolution by half
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
# Third Layer - Convolution, twice the size of the first convoltion
model.add(Conv2D(40, (5, 5), padding="same"))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
# Fifth Layer is Full connected flattened layer that makes our 3D images into 1D arrays
model.add(Flatten())
model.add(Dense(500))
model.add(Activation("relu"))
# softmax classifier
model.add(Dense(classes))
model.add(Activation("softmax"))
I expect to get for ex .1 for the first class and .9 for the second. In the result, i get 1 , 1.3987518e-35. The main problem is that i always get 1 as a prediction.
You should be using binary_crossentropy and there is nothing wrong in the output you have got. The output 1 , 1.3987518e-35 means the probability of first class is almost 1 and probability of second class is very close to 0 (1e-35).

Categories