Related
I'm trying to train a siamese CNN model for deepfake detection. The model takes pairs of 3D images. Each pair consists of a 3D face image and a 3D background block image. The 3D image is formed by stacking images across multiple frames where the number of stacked frames is referred to as depth. Each 3D image has the shape (height, width, depth, channels=3) and dtype=float32. The images are normalized to have values between 0.0 and 1.0. I use a custom data generator to generate batches of data. The data generator is tested and provides the data as expected. Each generated batch consists of a tuple of ([input1, input2], output) where each input has shape (batch_size, img_height, img_width, img_depth, 3) and each output has shape (batch_size, 1) because it's a binary classification problem and the output is either 0 (for real) or 1 (for fake).
The assumption is for real videos, the noise patterns are the same for both the face and the background. However, for fake videos the face is manipulated while the background isn't. So the noise patterns will be different. The objective is to train a siamese network to extract features related to noise patterns from the face and background and calculate a distance score (between 0 and 1). For real videos the distance score should be close to 0. And for fake videos the distance should be close to 1.
My base model is a 3D CNN model with the following architecture:
def create_3D_CNN(input_shape):
model = Sequential()
# 1
model.add(Conv3D(8, kernel_size=(3, 3, 3), padding='same', input_shape=input_shape))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 1), strides=(2, 2, 1), padding='same'))
# 2
model.add(Conv3D(16, kernel_size=(3, 3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 2), strides=(2, 2, 2), padding='same'))
# 3
model.add(Conv3D(32, kernel_size=(3, 3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 2), strides=(2, 2, 2), padding='same'))
# 4
model.add(Conv3D(64, kernel_size=(3, 3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 2), strides=(2, 2, 2), padding='same'))
# 5
model.add(Conv3D(128, kernel_size=(3, 3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPool3D((2, 2, 2), strides=(2, 2, 2), padding='same'))
# final
model.add(Flatten())
model.add(Dropout(0.5))
model.add(Dense(1024))
model.add(BatchNormalization())
model.add(Activation('relu'))
return model
It extracts 1024 features from each input sample. Because it's a siamese network, there are 2 inputs which are processed in parallel (face and background) by the same 3D CNN using shared weights. Each input is converted to a vector of 1024 features. Assume the two vectors are called a and b.
The two vectors are fed to an output layer which calculates a distance value score between 0 and 1 based on the Manhattan distance. The equation of the output layer is as follows:
predicted_output = 1 - exp(-sum|a_i - b_i|)
When a and b are smilar, the predicted_output will be close to 0. When a and b are very different, the predicted_output will be close to 1.
Here is the code for the rest of the model:
def manhatten_distance(vects):
x, y = vects
return 1 - K.exp(-K.sum(K.abs(x - y), axis=1, keepdims=True))
def man_dist_output_shape(shapes):
shape1, shape2 = shapes
return (shape1[0], 1)
input_shape = train_gen[0][0][0][0].shape
print('input_shape =', input_shape)
#-------------- defining layers using functional api ----------------------
# inputs
input_face = Input(shape=input_shape)
input_back = Input(shape=input_shape)
# feature extraction (shared weights)
CNN_3D = create_3D_CNN(input_shape)
features_face = CNN_3D(input_face)
features_back = CNN_3D(input_back)
# distance between the 2 feature vectors
distance = Lambda(manhatten_distance, output_shape=man_dist_output_shape)([features_face, features_back])
#---------------- creating final model ---------------------------
model = Model(inputs=[input_face, input_back], outputs=distance)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
However, when I train the model, the accuracy is always close to 50% and the loss (binary_crossentropy) doesn't decrease below 7.8. When I try to make predictions after training I always get the same value.
What could be the reason for this?
Update:
To narrow down the issue I think the problem has to do with the Manhattan distance layer.
I replaced it with a "concatenate" layer and added another layer for output with 1 neuron and sigmoid activation and the model seems to be learning.
Why would the Manhattan distance layer prevent the model from learning?
I got this error message when declaring the input layer in Keras.
ValueError: Negative dimension size caused by subtracting 3 from 1 for
'conv2d_2/convolution' (op: 'Conv2D') with input shapes: [?,1,28,28],
[3,3,28,32].
My code is like this
model.add(Convolution2D(32, 3, 3, activation='relu', input_shape=(1,28,28)))
Sample application: https://github.com/IntellijSys/tensorflow/blob/master/Keras.ipynb
By default, Convolution2D (https://keras.io/layers/convolutional/) expects the input to be in the format (samples, rows, cols, channels), which is "channels-last". Your data seems to be in the format (samples, channels, rows, cols). You should be able to fix this using the optional keyword data_format = 'channels_first' when declaring the Convolution2D layer.
model.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(1,28,28), data_format='channels_first'))
I had the same problem, however the solution provided in this thread did not help me.
In my case it was a different problem that caused this error:
Code
imageSize=32
classifier=Sequential()
classifier.add(Conv2D(64, (3, 3), input_shape = (imageSize, imageSize, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))
classifier.add(Conv2D(64, (3, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))
classifier.add(Conv2D(64, (3, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))
classifier.add(Conv2D(64, (3, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))
classifier.add(Conv2D(64, (3, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))
classifier.add(Flatten())
Error
The image size is 32 by 32. After the first convolutional layer, we reduced it to 30 by 30. (If I understood convolution correctly)
Then the pooling layer divides it, so 15 by 15.
Then another convolutional layer reduces it to 13 by 13...
I hope you can see where this is going:
In the end, my feature map is so small that my pooling layer (or convolution layer) is too big to go over it - and that causes the error
Solution
The easy solution to this error is to either make the image size bigger or use less convolutional or pooling layers.
Keras is available with following backend compatibility:
TensorFlow : By google,
Theano : Developed by LISA lab,
CNTK : By Microsoft
Whenever you see a error with [?,X,X,X], [X,Y,Z,X], its a channel issue to fix this use auto mode of Keras:
Import
from keras import backend as K
K.set_image_dim_ordering('th')
"tf" format means that the convolutional kernels will have the shape (rows, cols, input_depth, depth)
This will always work ...
You can instead preserve spatial dimensions of the volume such that the output volume size matches the input volume size, by setting the value to “same”.
use padding='same'
Use the following:
from keras import backend
backend.set_image_data_format('channels_last')
Depending on your preference, you can use 'channels_first' or 'channels_last' to set the image data format. (Source)
If this does not work and your image size is small, try reducing the architecture of your CNN, as previous posters mentioned.
Hope it helps!
# define the model as a class
class LeNet:
'''
In a sequential model, we stack layers sequentially.
So, each layer has unique input and output, and those inputs and outputs
then also come with a unique input shape and output shape.
'''
#staticmethod ## class can instantiated only once
def init(numChannels, imgRows, imgCols , numClasses, weightsPath=None):
# if we are using channel first we have update the input size
if backend.image_data_format() == "channels_first":
inputShape = (numChannels , imgRows , imgCols)
else:
inputShape = (imgRows , imgCols , numChannels)
# initilize the model
model = models.Sequential()
# Define the first set of CONV => ACTIVATION => POOL LAYERS
model.add(layers.Conv2D( filters=6,kernel_size=(5,5),strides=(1,1),
padding="valid",activation='relu',kernel_initializer='he_uniform',input_shape=inputShape))
model.add(layers.AveragePooling2D(pool_size=(2,2),strides=(2,2)))
I hope it would help :)
See code : Fashion_Mnist_Using_LeNet_CNN
I am trying to apply one-shot learning for face-recognition.
I have several pictures of different people in my dataset directory and want to train my model but the problem is I can't figure out how to provide anchor-positive and anchor-negative pairs from directory of dataset.
I have build a custom convNet model and defined triplet-loss(as described in deeplearning.ai course).
My model
model = models.Sequential()
model.add(layers.Conv2D(16, (3,3), (3,3), activation='relu', input_shape=(384, 384, 1)))
model.add(layers.MaxPooling2D((2,2)))
model.add(layers.BatchNormalization())
for t in range(2):
model.add(layers.Conv2D(32, (1,1), (1,1), activation='relu'))
model.add(layers.Conv2D(32, (3,3), (1,1), padding='same', activation='relu'))
model.add(layers.Conv2D(64, (1,1), (1,1), activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D((2,2)))
for t in range(3):
model.add(layers.Conv2D(64, (1,1), (1,1), activation='relu'))
model.add(layers.Conv2D(64, (3,3), (1,1), padding='same', activation='relu'))
model.add(layers.Conv2D(128, (1,1), (1,1), activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D((2,2)))
for t in range(4):
model.add(layers.Conv2D(128, (1,1), (1,1), activation='relu'))
model.add(layers.Conv2D(128, (3,3), (1,1), padding='same', activation='relu'))
model.add(layers.Conv2D(256, (1,1), (1,1), activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D((2,2)))
for t in range(3):
model.add(layers.Conv2D(256, (1,1), (1,1), activation='relu'))
model.add(layers.Conv2D(256, (3,3), (1,1), padding='same', activation='relu'))
model.add(layers.Conv2D(512, (1,1), (1,1), activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.AveragePooling2D((4,4)))
model.add(layers.Flatten())
model.add(layers.Dense(128))
model.add(layers.Lambda(lambda x: backend.l2_normalize(x,axis=1)))
Triplet_loss
def triplet_loss(y_true, y_pred, alpha = 0.3):
"""
Implementation of the triplet loss as defined by formula (3)
Arguments:
y_pred -- python list containing three objects:
anchor -- the encodings for the anchor images, of shape (None, 128)
positive -- the encodings for the positive images, of shape (None, 128)
negative -- the encodings for the negative images, of shape (None, 128)
Returns:
loss -- real number, value of the loss
"""
anchor, positive, negative = y_pred[0], y_pred[1], y_pred[2]
# Step 1: Compute the (encoding) distance between the anchor and the positive, you will need to sum over axis=-1
pos_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, positive)), axis=-1)
# Step 2: Compute the (encoding) distance between the anchor and the negative, you will need to sum over axis=-1
neg_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, negative)), axis=-1)
# Step 3: subtract the two previous distances and add alpha.
basic_loss = tf.add(tf.subtract(pos_dist, neg_dist), alpha)
# Step 4: Take the maximum of basic_loss and 0.0. Sum over the training examples.
loss = tf.reduce_sum(tf.maximum(basic_loss, 0.0))
return loss
Model Compilation
model.compile(optimizer='adam',loss='triplet_loss',metrics=['accuracy'])
Please help me in making anchor-positive and anchor-negative pairs for training. I don't have any idea how to handle dataset directory in this regard!
Finding triplets to train a Siamese neural network with the triplet loss function can be done in several ways. The original FaceNet paper describes the importance of hard triplets (hard positives, positives such that argmax||f(anchor)-f(positive)||^2 and hard negatives, negatives such that argmin||f(anchor)-f(negative)||^2 where f is the embedding from the neural network.
However, in one of my Siamese networks, I selected (anchor,positive,negative) triplets randomly and it turns out to have a good classification accuracy. So you could try random triplet selection first, as hard-triplet selection is generally computationally expensive and requires a CPU cluster.
I hope you have labelled all the images in the dataset and label should reflect which person that particular image refers to. For an example, if you have 5 images of person A, the labels should look like (A_1.jpg, A_2.jpg,...A_5.jpg) or you should have a separate directory for each person. You could select an image from one directory randomly as the anchor, select an image from the same directory as the positive and an image from a different directory as the negative. Bundle this images in triplet format (anchor,positive,negative) and repeat the process to create a batch. And there you have a training batch of images.
I just covered the basic procedure of doing it, however, if you're looking for an example code, this tutorial may help you to create batches of triplets to a train the network.
Based on the discussion in the comments, just modify the triplet loss function that you have in the question in the following way:
def triplet_loss(anchor, positive, negative, margin = 0.3):
"""
Implementation of the triplet loss as defined by formula (3)
Arguments:
anchor -- A batch of anchor embedddings (batch_size, embedding size)
positive -- A batch of positive embedddings (batch_size, embedding size)
negative -- A batch of negative embedddings (batch_size, embedding size)
margin -- The contrastive margin
Returns:
loss -- real number, value of the loss
"""
# Step 1: Compute the (encoding) distance between the anchor and the positive, you will need to sum over axis=-1
pos_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, positive)), axis=-1)
# Step 2: Compute the (encoding) distance between the anchor and the negative, you will need to sum over axis=-1
neg_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, negative)), axis=-1)
# Step 3: subtract the two previous distances and add alpha.
basic_loss = tf.add(tf.subtract(pos_dist, neg_dist), margin)
# Step 4: Take the maximum of basic_loss and 0.0. Sum over the training examples.
loss = tf.reduce_sum(tf.maximum(basic_loss, 0.0))
return loss
The real issue with computing the triplet loss is to come up with the triplets, or mine them. However, that is already done for as you mentioned in the comments discussion.
I am trying to color the bird images from the CIFAR-10 dataset.
Problem set-up:
X: (5000,32,32,1) where each entry is a grayscale version of the bird images
Y: (5000,4096) which is a one hot encode array. for example, the first pixel will have [0,0,1,0] where 1 implies which color to be used.
Y is simply the collapsed version of all the one-hot encoding per image.
I've followed many articles that implement coloring of gray-scale images, but my loss/accuracy continues to be high/low.
model = Sequential()
model.add(Convolution2D(32, (5, 5), strides=(1,1), input_shape=(32,32,1),padding='same', activation='relu'))
model.add(Dropout(0.2))
model.add(Convolution2D(32, (5, 5),activation='relu', padding='same' ))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Convolution2D(64, (5, 5), activation='relu', padding='same' ))
model.add(Flatten())
model.add(Dense(128))
model.add(Dense(4096, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(Xtrain, Ytrain, validation_data=(val_data,Ytest),epochs=5, batch_size=32)
I'm expecting the accuracy to be improved as it progresses through the epochs, but it continues to get worse.
You'll have to put some work into architecture (which it sounds like you've been thinking about), but to simply black-box it, you can pump in the gray images and draw out the color images. Why not?
Use model.summary() to make sure your shapes are to your liking. (See below)
I haven't tested this code, but it should be pretty close...
model = Sequential()
model.add(InputLayer(input_shape=(32,32,1)))
model.add(Conv2D(32,(5,5),strides=(1,1), activation='relu', padding='same'))
model.add(SpatialDropout2D(rate=0.2)) # holla at this layer
model.add(Conv2D(32,(5,5), activation='relu', padding='same'))
model.add(MaxPool2D((2,2)))
model.add(Conv2D(64,(5,5),activation='relu',padding='same'))
model.add(Dense(128))
# have to upsample to get your height/width back from max pooling!
model.add(UpSampling2D((2,2)))
model.add(Conv2D(3,(2,2),activation='relu',padding='same'))
model.add(Activation('softmax'))
model.compile(optimizer='adam',loss='mse')
model.summary()
Here's the output of model.summary(). The output layer is (32,32,3); 32 height, 32 width, channels.
[1]
Now just train it with grayscales as X, and the color originals as Y. And post results, for the curious!
I'm a student in acoustics and really new at deep learning. My goal is to get a good understanding in how a CNN exactly works. There is one part that I don't understand. I can't find any precise information about that.
My model is something like this:
model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same', activation='relu', input_shape = input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(48, (3, 3), padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(ndim, activation='relu', use_bias=True, batch_size=batchSize, kernel_initializer='glorot_uniform', kernel_regularizer=None))
model.add(Dense(nclasses, activation='softmax', kernel_regularizer=l2(1e-2)))
model.compile(loss='categorical_crossentropy', optimizer=opt)
It works, that's not the problem. I know, that the input of second conv-layer consists of 32 feature maps (output of first pooling-layer).
What is every single kernels of the second conv-layer exactly convoluted with?
Thank you for your time and help!
As my knowledge, if your input image is M*N, then the output for the first Conv2D is M*N with depth 32, that is M*N*32. And (M/2)*(N/2)*32 after the first max pooling. So the input for the second Conv2D is the (M/2)*(N/2)*32 matrix (tensor). Then the second Conv2D convolution the 32 (M/2)*(N/2) 2D matrix (tensor) to (M/2)*(N/2) * 64 matrix.
To specify how the convolution behave. I used tensorflow for deep learning, the statement below will also help you understand CNN.
Input image size [M, N, 3], 3 for image deepth (RGB for example), with this size of image the first convolution should have size [3, 3, 3, 32], first two 3s for convolution window size, the third 3 for the depth, 32 for output depth as your example. Then to do a second convolution should have size [3, 3, 32, 64], the third number 32 must be the same as the first conv output depth.
For this we can see, convolution is done with multiply 3*3 windows, that is one depth to one convolution window. In your example, the second conv should have 3*3*32 parameters to conv your M*N*32 output from the first conv.
Hope this is what you want and that I have state it clearly.