Can someone please explain the content loss function? - python

I am currently getting familiar with TensorFlow and machine learning. I am doing some tutorials on style transfer and now I have a part of an example code that I somehow can not comprehend.
I think I get the main idea: there are three images, the content image, the style image and the mixed image. Let's just talk about the content loss first, because if I can understand that, I will also understand the style loss. So I have the content image and the mixed image (starting from some distribution with some noise), and the VGG16 model.
As far as I can understand, I should now feed the content image into the network to some layer, and see what is the output (feature map) of that layer for the content image input.
After that I also should feed the network with the mixed image to the same layer as before, and see what is the output (feature map) of that layer for the mixed image input.
I then should calculate the loss function from these two output, because I would like the mixed image to have a similar feature map to the content image.
My problem is that I do not understand how this is done in the example codes that I could find online.
The example code can be the following:
http://gcucurull.github.io/tensorflow/style-transfer/2016/08/18/neural-art-tf/
But nearly all of the examples used the same approach.
The content loss is defined like this:
def content_loss(cont_out, target_out, layer, content_weight):
# content loss is just the mean square error between the outputs of a given layer
# in the content image and the target image
cont_loss = tf.reduce_sum(tf.square(tf.sub(target_out[layer], cont_out)))
# multiply the loss by its weight
cont_loss = tf.mul(cont_loss, content_weight, name="cont_loss")
return cont_loss
And is called like this:
# compute loss
cont_cost = losses.content_loss(content_out, model, C_LAYER, content_weight)
Where content_out is the output for the content image, model is the used model, C_LAYER is the reference to the layer that we would like to get the output of and content_weight is the weight with which we multiply.
The problem is that I somehow can not see where this feeds the network with the mixed image. It seems to me that into "cont_loss" calculates the root mean squared between the output for the content image and the between the layer itself.
The magic should be somewhere here:
cont_loss = tf.reduce_sum(tf.square(tf.sub(target_out[layer], cont_out)))
But I simply can not find how this produces the RMS between the feature map of the content image and the feature map of the mixed image at the given layer.
I would be very thankful if someone could point out where I am wrong and explain to me, how that content loss is calculated.
Thanks!

The loss forces the networks to have similar activation on the layer you have chosen.
Let us call one convolutional map/pixel from target_out[layer] and corresponding map from cont_out . You want their difference to be as small as possible, i.e., the absolute value of their difference. For the sake of numerical stability, we use the square function instead of absolute value because it is a smooth function and more tolerant of small errors.
We thus get , which is: tf.square(tf.sub(target_out[layer], cont_out)).
Finally, we want to minimize the difference for each map and each example in the batch. This is why we sum all the difference into a single scalar using tf.reduce_sum.

Related

Image Genaration in Variational Autoencoder having a Binary Images Dataset

I'm implementing a VAE with a Binary Images Dataset (pixels are b or w), where in a image every pixel has a meaning (belonging to a class).
Searching online I found that the best implementation is to use as the last activation function the Sigmoid, and binary crossentropy as loss function, correct me if I'm wrong.
When I try to generate an image from the latent space using random coordinates, or some that I obtained encoding an image in input, I may obtain blurry images, that it's normal, but I want only 0 and 1 as values (because i want to know if an element belongs to that class or not).
So my question is: there are some standard procedures in order to have only binary images outputs or to train the model to have this result (maybe changing the loss or something), or the model has to be implemented in this way and in order to have a binary image I just have to set a threshold (0.5) to the pixel of the images in output as the only solution?

How to train different size of image using cnn? [duplicate]

I am trying to train my model which classifies images.
The problem I have is, they have different sizes. how should i format my images/or model architecture ?
You didn't say what architecture you're talking about. Since you said you want to classify images, I'm assuming it's a partly convolutional, partly fully connected network like AlexNet, GoogLeNet, etc. In general, the answer to your question depends on the network type you are working with.
If, for example, your network only contains convolutional units - that is to say, does not contain fully connected layers - it can be invariant to the input image's size. Such a network could process the input images and in turn return another image ("convolutional all the way"); you would have to make sure that the output matches what you expect, since you have to determine the loss in some way, of course.
If you are using fully connected units though, you're up for trouble: Here you have a fixed number of learned weights your network has to work with, so varying inputs would require a varying number of weights - and that's not possible.
If that is your problem, here's some things you can do:
Don't care about squashing the images. A network might learn to make sense of the content anyway; does scale and perspective mean anything to the content anyway?
Center-crop the images to a specific size. If you fear you're losing data, do multiple crops and use these to augment your input data, so that the original image will be split into N different images of correct size.
Pad the images with a solid color to a squared size, then resize.
Do a combination of that.
The padding option might introduce an additional error source to the network's prediction, as the network might (read: likely will) be biased to images that contain such a padded border.
If you need some ideas, have a look at the Images section of the TensorFlow documentation, there's pieces like resize_image_with_crop_or_pad that take away the bigger work.
As for just don't caring about squashing, here's a piece of the preprocessing pipeline of the famous Inception network:
# This resizing operation may distort the images because the aspect
# ratio is not respected. We select a resize method in a round robin
# fashion based on the thread number.
# Note that ResizeMethod contains 4 enumerated resizing methods.
# We select only 1 case for fast_mode bilinear.
num_resize_cases = 1 if fast_mode else 4
distorted_image = apply_with_random_selector(
distorted_image,
lambda x, method: tf.image.resize_images(x, [height, width], method=method),
num_cases=num_resize_cases)
They're totally aware of it and do it anyway.
Depending on how far you want or need to go, there actually is a paper here called Spatial Pyramid Pooling in Deep Convolution Networks for Visual Recognition that handles inputs of arbitrary sizes by processing them in a very special way.
Try making a spatial pyramid pooling layer. Then put it after your last convolution layer so that the FC layers always get constant dimensional vectors as input . During training , train the images from the entire dataset using a particular image size for one epoch . Then for the next epoch , switch to a different image size and continue training .

Extracting features from caffe last layer

I am trying to extract features from the last layer of a GoogleNet caffe model fine-tuned on car classification. Here's the deploy.prototxt. I tried a couple of things:
I took features from 'loss3_classifier_model' layer which is incorrect.
Now I am extracting features from 'pool5' layer in the model as given in prototxt.
I am not sure whether it's correct or not because the features I am extracting for different cars doesn't seem to have much difference. In other words, I am unable to differentiate cars using this last layer features, I used Euclidean distance on features (Is it correct?). I am not using using softmax as I don't want to classify them, I just want features and I rechecking them using euclidean distance.
These are the steps I followed:
## load the model
net = caffe.Net('deploy.prototxt',
caffe.TEST,
weights ='googlenet_finetune_web_car_iter_10000.caffemodel')
# resize the input size as I have only one image in my batch.
net.blobs["data"].reshape(1, 3, 224, 224)
# I read my image of size (x,y,3)
frame = cv2.imread(frame_path)
bbox = frame[int(x1):int(x2), int(y1):int(y2)] # getting the car, # I have stored x1,x2,x3,x4 seperatly.
# resized my image to 224,224,3, network input size.
bbox = cv2.resize(bbox, (224, 224))
# to align my input to the input of the model
bbox_input = bbox.swapaxes(1,2).reshape(3,224,224)
# fed input image to the model.
net.blobs['data'].data[0] = bbox_input
net.forward()
# features from pool5 layer or the last layer.
temp = net.blobs["pool5"].data[0]
Now, I want to confirm if these steps are correct or not? I am new to caffe and I am not sure about the steps I wrote above.
Both options are valid. The farther you are from the end of your network, the less specialized the features will be to your problem/training set, while still capturing relevant information that may be applied to similar tasks. As you move to the end of the network, the features will be more tuned to your task.
Note that you are dealing with two similar problems/tasks. The network was fine-tuned for car classification ("which model is this car?") and now you want to verify if two cars belong to the same model.
Considering the network was fine-tuned with a large and representative training set, the features obtained from it are powerful and with a lot of representation capability (i.e., they capture a lot of complex underlying patterns of the task they were trained to) that are useful to your verification task.
With this in mind, you could try many ways of comparing two feature vectors:
Euclidean Distance is too simple. I would try it only because it is easy/fast to implement;
Cosine Similarity [1] might also be a simple, but good starting point;
Classifier. Another possibility that I've have done in a similar problem was to train a classifier (SVM, Logistic Regression) on top of a combination of the two features. The input of your classifier could be the concatenation of them side by side.
Incorporate the verification task to your network. You could alter the GoogleNet architecture to receive two photos of cars and output if they belong or not to the same model. You would be transforming/fine-tuning your network from the classification problem to a verification task. Check for siamese networks [2].
Edit: there is a mistake when you resize your frame that may be the cause of your problems!
# I read my image of size (x,y,3)
frame = cv2.imread(frame_path)
# resized my image to 224,224,3, network input size.
bbox = cv2.resize(frame, (224, 224))
You should have passed frame as input in the cv2.resize() method. You are probably feeding a garbage input to the network and that is why the output ends up always looking similar.

BCELoss for binary pixel-wise segmentation pytorch

I'm implementing a UNet for binary segmentation while using Sigmoid and BCELoss. The problem is that after several iterations the network tries to predict very small values per pixel while for some regions it should predict values close to one (for ground truth mask region). Does it give any intuition about the wrong behavior?
Besides, there exist NLLLoss2d which is used for pixel-wise loss. Currently, I'm simply ignoring this and I'm using MSELoss() directly. Should I use NLLLoss2d with Sigmoid activation layer?
Thanks
Seems to me like that your Sigmoids are saturating the activation maps. The images are not properly normalised or some batch normalisation layers are missing. If you have an implementation that is working with other images check the image loader and make sure it does not saturate the pixel values. This usually happens with 16-bits channels. Can you share some of the input images?
PS Sorry for commenting in the answer. This is a new account and I am not allowed to comment yet.
You might want to use torch.nn.BCEWithLogitsLoss(), replacing the Sigmoid and the BCELoss function.
An excerpt from the docs tells you why its always better to use this loss function implementation.
This loss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.

How to use Inception-v3 as a convolutional network

So, I've retrained the Inception-v3 network to classify specific kinds of data - for training I've provided it with 200x200 pictures. Now, when I run the graph on another 200x200 picture it works just fine. What I want to achieve is to turn it into a filter for a convolutional network - i.e. slide it as a filter through the whole picture and get the probability of each pixel being in a given class.
It seems to be fairly simple to do manually - just splice the picture into small sections, classify each of them, put the results together and voila. But that would be very inefficient. Instead, I want to do something like what is described here: http://cs231n.github.io/convolutional-networks/#convert. Basically, change the last FC layer and turn it into a CONV layer by reshaping the weights. Seems simple enough, but I can't figure out how to actually do this.
My main problem is that at the end of the Inception-v3 net, right before the last FC layer, there's a pooling operation that reformats the data into (1,2048) shape, so I won't really be able to perform a convolution here.
Could anyone help me out?
My most immediate solution for this is to skip the fully connected layer in the end as it would cause the input image to lose its initial structure. Doing a Conv -> FC -> Conv seems redundant

Categories