Image Genaration in Variational Autoencoder having a Binary Images Dataset - python

I'm implementing a VAE with a Binary Images Dataset (pixels are b or w), where in a image every pixel has a meaning (belonging to a class).
Searching online I found that the best implementation is to use as the last activation function the Sigmoid, and binary crossentropy as loss function, correct me if I'm wrong.
When I try to generate an image from the latent space using random coordinates, or some that I obtained encoding an image in input, I may obtain blurry images, that it's normal, but I want only 0 and 1 as values (because i want to know if an element belongs to that class or not).
So my question is: there are some standard procedures in order to have only binary images outputs or to train the model to have this result (maybe changing the loss or something), or the model has to be implemented in this way and in order to have a binary image I just have to set a threshold (0.5) to the pixel of the images in output as the only solution?

Related

UNet Loss function for non-categorical Mask?

I have a UNet Segmentation network implemented in Keras that simply maps all pixels in an RGB image to 4 categories which is trained on a heat map mask (Low, Low-Med, High-Med, High). Using CCE or categorical Dice loss I am able to get decent results.
However, The mask in it's original form is a heat map image with 255 bits of resolution. It seems like a totally arbitrary introduction of error to shoehorn it into the Unet by reducing the 255 bits of resolution into 4 categories.
I would like the network to output an image with each pixel having a value between (0,1), and train the network with masks that are produced by multiplying the heat map image by 1./255.
Where, in this case, the loss function would incorporate the mathematical difference between the mask and the prediction from the network. Can anyone point me in the direction of someone who has done something similar? I think I am just awful at describing what I'm looking for with relevant terminology because it seems like this would be a fairly common goal in computer vision..?
If I understand your question correctly - the "ground truth" mask is just a gray-scale image with values in range [0,255], meaning , there is a strong relation between it's values (for example - 25 is closer to 26 then to 70. this is not the case with regular segmentation, where you assign a different class to each pixel and the classes values may represent arbitrary objects such as "bicycle" or "person"). In other words, this is a regression problem, and to be more specific an image-to-image regression. You are trying to reconstruct a gray-scale image which should be identical to the ground truth mask, pixel-wise.
If I understood you correctly - you should look for regression losses. Common examples that can be used are Mean Squared Error (aka MSE, L2 norm) or Mean Absolute Error (aka MAE, L1 norm). Those are the "usual suspects" and I suggest you start with them, although many other losses exists.

Can someone please explain the content loss function?

I am currently getting familiar with TensorFlow and machine learning. I am doing some tutorials on style transfer and now I have a part of an example code that I somehow can not comprehend.
I think I get the main idea: there are three images, the content image, the style image and the mixed image. Let's just talk about the content loss first, because if I can understand that, I will also understand the style loss. So I have the content image and the mixed image (starting from some distribution with some noise), and the VGG16 model.
As far as I can understand, I should now feed the content image into the network to some layer, and see what is the output (feature map) of that layer for the content image input.
After that I also should feed the network with the mixed image to the same layer as before, and see what is the output (feature map) of that layer for the mixed image input.
I then should calculate the loss function from these two output, because I would like the mixed image to have a similar feature map to the content image.
My problem is that I do not understand how this is done in the example codes that I could find online.
The example code can be the following:
http://gcucurull.github.io/tensorflow/style-transfer/2016/08/18/neural-art-tf/
But nearly all of the examples used the same approach.
The content loss is defined like this:
def content_loss(cont_out, target_out, layer, content_weight):
# content loss is just the mean square error between the outputs of a given layer
# in the content image and the target image
cont_loss = tf.reduce_sum(tf.square(tf.sub(target_out[layer], cont_out)))
# multiply the loss by its weight
cont_loss = tf.mul(cont_loss, content_weight, name="cont_loss")
return cont_loss
And is called like this:
# compute loss
cont_cost = losses.content_loss(content_out, model, C_LAYER, content_weight)
Where content_out is the output for the content image, model is the used model, C_LAYER is the reference to the layer that we would like to get the output of and content_weight is the weight with which we multiply.
The problem is that I somehow can not see where this feeds the network with the mixed image. It seems to me that into "cont_loss" calculates the root mean squared between the output for the content image and the between the layer itself.
The magic should be somewhere here:
cont_loss = tf.reduce_sum(tf.square(tf.sub(target_out[layer], cont_out)))
But I simply can not find how this produces the RMS between the feature map of the content image and the feature map of the mixed image at the given layer.
I would be very thankful if someone could point out where I am wrong and explain to me, how that content loss is calculated.
Thanks!
The loss forces the networks to have similar activation on the layer you have chosen.
Let us call one convolutional map/pixel from target_out[layer] and corresponding map from cont_out . You want their difference to be as small as possible, i.e., the absolute value of their difference. For the sake of numerical stability, we use the square function instead of absolute value because it is a smooth function and more tolerant of small errors.
We thus get , which is: tf.square(tf.sub(target_out[layer], cont_out)).
Finally, we want to minimize the difference for each map and each example in the batch. This is why we sum all the difference into a single scalar using tf.reduce_sum.

Extracting features from caffe last layer

I am trying to extract features from the last layer of a GoogleNet caffe model fine-tuned on car classification. Here's the deploy.prototxt. I tried a couple of things:
I took features from 'loss3_classifier_model' layer which is incorrect.
Now I am extracting features from 'pool5' layer in the model as given in prototxt.
I am not sure whether it's correct or not because the features I am extracting for different cars doesn't seem to have much difference. In other words, I am unable to differentiate cars using this last layer features, I used Euclidean distance on features (Is it correct?). I am not using using softmax as I don't want to classify them, I just want features and I rechecking them using euclidean distance.
These are the steps I followed:
## load the model
net = caffe.Net('deploy.prototxt',
caffe.TEST,
weights ='googlenet_finetune_web_car_iter_10000.caffemodel')
# resize the input size as I have only one image in my batch.
net.blobs["data"].reshape(1, 3, 224, 224)
# I read my image of size (x,y,3)
frame = cv2.imread(frame_path)
bbox = frame[int(x1):int(x2), int(y1):int(y2)] # getting the car, # I have stored x1,x2,x3,x4 seperatly.
# resized my image to 224,224,3, network input size.
bbox = cv2.resize(bbox, (224, 224))
# to align my input to the input of the model
bbox_input = bbox.swapaxes(1,2).reshape(3,224,224)
# fed input image to the model.
net.blobs['data'].data[0] = bbox_input
net.forward()
# features from pool5 layer or the last layer.
temp = net.blobs["pool5"].data[0]
Now, I want to confirm if these steps are correct or not? I am new to caffe and I am not sure about the steps I wrote above.
Both options are valid. The farther you are from the end of your network, the less specialized the features will be to your problem/training set, while still capturing relevant information that may be applied to similar tasks. As you move to the end of the network, the features will be more tuned to your task.
Note that you are dealing with two similar problems/tasks. The network was fine-tuned for car classification ("which model is this car?") and now you want to verify if two cars belong to the same model.
Considering the network was fine-tuned with a large and representative training set, the features obtained from it are powerful and with a lot of representation capability (i.e., they capture a lot of complex underlying patterns of the task they were trained to) that are useful to your verification task.
With this in mind, you could try many ways of comparing two feature vectors:
Euclidean Distance is too simple. I would try it only because it is easy/fast to implement;
Cosine Similarity [1] might also be a simple, but good starting point;
Classifier. Another possibility that I've have done in a similar problem was to train a classifier (SVM, Logistic Regression) on top of a combination of the two features. The input of your classifier could be the concatenation of them side by side.
Incorporate the verification task to your network. You could alter the GoogleNet architecture to receive two photos of cars and output if they belong or not to the same model. You would be transforming/fine-tuning your network from the classification problem to a verification task. Check for siamese networks [2].
Edit: there is a mistake when you resize your frame that may be the cause of your problems!
# I read my image of size (x,y,3)
frame = cv2.imread(frame_path)
# resized my image to 224,224,3, network input size.
bbox = cv2.resize(frame, (224, 224))
You should have passed frame as input in the cv2.resize() method. You are probably feeding a garbage input to the network and that is why the output ends up always looking similar.

What is the expected input range for working with Keras VGG models?

I'm trying to use a pretrained VGG 16 from keras. But I'm really unsure about what the input range should be.
Quick answer, which of these color orders?
RGB
BGR
And which range?
0 to 255?
balanced from about -125 to about +130?
0 to 1?
-1 to 1?
I notice the file where the model is defined imports an input preprocessor:
from .imagenet_utils import preprocess_input
But this preprocessor is never used in the rest of the file.
Also, when I check the code for this preprocessor, it has two modes: caffe and tf (tensorflow).
Each mode works differently.
Finally, I can't find consistent documentation on the internet.
So, what is the best range for working? To what range are the model weights trained?
The model weights were ported from caffe, so it's in BGR format.
Caffe uses a BGR color channel scheme for reading image files. This is
due to the underlying OpenCV implementation of imread. The assumption
of RGB is a common mistake.
You can find the original caffe model weight files on VGG website. This link can also be found on Keras documentation.
I think the second range would be the closest one. There's no scaling during training, but the authors have subtracted the mean value of the ILSVRC2014 training set. As stated in the original VGG paper, section 2.1:
The only preprocessing we do is subtracting the mean RGB value,
computed on the training set, from each pixel.
This sentence is actually what imagenet_utils.preprocess_input(mode='caffe') does.
Convert from RGB to BGR: because keras.preprocessing.image.load_img() loads images in RGB format, this conversion is required for VGG16 (and all models ported from caffe).
Subtract the mean BGR values: (103.939, 116.779, 123.68) is subtracted from the image array.
The preprocessor is not used in vgg16.py. It's imported in the file so that users can use the preprocess function by calling keras.applications.vgg16.preprocess_input(rgb_img_array), without caring about where model weights come from. The argument for preprocess_input() is always an image array in RGB format. If the model was trained with caffe, preprocess_input() will convert the array into BGR format.
Note that the function preprocess_input() is not intended to be called from imagenet_utils module. If you are using VGG16, call keras.applications.vgg16.preprocess_input() and the images will be converted to a suitable format and range that VGG16 was trained on. Similarly, if you are using Inception V3, call keras.applications.inception_v3.preprocess_input() and the images will be converted to the range that Inception V3 was trained on.

Training a convolutional network on template images

I'm trying to design and train a convolutional neural network to identify circular cells in an image. I am training it on "cutouts" of the full images, which either have a circle in the middle of the image (positive training sample) or don't (negative training sample).
Example of an image with a circle in the middle (the heatmap colors are wonky, the images are all grayscale): http://imgur.com/a/6q8LZ
Rather than just classify the two types of input images (circle or not in the middle), I'd like the network output to be a binary bitmap, which is either a uniform value (e.g. -1) if there is no circle in the input image or has a "blotch" (ideally a single point) in the middle of the image to indicate the center of the circle. This would then be applied to a large image containing many such circular cells and the output should be a bitmap with blotches where the cells are.
In order to train this, I'm using the mean square error between the output image and a 2D gaussian filter (http://imgur.com/a/fvfP6) for positive training samples and the MSE between the image and a uniform matrix with value -1 for negative training samples. Ideally, this should cause the CNN to converge on an image, which resembles the gaussian peak in the middle for positive training samples, and an image, which is uniformly -1 for negative training samples.
HOWEVER, the network keeps converging on a unversal solution of "make everything zero". This does not minimize the MSE, so I don't think it's an inherent problem with the network structure (I've tried different structures, from a single layer CNN with a filter as large as the input image to multilayer CNNs with filters of varying size, all with the same result).
The loss function I am using is as follows:
weighted_score = tf.reduce_sum(tf.square(tf.sub(conv_squeeze, y)),
reduction_indices=[1, 2])
with conv_squeeze being the output image of the network and y being the label (i.e. the gaussian template shown above). I've already tried averaging over the batch size as suggested here:
Using squared difference of two images as loss function in tensorflow
but without success. I cannot find any academic publications on how to train neural networks with template images as labels and as such would be grateful for anybody to point me in the right direction. Thank you so much!
According to you description, I think you are facing an "imbalanced data" problem. And you can try Hinge Loss instead of MSE, it may solve your problem.

Categories