I have a UNet Segmentation network implemented in Keras that simply maps all pixels in an RGB image to 4 categories which is trained on a heat map mask (Low, Low-Med, High-Med, High). Using CCE or categorical Dice loss I am able to get decent results.
However, The mask in it's original form is a heat map image with 255 bits of resolution. It seems like a totally arbitrary introduction of error to shoehorn it into the Unet by reducing the 255 bits of resolution into 4 categories.
I would like the network to output an image with each pixel having a value between (0,1), and train the network with masks that are produced by multiplying the heat map image by 1./255.
Where, in this case, the loss function would incorporate the mathematical difference between the mask and the prediction from the network. Can anyone point me in the direction of someone who has done something similar? I think I am just awful at describing what I'm looking for with relevant terminology because it seems like this would be a fairly common goal in computer vision..?
If I understand your question correctly - the "ground truth" mask is just a gray-scale image with values in range [0,255], meaning , there is a strong relation between it's values (for example - 25 is closer to 26 then to 70. this is not the case with regular segmentation, where you assign a different class to each pixel and the classes values may represent arbitrary objects such as "bicycle" or "person"). In other words, this is a regression problem, and to be more specific an image-to-image regression. You are trying to reconstruct a gray-scale image which should be identical to the ground truth mask, pixel-wise.
If I understood you correctly - you should look for regression losses. Common examples that can be used are Mean Squared Error (aka MSE, L2 norm) or Mean Absolute Error (aka MAE, L1 norm). Those are the "usual suspects" and I suggest you start with them, although many other losses exists.
Related
I'm implementing a VAE with a Binary Images Dataset (pixels are b or w), where in a image every pixel has a meaning (belonging to a class).
Searching online I found that the best implementation is to use as the last activation function the Sigmoid, and binary crossentropy as loss function, correct me if I'm wrong.
When I try to generate an image from the latent space using random coordinates, or some that I obtained encoding an image in input, I may obtain blurry images, that it's normal, but I want only 0 and 1 as values (because i want to know if an element belongs to that class or not).
So my question is: there are some standard procedures in order to have only binary images outputs or to train the model to have this result (maybe changing the loss or something), or the model has to be implemented in this way and in order to have a binary image I just have to set a threshold (0.5) to the pixel of the images in output as the only solution?
I am working with features extracted from pre-trained VGG16 and VGG19 models. The features have been extracted from second fully connected layer (FC2) of the above networks.
The resulting feature matrix (of dimensions (8000,4096)) has values in the range [0,45]. As a result, when I am using this feature matrix in gradient based optimization algorithms, the value for loss function, gradient, norms etc. take very high values.
In order to do away with such high values, I applied MinMax normalization to this feature matrix and since then the values are manageable. Also, the optimization algorithm is behaving properly. Is my strategy OK i.e. is it fair enough to normalize features that have been extracted from a pre-trained models for further processing.
From experience, as long as you are aware of the fact that your results are coming from normalized values, it is okay. If normalization helps you show gradients, norms, etc. better then I am for it.
What I would be cautious about though, would be any further analysis on those feature matrices as they are normalized and not the true values. Say, if you were to study the distributions and such, you should be fine, but I am not sure what is your next step, and if this can/will be harmful.
Can you share more details around "further analysis"?
If I train my CNN to identify MNIST handwritten digits using "images" (arrays) with black background (value 0):
Will it be able to identify digits in images with white background?
What about vice versa? If the answer is yes (background color doesn't matter) what would be the explanation? Thanks in advance
No. It wouldn't work directly. If you think about the problem of classifying digits, what we want to do is take a coordinate point (of 28x28 numbers from 0-255), and map it to a digit 0-9. If we fit such a function that performs this task, you can't take the opposite point and expect it to work.
Imagine a simpler case where we have points in 2D (coordinates of 2 numbers), and fit a straight line through it. Now we transform the data by moving the points (the inverse for example), the line doesn't fit anymore, and neither does our model.
However, a CNN that trains and performs well on the first dataset in theory should be able to train and perform well on the second.
I'm currently trying to find any information on how to implement localization loss within the task of detection of multiple objects on an image. There are a lot of information on how to calculate localization loss if there is only one detection. From the other hand, there are also lots of implementations of sota object detectors (RCNN, F-RDCC, SSD etc).
The reason of my question is that I would like to try to train custom object detector which output is a simple tensor of shape B x N x 4, without any anchors and so on.
So, if I understand correctly, to calculate multiple objects loss one have to first map each perdition to ground truth bounding box (using for instance IoU) and than calculate smooth L1 loss for each pair and average them. How to do it within tensorflow?
Thanks in advance.
I'm trying to design and train a convolutional neural network to identify circular cells in an image. I am training it on "cutouts" of the full images, which either have a circle in the middle of the image (positive training sample) or don't (negative training sample).
Example of an image with a circle in the middle (the heatmap colors are wonky, the images are all grayscale): http://imgur.com/a/6q8LZ
Rather than just classify the two types of input images (circle or not in the middle), I'd like the network output to be a binary bitmap, which is either a uniform value (e.g. -1) if there is no circle in the input image or has a "blotch" (ideally a single point) in the middle of the image to indicate the center of the circle. This would then be applied to a large image containing many such circular cells and the output should be a bitmap with blotches where the cells are.
In order to train this, I'm using the mean square error between the output image and a 2D gaussian filter (http://imgur.com/a/fvfP6) for positive training samples and the MSE between the image and a uniform matrix with value -1 for negative training samples. Ideally, this should cause the CNN to converge on an image, which resembles the gaussian peak in the middle for positive training samples, and an image, which is uniformly -1 for negative training samples.
HOWEVER, the network keeps converging on a unversal solution of "make everything zero". This does not minimize the MSE, so I don't think it's an inherent problem with the network structure (I've tried different structures, from a single layer CNN with a filter as large as the input image to multilayer CNNs with filters of varying size, all with the same result).
The loss function I am using is as follows:
weighted_score = tf.reduce_sum(tf.square(tf.sub(conv_squeeze, y)),
reduction_indices=[1, 2])
with conv_squeeze being the output image of the network and y being the label (i.e. the gaussian template shown above). I've already tried averaging over the batch size as suggested here:
Using squared difference of two images as loss function in tensorflow
but without success. I cannot find any academic publications on how to train neural networks with template images as labels and as such would be grateful for anybody to point me in the right direction. Thank you so much!
According to you description, I think you are facing an "imbalanced data" problem. And you can try Hinge Loss instead of MSE, it may solve your problem.