Multi-channel, 2D mask weights using BCEWithLogitsLoss in Pytorch - python

I have a set of 256x256 images that are each labeled with nine, binary 256x256 masks. I am trying to calculate the pos_weight in order to weight the BCEWithLogitsLoss using Pytorch.
The shape of my masks tensor is tensor([1000, 9, 256, 256]) where 1000 is the number of training images, 9 is the number of mask channels (all encoded to 0/1), and 256 is the size of each image side.
To calculate pos_weight, I have summed the zeros in each mask, and divided that number by the sum of all of the ones in each mask (following the advice suggested here.):
(masks[:,channel,:,:]==0).sum()/masks[:,channel,:,:].sum()
Calculating the weight for every mask channel provides a tensor with the shape of tensor([9]), which seems intuitive to me, since I want a pos_weight value for each of the nine mask channels. However when I try to fit my model, I get the following error message:
RuntimeError: The size of tensor a (9) must match the size of
tensor b (256) at non-singleton dimension 3
This error message is surprising because it suggests that the weights need to be the size of one of the image sides, but not the number of mask channels. What shape should pos_weight be and how do I specify that it should be providing weights for the mask channels instead of the image pixels?

TLDR; This is a broadcasting issue which is surprisingly not handled by PyTorch's nn.BCEWithLogitsLoss namely F.binary_cross_entropy_with_logits. It might actually be worth putting out a Github issue linking to this SO thread to notify the developers of this undesirable behaviour.
In the documentation page of nn.BCEWithLogitsLoss, it is stated that the provided positive weights tensor pos_weight:
Must be a vector with length equal to the number of classes.
This is of course what you were expecting (rightly so) since positive weights refer to the weight given to the positive instances for every single class. Since your prediction and target tensors are multi-dimensional this seems to not be handled properly by PyTorch.
Anyhows, here is a minimal example showing how you can bypass this error and also showing the manual computation of the binary cross-entropy, as reference.
Here is the setup of the prediction and target tensors pred and label respectively:
>>> c=2;b=5;h=3;w=3
>>> pred = torch.rand(b,c,h,w)
>>> label = torch.randint(0,2, (b,c,h,w), dtype=float)
Now for the definition of the positive weight, notice the leading singletons dimensions:
>>> pos_weight = torch.rand(c,1,1)
In your case, with your existing 1D tensor of length c, you would simply have to unsqueeze two extra dimensions for the height and width dimensions. This means doing something like: pos_weight = pos_weight[:,None,None].
Calling the bce with logits function or its oop equivalent:
>>> F.binary_cross_entropy_with_logits(pred, label, pos_weight=pos_weight).mean()
Which is equivalent, in plain code to:
>>> z = torch.sigmoid(pred)
>>> bce = -(pos_weight*label*torch.log(z) + (1-label)*torch.log(1-z))
Note, that the built-in function would have the desired behaviour (i.e. no error message) if the class dimension was last in your prediction and target tensors.
>>> pos_weight = torch.rand(c)
>>> F.binary_cross_entropy_with_logits(
... pred.transpose(1,-1),
... label.transpose(1,-1),
... pos_weight=pos_weight)
In other words, we are applying the function with format NHWC which means the pos_weight of format C can be multiplied properly. So the result above effectively yields the same result as:
>>> F.binary_cross_entropy_with_logits(
... pred,
... label,
... pos_weight=pos_weight[:,None,None])
You can read more about the pos_weight in BCEWithLogitsLoss in another thread here

Related

a cost function that consists of two parts with different second dimension

I have defined a loss function like this:
def my_loss(y_recon, y_real, brain_hidden, brain_real):
loss = torch.mean((y_recon - y_real)**2 + (brain_hidden - brain_real)**2
return loss
y_recon's shape (and y_real) is batch_size*300 and brain_hidden's shape (and brain_real) is batch_size*64
I need to minimize these two both elements. However, this way I get the error
The size of tensor a (300) must match the size of tensor b (64) at
non-singleton dimension 1
How can I update the loss function to avoid this error?

Tensorflow CNN for different input size

I'm trying to make conv network for image regression.
As shown in below, one image [224 x 224] has one GT value {x}.
It's easy to make train [224 x 224] and valid/test with [224 x 224] images.
However, I'd like to apply CNN for different image sizes.
For example, [224 x 229] image, I want to get 5 regression values 'at once'.
Simply, I can do that by just sliding windows of [224 x 224] x 5 times, but apparently it is too slow.
I think using conv for different image size is possible. But FCL is not.
If I change image size to [455 x 256]
lhs shape= [4608,1024] rhs shape= [2048,1024]
error occurred. Is there any way to handle it?
Fully connected layers have a fixed size input. Thus, changing the input size will cause a wrong-size error.
One way to tackle this problem, and allow for different image sizes is to use a fully convolutional network.
An example with easy numbers:
Assuming for example the conv layer's output is of size 16X16, you can create a "classifier layer" of size 4x4 with stride 4, that would output for each of the 4 4x4 squares comprising the 16x16 feature map, a single value per dimension. Such filter would be of size 4x4xn_dim, in your case n_dim will be 5, and the final output would be of size 4x4x5, corresponding to 5 outputs (one for each regression value) for each 4x4 square.
You will notice you can play with the shape of the last conv filter to obtain different sizes for the final output, corresponding to different parts of the input image, but really, looking at all of it.
You can work out the numbers for your own example.
You probably would like to read about basic methods for semantic segmentaion.
Also see basic fully conv nets.

What is the meaning of the result of model.predict() function for semantic segmentation?

I use Segmentation Models library for multi-class (in my case 4 class) semantic segmentation. The model (UNet with 'resnet34' backbone) is trained with 3000 RGB (224x224x3) images. The accuracy is around 92.80%.
1) Why model.predict() function requires (1,224,224,3) shaped array as input ? I didn't find the answer even in the Keras documentation. Actually, below code is working, I have no problem with it but I want to understand the reason.
predictions = model.predict( test_image.reshape(-1,224,224,3) );
2) predictions is a (1,224,224,3) shaped numpy array. Its data type is float32 and contains some floating numbers. What is the meaning of the numbers inside this array? How can I visualize them? I mean, I assumed that the result array will contain one of 4 class label (from 0 to 3) for every pixel, and then I will apply the color map for each class. In other words, the result should have been a prediction map, but I didn't get it. To understand better what I mean about prediction map, please visit the Jeremy Jordan's blog about semantic segmentation.
result = predictions[0]
plt.imshow(result) # import matplotlib.pyplot as plt
3) What I finally want to do is like Github: mrgloom - Semantic Segmentation Categorical Crossentropy Example did in visualy_inspect_result function.
1) Image input shape in your deep neural network architecture is (224,224,3), so width=height=224 and 3 color channels. And you need an additionnal dimension in case you want to give more than one image at a time to your model. So (1,224,224,3) or (something, 224,224,3).
2) According to the doc of Segementation models repo, you can specify the number of classes you want as output model = Unet('resnet34', classes=4, activation='softmax'). Thus if you reshape your labelled image to have a shape (1,224,224,4). The last dimension is a mask channel indicating with a 0 or 1 if pixel i,j belongs to class k. Then you can predict and access to each output mask
masked = model.predict(np.array([im])[0]
mask_class0 = masked[:,:,0]
mask_class1 = masked[:,:,1]
3) Then using matplotlib you will be able to plot semantic segmentation or using scikit-image : color.label2rgb function

converting an array of size (n,n,m) to (None,n,n,m)

I am trying to reshape an array of size (14,14,3) to (None, 14,14,3). I have seen that the output of each layer in convolutional neural network has shape in the format(None, n, n, m).
Consider that the name of my array is arr
I tried arr[None,:,:] but it converts it to a dimension of (1,14,14,3).
How should I do it?
https://www.tensorflow.org/api_docs/python/tf/TensorShape
A TensorShape represents a possibly-partial shape specification for a Tensor. It may be one of the following:
Partially-known shape: has a known number of dimensions, and an unknown size for one or more dimension. e.g. TensorShape([None, 256])
That is not possible in numpy. All dimensions of a ndarray are known.
arr[None,:,:] notation adds a new size 1 dimension, (1,14,14,3). Under broadcasting rules, such a dimension may be changed to match a dimension of another array. In that sense we often treat the None as a flexible dimension.
I haven't worked with tensorflow though I see a lot of questions with both tags. tensorflow should have mechanisms for transfering values to and from tensors. It knows about numpy, but numpy does not 'know' anything about tensorflow.
A ndarray is an object with known values, and its shape is used to access those values in a multidimensional way. In contrast a tensor does not have values:
https://www.tensorflow.org/api_docs/python/tf/Tensor
It does not hold the values of that operation's output, but instead provides a means of computing those values
Looks like you can create a TensorProt from an array (and return an array from one as well):
https://www.tensorflow.org/api_docs/python/tf/make_tensor_proto
and to make a Tensor from an array:
https://www.tensorflow.org/api_docs/python/tf/convert_to_tensor
The shape (None, 14,14,3) represent ,(batch_size,imgH,imgW,imgChannel) now imgH and imgW can be use interchangeably depends on the network and the problem.
But the batchsize is given as "None" in the neural network because we don't want to restrict our batchsize to some specific value as our batchsize depends on a lot of factors like memory available for our model to run etc.
So lets say you have 4 images of size 14x14x3 then you can append each image into the array say L1, and now the L1 will have the shape 4x14x14x3 i.e you made a batch of 4 images and now you can feed this to your neural network.
NOTE here None will be replaced by 4 and for the whole training process it will be 4. Similarly when you feed your network only one image it assumes the batchsize of 1 and set None equal to 1 giving you the shape (1X14X14X3)

tensorflow - understanding tensor shapes for convolution

Currently trying to work my way through the Tensorflow MNIST tutorial for convolutional networks and I could use some help with understanding the dimensions of the darn tensors.
So we have images of 28x28 pixels in size.
The convolution will compute 32 features for each 5x5 patch.
Let's just accept this, for now, and ask ourselves later why 32 features and why 5x5 patches.
Its weight tensor will have a shape of [5, 5, 1, 32]. The first two dimensions are the patch size, the next is the number of input channels, and the last is the number of output channels.
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
If you say so ...
To apply the layer, we first reshape x to a 4d tensor, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels.
x_image = tf.reshape(x, [-1,28,28,1])
Alright, now I'm getting lost.
Judging by this last reshape, we have
"howevermany" 28x28x1 "blocks" of pixels that are our images.
I guess this makes sense because the images are in greyscale
However, if that is the ordering, then our weight tensor is essentially a collection of five 5x1x32 "blocks" of values.
The x32 makes sense, I guess, if we want to infer 32 features per patch
The rest, though, I'm not terribly convinced by.
Why does the weight tensor look the way it apparently does?
(For completeness: we use them
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)
where
def conv2d(x,W):
'''
2D convolution, expects 4D input x and filter matrix W
'''
return tf.nn.conv2d(x,W,strides=[1,1,1,1],padding ='SAME')
def max_pool_2x2(x):
'''
max-pooling, using 2x2 patches
'''
return tf.nn.max_pool(x,ksize=[1,2,2,1], strides=[1,2,2,1],padding='SAME')
)
Your input tensor has the shape [-1,28,28,1]. Like you mention, the last dimension is 1 because the images are in greyscale. The first index is the batchsize. The convolution will process every image in the batch independently, therefore the batchsize has no influence on the convolution-weight-tensor dimensions, or, in fact, no influence on any weight-tensor dimensions in the network. That is why the batchsize can be arbitrary (-1 signifies arbitrary size in tensorflow).
Now to the weight tensor; you don't have five of 5x1x32-blocks, you rather have 32 of 5x5x1-blocks. Each represents one feature. The 1 is the depth of the patch and is 1 due to the gray scale (it would be 5x5x3x32 for color images). The 5x5 is the size of the patch.
The ordering of dimensions in the data tensors is different from the ordering of dimensions in the convolution weight tensors.
Beside the other answer, I would like to add some more points,
Let's just accept this, for now, and ask ourselves later why 32 features and why 5x5 patches.
There is no specific reason why we choose 5x5 patches or 32 features, all of this parameters are experienced (except in some cases), you may use 3x3 patches or larger feature size.
I said 'except in some cases', because may we use 3x3 patches to catch information from images in more details, or larger feature size to learn each image in more details ('larger' and 'more details' are relative terms in this case).
However, if that is the ordering, then our weight tensor is essentially a collection of five 5x1x32 "blocks" of values.
Not exactly, but the weight tensor is not a collection it is only a filter with size 5x5 and input channel 1 and output feature (channel) 32
Why does the weight tensor look the way it apparently does?
The weight tensor weight_variable([5, 5, 1, 32]) tells I have 5x5 patch size to apply on an image, I have 1 input feature (since images are in grayscale) and 32 output feature (channel).
More Details:
So this line tf.nn.conv2d(x,W,strides=[1,1,1,1],padding ='SAME') takes input x as [-1,28,28,1], -1 means you can put in this dimension any size you want (batch size), 28,28 shows input size, and it must be exactly 28x82, and the last 1 shows the number of input channel, since the mnist images are grayscale so it is 1, in more details it says input image is a 28x28 2D matrix and each cell of matrix shows a value which indicates the grayscale intensity. If input images were RGB so we should have 3 channel instead 1, and this 3 channel says input image is a 28x28x3 3D matrix, the cells in the first dimension of 3 shows the intensity of Red color, the second dimension of 3 shows the intensity of Green color and the other shows Blue color.
Now tf.nn.conv2d(x,W,strides=[1,1,1,1],padding ='SAME') takes x and apply W ( which is a 3x3 patches and apply whis patch on 28x28 image with step size 1 (since stride is 1) and give the result image again in size 28x28 because we use padding='SAME'

Categories