I was reading this research paper fully convolutional network for semantic segmentation and following is quote from this paper
The fully connected layers of these nets have fixed dimensions and throw away spatial coordinates. However, these fully connected layers can also be viewed as convolutions with kernels that cover their entire input regions.
I didn't understand bold part, but after some researching on internet I have come to conclusion that if I remove last layer(that is fully connected layer) and then convolve last layer(which was second last before removing fully connected layer) with three 1x1 kernels, I will be doing the same thing as the bold part says. Am I correct here?
Why three 1x1 kernels?
Because in paper they are creating a heatmap in rgb from original input and rgb means three channels, but result of convolution network(without fully connected layer) is having many channels(high dimensional) and therefore convolution with three 1x1 kernels to make it an rbg image. Image from paper
Suppose you have a 200X200 matrix in the second last layer.
Then if you are going to fully connected layer you will be converting the 200X200 matrix into a single one dimensional array.
That means an array of size 40000. That is what is meant by throwing away spatial coordinates. If you are applying a 1x1 kernel the same thing will be happening. You will get a similar one with no change in the values of pixel.
Related
For a university project I need to compare two images I have taken and find the differences between them.
To be precise I monitor a 3d printing process where I take a picture after each printed layer. Afterwards I need to find the outlines of the newly printed part.
The pictures look like this (left layer X, right layer X+1):
I have managed to extract the layer differences with the structural similarity from scikit from this question. Resulting in this image:
The recognized differences match the printed layer nearly 1:1 and seem to be a good starting point to draw the contours. However this is where I am currently stuck. I have tried several combinations of tresholding, blurring, findContours, sobel and canny operations but I am unable to produce an accurate outline of the newly printed layer.
Edit:
This is what I am looking for:
Edit2:
I have uploaded the images in the original file size and format here:
Layer X Layer X+1 Difference between the layers
Are there any operations that I haven't tried yet/do not know about? Or is there a combination of operations that could help in my case?
Any help on how to solve this problem would be greatly appreciated!!
I'm creating an encoder-decoder CNN for some images. Each image has a geometric shape around the center - circle, ellipse, etc.
I want my CNN to ignore all the values that are in this shape. All my input data values have been normalized to be around 0-1. I set all the shape values to be 0.
I thought that setting them to zero would mean that they will not be updated, however, the output of my encoder-decoder CNN changes the shape.
What can I do to ensure these values stay put and do not update?
Thank you!
I think you are looking for "partial convolution". This is a work published by Guilin Liu and colleagues that extends convolution to take an input mask as well as an input feature map and apply the convolution only to the unmasked pixels. They also suggest how to compensate for pixels on the boundary of the mask, where the kernel "sees" both valid and masked-out pixels.
Please note that their implementation may have issues running with automatic mixed precision (AMP).
i'm building a CNN to identify facial keypoints. i want to make the net more robust, so i thought about applying some zoom-out transforms because most pictures have about the same location of keypoints, so the net doesn't learn much.
my approach:
i want augmented images to keep the original image size so apply MaxPool2d and then random (not equal) padding until the original size is reached.
first question
is it going to work with simple average padding or zero padding? i'm sure it would be even better if i made the padding appear more like a background but is there a simple way to do that?
second question
the keypoints are the target vector, they come as a row vector of 30. i'm getting confused with the logic needed to transform them to the smaller space.
generally if an original point was at (x=5,y=7) it transforms to (x=2,y=3)- i'm not sure about it but so far manually checked and it's correct. but what to do if to keypoints are in the same new pixel? i can't feed the network with less target values.
that's it. would be happy to hear your thoughts
I suggest to use torchvision.transforms.RandomResizedCrop as a part of your Compose statement. which will give you random zooms AND resize the resulting the images to some standard size. This avoids issues in both your questions.
Goal:
For the past two weeks I've been trying to figure out how to convert the following image:
To one that looks like this (may not match exactly, as this image was taken at a different time):
Lens Correction (necessary?):
The first thing I noticed is that simply slicing the image and overlaying the four parts wouldn't work perfectly, as the curvature of certain lines does not match. For instance, the mid-court line bends left in the second slice and bends right in the third slice. This bending looks like a barrel distortion so I tried using both a parameterized lens correction function (passing k1, k2, and k3 to OpenCV) and using lensfun. Since the lensfun database does not include my camera make or model (it's an AXIS camera) and I do not know the make or model of the lens (it's manufactured as part of the camera), I wrote a small script to dump test images using various lenses with various parameters, then skimmed through the thousands of output images until I found one that looked like it had relatively straight lines:
This correction was done using the "Samyang 12mm f/2.8 Fish-Eye ED AS NCS" lens with a "Canon EOS 10D" camera in lensfun. It's probably not perfect, but I figured it was close enough to move on to step two.
Once the lens distortion was corrected, the second issue is that the same line in two slices was pointing in different directions, which should be corrected with a simple perspective transform. So I began a long quest to figure out the proper parameters for this perspective transform.
Failed Attempts:
1. Using SciPy
I started by writing a cost function to judge the "quality" of a given set of parameters (overlapped pixels should match) and applying SciPy's solver to figure it out. I made several tweaks to my cost function (applying a Gaussian blur, scaling down the image, gray scaling the image, using the Sobel operator to get a gradient, looking only at the pixels on either side of a "seam" after overlapping instead of the whole overlap region, etc) but it always failed to find a good solution. The results looked worse than the original camera image most of the time:
2. Using math
When that failed I tried applying math to compute the proper perspective transform. I know the FOV of the camera (from the spec sheet), I know the image width and height, I know the sensor size (from the spec sheet), and using a protractor I measured the angles between the lenses. Using the pinhole model I then calculated the expected (x,y) values of points on the image plane and what transform would be necessary to correct them. The results looked better than SciPy, but were still dismal.
3. Using OpenCV's Stitcher
After this I tried using OpenCV's built-in Stitcher class. However it failed to stitch together slices 2 and 3 due to insufficient overlap between the images (and about 10% of the time it even failed to stitch together slices 1 and 2, presumably because of the non-deterministic nature of RANSAC). Even when it did succeed, the stitch wasn't that great:
4. Using ORB and OpenCV's findHomography
Most recently I tried using ORB with a mask (only looking for features in the overlap region) and OpenCV's findHomography function to create a custom version of the Stitcher. While the matches seemed promising, the resulting stitch was still sub-optimal:
I'm beginning to suspect that my methodology (slice -> lens correct -> perspective transform -> overlay) is flawed and there's a better way to do this.
5. Updated ORB / findHomography
I updated my feature detection to eliminate any matches where the Y coordinates differed drastically (e.g. matching the white of the table to the white of the lights). After doing this my number of matched features fell from ~110 to ~55, but the homography was improved significantly. Here's the stitch that results for slices 1/2 and 2/3 with the update:
Until someone can tell me that I'm going about this all wrong, I'm going to keep pursuing this strategy with the following added step:
Slice image
Lens correct each slice
Perspective transform slice 2 or 3 so that the side line is horizontal and the mid-court line is vertical
Use ORB + match filtering + findHomography to iteratively align and then stitch adjacent slices
Ultimately when it's all said and done I want to try and compute a mapping from input pixels to output pixels so that we're not doing all of this complex work (lens correction, ORB, findHomography, etc) per-frame. We'll do it once per camera, save the mapping to a file somewhere, then we can in real-time map the input video to an output video frame-by-frame using cv2.remap
Note:
The second image I posted showing the "expected output" comes directly from the camera in question. It can be configured to return the first image at 30 fps, or the second image at 10 fps. We wish to perform the stitching off-camera on a more powerful computer so we can get 30 fps but still have the single image.
AXIS provides an SDK for doing the stitching off-camera, but this SDK is Windows-only and most of our tech stack is Linux and most of our development machines are Mac OS. I have used a Windows computer to try and look into the stitching SDK they provide, however I had no luck getting it to compile and run. Their sample code kept throwing errors and I've never had any luck getting Visual Studio or C++ to play nicely for me.
My suggestion is to train an autoencoder. Use the first image as input and the second one as an output, as in a denoising autoencoder:
Note that you may lose resolution if you create a botteleneck too small in the middle layer.
Also, Variational autoencoders present a latent vector but work following the same principle.
You can adapt this code:
denoise = Sequential()
denoise.add(Convolution2D(20, 3,3,
border_mode='valid',
input_shape=input_shape))
denoise.add(BatchNormalization(mode=2))
denoise.add(Activation('relu'))
denoise.add(UpSampling2D(size=(2, 2)))
denoise.add(Convolution2D(20, 3, 3,
init='glorot_uniform'))
denoise.add(BatchNormalization(mode=2))
denoise.add(Activation('relu'))
denoise.add(Convolution2D(20, 3, 3,init='glorot_uniform'))
denoise.add(BatchNormalization(mode=2))
denoise.add(Activation('relu'))
denoise.add(MaxPooling2D(pool_size=(3,3)))
denoise.add(Convolution2D(4, 3, 3,init='glorot_uniform'))
denoise.add(BatchNormalization(mode=2))
denoise.add(Activation('relu'))
denoise.add(Reshape((28,28,1)))
sgd = SGD(lr=learning_rate,momentum=momentum, decay=decay_rate, nesterov=False)
denoise.compile(loss='mean_squared_error', optimizer=sgd,metrics = ['accuracy'])
denoise.summary()
denoise.fit(x_train_noisy, x_train,
nb_epoch=50,
batch_size=30,verbose=1)
Keras' Conv3D expects input 5D tensor with shape: (batch, conv_dim1, conv_dim2, conv_dim3, channels) (assuming data_format is "channels_last"). Now say my filter size is (3,3,3) and my input is (10,125,300,200,3), a video dataset of 10 videos, each with 125 frames, and spatial size 300x200,and channel 3 due to frames being RGB. The default stride value is (1, 1, 1). The picture in my head of how this convolution works is as shown here at 9:28.
What I can't figure is whether the stride of 1 along the temporal dimension moves 1 frame at a time or 1 channel of a frame at a time. I tried to look up the code of conv3D here and couldn't gather much. I tried training a deep learning network using 3D CNNS and RGB videos and the resulting images have messed up colours (almost grey), so I'm assuming there's some mess up with colour channels. I checked my input, that seems fine, so the network is probably funny.
Tl:dr
Need to figure if RGB videos need conscious changes in strides so the channels of one frame are treated with 2D convolution, and also would be grateful for pointers to code/papers dealing with RGB videos and 3D CNNs
In all convolutions, the filter size encompasses all channels together. Channels do not participate in strides.
So, strides happen as if your video was a cube. Stride 1 step in each dimension (x,y,z) until the entire cube is swept. (The convolution has no idea of what the dimensions are, and will not treat the frames differently from how they treat pixels).
You have a little 3x3x3 cube sweeping a huge 125x300,200 parallelepiped, pixel by pixel, frame by frame. So: the stride moves one frame at a time, but considering only a 3x3 segment of the image.
This doesn't "seem" good for videos (but machine learning has its surprises), unless at some point you have a very tiny resolution, so that a filter starts seeing the whole picture in each frame.
You can keep testing the 3D convs to see what happens, but a few suggestions that "seem" better are:
Use TimeDistributed(Conv2D(...)) and TimeDistributed(MaxPooling2D(...)) until you get a small resolution video in the middle of the model (or even a 1x1, if you're going extreme). Then start using:
Conv3D if there are still spatial dimensions
Conv1D if you eliminated the spatial dimensions
In both cases, it's a good idea to increase the kernel size in the frames dimension, 3 frames may be too little to interpret what is happening (unless you have a low frame rate)
Use TimeDistributed(Conv2D(...)), eliminate the spatial dimensions at some point and start using RNNs like LSTM
Use ConvLSTM2D layers.