In this blog I will talk about the Fully Convolution Network for Semantic Segmentation Paper.
Let's first understand what is semantic segmentation problem. Semantic segmentation classify each pixel into one of the class including background without differentiating instance of object.
![]() |
Semantic Segmentation |
In the above image we can see two cow but semantic segmentation does not differentiate instance of connected similar object. One can use Instance segmentation if want to segment two cow separately.
Fully Convolutional Networks for Semantic Segmentation Paper majorly talk about two things
- First is using fully convolutional network pixel-to-pixel end-to-end training, that takes arbitrary size input images and produces corresponding size of output.
- Second is using novel "skip" architecture that combine the semantic information from a deep, coarse layer with local appearance information from a shallow, fine layer to produce accurate and detailed segmentation.
Classification Network to FCN Network for Semantic Segmentation
Lets see what we mean by fully convolution network. Usually, in the classification network like AlexNet, VGG and GoogLeNet input image goes through multiple convolution block followed by fully connected layer and then output layer.
Because of fully connected layer these network requires fixed input dimension images because fully connected layer takes fixed dimension of input.
In FCN, author, replaces this fully connected layer by 1x1 convolutional layer hence making the network fully convolutional.
In classification network we usually have N number of output node for N classes in output layer, but for semantic segmentation we want pixel mask as an output having spatial dimension same as input. For this author removes the final output layer of classification network and uses deconvolution (upsampling) layer to produce output mask.
Don't get confused with the name of deconvolution, it is not the reverse process of convolution, in deep learning deconvolution or some time also called back convolution or up convolution or transposed convolution is only used for upsampling.
Note that the deconvolution filter in such layer need not to be fixed (e.g to bilinear upsampling), but can be learned. A stack of deconvolution layers and activation functions can even learn a nonlinear upsampling.
So till now we have seen that, two changes requires in fully connected classification network in order to make it fully convolutional network for semantic segmentation. Let's see how to do these changes.
FCN-32
First recall the architecture of VGG network.
![]() |
VGG original architecture |
After fifth max pooling layer output feature map dimension is 7x7x512 and after that we are using two fully connected layer.
In order to use only convolutional layer, we change original fc6 layer with convolutional layer of filter size 7x7 (7x7x512x4096) which produces 7x7x4096 size feature map and replace fc7 with convolution layer of filter size 1x1 (1x1x4096x4096) which produces 7x7x4096 size of feature map.
![]() |
FCN-32 |
This gives us different variant of FCN i.e. FCN-16 and FCN-8.
FCN-16
For FCN-16, output of pool4 is convolved with 1x1 convolution to get class specific prediction for 21 channels. This predicted output is fused with the 2x upsampled output of conv7 and 16x upsampling is performed on the fused output to get the final segmentation mask.
![]() |
FCN-32, FCN-16 and FCN-8 |
FCN-8
Similarly for FCN-8, pool3 is convolved with 1x1 convolution to get class specific prediction for 21 classes and this output is fused with 2x upsampled feature map (last) of FCN-16 i.e 2x pool4 and 4x conv7 and after that 8x upsampling is performed on the fused output to get the final segmentation mask.![]() |
Refining fully convolutional nets by fusing information from layers with different strides improves segmentation detail |
Result
Pixel accuracy, mean accuracy, mean intersection over union (IU) and frequency weighted IU is reported for these 3 network.
![]() |
Comparison of skip FCN's on a subset of PASCAL VOC 2011 validation |
That's all for this blog, Thanks for reading.
Reference
- https://arxiv.org/pdf/1411.4038.pdf
- http://cs231n.stanford.edu/slides/2016/winter1516_lecture13.pdf
No comments:
Post a Comment