Pages

Sunday, June 20, 2021

Fully Convolutional Networks for Semantic Segmentation

In this blog I will talk about the Fully Convolution Network for Semantic Segmentation Paper. 

Let's first understand what is semantic segmentation problem. Semantic segmentation classify each pixel into one of the class including background without differentiating instance of object.

Semantic Segmentation

In the above image we can see two cow but semantic segmentation does not differentiate instance of connected similar object. One can use Instance segmentation if want to segment two cow separately.

Fully Convolutional Networks for Semantic Segmentation Paper majorly talk about two things

  • First is using fully convolutional network pixel-to-pixel end-to-end training, that takes arbitrary size input images and produces corresponding size of output.
  • Second is using novel "skip" architecture that combine the semantic information from a deep, coarse layer with local appearance information from a shallow, fine layer to produce accurate and detailed segmentation.

 

Classification Network to FCN Network for Semantic Segmentation

Lets see what we mean by fully convolution network. Usually, in the classification network like AlexNet, VGG and GoogLeNet input image goes through multiple convolution block followed by fully connected layer and then output layer.

Because of fully connected layer these network requires fixed input dimension images because fully connected layer takes fixed dimension of input.

In FCN, author, replaces this fully connected layer by 1x1 convolutional layer hence making the network fully convolutional.  

In classification network we usually have N number of output node for N classes in output layer, but for semantic segmentation we want pixel mask as an output having spatial dimension same as input. For this author removes the final output layer of classification network and uses deconvolution (upsampling) layer to produce output mask.

Don't get confused with the name of deconvolution, it is not the reverse process of convolution, in deep learning deconvolution or some time also called back convolution or up convolution or transposed convolution is only used for upsampling.   

Note that the deconvolution filter in such layer need not to be fixed (e.g to bilinear upsampling), but can be learned. A stack of deconvolution layers and activation functions can even learn a nonlinear upsampling.

So till now we have seen that, two changes requires in fully connected classification network in order to make it fully convolutional network for semantic segmentation. Let's see how to do these changes.

FCN-32

First recall the architecture of VGG network.

VGG original architecture

After fifth max pooling layer output feature map dimension is 7x7x512 and after that we are using two fully connected layer.

In order to use only convolutional layer, we change original fc6 layer with convolutional layer of filter size 7x7 (7x7x512x4096) which produces 7x7x4096 size feature map and replace fc7 with convolution layer of filter size 1x1 (1x1x4096x4096) which produces 7x7x4096 size of feature map. 

After that this output feature map (7x7x4096) of new fc7 layer passes through one more 1x1 (1x1x4096x21) convolution layer to give individual prediction mask for each of the PASCAL classes (including background), at last deconvolution layer is used to upsample the output 32 times, from 7x7 to 224x224.
 
 
Note for any arbitrary input dimension (512x512),  feature map after last pooling layer will be 1/32 times (16x16) and after deconvolution layer (upsample by 32 times) output mask will have same dimension (512x512) as input.  

This Network is what we call FCN-32. 

FCN-32
 
FCN-32 network outputs the segmented mask of input, but the output maps are coarse (rough output map) because of the 32 pixel stride at the final prediction layer which limit the scale of details in the upsampled output.
 
To overcome this, author uses skip connection which combine the semantic information from deep layers with spatial location information from the shallow layers to produce accurate and detailed segmentation.

This gives us different variant of FCN i.e. FCN-16 and FCN-8.

FCN-16

For FCN-16, output of pool4 is convolved with 1x1 convolution to get class specific prediction for 21 channels. This predicted output is fused with the 2x upsampled output of conv7 and 16x upsampling is performed on the fused output to get the final segmentation mask.

FCN-32, FCN-16 and FCN-8

FCN-8

Similarly for FCN-8, pool3 is convolved with 1x1 convolution to get class specific prediction for 21 classes and this output is fused with 2x upsampled feature map (last) of FCN-16 i.e 2x pool4 and 4x conv7 and after that 8x upsampling is performed on the fused output to get the final segmentation mask.
 
As FCN-8 is having both spatial information and deep semantic information, so it formed better than FCN-16 and FCN-32.

Refining fully convolutional nets by fusing information from layers with different strides improves segmentation detail

Result 

Pixel accuracy, mean accuracy, mean intersection over union (IU) and frequency weighted IU is reported for these 3 network.

 

Comparison of skip FCN's on a subset of PASCAL VOC 2011 validation

That's all for this blog, Thanks for reading.

Reference

  • https://arxiv.org/pdf/1411.4038.pdf
  • http://cs231n.stanford.edu/slides/2016/winter1516_lecture13.pdf

No comments:

Post a Comment