Pages

Tuesday, July 14, 2020

Fast RCNN : Object Detection

In this blog I will cover Fast R-CNN (Fast Region-based Convolutional Neural Network) for object detection algorithm. Fast R-CNN is better than it's base architecture R-CNN in term of detection accuracy, training and testing time. If you don't know about the R-CNN model please read about it in my previous blog.

I will be covering following points from the Fast R-CNN paper.
  • Advantage of Fast R-CNN over R-CNN
  • Fast R-CNN Architecture
  • RoI Pooling Layer
  • VGG-16 fine-tuning for Fast-RCNN
  • Loss Function
  • Main Results
  • Miscellaneous
Let's get started with Fast R-CNN. 

Advantage of Fast R-CNN over R-CNN

  • Higher detection rate (mAP) than R-CNN. It have 66 % mAP on PASCAL VOC 2012
  • Training is single stage for classifying candidate objects and refining their spacial location, using a multi-task loss function
  • Fast R-CNN with VGG16 as base network trains 9 times faster then R-CNN and 213 times faster at test time
  • No disk storage is required for feature caching

Fast R-CNN Architecture

Let's Recall R-CNN architecture, we saw in my previous blog that R-CNN network have three component, first is region proposal that generate 2000 region proposal, second is feature extraction for each region proposal by passing them to CNN and third is the region classification using SVM & bounding box (b-box) refinement using b-box regression.

Fast R-CNN also have region proposal network same as R-CNN but instead of processing all region proposal through CNN, Fast-RCNN first generate feature map of an image by passing it to CNN and then extract features for each proposals from feature map using ROI Pooling Layer. 

Now these features are feed into a sequence of fully connected (FC) layers that finally branch into two sibling output layers, one produce softmax probability estimate over K+1 object classes (where K is no. of class, plus one is for background) and another layer outputs four real-valued number for each of the K object classes. These four value encodes refined bounding-box positions for one of the K classes.

Fast R-CNN architecture
R-CNN:-  Image + Region Proposal(2K) -> 2K cropped region -> CNN feature for 2K cropped region by passing each region to CNN -> SVM + b-box regression.

Fast-RCNN:- Image + Region Proposal(2K) -> CNN feature for image -> Mapping all region proposal on feature map of image -> Extracting feature of each region proposal using RoI pooling layer -> two separate output layer for classification and b-box refinement 

RoI Pooling Layer

Let's see what is the use of RoI pooling layer.
As we know the fully connected layer always takes fixed length input features. Therefore in R-CNN we wrapped region proposal to fix size (227x227) before passing it to CNN, so that CNN can generate a fixed shape and size of feature map to pass fixed length of input to FC layer.

Here also Faster R-CNN takes input image and region proposal as an input for that image and calculate the features for all proposals of an image simultaneously by calculating the feature map for whole image and projecting all the region proposal onto feature map. 
These features map for each proposal should be of fixed dimension because we have to pass these feature to FC layer for classification and refining the boundary of object. ROI Pooling layer is used to resize the feature map of each proposal to a fixed size.

RoI max pooling works by dividing the H x W roi region (region proposal) into h x w grid of sub-window of approximate size H/h x W/w and then max-pooling the values in each sub-window into the corresponding output grid cell.

Let's understand RoI Pooling with an example where conv feature map size is 8x8 and fixed output RoI size for each proposal is 2x2. Assumimg after applying multiple conv block on input image we get a 8x8xC where C is the depth of feature map or you can say number of channel.

Pooling is applied to each channel independently, for now we will ignore the depth/channel. 

1. 8x8 feature map


2. Now let's say we have one object proposal for an input image and after mapping it to the feature map it cover 6x5 area starting from (3,2) to (8,6) index. 


3. As in this case our RoI pooling output size is 2x2, so we will divide the 6x5  region to get 2x2 cell and apply max pooling. 


4. We will select maximum value (max pooling) from each sub-cell. Now this will be the output by RoI pooling layer. 

** 0.64 instead of 0.60

5. RoI pooling will be applied to all layer of feature map and final output will be 2x2xC size feature map and this will be the input to FC layer.

Importance of RoI Pooling-

Feature sharing: It allow to reuse the feature map for all object proposal hence feature sharing.
Training and testing time: It significantly speed up the training and testing, as it allow to train the network end-to-end. 

VGG-16 fine tuning for Fast R-CNN Network

VGG-16 network undergoes three transformation
  • Last Max pooling layer is replaced by RoI Pooling layer.
  • Network last fully connected layer and softmax layer are replaced with two sibling layers (a fully connected layer and softmax over K+1 categories and category specific bounding box regressors)
  • Network is modified to take two inputs, a list of images and a list of RoI's in those images.

Fast R-CNN Loss Function

As Fast R-CNN have two sibling output layers at the end, it uses multi task loss to train both the layer simultaneously.

The First layer outputs a discrete probability distribution (per ROI) p = (p0, p1,...., pk), over K+1 category. As usual, p is computed by softmax over K+1 outputs of a fully connected layer. This is the normal softmax layer that we use for classification over K object class plus one background class.

The Second layer outputs bounding-box offsets, tk = (tkx, tky, tkw, tkh), for each of the K object classes, for background class we don't need bounding-box. Here tk specifies a scale-invarient translation and log-space height/width shift relative to an object proposal.
Each training ROI is labelled with a ground truth class u and bounding-box target v. Multi task loss function is defined as.
where u and v is ground truth, p and tu is network outputs, Lcls(p,u) = -log pu is log loss for true class u.
The second loss Lloc, is defined over a tuple of true bounding-box target for class u, v = (ux, uy, uw, uh) and predicted tuple tu = (tux, tuy, tuw, tuh), again for each class u. The hyper-parameter lambda is used to balance both the loss.
For bounding-box regression (Lloc) we use,
 where smooth L1 function is, 
Smooth L1 loss is less sensitive to the outliers than L2 loss.

Main Results  

Three main results of this papers is:
  • State-of-the-art mAP on VOC 2007, 2010, and 2012
  • Fast training and testing compared to R-CNN, SPPnet
  • Fine-tuning conv layers in VGG16 improves mAP

Miscellaneous

Fast R-CNN paper also talks about following thing that i haven't covered, if you want to know more please go through the original paper
  • Truncated SVD for faster detection
  • SVM vs Softmax for classification
  • How does number of object proposal affect mAP
That's all for this post, hope you find this blog informative.
Thanks for reading !!