Pages

Monday, November 9, 2020

Faster R-CNN : Object Detection

In this blog I will talk about Faster R-CNN algorithm. This is the third blog in the series of R-CNN based object detection. In my previous blogs I talked about R-CNN and Fast R-CNN, for better understanding of Faster R-CNN, first read about R-CNN and Fast R-CNN. Faster R-CNN is one of the state-of-the-art deep learning based object detection network, all the new object detection algorithm is compared with this one, so one must know about this.

Lets recall the previous object detection algorithm of the R-CNN family before starting about Faster R-CNN.

R-CNN (Region based convolutional neural network) have three component, first is region proposal using Selective Search algorithm that generate approx 2000 candidate regions for objects in an image, second part is feature extraction for all these candidate regions by passing them to CNN and third part is region classification using SVM and bounding box (b-box) refinement using b-box regression layer. 

R-CNN Object Detection
Fast R-CNN also have region proposal (Selective Search) same as R-CNN but instead of processing all region proposal through CNN, It first generate feature map of image by processing it by CNN and then extract features for each proposal from the feature map using ROI Pooling Layer.  

And these features are feed into a sequence of fully connected (FC) layers that finally branch into two sibling output layer, one produce softmax probability estimate over K+1 object classes (where K is no. of class, plus one for background) and other layer outputs four real-value number for each of the K classes. 

Fast-RCNN Object Detection
Fast R-CNN if fast and accurate in comparison to R-CNN but still it is not end-to-end trainable and region proposal is still bottelneck in state-of-the-art detection system.

Faster R-CNN introduces novel Region Proposal Networks (RPN) that share convolutional layers with detection networks. By sharing convolutions, the marginal cost for computing proposal reduces to 10ms from 1.5sec for an image. 

Faster R-CNN can be seen as RPN network plus Fast R-CNN detector network.

Let's try to understand RPN Network.

Region Proposal Network 

RPN network is going to generate candidate region for objects in the image, so it must generate proposal for object of different scale and aspect ratio.

To incorporate multiple scales and sizes, we have different scheme like pyramid of images and pyramid of filters. In pyramid of images, features map are calculated over multiple scale of images to detect smaller and larger object. In pyramid of filters, multiple filter with different scales/sizes are applied over feature map to detect smaller and larger object.  

Fig. (a) Pyramid of Image (b) Pyramid of Filters

RPN uses different scheme i.e. pyramid of reference boxes (Anchor). 

Fig. Pyramid of Reference Anchor

Anchor 

Anchor are pre-define bounding box of fixed shape and size. Anchor dimension are define wrt to images. For Anchors, Faster R-CNN uses 3 different scale with box area 1282, 2562 and 5122 and 3 different aspect ration 1:1, 1:2 and 2:1, combination of these will produce 9 anchors.

All 9 anchor at pixel (320,320). Area for blue, green, and red anchors are 1282, 2562 and 5122 respectively.

For a typical input image 1000x600x3 in Faster R-CNN, the dimension of feature map is reduced to 60x40x512 after backbone (VGG-16) network. This feature map is shared between RPN network and Fast R-CNN detection network.

RPN add a mini-convolutional network over backbone/head network to output a set of rectangular object proposal. This mini-network slide multiple 3x3 filter on the last convolutional layer of backbone network, followed by, two sibling 1x1 convolutional layer, a box-regression layer and box-classification layer for object proposal prediction.

At each sliding window (convolution) location it predict 9 proposal (b-box) of different scale and size using Anchors.

Region Proposal Network

The first conv layer of RPN help the RPN network to learn feature for anchors prediction on top of the base network features. The first 1x1 convolution branch of RPN predict whether proposal containing object or no-object. So the number of conv filter required to classify one anchor is 2 and for 9 anchor is 18. 

The Second 1x1 convolution branch of RPN predict proposal bounding box offset. So the number of conv filter required to predict b-box offset for one anchor is 4 (x,y,w,h) and for 9 anchor is 36.

For VGG and ZF head network, network stride is 16 which means after processing image by any of these head network, feature map dimension will be 1/16 of input image dimension. One pixel in feature map have 16x16 receptive field in the input image. So when we say at each point of feature map we try to detect 9 different proposal that means we try to locate object in input image after every 16 pixels.

To understand how RPN learn to predict the object location, see it's loss function. 

RPN Loss function

RPN network is trained in a batch of 256 anchor, with 1:1 ratio of positive anchor and negative anchor. Positive anchor have IoU grater than 0.7 with ground truth and negative anchor have IoU less than 0.3 with ground truth, other anchor are defined as neutral anchor. Note one ground truth can have IoU of 0.7 with multiple anchor.

RPN loss have two component, b-box regression loss to predict the location of object and b-box classification loss to predict positive object and negative object. Regression loss is only calculated for positive anchors.

Multi task loss of RPN

Here, i the index of an anchor in a mini-batch.

First part of above loss is classification loss where pi is the predicted probability of anchor i being an object and p*i is the ground truth of the anchor i, here p*i is 1 for positive anchor and 0 for negative anchor. This classification loss Lcls is log loss over two classes (object vs not object).

Second part of loss is regression loss which is multiplied by p*i so it's value for negative anchor is zero, here ti is a vector representing the 4 parameterized co-ordinates of the predicted b-box and t*i is that of the ground-truth box associated with a positive anchor.

For Regression loss, Lreg(ti, t*i), it uses smooth L1 loss. To read more about smooth L1 loss go through the loss section of previous blog.

Classification loss is normalized by mini batch size Ncls (256), regression loss is normalized by number of anchor location Nreg (~2400) and regression loss is weighted by lambda which is 10 thus giving approximately equal weighted to both loss.

The regression target t*i for a positive anchor is defined as -

Offset of ground truth box and anchor

Here x, y, w and h denotes the box's center coordinates and it's width and height.

Variable x, xa and x* are for predicted, anchor and ground truth respectively, likewise for y, w and h.

Predicted bounding box parameter (x, y, w, h) can be calculated by using predicted offset and corresponding anchor.

Offset of predicted box and anchor

RPN loss function force the RPN network to learn to predict the offset (tx, ty, tw and th) of bounding box wrt pre-define anchors for object. 

We can see this as a bounding box regression from an anchor box to a nearby ground-truth box.

Training RPN

RPN can be trained end-to-end using backpropogation and SGD. Weights of backbone network is initialized by pre-trained model for ImageNet classification. New layers weight are randomly initialized using zero mean Gaussian distribution with standard deviation of 0.01.

To compute the loss of RPN, random sample of 256 anchor are selected from an image with 1:1 ration of positive and negative anchor. In case of less (<128) positive anchor in an image mini batch is padded with negative anchor.

The Anchor boxes that cross image boundaries were removed from training which reduces the number of anchor from ~20000 (60x40x9) to 6000. Some proposals highly overlap with each other so to reduce redundancy NMS is used with 0.7 IoU threshold, which leaves around 2000 proposal regions per image. After NMS top-N proposal are used for detection.

Faster R-CNN : RPN + Fast R-CNN detector

Faster R-CNN network can be seen as Fast R-CNN network with RPN network for object proposal network instead of Selective Search Algorithm. 

Faster R-CNN Network

Backbone/Head network (VGG-16) feature map is shared between both RPN and detection branch.

Detection network project the object proposal from RPN network to backbone network feature map to get the features for object classification and object bounding box refinement but these feature need to be a fixed size because of fully connected layer in network.

So it uses RoI pooling layer to resize the feature map of each object to a fixed size (7x7x512).

RoI pooling works by dividing the H x W roi region (object proposal) into h x w grid of sub-window of approximate size H/h x W/w and then max-pooling the values in each sub-window into the corresponding output grid cell. Read more about RoI pooling layer in Fast R-CNN blog.

After RoI pooling layer, resized feature for each proposal is passed to fully connected layer and that feature is finally passed to classification layer and bounding box refinement layer.

Classification layer gives C probability values using Softmax function for each proposal where C is number of class including background. 

Predicted box of object proposal is further refined by bounding box regression layer of detection network. This layer gives bounding box offset wrt to each class. That means each class have their own regression with four parameter unlike bounding box regression of RPN.

It uses the same multi-task loss as in Fast R-CNN.

One of the important thing to note about this network is feature sharing between RPN and detection network.

Feature Sharing for RPN and Fast R-CNN

If both RPN and detector network trained independently, both will modify backbone convolutional layer weights in different ways. So author followed a 4-step training procedure to allow the networks to share the weights.

  1. First, RPN is trained independently as mentioned above. The network initialized with ImageNet pre-trained model and fine-tuned end-to-end for the region proposal task.
  2. In second step, separate detection network is trained by using the proposal generated by step-1 RPN. Again this network is also initialized with ImageNet pre-trained model. Till here networks are not sharing weights.
  3. In third step, RPN is again trained but this time network is initialized with the above step-2 detector weights and keeping the common convolutional layer weights between RPN and detector fixed, only layer unique to RPN are fine-tuned. Now here both network are sharing the same wights for backbone network.
  4. Finally, keeping the shared convolutional layers fixed of above network, layers unique to detection branch (Faster R-CNN) are fine-tuned. This gives the final Faster R-CNN network which share the convolutional weight and form a unified network.

Result

Faster R-CNN using VGG-16 as backbone network achieves state of the art object detection accuracy on PASCAL VOC 2007, 2012 and MS COCO dataset with only 300 proposal. It perform at 5 fps including all step on a GPU.

  • Faster R-CNN takes around 198ms for proposal and detection on one image compared to approx 1.8 sec for Fast R-CNN (avg 1.5 sec for proposal and 320ms for detection).
  • Faster R-CNN achieved 3.2% higher mAP compared to Fast R-CNN on the union set of PASCAL VOC 2007 trainval and 2012 trainval dataset.
  • Faster R-CNN achieved 2.8% higher mAP@0.5 IoU compared to Fast R-CNN on MS COCO dataset.

That's all for this post, hope you find this blog informative.
Thanks for reading !!

No comments:

Post a Comment