Pages

Saturday, June 13, 2020

R-CNN : Object Detection

In this blog I will discuss about one of the CNN based Object Detection algorithm, R-CNN (Region based Convolutional Neural Network). Based on R-CNN there are two more object detection algorithm Fast R-CNN and Faster R-CNN. Now day's we use only Faster R-CNN out of these three but it's important to understand the base network first.  

So let's start.

What is object detection and how it is different than classification ?
In classification, given any image we have to predict the class id of the object present in image and in classification problems generally object occupy more than 70-80% region in image and rest is background. See below an example image. 

Example image for classification and it's output should be dog
In object detection problem, an image can have one or multiple object in it. Object detection algorithm have to detect the location (x, y, width and height in pixels) of the objects in the image and also have to correctly classify all the object present.
Example - let's say we have a dog detection model and if we run this model on the above image it should detect the bounding box covering the dog and classify it as dog.
Output of object detection model
If we see pipeline of any object detection algorithm, then it can be divided in three part first one is find the region in image where object can be, second one is if there is an object then which object it is and third one is find the closest bounding box surrounding that object.

Previous Method of Object Detection

Previous to these CNN based object detection network we were using sliding window based algorithms for object detection. In which we slide a window of WxH dimension over the image from top to bottom and crop these window regions from image and classify them with machine learning based algorithm like SVM. 
To detect different shape and size of object, we change the dimension and aspect ration of sliding window and iterate it again over the image.

This method is quite fast but problem is with the accuracy, as it's very difficult to cover the whole object in one sliding window and also classical machine learning algorithm fails to classify correctly if image is exposed to different lighting conditions or if object size is very small or large in image.

Moving forward let's see what is R-CNN.

R-CNN stand for Regions with CNN features. This algorithm was introduced in Rich Feature Hierarchies For Accurate Object Detection and Semantic Segmentation paper. This achieve mean average precision (mAP) of 53.3% on VOC 2012 which is 30% more than the previous best algorithm.
1. Input Image 2. Extracts around 2000 region proposals 3. Computes features for each proposal using a large CNN 4. Classifies each region using class-specific linear SVM

R-CNN object detection model consists of three module.

1. Region Proposal

Conventional object detection algorithm uses Sliding Window approach to search for object in image at each location. This method is very slow if we use CNN for classification at each location and also not accurate for object with different aspect ratio and size.

R-CNN uses Selective Search Algorithm to  generate category-independent region proposals. These proposals defines the set of candidate detection available to the detector. At test time for any given input image it generates 2000 category-independent region proposals.

Region proposal happen only during testing time not training time because during training time we use labelled ground truth as detection candidates. 

10 random region proposal out of 2000
In above image we can see 10 random region proposal of different size and aspect ration out of 2000 region proposals.

2. Feature Extraction

Region proposal gives us detection candidates, after this second step of RCNN is fixed length feature extraction using CNN network for each detection candidates.

Each region proposals bounding box is first dilated (expanded) by p (16) pixels, to get some image context around the original box.  
 
For each region proposals it generates a fixed length (4096) feature vector using CNN. To calculate the features it uses a CNN (AlexNet network) that have five convolutional layers and two fully connected layers and this network takes mean- subtracted input image of size 227X227. In order to compute the features for region proposal, each region should be of size 227x227. So regardless the size and aspect ratio of the different region proposal, it warp all candidate regions in 227x227 region before passing through CNN.

1. Input image with ground truth 2. Example of object wrapping to 227x227 size 3. Feature extraction for each wrapped images using AlexNet network, feature size is 4096
To explain the procedure i have used ground truth labels, in the above image we can see image with ground truth bounding box, it crops the object from image and resize to 227x227, then pass it to AlexNet network which give a 4096 length feature for each cropped ground truth, same happens during testing time with region proposals.

3. Region Classification

The feature vector for each region generated by CNN is scored by SVM that is trained for each class. For an image the feature matrix is typically 2000×4096 (2000 region proposal and 4096 length feature for each proposal) and the SVM weight matrix is 4096×N, where N is the number of classes. SVM will give probability value for each region proposal and depending upon threshold value region proposal will be classify as one of the N class or background.

After region classification non-maximum suppression is applied (for each class independently) to remove overlapping object. 

Bounding-box Regression

To reduce localization error, RCNN fine-tune the bounding box prediction of detected object by class-specific bounding box regressor branch. Input to bounding box regression is set of N training pairs {(P, G)} where P = {Px, Py, Pw, Ph} specifies the pixel coordinates of the center of the proposals (b-box) along with their width and height in pixels. Each ground-truth box G is specified in same way: G = {Gx,Gy,Gw,Gh}. Bounding box regressor inceases mAP by 3-4 point. 

R-CNN with bounding box regressor branch

Conclusion

So let’s wrap it up, R-CNN object detection algorithm is all about generating 2000 object proposal from input image using selective search algorithm then wrapping the each proposal region to fix size i.e. 227x227 to calculate the fixed length (4096) feature vector using CNN (AlexNet) and classify these feature vectors using class specific SVM followed by non-max suppression for each class independently and finally bounding box regression to fine-tune bounding box of predicted objects.

Drawback of R-CNN 

1. Training is a multistage pipeline. R-CNN first fine-tunes a ConvNet (AlexNet) on object proposals using log loss. Then, it fits SVM to ConvNet features. These SVM act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressor are learned. 

2. Training is expensive in space and time. For SVM and bounding-box regressor training, features are extracted from each object proposal from each image and written to disk. These features require lots of storage unit and takes lots of time in training.  

3. Object Detection is slow. At test time, features are extracted from each object proposal in each test image. Detection with VGG16 as base network instead of Alexnet takes 47s per image on a GPU.

That's all for this blog, hope you find this blog informative.
Thanks for reading !!

No comments:

Post a Comment