Fast-RCNN object detection algorithm

Shashikant Reddy
4 min readJan 11, 2021

Successor of RCNN object detection algorithm.

I’m in truly love with Computer vison ..

I recommend you to read my prevoius post on RCNN before you get start reading this article.

Let’s get started with Fast Region-based Convolutional Network method (Fast R-CNN) for object detection

Fast R-CNN builds on previous work RCNN to efficiently classify object proposals using deep convolutional networks. Compared to RCNN, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9× faster than R-CNN, is 213× faster at test-time, and achieves a higher mAP on PASCAL VOC 2012 — Ross Girshick

Fast R-CNN architecture and training:

  1. First part of Fast-RCNN is convolutional neural network such as VGG16 .The Fast-RCNN network takes as input an entire image and network first processes the whole image with several convolutional and maxplooing layer to produce feature maps.
  2. Region proposal(Explained in RCNN post) are generated on each on these feature maps, region proposal are generated using hard coded logic called Selective search which uses local cues like texture, intensity, color and/or a measure of insideness etc to generate all the possible locations of the object.

Refer above two picture which will help you for better understanding

3) These generated region proposal from feature map using selective search are fed to Region of interest (ROI)pooling layer which produces a fixed-length feature vector from the feature map.

what is ROI?

Region of interest is network layer which has 3 different window of fixed size and number of bins .

The output feature map has 256 filters and is of arbitrary size(depends on input size).

  1. In the first pooling layer(Gray one in the figure), the output has a single bin and covers a complete image. This is similar to the global pooling operation. The output of this pooling is 256-d.
  2. In the second pooling, the feature map is pooled to have 4 bins thus giving an output of size 4*256.
  3. In third pooling, the feature map is pooled to have 16 bins thus giving an output of size 16*256.

The output of all the pooling layers is flattened and concatenated to give an output of a fixed dimension irrespective of input size and this to be fed to Fully Connected layer.

4)Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer bounding box regressor that outputs four real-valued numbers for each of the K object classes(Bounding box)

Each set of 4 values encodes refined bounding-box positions for one of the K classes.

Initializing from pre-trained networks:

They have experiment with three pre-trained ImageNet CaffeNet ,VGG CNN M 1024 and VGG16 networks.

When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations.

First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H = W = 7 for VGG16).

Second, the network’s last fully connected layer and softmax (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K + 1 categories and category-specific bounding-box regressors).

Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

Contributions:

They proposed a new training algorithm that fixes the disadvantages of R-CNN and SPPnet, while improving on their speed and accuracy. We call this method Fast R-CNN because it’s comparatively fast to train and test.

The Fast RCNN method has several advantages: :

  1. Higher detection quality (mAP) than R-CNN, SPPnet.
  2. Training is single-stage, using a multi-task loss.
  3. Training can update all network layers.
  4. No disk storage is required for feature caching.
  5. Contributions We propose a new training algorithm that fixes the .

Result:

Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012.

This concludes the technical summary of the Fast R-CNN paper. Hope you enjoyed (understood)! Open to discussions or corrections in the comments below.

References:

https://arxiv.org/pdf/1504.08083.pdf

https://medium.com/analytics-vidhya/review-spatial-pyramid-pooling-1406-4729-bfc142988dd2

--

--