Object detection using RCNN algorithm

Shashikant Reddy
7 min readJan 9, 2021

Deep dive into RCNN object detection algorithm…

Object detection sample

One of the most powerful and compelling types of AI is computer vision(I’m in love with this) which you’ve almost surely experienced in any number of ways without even knowing. Here’s a look at what it is, how it works, and why it’s so awesome (and is only going to get better).

Computer vision is the field of computer science that focuses on replicating parts of the complexity of the human vision system and enabling computers to identify and process objects in images and videos in the same way that humans do.

Thanks to advances in artificial intelligence and innovations in deep learning and neural networks, the field has been able to take great leaps in recent years and has been able to surpass humans in some tasks related to detecting and labeling objects.

What is Image Classification?

Image classification takes an image and predicts the object in an image. For example, when we built a cat-dog classifier, we took images of cat or dog and predicted their class.

What do you do if both cat and dog are present in the same image?

This is were object detection comes into game .Object detection will classify the object (Here Dog or Cat)as well as localize the object position (co-ordinates of object)in the image .

Bounding Box: In object detection, we usually use a bounding box to describe the target location

The bounding box is a rectangular box that can be determined by the x and y axis coordinates in the upper-left corner and the x and y axis coordinates in the lower-right corner of the rectangle.

Here you can see x,y co-ordinates on upper left and lower right of oject in the image

What is object detection?

Predicting the location of the object along with the class is called object Detection.

In place of predicting the class of object from an image, we now have to predict the class as well as a rectangle(called bounding box) containing that object. It takes 4 variables to uniquely identify a rectangle. So, for each instance of the object in the image, we shall predict following variables:

class_name,

bounding_box_top_left_x_coordinate,

bounding_box_top_left_y_coordinate,

bounding_box_width,

bounding_box_height

Just like multi-label image classification problems, we can have multi-class object detection problem where we detect multiple kinds of objects in a single image:

Object Detection is modeled as a classification problem where we take windows of fixed sizes from input image at all the possible locations feed these patches to an image classifier.

Whats is sliding window?

While classifying it detect in part of the picture is having the object/region, this is how it detects the location of object in the given picture.

As you can see that the object can be of varying sizes. To solve this problem an image pyramid is created by scaling the image.Idea is that we resize the image at multiple scales and we count on the fact that our chosen window size will completely contain the object in one of these resized images.

Most commonly, the image is downsampled(size is reduced) until certain condition typically a minimum size is reached. On each of these images, a fixed size window detector is run. It’s common to have as many as 64 levels on such pyramids. Now, all these windows are fed to a classifier to detect the object of interest. This will help us solve the problem of size and location

While running the window detector on the image pyramid patchs it classifys image and also detect the location of image (bounding box)

There is one more problem, aspect ratio (the ratio of the width to the height of an image). A lot of objects can be present in various shapes like a sitting person will have a different aspect ratio than standing person or sleeping person. We shall cover this a little later in this post. There are various methods for object detection like RCNN, Faster-RCNN, SSD etc. Why do we have so many methods and what are the salient features of each of these? Let’s have a look:

Above method is good if we are performing object detection using HOG,but it will be very slow and computation expensive if we perform in deeplearning CNN on all the patches created by sliding window.

Region-based Convolutional Neural Networks(R-CNN):

Object detection is modeled as image classifcation and regression problem, success depends on the accuracy of classification. After the rise of deep learning, the obvious idea was to replace HOG(Histogram of Oriented Gradients(HOG) features in 2005) based classifiers with a more accurate convolutional neural network-based classifier. However, there was one problem.

CNNs were too slow and computationally very expensive.

It was impossible to run CNNs on so many patches generated by sliding window detector.

R-CNN solves this problem by using an object proposal algorithm called Selective Search which reduces the number of bounding boxes that are fed to the classifier to close to 2000 region proposals.

Selective search : Selective search is hard coded algorithm which uses local cues like texture, intensity, color and/or a measure of insideness etc to generate all the possible locations of the object. It will not learn anything and give aproximately 2000 proposals.

Instead of thousands and millions of proposals generated by sliding window selective search will generate nearly 2000 proposal which is very small number comparatively , which makes our RCNN model to process fast.

Now, we can feed these boxes(proposals) to our CNN based classifier. Remember, fully connected part of CNN takes a fixed sized input so, we resize (without preserving aspect ratio) all the generated boxes to a fixed size (224×224 for VGG) and feed to the CNN part.

We all know what happens inside CCN model, we can use any of the CNN model such VGG, Inception, ResNet etc

Here they used only CNN model convolutional layers to just extract features and not forward connect layer.

For particular proposed region CNN will extract features, SVM(Support vector machine Machine learning algorithm) will classify and bounding box regressor will fine tuning region (try to find new co-ordinates) co-ordinates. This way we are able to localize and classify object at a same time.

Bounding-box regression : Is a popular technique to refine or predict localization boxes in recent object detection approaches. Typically, bounding-box regressors are trained to regress from either proposals or fixed anchor boxes to nearby bounding boxes of a pre-defined target object classes

Hence, there are 3 important parts of R-CNN:

Run Selective Search to generate probable objects.

Feed these patches to CNN, followed by SVM to predict the class of each patch.

Optimize patches by training bounding box regression separately.

Problems with R-CNN

  • It still takes a huge amount of time to train the network as you would have to classify 2000 region proposals per image.
  • It cannot be implemented real time as it takes around 47 seconds for each test image.
  • The selective search algorithm is a fixed algorithm. Therefore, no learning is happening at that stage. This could lead to the generation of bad candidate region proposals.

Thanks for reading this article .

Hit clap if you found this article useful and aslo commet below for any questions and suggestions .

Happy learning !

References:

https://arxiv.org/pdf/1311.2524.pdf

Thanks to advances in artificial intelligence and innovations in deep learning and neural networks, the field has been able to take great leaps in recent years and has been able to surpass humans in some tasks related to detecting and labeling objects.

One of the driving factors behind the growth of computer vision is the amount of data we generate today that is then used to train and make computer vision better.

--

--