Today, We are going through a very important application of Computer Vision which is Facial Recognition, a widely popular topic with a huge range of applications.
We are going to use OpenCV library to implement various examples of computer vision. OpenCV (Open Source Computer Vision Library)[1] is an open-source library that includes many computer vision algorithms.
You see Facial Recognition Technique everywhere, from security camera to your phone. But how exactly Facial Recognition works to classify Faces? Which Algorithms are best suitable for recognition? We are going through all these and see how it all works in the real world.
Haar feature-based cascade classifier is an effective object detection method proposed by Paul Viola and Michael Jones in their paper ”Rapid Object Detection using a Boosted Cascade of Simple Features[2]” in 2001.
So What is a Haar-Cascade Classifier? To understand this you have to first understand what a classifier is. A classifier is trained with a few hundred sample views of a particular object(i.e., a face or a car) called positive images, Example that are scaled to the same size(20 * 20) & negative images, Example is an arbitrary image of the same size.
Haar-Cascade Classifier is an Object Detection Algorithm used to identify faces in an image or a real time video. It is a machine learning based approach where a cascade function is trained from a lot of positive and negative images. It is then used to detect objects in other images.
How this works is they are huge .xml file with a lot of feature sets and each xml corresponds to a very specific type of use case.
The algorithm can be explained in four stages:
Haar Features
Integral Images
Adaboost
Cascading
Step 1: Haar Features
Now, the question arises how are we going to extract the feature? But before even getting into what features are used in viola jones algorithm, what features are extracted. Let me just give you a brief introduction to edge detection.
First understand what a convolution kernel is? It is a small matrix. It is used for blurring, sharpening, embossing, edge detection and many more. This is accomplished by doing a convolution between a kernel and an image.
In the above image, you can see how edge detection works. I have the above matrix or pattern. If you look at this pattern it has low values on top and bottom and higher values in the middle. So, it’s like a single line surrounded by a black region. I want to find out the horizontal line from the input image, so I created a kernel which is similar to work.
This process is called a convolution process in image processing. Convolution Process is a mathematical way of combining two signals to form a third signal. It is important because it relates three signals: the input signal, the output signal and the impulse response.
In the output, you can see we get high values only at the places where the pattern matches the image. Now, we can proceed to understand Haar Features.
You can see some basic Haar Features in the above image. When you look at the above Haar Features, what this signifies is black region is replaced by higher value or positive and white region is replaced by lower value or negative. By that I mean it is exactly like a convolution kernel. So, if you want to apply this feature on any image. Each feature is a single value obtained by subtracting the sum of pixels under white rectangle from sum of pixels under black rectangle.
We can apply some features on the above image. Now, all possible sizes and locations of each kernel are used to calculate plenty of features. Similarly, you have several features to apply on a single image. So, We can say that Haar Features represent some characteristics of the face.
But the problem is the computation time. Imagine how much computation it needs? If we consider all possible parameters of Haar features (position, size & type), we end up calculating over 160,000 features in a 24*24 window size which is the base window size to start evaluating these features in any given image. This is where Integral images come into play.
Step 2: Integral Images
As i told you, For each feature calculation, we need to find the sum of pixels under white and black regions. To solve this, they introduce the integral images.
It simplifies calculation of the sum of pixels.
In an Integral image, the value of each point is the sum of all pixels above and to the left, including the target pixel(shown below).
When we add pixels in the border box of the original image, we get 8 as sum all pixels and here, we had six elements involved in our calculation.
Now to calculate the sum of pixels using an integral image. You just need to find corners of the border box and then add the vertices of grey box and then subtract it to sum the vertices of yellow box. We get the same answer(21+1 - (11+3) = 8) and only four elements are involved in calculations. No matter how many pixels are in the rectangle box, we will just need to compute 4 vertices.
But among all these features we calculated, some of them are irrelevant. So, how do we determine the best features that represent an object from 160000+ features. This is where Adaboost comes into play.
Step 3: Adaboost
As i told previously, The no of features presented in 24*24 window is over 160000, but only a few of these are important to identify a face, So, we used the Adaboost algorithm to identify the Best Features.
Now, Let see how this process works:-
In the above diagram, we have positive images that are faces and negative are non-faces. Now, we have to identify faces from non-faces but we need to draw and find a single feature that identifies this thing which is not possible.
So, we combine several features with the weights and identify this faces from non-faces perfectly. Let's see how we are going to do it.
First, we assign equal weight to all faces and non-faces. Now, select the classifier with the lowest weighted error(i.e., a weak classifier).
By Weak Classifier, I mean the value of x < 0.25. If the value is less than 0.25, it’s not a face. If it’s higher than 0.25, it’s a face. Observing all this, we choose this level as perfect to identify one weak classifier.
But there are some non-faces in the face region. So, what we do is we increase the weights of these misclassified images and we decrease the weight of correctly classified(shown below).
Repeat the step and identify more weak classifiers.
These are non-faces but they are in the space of faces. So, Those we are observing before are misclassified(shown below).
Again we increase the weight of misclassifieds and decrease the weight of correctly classified.
So, Again we find a new feature and again we can find what is misclassified.
But some faces have fallen in the category of non-faces. So, this is misclassified(as shown above). So, again we increase the weight of this thing and again the same process is done.
So, At the end what we have is a linear combination of weak classifiers and form a final strong classifier.
We have a linear combination of several such classifiers and we are able to design a system that identifies all faces from non-faces. So, This is Training.
Let's say you get 2500 features after this step. Now, take an image. Take each 24*24 window. You need to evaluate 2500 features in every 24*24 window. But it is still a time-consuming process. This is where the last step Cascading comes into play.
Step 4: Cascading Classifiers
The job of the cascade is to quickly discard non-faces and avoid wasting time and computation and spend more time on probable face regions.
We set up a cascade system in which we divide the process of identifying a face into multiple stages.
In the first stage, the subregion passes through the best features. If that stage evaluates the subregion as positive means that it thinks it’s a face, output of the stage is “May Be”.
When a subregion gets a maybe, it is sent to the next stage of the cascade and the process continues as it is till we reach the last stage.
If all classifiers approve the image, it is finally classified as Human Face and is presented to the user as a detection.
How does it help us to increase our speed?
If the first stage gives a negative evaluation then the image is immediately discarded, we don’t consider remaining features on it. If it passes the first stage but fails at the second stage, it is discarded as well. Basically, images can get discarded at any stage of classifier.
Conclusion:
Haar Cascade is one of the most powerful face detection algorithms invented. Haar Features are not only used to detect faces but also for eyes, lips, car, animals, license number plates, etc. The models are stored on GitHub[5].
Check out the references for more details.
References:
Comments