Face recognition is normally used in automated surveillance, individual identification, and database searches for specific faces. Face detection, representation, and matching are the different stages of the face recognition process. The face detection starts from the query image, and then features are retrieved using a face recognition algorithm in the next stage. Matching the query face with the database is the final stage. However, face recognition algorithms perform low in unrestricted environments such as those with variations in an individual's lighting, posture, and facial expressions. This paper proposes a face recognition system designed to address these challenges using Convolutional Neural Networks (CNNs), Local Binary Pattern (LBP) histograms, and Histogram of Oriented Gradients (HOG). Initially, face detection from the input image is accomplished using the Viola-Jones technique. The feature space is created through fusing the features that were extracted using CNN, HOG, and the LBP histogram. SVM and KNN classifiers are used to assess the classification ability for various HOG cell sizes
The hardware and software technologies have advanced significantly in the past two decades. Due to this advancement, information technologies like Artificial Intelligence and Machine Learning have grown rapidly. These technologies employ current gadgets to build more effective and comfortable methods of human-computer interaction. Computer vision technology focuses on replicating or simulating visual perception, which is one area of machine intelligence. Applications for computer vision systems have included automated industrial quality control and assembly line inspections. As the cost of computer systems and video image-collecting technology has dropped, computer vision technologies have grown to more sophisticated vision applications including facial recognition and facial tracking techniques. Due to scaling and illumination issues, face recognition in the computer vision domain has remained a challenge to date. The other challenges with face detection are facial expression recognition and face authentication. Problems have been solved traditionally using the segmentation approach, facial feature detection, and face verification in complex contexts.
The difficulties of face detection are exacerbated by variations in size, location, orientation, posture, facial expression, occlusion, and illumination. A sub problem of the broader field of visual object tracking under computer vision studies is face tracking. A good number of research has been noticed on object tracking in the context of computer vision including autonomous robots based on this applications.
A real-time captured image sequence generally exhibits minimal variation from one frame to the next. Consequently, the object information present across frames within a defined time interval tends to be significantly redundant. This redundancy can be leveraged to monitor particular objects and differentiate between various visual elements. Since the human visual system struggles to differentiate between a face and a complex background, identifying redundancy in a series of images continues to pose one of the most significant challenges in the field of recognition.
Many references address works on face tracking through face recognition, we first discuss the papers on face detection and later continue with face tracking methods. Chi et al.
[2] studied the current schemes in the field of visual surveillance. The recent development in computer vision-based applications has sparked much interest in Face detection and Recognition. Hao et al. [3] explained that face detectors on Convolutional Neural Networks (CNNs) are ineffective when dealing with faces of various sizes. It relies on multi-scale testing or applying a single sizeable model that represents faces on a large-scale spectrum. Further, Zhang et al. [4] discussed that face detection with high performance remains a complex problem, mainly when there are several small faces. They presented Refine Face, a single-shot specialization face detector with high efficiency. Lenz et al. [5] presented the first purely event- based approach for face detection which uses an event-based camera's high temporal resolution properties to track the movement of an object in a shot. Guo et al. [6] stated that convolutional neural network-based face and object recognition methods (such as OverFeat, R-CNN and DenseNet) precisely extract multi-scale features based on an image. Zhang et al. [7] concluded that anchor-based deep face recognition techniques had shown promising outcomes but deep learning-based methods have difficulty in identifying stern faces that are small, fuzzy, or partially obscured. Further, Liang et al. [8] explained that face recognition from low-light exposures is complex due to the small number of photos available and the unavoidable noise, which is also spatially unevenly distributed, making the task even more difficult. According to Zhou et al. [9] in the field of security, the image taken by an outside surveillance camera, normally has distorted faces occluded in a variety of poses and tiny which is influenced by external factors such as camera pose and distance as well as weather conditions. Chen et al. [10] also explained that face spoofing puts the security of face recognition systems in jeopardy. Previous anti-spoofing research has focused on supervised methods with binary or auxiliary supervision being the most common. Xu et al. [11] introduced interface, a one-stage approach that predicts the location of the facial box and landmark in real time with more accuracy. Mahmoud et al. [12] suggested a robust method for detecting hidden faces in various camera angles and lighting conditions. A hybrid non-linear transform model that blends the RGB color space model and the YCbCr color model identifies human skin patches. Li et al. [13] suggested a face identification technique in the wild that uses a multi-task discriminative learning framework to integrate a ConvNet with a 3D mean face model. Zhang et al. [14] presented a unique cascaded Convolutional Neural Network called the Supervised Transformer Network to predict face regions and related facial landmarks. Tao et al. [15] used the kernel combination (LS- KC-SVM) approach to construct a locality-sensitive support vector machine to solve the problems. Lian et al. [16] presented using multiple objects tracking algorithms to create a real- time face tracking device. Ren et al [17], presented a tool for detecting and monitoring the human face in real-time using Convolution Neural Networks and Kalman Filters.
Further, the review on Face-tracking algorithms proposed by various authors are discussed subsequently. Lin et al. [18] added face tracking with region-based CNN, or FT- RCNN, is an effective face tracker based on the Faster-RCNN platform. In addition, Zheng et al. [19] suggested a deep learning-based face detection and tracking system that includes a Regression Network-based Face Tracking (RNFT) model to precisely monitor human faces in video sequences. The Squeeze and Excitation Network (SEN) and the Residual Neural Network (RNN) are combined in the SENResNet model (ResNet). Li et al. [20] demonstrated a multi- target face real-time detection, monitoring and recognition algorithm including three stages fast- tracking, detection and rapid recognition methods. A new GOTURN-based network is used in this work for quick face tracking. Chakravorty et al. [21] addressed visual face monitoring in real- world situations, covering various challenges in face matching. The authors introduced the FaceTrack method, which uses multiple appearance models as well as long- and short-term memory to provide effective face monitoring. Su et al. [22], proposed a quick Face Tracking-by-Detection (FFTD) that works independently for tasks like tracking, facial detection and discrimination. Li et al. [23] explained face tracking could be used to monitor faces reliably in various situations, including variation in lighting, background clutter, rapid
motion and partial occlusion. Soldie et al. [24] presented a powerful real-time face tracking device with numerous novel capabilities. Short and Long-Term memories (STM and LTM) are built into the framework and are used to monitor re-initialization throughout the online learning process. Pham et al. [25] presented a comprehensive hybrid 3D face tracking framework based on RGBD (Red Green Blue-Depth) video streams that tracks head pose and facial gestures without the need for re-calibration or user involvement. Ranganatha et al. [26] proposed an innovative face tracking method that combines the corner measured algorithm and the KLT (Kanade-Lucas-Tomasi) tracker. In the first frame of the video sequence, the Viola-Jones approaches first and detects the face and then extracts the detected portion of the face and applies to the Harris corner measured algorithm. Male et al. [27] developed a new reference architecture based on four paradigms. The suggested framework that allows deep learning ideas, a traditional approach to addressing the domain problem, cognitive agents with social concerns, and nature-inspired computing concepts to be integrated. According to Yuan et al. [28], traditional face tracking algorithms have obtained good results in some confined contexts. On the other hand, these methods necessitate the creation of manual facial features based on the researcher’s experience. Wu et al. [29] developed a unique framework for maintaining identification that simultaneously clusters and connects the faces of different persons in extended video sequences. Congcong et al. [30] proposed Dual-Cycle Deep Reinforcement Learning (DCDRL) to learn a robust face-tracking policy using just weakly-labeled annotations sparsely acquired from raw video data.
The approaches proposed in literature were unable to reach human-level performance in identifying faces. Further, accuracy and datasets look poor in proposed methods of face detection and tracking. Although progress in facial recognition was encouraging, the task has also turned out to be a difficult endeavor. To achieve better results, the work propose a face recognition system using CNN, LBP histogram, and HOG to perform face identification under difficult conditions. Initially, face detection from the input image was accomplished using the Viola-Jones technique [19]. The feature space was created by fusing later features that were extracted using CNN, HOG, and the LBP histogram. SVM and KNN classifiers were used to assess the proposed method's classification ability for various HOG cell sizes.
Initially, the exact face area from the input image is extracted using the Viola-Jones algorithm, and then the retrieved face region is resized to 64 × 64 dimensions for ensuring
accurate recognition and computational efficiency in processing of the images. Subsequently, by combining HOG, LBP, and CNN features, a comprehensive feature vector is created that leverages the strengths of each method, i.e., robust features are obtained with the assistance of CNN, while HOG acquires the local shape information from the input face image and LBP extracts texture features. The CNN, LBP histogram, and HOG features are combined to form a feature vector, which is then categorized using SVM. The entire process is depicted in Fig.1.
Fig. 1 The process of the proposed face recognition method
3.1 Histogram of Oriented Gradients (HOG)
The HOG is a feature-based descriptor used in image processing and computer vision for detecting the faces. The HOG feature extraction preserves the edges and also the directionality of the edge information. In this process, the entire image is divided in to cells. Each cell has a matrix of pixels. Each pixel casts a weight vote for an oriented based histogram channel. Histogram channels are evenly spread over 0 to 360 degrees. The HOG shape descriptor is used to find the shape of the local objects in computer vision (Dalal and Triggs [31]). HOG splits the image into tiny connected blocks, which are further segmented as cells. The HOG directions of each pixel in the cell is determined. Let P (.) is an intensity function denoting the grayscale values of the image. Each pixel’s gradient in horizontal and vertical directions are determined as
The weights of Gradient magnitudes are combined to create
a histogram vector for each cell. To enhance the robustness against edge intensity, shadows, and illumination, these histogram vectors are normalized. The final HOG representation consists of vectors from all normalized cells within each block. With a cell size of 4 × 4 for a 64 × 64 dimension image, this results in a feature vector of 1 × 8100. The input image and the respective HOG descriptor image is shown in Fig. 2.
Fig. 2 (a) Face detected image (b) HOG Descriptor
Local Binary Patterns
Ojala et al. [32] presented LBP as a local texture descriptor. The grayscale value of the eight adjacent pixels in the 3×3 neighborhood is compared with the center pixel in the window. If the value is more than the central pixel then it is replaced with one, otherwise, zero is placed at that particular pixel location as given in Eqn. (5).
From Eqn. (6), eight bits are generated and then sum them up with a weight of 2n to obtain the value of LBP
where gp (p = 0, 1, 2, ... ..., 7) represents eight pixels around the center pixel and gc
denotes the center pixel grayscale value, (xc, yc) is the location of the central pixel. (P, R) represents the P neighboring points with a radius of R. The way of generating LBP is given in Fig. 3. For the resized face obtained the histogram of the LBP feature vector of 1 × 59 dimension. The resized face image, LBP image, and corresponding histogram are shown in Fig 4.
Fig. 3 The LBP with R=1 and P=8.
Fig. 4: Sample of (a) Resized image (b) LBP image (c) Histogram of (b)
The framework of the CNN is depicted in Fig. 5. It contains three convolutional layers with 8, 16, and 32 filters. In each convolutional layer, the ReLU is utilized as an activation function. The input to the first convolutional layer is an image with a 64x64x1 dimension. The first convolutional layer comprises 3x3 kernels with eight filters and the stride is set to one. Thus, the output of the Convolution 1 is eight feature maps with a 62x62 dimension. In the proposed CNN, every convolutional layer is succeeded by a max-pooling layer with a kernel size of 2 × 2 and stride two. Maxpooling1 produces an output with a dimension of 31 × 31 ×
× 12 × 32 respectively. Maxpooling2 and Maxpooling3 generate an output with sizes 14×14×
16 and 6 × 6 × 32 respectively. The size of the fully connected layers is 250 and 120 that follow the Maxpooling3 layer. The number of learnable parameters for the proposed CNN are tabulated in Table 1. While training the data to the proposed CNN, Stochastic Gradient Descent is employed with a batch size of four. In every class of the face database, 70% of images were utilized for training and 30% for testing.
Fig. 5 The Architecture of the proposed CNN
Table 1 Number of learnable parameters of the proposed CNN
Layer |
Activation shape |
Number of learnable parameters |
Conv1 |
(62, 62, 8) |
80 |
Conv2 |
(29, 29, 16) |
1168 |
Conv3 |
(12, 12, 32) |
4640 |
FC1 |
(250, 1) |
288250 |
FC2 |
(200, 1) |
30120 |
Total number of learnable parameters |
324258 |
Experiments and Results
We conducted experimentation on the ORL (Jin et al. [33]), Extended Yale B (Georghi- ades et al. [34], and CMU-PIE (Gross et al. [35]) face datasets. The ORL includes 400 images of 40 subjects with 10 different images for each person. Each subject contains images with various lighting, poses, illuminations, and facial details. The Extended Yale B comprises 16,128 face images of 28 persons with nine different poses and 64 lighting environments. The CMU-PIE includes 41,368 images of 68 classes. The images were captured from all subjects under 13 distinct poses, 43 dissimilar lighting environments, and four distinct variations. The few images of the aforesaid datasets are given in Fig. 6. The recognition rate for HOG, histogram of LBP, and CNN features is shown in Table 2 individually and the combination of these three methods on chosen face databases. For comparison purposes, the recognition rate with the KNN classifier is also given. From the values of Table 2, it is
observed that, compared to HOG and histogram of LBP, CNN has given a good recognition rate across all the chosen databases. Among all the combinations, the proposed method (HOG + histogram of LBP + CNN) has given a good recognition rate. Table 3 consists of the recognition rate values for HOG with a cell size of 8x8 on different databases. For 4x4 cell size, the recognition rate for the suggested approach on ORL, Extended Yale B, and CMU-PIE is 98.48%, 97.33%, and 97.28% respectively, whereas for the 8x8 cell size the recognition rate is 98.12%, 96.95%, and 96.74% respectively. From Tables 2 and 3, it is noticed that the HOG with a cell size of 4x4 produced good results compared to the HOG with an 8x8 cell size for the suggested method. To estimate the capability of the proposed methodology, the following performance metrics were utilized: precision, recall, specificity, and F1-score. The performance metrics on the aforementioned databases are specified in Tables 4, 5, and 6.
Fig. 6: Database images of (a)ORL (b)Extended YALE B (c)CMUPIE databases
Table 2 Recognition rate (%) using KNN and SVM classifier with HOG Cell size =4x4
Method |
ORL |
Extended Yale B |
CMU-PIE |
|||
KNN |
SVM |
KNN |
SVM |
KNN |
SVM |
|
LBP |
95.27 |
97.34 |
93.25 |
94.43 |
94.35 |
95.52 |
HOG |
95.69 |
97.73 |
93.47 |
94.57 |
94.74 |
95.84 |
CNN |
96.82 |
97.91 |
94.84 |
95.93 |
95.46 |
96.73 |
LBP+HOG |
96.61 |
97.83 |
94.21 |
95.42 |
95.31 |
95.86 |
LBP+ CNN |
97.21 |
98.17 |
95.46 |
96.61 |
95.82 |
96.88 |
HOG+CNN |
97.46 |
98.23 |
95.83 |
96.94 |
96.66 |
97.11 |
LBP+HOG+CNN |
97.83 |
98.48 |
96.57 |
97.33 |
96.83 |
97.28 |
Table 3 Recognition rate (%) using KNN and SVM classifier with HOG Cell size =8x8
Method |
ORL |
Extended Yale B |
CMU-PIE |
|||
KNN |
SVM |
KNN |
SVM |
KNN |
SVM |
|
LBP |
93.37 |
94.76 |
93.74 |
94.24 |
93.43 |
95.58 |
HOG |
93.64 |
94.83 |
93.48 |
94.63 |
93.68 |
95.42 |
CNN |
94.68 |
95.46 |
94.82 |
95.47 |
94.94 |
95.86 |
LBP+HOG |
94.42 |
95.84 |
94.62 |
94.95 |
94.25 |
95.67 |
LBP+ CNN |
95.62 |
96.23 |
94.64 |
95.24 |
95.24 |
96.32 |
HOG+CNN |
95.86 |
96.78 |
95.22 |
96.37 |
95.83 |
96.68 |
LBP+HOG+CNN |
96.15 |
98.12 |
95.64 |
96.95 |
95.96 |
96.74 |
Table 4 Performance Metrics on the ORL database
Classifier |
Precision |
Recall |
Specificity |
F1-Score |
KNN |
0.9925 |
0.9846 |
0.9763 |
0.9785 |
SVM |
0.9887 |
0.9972 |
0.9742 |
0.9836 |
Table 5 Performance Metrics on the Extended Yale B database
Classifier |
Precision |
Recall |
Specificity |
F1-Score |
KNN |
0.9763 |
0.9749 |
0.8868 |
0.9741 |
SVM |
0.9936 |
0.9884 |
0.9364 |
0.9779 |
Table 6 Performance Metrics on the CMU-PIE database
Classifier |
Precision |
Recall |
Specificity |
F1-Score |
KNN |
0.9742 |
0.9723 |
0.9682 |
0.9768 |
SVM |
0.9983 |
0.9767 |
0.9863 |
0.9756 |
Table 7 Comparison of the suggested method with other techniques on the ORL.
Method |
Recognition accuracy (%) |
PCA (Cavalcanti et al. (2013)) |
95.85 |
LDA (Lu et al. (2012)) |
91.45 |
DLPV (Wen, Zhang, von Deneen and He (2016)) |
96.65 |
LBP (Ojala et al. (2002)) |
95.35 |
LOOP (Chakraborti et al. (2018)) |
97.31 |
GA-CNN (Rikhtegar et al. (2016)) |
94.61 |
SIAMESE (Wang, Yang, Xiao, Li and Zhou (2014)) |
92.10 |
DCT+LBP (Khan et al. (2015)) |
95.10 |
SSV (Zaaraoui et al. (2021)) |
96.75 |
Proposed method |
98.48 |
Table 8 Comparison of the proposed method with other methods on the Extended Yale B.
Method |
Recognition accuracy (%) |
PCA (Cavalcanti et al. (2013)) |
83.47 |
LDA (Lu et al. (2012)) |
85.41 |
DLPV (Wen, Zhang, von Deneen and He (2016)) |
89.90 |
LBP (Ojala et al. (2002)) |
89.32 |
LOOP (Chakraborti et al. (2018)) |
95.36 |
GA-CNN (Rikhtegar et al. (2016)) |
93.84 |
SIAMESE (Wang, Yang, Xiao, Li and Zhou (2014)) |
92.52 |
DCT+LBP (Khan et al. (2015)) |
94.36 |
SSV (Zaaraoui et al. (2021)) |
94.42 |
Proposed method |
97.33 |
Comparison of the proposed method with other techniques
To demonstrate the efficiency, the suggested approach is compared with the existing approaches. The holistic feature extraction methods like PCA (Cavalcanti et al. [36], LDA (Lu et al. [37]), Discriminative Locality Preserving Vectors (DLPV) (Wen, Zhang, von Deneen and He [38]), 2 Dimensional Random projection (2DRP) (Leng et al. [39]), and local feature descriptors namely LBP (Ojala et al. [40], Full Ranking (FR) (Chan et al. [41]), Local Optimal Oriented Pattern (LOOP) (Chakraborti et al. [42]), Local Quadruple Pattern (LQP) (Chakraborty et al. [43]), and Strings of Successive Values (SSV) (Zaaraoui et al. [44]) were used for comparison. Moreover, the deep learning techniques like the Genetic Algorithm optimized structure of CNN (GA-CNN) (Rikhtegar et al. [45]), and SIAMESE network (Wang, Yang, Xiao, Li and Zhou [46]), additionally the approaches depending on a fusion technique like DCT+LBP (Khan et al. [47]) are also used. The comparison of the recognition accuracy for the suggested approach with other methods on chosen databases is given in Fig. 7.
Fig. 7: Comparison of the proposed method with other techniques on the CMU-PIE
In this work, a convolutional neural network-based novel face recognition method is proposed. Initially, the Viola-Jones algorithm was used for face detection from the input image. Later features were extracted using HOG, histogram of LBP, and proposed CNN and are fused to create the feature space. The classification capacity of the suggested approach was tested with SVM and KNN classifiers for different cell sizes of HOG. Among these two classifiers, SVM has given a good recognition rate. The ORL, Extended Yale B, and CMU- PIE databases are used for experimental work and attained a recognition rate of 98.48%, 97.33%, and 97.28% respectively. Our experimental work reveals that the proposed approach remarkably improved the face recognition rate compared to some of the existing techniques. In future, we extend the proposed system to track the faces in videos of real surveillance system.