Article Information
Corresponding Author: Jin Wang* (jinwang@csust.edu.cn)
Jianming Zhang*, Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation and School of Computer and CommunicationEngineering, Changsha University of Science and Technology, Changsha, China, jmzhang@csust.edu.cn
Xiaokang Jin*, Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation and School of Computer and CommunicationEngineering, Changsha University of Science and Technology, Changsha, China, jxk726@163.com
Yukai Liu*, Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation and School of Computer and CommunicationEngineering, Changsha University of Science and Technology, Changsha, China, lyk0311@163.com
Arun Kumar Sangaiah**, School of Computer Science and Engineering, Vellore Institute of Technology (VIT), Vellore, India, arunkumarsangaiah@gmail.com
Jin Wang*, Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation and School of Computer and CommunicationEngineering, Changsha University of Science and Technology, Changsha, China, jinwang@csust.edu.cn
Received: August 31 2018
Revision received: October 2 2018
Accepted: October 22 2018
Published (Print): December 31 2018
Published (Electronic): December 31 2018
1. Introduction
Face recognition, as a classical and important task in computer vision, is commonly used in video retrieval and pedestrian tracking, as well as distributed diagnosis and home healthcare nowadays. Stephen et al. [1] have constructed a computer model based on the cognitive learning of facial images. This computer model can both make accurate physical health judgments and predict body mass index (BMI) and blood pressure. It also facilitates the diagnosis of doctors to achieve the early identification and treatment of diseases. Meanwhile, face recognition has been extensively applied to Internet of things (IoT) such as intelligent attendance system through face identification, intelligent video surveillance system in public and secure payment system. Domestic and foreign scholars are attracted to work on face recognition because of its extensive applications [2,3]. Under the interference of factors such as angle variation, illumination variation, expression and posture change, noise, low resolution, object occlusion, small number of single-class samples with numerous categories (Million), face recognition is still a challenge despite its progress. Human can recognize one person at the first sight while it is a great challenge for computer. The algorithms cope with the classifier training with small samples. The representative face recognition algorithms [4-6] indicate that the performance of these algorithms decrease dramatically with the drop of number of training samples, including the convolutional neural network (CNN) which shows excellent performance in object detection and classification [7]. Therefore, the face recognition with small sample is an extremely challenging topic.
In recent years, benefiting from the massive training data and the improvement of the hardware computing capabilities, deep learning has made great progress in the fields of image [8,9], voice [10] and text [11]. In public datasets, upon the number of category increases, the number of samples needs to be enriched to facilitate the training of network models and improve the efficiency of classification. However, there are some situations with small samples and large number of categories in real face recognition. It results in the insufficiency of samples in each category, which greatly limits the performance of face recognition.
In this work, with the help of Siamese network [12], we utilize pairs of face images as inputs to expand the number of samples for a single category, further we propose the small sample face recognition algorithm based on self-constructed Siamese network without a large amount of training samples. It presents a map by using contrastive loss functions [12] and training CNN, and maps the input image pairs to target space so that the L2 norm distance of target space can represent the semantic distance of source space. In the training process, the network parameters learning process aims at minimizing the loss function to diminish the distance of the face image pairs from the same person and increase the distance of the face image pairs from different persons. The experiments between several loss functions are implemented and the results show that the proposed network model combined with a method to generate training data can effectively improve the face recognition accuracy, and it achieves better recognition rate on AR datasets and labeled faces in the wild (LFW) datasets.
2. Related Work
The traditional face recognition algorithms have made many achievements through years of development. The work in [13] proposes the sparse representation based classification (SRC), which uses a linear combination of all the training samples from the same person to represent one face image. The SRC compared with other ordinary methods is more effective when there are a few training samples of each category. Gabor wavelet can capture local structure information corresponding to spatial frequency, spatial position and direction. The Gabor feature is applied to SRC in [14], in which the SRC recognition rate is improved significantly. Although the SRC improves face recognition rate effectively, it causes high computational cost as well. Zhang et al. [15] propose the collaborative representation based classification (CRC), which points out that the SRC uses the regularization of the vector L1 norm has a huge computation, while the L2 regularization constraint can achieve similar recognition results and improves the computational efficiency as well. Nevertheless, the performances of both SRC and CRC would be greatly influenced when the number of training samples is insufficient. A new representative method called hierarchical CRC (HCRC) is proposed in [16]. Compared with some traditional collaborative representation method, HCRC introduces the Euclidean distance from projective vectors to training vectors, which improves the recognition precision effectively even if the training sample is not enough.
In recent years, the algorithms based on CNN make great achievements in face verification and recognition [17-19]. Compared with face recognition methods based on handcraft features [20,21], CNN-based method achieves higher accuracy. A new deep learning model is proposed in [22]. It can restore the front facial features, reduce the difference between the single individual faces greatly and improve the performance of the face recognition algorithm. DeepFace [23] uses complex 3D face-lignment and four million facial images to derive a face representation from a 9-layer deep neural network. DeepID1 [24] crops the facial images, whose features are extracted from image patches and integrated by Joint Bayesian. These facial features contain rich category information. DeepID2 [25] exploits contrastive loss and softmax loss to achieve network feedback regulation. A great number of positive and negative samples are used as training data. Positive samples are used to reduce the distance of a single category. Negative samples are used to increase the distance between categories. However, the samples are generated randomly, which results in the instability of network model. A new network model called HaarNet [26] is designed. Its backbone network extracts the global image information, and its three branches use Haar-like to extract features in region of interest (ROI), which significantly improves the accuracy of the face recognition. A face recognition algorithm called FaceNet [27] maps the face images into Euclidean space, in which the distance represents the similarity of the face images. It also uses the triplet loss function in the training process, which achieves high performance in pose-variant face recognition. The number of training images is up to 200 million.
3. Face Recognition Algorithm Based on Siamese Network
According to traditional feature extraction algorithms, the feature operators are determined by hand-craft features. It is a man-made choice to extract features of a certain kind, which causes the poor robustness and expansibility of the algorithms. The advantage of CNN over traditional methods is that the parameters of the entire model are obtained by autonomous learning. It performs superiorly from following two aspects. Firstly, autonomous learning features are more robust and have stronger expressive ability. Secondly, it greatly reduces the labor and avoids that the designed parameters are not inappropriate for the model in the artificial process because of insufficient experience. CNN shows great performance in many areas of image processing and exceeds the traditional image processing methods and the human ability in some respects.
CNN can achieve great performance mainly due to the autonomous learning ability for its network model and numerous training data. CNN obtains the suitable model parameters by learning the features extracted from the training data. In conclusion, data plays a critical role in training an excellent network model. The performance of the network model would not be satisfied when we train it with a small number of data. Up to now, the recognition performance of CNN is seriously affected by insufficient face datasets of each category. In this paper, we propose the face recognition algorithm based on Siamese network. The proposed algorithm designs and implements two different network models. We can still achieve high recognition accuracy when the number of single-class training samples is small.
3.1 Siamese Network
Siamese network [12], which is divided into two parts from input to output, is one of the CNNs. Two parts of Siamese network share the same weight. Siamese network is special for that its training samples use image pairs as input, extract features by its two parts respectively, and finally obtain the eigenvector pairs of the samples. Fig. 1 shows the architecture of the Siamese network.
Here, [TeX:] $$< \boldsymbol { X } _ { 1 } , \boldsymbol { X } _ { 2 } >$$ is the input image pair. [TeX:] $$< G _ { W } \left( \boldsymbol { X } _ { 1 } \right) , G _ { W } \left( \boldsymbol { X } _ { 2 } \right) >$$, calculated by network mapping, is the output feature pair. W is the parameter of the network model. [TeX:] $$\left\| G _ { W } \left( \boldsymbol { X } _ { 1 } \right) - G _ { W } \left( \boldsymbol { X } _ { 1 } \right) \right\| _ { 2 }$$ is the loss function, which adjust the parameters of the entire network.
Siamese network algorithm architecture.
3.2 Face Recognition Oriented Siamese Network Model Design
In this paper, we design and implement two different network models based on the Siamese network named SiameseFace1 and SiameseFace2, respectively to improve the accuracy of the face recognition.
3.2.1 SiameseFace1 model
The single network model of SiameseFace1 consists of 7 convolutional layers, 3 pooling layers, and 3 fully-connected layers. Its output is a 400-dimensional feature vector. The two outputs of the Siamese network are compared on the similarity of their Euclidean distances to judge whether they are the same type of sample. The feature pair [TeX:] $$< G _ { W } \left( \boldsymbol { X } _ { 1 } \right) , G _ { W } \left( \boldsymbol { X } _ { 2 } \right) >$$ is denoted as [TeX:] $$G _ { W } \left( \boldsymbol { X } _ { 1 } \right) = \left( x _ { 1 } ^ { ( 1 ) } , x _ { 1 } ^ { ( 2 ) } , \ldots x _ { 1 } ^ { ( i ) } , \ldots , x _ { 1 } ^ { ( 400 ) } \right)$$ and [TeX:] $$G _ { W } \left( X _ { 2 } \right) = \left( x _ { 2 } ^ { ( 1 ) } , x _ { 2 } ^ { ( 2 ) } , \ldots x _ { 2 } ^ { ( i ) } , \ldots , x _ { 2 } ^ { ( 400 ) } \right)$$ separately, the value of Euclidean distance [TeX:] $$D < \tau$$ determines that the image pair is cropped from the faces of the same person while the value of [TeX:] $$D > \tau$$ represents that the image pair is cropped from the faces of different persons. Fig. 2 shows the network architecture of SiameseFace1.
Fig. 2 shows that each input of the training are image pair and label. The label 0 denotes image pairs from the faces of the same person while the label 1 denotes the image pairs from the faces of different persons. Sizes of the input images of the network model are set to 120×120, convolutional kernel is set to 3×3, padding is set to 1 and step is set to 1 as well. We use convolutional layers to extract features, and each convolutional layer followed by a ReLU activation function, then we employ max-pooling and three fully-connected layers. The final output of the network model is a 400-dimensional vector. Table 1 shows the detailed parameters of SiameseFace1 model.
SiameseFace1 network architecture.
SiameseFace1 network parameter
3.2.2 SiameseFace2 model
A new lightweight network based on SiameseFace1 model is designed to optimize the network. We reduce the number of the network parameters without losing the recognition precision. SiameseFace2 network architecture is shown in Fig. 3. We add a convolutional kernel of size 1×1 in SiameseFace2 model to enhance the nonlinear eigenvalue without changing the scale of feature images. It facilitates the network deepening, enhances the feature expressive ability of the network and reduces both the dimension and the computational load at the same time. In the deep CNNs, the low-level convolutional layers extract most low-level features such as edge and texture while the high-level layers extract the features that contain more semantic information. Therefore, we cascade the low-level and high-level feature and merge detailed information (e.g., edge and texture) into semantic features in high–level layers to enhance the feature expressive ability.
SiameseFace2 network architecture.
SiameseFace2 network parameter
Table 2 shows the detailed parameters of SiameseFace2 model. Out in Table 2 indicates that we cascade the feature map in this layer with CAT layer. The output of CAT is used as the input of the fully-connected layers.
3.3 Contrastive Loss Function
We use loss functions to estimate consistency between the predications f(x) and ground-truth of the model. Log loss function, square loss function and exponential loss function are often used; however, they are not suitable for the Siamese network. Therefore, we employ the discriminative contrastive loss function [12] in this paper. The network parameters learning is a process of minimizing the contrastive loss function to enlarge the similarity measurement on the faces from the same person and narrow it on the faces from different persons.
As is shown in Fig. 1, [TeX:] $$E _ { W } ^ { ( i ) }$$ denotes the Euclidean distance of the [TeX:] $$< X _ { 1 } ^ { ( i ) } , X _ { 2 } ^ { ( i ) } >$$ output features for the sample i. [TeX:] $$E _ { W } ^ { ( i ) }$$ is computed as:
We use mini-batch to process the input data in batches by CNN for more effective training. The final loss function is:
Here, mb denotes the number of samples per batch. The [TeX:] $$H ^ { ( i ) }$$ represents the contrastive loss value of the sample pair i : [TeX:] $$< X _ { 1 } ^ { ( i ) } , X _ { 2 } ^ { ( i ) } >$$. The contrastive loss value [TeX:] $$H ^ { ( i ) }$$ is computed as:
Here, [TeX:] $$f ^ { ( i ) } \hat { \mathrm { I } } \{ 0,1 \}$$ denotes the label of sample i. The label value [TeX:] $$f ^ { ( i ) } = 0$$ indicates that sample pair [TeX:] $$< X _ { 1 } ^ { ( i ) } , X _ { 2 } ^ { ( i ) } >$$ are the faces of the same person. Its contrastive loss value [TeX:] $$H ^ { ( i ) } = E _ { W } ^ { ( i ) }$$. The smaller [TeX:] $$H ^ { ( i ) }$$ is, the more reasonable parameters of the model are. If [TeX:] $$H ^ { ( i ) }$$ is too large, we need optimize the parameters of the model by back-propagation (BP). The label value [TeX:] $$f ^ { ( i ) } = 1$$ represents [TeX:] $$H ^ { ( i ) } = m - E _ { W } ^ { ( i ) }$$, indicating sample pair i [TeX:] $$< X _ { 1 } ^ { ( i ) } , X _ { 2 } ^ { ( i ) } >$$ is not faces of the same person. Here, m denotes the boundary value. The similarity measurement [TeX:] $$E _ { W } ^ { ( i ) }$$ from different faces is maximized by loss function. When [TeX:] $$E _ { W } ^ { ( i ) } > m$$, the loss function is set to 0 without changing the model parameters. The sample pair i does not affect the network model learning process.
4. Data Training
It is difficult for the state-of-the-art face recognition algorithms to achieve models with high recognition accuracy by employing a small number of training samples for single class without pretraining. The number of the face samples for each person is relatively small in the current public face samples datasets such as AR dataset and LFW dataset. It is hard for deep learning to train an excellent network model without enough data. According to the limitations above, we reproduce the training data for the experiments based on AR and LFW dataset combined with the Siamese network.
(1) AR dataset: AR, providing 126 facial color print, is created by Purdue University in America. In this paper, we use a subset of AR dataset. This subset contains 100 person, 50 men and 50 women respectively. Everyone has 26 images, then totally 2,600 images. The pixel for each image is 165×120. Fig. 4 shows part of the facial images of this subset.
(2) LFW dataset: In this paper, the training data originates from LFW dataset. LFW is an unconstrained face recognition dataset in scene images. The dataset consists of almost 13,000 face images of more than 5,000 celebrities in different orientations, expressions, and lighting of natural scenes. Among them, 1,680 celebrities have two or more face images per person. Each face image is a color image with size of 250×250 and has its unique name ID and serial number. Fig. 5 shows part of image dataset.
AR dataset part of the face image.
LFW dataset part of the face image.
The inputs of Siamese network model are image pair and label. Therefore, it is necessary to collate and generate the training data that meets the requirements. We mark the face image from the same person as 0, otherwise 1. The dataset containing 3200 pairs of images are generated at the rate of 1:1. In this dataset, 20,000 images pairs are used for training while 12,000 are used for testing. Table 3 shows the generative algorithm of the training data.
(5) The number of training sample pair in S' reaches the set-point, end;
Fig. 6 shows the results of the matched pairs and mismatched pairs formed by the generation algorithm, where [TeX:] $$A , A _ { 1 } , A _ { 2 } , A _ { 3 }$$ represent the face images of the same person with variations in expression, gesture and distinguished by different ID. Here, A, B, C, D represent the face images of the different persons, respectively. The final sample pairs are [TeX:] $$\left( A , A _ { 1 } , 0 \right) , \left( A , A _ { 2 } , 0 \right) , \left( A , A _ { 3 } , 0 \right) , ( A , B , 1 )$$ [TeX:] $$( A , C , 1 ) , ( A , D , 1 )$$.
Matched pairs and unmatched pairs generation.
There is no preprocessing for the images with background and illumination when features are extracted by CNN. However, n the process of nonlinear dimension reduction of CNN, the influence of interference factors can be eliminated automatically. To further reduce image matching time and computation, the size of image is set to 120×120.
5. Experimental Results and Analysis
The experiment is implemented on the Rongtian SCW4550 GPU server, Intel Xeon E5-2670 v3 2.3 GHz with 128 GB memory and GeForce GTX TITAN X with 12 GB memory. The processing speed is up to 50 fps which is faster than the real-time standard. We use PyTorch as the framework of deep learning. The experiment explores the effects of network structure, parameter settings and loss functions separately. We conduct our experiment on AR and LFW dataset.
We set the parameters as follows: the boundary value of the loss function m is set to 2, the mini-batch (mb) is set to 32. In an interval of [0, 30], we gradually increase the threshold with step size of 0.01. The recognition rates for each tested threshold are calculated to find the ultimate threshold which achieves the premium recognition performance. We choose 0.49 as the threshold when the highest recognition rate is up to 0.988.
Five different network models’ configuration and recognition rate
5.1 Network Models and Parameters Comparison
In this paper, five different models are trained on LFW dataset and AR dataset, respectively. The network structure of each model is different, whose specific configuration and recognition rate are shown in Table 4.
Table 4 shows that the network structures and parameter settings will affect the accuracy of the algorithm. When the number of convolutional layers is 7, the recognition rate is the highest. When the number of convolutional layers is invariable, and the number of fully connected layers is 3, the recognition rate is the highest. Considering comprehensively, the first model is adopted in this experiment. The size of convolutional kernel is set to 3×3 when designing network parameters, which enhances the recognition ability of the discriminant function and reduces the parameters compared with the convolutional kernel of size 5×5 and 7×7. For example, when the number of channels is C and the number of convolutional kernels of size 3×3 is 3, the number of parameters is 3×(3×3×C×C) =27C2, likely, when the number of channels is C and the number of convolutional kernels of size 7×7 is 1, the number of parameters is 7×7×C×C=49C2.
The convergence of each model is shown in Fig. 7. The model1 and model2 are corresponding to the SiameseFace1 model and the SiameseFace2 model, respectively. In the experiment, model1 has the best performance and its loss function converges fastest. Model3 is difficult to converge due to the deep network and the performance is not ideal with the same number of iterations.
Different model loss convergence performance comparison chart.
5.2 AR Dataset Experiment
AR dataset of faces contains 4,000 images of 126 people with variation in facial expression, illumination and camouflage face images. AR dataset is processed by the training data generation mode of LFW. The experimental results of the training models and the existing algorithms are shown in Table 5.
The algorithm and value marked in bold indicate that the experimental results are the best. Although the traditional algorithms in Table 5 have achieved good results on the AR dataset, the recognition rate of our algorithm has a great improvement. It shows the method based on the Siamese network can effectively solve the problem of insufficient training samples for a single category. The network model has learned effective features which can better compare and distinguish a pair of input images.
Experimental results on the AR dataset
5.3 LFW Dataset Experiment
We select 12,000 pairs of faces from LFW dataset randomly to form a face test dataset, of which 6,000 pairs belong to the same person in different postures and the remaining 6,000 pairs belong to two different persons. In the test process, a pair of images are the inputs of Siamese network, the output of which is ‘yes’ or ‘no’ respectively. ‘Yes’ means that the image pairs represent the same person while ‘no’ means that the image pairs represent different person. The face recognition accuracy was obtained by the ratio of the results of 6,000 pairs of test face images to the real results. There are more than 13,000 face images collected from over 5,000 people in LFW face dataset, of which only 1,680 people have two or more images and about 4,000 people have only a face image. It greatly increases the difficulty of the model training. We only use internal data of LFW dataset when training network and don’t use external data to optimize network. Table 6 shows the experimental results of the training model compared with the existing algorithms.
Experimental results on the LFW dataset
In Table 6, the algorithm, namely Face++, is a commercial system built by a Face++ company, has the best performance. The number of facial feature points and the training data in this algorithm are not clearly opened. Our algorithm is only inferior to Face++ and has a higher recognition rate compared with other algorithms.
5.4 Comparison Experiment of Loss Function
The loss function used in this paper is the contrastive loss function, which can achieve higher recognition accuracy. We also tried to use some different loss function, including triplets-loss function, cosine proximity function, the squared error function. Comparison experiments are implemented on AR dataset and the results are shown in Table 7.
In the work of [27], the generation process of triplet is to randomly select a sample from the training dataset, denoted as S_a, and continue to randomly select a sample of the same class and different class with S_a, respectively denoted as positive samples S_p and negative sample S_n. For each element in the triple, a parameter-sharing network is trained to obtain the feature expression of the three elements, denoted as [TeX:] $$f \left( s _ { i } ^ { a } \right) , f \left( s _ { i } ^ { p } \right) , f \left( s _ { i } ^ { n } \right)$$. The purpose of triplets-loss function is to make the feature expression distance between sample elements of the same class S_a and S_p as small as possible, and the distance between sample elements of different class S_a and S_n as large as possible by learning. The triplets-loss function is defined as:
Here, N denotes the number of samples and the m denotes the margin value, subscript + represents the value in the brackets is the loss value when it is greater than zero. When it less than zero, the loss is zero.
The cosine distance is called cosine similarity in [30], which uses the cosine values of two vector angles in vector space to measure the difference between two inputs. It is defined as:
The loss function used in [31] is the squared error loss function, which is defined as:
Here, [TeX:] $$\delta$$ is a logistic function, [TeX:] $$d ^ { ( i ) }$$ represents the similarity measure of sample pair i :[TeX:] $$< X _ { 1 } ^ { ( i ) } , X _ { 2 } ^ { ( i ) } >$$.
A shortcoming of the squared error loss function is easy to vanish gradient.
Comparison of different loss functions on AR dataset
As show in Table 7, the contrastive loss function used in this paper is optimal, and its recognition rate is much higher than that of the cosine proximity function, slightly higher than the triplets-loss function. It is also found that the triplets-loss function is slow and prone to overfit in the experiment.
6. Conclusion
In this paper, we propose an effective face recognition algorithm based on a novel Siamese CNN, which indirectly expands the number of training samples of a single category on AR and LFW datasets. With the image pair as the input of the network, the designed Siamese network model is used to extract the features, and the similarity calculation is carried out by using the contrastive loss function. In addition, a lightweight network model without loss of recognition accuracy is also proposed. The training data generation method combined with the new Siamese network model proposed, and the contrastive loss function, achieve a higher recognition rate on the AR and LFW datasets. In the future, we will carry out quantitative experiment analysis for single sample training, design and optimize the deep network model, construct novel loss function and further improve the recognition performance of our algorithm.
Acknowledgement
The research work was supported by National Natural Science Foundation of China (No. 61772454, 61811530332), the Scientific Research Fund of Hunan Provincial Education Department (No. 16A008), the Scientific Research Fund of Hunan Provincial Transportation Department (No. 201446), the Industry-University Cooperation and Collaborative Education Project of Department of Higher Education of Ministry of Education (No. 201702137008), the Undergraduate Inquiry Learning and Innovative Experimental Fund of CSUST (No. 2018-6-119), and the Postgraduate Course Construction Fund of CSUST (No. KC201611).