Zhang* , Jin* , Liu* , Sangaiah** , and Wang*: Small Sample Face Recognition Algorithm based on Novel Siamese Network

# Small Sample Face Recognition Algorithm based on Novel Siamese Network

Abstract: In face recognition, sometimes the number of available training samples for single category is insufficient. Therefore, the performances of models trained by convolutional neural network are not ideal. The small sample face recognition algorithm based on novel Siamese network is proposed in this paper, which don’t need rich samples for training. The algorithm designs and realizes a new Siamese network model, SiameseFace1, which uses pairs of face images as inputs and maps them to target space to achieve that the L2 norm distance in target space can represent the semantic distance in source space. The mapping is represented by neural network in supervised learning. Moreover, a more lightweight Siamese network model, SiameseFace2 is designed to reduce the network parameters without losing accuracy. We also present a new method to generate training data and expand the number of training samples for single category in AR and Labeled Faces in the Wild (LFW) datasets, which improves the recognition accuracy of the models. The experiments compared with several loss functions in AR and LFW datasets show that the contrastive loss function combined with new Siamese network model can effectively improve the accuracy of face recognition.

Keywords: Face Recognition , Convolutional Neural Network , Siamese Network; Loss Function; Small Sample

## 1. Introduction

Face recognition, as a classical and important task in computer vision, is commonly used in video retrieval and pedestrian tracking, as well as distributed diagnosis and home healthcare nowadays. Stephen et al. [1] have constructed a computer model based on the cognitive learning of facial images. This computer model can both make accurate physical health judgments and predict body mass index (BMI) and blood pressure. It also facilitates the diagnosis of doctors to achieve the early identification and treatment of diseases. Meanwhile, face recognition has been extensively applied to Internet of things (IoT) such as intelligent attendance system through face identification, intelligent video surveillance system in public and secure payment system. Domestic and foreign scholars are attracted to work on face recognition because of its extensive applications [2,3]. Under the interference of factors such as angle variation, illumination variation, expression and posture change, noise, low resolution, object occlusion, small number of single-class samples with numerous categories (Million), face recognition is still a challenge despite its progress. Human can recognize one person at the first sight while it is a great challenge for computer. The algorithms cope with the classifier training with small samples. The representative face recognition algorithms [4-6] indicate that the performance of these algorithms decrease dramatically with the drop of number of training samples, including the convolutional neural network (CNN) which shows excellent performance in object detection and classification [7]. Therefore, the face recognition with small sample is an extremely challenging topic.

In recent years, benefiting from the massive training data and the improvement of the hardware computing capabilities, deep learning has made great progress in the fields of image [8,9], voice [10] and text [11]. In public datasets, upon the number of category increases, the number of samples needs to be enriched to facilitate the training of network models and improve the efficiency of classification. However, there are some situations with small samples and large number of categories in real face recognition. It results in the insufficiency of samples in each category, which greatly limits the performance of face recognition.

In this work, with the help of Siamese network [12], we utilize pairs of face images as inputs to expand the number of samples for a single category, further we propose the small sample face recognition algorithm based on self-constructed Siamese network without a large amount of training samples. It presents a map by using contrastive loss functions [12] and training CNN, and maps the input image pairs to target space so that the L2 norm distance of target space can represent the semantic distance of source space. In the training process, the network parameters learning process aims at minimizing the loss function to diminish the distance of the face image pairs from the same person and increase the distance of the face image pairs from different persons. The experiments between several loss functions are implemented and the results show that the proposed network model combined with a method to generate training data can effectively improve the face recognition accuracy, and it achieves better recognition rate on AR datasets and labeled faces in the wild (LFW) datasets.

## 2. Related Work

The traditional face recognition algorithms have made many achievements through years of development. The work in [13] proposes the sparse representation based classification (SRC), which uses a linear combination of all the training samples from the same person to represent one face image. The SRC compared with other ordinary methods is more effective when there are a few training samples of each category. Gabor wavelet can capture local structure information corresponding to spatial frequency, spatial position and direction. The Gabor feature is applied to SRC in [14], in which the SRC recognition rate is improved significantly. Although the SRC improves face recognition rate effectively, it causes high computational cost as well. Zhang et al. [15] propose the collaborative representation based classification (CRC), which points out that the SRC uses the regularization of the vector L1 norm has a huge computation, while the L2 regularization constraint can achieve similar recognition results and improves the computational efficiency as well. Nevertheless, the performances of both SRC and CRC would be greatly influenced when the number of training samples is insufficient. A new representative method called hierarchical CRC (HCRC) is proposed in [16]. Compared with some traditional collaborative representation method, HCRC introduces the Euclidean distance from projective vectors to training vectors, which improves the recognition precision effectively even if the training sample is not enough.

In recent years, the algorithms based on CNN make great achievements in face verification and recognition [17-19]. Compared with face recognition methods based on handcraft features [20,21], CNN-based method achieves higher accuracy. A new deep learning model is proposed in [22]. It can restore the front facial features, reduce the difference between the single individual faces greatly and improve the performance of the face recognition algorithm. DeepFace [23] uses complex 3D face-lignment and four million facial images to derive a face representation from a 9-layer deep neural network. DeepID1 [24] crops the facial images, whose features are extracted from image patches and integrated by Joint Bayesian. These facial features contain rich category information. DeepID2 [25] exploits contrastive loss and softmax loss to achieve network feedback regulation. A great number of positive and negative samples are used as training data. Positive samples are used to reduce the distance of a single category. Negative samples are used to increase the distance between categories. However, the samples are generated randomly, which results in the instability of network model. A new network model called HaarNet [26] is designed. Its backbone network extracts the global image information, and its three branches use Haar-like to extract features in region of interest (ROI), which significantly improves the accuracy of the face recognition. A face recognition algorithm called FaceNet [27] maps the face images into Euclidean space, in which the distance represents the similarity of the face images. It also uses the triplet loss function in the training process, which achieves high performance in pose-variant face recognition. The number of training images is up to 200 million.

## 3. Face Recognition Algorithm Based on Siamese Network

According to traditional feature extraction algorithms, the feature operators are determined by hand-craft features. It is a man-made choice to extract features of a certain kind, which causes the poor robustness and expansibility of the algorithms. The advantage of CNN over traditional methods is that the parameters of the entire model are obtained by autonomous learning. It performs superiorly from following two aspects. Firstly, autonomous learning features are more robust and have stronger expressive ability. Secondly, it greatly reduces the labor and avoids that the designed parameters are not inappropriate for the model in the artificial process because of insufficient experience. CNN shows great performance in many areas of image processing and exceeds the traditional image processing methods and the human ability in some respects.

CNN can achieve great performance mainly due to the autonomous learning ability for its network model and numerous training data. CNN obtains the suitable model parameters by learning the features extracted from the training data. In conclusion, data plays a critical role in training an excellent network model. The performance of the network model would not be satisfied when we train it with a small number of data. Up to now, the recognition performance of CNN is seriously affected by insufficient face datasets of each category. In this paper, we propose the face recognition algorithm based on Siamese network. The proposed algorithm designs and implements two different network models. We can still achieve high recognition accuracy when the number of single-class training samples is small.

3.1 Siamese Network

Siamese network [12], which is divided into two parts from input to output, is one of the CNNs. Two parts of Siamese network share the same weight. Siamese network is special for that its training samples use image pairs as input, extract features by its two parts respectively, and finally obtain the eigenvector pairs of the samples. Fig. 1 shows the architecture of the Siamese network.

Here, [TeX:] $$< \boldsymbol { X } _ { 1 } , \boldsymbol { X } _ { 2 } >$$ is the input image pair. [TeX:] $$< G _ { W } \left( \boldsymbol { X } _ { 1 } \right) , G _ { W } \left( \boldsymbol { X } _ { 2 } \right) >$$, calculated by network mapping, is the output feature pair. W is the parameter of the network model. [TeX:] $$\left\| G _ { W } \left( \boldsymbol { X } _ { 1 } \right) - G _ { W } \left( \boldsymbol { X } _ { 1 } \right) \right\| _ { 2 }$$ is the loss function, which adjust the parameters of the entire network.

Fig. 1.

Siamese network algorithm architecture.
3.2 Face Recognition Oriented Siamese Network Model Design

In this paper, we design and implement two different network models based on the Siamese network named SiameseFace1 and SiameseFace2, respectively to improve the accuracy of the face recognition.

3.2.1 SiameseFace1 model

The single network model of SiameseFace1 consists of 7 convolutional layers, 3 pooling layers, and 3 fully-connected layers. Its output is a 400-dimensional feature vector. The two outputs of the Siamese network are compared on the similarity of their Euclidean distances to judge whether they are the same type of sample. The feature pair [TeX:] $$< G _ { W } \left( \boldsymbol { X } _ { 1 } \right) , G _ { W } \left( \boldsymbol { X } _ { 2 } \right) >$$ is denoted as [TeX:] $$G _ { W } \left( \boldsymbol { X } _ { 1 } \right) = \left( x _ { 1 } ^ { ( 1 ) } , x _ { 1 } ^ { ( 2 ) } , \ldots x _ { 1 } ^ { ( i ) } , \ldots , x _ { 1 } ^ { ( 400 ) } \right)$$ and [TeX:] $$G _ { W } \left( X _ { 2 } \right) = \left( x _ { 2 } ^ { ( 1 ) } , x _ { 2 } ^ { ( 2 ) } , \ldots x _ { 2 } ^ { ( i ) } , \ldots , x _ { 2 } ^ { ( 400 ) } \right)$$ separately, the value of Euclidean distance [TeX:] $$D < \tau$$ determines that the image pair is cropped from the faces of the same person while the value of [TeX:] $$D > \tau$$ represents that the image pair is cropped from the faces of different persons. Fig. 2 shows the network architecture of SiameseFace1.

Fig. 2 shows that each input of the training are image pair and label. The label 0 denotes image pairs from the faces of the same person while the label 1 denotes the image pairs from the faces of different persons. Sizes of the input images of the network model are set to 120×120, convolutional kernel is set to 3×3, padding is set to 1 and step is set to 1 as well. We use convolutional layers to extract features, and each convolutional layer followed by a ReLU activation function, then we employ max-pooling and three fully-connected layers. The final output of the network model is a 400-dimensional vector. Table 1 shows the detailed parameters of SiameseFace1 model.

Fig. 2.

SiameseFace1 network architecture.

Table 1.

SiameseFace1 network parameter
 Layer Kernel Step Padding Input Output Conv1_1 3×3 1 1 120×120×1 120×120×32 Conv1_2 3×3 1 1 120×120×32 120×120×32 Pooling 2×2 2 0 120×120×32 60×60×32 Conv2_1 3×3 1 1 60×60×32 60×60×64 Conv2_2 3×3 1 1 60×60×64 60×60×128 Pooling 2×2 2 0 60×60×128 30×30×128 Conv3_1 3×3 1 1 30×30×128 30×30×512 Conv3_2 3×3 1 1 30×30×512 30×30×512 Pooling 2×2 2 0 30×30×512 15×15×512 Conv4 3×3 1 1 15×15×512 15×15×1024 FC1 15×15×1024 400 FC2 400 400 FC3 400 400
3.2.2 SiameseFace2 model

A new lightweight network based on SiameseFace1 model is designed to optimize the network. We reduce the number of the network parameters without losing the recognition precision. SiameseFace2 network architecture is shown in Fig. 3. We add a convolutional kernel of size 1×1 in SiameseFace2 model to enhance the nonlinear eigenvalue without changing the scale of feature images. It facilitates the network deepening, enhances the feature expressive ability of the network and reduces both the dimension and the computational load at the same time. In the deep CNNs, the low-level convolutional layers extract most low-level features such as edge and texture while the high-level layers extract the features that contain more semantic information. Therefore, we cascade the low-level and high-level feature and merge detailed information (e.g., edge and texture) into semantic features in high–level layers to enhance the feature expressive ability.

Fig. 3.

SiameseFace2 network architecture.

Table 2.

SiameseFace2 network parameter
 Layer Kernel Step Padding Input Output Conv1_1 3×3 1 1 120×120×1 120×120×32 Conv1_2 3×3 1 1 120×120×32 120×120×64 Conv1_3 1×1 1 0 120×120×64 120×120×32 Pooling(out) 2×2 2 0 120×120×32 60×60×32 Conv2_1 3×3 1 1 60×60×32 60×60×128 Conv2_2 3×3 1 1 60×60×128 60×60×128 Conv2_3 3×3 1 1 60×60×128 60×60×128 Conv2_4 1×1 1 0 60×60×128 60×60×64 Pooling(out) 2×2 2 0 60×60×64 30×30×64 Conv3_1 3×3 1 1 30×30×64 30×30×512 Conv3_2 3×3 1 1 30×30×512 30×30×512 Conv3_3 1×1 1 0 30×30×512 30×30×64 Pooling 2×2 2 0 30×30×64 15×15×64 CAT 60×60×32 + 30×30×64 + 15×15×64 = 187200 FC1 187200 100 FC2 100 100

Table 2 shows the detailed parameters of SiameseFace2 model. Out in Table 2 indicates that we cascade the feature map in this layer with CAT layer. The output of CAT is used as the input of the fully-connected layers.

3.3 Contrastive Loss Function

We use loss functions to estimate consistency between the predications f(x) and ground-truth of the model. Log loss function, square loss function and exponential loss function are often used; however, they are not suitable for the Siamese network. Therefore, we employ the discriminative contrastive loss function [12] in this paper. The network parameters learning is a process of minimizing the contrastive loss function to enlarge the similarity measurement on the faces from the same person and narrow it on the faces from different persons.

As is shown in Fig. 1, [TeX:] $$E _ { W } ^ { ( i ) }$$ denotes the Euclidean distance of the [TeX:] $$< X _ { 1 } ^ { ( i ) } , X _ { 2 } ^ { ( i ) } >$$ output features for the sample i. [TeX:] $$E _ { W } ^ { ( i ) }$$ is computed as:

##### (1)
[TeX:] $$E _ { W } ^ { ( i ) } = \left\| G _ { W } \left( X _ { 1 } ^ { ( i ) } \right) - G _ { W } \left( X _ { 2 } ^ { ( i ) } \right) \right\| _ { 2 }$$

We use mini-batch to process the input data in batches by CNN for more effective training. The final loss function is:

##### (2)
[TeX:] $$L ( W ) = \frac { 1 } { m b } \sum _ { i = 1 } ^ { m b } H ^ { ( i ) }$$

Here, mb denotes the number of samples per batch. The [TeX:] $$H ^ { ( i ) }$$ represents the contrastive loss value of the sample pair i : [TeX:] $$< X _ { 1 } ^ { ( i ) } , X _ { 2 } ^ { ( i ) } >$$. The contrastive loss value [TeX:] $$H ^ { ( i ) }$$ is computed as:

##### (3)
[TeX:] $$H ^ { ( i ) } = \left( 1 - f ^ { ( i ) } \right) { * } E _ { W } ^ { ( i ) } + f ^ { ( i ) } * \left( m - E _ { W } ^ { ( i ) } \right)$$

Here, [TeX:] $$f ^ { ( i ) } \hat { \mathrm { I } } \{ 0,1 \}$$ denotes the label of sample i. The label value [TeX:] $$f ^ { ( i ) } = 0$$ indicates that sample pair [TeX:] $$< X _ { 1 } ^ { ( i ) } , X _ { 2 } ^ { ( i ) } >$$ are the faces of the same person. Its contrastive loss value [TeX:] $$H ^ { ( i ) } = E _ { W } ^ { ( i ) }$$. The smaller [TeX:] $$H ^ { ( i ) }$$ is, the more reasonable parameters of the model are. If [TeX:] $$H ^ { ( i ) }$$ is too large, we need optimize the parameters of the model by back-propagation (BP). The label value [TeX:] $$f ^ { ( i ) } = 1$$ represents [TeX:] $$H ^ { ( i ) } = m - E _ { W } ^ { ( i ) }$$, indicating sample pair i [TeX:] $$< X _ { 1 } ^ { ( i ) } , X _ { 2 } ^ { ( i ) } >$$ is not faces of the same person. Here, m denotes the boundary value. The similarity measurement [TeX:] $$E _ { W } ^ { ( i ) }$$ from different faces is maximized by loss function. When [TeX:] $$E _ { W } ^ { ( i ) } > m$$, the loss function is set to 0 without changing the model parameters. The sample pair i does not affect the network model learning process.

## 4. Data Training

It is difficult for the state-of-the-art face recognition algorithms to achieve models with high recognition accuracy by employing a small number of training samples for single class without pretraining. The number of the face samples for each person is relatively small in the current public face samples datasets such as AR dataset and LFW dataset. It is hard for deep learning to train an excellent network model without enough data. According to the limitations above, we reproduce the training data for the experiments based on AR and LFW dataset combined with the Siamese network.

(1) AR dataset: AR, providing 126 facial color print, is created by Purdue University in America. In this paper, we use a subset of AR dataset. This subset contains 100 person, 50 men and 50 women respectively. Everyone has 26 images, then totally 2,600 images. The pixel for each image is 165×120. Fig. 4 shows part of the facial images of this subset.

(2) LFW dataset: In this paper, the training data originates from LFW dataset. LFW is an unconstrained face recognition dataset in scene images. The dataset consists of almost 13,000 face images of more than 5,000 celebrities in different orientations, expressions, and lighting of natural scenes. Among them, 1,680 celebrities have two or more face images per person. Each face image is a color image with size of 250×250 and has its unique name ID and serial number. Fig. 5 shows part of image dataset.

Fig. 4.

AR dataset part of the face image.

Fig. 5.

LFW dataset part of the face image.

The inputs of Siamese network model are image pair and label. Therefore, it is necessary to collate and generate the training data that meets the requirements. We mark the face image from the same person as 0, otherwise 1. The dataset containing 3200 pairs of images are generated at the rate of 1:1. In this dataset, 20,000 images pairs are used for training while 12,000 are used for testing. Table 3 shows the generative algorithm of the training data.

 Training data generation algorithm (1) Randomly select two face images [TeX:] $$A_{1}\hat { \mathrm { I } } S, A_{2}\hat { \mathrm { I } } S, S$$ is the databases of face images; (2) If [TeX:] $$A _ { 1 } \text { and } A _ { 2 }$$ for the same image, continue (1); (3) If the different images are from the same person, set label to 0; if they are from different persons, set label to 1, then, form a pair of training sample, which is : [TeX:] $$T : \left( A _ { 1 } , A _ { 2 } , 0 \right) \text { or } \left( A _ { 1 } , A _ { 2 } , 1 \right)$$; (4) Let the set of training samples be S', if T not exist in the sample set, added it to the set, otherwise, continue (1);

(5) The number of training sample pair in S' reaches the set-point, end;

Fig. 6 shows the results of the matched pairs and mismatched pairs formed by the generation algorithm, where [TeX:] $$A , A _ { 1 } , A _ { 2 } , A _ { 3 }$$ represent the face images of the same person with variations in expression, gesture and distinguished by different ID. Here, A, B, C, D represent the face images of the different persons, respectively. The final sample pairs are [TeX:] $$\left( A , A _ { 1 } , 0 \right) , \left( A , A _ { 2 } , 0 \right) , \left( A , A _ { 3 } , 0 \right) , ( A , B , 1 )$$ [TeX:] $$( A , C , 1 ) , ( A , D , 1 )$$.

Fig. 6.

Matched pairs and unmatched pairs generation.

There is no preprocessing for the images with background and illumination when features are extracted by CNN. However, n the process of nonlinear dimension reduction of CNN, the influence of interference factors can be eliminated automatically. To further reduce image matching time and computation, the size of image is set to 120×120.

## 5. Experimental Results and Analysis

The experiment is implemented on the Rongtian SCW4550 GPU server, Intel Xeon E5-2670 v3 2.3 GHz with 128 GB memory and GeForce GTX TITAN X with 12 GB memory. The processing speed is up to 50 fps which is faster than the real-time standard. We use PyTorch as the framework of deep learning. The experiment explores the effects of network structure, parameter settings and loss functions separately. We conduct our experiment on AR and LFW dataset.

We set the parameters as follows: the boundary value of the loss function m is set to 2, the mini-batch (mb) is set to 32. In an interval of [0, 30], we gradually increase the threshold with step size of 0.01. The recognition rates for each tested threshold are calculated to find the ultimate threshold which achieves the premium recognition performance. We choose 0.49 as the threshold when the highest recognition rate is up to 0.988.

Table 4.

Five different network models’ configuration and recognition rate
 Model model1 model2 model3 model4 model5 Network configuration Conv3 Conv3 Pooling Conv3 Conv3 Pooling Conv3 Conv3 Poooling Conv3 FC FC FC Conv3 Conv3 Conv1 Pooling (out) Conv3 Conv3 Conv3 Conv1 Pooling (out) Conv3 Conv3 Conv1 Pooling (out) FC FC Conv3 Conv3 Conv1 Pooling Conv3 Conv1 Conv1 Pooling Conv3 Conv3 Conv1 Pooling Conv3 Conv3 Conv1 Pooling Conv3 Conv3 Conv1 (out) Pooling Conv3 Conv3 Conv1 (out) Pooling Conv3 Conv3 Conv1 (out) FC FC FC Conv3 Conv3 Pooling Conv3 Conv3 Pooling Conv3 Conv3 Pooling Conv3 Conv3 Pooling Conv3 Conv3 Pooling Conv3 Conv3 Pooling Conv3 Conv3 Pooling Conv3 Conv3 Conv1 Pooling Conv3 Conv1 Conv1 Pooling Conv3 Conv3 Conv1 FC FC FC Recognition rate (%) 94.8 94.6 50.1 91.1 93.2
5.1 Network Models and Parameters Comparison

In this paper, five different models are trained on LFW dataset and AR dataset, respectively. The network structure of each model is different, whose specific configuration and recognition rate are shown in Table 4.

Table 4 shows that the network structures and parameter settings will affect the accuracy of the algorithm. When the number of convolutional layers is 7, the recognition rate is the highest. When the number of convolutional layers is invariable, and the number of fully connected layers is 3, the recognition rate is the highest. Considering comprehensively, the first model is adopted in this experiment. The size of convolutional kernel is set to 3×3 when designing network parameters, which enhances the recognition ability of the discriminant function and reduces the parameters compared with the convolutional kernel of size 5×5 and 7×7. For example, when the number of channels is C and the number of convolutional kernels of size 3×3 is 3, the number of parameters is 3×(3×3×C×C) =27C2, likely, when the number of channels is C and the number of convolutional kernels of size 7×7 is 1, the number of parameters is 7×7×C×C=49C2.

The convergence of each model is shown in Fig. 7. The model1 and model2 are corresponding to the SiameseFace1 model and the SiameseFace2 model, respectively. In the experiment, model1 has the best performance and its loss function converges fastest. Model3 is difficult to converge due to the deep network and the performance is not ideal with the same number of iterations.

Fig. 7.

Different model loss convergence performance comparison chart.
5.2 AR Dataset Experiment

AR dataset of faces contains 4,000 images of 126 people with variation in facial expression, illumination and camouflage face images. AR dataset is processed by the training data generation mode of LFW. The experimental results of the training models and the existing algorithms are shown in Table 5.

The algorithm and value marked in bold indicate that the experimental results are the best. Although the traditional algorithms in Table 5 have achieved good results on the AR dataset, the recognition rate of our algorithm has a great improvement. It shows the method based on the Siamese network can effectively solve the problem of insufficient training samples for a single category. The network model has learned effective features which can better compare and distinguish a pair of input images.

Table 5.

Experimental results on the AR dataset
 Algorithm Recognition rate (%) SRC [15] 87.9 CRC [13] 90 GSRC [14] 93 SiameseFace1 98.8 SiameseFace2 98.4
5.3 LFW Dataset Experiment

We select 12,000 pairs of faces from LFW dataset randomly to form a face test dataset, of which 6,000 pairs belong to the same person in different postures and the remaining 6,000 pairs belong to two different persons. In the test process, a pair of images are the inputs of Siamese network, the output of which is ‘yes’ or ‘no’ respectively. ‘Yes’ means that the image pairs represent the same person while ‘no’ means that the image pairs represent different person. The face recognition accuracy was obtained by the ratio of the results of 6,000 pairs of test face images to the real results. There are more than 13,000 face images collected from over 5,000 people in LFW face dataset, of which only 1,680 people have two or more images and about 4,000 people have only a face image. It greatly increases the difficulty of the model training. We only use internal data of LFW dataset when training network and don’t use external data to optimize network. Table 6 shows the experimental results of the training model compared with the existing algorithms.

Table 6.

Experimental results on the LFW dataset
 Algorithm Recognition rate (%) Joint Bayesian [28] 90.90 Fisher Vector Faces [29] 93.03 FR+FCN [22] 93.65 Face++ [22] 97.27 SiameseFace1 94.80 SiameseFace2 94.60

In Table 6, the algorithm, namely Face++, is a commercial system built by a Face++ company, has the best performance. The number of facial feature points and the training data in this algorithm are not clearly opened. Our algorithm is only inferior to Face++ and has a higher recognition rate compared with other algorithms.

5.4 Comparison Experiment of Loss Function

The loss function used in this paper is the contrastive loss function, which can achieve higher recognition accuracy. We also tried to use some different loss function, including triplets-loss function, cosine proximity function, the squared error function. Comparison experiments are implemented on AR dataset and the results are shown in Table 7.

In the work of [27], the generation process of triplet is to randomly select a sample from the training dataset, denoted as S_a, and continue to randomly select a sample of the same class and different class with S_a, respectively denoted as positive samples S_p and negative sample S_n. For each element in the triple, a parameter-sharing network is trained to obtain the feature expression of the three elements, denoted as [TeX:] $$f \left( s _ { i } ^ { a } \right) , f \left( s _ { i } ^ { p } \right) , f \left( s _ { i } ^ { n } \right)$$. The purpose of triplets-loss function is to make the feature expression distance between sample elements of the same class S_a and S_p as small as possible, and the distance between sample elements of different class S_a and S_n as large as possible by learning. The triplets-loss function is defined as:

##### (4)
[TeX:] $$L _ { tr i p } = \sum _ { i } ^ { N } \left[ \left\| f \left( s _ { i } ^ { a } \right) - f \left( s _ { i } ^ { p } \right) \right\| _ { 2 } ^ { 2 } - \left\| f \left( s _ { i } ^ { a } \right) - f \left( s _ { i } ^ { n } \right) \right\| _ { 2 } ^ { 2 } + m \right] _ { + }$$

Here, N denotes the number of samples and the m denotes the margin value, subscript + represents the value in the brackets is the loss value when it is greater than zero. When it less than zero, the loss is zero.

The cosine distance is called cosine similarity in [30], which uses the cosine values of two vector angles in vector space to measure the difference between two inputs. It is defined as:

##### (5)
[TeX:] $$L _ { \mathrm { cos } } = \frac { 1 } { m b } \sum _ { i = 1 } ^ { m b } \left[ \left( 1 - f ^ { ( i ) } \right) * \cos \left( X _ { 1 } ^ { ( i ) } , X _ { 2 } ^ { ( i ) } \right) + f ^ { ( i ) } \left( m + \cos \left( X _ { 1 } ^ { ( i ) } , X _ { 2 } ^ { ( i ) } \right) \right]\right.$$

The loss function used in [31] is the squared error loss function, which is defined as:

##### (6)
[TeX:] $$L _ { \mathrm { sq } } = \frac { 1 } { m b }\sum _ { i = 1 } ^ { m b }[\left( 1 - f ^ { ( i ) } \right) * \left( \frac { 1 } { 2 } - \delta \left( d ^ { ( i ) } \right) ^ { 2 } \right) ^ { 2 } + f ^ { ( i ) * } 1 - \delta \left( d ^ { ( i ) } \right) ^ { 2 }]$$

##### (7)
[TeX:] $$\delta ( x ) = \frac { 1 } { 1 + e ^ { - x } }$$

Here, [TeX:] $$\delta$$ is a logistic function, [TeX:] $$d ^ { ( i ) }$$ represents the similarity measure of sample pair i :[TeX:] $$< X _ { 1 } ^ { ( i ) } , X _ { 2 } ^ { ( i ) } >$$.

A shortcoming of the squared error loss function is easy to vanish gradient.

Table 7.

Comparison of different loss functions on AR dataset
 Loss function Recognition rate (%) Triplets-loss function 98.5 Cosine proximity function 97.2 The squared error function 96.5 Contrastive loss function 98.8

As show in Table 7, the contrastive loss function used in this paper is optimal, and its recognition rate is much higher than that of the cosine proximity function, slightly higher than the triplets-loss function. It is also found that the triplets-loss function is slow and prone to overfit in the experiment.

## 6. Conclusion

In this paper, we propose an effective face recognition algorithm based on a novel Siamese CNN, which indirectly expands the number of training samples of a single category on AR and LFW datasets. With the image pair as the input of the network, the designed Siamese network model is used to extract the features, and the similarity calculation is carried out by using the contrastive loss function. In addition, a lightweight network model without loss of recognition accuracy is also proposed. The training data generation method combined with the new Siamese network model proposed, and the contrastive loss function, achieve a higher recognition rate on the AR and LFW datasets. In the future, we will carry out quantitative experiment analysis for single sample training, design and optimize the deep network model, construct novel loss function and further improve the recognition performance of our algorithm.

## Acknowledgement

The research work was supported by National Natural Science Foundation of China (No. 61772454, 61811530332), the Scientific Research Fund of Hunan Provincial Education Department (No. 16A008), the Scientific Research Fund of Hunan Provincial Transportation Department (No. 201446), the Industry-University Cooperation and Collaborative Education Project of Department of Higher Education of Ministry of Education (No. 201702137008), the Undergraduate Inquiry Learning and Innovative Experimental Fund of CSUST (No. 2018-6-119), and the Postgraduate Course Construction Fund of CSUST (No. KC201611).

## Biography

##### Jianming Zhang
https://orcid.org/0000-0002-4278-0805

He received the B.S. and M.S. degree in 1996 and 2001 respectively from Zhejiang University and the National University of Defense Technology, China. He received the Ph.D. in 2010 from Hunan University, China. Currently, he is an associate professor and the deputy dean in the School of Computer and Communication Engineering at Changsha University of Science and Technology, China. His main research interests lie in the areas of computer vision, data mining, and wireless ad hoc sensor networks.

## Biography

##### Xiaokang Jin
https://orcid.org/0000-0002-9563-8888

He received the B.S. degree from the Changsha University of Science and Technology in 2016, China. He is currently pursuing the M.S. degree in computer science and technology at Changsha University of Science and Technology. His research interests include computer vision, deep learning and object tracking

## Biography

##### Yukai Liu

He received the B.S. degree in 2013 from Xiangnan University, China. He received the M.S. degree from Changsha University of Science and Technology in 2018, China. His research interests include computer vision, deep learning and pattern recognition.

## Biography

##### Arun Kumar Sangaiah
https://orcid.org/0000-0002-0229-2460

He received the M.S. degree in computer science and engineering from the Government College of Engineering, Tirunelveli, Anna University, India. He received the PhD degree in computer science and engineering from the VIT University, Vellore, India. He is presently working as an associate professor in the School of Computer Science and Engineering, VIT University, India. His area of interest includes software engineering, computational intelligence, wireless networks, bioinformatics, and embedded systems.

## References

• 1 Stephen. ID, Hiew. V, Coetzee. V, "Facial Shape Analysis Identifies Valid Cues to Aspects of Physiological Health in Caucasian, Asian, and African Populations," Tiddeman. BP and Perrett. DI. Frontiers in psychologyAug. 2017, vol. 8. doi:[[[10.3389/fpsyg.2017.01883]]]
• 2 Blanco-Gonzalo R, Poh N, Wong R, "Time evolution of face recognition in accessible scenarios," et al. Human-centric Computing and Information Sciences, Aug. 2015, vol. 5, no. 1, pp. 24-24. doi:[[[10.1186/s13673-015-0043-0]]]
• 3 Maze. B, Adams. J, Duncan. J. A, "IARPA Janus Benchmark–C: Face Dataset and Protocol," in Proceedings of the 11th IAPR International Conference on Biometrics. Queensland, Australia, 2018;custom:[[[-]]]
• 4 Liu. F, Bi. Y, Cui. Y, "Local similarity based linear discriminant analysis for face recognition with single sample per person," in Asian Conference on Computer Vision, Singapore, Singapore, 2014;custom:[[[-]]]
• 5 Tsalakanidou. F, Tzovaras. D, Strintzis. M. G, "Use of depth and colour eigenfaces for face recognition," Pattern Recognition Letters, Jun. 2003, vol. 24, no. 9-10, pp. 1427-1435. doi:[[[10.1016/S0167-8655(02)00383-5]]]
• 6 He. X, Yan. S, Hu. Y, "Face recognition using laplacianfaces," IEEE transactions on pattern analysis and machine intelligence, Mar. 2005, vol. 27, no. 3, pp. 328-340. doi:[[[10.1109/TPAMI.2005.55]]]
• 7 Ya. Tu, Yun. Lin, Jin. Wang, Jeong-Uk. Kim, "Semi-supervised Learning with Generative Adversarial Networks on Digital Signal Modulation Classification," Computers Materials Continua2018, , May. 2018, vol. 55, no. 2, pp. 243-254. doi:[[[10.3970/cmc.2018.01755]]]
• 8 Yu. N, Yu. Z, Gu. F, Ti. Li, Xin. T, Pan. Yi, "Deep learning in genomic and medical image data analysis: challenges and approaches," Journal of Information Processing Systems, 2017, vol. 13, no. 2, pp. 204-214. doi:[[[10.3745/JIPS.04.0029]]]
• 9 Koo K M, "Image recognition performance enhancements using image normalization," Cha E Y. Human-centric Computing and Information Sciences, 2017, vol. 7, no. 1, pp. 33-33. doi:[[[10.1186/s13673-017-0114-5]]]
• 10 Sainath. T. N, Weiss. R. J, Wilson. K. W, Li. Bo, "Multichannel signal processing with deep neural networks for automatic speech recognition," IEEE Transactions on AudioSpeech, and Language Processing, , May.2017, vol. 25, no. 5, pp. 965-979. doi:[[[10.1109/TASLP.2017.2672401]]]
• 11 Lee. S. G, Sung. Y, Kim. Y. G, "Variations of AlexNet and GoogLeNet to Improve Korean Character Recognition Performance," Journal of Information Processing Systems, 2018, vol. 14, no. 1, pp. 205-217. doi:[[[10.3745/JIPS.04.0061]]]
• 12 Chopra. S, Hadsell. R, Lecun. Y, "Learning a similarity metric discriminatively, with application to face verification," in IEEE Conference on Computer Vision and Pattern Recognition. San Diego, US, 2005;custom:[[[-]]]
• 13 Wright. J, Yang. A. Y, Ganesh. A, "Robust face recognition via sparse representation," IEEE Transactions on Pattern Analysis and Machine Intelligence, Feb. 2009, vol. 31, no. 2, pp. 210-227. doi:[[[10.1109/TPAMI.2008.79]]]
• 14 Yang. M, Zhang. L, "Gabor feature based sparse representation for face recognition with gabor occlusion dictionary," in European Conference on Computer Vision. Crete, Greece, 2010;custom:[[[-]]]
• 15 Zhang. L, Yang. M, in "Sparse representation or collaborative representation: Which helps face recognition?" IEEE International Conference on Computer Vision. Barcelona, Spain, 2011;custom:[[[-]]]
• 16 Vo. D. M, Lee. S. W, "Robust face recognition via hierarchical collaborative representation," Information Sciences, Mar.2018, vol. 432, pp. 332–346-332–346. doi:[[[10.1016/j.ins.2017.12.014]]]
• 17 Li. C, Zhao. S, Xiao. K, "Face Recognition Based on the Combination of Enhanced Local Texture Feature and DBN under Complex Illumination Conditions," Journal of Information Processing Systems, 2018, vol. 14, no. 1, pp. 191-204. doi:[[[10.3745/JIPS.04.0060]]]
• 18 Whitelam. C, Taborsky. E, Blanton. A, "IARPA Janus Benchmark-B Face Dataset," in Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 2017;custom:[[[-]]]
• 19 Tran. L, Yin. X, Liu. X, "Disentangled representation learning gan for pose-invariant face recognition," in Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA, 2017;custom:[[[-]]]
• 20 CHEN. Dong, CAO. Xudong, WEN. Fang, "Blessing of dimensionality: high dimensional feature and its efficient compression for face verification," in Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 2013;custom:[[[-]]]
• 21 Simonyan. K, Parkhi. O. M, Vedaldi. A, "Fisher Vector Faces in the Wild," in Proceedings of 2013 British Machine Vision Conference, Bristol, UK, 2013;custom:[[[-]]]
• 22 Zhu. Z, Luo. P, Wang. X, Tang. X, "Recover canonical-view faces in the wild with deep neural networks," arXiv 2014arXiv:1404.3543. custom:[[[-]]]
• 23 Taigman. Y, Yang. M, Ranzato. M, Wolf. L, "DeepFace: closing the gap to human-level performance in face verification," in Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA, 2014;custom:[[[-]]]
• 24 Yi. Sun, Xiaogang. Wang, Xiaoou. Tang, "Deep learning face representation from predicting 10,000 classes," in Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014;custom:[[[-]]]
• 25 Yi. Sun, Yuheng. Chen, Xiaogang. Wang, Xiaoou. Tang, "Deep learning face representation by joint identification-verification," in Proceedings of the 28th Annual Conference on Neural Information Processing Systems. Montreal, QC, Canada, 2014;custom:[[[-]]]
• 26 Parchami. M, Bashbaghi. S, Granger. E, "Video-based face recognition using ensemble of haar-like deep convolutional neural networks," in International Joint Conference on Neural Networks. Anchorage, US, 2017;custom:[[[-]]]
• 27 Schroff. F, Kalenichenko. D, Philbin. J, "FaceNet: A unified embedding for face recognition and clustering," in Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, US, 2015;custom:[[[-]]]
• 28 Chen. D, Cao. X, Wang. L, "Bayesian face revisited: A joint formulation," in European Conference on Computer Vision. Florence, Italy, 2012;custom:[[[-]]]
• 29 Barkan. O, Weill. J, Wolf. L, "Fast high dimensional vector multiplication face recognition," in IEEE International Conference on Computer Vision. Sydney, Australia, 2013;custom:[[[-]]]
• 30 Berlemont. S, Lefebvre. G, Duffner. S, "Class-balanced siamese neural networks," Neurocomputing, Jan. 2018, vol. 273, pp. 47-56. doi:[[[10.1016/j.neucom.2017.07.060]]]
• 31 Shaham. U, Lederman. R. R, "Learning by coincidence: siamese networks and common variable learning," Pattern Recognition, Feb. 2018, vol. 74, pp. 52-63. doi:[[[10.1016/j.patcog.2017.09.015]]]