Article Information
Corresponding Author: Songze Tang* (ts198708@163.com)
Songze Tang*, Dept. of Criminal Science and Technology, Nanjing Forest Police College, Nanjing, China, ts198708@163.com
Xuhuan Zhou*, Dept. of Criminal Science and Technology, Nanjing Forest Police College, Nanjing, China, 416126986@qq.com
Nan Zhou*, Dept. of Criminal Science and Technology, Nanjing Forest Police College, Nanjing, China, nudge@163.com
Le Sun**, School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, China, sunlecncom@163.com
Jin Wang***, School of Computer & Communication Engineering, Changsha University of Science & Technology, Changsha, China, jinwang@csust.edu.cn
Published (Print): December 31 2019
Published (Electronic): December 31 2019
1. Introduction
In real world, it is not easy to directly obtain a frontal face image of the criminal suspect in an actual investigation. The suspect often intentionally cover their face in a surveillance video [1,2]. However, an artist can draw a sketch for us according to information in the video surveillance. This sketch may then serve as a substitute for identifying the suspect. Face sketch synthesis, which refers to transformation of a face photo into a sketch, has recently been used. We roughly divide face sketch synthesis methods into two categories: image-based and patch-based.
1.1 Prior Works
Image-based methods treat the input photo image as a whole, and produce a sketch image with some models. Wang et al. [3] converted greyscale images to pencil sketches, in which the pencil strokes adhered to the image features. Li and Cao [4] proposed a simple two-stage framework for face photo-sketch synthesis. These methods did not mimic well a sketch style. Since the breakthrough of deep learning [5], it has been achieved great attention in image processing problems [6-8]. Fully convolutional network (FCN) was first introduced to learn some mapping functions from photos to sketches [9]. Recently, the generative adversarial network (GAN) [10] has been attracting growing attention. To infer photo-realistic natural images, Ledig et al. proposed a perceptual loss function that consisted of adversarial loss and a content loss [7]. Based on the GAN method, a class of loss functions were designed to generate images [11] with perceptual similarity metrics. Wang et al. employed a back projection strategy to improve the final synthesized performance further [12].
Different patch-based methods have been proposed. They are divided into three categories, i.e., subspace learning based methods [13-17], sparse representation based methods [18-23], and Bayesian inference based methods [24-28].
The subspace learning framework mainly includes linear subspace-based methods and nonlinear subspace-based methods. The seminal work in linear face sketch synthesis was the eigen-transformation method [13]. Considering the complexity of human faces, a linear relationship may not always hold, thus Liu et al. [14] proposed to characterize the nonlinear process of face sketch synthesis according to the concept of locally linear embedding (LLE) [15]. Inspired by the image denoising method, Song et al. [16] explored the K surrounding spatial neighbors for face sketch synthesis. Instead of searching for neighbors online, Wang et al. [17] randomly sampled the similar patches offline, and used them to reconstruct the target sketch patch. It was named Fast-RSLCR.
Due to the great success of the sparse representation in many image processing problems [18-20]. Chang et al. [21] incorporated it into face sketch synthesis. To mitigate the handicap in the face image retrieval process, a two-step framework was proposed by Gao et al. [22]. They obtained a coarse estimation using neighbor selection, and then enhanced the definition of the initial estimate by sparse representation. The above methods assumed that the photo patches and the corresponding sketch patches had the same sparse representation coefficients. Actually, the relationships between different styles of images are complex. The assumptions are not always sufficient. Wang et al. [23] proposed to learn a map of the sparse coefficients between the sketch patch and the corresponding photo patch.
To take the constraints between neighboring image patches into consideration, Bayesian inference methods explore the constraints between neighboring image patches. Wang and Tang [25] considered the relationship at different scales, and they called their method the multi-scale Markov random fields (MRF) method. This method generated facial deformations. To address this issue, Zhou et al. [26] proposed a Markov weight field (MWF) method by embedding the LLE idea into the MRF model. Because lighting and pose variations often appear, neighbor selection is not robust. Peng et al. [28] adaptively represented an image patch by multiple features to improve the robustness.
1.2 Motivation and Contributions
As is generally understood, face images have strong structural similarity in local regions (the mouth is not similar to the nose) [29], which means that similarity is low between photo patches and sketch patches that do not correspond to the same small region. Thus, a local similarity constraint is employed to search for the best matching neighbor patches from the training sets. In addition, inspired by the nonlocal denoising method [30], a nonlocal similarity regularization is also introduced to further improve the sketch synthesis quality.
The main contributions of our work can be summarized as follows: We impose local similarity constraints on the selection of similar patches. This improves the quality of the synthesized sketches by discarding dissimilar training patches. In addition, taking into consideration the redundancy of image patches, a global nonlocal similarity regularization is employed to inhibit the generation and maintain primitive facial features during the synthesis process. Thus, more robust synthesized results can be achieved.
2. Related Work
Given a training set with M face photo-sketch pairs, we divide each training image into N small overlapping patches. Let X and Y be an input test photo and an estimated sketch image, which are divided into N overlapping patches [TeX:] $$\left\{\mathbf{x}_{1}, \mathbf{x}_{2}, \cdots, \mathbf{x}_{N}\right\}$$ and [TeX:] $$\left\{\mathbf{y}_{1}, \mathbf{y}_{2}, \cdots, \mathbf{y}_{N}\right\}$$ , respectively, in the same way. For each test photo patch [TeX:] $$\mathbf{X}_{i}$$, we reconstruct it using the K nearest photo patches [TeX:] $$\mathbf{P}_{i, K}=\left\{\mathbf{p}_{i, k}\right\}_{k=1}^{K}$$ from the training dataset with corresponding weight vector [TeX:] $$\mathbf{w}_{i, K}=\left\{w_{i, k}\right\}_{k=1}^{K}.$$ Thus, the corresponding sketch image patch [TeX:] $$\mathbf{y}_{i}$$ can be synthesized by the corresponding K nearest sketch patches [TeX:] $$\left\{\mathbf{s}_{i, k}\right\}_{k=1}^{K}$$ with the above obtained weight vector [TeX:] $$\mathbf{W}_{i, K}$$ .
2.1 LLE
For a test photo patch \mathbf{X}_{i}, the training set was searched for the K nearest photo patches by Euclidean distance. The combination weight was then achieved according to LLE.
where [TeX:] $$w_{i, k}$$ represents the linear combination weight for the k-th photo patch [TeX:] $$\mathbf{p}_{i, k}$$. We can rewrite (1) as
After the combination weight is obtained from (2), the target sketch patch [TeX:] $$\mathbf{y}_{i}$$ can be synthesized as:
When we generated all the sketch patches from (3), a whole sketch Y can be assembled by averaging overlapping pixel values.
2.2 MWF
Taking the relationship between adjacent sketch patches into account, Zhou et al. [26] proposed an MWF method, which introduced a linear combination into the MRF model. This is equivalent to minimizing the following cost function.
where [TeX:] $$(i, j) \in \mathrm{Ne}$$ represents the i-th and j-th patches are neighbors. [TeX:] $$\mathbf{O}_{i}^{j}$$ is a matrix with the column [TeX:] $$\mathrm{O}_{i, k}^{j}$$, which denotes the overlapping area of the k-th candidate for the i-th sketch patch and the j-th patch. [TeX:] $$\lambda$$ is a balancing parameter between the two terms.
2.3 Fast-RSLCR
The abovementioned methods search for nearest neighbors online, thus, the time consumption of testing becomes significantly slower. Wang et al. [17] randomly sampled patches offline. A locality constraint was imposed to regularize the reconstruction weights.
where [TeX:] $$\mathbf{d}_{i}=\left\|\mathbf{x}_{i}-\mathbf{p}_{i}\right\|_{2},(1 \leq i \leq N)$$ measures the distance between [TeX:] $$\mathbf{X}_{i}$$ and [TeX:] $$\mathbf{p}_{i}$$ . [TeX:] $$\lambda$$ is a balancing parameter. The Fast RSLCR method speeds up the synthesis process. However, more candidate patches were allowed to be sampled. This method has two shortcomings: one is that the discriminability of the synthesized sketch is reduced, and the other is that it increases the spatial complexity.
3. Face Sketch Synthesis Based on Local and Nonlocal Similarity Regularization
3.1 Adaptive Regularization by Local Similarity
As mentioned, the LLE method [15] calculates the linear combination coefficient vector without using the appropriate constraints, thus, biased solutions can be obtained easily. Due to the large quantization errors, similar test patches may have different content (Fig. 1(a)). In Fast-RSLCR, each test patch is more accurately represented by capturing the correlations between different training patches. However, not all patches play a positive role in the final results of the face sketch synthesis. A greater number of patches contributing to the final synthesized result implies lower discriminability, as shown in Fig. 1(b). Therefore, we introduce a local similarity regularization to the neighbor selection, which leads to (i) a stable solution, and (ii) discriminant synthesized results (Fig. 1(c)).
Comparison between LLE (a), Fast-RSLCR (b), and the proposed (c).
In our local similarity constraint model, we consider only the most relevant patches in the training set as effective samples. For each patch [TeX:] $$\mathbf{X}_{i}$$ in the test photo, the optimal weights are obtained by minimizing the local similarity regularized reconstruction error:
where [TeX:] $$\mathbf{D}_{i, K}=\left[\begin{array}{ccc} {d_{t, 1}} & {} & {0} \\ {} & {d_{i, 2}} & {} \\ {} & {} & {\ddots} & {} \\ {0} & {} & {} & {d_{t, X}} \end{array}\right], d_{i, j}=\exp \left(\frac{\left\|\mathbf{x}_{i}-\mathbf{p}_{j}\right\|_{2}^{2}}{\sigma}\right)$$ represents the entries of the exponential locality adaptor and [TeX:] $$\sigma$$ is a positive number. The K sampled training photo patches constitute the matrix [TeX:] $$\mathbf{P}_{i, K} \cdot \mathbf{W}_{i, K}$$ is the weight representation. [TeX:] $$\lambda_{1}$$ is a balancing parameter. To preserve the data structure, the exponential function is used to improve representation. Because [TeX:] $$d_{i, j}$$ grows exponentially with [TeX:] $$\left\|\mathbf{x}_{i}-\mathbf{p}_{j}\right\|_{2}^{2} / \sigma$$, the exponential locality adaptor will be quite large when [TeX:] $$\mathbf{X}_{i}$$ and [TeX:] $$\mathbf{p}_{j}$$ are far apart. This property is useful when we want to stress the importance of data locality (Because [TeX:] $$d_{i, j}$$ is the weight of [TeX:] $$w_{i, j}$$in (6), a large value of [TeX:] $$d_{i, j}$$ causes [TeX:] $$w_{i, j}$$ to be small).
To determine the solution [TeX:] $$\mathbf{w}_{i, K}$$ in (6), we consider the Lagrange function [TeX:] $$L\left(\mathbf{w}_{i, K}, \lambda_{1}, \beta\right)$$, which is defined as
(7) can be reformulated as
where 1 is a column vector where all values are equal to 1. [TeX:] $$\mathbf{Z}=\left(\mathbf{x}_{i} \mathbf{1}^{T}-\mathbf{P}_{i, K}\right)^{T}\left(\mathbf{x}_{i} \mathbf{1}^{T}-\mathbf{P}_{i, K}\right)+\lambda_{1} \mathbf{D}_{i, K}^{T} \mathbf{D}_{i, K}$$By setting [TeX:] $$\frac{\partial}{\partial \mathbf{w}_{i, K}} L\left(\mathbf{w}_{i, K}, \lambda_{1}, \beta\right)=0$$ and [TeX:] \quad \frac{\partial}{\partial \lambda_{1}} L\left(\mathbf{w}_{i, K}, \lambda_{1}, \beta\right)=0$$ according to (8), we obtain the combination weight.
Then the sketch patch [TeX:] $$\mathbf{y}_{i}$$ can be synthesized by (3) with the weight in (9). Finally, a whole sketch Y can be assembled by averaging overlapping pixel values.
Nonlocal similarity in the sketch images.
3.2 Adaptive Regularization by Nonlocal Similarity
The local context constraint model exploits local geometry in data space. There are also many repetitive patterns throughout a sketch image, which is quite helpful for improving the quality of final sketch images [31,32], as shown in Fig. 2. Therefore, we explore nonlocal self-similarity. Generally, for each extracted patch [TeX:] $$\mathbf{y}_{i}$$ from the sketch image Y, we search for its L similar patches [TeX:] $$\left\{\mathbf{y}_{i}^{l}\right\}_{l=1}^{L}$$ in Y. Then there is a following linear relationship between [TeX:] $$\mathbf{y}_{i}$$ and [TeX:] $$\left\{\mathbf{y}_{i}^{l}\right\}_{l=1}^{L}$$
The nonlocal similarity weight [TeX:] $$b_{i}^{l}$$ is inversely proportional to the distance between patches [TeX:] $$\mathbf{y}_{i}$$ and [TeX:] $$\mathbf{y}_{i}^{l}$$ in (10) and value is calculated as
where h is a pre-determined control factor of the weight. Let [TeX:] $$\mathbf{b}_{i}$$ be the column vector containing all weights [TeX:] $$b_{i}^{l}$$ and [TeX:] $$\boldsymbol{\beta}_{i}$$ be the column vector containing all [TeX:] $$\mathbf{y}_{i}^{l}$$. (11) can be rewritten as:
By incorporating the nonlocal similarity regularization term (12) into patch aggregation, we obtain:
where [TeX:] $$\mathbf{R}_{i}$$ is to extract a patch from an image. (13) can be rewritten as
where I is the identity matrix and [TeX:] $$\mathbf{B}(i, j)=\left\{\begin{array}{cc} {b_{i}^{l}, \text { if } \mathbf{y}_{i}^{l} \text { is an elment of } \boldsymbol{\beta}_{i}, b_{i}^{l} \in \mathbf{b}_{i}} \\ {0,} & {\text { otherwise }} \end{array}\right.$$. Now, we can easily get the final synthesized image
4. Experimental Results and Analysis
4.1 Database Description
We validate our method on the Chinese University of Hong Kong (CUHK) face sketch database (CUFS) [25] and the CUHK face sketch FERET database (CUFSF) [33]. The CUFS database contains three sub-datasets, i.e., the CUHK student (CUHKs) database, the AR database [34], and the XM2VTS database [35]. In the CUHKs database, 88 photo-sketch pairs were constructed a training set, and the remaining 100 pairs were used for testing. In the AR database, 80 pairs were randomly selected as the training set and the rest were used as test cases. As to the XM2VTS database, the training set had 100 pairs. There are 1,194 photo–sketch pairs in the CUFSF database [36]. The 250 pairs were randomly selected to construct the training set, and the remaining 944 pairs were used as test cases. All face images were cropped to 250 [TeX:] $$\times$$ 200 pixels.
FSIM as a function of the regularization parameters [TeX:] $$\lambda_{1}$$ and [TeX:] $$\lambda_{2}$$on different datasets: (a) CUHKs, (b) AR, (c) XM2VTS, and (d) CUFSF.
The proposed method was compared with some related methods, including the LLE [14], MWF [26], Fast-RSLCR [17], FCN [9], and BP-GAN [12]. The feature similarity index metric (FSIM) [37] was adopted as the evaluation criterion to estimate the quality of final synthesized sketches.
4.2 Discussion on the Parameters
4.2.1 The influence of different regularization parameters
Our algorithm has two free-regularization parameter [TeX:] $$\lambda_{1}$$ and [TeX:] $$\lambda_{2}$$, which balance the different contribution of regularization terms. A set of parametric experiments are performed to validate the effectiveness of the proposed regularization terms. We carefully tune the local similarity parameter [TeX:] $$\lambda_{1}$$ (from 0 to 0.3 with step size of 0.02) and the nonlocal similarity parameter [TeX:] $$\lambda_{2}$$ (from 0 to 0.1 with step size of 0.01). Fig. 3 shows the surfaces of FSIM variations. It can be clearly observed that the synthesized performance are stable in terms of FSIM with regards to [TeX:] $$\lambda_{1} \in[0.18,0.26]$$ and [TeX:] $$\lambda_{2} \in[0.05,0.07]$$.
FSIM score of the different databases with different numbers of the nearest patches: (a) CUHKs, (b) AR, (c) XM2VTS, and (d) CUFSF.
4.2.2 The influence of nearest neighbor number
The synthesized performance generated by the proposed method correlated with the number of the nearest neighbors. We conduct experiments on the four mentioned databases by changing the value of K. The curved lines of the FSIM values plotted against the number of training patches are shown in Fig. 4. When the value of K is equal to the number of training photos, our proposed method does not obtain the best performance. As shown in Fig. 4, the values of the FSIM increase steadily with the increase in the number of nearest neighbors. Nonetheless, after the nearest patch number reaches a suitable value, the performance of our proposed method remains constant. We also note that performance shows a descending trend with the increasing value of K (for a value of larger than of the number of training photos). In view of this, to achieve the optimal or nearly optimal performance, we recommend setting K = 80% of the number of training photos.
Synthesized sketches on different databases by LLE, MWF, FCN, BP-GAN, Fast-RSLCR, and the proposed method, respectively.
4.3 Face Sketch Synthesis
Fig. 5 presents some synthesized sketches using different methods on the abovementioned databases. Generally speaking, the proposed method generates much more detail in comparison with the other five popular methods. In the AR database, we note that the LLE, MWF, and Fast-RSLCR methods produce very smooth sketches (the first row in Fig. 5). For example, they did not generate some details, such as hair. Then we compared the synthesized sketches on the CUHKs database. As shown in the remaining results of Fig. 5, the proposed method presents much better synthesized performance than other patch-based methods. Textures (e.g., hair regions) are synthesized successfully. FCN can produce some details, but some distortions are shown in the results. The results for BP-GAN look very well, but some details are also missing, such as the hairs of the first and second person. As shown in the results, our approach predicted unusual features well, while the comparison methods tended to smooth these regions. This illustrates the robustness and effectiveness of the proposed method.
To investigate the robustness of the proposed methods against the complex illumination, we compared the synthesis results on the CUFSF database using different methods, as shown the last two rows in Fig. 5. Our proposed method generated competitive results with more facial detail. Table 1 presents the average FSIM comparisons of different methods.
Average FSIM scores of different methods on different databases
Time consuming (in second) on different databases
4.4 Time Consuming
To compare the time cost of the proposed face sketch synthesized method, we further list the runtimes of our algorithm and the other five competitors on different databases in Table 2. It can be seen that the FCN method has the fastest computation time of less than 0.1. The BP-GAN method has a long processing time due to the neighbor selection process. Our proposed method runs slowly compared with the Fast-RSLCR method. Overall, the proposed method gets the best synthesis results within a moderate time consumption between the comparison algorithms.
5. Conclusion and Future Work
In this paper, we presented a novel face sketch synthesis method using two regularization terms. By incorporating a local similarity regularization term into the neighbor selection, we selected the most relevant patch samples to reconstruct face sketch versions of the input photos, thus generating discriminant face sketches with detailed features. A global nonlocal similarity regularization term was employed to further maintain primitive facial features. The results of thorough experimental testing on public databases demonstrated the superiority of the proposed method over other methods.
Compared with traditional synthesis methods, our novel generative approach retained more detailed information from the photos. However, our inference time was dependent on the amount of training data. Thus, we could incorporate priors into the deep learning method to improve performance and speed up the processing in the future.
Acknowledgement
This paper is supported in part by the National Natural Science Foundation of China (No. 61702269, 61971233, and 61671339), in part by the Natural Science Foundation of Jiangsu Province (No. BK20171074), and the Fundamental Research Funds for the Central Universities at Nanjing Forest Police College (No. LGZD201702).