Abdelhadi Lotfi* and Abdelkader Benyettou**Cross-Validation Probabilistic Neural Network Based Face IdentificationAbstract: In this paper a cross-validation algorithm for training probabilistic neural networks (PNNs) is presented in order to be applied to automatic face identification. Actually, standard PNNs perform pretty well for small and medium sized databases but they suffer from serious problems when it comes to using them with large databases like those encountered in biometrics applications. To address this issue, we proposed in this work a new training algorithm for PNNs to reduce the hidden layer’s size and avoid over-fitting at the same time. The proposed training algorithm generates networks with a smaller hidden layer which contains only representative examples in the training data set. Moreover, adding new classes or samples after training does not require retraining, which is one of the main characteristics of this solution. Results presented in this work show a great improvement both in the processing speed and generalization of the proposed classifier. This improvement is mainly caused by reducing significantly the size of the hidden layer. Keywords: Biometrics , Classification , Cross-Validation , Face Identification , Optimization , Probabilistic Neural Networks 1. IntroductionFor a pattern recognition system in general and for biometric identification systems in particular, in addition to the employed strategy for feature extraction, the choice of the classification method to be used in the decision phase is a determining factor in the effectiveness of the proposed solution. To address this problem, the user must select a decision strategy that meets his or her needs. These methods include: probabilistic methods, Bayesian methods and neural methods. Each method has advantages and disadvantages. Therefore, choosing the appropriate strategy is not always obvious. Progress in the field of pattern recognition recently resulted in problems with large training databases. Databases containing thousands and sometimes millions of samples have become a common reality for applications such as biometrics, information retrieval, semantic web, etc. For such applications, the choice of the most appropriate method is essential for a robust solution. Therefore, the scientific community has recently reconsidered many old techniques to make them meet these new requirements. Unfortunately, many methods have been neglected from this refinement and need to be re-explored more in depth. Since they were created by Specht [1] probabilistic neural networks (PNNs) have been used extensively to solve many problems related to classification, mapping and associative memory. The reader can find so many papers in the literature referring to PNNs, especially for classification applications. Actually, they have been used in many fields, namely: automatic emails security enhancement [2], intrusion detection in computer networks [3], water quality assessment [4], illness diagnosis [4], antibiotics resistance detection [4], biometrics applications [5-7], handwriting recognition [8], military applications (target detection) [9, 10], etc. The main reason behind their popularity is the fact that unlike multi-layer neural networks (MLP), these classifiers do not suffer from the problem of local minimums frequently encountered with other neural networks [11]. In fact, in PNNs the global decision function is a sum of many local decision functions centered on the training samples, which makes the situation of local minimums very unlikely to happen [12]. Moreover, these networks have a stronger mathematical background than most other neural networks [12]. Unfortunately, PNNs cannot be used effectively to classify databases with a big number of samples in the training set because the number of hidden neurons is equal to the number of training samples. If a situation like this appears, another technique is used to reduce the number of training samples for a later use by PNNs. This solution does not guarantee that these samples are good for the PNN since it does not consider the PNN parameters. As a result, PNNs are generally used for small benchmarking databases. When it comes to bigger real world databases, other types of classifiers like MLPs are preferred [13]. Despite this major drawback, the PNN have several important good features such as the probabilistic nature of its outputs, the possibility of learning by small chunks without retraining, parallelism in the hidden layer (for separate classes), robustness against the problem of local minima, and multi-dimensional interpolation, etc. A lot of work has been done so far to make PNNs more efficient for certain problems by changing either the architecture of the network or the training algorithm. Most of time, these changes affect seriously the network making it lose some of its major advantages and adds more complexity to its implementation. As a response to this concern, a learning vector quantization (LVQ) training algorithm for PNNs was proposed in [14]. By using a Bayesian model for estimating parameters and a particle swarm optimization to determine the weight of the outputs of a PNN network, a solution has been proposed by De Falco et al. [15]. The solution requires a very expensive computational complexity to estimate the network’s parameters from the training samples. In [16], authors proposed a PNN that uses fuzzy logic to measure the significance of each neuron in the hidden layer by giving it an output weight towards the summation layer. Other propositions include Gram-Charlier PNNs, generalized PNNs [17], and rotated kernel PNNs [18]. All of these solutions were proposed to solve a particular problem and they modified the architecture of the network which makes it lose its major advantages. In this paper, we propose a new training algorithm for PNNs that does not change its architecture at all. It uses the principle of cross validation to reduce the number of hidden neurons of the PNN and hence speed up the response of the network in the testing step. Of course, our training algorithm is slower (in training) than the standard algorithm. In fact, cross-validation PNN (CVPNN) has an algorithmic complexity of O(n2) while standard PNN’s complexity is O(n); n being the number of training examples. The proposed solution was tested for a variety of benchmarking databases. They were also tested for some biometric applications. Performances of the CVPNN and the standard PNN were compared in terms of the size of the hidden layer, the processing speed and the generalization of the network to show the difference between the new and old training algorithms. 2. Cross-Validation PNNs2.1 BasicsIn a classification context, let’s suppose we have a training data set of examples belonging to m different classes and we want to classify a new vector x of dimension d. According to the Bayes conditional property rule, the probability of samples x to belong to class Ci is given by:
(1)[TeX:] $$P \left( C _ { i } | x \right) = \frac { P ( x | C _ { i } ) P \left( C _ { i } \right) } { \sum _ { j = 1 } ^ { m } P ( x | C _ { j } ) P \left( C _ { j } \right) }$$The problem here is that the prior probabilities P(x|Ci) are unknown (since the probability densities of the patterns in the categories to be separated are unknown) [1]. The Parzen windowing technique gives a solution for the estimation of probability density functions from the training examples. The formula used in this paper for the estimation of probability density functions (pdfs) is given by:
(2)[TeX:] $$g _ { i } ( x ) = \frac { 1 } { 2 m \pi ^ { d / 2 } \sigma ^ { d } } \sum _ { j = 1 } ^ { m } \exp \left( - \frac { \left\| x - x _ { i j } \right\| ^ { 2 } } { 2 \sigma ^ { 2 } } \right)$$where gi(x) is the global decision function for class >i; Xij is sample j belonging to class i; and σ is a smoothing parameter. Example x is classified as belonging to class i if:
When we apply this principle to neural networks, the result will be a multi-layer neural network with four layers (Fig. 1). The input layer has as much neurons as there are features in the input space (d dimensional space). All neurons are connected to all neurons in the hidden layer. The pattern (hidden) layer has neurons representing each example in the training set. All neurons belonging to a certain class are connected to the neuron in the summation layer which represents this class. The hidden layer computes its outputs using an exponential activation function. The formula of activation in the hidden layer is given by:
(4)[TeX:] $$\Phi _ { j } ^ { i } ( x ) = \frac { 1 } { ( 2 \pi ) ^ { d / 2 } \sigma ^ { d } } e ^ { - \frac { \left( x _ { j } ^ { i } - x \right) ^ { t } \left( x _ { j } ^ { i } - x \right) } { 2 \sigma ^ { 2 } } }$$where xji is the jth example from the ith class. The summation (class) layer contains m neurons; each one representing a separate class. The activation of neuron i (class i) is calculated here using:
(5)[TeX:] $$P \left( C _ { i } | x \right) = \frac { 1 } { N _ { i } ( 2 \pi ) ^ { d / 2 } \sigma ^ { d } } \sum _ { j = 1 } ^ { N _ { i } } \exp \left( - \frac { \left( x _ { j } ^ { i } - x \right) ^ { t } \left( x _ { j } ^ { j } - x \right) } { 2 \sigma ^ { 2 } } \right)$$With Ni the total number of examples in class i. The output layer calculates the wining class using the formula:
(6)[TeX:] $$\hat { C } ( x ) = \underset { j = 1,2 , \ldots } { \arg \max } P \left( C _ { j } | x \right)$$Or more formally:
(7)[TeX:] $$\hat { C } ( x ) = \underset { i = 1,2 , \ldots } { \arg \max } \left\{ \frac { 1 } { N _ { i } ( 2 \pi ) ^ { d / 2 } \sigma ^ { d } } \sum _ { j = 1 } ^ { N _ { i } } \exp \left( - \frac { \left( x _ { j } ^ { i } - x \right) ^ { t } \left( x _ { j } ^ { i } - x \right) } { 2 \sigma ^ { 2 } } \right) \right\}$$2.2 Training AlgorithmThe training algorithm presented here is inspired from a former work where we presented a network called reduced PNN (RPNN) for classification of large databases [12]. The CVPNN training algorithm is much faster and generates fewer hidden neurons. The former algorithm starts its training using a standard PNN and reduces the number of hidden units iteratively which takes much time to delete the unnecessary neurons after measuring their contribution in the final decision of the network (because the size of the original PNN’s hidden layer is equal to the number of training vectors). The idea here is to start instead with an empty PNN (no neurons in the hidden layer) and use a crossvalidation algorithm while constructing the hidden layer in order to select only the representative examples from the training set. This is done by presenting the training examples one at a time. Each training vector presented to the network is tested before creating the corresponding hidden neuron. If this new neuron does not add a significant value to the generalization of the network, it is automatically ignored. It is a test-and-train process that is repeated so many times and for all the training data set (Fig. 2). For initialization, we start here with an empty PNN and neurons are added to the hidden layer when necessary. Adding a new neuron to the hidden layer depends on the accuracy of classifying the corresponding example by the current network. Three different cases are possible: • If the network is unable to classify the current example accurately, this example is added to the hidden layer. • If the classification is accurate with low confidence, the neuron’s positions are changed according to Fig. 3. • In case we have good classification with high confidence, the current PNN is kept unchanged. The major drawback of this new training algorithm is the training time which can be very long in comparison with a standard training algorithm. However, the training can be done by classes of samples at separate points of time without a retraining each time. This means that new examples and/or classes can be added as needed in different points in time. Actually, this is one of the major advantages of this solution. The result of this training is a classifier which has a smaller hidden layer (with less neurons); is more robust and resistant to over-fitting, and has better generalization capacities (better classification rates). The reduced number of neurons in the hidden layer makes the testing (time of response) of the network very fast according to the reduction ratio. A small number of hidden neurons needs a reduced time to calculate the activation in the hidden and summation layers and produces a quick response. The main idea of the training algorithm is further explained in the following algorithm: 2.3 Testing AlgorithmThe testing algorithm is similar to the one usually used in standard PNNs. The same algorithm is also used in the training step and is given hereafter: In the following section, the CVPNN will be used for classification of simple data points in twodimensional space to show its performance compared to a standard PNN. After that, it will be used for face identification using some well-known databases. 3. Experiments3.1 Test on a 2-Dimentional SpaceIn order to show the difference between the CVPNN and the standard PNN, we present the classification of 2-dimentional data points into three different classes. Each class is represented in the graph with a different color. There are 137 training examples from all classes (Fig. 4). 3.2 The ORL Faces DatabaseThe face database Oliveti Research Laboratory (ORL) was collected by a laboratory of AT & T, based in Cambridge. The database contains images of 40 subjects; each registered under 10 different views (Fig. 5). For some subjects, the images were collected at different times, with variations in lighting conditions, facial expressions (neutral expression, smile and eyes closed) and facial details (with/without glasses). All images were collected on a dark background. The size of each image is 112×92 pixels gray-scale. The poses of the head have some variation in depth with respect to the frontal pose. However, these variations relate only to certain individuals and are not systematic. The files are in PGM format. 3.2.1For features extraction (and dimensionality reduction), two algorithms were used namely Principal Component Analysis (PCA) and Kernel Principal Component Analysis (KPCA) [19]. After feature extraction, two training algorithms (PNN, CVPNN) were used to construct PNNs. The ORL database samples were divided into: • A training set containing 280 samples chosen randomly from the overall database. • A test set which contains 120 totally different from those used for training. Results of these experiments are presented in Table 1. A comparison between the two networks is shown in Fig. 6. Table 1.
3.3 The GT Faces DatabaseGeorgia Tech faces database contains images of 50 subjects taken in two or three sessions at the Signal Processing Center and the image in the Georgia Institute of Technology. All subjects in the database are represented by 15 color JPEG images with a noisy background and a resolution of 640×480 pixels. The average size of the faces in these images is 150×150 pixels. The images show the front taken and/or inclined with different facial expressions, lighting conditions and scale. Each image is manually labeled to determine the position of the face in the image. Fig. 7 shows a sample of the GT database. 3.3.1 ResultsThe algorithms, Linear Discriminant Analysis (LDA), PCA, and KPCA were used for the extraction of features from images of the GT database. As in the previous experiment, the two training algorithms (PNN and CVPNN) were used to construct three PNNs each time. The GT database samples were divided into: • A training set containing 500 examples chosen randomly from the overall database. • A test set which contains 250 remaining examples. Table 2 shows the results of this study and Fig. 8 shows a comparison between both algorithms in terms of classification rate, processing speed and size of the hidden layer. 4. DiscussionFrom Fig. 4, we can observe that the CVPNN produces almost the same separation areas like the standard PNN for all 3 classes. However, the number of neurons in the hidden layer of a CVPNN (21 neurons) is much lower than the number of neurons in the hidden layer of the standard PNN (137= number of samples). Thus, the size of the hidden layer is reduced to only 15% of the training samples and the time for classification of all points of the plan (100×100=10000) is reduced consequently. It is clear here that the CVPNN outperforms standard PNNs for classifying these data points both in the hidden layer's size and classification speed. According to the results in Table 1 and Fig. 7, we notice that the classification results of the standard PNN network and the CVPNN network (proposed algorithm) are similar for both PCA and KPCA. However, the number of hidden neurons is very low for CVPNN (130 and 110 for the PCA and the KPCA respectively against 280 hidden neurons in a standard PNN network). The gain in execution time is about 60% which is proportional to the number of neurons in the hidden layer. From Table 2 and Fig. 8, we observe that despite the reduced number of neurons in the CVPNN network, this latter gives better performance compared to a standard PNN. Here, the gain is double: a gain in generalization and classification speed. Table 2 shows that for the LDA, for example, the CVPNN network gives 95% as classification rate with only 40% of hidden neurons. This is clearly a network with a better generalization than the standard PNN. In this experiment, it was shown that by using the proposed training algorithms, one can easily gain in speed with almost the same performance in classification rate. The fact of using fewer neurons in the hidden layer while keeping the same classification rate means that the new classifier has high generalization ability compared to standard PNNs. This characteristic becomes more apparent for bigger training databases. Indeed, the reader can notice that the larger the training set the greater is the gain. Thus, the problem of over-fitting is reduced significantly for CVPNNs. 5. ConclusionsResults presented in this paper show that the new proposed CVPNN outperforms standard PNNs for all cases treated in this work. Significant improvements were reported not only in the size of the network and its processing speed but also in the overall generalization of the network. Actually, the proposed solution can be used for all cases where the database is large and where standard PNNs find serious problems with their huge hidden layer. It has been also shown that this network performs also well for very small databases with a more added complexity for the training algorithm. For future works, the response of the network can be made even faster by using a parallel testing algorithm instead of a single threaded testing algorithm. In fact, parallelism is possible since connections to the summation layer are independent for each class. Hence each neuron in the class (summation) layer can calculate its activation using a separate dedicated execution thread. As a result, the processing time will depend greatly on the number of processors in the machine. BiographyAbdelhadi Lotfihttps://orcid.org/0000-0001-8229-0539He is an Associate Professor at the National Institute of Telecommunication and Information and Communication Technology of Oran (INTTIC), Algeria. He is also member of the LaRATIC research laboratory at the INTTIC institute. He received his Ph.D. from the USTO-MB University and works now mainly on pattern recognition and intelligent systems. References
|