1. Introduction
Facial expressions (FEs) provide a nonverbal channel for conveying genuine emotions and intentions, serving as a vital means of exchanging emotional information and facilitating interpersonal dynamics [1]. In the field of computer vision, the analysis of human FEs enables the comprehension of human emotions and their integration into a wide array of human–computer interaction systems, spanning service robots, fatigue detection for drivers, and medical services [2]. In social dynamics, complex facial movements and expressions have evolved to convey inner emotions. However, academic circles predominantly delve into the six fundamental emotional categories, as proposed by psychologists Ekman and Friesen, which encompass happiness, anger, sadness, surprise, disgust, and fear [3]. According to [4], in day-to-day human interaction, as much as 55% of the conveyed information is transmitted via FEs. This highlights the considerable research value and significance of facial expression recognition (FER).
Traditional FER methods predominantly rely on shallow learning and manually engineered features. These techniques include principal component analysis (PCA) [5], local binary patterns (LBP) [6], Gabor transformation [7], geometric feature-based extraction [8], and hybrid feature extraction [9]. Nonetheless, these FER approaches are constrained by their reliance on prior knowledge, limited generalization capabilities, and inability to meet the precision and efficiency demands of real-world applications.
With rapid advancements in deep learning (DL) technology, studies on FER using deep neural network models have made significant progress. Convolutional neural networks (CNNs) have gained popularity for image recognition and classification [10]. Researchers have undertaken multifaceted efforts to enhance the accuracy of CNNs for expression recognition. Mollahosseini et al. [11] constructed a 7-layer CNN, initially pretrained on an extensive face dataset, followed by fine tuning with a FE dataset. Their innovative use of the inception layer architecture across multiple datasets for FER yielded superior results compared with traditional methods. However, the limited data in the FE datasets led to overfitting.
Ding et al. [12] introduced the FaceNet2ExpNet approach, employing deep features from the facial network to supervise training of the convolutional layer. Subsequently, they added a randomly initialized fully connected layer and initiated the training from scratch. Ng et al. [13] adapted a pretrained CNN model from the ImageNet dataset, adjusted it using the FER2013 dataset, and fine-tuned the modified model using the EmotiW dataset. They assessed the performance of the model for FER in real-world scenarios; however, the recognition accuracy was not optimal. Verma et al. [14] employed diverse subnetworks to extract rich features and efficiently combined them using an appropriate ensemble technique. This approach comprehensively considered changes in facial features due to significant facial movements and performed well on the CK+ dataset. Liu et al. [15] adopted a strategy involving three parallel multichannel CNNs to learn the global and local features from distinct facial regions. They implemented a joint embedding feature learning strategy to explore identity-invariant and pose-invariant expression representations based on fused regions in the embedding space. However, this method does not achieve precise human facial recognition accuracy in unconstrained environments.
In recent years, researchers have introduced attention mechanism into CNN [16]. By learning and adaptively assigning different weight coefficients to different regions on the feature maps (FMs), the network is capable of obtaining more expressive features, which enhances the efficiency and accuracy of FER. Hu et al. [17] presented a SENet network to obtain the channel dependencies of the features, which significantly improved the performance of the CNN model. Woo et al. [18] introduced the convolutionalblock- attention-module (CBAM) concept, in which feature attentional operations were simultaneously performed in the spatial and channel dimensions, and good recognition results were obtained.
In recent years, the rapid development of image dehazing technology has a profound impact on various computer vision domains, including FER [19]. In FER, image quality significantly affects the algorithm performance. In particular, in practical applications such as security monitoring, facial recognition, and emotion analysis, image quality can be compromised by atmospheric conditions and adverse weather, making it challenging to accurately capture FEs [20].
Recent studies have emphasized the importance of image dehazing technology in FER. By applying state-of-the-art image dehazing algorithms, researchers can enhance the image clarity and visibility, improving the accuracy of expression recognition algorithms. This is particularly crucial for capturing expressions in low-light conditions or for conducting real-time facial analyses in outdoor environments. Furthermore, image dehazing technology can assist in reducing noise and enhancing image quality, thereby facilitating the precise capture of facial features and emotions [21]. Consequently, the application of image dehazing technology in FER has become a topic of significant interest, offering new opportunities to enhance the practicality and performance of FER systems. This trend will further drive interdisciplinary research on image dehazing and FER, aiming for a clearer and more accurate FER.
This study delves into the integration of contextual information and multiple attention mechanisms within the VGGNet16 network. In our proposed FER framework, the enhanced VGGNet serves as a backbone network for feature extraction. We introduce a multiscaled feature merging strategy to combine FMs from different levels, thereby enhancing the utilization of lower-level features and achieving precise recognition performance. The main contributions of our study are as follows:
· We employed an improved VGGNet16 as the backbone network for feature extraction. In each backbone block, we implemented an enhanced group convolutional channel attention (GCCA) module to steer the network's focus toward critical areas while suppressing irrelevant ones.
· Five backbone blocks were used to extract multiple features of varying sizes in different layers. The lower-level blocks capture high-resolution edge features with limited semantics. A partial decoder (PD) was introduced at the end of the backbone to aggregate all of the high-level block features and generate a global map. This map guides progressive learning through reverse attention (RA) modules, enabling the network to learn more nuanced expression details.
The remainder of this paper is organized as follows. Section 2 describes the proposed FER framework. Section 3 presents the performance evaluation and comparison of the results obtained for two public datasets. Finally, Section 4 provides a comprehensive summary of the study, highlighting the limitations of the proposed approach and outlining future research directions.
2. Proposed FER Framework
2.1 Network Structure
Fig. 1 illustrates the overall framework of this study. The FE images first undergo processing in the initial two low-level blocks to extract high-resolution, low-level features with limited semantics. To bolster the extraction of boundary features, these features pass through a convolutional layer with a single kernel, thereby enhancing the edge mapping accuracy. The low-level feature [TeX:] $$C_2$$ is subsequently directed to the last three high-level blocks within the backbone network. To extract the partial features, a PD module is incorporated at the end of the backbone. The PD module consolidates all of the high-level features from these blocks, producing a comprehensive global map labeled as [TeX:] $$P_d$$. Furthermore, an RA module is placed after each high-level block. Each high-level block generates features at different scales and sequentially combines them with features from various blocks originating from the preliminary global map. These outputs serve as inputs to the RA module, enabling the extraction of finer expression details. Finally, multiple features are fused to generate the ultimate FER outcome.
2.2 Backbone Network
In our proposed FER framework, we opted for VGGNet16 as the backbone for feature extraction with certain enhancements to enhance the stability of the FER tasks. A visual representation of the modified backbone is shown in Fig. 2. To streamline the model and safeguard against the loss of fine-grained FE details in high-level features, we omitted the final pooling layer and the fully connected layer in the VGGNet16. In addition, we integrated a GCCA module into each block of the backbone network. This module encourages the network to concentrate on target areas, thereby enhancing the overall model performance.
Structure of the proposed framework. The improved VGGNet16 was used as the backbone network, and feature fusion was employed to extract multiple features (shallow features from low-level blocks in the backbone network and the aggregation of deep internal details from high-level blocks using a PD module). Through the RA mechanism, the currently predicted area was erased from the high-level side to the output features. This guides the entire network to progressively explore the supplementary fine details from top to bottom.
Structure of the backbone network. To steer the network's focus toward target regions, CBAM was introduced in every block of the backbone network. To empower the network with the capability to learn multiscale features, lateral output components were added after the last four blocks of the backbone network to provide feature information of varying scales.
To equip the network with multiscale feature learning capabilities, we introduced a lateral output component after the last four blocks of the backbone network, offering features of varying scales. As depicted in Fig. 2, the lateral output branch from the low-level block 2 was employed to extract low-level features with limited semantics. Simultaneously, the lateral output branches of high-level blocks 3, 4, and 5 were dedicated to extract high-level features with strong semantic content. The ultimate result was derived by fusing the high-level and low-level features.
2.3 Backbone Network
Within a CNN, the distinct feature channels exhibit varying responses. If each channel is assigned an equal weight, the significance of the individual channels in feature extraction remains inadequately addressed. To maximize the utility of each feature channel, we introduced a channel attention module (i.e., the GCCA module) based on the group convolution concept [22]. The structure of this module is illustrated in Fig. 3.
GCCA module. While the SENet exclusively employs maximum pooling, the proposed module extends it by incorporating average pooling and random pooling. This triad of pooling methods collectively yields a more comprehensive extraction of global features from various channels. Following these three distinct pooling processes, the FM was transformed into three channel descriptors, aligning with the dimensionality of the input FM.
Let input FM be [TeX:] $$F \in R^{H \times W \times C},$$ which is compressed into three channel descriptors, [TeX:] $$F_{a v g}^C, F_{\max }^C,$$ and [TeX:] $$F_{r d m}^C,$$ with dimensions of 1×1×C after the aforementioned pooling processes. To further learn the correlation between the channels, a group convolution operation was introduced. First, the three channel descriptors were grouped according to different channels, and the global information of the same channel was spliced together to form a new feature vector. Each new feature vector contained three types of global information. Following this, the feature vectors were convoluted by convolutional layers containing 1×1 convolution kernels such that the three types of global information were adaptively fused together, resulting in a channel descriptor with dimensions of 1×1×C. The channel descriptor was subsequently sent to two convolutional layers containing a 1×1 convolution kernel for feature learning. The number of channels in the previous convolutional layer was C/16. The number of channels in the following convolutional layer was C, to learn the weight coefficients of different channels.
here, δ and σ are the ReLU and sigmoid functions, G is the group convolution operation, and [TeX:] $$W_1 \text{ and } W_2$$ are the convolution parameters of the first and second convolutional layers, respectively. [TeX:] $$F_{a v g}, F_{\max }$$ and [TeX:] $$F_{r d m}$$ represent average pooling, max pooling, and random pooling operations, respectively.
The sigmoid function limits the value of each element within the interval [0,1]. If it is directly multiplied by the input FM, the output response of the FM is weakened. Hence, the input FM was weighted using the attention weight coefficients through a point multiplication operation so that the effective features were strengthened and redundant features were restrained. To prevent the weakening of the output response, the input FM was added to the weighted attention FM to strengthen the stability of the model. The ultimate output of the GCCA module is expressed as:
where [TeX:] $$F^{\prime}$$ is the output FM by the GCCA module and [TeX:] $$M_c(F)$$ is the attention weight coefficient.
2.4 Partial Decoder
In the CNN model, high-level features convey semantics, whereas low-level features depict spatial details that are beneficial for refining object boundaries. In contrast to high-level features, low-level features have a relatively smaller impact on the overall performance, and due to their substantial spatial resolution, can result in substantial computational overhead. Hence, a PD module [23] was introduced at the end of the backbone, as illustrated in Fig. 4.
Structure of the PD module. The PD only incorporates high-level features, discarding the larger resolution features from the low-level layers, which facilitates rapid and precise extraction of target region features.
The PD module exclusively integrates high-level features while discarding lower-resolution shallow features, ensuring swift and accurate extraction of FE features. The process is described as follows. Initially, three sets of high-confidence, high-level features [TeX:] $$\left\{C_i, i=3,4,5\right\}$$ are extracted from the high-level blocks of the backbone network. Subsequently, these high-level features from the three sets are amalgamated using the partial decoder [TeX:] $$p d(\cdot).$$ This fusion of features from different levels encourages information from distinct layers to complement one another, culminating in the creation of a preliminary global map [TeX:] $$P_d=p d\left(C_3, C_4, C_5\right).$$ This map serves as a guide for the subsequent progressive learning based on the reverse attention strategy.
2.5 Reverse Attention Module
To meticulously capture detailed expression information from crucial regions, we introduced an RA module within high-level blocks to progressively expand the target area. Commencing from the initial global map generated by the PD module, multiple features of distinct sizes extracted by blocks 5, 4, and 3 serve as inputs and are transmitted to the RA module.
The RA module methodically erases the presently predicted region from the high-level lateral output features, sequentially unveiling the missing details and nuanced features of the essential expression regions that require supplementation from top to bottom. In this approach, the present prediction result is obtained by upscaling information from the deeper network layers. This incremental erasure concept [24] refines the initial rough prediction into a comprehensive and precise prediction outcome.
The reverse attention feature output results from the element-wise multiplication of the high-level output features [TeX:] $$\left\{C_i, i=3,4,5\right\}.$$ The reverse attention weight [TeX:] $$A_i$$ is expressed mathematically as:
where [TeX:] $$D(\cdot)$$ denotes the dot multiplication operation. The RA weight [TeX:] $$A_i$$ can be obtained by simply subtracting the upsampling prediction of the (i + 1)-th lateral output from 1, as follows:
where [TeX:] $$U(\cdot)$$ denotes the upsampling operation.
2.6 Reverse Attention Module
The loss function primarily quantifies the disparities between the predicted and actual values, and the network training aims to minimize these loss functions. Using the softmax loss function, the neural network output values were mapped within the (0,1) interval, providing probabilities for various classifications. These probabilities were then compared to achieve multi-classification. Although the softmax loss function effectively optimizes interclass spacing, it can misjudge samples of the same FE when there are substantial discrepancies.
To address this issue, we introduced an islanding loss function [25]. By implementing these two functions, the objectives of increasing interclass distances and decreasing intraclass distances were realized. The islanding loss function is an enhancement built upon the center loss. Initially, a cosine distance is computed and 1 is added to extend the range to (0,2), thereby enlarging the distance between different classes. The islanding loss function is mathematically expressed as:
where [TeX:] $$L_C$$ represents the center loss function (which is used to optimize the intraclass distance), [TeX:] $$L_{I-\operatorname{COS}}$$ represents the cosine distance of the cluster center, and [TeX:] $$\lambda_1$$ is a hyperparameter indicating the weight ratio in the islanding loss function. [TeX:] $$L_{I-\operatorname{COS}}$$ can be calculated from the following equation:
where N represents the set of sample labels, [TeX:] $$c_k \text{ and } c_j$$ represent the cluster centers of the k-th and j-th classes of expressions, respectively, and [TeX:] $$\left\|c_k\right\|_2 \text { and }\left\|c_j\right\|_2$$ represent the Euclidean distances from the cluster center to the origin of the coordinates.
In our proposed framework, the training of the network was optimized by considering the softmax loss and islanding loss functions. Therefore, features of the same class were close to one another, and the distances between dissimilar classes of facial features were increased to achieve better recognition results. The joint loss function can be determined using the following equation:
where λ denotes the weight ratio of the islanding loss in the joint loss function. Based on the model performance results with different parameter values in the experiments, [TeX:] $$\lambda \text{ and} \lambda_1$$ were fixed at 0.005 and 7, respectively
3. Experiment and Analysis
3.1 Experimental Dataset and Data Augmentation
In this experiment, two widely recognized public FER datasets were employed: the FER2013 and CK+ expression datasets.
The CK+ expression dataset [26] encompasses 593 sequences of expression images, spanning from natural to peak expressions, featuring 123 individuals. Of these sequences, 327 have additional expression tags. The dataset encompasses eight fundamental FE classes: anger, contempt, happiness, sadness, surprise, disgust, fear, and neutral. All images portray clear and positive FEs, with annotations meticulously validated by psychologists. For fairness in performance evaluation and comparison, the dataset omits contempt expressions due to their notably small sample size.
The FER2013 dataset [27] comprises 35,887 facial images accompanied by expression labels. This dataset encapsulates seven expressions: anger, disgust, fear, happiness, sadness, surprise, and neutral. The data collection was conducted in an uncontrolled environment, making it challenging to obtain precise recognition results.
Data augmentation. The multi-task cascaded convolutional networks (MTC-NN) model was used for facial detection and cropping to obtain nearly background-free facial images. Subsequently, the acquired facial images were scaled and normalized using bilinear interpolation, resulting in saved images with dimensions of 224 × 224 pixels. Subsequently, data augmentation was applied to the used samples, expanding the sample quantity.
Data augmentation was performed on the image samples to enhance the resilience of the model to interference, resulting in an expanded sample pool. Each image underwent flipping and rotation, with a rotation angle of [TeX:] $$\pm 10^{\circ} \text { at } 5^{\circ}$$ intervals. Hence, the number of image samples increased to 15 times the number of image samples of the original dataset. The efficacy of data augmentation is depicted in Fig. 5.
3.2 Performance Evaluation Metrics
Two performance indicators (average accuracy Acc and stability Sta) were used [28] in this experiment. Each method was tested N (N = 10) times, and the final average accuracy was determined as follows:
where [TeX:] $$A c c_i$$ denotes the recognition accuracy of the i-th experiment. Owing to the random initialization of network parameters and random batching of training samples, there is a certain error in each recognition result under the same settings. Therefore, it is fairer and more reliable to use the average value of multiple experiments.
Stability is the mean square error of N experimental results, which is defined as:
where Sta represents the degree of variation in the experimental results under the same settings.
3.3 Network Training and Parameter Setting
The following hardware were used for the experiment: i5-10400F 2.9 GHz CPU, NVIDIA GTX 1080Ti 8 G GPU, and 16 GB RAM. The following software were used: Ubuntu 18.04.3 64-bit operating system, MATLAB version 2019a, PyTorch GPU 1.4.0 (to establish the training environment), and Python version 3.6.2.
During the experiment, as the training rounds progressed, the learning rate decayed to half of its original value when the loss rate ceased to decrease within three iterations. To mitigate model overfitting, dropout was integrated into the model to randomly deactivate neurons at a set random dropout rate of 0.001. During the model training process, after each iteration of the training set, a validation set was tested, and the loss and accuracy were recorded.
The joint loss function incorporates two weight parameters ([TeX:] $$\lambda \text { and } \lambda_1$$), which require adjustment. Theoretically, the grid technique should be used to compute the optimal weights. It is more appropriate to reduce the constraint weight for intraclass variations within a range of 0.001–0.01. Hence, in the experiment, λ was initially fixed at 0.01, while [TeX:] $$\lambda_1$$ was varied from 1 to 10. The recognition accuracy for the FER2013 dataset is shown in Fig. 6(b). Although the recognition accuracy was not strictly a convex curve, overall, the recognition accuracy was significantly higher when [TeX:] $$\lambda_1=7$$ compared with those for other cases. Consequently, [TeX:] $$\lambda_1$$ was set to 7, while λ was varied at 0.0005, 0.001, 0.005, 0.01, and 0.05. The recognition accuracy for the FER2013 dataset is shown in Fig. 6(a). Fig. 6 shows that the model achieved the highest recognition rate when λ was set at 0.005. Hence, we selected the values 0.005 and 7 for [TeX:] $$\lambda \text { and } \lambda_1$$, respectively.
Performance of the model with different weight values: (a) [TeX:] $$\lambda_1=7$$ and (b) [TeX:] $$\lambda=0.01.$$
3.4 Ablation Analysis
To assess the influence of various components in the framework on the FER accuracy and stability, we conducted multiple sets of ablation experiments. The aim of these experiments was to evaluate the effectiveness and dependability of different modification strategies within the proposed FER framework. Tests were performed on the FER2013 and CK+ datasets, with consistent experimental parameters across the various test groups. The outcomes are summarized in Table 1.
Comparison of the experimental results for various improvement strategies
"×" denotes the components that were not used, and "√" indicates the components that were incorporated.
Compared with the original VGG16 model (model 1), the different improvement strategies adopted in this study improved the model recognition accuracy and stability to a certain extent. It can be observed from the results that after adding the GCCA module, the effective features were enhanced, the redundant features were suppressed, and the model recognition accuracy was significantly improved. The multifeature fusion strategy combines the edge features obtained in the low-level blocks as constraints with the high-level semantic features obtained in the high-level blocks, thereby enabling the overall framework to learn more targeted areas and improve capturing of detailed information of subtle facial changes. The effect of the PD module on improving the recognition accuracy was not obvious; however, the module effectively reduced the computational complexity of the overall framework and significantly improved the stability of the model. The subsequent RA module supplemented the missing information and detailed features of the key expression regions to ensure better accuracy while increasing the processing speed. Finally, after applying multiple strategies, we attained an accuracy of 74.08% and 98.66% on the CK+ and FER2013 datasets, respectively.
3.5 Performance Evaluation and Comparison
To further analyze the accuracy for various expression classes, we obtained the confusion matrices of FER results from different sources. The confusion matrices are presented in Figs. 7 and 8.
CK+ dataset confusion matrix: Acc (%).
FER2013 dataset confusion matrix: Acc (%).
For the FER2013 dataset, the proposed model had a high recognition accuracy for happy and surprise expressions. However, the model could not accurately identify sad and fear expressions, with a recognition accuracy of only 55.98% and 61.01%, respectively. The distribution of dissimilar classes of samples in the FER2013 dataset was extremely unbalanced. The number of sad images in the training set was only approximately 500, whereas the number of happy images was more than 7,000. Moreover, two classes of expressions (sad and disgust) had similar changes in the mouth or eye areas of the human face, and the distinguishability of the expression features was relatively low, making them prone to recognition errors.
However, for the CK+ dataset, the accuracy for various classes of expressions was significantly higher than that for the FER2013 dataset, and the fluctuation in the recognition accuracy for different classes of expressions was very small. This is because the CK+ dataset is obtained under controlled laboratory conditions, and the facial images are clear without occlusion and environmental factors. In addition, fear, disgust, and sad expressions have a certain degree of similarity, which increases the difficulty of distinguishing between these three classes.
Comparison of the average accuracy and stability of different models tested on the FER2013 and CK+ datasets is presented in Table 2. Our findings indicate that our model achieved the highest accuracy in expression recognition and displayed the best stability across both experimental datasets.
Barman and Dutta [8] and Agarwal and Mukherjee [9] employed traditional FER methods based on manual features. In [8], the authors used an active shape model to extract facial contours and region positions, facilitating the extraction of salient FE features. However, this approach tends to lose key recognition and classification information, leading to relatively poor FER accuracy on both datasets.
In [9], complex non-rigid motion facial components were captured by extracting scale-independent features and tracking pixel motion. Unfortunately, the generalization of this method is rather limited, particularly the performance of the model on the FER2013 dataset, which proved to be unsatisfactory.
Among the deep learning methods, Mollahosseini et al. [11] adopted a fine-tuning strategy after pretraining to achieve better recognition results compared with traditional methods. Nonetheless, network overfitting was a concern, and the feature attention mechanism was not considered. Verma et al. [14] processed image sequences through a visual branch network, introducing jump connections from a low level to a high level to consider underlying features. This significantly improved the model performance, but the method did not account for contextual information or the influence of highly similar expression classes on the recognition accuracy.
Liu et al. [15] proposed a parallel multi-channeled convolutional network to learn effective feature representation through the integration of global and local features, achieving good accuracy and robustness. However, the FER accuracy on the unconstrained environment dataset FER2013 still requires improvement, indicating limited generalizability.
Performance comparison of different methods
Our approach achieved both accuracy and stability for both experimental datasets. This success can be attributed to our framework, which uses the improved VGGNet16 as the backbone and incorporates the GCCA module to capture crucial information in the deeper network layers. We fused multiple features, extracted shallow features from the low-level blocks of the backbone, and aggregated the deep details from the high-level blocks using the PD module. Furthermore, the RA mechanism guides the entire network sequentially from top to bottom, allowing the mining of detailed information that requires supplementation. This approach makes full use of contextual information, leading to improved FER accuracy and model stability.
4. Conclusion
In this study, we developed a novel FER framework to address the limitations of traditional FER algorithms, which tend to overlook important features as the network depth increases during feature extraction. This framework is based on multiscale feature fusion, and incorporates an attention mechanism that considers contextual information. The improved VGGNet was employed for FE feature extraction, complemented by a multiscale FM fusion strategy that introduced contextual information, thereby enhancing the recognition accuracy. In addition, we introduced an attention mechanism that improved the channel attention module based on group convolution to extract more expressive FER features. Our results confirmed the high accuracy of our model for FER tasks across various scenarios. However, although the FER2013 dataset can represent uncontrolled non-laboratory environments, it is primarily sourced from the Internet, potentially resulting in limited diversity in image quality and environmental conditions. This implies that the dataset may not comprehensively represent all of the possible real-world scenarios. The experimental dataset includes only seven major FEs, whereas realworld expressions are much more diverse, encompassing a richer array of emotions and emotional expressions. In future research, we plan to further optimize the network structure and explore datasets that closely mimic real-world conditions, thereby enhancing the practical applications of our research.