Yan Xiang , Anlan Zhang , Jinlei Shi and Yuxin HuangWeakly Supervised Aspect Category Detection Method Based on Label Hierarchical Filtering MechanismAbstract: Aspect category detection is crucial for aspect-level sentiment analysis. Traditionally, supervised methods require a large number of labeled samples to achieve effectiveness, whereas unsupervised methods rely heavily on human judgment, which can compromise identification accuracy. To address these limitations, we propose a weakly supervised aspect category detection method that uses a hierarchical label filtering mechanism. Our approach begins by assigning preliminary pseudo-labels to comments based on topic similarity. Subsequently, unreliable pseudo-labeled samples are filtered out using semantic similarity. Finally, high-confidence training samples are selected using both high and low thresholds. Using the hierarchical label filtering mechanism employed in these three stages, we constructed a high-confidence training dataset, which was used to train a topic representation-enhanced classifier for aspect category detection. The proposed method, which was evaluated on three publicly available datasets, outperformed existing baseline models, thereby reducing the need for human intervention. Keywords: Aspect Category Detection , Aspect-Level Sentiment Analysis , Label Filtering , Topic Model , Weakly Supervised Method 1. IntroductionWith the rapid development of the Internet, an enormous number of online comments have accumulated. To analyze the opinions in these comments, aspect-based sentiment analysis (ABSA) [1] was proposed. Aspect category detection (ACD) [2], a foundational task of ABSA, aims to classify comments into predefined categories based on their key parts. For example, the aspect categories of the comment “The service is good although rooms are pretty expensive” are “service” and “price,” respectively. This categorization is essential for understanding user opinions. Conventional supervised ACD methods depend on a substantial volume of labeled data, which can be time consuming and costly to obtain. Consequently, researchers have attempted to use unsupervised or weakly supervised methods to accomplish ACD tasks. One unsupervised method is topic modeling [3], which leverages word co-occurrence statistics to generate aspect keywords. Another approach involves clustering-based models [4, 5], which cluster relevant comments and match them to predefined aspect categories. However, even with the advancements in these models, a manual mapping process for aspect categories is still required, and the performance of category detection remains limited. Given the limitations of these methods, weakly supervised methods have garnered extensive attention in recent years. A prevalent approach involves iterative training for ACD guided by seed words [3]. This approach focuses on learning the embedding space for sentences and seed words to establish similarities between sentences and aspects. However, aspect representations are often constrained by a fixed initial number of seed words. To expand the initial seed word set, some researchers have leveraged the vocabulary inherent to pretrained models [2]. However, relying solely on weakly supervised data for training can lead to suboptimal model performance depending on the limitations of the seed-word representations. To overcome this challenge, an approach suggested implementing a pseudo-labeling strategy [6]. In this method, labeled data are used to train an initial model, which is then applied to unlabeled data to generate pseudo-labels, thereby expanding the training dataset [7 ]. However, filtering out noisy labels during this process is crucial. To address this issue, we propose a framework for selecting pseudo-labels in a stepwise manner. The primary contributions are summarized as follows: · A weakly supervised ACD method that utilizes a hierarchical label-filtering mechanism is introduced. This method iteratively identifies high-confidence pseudo-labeled comments from the unlabeled data by combining topics and semantic similarities. Consequently, these high-quality pseudo-labels enhance the training of the ACD classifier. · Following the acquisition of a high-confidence pseudo-labeled dataset, topic words are integrated into the bidirectional encoder representations from transformers (BERT) classifier to characterize the connections between comments and topic words. This process is crucial for predicting the aspect categories. · In experiments on three benchmark datasets, the proposed method outperformed several competitive methods. 2. Related Works2.1 Supervised Aspect Category DetectionACD can be categorized into supervised, unsupervised, and weakly supervised methods, based on the availability of labels [5,8]. In the supervised approach, the task typically takes the form of sequence labeling in which methods such as conditional random fields and recurrent neural networks are employed. However, these models depend heavily on a substantial amount of domain-specific training [9]. To address the limited training data in the target domain, researchers have introduced cross-domain models. However, designing effective cross-domain models is particularly challenging, given the inherent differences between the source and target domains [10]. 2.2 Unsupervised Aspect Category DetectionThe introduction of unsupervised methods provides a solution to this issue. Earlier unsupervised ACD approaches relied heavily on the latent Dirichlet allocation (LDA) topic model [11-13] to generate word distributions for each aspect category using a Dirichlet prior. Wang et al. [14] introduced an augmented restricted Boltzmann machine (RBM) that integrated prior knowledge to jointly capture the general and sentiment aspects of comments. Subsequently, the aspect-based auto-encoder (ABAE) model [15] and its variants [16] were introduced, demonstrating significant performance improvements. These models capture word co-occurrence patterns to identify the coherent aspects in the text. Shi et al. [16] recently applied self-supervised representation learning using contrastive algorithms to enhance aspect and comment segment representations, thereby achieving improved outcomes. Yang et al. [17] devised a set of aspect category experts by assigning each expert the responsibility of coding for a specific aspect category. The primary limitation of these methods is the need for manual intervention, such that the subjectivity of human judgment significantly affects the results. 2.3 Weakly Supervised Aspect Category DetectionIn recent years, weakly supervised methods for ACD have attracted widespread attention owing to their ability to model meaningful aspects using only a small amount of domain knowledge. Handcrafted mapping rules [2] and seed-driven methods [5] have emerged as two prominent approaches to weakly supervised ACD. Huang et al. [18] leveraged seed words to generate aspect representations and used a convolutional neural network (CNN) model to align reviews with corresponding aspects. Nguyen et al. [19] introduced an aspect detection encoder that transformed comments and aspects into a low-dimensional embedding space. Tulkens and Van Cranenburgh [20] identified aspects by measuring the cosine similarity between pretrained aspect representations and label names. However, the effectiveness of seed words is often constrained, making it challenging to enhance the performance of these weakly supervised methods. 3. Methodology3.1 Task DefinitionGiven a dataset with multiple aspect categories C, each category contains a few labeled comments and numerous unlabeled comments. Letting x denote one of the unlabeled comments and sji denote the j-th labeled comment of the i-th category, with the entire dataset as [TeX:] $$\begin{equation} \left\{x x^i\right\}_{i=1}^{\mathrm{C}} \end{equation}$$, where [TeX:] $$\begin{equation} i=1,2, \ldots C ; j=1,2, \ldots\left|s^i\right|,\left|s^i\right| \end{equation}$$ is the number of labeled comments of category i. The task is to predict the aspect category of the unlabeled comments using the provided data. 3.2 Overview of the ModelThe model consists of two main components. 1) Construction of the training set, which employs a hierarchical filtering mechanism. This process involves combining the topic and semantic similarity of comments to iteratively filter and identify high-confidence pseudo-labeled comments from unlabeled comments. 2) Training a topic-enhanced classifier based on the constructed training set, with comments and their corresponding topic terms concatenated and input into a BERT classifier to detect aspect categories. The overall workflow of the model is shown in Fig. 1. 3.3 Training Set Construction based on Hierarchical Filtering MechanismThe training dataset was constructed in three steps: 1) Pseudo-labels were initially assigned to unlabeled comments based on their topic similarity values with labeled comments. 2) The probabilities of pseudo-labeled comments belonging to different categories were calculated based on their semantic similarity values with labeled comments. 3) High-confidence pseudo-labeled comments were selected for the training set using hierarchical filtering rules. 3.3.1 Initial pseudo-labeling based on topic similarityTo assign pseudo-labels to unlabeled comments, we first apply the topic model ABAE [13] with the topic number K on the dataset [TeX:] $$\begin{equation} \left\{x x^i\right\}_{i=1}^C \end{equation}$$, and obtain the topic distribution vector [TeX:] $$\begin{equation} p_i \in \mathbb{R}^K \end{equation}$$ of the i-th comment, as well as its top m topic terms [TeX:] $$\begin{equation} \left\{w_j\right\}_{j=1}^m \end{equation}$$. Subsequently the cosine similarity [21] of the topic distribution vector is used to measure the distance between unlabeled and labeled comments. Specifically, the topic similarity [TeX:] $$\begin{equation} \operatorname{sim}_{-} t(\cdot) \end{equation}$$ of the two comments is calculated using Eq.(1):
(1)[TeX:] $$\begin{equation} \operatorname{sim}_{-} t\left(p_j^i, q\right)=\frac{p_j^i}{\left\|p_j^i\right\|} \cdot \frac{q}{\|q\|^{\prime}} \end{equation}$$where Pji is the topic probability vector of the j-th labeled comment belonging to category i, and q is the topic distribution vector of the unlabeled comments x. The similarity values between the unlabeled comment x and all the labeled comments in the i-th category are calculated and averaged to obtain the topic similarity [TeX:] $$\begin{equation} S i m T^i \end{equation}$$ of x belonging to the i-th category, as follows:
(2)[TeX:] $$\begin{equation} \operatorname{Sim} T^i=\frac{\sum_{j=1}^{\left|s^i\right|} \operatorname{sim} \_t\left(p_j^i, q\right)}{\left|s^i\right|} \end{equation}$$Finally, a pseudo-label is assigned to the unlabeled comment x according to the highest value of [TeX:] $$\begin{equation} S i m T^i \end{equation}$$. Based on the above process, the unlabeled comments are classified into different aspect categories. 3.3.2 Further pseudo-labeling based on semantic similarityTo eliminate the potential errors in the pseudo-labeled samples thus obtained, relatively reliable pseudo-labeled samples are filtered out by combining semantic similarities. Whitening BERT [22] is first performed to calculate the semantic similarity values between the unlabeled comment x and all the labeled comments in the i-th category, which are averaged to obtain the similarity [TeX:] $$\begin{equation} S i m B^i \end{equation}$$ of x belonging to the i-th category, as follows:
(3)[TeX:] $$\begin{equation} \operatorname{SimB}^i=\frac{\sum_{j=1}^{\left|s^i\right|} \text { BERT }- \text { Whitening }\left(x, s_j^i\right)}{\left|s^i\right|}, \end{equation}$$where BERT-Whitening represents the whitening BERT operator, which involves a linear transformation of the output features from the BERT model. This operator reduces the redundancy among features and enhances the effectiveness of computing semantic similarity [22]. [TeX:] $$\begin{equation} S i m B^i \end{equation}$$ is then normalized to obtain the probability [TeX:] $$\begin{equation} S c o r e^i \end{equation}$$ that x belongs to category i, as follows:
(4)[TeX:] $$\begin{equation} \operatorname{Score}^i=\frac{e^{\operatorname{Sim} B^i}}{1+e^{\operatorname{Sim} B^i}} \end{equation}$$After assigning another pseudo-label to the unlabeled comment based on the highest value of [TeX:] $$\begin{equation} S c o r e^i \end{equation}$$, this new pseudo-label is compared with the initial label obtained through topic similarity in the previous stage. If the new pseudo-label is inconsistent with the initial label, it is considered unreliable and the corresponding pseudo-labeled comment is excluded from the training set. However, if the new pseudo-label is consistent with the initial label, we proceed to the next step for further evaluation. 3.3.3 Selection of pseudo-labeled comments with high confidenceAssuming that both the pseudo-labels obtained by the topic and semantic similarity are of category c, the confidence level of the pseudo-label is determined as follows, based on the probability [TeX:] $$\begin{equation} S c o r e^c \end{equation}$$ that x belongs to category c. Fig. 2 illustrates the entire filtering process. 1) If [TeX:] $$\begin{equation} S c o r e^c \end{equation}$$ is greater than the high threshold [TeX:] $$\begin{equation} \tau_{h} \end{equation}$$, it indicates that the label is reliable, and the comment x and its corresponding label c can be directly added to the training set. 2) If [TeX:] $$\begin{equation} S c o r e^c \end{equation}$$ is lower than the high threshold [TeX:] $$\begin{equation} \tau_{h} \end{equation}$$, we further consider whether the probability that the comment belongs to the other categories is lower than the threshold [TeX:] $$\begin{equation} \tau_{l} \end{equation}$$. If all the probabilities are indeed lower than [TeX:] $$\begin{equation} \tau_{l} \end{equation}$$, the comment can be added to the training set. Specifically, we used PNc to indicate that comment x belongs to category c.
(5)[TeX:] $$\begin{equation} P N^c= \begin{cases}1, & \text { Score }^c \geq \tau_h \\ \prod_{i=1}^c I\left[\text { Score }^i \leq \tau_l\right]_{i \neq c}, & \text { Score }^c<\tau_h\end{cases} \end{equation}$$where I(.) is an indicator function that returns a value of one when the condition is satisfied. If PNc = 1, then comments and their corresponding labels can be added to the training set. In contrast to previous methods, this approach considers both high and low probabilities, ensuring that the selected pseudo-labeled samples in the training set are both abundant and reliable. Through the aforementioned steps subsections 3.3.1 to 3.3.3, we construct a training set. 3.4 Classification based on Topic Information EnhancementAfter obtaining a high-confidence training set, we trained the classifier using a fine-tuning method with a pretrained language model. To enhance the capacity of the model to represent aspect categories, we constructed a sequence by merging comments with their corresponding topic terms. The sequence was encoded using the BERT model [23]. The input was as follows: "[CLS] + Comment + [SEP] + Topic Terms + [SEP]." Here, "[CLS]" serves as a special identifier, and "[SEP]" is used as a separator to distinguish comments from topic terms. By adopting this approach, interactions are established between individual words in comments and the corresponding topic terms. Consequently, hidden vector hj of the [CLS] output from the last layer serves as a characteristic representation of the j-th comment. Finally, the characteristic representation hj of the comment is sent to a fully connected layer and a softmax layer to obtain the classification result.
(6)[TeX:] $$\begin{equation} \widehat{y}_j=\operatorname{softmax}\left(W h_j+b\right), \end{equation}$$where W and b denote the learnable weight and bias, respectively; [TeX:] $$\begin{equation} \widehat{y}_j \in \mathbb{R}^C \end{equation}$$ is the predicted probability that the j-th comment belongs to different aspect categories; C is the number of classes. The training loss can be expressed as:
(7)[TeX:] $$\begin{equation} \mathcal{L}=-\frac{1}{N} \sum_{j=1}^N \sum_{i=1}^c y_{j l} \log \widehat{y_{j \nu}} \end{equation}$$where N represents the number of training samples; C denotes the number of aspect categories; yji is the true label probability that the j-th sample belongs to the i-th category; and [TeX:] $$\begin{equation} \widehat{y_{ji}} \end{equation}$$ is the probability predicted by the model that the j-th sample belongs to the i-th category. 4. Experiments4.1 DatasetThree datasets were used in the experiments: Restaurant, Bags, and Keyboards. The Restaurant dataset (https://huggingface.co/datasets/Charitarth/SemEval2014-Task4) is from Citysearch New York restaurant comments and the latter two datasets are from Amazon product comments (https://cseweb.ucsd.edu/ ~jmcauley/datasets/amazon_v2/). Table 1 presents detailed statistical results on these datasets. Table 1. Statistics on three datasets
4.2 Experimental SettingsFive labeled samples were randomly selected for each category; punctuation, stop words, and words that occurred fewer than 10 times in the dataset were removed. We used ABAE [15] as the topic model and set the number of topics to 14. In the pseudo-label screening, the high threshold [TeX:] $$\begin{equation} \tau_{h} \end{equation}$$ was set to 0.5 and the low threshold [TeX:] $$\begin{equation} \tau_{l} \end{equation}$$ was set to 0.15. The classifier and BERT-whitening were configured with the “BERT-base” setting [23], with a hidden vector dimension of 768. During training, the batch size was 250 and optimization was conducted using adaptive moment estimation (Adam) as the optimizer with a learning rate of 0.01. The model underwent 15 update iterations and a dropout layer was incorporated to prevent overfitting. The experimental environment was Windows 10, equipped with an Intel Core i5-6300HQ @2.3 GHz, 32 GB RAM, and NVIDIA GeForce GTX 960M GPU. Python 3.7 was the programming language, and the programming framework was PyTorch 1.11. 4.3 Baseline ModelsSERBM: Sentiment-aspect extraction based on RBM [14] jointly extracts comment aspects and sentiment polarity in an unsupervised manner based on RBM. In this model, the hidden vector dimension was set to 10. W2VLDA: Word2vec-based LDA [11] is a topic-based modeling approach that combines word embeddings with LDA. It automatically aligns discovered topics with predefined aspect names by utilizing seed words provided by the user for different aspects. ABAE [15] is an unsupervised neural topic model that employs neural word embedding to acquire coherent aspects. The model parameters were optimized using the Adam optimization technique with a learning rate of 0.001. The experimental process encompassed 15 iterations, and a batch size of 50 was employed to expedite training. ABAEinit [24] modifies the aspect embedding vectors in ABAE by substituting them with the centroid of the relevant seed word embeddings and fixing the aspect embedding vector during training. The experimental parameters were set in line with those of ABAE. LDA-Anchors [25] uses the latent Dirichlet to obtain topic word distributions via seed words. MATE: The multi-seed aspect extractor [24] is a weakly supervised neural model that combines a seed aspect extractor trained under a multi-task objective with a multi-instance learning sentiment predictor to identify and extract useful comments. MATE+MT: The multi-task (MT) counterpart of MATE [24] is a weakly supervised auto-encoder extension of the ABAE model that initializes the aspect-embedding matrix through aspect-specific seed words and subsequently refines them during training. TS-*: Teacher-student (TS-*) [26] is a weakly supervised training framework in which the teacher network is a seed-word-based bag-of-words classifier, and the student network uses word2vec embedding and the BERT model to encode text fragments. AE-CSA: Aspect extraction via context-enhanced Sememe attentions [27] is an unsupervised neural framework that enhances lexical semantics using semantic elements. The entire framework is similar to that of an auto-encoder reconstructing sentence representations and learning aspects using latent variables. CAT: Continual adapter tuning [2] was introduced as a basic heuristic model that encompasses comparative attention and an automated aspect assignment method. 4.4 Experimental Results and Analysis4.4.1 Comparison with the baseline modelsThe experimental results of the different models on the Bags, Keyboards, and Restaurant datasets are listed in Table 2, where we observe that: 1) compared with the suboptimal models on the three datasets, the Macro-F1 score of the proposed model increased by 0.7%, 0.6%, and 2.1%, respectively, proving that our model can compete with other baseline models. 2) Unsupervised models, such as ABAE and SERBM, which do not utilize labeled data, require manual judgment of the correspondence between aspect words and aspect categories. This significantly affects performance. 3) Weakly supervised models such as MATE and MATE-MT surpass unsupervised ABAE by leveraging aspect words for more accurate aspect category predictions, suggesting that the topic information provided by aspect words is important for ACD. Based on these observations, it can be inferred that our model achieves substantial enhancements by leveraging semantic and thematic similarities to obtain more relevant samples, filtering out the low-confidence samples based on the confidence threshold. Table 2. Macro-F1 score (%) of different models
4.4.2 Effectiveness of the label hierarchical filtering mechanismTo evaluate the impact of the hierarchical label filtering mechanism on the model, we performed ablation experiments on the Restaurant dataset. We compared the results of two approaches: the model utilizing initial pseudo-labeling based on topic similarity and the model employing the complete label filtering mechanism. The F1 scores for both methods are shown in Fig. 3. The hierarchical filtering approach resulted in noticeable improvements in specific categories, specifically, a 0.5% improvement for the “Food” category, a 0.7% improvement for the “Staff” category, and a substantial 4.2% improvement for the “Ambience” category. Overall, the stepwise filtering mechanism led to a 1.8% improvement in the Macro-F1 score compared with the method that relied exclusively on topic similarity for pseudo-label filtering. This observation suggests that certain incorrect pseudo-labeled comments annotated by the topic modeling process exert a substantial influence on model performance. Applying hierarchical filtering and eliminating some of these inaccurate comments, increases the reliability of the training samples, thereby improving model performance. 4.4.3 Effect of the number of labeled samples on the modelIn this section, we describe the ablation experiments performed on the Restaurant dataset to assess the impact of the amount of labeled data on the model. The experimental setup involved randomly selecting 5, 10, and 15 labeled samples for each category. After determining the training set using a label hierarchical filtering mechanism, we trained the BERT classifier on topic enhancement. The results of these experiments are presented in Fig. 4. In Fig. 4, the performance increases to varying degrees for all three aspect categories by adding more labeled data. Notably, the most substantial improvement is observed for the “Ambience” category. This indicates that introducing more labeled samples with classification information enables the model to identify additional pseudo-labeled comments related to specific aspect categories, thereby enhancing overall performance. 4.4.4 Validity of topic informationAblation experiments were performed on the Restaurant dataset to assess the influence of topic information on the classifier. We evaluated the model performance both with and without topic information augmentation by BERT, using the same constructed training set. In Fig. 5, a performance improvement of 1.8% is observed when topic information is incorporated into each comment sentence. This topic information directs the model’s attention toward aspects related to specific themes. The inclusion of specific thematic information helps the model better understand and differentiate the meanings of aspects, thereby enhancing its ability to identify aspect categories. 5. ConclusionIn this paper, we propose a weakly supervised ACD model that includes the following steps. First, limited labeled data are utilized to assign pseudo-labels to extensive unlabeled data by employing topic-based similarity. Second, filtering rules are applied to select comments labeled with high confidence. Finally, the ACD classifier enhances its ability to identify aspect categories by integrating a high-quality pseudo-labeled dataset with topic terms. The experimental results demonstrated that the proposed model achieves more accurate ACD than existing models and reduces the subjectivity introduced by human intervention. Ablation experiments demonstrated the effectiveness of the proposed hierarchical label-filtering mechanism. Moreover, incorporating thematic information is useful because it directs the attention of the classifier to words relevant to the aspect categories. In this study, BERT-whitening was employed for semantic similarity computations. In future studies, we plan to synergize the understanding capabilities of large language models with hierarchical filtering mechanisms to achieve enhanced performance. FundingThis work was supported by the National Natural Science Foundation of China (Grant Nos. 62162037, 62266027, U21B2027, 62266028, and 62241604), the Yunnan Provincial Major Science and Technology Special Plan Projects (Grant No. 202302AD080003), and the General Projects of Basic Research in Yunnan Province (Grant No. 202301AT070444). BiographyYan Xianghttps://orcid.org/0000-0002-6475-638XShe is an associate professor at the Faculty of Information Engineering and Automation at the Kunming University of Science and Technology. She graduated with a B.E. degree in engineering and an M.S. degree in science from Wuhan University, China. She presided over one general basic research project in the Yunnan Province, one Yunnan Provincial Department of Education project, and one of the special plan sub-projects of the Yunnan Provincial Major Science and Technology. She has published over 20 papers as first or corresponding author. Her primary research interests include text mining and sentiment analysis. BiographyBiographyReferences
|