1. Introduction
Case-related news refers to news reports on cases or potential cases. Case-related news filtering aims to quickly extract the required news related to the case from the news data, which is significant for public opinion supervision, prevention, and control.
There are two classic methods of filtering news [1,2]: keyword-based and machine-learning filtering. Initially, researchers matched the news text by collecting domain-related keywords such as Knuth-Morris-Pratt (KMP) [3] and Sunday [4]. However, keyword-based filtering relies heavily on keyword dictionary completeness and may result in high-precision and low-recall situations with a low degree of completeness. Owing to complicated and changeable case-related news reports, it is challenging to construct a complete keyword dictionary. The machine-learning method is an effective solution for filtering news that makes assumptions about the data distribution of the news categories involved, such as support vector machine (SVM) [5] and decision trees [6]. However, these methods depend on handcrafted feature functions and often satisfy the curse of dimensionality. The deep neural network method learns vector representation from the text without handcrafted feature functions, to alleviate dimensional disasters caused by statistical methods; however, this requires a large amount of labeled data. The lack of labeled training data will challenge text-filtering methods to achieve desirable results. In real-world applications of case-related news filters, there are only a negligible number of labeled case-related news samples; however, it is easy to collect a large amount of unlabeled news. Therefore, we focus on finding a way to improve filtering performance by mining unlabeled news samples.
Machine learning using only positive samples is a common challenge in the field of fake reviews and recommendations. To address this challenge, researchers have proposed a positive-unlabeled (PU) learning algorithm based on positive and unlabeled samples [7]. The standard PU learning method is a two-step process [8]. First, it identifies reliable negative samples and trains a classifier using these negative and the original positive samples. Next, the trained classifier is utilized to score unlabeled examples and select reliable positive and negative samples as new training samples. By repeating the above steps, we can obtain a classifier with a higher accuracy. This provides an effective training method for datasets with only positive samples. However, the final performance of the classifier largely depends on the initial reliable labeled data. Regarding a negligible number of initially labeled data samples, the iteratively trained accuracy of the classifier is insufficient. The positive and negative samples obtained may not be reliable, and there may be more misclassified data. These unreliable data will be added later in the iterative training process, resulting in increasingly misclassified data, which will seriously affect the performance of the final classifier. In this study, this phenomenon is referred to as "error accumulation."" Error accumulation is caused by insufficient use of text information in the PU learning process. Case-related news contains hidden information, such as topic information. Topic information is a unique news attribute that can be obtained using unsupervised methods [9]. By leveraging this information in the initial training and subsequent iterations of PU learning, "error accumulation" can be effectively alleviated.
In this study, we propose a PU learning method combined with topic information to filter case-related news, which utilizes a negligible number of labeled case-related news samples. Our method extracts topic information from the labeled and unlabeled case-related news datasets through the unsupervised pre-training topic model and adds the topic information to the initial training and subsequent iterative training processes of PU learning. With our method, more case-related topic information can be used when the initial labeled samples are small, and topic enhancements are performed in the subsequent iteration of the training process. This allows the classifier to be trained in each iteration to obtain reliable positive and negative sample data from unlabeled data and improve the performance of the final case-related news classifier.
To conclude, the main contributions are as follows:
· We applied the PU learning method to the case-related news filtering task, which effectively tackles filtering case-related news under a negligible number of manual annotations.
· We extracted the topic information using the variational autoencoder (VAE) topic model and enhanced the text representation by learned topic representation within the PU learning training process, which significantly stabilized the negative sample selection.
· We constructed a dataset of case-related news and used our method to conduct the experiments. This indicates that our method achieves better results than the PU learning method without topic enhancement.
2. Related Work
Recently, in the recommendation system and spam review filtering domain, researchers have made a series of achievements in the classification of only positive samples, which can be summarized into the following three methods [10].
The one-class classification approach uses only positive sample data in the training set. Its core idea is to construct a minimum region that approximately covers the training set, whereas instances outside the region belong to negative samples. Manevitz and Yousef [11] proposed a one-class SVM classification method for text classification. Because this method completely ignores the unlabeled dataset, hidden classification information in the unlabeled dataset is lost. When there are reliable negative samples in the unlabeled dataset, the model is prone to overfitting because it ignores this valuable information.
The two-step approach uses positive samples and samples from an unlabeled dataset to build the final classifier. It mainly comprises two steps. First, it uses a heuristic strategy to identify negative sample data with high credibility in unlabeled data. Second, these negative sample data are combined with existing positive samples to form new training samples, and existing classification methods are used to study classifiers in the new training samples. The above algorithm framework can use iterative training to train the classifiers. The disadvantage of this method is that the performance of the final classifier depends significantly on the initial reliable sample data. If the scale is negligible or the sample quality is not high, the performance of the classifier is limited.
In addition, some researchers have considered using positive samples and all unlabeled samples for training. The core idea of these methods is to establish a binary classifier to determine the labels of the unlabeled samples, convert the unlabeled sample dataset into labeled data, and train with known positive samples. Ren et al. [12] proposed a PU-based learning algorithm applied to fake reviews. Li et al. [13] applied the conventional PU challenge to a streaming data environment and proposed a PU learning algorithm based on clustering. Xiao et al. [14] proposed a PU learning algorithm based on the similarity. First, positive samples were utilized to extract reliable negative samples from the unlabeled sample dataset. Further, based on the positive and extracted negative samples, the probability that the remaining unlabeled samples belong to positive and counterexamples was calculated, and an SVM classifier with probability weight was established based on the above data.
The topic model mainly adopts Gibb's sampling, variational inference, non-negative matrix factorization, and other machine-learning algorithms to infer potential topics from high-dimensional sparse text feature spaces [15]. VAE is an encoding-decoding network proposed by Kingma and Welling [16] in 2014. Regarding topic modeling, Miao et al. [16] first attempted to use VAE to build a neural variational document model, and on this basis considered topic-word distribution, thus forming a topic model based on a neural self-encoder structure [18-20]. The VAE is an unsupervised model; that is, it does not need to label the data but only constructs the optimization function and trains the model by reconstructing the data, which is very suitable for the application scenario of this task. Therefore, we choose VAE to model the topic of case-related news and integrate the extracted unsupervised topic into the iterative process of PU learning to improve the "error accumulation" challenge existing in PU learning and improve the performance of case-related news filtering. Therefore, we choose VAE to model the topic of case-related news and integrate the extracted unsupervised topic into the iterative process of PU learning, to improve the "error accumulation" challenge existing in PU learning and the performance of case-related news filtering.
3. Case-Related News Filtering Method
Based on PU learning under the framework of a neural network, we propose a PU learning method with topic enhancement, and combine it with a case-related news filtering model to improve the performance of case-related news filtering. This method can be divided into training, prediction, and iterative processes. The specific method is illustrated in Fig. 1.
A classification model of case-related news that integrates topic to enhance PU learning.
3.1 VAE Topic Model
The VAE is an unsupervised document generation model, as shown in Fig.2. Its purpose is to extract potential features from the word vector space of documents, which we refer to as topic features. Gururangan et al. [9] used VAE to extract topics to assist text classification tasks. Referring to previous work and VAE principles, we implemented this VAE structure and used the entire case-related news dataset for unsupervised pre-training.
The VAE architecture is an encoder-decoder architecture. In the encoder, the input is compressed into a potential distribution Z, whereas the decoder reconstructs the input signal D by sampling according to the distribution of Z in the data potential space.
where [TeX:] $$d^{(i)}$$ represents a real sample in D, and [TeX:] $$\mu$$ and [TeX:] $$\delta^{2}$$ are generated by [TeX:] $$d^{(i)}$$ through a neural network. From the obtained [TeX:] $$\mu^{(i)}$$ and [TeX:] $$\delta^{2(i)}$$, the distribution [TeX:] $$P\left(Z^{(i)} \mid d^{(i)}\right)$$ corresponding to each [TeX:] $$d^{(i)}$$ can be obtained, and [TeX:] $$\tilde{d}^{(i)}$$ can be reconstructed by decoding network[TeX:] $$\tilde{d}^{(i)}=\operatorname{Decode}\left(Z^{(i)}\right)$$. We used multilayer perception (MLP) for the generation of [TeX:] $$\mu$$ and [TeX:] $$\delta^{2}$$ and the implementation of decoding network decoding.
where [TeX:] $$m$$ represents the default number of potential topics. After the calculation above, the potential topic distribution of in this study can be expressed [TeX:] $$\vec{z}=\left\{\mathrm{Z}^{(1)}, \mathrm{Z}^{(2)}, \ldots, \mathrm{Z}^{(\mathrm{m})}\right\}$$.
3.2 Topic-Enhanced Positive-Unlabeled Learning
The PU learning process comprises three steps: training, prediction, and iterative processes. In the training and iterative processes, we guide and enhance the topics obtained by the unsupervised topic model. Because we have mainly improved the training process, we will introduce it.
3.2.1 Training and predicting of PU learning
The training classifier is the main training process of the PU learning method, and an unsupervised topic model is used for enhancement. There are only a small amount of case-related news data and unmarked data in the dataset. Therefore, extracting reliable unrelated news data from unlabeled data is the first problem to be solved by the algorithm, and the initial training is executed in combination with the existing data.
To extract reliable non-case-related news data from unlabeled data, we use an improved version of the I-DNF [21] algorithm. First, the non-case-related news set should be extracted using the different frequencies of text features in the case-related news and unlabeled sample sets. We used I-DNF to obtain counterexamples of the same scale as the initial case-related news, and further trained the initial classifier. Regarding the construction of classifiers, a variety of machine-learning algorithms or deep networks can be used. Our method uses embedding and the network structure of bidirectional long-term and short-term memory network (LSTM) as classifiers [22].
First, the embedding network layer embeds the sparse representation of the raw data into a high-dimensional space and then forms a dense matrix with semantic representation. The continuous bag-of-words (CBOW) network structure of word2vec is used to build the embedding network layer [23]. First, the text is segmented, and the position code of each word is obtained according to the dictionary. The word-embedding vector [TeX:] $$\vec{x}$$ of each word is obtained through embedding and combined to obtain the word vector of the text[TeX:] $$X=\left\{\overrightarrow{x_{1}}, \overrightarrow{x_{2}}, \ldots, \overrightarrow{x_{n}}\right\} \in R^{n * v}$$, where [TeX:] $$n$$ represents the length of the news text, and [TeX:] $$v$$ is the word vector dimension. In addition, the input text is passed through the VAE theme model to obtain the theme vector [TeX:] $$\vec{z} \in R^{m}$$m of the news text, where [TeX:] $$m$$ is the preset number of themes. After obtaining the two types of encoded information, the news topic vector [TeX:] $$\vec{z}$$ is used to guide the word-embedding vector X. Because the topic vector obtained by the case-related topic model is a vector with a shape of [TeX:] $$1^{*} m$$ , we made [TeX:] $$n$$ copies of it and spliced them into the word-embedding vector X respectively, and the new matrix [TeX:] $$X^{\prime}$$ formed is the news decode vector integrated with the topic vector.
Bidirectional long short-term memory (BiLSTM) is a temporal network that can model data well onto close sequential relationships, such as text. It has three gating mechanisms―the forget gate, input gate, and output gate―to alleviate the vanishing gradient and capture the long-term dependency on contexts. We add the news decode vector with the topic and model its context through the BiLSTM network layer to obtain the news semantic representation vector. The specific formula is as follows:
where [TeX:] $$H$$ is the sentence vector encoded by the BiLSTM, [TeX:] $$q$$ is the hidden layer dimension of the BiLSTM, and [TeX:] $$y$$ represents the final probability output. Our method predicts the remaining unlabeled data using a classifier model. In addition, the probability of the prediction results for unlabeled news is sorted from high to low. In each prediction, the data with the highest probability will be obtained according to a certain iteration step as reliable case-related news samples, and the data with the lower probability as reliable negative samples, which will be removed from the unlabeled samples and added to the training data for the subsequent iterative training process. When case-related news samples are predicted, the subsequent samples are all negative.
The training process of PU learning.
3.2.2 PU learning iterative algorithm
Fig. 3 illustrates the iteration process of PU learning, where Steps 1-3 describe the pre-training of the topic model and the process of obtaining positive and negative samples for the first time; Steps 4-16 describe the iteration process of the entire PU learning. PU learning retrains the classifier on the newly obtained training set and repeats the entire prediction and training process after completing the initial training and prediction processes. There is no difference between the training, iterative process prediction, and the initial training prediction. The difference is that the number of unlabeled data points decreases, and the training set increases after each iteration is completed. When the unlabeled data are completely predicted as reliable samples, the entire iteration process is completed. Finally, all the samples are put into the classifier for training, and the final model obtained is the case-related classification model used in this study. The specific algorithm flow is as follows.
PU iterative algorithm pseudocode
4. Experiment
To evaluate the performance of our model, we conducted three experiments on a dataset that included case-related news. One was a comparative experiment that compared the performance of the PU classification algorithm without a topic. Simultaneously, we analyzed their prediction performance in iterative training. In addition, we conducted comparative experiments on initial datasets of different scales and an iterative step comparison experiment. Under different steps, we verified the effectiveness of our method and the PU classification algorithm without a topic. The experimental results also validated the effectiveness of our method on the relevance-analysis task of case-related news. This also illustrates that topic information enhances the PU learning iterative process and can improve the performance of the model.
4.1 Dataset
We use the categories provided in Table 1 to define the scope of case-related news. By crawling relevant news data from microblogs, Tianya forums, and other websites, we constructed a dataset. The length of the news text in this dataset was approximately 100 to 250 characters. To facilitate the experimental verification effect, we manually marked all the data, including 10,000 case-related and 20,000 non-related news. During the experiment, the marked data were regarded as unmarked data required for accurate analysis.
Category of case-related news
4.2 Parameter Setting and Evaluation Metrics
This study sets the maximum length of the body to 200 characters. The Adam algorithm is used as the optimizer. The learning rate is set to 0.001, dropout for single-layer BiLSTM is set to loss 0.2, batch processing size is set to 128, training rounds are set to 20, and the number of iterative trainings is the ratio of the total amount of unlabeled data to the number of positive and negative samples extracted each time. Our evaluation metrics mainly adopt accuracy (Acc), precision (P), recall (R), and F1. The accuracy describes the proportion of correctly predicted samples to the total number of samples. Precision describes the number of predicted positive samples that are truly positive. The recall indicates how many positive samples are predicted correctly, and the F1 value is the adjustment of accuracy and recall. In addition, we also use the error rate to analyze the verification results.
4.3 Experimental Results and Analysis
First, we compare our method with the PU learning baseline model using two groups of experiments and verify its effectiveness using a small amount of case-related news data. Subsequently, we compare our method with the advanced PU learning method, which also indicates that our method is competitive. In addition, we also test some important parameters using two groups of experiments and determine that our method is superior to the baseline model of the PU method in each iteration when the initial data scale and iteration step are fixed; when the initial data scale is small or the iteration step is large, the performance will improve more stably.
4.3.1 Comparative experiment with PU learning baseline model
Because our method mainly improves the conventional two-stage PU learning, this experiment compares the performance of our method with the PU learning baseline model on the case-related news dataset. We established two experimental groups. One group used the reserved validation set to evaluate the generalization performance of the classifier trained in the iterative process and the other group evaluated the performance of the classifier trained in each iteration on the remaining unlabeled samples. The experimental results are illustrated in Fig. 4, where the x-axis represents the number of iterations, and the y-axis represents the size of the evaluation index. In the process of the experiment, the number of case-related news samples in the initial test was preset to 1000, the non-related news samples extracted were also 1000, and the experiment was compared with PU learning without topic and conventional classification model. Among them, "PU learning" refers to the PU learning method without a topic, and the classifier used is the same as that used in this study. In this experiment, the iteration step was set to 500, all the parameters of the two groups of experiments were the same, and the evaluation index was the F1 value.
The results of PU learning baseline and our method using different data: (a) illustrate adding the reserved validation data in training process and (b) illustrate adding the unlabeled data in training process.
We evaluated our method using the reserved validation set, and the results are illustrated in Fig. 4(a). As can be observed from the figure, the F1 value for PU learning was 73.9%. However, our method reached 75.7%, 1.8% more than PU learning. Fig. 4 illustrates that our method achieved effective results in filtering challenges with only a few case-related news samples. We can observe from the curve that the curve of our method is more stable and has better comprehensive performance than PU learning, and the classifier trained by each iteration has a certain degree of improvement inaccuracy. The data fully explained that the topic plays an enhanced role in PU learning. This improvement occurred not only in the first training but in every iterative process. It played an effective role in relieving "error accumulation" in PU learning.
Fig. 4(b) illustrates the evaluation results of our method of "unlabeled datasets" in the first nine iterations of training. These "unlabeled datasets" have been labeled manually and used as unlabeled data in the prediction process. As can be observed from the figure, the performance of our method in predicting unlabeled data is superior to conventional PU learning, and the gap between the two becomes larger as the number of iterations increases. In fact, in the first to fifth iterations, our method is only slightly improved compared with conventional PU learning, but in the subsequent iterations, with the increase in training data, the gap is gradually widened. For the seventh iteration, the F1 value of our method on the unlabeled dataset is approximately 11.5% ahead of the conventional PU learning. The reason for this phenomenon is that the method in this study uses the information of the text more effectively and realizes layer-by-layer enhancement of the conventional PU learning process.
4.3.2 Comparative experiment with advanced PU learning method
The main purpose of this experiment was to compare the performance of our method with that of other advanced PU learning methods. We chose the two latest PU learning methods for comparison. Among them, the nnPU model [24] is a classical model for PU learning. The idea was to change the weight of unlabeled samples by improving the loss function, and finally obtain the unbiased optimal solution. The I-PU model [25] calculated the probability according to the similarity between unlabeled samples and positive samples with the idea of ensemble learning, thus labeling unlabeled samples and generating multiple datasets to train different models. The experimental results are listed in Table 2.
Comparative experiment with advanced PU learning method (unit: %)
As presented in Table 2, the performance of our method for the case-related news dataset is better than that of the two advanced methods, of which the F1 value is 1.1% ahead of the I-PU model and 8.4% ahead of the nnPU model. On the one hand, because of the adaptability of the dataset, the case-related news dataset has the characteristics of a case-related topic. We extract the topic to strengthen the iterative process, which gives the classifier good domain characteristics. On the other hand, our method adopts an iterative training process, which is higher than the other two methods in terms of training duration and data utilization rate. It can be observed that our method is fairly competitive compared to other advanced methods in terms of performance.
4.3.3 Comparative experiment of different experimental parameters
The main purpose of this experiment was to observe the performance improvement of our method compared to conventional PU learning under different experimental parameters. We chose two important parameters to conduct the experiment: the initial data and iteration step sizes. The experimental results are illustrated in Fig. 5.
We compare our method with PU learning under different initial data scales, and the results are illustrated in Fig. 5(a). We set four different initial datasets, 500, 750, 1000, 1500, and 2000. The number of iterations is 500. The x-axis represents the different data scales. This figure presents the evaluation results obtained by iteration of the unlabeled data. When the initial data scale is merely 500, PU learning has already failed, as illustrated in Fig. 5(a). Failure occurs when PU learning depends on the scale of the initial labeled data. If the initial data scale is too small, the trained classifier lacks precision. The low precision results in reliable positive and negative samples with a large bias during the subsequent forecast process. With iteration processing, this bias accumulates and leads to PU learning failure. With an increase in the initial data scale, the bias of each iteration is smaller, and the final result is better, which is a common phenomenon in PU learning. Our method follows this phenomenon as well. Compared to PU learning, our method has better adaptability with small initial data. When the initial data size is only 750, the F1 value gap between the effect of our method and the conventional PU learning reaches 9.4%. As the initial data scale increases, the gap is smaller. This result indicates that our method is more effective in using messages on a small initial scale.
(a) The initial data of different scales and (b) the different iteration steps.
We compare our method with PU learning with different iteration steps, and the results are illustrated in Fig. 5(b). We set five different iteration steps: 300, 500, 750, 1000 and 1500. This figure presents the evaluation results obtained by iteration of the unlabeled data. First, we set the data scale to 1000, the x-axis represents different data scales. As illustrated in Fig. 5(b), the performance of our method and conventional PU learning is maintained at a good level. As the iteration step increases, the performance of PU learning will decrease, and our method will be kept well. Regardless of our method or PU learning, the classifier for each iteration of training has basic precision. This classifier predicts unlabeled samples and sorts them by probability. This leads to a higher density of reliable samples at both ends. In the low iteration step, both our method and PU's positive and negative samples can be classified well, which leads to a small bias of positive and negative samples. However, as the number of steps increases, the positive and negative sample biases of PU learning increase, and the performance decreases. Our method has an obvious fall when the step reaches 1000, which indicates that PU learning combined with topic has better adaptability. When the iteration step reaches 1500, our method fails, and PU learning also fails. When the initial data scale is 1000, the classifier trained by PU learning has a limited precision. Even if we add a topic for enhancement, the required precision cannot be achieved.
5. Conclusion
We proposed a filtering method for case-related news that combines topic and PU learning methods. The performance of the final classifier for PU learning strongly depended on the initial labeled data, in the case of a small amount of case-related news, the accuracy of case-related news filtering decreased. Therefore, we developed a topic-enhanced PU learning method to extract keywords from news data by adding unsupervised pre-training to ensure that the trained PU classifier had higher accuracy. Through repeated iterative training of the obtained classifier, the performance of the case-related news filtering model was improved. Thus, the performance of our method on the test set was 1.8% ahead of the F1 value of the PU learning baseline model and 1.1% better than that of the advanced PU learning baseline model. In the future, this method will continue to be optimized and applied to the case of public opinion analysis.
Acknowledgement
This study was supported by the project of the National Key Research and Development Project (No. 2018YFC0830100) and the Science and Technology Plan Projects of Yunnan province (No. 202001AT070046).