Article Information
Corresponding Author: Dong-Ho Lee* (danny911kr@naver.com)
Dong-Ho Lee*, Dept. of Computer Education, Sungkyunkwan University, Seoul, Korea, danny911kr@naver.com
Yu-Ri Kim**, Dept. of Industrial & Management Engineering, Hansung University, Seoul, Korea, dbflek620@naver.com
Hyeong-Jun Kim***, Dept. of Computer Science, Yonsei University, Seoul, Korea, hjkim2246@gmail.com
Seung-Myun Park****, Dept. of Information System, Hanyang University, Seoul, Korea, psm5moto@gmail.com
Yu-Jun Yang*****, Dept. of Software, Gachon University, Seongnam, Korea, defr5623@gmail.com
Received: June 22 2018
Revision received: September 20 2018
Accepted: October 31 2018
Published (Print): October 31 2019
Published (Electronic): October 31 2019
1. Introduction
Since 2010, Social Network Services (SNSs) such as Facebook and Twitter have become widespread and fake news, which is a form of false information disguised as media, has started spreading. It had a significant impact on voting decisions in the 2016 US Presidential Election and became a hot topic [1]. Fake news on Facebook during the election was mainly used in support of a certain candidate [2]. Mainstream media around the world united to provide readers a confidence index for articles and employed people to monitor for fake news to prevent its spread [3]. In addition, there have been various attempts to solve this problem by taking a technical approach. For example, there are artificial intelligence (AI)-based detection methods and methods that detect the abnormal diffusion pattern of fake news propagation [4]. AI-based detection methods use models that have been trained on data; this method is classified as a natural language processing (NLP) task based on machine learning. Several previous works have garnered >80% accuracy using this method such as neural network models or decision trees [5,6].
However, Korean has two issues that cannot apply these works: (1) Korean can be expressed in shorter sentences than English even with the same meaning; therefore, it is difficult to operate a deep neural network because of the lack of features for deep learning. (2) Difficulty in semantic analysis due to morpheme ambiguity. We resolve these issues and proposed a suitable fake news detection model for Korean by implementing a system that uses various convolutional neural network (CNN)-based deep learning architecture and “Fasttext” which is a word embedding model learned by syllable unit. Among the various types of fake news, we detect so-called “Click-bait” articles. In this paper, mission1 is the case in which the headline and body are inconsistent and mission2 is the case where the content of the body is irrelevant to the context.
2. Related Work
In this paper, we apply and transform various mechanisms based on “Fasttext” [
7] and “Shallow-andwide CNN” [
8] to implement a model for detecting fake news. This section introduces previous related works that we use to implement models for fake news detection.
2.1 Word Embedding
Word embedding is a method of mapping words or phrases to vectors of real numbers. The traditional method, “discrete representation” has a “one-hot vector” representation that consists of 0 second in all dimensions with the exception of a single 1 in only one dimension that is used to represent the word. However, “discrete representation” does not reflect the context and has problems handling synonyms and antonyms. Recently, “distributed representation” has emerged as a way to represent words in a continuous vector space where all dimensions are required to represent the word. This paper introduces and applies “Word2vec” and “Fasttext” among various representations.
2.1.1 Word2vec
“Word2vec” represents word embedding using a neural network; it has two model architectures for learning distributed representations of words: continuous bag-of-words (CBOW) and Skip-gram. The Skip-gram architecture is widely used because it works better on semantic tasks than the CBOW model [9]. The Skip-gram architecture uses each current word as an input [TeX:] $$\left(w_{t}\right)$$ for the model and predicts words within a certain range before and after the current word [TeX:] $$\left(W_{t-k} \sim W_{t+k}\right).$$ It maximizes the classification of a word based on another word in the same sentence, so similar words have similar vectors and their similarity increases [10]. Given a sequence of training words [TeX:] $$\left(w_{1} \sim w_{t}\right)$$ and the size of training context (c), the objective of the Skip-gram model is to maximize the average log probability
The basic Skip-gram formulation defines [TeX:] $$p\left(w_{t+j} | w_{t}\right)$$ using the softmax function as follows:
where
[TeX:] $$v_{w} \text { and } v_{w}^{\prime}$$ are the “input” and “output” vector representations of w, and W is the number of words in the vocabulary [
11].
2.1.2 Fasttext
“Fasttext” is a method of adding the concept of “Sub-word” to “Word2vec.” Each word is represented as the sum of n-gram vectors and the word vector itself. Taking the word apple and n = 3 as an example. It will be represented by the character n-grams: <ap, app, ppl, ple, le> and the word itself <apple>. The reason why it has 2-gram vector <ap, le> is that it adds special boundary symbols <and> at the beginning and end of words to distinguish prefixes and suffixes from other character sequences. The formulation is as follows: suppose that you are given a dictionary of n-grams of size G. Given a word w, let us denote by [TeX:] $$G_{w} \subset\{1, \ldots, G\}$$ the set of n-grams appearing in w. is a vector representation which is associated to each n-gram g. Thus, [TeX:] $$z_{n}$$ is a vector representation which is associated to each n-gram g. Thus, [TeX:] $$v_{w}$$ in Eq. (2), which is the “input” vector representation of input word can be represented as follows [7]:
2.2 Shallow-and-Wide CNN
The model architecture as shown in Fig. 1, is the “Shallow-and-wide CNN” architecture of Kim [8]. The first layer is the look-up table that is the set of k-dimensional word vectors that each corresponds to the ith word in the sentence. Then, a convolution operation is applied with multiple filter widths and a max-overtime pooling operation. Finally, these features are passed to a fully connected layer and the prediction is made with the softmax layer.
Shallow-and-wide CNN architecture [ 8].
It has two channels of word vectors: one named “Static” that is kept static throughout the training and one named “Non-static” that is fine-tuned via backpropagation. Previous works have conducted sentimental analysis with a dataset that consists of short sentences and the “Static” and “Non-static” results were comparable, but “Non-static” allows the words to attain more meaningful representations [8]. However, if only “Non-static” is used, new words can be over-fitted in this model. Therefore, both channels are used to secure the generality of the meaning of words.
2.3 Attentive Pooling
The model architecture shown in Fig. 2 is the “Attentive-pooling” architecture of Santos et al. [12]. Recently, attention mechanisms have been successfully used for image captioning [13] and machine translation [14]. However, there were no further studies of applying the attention mechanism to NLP tasks with two inputs such as pair-wise ranking or text classification. Meanwhile, “attentive pooling” has contributed to the improvement of performance in these tasks by effectively representing the two inputs’ similarity [12]. Although there is the Term Frequency-Inverse Document Frequency (TF-IDF) method that statistically measures the similarity by the frequency of words in a document, this model measures the similarity by increasing the weight for words that have the same or similar meanings to the two inputs.
Attentive pooling networks for answer selection [ 12].
2.4 Bi-LSTM
Long short-term memory (LSTM) is a structure that learns how much of the previous network state to apply when input data is received. It resolves the long-term dependency problem of conventional recurrent neural network (RNN) using both the hidden state and the cell state, which is a memory for storing past input information and the gates that are used to regulate the ability to remove or add information to the cell state. The multiplicative gates and memory are defined for time t [15]:
where [TeX:] $$\sigma(\cdot)$$ is the sigmoid function and [TeX:] $$f_{t}, i_{t}, o_{t}, C_{t}, \text { and } h_{t}$$ are the vectors of the forget gate, input gate, output gate, memory cell, and hidden state, respectively. All of the vectors are the same size. Moreover, [TeX:] $$$$ [TeX:] $$W_{f}, W_{i}, W_{o}, \text { and } W_{c}$$ denote the weight matrices of each gates and [TeX:] $$b_{f}, b_{i}, b_{o}, \text { and } b_{c}$$ denote the bias vectors of each gates. Another shortcoming of conventional RNN is that they are only able to make use of previous context [16]. To resolve this, bidirectional-RNN (Bi-RNN) stacks two RNN layers. If the existing RNN is the forward RNN that only forwards previous information, Bi-RNN stacks backward RNN that can receive subsequent information, as shown in Fig. 3. Combing Bi-RNN with LSTM gives Bidirectional- LSTM (Bi-LSTM), which can handle long-range context in both input directions [16].
Bidirectional-RNN. Adapted from Graves et al., “Speech recognition with deep recurrent neural networks,” Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645-6649, with the permission of IEEE [ 16].
3. Model Architecture
This paper modifies and combines “Fasttext” and “Shallow-and-wide CNN” to implement a fake news detection model. To detect so-called “Click-bait” articles among the various types of fake news, we need to understand the consistency and relevance between the headline and body of article. To do this, we extract the global feature vectors from the headline and body, respectively and compare the vectors. For extracting method, there are several methods such as TF-IDF and RNN, but since the overall meaning of text is determined by a few key words in the text, we use CNN which can extract the most salient local features to form fixed-length global feature vector [17]. Then, we pass these features to a fully connected layer and make the prediction with softmax layer. We call this model “BCNN (Bi-CNN)” because for the headline and the body which are two inputs of model, the convolution and the pooling have been used. Moreover, we try to improve the accuracy by implementing new models by applying LSTM/Bi-LSTM and attentive pooling to BCNN; in this section, we first apply “Word2vec” and “Fasttext”, which are representative word embedding techniques, to Korean and compare the accuracy. Then, we introduce several BCNN models with better performance word embedding technique.
3.1 Word Embedding
We train 100K articles with “Word2vec” and “Fasttext” to find suitable word embedding for Korean; the results are as shown in Table 1.
This paper uses “Fasttext” because its performance is better in terms of accuracy.
Test results for “Word2vec” and “Fasttext”
3.2 BCNN
BCNN is a CNN with two inputs and pre-trained word embedding “Fasttext” as shown in Fig. 4. It extracts feature maps from headlines and bodies using 3-g filters through the convolution layer. The number of filters is proportionally set as 256 filters for the headline and 1024 filters for the body considering the huge difference in the amount of text between them. Then, it makes each feature map to one vector through the max-pooling layer. This is the process of forming fixed-length global vector for the headline and the body. Finally, classification is performed through the fully connected layer. We use Rectified Linear Unit (ReLU) as an activation function and the softmax function as an output function. We use the “Static” channel that keeps word embedding static with pre-trained word embedding “Fasttext” throughout training.
LSTM/Bi-LSTM + BCNN architecture.
3.4 BCNN with Attentive Pooling Similarity
BCNN with Attentive Pooling Similarity (APS-BCNN) adds a similarity vector between two inputs represented using attention pooling to a one-dimensional vector matrix that is derived from max-pooling in BCNN, as shown in Fig. 6. We expect improved performance because the similarity vector has been added.
BCNN with attention pooling similarity architecture.
3.5 Hyperparameters
Hyperparameters are as shown in Table 2. As mentioned above, the number of filters is proportionally set considering the huge difference in the amount of text between the headline and the body.
4. Experiments
This paper detects so-called “click-bait” articles among the various types of fake news. We define mission1 as where the headline and body are inconsistent and mission2 as where the content of the body is irrelevant to the context (Table 3).
Example of fake news for each mission (Korean)
4.1 Dataset
We use 100K articles that were crawled from Joongang Ilbo, Dong-A Ilbo, Chosun Ilbo, Hankyoreh, and Maeil businesses as a dataset. For each press, we categorize the news into economic, society, politics, entertainment, and sports, and then collect articles in the same proportion for each category. Of these, we use 31K for mission1 and 68K for mission2. The real news and fake news are in the same proportion for each mission and the training and validation data are composed at a ratio of 9:1. We measure the model’s accuracy with test data that includes 350 recent articles (as of March 2018) that are not included in the training and validation data and are composed of real news and fake news in the same proportion.
4.2 Experiment Results
We measure the accuracy using the model with the lowest validation loss among multiple steps of the model and the results are shown in Table 4; AUROC (area under receiver operating characteristic curve) is used as a measurement technique [18].
Fake news classification accuracy results
5. Conclusions
This paper implements a deep learning model for fake news detection and measures the accuracy; its main contributions are as follows:
(1) The accuracy of classification for mission2, which consists of fake news that is irrelevant to the article context, is the highest with APS-BCNN at an AUROC score of 0.726. It can be concluded that the similarity vector between the headline and body contributes to detecting the content that is irrelevant to the context.
(2) The accuracy of classification for mission1, which consists of fake news where the headline and body are inconsistent, is the highest with a BCNN in AUROC score of 0.52; however, this accuracy cannot be used to detect real fake news. We can deduce the causes of low accuracy as follows: (a) as CNN uses the local information of texts to classify, mission2 would have achieved high accuracy due to the large amount of perturbed local information. However, as mission1 has a relatively small amount of perturbed local information, it would have been difficult to classify it. (b) The difference in the amount of training data between mission1 and mission2 would have caused a difference in the accuracy between missions; we were able to acquire a large amount of fake news data for mission2 by mixing parts of the bodies of several articles. However, since we had to make the fake news data for mission1 individually, it was difficult to acquire a large amount of fake news data like for mission2.
(3) CNN with LSTM has low classification accuracy. Although there is a previous work of LSTMCNN with high accuracy in the text classification of one input [19], the application of LSTM in the text classification of two inputs as shown in this paper had low accuracy. We can deduce the cause of the low accuracy as follows: for example, if we assume that both the headline and body have the same word “apple” as shown in Table 5.
Before the LSTM has been applied, the word “apple” in both the headline and body would have had the same vector trained by “Fasttext.” However, after LSTM has been applied, each word is influenced by the preceding words and has different vectors. This reduces the association of “apple” in the headline and “apple” in the body even though they are the same word.
(4) “Fasttext” performs better than “Word2vec” in terms of Korean word similarity. We can deduce the cause of the better performance as follows: unlike other languages, the syllables that form Korean words have their own meaning. For example, the word “대학” is composed of the syllables “대” that means “big” and “학” that means “learn.” This would have made “Fasttext” which is trained in syllable unit, perform better in word similarity than “Word2vec” which is trained in word unit.
This paper proposes a meaningful deep learning model for fake news detection. The limitation of this study is that we could achieve meaningful accuracy for classification in the case where the content of the body is irrelevant to the context, but the accuracy was low when the headline and body were inconsistent. In future work, we will implement a big-data system to collect and make good-quality fake news for training data and retrain our model to improve the accuracy.
Acknowledgement
This paper is recommended from the 2018 Korea Information Processing Society Spring Conference. All code and data is available on our Github (https://github.com/2alive3s/Fake_news).