1. Introduction
Text summarization in natural language processing is a challenging task. It aims to produce a shorter version of a long document that encapsulates the main information [1,2]. There are two summarization approaches: extractive and abstractive summarization. Extractive summarization simply extracts representative sentences that are relevant to a document [3,4]. The importance of a sentence within the document is calculated based on the location of words and their relationship to the neighboring words. Keywords or sentences are identified by considering the thematic role of a word and the grammatical structure of a sentence. The task of machine reading comprehension (MRC) is similar to extractive summarization. It answers questions by reading a document, extracting a specific passage, and high¬lighting the relevant portion of the document. Abstractive summarization is preferred when key sentences in the document do not represent the overall meaning of the entire document. In such a scenario, summarization is performed by analyzing the content of the document, thereby generating new sentences that are not included in the original text [5,6].
Bidirectional encoder representations from transformers (BERT) embedding is widely used for machine reading comprehension and natural language understanding [7]. Neural networks are good models for implementing abstractive summarization [8,9] and have been applied to various tasks in natural language processing, such as text classification and machine reading comprehension [10]. Recurrent neural networks (RNNs) are popular models for processing sequential information. An RNN uses the [TeX:] $$t^{\text {th }}$$ input and the [TeX:] $$t-1^{\text {th }}$$ hidden state to create an output for the [TeX:] $$t^{\text {th }}$$ input. This method maintains the sequential characteristics of the sentence naturally [11]. It reads a sentence from the first word instead of from the last word. However, an RNN is weak at capturing long-term dependency. Transformers address this issue by using the self-attention method instead [12-14]. In this method, words from each sentence are vectorized through the self-attention mechanism. The transformer uses positional encoding for the relative positions of a word along with the embedding of the word. This allows the model to learn relative-position information. Therefore, the embedding vector varies according to the position of a word and the context, even for the same words. It can deal with cases where the same word has different meanings.
The BERT embedding model represents a word by only using the encoder portion of the transformer. BERT input embedding uses position embedding instead of positional encoding. One-hot index embedding is used based on the position of a word, after which sentence and word embedding are added sequentially. BERT learns through a masked-language model and the next-sentence prediction that consider all words in an entire sentence before and after the input word through self-attention at each step of the sequential information. Thus, BERT embedding considers all the words in a sentence and also learns the semantic relationship of the sentences in a document.
The neural network-based abstractive summarization model uses a dictionary focusing on words frequently found in the training data. Rare words such as jargon face issues in being represented as out-of-vocabulary words. When humans encounter rare words in a document, they may summarize the document based on the meaning of surrounding sentences and through personal experience by looking up related information. Previous studies use a copy mechanism to address this challenge [15,16]. In this mechanism, out-of-vocabulary (OOV) words are collected from the pointed sub-sequences of the input document. It has been adopted in a way to increase the performance by adding a gate that determines a generation or pointing a word. Keywords in the input document are selected through this gate and the pointing positions are given selectively.
Our research will demonstrate the possibility of achieving better accuracy for abstractive summarization in the Korean language. A pre-trained Korean language model is improved by providing OOV words based on training data using a selective OOV copy method. We focus on two aspects. The first one is to build a training data set for text summarization, which works well for technical documents such as academic papers. The other is to check if our proposed model properly generates the OOV words in terms of summarization performance. We apply the following steps to improve the accuracy:
Mask OOV words in the training stage,
Add contextual and morphological information in the embedding model,
Apply precise pointing and an optional copy instruction.
In the preprocessing stage, rare words are substituted with <unk> tags in order to train the OOV model. A masked-OOV (MOOV) training method is applied to the masked-language model, which intentionally distorts the input in BERT. Based on the input document, we perform a context-based word embedding to select and generate a summary through syntactic and semantic features. The sentence piece tokenizer that is used for the public BERT model does not utilize the morphological information of the Korean language. Therefore, we combine a pre-trained BERT model with a Korean morphological analyzer for performance enhancement. Korean academic papers are collected as a training data set. They are reliable documents and the author’s keywords are used to construct an OOV vocabulary for training the OOV model. We improve the summarization accuracy by using the additional information that is selectively pointing or generating.
This paper is structured as follows. In Section 2, we describe the related works and explain the selective-pointing OOV copy model in Section 3. Section 4 presents how to use BERT embedding and masked OOV words. The experimental results are given in Section 5. Finally, we conclude the paper in Section 6.
2. Related Work
2.1 Seq2seq Copy Mechanism
The copy mechanism adopts a sequence-to-sequence (seq2seq) model to detect and copy OOV words from an input document [15]. Fig. 1 shows that a summarization method selects OOV items by using the copy mechanism in the decoding process. Determining a word by adding copy-attention scores involves deciding whether to generate it normally or to copy it from an input document. A higher copy-attention score indicates that the influence of a specific part of the input document is high [16]. The copy mechanism may copy the word from an input document even though it is supposed to be generated in the normal summarization process.
Copy mechanism for abstractive summarization.
2.2 Selective-Pointing Copy Mechanism
In order to improve the copy mechanism, Nallapati et al. added position information by providing a criterion with generator g and OOV pointer p as in Fig. 2. They are used to determine whether to copy the word of the input document or to generate a word during the summarization process [17]. By using the copy mechanism with necessary words selectively, the copy-attention score is not considered for the words in the lexicon.
Selective-pointing copy mechanism.
3. Selective-Pointing OOV Copy Model
We construct a long short-term memory (LSTM) OOV copy model with BERT embedding, which is designed to extract words by selectively generating or pointing the words. The input to the model incorporates the result of BERT context embedding. Each word, with separated morphemes, is provided to the LSTM OOV copy model through BERT embedding. We use an LSTM cell in an automatic-summary neural network, which is a seq2seq model. The encoding process of the model is shown in Fig. 3. The left side of the diagram shows an input document as a vector, sequentially receiving the words of each input sentence, while the right side shows the sequential decoding of the summary starting from the final hidden state of the encoding step, i.e., the initial value. The decoder also estimates the pointer [TeX:] $$p$$, the generator [TeX:] $$g$$, and the generation error of summary words. It optimizes the model through the loss function. The first of three LSTM encoding hidden layers uses a bidirectional RNN; bidirectional information is processed at each timestep. The three encoding layers have three corresponding decoding layers. If a value [TeX:] $$g_{2}$$ is 1, then the model generates summary words. If it is 0, the model points out [TeX:] $$x_{2}$$ i.e., one of the input words in Fig. 3.
Selective-pointing OOV copy model with BERT embedding. [TeX:] $$x$$ is input word; [TeX:] $$y$$, output word; [TeX:] $$p_{t}$$ input-word-selection number; [TeX:] $$h^{e}, h^{d}$$, hidden-state neural network matrices; N, the total number of timesteps.
An attention layer in Fig. 4 is required for precisely pointing to OOV words. At the decoding timestep [TeX:] $$t$$, an appropriate weight [TeX:] $$a$$ is calculated through a neural network as input with decoder hidden state [TeX:] $$h_{t}^{d}$$ and encoder hiddern outputs [TeX:] $$o_{i}^{e}$$. [TeX:] $$g_{t}$$ is quantified from the <unk> tag information. The <unk> tag contains the correct-answer summary. It informs the decoder whether to generate or copy the words. When [TeX:] $$g_{t}$$=1, it generates a word. When [TeX:] $$g_{t}$$=0, it copies a word that receives the highest pointing attention. For each decoding timestep, the model calculates c (attention-weighted context vector) and [TeX:] $$a$$ (attention weight) which are determined based on the concentration of a specific word in the entire encoding. [TeX:] $$c_{t}$$ is designed to generate a word at the current timestep (att_wcv) to be used as input to the next timestep.
Selective-pointing attention mechanism. W and b are weight and bias, respectively.
The overall flow of the decoding step is illustrated in Fig. 5. ‘logits’ is a probability value providing the possibility of correct summary-word generation. Summary-word generation results are printed out as a lexicon index (vocab_id). Used for computing [TeX:] $$a$$ and [TeX:] $$c$$ in the attention layer, [TeX:] $$\left\{o^{e}\right\}$$ is a set of all output values from the encoding step. [TeX:] $$p_{t}$$ is an index value of the largest [TeX:] $$a$$ value. In the decoding step, the input words for training vary from the input words for prediction. While training the model, training input words are the correct summary words included in the training data. Prediction uses the words generated in the previous step. Each input word and att_wcv are added in the next timestep to correspond to the changes in the attention information at each decoding step.
Decoding step of summary word generation.
Eq. (1) is a probability function that corresponds to the position value [TeX:] $$p_{t}$$ of the keyword along with [TeX:] $$g_{t}$$ (the value that decides to generate a word). It provides additional information to train the encoder, decoder, and OOV pointing-attention neural network. If [TeX:] $$g_{t}$$=0, then pointing-attention is executed; otherwise, a summary-word is generated for [TeX:] $$p_{t}$$=1.
Eq. (1) describes a conditional probability that generates the summary words [TeX:] $$Y$$ for the encoding of input words [TeX:] $$X$$. [TeX:] $$Y_{t-1}$$ indicates a state where decoding is processed at the [TeX:] $$t-1^{\text {th }}$$ timestep. For summary word generation, the next-word prediction probability function, [TeX:] $$P\left(y_{t} \mid Y_{t-1}, X\right)$$, is activated. In the encoding side, the pointing probability function, [TeX:] $$P\left(p_{t} \mid Y_{t-1}, X\right)$$, is activated for the pointing to a word. To optimize the weight of the neural network, a loss function is defined by applying the natural log to Eq. (1) with a negative value of the cross-entropy error function, i.e., Eq. (2).
Negative log-likelihood is a neural-network loss function that is used to minimize the difference between a correct answer and a prediction. The loss function in Eq. (3) is described by combining the generating function for summary word generation_NLL and the pointing function pointing_NLL that directs the words in the input document. generation_NLL is a loss function that learns to match the output of the decoder with the word in actual summary, while pointing_NLL is a loss function ensuring the target summary words. The position of a word in the input document is accurately pointed assuming that the word generated in the summary is not in the lexicon. These functions normalize the output of a neural network and are expressed as cross-entropy functions for the error between prediction and the correct answer.
4. BERT Embedding with Masked OOV Words
The OOV model is trained for academic papers. The BERT model is trained for MRC corpus in the National Information Society Agency (http://aihub.or.kr), which contains 107,717 news articles. After preprocessing, the BERT model is generated by pre-training the documents excluding academic papers. Then, it is fine-tuned through OOV model training. Finally, the performance is evaluated by using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metric [18]. We also collected 4,707 academic papers on computer science published by the Korean Institute of Information Scientists and Engineers (http://www.kiise.or.kr) and crawled through the NDSL website (http://www.ndsl.kr) operated by the Korea Institute of Science and Technology Information (http://www.kisti.or.kr) (Table 1). The collected dataset is randomly divided into training, evaluation, and test data in the ratio 6:1:3 for the validation experiment.
Data collection of academic papers
4.1 Bidirectional Encoder Representations from Transformers
BERT embedding is created through pre-training using data generated for a masked language model and the next-sentence prediction training. The BERT pre-training data, extracted from news articles in sentence units, is composed of 1,497,079 lines. BERT model training was conducted with options presented in Table 2. Considering the mean length of sentences in the training documents, max_seq_length is set as 128 words.
The masked language parameter is set to mask approximately 15% of all words. The hyperparameters of the BERT model, i.e., hidden_size, num_attention_heads, and num_hidden_layers, are set to 256, 4, and 2-12, respectively, considering the device performance. With these settings, the size of one hidden attention becomes 64. The size of the feed-forward hidden neural network, which collects the multi-head attention output in the transformer encoding block, is set to 128. A Korean morphological analyzer, MeCab-ko, was used at the input layer of BERT. The best pre-training results were achieved with six hidden layers. The accuracy of the masked language model was 0.5267 and that of next-sentence prediction was 0.9875 (Fig. 6). Although the BERT-based model has 12 hidden layers, only six hidden layers are used in our BERT model because of the small set of documents collected.
Performance comparison of language model according to number of BERT hidden layers.
4.2 Training Masked OOV Words
Scientific article summarization differs from news article or general document summarization because scientific papers are typically long and contain complex concepts and technical terms [19]. Erera et al. [20] described a typical scenario for scientific paper usage through a qualitative user study. Users first read the title, and if relevant, continue to read the abstract. Thus, the title is considered a summary of the abstract. Abstractive summarization with OOV words requires a list of keywords that are added to the input document. Keywords may or may not appear in the abstract but are essential for learning the summarization pattern. In general, peer-reviewed articles follow a specific document type or expression and have keywords selected by the authors. Thus, the title and abstract of the articles are the summary statement and input document. We assume that the title of the article summarizes the contents of the paper and the keywords selected by the authors are the most important keywords of the paper. Inverse document frequency scores are calculated for each keyword, and the three keywords with the highest scores are selected as OOV words.
The training process comprises seven steps. In step 1, tokenization is performed by the morphologicalanalyzer. In step 2, OOV words are substituted with <unk> tags and a (OOV word, <unk>) table is constructed. Step 3 generates a lexicon for indexing OOV words. In step 4, the OOV dictionary is generated to replace the <unk> tag with the actual word. In step 5, indexing is conducted by using the lexicon. Step 6 generates a selective gate’s correct answer to determine whether to generate or to copy the word. Finally, step 7 creates an indicator for the correct answer to accurately point to the OOV word.
We replace words with <unk> tags considering the author’s keywords in order to train the OOV model. The model is not trained in a context where all the OOV keywords are substituted with <unk> tags because it does not consider the occurrence pattern of the <unk> token. It should be trained in a context in accordance with sentences and words that occur before or after the <unk> tag. Therefore, we devised a MOOV training method based on masked language modeling, which randomly shows masked words in BERT. We use a MOOV model for the regularization of OOV words [21]. OOV words are substituted with the <unk> tag, retained, or replaced with a random word depending on the probabilities P1, P2, and P3, respectively (Fig. 7). The experimental results showed the highest performance when P1:P2:P3 is 9:1:0 (Table 3).
Examples of converting masked OOV words.
Unlike the BERT masked language model, substituting with random words was not meaningful for training the MOOV model. The noise effect in training the OOV model was to keep the original word or to replace the OOV word with an <unk> tag. The noise sensitivity was different because BERT produces a high-dimensional vector of words, whereas OOV model training calculated an exact position value and an indicating value of generating or selecting a word, depending on the context.
Summarization performance
4.3 Morphological Decomposition and Composition
Parts-of-speech (POS) information is added in the training stage by the Korean morphological analyzer, which separates postpositions or endings from a word. We combined sentences and words by adding POS tags for roots, postpositions, and endings, which were generated in the prediction stage as listed in Table 4. Formal morphemes correspond to parts of speech such as case markers, verb endings, and suffixes. They are a subset of Korean parts of speech to perform word embeddings. Furthermore, symbols such as “·-o·” (terminal type) or “·--·” (linking type) are used as information to reconstruct a complete sentence in the prediction stage. We did not consider prefixes and compound nouns because they complicate the model too much.
Formal morphemes and POS tags
Restoring sentences generated as morpheme units, we construct words by connecting the roots and the postpositions or endings by adding a space when two roots are adjacent to each other. In other words, we add a space before the roots and connect others by identifying the roots and the postpositions or endings. The stepwise process of morphological decomposition and generating sentences is: (1) Morphologically separate an input sentence using the morphological analyzer. (2) Add a morpheme symbol. (3) Add a space before the elements that correspond to the roots in the result of (2). (4) Concatenate all the elements of a list. (5) Remove symbols such as “·-o·” or “·--·”. As a result, a sentence identical to the input sentence is obtained. In this way, we applied the morpheme-to-sentence converter module to the results generated by the selectively-pointing OOV model.
5. Experiments and Results
5.1 Experiment Settings
Information omission is important for evaluating the performance of abstractive summarization. It is determined by comparing the generated summary with the correct answer. In this study, we used the ROUGE metric. There are many varieties of ROUGE based on factors such as the appearance of correct words, the length of word-order matching, and the number of matched words. We used ROUGE-L, ROUGE-1, ROUGE-2, and ROUGE -SU4. We experimented on a PC server with specifications tabulated in Table 5, which shows the hyperparameters used to perform training. Dropout and hierarchical normalization were used to enhance model performance. LSTM cells were implemented in three layers in encoding and decoding by using a pre-trained BERT model.
Hyperparameters for OOV copy model
We regenerated MOOV data after every 2-epoch train is completed and proceeded to perform training again. The MOOV was created randomly to enhance the training effect each time. In the model training, the validation losses were measured every 30 minutes. When fine-tuning was performed in combination with the language model, we changed the learning rate three times based on the research results that the method of adjusting the learning rate can reduce the learning time [22-25]. Fig. 8 shows the loss graph of fine-tuning our entire model. We began training our model using the learning rate [TeX:] $$2 \times 10^{-5}$$ after initializing with the pre-trained BERT model and then retrained the model using the learning rate [TeX:] $$2 \times 10^{-4}$$ and [TeX:] $$2 \times 10^{-3}$$ at the point where the training progress is slowed. Through training, we made an optimal model in 190,000 steps where the validation loss began to increase. The validation loss continued to increase because overfitting occurred after these steps. We generated summary sentences by using beam search with the trained model [26].
Loss graph of BERT OOV model (up curve, validation loss; down curve, training loss).
5.2 Performance Evaluation
We evaluated the performance of the proposed model using n-gram-based ROUGE-N, longest-common-subsequence-based ROUGE-L, skip-bigram co-occurrence, and unigram-based ROUGE-SU. We found that the performance was improved compared to baseline models, such as the LSTM model, LSTM+[TeX:] $$p_{t}$$ model, LSTM+[TeX:] $$g_{t}$$+[TeX:] $$p_{t}$$ model, and LSTM+[TeX:] $$g_{t}$$+[TeX:] $$p_{t}$$+MOOV model. We evaluated six models trained using the same dataset. All the baseline models except the BERT+LSTM+[TeX:] $$g_{t}$$+[TeX:] $$p_{t}$$+MOOV model and BERT+LSTM+[TeX:] $$g_{t}$$+[TeX:] $$p_{t}$$ used lookup tables that are optimized during model training without a separate embedding representation layer. First, the LSTM model was constructed with three LSTM layers and a lookup embedding table to summarize the document. The LSTM+[TeX:] $$p_{t}$$ model performed summaries by using the copy mechanism for all <unk> tags. The LSTM+[TeX:] $$g_{t}$$+[TeX:] $$p_{t}$$ model performed summaries using the [TeX:] $$g_{t}$$and[TeX:] $$p_{t}$$information for <unk> tags and selectively executed the copy mechanism. The LSTM+[TeX:] $$g_{t}$$ +[TeX:] $$p_{t}$$+MOOV model was the LSTM+[TeX:] $$g_{t}$$+[TeX:] $$p_{t}$$ model trained by MOOV.
Table 6 summarizes the performance measurements of ROUGE after obtaining a generative summary using each model with a test document as input. When BERT embedding was combined with an LSTM search ([TeX:] $$g_{t}$$+[TeX:] $$p_{t}$$) model, the highest performance was observed. According to all ROUGE measures, our model performed considerably better than the other copy-mechanism models. This was because the summary was created through the OOV copy mechanism along with the positive effects of vector expression of the morpheme-classified words according to the BERT pre-training that was reflected in the training effects of the Korean language model. ROUGE-1 is a performance index focusing on the emergence sequence of summary words, while ROUGE-L is a performance index for the occurrence of consecutive words. The ROUGE-1 performance improved by 8.10% (from 47.01 to 55.11), while ROUGE-L improved by 10.10% (from 29.55 to 39.65).
Table 7 shows the word statistics of the summarization results. We compared results with the human-generated correct answers (Fig. 9). In the case of the BERT+LSTM+MOOV copy-model, shaded words were OOV words. The summary was generated by copying the words through the pointing network. This meant that the OOV word reproduction performance of the summary was improved through BERT embedding and selective OOV copy method.
Performance of our model and baseline
Word statistics of summarization results
Examples of abstractive summary generation. Shaded text indicates OOV words.
6. Conclusion
We improved the performance of abstractive summarization by training MOOV words using pre-trained embedding with position OOV words and selective OOV copy method. The OOV copy-pointing method in LSTM with BERT embedding improved the performance of document summarization. Experiments on academic papers demonstrated that summary performance was improved when MOOV and morpheme-to-sentence conversion were applied to the selectively-pointing OOV copy model. Specifically, we generated a summary by pointing to the OOV words, which were keywords selected by the authors, in the input document. In order to improve the quality of summarization, we trained a neural network gate that determined the pointing operation and the generation operation. ROUGE metric evaluation was increased because of the word-creation effects against the <unk> tags. Experimental results showed that ROUGE-1 score was enhanced from 40.46 to 55.11.
In a natural language system, OOV words make the system misunderstand semantic and syntactic information of text. So, they degrade the performance of the natural language system and cause misinterpretation of the meaning of the text. Our approach can be applied to dialogue-based systems to improve their performance and correctly understand the meaning of utterances. This work produces a summary sentence containing OOV words, but the quality of summarization can be improved through further research on generating more natural sentences.
Acknowledgement
This research was supported by the Ministry of Education of the Republic of Korean and the National Research Foundation of Korea (No. NRF-2019S1A5A2A03046571) and this research was supported by the Korea Institute of Science and Technology Information (No. K-21-L01-C06-S01). Our research on BERT embedding, selective-pointing mechanism, OOV masking, and deep learning methods contributes to solve the bias and fairness problem in AI systems.