1. Introduction
Keywords are usually extracted from a document by selecting a few critical terms that arise from the text. Such a method of keyword extraction can be widely applied to different information retrieval technologies such as keyword indexing [1-14] and information extraction [5,10,11,13]. Automatic keyword extraction can be realized by analyzing information such as word frequency count and word placement [2,5,9,13,14] or by using natural language processing that looks more closely at sentence structure and word context [3,4,6,13].
However, if there are no words to serve as keywords, if some words composed of keywords are spread across the text body, or when the context of the text is to be guessed, term draw was used as a meaning of abstraction, derivation, or conjecture. This process is an inferring of the context, concept, or conclusion, not a detection of words in the document, so term draw is not effective when there are abstract vocabularies or keywords by which the system cannot infer or guess. Furthermore, this is not effective when there are no terms within the document that can be used as a valid keyword, when the potential keywords appear in too many places in the document, or when the keywords are made of too abstract terms that cannot be associated with the subject of the document (the word “extraction” was used in [14] in the context of derivation and implication).
Concerning this issue, Nagata and Kimoto [6] defined an index rule explaining the basic concept of how to make up keywords (key concept) and describing concurrence (word sharing) among keywords. This rule served as their basis in their proposed keyword extraction method. In this rule, nouns are first detected from sentences, and a proper keyword is extracted by collecting words similar to or same with the nouns. An appropriate keyword to the text may not be analyzed as relevance among similar or same words is not considered when the key concept is extracted.
Against this backdrop, this study experiments on extracting new keywords by considering the relevance between and among concepts of the main words appearing in the document. To extract proper keywords as best as possible even if acceptable ones do not appear in the document itself, this study proposes a concept-based compound keyword extraction method. Particularly, when author-provided keywords do not appear in the text, or they are not selected as thematic words, a compound-word-production rule is used. Focused on the conceptual relevance of the production rule, a new calculation method to determine importance is introduced.
This study is structured as follows. Section 2 briefly analyzes author-provided keywords and provides an overview of a production rule setup. Section 3 proposes a concept-based production rule and the calculation method of keyword importance, considering the co-occurrence of keywords. Section 4 experiments for the validity of author-provided keywords and evaluates extracted compound keywords. Finally, Section 5 finishes with the conclusion and suggestions for future studies.
2. Keyword Patterns
There are some characteristics of keyword patterns, as seen in Table 1 selected and written by authors to extract correct keywords sticking to the theme of the document. The table displays six real examples (sentences or character strings) and extraction patterns. Each different pattern is adopted for each different occasion, and the occasion falls into three categories. First, when keywords exist in all parts of the document. Second, when keywords exist in a part of the document. Third and last, when keywords do not exist in the document. The extraction patterns described above can be correctly analyzed using each concept word. As a pre-step to generate keywords based on the rule, compound keywords are divided into each component, and pattern analysis will be carried out. For case # 6 found in Table 1, a dictionary for the conversion should be generated. The next section will introduce the concept-based keyword extraction method.
3. Keyword Extraction
This section attempts to extract keywords based on concepts. Section 3.1 discusses production rule for forming compound keywords, and Section 3.2 discusses conceptual distance, the number of cooccurrences, and the calculation of keyword importance for ranking keyword candidates.
3.1 Production Rule for Compound Keywords
Compound keywords are not often found in texts requiring a guess or inference from words appearing in the texts. This section employs a method that Nagata and Kimoto [6] used to define relations between the concept of morpheme in keywords and the keywords themselves as a rule. Under the rule, it detects words embracing abstract vocabularies or themes [9].
Under the assumption that a compound keyword w is a keyword and the compound word consists of component morphemes, which are composed of sub-concept words, the component morphemes are called conceptual components of w. These conceptual components consist of synonyms and related terms of w.
The compound word, w, can be made up again of each component’s synonyms and related terms. It can be structured as follows using [ ]. The first { } represents synonyms, and the second { }, related terms. Each component of w is expressed as w1, w2, …, wn, and the function to obtain the w’s concept is Concept ( ). The conceptual components of each phoneme are as follows:
Concept (meaning) = [{definition, value, ...}, {context, ...}]
Concept (processing) = [{dispensing, disposing, ...}, {solving, handling, ...}]
Concept (vocal) = [{voice, ...}, {timbre, intonation, ...}]
Concept (dialogue) = [{conversation, discussion, ...}, {talk, chat, ...}]
The production rule (PR) of w1, w2, … wn is defined as the addition of all conceptual components or PR(w1, w2, …, wn) = Concept(w1) + Concept(w2) + … + Concept(wn). This extracts the compound keyword w or the set of w1, w2, …, wn only when all concept elements from Concept(w1) to Concept(wn) appear in the document. For example, there is a compound word of “meaning processing” in a document. Under the production rule, if the context components (“meaning” and “processing”) and two-way conceptual components exist, it extracts a compound word: “meaning processing”. Likewise, in case of “voice dialogue processing” it extracts keywords as follows if three conceptual components (“voice”, “dialogue”, and “processing”) exist in text at the same time:
PR (meaning processing) = Concept (meaning) + Concept (processing)
PR (spoken dialogue process) = Concept (spoken) + Concept (dialogue) + Concept (processing)
3.2 Importance of Keyword Candidates
A new method for importance calculation is needed to enhance the extraction’s precision level. Sentential distance, sd, and conceptual distance, cd, are also required to measure importance.
3.2.1 Conceptual distance
A distance between sentences that include each component can be used as a distance between conceptual terms and serve as an indicator of importance. Therefore, the conceptual distance is obtained using the co-occurrence of concept words. However, the distance alone is not enough to understand the relevance between conceptual terms. Therefore, a distance between concepts should be used to focus on the concurrence1 of conceptual terms. For example, let us take a look at PR(XZ) = Concept(X) + Concept(Z).
Distance between sentences, sd, is shown as the arrow in Fig. 2. Variables v, x, y, and z are a top-layer language appearing in the document and displayed in lowercase English as conceptual components. Concept words are marked by capitalized alphabets V, X, Y, and Z. The conceptual distance between X and Z is simply 5. sd from sentence i to j is defined by [TeX:] $$s(j)-s(i)+1,(\text { where } j \geq i).$$
The relation of sentential distance and conceptual distance.
The conceptual terms are expressed in uppercase English V, X, Y, and Z; the conceptual distance between X and Z is simply the distance between sentences, which is 5 (=5–1+1); and sd between sentences i to j is defined as [TeX:] $$s(j)-s(i)+1(j \geq i).$$ However, if you focus on the co-occurring concept term V, which appears commonly in X and Z, V is in between X and Z, and the conceptual distance cd becomes 1. cd is calculated using the following equation.
where, cc is the number of common conceptual word X, cd of XV is 1, and cd of VZ is 1.
Therefore, cd becomes [TeX:] $$2\left(\left(=\frac{1+1}{1}, X V \text { and } V Z\right)\right..$$ This is a calculation of the conceptual distance considering meaningful relevance.
3.2.2 Number of co-occurrences (word sharing)
Conceptual terms or conceptual components that represent the theme often appear in a document. Such concept words are related by the co-occurrence with other words. These terms are also in the same context as the other words in many cases. In this regard, conceptual terms having many words sharing the same context are crucial or most likely to reflect the theme of the entire document. Therefore, it is essential to use a concept word that has a large number of co-occurrences.
Fig. 3 demonstrates the co-occurrences in the document where concept words with co-occurrences are connected by a dashed line. If, in some document, the i-th complex word w has the number of cooccurrences N(wi), the number of co-occurrences of i-th w is shared with other word groups. Concept word V and N(V) become 3(=1+1+1) as in Fig. 4, which shows the number of words involved in the concurrence relationships of V. Also, because V is the only conceptual term in a concurrent relationship with X, Y, and Z, N(X), N(Y), and N(Z) all become 1. In conclusion, [TeX:] $$N(V)>N(X), N(Y), \text { or } N(Z)$$ and V has higher importance than X, Y, and Z.
Concept word with co-occurrence.
Number of co-occurrences of concept word V.
3.2.3 Calculation of importance
Calculation of the importance of keyword candidates that consider the conceptual distance (cd) and the number of concurrences of a conceptual term wi, or N(wi), can be expressed as in Eq. (2). The smaller cd, the larger the number of concurrences and the higher frequency of the higher importance, I. Furthermore, the smaller cd, the higher frequency of synonyms (S) and related terms (R), of concept elements and the higher the importance. The following Eq. 2 expresses how to calculate the importance of candidate keywords considering the distance between conceptual terms cd and the number of cooccurrences of a concept word, N(wi); this is expressed in Eq. (2).
Here, n is the number of conceptual terms that make up the keyword candidates, S(wi) is the frequency of synonyms of wi, R(wi) is a frequency of how ofter wi, related terms appear ST is the overall frequency of the synonyms, RT is the overall frequency of the related terms, α and β are represent weighted values of synonyms and related terms, respectively [TeX:] $$(where, \alpha>\beta ).$$
In Table 2, the underlined words in the sample documents are words of interest and concept words that appear in the document. The distribution of these words is as shown in Table 3. Fig. 5 shows the cooccurrence relation between the concept element and concept keyword. Next, the production rule calculates the keyword importance by synonyms and similar meaning word candidates of extracted concept keywords. For example, the following “natural language processing,” in example (a) has n of 2 (“language”, “natural language”), S(natural language) of 6, S(processing) of 5, and, therefore, ST becomes 17(=2+6+5+2+1+1). N(natural language) is 5, N(processing) is 4, and cd is the minimal distance 1. Therefore, the importance is calculated as follows (where, the weighting of synonyms and related terms are respectively α = 1 and β = 0.5; also, the synonyms of this concept word are assumed to not appear in the document).
PR (natural language processing) = Concept (natural language) + Concept (processing)
PR (natural language interpretation) = Concept (natural language) + Concept (interpretation)
PR (Korean language interpretation) = Concept (Korean language) + Concept (interpretation)
Sample document ① [TeX:] $$( >$$ Language Processing // Processing Undefined Words // GLR Rule)
Conceptual words and conceptual elements extracted from sample document ①
Co-occurrence relation between the concept element and concept word.
Conceptual words
“Natural language interpretation” of (b) is
and “Korean language interpretation” of (c) is
If we calculate the importance of the rest of the keyword candidates, then the keyword candidate with the greatest number of co-occurrences is “natural language processing.” This keyword candidate also has the greatest importance. This shows that the author-provided keyword “natural language processing” at the bottom of Table 2 (marked by [TeX:] $$\langle\mathrm{KYWD}>>)$$ is the most appropriate keyword that captures the main topic of the document. It can also be conveyed from calculation results that the keyword “natural language interpretation” with importance greater than 1 is also an appropriate keyword candidate. Fig. 5 shows a graph for Sample ① about the co-occurrence relation between the concept element and concept word.
Next is an example of keyword generation using a group of related terms, as in Table 4 of sample document ②. The main topic of the text is termed as “understanding acoustic environment”. The appearance of the synonym “sound” and related term “voice” produces the concept word “acoustic”, and the appearance of the synonym “environment” generates the concept word “environment”. Lastly, no synonyms for the word “understanding” appeared, but a concept element “recognition” in related terms appears three times to generate the word “understanding”. Conclusively, the compound keyword “understanding acoustic environment” (different from the author provided keyword [TeX:] $$\langle\mathrm{KYWD}>>)$$ is formed from “understanding + acoustic + environment”.
Sample document ② [TeX:] $$(<\mathrm{K} Y W D>>$$ Speech Recognition // Acoustic Environment Recognition)
4. Extraction and Evaluation
4.1 System Configuration
The schematic of the keyword generation system based on the production rule is illustrated in Fig. 1. This system consists of two core modules of keyword generation and importance calculation. The keyword generation part selects only nouns from words found in the dictionary, and finds synonyms and related terms for concept elements. It then uses the production rule to create compound keywords as keyword candidates. The importance attribution part applies synonym and related term weighting, and uses co-occurrence information in the importance equation in Section 3.2.3 to extract keywords in the order of greatest importance.
4.2 Evaluation of Author-Provided Keywords
The experiment of this paper is based on 5,400 sets of titles and summaries of master’s and PhD journal articles (about 28.5 kB) offered by KTSET (Korean Test Set) and Jeonju University’s academic information center [15]. Table 5 shows keywords extracted from the system and those that were unsuccessfully detected. The experiment considered author-provided keywords but did not appear in the titles or abstracts as answer keywords and conducted the extraction.
Of 2,010 (average number of 4 author keywords in 1 set made up of a title and summary) authorprovided keywords, there were 1,540 answer keywords. About 70% of the total keywords were compound keywords, and it had an average of two composite morphemes. This shows that people often use compound keywords to capture the topic of a document. There were 114 keywords extracted using the production rule and 500 keywords generated using the dictionary (case #6 in Table 1), totaling 2,190. A total of 3,350 keywords were not extracted, which means that 17.5% of the papers did not have enough trails to generate keywords, and 47.5% lacked the concept elements (synonyms, related terms) in the dictionary. The production rule was not available at all for the 35%.
Recall and precision of author keywords were evaluated using Eqs. (3) and (4). The recall was the number of extracted answer keywords over the total number of answer keywords, and the precision was the number of extracted answer keywords over the total number of extracted keywords. The recall was 35%, while the precision was 11%. The noticeably low precision is likely caused by the fact that the number of answer keywords is only 1, whereas the number of author-provided keywords is 4. Therefore, keyword validation is necessary.
Comparison of both keywords extracted by system and author-provided keywords [TeX:] $$(<\mathrm{KYWD}>>)$$
4.3 Keyword Validation
Five experimenters evaluated the validation of the rule-based extraction of keywords using the rating in Table 6. A keyword is determined to be valid if more than four evaluators scored it with B or above, and then it is assumed as an answer keyword. The accuracy rate was defined as the number of answer keywords over the number of extracted keyword candidates as in Eq. (5). The test results were compared to that of Nagata and Kimoto [6].
First, let us look at Nagata’s experiment in Fig. 6 (labeled as the previous method of keyword extraction). It first defines the index dictionary that states the key concepts and keyword relationships and uses concepts that appear in the document to generate keywords. It also uses Eq. (6) to calculate the importance. The major variables of this equation are the number of concept words that form keyword candidates, conceptual distance based on co-occurrence, and frequency.
As seen in Fig. 6, precision improved up to 96%, which is a result of comparing the top seven candidate keywords among extracted ones with that of Nagata and Kimoto [6]. The experiment selected 30 words based on field terms and thematic words. A dictionary [8] classified by Kadogawa was used to refer to conceptual components after being translated into Korean. For some terminologies related to natural language and auditory language that were not listed on the dictionary, people manually selected words from the classification chart by matching the closest meaning as best as possible and used synonyms and the related terms as the conceptual components [15-17]. For the weight value of synonyms and related terms, multiple preliminary experiments were conducted, and the optimal value was obtained: = 1 and = 0.5, which were the optimum number determined from pre-experiments and a heuristic algorithm [18].
Validity of keyword candidates
Comparison of proposed method and Nagata’s method.
5. Conclusions
The present system suggests proper keywords to assist in the quick determination of whether they will be read or not by informing them of keywords having the highest importance. This is a new introduction of the concept-based compound keyword extraction focused on extracting inference words [14] or thematic terms to help readers easily guess the document category [19,201]. This study defines a conceptbased keyword production rule to improve the keyword precision and introduces a method to calculate the keyword importance by using the frequency of word appearance, concurrence, and distance between concepts. One of the greatest points of this method is that keywords that did not appear in a document can be extracted as well as author keywords. To improve the accuracy rate of extracted keywords, it defines a concept-based production rule and proposes a calculation method of keyword importance using frequency count, co-occurrence, and conceptual distance. The biggest advantage of the developed method is that it can extract not only the author-provided keywords but also keywords that do not appear in the text body or database [21] of the document.
This study’s limitations include the method’s inability to extract field terminologies when they are not listed in the dictionary of concept elements. Thus, for future studies, the dictionary of concept elements can be established to increase the accuracy rate, and it can further be applied to the information retrieval system currently in development. Moreover, several studies [22-24] can be beneficial for future research [10,21-23,25].