1. Introduction
This study analyzes the vocabulary usage patterns of learners of Korean heritage language learners. Korean heritage language learners have different characteristics from learners of Korean as a foreign or second language as these learners are most often raised in environments where the mother tongue and national language are inconsistent. In particular, Korean heritage language learners may have distinct characteristics from other learner groups, such as their reasons for learning Korean, the degree of background knowledge they have before starting to learn Korean, and their degree of familiarity with Korean linguistic features such as vocabulary and word order.
Therefore, studying the interlanguage of learners of Korean as a heritage language can highlight the differences they have from other Korean language learners. Research examining the specific aspects of interlanguage development in learners of Korean as a heritage language as they gain proficiency has not been fully covered in Korean language teaching education. Therefore, we analyze the interlanguage of Korean heritage language learners to examine their vocabulary usage patterns and their use of major content keywords at certain proficiency levels. The "Korean Learner’s Corpus" from the National Institute of Korean Language, which includes the proficiency level interlanguages of Korean language learners, was used for the data analysis. This study exemplifies the usefulness of employing corpus language data for applied linguistic research.
2. Previous Research
Many Korean language education studies have examined the interlanguage of Korean heritage language learners; however, most have focused on error analysis and few have examined interlanguage patterns by proficiency level using a cross-sectional language analysis. No research has been conducted on the vocabulary used in the interlanguage of adult Korean language learners. Therefore, the following review focuses on studies on interlanguage patterns rather than error analysis.
Kang [1] categorized Korean vocabulary recognition using the dialogue of a Korean-English speaker during a broadcast, Lee [2,3] examined the development of Korean vocabulary skills by Korean- American children, and Kim and Pyun [4] reviewed the language, literacy patterns, and performances of Korean-Americans and revealed how the heritage language learners developed their literacy skills. It was found that Korean language use at home and Korean literacy skills are positively correlated, but Korean literacy skills are not highly correlated with the learners’ cognitive maturity or schooling length. Kim [5] analyzed the writing materials of Korean heritage language learners and found that their grammar use is limited to frequent colloquial grammar forms. Consequently, they proposed specific educational content for heritage language learners. Lee [6] analyzed the discourse of heritage and non-heritage Korean speakers and developed a conversational education model. Lee [7] compared the Korean language skills of Korean-American beginners and non-Korean beginner learners based on their Korean Language Proficiency Test (TOPIK) scores, finding that the spoken language skills of heritage language learners are superior to the non-heritage language learners. They suggested that future studies on the language skills of heritage language learners should be based on regional characteristics.
Taken together, the following characteristics were observed from these studies. First, most research focused on only one proficiency level, possibly because it is difficult to obtain learner interlanguage materials for all levels from beginner to advanced. However, as the interlanguage patterns of Korean heritage language learners change with increased proficiency, making comparisons of the heritage language learner linguistic characteristics across proficiency levels could provide a greater understanding, which could inform the development of more appropriate heritage language learner courses.
Second, few studies focus on the vocabulary development patterns shown in Korean heritage language learners. However, to better understand overall language patterns, it is necessary to subdivide the areas. Therefore, we focus on the vocabulary usage patterns of Korean heritage language learners.
The limitations in these previous studies have been largely because of the lack of basic data describing the target language use of Korean learners by proficiency level. Therefore, we sought to overcome this limitation by conducting an inductive analysis of the corpus of Korean language learners, which includes balanced proficiency levels, so that we could objectively analyze and describe the target language development patterns of Korean heritage language learners, a group that has an important position in Korean language learner groups.
This study also exemplifies the integration of convergence linguistics and computer science research. Many studies using Korean learner data have been conducted. For example, Lee et al. [8] examined second language acquisition (SLA), Lee [9] examined Korean keyword extraction, Cho et al. [10] studied the scoring of writing materials by Koreans from a Korean language education perspective, and Shin and Nam [11] studied the methods for automatically attaching code to language materials. However, there have been very few convergence studies using a corpus of Korean learners based on their proficiency and writing. Therefore, this study can serve as a model for convergence research by introducing the Korean learner corpus to engineers whose main research interest is Korean language data and introducing one of the engineering methodologies to Korean language researchers.
3. Methodology
3.1 Corpus
We used the Korean language learner corpus data from the National Institute of Korean Language at the Korean Language Learner Corpus Sharing Center as the research subject. The Korean language learner corpus, which was started in 2015, is a collection of Korean language data from domestic and foreign learners of Korean as a second language, foreign language, and heritage language. In this study, we only analyzed the data from Korean heritage language learners.
The Korean language learner corpus has three separate corpora: a raw corpus, a morphological annotation corpus, and an error annotation corpus. This study uses the morphological annotation corpus as it was deemed the most appropriate. There are two types of raw language data in the Korean learner corpus: written and spoken. This study only analyzes the written data because there is less sample spoken language data and they are difficult to analyze because they also include native speaker data, such as teacher dialogue. The Korean language learner corpus also includes proficiency level learner data for levels 1–6 and higher; however, we exclude data higher than level 6 as there is very little data in the levels above 6. Therefore, the data analyzed in this study is written data from the morphological annotation corpus from 2015 to 2020 produced by Korean heritage language learners. The analysis comprises 589 samples and 72,802 words. The breakdown by level is shown in Table 1.
An important preprocessing task was to correct the learner's spelling errors in the content words (nouns, adjectives, verbs, and adverbs). This was done because the purpose of this study is to identify the vocabulary being used by the learner, not to examine the learner's vocabulary errors. However, any vocabulary misused by the learner that did not fit the context was not modified to allow us to analyze the learner’s vocabulary intentions.
Number of samples and words included in the analysis data by level
3.2 Keyword Analysis
The keywords were selected using the TextRank algorithm, which is a machine learning technique that extracts keywords for document summarization. Since first proposed by Mihalcea and Tarau [12], TextRank has been used in many vocabulary extraction studies [13-15], text summary studies [16,17], and word recommendation studies [18]. TextRank is an algorithm that can determine the importance between web pages, which when applied to text, weighs the words and/or sentences by considering the word or sentence frequencies that make up the text and the connections between them. We selected the keywords for the major content words for each Korean heritage language learner proficiency level using this algorithm.
4. Analysis of the Vocabulary Development Patterns by Heritage Language Learner Proficiency Level
4.1 Keywords Common to All Proficiency Levels
Before examining vocabulary development across proficiency levels, we first checked whether there are common keywords that appear in all proficiency levels, the results for which are shown in Table 2 (The topics and semantic categories shown in the table below are tagged with the topics presented in <Step 4 of Research on the Development of Korean Language Education Vocabulary Contents>, and if there is no subject and semantic category in the list, they are tagged based on the researcher's experience to fill in the blanks).
The 17 words in Table 2 are common to all proficiency writing materials. The keyword results are summarized based on the topic and meaning categories in <Step 4 of the Korean Language Education Vocabulary Development Research> from the National Institute of Korean Language. Topics across all proficiency levels at the personal level relate to "self-introduction, school life, daily life, and expressing emotions." The meaning category is also limited to the personal level and general daily life. These results confirm that personal and daily life topics are frequently used by all learners regardless of proficiency.
Keywords common to all proficiency levels and their semantic categories
a) The romanization of each word followed the results from the romanization transducer at Pusan National University.
4.2 Distinctive Keywords by Proficiency Level
Some words, however, are distinctive to specific proficiency levels. As the learner's proficiency increases, the number of distinctive keywords, topics, and semantic categories also increases.
We refer to Vygotsky's “near development area” to examine the Korean heritage language learners' vocabulary expansion across the proficiency levels [19]. Unlike Piaget [20], who states that behavioral development results from inquiry and knowledge composition, Vygotsky [19] believes that behavioral development occurs through interactions across multiple levels. The Zone of Proximal Development is divided into an actual development level and a potential development level [19]. The actual development level refers to the level learners can handle without assistance, that is, this level indicates the development results. However, the potential development level refers to the level that learners can achieve with the assistance of teachers, parents, or peers. Over time, the potential level of development becomes actual development and the development scope expands from individuals to parents and family members to society to specialized fields in society. As Vygotsky’s approach has been applied to language education, we use it here to explain the language development patterns in Korean heritage language learners.
We tagged the topics, semantic categories, and levels using the research results from <Step 4 of the Study on Developing Korean Vocabulary Education Content>; however, if the word was not tagged in these research results, it was directly tagged by the researcher, and if it did not belong anywhere, it was marked as “-.” Tables 3–8 show the topics and semantic categories for the distinctive keywords in proficiency levels 1–6 (Due to the limitations of the paper, only words with a text rank score of 2 or more are presented in the table).
Distinctive keywords in proficiency level 1
Parentheses indicate the abbreviation of the part of speech. N=nouns.
The prominent highest frequency topics in the semantic category in the beginning level keywords are “buying item” and “concept.” Of the concept words, vocabulary referring to "time" is used the most. Most vocabulary used is necessary to describe daily life, which also aligns with the topics and vocabulary covered in beginner Korean textbooks and beginner sections of the international Korean language curriculum.
Distinctive keywords in proficiency level 2
Parentheses indicate the abbreviation of the part of speech. N=nouns, V=verbs, M=adverbs.
Hobbies are the main distinctive keyword topic at the 3 and 4 intermediate levels, with the vocabulary tagged as hobby-related appearing five times. The semantic category associated with “life-leisure activity” also has a high frequency. Therefore, compared to the beginner level daily life and survival topics, the distinctive intermediate level keywords indicate a move to more topic and vocabulary categories related to “social life,” indicating an expansion in the distinctive keyword semantic categories.
Distinctive keywords in proficiency level 3
Parentheses indicate the abbreviation of the part of speech. N=nouns, V=verbs.
Distinctive keywords in proficiency level 4
Parentheses indicate the abbreviation of the part of speech. N=nouns, V=verbs, A=adjectives, Adv=adverbs.
"-" means that the topic and semantic category of the vocabulary are not designated in the <Development of Korean Vocabulary Contents> presented by the National Institute of Korean Language.
Distinctive keywords in proficiency level 4
Parentheses indicate the abbreviation of the part of speech. N=nouns.
"-" means that the topic and semantic category of the vocabulary are not designated in the <Development of Korean Vocabulary Contents> presented by the National Institute of Korean Language.
In proficiency levels 5 and 6, specialized vocabulary appears in areas such as “economic, political, social, and education.” While some vocabulary naturally corresponds with beginner level, proficiency levels 5 and 6 are mainly characterized by intermediate and advanced level keywords. By examining these advanced distinctive keywords, it is possible to confirm the Korean heritage language learners' development patterns in their higher-level writing, which moves from a general social vocabulary to the use of more professional vocabulary. However, there is a relatively large gap between levels 5 and 6. While topics related to “society” mainly appear in level 5, topics in “specialized, academic” fields are more prominent at level 6. It was also observed that the number of words increases rapidly when the language learners enter level 6. This differs from the vocabulary development pattern results for general learners in Hur and Lee [21], who found that the advanced vocabulary use rate by general learners increases relatively slowly, and there is only a small gap between grades 5 and 6. Comparing these general learner results with the level 6 Korean heritage language learners, the Korean heritage learner vocabulary proficiency appears to be higher than general learners.
Distinctive keywords in proficiency level 6
Parentheses indicate the abbreviation of the part of speech. N=nouns, V=verbs, A=adjectives, M=adverbs.
"-" means that the topic and semantic category of the vocabulary are not designated in the <Development of Korean Vocabulary Contents> presented by the National Institute of Korean Language.
4.3 Vocabulary Development between Proficiency Levels
In the International Standard Model of Korean Language Education (2017 Notice) and the Korean language curriculum (2020 Notice of the Ministry of Culture, Sports and Tourism) the topics/vocabulary covered in the beginner, intermediate, and advanced levels expand from daily to personal to social to professional. In this section, we examine the specific vocabulary patterns by proficiency to compare the Korean language education model and the curriculum technology previously presented at the institutional level.
Figs. 1–6 show the top 30 text ranks in each level from 1–6.
Level 1 keywords (by TextRank score).
The highest frequency level 1 keywords are related to the concept semantic category, followed by eating, life, human, and economic life. The concept semantic category has the highest frequency in all levels, which is consistent with analyses of general language use. The distinct level 1 characteristic is that there is a high frequency of eating vocabulary; however, these eating keywords appear to decrease as the proficiency level increases. There results indicate that the eating category is most frequent in level 1 because the most frequently encountered daily life element is associated with meals. All level 1 vocabulary could be tagged as beginner level in <Vocabulary Content Development Research>. In Korean language education, a vocabulary development pattern beyond the general-purpose level does not appear even if the learner is a heritage language learner.
Level 2 keywords (by TextRank score).
The semantic categories for the level 2 keywords are concept-human-life-social life-education-stateeating in that order of frequency. As with level 1, the concept, human, and life categories are the most frequent, but social life and education have higher frequencies than level 1, indicating that an ability to deal with a greater number of topics expands the semantic categories.
Level 3 keywords (by TextRank score).
The semantic categories for the level 3 keywords are concept-human-life-social life in that order of frequency. While the frequencies of these semantic categories are similar to levels 1 and 2, the "cultural" category is more prominent than at level 1. The level 1 and 2 keywords are mainly related to daily life, whereas the level 3 keywords reflect more cultural aspects. Although beginner level vocabulary accounts for most of the level 3 keyword list, more intermediate vocabulary is included, such as "performance," "Daeha-da," "scene," and "stage," which correspond to the intermediate level in <Vocabulary Content Development Research>.
Level 4 keywords (by TextRank score).
The semantic categories for the level 4 keywords are concept-human-social life-life in that order of frequency. Compared to the levels 1–3 keywords, abstract concepts are included in the list of upper keywords in level 4, such as "experience, difference, and life." The <Korean Standard Curriculum> categorizes the topics covered for level 4 under "social and abstract." While the "social" topic is covered in the overall level 3 goal, the "abstract" topic distinctively appears in level 4 and also appears in the level 4 heritage language learners' interlanguage. Unlike level 3, intermediate vocabulary words are the distinctive keywords in level 4.
Level 5 keywords (by TextRank score).
The high frequency semantic categories for the level 5 keywords are concept-human-social life-life in that order of frequency. There are many abstract concept keywords, and national level semantic categories beyond the personal and social level also began to emerge. While the majority of the distinctive keyword vocabulary corresponds to an intermediate level, all vocabulary belongs to <Step 4 of the Korean Language Education Vocabulary Development Research>. This is different from level 6 because in the 6th level, "non-level" vocabulary is included that does not belong to any of the beginner, intermediate, or advanced levels. Even though levels 5 and 6 are both advanced, there is a difference because the level 6 vocabulary is more diverse, detailed, and professional.
Level 6 keywords (by TextRank score).
The high frequency semantic categories for the level 6 keywords are concept-human-life-human life in that order of frequency. Compared to the level 1–5 keywords, the prominent meaning category is the professional area of "politics and administration," which are not keywords in level 5. The keywords in level 6 are in line with the level 6 explanations in the "International Common Korean Standard Model" and the "Korean Standard Curriculum." These two curricula see level 6 proficiency as being able to deal with social, professional, and academic areas.
The results of this keyword analysis confirm that the Korean vocabulary skills of Korean heritage language learners expand from personal to social, daily areas to professional and academic areas (The use of difficult and complex Chinese characters particularly increases in advanced levels, including level 6. However, these research results may be attributed to the linguistic background of the learners who produced the text. If learners whose native language has many Chinese characters were predominantly distributed in level 6, these results may have been easier to obtain. Nevertheless, the purpose of this paper was a holistic analysis control for such variables, thus we focused on examining the overall vocabulary development patterns of heritage learners). The expansion keyword patterns of the Korean heritage language learners are shown in Fig. 7.
Extension of keywords by proficiency.
Keywords in the individual domain are prominent in beginner proficiency levels 1 and 2, keywords in the social domain are prominent in intermediate proficiency levels 3 and 4, and keywords in the social and professional domain are prominent in advanced proficiency levels 5 and 6. The analysis confirms that the cognitive development of Korean heritage language learners expands from an individual level to social, national, and professional domains and the negotiation patterns between these domains, which is in line with the learning aspects of Vygotsky's social constructionism. Vygotsky's theory of the development of proximal zones also assumes that a person's capacity gradually expands through interaction, which leads to a larger scope [19]. In other words, Vygotsky suggests that initially, a learner's capacity remains limited to a narrow range but gradually expands to a wider range.
5. Conclusion
Based on corpus analysis, this study objectively analyzes the Korean vocabulary development patterns of Korean heritage language learners. The keywords and their associated semantic categories, which are analyzed by proficiency level, were determined using the TextRank algorithm. We found that as the heritage language learners’ proficiency increases, low-frequency (high-level) vocabulary begins to appear as the keywords, with the semantic categories expanding from daily to social to specialized fields. Therefore, we confirmed that as the vocabulary use of Korean heritage language learners develops, their proficiency increases.
This study is meaningful because it confirms the Korean vocabulary development in Korean heritage language learners, a learner group that has not been focused on in past research. This study is also meaningful because it exemplifies the convergence of data-based applied linguistic research and computer science by using a keyword extraction algorithm devised in the machine learning field.
Further studies are needed to compare the similarities and differences in the vocabulary development patterns of Korean heritage and non-heritage language learners. If such a study were conducted, the differences in these Korean learner vocabulary development patterns could be examined in greater detail.