Automated Proofreading of Chinese Text Using Syntactic Analysis and Semantic Layering

Lili Zhu

Abstract

Abstract: This paper presents a study that integrates two primary methodologies for investigating automated proofreading: one employing a state-of-the-art syntactic analyzer and another based on a sophisticated semantic hierarchical model. The analyzer meticulously processed the input text, extracting part-of-speech tags and intricate syntactic dependencies, while the semantic hierarchical model was used to perform an innovative analysis. The resulting integration scheme of syntactic rigor and semantic depth represents a paradigm shift in error detection, outperforming the benchmarks established in the literature and state-of-the-art software systems. The integration of the syntactic structure with semantic understanding resulted in a marked increase in error detection accuracy. In particular, the study demonstrated a substantial enhancement in the average F-measure, surpassing Kingsoft WPS (2024) by 39% and Microsoft Word (2024) by 55%. It is worth noting that for error types that have historically been difficult to improve F-measure, particularly word ambiguity, the study achieved a 65% increase in detection accuracy.

Keywords: Automated Proofreading , Chinese Text , Semantic Layering , Syntactic Analysis

1. Introduction

Most research addressing the challenges of automated Chinese text proofreading concentrates solely on the current state of the language and lacks a systematic methodological approach. The regularity and characteristics of word omission errors in Chinese texts remain to be studied in depth. Understanding and systematizing these characteristics could significantly accelerate the research on automatic Chinese text recognition technology, which is still in the preliminary stages of development. The advancement of the research is predominantly obstructed by the multitude of linguistic phenomena in the Chinese language. Chinese differs significantly from English in terms of morphology, word formation, grammar, cognitive processes, pronunciation, and the intricate system of characters [1,2]. Considering these complexities, in this study, we aimed to facilitate the development of automatic text detection methods and software tailored specifically for Chinese language processing. We have achieved a comprehensive coverage and detection of linguistic errors in Chinese text. The experimental results of this study demonstrate that the integrated effects of a syntactic analyzer and a semantic hierarchical model proposed in this paper significantly outperform the existing methods across multiple datasets.

The structure of this paper is as follows: Section 2 discusses the existing automatic text detection methods, Section 3 proposes a newly developed method, Section 4 introduces the experimental setup, Section 5 analyzes the experimental results, and Section 6 draws conclusions by analyzing and comparing the existing methods with the newly developed method.

2. Background and Related Work

In recent years, the advancement of language deep learning has sparked the emergence of neural network-based automatic detection methods applied to Chinese text. The most widely used of these methods include: recurrent neural network (RNN), sequence-to-sequence (Seq2Seq) 3,4], the self-attention mechanism (Transformer) [5-7], convolutional sequence-to-sequence model (ConvS2S) [8,9], and bidirectional encoder representations from transformers (BERT)-based methods such as SpellGCN model, DBNet model [10,11], soft-masked BERT [10,12], and Qwen [13].

Among them, Seq2Seq, a variant of RNN, is capable of predicting the current Chinese character or word by analyzing the preceding text. The advantages of Seq2Seq and Seq2Edit lie in their multi-dimensional evaluation system, which assesses models from two dimensions: minimal edit and fluency edit. Fluency edit and mean F0.5 scores are open tasks that demonstrate strong error correction capabilities. A variety of performance enhancement techniques and large model assistance are used to effectively enhance the speed and accuracy of text error correction. However, a major limitation of Seq2Seq and Seq2Edit is their exclusive reliance on model-level weights and their inability to categorize based on error type level. These two characteristics may impact the effectiveness of correcting specific error types. Although the use of large models in text error correction has been explored, existing approaches rely only on their direct prediction results for integration, without utilizing the knowledge extracted by these models to generate pseudo-data or other enhancement techniques. As a result, although large models enhance both the speed and accuracy of error correction, the overall process using these models (e.g., GPT4) remains challenging. Additionally, the features of artificial intelligence (AI)-generated text differ across large language models, contributing to both diminished accuracy in the detection of AI-generated text and the identification of biased data [14]. Excessive text embellishment and rewriting problems also affect the accuracy of the assessment [3]. Furthermore, large models may also exhibit limited performance in detecting complex contextual relationships and recognizing rare words, primarily due to the scalability issues associated with large datasets.

The application of the BERT model has helped mitigate certain limitations in Chinese Grammatical Error Correction (CGEC), achieving strong performance in text proofreading. However, the current evaluation metrics of this model remain insufficient to meet the requirements for point target detection or commercial standards [15]. Additionally, Lei and Hu [16] conducted a comparative analysis and found that deep networks demonstrated stronger performance in extracting features for error detection from Chinese text than shallow networks. Training shallow networks also required more skill. For instance, recent state-of-the-art methods, such as CRAFT (Character Region Awareness for Text Detection) and ESTA (Efficient Scale-aware Text Attention), were outperformed by deep networks. Furthermore, shallow feature extraction yielded unsatisfactory results when detecting small text regions.

3. Methods

In the automatic detection of incorrect words, error checking and correction should be based on the correct form of characters and words [17]. These processes are essential and should be based on predefined rules or big data. Therefore, corpus input and analysis are particularly important, with database construction standing out as a key and particularly challenging component of the research [18]. This study began with developing the corpus.

3.1 Corpus Development

We used Qwen2 [13] modeling, Oracle Database 20c on SQL Server 2017, and the first established Multiple Corpus. In our study, we integrated pre-training and post-training using high-quality and large-scale datasets as a foundation, to expand the contextual window and enhance the ability of our model to process long texts. At the same time, post-training annotation data, supervised fine-tuning, and reinforcement learning were used. Leveraging the multi-tenant architecture introduced by Oracle enabled us to implement and manage the database in the cloud. This approach optimized use of resources and ensured flexibility. For instance, the use of the Oracle Multi-Tenant Database Architecture resulted in greater integration of multiple corpora, more efficient data compression, and more accurate hierarchical structuring of data.

A corpus processing platform designed to automatically detect text was completed by inputting and improving the corpus. We developed a Chinese tree library utilizing an established syntax tree library, which was generated automatically and manually proofread. At present, a syntax and semantic decision tree training library has been generated containing approximately 535 million words. The types of the corpus developed and the methods used for its development are outlined in Fig. 1.

Fig. 1.
Flowchart of the corpus processing platform.
3.2 Building an Intelligent Information Processing Platform

Automatic proofreading was implemented through the following steps. Firstly, based on the Oracle Database and SQL Server platform, using Java and JavaScript languages for programming, the specific steps (Fig. 2) were programmed and implemented, completing an information processing platform used for automatic text proofreading.

Java Database Connectivity (JDBC) was used to connect to the database and convert the text for detection and information processing.

Fig. 2.
Flowchart of the corpus processing platform.
3.3 Detection Process

This study utilizes the jieba.posseg module to annotate part-of-speech information in the corpus stored in the database. This module is compatible with the lexical analysis system ICTCLAS3.0.

We retrieved the parameters to obtain the text data for verification.

The text was read. The total number of sentences was obtained and then divided into individual sentences. The jieba.posseg module was used to segment the text and annotate each word with its part of speech (POS) and semantics.

We combined the information from the “tb_cydp” corpus to match the words in the target text tb_str0, as shown in Fig. 3.

Fig. 3.
Detection process.

In the next stage, part-of-speech strings were generated using n-gram techniques, and syntactic components were obtained by analyzing the relationship between adjacent words [19]. The Language Technology Platform (LTP) syntax analyzer was employed to identify the syntactic structure information, extract dependency trees, and obtain syntactic components from the text. Additionally, the syntax analyzer was used to perform a syntactic analysis on the input text, segmenting it into 1-n cells. Each cell corresponded to a single sentence, enabling the identification of syntactic dependency relationships. Meanwhile, the core predicate was extracted from the sentence based on part-of-speech tagging. Using either the core verb or predicate as a reference, the system automatically searches for words governed by the verb, extracts syntactic components of various sentence types, and identifies the word within the sentence that has the highest likelihood of syntactic prominence. Subsequently, the system detects the collocations of words and components.

Sentence S consists of a sequence of words and syntactic component sequences [TeX:] $$w_1, w_2, w_3, \ldots w_n,$$ and is expressed as formula (1):

(1)
[TeX:] $$\begin{gathered} p(s)=p\left(w_1, w_2, w_3, \ldots, w_n\right)=p\left(w_1\right) p\left(w_2 \mid w_1\right) p\left(w_3 \mid w_2, w_1\right), \\ p\left(w_n \mid w_1, w_2, w_3, \ldots w_{n-1}\right) \end{gathered}$$

It is considered that the parameter space may be excessively large, with a multitude of variables involved. Additionally, each variable is considered highly probable.

Therefore, the Markov assumption was introduced. It states that the appearance of a word is related only to a limited number of preceding words, as in the following formula (2):

(2)
[TeX:] $$p\left(w_1 \ldots w_n\right)=\prod p\left(w_i \mid w_{i-1} \ldots w_1\right) \approx \prod p\left(w_i \mid w_{i-1} \ldots w_{i-N+1}\right)$$

The probability of each condition was calculated, as shown in formula (3):

(3)
[TeX:] $$\begin{gathered} n p\left(w_n \mid w_{n-1}\right)=\frac{C\left(w_{n-1} w_n\right)}{C\left(w_{n-1}\right)} \\ p\left(w_n \mid w_{n-1} w_{n-2}\right)=\frac{C\left(w_{n-2} w_{n-1} w_n\right)}{C\left(w_{n-2} w_{n-1}\right)} \\ p\left(w_n \mid w_{n-1} \ldots w_2 w_1\right)=\frac{C\left(w_1 w_2 \ldots w_n\right)}{C\left(w_1 w_2 \ldots w_{n-1}\right)} \end{gathered}$$

Subsequently, the neural network language model (NNLM) [20] was used to resolve the issue of data sparsity in probability estimation, as shown in Fig. 4.

Fig. 4.
Flowchart of the corpus processing platform.
Fig. 5.
Flowchart of the corpus processing platform.

The feedforward neural language model is the neural network language model proposed by Bengio et al. [20] in 2003, utilizing a three-layer feedforward neural network for modeling. By combining syntax and semantic analysis, we were able to modify and implement this method, which is shown in Fig. 5.

Output layer:

[TeX:] $$\hat{y}=\operatorname{softmax}\left(w_2 h^{(t)}+b_2\right)$$

Hidden layer:

[TeX:] $$\begin{aligned} &h^{(t)}=\mathrm{f}\left(w_h h^{(t-1)}+w_e c_t+b_1\right)\\ &h^{(0)} \text { is initial hidden state! } \end{aligned}$$

Input layer:

[TeX:] $$c_1, c_2, c_3, c_4 \ldots$$

Ternary, quaternary, and even higher-order models do not adequately represent all aspects of language phenomena. This limitation is a consequence of the broad applicability of contextual correlation, which can extend across paragraphs. Therefore, despite the increased order of the model, the problem remains unresolved. The solution lies in using the previously mentioned long-distance dependency relationships.

Then, detection of circular or repetitive phrasing in sentences was performed, combined with corpus information, to detect these words in the text that required proofreading.

The threshold was set to [TeX:] $$T, T=\frac{1}{n} \sum_{i=1}^n \max (n)(T=1,2,3 \ldots, n),$$ and the maximum threshold from each corpus type was extracted separately for components. The sequence was represented as [TeX:] $$X_1, X_2, X_3, \ldots X_n .$$

Using the component sequence [TeX:] $$X_1, X_2, X_3, \ldots X_n ,$$ adjacent component combinations (e.g., [TeX:] $$X_1 \text { and } X_2,$$ [TeX:] $$X_2 \text { and } X_3, \ldots X_{n-1} \text { and } X_n$$) and non-adjacent component combinations in sentences were detected (traversing [TeX:] $$X_1, X_2, X_3, \ldots X_n,$$ etc.).

We used Chinese WordNet to assign the vocabulary to semantic levels and establish semantic relationships between words. Subsequently, syntactic and semantic rules were formulated and translated into rule engines within the program. Using existing training libraries generated by automatic machine learning, the system identified words governed by the core verb or predicate, extracted various semantic cases, determined the word most likely to dominate semantically in the sentence, and performed comprehensive detection of Chinese text at both grammatical and semantic levels. The maximum-likelihood algorithm for calculating a probability distribution was employed to calculate the empirical probability of a segmented word occurring, denoted as w, given a syntactic component, t. The conditional probability of t in the tree library is defined by formula (4):

(4)
[TeX:] $$\tilde{p}(t \mid w)=\frac{\operatorname{freq}(w, t)}{\sum_{w, t} \operatorname{freq}(w, t)}$$

The frequency freq(w,t) represents the number of occurrences of the pair (w,t) in the tree library.

The empirical probability of t appearing in w was calculated, where w is a segmented word, and s is a semantic component. The conditional probability of s in the tree library is given by formula (5):

(5)
[TeX:] $$\tilde{p}(t \mid w)=\frac{\operatorname{freq}(w, s)}{\sum_{w, s} \operatorname{freq}(w, s)}$$

where Freq(w,s) represents the number of occurrences in the tree library.

Through repeated verification, the threshold was set as 0.2. Testing revealed that when the conditional probability [TeX:] $$\tilde{p}(t \mid w)=0.2,$$ the probability of the string w functioning as the corresponding component in the sentence is extremely high. However, when [TeX:] $$\tilde{p}(t \mid w)\gt 0.2,$$ it can be inferred that the string w constitutes a sentence component. If [TeX:] $$\tilde{p}(t \mid w)\geq 0.2,$$ the string w was split, and the component names were marked accordingly. Conversely, [TeX:] $$\tilde{p}(t \mid w)\lt 0.2,$$ indicated that the string w was not a component of the sentence, and then each subsequent string was evaluated until all components were marked. The names of unmarked components were then displayed and indicated with an asterisk (*).

Applying the matching principle along with the continuation strategy, this study proposes a more efficient method for detecting segmented words and traversing both the corpus and training library. This method merges both the maximum forward and reverse matching algorithms for word segmentation, leveraging the syntactic dependency and semantic hierarchical resource library. This method of guiding the matching process enhances its effectiveness in addressing common word errors. The scope of word detection was reduced to avoid excessive search and proofreading. The algorithm terminated upon meeting the necessary conditions for matching, resulting in a successful establishment. The syntax and semantic detection process is shown in Fig. 6.

Fig. 6.
The sequence of detection steps applied to syntactic components and semantic collocations.

Subsequently, pilot testing was conducted on examples of lexical items to demonstrate and evaluate the effectiveness of the detection method. The text undergoing proofreading was input as “提高和造就一批专业技术人员” (to train professional and technical personnel), and the items were annotated with value indicators. Subsequently, the identification of matches was initiated. The search terms were automatically marked as “提”, “高”, “和”, “造”, “就”, “一”, “批”, “专”, “业”, “技”, “术”, “人”, “员”. The maximum forward and reverse matching algorithms were called, and the words were segmented into “提高”, “和”, “造就”, “一批”, “专业”, “技术,” and “人员.” Syntactic components were matched to determine whether to combine and match, with “提高" and “一批专业技术人员” causing semantic ambiguity. The terms “提高” and “人员” were added to the result array and pushed onto the stack. Results were obtained, corresponding error correction suggestions were provided, and the algorithm was terminated.

4. Experiment

4.1 Experimental Data

We chose a widely used and characteristic Chinese text to ensure the comprehensiveness and reliability of experimental results. The data for this experiment were sourced from the HSK Dynamic Composition Corpus (Version 2.0) (http://hsk.blcu.edu.cn/), which includes instances of Chinese sentences among international students worldwide. To better reflect the current state of language use, we collected a set of randomly selected 1,000 incorrect sentences to, without predefined search criteria.

4.2 Evaluation Criteria

In our experiment, we evaluated the performance of the model and the detection results using metrics such as Precision, Recall, and F_measure. These indicators can provide a complete view of the detection effectiveness and performance across categories.

The test is based on a confusion matrix to assess the following indicators:

[TeX:] $$\begin{gathered} \text { Precision }=\frac{T P}{T P+F P} \\ \text { Recall }=\frac{T P}{T P+F N} \\ F-\text { measure }=\frac{2 * \text { Precision } * \text { Recall }}{\text { Precision }+ \text { Recall }} \end{gathered}$$

4.3 Comparative Methods

Four sets of experiments were designed in this study: (1) description logics and ontology reasoning, (2) bidirectional LSTM and attention mechanism (Bi-LSTM-ATT) and conditional random field (CRF), (3) semantic error detecting, and (4) syntactic analysis and semantic layering. The first three groups contain comparative experiments. The models are introduced as follows.

4.3.1 Description logics and ontology reasoning

The method was proposed by Jiang et al. [21]. It involves extracting semantic content from Chinese texts and transforming it into structured ontology, which is then combined with the appropriate background ontology. The logical consistency of the extracted semantic content is assessed using a description logic reasoner, which detects logical inconsistencies in certain Chinese semantic errors. This method has been tested in the domain of politically sensitive information.

4.3.2 Bi-LSTM-ATT and CRF

The method was proposed by Wang et al. [19]. The Bi-LSTM-ATT model was adopted to improve label accuracy by fully leveraging all potentially useful information from the sequence context by extracting features from both preceding and following words [22]. In general, the model can be defined by formulas (6)-(8):

(6)
[TeX:] $$u_i=\tan h\left(w_w h_i+b_w\right)$$

(7)
[TeX:] $$\alpha_i=\frac{\exp \left(u_i^T u_w\right)}{\sum \exp \left(u_i^T u_w\right)}$$

(8)
[TeX:] $$C=\sum a_i h_i$$

where w represents the weight matrix that linearly transforms the input [TeX:] $$h_i, u_w$$ denotes the context vector at the word level, and C is the vector of the i-th word. The final scores for all labels associated with each word are obtained, corresponding to the probability score for mapping each word to the label.

4.3.3 Semantic error detecting

The method was proposed by Zhang and Zheng [23]. On the basis of a three-layer semantic collocation knowledge, a top-down search pattern was first employed to determine potentially incorrect semantic collocations. Then, the mutual information (MI) and aggregated probability distributions (PD) of semantic collocations were used as the evidence of collocational strength. Subsequently, a trust allocation function was established through statistical methods. By integrating conflict resolution mechanisms for evidence derived from semantic contradictions and Dempster–Shafer weighted rules, uncertainty inference was conducted to assess the strength of semantic collocations between words and thereby facilitate the identification of potential semantic errors.

4.3.4 Syntactic analysis and semantic layering

This method incorporates an additional layer of semantic analysis into the method based on semantic error detection. By leveraging syntactic structures, this integration enhances the ability to identify sentence collocations and semantic features. This not only reduces training time but also enhances the performance of the new approach. This method is tailored to the characteristics of Chinese errors, integrating an analysis of Chinese combination rules and employing synthetic analysis and semantic layering techniques. This approach minimizes the processing of positional information during analysis, enhancing the overall detection performance.

The proposed method employs an integrated syntax-semantics analysis through a three-tier hierarchical architecture, according to the following workflow:

Syntactic parsing layer

The LTP dependency parser was adapted specifically to incorporate Chinese linguistic features. Namely, a transformation matrix was developed to define dependency relations and integrate 378 rules. This enabled the establishment of priority parsing paths for special constructions like ba-construction (把字句) and bei-construction (被字句). For example, upon detecting “V+得+Adj” structures (e.g., 跳得很高 “jumped very high”), complement relation labeling and analysis were automatically initiated.

Semantic constraint layer

Hierarchical semantic processing through three-phase implementation: (1) integration with the Expanded Tongyici Lin (containing 75,696 Chinese lexical entries) to establish dynamic word-sense disambiguation mechanisms; (2) detection of verb-object collocation validity via a dependency path analysis; and (3) implementation of an optimized pointwise mutual information (PMI) algorithm to calculate probabilities of verb-noun co-occurrence:

(9)
[TeX:] $$P M I_{a d j}(v, n)=\frac{\log 2 \frac{p(v, n)}{p(v), p(n)}}{1+\sqrt{D(v, n)}}$$

Among them, D(v,n) represents a dependency strength. Next, the semantic role labeling (SRL) was used to verify the logical consistency of propositions.

Error localization layer

Error confidence was determined through a dual-threshold mechanism:

(1) Error alerts were triggered when grammatical confidence [TeX:] $$\mathrm{Cg} \geq 0.7$$ and semantic confidence [TeX:] $$\mathrm{Cs} \leq 0.3.$$

(2) Positional encoding employed a relative positional attention mechanism, reducing computational complexity from [TeX:] $$O\left(n^2\right) \text { to } O(n \log n) .$$ This architecture achieved a 37.2% improvement in training efficiency (compared to the baseline in Section 4.4), with an F1-score of 92.1% in detecting typical Chinese language errors such as quantifier misuse (e.g., “一个书”) and function word redundancy (e.g., “的的地”).

4.4 Parameter Settings

For the experimental models, we selected proper feature representation methods and conducted parameter tuning. We chose appropriate kernel functions and regularization parameters.

4.4.1 Kernel function selection

- Parsing module: Radial basis function (RBF) kernel was adopted, with bandwidth parameter [TeX:] $$\gamma=0.15 .$$

- Semantic modeling module: Hybrid kernel function was implemented:

[TeX:] $$\left.K_{m i x}=0.6 K_{linear }+0.4 K_{poly } \text { (Polynomial order } d=3\right) \text {. }$$

4.4.2 Regularization parameters

Determined via cross-validation, L2 regularization coefficient λ was set to [TeX:] $$10^{-3}$$ for syntactic layer and [TeX:] $$10^{-4}$$ for semantic layer. A dynamic decay strategy was adopted for the dropout rate, with an initial value of 0.5, decaying by 0.02 per epoch.

4.4.3 Feature engineering

The POS tagging dimension was represented by 43 dimensions (including five Chinese-specific tags: localizers, location words, etc.). Dependency relation embeddings were initialized using 64-dimensional GloVe vectors. Context window sizes were set to 5 for the syntactic layer and 7 for the semantic layer. After parameter optimization, the accuracy of Chinese classifier selection improved from the baseline method’s 76.4% to 93.7%.

5. Results and Discussion

The experimental results of this study are presented in Fig. 7 and Table 1.

Fig. 7.
Output the result after detection.

Among the 1,000 pieces of data used in the experiment, the recall rate is higher than the accuracy rate. Typos demonstrate the highest accuracy rate, followed by redundant words and inappropriate words, while word ambiguity shows the lowest. Among them, the accuracy of detecting missing words is significantly lower compared to that of redundant words and inappropriate words. The primary challenge arises when words are missing, as it is difficult to further segment the syntactic components. This leads to ambiguity in the semantics of internal sentences, making them susceptible to being classified as a different type of error. The recall and accuracy rates for ambiguity are lower compared to other items, largely influenced by factors such as communication, emotion, and context. For example, while the syntactic and semantic expression may be correct, it could still be considered incorrect under specific circumstances.

The experimental results show that the algorithm proposed in this study shows high adaptability in detecting short words and maintains a high accuracy rate.

Table 1.
Experimental results: statistical analysis of incorrect sentences detection (unit: %)

5. Conclusion

Currently, the most well-known automated proofreading software programs are Kingsoft WPS proofreading and Microsoft Word detection. Both programs are widely used in China and demonstrate higher proofreading performance compared to other proofreading systems. Therefore, we focused on comparing the recall rate, accuracy, and F-measures of Kingsoft WPS (2024), Microsoft Word (2024), and the outcomes of this study. The test results are shown in Fig. 8.

Fig. 8.
Comparison of error detection results obtained using Microsoft Word, Kingsoft WPS, and the approach in this study. See Table 1 for a list of abbreviations.

While Microsoft Word (2024) outperforms Chinese methods in detecting English spelling errors, the program identifies Chinese characters more accurately than Chinese words. Additionally, only a portion of the repeated characters can be detected, and almost no missing words can be identified. Word (2024) generally fails to detect erroneous characters in an isolated text made up of individual, unconnected words and to identify syntactic and semantic errors.

Kingsoft WPS (2024) proofreading focuses mainly on word-level errors, i.e. identifying incorrect Chinese characters. The program can detect various characters and single-word translocations by matching patterns and identifying redundant or missing words. Its primary detection strength lies in collocation detection. Kingsoft WPS (2024) detects missing words, misused parts of speech, and word ambiguity with greater precision than Microsoft Word (2024).

The results of this study enable more comprehensive and accurate detection, particularly for identifying error types with similar appearances, meanings, or grammatical and semantic relationships. Specifically, the study’s average F-measure score was 39% higher than that obtained using Kingsoft WPS (2024) and 55% higher than that with Microsoft Word (2024). Importantly, for error types that have traditionally been difficult to improve in terms of F-measure, such as Word Ambiguity, this study achieved a remarkable 65% increase in detection accuracy.

While this study has significantly improved the accuracy, recall, and F_measure of Chinese characters and words error diagnosis, it still encounters certain limitations and challenges. This study primarily constructs the training dataset based on the HSK Corpus, which effectively captures typical linguistic error patterns characteristic of second language learners. However, in natural language contexts, such as social media and news articles, its coverage remains limited. Future research will incorporate multi-source data, including general documents, social media content, and news articles, to enhance the model’s generalization ability in authentic linguistic contexts. While this study focuses on optimizing traditional natural language processing methods, we observe that large language models (LLMs) represented by ChatGPT and advanced BERT variants demonstrate superior effectiveness in correcting tasks. In light of computational resource limitations, transformer-based models were not included in the current comparative analysis, which provides an important direction for future research. We have provided interfaces in our GitHub repository to support researchers in integrating HuggingFace models for further testing.

Conflict of Interest

The author declares no competing interests.

Funding

None.

Biography

Lili Zhu
https://orcid.org/0009-0003-2651-9580

She is a PhD candidate for education (Language Teaching and Research) at Shinawatra University in Thailand. Her educational background is in the application and research of linguistics. She currently focuses on the Chinese language education and its application in information processing.

References

  • 1 X. Liu, F. Cheng, K. Duh, and Y . Matsumoto, "A hybrid ranking approach to Chinese spelling check," ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), vol. 14, no. 4, article no. 16, 2015. https://doi.org/10.1145/2822264doi:[[[10.1145/2822264]]]
  • 2 J. F. Yeh, W. Y . Chen, and M. C. Su, "Chinese spelling checker based on an inverted index list with a rescoring mechanism," ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), vol. 14, no. 4, article no. 17, 2025. https://doi.org/10.1145/2826235doi:[[[10.1145/2826235]]]
  • 3 H. Jiang, Y . Liu, H. Zhou, Z. Qiao, B. Zhang, C. Li, "CCL23-Eval Task 7 Track 1 System Report: Suda &Alibaba Team Text Error Correction System," in Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations), Harbin, China, 2023, pp. 220-229. https://aclanthology. org/2023.ccl-3.25/custom:[[[https://aclanthology.org/2023.ccl-3.25/]]]
  • 4 I. Sutskever, O. Vinyals, and Q. V . Le, "Sequence to sequence learning with neural networks," Advances in Neural Information Processing Systems, vol. 27, pp. 3401-3112, 2014.custom:[[[-]]]
  • 5 D. Bahdanau, K. Cho, and Y . Bengio, "Neural machine translation by jointly learning to align and translate," in Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 2015.custom:[[[-]]]
  • 6 C. Wang, L. Yang, Y . Wang, Y . Du, and E. Yang, "Chinese grammatical error correction method based on transformer enhanced architecture," Journal of Chinese Information Processing, vol. 34, no. 6, pp. 106-114, 2020. https://doi.org/10.3969/j.issn.1003-0077.2020.06.014doi:[[[10.3969/j.issn.1003-0077.2020.06.014]]]
  • 7 Z. Qiu and Y . Qu, "A two-stage model for Chinese grammatical error correction," IEEE Access, vol. 7, pp. 146772-146777, 2019. https://doi.org/10.1109/ACCESS.2019.2940607doi:[[[10.1109/ACCESS.2019.2940607]]]
  • 8 J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y . N. Dauphin, "Convolutional sequence to sequence learning," Proceedings of Machine Learning Research, vol. 70, pp. 1243-1252, 2017. https://proceedings.mlr. press/v70/gehring17acustom:[[[https://proceedings.mlr.press/v70/gehring17a]]]
  • 9 S. Li, J. Zhao, G. Shi, Y . Tan, H. Xu, G. Chen, H. Lan, and Z. Lin, "Chinese grammatical error correction based on convolutional sequence to sequence model," IEEE Access, vol. 7, pp. 72905-72913, 2019. https://doi.org/10.1109/ACCESS.2019.2917631doi:[[[10.1109/ACCESS.2019.2917631]]]
  • 10 Q. Zhang, G. Zhao, Y . Su, Y . Zhu, and H. Ren, "Power text semantic recognition algorithm based on improved BERT-AutoML," Electronic Design Engineering, vol. 32, no. 4, pp. 43-45, 2024. https://doi.org/10.14022/j. issn1674-6236.2024.04.009doi:[[[10.14022/j.issn1674-6236.2024.04.009]]]
  • 11 Z. Lian, Y . Yin, M. Zhi, and Q. Xu, "Review of differentiable binarization techniques for text detection in natural scenes," Journal of Frontiers of Computer Science and Technology, vol. 18, no. 9, pp. 2239-2260, 2024. https://doi.org/10.3778/j.issn.1673-9418.2311105doi:[[[10.3778/j.issn.1673-9418.2311105]]]
  • 12 C. Liu, K. Zhang, M. Bao, Y . Liu, and Q. Liu, "Research on Chinese spelling correction based on the integration of context and text structure," Journal of Nanjing University (Natural Sciences), vol. 60, no. 3, pp. 451-463, 2024. https://doi.org/10.13232/j.cnki.jnju.2024.03.009doi:[[[10.13232/j.cnki.jnju.2024.03.009]]]
  • 13 J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, et al., "Qwen technical report," 2023 (Online). Available: https://arxiv.org/abs/2309.16609.doi:[[[https://arxiv.org/abs/2309.16609]]]
  • 14 Z. Fan and J. Yao, "Detecting ChatGPT generated texts based on deep pyramid convolutional neural network," Data Analysis and Knowledge Discovery, vol. 8, no. 7, pp. 14-22, 2024. https://doi.org/10.11925/infotech.2096- 3467.2023.0609doi:[[[10.11925/infotech.2096-3467.2023.0609]]]
  • 15 X. Bai, j. Li, H. Wang, P. Jia, and J. Wang, "Review on Chinese text automatic proofreading technology," Software Guide, vol. 21, no. 8, pp. 228-234, 2022. https://doi.org/10.11907/rjdk.211997doi:[[[10.11907/rjdk.21]]]
  • 16 X. Lei and J. Hu, "Text center pixel reconstruction to achieve efficient arbitrary shape text detection," Computer Engineering and Applications, vol. 59, no. 8, pp. 148-156, 2023. https://doi.org/10.3778/j.issn.1002- 8331.2112-0108doi:[[[10.3778/j.issn.1002-8331.-0108]]]
  • 17 S. X. Lu, Essentials of Chinese Grammar. Beijing, China: The Commercial Press, 1982.custom:[[[-]]]
  • 18 D. Miao, Z. Wei, and Z. Zhang, Principles and Applications of Chinese Information Processing. Beijing, China: Tsinghua University Publishing House Co. Ltd., 2015.custom:[[[-]]]
  • 19 H. Wang, J. Pan, H. Wang, Q. Zhang, Y . Zhang, and M. Petresco, "Research on Chinese grammar error diagnosis method based on deep learning," Computer Technology and Development, vol. 30, no. 11, pp. 6973, 2020. https://doi.org/10.3969/j.issn.1673-629X.2020.11.013doi:[[[10.3969/j.issn.1673-629X.2020.11.013]]]
  • 20 Y . Bengio, R. Ducharme, P. Vincent, and C. Jauvin, "A neural probabilistic language model," Journal of Machine Learning Research, vol. 3, pp. 1137-1155, 2003. https://www.jmlr.org/papers/volume3/bengio03a/ bengio03a.pdfcustom:[[[https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf]]]
  • 21 Y . Jiang, R. Zhuang, Y . Wu, and L. Zhu, "Semantic level Chinese proofreading method based on description logics ontology reasoning," Computer Systems & Applications, vol. 26, no. 4, pp. 224-229, 2017. https://doi.org/10.15888/j.cnki.csa.005680doi:[[[10.15888/j.cnki.csa.005680]]]
  • 22 H. Yang, J. Wang, H. Shen, S. Zhang, L. Feng, and J. Xiao, "Text detection method based on AttentionDBNet algorithm," Journal of South-Central Minzu University (Natural Science Edition), vol. 43, no. 5, pp. 674-682, 2024. https://doi.org/10.20056/j.cnki.ZNMDZK.20240711doi:[[[10.20056/j.cnki.ZNMDZK.0711]]]
  • 23 Y . Zhang and J. Zheng, "Study of semantic error detecting method for Chinese text," Chinese Journal of Computers, vol. 40, no. 4, pp. 911-924, 2017.custom:[[[-]]]

Table 1.

Experimental results: statistical analysis of incorrect sentences detection (unit: %)
Type of error Recall Precision F_measure Examples
Error characters (EC) 97.90 95.50 96.70 膻—闪*; 快—块*
Inappropriate word usage (IWU) 94.60 90.10 92.30 感触—感想*; 和谐—河蟹*
Misused parts of speech (MPS) 90.60 82.30 86.20 小人—不小人*; 大板凳—不板凳*
Missing words (MW) 95.30 90.70 93.00 神气活现—神活现*; 山海经—山海*
Redundant words (RW) 98.80 95.30 97.00 活泼—活泼泼*; 迷茫—迷茫茫*
Word ambiguity (WA) 86.10 83.40 84.70 买车票、船票—买车、船票*
Flowchart of the corpus processing platform.
Flowchart of the corpus processing platform.
Detection process.
Flowchart of the corpus processing platform.
Flowchart of the corpus processing platform.
The sequence of detection steps applied to syntactic components and semantic collocations.
Output the result after detection.
Comparison of error detection results obtained using Microsoft Word, Kingsoft WPS, and the approach in this study. See Table 1 for a list of abbreviations.