1. Introduction
Convolutional neural networks (CNNs) are beginning to receive considerable use in the imagerecognition field [1-3], and their image-recognition performance is being enhanced in various situations. In addition, studies to enhance character-recognition performance are being conducted [4,5], and these studies show marked progress in Chinese character recognition. Zhang [6] compared the recognition accuracy of Chinese characters through several CNN architectures of different depths. Moreover, Zhong et al. [7] created a characteristic map according to the directional feature. This newly created map was used with existing training data and showed advanced recognition accuracy. Yang et al. [8] used Chinese character domain information and applied it to an experiment with CNNs.
However, studies applying CNNs to Korean character recognition have not been conducted in earnest. For the last three decades, studies on various methods have advanced Korean character recognition [9-14]; however, this is considered a slow development speed when compared with the recognition rate of other characters using CNNs. Recently, Kim and Xie [15] conducted a study on Korean character recognition using a CNN; however, their experimental data was limited and the CNN configuration was relatively simple.
In this study, a large-scale Korean character database, PHD08 [16], is used for the experiment. PHD08 is the largest Korean character database currently in existence, and it has a total of 5,139,450 characters. PHD08 is very useful for training because every character has different fonts, resolutions, rotations, and noise levels. After training newly designed CNNs, Korean character recognition (KCR)- AlexNet and KCR-GoogLeNet, with divided training and test data from PHD08 at a constant rate, we compare the test accuracy depending on the training iterations. After demonstrating which CNN network has higher test accuracy, we present a classification experiment, which is based on additional Korean character data with fonts that are not in PHD08, to ensure the objectivity of the experiment. Then, we compare the classification success rate with online commercial optical character recognition (OCR) programs to compare the pros and cons of KCR-AlexNet and KCR-GoogLeNet. Moreover, we measured the time to conduct each experiment and used it as one of the factors for evaluating the performance between KCR-AlexNet and KCR-GoogLeNet.
The structure of this paper is as follows: Section 2 gives a brief introduction to CNNs and describes our KCR-AlexNet and KCR-GoogLeNet architectures in detail. Our experiment is discussed in Section 3. Section 3.1 describes the process of training a CNN network, from organizing the experimental data to drawing a test-accuracy curve graph, with KCR-AlexNet and KCR-GoogLeNet. Section 3.2 shows an experiment classifying Korean characters with fonts that are not in PHD08, and compares the classification performance of KCR-AlexNet, KCR-GoogLeNet, and other commercial programs. Conclusions are drawn in Section 4.
2. CNN Architecture Design for Korean Character Recognition
2.1 Introduction to Convolutional Neural Network
The CNN that extracts local feature information from the input data used for training is began to use from LeNet, which was designed by LeCun et al. [17] to recognize digit images. CNN is a type of neural network that uses convolution, pooling, inner-products, etc., repeatedly, and has become the most frequently used architecture for deep learning in the image-recognition field. Recently, the problem of over-fitting has been solved using the rectified linear unit (ReLU) non-linear activation function and a drop-out layer. When the data enters the CNN’s convolutional layer, the output value is calculated as in (1) and propagated to a node in the next layer.
where [TeX:] $$\chi _ { i } ^ { l }$$ is the result of calculating the [TeX:] $$i ^ { \mathrm { th } }$$ node of the [TeX:] $$l ^ { \mathrm { th } }$$ layer. It is calculated by accumulating the results of multiplying the node value at the [TeX:] $$( l - 1 ) ^ { \mathrm { th } }$$ layer by the [TeX:] $$i ^ { \mathrm { th } }$$ kernel map of the [TeX:] $$l ^ { \mathrm { th } }$$ layer, and adding the [TeX:] $$i ^ { \mathrm { th } }$$ bias at the [TeX:] $$l ^ { \mathrm { th } }$$ layer. For the results of the final CNN layer, the term Loss or Error is used to indicate how close the output value is to the target value. The average loss over all |D| instances throughout dataset D is calculated as follows.
where W is the weight-parameter map of the current network, [TeX:] $$f _ { w } \left( X ^ { ( i ) } \right)$$ is the loss on data instance [TeX:] $$X ^ { ( i ) }$$, and r(W) is a regularization term with constant value λ. When the loss calculation is completed by (2), the weight-parameter map must be updated in every training iteration, as in (3) and (4). This equation means that we follow the SGD (Stochastic Gradient Descent) algorithm to minimize the value of the Loss, which is the goal of training.
Firstly, Wt+1 means the updated weight-parameter map at training iteration t + 1 , which is calculated by update value Vt+1. We use constant value μ, called momentum, to calculate Vt+1; Vt means the update value at the previous training iteration. Lastly, α means the learning rate for the current training iteration and [TeX:] $$\nabla L \left( W _ { t } \right)$$ means the negative loss gradient from the weight-parameter map at the previous training iteration. Specific constant values, like α or μ, will be presented in Section 3.
2.2 Design of Two CNN Architectures for Korean Character Recognition 2.2.1 KCR-AlexNet
The CNN architecture that received the most attention after LeNet was AlexNet by Krizhevsky et al. [18]. AlexNet won first place at the Imagenet Large-Scale Visual Recognition Challenge 2012 (ILSVRC- 2012) [19] contest. The overall architecture of KCR-AlexNet is the same as AlexNet, but KCR-AlexNet uses a 56×56-pixel input data size for Korean character images, which is smaller than AlexNet’s input data size of 256×256 for natural images. In addition, while the output layer of the existing AlexNet only has 1,000 nodes for classifying ILSVRC’s classes, KCR-AlexNet needs 2,350 nodes at the output layer to classify PHD08’s 2,350 Korean character classes.
The details of KCR-AlexNet are depicted in Fig. 1; it consists of five convolutional layers, three max pooling layers, and three fully connected layers. The ReLU non-linearity activation function is applied to each convolutional and fully connected layer in KCR-AlexNet, and the final output value is calculated by Softmax.
KCR-AlexNet architecture.
2.2.2 KCR-GoogLeNet
We designed another CNN architecture, KCR-GoogLeNet, based on GoogLeNet developed by Szegedy et al. [20]. CNN-based GoogLeNet won a recognition field in ILSVRC-2014, and has a much deeper architecture than existing CNN architectures. KCR-GoogLeNet has 22 layers of weight parameters, which are twice KCR-AlexNet’s 8 layers of weight parameters. However, the biggest feature of GoogLeNet and KCR-GoogLeNet is that the inception module is included in the architecture. Szegedy et al. configured the inception module using the method of Arora et al. [21]. The inception module is designed as a network structure in a network; and it is different from a conventional CNN in that it has only a one-dimensional series configuration. The same inception modules are used in GoogLeNet and KCR-GoogLeNet, and are shown in Fig. 2 with a depth of two. The depth indicates how many layers with a weight parameter are connected.
The inception module is a method introduced to effectively express the features of the local space. After subdividing the regional characteristics of the kernel space into sizes 1×1, 3×3, and 5×5 to calculate the convolutional value, all convolutional result values are concatenated in the last layer of the inception module. The 1×1 convolutional layer applying the ReLU activation function is used to reduce the complexity of the calculations occurring in the 3×3 and 5×5 convolutional layers. The biggest difference between GoogLeNet and KCR-GoogLeNet is that GoogLeNet uses nine inception modules and KCR-GoogLeNet uses only three inception modules. This is because GoogLeNet’s purpose is to classify the nature of a 256×256×3 image size and KCR-GoogLeNet’s purpose is to classify small Korean characters of size 56×56×1. The KCR-GoogLeNet architecture is shown in Fig. 3 and the detail size of each layer, including the inception modules, is introduced in Table 1.
Inception module for GoogLeNet and KCR-GoogLeNet.
KCR-GoogLeNet incarnation of the inception architecture
KCR-GoogLeNet architecture.
3. Experiments Results
In this section, we describe and analyze the results for the two experiments. Firstly, in Section 3.1, after training KCR-AlexNet and KCR-GoogLeNet on PHD08, we compared the test accuracy in accordance with the training iteration and the time it takes for a training iteration. Secondly, in Section 3.2, after entering new Korean character data with fonts that are not in PHD08 into KCR-AlexNet and KCR-GoogLeNet, we compared their classification performance and classification time. Additionally, we compared the classification performance with other online commercial OCR-programs to ensure the objectivity of the experiments.
3.1 Experiments for PHD08 3.1.1 Experimental data
PHD08 is the Korean character database used for experiments. The scanned images are saved in binary form after printing Korean characters created in a variety of conditions. With 2,187 samples of each character, which is the Korean Standard (KS) completion of 2,350 Hangul characters, all samples have a different font, size, rotation, noise level, etc., as shown in Table 2. The experiments change the binary data of PHD08 to binary images to be used as input data for KCR-AlexNet and KCRGoogLeNet. However, because of the constraint that all CNN input data must be the same size, we changed all of the data sizes to 56×56. Linear interpolation [22] was used to change the size, and an example of the final changed Korean character input data is shown in Fig. 4.
Example of transformed Input data from PHD08.
In this paper, to investigate the difference in test accuracy depending on the training iterations when the number of training data is large or small, we constructed five experimental data sets, as shown in Table 3. A total of 5,139,450 PHD08 data elements were divided into training and testing data, with ratios of 1:0.5 to 1:8 composing the five sets. While KCR-AlexNet and KCR-GoogLeNet train each data set, we confirm that the difference in test accuracy depends on the training iteration and shows which network is better for training Korean characters.
3.1.2 Experiment environment
Some of the parameters that were used in common by KCR-AlexNet and KCR-GoogLeNet were given equal initial values. SGD, a method for modifying the weight parameter that has the most critical role in the training, was explained in Section 2. The initial learning rate (α) used 0.9 and the momentum constant (μ) used 0.01 for all training experiments. In addition, the learning rate was decreased 0.96- fold for every 10,000 training iteration. Batch sizes of 56 and 50 were used for the training phase and test phase, respectively; the batch size is the number of images used as input data for each training and test iteration. Throughout this experiment, a parallel-computing graphics card GTX-970 (CUDA v7.0) was used for quick operations, and we conducted the test on a public framework, Caffe [23].
Five experimental data sets
3.1.3 Experiment analysis
We show that the test-accuracy curve graph depends on the training iteration in Fig. 5, while KCRAlexNet and KCR-GoogLeNet are training the data, E1-Set_1 to E1-Set_5 in Table 3. In the results of the experiment, KCR-AlexNet and KCR-GoogLeNet always converged at over 98% test accuracy when training each of the five data sets. The exact values of the test accuracy at the end of the training are shown in Table 4. If sufficient training continues, even if the proportion of the training data is relatively small, like E1–Set_5, the accuracy of the test can increase at any time.
In the case of top-1 as shown in Table 4, KCR-GoogLeNet has higher test accuracy than KCRAlexNet for all of the data sets. However, according to Fig. 5, we can see that the test accuracy of KCRGoogLeNet always converges in the training iteration, which is more progress than occurs in the training iterations of KCR-AlexNet. This is because the KCR-GoogLeNet architecture spends more time in the inception module, finding features of a small compact area, than KCR-AlexNet. Such characteristics can greatly affect the training time, which is an important factor when designing a CNN. Therefore, when the two networks continued with training, we measured the average time spent on each training iteration and recorded it in Table 4. KCR-AlexNet took 0.042 seconds and KCRGoogLeNet took 0.424 seconds on average to work out each of the single iterations. If the time required for training is a significant constraint, KCR-AlexNet is more effective for training Korean characters.
Comparison between KCR-AlexNet and KCR-GoogLeNet. (a) E1-Set_1, (b) E1-Set_2, (c) E1- Set_3, (d) E1-Set_4, and (e) E1-Set_5.
Test accuracies for the last training iteration and average times for a single iteration
3.2 Experiment about Classification with Other Applications
This experiment compares the time required for classification and the classification success rate between KCR-AlexNet and KCR-GoogLeNet on the new Korean character data with fonts that are not in PHD08. In addition, we compared the classification success rates with commercial OCR programs ABBYY FineReader 12 [24], ABC-OCR [25], and Office Lens [26] to ensure the objectivity of the experiment.
3.2.1 Experimental data
We used the characters in the Korean national anthem as experimental data. The national anthem is composed of four verses and a chorus. Each verse has 28 characters, while the chorus has 24 characters, so 136 characters are in the anthem. However, we do not count duplicate characters; thus, only 82 characters are used for this experiment.
Used fonts for PHD08 and new data set
Comparison between KCR-AlexNet and KCR-GoogLeNet.
We made 10 data sets as shown in Table 5. Each set has all 82 characters, for a total of 820 characters. All fonts used in new data sets are not PHD08 fonts. However, in this experiment, new data sets were used as input data for the classification after the procedure, as in Fig. 6. After selecting the specified area of each character from the image, the width and height are transformed to the size of 56×56 and converted to a binary image. In addition, we used a weight-parameter map for classification from the trained KCR-AlexNet and KCR-GoogLeNet with E1–Set_1 in Section 3.1.
3.2.2 Experiment analysis
Table 6 shows the classification success rate for KCR-AlexNet, KCR-GoogLeNet, and other OCR programs. For the 820 distinct Korean characters used in the experiment, three applications, ABBYY, ABC-OCR, and Office Lens, respectively showed 72.07%, 83.17%, and 66.95% for classification success rates. On the other hand, KCR-AlexNet and KCR-GoogLeNet showed classification success rates of 90.12% and 89.14%, respectively, which is better performance than the existing programs. However, all networks and programs used in the experiment showed relatively low success rates for forms similar to human handwriting, e.g., “OI (E2–Set_5),” “Humanpyeonjichae (E2–Set_8),” and “Ganeun Ansangsoochae (E2–Set_9).” This phenomenon occurred because the handwritten font patterns and the PHD08 font patterns did not match. Thus, additional training data is needed to increase the classification success rate for handwritten pattern data.
Classification success rate comparison between KCR-AlexNet, KCR-GoogLeNet and other programs
Comparing only the performance of KCR-AlexNet and KCR-GoogLeNet, while KCR-GoogLeNet had higher test accuracy for training PHD08 in Section 3.1, this experiment showed a slightly higher classification success rate for KCR-AlexNet. As additional experimental results, KCR-AlexNet took between 0.027 seconds and 0.036 seconds to classify one character, and KCR-GoogLeNet took between 0.021 seconds and 0.025 seconds. Such a difference may be significant when classifying bulk characters. Therefore, the CNN should be selected in view of the classification time and the classification success rate, depending on the situation.
4. Conclusions
CNN structures, KCR-AlexNet and KCR-GoogLeNet, used for Korean character recognition in this paper showed more than 98% test accuracy for PHD08. Further, for an objective evaluation of this paper, we generated new Korean character data with fonts that did not exist in PHD08 and compared the performance with commercial OCR programs that are present online. The experimental results showed that the classification success rates of KCR-AlexNet and KCR-GoogLeNet were higher than the success rates of existing OCR programs, which proved the classification performance for Korean characters with various fonts.
However, additional discussion was required for the performance comparison between KCR-AlexNet and KCR-GoogLeNet. If the test used PHD08, the test accuracy of the KCR-GoogLeNet was higher; however, the classification success rate of KCR-AlexNet was higher in the experiments classifying newly created Korean character data.
In addition to the test accuracy and classification success rate, we measured the temporal component for the experiments. While the training time for PHD08 was short for KCR-AlexNet, the time required to classify one character was short for KCR-GoogLeNet. Thus, we showed that the training time on a given database, the test accuracy from a network, the classification success rate, and the time required for classification must be considered when choosing a CNN for recognizing Korean characters.
Acknowledgement
This work was supported by a 2-Year Research Grant of Pusan National University.