1. Introduction
As important aspects of artificial intelligence technology, image classification and recognition [1,2] have been used extensively in industry, agriculture and daily life. For example, a fine-grained recognition method [3] was proposed to improve the real-time performance and accuracy of vehicle recognition in an intelligent transportation system. Chen et al. [4] presented a load forecasting framework based on self-adaptive dropout, which improves the load supply robustness of power systems. However, as the input data and specific requirements vary significantly in different application scenarios, it is difficult to use a general model to solve all problems. For example, in the identification of tobacco sales bills, different regions have varying management departments, and different customers adopt varying management clients, which results in many differences in the formats and contents of the sales bills.
In view of the above problems, this paper presents a novel multi-branch residual network for tobacco sales bills to improve the efficiency and accuracy of tobacco sales reconciliation. Our work can be summarized as follows:
(1) A novel multi-branch residual network is proposed for the recognition of tobacco sales bills. The proposed network integrates a multi-branch conv-block module, and spatial and channel squeeze and excitation (SCSE) module.
(2) Score advancements are achieved on large-scale datasets. We conducted comparative experiments using the China Tobacco Bill Image (CT-BI) dataset, which consists of more than 125,000 images.
The rest of this paper is structured as follows: Section 2 discusses related work on image classification and recognition for different tasks. Then, the methodology of the proposed framework is described in detail in Section 3. In Section 4, several comparative experiments that were conducted on CT-BI dataset are outline, and conclusions are drawn in Section 5.
2. Related Work
The application of image recognition technology in various fields has resulted in a series of phased achievements. A weakly supervised fine-grained image recognition method that can accurately locate objects and parts without annotation was proposed in [5]. The experimental results demonstrated that the target mask module and salient point detection module in the method could suppress the background interference and improve the correct recognition rate. Zhao et al. [6] proposed a nonlinear loosely coupled nonnegative matrix decomposition method for low-resolution image recognition, in which the target images were regarded as being composed of different local features.
Furthermore, to evaluate the reliability of image recognition applications that are driven by deep learning technology to change the image background area, Zhang et al. [7] introduced a deformation test method for image recognition systems. To alleviate the difficulty of understanding the network system owing to the unpredictability of the network structure, a coordination method based on deep learning technology was proposed [8] to solve the network structure reasoning problem by integrating a residual network and fully-connected network. Yan et al. [9] proposed a method for predicting future traffic, and constructed a prediction interval based on the combination of a residual neural network as well as upper- and lower-bound estimation. This method generated a combined residual network by optimizing the objective function and adjusting the type of the remaining blocks, which effectively improved the accuracy of quantifying the uncertainty prediction of future traffic. To solve the problem of obtaining subtle clues in the process of fine-grained image recognition more effectively, Kim et al. [10] proposed a method for generating the characteristics of hard negative samples that reduced the dependence on the number of tuples of hard negative samples.
3. Methodology
The proposed tobacco sales bill recognition model consists of three multi-branch conv-blocks and an SCSE module, as shown in Fig. 1.
3.1 Data Preprocessing
In this study, the data preprocessing consisted of two sequential processes: image calibration and image segmentation, as shown in Fig. 2.
Architecture of the proposed recognition model.
Flow of data preprocessing.
In the image correction process, the input image was tilt-corrected; the four top corners of the document were extracted to calculate the length and width of the image, and rotation alignment was performed. Thereafter, in order to reduce the impact of image noise on the recognition rate, a Gaussian difference filter was applied for noise reduction and the image was binarized. The transfer and impact response functions are presented in Eqs. (1) and (2), respectively.
where [TeX:] $$M \geq N$$, [TeX:] $$\alpha_{1}>\alpha_{2}$$, [TeX:] $$\sigma_{i}=1 / 2 \pi \alpha_{i}$$.
Based on the image calibration, the Sobel operator was used to determine the gradient in the X-direction of the input image to realize the text positioning operation. Each pixel in the input image was convoluted using two convolution kernels of the Sobel operator. One of the two convolution kernels had the largest response to the vertical edge, whereas the other had the largest response to the horizontal edge. The maximum value of the two convolution results was used as the output of the pixel. Then, the processed image was further expanded and corroded to detect the text area, to realize line segmentation of the entire image. Finally, a vertical projection operation was used to divide each line into a series of characters for processing.
3.2 Recognition Model
The preprocessed image was input into the proposed recognition model that consists of a multi-branch conv-block module and an SCSE module, as shown in Fig. 1.
The design of the former integrates the branch idea of the Inception series model and residual mechanism of the ResNET network. To reduce the total number of parameters, we referred to the Inception network and used two 3[TeX:] $$\times$$3 convolutions instead of large 5[TeX:] $$\times$$5 convolutions. This improvement could reduce the number of model parameters and establish more nonlinear transformations, which increased the capability of the conv-block for learning features. Moreover, using this structure, the sparse matrix could be clustered into dense submatrices to improve the computational performance. As illustrated in Fig. 3, the input of the multi-branch conv-block module was composed of four branches: branch 0 was composed of one BasicConv2d; branch 1 was composed of two BasicConv2d; branch 2 was composed of three BasicConv2d; and branch 3 was composed of an average pool and one BasicConv2d. The input tensors passed through the above four branches, respectively, following which the results were spliced together. The advantage of this procedure was that visual information could be processed on different scales and subsequently aggregated, and features could be extracted from different scales simultaneously.
Multi-branch conv-block module.
The SCSE module consists of the sSE and cSE. The cSE is a channel attention module. The specific process is as follows: the global average pooling method is used to convert the feature map from [C, H, W] to [C, 1, 1], and two 1[TeX:] $$\times$$1 convolutions are used for information processing to obtain the C-dimensional vector. Thereafter, the sigmoid function is used for normalization to obtain the corresponding mask. Finally, the feature map that is calibrated by the information is obtained by channel-wise multiplication. The sSE module is a spatial attention module, whose implementation process is as follows: a 1[TeX:] $$\times$$1 convolution operation is conducted directly on the feature maps and their dimensions are converted from [C, H, W] to [1, H, W]. Subsequently, the feature maps are activated with a sigmoid function to obtain a spatial attention map, which is applied to the original feature map to complete the spatial information calibration. The SCSE is a parallel connection between the two modules. Input features from the sSE and cSE modules are added to obtain a more accurately calibrated feature map. Finally, the result is added to the input tensor as the output of the block. The structure of the SCSE is presented in Fig. 4. By introducing an attention mechanism, the network can focus on more critical information in the current task to solve the problem of information overload, and the efficiency and accuracy of the task processing can be improved.
4. Experiments
We conducted a series of comparative experiments to demonstrate the effectiveness of the proposed recognition approach through comparisons with existing methods.
4.1 Experimental Configuration
The hardware of the experimental environment consisted of an NVIDIA Titan X graphics card, 128 G of running memory, and an Intel E5-2678V3 CPU. The software environment comprised an Ubuntu 16 system, Python 3.6 and the PyTorch 1.0 development environment. The experiments were conducted using the integrated development environment Python 3.6+PyTorch 0.4.0.
The experimental data were obtained from the CT-BI dataset. This dataset contained more than 1.2 million tobacco sales bills and corresponding statistical data, collected from different regions and dealers. These sales bills were saved in the JPG image format according to different tobacco types and statistical data were saved in a sheet in the .xlsx format. Each sample image contained the store name, monopoly license number, and tobacco commodity sales data and amounts. Fig. 5 presents several samples from the CT-BI dataset. In the experiments, the correct recognition rate of the algorithm was tested by identifying the sales and amounts of specific types of cigarettes in the image bill and verifying these with the data in the statistical table. The experimental results were analyzed quantitatively using the Top-1 to Top-5 error rate indexes.
Image examples from CT-BI dataset.
4.2 Experiment I: Different Tobacco Types
In this experiment, we selected seven types of tobacco sales—namely, SUYAN, NANJING, ZHONGHUA, TAISHAN, HUANJINYE, WANBAOLU and LIQUN—to verify the accuracy of the proposed recognition method. The experimental results are listed in Table 1.
It can be observed from Table 1 that the recognition accuracies of the different tobacco were slightly different. This is mainly because the number and complexity of the Chinese characters in the names of these types of tobacco products differed, resulting in changes in accuracy when positioning the commodity names from the sample images with interference information. For example, the Chinese name for HUANJINYE contains three complex Chinese characters, whereas the Chinese name for ZHONGHUA contains only two simple Chinese characters. This resulted in a difference of 0.88% in the Top-1 index and 0.36% in the Top-5 index. However, Table 1 also indicates that the maximum standard deviation corresponding to each index from Top-1 to Top-5 was only 0.62, which reflects the stability of the proposed method.
Experimental results for different tobacco (unit: %)
4.3 Experiment II: Different Recognition Methods
We compared the proposed method with several existing methods. The experimental data were the sales data of LIQUN tobacco. The experimental results are presented in Table 2.
It can be observed from Table 2 that the proposed method was superior to the other four methods in terms of the Top-1 to Top-5 indicators. For example, compared to the second-ranked Inception V3 method, the Top-1 of our method increased by 1.14%. Moreover, the Top-5 increased by almost 5% compared to the closest methods of Inception V3 and NasNet. The direct reason for this performance improvement was that our method integrated the multi-branch concept of Inception network, and pro¬vided the conv-block with a stronger learning feature ability by establishing more nonlinear transfor¬mations. Another possible reason was that the proposed method could focus on more critical information in the current task by introducing an attention mechanism, to improve the efficiency and accuracy of the task processing.
Experimental results for different methods (unit: %)
5. Conclusion
In this study, as one of the typical applications of artificial intelligence technology in traditional industries, a new multi-branch residual framework was developed for the recognition of tobacco sales bills. A multi-branch residual network recognition model was designed and trained based on the geo¬metric correction and edge alignment of input images. Finally, the effectiveness of the proposed approach was verified through comparative experiments on a large-scale tobacco sales bill dataset.
Acknowledgement
This work was supported by the Research on Key Technology and Application of Marketing Robot Process Automation (RPA) Based on Intelligent Image Recognition of the Zhejiang China Tobacco Industry Co. Ltd. (No. ZJZY2021E001).