Article Information
Corresponding Author: Qingji Xue , xue_qj@sina.com
Xinhua Lu, School of Information Engineering, Nanyang Institute of Technology, Nanyang, China, ieluxinhua@sina.com
Haihai Wei, School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China, whh15649863783@163.com
Li Ma, School of Information Engineering, Nanyang Institute of Technology, Nanyang, China, ielima@sina.com
Qingji Xue, School of Information Engineering, Nanyang Institute of Technology, Nanyang, China, xue_qj@sina.com
Yonghui Fu, School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China, fu8926@qq.com
Received: July 26 2022
Revision received: October 14 2022
Revision received: November 7 2022
Accepted: November 14 2022
Published (Print): August 31 2023
Published (Electronic): August 31 2023
1. Introduction
Identifying the text in the scene images correctly and efficiently can help people quickly acquire the semantic information in the images, which is very important for some text-related downstream tasks, such as image search, robot navigation, and instant translation [1]. For scene text image recognition, several former models [2-8] have achieved excellent recognition results on clear images, but their performance drops dramatically [9] on low-resolution (LR) text images, for character degradation blurring the shapes and edges of text. Single image super-resolution (SISR) technology is considered as an effective preprocessing method for LR text image recognition.
Some SISR models [10-14] have performed better on synthetic datasets in which LR images are typically made by down-sampling and blurring on high-resolution (HR) images. In reality, the above models cannot achieve satisfactory performance on real LR text images, which its degradation is much more complex. For real LR text images, many existing methods [15,16] started from effectively capturing the sequence characteristics of text images or generating finer image boundaries to generate more recognizable text images. The above methods achieved some results on real LR text image, but they neglect the important role of text images’ multi-scale features.
This paper proposes TSRMAN, a novel scene text image super-resolution (STISR) model for STISR task that combines multi-scale and attention mechanisms. Many works [17-19] demonstrate that the ability of the model can be further improved by fully utilizing features at different scales of images. In addition, attention mechanisms are widely used to guide the model to concentrate on task-relevant regions. Channel and spatial attention (CA, SA) [20] are two mechanisms commonly used in computer vision. The former could establish the interactive dependencies of the image channel dimension, and the latter could get the information on the image space dimension. This paper designs a multi-scale residual attention (MRA) module by skillfully fusing the above two mechanisms to work on real STISR task.
The contributions of our work are as follow:
To capture text images’ features at different scales, multi-scale learning based on different convolution kernel sizes is introduced.
A MRA module is designed, which skillfully fuses multi-scale and attention mechanisms to enrich the representation ability of image features, and increase the text image recognition’s accuracy.
The experiments demonstrate that our work increases scene text recognition’s (ASTER) average recognition accuracy by 1.2% when compared to text super-resolution network (TSRN) on TextZoom.
2. Related Work
2.1 Single Image Super-Resolution
Restoring a reasonable HR image from an LR image is the primary objective of image super-resolution (SR) technology. Dong et al. [10] introduced the convolution neural network (CNN) into the SISR reconstruction task and proposed a simple three-layer network (SRCNN) to generate HR image and its results indicated the advantage of deep learning in SR techniques. Since then, the image SR task has ushered in a large number of SR models using CNN. Inspired by the residual network [21], Kim et al. [11] constructed a model called VDSR, that significantly improves SISR reconstruction results; Tong et al. [22] proposed SR-DenseNet which uses a dense connection mechanism to connect image features of different depths in the model to each other to improve the reconstruction results; Zhang et al. [23] proposed RCAN, which improves image reconstruction’s result by introducing the CA module. The above SISR technologies have achieved good results, but most of them rely on artificially synthesized datasets for training. Baek et al. [9] have shown that the performance of these models on real scene LR images drops drastically because the degradation problems of real-scene images are very complex compared to synthetic LR images.
2.2 Scene Text Image Super-Resolution
SISR techniques can improve scene text recognizers’ accuracy as a preprocessing step. However, most of the previous SISR models are trained using LR images that are generated artificially, and it is difficult for such models to reconstruct LR scene text images because the similar degradation problem of real scene text images is more complex than that of synthetic LR images. Due to the lack of datasets for scene text image SR tasks, there are few works on real scene LR text images. Wang et al. [15] proposed the dataset called TextZoom filling the gap of datasets for real scene text-image SR tasks and proposed TSRN for this dataset. The experimental results of this model demonstrated that text image SR could significantly increase real scene LR text images’ recognition accuracy.
2.3 Scene Text Recognition
The HR text image generated from the SR model is used for the input of the recognizer to obtain all text in origin LR text image. Shi et al. [24] proposed an image sequence recognition model (CRNN), which extracts the sequence information of text images by jointly CNN and recurrent neural model (RNN) and utilizes the connectionist temporal classification (CTC) to match generated characters to real labels. Shi et al. [5] proposed ASTER, explicitly rectified irregular text by introducing the spatial transformation network (STN) [25] and then used an attention-based approach for decoding; Luo et al. [4] proposed a recognition model called MORAN, which designed a multi-object correction module to correct irregular text. In this work, we choose the above models as the model performance evaluator.
3. Proposed Method
The proposed model (TSRMAN) is introduced as follows. We comprehensively describe the main model architecture, then focus on the proposed MRA module.
Architecture of the model (TSRMAN): [TeX:] $$\mathrm{L}_2$$ loss represents the pixel-wise loss and [TeX:] $$\mathrm{L}_{\mathrm{GP}}$$ loss represents the gradient prior loss.
As presented in Fig. 1, our model includes four parts: the STN, the shallow convolution module, the deep feature extraction module consisting of multiple MRA modules connected sequentially, and the upsampling module (pixel shuffle [26]). Firstly, the LR text image and its binarized mask map are combined into a four-channel image and input to STN, the process can be formulated as (1):
where [TeX:] $$F_{s t n}(.)$$ represents the STN, which is employed to deal with the issue of pixel alignment between LR-HR text-image pairs and blurring artifacts in reconstructed images, [TeX:] $$I_{L R} \text { and } I_{s t n}$$ represent the four-channel LR image and the rectified image, respectively. [TeX:] $$I_{s t n}$$ is input into the shallow convolution module. The shallow feature extraction process can be formulated as (2):
where [TeX:] $$F_{SF}(.)$$ represents the shallow convolution module, which makes use of a single convolutional layer and its convolution kernel size is 9×9, and [TeX:] $$X_{SF}$$ is the extracted shallow feature map. The large kernel convolution can capture the connection between longer-distance pixels, which is convenient for modeling image features in the global view. Then, [TeX:] $$X_{SF}$$ is input into the deep feature extraction module, which can be formulated as (3):
where [TeX:] $$F_{DF}(.)$$ represents the deep feature extraction module, and [TeX:] $$X_{DF}$$ is the extracted deep feature map. [TeX:] $$F_{DF}(.)$$ can obtain better feature representation, but as the network deepens, it can bring about the problem of exploding or disappearing gradients. Residual connections can solve the above problems by fusing shallow and deep features through element-wise addition. Finally, the fusion map is input into an upsampling module. The upsampling method used in this paper is the subpixel convolution operation [26]. The upsampling process can be formulated as (4):
where [TeX:] $$\operatorname{Conv}(.)$$ represents the operation of generating a four-channel SR image from the unsampled feature map, [TeX:] $$F_{U P}(.)$$ represents the upsampling module, and [TeX:] $$\operatorname{add}(.)$$ represents element-wise addition, and [TeX:] $$I_{\text {out }}$$ represents the generated SR image. The subpixel convolution operation achieves upsampling by rearranging the pixels. Fig. 2 illustrates the specific implementation.
The sub-pixel convolution operation: H, W, C, and r represent the height, the width, the channel dimension, and the upsampling multiple of the image, respectively (H, W and C represent the same meaning in this paper).
3.2 Multi-Scale Residual Attention Module
To extract LR text images’ features fully at different scales, this paper proposes an MRA module. The proposed module is thoroughly introduced in this section.
Fig. 3 is the MRA module which comprises three multi-scale blocks, three BPC (batch normalization [BN] + PReLU+ CA) blocks, one SA module, and a bidirectional gated recurrent unit (GRU). The multi-scale block (MS) is made up of three convolutional layers with different kernel sizes, where the kernel sizes are 1×1, 3×3, and 5×5, respectively. This module could capture the image’s feature representation at different scales and use element-wise addition to fuse features. Furthermore, the parameter sharing mechanism is used in this paper to reduce the parameters, that is, all 1×1 convolutions in the multi-scale residual block have the same parameters, and similarly, all 5×5 convolutions also have the same parameters. In BPC, the BN layer in the module is used to reduce the gradient dispersion issue in training deep networks, and it can even accelerate the convergence of the model [27]. The activation layer can increase the model’s nonlinearity.
The multi-scale residual attention module.
Since the convolution operation treats the feature maps of the image channel dimension equally, which limits the representation ability of the model. Therefore, this paper adopts the CA module [23] to establish the dependencies between different channels in Fig. 4.
The channel attention (CA) module.
The spatial attention (SA) module.
Fig. 5 shows SA module. The SA module can direct the model to focus more on image boundaries that contain more high-frequency information, which is helpful for image reconstruction. This paper adopts the SA module proposed in [21].
In [15], the authors demonstrate the effectiveness of modeling the context information using a bidirectional long short-term memory (LSTM). Therefore, this paper takes a similar approach by using GRU with fewer parameters.
3.3 Loss Function
The loss function L includes the pixel loss and the gradient prior loss, which can be express as:
where, coefficients a and b are the weights of different terms of L.
The pixel loss compares the differences between two images. In this paper, the pixel loss is calculated using the mean square error ([TeX:] $$L_2$$ loss), which is expressed as Eq. (6):
where, n represents batch size. [TeX:] $$I_{SR}$$ represents the SR image and [TeX:] $$I_{HR}$$ represents the HR image.
Since sharpened characters are more recognizable than smooth ones, this paper adopts the gradient prior loss ([TeX:] $$L_{GP}$$) as the same as TSRN to generate sharp image boundaries. [TeX:] $$L_{GP}$$ is as the equation (7):
where, [TeX:] $$\nabla I_{H R}(i), \nabla I_{S R}(i)$$ represent the gradient field of HR image and SR image, respectively. [TeX:] $$i_0, i_1$$ represent the pixels whose pixel values begin to change and stop changing along the direction of the image gradient, respectively.
4. Experiments
4.1 Dataset
This paper uses TextZoom for model training and testing, including 17,367 pairs of training sets and 4,373 pairs of test sets, in which each data sample contains LR and HR image pairs, and their corresponding text labels. There are three subsets in the test set which include 1,353 image pairs in the hard subset, 1,411 in the medium subset, and 1,619 in the easy subset. Example diagrams are shown in Figs. 6–8.
Easy subset diagram: the top row represents LR text images, and the bottom row represents the corresponding HR text image.
Medium subset diagram: the top row represents LR text images, and the bottom row represents the corresponding HR text image.
Hard subset diagram: the top row represents LR text images, and the bottom row represents the corresponding HR text image.
4.2 Implement Details
According to [15], the shape of all LR text images and all HR text images are converted to 16×64 and 32×128, respectively. The coefficients of [TeX:] $$L_2 \text { and } L_{G P}$$ are taken to 1 and [TeX:] $$10^{-4}$$, respectively, and the Adam optimizer with a momentum of 0.9 is used in this paper. The model evaluation metric uses the recognition accuracy obtained from the released PyTorch version of the ASTER [5]. The model is trained on an NVIDIA RTX 3080ti. The training epochs are 500 and the number of images processed at one time is 64. Fig. 9 shows the proposed model’s visualization.
Image visualization: images with the same character are represented as a group, and from top to bottom are LR text images, SR images generated by the proposed model, and HR text images.
4.3 Ablation Study
This paper analyzes the effectiveness of our work from the following two aspects.
4.3.1 The effect of the number of MRA modules
As shown in Table 1, building deeper networks by increasing the number of MRA can’t improve model capability indefinitely. When stacking 6 MRA modules, the accuracy of the model in the text recognizer ASTER begins to decline, and when stacking to 5 MRA modules, the model reaches saturation and obtains the best average recognition accuracy.
The effects of the number of MRA modules
The bold font indicates the best performance in the test on number of MRA modules.
The effects of components of MRA module
The bold font indicates the best performance in the test on number of MRA modules.
4.3.2 The effect of decomposition components of MRA modules
As shown in Table 2, we take TSRN as the baseline model and obtain four reconstructed models by decomposing the components that constitute the multi-scale residual module. By analyzing the experimental data of Model 1 and Model 2 in Table 2, we found that the model is less effective at reconstructing images with low levels of blur, the reason may be that the reconstruction of this type of image is inhibited by the multi-scale learning features, resulting in a decline in the reconstructed image’s quality. We analyze the experimental data of Model 1 and Model 3, and we can conclude that adding the attention module can only achieve a recognition accuracy similar to the benchmark model, which to some extent indicates that the baseline model’s feature utilization has reached saturation. We analyzed the experimental data of Model 3 and Model 4 and found that such a skillful fusion of multi-scale and attention mechanisms can significantly enhance the model’s performance, demonstrating that such multi-scale module does extract rich feature representations, and also shows that the attention module is indeed possible to filter out the features that are beneficial to the target task from the rich image features.
4.4 Comparison
We compare current SISR models, including SRCNN [10], VDSR [11], SRResNet [28], EDSR [12], LapSRN [17], and TSRN [15] in this paper. The above models are trained with TextZoom, and the model evaluation metric is the text recognition’s accuracy. The proposed model achieves an improvement of 1.2% over the baseline model TSRN [15] on TextZoom using the text recognizer ASTER [5], whose specific data are shown in Table 3. The comparison of the recognition accuracy is shown in Fig. 10. Our model achieves relatively high values on images with a large degree of blur, which proves the effectiveness of the model in recovering image details in Table 4.
Recognition result of SR models in different recognizers (unit: %)
The bold font indicates the best performance in the test to different recognizers.
PSNR and SSIM result comparison
PSNR=peak signal-to-noise ratio, SSIM=structural similarity index measure.
The bold font indicates the best performance in the test by PSNR and SSIM.
5. Conclusion
In this paper, a novel text image super-resolution model is proposed for real scene text image super resolution. This paper skillfully combines the multi-scale learning and the attention mechanisms, and designs a MRA module, thereby improving the text recognizers’ recognition accuracy in LR scene text images, surpassing the existing baseline model TSRN. The proposed model has achieved better results, but the recognition accuracy on extremely blurred and long text images is still low. In the future work, for blurred text, we will introduce deblurring techniques, and for long texts, we can try to build self-attention mechanisms to learn long-distance semantic information.