1. Introduction
Single image super-resolution (SISR) is a computer vision task that reconstructs high-resolution (HR) images from low-resolution (LR) images. Unlike other high-level vision tasks that predict coordinate points from images, SISR is a pixel-level task that learns one-to-many mappings. To learn them, Dong et al. [1] introduced a neural network to solve this problem, and their method (super-resolution con¬volutional neural network [SRCNN]) performed better than the previous conventional methods. However, with the deepening and widening of deep neural networks, the number of parameters and the computational effort increase. Therefore, these methods are becoming more and more demanding in terms of hardware in practical applications. Therefore, researchers have recently focused on designing lightweight neural networks. On the one hand, they found that the simplest strategy is to build a shallow network, such as efficient sub-pixel convolutional neural network (ESPCN) [2] and fast super-resolution convolutional neural network (FSRCNN) [3]. On the other hand, some approaches reduce computation and parameters by sharing parameter mechanisms, such as recursive learning. For example, deeply- recursive convolutional network (DRCN) [4] reduces redundant parameters by recursive learning, and deep recursive residual network (DRRN) [5] uses residual learning and recursive learning.
These methods effectively reduce the number of model parameters and perform well in terms of inference speed compared to general methods. However, they have two drawbacks: (1) the up-sampling operation before input increases the computational cost of the neural network and (2) they perform poorly in balancing lightweight neural networks and recovering the quality of HR images.
To solve these two problems, Ahn et al. [6] propose CARN-M for mobile scenario by cascading network structures [3], but it comes at the cost of a significant reduction in PSNR. Hui et al. [7] propose an information distillation network (IDN) that explicitly divides the previously extracted features into two parts: one of which is retained and the other is further processed. In this way, IDN achieves good performance, but there is still room for improvement in terms of performance. MOA-S [8] and FALSR [9] introduced neural architecture search (NAS) to SISR. NAS [10] is an emerging approach for the automatic design of efficient networks. NAS-based methods seem to be theoretically effective, the search space and strategies of NAS are limited, leading to limit the performance and reproducibility of NAS networks.
These methods above perform better in lightweight or speeding up inference, but the purpose of lightweight is to reduce the number of parameters of the reconstructed network and speed up the com¬putation, while ensuring that the reconstruction quality is not reduced. Learning from the experience of these methods, we adopt the idea of group convolution to design the structure of channel split residual learning and the structure of double-sampling to widen the upsampling network to improve the recon¬struction performance when achieving the balance between reconstruction performance and accelerated computation speed.
Our contributions include the following three main points:
1. To design compact networks, we propose a channel split residual structure that effectively reduces the computation and parameters in experiments.
2. We propose a double-upsampling network for SISR to improve the network performance and relieve the pressure on the deep feature extraction network.
3. Based on the accurate selection of fast network reconstruction in practical applications, we propose a new evaluation metric of 100_FPS for super-resolution lightweight networks.
2. Related Work
2.1 Residual Learning
He et al. [11] introduce deep residual learning theory to solve the performance degradation problem in ResNet caused by the deepening of neural network layers. SRResNet [12] is the first method to apply residual learning to SISR. It extracts reconstruction features of HR images by residual learning blocks and restores high-quality HR images. Enhanced deep super-resolution network (EDSR) [13] also uses residual learning, but compared with ResNet and SRResNet, it proposes many improvements such as removing batch normalization (BN). Fig. 1 shows the three different modules using residual learning: the original ResNet [14], SRResNet, and EDSR networks, and we compare them. The original structure of residual learning is shown in Fig. 1(a), which was proposed by He et al. [11]. Fig. 1(b) shows SRResNet. We can find that it completely applies the residual learning of ResNet in the feature extraction module of image reconstruction.
In Fig. 1(c), the EDSR removes the BN layer from the network, because Nah et al. [15] argue that the BN layer normalizes the features and eliminates the range flexibility of the network. Importantly, they also show experimentally that this simple modification greatly improved the performance.
Four different residual learning modules: (a) ResNet, (b) SRResNet, (c) EDSR, and (d) Spilt ResNet. “conv” stands for convolution, “BN” stands for batch normalization, “ReLU” is activation function, and “Addition” stands for the addition operation of feature maps.
2.2 Depthwise Separable Convolution
Depthwise separable convolution is proposed by MobileNet [17], and it is a form of factorized convolution that factorizes standard convolution into depthwise convolution and pointwise convolution. Depthwise convolution can be regarded as group convolution, and pointwise is a way to maintain the flow of information between group convolutions. In Fig. 2, we can see the detailed calculation process about depthwise separable convolution. However, the size of depthwise group is different from other methods, in which the size of the group is equal to the size of the input channel. It indicates that each input channel has a corresponding convolution kernel for computation, which will reduce a large number of parameters and computation. Assuming that the input and output channels are 64, then we know that depthwise separable convolution computes 63 times fewer parameters than the original convolution. We
The structure of depthwise separable convolution, where “Group” represents the process of grouping feature maps, “Depthwise” is channel separation convolution, “Pointwise” stands for point convolution.
do not calculate the parameters of pointwise convolution, because the pointwise is an operation of 1×1 convolution, and its parameters are negligible compared with the general convolution. In conclusion, depthwise separable convolution is indeed a good way to reduce the computation and parameters of the networks.
3. Proposed Method
3.1 Model Design Analysis
From Fig. 3(a) and 3(b), we find that the super-resolution neural network mainly consists of a feature extraction network and an upsampling network. The feature extraction network mainly extracts the deep feature information required for image reconstruction from the original image, while the upsampling network uses deep feature information to reconstruct HR images. The upsampling strategy is less and fixed, so it is important to improve the image reconstruction quality by designing an efficient feature extraction network. From SRCNN [1] to wide activation super-resolution (WDSR) [18] and super-resolution feedback network (SRFBN) [19], they have been committed to changing the feature extraction network to improve the image super-resolution. Although the network goes from three to hundreds of layers, the quality of reconstructed images improved less in terms of PSNR (peak signal-to-noise ratio) [20] and SSIM (structural similarity index measure) [21]. The PSNR/SSIM of SRCNN is 30.48/0.8628 on the test ×4 Set5 image, while the PSNR/SSIM of SRFBN is 32.56/0.8992 on the test ×4 Set5 image. The depth of the feature extraction network increased hundreds of times, while PSNR and SSIM improved by 2.08 dB and 0.0364, respectively. In addition, SISR is a pixel-level task and relies on shallow pixel-level information during the construction process. Therefore, researchers should pay more attention to the pixel-level information at the shallow layer of the image when designing lightweight image super-resolution networks.
There are three different upsampling methods in the image reconstruction network: (a) pre-upsampling, (b) post-upsampling, and (c) double-upsampling. [TeX:] $$\oplus$$ stands for the operation of addition.
The basic operation time cost of convolutional neural network in different channels
Before designing a compact and lightweight network, we conducted experiments on the time cost of the fundamental operation of a convolutional neural network. These experiments use the TensorFlow framework and the TensorFlow Timeline tool to calculate time cost. All computation times in Table 1 are in milliseconds. From Table 1, we find that convolution is the most time-consuming compared with other operations. When the number of feature channels in the convolution layer is reduced by 1/2, the computation time of the convolutional layer is reduced by 3/4. But in other operations, the time reduction is not so strong. Therefore, reducing the number of convolutional channels and convolution operations is more beneficial to reducing the parameters and computation of the lightweight network.
3.2 Channel Split Residual
Through the experiments and analyses in Section 3.1, we can find that reducing the number of convolutional channels plays an important role in the lightweight of the whole network, so we propose the structure of channel split residual learning to reduce the number of convolutional channels. In the channel split residual learning (Fig. 1(d)), we combine the general residual learning method with the recently popular group convolution, and we retain the method proposed by EDSR [13] which removed the BN layer. Meanwhile, we adopt group convolution in the second convolution, where the number of groups is equal to the number of input channels. If the number of input channels is 64 and the number of output channels is the same as the number of input channels, the number of parameters of the residual module in EDSR [13] is [TeX:] $$3 \times 3 \times 64 \times 2$$ 2 (shown in Fig. 1(c)) and the number of parameters of the channel split residual structure is [TeX:] $$3 \times 3 \times 64 \times 64+3 \times 3 \times 1 \times 64+1 \times 1 \times 64$$ (shown in Fig. 1(d)), which reduces the parameters to 50.78%. Assuming the size of the input feature map is H×W, the number of the floating point of operations (FLOPs) of the residual module in EDSR is calculated as 50.82% of the original.
If the input and output feature maps are represented by [TeX:] $$X_{i} \text { and } X_{i+1},$$ respectively, the residual structure proposed by EDSR can be expressed by the formula (1):
Then the channel split residual structure can be described by the formula (2):
where ReLU represents the activation function and [TeX:] $$f_{\text {conv }}$$ is an ordinary convolution operation with a [TeX:] $$3 \times 3$$ kernel. [TeX:] $$f_{G_{-} \operatorname{conv}}$$ denotes the group convolution operation, G is the number of groups. In CSRNet, G is equal to the number of input channels. To maintain the data flow between the channels, we refer to the ordinary addition operation in the network. Compared with channel shuffling [22] and identity skipping method in Ghost module [23], our method is more straightforward and computationally convenient without adding additional computations, which is an advantage of general residual learning.
3.3 Double-Upsampling Network
There are two common types of upsampling networks: subpixel convolution and deconvolution. Subpixel convolution, proposed by ESPCN [2], is a method that corresponds the feature map to the HR image, which reduces the computation and memory consumption of deconvolution. In the subpixel, the extraction of HR features is the key to the quality of the reconstructed image since the final value of the HR feature map corresponds to the reconstructed HR image pixels one by one. It is different from other high-level visual tasks, which have far fewer prediction points than SISR. To reduce the pressure of feature extraction in the networks, we adopt a double-upsampling network structure, which is supplemented by a simple upsampling network compared with the general upsampling network. In Fig. 3(c), the depth feature extraction network in the red box is regarded as the residual network, marked as Res_Net, and the shadow network is regarded as the main network structure, marked as Main_Net. If [TeX:] $$X_{\text {input }}$$ is input as an LR image, [TeX:] $$Y_{r e}$$ as the reconstructed HR image, and the real HR image is [TeX:] $$Y_{H},$$ then the feature map [TeX:] $$X_{\text {res_map }}$$ extracted by Res_Net during the HR image reconstruction can be represented as:
The Main_Net feature map [TeX:] $$X_{\text {main_map }}$$ is represented as:
where [TeX:] $$f_{e x_{-} \text {res }} \text { and } f_{e x_{-} \text {main }}$$ represent the Res_Net feature extraction network and Main_Net feature extraction network, respectively, and the neural network structure is composed of multiple channels separated by residual structures. The whole reconstruction process is represented as:
where [TeX:] $$f_{\text {res_up }} \text { and } f_{\text {main_up }}$$ represent the upsampling network of Res_Net and Main_Net reconstructed networks, respectively, and the upsampling network adopts the subpixel convolution method.
In order to make the network easy to train, we propose a loss function for the double upsampling network. It is described as follows:
From the formula (6), we can find that the loss function of the double upsampling network consists of two components: [TeX:] $$\operatorname{loss}_{s} \text { and } \operatorname{los}_{o} \cdot \operatorname{loss}_{s}$$ represents the loss between the HR image reconstructed by Main_Net and the real HR image [TeX:] $$Y_{H}, \text { and } \operatorname{loss}_{o}$$ represents the loss between the final reconstructed HR image [TeX:] $$Y_{r e}$$ and the real HR image [TeX:] $$Y_{H}.$$ . To reduce the difference between the reconstructed HR image and the real image in Main_Net, the reconstruction loss of Main_Net is also added to the final loss, which can also reduce the learning pressure of the residual learning upsampling network.
3.4 Frame Rate Evaluation
The existing evaluation metrics for lightweight networks include the number of parameters and FLOPs, both of which are objective evaluation metrics. However, the inference speed of the neural network is expressed by this evaluation metric inaccurately due to the optimization of the GPU internal acceleration mechanism. The reduced parameters and the percentage of computation cannot be matched with the percentage increase of the actual computation speed. In Table 2, we perform three sets of comparison experiments in terms of convention structure, number of convolutional channels, and network depth. 10_res_block_64_ [TeX:] $$3 \times 3$$ is our baseline control group. Res_block represents residual blocks, 10 stands for the number of residual blocks, 64 stands for the number of convolutional channels, and [TeX:] $$3 \times 3$$ represents the size of the convolutional kernel. 10_res_block_64_ indicates that [TeX:] $$3 \times 1+1 \times 3$$ indicates that [TeX:] $$3 \times 3$$ convolution is replaced by the [TeX:] $$3 \times 1 \text { and } 1 \times 3$$ convolution. In Table 2, each experiment is tested three times on TITAN X GPU using the TensorFlow framework for an image size of [TeX:] $$256 \times 256.$$ . From Table 2, you can find that:
Compared with [TeX:] $$3 \times 3$$ convolution, [TeX:] $$3 \times 1 \text { and } 1 \times 3$$ convolutional groups require more computation time. In compact network design, replacing [TeX:] $$3 \times 3$$ convolution with [TeX:] $$3 \times 1 \text { and } 1 \times 3$$ convolutional groups does not reduce the computation time.
Theoretically, if the depth of the network is 1/2 of the original network, the time should be 1/2 of the original network, but in the experiment, the time is 4/7 of the original network.
Theoretically, if the number of channels of the network is 1/2 of the original network, and the time should be 1/4 of the original network, but in the experiment, the time is 1/3 of the original network.
The change of the number of channels has the greatest influence on the change of the whole network time, while the change of the convolutional kernel size has little influence on the change of the whole network time.
Comparison of convolution parameter changes and frame rates, where frame denotes the frame rate and time denotes the cost time
Therefore, we propose a new method to test the speed, but it requires a basic model as a baseline. As shown in Table 2, our baseline is 10_res_block_64_ [TeX:] $$3 \times 3.$$ The time cost of the test includes only the time of the feature extraction network. To minimize the bias in statistical computation, we performed three tests for each experiment. In the tests, we computed the time cost of 105 frames, but not 100 frames, which is described as T_100_frame. The computation time for the first 5 frames is unstable in TensorFlow because the GPU needs some more time to prepare at the beginning of the computation. The frame rate T_100_frame is as follows:
4. Experiments
4.1 Datasets and Metrics
In our experiments, we used the dataset DIV2K [24] to train the model, which contains 800 high-quality training images. First, we randomly cropped [TeX:] $$256 \times 256$$ images as the HR images and then generate the LR images with different down-sampling multiples from the HR images by bicubic interpolation. To increase the amount of training data, several dataset enhancement operations were performed on these images, including such as random level, flips, and rotation. We tested on four datasets, Set5 [25], Set14 [26], BSD100 [27], and Urban100 [28]. Set5 and Set14 are generally test benchmarks. BSD100 is composed of nature images from the segmentation dataset proposed by Berkeley Lab. Recently, the urban images provided by Huang et al. [28] are very interesting, because they contain many challenging images that are not available with existing methods. These four datasets can verify the effectiveness of the model. For the evaluation metric, two evaluation metrics are used: a measure of image quality and a lightweight network evaluation metric. The evaluation metrics for image quality include calculated PSNR and SSIM. The evaluation metrics for lightweight networks include parameters, Multi_Add, our proposed frame rate of 100 frames (100_FPS).
4.2 Implementation Details
In our network, the convolution kernels are set to [TeX:] $$3 \times 3.$$ The padding size of the convolution kernels is set to 1 to keep the size of the input and output feature maps consistent with the real HR images. The Res_Net uses a residual structure with 10 channel split residual blocks. In Res_Net, the first convolution uses [TeX:] $$643 \times 3 \times 64$$ convolution kernels, and the second convolution structure uses [TeX:] $$643 \times 3 \times 1$$ convolution kernels. In Main_Net, we only use a 5-layer general convolution as the feature extraction network. In Main_Net and Res_Net, we use the subpixel approach to generate high-resolution images. In training, we use the Adam optimizer [29] to minimize the loss function [TeX:] $$\operatorname{loss}_{a l l}.$$ The initial training sets the learning rate at 0.001, which decreases by a factor of 10 after [TeX:] $$3 \times 10^{4}$$ iterations. We implemented the proposed network using the PyTorch framework and trained it using an NVIDIA Titan X GPU. The entire model was trained in less than 1 day.
4.3 Ablation Analysis
To verify the validity of each part the proposed method, we conducted three groups of comparison experiments. In the ablation experiment, the baseline model is a residual learning structure proposed in EDSR. As can be found from Table 3, 10_res_block is the baseline model, and 10_res_split is our proposed channel splitting residual. Compared the first row with the second row, we can find that the parameters of the channel split residual are reduced by about 50% compared with the residual structure of EDSR. And the Multi_Adds of the channel split residual are reduced by about 75%. The amount of parameters and computations is reduced by about double and the frame rate is increased by 90%. The large reduction in parameters and computations also introduces some quality loss to the reconstructed images. In addition to testing the performance of the channel split residual structure, we also tested the performance of the double-upsampling network. The 10_res_split and 10_res_split_double in the table are used as a comparison experimental group. “Double” represents the double-upsampling network structure, and the unmarked one is the single upsampling network. From the two pairs of experiments, it can be found that the double-upsampling network has fewer parameters than the single-upsampling network in simple Res_Net, but the PSNR and SSIM are improved by 0.08 dB and 0.097, respectively. In terms of speed, the double-upsampling network does not increase the running time. Before the deep feature extraction network is finished in Res_Net, the shallow feature extraction network Main_Net has finished running and was waiting for the results of Res_Net. We also compared the loss function proposed for double-upsampling. 10_res_split_double_Lall indicates the use of a loss function [TeX:] $$\operatorname{loss}_{a l l},$$ 10_res_split_double indicates the use of loss function [TeX:] $$\operatorname{loss}_{o}.$$ Comparing this group of experiments reveals that the loss function [TeX:] $$\operatorname{loss}_{a l l}$$ contributes to improving the quality of the reconstructed images in the same structured network.
Ablation comparison results of experimental group ×4 PSNR and SSIM, parameters, Multi_Adds, and frame rate on Set5
In Table 4, the proposed method CSRNet is compared with some general SISR methods, such as SRCNN [1], DRRN [5], and EDSR [13], as well as with lightweight image super-resolution methods. Table 4 shows PSNR and SSIM for all methods tested on the three datasets. In contrast to the earlier lightweight network DRRN [5], which is a network with only 22 layers, the proposed method CSRNet has 40 layers. Therefore, the CSRNet has more parameters compared to DRRN. When we test on Set5, the CSRNet has 0.46 dB and 0.005 more than DRRN in PSNR and SSIM, respectively. Meanwhile, the CSRNet is slightly faster than DRRN in terms of computational speed. We also compared it with EDSR, which the proposed channel split residual learnings references. The number of parameters and computation of the CSRNet is about 50% of that of EDSR, and there are some significant improvements in frame rate with little loss in reconstruction. In Table 4, we also compare with the lightweight image super-resolution network of FALSR [9], IDN [7], and CARN [30]. These three methods are similar in parameters and calculation, but IDN performs better than the other two methods in the quality of reconstructed images. Compared with the other two methods, the PSNR and SSIM of our method are improved by about 0.2 dB and 0.02, respectively. Although the CSRNet is slightly more parametric and computationally intensive than CARN, and the computational speed is naturally inferior to it, the CSRNet has 0.42 dB more in PSNR than CARN. Compared with the latest method MADNet [31], the proposed CSRNet is inferior to MADNet in some evaluation metrics. It indicates that it is still room for improvement, and we will continue to work hard on lightweight SISR. By comparing the frame rate and PSNR together (Fig. 4), it can be seen that although the CSRNet is inferior to EDSR in performance, it is much faster than EDSR in speed. Our method is not as fast as AWSR [32], FALSR, and CARN in terms of speed, but far superior to these three methods in terms of performance.
Our method is compared with other state-of-the-art methods in terms of evaluation metrics
Scatter plot of frame rate and PSNR, where the horizontal axis represents the frame rate (in fps) and the vertical axis represents the PSNR value (in dB), which is the result of ×4 on Set5.
Some ×4 scale image results on Urban100 dataset Img047, Img012, and Img003, marked in red are the best reconstructed results, and marked in purple are our results.
In addition, the CSRNet outperforms these methods in terms of performance and speed compared to the earlier DRRN methods. During the test, we selected three test images from Urban100 shown in Fig. 5. The image on the left is the original image, and the image on the right is the resulting image cropped from the yellow box. From the result images, we find that the reconstructed images of EDSR have the best visual effect, followed by ours. However, the reconstructed images of CSRNet look a little bit sharper compared to other lightweight methods. The CSRNet method has advantages and disadvantages com¬pared to the currently popular methods, but it is an effective and useable method.
5. Conclusion
The SISR research serves as a preprocessing work for many high-level visual tasks, which have a greater requirement for the inference speed of reconstructed HR images. Therefore, we introduce the idea of lightweight artificial design networks into this research direction of image super-resolution. To design the lightweight image super-resolution model, we propose the structures of channel split residual learning and double-upsampling. Channel split residual learning mainly uses group convolution to reduce the number of convolutional computation parameters and speed up the computation. Double-upsampling uses a two upsampling structure to widen the up-sampling network to ensure the performance of the lightweight network without adding extra computation time overhead. We demonstrate the effectiveness of our method on different datasets and find that our method is comparable to the state-of-the-art in terms of the quality of the reconstructed images. Making the network lighter and higher quality will be the work of the future.