Wan Liao , Qian Zhang , Jiaqi Hou , Bin Wang , Yan Zhang and Tao YanA Multiple Receptive Field Anti-occlusion Self-supervision Network with CBAM for Light Field Angular Super-resolutionAbstract: By recording high-dimensional light data, a light field (LF) can accurately perceive a dynamic environment, thus supporting the understanding and decision-making of intelligent systems. However, with the discrete sampling of high-dimensional signals, LF faces have insufficient efficacious acquisition of LF information. This study tackles this problem by introducing a self-supervised learning approach that uses convolutional neural networks with varying receptive fields for processing sparse view inputs and subsequently generating a dense view through warping. The primary basis relies on the fact that the inherent correlation of the LF data and the convolutional block attention module (CBAM) are applied to process the LF data and wrap the operation into a layer to construct a deep network. The proposed method eliminates occlusions and achieves super-resolution LF angle reconstruction. Extensive experiments on an HCI dataset demonstrated that the proposed model outperforms recent state-of-the-art models. Keywords: Angular Resolution , CBAM , Light Field , Self-supervision Learning 1. IntroductionA light field (LF) is a complete representation of light flow in a 3D world. LF [1] images record both spatial and angular information and provide more comprehensive information. The entire LF can be represented by a 7D plenoptic function to simplify the function and is typically depicted by two variables. Many LF acquisition devices have been designed based on the LF plenoptic model, such as Lytro [2] and RayTrix [3]. However, the limitation of the sensor resolution leads to a trade-off between the spatial and angular resolutions. To address this issue, developing an efficient LF super-resolution, which reconstructs the original sparse LF image with a low angular resolution into a dense image with a high angular resolution, has become a key research focus. Deep learning has various real-world applications, such as centrifugal pump fault detection [4,5]. Because of the simplicity of self-supervised learning methods, supervised learning approaches based on enhanced convolution techniques have been introduced [6] and extensively employed for fundamental tasks, including image classification [7]. Self-supervised methods based on deep learning are equally applicable to LF data. Most recent studies have been based on supervised learning. The Human Capital Index (HCI) [8] and Stanford Light Field datasets [9] are only two examples of real-world circumstances that may be altered; therefore, it is not necessarily practical to capture the environment in full-angle resolution. Unsupervised learning techniques have a significant value. In contrast, most current research relies on narrow baselines [10,11], whereas high spatial resolution images are frequently acquired using LF cameras with long baselines. Therefore, broad-baseline-based angle reconstruction techniques are increasingly crucial. In this study, a self-supervised convolutional neural network (CNN) method that can input broad baseline views is proposed. Compared with the more mainstream methods proposed recently, unlike the method proposed by Jin et al. [12], our method adopts a multireceptive field structure in the network design. Unlike the method proposed by Yun et al. [13], our method considers the occlusion problem in a loss function. Our approach provides the best overall performance and speed. The significant contributions of this study are as follows. · A multi-stream unsupervised learning network with three different sizes of receptive fields is designed for disparity estimation. · Import the convolutional block attention module (CBAM) into the LF reconstruction network to improve network performance with fewer parameter additions. · A novel anti-occlusion adaptive weight block matching method is proposed in the loss function design. The remainder of this paper is organized as follows. A summary of the relevant work is provided in Section 2. Section 3 provides details on the proposed multiple-receptive anti-occlusion network using the CBAM-based angular super resolution reconstruction approach. Simulation results are presented in Section 4. Finally, Section 5 concludes the paper. 2. Related Works for Light Field ReconstructionSome previously proposed non-learning-based approaches for LF reconstruction perform poorly in terms of speed or quality [14,15]. The two primary categories of learning-based methodologies are supervised and unsupervised/self-supervised LF reconstruction. Some methods rely on ground conditions for supervision [16], whereas others attempt to perform LF reconstruction under supervision using real depth maps [10,11,17,18]. For example, a multi-stream CNN in a feature extraction module was exploited for disparity estimation in [19]. A learning-based view synthesis method for LF cameras that considers the decomposition of disparity and color has been proposed [17]. An efficient angular super-resolution method combined with a cascaded model fusion approach was developed in [20]. Wu et al. [21] constructed an end-to-end network to reconstruct the LF. Hu et al. [22] designed a spatially angular dense network containing related blocks and spatially angular dense skip connections. In the second category of reconstruction methods, reconstruction supervision is performed by creating a variance between the reconstructed and reconstructed center maps. Several approaches [12,13,23,24] have made the depth map distort the original center view and reconstructed center map, and compensation for occlusion-based networks has been proposed using a forward-backward warping process [24]. Li et al. [25] proposed an occlusion pattern-aware loss function for unsupervised LF disparity estimation, which successfully extracted and encoded the general occlusion patterns inherent in the LF for loss computation. Smith et al. [26] exploited a novel LF synthesizer module that reconstructs a global LF from a set of object centric LFs. Digumarti et al. [27] introduced the generalized encoding of sparse LF, allowing unsupervised learning of odometry and depth. Mousnier et al. [15] achieved high angular resolution by appropriately exploiting the image content captured at large focal length intervals. Wang et al. [28] designed an LF-InterNet to use both spatial and angular information for reconstruction. Recently, optimized techniques have been employed for LF reconstruction. To eliminate artifacts, Wang et al. [10] used a pseudo-4DCNN paradigm. The combination of residual blocks and the CBAM has led to the development of effective network refinement structures [19]. The angular domain attention mechanism and EPI blur were implemented in a spatial-angular attention network proposed to remove the spatial high-frequency components of EPI [21,29]. Similarly, an attention-based multilevel fusion network that leads to an efficient matching cost to eliminate occlusions was proposed [30]. The performance of LF reconstruction was improved using deep-learning-based approaches. Most deep-learning-based approaches, however, do not consider various receptive fields for the input of the sparse view, which is also important for enhancing the quality of LF reconstruction. Additionally, there is still some space to remove occlusion-related artifacts and calculate a densely rebuilt LF using a new loss function. This paper addresses these challenges by proposing a self-supervised learning technique that produces dense view disparity maps by feeding input from sparse views into multiple CNNs with various receptive fields, and then obtains dense views through depth image-based rendering. [31] served as an inspiration for CBAM, which efficiently extracted detailed information by fusing spatial and channel attention. Examples of CBAM applications in object recognition include YOLOv4 [32], super-resolution [33,34], and pose estimation [35]. Therefore, we used the CBAM to discover strong correlations between related LF perspectives. To recover details more accurately, we defined a unique loss function. The results of our experiments demonstrate the effectiveness of our strategy in producing unique vistas while enhancing the realistic features. 3. Proposed LF Angular Super-resolution Reconstruction FrameworkFig. 1 depicts our proposed LF angular super-resolution reconstruction framework. It has three primary modules: 1) The multi-stream disparity estimation module (MDEM): predicts an advanced disparity map for each LF input view. 2) The LF warping module (LFWM): creates new views by adjusting the input views using the estimated disparity map. 3) The LF blending module (LFBM): eliminates artifacts and enhances network performance, producing the final high angular resolution LF. Fig. 1 shows the reconstruction of a 3×3 LF from a 2×2 sparse LF. Reconstruction at other angular resolutions can be easily achieved by tuning certain network architecture parameters. Fig. 1. Multi-stream attention fusion CNN architecture diagram. MDEM is a multi-receptive field network with residual structure, and LFBM is a single-stream network with an attention mechanism. ![]() 3.1 Multi-stream Parallax Estimation ModuleFirst, a low-resolution input view is fed into a three-stream CNN with different receptive fields. The multi-stream structure contains a dilated convolution with a convolution kernel size of 7×7, 9×9, or 11×11, a dilated convolution with the same convolution kernel size of 7×7, and a normal convolution with a kernel size of 5×5. The three channels were then merged, delivered to two CNNs with residual blocks, and finally passed through a 3×3 convolution kernel to obtain the final disparity maps (taking 2×2 to 3×3 as an example). In this module, this process can be expressed as follows:
where fp represents MDEM and uu refers to the angular resolution of the disparity map. To distinguish this map from the original map u, we introduce a uu representation. 3.2 Light Field Warping ModuleThe warping module uses the four input views and nine disparity map outputs of the multi-stream parallax estimation module to perform physical distortion operations and obtains 4×9, a total of 36 sub-images. The warped view can be expressed as follows:
fw refers to the warp operation, u represents warping by the u' input view and disparity map, uu represents warping by the uu' disparity map and input view, and x represents the spatial resolution. Knowing that disparity refers to the offset of pixels, all that is required to create the final warp map is to multiply the disparity map of the relevant point by the baseline length and then add the resulting offset to the original view. The resulting parallax increases with the distance from the central vision and is mathematically stated as follows:
(3)[TeX:] $$\begin{equation} W\left(u^{\prime}, u u^{\prime}, x\right)=L\left(u^{\prime}, x+D\left(u u^{\prime}, x\right)\left(u u c-u^{\prime}\right)\right), \end{equation}$$where [TeX:] $$\begin{equation} D\left(u u^{\prime}, x\right) \end{equation}$$ is the disparity map obtained by the CNN in this study; the disparity map of the real scene is not used. 3.3 Light Field Blending ModuleAfter finishing the warp module, some researchers completed LF reconstruction; however, the experimental data revealed that these techniques have numerous problems and operate poorly. To enhance experimental performance, we included an LF blending module. First, the 4×9 warped view was mixed into an ordinary 2D spatial convolution. Subsequently, feature maps are sent into CBAM, followed by an expanded convolution with a kernel dilation rate of 2, and then go through a 2D angular convolution, the size of which is 3×3. Repeat these 2 alternates four times. Then, the network terminates with three 3D convolution layers; the convolution kernel sizes are 5×3×3, 4×3×3, and 3×3×3, and the step sizes are 4×1×1, 4×1×1, and 1×1×1. To acquire the final reconstruction result map, the final output feature maps were allowed to have the same number of channels as in the warping view. The above process can be expressed as:
where fb represents the LF blending network. 3.4 Convolutional Block Attention ModuleThe CBAM [31] is used to adaptively improve features by successively inferring the attention map along the channel and spatial order and then multiplying the attention map by the input feature map. The network architecture is shown in Fig. 2. The entire process is outlined as follows:
(5)[TeX:] $$\begin{equation} F^{\prime}=\sigma(M L P(\operatorname{AvgPool}(F))+M L P(\operatorname{MaxPool}(F))) \otimes F, \end{equation}$$
(6)[TeX:] $$\begin{equation} F^{\prime \prime}=\sigma\left(f\left(\left[\text { AvgPool }\left(F^{\prime}\right) ; \operatorname{MaxPool}\left(F^{\prime}\right)\right]\right)\right) \otimes F^{\prime} . \end{equation}$$where [TeX:] $$$$ represents the sigmoid function and f represents the convolution operation, MLP represents shared parameters convolution operation, ⊗ represents element-wise multiplication. Fig. 2. CBAM overview. The first half is the channel attention module, and the second half is the spatial attention module. ![]() 3.5 Training LossSome existing methods based on unsupervised learning all use unilateral and limited loss functions, which typically offer some artifacts to the reconstructed LF. To mitigate this, in this study, we design six loss functions to constrain the network, which are expressed as follows: Map loss: To create a mapping connection, the gradient of the disparity map must be used to balance the difference between the input and warped views. This is known as map loss, and is expressed as follows:
(7)[TeX:] $$\begin{equation} \ell_{\text {map }}=\sum_{x, u}\left(\sum_{u u}|L(u, x)-W(u, u u, x)|+\nabla_x D(u, x)\right), \end{equation}$$where u,uu represents the angular resolution, x represents the spatial resolution, L represents the input LF view. W represents the view after warp, [TeX:] $$\begin{equation} \nabla_x D \end{equation}$$ represents the gradient of disparity. EPI gradient loss: The EPI is created as a 2D slice by dropping the spatial and angular capabilities of the LF in 1D. We set the input and output view EPI gradients to be equal because parallax has a direct relationship with the EPI image slope magnitude. The loss function is expressed as follows:
(8)[TeX:] $$\begin{equation} \begin{aligned} \ell_{e p i}= & \sum_{x, u}\left(\left|\nabla_y E_{x, u}(v, y)-\nabla_y \hat{E}_{x, u}(v, y)\right|+\left|\nabla_v E_{x, u}(v, y)-\nabla_v \hat{E}_{x, u}(v, y)\right|\right) \\ & +\sum_{y, v}\left(\left|\nabla_x E_{y, v}(u, x)-\nabla_x \hat{E}_{y, v}(u, x)\right|+\left|\nabla_u E_{y, v}(u, x)-\nabla_u \hat{E}_{y, v}(u, x)\right|\right) \end{aligned} \end{equation}$$Blend loss: Because we used the corner view to reconstruct the middle view, we made the reconstructed corner view equal to the original input corner view. This loss occupies a relatively large weight in the total loss function and is expressed as follows:
(9)[TeX:] $$\begin{equation} \ell_{\text {blend }}=\sum_{x, u}\left|\hat{L}\left(u_{\text {comer }}, x\right)-L\left(u_{\text {comer }}, x\right)\right|, \end{equation}$$where [TeX:] $$\begin{equation} \hat{L} \end{equation}$$ represents the output LF view, ucorner represents the view from the four corners. Warp loss: After the physical warping procedure, nine reconstructed views were obtained for a single input view, with the center view of each view remaining consistent with the corresponding original input view. These are summarized as follows:
(10)[TeX:] $$\begin{equation} \ell_{\text {wap }}=\sum_x\left(\sum_{i=1}^{i n^2}\left|L\left(u_i, x\right)-W\left(u_i, u u_{\text {center }}, x\right)\right|\right) . \end{equation}$$Anti-occlusion loss: This is one of the highlights of the present study. The AM was used for the second time. This loss function is primarily used to eliminate the impact of occlusion and is expressed as follows:
(11)[TeX:] $$\begin{equation} \ell_{\text {anti-occlusion }}=\sum_{i=1}^{\text {out }^2} \sum_x\left|\hat{L}\left(u_i, x\right)-\sum \omega_{i, j} {\underset{j=1}{\stackrel{i n^2}{W}}\left(u_j, u u_i, x\right)}\right|, \end{equation}$$where out2 represents the number of final reconstructed output views, in2 represents the number of input warp views, where ωi,j represents the weight of different input views, the farther away from the output view, the greater the weight, and the more it can get around obstacles. Parallax loss: Because the baseline of the LF camera is short, the disparity maps produced at various optical center positions are rather small, and the distance between adjacent disparity maps is even smaller, allowing us to make the intermediate disparity maps as similar as possible. Otherwise, it occupies the leading position. The loss function is expressed as follows:
(12)[TeX:] $$\begin{equation} \ell_{\text {parallax }}=\sum_{i=1}^{\text {out }{ }^2} \sum_x\left|D\left(u_{\text {center }}, x\right)-D\left(u_i, x\right)\right| . \end{equation}$$Total loss: Our total loss function is defined as follows:
(13)[TeX:] $$\begin{equation} L_{\text {total }}=\alpha \ell_{\text {map }}+\beta \ell_{\text {epi }}+\gamma \ell_{\text {blend }}+\delta \ell_{\text {warp }}+\varepsilon \ell_{\text {anti-occlusion }}+\xi \ell_{\text {parallax }} \end{equation}$$where [TeX:] $$\begin{equation} a, \beta, \gamma, \delta, \varepsilon, \xi \end{equation}$$ are constants, and we set to 4, 8, 2, 2, 2, 1 in our experiment. 4. ExperimentsQuantitative and qualitative comparison studies have been conducted using various cutting-edge methods, including those developed by Jin et al. [12], Yun et al. [13], Wu et al. [36], and Meng et al. [37]. To achieve fairness, we first trained all the experiments using the settings suggested by the authors and then tested them using the same hardware and environment on the new HCI LF dataset. The HCI dataset contains 28 scenes, including four training scenes, four test scenes, 16 additional scenes, and four stratified scenes. During training, four training scenes and 15 additional scenes except “dishes” are selected, 19 scenes in total as the training set. During the test, three test scenes containing “bedroom,” “bicycle,” “herb” and an additional scene containing “dishes,” a total of four scenes are selected as the test sets. The single scene size of the HCI dataset was 512×512×9×9; we reconstructed the input view from 2×2 to 7×7, and the reconstructed result had a size of 468×468×7×7; thus, when comparing. 4.1 Quantitative AssessmentThe peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are the two key metrics employed for the quantitative analysis of LF reconstruction in our study. The experimental results are presented in Table 1. As is evident from the data, the techniques proposed by Wu et al. [36] and Meng et al. [37] exhibit strong performance under straightforward conditions, however, face challenges in obstructed scenes. By contrast, our approach demonstrates remarkable robustness across diverse scenarios, consistently outperforming alternative methods. Moreover, the graphical representation of the running duration and reconstruction performance, as depicted in Fig. 3, underscores the exceptional speed and efficacy of our method. The practical implications of these quantitative results extend beyond statistical analysis. Our method’s consistent excellence in challenging scenarios, where others falter, underscores its potential significance in fields such as autonomous navigation, 3D imaging, virtual reality (VR), and various related domains. Table 1. Quantitative comparisons (PSNR/SSIM) of different methods
Fig. 3. Comparison of LF reconstruction effects achieved by different methods in recent years. Comparisons were made from reconstruction quality (PSNR, SSIM, and runtime). ![]() 4.2 Qualitative AnalysisIn Fig. 4, the residual plot uses the center view as a baseline, a local close-up in the red box, and normalized red and blue levels to depict the difference. When compared vertically, we can see that our method is better than the first four methods in each scene, our technique has less incorrect pixels and only a few edges part have large differences, and the rest are relatively smooth. Additionally, our test performs well, particularly in scenarios containing occlusions. Fig. 4. Comparison of our method with other methods. The results show the ground-truth image, the residual plot of the synthesized center view versus the ground-truth image, a close-up version of the portion of the image boxed in red. In the grid in the upper left corner of each ground truth image, the four-corner blue box represents the input view in the LF, and the red box represents the display view. ![]() 4.3 Ablation StudyEffects of multi-stream network: We compare the impact of multi-stream networks on the reconstruction performance in Table 2. The parameter settings for each network model are listed in Table 2. The multi-stream LF blending module has three-stream networks with varying convolution kernel sizes, which are then merged in a fair proportion. The single-stream network employed a 9×9 receptive field for the disparity estimation. According to ablation studies, the multi-receptive field used by the parallax estimation module can significantly improve the network performance. Effects of CBAM: The impact of a network’s CBAM structure on the network performance in various scenarios is shown in Table 3. A 5×5 convolution kernel was placed in front of the two CBAMs of the disparity estimation module, and a 3×3 convolution operation was placed after the warp view and first residual block following the multi-stream network connection. A comparison of the four network architectural parameters is presented in Table 3. According to the experimental data, adding a CBAM module to the LF hybrid module can vastly improve the network performance while only adding a few extra parameters. Table 2. Comparison of PSNR/SSIM with(+) or without(-) multi-stream structure in MDEN and LFBN
Table 3. Comparison of PSNR/SSIM with(+) or without(-) CBAM in MDEN and LFBN
5. ConclusionIn this study, we introduce a CNN architecture featuring multiple receptive fields and a CBAM structure for LF angular super-resolution reconstruction. This approach leverages input views to extract richer data using features from various receptive fields. Although our model exhibits a high adaptability and can be trained without relying on real depth maps or scene graphs, we must acknowledge its limitations. The following aspects should be addressed in future research: 1) Model limitations: Our model has certain limitations. For instance, it may face challenges in scenarios with extreme variations in lighting conditions. Furthermore, the computational cost, particularly in terms of the model parameters, remains a concern. 2) Improvement suggestions: To enhance the model performance, future studies should focus on reducing its computational overhead and improving its robustness to diverse lighting conditions. In addition, further exploration of loss-function variations and potential regularization techniques should be considered to effectively mitigate occlusion-related artifacts. 3) Future directions: This study opens new avenues for future research. These include investigating the integration of real-time depth map estimation and scene graph generation to further enhance the performance and versatility of the model. The exploration of lightweight model architectures and their effect on efficiency is also a potential direction. In summary, the proposed model demonstrated competitive performance, as evidenced by extensive experiments on the HCI dataset. However, we must address these limitations, implement the suggested improvements, and explore the suggested future directions for advancing the field of LF angular super-resolution reconstruction. FundingThis study was jointly supported by the National Natural Science Foundation of China (Grant No. 62301320), Natural Science Foundation of Fujian (No. 2023J011009), Scientific Research Project of Putian Science and Technology Bureau (No. 2021G2001ptxy08), and the Colleges and Universities in Hebei Province Science and Technology Research Project (No. ZC2021006). BiographyWan Liaohttps://orcid.org/0000-0002-1300-3580He graduated from the School of Information and Mechatronics Engineering of Shanghai Normal University with a bachelor's degree in communications. Currently studying for a master's degree in electronic information at Shanghai Normal University. His research interests include light field depth estimation and angle reconstruction. BiographyBiographyJiaqi Houhttps://orcid.org/0000-0002-7363-7753She received her bachelor’s degree in electronic science and technology from the College of Electronic Engineering, Heilongjiang University. She is currently studying for a master’s degree in electronic information at Shanghai Normal University. Her research interest includes super-resolution reconstruction of light field images. BiographyYan Zhanghttps://orcid.org/0000-0001-5970-7244She received the Ph.D. degrees in communication and information systems from Shanghai University, Shanghai, China. She has been with the faculty of the School of Computer, North China Institute of Aerospace Engineering, where she is currently an associate professor. Her major research interests include Stereo video quality assessment, artificial intelligence. BiographyTao Yanhttps://orcid.org/0000-0002-8304-8733He received the Ph.D. degrees in communication and information systems from Shanghai University, Shanghai, China, in 2010. He has been with the faculty of the School of Information Engineering, Putian University, where he is currently an associate professor. His major research interests include multiview high efficiency video coding, rate control, and video codec optimization. He has authored or co-authored more than 20 refereed technical papers in international journals and conferences in the field of video coding and image processing. He currently presides National Natural Science Foundation project. References
|