Article Information
Corresponding Author: Wan Liao , 2232571549@qq.com
Dingkang Hua, School of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai, China, hua-1998@qq.com
Qian Zhang, School of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai, China, qianzhang@shnu.edu.cn
Wan Liao, School of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai, China, 2232571549@qq.com
Bin Wang, School of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai, China, binwang@shnu.edu.cn
Tao Yan, School of Mechanical, Electrical & Information Engineering, Putian University, Putian, Fujian, China, yantaoshu@aliyun.com
Received: August 10 2022
Revision received: November 28 2022
Accepted: February 26 2023
Published (Print): August 31 2023
Published (Electronic): August 31 2023
1. Introduction
In traditional photography, the camera forms pixel values by recording light rays in diverse directions, but this results in the loss of information about the source of the light rays. The light field photography device overcomes this shortcoming by placing a microlens array between the lens and the sensor of the photographic device to enable intense sampling of the perspective and the conversion of measured data into multi view light field images with multiple viewpoint. The light field picture contains the depth clue of the scene, which has interesting applications in face recognition [1], automatic driving [2], and three-dimensional (3D) reconstruction [3].
Although the depth clues of the scene are hidden in the light field images, it’s still challenging to extract the depth information from the clues. To visualize and characterize the light field, the traditional methods get depth from light fields typically convert 4D light field data into various 2D images—e.g., multi-view images [4], epipolar plane images (EPIs) [5,6], and focal stacks [7]—and then apply feature matching. These methods often need carefully designed optimization procedures, occupy a significant amount of computational overhead and time, practical value is limited. With convolution neural network in the successful application of computer vision tasks, some deep learning-based methods are on the rise, which extracts the image feature through convolution network for early treatment, based on global optimization or refine the network features of the light field, to improve the depth estimation precision and speed of the entire network. Owing to benefits such as low reasoning time, high prediction accuracy, and a broad range of application scenarios, it is rapidly replacing the traditional light field depth estimation method and has demonstrated outstanding performance on public data sets in recent years. For example, EpiNet [8], an end-to-end network training system without post-processing, achieved the most advanced accuracy in HCI and CVIA-HCI datasets [9] when it was proposed, and it still continues to dominate the list to this day. It can be seen that this network has a unique and superior effect in the design. However, it is a pity that the network does not consider the essential differences between the four multi-stream layer inputs. The network connects the extracted features directly, ignoring the details of the scene view, introducing duplication in the subsequent extraction of advanced features, and diminishing the accuracy of prediction. Although the above methods have achieved the desired results for the depth extraction of light field images, in the process of depth information acquisition, there are common issues caused by the complexity of the natural scene and the insufficient viewing angle of the scene sampling by the acquisition system, which have a negative effect on the overall light field data encoding efficiency.
In this paper, we select a new composite attention network for extracting depth maps from light field images in order to overcome some drawbacks of previous approaches and improve the efficiency of an existing method. The overall performance of our method CAttNet is shown in Fig. 1. It can be seen that our model has improved between prediction accuracy and running time.
Compared with other common light field depth estimation algorithms. CAttNet has advantages in prediction accuracy and running speed.
Our research consists of three main parts. Firstly, we enhance the polar plane structure of the light field image. Due to the unique design of light field cameras, EPIs can display directional lines in fixed colors to help visualize light fields. Each line corresponds to a 3D projection and the depth of each point of the scene can be inferred by analyzing the slope of the line. To take advantage of this feature and improve the correlation between views, the horizontal, vertical, and diagonal views are combined to select the input views in data processing, which is similar to the EpiNet. Secondly, we reduce the redundancy between sub-aperture views. Due to the characteristics of the light field image, the baselines between the sub-aperture views are very narrow, and there is a lot of redundancy between the views. As using too many views will increase the amount of computation and negatively affect the predicted results, we use a composite attention module [10] to weigh the feature images after the fusion of each channel, which can not only focus on the most important views but also discover the important areas in the views so that the network can make more effective use of them for depth estimation. Thirdly, we enhance the depth of network training effectively. By integrating the framework of previous deep learning methods, with the network depth reaching a certain level, gradient explosion will occur, resulting in loss failure to converge. In this paper, we use the residual module [11] to overcome the problem of gradient explosion while deepening the network.
The organizational structure of the paper is as follows: Section 2 provides a current research progress research status of light field depth estimation. Section 3 follows with a detailed description of our proposed approach. Simulations and analyses of actual data will be conducted to illustrate the practical value of the method described in Sections 4 and 5. The last section is devoted to the summary of this paper and discussion of future research.
2. Related Work
This paper introduces the depth estimation method of light field from two aspects: methods based on optimization and methods based on deep learning.
2.1 Optimization-based Methods
The depth estimation method of the light field emerged from the depth estimation method of a 2D image, mostly utilizing multi-view matching or refocusing or polar plane image. In terms of multi-view matching, Yu et al. [12] studied geometric relations in baseline space to improve the stereo matching effect of the light field, whereas Heber and Pock [13] used a new PCA matching condition to align sub-aperture images (SAIs). Recently, Chen et al. [14] proposed a depth estimation framework for the regularization of initial label confidence graphs and edge strength weights. Tao et al. [15] used a comparison-based metric to find the optimal parallax for refocusing based on the fact that defocusing diminishes visual sharpness and contrast. Zhou et al. [7] proposed a FocalStackNet, which learned the deep semantic features and local structure information of the light field from focal stack to obtain the depth map. EPIs consist slices of Angle and spatial direction in 2D. Wanner and Goldluecke [16] used a structure tensor to calculate the slope of each line of vertical and horizontal EPIs, changing the study of light field depth into the study of the straight-line slope. Zhang et al. [17] used a rotating parallelogram operator (SPO) to find matching lines from EPIs to solve the depth estimation problem in occluded scenes. Li et al. [18] introduced a new tensor Kullback-Leibler divergence (KLD) to calculate depth by combining EPIs tensors in vertical and horizontal directions. Due to more computing resources are consumed, and the training time becomes longer, however, it is difficult to apply these methods to different data sets.
2.2 Methods Related to Deep Learning
Recently, an increasing number of scholars began to use deep learning methods to obtain the depth of light fields. Shin et al. [8] used a whole network EpiNet, which enhanced the data set and took horizontal, vertical, and diagonal camera views as input. Tsai et al. [19] used channel attention to selectively control input view contribution weights to reduce data redundancy. MANet was proposed by Li et al. [20] as a multi-scale aggregation network with fewer parameters and faster running speed. Li et al. [21] used a lightweight convolutional network LLF-NET that has been trained end-to-end to estimate the light field depth of the wide baseline and also has good performance in the light field of the narrow baseline. These methods improved the prediction results to a large extent, but most of them focus on the number of views selected, ignoring the relationship between views, and a large amount of data redundancy interferes with the accuracy of predicted results. In this paper, we choose a method using compound attention, which effectively combines the relevance between views and completes the depth estimation process efficiently.
3. Proposed Method
In this work, we constructed a compound attention network (CAttNet) for depth estimation from the light field by using compound attention.
A 4D light field is a collection of multi-views, which can be represented by a biplanar parameterization L(x,y,u,v). The connection between the center point and the edge point is described as follows:
where (x,y) is the spatial resolution, (u,v) is the angular resolution. In this paper, we take N=9. [TeX:] $$\left(u^*, v^*\right)$$ represents the coordinates of the central view, which are numbered 40 in the [TeX:] $$N \times N$$ view array (from 00 to 80) specified in this article. d(x,y) represents the parallax between the center view pixel (x,y) and the corresponding pixel of the adjacent view. To estimate the depth of the center view, one must determine the actual offset from the corresponding point in the edge view.
The structure of the CAttNet is shown in Fig. 2. It can be seen that the network consists of three parts: processing layer, weighted layer, and abstraction layer, which will be detailed in following section.
Structure of the CattNet.
3.1 Processing Layer
According to the study of the depth estimation method of light field, when all views of light field are used as input, more accurate depth estimation results can be obtained to a certain extent, but the calculation speed is more than ten times slower, therefore, we adopt EPI characteristic four-channel input to reduce the amount of data input. Firstly, the horizontal, vertical, and diagonal view streams from the 9×9 view array are stacked and connected, and then sent into the three-layer structure Conv-SeLu-Conv-BN-SeLu block successively. The convolution layer is used to extract image features in this instance. Batch normalization (BN) is used to address data distribution changes in the middle layer during training in order to prevent gradients from vanishing or exploding and to accelerate training. Using the activation function SeLu [22], since it is a regularization scheme based on activation function and has the characteristics of self-normalization. It can converge even with the addition of noise, to reduce the amount of computation, improve the sparsity of the network, reduce the interdependence of parameters, and prevent the disappearance of the gradient. The four branch channels are processed separately, and their parameters are not shared. Then, the outputs of the four branch channels are joined. To adapt to the characteristics of a narrow baseline of the light field, the space kernel size of convolution filters was set to 2×2, the stride was 1, and batch size was 16. The activation function SeLu is calculated as follows:
where the fixed values of [TeX:] $$\alpha_{\text {selu }} \text { and } \lambda_{\text {selu }}$$ have been calculated to be about 1.6733 and 1.0507.
3.2 Weighted Layer
To obtain depth information, it is important to get effective features from the image, there are a large number of views, and an image of a light field, as mentioned above, the SAIs provide large amounts of information. While also containing a large amount of redundant information, so that our model pays more attention to the training during which views can provide more effective information, as well as an important area in these views. We introduce a compound attention module to weight the feature images, thus reducing the training time and running efficiency of the model. It is composed of a channel attention module and a space attention module, this compound attention module can optimize our method without increasing or decreasing parameters in a stably and reliably manner.
The feature maps from the four branch channels are weighted to determine the significance of "which views." Firstly, the spatial dimension of features extracted by shallow features is compressed to avoid the influence on the channel. This goal can be effectively achieved by maximum pooling and average pooling. The pooled features are then forwarded to a shared network of multilayer perceptron (MLP). Finally, feature vectors are summed and sigmoid functions are used to construct channel attention weights. The attention weight of the channel is calculated as follows:
where [TeX:] $$\sigma(\cdot)$$ represents sigmoid activation function, [TeX:] $$F_{\text {avg }}^c, F_{\max }^c$$ represent average pooling characteristics and maximum pooling characteristics. [TeX:] $$W_0, W_1$$ is the weight of the MLP, [TeX:] $$W_0\in R^{C / r \times C}, W_1 \in R^{C \times C / r} \text {. }$$
Spatial attention selects the key “view regions” using the weighted features obtained by multiplying the channel attention weights and the feature maps connected by our four branch channels. First, maximum pooling and average pooling are performed sequentially. The convolution layer is used to connect and generate a spatial attention diagram after the connection layer. The method for calculating spatial attention weight is:
where [TeX:] $$F^{\prime}$$ represents the feature graph weighted by channel attention, [TeX:] $$\sigma(\cdot)$$ represents the sigmoid activation function, and [TeX:] $$f^{7 \times 7}(\cdot)$$ represents the convolution with 7×7 convolution kernel.
As previously noted, the composite attention not only reduces data redundancy but also effectively demonstrates the importance of each view and view region in the parallax estimation step. Because the spatial distance and angle are different, different views or view area the contribution of each is not identical, some repetitive and redundant information of an image of a light field need to give up, let us estimate module can focus on important views and area, access to effective information, to improve the network's overall accuracy and efficiency.
3.3 Abstraction Layer
The level of features in computer vision grows as the network depth increases. Theoretically, deepening the network depth can theoretically achieve better results, but in practical training, the network often exhibits gradient explosion or model degradation, which makes the experimental results worse. To avoid the negative impact produced by network deepening, the residual structure can be used to restore some of the feature information lost due to the previous layer of convolution into the network. To improve the results of the forecast, this paper uses a series of residual models to learn the relationship between features after weighting the feature graph, allowing the network to achieve the same impact when extracting deep features as shallow features. As shown in Fig. 2, a total of four residual modules are used in the extraction layer. Each residual block has two branches, one of which is connects of Conv layer, SeLu layer, and BN layer directly, and the convolution kernel is 2×2 in size for further feature extraction. The other is the residual mapping branch with the structure of the Conv-BN layer, and the size of the convolution kernel is set to 5×5 to ensure that feature graphs have the same size when they are connected. The calculation method of the residual module is as follows:
where x and y represent the input and output respectively. [TeX:] $$W_1 \text { and } W_2$$ represent the two-layer convolution block respectively. [TeX:] $$\sigma(\cdot)$$ stands for ReLU activation function. [TeX:] $$W_s$$ is used to match the dimensional difference between the input and the residual mapping to be learned.
Relation between the depth and parallax of point P in the light field.
After the residual module, we apply the prediction module using Conv-ReLu-Conv to obtain the final parallax value, with a convolution kernel size of 2×2. According to the geometric relationship between parallax and depth in light field imaging, depth can be quickly calculated to obtain the depth image we need. The relation between the depth of point P in the light field and parallax is shown in Fig. 3. The calculation method of parallax and depth is as follows:
where Z represents the depth value, f represents the fixed focal length of the light field camera, [TeX:] $$\Delta x$$ represents parallax under different viewing angles, and [TeX:] $$\Delta u$$ is the distance between two adjacent light field microlenses.
4. Experiments
4.1 Operation Settings
PyCharm was used as the software environment in the experiment, TensorFlow framework was used as the training backend, Keras library was used to build the network, RMSprop was selected by the optimizer, and batch size was set to 16, the number of network training was set to 600 epochs, 10,000 small batches per epoch. The learning rate for the first 400 epochs was set to [TeX:] $$10^{-4}$$, and then the learning rate for the epochs was reset to [TeX:] $$10^{-5}$$, to improve the accuracy. Mean absolute error (MAE), which has good robustness to outliers, mean square errors (MSE) and bad pixel (BP) are used as the criterion for evaluation of loss.
4.2 Datasets
A 4D Light Field dataset is frequently used as a benchmark for parallax estimation methods for evaluating light field images, which can be compared to a variety of different approaches. It has 24 scenarios divided into four categories: stratified, test, training, and additional. The light field image resolution is 9×9×512×512×3.
4.3 Experimental Contrast
In the training of our model, to judge the error between the model and ground truth, MAE as the evaluation standard can better reflect the situation of the predicted value error. If the MAE value is smaller, the error is considered smaller. MAE is calculated as follows:
To compare the differences between different methods, MSE and BP were selected to evaluate. The accuracy of depth estimation by this method. The smaller the value of the two metrics, the better the algorithm’s performance. MSE and BP are calculated as follows:
where d(x) is the estimated parallax map, gt(x) is ground truth, N is the total number of samples, and t is the threshold of bad pix. We choose the size of 0.07.
In order to further verify the accuracy of our method. Peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are common image quality evaluation indicators. The higher the value of the two criteria, the better the algorithm's performance. PSNR and SSIM are calculated as follows:
where, MAX represents the maximum number of pixels in this image, x and y represent the depth map and ground truth of the scene, respectively. is the mean of the images, [TeX:] $$\sigma_x^2 \text { and } \sigma_y^2$$ are the variances of the images, [TeX:] $$\sigma_{x, y}$$ is the covariance of x and y, and c is a fixed value.
4.4 Experimental Results
We use a 4D Light Field Benchmark dataset to measure the performance of our model. The dataset consists of 28 light field scenarios divided into four subsets: stratified, training, test, and additional. The image resolution is 512×512, and each scene contains 9×9 viewpoints. We use additional 16 light field scenarios to train our proposed network. We use the MSE and BP of light field images to assess the quantitative performance of depth estimation. Qualitative results to show the depth of the light field estimation, we used the eight light field scenes as shown in the images as a comparison—scene from top to bottom is: “turned on,” “dots,” “pyramids,” “stripes,” “boxes,” “cotton,” “dino,” “sideboard.” We were surprised to find that our results had a good effect on the details of the scene objects. Let's zoom in on the feature details as shown in Fig. 4.
Comparison of magnified local results: (a) initial scenes, (b) ground truth, (c) EpiNet, and (d) ours.
We compared our model with the following representative models, including OBER-cross [23], SPO-MO [17], MANet [20], EPN+OS+GC [24], and EpiNet [8]. Depth maps obtained from light field images and the corresponding BP diagram is shown in Fig. 5. The green area represents the good pixels, and the red area represents the bad pixels.
Tables 1 and 2 give a quantitative performance comparison of the corresponding MSE (multiplied by 100) and runtime in seconds for each method in different scenarios, and the data with the best performance in the same scenario highlighted in bold. The experimental data show that our method can perform well in most scenarios. Our method compares run time with other algorithms, and the running time is only 0.5 seconds longer than EpiNet after adding modules, indicating that it is excellent practicability.
From left to right, algorithms for each scene are (a) ground truth, (b) OBER-cross, (c) SPO-MO, (d) MANet, (e) EPN+OS+GC, (f) EpiNet, and (g) ours.
MSE results of the estimated depth on light field datasets
The bold font indicates the best performance in each test.
Runtime comparison of depth estimation algorithm on light field datasets
The bold font indicates the best performance in each test.
The experimental data show that, both our method and EpiNet perform well in light field scenes. To delve deeper into the accuracy of these two techniques, we use additional image quality evaluation indicators to better reflect the integrity and systematization of the experimental evaluation. The quantitative performance comparison of PSNR and SSIM corresponding to the two methods in different scenarios is shown in Figs. 6 and 7. The green line is EpiNet, whereas the red line represents our technique. It can be concluded that the average performance of our method is better than EpiNet.
The changes in the BP and MSE indexes on the training set with the number of iterations were observed to investigate the convergence performance of our method on the training data set, and the results are shown in Fig. 8. The red curve represents the MSE index and the blue curve represents the BP index.
PSNR results of the estimated depth on light field datasets.
SSIM results of the estimated depth on light field datasets.
Network training curve of MSE and BP changing with the number of iterations.
5. Ablation Experiments
To demonstrate the reliability of each element in the suggested model architecture, we conducted ablation contrast experiments. The HCI dataset is used to train the suggested network.
To verify the role of our module in the whole network, we successively added attention block and Resblock, and trained the model for the same amount of times. As a visual comparison, we used the scenario “Origami,” as shown in Fig. 9. Table 3 depicts the performance of our method in the scene following the addition of each module. Measured by the average MSE value, we use “” to indicate adding the module and “” to indicate not adding the module.
Although there are numerous inconsistencies in the background, particularly on certain smooth and continuous planes, the attention module is more effective in paying attention to scene details than Fig. 9 and Table 3. Resblock reduces this error by deepening the network and improving accuracy. The combination of the two can further improve the performance of scene details while reducing errors. This method can better analyze the edge details and Angle information throughout depth estimation process, improve the accuracy of depth map prediction, minimizing the error, and saving time.
Effect of attention and Resblock on the accuracy of depth estimation under the same training times.
Impact of network components on performance
“” indicates adding the module and “” indicates not adding the module.
6. Conclusion
We propose a new network for depth estimation, by using the light field image geometry and introducing the attention of a complex network, which effectively improves the depth map accuracy of our network. The 4D Light Field Benchmark datasets are used to evaluate the performance of our network and the experimental results show that the CAttNet performs well on the detail place as well as keeping out of the scene, with the hanging time of 18% MSE gain that is substantially less than that of EpiNet. In the future, we intend to continue improving our network so that it can perform better on a smooth background.