Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Non-local affinity adaptive acceleration propagation network for generating dense depth maps from LiDAR

Open Access Open Access

Abstract

Depth completion aims to generate dense depth maps from the sparse depth images generated by LiDAR. In this paper, we propose a non-local affinity adaptive accelerated (NL-3A) propagation network for depth completion to solve the mixing depth problem of different objects on the depth boundary. In the network, we design the NL-3A prediction layer to predict the initial dense depth maps and their reliability, non-local neighbors and affinities of each pixel, and learnable normalization factors. Compared with the traditional fixed-neighbor affinity refinement scheme, the non-local neighbors predicted by the network can overcome the propagation error problem of mixed depth objects. Subsequently, we combine the learnable normalized propagation of non-local neighbor affinity with pixel depth reliability in the NL-3A propagation layer, so that it can adaptively adjust the propagation weight of each neighbor during the propagation process, which enhances the robustness of the network. Finally, we design an accelerated propagation model. This model enables parallel propagation of all neighbor affinities and improves the efficiency of refining dense depth maps. Experiments on KITTI depth completion and NYU Depth V2 datasets show that our network is superior to most algorithms in terms of accuracy and efficiency of depth completion. In particular, we predict and reconstruct more smoothly and consistently at the pixel edges of different objects.

© 2023 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. Introduction

In recent years, depth estimation has become a popular research topic with the rapid development of indoor-outdoor mobile robot sensing and applications [1]. Sensors such as RGB-D cameras and LiDAR are usually used to obtain reliably predicted depths [24]. However, these sensors are limited by their active perception and can only obtain sparse predicted depth [58]. Dense depth maps are indeed essential for robots in autonomous navigation and obstacle avoidance tasks. Although some vision or lidar SLAM (simultaneous localization and mapping) methods can build dense depth maps [913], these methods cannot scale to large scenes, limiting their applicability in realistic scenes. To overcome the above problems and obtain dense depth maps, we investigate the estimation of dense depth information based on the obtained sparse depth, i.e., depth completion methods [1416].

Traditional depth completion methods, which only rely on sparse depth measurements generated by LiDAR, usually show artifacts in the dense depth maps after completion, resulting in a loss of accuracy [17,18]. Since RGB images can display color and texture details, recent methods have used RGB images to guide depth completion [19,20]. Compared with traditional methods, these methods obtain good prediction accuracy. However, in complex environments with mixed depths, the predicted dense depth maps still appear blurred at mixed depth boundaries. Affinity-based spatial propagation methods mitigate this phenomenon by refining the predicted dense depth maps by learning the affinities of local neighbors [21].

Affinity-based spatial propagation methods are classified into local affinity propagation and non-local affinity propagation. The local affinity propagation method is a fixed local neighborhood configuration, such as a 3 × 3 neighborhood, a 4 × 4 neighborhood, or an n × n neighborhood that can be expanded and reduced autonomously. This local affinity propagation usually propagates information that is not related to the reference information, especially on the depth boundary. The non-local affinity propagation method can avoid propagating information irrelevant to the reference information by predicting the non-local neighbors of each pixel point, even though the irrelevant information is located in the local neighbors. However, the method is time-consuming because it requires predicting the nonlocal neighbors and aggregating the relevant information using spatially varying affinities. Moreover, the availability of the reliability measure is desirable because it indicates the confidence of the predicted depth and can be used equally to guide the depth completion method to predict the dense depth [22].

To address the above issues, we propose a non-local 3A (affinity, adaptation, acceleration) propagation network framework for generating dense depth maps from LiDAR. The main contributions of this work are as follows:

  • (1) In the network, we design a 3A prediction layer to obtain the initial prediction depth and its confidence, the non-local neighbors of each pixel, the affinities, and learnable normalization factors. We modify the traditional scheme with fixed affinity normalization parameters to a more reasonable normalization parameter that can be learned by the network and outputted autonomously. This approach ensures more accurate affinity estimation, improves the accuracy of nonlocal neighbor propagation, and provides the basis for subsequent more accurate refinement of the predicted depth maps.
  • (2) In the network, we propose a 3A propagation layer to refine the predicted dense depth maps faster and more accurately. We devise an affinity propagation method that makes the propagation of each non-local neighbor truly parallel, which greatly speeds up the propagation process. To further improve the robustness of the network and the resistance to the outliers of the input prediction depths, we design the propagation process as an adaptive propagation process. Specifically, we incorporate the reliability of the predicted depth into the propagation process to adaptively adjust the weight of each non-local neighbor depth value in the propagation process.
  • (3) Experimental results on the KITTI depth completion dataset and the NYU Depth V2 dataset show that our method achieves superior depth completion performance compared to state-of-the-art methods. Also, we further discuss the network parameter sizes and the depth completion time for each image frame. These demonstrate that our network reduces the network parameters and speeds up the depth completion time while ensuring depth completion accuracy.

2. Related work

2.1 Depth estimation and completion

Depth estimation and completion refer to the dense depth maps obtained by using monocular images, stereo images, or sparse point cloud data collected by sensors. Li et al. proposed a method to jointly learn depth, self-motion, and 3D dense object motion graphs only from monocular videos [23]. This method cannot rely on the marked data in the training process, but due to the limitation of using only monocular images, the accuracy of depth estimation is not high. Lu et al. proposed a semantically guided two-branch network to solve the problem of generating dense depth maps from single-line LiDAR and aligned RGB images [24]. Imran proposed a multi-hypothesis depth representation [25]. This representation explicitly models the foreground and background depths in difficult occlusion boundary regions. And these extrapolated surfaces are fused into a single-depth image using image data. Qiu et al. proposed a network for estimating surface normal as an intermediate representation to produce dense depths, and the network can be trained end-to-end [26]. Also, the network predicts a confidence mask to handle the mixed LiDAR signal due to occlusion near the foreground boundary and combines the estimation of color images and surface discovery with learning attention maps to improve depth accuracy. Zhao et al. adopted graph propagation to capture the observed spatial context information [27]. Specifically, they first constructed multiple graphs at different scales from the observed pixels. Then they considered multimodal information of the input data for multimodal representation. In addition, they proposed a symmetric gated fusion strategy to efficiently utilize the extracted multimodal features. Tang et al. designed a new bootstrap network to predict kernel weights from the bootstrap images [28]. These predicted kernels are then applied to extract depth image features. In this way, the network generates spatially varying kernels of content-dependent kernels for multimodal feature fusion. Liu et al. reduced the depth completion process to a two-stage learning task, i.e., a sparse to the coarse stage and a coarse to the fine stage [29]. Specifically, in the coarse to fine stage, more representative features are extracted from the color images and coarse depth maps using the channel-blending extraction operation, and the energy-based fusion operation is used to efficiently fuse these features obtained through the channel-blending operation, resulting in a more accurate and finer depth maps. Eldesokey et al. focused on lightweight the network to be developed in embedded systems. They investigated fusion strategies that combine depth and RGB information in a normalized convolutional network framework. In addition, they introduced output confidence as auxiliary information to improve the results [30]. Teixeira et al. proposed a depth and uncertainty estimation method based on [30] to better cope with challenges such as large variations in viewpoint and depth and limited computational resources [31]. [32] proposed the SemAttNet network, i.e., depth completion guided by attention-based semantic perception oriented. The network mainly consists of color-oriented, semantic-guided, and depth-oriented branches and is based on SAMMFAB to fuse the features between the three branches. Although the above algorithms have demonstrated good performance, the accuracy of depth estimation in mixed-depth environments needs to be improved.

2.2 Local affinity propagation network

Liu et al. proposed a spatial propagation network of affinity matrices to improve the results of depth estimation in mixed depth environments [33]. They demonstrated that the model can learn semantic perceptual affinity values for advanced visual tasks. However, it propagates in columns and rows, which is very inefficient. Therefore, to improve efficiency, Cheng et al. proposed a convolutional space propagation network (CSPN) and applied it in the field of deep completion [34]. Specifically, Cheng et al. predicted the affinity values of local fixed neighbors and updated all pixels of the images and local fixed neighbor information simultaneously to improve efficiency. Xu et al. proposed a deformable space propagation network (DSPN) based on the CSPN to adaptively generate different perceptual fields and affinity matrices for each pixel [35]. Next, Cheng et al. further improved the limitations of fixed local neighborhood affinity propagation of CSPN networks and proposed a learning context and resource aware convolutional spatial propagation network (CSPN++) for depth completion. This network further improves its effectiveness and efficiency by learning the adaptive convolutional kernel size and the number of iterations for propagation [36]. In addition, SemAttNet further refines the predicted dense depth map by the CSPN++ method after fusing the features between the three branches introduced in Section 2.1, which improves the original accuracy. However, all the above methods belong to local affinity propagation methods. Since the local neighborhood pixels may come from different objects, the method based on this local affinity propagation leads to the propagation of mixed depth values, which affects the results of depth completion at the mixed depth boundary.

2.3 Non-local affinity propagation network

Non-local information has been applied in various vision tasks. Wang et al. used non-local operations as a generic building block for capturing remote dependencies that have been inserted into many computer vision architectures [37]. Shim et al. proposed a new end-to-end trainable reference feature extraction module called similarity search and extraction network with similarity-aware deformable convolution [38]. Park et al. applied the nonlocal affinity propagation network in the field of depth completion and also proposed an affinity normalization method to better learn the combinations between affinities. However, this method is not very efficient in terms of propagation efficiency and is a very time-consuming propagation method [39]. The DySPN method proposed by Lin et al. utilizes a nonlinear propagation model. While this method incorporates three variants aimed at reducing the required number of neighbors, it still falls short in achieving optimal affinity propagation. This limitation stems from the method's inability to effectively learn and identify the most closely related neighbors per pixel depth [40]. Unlike earlier, our network can more efficiently and accurately select the closest non-local neighbors for propagation through the 3A prediction and propagation layers, while quickly refining the predicted dense depth, especially sharpening the performance in depth estimation for mixed depth borders.

3. Methodology

In this paper, we propose a non-local 3A propagation network for generating dense depth maps from LiDAR. We introduce the overall framework of the depth completion network in Section 3.1. Subsequently, the necessity of non-local neighbor selection and how to perform affinity normalization more rationally in the NL-3A prediction layer in the network are specified in Section 3.2. Next, in Section 3.3, affinity adaptive accelerated propagation is performed based on the input non-local neighbors, affinities, learnable normalization factors, and reliability. The NL-3A propagation layer further refines the predicted dense depth maps to obtain satisfactory accuracy while using fewer network parameters and consuming less time. This is the highlight of our approach. Finally, we present the loss functions used in the training process to guide the direction of the network training.

3.1 Depth completion network architecture

In this section, we describe the depth completion network framework in detail. The network framework is shown in Fig. 1 and consists of three main parts: the NL-3A extraction layer, the NL-3A prediction layer, and the NL-3A propagation layer. The NL-3A extraction layer first extracts high-level features from RGB images and sparse depth maps and connects them. Then the encoder-decoder structure based on the Res-Net34 network [41] is used to extract the low-level features and simultaneously uses the high-level features to obtain the feature sets needed for the NL-3A prediction layer. Among them, the 3 × 3 convolution kernel is commonly employed due to its various advantages. It effectively reduces the number of network parameters and enhances network performance. Specifically, when considering the same receptive field, the 3 × 3 convolution kernel requires significantly fewer network parameters compared to the 5 × 5 and 7 × 7 convolution kernels. This undoubtedly reduces the model's complexity and accelerates the training speed. The NL-3A prediction layer predicts the depth maps and their reliability, the non-local neighbors of each pixel, affinities, and learnable normalization factors based on the passed feature sets, and passes the five pieces of information to the NL-3A propagation. The NL-3A propagation layer quickly refines the dense depth maps predicted by the prediction layer by deformable convolution efficient computation, and finally outputs dense depth maps with high accuracy.

 figure: Fig. 1.

Fig. 1. Depth completion network framework diagram. The inputs are RGB images and sparse depth maps. The outputs are the depth-completed dense depth maps. The whole network is divided into the NL-3A extraction layer, NL-3A prediction layer, and NL-3A propagation layer.

Download Full Size | PDF

3.2 NL-3A prediction layer

We first introduce the propagation models of affinity between pixel neighbors. The propagation model works by propagating the observed depth values of similar neighbors to estimate the missing depth values in the prediction and refine the less reliable depth values in the prediction. ${\mathbf X} = ({x_{m,n}}) \in {{\mathbb R}^{M \times N}}$ denotes the 2D depth maps to be updated by affinity propagation, where ${x_{m,n}}$ denotes the pixel values at $(m,n)$. After the t-1th propagation through affinity ${x_{m,n}}$ can be expressed as:

$$x_{m,n}^t = (1 - \mathop \sum \limits_{(i,j) \in {N_{m,n}}} \omega _{m,n}^{i,j})x_{m,n}^{t - 1} + \mathop \sum \limits_{(i,j) \in {N_{m,n}}} \omega _{m,n}^{i,j}x_{i,j}^{t - 1}$$
where, $(i,j)$ is the pixel coordinate of the neighbor of the pixel, and $\omega _{m,n}^{i,j}$ denotes the affinity of two pixels between $(m,n)$ and $(i,j)$. As shown in Eq. (1), the whole affinity propagation model is divided into two parts: the propagation of the reference pixel and the propagation of the neighboring pixels with corresponding affinity weights.

The next focus is to determine the neighbors of each pixel. As shown in Fig. 2, the choice of pixel neighbors is mainly divided into local neighbors with a fixed 3 × 3 range (see Fig. 2(a)), local neighbors with an expandable neighbor range (3 × 3/5 × 5/7 × 7) (see Fig. 2(b)), and the non-local neighbors adopted by us (see Fig. 2(c)).

 figure: Fig. 2.

Fig. 2. Visual comparison of fixed-local and non-local. (a) CSPN, (b) DSPN, and (c) Ours are examples of selected neighbors in different. Where purple is the reference pixel and pink is the selected neighbor pixel. Unlike their selected neighbors, we are selecting the most suitable neighbors at the sub-pixel level. (d): An example of a NYUv2 dataset RGB image and a dense depth map. (e): The fixed local neighborhood neighbor selection approach of CSPN and DySPN causes mis-mixing of the depth of foreground and background objects in the depth mixing region. (f): This non-local neighborhood neighbor selection approach avoids this problem by learning and picking suitable neighbors (independent of pixel distance) through neural networks.

Download Full Size | PDF

This method of selecting neighbors with a fixed 3 × 3 range takes into account all possible propagation directions, and the sets of local neighbors are defined as follows:

$$N_{m,n}^{CS} = \{{{x_{m + p,n + q}}|p \in \{ - 1,0,1\} ,q \in \{ - 1,0,1\} ,(p,q) \ne (0,0)} \}$$

The selection of local neighbors utilizing an expandable neighbor range makes it possible to select a larger range of neighbors. Meanwhile, to obtain a similar receptive field by using fewer neighbors, not all neighbors in the large range are selected. The neighbor sets are defined as follows:

$$N_{m,n,k}^{DS} = \{{{x_{m + p,n + q}}|p \in \{ - 2k + 1,0,2k - 1\} ,q \in \{ - 2k + 1,0,2k - 1\} ,(p,q) \ne (0,0)} \}$$

Both CSPN and NySPN can effectively pass the surrounding depth information to themselves for depth refinement. However, their potential problem is that they are prone to incorrect depth propagation between different objects, as shown in Fig. 2(e). Specifically, the foreground objects and the background objects in the depth blending region are very different in pixel values even though they are neighbors in the pixel coordinate system. If depth propagation is performed between them, the depth value will be mixed and the boundary will be blurred.

To solve the above problems, we adopt affinity propagation beyond local regions to select non-local neighbors. Specifically, we use a neural network to estimate the neighbors of each pixel in the nonlocal region by color and depth features. The sets of non-local neighbors are defined as shown in Eq. (4):

$$N_{m,n}^{NL} = \{{{x_{m + p,n + q}}|(p,q) \in {f_\phi }({\boldsymbol I},{\boldsymbol D},m,n),p,q \in {\mathbb R}} \}$$
where ${\boldsymbol I}$ and ${\boldsymbol D}$ denote RGB images and sparse depth images, respectively. ${f_\phi }()$ denotes the network equations for estimating the nonlocal neighbors of each pixel in the presence of network parameter $\phi $. Its specific architecture is shown in Fig. 1 (NL-3A propagation layer).

As in Fig. 2(f), we show some examples of selecting non-local neighbors through the network at mixed depth boundaries. Compared with the fixed local neighbor selection approach shown in Fig. 2(e), our approach of selecting non-local neighbor affinity propagation can well avoid the problem of mixing different objects in depth, while better refining each object’s edge depth and improving the accuracy of dense depth completion. We further discuss the differences between the two neighbor selection methods from qualitative and quantitative perspectives in the discussion.

We need to normalize the affinity before propagating the correlated neighbors of each pixel. The purpose of affinity normalization is to ensure stability during the propagation process. The sufficient conditions to maintain stability during the propagation process [33] are:

$$\partial {x^t}/\partial {x^{t - 1}} \le 1$$

Then, the following conditions need to be satisfied in the affinity propagation process:

$$\mathop \sum \limits_{(i,j) \in {N_{m,n}}} |{\omega_{m,n}^{i,j}} |\le 1$$

To perform the condition as in Eq. (6), previous work [23] normalized the affinity by the sum of absolute values:

$$\omega _{m,n}^{i,j} = \hat{\omega }_{m,n}^{i,j}/\mathop \sum \limits_{(i,j) \in {N_{m,n}}} |{\hat{\omega }_{m,n}^{i,j}} |$$

The paper [39] points out that the affinity normalization by such absolute sums can bias the combination of affinities towards a narrow high-dimensional space. Therefore, it is recommended to use hyperbolic tangent functions that can reduce the bias, while adding a normalization factor to ensure stability:

$$\omega _{m,n}^{i,j} = ({e^{\hat{\omega }_{m,n}^{i,j}}} - {e^{ - \hat{\omega }_{m,n}^{i,j}}})/(\eta ({e^{\hat{\omega }_{m,n}^{i,j}}} + {e^{ - \hat{\omega }_{m,n}^{i,j}}}))$$
where, $\eta $ is the normalization factor. $\eta $ is generally given based on empirical values, which are related to the total number of neighbors k, the training environment, etc.

However, we try to learn to get the best value of in training. Therefore, as shown in the NL-3A prediction layer in Fig. 1, we output the learned normalization factor in the prediction layer of the network to ensure that the subsequent affinity normalization is stable while being able to reduce the bias.

In addition, the prediction layer of the network outputs the initial predicted dense depth maps and the reliability of each pixel value. In Section 3.3, it will be described how the dense depth maps are adaptively and quickly refined by non-local affinity.

3.3 NL-3A propagation layer

In the affinity propagation network, affinity describes the correlation between pixel neighbors, and based on the correlation, information is transferred between each other. However, this is reliable on a per-pixel basis. In actual depth prediction, the predicted depth map pixel values are often unreliable due to factors such as noise or unclear boundary information. Therefore, we hope to add pixel depth reliability to adaptively adjust the weight of each pixel depth between information transmissions, to reduce the influence of unreliable pixel depth values in the propagation process. We will prove that this adaptive propagation method provides more accurate depth completion results in subsequent experiments. As shown in Eq. (9), in this work, we adaptively adjust the weight operation of the pixel depth value during propagation together with the affinity normalization operation. Specifically, when the correlation between pixel neighbors is large and the reliability is high, the amount of information with a greater weight should be transmitted; on the contrary, when the correlation between pixel neighbors is small or the reliability is low, the amount of information with a smaller weight should be transmitted.

$$\omega _{m,n}^{i,j} = {a^{i,j}}\ast ({e^{\hat{\omega }_{m,n}^{i,j}}} - {e^{ - \hat{\omega }_{m,n}^{i,j}}})/(\eta ({e^{\hat{\omega }_{m,n}^{i,j}}} + {e^{ - \hat{\omega }_{m,n}^{i,j}}}))$$
where ${a^{i,j}} \in [0,1]$ represents the reliability of the depth pixel at pixel coordinate $(i,j)$.

To make the propagation of each non-local neighbor truly parallel, we design an implementation, which greatly speeds up the propagation process. As shown in Eq. (1), the entire equation is defined per pixel, which we are ready to convert into tensor-scale operations. Considering there are k neighbors, we can learn ${\boldsymbol K}$ affinity maps from the network, each map representing the affinity of a particular neighborhood to the pixels in the neighborhood. Next, each affinity map needs to be translated along the opposite direction of the corresponding non-local neighbors to align it. We assume 8 neighbors as an example, and we use 8 convolution kernels to implement these operations, as shown in Fig. 3.We denote a translation operator by ${\boldsymbol T}({{\boldsymbol W}^{p}},{\boldsymbol p})$, denoting that it moves an affinity map ${{\boldsymbol W}^p}$ along the direction $- {p}$. Then the affinity propagation defined in Eq. (1) can be redefined as:

$$x_{m,n}^t = {\boldsymbol T}({{\boldsymbol W}^{\boldsymbol 0}},{\boldsymbol 0}){\boldsymbol T}(x_{m,n}^{t - 1},{\boldsymbol 0}) + \mathop \sum \limits_{(i,j) \in {\boldsymbol p} \in N} {\boldsymbol T}({{\boldsymbol W}^{\boldsymbol p}},{\boldsymbol p}){\boldsymbol T}(x_{i,j}^{t - 1},{\boldsymbol p})$$

 figure: Fig. 3.

Fig. 3. Implementation example of accelerated affinity propagation process. Compared with pixel-level expansion, this tensor-level operation is faster and more efficient.

Download Full Size | PDF

Among them, the first half of Eq. (10) represents the affinity propagation of the pixel point at adjacent moments, while the second half represents the affinity propagation of the non-local neighbors of the pixel point at adjacent moments.

We will show in the discussion (Section 5) that adopting Eq. (10) for affinity propagation greatly reduces the time taken by Eq. (1). Finally, through formula (9) and formula (10), the non-local affinity adaptive acceleration propagation is realized to efficiently and accurately refine the initial depth map predicted by the NL-3A prediction layer, and finally the NL-3A propagation layer outputs a high-precision dense depth map.

3.4 Loss function

To accurately predict dense depth maps, and to make a fair comparison with other state-of-the-art algorithms, we adopt the same loss as [39,40] as the loss function in training and train our NL-3A depth completion network with ground truth depth. As follows:

$$\begin{array}{c} {L_{NL - 3A}}({{\boldsymbol d}^{gt}},{{\boldsymbol d}^{NL - 3A}}) = \frac{1}{V}\mathop \Sigma \limits_i^m \mathop \Sigma \limits_j^n |{(d_{i,j}^{NL - 3A} - d_{i,j}^{gt}) \cdot \Pi (d_{i,j}^{gt} > 0)} |\\ V = \mathop \Sigma \limits_i^m \mathop \Sigma \limits_j^n \Pi (d_{i,j}^{gt} > 0) \end{array}$$

Among them, ${{\boldsymbol d}^{NL - 3A}}$ are the depth maps predicted by the NL-3A network, and ${{\boldsymbol d}^{gt}}$ is the real depth maps. $\Pi (d_{i,j}^{gt} > 0)$ is the validity index of ${{\boldsymbol d}^{gt}}$ at pixels $(i,j)$, and V are the number of valid pixels with the true depth maps.

4. Experiments

In this section, to demonstrate the performance of our NL-3A propagation network for generating dense depth maps from LiDAR, we compared other depth completion networks on the KITTI depth completion dataset [42] and the NYU Depth V2 dataset [43]. In order to facilitate quantitative comparison with other methods, we use the same evaluation metrics as them. Specifically, for the KITTI depth completion dataset, we adopt the same error metrics as the dataset baseline, including root mean square error (RMSE), mean absolute error (MAE), inverse root mean square error (iRMSE), and inverse mean absolute error (iMAE). For the NYU Depth V2 dataset, root mean square error (RMSE), mean absolute relative error (REL), and percentage of pixels satisfying are chosen as evaluation metrics. All the above evaluation indicators are shown in Table 1.

Tables Icon

Table 1. Evaluation indicators

4.1 Datasets and setup

The KITTI depth completion dataset provides about 80,000 raw image frames of outdoor scenes and associated sparse depth maps. The sparse depth map is constructed from the ∼5% dense point cloud depth output from the Velodyne LiDAR sensor. Among them, 90,000 RGB and LiDAR are used for training, 1,000 are used for verification, and 1,000 are used for test evaluation and comparison with other advanced affinity propagation completion methods. In training, we crop the RGB top without LiDAR data, and then crop the center to 1216 × 256.

The NYU Depth V2 dataset consists of RGB and sparse depth images captured by Microsoft Kinect cameras in 464 indoor scenes. Our model is trained on a subset of 50,000 images from the official training split, and tested and evaluated on 654 images from the officially labeled test set. To compare with other state-of-the-art depth completion methods, we perform the same operation, downscaling each image to 320*240, and performing a 304 × 228 center crop.

We implement the proposed method using PyTorch [45] and train on a machine equipped with an NVIDIA GTX 3090Ti GPU with 24 G memory. For all our experiments, we train with the Adam optimizer, ${\beta _1} = 0.9$, ${\beta _2} = 0.999$, with an initial learning rate of ${10^{ - 3}}$. We set the number of non-local neighbors to 8 for a fair comparison with state-of-the-art algorithms using 3 × 3 local neighbors. The batch size is set to 8 and 24 for the KITTI depth completion dataset and the NYU Depth V2 dataset, respectively. In addition, we also use data augmentation techniques such as horizontal random flipping and color jittering during training.

4.2 Comparison of KITTI depth completion dataset

We submit our method to the KITTI Depth benchmark and compare it with some other state-of-the-art depth completion methods. The StD method proposed by Ma et al. first proposes to introduce additional sparse depth samples and combines a single RGB image for dense depth prediction [44]. The CSPN method exploits the fixed neighbor affinity in each direction for propagation. The CSPN++ method uses an adaptive kernel and a resource-aware method for fixed neighbor affinity propagation. TWISE models the foreground and background in difficult occluded boundary regions to address boundary blending depth. DSPN uses deformable convolution kernels to obtain more matching local neighbors, thereby refining the depth of the mixed boundary. DeepLiDAR predicts a confidence mask to handle mixed LiDAR signals due to occlusion near foreground boundaries and combines estimates of color images and surface findings with learned attention maps to improve depth accuracy. ACMNet adopts a symmetric gated fusion strategy to effectively utilize the extracted multimodal features to improve dense depth completion accuracy. GuideNet adopts kernel-guided depth completion with content-dependent kernel space variation from multimodal feature fusion. FCFR-Net utilizes channel shuffling extraction operations to extract more representative features from color images and coarse depth maps and utilizes energy-based fusion operations to effectively fuse these features obtained through channel shuffling operations to obtain more accurate depth maps. PENet uses an expanded network to obtain more fixed neighbor affinities, and at the same time accelerates the propagation speed of the affinities. SemAttNet uses color-guided, semantic-guided, and depth-guided to generate a dense depth map, and uses CSPN++ refinement to obtain the final dense depth maps. The DySPN method introduces three variants of selecting neighbors. In the following comparison, we choose the Deformable DySPN with the highest accuracy.

The quantitative results of the KITTI depth completion dataset test set using some evaluation indicators in Table 1 and the above methods are shown in Table 2. Our method achieves higher depth completion accuracy than most methods and slightly worse than DySPN. The difference in performance between our method and DySPN can be attributed to the variation in neighbor selection. In our method, we set the number of non-local neighbors in the network to 8. On the other hand, DySPN utilizes a variant neighbor selection approach that combines both non-local and local neighbors, with the number of neighbors set to 24, three times more than our method. This larger number of neighbors enables DySPN to allow for better accuracy in representing the errors of all pixels globally. However, it should be noted that at the mixed boundaries where different objects intersect, DySPN may encounter challenges and experience increased errors due to incorrect local neighbor affinity propagation. This limitation is demonstrated in the subsequent qualitative results, highlighting areas where the method may struggle. Additionally, in Section 5 of the discussion, we analyze the accuracy of depth estimation at different depth boundaries. The quantitative results conclusively demonstrate that our method surpasses the performance of the DySPN method in this regard. In addition, since the DySPN method selects a larger number of neighbors, it does not have an advantage in running time. On the contrary, when our method selects 8 neighbors like other methods, the time to complete the dense depth of our entire NL-3A network is greatly reduced because we use the method of accelerating affinity propagation. This point will also be demonstrated in the discussion in the fifth section.

Tables Icon

Table 2. Quantitative evaluation on the KITTI depth completion dataset. The results of other state-of-the-art depth completion methods are obtained from the respective reference papers

To qualitatively judge the effect of our method on the KITTI deep completion dataset, we visualize and compare the results of our method and other state-of-the-art deep completion methods, as shown in Fig. 4. The first row is an RGB image, the second row is a sparse depth map collected by lidar, the third row is the CSPN++ method, the fourth row is the SemAttNet method, the fifth row is the DySPN method, the sixth row is the depth completion graph of our method, and the last row is the error graph of our method. In addition, we zoom in on the regions where different objects are located, and our method is marked with a red box, and the other methods are marked with a green box. It is clear from Fig. 4: In the left image, the boundary between the cyclist and the car is reconstructed more clearly and the closest one among the four branches can be reconstructed after our NL-3A network depth complementation (all other methods reconstruct the four branches to a uniform distance). In the middle image, only our method reconstructs the shape of the stairs and the actual distance from near to far. In the image on the right, our method recon-structs the stone block occluded by tree branches more smoothly and consistently than other methods. This is thanks to our selection of non-local neighbors of different object boundary pixels and their affinity to adaptively propagate refined edge structure in-formation. This is the highlight of our method. Next, we will visualize the non-local neighbors of some boundary pixels in the discussion (Section 5).

 figure: Fig. 4.

Fig. 4. Depth completion results on the KITTI depth completion dataset. The first line: RGB image, the second line: sparse depth maps, the third line: CSPN++, the fourth line: SemAttNet, the fifth line: DySPN, the sixth row: Ours, the tenth row: depth error maps of our method.

Download Full Size | PDF

4.3 Comparison of NYU depth V2 dataset

In the previous section, our method is tested on the KITTI depth completion dataset (outdoor dataset) and compared with other advanced depth completion methods and achieved good results. In this section, we continue to test the method on the NYU Depth V2 dataset (indoor dataset). Since this dataset captures indoor scenes, there are more objects in a relatively small environment. This feature is a challenge to depth completion methods and also enables better testing of our method.

As shown in Table 3, we use some indicators in Table 1 to compare with the other nine advanced deep completion methods. Our method almost achieves the best results, only higher than the RMSE of the DySPN method by 0.001 m. As we discussed in the previous section, we only pick 8 non-local neighbors, while Deformable DySPN picks 24 surrounding neighbors. This fully demonstrates the efficiency of our method. Furthermore, our method achieves higher accuracy compared to the CSPN and CSPN++ methods which also have 8 neighbors. This reflects the importance of picking non-local neighbors to ensure the correct propagation of affinity for depth completion in dense spaces with multiple objects like indoors. This point will also be further analyzed in the fifth section discussion.

Tables Icon

Table 3. Quantitative evaluation on NYU Depth V2 dataset. The results of other state-of-the-art depth completion methods are obtained from their respective reference papers

Next, we pick some multi-object scenes from NYU Depth V2 for qualitative examples. As shown in Fig. 5, the first column is the RGB image, the second column is the locally enlarged sparse point cloud, the third column is the dense depth maps predicted by our method, and the fourth column is the real dense depth maps. In addition, we use pink to mark different objects, such as bicycles, small fans, pillows, seats, people, lamps, bottles, etc. Although many objects overlap or are close to each other, we can still effectively refine the boundary information of these objects by selecting non-local neighbors reasonably for adaptive accelerated affinity propagation. In terms of edge sharpness and detail, our predictions and reconstructions are smoother and more consistent, especially along the edges.

 figure: Fig. 5.

Fig. 5. Depth completion results in the NYU Depth V2 dataset. The first column: RGB images, the second column: sparse depth maps, the third column: Ours, and the fourth column: true depth maps. Note that the sparse depth images are enlarged in order to visualize the sparse depth.

Download Full Size | PDF

5. Discussion

In this section, we conduct ablation studies to discuss and validate the role of each component of our network, including non-local neighbor extraction experiments, adaptive affinity propagation experiments, and affinity accelerated propagation experiments.

In the first part of the ablation studies, we visualize the non-local neighbors of some pixels extracted, as shown in Fig. 6. Compared with fixed local neighbors and variable local neighbors, our non-local neighbors predicted by the network are more flexible and effective. In particular, when selecting the neighbors of some object boundary pixels, since our network selects by the geometric relative position near the depth boundary, it can avoid the propagation of adjacent pixel depth errors between different depths of different objects. Furthermore, we also compare the depth variance of different neighbor selection configurations to show the correlation of the pixel depth of the selected neighbors. As shown in Table 4, we compare the average depth variance of three neighbor selection configurations in object boundary pixels in the KITTI depth completion dataset. The first is a fixed 3 × 3 neighbor like CSPN; the second is a deformable neighbor like DySPN; the third is our non-local neighbor. From the data in Table 4, it can be seen that our non-local neighbor configuration method achieves the best accuracy in the error test of picking object boundary pixel neighbors. This demonstrates that our method performs better when there are more objects at different depths in the environment.

 figure: Fig. 6.

Fig. 6. Examples of non-local neighbors predicted by our network. The first row: visualization of RGB images and some non-local neighbors, the second row: visualization of sparse depth images and some non-local neighbors, the third row: a zoom-in of non-local neighbors for visualization, and the fourth row: the dense depth map obtained by our method.

Download Full Size | PDF

Tables Icon

Table 4. Quantitative evaluation on the KITTI DC validation set for various pixel neighbor configurations

In the second part of the ablation studies, we test the adaptive affinity propagation algorithm. It is mainly divided into the selection of affinity normalization function and the comparison of whether there is adaptive propagation or not. We take the KITTI Depth completion dataset as an example, and the experimental results are shown in Table 5. When adopting the traditional sum-of-absolute normalization model, the affinity combination is always in a narrow high-dimensional space, so its model performs poorly. The performance is improved when we adopt the hyperbolic tangent function as the normalization function. At the same time, we also compared the experimental results of fixing the normalization factor $\eta = K = 8$ and learning the normalization factor through the network. In the learning process, we use $\eta = K = 8$ as the initial learning condition and finally output the learning normalization factor ($\textrm{KITTI: }\eta = 6.7,\textrm{ NYU}v2\textrm{: }\eta = 5.5$). The change curve of the learning normalization factor during the learning process is shown in Fig. 7. Additionally, we split and combine each normalization approach with adaptive propagation. Experimental results show that adding affinity reliability into the process of affinity normalization propagation to make it propagate adaptively can better refine the dense depth maps and obtain higher accuracy.

 figure: Fig. 7.

Fig. 7. The change curve of the learning normalization factor during the learning process.

Download Full Size | PDF

Tables Icon

Table 5. Quantitative evaluation of KITTI depth completion validation set for various configurations

In the third part of the ablation studies, we test our affinity accelerated propagation module. As shown in Table 6, we compare the depth completion time of accelerated and non-accelerated propagation. The results show that our accelerated implementation greatly reduces the running time and improves the efficiency of depth completion. In addition, compared with the running time of other advanced depth completion methods, our method also has certain advantages.

Tables Icon

Table 6. Quantitatively evaluates whether to configure the accelerated model propagation model on the KITTI depth completion verification set

5. Conclusion

In this paper, we propose a non-local affinity adaptive accelerated propagation network for generating dense depth maps from LiDAR. First, the network predicts the initial dense depth through the NL-3A prediction layer and flexibly selects the non-local neighbors of each pixel, avoiding the error propagation of irrelevant neighbors. Second, we combine learnable normalized propagation of non-local neighbor affinity with reliability to make it adaptive and improve the robustness of the propagation process. Next, we integrated the accelerated propagation model to make it truly parallel in the affinity adaptive propagation process, improving the efficiency of depth refinement. Finally, we conduct experiments on the KITTI depth completion dataset and the NYU Depth V2 dataset and demonstrate the superiority of our method compared with other advanced depth completion methods. Especially at the pixel edges of different objects, our prediction and reconstruction are smoother and more consistent. In addition, we performed ablation studies in the Discussion to verify the role of each building block.

Our method belongs to supervised depth completion approaches. Its accuracy relies on well-trained models and ground truth data during training. However, obtaining accurate ground truth data can be challenging in practical training scenarios. Therefore, in the future, we will focus on unsupervised depth completion methods to alleviate the dependency on labeled training data.

Funding

National Natural Science Foundation of China (61473100).

Acknowledgment

The authors would like to acknowledge the reviewers and editors for their careful work in improving the quality and presentation of this paper.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. S. Xiang, L. Liu, H. P. Deng, J. Wu, Y. Yang, and L. Yu, “Fast depth estimation with cost minimization for structured light field,” Opt. Express 29(19), 30077–30093 (2021). [CrossRef]  

2. H. C. Wang, X. Z. Sang, D. Chen, P. Wang, X. Q. Ye, S. Qi, and B. B. Yan, “Self-supervised stereo depth estimation based on bi-directional pixel-movement learning,” Appl. Opt. 61(7), D7–D14 (2022). [CrossRef]  

3. Z. Ma, Z. F. Cen, and X. T. Li, “Depth estimation algorithm for light field data by epipolar image analysis and region interpolation,” Appl. Opt. 56(23), 6603–6610 (2017). [CrossRef]  

4. T. Y. Tao, Q. Chen, S. J. Feng, Y. Hu, and C. Zuo, “Active depth estimation from defocus using a camera array,” Appl. Opt. 57(18), 4960–4967 (2018). [CrossRef]  

5. Y. Zhao, L. Bai, Z. Zhang, and X. Huang, “A Surface Geometry Model for LiDAR Depth Completion,” IEEE Robot. Autom. Lett. 6(3), 4457–4464 (2021). [CrossRef]  

6. S. Hwang, J. Lee, W. J. Kim, S. Woo, K. Lee, and S. Lee, “LiDAR Depth Completion Using Color-Embedded Information via Knowledge Distillation,” IEEE Trans. Intell. Transport. Syst. 23(9), 14482–14496 (2022). [CrossRef]  

7. F. Ma, G. V. Cavalheiro, and S. Karaman, “Self-Supervised Sparse-to-Dense: Self-Supervised Depth Completion from LiDAR and Monocular Camera,” in 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, pp. 3288–3295 (2019).

8. K. Zhang, J. Xie, N. Snavely, and Q. Chen, “Depth Sensing Beyond LiDAR Range,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 1689–1697 (2022).

9. Z. Lai, F. Liu, S. Guo, X. Meng, S. Han, and W. Li, “Onboard Real-Time Dense Reconstruction in Large Terrain Scene Using Embedded UAV Platform,” Remote Sensing 13(14), 2778 (2021). [CrossRef]  

10. C. Zhang, R. Zhang, S. Jin, and X. Yi, “PFD-SLAM: A New RGB-D SLAM for Dynamic Indoor Environments Based on Non-Prior Semantic Segmentation,” Remote Sensing 14(10), 2445 (2022). [CrossRef]  

11. L. Yan, X. Hu, L. Zhao, Y. Chen, P. Wei, and H. Xie, “DGS-SLAM: A Fast and Robust RGBD SLAM in Dynamic Environments Combined by Geometric and Semantic Information,” Remote Sensing 14(3), 795 (2022). [CrossRef]  

12. R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “KinectFusion: Real-time dense surface mapping and tracking,” in 2011 10th IEEE International Symposium on Mixed and Augmented Reality, Basel, Switzerland, pp. 127–136 (2011).

13. Z. Liu and F. Zhang, “BALM: Bundle Adjustment for Lidar Mapping,” IEEE Robot. Autom. Lett. 6(2), 3184–3191 (2021). [CrossRef]  

14. Z. Yan, K. Wang, X. Li, Z. Zhang, J. Li, and J. Yang, “RigNet: Repetitive Image Guided Network for Depth Completion,” in Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, 13687. Springer, Cham (2022).

15. L. Liu, X. B. Song, J. D. Sun, X. Y. Lyu, L. Li, Y. Liu, and L. J. Zhang, “MFF-Net: Towards Efficient Monocular Depth Completion With Multi-Modal Feature Fusion,” IEEE Robot. Autom. Lett. 8(2), 920–927 (2023). [CrossRef]  

16. J. Liu and C. Jung, “NNNet: New Normal Guided Depth Completion from Sparse LiDAR Data and Single Color Image,” IEEE Access 10, 114252–114261 (2022). [CrossRef]  

17. M. Dimitrievski, P. Veelaert, and W. Philips, “Learning Morphological Operators for Depth Completion,” in Advanced Concepts for Intelligent Vision Systems. ACIVS 2018. Lecture Notes in Computer Science, 11182. Springer, Cham (2018).

18. N. Chodosh, C. Wang, and S. Lucey, “Deep Convolutional Compressed Sensing for LiDAR Depth Completion,” In Computer Vision – ACCV 2018. ACCV. Lecture Notes in Computer Science, 11361. Springer, Cham (2018).

19. S. S. Shivakumar, T. Nguyen, I. D. Miller, S. W. Chen, V. Kumar, and C. J. Taylor, “DFuseNet: Deep Fusion of RGB and Sparse Depth Information for Image Guided Dense Depth Completion,” in 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand. pp. 13–20 (2019).

20. Y. Yang, A. Wong, and S. Soatto, “Dense Depth Posterior (DDP) From Single Image and Sparse Range,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, pp. 3348–3357 (2019).

21. M. Hu, S. Wang, B. Li, S. Ning, L. Fan, and X. Gong, “PENet: Towards Precise and Efficient Image Guided Depth Completion,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi'an, China, pp. 13656–13662 (2021).

22. K. T. Giang, S. Song, D. Kim, and S. Choi, “Sequential Depth Completion with Confidence Estimation for 3D Model Reconstruction,” IEEE Robot. Autom. Lett. 6(2), 327–334 (2021). [CrossRef]  

23. H. Li, A. Gordon, H. Zhao, and C. Vincent, “Unsupervised Monocular Depth Learning in Dynamic Scenes,” in Conference on Robot Learning (2020).

24. H. Lu, S. Xu, and S. Cao, “SGTBN: Generating Dense Depth Maps from Single-Line LiDAR,” IEEE Sensors J. 21(17), 19091–19100 (2021). [CrossRef]  

25. S. Imran, X. Liu, and D. Morris, “Depth Completion with Twin Surface Extrapolation at Occlusion Boundaries,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, pp. 2583–2592 (2021).

26. J. Qiu, Z. Cui, Y. Zhang, X. Zhang, S. Liu, B. Zeng, and M. Pollefeys, “DeepLiDAR: Deep Surface Normal Guided Depth Prediction for Outdoor Scene from Sparse LiDAR Data and Single Color Image,” in proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3313–3322 (2019).

27. S. Zhao, M. Gong, H. Fu, and D. Tao, “Adaptive Context-Aware Multi-Modal Network for Depth Completion,” IEEE Transactions on Image Processing 30, 5264–5276 (2021). [CrossRef]  

28. J. Tang, F. P. Tian, W. Feng, J. Li, and P. Tan, “Learning Guided Convolutional Network for Depth Completion,” IEEE Transactions on Image Processing 30, 1116–1129 (2021). [CrossRef]  

29. L. Liu, X. Song, X. Lyu, J. Diao, M. Wang, Y. Liu, and L. Zhang, “FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Depth Completion,” in proceedings of the AAAI Conference on Artificial Intelligence, 35(3), 2136–2144 (2021).

30. A. Eldesokey, M. Felsberg, and F. S. Khan, “Confidence Propagation through CNNs for Guided Sparse Depth Regression,” IEEE Transactions on Pattern Analysis and Machine Intelligence 42(10), 2423–2436 (2020). [CrossRef]  

31. L. Teixeira, M. R. Oswald, M. Pollefeys, and M. Chli, “Aerial Single-View Depth Completion with Image-Guided Uncertainty Estimation,” IEEE Robot. Autom. Lett. 5(2), 1055–1062 (2020). [CrossRef]  

32. D. Nazir, A. Pagani, M. Liwicki, D. Stricker, and M. Z. Afzal, “SemAttNet: Toward Attention-Based Semantic Aware Guided Depth Completion,” IEEE Access 10, 120781–120791 (2022). [CrossRef]  

33. S. Liu, S. D. Mello, J. Gu, G. Zhong, M.H. Yang, and J. Kautz, “Learning affinity via spatial propagation networks,” in proc. of Advances in Neural Information Processing Systems1519–1529 (2017).

34. X. Cheng, P. Wang, and R and Yang, “Learning Depth with Convolutional Spatial Propagation Network,” IEEE Transactions on Pattern Analysis and Machine Intelligence42, 2361–2379 (2020).

35. Z. Xu, H. Yin, and J. Yao, “Deformable Spatial Propagation Networks for Depth Completion,” in 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, pp. 913–917 (2020).

36. X. Cheng, P. Wang, C. Guan, and R. Yang, “CSPN++: Learning Context and Resource Aware Convolutional Spatial Propagation Networks for Depth Completion,” in AAAI Conference on Artificial Intelligence (2019).

37. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local Neural Networks,” in 2018 IEEE/CVF Conference on Computer Vision and Pat-tern Recognition, Salt Lake City, UT, USA, pp. 7794–7803 (2018).

38. G. Shim, J. Park, I. Kweon, and S. Robust, “Reference-Based Super-Resolution With Similarity-Aware Deformable Convolution,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 8422–8431 (2020).

39. J. Park, K. Joo, Z. Hu, C. K. Liu, and I. S. Kweon, “Non-local Spatial Propagation Network for Depth Completion,” in Computer Vision – ECCV 2020. Lecture Notes in Computer Science, 12358. Springer, Cham (2020).

40. Y. Lin, Y. Qin, T. Cheng, Q. Zhong, W. Zhou, and H. Yang, “Dynamic Spatial Propagation Network for Depth Completion,” in AAAI Conference on Artificial Intelligence36(2) (2022).

41. K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770–778 (2016).

42. J. Uhrig, J. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity Invariant CNNs,” in 2017 International Conference on 3D Vision (3DV), Qingdao, China, pp. 11–20 (2017).

43. N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor Segmentation and Support Inference from RGBD Images,” in Computer Vision – ECCV 2012. ECCV 2012. Lecture Notes in Computer Science, 7576. Springer, Berlin, Heidelberg (2012).

44. F. Ma and S. Karaman, “Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, pp. 4796–4803 (2018).

45. A. Paszke, S. Gross, F. Massa, et al., “Pytorch: An imperative style, high-performance deep learning library,” in proceedings of the 33rd International Conference on Neural Information Processing Systems, 721, 8026–8037 (2019).

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (7)

Fig. 1.
Fig. 1. Depth completion network framework diagram. The inputs are RGB images and sparse depth maps. The outputs are the depth-completed dense depth maps. The whole network is divided into the NL-3A extraction layer, NL-3A prediction layer, and NL-3A propagation layer.
Fig. 2.
Fig. 2. Visual comparison of fixed-local and non-local. (a) CSPN, (b) DSPN, and (c) Ours are examples of selected neighbors in different. Where purple is the reference pixel and pink is the selected neighbor pixel. Unlike their selected neighbors, we are selecting the most suitable neighbors at the sub-pixel level. (d): An example of a NYUv2 dataset RGB image and a dense depth map. (e): The fixed local neighborhood neighbor selection approach of CSPN and DySPN causes mis-mixing of the depth of foreground and background objects in the depth mixing region. (f): This non-local neighborhood neighbor selection approach avoids this problem by learning and picking suitable neighbors (independent of pixel distance) through neural networks.
Fig. 3.
Fig. 3. Implementation example of accelerated affinity propagation process. Compared with pixel-level expansion, this tensor-level operation is faster and more efficient.
Fig. 4.
Fig. 4. Depth completion results on the KITTI depth completion dataset. The first line: RGB image, the second line: sparse depth maps, the third line: CSPN++, the fourth line: SemAttNet, the fifth line: DySPN, the sixth row: Ours, the tenth row: depth error maps of our method.
Fig. 5.
Fig. 5. Depth completion results in the NYU Depth V2 dataset. The first column: RGB images, the second column: sparse depth maps, the third column: Ours, and the fourth column: true depth maps. Note that the sparse depth images are enlarged in order to visualize the sparse depth.
Fig. 6.
Fig. 6. Examples of non-local neighbors predicted by our network. The first row: visualization of RGB images and some non-local neighbors, the second row: visualization of sparse depth images and some non-local neighbors, the third row: a zoom-in of non-local neighbors for visualization, and the fourth row: the dense depth map obtained by our method.
Fig. 7.
Fig. 7. The change curve of the learning normalization factor during the learning process.

Tables (6)

Tables Icon

Table 1. Evaluation indicators

Tables Icon

Table 2. Quantitative evaluation on the KITTI depth completion dataset. The results of other state-of-the-art depth completion methods are obtained from the respective reference papers

Tables Icon

Table 3. Quantitative evaluation on NYU Depth V2 dataset. The results of other state-of-the-art depth completion methods are obtained from their respective reference papers

Tables Icon

Table 4. Quantitative evaluation on the KITTI DC validation set for various pixel neighbor configurations

Tables Icon

Table 5. Quantitative evaluation of KITTI depth completion validation set for various configurations

Tables Icon

Table 6. Quantitatively evaluates whether to configure the accelerated model propagation model on the KITTI depth completion verification set

Equations (11)

Equations on this page are rendered with MathJax. Learn more.

x m , n t = ( 1 ( i , j ) N m , n ω m , n i , j ) x m , n t 1 + ( i , j ) N m , n ω m , n i , j x i , j t 1
N m , n C S = { x m + p , n + q | p { 1 , 0 , 1 } , q { 1 , 0 , 1 } , ( p , q ) ( 0 , 0 ) }
N m , n , k D S = { x m + p , n + q | p { 2 k + 1 , 0 , 2 k 1 } , q { 2 k + 1 , 0 , 2 k 1 } , ( p , q ) ( 0 , 0 ) }
N m , n N L = { x m + p , n + q | ( p , q ) f ϕ ( I , D , m , n ) , p , q R }
x t / x t 1 1
( i , j ) N m , n | ω m , n i , j | 1
ω m , n i , j = ω ^ m , n i , j / ( i , j ) N m , n | ω ^ m , n i , j |
ω m , n i , j = ( e ω ^ m , n i , j e ω ^ m , n i , j ) / ( η ( e ω ^ m , n i , j + e ω ^ m , n i , j ) )
ω m , n i , j = a i , j ( e ω ^ m , n i , j e ω ^ m , n i , j ) / ( η ( e ω ^ m , n i , j + e ω ^ m , n i , j ) )
x m , n t = T ( W 0 , 0 ) T ( x m , n t 1 , 0 ) + ( i , j ) p N T ( W p , p ) T ( x i , j t 1 , p )
L N L 3 A ( d g t , d N L 3 A ) = 1 V Σ i m Σ j n | ( d i , j N L 3 A d i , j g t ) Π ( d i , j g t > 0 ) | V = Σ i m Σ j n Π ( d i , j g t > 0 )
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.