SPADnet: deep RGB-SPAD sensor fusion assisted by monocular depth estimation

Zhanghao Sun; David B. Lindell; Olav Solgaard; Gordon Wetzstein

doi:10.1364/OE.392386

1. Introduction

Understanding the spatial layout of a scene, including depth, is a vital capability for many applications, including segmentation [1], automotive navigation [2], pose estimation [3], augmented reality [4], and robotics [5]. Although many depth mapping techniques exist, for example based on stereo vision and structured illumination, we are most interested in two specific categories: monocular depth estimation and light detection and ranging (LiDAR) systems. Monocular depth estimation is attractive because it uses only a single RGB image or video captured by a commodity camera to estimate a depth map [6–10]. These approaches are great at estimating relative depth ordering of a scene. However, recent research has shown that they often contain large errors when estimating absolute distances [6,7].

In contrast, LiDAR systems are attractive because they can capture accurate depth information at kilometer range [11–13], albeit at low resolution. These long-range depth sensing capabilities are primarily enabled by pulsed illumination combined with emerging single-photon avalanche diodes (SPADs), which can record the time-of-arrival of individual photons with picosecond accuracy [14–20]. Unfortunately, eye safety sets an upper limit on the power of light pulses emitted into the scene. Therefore, SPAD-based LiDAR systems usually suffer from a significant amount of measurement noise and are also corrupted by background signal from ambient light. Low reflectance and the square distance falloff quickly reduce the number of photons reflected from the scene back to the detector to only a handful. In direct or indirect sunlight, ambient photons further corrupt the measurements and bury the few signal photons in noise. To address these challenges, a significant amount of recent work has focused on developing robust algorithms for depth estimation from noisy SPAD data [14–18,21,22]. However, most of them are still limited by the quality of the data under low photon-flux conditions.

One of the most promising directions for depth estimation in low flux conditions relies on neural networks. Such approaches offer the capability of overcoming low flux limitations by fusing measurements from multiple complementary sensors. For example, recent approaches have fused data from SPADs and a conventional intensity camera [23] (RGB-SPAD sensor fusion). Indeed, sensor fusion techniques for 3D imaging are not uncommon; fusion between RGB and depth, or RGB-D data, has been explored intensively to overcome conventional limitations of 3D sensors [24–28]. By leveraging convolutional neural networks (CNNs), prominent advances have been made, from depth inpainting [29,30] and sparse to dense depth mapping [31–34] to depth super-resolution [35–38]. These RGB-D fusion models generally take as input a 2D depth map from a traditional depth sensor (e.g., based on stereo matching or active illumination) and rely on standard 2D convolutional layers for neural network processing. However, RGB-SPAD sensor fusion is a fundamentally different task because the input SPAD measurement consists of noisy photon arrivals, not a depth map. Moreover, the SPAD measurements are fundamentally 3D: photon arrivals captured on a grid of 2D spatial locations over time. While previous approaches for RGB-SPAD fusion directly merge an input image and SPAD measurements in a 3D CNN [23], we show that this straightforward approach fails at extreme low flux levels and has a high computational cost (see Fig. 1).

Fig. 1. (a) A SPAD array captures a datacube with time-resolved photon counts whereas a conventional intensity camera records the time-integrated photon flux of a scene. (b) Monocular depth estimators allow the depth of the scene to be directly recovered from the 2D image. While the ordinal (i.e., relative) depth information of such an estimate is often good, there is scale ambiguity resulting in a large error (inset). (c) RGB-SPAD fusion approaches use neural networks to fuse the SPAD data with the 2D image to optimize depth estimation. (d) We introduce SPADnet, a neural network architecture that achieves state-of-the-art results for RGB-SPAD sensor fusion.

Download Full Size | PDF

In this paper, we propose a neural network architecture, dubbed SPADnet, for RGB-SPAD sensor fusion and robust depth estimation. As opposed to previous work [23], SPADnet leverages the advantage of both monocular depth estimation and a SPAD-based LiDAR sensor. The approach uses a monocular depth estimation network [6,7,39] to extract a depth map from an RGB or gray-scale image. This estimate is fused with the noisy output of a SPAD array to compute a final depth map. The proposed strategy, together with several other improvements over related work, allows SPADnet to significantly improve depth estimation quality and also computational efficiency over previous approaches. Our extensive evaluations show that SPADnet achieves state-of-the-art performance on both simulated and captured data by a large margin.

This paper is organized as follows: Sec. 2 outlines the image formation model for SPAD-based 3D imaging. Sec. 3 describes the processing pipeline. Sec. 4 shows comparisons on results as well as comprehensive ablation studies on the proposed techniques. Sec. 5 discusses limitations and future work, and Sec. 6 concludes.

2. Image formation model

In a typical SPAD-based 3D imaging system, a short light pulse is generated by pulsed laser and emitted into the scene. The pulse is scattered and some of the photons will be reflected back to the SPAD detector, where they have a certain possibility to trigger a photon arrival event that is time-stamped. The number of photons returning to the sensor within each time bin is

(1)$$s[n] = \int^{ (n+1)\Delta t}_{n\Delta t} (g*f) (t - 2d/c) dt,$$

where $\Delta t$ is the time bin size, $g$ and $f$ are the laser pulse temporal shape and the detector jitter, $c$ is the speed of light, and $d$ is the depth of illuminated object [19]. After $N$ times of illumination-measurements (during which $N$ laser pulses are generated and detected), the final output of the sensor will be a histogram of time-stamped photon counts.

The number of photons recorded in the output histogram can be approximated as a Poisson process [15]. During the measurement process, background photon detection caused by ambient light and falsely detected events known as dark count occur with uniform temporal probability. Taken these effects into account, we express the histogram $h[n]$ as

(2)$$h[n] \sim \mathcal{P} (N (\eta \gamma s[n] + \eta a + dc))$$

where $\eta$ is photon detection probability of the detector, $\gamma$ is reflectivity of the object, $a$ is received ambient light intensity and $dc$ is the dark count.

It is important to note that this model only characterizes SPADs under low photon flux. With high photon flux incidence, histogram would be distorted by pile-up [17,40,41], afterpulsing [42], crosstalk [42] and other effects. Through this work, we consider the common low-photon-flux regime.

The raster-scanned 3D measurements comprise $M_1 \times M_2$ temporal histograms $h$, one per scanned location, each containing the number of detected photon events for a number of time bins. We use this 3D spatio-temporal volume as the input to a neural network and output a denoised 3D volume where detections from ambient light or dark count have been censored. This results in an output vector $\hat {h}$ for each pixel. Finally, we identify the time of flight of the laser pulse, converting the denoised 3D volume to a 2D depth map, through a soft argmax operation:

(3)$$\hat{n}_{ij} = \sum_{n} n\cdot \hat{h}_{ij}[n], \quad \hat{d}_{ij} = \hat{n}_{ij} \cdot 2c\Delta t,$$

where $\hat {h}$ is the denoised histogram, $\hat {n}$ is the reconstructed index, $i,j$ are 2D spatial indices, $\hat {d}$ is the estimated depth, $c$ is light speed and $\Delta t$ is time bin size.

3. SPADnet sensor fusion model

In this section, we describe the SPADnet sensor fusion model including the network architecture, loss function, and log-scale time rebinning as a pre-processing step.

3.1 Network architecture

The SPADnet architecture is shown in Fig. 2 and consists of a denoising branch, monocular depth estimator, fusion operator, and a refinement branch. The denoising and refinement branches consist of learned 3D convolutional layers, and the fusion operation is intended to combine 2D features from an estimated depth map with 3D features from the processed SPAD measurements. The output of the network is the predicted depth map.

Fig. 2. SPADnet uses a monocular depth estimator to convert the 2D image into a rough depth map and then conduct 2D-3D up-projection to fuse it with 3D features extracted from SPAD measurement.

Download Full Size | PDF

Fusion of 2D images into a 3D denoising network is a non-trivial task. Recently, Lindell et al. [23] fused a 2D image into the denoising branch of a 3D convolutional neural network (CNN) using a repetition approach. That is, the 2D image is lifted into 3D by simply repeating it along the time dimension. However, this fusion approach does not use any physical relationship between the 2D and 3D data. Intuitively, a physically-inspired approach may be able to extract additional information from the 2D image, improving performance. Based on this insight, we propose to use a pre-trained monocular depth estimator [6,39], as shown in Fig. 2. Monocular depth estimation at each spatial location ($x,y$) is converted into an index on the depth axis (i.e. z-axis) of the 3D volume. We assign the corresponding indices in the 3D volume a value of 1 and set other points in the volume to 0. We call this process “2D-3D up-projection” in Fig. 2.

Another critical observation that motivates us to combine monocular depth estimator and SPAD denoising network is the complementary relationship in error distributions of these two methods. The SPAD denoising output tends to give an accurate prediction when the signal is sufficiently high. However, errors can result in very noisy regions of the input. On the other hand, the monocular depth estimate is typically spatially consistent with few outlying errors, but has a global offset from the correct absolute depth. Neural networks for monocular depth estimation are generally understood to predict good ordinal depth, but poor metric depth [39]. Therefore, fusing these two data modalities together is beneficial.

3.2 Objective function

We utilized an ordinal regression (OR) loss as the objective function in network training. This loss considers the order of different temporal bins in the SPAD’s histograms. This makes OR loss well-suited for depth estimation [6]. In this work, we calculate the ordinal regression loss independently at each spatial location ($x,y$). Then we take an average across all spatial locations to get the total ordinal loss for the output point cloud:

(4)$$\begin{aligned}\mathcal{L}_{OR} (h, \hat{h}) &= \frac{-1}{M_1 \times M_2}\sum_{ij} \left( \sum_{n=1}^{l} \textrm{log} \left( 1-P_{ij}[n] \right) + \sum_{n=l+1}^{K} \textrm{log} \left(P_{ij}[n] \right) \right)\\ P_{ij}[n] &= \textrm{cumsum} \left( \hat{h}_{ij}[n] \right), \end{aligned}$$

where $\hat {h}$ is the output histogram, $l$ is the bin index of ground truth detection rate peak and “cumsum” stands for cumulative summation.

We also introduce a term for total variation (TV) spatial regularization on the output 2D depth map to make sure that edge sharpness is preserved. So the total loss function can be expressed as:

(5)$$\mathcal{L}_{total} = \mathcal{L}_{OR} (h, \hat{h}) + \lambda_{TV} \Vert \hat{d} \Vert_{TV}.$$

During experiments, we determine the best value of $\lambda _{TV}$ empirically.

3.3 Log-scale depth binning

3D volumes of SPAD data quickly become very memory intensive. In a SPAD denoising network, memory consumption is primarily generated by data and features, instead of network weights. This makes it difficult to use memory efficient network architectures, such as Mobile-net [43] or Squeezenet [44]. Therefore, in a pre-processing step, we re-bin the linear-scale time-dimension (with $B = 1024$ bins) into log-scale (with $B'$ bins). Detection count in each new bin $h^{log}[k]$ is assigned as the average of raw detection counts from $n_1 (k)$ to $n_2 (k)$, where $n_1$, $n_2$ and $k$ should satisfy:

(6)$$n_1 (k) = \lfloor B \times \frac{q^k - 1}{q - 1} \rfloor, \quad n_2 (k) = \lfloor B \times \frac{q^{k+1} - 1}{q - 1} \rfloor$$

$B$ is the number of original time bins, $q$ is a constant that controls the number of new log-scale time bins $B'$. In the 3D-2D projection step, we apply a soft-argmax operation with output $h^{log}[k]$ and central point of each log-scale bin.

(7)$$\hat{n}^{log}_{ij} = \sum_{k} \frac{n_1 (k) + n_2 (k)}{2}\cdot \hat{h}^{log}_{ij}[k]$$

We also experiment on the log-scale bin number $B'$ and find 128 to be a good value. With 8 times fewer time bins, we reduce the runtime and memory consumption by a factor of approximately $7$ (see Appendix B for detailed discussion).

4. Results

We use a simulated dataset for network training and compare network performance on both a simulated dataset as well as a real-world captured dataset.

For the training dataset, the image formation model outlined in Section 3 is used to conduct simulations on the NYUv2 dataset [45] and generate synthetic SPAD measurements and ground truth detection rates. In simulation, a SPAD data cube consists of $512\times 512$ histograms, each with 1024 time bins, which corresponds to a 80 ps temporal resolution, or 2.4 cm depth resolution. We use 7.6k scenes for training and 766 scenes for test. All models are trained for 5 epochs (around 12k iterations) for fair comparison.

In each training iteration, we randomly crop out a $128\times 128\times 1024$ data patch from a simulated point cloud. In the test phase, to ensure deterministic comparison, we divide the $512\times 512\times 1024$ SPAD data into $128\times 128\times 1024$ patches with overlap, then feed each data patch through the network and finally integrate the outputs together. Processing a $128\times 128\times 1024$ data patch with log-scale rebinning would take around 2GB memory and 1.5s computation time. For models with linear-scale binning, memory cost generated by a $128\times 128\times 1024$ data patch would exceed the capacity of our 12GB Titan V GPU. Therefore, we use a patch size of $64\times 64\times 1024$ and implement similar overlap strategy in test. While this smaller patch size indeed impairs network performance, we conduct ablation experiments to show that it is not the major contribution to performance difference. Detailed discussions on patch size can be found in Appendix B.

We also use real-world data from [23] for network evaluation. This small dataset is captured through a 1D LinoSPAD array with 256 SPAD detectors [46,47] and a PointGrey camera at 5Hz frame rate. Output from the SPAD array has $256\times 256$ spatial resolution and $1536$ time bins; each time bin corresponds to a 26 ps temporal resolution.

4.1 Simulated data

Table 1 lists the performance of different methods on synthetic dataset generated from NYUv2 indoor scenes (depth limited within 10 m). Here we fix the signal-to-background ratio (SBR) at 0.04 (2 signal photons vs. 50 background photons). We use the ordinal regression loss for the loss function and use “DORN” [6] for the monocular depth estimator for controlled comparisons. The SPADnet architecture with log-scale discretization achieves the best performance, reducing the root-mean-square error (RMSE) by a factor of 5 from model with linear-scale binning and repetition fusion strategy [23]. In the case of linear-scale discretization, the SPADnet architecture also has two times smaller error compared to simple repetition fusion.

Table 1. Results of different methods and linear/log-scale discretizations (up-arrow stands for higher is better and down-arrow stands for lower is better). SPADnet with a log-scale discretization reduces RMSE error by $80\%$. The best result is shown in bold and the second best is shown with underline. Note that we use different patch size division for linear and log-scale models, which are both largest patch size within our devices’ computational capacity. See supplementary for detailed discussions on influence of patch size on model performance.

View Table | View all tables in this article

In the first row, we show results from the pre-trained monocular depth estimator we use. The SPAD-denoising methods (except the repetition fusion model with linear-scale) outperforms monocular estimation, which means the network does learn how to fuse these two predictions, instead of simply relying on one of them and ignoring the other one. In the second row, we also include a state-of-the-art non-CNN algorithm proposed by Rapp et al. [15] for comprehensive comparison.

It is important to note that log-scale rebinning not only leads to lower computation cost, but also better denoising performance as well. This is because objects farther away from the sensor would have much lower signal level. In log-scale rebinned histogram, tolerance on farther objects are larger, due to larger time bin size. This would help the network to focus more on closer objects, which is easier to reconstruct.

Qualitative results are shown in Fig. 3, for the same SPAD simulations. Although the average SBR is fixed through the dataset, extremely weak returning signal happens when depth is large or reflectivity is low (Equation (2)). In the first row, we show depth prediction without these undesirable situations. All methods achieved comparatively good performance, though linear-scale repetition fusion model generates small artifacts because of 2D image. In the second and third rows, we encounter with large depth and low reflectivity, separately, in the marked out regions. It is evident from the comparison that SPADnet borrows information from rough monocular depth estimation in these regions.

Fig. 3. Qualitative results comparing various approaches with fixed SBR at 0.04 (2 signal photons vs. 50 background photons). SPADnet achieves the lowest RMSE error. White boxes in figures mark out regions with extremely weak signal return. SPADnet significantly improves the prediction in these regions.

Download Full Size | PDF

4.2 Captured data

We evaluate the proposed model with captured SPAD data qualitatively. It is difficult to quantify the detected signal photon number and background photon number in measurements. This poses difficulty to non-CNN based algorithms because they require a user-specified SBR estimation. On the other hand, CNN-based models are better at this task because of better generalization capability. For fair comparison, we try out different SBR values in Rapp and Goyal’s algorithm [15] and choose the best result. We also utilize the model trained on mixing noise level in [23], which is reported to be best for captured data. For the proposed networks, we only exploit the models trained on lowest SBR simulations (2 signal photons, 50 background photons) and demonstrate its capacity to generalize to complicated real-world conditions.

As shown in Fig. 4, the “stuff” scene has a comparatively high SBR (average photon counts per pixel $= 2.5$). Denoising results from different algorithms on this scene are similar. The “kitchen” and “hallway” scenes are taken under much lower SBR. Both scenes have average photon counts per pixel around 7. Moreover, the signal is further decreased by low reflectivity, optical misalignment, multipath interference and other practical issues in regions marked out by white boxes. Typical examples are a black tissue box in the kitchen (second row) or an open door in the hallway (third row).

Fig. 4. Evaluation of different algorithms on captured data. The “stuff” scene has higher SBR. All methods are comparatively good under this condition. In the “kitchen” and “hallway” scenes, the SBR is much lower. These two scenes also contain regions with low reflectivity or large depth. SPADnet significantly outperforms other methods in robustness. Video and 3D visualizations of the comparison are shown in Visualization 1 (Supplementary Material)

Download Full Size | PDF

Similar to the case in simulated dataset, previous algorithms generate large errors in these regions. On the other hand, monocular depth estimators we exploit are biased on NYUv2 dataset and corresponding camera settings. Therefore, we need to rescale the predictions with a global factor through comparison with denoised SPAD measurements. Even after this post-processing, predicted depth maps still deviate a lot from SPAD denoising results.

Contrary to the unsatisfying repetition fusion denoising and monocular estimation, SPADnet successfully extract reliable information from the two corrupted data and fuses them together, shown in the last column in Fig. 4. Reconstruction quality is largely enhanced and the largest errors are strongly suppressed. We also provide reconstructions from videos containing multiple measurements in the supplementary material (Visualization 1).

4.3 Ablation study

We conduct ablation studies on loss functions and monocular depth estimators. Results are shown in Table 2.

Fig. 5. Failure case for monocular depth estimation. In this case, SPADnet cannot effectively use information provided by the estimated depth map and performs about as good as previous neural network approaches. Both Lindell et al.’s method and SPADnet perform better than Rapp and Goyal’s method in this case.

Download Full Size | PDF

Table 2. Ablation experiments on different loss functions and different monocular depth estimators. As shown in the left panel, a model trained with the ordinal regression loss gives the lowest error. As shown in the right panel, SPADnet denoising results are robust against inaccurate monocular estimations.

View Table | View all tables in this article

4.3.1 Loss function

We compare ordinal regression loss to Kullback-Leibler (KL) divergence, which is a popular measure for the difference between two probability distributions. KL loss can be naturally applied on denoised histogram in SPAD denoising [23]. However, contrary to the ordinal regression loss, KL loss considers error in each time bin independently instead of considering ordinal relationship along the whole time axis. As shown in Table 2, ordinal regression (OR) loss outperforms KL loss. Apart from these two losses defined on the 3D point cloud, we also train the network with a 2D $L_2$ loss defined on projected depth map. The ordinal regression loss also achieves better convergence in all metrics.

4.3.2 Monocular depth networks

We substitute the DORN network with the recently reported DenseDepth model [39], which achieved state-of-the-art performance on NYUv2 dataset. However, the DenseDepth model was originally trained on a smaller subset of NYUv2 where unlabeled defects are corrected. This dataset uses different camera settings compared to the rest of NYUv2 that we use, so we re-train the model on our dataset with minimal modifications. Because of defects in the ground truth depth map, we omit perceptual loss terms in the original work and only use $L_1$ loss. The network accuracy is much worse than that reported in [39], as shown in the first row of Table 2. However, the SPAD denoising network performance is hardly influenced and even achieves small improvements. This demonstrates the robustness of SPADnet.

5. Discussion

In summary, we present SPADnet: a method for CNN-based sensor fusion of SPAD measurements and conventional images for robust depth estimation. Our approach outperforms other CNN-based techniques as well as state-of-the-art depth estimation approaches from SPAD measurements based on principled optimization techniques.

Through all the experiments above, we focus on models trained under low SBR (2 signal photons and 50 background photons). To understand the influence of noise level in detail, we fix the average signal photon number to be 2 and change the average background photon number from 2 to 50, corresponding to SBR $= 1$ to $0.04$. We choose the best-performing log-scale model with a $128\times 128$ patch size, and only study the influence of fusion strategy. We retrain the models on simulated data with different SBR and validate under the same SBR condition. As shown in Fig. 6, when background light intensity is low, all methods can reliably reconstruct the depth. On the other hand, when the number of background photons exceeds 20, it is evident that error of the repetition fusion model undergoes a faster increase than that of SPADnet.

Fig. 6. Depth prediction error under different noise levels. The denoising task is sensitive to SBR. The proposed model is able to work under extremely low SBR condition, without too much sacrifice in accuracy.

Download Full Size | PDF

Despite the advantages of SPADnet, we also mention some limitations. The first one is the trade-off between linear-binning and log-binning. 3D point cloud denoising is a classification task. Discretization with too much or too few classes are both unfavorable, the former one would increase difficulty of convergence and the latter one would result in large quantization error. As shown in Fig. 4, discontinuity artifacts are more significant in the log-scale binning result.

Another issue associated with SPADnet is that the required monocular depth estimation could fail. Most off-the-shelf monocular depth estimators do not generalize well beyond the datasets they are trained for. Moreover, for close objects monocular cues can be almost missing as, for example, shown for the “lamp” scene in Fig. 5. Here, the monocular depth estimator fails and SPADnet does not benefit from it. Despite this failure, SPADnet still achieves an equally good or better result than other methods.

6. Conclusions

Efficient and robust reconstruction algorithm is essential for SPAD-based depth sensors. In this work, we introduce monocular depth estimation in RGB-SPAD sensor fusion with a convolutional neural network (CNN) model. The proposed model achieves state-of-the-art accuracy and computational efficiency. This can lead to application of SPAD-based 3D reconstruction in broader contexts, such as 3D imaging with longer range and strong ambient light or 3D imaging on mobile devices. Also, the general insight that monocular depth estimator can facilitate fusion between 3D and 2D data modality may be helpful in other 3D sensing related scenarios.

Appendix A Implementation details

As shown in Fig. 2, SPADnet contains a multi-scale denoising network for pre-processing of 3D SPAD data and a fusion network that combines the up-projected monocular estimation with 3D features. Our denoising network is a four-stage Unet [48]. Each stage has four (convolutional layer $+$ batch normalization $+$ ReLU activation function) modules. We use decreasing convolutional kernel size from $9\times 9$ to $3\times 3$ for each stage. All features are upsampled with transpose convolutional layers and concatenated with features from the above stages. Output from the denoising network is 40 channels of 3D features. They are concatenated with up-projected monocular estimation and fused through the fusion network, which consists of four $7\times 7$ (convolutional layer $+$ batch normalization $+$ ReLU activation function) modules. The final output of the network is a denoised 3D point cloud and we conduct 3D-2D projection defined in Equation (3) to get the depth map. Code and pre-trained models will be made public. During training, we use the Adam optimizer [49] and set the learning rate to be $0.0001$, with a $0.5$ learning rate decay. All models are trained for 5 epochs (around 12k iterations) for fair comparison.

Despite the fusion module, network architecture design can involve many other factors, for example, where the fusion is conducted [31,34], structure of convolutional layers [50], depth of network [51] and so on. We observed that the performance was relatively insensitive to these parameters.

Also, it is helpful to mention that due to the color-augmentation strategy in [6], the DORN network has similar performance on gray-scale intensity images and RGB images.

Appendix B Patch sizes

We compare the influence of different patch sizes on model performance in Table 3. Smaller patch sizes negatively impact the network performance. However, compared to the linear-scale models, the $64\times 64$ patch size SPADnet is still better. Also, in practical applications, the large data size for large patch sizes is imposes a restriction on linear-scale models due to the much higher memory consumption.

Table 3. Influence of different patch sizes on SPADnet performance. Although smaller patch degrades the network performance. SPADnet using $64\times 64$ patch size data still performs better than linear-scale model with repetition fusion.

View Table | View all tables in this article

As shown in Fig. 7, a model with linear-scale binning has around 7 times longer run time and occupies 7 times more memory, compared with log-scale SPADnet (also takes monocular depth estimation into account). It is important to note a few things. First, we only consider memory consumption of model features and weights here. Second, runtime and memory consumption for different monocular depth estimators vary a lot. DenseDepth model [39] is memory intensive (3 GB for $512\times 512$ image resolution) due to the fact that it uses pre-trained densenet model. On the other hand, DORN [6] consumes only 700 MB memory. This is another reason that motivates us to choose DORN as the default monocular estimator. Both these two models have a runtime around 0.5 s for $512 \times 512$ resolution scene, which is much faster than the 3D denoising branch.

Fig. 7. Influence of different patch sizes on computational cost. Log-scale rebinning facilitates model inference significantly (Dashed line means the memory consumption is beyond computational capacity available to us)

Download Full Size | PDF

Fig. 8. Additional qualitative comparisons on the simulated dataset with a signal-to-background ratio (SBR) of 0.04. SPADnet achieves state-of-the-art performance.

Download Full Size | PDF

Appendix C Additional comparisons on simulated dataset

Here we show more comparisons on the simulated dataset (Fig. 8). The settings are same as explained in Section 4.

Funding

National Science Foundation (IIS 1553333).

Acknowledgments

D.L. was supported by a Stanford Graduate Fellowship. G.W. was supported by an NSF CAREER Award (IIS 1553333), a Sloan Fellowship, by the KAUST Office of Sponsored Research through the Visual Computing Center CCF grant, the DARPA REVEAL program, and a PECASE by the U.S. Army Research Office. The authors would like to thank Matthew O'Toole for his work on the prototype and data acquistion.

Disclosures

The authors declare no conflicts of interest.

References

1. C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture,” in Computer Vision – ACCV 2016, S.-H. Lai, V. Lepetit, K. Nishino, and Y. Sato, eds. (Springer International Publishing, Cham, 2017), pp. 213–228

2. A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The Int. J. Robotics Res. 32(11), 1231–1237 (2013). [CrossRef]

3. J. P. S. do Monte Lima, F. P. M. Simões, H. Uchiyama, V. Teichrieb, and E. Marchand, “Depth-assisted rectification for real-time object detection and pose estimation,” Mach. Vis. Appl. 27(2), 193–219 (2016). [CrossRef]

4. P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, “Rgb-d mapping: Using kinect-style depth cameras for dense 3d modeling of indoor environments,” The Int. J. Robotics Res. 31(5), 647–663 (2012). [CrossRef]

5. C. Cadena, A. R. Dick, and I. D. Reid, “Multi-modal auto-encoders as joint estimators for robotics scene understanding,” in Robotics: Science and Systems, vol. 5 (2016), p. 1.

6. H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2018).

7. D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds. (Curran Associates, Inc., 2014), pp. 2366–2374.

8. A. Gordon, H. Li, R. Jonschkowski, and A. Angelova, “Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras,” CoRR abs/1904.04998 (2019).

9. D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nešić, X. Wang, and P. Westling, “High-resolution stereo datasets with subpixel-accurate ground truth,” in Pattern Recognition, X. Jiang, J. Hornegger, and R. Koch, eds. (Springer International Publishing, Cham, 2014), pp. 31–42.

10. C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017).

11. A. McCarthy, N. J. Krichel, N. R. Gemmell, X. Ren, M. G. Tanner, S. N. Dorenbos, V. Zwiller, R. H. Hadfield, and G. S. Buller, “Kilometer-range, high resolution depth imaging via 1560 nm wavelength single-photon detection,” Opt. Express 21(7), 8904–8915 (2013). [CrossRef]

12. A. M. Pawlikowska, A. Halimi, R. A. Lamb, and G. S. Buller, “Single-photon three-dimensional imaging at up to 10 kilometers range,” Opt. Express 25(10), 11919–11931 (2017). [CrossRef]

13. Z.-P. Li, X. Huang, Y. Cao, B. Wang, Y.-H. Li, W. Jin, C. Yu, J. Zhang, Q. Zhang, and C.-Z. Peng, “Single-photon computational 3d imaging at 45 km,” arXiv preprint arXiv:1904.10341 (2019).

14. A. Kirmani, D. Venkatraman, D. Shin, A. Colaço, F. N. Wong, J. H. Shapiro, and V. K. Goyal, “First-photon imaging,” Science 343(6166), 58–61 (2014). [CrossRef]

15. J. Rapp and V. K. Goyal, “A few photons among many: Unmixing signal and noise for photon-efficient active imaging,” IEEE Trans. Comput. Imaging 3(3), 445–459 (2017). [CrossRef]

16. Y. Altmann, R. Aspden, M. Padgett, and S. McLaughlin, “A bayesian approach to denoising of single-photon binary images,” IEEE Trans. Comput. Imaging 3(3), 460–471 (2017). [CrossRef]

17. F. Heide, S. Diamond, D. B. Lindell, and G. Wetzstein, “Sub-picosecond photon-efficient 3d imaging using single-photon sensors,” Sci. Rep. 8(1), 17726 (2018). [CrossRef]

18. A. Gupta, A. Ingle, A. Velten, and M. Gupta, “Photon-flooded single-photon 3d cameras,” in Proc. CVPR, (2019).

19. M. O’Toole, F. Heide, D. B. Lindell, K. Zang, S. Diamond, and G. Wetzstein, “Reconstructing transient images from single-photon sensors,” in Proc. CVPR, (2019).

20. S. Xin, S. Nousias, K. N. Kutulakos, A. C. Sankaranarayanan, S. G. Narasimhan, and I. Gkioulekas, “A theory of fermat paths for non-line-of-sight shape reconstruction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2019), pp. 6800–6809.

21. D. Shin, F. Xu, D. Venkatraman, R. Lussana, F. Villa, F. Zappa, V. K. Goyal, F. N. Wong, and J. H. Shapiro, “Photon-efficient imaging with a single-photon camera,” Nat. Commun. 7(1), 12046 (2016). [CrossRef]

22. D. Shin, A. Kirmani, V. K. Goyal, and J. H. Shapiro, “Photon-efficient computational 3-d and reflectivity imaging with single-photon detectors,” IEEE Trans. Comput. Imaging 1(2), 112–125 (2015). [CrossRef]

23. D. B. Lindell, M. O’Toole, and G. Wetzstein, “Single-photon 3d imaging with deep sensor fusion,” ACM Trans. Graph. 37(4), 1–12 (2018). [CrossRef]

24. C. Ti, R. Yang, J. Davis, and Z. Pan, “Simultaneous time-of-flight sensing and photometric stereo with a single tof sensor,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),, (2015).

25. J. Diebel and S. Thrun, “An application of markov random fields to range sensing,” in Advances in neural information processing systems, (2006), pp. 291–298.

26. D. Ferstl, C. Reinbacher, R. Ranftl, M. Rüther, and H. Bischof, “Image guided depth upsampling using anisotropic total generalized variation,” in Proceedings of the IEEE International Conference on Computer Vision, (2013), pp. 993–1000.

27. J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele, “Joint bilateral upsampling,” ACM Trans. Graph. 26(3), 96 (2007). [CrossRef]

28. P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, “Rgb-d mapping: Using kinect-style depth cameras for dense 3d modeling of indoor environments,” The Int. J. Robotics Res. 31(5), 647–663 (2012). [CrossRef]

29. J. Qiu, Z. Cui, Y. Zhang, X. Zhang, S. Liu, B. Zeng, and M. Pollefeys, “Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image,” (2018).

30. Y. Zhang and T. Funkhouser, “Deep depth completion of a single rgb-d image,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2018).

31. F. Ma, G. V. Cavalheiro, and S. Karaman, “Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera,” (2018).

32. J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” in 2017 International Conference on 3D Vision (3DV), (2017), pp. 11–20.

33. A. Eldesokey, M. Felsberg, and F. S. Khan, “Confidence propagation through cnns for guided sparse depth regression,” IEEE Trans. Pattern Anal. Mach. Intell.1 (2019).

34. M. Jaritz, R. D. Charette, E. Wirbel, X. Perrotton, and F. Nashashibi, “Sparse and dense data with cnns: Depth completion and semantic segmentation,” in 2018 International Conference on 3D Vision (3DV), (2018), pp. 52–60.

35. T.-W. Hui, C. C. Loy, and X. Tang, “Depth map super-resolution by deep multi-scale guidance,” in Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, eds. (Springer International Publishing, Cham, 2016), pp. 353–369.

36. I. Eichhardt, D. Chetverikov, and Z. Jankó, “Image-guided tof depth upsampling: a survey,” Mach. Vis. Appl. 28(3-4), 267–282 (2017). [CrossRef]

37. Y. Wen, B. Sheng, P. Li, W. Lin, and D. D. Feng, “Deep color guided coarse-to-fine convolutional network cascade for depth image super-resolution,” IEEE Trans. on Image Process. 28(2), 994–1006 (2019). [CrossRef]

38. Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep joint image filtering,” in Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, eds. (Springer International Publishing, Cham, 2016), pp. 154–169.

39. I. Alhashim and P. Wonka, “High quality monocular depth estimation via transfer learning,” (2018).

40. A. Gupta, A. Ingle, A. Velten, and M. Gupta, “Photon-flooded single-photon 3d cameras,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2019).

41. A. Ingle, A. Velten, and M. Gupta, “High flux passive imaging with single-photon sensors,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2019).

42. E. Charbon, “Single-photon imaging in complementary metal oxide semiconductor processes,” Phil. Trans. R. Soc. A 372(2012), 20130100 (2014). [CrossRef]

43. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” (2017).

44. F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size,” arXiv preprint arXiv:1602.07360 (2016).

45. N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in Computer Vision – ECCV 2012, A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, eds. (Springer Berlin Heidelberg, Berlin, Heidelberg, 2012) pp. 746–760.

46. S. Burri, C. Bruschini, and E. Charbon, “Linospad: a compact linear spad camera system with 64 fpga-based tdc modules for versatile 50 ps resolution time-resolved imaging,” Instruments 1(1), 6 (2017). [CrossRef]

47. S. Burri, H. Homulle, C. Bruschini, and E. Charbon, “Linospad: a time-resolved 256x1 cmos spad line sensor system featuring 64 fpga-based tdc channels running at up to 8.5 giga-events per second,” in Optical Sensing and Detection IV, vol. 9899 (International Society for Optics and Photonics, 2016), p. 98990D.

48. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, (Springer, 2015), 234–241.

49. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” (2014).

50. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018). [CrossRef]

51. K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European conference on computer vision, (Springer, 2016), 630–645.

Methods	patch size	$δ < 1.25 ↑$	$δ < {1.25}^{2} ↑$	$δ < {1.25}^{3} ↑$	RMSE (cm) $↓$	Abs Rel $↓$
DORN [6]	-	0.881	0.976	0.993	53.7	0.117
Rapp et al. [15]	-	0.965	0.986	0.994	43.9	0.032
Repeat (Linear) [23]	$64 \times 64$	0.935	0.952	0.960	72.1	0.058
Repeat (Log)	$128 \times 128$	$\underline{0.993}$	$\underline{0.998}$	$\underline{0.999}$	$\underline{18.2}$	$\underline{0.011}$
SPADnet (Linear)	$64 \times 64$	0.970	0.993	0.998	34.7	0.031
SPADnet (Log)	$128 \times 128$	$0.996$	$0.999$	$0.999$	$14.4$	$0.010$

Loss functions	patch size	RMSE (cm)	Abs Rel
KL loss	$128 \times 128$	21.9	0.010
MSE loss	$128 \times 128$	17.5	0.013
OR loss	$128 \times 128$	$14.4$	$0.010$

Monocular depth networks	Monocular depth RMSE (cm)	SPADnet RMSE (cm)
DenseDepth [39]	71.2	13.3
DORN [6]	53.7	14.4

Patch size	RMSE (cm)	Abs Rel
$64 \times 64$	25.9	0.016
$128 \times 128$	14.4	0.010

Methods	patch size	$δ < 1.25 ↑$	$δ < {1.25}^{2} ↑$	$δ < {1.25}^{3} ↑$	RMSE (cm) $↓$	Abs Rel $↓$
DORN [6]	-	0.881	0.976	0.993	53.7	0.117
Rapp et al. [15]	-	0.965	0.986	0.994	43.9	0.032
Repeat (Linear) [23]	$64 \times 64$	0.935	0.952	0.960	72.1	0.058
Repeat (Log)	$128 \times 128$	$\underline{0.993}$	$\underline{0.998}$	$\underline{0.999}$	$\underline{18.2}$	$\underline{0.011}$
SPADnet (Linear)	$64 \times 64$	0.970	0.993	0.998	34.7	0.031
SPADnet (Log)	$128 \times 128$	$0.996$	$0.999$	$0.999$	$14.4$	$0.010$

SPADnet: deep RGB-SPAD sensor fusion assisted by monocular depth estimation

Abstract

1. Introduction

2. Image formation model

3. SPADnet sensor fusion model

3.1 Network architecture

3.2 Objective function

3.3 Log-scale depth binning

4. Results

4.1 Simulated data

4.2 Captured data

4.3 Ablation study

4.3.1 Loss function

4.3.2 Monocular depth networks

5. Discussion

6. Conclusions

Appendix A Implementation details

Appendix B Patch sizes

Appendix C Additional comparisons on simulated dataset

Funding

Acknowledgments

Disclosures

References

Supplementary Material (1)

Cited By

Figures (8)

Tables (3)

Equations (7)

Optics Express