PE-RASP: range image stitching of photon-efficient imaging through reconstruction, alignment, stitching integration network based on intensity image priors

Xu Yang; Xu Yang; Shaojun Xiao; Hancui Zhang; Lu Xu; Long Wu; Long Wu; Jianlong Zhang; Yong Zhang; Yong Zhang

doi:10.1364/OE.514027

1. Introduction

Active optical imaging is a technique that employs active laser illumination to acquire information from the target scene [1,2]. It finds extensive applications across diverse domains, such as ghost imaging [3,4], laser ranging [5,6], scattering imaging [7,8], and remote sensing [9,10]. LiDAR stands out as a prominent active optical imaging system, with the detection range being a crucial metric for assessing its performance. Traditional avalanche photodiode (APD) detectors exhibit low sensitivity, falling short of the requirements for extended-range detection. Single photon detection, characterized by single-photon sensitivity, high signal gain, rapid response, and compact dimensions, emerges as a vital technological approach for achieving long-range detection in LiDAR systems [11,12]. However, Single Photon Avalanche Diodes (SPAD) can only produce digital signals of 0 or 1. Moreover, background noise, dark count noise, and thermal noise contribute to a high incidence of false detections in SPAD results. Consequently, achieving high-quality scene intensity and range reconstructions under conditions of sparse sampling poses a formidable challenge in single-photon LiDAR research. To address these challenges, Shin et al [13] established an SPAD trigger probability model to mitigate Poisson noise on a per-pixel basis. Peng et al [14] proposed an end-to-end reconstruction network with a denoising module as the core architecture, introducing non-local attention modules to extract long-range temporal and spatial correlation features from SPAD detection data. Xu et al [15] designed a convolutional encoder to directly reconstruct incoming 3D data into range and intensity images, utilizing the NLSA [16] module to extract non-local features with lower computational overhead. With the continuous refinement of deep learning-based photon-efficient imaging reconstruction techniques, single-photon LiDAR systems can now deliver high-quality, high-precision, and high-speed intensity and range images.

Nevertheless, a noteworthy challenge currently confronting single-photon LiDAR is the issue of low imaging resolution [17]. Due to constraints in device manufacturing processes, the single-photon detection arrays typically incorporate a limited number of pixels. Consequently, this leads to diminished spatial resolution and a restricted field of view for both intensity and range images within single-photon LiDAR systems. In such scenarios, single-photon LiDAR systems can only offer rudimentary contour information of the target scene, lacking finer scene details. This limitation significantly complicates downstream tasks such as target detection and recognition. Consequently, achieving a balance between a wide field of view and high resolution without compromising imaging time, system power consumption, and size has emerged as a crucial research focus for advancing single-photon LiDAR systems.

Motivated by the success of deep learning-based image stitching technology in the realm of high-resolution, wide-field image generation [18,19], this study introduces a tailored deep learning image stitching algorithm for single-photon imaging. The objective is to significantly augment the spatial resolution and imaging range of single-photon LiDAR systems. The paper outlines a module for intensity and range image reconstruction, leveraging the three-dimensional data structure features of single-photon detection results. This module aims to reconstruct high-quality intensity and range images of the target scene. Notably, reconstructed intensity images exhibit superior scene details compared to their range counterparts. To ensure image alignment, a homography transformation matrix prediction network module is devised for single-photon intensity images. This module capitalizes on rich prior knowledge derived from intensity images reconstructed from single-photon LiDAR data. Acknowledging the lower resolution and limited feature information in single-photon reconstructed images, a Multi-Scale Fusion of Features (M-SFFN) network module is introduced. This module adopts the design philosophy of Residual Networks (ResNet), employing a network structure akin to Unet with lateral connections for feature fusion. This approach significantly enhances the ability to extract image features without increasing the computational complexity of the original model. Operating at multiple scales, it estimates homography to align two range images from aligned input single-photon reconstructed images. Building on the aligned images achieved during the alignment phase, a multi-scale progressive fusion module for single-photon range images is crafted. This module combines the Unet encoder-decoder network and the Dense Residual Net (DRN) in stages, fully leveraging information across all image levels to enhance information flow and gradient within the network. This design facilitates smoother network training, reducing the overall computational load compared to ResNet and ensuring more stable network training. Simulation and experiments validate that the proposed method enables the reconstruction of intensity and range image stitching results with a resolution not lower than 100 × 100, under the condition of a SPAD device resolution of 64 × 64. The image stitching network model developed in this approach demonstrates a distinct advantage in stitching quality when compared to state-of-the-art methods. This renders it particularly well-suited for intensity and range image stitching in single-photon LiDAR systems, especially in scenarios characterized by a low signal-to-background ratio (SBR), such as 0.01.

In conclusion, this work makes a triple-fold contribution:

1) An inventive approach to enhance the spatial resolution of single-photon imaging is introduced, employing image stitching to augment the imaging field of view and spatial resolution.
2) During the alignment phase of single-photon reconstruction images, a multi-scale feature fusion network is devised to boost the network's feature extraction capabilities, thereby enhancing the network's alignment performance.
3) A novel method for stitching range images with intensity images as priors is proposed, leading to high-quality stitching of range images.

2. Related work

2.1 Reconstruction photon-efficient imaging

Photon-efficient imaging has garnered increased attention in recent years. Kirmani et al. [20] introduced a pioneering First-Photon Imaging (FPI) system, employing a pulsed laser grating to scan a scene. This system utilizes the first photon received at each pixel by the detector to reconstruct the 3D structure and reflectance of the image, marking a breakthrough in few-photon imaging. Subsequently, Shin et al. [13] expanded the FPI framework from point-wise scanning to array-based scenarios, significantly enhancing the efficiency of photon imaging. They also devised a single-photon reconstruction algorithm suitable for array systems, capable of recovering scene range and intensity under low light conditions by leveraging the lateral smoothness and longitudinal sparsity inherent in natural scenes [21]. Additionally, Rapp et al. [22] addressed the prevalent high background noise in real-world imaging scenarios by proposing a novel method for range and intensity estimation. This method emphasizes the separation of signal and noise source contributions, markedly enhancing imaging robustness.

With the advancement of deep learning technology, deep learning-based single-photon imaging has emerged as a novel technological approach to address challenges in photon-efficient and rapid imaging. Lindell et al. [23] introduced a deep learning-based photon-efficient 3D imaging algorithm that combines photon arrival times from a low-resolution single-photon detection array with high-resolution intensity images from traditional cameras to generate high-resolution range maps. This approach enhances the reconstruction performance of single-photon LiDAR systems in low-light conditions. Peng et al. [14] designed an end-to-end deep neural network incorporating non-local modules to extract long-range spatiotemporal correlations from 3D digital signals detected by Single Photon Avalanche Diodes (SPADs). They also integrated a noise prior module to improve reconstruction performance, enabling the simultaneous recovery of range and intensity images in scenarios with extremely low photon counts and low signal-to-background ratios (SBR). Yang et al. [15] devised a convolutional encoder to directly reconstruct incoming 3D data into range and intensity images. They utilized an NLSA module to extract non-local features with low computational cost, further contributing to the field of photon-efficient imaging.

Considering the collective strengths of the aforementioned methods in photon-efficient imaging, encompassing the efficiency of First-Photon Imaging (FPI), the rapid scanning capabilities of array imaging, the accuracy of single-photon reconstruction algorithms, and the potential of deep learning techniques, they collectively confront a common challenge — the inherent limitation of relatively low imaging resolution. This suggests that achieving image quality on par with conventional high-resolution imaging techniques remains elusive. Consequently, enhancing imaging resolution remains a central hurdle for these methods to overcome.

2.2 Image stitching

Conventional image stitching methods typically employ a global homography [24] approach, mapping points from one image to corresponding points in another through a 2D transformation. The goal is to create a seamless composite image from distinct ones. However, this technique encounters challenges in complex scenes, leading to artifacts or geometric distortion. It assumes that a single transformation can describe the entire image, which proves impractical, particularly in scenarios with significant parallax or variations.

Several approaches have been proposed to address these challenges. Gao et al. [25] introduced Double Homography Warping (DHW) to separately align foreground and background, reducing artifacts to some extent but facing difficulties in complex scenes. Lin et al. [26] presented a method based on a smooth affine field, enhancing disparity tolerance and improving local deformation and alignment. Zaragoza [27] proposed an “As-Projective-As-Possible” (APAP) method, dividing images into dense grids and assigning corresponding homography to each grid by weighting features. Chang et al. [28] introduced Shape-Preserving Half Projective (SPHP) stitching, extending the projection transformation smoothly to non-overlapping regions to reduce geometric distortion. Lin et al. [29] combined APAP and SPHP, presenting an “As-natural-As-Possible” (ABAP) warping for more natural stitching effects.

While these feature-based traditional algorithms contribute to high-quality stitched images, they heavily rely on feature detection and struggle in scenarios with lower resolution and fewer features. Deep learning methods for image stitching have gained attention due to the strong feature extraction capabilities of CNN networks. Homography estimation, a critical aspect of image stitching, has been addressed by DeTone et al. [30], who introduced a deep homography method using a VGG-style network to predict image offsets for homography determination. Nie et al. [31–33]built on this, introducing a View-Free Image Stitching Network (VFISNet) capable of arbitrary view image stitching within a comprehensive deep learning framework. They further presented Edge-Preserved Image Stitching Network (EPISNet), offering input image resolution flexibility and introducing the first unsupervised deep learning image stitching framework. Seam-driven image stitching methods, focusing on optimizing seams in images to achieve natural and seamless stitching, have also shown promise. Gao et al. [34] proposed a seam loss for optimal stitching through homography, and Zhang et al. [35] combined content-preserving warping optimization strategies with a seam-based local alignment method.

While deep learning methods exhibit potential in general image stitching, they often prove unsuitable for the stitching of single-photon reconstruction results. Thus, there is a need to redesign networks and methods to align with the unique features of single-photon reconstruction results.

3. Method

3.1 Forward model

Figure 1 illustrates the schematic of a pulsed Lidar imaging system based on SPAD. The target object is illuminated by a pulsed laser, denoted as s(t), with a period of T. To prevent distance aliasing, it is necessary to ensure T > 2z_max/c during detection, where z_max represents the maximum distance to the target, and c is the speed of light. Under conditions of low photon flux, the probability of photons reaching the detector is relatively low. During the dead time, the photons that may arrive have a low impact on the statistical results, so the dead time can be ignored. d represents the dark count of the single-photon detector. b_λ represents the background photon flux. η represents the quantum efficiency of the detector. Combining the above analysis, the photon detection process implemented by SPAD manifests as a non-uniform Poisson process with a rate function [13] under low-flux conditions:

(1)$${\omega _{x,y}}(t) = \eta {\alpha _{x,y}}s({t - 2{d_{x,y}}/c} )+ \eta {b_\lambda } + d$$

Here, α_x,y represents the intensity of the target object and d_x,y is the range of the target. During the detection process, the probability mass function of the Poisson process for the number of photon arrivals within the time interval [0, T) in response to a single illumination pulse is as follows:

(2)$$p(k) = \frac{{n{{(0,T)}^k}\exp [ - n(0,T)]}}{{k!}},k = 0,1,\ldots ,$$

where k represents the number of detected photons, $n(0,T) = \int_0^T {{\omega _{x,y}}(\tau )} d\tau$. The count of photons with time-of-flight information returning from the target location and the count of photons caused by background light noise and dark counts are respectively represented as S and B, with $S = \int_0^T {s({t - 2{d_{x,y}}/c} )dt}$ and $B = \int_0^T {({\eta {b_\lambda } + d} )dt} = ({\eta {b_\lambda } + d} )T$ . Therefore, the photon flux n(0, T) can be expressed as $n(0,T) = \int_0^T {{\omega _{x,y}}(\tau )} d\tau = \eta {\alpha _{x,y}}S + B$ . In long-range imaging, the echo signals are extremely weak. Therefore, the photon low-flux is $\eta {\alpha _{x,y}}S + B \ll 1$ . For any target position (x, y), within a pulse period, the detector's response probability follows the 0-1 distribution:

(3)$${P_0}(x,y) = \exp [{ - ({\eta {\alpha_{x,y}}S + B} )} ]$$

(4)$${P_1}(x,y) = 1 - {P_0}(x,y)$$

Fig. 1. Pulsed LiDAR imaging system based on SPAD.

Download Full Size | PDF

Equation (3) and (4) indicate that for each emitted pulse, the detector has a probability P₀ of not detecting a photon event and a probability of (1- P₀) of detecting a photon event. For a detection process with a pulse count of N, detection across N pulses is statistically independent, and the number of photons detected by the detector array (x, y) pixels can be summarized to follow a binomial distribution. The probability mass function for the number of detected photons K_x,y is shown as following:

(5)$$P[{K_{x,y}} = {k_{x,y}};{\alpha _{x,y}}] = \left( \begin{array}{l} N\\ {k_{x,y}} \end{array} \right){P_0}{(x,y)^{N - {k_{x,y}}}}{[{1 - {P_0}(x,y)} ]^{{k_{x,y}}}}$$

For each detected photon, the time of the photon detection event follows the distribution described below:

(6)$$f(t;x,y) = \frac{{{\omega _{x,y}}(t)\exp \left[ { - \int_0^T {{\omega_{x,y}}(\tau )} d\tau } \right]}}{{1 - {P_0}(x,y)}}$$

Due to low photon flux conditions, the term $\exp \left[ { - \int_0^T {{\omega_{x,y}}(\tau )} d\tau } \right] = \exp [{ - ({\eta {\alpha_{x,y}}S + B} )} ]$ in Eq. (6) can be approximated as 1. Equation (6) can be expressed as:

(7)$$\begin{aligned} f(t;x,y) &= \frac{{{\omega _{x,y}}(t)}}{{1 - {P_0}(x,y)}}\mathop = \limits^{(i)} \frac{{\eta {\alpha _{x,y}}s({t - 2{d_{x,y}}/c} )+ \eta {b_\lambda } + d}}{{\eta {\alpha _{x,y}}S + B}}\\ &= \frac{{\eta {\alpha _{x,y}}S}}{{\eta {\alpha _{x,y}}S + B}}\left( {\frac{{s({t - 2{d_{x,y}}/c} )}}{S}} \right) + \frac{B}{{\eta {\alpha _{x,y}}S + B}}({1/T} )\end{aligned}$$

The condition for the equality to hold at position (i) is when $\eta {\alpha _{x,y}}S + B \ll 1$, in which case, $1 - \exp [ - (\eta {\alpha _{x,y}}S + B)] \approx \eta {\alpha _{x,y}}S + B$. The Eq. (7) illustrates that the recorded time-of-flight of a single photon detector comprises two components. The first term of Eq. (7) is the echo signal reflected from the target scene, which follows the time distribution $s(t - 2{d_{x,y}}/c)$. The last term is composed of background noise, which is uniformly distributed in time. It can also be observed from the last term that the data obtained from single photon detection includes both the range information of the target and the intensity information of the target.

In previous works, the Maximum Likelihood Estimation (MLE) method [36] has been employed to estimate the intensity and range information of a target. The intensity and range information of the target at position (x, y) can be represented as:

(8)$$\alpha _{x,y}^{ML} = \max \left\{ {\frac{1}{{\eta S}}\left[ {\log \left( {\frac{N}{{N - {k_{x,y}}}}} \right) - B} \right],0} \right\}$$

(9)$$d_{x,y}^{ML} = \arg \max \sum\limits_{l = 1}^{{k_{x,y}}} {\log [{s({t_{x,y}^{(l)} - 2{d_{x,y}}/c} )} ]}$$

Differing from conventional methods, the proposed method employs a Deep Learning (DL) network to reconstruct the range image $X_{range}^i$ and intensity image $X_{intensity}^i$. The reconstruction process can be represented as follows:

(10)$$\{{X_{range}^i,X_{intensity}^i} \}= D{L_{photon}}({M|\theta = {\theta^\ast }} )$$

DL stands for the trained neural network, M represents the data obtained from single photon detection, θ represents the parameters of the neural network, and θ* represents the parameters after training. After the range image and intensity image reconstruction stage, a deep learning image stitching network is employed to stitch the reconstructed single photon range image and intensity image. The stitching stage can be divided into alignment and stitching stages.

In the alignment stage, the intensity image reconstructed from single photon data is used to estimate the homography transformation matrix H between a pair of images with different field of view. The estimation process is as follows:

(11)$$H = D{L_{align}}(X_{intensity}^t,X_{intensity}^r\textrm{|}\theta = {\theta ^\ast })$$

$X_{intensity}^t$ and $X_{intensity}^r$ represent the intensity target image and the reference image to be stitched, and it is necessary to warp the target image to align it with the reference image in pixel space. After obtaining the transformation matrix, it is applied to the target range image $X_{range}^t$ to obtain the warped image $X_{range}^{wt}$ aligned with the reference image $X_{range}^r$.

During the alignment stage, the concept of the Minimum Extended Image Space is introduced, defined as the smallest rectangular boundary of the stitched image. The coordinates of the four vertices of the warped target image can be expressed as:

(12)$$(x_{wt}^k,y_{wt}^k) = (x_t^k,y_t^k) + (\Delta {x_k},\Delta {y_k}),k \in \{{1,2,3,4} \}$$

Here, $(x_{wt}^k,y_{wt}^k)$ and $(x_t^k,y_t^k)$ are the coordinates of the k-th vertex of the warped target image and the target image, respectively. $(\Delta {x_k},\Delta {y_k})$ represent the coordinate offsets of the k-th vertex of the target image estimated by the alignment network. The size of the warped and aligned image $(h \times w)$ can be obtained as follows:

(13)$$\left\{ \begin{array}{l} h = \mathop {\max }\limits_{k \in \{{1,2,3,4} \}} \{{y_{wt}^k,y_r^k} \}- \mathop {\min }\limits_{k \in \{{1,2,3,4} \}} \{{y_{wt}^k,y_r^k} \}\\ w = \mathop {\max }\limits_{k \in \{{1,2,3,4} \}} \{{x_{wt}^k,x_r^k} \}- \mathop {\min }\limits_{k \in \{{1,2,3,4} \}} \{{x_{wt}^k,x_r^k} \}\end{array} \right.$$

where $(x_r^k,y_r^k)$ represents the vertex coordinates from the reference image that share the same values with $(x_t^k,y_t^k)$. the final warped and aligned single-photon range images $X_{range}^{wt}$ and $X_{range}^{wr}$ can be obtained through the input images $X_{range}^t$ and $X_{range}^r$ according to:

(14)$$\left\{ \begin{array}{l} X_{range}^{wt} = W(X_{range}^t,H)\\ X_{range}^{wr} = W(X_{range}^t,E) \end{array} \right.$$

H and E represent the network-estimated homography matrix and the identity matrix, respectively. $W({\cdot} )$ denotes the use of a 3 × 3 transformation matrix to map the warped image to the Minimum Extended Image Space $(h \times w)$.

In the stitching stage, the stitching module DL_stitching with well-trained parameters θ^* is utilized to stitch and fuse the warped and aligned single-photon range images $X_{range}^{wt}$ and $X_{range}^{wr}$. The final stitching result has the characteristics of high spatial resolution and wide field of view. The stitching process can be represented as following:

(15)$$Stitch\_result = D{L_{sitiching}}(X_{range}^{wt},X_{range}^{wr}\textrm{|}\theta = {\theta ^\ast })$$

3.2 Network architecture

The primary framework of the proposed wide-field, high-resolution single-photon reconstruction network model consists of three modules: the intensity and range images reconstruction module (IRR), the unsupervised single-photon reconstruction image alignment module (UIA), and the unsupervised single-photon image stitching module (UIS). The overall architecture of the network model is shown in Fig. 2.

Fig. 2. The overall network framework of PE-RASP

Download Full Size | PDF

During the intensity and range image reconstruction stage, a pair of 3D scene data with different field-of-views, detected by the single-photon detector, plays an input role of GSA-Encoder network modules. Through the reconstruction module, corresponding intensity images and range images are obtained. In the unsupervised single-photon reconstruction image alignment stage, the reconstructed intensity images pair are fed into the M-SFFN network. The M-SFFN network is employed to extract fusion features, which are then utilized to estimate homography matrix. The estimated homography matrix is applied to the corresponding reconstructed range images, resulting in aligned range images and content masks for fusion. In the unsupervised single-photon reconstruction image stitching stage, deep VGG features are first used for content concatenation reconstruction through an encoding-decoding structure, yielding a low-resolution stitched image. Subsequently, shallow VGG features are utilized for constraints, and high-resolution image reconstruction is performed through a densely connected residual network structure, obtaining the final stitched result.

3.2.1 Single-photon intensity and range image reconstruction

Inspired by the BSRN [37] super-resolution network and the application of the Global Self Attention (GSA) [38] module, an IRR module is proposed in the paper. The IRR module consists an encoder network with a residual [39] structure as the backbone, integrating the GSA. This network is used to reconstruct range and intensity images from the 3D data captured by SPAD. The structure for the reconstruction network is depicted in Fig. 3.

Fig. 3. The intensity and range images reconstruction network structure

Download Full Size | PDF

In the network, the input is three-dimensional data with dimensions H × W × 1024, where 1024 represents the time dimension, and H × W is the spatial size. The network treats 1024 as channels and is composed of GSAR modules. Each GSAR module consists of a GSA module and four ResBlock modules. The GSA module effectively extracts global-related features, leveraging long-range temporal and spatial correlations in the image data captured by SPAD. Following the GSA module, four residual modules are cascaded, each consisting of three convolutional layers: 1 × 1, 3 × 3, and 1 × 1. The first 1 × 1 convolution reduces dimensionality to enhance computational efficiency, while the second 1 × 1 convolution restores the dimension. The input to the residual block is connected through a skip connection with the output of the second 1 × 1 convolution. These modules extract and condense relevant features, simultaneously reducing the computational workload. The final residual module accomplishes dimension reduction. Last layer of the network employs a 3 × 3 convolutional layer to reconstruct the range and intensity images.

3.2.2 Unsupervised single photon reconstruction image alignment

In the UIA module, a Multi-scale Feature Fusion Network (M-SFFNet) is specifically designed to extract multi-scale image features. The network framework for the alignment stage is depicted in the following diagram.

The reconstructed intensity images pair $I_{intensity}^T$ and $I_{intensity}^R$ is fed into the network and passing through four convolutional layers for downsampling and feature extraction processing. The number of filters in each layer is set to 64, 128, 256, and 512, respectively, resulting in scale reductions of 1/2, 1/4, 1/8, and 1/16. Each feature layer at a specific scale is adjusted to 256 channels using a 1 × 1 convolution. Subsequently, the 1/16 scale feature is upsampled and combined with the 1/8 scale feature, both of features has undergone channel adjustment. The process is repeated at various scales. Each fused scale feature is passed through a convolutional layer to produce the different scale features F_R and F_T. As shown in Fig. 4, the three scale feature fusion results of 1/2, 1/4 and 1/8 are selected to form the feature pyramid. Homography matrix is estimated at each layer, and the estimated homography at the lower level is transmitted to the higher level to continually enhance estimation accuracy. This process enables the prediction of image homography transformations from a coarse to fine level at the feature level.

Fig. 4. Unsupervised single-photon reconstruction image alignment network structure diagram.

Download Full Size | PDF

Feature correlation layers are integrated into the network to improve feature matching and enhance the accuracy of estimated homography. The correlation between the reference image feature F_r and the target image feature F_t can be expressed by Eq. (16).

(16)$${c^l}(x_T^l,x_R^l) = \frac{{\langle F_T^l(x_T^l),F_R^l(x_R^l)\rangle }}{{|{F_T^l(x_T^l)} ||{F_R^l(x_R^l)} |}},\textrm{ }x_T^l,x_R^l \in {{{\mathbb Z}}^2}$$

l corresponds to the scale of the feature pyramid, which represents multi-scale features.$x_T^l,x_R^l \in {{{\mathbb Z}}^2}$ respectively represent the 2D spatial coordinates in different scales for the target image feature $F_T^l$ and the reference image feature $F_R^l$. Feature correlation is divided into local correlation and global correlation. By applying both global and local correlations in the alignment network, it allows the computation of homography from a global-to-local perspective. Following the correlation calculation through feature extraction, a regression network, consisting of three convolutional layers and two fully connected layers, is introduced. The network is used to predict the eight coordinate offsets which determine the homography. The Direct Linear Transform (DLT) algorithm is employed to estimate the 3 × 3 homography matrix H, mapping points from one coordinate system to another through a linear transformation. and Warp distorts the target feature map using the estimated homography matrix to obtain the warped feature map. Finally, aligned intensity images $I_{intensiy}^{WT}$ and $I_{intensity}^{WR}$ can be obtained in the minimum expansion space, along with the corresponding aligned range images $I_{range}^{WT}$ and $I_{range}^{WR}$.

3.2.3 Unsupervised single-photon reconstruction image stitching

In the single-photon reconstruction image stitching stage, to mitigate stitching artifacts that cannot be resolved by global homography alignment, an improved unsupervised image stitching network is designed to eliminate these artifacts. The stitching network consists of two branches: a low-resolution reconstruction branch and a high-resolution refinement branch. The low-resolution branch is dedicated to learning the deformation rules of the images, whereas the high-resolution branch focuses on enhancing the resolution of the reconstructed images. The structure of the designed UIS module is depicted in Fig. 5.

Fig. 5. Unsupervised single-photon reconstruction image stitching network structure diagram.

Download Full Size | PDF

This module employs a progressive stitching approach at different scales. In the low-resolution stitching stage, the aligned images are downsampled to a lower resolution. An encoder-decoder network structure composed of convolutions, pooling, and upsampling is used to reconstruct the stitched image. Additionally, skip connections are employed to link features of the same resolution from lower and higher layers. Content masks and edge masks are applied to constrain the network, facilitating the learning of deformation rules for image stitching. In the high-resolution optimization branch, a Dense Residual Network (DRB) is utilized. It consisting of Dense Connection Modules, Local Feature Fusion Modules, and Local Residual Learning Modules to fully exploit information from all levels of the low-resolution image. After passing through M Dense Connection Modules, there will be a total of G₀ + G(M-1) feature maps, where G₀ represents the initial feature dimension, and G is the feature dimension growth rate, both set to 64 and 32 in the network, respectively. Local feature fusion, using an adaptive approach to combine the obtained feature maps, ensures that the output maps of each DRB are fixed at G₀, allowing for a larger growth rate and more stable training of deep networks. The implementation of local residual learning modules further optimizes the information flow and gradients, making full use of local features and achieving dense feature fusion. This module effectively enhances the network's representational capacity, resulting in improved image reconstruction performance.

After obtaining the low-resolution stitching result I_LS, it is upsampled and pixel-wise overlaid with the aligned result of the same resolution. Shallow features are extracted through convolution, and these shallow features are subsequently input into three DRBs for feature fusion. The output undergoes global residual learning, where the original shallow features are added to the fused features, further enhancing image information and reducing information loss. Finally, a convolution operation is employed to produce the single-channel stitched result I_HS.

3.2.4 Loss function

In the single-photon image reconstruction stage, a composite loss function is defined using the mean absolute error (MAE) and structural similarity (SSIM) [40] to measure the difference between the 2D network output image and the target image. MAE calculates the error in the image and can be expressed as:

(17)$${L_{MAE}}(X,Y) = \frac{1}{n}\sum\nolimits_n^1 {|{X - Y} |}$$

Y and X represent the reconstructed image and the target image respectively. n denotes the number of image pixels. To ensure structural similarity between the reconstructed image and the target image, SSIM is introduced as a constraint. The formula can be expressed as:

(18)$${L_{SSIM}}(X,Y) = \frac{{(2{\mu _X}{\mu _Y} + {c_1})(2{\sigma _{XY}} + {c_2})}}{{({\mu _X}^2 + {\mu _Y}^2 + {c_1})({\sigma _X}^2 + {\sigma _Y}^2 + {c_2})}}$$

where ${\mu _X}$ and ${\mu _Y}$ are the means of X and Y, ${\sigma _X}^2$ and ${\sigma _Y}^2$ are the variances of X and Y, and ${\sigma _{XY}}$ is the covariance of X and Y. The SSIM ranges from 0∼1, with higher values indicating greater similarity between the two images, signifying a better quality 2D reconstructed image. Therefore, the loss calculation taking $1 - {L_{SSIM}}(X,Y)$. The total loss in the single-photon image reconstruction stage can be expressed as:

(19)$${L_{SPAD}}\textrm{ = }L_{MAE}^{range} + {\lambda _1}({1 - L_{SSIM}^{range}} )+ {\lambda _2}[{L_{MAE}^{ref} + {\lambda_3}({1 - L_{SSIM}^{ref}} )} ]$$

Here, the values of ${\lambda _1}$, ${\lambda _2}$ and ${\lambda _3}$ are set to 0.2, 0.025, and 0.3, respectively.

In the alignment stage, corresponding to multi-scale homography prediction, the loss is calculated by weighting the network at each scale, which can be expressed by the following:

(20)$$L_{UIA}^l = \sum\limits_l {{\omega _l}{L_1}[I_{warp}^l(overlap),I_{reference}^l(overlap)]}$$

where, ${\omega _l}$ is the weight of the l^th scale, $I_{warp}^l$ is the predicted image of the l^th scale, which is the distorted target image, and $I_{reference}^l$ is the reference image of the l^th scale. overlap means that only when the corresponding mask of the pixels of the two images is 1, that is, in the overlapping region, it will be recorded into the L₁ loss. There are three scales in total, and the objective function for a single scale can be more intuitively expressed as follows:

(21)$$L_{UIA}^l = {||{H(E) \odot {I_T} - H({I_R})} ||_1}$$

$H({\cdot} )$ represents using homography estimation to warp an image to align with another image, I_T and I_R represent the target image and the reference image to be aligned, respectively. E is a full one matrix of the same size as ${I_T}$, and ${\odot}$ denotes element-wise multiplication. For the three scales, the hyperparameters used during the training process, ${\omega _1}$, ${\omega _\textrm{2}}$, and ${\omega _\textrm{3}}$, are set to 16, 4, and 1, respectively.

During the image stitching phase, content masks M_rc and M_tc are used to constrain the similarity of image features near the warped areas, and edge masks M_rs and M_ts are used to ensure the natural continuity of edges in the overlapping regions of the stitched image. In this context, content masks M_rc and M_tc are obtained by replacing the warped and aligned images $I_{}^{WR}$ and $I_{}^{WT}$ with full one matrices of the same size, where $I_{}^{WT} = H({I_T}),I_{}^{WR} = H(E) \odot {I_R}$. Edge masks M_rs and M_ts can be calculated as following:

(22)$$\left\{ \begin{array}{l} \Delta {M_{rc}} = |{{M_{rc}}^{(x,y)} - {M_{rc}}^{(x - 1,y)}} |+ |{{M_{rc}}^{(x,y)} - {M_{rc}}^{(x,y - 1)}} |\\ \Delta {M_{tc}} = |{{M_{tc}}^{(x,y)} - {M_{tc}}^{(x - 1,y)}} |+ |{{M_{tc}}^{(x,y)} - {M_{tc}}^{(x,y - 1)}} |\end{array} \right.$$

(23)$$\left\{ \begin{array}{l} {M_{rs}} = {\mathbb C}(\Delta {M_{tc}}\ast {E_{3 \times 3}}\ast {E_{3 \times 3}}\ast {E_{3 \times 3}}) \odot {M_{rc}}\\ {M_{ts}} = {\mathbb C}(\Delta {M_{rc}}\ast {E_{3 \times 3}}\ast {E_{3 \times 3}}\ast {E_{3 \times 3}}) \odot {M_{tc}} \end{array} \right.$$

(x, y) represents the coordinate position, * denotes convolution operation, and ${\mathbb C}$ signifies clipping elements to be within the range of 0 to 1. The content loss and edge loss for the low-resolution branch are as follows:

(24)$$L_C^{low} = {L_P}({I_{LS}} \odot {M_{rc}},{I^{WR}}) + {L_P}({I_{LS}} \odot {M_{tc}},{I^{WT}})$$

(25)$$L_S^{low} = {L_1}({I_{LS}} \odot {M_{rs}},{I^{WR}} \odot {M_{rs}}) + {L_1}({I_{LS}} \odot {M_{ts}},{I^{WT}} \odot {M_{ts}})$$

Here, I_LS represents the low-resolution stitched image, L₁ and L_P denote the L₁ loss and perceptual loss [41], respectively. The perceptual loss is calculated using the ‘conv5_3’ deep features from the VGG-19 [42] network to focus on global features and reduce feature differences between images. The total loss function for the low-resolution phase can be represented as follows:

(26)$${L_{Low}} = {\kappa _C}L_C^{Low} + {\kappa _S}L_S^{Low}$$

${\kappa _C}$ and ${\kappa _S}$ represent the weighting coefficients for these two losses. In the high-resolution branch, the total loss function can be expressed as follows:

(27)$${L_{High}} = {\kappa _C}L_C^{High} + {\kappa _S}L_S^{High}$$

The low-resolution stitched image I_LS is upsampled to the original image input resolution I_High, employing the same approach to construct high-resolution loss functions $L_C^{High}$ and $L_S^{High}$. When calculating the high-resolution perceptual loss L_P, the shallow features from the ‘conv3-3’ layer of VGG-19 are used to focus on detail information, which results in clearer outputs from that layer.

The high-resolution branch is designed to refine the stitched image and output a clear image, yet it may introduce some artifacts. To eliminate these artifacts, a content consistency loss L_CC is employed, which is expressed as following:

(28)$${L_{CC}} = {L_1}({I_{LS}},I_H^{128})$$

Here, $I_H^{128}$ is obtained by downsampling the stitching result from the high-resolution output ${I_H}$ to a resolution of 128 × 128, which is the resolution output by the low-resolution branch. Considering all the losses, the total loss for the single-photon reconstruction image stitching stage can be expressed as Eq. (29):

(29)$${L_R} = {\omega _{Low}}{L_{Low}} + {\omega _{High}}{L_{High}} + {\omega _{CC}}{L_{CC}}$$

Where, ${\omega _{Low}}$, ${\omega _{High}}$ and ${\omega _{CC}}$ denote the weight of each part. In training, ${\kappa _C}$ and ${\kappa _S}$ are set to 1 and 10, ${\omega _{Low}}$, ${\omega _{High}}$ and ${\omega _{CC}}$ are set to 10, 1 and 1.

4. Comparision, simulation, and ablation experiments

4.1 Training

A stepwise training strategy is employed for the single-photon reconstruction image stitching network. Initially, during the training of the single-photon reconstruction network, SPAD measurements are simulated based on the NYU v2 dataset [43]. Nine different Signal-to-Background Ratio (SBR) conditions are set to simulate various lighting distances, input data dimension is 64 × 64 × 1024. A total of 38,624 and 2,372 measurements are generated for training and validation, respectively. The training process involves iterative optimization using the ADAM optimizer [44] with a learning rate of 2 × 10⁻⁴, which decays by a factor of 0.95 for each epoch. Approximately 20 epochs are required for network training, with a batch size of 4, totaling around 193 k iterations.

Subsequently, the module for predicting the homography transformation matrix is trained. To enhance the network's generalization across different scenes, the alignment network is pretrained on a synthetic dataset (Stitched MS-COCO) [31] for 150 epochs using the ADAM optimizer. The batch size is set to 16, with 50,000 pairs of 128 × 128 resolution images used for training, and 5,000 pairs used for validation. The initial learning rate with exponential decay is set to 10⁻⁴, with a decay step of 3,125, ultimately decaying to 10⁻⁵. For the single-photon reconstruction images, intensity images are extracted from the NYU v2 dataset. Simulated image pairs with a resolution of 256 × 256 were generated, and the network is fine-tuned for 50 epochs. The training dataset consists of 20,000 pairs, and the validation dataset contains 5,000 pairs.

Finally, the image stitching network is trained for 40 epochs using the generated single-photon reconstruction aligned images as inputs. The batch size is set to 16. During training, the ADAM optimizer is employed for iterative optimization, and the learning rate has exponential decay with an initial value of 3 × 10⁻⁴.

All of the steps are implemented and trained on PyTorch, and the training process is executed on an NVIDIA RTX 3090.

4.2 Comparison methods

There are several methods utilized for comparisons in the study. During the homography estimation phase, the performance of our proposed network is compared against that of conventional algorithms (SIFT + RANSAC) [45] and other deep learning algorithms (UDHN [46], VFIS [31], UDIS [33]) in homography estimation. In the image stitching phase, Global Homograph [32], APAP [27], VFIS, and UDIS are used as comparison methods.

4.2.1 Homography estimation performance comparison

To quantitatively assess the performance of the proposed enhanced unsupervised deep homography estimation network, which is based based on M-SFFN feature extraction for single-photon reconstruction image alignment, a quantitative comparison is conducted using the synthetic dataset of NYU v2. To validate the network's robustness in stitching images with varying overlap ratios, the overlap ratios are set as 50%, 60%, 70%, and 80%. Following Lang's method [31], 5000 pairs of intensity-based synthetic data are generated for each overlap ratio. This dataset comprises target images along with corresponding vertex coordinate offset values.

In Table 1, Table 2, the image alignment result of the proposed method is compared with other methods. The Peak Signal-to-Noise Ratio (PSNR), the Universal Quality Index (UQI) [47] and the Structure Similarity Index Measure (SSIM) are all calculated for the overlapping regions of the aligned images. Additionally, the Root Mean Square Error (RMSE) for the four-point offset between the network's estimates and the actual coordinates is also calculated in Table 3. The PSNR, UQI and SSIM can be expressed as follows:

(30)$$\left\{ \begin{array}{l} PSN{R_{overlap}}({I_{wr}},{I_{wt}}) = PSNR(H(E) \odot {I_{wt}},H({I_{wr}}))\\ UQ{I_{overlap}}({I_{wr}},{I_{wt}}) = UQI(H(E) \odot {I_{wt}},H({I_{wr}}))\\ SSI{M_{overlap}}({I_{wr}},{I_{wt}}) = SSIM(H(E) \odot {I_{wt}},H({I_{wr}})) \end{array} \right.$$

Table 1. The PSNR and UQI for the overlapping regions in the intensity-based dataset. Higher values indicate that the image content in the overlapping parts is closer after distortion. The first and second-ranked solutions are marked in red and blue, respectively.

View Table | View all tables in this article

Table 2. The SSIM for the overlapping regions in the intensity-based dataset. The first and second-ranked solutions are marked in red and blue, respectively.

View Table | View all tables in this article

Table 3. 4pt-Homography RMSE (↓). The lower the value, the closer the four-point offset estimated by the network is to the true value. The first and second-ranked solutions are marked in red and blue, respectively.

View Table | View all tables in this article

4.2.2 Comparison of image stitching performance for reconstructed images

In the image stitching phase, a comparative analysis is conducted with other methods to validate the superiority of the proposed network in image stitching. To closely emulate real-world applications of image stitching, a 60% overlap intensity-based synthetic dataset with a resolution of 128 × 128 is chosen to evaluate the capabilities of the stitching network, as illustrated in Table 4 below.

Following the testing methodology proposed by Lang and colleagues [33], “error” signifies the count of program crashes, while “failure” denotes instances of stitching failures. Stitching failures encompass severe distortions and intolerable artifacts, as illustrated in the Fig. 6, where the left part illustrates severe distortions, while the right part exhibits severe artifacts.

Fig. 6. Examples of “failure,” with the left part showing a severely distorted stitched image and the right part displaying a stitched image with intolerable artifacts.

Download Full Size | PDF

As shown in Table 4, the proposed method exhibits greater robustness compared to other methods, owing to its enhanced feature extraction capabilities. The enhancement leads to a higher success rate in image stitching compared to alternative methods.

Table 4. Comparison of image stitching success rates for single-photon reconstructed images.

View Table | View all tables in this article

Furthermore, a user evaluation approach is employed to assess the quality of image stitching for qualitative comparison. Specifically, the image stitching results of proposed method are individually compared with APAP, VFIS, and UDIS. During each comparison, four images are displayed on the screen: two inputs, our stitching result, and the stitching result from APAP/VFIS/UDIS, presented in a random order. Users are asked to indicate the best option, or in the absence of a preference, they could choose “both good” or “both bad.” In the comparison, 10 participants are invited and each method is compared for 1000 images.

The results are depicted in Fig. 7. Ignoring the “both good” and “both bad” proportions, it is observed that images stitched using proposed approach are favored over alternative methods. This indicates that the stitching results obtained through the proposed method exhibit superior visual quality in user evaluations.

Fig. 7. Visual quality user study: Compared to feature-based methods. Numbers are displayed as percentages and averaged across 10 participants.

Download Full Size | PDF

4.3 Simulation experiment

Simulation experiments are conducted for eight different scenes at ten different SBR, with intensity-based and range-based spatial resolutions both set to 128 × 128. A qualitative evaluation is performed for comparison, with emphasis placed on the “Art” scene. The stitching results for this scene at various SBRs are depicted in Fig. 8.

Fig. 8. Stitching range images for the same scene at different Signal-to-Background Ratios (SBR) using various methods. The image resolution is 128 × 128. “Ground Truth” represents the ground truth range map provided in the dataset.

Download Full Size | PDF

It can be seen that the conventional methods of Global Homography and APAP as well as the unsupervised algorithm UDISNet, encounter difficulties in properly stitching range images the stitching results have serious distortion. While VFISNet demonstrates proficiency in stitching range images, the non-overlapping portions of the stitched result suffer from pronounced distortion, presenting a noticeable misalignment compared to the true values. In contrast, the proposed PE-RASP method is capable of seamlessly stitching Single-photon reconstructed range images, demonstrating excellent performance across various SBRs. The stitched results achieved by our method closely approximate the true values.

In Fig. 9, we present the results of stitching three images using our stitching method. During the stitching process, we select the middle image as the spatial coordinate reference and iteratively stitch pairs of images. Ultimately, this process yields a single-photon reconstructed distance image with higher spatial resolution.

Fig. 9. The stitching results of three single-photon reconstructed range images.

Download Full Size | PDF

The runtime for each module of the proposed method and the comparative alignment and stitching times for various methods are presented in Table 5. For reconstruction, all comparative methods utilize the single-photon reconstruction method proposed by us, the test dataset comprises 10 sets of 64 × 64 × 1024-sized data, reconstructed 100 times repeatedly. Regarding alignment and stitching, the dataset used is the same as the one utilized in the alignment phase for comparison, consisting of 5000 intensity image sets with a 60% overlap rate. The APAP algorithm is computed using an R5-5600X @3.7 GHz CPU, while VFIS, UDIS, and Ours are tested using an NVIDIA 3090 GPU.

Table 5. The average speed for reconstructing 1000 sets of data and aligning/stitching 5000 intensity image sets.

View Table | View all tables in this article

4.4 Ablation experiment

The ablation experiment is designed to investigate the influence of various network architectures on the ultimate image stitching performance. This includes the intensity to range image stitching module, fine-tuning in the alignment phase for the intensity image, and the incorporation of dense residual modules in the stitching stage, along with their collective impact on the overall stitching process. A comparison of six different network frameworks is conducted. The first framework excludes the intensity to range image stitching module (w/o I2R). To assess the effectiveness of this module, the results of the USIDNet framework are compared against the outcomes obtained by incorporating this module into the framework, resulting in the second and third frameworks, respectively. The fourth framework (Ours_v1) applies no specific network fine-tuning to the intensity image. The fifth framework (Ours_v2) omits the use of dense residual modules in the stitching network and instead opts for simpler residual modules to assess the impact of the improved stitching network on stitching artifact and seam. The sixth framework (Ours) includes all the aforementioned steps. Through these comparisons, the analysis aims to evaluate the influence of different components on the overall stitching performance, offering a comprehensive assessment of the network architecture's strengths and weaknesses.

The ablation experiments utilize the same test data as the simulation experiments, and the stitching results of range images under various scenarios with SBR (10:50) are shown in the Fig. 10. Comparisons reveal that the exclusion of the intensity to range module noticeably deteriorates the stitching performance, leading to a certain degree of distortion. The UDISNet network encounters challenges in directly stitching range images, resulting in significant distortion. However, after adding the intensity to range module, while some distortion still exists to some extent, the stitching results show significant improvement. This further validates the effectiveness of the intensity to range module. The stitching results without fine-tuning the network specifically for intensity images (Ours_v1) exhibit severe distortion. This highlights the importance of fine-tuning the network for intensity image alignment, significantly enhancing the network's ability to align with intensity images. When combined with the intensity to range module, it also improves the network's ability to align with range images. The stitching network that excludes dense residuals (Ours_v2) closely resembles the results obtained with all steps included (Ours), but it exhibits noticeable seams and artifacts along the overlapping edges of the stitching, which can be observed within the red box in the image. In summary, the proposed network demonstrates superior range image stitching quality in these five test scenarios. This is attributed to the more pronounced features in intensity images, the effective feature extraction capabilities of the alignment network, and the ability of the stitching network's dense residual modules to fully utilize image feature information. These factors contribute to the effective stitching of range images, complementing the capabilities of feature-based stitching methods.

Fig. 10. Stitching range images using different methods in various scenarios. Image resolution is 128 × 128. Signal-to-noise ratio is 2:50. ‘Target’ represents the ground truth range images provided in the dataset

Download Full Size | PDF

5. Real-world experiments

To validate the performance of the proposed method under real-world experimental conditions, real-world experimental data is collected using a SPAD-based single-photon lidar system, as depicted in Fig. 11. The targets detected are three dolls placed on a ping pong table. A fiber-pulsed laser operates with a repetition frequency of 20 kHz, emitting pulses with a width of 1 ns, a wavelength of 1064 nm, and a peak power of 500 mW. The laser beam is diffused using external mirrors, with a divergence angle of 25 m/rad. The laser emitter triggers the SPAD synchronously with pulse emission to detect the time of arrival of the reflected photons. The SPAD model used is the GD5551 type InGaAs SPAD, with an exposure time of 4096 ns per trigger, time resolution of 1 ns, and spatial resolution of 64 × 64. Risley prisms are employed to steer the receiving field of view.

Fig. 11. The utilized SPAD system in real-world experiments

Download Full Size | PDF

The images of the target scene along with the reconstruction results from real-world experimental data are presented in Fig. 12. For the stitching process, 200 frames are utilized and reconstructed into 64 × 64 intensity and range images. The stitching results are depicted in Fig. 12(b), Notably, the conventional APAP algorithm fails to correctly stitch the images, and the UDISNet network exhibits unacceptable distortion in its stitching output. The stitching results produced by the VIFSNet network suffer from severe distortions in non-overlapping regions, leading to images that deviate significantly from reality. In contrast, the method proposed in this paper successfully accomplishes the stitching task with high quality. It accurately restores disparities in non-overlapping areas, enabling the effective stitching of single-photon range images even at low spatial resolutions.

Fig. 12. (a) Experimental target scene (b) Stitching results of various methods using real-world experimental detection data. The input images have a spatial resolution of 64 × 64.”

Download Full Size | PDF

6. Conclusion

An unsupervised learning method for stitching single-photon range profiles is proposed to address the challenge of low spatial resolution in single-photon imaging caused by the limited spatial resolution of SPAD arrays. The approach comprises three key modules: a reconstruction module, an alignment module, and an image stitching module. In the reconstruction stage, a GSA-encoder network is employed to reconstruct the captured three-dimensional data into separate intensity and range images. The alignment stage uses the M-SFFNet network for feature extraction, enabling the warping and alignment of the target image. In the stitching stage, the image is stitched using the low-resolution branch, while the high-resolution branch refines the stitched image. This process aims to eliminate artifacts and enhance overall image quality by reconstructing the image from features to pixels. Additionally, a novel approach for stitching range images in single-photon imaging is introduced, leveraging the intensity image prior to achieve high-quality stitching. Ablation experiments are conducted to investigate the impact of each module on the network's stitching performance, confirming the effectiveness of using the intensity image to stitch the range image in single-photon image stitching. The proposed method is tested in real-world experiments, demonstrating its robust ability to stitch single-photon range images and outperforming existing state-of-the-art methods in terms of performance.

Funding

National Natural Science Foundation of China (62301493, 62371163).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. Z. Bai, G. Wu, M. J. Barth, et al., “PillarGrid: Deep Learning-based Cooperative Perception for 3D Object Detection from Onboard-Roadside LiDAR,” in 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC, 2022), pp. 1743–1749.

2. S. Zhou, H. Xu, G. Zhang, et al., “Leveraging Deep Convolutional Neural Networks Pre-Trained on Autonomous Driving Data for Vehicle Detection From Roadside LiDAR Data,” IEEE Trans. Intell. Transport. Syst. 23(11), 22367–22377 (2022). [CrossRef]

3. X. Yang, Z. Yu, L. Xu, et al., “Underwater ghost imaging based on generative adversarial networks with high imaging quality,” Opt. Express 29(18), 28388–28405 (2021). [CrossRef]

4. W. Yu, S. Shah, D. Li, et al., “Polarized computational ghost imaging in scattering system with half-cyclic sinusoidal patterns,” Opt. Laser Technol. 169, 110024 (2024). [CrossRef]

5. P.-Y. Jiang, Z.-P. Li, W.-L. Ye, et al., “Long range 3D imaging through atmospheric obscurants using array-based single-photon LiDAR,” Opt. Express 31(10), 16054–16066 (2023). [CrossRef]

6. Z.-P. Li, J.-T. Ye, X. Huang, et al., “Single-photon imaging over 200 km,” Optica 8(3), 344–349 (2021). [CrossRef]

7. B. Lin, X. Fan, and Z. Guo, “Self-attention module in a multi-scale improved U-net (SAM-MIU-net) motivating high-performance polarization scattering imaging,” Opt. Express 31(2), 3046–3058 (2023). [CrossRef]

8. X. Fan, B. Lin, K. Guo, et al., “TSMPN-PSI: high-performance polarization scattering imaging based on three-stage multi-pipeline networks,” Opt. Express 31(23), 38097–38113 (2023). [CrossRef]

9. X. Zhou, K. Sh, L. Weng, et al., “Edge-Guided Recurrent Positioning Network for Salient Object Detection in Optical Remote Sensing Images,” IEEE Trans. Cybern. 53(1), 539–552 (2023). [CrossRef]

10. P. Wang, B. Bayram, and E. Sertel, “A comprehensive review on deep learning based remote sensing image super-resolution methods,” Earth-Sci. Rev. 232, 104110 (2022). [CrossRef]

11. E. D. Walsh, W. Jung, G.-H. Lee, et al., “Josephson junction infrared single-photon detector,” Science 372(6540), 409–412 (2021). [CrossRef]

12. P. Vines, K. Kuzmenko, J. Kirdoda, et al., “High performance planar germanium-on-silicon single-photon avalanche diode detectors,” Nat. Commun. 10(1), 1086 (2019). [CrossRef]

13. D. Shin, A. Kirmani, V. K. Goyal, et al., “Photon-efficient computational 3-d and reflectivity imaging with single-photon detectors,” IEEE Trans. Comput. Imaging 1(2), 112–125 (2015). [CrossRef]

14. J. Peng, Z. Xiong, X. Huang, et al., “Photon-efficient 3d imaging with a non-local neural network,” in European Conference on Computer Vision (Springer, 2020), pp. 225–241.

15. X. Yang, Z. Tong, P. Jiang, et al., “Deep-learning based photon-efficient 3D and reflectivity imaging with a 64 × 64 single-photon avalanche detector array,” Opt. Express 30(18), 32948–32964 (2022). [CrossRef]

16. Y. Mei, Y. Fan, and Y. Zho, “Image super-resolution with non-local sparse attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2021), pp. 3517–3526.

17. R. Xue, Y. Kang, T. Zhang, et al., “Sub-Pixel Scanning High-Resolution Panoramic 3D Imaging Based on a SPAD Array,” IEEE Photonics J. 13(4), 1–6 (2021). [CrossRef]

18. N. Yan, Y. Mei, L. Xu, et al., “Deep Learning on Image Stitching With Multi-viewpoint Images: A Survey,” Neural Proces. Lett. 55(4), 3863–3898 (2023). [CrossRef]

19. R. Szeliski, “Image Alignment and Stitching,” in Texts in Computer Science, (Springer, 2022), pp. 401–441.

20. A. Kirmani, D. Venkatraman, D. Shin, et al., “First-Photon Imaging,” Science 343(6166), 58–61 (2014). [CrossRef]

21. D. Shin, F. Xu, D. Venkatraman, et al., “Photon-efficient imaging with a single-photon camera,” Nat. Commun. 7(1), 12046 (2016). [CrossRef]

22. J. Rapp and V. K. Goyal, “A few photons among many: Unmixing signal and noise for photon-efficient active imaging,” IEEE Trans. Comput. Imaging 3(3), 445–459 (2017). [CrossRef]

23. D. B. Lindell, M. O’Toole, and G. Wetzstein, “Single-photon 3d imaging with deep sensor fusion,” ACM Trans. Graph. 37(4), 1–12 (2018). [CrossRef]

24. M. Hong, Y. Lu, N. Ye, et al., “Unsupervised Homography Estimation with Coplanarity-Aware GAN,” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2022), pp. 17663–17672.

25. J. Gao, S. J. Kim, and M. S. Brown, “Constructing image panoramas using dual-homography warping,” in CVPR 2011 (IEEE, 2011), pp. 49–56.

26. W.-Y. Lin, S. Liu, Y. Matsushita, et al., “Smoothly varying affine stitching,” in CVPR 2011 (IEEE, 2011), pp. 345–352.

27. J. H. Zaragoza, T.-J. Chin, Q.-H. Tran, et al., “As-Projective-As-Possible Image Stitching with Moving DLT,” IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1285–1298 (2014). [CrossRef]

28. C.-H. Chang, Y. Sato, and Y.-Y. Chuang, “Shape-Preserving Half-Projective Warps for Image Stitching,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2014), pp. 3254–3261.

29. C.-C. Lin, S. U. Pankanti, K. N. Ramamurthy, et al., “Adaptive as-natural-as-possible image stitching,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2015), pp. 1155–1163.

30. D. DeTone, T. Malisiewicz, and A. Rabinovich, “Deep Image Homography Estimation,” arXiv, arXiv:1606.03798 (2016). [CrossRef]

31. L. Nie, C. Lin, K. Liao, et al., “A view-free image stitching network based on global homography,” J. Vis. Commun. Image Representation 73, 102950 (2020). [CrossRef]

32. L. Nie, C. Lin, K. Liao, et al., “Learning Edge-Preserved Image Stitching from Large-Baseline Deep Homography,” arXiv, arXiv:2012.06194 (2020). [CrossRef]

33. L. Nie, C. Lin, K. Liao, et al., “Unsupervised Deep Image Stitching: Reconstructing Stitched Features to Images,” IEEE Trans. on Image Process. 30, 6184–6197 (2021). [CrossRef]

34. J. Gao, Y. Li, T.-J. Chin, et al., “Seam-Driven Image Stitching,” Eurographics (Short Paper), 45–48 (2013).

35. F. Zhang and F. Liu, “Parallax-Tolerant Image Stitching,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2014), pp. 3262–3269.

36. Q. Houwink, D. Kalisvaart, S. Hung, et al., “Theoretical minimum uncertainty of single-molecule localizations using a single-photon avalanche diode array,” Opt. Express 29(24), 39920–39929 (2021). [CrossRef]

37. Z. Li, Y. Liu, X. Chen, et al., “Blueprint Separable Residual Network for Efficient Image Super-Resolution,” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2022), pp. 833–843.

38. Z. Shen, I. Bello, R. Vemulapalli, et al., “Global Self-Attention Networks for Image Recognition,” arXiv, arXiv:2010.03019 (2020). [CrossRef]

39. K. He, X. Zhang, S. Ren, et al., “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), pp. 770–778.

40. Z. Wang, A. C. Bovik, H. R. Sheikh, et al., “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. on Image Process. 13(4), 600–612 (2004). [CrossRef]

41. J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual Losses for Real-Time Style Transfer and Super-Resolution,” in Computer Vision – ECCV 2016, Lecture Notes in Computer Science (ECCV, 2016), pp. 694–711.

42. K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv, arXiv:1409.1556 (2014). [CrossRef]

43. N. Silberman, D. Hoiem, P. Kohli, et al., “Indoor segmentation and support inference from rgbd images,” in European conference on computer vision (Springer, 2012), 746–760.

44. D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv, arXiv:1412.6980 (2014). [CrossRef]

45. G. Shi, X. Xu, and Y. Dai, “SIFT Feature Point Matching Based on Improved RANSAC Algorithm,” in 2013 5th International Conference on Intelligent Human-Machine Systems and Cybernetics (IEEE, 2013), pp. 474–477.

46. T. Nguyen, S. W. Chen, S. S. Shivakumar, et al., “Unsupervised Deep Homography: A Fast and Robust Homography Estimation Model,” IEEE Robot. Autom. Lett. 3(3), 2346–2353 (2018). [CrossRef]

47. U. Sara, M. Akter, and M. S. Uddin, “Image Quality Assessment through FSIM, SSIM, MSE and PSNR—A Comparative Study,” J. Comput. Commun. 07(03), 8–18 (2019). [CrossRef]

Overlap	PSNR					UQI
Overlap	SIFT	UDHN	VFIS	UDIS	Ours	SIFT	UDHN	VFIS	UDIS	Ours
50%	12.16	16.9	27.08	29.9	35.7	0.31	0.574	0.919	0.926	0.955
60%	12.83	17.6	27.91	33.8	40.0	0.37	0.628	0.941	0.973	0.987
70%	13.96	18.3	28.69	37.3	41.9	0.47	0.679	0.955	0.992	0.997
80%	15.45	19.7	29.40	39.2	42.1	0.56	0.723	0.963	0.997	0.998
AVG	13.6	18.1	28.27	35.1	39.9	0.43	0.651	0.945	0.972	0.984

Overlap	SSIM
Overlap	SIFT	UDHN	VFIS	UDIS	Ours
50%	0.473	0.600	0.941	0.883	0.932
60%	0.540	0.613	0.946	0.931	0.962
70%	0.617	0.627	0.949	0.959	0.975
80%	0.694	0.639	0.950	0.968	0.975
AVG	0.581	0.620	0.947	0.935	0.961

Overlap	UDHN	VFIS	UDIS	Ours
50%	32.1405	7.2403	17.6857	10.1803
60%	25.1758	6.4239	8.7804	3.8711
70%	19.1393	5.7985	3.8304	1.4592
80%	13.9692	5.1783	1.9435	0.9463
AVG	22.6062	6.1603	8.0600	4.1142

Input resolution	Metrics	Global Homograph	APAP	VFIS	UDIS	Ours
128 × 128 (Amount:1000)	Error	12	12	0	0	0
	Failure	207	205	43	45	26
	Total	219	217	43	45	26
	Success rate	78.1%	78.3%	95.7%	95.5%	97.4%

	APAP	VFIS	UDIS	Ours
Reconstruction	0.01533	0.01533	0.01533	0.01533
Alignment	0.60986	0.01944	0.01943	0.01883
Stitching	0.00815	0.00731	0.00572	0.00625

PE-RASP: range image stitching of photon-efficient imaging through reconstruction, alignment, stitching integration network based on intensity image priors

Abstract

1. Introduction

2. Related work

2.1 Reconstruction photon-efficient imaging

2.2 Image stitching

3. Method

3.1 Forward model

3.2 Network architecture

3.2.1 Single-photon intensity and range image reconstruction

3.2.2 Unsupervised single photon reconstruction image alignment

3.2.3 Unsupervised single-photon reconstruction image stitching

3.2.4 Loss function

4. Comparision, simulation, and ablation experiments

4.1 Training

4.2 Comparison methods

4.2.1 Homography estimation performance comparison

4.2.2 Comparison of image stitching performance for reconstructed images

4.3 Simulation experiment

4.4 Ablation experiment

5. Real-world experiments

6. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (12)

Tables (5)

Equations (30)

Optics Express