Large depth-of-field computational imaging with multi-spectral and dual-aperture optics

Tingdong Kou; Qican Zhang; Chongyang Zhang; Tianyue He; Junfei Shen

doi:10.1364/OE.470037

1. Introduction

Large DOF imaging is one of the most widely used computational imaging techniques with a lot of applications, such as optical microscopy [1], integral imaging [2]. However, the trade-off between DOF and SNR is a fundamental and long-standing limitation in imaging. For example, microscopes usually pursue high definition and large NA (numerical aperture) values with the compromise of DOF. Although DOF can be extended by reducing the aperture size, this results in insufficient light throughput and low SNR. Long exposure time may alleviate this, but it easily causes motion blur.

DOF extension is a challenging task due to its ill-posed nature. Until now, many solutions have been proposed and proved efficient, especially such as multi-focus image fusion [3–5] and computational imaging [6–11]. The two representative solutions differ in that the fusion-based method focuses on the optimization of the algorithm in the image post-processing stage, while computational imaging majorly contributes to the co-design and co-optimization of the optical system and image processing. In recent years, boosted by the big data revolution, deep learning-based artificial intelligence has been widely used, which also successfully worked in the DOF extension target.

(1) Multi-focus image fusion is an effective and low-cost technique for DOF extension by generating an all-in-focus image from a set of partially focused images. Moreover, it has been successfully applied to improve the imaging quality of smartphone, microscopy, etc. As is well known, desirable image fusion performance depends on the appropriate activity-level measurement and fusion rules. They always need to be manually designed in most of existing transform domain algorithms such as multi-scale decomposition-based approach, which is inefficient and not robust. With the rapid development of artificial intelligence, automatically learning-based algorithms occur, such as neural network-based approach.

Multi-scale decomposition-based approach usually uses multi-resolution analysis tools like image pyramids and wavelets to perform the activity-level measurement, and employs a given fusion rule like the maximum selection and the weighted sum to complete features fusion. For example, Burt et al. [12] proposed a multi-focus image fusion method based on the Laplacian pyramid. In their work, the absolute value of a decomposed coefficient was used as its activity-level measurement, and the maximum selection rule was applied to obtain the fused coefficient. However, practical manual design increases the complexity of fusion algorithms and limits the final fusion performance.

Neural network-based approach employs CNNs (Convolutional neural networks) to model the activity-level measurement processing and deduce the applicable fusion rule. For example, Liu et al. [13] designed a supervised-learning CNN to generate a decision map, which was used to detect the focus area and output a large DOF fusion image. To get a more accurate decision map, Guo et al. [14] proposed a generative adversarial network named FuseGAN with a discriminator to distinguish the generated decision map from the ground truth. Apart from decision map-based methods, Zhang et al. [15] proposed an end-to-end network IFCNN consisting of features extraction, features fusion, and features reconstruction modules to directly generate a high-quality fused image.

(2) Computational imaging is a hybrid optical-digital technology, consisting of optical design and digital processing, which has more flexibility in the image quality promotion than methods purely depending on the complex optical design. Another obvious advantage is that by jointly designing the optics and post-process algorithms, the complexity of optical system could be extremely reduced, which has attracted researchers’ attention in past decades. In computational imaging, PSF is a very important bridge combining optical design and image acquisition, which is vital to the imaging quality and determines the DOF of optical systems. Hence, PSF engineering-based methods have now been largely investigated in the computational imaging area to realize DOF extension, such as the wavefront coding-based approach and the focal sweep-based approach.

Wavefront coding-based approach combines optical encoding and digital decoding to extend DOF. In the optical encoding, a special cubic phase plate is used to code wavefront for the depth-invariant PSF, which provides space-unvaried features for the high-definition image reconstruction in digital decoding. For example, Dowski et al. [16] designed a misfocus-insensitive PSF by placing a phase mask at the pupil of the optical system, then the extended DOF image was obtained by deconvolution with the spatial-invariant PSF.

Focal sweep-based approach produces a depth-invariant PSF by sweeping either the object or the sensor along the optical axis during exposure, and a deconvolution step was applied to remove the spatially invariant blur effect. Compared to the wavefront coding, focal sweeping could generate a more uniform PSF [17], but it requires the use of moving part at the sacrifice of efficacy. To shorten the sweeping distance and reduce exposure time, Peng et al. [18] leveraged the design flexibility of diffractive optics to produce a multi-focal lens, and designed a deconvolutional algorithm with cross-channel regularization to reconstruct large DOF images without chromatic aberration.

However, most of existing computational imaging approaches jointly optimize the single optical-path system and the post-process algorithm on a single image, where the high-fidelity image reconstruction is a challenging task due to limited input raw information. In addition, the design strategies of most existing multi-focus image fusion algorithms are purely based on the well-calibrated image pair with complementary DOF range, in which the capture and calibration processes are complex and difficult, especially in dynamic application scenes. To address these issues, we propose a computational imaging system with dual differentiated optical paths. Specifically, the designed optical imaging system is made up of a small-aperture NIR camera to provide sharp edge and a large-aperture VIS camera to provide faithful color, as shown in Fig. 1. Thanks to the photosensitive differences between the human eye and cameras, the NIR image with large DOF and the VIS image with faithful color could be obtained at the same time without unpleasant light pollution in practice. To integrate merits of the NIR and VIS optical paths, a fusion network DEU-Net (dual-encoder U-net) is specifically designed, consisting of VIS encoders for color extraction, NIR encoders for edge extraction, and decoders for features fusion. Parallel NIR and VIS encoders with separate parameters are designed for crucial information extraction and the multi-level skip concatenation is proposed to realize the information flow between the encoder and the decoder. Considering that the network trained with a single loss function is not powerful and robust, we propose a comprehensive loss function, consisting of the perceptual loss to enhance visual perception, the gradient loss to preserve structure information and the pixel loss to preserve luminance distribution, for the generation of high-quality and visually pleasing fusion images.

Fig. 1. Schematic overview of the computational dual-aperture imaging system for large DOF. Optical system: consisting of a NIR camera with small aperture N and a VIS camera with large aperture V. DEU-Net: image fusion network architecture with two parallel encoders and a decoder. It should be clarified that the NIR input is a composite image combining brightness information of the NIR image and color features of the VIS image.

Download Full Size | PDF

Particularly, the main contributions of this paper are as follows:

1. Building a computational imaging system with dual optical paths, enabling large DOF imaging.
2. Leveraging the photosensitive differences between the human eye and the camera to extract high-definition details and the faithful color of scenes while avoiding unpleasant light pollution.
3. Proposing an algorithm with dual encoders to extract crucial features of each optical path, designing a multi-scale fusion module to combine these features to extend DOF while preserving high color fidelity.

2. Related work

NIR-VIS image fusion aims to reconstruct a perfect image from the NIR and VIS images with complementary vision information [19]. It has been widely applied in many vision applications such as object tracking and video surveillance. The excellent fusion performance depends on effective feature extraction and appropriate fusion strategies. Until now, a variety of fusion algorithms have been proposed over the past decades, among which multi-scale transform-based approach, sparse representation-based approach, and deep learning-based approach are three representative methods.

Multi-scale transform-based approach. In multi-scale transform, the inputs are decomposed into a series of components at different scales, where each component represents the sub-image at each scale and real-world objects typically comprise components at different scales [20,21]. Several studies have demonstrated that the multi-scale transform is consistent with human visual characteristics, and this property contributes to the generation of fusion images with better visual perception [22,23]. In general, this approach comprises three parts: multi-scale decomposition, fusion strategy, and inverse transform. Each input is decomposed into a variety of multi-scale representations, which are then fused according to a given strategy. Finally, the fused image is obtained using the corresponding inverse transform on the fused representations. However, misregistration and noise bring bias to the fused multi-scale representation coefficients, resulting in the visual artifacts in the fused image.

Sparse representation-based approach. Sparse representation addresses the natural sparsity of signals by effectively characterizing the human visual system [24]. Different from the multi-scale transform-based approach with prefixed basis functions, the soul of this approach is the construction of the over-complete dictionary. Normally, five steps are necessary in this approach: vectorization, building dictionary, sparse coding, fusion rule, and reconstruction. Specifically, each input is decomposed into several overlapping patches, which potentially reduces visual artifacts and improves robustness to misregistration [25]. The dictionary is built by learning a large number of natural images. Then each patch is sparsely coded by the learned dictionary to obtain the sparse representation coefficients, which are fused in line with the designed fusion rule. Finally, the fused image is generated by using the fused coefficients and the over-complete dictionary.

Deep learning-based approach. Due to the strong capability of modeling the complicated relationships among data, deep learning has been successfully applied in NIR-VIS fusion. Different from the two above approaches, it aims to train a special CNN on abundant images to automatically learn the necessary factors for the optimal fusion performance, which gets rid of the difficult manual design and has powerful adaptability and migration ability. For example, Li et al. [26] designed an end-to-end image fusion network to eliminate the arbitrariness of handcrafted design. And Wang et al. [27] proposed a fully convolutional network to fuse the noisy VIS image and sharp NIR image for the reduction of noise and the enhancement of details, and results presented that their method achieves better performance than the comparative traditional denoising methods.

3. Computational dual-aperture imaging system

In this section, two core parts in our imaging system are presented, including optical imaging law, and observer visual model.

3.1 Optical imaging law

In optical imaging, different apertures have various effects on the quality of captured images. Here is to explain how DOF is affected by the aperture size. As shown in Fig. 2, in contrast to the small aperture Q’, large aperture Q enables more off-axis rays to pass through the lens, but which results in larger blur kernel B and smaller DOF.

Fig. 2. Schematic diagram of optical imaging law. Q and Q’ denote large aperture and small aperture. H and H’ are the principal points of object and imaging. It should be noted that the compound lens set adopted here is just for clarification and a more complex optical system is also applicable.

Download Full Size | PDF

In Fig. 2, in a geometrical optical system, when the object is placed at O, according to the conjugate imaging relationship (Gaussian lens law [28] as Eq. (1)), the clear image can be obtained at O’.

(1)$$\frac{1}{f} = \frac{1}{v} + \frac{1}{u}. $$

Here, f is the focal length of optical system, u is the object depth, and v is the image depth. This law could be used to indicate the spatial relationship between an object and its image. To the object point at A away from O, the image recorded on the CCD will become a spot with size B, which is determined by:

(2)$$B = Qh(\frac{1}{f} - \frac{1}{h} - \frac{1}{u}), $$

where Q is the aperture size, and h is the sensor-to-lens distance. The depth range, in which the objects are clearly imaged on the CCD (in case that B is smaller than the limiting resolution L of CCD), is called DOF, and can be derived from:

(3)$$DOF = \frac{{2{u^2}{f^3}L}}{{Q{f^4} - \frac{{{{(fuL)}^2}}}{Q}}}. $$

From Eq. (2) and (3), the reduction of Q can decrease B and increase DOF. Therefore, traditional ways to extend DOF require a reduction in Q, but it causes insufficient light throughput and worse image quality. Using the high-power white light source as the supplement lamp may mitigate this, but more problems will be introduced, such as human eye dazzling and unpleasant light pollution.

3.2 Observer visual model

According to the imaging law, realizing high-quality and large DOF imaging while avoiding light pollution in one shot is necessary but difficult. To overcome that, the observer visual model on human eye and electronic detector is introduced, and their differences are modeled to serve for the proposed computational dual-aperture imaging system as shown in Fig. 1.

3.2.1 Human visual system

In the human visual mechanism, the retina, consisting of two basic types of photoreceptors — cones and rods, is responsible for the perception of light. The cones (termed as photopic vision) working in bright conditions (over 5 cd/m²) can distinguish the color and details of the scenes. Conversely, the rods (termed as scotopic vision) working in dim conditions (below 0.005 cd/m²) can only perceive gray information. The spectral sensitivity of photopic vision and scotopic vision (termed as luminosity function) is characterized as the V(λ) and V’(λ) curves (shown as black solid and dotted curves in Fig. 3), respectively.

Fig. 3. The diagram of LUT with different L_P at 0.01, 0.3, 4.5 cd/m². m determines the adaption conditions of the mesopic vision. The increase of L_P or R_SP results in larger m, which implies that the human eye will adjust the vision system to improve the role of cones in mesopic vision to adapt to the change in luminance.

Download Full Size | PDF

Given the spectral luminous efficacy of photopic vision and scotopic vision as K(λ) and K’(λ), and their maximum as K_m (K_m = K(555nm) 683lm/W) and K’_m (K’_m = K’(507nm) 1700lm/W), the luminosity functions of photopic vision and scotopic vision are defined as:

(4)$$V(\lambda ) = \frac{{K(\lambda )}}{{{K_m}}}, V^{\prime}(\lambda ) = \frac{{K^{\prime}(\lambda )}}{{K{^{\prime}_m}}}.$$

At luminance levels between 0.005 and 5 cd/m², both the cones and rods are active, which is referred to as the mesopic vision V_M(λ), as the weighted combination of V(λ) and V’(λ).

(5)$$\left\{ \begin{array}{l} M(m){V_M}(\lambda ) = mV(\lambda ) + (1 - m)V^{\prime}(\lambda ).\\ m = LUT({L_P},{R_{SP}}),{R_{SP}} = \frac{{\int\limits_0^\infty {\varPhi (\lambda ) \ast K^{\prime}(\lambda )d\lambda } }}{{\int\limits_0^\infty {\varPhi (\lambda ) \ast K(\lambda )d\lambda } }} \end{array} \right..$$

M(m) is a normalized function to weight the maximum of V_M(λ) to be 1. m ranges from 0 to 1. The CIE report [29] gives the value of m as a LUT function of photopic luminance L_P and illuminant ratio R_SP, as listed in Supplement1 Table 1. L_P is the luminance of illuminants. R_SP is calculated as the ratio of the perceived luminous flux from scotopic vision to that from photopic vision. Meanwhile, the variation of m with R_SP under several representative L_P is drawn to present LUT more visually. As shown in Fig. 3, m increases with the rise of R_SP under different L_P at 0.01, 0.3, 4.5 cd/m², but the rate of increase is reduced with the rise of L_P.

Table 1. L_P is measured by the luminometer, R_SP is calculated as Eq. (5), m is given via LUT.

View Table | View all tables in this article

When L_P is lower than 0.005 cd/ m², m is equal to 0 and V_M(λ) becomes V’(λ). When L_P exceeds 5 cd/m², m is equal to 1 and V_M(λ) becomes V(λ). If L_P ranges from 0.05 to 5 cd/m², V_M(λ) trends to V(λ)/V’(λ) with the increase/decrease of m, as shown in Fig. 4. It implies that the mesopic vision plays a key role in the transition between dark vision and photopic vision, which can also be used to explain many scientific problems such as the dark adaption and the human dazzling.

Fig. 4. The black dotted curve, white solid curve, and black solid curve are the distribution of luminosity functions of the scotopic, mesopic and photopic vision, denoted as V’(λ) [30], V_M(λ) and V(λ) [31], respectively.

Download Full Size | PDF

3.2.2 Camera vision system

The spectrum of NIR camera usually spans from 750nm to 1000nm, while VIS camera typically covers the spectral range from 380nm to 700nm. In this paper, a chromatic NIR lamp (peak wavelength at 850nm) and a 7176K white illuminant are employed for illumination, and their spectral distributions are plotted in Fig. 5. As illustrated above, the human eye is insensitive to NIR light because both cones and rods rarely respond to long-wavelength light (over 750nm). In contrast, the peak response region of NIR cameras is commonly from 750nm to 900nm, such as a typical NIR-VIS camera (its spectral sensitivity is plotted in Fig. 6).

Fig. 5. Top: Spectral power distributions of our employed white and 850 nm NIR illuminant, denoted as $\phi $_Vis(λ) and $\phi $_Nir(λ).

Download Full Size | PDF

Fig. 6. The camera spectral sensitivity of our used multi-spectral prism camera (JAI FS-3200D-10GE) [32].

Download Full Size | PDF

To prove that the nearly invisible NIR light can provide sufficient optical signal in the NIR path. We propose the Signal_Ratio, which is defined as the ratio of precepted optical signal from the NIR camera to the VIS camera:

(6)$$\left\{ \begin{array}{l} Signal\_Ratio = \frac{{\int {{C_{Nir}}(\lambda ){\Phi_{Nir}}(\lambda )} d\lambda }}{{\int {{C_{Vis}}(\lambda ){\Phi_{Vis}}(\lambda )} d\lambda }},\\ {C_{VIS}}(\lambda ) = {w_R} \ast {C_{VIS\_R}}(\lambda ) + {w_G} \ast {C_{VIS\_G}}(\lambda ) + {w_B} \ast {C_{VIS\_B}}(\lambda ). \end{array} \right.$$

C_Nir(λ) and C_Vis(λ) are the spectral sensitivity of the NIR and VIS cameras, $\phi $_Nir(λ) and $\phi $_Vis(λ) are the spectral power distributions of our used NIR and white illuminants, respectively. Considering that the VIS camera has three spectral sensitivity curves, while the NIR camera has only one. For the fair comparison, the VIS camera spectral sensitivities of red (C_{VIS_R}(λ)), green (C_{VIS_G}(λ)), and blue (C_{VIS_B}(λ)) channels are weighted summed. In our experiment, Siganl_Ratio is computed as 5.7869, which implies that it is available to capture high-quality and large DOF images in the small-aperture NIR optical path.

3.2.3 Observer perception model

To quantitatively evaluate the differences between the human eye and cameras, Perception_Gain is computed as the ratio of luminous flux perceived by human from using the NIR illuminant to using the white illuminant:

(7)$$Perception\_Gain = \frac{{1700\int {{V_M}^{\prime}(\lambda ){\varPhi _{Nir\_E}}(\lambda )} d\lambda }}{{683\int {{V_M}(\lambda ){\varPhi _{Vis}}(\lambda )d} \lambda }}.$$

The parameters of our used illuminants are listed in Table 1. According to Eq. (5), V_M’(λ) and V_M(λ) are equal to V’(λ) and V(λ), respectively. The Perception_Gain [33] is calculated as 6.18×10⁻⁷, which represents that the influence on people using NIR illuminant is minimal compared to using white illuminant.

The results of Signal_Ratio and Perception_Gain (shown in Fig. 7(B)) also demonstrate that the huge perception difference is existed between human visual model and camera vision system. The observer perception model is the soul of our computational dual-aperture imaging system and contributes to our success. Other than large DOF imaging, this model can also be applied to guide the choice of light sources and cameras in other applications such as low-light imaging.

Fig. 7. A. The schematic of our dual-aperture optical imaging system. B. The quantitative differences between the human eye and cameras in the ratio of VIS perception to NIR perception under the same measured lighting conditions in our experiment (exhaustedly strong NIR light and relatively weak VIS light), showing the strong sensitivity of the human eye in VIS band (L_P 263.8 cd/m² larger than NIR L_P 0.03 cd/m²) and the oppositely different situation in camera response (NIR response 5.7869 times as large as VIS response)

Download Full Size | PDF

Based on the perceptual differences between the human eye and the camera, we propose a dual-aperture optical imaging system for large DOF, as shown in Fig. 7(A). Specifically, the high-definition texture detail can be recorded by leveraging the NIR camera’s spectral efficacy while avoiding light pollution. Unfortunately, the NIR camera can’t record any color information, which is vital to a wide variety of computer vision applications such as image classification and target detection. To address that, a large-aperture VIS camera is introduced to record faithful color information. To get a large DOF image with high color fidelity, high-definition features from the NIR optical path and faithful color information from the VIS optical path can be fused by a specially designed fusion algorithm.

4. Learned image fusion

NIR-VIS image fusion is a challenging task due to image structure discrepancies, which easily cause the inhomogeneous reconstructed pixel intensities within the same structure and result in color deviation. To overcome that, we train a variant of U-Net model named as DEU-Net consisting of a pyramid of VIS encoders for true color information extraction, NIR encoders for sharp features extraction, and decoders for features fusion such that the reconstructed pixel has good sensing of its neighboring pixels, which is vital in the preservation of image structure, as shown in Fig. 8. Meanwhile, the combination of pixel loss, perceptual loss, and gradient loss is employed to improve the quality of the fused images.

Fig. 8. A. Architecture of DEU-Net. B. The layer configurations are illustrated with different color blocks. The blue (VIS encoder) and yellow (NIR encoder) blocks share the same configurations.

Download Full Size | PDF

The input of NIR encoder is a composite color image combing the basic color features from VIS image and brightness features from NIR image. Because RGB image is organized and presented in a three-color order, we transform the RGB image into YUV color space to separate brightness and color components, facilitating to generate the composite image with brightness (Y) from NIR image and color (U and V) from VIS image. Finally, the composite image in YUV color space is then transformed back to RGB color space and fed into the network. The whole process is shown in Fig. 9 To simultaneously achieve DOF extension and high color fidelity, the VIS image is fed into the VIS encoder to provide faithful color information during the fusion process.

Fig. 9. The obtainment process of NIR and VIS input images to the following network. VIS input is the captured VIS image. NIR input is a composite image, which intends to include color information from the captured VIS image and brightness features from the captured NIR image.

Download Full Size | PDF

4.1 Network architecture

NIR encoders. Four consecutive encoders are designed to extract sharp features from NIR images, which is vital in DOF extension. Specifically, as the yellow block shown in Fig. 8(B). middle, each encoder is made up of two convolutional blocks and a down-sampling block. Each convolutional block consists of a 3×3 convolution layer with stride 1 to extract high-definition features, a BatchNorm layer to improve the convergence rate, and a ReLu (Rectified Linear unit) function to increase the network nonlinearity. The down-sampling block is composed of a 4×4 convolution layer with stride 2 to extract deep semantic information, a BatchNorm layer, and a ReLu function. Moreover, the input and the output of each convolutional block are concatenated as the input of following blocks, which enables to reuse shallow features extracted by the first few blocks as much as possible. Details such as edges are well preserved for deeper layers.

VIS encoders. The VIS image is fed into the VIS encoder and passes through three same encoders for color extraction. The configuration of convolutional blocks in VIS encoders are the same as those in NIR encoders. For better fusion performance, the information extracted by the VIS and NIR encoders is concatenated in the channel dimension to be fed into decoders.

Decoders. Four consecutive decoders are designed to fuse features from NIR encoders and VIS encoders. Each decoder is composed of a nearest neighbor up-sampling layer and a multi-scale convolution block, in which the parallel branches in the trident module share synchronous convolution operation, with different kernel sizes and strides, as the orange block shown in Fig. 8(B). bottom. The input of the multi-scale convolution block is the concatenation of extracted visual features from VIS and NIR encoders and the output of the previous block. Small scale (1×1) convolution with stride 1 is employed to establish the pixel-wise image fusion, which contributes to preserving details such as edge and texture. However, small-scale convolution fails to aggregate neighboring contextual information due to the small receptive field, which easily causes color artifacts. Middle scale (3×3) convolution with stride 1 aims to extract neighboring features, which is vital in image structure matching. Large scale (5×5) convolution with stride 2 is used to increase receptive field and extract semantic features with fewer details but more global information, which can assist semantic information fusion. Finally, the features of each scale are fused by a following 3×3 convolution layer.

4.2 Loss function

Loss function is crucial to the training of the deep neural network. Tailored to the improvement of image quality, different loss functions pose constraints on various aspects. For example, the pixel loss contributes to reducing the brightness distribution mismatch between the fused image and the ground truth, the perceptual loss aims to improve the visual perception of the result, and the gradient loss is used to preserve more structure information. To synthetically enhance the performance of our network, an organic combination of pixel loss, perceptual loss and gradient loss is employed.

4.2.1 Pixel loss

The classical loss used in deep learning like MAE (mean absolute error) penalizes the pixel-wise differences between the ground truth and the output:

(8)$$Pix = \frac{1}{{CHW}}\sum\limits_{c,h,w} {|{{x_{c,h,w}} - rea{l_{c,h,w}}} |}. $$

However, this loss may lead to overly blurry output due to the pixel-wise average of possible optima. Recently, the perceptual loss [34] has been proved effective and powerful in the enhancement of visual perception. Therefore, it is introduced to alleviate the blurry effect and obtain visually pleasing results.

4.2.2 Perceptual loss

In this loss component, the pretrained network is used to extract high-level representations of the output and the ground truth, which are compared to quantify the visual perception differences. In this paper, the feature maps of the sixth convolutional layer (i.e., conv3_2. Referring to Peng et al. [35]) of the VGG 19 φ [36] are used to calculate the perceptual loss. The PL (perceptual loss) consists of content loss and style loss:

(9)$$PL = \zeta _{feat}^{\varphi ,j} + \vartheta _{style}^{\varphi ,j}. $$

The content loss penalizes the content differences between the output x and the ground truth real, and can be mathematically expressed as:

(10)$$\zeta _{feat}^{\varphi ,j}(x,real) = \frac{1}{{{C_j}{H_j}{W_j}}}||{{\varphi_j}(x) - {\varphi_j}(real)} ||_F^2. $$

Here, the feature map φ_j(x) with shape (C_j, H_j, W_j) is extracted by the jth convolutional layer of the VGG19 network φ.

The style transfer loss is used to measure semantic differences between the output and the ground truth and penalize the mismatch in style: color, texture, common pattern, etc.

(11)$$\vartheta _{style}^{\varphi ,j}(x,real) = ||{G_j^\varphi (x) - G_j^\varphi (real)} ||_F^2, $$

where the C_j×C_j gram matrix Gφ j(x) is the product of the φ_j(x) with shape (C_j, H_j×W_j) and its transposition matrix:

(12)$$G_j^\varphi {(x)_{c,c^{\prime}}} = \frac{1}{{{C_j}{H_j}{W_j}}}\sum\limits_{h = 1}^{{H_j}} {\sum\limits_{w = 1}^{{W_j}} {{\varphi _j}{{(x)}_{h,w,c}}{\varphi _j}{{(x)}_{h,w,c^{\prime}}}} }. $$

4.2.3 Gradient loss

Gradient information also plays a key role in many vision tasks such as super resolution. And it has been proved that the gradient loss could induce the network to generate visually pleasing results with fewer geometric distortions [37]. Therefore, in our design, the gradient loss is employed to provide the gradient-space supervision for better image fusion. The gradient map $\nabla I$ is obtained by computing the difference between adjacent pixels:

(13)$$\left\{ \begin{array}{l} \nabla I = \sqrt {I_x^2 + I_y^2} ,\\ {I_x} = I(x + 1,y) - I(x - 1,y),{I_y} = I(x,y + 1) - I(x,y - 1). \end{array} \right.$$

The GL (gradient loss) penalizes the differences between gradient maps of the output x and the ground truth real:

(14)$$GL = \frac{1}{{CHW}}\sum\limits_{c,h,w} {|{\nabla {x_{c,h,w}} - \nabla rea{l_{c,h,w}}} |}. $$

4.2.4 Overall loss

The overall loss Г of our network is the combination of pixel loss,perceptual loss and gradient loss:

(15)$$\Gamma = {w_1}Pix + {w_2}PL + {w_3}GL. $$

w₁, w₂ and w₃ are the weights for these three loss functions and they are equally set to 1 in this paper. For future various application scenarios, the weights can be flexibly set.

5. Experiment and results

5.1 Apparatus

To build a real scene dataset, a NIR-VIS prism camera (JAI FS-3200D-10GE) with a 16mm lens was adopted, and a NIR illuminant was used as a supplement lamp, as shown in Fig. 10. In the experiment, the depth range was about 1500mm, in which objects were randomly placed. To provide faithful color information and high-definition features, the VIS image was captured when F number of the camera was 1.4, while the large DOF NIR image was captured when F number of the camera was 12. It should be noted that the luminometer (Topcon SR-3AR) and spectrometer (Hyperspectral camera Specim IQ) were used to measure the parameters of our used illuminants (as listed in Table 1 and Fig. 5) to infer the observer perception model, but not necessary in practice. The standard white plate was employed during the measurement processes.

Fig. 10. Illustration of our experiment setup. All the training, validation and test sets were captured by our camera. The NIR light as the supplement lamp was used to maintain enough light throughput in small-aperture NIR imaging. Meanwhile, a uniform light membrane was adhered to the pupil of the NIR illuminant for uniform illumination. The luminometer, spectrometer, and standard white plate were used to measure the parameters of our used white and NIR illuminants.

Download Full Size | PDF

5.2 Dataset

We totally captured 1000 samples. Each sample contained a blurry and colorful VIS image, a sharp and monochromatic NIR image and a ground truth color image. The ground truth image was used to provide an objective for network training, and not used in the test. The size of the input image is 512×512 pixels.

The three-cross-validation method [38] was employed in our training process to enhance the model robustness. The captured 1000 samples were separated into three parts: 810 samples for training, 90 samples for validation and 100 samples for test. The training samples were used to train the network, and the validation samples were exploited to evaluate the prediction quality of the trained network. The well-trained network was finally tested on the rest test samples.

Moreover, preprocessing was introduced to improve the generalization of our network, in which the inputs including ground truth, NIR and VIS images were synchronously rotated and cropped in the same manner. They shared the same random rotation angle and cropping position, as shown in Fig. 11. In the rotation process, the maximum rotation angle was set as 60° because the information loss would be caused by the introduction of some black areas. In the cropping step, the minimal cropping ratio was set as 0.8 to assure that at least 80% effective information was remained to maintain the reliability of the dataset.

Fig. 11. Presentation of preprocessed images. The same preprocess of rotation and cropping are operated on the VIS, NIR and ground truth images to realize data enhancement.

Download Full Size | PDF

5.3 Training details

The network was trained with 300 epochs. The batch size was chosen as 2. Adam optimizer with momentum parameters β₁ = 0.9, β₂ = 0.999 was employed during training. The learning rate was set as 0.0001. The designed network was implemented with PyTorch 1.8. The reconstruction time of each test sample was about 0.2s on a single Nvidia Tesla P100 GPU.

5.4 Results and analysis

In this section, the performance of our method is qualitatively and quantitively evaluated. And, our fusion algorithm is compared against other NIR-VIS fusion approaches to prove its efficiency. Furthermore, ablation experiments are conducted to analyze the capacity of comparative network architectures with different modules. Moreover, the performance of our method under different focal depths is analyzed. In the end, the influence of noise and blur on imaging quality is specifically discussed.

5.4.1 Performance evaluation

The performance of our computational dual-aperture imaging system is evaluated from four aspects including DOF extension, detail preservation, image edge and color fidelity.

DOF extension. Large DOF imaging is vital in lots of photography applications and thus boosts the motivation of our work. One test sample is shown in Fig. 12, including the colorful VIS image, the large DOF NIR image, the fused image and the ground truth image. As illustrated above, the small DOF VIS image with faithful color is captured in the case of large aperture, on the contrary to the NIR image. In Fig. 12, we can see that the doll’s head (highlighted in green rectangle) is blurry in the VIS image, but sharp in the NIR image. In the fused image, it is highly deblurred by successfully introducing details from NIR images, and the DOF range is extended about 3 times than VIS images. Moreover, the PSNR and SSIM [39] metrics of fused images are higher on average than those of raw VIS images, as listed in Table 2. All these results demonstrate that our method can effectively achieve high-quality, large DOF imaging.

Fig. 12. Presentation of DOF extension. Magnified details are shown in green rectangle. The blurry doll’s head shown in VIS image (green rectangle) is well reconstructed in the fused image with extended DOF.

Download Full Size | PDF

Table 2. Average quantitative evaluation results on 100 test samples. The best value in every row is highlighted in red and bold font.

View Table | View all tables in this article

Detail preservation. Apart from DOF extension, detail preservation is also very important in many imaging applications. One test sample is shown in Fig. 13 to analyze the detail-preservation capacity of our method. In Fig. 13, the doll’s head (shown in red rectangle) is blurry and over-exposure in VIS image but has clear edge in NIR image. In the fused image, the details in the over-exposure area are well recovered by preserving details from the NIR image. The result proves that our method is powerful in detail preservation.

Fig. 13. Presentation of detail preservation. Magnified details are shown in red rectangle. The blurry and over-exposed doll’s hat shown in VIS image (red rectangle) is recovered with high color fidelity through the fusion of sharp features from NIR image and color information from VIS image.

Download Full Size | PDF

Image edge. Image edge is also a significant factor for imaging quality test. The edge reconstruction is a challenging task in convolutional neural network because the randomly supplementary information is padded to the image edge during the convolutional operation. Different padding approaches like zeros padding and replicate padding may have varying degrees of influence on the results. While the image edge is still well reconstructed by our network no matter which padding approach is used, as shown in Fig. 14(a) and (b) red rectangle. Because the proposed multi-scale fusion module can induce networks with different padding approaches to acquire the optimal fusion path. That demonstrates that our network is robust in edge reconstruction.

Fig. 14. Evaluating the image edge produced by our network with different padding approaches. Magnified details are shown in red rectangle.

Download Full Size | PDF

Color fidelity. High color fidelity is a crucial factor in camera quality testing and scoring, such as the specific color evaluation section in DXOMARK [40]. In the DEU-Net, the pixel loss is employed to penalize the brightness distribution difference between the fused image and the ground truth, which contributes to the preservation of color information from the VIS image. One sample covering 24 Munsell color chips and many other objects is shown in Fig. 15 top. The blurry image is well reconstructed without apparent color deviation. Furthermore, the RGB color diagrams are also presented and compared, as shown in Fig. 15 bottom. There is no distinct change in the color distribution after fusion. But it deserves to be notified that to the over-exposure zone such as the doll’s head shown in Fig. 15 top green rectangle, the specular effect is effectively removed, resulting in more visually pleasing fused images.

Fig. 15. Top: Qualitative evaluation of the color fidelity of our method. Bottom: The RGB color diagrams of VIS image, fused image and ground truth are adopted for quantitative evaluation.

Download Full Size | PDF

5.4.2 Comparison to other image fusion methods

To prove the proposed image fusion algorithm’s validity, DEU-Net is compared with other image fusion methods such as brightness transfer approach, direct multiplication approach, pyramid-based approach [25], and learning-based approach [41]. The inputs, including a colorful VIS image and a large DOF NIR image, to all methods are the same. The comparison results are shown in Fig. 16.

Fig. 16. Comparison among different image fusion methods. The same VIS and NIR images are fed to different methods. Images (a) – (e) are the fused images generated by Brightness transfer, Multiplication fusion, Pyramid-based [25], U2Fusion [41], and our method, respectively.

Download Full Size | PDF

In brightness transfer, the VIS image is transferred from RGB to YUV color space and then the Y channel is simply replaced with the NIR image to generate the fused image, which has clear edge but serious color deviation because it can’t achieve the integration of structure information and deep fusion, as shown in Fig. 16(a). In direct multiplication approach, the NIR image is directly multiplied with color coordinates (r, g, b):

(16)$$\left\{ \begin{array}{l} fuse = Cat(k \ast Nir \ast r,k \ast Nir \ast g,k \ast Nir \ast b),\\ r = \frac{R}{{R + G + B}}\textrm{, }g = \frac{G}{{R + G + B}}\textrm{, }b = \frac{B}{{R + G + B}}. \end{array} \right.$$

This approach can well preserve NIR information for deblurring, and the color deviation is alleviated compared to the last approach, as shown in Fig. 16(b). But the fusion performance heavily relies on the choice of k’s value.

In pyramid-based approach [25], the Laplacian pyramid is employed on the VIS and NIR images to obtain their low-pass and high-pass bands, which are fused by the designed fusion rule to generate the fusion image. The result (shown in Fig. 16(c)) shows that the color deviation is alleviated by constraining the fusion ratio of NIR information compared to the first approaches.

All of the aforementioned approaches heavily rely on the human-designed feature extraction and fusion rule, which is challenged to consider all necessary factors and limits the fusion performance. In learning-based approaches, a deep convolutional network can be trained on the large dataset to intelligently learn the feature extraction and fusion strategies, which avoids difficult manual design and has huge potential to design the general and robust image fusion algorithm. In U2Fusion [41], the input is the Y channel of the VIS image and the NIR image, then the fused result is concatenated with original color information. By leveraging the powerful self-learn ability of CNNs, this approach achieves successful fusion for deblurring while overcoming the complex manual innervation, although the result is affected by color deviation due to the lack of consideration on the color restoration, as shown in Fig. 16(d).

Compared with these methods, we leverage inductive learning of the convolutional networks, and design special network modules to acquire more appropriate strategies for features extraction and fusion. That contributes to the generation of large DOF images with faithful color. The result (shown in Fig. 16(e)) also proves that our fusion approach has better performance to achieve high-fidelity reconstruction.

5.4.3 Ablation study

To deeply analyze the capability of comparative network models with changed modules, ablation experiments on the network architecture and the loss function are performed. These networks are trained with the same preprocessing strategy, iterations, dataset, and training methodology. Besides subjective comparison, PSNR and SSIM are adopted for objective comparison.

The Pix (pixel-wise loss) like L1 loss, highly relying on the low-level differences, promotes the network to well learn the brightness distribution of the ground truth. But it will limit ability in DOF extension due to the lack of consideration on the subjective visual perception. As the closeup shown in Fig. 17(a) red rectangle, some details are still blurry in the result. While the PL (perceptual loss) based on the visual feature differences is designed with deep evaluation of human perception, which is beneficial to enhancing visual perception of the result. Therefore, when the PL is added to our Pix, the blurry effect caused by Pix is highly alleviated, as shown in Fig. 17(b). Compared to the network only using Pix, the PSNR and SSIM metrics are improved (as listed in Table 3), which also proves that PL contributes to improving the image quality.

Fig. 17. Qualitative comparison between the network using Pix and the network using Pix + PL. (a). The reconstructed result of the network using Pix. (b). The reconstructed result of the network using Pix + PL.

Download Full Size | PDF

Table 3. Objective comparison among different networks with changed modules (average results on 100 test samples). The best value is highlighted in red and bold font. ✓ means adoption, in contrast to ⌋.

View Table | View all tables in this article

Although the network trained with the combination of PL and Pix can generate visually pleasing results, its generalization is weak. One test sample captured in another real scene with a larger depth range is shown in Fig. 18. In Fig. 18(a) blue rectangle, the color structure distortion (the wall’s color deviates to red) is introduced into the fused image. Together with image-space losses, the GL (gradient loss) imposes restrictions on the second-order relationship of neighboring pixels, which is vital in the retainment of structure information. Therefore, the combination of Pix, PL and GL is proposed to reduce the structure distortion and enhance the generalization of our network, and the corresponding result (shown in Fig. 18(b)) validates that. Meanwhile, the network trained with the combination of Pix, PL and GL obtains the optimal value in PSNR and SSIM metrics (as listed in Table 3). It should be noted that GL may also cause blurry effect due to the strong constraint in gradient level. Here we adopt GL majorly as a supplement to pixel and perceptual losses to penalize structure distortion. However, as to different application scenarios, the weight of gradient losses can be flexibly set. For example, for applications needing delicate details and sharp edges, the weight of gradient loss can be flexibly reduced or even removed.

Fig. 18. Qualitative comparison between the network using Pix + PL and our chief network using Pix + PL + GL. (a). The reconstructed image of the network using Pix + PL. (b). The reconstructed image of the network using Pix + PL + GL.

Download Full Size | PDF

Generally, PL used in low-level tasks only contains the content loss, which can help the network generate visually pleasing results, while SL (the style loss) contributing to the preservation of color and texture is often neglected. In our work, SL is innovatively introduced to penalize the color deviation and detail loss, and the experiment results show that the introduction of SL can indeed help our network recover more natural color and more complete textures. As shown in the yellow rectangle of Fig. 19(a), the result reconstructed by the network without SL has the deviation of color and loss of textures, while the fused image of our chief network shows natural color and intact texture as shown in Fig. 19(b). In addition, compared to the network without SL, the PSNR and SSIM metrics are improved (as listed in Table 3). All results demonstrate that SL can successfully help the network achieve high fidelity reconstruction.

Fig. 19. Qualitative evaluation in the improvement of image quality provided by the style transfer loss. (a). The reconstructed image of the network w/o (without) style loss. (b). The reconstructed image of our chief network adding the style loss.

Download Full Size | PDF

In our chief network, the MFC (multi-scale fusion layer) is specially designed to well fuse the features extracted by encoders. To validate that, MFC layer is replaced with sequential convolutional layers with 3×3 kernel and 1 stride, respectively. Both the PSNR and SSIM metrics of that network are lower than those of our chief network (as listed in Table 3), which proves that MFC module contributes to better-quality fused result.

5.4.4 Performance evaluation under different focal depths

To analyze the performance of our method under different focal distances, six test samples are captured under various focal distances from 475mm to 1400mm in the same scene, and the qualitative and quantitative results are given in Fig. 20 and Table 4. The VIS images captured under different focal distances show varying degrees of blur due to the limited DOF under large aperture. Both subjective and objective results show that our model achieves high image quality except for the focus distance at 475mm. This is because with the shortening of focal distances, objects far away from the camera become severe blur to be recovered. To the cases having active demand in very short or long focal distances, this can be solved by adding multi-focus image fusion into the network to realize further depth extension.

Fig. 20. Qualitative evaluation of our method in various object focal distances. Samples (a) – (e) are captured when the object focuses are set 475 mm, 675 mm, 850 mm, 975 mm, 1200 mm, and 1400 mm away from the camera, respectively. Magnified details are shown in the red and green rectangles.

Download Full Size | PDF

Table 4. Quantitative analysis of our method in different object focal distances. The best value in every row is highlighted in bold and red font.

View Table | View all tables in this article

5.4.5 Comparison between different apertures

The trade-off between DOF and SNR is a fundamental imaging problem. Although small aperture can help the camera achieve large DOF, severe noise will invade and result in lower imaging quality. NIR-VIS image fusion methods can eliminate noise but cannot overcome other inherent problems caused by optical design. Specifically, (1) small aperture may cause color deviation due to insufficient optical throughput; and (2) small aperture may lead to low contrast because of low gray-level dynamic range. Different from methods based on reduced aperture, our work leverages optical advantages of large aperture and small aperture to collect more complementary raw information (e.g., natural color, sharp edge, etc.) and overcome the above-mentioned problems, then a specially designed network is trained to fuse essential information to obtain high-quality images.

To prove that, one ablation study is shown in Fig. 21. Moreover, two state-of-the-art denoising methods based on NIR-VIS fusion, one representative machine learning-based algorithm [42], and one recently proposed deep learning-based algorithm [43], are adopted to complete the ablation study. From the qualitative comparison, several problems are prominent in the results of Scale map (shown in (e)) and MN (shown in (f)). In detail, (1) results have an erroneous tendency to green because the given raw color information has been deviated due to the failure of AWB (auto white balance). Because AWB relying on the image analysis easily fails in precise illuminant estimation and image color correction due to the noise and BL (black level) drift in low-light conditions; (2) results exhibit low contrast because the raw gray intensity information is limited in small aperture imaging; and (3) detail loss (as shown in (e) green rectangle) and residual noise (as shown in (f) red rectangle) respectively exist in results of Sale map and MN, because severe noise invades into the VIS raw information under low light throughput. Conversely, our result shows natural color, high SNR, and rich texture under dual aperture imaging, as shown in (b). Moreover, from the quantitative comparison, the F1.4 raw image has higher values in PSNR and SSIM than the F12 raw image, as listed in Table 5. And our method achieves optimal performance in objective evaluation. Overall, comprehensively qualitative and quantitative results demonstrate that our method performs better.

Fig. 21. Qualitative evaluation of the influence under different apertures. Image (a) is captured under F1.4, (b) is the reconstructed result of our method, and (c) is the ground truth. Image (d) is captured under F12, (e) is the denoised result of Scale map [42], and (f) is the recovered result of MN [43]. Compared with large-aperture imaging, small aperture causes distinct color deviation.

Download Full Size | PDF

Table 5. Quantitative comparison among F1.4 raw image, F12 raw image, and reconstructed images of different methods. The best value in every row is highlighted in red and bold font.

View Table | View all tables in this article

6. Conclusions and discussions

Large DOF imaging is a fundamental function for cameras. Deviating from the traditional optical design which minimizes the aberrations to extend DOF by designing complex lenses or stacking off-the-shelf lenses, we proposed a computational imaging system with dual differentiated optical paths, and specifically designed an image fusion network DEU-Net to produce a high-definition image with faithful color from the colorful VIS image and the sharp NIR image.

To validate the effectiveness of our proposed method, the prototype consisting of a NIR-VIS prism camera and a NIR illuminant was constructed to capture a real-scene dataset containing 3000 raw images. The results demonstrated that our system successfully achieves high-quality and large DOF imaging. Compared to other image fusion algorithms, our DEU-Net preferably balanced the DOF extension and the color fidelity. As for ablation studies, our chief network model achieved the optimum in subjective and objective evaluation.

Compared to previous works, the key advantage of our method is its flexibility in system formation. For different application scenarios, our system can be designed in different forms with variable optical elements. Moreover, the proposed algorithm can be easily integrated into the NIR-VIS cameras to realize DOF extension. By leveraging the perception differences between the human eye and cameras, our system effectively integrates NIR and VIS information, and provides more raw information for various applications such as surveillance, computational photography and medical inspection while avoiding unpleasant light pollution. The uniquely diversified NIR-VIS optical system combing large- and small-aperture is compact and inexpensive, and can be combined with algorithms to present a new computational imaging technology. Based on the deep learning computation, the whole methodology could be easily integrated into the GPU-equipped computation imaging system, such as most of the modern mobile phones.

The input image size is 512×512 pixels now in our well-trained network, which could be retrained and fine-tuned for the case of larger input requirement. The reconstruction time of each test sample is about 0.2s, which is enough for the commercial camera shot. For future video real-time applications, multi-fame parallel processing approaches can be adopted.

Our large DOF imaging framework, adopting two differentiated optical paths to provide much more raw information, has been proved robust and powerful in the DOF extension. Getting rid of traditional complex optical design, the proposed method highly integrates dual-aperture optics with NIR-VIS imaging and artificial intelligence network, to provide a new alternative solution in the future large DOF imaging tasks such as machine vision, virtual/argument reality, and autopilot.

Funding

National Natural Science Foundation of China (62105227, 62075143); Sichuan Science and Technology Program (2022YFS0113); Chengdu Science and Technology Program (2021-YF05-01990-SN).

Acknowledgment

The authors would like to acknowledge funding support from The National Natural Science Foundation of China and The Chengdu Science and Technology Program.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Supplemental document

See Supplement 1 for supporting content.

References

1. P. Wu, D. Zhang, J. Yuan, S. Zeng, H. Gong, Q. Luo, and X. Yang, “Large depth-of-field fluorescence microscopy based on deep learning supported by Fresnel incoherent correlation holography,” Opt. Express 30(4), 5177–5191 (2022). [CrossRef]

2. A. Castro, Y. Frauel, and B. Javidi, “Integral imaging with large depth of field using an asymmetric phase mask,” Opt. Express 15(16), 10266–10273 (2007). [CrossRef]

3. M. Amin-Naji, A. Aghagolzadeh, and M. Ezoji, “Ensemble of CNN for multi-focus image fusion,” Inf. Fusion 51, 201–214 (2019). [CrossRef]

4. J. Li, X. Guo, G. Lu, B. Zhang, Y. Xu, F. Wu, and D. Zhang, “DRPL: Deep Regression Pair Learning for Multi-Focus Image Fusion,” IEEE Trans. on Image Process. 29, 4816–4831 (2020). [CrossRef]

5. J. Zuo, W. Zhao, L. Chen, J. Li, K. Du, L. Xiong, S. Yin, and J. Wang, “Multi-focus image fusion algorithm based on random features embedding and ensemble learning,” Opt. Express 30(5), 8234–8247 (2022). [CrossRef]

6. E. Ben-Eliezer, N. Konforti, B. Milgrom, and E. Marom, “An optimal binary amplitude-phase mask for hybrid imaging systems that exhibit high resolution and extended depth of field,” Opt. Express 16(25), 20540–20561 (2008). [CrossRef]

7. S. Ryu and C. Joo, “Design of binary phase filters for depth-of-focus extension via binarization of axisymmetric aberrations,” Opt. Express 25(24), 30312–30326 (2017). [CrossRef]

8. S. Elmalem, R. Giryes, and E. Marom, “Learned phase coded aperture for the benefit of depth of field extension,” Opt. Express 26(12), 15316–15331 (2018). [CrossRef]

9. V. Sitzmann, S. Diamond, Y. Peng, X. Dun, S. Boyd, W. Heidrich, F. Heide, and G. Wetzstein, “End-to-end optimization of optics and image processing for achromatic extended depth of field and super-resolution imaging,” ACM Trans. Graph. 37(4), 1–13 (2018). [CrossRef]

10. B. Milgrom, R. Avrahamy, T. David, A. Caspi, Y. Golovachev, and S. Engelberg, “Extended depth-of-field imaging employing integrated binary phase pupil mask and principal component analysis image fusion,” Opt. Express 28(16), 23862–23873 (2020). [CrossRef]

11. Y. Liu, C. Zhang, T. Kou, Y. Li, and J. Shen, “End-to-end computational optics with a singlet lens for large depth-of-field imaging,” Opt. Express 29(18), 28530–28548 (2021). [CrossRef]

12. P. Burt and E. Adelson, “Merging images through pattern decomposition,” in Applications of Digital Image Processing VIII. International Society for Optics and Photonics575, 173–181 (1985).

13. Y. Liu, X. Chen, H. Peng, and Z. Wang, “Multi-focus image fusion with a deep convolutional neural net-work,” Inf. Fusion 36, 191–207 (2017). [CrossRef]

14. X. Guo, R. Nie, J. Cao, D. Zhou, L. Mei, and K. He, “FuseGAN: Learning to Fuse Multi-Focus Image via Conditional Generative Adversarial Network,” IEEE Trans. Multimedia 21(8), 1982–1996 (2019). [CrossRef]

15. Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, and L. Zhang, “IFCNN: A general image fusion framework based on convolutional neural network,” Inf. Fusion 54, 99–118 (2020). [CrossRef]

16. E. R. Dowski and W. T. Cathey, “Extended depth of field throough wave-front coding,” Appl. Opt. 34(11), 1859–1866 (1995). [CrossRef]

17. O. Cossairt, C. Zhou, and S. Nayer, “Diffusion Coded Photography for Extended Depth of Field,” ACM Trans. Graph. 29(4), 1–10 (2010). [CrossRef]

18. Y. Peng, X. Dun, Q. Sun, F. Heide, and W. Heidrich, “Focal Sweep Imaging with Multi-focal Diffractive Optics,” in IEEE International Conference on Computational Photography (ICCP), 1–8 (2018).

19. J. Ma, Y. Ma, and C. Li, “Infrared and visible image fusion methods and applications: A survey,” Inf. Fusion 45, 153–178 (2019). [CrossRef]

20. G. Piella, “A general framework for multiresolution image fusion: from pixels to regions,” Inf. Fusion 4(4), 259–280 (2003). [CrossRef]

21. X. Bai, F. Zhou, and B. Xue, “Fusion of infrared and visual images through region extraction by using multi scale center-surround top-hat transform,” Opt. Express 19(9), 8444–8457 (2011). [CrossRef]

22. D. Donoho and A. Flesia, “Can recent innovations in harmonic analysis ‘explain’ key findings in natural image statistics?” Network: Comput. Neural Syst. 12(3), 371–393 (2001). [CrossRef]

23. H. Lin, Y. Tian, R. Pu, and L. Liang, “Remotely sensing image fusion based on wavelet transform and human vision system,” IJSIP 8, 291–298 (2015). [CrossRef]

24. B. Olshausen and D. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature 381(6583), 607–609 (1996). [CrossRef]

25. Y. Liu, S. Liu, and Z. Wang, “A general framework for image fusion based on multi-scale transform and sparse representation,” Inf. Fusion 24, 147–164 (2015). [CrossRef]

26. H. Li, X. Wu, and J. Kittler, “RFN-Nest: An end-to-end residual fusion network for infrared and visible images,” Inf. Fusion 73, 72–86 (2021). [CrossRef]

27. X. Wang, F. Dai, Y. Ma, J. Guo, Q. Zhao, and Y. Zhang, “Near-infrared Image Guided Neural Networks for Color Image Denoising,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3087–3811 (2019).

28. W. J. Smith, “Modern Optical Engineering: The Design of Optical Systems,” McGraw-Hill Education, 2008 (2008).

29. CIE (2010) CIE Publication 191:2010, Recommended system for visual performance based on mesopic photometry.

30. B. Crawford, “The Scotopic Visibility Function,” Proc. Phys. Soc. B 62(5), 321–334 (1949). [CrossRef]

31. J. Walsh, “Visibility of Radiant Energy Equation,” J. Opt. Soc. Am. 11(2), 111–112 (1925). [CrossRef]

32. JAI Inc., “JAI FS-3200D-10GE Multi-spectral camera,” https://www.jai.com/products/fs-3200d-10ge, [Online; accessed 16-May.-2022].

33. J. Xiong, J. Wang, W. Heidrich, and S. Nayar, “Seeing in Extra Darkness Using a Deep-Red Flash,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 9995–10004 (2021).

34. J. Johnson, A. Alahi, and F. Li, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference on Computer Vision (ECCV), 694–711 (2016).

35. Y. Peng, Q. Sun, X. Dun, G. Wetzstein, W. Heidrich, and F. Heide, “Learned large field-of-view imaging with thin-plate optics,” ACM Trans. Graph. 38(6), 1–14 (2019). [CrossRef]

36. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 (2014).

37. C. Ma, Y. Rao, Y. Cheng, C. Chen, J. Lu, and J. Zhou, “Structure-Preserving Super Resolution with Gradient Guidance,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7766–7775 (2020).

38. Y. Hu, B. Wang, and S. Lin, “FC4: Fully Convolutional Color Constancy with Confidence-Weighted Pooling,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 330–339 (2017).

39. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. on Image Process. 13(4), 600–612 (2004). [CrossRef]

40. G. Facciolo, G. Pacianotto, M. Renaudim, C. Viard, and F. Guichard, “Quantitative measurement of contrast, texture, color, and noise for digital photography of high dynamic range scenes,” in Proc. IS&T Int’l. Symp. on Electronic Imaging: Image Quality and System Performance XV30, 1–10 (2018).

41. H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2Fusion: A Unified Unsupervised Image Fusion Network,” IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 502–518 (2022). [CrossRef]

42. Q. Yan, X. Shen, L. Xu, S. Zhuo, X. Zhang, L. Shen, and J. Jia, “Cross-Field Joint Image Restoration via Scale Map,” in IEEE International Conference on Computer Vision (ICCV), 1537–1544 (2013).

43. S. Xu, J. Zhang, J. Wang, K. Sun, C. Zhang, J. Liu, and J. Hu, “A model-driven network for guided image denoising,” Inf. Fusion 85, 60–71 (2022). [CrossRef]

Image	VIS	Fused
PSNR	26.3859	30.3028
SSIM	0.9157	0.9578

Test samples	475mm	675mm	850mm	975mm	1200mm	1400mm
PSNR	29.6448	33.4313	33.8100	31.7360	30.8412	31.5811
SSIM	0.9496	0.9764	0.9774	0.9653	0.9568	0.9588

Image	(a) F1.4 raw image	(d) F12 raw image	(e) Scale map	(f) MN	(c) Our method
PSNR	26.9276	21.3044	21.2934	21.6793	30.5594
SSIM	0.8761	0.6869	0.7347	0.7619	0.9233

Image	VIS	Fused
PSNR	26.3859	30.3028
SSIM	0.9157	0.9578

Test samples	475mm	675mm	850mm	975mm	1200mm	1400mm
PSNR	29.6448	33.4313	33.8100	31.7360	30.8412	31.5811
SSIM	0.9496	0.9764	0.9774	0.9653	0.9568	0.9588

Pix	PL	PL w/o SL	GL	MFC	PSNR/SSIM
✓	⌋	⌋	⌋	✓	28.1538/0.9398
✓	✓	⌋	⌋	✓	29.7487/0.9520
✓	⌋	✓	✓	✓	27.9654/0.9418
✓	✓	⌋	✓	⌋	29.6604/0.9517
✓	✓	⌋	✓	✓	30.3028/0.9578

Abstract

1. Introduction

2. Related work

3. Computational dual-aperture imaging system

3.1 Optical imaging law

3.2 Observer visual model

3.2.1 Human visual system

3.2.2 Camera vision system

3.2.3 Observer perception model

4. Learned image fusion

4.1 Network architecture

4.2 Loss function

4.2.1 Pixel loss

4.2.2 Perceptual loss

4.2.3 Gradient loss

4.2.4 Overall loss

5. Experiment and results

5.1 Apparatus

5.2 Dataset

5.3 Training details

5.4 Results and analysis

5.4.1 Performance evaluation

5.4.2 Comparison to other image fusion methods

5.4.3 Ablation study

5.4.4 Performance evaluation under different focal depths

5.4.5 Comparison between different apertures

6. Conclusions and discussions

Funding

Acknowledgment

Disclosures

Data availability

Supplemental document

References

Supplementary Material (1)

Data availability

Cited By

Figures (21)

Tables (5)

Equations (16)

Optics Express