Time-gated imaging through dense fog via physics-driven Swin transformer

Shaohui Jin; Ziqin Xu; Mingliang Xu; Mingliang Xu; Mingliang Xu; Mingliang Xu; Hao Liu; Hao Liu

doi:10.1364/OE.519662

1. Introduction

Fog is a common atmospheric phenomenon, constituted by an aerosol system of tiny water droplets formed by the condensation of water vapor in the air near the ground [1–3]. In authentic fog scenarios, the outdoor air teems with an abundance of fine suspended particles. The interaction between light and complex media such as fog results in scattering [4–6]. The light, after undergoing backscattering and mixing with the light reflected off targets, significantly diminishes the clarity and contrast of images captured by sensors. This can lead to color shifts and substantial loss of detail in the images, thereby obscuring accurate image information. Particularly in dense fog conditions, the prevalence of extensive backscattering effects notably reduces visibility, leading to a blurred field of vision. Under such circumstances, sensors exhibit pronounced distortion phenomena, adversely impacting the operational efficiency in various domains, including but not limited to, assisted and autonomous driving, remote sensing, and public safety. Therefore, the development of efficient imaging through the fog (ITTF) techniques emerges as a critical factor in enhancing safety and reliability across these application areas. As illustrated in Fig. 1, detectors record the backscatter signals from ambient light or active laser illumination, then reconstruct clear fog-free scenes based on physical models or data-driven methods. In recent years, with the development of artificial intelligence, ITTF has become a comprehensive, multidisciplinary task, mainly focused on optics and computer vision.

Fig. 1. The basic principle of ITTF. The Active ITTF by using an additional light source through optical field modulation and algorithms, while the Passive ITTF without the need for an additional light source, which is achieved through data-driven and algorithms

Download Full Size | PDF

By combining technical solutions in the fields of optics and computer vision, we can categorize ITTF into active and passive types based on the use of an active modulation light source. Active ITTF generally uses active modulation light as the illumination source for the fog-obstructed target and utilizes pulse lasers and ultra-fast detectors, such as enhanced intensified Charge-Coupled Devices (ICCD), Single-Photon Avalanche Diodes (SPAD), and ToF cameras. Via techniques such as transmission matrix [7,8], speckle correlation technology [9–11], wavefront shaping [12–15], active polarization [16,17], and Time of flight [18–22], it generates modulated signals and collects active echo signals from the foggy scenario, then computes them into images.

Over the past decade, numerous ITTF technologies have been developed. While methods like the transmission matrix and speckle correlation enable non-invasive acquisition of object information, they are constrained by the field of view and the speed of reconstruction. Furthermore, phase distortions caused by scattering or components can be rectified by either optimizing the wavefront of the input light pattern iteratively or by conjugating the transmission matrix of the fog. Expanding on this principle, Lai et al [14]. have designed an efficient dual-pulse excitation technique. This technique generates robust nonlinear photoacoustic (PA) signals utilizing the Grueneisen relaxation effect, and these nonlinear PA signals are then utilized as feedback to guide the iterative optimization of the wavefront. Such optimization achieves optical diffraction-limited focusing in scattering media. On the other hand, in the context of active polarization dehazing, current methods often employ a near-Lambertian light source similar to sunlight to enhance brightness in scattering environments. Rowe et al. [23] calculated the Stokes vectors [24] using polarized images captured at different polarization angles and designed a dehazing system based on bionic principles using polarization difference optics. Fade et al. [17] improved upon this by capturing images with two polarization directions using an improved polarization difference experimental system.

However, the aforementioned methods show obvious sensitivity in foggy environments, especially in transmission matrices, wavefront shaping, and polarization. Considering the definite influence of temperature and pressure, wavefront shaping requires considerable environmental stability. Their fluctuations can generate aberrations, particularly spherical aberrations. Defogging techniques for polarization also require specialized polarized light sources and unique configurations. Any changes in temperature, humidity, or air pressure can result in inaccuracies in the model.

Alternative techniques for large-scale ITTF include time gating methods based on ToF [18,19], with a recent survey on transient imaging compiled in [20]. These methods are viable when the object is positioned at a considerable distance (resulting in reduced coupling with the backward reflection of the fog) and are less affected by other factors in the environment. However, their practicality is limited by the signal-to-noise ratio, requiring extensive integration time and static scenes. Satat et al. [21] discovered that background and signal photons from foggy scenes respectively follow gamma and Gaussian distributions over time. They utilized temporal distribution of arriving photons measured by SPAD to separate ballistic photons out of highly scattered photons travelling through the dense fog with optical thickness OT = 2.5. The brilliant method achieved higher imaging quality in the scene with much dense fog. However, majority of SPAD imaging devices with that sufficient temporal resolution have limited pixels, and available SPAD arrays with over 1k$\times$1k pixels are expensive, both of which are the obstacle to the SPAD-based ITTF in the real-world applications. Therefore, in this article, we focus on the time-gated imaging method to see through the dense fog which employs devices with acceptable costs. Our findings suggest that, in the presence of fog with a high optical thickness, the echo photons from the target object are considerably weak, and subject to the limitations of detector sensitivity and time resolution. Although active ITTF technology has achieved a better fog perception capability, there is still significant potential for improvement under dense fog conditions.

Unlike active ITTF, passive ITTF does not use additional light sources, only utilizing ambient atmospheric light and various detectors such as Microbolometer Array (MA), Charge-Coupled Device (CCD), Complementary Metal-Oxide-Semiconductor (CMOS). It collects the original signals from foggy scenes for computing using methods such as band selection [25,26], cloud tomography [27–29], passive polarization [30–32], as well as parameter estimation [33–36], and end-to-end image reconstruction [37–40] based on deep learning.

In the solar spectrum, the visible light wavelength range roughly spans from 390nm to 780nm. Ordinary visible light, due to its limited wavelength, struggles to penetrate through fog. However, infrared light can penetrate certain levels of fog. Lex Schaul et al. [25] combined the strong penetration ability of near-infrared wavelengths through fog with the advantage of visible color output. They achieved ITTF by fusing visible and near-infrared light of the foggy scene. Chen Feng et al. [26] exploited the dissimilarity between visible and near-infrared light for aerial light color estimation. They proposed a two-stage dehazing method followed by an optimization framework. However, band selection techniques require the presence of the selected band in the scene and have high requirements for the types of occluded targets, making them less suitable for widespread ITTF applications. Passive polarization techniques analyze the scattered polarized light measured in natural environments. Cloud tomography techniques measure the layer obscured by fog from multiple different angles, using tomography algorithms to reconstruct the three-dimensional structure of the fog to understand and predict the distribution and movement of the fog. Ultimately, these methods increase the clarity and accuracy of imaging. These methods don’t have requirements for the types of targets, but there are current challenges with accuracy under conditions of dense fog.

In 2010, He et al. [41] introduced the Dark Channel Prior (DCP) method, which utilizes only a standard CMOS camera and assumes the presence of a channel in most non-sky regions of the captured image where pixel values tend toward zero. They combined this with an atmospheric scattering model [4–6] to estimate the thickness of haze. In recent years, with the advent of GPU parallel computing and the availability of various open-source fog datasets, deep learning has emerged as the dominant solution in the field of ITTF. Due to the proven effectiveness of DCP in ITTF, early deep learning methods utilized images captured by standard CMOS cameras and incorporated the atmospheric scattering model to perform dehazing through parameter estimation. Cai et al. [35] proposed a DehazeNet model based on convolutional neural networks, which takes hazy images as input and outputs their transmission maps. Ren et al. [36] introduced a multi-scale deep neural network that learns the mapping between blurred images and their corresponding transmission maps for single-image dehazing.

As deep learning field continues to evolve, an increasing number of end-to-end methods are being introduced to enhance the generalization capability of network models across a wider range of fog scenarios and to improve the quality of image reconstruction in varying atmospheric conditions. Li et al. [37] reformulated the atmospheric scattering model and employed a lightweight CNN to directly predict dehazed images. Chen et al. [38] proposed an end-to-end gated context aggregation network that employs smooth dilated convolutions to eliminate the grid artifacts caused by hole convolution, directly recovering the final haze-free image. Qin et al. [39] reconfigured the feature attention module and multi-level feature fusion mechanism, using residual blocks as the backbone for feature extraction to reconstruct input hazy images. Song et al. [40] modified the Swin Transformer algorithm, which has demonstrated excellence in advanced computer vision tasks, to directly process hazy images and reconstruct clear images.

Unfortunately, the reconstruction process through deep learning methods is an ill-posed problem. Existing physical prior-based algorithms relying on the atmospheric scattering model and those utilizing deep learning approaches often encounter challenges due to unreliable physical priors. This leads to inaccurate transmission estimations. Moreover, the majority of deep learning-based methods resort to training on synthetic data. However, there exists a domain gap between synthetic and real data, and the scarcity of dense fog occlusion data with high optical thickness further compounds the issue.

It’s worth noting that in previous ITTF work, the optical thickness of fog has consistently remained below 2.5. Most studies have chosen fog scenarios with optical thickness between 1 and 2, which limits practical applications in extreme weather conditions. Due to the exponential attenuation of ballistic photons with increasing optical thickness beyond 2.0, the signal-to-noise ratio significantly deteriorates in heavy fog. To broaden the scope of ITTF for wider applications, new methods are needed to recover obscured objects from raw data in high optical thickness fog conditions.

To address these challenges, especially regarding target perception through dense fog, this paper proposes active imaging through dense fog technology, as shown in Fig. 2, via Physics-Driven Swin Transformer (PDST) to eliminate scattering effects and reconstruct scenes with various heterogeneous and differently concentrated scattering media. The main contributions are as follows:

(1) We proposed a PDST theoretical framework by defining an optical scattering model and designing the ToFormerv2, it eliminates scattering effects and processes depth intensity distribution signals by using the TOF principle and the ToFormerv2 network.
(2) We introduce ToFormerv2, built upon the Swin Transformer, incorporating novel feature extraction and multi-scale transformation methods. These enhancements aim to effectively capture both high-frequency and low-frequency features, balance contextual semantic information, and construct accurate spatial structures. Additionally, we define a loss function constrained by the physics model of fog imaging for training ToFormerv2.
(3) We have assembled a prototype of the PDST method and established a real-world Imaging Through Dense Fog (ITDF) dataset, which includes the widest range of optical thicknesses and the highest concentrations. Through comparative experiments with existing optical penetration and image dehazing techniques, we have confirmed the advantages of the PDST theory for imaging through strong scattering media.

Fig. 2. PDST theoretical framework. (a) A dual-laser and light source modulation system operating at 532nm and 633nm with a power of P. (b) A programmable gated optical reception system. (c) A control and processing center for photon and EET equation parameter. (d) The ToFormerv2 framework structure. (e) Short-distance dense fog scenarios with optical thickness ranging from 0 to 3.0.

Download Full Size | PDF

2. Physics-driven Swin transformer

2.1 Active imaging through dense fog physical principle

In a hypothetical three-dimensional space containing only the target and dense fog as shown in Fig. 2(e), we consider a scenario where a pulsed laser emits photons directed towards the target. Adjacent to the laser is a CMOS camera. In this specialized context of single-sensor imaging through dense fog, the photon signals received by the sensor can be categorized as background photons that do not interact with the target and only carry scattering information, signal photons that interact with the target and contain information about the target’s reflectance and depth, and dark counts considered as error detection noise that occurs uniformly over time.

Therefore, the noise signal is composed of the summation of diffuse photons within the adjacent region of the specific point and the backscattered signals reflected at that point. The radius and structure of these signals are determined by the scene and density of the scattering environment. When coupled with the Radiative Transfer Function (RTF) theory, the signal contribution equation (SCE) received from the sensor can be expressed as Eq. (1):

(1)$$\begin{aligned} S_{\text{recv}}(X, Y, T) & \approx S_{\text{ballistic}}(X, Y, T) + S_{\text{bk scat}}(X, Y, T)\\ & + \sum_{i}\sum_{j} \int_{k}S_{\text{snake}}(X+i, Y+j, T)dT+ S_{\text{dk cnt}}(X, Y, T) \end{aligned}$$

Where $S_{\text {recv}}$ represents the total received signal by the sensor. $S_{\text {ballistic}}$ and $S_{\text {snake}}$ belong to Signal photons. $S_{\text {ballistic}}$ represents a ballistic photon that reflects off the target surface and reaches the detector without further interaction with the environment, while $S_{\text {snake}}$ interacts with the environment. $S_{\text {bk scat}}$ falls under Background photons.$S_{\text {dk cnt}}(X, Y, T)$ represents dark counts. $X$, $Y$, and $T$ represent the different signal contribution strengths acquired by the CMOS sensor on the location $(X, Y)$ at time bin $T$, $k$ represents the temporal duration of the time gate. When the optical thickness of the fog is low, the contribution from $S_{\text {snake}}$ can be neglected.

We first propose Hypothesis 1, in the scene depicted in Fig. 2(e), the target does not exist. Therefore, for an active laser light source, the irradiance at a distance of $Z$ reflected onto the CMOS can be represented by Eq. (2).

(2)$$dI = \frac{P}{{4F_{\text{n}}^2\tan(\theta)^2Z^2+A_0}} \left(e^{-\gamma Z} - e^{-\gamma (Z+dZ)}\right)G e^{-\gamma Z}$$

Here, $\gamma$ signifies the atmospheric attenuation coefficient, $F_{\text {n}}$ is the maximum aperture diameter of the photographic lens, and $G$ corresponds to the backscatter gain. $P$ represents the laser power, $A_0$ is the laser aperture, and $\theta$ is the laser divergence angle. Furthermore, $e^{-\gamma Z}$ characterizes the attenuation of laser intensity at distance $Z$, following Beer’s law. Calculating the precise value of $G$ is contingent upon specific atmospheric conditions, typically determined using Mie scattering theory.

We approximate Eq. (3) using a first-order Taylor series, and integrate it over the total optical path $Z$ to obtain the echo signal $S_{\text {echo}}(Z)$ detected by the sensor with a range of Z length.

(3)$$S_{\text{echo}}(Z) = \int_{0}^{Z} dI_{\text{}} = \int_{0}^{Z} \frac{\gamma Pe^{{-}2\gamma Z}dz}{4F_{\text{n}}^2\tan(\theta)^2Z^2} G$$

(4)$$min|S_{\text{bk scat}}| = \int_{Z-Depth_{\text{target}}}^{Z} \frac{\gamma Pe^{{-}2\gamma Z}dz}{4F_{\text{n}}^2\tan(\theta)^2Z^2} G$$

Since there is no target present within the range of $Z$, the received signal $S_{\text {echo}}$ can be regarded as the background signal $S_{\text {bk scat}}$ in the SCE. By integrating Eq. (3) only over the target depth $Depth_{\text {target}} = [Z-Depth_{\text {target}}, Z]$, as shown in Eq. (4), this short-range interval, relative to integrating over the entire optical path length Z, minimizes the scattered background noise $S_{\text {bk scat}}$.

We further propose Hypothesis 2, in the scenario described in Fig. 2(e), if the occluded target is at position Z, then the situation is the same as Hypothesis 1 where only backscatter exists within the range from 0 to Z. Thus, we can also obtain $min|S_{\text {bk scat}}|$ at position Z.

Figure 3 shows the $S_{\text {recv}}$ results at different optical thicknesses, as well as the longitudinal intensity distribution curve when there is a minimum $min|S_{\text {bk scat}}|$. We find that although integrating at the target depth effectively improves the backscattering effects, imaging intensity decreases with the increase of OT, particularly exhibiting exponential decay in the Region of Interest (ROI). The smoothness of the intensity curve also decreases.

Fig. 3. Top: $S_{\text {recv}}$ results at different optical thicknesses (OT = 0.1, 1.4, 2.9) when there is a minimum minbk, Bottom: the intensity distribution projected onto the x-axis.

Download Full Size | PDF

This phenomenon is due to the increase of suspended particles in the scene as OT increases. These suspended particles scatter some ballistic photons under the condition of thin fog, resulting in increased $S_{\text {snake}}$ signal components, which is the main reason for the decrease in the smoothness of the intensity curve. Based on the Fresnel equations and the interference model, assuming a single reflection occurs on the object surface without multiple reflections or interferences under unchanged conditions of incident angle, wavelength, object surface properties, and the optical system, we can describe this process using the Eq. (5):

(5)$$I = I_0 \cdot R \cdot \left(1 - e^{{-}2\alpha ot}\right)$$

Here, $I$ is the imaging intensity, $I_0$ is the incident intensity, $R$ is the reflectance, $ot$ is the optical thickness, and $\alpha$ is a constant related to the material’s absorption characteristics. It can be seen that $ot$ has an inverse exponential relationship with imaging intensity, and high reflectance in the ROI region further accelerates the decay of intensity. We introduced a physics-driven ToFormerv2 network designed to reconstruct a clear scene of the target.

2.2 ToFormerv2

We aim to construct an efficient dehazing network based on physical driving principles, which can adapt to different optical thicknesses and realize the target reconstruction task in dense fog. We have designed the model’s capabilities in two aspects: multi-scale transformation and feature extraction. These efforts are directed to strike a balance between harnessing contextual semantic information and achieving precise reconstruction of spatial structures, thereby facilitating accurate target scene reconstruction. These two aspects are realized through all the modules described in Fig. 4. We will introduce these two aspects in this section. Additionally, we will provide a more comprehensive explanation of the design principles of each module contained within these two aspects, as well as the loss function, in the Supplement 1.

Fig. 4. ToFormerv2 Network Architecture: Based on the U-Net architecture, the ToFormerv2 network incorporates ToFormerv2 blocks as the fundamental backbone.

Download Full Size | PDF

2.2.1 Overall framework

The overall structure of the proposed ToFormerv2 network is shown in Fig. 4(a). We use a multi-scale network structure (encoder-decoder U-net structure) to expand the receptive field and capture more refined contextual information.

We first divide the image into non-overlapping patches (4$\times$4) using 2D convolutions and a Patch Embedding layer, and linearly project the feature dimensions to the specified dimension $C$. Then, the transformed patch tokens are fed into the encoder to extract deep features at multiple scales and linear dimensions through multiple ToFormerv2 blocks (Fig. 4(b)) and Slice Reassemble (SR) blocks (Fig. 4(d)). The SR block is responsible for the multi-scale transformation involving downsampling and dimension augmentation, while the ToFormerv2 block focuses on feature extraction.

Subsequently, we combine multiple ToFormerv2 blocks (Fig. 4(b)), the Multi-branch upsample (MultiUP) block (Fig. 4(e)), and the Branch Attention Fusion Subnetwork (BAFS) (Fig. 4(f)) as the decoder. This mapping aims to translate deep, high-dimensional features to the upper-scale space while better integrating spatial structure information preserved from shallow-level features.

2.2.2 Feature extraction

In the feature extraction aspect, we introduce the ToFormerv2 Block as the feature extraction block, as shown in Fig. 4(b), incorporating the Dual-branch Spatial Convolution-Multi-Head Self-Attention (DSC-MHSA) method (Fig. 4(c)), the parameterizable W-Mask method, and the Convolutional Flexible Multi-Layer Perceptron (CF-MLP) block.

The DSC-MHSA is designed to balance high and low-frequency information within the spatial structure. To begin, we employ Point-wise convolution (PC) to map image features into Q, K, V. Subsequently, we concatenate features along the channel dimension from depth-wise convolution (DC) and window MHSA branches, and then feed the concatenated features into a point-wise convolution. Furthermore, we set up learnable weights to measure the importance of different fused feature channels for adaptive fusion of branch output features.

It is noteworthy that in the MHSA branch of DSC-MHSA, In the masking step, unlike traditional MHSA that masks off attention weights obtained from different regions directly, we define a parameterizable Mask (W-Mask), and initialize weight with a high value. This approach avoids directly masking off attention weights from different regions, encouraging the network to adaptively learn the weight magnitude.

The output features from DSC-MHSA are further fed into the CF-MLP. In contrast to the original MLP, we employ only three layers of operations (Conv-PReLU-Conv). We opt for a monotonic non-linear activation function, PReLU, to alleviate gradient inversion issues. Additionally, PReLU prevents potential gradient vanishing issues within the negative value region by introducing learnable negative slopes. To capture abstract spatial features in high-dimensional image data, such as edges, textures, and output feature shapes, 1x1 convolutional layers are employed as hidden and output layers in CF-MLP. The dropout layer is omitted to expedite model convergence.

2.2.3 Multi-scale transformation

In the multi-scale transformation aspect, we introduce the SR block (Fig. 4(d)), the MultiUP block (Fig. 4(e)), and the BAFS (Fig. 4(f)).

We employ the SR block [42] for downsampling, as illustrated in Fig. 4(d). it divide the feature map into four parts along the $H$ and $W$ directions with a 2-pixel interval. Subsequently, for each part, SR merge channels and apply a linear layer to adjust the feature dimensions to the specified size, effectively avoiding the loss of spatial structural information.

We employ the MultiUP block and BAFS for upsampling. The MultiUP block, as illustrated in Fig. 4(e), includes 1x1 convolution for feature extraction, PReLU activation function to introduce non-linearity, and different upsampling techniques, including Pixel Shuffle and Bilinear Interpolation. The first branch utilizes convolutional layers and Pixel Shuffle, focusing on extracting local texture features, while the second branch employs convolutional layers and Bilinear Interpolation, aiming to preserve global structure. To further balance contextual information and precise spatial structure reconstruction, we introduce the BAFS. As illustrated in Fig. 4(f), BAFS first concatenates the output features $F_1$ and $F_2$ from the two branches of MultiUP along the channel dimension, and then passes them through a linear layer to expand the feature space while reducing the channel dimension, forming the upsampled fusion feature $F_{up}$. Subsequently, BAFS concatenates $F_{up}$ with the output feature $F_{sc}$ from the SR Block of the encoder at the same depth, along the channel dimension to form the fused feature $F_{f}$. This is done to preserve global structural features and local texture details. Then, we apply a convolutional layer to $F_{f}$ to obtain a fused attention scale factor, which, along with the original fused feature $F_{f}$, is linearly combined to produce the output feature.

3. Experiment

To evaluate the effectiveness of the proposed method, we first set up an experimental environment and designed the PDST system prototype. We then compared it with active and passive ITTF methods in the field of computer vision and optics.

3.1 Experimental scenario

As Swin Transformer belongs to a type of deep learning, it requires a large amount of sample data to be obtained in advance to drive the network to learn complex mapping relationships. Since the optical thickness of fog in real-world scenes is dynamically changing and linear, it is necessary to acquire a significant amount of sample data for foggy conditions with different optical thicknesses before deploying the system in practical applications. This characteristic makes it necessary to perform considerable preparatory work for capturing the variations in foggy scenes. To improve efficiency, we constructed the ITTF scene (Fig. 5):

Fig. 5. Basic ITTF scene. The pulses of laser light emitted from the laser pass through the fog chamber and illuminate the surface of the obscured target. The gated camera collects the reflected signals from the region of interest behind the fog to reconstruct a two-dimensional image of the obscured target.

Download Full Size | PDF

The dense fog chamber employs an ultrasonic atomizer to generate smoke containing suspended droplets of approximately 1-10 micrometers in size at a rate of 75 kg per hour. To introduce the fog into the chamber, a low-power, fan-embedded polyethylene pipe connects the atomizer to the chamber. For safety reasons, the chamber is constructed from acrylic material. The chamber dimensions are 0.8 m (H) $\times$ 0.5 m (W) $\times$ 1.0 m (L), with an entrance port of 10 cm in diameter. The fog chamber is located in the line of sight, at distances of 5 meters from the optical system and 1 meter from the target area. The acrylic glass chamber is sealed and filled with fog during each measurement, and the target scene is observed as the aerosol particles gradually dissipate.

Different objects of various materials, such as chemical fibers, wood, or metal, with dimensions of 0.3 meters (height) $\times$ 0.2 meters (width), are selected as targets to simulate pedestrians, vehicles, and other common objects in real foggy weather. The targets are placed at different depths to capture contributions from signals within various depth ranges. The continuous wave laser used to measure fog optical thickness is arranged in parallel with the targets, but outside the field of view of the ICMOS camera. This arrangement is an important component of the system and helps gather data related to foggy optical characteristics.

3.2 PDST system prototype

The system prototype of physical-driven Swin Transformer (PDST) model is used to collect experimental data, build PDST dataset, and promote the application of ITTF in real scenarios. The PDST System, illustrated in Fig. 6, is based on the ToF principle. It incorporates a 532nm CMOS camera (with a minimum gate of 3ns, >2000@single MCP, 1920$\times$1200px-5.86um) equipped with a 532nm bandpass filter and mounted on a telephoto lens to receive directional light. A pulsed laser, operating at a repetition rate of 1-20Hz and generating a 1064nm fundamental frequency that is converted to the 532nm wavelength via second harmonic generation, is optically expanded before the laser source using lens combinations to ensure complete coverage of the target. To quantitatively measure the optical thickness of fog, a dedicated Optical Emission and Reception System is devised. This system employs an industrial camera equipped with a 633nm bandwidth filter, combined with a 633nm continuous-wave laser, enabling the industrial camera to receive the 633nm directional light emitted from the occluded scene without interfering with the simultaneous operation of the two imaging systems.

Fig. 6. PDST System Prototype: The CMOS camera is installed on a telephoto lens. A combination lens system is employed to expand a pulsed laser at a wavelength of 532nm. An industrial camera is used to receive the fog-penetrated signal of continuous wave laser at 633nm. An edge computing unit is utilized for ToFormerv2; Dense Fog Chamber: An ultrasonic fog generator is connected to the acrylic fog chamber through a polyethylene pipe with a small fan embedded in it.

Download Full Size | PDF

To achieve the $min|S_{\text {bk scat}}|$ at target Z, we implement gating in front of the CMOS. The gate is opened only when the optical echo signal from the target object returns to the camera within a specific time bin. The relationship between camera shutter control and laser pulse width can be described as Equaction (6):

(6)$$\Gamma(d) = \sum_{i=1}^{n}\int_{0}^{\infty} L(t_{\text{i}} - \frac{2d}{c}) \ast C(t_{\text{1}}) \, dt$$

Where $\Gamma (d)$ represents the effective exposure time of the target, $c$ is the speed of light, $d$ is the range of target, and $\text {C}(t_{})$ and $L(t_{\text {}})$ are the width of the camera and laser. Equation (6) should be considered only in the time that the the gated camera is open. According to Eq. (3), by controlling $d$ as the distance between the sensor and the target object, we can mitigate the effects of backscattering, thereby improving the signal-to-noise ratio.

3.3 Parameter settings

According to Eq. (1), the echo signal received by the sensor, $S_{\text {recv}}$, is influenced by the backscatter $S_{\text {bk scat.}}$ and diffuse scattering $S_{\text {snake}}$. It undergoes significant attenuation after passing through dense fog. Therefore, as shown in Fig. 7, we use the "on-chip integration" method to accumulate multiple pulse target reflection signals from the same time bin within a single CMOS exposure time by intermittently opening the gate at a fixed frequency.

Fig. 7. (a) Trigger logic timing; (b) Principle of on-chip integration

Download Full Size | PDF

The maximum number of times the MCP gate can be opened within a single exposure time, denoted as $n$ (The "n" in Eq. (6).), is directly proportional to the external triggering frequency $f$ and the CMOS exposure time $t$.

Due to the intrinsic delay of CMOS, which is 3.1 microseconds and greater than the laser emission delay, the first laser pulse is missed within each exposure cycle (CMOS sensing period and CMOS hardware delay). To address this issue, we perform $N+1$ "on-chip integration", meaning that the MCP gate is opened $N+1$ time within a single exposure cycle to accumulate $N$ effective signals $S_{\text {recv}}$. Figure 7 illustrates the overall trigger logic timing diagram.

To train ToFormerv2 to learn the complex mapping relationship from the intermediate images to the clear images, we collected 1600 sets of high-resolution intermediate images that covered different target types and were placed at various depths in the target scene. These images were then segmented into 56,000 sets of blurry images with a size of 256px*256px, out of which we selected 52,500 sets for training purposes. For testing, we conducted experiments on 100 sets of high-resolution blurry images.

For the distributed training of ToFormerv2, we utilized dual NVIDIA RTX A4000 graphics cards and the PyTorch 2.0.0 framework. During the training phase, we employed the ADAM optimizer as our chosen optimization method, setting the momentum parameters $\beta _1$ and $\beta _2$ to 0.9 and 0.999, respectively. Additionally, we initialized the learning rate as 0.0002 and configured the batch size as 4. To implement dynamic learning rate adjustment, we employed the cosine annealing strategy as shown in Eq. (7) [43].

(7)$$\alpha(t) = \frac{1}{2} \cdot (\alpha_{\text{max}} + \alpha_{\text{min}}) \cdot \left(1 + \cos\left(\frac{t \pi}{T_{\text{max}}}\right)\right)$$

Based on experimental observations, we found that the optimal number of training epochs was 200 to ensure the model’s convergence throughout the training process.

4. Results and discussion

4.1 Output of reconstructed images

Figure 8 presents the reconstruction results obtained during the target validation process, specifically focusing on chemical fibers and wood materials. The showcased instances illustrate two distinct fog scenarios with optical thicknesses of 1.8 and 3.0.

Fig. 8. Original, $CE-S_{\text {sep.}}$,Recovery and Intensity Result. First column: Original images with different optical thickness; Second column: Backscatter Suppression Image with Digital Image Processing; Third column: The reconstruction structure using PDST; Fourth column: Ground truth.

Download Full Size | PDF

In the visual representation, the first column unveils the authentic original signals, representing the unprocessed $Srecv$ signals. The second column displays the digitally processed signals, named $CE-Ssep$, representing the $Srecv$ signal in the presence of $min|S_{\text {bk scat}}|$. The third column showcases the final reconstruction results, emphasizing the algorithm’s efficacy in alleviating the impact of fog and enhancing image clarity. The fourth column, serving as a crucial reference, presents the Ground Truth for direct comparison with the reconstruction results.

For a more in-depth analysis, intensity profiles along the lateral directions, denoted as $\alpha$, $\beta$, $\gamma$, and $\delta$ in Fig. 8, have been plotted for each signal type. These additional insights contribute to a deeper understanding of the spatial distribution of internal features within the reconstructed images and the overall performance of the PDST method.

It is noteworthy that ToFormerv2 achieves remarkable success in recovering de-fogged images for both chemical fibers and wood materials. Particularly impressive is its ability to achieve robust reconstruction even under challenging conditions, such as an optical thickness of 3.0. This not only highlights the algorithm’s effectiveness but also showcases its versatility in addressing various levels of atmospheric interference. The results presented in Fig. 8 collectively affirm the PDST method as a valuable tool for image reconstruction in adverse weather conditions.

4.2 Comparison with optical model-based penetration technique

We demonstrated our restoration technique within a scattering chamber containing dense, non-uniform, and heterogeneous strong scattering media. The optical thickness (OT) in this chamber ranged from 0 to 3.0. Existing optical methods for penetrating scattering media mainly focus on natural thin fog and uniform haze in real-world scenarios, and such methods exhibit varying degrees of degradation when applied to scenes with strong scattering media. Based on this, we selected state-of-the-art optical haze-penetrating methods that target experimental scenarios and parameter metrics similar to ours, involving high-concentration strong scattering media: SPAD combined with the probabilistic computational framework method, and SPAD combined with the time gating method.

Table 1 provides a quantitative evaluation of scatter-penetration perception achieved using these methods in different high-concentration strong scattering media scenarios (OT=3.0, 2.5, 2.1, 1.9, 1.6).

Table 1. Quantitative comparison of different optical model methods for imaging through fog

View Table | View all tables in this article

The evaluation results demonstrate that our method consistently achieves higher PSNR and SSIM values compared to existing methods across all optical thickness values, indicating its superior capability to effectively remove haze obscuration and improve image quality in strong scattering media scenarios. Specifically, our method achieves an average PSNR of 34.90 dB across five different optical thickness values (OT=2.5, 2.1, 1.9, 1.6), surpassing the Probability Statistics method and Time Gating method by 18.07 dB and 22.03 dB, respectively. Similarly, our method attains an average SSIM of 0.974 across the same optical thickness values, outperforming the Probability Statistics method and Time Gating method by 0.50 and 0.79, respectively.

Furthermore, we evaluated the PDST method at a higher optical thickness (OT=3.0), achieving a PSNR of 32.31 and an SSIM of 0.963 in a high-concentration strong scattering media scenario.

Moreover, we performed statistical analyses on the PSNR and SSIM values across all optical thickness levels, calculating their means and standard deviations. The results indicate that our method exhibits more stable performance in terms of PSNR and SSIM compared to existing methods, characterized by lower standard deviations. This suggests that our method is more reliable for processing haze images in various scenarios and performs better when handling haze images of different optical thicknesses.

4.3 Comparison with image processing-based penetration technique

As shown in Table 2, we evaluated our PDST Method and state-of-the-art (SOTA) image-based haze-penetration techniques in real-world scenarios using the ITDF dataset. Compared to existing SOTA methods, our PDST Method achieved the best performance with a PSNR of 33.03 dB and an SSIM of 0.974. When compared to the second-ranking method, DehazeFormer-B, our PDST Method exhibited a performance gain of 7.83 dB in PSNR and 0.124 in SSIM. We conducted a qualitative analysis of the PDST Method, as depicted in Fig. 9. We selected optical thickness values ranging from 0.5 to 3.0 for the scattering-obscured medium, encompassing scenes with varying degrees of scattering concentration: moderate, dense, and extremely dense. Within these scenarios, we presented original images, scatter images, and Reconstructed image outcomes employing various methodologies. Our observations revealed that within moderate scatter level (OT=0.5-0.9), prevailing comparative methods managed to achieve rudimentary haze removal. However, our approach notably outperforms these counterparts in terms of detail recovery and glare reduction. In instances of dense scatter (OT=1.3-1.7), methodologies like DCP, GCANet, and FFANet struggled to effectively eliminate backscattering, with GCANet exhibiting conspicuous intensity distortion. MSBDN, TaylorFormer and DehazeFormer achieved only rudimentary contour information reconstruction for target objects. In the context of extremely dense scatter level (OT=2.1-3.0), we observed that existing ITTF techniques encountered challenges in reconstructing objects effectively, leading to pronounced distortion. In contrast, our PDST Method consistently achieved higher-quality restored images across diverse optical thickness scenarios, particularly excelling in extremely dense scatter conditions. In summation, our proposed PDST Method outperforms all extant image processing techniques.

Fig. 9. Qualitative comparison of different image processing methods for imaging through fog.

Download Full Size | PDF

Table 2. Quantitative comparison of different image processing methods for imaging through fog

View Table | View all tables in this article

4.4 Ablation study

To assess the effectiveness of the individual module designs in the PDST method, we initially conducted ablation analyses on both the PDST method and the ToFormerv2 network. Concurrently, to further validate the efficacy of our ToFormerv2 network module, we substituted the ToFormerv2 algorithm module in PDST with existing state-of-the-art (SOTA) methods [38–41,44,45] and evaluated the advancement of the proposed ToFormerv2 network model on the ITDF dataset. The experimental results are presented in Table 3, providing the average outcomes of ablation experiments within the optical thickness range of the scattering medium from 0 to 3.0.

Table 3. Ablation study on PDST method

View Table | View all tables in this article

In Table 3, the first row corresponds to the complete PDST approach, while the second and third rows validate the benefits of the ToF method and ToFormerv2 modules, respectively. The subsequent seven rows pertain to an analysis of the advancement of the ToFormerv2 network model. It is evident from these results that removing the two major modules, the ToF method and ToFormerv2, significantly reduces experimental performance, with the ToFormerv2 network exhibiting higher gains within the PDST Method.

Furthermore, the results in the last seven columns of the table indicate that substituting the ToFormerv2 network module with existing image processing methods for imaging through the fog results in varying degrees of degradation in performance. Notably, the degradation is most pronounced with the DCP algorithm, possibly due to its reliance on the dark channel prior assumption, which may not hold well in single-channel grayscale images, leading to suboptimal algorithm performance. Additionally, we observed that DehazeFormer, which achieved the highest quantitative scores on the publicly available RESIDE dataset, experiences a decline in its ability to recover gated images. This decline could be attributed to the fact that target object information in heavily hazy scenes is nearly completely occluded, rendering the soft-constraint method proposed in DehazeFormer unsuitable for transmittance image data. Moreover, factors such as reduced training data volume and the relatively high complexity of the DehazeFormer model contribute to overfitting.

In contrast, our ToFormerv2 method introduces improvement modules at various scales within the network architecture, enabling it to learn the intricate mapping relationship for imaging in the presence of strong scattering media, thus enhancing the model’s generalization capability. We show the qualitative analysis results of different deep learning algorithms, as shown in Fig. 10. The figure illustrates that compared to other image processing methods, the ToFormerv2 network is better suited for the PDST approach. Specifically, algorithm (a) is unable to effectively reconstruct the target; networks (b), (d), and (g) encounter difficulties in reconstructing background elements in foggy scenes and tend to introduce new artifacts. Similarly, networks (e) and (f) cannot accurately reconstruct the shapes of objects, while network (c) and (h) not only fail to effectively remove fog haze noise but also struggle to accurately reconstruct the shapes of objects. To further validate the superiority of the ToFormerv2 network, we shift our focus toward the effectiveness of the internal structure of ToFormerv2. We specifically conduct ablation experiments on various sub-modules within the DeScatter module. Our primary attention is directed towards the following modules: 1.BAFS, 2.DSC-MHSA, 3.CF-MLP. While maintaining the other configurations and implementation details of the PDST system constant, we carry out these ablation experiments on the ITDF dataset, and the results are presented in Table 4. We degraded the DSC-MHSA module to a regular MHSA similar to that in the Swin Transformer, as shown in the first two rows of the table, to validate the effectiveness of the DSC-MHSA module. It is evident that the regular MHSA, lacking the high-pass characteristics of parallel convolutions, experiences a significant degradation in image restoration.

Fig. 10. The qualitative analysis results of different existing image processing-based penetration algorithm including (a)DCP [41], (b)GCANet [38], (c)FFANet [39], (d)MSBDN [44], (e)-(g)DehazeFormer [40], (h)MB-TaylorFormer [45] and (i) our ToFormerv2 network

Download Full Size | PDF

Table 4. Ablation study on ToFormerv2

View Table | View all tables in this article

In the third and fourth rows, we verified the effectiveness of the CF-MLP module: according to the experimental results, our CF-MLP module exhibited a 16.7% improvement in PSNR and a 4.1% improvement in SSIM compared to the traditional MLP. This enhancement is attributed to the fact that PReLu, in comparison to Gelu, demonstrates better monotonicity and is more suitable for low-level computer vision tasks. Additionally, PReLu’s learnable gradient property enhances its robustness across different optical thickness scenarios.

Finally, we compared the proposed BAFS module with popular existing image dehazing upsampling methods. As shown in the fifth and sixth rows, the results indicate that the BAFS module offers higher gains compared to existing upsampling methods.

5. Conclusion

This paper introduces a Physics-Driven Swin Transformer method for imaging through dense fog. This method effectively mitigates scattering effects and enables the reconstruction of targets under heterogeneous fog with varying optical thickness. In summary, we define the optical imaging process in foggy environments and develop a physics-driven ToFormerv2. Through the ablation study, we have demonstrated the effectiveness of the feature extraction, multi-scale transformation, and loss function methods within the proposed ToFormerv2. Furthermore, we have constructed a prototype PDST system and introduced the ITDF dataset to evaluate its performance. In comparative analysis with SOTA optical and computer vision methodologies, the PDST system has shown exceptional imaging results, particularly in challenging dense fog scenarios. Our future research efforts will focus on imaging through the fog within real-world scenarios marked by heterogeneous scattering media occlusions. Additionally, we aim to design more robust and lightweight reconstruction networks to facilitate the development of real-time penetration sensing systems for device detection.

Funding

National Natural Science Foundation of China (62172371, 62272421, U21B2037); the Preresearch Project on Civil Aerospace Technologies funded by China National Space Administration (D010105).

Acknowledgment

This work is supported by the National Natural Science Foundation of China under Grant No. 62272421, and in part supported by the No. U21B2037 and No. 62172371. Additionally, it is supported by the Preresearch Project on Civil Aerospace Technologies funded by China National Space Administration under Grant No. D010105.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper will release within two weeks after the article is formally published in the repository [46]. The dataset and code link is avbailable in [47].

Supplemental document

See Supplement 1 for supporting content.

References

1. P. Gill, T. Graedel, and C. Weschler, “Organic films on atmospheric aerosol particles, fog droplets, cloud droplets, raindrops, and snowflakes,” Rev. Geophys. 21(4), 903–920 (1983). [CrossRef]

2. J. L. Pérez-Díaz, O. Ivanov, Z. Peshev, et al., “Fogs: Physical basis, characteristic properties, and impacts on the environment and human health,” Water 9(10), 807 (2017). [CrossRef]

3. I. Boutle, J. Price, I. Kudzotsa, et al., “Aerosol–fog interaction and the transition to well-mixed radiation fog,” Atmos. Chem. Phys. 18(11), 7827–7840 (2018). [CrossRef]

4. E. J. McCartney, “Optics of the atmosphere: scattering by molecules and particles,” New York (1976).

5. S. G. Narasimhan and S. K. Nayar, “Chromatic framework for vision in bad weather,” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), vol. 1 (IEEE, 2000), pp. 598–605.

6. S. G. Narasimhan and S. K. Nayar, “Vision and the atmosphere,” Int. J. Comput. Vis. 48(3), 233–254 (2002). [CrossRef]

7. S. Popoff, G. Lerosey, M. Fink, et al., “Image transmission through an opaque material,” Nat. Commun. 1(1), 81 (2010). [CrossRef]

8. Y. Choi, T. D. Yang, C. Fang-Yen, et al., “Overcoming the diffraction limit using multiple light scattering in a highly disordered medium,” Phys. Rev. Lett. 107(2), 023902 (2011). [CrossRef]

9. O. Katz, P. Heidmann, M. Fink, et al., “Non-invasive single-shot imaging through scattering layers and around corners via speckle correlations,” Nat. Photonics 8(10), 784–790 (2014). [CrossRef]

10. M. Chen, H. Liu, Z. Liu, et al., “Expansion of the fov in speckle autocorrelation imaging by spatial filtering,” Opt. Lett. 44(24), 5997–6000 (2019). [CrossRef]

11. H. He, X. Xie, Y. Liu, et al., “Exploiting the point spread function for optical imaging through a scattering medium based on deconvolution method,” J. Innovative Opt. Health Sci. 12(04), 1930005 (2019). [CrossRef]

12. I. M. Vellekoop and A. Mosk, “Focusing coherent light through opaque strongly scattering media,” Opt. Lett. 32(16), 2309–2311 (2007). [CrossRef]

13. A. P. Mosk, A. Lagendijk, G. Lerosey, et al., “Controlling waves in space and time for imaging and focusing in complex media,” Nat. Photonics 6(5), 283–292 (2012). [CrossRef]

14. P. Lai, L. Wang, J. W. Tay, et al., “Photoacoustically guided wavefront shaping for enhanced optical focusing in scattering media,” Nat. Photonics 9(2), 126–132 (2015). [CrossRef]

15. Z. Yu, H. Li, T. Zhong, et al., “Wavefront shaping: a versatile tool to conquer multiple scattering in multidisciplinary fields,” The Innov. 3(5), 100292 (2022). [CrossRef]

16. D. Brousseau, J. Plant, and S. Thibault, “Real-time polarization difference imaging (rpdi) reveals surface details and textures in harsh environments,” in Photonic Applications for Aerospace, Commercial, and Harsh Environments IV, vol. 8720 (SPIE, 2013), pp. 100–106.

17. J. Fade, S. Panigrahi, A. Carré, et al., “Long-range polarimetric imaging through fog,” Appl. Opt. 53(18), 3854–3865 (2014). [CrossRef]

18. O. David, N. S. Kopeika, and B. Weizer, “Range gated active night vision system for automobiles,” Appl. Opt. 45(28), 7248–7254 (2006). [CrossRef]

19. M. Laurenzis, F. Christnacher, E. Bacher, et al., “New approaches of three-dimensional range-gated imaging in scattering environments,” in Electro-Optical Remote Sensing, Photonic Technologies, and Applications V, vol. 8186 (SPIE, 2011), pp. 27–36.

20. A. Jarabo, B. Masia, J. Marco, et al., “Recent advances in transient imaging: A computer graphics and vision perspective,” Cornell University - arXiv,Cornell University - arXiv, arXiv:1611.00939 (2016). [CrossRef]

21. G. Satat, M. Tancik, and R. Raskar, “Towards photography through realistic fog,” in 2018 IEEE International Conference on Computational Photography (ICCP), (IEEE, 2018), pp. 1–10.

22. R. Tobin, A. Halimi, A. McCarthy, et al., “Robust real-time 3d imaging of moving scenes through atmospheric obscurant using single-photon lidar,” Sci. Rep. 11(1), 11236 (2021). [CrossRef]

23. M. Rowe, E. Pugh, J. S. Tyo, et al., “Polarization-difference imaging: a biologically inspired technique for observation through scattering media,” Opt. Lett. 20(6), 608–610 (1995). [CrossRef]

24. D. H. Goldstein, Polarized light (CRC press, 2017).

25. L. Schaul, C. Fredembach, and S. Süsstrunk, “Color image dehazing using the near-infrared,” in 2009 16th IEEE International Conference on Image Processing (ICIP), (IEEE, 2009), pp. 1629–1632.

26. C. Feng, S. Zhuo, X. Zhang, et al., “Near-infrared guided color image dehazing,” in 2013 IEEE international conference on image processing, (IEEE, 2013), pp. 2363–2367.

27. V. Holodovsky, Y. Y. Schechner, A. Levin, et al., “In-situ multi-view multi-scattering stochastic tomography,” in 2016 IEEE International Conference on Computational Photography (ICCP), (IEEE, 2016), pp. 1–12.

28. A. Levis, Y. Y. Schechner, A. Aides, et al., “Airborne three-dimensional cloud tomography,” in Proceedings of the IEEE International Conference on Computer Vision, (2015), pp. 3379–3387.

29. A. Levis, Y. Y. Schechner, and A. B. Davis, “Multiple-scattering microphysics tomography,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), pp. 6740–6749.

30. Y. Y. Schechner, S. G. Narasimhan, and S. K. Nayar, “Instant dehazing of images using polarization,” in Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, vol. 1 (IEEE, 2001), pp. I–I.

31. F. Liu, L. Cao, X. Shao, et al., “Polarimetric dehazing utilizing spatial frequency segregation of images,” Appl. Opt. 54(27), 8116–8122 (2015). [CrossRef]

32. L. Cao, X. Shao, F. Liu, et al., “Dehazing method through polarimetric imaging and multi-scale analysis,” in Satellite Data Compression, Communications, and Processing XI, vol. 9501 (SPIE, 2015), pp. 266–273.

33. S. Lee, S. Yun, J.-H. Nam, et al., “A review on dark channel prior based image dehazing algorithms,” EURASIP J. on Image Video Process. 2016(1), 4–23 (2016). [CrossRef]

34. R. Fattal, “Single image dehazing,” ACM Trans. Graph. 27(3), 1–9 (2008). [CrossRef]

35. B. Cai, X. Xu, K. Jia, et al., “Dehazenet: An end-to-end system for single image haze removal,” IEEE Trans. on Image Process. 25(11), 5187–5198 (2016). [CrossRef]

36. W. Ren, S. Liu, H. Zhang, et al., “Single image dehazing via multi-scale convolutional neural networks,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, (Springer, 2016), pp. 154–169.

37. B. Li, X. Peng, Z. Wang, et al., “Aod-net: All-in-one dehazing network,” in Proceedings of the IEEE international conference on computer vision, (2017), pp. 4770–4778.

38. D. Chen, M. He, Q. Fan, et al., “Gated context aggregation network for image dehazing and deraining,” in 2019 IEEE winter conference on applications of computer vision (WACV), (IEEE, 2019), pp. 1375–1383.

39. X. Qin, Z. Wang, Y. Bai, et al., “Ffa-net: Feature fusion attention network for single image dehazing,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34 (2020), pp. 11908–11915.

40. Y. Song, Z. He, H. Qian, et al., “Vision transformers for single image dehazing,” IEEE Trans. on Image Process. 32, 1927–1941 (2023). [CrossRef]

41. K. He, J. Sun, and X. Tang, “Single image haze removal using dark channel prior,” IEEE Trans. Pattern Anal. Mach. Intell. 33(12), 2341–2353 (2011). [CrossRef]

42. G. Huang, Y. Sun, Z. Liu, et al., “Deep networks with stochastic depth,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, (Springer, 2016), pp. 646–661.

43. T. He, Z. Zhang, H. Zhang, et al., “Bag of tricks for image classification with convolutional neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2019), pp. 558–567.

44. H. Dong, J. Pan, L. Xiang, et al., “Multi-scale boosted dehazing network with dense feature fusion,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2020), pp. 2157–2167.

45. Y. Qiu, K. Zhang, C. Wang, et al., “Mb-taylorformer: Multi-branch efficient transformer expanded by taylor formula for image dehazing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2023), pp. 12802–12813.

46. H. Liu, P. Wang, X. He, et al., “Pi-nlos: polarized infrared non-line-of-sight imaging,” Opt. Express 31(26), 44113–44126 (2023). [CrossRef]

47. S. Jin, M. Xu, and Z. Xu, “Code and data for Time-gated imaging through dense fog via physics-driven Swin transformer,” GitHub (2024) [accessed 29 April 2024], https://github.com/Unconventional-Vision-Lab-ZZU/ITTF-PDST.

Methods	OT 3.0		OT 2.5		OT 2.1		OT 1.9		OT 1.6
Methods	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
Probability Statistics [21]	——	——	14.23	0.310	16.10	0.500	17.30	0.510	19.63	0.570
Time Gating [19]	——	——	10.46	0.11	12.53	0.160	13.59	0.190	14.92	0.260
PDST(Our)	32.31	0.963	33.04	0.967	34.09	0.974	36.16	0.978	36.34	0.979

METHODS	PSNR	SSIM
(TPAMI’10)DCP [41]	20.90	0.730
(WACV’19)GCANet [38]	21.97	0.539
(AAAI’20) FFANet [39]	23.53	0.840
(CVPR’20) MSBDN [44]	25.06	0.866
(CVPR’22)DehazeFormer-t [40]	23.70	0.828
(CVPR’22)DehazeFormer-b [40]	25.20	0.850
(CVPR’22)DehazeFormer-l [40]	24.09	0.828
(ICCV’23)MB-TaylorFormer [45]	23.52	0.85
PDST(Our)	33.03	0.974

Setting	PSNR	SSIM
PDST	33.03	0.974
PDST w/o ToF	24.31	0.870
PDST w/o ToFormerv2	21.25	0.797
ToFormerv2 $\to$ DCP [41]	19.20	0.451
ToFormerv2 $\to$ GCA [38]	28.11	0.916
ToFormerv2 $\to$ FFA [39]	28.38	0.895
ToFormerv2 $\to$ MSBDN [44]	27.27	0.881
ToFormerv2 $\to$ DehazeFormer-t [40]	26.49	0.913
ToFormerv2 $\to$ DehazeFormer-b [40]	27.27	0.929
ToFormerv2 $\to$ DehazeFormer-l [40]	27.91	0.922
ToFormerv2 $\to$ MB-TaylorFormer [45]	26.50	0.901

Setting	PSNR	SSIM
ToFormerv2	33.03	0.974
DSC-MHSA $\to$ MHSA	31.53	0.962
ToFormerv2 w/o CF-MLP	27.07	0.932
CF-MLP $\to$ MLP	27.10	0.930
BAFS $\to$ Sub-pixel Conv	31.01	0.968
BAFS $\to$ Transposed Conv	31.36	0.966

Methods	OT 3.0		OT 2.5		OT 2.1		OT 1.9		OT 1.6
Methods	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
Probability Statistics [21]	——	——	14.23	0.310	16.10	0.500	17.30	0.510	19.63	0.570
Time Gating [19]	——	——	10.46	0.11	12.53	0.160	13.59	0.190	14.92	0.260
PDST(Our)	32.31	0.963	33.04	0.967	34.09	0.974	36.16	0.978	36.34	0.979

Time-gated imaging through dense fog via physics-driven Swin transformer

Abstract

1. Introduction

2. Physics-driven Swin transformer

2.1 Active imaging through dense fog physical principle

2.2 ToFormerv2

2.2.1 Overall framework

2.2.2 Feature extraction

2.2.3 Multi-scale transformation

3. Experiment

3.1 Experimental scenario

3.2 PDST system prototype

3.3 Parameter settings

4. Results and discussion

4.1 Output of reconstructed images

4.2 Comparison with optical model-based penetration technique

4.3 Comparison with image processing-based penetration technique

4.4 Ablation study

5. Conclusion

Funding

Acknowledgment

Disclosures

Data availability

Supplemental document

References

Supplementary Material (1)

Data availability

Cited By

Figures (10)

Tables (4)

Equations (7)

Optics Express