Semantic-guided polarization image fusion method based on a dual-discriminator GAN

Ju Liu; Jin Duan; Jin Duan; Youfei Hao; Guangqiu Chen; Hao Zhang

doi:10.1364/OE.472214

1. Introduction

Polarization is an external manifestation of the transverse wave nature of light, and the polarizability of light is a physical property that is more universal than the intensity of light. The advantages of polarization properties include its robustness for the characterization of various materials. As a result, several polarization imaging techniques have developed. Compared with conventional imaging methods, polarization imaging is able to obtain information on the polarization of a target while acquiring information on the light intensity of the target. Therefore, Polarization imaging has found widespread use in military and civilian applications [1], such as reflection removal [2], dehazing [3], transparent object segmentation [4], road detection [5]. A typical example is shown in Fig. 1, polarization intensity image (S0), Degree of Linear Polarization (DoLP) and fused image are tested on well-trained deeplabv3+ mode to display the advantages of polarization image fusion in target detection and segmentation applications [6]. These results show that the fused image segmentation accuracy is improved by 7.8% and the fused image can detect the table, the empty water glass and the water glass with water more accurately than the single-modal image.

Fig. 1. Application of Polarization Image Fusion in Semantic Segmentation.

Download Full Size | PDF

In general, the Stokes vector is used to characterize the polarization state [7]:

(1)$$\left\{\begin{array}{l} S_{0}=\left(I_{0^{{\circ}}}+I_{45^{{\circ}}}+I_{90^{{\circ}}}+I_{135^{{\circ}}}\right) / 2 \\ S_{1}=I_{0^{{\circ}}}-I_{90^{{\circ}}} \\ S_{2}=I_{45^{{\circ}}}-I_{135^{{\circ}}}, \end{array}\right.$$

where $I_{\theta }$ is the intensity of light in different polarization directions $\theta \left (\theta \in 0^{\circ }, 45^{\circ }, 90^{\circ }, 135^{\circ }\right )$. $\mathrm {S}_{0}$ is the total intensity of light used to describe the reflectance and transmittance of the object, and $\mathrm {S}_{1}$ and $\mathrm {S}_{2}$ represent the polarization difference. The degree of linear polarization (DoLP) is calculated from the Stokes vector, which is expressed as:

(2)$$D o L P=\sqrt{\left(S_{1}^{2}+S_{2}^{2}\right)} / \mathrm{S}_{0} ,$$

It is clear from its description that DoLP has no relationship to intensity. DoLP can describe additional aspects, such as target surface shape, roughness and shading, that intensity images cannot. Thus, S0 and DoLP images offer complimentary scene data from various aspects. Further research into effective polarization image fusion techniques are required in the field of polarization imaging to generate images with richer information and enhance the capability of target recognition in complicated backgrounds. As a result, the primary focus of this research is the fusion of S0 and DoLP images.

In recent years, traditional image fusion techniques have been dominant. Numerous manually designed fusion methods have been proposed, for example, transform domain based methods [8], saliency based methods [9], sparse representation based methods [10], etc. Although these strategies have achieved significant success, they rarely consider all aspects of the integration process. Many constraints are added to improve the fusion effect, which makes the method complex and lacks flexibility and generalization ability.

With the improvement of computing power, the methods based on deep learning have been successfully used in the field of image fusion, which has greatly broken through the limitations of conventional methods. For example, Prabhakar et al. devised an unsupervised approach that employs the no-reference image quality metric(MEF-SSIM) loss function to train the network [11]. For infrared and visible image fusion task, Ma et al. designed a network based on salient object recognition network [12]. Although these techniques have shown positive outcomes, the use of polarization image fusion still some shortcomings. Polarization properties are very sensitive to surface roughness and conductance of objects, and polarization imaging is highly different for metals and dielectric materials, and the manufactured objects and natural background. Convolutional neural networks are now used in the majority of deep learning techniques to extract visual features. At present, most deep learning approaches generally choose convolutional neural networks to extract features from source images. Pre-designed loss functions are then used to improve the network. Since the pre-designed fusion rules are not tailored to the specificity of individual semantic items, the fusion rules are consistent throughout the image, i.e. they all use the same fusion rules. As a result, it is challenging for these approaches to fuse various semantic items in a particular way, and they also do not adequately capture the benefits of polarization imaging’s high contrast characterization for various materials. Second, now available GAN techniques construct an adversarial game between the discriminator and the generator that forces the fused picture to concurrently maintain the features from both the DoLP and the S0 images. However, because the discriminator affects the entire image, the fusion results cannot consider the importance of each semantic region. The edge and gradient information of DoLP, as well as the texture features of S0, will generate some loss as the adversarial game progresses.

Polarization image fusion is an effective means to improve advanced image applications such as semantic segmentation and object recognition. In particular, in vision tasks such as transparent objects, water, fog, etc. That cannot be achieved only by intensity and color information, by redefining the conventional image segmentation and recognition problem as the polarization problem of light, that is, through polarization image fusion, it can achieve better image semantic information expression ability. For it, we propose a new end-to-end network model, called Semantic Guided Polarization Image Fusion Dual Discriminator Generative Adversarial Network (SGPF-GAN). In order to maintain the polarization contrast of various materials, such as electrolytes and metals, and to guide the image fusion process, we also developed a Polarized Image Information Quantity Discriminator (PIQD) block. This block calculates its proper fusion weight for each semantic region. With the use of generator in the network is able to learn the fusion rules of different semantic regions, while dual discriminators are used to determine the extent to which both S0 and DoLP modal information is retained in the fused images. In summary, we primarily offer the three contributions listed below:

• According to the differences in the polarization properties of dielectrics and metal materials, we specifically design a module for discriminating the amount of information in polarized images, which can be targeted to retain the polarization information of different materials in the fusion process. Through the design of appropriate fusion rules, it can improve the adequacy of scene expression;
• For polarized image fusion, we suggest a dual-discriminator generative adversarial structure which enables the generator to consider the information of two modalities at the same time to constrain the training process and avoid information loss by introducing a single modality source image into the discriminator;
• Addressing the difference of polarized light reflected by adjacent micro-element surfaces in DoLP imaging, we designed a dual-stream generator structure to improve the ability to extract features from DoLP images.

The rest of the article is organized as follows. Some well-known related works are introduced in Section 2. For the polarized image fusion of S0 and DoLP, we suggest a semantically guided dual-discriminator generative adversarial network (SGPF-GAN) in section 3. A comprehensive evaluation of our methodology is provided in Section 4, which also includes qualitative and quantitative comparative analysis, ablation experiments, and generalization validation. Finally, the conclusion and prospect of this paper are provided.

2. Related work

In the following, we provide a brief description of the classical traditional image fusion methods and the currently more popular deep learning image fusion methods.

2.1 Traditional fusion methods

Traditional fusion techniques are mostly based on the transform domain and the spatial domain. In recent years, the more mainstream transform domain fusion methods include wavelet transform, pyramid transform and nonsubsampled contourlet transform, etc. These methods first perform multi-scale decomposition of the source image. Next, they fuse coefficients of the same scale in accordance with a specific rule. The outstanding contribution of this type of method include GTF(Gradient Transfer Fusion) [12], CVT (Curvelet Transform) [13], DTCWT (Dual-tree Complex Wavelet) [14] and SR (Sparse Representation) [8]. For fusion results, it is important to choose the transform domain and fusion rule. In addition to low-rank representation methods there are sparse representation methods that are still commonly utilized in the field of image fusion. The key innovation made by Zhang et al. [15]. In their work was the use of patch level consistency rectification to suppress spatial artefacts in a multi-focus image fusion approach ground on non-negative sparse representation. Using super-pixel clustering, Zhang et al. [16] devised a method for multifocal image fusion in the ULRR (unified low-rank representation) model that reduces spatial artefacts in the fused result by segmenting the source image into several super-pixels of irregular size.

2.2 Deep learning-based image fusion

With the widespread use of deep learning in advanced vision applications, various conventional convolution neural network (CNN)-based image fusion methods have been applied. For instance, Zhang et al. [17] proposed a scale maintenance loss of gradients and intensities to guide the network architecture to generate fused images in their end-to-end CNN-based image fusion method. Liu et al. [18] implemented the medical image fusion challenge utilizing the conventional Laplace pyramid method for feature extraction and a trained CNN to provide fusion rules. Li et al. [19] used a trained auto-encoder structure for feature extraction, followed by a mix of traditional addition and l1-norm rule to implement the image fusion task. Additionally, Ma et al. [20] proposed a GAN-based method for infrared and visible image fusion, which further enhances the texture details of the fused image by playing a game between the generator and discriminator to estimate the probability distribution of the target.

Due to the wide application of generative adversarial networks, a series of GAN-based image fusion strategies have emerged and achieved good results. Ma et al. [20] proposed a GAN-based method for infrared and visible image fusion, called Fusion-gan, which further enhances the texture details of the fused image by playing a game between the generator and discriminator to estimate the probability distribution of the target. On this basis, Ma et al. [21] also proposed a multi-classification constrained image fusion method by transforming the image fusion into a multi-distribution simultaneous estimation problem. However, the method uses only one discriminator, which does not allow for the simultaneous estimation of both modal images, and there is some difficulty in finding a balanced way to ensure that the important information of both modalities is fully retained. J. F. et al. [22] proposed to use the U-Net structure instead of the traditional encoder-decoder architecture in the generator structure of the GAN network, but the model tends to retain more information about the infrared images, resulting in the loss of some visible light information. Fu et al. [23] proposed a new method that combines dense blocks and GANs. The input image-visible images in the network are directly inserted into each layer of the whole network, which makes the fusion results more consistent with human visual perception. In addition, more typical GAN-based image processing methods include the method proposed by Li et al. [24] which uses the idea of domain transformation to preprocess images and combines generative loss and discriminative loss to achieve good constraints on the model. As well as Liu et al. [25] proposed the PD-GAN method based on GAN which modulates the depth features of random noise vectors by SPDNorm to achieve contextual constraints.

Despite the positive results of the previous studies, these techniques were developed for multi-modal, multi-exposure and multi-focus picture fusion, and they do not completely apply to polarized image fusion. The current work on polarimetric image fusion is relatively small, with a notable contribution being a self-learning based polarimetric image fusion strategy proposed by Zhang et al. [26] The network is a simple code-and-decode structure, in which a multi-scale structural similarity loss is specifically introduced to achieve good results in polarimetric image fusion. However, the method does not take into account the polarization differences of different materials (metals/dielectric materials), and the same feature extraction and fusion strategy is used in image fusion, which makes some important polarization information cannot be effectively retained.

We identified the following through the analysis of polarized images: (1) DoLP has significant differences in imaging of various target materials and is highly contrasted, but the polarization imaging distribution of a single semantic object is consistent, so using the existing convolutional network for global feature extraction will lose effective information. (2) While many existing networks treat inputs uniformly, DoLP images differ from intensity images in that they represent the degree to which the target’s micro-element surface reflects and radiates polarized light. As a result, intensity and DoLP images should be transformed differently. (3) The information such as gradients and edges of DoLP images is given less consideration by the existing GAN-based fusion approaches, which solely concentrate on fused images with more detailed information of S0 images. The fused images will be closer to the intensity image S0 if this method is applied to fuse polarization images with the adversarial game, and the target’s polarization properties won’t be able to be significantly expressed. (4) The differences in feature extraction from DoLP images are not taken into account by existing approaches, which employ the same convolution and structure to extract features from DoLP and S0 images.

To address the polarization image fusion problem while addressing the aforementioned issues, we propose a semantic-guided dual-discriminator generative adversarial framework that primarily uses the PIQD block to estimate the weight of each semantic part in the polarization image. Additionally, based on the differences in how polarization information is represented between DoLP and S0 images, we design a dual-stream generator network architecture to guarantee the effectiveness of the training process. Finally, we adopt a new loss function for unsupervised training of the fusion process.

3. Proposed method

The proposed semantic-guided S0 and DoLP polarization image fusion approach based on the information quantity discrimination block is thoroughly described in this section. We’ll go over the entire framework, the proposed PIQD block, and the SGPF-GAN’s loss function one by one. Finally, a precise description of the generator and discriminator network designs are provided.

3.1 Overall framework

We specially build the semantically guided dual discriminator GAN network to characterize the image fusion problem as a semantically guided GAN model in order to more effectively tackle the polarized image fusion challenge. Our aim is to obtain by training a generator G that can extract the best useful characteristics from the input images and produce an information-rich polarization fusion result F, permitting further advanced image applications, given a pair of polarization intensity image S0 and degree of polarization image DoLP. DoLP images can well characterize the material and roughness properties of the target, and there is generally complex target material and roughness variability in our acquired scene images, while the different semantic objects of polarization images have significant high contrast advantages in characterizing the polarization properties. Therefore, we define the challenge of image fusion as the fusing of several semantic items inside the image, as opposed to defining image feature extraction and fusion algorithms for the entire image. To quantify the information quantity of each semantic item, we particularly build a PIQD block based on the benefits of gradient characterization of polarization images. PIQD block weighting determines the degree of retention of every semantic region in the fused result.

Figure 2(a) shows the overall structure of the SGPF-GAN. First, we use a segmentation method to divide both S0 and DoLP into two parts: the target region $\left (\mathrm {S0}_{\mathrm {fg}}, \mathrm {DoLP}_{\mathrm {fg}}\right )$ and the background region $\left (\mathrm {S0}_{\mathrm {bg}}, \mathrm {DoLP}_{\mathrm {bg}}\right )$. The production process of the $\mathrm {S0}_{\mathrm {fg}}, \mathrm {DoLP}_{\mathrm {fg}} ; \mathrm {S0}_{\mathrm {bg}}, \mathrm {DoLP}_{\mathrm {bg}}$ can be formulated as in Equation (3):

(3)$$\begin{aligned} & S 0_{\mathrm{fg}}, D o L P_{\mathrm{fg}}=S 0 \circ \text{ mask1}, D o L P \circ \text{ mask1} \\ & S 0_{\mathrm{bg}}, D o L P_{\mathrm{bg}}=S 0 \circ \text{ mask2}, D o L P \circ \text{ mask2}, \end{aligned}$$

where $S0, DoLP$, mask1, and mask2 represent the intensity image (S0), the linear polarimetric image (DoLP), and the binarization results mask1 and mask1=1-mask2 of the weight mapping M, respectively. $\circ$ means the Hadamard operator.

Fig. 2. The overall network structure of the proposed SGPF-GAN. S0, DoLP, mask and $\text {I}_{\text {f}}$ denote polarization intensity image, polarization degree image, mask and fused image. The dual-stream generators G1 and G2 branches fuse foreground (fg) and background (bg) feature maps, respectively. Discriminators $\text {D}_{\text {S0}}$ and $\text {D}_{\text {DoLP}}$ act on the S0 and DoLP images, respectively, to ensure the balance and optimal retention of the two modal information in $\text {I}_{\text {f}}$.

Download Full Size | PDF

Our network architecture contains two discriminators, $\text {D}_{\text {S0}}$ and $\text {D}_{\text {DoLP}}$, which each form an adversarial relationship with the generator G. Inspired by [27], we design different feature extraction methods for the target region fg and the background region bg to form a two-stream generator G containing G1 and G2 to generate the fused target region $\text {I}_{\text {fbg}}$ and the fused background region $\text {I}_{\text {ffg}}$. In the Fig. 2(a), the white area of mask1 consists of semantic objects with weight map M score larger than 0.5, and the regional fusion images are biased towards DoLP. In contrast, the fusion results for the semantic objects represented by the black areas should be closer to S0. The most meaningful information in the fusion of polarization intensity image S0 and degree of polarization image DoLP is the edge, gradient and texture details. To achieve an effective combination of the selected information requires us to design a balanced structure that ensures the information of the two different modalities is optimally retained in the fusion process. That is, the target region $\text {I}_{\text {ffg}}$ and the background region $\text {I}_{\text {fbg}}$ of the generated image have sufficient realism and information to deceive the discriminator. Meanwhile, we use the output scalar values of the two discriminators $\text {D}_{\text {S0}}$ and $\text {D}_{\text {DoLP}}$ to estimate the probability that the input data is from the source image or the fused image. Specifically, $\text {D}_{\text {S0}}$ aims to separate the fused image from the S0 image, while $\text {D}_{\text {DoLP}}$ separates the fused image from the DoLP.

As shown in Fig. 2(b) is the testing process of applying our trained model. We feed S0 and DoLP into the trained generator G, and finally obtain the output fusion result $\text {I}_{\text {f}}$.

3.2 Polarization image information quantity discrimination block

In order to determine the retention degree of every semantic item in the process of fusion, we analyze the advantages of S0 and DoLP images information representation and design polarized image information quality identification block to guide polarized image fusion. The PIQD block is composed of four elements: reference-free image quality assessment (NR-IQA) [28], information entropy (EN) [29], gradient magnitude similarity deviation (GMSD) [30] and luminance contrast (LC) [31].

Firstly, NR-IQA is used to measure the quality of every semantic part of the input image, which discriminates whether the quality of each semantic region of the polarization image is degraded due to distortion for some reason, such as blocking effects, blurring, compression and many forms of digital image noise.

The S0 image, which is comprised of specular and diffuse reflections, is the sum of intensities in two orthogonal polarization directions, as seen in Fig. 3(a). Therefore, the IQA of S0 tends to be higher. However, IQA solely assesses image quality without taking into account any additional factors, particularly for specialized polarization imaging such as DoLP. Due to the large difference of DoLP on micro-surface, IQA may judge DoLP images as low quality images. A typical example is shown in Fig. 3(b), both vehicles and shaded areas in S0 image have higher IQA, but it is clear that the DoLP image provides a more complete representation of the scene. For example, the DoLP image highlights the vehicle in the shadow and the interior scenes that can be seen through the windows because DoLP imaging removes reflections from the window glass. Therefore, intuitively, we would prefer that the fusion result of window and shaded area be closer to DoLP image. Theoretically, we prefer the fused result to retain more information about the different modal source images, so we will measure the amount of information in every semantic item using the objective metric EN [32], which is defined as follows:

(4)$$E N={-}\sum_{l=0}^{L-1} p_{l} \log _{2} p_{l},$$

where L denotes the gray level of the image, and $p_{l}$ is used to denote the probability of each of the 256 levels. As shown in Fig. 3(c), the S0 and DoLP images on the left and the corresponding gradient amplitude value plots on the right are shown. It can be seen that in transparent objects DoLP is able to perform stress imaging based on the effect of stress on polarized light, but this type of imaging is very sensitive to the edge and gradient information of the imaged target, so effective gradient and edge information metrics for polarized images cannot be achieved by conventional EN and NR-IQA image quality calculation methods. For these reasons, we also introduced GMSD (Gradient Magnitude Similarity Deviation) metric to ensure the effective retention of edge and gradient features of DoLP images during polarimetric image fusion [33]. GMSD is defined as follows:

(5)$$G M S D=\sqrt{\frac{1}{N} \sum_{i=1}^{N}(G M S(i)-G M S M)^{2}},$$

The gradient magnitude similarity is denoted by GMS(i), while the gradient magnitude similarity mean is denoted by GMSM. As seen in the S0 and DoLP pictures on the left of Fig. 3(d), and the accompanying LC saliency 3D surface plots on the right side, DoLP images have significant imaging differences for semantic objects made of various materials. The texture of the wall is similar to that of the tree, and the tree is well camouflaged within the wall in the S0 image, but the semantic object of the tree is highlighted in the DoLP image due to the difference in polarization imaging of different materials. In order to make sure that important characteristics were kept in the fused images, we additionally apply the LC. the quantitative indicator of saliency of any pixel $\text {I}_{\text {k}}$ in image I is expressed as:

(6)$$L C\left(\mathbf{I}_{k}\right)=\sum_{i=1}^{W} \sum_{j=1}^{H}\left\|\mathbf{I}_{k}-(\mathbf{I})_{i, j}\right\|,$$

where, $\|\cdot \|$ represents the color distance metric, the value range of $({I})_{i, j}$ is [0,255], i.e., the gray value. IQA can assess noise and other issues that degrade image quality, but is weak in evaluating DoLP images. As a complement, GMSD ensures that gradient, edge information is preserved in DoLP images. EN makes sure that a lot of source data is included in the fused image, whereas LC keeps the target region’s saliency. A trustworthy assessment criterion for image information content is therefore provided by the combination of the four metrics (NR-IQA), (EN), (GMSD), and (LC), which ensures the correctness of the information content computation for each semantic item. Regarding polarimetric imaging properties, the PIQD block can combine the benefits of the individual information representations of the S0 and DoLP images to produce a weight map M that can be used to gauge the degree to which each semantic object in the scene will be retained and to direct the fusion process. Figure 4 shows a schematic diagram of the PIQD module, where we first label the S0 and DoLP images to obtain label images $\text {L}_{\text {S0}}$ and $\text {L}_{\text {DoLP}}$. Since some semantic items can only be recorded by S0 or DoLP due to the different imaging mechanisms. To produce a full label $\text {L}_{\text {f}}$ that specifies the precise placement of the semantic information for our fusion job, we thus fuse the label images using pre-made fusion rules. During testing, we do not need to label the images, and Section 4.1 describes the pre-made fusion rules. The information score $S_{1}^{p}$ and $S_{2}^{p}$ of each semantic object in S0 and DoLP is calculated as follows:

(7)$$\mathrm{S}_{1}^{P}=\omega_{1}^{p} \cdot\left(I Q A_{1}^{p}+\lambda E N_{1}^{p}\right), \quad \mathrm{S}_{2}^{P}=\omega_{2}^{p} \cdot\left(I Q A_{2}^{p}+\lambda E N_{2}^{p}\right),$$

where $S_{1}^{p}$ and $S_{2}^{p}$ denote the information quality score of the semantic object in S0 or DoLP, respectively. $\lambda$ is the balance coefficient between the quality and information entropy of the control source image.

(8)$$\omega_{1}^{p}=\frac{\sum_{i, j \in p}\left(S_{1}\right)_{i, j}}{\sum_{i, j \in p}\left(\left(S_{1}\right)_{i, j}+\left(S_{2}\right)_{i, j}\right)},$$

$\omega _{2}^{p}$ is defined as: $1-\omega _{1}^{p}$. In addition, a larger GMSD indicates more complementary information between S0 and DoLP images, for which we use the average gradient $\bar {m}_{1}^{p} , \bar {m}_{2}^{p}$ size at semantic region p of S0 and DoLP images as the discriminant, and the gradient magnitude similarity deviation weights $e_{G M S D}^{p}$ can be defined as:

(9)$$e_{G M S D}^{p}= \begin{cases}1-G M S D^{p} & \bar{m}_{1}^{p}<\bar{m}_{2}^{p} \\ G M S D^{p} & \bar{m}_{1}^{p}>\bar{m}_{2}^{p}\end{cases},$$

where ${G M S D}^{p}$ is the deviation value of the gradient magnitude similarity at the semantic region p of S0 and DoLP images. Finally, the final source image quality quantitative measure weight map M is obtained denoted as:

(10)$$M^{p}=\frac{\mathrm{S}_{2}^{P}}{\left(\mathrm{~S}_{1}^{P}+\mathrm{S}_{2}^{P}\right)} \cdot e_{G M S D}^{p} ,$$

where $M^{p}$ denotes the final weight map value of the semantic object region p, indicating the degree of retention of region p for DoLP features during the fusion of polarized images, i.e. the degree of retention of the p region for S0 image features in the p region can be expressed as:$1-M^{p}$. If $M^{p}>0.5$, it indicates that the p region in DoLP contains more edge gradient information and the polarization is stronger in highlighting the semantic object, and vice versa, it indicates that the p-region in the S0 image contains more texture detail features.

Fig. 3. Example of polarization degree DoLP (top) and polarization intensity S0 (bottom) images. The right side of (c) is the gradient magnitude map corresponding to the left side image, and the right side of (d) is the 3D visualization result of the luminance contrast map corresponding to the left image.

Download Full Size | PDF

Fig. 4. Sketch map of the PIQD block. $\text {L}_{\text {f}}$ denotes the result of fusion of two label images $\text {L}_{\text {S0}}$ and $\text {L}_{\text {DoLP}}$. $\text {S}_{\text {S0}}$ , $\text {S}_{\text {DoLP}}$ denotes the result of saliency detection LC. We use LC, NR-IQA, EN and GMSD to calculate the weight map M. Finally, the magnitude of M value determines the degree of retention of each semantic object in the fusion to achieve the purpose of guiding the training process.

Download Full Size | PDF

3.3 Network architecture

Generator Architecture. Figure 5 shows the detailed network structure of the generator G. The source image is segmented into two parts, foreground $\text {S0}_{\text {fg}}$, $\text {DoLP}_{\text {fg}}$ and background $\text {S0}_{\text {bg}}$, $\text {DoLP}_{\text {bg}}$, according to the weight mapping M. They are used as input to G. The output of G is the fused foreground $\operatorname {I}_{\mathrm {ffg}}$ and background $\operatorname {I}_{\mathrm {fbg}}$. Due to the effect of the weight mapping M, foreground $\operatorname {I}_{\mathrm {ffg}}$ tends to retain more DoLP information, while background $\operatorname {I}_{\mathrm {fbg}}$ retains more S0 intensity information. Therefore, based on the imaging differences between DoLP and S0 and inspired by [34] Gated Convolution, we apply the HA convolution proposed in [27] for feature extraction in the regions that tend to retain DoLP features. HA convolution can avoid the problem that the polarization information of the same material varies greatly due to large differences in exposure intensity in DoLP imaging, and enable the extraction of richer information from DoLP images.

Fig. 5. Structure of the proposed two-stream generator network.

Download Full Size | PDF

So as to extract the features of the different modalities of the source image more effectively, we designed the generator as a dual-stream structure, i.e. the standard ST convolution is used to extract the S0 intensity information, while the HA convolution kernel extracts the features of the DoLP image. Finally, the fused background and foreground are added to obtain the fused image $\text {L}_{\text {f}}$.

The dual-flow structure is composed of two branches, DoLP-Branch and S0-Branch. The first four layers of DoLP-Branch are convolution form in [27], and the last layer is consistent with S0-Branch structure. For any of the convolutional layers, the padding is set to SAME to ensure that the extracted feature map is of the same size. In S0-Branch, to avoid gradient vanishing, we follow the deep convolution GAN [35] rules for batch normalization and activation functions. We used batch normalization and a leaky ReLU activation function to enhance the robustness of G in all layers except the last one, for the fifth layer, we utilize the tanh activation function. In the five-level convolution, the first convolution kernel is 5 x 5, the middle three layers are 3 x 3, and the last layer is 1 x 1. The quantity of convolutional kernels is set to 16 for the first four layers and 1 for the fifth layer. In addition, to reuse the extracted features and avoid loss of information during the convolution operation, we apply dense convolutional connectivity layers in each of the first four layers [36]. Each branch in the dual-stream generator structure has the same input and output channels, 2:16, 16:16, 32:16, 48:16, and 64:1, respectively.

Discriminator Architecture. The proposed discriminator network consists of $\text {D}_{\text {S0}}$ and $\text {D}_{\text {DoLP}}$, and both discriminators have the same architecture, as shown in Fig. 6. In order to determine the likelihood that the input image came from the source image and not from an image generated by the generator, both discriminators are employed as classifiers. Both discriminators consist of a five-layer convolutional, the stride is set to 2. 3 x 3 kernels are set for all layers except for the last layer, and a leaky ReLU activation function is used, with batch normalization for the middle three layers. The output of the discriminator is used for classification, thus the fifth layer is designed as a fully connected layer.

Fig. 6. Structure of the proposed two-stream generator network.

Download Full Size | PDF

3.4 Loss function

The loss function constraint term designed for the SGPF-GAN structure in this paper contains two main components: one is the loss of the generator G, and the other is the loss of the two discriminators $\text {D}_{\text {S0}}$ and $\text {D}_{\text {DoLP}}$.

Loss function of generator G: Since the use of adversarial loss alone makes the training process of GAN unable to achieve better convergence, an additional content loss $\mathcal {L}_{\text {con }}$ is introduced to constrain the generator, i.e. the loss function of generator G is defined as follows:

(11)$$\mathcal{L}_{G}=\mathcal{L}_{a d v}+\lambda \mathcal{L}_{c o n},$$

$\lambda$ is a balance weight parameter.

The adversarial loss $\mathcal {L}_{\text {adv }}$ is used to guide the generator G to generate real fusion results to deceive the two discriminators $\text {D}_{\text {S0}}$ and $\text {D}_{\text {DoLP}}$, the loss $\mathcal {L}_{\text {adv }}$ is defined as:

(12)$$\mathcal{L}_{a d v}=\mathbb{E}\left(\log \left(1-D_{S 0}\left(\mathrm{I}_{\mathrm{ffg}}, \mathrm{I}_{\mathrm{fb g}}\right)\right)\right)+\mathbb{E}\left(\log \left(1-D_{D o L P}\left(\mathrm{I}_{\mathrm{ff g}}, \mathrm{I}_{\mathrm{fb g}}\right)\right)\right)$$

$\operatorname {I}_{\mathrm {ffg}}$ and $\operatorname {I}_{\mathrm {fbg}}$ denote the fused foreground and background, $D_{S 0}()$ and $D_{D o L P}()$ are the loss functions of the discriminator, $\mathbb {E}$ indicates the mathematical expectation. The aim is for the generator to expect the discriminator to enforce the fused image to be similar to both S0 and DoLP.

To retain the meaningful information from the source images, we use $\mathcal {L}_{\text {con }}$ loss to constrain the fusion process and ensure the effective utilization of information in adversarial learning. The meaningful information in the generator model consists of two items, the SSIM metric indicating the degree of similarity between the fused image and the source images, and the intensity loss highlighting the target region information in the DoLP image. $\mathcal {L}_{\text {con }}$ is defined as:

(13)$$\mathcal{L}_{c o n}=\mathcal{L}_{S S I M}+\alpha \mathcal{L}_{i n},$$

$\mathcal {L}_{\text {SSIM }}$ is the loss of structural similarity and $\mathcal {L}_{\text {in }}$ is the loss of strength. $\alpha$ controls the trade-off.

Due to the superiority of SSIM in evaluating image fusion performance, in the proposed SGPF-GAN, we refer to the multi-scale weighted SSIM (MSWSSIM) loss function design method from [26],

(14)$$\mathcal{L}_{SSIM}=1-\frac{1}{5} \cdot \sum_{w \in\{3,5,7,9,11\}}\left(\gamma_{w} \cdot loss_{ssim}\left(I_{S0}, \mathrm{I}_{\mathrm{f}} ; w\right)\\ +\left(1-\gamma_{w}\right) \cdot {loss}_{s s i m}\left(I_{D O L P}, \mathrm{I}_{\mathrm{f}} ; w\right)\right),$$

where the multiscale SSIM window ${w}$ size is chosen at five levels of 3, 5, 7, 9 and 11, and $\gamma _{w}$ is the average value of the weight map M over each window region, expressed as:

(15)$$\gamma_{w}=\frac{1}{w^{2}} \sum_{i=1}^{w} \sum_{j=1}^{w} M_{i, j},$$

$\operatorname {loss}_{\mathrm {ssim}}(\mathrm {x}, \mathrm {y} ; \mathrm {w})$ represents the local structural similarity of x and y in window, which can be expressed as:

(16)$$\operatorname{loss}_{\mathrm{SSim}}(\mathrm{x}, \mathrm{y} ; \mathrm{w})=\frac{\left(2 \overline{\mathrm{w}}_{\mathrm{x}} \overline{\mathrm{w}}_{\mathrm{y}}+\mathrm{C}_{1}\right)\left(2 \sigma_{\mathrm{w}_{\mathrm{x}} \mathrm{w}_{\mathrm{y}}}+\mathrm{C}_{2}\right)}{\left(\overline{\mathrm{w}}_{\mathrm{x}}^{2}+\overline{\mathrm{w}}_{\mathrm{y}}^{2}+\mathrm{C}_{1}\right)\left(\sigma_{\mathrm{w}_{\mathrm{x}}}^{2}+\sigma_{\mathrm{w}_{\mathrm{y}}}^{2}+\mathrm{C}_{2}\right)},$$

${C}_{1}$ and ${C}_{2}$ are constants. $\mathrm {w}_{\mathrm {x}}$ and $\mathrm {w}_{\mathrm {y}}$ are the regions of the image x and y within the window $\mathrm {w}$ respectively, $\bar {\mathrm {w}}_{\mathrm {x}}$ and $\bar {\mathrm {w}}_{\mathrm {y}}$ are the mean values of $\mathrm {w}_{\mathrm {x}}$ and $\mathrm {w}_{\mathrm {y}}$ respectively. $\sigma _{\mathrm {w}_{\mathrm {x}}}^{2}$ and $\sigma _{\mathrm {w}_{\mathrm {x}} \mathrm {w}_{\mathrm {y}}}$ denotes the variance and co-variance.

The intensity loss allows the fused result to maintain an intensity distribution similar to that of the input image, so as to retain the important contrast information. That is, the final expression for intensity loss is:

(17)$$\mathcal{L}_{\mathrm{in}}=\frac{1}{\mathrm{WH}} \sum_{\mathrm{i}=1}^{\mathrm{W}} \sum_{\mathrm{j}=1}^{\mathrm{H}}\left((\mathrm{M})_{\mathrm{i}, \mathrm{j}} \cdot\left(\left(\mathrm{I}_{\mathrm{f}}\right)_{\mathrm{i}, \mathrm{j}}-\left(\mathrm{I}_{\mathrm{DoLP}}\right)_{\mathrm{i}, \mathrm{j}}\right)^{2}+\eta\left(1-(\mathrm{M})_{\mathrm{i}, \mathrm{j}}\right) \cdot\left(\left(\mathrm{I}_{\mathrm{f}}\right)_{\mathrm{i}, \mathrm{j}}-\left(\mathrm{I}_{\mathrm{S} 0}\right)_{\mathrm{i}, \mathrm{j}}\right)^{2}\right),$$

$(\mathrm {M})_{\mathrm {i}, \mathrm {j}}$ denotes the weight of each pixel point, which determines the extent to which each semantic object of the source image is retained in the fusion result during image fusion, by adjusting the balance between the two terms DoLP and S0 by $\eta$.

Loss Function of Discriminators: SGPF-GAN uses two independent discriminators (i.e., $\text {D}_{\text {S0}}$ and $\text {D}_{\text {DoLP}}$) to identify the distribution of information in the target, background regions of S0 and DoLP, respectively. The discriminators are trained to output an estimate of the probability that the input comes from real input data but not from the generator G. The adversarial loss expressions for these two discriminators are:

(18)$$\mathcal{L}_{D_{\text{DoLP }}}=\mathbb{E}\left[-\log D_{\text{DoLP }}\left(\mathrm{DoLP}_{\mathrm{fg}}, \mathrm{DoLP}_{\mathrm{bg}}\right)\right]+\mathbb{E}\left[-\log \left(1-D_{DoLP}\left(\mathrm{I}_{\mathrm{ffg}}, \mathrm{I}_{\mathrm{fbg}}\right)\right)\right]$$

(19)$$\mathcal{L}_{D_{\text{S0}}}=\mathbb{E}\left[-\log D_{\text{S0 }}\left(\mathrm{S0}_{\mathrm{fg}}, \mathrm{S0}_{\mathrm{bg}}\right)\right]+\mathbb{E}\left[-\log \left(1-D_{S0}\left(\mathrm{I}_{\mathrm{ffg}}, \mathrm{I}_{\mathrm{fbg}}\right)\right)\right]$$

4. Experiments

For this section, we first demonstrate the superiority of SGPF-GAN networks for polarization image fusion through a comparison with nine better fusion approaches on existing datasets that are publicly accessible and with similar single scenes from [37] and [38], and then employ seven evaluation metrics to assess the fusion results quantitatively. To validate the generalization of the SGPF-GAN model, we performed a qualitative and quantitative comparison of all fusion methods on a dataset of complex scenes collected in [39]. Next, ablation tests are used to experimentally confirm the PIQD block’s efficiency. Finally, to demonstrate the superiority of the SGPF-GAN technique in enhancing semantic segmentation performance, we implement a performance evaluation of the SGPF-GAN network by using polarization fusion image to image segmentation.

4.1 Experimental configurations

(1) Datasets: Experiments on three publicly available datasets provided by [37–39], of which the [37] and [38] datasets have similar single scene. We label 80 sets of images from both datasets. As metals and dielectric materials, manufactured objects and natural backgrounds have different representational advantages in polarization and intensity images, we first divided the datasets into two main categories according to target and background, followed by detailed labelling of natural scenes and man-made objects or metallic and dielectric materials according to target and background. The third dataset, generously provided by the authors of [39], contains complex information about natural scenes. After an initial classification of man-made and natural objects, metals and dielectric materials, this dataset is then classified in detail with nine categories of specific tags containing vehicles, vegetation, traffic signs, sky, clouds, water bodies, ground, buildings, and backgrounds. Since some semantic objects of transparent objects and low-light images are not visible or partially visible in S0 or DoLP, we use pre-designed fusion rules to fuse the tagged images to generate the complete tag $\text {L}_{\text {f}}$. The fusion rules are designed since DoLP images have a more complete representation of transparent objects, low-light and other areas, while S0 images contain texture. Therefore, all semantic objects are divided into two categories by us: DoLP image labels (e.g., shaded areas and car windows) and S0 image labels (e.g., natural objects that reflect sufficient light), which are then stitched together into complete fused labels containing complementary information.

All datasets were tested with 20 sets of images. To get additional training data, we crop the images into 256 $\times$ 256 image patches for training. In the end, the amount of whole image patches used for training was 8838.

(2) Comparison method: Nine mainstream methods were selected for comparison with our method, including GTF [40], CVT [13], DTCWT [14], SR [8], FusionGAN [20], GANMcC [21], FIRe-GAN [22], Perceptual-FusionGan [23], DIFNet [41], U2Fusion [42], DenseFuse [19]. Of these, GTF, CVT, DCTWT, and SR are typical traditional methods, while the others are all deep learning-based methods. We set the parameters for all the comparison methods by drew on the original papers. In these comparative experiments, the traditional methods are performed on MatLab 2020a with 2.10 GHz Intel processor Xeon E5-2620, and the others deep learning methods are accomplished on an NVIDIA TESLA T4.

(3) Parameter Settings: In our approach, there are six main control parameters, including $\alpha$, $\lambda$, $\eta$, $\mathrm {C1}$, $\mathrm {C2}$ and $\mathrm {k}$. They determine the balance of the loss functions and guide the degree of retention of the source images by the generator. In practice, we empirically set: $\alpha =0.3$, $\lambda =1.33$, $\eta =0.6$, $C_{1}=2 \times 10^{-4}$, $C_{2}=8 \times 10^{-4}$, $\mathrm {k}=1$. We set the initial learning rate and decay rate to $2 e-5$, 0.6. And RMSProp and Adam are used as optimizers to train the generator and discriminator, and the epoch is set to 20.

(4) Evaluation metrics: The qualitative assessment first depends on the eye’s visual perception, and good polarization image fusion can accurately distinguish the regions of different materials of the target and enhance the detailed information of the target edge contours. Secondly, six quantitative evaluation metrics are utilized to judge the performance of the method, including Sum of Difference Correlation (SCD), Average Gradient (AG), Visual Information Fidelity (VIF), Mean Square Error (MSE), Correlation Coefficient (CC), Peak Signal to Noise Ratio (PSNR), etc. AG is a gradient evaluation of image quality, the higher its value the more detail and texture the fused image contains. The SCD evaluates the quality of the fusion result by considering the source image and the effect of the source image on the fused image, the higher the SCD value, the greater the quality of the fused image. A smaller value in MSE indicates a higher-quality fused image. It measures the difference between the source image and the fused image. PSNR measures the degree of distortion in the fused image, with a higher PSNR value representing a more informative fused image.

4.2 Experimental results on dataset [37,38]

(1) Qualitative comparison: Three representative scene fusion maps are shown in Fig. 7, and the results show that SGPF-GAN has certain advantages. The source images S0 and DoLP are shown in the blue dashed box in the figure, Then, the fusion results of GTF, CVT, DTCWT, SR, FusionGAN, GANMcC, FIRe-GAN, Perceptual-FusionGan, DIFNet, U2Fusion, DenseFuse, and our SGPF-GAN are displayed. We have zoomed the small region of the selected display target (the red box) and the small region of the display background textures(the green box) in the right corner of the images, respectively, to illustrate the differences more clearly.

Fig. 7. Polarized image fusion results from the dataset provided by [37,38].

Download Full Size | PDF

Visually, our method has obvious advantages. In Fig. 7(b), the mobile phone screen has sufficient polarized light so that it has an advantage in DoLP imaging, while the background areas are well represented in the intensity image. The results show that our method achieves good stability in retaining the superior information of the target and background, and the fused image enhances the edge and gradient information in the scene while ensuring the visual effect. Traditional methods such as GTF, CVT, DTCWT and SR may have blurring and artefacts. Fusiongan, FIRe-GAN and GANMcC over-emphasise the quantity of information in the fused image causing image distortion. DIFNet, U2Fusion and DenseFuse perform better but none can balance the dominant information in both the target and background regions and they do not retain the edge information in the background regions well. In addition, the DoLP of the transparent object scene in Fig. 7(a) has the advantage of internal stress description in terms of the retention of barcode information (the red box) and stress information (the green box), the barcode information is blurred in the traditional method, while DIFNet, Perceptual-FusionGan, U2Fusion and DenseFuse have some advantages over FusionGAN and GANMcC but the contrast of barcode information is low, it is clear that SGPF-GAN has the best fusion results in the stress ripple region and the barcode region. As shown in Fig. 7(c), the shaded areas of the S0 image (the red box) are highlighted by fusing the DoLP images, and the sky area is more texturally rich for DoLP. It can be seen that the GTF, CVT, DTCWT and SR methods have lower overall contrast and blurred results in the tree region. The different micro-surfaces of the target behave differently to polarised light, while FusionGAN, GANMcC has weak constraints on the structural similarity metric, the fusion results will be distorted and a large proportion of the generated results do not belong to the source image. DIFNet, U2Fusion and DenseFuse fusion results performed better overall, but DenseFuse had weaker texture performance in the sky region, while the contrast of DIFNet, Perceptual-FusionGan and U2Fusion in the tree region was not as good as SGPF-GAN.

(2) Quantitative comparison: We selected 20 sets of images for quantitative analysis, as shown in Table 1. It can be clearly observed from Table 1 that SSPF-GAN shows excellent performance. Specifically, VIF, SCD and AG average values are maximized by SGPF-GAN. This suggests that the approach used in this study, which has stronger contrast and sharper edges, is more suited to human eye perception. SGPF-GAN ranked second on CC, PSNR and MSE. Overall, the performance of SGPF-GAN is the most optimistic among all methods.

Table 1. Quantitative comparison of 20 pairs of images from datasets provided by [37,38].

View Table | View all tables in this article

4.3 Experimental results on dataset [39]

(1) Qualitative Comparisons: We do further comparisons on the complex scene images, and the results are shown in Fig. 8. Similarly, the red and green boxes in the figures indicate typical regions of the fusion results. The fused image in Fig. 8(a) highlights the vehicle information in the shadows of the selected region (the red box), which can reveal the vehicle’s interior scenes due to the de-reflective effect of DoLP imaging on the transparent glass. By comparing the fusion performance of several methods for glass regions, it is clear that our method can better maintain the balance between DoLP and S0 information and avoid using too much DoLP information. Although DoLP images can describe specific information, they are not smooth transitions between each pixel value, and excessive utilization can affect subsequent advanced applications of the image, such as degrading the performance of semantic segmentation.

Fig. 8. Polarized image fusion results from the dataset provided by [39].

Download Full Size | PDF

Figure 8(b) and Fig. 8(c) show the effect of identifying trees camouflaged in walls and the ability of glass to highlight in shadowed areas through polarization differences of different materials, respectively. The comparison reveals that the method in this paper clearly outperforms the others in terms of contrast enhancement, texture reconstruction and visual effects.

(2) Quantitative comparison: As shown in Table 2, we selected 20 sets of typical images from the complex scene dataset for testing, and the comparison results show that our method achieves maximum values on VIF, SCD, AG, PSNR and CC. It shows that SGPF-GAN can characterize rich texture and edge information while maintaining high-quality visual effects. In addition, SGPF-GAN ranks third on CC, after U2Fusion, Perceptual-FusionGan. Overall, SGPF-GAN can achieve the better performance in complex natural scenes as well.

Table 2. Quantitative comparison of 20 pairs of images from datasets provided by [39].

View Table | View all tables in this article

4.4 Ablation experiment

(1) Ablation Study of dual discriminators

We perform three comparison tests to confirm the effects of each discriminator as our SGPF-GAN is a dual discriminator structure comprising $\text {D}_{\text {S0}}$ and $\text {D}_{\text {DoLP}}$. To make the adversarial relationship between G and $\text {D}_{\text {DoLP}}$ non-existent, we first remove the $\text {D}_{\text {DoLP}}$ discriminator. Second, we eliminate the discriminator $\text {D}_{\text {S0}}$ from our network design, which only leaves G and $\text {D}_{\text {DoLP}}$ in an adversarial relationship. Third, we generate fused images using our SGPF-GAN complete network architecture. The combined findings of all these comparison trials, which were conducted in identical conditions, are displayed in Fig. 9.

Fig. 9. Experimental validation results of discriminators.

Download Full Size | PDF

In the experiment without discriminator $\text {D}_{\text {DoLP}}$, the fusion results tended to retain more information about the S0 image, but the vehicle information in the shaded part is not highlighted. In the experiments without discriminator $\text {D}_{\text {S0}}$, although the DoLP image information is fully retained, the overall visual effect is poor and the use of too much DoLP information led to a decrease in overall image quality. Thus, $\text {D}_{\text {DoLP}}$ mainly works on the salient target regions labeled with weight M to highlight the complementary information of DoLP on the $\text {D}_{\text {S0}}$ image. In contrast, $\text {D}_{\text {S0}}$ reacts mainly to the background to retain the rich texture detail in the S0 image. Our SGPF-GAN experiments show that the dual discriminator structure can balance the retention of information of each division item in different modalities from the source image to obtain the optimal fused result.

(2) Ablation Study of the PIQD Block

In the validation experiments of the PIQD block, we use four metrics: EN, NR-IQA, LC and GMSD. Four image metrics are used in the PIQD block validation experiments, and four comparative experiments are conducted to verify the effectiveness of the block. First, we remove the entire PIQD block in our approach which indicates the semantic guidance function based on image quality information for the guidance of each semantic object no longer exists. Second, in the PIQD block, we only use NR-IQA and EN. Therefore, NR-IQA and EN decide on the fusion weights of the various semantic objects. Third, we just remove the GMSD from the PIQD block. Fourth, we apply the complete PIQD to each semantic part of the fused image. The experimental results are shown in Fig. 10.

Fig. 10. Experimental validation results of PIQD Block.

Download Full Size | PDF

The DoLP image information is severely lost in the first comparison experiment’s poor fusion results. The highlighted tree region in the DoLP image was not improved in the second comparison experiment, despite the fused image retaining some of the edge and gradient information of the DoLP image. In the third experiment, the fused image has some enhancement of the salient regions and contained rich texture details, which are more consistent with the visual perception of the human eye, but the overall image contrast is low. The fourth experiment’s results demonstrate that the backdrop is unobscured, and the target area is enhanced. These experiments demonstrate that the PIQD block may enhance the ability of semantic expression and better direct the fusion rules.

(3) Ablation Study of the dual-stream generator structure

The incorporation of HA convolution marks the primary distinction between the two-stream generator structure developed in this research for the various polarization information distribution characteristics of the target and background. The effectiveness of the target region feature extraction for DoLP images proposed in this study is thus tested through two comparative experiments. In Fig. 11, the experiment’s results are displayed.

Fig. 11. Experimental validation results of dual-stream generator structure.

Download Full Size | PDF

The fused image introduces a greater amount of DoLP information, resulting in a reduction in its overall quality, with significant distortion especially in the red boxed marked areas. In the first comparison experiment where we replaced the conventional ST Conv with HA Conv in the dual-stream network HA-Baranch. This is because the DoLP seems to be more differently characterized in this location due to greater polarized light, but the global information is ignored during the ST Conv feature extraction. The second experiment is the result of combining our proposed dual-stream generator structure, which better maintains the characteristics of each division item and the input image’s texture information. The experiments show the viability of our dual-stream generator construction.

(4) Quantitative Ablation Experiments

The quantitative results of the ablation experiments are shown in Table 3 and Table 4. The results in Table 3 are optimal for all indicators of SGPF-GAN except PSNR and MSE.

Table 3. Quantitative results of ablation experiments with dual discriminators.

View Table | View all tables in this article

Table 4. Quantitative results of ablation experiments on PIQD block and dual-stream generator structure.

View Table | View all tables in this article

In the results of Table 4, the overall indicators of SGPF-GAN are the best. Since SGPF-GAN is a dual discriminator structure, each discriminator acts on one of the modalities in the input images, so the two discriminators work without influence on each other. When the $\text {D}_{\text {DoLP}}$ discriminator is removed, the fused image is more similar to the S0 image, and conversely, when the $\text {D}_{\text {S0}}$ discriminator is removed, the fused image is more similar to the DoLP image. Therefore, the PSNR value of SGPF-GAN should be in the middle of the two sets of experimental results. Since PSNR is inversely related to MSE, both PSNR and MSE metrics ranked second. The comparative results of the quantitative metrics from the ablation experiments show that the design of our SGPF-GAN method is effective.

4.5 Application to image segmentation

We trained segmentation models using S0, DoLP, and SGPF-GAN results, in that order, to show the efficacy of our SGPF-GAN design in fusing DoLP information to increase semantic segmentation of intensity images (S0). To train the three types of images mentioned above, we took inspiration from the deepLabv3+ network and utilized the same loss function and parameter values. Where the training dataset included the dataset from the literature [39] labeled in this paper as well as the ZJU-RGB-P Dataset [5] for a total of 410 sets of images. The following is a qualitative and quantitative analysis of the segmentation findings.

(1) Qualitative results: First, Fig. 12 displays the visualization results of the image segmentation in this study. As can be observed, the DoLP image-based segmentation model displays serious mis-segmentation in the wall region while segmenting the automobile more accurately in the shadow zone. In other words, the segmentation areas in both unimodal-based segmentation results defective and remote from the GT. The segmentation model on the basis of SGPF-GAN image fusion, in contrast, correctly segmented each region and produced results that are the closest to the values of the Ground Truth. The qualitative results demonstrate that the SGPF-GAN may indeed enhance image segmentation ability.

Fig. 12. Visualization results of image segmentation. The top row is the test image, from left to right are the S0, DoLP, SGPF-GAN fusion images and labels, and the bottom row is the segmentation result corresponding to the previous row.

Download Full Size | PDF

(2) Quantitative result: We also assessed the image segmentation results quantitatively, as shown in Table 5.

Table 5. Objective evaluation of segmentation performance is performed on 410 images provided from [5], [39]. Red represents the best segmentation result.

View Table | View all tables in this article

The results demonstrate that segmentation based on fused images outperforms segmentation based on DoLP images or S0 images in terms of quantitative performance. Therefore, the better image segmentation results also further show the advantages of our SGPF-GAN in a high-level image segmentation task.

5. Conclusions

For polarimetric image fusion, we suggest a brand-new SGPF-GAN structure in this study. We start by creating a polarization image information quantity discriminator module that will direct the fusion of polarization images and enhance the adequacy of the scene representation. For the purpose of constraining the training process and improve the reasonability of the fusion results, we also introduce a dual discriminator structure that enables the generator to take into account both modal information of the source image. The dual-stream generator network structure is also created to efficiently extract the characteristics of the original image. The outcomes avoid the high-contrast target region from being distorted in the fused image due to inappropriate feature extraction. Numerous qualitative and quantitative tests have demonstrated that SGPF-GAN performs better in polarized image fusion. Additionally, the use of additional image segmentation experiments show that our SGPF-GAN aids in the improvement of advanced computer vision task performance. In the future, we will introduce Angle of polarization (AoP) for fusion to produce photo-realistic fused image.

Funding

National Natural Science Foundation of China (61890960, 62127813); Jilin Scientific and Technological Development Program (20210203181SF).

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. J. Wan, Y. Dong, J.-H. Xue, L. Lin, S. Du, J. Dong, Y. Yao, C. Li, and H. Ma, “Polarization-based probabilistic discriminative model for quantitative characterization of cancer cells,” Biomed. Opt. Express 13(6), 3339–3354 (2022). [CrossRef]

2. C. Lei, X. Huang, M. Zhang, Q. Yan, W. Sun, and Q. Chen, “Polarized reflection removal with perfect alignment in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2020), pp. 1750–1758.

3. L. Shen, Y. Zhao, Q. Peng, J. C.-W. Chan, and S. G. Kong, “An iterative image dehazing method with polarization,” IEEE Trans. Multimedia 21(5), 1093–1107 (2019). [CrossRef]

4. A. Kalra, V. Taamazyan, S. K. Rao, K. Venkataraman, R. Raskar, and A. Kadambi, “Deep polarization cues for transparent object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2020), pp. 8602–8611.

5. K. Xiang, K. Yang, and K. Wang, “Polarization-driven semantic segmentation via efficient attention-bridged fusion,” Opt. Express 29(4), 4802–4820 (2021). [CrossRef]

6. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), (Springer, 2018), pp. 801–818.

7. J. S. Tyo, D. L. Goldstein, D. B. Chenault, and J. A. Shaw, “Review of passive imaging polarimetry for remote sensing applications,” Appl. Opt. 45(22), 5453–5469 (2006). [CrossRef]

8. Y. Liu, S. Liu, and Z. Wang, “A general framework for image fusion based on multi-scale transform and sparse representation,” Inf. Fusion 24, 147–164 (2015). [CrossRef]

9. J. Han, E. J. Pauwels, and P. De Zeeuw, “Fast saliency-aware multi-modality image fusion,” Neurocomputing 111, 70–80 (2013). [CrossRef]

10. B. Yang and S. Li, “Multifocus image fusion and restoration with sparse representation,” IEEE Trans. Instrum. Meas. 59(4), 884–892 (2010). [CrossRef]

11. K. Ram Prabhakar, V. Sai Srikar, and R. Venkatesh Babu, “Deepfuse: A deep unsupervised approach for exposure fusion with extreme exposure image pairs,” in Proceedings of the IEEE international conference on computer vision (ICCV), (IEEE, 2017), pp. 4714–4722.

12. J. Ma, L. Tang, M. Xu, H. Zhang, and G. Xiao, “Stdfusionnet: An infrared and visible image fusion network based on salient target detection,” IEEE Trans. Instrum. Meas. 70, 1–13 (2021). [CrossRef]

13. F. Nencini, A. Garzelli, S. Baronti, and L. Alparone, “Remote sensing image fusion using the curvelet transform,” Inf. Fusion 8(2), 143–156 (2007). [CrossRef]

14. J. J. Lewis, R. J. O’Callaghan, S. G. Nikolov, D. R. Bull, and N. Canagarajah, “Pixel-and region-based image fusion with complex wavelets,” Inf. Fusion 8(2), 119–130 (2007). [CrossRef]

15. Q. Zhang, G. Li, Y. Cao, and J. Han, “Multi-focus image fusion based on non-negative sparse representation and patch-level consistency rectification,” Pattern Recognit. 104, 107325 (2020). [CrossRef]

16. Q. Zhang, F. Wang, Y. Luo, and J. Han, “Exploring a unified low rank representation for multi-focus image fusion,” Pattern Recognit. 113, 107752 (2021). [CrossRef]

17. H. Zhang, H. Xu, Y. Xiao, X. Guo, and J. Ma, “Rethinking the image fusion: A fast unified image fusion network based on proportional maintenance of gradient and intensity,” in Proceedings of the AAAI Conference on Artificial Intelligence, 34 (2020), pp. 12797–12804.

18. Y. Liu, X. Chen, J. Cheng, and H. Peng, “A medical image fusion method based on convolutional neural networks,” in 2017 20th international conference on Inf. fusion (IEEE, 2017), pp. 1–7.

19. H. Li and X.-J. Wu, “Densefuse: A fusion approach to infrared and visible images,” IEEE Trans. on Image Process. 28(5), 2614–2623 (2019). [CrossRef]

20. J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang, “Fusiongan: A generative adversarial network for infrared and visible image fusion,” Inf. Fusion 48, 11–26 (2019). [CrossRef]

21. J. Ma, H. Zhang, Z. Shao, P. Liang, and H. Xu, “Ganmcc: A generative adversarial network with multiclassification constraints for infrared and visible image fusion,” IEEE Trans. Instrum. Meas. 70, 1–14 (2021). [CrossRef]

22. J. F. Ciprián-Sánchez, G. Ochoa-Ruiz, M. Gonzalez-Mendoza, and L. Rossi, “Fire-gan: A novel deep learning-based infrared-visible fusion method for wildfire imagery,” Neural Computing and Applications pp. 1–13 (2021).

23. Y. Fu, X.-J. Wu, and T. Durrani, “Image fusion based on generative adversarial network consistent with perception,” Inf. Fusion 72, 110–125 (2021). [CrossRef]

24. X. Li, Z. Du, Y. Huang, and Z. Tan, “A deep translation (gan) based change detection network for optical and sar remote sensing images,” ISPRS J. Photogramm. Remote. Sens. 179, 14–34 (2021). [CrossRef]

25. H. Liu, Z. Wan, W. Huang, Y. Song, X. Han, and J. Liao, “Pd-gan: Probabilistic diverse gan for image inpainting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2021), pp. 9371–9381.

26. J. Zhang, J. Shao, J. Chen, D. Yang, and B. Liang, “Polarization image fusion with self-learned fusion strategy,” Pattern Recognit. 118, 108045 (2021). [CrossRef]

27. J. Guo, S. Lai, C. Tao, Y. Cai, L. Wang, Y. Guo, and L.-Q. Yan, “Highlight-aware two-stream network for single-image svbrdf acquisition,” ACM Trans. Graph. 40(4), 1–14 (2021). [CrossRef]

28. S. Bosse, D. Maniry, K.-R. Müller, T. Wiegand, and W. Samek, “Deep neural networks for no-reference and full-reference image quality assessment,” IEEE Trans. on Image Process. 27(1), 206–219 (2018). [CrossRef]

29. L. Liu, B. Liu, H. Huang, and A. C. Bovik, “No-reference image quality assessment based on spatial and spectral entropies,” Signal Processing: Image Commun. 29(8), 856–863 (2014). [CrossRef]

30. W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient magnitude similarity deviation: A highly efficient perceptual image quality index,” IEEE Trans. on Image Process. 23(2), 684–695 (2014). [CrossRef]

31. Y. Zhai and M. Shah, “Visual attention detection in video sequences using spatiotemporal cues,” in Proceedings of the 14th ACM international conference on Multimedia (ACM, 2006), pp. 815–824.

32. H. Xu, J. Ma, Z. Le, J. Jiang, and X. Guo, “Fusiondn: A unified densely connected network for image fusion,” in Proceedings of the AAAI Conference on Artificial Intelligence, 34 (2020), pp. 12484–12491.

33. G. C. Sargent, B. M. Ratliff, and V. K. Asari, “Conditional generative adversarial network demosaicing strategy for division of focal plane polarimeters,” Opt. Express 28(25), 38419–38443 (2020). [CrossRef]

34. J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Free-form image inpainting with gated convolution,” in Proceedings of the IEEE/CVF international conference on computer vision (ICCV), (IEEE, 2019), pp. 4471–4480.

35. H. Zhang, Z. Le, Z. Shao, H. Xu, and J. Ma, “Mff-gan: An unsupervised generative adversarial network with adaptive and gradient joint constraints for multi-focus image fusion,” Inf. Fusion 66, 40–53 (2021). [CrossRef]

36. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (IEEE, 2017), pp. 4700–4708.

37. S. Qiu, Q. Fu, C. Wang, and W. Heidrich, “Linear polarization demosaicking for monochrome and colour polarization focal plane arrays,” in Computer Graphics Forum, 40 (Wiley Online Library, 2021), pp. 77–89.

38. M. Morimatsu, Y. Monno, M. Tanaka, and M. Okutomi, “Monochrome and color polarization demosaicking using edge-aware residual interpolation,” in 2020 IEEE International Conference on Image Processing (ICIP), (IEEE, 2020), pp. 2571–2575.

39. Y. Sun, J. Zhang, and R. Liang, “Color polarization demosaicking by a convolutional neural network,” Opt. Lett. 46(17), 4338–4341 (2021). [CrossRef]

40. J. Ma, C. Chen, C. Li, and J. Huang, “Infrared and visible image fusion via gradient transfer and total variation minimization,” Inf. Fusion 31, 100–109 (2016). [CrossRef]

41. H. Jung, Y. Kim, H. Jang, N. Ha, and K. Sohn, “Unsupervised deep image fusion with structure tensor representations,” IEEE Trans. on Image Process. 29, 3845–3858 (2020). [CrossRef]

42. H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2fusion: A unified unsupervised image fusion network,” IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 502–518 (2022). [CrossRef]

	VIF	CC	SCD	AG	PSNR	MSE
GTF	0.9573	0.4677	0.7333	1.0745	62.8047	0.0461
CVT	0.9731	0.6281	1.3194	2.7445	62.5151	0.0484
DTCWT	0.9978	0.6314	1.3187	2.6908	65.2445	0.0402
SR	1.1712	0.6234	1.3185	3.0145	65.3512	0.0333
FusionGAN	0.7395	0.3569	0.8242	2.8414	59.3662	0.1512
GANMcC	0.4485	0.2626	0.4963	3.2146	57.1808	0.1272
FIRe-GAN	0.9896	0.5216	1.0363	2.9658	61.3802	0.0812
Perceptual-FusionGan	1.0213	0.6378	1.4142	3.1713	65.2156	0.0241
DIFNet	0.9678	0.6376	1.3093	1.4993	63.0389	0.0246
U2Fusion	1.2266	0.6067	1.7152	3.0363	64.1939	0.0243
DenseFuse	1.2271	0.6414	1.5715	2.1773	65.5783	0.0238
SGPF-GAN	1.2659	0.6385	1.9161	3.6224	65.5055	0.0239

	VIF	CC	SCD	AG	PSNR	MSE
GTF	0.5244	0.3172	0.7056	1.1982	62.2218	0.0417
CVT	0.7425	0.5484	1.5981	6.1924	64.9198	0.0222
DTCWT	0.7723	0.5492	1.5983	6.1683	64.9319	0.0221
SR	0.7493	0.5321	1.5394	5.8561	64.6884	0.0235
FusionGAN	0.2253	0.2016	0.5135	1.9857	54.9755	0.1577
GANMcC	0.3791	0.1477	0.4935	3.5146	57.1824	0.1242
FIRe-GAN	0.5672	0.3171	0.8212	3.6262	60.243	0.0416
Perceptual-FusionGan	0.8272	0.5653	1.8024	4.3929	64.9858	0.0224
DIFNet	0.8302	0.5413	1.6704	2.9795	63.6673	0.0314
U2Fusion	0.8026	0.5648	1.3684	5.8666	62.3883	0.0332
DenseFuse	0.8107	0.5427	1.8157	4.5487	62.3233	0.0413
SGPF-GAN	0.8652	0.5501	1.8541	6.4331	65.0319	0.0217

Method	VIF	CC	SCD	AG	PSNR	MSE
$w / o$ $D_{D o L P}$	0.7672	0.4982	1.0648	4.9637	61.5528	0.0329
$w / o$ $D_{s 0}$	0.6144	0.4546	1.4321	3.5881	59.8134	0.0454
SGPF-GAN	0.8184	0.6727	1.7003	8.9031	61.4538	0.0358

Method	VIF	CC	SCD	AG	PSNR	MSE
$w / o$ $P I Q D$	0.7658	0.4643	1.7481	4.1223	62.6396	0.0354
$w / o$ $G M S D, L C$	0.8246	0.4796	1.6402	2.7910	63.4723	0.0236
$w / o$ $G M S D$	0.9112	0.5571	1.6659	3.6823	64.4785	0.0235
$w / o$ $H A$ $C o n v$	0.9872	0.5857	1.7312	2.8254	63.1846	0.0283
SGPF-GAN	1.1031	0.6361	1.9873	5.5707	65.1602	0.0221

class	car	Car window	Telegraph pole	wall	mIOU
S0	0.705	0.863	0.712	0.837	0.779
DoLP	0.849	0.821	0.673	0.689	0.758
Fusion	0.895	0.883	0.741	0.952	0.868

Semantic-guided polarization image fusion method based on a dual-discriminator GAN

Abstract

1. Introduction

2. Related work

2.1 Traditional fusion methods

2.2 Deep learning-based image fusion

3. Proposed method

3.1 Overall framework

3.2 Polarization image information quantity discrimination block

3.3 Network architecture

3.4 Loss function

4. Experiments

4.1 Experimental configurations

4.2 Experimental results on dataset [37,38]

4.3 Experimental results on dataset [39]

4.4 Ablation experiment

4.5 Application to image segmentation

5. Conclusions

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (12)

Tables (5)

Equations (19)

Optics Express