Employing texture loss to denoise OCT images using generative adversarial networks

Maryam Mehdizadeh; Maryam Mehdizadeh; Sajib Saha; David Alonso-Caneiro; Jason Kugelman; Cara MacNish; Fred Chen

doi:10.1364/BOE.503868

1. Introduction

Optical coherence tomography (OCT) is a commonly used clinical imaging modality, specially in the field of optometry and ophthalmology where it routinely used to capture images of the eye. Speckle noise is inherent to this imaging modality [1], which unfortunately create challenges for clinical interpretation. The statistical properties of OCT images and the associated speckle noise have been studied in detail in [2,3]. The presence of speckle noise can occlude important features (such as pathology and retinal landmarks) as well as interfere with accurate segmentation of retinal layers [4–6], which is critical to extract quantitative biomarkers from these images and as a result for clinical decision making [7].

A wide range of strategies has been proposed to address the issue of noise in OCT images, encompassing traditional methods as well as deep learning techniques utilizing CNNs and GANs. Traditional methods can be further classified based on whether they employ a single frame or multiple frames for denoising. Examples of single frame denoising techniques include those proposed by Rogowska et al. [8], Wong et al. [9], Bernardes et al. [10], Puvanathasan et al. [11], Habib et al. [12], Hongwei et al. [13], Kafieh et al. [14], and Chong et al. [15]. And examples of multiple-frame denoising techniques include those proposed by Chitchian et al. [16], Fang et al. [17], and Fang et al. [18]. However, traditional methods often require manual parameter selection and lack adaptability to different noise levels.

In recent years, deep learning methods based on CNNs have emerged as a promising alternative, surpassing the performance of traditional methods. For instance, Shi et al. [19] designed a deep learning network called DeSpectNet for speckle noise reduction in retinal OCT images. They investigated the impact of $L_1$ and $L_2$ losses and found $L_1$ loss to be superior in terms of visual quality and quantitative indices. However, some denoised images produced by DeSpectNet exhibited significant blurriness.

To address the blurring issue, Qiu et al. [20] introduced a CNN-based denoising network that incorporated a perceptually-sensitive loss, the multi-scale structural similarity index (MS-SSIM). Their approach achieved lower levels of blurriness and improved perceptual representation of denoised OCT images, outperforming traditional $L_1$ and $L_2$ losses. The authors also reported enhanced contrast between retinal layers and the background in the denoised images.

Additionally, Mehdizadeh et al. [21] demonstrated the effectiveness of deep feature loss, utilizing the internal activations of pretrained deep neural networks such as VGG, for CNN-based OCT denoising. This method outperformed traditional loss functions such as $L_1$ and $L_2$. However, denoised images obtained using deep feature loss could exhibit unwanted artifacts in form of mesh-like patterns.

Recent studies utilized GANs to reconstruct OCT images. Denoising-GAN (DN-GAN) by Chen et al. [22] employed a GAN to reconstruct OCT images. In their approach they combined the adversarial loss with content loss (($L_1$ and $L_2$)) and perceptual loss (deep feature loss). Overall, the GAN method presented an improved OCT denoising compared to traditional methods and CNN-based methods. However, the results showed that in some rare cases DN-GAN presents certain limitations in preserving structural information of the posterior part of the OCT (i.e. choroidal tissue).

A recent study by Nilesh et al. [23] introduced Siamese-GAN network to reconstruct OCT images. In their approach they constructed a conditional GAN (cGAN) comprising of residual UNet (ResUNet) generator [24], and a Wasserstein GAN (WGAN) [25] discriminator. They complemented their cGAN network with a Siamese twin network [26] to better facilitate generating realistic looking denoised OCT images. In their experiments, the authors investigated the use of UNet [27] versus ResUNet as generators, and they reported that ResUNet presented a superior performance to UNet. Furthermore, they investigated perceptual loss versus mean square error (MSE) on the training of cGAN networks. They reported that MSE is a better loss for generating denoised OCT images. Overall the Siamese-GAN (ResUNet-WGAN-Siamese) show improved peak signal to noise ratio (PSNR) compared to some other GAN-based networks while they reported inferior texture preserving index (TP), and edge preserving index (EPI) compared to CNN based OCT denoising approaches.

Isola et al. [28] introduced an image-to-image (I2I) transformation framework, pix2pix, designed based on Markovian model principals. Pix2pix network managed to generate impressive results with realistic looking textures and detail. The authors reported that participants who were blinded to the origin of the generated images, still categorized as real some of the images generated by their framework. Pix2pix consists of a lightweight UNet generator and a Markovian discriminator, called PatchGAN. The authors demonstrated success of pix2pix across multiple applications of I2I transfer with realistic texture.

Despite all the encouraging results and all the progress in the field, generating realistic looking denoised OCT images that preserve the tissue structure and exhibit a texture similar to the averaged OCT image still represents a challenge. Motivated by the work by Isola et al. [28] in this study, we investigate the application of OCT image reconstruction employing PatchGAN as a texture/style loss in an I2I transformation setting. We construct a cGAN networks with lightweight UNet and ResUNet generators and pair them with PatchGAN discriminator to gain more insight into the texture synthesis of PatchGAN for the purpose of OCT image reconstruction. We compare the performance of our work with the recently proposed SiameseGAN. Additionally, through our experiments, we systematically compare the performance of WGAN and PatchGAN discriminators. This study also explores the use of lightweight UNet versus ResUNet as generators in the overall performance of a cGAN to reconstruct OCT images. Furthermore, we report on the effect of perceptual loss and MSE on the performance of the networks. We provide quantitative and qualitative analysis of the reconstructed OCT images through various cGAN networks.

In summary, the main contributions of this work are as follows:

• Explore the effect of PatchGAN as a texture loss for OCT image denoising
• Systematically compare PatchGAN with WGAN as discriminators in cGAN I2I networks for OCT image denoising
• Systematically compare lightweight UNet with ResUNet as generators in cGAN I2I networks for OCT image denoising
• Explore and compare the effect of perceptual loss versus MSE on the quality of the reconstructed OCT images in cGAN I2I networks
• Use of feedback of ophthalmologists to assess the perceptual reality, texture, and content of the reconstructed (denoised) OCT images

The remainder of this paper is structured as follows, section 2 provides the background information, section 3 presents details of the proposed methodology, section 4 provide the experimental results and evaluation, and sections 5 and 6 are discussion and conclusion, respectively.

2. Background

2.1 Image-to-Image translation

Image-to-image (I2I) translation [29,30] is a class of learning tasks that transforms images from a source domain to a target domain while preserving the content. Applications include but are not limited to day-to-night, label-to-image, photo-to-paining and image colourisation.

2.2 GAN as a general solution for I2I

GAN [29] can be used as a versatile framework for generating images through the process of image-to-image (I2I) translation, allowing the learning of mappings from a source domain to a target domain. The generative model in GAN assumes a particular image distribution and learns to approximate it during training, enabling the generation of realistic images instead of solely classifying existing ones.

The main idea of GANs is to establish a zero-sum game between the two networks (players), namely, a generator $G$ and a discriminator $D$. Each network is represented by a differentiable function controlled by a set of parameters. The generator $G$ learns to generate fake but plausible images, while the discriminator $D$ learns to distinguish between the fake and real images. The solution of this game is to find a Nash equilibrium between the two networks.

The generator $G$ inputs a random noise vector $z$ sampled from model’s prior distribution $p(z)$ to generate an image $G(z)$ to fit the distribution of real images. Then, the discriminator $D$ takes a random real image $x$ from the dataset and the synthetic image $G(z)$ as inputs and outputs a probability between 0 and 1, indicating whether the synthetic image is a real or fake image. In other words, $D$ wants to discriminate the synthetic image $G(z)$ with the real image $x$, while $G$ intends to generate synthetic image to confuse $D$. According to Goodfellow et al. [31], the objective of GAN can be expressed as

(1)$$\min_{G}\max_{D} \mathcal{L}(D,G) = \mathrm{E}_{x \sim p_{data}(x)} [log D(x)] + \mathrm{E}_{z \sim p_{z}(z)}[log(1 - D(G(z)))]$$

where $\mathrm {E}_{x \sim p_{data}(x)} [log D(x)]$ term represents the expectation over the real data distribution $p_{data}(x)$. It calculates the average log-probability of the discriminator $(D)$ correctly classifying real data samples $(x)$ as real. In other words, it measures how well the discriminator distinguish real data. $\mathrm {E}_{z \sim p_{z}(z)} [log(1 - D(G(z)))]$ represents the expectation over the generator’s input noise distribution $p_{z}(z)$. It calculates the average log-probability of the discriminator $(D)$ incorrectly classifying generated data samples $(G(z))$ as real $(1 - D(G(z)))$. In other words, it measures how well the generator can produce data that fools discriminator.

By optimizing the objective function, the generator aims to generate data that the discriminator is more likely to classify as real, while the discriminator aims to correctly distinguish between the real and generated data. The training process involves iteratively updating the parameters of the generator and discriminator to improve their performance until equilibrium is reached.

2.2.1 Conditional GANs

In traditional GANs, there is no control on the content of the generated image because the only input to the network is the random noise vector $z$. To solve this issue Mirza et al. [32] introduced a conditional version of GANs, where both the generator and discriminator are conditioned using additional information $y$. The conditional input $y$ can be coded using various information, such as data labels, text and attributes of image. In this research, we input the averaged OCT image as the conditional input to the generator and discriminator. According to Mirza et al. [32], the objective of cGAN can be expressed as

(2)$$\min_{G}\max_{D} \mathcal{L}(D,G) = \mathrm{E}_{x \sim p_{data}}(x) [log D({x}\mid{y})] + \mathrm{E}_{z \sim p_{z}}(z) [log(1 - D(G(z \mid y)))]$$

$min_{G}$ indicates that the objective is to minimize the generator’s loss, $max_{D}$ indicates that the objective is to maximize the discriminator’s loss, and $L(D,G)$ represents the loss function of the cGAN. The term $\mathrm {E}_{x \sim p_{data}}(x) [log D({x}\mid {y})]$ calculates the average log-probability of the discriminator $(D)$ correctly classifying real data samples $(x)$ conditioned on a specific label $(y)$. The term $\mathrm {E}_{z \sim p_{z}}(z) [log(1 - D(G(z \mid y)))$ calculates the average log-probability of the discriminator $(D)$ incorrectly classifying generated data samples $(G(z|y))$ conditioned on a specific label $(y)$ as real $(1 - D(G(z|y)))$. It measures how well the generator can produce data conditional on the label that fools discriminator.

In our study, the goal of the cGAN is to find an equilibrium where the generator produces realistic denoised OCT image sample that is conditioned on the equivalent “averaged” OCT image, and the discriminator is unable to distinguish between real “averaged” OCT images and the generated denoised OCT images. The training process involves iteratively updating the parameters of the generator and the discriminator to improve their performance and reach equilibrium.

In cGAN setting, image restoration loss is added to the discriminator loss to measure the quality of the match between the input image and the generator’s output image. Two of the commonly used losses are: (i) perceptual loss [33], which is the sum of the squared differences of features extracted from a pre-trained network such as VGG [34] network, and (ii) mean-squared-error (MSE) loss. Perceptual loss has shown success in preserving structural content [21,23] in OCT images and is calculated using the pretrained VGG-19 network on ImageNet [35].

(3)$$L_\textrm{VGG}(T) = \frac{1}{n}\frac{1}{w\,h\,d} \sum_{i=1}^{n} \| VGG(T(\mathbf{x}_{i})) - VGG(\mathbf{y}_{i})\|^{2}$$

Where $w$, $h$, $d$ present the width, height, and depth of the convolutional layers respectively.

The MSE measures the average squared difference between the estimated values and the actual values and the MSE loss is the average squared difference of pixels of two images $\textbf {y}$ and $\textbf {x}$.

(4)$$MSE = \frac{1}{n} \sum_{i=1}^{n} \| {x}_{i} - \mathbf{y}_{i}\|^{2}$$

Where $n$ is the number of pixels in the image, $\textbf {y}$ is the gold standard averaged OCT image and $\textbf {x}$ is the output image. In our experiments, we computed the perceptual loss by extracting features from the block3 conv3 layer, aligning with the experimental approach of SiameseGAN to maintain methodological continuity.

3. Methodology

In this section, we outline the structural details of the generators and discriminators utilized in designing the cGAN I2I networks for our study. First, the lightweight UNet and ResUNet models, which are considered as generators, then the PatchGAN and WGAN classifiers that are considered as discriminators, are described in detail.

3.1 Lightweight UNet model

The original UNet [27] was proposed by Ronneberger et al. for biomedical image segmentation. The cGAN pix2pix has adopted the UNet model to operate as generator. This variant of UNet used has a highly lightweight design compared to the original UNet architecture. It uses only one convolution in each encoder and decoder components as opposed to two. Various studies have proved that this lightweight version produces comparable results to the original UNet model [36]. The lightweight UNet is beneficial for applications that have limited training data, which can occur in some medical imaging applications.

The lightweight UNet architecture consists of seven encoder layers, one bottleneck layer, and seven decoder layers (Fig. 1). The encoder layers consist of a convolution, followed by a batch normalization, and Leaky ReLU (Conv+BN+LReLu) operations. The decoder layers consist of a transpose convolution, followed by a batch normalisation, and ReLU (ConvT+BN+ReLu) operation. The bottleneck layer serves as a bridge between the encoder and decoder units. The first encoder layer and the last decoder layer do not contain batch normalization operations. The bottleneck layer consists of a convolution followed by a ReLU Conv+ReLu operation. For all convolutions, stride is set to 2, and the kernel sizes is set to $4\times 4$ (illustrated in Fig. 1).

Fig. 1. Lightweight UNet used in pix2pix, the architecture contains 54 million parameters that need to be trained. “f” represents number of filters, “k” represents kernel size, and “s” represents number of strides in each layer. E1 stands for encoder type 1, and E2 stands for encoder type 2. Similarly, D1 stands for decoder type 1 and D2 stands for decoder type 2. E1,E2,D1,D2, and bottleneck, displayed on the right side, are the building blocks of the lightweight UNet architecture. The diagram displays the size of the input and output feature maps for each component.

Download Full Size | PDF

3.2 Deep residual UNet

The deep residual UNet (ResUNet), which was proposed by Zhang et al. [24], is a variant of UNet that uses the residual connections for better information flow and facilitate the training of the network. ResUNet consists of three parts: encoder, bottleneck, and decoder. All the three parts are built with residual unit (Fig. 2). Each residual unit consists of two $3 \times 3$ convolutional layers followed by batch normalization and ReLu activation layers. The identity mapping function contains a $1 \times 1$ convolution and a batch normalization operation. We utilize a 7-level architecture of deep ResUNet in our experiments. As shown in Fig. 2 the network has three residual units in encoder section, one residual unit in bottle neck, and three residual units in decoder section. The encoder section encodes the input image into compact representations with stride of $s = 2$ in the convolutional layers to downsize the feature maps by half. The corresponding decoder path uses up-sampling of feature maps and a concatenation of feature maps from the corresponding encoder path before each residual unit. In the decoder path, the stride of $s = 1$ is used in the residual units. After the last residual unit in decoder path, a $1 \times 1$ activation followed by a $Tanh$ activation layer is used to channel the multi-layer feature maps back to pixel level details of the reconstructed OCT image. The detail of each layer, including the size of feature maps at each layer is illustrated in Fig. 2.

Fig. 2. ResUNet and its components, this architecture contains 19 million parameters that need to be trained. “f” represents number of filters, “k” represents kernel size, and “s” represents number of strides in each layer. Residual Block is the building component of the ResUNet architecture. The detail of this component is displayed on the right hand side of the diagram. The diagram displays the size of the input and output feature maps for each component.

Download Full Size | PDF

3.3 Wasserstein GAN

Arjovsky et al. introduced Wasserstein GAN (WGAN) [37] as a solution to the vanishing gradients problem [38], where gradients in neural networks become extremely small during training, leading to slow or ineffective learning. WGAN addresses this by estimating the Wasserstein distance, measuring the difference between the distributions of real and generated samples, to evaluate the authenticity of an image. The authors demonstrated that WGAN exhibits improved stability and is less sensitive to hyperparameter choices compared to the original GAN. Several studies have shown a positive relationship between the loss of the WGAN discriminator and the quality of generated images.

The objective function of WGAN is formulated as:

(5)$$\min_{G}\max_{D}\mathcal{L}_{WGAN}(D,G) ={-}\mathrm{E}_{{\boldsymbol{x}}}[D({\boldsymbol{x}})] +\mathrm{E}_{{\boldsymbol{z}}}[D(G({\boldsymbol{z}}))] + \lambda\mathrm{E}_{\hat{{\boldsymbol{x}}}}[(\|\nabla_{\hat{{\boldsymbol{x}}}}D(\hat{{\boldsymbol{x}}})\|_{2} - 1)^{2}]$$

where the first two terms estimate the Wasserstein distance; the last term is a regularization term and provides gradient penalty. $\lambda$ is a constant weighting parameter, $\hat {{\boldsymbol{x}}}$ is uniformly sampled along straight lines that connect pairs of generated ${\boldsymbol{z}}$ and real samples ${\boldsymbol{x}}$. The schematic diagram of the WGAN is provided in Fig. 3.

Fig. 3. The comparison of PatchGAN and WGAN concept, architecture, and network modules. The PatchGAN discriminates on image patches whereas WGAN discriminates on the whole image as either “real” or “fake”. The PatchGAN input is noisy and averaged OCT image pairs, whereas the WGAN input is only the noisy OCT image. Both discriminators have similar architecture, with the same layers. However, there are two main differences that make each network unique. The first is the input layer, which is responsible for taking in the data. The second difference is the final layer, which is responsible for producing the output. These small design choices have a significant impact on the two network’s functionality and purpose.

Download Full Size | PDF

3.4 PatchGAN

The PatchGAN [28] penalizes structure at the scale of patches to ensure high frequency correctness. It tries to classify each patch of $N \times N$ as real or fake. The final output is the result of averaging of all convolutional responses run over image patches. PatchGAN input is real and fake image pairs. It has five convolutional layers. After the last layer, a convolution is applied to map the patch responses to a one-dimensional output, followed by a Sigmoid operation. The design of PatchGAN is dependent on receptive field, i.e., the size of patch $p$. In this experiment, a size of $p = 70 \times 70$ was employed as it was determined to be the optimal choice by the authors. The receptive field denotes the relationship between one output activation of the model to an area of the input image. The schematic diagram of PatchGAN is illustrated in Fig. 3.

The objective function of PatchGAN is formulated as:

(6)$$\min_{G}\max_{D}\mathcal{L}_{PGAN}(D,G) = \mathrm{E}_{{\boldsymbol{x,y}}}[\log D({\boldsymbol{x,y}})] +\mathrm{E}_{{\boldsymbol{x,z}}}[\log (1 - D(x, G({\boldsymbol{x,z}})))] + \lambda\mathrm{E}_{{\boldsymbol{x,y,z}}}[\|y - G(x,z)\|_{1}]$$

Where the first two terms are conditioned by $x$, compared to the unconditional variant in which the discriminator does not observe $x$ (Eq. (5)).

Figure 3 illustrates a comparison of PatchGAN with the WGAN discriminator. The two networks have similar architecture of Conv+BN+LReLU layers. However, the two networks differ firstly in that the input of the PatchGAN is “conditioned" [28] with the "averaged" OCT images. This makes PatchGAN a conditional discriminator, whereas the WGAN is a discriminator not conditioned on the target images. Secondly, the final real/fake output of PatchGAN is a majority of votes on the realness/fakeness of input image patches, whereas the WGAN final real/fake output is the classification on the whole image.

4. Experiments

4.1 Experimental dataset

Our dataset comprises foveal-centered OCT retinal scans of 226 children aged between 4 and 12 years with normal vision in both eyes and no history of ocular pathology [39]. To acquire the images, a spectral domain OCT instrument (Copernicus SOCT-HR Optopol Technology SA, Zawiercie, Poland) was used. The dataset is ideal for the experiments because it consists of multiple OCT noisy B-scans captured in the same retinal location as well as and the corresponding averaged “noise-free” image pairs (Fig. 4). The original image size is $999 \times 868$ pixels. The averaged image is obtained by image registration and averaging as presented by Alonso-Caneiro et al. in [40].

Fig. 4. Original OCT B-scan image (A) and the corresponding averaged OCT B-scan (B).

Download Full Size | PDF

A total of 1660 OCT image pairs were used in this study. A randomly selected subset of 1460 image pairs were used for training the cGAN and the remaining 200 image pairs were used for validation and testing.

All the OCT images were resized to $512\times 512$ pixels to suit the network input. The input to all networks were identical, consisting of 1460 noisy and averaged OCT image pairs. The images were fed to the networks as whole images in gray-scale format.

4.2 Experimental settings

We built six networks by varying the selection of generators (UNet and ResUNet), discriminators (PatchGAN and WGAN), and utilizing different loss functions (Perceptual and MSE for WGAN, and MSE for PatchGAN, as the latter can only trained with MSE loss). Additionally, for comparative analysis, we assessed the performance of the SiameseGAN network on our OCT image dataset. The authors provided the code implementation, which was accessible at https://github.com/sml-iisc/SiameseGAN. Table 1 lists the seven networks and their constituent components.

Table 1. Seven networks that were trained in our experiments. The left column lists the names that we called the networks, and the right column lists the generator, discriminator and adversarial loss that were used in each network.

View Table | View all tables in this article

4.3 Super-computing infrastructure

The deep learning frameworks were implemented in Tensorflow and all experiments were conducted on 1 GPU node on the Bracewell CSIRO super-computing facility where each node consists of 2x Intel Xeon Broadwell E5-2680 v4, 14-core CPUs (28 cores total) @ 2.4 GHz (nominal), 256 GB of RAM and 4x NVIDIA Tesla P100s with each card having 16 GB memory.

We trained each network for 100 epochs. For every epoch the trained network was tested with a set of unseen validation images. At every epoch the network weights were saved, and the PSNR of the validation images were calculated.

4.4 Evaluation metrics

We evaluated the reconstructed OCT images using both quantitative and qualitative measures.

4.4.1 Quantitative image quality measurements

Following a similar methodology to our previous work [21], the reconstructed OCT images were evaluated with some well known metrics such as peak signal-to-noise ratio (PSNR), structural similarity (SSIM), texture preservation (TP), and edge preservation (EP) indexes. In addition, two no-reference image sharpness metrics: just noticeable blur (JNB) [41] and spectral and spatial sharpness (S3) [42] are reported in our experiments.

4.4.2 Expert feedback qualitative analysis

To qualitatively assess the performance of each network in reconstructing OCT images, six ophthalmologists who were not involved in the development of the networks were asked to provide feedback based on certain criteria. These expert clinicians all have a background of at least 2 years in retinal image analysis and acquisition.

Three evaluation tests were specifically designed to gather expert user feedback and assess the performance of the networks. These tests encompassed texture similarity, structural integrity, and perceptual reality, enabling a comprehensive evaluation and comparison of the networks. The primary objective of these tests was to gauge the realism of the reconstructed OCT images as perceived by human experts, in comparison to the averaged OCT image. Furthermore, the preservation of crucial image content, such as blood vessels, macula boundary, and retinal layers, was also assessed. Thus, the design of the three evaluation tests was formulated with these objectives in mind:

• Test 1 - Texture similarity With reference to the averaged OCT image (labelled for the observer), choose the top three reconstructed OCT images that exhibits the most similar texture to the averaged OCT image (gold standard).
• Test 2 - Structural integrity With reference to the original noisy image (labelled for the observer), pick the top three reconstructed images that best preserved the content (detail) of the noisy image.
• Test 3 - Perceptual reality Label the reconstructed OCT images with “Synthetically generated” or “Perceptually realistic” to evaluate the “fakeness” or “realness” of the reconstructed images.

Thirty randomly selected sets out of 200 sets of 9 images (the averaged, noisy and the corresponding reconstructed images from the 7 trained GAN networks) were used for this experiment. A web-based dashboard to receive the expert user’s feedback was designed. The system would display the thirty records to the observer, which was blind to the image condition, except where a reference image was used. The user can open each record to view the 9 OCT images. In Test 1, the averaged image is labelled as the reference to the user. The same set of images were presented to all image graders as well as for all three test cases.

4.5 Experimental results

4.5.1 Quantitative results

Table 2 presents the averaged quantitative results obtained from our experiments. The table showcases various evaluation metrics, including PSNR-B (peak signal-to-noise ratio of the background), PSNR-All (peak signal-to-noise ratio of all image), SSIM-B (structural similarity index of the background), SSIM-All (structural similarity index of all image), EPI (edge preservation), ENL (equivalent number of looks), TP (texture preservation), and JNB (just noticeable blur).

Table 2. Average of the PSNR background (PSNR-B), PSNR all image (PSNR-All), SSIM background (SSIM-B), SSIM all image (SSIM-All), EPI, ENL, TP, JNB.

View Table | View all tables in this article

Among the different configurations tested, the results highlight the following key observations:

• The ResNet model combined with PatchGAN discriminator using MSE loss (ResUNet-PatchGAN-MSE) achieves the highest PSNR-B (32.88) and SSIM-All (0.98) scores, indicating superior performance in preserving image details and structural similarity.
• The UNet model combined with with the PatchGAN discriminator using MSE loss (UNet-PatchGAN-MSE) demonstrates a competitive performance with PSNR-B of 32.50 and an SSIM-All of 0.97.
• The ResUNet model combined with WGAN discriminator and SiameseGAN also shows promising results, particularly in terms of ENL (2.48) and JNB (8.35), indicating good preservation of edge information.

The findings emphasize the effectiveness of using PatchGAN discriminator in conjunction with the ResUNet and UNet models, especially when trained with MSE loss, for achieving favourable quantitative results. Note the values marked in bold represent the highest scores in their respective categories. Additionally, we analyze the Edge Preservation Index (EPI) along the vertical direction at each boundary of retinal layers. Figure 5 provides a comparative visualization of EPIs across the seven networks. Notably, SiameseGAN exhibits the highest EPI values near the ISOS and RPE layers, indicating superior edge preservation in those regions. On the other hand, ResNet-WGAN-Perceptual demonstrates the highest EPI scores around the IPL, OPL, and ELM layers.

Fig. 5. (a) an averaged B-scan OCT image with marked retinal layer boundaries, and (b) edge preserving index (EPI) for test OCT images around retinal layer boundaries.

Download Full Size | PDF

Furthermore, we evaluate the perceived sharpness of all seven networks using the S3 measure, as depicted in Fig. 6. Perceived sharpness, captured by the S3 (spectral and spatial sharpness) method, is a metric that produces a perceptual sharpness map. In this map, higher values indicate regions perceived as sharper by human visual system. The overall sharpness is estimated by identifying the sharpest region in the image, corresponding to the maximum value in the sharpness map. The synthesis of the S3 value offers quantitative assessment of the overall perceived sharpness of the entire image.

Fig. 6. The slope parameter $\alpha$ of each trained GAN network and their relative positions on the perceived sharpness measure spectrum.

Download Full Size | PDF

The diagram clearly illustrates that SiameseGAN achieves the highest S3 score, surpassing UNet-WGAN-Perceptual by more than 0.1 in perceived sharpness. Both UNet-PatchGAN and ResUNet-PatchGAN achieve S3 scores above 0.5, while the averaged OCT image achieves a perceived sharpness score above 0.3.

4.5.2 Qualitative results

Figure 7 displays results for Test 1 illustrating the accumulative votes of all six clinical observers through violin plots. Only 5 out of 7 networks were picked as top three choices. UNet-PatchGAN-MSE (100%), Siamese (63%), ResUNet-WGAN-Perceptual (61%), UNet-WGAN-Perceptual (51%), UNet-WGAN-MSE (22%) were selected as top three choices in Test 1. Some of the reconstructed images using the SiameseGAN network contained artifacts (40% of the time), and therefore were not selected within top three. ResUNet-WGAN-MSE, ResUNet-PatchGAN-MSE were never picked as top three choices.

The results (top three choices) for Test 2 which examined the structural integrity of the generated de-noised images with reference to original noisy images, ordered based on scores are; averaged (77%), UNet-PatchGAN-MSE (74%), Siamese (63%), ResUNet-WGAN-Perceptual (52%), UNet-WGA-MSE (21%), UNet-WGAN-Perceptual (12%), ResUNet-WGAN-MSE (1%). ResUNet-PatchGAN-MSE was never picked in the top three choices. It is important to note that the averaged OCT image was not selected $20{\% }$ of the time among the top three choices. According to Fig. 8, UNet-PatchGAN-MSE received the highest votes, competing with averaged OCT image. SiameseGAN was mostly picked as second and third choice.

Fig. 7. Left graph - violin plot illustrating clinical observer classification on the top three images that exhibit the most similar texture to the averaged OCT image. The white circle indicates the median of the votes, the black thick bar shows the interquartile range. Right side - screenshot of the image grading tool displaying one typical record containing noisy OCT, its corresponding averaged image, and the reconstructed images through seven networks.

Download Full Size | PDF

Fig. 8. Left graph - violin plot illustrating clinical observer classification on the top three images that exhibit the most structural integrity compared to the original noisy image. The white circle indicates the median of the votes, the black thick bar shows the interquartile range. Right side - screenshot of the image grading tool displaying one typical record containing noisy OCT, its corresponding averaged image (labelled for the user), and the associated reconstructed OCT images through seven networks.

Download Full Size | PDF

Fig. 9. Each column shows the same cross-sectional example displaying the averaged OCT image, the reconstructed OCT images from SiameseGAN network, and UNet-PatchGAN-MSE network. On the left side, the yellow arrow highlights the ambiguity in the ILM layer due to failure in the image registration process. This figure shows that the networks were successful in reconstructing the OCT image, regardless of the averaged image. On the right, the yellow arrow highlights the introduced artifact by the SiameseGAN network.

Download Full Size | PDF

Results for Test 3 on accumulative users feedback on perceptual reality of the images is illustrated in (Fig. 10). UNet-PatchGAN-MSE (65%), ResUNet-WGAN-Perceptual (52%), UNet-WGAN-MSE (48%), ResUNet-WGAN-MSE (27%), Siamese (25%), UNet-WGAN-Perceptual (23%), ResUNet-PatchGan-MSE (4%) were picked as exhibiting authentic textures.

Fig. 10. Right graph - violin plot illustrating clinical observer classification on the perceptual reality of the reconstructed images. The white circle indicates the median of the votes, the black thick bar shows the interquartile range. Left side - screenshot of the image grading tool displaying one typical record. In each record, the original noisy and the corresponding averaged image were labelled for the user. The user would then pick either “real” or “fake” label for the reconstructed images through seven trained networks.

Download Full Size | PDF

The overall observation from imaging experts was that ResUNet was more susceptible to introducing artifacts in the reconstructed images ((Fig. 9). This would compromise the structural integrity of the reconstructed images (Fig. 11). The percentage of introduced artifacts in ResUNet based networks are; SiameseGAN ($9{\% }$), ResUNet-WGAN-MSE ($24.5{\% }$), ResUNet-WGAN-Perceptual ($18.5{\% }$).

Fig. 11. The left column displays the averaged OCT image and the reconstructed OCT images from ResUNet-WGAN-MSE network, and UNet-PatchGAN-MSE network. The yellow arrow shows the introduced artifact in the reconstructed image by ResUNet-WGAN-MSE network. The right column displays the averaged OCT image, and the reconstructed OCT image from ResUNet-WGAN-Perceptual and UNet-PatchGAN-MSE networks. The yellow arrow highlights the introduced artifact introduced by the ResUNet-WGAN-Perceptual network.

Download Full Size | PDF

5. Discussion

We have investigated the effect of texture loss of the PatchGAN discriminator on the de-noising performance of GAN networks for OCT speckle noise reduction. We performed an ablation study on generators (UNet and ResUNet) and discriminators (PatchGAN and WGAN). Additionally, different loss functions were considered, given that the PatchGAN discriminator loss typically uses a MSE, while the WGAN discriminator has been used with both MSE and perceptual losses. To understand the influence of each module in reconstructing OCT images an experiment including six networks was performed in this study. Additionally, the study also compares these networks with SiameseGAN (the seventh network) which consists of a ResUNet generator, WGAN discriminator with perceptual loss and is coupled with a twin Siamese network as an additional adversarial loss. We used a range of quantitative measures (PSNR-B, PSNR-ALL, SSIM-B, SSIM-All, EPI, ENL, TP, JNB, and S3) to evaluate the reconstructed OCT images. In addition, we gathered feedback from six clinical experts through three scenarios where they have to categorize based on the style (Test 1), content (Test 2) and perceptual realness (Test 3) of the OCT reconstructed images.

Our findings using real clinical images indicated that the texture loss of PatchGAN can effectively optimize UNet generator with MSE adversarial loss to generate OCT reconstructed images that were superior to SiameseGAN score for a number of metrics, including: PSNR-B (32.50dB compared to 31.02dB ), PSNR-All (44.48dB compared to 33.76dB), SSIM-B (0.88 compared to 0.85), SSIM-All (0.97 compared to 0.89) while also scoring high on EPI (0.90 compared to 0.77). These quantitative findings also agree with the qualitative assessment. According to imaging experts’ feedback, UNet-PatchGAN-MSE network was chosen 100% of the time as the top three choice, whereas SiameseGAN was chosen as the top three in only 63% of the time. According to the users’ feedback, SiameseGAN is prone to introduce artifacts in the reconstructed OCT images that compromises the structural integrity of the retina. Similarly, PatchGAN can effectively optimize the ResUNet generator with MSE adversarial loss to generate OCT reconstructed images that scored higher compared to SiameseGAN in PSNR-B (32.88dB compared to 31.02dB), PSNR-All (44.13dB compared to 33.76dB), SSIM-B (0.91 compared to 0.85), SSIM-All (0.98 compared to 0.89), and EPI (0.93 compared to 0.77). However, ResUNet-PatchGAN-MSE was not a popular pick in comparison with SiameseGAN and UNet-PatchGAN-MSE networks. Clinicians expressed dissatisfaction with the images due to excessively blurred textures, which resulted in a perceived fading of retinal structures and diminished perceptual clarity. Our findings underscore the critical role of visual evaluation in guiding the selection of image reconstruction methodologies. The unanimous preference for the UNet-PatchGAN-MSE network by imaging experts, coupled with their reservation regarding artifacts introduced by SiameseGAN, highlights the necessity of incorporating qualitative feedback (visual assessment) in the assessment of proposed methods. As we look towards future research, the integration of more extensive clinical datasets and the active involvement of domain experts in the evaluation process will be paramount. Furthermore, while PatchGAN proves effective in optimizing both UNet and ResUNet generators, the relatively lower popularity of ResUNet-PatchGAN-MSE among imaging experts prompts intriguing questions about the interplay between network architectures and visual preferences. This observation emphasizes the need for further exploration into the interaction of different generator-discriminator architectures and their impact on visual acceptability. For future research, a holistic approach integrating quantitative metrics, qualitative assessments, and, most importantly, clinical expert feedback will be crucial. The comprehensive evaluation strategy ensures that the developed methodologies not only excel numerical benchmarks but also align with the nuanced requirements and preferences of practitioners. As we delve deeper into refining image reconstruction techniques, a focus on bridging the gap between quantitative benchmarks and clinical applicability will remain a central theme in our research agenda.

While considering the effect of the two generators (UNet and ResUNet) coupled with WGAN discriminator with the perceptual loss, the quatitative results indicate that UNet-WGAN-Perceptual performs superiorly to ResUNet-WGAN-Perceptual with PSNR-B (31.73dB compared to 31.35dB), PSNR-All (42.49dB compared to 39.58dB), SSIM-B (0.86 compared to 0.81), SSIM-All (0.97 compared to 0.95), ENL (2.02 compared to 1.26), TP (1.05 compared to 0.86), and JNB (8.05 compared to 7.83). However, according to clinical expert feedback, ResUNet-WGAN-Perceptual was selected more than the UNet-WGAN-Perceptual (61% vs 51%) on texture similarity compared to averaged OCT (Test 1), while the ResUNet-WGAN-Perceptual was selected more than the UNet-WGAN-Perceptual (52% vs 40%) on structural integrity (Test 2). Finally, ResUNet-WGAN-Perceptual was voted to be more realistic than the UNet-WGAN-Perceptual (52% vs 23% of the times). Based on clinician’s feedback, the UNet-WGAN-Perceptual introduced noticeable repeated patterns in the reconstructed OCT images, particularly effecting the clarity of the ILM boundary and macular region. In contrast, ResUNet-WGAN-Perceptual did not exhibit prominent repeated patterns in the reconstructed images, contributing to enhanced clarity around retinal tissue, and the background (vitreous).

UNet and ResUNet generators coupled with WGAN discriminator optimised with MSE exhibited similar quantitative measures in PSNR-B (32.60dB and 32.96dB), PSNR-All (44.66dB and 44.60dB), SSIM-B (0.89 and 0.90), SSIM-All (0.97 both), EPI (0.91 both), ENL (1.15 both), TP (0.97 both), and JNB (4.90 and 4.70). However, according to imaging experts’ feedback, UNet-WGAN-MSE network was picked as top three image of choice 22% of the time, whereas ResUNet-WGAN-MSE was never picked as top three on texture similarity to averaged OCT in Test 1. UNet-WGAN-MSE (21%) was scored almost 10% more than ResUNet-WGAN-MSE (12%) in structural integrity of Test 2. UNet-WGAN-MSE was voted 48% in terms of realism, 21% more than ResUNet-WGAN-MSE with 27% votes. Further analysis of expert’s feedback revealed a consistent observation: both networks exhibited a certain degree of blurriness in the reconstructed OCT images. However, the ResNet-WGAN-MSE network demonstrated a more pronounced blurry effect, suggesting limitations in capturing finer retinal structural and layer details compared to its UNet counterpart. The UNet-WGAN-MSE network, on the other hand, exhibited a higher capacity to reconstruct images with greater fidelity to retinal structures. While quantitative measures provide valuable feedback, the distinct preference for UNet-WGAN-MSE highlights that the perceived quality of reconstructed images extends beyond numerical benchmarks. The ability to capture fine details and nuances in retinal structures, as noted by the experts, underscores the importance of qualitative assessments in refining image reconstruction methodologies for clinical applications.

It is worth noting that the dataset used in this study only included OCT images of healthy individuals. Future work should explore the potential of the proposed method on OCT datasets with images that present pathologies. Similarly, the method was only tested on data from a single OCT instrument. The method’s extension to other OCT devices should also be explored. It is likely that the proposed network could serve as a pre-trained network that can be fine-tuned for other OCT devices.

6. Conclusion

In conclusion, this study has presented a robust methodology for speckle noise reduction in OCT images through image reconstruction, leveraging the capabilities of conditional Generative Adversarial Networks (cGAN). The architecture involved a UNet generator with skip connections, coupled with a PatchGAN discriminator. A comprehensive exploration of the discriminator and generator architectures, along with the impact of different training loss functions, was conducted. The dataset, comprising 1660 B-scan OCT images, enabled a thorough investigation, with 1460 image pairs dedicated to training and the remaining 200 pairs equally divided between validation and testing.

The experimental outcomes revealed notable improvements in key metrics, including PSNR, TP, SSIM, and EPI, when compared to a state-of-the-art method used for speckle reduction, Siamese-GAN. These quantitative advancements were typically supported by qualitative assessments, as indicated by the favorable feedback from clinical experts. Notably, the UNet-PatchGAN-MSE configuration emerged as the preferred choice among imaging experts, demonstrating its efficacy in achieving enhanced image fidelity and noise reduction.

However, the nuanced preferences and reservations expressed by clinicians, particularly regarding the blurriness associated with ResUNet-WGAN-MSE, emphasize the importance of a holistic approach while assessing the outcomes produced by these networks. While quantitative benchmarks provide valuable insights, the incorporation of expert feedback has proven indispensable in better understanding the potential of image reconstruction methodologies for real-world clinical applications. The unanimous preference for the UNet-PatchGAN-MSE network, coupled with the acknowledgment of artifacts introduced by alternative methods like Siamese-GAN, underscores the necessity of incorporating visual feedback in the development process and subsequent assessment.

As we chart a course for future research, it becomes evident that the interplay between different generator-discriminator architectures and their impact on clinical acceptability demands further exploration. The comprehensive evaluation strategy, integrating quantitative metrics, qualitative assessments, and expert feedback, will remain pivotal. The imperative of bridging the gap between quantitative benchmarks and clinical applicability will continue to guide our research agenda. Future endeavours should focus on the integration of more extensive clinical datasets and the active participation of domain experts in the evaluation process, ensuring that the developed methodologies not only excel in numerical benchmarks but also align with the nuanced requirements and preferences of medical practitioners.

In summary, the presented methodology stands as a promising advancement in OCT image reconstruction, demonstrating its potential for clinical utility. Additionally, the collaboration between quantitative evaluations and clinical feedback serves as a paradigm for future research in medical image processing, where the success of a methodology lies in its ability to meet the multifaceted demands of real-world clinical scenarios.

Acknowledgments

We thank Dr Jason Charng, Dr Rachael Heath Jeffery, Dr Danial Roshandel, and Dr Mary Safwat Aziz Attia as the expert image graders for taking part in the qualitative experiments and their helpful discussions. The authors express their gratitude to Professor Scott Read for granting access to the dataset.

Disclosures

The authors declare that there are no conflicts of interests related to this article.

Data availability

The code implementations used in this study were based on the SiameseGAN and pix2pix frameworks. The SiameseGAN code was sourced from the public repository SiameseGAN [43], and the keras implementation of pix2pix code was obtained from pix2pix [44]. These frameworks provided the essential foundations for the development and training of the deep learning models employed in our experiments.

The datasets employed in this study were acquired from the Queensland University of Technology (QUT) under the terms of an ethics agreement that ensures compliance with all relevant ethical guidelines and regulations. Unfortunately, due to privacy and ethical considerations, the datasets cannot be made publicly available. Researchers interested in accessing the datasets can initiate the necessary ethical approval process through QUT.

The custom code developed to train the models is available on CodeOcean at [45].

References

1. M. S. Hepburn, K. Y. Foo, A. Curatolo, P. R. T. Munro, and B. F. Kennedy, Speckle in Optical Coherence Tomography, chap. 4, pp. 4–1–4–29.

2. B. Karamata, K. Hassler, M. Laubscher, and T. Lasser, “Speckle statistics in optical coherence tomography,” J. Opt. Soc. Am. A 22(4), 593–596 (2005). [CrossRef]

3. M. A. Mayer, A. Borsdorf, M. Wagner, J. Hornegger, C. Y. Mardin, and R. P. Tornow, “Wavelet denoising of multiframe optical coherence tomography data,” Biomed. Opt. Express 3(3), 572–589 (2012). [CrossRef]

4. P. Mittal and C. Bhatnagar, “Effectual accuracy of OCT image retinal segmentation with the aid of speckle noise reduction and boundary edge detection strategy,” Journal of Microscopy 289, 164–179 (2022). [CrossRef]

5. Z. Mao, A. Miki, S. Mei, Y. Dong, K. Maruyama, R. Kawasaki, S. Usui, K. Matsushita, K. Nishida, and K. Chan, “Deep learning based noise reduction method for automatic 3d segmentation of the anterior of lamina cribrosa in optical coherence tomography volumetric scans,” Biomed. Opt. Express 10(11), 5832 (2019). [CrossRef]

6. A. Stankiewicz, T. Marciniak, A. Döbrowski, M. Stopa, P. Rakowicz, and E. Marciniak, “Denoising methods for improving automatic segmentation in oct images of human eye,” Bull. The Pol. Acad. Sci. Sci. 65(1), 71–78 (2017). [CrossRef]

7. L. Fang, D. Cunefare, C. Wang, R. H. Guymer, S. Li, and S. Farsiu, “Automatic segmentation of nine retinal layer boundaries in oct images of non-exudative amd patients using deep learning and graph search,” Biomed. Opt. Express 8(5), 2732–2744 (2017). [CrossRef]

8. J. Rogowska and M. E. Brezinski, “Image processing techniques for noise removal, enhancement and segmentation of cartilage oct images,” Phys. Med. Biol. 47(4), 641–655 (2002). [CrossRef]

9. A. Wong, A. Mishra, K. Bizheva, and D. A. Clausi, “General Bayesian estimation for speckle noise reduction in optical coherence tomography retinal imagery,” Opt. Express 18(8), 8338–8352 (2010). [CrossRef]

10. R. Bernardes, C. Maduro, P. Serranho, A. Araújo, S. Barbeiro, and J. Cunha-Vaz, “Improved adaptive complex diffusion despeckling filter,” Opt. Express 18(23), 24048 (2010). [CrossRef]

11. P. Puvanathasan and K. Bizheva, Interval type-ii Fuzzy Anisotropic Diffusion Algorithmfor Speckle Noise Reduction in Optical Coherence Tomography Images, (Optical Society of America, 2009), pp. 733–746.

12. W. Habib, A. M. Siddiqui, and I. Touqir, “Wavelet based despeckling of multiframe optical coherence tomography data using similarity measure and anisotropic diffusion filtering,” in 2013 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), (IEEE, 2013), pp. 330–333.

13. Z. Hongwei, L. Baowang, and F. Juan, “Adaptive wavelet transformation for speckle reduction in optical coherence tomography images,” in 2011 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), (2011), pp. 1–5.

14. R. Kafieh, H. Rabbani, and I. Selesnick, “Three dimensional data-driven multi scale atomic representation of optical coherence tomography,” IEEE Trans. Med. Imaging 34(5), 1042–1062 (2015). [CrossRef]

15. B. Chong and Y.-K. Zhu, “Speckle reduction in optical coherence tomography images of human finger skin by wavelet modified BM3D filter,” Opt. Commun. 291, 461–469 (2013). [CrossRef]

16. S. Chitchian, M. A. Mayer, A. Boretsky, F. v. Kuijk, and M. Motamedi, “Complex wavelet denoising of retinal OCT imaging,” Investigative Ophthalmology Visual Science 53, 3124 (2012).

17. L. Fang, S. Li, R. P. McNabb, Q. Nie, A. N. Kuo, C. A. Toth, J. A. Izatt, and S. Farsiu, “Fast acquisition and reconstruction of optical coherence tomography images via sparse representation,” IEEE Trans. Med. Imaging 32(11), 2034–2049 (2013). [CrossRef]

18. L. Fang, S. Li, Q. Nie, J. A. Izatt, C. A. Toth, and S. Farsiu, “Sparsity based denoising of spectral domain optical coherence tomography images,” Phys. Med. Biol. 3(5), 927–942 (2012). [CrossRef]

19. F. Shi, N. Cai, Y. Gu, D. Hu, Y. Ma, Y. Chen, and X. Chen, “DeSpecNet: a CNN-based method for speckle reduction in retinal optical coherence tomography images,” Phys. Med. Biol. 64(17), 175010 (2019). [CrossRef]

20. B. Qiu, Z. Huang, X. Liu, X. Meng, Y. You, G. Liu, K. Yang, A. Maier, Q. Ren, and Y. Lu, “Noise reduction in optical coherence tomography images using a deep neural network with perceptually-sensitive loss function,” Biomed. Opt. Express 11(2), 817–830 (2020). [CrossRef]

21. M. Mehdizadeh, C. MacNish, D. Xiao, D. Alonso-Caneiro, J. Kugelman, and M. Bennamoun, “Deep feature loss to denoise oct images using deep neural networks,” J. Biomed. Opt. 26(04), 046003 (2021). [CrossRef]

22. Z. Chen, Z. Zeng, H. Shen, X. Zheng, P. Dai, and P. Ouyang, “DN-GAN: Denoising generative adversarial networks for speckle noise reduction in optical coherence tomography images,” Biomed. Signal Process. Control. 55, 101632 (2020). [CrossRef]

23. N. A. Kande, R. Dakhane, A. Dukkipati, and P. K. Yalavarthy, “SiameseGAN: a generative model for denoising of spectral domain optical coherence tomography images,” IEEE Trans. Med. Imaging 40(1), 180–192 (2021). [CrossRef]

24. Z. Zhang, Q. Liu, and Y. Wang, “Road Extraction by Deep Residual U-Net,” IEEE Geosci. Remote Sensing Lett. 15(5), 749–753 (2017). [CrossRef]

25. M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proceedings of the 34th International Conference on Machine Learning, vol. 70 of Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, eds. (PMLR, 2017), pp. 214–223.

26. S. Dey, A. Dutta, J. I. Toledo, S. K. Ghosh, J. Lladós, and U. Pal, “SigNet: convolutional siamese network for writer independent Offline signature verification,” arXiv, arXiv:1707.02131 (2017). [CrossRef]

27. O. Ronneberger, P. Fischer, and T. Brox, “U-Net: convolutional networks for biomedical image segmentation,” arXiv, arXiv:1505.04597 (2015). [CrossRef]

28. P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” arXiv, arXiv.1611.07004 (2017). [CrossRef]

29. Y. Pang, J. Lin, T. Qin, and Z. Chen, “Image-to-image translation: methods and applications,” arXiv, arXiv:2101.08629 (2021). [CrossRef]

30. A. Alotaibi, “Deep generative adversarial networks for image-to-image translation: a review,” Symmetry 12(10), 1705 (2020). [CrossRef]

31. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, vol. 27 Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, eds. (Curran Associates, Inc., 2014).

32. M. Mirza and S. Osindero, “Conditional Generative Adversarial Nets,” arXivarXiv:1411.1784 (2014). [CrossRef]

33. J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference on Computer Vision, (2016).

34. K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, eds. (2015).

35. L. Fei-Fei, J. Deng, and K. Li, “ImageNet: Constructing a large-scale Image Database,” J. Vis. 9(8), 1037 (2010). [CrossRef]

36. I. Pratikakis, K. Zagoris, X. Karagiannis, L. Tsochatzidis, T. Mondal, and I. Marthot-Santaniello, “ICDAR 2019 Competition on Document Image Binarization (DIBCO 2019),” in 2019 International Conference on Document Analysis and Recognition (ICDAR), (2019), pp. 1547–1556.

37. M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, (JMLR.org, 2017), ICML’17, p. 214–223.

38. M. Arjovsky and L. Bottou, “Towards principled methods for training generative adversarial networks,” in International Conference on Learning Representations, (2017).

39. S. A. Read, M. J. Collins, S. J. Vincent, and D. Alonso-Caneiro, “Macular retinal layer thickness in childhood,” Retina 35(6), 1223 (2015). [CrossRef]

40. D. Alonso-Caneiro, S. A. Read, and M. J. Collins, “Speckle reduction in optical coherence tomography imaging by Affine-Motion image registration,” J. Biomed. Opt. 16(11), 116027 (2011). [CrossRef]

41. R. Ferzli and L. J. Karam, “A no-reference objective image sharpness metric based on just-noticeable blur and probability summation,” in 2007 IEEE International Conference on Image Processing, vol. 3 (2007), pp. III – 445–III – 448.

42. C. T. Vu, T. D. Phan, and D. M. Chandler, “S(3): A Spectral and Spatial Measure of Local Perceived Sharpness in Natural Images,” IEEE Trans. on Image Process. 21(3), 934–945 (2012). [CrossRef]

43. S. Gupta, “SiameseGAN,” Github, 2020, https://github.com/sml-iisc/SiameseGAN

44. J. Brownlee, “How to Develop a Pix2Pix GAN for Image-to-Image Translation,” Machine Learning Mastery, 2021, https://machinelearningmastery.com/how-to-develop-a-pix2pix-gan-for-image-to-image-translation/.

45. The Australian e-Health Research Centre, “OCT-Image-Reconstruction,” Github, 2024, https://github.com/aehrc/OCT_Denoising_pix2pix

Network name	Generator + Discriminator + Adversarial loss
ResUNet-WGAN-MSE	ResUNet + WGAN + MSE
ResUNet-WGAN-Perceptual	ResUNet + WGAN + Perceptual
ResUNet-PatchGAN-MSE	ResUNet + PatchGAN + MSE
UNet-WGAN-MSE	UNet + WGAN + MSE
UNet-WGAN-Perceptual	UNet + WGAN + Perceptual
UNet-PatchGAN-MSE	UNet + PatchGAN + MSE
ResUNet-WGAN-Siamese	ResUNet + WGAN + Siamese (SiameseGAN)

Network name	Generator + Discriminator + Adversarial loss
ResUNet-WGAN-MSE	ResUNet + WGAN + MSE
ResUNet-WGAN-Perceptual	ResUNet + WGAN + Perceptual
ResUNet-PatchGAN-MSE	ResUNet + PatchGAN + MSE
UNet-WGAN-MSE	UNet + WGAN + MSE
UNet-WGAN-Perceptual	UNet + WGAN + Perceptual
UNet-PatchGAN-MSE	UNet + PatchGAN + MSE
ResUNet-WGAN-Siamese	ResUNet + WGAN + Siamese (SiameseGAN)

Employing texture loss to denoise OCT images using generative adversarial networks

Abstract

1. Introduction

2. Background

2.1 Image-to-Image translation

2.2 GAN as a general solution for I2I

2.2.1 Conditional GANs

3. Methodology

3.1 Lightweight UNet model

3.2 Deep residual UNet

3.3 Wasserstein GAN

3.4 PatchGAN

4. Experiments

4.1 Experimental dataset

4.2 Experimental settings

4.3 Super-computing infrastructure

4.4 Evaluation metrics

4.4.1 Quantitative image quality measurements

4.4.2 Expert feedback qualitative analysis

4.5 Experimental results

4.5.1 Quantitative results

4.5.2 Qualitative results

5. Discussion

6. Conclusion

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (11)

Tables (2)

Equations (6)

Biomedical Optics Express

Evaluation method	PSNR-B	PSNR-All	SSIM-B	SSIM-All	EPI	ENL	TP	JNB
ResUNet-WGAN-MSE	32.96	44.60	0.90	0.97	0.91	1.15	0.97	4.70
ResUNet-WGAN-Perceptual	31.35	39.58	0.81	0.95	0.85	1.26	0.86	7.83
ResUNet-PatchGAN-MSE	32.88	44.13	0.91	0.98	0.93	2.38	0.93	4.95
UNet-WGAN-MSE	32.60	44.66	0.89	0.97	0.91	1.15	0.97	4.90
UNet-WGAN-Perceptual	31.73	42.49	0.86	0.97	0.81	2.02	1.05	8.05
UNet-PatchGAN-MSE	32.50	44.48	0.88	0.97	0.90	1.07	1.05	5.43
ResUNet-WGAN-Siamese	31.02	33.76	0.85	0.89	0.77	2.48	1.17	8.35