ERA-WGAT: Edge-enhanced residual autoencoder with a window-based graph attention convolutional network for low-dose CT denoising

Han Liu; Peixi Liao; Hu Chen; Yi Zhang

doi:10.1364/BOE.471340

1. Introduction

X-ray computed tomography (CT) has been widely used for medical examination and diagnosis for humans. CT images can help achieve more effective medical management by determining when surgery is necessary, reducing exploratory surgeries, improving cancer diagnosis and treatment, etc. They also show surgeons exactly where to operate, greatly improving surgery success. Various body tissues absorb X-rays differently, hence CT images can provide vital internal information. However, excessive radiation exposure if patients CT imaged too many times can lead to potential health risks [1,2].

Many algorithms have been proposed to improve image quality for low-dose CT (LDCT), hence lessening radiation exposure for patients, with the underlying approaches categorized as (1) sinogram domain filtration, (2) iterative reconstruction (IR), and (3) image post-processing.

Sinogram filtering methods directly smooth raw data before image reconstruction, such as filtered back projection (FBP). [3] investigated a relatively accurate statistical model for sinogram data and developed a penalized likelihood method to smooth the sinogram. Other typical methods include structural adaptive filtering [4], bilateral filtering [5], and penalized weighted least-squares algorithms [6]. Sinogram filtering algorithms are often restricted in pratice due to difficultly accessing projection data.

Iterative reconstruction (IR) algorithms incrementally estimate denoised images using priors have been proposed for LDCT image denoising, including total variation (TV) and its variants [7–10], non-local mean [11–13], dictionary learning [14,15], and other techniques [16–18]. These IR algorithms can significantly improve image quality, but can lose details and have high computational cost, considerably limiting their practical application.

Post-processing can be applied to LDCT images. Previous studies have proposed classical image processing methods, including non-local means filtering [19–21], dictionary learning [22,23], three-dimensional (3D) block matching (BM3D) [24–26], and diffusion filters [27], which are more computationally efficient then IR methods. However, noise distributions in reconstructed LDCT images are commonly non-uniform, which makes it difficult to obtain valid denoising results.

Convolutional neural networks (CNN) have recently been shown to be highly effective for image denoising [28–36]. Hence various network architectures have been proposed for LDCT denoising, including two-dimensional (2D) CNNs [28,29], 3D CNNs [30,33,35], residual connections [28,34], cascade CNN [31], dense connections [36], and quadratic convolution and deconvolution [34]; with many different objective functions, including mean squared error (MSE) [28–31,34,36], adversarial loss [30,32,33,35], and perceptual loss [32,33,35,36]. We selected two pixels P and Q for visualization of non-local dependencies as shown in Fig. 1(a). Figure 1(b) and (c) are related pixels of pixel P and Q, respectively. It can be seen that the pixels of CT images are not only dependent on local pixels, but also dependent on non-local pixels. The main drawback for CNN-based methods is the lack of non-local self-similarity capturing due to the convolution kernels.

Fig. 1. Non-local dependencies. (a) Two pixels selected for visualization of non-local dependencies. (b) Pixel P’s related pixels. (c) Pixel Q’s related pixels. The darker the color, the stronger the relationships.

Download Full Size | PDF

Different from CNN-based methods, Graph convolutional networks (GCNs) have shown great potential exploring non-local self-similarity for LDCT image denoising [37,38]. However, these GCN-based methods only consider non-local self-similarity between pixels, which tends to provide unstable performance due to noisy pixels in LDCT images. Moreover, the behavior of the GCN in these pixel-based GCNs is different between training and testing for efficiency reasons, i.e., they use overlapped patches during training and all graphs are constructed in the patches, while during testing they define a search window which is roughly comparable to the patch size used in training for each pixel. However, this procedure is suboptimal as some pixels might suffer from border effects during training, i.e., their search windows are not centered around them.

Therefore, we propose ERA-WGAT, an edge-enhanced residual autoencoder with window-based graph attention convolutional network, to perform LDCT denoising, treating non-overlapped windows of fixed size in the feature maps as nodes rather pixels. The main contributions for this paper are as follows.

1. We propose a conveying path based residual autoencoder and use eight types of learnable operators (vertical, horizontal, diagonal, and anti-diagonal Sobel operators; vertical and horizontal Scharr operators; two types of Laplacian operators) to extract edge information from the input LDCT image, and design an edge branch to provide sufficient edge information for each stage of the encoder part.
2. We propose a window-based graph attention convolutional network (WGAT) combining static and dynamic attention modules to explore non-local self-similarity in the encoder, bottleneck, and decoder parts of the proposed model for LDCT images.
3. In WGAT, we treated non-overlapped windows as nodes rather then pixels and adopted a hierarchical structure and perform WGAT on the feature maps with appropriate scale, which makes our model maintain the same behavior in the training and testing stages and solve the problem that some pixel may suffer from border effects in pixel-based GCN methods.
4. Extensive experiments demonstrate that the proposed ERA-WGAT provides superior noise suppression and better image quality compared with several state-of-the-art denoising methods.

The remainder of this paper is organized as follows. Section 2. surveys the deep learning based noise suppression methods for LDCT. Section 3. describes the proposed method, which is evaluated and validated in Section 4. Section 5. summarizes and concludes the paper.

2. Related work

Table 1 summarizes a comprehensive comparison between the proposed ERA-WGAT and current deep learning based LDCT denoising methods. The main difference of models are divided into two aspects: network architecture and objective function.

Table 1. Comparison between Deep Learning based Methods. the Abbreviations Mse, Al, Pl in Table are for Mean Squared Error, Adversarial Loss, and Perceptual Loss, respectively

View Table | View all tables in this article

2.1 Network architecture

The key elements for relevant network architectures include convolutional and deconvolutional layers, shortcut connections, Sobel operators, and GCNs.

2.1.1 Convolutional layers

Convolutional layers in deep learning networks can be broadly divided into 2D (Conv2d), quadratic 2D (Q-Conv2d) [34], and 3D (Conv3d) convolutions. Conv2d and Q-Conv2d are performed on a single CT slice, with Conv2d commonly employed whereas Q-Conv2d has been recently proposed to enhance individual neuron capabilities by replacing the inner product with a quadratic operation on input data. Conv3d is applied to several adjacent slices, incorporating 3D spatial information, typical methods include GAN-3D [30], CPCE-3D [33], and SACNN [35].

2.1.2 Deconvolutional layers

Autoencoder based methods often employ convolutional layers for the encoder, and deconvolutional layers for the decoder. However, deconvolution has uneven overlap when kernel size is not an integer multiple of the stride. But although selecting kernel size to be a multiple of the stride can help avoid overlap issues, it remains susceptible to create artifacts [39].

2.1.3 Shortcut connection

Shortcut connections are either skip connection (element-wise addition) or conveying path (concatenation). Skip connection bypasses non-linear transformations with an identity function, and conveying path reuses early feature maps as input for latter layers. Residual mapping, as proposed in RED-CNN [28], is also a skip connection, transforming direct mapping problems into residual mapping problems to helps avoid gradient vanishing problems and significantly enhance LDCT imaging performance. Dense connection, e.g., EDCNN [36], conveys edge enhancement module outputs to each convolution block. Cascade-CNN [31], cascades several CNNs, which are trained individually rather then forming a unified network. The current study employs residual mapping and conveying path based concatenation.

2.1.4 Edge enhancement

Edge enhancement modules (e.g. Sobel operators) applied to input CT images can help extract edge information [36,37]. CT-GCN [37] uses the Sobel edge extractor in horizontal and vertical directions; whereas EDCNN proposed the trainable Sobel convolution with four types of Sobel operators to extract edge information for the LDCT image, including vertical (Fig. 2(a)), horizontal (Fig. 2(b)), diagonal (Fig. 2(c)), and anti-diagonal (Fig. 2(d)) directions. On this basis, we add four operators to provide more powerful edge information extraction ability, including vertical Scharr (Fig. 2(e)), horizontal Scharr (Fig. 2(f)), and two Laplacian (Fig. 2(g) and (h)) operators. Compared with Sobel operators, Scharr operators have two changes: 1) increase the gap between pixels and can enlarge some edge details, 2) increase the weight in the cross directions and weaken the weight in the diagonal directions, that is, pay more attention to the influence of adjacent pixels. Laplacian operators are second-order differential operators while Sobel and Scharr operators are first-order differential ones. Laplacian operators can get more finer edge information and are sensitive noise. Therefore, we can rely on Laplacian operators to obtain the prior information (such as noise and edge details) where the pixel value of the LDCT image changes strongly. The learnable parameter $\lambda$ defined in these trainable operators can be adaptively adjusted during training process so that it can provide better adaptability and generalization ability.

Fig. 2. Eight types of trainable operators. (a), (b), (c), and (d) are vertical, horizontal, diagonal, and anti-diagonal Sobel operators, respectively. (e) and (f) are vertical and horizontal Scharr operators, respectively. (g) and (h) are two types of Laplacian operators.

Download Full Size | PDF

2.1.5 GCN

CNNs can only explore LDCT image local information, whereas non-local self-similarity has been shown to be proved benefical for LDCT denoising. In contrast with CNNs, GCNs have been widely used to process non-Euclidian geometry data, and have proved effective on real image denoising [40]. Although several previous studies have employed GCNs for LDCT denoising [37,38], they still have two disadvantages. As discussed in Section 1, one is that these methods are pixel-based GCNs which tends to provide unstable performance due to noisy pixels. The other is that these methods may cause some pixels to suffer from border effects due to the different training and testing behaviors. The proposed ERA-WGAT uses a hierarchical structure and performs non-overlapping window based graph attention networks on encoder, bottleneck, and decoder parts to overcome the shortcomings of pixel-based methods.

2.2 Objective function

As a image transformation task, LDCT image denoising use various objective functions for optimization. Per-pixel loss, e.g., MSE loss, tend to generate over-smoothed images [28] because structural information is not considered. [41] propose perceptual loss (PL) to address this problem, calculating similarity between images in feature space; and several subsequently proposed GAN based methods combine PL with adversarial loss (AL) to generate more realistic images [32,33,35]. Although these approaches can retain structural information from the original images, they remain poor at suppressing noise. EDCNN [36] proposed compound loss, combining MSE and multi-scale perceptual loss to mitigate over-smoothing. Similar to EDCNN, we use the combination of MSE loss and multi-scale perceptual loss to optimize the proposed model.

3. Method

3.1 Overall architecture

Figure 3 shows that the proposed ERA-WGAT overall architecture comprises four main parts.

Fig. 3. Overall architecture of our proposed network.

Download Full Size | PDF

3.1.1 Edge enhancement module

We first perform four groups of convolutions with eight types of learnable operators described above to extract edge information from the input LDCT image, then, we use three $3 \times 3$ convolution with stride $2$ followed by a Gaussian error linear unit (GELU [42]) to obtain feature maps of different scales, each convolution halves the size and doubles the channels of the feature map. These feature maps are concatenated with the corresponding feature maps from the encoder. In this way, we can provide sufficient edge information for the encoder.

3.1.2 Encoder

The encoder comprises five $3 \times 3$ convolution with stride $1$, three $3 \times 3$ convolutions with stride $2$, and a WGAT. Each convolution is followed by a GELU. The first $3 \times 3$ convolution with stride $1$ is responsible for mapping the input LDCT image to the $32$-dimensional feature space. The other $3 \times 3$ convolutions with stride $1$ perform nonlinear fusion of the feature maps corresponding to the edge enhancement module and encoder at different scales. Each $3 \times 3$ convolution with stride $2$ halves the size and doubles the channels of the feature map. The WGAT is performed on the feature map generated by the fourth $3 \times 3$ convolution with stride $1$.

3.1.3 Bottleneck

The bottleneck consists of three $5 \times 5$ convolutions with stride $1$ and a WGAT. Each convolution is followed by a GELU. Here we use $5 \times 5$ convolutions to increase the receptive field. And the WGAT is performed on the feature map generated by the second $5 \times 5$ convolution.

3.1.4 Decoder

The decoder comprises five $3 \times 3$ convolutions with stride $1$, three upsampling modules, and a WGAT. Each upsampling module consists of a bilinear upsampling layer that doubles the feature map size and a $3 \times 3$ convolution with stride $1$ that halves the number of the feature map channels. Except for the last convolution, all other convolutions are followed by a GELU. The four $3 \times 3$ convolutions with stride $1$ of blue arrow in decoder in Fig. 3 perform nonlinear fusion the feature maps corresponding to the encoder and decoder at different scales. This not only provides the decoder with early feature map information, but also provides rich edge information. We add residual compensation [28] to transform direct mapping into residual mapping problems, employing a $3 \times 3$ convolution to map each $32$-component feature vector for the feature map to $1$, and element-wise addition between the feature map and LDCT image. We also add a rectified linear unit (ReLU) to the resulting feature map after addition to obtain the restored image.

The following sections describe the WGAT and objective function.

3.2 Window based graph attention convolutional network

Figure 4 shows the proposed WGAT framework. We first construct a graph for the non-overlapped windows of the feature map, then pass the graph to the proposed WGAT to explore non-local self-similarity. This section describes WGAT and related components, including graph construction, input, WGAT, and readout layers.

Fig. 4. The WGAT framework.

Download Full Size | PDF

3.2.1 Graph construction

The input feature map is firstly extracted to non-overlapped windows, and then each extracted window is treated as a node. We use Euclidian distance to find $k$ nearest neighbors for each node, and the constructed graph is send to WGAT to explore non-local self-similarity for all windows.

Formally, let $C$, $H$, and $W$ represent input feature map channels, height, and width, respectively. The feature map is extracted to a set of non-overlapped windows $\mathbf {F} \in \mathbb {R}^{N \times C \times M \times M}$ with window size $M \times M$, $N = \frac {H}{M} \times \frac {W}{M}$, comprising $N$ windows $\mathbf {F}_{i} \in \mathbb {R}^{C \times M \times M}$. It should be noted that LDCT images are generally equal in height and width, so the width and height of the windows are also set to be equal. We construct a $k$-connected directed graph $\mathcal {G}$ for graph attention convolution. The edge weight is computed as

(1)$$\boldsymbol{\omega}_{j \rightarrow i} = \mathrm{exp} \left( - \frac{\mathrm{dist} \left( i, j \right) }{\sigma^{2}} \right) ,$$

where $i, j \in N$, and $\mathrm {dist}(i, j)$ is the feature distance between nodes $i$ and $j$; and feature distance $\mathrm {dist}(i, j)$ is calculated using the $\ell _2$-norm distance between two corresponding patches $\mathbf {F}_{i}$ and $\mathbf {F}_{j}$,

(2)$$\mathrm{dist} \left ( i, j \right ) = \left \| \mathbf{F}_{i} - \mathbf{F}_{j} \right \|_{2} .$$

Using (1) to compute $\boldsymbol {\omega }_{j \rightarrow i}$ means that edge weights are always non-negative. Thus, the scale parameter $\sigma$ is defined as the average distance for the $k$ nearest neighbors for each node.

3.2.2 Input layer

For a given graph, we have node feature $\mathbf {F}_{i} \in \mathbb {R}^{C \times M \times M}$ for each node $i$ and edge weight $\boldsymbol {\omega }_{j \rightarrow i} \in \mathbb {R}^{1}$. Input features $\mathbf {F}_{i}$ is embedded to hidden features $\mathbf {S}_{i}^{\ell =0}$ and $\mathbf {D}_{i}^{\ell =0}$, and $\boldsymbol {\omega }_{j \rightarrow i}$ is embedded to hidden feature $\mathbf {e}_{j \rightarrow i}$ using a simple assignment operation before passing them to a graph convolutional network:

(3)$$\mathbf{S}_{i}^{0}=\mathbf{F}_{i}; \mathbf{D}_{i}^{0}=\mathbf{F}_{i}; \mathbf{e}_{j \rightarrow i}=\boldsymbol{\omega}_{j \rightarrow i} ,$$

where we omit $\mathbf {e}_{j \rightarrow i}$ superscripts, since it remains constant once the graph is constructed, and is used in the static attention module in the WGAT layer.

3.2.3 WGAT layer

Graph convolution updates the current node hidden state by aggregating information from its neighbor nodes. The key point is to efficiently aggregate the information. [43] propose to use attention mechanism to perform aggregation. The basic concept is to assign an attention weight or importance to each neighbor, which is then used to weigh the neighbor’s influence during aggregation. We combine static and dynamic attention modules to improve performance. Static attention uses edge hidden embedding $\mathbf {e}_{j \rightarrow i}$ as the attention weight, which remains constant once the graph is constructed (see above). We extend the original dynamic attention definition [43] for window aggregation, generating the dynamic attention by self-attention on the nodes at each WGAT layer. Hidden states for the nodes are updated after each WGAT layer, and the attention weight changes dynamically.

Formally, let $\mathbf {S}_{i}^{\ell }, \mathbf {D}_{i}^{\ell } \in \mathbb {R}^{C \times M \times M}$ denote the static and dynamic hidden states at WGAT layer $\ell$ associated with node $i$. Next, let’s introduce the static and dynamic attention modules in detail.

Static attention. We use the constant edge weight $\mathbf {e}_{j \rightarrow i}$ to calculated the static attention module output $\mathbf {S}_{i}^{\ell + 1} \in \mathbb {R}^{C \times M \times M}$:

(4)$$\mathbf{S}_{i}^{\ell + 1} = \mathrm{\mbox{Max}}_{j \in \mathcal{N}_{i}} \left( \mathbf{W}_{s}^{\ell} \ast \left( \mathbf{e}_{j \rightarrow i} \mathbf{S}_{j}^{\ell} \right) + \mathbf{b}_{s}^{\ell} \right) ,$$

where $\mathcal {N}_{i}$ represents the set of neighbor nodes of node $i$, $\mathbf {W}_{s}^{\ell } \in \mathbb {R}^{P^{2} \times C \times C}$ and $\mathbf {b}_{s}^{\ell } \in \mathbb {R}^{C}$ denote the weight and bias, respectively. $\ast$ represents the convolution operator. $P^{2}$ is the convolution kernel size, and we simply set $P = 3$.

Dynamic attention. The dynamic attention module output $\mathbf {D}_{i}^{\ell + 1} \in \mathbb {R}^{C \times M \times M}$ is calculated as:

(5)$$\mathbf{D}_{i}^{\ell + 1} = \mathrm{\mbox{Sum}}_{j \in \mathcal{N}_{i}} \left( \alpha_{j \rightarrow i}^{\ell} \left( \mathbf{W}_{d}^{\ell} \ast \mathbf{D}_{j}^{\ell} + \mathbf{b}_{d}^{\ell} \right) \right) ,$$

where $\mathbf {W}_{d}^{\ell } \in \mathbb {R}^{P^{2} \times C \times C}$ and $\mathbf {b}_{d}^{\ell } \in \mathbb {R}^{C}$ denote the weight and bias, respectively. $\alpha _{j \rightarrow i}^{\ell }$ is the attention coefficient defined as:

(6)$$\alpha_{j \rightarrow i}^{\ell} = \frac{\exp \left( \hat{\alpha}_{j \rightarrow i}^{\ell} \right)}{\sum_{t\in \mathcal{N}_{i}} \exp \left( \hat{\alpha}_{t \rightarrow i}^{\ell} \right)} ,$$

(7)$$\hat{\alpha}_{j \rightarrow i}^{\ell} = \mathrm{\mbox{LR}} \left( \mathrm{\mbox{AvgPool}} \left( \mathbf{W}_{f}^{\ell} \ast \left( \mathrm{\mbox{Concat}} \left( \mathbf{W}_{d}^{\ell} \ast \mathbf{D}_{i}^{\ell} + \mathbf{b}_{d}^{\ell}, \mathbf{W}_{d}^{\ell} \ast \mathbf{D}_{j}^{\ell} + \mathbf{b}_{d}^{\ell} \right) \right) + \mathbf{b}_{f}^{\ell} \right) \right) ,$$

where LR, $\mathbf {W}_{f}^{\ell } \in \mathbb {R}^{P_{f}^{2} \times 1 \times 2C}$, and $\mathbf {b}_{f}^{\ell } \in \mathbb {R}^{1}$ represent LeakyReLU, the weight and bias of the convolution that fuses the transformed hidden features of node $i$ and its neighbor node $j$, respectively. $\mathbf {P}_{f}^{2}$ is the convolution kernel size of the convolution which use kernel size $2 \times 2$ and stride $2$ to downsample the concatenated feature maps. We use two symmetric aggregation functions (4) and (5), i.e., invariant to input permutations, Max and Sum aggregators, although any symmetric function would suffice.

The WGAT explores the feature map non-local self-similarity and the two attention modules can better aggregate information from neighbor nodes. The number of neighbors for graph construction and the number of WGAT layers are two hyperparameters, and we empirically set number of neighbors $=8$ and number of WGAT layers $=2$ for all experiments.

3.2.4 Readout layer

The final WGAT component is the readout layer. We first reorganize all the windows into the input feature map format to obtain the static attention module output $\mathbf {S} \in \mathbb {R}^{C \times H \times W}$ and dynamic attention module output $\mathbf {D} \in \mathbb {R}^{C \times H \times W}$. The output of the WGAT module $\mathbf {U} \in \mathbb {R}^{C \times H \times W}$ is calculated as:

(8)$$\mathbf{U} = \mathbf{W} \ast \left( \mbox{Concat} \left( \mathbf{S}, \mathbf{D} \right) \right) + \mathbf{b},$$

where $W \in \mathbb {R}^{\mathbf {P}^{2} \times C \times 2C}$, and $\mathbf {b} \in \mathbb {R}^{1}$ represent the weight and bias of the convolution.

More importantly, we propose a fusion strategy of local and non-local feature maps. First, the local feature map (the input feature map) and non-local feature map (the output feature map of WGAT) are concatenated and then passed through a $3 \times 3$ convolution followed by a GELU. This strategy can effectively combine local and non-local information to improve the feature extraction ability of the network.

3.3 Objective function

Previous CNN based methods usually use MSE loss to train the network, i.e., minimize pixel-wise error between denoised and NDCT images. However, MSE loss can generate blurry images and cause detail distortion or loss. Hence, the perceptual loss is proposed to address this problem, but methods that use only perceptual loss perform weak in noise suppression. Therefore, we leverage the compound loss which combines the MSE loss and multi-scale perceptual loss to optimize the proposed ERA-WGAT.

3.3.1 MSE loss

The MSE loss function $\mathcal {L}_{mse}$ can be expressed as:

(9)$$\mathcal{L}_{mse} = \frac{1}{N} \sum_{i=1}^{N} \left\| \phi \left( \mathbf{X}_{i} \right) - \mathbf{Y}_{i} \right\|^{2} ,$$

where $\mathbf {X}_{i}$, $\mathbf {Y}_{i}$, $N$, and $\phi$ represent the LDCT image, NDCT image, number of image pairs, and the proposed ERA-WGAT, respectively.

3.3.2 Multi-scale perceptual loss

The multi-scale perceptual loss $\mathcal {L}_{per}$ is computed as:

(10)$$\mathcal{L}_{per} = \frac{1}{NS} \sum_{i=1}^{N} \sum_{s=1}^{S} \left\| \gamma_{s} \left( \phi \left( \mathbf{X}_{i} \right) \right) - \gamma_{s} \left( \mathbf{Y}_{i} \right) \right\|^{2} ,$$

where $\gamma$ is the pretrained VGG19 [44]. $S$ is the number of scales. In the experiments, $\eta$ takes the fourth, eighth, $12$th, and $16$th convolutions in the VGG network. We duplicated the CT images to create RGB channels, since $\gamma$ takes color image inputs, whereas CT images are greyscale.

3.3.3 Compound loss

The compound loss $\mathcal {L}_{comp}$ can be expressed as:

(11)$$\mathcal{L}_{comp} = \mathcal{L}_{mse} + \alpha \mathcal{L}_{per} ,$$

where $\alpha$ is a hyperparameter to balance the two components.

4. Experimental design and results

This section details the datasets used to train and evaluate the networks, discusses hyperparameter selection and ablation experiments on different modules, and compares the proposed ERA-WGAT approach with recent state-of-the-art methods denoising CT images.

4.1 Data sources

We used a real clinical dataset from the 2016 NIH AAPM-Mayo Clinic Low-Dose CT Grand Challenge [45] by Mayo Clinic for the training and evaluation of the proposed ERA-WGAT. The dataset contains 10 anonymous patients’ normal-dose abdominal CT images and simulated quarter-dose CT images. All the networks were trained with a subset of full and quarter dose image pairs (4,736 images from 8 patients), and tested with the remaining pairs (896 images from 2 patients).

4.2 Parameter selection

We experimentally evaluated several parameter combinations and finalized parameter settings as follows. We use the original resolution of the images, i.e., $512 \times 512$. Networks were trained for 60 epochs; with base learning rate = $10^{-4}$, multiplied by 0.1 at the 20th and 40th epoch. Number of WGAT layers = 2, with window size in encoder and decoder = $8 \times 8$, window size in bottleneck = $4 \times 4$. Number of neighbors for graph construction = 8. We used the AdamW [46] algorithm with weight decay $=10^{-3}$ to optimize the network. All networks were implemented in PyTorch on two NVIDIA RTX 2080Ti GPU.

We employed four metrics to quantitatively evaluate network performance: root mean square error (RMSE), peak signal to noise ratio (PSNR), structural similarity (SSIM), and VGG-P. RMSE and PSNR focus on pixel-level similarity, SSIM on structural similarity, and VGG-P is the commonly used perceptual loss based on VGG19.

In order to determine the weighting parameter $\alpha$ for the compound loss, we select $\alpha$ from $\left \{ 0, 10^{-1}, 10^{-2}, 10^{-3}, 10^{-4}, 10^{-5} \right \}$. Table 2 summarizes the quantitative results associated with different value of $\alpha$ for the images in the test set. $\alpha = 10^{-4}$ achieved the result of trade-off on all the metrics, therefore we set $\alpha = 10^{-4}$ in the following experiments.

Table 2. Quantitative Results (Mean$\pm$Sd) Associated with Different Value of Parameter $\alpha$ in the Compound Loss in the Proposed Era-wgat for the Images in the Test Set

View Table | View all tables in this article

For comparison with the proposed ERA-WGAT, we employed BM3D [24], RED-CNN [28], WGAN-VGG [32], CPCE-2D [33], EDCNN [36], QAE [34], and CT-GCN [37] models.

4.3 Experimental results

4.3.1 Denoising performance comparison

Two representative patients’ scans from the test set were chosen to visualize the denoising performance. Figure 5 and Fig. 7 show the visualization results on the two slices, respectively. Figure 5(a) and Fig. 7(a) are NDCTs, Fig. 5(b) and Fig. 7(b) are LDCTs. Figure 5(c) and Fig. 7(c) are the denoising results of BM3D, a classical image processing method. Figure 5(d), (h), (i) and Fig. 7(d), (h), (i) show the denoising results of methods using only MSE loss, including RED-CNN, QAE, and CT-GCN. Figure 5(e), (f) and Fig. 7(e), (f) show the denoising results of models with GAN framework using perceptual loss, including WGAN-VGG and CPCE-2D. Figure 5(g), (j) and Fig. 7(g), (j) show the denoising results of models using the compound loss which combines MSE loss and perceptual loss, including EDCNN and our proposed ERA-WGAT. For a clearer comparison, we selected two enlarged region of interests (RoIs) marked by the blue rectangles in Fig. 5 and Fig. 7, and we also selected a complex structure marked by yellow rectangle in Fig. 7.

Fig. 5. Results from the abdominal image with a metastasis in the liver for comparison. The Region of Interest(RoI) in the blue box is selected and magnified for a clearer comparison.

Download Full Size | PDF

BM3D blurred the low-contrast lesions marked by red circles and caused many artifacts in the RoIs marked by blue rectangles in Fig. 5(c) and Fig. 7(c). It can be seen that RED-CNN had good denoising performance from its enlarged RoIs marked by blue rectangles in Fig. 5(d) and Fig. 7(d), but RED-CNN produced over-smoothed results (the contrast reduction of the image) and lost some texture structure information (some vessels pointed out by the orange arrows in Fig. 5(d) and Fig. 7(d)) compared with the NDCT images. The reason is that MSE loss is focused on minimizing the pixel-level average loss and often causes over-smoothing problem. Another two methods using MSE loss are QAE and CT-GCN which retained more structural details by using new network architectures (QAE uses the quadratic convolution, CT-GCN uses the graph convolution network), but they still generated over-smoothed results. WGAN-VGG and CPCE-2D use VGG perceptual loss to ease this problem, but they are poor in suppressing noise as shown in Fig. 5(e), (f) and Fig. 7(e), (f). EDCNN proposed to use the compound loss which combines the MSE loss and perceptual loss to reach a balance of noise reduction and structure preservation, as shown in Fig. 5(g) and Fig. 7(g). The noise removal ability of EDCNN is better than that of WGAN-VGG and CPCE-2D, however, EDCNN still had shortcomings in detail preservation (vessels pointed out by the orange arrow in Fig. 7(g)). At the positions pointed by the green arrows in Fig. 5(g) and Fig. 7(g), the vessels look a little vague and some are hardly identifiable. It is clear that our proposed ERA-WGAT obtained the best performance in terms of both noise suppression and structure preservation as shown in Fig. 5(j) and Fig. 7(j). At the positions pointed by orange and green arrows, ERA-WGAT performed better than the other methods in both vessel preservation and vessel brightness maintenance. Moreover, from the enlarged RoI marked by yellow rectangle in Fig. 7, ERA-WGAT gave the best performance of structure preservation over the other methods. The use of the compound loss helped ERA-WGAT avoid over-smoothing problem. ERA-WGAT shows a strong detail and structural information retention ability due to the rich edge information provided by the edge enhancement module and the non-local information provided by the window-based graph attention convolutional network (WGAT). Table 3 summarizes the quantitative results from the afore-mentioned two images. ERA-WGAT achieved better performance in terms of most of the metrics than the other methods.

Table 3. Quantitative Results Associated with Different Algorithms for Figs. 5 and 7

View Table | View all tables in this article

To further show the merits of the porposed ERA-WGAT, we provided the absolute difference images relative to the NDCT image of Figs. 5 and 7 in Figs. 6 and 8, respectively. It is clear that ERA-WGAT yielded the smallest difference from the NDCT image.

Fig. 6. Absolute difference images relative to the NDCT image of Fig. 5.

Download Full Size | PDF

Fig. 7. Results from the abdominal image with a metastasis in the liver for comparison. The Region of Interests(RoIs) in the blue box and yellow box are selected and magnified for a clearer comparison.

Download Full Size | PDF

Fig. 8. Absolute difference images relative to the NDCT image of Fig. 7.

Download Full Size | PDF

4.3.2 Quantitative results

Table 4 summarizes the comparison results associated with different algorithms tested on the test set. Our proposed ERA-WGAT gave better performance in terms of most of the metrics than the other methods. It should be noted that since the loss functions used by the methods involved in the comparison are different, the results in Table 4 may not fairly show the performance of the methods. Therefore, in order to fairly compare the performance of these methods, we retrained them using the same loss functions (since WGA-VGG and CPCE-2D are GAN-based methods, we only involved their generators in the comparison). Table 5 and Table 6 summarizes the quantitative results associated with different algorithms retrained using MSE loss and the compound loss, respectively. It is exciting that ERA-WGAT achieved the best results on all the metrics.

Table 4. Quantitative Results (Mean$\pm$Sd) Associated with Different Algorithms for the Images in the Test Set

View Table | View all tables in this article

Table 5. Quantitative Results (Mean$\pm$Sd) Associated with Different Algorithms using Mean Squared Error Loss for the Images in the Test Set

View Table | View all tables in this article

Table 6. Quantitative Results (Mean$\pm$Sd) Associated with Different Algorithms using the Compound Loss for the Images in the Test Set

View Table | View all tables in this article

4.3.3 Ablation study of proposed method

This section investigates ERA-WGAT performance under different model structure configurations. We designed a residual autoencoder (RA), removing edge enhancement module and WGAT from the structure presented in Fig. 3, and then separately added edge enhancement module (ERA), WGAT with static attention module ($\mbox {RA-WGAT}^{\left ( sa \right )}$), and WGAT with dynamic attention module ($\mbox {RA-WGAT}^{\left ( da \right )}$). We also added WGAT with static ($\mbox {ERA-WGAT}^{\left ( sa \right )}$) and static and dynamic ($\mbox {ERA-WGAT}^{\left ( sa + da \right )}$) attention modules into ERA. All models were trained using the compound loss with the same training strategy and datasets as used previously. Table 7 summarizes the quantitative results for various models. The complete ERA-WGAT model ($\mbox {ERA-WGAT}^{\left ( sa + da \right )}$) achieved the best results on all the metrics, with each additional component providing significantly performance improvement. Figure 9 shows the compound loss value on the testing dataset during training under different model structure configurations. It is that the compound loss value increases continuously by adding edge enhancement module, WGAT with static and dynamic attention modules.

Fig. 9. Compound loss value on the testing dataset during training under different model structure configurations.

Download Full Size | PDF

Table 7. Quantitative Results (Mean$\pm$Sd) of Ablation Experiments on Differnet Modules in our method for the images in the test set. the Abbreviations E, Sa, Da in Table are for Edge Enhancement Module, Static Attention Module, and Dynamic Attention Module, respectively

View Table | View all tables in this article

Table 8 summarizes the quantitative results of ablation experiments on the proposed edge branch in our method for the image in the test set, it is clear that the proposed edge branch can improve the performance of the model on all metrics, which is due to the fact that the edge branch can deliver rich edge information to all parts of the encoder.

Table 8. Quantitative Results (Mean$\pm$Sd) of Ablation Experiments on Edge Branch in our Method for the Images in the Test Set

View Table | View all tables in this article

Table 9 summarizes the quantitative results of different number of WGAT layers in our method for the images in the test set, which shows no significant difference on RMSE, SSIM, and VGGP. The proposed ERA-WGAT achieved the best performance on PSNR, Params, and FLOPs when the number of WGAT layers is $2$. As the number of WGAT layers increases, the Params and FLOPs increases. The basic concept bebind WGAT is that every node aggregates information from its neighborhood, and node embedding contains increasing information from further reaches of the graph as the WGAT layers progress, i.e., every node embedding contains information about its $L$-hop neighborhood after $L$ WGAT layers. When the number of WGAT layer is 2, we can already obtain very rich non-local information. If we increase the number of WGAT layers, not only the Params and FLOPs will increase, but also the essence is to smooth the information of all nodes. So we set the number of WGAT layers to 2 for better performance.

Table 9. Quantitative Results (Mean$\pm$Sd) of Different Number of WGAT Layers in our Method for the Images in the Test Set

View Table | View all tables in this article

5. Conclusion

We proposed ERA-WGAT, a residual autoencoder incorporating edge enhancement module providing edge information and window-based graph attention convolutional network (WGAT) exploring non-local self-similarity. From the results of denoising performance and quantitative metrics, ERA-WGAT achieved superior performance compared with current state-of-the-art methods. The ablation experiments have proved the effectiveness of the proposed components, including edge enhancement, edge branch, and WGAT with static and dynamic attention. In the experimental results, we found that the FLOPs of the proposed WGAT is a little large, which was mainly caused by the graph construction part. As we mentioned earlier, the previous GCN methods use patches in training to try to solve the problem of large calculation amount, which led to inconsistent behavior between training and testing, because GCN does not have the translation invariance, which will cause boundary effects in the denoising results. The traditional GCN methods still have a huge amound of computation because this kind of methods treats each pixel as a node, and the number of nodes in the graph is larger than our proposed WGAT. Although we have solved the boundary effect problem and reduced the number of nodes in the graph, the FLOPs of our method is still larger than that of CNN methods. We do not shy away from this problem, but take it as one of the directions fro feature improvement. Future study will also extend the WGAT module to 3D to explore non-local information between adjacent slices. We will also consider extracting non-local information of different scales to further improve the denoising performance.

Funding

National Natural Science Foundation of China (61871277);Sichuan Province Science and Technology Support Program (2022JDJQ0045, 2021JDJQ0024, 2019YFH0193); Chengdu Science and Technology Program (2018YF0500069SN).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in [45].

References

1. R. Smith-Bindman, J. Lipson, R. Marcus, K.-P. Kim, M. Mahesh, R. Gould, A. B. De González, and D. L. Miglioretti, “Radiation dose associated with common computed tomography examinations and the associated lifetime attributable risk of cancer,” Arch. Intern. Med. 169(22), 2078–2086 (2009). [CrossRef]

2. A. B. De González, M. Mahesh, K.-P. Kim, M. Bhargavan, R. Lewis, F. Mettler, and C. Land, “Projected cancer risks from computed tomographic scans performed in the united states in 2007,” Arch. Intern. Med. 169(22), 2071–2077 (2009). [CrossRef]

3. T. Li, X. Li, J. Wang, J. Wen, H. Lu, J. Hsieh, and Z. Liang, “Nonlinear sinogram smoothing for low-dose x-ray ct,” IEEE Trans. Nucl. Sci. 51(5), 2505–2513 (2004). [CrossRef]

4. M. Balda, J. Hornegger, and B. Heismann, “Ray contribution masks for structure adaptive sinogram filtering,” IEEE Trans. Med. Imaging 31(6), 1228–1239 (2012). [CrossRef]

5. A. Manduca, L. Yu, J. D. Trzasko, N. Khaylova, J. M. Kofler, C. M. McCollough, and J. G. Fletcher, “Projection space denoising with bilateral filtering and ct noise modeling for dose reduction in ct,” Med. Phys. 36(11), 4911–4919 (2009). [CrossRef]

6. J. Wang, T. Li, H. Lu, and Z. Liang, “Penalized weighted least-squares approach to sinogram noise reduction and image reconstruction for low-dose x-ray computed tomography,” IEEE Trans. Med. Imaging 25(10), 1272–1283 (2006). [CrossRef]

7. E. Y. Sidky and X. Pan, “Image reconstruction in circular cone-beam computed tomography by constrained, total-variation minimization,” Phys. Med. Biol. 53(17), 4777–4807 (2008). [CrossRef]

8. Y. Zhang, W.-H. Zhang, H. Chen, M.-L. Yang, T.-Y. Li, and J.-L. Zhou, “Few-view image reconstruction combining total variation and a high-order norm,” Int. J. Imaging Syst. Technol. 23, 249–255 (2013). [CrossRef]

9. Y. Zhang, W. Zhang, Y. Lei, and J. Zhou, “Few-view image reconstruction with fractional-order total variation,” J. Opt. Soc. Am. A 31(5), 981–995 (2014). [CrossRef]

10. Y. Zhang, Y. Wang, W. Zhang, F. Lin, Y. Pu, and J. Zhou, “Statistical iterative reconstruction using adaptive fractional order regularization,” Biomed. Opt. Express 7(3), 1015–1029 (2016). [CrossRef]

11. Y. Chen, D. Gao, C. Nie, L. Luo, W. Chen, X. Yin, and Y. Lin, “Bayesian statistical reconstruction for low-dose x-ray computed tomography using an adaptive-weighting nonlocal prior,” Comput. Med. Imaging Graph. 33(7), 495–500 (2009). [CrossRef]

12. J. Ma, H. Zhang, Y. Gao, J. Huang, Z. Liang, Q. Feng, and W. Chen, “Iterative image reconstruction for cerebral perfusion ct using a pre-contrast scan induced edge-preserving prior,” Phys. Med. Biol. 57(22), 7519–7542 (2012). [CrossRef]

13. Y. Zhang, Y. Xi, Q. Yang, W. Cong, J. Zhou, and G. Wang, “Spectral ct reconstruction with image sparsity and spectral mean,” IEEE Trans. Comput. Imaging 2(4), 510–523 (2016). [CrossRef]

14. Q. Xu, H. Yu, X. Mou, L. Zhang, J. Hsieh, and G. Wang, “Low-dose x-ray ct reconstruction via dictionary learning,” IEEE Trans. Med. Imaging 31(9), 1682–1697 (2012). [CrossRef]

15. Y. Zhang, X. Mou, G. Wang, and H. Yu, “Tensor-based dictionary learning for spectral ct reconstruction,” IEEE Trans. Med. Imaging 36(1), 142–154 (2017). [CrossRef]

16. M. Yan, J. Chen, L. A. Vese, J. Villasenor, A. Bui, and J. Cong, “Em+ tv based reconstruction for cone-beam ct with reduced radiation,” in International Symposium on Visual Computing, (Springer, 2011), pp. 1–10.

17. K. Hammernik, T. Würfl, T. Pock, and A. Maier, “A deep learning architecture for limited-angle computed tomography reconstruction,” in Bildverarbeitung für die Medizin 2017, (Springer, 2017), pp. 92–97.

18. J. Adler and O. Öktem, “Learned primal-dual reconstruction,” IEEE Trans. Med. Imaging 37(6), 1322–1332 (2018). [CrossRef]

19. Z. S. Kelm, D. Blezek, B. Bartholmai, and B. J. Erickson, “Optimizing non-local means for denoising low dose ct,” in 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, (IEEE, 2009), pp. 662–665.

20. J. Ma, J. Huang, Q. Feng, H. Zhang, H. Lu, Z. Liang, and W. Chen, “Low-dose computed tomography image restoration using previous normal-dose scan,” Med. Phys. 38(10), 5713–5731 (2011). [CrossRef]

21. Z. Li, L. Yu, J. D. Trzasko, D. S. Lake, D. J. Blezek, J. G. Fletcher, C. H. McCollough, and A. Manduca, “Adaptive nonlocal means filtering based on local noise level for ct denoising,” Med. Phys. 41(1), 011908 (2013). [CrossRef]

22. M. Aharon, M. Elad, and A. Bruckstein, “K-svd: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process. 54(11), 4311–4322 (2006). [CrossRef]

23. Y. Chen, X. Yin, L. Shi, H. Shu, L. Luo, J.-L. Coatrieux, and C. Toumoulin, “Improving abdomen tumor low-dose ct images using a fast dictionary learning based processing,” Phys. Med. Biol. 58(16), 5803–5820 (2013). [CrossRef]

24. P. F. Feruglio, C. Vinegoni, J. Gros, A. Sbarbati, and R. Weissleder, “Block matching 3d random noise filtering for absorption optical projection tomography,” Phys. Med. Biol. 55(18), 5401–5415 (2010). [CrossRef]

25. D. Kang, P. Slomka, R. Nakazato, J. Woo, D. S. Berman, C.-C. J. Kuo, and D. Dey, “Image denoising of low-radiation dose coronary ct angiography by an adaptive block-matching 3d algorithm,” in Medical Imaging 2013: Image Processing, vol. 8669 (International Society for Optics and Photonics, 2013), vol. 8669, p. 86692G.

26. K. Sheng, S. Gou, J. Wu, and S. X. Qi, “Denoised and texture enhanced mvct to improve soft tissue conspicuity,” Med. Phys. 41(10), 101916 (2014). [CrossRef]

27. A. M. Mendrik, E.-J. Vonken, A. Rutten, M. A. Viergever, and B. van Ginneken, “Noise reduction in computed tomography scans using 3-d anisotropic hybrid diffusion with continuous switch,” IEEE Trans. Med. Imaging 28(10), 1585–1594 (2009). [CrossRef]

28. H. Chen, Y. Zhang, M. K. Kalra, F. Lin, Y. Chen, P. Liao, J. Zhou, and G. Wang, “Low-dose ct with a residual encoder-decoder convolutional neural network,” IEEE Trans. Med. Imaging 36(12), 2524–2535 (2017). [CrossRef]

29. H. Chen, Y. Zhang, W. Zhang, P. Liao, K. Li, J. Zhou, and G. Wang, “Low-dose ct via convolutional neural network,” Biomed. Opt. Express 8(2), 679–694 (2017). [CrossRef]

30. J. M. Wolterink, T. Leiner, M. A. Viergever, and I. Išgum, “Generative adversarial networks for noise reduction in low-dose ct,” IEEE Trans. Med. Imaging 36(12), 2536–2545 (2017). [CrossRef]

31. D. Wu, K. Kim, G. E. Fakhri, and Q. Li, “A cascaded convolutional neural network for x-ray low-dose ct image denoising,” arXiv preprint arXiv:1705.04267 (2017).

32. Q. Yang, P. Yan, Y. Zhang, H. Yu, Y. Shi, X. Mou, M. K. Kalra, Y. Zhang, L. Sun, and G. Wang, “Low-dose ct image denoising using a generative adversarial network with wasserstein distance and perceptual loss,” IEEE Trans. Med. Imaging 37(6), 1348–1357 (2018). [CrossRef]

33. H. Shan, Y. Zhang, Q. Yang, U. Kruger, M. K. Kalra, L. Sun, W. Cong, and G. Wang, “3-d convolutional encoder-decoder network for low-dose ct via transfer learning from a 2-d trained network,” IEEE Trans. Med. Imaging 37(6), 1522–1534 (2018). [CrossRef]

34. F. Fan, H. Shan, M. K. Kalra, R. Singh, G. Qian, M. Getzin, Y. Teng, J. Hahn, and G. Wang, “Quadratic autoencoder (q-ae) for low-dose ct denoising,” IEEE Trans. Med. Imaging 39(6), 2035–2050 (2020). [CrossRef]

35. M. Li, W. Hsu, X. Xie, J. Cong, and W. Gao, “Sacnn: self-attention convolutional neural network for low-dose ct denoising with self-supervised perceptual loss network,” IEEE Trans. Med. Imaging 39(7), 2289–2301 (2020). [CrossRef]

36. T. Liang, Y. Jin, Y. Li, and T. Wang, “Edcnn: Edge enhancement-based densely connected network with compound loss for low-dose ct denoising,” in 2020 15th IEEE International Conference on Signal Processing (ICSP), vol. 1 (IEEE, 2020), vol. 1, pp. 193–198.

37. K. Chen, X. Pu, Y. Ren, H. Qiu, H. Li, and J. Sun, “Low-dose ct image blind denoising with graph convolutional networks,” in International Conference on Neural Information Processing, (Springer, 2020), pp. 423–435.

38. Y.-J. Chen, C.-Y. Tsai, X. Xu, Y. Shi, T.-Y. Ho, M. Huang, H. Yuan, and J. Zhuang, “Ct image denoising with encoder-decoder based graph convolutional networks,” in 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), (IEEE, 2021), pp. 400–404.

39. A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard artifacts,” Distill 1(10), e3 (2016). [CrossRef]

40. D. Valsesia, G. Fracastoro, and E. Magli, “Deep graph-convolutional image denoising,” IEEE Trans. on Image Process. 29, 8226–8237 (2020). [CrossRef]

41. J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution, in European conference on computer vision,” (Springer, 2016), pp. 694–711.

42. D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415 (2016).

43. P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903 (2017).

44. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 (2014).

45. C. H. McCollough, A. C. Bartley, R. E. Carter, B. Chen, T. A. Drees, P. Edwards, D. R. Holmes III, A. E. Huang, F. Khan, S. Leng, K. McMillan, G. Michalak, K. Nunez, L. Yu, and J. G. Fletcher, “Low-dose ct for the detection and classification of metastatic liver lesions: results of the 2016 low dose ct grand challenge,” Med. Phys. 44(10), e339–e352 (2017). [CrossRef]

46. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101 (2017).

$α$	RMSE ↓	PSNR ↑	SSIM ↑	VGG-P ↓
$10^{- 1}$	0.0073 $\pm$ 0.0011	42.8483 $\pm$ 1.2752	0.9620 $\pm$ 0.0125	0.0342 $\pm$ 0.0150
$10^{- 2}$	0.0070 $\pm$ 0.0011	43.1778 $\pm$ 1.2578	0.9647 $\pm$ 0.0110	0.0330 $\pm$ 0.0157
$10^{- 3}$	0.0066 $\pm$ 0.0010	43.6752 $\pm$ 1.2480	0.9682 $\pm$ 0.0102	0.0335 $\pm$ 0.0160
$10^{- 4}$	0.0063 $\pm$ 0.0010	44.0886 $\pm$ 1.2419	0.9712 $\pm$ 0.0093	0.0619 $\pm$ 0.0300
$10^{- 5}$	0.0063 $\pm$ 0.0009	44.1520 $\pm$ 1.2352	0.9717 $\pm$ 0.0092	0.1201 $\pm$ 0.0604
0	0.0063 $\pm$ 0.0009	44.1549 $\pm$ 1.2377	0.9717 $\pm$ 0.0092	0.1346 $\pm$ 0.0701

	Fig. 5				Fig. 7
	RMSE ↓	PSNR ↑	SSIM ↑	VGG-P ↓	RMSE ↓	PSNR ↑	SSIM ↑	VGG-P ↓
LDCT	0.0143	36.8973	0.8624	0.6772	0.0171	35.3317	0.8296	0.5979
BM3D	0.0083	41.6485	0.9387	0.3975	0.0105	39.5418	0.9153	0.2810
RED-CNN	0.0074	42.6343	0.9584	0.2973	0.0092	40.7403	0.9418	0.2236
WGAN-VGG	0.0100	39.9989	0.9273	0.1495	0.0125	38.0954	0.9004	0.1628
CPCE-2D	0.0098	40.2085	0.9372	0.1079	0.0121	38.3184	0.9098	0.1455
EDCNN	0.0080	41.8843	0.9514	0.1039	0.0100	39.9913	0.9322	0.1046
QAE	0.0077	42.2746	0.9560	0.3094	0.0097	40.2299	0.9350	0.2291
CT-GCN	0.0076	42.3923	0.9567	0.3127	0.0095	40.4406	0.9384	0.2237
ERA-WGAT	0.0073	42.7222	0.9599	0.1029	0.0090	40.8842	0.9446	0.1377

	RMSE ↓	PSNR ↑	SSIM ↑	VGG-P ↓
LDCT	0.0117 $\pm$ 0.0022	38.8083 $\pm$ 1.5995	0.9054 $\pm$ 0.0311	0.3376 $\pm$ 0.1595
BM3D	0.0075 $\pm$ 0.0010	42.6105 $\pm$ 1.0930	0.9553 $\pm$ 0.0135	0.2409 $\pm$ 0.0735
RED-CNN	0.0064 $\pm$ 0.0009	43.9140 $\pm$ 1.1936	0.9702 $\pm$ 0.0096	0.1416 $\pm$ 0.0672
WGAN-VGG	0.0090 $\pm$ 0.0011	40.9629 $\pm$ 1.0069	0.9482 $\pm$ 0.0166	0.0750 $\pm$ 0.0342
CPCE-2D	0.0087 $\pm$ 0.0012	41.2704 $\pm$ 1.1064	0.9547 $\pm$ 0.0150	0.0621 $\pm$ 0.0309
EDCNN	0.0070 $\pm$ 0.0010	43.2371 $\pm$ 1.2320	0.9652 $\pm$ 0.0110	0.0590 $\pm$ 0.0232
QAE	0.0068 $\pm$ 0.0010	43.4866 $\pm$ 1.1837	0.9676 $\pm$ 0.0103	0.1641 $\pm$ 0.0628
CT-GCN	0.0067 $\pm$ 0.0009	43.5646 $\pm$ 1.1426	0.9682 $\pm$ 0.0098	0.1585 $\pm$ 0.0621
ERA-WGAT	0.0063 $\pm$ 0.0010	44.0886 $\pm$ 1.2419	0.9712 $\pm$ 0.0093	0.0619 $\pm$ 0.0300

	RMSE ↓	PSNR ↑	SSIM ↑	VGG-P ↓
LDCT	0.0117 $\pm$ 0.0022	38.8083 $\pm$ 1.5995	0.9054 $\pm$ 0.0311	0.3376 $\pm$ 0.1595
RED-CNN	0.0064 $\pm$ 0.0009	43.9140 $\pm$ 1.1936	0.9702 $\pm$ 0.0096	0.1416 $\pm$ 0.0672
WGAN-VGG	0.0068 $\pm$ 0.0010	43.4073 $\pm$ 1.1547	0.9675 $\pm$ 0.0101	0.1653 $\pm$ 0.0625
CPCE-2D	0.0068 $\pm$ 0.0010	43.3890 $\pm$ 1.1464	0.9674 $\pm$ 0.0100	0.1660 $\pm$ 0.0596
EDCNN	0.0066 $\pm$ 0.0009	43.6518 $\pm$ 1.1653	0.9685 $\pm$ 0.0099	0.1465 $\pm$ 0.0578
QAE	0.0068 $\pm$ 0.0010	43.4866 $\pm$ 1.1837	0.9676 $\pm$ 0.0103	0.1641 $\pm$ 0.0628
CT-GCN	0.0067 $\pm$ 0.0009	43.5646 $\pm$ 1.1426	0.9682 $\pm$ 0.0098	0.1585 $\pm$ 0.0621
ERA-WGAT	0.0063 $\pm$ 0.0009	44.1549 $\pm$ 1.2377	0.9717 $\pm$ 0.0092	0.1346 $\pm$ 0.0701

	RMSE ↓	PSNR ↑	SSIM ↑	VGG-P ↓
LDCT	0.0117 $\pm$ 0.0022	38.8083 $\pm$ 1.5995	0.9054 $\pm$ 0.0311	0.3376 $\pm$ 0.1595
RED-CNN	0.0064 $\pm$ 0.0009	43.8987 $\pm$ 1.2002	0.9701 $\pm$ 0.0094	0.0627 $\pm$ 0.0281
WGAN-VGG	0.0069 $\pm$ 0.0010	43.2901 $\pm$ 1.2066	0.9666 $\pm$ 0.0109	0.0805 $\pm$ 0.0320
CPCE-2D	0.0069 $\pm$ 0.0010	43.3111 $\pm$ 1.2079	0.9665 $\pm$ 0.0108	0.0838 $\pm$ 0.0307
EDCNN	0.0067 $\pm$ 0.0010	43.5123 $\pm$ 1.1913	0.9673 $\pm$ 0.0103	0.0686 $\pm$ 0.0267
QAE	0.0068 $\pm$ 0.0010	43.4221 $\pm$ 1.1803	0.9666 $\pm$ 0.0103	0.0777 $\pm$ 0.0289
CT-GCN	0.0068 $\pm$ 0.0010	43.4128 $\pm$ 1.2013	0.9668 $\pm$ 0.0106	0.0676 $\pm$ 0.0257
ERA-WGAT	0.0063 $\pm$ 0.0010	44.0886 $\pm$ 1.2419	0.9712 $\pm$ 0.0093	0.0619 $\pm$ 0.0300

Method	Network Architecture					Objective Function
	Conv	Deconv	Shortcut	Edge Enhancement	GCN	MSE	AL	PL
RED-CNN [28]	Conv2d	Deconv2d	Residual + Skip	–	–	✓	–	–
CNN [29]	Conv2d	–	–	–	–	✓	–	–
GAN-3D [30]	Conv3d	–	Skip	–	–	✓	–	–
Cascade-CNN [31]	Conv2d	–	Cascade	–	–	✓	–	–
WGAN-VGG [32]	Conv2d	–	–	–	–	–	✓	✓
CPCE-2D [33]	Conv2d	Deconv2d	Concatenation	–	–	–	✓	✓
CPCE-3D [33]	Conv3d	Deconv2d	Concatenation	–	–	–	✓	✓
QAE [34]	Q-Conv2d	Q-Deconv2d	Residual + Skip	–	–	✓	–	–
SACNN [35]	Conv3d	–	Skip	–	–	–	✓	✓
EDCNN [36]	Conv2d	–	Densely connection	Learnable (4 types)	–	✓	–	✓
CT-GCN [37]	Conv2d	–	Concatenation	Fixed (2 types)	Pixel-based	✓	–	–
Chen et al. [38]	Conv2d	Deconv2d	Skip	–	Pixel-based	✓	–	–
ERA-WGAT (Ours)	Conv2d		Residual + Concatenation	Learnable (8 types)	Window-based	✓	–	✓

	E	SA	DA	RMSE ↓	PSNR ↑	SSIM ↑	VGG-P ↓	Params ↓	FLOPs ↓
RA	–	–	–	0.0066 $\pm$ 0.0009	43.6962 $\pm$ 1.1380	0.9696 $\pm$ 0.0093	0.0801 $\pm$ 0.0307	7.26M	58.05G
ERA	✓	–	–	0.0064 $\pm$ 0.0010	44.0277 $\pm$ 1.2388	0.9710 $\pm$ 0.0094	0.0642 $\pm$ 0.0314	9.21M	81.01G
${RA-WGAT}^{(s a)}$	–	✓	–	0.0064 $\pm$ 0.0009	43.9249 $\pm$ 1.1933	0.9705 $\pm$ 0.0093	0.0719 $\pm$ 0.0304	10.80M	188.51G
${RA-WGAT}^{(d a)}$	–	–	✓	0.0065 $\pm$ 0.0009	43.8108 $\pm$ 1.1604	0.9700 $\pm$ 0.0092	0.0770 $\pm$ 0.0308	10.82M	189.27G
${ERA-WGAT}^{(s a)}$	✓	✓	–	0.0063 $\pm$ 0.0010	44.0774 $\pm$ 1.2386	0.9712 $\pm$ 0.0093	0.0622 $\pm$ 0.0300	12.75M	211.46G
${ERA-WGAT}^{(s a + d a)}$	✓	✓	✓	0.0063 $\pm$ 0.0010	44.0886 $\pm$ 1.2419	0.9712 $\pm$ 0.0093	0.0619 $\pm$ 0.0300	16.30M	240.62G

Edge Branch	RMSE ↓	PSNR ↑	SSIM ↑	VGG-P ↓
–	0.0064 $\pm$ 0.0009	44.0326 $\pm$ 1.2314	0.9710 $\pm$ 0.0094	0.0649 $\pm$ 0.0318
✓	0.0063 $\pm$ 0.0010	44.0886 $\pm$ 1.2419	0.9712 $\pm$ 0.0093	0.0619 $\pm$ 0.0300

Abstract

1. Introduction

2. Related work

2.1 Network architecture

2.1.1 Convolutional layers

2.1.2 Deconvolutional layers

2.1.3 Shortcut connection

2.1.4 Edge enhancement

2.1.5 GCN

2.2 Objective function

3. Method

3.1 Overall architecture

3.1.1 Edge enhancement module

3.1.2 Encoder

3.1.3 Bottleneck

3.1.4 Decoder

3.2 Window based graph attention convolutional network

3.2.1 Graph construction

3.2.2 Input layer

3.2.3 WGAT layer

3.2.4 Readout layer

3.3 Objective function

3.3.1 MSE loss

3.3.2 Multi-scale perceptual loss

3.3.3 Compound loss

4. Experimental design and results

4.1 Data sources

4.2 Parameter selection

4.3 Experimental results

4.3.1 Denoising performance comparison

4.3.2 Quantitative results

4.3.3 Ablation study of proposed method

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Tables (9)

Equations (11)

Biomedical Optics Express