PCCA-Model: an attention module for medical image segmentation

Linjie Liu; Guanglei Wang; Yanlin Wu; Hongrui Wang; Yan Li

doi:10.1364/BOE.478058

1. Introduction

Deep learning (DL) is a powerful tool for pattern recognition and can handle tasks such as classification, detection, and segmentation, by manifesting data-driven and large-scale computing characteristics [1]. In the field of DL, various types of neural networks have been employed to solve related problems. For example, in the process of image segmentation, several neural network models have been proposed to address specific problems. In particular, in the field of medical image segmentation, networks with an encoder–decoder structure [2] (e.g., U-Net [3] and V-Net [4]) have been proposed. In addition, researchers have attempted to improve the performance of segmented networks by employing various means, such as applying improved optimizers [1,5,6], proposing generative adversarial schemes [1,7], improving the network width and depth [8], and using attention mechanisms. Among these approaches, the development of attention mechanisms has attracted increased research interest. In this paper, we propose a method that combines channel and spatial attention to achieve accurate location information and multiscale feature aggregation. The channel direction can aggregate multiscale information, capture global and local cross-channel interactions, and integrate them, whereas the spatial direction captures accurate location information. By parallel computing in both directions and combining them, the location information can be embedded into the channel that aggregates local and global information, thus, obtaining an attention module that contains accurate location information and aggregates local and global channel information.

In recent years, numerous attention mechanisms focusing on the channel dimension have been proposed. For instance, squeeze-and-excitation (SE) networks (SENet) that target the channel information [9] divide the whole process into SE modules to model channel dependencies. Based on the SE module, Wang et al. proposed efficient channel attention (ECA) [10], which avoids feature dimensionality reduction and uses the adaptive convolution kernel to realize cross-channel interactions. Considering the limitations of the standard convolutional neural network (CNN) for the same receptive field, Li et al. referred to the idea that the local receptive field of the visual neuron depends on the stimulus location and proposed the selective kernel network (SKNet) [11], wherein the size of the receptive field can be adjusted adaptively according to the multiscale information of the input feature map. Attention to the spatial direction has also been of great concern. In this regard, Hou et al. proposed the coordinate attention (CA) module [12] that focuses on the spatial direction information and integrates the spatial coordinate information into the generated attention map, captures long-range dependencies along the horizontal and vertical directions, and accurately preserves the location information. Furthermore, attention mechanisms that use both spatial and channel dimensions have been introduced. For instance, bottleneck attention module (BAM) and convolutional block attention module (CBAM) perform computation independently in the channel and spatial directions and then combine the results in a serial and parallel manner, respectively, to yield an output feature map [1,13] with information enhancement in both spatial and channel directions. Similarly, the dual attention (DA) module [14] has been proposed, which focuses on global information and uses a self-attention mechanism to capture the characteristic dependencies of the spatial and channel dimensions [14]. In the aforementioned modules, attention is an effective means of improving network performance. For channel attention modules, the full connection has been employed to capture all channel information, or a convolution kernel has been used for local cross-channel interactions. For parallel attention modules, the location information in the spatial direction is typically combined with all channel information but not with the local cross-channel interactions. However, in the human visual cortex, the neurons located in the same area have different receptive field sizes and can thus obtain spatial information of different sizes at the same processing stage [11]. Furthermore, the receptive fields vary with the stimulus location. Based on this idea of biological vision, in this paper, we proposed a pyramid channel coordinate attention (PCCA) module that can aggregate global information. The proposed module captures the dependencies between local and global channels and preserves the accurate location information in the spatial direction, and embeds it into the local and global channel information. We constructed pyramidal channel attention (PCA) by employing the pyramid concept in the channel dimension by using the pyramid-like convolution kernel to capture the interactions among the feature map channels and obtain the local and channel information. In addition, we used the spatial attention module CA [12] to aggregate the long-range dependencies in the spatial direction and retain accurate location information. The CA module computes in parallel with the PCA module and plays an important role in embedding the location information into the local and global channel relationships. We performed numerous experiments on three medical image datasets to demonstrate the validity of combining the PCCA module and the neural network.

The main contributions of this study are listed as follows.

1. For the channel dimension, we proposed the PCA module, which can model local and global channel information. In addition, we used the pyramid structure to aggregate multiscale information and employed convolution kernels of different sizes to control the coverage of cross-channel interactions. Furthermore, we obtained the local dependencies and long-range correlation between various channels through nonlinear mapping, thus enhancing feature aggregation in the channel direction.
2. We proposed the PCCA attention module, which can embed location information into local and global channel information. The PCCA attention module is composed of the parallel computing PCA and CA modules, which can combine the aggregated local and global channel information with accurate location information. The PCA module in the channel direction fuses multiscale information, and the pyramid idea models multiple interactions in the channel dimension of the feature map. Therefore, the local and global features in the channel dimension can be obtained by considering the spatial location information and aggregating the channel information.

2. Related work

2.1 Image segmentation networks

Image segmentation is an important part of image processing and computer vision. In recent years, DL models have achieved great success in the field of vision [15]; in particular, with the emergence of numerous image segmentation methods using DL models, medical image segmentation in the field of computer vision has received increased attention. CNNs are one of the most widely used network architectures in the field of DL [16]. In CNNs, all receptive fields in the same layer can share weights; thus, the number of parameters is greatly reduced. In recent years, Google has proposed several image segmentation models based on DeepLab [17,18] and made unique contributions, such as the encoder–decoder structure, pyramid structure, skip connection, multiscale analysis, and atrous convolution. In 2015, Long et al.., in one of the first works to use DL for semantic segmentation, proposed the full convolutional network (FCN) [19]. The FCN can accommodate input images of any size and achieve classification and prediction of each pixel. However, its upsampling process does not consider the relationships between pixels and is relatively rough. To address this problem, image segmentation models based on the encoder–decoder structure have been proposed. For instance, Badrinarayanan et al. proposed a convolutional encoder architecture for image segmentation called SegNet [2], wherein the decoder performs upsampling to realize low-resolution input feature mapping. Several models in this category have been developed for biomedical segmentation (e.g., U-Net and V-Net [3,4]); however, they are no longer limited to medical applications. The U-Net architecture consists of two parts: the shrink path captures the context information, and the symmetric expansion path realizes precise positioning. Downsampling utilizes an architecture similar to the FCN [19] for feature extraction and uses upconv for upsampling, which reduces the number of feature maps and increases the spatial resolution. In addition, skip connection has been employed to copy the feature map of the downsampling part of the network to the upsampling part to combine the shallow features with the deep features. Finally, 1 × 1 convolution is conducted on the feature map to generate a segmentation map, and the output feature map is classified pixel by pixel [15]. Milleri et al. proposed V-Net [19], which is another famous FCN-based model for three-dimensional (3D) medical image segmentation. For model training, a new objective function based on the dice coefficient has been introduced to handle the imbalance of voxels between foreground and background [15]. In addition, multiscale analysis is a valuable concept in image processing and has been applied to various neural network structures, one of the most prominent models being the feature pyramid network (FPN) [20], which is currently used in both target detection and segmentation. HFNet proposed by Wujie Zhou et al. also focuses on multi-scale features, its proposed Multi-level Atrous Spatial Pyramid Pooling (MASPP) can capture multi-scale features of different fields of view for feature refinement. The idea of utilizing detailed multi-scale information for decoding, along with a channel attention mechanism, is highly significant [40]. Although the existing DL methods have made some achievements in medical image segmentation, due to the inherent limitations of the convolution operation and traditional encoder–decoder structure, most networks cannot cover the extraction of global and local channel information and the fusion and interaction of multiscale context information, resulting in poor network performance in the segmentation of complex regions.

2.2 Attention mechanisms

Extensive research has been conducted on attention mechanisms in the field of computer vision to improve the performance of deep CNNs. Attention mechanisms simulate the role of attention in human brain perception, where the fovea centralis has a higher resolution than the surrounding areas in the human eye [21]. By scanning the global information, the region of interest that requires special attention can be determined [22], and more resources can be invested in this area to obtain more detailed information related to the target and ignore other irrelevant information. Therefore, the limited attention resources can be used to quickly screen valuable information from massive amounts of data. Attention mechanism is being developed in multiple directions [23–27], and various attention modules, such as channel attention modules, spatial attention modules, and hybrid attention modules, have been proposed and applied to neural networks for performance improvement. The classical channel attention SE module [9] proposed by Hu et al. divides the whole process into SE, explicitly modeling the interdependencies between channels and adaptively recalibrating the characteristic response of channels. Furthermore, the dimensionality reduction operation has been studied in depth by using the SE module. Wang et al. proposed efficient channel attention (ECA) [10], which aggregates global information without dimensionality reduction and captures local cross-channel interactions by using an adaptive convolution kernel. Subsequently, Li et al. proposed the SKNet [11], which can adaptively adjust the size of the receptive field according to the multiple scales of input information. The SKNet can capture target objects of various scales, thus emulating the capability of neurons to adaptively adjust the size of the receptive field according to the input. In addition to making improvements in a single direction, researchers have aggregated information in both spatial and channel directions. For example, the channel attention SE module and spatial attention module have been combined to establish BAM and CBAM modules [1,13], which infer attention mapping along two paths (channel and space) separately. The two-branch structure saves computational and parameter costs. New spatial attention mechanisms have also been proposed. For instance, Hou et al. analyzed the advantages and disadvantages of SE and CBAM and proposed CA [12], aggregating features along two spatial directions to obtain accurate location information and long-range dependencies. Moreover, with the advent of the self-attention mechanism, the dual attention (DA) network based on the self-attention mechanism was proposed to capture feature dependencies in the spatial and channel dimensions [14]. DA can perform feature representation by selectively aggregating similar features of objects and adaptively integrating similar features on any scale from a global perspective. For long-range dependencies, Wu et al. proposed dimensional interaction (DI), an effective self-attention mechanism for feature processing [28], to capture dependencies of various dimensions through cross-dimension interactions and simulated long-range dependencies. However, the self-attention mechanism is computationally expensive. Although the studies mentioned above have improved the network performance to some extent, single-path attention modules suffer from limitations: the channel attention module only aggregates either global or local information and lacks spatial direction information. The attention modules that consider both spatial and channels do not take location information into account when fusing the two channels of information. Therefore, these modules are not suitable for the segmentation of medical images.

3. Methods

The network structure of the proposed PCCA-Net module is illustrated in Fig. 1. In PCCA-Net, the PCCA module is added to the bottleneck layer of U-Net; this enables the network to obtain more features from the encoding structure. The introduced pyramid structure realizes the extraction of multiscale features and the fusion of local and global information. Introducing spatial information enables focusing on the location information of the network for obtaining long-range dependencies and highlighting the region of interest.

Fig. 1. Structure of PCCA-Net.

Download Full Size | PDF

3.1 PCCA module

The PCCA module aggregates attention in the channel and spatial directions for parallel computing. The computing process is illustrated in Fig. 2. Based on the idea that the sizes of the receptive field of neurons in the same area of the human visual cortex are different in the channel direction, we proposed the PCA module based on the SE [9] and ECA modules [10]. The pyramid structure is used to aggregate information from different receptive fields and capture local and global channel information for feature aggregation, thereby enhancing or suppressing channel information according to the requirement. Because the receptive field of human visual cortex neurons changes with the stimulus location, the CA module [12] is used for the spatial direction as it can simultaneously perform one-dimensional (1D) feature coding in both horizontal and vertical directions, capture location information, and long-range dependencies, and combine accurate location information with the channel information. Finally, we proposed the PCCA module, which combines multiscale information, focuses on local and global channel features, and embeds the spatial location information into the channel dimension, similar to the stimulus processing of the visual nerve. Thus, feature enhancement is achieved, enhancing the neural network's ability to extract advanced semantic information from deep feature maps.

Fig. 2. Visual cortex neuron and PCCA module architecture.

Download Full Size | PDF

The calculation process is as follows. For a given feature mapping $X \in {R^{C \times H \times W}}$, a 3D attention map ${M_s}(X )\in {R^{C \times H \times W}}$ is obtained using the proposed PCCA module. The residual between the attention mapping ${M_s}(X )$ and the input characteristic map is calculated so that the error of the bottom layer can be transferred to the upper layer, thereby solving the problem of gradient vanishing.

(1)$$\begin{array}{{c}} {F = X\; + \; X \otimes M(X ),} \end{array}$$

where ${\otimes} $ represents multiplication by elements. In this study, the classical residual learning scheme and attention mechanism were adopted, and a powerful module was designed to fuse the information aggregated by two independent branches of channel and space. The two branches correspond to the channel attention ${M_C}(X )\in {R^{C \times H \times W}}$ and the spatial attention ${M_s}(X )\in {R^{C \times H \times W}}$, respectively. Next, the attention map $M(X )$ is calculated as follows:

(2)$$\begin{array}{{c}} {M(X )= \sigma ({{M_S}(X )+ \; {M_C}(X )} ),} \end{array}$$

where σ is the sigmoid function. Both branch outputs are resized to R^{C × H × W} before summation.

The detailed calculation processes of channel and space in the PCCA module are described in Sections 3.2 and 3.3.

3.2 Channel attention dimension PCA

We designed the channel branch PCA by referring to ECA [10], pyramidal convolution (PyConv) [29], and residual building block (RBB) [30], as shown in Fig. 3. The original ECA [10] uses an adaptive convolution kernel to capture local cross-channel interactions, and the size of the convolution kernel is adaptively determined by the nonlinear mapping in the channel dimension. However, the influence of long-range interactions cannot be considered by capturing only the local channel information for interactions fails.

Fig. 3. Channel dimension PCA.

Download Full Size | PDF

The designed PCA combines the advantages of SE [9] and ECA [10], and PyConv aids in capturing the local and global cross-channel interactions. Because receptive fields with different sizes have different effects on targets of different scales, aggregating multiscale information helps achieve enhanced feature aggregation, and similar features are combined after multiple iterations. For a given feature graph $\textrm{X} \in {R^{\textrm{C} \times \textrm{H} \times \textrm{W}}}$, global average pooling is conducted, and the global smooth features are obtained by averaging the spatial information to achieve increased robustness to spatial changes in the input:

(3)$$\begin{array}{{c}} {{M_{avg}}(X )= \frac{1}{{H \times W}}\; \mathop \sum \limits_{i = \; 1}^H \mathop \sum \limits_{j = 1}^W X({i,j} ),\textrm{i},\textrm{j} = \textrm{}1, \ldots ,\textrm{H}(\textrm{W} ),} \end{array}$$

where ${M_{avg}}$ represents global average pooling. By averaging a subregion of the feature map and sliding this subregion, the feature map ${M_{avg}} \in {R^{\textrm{C} \times 1\textrm{} \times \textrm{}1}}$ is obtained.

Next, ${M_{avg}}(X )$ is fed into the PyConv structure to aggregate receptive fields of various sizes from top to bottom. The convolution kernel size is 1, 3, and 5, respectively. A kernel = 1 convolution is used for extracting local detailed features, such as the details of organizational structure. A kernel = 3 convolution falls in between kernel = 1 and kernel = 5. It can help to extract larger scale features while preserving detail. It also plays a role in balancing and transitioning between the two. Using a convolution kernel of size 5 results in a larger receptive field, which can capture large objects and global semantic information. Thus, this convolution is better able to extract linear features and longer edge lines from the image. Additionally, it is more computationally efficient than larger kernels and reduces the risk of overfitting. Furthermore, the idea of avoiding dimensionality reduction in the channel dimension is employed so that the channels directly correspond to the weights. Summing the results of the three branches yields the aggregated channel direction information:

(4)$$\begin{array}{{c}} {{M_C}^{\prime}(X )= \; Conv1({{M_{avg}}(X )} )+ \; Conv3({{M_{avg}}(X )} )+ \; Conv5({{M_{avg}}(X )} ),} \end{array}$$

where Conv1, Conv3, and Conv5 represent two-dimensional (2D) convolutions with a convolution kernel size of 1, 3, and 5, respectively. Summing them yields the feature mapping ${M_c}^{\prime}(X )\; \in {R^{C \times 1\; \times \; 1}}$.

Finally, ${M_c}^{\prime}(X )$, which aggregates the channel information, is multiplied with the original feature map and restored to the same size as the input feature map to obtain the feature map ${M_c}^{\prime}(X )\; \in {R^{C \times H \times W}}$, thereby realizing feature aggregation in the channel dimension to facilitate subsequent operations.

(5)$$\begin{array}{{c}} {{M_C}(X )= \; {M_C}^{\prime}(X )\otimes X} \end{array}$$

Compared with ordinary convolution [29], the proposed PCA module offers the following advantages, as shown in Fig. 4. (1) Multiscale processing: The receptive field can be expanded without increasing the cost, and the parallel convolutional kernels with different sizes have varying channel resolutions and depths. If the size of the kernel increases, its depth decreases; thus, allowing for small convolution kernels, which focus on details, and large convolution kernels, which focus on large objects or context information. (2) Efficiency: Similar model parameters and requirements are maintained in the computation. The PCA module can perform parallel computing, use different convolution kernels, and atrous convolution can be used for the construction of more complex networks, thereby realizing parallel execution and merged output of various computing units. (3) Flexibility: The number of pyramid layers and the size of the PyConv kernel at each layer can be flexibly selected to serve various visual tasks.

3.3 Spatial dimension attention CA

The CA mechanism is adopted for the spatial dimension [12], as shown in Fig. 5, to accurately highlight the region of interest with accurate location information and long-range correlation. First, the global average pooling is factorized into the one-to-one feature encoding operation.

Fig. 4. (a) Standard convolution; (b) Proposed PyConv.

Download Full Size | PDF

Fig. 5. Spatial dimension CA.

Download Full Size | PDF

For a given input ${x_S}$, first, average pooling is performed using the pool kernel (H,1) or (1,W) of two spatial ranges in the horizontal and vertical directions, respectively; thus, the output of the c-th channel at height h can be expressed as

(6)$$\begin{array}{{c}} {z_c^h(h )= \; \frac{1}{W}\mathop \sum \limits_{0 \le i < W} {x_S}({h,i} ).} \end{array}$$

Similarly, the output of the c-th channel at width w can be expressed as

(7)$$\begin{array}{{c}} {z_c^w(w )= \frac{1}{H}\mathop \sum \limits_{0 \le j < H} {x_S}({j,w} ).} \end{array}$$

These two transformations aggregate features along two spatial directions; thus, a pair of directional perception feature mapping is obtained. In the direction-aware map generated along the h direction, long-distance dependency relationships are captured in the h direction, and position information is retained in the w direction. Conversely, in the feature map generated along the w direction, long-distance dependency relationships are captured in the w direction, and position information is retained in the h direction. In other words, both direction-aware maps capture long-distance dependency relationships along the pooling direction and retain location information along the other.

Next, $z_c^\textrm{h}(\textrm{h} )$ and $z_c^w(w )$ are concatenated for better feature aggregation, and the result is processed using a 1 × 1 convolution transformation function. Next, batch normalization and nonlinear processing are performed. Batch normalization prevents changes in the distribution of deep input data caused by the amplification of small changes during transmission. In nonlinear processing, the Swish activation function is used to prevent the loss of numerical accuracy caused by the use of the sigmoid activation function, enabling the activation function f to aggregate global information:

(8)$$\begin{array}{{c}} {z\; = \; Conv1[{Concat({{z^h},{z^w}} )} ],} \end{array}$$

(9)$$\begin{array}{{c}} {f\; = \; Swish\; [{BN(z )} ],} \end{array}$$

where Concat represents the operation of concatenation along the spatial dimension, Conv1 represents 2D convolution with a kernel size of 1, BN represents the batch normalization process, and Swish represents the nonlinear activation function. $f \in {R^{\frac{C}{r} \times H \times W}}$ is obtained, where r is the reduction ratio of the number of channels in the control feature map.

Next, f is divided into two independent tensors along the spatial dimension: ${f^h} \in {R^{\frac{C}{r} \times h}}$ and ${f^w} \in {R^{\frac{C}{r} \times w}}$. Two 1 × 1 convolution transformations are then employed to transform ${f^h}$ and ${f^w}$ along two directions into tensors with the same number of channels as that of the input feature map X, obtaining

(10)$$\begin{array}{{c}} {{\textrm{g}^h} = \delta [{Conv1({{f^h}} )} ]} \end{array}$$

(11)$$\begin{array}{{c}} {{\textrm{g}^w} = \delta [{Conv1({{f^w}} )} ]} \end{array}$$

where δ is the sigmoid activation function, and r is the reduction ratio configured to reduce the number of channels. Because the main concern is the information of spatial dimension, proper dimensionality reduction in channel dimension can reduce model complexity.

Finally, the output ${g^h}$ and ${g^w}$ are expanded separately as the attention weight, and spatial dimension attention CA ${\textrm{y}_S}$ containing the global information is output.

(12)$$\begin{array}{{c}} {{\textrm{y}_S}({\textrm{i},\textrm{j}} )= {x_s}({i\; ,\; j} )\times g_c^h(i )\times g_c^w(j ).} \end{array}$$

4. Experiments and results

4.1 Datasets and settings

To verify the accuracy and effectiveness of the proposed PCCA module, we performed ablation and comparative experiments on three public datasets, namely liver segmentation, skin injury segmentation, and chest X-ray datasets, provided by the Liver Tumor Segmentation Challenge (LiTS), the International Skin Imaging Collaboration (ISIC), and ChestX-ray14 (CX) released by Wang et al. (2017). The total number of images and experimental division of the three datasets were as follows:

(1) LiTS: There are 19,160 liver images in the Medical Image Computing and Computer-assisted Intervention 2017 LiTS dataset. Of the 19,160 images, 10,477 images were used as the training set, 4278 images as the validation set, and the remaining 4405 images as the test set.
(2) ISIC-2018: This dataset contains 2694 red-green-blue images of skin lesions. The images were randomly divided into the training set (2494 images), validation set (100 images), and test set (100 images).
(3) CX: This dataset lacks labels; thus, classification was performed first for the selected 565 chest X-ray images. Next, these labeled images were randomly divided into 450 images as the training set, 66 images as the validation set, and the remaining 50 images as the test set.

All images were reshaped to a size of 512 × 512 before being inputted into the network. All the models were run on a workstation with an NVIDIA Geforce RTX 3090 graphics card, and the Python DL framework was used for training. Adam optimizer was used to optimize the model; the initial learning rate was set as 0.0001, the momentum parameter was set as 0.9, and the cross-entropy loss function was employed as the loss function. The training batch size was set as 8, and the maximum number of training rounds was set as 50, 200, and 50, respectively, in the LiTS, ISIC-2018, and CX datasets, ensuring that data were collected and sorted after model convergence.

4.2 Experimental details

Six commonly used evaluation indicators, namely Dice similarity coefficient (DICE), intersection over union (IOU), sensitivity (SEN), positive predictive value (PPV) and relative volume difference (RVD), were used to comprehensively evaluate the segmentation performance of the proposed model. These indicators can be mathematically expressed as follows:

(13)$$\begin{array}{{c}} {DICE = \frac{{2TP}}{{2TP + FP + FN}}} \end{array}$$

(14)$$\begin{array}{{c}} {IOU = \frac{{TP}}{{TP + FP + FN}}} \end{array}$$

(15)$$\begin{array}{{c}} {SEN = \frac{{TP}}{{TP + FP}}} \end{array}$$

(16)$$\begin{array}{{c}} {PPV = \frac{{TP}}{{FP + TP}}} \end{array}$$

(17)$$\begin{array}{{c}} {RVD = \frac{{|A |- |B |}}{{|B |}}} \end{array}$$

where TP (true positive) and TN (true negative) represent the number of pixels in the correctly segmented area and background area, respectively. FP (False positive) indicates the number of background pixels incorrectly labeled, whereas FN (false negative) denotes the number of pixels incorrectly predicted as background. A and B represent the target area in the prediction segmentation graph and the true label, respectively. DICE is a function characterizing set similarity and is typically used to evaluate the similarity between the true label (ground truth) and the segmentation result in medical image segmentation; it has a value range of [0, 1], with a higher value corresponding to a better image segmentation result. IOU is used to calculate the ratio between the intersection and union of the prediction result of a certain category and the true label (ground truth). SEN is used to measure the positive voxels of the real background, that is, the ability to segment the region of interest in the segmentation experiment. PPV is used to evaluate the overlap between the segmentation prediction region and the true label. RVD is used to measure the error rate.

In the training process of deep learning, the gradient descent algorithm adjusts the module parameters to minimize the loss function. In the process of using gradient descent to find the optimal solution, we use the Binary Cross Entropy (BCE) loss, which has good mathematical properties. The BCE loss, as an extension of the Cross Entropy loss function is commonly used in binary classification problems. It is used to calculate the loss between the predicted probability and the true label. Specifically, the calculation of the BCE loss function is as follows: for each sample, the difference between the predicted value and the true value is first calculated, and then this difference is squared. Finally, the average of all samples is calculated to get the final loss value. The formula is as follows:

(18)$$\begin{array}{{c}} {Loss ={-} ({y\cdot\log ({\hat{y}} )+ ({1 - y} )\log ({1 - \hat{y}} )} )} \end{array}$$

where $\hat{y}$ is the probability that the module predicts the sample is a positive instance, and y is the sample label. If the sample belongs to a positive instance, the value is 1. Otherwise, the value is 0. It well measures the difference between the prediction result and the true result. Moreover, it measures the difference between the probability distributions, thereby reflecting the degree of errors in the module.

4.3 Ablation study for the PCCA module

To demonstrate the effectiveness of the proposed PCCA module, ablation experiments were performed under the same experiment environment to compare the network performance after the addition of different modules. First, U-Net was employed as the benchmark to evaluate the network performance. As shown in Fig. 6, SE, ECA, CA, PCA, and PCCA modules were added to the bottleneck layer of U-Net for training under the same experiment environment.

Fig. 6. Addition of attention modules (the red module signifies the position where the attention layer is added).

Download Full Size | PDF

Ablation experiments were conducted on the LiTS, ISIC and CX datasets. The experimental settings are shown in Table 1. First, the benchmark network U-Net was configured. Next, the original channel attention module SE was added to aggregate the global channel information. Then, the channel attention module ECA, which uses the adaptive convolution kernel, was adopted to aggregate local channel relationships. Next, the spatial attention module CA was added to aggregate the location information in the spatial direction and retain the location and long-range correlation information. Subsequently, the channel attention module PCA (with the pyramid structure) was added to fuse multiscale information and include the relationship between local and global channels. 6) Finally, PCCA, which combines the spatial attention module CA and channel attention module PCA, was added to embed location information into the local and global channel information.

Table 1. Ablation Study

View Table | View all tables in this article

The experimental results are presented in Table 2. SE, ECA, and PCA focused on different channel information, whereas CA integrated the location information. Therefore, focusing only on the channel or spatial information did not yield the optimal effect. In contrast, the PCCA module achieved good segmentation results and exhibited superior results in terms of the three evaluation indicators. Therefore, it can be concluded that the PCCA module improves the performance of model segmentation by aggregating channel and spatial information, fusing multiscale features, and retaining accurate location information.

Table 2. Results of the Ablation Experiment

View Table | View all tables in this article

4.4 Comparison between PCCA-Net and state-of-the-art network models on the LiTS dataset

The proposed PCCA-Net network model was compared with state-of-the-art network models to demonstrate its superiority in medical image segmentation. To ensure a fair comparison, all the network models were run in the same experiment environment. The network models used for comparison included the classical and latest U-Net, DeepLabv3_ Plus, CENet, ExFuse, r2UNet, and UNeXt. The comparison results are presented in Table 3.

Table 3. Statistical Comparison between the Proposed and State-of-the-Art Network Models on the LiTS Dataset

View Table | View all tables in this article

As can be seen from the experimental results presented in Table 3, the proposed network model was superior in terms of most indicators when segmenting liver regions on the LiTS dataset, achieving DICE, IOU, SEN, RVD, and PPV values of 92.35%, 88.01%, 94.03%, 4.59%, and 94.46%, respectively. Compared with the original U-Net, the DICE, IOU, and SEN values of PCCA-Net were higher by 4.84%, 5.68%, and 8.75%, respectively. ExFuse embeds spatial information into advanced features, compensating for the lack of spatial information in advanced features [33]. UNeXt uses the shift operation to focus on local dependencies between channels [35]. As such, ExFuse and UNeXt achieved excellent performance. The performance of the proposed network model was better than that of ExFuse and UNeXt, thus, proving the superiority of PCCA-Net in focusing on local and global channel information and using location information to compensate for the lack of advanced features.

The segmentation results obtained using the aforementioned network models for three liver images are shown in Fig. 7. U-Net was prone to undersegmentation and oversegmentation when segmenting lesion areas, whereas ExFuse and UNeXT [33,35] greatly alleviated these problems when segmenting edges and small target areas. This demonstrates that spatial information and local channel information are crucial for segmentation tasks. Furthermore, the proposed PCCA-Net achieved more smooth and accurate segmentation for liver region edges, and the overall segmentation results were more precise, which demonstrates that the fusion of local and global channel information and spatial location information plays a key role in improving the segmentation performance.

Fig. 7. Segmentation results obtained using different attention mechanisms on the LiTS dataset. From left to right on each row, the true label and the segmentation results of different networks are presented; red-colored areas represent the wrongly segmented areas.

Download Full Size | PDF

4.5 Comparison between PCCA-Net and state-of-the-art network models on the ISIC dataset

On the ISIC-2018 dataset, PCCA-Net was compared with state-of-the-art semantic segmentation networks, namely, including FCN, DeepLabv3_ plus, U-Net_ nest, CENet, r2UNet, and ScaleFormer, in the same experiment environment. The experimental results obtained using these network models on the ISIC-2018 dataset are presented in Table 4.

Table 4. Statistical comparison between the proposed and state-of-the-art network models on the ISIC dataset

View Table | View all tables in this article

As can be seen from Table 4, compared with the original U-Net, the DICE, IOU, SEN, VOE, and PPV values of PCCA-Net were higher by 4.95%, 6.12%, 3.12%, 12.75%, and 4.82%, respectively. The proposed network model achieved the best results in terms of most indicators, yielding DICE, IOU, SEN, RVD, and PPV values of 81.82%, 73.22%, 83.89%, 11.47%, and 87.05%, respectively. In addition to the proposed network model, DeepLab and ScaleFormer exhibited excellent performance. DeepLab uses the feature pyramid and encoder–decoder structure to integrate multiscale and high-dimensional information [31], which proves the effectiveness of using the pyramid structure to obtain multiscale information in the channel direction. ScaleFormer extracts local and global information within the scale and highlights cross-scale dependencies to solve complex scale changes [38]; this approach is also consistent with the aggregation of local and global information employed in the proposed network model.

The segmentation results obtained using the aforementioned network models for three skin lesions are shown in Fig. 8. Lacking the combination of local and global information, U-Net yielded false segmentation in the output image when the contrast between the lesion area and normal skin tissue was low. DeepLabv3_plus, which integrates multiscale information, and ScaleFormer, which extracts local and global information within the scale, considerably reduced the segmentation error rate [31,38]. Compared with DeepLabv3_plus and ScaleFormer, PCCA-Net was more robust when segmenting the lesion regions.

Fig. 8. Segmentation results obtained using different attention mechanisms on the ISIC dataset. From left to right on each row, the true label and the segmentation results of different networks are presented; the red-colored areas represent the wrongly segmented areas.

Download Full Size | PDF

4.6 Comparison between PCCA-Net and state-of-the-art network models on the CX dataset

The segmentation performance of the proposed network model was compared with that of existing state-of-the-art networks on the CX dataset. The experimental results are presented in Table 5.

Table 5. Statistical comparison between the proposed and state-of-the-art network models on the CX dataset

View Table | View all tables in this article

As can be seen from Table 5, the proposed network model exhibited superior performance over the other networks, again demonstrating that the idea of aggregating local and global information in the channel dimension, fusing multiscale features, and embedding spatial location information into the channel dimension is of great significance for medical image segmentation on small datasets. PCCA-Net yielded DICE, IOU, SEN, RVD, and PPV values of 96.75%, 93.75%, 98.13%, 2.87%, and 95.48%, respectively, which were vastly superior to those of the original U-Net. The segmentation results obtained for three X-ray images are shown in Fig. 9. PCCA-Net exhibited better segmentation performance than most other segmentation networks. When the color of the segmented area was similar to that of the background or when artifacts were present, PCCA-Net achieved good segmentation results. Thus, it can be concluded that the proposed network model can yield satisfactory performance on small datasets.

Fig. 9. Segmentation results obtained using different attention mechanisms on the CX dataset. From left to right on each row, the true label and the segmentation results of different networks are presented; the red-colored areas represent the wrongly segmented areas.

Download Full Size | PDF

5. Discussion

Considering the challenges encountered in medical image segmentation, and by drawing inspiration from the variations in the receptive field in human vision neurons, we proposed an attention module named PCCA in this paper. This module aggregates multiscale information, combines local and global channel information, and embeds spatial direction location information to improve medical image segmentation. For the channel direction, as can be seen from the experimental results presented in Table 1, U-Net, which includes the SE and ECA modules, achieved better results; the SE module focused on the global channel information [9], whereas the ECA module focused on the local channel information [10] by using the adaptive convolution kernel, thus demonstrating that considering the channel information can result in performance improvement. To better combine the advantages of the SE and ECA modules and by adopting the principle that the receptive fields of human visual cortex neurons have different sizes, the PCA module was designed by using PyConv to combine the local and global information in the channel direction to obtain multiscale information. Furthermore, because the response of the human visual cortex neuron depends on the stimulus location, we adopted the CA module and conducted 1D feature encoding along the horizontal and vertical directions to capture the long-range dependencies along the spatial direction and retain accurate location information. Finally, we proposed the PCCA module by combining the CA and PCA modules to embed the aggregated location information into the local and global channel information and realize multipath feature processing to improve the performance of the attention module, thus making the segmentation process consistent with the signal processing technique of the human visual nerve. Furthermore, we conducted experiments on three datasets: LiTS, ISIC-2018, and CX. The proposed PCCA-Net network based on U-Net exhibited superior performance in terms of DICE, IOU, and other indicators over state-of-the-art networks. The experimental results demonstrated that the PCCA module is extremely valuable in the processing of medical images with different quantities and forms and provides a new approach in the direction of multipath feature processing. By connecting the neural network design with the process of signal processing in visual cortex neurons and referring to the biological vision principle that the size of the neuron receptive field varies with the stimulus location, we combined the attention module that aggregates local and global channel information with the spatial attention module containing location information. However, for embedding the location information into the channel dimension, the CA module used in this study adopts a complex process of obtaining the location information in the spatial dimension; thus, in future studies, the two parallel channels can be simplified.

6. Conclusion

In this paper, a novel multipath feature processing PCCA module was proposed and employed for the segmentation of medical images. The PCCA module can be divided into two parallel paths: space and channel. In the channel path, we use PyConv to construct the PCA and capture multiscale contextual information, while the spatial path employs the CA module to capture long-range dependencies and accurate location information. In the proposed PCCA-Net network based on U-Net, the PCCA module is placed at the bottleneck layer to obtain deep, and shallow features and enhanced edge features through feature mapping. The performance of the PCCA module was evaluated on three datasets: LiTS, ISIC, and CX. Furthermore, comparative experiments were performed to demonstrate that the proposed module is effective in medical image segmentation tasks and that the number of parameters is less than that of most DL networks. In future studies, we will optimize the proposed structure, improve the combination of space and channel dimensions, simplify the computing processes for the two dimensions, and explore the possibility of its application in diverse medical image segmentation tasks.

Funding

Hebei Provincial Natural Science Fund Key Project (F2017201222); National Natural Science Foundation of China (61473112).

Disclosures

The authors declare there is no conflict of interest.

Data availability

The data presented in this study are available on request from the corresponding author.

References

1. J. Park, S. Woo, J. Y. Lee, and I. S. Kweon, “Bam: Bottleneck attention module,” arXiv preprint arXiv:1807.06514 (2018).

2. V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017). [CrossRef]

3. O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2015), 234–241.

4. F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: fully convolutional neural networks for volumetric medical image segmentation,” in 2016 Fourth International Conference on 3D Vision (3DV), (IEEE, 2016), 565–571.

5. D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” arXiv preprint arXiv:1412.6980 (2014).

6. M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701 (2012).

7. I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, and D. Warde-Farley, “Generative adversarial nets,” in Advances in Neural Information Processing Systems (MIT Press, 2014).

8. H. Cai, L. Zhu, and S. Han, “Proxylessnas: direct neural architecture search on target task and hardware,” arXiv preprint arXiv:1812.00332 (2018).

9. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), 7132–7141.

10. Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Supplementary material for ‘ECA-Net: efficient channel attention for deep convolutional neural networks,” in Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, WA, USA, (2020), 13–19.

11. X. Li, W. Wang, X. Hu, and J. Yang, “Selective kernel networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), 510–519.

12. Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for efficient mobile network design,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021), 13713–13722.

13. S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: convolutional block attention module,” in Proceedings of the European Conference on Computer Vision (ECCV) (2018), 3–19.

14. J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), 3146–3154.

15. S. Minaee, Y. Y. Boykov, F. Porikli, A. J. Plaza, N. Kehtarnavaz, and D. Terzopoulos, “Image segmentation using deep learning: a survey,” IEEE Trans. Pattern Anal. Mach. Intell. 44, 1 (2021). [CrossRef]

16. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE 86(11), 2278–2324 (1998). [CrossRef]

17. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected CRFS,” arXiv preprint arXiv:1412.7062 (2014).

18. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS,” IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018). [CrossRef]

19. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), 3431–3440.

20. T. Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), 2117–2125.

21. J. Hirsch and C. A. Curcio, “The spatial resolution capacity of human foveal retina,” Vision Res. 29(9), 1095–1101 (1989). [CrossRef]

22. H. Larochelle and G. Hinton, “Learning to combine foveal glimpses with a third-order Boltzmann machine,” in Proceedings of the 23rd International Conference on Neural Information Processing Systems-Volume 1 (2010), 1243–1251.

23. J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” arXiv preprint arXiv:1412.7755 (2014).

24. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473 (2014).

25. K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra, “Draw: A recurrent neural network for image generation,” in International Conference on Machine Learning, (PMLR, 2015), 1462–1471.

26. M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” in Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, (2015), 2017–2025.

27. V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent models of visual attention,” in Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2, (2014), 2204–2212.

28. Y. Wu, G. Wang, Z. Wang, H. Wang, and Y. Li, “DI-Unet: dimensional interaction self-attention for medical image segmentation,” Biomed. Signal Process. Control 78, 103896 (2022). [CrossRef]

29. I. C. Duta, L. Liu, F. Zhu, and L. Shao, “Pyramidal convolution: rethinking convolutional neural networks for visual recognition,” arXiv preprint arXiv:2006.11538 (2020).

30. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), 770–778.

31. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), (2018), 801–818.

32. Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao, and J. Liu, “Ce-net: context encoder network for 2D medical image segmentation,” IEEE Trans. Med. Imaging 38(10), 2281–2292 (2019). [CrossRef]

33. Z. Zhang, X. Zhang, C. Peng, X. Xue, and J. Sun, “Exfuse: Enhancing feature fusion for semantic segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), (2018), 269–284.

34. M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. Asari, “Recurrent residual convolutional neural network based on u-net (R2U-net) for medical image segmentation,” arXiv preprint arXiv:1802.06955 (2018).

35. J. M. J. Valanarasu and V. M. Patel, “UNeXt: MLP-based rapid medical image segmentation network,” arXiv preprint arXiv:2203.04967 (2022).

36. Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. U. Liang, “A nested U-Net architecture for medical image segmentation,” arXiv preprint arXiv:1807.10165 (2018).

37. H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, Y. Iwamoto, X. Han, Y.-W. Chen, and J. Wu, “Unet3+: A full-scale connected unet for medical image segmentation,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2020), 1055–1059.

38. H. Huang, S. Xie, L. Lin, Y. Iwamoto, X. Han, Y.-W. Chen, and R. Tong, “ScaleFormer: revisiting the transformer-based backbones from a scale-wise perspective for medical image segmentation,” arXiv preprint arXiv:2207.14552 (2022).

39. Q. Xu, W. Duan, and N. He, “DCSAU-Net: a deeper and more compact split-attention U-Net for medical image segmentation,” arXiv preprint arXiv:2202.00972 (2022).

40. W. Zhou, C. Liu, J. Lei, L. Yu, and T. Luo, “HFNet: hierarchical feedback network with multilevel atrous spatial pyramid pooling for RGB-D saliency detection,” Neurocomputing 490, 347–357 (2022). [CrossRef]

Methods	Years	Local channel information	Global channel information	Space information	Location information
U-Net [3]	2015
SE [9]	2018	√
ECA [10]	2020		√
CA [12]	2021			√	√
PCA	_	√	√
PCCA	_	√	√	√	√

Methods	LiTS			ISIC			CX
Methods	DICE(%)	IOU(%)	SEN(%)	DICE(%)	IOU(%)	SEN(%)	DICE(%)	IOU(%)	SEN(%)
U-Net [3]	87.51	82.33	85.28	76.87	67.1	80.77	96.71	93.67	98.06
SE [9]	88.89	84.77	88.55	76.4	66.31	82.55	96.71	93.68	98.03
ECA [10]	86.92	82.09	85.59	80.06	71.41	81.23	96 . 95	94.12	97.74
CA [12]	88.25	82.83	90.15	81.61	73.46	81.63	96.64	93.53	98.26
PCA	88.26	83.08	87.03	55.37	44.35	54.98	96.74	93.74	96.52
PCCA	91.81	87.68	92.3	81.92	72.58	83.09	96.75	93.75	98.13

Methods	Years	DICE(%)	IOU(%)	SEN(%)	RVD(%)	PPV(%)
U-Net [3]	2015	87.51	82.33	85.28	-4.24	95.35
FCN [19]	2015	80.01	72.69	76.03	-7.17	93.48
DeepLabv3_plus [31]	2017	87.36	81.98	84.95	-4.14	96 . 32
CENet [32]	2018	90.91	86.85	91.17	5.24	94.59
ExFuse [33]	2018	91.18	86.31	90.94	2.44	94.28
r2UNet [34]	2018	75.01	66.68	72.01	22.98	89.32
UNeXt [35]	2022	91.32	87.03	92.28	8.56	93.32
Proposed (PCCA-Net)	_	92.35	88.01	94.03	4.59	94.46

Methods	Years	DICE(%)	IOU(%)	SEN(%)	RVD(%)	PPV(%)
U-Net [3]	2015	76.87	67.1	80.77	24.22	82.23
FCN [19]	2015	78.45	69.02	81.46	22.67	83.2
DeepLabv3_plus [31]	2017	67.64	55.79	85 . 11	108.55	67.94
UNet_nest [36]	2018	76.59	66.67	80.97	32.77	81.23
r2UNet [34]	2018	64.04	52.59	68.97	18.31	72.66
UNet_3Plus [37]	2020	78.03	68.23	78.88	4.14	84.11
ScaleFormer [38]	2022	81.11	71.66	84.46	24.18	84.91
Proposed (PCCA-Net)	_	81.82	73.22	83.89	11.47	87.05

Methods	Years	DICE(%)	IOU(%)	SEN(%)	RVD(%)	PPV(%)
U-Net [3]	2015	96.71	93.67	98.06	2.82	95.44
DeepLabv3_plus [31]	2017	96.59	93.42	97.47	1.81	95.76
CENet [32]	2018	96.72	93.68	97.41	-1.43	96.08
ExFuse [33]	2018	96.66	93.58	97.19	1 . 11	96.18
r2UNet [34]	2018	92.31	86.92	95.01	6.14	90.12
ScaleFormer [38]	2022	96.51	93.3	97.77	2.59	95.32
DCSAU [39]	2022	93.52	87.97	98.1	10.11	89.47
Proposed (PCCA-Net)	_	96.75	93.75	98.13	2.87	95.48

PCCA-Model: an attention module for medical image segmentation

Abstract

1. Introduction

2. Related work

2.1 Image segmentation networks

2.2 Attention mechanisms

3. Methods

3.1 PCCA module

3.2 Channel attention dimension PCA

3.3 Spatial dimension attention CA

4. Experiments and results

4.1 Datasets and settings

4.2 Experimental details

4.3 Ablation study for the PCCA module

4.4 Comparison between PCCA-Net and state-of-the-art network models on the LiTS dataset

4.5 Comparison between PCCA-Net and state-of-the-art network models on the ISIC dataset

4.6 Comparison between PCCA-Net and state-of-the-art network models on the CX dataset

5. Discussion

6. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Tables (5)

Equations (18)

Biomedical Optics Express