Ensemble model with cascade attention mechanism for high-resolution remote sensing image scene classification

Fengpeng Li; Fengpeng Li; Ruyi Feng; Ruyi Feng; Wei Han; Wei Han; Lizhe Wang; Lizhe Wang

doi:10.1364/OE.395866

1. Introduction

With the advancement of earth observation technology, acquiring remote sensing images with high spatial resolution becomes increasingly simplified and efficient. A large number of high-resolution remote sensing (HRRS) images are generated each day. Owing to the abundant texture information of ground objects and high frequency of data updating, HRRS images have been widely applied in land cover/use [1–5], urban planning [6–8], environment monitoring [9,10] and object detection [6,11]. HRRS image scene classification is a fundamental research topic and plays an essential role in all these tasks [12]. HRRS image scene classification has attracted a great deal of attention, and a large number of methods have been proposed for this task. Although these methods obtain excellent performance on HRRS image scene classification, developing an efficient and robust model for HRRS image scene classification is crucial because the performance of these methods still cannot satisfy the demand of practical tasks in earth observation [13].

In the recent years, artificial intelligence has made considerable progress. Models based on convolutional neural network (CNN) have been widely applied in many fields, including computer vision [15–20], voice recognition [21,22], and natural language processing [23,24] because of the strong representation capacity of deep learning. Inspired by the successful application of deep learning in computer vision tasks, many approaches based on CNNs have been proposed for HRRS image scene classification [13,14,25–34], object detection [35,36], and hyperspectral remote sensing image classficaition [37–39], and deep learning-based models [13,32,33] for HRRS image scene classification have achieved state-of-the-art performance in many studies. However, CNN-based methods have some limitations.

The first limitation is the number of training data. CNNs need a large number of labelled data to train a general model. As shown in Fig. 1, when labelled training data are limited, especially when the inter-class discrepancy is large, the models cannot capture the universal discriminative features for each class. To increase prior knowledge and improve the generalization ability of models, transfer learning policies have been adopted in scene classification. The most widely used approach is fine-tuning, which uses parameters of models trained on ImageNet [40] or other big-scale datasets, instead of training models from scratch. The source dataset of prior knowledge has great effects on model performance. And different CNNs trained on prior knowledge datasets tend to capture the same features. For some ensemble models [13,25,27], they adopt the same dataset, which is always ImageNet to acquire prior knowledge. This operation probably leads to each branch of models capturing redundant similarity features, resulting in a waste of computing resources.

Fig. 1. The large intra-class discrepancy for some categories in NWPU-RESISC45 [14].

Download Full Size | PDF

The second issue is that most of the existing methods overstate global information or information from single-type locals. As shown in Fig. 2, the ground objects in red boxes determine the scene category of the HRRS image, whereas the ground objects in black boxes are redundant and irrelevant information. That means scene categories of HRRS images usually depend on some regions that contain class-specific regions. Excessive attention to global information introduces a host of irrelevant, redundant information and reduces class-specific regions’ saliency. Some methods [32], which utilize recurrent-based attention, can focus merely on single-type areas, resulting in loss of class-specific information. To solve these issues, Lu et al. [26] proposed a global-local feature fusion method. Although the basic idea of this method makes sense, it cannot guarantee local information from class-specific regions. Similarly, some ensemble approaches [13,25,27], which combine several species of CNNs as feature extractors, pay attention to fusion of features from different branches and do not eliminate similar redundant features obtained by each branch.

Fig. 2. Different situations for HRRS image scene classification.

Download Full Size | PDF

Different prior knowledge datasets have different properties. Models trained on different datasets can have unique representation abilities, even if the structures of models are the same as each other. For example, models trained on ImageNet are much more universal, because the categories of ImageNet are abundant and cover thousands of nature scenes, whereas, models trained on CUB-200 are good at capturing fine-grained features. Each branch of ensemble methods employing different datasets to obtain prior knowledge can reduce redundant information and strengthen the model’s generality. Moreover, attention mechanisms that imitate the mechanism of the human visual system can focus on the most discriminative regions and eliminate redundant information. A combination of ensemble feature extractor and attention mechanisms avoid excessive attention on single-type features and force model capturing multiple class-specific features.

Motivated by the successful application of ensemble methods and attention mechanisms in HRRS image scene classification, an end-to-end ensemble model with cascade attention mechanism, CAE-CNN, is proposed in this paper. CAE-CNN employs Inception-V3 trained on CUB-200 and InceptionResNet-V2 with ImageNet weights as an ensemble feature extractor to obtain different class-specific features. Then the cascade attention module is used to eliminate redundant information and forces each branch of CAE-CNN extract different features.

The main contribution of this method can be summarised as follows:

1. A trick to improve the generality of the model is adopted. In this paper, each branch in the feature extractor of CAE-CNN is trained on different datasets to obtain multiple prior knowledge to reduce redundant information and encourage ensemble feature extractor capturing class-specific features as much as possible.
2. A cascade attention mechanism where there is spatial confusion attention, cross branch attention, and Branch Fusion Attention, is proposed. The Spatial Confusion Attention is used to eliminate irrelevant, redundant information with a constraint loss function. And Cross Branch Attention employs Branch Similarity Loss function and Feature Rank Loss to force each branch to learn different features. In the end, Branch Fusion Attention fuses all features to make predictions for input data.

There are four sections in the rest of the paper. Section 2 introduces existing methods and related knowledge about HRRS scene classification, while Section 3 explains CAE-CNN details. In Section 4, parameter settings, benchmark dataset and experiment results are displayed. And the conclusion is presented in the last section.

2. Related work

In the early days of satellite remote sensing, the spatial resolution of remote sensing image was extremely coarse, and ground objects with large practical sizes were tiny in remote sensing, that means each pixel contained abundant information [41]. Hence, methods for remote sensing image analysis processed data in pixel level. With the repaid development of earth observation technology, acquiring remote sensing images with high spatial resolution becomes ordinary. A great deal of datasets have proposed for HRRS image scene classification task, such as UC Merced Land-Used Dataset [42], OPTIMAL-31 [32], Aerial Image Dataset [43] and NWPU-RESISC45 [14]. As presented in Fig. 1, HRRS images have abundant texture information, and each one often contains several types of ground objects. Moreover, ground objects in HRRS images have changeable scales. All these special properties of HRRS images make scene classification for HRRS images a difficult task.

To face the challenge of HRRS image scene classification, in ths past several decades, numerous methods have been proposed for HRRS image scene classification. Based on the type of adopted feature extractors, they are roughly divided into three types: hand-crafted-feature-based methods, unsupervised-feature-based methods, and deep-learning-feature-based methods [14].

In the early time, models for HRRS image scene classification were mostly based on hand-crafted features [14], such as colour histograms [44], texture descriptors [45], scale-invariant feature transform (SIFT) [46] and histogram of oriented gradients (HOG) [47]. Each hand-crafted feature could only reflect a certain aspect, leading to considerable information being discarded. To solve this issue, some methods [28,48–50], which combine or fuse multiple hand-crafted features were proposed. For instance, Zhu et al. [49] proposed the BoVW-based HRRS image scene classification method, which fused local dense SIFT features and global shape-based texture features to represent HRRS images. Although feature combination or fusion methods partly improve the performance of HRRS image scene classification, the generality capacity of hand-crafted features are quite weak, and features need to be redesigned for different datasets.

To achieve automatic feature design and overcome the disadvantages of hand-crafted features, unsupervised methods, such as principal component analysis (PCA) [51], k-means clustering [52], sparse coding [53,54] and auto-encoder [55–59], were proposed, which aim to learn a set of functions, which are used to encoder input data as features. Unsupervised-feature-based methods can obtain features efficiently and automatically, but these features are not discriminative enough because of the lack of labelled information.

Many deep-learning-based methods have been proposed that have obtained state-of-the-art performance in multiple computer vision tasks [18,20,60–62]. Deep learning methods, especially CNN, have a strong representation capacity, and they can capture discriminative features without extra hand-engineering work. This is the greatest strength of deep learning methods. Inspired by the success of CNNs on natural scene image classification on ImageNet, numerous methods based on CNNs [13,14,25–33] have been proposed for HRRS image scene classification. However, trained a deep model with limited data is quite difficult. A host of approaches have been introduced to accelerate convergence and improve the generality capacity, and the most widely applied policy is fine-tuning. In fine-tuning, the model is trained on other big-scale datasets, such as ImageNet and CUB-200, to obtain some prior knowledge, instead of training the model from scratch. However, for tasks whose domains have a great gap with datasets for fine-tuning, fine-tuning tricks cannot provide positive effectiveness [63].

2.1 Attention mechanism

Attention mechanisms are an essential ability of the human visual system. When humans or other animals watch a scene, not all information in the visual scene has equal weights. The human or animal visual system focuses on the essential parts and ignores irrelevant information, thus making the watching procedure quite efficient. Motivated by the mechanism, models based on attention mechanisms have been developed for computer vision tasks [64–66], and natural language processing [67].

In remote sensing image processing, attention mechanisms have been adopted in scene classification [19,32,61] and hyperspectral image classification [68–70]. Wang et al. [32] proposed the first successful attention mechanism-based method for HRRS image scene classification, which adopts a recurrent attention mechanism to capture the most class-specific feature. The attention mechanism-based methods for hyperspectral image classification pay more attention to channel space, according to the characteristics of hyperspectral images. Wang et al. [32] proposed the first successful attention mechanism-based method for HRRS image scene classification, which adopts a recurrent attention mechanism to capture the most class-specific feature. However, recurrent attention mechanism-based methods overstate single-type class-specific features, which leads to misclassification in some conditions.

3. The proposed method: CAE-CNN

CAE-CNN is divided into two main parts. The first subsection explains the reasons why Inception-V3 and InceptionResNet-V2 are chosen to combine feature extractor, and some prior operations are also presented. In the last paragraph, the structure and principle of the Cascade Attention Mechanism are introduced. In this section, the pseudo-code is displayed at the last of the part.

3.1. Ensemble CNN feature extractor

Compared with hand-crafted features and unsupervised features, features extracted by CNNs have strong representation capacity without extra engineering work. Numerous CNNs models exist with different structures, and features extracted by different CNNs have various representations capacities.

For HRRS image scene classification, existing ensemble methods adopt a different policy. For instance, Liu et al. [25] proposed a triplet combination model, which employs VGGNet-16 [71], GoogleNet [60] and AlexNet [72] as feature extractors. For the other typical ensemble method, Rodrigo et al. [13] made use of ResNet [62] and DenseNet [73] to extract features. However, these CNN models may not be quite appropriate for the HRRS image. As shown in Fig. 2, class-specific regions on HRRS images are often tiny. These CNNs used in the existing ensemble methods have a single filter kernel size in each layer, limiting their capacity to capture multi-scale features and leads them to ignore tiny features easily.

Figure 3 displays the whole structure of Inception-V3; the blocks filled with the same colors have the same structure as each other. Inception-V3 has several species filter kernel sizes in each block, which is beneficial for capturing class-specific features, while ground objects have a changeable scale. Models based on Inception-V3 have been applied in fine-grained image classification and obtained state-of-the-art performance in this task [74], which means Inception-V3 has a strong capacity to capture stunning details. Thus, based on the characteristics of the HRRS image, Inception-V3 is adopted as one of the CAE-CNN feature extractors.

Fig. 3. The structure of Inception-V3.

Download Full Size | PDF

To make abundant features and make features from different branches easy to fuse, InceptionResNet-V2 is employed as the other feature extractor. As shown in Fig. 4, where boxes with the same colors represent blocks of InceptionResNet-V2 with the same structure, InceptionResNet-V2 is a CNN model that adds residual module in the construction of the Inception model.

Fig. 4. The structure of InceptionResNet-V2.

Download Full Size | PDF

To accelerate the training step and make each branch capture as little redundant information as possible, a branch based on Inception-V3 is trained on CUB-200 to introduce prior knowledge about capturing detailed parts. In contrast, the other branch uses ImageNet weights to obtain strong generality of capacity. Both are removed fully connected layers.

3.2. Cascade attention mechanism

The Cascade Attention module has three parts, and eah one has a different functional setting, which are described in detail in separate subsections. To better present the module and show details of each part, the module is divided into three section to describe. In the follwing parts, $X^{H\times W \times 3}=(x_1,x_2,\dots ,x_i)$ is input data, $H$, $W$ represent the height and width of input data, $i$ is the batch size, $Y^t=(y_1,y_2,\dots ,y_i)$ and $Y^{'}=(y_1^{'},y_2^{'},\dots ,y_i^{'})$ are true label and predicted labels for input HRRS images. $F_1$ and $F_2$ denote Inception-V3 and InceptionResNet-V2, while $f_1$ and $f_2$ are the feature maps extracted by the two branches.

3.2.1. Spatial confusion attention

Different from traditional operation which transform extracted feature maps into 1D vectors, feature maps $f_1$ and $f_2$ are processed by a $1\times 1$ convolutinal operation $W_c$, as shown in Eq. (1), 2, where $C$ is the number of categories and $A$ is named class activation map.

(1)$$A_1=f_1\times W_c\in R^{h\times w\times C}$$

(2)$$A_2=f_2\times W_c\in R^{h\times w\times C}$$

Because of the stronger capacity of capturing class-specific features of InceptionResNet-V2 than Inception-V3, and still, Inception-V3 trained on CUB-200 can capture tiny features. To make full use of the advantages of the various feature extractors, the second branch is used to locate regions with class-specific ground objects to eliminate the disturbance of irrelevant information. The other is used to make predictions in Spatial Confusion Attention. To achieve this, $A_2$ needs to be squeezed into a single channel feature map $M$. If just M is calculated, as shown in Eq. (3), some useful information is not taken into consideration. Thus, to make $M$ much more abundant with class-specific information, the weighted summation is the best choice. The weight of each channel of $A_2$ can be obtained based on the label of input data. As shown in the Eq. (4), where $m$ and $n$ are coordinate indexes, average value $A_{avg}$ of each channel is computed by a global average pooling, and then the average value vectors are used to obtain weights $W_c$ of each channel by softmax in Eq. (5). Weighted summation $M$ is calculated by Eq. (6) where $j$ is the channel index of $A_2$.

(3)$$M=max_C(A_2) \in R^{h\times w}$$

(4)$$A_{avg}=\frac{\sum_{m,n}^{w,h}A_2^{m,n}}{w\times h} \in R^C$$

(5)$$W_c=softmax(A_{avg}) \in R^{w\times h}$$

(6)$$M=\sum_{j=1}^{C}A_2\times W_c \in R^{w\times h}$$

Thus, attention map $M$ of input $X_i$ is obtained. The value of the pixels in $M$ is not enough to spatially present their importance. Therefore, the weight of each pixel needs to be calculated by the spatial attention mechanism. Two layers $1\times 1$ convolutional filter procedure $W_{cn1}$ and $W_{cn2}$ are adopted to compute the spatial weights matrix $W_{sw}$ in Eq. (7). As shown in Eq. (8), where $\odot$ is the element-wise product and $W_{cn3}$ and $W_{cn4}$ represent 1$\times$1 convolutional filter, $M_s$, a mask matrix for feature maps $f_1$, is obtained. The procedure of obtaining $M_s$ is the same as the method of computing spatial attention weights. Moreover, $M_s$ is used to locate the most discriminative part in feature maps $f_1$ extracted by Inception-V3 and make predictions for input HRRS images by softmax, following Eq. (9) and 10.

(7)$$W_{sw}=sigmoid(W_{cn2}(tanh(W_{cn1}(A_2)))) \in R^{h\times w}$$

(8)$$M_s=sigmoid(W_{cn4}(tanh(W_{cn3}(M\odot W_{sw})))) \in R^{h\times w}$$

(9)$$f_1^{'}=\frac{\sum_{n,m}^{h,w} f_1^{n,m}\odot M_s}{{\sum_{n,m}^{h,w}M_s^{n,m}} }\in R^C$$

(10)$$Y_1^{'}=\frac{e^{f_1^{'}}}{\sum_{j}^{C}e^{f_1^{'j}} } \in R^C$$

To force Spatial Confusion Attention achieving the expected effect, Spatial Mask Loss is proposed. For $M_s$, the value of a pixel of it needs to be between 0 and 1, because each value indicates the importance of the corresponding pixel in $f_1$. However, some values of pixels in $M_s$ are larger than 1. To achieve this, two hyperparameters $b_{low}$ and $b_{high}$ are manually set in this place. There are three steps in Spatial Mask Loss. First, the summation value of all pixels $T$ in $M_s$ is obtained following Eq. (11), where $n,m$ are the coordinate indexes. Then fluctuation range is obtained by Eq. (12) and 13 and the loss value $\ell _{sm}$ is summation presented in Eq. (14) of the lowest and highest values in this range.

(11)$$T=\sum_{n,m}^{h,w}(M_s)_{n,m} \in R^1$$

(12)$$\ell_{low}=max(b_{low}\times h \times w-T,0)$$

(13)$$\ell_{high}=max(T-b_{high}\times h \times w,0)$$

(14)$$\ell_{sm}=\frac{\ell_{high}+\ell_{low}}{h\times w}$$

The whole procedure of Spatial Mask Attention is shown in Fig. 5 and the mask operation of $M_s$ is displayed in Fig. 6, where the size of the input image is 448$\times$448. The masking procedure in CAE-CNN is quite similar to bilinear pooling. However, mask operation in CAE-CNN needs that the values of factors in $M_s$ should be smaller than 1 and larger than 0, while there is no such requirement in bilinear pooling. That is the essential difference between mask operation and bilinear pooling.

Fig. 5. The detailed structure of Spatial Confusion Attention.

Download Full Size | PDF

Fig. 6. The mask operation of $M_s$.

Download Full Size | PDF

3.2.2. Cross branch attention

The two branches of CAE-CNN are expected to capture different class-specific features and make the same predictions for input HRRS images $X_i$. Each branch has been trained on different datasets. Thus they possess different prior knowledge. Naturally, branch2 with InceptionResNet-V2 performs better than branch1. To increase the performance of each branch, Cross Branch Similarity Loss is proposed here. As presented in Eq. (16), the loss function has two parts; one of them is Kullback-Leibler divergence (as shown in Eq. (15)) and a pre-defined margin. $\ell _{cbs}$ is minimized during training so that the two branches can obtain the same prediction. $m_{cns}$ is a bias value, and its setting is based on [75].

(15)$$D_{kl}=\sum_{j}^{C}Y_{f_1}^{'j}\log(\frac{Y_{f_1}^{'j}}{Y_{f_2}^{'j}})$$

(16)$$\ell_{cbs}=D_{kl}-m_{cns}$$

Although Cross Branch Similarity Loss forces the two branches to obtain better performance, the weak capacity of Inception-V3 may decrease the performance of InceptionResNet-V2. To overcome this disadvantage, motivated by [75], a rank loss is adopted in Cross Branch Attention. The rank loss takes predictions from two branches into consideration:

(17)$$\ell_r=max(0,Y_{f_1}^{'j}-Y_{f_2}^{'j}+m_r)$$

In training procedure, $\ell _r$ is minimised to force $Y_{f_1}^{'j}-Y_{f_2}^{'j}+m_r>0$. Because the rank loss is unbounded, which leads to the unbalance of $Y_{f_1}^{'j}$ and $Y_{f_2}^{'j}$. To address the issue, the other part of the rank loss is presented as:

(18)$$\ell_s=max(0,-Y_{f_1}^{'j}-Y_{f_2}^{'j}+2m_s)$$

In Eq. (17) and 18, $m_s$ and $m_r$ are both pre-definded, satisfied margin. The pre-defined rank loss function is presented as shown in Eq. (19). The parameter settings of $m_s$ and $m_r$ is explained in [76].

(19)$$\ell_{sr}=max(\ell_s,\ell_r)$$

The proposed rank loss has two-fold. It can encourage $Y_{f_1}^{'j}$ and $Y_{f_2}^{'j}$ in the same trend. By the two-loss functions, the ensemble feature extractors can capture discriminative features and make the same predictions for each branch.

3.2.3. Branch fusion attention

There are several outputs of the ensemble feature extractor of CAE-CNN, and combining the outputs properly is essential. According to the residual network, the summation strategy is quite effective to fuse different outputs:

(20)$$A_f=Avgpool(A_1+A_2) \in R^C$$

(21)$$Y_{final}^{'}=\frac{e^{A_f}}{\sum_{j}^{C}e^{A_f^j}}\in R^C$$

However, treating each branch with equal contribution is not appropriate. Parameterized methods have not fully utilized the property of softmax. Therefore, a parameterless trick is adopted to calculate the weights of each output. The entropy of branches is computed as Eq. (22) and 23.

(22)$$E1={-}\sum_{j}^{C}Y_{f_1}^j\times \log(Y_{f_1}^j)$$

(23)$$E2={-}\sum_{j}^{C}Y_{f_2}^j\times \log(Y_{f_2}^j)$$

Entropy value reflects the confidence of predictions. If the entropy value is smaller, the predictions are closer to truth labels. Therefore, the weights of each output can be calculated as Eq. (24) and 25. And The final output can be presented as Eq. (27) and the final prediction is shown in Eq. (28). There are only two branches in this place, as shown in Eq. (26) and the summation of the two parameters are 1.

(24)$$\omega_1=\frac{\frac{1}{E_1}}{\frac{1}{E_1}+\frac{1}{E_2}}$$

(25)$$\omega_2=\frac{\frac{1}{E_2}}{\frac{1}{E_1}+\frac{1}{E_2}}$$

(26)$$\omega_2=1-\omega_1$$

And the final output can be presented as Eq. (27) and the final prediction is shown in Eq. (28).

(27)$$A_{final}=Avgpool(\omega_1\times A_1 +\omega_2\times A_2)$$

(28)$$Y_{final}^{'}=\frac{e^{A_f}}{\sum_{j}^{C}e^{A_f^j}}\in R^C$$

3.3. Pseudo code of CAE-CNN

To better display the whole structure of CAE-CNN, pseudo-code is presented. The algorithm can be included using the commands, as shown in algorithm 1. And the detailed structure of CAE-CNN is presented in Fig. 7.

Fig. 7. The detailed structure of CAE-CNN.

Download Full Size | PDF

4. Experiment

This section provides details of the experiments. First, evaluation metrics for HRRS image scene classification are presented. Next, we discuss the settings of some parameters for training and introduce benchmark datasets used in experiments. In the last subsection, we describe the experimental results of CAE-CNN with different hyperparameters on OPTIMAL-31 and display the CAE-CNN results on each benchmark dataset to compare with the existing methods. Moreover, we explain experimental results to indicate the superiority and robustness of CAE-CNN better.

4.1. Evaluation metrics

There are three frequently used evaluation metrics in image scene classification. They are Confusion Matrix (CM), overall accuracy (OA), and average accuracy (AA).

1. Confusion Matrix (CM): CM can visually present classification results, and it has been widely applied in supervised classification evaluation. In the CM matrix, the row labels represent the exact categories for input data, while the column labels are predictions.
2. Overall Accuracy (OA): OA is obtained by using numbers of correctly predicted data dividing total amounts of data.
3. Average Accuracy (AA): Average Accuracy (AA) is the average value of accuracy for each category in test datasets. If each class has the same numbers of data, AA is equal to OA.

4.2. Parameter settings

We perform our experiments with a code based on Keras used TensorFlow as the backend. The ImageNet weight of InceptionResNet-V2 can be found in Keras, and our experiment in 100 training steps obtains the CUB-200 weight of Inception-V3. All experiments of this paper are carried out on the Tianhe-2 V100 GPU distributed system, where there are four V100 GPUs with 16GB RAM on each computation node. We take a variable learning rate in our experiments [77] with the SGD optimizer accelerated by Nesterov into consideration. The original learning rate $lr$ is $1\times 10^{-3}$, $lr$ is changed as $lr=lr*0.9$ in each 10 steps, and the low limit of $lr$ is $1\times 10^{-6}$. We set the batch size to 12 for images with a size of $448\times 448$ in 200 epochs. The image size of some datasets is smaller than $448\times 448$. We use bilinear interpolation to enlarge the scale of these images. Although bilinear interpolation operation leads to image blur, it enlarges some tiny features and avoids losing class-specific details. Several groups experiments discuss two hyperparameters $b_h$ and $b_l$ in Spatial Mask Loss. we set $m_{cns}=0.15$ as described in [75] and $m_r=0.25$, $m_s=0.05$ as introduced in [76].

4.3. Brief introduction of datasets

1. OPTIMAL-31 Dataset:
The OPTIMAL-31 dataset [32] is a small dataset for HRRS image scene classification. All of these images in the dataset are collected from GoogleEarth with 31 categories and 1860 images. Each class of the dataset has 60 images with height and width as $256\times 256$. The rule for categories setting of OPTIMAL-31 is more reasonable and difficult than equal scale datasets, such as UC Merced Land-Use Dataset [42]. To compare with existing approaches that have shown the excellent effectiveness on OPTIMAL-31, we use 80% to train model, and the rest is used for testing procedure, according to [32].
2. UC Merced Land-Use dataset (UCM):
The UCM dataset [32] is one of the widely used benchmark datasets for HRRS image scene classification. There are 21 categories in the dataset, and each one has 100 images. All images in this dataset are generated from HRRS images with a spatial resolution of 0.3 meters and extracted by aerial orthography. To compare the effectiveness of CAE-CNN on this dataset, we perform two experiments with 50% and 80% for training, based on [29,30,32,33,78,79].
3. Aerial Image dataset (AID):
AID [43] is a large-scale land-use dataset for HRRS images where there are 30 categories. The dataset has 10000 images, and the size of each one is $600\times 600$. Remote sensing image interpretation specialists label all these images. We perform two groups experiments with 20% and 50% data for training separately, and the rest of the images are test data, according to the existing models in [25,32,33,73,80].
4. NWPU-RESISC45 dataset (NWPU-45):
NWPU-RESISC45 [14] is a challenging dataset for HRRS scene classification. The dataset has 45 classes. Each category of the dataset has 700 images with a size of $256\times 256$. The spatial resolution of these images is range from 0.2m to 0.3m. To our knowledge, this is the largest dataset for HRRS image scene classification [13]. Each category has abundant data. Intra-class diversity for some categories is the large, and inter-class similarity is high. All these properties of this dataset make it a challenge for models to achieve satisfying performance on it. Because CAE-CNN is an end-to-end model without any other trick, we compare its performance of CAE-CNN with existing end-to-end models on NWPU-45. According to [13,25,27,33,41,81], there are two groups of experiments used NWPU-45 20% and 10% images for training, and the performance of models are tested on the rest of the dataset.

Data partition ways of training datasets and testing datasets have a significant impact on testing accuracy. To avoid the occasionality of experimental results and display the robustness of CAE-CNN, several experiments are performed over each dataset with different partition ways. We also use horizontal flipping, vertical flipping, and random rotation from $-360$ degree to $360$ degree to augment the training data.

4.4. Experiment results and explaination

This subsection analyzes CAE-CNN performances over each benchmark and compares it with that of the existing methods with state-of-art performances. The beginning of this subsection discusses the best parameter settings of $b_h$ and $b_l$ in Spatial Mask Loss.

We use several group experiments on OPTIMAL-31 to find the most appropriate parameter setting of the two hyperparameters. In the experiments, we use 80$\%$ data of OPTIMAL-31 for training, while the rest of the data is used for testing. Because $b_h$ should far greater than $b_l$ because the too-small gap of the two parameters can result in the loss of a great deal of class-specific information. We set $b_h=\frac {1}{3}$ or $b_h=\frac {2}{3}$ or $b_h=1$, while $b_l$ is set as $\frac {1}{4}$ or $\frac {1}{3}$. As shown in Table 1, when $b_h$ is $\frac {2}{3}$ and $b_l$ is $\frac {1}{4}$, CAE-CNN obtains the best performance on OPTIMAL-31. Still from Fig. 8, in the best parameter settings case, CAE-CNN achieved the best performance in high efficiency, and we find that when the performance of CAE-CNN with $b_h=\frac {2}{3}$ is better than other parameter settings. Therefore, for $b_h$, $\frac {2}{3}$ is the best parameter setting case. The best parameter setting is still applied in experiments on other datasets to display the robustness of CAE-CNN.

Fig. 8. Testing accuracy tendency during training.

Download Full Size | PDF

Table 1. THE PERFORMANCE COMPARISON OF DIFFERENT $b_h$ and $b_l$ SETTINGS ON THE OPTIMAL-31 DATASET

View Table | View all tables in this article

1. OPTIMAL-31 dataset:
OPTIMAL-31 is a challenging dataset owing to the limited amount of images. Each category in this dataset has 60 images, with 80 $\%$ images used for training, while the rest are used for testing. In this situation, there are only 12 images of each category for testing. Misclassification of a single image of a category has a great negative impact on the performances of CAE-CNN. As shown in Table 2, where the corresponding true labels to these numbers are airplane, airport, baseball diamond, basketball court, beach, bridge, chaparral, church, circular farmland, commercial area, dense residential, desert, forest, freeway, golf course, ground track field, harbour, industrial area, intersection, island, lake, meadow, medium residential, mobile home park, mountain, overpass, parking lot, railway, rectangular farmland, roundabout, and runway. CAE-CNN achieves great performance compared with existing methods over OPTIMAL-31.
To evaluate the influence of different adopted CNN extractors, we perform experiments over the four benchmark dataset by using CAE-CNN-single, which only use Inception-V3 as feature extractors and CAE-CNN-Multiple with Inception and IncetionResNet-V3 as feature extractors. Moreover, the influence of pre-trained policy with UCB200 is also discussed.
From Fig. 9, CAE-CNN makes correct predictions for almost all categories in OPTIMAL-31. When multiple prior knowledge is not adopted, that of CAE-CNN on the dataset is lower than CAE-CNN with various fine-tuning tricks. However, CAE-CNN still has better performance than other existing approaches. Although its performance is excellent over OPTIMAL-31, there are still some issues. As shown in Fig. 9, several images belonging to the church are classified into the wrong label, commercial area, from Fig. 10 and Fig. 11, incorrect classification images of category Church in the dataset are quite similar to typical samples in the category Commercial Area, which results in the misclassification issue between the two categories.
2. UCM Dataset: UCM is a sample dataset. We perform two groups of experiments on this dataset. As shown in Table 3, CAE-CNN achieves state-of-the-art performance on two training data ratios cases. Comparing with previous state-of-the-art performance, CAE-CNN obtained an apparent improvement. The performance of CAE-CNN training with 50% data is better than some models trained by 80% data. When multiple pre-trained policies are not adopted in experiments over the UCM dataset, the performance of CAE-CNN is slightly lower than it adopted. That indicates the effectiveness of multiple pre-trained policies in UCM dataset. Due to the saturated performance of CAE-CNN trained with 80% data for training on the UCM dataset, we only display the CM of CAE-CNN trained with 50%, as shown in Fig. 14, from 0 to 20, corresponding practical categories are agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium residential mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, tennis court.
3. AID: AID is much more complicated than OPTIMAL-31. Although the inter-class discrepancy of AID categories is small, the intra-class similarity prevents improvement of the performance of existing HRRS image scene classification methods.
To compare with current end-to-end methods proposed for HRRS image scene classification, we perform two species of experiments. In the first experiment, 20$\%$ of data for each category in the dataset is used to train model, while in the other 50$\%$ data is utilized as training samples.
As shown in Table 4, CAE-CNN takes advantage of 20$\%$ samples of AID and obtains an extremely splendid average accuracy, which is even better than performances, which are achieved by existed models trained with 50$\%$ data. When 50 $\%$ data of the dataset is used to train our proposed model, CAE-CNN achieves state-of-the-art performance, compared with all existing HRRS image scene classification methods. Moreover, although multiple fine-tuning is not used before the training step, CAE-CNN still can obtain state-of-the-art performances than existing methods. In Fig. 15 and 16, from 0 to 29, the homologous labels are airport, bare land, farmland, forest, industrial, meadow, medium residential, mountain, park, parking, playground, pond, baseball field, port, railway station, resort, river, school, sparse residential, square, stadium, storage tanks, viaduct, beach, bridge, center, church, commercial, dense residential, desert. From Fig. 15, CAE-CNN makes high precision predictions for most of the categories in the AID dataset. However, for images, categorized resorts, the performance of the category does not have apparent improvement, even if the number of training images increases. A shown in Fig. 12 and 13, the incorrect classification images have quite close parallels with images labelled park, which results in the misclassification of images belonging to the resort category.
4. NWPU-RSISC-45 Dataset:
NWPU-45 is a challenging dataset because of the high intra-class discrepancy and has extremely fine-grained categories. Yet, ratios of data for training in the dataset are low, we have analyzed in the Introduction of this paper. To compare with existing end-to-end methods which have performed experiments on NWPU-45, we use two ratios of images for training.
As displayed in Table 5, CAE-CNN achieves state-of-the-art performance on both two experiments. For the first group experiments, CAE-CNN using 10$\%$ images of the dataset obtains a 1.21$\%$ increase compared with existing methods. Moreover, our proposed method still makes satisfactory performance when 50$\%$ images of NWPU-45 are fed into the model during the training procedure. Even if multiple fine-tuning policies are not adopted, CAE-CNN still achieves quite competitive performances on the NWPU-45, which is better than all of the existing end-to-end HRRS image scene classification methods.
From Fig. 17 and Fig. 18, where 0 to 44 represents airplane, airport, commercial area, dense residential, desert, forest, freeway, golf course, ground track field, harbour, industrial area, intersection, baseball diamond, island, lake, meadow, medium residential, mobile home park, mountain, overpass, palace, parking lot, railway, basketball court, railway station, rectangular farmland, river, roundabout, runway, sea ice, ship, snow-berg, sparse residential, stadium, beach, storage tank, tennis court, terrace, thermal power station, wetland, bridge, chaparral, church, circular farmland, cloud, images labelled church are easily classified to category palace, and palace images also can be predicted as the church. As shown in Fig. 19 and 20, images belong to the two species are quite similar to each other, which makes it a difficult task to distinguish the two categories with high precision.

Table 2. THE PERFORMANCE COMPARISON OF MODELS ON THE OPTIMAL-31 DATASET

View Table | View all tables in this article

Fig. 9. The confusion matrix of CAE-CNN over the OPTIMAL-31 dataset.

Download Full Size | PDF

Fig. 10. Incorrect classification images belonging to church.

Download Full Size | PDF

Fig. 11. Typical samples belonging to commercial area.

Download Full Size | PDF

Table 3. PERFORMANCE COMPARISON OF MODELS ON UCM LAND-USE DATA SET

View Table | View all tables in this article

Fig. 12. Misclassification images belonging to resort.

Download Full Size | PDF

Fig. 13. Typical samples belonging to park.

Download Full Size | PDF

Fig. 14. The confusion matrix of CAE-CNN with 50$\%$ data for training over the UCM dataset.

Download Full Size | PDF

Fig. 15. The confusion matrix of CAE-CNN with 20$\%$ data for training over the AID dataset.

Download Full Size | PDF

Fig. 16. The confusion matrix of CAE-CNN with 50$\%$ data for training over the AID dataset.

Download Full Size | PDF

Table 4. THE PERFORMANCE COMPARISON OF MODELS ON THE AID DATASET

View Table | View all tables in this article

Table 5. THE PERFORMANCE COMPARISON OF MODELS ON THE NWPU-45 DATASET

View Table | View all tables in this article

Fig. 17. The confusion matrix of CAE-CNN with 10$\%$ data for training over the NWPU-45 dataset.

Download Full Size | PDF

Fig. 18. The confusion matrix of CAE-CNN with 20$\%$ data for training over the NWPU-45 dataset.

Download Full Size | PDF

Fig. 19. Typical samples belonging to church in the NWPU-45 dataset.

Download Full Size | PDF

Fig. 20. Typical samples belonging to palace in the NWPU-45 dataset.

Download Full Size | PDF

To better present the advantages of CAE-CNN, we compare the size and time consumption of CAE-CNN with currently existing methods for HRRS image scene classification in Table 6. CAE-CNN-single has a similar scale with ResNet-101, and the time-consuming is sightly lower than ResNet-101. However, CAE-CNN obtains far better performance than ResNet-101 overall benchmark datasets. The size of CAE-CNN-Multiple is quite large, and its time-consuming is not quite high. From Table 3 to Table 5, we find that CAE-CNN without pre-trained UCM200 still can obtain the state-of-the-art performance on all used datasets in this paper, which indicates the strong capacity of CAE-CNN on HRRS image scene classification.

Table 6. THE COMPARISON OF MODEL SIZE AND TIME CONSUMPTION ON THE NWPU-45 DATASET

View Table | View all tables in this article

5. Conclusion

In this paper, we present an end-to-end model, CAE-CNN, for HRRS image scene classification. CAE-CNN has two main parts, ensemble feature extractor, and cascade attention module. We train each branch of the ensemble feature extractor with different datasets to force them to obtain abundant prior knowledge. The cascade attention module employs three child parts to play different roles to push the whole model to capture the most discriminative feature. Considering all these strategies, CAE-CNN has achieved state-of-the-art performance on four benchmark datasets. Future studies should enhance the proposed method and further improve the performance, especially on the NWPU-45 dataset.

Funding

National Natural Science Foundation of China (No.41701429, U1711266); China National Funds for Distinguished Young Scientists (No.41925007).

Acknowledgments

The authors would like to thank Prof Qi Wang, Prof. Gong Cheng and Prof. Guisong Xia for sharing the OPTIMAL-31, NWPU-45 and AID data sets. We would also like to thank editors, associated editors, and anonymous reviewers for their insightful suggestions and comments, which significantly improve the paper.

Disclosures

The authors declare no conflicts of interest.

References

1. B. Huang, B. Zhao, and Y. Song, “Urban land-use mapping using a deep convolutional neural network with high spatial resolution multispectral remote sensing imagery,” Remote. Sens. Environ. 214, 73–86 (2018). [CrossRef]

2. F. Chen, K. Wang, T. Van de Voorde, and T. F. Tang, “Mapping urban land cover from high spatial resolution hyperspectral data: An approach based on simultaneously unmixing similar pixels with jointly sparse spectral mixture analysis,” Remote. Sens. Environ. 196, 324–342 (2017). [CrossRef]

3. G. Milani, M. Volpi, D. Tonolla, M. Doering, C. Robinson, M. Kneubühler, and M. Schaepman, “Robust quantification of riverine land cover dynamics by high-resolution remote sensing,” Remote. Sens. Environ. 217, 491–505 (2018). [CrossRef]

4. X.-Y. Tong, G.-S. Xia, Q. Lu, H. Shen, S. Li, S. You, and L. Zhang, “Land-cover classification with high-resolution remote sensing images using transferable deep models,” Remote Sens. Environ. 237, 111322 (2020). [CrossRef]

5. G. Cheng, J. Han, L. Guo, Z. Liu, S. Bu, and J. Ren, “Effective and efficient midlevel visual elements-oriented land-use classification using vhr remote sensing images,” IEEE Trans. Geosci. Remote. Sens. 53(8), 4238–4249 (2015). [CrossRef]

6. G. Liu, Y. Gousseau, and F. Tupin, “A contrario comparison of local descriptors for change detection in very high spatial resolution satellite images of urban areas,” IEEE Trans. Geosci. Remote. Sens. 57(6), 3904–3918 (2019). [CrossRef]

7. J. Song, X. Tong, L. Wang, C. Zhao, and A. V. Prishchepov, “Monitoring finer-scale population density in urban functional zones: A remote sensing data fusion approach,” Landsc. Urban Plan. 190, 103580 (2019). [CrossRef]

8. J. G. Su, P. Dadvand, M. J. Nieuwenhuijsen, X. Bartoll, and M. Jerrett, “Associations of green space metrics with health and behavior outcomes at different buffer sizes and remote sensing sensor resolutions,” Environ. Int. 126, 162–170 (2019). [CrossRef]

9. S. Wang, M. Garcia, P. Bauer-Gottwein, J. Jakobsen, P. J. Zarco-Tejada, F. Bandini, V. S. Paz, and A. Ibrom, “High spatial resolution monitoring land surface energy, water and co2 fluxes from an unmanned aerial system,” Remote Sens. of Environ. 229, 14–31 (2019). [CrossRef]

10. Z. Cao, R. Ma, H. Duan, and K. Xue, “Effects of broad bandwidth on the remote sensing of inland waters: Implications for high spatial resolution satellite data applications,” ISPRS-J. Photogramm. Remote Sens. 153, 110–122 (2019). [CrossRef]

11. G. Cheng, J. Han, L. Guo, and T. Liu, “Learning coarse-to-fine sparselets for efficient object detection and scene classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR 2015, Boston, MA, USA, June, 2015, pp. 1173–1181.

12. E. Li, J. Xia, P. Du, C. Lin, and A. Samat, “Integrating multilayer features of convolutional neural networks for remote sensing scene classification,” IEEE Trans. Geosci. Remote. Sens. 55(10), 5653–5665 (2017). [CrossRef]

13. R. Minetto, M. P. Segundo, and S. Sarkar, “Hydra: an ensemble of convolutional neural networks for geospatial land classification,” IEEE Trans. Geosci. Remote. Sens. 57(9), 6530–6541 (2019). [CrossRef]

14. G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classification: Benchmark and state of the art,” Proc. IEEE 105(10), 1865–1883 (2017). [CrossRef]

15. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR 2016, Las Vegas, NV, USA, June, 2016, pp. 2921–2929.

16. G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR 2016, Las Vegas, NV, USA, June, 2016, pp. 2261–2269.

17. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR 2018, Salt Lake City, UT, USA, June, 2018, pp. 7132–7141.

18. Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. Torr, “Fast online object tracking and segmentation: A unifying approach,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR 2019, Long Beach, CA, USA, June, 2019, pp. 1328–1338.

19. X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang, “Progressive attention guided recurrent network for salient object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR 2018, Salt Lake City, CA, USA, June, 2018, (2018), pp. 714–722.

20. S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). [CrossRef]

21. A. Chowdhury and A. Ross, “Fusing mfcc and lpc features using 1d triplet cnn for speaker recognition in severely degraded audio signals,” IEEE Trans. Inf. Forensic Secur. (2019).

22. T. Tuncer and S. Dogan, “Novel dynamic center based binary and ternary pattern network using m4 pooling for real world voice recognition,” Appl. Acoust. 156, 176–185 (2019). [CrossRef]

23. T. Shen, T. Zhou, G. Long, J. Jiang, S. Pan, and C. Zhang, “Disan: Directional self-attention network for rnn/cnn-free language understanding,” in Proc. AAAI Conf. Artif. Intell.,AAAI 2018, New Orleans, Louisiana, USA, February, 2018, pp. 5446–5455.

24. B. McCann, J. Bradbury, C. Xiong, and R. Socher, “Towards the imagenet-cnn of nlp: Pretraining sentence encoders with machine translation,” in Proc. Adv. Neural Info. Process. Syst., pp. 6285–6296.

25. Y. Liu, C. Y. Suen, Y. Liu, and L. Ding, “Scene classification using hierarchical wasserstein cnn,” IEEE Trans. Geosci. Remote. Sens. 57(5), 2494–2509 (2019). [CrossRef]

26. Y. Yuan, J. Fang, X. Lu, and Y. Feng, “Remote sensing image scene classification using rearranged local features,” IEEE Trans. Geosci. Remote. Sens. 57(3), 1779–1792 (2019). [CrossRef]

27. Y. Liu and C. Huang, “Scene classification via triplet networks,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 11(1), 220–237 (2018). [CrossRef]

28. J. Zou, W. Li, C. Chen, and Q. Du, “Scene classification using local and global features with collaborative representation fusion,” Inf. Sci. 348, 209–226 (2016). [CrossRef]

29. F. Hu, G.-S. Xia, J. Hu, and L. Zhang, “Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery,” Remote Sens. 7(11), 14680–14707 (2015). [CrossRef]

30. S. Chaib, H. Liu, Y. Gu, and H. Yao, “Deep feature fusion for VHR remote sensing scene classification,” IEEE Trans. Geosci. Remote. Sens. 55(8), 4775–4784 (2017). [CrossRef]

31. J. Xie, N. He, L. Fang, and A. Plaza, “Scale-Free Convolutional Neural Network for Remote Sensing Scene Classification,” IEEE Trans. Geosci. Remote. Sens. 57(9), 6916–6928 (2019). [CrossRef]

32. Q. Wang, S. Liu, J. Chanussot, and X. Li, “Scene classification with recurrent attention of VHR remote sensing images,” IEEE Trans. Geosci. Remote. Sens. 57(2), 1155–1167 (2019). [CrossRef]

33. G. Cheng, C. Yang, X. Yao, L. Guo, and J. Han, “When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative CNNs,” IEEE Trans. Geosci. Remote. Sens. 56(5), 2811–2821 (2018). [CrossRef]

34. X. Lu, B. Wang, X. Zheng, and X. Li, “Exploring models and data for remote sensing image caption generation,” IEEE Trans. Geosci. Remote. Sens. 56(4), 2183–2195 (2018). [CrossRef]

35. G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class geospatial object detection and geographic image classification based on collection of part detectors,” ISPRS-J. Photogramm. Remote Sens. 98, 119–132 (2014). [CrossRef]

36. G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images,” IEEE Trans. Geosci. Remote. Sens. 54(12), 7405–7415 (2016). [CrossRef]

37. S. Mei, J. Ji, J. Hou, X. Li, and Q. Du, “Learning sensor-specific spatial-spectral features of hyperspectral images via convolutional neural networks,” IEEE Trans. Geosci. Remote. Sens. 55(8), 4520–4533 (2017). [CrossRef]

38. Y. Zhang, Y. Yuan, Y. Feng, and X. Lu, “Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection,” IEEE Trans. Geosci. Remote. Sens. 57(8), 5535–5548 (2019). [CrossRef]

39. X. Lu, W. Zhang, and X. Li, “A hybrid sparsity and distance-based discrimination detector for hyperspectral images,” IEEE Trans. Geosci. Remote. Sens. 56(3), 1704–1717 (2018). [CrossRef]

40. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. CVPR 2009, Miami, Florida, USA, June, 2009, pp. 248–255.

41. W. Han, R. Feng, L. Wang, and Y. Cheng, “A semi-supervised generative framework with deep learning features for high-resolution remote sensing image scene classification,” ISPRS-J. Photogramm. Remote Sens. 145, 23–43 (2018). [CrossRef]

42. Y. Yi and S. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” pp. 270–279.

43. G.-S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, and X. Lu, “AID: A benchmark data set for performance evaluation of aerial scene classification,” IEEE Trans. Geosci. Remote. Sens. 55(7), 3965–3981 (2017). [CrossRef]

44. N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR 2005, San Diego, CA, USA, June, 2005, pp. 886–893.

45. S. Bhagavathy and B. S. Manjunath, “Modeling and detection of geospatial objects using texture motifs,” IEEE Trans. Geosci. Remote. Sens. 44(12), 3706–3715 (2006). [CrossRef]

46. D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis. 60(2), 91–110 (2004). [CrossRef]

47. N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR 2005, San Diego, CA, USA, June, 2005, pp. 886–893.

48. V. Risojević and Z. Babić, “Fusion of global and local descriptors for remote sensing image classification,” IEEE Geosci. Remote Sens. Lett. 10(4), 836–840 (2013). [CrossRef]

49. Q. Zhu, Y. Zhong, B. Zhao, G.-S. Xia, and L. Zhang, “Bag-of-visual-words scene classifier with local and global features for high spatial resolution remote sensing imagery,” IEEE Geosci. Remote Sens. Lett. 13(6), 747–751 (2016). [CrossRef]

50. X. Lu, X. Zheng, and Y. Yuan, “Remote sensing scene classification by unsupervised representation learning,” IEEE Trans. Geosci. Remote. Sens. 55(9), 5148–5157 (2017). [CrossRef]

51. S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemom. Intell. Lab. Syst. 2(1-3), 37–52 (1987). [CrossRef]

52. J. A. Hartigan and M. A. Wong, “A k-means clustering algorithm,” J. R. Stat. Soc. Ser. C-Appl. Stat. 28, 100–108 (1979).

53. B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by V1?” Vision Res. 37(23), 3311–3325 (1997). [CrossRef]

54. G. Sheng, W. Yang, T. Xu, and H. Sun, “High-resolution satellite scene classification using a sparse coding based multiple feature combination,” Int. J. Remote Sens. 33(8), 2395–2412 (2012). [CrossRef]

55. S. Saha, F. Bovolo, and L. Bruzzone, “Unsupervised deep change vector analysis for multiple-change detection in vhr images,” IEEE Trans. Geosci. Remote. Sens. 57(6), 3677–3693 (2019). [CrossRef]

56. J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber, “Stacked convolutional auto-encoders for hierarchical feature extraction,” in Int. Conf. Artif. Neural Netw. ICANN 2011, Espoo, Finland, June, 2011, pp. 52–59.

57. Z. Fan, D. Bo, and Z. Liangpei, “Saliency-guided unsupervised feature learning for scene classification,” IEEE Trans. Geosci. Remote. Sens. 53(4), 2175–2184 (2015). [CrossRef]

58. D. Bo, X. Wei, W. Jia, Z. Lefei, Z. Liangpei, and T. Dacheng, “Stacked convolutional denoising auto-encoders for feature representation,” IEEE Trans. Cybern. 47(4), 1017–1027 (2017). [CrossRef]

59. G. Cheng, P. Zhou, J. Han, L. Guo, and J. Han, “Auto-encoder-based shared mid-level visual dictionary learning for scene classification using very high resolution remote sensing images,” Int. J. Comput. Vis. 9(5), 639–647 (2015). [CrossRef]

60. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR 2015, Boston, MA, USA, June, 2015, pp. 1–9.

61. M. Guo, Y. Zhao, C. Zhang, and Z. Chen, “Fast object detection based on selective visual attention,” Neurocomputing 144, 184–197 (2014). [CrossRef]

62. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR 2016, Las Vegas, NV, USA, June, 2016, pp. 770–778.

63. N. He, L. Fang, S. Li, A. Plaza, and J. Plaza, “Remote sensing scene classification using multilayer stacked covariance pooling,” IEEE Trans. Geosci. Remote. Sens. 56(12), 6899–6910 (2018). [CrossRef]

64. M. Corbetta and G. L. Shulman, “Control of goal-directed and stimulus-driven attention in the brain,” Nat. Rev. Neurosci. 3(3), 201–215 (2002). [CrossRef]

65. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. 27th Int. Conf. Mach. Learn., ICML 2015, Lille, France, July, 2015, pp. 2048–2057.

66. F. Wang, M. Jiang, C. Qian, S. Yang, C. Y. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention network for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR 2017, Honolulu, HI, USA, July, 2017, pp. 6450–6458.

67. T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proc. Conf. Empir. Methods in Nat. Lang. Process., EMNLP 2015, Lisbon, Portugal, September, 2015, pp. 1412–1421.

68. B. Fang, Y. Li, H. Zhang, and J. C.-W. Chan, “Hyperspectral images classification based on dense convolutional networks with spectral-wise attention mechanism,” Remote Sens. 11(2), 159–163 (2019). [CrossRef]

69. W. Ma, Q. Yang, Y. Wu, W. Zhao, and X. Zhang, “Double-branch multi-attention mechanism network for hyperspectral image classification,” Remote Sens. 11(11), 1307–1328 (2019). [CrossRef]

70. X. Mei, E. Pan, Y. Ma, X. Dai, J. Huang, F. Fan, Q. Du, H. Zheng, and J. Ma, “Spectral-spatial attention networks for hyperspectral image classification,” Remote Sens. 11(8), 963–981 (2019). [CrossRef]

71. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Represent., ICLR 2015, San Diego, CA, USA, May, 2015, pp. 1–9.

72. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst. NIPS 2012, Lake Tahoe, Nevada, USA, December, 2012, pp. 84–90.

73. Y. Yu and F. Liu, “Dense connectivity based two-stream deep feature fusion framework for aerial scene classification,” Remote Sens. 10(7), 1158–1172 (2018). [CrossRef]

74. T. Lin, A. Roy Chowdhury, and S. Maji, “Bilinear convolutional neural networks for fine-grained visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1309–1322 (2018). [CrossRef]

75. J. Fu, H. Zheng, and T. Mei, “Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR 2017, Honolulu, HI, USA, July, 2017, pp. 4438–4446.

76. Y. Zhu, R. Li, Y. Yang, and N. Ye, “Learning cascade attention for fine-grained image classification,” Neural Netw. 122, 174–182 (2020). [CrossRef]

77. L. Fan, T. Zhang, X. Zhao, H. Wang, and M. Zheng, “Deep topology network: A framework based on feedback adjustment learning rate for image classification,” Adv. Eng. Inform. 42, 100935 (2019). [CrossRef]

78. X. Bian, C. Chen, L. Tian, and Q. Du, “Fusing local and global features for high-resolution scene classification,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 10(6), 2889–2901 (2017). [CrossRef]

79. H. Sun, S. Li, X. Zheng, and X. Lu, “Remote sensing scene classification by gated bidirectional network,” IEEE Trans. Geosci. Remote. Sens. 58(1), 82–96 (2020). [CrossRef]

80. R. M. Anwer, F. S. Khan, J. van de Weijer, M. Molinier, and J. Laaksone, “Binary patterns encoded convolutional neural networks for texture recognition and remote sensing scene classification,” ISPRS-J. Photogramm. Remote Sens. 138, 74–85 (2018). [CrossRef]

81. J. Wang, W. Liu, L. Ma, H. Chen, and L. Chen, “Iorn: An effective remote sensing image scene classification framework,” IEEE Geosci. Remote Sens. Lett. 15(11), 1695–1699 (2018). [CrossRef]

82. E. Othman, Y. Bazi, N. Alajlan, H. Alhichri, and F. Melgani, “Using convolutional features and a sparse autoencoder for land-use scene classification,” Int. J. Remote Sens. 37(10), 2149–2167 (2016). [CrossRef]

83. B. Zhang, Y. Zhang, and S. Wang, “A lightweight and discriminative model for remote sensing scene classification with multidilation pooling module,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 12(8), 2636–2653 (2019). [CrossRef]

Value	$b_{h}$ = $\frac{1}{3}$	$b_{h}$ = $\frac{2}{3}$	$b_{h}$ = $1$
$b_{l}$ = $\frac{1}{4}$	95.30 $\pm$ 0.14	96.64 $\pm$ 0.09	96.38 $\pm$ 0.14
$b_{l}$ = $\frac{1}{3}$	None	96.39 $\pm$ 0.4	95.84 $\pm$ 0.14

Models/Ratio	OA(80%)
Fine-tuning DenseNet-121	92.57 $\pm$ 0.12
Fine-tuning DenseNet-169	93.33 $\pm$ 0.31
Fine-tuning ResNet-101	92.12 $\pm$ 0.22
Fine-tuning ResNet-152	93.02 $\pm$ 0.19
Fine-tuning GoogleNet [32]	87.15 $\pm$ 0.45
Fine-tuning VGGNet16 [32]	87.45 $\pm$ 0.45
Fine-tuning Inception-V3	89.15 $\pm$ 0.35
Fine-tuning InceptionResNet-V2	91.27 $\pm$ 0.12
VGG-VD-16 [43]	89.12 $\pm$ 0.35
ARCNet-VGGNet16 [32]	92.70 $\pm$ 0.35
CAE-CNN_single(without multiple pretrain)	95.10 $\pm$ 0.12
CAE-CNN_single	96.10 $\pm$ 0.35
CAE-CNN_multiple(without multiple pretrain)	95.40 $\pm$ 0.16
CAE-CNN_multiple	96.64 $\pm$ 0.09

Models/Ratio	OA(50%)	OA(80%)
Fine-tuning DenseNet-121	96.59 $\pm$ 0.33	97.57 $\pm$ 0.48
Fine-tuning DenseNet-169	96.92 $\pm$ 0.15	97.88 $\pm$ 0.12
Fine-tuning ResNet-101	95.97 $\pm$ 0.13	96.68 $\pm$ 0.32
Fine-tuning ResNet-169	96.63 $\pm$ 0.42	97.42 $\pm$ 0.23
Fine-tuning Inception-V3	95.38 $\pm$ 0.21	97.23 $\pm$ 0.18
Fine-tuning InceptionResNet-V2	96.27 $\pm$ 0.12	98.27 $\pm$ 0.33
Combing Scenatios I and II[29]	98.49
D-CNN [33]	98.93 $\pm$ 0.10
Fusion by Addition [30]	97.42 $\pm$ 0.17
CNN-NN [82]	97.19
ARCNet-VGG16 [32]	96.91 $\pm$ 0.14	99.12 $\pm$ 0.40
GBNet [79]	97.05 $\pm$ 0.19	98.57 $\pm$ 0.48
CAE-CNN-single(Without multiple pre-train)	97.52 $\pm$ 0.19	99.28 $\pm$ 0.24
CAE-CNN-single	98.92 $\pm$ 0.10	99.71 $\pm$ 0.10
CAE-CNN-multiple(Without multiple pre-train)	98.03 $\pm$ 0.17	99.36 $\pm$ 0.24
CAE-CNN-multiple	98.94 $\pm$ 0.39	99.91 $\pm$ 0.03

Models/Ratio	OA(20%)	OA(50%)
Fine-tuning Inception-V3	87.18 $\pm$ 0.12	91.37 $\pm$ 0.25
Fine-tuning InceptionResNet-V2	89.05 $\pm$ 0.17	92.53 $\pm$ 0.13
ARCNet-VGG16[32]	88.75 $\pm$ 0.40	93.10 $\pm$ 0.55
Pre-trained VGGNet-16+SVM	89.33 $\pm$ 0.23	96.04 $\pm$ 0.13
Fine-tuning RseNet-101	91.76 $\pm$ 0.32	94.67 $\pm$ 0.36
Fine-tuning RseNet-152	92.53 $\pm$ 0.32	95.23 $\pm$ 0.16
Fine-tuning DenseNet-121	92.87 $\pm$ 0.62	95.65 $\pm$ 0.22
Fine-tuning DenseNet-169	93.22 $\pm$ 0.34	96.31 $\pm$ 0.27
Binary Patterns encoded CNNs [80]		95.73 $\pm$ 0.16
Dense connectivity based two-steam fusion [73]		95.99 $\pm$ 0.35
D-CNN [33]	90.82 $\pm$ 0.16	96.89 $\pm$ 0.10
SE-MDPMNet [83]	94.68 $\pm$ 0.17	97.14 $\pm$ 0.15
HW-CNN [25]		96.98 $\pm$ 0.33
CAE-CNN-single(without multiple pretrain)	96.14 $\pm$ 0.17	97.24 $\pm$ 0.23
CAE-CNN-single	96.87 $\pm$ 0.13	97.81 $\pm$ 0.35
CAE-CNN-multiple(without multiple pretrain)	96.34 $\pm$ 0.15	97.79 $\pm$ 0.22
CAE-CNN-multiple	97.07 $\pm$ 0.03	98.01 $\pm$ 0.16

Models/Ratio	OA(10%)	OA(20%)
Fine-tuning Inception-V3	88.14 $\pm$ 0.26	91.43 $\pm$ 0.36
Fine-tuning InceptionResNet-V2	89.35 $\pm$ 0.37	92.21 $\pm$ 0.31
IOR4-VGG16 [81]	87.83 $\pm$ 0.16	91.30 $\pm$ 0.17
D-CNN [33]	89.22 $\pm$ 0.50	91.89 $\pm$ 0.22
Deeper ConvNet+Context Aggregation[25]		93.47 $\pm$ 0.26
DenseNet-121 [13]	91.06 $\pm$ 0.61	93.33 $\pm$ 0.55
DenseNet-169	92.41 $\pm$ 0.27	94.23 $\pm$ 0.10
ResNet-152	91.11 $\pm$ 0.31	93.23 $\pm$ 0.15
ResNet-101	90.69 $\pm$ 0.45	92.44 $\pm$ 0.22
Hygra: an ensemble of CNNs [13]	92.44 $\pm$ 0.34	94.51 $\pm$ 0.21
SE-MDPMNet [83]	91.80 $\pm$ 0.07	94.11 $\pm$ 0.0.03
HW-CNN [25]		94.38 $\pm$ 0.16
CAE-CNN-single(without multiple pretrain)	93.23 $\pm$ 0.17	95.20 $\pm$ 0.23
CAE-CNN-single	93.50 $\pm$ 0.21	95.65 $\pm$ 0.12
CAE-CNN-multiple(without multiple pretrain)	93.34 $\pm$ 0.11	95.60 $\pm$ 0.05
CAE-CNN-multiple	93.65 $\pm$ 0.10	95.74 $\pm$ 0.02

Ensemble model with cascade attention mechanism for high-resolution remote sensing image scene classification

Abstract

1. Introduction

2. Related work

2.1 Attention mechanism

3. The proposed method: CAE-CNN

3.1. Ensemble CNN feature extractor

3.2. Cascade attention mechanism

3.2.1. Spatial confusion attention

3.2.2. Cross branch attention

3.2.3. Branch fusion attention

3.3. Pseudo code of CAE-CNN

4. Experiment

4.1. Evaluation metrics

4.2. Parameter settings

4.3. Brief introduction of datasets

4.4. Experiment results and explaination

5. Conclusion

Funding

Acknowledgments

Disclosures

References

Cited By

Figures (20)

Tables (6)

Equations (28)

Optics Express

model	Each Step Time(s)	Model Size(MB)
CAE-CNN-single	1.63	180
CAE-CNN-multiple	2.04	300
ResNet-101	1.74	170
ResNet-152	2.59	230
DenseNet-121	1.10	30.8
DenseNet-169	1.22	54.7