Minimalistic fully convolution networks (MFCN): pixel-level classification for hyperspectral image with few labeled samples

Buyun Xu; Weijun Hou; Weijun Hou; Yiwei Wei; Yiting Wang; Xihai Li

doi:10.1364/OE.453274

1. Introduction

A hyperspectral image (HSI) is a kind of image composed of multiple continuous spectral images in the same area, containing both spatial and spectral information of objects. HSI data forms a three-dimensional (3D) data cube that can be regarded as a collection of two-dimensional (2D) images in different bands or a collection of 1D spectral curves in different spatial positions. Due to its high spectral resolution and large number of spectral bands, HSI can improve the category separability of objects that are difficult to distinguish in standard optical, infrared, or multispectral images. Therefore, HSI have attracted wide scholarly attention and have been applied in many fields [1–8]. The research of hyperspectral image processing can promote the development of hyperspectral sensors in the field of optical remote sensing.

One of the most vibrant fields in the research of HSI is HSI classification, which aims at assigning each pixel to one category based on its spectral characteristics [9,10]. In early studies, HSI classification mainly relied on traditional machine learning methods, such as support vector machine (SVM) [11,12], multinomial logistic regression (MLR) [13,14], and k-nearest neighbors (KNN) [15]. However, these classifiers are highly dependent on hand-crafted features, so it is difficult to achieve good performance on nonlinear data such as HSI. In contrast, deep learning methods use neural networks to automatically design a feature space that is tailored to the objective task. Due to the superior feature representation learning ability, deep learning methods have revolutionized the way that image data are processed.

Although deep learning methods have achieved a leading position in the field of image processing, it still remains challenging to train well-generalized models for HSI. The application of deep learning to HSI classification faces three difficulties: (1) Compared with RGB images with only three channels, hyperspectral images have hundreds of channels. (2) Compared with visible image data sets containing tens of thousands of images, a hyperspectral data set usually contains only one hyperspectral image with a very large spatial size. (3) The ground object information contained in different hyperspectral images varies greatly, and the model learned from one dataset cannot be directly applied to other datasets.

To cope with the huge sizes of HSI and the relatively few training datasets, most deep learning models for HSI will pre-process the original HSI dataset through mainly two schemes:

(1) Pixel-wise scheme (as shown in Fig. 1(a)): imitating acoustic signal datasets, the spectral curve of a single pixel is regarded as a one-dimensional (1-D) signal, and a single hyperspectral image is regarded as a collection of 1D signals. For example, Mou et al. [16], Wu et al. [17], and Hao et al. [18] all treated each pixel of an HSI as a spectral sequence and used a recurrent neural network (RNN) or convolutional recurrent neural network (CRNN) to determine the pixel labels. However, the pixel-wise scheme gives no consideration to the spatial information.

Fig. 1. Different data pre-processing schemes of deep learning models for HSI classification: (a) Pixel-wise scheme. (b) Patch-wise scheme. (c) Image-wise scheme.

Download Full Size | PDF

(2) Patch-wise scheme (as shown in Fig. 1 (b)): imitating RGB image datasets, the original HSI is divided into many small image patches with the same spatial size. At the same time, PCA is used to reduce the spectral dimension, and the first three principal components are used to form three-channel data. For example, Zhong et al. [1] addressed the HSI classification task with a generative adversarial network and conditional random field (GAN-CRF)-based framework, in which HSI patches with a $9 \times 9$ spatial size were taken as the input of the GAN and the three most prominent PCA channels were used to facilitate the mean-field approximation of the dense conditional random field (CRF). Xue et al. [19] proposed a novel attention-based second-order pooling network (A-SPN) and discussed the influence of different patch sizes on the network performance. Although the patch-wise scheme makes use of spectral and spatial information, it also has several drawbacks: (a) Adjacent patches have overlapping pixels, leading to redundant calculations, training–test information leakage, and overly optimistic experimental results [20,21]. (b) The size of the patches limits the spatial information involved in the processing, thereby limiting the classification accuracy. c The commonly used PCA dimension-reduction method damages some of the spectral dimension information.

1.1 Motivation

We seek to design a network that can fully and completely utilize the information of the entire HSI without the need for data preprocessing steps. We call such a deep learning network for HSI classification an image-wise scheme (as shown in Fig. 1(c)).

First, we need to determine how to design the network structure. HSI classification is the process of assigning labels to all the pixels within the HSI, which can be treated as a semantic segmentation task. Fully convolutional networks (FCN) are powerful models for image semantic segmentation because they can be trained end-to-end and pixel-to-pixel [22]. In addition, they can take an input of arbitrary size and produce a correspondingly sized output with efficient inference and learning. Kim et al. [23] investigated the usage of an 2D-FCN for single image segmentation. By switching between different loss functions, the network could adapt to unsupervised and semi-supervised tasks. However, when we applied the network proposed in Ref. [23] to HSI, we found that it did not achieve very good results. Thus, in this paper, we will focus on how to design effective FCN structures for HSI classification without any pre-training and data preprocessing. It should be clarified that we only discuss supervised and semi-supervised cases in this paper.

Second, one of the major limitations to training an excellent fully-supervised deep neural network is the labeling process, which usually requires the efforts of experts and is often difficult, expensive, or time consuming [24]. There are no huge datasets of HSI available, such as those of natural images. Since the training samples of HSI are usually a few regions of all the pixels, we need to discuss how to make full use of the data, including labeled and unlabeled samples. It is generally believed that using unlabeled data can improve the classification performance, but how to use these unlabeled data remains to be further studied.

1.2 Contributions

The main contributions in this paper are three fold.

(1) We propose minimalistic fully convolution networks (MFCN) structure, which can achieve good results for HSI classification. The MFCN has a simple structure and consists of convolution or transposed convolution layers. Through theoretical analysis and an ablation study, we explain why we selected these two types of convolution layers instead of other types.
(2) We propose a kind of semi-supervised loss function to train the MFCN. We introduce a spatial loss, which can constrain the differences between adjacent pixels, to cooperate with the conventional supervised cross-entropy loss. The semi-supervised loss is proven to be superior to the supervised loss and other types of pseudo-labeling-based semi-supervised losses [23,25].
(3) Based on the affine transformation properties of neural networks, we propose a series of concepts from the perspective of the feature space transformation to explain the different loss functions. This helps us to understand the training process of the network and the determination of the loss function.

The rest of this article is organized as follows. In Section 2, we review some previous work regarding FCNs and semi-supervised learning related to this paper. In Section 3, we describe our proposed MFCN and semi-supervised loss function. In Section 4, we present the comparison experiments and ablation study. In Section 5, we discuss how does our MFCN works. Finally, we conclude our work in Section 6.

2. Related work

2.1 Fully convolutional network

FCNs can efficiently learn to make predictions for per-pixel tasks, which avoids the limitations of patch-wise feature learning in recent deep-learning-based methods [22,26].

FCNs have been widely adopted in HSI classification methods. For example, Jiao et al. [26] used FCN-8s to excavate spatial structural information from the whole HSI and avoided the shortcomings of the patch-wise scheme. However, the classification method used had several inadequacies: (1) the FCN-8s needed to be pre-trained on natural image datasets, which was time-consuming and not practical; (2) since the FCN-8s was pre-trained on RGB images, only the first three principal components from PCA were used as the input when FCN-8s was applied to HSI; (3) the stride of the convolutional and deconvolutional layers in FCN-8s were all larger than $1$, which meant that the convolution kernel overstepped the correlation of adjacent pixels when extracting spatial information, yet the closest relationship between pixels usually occurs between adjacent pixels in HSI due to the very low spatial resolution.

Furthermore, Li et al. [27] developed a multilayer 2D FCN to extract pixel-level deep features from HSI and then used an extreme learning machine (ELM) to classify the extracted features. However, to reduce the data dimension and improve the calculation efficiency, only the first principal component from PCA was used, which resulted in a significant waste of spectral dimension information.

2.2 Semi-supervised classification

Semi-supervised classification aims to train a classifier on both labeled and unlabeled data to achieve better results than a supervised classifier trained only on labeled data. Deep semi-supervised learning is a fast-growing field with a range of practical applications. According to model designs and loss functions, deep semi-supervised methods can be categorized as generative methods, consistency regularization methods, graph-based methods, pseudo-labeling methods, and hybrid methods [24]. We refer interested readers to Ref. [24], which provides a comprehensive overview of deep semi-supervised learning methods. In this paper, we will focus on pseudo-labeling methods.

Lee [25] proposed a simple and efficient way to train neural networks in a semi-supervised fashion, in which the labeled and unlabeled data are trained simultaneously in the usual supervised manner with a cross-entropy loss. For unlabeled data, pseudo-labels, for which only the channel number that has the maximum network output for every weight update is selected, are used as if they were true labels. Based on the pseudo-label idea, unlabeled data can also be exploited in other ways. For example, Kim et al. [23] used the pseudo-labels of unlabeled data to control the spatial-smoothness constraint [28] in unsupervised image segmentation.

Based on a pseudo-labeling method, Wu et al. [17] proposed a semi-supervised framework for HSI classification. The model was pre-trained on pseudo-labels (cluster labels) of unlabeled data and fine-tuned based on the limited labeled data. However, Wu et al. [17] used deep convolutional recurrent neural networks by treating each pixel as a spectral sequence, which abandoned the spatial information of the HSI. Beyond this, compared with other semi-supervised methods, such as generative adversarial networks [1,29–31], the application of pseudo-label methods has not been thoroughly studied for HSI, although it deserves more research and development.

3. Methodology

Figure 2 depicts the flowchart of our classification framework. The original hyperspectral image (HSI) is fed into the MFCN, and different bands are regarded as different input channels. For example, when the MFCN uses the WHI-Hi-LongKou dataset with 270 bands as input, the input layer has 270 input channels. The output of MFCN is a new three-dimensional (3D) data cube, based on which the label of each pixel is predicted. Then, the spatial-smoothness loss is calculated based on the output, and the usual supervised loss function is calculated by combining the output with the training set. Through the backpropagation of the overall loss function, the network is optimized. In Section 4.3, we will show the effectiveness and superiority of our network structure and loss function through the ablation study.

Fig. 2. Flowchart of our method. The original hyperspectral image (HSI) is fed into the MFCN, and different bands are regarded as different input channels. The output of MFCN is a new three-dimensional (3D) data cube, based on which the label of each pixel is predicted. Then, the spatial-smoothness loss is calculated on the basis of the output, and the usual supervised loss function value is calculated by combining the output with the training set. Through the backpropagation of the overall loss function, the network is optimized.

Download Full Size | PDF

3.1 Network structure

Our proposed MFCN requires the input and output to have the same spatial size. Figure 3(a)–(d) show four typical examples of our proposed MFCN, which contain four convolution layers or transposed convolution layers and the corresponding normalization layers and activation layers. In the later ablation study (Section 4.3.1), we will prove that the network structures in Fig. 3(a)–(d) can achieve almost equivalent performance. Figure 3(e) shows the 2D-FCN of Ref. [23], which will also be compared with our MFCN in the later ablation study (Section 4.3.1).

Fig. 3. Four typical examples of our proposed MFCN: (a) four same-padding convolution layers, (b) two combinations involving convolution–transposed convolution layers, (c) two same-padding convolution layers followed by one combination involving convolution–transposed convolution layers, and (d) one combination involving convolution–transposed convolution layers followed by two same-padding convolution layers. (e) 2D-FCN of Ref. [23] consisting of three same-padding convolution layers and one no-padding convolution layer with a unit kernel. In addition, it should be noted that the normalization and activation layer order of the two networks are different.

Download Full Size | PDF

A convolution layer’s output shape is affected by the shape of its input as well as the choice of the kernel shape, padding, and strides [32]. In addition to these parameters, we will also introduce the network depth, width, normalization layers, and activation function. In Section 4.3.2, we investigate the effect of the kernel size in our MFCN on different datasets.

3.1.1 Stride

As mentioned in Section 2.1, if the stride of a convolutional layer is larger than $1$, the convolution kernel cannot extract spatial information from the adjacent pixels. Since HSI have low spatial resolution, the closest relationship between pixels usually occurs between adjacent pixels. Thus, we restrict the stride of our used convolutional layers to $1$.

3.1.2 Padding

The convolution with a kernel of size $k\ (k>1)$ will decrease the output size with respect to the input size. For a convolution layer, one simple way to keep the output and input the same shape is using the half (same) padding [32], i.e., using a padding of size $\lfloor k/2 \rfloor$ to make the input and output of the layer the same size. To avoid introducing noise, we used zero as the padding value.

If the direct convolution has no-padding, then another way to recover the shape of the initial input is using a no-padding transposed convolution layer after the no-padding convolution layer. In fact, with the same stride, padding, and kernel parameters, the convolution module composed of direct convolution and transposed convolution can ensure that the input and output have the same shape [32]. In this paper, we only consider the no-padding case when using transposed convolution layer.

3.1.3 Network depth and width

Due to the low spatial resolution and high spectral resolution of HSI, spectral information usually contributes more to category separability than spatial information. Since we use the channel axis to represent the spectral dimension, we tend to make the network structure as wide as possible, that is, with as many channels as possible. However, the network depth can be reduced correspondingly. When the network width is large enough, not too many layers are needed to ensure the efficiency and accuracy of the network [33,34]. In this paper, we set the network width as the number of bands of the input HSI and set the network depth as four.

3.1.4 Normalization layers and activation function

The essential operations of neural networks are affine transformations [32,35]. Through convolution layers, a vector space is transformed into another vector space by a linear transformation and translation. The nonlinear transformation is realized by the nonlinear activation function. In this paper, we choose rectified linear units (ReLU) as the activation function. The ReLU function can effectively prevent gradient explosion and return all negative values to zero. In addition, we follow each convolution layer with a normalization layer.

3.2 Loss function

For ease of description, we let one 3D matrix $X \in \mathbb {R}^{W \times H \times L}$ denote the raw data cube of the HSI, which means that the HSI contains $W \times H$ pixels and $L$ spectral bands. We select $X$ as our network input, and then we denote the network operation as $f$ and the network output as $F \in \mathbb {R}^{W \times H \times L}$, which means that the number of output channels is also $L$. For each pixel, the channel number with the maximum network output is the class prediction: $p^{(i,j)}\!=\underset {k}{\arg \max } \, F^{(i,j)}_{k}$.

3.2.1 Fully supervised loss

Supposing there are $B$ classes in the original HSI, the true label of the $i$th labeled pixel is denoted by $t_{i} \in \{1,\ldots,B\}$. Then the one-hot code of the true label $t_{i}$ is denoted by $\boldsymbol {y}_{i} \in \mathbb {R}^{1 \times B}$, which is defined as:

(1)$$\boldsymbol{y}_{i}=\left({y}_{i}^{(1)}, {y}_{i}^{(2)},\ldots, {y}_{i}^{(B)}\right)$$

in which

y_{i}^{b}=\left\{\begin{array}{c} 1\qquad \text{if } b = t^{(i)} \\ 0\qquad \text{if } b \neq t^{(i)} \end{array}\right.

If the network is trained in a fully supervised manner, the cross-entropy loss function is as follows:

(2)$$\mathcal{L}_{\text{su}}=\frac{1}{N} \sum_{i=1}^{N} \ell_{\text{cross}} \left(\boldsymbol{y}_{i}, \boldsymbol{f}_{i}\right),$$

where $N$ is the number of labeled pixels in the network input $X$, and $\boldsymbol {f}_{i}$ is the network output.

3.2.2 Unsupervised loss

In Ref. [25], a kind of unsupervised loss based on pseudo-labels was proposed:

(3)$$\mathcal{L}_{\text{un}}=\frac{1}{M} \sum_{j=1}^{M} \ell_{\text{cross}} \left(\boldsymbol{y}^{'}_{j}, \boldsymbol{f}_{j}\right),$$

where $M$ is the number of unlabeled pixels of the input $X$, $\boldsymbol {y}^{'}_{j}$ is the one-hot code of the pseudo-label for the $j$th unlabeled pixel, and $\boldsymbol {f}_{j}$ is the network output. For the unlabeled sample, the class predictions, also called pseudo-labels, are used as if they were true labels.

3.2.3 Spatial-smoothness loss

In Ref. [23], a spatial-smoothness loss function [28] was introduced for unsupervised image segmentation. It was defined to minimize the neighborhood differences of the network output, which can be obtained by

(4)$$\mathcal{L}_{\text{spa}}= \mathcal{S}_{W}+\mathcal{S}_{H}.$$

$\mathcal {S}_{W}$ represents the horizontal neighborhood differences:

(5)$$\mathcal{S}_{W}= \frac{1}{(W-1) \! \times \! H} \sum_{i=1}^{W-1} \sum_{j=1}^{H} \Vert \boldsymbol{f}_{(i+1, j)},\boldsymbol{f}_{(i, j )}\Vert _{1},$$

and $\mathcal {S}_{H}$ represents the vertical neighborhood differences:

(6)$$\mathcal{S}_{H}=\frac{1}{W\times(H-1)} \sum_{i=1}^{W} \sum_{j=1}^{H-1} \Vert \boldsymbol{f}_{(i , j+1)},\boldsymbol{f}_{(i, j)}\Vert _{1}$$

where the $\Vert \cdot \Vert _{1}$ is the L1-norm operator, and $\boldsymbol {f}_{(i , j)}$ is the network output for a pixel with location $(i , j)$.

Combining the unsupervised loss and spatial-smoothness loss, Kim et al. [23] proposed an improved loss function to train an FCN for unsupervised image segmentation. It is defined as as follows:

(7)$$\mathcal{L}_{\text{un-improved}}=\mathcal{L}_{\text{un}}+\mu \mathcal{L}_{\text{spa}} \;\; \text{(in Ref. [23])},$$

where $\mu$ represents the weight for balancing the two constraints. It should be noted that $\mu$ is a constant weight.

3.2.4 Semi-supervised loss

If the network is trained in a semi-supervised manner, as in Ref. [25], the labeled and unlabeled data are both trained with the cross-entropy loss function. The overall semi-supervised loss function is defined as follows:

(8)$$\mathcal{L}_{\text{semi}} = \mathcal{L}_{\text{su}} + \alpha (t) \mathcal{L}_{\text{un}} \;\; \text{(in Ref. [25])},$$

where $t$ is the epoch of the training process, and $\alpha (t)$ is a coefficient for balancing the supervised and unsupervised loss terms. According to Ref. [25], using unlabeled data in a supervised learning manner ($\mathcal {L}_{un}$) can regularize the network in such a way that the activations enter a saturation region. This can encourage an invariance or robustness of the representation for small variations of the input. However, as for the training process of our MFCN, the network input in every epoch is always the original HSI. Therefore, there is no need to add regularization constraints that constrain the network input variables. In other words, the unsupervised loss $\mathcal {L}_{\text {un}}$ is not suitable for our network.

Semi-supervised algorithms can be readily obtained by extending supervised or unsupervised learning algorithms [24]. Thus, Kim et al. [23] also proposed a semi-supervised loss for image segmentation with scribbles as the user input:

(9)$$\mathcal{L}_{\text{semi}} = \mathcal{L}_{\text{su}} + \mu \mathcal{L}_{\text{un}} + \nu\mathcal{L}_{\text{spa}} \;\; \text{(in Ref. [23])}$$

where $\mu$ and $\nu$ are constant weights for balancing different constraints. The "scribbles" are simple lines inputted by the user to specify the boundary regions as well as the foreground/background.

Adjacent pixels in an image tend to be in the same category [23,36], and this is no exception for HSI. Thus, adding such a spatial continuity constraint will improve the network performance for pixel-level HSI classification. However, the treatment of the balancing coefficients in Ref. [23] is inappropriate. If the balancing coefficients ($\mu$ and $\nu$) are too high, it is difficult to predict the label, even for labeled data. If the balancing coefficients are too small, the information of the unlabeled data cannot be fully utilized. Thus, similar to the approach in Ref. [25], we recommend using a function that varies with the epoch, rather than a constant, as the balancing coefficient.

Based on the above analysis, we use the combination of $\mathcal {L}_{\text {su}}$ and $\mathcal {L}_{\text {spa}}$ to build our semi-supervised loss function. It is defined as follows:

(10)$$\mathcal{L}_{\text{semi}} = \mathcal{L}_{\text{su}} + \alpha(t) \mathcal{L}_{\text{spa}},$$

where the supervised loss $\mathcal {L}_{\text {su}}$ can be used to guarantee the correct category prediction based on a limited number of labeled pixels. In addition, since the labeled pixels are scattered in different regions, the spatial-smoothness loss $\mathcal {L}_{\text {spa}}$ can be used to spread and adjust the correct category prediction from labeled pixels to their adjacent pixels. In addition, we use the following step function starting from zero to determine the value of $\alpha (t)$:

(11)$$\alpha(t)= \begin{cases}0 & t<T_{1} \\ \frac{t-T_{1}}{T_{2}-T_{1}} \alpha_{f} & T_{1} \leq t<T_{2} \\ \alpha_{f} & T_{2} \leq t\end{cases},$$

where $t$ is the epoch of the training process, $T_{1}$ and $T_{2}$ are the turning-point epochs, and $\alpha _{f}$ is the final value of the balancing coefficient function.

In Sections 4.3.3 and 4.3.4, we will show the superiority of our loss function and discuss the parameters ($T_{1}$, $T_{2}$, and $\alpha _{f}$) in the balancing coefficient function.

4. Experiments

Our experiments were divided into two parts. First, we conducted comparison experiments with some state-of-the-art methods, and then we conducted ablation studies on the network structure, loss function, and kernel size.

4.1 Datasets

In this paper, three new benchmark datasets, including WHU-Hi-LongKou, WHU-Hi-HanChuan, and WHU-Hi-HongHu, were used to prove the advantages of our method for HSI classification. The three datasets were all acquired in farming areas with various crop types in Hubei province, China, via a Headwall Nano-Hyperspec sensor mounted on a unmanned aerial vehicle (UAV) platform [37]. These datasets have both high spectral resolution (nm level) and high spatial resolution (cm level). Since the very high spatial resolution causes severe spectral variability and spatial heterogeneity, it is challenging to apply the usual HSI classification methods to UAV-borne HSI. The data source address is found in Ref [42].

The characteristics of these datasets are listed in Table 1. The False-color composite images and the groundtruth maps are shown in Figs. 4, 5, and 6. More detailed description about the datasets can be found in Ref. [37].

Fig. 4. WHU-Hi-LongKou dataset. (a) False-color composite image. The RGB channels correspond to bands 120, 60, and 20, respectively. (b) Groundtruth map.

Download Full Size | PDF

Fig. 5. WHU-Hi-HanChuan dataset. (a) False-color composite image. The RGB channels correspond to bands 120, 60, and 20, respectively. (b) Groundtruth map.

Download Full Size | PDF

Fig. 6. WHU-Hi-HongHu dataset. (a) False-color composite image. The RGB channels correspond to bands 120, 60, and 20, respectively. (b) Groundtruth map.

Download Full Size | PDF

Table 1. Hyperspectral information of the datasets.

View Table | View all tables in this article

4.2 Comparison experiments

In the comparison experiments, we compare our method with six state-of-the-art methods, including SVM [12], 3D-CNN [38], 1D-RNN [16], 3D-FCN [39], 2D-FCN [23] and SSFCN-CRF [40].

4.2.1 Experiment configuration

As for the network structure of our MFCN, we used the MFCN in Fig. 3(a), consisting of four same-padding convolution layers. See Table 2 for detailed network configuration on each dataset.

Table 2. Network configuration of our MFCN in the comparison experiments.

View Table | View all tables in this article

As for the semi-supervised loss function (Eq. (10)), we set $T_{1}=10$, $T_{2}=60$, and $\alpha _{f}=1$.

We randomly selected 25 labeled samples of each class to form the training set, and the rest of the samples were used as test set. For the SVM [12], 3D-CNN [38], 1D-RNN [16] and 3D-FCN [39] methods, 5% of the training set was used as an independent validation set for these competing approaches. We ran 100 epochs for each network and the experiments of all methods were executed for 3 runs with each dataset to obtain average results. However, our MFCN and the 2D-FCN [23] performed training and inferencing (that is, testing) jobs at the same time, without the need for a validation set. Thus, for two 2D FCNs, the output of the epoch with the smallest loss function was the prediction result. It should be noted that the overall accuracy (OA) result of SSFCN-CRF [40] method is extracted from the website of data source address (http://rsidea.whu.edu.cn/resource_WHUHi_sharing.htm). We did not reproduce SSFCN-CRF [40] method, so only the OA results of this method is presented.

4.2.2 Experiment results

In the comparison experiments, we used four measures to analyze the classification results of the different methods: the class-specific precision, average accuracy (AA), overall accuracy (OA), and kappa coefficient ($\kappa$). Their values were obtained using the confusion matrix:

\left(\begin{array}{ccccc} f_{11} & f_{12} & \cdots & f_{1 g} \\ f_{21} & f_{22} & \cdots & f_{2 g} \\ \vdots & \vdots & \ddots & \vdots \\ f_{g 1} & f_{g 2} & \cdots & f_{g g} \end{array}\right).

where $f_{ij}$ refers to the number of samples that were actually class $i$ but were predicted to be class $j$ in the classification testing procedure. Thus, $N=\sum _{i=1}^{g} \sum _{j=1}^{g} f_{ij}$ is the number of test samples.

The class-specific precision of class $j$ is defined as

(12)$$P_{j} = \frac{f_{j j}}{\sum_{i=1}^{g} f_{i,j}}.$$

The AA is the average recall of all the classes. It is defined as

(13)$$\text{AA} = \frac{\sum_{i=1}^{g} (f_{i i}/\sum_{j=1}^{g} f_{i,j})}{g}.$$

The OA value reflects the observed probability of success in the classification. It is defined as

(14)$$\text{OA} = \frac{\sum_{i=1}^{g} f_{i i}}{N}.$$

The number of samples in each category is often unbalanced, and the classifier model may be biased toward large categories instead of small categories. As a result, the OA value would not be very low, and the classification results would be poor for small categories.

Therefore, the kappa coefficient ($\kappa$) was employed. It is defined as

(15)$$\kappa=\frac{OA-p_{e}}{1-p_{e}},$$

in which $p_{e}$ is the probability of success under the assumption of an extremely bad case: $p_{e}=$ $\frac {1}{N^{2}} \sum _{i=1}^{g} f_{i+} f_{+i}$, where $f_{i+}$ is the total for the $i_{t h}$ row and $f_{+i}$ is the total for the $i_{t h}$ column in the confusion matrix. The lower the $\kappa$ value is, the more biased the model is; for a high value of $\kappa$, the model is more consistent between different categories.

Tables 3, 4, and 5 present the values of the class-specific precision, AA, OA, and kappa coefficient for the WHI-Hi-LongKou, WHI-Hi-HanChuan, and WHI-Hi-HongHu datasets, respectively. In terms of single-class accuracy, our MFCN achieved the best performance for the most classes in all three datasets. In terms of AA, OA and kappa coefficient ($\kappa$), our MFCN produced the best classification results. The classification results fully demonstrated the superiority of our NCFN.

Table 3. Values of the class-specific precision, AA, OA, and kappa coefficient obtained by different methods for the WHU-Hi-LongKou dataset (25 labeled samples per class used for training).

View Table | View all tables in this article

Table 4. Values of the class-specific precision, AA, OA, and kappa coefficient obtained by different methods for the WHU-Hi-HanChuan dataset (25 labeled samples per class used for training).

View Table | View all tables in this article

Table 5. Values of the class-specific precision, AA, OA, and kappa coefficient obtained by different methods for the WHU-Hi-HongHu dataset (25 labeled samples per class used for training).

View Table | View all tables in this article

Figures 7, 8, and 9 show the visualization of classification maps obtained via various methods except for SSFCN-CRF [40] in our comparison experiments. It can be clearly seen from the figure that the classification performance of our MFCN have obvious advantages.

Fig. 7. The visualization of classification maps obtained via various methods for the WHU-Hi-LongKou dataset. (a) Groundtruth. (b) SVM [12]. (c) 3D-CNN [38]. (d) 1D-RNN [16]. (e) 3D-FCN [39]. (f) 2D-FCN [23]. (g) our MFCN.

Download Full Size | PDF

Fig. 8. The visualization of classification maps obtained via various methods for the WHU-Hi-HanChuan dataset. (a) Groundtruth. (b) SVM [12]. (c) 3D-CNN [38]. (d) 1D-RNN [16]. (e) 3D-FCN [39]. (f) 2D-FCN [23]. (g) our MFCN.

Download Full Size | PDF

Fig. 9. The visualization of classification maps obtained via various methods for the WHU-Hi-HongHu dataset. (a) Groundtruth. (b) SVM [12]. (c) 3D-CNN [38]. (d) 1D-RNN [16]. (e) 3D-FCN [39]. (f) 2D-FCN [23]. (g) our MFCN.

Download Full Size | PDF

4.3 Ablation study

To thoroughly evaluate the effectiveness of our design choices, we performed extensive ablation experiments. The core of ablation studies is the control variable method. An ablation study typically refers to removing or changing some part of the model or algorithm, and seeing how that affects the final performance. We first studied different network structures, then the loss function. Finally, we investigated how the kernel size affected the performance of our MFCN on each dataset.

For all the ablation experiments, different numbers (25, 50, 100, 150, 200, 250, 300) of labeled pixels from an image were randomly selected as training samples, and the rest were used as test samples. We explain the details of ablation experiments below.

4.3.1 Network structure

In this subsection, we experimentally verified that using only convolution or transposed convolution layer could achieve good results. We compared four typical examples of our proposed MFCN (Fig. 3(a)–(d)) with the 2D-FCN [23] (Fig. 3(e)).

For each convolution layer, the stride was $1$ and the kernel was $3$. The number of channels is the same as that in the comparison experiments, as shown in Table 2. We selected our semi-supervised loss function (Eq. (10)) as the benchmark loss function. For the balancing coefficient, we set $T_{1}=10$, $T_{2}=60$, and $\alpha _{f}=1$. We ran 100 epochs for each network and selected the output of the epoch with the smallest loss function as the prediction result. The experiments were executed for 3 runs with each dataset to obtain the average results.

Figure 10 shows the OA curves of different networks as a function of the percentage of the training set. We observed that the four MFCNs proposed in this work were equally effective, and all of them outperformed the 2D-FCN [23]. In the following experiments, we selected the MFCN shown in Fig. 3(a) as our benchmark network.

Fig. 10. Comparison of different network structures. The detailed structures of our proposed MFCNs are shown in Fig. 3. The overall accuracies (OAs) are plotted with different percentages of the training set for the (a) WHI-Hi-LongKou, (b) WHI-Hi-HanChuan, and (c) WHI-Hi-HongHu datasets. The four MFCNs proposed in this work were equally effective, and all of them outperformed the FCN proposed by [23].

Download Full Size | PDF

From the perspective of the convolution layer structure, our MFCN only contained the same-padding convolution layers, while the 2D-FCN [23] used not only the same-padding convolution layers but also a no-padding convolution layer with a unit kernel. However, the effect of a no-padding convolution layer with a unit kernel is equivalent to a full connection layer. It outputs linear weighted combinations between channels, which damages the nonlinear characteristics of the HSI in the channel dimension. Therefore, the use of the no-padding convolution layer rather reduces the performance of the network.

In addition, it should be noted that the order of the normalization layer and the activation layer of our networks was different from that of the 2D-FCN [23]. Generally, the batch normalization layer is recommended to be placed in front of the activation function (ReLU), so that the images input to the ReLU approach a normal distribution. If the normalized tensor is input to an activation function whose gradient change point is 0 (such as the ReLU used here), the nonlinear characteristics of the activation function will be well developed, and the constructed loss function will become smoother [41].

4.3.2 Kernel size

In this experiment, we investigated the effect of the kernel size in our convolution layers using kernel sizes of $3$, $5$, and $7$. The number of channels is the same as that in the comparison experiments, as shown in Table 2.

We selected the MFCN in Fig. 3(a) as the benchmark network. For each convolution layer, the stride was $1$. For the balancing coefficient in our semi-supervised loss function (Eq. (10)), we set $T_{1}=10$, $T_{2}=60$, and $\alpha _{f}=1$. We ran 100 epochs for each network and selected the output of the epoch with the smallest loss function as the prediction result. The experiments were executed for 3 runs with each dataset to obtain the average results.

The experimental results are presented in Fig. 11. It can be seen that with different numbers of training samples, the kernel size will affect the performance of the network. In addition, the curves for the different kernels present strong variations. This can be explained by the fact that the spatial resolution of the datasets we used was very high, including 0.463m/pixel for WHU-Hi-LongKou, 0.109m/pixel for WHU-Hi-HanChuan, and 0.043 m/pixel for WHU-Hi-HongHu. Therefore, the classification on these datasets is more difficult and the classification results are more prone to instability.

Fig. 11. Comparison of different convolution kernel sizes. The OAs are plotted with different percentages of the training set for the (a) WHI-Hi-LongKou, (b) WHI-Hi-HanChuan, and (c) WHI-Hi-HongHu. We observed that using the kernel size of $5$ for the WHI-Hi-LongKou dataset, $3$ for the WHI-Hi-HanChuan dataset, and $7$ for the WHI-Hi-HongHu dataset could achieve relatively better results.

Download Full Size | PDF

4.3.3 Form of loss function

In this experiment, we compared the different forms of the loss functions developed from the usual supervised loss function (Eq. (2)) and three semi-supervised loss functions (Eqs. (8), (9), and (10)). For the three semi-supervised loss functions, we also studied the influence of constant and functional coefficients.

We selected the MFCN in Fig. 3(a) as the benchmark network. For each convolution layer, the stride was $1$ and the kernel was $3$. The number of channels is the same as that in the comparison experiments, as shown in Table 2. For the balancing coefficient in the functional form (Eq. (11)), we set $T_{1}=10$, $T_{2}=60$, and $\alpha _{f}=1$. We ran 100 epochs for each network and selected the output of the epoch with the smallest loss function as the prediction result. The experiments were executed for 3 runs with each dataset to obtain the average results.

Figure 12 shows the experimental results. The results were as follows: (1) When the loss function was composed of the same components, the balancing coefficient in functional form was better than that in constant form. (2) Among the three semi-supervised loss functions that all used functional coefficients, our loss function achieved the best results. Thus, the superiority of our proposed semi-supervised loss function was demonstrated and compared to the usual fully supervised function and the semi-supervised functions in Ref. [25] and Ref. [23]. (3) The strong variations occurred only when the $\mathcal {L}_{\text {un}}$ (defined in Eq. 3) in involved in the loss functions. This illustrates two points. First, we must be very careful when using both labeled samples and unlabeled samples. If unlabeled samples are used improperly, the classification accuracy will be reduced. Second, it proves that our analysis and viewpoint of the $\mathcal {L}_{\text {un}}$ in Section 3.2.4 is reasonable, that is, the unsupervised loss $\mathcal {L}_{\text {un}}$ is not suitable for our MFCN. (4) When using the $\mathcal {L}_{\text {un}}$, the use of 300 training samples sometimes reduces the classification accuracy, which further illustrates that a reasonably designed loss function is very important for stable and excellent network performance.

Fig. 12. Comparison of different loss functional forms. The OAs are plotted with different percentages of the training sets for the (a) WHI-Hi-LongKou, (b) WHI-Hi-HanChuan, and (c) WHI-Hi-HongHu datasets. Our proposed semi-supervised loss ($\mathcal {L}_{\text {su}} + \alpha (t) \mathcal {L}_{\text {spa}}$) outperformed all the other loss functions.

Download Full Size | PDF

4.3.4 Parameters in balancing coefficient function

In this experiment, we investigated the effects of the parameters in balancing the coefficient function for our semi-supervised loss function (Eq. (10): $\mathcal {L}_{\text {semi}} = \mathcal {L}_{\text {su}} + \alpha (t) \mathcal {L}_{\text {spa}}$). We changed the values of the parameters $T_{1}$, $T_{2}$, and $\alpha _{f}$ to investigate their influences on the classification performance.

We selected the MFCN in Fig. 3(a) as the benchmark network. For each convolution layer, the stride was $1$ and the kernel was $3$. The number of channels is the same as that in the comparison experiments, as shown in Table 2. We ran 100 epochs for each network and selected the output of the epoch with the smallest loss function as the prediction result. The experiments were executed for 3 runs with each dataset to obtain the average results.

The experimental results are presented in Fig. 13. The first set of parameters ($T_{1}=30$, $T_{2}=70$, and $\alpha _{f}=1$) had the worst robustness, followed by the second set ($T_{1}=20$, $T_{2}=80$, and $\alpha _{f}=1$). This indicated that the balancing coefficient should not keep zero for too many epochs during the network training process. The fourth set of parameters ($T_{1}=10$, $T_{2}=60$, and $\alpha _{f}=1$) performed slightly better than the third set ($T_{1}=10$, $T_{2}=60$, and $\alpha _{f}=0.4$), indicating that appropriately increasing the final value of the balancing coefficient could improve the classification accuracy.

Fig. 13. Comparison of different parameters in balancing coefficient function. The OAs are plotted for different percentages of the training sets for the (a) WHI-Hi-LongKou, (b) WHI-Hi-HanChuan, and (c) WHI-Hi-HongHu datasets.

Download Full Size | PDF

5. Discussion

Most deep learning classification methods that use the cross-entropy as the loss function regard the channel sequence number with the maximum value as the predicted label. This can be interpreted as follows: by treating the output of the neural network as another transformed feature space, the value of each channel can be viewed as the projection on each dimension, and the maximum projection dimension indicates the label. From the perspective of the feature space transformation, the training process of MFCN can be regarded as an optimization process of the network output feature space.

Typical fully supervised methods usually calculate the cross-entropy loss of labeled data outputs and optimize the affine transformation function (that is, the parameters of the neural network) through backpropagation. Every time the weights are updated, the transformed feature space is adjusted to make the maximum projection dimension reflect the label of each training sample as accurately as possible.

Adjacent pixels in an image tend to be in the same category [23,36], and this is no exception for HSI. Based on the usual supervised loss, the spatial-smoothness loss is additionally used to constrain the differences between adjacent pixels. Every time the weights are updated, the correct category predictions are spread and adjusted from labeled pixels to their unlabeled adjacent pixels. Thus, our MFCN combines training and inferencing processes together, thus simultaneously maintaining excellent results and high execution speeds.

6. Conclusion

In this paper, we propose minimalistic fully convolution networks (MFCN) and a semi-supervised loss function for HSI classification. Our MFCN is composed of convolution or transposed convolution layers. Our semi-supervised loss function is composed of the usual supervised loss and a spatial-smoothness loss. The excellent performance was proven by the comparison experiments in this paper. In addition, we examined the effects of various elements of our method through ablation studies.

Funding

Young Talent Support Program of Shaanxi Province University (20200704); National Natural Science Foundation of China (6190528).

Acknowledgments

The authors would like to appreciate the editors and the anonymous reviewers for their helpful and valuable suggestions that greatly improved the quality of this work. And they thank LetPub (www.letpub.com) for linguistic assistance and pre-submission expert review.

Disclosures

The authors declare no conflicts of interest.

Data Availability

Data underlying the results presented in this paper are available in Ref. [42].

References

1. Z. Zhong, J. Li, D. A. Clausi, and A. Wong, “Generative adversarial networks and conditional random fields for hyperspectral image classification,” IEEE Trans. Cybern. 50(7), 3318–3329 (2020). [CrossRef]

2. P. Launeau, Z. Kassouk, F. Debaine, R. Roy, P. G. Mestayer, C. Boulet, J.-M. Rouaud, and M. Giraud, “Airborne hyperspectral mapping of trees in an urban area,” Int. J. Remote. Sens. 38(5), 1277–1311 (2017). [CrossRef]

3. H. Akbari, Y. Kosugi, K. Kojima, and N. Tanaka, “Detection and analysis of the intestinal ischemia using visible and invisible hyperspectral imaging,” IEEE Trans. Biomed. Eng. 57(8), 2011–2017 (2010). [CrossRef]

4. B. Luo, C. Yang, J. Chanussot, and L. Zhang, “Crop yield estimation based on unsupervised linear unmixing of multidate hyperspectral imagery,” IEEE Trans. Geosci. Remote Sensing 51(1), 162–173 (2013). [CrossRef]

5. B. Xu, X. Li, W. Hou, Y. Wang, and Y. Wei, “A similarity-based ranking method for hyperspectral band selection,” IEEE Transactions on Geoscience and Remote Sensing pp. 1–15 (2021).

6. A. J. Brown, T. I. Michaels, S. Byrne, W. Sun, T. N. Titus, A. Colaprete, M. J. Wolff, G. Videen, and C. J. Grund, “The case for a modern multiwavelength, polarization-sensitive lidar in orbit around mars,” J. Quant. Spectrosc. Radiat. Transfer 153, 131–143 (2015). [CrossRef]

7. A. Brown, “Spectral curve fitting for automatic hyperspectral data analysis,” IEEE Trans. Geosci. Remote Sensing 44(6), 1601–1608 (2006). [CrossRef]

8. A. J. Brown, “Equivalence relations and symmetries for laboratory, lidar, and planetary müeller matrix scattering geometries,” J. Opt. Soc. Am. A 31(12), 2789–2794 (2014). [CrossRef]

9. L. He, J. Li, C. Liu, and S. Li, “Recent advances on spectral-spatial hyperspectral image classification: An overview and new guidelines,” IEEE Trans. Geosci. Remote Sensing 56(3), 1579–1597 (2018). [CrossRef]

10. S. T. Li, W. W. Song, L. Y. Fang, Y. S. Chen, P. Ghamisi, and J. A. Benediktsson, “Deep learning for hyperspectral image classification: An overview,” IEEE Trans. Geosci. Remote Sensing 57(9), 6690–6709 (2019). [CrossRef]

11. G. Camps-Valls and L. Bruzzone, “Kernel-based methods for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sensing 43(6), 1351–1362 (2005). [CrossRef]

12. F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sensing images with support vector machines,” IEEE Trans. Geosci. Remote Sensing 42(8), 1778–1790 (2004). [CrossRef]

13. J. Li, J. M. Bioucas-Dias, and A. Plaza, “Semisupervised hyperspectral image segmentation using multinomial logistic regression with active learning,” IEEE Transactions on Geosci. Remote. Sens 48, 4085–4098 (2010). [CrossRef]

14. J. Li, J. M. Bioucas-Dias, and A. Plaza, “Spectral–spatial hyperspectral image segmentation using subspace multinomial logistic regression and Markov random fields,” IEEE Trans. Geosci. Remote Sensing 50(3), 809–823 (2012). [CrossRef]

15. T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Trans. Inf. Theory 13(1), 21–27 (1967). [CrossRef]

16. L. Mou, P. Ghamisi, and X. X. Zhu, “Deep recurrent neural networks for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sensing 55(7), 3639–3655 (2017). [CrossRef]

17. H. Wu and S. Prasad, “Semi-supervised deep learning using pseudo labels for hyperspectral image classification,” IEEE Trans. on Image Process. 27(3), 1259–1270 (2018). [CrossRef]

18. H. Wu and S. Prasad, “Convolutional recurrent neural networks for hyperspectral data classification,” Remote Sens. 9(3), 298 (2017). [CrossRef]

19. Z. Xue, M. Zhang, Y. Liu, and P. Du, “Attention-based second-order pooling network for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, pp. 1–16 (2021).

20. L. Zou, X. Zhu, C. Wu, Y. Liu, and L. Qu, “Spectral–spatial exploration for hyperspectral image classification via the fusion of fully convolutional networks,” IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing 13, 659–674 (2020). [CrossRef]

21. J. Nalepa, M. Myller, and M. Kawulok, “Validating hyperspectral image segmentation,” IEEE Geoscience and Remote Sensing Letters 16(8), 1264–1268 (2019). [CrossRef]

22. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2015), pp. 3431–3440.

23. W. Kim, A. Kanezaki, and M. Tanaka, “Unsupervised learning of image segmentation based on differentiable feature clustering,” IEEE Trans. on Image Process. 29, 8055–8068 (2020). [CrossRef]

24. X. Yang, Z. Song, I. King, and Z. Xu, “A survey on deep semi-supervised learning,” (2021).

25. D.-H. Lee, “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” ICML 2013 Workshop: Challenges in Representation Learning (WREPL) (2013).

26. L. Jiao, M. Liang, H. Chen, S. Yang, H. Liu, and X. Cao, “Deep fully convolutional network-based spatial distribution prediction for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sensing 55(10), 5585–5599 (2017). [CrossRef]

27. J. Li, X. Zhao, Y. Li, Q. Du, B. Xi, and J. Hu, “Classification of hyperspectral imagery using a new fully convolutional neural network,” IEEE Geoscience and Remote Sensing Letters 15(2), 292–296 (2018). [CrossRef]

28. T. Shibata, M. Tanaka, and M. Okutomi, “Misalignment-robust joint filter for cross-modal image pairs,” in 2017 IEEE International Conference on Computer Vision (ICCV), (2017), pp. 3315–3324.

29. L. Zhu, Y. Chen, P. Ghamisi, and J. A. Benediktsson, “Generative adversarial networks for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sensing 56(9), 5046–5063 (2018). [CrossRef]

30. Y. Zhan, D. Hu, Y. Wang, and X. Yu, “Semisupervised hyperspectral image classification based on generative adversarial networks,” IEEE Geoscience and Remote Sensing Letters 15(2), 212–216 (2018). [CrossRef]

31. M. Zhang, M. Gong, Y. Mao, J. Li, and Y. Wu, “Unsupervised feature extraction in hyperspectral images based on wasserstein generative adversarial network,” IEEE Trans. Geosci. Remote Sensing 57(5), 2669–2688 (2019). [CrossRef]

32. V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,” arXiv (2018).

33. Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang, “The expressive power of neural networks: A view from the width,” in Advances in Neural Information Processing Systems, vol. 30I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds. (Curran Associates, Inc., 2017), pp. 6231–6239.

34. R. Eldan and O. Shamir, “The power of depth for feedforward neural networks,” Comput. Sci. (2015).

35. C. C. Stearns and K. Kannappan, “Method for 2-d affine transformation of images,” (1995).

36. A. Kanezaki, “Unsupervised image segmentation by backpropagation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2018), pp. 1543–1547.

37. Y. Zhong, X. Hu, C. Luo, X. Wang, J. Zhao, and L. Zhang, “Whu-hi: Uav-borne hyperspectral with high spatial resolution (h2) benchmark datasets and classifier for precise crop identification based on deep convolutional neural network with crf,” Remote. Sens. Environ. 250, 112012 (2020). [CrossRef]

38. A. B. Hamida, A. Benoit, P. Lambert, and C. B. Amar, “3-d deep learning approach for remote sensing image classification,” IEEE Trans. Geosci. Remote Sensing 56(8), 4420–4434 (2018). [CrossRef]

39. H. Lee and H. Kwon, “Contextual deep cnn based hyperspectral classification,” in 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), (2016), pp. 3322–3325.

40. Y. Xu, B. Du, and L. Zhang, “Beyond the patchwise classification: Spectral-spatial fully convolutional networks for hyperspectral image classification,” IEEE Trans. Big Data 6(3), 492–506 (2020). [CrossRef]

41. A. Galloway, A. Golubeva, T. Tanay, M. Moussa, and G. W. Taylor, “Batch normalization is a cause of adversarial vulnerability,” arXiv (2019).

42. Intelligent Data Extraction, Analysis and Applications of Remote Sensing (RSIDEA) academic research group“WHU-Hi: UAV-borne hyperspectral and high spatial resolution (H²) benchmark datasets for crop precise classification,” Wuhan University, (2010), http://rsidea.whu.edu.cn/resource_WHUHi_sharing.htm.

Name	Bands	Spatial size	Spatial resolution	Classes	Wavelength range
WHU-Hi-LongKou	270	550*400	0.463m	9	400-1000nm
WHU-Hi-HanChuan	274	1217*303	0.109m	16	400-1000nm
WHU-Hi-HongHu	270	940*475	0.043m	22	400-1000nm

Datasets	Bands	Classes	Layers	Kernel	Padding	Stride	Input size	Output size
WHU-Hi-LongKou	270	9	2DConv+BN+ReLU	(3,3)	1	1	(550,400,270)	(550,400,128)
			2DConv+BN+ReLU	(3,3)	1	1	(550,400,128)	(550,400,256)
			2DConv+BN+ReLU	(3,3)	1	1	(550,400,256)	(550,400,128)
			2DConv+BN+ReLU	(3,3)	1	1	(550,400,128)	(550,400,10)
WHU-Hi-HanChuan	274	16	2DConv+BN+ReLU	(7,7)	3	1	(1217,303,274)	(1217,303,128)
			2DConv+BN+ReLU	(7,7)	3	1	(1217,303,128)	(1217,303,256)
			2DConv+BN+ReLU	(7,7)	3	1	(1217,303,256)	(1217,303,128)
			2DConv+BN+ReLU	(7,7)	3	1	(1217,303,128)	(1217,303,17)
WHU-Hi-HongHu	270	22	2DConv+BN+ReLU	(3,3)	1	1	(940,475,270)	(940,475,128)
			2DConv+BN+ReLU	(3,3)	1	1	(940,475,220)	(940,475,128)
			2DConv+BN+ReLU	(3,3)	1	1	(940,475,256)	(940,475,128)
			2DConv+BN+ReLU	(3,3)	1	1	(940,475,128)	(940,475,23)

	Train	Test	SVM [12]	3D-CNN [38]	1D-RNN [16]	3D-FCN [39]	2D-FCN [23]	SSFCN-CRF [40]	our MFCN
Corn	25	34486	94.688	96.851	55.354	44.787	97.552	—-	98.430
Cotton	25	8349	60.483	34.732	32.641	12.379	65.777	—-	96.127
Sesame	25	3006	25.575	8.794	63.761	6.167	35.079	—-	96.243
Broad-leaf soybean	25	63187	98.294	61.223	33.333	32.430	97.401	—-	98.799
Narrow-leaf soybean	25	4126	31.635	29.983	0.000	15.843	27.851	—-	73.547
Rice	25	11829	76.486	66.846	46.014	53.813	96.959	—-	99.778
Water	25	67031	99.940	99.474	97.102	99.739	99.940	—-	99.927
Roads and houses	25	7099	80.344	66.059	71.653	52.173	88.320	—-	96.536
Mixed weed	25	5204	53.138	45.507	53.113	7.164	36.129	—-	79.539
AA (%)	0	0	82.497	60.959	42.159	46.487	81.869	—-	96.676
OA (%)	0	0	85.716	60.323	53.562	50.604	81.999	87.700	97.635
$κ \times 100$	0	0	81.796	53.323	43.091	42.118	77.883	—-	96.904

	Train	Test	SVM [12]	3D-CNN [38]	1D-RNN [16]	3D-FCN [39]	2D-FCN [23]	SSFCN-CRF [40]	our MFCN
Strawberry	25	44710	92.493	92.825	52.066	57.394	92.816	—-	93.321
Cowpea	25	22728	57.026	77.666	11.870	37.447	87.059	—-	92.962
Soybean	25	10262	33.147	19.079	5.714	9.118	58.237	—-	66.803
Sorghum	25	5328	59.506	45.776	95.149	9.982	37.188	—-	57.444
Water spinach	25	1175	11.915	10.579	0.399	5.795	31.822	—-	49.749
Watermelon	25	4508	16.793	7.418	0.000	8.062	25.012	—-	33.968
Greens	25	5878	35.261	47.443	0.876	37.699	45.802	—-	55.866
Trees	25	17953	52.545	63.921	17.539	66.151	73.259	—-	85.807
Grass	25	9444	40.341	41.414	14.732	2.899	56.699	—-	66.873
Red roof	25	10491	68.555	76.249	70.183	6.744	43.943	—-	71.990
Gray roof	25	16886	32.316	38.006	24.624	29.051	38.859	—-	40.252
Plastic	25	3654	14.027	11.139	5.406	5.329	23.931	—-	45.378
Bare soil	25	9091	31.282	29.366	0.000	36.274	62.781	—-	66.757
Road	25	18535	80.029	60.369	62.619	56.080	76.416	—-	82.285
Bright object	25	1111	24.602	61.626	66.280	5.747	49.664	—-	48.330
Water	25	75376	99.989	92.407	28.339	79.847	98.480	—-	97.359
AA (%)	0	0	55.821	49.824	16.629	38.330	67.224	—-	77.003
OA (%)	0	0	58.289	54.476	14.416	38.841	66.558	73.440	74.532
$κ \times 100$	0	0	52.966	48.517	9.360	31.562	62.152	—-	71.014

	Train	Test	SVM [12]	3D-CNN [38]	1D-RNN [16]	3D-FCN [39]	2D-FCN [23]	SSFCN-CRF [40]	our MFCN
Red roof	25	14016	91.454	90.076	72.627	40.868	82.019	—-	95.311
Road	25	3487	47.495	66.324	42.734	29.747	65.695	—-	75.841
Bare soil	25	21796	92.054	92.096	10.561	50.997	94.079	—-	93.349
Cotton	25	163260	96.744	96.979	32.086	86.287	98.171	—-	99.261
Cotton firewood	25	6193	18.203	24.834	12.964	16.100	21.981	—-	50.242
Rape	25	44532	80.084	75.959	46.021	73.341	80.647	—-	92.281
Chinese cabbage	25	24078	74.618	67.022	47.039	44.507	67.695	—-	90.746
Pakchoi	25	4029	7.565	7.368	1.883	16.719	5.903	—-	17.038
Cabbage	25	10794	99.138	93.467	90.819	32.673	75.184	—-	98.858
Tuber mustard	25	12369	65.149	42.574	29.347	23.017	55.975	—-	76.252
Brassica parachinensis	25	10990	32.287	12.842	4.624	4.308	28.333	—-	53.871
Brassica chinensis	25	8929	47.470	30.844	4.447	12.096	27.228	—-	61.168
Small Brassica chinensis	25	22482	59.326	56.842	14.396	53.206	59.937	—-	73.150
Lactuca sativa	25	7331	66.645	23.790	8.231	16.908	45.097	—-	82.266
Celtuce	25	977	7.114	2.654	0.930	0.867	2.242	—-	60.868
Film covered lettuce	25	7237	83.206	90.396	58.029	54.846	68.640	—-	91.579
Romaine lettuce	25	2985	57.435	32.951	11.907	24.478	50.865	—-	68.704
Carrot	25	3192	29.203	17.124	2.744	3.827	12.589	—-	50.108
White radish	25	8687	45.061	46.840	62.082	14.495	46.008	—-	73.076
Garlic sprout	25	3461	36.639	35.995	6.809	2.286	21.658	—-	50.691
Broad bean	25	1303	9.529	8.215	0.548	3.181	11.660	—-	20.822
Tree	25	4015	26.695	10.868	2.271	12.319	21.064	—-	35.685
AA (%)	0	0	62.582	52.465	24.294	30.255	55.239	—-	78.580
OA (%)	0	0	69.605	50.074	15.232	33.256	62.805	82.920	83.908
$κ \times 100$	0	0	63.263	43.648	12.848	26.026	56.048	—-	79.937

Abstract

1. Introduction

1.1 Motivation

1.2 Contributions

2. Related work

2.1 Fully convolutional network

2.2 Semi-supervised classification

3. Methodology

3.1 Network structure

3.1.1 Stride

3.1.2 Padding

3.1.3 Network depth and width

3.1.4 Normalization layers and activation function

3.2 Loss function

3.2.1 Fully supervised loss

3.2.2 Unsupervised loss

3.2.3 Spatial-smoothness loss

3.2.4 Semi-supervised loss

4. Experiments

4.1 Datasets

4.2 Comparison experiments

4.2.1 Experiment configuration

4.2.2 Experiment results

4.3 Ablation study

4.3.1 Network structure

4.3.2 Kernel size

4.3.3 Form of loss function

4.3.4 Parameters in balancing coefficient function

5. Discussion

6. Conclusion

Funding

Acknowledgments

Disclosures

Data Availability

References

Data Availability

Cited By

Figures (13)

Tables (5)

Equations (17)

Optics Express