Compressive hyperspectral image classification using a 3D coded convolutional neural network

Hao Zhang; Xu Ma; Xianhong Zhao; Gonzalo R. Arce

doi:10.1364/OE.437717

1. Introduction

Hyperspectral imaging acquires hundreds image planes spanning wavelengths from the visible to the infrared wavelengths. The rich spectral information of hyperspectral images has been widely employed in a range of remote sensing applications, such as ecological science, geological science, hydrological science, and precision agriculture [1,2]. Hyperspectral image classification (HIC) technology plays a crucial role in these applications, where a label is assigned to each spatial pixel of the scene based on its spectral signature. A large number of HIC methods have been proposed based on k-nearest-neighbors, maximum likelihood criterion, logistic regression, and support vector machine (SVM) [3–6]. Over the past several years, deep learning has become one of the most efficient signal processing approaches with great potential in hyperspectral imaging and classification [7–10].

Traditional HIC methods need three-dimensional (3D) spatio-spectral datasets, which are captured by scanning spectral imaging systems [11,12]. In this paper, the first two dimensions of 3D data cube indicate x and y in the spatial domain, and the third dimension refers to λ in the spectral domain. However, these systems require a time-consuming scanning process in the spatial or spectral domain, and lead to a big data size to be stored, transmitted, and processed. To overcome these limitations, Wagadarikar et al. introduced the concept of coded aperture snapshot spectral imaging (CASSI) system based on the compressive sensing (CS) theory [13]. The CASSI system simultaneously senses and compresses the 3D spectral data cube with just a single or a few two-dimensional (2D) projection measurements [14–18]. The complete 3D spectral data cube can be reconstructed from the compressive measurements, which can then be used for classification and processing. However, spectral image classification based on CASSI is a challenging task since the reconstruction procedure is very time-consuming and noise sensitive.

Recently, a supervised compressive spectral image classifier was proposed for CASSI system, where the hyperspectral pixel was approximately represented as a sparse linear combination of samples in an overcomplete training dictionary [19]. The sparse coefficients were recovered from a set of CASSI compressive measurements to determine the clusters of the unknown pixels. Although this method does not need to reconstruct the complete 3D spectral data cube, recovering the sparse coefficients for all hyperspectral pixels is still computationally intensive. In addition, to improve the accuracy of compressive spectral image classifiers, two improvements in the compressive classifier system were made. One is the measurement stage and the other is classification stage. In the measurement stage, the coded apertures were optimized based on the restricted isometry property (RIP), which is a widely used criterion to obtain the optimal reconstruction performance in CS theory [19]. In the classification stage, the sparse dictionary was optimized and different sparsity-based classifiers were proposed [20–22]. However, the coded apertures with optimal reconstruction performance are not necessarily the best to achieve the highest classification accuracy, since the reconstruction itself may introduce unexpected artifacts that are not supported by the compressive measurements. In addition, most of existing methods ignored the synergy between the measurement stage and classification stage, which limits the improvement of classification accuracy. Recently, the combination of optics and deep learning has become a trend, and some end-to-end optimization approaches of optics and image processing were proposed [23–26]. This paper proposed a novel deep learning approach, namely 3D coded convolutional neural network (3D-CCNN) that efficiently solves the HIC problem directly from the compressive domain without reconstruction, and jointly optimizes the coded apertures.

As shown in Fig. 1(a), the CASSI system with a dual-disperser architecture (DD-CASSI) is used in the measurement stage to capture the compressive measurement of a target scene. In DD-CASSI, the hyperspectral data cube is first shifted by the front dispersive element, then modulated by a coded aperture in spatial domain, and finally shifted back using the second dispersive element [15]. After that, the encoded hyperspectral data cube is projected onto a 2D integrated detector. In the measurement process, we switch different coded apertures to capture a set of compressive measurements. Different from the single-disperser-based CASSI system [13,14], the compressive measurements in DD-CASSI have the same spatial dimensions as the target scene. Each detector element receives the information from all spectral bands with different codes. These characteristics enable us to decompose the imaging model of DD-CASSI into patch-based models, whose dimensionality is consistent with the following classification network, since the classification is implemented in a patch-based manner.

Fig. 1. Sketches of (a) the DD-CASSI system and (b) the proposed 3D-CCNN framework. The DD-CASSI is used in the measurement stage, and the 3D-CNN is used in the classification stage. The imaging model of DD-CASSI can be decomposed into patch-based models to keep the dimensionality consistent with the classification network. The 3D-CCNN system combines the coded aperture optimization and the HSI classification into a single framework.

Download Full Size | PDF

As shown in Fig. 1(b), the classification stage consists of a 3D convolutional neural network (3D-CNN) to predict the classification map directly in the compressive domain, without reconstructing the complete 3D data cube. Given the correlation of hyperspectral data across both spatial and spectral dimensions, the 3D-CNN takes the compressive measurement patches as the input. In order to obtain the optimal coding from a small training subset of the hyperspectral data, the coded apertures are designed as periodic patterns to reduce the independent optimization variables. Thus, we only need to optimize one period of coded apertures, and then periodically extend it to the entire coded pattern. Taking the full advantages of the patch-based model, an end-to-end training method is proposed to jointly optimize the coded apertures and the classification network parameters. In this work, the coded aperture optimization and hyperspectral image classification are concatenated into a single system, dubbed 3D-CCNN, which effectively increases the degrees of optimization freedom and improves the classification accuracy.

The main contributions of this paper are twofold. First, we integrate deep learning with CASSI to solve for the classification problem directly in the compressive domain, thus avoid the time-consuming reconstruction procedure and alleviate the influence of reconstruction artifacts. Second, the hardware-based coded aperture and software-based classification network are unified into one framework, coined 3D-CCNN. The proposed method bridges the gap between coded aperture design and classification to increase the degrees of optimization freedom. Then, the end-to-end training method is used to jointly optimize the coded apertures and network parameters, which effectively improves the classification accuracy. The superiority of the proposed method over some of the state-of-the-art approaches is verified by a set of simulations. In this paper, the principle of CASSI system is implemented using simulations, and the term “hardware” indicates that the coded aperture in real CASSI system is implemented by the physical instruments, such as the mask pattern or spatial light modulator. In the future, we would study and build up the testbed for the CASSI system to verify the proposed compressive hyperspectral classification methods.

2. Forward imaging model of the DD-CASSI system

As shown in Fig. 1(a), DD-CASSI system employs two opposite dispersers and a coded aperture to encode the hyperspectral data cube in both, the spatial and spectral domains [15]. Let ${f_0}({x,y,\lambda } )$ be the hyperspectral data cube of the target scene, where x and y are the spatial coordinates, and $\lambda$ is the spectral coordinate. The hyperspectral data cube is first laterally shifted as a function of wavelength by the front disperser to form the skewed data cube, which is then projected by an imaging lens onto the coded aperture plane. The skewed data cube is modulated in the spatial domain by the coded aperture whose transmission function is denoted by $T(x,y)$. Subsequently, the coded source planes are shifted back into a standard cube by the second disperser, and integrated along the $\lambda$ axis on the 2D focal plane array (FPA) detector. The dispersion effect enables the coded aperture to introduce distinguishable spatial modulations in different spectral bands. The measurement intensity on FPA detector can be formulated as [15]:

(1)$$Y({x,y} )= \int {T(x - \alpha (\lambda - {\lambda _c}),y){f_0}(x,y,\lambda )} d\lambda, $$

where $\alpha$ and ${\lambda _c}$. are the linear dispersion rate and the center wavelength of prisms, respectively.

Due to the pixelated nature of detector array, the continuous model in Eq. (1) can be transformed into a discrete form. Suppose we take $K$ snapshots in total with different coded apertures, and ${\mathbf T}_{}^k$ represents the coded aperture pattern used in the kth snapshot. Then, the kth snapshot measurement on FPA is given by

(2)$${\mathbf Y}_{ij}^k = \sum\limits_{l = 0}^{L - 1} {{{\mathbf F}_{i,j,l}}} {\mathbf T}_{i,j + l}^k + {\boldsymbol \omega }_{i,j}^k, $$

where i and j are the pixel coordinates in spatial domain, and l is the pixel coordinate in spectral domain; ${\mathbf F}$ is the 3D hyperspectral data cube of target with dimension $N \times M \times L$; ${{\mathbf F}_{i,j,l}}$ is the voxel at the spatial coordinate $(i,j)$ in the lth spectral band; and ${\boldsymbol \omega }_{i,j}^k$ is the measurement noise on the detector; ${\mathbf Y}_{}^k$ is the compressive measurement with dimension $N \times M$; and the dimension of ${\mathbf T}_{}^k$ is $N \times (M + L - 1)$. Next, we transform Eq. (2) into a matrix form. Let ${{\boldsymbol y}^k} \in {R^{NM \times 1}}$ and ${\boldsymbol f} \in {R^{NML \times 1}}$ be the vectorized representations of ${\mathbf Y}_{}^k$ and ${\mathbf F}$, respectively. Then, we have

(3)$${{\boldsymbol y}^k} = {{\mathbf H}^k}{\boldsymbol f} + {{\boldsymbol \omega }^k}, $$

where ${\mathbf H}_{}^k$ is the system matrix representing the effect of the kth coded aperture and the dispersers, and ${{\boldsymbol \omega }^k}$ is the vector of measurement noise. Taking into account all of the $K$ snapshots, the measurements can be concatenated together, and the forward imaging model becomes [16,27,28]:

(4)$${\boldsymbol y} = {\mathbf H}{\boldsymbol f} + {\boldsymbol \omega }, $$

where ${\boldsymbol y} = {[{({{\boldsymbol y}^1})^T},{({{\boldsymbol y}^2})^T},\ldots ,{({{\boldsymbol y}^K})^T}]^T}$ and ${\mathbf H} = {[{({{\mathbf H}^1})^T},{({{\mathbf H}^2})^T},\ldots ,{({{\mathbf H}^K})^T}]^T}$. Suppose the data cube is highly correlated across the spatial and spectral domains, and is sparse in some representation basis ${\mathbf \Psi }$ [29–31]. Then, ${\boldsymbol f}$ in Eq. (4) can be represented as ${\boldsymbol f} = {\mathbf \Psi }{\boldsymbol \theta }$, where ${\mathbf \Psi } = {{\mathbf \Psi }_1} \otimes {{\mathbf \Psi }_2}$ is a 3D representation basis, ${\otimes}$ is the Kronecker product, and ${\boldsymbol \theta }$ is the coefficient vector. The ${{\mathbf \Psi }_1}$ indicates the 2D basis that depicts the correlation in the spatial domain, and ${{\mathbf \Psi }_2}$ indicates the one-dimensional (1D) basis in the spectral domain. For an example, this paper sets ${{\mathbf \Psi }_1}$ as the 2D wavelet Symmlet-8 basis, and sets ${{\mathbf \Psi }_2}$ as the 1D DCT basis. Substituting ${\boldsymbol f} = {\mathbf \Psi }{\boldsymbol \theta }$ into Eq. (4), we have

(5)$${\boldsymbol y} = {\mathbf H\varPsi }{\boldsymbol \theta } + {\boldsymbol \omega }$$

It is noted that the matrix ${\mathbf H}$ is sparse and highly structured, which includes a set of diagonal line structures determined by the coded aperture entries ${\mathbf T}_{i,j + l}^k$. An illustrative example of the matrix ${\mathbf H}$ is shown in Fig. 2, where $K = 2$, $N = M = 6$, $L = 3$, and the coded aperture patterns obey the Bernoulli distribution with 50% transmittance.

Fig. 2. An illustrative example of the matrix H for the Bernoulli random coded apertures, where $K = 2$, $N = M = 6$, $L = 3$.

Download Full Size | PDF

3. 3D-CCNN approach for hyperspectral image classification

In this section, we build up a seven-layer 3D-CNN to solve the classification problem directly in the compressive domain. Then, we decompose the forward imaging model of the DD-CASSI into patch-based models, and the periodic design of coded apertures is introduced. The coded aperture and classification network are further connected into a uniform framework, namely 3D-CCNN. The joint training method of 3D-CCNN is presented at the end of this section. The sketch of the 3D-CCNN framework is shown in Fig. 3.

Fig. 3. Sketch of the 3D-CCNN framework, which connects (a) the measurement stage of DD-CASSI system and (b) the 3D-CNN classification network. The coded apertures and classification network are jointly trained in an end-to-end supervised manner.

Download Full Size | PDF

3.1 Compressive spectral image classification using the 3D-CNN

The DD-CASSI system is used to acquire several compressive measurements with different coded apertures. Initially, random coded apertures were used in DD-CASSI. Assume the hyperspectral data cube of target scene consists of $N \times M$ spatial pixels and L spectral bands. Taking K snapshots, we can obtain a compressive measurement data cube with dimension of $N \times M \times K$. The goal is to solve the HIC problem directly from the compressive measurements without reconstruction.

Recently, deep learning has been proved to render accurate semantic interpretation of the underlying datasets [32]. Given the 3D nature of the compressive measurement data cube, the 3D-CNN framework is chosen to perform the classification task, since it can simultaneously exploit information from all measurement slices with different coding, which is essential to improve classification performance [8,9,32–34].

As shown in Fig. 3, the HIC problem is pixel-based, where each spatial pixel on the hyperspectral images is associated with a specific classification label. Note that the pixels inside a small neighborhood often reflect relevant information of the underlying objects or materials. Thus, the information of measurement data surrounding a pixel is helpful to improve the classification accuracy of that pixel. For each pixel under consideration, we truncate a small patch around it from the compressive measurement data cube. The dimension of the patch is $P \times P \times K$, where $P \times P$ is the spatial size, and K is equal to the number of compressive measurements. The dimension P is often chosen as an odd number to keep the symmetry. The center of the patch is located on the pixel to be classified. Then, the patch is used as the input of the 3D-CNN, and the output is the classification label of the central pixel. It is noted that the pixel-label-based method is not the only way to perform the hyperspectral image classification. Another possible method is to first segment the entire hyperspectral image into several regions, and then determine the class label for each region. However, the segmentation-based method is out of the scope of this paper, and will be studied in the future. Next, we describe the structure of the 3D-CNN in more detail.

Talking about the depth and width of the 3D-CNN is a very rich debate that generates a lot of questions. However, it has been recently proved that one of the keys for better performances is to find the right balance between the network’s depth and width [32]. To harmonize the cost and accuracy of a deep network, the 3D-CNN built up in this paper consists of 7 layers, including 6 convolutional layers and 1 fully connected layer. In addition, the first and second layers are characterized with 20 filters, whereas the rest of the layers have 35 filters. As shown in Fig. 3 (b), the convolutional layers transfer the input data to a series of 3D feature maps, which are gradually reduced into a 1D feature vector. The 1D feature vector is inputted to a fully-connected layer, the output of which is then fed into a Softmax classifier to calculate the classification result. From the first layer to the sixth layer, the dimensions of the 3D convolution kernels are $20 \times 3 \times 3 \times 3$ (i.e., $K_1^1 = 3$, $K_2^1 = 3$, and $K_3^1 = 3$), $\textrm{20} \times \textrm{3} \times \textrm{1} \times \textrm{1}$ (i.e., $K_1^2 = 3$, $K_2^2 = 1$, $K_3^2 = 1$), $\textrm{35} \times \textrm{3} \times \textrm{3} \times \textrm{3}$ (i.e., $K_1^3 = 3$, $K_2^3 = 3$, and $K_3^3 = 3$), $\textrm{35} \times \textrm{3} \times \textrm{1} \times \textrm{1}$ (i.e., $K_1^4 = 3$, $K_2^4 = 1$, and $K_3^4 = 1$), $\textrm{35} \times \textrm{3} \times \textrm{1} \times \textrm{1}$ (i.e., $K_1^5 = 3$, $K_2^5 = 1$, and $K_3^5 = 1$),and $\textrm{35} \times \textrm{2} \times \textrm{1} \times \textrm{1}$ (i.e., $K_1^6 = 2$, $K_2^6 = 1$, and $K_3^6 = 1$), respectively. For instance, the “$\textrm{20} \times \textrm{3} \times \textrm{3} \times \textrm{3}$” means that there are twenty 3D-kernels with dimension of $\textrm{3} \times \textrm{3} \times \textrm{3}$ (i.e., two spatial dimensions and one spectral dimension). Let W denote the parameter set of the 3D-CNN, including all the convolution kernels, weights and biases.

3.2 Patch-based model with periodic coded apertures

To keep the consistence of the dimensionality, we first decompose the forward imaging model of DD-CASSI into patch-based models. As shown in Fig. 4, we first divide the hyperspectral data cube and compressive measurements in Eq. (4) into small patches. Let ${{\boldsymbol y}^i}$ with dimension $P \times P \times K$ be the ith measurement patch truncated from the compressive measurement data cube. Define ${\boldsymbol y}_k^i \in {R^{P \times P}}$ as the kth slice of ${{\boldsymbol y}^i}$. Then, we can trace ${\boldsymbol y}_k^i$ from the detector back through the DD-CASSI system, and find out the corresponding 3D patch ${{\boldsymbol s}^i}$ in the original hyperspectral data cube ${\mathbf F}$. The patch ${{\boldsymbol s}^i}$ is a $P \times P \times L$ data cube, where L denotes the number of spectral bands. Due to the first disperser, the hyperspectral patch ${{\boldsymbol s}^i}$ is shifted into a parallelepiped, and then modulated by a coded aperture patch ${\boldsymbol t}_k^i$ with dimension of $P \times (P + L - 1)$, where ${\boldsymbol t}_k^i$ represents the coded aperture patch associated with ${{\boldsymbol s}^i}$ at the kth snapshot.

Fig. 4. The patch-based imaging model of DD-CASSI. For each snapshot, a 3D hyperspectral patch ${\boldsymbol s}_{}^i \in {R^{P \times P \times L}}$ corresponds to a compressive measurement patch ${\boldsymbol y}_k^i \in {R^{P \times P}}$ on the detector. The hyperspectral patch ${\boldsymbol s}_{}^i$ is modulated by a coded aperture patch ${\boldsymbol t}_k^i \in {R^{P \times (P\textrm{ + }L - 1)}}$.

Download Full Size | PDF

To describe in detail, each spectral band of ${{\boldsymbol s}^i}$ is modulated by different coding templates with dimension $P \times P$ due to the dispersive effect, and every coding template can be regarded as a part of the coded aperture patch ${\boldsymbol t}_k^i$. Denote the central point of ${\boldsymbol y}_k^i$ as ${\mathbf Y}_{{x_0},{y_0}}^k$, where $({x_0},{y_0})$ is the central coordinate. Then, the central pixel of the hyperspectral patch ${{\boldsymbol s}^i}$ is ${{\mathbf F}_{{x_0},{y_0},l}}$, where l is the spectral coordinate. The lth spectral band of ${{\boldsymbol s}^i}$ and the coded aperture patch ${\boldsymbol t}_k^i$ can be given by:

(6)$${\boldsymbol s}_l^i = \left[ {\begin{array}{ccc} {{{\mathbf F}_{{x_0} - q,{y_0} - q,l}}}& \cdots &{{{\mathbf F}_{{x_0} - q,{y_0} + q,l}}}\\ \vdots &{{{\mathbf F}_{{x_0},{y_0},l}}}& \vdots \\ {{{\mathbf F}_{{x_0} + q,{y_0} - q,l}}}& \ldots &{{{\mathbf F}_{{x_0} + q,{y_0} + q,l}}} \end{array}} \right], \,\,\textrm{and} {\boldsymbol t}_k^i = \left[ {\begin{array}{ccc} {{\mathbf T}_{{x_0} - q,{y_0} - q}^k}& \cdots &{{\mathbf T}_{{x_0} - q,{y_0} + q + L - 1}^k}\\ \vdots & \ddots & \vdots \\ {{\mathbf T}_{{x_0} + q,{y_0} - q}^k}& \cdots &{{\mathbf T}_{{x_0} + q,{y_0} + q + L - 1}^k} \end{array}} \right],$$

where $q = \lfloor{P/2} \rfloor$, and $\lfloor{} \rfloor$ is the rounding operator.

Let ${\mathbf F}_{m,n,l}^{}$ be the $(m,n)\textrm{th}$ pixel in the lth spectral band of ${{\boldsymbol s}^i}$, and let ${\mathbf Y}_{m,n}^k$ be the $(m,n)\textrm{th}$ pixel in the kth band of ${{\boldsymbol y}^i}$. The spatial dimension of input patch ${{\boldsymbol s}^i}$ and the measurement patch ${{\boldsymbol y}^i}$ is the same. According to Eq. (6), the relationships between the subscripts are $m = {x_0} + \alpha$, $n = {y_0} + \alpha$, and $\alpha \in [ - q,q]$. Then, we have

(7)$${\mathbf Y}_{m,n}^k\textrm{ = }\sum\limits_{l = 0}^{L - 1} {{{\mathbf F}_{m,n,l}}} {\mathbf T}_{m,n + l}^k, $$

where ${\mathbf T}_{m,n + l}^k(l = 0,\ldots ,L - 1)$ is the $(m,n + l)\textrm{th}$ pixel of the coded aperture patch ${\boldsymbol t}_k^i$.

When taking K snapshots, there are totally $KN(M + L - 1)$ coded aperture entries to be optimized. If all the coded aperture variables are independent on each other, it is impossible to train them by only using a small set of training samples on the hyperspectral data cube. That is because the training problem will become underdetermined if the number of variables exceeds the number of training samples. To circumvent this problem, we design the coded apertures with periodic patterns, where each coded aperture is cyclically filled by a basic block with dimension $B \times B$, and different coded apertures have different basic blocks. The reason and benefit to use periodic coding patterns are explained as follows. For a limited number of training samples, some coded aperture pixels will locate in the central regions of training patches, and some others locate at the peripheries of training patches or even in the blank regions out of any training patch. In the training process, the coded aperture pixels at different locations will be optimized differently. However, in the testing stage, each coded aperture pixel will be at the center of a compressive measurement patch, since the 3D-CCNN implements the hyperspectral classification pixel-by-pixel in the compressive domain. Thus, if the coded aperture is aperiodic, it may induce asymmetry and inconsistency between the training stage and inference stage. On the other hand, if the coded aperture is periodic, the degrees of optimization freedom is limited within a period of the coded aperture. The coded aperture pixels at the same spatial location in different periods should be kept the same and optimized synchronously. As the number of training samples reaches a certain level, all coded aperture pixels in a period will be in the central regions of some training patches. In this way, the asymmetry and inconsistency between the training and inference stages will be effectively alleviated. The purpose of using the periodic coded apertures is to reduce the degrees of optimization freedom, such that the periodic coded aperture patterns can be appropriately optimized by the limited number of training samples. Each coded aperture period should be optimized based on a set to training patches. Thus, the periodic size B of the coded apertures should be large enough to cover lots of training patches, that is B should be much larger than the patch size P. In addition, the selection of the periodic size B should also accommodate to the amount of training dataset. Specifically, a smaller training dataset needs smaller periodic size to reduce the degrees of optimization freedom. When using a large training dataset, the periodic size can be increased. On the other hand, the patch size P should be smaller than the periodic size B. In addition, the value of P should not be too large, otherwise the correlation between the central spectra and the boundary spectra in the same patch will be reduced, which is not beneficial to retain the classification accuracy. Denote ${{\mathbf C}^k} \in {R^{B \times B}}$ as the basic block for the kth coded aperture ${{\mathbf T}^k} \in {R^{N \times (M + L - 1)}}$. The pixel value of ${\mathbf T}_{m,n + l}^k$ is equal to ${\mathbf C}_{{m_1},{n_1}}^k$, where ${m_1} = m\%B$, ${n_1} = (n + l)\%B$, and $\%$ is the remainder operation. Then, Eq. (7) can be rewritten as

(8)$${\mathbf Y}_{m,n}^k\textrm{ = }\sum\limits_{l = 0}^{L - 1} {{{\mathbf F}_{m,n,l}}} {\mathbf C}_{{m_1},{n_1}}^k. $$

Taking all of the K snapshots into account, the entries of ${\mathbf C}_{{m_1},{n_1}}^k$ can be concatenated together to form:

(9)$${\mathbf C}(i) = [{{\mathbf C}_{{m_1},{n_1}}^1,{\mathbf C}_{{m_1},{n_1}}^2,\ldots ,{\mathbf C}_{{m_1},{n_1}}^k,\ldots ,{\mathbf C}_{{m_1},{n_1}}^K} ], \textrm{for different}\, {m_1} \textrm{and} {n_1},$$

where i indicates the index of input patch ${{\boldsymbol s}^i}$. The set ${\mathbf C}(i)$ consists of all the entries in the K basic blocks associated with ${{\boldsymbol s}^i}$. As shown in Fig. 3(a), the coded aperture can be regarded as a pixel-wise connected layer to encode the input data cube and obtain the measurement patches, which are then inputted to the 3D-CNN proposed in Section 3.1. The effect of coded apertures is equivalent to a virtual connected layer in the 3D-CCNN framework. Then, we can jointly optimize the basic blocks of coded apertures and other network parameters using an end-to-end supervised training method.

3.3 Joint training method of the 3D-CCNN

This section proposes an end-to-end supervised training method for 3D-CCNN. The set of all parameters to be optimized is denoted as $\Theta = [W,{\mathbf C}(i)]$, where $W$ represents the parameters in the seven-layer 3D-CNN model described in Section 3.1, and ${\mathbf C}(i)$ represents the parameters of coded apertures defined in Eq. (9). In the hyperspectral data cube, we randomly choose 30% labeled samples as the training data, and the remaining 70% pixels are used for testing.

In this work, the softmax loss is used as the objective function in the training process. Suppose the input of softmax classifier is a vector ${\boldsymbol x} \in {R^{M \times 1}}$, where M is the number of classes. The output of softmax classifier is an $M \times 1$ vector ${\boldsymbol p}\textrm{ = [}{{\boldsymbol p}_1}\textrm{,}{{\boldsymbol p}_2}\textrm{,} \ldots ,{{\boldsymbol p}_M}\textrm{]}$. Then, the loss function is

(10)$$\textrm{Loss} ={-} \sum\limits_{j = 1}^M {{{\boldsymbol l}_j}} \log ({{\boldsymbol p}_j}), $$

where ${\boldsymbol l}\textrm{ = [}{{\boldsymbol l}_1}\textrm{,}{{\boldsymbol l}_2}\textrm{,} \ldots \textrm{,}{{\boldsymbol l}_M}\textrm{]}$ is an $M \times 1$ true label vector. For $i \in [1,M]$, the ${{\boldsymbol l}_j} = 1$ if the pixel under consideration actually belongs to the jth class, otherwise ${{\boldsymbol l}_j} = 0$. The ${\boldsymbol p}_j^{}$ is the jth element of the output vector, which represents the probability that the pixel belongs to the jth class:

(11)$${\boldsymbol p}_j^{} = {e^{{\boldsymbol x}_j^{}}}/\sum\limits_{i = 1}^M {{e^{{\boldsymbol x}_i^{}}}}, $$

where ${{\boldsymbol x}_i}$ and ${{\boldsymbol x}_j}$ represent the ith and the jth elements of the input vector ${\boldsymbol x}$, respectively. The loss function is minimized using back propagation method. The network parameters are updated as:

(12)$$\{ {W^{v + 1}},{\mathbf C}{(i)^{v + 1}}\} \textrm{ = \{ }{W^v},{\mathbf C}{(i)^v}\} \textrm{ - }\eta \cdot \nabla Loss\{ {W^v},{\mathbf C}{(i)^v}\}, $$

where v indicates the iteration number, $\eta$ is the learning rate, and $\nabla Loss\{ {W^v},{\mathbf C}{(i)^v}\}$ represents the gradient of the loss function with respect to the variables. After the training process, the leaned basic block ${{\mathbf C}^k}$ can be tiled to form the complete coded aperture ${\mathbf T}_{}^k$.

It is noted that the coded apertures used in this paper are grey-scaled, and the training process may lead to coded aperture elements out of the range of [0,1]. At the end of the training process, we use a truncation function to limit the values of coded aperture elements between 0 and 1:

(13)$${\mathbf C}_{m,n}^k = \left\{ \begin{array}{l} 1,\textrm{ if }{\mathbf C}_{m,n}^k > 1\\ 0,\textrm{ if }{\mathbf C}_{m,n}^k < 0 \end{array} \right., {\textrm{for}}\, {\textrm{different}} \, m \,{\textrm{and}}\, n.$$

As shown in Fig. 5, the blue arrows represent the end-to-end training process, and the red arrows represent the testing process. In the testing process, the optimized coded apertures obtained by training method are first manufactured and installed in the DD-CASSI system. A set of compressive measurements are captured by the detector. Subsequently, the compressive measurement data cube is decomposed into patches, which are then inputted into the 3D-CNN to obtain the classification results.

Fig. 5. Overview of the training process (blue arrows) and testing process (red arrows) of 3D-CCNN.

Download Full Size | PDF

4. Experimental results

In this section, we evaluate the 3D-CCNN method on two public hyperspectral datasets, including the Pavia University dataset and the Salinas Valley dataset [7]. The network was trained with the deep learning tool Pytorch on a desktop with the GTX 1080 graphical processing unit (GPU) and 11 GB RAM. In addition, we compare the proposed method with several competitive methods, including the convolution neural network and support vector machine (SVM) classifier [35,36]. In the comparative experiments, we use the random coded apertures or blue noise coded apertures in the DD-CASSI system. Note that the blue noise coding strategy has been proved to be optimal for the reconstruction in CASSI [37]. In this paper, the transmittance of all random coded apertures is set as 0.5. The blue noise coded apertures are generated based on the relevant method in [37]. In the practical implementation of the classification system, we need to calibrate and keep alignment between the coded aperture pattern and the detector. We can make a cross mark on the coded aperture, and use a standard whiteboard to replace the target. The cross mark can be imaged on the detector, and we can adjust the position of the cross-mark image to align the coded aperture with the detector. More details of the calibration method in CASSI system can be found in literature [38]. In addition, the modulation of coded aperture in real CASSI system cannot be regarded as ideal coding. Thus, we can first obtain the images of the coded apertures on the detector, and then use these images to calibrate the transmission functions of coded apertures.

The first four comparative methods are defined as follows:

(1) “Rand-compress-3D-CNN” Method: Use the random coded apertures to obtain the compressive measurements, and then perform the hyperspectral classification using the seven-layer 3D-CNN model described in Section 3.1.
(2) “Bluenoise-compress-3D-CNN” Method: Use blue noise coded apertures to obtain the compressive measurements, and then perform the hyperspectral classification using the seven-layer 3D-CNN model.
(3) “Rand-compress-SVM” Method: Use the random coded apertures to obtain the compressive measurements, and then perform the hyperspectral classification using the SVM classifier.
(4) “Bluenoise-compress-SVM” Method: Use the blue noise coded apertures to obtain the compressive measurements, and then perform the hyperspectral classification using the SVM classifier.

All of the methods mentioned above perform the classification in the compressive domain. It is noted that the hyperspectral data cube of the target scene can be reconstructed from compressive measurements by solving for an l₁-norm minimization problem. The details of the reconstruction methods have been published in literature [14,39,40]. It is natural to ask whether the classification accuracy can be improved by using the reconstructed hyperspectral data cube instead of the compressive measurements. To answer this question, we compare the proposed method with the following comparative methods:

(5) “Rand-construct-3D-CNN” Method: Use the random coded apertures to obtain the compressive measurements, and then use 3D-CNN to perform the classification based on the reconstructed hyperspectral data cube.
(6) “Bluenoise-construct-3D-CNN” Method: Use the blue noise coded apertures to obtain the compressive measurements, and then use 3D-CNN to perform the classification based on the reconstructed hyperspectral data cube.
(7) “Rand-construct-SVM” Method: Use the random coded apertures to obtain the compressive measurements, and then use SVM to perform the classification based on the reconstructed hyperspectral data cube.
(8) “Bluenoise-construct-SVM” Method: Use the blue noise coded apertures to obtain the compressive measurements, and then use SVM to perform the classification based on the reconstructed hyperspectral data cube.

This paper uses compressive sensing reconstruction algorithm for the hyperspectral images. Specifically, the gradient projection for sparse reconstruction (GPSR) algorithm is used [39], and ${\mathbf \Psi } = {{\mathbf \Psi }_1} \otimes {{\mathbf \Psi }_2}$ is constructed as the sparse representation basis, where ${{\mathbf \Psi }_1}$ is the 2D wavelet Symmlet-8 basis in spatial domain and ${{\mathbf \Psi }_2}$ is the one-dimensional (1D) DCT basis in spectral domain. Note that the reconstruction can be also implemented by an end-to-end whole image-based deep learning method. Given the 3D nature of hyperspectral images, training the end-to-end whole image-based deep learning networks will be very computationally intensive. The influence of the patch-based deep learning and end-to-end image-based deep learning reconstruction approaches on the classification performance will be studied and compared in the future. Furthermore, in this paper the proposed method is also compared with the classifiers, where the original hyperspectral data cube is assumed to be available:

(9) “Original-3D-CNN” Method: Use 3D-CNN to perform the classification directly based on the original hyperspectral data cube of target scene.
(10) “Original-SVM” Method: Use SVM to perform the classification directly based on the original hyperspectral data cube of target scene.

In the following simulations, several indices are used to quantitatively assess the classification performance, including the overall accuracy (OA), average accuracy (AA), and Kappa coefficient (K_a). The OA is defined as the ratio of the correctly classified samples over all testing samples. The AA is the mean value of accuracy for each category. The K_a is a statistical metric that provides mutual information regarding the agreement between the ground truth map and the classification map [7].

4.1 Simulation result on the Pavia University dataset

The Pavia University dataset was collected by the Reflective Optics Imaging Spectrometer (ROSIS) over University of Pavia, Italy [7]. The spectral image in this dataset is characterized by high spatial resolution (1.3m per pixel) comprising $640 \times 340$ spatial pixels, and 103 spectral reflectance bands in the wavelength range from 0.43$\mu m$ to 0.86$\mu m$. In the following, a $256 \times 256 \times 103$ cube is truncated from the entire dataset to be used as the original hyperspectral data cube. Figure 6(a) shows the false-color composite image of the Pavia University spectral data. Figure 6(b) shows the ground truth of the classification map, which consists of nine distinct classes with different colors. Each class label corresponds to a different kind of objects in the urban cover, and the black regions represent the unlabeled pixels. From the image, 30% of the labeled pixels are randomly chosen to be used as the training samples, and the remaining 70% pixels are used for testing. Figure 7(a) illustrates one of randomly initialized coded aperture patterns, and Fig. 7(b) illustrates the optimized coded aperture pattern after joint optimization of the 3D-CCNN. The difference pattern between the initial coded aperture and the optimized coded aperture is showed in Fig. 7(c).

Fig. 6. (a) False-color composite image of the Pavia University spectral data and (b) ground truth of the classification map including nine distinct classes, where black regions represent the unlabeled pixels.

Download Full Size | PDF

Fig. 7. Illustration of coded apertures for the Pavia University spectral data: (a) the initial coded aperture pattern, (b) the optimized coded aperture pattern using 3D-CCNN, and (c) the difference between the initial and optimized coded aperture patterns.

Download Full Size | PDF

Figure 8 shows the classification results on the Pavia University dataset using the (a) proposed 3D-CCNN method, and the first four comparative methods, including the (b) Rand-compress-3D-CNN method, (c) Bluenoise-compress-3D-CNN method, (d) Rand-compress-SVM method, and (e) Bluenoise-compress-SVM method. The number of snapshots is 5, which means that the compression ratio of DD-CASSI is about 5%. This is because when the compression ratio exceeds 5%, the classification accuracy cannot be improved significantly. One of the possible reasons for that is the hyperspectral datasets used in the experiments are highly correlated in both spatial and spectral domains, and five compressive measurements (equivalent to 5% compression ratio) are adequate to acquire enough information for the following classification procedure. In the 3D-CCNN framework, the spatial dimension of each patch in both training and testing sets is $7 \times 7$. Thus, the patch size of the 3D-CNN input is $7 \times 7 \times 5$.

Fig. 8. Classification results on the Pavia University dataset using the following methods: (a) the proposed 3D-CCNN method, (b) Rand-compress-3D-CNN method, (c) Bluenoise-compress-3D-CNN method, (d)Rand-compress-SVM method, and (e) Bluenoise-compress-SVM method.

Download Full Size | PDF

Table 1 shows the classification performance of the proposed method and the first four comparative methods using the Pavia University dataset. The OA, AA and Ka of 3D-CCNN is 86.58%, 87.47% and 0.839, respectively; the OA, AA and Ka of Rand + compress+3D-CNN is 73.39%, 73.01% and 0.678, respectively; the OA, AA and Ka of Blue noise + compress+3D-CNN is 76.34%, 76.79% and 0.715, respectively; the OA, AA and Ka of Rand + compress + SVM is 45.78%, 36.69% and 0.315, respectively; the OA, AA and Ka of Blue noise + compress + SVM is 47.55%, 38.07% and 0.334, respectively. These metrics are calculated by averaging over 5 runs of the experiments. From the second row to the tenth row, it shows the percentage of accurate classification for each kind of objects. The last three rows provide the OA, AA, and K_a of the overall classification result.

Table 1. The classification performance of the proposed method and the first four comparative methods using the Pavia University dataset (30% training and 70% testing)

View Table | View all tables in this article

In order to demonstrate the robustness of the proposed methods, we switch the ratios of training data set and testing data set. In the following simulations, 70% of the labeled pixels are chosen as the training samples, and the remaining 30% pixels are used for testing. Figure 9 shows the classification results on the Pavia University dataset using the (a) proposed 3D-CCNN method, (b) Rand-compress-3D-CNN method, and (c) Rand-compress-SVM method. Table 2 shows the classification performance using the three methods mentioned above. The OA, AA and Ka of 3D-CCNN is 93.00%, 94.17% and 0.916, respectively; the OA, AA and Ka of Rand + compress+3D-CNN is 88.00%, 89.30% and 0.856, respectively; the OA, AA and Ka of Rand + compress + SVM is 47.14%, 37.84% and 0.329, respectively. Comparing the results in Table 1 and Table 2, it is observed that increasing the ratio of training data set will improve the classification accuracy to some extent, however, the training time will be also increased using more training samples.

Fig. 9. Classification results on the Pavia University dataset using the following methods: (a) the proposed 3D-CCNN method, (b) Rand-compress-3D-CNN method, and (c)Rand-compress-SVM method. 70% of the labeled pixels are used as the training samples, and the remaining 30% pixels are used for testing.

Download Full Size | PDF

Table 2. The classification performance of the proposed method and the two comparative methods using the Pavia University dataset (70% training and 30% testing)

View Table | View all tables in this article

Fig. 10. Classification results on the Pavia University dataset using the following methods: (a) the Original-3D-CNN method, (b) Rand-construct-3D-CNN method, (c) Bluenoise-construct-3D-CNN method, (d) Original-SVM method, (e) Rand-construct-SVM method, and (f) Bluenoise-construct-SVM method.

Download Full Size | PDF

Above simulations show that the proposed 3D-CCNN method outperforms other methods directly based on the compressive measurements. The gain of the proposed method is mainly attributed to the joint optimization between coded apertures and network parameters. In addition, both of the 3D-CNN and SVM classifiers perform better on the blue noise coded apertures than the random coded apertures. That is because the blue noise coding strategy achieves more uniform sampling than the random ones, and is beneficial to capture more structure information from the target scene. Based on the same type of coded apertures, the 3D-CNN outperforms the SVM classifier, which proves the superior prediction capacity of the deep learning approach. The poor performance of SVM indicates that it is inadequate to obtain practical classification results directly from the compressive measurements of CASSI system. This makes sense given that the spectral information of the target scene is implicit in the compressive domain.

Figure 10 shows the classification results on the Pavia University dataset using the (a) Original-3D-CNN method, (b) Rand-construct-3D-CNN method, (c) Bluenoise-construct-3D-CNN method, (d) Original-SVM method, (e) Rand-construct-SVM method, and (f) Bluenoise-construct-SVM method. From the image, 30% of the labeled pixels are randomly chosen to be used as the training samples, and the remaining 70% pixels are used for testing. These methods perform the classification based on the original hyperspectral data cube or the reconstructed data cube. Table 3 provides the metrics of classification performance for these methods. The OA, AA and Ka of Original+3D-CNN is 94.88%, 97.46% and 0.939, respectively; the OA, AA and Ka of Rand + construct+3D-CNN is 92.79%, 95.53% and 0.914, respectively; the OA, AA and Ka of Blue noise + construct+3D-CNN is 93.02%, 95.64% and 0.916, respectively; the OA, AA and Ka of Original + SVM is 88.20%, 80.46% and 0.857, respectively; the OA, AA and Ka of Rand + construct + SVM is 76.26%, 73.16% and 0.710, respectively; the OA, AA and Ka of Blue noise + construct + SVM is 78.67%, 76.31% and 0.739, respectively.

Table 3. The classification performance of the last six comparative methods using the Pavia University dataset (30% training and 70% testing)

View Table | View all tables in this article

From Fig. 10 and Table 3, it is observed that the classifiers with blue noised coded apertures outperform those with random coded apertures, since the blue noise coding strategy achieves higher reconstruction quality than the random coding. Although the 3D-CNN applied on the reconstructed data cube outperforms the proposed 3D-CCNN method, it is important to remark the computational complexity to solve the reconstruction problem. In our simulations, the reconstruction process will take about 1168s. On the other hand, the 3D-CCNN only takes 57s to calculate the entire classification map. That is the proposed 3D-CCNN method will achieve more than 20-fold acceleration compared to the methods based on the reconstruction. In all of the experiments, classification maps are calculated pixel by pixel. That means classification labels for each pixel in the testing dataset are calculated in sequence. No special parallel computing method is used. On average, the 3D-CCNN takes approximately 0.18s to complete the classification of a spatial pixel. It is expected that the efficiency of the 3D-CCNN approach can be significantly improved by using the parallel computing methods. What is more interesting, the performance of 3D-CCNN with only 5% compressive ratio is even better than the well-known SVM classifier that performs on the reconstructed full data cube.

We also test and evaluate all the of the classification methods on another dataset referred to as Salinas Valley dataset. Details about these experiments are presented in Supplement 1.

5. Conclusion

This paper develops an efficient 3D-CCNN method to perform hyperspectral classification directly based on the DD-CASSI compressive measurements. The proposed 3D-CCNN method successfully avoids the time-consuming reconstruction procedure, and the influence of reconstruction artifacts. In addition, the hardware-based coded apertures and the software-based 3D-CNN are combined into a uniform framework, which are then jointly optimized by an end-to-end training method to increase the degrees of optimization freedom. Based on a set of simulations, the proposed 3D-CCNN is proved to outperform the 3D-CNN and SVM classifiers based on the compressive measurements. Also, the performance of 3D-CCNN with only about 5% compression ratio is comparable or even better than the SVM classifier based on the full data cube. In the future, we will compare the proposed 3D-CCNN approach with other coding strategies, and study the post-processing techniques to further improve the classification accuracy.

Funding

Fundamental Research Funds for the Central Universities (2018CX01025, 2020CX02002).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Ref. [7].

Supplemental document

See Supplement 1 for supporting content.

References

1. A. Plaza, Q. Du, Y.-L. Chang, and R. L. King, “High performance computing for hyperspectral remote sensing,” IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing 4(3), 528–544 (2011). [CrossRef]

2. J. M. Bioucas-Dias, A. Plaza, G. Camps-Valls, P. Scheunders, N. Nasrabadi, and J. Chanussot, “Hyperspectral remote sensing data analysis and future challenges,” IEEE Geosci. Remote Sens. Mag. 1(2), 6–36 (2013). [CrossRef]

3. J. Ediriwickrema and S. Khorram, “Hierarchical maximum-likelihood classification for improved accuracies,” IEEE Trans. Geosci. Remote Sensing 35(4), 810–816 (1997). [CrossRef]

4. J. Li, J. M. Bioucas-Dias, and A. Plaza, “Semisupervised hyperspectral image segmentation using multinomial logistic regression with active learning,” IEEE Trans. Geosci. Remote Sensing 48(11), 4085–4098 (2010).

5. L. Samaniego, A. Bárdossy, and K. Schulz, “Supervised classification of remotely sensed imagery using a modified k-NN technique,” IEEE Trans. Geosci. Remote Sensing 46(7), 2112–2125 (2008). [CrossRef]

6. F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sensing images with support vector machines,” IEEE Transactions on geoscience and remote sensing 42(8), 1778–1790 (2004). [CrossRef]

7. M. Paoletti, J. Haut, J. Plaza, and A. Plaza, “Deep learning classifiers for hyperspectral imaging: A review,” ISPRS Journal of Photogrammetry and Remote Sensing 158, 279–317 (2019). [CrossRef]

8. A. B. Hamida, A. Benoit, P. Lambert, and C. B. Amar, “3-D deep learning approach for remote sensing image classification,” IEEE Transactions on geoscience and remote sensing 56(8), 4420–4434 (2018). [CrossRef]

9. S. K. Roy, G. Krishna, S. R. Dubey, and B. B. Chaudhuri, “HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification,” IEEE Geoscience and Remote Sensing Letters 17(2), 277–281 (2020). [CrossRef]

10. N. Cohen, S. Shmilovich, Y. Oiknine, and A. Stern, “Deep neural network classification in the compressively sensed spectral image domain,” J. Electron. Imag. 30(04), 041406 (2021). [CrossRef]

11. N. Gat, “Imaging spectroscopy using tunable filters: a review,” in Wavelet Applications VII, 2000, vol. 4056, 50–64:International Society for Optics and Photonics. [CrossRef]

12. A. Gorman, D. W. Fletcher-Holmes, and A. R. Harvey, “Generalization of the Lyot filter and its application to snapshot spectral imaging,” Opt. Express 18(6), 5602–5608 (2010). [CrossRef]

13. A. Wagadarikar, R. John, R. Willett, and D. Brady, “Single disperser design for coded aperture snapshot spectral imaging,” Appl. Opt. 47(10), B44–B51 (2008). [CrossRef]

14. G. R. Arce, D. J. Brady, L. Carin, H. Arguello, and D. S. Kittle, “Compressive coded aperture spectral imaging: An introduction,” IEEE Signal Process. Mag. 31(1), 105–115 (2013). [CrossRef]

15. M. E. Gehm, R. John, D. J. Brady, R. M. Willett, and T. J. Schulz, “Single-shot compressive spectral imaging with a dual-disperser architecture,” Opt. Express 15(21), 14013–14027 (2007). [CrossRef]

16. D. Kittle, K. Choi, A. Wagadarikar, and D. J. Brady, “Multiframe image estimation for coded aperture snapshot spectral imagers,” Appl. Opt. 49(36), 6824–6833 (2010). [CrossRef]

17. C. Xun, Y. Tao, L. Xing, S. Lin, Y. Xin, Q. Dai, L. Carin, and D. Brady, “Computational snapshot multispectral cameras: Toward dynamic capture of the spectral world,” IEEE Signal Processing Magazine 33(5), 95–108 (2016). [CrossRef]

18. R. Calderbank and S. Jafarpour, “Finding needles in compressed haystacks,” In Compressed Sensing: Theory and Applications, Y. C. Eldar and G. Kutyniok, eds. (Cambridge University Press, 2012, 439–484.

19. A. Ramirez, H. Arguello, G. R. Arce, and B. M. Sadler, “Spectral image classification from optimal coded-aperture compressive measurements,” IEEE Trans. Geosci. Remote Sensing 52(6), 3299–3309 (2014). [CrossRef]

20. Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral image classification using dictionary-based sparse representation,” IEEE Trans. Geosci. Remote Sensing 49(10), 3973–3985 (2011). [CrossRef]

21. R. Rubinstein, A. M. Bruckstein, and M. Elad, “Dictionaries for sparse representation modeling,” Proc. IEEE 98(6), 1045–1057 (2010). [CrossRef]

22. Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral image classification via kernel sparse representation,” IEEE Trans. Geosci. Remote Sensing 51(1), 217–231 (2013). [CrossRef]

23. SH. Baek, H. Ikoma, DS. Jeon, Y. Li, W. Heidrich, G. Wetzstein, and MH. Kim, “Single-shot Hyperspectral-Depth Imaging with Learned Diffractive Optics,” arXiv preprint arXiv:2009.00463, 2020.

24. V. Sitzmann, S. Diamond, Y. Peng, X. Dun, S. Boyd, W. Heidrich, F. Heide, and G. Wetzstein, “End-to-end optimization of optics and image processing for achromatic extended depth of field and super-resolution imaging,” ACM Trans. Graph. 37(4), 1–13 (2018). [CrossRef]

25. G. Wetzstein, A. Ozcan, S. Gigan, S Fan, D Englund, M Soljai, C Denz, DAB Miller, and D Psaltis, “Inference in artificial intelligence with deep optics and photonics,” Nature 588(7836), 39–47 (2020). [CrossRef]

26. L. Wang, T. Zhang, Y. Fu, and H. Huang, “HyperReconNet: Joint Coded Aperture Optimization and Image Reconstruction for Compressive Hyperspectral Imaging,” IEEE Trans. on Image Process. 28(5), 2257–2270 (2019). [CrossRef]

27. Y. Wu, I. O. Mirza, G. R. Arce, and D. W. Prather, “Development of a digital-micromirror-device-based multishot snapshot spectral imaging system,” Opt. Lett. 36(14), 2692–2694 (2011). [CrossRef]

28. H. Rueda, D. Lau, and G. R. Arce, “Multi-spectral compressive snapshot imaging using RGB image sensors,” Opt. Express 23(9), 12207–12221 (2015). [CrossRef]

29. H. Zhang, X. Ma, D. L. Lau, J. Zhu, and G. R. Arce, “Compressive spectral imaging based on hexagonal blue noise coded apertures,” IEEE Trans. Comput. Imaging 6, 749–763 (2020). [CrossRef]

30. H. Zhang, X. Ma, and G. R. Arce, “Compressive spectral imaging approach using adaptive coded apertures,” Appl. Opt. 59(7), 1924–1938 (2020). [CrossRef]

31. H. Arguello and G. R. Arce, “Rank minimization code aperture design for spectrally selective compressive imaging,” IEEE Trans. on Image Process. 22(3), 941–954 (2013). [CrossRef]

32. M. He, B. Li, and H. Chen, “Multi-scale 3D deep convolutional neural network for hyperspectral image classification,” in 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 3904-3908: IEEE.

33. W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li, “Deep convolutional neural networks for hyperspectral image classification,” J. Sens. 2015, 258619 (2015). [CrossRef]

34. Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi, “Deep feature extraction and classification of hyperspectral images based on convolutional neural networks,” IEEE Trans. Geosci. Remote Sensing 54(10), 6232–6251 (2016). [CrossRef]

35. C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning 20(3), 273–297 (1995).

36. M. Fauvel, J. Chanussot, and J. A. Benediktsson, “Evaluation of kernels for multiclass classification of hyperspectral remote sensing data,” in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 2006, vol. 2, pp. II-II: IEEE.

37. C. V. Correa, H. Arguello, and G. R. Arce, “Spatiotemporal blue noise coded aperture design for multi-shot compressive spectral imaging,” J. Opt. Soc. Am. A 33(12), 2312–2322 (2016). [CrossRef]

38. A. A. Wagadarikar, N. P. Pitsianis, X. Sun, and D. J. Brady, “Video rate spectral imaging using a coded aperture snapshot spectral imager,” Opt. Express 17(8), 6368–6388 (2009). [CrossRef]

39. M. A. Figueiredo, R. D. Nowak, and S. J. Wright, “Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems,” IEEE J. Sel. Top. Signal Process. 1(4), 586–597 (2007). [CrossRef]

40. J. M. Bioucas-Dias and M. A. Figueiredo, “A new TwIST: Two-step iterative shrinkage/thresholding algorithms for image restoration,” IEEE Trans. on Image Process. 16(12), 2992–3004 (2007). [CrossRef]

Class	3D-CCNN	Rand +compress +3D-CNN	Blue noise +compress+3D-CNN	Rand +compress +SVM	Blue noise +compress +SVM
Asphalt	85.5%	84.4%	85.2%	0.0%	0.0%
Meadows	94.6%	80.5%	80.1%	53.4%	54.4%
Gravel	76.5%	32.1%	44.4%	0.0%	0.0%
Trees	80.9%	78.2%	77.9%	0.0%	0.7%
Metal sheets	94.5%	94.8%	88.6%	81.1%	82.4%
Bare soil	87.6%	69.0%	72.7%	34.5%	35.9%
Bitumen	87.6%	52.4%	60.7%	0.0%	0.0%
Bricks	86.7%	67.0%	85.2%	61.4%	69.4%
Shadows	93.3%	98.7%	96.3%	99.8%	99.8%
OA (%)	86.58%	73.39%	76.34%	45.78%	47.55%
AA (%)	87.47%	73.01%	76.79%	36.69%	38.07%
Ka	0.839	0.678	0.715	0.315	0.334

Class	3D-CCNN	Rand +compress +3D-CNN	Rand +compress +SVM
Asphalt	94.1%	89.3%	0.0%
Meadows	95.9%	91.6%	54.4%
Gravel	92.5%	81.3%	0.0%
Trees	91.3%	80.3%	0.8%
Metal sheets	99.5%	97.8%	82.9%
Bare soil	94.9%	88.6%	36.9%
Bitumen	96.1%	82.2%	0.0%
Bricks	97.0%	95.5%	66.9%
Shadows	95.2%	94.7%	98.7%
OA (%)	93.00%	88.00%	47.14%
AA (%)	94.17%	89.03%	37.84%
Ka	0.916	0.856	0.329

Class	Original+3D-CNN	Rand +construct +3D-CNN	Blue noise +construct +3D-CNN	Original + SVM	Rand +construct +SVM	Blue noise + construct +SVM
Asphalt	96.7%	96.0%	94.3%	88.8%	87.9%	70.5%
Meadows	96.7%	93.9%	94.9%	91.0%	73.7%	79.4%
Gravel	95.6%	91.6%	96.1%	2.9%	7.0%	91.1%
Trees	97.0%	93.3%	89.0%	96.5%	70.6%	0.7%
Metal sheets	100.0%	99.9%	99.8%	99.7%	99.3%	99.0%
Bare soil	96.3%	93.1%	93.9%	90.3%	65.8%	72.3%
Bitumen	97.6%	96.0%	96.0%	71.9%	72.1%	80.4%
Bricks	99.6%	98.6%	98.9%	83.0%	82.4%	93.6%
Shadows	97.6%	97.4%	97.9%	100.0%	99.6%	99.8%
OA	94.88%	92.79%	93.02%	88.20%	76.26%	78.67%
AA	97.46%	95.53%	95.64%	80.46%	73.16%	76.31%
Ka	0.939	0.914	0.916	0.857	0.710	0.739

Class	3D-CCNN	Rand +compress +3D-CNN	Blue noise +compress+3D-CNN	Rand +compress +SVM	Blue noise +compress +SVM
Asphalt	85.5%	84.4%	85.2%	0.0%	0.0%
Meadows	94.6%	80.5%	80.1%	53.4%	54.4%
Gravel	76.5%	32.1%	44.4%	0.0%	0.0%
Trees	80.9%	78.2%	77.9%	0.0%	0.7%
Metal sheets	94.5%	94.8%	88.6%	81.1%	82.4%
Bare soil	87.6%	69.0%	72.7%	34.5%	35.9%
Bitumen	87.6%	52.4%	60.7%	0.0%	0.0%
Bricks	86.7%	67.0%	85.2%	61.4%	69.4%
Shadows	93.3%	98.7%	96.3%	99.8%	99.8%
OA (%)	86.58%	73.39%	76.34%	45.78%	47.55%
AA (%)	87.47%	73.01%	76.79%	36.69%	38.07%
Ka	0.839	0.678	0.715	0.315	0.334

Class	3D-CCNN	Rand +compress +3D-CNN	Rand +compress +SVM
Asphalt	94.1%	89.3%	0.0%
Meadows	95.9%	91.6%	54.4%
Gravel	92.5%	81.3%	0.0%
Trees	91.3%	80.3%	0.8%
Metal sheets	99.5%	97.8%	82.9%
Bare soil	94.9%	88.6%	36.9%
Bitumen	96.1%	82.2%	0.0%
Bricks	97.0%	95.5%	66.9%
Shadows	95.2%	94.7%	98.7%
OA (%)	93.00%	88.00%	47.14%
AA (%)	94.17%	89.03%	37.84%
Ka	0.916	0.856	0.329

Compressive hyperspectral image classification using a 3D coded convolutional neural network

Abstract

1. Introduction

2. Forward imaging model of the DD-CASSI system

3. 3D-CCNN approach for hyperspectral image classification

3.1 Compressive spectral image classification using the 3D-CNN

3.2 Patch-based model with periodic coded apertures

3.3 Joint training method of the 3D-CCNN

4. Experimental results

4.1 Simulation result on the Pavia University dataset

5. Conclusion

Funding

Disclosures

Data availability

Supplemental document

References

Supplementary Material (1)

Data availability

Cited By

Figures (10)

Tables (3)

Equations (13)

Optics Express