Single-shot hyperspectral imaging based on dual attention neural network with multi-modal learning

Tianyue He; Qican Zhang; Mingwei Zhou; Tingdong Kou; Junfei Shen

doi:10.1364/OE.446483

1. Introduction

Conventional digital camera imitates human eyes by assembling RGB color filters on the sensors, despite the missing details of scene spectra. In contrast, hyperspectral image (HSI) can be viewed as a 3D data-cube with 2D spatial and 1D spectral variation, providing each spatial location with a continuous range of spectrum. The intrinsic properties of the scene objects and power distribution of the illumination can be revealed by a HSI with high spectral resolution, which makes HSI useful in wide range of applications in science and industry, such as material classification [1,2], object detection [3–5], food safety analysis [6,7], and medical diagnostic [8–11].

There have been a variety of HSI recovery techniques proposed and evaluated over the past decades. Most of them can be categorized into four groups: dispersion-based methods, filter-based methods, Fourier transform based methods, and hybrid-based methods.

Dispersion-based methods (spectral-spatial modulation). Dispersion-based methods realize hyperspectral imaging by using different dispersive elements, in which the spectral components are spatially modulated.

(1) Spatial scanning. The conventional spatial scanning systems, such as push-broom and whisk-broom [12,13], record the spectrum of a slit of the scene through a dispersive prism, and then scan the entire scene to capture the complete hyperspectral cube. Although easy to handle, these approaches require precise moving parts and only work for static scenes, sacrificing the temporal resolution.
(2) Coded aperture. Another classical dispersive system, named coded aperture snapshot spectral imaging (CASSI), use coded mask and prims (or gratings) to encode the scene spectra [14–17]. The HSI can be restored from the encoded 2D image by solving a constraint inverse problem while the problem constraints can be given by different kinds of side information, such as the sparse prior of the scene or the spectral sensitivity of the camera sensors [18–20]. Generally, the CASSI system is able to capture hyperspectral images and even videos but is often confined to large tabletop setups due to the usage of multiple lenses and complex optical components.
(3) DOE and scatter. In spite of prism and grating, other dispersive medias, such as diffractive optical element (DOE) and scatter, have also been employed for multi-spectral imaging. These approaches mainly depend on diffraction transportation theory to modulate different wave-bands and generate the 2D projection [21–23]. In some cases, under the shift-invariant assumption of point spread function (PSF), the HSI can be reconstructed through a deconvolution process using the pre-calibrated PSF of the optical system [24]. Systems using DOE and scatters are usually compact because they require only a sensor and one kind of scattering media as their optical system. However, the trade-off between spatial-spectral resolution and measured spectral range is hardly to be avoided, since the PSF serves as a function of spatial coordinate and wavelength.

Filter-based methods (spectral-spectral modulation). Unlike dispersion-based methods using spatial encoding scheme, filter-based methods focus on rebuilding the HSI from a signal that is spectrally modulated by narrow-band or broad-band color filters.

(1) Narrow-band filters. Traditional spectral scanning systems place the tunable narrow-band spectral filter in front of the sensor, and the HSI can be recovered via multiple exposures on the target scene. These systems usually have limited temporal resolution due to the inherent scanning procedure. Technically, the narrow-band filters array can be directly assembled on the sensors to realize snapshot hyperspectral imaging [25]. Ono used a pixel-wise polarization color image sensor to realize single-shot spectral imaging, capturing the multispectral image through an imaging lens with multiple narrow-band spectral filters and polarization filters [26]. The multi-spectral data was reconstructed by multiplying the 2D signal recorded in the image plane with an invertible transmission matrix. The optical system reported in Ref. [26] has a compact size so that it can be mounted on portable devices. However, since the tiled array consists of a grid of filters, the spectral resolution increases while the spatial resolution decreases. The reconstruction quality may degrade in low light conditions due to the limited transmittance of narrow-band filters and polarization modules.
(2) Broad-band filters. Models using broad-band filters spectrally modulate the entire spectral bands, and then rebuild the multiplexed spectral information through a specific reconstructive process. Zhang et al. developed a deeply learned broad-band encoding stochastic hyperspectral camera by using advanced artificial intelligence in filter design [27]. This gives the flexibility to design a customized broadband filter system that can be jointly optimized with the reconstruction algorithm.

Typically, the commonly used RGB image can also be viewed as a spectral sampling of the original HSI, which is modulated by three broad-band color filters (red, green and blue). Therefore, it will be convenient if HSI can be directly recovered from the corresponding RGB image, without any complex optical elements, such as the projection architectures and large amounts of narrow-band filters. Toward this end, Xiong et al. proposed hyperspectral convolutional neural networks (HSCNN) and HSCNN+ to learn the mapping function from RGB image to ground-truth HSI [28,29], and then realized end-to-end spectral imaging by upsampling interpolation. HSCNN+ ranked the 1st place in the NTIRE 2018 challenge. Recently, Li et al. proposed an adaptive weighted attention network (AWAN) for RGB-to-HSI reconstruction with the help of camera spectral sensitivity prior [30]. Their remarkable work ranked the 1st place in the NTIRE 2020 challenge (Track 1).

Fourier transform based methods (spectral-frequency modulation). Fourier transform based methods obtain spatial-spectral information of the measured object by using the Fourier transform relationship between the captured signal and restored spectrum [31], mainly depend on diffractive optical elements such as Fresnel lenses and diffraction gratings. Although these approaches can realize spatial-spectral modulations in a wavelength scale, the large volume and complexity of the system limit their real applications.

Hybrid-based methods (hybrid modulation). Recently, a diverse range of hybrid architectures consisting of different optical devices emerged. Disparate optical components, such as color-coded aperture (CCA) [32,33], meta lenses [34], and photonic crystal slabs [35], can provide more efficient spatial-spectral modulations. For instance, Henry et al. explored the advantages of the spectral modulation of an optical setup composed of a DOE and a CCA [36]. Based on their previous work in deep optic [37], they developed a novel architecture in which the optical elements, including the spectral response of the filter and the height map of the diffractive elements, can be jointly optimized with the reconstruction algorithm to achieve unprecedented result. Monakhova et al. proposed a lensless architecture equipped with diffusercam and spectral filter array to circumvent the trade-off between spatial and spectral image resolution [38], which realized a novel, compact and inexpensive snapshot hyperspectral imaging system.

As previously demonstrated, these existing methods face one or more of the following challenges: long time-consuming, complicate light-path, large benchtops, and trade-off between spatial and spectral resolution. For instance, scanning approaches are precise but confined to long scanning time. CASSI method can realize snapshot spectral imaging but are often complex with special optical devices. In comparison, the broad-band filter based RGB-to-HSI methods have relatively balanced performance, realizing single-shot hyperspectral imaging without complicate optical elements. However, RGB-to-HSI methods have the difficulty in generalizing with different imaging systems. Disparate camera characteristics are hard to be considered in the single-modal based framework, which may lead to inaccurate reconstruction when the object camera has different spectral sensitivity functions.

To address the aforementioned issues, this paper proposes a multi-modal hyperspectral imaging neural network (MHINet) to accomplish end-to-end snapshot hyperspectral imaging. To realize HSI acquisition using arbitrary RGB cameras, our multi-model learning scheme focuses on the vital physics process of hyperspectral imaging and considers different camera characteristics. By solving an inverse imaging problem, the hyperspectral volume can be recovered from only one scene RGB image. During training, scene spectra features and camera spectral sensitivity prior supplied by different modal components are jointly learned, and the mapping relationship between RGB image and HSI can be automatically built. Meanwhile, a customized loss function and attention mechanism are employed to boost the network performance. Experiment validates the proposed approach on both synthetic and real captured data, and shows precise recovery of HSIs for indoor and outdoor scenes.

The main contributions of our paper are:

(1) A method combining deep learning with hyperspectral imaging, enabling compact, end-to-end, and snapshot spectral imaging.
(2) A customized framework for solving the inverse compressive imaging problem based on CNN structure and spatial-channel attention modules.
(3) Specifically designed multi-modal learning mechanism to integrate disparate modal information, which greatly facilitates the pre-trained network generalizing in different imaging systems.

The rest of this paper proceeds as follows: Methodology and network design are presented in Section 2. Section 3 shows experiment and results. Comparisons between different methods are also given here. Finally, conclusions and discussions are drawn in Section 4.

2. Method

2.1 Hyperspectral volume projection

A RGB image is formed by the projection of a HSI along the spectral dimension over wavelength domain. Let H(x, y, λ) denotes the 3D hyperspectral cube with (x, y) being the 2D spatial coordinates and λ being the spectral dimension. The relationship between HSI H(x, y, λ) and RGB image I_k (x, y) can be formulated as:

(1)$${I_k}({x,y} )= \int_\lambda {H({x,y,\lambda } ){S_k}(\lambda )} d\lambda, $$

where k∈{R, G, B} represents the spectral channel index and S_k(λ) denotes the spectral response of channel k at wavelength λ. Meanwhile, Eq. (1) can be discretized into a vector-matrix form as follows:

(2)$${{\mathbf I}_k}({x,y} )\textrm{ = }\sum\limits_\lambda {{\mathbf H}({x,y,\lambda } ){{\mathbf S}_k}(\lambda )}, $$

Equation (2) presents the fundamental hyperspectral imaging process and it will be further used to generate the training dataset. Details will be discussed in Section 3.1.

2.2 Multi-modal embedding for HSI recovery

We aimed to rebuild the HSI from the corresponding RGB image. Inspired by the spatial super-resolution task of a 2D image, recovering HSI from spectrally under-sampled projections can be regarded as a spectral super-resolution problem. However, this problem is severely ill-posed since a large amount of information is lost during the imaging process. Nevertheless, it is still possible to extract high-level information from RGB image and up-sample three channels to more channels since most of the hyperspectral bands are highly correlated. To make this problem solvable, a unified multi-modal embedded deep-learning framework “MHINet” is proposed. Compared with the single-modal-based method, multi-modal learning mechanism can find concise feature representations of each modality, extract the interdependencies of different modal information, and build their relations to achieve higher performance. In MHINet, the first visual-image-modal is used to learn scene spectra features, while the second spectral-modal focuses on assisting the spectral decoding process by providing the network with camera spectral sensitivity prior. Integrated with attention mechanism, disparate modal components are adaptively aggregated to establish a unique spectral learning procedure, from which the spectrum is reconstructed.

2.2.1 HSI reconstruction algorithm

According to Eq. (2), the projection from HSI to RGB contains two main steps: (1) spectral response encoding and (2) integration summation, in which the 3D hyperspectral cube is compressed into a 2D image. Therefore, the HSI can be completely rebuilt from each channel of the RGB image by solving a compressive-imaging inverse problem. Let HS denotes the multiplication of H(x, y, λ) and S_k(λ), the reconstruction process aims to learn a reverse process of : I_k(x, y) → HS → H(x, y, λ). For that purpose, three key modules are specifically designed in our MHINet, including the U-Net blocks, the attention blocks, and the pre-trained confidence voting convolutional neural network (CVNet). The block diagram of the proposed method is depicted in Fig. 1. It takes the scene-ColorChecker RGB image pair as input and outputs the corresponding hyperspectral cube. The pre-trained CVNet (detailed in Section 2.2.4) is capable of transforming the ColorChecker image into the corresponding camera response, thereby the input RGB-ColorChecker image pair should be taken by the same camera. During the training phase, the first visual-image-modal RGB image and the second spectral-modal camera response produced by the CVNet are learned by the network simultaneously.

Fig. 1. Overview of the proposed method. It takes the RGB image and ColorChecker image as input and outputs the reconstructed hyperspectral image. SA-CA denotes the channel attention and spatial attention block, and RGBA denotes the RGB attention block. The raw hyperspectral image (R-HSI) and the spectral-encoded hyperspectral image (SE-HSI) are the output of the U-Net and SA-CA, respectively. Response * represents the single channel (R, G, or B) spectral response of the camera.

Download Full Size | PDF

Inversion step 1: I_k(x, y) → HS. The U-Net blocks along with the spatial attention and channel attention (SA-CA) blocks are responsible for recovering the inverse process of integration summation. Instead of up-sampling by simple interpolation, the U-Net structure up-samples each channel of the RGB image (R, G and B) to the desired spectral resolution and maps these features into a certain kind, thereby the raw hyperspectral images (R-HSIs) are created by the unique scene spectra features. To further reduce the noise impact for higher precision, the attention module is closely placed with the U-Net module to extract important features of the R-HSI and integrate them with the highly related ones. Specifically, ‘channel attention’ focuses on spectral correlations through different spectral channels and ‘spatial attention’ tempts to locate the informative regions of the input image. Similar features would be related to each other regardless of their distances.

Inversion step 2: HS → H(x, y, λ). In the inverse design, the spectral encoded hyperspectral image (SE-HSI), i.e. HS, produced by the attention module can be viewed as a spectral data-cube encoded with camera spectral sensitivity information. This data-cube is then spectrally decoded by using the camera spectral response provided by the pre-trained CVNet. After the spectral decoding process, three HSIs can be obtained from the corresponding RGB paths. These elements are weighted by the RGB weights calculated from the RGB attention block and then integrated into the final reconstructed HSI.

In the backpropagation process, the network is optimized by minimizing the loss value between reconstructed HSI and ground-truth HSI (GT-HSI). Detailed inner structure of each sub-module will be respectively introduced in Section 2.2.2 ∼ 2.2.5.

2.2.2 U-Net block

The U-Net is commonly used in image segmentation and recognition. It follows a down-sample and up-sample structure and there is a special structure of skip-connection for rendering the large-scale features to the up-sampling process, where low-level information can be directly learned by the network to avoid model degradation. High-dimensional features can be extracted by the down-sample convolution and then classified by the up-sampling operations.

The U-Net structure introduced in our MHINet is demonstrated in Fig. 2. The integral U-Net module consists of three U-Net structures, each of which takes a single channel of the RGB image (size of H×W×1) as input and outputs the R-HSI with H×W spatial resolution and C spectral bands.

Fig. 2. Inner structure of the U-Net, whose input-output pair is a single channel R/G/B image and R-HSI, respectively. The bottom numbers represent the output dimension of this layer.

Download Full Size | PDF

2.2.3 SA-CA block

Convolution operations may lead to a local receptive field while attention modules can model long-range dependencies. Since the attention mechanism has been proved to be capable of boosting the network performance, it has recently become popular in computer vision (CV) domain [39]. The spatial and channel attention modules can explore global contextual information and better spatial-spectral representations, which is significantly important to these spectral imaging tasks that require accurate feature representation between different spatial locations and spectral channels. Therefore, we also considered integrating them into our MHINet as a spatial-spectral enhance module for inversion step 1. As illustrated in Fig. 3 and Fig. 4, two types of attention blocks are designed to draw global context over local features, including spatial attention block and channel attention block. The process to adaptively aggregate spatial and spectral context is elaborated as follows.

Fig. 3. Details of the spatial attention (SA) block.

Download Full Size | PDF

Fig. 4. Details of the channel attention (CA) block.

Download Full Size | PDF

In particular, a patch-level spatial attention mechanism is employed in our network. The input features are discretized into several patches to explore the relations between different image segments. As shown in Fig. 3, the input feature A∈ℝ^H^×W×C is first fed into the convolutional layers to generate three new feature maps Q, K, and V, where {Q, K}∈ℝ^H^×W and V∈ℝ^H^×W×C. Then, Q and K are reshaped to ℝ^{(X×Y)×(P×P)}, where P×P is the patch size, X×Y denotes the patch number while X = H/P and Y = W/P. After that a matrix multiplication is performed between Q and the transpose of K, and a softmax operation is applied to calculate the spatial attention map S∈ℝ^{(X×Y)×(X×Y)}. Let N = X×Y, so we have:

(3)$${s_{ij}} = \frac{{\exp ({{{\mathbf Q}_i} \cdot {{\mathbf K}_j}} )}}{{\sum\limits_{i = 1}^N {\sum\limits_{j = 1}^N {\exp ({{{\mathbf Q}_i} \cdot {{\mathbf K}_j}} )} } }}, $$

where i, j ∈ [1, N] is the position index of Q, K, and s_ij measures the i^th patch’s impact on the j^th patch, i.e., larger value of s_ij corresponds to more similar feature representations of these two patches. Meanwhile, another feature map V∈ℝ^H^×W×C is reshaped to ℝ^N^×(P×P×C). Then a matrix multiplication is performed between S and V. Finally, the result is reshaped to ℝ^H^×W×C and summed with the input feature A in an element-wise form to obtain the final output B∈ℝ^H^×W×C:

(4)$${{\mathbf B}_{H \times W \times C}} = reshape[{\alpha \cdot ({{{\mathbf S}_{N \times N}}{{\mathbf V}_{N \times (P \times P \times C)}}} )} ]+ {{\mathbf A}_{H \times W \times C}}, $$

where α is a trainable weight coefficient during training and reshape(·) denotes a dimension transformation from ℝ^N^×(P×P×C) to ℝ^H^×W×C. It can be inferred that each patch of the output feature B is a weighted summation of the features across all patches and features. Such a design encourages the spatial attention block to have a global contextual view and selectively aggregates contexts according to the attention map S.

In the channel attention block, as shown in Fig. 4, the input feature D∈ℝ^H^×W×C is first fed into the convolutional layers and transformed into three feature maps Q*, K*, and V*, respectively, where {Q*, K*, V*}∈ℝ^H^×W×C. Similarly, we reshape Q* and K* to ℝ^N^×C and perform a matrix multiplication between transposed Q* and K*. By applying a softmax layer to the multiplication result, the channel attention map S*∈ℝ^C^×C can be obtained, whose element s*_mn can be interpreted as:

(5)$$s_{mn}^\mathrm{\ast } = \frac{{\exp ({{\mathbf Q}_m^\mathrm{\ast } \cdot {\mathbf K}_n^\mathrm{\ast }} )}}{{\sum\limits_{m = 1}^C {\sum\limits_{n = 1}^C {\exp ({{\mathbf Q}_m^\mathrm{\ast } \cdot {\mathbf K}_n^\mathrm{\ast }} )} } }}, $$

where m, n ∈ [1, C] is the channel index of Q*, K*, and s*_mn measures the m^th channel’s impact on the n^th channel. Meanwhile, we perform a matrix multiplication between the feature map V* and S^* and reshape their result to ℝ^H^×W×C. Then, it is multiplied by a scale parameter β and summed with the input feature D to calculate the final output E∈ℝ^H^×W×C, which can be written as:

(6)$${{\mathbf E}_{H \times W \times C}} = reshape[{\beta \cdot ({{\mathbf V}_{N \times C}^\ast {\mathbf S}_{C \times C}^\ast } )} ]+ {{\mathbf D}_{H \times W \times C}}, $$

Equation (6) shows the final feature of each channel is a weighted summation of all channels and features.

To summarize, SA-CA block aims to model the long-range semantic dependence between different spatial positions and spectral bands. Owing to its global attention ability, the intensity and spectrum distribution of the original hyperspectral cube can be accurately learned. In this case, the intra-class compactness and semantic consistency of the reconstructed SE-HSI are improved to achieve mutual gains. For example, a red apple may be first recognized and classified by the U-Net, and then encoded in the SA-CA block with more long-wave bands information. The other advantage of the SA-CA block is that the output feature has the same shape with the input one, indicating that it can be embedded in an optional position as an enhanced sub-module. In MHINet, the SA-CA block is closely placed with the U-Net block to acquire global context of the R-HSI and convert it into SE-HSI, reducing noise impact and accomplishing higher accuracy.

2.2.4 Pre-trained CVNet

As previously demonstrated, the SE-HSI produced by the attention block is a spectrally encoded data-cube. Since the spectral sensitivity is boned to the camera, such redundant information should be removed by providing the network with correct preconditions. Based on our previous work for camera spectral sensitivity estimation [40], the pre-trained CVNet is applied to rebuilds the camera spectral sensitivity from only one ColorChecker image and offer the second spectral-modal for the spectral decoding process. By modeling the spectral sensitivity as the sum of weighted basis functions, there are two major steps contained in CVNet: (1) basic convolutional operations and (2) confidence voting integration. High-dimensional features from different image regions are extracted through multiple convolutional operations and then integrated by confidence voting algorithm to generate weight. The mapping relationship from RGB image to spectral sensitivity can be autonomously built. The inner structure of the CVNet is shown in Fig. 5, including 1 gamma layer, 6 convolutional layers, 5 pooling layers, and 1 confidence voting layer.

Fig. 5. Detailed structure of CVNet. It takes ColorChecker images (512×512×3 for example) as input and outputs C×3 camera response for three channels, where C is the spectral number determined by the shape of HSI. Bottom numbers represent the output dimension of this layer.

Download Full Size | PDF

Gamma nonlinear layer compensates for the model nonlinearity and optimizing the gradient descent process.

Multiple convolution and pooling layers extract high-dimensional features, such as the illumination spectrum and spectral reflectance, to offer the underlying constraint for solving the inverse problem.

Confidence voting layer evaluates the information extracted from different image segments and creates disparate confidence to generate basis functions’ weights. The length of confidence matrix is 1 + 3n while n is the basis function number. By applying softmax function to the first channel of confidence voting layer, normalized confidence levels c_i can be obtained. Each confidence value represents the importance of the behind features. For instance, when the extracted high-dimensional features come from the part of the image with less effective spectral information, the generated weight matrix matches lower confidence. The rest 3n channels of the confidence layer consist of the weight matrixes w_i∈ℝ^1×1×3n, and the weight coefficient W_BF ∈ℝⁿ^×3 of the basis functions can be written as:

(7)$${{\mathbf W}_{BF}} = reshape\left( {\sum\limits_i {{c_i} \cdot {{\mathbf w}_i}} } \right). $$

Each column of W_BF corresponds to the function weight of each channel (R, G and B). Finally, the camera spectral sensitivity S∈ℝ^C^×3 can be calculated by the weighted summation of the basis functions.

Although different types of basis functions and different function numbers could be the candidates for training CVNet, 50 Fourier basis functions were used in the real experiment because it was proved to have the best performance in our previous work. More detailed discussions and results about different basis functions can be found in Ref. [40].

In the spectral decoding process, the camera response reconstructed by the pre-trained CVNet is first fed into a 1×1 convolutional layer and then multiplied with the SE-HSI over the spectral dimension. Note that the SE-HSIs generated from the R/G/B image are decoded by using the spectral response of channel R, G, and B, respectively.

2.2.5 RGB attention (RGBA) block

As shown in Fig. 1, all of the SE-HSIs pass through the spectral decoding process and then create three different HSIs. Theoretically, the HSI can be rebuilt from an arbitrary channel of the RGB image, but more attention should be paid to these channels within more effective information. For instance, if a RGB image is captured under the monochromatic light condition (only 700nm for example), the HSI reconstructed from channel G or B may become unreliable. Therefore, the RGB attention block is applied to assign these three HSIs with their own weights, and then integrates them to generate the final reconstructed HSI.

Let I_R, I_G, and I_B represent the input R, G, and B images with the shape of H×W. The weight coefficients W_R, W_G, and W_B for three channels can be formulated as:

(8)$${{\mathbf W}_k} = {{\mathbf I}_k} \cdot \textrm{ }{{\mathbf I}_R} + {{\mathbf I}_k} \cdot \textrm{ }{{\mathbf I}_G} + {{\mathbf I}_k} \cdot \textrm{ }{{\mathbf I}_B}, $$

where k∈{R, G, B}, “ · “ denotes Hadamard product, and W_k∈ℝ^H^×W denotes the k^th channel’s weight. In addition, W_R, W_G, and W_B are concatenated to a 3D cube W∈ℝ^H^×W×3, and then normalized by the softmax function over the channel dimension. The final reconstructed hyperspectral image H can be calculated by a weighted combination, as shown in Eq. (9) and Fig. 6:

(9)$${\mathbf H} = \sum\limits_k {{{\mathbf W}_k}} \cdot {\mathbf HS}{{\mathbf I}_k}, $$

where HSI_k is the hyperspectral cube produced by the k^th channel.

Fig. 6. Details of the RGB attention block. It weights the HSIs produced by three channels and integrates them into the final reconstructed HSI.

Download Full Size | PDF

2.3 Loss function

To train the network, a customized loss function considering the correlations between different spatial locations and spectral channels is proposed. The difference between ground-truth and reconstructed HSI is measured by different losses. The overall loss function L_overall can be interpreted as:

(10)$${L_{overall}} = {L_{MSE}} + {L_{SSIM}} + {L_{COS}} + {L_{reg}}, $$

where L_MSE is the mean square error (MSE) for measuring the pixel-wise difference, which can be given by:

(11)$${L_{MSE}}\textrm{ = }\frac{1}{M}\sum\limits_{i = 1}^M {{{||{{\mathbf H}_i^\ast{-} {{\mathbf H}_i}} ||}^2}}, $$

where M is the pixel number of HSI, H_i* and H_i represent the pixel of ground-truth HSI and reconstructed HSI, respectively.

L_SSIM= 1 –SSIM(x, y), and SSIM(x, y) represents the local structural similarity (SSIM) between x and y, ensuring the luminance, contrast, and structure of the reconstructed image being consistent with the ground-truth. The SSIM(x, y) is defined as:

(12)$$\textrm{SSIM}({\mathbf x},{\mathbf y})\textrm{ = }\frac{{({2{\mu_x}{\mu_y} + {C_1}} )({2{\sigma_{xy}} + {C_2}} )}}{{({\mu_x^2\mu_y^2 + {C_1}} )({\sigma_x^2 + \sigma_y^2 + {C_2}} )}}, $$

where C₁, C₂ are constants and experimentally set as 1 × 10⁻³ and 1 × 10⁻⁴. μ_x is the mean of x. The variables σ_x² and σ_xy are the variance of x and the co-variance of x and y, respectively. The rest ones μ_x and σ_y² represent the corresponding meanings. Moreover, L_SSIM satisfies the condition that L_SSIM ∈[0,1] since the SSIM(x, y) is in the range of [0,1].

L_COS is the cosine distance between ground-truth H* and reconstructed H. Differing from the super-resolution task of a 2D image, millions of spectral information are involved in different spectral channels. Here the cosine distance is employed to preserve the “spectrum consistency” of these high-dimensional features. L_COS computes the loss value of each spatial location over the spectral dimension, which can be written as:

(13)$${L_{COS}}\textrm{ = 1} - \frac{1}{{H \times W}}\sum\limits_{i = 1}^H {\sum\limits_{j = 1}^W {\frac{{{{\mathbf h}_{ij}} \cdot \textrm{ }{\mathbf h}_{ij}^\ast }}{{||{{{\mathbf h}_{ij}}} ||\times ||{{\mathbf h}_{ij}^\ast } ||}}} }, $$

where “ || · || “ represents the vector norm, H and W are the image height and width. Either h_ij or h_ij* denotes a 1×1×C vector at position (i, j), while C represents the channel number.

Furthermore, the Euclidean norm loss L_reg is used to prevent over-fitting during training and enhance the network performance in real test. It can be expressed by:

(14)$${L_{reg}} = \lambda {\sum\limits_i {|{{\omega_i}} |} ^2}, $$

where ω_i is the optimizable weight of the hidden layer and λ is the weight decay coefficient of the regulation loss.

During training, the mapping relationship from RGB image to HSI can be gradually learned by minimizing L_overall.

3. Experiment and results

3.1 Dataset acquisition

From Fig. 1, it can be seen that each training pair consists of three components: a scene RGB image along with a ColorChecker RGB image captured by the same camera, and the ground-truth HSI. To provide a more accurate simulation of capturing these two RGB images in an identical camera system, we first collected their hyperspectral data. Then, the RGB images were numerically calculated using the corresponding hyperspectral data and supplied camera spectral sensitivity based on the RGB-to-HSI projection rule in Eq. (1).

In practice, the hyperspectral data of a 140-patch ColorChecker were taken by a hyperspectral camera “Specim IQ” with the working band of 400 nm ∼ 1000 nm. For a better adaption in real measurement, the ColorChecker HSIs were captured under variable illumination conditions to enhance the generality ability of the network. Figure 7 (left) shows the whole experiment setup for ColorChecker HSI acquisition. The used light source was a LED lighting box and halogen lamps constructed in our laboratory. The lighting box was an LED panel with the size of 700mm×400mm integrated with 128 LEDs (11 kinds of LEDs). The 11 kinds of high-power LEDs consisted of 8 color LEDs and 3 white LEDs with different CCTs, whose spectral distributions were all given by Fig. 7 (right). Luminance level of each LED could be controlled by the LED control circuit and the target spectral distribution was modulated by the software based on a light matching algorithm and the feedback signal from a spectrometer.

Fig. 7. Experiment setup to capture the hyperspectral data of 140-patch ColorChecker (left). Spectral distribution of used LEDs (right), with shaded lines indicating the central wavelength. Vertical axis represents the normalized intensity, which is in arbitrary units.

Download Full Size | PDF

The NTIRE 2018 dataset (https://icvl.cs.bgu.ac.il/img/hs_pub/NTIRE2018/) was concomitantly used to train our network, whose hyperspectral images were acquired using a Specim PS Kappa DX4 hyperspectral camera and a rotary stage for spatial scanning. This HSI database contained a variety of scene targets that were captured with 1392×1300 spatial resolution over 519 spectral bands (400nm ∼ 1000nm at roughly 1.25nm increments). In experiment, both the ColorChecker HSIs and NTIRE HSIs are downsampled to 31 spectral channels from 400nm to 700nm at 10nm increments and cropped into patches with the size of 512×512 pixels to provide a larger receptive field for target recognition.

Based on the abovementioned dataset acquisition method, 600 NTIRE HSIs and 25 ColorChecker HSIs were experimentally measured and collected. Since the second spectral-modal was designed to learn the camera spectral sensitivity prior, the network should be given with richer spectral information during training to make it generalize well in real test, where an arbitrary new camera may be adopted. For that purpose, 25 groups of camera spectral sensitivity functions were used to simulate different kinds of cameras, each of which comprises the spectral response of RGB channels covering the range from 400nm ∼ 700nm with 31 bands.

Training dataset. The total 10,000 sets of training data were created by integrating 500 scene HSIs and 20 ColorChecker HSIs with 20 camera spectral sensitivity, e.g., every 500 NTIRE HSIs and 1 ColorChecker HSI were mapped into the RGB images using the same spectral sensitivity function.

Test dataset. Similarly, the rest 100 scene HSIs, 5 ColorChecker HSIs, together with 5 spectral responses were used to make 500 test sets, whose hyperspectral and spectral sensitivity data did not appear in the training set.

3.2 Training details

As previously demonstrated, the MHINet model was trained on 10,000 sets (9000 for training and 1000 for validation) and tested on 500 sets of images, each of which consists of a scene-ColorChecker RGB image pair and the corresponding hyperspectral cube. The Tensorflow framework and Adam solver were adopted for model optimization with the weight decay coefficient μ = 2×10⁻⁴. The learning rate is initialized as 3×10⁻⁴ and exponentially decayed by 0.9 after every 2 epochs. Here 1 epoch denotes a complete consumption round of the training data and the whole training phase was ended at the 201^th epoch for better convergence. Except for the 1×1 convolutional layers, all convolutional layers used 3×3 filter with a zero-padding to keep the feature map unchanged. The patch size P in the spatial attention module was set to 16, e.g., the input 512×512 image was discretized into 32×32 patches. To prevent overfitting, 40 percent parameters of the network, including weights and biases, are randomly dropped out in the back propagation process to reduce the model complexity. Also, L2 regularization loss is adopted. The introduced CVNet model was pre-trained with 50 Fourier basis functions which was proved to have the best performance in our previous work [40] with the spectral sensitivity estimation accuracy of 99.14%. Moreover, the batch size was set to 4 and the average running time was recorded over 50 runs. Under the setting of 512×512×3 input RGB images, it takes about 280 hours on Tesla P100 GPU to well train a MHINet model while the whole network approximately contains 27.1×10⁶ parameters.

3.3 Network train and test

The loss value after each training epoch was recorded, as shown in Fig. 8 (first row). Epoch training loss and validation loss represent the loss value computed from the training set and validation set, respectively. It can be seen that the loss function converges after approximately 170 epoch training iterations, which proves the feasibility of the proposed model. To validate the spectral sensitivity estimation accuracy of the pre-trained CVNet, Fig. 8 (second row) shows a sampling of spectral reconstructions overlaid on top of each other, with the solid line indicating the ground-truth spectra. The shape of these RGB curves coincides closely over the whole spectrum for different cameras, ensuring high reliability of the supplied spectral-modal.

Fig. 8. (First row) Illustration of the validation loss and epoch loss. (Second row) Estimated results for RGB channels along with the corresponding ground truth data calculated from the pre-trained CVNet. Solid and dash line represent the ground-truth (GT) and estimated (E) spectral sensitivity functions, respectively.

Download Full Size | PDF

In experiment, the MSE and SSIM computed between the submitted reconstruction results and ground-truth was selected as the objective indices to evaluate the network performance. We performed a reconstruction accuracy analysis both spatially and spectrally. Except for the abovementioned method, three kinds of other strategies using disparate loss functions and network structures, were employed for comparisons.

Comparison 1. The MSE loss “L_MSE” in Eq. (9) was replaced by the mean value of l₁ norm to investigate the effect of different loss functions. The “L₁” loss can be formulated as:

(15)$${L_1}\textrm{ = }\frac{1}{M}{||{{\mathbf H}_{}^\ast{-} {\mathbf H}} ||_1}, $$

where “ || · ||_{1 “} is the l₁ norm and M is the pixel number of the HSI. H* and H represent ground-truth HSI and reconstructed HSI, respectively.

Comparison 2. To investigate the contributions of attention mechanism and integrated multi-modal information, the attention module was replaced by equivalent convolutional layers and the pre-trained CVNet was removed. In this case, the network only comprised the U-Net module along with the conventional CNN structure.

Comparison 3. The HSCNN+ and AWAN method in Ref. [29] and Ref. [30] were employed to compare with the proposed method.

For a fair comparison, the range of all pixels was normalized and limited to [0,1] in the data pre-processing process. Let θ denote the input image, the normalized image θ_norm can be computed by: θ_norm= θ/max(θ), where max(θ) is the maximum pixel value of θ.

3.3.1 Comparisons of MHINet and CNN

Experimental results of comparison 1 and 2 are shown in Table 1, where the final test loss (MSE) and SSIM of these approaches are calculated from the test dataset. Figure 9 shows the recovered images in five selected bands (430 nm, 520 nm, 570 nm, 620 nm, and 660 nm), and the difference-maps between reconstructed and ground-truth HSI are provided in Fig. 10 to better distinguish the errors.

Fig. 9. Visual comparison of CNN and MHINet. Reconstruction results in five selected bands (430 nm, 520 nm, 570 nm, 620 nm, and 660 nm) are shown from left to right.

Download Full Size | PDF

Fig. 10. Difference-maps of five selected bands for three different methods, where the closer value to zero denotes better performance.

Download Full Size | PDF

Table 1. MSE and SSIM of MHINet and CNN. ^a

View Table | View all tables in this article

Benefitted from the proposed loss function considering the correlations among different spatial locations and spectral channels, all of these reconstructed spectral channels agree well with the ground-truth. However, as shown in Table 1 and Figs. 9 and 10, schemes operating on MHINet outperform the conventional CNN structure. Due to the missing information of camera response and the absence of attention module, the results of CNN have a consistently larger MSE than our MHINet across the whole bands, which demonstrates the superiority of integrated multi-modal learning and attention mechanism.

The MSE and SSIM results of the entire 31 spectral bands are illustrated in Fig. 11. It can be seen that the reconstruction results of “L_MSE” have higher accuracy and stability than that of “L₁”. It should be because the luminance level among different hyperspectral bands usually varies significantly, e.g., the same deviation in the pixel value may have different influence on the bands with different luminance levels. It makes the MSE loss generate a bias towards the more important bands and thus perform better than other methods.

Fig. 11. MSE (left) and SSIM (right) results of three methods computed from the entire 31 spectral bands. L_MSE and L₁ represent the schemes of “MHINet + L_MSE” and “MHINet + L₁”, respectively.

Download Full Size | PDF

Since the estimation accuracy of scene spectra is a crucial metric in hyperspectral imaging, the recovered spectrum profiles of different spatial locations are extracted to evaluate the quality of spectral reconstruction, as shown in Fig. 12. Based on the customized “cosine loss”, all of these rebuilt curves are prone to cover the profile of ground-truth; however, the “L_MSE” scheme still has the best performance, as expected, while the CNN-based method is the worst. Certainly, it’s uneasy for traditional CNN to achieve higher precision without the supplied spectral response prior and attention learning mechanism. In contrast, the recovered spectral profiles of our MHINet closely match the ground-truth, verifying its spectral learning ability and generality ability on the test set.

Fig. 12. Experimental results of reconstructed scene spectra. Red circles in the left four scene RGB images demonstrate the locations of these selected points. Numbers in the bracket represent the [B, G, R] value of this point. The corresponding spectral line profiles of each point are plotted in the right four images. Ground-truth spectral line profiles are plotted in black for reference.

Download Full Size | PDF

Furthermore, the robustness against noise is also considered and analyzed in experiments by adding random gaussian noises (with mean value of zero and standard deviation of 1×10⁻³) into the testing images. Testing results on the noisy dataset are shown in Table 2. Compared with Table 1, added noises have different effects on the reconstruction accuracy while the anti-noise characteristic of our MHINet is stronger than conventional CNN. This is because the attention modules employed in our network is specifically designed to automatically discriminate different features extracted from different segments of the image. High-dimensional features learned from these ‘noisy segments’ will be assigned with lower weights, and thus make little contribution to the experimental result.

Table 2. MSE and SSIM of different networks on the noise dataset. ^a

View Table | View all tables in this article

3.3.2 Comparison to HSCNN+ and AWAN

Other network structures, HSCNN+ and AWAN, were employed to compare with our MHINet. Their final test results of MSE loss and SSIM are shown in Table 3. Compared with the MSE result of HSCNN+, our MHINet is 3.4 times (0.00235/0.00069) more precise. In addition, the reconstructed images from different spectral channels and the corresponding difference-maps are also given in Figs. 13 and 14.

Fig. 13. Visual comparison of HSCNN+, AWAN and MHINet. Reconstruction results in five selected bands (430 nm, 520 nm, 570 nm, 620 nm, and 660 nm) are shown from left to right.

Download Full Size | PDF

Table 3. MSE and SSIM of HSCNN+, AWAN and MHINet. ^a

View Table | View all tables in this article

From Fig. 13 and Fig. 14, it can be found that HSCNN+ and AWAN have lower error in some local regions. However, in most cases, especially in these pixels with high luminance level, their global errors are much larger than that of MHINet, which can also be inferred from the MSE and SSIM results of the entire 31 spectral bands shown in Fig. 15. The main reason might be that HSCNN+ and AWAN only used the generic CIE1964 filter response functions as their mapping function, which was adopted to generate the RGB images (simulated input) from hyperspectral images (ground truth). But limited by the manufacturing, in many real sensors, the CFA (color filter array) curves is different from the human perceptual curves, which now need to be corrected by CCM (color correction matrix) in camera ISP (image signal processing). If the spectral sensitivity information can be explicitly calculated to different sensors, more accurate reconstruction result can be obtained. In other words, it was the confinement of single-modal-based method that different semantic information could not cooperate with each other to be processed in a parallel way. Although AWAN considered the effect of camera response prior, it only used this prior information in the training phase to calculate the deviation between the original RGB image and the projected one. Namely, the spectral sensitivity information is hard to be calculated in real test, which impedes its generalization ability for various imaging systems.

Fig. 14. Difference-maps of five selected bands for HSCNN+, AWAN and MHINet.

Download Full Size | PDF

Fig. 15. MSE (left) and SSIM (right) results of HSCNN+, AWAN, and MHINet computed from the entire 31 spectral bands.

Download Full Size | PDF

Experiment result shows that the proposed multi-modal network design generalizes well in the dataset beyond the scope of the training group. Compared with traditional methods, it’s more suitable for these measurements using variable camera imaging system.

From Table 1 and Table 3, we note that the conventional CNN structure achieves slightly better results than HSCNN+. This might be because the CNN model here adopted comprises of three U-Net architectures and RGB integration layer. Its parameter number is still larger than that of HSCNN+, which is at the cost of large memory consumption and longer computing time. At the advantage of large model in traditional CNN and the specific design in HSCNN+, respectively, the MSE of these two methods are still comparable (with 0.00004 negligible difference).

For the running time, MHINet costs 0.45s for each conversion of the 512×512×3 RGB image to the reconstructed 512×512×31 HSI on the Tesla P100 GPU. In contrast, dozens of minutes to hours are required for traditional scanning method, which depends on the speed of system calibration and mechanical scanning. For reference, the reconstruction time of the network for HSCNN+, AWAN, and MHINet are listed in Table 4.

Table 4. Running time of three different networks.

View Table | View all tables in this article

3.3.3 Comparison to CASSI and DOE based method

The peak signal to noise ratio (PSNR) of the proposed method computed from the test dataset was compared with that of these projection methods, including CASSI and diffractive optical element (DOE) based methods. Results are shown in Table 5.

Table 5. Comparison to CASSI and DOE based method.

View Table | View all tables in this article

It can be seen that our reconstruction accuracy is comparable with that of Ref. [17] and [22], but it still has 3∼4 dB degradation compared with the other two approaches. This might be because:

(1) These projection methods adopt multiple well-designed optical elements to capture the source spectral information. As a comparison, our proposed system only comprised of a RGB camera so that some efficient information might be lost during the imaging process. In other words, the compact size of our system is at the expense of reconstruction accuracy.
(2) Generally, the pre-calibrated process of PSF and employment of iteration algorithms are included in these CASSI and DOE based methods, which can help better design the whole hyperspectral imaging system with specific algorithms. Our proposed method aims to realize end-to-end snapshot spectral imaging; thus, the accuracy may suffer some degradation.

3.3.4 Real scene validation

To test the performance of the trained network in real measurement, a raw image pair of the target scene and 140 patch ColorChecker was captured by Nikon D3X, as shown in Fig. 16 left (Pair 1 in the red box). Demosaic process was then manipulated using the Bayer pattern ‘RGGB’ for each color channel to generate corresponding RGB images. There were not any other operations done on the image gray-level intensity or chromatic value. The corresponding ground-truth data was taken by hyperspectral camera Specim IQ, whose spectral sensitivity was pre-calibrated and then used to obtain the original hyperspectral data that only contained illumination spectrum and object reflectance. Besides, considering the Colorchecker image may not be accessible in real measurement and our previous work [40] had proved that a natural image with rich color features can also be used for camera response estimation, another raw image pair consisting of the scene RGB image and natural image was utilized to test its performance without the stubborn color chart, as shown in Fig. 16 right (Pair 2 in the green box). The above mentioned two image pairs were separately used as the input of the pre-trained MHINet to generate the final reconstructed HSI.

Fig. 16. (Left) Input raw images (after demosaic) consisting of scene RGB image and 140 patch ColorChecker image. (Right) Input raw images consisting of the scene RGB image and a natural image. All of these raw images were taken by RGB camera Nikon D3X.

Download Full Size | PDF

Since the RGB image and reconstructed HSI were taken by two devices whose spatial resolution, focal distance, and field of view were totally different, it’s difficult to achieve accurate pixel-matching of these image data. In this case, some of the evaluation metrics used in simulation, such as the error-maps, may become unreliable. To address this problem, both qualitative and quantitative metrics are used for validation. For qualitative evaluation, the reconstructed results of five selected bands are given in Fig. 17. On the other hand, although it is uneasy to realize pixel-to-pixel comparison, the spectra estimation accuracy can be analyzed by using these objects with special spectral distribution. For that purpose, a 24 patch ColorChecker, which has the same color distribution in a local region, was adopted as the imaging target. The experimental setup for capturing hyperspectral and RGB data of the 24 patch ColorChecker is demonstrated in Fig. 18 (left). Accordingly, as the quantitative metric, the rebuilt spectral curves extracted from different segments of the 24 patch ColorChecker image are shown in Fig. 18 (right).

Fig. 17. Ground-truth (top) and reconstructed results of the real captured data. Five selected bands (430 nm, 520 nm, 570 nm, 620 nm, and 660 nm) are shown from left to right. Results of input pair 1 and 2 are illustrated in rectangles with corresponding colors (red for pair 1 and green for pair 2).

Download Full Size | PDF

Fig. 18. Experiment setup for taking the hyperspectral and RGB data of the 24 patch ColorChecker (left). Recovered spectral line profile along with the ground-truth (GT) curves of three selected points (right).

Download Full Size | PDF

From Fig. 17 and Fig. 18 we can see that the reconstructed images agree well with the ground-truth and the estimated spectral profiles closely match the reference curves, indicating that our network performs well in color reproduction and spectra detail recovery. As expected, the proposed approach can be adapted to a new camera system and generalizes well on the real captured data. The PSNR of pair 1 using scene-Colorchecker pair was lower than that of pair 2 using scene and natural image. This should be because colorchecker can provide more specific and abundant color information for camera spectral sensitivity estimation. Nevertheless, their reconstruction accuracy was still comparable in different spectral bands, verifying the model robustness in different experimental conditions. Moreover, we note that the PSNR in real test has some degradations compared with the result recorded in Table 5, mainly due to the effect of pixel mismatching.

In future work, we are planning to investigate the detailed influence of scene chromatic complexity on the reconstruction accuracy. As such, the training dataset can be expanded for more kinds of experimental conditions with various spectral reflectance distribution. Moreover, the network structure will be further optimized and integrated with the designed optical system. Other deep learning architectures with strong generation ability, such as Generative Adversarial Network (GAN) will also be primarily considered. In our approach, the spectral sensitivity prior is used as the second modal information to remove the dependence of camera response and make our network flexible for variable imaging systems. Similarly, the camera spectral sensitivity is a key parameter of other imaging systems, such as CASSI or DOE. The proposed multi-modal learning scheme can also be integrated with CASSI or DOE to achieve more accurate reconstruction by providing them with such useful information.

4. Conclusions and discussion

This paper presents a HSI recovery scheme based on the multi-modal learning neural network (MHINet) and integrated attention mechanism. Multiple contextual information extracted from different semantic modal was jointly used to recover the hyperspectral cube from the corresponding RGB image. There are four major parts in the network for solving this inverse imaging problem: (1) Basic convolutional operations and feature mapping. (2) Multi-spectral channel up-sampling. (3) SA-CA rescaling and encoding. (4) Multi-model integration and spectral decoding. The first visual-image-modal is employed for spectrum classification while the second spectral-modal focuses on providing the spectral decoding process with camera spectral sensitivity prior. During training, the mapping relationship between 2D RGB image and 3D hyperspectral volume can be autonomously learned. Experiment result shows that the reconstruction results with our approach have higher accuracy, both in terms of MSE and SSIM.

Based on the camera response formation model, 10500 sets of simulated images were adopted as the database, for network training, validation, and test. For higher precision, except for embedded multi-modal components, a customized loss function was proposed to learn the correlations among different spatial locations and spectral channels. Meanwhile, attention sub-modules were used to adaptively rescale pixel-wise features in all feature maps. Results show that the proposed multi-modal-based method outperforms the conventional CNN structure. In particular, the “MHINet + L_MSE” approach has the best performance due to its tolerance of large numerical deviation. The pre-trained network was tested on both indoor and outdoor scene images, and extensive results on both simulation and real data have verified the performance of our proposed method.

Benefited from the multi-modal learning mechanism, a key advantage of our design over previous work is its flexibility to be applied to different types of RGB cameras with various spectral responses. For instance, in some real applications, the hyperspectral data should be captured on different portable devices, such as digital camera or mobile phone, which means the traditional scanning method and the single-modal-based method would not work. Instead, the proposed multi-modal neural network can realize HSI reconstruction and computational imaging in variable imaging conditions, serving as an efficient alternative for high-complexity task. Furthermore, owing to the high-performance characteristic of deep learning algorithm which allows parallel batch processing, this work can be assembled on GPU to accelerate computation. The whole reconstruction can be completed without expensive and redundant hardware environment. For some AI smart applications without high-powered GPU assembled. This can also be achieved by algorithm optimization for terminal systems and model migration.

The spatial image resolution of the model is 512×512 while the spectral resolution is 10nm, which can be improved by training the network with high-resolution images, but it may require longer computation time and higher memory occupation. To address this issue, optimized algorithm and simplified network structure can be adopted to find a better balance between the desired precision and hardware cost. The reconstruction time for one HSI is presently 0.45s, which can be successfully used for snapshot imaging. However, that will fail if used in many real-time applications, such as dynamic scene measurement and online hyperspectral video synthesis. This can be solved by computational capacity improvement or utilizing multi-frame parallel processing technology.

Technically, this work provides a potentially simple, end-to-end, and scan-free route for hyperspectral imaging. Our method is efficient in terms of reconstruction quality and speed trade-off, and flexible enough to be ready to use for different imaging systems. Since the multi-modal learning mechanism can be further developed with the computational optical design, the proposed network architecture can be combined with traditional optical systems, such as CASSI and DOE to achieve unprecedented results. We believe that the proposed method will inspire future work towards high-spatial and high-temporal resolution HSI acquisition, and facilitate a wide range of applications.

Funding

National Natural Science Foundation of China (62075143, 62105227); Chengdu Science and Technology Program (2021-YF05-01990-SN).

Acknowledgment

The authors would like to thank Spectral imaging Ltd. for their technical help and support.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. G. Bonifazi, G. Capobianco, and S. Serranti, “Asbestos containing materials detection and classification by the use of hyperspectral imaging,” J. Hazard. Mater. 344, 981–993 (2018). [CrossRef]

2. B. Nie, L. Yang, F. Zhao, J. Zhou, and J. Jing, “Space object material identification method of hyperspectral imaging based on Tucker decomposition,” Adv. Space Res. 67(7), 2031–2043 (2021). [CrossRef]

3. S. Delalieux, A. Auwerkerken, W. W. Verstraeten, B. Somers, R. Valcke, S. Lhermitte, J. Keulemans, and P. Coppin, “Hyperspectral Reflectance and Fluorescence Imaging to Detect Scab Induced Stress in Apple Leaves,” Remote Sens. 1(4), 858–874 (2009). [CrossRef]

4. H. Finkelstein, “Next Generation Intelligent Hyperspectral Imagers for High-Resolution Object Identification,” in Imaging and Applied Optics 2017 (3D, AIO, COSI, IS, MATH, pcAOP), OSA Technical Digest (online) (Optical Society of America, 2017), AM2A.2. [CrossRef]

5. B. E. Nogueira de Faria, G. R. da Fonseca, G. Valentini, A. Bassi, G. Cerullo, C. Manzoni, and A. M. de Paula, “Hyperspectral Microscope Based on a Birefringent Interferometer for Biomedical Imaging,” in Biophotonics Congress: Optics in the Life Sciences Congress 2019 (BODA,BRAIN,NTM,OMA,OMP), The Optical Society (Optical Society of America, 2019), NS1B.5.

6. A. A. Gowen, C. P. O’Donnell, P. J. Cullen, G. Downey, and J. M. Frias, “Hyperspectral imaging – an emerging process analytical tool for food quality and safety control,” Trends Food Sci. Technol. 18(12), 590–598 (2007). [CrossRef]

7. W. Huang, J. Li, Q. Wang, and L. Chen, “Development of a multispectral imaging system for online detection of bruises on apples,” J. Food Eng. 146, 62–71 (2015). [CrossRef]

8. R. T. Kester, N. Bedard, L. Gao, and T. S. Tkaczyk, “Real-time snapshot hyperspectral imaging endoscope,” J. Biomed. Opt. 16(5), 056005 (2011). [CrossRef]

9. B. Hu, J. Du, Z. Zhang, and Q. Wang, “Tumor tissue classification based on micro-hyperspectral technology and deep learning,” Biomed. Opt. Express 10(12), 6370–6389 (2019). [CrossRef]

10. Q. Li, Y. Wang, H. Liu, Z. Sun, and Z. Liu, “Tongue fissure extraction and classification using hyperspectral imaging technology,” Appl. Opt. 49(11), 2006–2013 (2010). [CrossRef]

11. S. Ortega, H. Fabelo, R. Camacho, M. de la Luz Plaza, G. M. Callicó, and R. Sarmiento, “Detecting brain tumor in pathological slides using hyperspectral imaging,” Biomed. Opt. Express 9(2), 818–831 (2018). [CrossRef]

12. W. Porter and H. Enmark, “A System Overview Of The Airborne Visible/Infrared Imaging Spectrometer (Aviris),” in 31st Annual Technical Symposium on Optical and Optoelectronic Applied Sciences and Engineering (SPIE, 1987), Vol. 0834.

13. R. Basedow, D. Carmer, and M. Anderson, “HYDICE system: implementation and performance,” in SPIE's 1995 Symposium on OE/Aerospace Sensing and Dual Use Photonics (SPIE, 1995), Vol. 2480.

14. A. Wagadarikar, R. John, R. Willett, and D. Brady, “Single disperser design for coded aperture snapshot spectral imaging,” Appl. Opt. 47(10), B44–B51 (2008). [CrossRef]

15. M. E. Gehm, R. John, D. J. Brady, R. M. Willett, and T. J. Schulz, “Single-shot compressive spectral imaging with a dual-disperser architecture,” Opt. Express 15(21), 14013–14027 (2007). [CrossRef]

16. C. Xun, Y. Tao, L. Xing, S. Lin, Y. Xin, D. Qionghai, L. Carin, and D. J. Brady, “Computational Snapshot Multispectral Cameras: Toward dynamic capture of the spectral world,” IEEE Signal Process. Mag. 33, 95–108 (2016). [CrossRef]

17. X. Lin, Y. Liu, J. Wu, and Q. Dai, “Spatial-spectral encoded compressive hyperspectral imaging,” ACM Trans. Graph. 33, 233 (2014). [CrossRef]

18. H. Rueda, D. Lau, and G. R. Arce, “Multi-spectral compressive snapshot imaging using RGB image sensors,” Opt. Express 23(9), 12207–12221 (2015). [CrossRef]

19. L. Galvis, D. Lau, X. Ma, H. Arguello, and G. R. Arce, “Coded aperture design in compressive spectral imaging based on side information,” Appl. Opt. 56(22), 6332–6340 (2017). [CrossRef]

20. J. Bacca, Y. Fonseca, and H. Arguello, “Compressive spectral image reconstruction using deep prior and low-rank tensor representation,” Appl. Opt. 60(14), 4197–4207 (2021). [CrossRef]

21. S. K. Sahoo, D. Tang, and C. Dang, “Single-shot multispectral imaging with a monochromatic camera,” Optica 4(10), 1209–1213 (2017). [CrossRef]

22. D. Jeon, S.-H. Baek, S. Yi, Q. Fu, X. Dun, W. Heidrich, and M. Kim, “Compact snapshot hyperspectral imaging with diffracted rotation,” ACM Trans. Graph. 38(4), 1–13 (2019). [CrossRef]

23. X. Li, J. A. Greenberg, and M. E. Gehm, “Single-shot multispectral imaging through a thin scatterer,” Optica 6(7), 864–871 (2019). [CrossRef]

24. R. French, S. Gigan, and O. L. Muskens, “Speckle-based hyperspectral imaging combining multiple scattering and compressive sensing in nanowire mats,” Opt. Lett. 42(9), 1820–1823 (2017). [CrossRef]

25. A. Manakov, J. Restrepo, O. Klehm, R. Hegedüs, E. Eisemann, H.-P. Seidel, and I. Ihrke, “A reconfigurable camera add-on for high dynamic range, multispectral, polarization, and light-field imaging,” ACM Trans. Graph. 32(4), 1–14 (2013). [CrossRef]

26. S. Ono, “Snapshot multispectral imaging using a pixel-wise polarization color image sensor,” Opt. Express 28(23), 34536–34573 (2020). [CrossRef]

27. W. Zhang, H. Song, X. He, L. Huang, X. Zhang, J. Zheng, W. Shen, X. Hao, and X. Liu, “Deeply learned broadband encoding stochastic hyperspectral imaging,” Light: Sci. Appl. 10(1), 108 (2021). [CrossRef]

28. Z. Xiong, Z. Shi, H. Li, L. Wang, D. Liu, and F. Wu, “HSCNN: CNN-Based Hyperspectral Image Recovery from Spectrally Undersampled Projections,” in 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), (2017), 518–525.

29. Z. Shi, C. Chen, Z. Xiong, D. Liu, and F. Wu, “HSCNN+: Advanced CNN-Based Hyperspectral Recovery from RGB Images,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), (2018), 1052–10528.

30. J. Li, C. Wu, R. Song, Y. Li, and F. Liu, “Adaptive Weighted Attention Network with Camera Spectral Sensitivity Prior for Spectral Reconstruction from RGB Images,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), (2020), 1894–1903.

31. C. Hsieh, O. Momtahan, A. Karbaschi, and A. Adibi, “Compact Fourier-transform volume holographic spectrometer for diffuse source spectroscopy,” Opt. Lett. 30(8), 836–838 (2005). [CrossRef]

32. H. Rueda-Chacon, J. F. Florez-Ospina, D. L. Lau, and G. R. Arce, “Snapshot Compressive ToF + Spectral Imaging via Optimized Color-Coded Apertures,” IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2346–2360 (2020). [CrossRef]

33. A. Parada-Mayorga and G. R. Arce, “Spectral Super-Resolution in Colored Coded Aperture Spectral Imaging,” IEEE Trans. Comput. Imaging 2(4), 440–455 (2016). [CrossRef]

34. A. Y. Zhu, W.-T. Chen, M. Khorasaninejad, J. Oh, A. Zaidi, I. Mishra, R. C. Devlin, and F. Capasso, “Ultra-compact visible chiral spectrometer with meta-lenses,” APL Photonics 2(3), 036103 (2017). [CrossRef]

35. Z. Wang, S. Yi, A. Chen, M. Zhou, T. S. Luk, A. James, J. Nogan, W. Ross, G. Joe, A. Shahsafi, K. X. Wang, M. A. Kats, and Z. Yu, “Single-shot on-chip spectral sensors based on photonic crystal slabs,” Nat. Commun. 10(1), 1020 (2019). [CrossRef]

36. H. Arguello, S. Pinilla, Y. Peng, H. Ikoma, J. Bacca, and G. Wetzstein, “Shift-variant color-coded diffractive spectral imaging system,” Optica 8(11), 1424–1434 (2021). [CrossRef]

37. V. Sitzmann, S. Diamond, Y. Peng, X. Dun, S. Boyd, W. Heidrich, F. Heide, and G. Wetzstein, “End-to-end optimization of optics and image processing for achromatic extended depth of field and super-resolution imaging,” ACM Trans. Graph. 37(4), 1–13 (2018). [CrossRef]

38. K. Monakhova, K. Yanny, N. Aggarwal, and L. Waller, “Spectral DiffuserCam: lensless snapshot hyperspectral imaging with a spectral filter array,” Optica 7(10), 1298–1307 (2020). [CrossRef]

39. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale,” (2020).

40. T. He, Q. Zhang, M. Zhou, and J. Shen, “CVNet: confidence voting convolutional neural network for camera spectral sensitivity estimation,” Opt. Express 29(13), 19655–19674 (2021). [CrossRef]

Method	CNN(L_MSE)	MHINet (L₁)	MHINet (L_MSE)
MSE	0.00229	0.00101	0.00069
SSIM	0.9623	0.9822	0.9873

Method	CNN(L_MSE)	MHINet (L₁)	MHINet (L_MSE)
MSE	0.0437	0.0092	0.0059
SSIM	0.7661	0.8917	0.9095

Method	HSCNN+	AWAN	*MHINet (L_MSE)*
MSE	0.00235	0.00137	0.00069
SSIM	0.9560	0.9684	0.9873

Network	HSCNN+	AWAN	MHINet
Running time (s)	0.96	0.56	0.45
Compute platform	NVIDIA 1080Ti	NVIDIA 2080Ti	Tesla P100

Method	CNN(L_MSE)	MHINet (L₁)	MHINet (L_MSE)
MSE	0.00229	0.00101	0.00069
SSIM	0.9623	0.9822	0.9873

Single-shot hyperspectral imaging based on dual attention neural network with multi-modal learning

Abstract

1. Introduction

2. Method

2.1 Hyperspectral volume projection

2.2 Multi-modal embedding for HSI recovery

2.2.1 HSI reconstruction algorithm

2.2.2 U-Net block

2.2.3 SA-CA block

2.2.4 Pre-trained CVNet

2.2.5 RGB attention (RGBA) block

2.3 Loss function

3. Experiment and results

3.1 Dataset acquisition

3.2 Training details

3.3 Network train and test

3.3.1 Comparisons of MHINet and CNN

3.3.2 Comparison to HSCNN+ and AWAN

3.3.3 Comparison to CASSI and DOE based method

3.3.4 Real scene validation

4. Conclusions and discussion

Funding

Acknowledgment

Disclosures

Data availability

References

Data availability

Cited By

Figures (18)

Tables (5)

Equations (15)

Optics Express