Dual camera snapshot high-resolution-hyperspectral imaging system with parallel joint optimization via physics-informed learning

Hui Xie; Zhuang Zhao; Jing Han; Fengchao Xiong; Yi Zhang

doi:10.1364/OE.487253

1. Introduction

Traditional optical images are often RGB or grayscale images, but they often have difficulty in effectively identifying some substances. Hyperspectral imaging captures narrow-band spectral wavelength information reflected from the scene and provides images of the continuous spectral range of the scene. Hyperspectral images (HSI) can be viewed as a three-dimensional data cube with both two-dimensional spatial and one-dimensional spectral variations, greatly enriching the scene information. This has led to HSI being used in a wide range of applications such as target detection [1], scene classification, medical diagnosis and remote sensing [2]. However, common imaging spectrometers have complex optical structures that result in an imbalance between high resolution, high spectral resolution, and imaging time. The time-consuming nature of scanning along a spatial or spectral dimension further limits the application of HSI, especially in dynamic scenes.

To address the application of hyperspectral imaging in real-time scenes, snapshot compressive imaging (SCI), specifically coded aperture snapshot spectral imaging (CASSI) systems (including single-disperser (SD-CASSI) [3]) and dual-disperser (DD-CASSI [4])), provides an elegant solutions. They use coded masks and dispersive prisms to modulate the three-dimensional spectral information of the scene and capture the compressed measurements on a two-dimensional sensor. Then use reconstruction algorithms to reconstruct 3D hyperspectral data from the compressed measurements and masks. The reconstruction algorithm plays a key role in the SCI reconstruction because we rely on it to reconstruct the hyperspectral data we expect to obtain.

Existing SCI reconstruction algorithms can be classified into model-based and learning-based approaches. Since spectral SCI reformulation is an ill-posed problem, traditional model-based algorithms usually require a regularization prior, such as sparsity [5], total variation [6]. These model-based methods have complete theoretical proofs that explain their mathematical derivations well. Some representative algorithms are two-step iterative shrinkage/thresholding algorithm [6] (TWIST), generalized alternating projection total variation [7] (GAP-TV) and rank minimization for snapshot compressive imaging [8] (DeSCI). However, these methods often rely on relying on hand-crafted image priors, which have limited representation capabilities and are very time-consuming because of iterative optimization.

Deep learning has powerful learning capabilities and typically uses a convolutional neural network to learn the mapping relationships of hyperspectral datasets directly from the encoded compressed images and achieves good reconstruction results [9,10]. Such methods also include reconstructing hyperspectral images [11,12,13] from RGB images. To cope with the generalization problem of deep learning methods, Wang et al. use deep external and internal learning [14], which can effectively exploit spatial-spectral correlation and sufficiently represent the variety nature of HSI. However, this data-driven learning from compressed measurement data to the underlying spectral images lacks theoretical guarantee and interpretability. To improve interpretability, plug-and-play [15,16] algorithms combine the powerful learning capabilities of deep learning methods with the theoretical underpinnings of model-based physical models. Meng et al. proposed a deep image prior (DIP) based self-supervised network [17] to solve the reconstruction problem of SCI, which improves the reconstruction accuracy as well as the reconstruction speed by using an untrained network as the denoiser of the alternating direction method of multipliers [18] (ADMM) algorithm, but cannot be applied to real-time scenes. In recent years, the self-attention mechanism of natural language processing model called Transformer [19] has been a great success in the field of computer vision under the condition of rapid growth of hardware computing power.

Transformer module has impressive capabilities, such as MST [20] and MST++ [12]. Deep unfolding networks also combine the advantages of model-based and learning-based methods, allowing deep learning methods to be somewhat interpretable, such as GAP-Net [21]. Based on this, GAP-CCoT [22] network combines the advantages of deep unfolding network approach and Transformer module to further improve the reconstruction performance of SCI. However, these methods still cannot escape the nature of requiring large amounts of data for training, and trained models may fail when dealing with scenarios that differ significantly from the training data. Not to mention that in many real-world scenarios we usually do not have enough training data.

In our previous work [23], we proposed a physics-based learning self-supervised two-camera system to solve the reconstruction problem of hyperspectral images. However, at that time, we used HR-net [24] as the network, and used the compressed measurements of the CASSI system as the input to the network, without taking full advantage of the spatial information brought by the RGB images.

In this paper, we propose a two-camera high-resolution hyperspectral image reconstruction system that combines a physical model of optical imaging and a mathematical model of joint optimization to design a parallel joint-optimized self-supervised framework. We introduced Transformer in our architecture to leverage the spatial detail information from high-resolution RGB cameras to achieve both hyperspectral and high-resolution image reconstruction. Our main contributions in this paper are as follows.

(1) We added a high-resolution color camera to the CASSI system to form a high-resolution self-supervised hyperspectral reconstruction system.
(2) We combine the imaging principles of CASSI system and color camera, embed the physical model of the system optical imaging and the mathematical procedure of joint optimization into the CNN network, and introduce Transformer to design a parallel optimized deep learning network.
(3) The system achieves both hyperspectral and high-resolution image reconstruction, and reconstructs hyperspectral images directly from 2D compressed data without any data for pre-training. Meanwhile, the model completed by online self-learning based on the scenario can be directly used in the inference process of the network and still has better reconstruction performance in similar scenarios.

2. Methods

Our proposed high-resolution-hyperspectral imaging system is shown in Fig. 1, where we place a beam splitter in front of the CASSI system and use a high-resolution color camera (color camera has higher resolution than CASSI system) to capture the spatial information of the scene. In this chapter, we first introduce the physical process of CASSI system imaging and the mathematical process of our proposed joint optimization, and then introduce the architecture of our proposed parallel optimization network.

Fig. 1. High-resolution hyperspectral imaging system.

Download Full Size | PDF

2.1. Physical process of CASSI and the self-supervised learning

The basic idea of SCI is as follows: the 3D spectral information is first encoded, and then the 2D encoded compressed data is acquired using a charge coupled device (CCD) or complementary metal oxide semiconductor (CMOS). The coded mask is typically located on the image plane used for modulation, i.e., the conjugate focal plane of the sensor, and the spatial spectral voxels of the scene are modulated by a pixel on the coding mask and mapped to a pixel on the detector through the relay mirror.

The schematic diagram of spectral SCI is shown in Fig. 2. For SD-CASSI, the 3D spectral data are first modulated in the image plane of the scene using a coded mask, then passed through a dispersive medium, and finally a compressed measurement is obtained in 2D form. Thus, the single-disperser architecture uses a coded mask for spatial coding and a disperser for spectral shearing.

Fig. 2. Schematic diagram of SD-CASSI.

Download Full Size | PDF

In order to encode the spectral cube of a scene into a single measurement, the sensing matrix must vary with the spectral band. As shown in Fig. 3, for a single-disperser structure, assuming that the encoded mask version is an $H \times W$ matrix, the encoded aperture is first reshaped into a column vector, which is then repeated uniformly in the horizontal direction and filled with zeros in the vertical direction.

Fig. 3. Relationship between the coded aperture and the sensing matrix of the camera.

Download Full Size | PDF

What is common to both single-disperser and dual-disperser is that the final perception matrix is a series of diagonal matrices, and the perception matrix can be written as

(1)$$\begin{array}{c} {\varPhi = [{{\varPhi_1},{\varPhi_2} \cdots ,{\varPhi_C}} ],} \end{array}$$

where $\varPhi \in {\mathrm{\mathbb{R}}^{WH \times WHC}}$ denotes the sensing matrix determined by the physical mask. The physical mask version $\varPhi $ can be adjusted arbitrarily according to system requirements. And W, H, and C denote the width, height, and the number of spectral bands, respectively.

For the linear model of spectral SCI, the spectral cube of the scene is expressed as $X \in {\mathrm{\mathbb{R}}^{W \times H \times C}}$, and the measured value after coded modulation is $Y \in {\mathrm{\mathbb{R}}^{W \times H}}$ (or an approximate size). The generalized matrix representation of the CASSI system can be expressed as

(2)$$\begin{array}{c} {Y = \varPhi X + E,} \end{array}$$

where $E \in {\mathrm{\mathbb{R}}^{W \times H}}$ denotes the measurement or sensor noise.

The vectored spectral cube and measured values are $x = vec(X )\in {\mathrm{\mathbb{R}}^{WHC}}$, $y = vec(Y )\in {\mathrm{\mathbb{R}}^{WH}}$ and $\eta = vec(E )\in {\mathrm{\mathbb{R}}^{WH}}$ respectively. As shown in Fig. 4, the spectral SCI linear system can be represented as

(3)$$\begin{array}{c} {y = \varPhi x + \eta .} \end{array}$$

Based on the system, the spectral SCI reconstruction can be cast as the following optimization problem

(4)$$\begin{array}{c} {\hat{x} = arg \mathop {min}\limits_x \frac{1}{2}\left\|y - \varPhi x\right\|{^2} + \tau R(x ),} \end{array}$$

where $\left\|y - \varPhi x\right\|^{2}$ is the data fidelity term, $R(x )$ is the image prior, and $\tau $ is the hyperparameter that balances its importance.

Fig. 4. Generalized representation and matrix representation of spectral SCI.

Download Full Size | PDF

For self-supervised learning models, typically use an encoder to compress the input data into a representation of the embedding data in the latent space, and then use a decoder to reconstruct the embedding vector into an image. The process as shown in Fig. 5.

Fig. 5. Self-supervised learning model.

Download Full Size | PDF

The generated self-supervised learning model represents the inverse problem of image reconstruction as an energy minimization problem:

(5)$$\begin{array}{c} {\hat{x} = \mathop {min}\limits_x E({x;{x_0}} )+ R(x ),} \end{array}$$

where, $E({x;{x_0}} )$ is the data item associated with the image reconstruction task; ${x_0}$ is the measured value of the original signal x; $R(x )$ is the regularization constraints. And the implicit prior captured by the neural network can be used in the self-supervised learning model to replace $R(x )$

(6)$$\begin{array}{c} {\hat{\theta } = \mathop {argmin}\limits_\theta E({{f_\theta }(e );{x_0}} ),\hat{x} = {f_{\hat{\theta }}}(e ),} \end{array}$$

where, $f({\cdot} )$ is the convolutional neural network, $\theta $ is network parameters; e is a fixed random tensor which is initially input to the network. And the so-called implied prior means that $R(x )$ is implied in the convolutional neural network $f({\cdot} )$ in another way. In layman's terms, this means that if the network $f({\cdot} )$ can easily reconstruct the image x, then it can be assumed that $R(x )= 0$. Conversely, if the network $f({\cdot} )$ cannot reconstruct the image x, then $R(x )={+} \infty $ [25].

Then for CASSI system, it is to compress the 3D hyperspectral data into a 2D compression measurement, and then recover the 3D hyperspectral data with reconstruction algorithm. This process is also one of downscaling and compression, followed by reconstruction and reduction.

If we consider the physical process $\varPhi x$ of the CASSI system for compressing measurements of 3D hyperspectral images as an encoder, then the measurement data y captured by the CASSI system is considered as an embedding data representation of 3D hyperspectral images in the lower dimensional space. Thus, we can use the massive amount of data y to train the convolutional neural network as a decoder to reconstruct the hyperspectral image data x by a self-supervised learning method, and this process is shown in Fig. 6.

Fig. 6. Self-supervised learning model for CASSI system.

Download Full Size | PDF

Since we want the network to act as a decoder to learn the physical processes of the optical system, the data input to the network is the measurement results of the CASSI system. However, due to the high-dimensional character of hyperspectral information, the reconstruction of its compressed measurement data is an ill-posed problem. If we only use the physical process of the CASSI system as a constraint of the model, i.e., $min\left\|Y - \varPhi X\right\|^{2}$, the reconstructed image, although possessing the spectral information of the scene, is also covered by the encoding matrix in terms of spatial forms. Therefore, a prior with spatial information of the scene is necessary to constrain the model.

And for color cameras, RGB images are obtained by integrating a spectral data cube along the spectral dimensional direction on the CMOS. $X({x,y,\lambda } )$ represents the 3D spectral data cube, while the RGB image ${Z_k}({x,y} )$ can be represented as

(7)$$\begin{array}{c} {{Z_k}({x,y} )= \mathop \smallint \limits_\lambda X({x,y,\lambda } ){L_k}(\lambda )d\lambda ,} \end{array}$$

where, $k \in R,G,B$ represents the spectral channel and ${L_k}(\lambda )$ represents the response of channel k at wavelength $\lambda $, i.e., the quantum effect curve of the camera. Its matrix form after discretization can be expressed as

(8)$$\begin{array}{c} {{Z_k} = X{L_k}.} \end{array}$$

In our previous work [23], we introduced the physical process of color camera imaging as a priori constraint on the spatial information of the scene into the self-supervised learning model, i.e., $min\left\|X{L_k} - {Z_k}\right\|^2$, to form a double path self-supervised learning framework based on physical learning ($min\left\|Y - \varPhi X\right\|^{2} + \left\|X{L_k} - {Z_k}\right\|^2$). The HR-Net is then used as the decoder network in Fig. 6, and the input to the network is the compressed measurements from the CASSI system.

As for the CASSI system, in the ideal case shown in Fig. 2, a voxel in space corresponds to a pixel of the coded mask and then to a pixel of the camera. However, in reality, if the pixel size of the coded mask is too small, it will be difficult to calibrate the system due to the presence of misalignment between the pixels of the coding mask and the camera element in the actual system, etc. Therefore, the pixel size of 2 × 2 or 3 × 3 camera elements is generally chosen as the pixel size of the coded mask when manufacturing the coded mask. This also results in a lower spatial resolution of the CASSI system. Thus, it is very necessary to introduce the high-resolution spatial detail information provided by color cameras to improve the imaging performance of the system. So, we asked the question: How can we logically introduce the spatial detail information provided by the color camera into the self-supervised learning model, rather than just as a constraint on the network model?

Then, we combine the physical drivers to design a mathematical model for the joint optimization of the CASSI system and the color camera imaging process, and we propose a self-supervised learning framework for parallel joint optimization on the basis of this model. This framework takes full advantage of the spatial information provided by the color camera to achieve high resolution and hyperspectral image reconstruction.

2.2. Mathematical process of our joint optimization

For CASSI system iterative optimization solution, by introducing the auxiliary variable z, Eq. (4) can be written as

(9)$$\begin{array}{c} {({{x^{(n )}},{z^{(n )}}} )= \mathop {argmin}\limits_{x,z} \frac{1}{2}\left\|x - z\right\|{^2} + \tau R(z )\; \; \; s.t.\varPhi x = y,} \end{array}$$

where, n indexes the iterations number. Eq. (9) can be written as two subproblems using block-coordinate descent [26], which can be optimized by alternately updating x and z. For a given z, the solution is the Euclidean projection of z onto a linear manifold, which can be written as

(10)$$\begin{array}{c} {{x^{({n + 1} )}} = {z^{(n )}} + {\varPhi ^T}{{({\varPhi {\varPhi ^T}} )}^{ - 1}}({y - \varPhi {z^{(n )}}} ).} \end{array}$$

The next step is given x to update z according to the regularization terms. When considering the process as a denoising, the solution can be obtained by

(11)$$\begin{array}{c} {{z^{({n + 1} )}} = {\mathrm{{\cal D}}_{n + 1}}({{x^{({n + 1} )}}} ),} \end{array}$$

where $\mathrm{{\cal D}}$ represents denoiser.

Since we use a high-resolution color camera in a dual camera system, which is jointly optimized with the reconstruction of the CASSI branch system. Because of the higher spatial resolution of the color camera than the spatial resolution of the CASSI branch, the imaging model of the CASSI system can be abbreviated at this point as

(12)$$\begin{array}{c} {Y = \varPhi ({BX} ),} \end{array}$$

where Y is the encoded and modulated compressed image, X is the high-resolution hyperspectral data cube, B is the spatial sampling matrix, and $\varPhi $ is the sensing matrix of the CASSI system.

Although recovering hyperspectral images from RGB images is a morbid problem, RGB images can provide richer spatial detail information to help reconstruct hyperspectral images.

And the goal of joint optimization is to estimate a high-quality, high-resolution hyperspectral image X from an encoded and compressed low-quality image Y and a high-quality, clear image Z. Then the optimization problem for hyperspectral images is as follows

(13)$$\begin{array}{c} {X = \mathop {argmin}\limits_X \left\|{Z_k} - X{L_k}\right\|_F^2 + \left\|Y - \varPhi ({BX} )\right\|_F^2 + \tau J(X ),} \end{array}$$

where $J(X )$ represents the image prior.

The parallel joint optimization network framework designed based on optimization Eq. (10) and Eq. (11) and Eq. (13). For the solution of Eq. (13), we will describe it together with the introduction of the network architecture.

2.3. Our parallel joint optimization network framework

According to Eq. (10), Eq. (11) and Eq. (13). our proposed parallel joint optimization network framework is shown in Fig. 7(a). The network consists of n stages that form an alternating parallel joint optimization structure. Since the resolution of the image captured by the color camera and the compressed image captured by the CASSI system are not the same, up-sampling and down-sampling are required in parallel network information transmission. In this case we define the resolution of the image captured by the color camera as $H \times W$ and the sampling value of the CASSI system as $h \times w$. The high-resolution $H \times W$ can be converted to the low-resolution $h \times w$ by the spatial sampling matrix B.

Fig. 7. Our proposed parallel optimization network (a) Parallel Optimization Network; (b) Denoising Network $\mathrm{{\cal D}}({\cdot} )$; (c) Spatial-spectral optimization CNN; (d) Spatial-spectral Regularization Module; (e) Transformer block.

Download Full Size | PDF

The computed result ${z^{(n )}}$ of the ${n^{th}}$ order Spectral Stage is up-sampled, concatenated with the computed result ${X^{({n - 1} )}}$ of the ${({n - 1} )^{th}}$ order Spatial Stage, and then passed to the ${n^{th}}$ order Spatial Stage for calculation. The result ${X^{({n - 1} )}}$ of the ${({n - 1} )^{th}}$ Spatial Stage is down-sampled and connected to the result ${z^{({n - 1} )}}$ of the ${({n - 1} )^{th}}$ Spectral Stage, which is then passed to the ${n^{th}}$ Spectral Stage for calculation. This results in a parallel alternating joint optimization structure. Here, down-sampling is performed using the sliding step of the convolutional layer, and up-sampling is performed using the bicubic interpolation algorithm. The initial values ${z^{(0 )}} = {\varPhi ^T}y$ for the first-order Spectral Stage and ${X^{(0 )}} = {z^{(1 )}}$ for the first-order Spatial Stage.

2.3.1 Spectral Stage

The Spectral Stage draws on the deep unfolding network structure, and $f({\cdot} )$ is the operation of Eq. (10). The denoising network $\mathrm{{\cal D}}({\cdot} )$ in Fig. 7(b) is built as a denoiser. Since convolution operations may lead to locally receptive domains, the Transformer module is advantageous in capturing long-range correlations between spatial regions. The Transformer module has been shown higher capacity to improve network performance and has gained popularity in the field of computer vision in recent years. In the denoising network, we use convolutional branch to extract the spatial features of the image, using the Transformer block shown in Fig. 7(e) to improve the spectral dimensional features of the obtained hyperspectral image. Meanwhile, the multi-scale image features are learned using both down-sampling and up-sampling structures, where down-sampling is performed using a convolutional layer sliding step and up-sampling is performed using Pixel-shuffle.

The Transformer block shown in Fig. 7(e) will be described in detail in Section 2.3.3.

2.3.2 Spatial Stage

Since both the observed coded compressed image and the RGB image can be viewed as obtained from the hyperspectral data cube through the imaging process of the optical system, and the solution of the hyperspectral image X is an underdetermined problem, the solution space is not unique. Therefore, it needs to be regularized with some image priori information $J(X )$. For this purpose, we designed a Spatial-spectral Regularization Module, as shown in Fig. 7(d). Defining the input of the module as ${C_{in}}$ and the output as ${C_{out}}$, and making it learn the prior information adaptively, Eq. (13) can be written as

(14)$$\begin{array}{c} {X = \mathop {argmin}\limits_X \left\|Z - XL\right\|_F^2 + \left\|Y - \varPhi ({BX} )\right\|_F^2 + \tau \left\|{C_{in}} - {C_{out}}\right\|_2^2.} \end{array}$$

For Eq. (14), although quadratic optimization can be used to solve X, it is complicated to compute the inverse of the large matrix. Therefore, we use the single-step gradient descent method to solve this problem

(15)$$\begin{array}{c} {{X^{(n )}} = {X^{({n - 1} )}} - \alpha \{{({{X^{({n - 1} )}}L - Z} ){L^T} + {\varPhi ^T}({\varPhi ({B{X^{({n - 1} )}}} )- Y} )+ \tau ({C_{in}^{(n )} - C_{out}^{(n )}} )} \},} \end{array}$$

where $\alpha $ is the step size. In parallel optimization, ${X^{\mathrm{\ast }({n - 1} )}}$ is defined as the result of ${X^{({n - 1} )}}$ concatenated with ${z^{(n )}}$, and Eq. (15) becomes

(16)$$\begin{array}{c} {{X^{(n )}} = {X^{\mathrm{\ast }({n - 1} )}} - \alpha \{{({{X^{\mathrm{\ast }({n - 1} )}}L - Z} ){L^T} + {\varPhi ^T}({\varPhi ({B{X^{\mathrm{\ast }({n - 1} )}}} )- Y} )+ \tau ({C_{in}^{(n )} - C_{out}^{(n )}} )} \}.} \end{array}$$

Based on Eq. (16), we designed the spatial-spectral optimized CNN using the imaging model of the color camera and the imaging model of the CASSI system. As shown in Fig. 7(c), the network has four branches.

The first branch connects the RGB images Z and ${X^{\mathrm{\ast }({n - 1} )}}$ acquired by the camera in series and then passes them to the Spatial-spectral Regularization Module consisting of the U-net and Transformer block. U-Net is a commonly used network that follows a down-sampling and up-sampling structure and uses a jump connection structure to present large-scale features to the up-sampling process and avoid model degradation. We replaced the last layer of the U-Net with the Transformer block to improve the accurate characterization between spectral channels.

In the second branch, we embed the quantum effect curve of the camera into the network and synthesize the RGB image with ${X^{\mathrm{\ast }({n - 1} )}}$ data, i.e., ${X^{\mathrm{\ast }({n - 1} )}}L$ is the color camera imaging process simulated in the network, and then use the real RGB image Z taken by the color camera to calculate the residual $Re{s_1} = {X^{\mathrm{\ast }({n - 1} )}}L - Z$. The transposed version of ${L^T}$ uses convolutional layers to model.

The third branch is the transmit branch of ${X^{\mathrm{\ast }({n - 1} )}}$.

In the fourth branch, we embed the coded compressed optical imaging process of the CASSI system into the network and synthesize a coded compressed image using ${X^{\mathrm{\ast }({n - 1} )}}$ data, i.e., $\varPhi B{X^{\mathrm{\ast }({n - 1} )}}$ for the optical imaging process of the simulated CASSI system in the network. The residual $Re{s_2} = \varPhi B{X^{\mathrm{\ast }({n - 1} )}} - Y$ is calculated using the real coded compressed image Y captured by the CASSI system, and the transposed version ${\varPhi ^T}$ is modeled with the de-convolution layer.

The Spatial-spectral optimization CNN output of ${X^{(n )}}$ can be expressed as

(17)$$\begin{array}{c} {{X^{(n )}} = {X^{\mathrm{\ast }({n - 1} )}} - \alpha \{{Re{s_1}{L^T} + {\varPhi ^T}Re{s_2} + \tau ({C_{in}^{(n )} - C_{out}^{(n )}} )} \}.} \end{array}$$

2.3.3 Transformer block

In the Transformer block, the input ${A_{in}} \in {\mathrm{\mathbb{R}}^{W \times H \times C}}$ is first reshaped into tokens ${A_s} \in {\mathrm{\mathbb{R}}^{WH \times C}}$, then fed into the convolutional layers and transformed into three feature maps $Q,K$ and V respectively, where $Q,K,V \in {\mathrm{\mathbb{R}}^{WH \times C}}$. Subsequently, consider each spectral representation as a token, and perform matrix multiplication between the transposed Q and K. The result of multiplication is then used to calculate the soft-max to obtain the spectral channel self-noticing map $A \in {\mathrm{\mathbb{R}}^{C \times C}}$, whose element ${a_{nm}}$ can be expressed as follows

(18)$$\begin{array}{c} {{a_{nm}} = {\sigma _{nm}}\frac{{exp ({{Q_m} \cdot {K_n}} )}}{{\mathop \sum \nolimits_{m = 1}^C \mathop \sum \nolimits_{n = 1}^C exp ({{Q_m} \cdot {K_n}} )}},} \end{array}$$

where, $m,n \in [{1,C} ]$ represent the number of channels of Q and K. Also, due to the large variation of spectral features with wavelength, the results of the reweighting matrix product ${a_{nm}}$ are computed together with the learnable parameter ${\sigma _{nm}}$ to accommodate the self-attentive mechanism. Subsequently, matrix multiplication is done between the mapping features V and A. The result is reshaped as ${\mathrm{\mathbb{R}}^{W \times H \times C}}$, which is added with the input feature ${A_{in}}$, and finally the output feature ${A_{out}} \in {\mathrm{\mathbb{R}}^{W \times H \times C}}$ is obtained, and this process can be written as

(19)$$\begin{array}{c} {{A_{out}} = reshape({V \times A} )+ {A_{in}}.} \end{array}$$

2.3.4 Double path self-supervised learning

We use $min\left\|y - \varPhi x\right\|^{2} + \left\|x{l_k} - {z_k}\right\|^2$ as a constraint for self-supervised learning, as described previously. Thus, we created two self-supervised learning branches using the final result ${X^{(n )}}$ of the network output, as show in Fig. 7(a). One branch is down-sampled and then projected onto the physical process of CASSI to obtain a modulated compressed image to learn the spectral information. In another branch, quantum efficiency curves of color cameras are used to project the imaging process of color cameras to obtain RGB image to learn spatial information. These two self-supervised branches are combined with a dual-camera system to form a closed-loop online self-learning system.

Here, we define the result of the network output ${X^{(n )}}$ as $\hat{X}$; The coded matrix $\varPhi $ obtained by calibrating the CASSI system is defined as the sensing matrix of the system; Using the ratio of compressed image resolution and RGB image resolution as the spatial sampling matrix B. Then the projected image ${P_{CASSI}}$ of the CASSI system is

(20)$$\begin{array}{c} {{P_{CASSI}} = \varPhi ({B\hat{X}} ),} \end{array}$$

using the spectral response curve ${L_k}$ of the color camera to project the physical process of color camera imaging, the RGB image ${P_{RGB}}$ is obtained

(21)$$\begin{array}{c} {{P_{RGB}} = \hat{X}{L_k}.} \end{array}$$

We use the mean square error (MSE) between the real images taken by the two-camera system and the projected images of the two self-supervised branches to constrain the whole system. We define the real images captured by grayscale and RGB cameras in the dual camera system as ${C_{CASSI}}$ and ${C_{RGB}}$, respectively. The calculation results are

(22)$$\begin{array}{c} {{L_{RGB}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^n {{|{{C_{RG{B_i}}} - {P_{RG{B_i}}}} |}^2},} \end{array}$$

(23)$$\begin{array}{c} {{L_{CASSI}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^n {{|{{C_{CASS{I_i}}} - {P_{CASS{I_i}}}} |}^2},} \end{array}$$

where n. represents the total number of self-supervised learning data.

The overall loss function ${L_{overall}}$ can be obtained by:

(24)$$\begin{array}{c} {{L_{overall}} = {L_{RGB}} + {L_{CASSI}}.} \end{array}$$

In contrast to other supervised learning-based methods, we use a self-supervised framework to learn the physical imaging process of the optical system rather than fitting a mapping relationship between a compressed image and a hyperspectral data cube. We use a jointly optimized mathematical model to embed the optical imaging process into a deep learning network that makes full use of the spatial information of RGB images. And the rational construction of parallel optimization architecture combined with deep unfolding makes the network interpretable. Due to our self-supervised framework based on physical models, we do not need a large amount of standard data for pre-training. In the self-learning process, only the compressed image and the high-resolution RGB image need to be encoded to obtain high-resolution-hyperspectral data. In the self-learning process, only encoded compressed images and high-resolution RGB images are required to obtain high-resolution-hyperspectral data. Also, after the learning is completed, the resulting model has good reconfiguration capability for the same type of scenes.

3. Experiments and results

3.1 Simulations

In this section, we compare the performance of the proposed joint optized physical model with several state of the arts (SOTA) methods on a simulated data set. Peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and spectral angle mapping (SAM) were used to evaluate the performance of different hyperspectral image reconstruction methods. The methods we compared included GAP-TV [7], DeSCI [8], PnP-DIP-HIS [17], MST [20], and GAP-CCoT [22]. Among them, GAP-TV and DeSCI are model-based iterative optimization methods, PnP-DIP-HIS is a DIP-based self-supervised method, MST is a supervised method based on Transformer, and GAP-CCoT is a supervised method with deep unfolding structure.

The computing platform uses an i7 10700 CPU, 32 G RAM and RTX TITAN GPU with PyTorch. The number of spectral channels λ of the network is set to 31, and the initial learning rate is 0.0001 and decreased by 10% for every 30 epochs. The simulated data were obtained using the KAIST [27] dataset, which consists of 30 scenes including full spectral resolution reflectance data in 31 bands from 400 nm to 700 nm with a step size of 10 nm and an image resolution of 2704 × 3376. Since our proposed self-supervised model does not require GT for pre-training, there is no need to distinguish between training and testing datasets. However, our models still have the ability to be reconstructed after learning, so it is necessary to discuss the reconstructive ability of models after learning. Here, we divide the validation of the method into two strategies.

3.1.1 Compared with iterative optimization algorithm methods

Thanks to our physical self-supervised framework and the mathematical model of joint optimization, our network has more robust performance, more efficient information utilization, and possesses more powerful reconstruction capabilities than iterative optimization methods. In this strategy, we compared with two model-based approaches, GAP-TV and DeSCI, and a self-supervised approach, PnP-DIP-HIS. We adjusted the image resolution of the hyperspectral data cube to 512 × 512 for simulating the imaging process of CASSI, and the RGB image resolution to 512 × 512 and 2048 × 2048 for simulating the case where the color camera has the same resolution as the grayscale phase and the color camera has a higher resolution than the grayscale camera, and the reconstructed image resolutions are 512 × 512 and 2048 × 2048, respectively. The number of the stage are set to two (Ours-S2) and four (Ours-S4), respectively. Here, we also tested the performance of our previous work (Ours-pre) on the KAIST dataset.

In order not to lose the generality, we take the average of the calculated results of 30 images from the KAIST dataset, and the reconstruction indices are shown in Table 1.

Table 1. Performance comparison of our method with iterative optimization methods

View Table | View all tables in this article

As can be seen from Table 1, when the RGB images are of 512 × 512 resolution, due to our dual-camera structure, the reconstruction metrics are much higher than the model-based methods and the DIP-based self-supervised method. Compared to our previous work, the proposed parallel joint optimization network architecture based on physical perception can reasonably introduce high-resolution RGB image information into the network, which greatly improves the reconstruction performance of the system, especially SAM. And when the RGB image resolution is 2048 × 2048, i.e., the reconstructed image resolution is 2048 × 2048, our model still has excellent performance and can provide high-resolution-hyperspectral data reconstruction. In addition, the reconstruction performance is not reduced much when the network is reduced from four stages to two stages, while the reconstruction time can be reduced by half. In addition, our proposed model takes 50 and 200 minutes to reconstruct images with resolutions of 512 × 512 and 2048 × 2048, respectively, in the four stages case, while the computation time of the model can be reduced by half in the two stages case. In comparison, the time required to reconstruct 512 × 512 resolution images were 115 and 310 minutes for DeSCI and PnP-DIP-HIS, respectively.

Fig. 8 shows the reconstruction results and error maps for CASSI 512 × 542 measurement resolution and 512 × 512 RGB image resolution, as well as the spectra at three selected spatial locations. From the local magnification images as well as the error maps, it can be seen that our method has a clear advantage over the compared methods.

Fig. 8. The reconstruction results and error maps for CASSI 512 × 542 measurement resolution and 512 × 512 RGB image resolution, and the spectra at three selected spatial locations.

Download Full Size | PDF

Fig. 9 shows the reconstruction results and error maps for CASSI 512 × 542 measurement resolution and 2048 × 2048 RGB image resolution, the reconstructed image resolution is 2048 × 2048. The local magnification and error maps show that our proposed model has very high reconstruction accuracy in reconstructing the spatial details of high-resolution images.

Fig. 9. The reconstruction results and error maps for CASSI 512 × 542 measurement resolution and 2048 × 2048 RGB image resolution.

Download Full Size | PDF

3.1.2 Compared with the supervised deep learning methods

Unlike model-based iterative optimization methods, our proposed method also has the ability to be reconstructed after learning is complete, which is similar to supervised deep learning methods. Thus, in this comparison strategy, similar to the training and testing strategy of most supervised deep learning methods. To facilitate comparison, we used the training and test datasets employed by MST and GAP-CCoT, and the wavelengths of the training and test data were modified by spectral interpolation. The final wavelengths were fitted to 450 nm to 650 nm and divided into 28 spectral bands. Its training set is the CAVE [28] dataset, which is expanded to more than 1000 sets using random cropping, rotation and flipping. The test set is a selection of 10 scenes from KAIST with hyperspectral image resolution of 256 × 256. We first simulated the self-learning of the model at the same resolution for the color camera and the grayscale camera.

Table 2 shows the reconstruction results of the completed model for the test scenes. Thanks to our jointly optimized dual-camera structure, we can see that the reconstruction performance of the completed model after learning is significantly better than that of SOTA supervised deep learning methods.

Table 2. Average PSNR, SSIM and SAM of different algorithms on 10 selection scenes

View Table | View all tables in this article

When the model is set to four stages, it takes 0.3 seconds to reconstruct an image with 256 × 256 resolution during the test, and 0.15 seconds in the two stages case. Most importantly, our method only requires coded compressed image captured by the CASSI system and a RGB image captured by the color camera during the learning process, which does not require ground truth from the training set and thus has good scene adaptation capability. At the same time, we can still use the “ fine-tune “ from our previous work to continue improving the performance of the model. “Fine-tune” means the practice of using already trained models and adding new data to train new models for new scenarios.

As can be seen from the reconstructed results of the local zoom of scene 1 shown in Fig. 10 and the spectra of three selected spatial locations in scene 9 shown in Fig. 11, our method outperforms SOTA supervised deep learning method in terms of reconstruction accuracy on the test set.

Fig. 10. The reconstruction results and error maps of scene 1.

Download Full Size | PDF

Fig. 11. The spectra of three selected spatial locations in scene 9.

Download Full Size | PDF

We randomly selected another 20 scenes from the KAIST dataset as the training set and the remaining 10 scenes as the test set. To simulate the imaging process of CASSI, the image resolution of the hyperspectral data cube was adjusted to 512 × 512, and to simulate the color camera with higher resolution than the grayscale camera, the RGB image resolution was adjusted to 2048 × 2048, and the reconstructed image resolution was 2048 × 2048. Table 3 shows the average reconstruction index of the model on the test set after learning is completed. As can be seen from the table 3, the high-resolution reconstruction performance of our method on the test set is still very good.

Table 3. Reconstruction index on test set with higher resolution RGB image

View Table | View all tables in this article

3.1.3 Discussion on extreme performance

In CASSI systems, because the amount of dispersion modulation of light by dispersion elements such as prisms in the system is a fixed value, when the spatial resolution of the scene becomes low (e.g., telephoto), the sampled value after the system coding modulation becomes very blurred. For example, Fig. 12 simulates the measurements of test samples passing through the CASSI system at different scales. We adjusted the image resolution of the hyperspectral data cube to different scales, such as 512 × 512, 256 × 256, 128 × 128, 64 × 64, etc., encoded using a random matrix, and the dispersion offsets were all set to 1 pixel per channel.

Fig. 12. The measurements of test samples passing through the CASSI system at different scales. (a) 512 × 512 (b) 256 × 256 (c) 128 × 128 (d) 64 × 64.

Download Full Size | PDF

As we can see in Fig. 12, the measurements of the CASSI system become very blurred when the spatial resolution of the scene becomes low. And to reconstruct hyperspectral data cubes from these data is a great challenge for the reconstruction methods. In our experiments, we found that SOTA supervised deep learning methods have serious generalization problems when reconstructing such data, and the models do not even converge when trying to train for such data. The model-based iterative optimization algorithm is very poor at reconstructing such data.

To explore the extreme performance of our proposed method, we conducted the following additional experiments using the KAIST dataset as test data. The imaging spatial resolution of the CASSI system is gradually reduced to validate the performance of our proposed model.

(1) To simulate the imaging process of CASSI, the image resolution of the hyperspectral data cube was resized to 128 × 128, the RGB image resolution was resized to 512 × 512, the reconstructed image resolution was 512 × 512, and the model was set to four stages and two stages. The contrast methods were GAP-TV, DeSCI and PnP-DIP-HIS, and the reconstructed image resolution was 128 × 128.
(2) To simulate the imaging process of CASSI, the image resolution of the hyperspectral data cube was resized to 64 × 64, the RGB image resolution was adjusted to 512 × 512, the reconstructed image resolution was 512 × 512, and the model was set to four stages and two stages. The comparison methods are GAP-TV, DeSCI and PnP-DIP-HIS, and the reconstructed image resolution is 64 × 64.

Tables 4 and 5 and Fig. 13 and Fig. 14 show the reconstruction index and reconstruction samples for the compared methods, and the method proposed in this paper still has very strong reconstruction performance at low spatial resolution. This demonstrates the superiority of the joint optimization model proposed in this paper compared to our previous work.

Fig. 13. Reconstructed samples at 128 × 128 resolution.

Download Full Size | PDF

Fig. 14. Reconstructed samples at 64 × 64 resolution.

Download Full Size | PDF

Table 4. Reconstruction index at 128 × 128 resolution

View Table | View all tables in this article

Table 5. Reconstruction index at 64 × 64 resolution

View Table | View all tables in this article

3.2 Experiments

To verify the performance of our proposed method in a real scene, we set up an experimental system as shown in Fig. 15, using Hikvision MV-CA020-10UC for the color camera and Hikvision MV-CA020-10UM for the grayscale camera, with image element size of 4.5µm for both color and grayscale cameras. The relay lens is Hikvision MVL-KF5028M-12MP, the objective lens is Hikvision MVL-KF2528M-12MP, and the non-polarizing beam splitter is Hengyang Optics HCBS1-020-30-VIS. The Near-infrared cut-off band-pass filter was manufactured by Beijing Yongxing Sensing Information Technology Co., Ltd, the passband wavelength is 450 nm to 650 nm. JINBEI EFII100 photography lamp is used to fill the target lighting of indoor scenes. The coded aperture consisted of a random matrix made of lithographed chromium etched on a CaF2 optical glass with a pixel pitch of 9 µm. The resolution of the encoding mask is 256 × 256, and one pixel of the mask corresponds to 2 × 2 pixels on the detector. We calibrate the system using a monochromator, the dispersive prisms use double Amici dispersive prisms, producing 31-pixel dispersion from 450 nm to 650 nm when the camera binning selected 2 to 1.

Fig. 15. Equipment diagram of our dual camera system.

Download Full Size | PDF

We first operate the grayscale camera and the color camera in pixel duplex mode to capture RGB images and compressed images with the same resolution of the grayscale camera and the color camera. The color camera then runs in normal mode and the grayscale camera runs in pixel duplex mode to capture high-resolution RGB images. The illustration of a picture book was used as the experimental target, and the reconstruction results are shown in Fig. 16.

Fig. 16. Reconstruction results of realistic scenes. (a) Color camera and grayscale camera with the same resolution. (b) Color cameras with higher resolution than grayscale cameras.

Download Full Size | PDF

As we can see from the Fig. 16(a), we can see that the reconstruction results are very promising when the resolution of the color camera is consistent with the CASSI system. And Fig. 16(b) shows when the resolution of the RGB image is increased, it greatly improves the sharpness of the reconstructed image from the CASSI system.

To verify the spectral accuracy of the reconstructed images from this system, we replaced the coded matrix with a column vector, i.e., a slit. Thus, the CASSI system becomes a slit spectrometer. We collected the spectra of the target scene with this spectrometer and compared them with the reconstructed results of the model proposed in this paper (the data were normalized). The relative intensity values tested on the standard color card as shown in Fig. 17. We can see that the model has a high spectral accuracy in the real system.

Fig. 17. Three exemplar reconstruction spectra of the standard color card.

Download Full Size | PDF

4. Conclusion

In this paper, we design a two-camera high-resolution self-supervised hyperspectral imaging system based on the optical imaging process and a jointly optimized mathematical model, based on our previous work. By embedding the optical imaging process and the joint optimization model into the network, a high-resolution hyperspectral image can be reconstructed using only one RGB image taken by high-resolution color camera and a coded compressed image taken by CASSI system. The model has a strong learning capability and does not require any ground truth. The specific contributions are as follows

(1) We design a self-supervised deep learning network with parallel structure, embedding the optical imaging process and jointly optimized digital model into the network, and combining the powerful learning capability of the Transformer module to more rationally utilize the spatial detail information provided by RGB images to improve the reconstruction capability of the system.
(2) The system reconstructs high-resolution hyperspectral images with only high-resolution RGB image and coded compressed image during the optical imaging process of learning color cameras and CASSI. The brute-force mapping relationship between the compressed image and the standard HSI is replaced by learning the physical and mathematical process without any ground truth, with better adaptation to the scene.
(3) This deep learning framework based on physical and mathematical processes shows better reconstruction performance compared to model-based iterative optimization algorithms. Moreover, the learned completed model has the same strong reconstruction capability when dealing with similar scenes, which is not inferior to the SOTA supervised deep learning methods.
(4) Due to this framework based on a physical model of optical imaging and a jointly optimized mathematical model, our system still has a powerful reconstruction capability under the limitation of low spatial resolution.

The core of our method is to introduce mathematical and physical models in the network to more rationally utilize spatially detailed information from high-resolution color camera and to provide guidance on how to more rationally design deep learning networks. In optical systems, there are optical phenomena such as phase aberration, distortion and scatter, and many SOTA methods do not take these disturbances into account. In the future, we will continue to refine our model in order to further improve the imaging accuracy of the system. Besides, although the network framework based on deep unfolding has a strong performance. However, due to the complexity of the model and the multi-stage structure of the deep unfolding, the network is relatively large, which puts high demands on the computational performance, and how to streamline to improve efficiency is also something we need to do.

Funding

National Natural Science Foundation of China (62271263, 62031018); Fundamental Research Funds for the Central Universities (30922010705); Jiangsu Provincial Key Research and Development Program (BE2022391).

Acknowledgment

We thank Shuaifeng Gong and Lei Gan for technical supports. We also thank the anonymous reviewers for their helpful comments to improve our paper.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time.

References

1. F. C. Xiong, J. Zhou, and Y. T. Qian, “Material based object tracking in hyperspectral videos,” IEEE Trans. on Image Process. 29, 3719–3733 (2020). [CrossRef]

2. F. C. Xiong, J. Zhou, Q. L. Zhao, J. F. Lu, and Y. T. Qian, “MAC-Net: Model-aided nonlocal neural network for hyperspectral image denoising,” IEEE Trans. Geosci. Remote Sensing 60, 1–14 (2022). [CrossRef]

3. A. Wagadarikar, R. John, R. Willett, and D. J. Brady, “Single disperser design for coded aperture snapshot spectral imaging,” Appl. Opt. 47(10), B44–B51 (2008). [CrossRef]

4. M. E. Gehm, R. John, D. J. Brady, R. M. Willett, and T. J. Schulz, “Single-shot compressive spectral imaging with a dual-disperser architecture,” Opt. Express 15(21), 14013–14027 (2007). [CrossRef]

5. M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright, “Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems,” IEEE J. Sel. Top. Signal Process. 1(4), 586–597 (2007). [CrossRef]

6. J. M. Bioucas-Dias and M. A. T. Figueiredo, “A new TwIST: Two-step iterative shrinkage/thresholding algorithms for image restoration,” IEEE Trans. on Image Process. 16(12), 2992–3004 (2007). [CrossRef]

7. X. Yuan, “Generalized alternating projection based total variation minimization for compressive sensing,” IEEE International Conference on Image Processing (ICIP), IEEE, 2539–2543 (2016).

8. Y. Liu, X. Yuan, J. L. Suo, D. J. Brady, and Q. H. Dai, “Rank minimization for snapshot compressive imaging,” IEEE Trans. Pattern Anal. Mach. Intell. 41(12), 2990–3006 (2019). [CrossRef]

9. X. Miao, X. Yuan, Y. C. Pu, and V. Athitsos, “l-net: Reconstruct hyperspectral images from a snapshot measurement,” Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 4059–4069 (2019).

10. L. Z. Wang, T. Zhang, Y. Fu, and H. Huang, “Hyperreconnet: Joint coded aperture optimization and image reconstruction for compressive hyperspectral imaging,” IEEE Trans. on Image Process. 28(5), 2257–2270 (2019). [CrossRef]

11. Z. Shi, C. Chen, Z. W. Xiong, D. Liu, and F. Wu, “Hscnn+: Advanced cnn-based hyperspectral recovery from rgb images,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR). 939–947 (2018).

12. Y. H. Cai, J. Lin, Z. D. Lin, H. Q. Wang, Y. L. Zhang, H. Pfister, R. Timofte, and L. V. Gool, “Mst++: Multi-stage spectral-wise transformer for efficient spectral reconstruction,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 745–755 (2022).

13. T. Huang, W. S. Dong, J. J. Wu, L. D. Li, X. Li, and G. M. Shi, “Deep hyperspectral image fusion network with iterative spatio-spectral regularization,” IEEE Trans. Comput. Imaging 8, 201–214 (2022). [CrossRef]

14. T. Zhang, Y. Fu, L. Z. Wang, and H. Huang, “Hyperspectral image reconstruction using deep external and internal learning,” Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 8559–8568 (2019).

15. S. M. Zheng, Y. Liu, Z. Y. Meng, M. Qiao, Z. S. Tong, X. Y. Yang, S. S. Han, and X. Yuan, “Deep plug-and-play priors for spectral snapshot compressive imaging,” Photonics Res. 9(2), B18–B29 (2021). [CrossRef]

16. Z. Q. Lai, K. X. Wei, and Y. Fu, “Deep plug-and-play prior for hyperspectral image restoration,” Neurocomputing 481, 281–293 (2022). [CrossRef]

17. Z. Y. Meng, Z. M. Yu, K. Xu, and X. Yuan, “Self-supervised neural networks for spectral snapshot compressive imaging,” Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2622–2631 (2021).

18. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” FNT in Machine Learning 3(1), 1–122 (2010). [CrossRef]

19. Y. Liu, Y. Zhang, Y. X. Wang, F. Hou, J. Yuan, J. Tian, Y. Zhang, Z. C. Shi, J. P. Fan, and Z. Q. He, “A survey of visual transformers,” arXiv, arXiv:2111.06091 (2021). [CrossRef]

20. Y. H. Cai, J. Lin, X. W. Hu, H. Q. Wang, X. Yuan, Y. L. Zhang, R. Timofte, and L. V. Gool, “Mask-guided spectral-wise transformer for efficient hyperspectral image reconstruction,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 17502–17511 (2022).

21. Z. Y. Meng, S. Jalali, and X. Yuan, “Gap-net for snapshot compressive imaging,” arXiv, arXiv:2012.08364 (2020). [CrossRef]

22. L. S. Wang, Z. L. Wu, Y. Zhong, and X. Yuan, “Snapshot spectral compressive imaging reconstruction using convolution and contextual Transformer,” Photonics Res. 10(8), 1848–1858 (2022). [CrossRef]

23. H. Xie, Z. Zhao, J. Han, Y. Zhang, L. F. Bai, and J. Lu, “Dual camera snapshot hyperspectral imaging system via physics-informed learning,” Opt. Lasers Eng. 154, 107023 (2022). [CrossRef]

24. Y. Li, C. X. Wang, Y. Cao, B. Y. Liu, Y. Luo, and H. G. Zhang, “A-hrnet: Attention based high resolution network for human pose estimation,” 2020 Second International Conference on Transdisciplinary AI (TransAI). IEEE, 75–79 (2020).

25. D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 9446–9454 (2018).

26. D. P. Bertsekas, “Nonlinear programming,” J. Operational Res. Soc. 48(3), 334 (1997). [CrossRef]

27. I. Choi, D. S. Jeon, G. Nam, D. Gutierrez, and M. H. Kim, “High-quality hyperspectral reconstruction using a spectral prior,” ACM Trans. Graph. 36(6), 1–13 (2017). [CrossRef]

28. F. Yasuma, T. Mitsunaga, D. Iso, and S. K. Nayar, “Generalized assorted pixel camera: post-capture control of resolution, dynamic range, and spectrum,” IEEE Trans. on Image Process. 19(9), 2241–2253 (2010). [CrossRef]

Method	PSNR (dB)	SSIM	SAM (°)
GAP-TV (512 × 512)	26.23	0.79	20.11
DeSCI (512 × 512)	27.47	0.85	12.49
PnP-DIP-HIS (512 × 512)	29.50	0.83	10.50
Ours-Pre (512 × 512)	38.63	0.95	10.37
Ours-S4 (512 × 512)	47.28	0.99	3.38
Ours-S2 (512 × 512)	44.68	0.97	4.12
Ours-S4 (2048 × 2048)	42.91	0.97	5.0
Ours-S2 (2048 × 2048)	41.37	0.96	5.6

Method	Scene	1	2	3	4	5	6	7	8	9	10
MST	PSNR (dB)	35.47	36.15	36.43	42.09	32.98	34.75	34.12	32.92	35.08	32.77
	SSIM	0.94	0.95	0.95	0.97	0.95	0.95	0.93	0.95	0.94	0.94
	SAM (°)	7.64	9.40	6.85	12.07	6.94	10.65	6.50	12.89	8.77	11.37
GAP-CCoT	PSNR (dB)	35.13	35.85	36.84	41.95	32.59	34.92	33.43	33.11	35.69	32.41
	SSIM	0.94	0.94	0.96	0.97	0.95	0.96	0.93	0.95	0.95	0.94
	SAM (°)	7.64	9.60	6.12	11.18	6.42	9.78	7.01	11.28	7.48	10.64
Ours-Pre	PSNR (dB)	33.26	34.59	33.93	33.82	38.04	30.39	32.23	31.08	31.17	31.26
	SSIM	0.91	0.95	0.91	0.93	0.95	0.94	0.95	0.94	0.95	0.95
	SAM (°)	8.34	11.4	10.23	7.91	13.20	11.74	13.34	7.88	13.50	9.63
Ours-S4	PSNR (dB)	40.92	39.05	40.58	45.06	44.37	39.15	39.03	39.73	38.62	39.01
	SSIM	0.98	0.98	0.96	0.97	0.96	0.97	0.97	0.96	0.96	0.96
	SAM (°)	6.05	8.10	5.03	5.77	4.42	5.28	5.11	5.66	6.26	6.55
Ours-S2	PSNR (dB)	37.81	35.57	37.74	41.64	41.90	35.80	36.78	36.00	36.42	37.69
	SSIM	0.96	0.96	0.94	0.93	0.94	0.92	0.95	0.91	0.95	0.92
	SAM (°)	8.56	9.50	7.38	7.85	6.03	8.05	7.06	8.03	8.44	8.67

Method	PSNR (dB)	SSIM	SAM (°)
Ours-S4 (2048 × 2048)	36.19	0.95	8.87
Ours-S2 (2048 × 2048)	32.65	0.88	12.29

Method	PSNR (dB)	SSIM	SAM (°)
GAP-TV (128 × 128)	18.82	0.62	28.13
DeSCI (128 × 128)	20.08	0.67	32.21
PnP-DIP-HIS (128 × 128)	24.83	0.67	15.67
Ours-Pre (128 × 128)	26.37	0.73	18.37
Ours-S4 (512 × 512)	39.98	0.96	6.50
Ours-S2 (512 × 512)	37.51	0.93	9.52

Method	PSNR (dB)	SSIM	SAM (°)
GAP-TV (64 × 64)	15.70	0.57	33.11
DeSCI (64 × 64)	17.10	0.58	34.95
PnP-DIP-HIS (64 × 64)	21.95	0.56	19.05
Ours-Pre (64 × 64)	23.98	0.61	23.71
Ours-S4 (512 × 512)	37.63	0.94	9.43
Ours-S2 (512 × 512)	36.75	0.94	10.08

Dual camera snapshot high-resolution-hyperspectral imaging system with parallel joint optimization via physics-informed learning

Abstract

1. Introduction

2. Methods

2.1. Physical process of CASSI and the self-supervised learning

2.2. Mathematical process of our joint optimization

2.3. Our parallel joint optimization network framework

2.3.1 Spectral Stage

2.3.2 Spatial Stage

2.3.3 Transformer block

2.3.4 Double path self-supervised learning

3. Experiments and results

3.1 Simulations

3.1.1 Compared with iterative optimization algorithm methods

3.1.2 Compared with the supervised deep learning methods

3.1.3 Discussion on extreme performance

3.2 Experiments

4. Conclusion

Funding

Acknowledgment

Disclosures

Data availability

References

Data availability

Cited By

Figures (17)

Tables (5)

Equations (24)

Optics Express