Deep learning enabled reflective coded aperture snapshot spectral imaging

Zhenming Yu; Diyi Liu; Liming Cheng; Ziyi Meng; Zhengxiang Zhao; Xin Yuan; Xin Yuan; Kun Xu

doi:10.1364/OE.475129

1. Introduction

Hyperspectral imaging is an important tool to capture 3D spectral-spatial information of real scenes in the natural world including 1D spectral information and 2D spatial image. Spectral-spatial information is of great significance for cultural relics identification, food industry, medical treatment, and other fields. Due to the limitation of existing sensors, which can only capture 2D information in one shot, traditional spectral imaging systems usually use multi-camera or mechanical scanning acquisition methods, suffering from slow speed, high system complexity, and high cost. To address this challenge, researchers have started using compressive sensing [1–3] theory and developed a variety of compressive imaging systems. The typical one for spectral imaging is coded aperture snapshot spectral imaging (CASSI) [4]. The underlying principle of CASSI is to capture a compressed measurement in a snapshot and recovers 3D hyperspectral cube by reconstruction algorithm [5] of compressive sensing. Based on this principle, the main idea of CASSI is to modulate different spectral channels in the hyperspectral images using different modulation patterns. In the literature, the first CASSI system is dual-disperser (DD) CASSI [4] which uses two prisms as dispersers along with a fixed mask. The mask plays the role of modulation and the two dispersers exert shear and unshear sequentially on the spectral cube, so that coded measurement can be obtained without spatial shifting. Another more popular design is single-disperser (SD) CASSI system [6], which only applies a fixed mask to encode the light, and realizes the equivalent modulation of different masks in different spectral bands through the dispersion of a single prism. Because of the simpler optics compared to DD-CASSI, researchers have developed many hardware and algorithm optimizations for SD-CASSI over the past decade [7–9]. However, the spatial measurement of SD-CASSI is poor due to the mixed spatial and spectral shifting. Therefore, most of the reconstruction algorithms cannot obtain excellent spatial results on SD-CASSI captured measurements. On the other hand, reconstructing the spectral data cube of DD-CASSI is relatively simple, while the size of its optical system is larger [10]. In order to obtain a compact CASSI system with high spatial performance, we extend our previous work [11] and propose a deep learning enabled compact CASSI system, dubbed reflective CASSI (R-CASSI), which utilizes the reflection optical path. The proposed R-CASSI uses only one prism to perform two times dispersion. In this manner, the size of R-CASSI is almost halved compared to DD-CASSI. Meanwhile, R-CASSI system owns the advantages of DD-CASSI, i.e., clearer compressed measurement in spatial. Therefore, the spatial reconstruction result of R-CASSI is significantly better than SD-CASSI.

The main contributions of this paper are summarized as follows:

• A new spectral compressive imaging system, a.k.a., R-CASSI is proposed and built, which enjoys the advantages of DD-CASSI but uses only a single disperser, thus saving the optical system size.
• A new deep learning-based reconstruction network using three-dimensional (3D) U-net [12] is developed to reconstruct the spatial-spectral data cube from the compressed measurement captured by our R-CASSI system, which leads to state-of-the-art reconstruction results.
• We release the experimental data and code used in this work to provide a benchmark for testing various algorithms, available in Code 1 Ref. [13].

2. Related work

Developing an effective method to obtain three-dimensional hyperspectral data cubes is the most fundamental task of hyperspectral imaging. Traditional methods include the temporal sequential scanning and spatial-spectral scanning approaches. The temporal sequential scanning approaches use different band-pass filters or liquid crystal tunable filters (LCTF) to record different spectral bands of the hyperspectral data cube. For example, an adjustable spectral filter is used to change the spectral channel of the image [14] or LCTF can be used to sequentially scan the visible spectrum [15]. The spectral resolution of those methods is limited by the number of filters used. Spatial-spectral scanning methods capture the spectrum of a single scene point or scene slit, then scan the entire scene to obtain a complete data cube, for example using a WhiskBroom or PushBroom scanner [16].

Due to the simple sampling principle, the spatial or spectral resolution of the above approaches is limited to a certain extent. At high resolution condition, it takes a long time to scan the data cube. In contrast, snapshot approaches capture the full 3D data cube in a single (usually compressed) image.

2.1 Compressed hyperspectral imaging

One of the most efficient hyperspectral imaging methods is Coded Aperture Snapshot Spectral Imaging, a.k.a, CASSI [4], which can capture both spectral and spatial information of a scene at high speed by a compressed measurement. This kind of compressive sampling technology requires reconstruction of the captured coding information according to the light transmission (or sensing) matrix of the optical system, to reconstruct the desired hyperspectral image. CASSI can be divided into two categories based on the dispersers used in the optical system.

1) Single-disperser CASSI [6,17], as shown in the upper part of Fig. 1 (d), first modulates the spatial-spectral data cube by a mask and then uses the disperser to spread different spectral bands, i.e., shifting different spectral wavelength to different spatial locations; in this way, each spectral band of the data cube is modulated by different masks (shifted versions of the same mask). Following this, the modulated 3D data cube is integrated across the spectral dimension into a single 2D measurement.
2) As can be seen above, in CASSI, the key idea is to use different masks to modulate different spectral bands. Different from SD-CASSI, two dispersers are employed in DD-CASSI [4], as shown in the middle part of Fig. 1 (d). The first disperser is used to shear the spatial-spectral data cube, which is then modulated by the mask. Following this, the modulated data cube is unsheared by the second disperser. The last step is to capture the (sheared-modulated-unsheared) data cube by integrating the light across the spectral dimension by the detector.

The analysis shows that SD-CASSI has a higher spectral accuracy while DD-CASSI gains more spatial details [18]. This paper aims to devise a third optical system, R-CASSI, to combine the merits of both. Specifically, R-CASSI uses a single disperser with a reflector to implement the principle of DD-CASSI, which saves the optical system size, as illustrated in Fig. 1 (a) and (d).

Fig. 1. (a) Principle of R-CASSI system. (b) Schematic illustration of R-CASSI system (upper) and optical setup of R-CASSI system (lower). L1: objective lens; L2 and L3: relay lenses; BS: beam splitter. (c) Details of CASSI reconstruction with U-net-3D. The recovery process consists of data initialization and reconstruction. The initialization phase takes the measurement $g$ and sensing matrix $\Phi$ to obtain the initialized data U-net-3D, i.e., taking $f_0$ as input, and generates the reconstructed hyperspectral image data $\hat {f}$. (d) Schematic diagram of three CASSI systems: SD-CASSI, DD-CASSI and R-CASSI.

Download Full Size | PDF

To improve the quality of CASSI imaging, temporal multiplexing sampling was introduced using a digital micro-mirror device (DMD) [19]. In 2010, the multiframe image estimation for CASSI [18] is proposed by researchers, which uses multiple different coded apertures to capture the same scene. Therefore, it improves spatial and spectral reconstruction fidelity, but suffers from low acquisition and reconstruction speed. A color-coded aperture compressive spectral imaging (CC-CASSI) [20] system is proposed by researchers. CC-CASSI achieves higher compression and higher-quality reconstruction by using colored coded apertures. In 2014, spatial-spectral encoded compressive hyperspectral imaging (SS-CASSI) [21] is proposed by Lin et al. SS-CASSI uses a diffraction grating as the disperser, which is similar to DD-CASSI in mathematical model. And Lin et al. also proposed dual-coded compressive hyperspectral imaging system [22], which uses two spatial light modulators and provides significant improvements in facilitates dynamic scene capture. In 2021 Henry Arguello et al. proposed a high-fidelity spectral imaging system denominated SCCD [23] by combining the color-coded apertures technique in coded spectral imaging and the DOE technique in diffractive spectral imaging, which can provide a high spectral resolution. In 2021, Jonathan Monsalve et al. proposed a digital micro-mirror device (DMD) based dual-dispersive compressive optical system [24], which estimated the spectral covariance matrix of a scene from a set of compressed measurements. Compared to the traditional DD-CASSI, it improved the reconstruction accuracy and reduced one prism, but required multiple acquisitions with DMD and lost the advantage of snapshot.

All of these techniques have an inherent trade-off between spatial resolution and spectral accuracy, and the reconstruction step determines the quality of the final desired hyperspectral image. Since the measurement is a single 2D (compressed) image and we aim to reconstruct a 3D data cube, the reconstruction is an ill-posed inverse problem.

2.2 Reconstruction algorithms

Optimization approaches were first used to solve the ill-posed reconstruction problem. The optimization aims to minimize a cost function consisting of a data fidelity term and a regularization term such as the total variation (TV), which is defined to impose the sparsity of gradients. Optimization iterations are usually implemented in the process of alternating projections between data items and prior items. Commonly used optimization algorithms include Generalized Alternating Projection (GAP) [25], and Alternating Direction Method of Multipliers (ADMM) [26], etc. The two-step iterative thresholding (TwIST) algorithm [27] uses the TV model as a regularization term and achieves good results in image boundary preservation and smooth region recovery, so it has been widely used in CASSI reconstruction. Although these SCI reconstruction methods based on optimization iteration have strong mathematical and interpretable properties, their main disadvantages are slow reconstruction speed and poor reconstruction effect.

Deep Learning (DL) is a main branch in Machine Learning (ML), and Convolutional Neural Network (CNN) has been demonstrated as a powerful tool in computer vision tasks, such as image denoising, image deblurring, image restoration, and various inverse problems. Compared with traditional iterative methods, CNN-based methods can capture image priors and features during the training process which requests a large amount of training data, instead of designing complicated image priors in algorithms. The trained CNN then uses forward propagation to solve the inverse problem. Due to the acceleration of GPU, CNN can achieve millisecond-level reconstruction time performance. The advantages of low time cost and high reconstruction make CNN-based end-to-end imaging system potentially useful in practical applications. In recent years, hyperspectral imaging based on deep learning reconstruction has been gradually explored and studied [28–37]. Later, Wang and Iliadis et al. proposed CNN architectures for snapshot compressed spectral imaging and temporal compressive imaging respectively, which learned the modulation mask in coded aperture compression imaging and realized a unified architecture for mask optimization and image reconstruction [29,34]. Since then, aiming at the particularity of hyperspectral data, researchers have proposed a CNN network based on self-attention mechanism and expanded network utilizing spatial-spectral prior information respectively, which has achieved better results in snapshot compressed spectral imaging [32,33]. Recently, Zheng et al. proposed a plug-and-play (PnP) method that uses deep-learning-based denoisers as regularization priors for spectral snapshot compressive imaging [38]. This method was flexible enough to be ready to use for different compressive coding mechanisms. J. Bacca et al. proposed a method without using training data, which regarded spectral data cube as a 3D tensor and performed tensor tucker decomposition in a learned way [39]. Y. Sun et al. [40]proposed an unsupervised network structure called HCS2-Net which took random code of the coded aperture and snapshot measurement as the network input. It could achieve comparable reconstruction results over the deep networks with pretraining.

In 2017, Xiong et al. [41]proposed a deep learning framework, i.e., HSCNN, which is one of the first CNN-based methods for hyperspectral recovery from a single RGB image. Fubara et al. [42] proposed a CNN-based strategy for learning RGB to hyperspectral cube mapping by learning a set of basis functions and weights in a combined manner. The above spectral reconstruction methods by mapping the RGB values and the spectral high-dimensional data need large prior datasets. Considering that the prior datasets are sensitive to scene and light, there are still huge challenges in the practical application of these methods. Because the RGB image has limited spectral information, researchers tend to manually add more information before reconstruction. For example, optical devices were designed together with RGB reconstruction techniques to achieve excellent results [43].

3. Methods

Figure 1 depicts our end-to-end pipeline of R-CASSI to capture hyperspectral images, composed of the hardware encoder, the reflective CASSI, and the software decoder, i.e., the reconstruction algorithm using U-net-3D. Figure 1(a) shows the principle of R-CASSI system. The ${x,y,\lambda }$ denote the three dimensions of the 3D spectral data cube, where ${x,y}$ denotes the two spatial dimensions, and ${\lambda }$ denotes the spectral dimension. In spectral dimension, there are N spectral channels including ${\lambda _1\cdots \lambda _N}$. Mathematically, let ${F(x,y,\lambda )}$ indicate the spectral intensity of light with wavelength $\lambda$ at location $(x,y)$ where $(x,y)$ are the spatial coordinates $(1\leq x\leq X, 1\leq y\leq Y)$. As a disperser, the prism creates dispersion on the x-axis of spatial, and the dispersion function can be expressed as $d(\lambda )$. The reflective mask encodes the data cube, based on the mask’s transfer function $\varphi (x,y)$, so the sheared and coded data cube can be expressed as $F(x,y+d(\lambda ),\lambda )\varphi (x,y)$. Due to the design of the reflective optical system, when the light passes through the prism in reverse, opposite dispersion occurs on the x-axis, so the unsheared and coded data cube can be expressed as $F(x,y,\lambda )\varphi (x,y-d(\lambda ))$. For the monochromatic camera sensor, the spectral density modulated by the coded aperture is $g(x,y)$:

(1)$$g(x,y) = \int_\lambda \varphi(x,y-d(\lambda))F(x,y,\lambda)d\lambda.$$

According to the mathematical model of compressed sensing, Eq. (1) can be written as

(2)$$\boldsymbol{g} = {\mathbf{\Phi}}\boldsymbol{f} + \boldsymbol{e},$$

Let $\boldsymbol{f}\in \mathbb {R}^{n}$ indicate the hyperspectral data cube, where $n=H\times W\times C$. $H$ and $W$ represent the length and width matrix dimensions of 2D hyperspectral images, $C$ is the number of spectral channels of hyperspectral data cube. The transfer matrix of mask is reshaped into diagonal matrix according to the coding of each spectral channel, and then spliced horizontally to obtain transfer matrix of the R-CASSI system: $\boldsymbol{\Phi }\in \mathbb {R}^{mn}$, where $m=H\times W$. In Eq. (2), $\boldsymbol{g}$ is the measurement captured by the monochrome camera, and $\boldsymbol{e}$ represents the matrix form of the optical system noise. $\boldsymbol{g}$ and $\boldsymbol{e}\in \mathbb {R}^{m}$ This equation describes a highly under-determined system, since $m\ll n$. Therefore, the reconstruction of hyperspectral data requires efficient mathematical calculation methods.

3.1 Hardware prototype implementation

The upper part of Fig. 1(b) shows the schematic illustration of our designed R-CASSI, where L stands for lens and BS stands for beam splitter. L1 is the objective lens, L2 and L3 are the two relay lenses. The reflective mask consists of a series of small pixels, which are either reflective or transparent. We place an optical beam splitter behind the objective lens. The light will pass through the beam splitter then enter the optical system. Two relay lenses compose the 4-f system and the prism creates the shear. When the light reaches the mask plane, the sheared information is coded and then reflected to the prism which creates the unshear. Finally, the light is received by the monochromatic camera through the 4-f system and the beam splitter.

The experimental prototype of R-CASSI is shown in the lower part of Fig. 1(b). The objective lens is a monofocal lens (M1614-MP2) from Computar. Other optics from Thorlabs are two relay lenses (ACA254-060-A), the prism (PS812-A) and the beam splitter (BS013). The camera is a Basler acA2000-165$\mu m$ camera with resolution $2048\times 1088$ and pixel pitch 5.5$\mu m$. The mask we used is a binary random pattern with pixel pitch 11.0 $\mu m$, which is made of a glass base plate, silver, and silicon dioxide. Ag reflects the light and the coating of silicon dioxide prevents the oxidation of silver. It is worth mentioning that the mask is attached with a Beam Traps (Thorlabs, BT610), which scatters and absorbs the energy of the light, preventing the light passing through the mask from interfering with the optical system. Moreover, we place two long-pass filters (Thorlabs, FELH0450 and FELH0700, not shown in Fig. 1(b)) with a bandwidth of 450–700 nm in front of the optical system to limit the wavelength to filter out unwanted spectral bands. The experimental scene is illuminated by an LED light (ABSOLUTE SERIES LED D65). The captured images are in 10-bit RAW format. By using the air-spaced achromatic doublets lens and experimental calibration, the chromatic aberrations of the system can be effectively reduced. Note that the beam splitter used in R-CASSI will reduce the luminous flux, which can be compensated by increasing the detector integration time. In our experimental setup, the acquisition time is set to 20 ms, allowing for both high acquisition speed and quality.

3.2 Reconstruction network

In our paper, solving the reconstruction problem is an important part, which is an ill-posed problem. We use CNN to learn the mapping from the snapshot measurement to the hyperspectral image. Experimental results show that the CNN-based algorithm outperforms several traditional algorithms in both reconstruction quality and reconstruction time. To be specific, we use U-net as a backbone. U-net was originally proposed for image segmentation and has been re-purposed as an image generator for various problems. This U-shaped network has proven to be an efficient network architecture and is used as a backbone in many other networks. Due to the skip connection and residual learning in U-net, the degeneration can be avoided when the network is deeper, resulting in improved reconstruction quality. Traditional U-net has the problem of low efficiency in volumetric images segmentation, 3D U-net was then proposed [44], which semi-automatically and fully-automatically segmented a 3D volume from a sparse annotation. The U-net based reconstruction network only handles spatial information well, while ignoring the spectral correlation. Except for the high correlation in spatial dimension, hyperspectral image also has a high correlation in the spectral dimension, which is different from other image types. Thus, we build a U-net-3D network for R-CASSI reconstruction in this paper. Compared with U-net, the U-net-3D expands a new dimension to deal with spectral information.

3.2.1 Network structure

In this work, we build an encoder-decoder structure U-net-3D based on U-net, which uses 3D convolution kernels in the network. Different from the traditional convolutional kernel that can only achieve two-dimensional slides, U-net-3D uses $3\times 3\times 3$ convolutional kernels that achieves the third-dimensional slides, as shown in the lower part of Fig. 1(c). Traditional ‘Pseudo’ 3D convolution kernels only generate 2D feature map. In our U-net-3D, $3\times 3\times 3$ convolution kernels generate 3D ‘feature map’, which we call feature cube. Due to the spectral correlation property of hyperspectral data, spectral correlation cannot be captured using a ‘Pseudo’ 3D convolution kernel, while the U-net-3D network designed in this paper could simultaneously achieve spatial consistency and spectral consistency.

In Fig. 1(c), we show the detailed structure of U-net-3D and feature cubes in every phase. We set 3 blocks in the encoder and decoder individually. The first two encoding modules consist of a double-conv layer and a max-pooling layer and the other contains a double-conv layer. The two subsequent decoding modules contain an up-sampling layer and a double-conv layer. The last layer is a $1\times 1\times 1$ convolutional layer. In the encoder and decoder, except that the last layer uses the Tanh function as the activation function, all other convolution layers use the Rectified Linear Unit (ReLU) as the activation function. The size of the feature cube is 4D in the form of $H\times W\times C\times F$, where $H\times W\times C$ represents the size of feature cube, and $F$ is the number of features. Same as U-net, we set up a double-conv consisting of two convolutional layers to expand or shrink the number of feature cubes. The kernel size is $3\times 3\times 3$ , convolution stride is 1. Two max-pooling layers are used for down-sampling in the structure. Also, we set two up-sampling layers to ensure that the output and input data are of the same size. Skip connection is used to merge the low-level and high-level features to avoid the gradient vanishing problem.

3.2.2 Training details

To train the model, we first need to calibrate and capture the mask in our real R-CASSI system and generate the measurements as the input $f$ from the captured mask and the dataset. The output is the reconstructed hyperspectral image data $\hat {f}$. The network calculates the Mean Square Error (MSE) between the network output $\hat {f}$ and the ground truth $f$ as the loss function. The loss function of U-net-3D is as below:

(3)$${Loss}=\frac{1}{m} \sum_{i=1}^{m}\left\|\hat{\boldsymbol{f}}-\boldsymbol{f}\right\|^{2}$$

where $m$ represents the number of training data samples, and each value of $i$ represents a single training data sample.

Our reconstruction network is trained on the CAVE dataset, which contains 32 hyperspectral data scenes with spatial-resolution of $512\times 512$. The spectral resolution of 10 nm, including a total of 31 channels ranging from 400 nm to 700 nm. We first use interpolation to obtain the desired wavelength bands consistent with our R-CASSI system (27 spectral bands). For data augmentation, we randomly select 4 scenes, rotate each scene randomly (rotations of 0, 90$^{\circ }$, 180$^{\circ }$, 270$^{\circ }$), then splice 4 scenes into a $1024\times 1024$ scene, and repeat the operation to obtain 168 big scenes. Next, we directly resize the 32 scenes to $1024\times 1024$ to obtain another 32 big scenes for better data augmentation. The training set is composed of 200 scenes with a spatial resolution of $1024\times 1024$ and 27 spectral bands. During the training phase, we randomly crop patches of size $256\times 256\times 27$ for network training. Then for the test set, we use 10 scenes from KAIST dataset to verify the scalability and generalizability of the U-net-3D. We use PyTorch [45] for implementation, with Adam as the optimizer. The total number of training epochs is set to 500, and the batch size is set to 5 with the learning rate of $10^{-4}$. Using a machine equipped with an Intel Xeon Gold 5218 CPU, 128 GB of memory, and an NVIDIA RTX 3090 GPU, training of the network takes approximately 2 days.

4. Results

In this section, we verify the performance of R-CASSI and U-net-3D algorithms through a considerable amount of simulations and experiments. Firstly, we conduct simulations to test and verify U-net-3D with other competitive methods. Then we use a real R-CASSI system to capture several scenes and reconstruct them using a variety of algorithms.

We use quantitative and qualitative metrics for comparison. The quantitative metrics are signal-to-noise ratio (PSNR) and structural similarity (SSIM) [46]. For each two-dimensional spatial image, we use PSNR and SSIM to compare the spatial quality between the reconstructed hyperspectral images and ground-truth reference images. For spectral results, we calculate the spectral correlation values. High values of PSNR, SSIM, and correlation indicate good performance of the algorithm. Qualitative evaluation is spectral curves and visual quality compared to ground truth.

4.1 Simulation results

In order to accurately evaluate the reconstruction performance of our U-net-3D on the R-CASSI system, the parameters set in the simulation such as mask and wavelength are consistent with those in the real experimental system mentioned before. Table 1 lists the average PSNR and SSIM of 10 scenes in KAIST data set by using TwIST [27], ADMM-TV, U-net, and U-net-3D algorithms. At the same time, different simulation results based on SD-CASSI and R-CASSI are listed. It can be seen that our designed U-net-3D outperforms other algorithms in most scenes. On average, U-net-3D can provide about 1dB higher PSNR than U-net in the simulation of R-CASSI. For SD-CASSI, the simulation results using U-net-3D in scenes are also improved in both PSNR and SSIM compared to U-net, TwIST, and ADMM-TV.

Table 1. Average PSNR (dB, left) and SSIM (right) of 10 Simulation Scenes in KAIST dataset using various algorithms.

View Table

Figure 2 shows the visualization results of R-CASSI system in U-net and U-net-3D algorithms with a spatial size of $256\times 256$. The scene is the sculpture from KAIST data set including 27 spectral channels which are identical to the following real experimental system. The PSNR and SSIM results are also shown in each algorithm. It can also be clearly seen that U-net-3D provides better results in the zoomed areas.

Fig. 2. Simulation results (PSNR/SSIM) of KAISTs’ scene with the size of $256\times 256$ compared to the ground truth.

Download Full Size | PDF

4.2 Experimental results

We capture 10 real scenes and reconstruct hyperspectral images using our R-CASSI. The captured images are in 10bit RAW format. By using the air-spaced achromatic doublets lens and experimental calibration, the chromatic aberrations of the system can be effectively reduced. The beam splitter used by R-CASSI loses luminous flux and requires an increased detector integration time to compensate, which is usually set to 50 fps in experiments, allowing for both high acquisition speed and quality. All data consist of 27 spectral bands: 454.1, 459.3, 464.6, 470.2, 476.0, 482.0, 488.2, 494.7, 501.4, 508.4, 515.8, 523.4, 531.4, 539.7, 548.5, 557.6, 567.2, 577.3, 587.9, 599.1, 610.8, 623.2, 636.3, 650.1, 664.7, 680.2, 696.6 nm and the spatial resolution is $1024\times 768$. Figure 3 shows the RGB reference images of these scenes.

Fig. 3. RGB reference images of 10 scenes.

Download Full Size | PDF

Block-wise reconstruction of real measurement: In our experiment, the measured size captured by R-CASSI is $768\times 1024$. Note that the training HSI data of CNN is $256\times 256\times 27$, and the measurement size is $256\times 256$. Due to the expensive computation in directly training CNN with an input size of $768\times 1024$, we train the CNN with a smaller input size ($256\times 256\times 27$) to achieve lower time cost. For the real measurement reconstruction, we divide the real measurement into 12 blocks, each with the same size of $256\times 256$, corresponding to the CNN input size. It is worth noting that our designed network is flexible to the mask modulations. After training a single network with $256\times 256$, the U-net-3D can handle all the blocks. Because the mask parts are randomly selected for training, it is more tolerant to the noise of experimental system. When the reconstruction is finished, 12 blocks are stitched together to form a complete hyperspectral image of size $768\times 1024\times 27$. For reconstruction, we use several state-of-the-art methods to recover the real captured data, including two iterative optimization algorithms: TwIST, ADMM-TV, and four deep learning methods: U-net, TSA-net [8], HD-net [37] and U-net-3D. The reconstructed data cube size of all real scenes is $1028\times 768\times 27$.

The top of Fig. 4 shows the reference RGB image, a Plasticine Mold made of plastic. We select three positions (a, b, and c) in this scene to plot the spectral reconstruction curves and calculate the spectral correlation values. The spectral results of the above six algorithms are compared, as shown at the top of Fig. 4. In the lower part of Fig. 4, we show the visualization of this scene with 27 channels. As shown in Fig. 4, the spatial results of U-net-3D, U-net, TSA-net, and HD-net methods have finer details and sharper image edges than ADMM and TwIST methods. Compared with U-net, U-net-3D shows better spatial results clearly. U-net-3D reduces artifacts in a large area, making the overall spatial result smoother and more in line with the real scene. At the same time, U-net-3D also retains even smaller details. For example, both U-net and U-net-3D can reconstruct the protrusion in the middle of the lower-left part of the scene, while only U-net-3D can reconstruct the smaller protrusion details around it. Compared with TSA-net and HD-net, U-net-3D has similar results in terms of resolution and detail. However, U-net-3D results are smoother in large areas and have fewer artifacts. For spectral results, both iterative algorithms perform well in green and red bands but are inferior to the other four deep learning algorithms on the whole, which is also consistent with the simulation results. U-net-3D shows one of the best reconstruction effects at the positions of a, b, and c with different colors. In the ‘a’ region, HD-net shows the best results, while U-net-3D is better in the ‘b’ and ‘c’ regions.

Fig. 4. Scene of Plasticine Mold. Reconstruction results with ADMM-TV, TwIST, U-net, TSA-net, HD-net, and U-net-3D. Three points marked by ‘a’, ‘b’, and ‘c’ in the RGB reference images are selected to plot the spectral curves.

Download Full Size | PDF

It can be concluded that U-net-3D presents one of the best results in spectral results. Furthermore, it can also provide finer details than U-net, and fewer artifacts than TSA-net and HD-net.

We provide spatial and spectral results with 4 out of 27 spectral channels (470.2 nm, 531.4 nm, 587.9 nm, 650.1 nm) in some other scenes. All the reconstructions use the six methods mentioned above.

Figure 5 is the doll of a cartoon character named Bulbasaur, whose body is smooth, surrounded by a jumble of flowers and leaves, and not in the same focal plane. We also show its RGB reference image in the following results. We can draw a similar conclusion to the previous scene. U-net-3D shows the best spatial results and it reduces a significant amount of artifacts, especially on the body of the toy. U-net-3D works best for the detailed reconstruction of red flowers. On the other hand, other algorithms can also reconstruct flowers, which proves the ability of R-CASSI system to adapt to a larger depth of field. For the spectral results, U-net-3D also shows one of the best spectral reconstruction effect, especially in the blue region at ‘a’. To visualize the recovered spatial and spectral effects, we converted the spectral images of 27 channels into synthetic RGB (sRGB) images by using color matching function [47]. The results are shown in Fig. 6. It can be seen that U-net-3D performs the best. The clarity of the results with traditional ADMM algorithm is reduced, and the details of the results with TwIST algorithm are lost. Compared with other deep learning methods, U-net-3D still shows excellent performance. For example, the spectral curve reconstructed with HD-net has similar accuracy to U-net-3D, but the RGB image synthesized with HD-net shows uneven colors,a problem that also exists with other algorithms, as shown in the area marked with a red box in Fig. 6.

Fig. 5. Scene of Toy doll. Reconstruction results with ADMM-TV, TwIST, U-net, TSA-net, HD-net, and U-net-3D. Three points marked by ‘a’, ‘b’ and, ‘c’ in the RGB reference images are selected to plot the spectral curves.

Download Full Size | PDF

Fig. 6. Results of the selected scene shown in synthetic-RGB.

Download Full Size | PDF

The visualization results for some scenes can be found in the Supplementary Materials.

4.3 Comparison with SD-CASSI

We compare the experimental results of R-CASSI and SD-CASSI systems, using U-net and U-net-3D methods. The parameters of R-CASSI are exactly the same as mentioned above and we select the visualization results at four wavelengths (470.2 nm, 531.4 nm, 587.9 nm, 650.1 nm). At the same time, we also build a traditional experimental SD-CASSI in our laboratory, which can capture hyperspectral data in the 450-650nm band. SD-CASSI uses the same detector pixel size and resolution, i.e., 5.5$\mu m$ and $2088\times 1088$, with R-CASSI. The acquisition time of the two systems are set to obtain their own optimal performance separately, i.e., 10 ms for SD-CASSI and 20 ms for R-CASSI. In order to compare with R-CASSI, we selected four channels (470.5 nm, 351.8 nm, 588.5 nm, 648.2 nm) close to the above four bands. Due to the hardware calibration, it is very challenging to make R-CASSI and SD-CASSI have the exact same wavelengths.

The scene shown in Fig. 7 is a disturbed magic cube with five main colors, and the cube’s material reflects light relatively easily. This scene has a large smooth area of the same color. From the measurement on the left side of Fig. 7, since SD-CASSI performs only one dispersion, the 3D hyperspectral data is shifted in the spectral dimension, resulting in unclear edges in the compressed image. In contrast, R-CASSI can obtain compressed observations with clear edge information. From the spatial results, we can observe that the spatial reconstruction effect of U-net-3D not only provides finer details, but also eliminates artifacts in large smooth areas. When reconstructing data using the same algorithm, R-CASSI also shows high accuracy on spatial information. For example, the scene in Fig. 8 shows a house model captured in rich details. The spatial results of R-CASSI exhibit low noise and sharp edges. Specifically, the sign in the upper right of the house says ’Flower Shop’. Either U-net or U-net-3D algorithms can reconstruct the captured data of R-CASSI, which means the details of the text can be clearly reconstructed. However, based on the results of SD-CASSI data, the information of this text has been completely lost. As mentioned before, the spatial prior information in the measurement of R-CASSI remains unchanged, so it is easier for the algorithm to reconstruct the spatial details. In the field of hyperspectral imaging, the improvement of spatial effect makes R-CASSI more applicable in practical applications.

Fig. 7. SD-CASSI and R-CASSI results of the magic cube scene.

Download Full Size | PDF

Fig. 8. SD-CASSI and R-CASSI results of the house scene.

Download Full Size | PDF

5. Conclusion

In summary, we have developed the reflective coded aperture snapshot spectral imaging named R-CASSI. Compared with SD-CASSI, the obtained coded measurement of R-CASSI suffers no spatial shifting, which is conducive to the reconstruction algorithm and shows better spatial reconstruction performance. Compared with DD-CASSI which utilizes two dispersion arms each with a 4-f system, the presented R-CASSI only applies one prism to realize two times dispersion. Therefore, the size of R-CASSI is almost halved. Furthermore, we have developed an encoder-decoder structure dubbed U-net-3D using 3D convolution kernels in the network, for reconstructing of spectral snapshot compressive imaging. The designed U-net-3D network simultaneously achieves spatial consistency and spectral consistency. In the simulation of R-CASSI, U-net-3D provides PSNR about 1 dB higher on average than U-net. Multi-group experiments have been constructed to verify the effectiveness of R-CASSI and U-net-3D. The experimental results show that R-CASSI can significantly improve the spatial reconstruction performance while obtaining high spectral reconstruction performance. Compared with U-net, the U-net-3D also shows better spatial reconstruction performance in R-CASSI and SD-CASSI experimental data. Importantly, we have released our real data and hope this can serve as a benchmark dataset to test new reconstruction algorithms. Future work includes developing efficient algorithms as well as miniaturizing the hardware setups.

Funding

National Key Research and Development Program of China (2021YFF0901700); National Natural Science Foundation of China (61821001, 61901045, 62271414); Natural Science Foundation of Zhejiang Province State Key Laboratory of Information Photonics and Optical Communications (IPOC2021ZT18); Westlake Foundation (2021B1501-2).

Acknowledgments

Xin Yuan would like to thank the support from the Research Center for Industries of the Future (RCIF) at Westlake University. Note: UNO is a trademark of Mattel, Inc. TOY STORY is a trademark of Disney Enterprises, Inc. They are not overseeing, involved with, or responsible for this activity, product, or service.

Disclosures

The authors declare no conflicts of interest.

Data Availability

Data underlying the results presented in this paper are available in Ref. [13].

Supplemental document

See Supplement 1 for supporting content.

References

1. D. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006). [CrossRef]

2. E. J. Candès and T. Tao, “Near-optimal signal recovery from random projections: Universal encoding strategies?” IEEE Trans. Inf. Theory 52(12), 5406–5425 (2006). [CrossRef]

3. E. J. Candès, “The restricted isometry property and its implications for compressed sensing,” Comptes rendus mathematique 346(9-10), 589–592 (2008). [CrossRef]

4. M. E. Gehm, R. John, D. J. Brady, R. M. Willett, and T. J. Schulz, “Single-shot compressive spectral imaging with a dual-disperser architecture,” Opt. Express 15(21), 14013–14027 (2007). [CrossRef]

5. X. Yuan, D. J. Brady, and A. K. Katsaggelos, “Snapshot compressive imaging: Theory, algorithms, and applications,” IEEE Signal Process. Mag. 38(2), 65–88 (2021). [CrossRef]

6. A. Wagadarikar, R. John, R. Willett, and D. J. Brady, “Single disperser design for coded aperture snapshot spectral imaging,” Appl. Opt. 47(10), B44–B51 (2008). [CrossRef]

7. G. R. Arce, D. J. Brady, L. Carin, H. Arguello, and D. S. Kittle, “Compressive coded aperture spectral imaging: An introduction,” IEEE Signal Process. Mag. 31(1), 105–115 (2014). [CrossRef]

8. Z. Meng, J. Ma, and X. Yuan, “End-to-end low cost compressive spectral imaging with spatial-spectral self-attention,” in European Conference on Computer Vision (ECCV), (2020).

9. Z. Meng, M. Qiao, J. Ma, Z. Yu, K. Xu, and X. Yuan, “Snapshot multispectral endomicroscopy,” Opt. Lett. 45(14), 3897–3900 (2020). [CrossRef]

10. I. Choi, D. S. Jeon, G. Nam, D. Gutierrez, and M. H. Kim, “High-quality hyperspectral reconstruction using a spectral prior,” ACM Trans. Graph. 36(6), 1–13 (2017). [CrossRef]

11. Z. Zhao, Z. Meng, Z. Ju, Z. Yu, and K. Xu, “A compact dual-dispersion architecture for snapshot compressive spectral imaging,” in 2021 Asia Communications and Photonics Conference (ACP), (IEEE, 2021), pp. 1–3.

12. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI), vol. 9351 of LNCS (Springer, 2015), pp. 234–241.

13. D. Liu, “Reflective-cassi,” figshare (2022), https://doi.org/10.6084/m9.figshare.20281494.v1.

14. N. Gat, “Imaging spectroscopy using tunable filters: a review,” in Wavelet Applications VII, vol. 4056 (International Society for Optics and Photonics, 2000), pp. 50–64.

15. H. Lee and M. H. Kim, “Building a two-way hyperspectral imaging system with liquid crystal tunable filters,” in International Conference on Image and Signal Processing, (Springer, 2014), pp. 26–34.

16. N. Brusco, S. Capeleto, M. Fedel, A. Paviotti, L. Poletto, G. M. Cortelazzo, and G. Tondello, “A system for 3d modeling frescoed historical buildings with multispectral texture information,” Mach. Vis. Appl. 17(6), 373–393 (2006). [CrossRef]

17. A. A. Wagadarikar, N. P. Pitsianis, X. Sun, and D. J. Brady, “Video rate spectral imaging using a coded aperture snapshot spectral imager,” Opt. Express 17(8), 6368–6388 (2009). [CrossRef]

18. D. Kittle, K. Choi, A. Wagadarikar, and D. J. Brady, “Multiframe image estimation for coded aperture snapshot spectral imagers,” Appl. Opt. 49(36), 6824 (2010). [CrossRef]

19. Y. Wu, I. O. Mirza, G. R. Arce, and D. W. Prather, “Development of a digital-micromirror-device-based multishot snapshot spectral imaging system,” Opt. Lett. 36(14), 2692–2694 (2011). [CrossRef]

20. H. Arguello and G. R. Arce, “Colored coded aperture design by concentration of measure in compressive spectral imaging,” IEEE Trans. on Image Process. 23(4), 1896–1908 (2014). [CrossRef]

21. X. Lin, Y. Liu, J. Wu, and Q. Dai, “Spatial-spectral encoded compressive hyperspectral imaging,” ACM Trans. Graph. 33(6), 1–11 (2014). [CrossRef]

22. X. Lin, G. Wetzstein, Y. Liu, and Q. Dai, “Dual-coded compressive hyperspectral imaging,” Opt. Lett. 39(7), 2044–2047 (2014). [CrossRef]

23. H. Arguello, S. Pinilla, Y. Peng, H. Ikoma, J. Bacca, and G. Wetzstein, “Shift-variant color-coded diffractive spectral imaging system,” Optica 8(11), 1424–1434 (2021). [CrossRef]

24. J. Monsalve, M. Marquez, I. Esnaola, and H. Arguello, “Compressive covariance matrix estimation from a dual-dispersive coded aperture spectral imager,” in 2021 IEEE International Conference on Image Processing (ICIP), (IEEE, 2021), pp. 2823–2827.

25. X. Liao, H. Li, and L. Carin, “Generalized alternating projection for weighted-2, 1 minimization with applications to model-based compressive sensing,” SIAM J. on Imaging Sci. 7(2), 797–823 (2014). [CrossRef]

26. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations Trends Mach. Learn. 3(1), 1–122 (2010). [CrossRef]

27. J. Bioucas-Dias and M. Figueiredo, “A new TwIST: Two-step iterative shrinkage/thresholding algorithms for image restoration,” IEEE Trans. on Image Process. 16(12), 2992–3004 (2007). [CrossRef]

28. A. Lucas, M. Iliadis, R. Molina, and A. K. Katsaggelos, “Using deep neural networks for inverse problems in imaging: beyond analytical methods,” IEEE Signal Process. Mag. 35(1), 20–36 (2018). [CrossRef]

29. L. Wang, T. Zhang, Y. Fu, and H. Huang, “Hyperreconnet: Joint coded aperture optimization and image reconstruction for compressive hyperspectral imaging,” IEEE Trans. on Image Process. 28(5), 2257–2270 (2019). [CrossRef]

30. J. Ma, X.-Y. Liu, Z. Shou, and X. Yuan, “Deep tensor admm-net for snapshot compressive imaging,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019), pp. 10223–10232.

31. Y. Liu, X. Yuan, J. Suo, D. J. Brady, and Q. Dai, “Rank minimization for snapshot compressive imaging,” IEEE Trans. Pattern Anal. Mach. Intell. 41(12), 2990–3006 (2019). [CrossRef]

32. X. Miao, X. Yuan, Y. Pu, and V. Athitsos, “λ-net: Reconstruct hyperspectral images from a snapshot measurement,” in IEEE/CVF Conference on Computer Vision (ICCV), (2019).

33. L. Wang, C. Sun, Y. Fu, M. H. Kim, and H. Huang, “Hyperspectral image reconstruction using a deep spatial-spectral prior,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), pp. 8024–8033.

34. M. Iliadis, L. Spinoulas, and A. K. Katsaggelos, “Deepbinarymask: Learning a binary mask for video compressive sensing,” Digit. Signal Process. 96, 102591 (2020). [CrossRef]

35. M. Qiao, Z. Meng, J. Ma, and X. Yuan, “Deep learning for video compressive sensing,” APL Photonics 5(3), 030801 (2020). [CrossRef]

36. S.-H. Baek, H. Ikoma, D. S. Jeon, Y. Li, W. Heidrich, G. Wetzstein, and M. H. Kim, “Single-shot hyperspectral-depth imaging with learned diffractive optics,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 2651–2660.

37. X. Hu, Y. Cai, J. Lin, H. Wang, X. Yuan, Y. Zhang, R. Timofte, and L. Van Gool, “Hdnet: High-resolution dual-domain learning for spectral compressive imaging,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), pp. 17542–17551.

38. S. Zheng, Y. Liu, Z. Meng, M. Qiao, Z. Tong, X. Yang, S. Han, and X. Yuan, “Deep plug-and-play priors for spectral snapshot compressive imaging,” Photonics Res. 9(2), B18–B29 (2021). [CrossRef]

39. J. Bacca, Y. Fonseca, and H. Arguello, “Compressive spectral image reconstruction using deep prior and low-rank tensor representation,” Appl. Opt. 60(14), 4197–4207 (2021). [CrossRef]

40. Y. Sun, Y. Yang, Q. Liu, and M. Kankanhalli, “Unsupervised spatial–spectral network learning for hyperspectral compressive snapshot reconstruction,” IEEE Transactions on Geoscience and Remote Sensing 60, 1–14 (2022). [CrossRef]

41. Z. Xiong, Z. Shi, H. Li, L. Wang, D. Liu, and F. Wu, “Hscnn: Cnn-based hyperspectral image recovery from spectrally undersampled projections,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, (2017), pp. 518–525.

42. B. J. Fubara, M. Sedky, and D. Dyke, “Rgb to spectral reconstruction via learned basis functions and weights,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, (2020), pp. 480–481.

43. W. Zhang, H. Song, X. He, L. Huang, X. Zhang, J. Zheng, W. Shen, X. Hao, and X. Liu, “Deeply learned broadband encoding stochastic hyperspectral imaging,” Light: Sci. Appl. 10(1), 108 (2021). [CrossRef]

44. Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net: learning dense volumetric segmentation from sparse annotation,” in International conference on medical image computing and computer-assisted intervention, (Springer, 2016), pp. 424–432.

45. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, eds. (Curran Associates, Inc., 2019), pp. 8024–8035.

46. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. on Image Process. 13(4), 600–612 (2004). [CrossRef]

47. T. Smith and J. Guild, “The cie colorimetric standards and their use,” Trans. Opt. Soc. 33(3), 73–134 (1931). [CrossRef]

Name	Description
Code 1	Experimental data and code for paper Deep learning enabled reflective coded aperture snapshot spectral imaging.
Supplement 1	Supplementary Material: Deep learning enabled reflective coded aperture snapshot spectral imaging

Algorithms	TwIST (in SD)	ADMM (in SD)	U-net (in SD)	U-net-3D (in SD)	TwIST (in R)	ADMM (in R)	U-net (in R)	U-net-3D (in R)
Scene1	24.56, 0.766	25.67, 0.741	31.53, 0.877	31.24, 0.874	26.23, 0.740	27.73, 0.878	31.86, 0.923	32.57, 0.937
Scene2	19.46, 0.616	20.53, 0.622	26.84, 0.827	26.17, 0.802	22.79, 0.686	23.72, 0.729	28.79, 0.938	30.08, 0.950
Scene3	21.37, 0.743	23.94, 0.781	27.81, 0.812	28.05, 0.850	26.21, 0.870	25.98, 0.868	30.49, 0.922	31.28, 0.945
Scene4	21.40, 0.675	22.70, 0.651	28.00, 0.853	30.68, 0.917	24.11, 0.737	25.97, 0.781	30.31, 0.869	32.54, 0.939
Scene5	27.66, 0.803	27.68, 0.850	35.91, 0.936	37.68, 0.952	33.96, 0.880	34.30, 0.858	36.12, 0.935	37.37, 0.954
Scene6	22.81, 0.632	22.29, 0.630	29.36, 0.879	29.75, 0.897	23.28, 0.740	27.30, 0.760	29.92, 0.905	30.79, 0.930
Scene7	22.61, 0.628	23.56, 0.680	30.61, 0.896	29.95, 0.888	20.91, 0.720	26.97, 0.710	33.38, 0.937	34.12, 0.943
Scene8	17.80, 0.640	17.49, 0.654	23.81, 0.752	24.64, 0.865	20.10, 0.728	20.98, 0.737	25.88, 0.892	26.62, 0.920
Scene9	21.19, 0.735	23.56, 0.720	28.05, 0.874	27.68, 0.876	23.76, 0.840	26.28, 0.761	28.98, 0.920	29.22, 0.926
Scene10	22.37, 0.597	23.10, 0.526	28.22, 0.846	29.55, 0.878	25.32, 0.611	25.53, 0.781	30.43, 0.908	32.21, 0.943
Average	22.12, 0.684	23.05, 0.686	29.01, 0.855	29.54, 0.880	24.67, 0.755	26.48, 0.786	30.62, 0.915	31.68, 0.939

Deep learning enabled reflective coded aperture snapshot spectral imaging

Abstract

1. Introduction

2. Related work

2.1 Compressed hyperspectral imaging

2.2 Reconstruction algorithms

3. Methods

3.1 Hardware prototype implementation

3.2 Reconstruction network

3.2.1 Network structure

3.2.2 Training details

4. Results

4.1 Simulation results

4.2 Experimental results

4.3 Comparison with SD-CASSI

5. Conclusion

Funding

Acknowledgments

Disclosures

Data Availability

Supplemental document

References

Supplementary Material (2)

Data Availability

Cited By

Figures (8)

Tables (1)

Equations (3)

Optics Express