Spatial compressive imaging deep learning framework using joint input of multi-frame measurements and degraded maps

Can Cui; Can Cui; Jun Ke; Jun Ke

doi:10.1364/OE.445127

1. Introduction

Compressed sensing (CS) theory is a landmark theory applied to signal acquisition and processing [1]. For signals with sparsity in a transform domain, CS theory can be used to reconstruct a signal from much less measurements than what required by Nyquist sampling theory [2]. Compared with traditional imaging methods, CS-based imaging has advantages such as obtaining high resolution or high speed images in non-visible band [3], or relaxing the requirements for transmission bandwidth and storage space of information collection equipment [4,5]. The random distribution of measurement matrices in CS also makes the theory to be used in information encryption transmission [6].

Although CS can be used into many applications, it still faces problems. Traditional CS reconstruction algorithms mainly include five categories, namely greedy algorithms [7,8], threshold algorithms [9–11], convex optimization algorithm [12,13], Bayesian framework algorithm [14,15] and combined algorithms [16]. The limitations of these algorithms mainly include three aspects. First, traditional CS algorithms often adopt a time consuming iterative form, which is not friendly to hardware acceleration. Secondly, the reconstruction performance of these algorithms is based on object sparsity. Good reconstructions can be easily obtained from a sparse object. However, many signals in nature only satisfy the sparsity requirement to some extend. Finally, in many applications, bandwidth restrictions are more stringent, and a higher compression ratio is often required. In such a case, the traditional CS algorithm often have reconstruction difficulty. Therefore, better solutions are desired.

In recent years, scholars have used deep neural networks into CS problems, breaking through limitations and obtaining excellent results. In wireless image transmission application, Bo et al. proposed a network FompNet to suppress the noise in data transmission channels, and then improved reconstruction quality compared with traditional iterative algorithms [17]. Mousavi et al. proposed a stacked denoising autoencoder (SDA) to improve the quality of reconstruction by capturing the statistical correlation between different elements of the target [18]. In spatial compressive imaging (SCI) field, Kulkarni et al. proposed ReconNet, which achieves fast and high-quality recovery by learning the nonlinear mapping between the measurement values and original object [19]. Xu et al. proposed LAPRAN to achieve a multi-resolution output [20]. These methods present fast and high-quality reconstructions, but depend on data much and often lack of interpretation. In additionally, these networks often encounter difficulty when the compression ratio is large. To deal with these issue, Yang et al. and Zhang et al. proposed ADMM-Net [21] and ISTA-Net [22], for TCI (temporal compressive imaging) [23,24]and SCI, respectively. These networks are designed by replacing each iteration in an iterative optimization algorithm (ADMM or ISTA) with a deep neural network. By doing so, these networks are more understandable. Using these networks, reconstructions can be obtained with high speed and quality. However, in all of these networks, the sensing matrix or the forward imaging model in spatial compressive imaging has not been used in a network expressively.

In this paper, we study another deep neural network Joinput-CiNet (Joint input compressive imaging Net) for SCI. For the first time, the sensing matrix information is incorporated as inputs of a SCI network. We design an encoding module based on Principal Component Analysis (PCA) to reduce the dimensionality of the sensing matrix. Then the obtained PCA components are used to form degraded maps, which are sent to the network for training. The whole process of degraded maps generation is named DM-Gen module. Using this network, better reconstruction performance is obtained with small number of network parameters and fast running time. In addition, for the inverse problem of compressed sensing reconstruction, the application of the sensing matrix is equivalent to adding additional degradation prior information, thereby constraining the solution space of the reconstruction, which further improves the interpretability of the network model.

The paper is organized as following. In section 2, we discuss the principle of spatial compressive imaging. Then in section 3, we present the framework of Joinput-CiNet and the loss function for network training. In section 4, simulated experiments are conducted using visible band images from several datasets. In section 5, optical experimental results obtained using an IR compressive imaging system are used to demonstrate the superior performance of our network. Finally, we conclude the work in section 6.

2. Spatial compressive imaging and iterative reconstruction algorithms

In imaging field, CS has applications such as single pixel imaging [25], ghost imaging [26], and block-wise compressive imaging (BCI) [27]. Among these applications, BCI is often discussed for the infrared band, because the resolution of a detector array is often difficult to improve due to the limitation of semiconductor technology in this band. In BCI, a target is considered as many blocks. The target is focused on a DMD (Digital Micromirror Device) using a lens. Then after modulation, the modulated result is focused onto a detector array. Each detector is used to make measurements of a target block. Compared to single pixel imaging, BCI can effectively reduces data compression ratio and the number of measurements of each block. The formula for the measurement collection process in BCI is defined as follows:

(1)$$\boldsymbol{y} = \boldsymbol{\Phi} \boldsymbol{x},$$

where $\boldsymbol {x}\left ( {K^2 \times 1} \right )$ is the sparse original signal block, $\boldsymbol {\Phi } \left ( {M \times K^2} \right )$ is the measurement matrix, and $\boldsymbol {y}\left ( {M \times 1} \right )$ is the measurement signal, and $K^2 \gg M$. In order to obtain effective target recovery, the measurement matrix needs to meet the RIP criterion (restricted isometry property) [28]. Such kind of matrices include random Gaussian matrix, random Bernoulli matrix, random patial Orthogonal matrix and random Fourier transform matrix.

To reconstruct the object back, many reconstruction algorithms can be used, such as TVAL-3 [29], BM3D-AMP [30], NLR-CS [31], ReconNet [19], LDAMP [32]. In this work, we use the trained Joinput-CiNet for reconstruction. Figure 1 summarizes the modulation and reconstruction process in BCI.

Fig. 1. The model for SCI. The modulation process uses a random Bernoulli matrix. The reconstruction process uses a trained Joinput-CiNet model.

Download Full Size | PDF

3. Joint input compressive imaging net (Joinput-CiNet)

3.1 Joinput-CiNet architecture

The design of Joinput-CiNet is inspired by a super-resolution(SR) network SRMD (Super-Resolution Network for Multiple Degradations) [33]. SCI and SR are two related problems in computational imaging. In both problems, an original high-resolution image are reconstructed from one or several low-resolution images or measurement frames. The image or measurement acquisition processes in SCI and SR can be represented using a linear function and closely related to a convolution process. Thus, deep network based SCI or SR reconstruction methods often use CNN as the basic framework. The loss functions in both problems are often the difference between the network output pixels and the original object pixels too.

The SRMD network is consisted with Conv+BN+ReLU blocks to extract object information layer by layer, where Conv, BN, and ReLU are for Convolution Layer, Batch Normalization, Rectified Linear Unit, respectively. Then a pixel-shuffle block is used to enlarge the network output resolution. The biggest difference between SRMD and other SR networks is to include the image degradation model, or the blur kernel and noise level, as network inputs. By doing so, the network can obtain the image degradation information through the training process, thus to have better object reconstruction quality. The usage of the degradation model into network inputs can also reduce the network model dependence on data, making the model more interpretable. This is consistent with the trend of new deep networks for computational imaging.

Similar to SRMD, the input of Joiput-CiNet consists of two parts, low-resolution measurement frames and degraded maps. The low-resolution measurement frames are obtained by modulating and then downsampling the original object. This process can be characterized by a matrix multiplication. As discussed before, degraded maps come from the image degradation model, which is represented by the random matrix in SCI. The matrix is a random Bernoulli matrix in this paper. To generate degraded maps, we design an encoding module based on PCA technology, named DM-Gen module, as shown in Fig. 2.

Fig. 2. Generation of degraded maps.

Download Full Size | PDF

In both Joinput-CiNet and SRMD, representing the modulation template or the imaging blur kernel with a low dimensional vector is the central part to generate degraded maps. In SRMD [33], the imaging blur kernel is assumed as a random Gaussian vector $\boldsymbol {h}$ $(K^2 \times 1)$. By generating a large amount of samples, the auto-correlation matrix $\boldsymbol {C_h}$ of size $(K^2 \times K^2)$ can be calculated. Then the $T$ eigen-vectors corresponding to the largest $T$ eigen-values of $\boldsymbol {C_h}$ become the rows of the PCA matrix $\boldsymbol {P}~(T\times K^2)$. For an imaging system with a specific blur kernel $\boldsymbol {h}$, a $(T\times 1)$ vector $\boldsymbol {g}=\boldsymbol {Ph}$ is used to represent the kernel $\boldsymbol {h}$, where $T\ll K^2$. Then each element of $\boldsymbol {g}$ is streteched into a degraded map of size $(W\times H)$, where $(W\times H)$ is the dimension of the low-resolution image. Note that, in SRMD, the PCA matrix $\boldsymbol {P}$ is assumed to work for all kernels with the same Gaussian distribution.

For a SCI system, the blur kernel becomes random binary templates. For $(K\times K)$ object blocks, we use $M$ modulation templates of size $(K\times K)$. As defined in Eq. (1), a vector $\boldsymbol {x}~(K^2 \times 1)$ is used to represent an object block. Although having a general PCA matrix for all binary templates is attractive, the obtained degraded maps do not present convincing results. Thus, we simplify the PCA process by assuming the templates are known. For each template, we reorder the pixels lexicographically into a vector of size $(K^2\times 1)$, and then use the PCA process to generate the low dimensional vector $\boldsymbol {g}$. As shown in Fig. 2, the measurement matrix $(M\times K^2)$ $\boldsymbol {\Phi }$ generates $M$ groups $(T\times 1)$ vectors $(\boldsymbol {g_1}\cdots \boldsymbol {g_M})$ through PCA process. Note that, $T=1$ if each template is used individually. In this case, the value of $\boldsymbol {g}$ is directly related to the number of "1"s in a template. As the number of "1"s increases, the values in degraded maps become larger. In other words, the values in degraded maps represents the amount of light emitted from an object and collected by a detector using a $\{0,1\}$ Bernoulli matrix. Although the representation of a sensing vector using a degraded map is simple, this brings more information about the sensing matrix into a network. Besides using modulation template individually, the PCA process can be implemented with all templates together. However, the generated vector $\boldsymbol {g}$ is not related to the imaging system or system measurements as much as what we have in the paper. We doubt its performance better.

The framework of Joinput-CiNet is presented in Figure 3. We assume the resolution of an original object is $(KW\times KH\times C)$, where $K$ is the size of a block in BCI, $W$ and $H$ are the width and the height of a measurement frame, $C$ stands for the number of channels, which is 3 in this paper, for red, blue and green three channels. If we use $M$ sensing templates of size $(K\times K)$, we can get the low-resolution measurement frames of size $(W\times H\times MC)$. Then we use the DM-Gen module to generate degraded maps with the size of $(W\times H\times MT)$. Finally, we combine the low-resolution measurement frames and the degraded maps to obtain a joint input of size $(W\times H\times (C+T))$. In the network, we use 11 Conv+BN+ReLU layers. The size of the convolution kernel is $(3\times 3)$. Zero padding is used to keep the size of the intermediate output feature maps as $(W\times H)$. After the 11 Conv+BN+ReLU layers, a $(3\times 3)$ Conv layer is added. The output feature map has a size $(W\times H\times CK^2)$. Finally, a Pixel-Shuffle [34] operation is used to transform the feature map into an output of size $(KW \times KH \times C)$. This is the reconstructed high resolution object using Joinput-CiNet. Note that, we use a Pixel-Shuffle module instead of a fully connected layer or a transposed convolutional layer in the end of the network, because this module effectively improves the training efficiency and reconstruction performance of Joinput-CiNet. It also avoids the checkerboard effect which often occurs due to zero padding in transposed convolution [34].

Fig. 3. A system diagram for SCI (upper) and the framework of Joinput-CiNet (lower).

Download Full Size | PDF

3.2 Loss function for Joinput-CiNet

We considered the maximum a posteriori (MAP) estimation model for the loss function. The sampling process in BCI is similar to a general image degradation model in SR imaging. Both of them include two key factors, a blur kernel and noise. The goals of the two tasks are also consistent, to restore the original high-dimensional object from low-dimensional data.

As discussed in section 2., the degradation model from an original object to a low-resolution measurement frame can be written as Eq. (1). The core of image reconstruction is to restore $\boldsymbol {x}$ using $\boldsymbol {y}$, $\boldsymbol {\Phi }$ and $\sigma$. Mathematically, we can obtain the reconstruction through the MAP framework, as shown in Eq. (2):

(2)$$\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}\over x}} = \arg {\min _x}\frac{1}{{2{\sigma ^2}}}||\boldsymbol{\Phi} \boldsymbol{x} - \boldsymbol{y}|{|^2} + \lambda \Psi (\boldsymbol{x}),$$

where $\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}\over x}}$ represents the reconstructed object, $\sigma$ is the noise level, $\lambda$ is the trade-off parameter, $\Psi (\boldsymbol {x})$ is a regularization term. We can rewrite Eq. (2) as a function of parameters $\boldsymbol {y}$, $\boldsymbol {\Phi }$, $\sigma$, and $\Theta$:

(3)$$\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}}\over x}} = R(\boldsymbol{y},\boldsymbol{\Phi},\sigma ,\Theta ),$$

where $\Theta$ represents the parameters of MAP inference. Based on this function, we design the loss function for Joinput-CiNet as following:

(4)$$L(\Theta ) = \frac{1}{{2i_s}}\sum_{i = 1}^{i_s} {||R({\boldsymbol{y_i}},\boldsymbol{\Phi},\sigma ,\Theta )} - {\boldsymbol{x_i}}|{|^2}.$$

The number $i_s$ represents the number of samples in a batch. $\boldsymbol {y_i}$, $\boldsymbol {x_i}$ represents the $i$-$th$ frame measurement signal and original signal, respectively.

4. Numerical experimental results using visible (VIS) images

Before the discussion about training data, we present some setting about the BCI system. The block size is assumed as $(4\times 4)$ or $(8\times 8)$. Thus, for a RGB measurement frame of size $(W\times H\times C)$, the original object has size $(4W\times 4H\times C)$ and $(8W\times 8H\times C)$, if $(4\times 4)$ and $(8\times 8)$ object blocks are used, respectively. The sensing matrices for following experiments are $\{0,1\}$ Bernoulli random matrices.

4.1 Dataset and network training

We use DIV2K dataset as the training set [35]. The dataset contasins 800 images with size around $(2000\times 1500)$. As discussed before, we numerically emulate the measurement frames. For example, if the block size is $(4\times 4)$, then high-resolution images are modulated using Bernoulli sensing matrix, and then downsampled to generate a low resolution measurement frame.

In the training process for Joinput-CiNet, object patches of size $(192\times 192)$ was randomly cut from each original object to make low-resolution measurement frames. For $(4\times 4)$ block size, the measurement frame or the feature map of network input has size $(48\times 48)$. For $(8\times 8)$ block size, the measurement frame has size $(24\times 24)$. Note that, although we cut each object into $(192\times 192)$ patches in the training process, the object size in the testing process is not restricted.

In the training process, we use Adam optimization [36]. The batch size is set as 128. The $lr$ (learning rate) is set as $10^{-4}$ initially. When the training loss becomes stable, $lr$ is changed to $10^{-5}$ until the training process convergences. To enlarge the size of the training set and improve the robustness of the network model, we use data enhancement technology, using rotation, flipping, cropping, translation and other operations onto each training object. The GPU used for the training process is GTX3090. For $(4\times 4)$ block size, it takes about 3 days to obtain a well-trained network. We evaluate the network reconstruction performance using PSNR (peak signal to noise ratio) and SSIM (structural similarity) [37].

We use $1\sim 4$ measurement frames to train network models. For $M$ measurement frames, $M$ degraded maps with size $(W\times H)$ are also sent into Joinput-CiNet for training and object reconstruction. Figure 4 presents the PSNR and SSIM values vs. the training iterations. To simplify notations, we use Joinput-CiNet-$M$ to represent the model with $M$ measurement frames. We can see that, as the number of iterations increases, the PSNR and SSIM values are improved and saturated for both Joinput-CiNet-1 and Joinput-CiNet-4 models. Between the two models, Joinput-CiNet-4 presents higher PSNR/SSIM values and converges faster. Thus, as expected, multiple measurement frames can increase the amount of information at the input of the network, thereby helping the network reconstruction performance.

Fig. 4. PSNR and SSIM comparison between Joinput-CiNet-1 and Joinput-CiNet-4 with various numbers of iteratins during training.

Download Full Size | PDF

4.2 Testing Joinput-CiNet using visible (VIS) images

We compared Joinput-CiNet with other reconstruction methods, including TVAL-3 (total variation) [29], BM3D-AMP (Block Matching 3D-Approximate Message Passing) [30], and ReconNet [19]. The first two methods are traditional iterative SCI algorithms. ReconNet is an end-to-end SCI deep network. The same dataset (DIV2K) and measurement matrix (Bernoulli matrix) are used for the training of ReconNet and Joinput-CiNet. In order to compare with Joinput-CiNet-1 and Joinput-CiNet-4, we trained network ReconNet-1 and ReconNet-4. Once again, ReconNet-$M$ is a model using $M$ measurement frames for reconstruction. We used five datasets, Set5 [38], Set14 [39], BSD100 [40], Urban100 [41], and Manga109 [42], for testing. PSNR and SSIM are used for reconstruction performance evaluation. In the fist set of tests, the block size is assummed as $(4\times 4)$. Using TVAL-3 and BM3D-AMP methods, we reconstruct objects from one measurement frame. Using ReconNet and Joinput-CiNet, objects are reconstructed from $1\sim 4$ measurement frames.

Before comparing among multiple reconstruction methods, we first demonstrate the advantage of using degraded maps. We reconstruct objects using networks with and without degraded maps as inputs. The two networks are labeled as “w/ maps” and “w/o maps” in Fig. 5. As discussed in section 3, when $T=1$, degraded maps represent the amount of weights representing by a modulation template, or directly related to the amount of light collected by the detector array. Since for Joinput-CiNet-1 there is only one measurement frame used for reconstruction, we expect that the reconstructions obtained with or without degraded maps are not very different from each other. However, when Joinput-CiNet-4 has degraded maps as inputs, compared with the case of without degraded maps, we have PSNR(dB)/SSIM improvements, 1.25/0.0167, 0.94/0.0328, 0.78/0.0284, 1.08/0.0433, and 1.81/0.0262 using datasets Set5, Set14, BSD100, Urban100m and Manga109, respectively. Figure 5 shows the results using Joinput-CiNet-1 and Joinput-CiNet-4 with and without degraded maps on the object $`zebra$’. It is clear that, in the case of using 4 measurement frames, degraded map input helps the network to obtain better reconstruction.

Fig. 5. Reconstructions using Joinput-CiNet with or without degraded map input. The values in the parentheses is PSNR(dB)/SSIM. The red values are for the best reconstruction.

Download Full Size | PDF

In Table 1, we summarize reconstruction PSNR and SSIM values using TVAL-3, BM3D-AMP, ReconNet and Joinput-CiNet. From Table 1, we can also see that, using single-frame input, Joinput-CiNet-1 has achieved the best PSNR/SSIM. Compared to ReconNet-1, Joinput-CiNet-1 has PSNR(dB)/SSIM improvements 3.76/0.0609, 2.76/0.0743, 2.53/0.0659, 2.21/0.1044, and 3.24/0.1056 for datasets Set5, Set14, BSD100, Urban100, and Manga109, respectively. We can also observe that, with the increasing of the number of measurement frames, the PSNR/SSIM values for Joinput-CiNet and ReconNet increase too. However, in the case of using 4 measurement frames for reconstruction, Joinput-CiNet-4 has a big lead in terms of PSNR/SSIM compared to ReconNet-4. It demonstrates the advantages of using degraded maps as a part of network input.

Table 1. Average PSNR and SSIM on datasets Set5, Set14, BSD100 and Urban100. TVAL-3, BM3D-AMP, ReconNet-1$\&$4 and Joinput-CiNet-1$\sim$4 are used for reconstruction. The object block size is $(4\times 4)$. The best and the second best results are highlighted in red and blue, respectively.

View Table

Figure 6 presents the reconstructions of object ’$barbara$’, ’$manarch$’, and ’$ppt3$’ in Set14 using different reconstruction methods. It can be seen that the reconstructions obtained using TVAL-3 and BM3D-AMP have lower resolution. The texture details are poorly restored. For ReconNet-1 and ReconNet-4, we add a BM3D module after the network in order to remove block artifacts in network outputs. However, due to the limited reconstruction quality, the blocking effect cannot be removed completely in ReconNet-1. In ReconNet-4, by using multiple measurement frames, better reconstructions can be observed. However, the reconstruction quality is still worse than the reconstructions obtained using Joinput-CiNet. Once again, we can find that, with the increase of the number of measurement frames, the reconstructions of Joinput-CiNet model are significantly improved. The reconstructions using Joinput-CiNet-4 are very close to the original objects.

Fig. 6. Reconstructions using objects in dataset Set14. The object block size is $(4\times 4)$. PSNR(dB)/SSIM values of reconstructions are listed in a parentheses under the enlarged reconstruction detail. The best and the second best results are highlighted in red and blue, respectively.

Download Full Size | PDF

In the third set of tests, the block size is set as $(8\times 8)$. When one measurement frame is used, the SCI compression ratio is 1:64. For the 1:64 compression ratio, TVAL-3 and BM3D-AMP do not work properly. Thus, in this set of test, we only present the reconstructions using ReconNet-1$\&$4 and Joinput-CiNet-1$\&$4. Figure 7 shows the reconstruction results using object $`img\_063$’, $`img\_083$’, $`img\_062$’, $`img\_023$’, $`img\_044$’ and $`img\_093$’ in Urban100 dataset [41]. The object resolution is around $(1000\times 1000)$. It can be seen that, ReconNet-1 has limited reconstruction quality at the compression ratio 1:64, while Joinput-CiNet-1 is still working. When the number of measurement frames are increased to 4, the reconstructions using ReconNet-4 are improved much, but still pretty blurred. In contrast, the reconstructions of Joinput-CiNet-4 have much better resolution. This demonstrates the advantages of Joinput-CiNet again at a relatively high compression ratio such as 1:64.

Fig. 7. Reconstructions using objects in dataset Urban100. The object block size is $(8\times 8)$. In an enlarged reconstruction detail, R$i$ and J$i$ with (i=1,4) represents the network ReconNet(i) and Joinput-CiNet(i), respectively.

Download Full Size | PDF

5. Experimental results using an infrared system

5.1 Reconstruction results of Joinput-CiNet on four-bar objects

In this section, we test the reconstruction performance of Joinput-CiNet models on a four-bar target. A four-bar target is commonly used in infrared imaging to test the resolution of a system. Figure 8 presents the Mid-wave IR system. We use a black-body working around $95^{\text {o}} C$ as the light source. The four-bar target is put in the focal plane of a collimator and illuminated by the light source. The model of the DMD is DLP9500. The DMD has been modified to make it working for the MIR band. The detector array has pixel size $15\mu m$. More specific parameters of the experimental system can be found in our previous publication [43].

Fig. 8. The Mid-wave Infrared system.

Download Full Size | PDF

Although we have an experimental system, we are not able to acquire high resolution images using the system for network training. Thus, we simulated the degradation process of a SCI system and construct a training dataset which contains 500 high resolution and low resolution image pairs. Each image has six four-bar targets with different sizes and rotated by different angles. Using the four-bar target dataset, if the compression ratio is 1:16, the reconstruction results using algorithms, TVAL-3 and BM3D-AMP, are similar to the low resolution raw measurements, thus we only present the results obtained using Joinput-CiNet and ReconNet here. We train Joinput-CiNet and ReconNet models with $1\& 4$ measurement frames on the same dataset. Then we test the models numerically using self-made four-bar targets first. The block size is $(4\times 4)$. The results are shown in Fig. 9.

Fig. 9. Reconstructions using Joinput-CiNet and ReconNet with simulated four-bar targets. HR: the original objects, LR: the low-resolution measurement frames. The object block is $(4\times 4)$.

Download Full Size | PDF

It is clear that, in a low-resolution measurements as shown in Fig. 9(a2) or (b2), the four lines in a target can not be distinguished. Figure 9(b3) and (b5) show that in the case of a single-frame input, for the four-bar target vertically positioned, both Joinput-CiNet and ReconNet cannot present good reconstructions. But as the number of measurement frames increases, the reconstructions get better. When 4 frames are used, the 4 lines in both targets can be observed clearly. Additionaly, Joinput-CiNet shows a better reconstruction compared to ReconNet, which is also reflected in SSIM values. Between the two targets, the one rotated by an angle can be reconstructed easier.

Figure 10 presents the reconstructions obtained from Joinput-CiNet and ReconNet using the Mid-wave infrared BCI system. Figure 10(A) shows one measurement frame from the Mid-wave infraed system (Fig. 8) and its enlarged details. Figure 10(B) shows the simulated HR image, which is used to calculate the SSIM of the reconstructions in Fig. 10(C), (D), (E), (F), (G) and (H) to quantify the reconstruction quality. We can see that the lines in the medium size targets labeled by green squares can not be distinguished. Even only with one measurement frame, the lines in the reconstructions using Joinput-CiNet-1 and ReconNet can be observed. As the number of measurement frames increases, the reconstruction quality is improved significantly. Once again, compared to ReconNet, Joinput-CiNet presents better results using single-frame and multi-frame inputs in terms of visual effects and SSIM values.

Fig. 10. Reconstruction using Joinput-CiNet and ReconNet with mid-wave infrared four-bar target. The size of the collected infrared image is $(320\times 256)$, and the picture (A) shown here is after cropping, the size is $(50\times 46)$, the size of corresponding HR images (B, C, D, E, F, G and H) are $(200\times 184)$.

Download Full Size | PDF

5.2 Comparison of the running time of different algorithms

In the last experiment, we test the running time of TVAL-3, BM3D-AMP, ReconNet, ReconNet-BM3D and Joinput-CiNet. Infraed objects with size of $(320\times 256)$ are used. The block size is $(4\times 4)$. The running time is calculated by taking average over 100 objects. The results are presented in Fig. 11. The amount of parameters in Joinput-CiNet is approximately 1.54 million. The Flops (floating-point operations per second) is approximately 7103 million. In the case of different input frames, the parameter amount and Flops only slightly fluctuate, which ensures the possibility for real-time imaging.

Fig. 11. SSIM vs. Running time for different algorithms. Note that the small running time value is on the right side of the x axis.

Download Full Size | PDF

From the figure, we can find that the traditional iterative reconstruction algorithms TVAL-3 and BM3D-AMP are time consuming, roughly more than 10s. Such a long time make these algorithms difficult to be used in real-time imaging applications. ReconNet presents the smallest running time due to its small network depth. However, ReconNet often needs a BM3D module to help with reconstruction quality due to the block effect in reconstructions. This module consumes much time, thus makes the network is not attractive to real-time applications. On the other hand, the running time of Joinput-CiNet models are between 0.01s and 0.001s, which is fast enough for most applications. In addition, the Joinput-CiNet models present high SSIM values, which indicates good reconstruction quality. Over all, Joinput-CiNet model has obvious advantages in real-time high-quality reconstruction.

6. Conclusions

In summary, we have designed a neural network named Joinput-CiNet for spatial compressive sensing reconstruction. Different from existing SCI neural networks, we use a tailored encoding module DM-Gen to convert the sensing matrix which characterizes image degradation process of SCI into degraded maps, and then send the maps together with low-resolution measurement frames as the joint input of Joinput-CiNet. By doing so, Joinput-CiNet achieves high-quality reconstruction and fast speed, which provides a possible solution to real-time imaging applications of SCI. However, Joinput-CiNet is still not a perfect network for SCI reconstruction. The usage of sensing vectors or matrices as a part of the input of a network can be further explored, for example, relaxing the prior knowledge of sensing vectors, assuming only some probability parameters of sensing vectors for network design. As other reconstruction methods, Joinput-CiNet might erase fine details in reconstructions due to its good denoising performance. In order to reduce SCI imaging time or network training time, fewer measurement frames are preferred generally. However, this will sacrifice reconstruction performance. To solve these issues, we will explore and search the solutions in future works.

Funding

National Natural Science Foundation of China (61675023).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006). [CrossRef]

2. E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inf. Theory 52(2), 489–509 (2006). [CrossRef]

3. T. Gerrits, D. J. Lum, V. Verma, J. Howell, R. P. Mirin, and S. W. Nam, “Short-wave infrared compressive imaging of single photons,” Opt. Express 26(12), 15519–15527 (2018). [CrossRef]

4. Z. Tian and G. B. Giannakis, “Compressed sensing for wideband cognitive radios,” in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, vol. 4 (IEEE, 2007), pp. IV–1357.

5. D. Lv, S. Zhu, and R. Liu, “Research on big data security storage based on compressed sensing,” IEEE Access 7, 3810–3825 (2019). [CrossRef]

6. A. Orsdemir, H. O. Altun, G. Sharma, and M. F. Bocko, “On the security and robustness of encryption via compressed sensing,” in MILCOM 2008-2008 IEEE Military Communications Conference, (IEEE, 2008), pp. 1–7.

7. D. L. Donoho, Y. Tsaig, I. Drori, and J.-L. Starck, “Sparse solution of underdetermined systems of linear equations by stagewise orthogonal matching pursuit,” IEEE Trans. Inf. Theory 58(2), 1094–1121 (2012). [CrossRef]

8. D. Needell and R. Vershynin, “Uniform uncertainty principle and signal recovery via regularized orthogonal matching pursuit,” Found. Comput. Math. 9(3), 317–334 (2009). [CrossRef]

9. T. Blumensath and M. E. Davies, “Iterative hard thresholding for compressed sensing,” Appl. Comput. Harmon. Anal. 27(3), 265–274 (2009). [CrossRef]

10. I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Comm. Pure Appl. Math. 57(11), 1413–1457 (2004). [CrossRef]

11. D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algorithms for compressed sensing,” Proc. Natl. Acad. Sci. 106(45), 18914–18919 (2009). [CrossRef]

12. S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM Rev. 43(1), 129–159 (2001). [CrossRef]

13. E. Candes and T. Tao, “The dantzig selector: Statistical estimation when p is much larger than n,” Ann. Statist. 35(6), 2313–2351 (2007). [CrossRef]

14. S. Ji, Y. Xue, and L. Carin, “Bayesian compressive sensing,” IEEE Trans. Signal Process. 56(6), 2346–2356 (2008). [CrossRef]

15. D. P. Wipf and B. D. Rao, “Sparse bayesian learning for basis selection,” IEEE Trans. Signal Process. 52(8), 2153–2164 (2004). [CrossRef]

16. G. Cormode, “Sketch techniques for approximate query processing,” Foundations and Trends in Databases (2011).

17. L. Bo, H. Lu, Y. Lu, J. Meng, and W. Wang, “Fompnet: Compressive sensing reconstruction with deep learning over wireless fading channels,” in 2017 9th International Conference on Wireless Communications and Signal Processing (WCSP), (IEEE, 2017), pp. 1–6.

18. A. Mousavi, A. B. Patel, and R. G. Baraniuk, “A deep learning approach to structured signal recovery,” in 2015 53rd annual allerton conference on communication, control, and computing (Allerton), (IEEE, 2015), pp. 1336–1343.

19. K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok, “Reconnet: Non-iterative reconstruction of images from compressively sensed measurements,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), pp. 449–458.

20. K. Xu, Z. Zhang, and F. Ren, “Lapran: A scalable laplacian pyramid reconstructive adversarial network for flexible compressive sensing reconstruction,” in Proceedings of the European Conference on Computer Vision (ECCV), (2018), pp. 485–500.

21. Y. Yang, J. Sun, H. Li, and Z. Xu, “Admm-net: A deep learning approach for compressive sensing MRI,” arXiv preprint arXiv:1705.06869 (2017).

22. J. Zhang and B. Ghanem, “Ista-net: Interpretable optimization-inspired deep network for image compressive sensing,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 1828–1837.

23. Q. Zhou, J. Ke, and E. Y. Lam, “Near-infrared temporal compressive imaging for video,” Opt. Lett. 44(7), 1702–1705 (2019). [CrossRef]

24. J. Ke, L. Zhang, Q. Zhou, and E. Y. Lam, “Broad dual-band temporal compressive imaging with optical calibration,” Opt. Express 29(4), 5710–5729 (2021). [CrossRef]

25. M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk, “Single-pixel imaging via compressive sampling,” IEEE Signal Process. Mag. 25(2), 83–91 (2008). [CrossRef]

26. O. Katz, Y. Bromberg, and Y. Silberberg, “Compressive ghost imaging,” Appl. Phys. Lett. 95(13), 131110 (2009). [CrossRef]

27. M. T. Nguyen, K. A. Teague, and N. Rahnavard, “Ccs: Energy-efficient data collection in clustered wireless sensor networks utilizing block-wise compressive sensing,” Comput. Netw. 106, 171–185 (2016). [CrossRef]

28. E. J. Candes, “The restricted isometry property and its implications for compressed sensing,” Comptes. Rendus Math. 346(9-10), 589–592 (2008). [CrossRef]

29. C. Li, W. Yin, and Y. Zhang, “User’s guide for tval3: Tv minimization by augmented lagrangian and alternating direction algorithms,” CAAM report 20, 4 (2009).

30. C. A. Metzler, A. Maleki, and R. G. Baraniuk, “From denoising to compressed sensing,” IEEE Trans. Inf. Theory 62(9), 5117–5144 (2016). [CrossRef]

31. W. Dong, G. Shi, X. Li, Y. Ma, and F. Huang, “Compressive sensing via nonlocal low-rank regularization,” IEEE Trans. on Image Process. 23(8), 3618–3632 (2014). [CrossRef]

32. C. A. Metzler, A. Mousavi, and R. G. Baraniuk, “Learned d-amp: Principled neural network based compressive image recovery,” arXiv preprint arXiv:1704.06625 (2017).

33. K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional super-resolution network for multiple degradations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), pp. 3262–3271.

34. W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 1874–1883.

35. E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, (2017), pp. 126–135.

36. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 (2014).

37. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. on Image Process. 13(4), 600–612 (2004). [CrossRef]

38. M. Bevilacqua, A. Roumy, C. Guillemot, and M.-L. A. Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” in British Machine Vision Conference (BMVC), (2012).

39. J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE Trans. on Image Process. 19(11), 2861–2873 (2010). [CrossRef]

40. D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, vol. 2 (IEEE, 2001), pp. 416–423.

41. J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2015), pp. 5197–5206.

42. Y. Matsui, K. Ito, Y. Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki, and K. Aizawa, “Sketch-based manga retrieval using manga109 dataset,” Multimed. Tools Appl. 76(20), 21811–21838 (2017). [CrossRef]

43. L. Zhang, J. Ke, S. Chi, X. Hao, T. Yang, and D. Cheng, “High-resolution fast mid-wave infrared compressive imaging,” Opt. Lett. 46(10), 2469–2472 (2021). [CrossRef]

		Set5	Set14	BSD100	Urban100	Manga109
Algorithm		PSNR/SSIM
TVAL-3 [29]		26.03/0.9015	23.74/0.7759	24.24/0.8021	21.49/0.7482	23.13/0.8866
BM3D-AMP [30]		18.71/0.6792	16.62/0.5672	16.98/0.5771	14.27/0.5094	14.34/0.6223
ReconNet-1 [19]		22.72/0.8503	21.25/0.7210	21.98/0.7431	19.46/0.6560	19.93/0.7838
ReconNet-4 [19]		26.81/0.9224	24.32/0.8120	25.02/0.8193	22.74/0.7920	24,23/0.9032
Joinput-CiNet-1	w/o maps	26.45/0.9058	24.08/0.7911	24.50/0.7972	21.69/0.7531	23.15/0.8812
Joinput-CiNet-1	w/ maps	26.48/0.9112	24.01/0.7953	24.51/0.8090	21.67/0.7604	23.17/0.8894
Joinput-CiNet-2	w/ maps	27.80/0.9294	24.89/0.8172	25.09/0.8264	22.50/0.7912	24.87/0.9179
Joinput-CiNet-3	w/ maps	28.38/0.9341	25.36/0.8271	25.52/0.8339	23.01/0.8041	25.58/0.9250
Joinput-CiNet-4	w/o maps	27.74/0.9252	24.94/0.8145	25.17/0.8214	22.47/0.7823	24.63/0.9101
Joinput-CiNet-4	w/ maps	28.99/0.9419	25.88/0.8473	25.95/0.8498	23.55/0.8256	26.44/0.9363

Spatial compressive imaging deep learning framework using joint input of multi-frame measurements and degraded maps

Abstract

1. Introduction

2. Spatial compressive imaging and iterative reconstruction algorithms

3. Joint input compressive imaging net (Joinput-CiNet)

3.1 Joinput-CiNet architecture

3.2 Loss function for Joinput-CiNet

4. Numerical experimental results using visible (VIS) images

4.1 Dataset and network training

4.2 Testing Joinput-CiNet using visible (VIS) images

5. Experimental results using an infrared system

5.1 Reconstruction results of Joinput-CiNet on four-bar objects

5.2 Comparison of the running time of different algorithms

6. Conclusions

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (11)

Tables (1)

Equations (4)

Optics Express