Lensless cameras using a mask based on almost perfect sequence through deep learning

Hao Zhou; Huajun Feng; Huajun Feng; Zengxin Hu; Zhihai Xu; Qi Li; Yueting Chen; Yueting Chen

doi:10.1364/OE.400486

1. Introduction

Traditional refractive lenses have dominated the field of imaging for many years. The images they produce are good quality, but they always have a large volume that limits the application of cameras. In the past few years, computational imaging has developed rapidly, and several small-sized lensless imaging cameras have been invented. Asif et al. proposed the FlatCam lensless imaging system [1–3]. They placed a mask in front of the sensor and then restored the image from the captured coded pattern. Adams et al. [4] developed FlatCam for the field of microscopy. Spatial light modulators were used to realize a similar function [5]. Instead of amplitude masks, Tajima et al. [6] used a Multiphased Fresnel zone aperture for imaging. DiffuserCam used a phase diffuser to achieve 3D lensless imaging [7,8]. Similar to FlatCam, only a single element was used in DiffuserCam. Their direct outputs are patterns that resemble speckles, and an image restoration algorithm is necessary to reconstruct the image from the pattern. Kuo et al. [9] extended DiffuserCam to microscopic imaging. Kim et al. built an interesting lens-free camera by placing sensors on the side of transparent glass [10]. Compressed sensing is an important branch of computational imaging. Yuan et al. built a parallel lensless compressive imaging system [34], which realized real-time imaging. Another type is lens-free cameras based on phase grating. Patrick R. Gill et al. invented an ultraminiature computational imager by employing special optical phase gratings integrated with CMOS photodetector matrixes [11,12]. To take advantage of the angle sensitivity, they came up with a Planar Fourier Capture Array that directly captures the 2D Fourier transforms of scenes [13]. Compared with traditional cameras, these lensless cameras have the advantages of miniaturization, ease of manufacture and low cost, which enables them to be applied to the Internet of things, surveillance, Drones, UAVs and other mobile platforms [14]. Lensless cameras can also be applied to some special scenes such as medical imaging [15] and AR glasses [16]. All in all, they have a bright future.

Inspired by FlatCam, we proposed to use a mask based on almost perfect sequence for lensless imaging (Fig. 1). A mask is transparent glass covered by a special pattern. In the original FlatCam, the mask pattern is generated by the maximum length sequence (MLS). In this paper, we will call it mask based on MLS and abbreviate it to MLS mask. Our proposed mask’s pattern is generated from an almost perfect sequence [17]. In the same way, we abbreviate our proposed mask based on almost perfect sequence to almost perfect mask (AP mask). An almost perfect sequence (AP sequence) is a sequence with good autocorrelation properties [18], which can make the system transfer matrixes have small conditional numbers. In other words, the ill conditions of the system transfer matrixes are alleviated. By using AP Mask for imaging, we enrich the mask types of FlatCam lensless cameras. Our experiment shows that our AP mask-based lensless cameras have achieved the best results so far in the field of FlatCam lensless imaging.

Fig. 1. Overview of our imaging pipeline. (a) The lensless display-capture device consists of our prototype and a Pad. The images displayed on the Pad screen are captured with the lensless camera to form training pairs and the images are used as ground truth labels. (b) The imaging model of FlatCam. The system transfer matrixes are obtained through calibration. (c) Our Learned Analytic solution Net that integrates the imaging models. The inputs of the network are the system transfer matrixes and measurements. After several iterations, the network outputs the reconstructed images.

Download Full Size | PDF

A common feature of lensless cameras is that they do not directly implement point-to-point mapping from an object to an image as traditional cameras do. Instead, they encode the object information into sensor measurements, which require reconstruction to obtain the original image. In addition, image reconstruction is basically about solving an ill-conditioned inverse problem. The classical image restoration algorithm is derived from the optimization principle. It builds a fidelity item and a regular item to get a low-loss output through multiple iterations [19–21]. Although the regularization term is a statistical prior, the reconstructed image often has poor quality due to artifacts. This poor quality is easy to understand because the hand-picked priors are crude and are obtained by counting a large number of images. It may not match a particular image exactly. In addition, the classical algorithm is time-consuming due to its hundreds of iterations.

Fortunately, deep learning has developed rapidly in recent years, and the image reconstruction algorithm based on the convolutional neural network (CNN) has made great progress. In the field of computational imaging, deep learning is a powerful tool. For example, Qiao et al. completed video snapshot compressive imaging through deep learning [35]. Yuan et al. proposed an efficient inversion algorithm based on deep convolutional neural network to realize real-time image reconstruction [34]. Image reconstruction algorithms based on deep learning can be divided into two categories: noniterative, which is a pure end-to-end convolutional neural network [22–24]; and unrolled optimization, which is combined with a classical optimization algorithm [25–27]. Kulkarni et al. proposed a noniterative network of reconstructed compressed sensing images, which was named ReconNet [28]. This kind of noniterative network usually has a light and small architecture, which can restore an image quickly and directly. However, their disadvantages are also obvious. They cannot make full use of the prior knowledge, such as the sensing matrix in compressed sensing and the PSF in DiffuserCam. Unrolled network combines traditional optimization with networking, and the advantages of both can be exploited. Unrolled optimization solves the inverse problem by incorporating the system model into the network. In this structure, the main function of the network is to learn a priori, rather than the whole inverse operation. Thus unrolled structure has lower requirements on the network function and is easier to achieve. Zhang et al. [29] proposed ISTA-Net, which is derived from the iterative procedures in a soft threshold shrinkage algorithm. Le-ADMM Net was used for DiffuserCam imaging [30]. The unrolled optimization structure can be combined with the model to make full use of prior knowledge. And it transfers the pressure of image restoration to the denoising network which have many successful model at present.

In this paper, we proposed the Learned Analytic solution Net (LAs Net) for the image restoration of FlatCam under the framework of unrolled optimization. The Learned Analytic solution Net has $k$ layers and each layer consists of two parts: the analytic solution updating block and the CNN optimization block (Fig. 1). The former solves the inverse problem in the form of an analytic solution while the latter further improves the image quality by optimizing the currently analyzed images through a convolutional neural network, which is explained in detail in Sec. 3. Furthermore, it can also be easily applied to compressed sensing image reconstruction or other imaging models. To the best of our knowledge, this is the first time that a CNN with an analytic solution has been applied to the image reconstruction of FlatCam. Similar to other unrolled optimized networks, it incorporates an imaging model into neural networks. This incorporation strongly ensures that high-quality images can be reconstructed.

To train our Learned Analytic solution Net, we set up a display-capture device to obtain the training dataset (see Fig. 1). After the training, we imaged the images on the display screen and the natural scene. Our work shows that the combination of lensless imaging models and deep learning can be very useful for image reconstruction. Combined with our Learned Analytic solution Net, our AP mask-based lensless cameras have achieved high-quality image reconstructions at a resolution of $\textrm{512} \times \textrm{512}$ that have excellent performances in both visual effects and objective evaluations.

In summary, our contributions include the following:

1. A novel and effective AP mask for FlatCam lensless imaging,
2. A more convenient FlatCam calibration method that reduces the time by half without any loss,
3. An image restoration network LAs Net for FlatCam that realized high quality image reconstruction at a resolution of $512 \times 512$.

2. FlatCam with an almost perfect mask

In this paper, we propose to use the AP mask instead of the MLS mask for FlatCam lensless imaging. The imaging principle and AP mask are described in detail below.

2.1 FlatCam model

FlatCam is an ultrathin lensless imaging system proposed by Asif [1]. From a coding perspective, they placed a mask at a submillimeter distance in front of the sensor. As a result, each pixel encodes multiple scene points. In addition, the mask is transparent glass covered with a special pattern. To reduce the computational complexity, they used a rank 1 mask, which means that the pattern is the outer product of two one-dimensional sequences. This separation of the row and column design allows the imaging system model to be written as follows:

(1)$$Y = {\Phi _L}X{\Phi _R}^T + N$$

Here, $Y$ is the encoded measurement, ${\Phi _L}$ and ${\Phi _R}$ are the left and right system transfer matrixes, X represents the scene radiance, and N represents the model error and noise. To recover scene X from Y, the authors obtain the system transfer matrixes by calibration [3], and then impose Tikhonov or total-variation constraints to solve Eq. (1).

In this paper, we use the almost perfect mask for imaging instead of the MLS mask [1]. The forward imaging model is unchanged. Thus, we can reconstruct the scene by solving a regularized optimization problem of the following form:

(2)$$\hat{X} = \mathop {\arg \min }\limits_X \frac{1}{2}{||{Y - {\Phi _L}X{\Phi _R}^T} ||^2} + \lambda \cdot r(X)$$

where $r(X )$ is regularization term, and $\lambda $ is a tuning parameter between the fidelity and regularization.

2.2 Almost perfect mask

Instead of using the MLS to generate a mask pattern, we use the almost perfect sequence, which is called the almost perfect mask. The almost perfect sequence is a special kind of pseudorandom sequence. The almost perfect sequence has excellent autocorrelation properties where all out-of-phase autocorrelation coefficients are zero except for one. This facet makes it have many applications. For example, in phase-modulated radar detection, it can obtain a zero sidelobe detection effect similar to a perfect sequence [17].

Let $({{s_0},{s_1},{s_2}, \cdots ,{s_{n - 1}}} )$ be an almost perfect sequence of period n. Since it is a periodic sequence, ${s_{i + n}} = {s_i}$ for every $i \in (\textrm{0},n\textrm{ - 1})$. Its autocorrelation function is expressed as follows:

(3)$$R(\tau )= \sum\limits_{i = 0}^{n - 1} {{s_i}{s_{i + \tau }}} = \left\{ \begin{array}{l} n,\;\;\;\;\;\;\;\;\tau = 0({\bmod n} )\\ 4 - n,\;\;\;\tau = {n / 2}\\ 0,\;\;\;\;\;\;\;\;else \end{array} \right.$$

Here, $\tau $ is the steps that are circularly shifted. $R(\tau )$ is the corresponding autocorrelation coefficients. $R(0 )$ is the in-phase autocorrelation value and $R(\tau )({\tau \ne 0(\bmod n)} )$ is the out-of-phase autocorrelation coefficients. From Eq. (3), we can see that the autocorrelation function of the almost perfect sequence is almost perfect, and the out-of-phase autocorrelation coefficients are nonzero only at the half sequence length. This outcome has advantages over the MLS, where the autocorrelation coefficients are constant over the entire length.

Unlike the MLS, almost perfect sequences are searched by computers according to their characteristic formulas. Cyclic difference sets are the best tools for studying binary sequences. Chen et al. [17] deduced the characteristic formula of the almost perfect sequence with the cyclic difference set in the following form:

(4)$${s_{\tau - 1}} + 2\sum\limits_{i = \tau }^{{n / 2} - \tau - 2} {{s_i} + {s_{{n / {2 - \tau - 1}}}}} + 2\sum\limits_{i = 0}^{\tau - 2} {{s_i}{s_{i + {n / {2 - \tau }}}}} - 2\sum\limits_{i = 0}^{{n / 2} - \tau - 2} {{s_i}{s_{i + \tau }}} = {n / 2} - \tau ,\;\;\;\;\tau = 1,2, \cdots ,{n / 4} - 1$$

For a canonical expanded sequence, except that ${s_{{n / 2}}}$ and ${s_n}$ are -1, the first half of the sequence is complementary to the second half as

(5)$${s_i} + {s_{{n / 2} + i}} = 0\;\;\;i = 0,1,2, \cdots ,{n / 2} - 1$$

We program the above formula to obtain an almost perfect sequence with a length of 516. We take the cross product of it and its transpose to obtain the pattern of the almost perfect mask proposed in this paper. The mask’s feature size $\Delta = 30\mu m$.

In FlatCam, ${\Phi _L}$ and ${\Phi _R}$ are the composites of the separable pixel response and encoding [4]. The response is fixed, and so we can only change the encoding form to mitigate the ill-nature of the system matrixes or, in other words, to reduce its conditional numbers. Through experiments, we find that each column in ${\Phi _L},{\Phi _R}$ is almost the shift of the other column, which is similar to the Toeplitz matrix to some extent. To maximize the differences between columns, we made the mask using the almost perfect sequence due to its excellent autocorrelation properties. The normalized autocorrelation properties of the MLS and AP sequences are shown in Fig. 2(a).

Fig. 2. Comparison of the MLS mask and AP mask. (a) Autocorrelation coefficients of the MLS and AP sequences. All values are normalized. (b) Singular value spectrums of the system transfer matrixes of the MLS mask and AP mask.

Download Full Size | PDF

As seen from Fig. 2(a), the autocorrelation of the MLS sequence has a fixed sidelobe noise over the whole length, while the autocorrelation of the AP sequence is nonzero only at the half-length shift. The almost perfect autocorrelation of the AP sequence ensures smaller condition number of the system transfer matrixes. For example, in our experiment, the condition number of ${\Phi _L}$ of the MLS mask is 853 while the condition number of ${\Phi _L}$ of the AP mask is 837.

The singular value spectrum was used to evaluate the degree of morbidity of the system transfer matrixes [1]. For this intuitive method, we simulated the system transfer matrixes for a 2-D scene at $64 \times 64$ resolution using the MLS mask and the AP mask. The MLS mask comes from a MLS of length 511 and the AP mask comes from an almost perfect sequence of length 516. The other parameters are the same. The system transfer matrix’s singular value spectra of the MLS mask and AP mask are shown in Fig. 2(b), which shows that the AP mask has a flatter singular value spectrum. In other words, the number of conditions is lower, which ensures a more stable recovery of scene image X from the sensor measurement Y.

3. Image reconstruction

FlatCam reconstruction is challenging since the transfer matrixes are ill-conditioned. The traditional methods generally specify a specific $r(X )$, and then solve Eq. (2). Prior knowledge is very important in this case, and traditional hand-picked priors often cannot adapt to all situations. We apply a neural network to this problem, which learns an unknown prior by inputting a large amount of data. In this section, in order to reconstruct the high-quality images of FlatCam, LAs Net is proposed to restore the images acquired by the AP mask by combining it with the imaging model of FlatCam. Here is how it works.

We reformulate Eq. (2) in the following form using the Half Quadratic Splitting (HQS) method:

(6)$$\hat{X} = \mathop {\arg \min }\limits_X {||{Y - {\Phi _L}X{\Phi _R}^T} ||^2} + \mu {||{Z - X} ||^2}$$

(7)$$\hat{Z} = \mathop {\arg \min }\limits_z \;\frac{\mu }{2}{||{Z - X} ||^2} + \lambda \cdot r(X )$$

Here, $\lambda $ and $\mu $ are scalar penalty parameters. Z is the introduced auxiliary variable, or the shadow variable of X. With the HQS method, the original problem is split into two subproblems, and the fidelity term is separated from the regular term. For Eq. (6), we update it with the analytic solution; and for Eq. (7), we use the CNN to solve it. After k iterations of the above two steps, we get the final result.

Before the reconstruction, we first obtain the system transfer matrixes through calibration, and then send them into LAs Net as prior knowledge.

3.1 Camera calibration

Since its imaging model is row and column separable, the transfer matrixes can be obtained by sweeping a line of light horizontally and vertically [31]. In FlatCam, the Hadamard patterns were projected onto the screen. In detail, each row and column of the Hadamard matrix is taken out and stretched into a two-dimensional pattern. And we need to project two opposite pictures (see Fig. 3) for each column or row because the Hadamard matrix consists of ${\pm} 1$ entries [1]. The final measurement is obtained by subtracting two corresponding sensor images. Although the two pictures are opposite, the two sensor images are very similar in the actual shooting. It is because a picture can be viewed as the small displacement of its opposite picture. This makes the subsequent calculation is easy to make mistakes, which makes the calibration difficult to achieve.

Fig. 3. The opposite stripe pictures used in camera calibration. ${h_k}$ is the kth column of the Hadamard matrix. ${1^T}$ is a row vector that is all ones. The red line is the border of the picture. (a) The image of ${h_k} \cdot {1^T}$ while setting the negative entries to zero. (b) The image of $- {h_k} \cdot {1^T}$ while setting the negative entries to zero.

Download Full Size | PDF

In my opinion, each pattern is a combination of stripes. Therefore, we can also take a full rank matrix consisting only of 0's and 1's as the basis and stretch each row and each column of it into a two-dimensional pattern. Fortunately, by replacing -1 in the Hadamard matrix with 0, we get such a matrix. In this way, we only need to project one picture for each column or row. And each picture is also a multi-stripe combination, which guarantees the SNR. All the calibration steps (see Fig. 4) remain the same [1]. With this small change, the calibration time is reduced by half without any loss. Furthermore, by transforming from bipolar to unipolar, the calibration process and subsequent calculations are easier to complete. In our experiment, we displayed 512 horizontal stripe patterns and 512 vertical stripe patterns; thus, the final resolution of the reconstructed image was 512 × 512. Under a fixed sized field of view, the ill-conditioning of the system transfer matrix at the resolution of 512 × 512 is much higher than that at the resolution of 256 × 256. This outcome makes the reconstructed images with the resolution of 256 × 256 often significantly better than those with a resolution of 512 × 512. In this paper, we reconstructed the images with a resolution of 512 × 512, and their image qualities are good.

Fig. 4. Camera calibration procedure. A set of horizontal patterns and a set of vertical patterns are displayed on the display screen and photographed with the lensless camera. Then, using the FlatCam imaging model, these shooting results are decomposed to obtain the system transfer matrixes. Horizontal stripes for left transfer matrix, vertical stripes for right transfer matrix.

Download Full Size | PDF

3.2 LAs Net for FlatCam reconstruction

Following the success of the unrolled optimization method, we proposed LAs net for FlatCam reconstruction. LAs Net combines the FlatCam imaging model with the convolutional neural network and accurately obtains the system transfer matrixes through calibration, which effectively ensures high-quality image recovery. LAs Net has an initial estimate block ${s_0}$ that a Tikhonov regularized scene X is recovered from the measurement and k layers. In addition, each layer consists of two parts: an update block of the analytic solution that solves Eq. (6) in the form of an analytic solution, and a CNN block for Eq. (7). Figure 5 shows the generalized block diagram for our LAs Net. In the following subsections, we describe each of these steps in more details.

Fig. 5. LAs Net architecture. LAs Net has an initial estimate block and k layers. The input measurement and the system transfer matrixes are first fed into the initial estimate block of LAs Net. The initial estimate ${X^0}$ is then fed into k layers. At each layer, we first update it according to Eq. (11), and then optimize it with a denoised CNN.

Download Full Size | PDF

When a sensor measurement $Y$ is sent into LAs Net, we first find an initial estimate by specifying $r(X )$ in Eq. (2) as the Tikhonov regularization. Using the least square method, we get the following solution:

(8)$${X^0} = {V_L}[{{U_L}^TY{U_R} \odot ({\sigma_L}{\sigma_R}^T) \cdot{/}({{\sigma_L}^2{\sigma_R}{{^2}^{^T}} + \lambda \cdot {{11}^T}} )} ]{V_R}^T$$

where ${U_L},\;{V_L},\;{U_R}and\;{V_R}$ are obtained by the singular value decomposition (SVD) of the system transfer matrixes ${\Phi _L},{\Phi _R}$. Specifically, $[{{U_L},{S_L},{V_L}^T} ]= SVD({\Phi _L})$, and $[{{U_R},{S_R},{V_R}^T} ]= SVD({\Phi _R})$. The vectors ${\sigma _L},{\sigma _R}$ are the diagonal entries of ${S_L},{S_R}$, respectively. The vectors ${\sigma _L}^\textrm{2},{\sigma _R}^\textrm{2}$ are the corresponding elements in ${\sigma _L},{\sigma _R}$ squared. ${\odot} $ and $./$ denote the elementwise multiplication and division of the matrixes or vectors, respectively. ${11^T}$ is a matrix where all elements are one, and $\lambda $ is the penalty parameter.

The initial estimate is then sent to the ${k^{th}}$ layer of LAs Net. Then, there is an analytic solution updating block ${A_k}$ to solve Eq. (6) and a CNN block to solve Eq. (7). The gradient of the target in Eq. (6) is set to zero, and the resulting equation is reduced to the following:

(9)$${\Phi _L}^T{\Phi _L}X{\Phi _R}^T{\Phi _R} + \mu X = {\Phi _L}^TY{\Phi _R} + \mu Z$$

Replacing ${\Phi _L}$ and ${\Phi _R}$ with their SVD decompositions yields

(10)$${S_L}^T{S_L}{V_L}^TX{V_R}{S_R}^T{S_R} + \mu {V_L}^TX{V_R} = {S_L}^T{U_L}^TY{U_R}{S_R} + \mu {V_L}^TZ{V_R}$$

The update formula of block ${A_k}$ can be obtained by further simplifying Eq. (10):

(11)$${X^k} = {V_L}[{({{U_L}^TY{U_R} \odot ({\sigma_L}{\sigma_R}^T) + \mu {V_L}^T{Z^{k - 1}}{V_R}} )./({{\sigma_L}^2{\sigma_R}{{^2}^{^T}} + \mu \cdot {{11}^T}} )} ]{V_R}^T$$

where ${Z^k}$ is the output of ${C_k}$. When k is equal to one, ${Z^0}$ is equal to ${X^0}$. For the CNN block ${C_k}$, we choose the U-Net architecture [32] with soft-thresholding. In the proposed CNN block, we use four scales for the encoder-decoder architecture, and the number of channels for each scale is set to 32, 64, 128, and 256. In each scales of encoder, two convolutional layers and a max-pooling layer encode spatial features. And there are two convolutional layers and an up-convolution layer in each scales of decoder. All convolution layers adopt a kernel size of $3 \times 3$. More channels leads to better results, but this improvement is limited by the amount of GPU memory. In addition, we apply a soft-thresholding function to denoise the results generated by the four scales U-Net. Finally, we produce a convolutional layer with 3 channels to output an RGB image.

(12)$$vex({X^k}) = {V_R} \otimes {V_L}[{vec((\mu \cdot {{11}^T})./({\sigma_L}^2{\sigma_R}{{^2}^{^T}} + \mu \cdot {{11}^T})) \odot (({V_R}^T \otimes {V_L}^T) \cdot vex({Z^{k - 1}}))} ]$$

(13)$$G = {\textstyle{{\partial vec({{X^K}} )} \over {\partial vec{{({Z^{K - 1}})}^T}}}} = ({{V_R} \otimes {V_L}} )\cdot diag({SG} )\cdot {({{V_R} \otimes {V_L}} )^T}$$

By vectorizing the matrix, Eq. (11) can be written as Eq. (12). And from Eq. (12), we can know that the derivative of Eq. (11) is Eq. (13). Where $vec(m )$ means to vectorize $m$, ${\otimes} $ is the Kronecker product, $SG$ is equal to $vec((\mu \cdot {11^T})./({\sigma _L}^2{\sigma _R}^{2{T}} + \mu \cdot {11^T}))$, $diag(A )$ means expand vector A into a diagonal matrix. And ${V_L},{V_R}$ are the identity orthogonal matrices. Thus, they're all full rank. According to $rank({A \otimes B} )= rank(A )rank(B )$, $({{V_R} \otimes {V_L}} )$ is a full rank matrix. According to Kronecker product’s property ${({A \otimes B} )^{ - 1}} = {A^{ - 1}} \otimes {B^{ - 1}}$, ${({{V_R} \otimes {V_L}} )^{ - 1}} = {V_R}^{ - 1} \otimes {V_L}^{ - 1} = {V_R}^T \otimes {V_L}^T = {({{V_R} \otimes {V_L}} )^T}$. Thus, $({{V_R} \otimes {V_L}} )$ is an identity orthogonal matrix. In other words, it is a unitary matrix. unitary matrix has many excellent properties, such as norm preserving. When you multiply a matrix by a unitary matrix, you're doing a rotation. And from Eq. (13), we know that G is a real symmetric matrix, and the diagonal elements in the matrix $diag({SG} )$ were $G^{\prime}s$ eigenvalues. And all diagonal elements in $diag({SG} )$ were in $({0,1} )$. None of the eigenvalues of G is 0. Therefore, G must be full rank, in other words, it’s nonsingular. In summary, $({{V_R} \otimes {V_L}} )$ is a unitary matrix, and all the eigenvalues of G are between 0 and 1. Therefore, no singularities or gradient explosions are introduced by the step of backward propagation through the analytical solution block. And we have done a lot of trainings under different parameters, all the back propagations can be completed successfully, and the losses are convergent.

4. Implementation

Our camera prototype consists of an almost perfect mask and a monochrome CMOS sensor with $2048 \times 2048$ pixels, and the pixel size of the CMOS sensor is $5.5\mu m$. The AP mask is generated by an almost perfect sequence of length 516, and the MLS mask is generated by a MLS of length 511. The AP mask is placed approximately $55\textrm{0}\mu m$ in front of the CMOS sensor. To be able to compare them, so is the MLS mask. To train our Learned Analytic solution Net, we build a lensless display-capture device to obtain the training dataset (see Fig. 1). It includes our FlatCam prototype and a Pad (Samsung Tab S4), which has a 10.5-inch screen. The distance between the screen and the mask is $32cm$.

We cropped 10000 color images of size $512 \times 512$ from the DIV2K dataset [33]. These images are displayed on the screen and the sensor captures the corresponding image with $2048 \times 2048$ pixels. In the actual display, only a $14.2cm$ squared area in the center of the screen was used. These 10000 images are also the ground truth of training dataset because the camera calibration, ${\Phi _L}\,{\textrm{and}}\,{\Phi _R}$ ensure the accurate registration between the captured image and ground truth. Since the CMOS sensor is grayscale, we shot the RGB three-channel images separately. The test images consist of two parts: one part is 100 images from DIV2K’s validation set obtained by the display capture device, and the other part is taken in real scenarios where objects are placed in front of the screen. The original 100 images in DIV2K’s validation set are at high resolution. And we resize each image to $512 \times 512$. The images we use for training and the images we use for testing are derived from DIV2K’s training set and test set respectively. We will perform the above process twice: once using the AP mask and once using the MLS mask as a contrast. In addition, all the parameters were set the same in both experiments.

In this paper, our LAs Net has four layers, which means that the final output of this network is ${Z^4}$. $\lambda $ is set to 0.1, and $\mu $ is set as a trainable variable with an initial value of 0.1. The loss function is the mean squared error between the output of LAs Net and the ground truth, ${||{{Z^4} - {X_{gt}}} ||_2}^2$. We use the ADAM optimizer over a total of 50 training epochs. The batch size is set to 12. The initial value of the learning rate is set to 0.0001 and then it is reduced by half every 5 epochs. Our networks are implemented in TensorFlow and trained on a Linux server with an Intel E5-2678 CPU 2.5 GHz with 64GB of memory and four graphic cards (NVIDIA GTX 1080Ti) with 11 GB of memory.

5. Results

In this section, we explain the results of the above implementation in detail. This section is divided into three parts. The first part is the comparison between our AP mask and MLS mask, the second part is the comparison between our LAs Net and existing methods, and the third part is the results of real scenes.

5.1 Comparison with MLS mask

As described in Sec. 4, we conducted two experiments: one using an AP mask that was generated by an almost perfect sequence of length 516, and another using an MLS mask that was generated by an MLS of length 511. All the parameters are the same except for the mask. We capture 100 images from the DIV2K validation set though a display-capture device. Because our masks were not fixed with the camera and were shot for a long time, there was a little bit of position deviation for the masks during the test, which resulted in an incomplete match between the reconstructed images and ground truth images. Therefore, we use a dense optical flow network to make them match perfectly before calculating the quantitative values. The reconstruction results are shown in Fig. 6. Visually, the quality of our reconstructed images is excellent. In terms of the quantitative evaluation, our reconstructed images also have high scores. Compared with the reconstruction images of the MLS mask, the reconstruction images of the AP mask have lower noises. We can see from the little red box (see Fig. 6) that is circled that the reconstruction images of the AP mask have fewer artifacts than the reconstruction images of the MLS mask, and the reconstruction images of the AP mask are clearer.

Fig. 6. Reconstruction images of the MLS mask and AP mask through LAs Net. The first line is the ground truth images for reference, the second line is the reconstruction images of the MLS mask, and the third line is the reconstruction images of the AP mask.

Download Full Size | PDF

Table 1 shows quantitative evaluation of the reconstruction results based on the MLS mask and the AP mask. In the quantitative comparison, the well-known PSNR and SSIM were used as the objective evaluation indexes. The average PSNR and SSIM of the 100 test images of DIV2K’s validation set are summarized in Table 1.

Table 1. Quantitative evaluation of the reconstruction results based on the MLS mask and the AP mask.

View Table | View all tables in this article

5.2 Comparison with other methods

In this section, we compare our method with three other existing methods. The first method is the Tikhonov regularized reconstruction proposed by Asif [1]. The second method is that we apply FISTA [20] to the FlatCam reconstruction. Both of the first two methods are traditional algorithms. The third method is Khan’s method [16], which is the best method to reconstruct FlatCam images using deep learning at present. The performances of the above three methods and our method are shown in Fig. 7, and the photos are from ImageNet [36]. The green boxes (see Fig. 7.) show that our reconstructed images are sharper and have less pseudo texture and noises. Compared with Khan’s method, our reconstructed images are clearer and more similar to the ground truth images with less color distortion (see Fig. 7 purple boxes). As seen from the visual effect, our reconstructed images performs better than the comparison methods. The quantitative evaluation also confirms that our results are better than the above three existing methods. Although we used four U-Net, the overall model was not large because our U-Net was small. Our method has fewer parameters and numbers of floating point operations (FLOPs) than Khan’s method. The parameters and FLOPs are summarized in Table 2.

Fig. 7. Images reconstructed by various methods. The green inset shows the finer region in each image. (a) Ground truth for reference images. (b) Tikhonov regularization reconstruction results. (c) Results of FISTA method. (d) Results of Khan’s method. (e) Results of our LAs Net. The quantitative evaluation is at the bottom of each image.

Download Full Size | PDF

Table 2. Parameter and FLOPs of Khan’s method and our method.

View Table | View all tables in this article

5.3 Test in real scenarios

In the real shooting process, we put the objects in front of the pad and use the LED lights as the light source. The lensless camera with an AP mask was used in real scenarios. The reconstruction results of three methods are shown in Fig. 8. Due to the strong ill-condition of the system, the reconstructed images of Tikhonov and FISTA methods are of poor quality. Figs. 8(a1)-(a4) show that Tikhonov regularized reconstructed images have observable noises, which are the result of the amplification of the noises of the measurements in the reverse restoration process. So are the reconstructed images of FISTA. Figs. 8(c1)-(c4) show that our reconstructed images have good qualities, but some texture details are lost. Figure 8(c5) and Fig. 8(c6) indicate that we can well reconstruct letters and numbers. On the whole, the qualities of the reconstructed images are good. And we analyze the Fig. 8(c5) to obtain the optical resolution of the system in real shooting. The object in Fig. 8(c5) is a printed photograph of a size of 13.4 cm. The minimum resolvable distance is three pixels in the original image. Thus, the optical resolution of the system is 2.4 mrad.

Fig. 8. Reconstructed images of the AP mask in the real scenarios. (a) Tikhonov regularized reconstructed image. (b) Image reconstructed by FISTA method. (c) Image reconstructed by our LAs Net.

Download Full Size | PDF

6. Conclusion

We use a mask based on almost perfect sequence for lensless imaging. Because of its excellent autocorrelation properties, the reconstructed images have good quality and less noises. In addition, we present a LAs Net for the image reconstruction of FlatCam. Although lensless imaging has a promising future, the quality of the reconstructed images is poor at present. Our network reconstructed high-quality images by combining the physical imaging models of FlatCam with deep learning. Although the transfer matrixes of the system are ill conditioned at a resolution of $512 \times 512$, we can still reconstruct good quality images. Moreover, we improved the calibration method to reduce the calibration time by half without any loss in the actual experiment. Combined with our LAs Net, our AP mask-based lensless cameras have reconstructed images at a resolution of $512 \times 512$ that have excellent performances in both visual effects and objective evaluations.

Acknowledgements

This works was financially supported by ZJU-Sunny Photonics Innovation Center #2019-04.

Disclosures

The authors declare no conflicts of interest.

References

1. M. S. Asif, A. Ayremlou, A. Sankaranarayanan, A. Veeraraghavan, and R. G. Baraniuk, “Flatcam: Thin, lensless cameras using coded aperture and computation,” IEEE Trans. Comput. Imaging. 3(3), 384–397 (2017). [CrossRef]

2. M. S. Asif, “Lensless 3d imaging using mask-based cameras,” in Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 2018), pp. 6498–6502.

3. V. Boominathan, J. K. Adams, M. S. Asif, R. G. Baraniuk, and A. Veeraraghavan, “Lensless Imaging: A computational renaissance,” IEEE Signal Process. Mag. 33(5), 23–35 (2016). [CrossRef]

4. J. K. Adams, V. Boominathan, B. W. Avants, D. G. Vercosa, F. Ye, R. G. Baraniuk, J. T. Robinson, and A. Veeraraghavan, “Single-frame 3D fluorescence microscopy with ultraminiature lensless FlatScope,” Sci. Adv. 3(12), e1701548 (2017). [CrossRef]

5. M. J. DeWeert and B. P. Farm, “Lensless coded-aperture imaging with separable Doubly-Toeplitz masks,” Opt. Eng. 54(2), 023102 (2015). [CrossRef]

6. K. Tajima, T. Shimano, Y. Nakamura, M. Sao, and T. Hoshizawa, “Lensless light-field imaging with multi-phased fresnel zone aperture,” in IEEE International Conference on Computational Photography (IEEE, 2017), pp. 1–7.

7. N. Antipa, G. Kuo, R. Ng, and L. Waller, “3D Diffusercam: Single-shot compressive lensless imaging,” in Computational Optical Sensing and Imaging (Optical Society of America, 2017) paper CM2B-2.

8. N. Antipa, G. Kuo, R. Heckel, B. Mildenhall, E. Bostan, R. Ng, and L. Waller, “DiffuserCam: lensless single-exposure 3D imaging,” Optica 5(1), 1–9 (2018). [CrossRef]

9. G. Kuo, N. Antipa, R. Ng, and L. Waller, “3D Fluorescence microscopy with diffusercam,” in Computational Optical Sensing and Imaging (Optical Society of America, 2018) paper CM3E-3.

10. G. Kim and R. Menon, “Computational imaging enables a “see-through” lens-less camera,” Opt. Express 26(18), 22826–22836 (2018). [CrossRef]

11. P. R. Gill, “Odd-symmetry phase gratings produce optical nulls uniquely insensitive to wavelength and depth,” Opt. Lett. 38(12), 2074–2076 (2013). [CrossRef]

12. P. R. Gill and D. G. Stork, “Lensless ultra-miniature imagers using odd-symmetry spiral phase gratings,” in Computational Optical Sensing and Imaging (Optical Society of America, 2013) paper CW4C-3.

13. P. R. Gill, C. Lee, D. Lee D, A. Wang, and A. Molnar, “A microscale camera using direct Fourier-domain scene capture,” Opt. Lett. 36(15), 2949–2951 (2011). [CrossRef]

14. J. Tan, L. Niu, J. K. Adams, V. Boominathan, J. T. Robinson, R. G. Baraniuk, and A. Veeraraghavan, “Face detection and verification using lensless cameras,” IEEE Trans. Comput. Imaging. 5(2), 180–194 (2019). [CrossRef]

15. K. Yanny, N. Antipa, R. Ng, and L. Waller, “Miniature 3D fluorescence microscope using random microlenses,” in Optics and the Brain (Optical Society of America, 2019) paper BT3A-4.

16. S. S. Khan, V. R. Adarsh, V. Boominathan, J. Tan, A. Veeraraghavan, and K. Mitra, “Towards photorealistic reconstruction of highly multiplexed lensless images,” in Proceedings of the IEEE International Conference on Computer Vision (IEEE, 2019), pp. 7860–7869.

17. G. Chen and Z. Zhao, “Almost perfect sequences based on cyclic difference sets,” J. of Syst. Eng. Electron. 18(1), 155–159 (2007). [CrossRef]

18. J. Wolfmann, “Almost perfect autocorrelation sequences,” IEEE Trans. Inf. Theory 38(4), 1412–1418 (1992). [CrossRef]

19. J. M. Bioucas-Dias and M. A. T. Figueiredo, “A new TwIST: Two-step iterative shrinkage/thresholding algorithms for image restoration,” IEEE Trans. Image Proc. 16(12), 2992–3004 (2007). [CrossRef]

20. A. Beck A and M. Teboulle M, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imaging Sci. 2(1), 183–202 (2009). [CrossRef]

21. S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo, “Sparse reconstruction by separable approximation,” IEEE Trans. Signal Proc. 57(7), 2479–2493 (2009). [CrossRef]

22. A. Sinha, J. Lee, S. Li, and G. Barbastathiset, “Lensless computational imaging through deep learning,” Optica 4(9), 1117–1125 (2017). [CrossRef]

23. Y. Li, Y. Xue, and L. Tian, “Deep speckle correlation: a deep learning approach toward scalable imaging through scattering media,” Optica 5(10), 1181–1190 (2018). [CrossRef]

24. S. Li, M. Deng, J. Lee, A. Sinha, and G. Barbastathiset, “Imaging through glass diffusers using densely connected convolutional networks,” Optica 5(7), 803–813 (2018). [CrossRef]

25. W. Dong, P. Wang, W. Yin, G. Shi, F. Wu, and X. Lu, “Denoising prior driven deep neural network for image restoration,” IEEE Trans. Pattern Anal. Mach. Intell. 41(10), 2305–2318 (2019). [CrossRef]

26. K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Trans. Image Proc. 26(9), 4509–4522 (2017). [CrossRef]

27. K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep CNN denoiser prior for image restoration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE2017), pp. 3929–3938.

28. K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok, “Reconnet: Non-iterative reconstruction of images from compressively sensed measurements,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE2016), pp. 449–458.

29. J. Zhang and B. Ghanem, “ISTA-Net: Interpretable optimization-inspired deep network for image compressive sensing,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (IEEE, 2018), pp. 1828–1837.

30. K. Monakhova, J. Yurtsever, G. Kuo, N. Antipa, K. Yanny, and L. Waller, “Learned reconstructions for practical mask-based lensless imaging,” Opt. Express 27(20), 28075–28090 (2019). [CrossRef]

31. A. Ayremlou, “FlatCam: Lensless Imaging, Principles, Applications and Fabrication,” 2015.

32. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention. (Springer, 2015), pp. 234–241.

33. E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (IEEE2017), pp. 126–135.

34. X. Yuan and Y. Pu, “Parallel lensless compressive imaging via deep convolutional neural networks,” Opt. Express 26(2), 1962–1977 (2018). [CrossRef]

35. M. Qiao, Z. Meng, J. Ma, and X. Yuan, “Deep learning for video compressive sensing,” APL Photonics 5(3), 030801 (2020). [CrossRef]

36. O. Russakovsky, J. Deng, and F. Li, “Imagenet large scale visual recognition challenge,” I. J. Comp. Vis. 115, 211–252 (2015). [CrossRef]

Mask type	PSNR (in dB)	SSIM	Time taken (in sec)
MLS mask	22.79	0.71	0.065
AP mask	23.48	0.75	0.065
Ground truth	-	1	-

Image size	$512 \times 512 \times 3$		$256 \times 256 \times 3$
Method	Khan [16]	ours	Khan [16]	ours
Parameters	30060K	7524K	30060K	7524K
FLOPs	1177G	297.7G	294.3G	74.4G

Mask type	PSNR (in dB)	SSIM	Time taken (in sec)
MLS mask	22.79	0.71	0.065
AP mask	23.48	0.75	0.065
Ground truth	-	1	-

Image size	$512 \times 512 \times 3$		$256 \times 256 \times 3$
Method	Khan [16]	ours	Khan [16]	ours
Parameters	30060K	7524K	30060K	7524K
FLOPs	1177G	297.7G	294.3G	74.4G

Lensless cameras using a mask based on almost perfect sequence through deep learning

Abstract

1. Introduction

2. FlatCam with an almost perfect mask

2.1 FlatCam model

2.2 Almost perfect mask

3. Image reconstruction

3.1 Camera calibration

3.2 LAs Net for FlatCam reconstruction

4. Implementation

5. Results

5.1 Comparison with MLS mask

5.2 Comparison with other methods

5.3 Test in real scenarios

6. Conclusion

Acknowledgements

Disclosures

References

Cited By

Figures (8)

Tables (2)

Equations (13)

Optics Express