MWDNs: reconstruction in multi-scale feature spaces for lensless imaging

Ying Li; Ying Li; Ying Li; Ying Li; Ying Li; Zhengdai Li; Zhengdai Li; Zhengdai Li; Zhengdai Li; Kaiyu Chen; Kaiyu Chen; Kaiyu Chen; Kaiyu Chen; Youming Guo; Youming Guo; Youming Guo; Changhui Rao; Changhui Rao; Changhui Rao

doi:10.1364/OE.501970

1. Introduction

Lensless imaging utilizes modulation mask to encode phase, amplitude and other light field information of a scene, which provides possibilities for optical system to be no longer limited by traditional complex lenses group. It has important significance in certain special scenarios that require subminiature designs, such as in vivo imaging, wearables, mobile platforms [1–5].

The imaging process of lensless camera is usually modeled as a straightforward linear system, and the visual image is restored by solving the inverse problem. Iterative reconstruction [1,6] is the preferred alternative for achieving image restoration based on the camera physical model, as it has stable deconvolution and robust results. Unfortunately, the overly idealized model would result in obvious noise and artifacts in reconstructions. Besides, iterations typically require expensive and slow calculations. In order to address this, data-driven pure Convolutional Neural Network (CNN) without physics priors is proposed to grow in the speed and quality [7,8]. CNN is capable of learning the complex scene statistics, however cannot accurately simulate the optical transmission of imaging system, making it difficult for the network to recover details from this highly reusable mode. Hence it is essential to incorporate traditional calculation into the entire algorithm.

Therefore, some works suggested cascading the two methods [9–12], with model-based algorithm for initial deconvolution and deep learning as post-processing for augmenting perception, making a remarkable improvement in overall performance. Despite the enhancement in reconstruction quality, there is still a lot of room for amelioration. The forward model is incomplete or incorrect that a large gap occurring between reality and model [13]. This problem may originate from misaligned or inaccurate point spread function (PSF), field of view (FOV) exceeding lateral shift invariance, and environmental or system noise during imaging. So, the complete description of field variation aberrations is extremely cumbersome, and that usually multiplies computational costs and is too slow for real-time imaging.

For correcting deviations and compensating the information loss caused by model mismatch, two solutions have been put forward in recent literatures. One is to design a mask for maximizing the acquired information in optics [14–17], the other is to adjust the reconstruction algorithms for increasing the information utilization of captured images [13,18–22]. In this article, we are more concerned with the algorithms. Although these advanced algorithms preserve some key details in the reconstructions, the complex convolutional structures and massive global optimization make training much more difficult. Meanwhile, the limited exploitation of raw images and physical knowledges drops the final performance. To fix it, we come up with a novel class of algorithms, Multi Wiener Deconvolution Networks called MWDNs, that use physical knowledge to deconvolute in multi-scale feature spaces for recovering more information from the multiplexed measurements; ulteriorly, it uses neural network encoder to correct PSF for degrading model error. Furthermore, we select Wiener filters instead of iterative optimization like ADMM, which observably reduces the complexity of computing while keeping high quality results.

Concretely speaking, the main contributions of this paper include:

1. An effective network framework. We learn the Multi Wiener Deconvolution Networks, utilizing model repeatedly in multi-scale feature spaces to fix model mismatch, and analyze the role of each part of the net through ablation experiments.
2. A practical experimental verification. We build a lensless camera for proof by using it to collect 25, 000 images displayed on a computer screen and a small number of real scenes, that proves the practicality of our method in an actual lensless camera.
3. A demonstration of excellent algorithm performance. Compared with advanced reconstruction algorithms in recent years, our method not only improves the metrics about PSNR and LPIPS, but also cut down by over half the reconstruction time.

2. Related work

Recently, lensless imaging has great potential because of its small and cheap design and high-dimensional capture, while the rise of computational imaging further unlocks the broad prospects of it. In this section, we provide a brief survey of lensless cameras and existing technologies of lensless image reconstruction.

2.1 Lensless cameras

Lensless camera is a simplified imaging system consisting of an image sensor with a modulation mask against the front end of it. The incident light from the target is received by sensor pixels after optical modulation, ultimately producing an encoded non-photorealistic caustics pattern. Numerous works demonstrate that the performance using masks as hardware optical encoders, including Fresnel zone plate [23,24], spatial light modulator [25], amplitude diffraction grating [26], separable amplitude mask [27], phase mask [14] and so on. Unlike traditional lens cameras that directly record images, lensless system transfers some imaging burden from hardware to computation, which requires computational algorithms to reconstruct the speckles into visualized images. Currently, lensless cameras have been used for holography [28], machine vision [2], 3D fluorescence microscopy [29] and refocusable photography [30].

2.2 Lensless image reconstruction methods

The forward imaging of lensless camera is assumed to be a convolution operation based on PSF of the system. Classical reconstruction methods are roughly divided into two categories: single-step [31] and iterative reconstructions [6]. Single-step method is fast but have harsh limitations, requiring precise calibration PSF and low noise environments. Iterative method typically exhibits better, guaranteeing strong convergence, but has the disadvantage of slow speed.

Then deep learning plays a role. Reference [8] achieved feature self-correction by fully convolutional network. Another paper [32] inferred better results from global features by a fully connected neural network with a transformer. Nevertheless, most prefer to study non-blind deconvolution, which is beneficial for detail reconstruction. The work in [9] unrolled five iterations of ADMM, followed by a U-Net to generate more visually appealing images. Reference [33] demonstrated an unsupervised approach on untrained networks, which was without labeled data but took 1-5 hours, no practicality.

Recently, algorithms were further improved when considering the model mismatch between forward model and the actual. In [13], the results of each unrolled ADMM were input into the residual block and then fed into U-Net denoiser to correct model errors. Zhou et al. [18,19] combined traditional optimization and deep denoising as each layer of iteration to gradually refine reconstruction quality. on the other side, some proposed to modify the kernel parameters in the inversion during training for suppressing the systematic error in PSF measurement [20,22]. Similarly, the work in [21] embedded multiple PSF sized trainable kernels and simple CNN into iterative model to compensate for optical aberrations and modify the physical model. For another, a work based the theory of wave optics analyzed the zero point of frequency response from diffraction, completing two measurements by moving the mask and then fusing the images for zero compensation [34]. The common defects with these methods are that it not only is complicated and difficult to optimize but also takes a long time to train and test, and more important is that there are much information omissions during the use of measurements.

To make up the deficiencies, we get inspiration from the supervised image deblurring algorithms in traditional optical camera. In the first place we found that performing classical Wiener deconvolution in feature space can significantly polish up image quality [35]. Later, we focused on a network that combines Wiener filters and U-Net, which achieves deconvolution in multi-scale feature spaces for preserving more image information [36]. Therefore, we extend this fast and high-performance algorithm to lensless camera in this work.

3. Method

We first introduce the forward model of lensless imaging system. Then on this basis we specify our traditional reconstruction approach based on the model. Finally, we provide a detailed explanation of the MWDNs (MWDN and its variant structure) and show the loss functions applied therein.

3.1 Physical model formulation

Assuming that the PSF conforms translation invariant throughout the whole FOV, the measurement, b, from lensless imaging system is the result of the linear combination of PSFs after the points in scene, v, are mapped onto the sensor. Meanwhile, taking into account the presence of some additional noise, ɛ, in the environment and system, the process of lensless imaging is described as:

(1)$${\textbf b}({x,y} )= {\textbf h}({x,y} )\ast {\textbf v}({x,y} )+ {\mathbf \varepsilon }({x,y} ),$$

where h denotes PSF of the optical system, (x, y) are the sensor coordinates, and * represents a 2D discrete linear convolution.

Here, our goal is to recover the target scene, v, from the measurement, b. Since our whole algorithm includes multiple inverse operations, we must control the calculation amount of a traditional model-based inverse solvers to ensure that inversion is completed in a short time. Thereby the v in this paper is preliminarily estimated by a simple Wiener filter:

(2)$$\hat{{\textbf V}}({u,v} )= \frac{{w{{\textbf H}^ \ast }({u,v} ){\textbf B}({u,v} )}}{{{{|{w{\textbf H}({u,v} )} |}^2} + \lambda }},$$

where w is the wight used to control the energy of PSF, λ is the constant term of regularization. H and B respectively represent the transformation of h and b from time domain to frequency domain, by the same token, $\hat{{\textbf V}}$ indicates the estimation of scene in the frequency domain.

However, there are some unavoidable issues with the above system model. For example, it ignores that the farther away from the center of the FOV, the weaker the shift invariance will be. Alternatively, the point source may be not on the axis when collecting the system PSF. All of these would prevent pure model-based algorithm from achieving optimal image. Thus, we in this work attempt to correct the relationship between the captured b, h, and v by inserting multi-Wiener filters into multi-scale feature spaces and learning to optimize PSF, in order to raise image quality. In the next section, we will describe in more detail how we achieve reconstruction quickly and accurately.

3.2 Networks

Here, we mainly explore two efficient reconstruction algorithms to achieve state-of-the-art results for intensity estimation from single snapshot coded image. The scheme we design follows the principle of “deconvoluting and denoising synchronously”, that Wiener filters and CNN working on a deeper integration, so that the physical model is inherently built into the entire architecture. This collaborative approach makes the network architecture more stable and efficient, and its specific content will be introduced in the following subsections.

3.2.1 Multi Wiener deconvolution network (MWDN)

In this network, we insert a Wiener filter into each skip connection of U-Net to complete deconvolution in multi-scale feature spaces, providing various physical information for deep learning. The measurement is not directly involved in deconvolution, but is first optimized by the encoder of U-Net to minimize the difference between the actual and predicted measurements and correct some model mismatch errors caused by the system. Based on this, the entire network structure becomes more generalized without complicating the U-Net. Figure 1 illustrates how our data and parameters flow through the network.

Fig. 1. Multi Wiener deconvolution network. The measurement is input into a common U-Net, with the difference being that the feature map is deconvoluted with the corresponding scale PSF before performing each skip connection.

Download Full Size | PDF

According to formula (2), it is known that Wiener filter heavily relies on regularization constant and the weight of the PSF, so it is necessary to optimize the two parameters in each Wiener. There are four Wiener filters applied in MWDN with parameters $w = \{{{w_1},{\; }{w_2},{\; }{w_3},{\; }{w_4}} \}$ and $\lambda = \{{{\lambda_1},{\; }{\lambda_2},{\; }{\lambda_3},{\; }{\lambda_4}} \}$. The PSFs involved with different scales are directly obtained through max pooling on the original PSF. Here together with the parameters in U-Net, the total number of learning parameters is 17,267,531. At the end of the network, we use the loss functions recommended in Section 3.3 to compare the reconstruction with the ground truth and update the trainable parameters by backpropagation.

3.2.2 Multi Wiener deconvolution network on corrected PSF (MWDN-CPSF)

In the MWDN above, we regard the encoder of U-Net as an optimization of the measurement (correcting for chromatic aberration and field curvature) to approximate to physical model. On the other hand, the amelioration of the forward model matching depends on a better and more standard PSF except for the optimized measurement, as can be seen from Eq. (1). Here, the diameter, wavelength and position of the point light source might affect to acquire precise PSF. As a consequence, another variation of MWDN is come into use. We add some CNN (same as the encoder of U-Net) to adjust the PSF, as shown in Fig. 2, which allows the feature map of each channel to receive a corresponding optimized PSF during Wiener deconvolution. This method has more learnable parameters, with a total of 21,956,743. Among this, since the PSF undergoes an encoding branch, the learning parameter w in Wiener is removed and only the regularization constant λ is retained.

Fig. 2. Multi Wiener deconvolution network on corrected PSF. We add a branch to optimize PSF in MWDN to reduce model errors. Specifically, each feature map together with a different PSF participating in Wiener deconvolution, making full use of the original image. As a result, the final reconstruction would have more detail.

Download Full Size | PDF

3.2.3 Other networks for ablation experiments

For completeness, we conduct ablation experiments with adding three other comparative methods. 1. Wiener (a pure physical model): using a single Wiener filter for image reconstruction, covering a pair of learning parameters w and λ. 2. U-Net (a pure deep learning model): establishing a mapping relationship between measurements and reconstructions through data-driven and including 17,267,523 parameters. 3. Wiener-U (a collaboration between physics knowledge and deep learning): combining the first two that Wiener deconvolution followed by a U-Net perceptron, and the number of learning parameters is also the sum of them, totaling 17,267,525.

3.3 Loss functions

The loss function must be carefully taken into account because it dictates the parameter updates throughout the training process. During training for image reconstruction, the mean-squared error (MSE) loss, that reflecting the degree of difference between the ground truth and the estimator, is indispensable. Besides, since MSE favors low frequency information in images, the reconstructions are blurry and lack detail, and appear noticeably distorted in color as well. The Learned Perceptual Image Patch Similarity metric (LPIPS) mitigate this problem by using 2 or 3 deep features to quantify a perceptual distance between two images, as introduced in [37]. The deep features are commonly computed from a pre-trained Alex, or VGG-16 network. In this paper, we select Alex for LPIPS. A combination of both MSE and LPIPS as follows:

(3)$$\begin{array}{c} {L_{Intensity}}({{\textbf v},\hat{{\textbf v}}} )= {\lambda _1}{L_{MSE}} + {\lambda _2}{L_{LPIPS}},\\ {L_{MSE}}({{\textbf v},\hat{{\textbf v}}} )= ||{{\textbf v} - \hat{{\textbf v}}} ||_2^2,\\ {L_{LPIPS}}({{\textbf v},\hat{{\textbf v}}} )= ||{{\phi_2}({\textbf v} )- {\phi_2}({\hat{{\textbf v}}} )} ||_2^2 + ||{{\phi_4}({\textbf v} )- {\phi_4}({\hat{{\textbf v}}} )} ||_2^2, \end{array}$$

where v indicates the ground truth, $\hat{{\textbf v}}$ is the reconstructed intensity. And, $\phi $ means extracting features from an image matrix, where ${\phi _2}$ is the perceptual output of the 2nd convolutional layer (after activation) and ${\phi _4}$ is from the 4th convolutional (after activation) layer in the pre-trained model. The parameter ${\lambda _1}$ and ${\lambda _2}$, as the weights of the respective loss functions, require to be set manually. If you want to achieve a higher SNR in results, increase ${\lambda _1}$ and decrease ${\lambda _2}$; conversely, if you want to get a better visual perception of images, decrease ${\lambda _1}$ and increase ${\lambda _2}$.

4. Implementation

In this section, we record our lensless camera and different datasets used in our experiment, as well as the process of training the networks.

Lensless camera To verify the effectiveness of our networks, we made a lensless camera, referring to Fig. 3(a). The phase mask used is customized according to the design scheme provided by [14]. As shown in Fig. 3(b), that is a 7 × 7 sq. mm diffractive optical element (DOE) with a pixel pitch of 2µm. Figure 3(c) shows the height map of the designed DOE, which is segmented into 8 levels with a 145 nm difference between the adjacent levels. It has better local focusing than the pseudo random phase mask [9], which could produce a higher contrast PSF like Fig. 3(d). The mask is placed at a distance of 2 millimeters from the front end of a HIKROBOT color camera with Sony IMX183 sensor. The aperture of the camera is about 2.5 × 3 square millimeters. The calibration PSF is obtained from a white LED light source (with the wavelength range of 400-760 nm) filtered by a pinhole placed 20 cm away from the camera. When the object distance exceeds 20 cm, the PSF variation is very small that could be ignored.

Fig. 3. Diagrammatic overview of lensless camera and experiment. (a) A lensless camera. (b) The appearance of a customized DOE. (c) The height map of the DOE. (d) The PSF of the lensless camera. (e) The process of capturing the Dogs vs. Cats dataset.

Download Full Size | PDF

Datasets In the first stage, we validate the performance of our methods on publicly available datasets, DiffuserCam dataset provided in [9] and PhlatCam dataset in [20], containing 25,000 and 10,000 image pairs, respectively. In the second step, we take training samples with our own camera. The experimental device is shown in Fig. 3(e). We put the camera about 20 centimeters away from the 27-inch Dell display, then controlled the display to play pictures from Dogs vs. Cats dataset in Kaggle. The area occupied by these pictures is perhaps 14 × 14 square centimeter. We continuously captured 25,000 images with a pixel size of 1280 × 1280, finally binning to 320 × 320. After that, we utilized ADMM to reconstruct an image for computing the homography transformation between the reconstructed image and the image data displayed on screen. The image data was then aligned and converted to the ground truth by means of resizing and padding.

Training We split up both DiffuserCam dataset and self-made dataset into 20,000 training patches, 2,500 validation patches, and 2,500 test patches with a ratio of 8:1:1. For the PhlatCam dataset, since the data volume is small, we divide it into 9,900 training patches and 100 test patches. All the networks are implemented in PyTorch and trained on a RTX 3090 GPU, using an ADAM optimizer throughout training. In order to restore images with higher PSNR, we set ${\lambda _1} = 1,\; {\lambda _2} = 1$ during earlier epochs and set ${\lambda _1} = 1,\; {\lambda _2} = 0.05$ during later epochs.

5. Results

We contrasted our work with the state-of-the-art methods. At the same time, we carried out ablation studies to determine the contribution of each component in our algorithm to the reconstruction quality. The results in three test sets show that our method accomplishes better image quality and accelerates by three times. Finally, we use the hardware prototype to verify our method against the natural objects in the actual scene.

5.1 Test results on public datasets

After training, the reconstruction results in test set from public DiffuserCam dataset show that our proposed method is able to effectively enhance the details of images. In Table 1, we compare against our works with others, analyzing the discrepancies between different networks and summarizing the respective performance and speed of reconstructions. As for the components, Le-ADMM-U [9] and MMCN [13], as well as Wiener-U, combine physics prior with CNN denoiser, ultimately achieving similar quality. To break through the foregoing performance ceilings, UPDN [21] and our approaches optimize the matching degree of physical knowledge, mainly by treating the measurements and PSF differently. Further, our networks do not require to follow up denoising. So that the MWDNs not only have the highest PSNR and LPIPS scores, but also the fastest reconstruction speed, increasing by 2-3 times. In the ablation experiments, we explored various combinations of Wiener filters and U-Net. Here, the performance from single Wiener to multi-Wiener (Wiener → U-Net → Wiener-U → MWDN → MWDN-CPSF) is getting better and better, proving that our strategies adding Wiener filters and PSF optimization branches are totally effective. MWDN-CPSF is slightly slower than MWDN because of its more learning branches on optimized PSF.

Table 1. Network performance. We summarize the average MSE, LPIPS metrics for reconstruction on test set of DiffuserCam. MWDN-CPSF has the best performance in terms of MSE and LPIPS, outperforming MWDN. Among methods with similar effects, two MWDNs are superior in terms of time.

View Table | View all tables in this article

Figure 4 exhibits several sample patterns reconstructed by various networks on the same test set. Overall, MWDN-CPSF has the best reconstruction performance, commendably restoring the high frequency of images for more detail and producing images that are visually more faithful to the ground truth images. This demonstrates the practicality of multiple Wiener filters and U-Net working together. Observing the results of ablation experiment, the pure Wiener appears a large number of artifacts on the images, while the pure deep learning has poor generalization ability and distorts the lines and color. When the model-based deconvolution is installed at the front of U-Net, the quality is greatly facilitated but the details are seriously lost. MWDN and MWDN-CPSF change the position and quantity of Wiener filters, allocating the underlying physical model of light transmission in lensless camera more reasonably by encoding measurement and PSF to accomplish the best performance.

Fig. 4. Test set results of the public DiffuserCam dataset, with the coded images and the ground truth images for reference. MWDN-CPSF have more details and noticeably better visual image quality than others.

Download Full Size | PDF

To further confirm the performance of the proposed methods, we also compared MWDNs with Le-ADMM-U and UPDN on the public PhlatCam dataset, some of which are shown in Fig. 5. Again, MWDN-CPSF achieved the highest evaluation. In some cases, the results of MWDN are inferior to UPDN in PSNR because of the lack of calibrating PSF in MWDN.

Fig. 5. Some test results of public PhlatCam dataset. These samples perform best under the MWDN-CPSF structure. Although MWDN does not have an advantage in SPNR, it is clearly superior to UPDN on the vision (described by LPIPS).

Download Full Size | PDF

5.2 Test results on our dataset

To validate the performance of our proposed methods in different systems and environments, we train and test them on our collected images. Table 2 records the scores of reconstruction images from different networks on PSNR and LPIPS. Clearly, WNDN and WNDN-CPSF are still the best performer of all. The performance of Le-ADMM-U, MMCN, and Wiener-U is basically consistent, the reason of this is that the customized PSF is higher contrast which brings a higher match between the real system and the physical model, resulting in richer details in the deconvolution image. Similarly, a good PSF weakens the effectiveness of our algorithms in correcting the model, leading to an obviously smaller improvement compared to the results in section 5.1. Figure 6 shows partial reconstructions of these works on the homemade test set, further demonstrating the superiority of our algorithms. However, it can also be found that the images of this dataset have more high-frequency components and more complex content than DiffuserCam dataset, making MWDN and MWDN-CPSF work a little worse than before.

Fig. 6. Test set results of our dataset. Both MWDNs perform better than the others.

Download Full Size | PDF

Table 2. Network performance. It documents the test results of our home-made dataset. From the vertical perspective, WMDN-CPSF always brings out the best in all methods.

View Table | View all tables in this article

5.3 Real scene results

Next, we took some lensless images of real objects (toys, boxes, brochures) and reconstructed them on different algorithms. Figure 7 shows these reconstructions using our learned networks. Need of special note is the ground truth are taken from the author's mobile phone, so the size and orientation of targets are slightly different from the reconstructions. From these results, it can be seen that the images generated by our methods have higher visual quality than current advanced methods. For example, we clearly reconstructed the horns of the dolls marked in green in column 2/3 and the letters circled in green in column 5/7. In particular, MWDN and MWDN-CPSF, while producing the most visually appealing images, also have faster reconstruction speeds than other methods. This manifests that our networks have the potential not only to extend from computer monitors to natural environments, but also to achieve real-time interaction.

Fig. 7. The performance of our lensless camera in the wild. Visually, our methods have clearer details compared to the other three advanced, especially these areas marked by highlighted green boxes, including the horns of toys and the letters printed on boxes and brochures.

Download Full Size | PDF

6. Conclusion

In this work, we present a new series of Multi Wiener Deconvolution Networks to reinforce model matching and achieve image restoration from a single snapshot coded image of mask-based lensless cameras. In our framework, by synchronously optimizing measurements and PSF, as well as multiple single-step reconstructions in multi-scale feature spaces, the more accurate and comprehensive deconvolution is achieved. Compared with the existing methods, ours have lower computational cost and a running speed of 2-4 times, which is conducive to interactive preview of scenes. In terms of performance, we conducted extensive simulations using different types of datasets from different lensless cameras, as a consequence, our algorithms recover image intensity with higher accuracy than others. It suggests the applicability of our method for both different cameras and different scenes. Moreover, testing on images captured in the field also reveals that our network can generalize to natural scenes.

After analyzing these reconstruction methods, we can know that the quality and resolution are affected by model defects and that can be cut down by optimizing the measurements and PSF. In this regard, there are two major limitations to our works. On the one hand, these optimizations depend on data-driven statistics, missing the specific research of theoretical basis. On the other hand, Wiener filter, with fewer constraints, only rebuild poor images (far inferior to ADMM), which leads to a sharp rise in the difficulty of deep learning. So, in the next stage we plan to modify the ways of optimization or look for suitable model replacing Formula 2 (the Wiener filters in the network) to overcome these limitations for more accurate imaging. Looking into the future, we hope to promote our idea to practical applications.

Funding

Youth Innovation Promotion Association of the Chinese Academy of Sciences (2020376); National Natural Science Foundation of China (12173041); Frontier Research Fund of Institute of Optics and Electronics, Chinese Academy of Sciences (C21K002).

Disclosures

The authors declare that there are no conflicts of interest.

Data availability

All data needed to evaluate the conclusions in the paper are provided herein. Additional data related to this paper may be kindly requested from the corresponding author.

References

1. N. Antipa, G. Kuo, R. Heckel, B. Mildenhall, E. Bostan, R. Ng, and L. Waller, “DiffuserCam: lensless single-exposure 3D imaging,” Optica 5(1), 1–9 (2018). [CrossRef]

2. J. Tan, L. Niu, J. K. Adams, V. Boominathan, J. T. Robinson, R. G. Baraniuk, and A. Veeraraghavan, “Face Detection and Verification Using Lensless Cameras,” IEEE Trans. Comput. Imaging 5(2), 180–194 (2019). [CrossRef]

3. J. K. Adams, D. Yan, J. Wu, V. Boominathan, S. Gao, A. V. Rodriguez, S. Kim, J. Carns, R. Richards-Kortum, C. Kemere, A. Veeraraghavan, and J. T. Robinson, “In vivo lensless microscopy via a phase mask generating diffraction patterns with high-contrast contours,” Nat. Biomed. Eng. 6(5), 617–628 (2022). [CrossRef]

4. F. Tian and W. Yang, “Learned lensless 3D camera,” Opt. Express 30(19), 34479–34496 (2022). [CrossRef]

5. V. Boominathan, J. T. Robinson, L. Waller, and A. Veeraraghavan, “Recent advances in lensless imaging,” Optica 9(1), 1–16 (2022). [CrossRef]

6. Y. Zheng and M. Salman Asif, “Joint image and depth estimation with mask-based lensless cameras,” IEEE Trans. Comput. Imaging 6, 1167–1178 (2020). [CrossRef]

7. A. Sinha, J. Lee, S. Li, and G. Barbastathis, “Lensless computational imaging through deep learning,” Optica 4(9), 1117–1125 (2017). [CrossRef]

8. B. Donggeon, J. Jaewoo, B. Nakkyu, and L. Seung Ah, “Lensless imaging with an end-to-end deep neural network,” International Conference on Consumer Electronics - Asia (IEEE, 2020).

9. K. Monakhova, J. Yurtsever, G. Kuo, N. Antipa, K. Yanny, and L. Waller, “Learned reconstructions for practical mask-based lensless imaging,” Opt. Express 27(20), 28075–28090 (2019). [CrossRef]

10. Y. Guo, L. Zhong, L. Min, J. Wang, Y. Wu, K. Chen, K. Wei, and C. Rao, “Adaptive optics based on machine learning: a review,” Opto-Electron. Adv. 5(7), 200082 (2022). [CrossRef]

11. K. Yanny, K. Monakhova, R. W. Shuai, and L. Waller, “Deep learning for fast spatially varying deconvolution,” Optica 9(1), 96–99 (2022). [CrossRef]

12. D. Bagadthey, S. Prabhu, S. S. Khan, D. T. Fredrick, V. Boominathan, A. Veeraraghavan, and K. Mitra, “FlatNet3D: intensity and absolute depth from single-shot lensless capture,” J. Opt. Soc. Am. A 39(10), 1903–1912 (2022). [CrossRef]

13. T. Zeng and E. Y. Lam, “Robust reconstruction with deep learning to handle model mismatch in lensless imaging,” IEEE Trans. Comput. Imaging 7, 1080–1092 (2021). [CrossRef]

14. V. Boominathan, J. K. Adams, J. T. Robinson, and A. Veeraraghavan, “PhlatCam: Designed Phase-Mask Based Thin Lensless Camera,” IEEE Trans. Pattern Anal. Mach. Intell. 42(7), 1618–1629 (2020). [CrossRef]

15. Q. Fu, D.-M. Yan, and W. Heidrich, “Diffractive lensless imaging with optimized Voronoi-Fresnel phase,” Opt. Express 30(25), 45807–45823 (2022). [CrossRef]

16. V. Sitzmann, S. Diamond, Y. Peng, X. Dun, S. Boyd, W. Heidrich, F. Heide, and G. Wetzstein, “End-to-end optimization of optics and image processing for achromatic extended depth of field and super-resolution imaging,” ACM Trans. Graph. 37(4), 1–13 (2018). [CrossRef]

17. Y. Peng, Q. Sun, X. Dun, G. Wetzstein, W. Heidrich, and F. Heide, “Learned large field-of-view imaging with thin-plate optics,” ACM Trans. Graph. 38(6), 1–14 (2019). [CrossRef]

18. H. Zhou, H. Feng, Z. Hu, Z. Xu, Q. Li, and Y. Chen, “Lensless cameras using a mask based on almost perfect sequence through deep learning,” Opt. Express 28(20), 30248–30262 (2020). [CrossRef]

19. H. Zhou, H. Feng, W. Xu, Z. Xu, Q. Li, and Y. Chen, “Deep denoiser prior based deep analytic network for lensless image restoration,” Opt. Express 29(17), 27237–27253 (2021). [CrossRef]

20. S. S. Khan, V. Sundar, V. Boominathan, A. Veeraraghavan, and K. Mitra, “FlatNet: towards photorealistic scene reconstruction from lensless measurements,” IEEE Trans. Pattern Anal. Mach. Intell. 44, 1934–1948 (2022). [CrossRef]

21. O. Kingshott, N. Antipa, E. Bostan, and K. Aksit, “Unrolled primal-dual networks for lensless cameras,” Opt. Express 30(26), 46324–46335 (2022). [CrossRef]

22. J. Yang, X. Yin, M. Zhang, H. Yue, X. Cui, and H. Yue, “Learning image formation and regularization in unrolling AMP for lensless image reconstruction,” IEEE Trans. Comput. Imaging 8, 479–489 (2022). [CrossRef]

23. J. Wu, H. Zhang, W. Zhang, G. Jin, L. Cao, and G. Barbastathis, “Single-shot lensless imaging with fresnel zone aperture and incoherent illumination,” Light: Sci. Appl. 9(1), 53 (2020). [CrossRef]

24. J. Wu, L. Cao, and G. Barbastathis, “DNN-FZA camera: a deep learning approach toward broadband FZA lensless imaging,” Opt. Lett. 46(1), 130–133 (2021). [CrossRef]

25. W. Chi and N. George, “Optical imaging with phase-coded aperture,” Opt. Express 19(5), 4294–4300 (2011). [CrossRef]

26. J. D. Rego, H. Chen, S. Li, J. Gu, and S. Jayasuriya, “Deep camera obscura: an image restoration pipeline for pinhole photography,” Opt. Express 30(15), 27214–27235 (2022). [CrossRef]

27. M. J. DeWeert and B. P. Farm, “Lensless coded-aperture imaging with separable Doubly-Toeplitz masks,” Opt. Eng. 54(2), 023102 (2015). [CrossRef]

28. J. Hao, X. Lin, Y. Lin, M. Chen, R. Chen, G. Situ, H. Horimai, and X. Tan, “Lensless complex amplitude demodulation based on deep learning in holographic data storage,” Opto-Electron. Adv. 6(3), 220157 (2023). [CrossRef]

29. K. Yanny, N. Antipa, W. Liberti, S. Dehaeck, K. Monakhova, F. L. Liu, K. Shen, R. Ng, and L. Waller, “Miniscope3D: optimized single-shot miniature 3D fluorescence microscopy (vol 9, 171, 2020),” Light: Sci. Appl. 12(1), 93 (2023). [CrossRef]

30. K. Tajima, T. Shimano, Y. Nakamura, M. Sao, and T. Hoshizawa, “Lensless light-field imaging with multi-phased Fresnel Zone Aperture,” in International Conference on Computational Photography (IEEE, 2017), 76–82.

31. M. S. Asif, A. Ayremlou, A. Veeraraghavan, R. Baraniuk, and A. Sankaranarayanan, “FlatCam: replacing lenses with masks and computation,” in International Conference on Computer Vision (IEEE, 2015), 663–666.

32. X. Pan, X. Chen, S. Takeyama, and M. Yamaguchi, “Image reconstruction with transformer for mask-based lensless imaging,” Opt. Lett. 47(7), 1843–1846 (2022). [CrossRef]

33. K. Monakhova, T. Vi, G. Kuo, and L. Waller, “Untrained networks for compressive lensless photography,” Opt. Express 29(13), 20913–20929 (2021). [CrossRef]

34. X. Chen, X. Pan, T. Nakamura, S. Takeyama, T. Shimano, K. Tajima, and M. Yamaguchi, “Wave-optics-based image synthesis for super resolution reconstruction of a FZA lensless camera,” Opt. Express 31(8), 12739–12755 (2023). [CrossRef]

35. J. Dong, S. Roth, and B. Schiele, “DWDN: Deep Wiener Deconvolution Network for Non-Blind Image Deblurring,” IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 9960–9976 (2022). [CrossRef]

36. Z. Shi, Y. Bahat, S.-H. Baek, Q. Fu, H. Amata, X. Li, P. Chakravarthula, W. Heidrich, and F. Heide, “Seeing through obstructions with diffractive cloaking,” ACM Trans. Graph. 41(4), 1–15 (2022). [CrossRef]

37. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The Unreasonable effectiveness of deep features as a perceptual metric,” in 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 586–595.

Method	Component				Metric		Time (ms)
Method	Physics prior	CNN denoiser	Optimized measurement	Optimized PSF	PSNR	LPIPS	Time (ms)
Wiener	✓				11.54	0.6265	0.4
U-Net		✓			20.23	0.2864	5.0
Wiener-U	✓	✓			21.46	0.2308	5.1
MWDN	✓		✓		25.03	0.1785	5.1
MWDN-CPSF	✓		✓	✓	27.06	0.1501	6.1
UPDN	✓	✓	✓	✓	25.06	0.2240	13.0
MMCN	✓	✓			23.09	0.2030	24.1
Le-ADMM-U	✓	✓			22.54	0.2087	18.6

Method	Metric		Time (ms)
Method	PSNR	LPIPS	Time (ms)
Wiener-U	22.97	0.2282	4.8
MWDN	24.18	0.1904	5.6
MWDN-CPSF	25.02	0.1573	6.4
UPDN	23.72	0.5071	15.0
MMCN	23.33	0.2323	22.9
Le-ADMM-U	23.35	0.2119	18.2

Method	Component				Metric		Time (ms)
Method	Physics prior	CNN denoiser	Optimized measurement	Optimized PSF	PSNR	LPIPS	Time (ms)
Wiener	✓				11.54	0.6265	0.4
U-Net		✓			20.23	0.2864	5.0
Wiener-U	✓	✓			21.46	0.2308	5.1
MWDN	✓		✓		25.03	0.1785	5.1
MWDN-CPSF	✓		✓	✓	27.06	0.1501	6.1
UPDN	✓	✓	✓	✓	25.06	0.2240	13.0
MMCN	✓	✓			23.09	0.2030	24.1
Le-ADMM-U	✓	✓			22.54	0.2087	18.6

Method	Metric		Time (ms)
Method	PSNR	LPIPS	Time (ms)
Wiener-U	22.97	0.2282	4.8
MWDN	24.18	0.1904	5.6
MWDN-CPSF	25.02	0.1573	6.4
UPDN	23.72	0.5071	15.0
MMCN	23.33	0.2323	22.9
Le-ADMM-U	23.35	0.2119	18.2

MWDNs: reconstruction in multi-scale feature spaces for lensless imaging

Abstract

1. Introduction

2. Related work

2.1 Lensless cameras

2.2 Lensless image reconstruction methods

3. Method

3.1 Physical model formulation

3.2 Networks

3.2.1 Multi Wiener deconvolution network (MWDN)

3.2.2 Multi Wiener deconvolution network on corrected PSF (MWDN-CPSF)

3.2.3 Other networks for ablation experiments

3.3 Loss functions

4. Implementation

5. Results

5.1 Test results on public datasets

5.2 Test results on our dataset

5.3 Real scene results

6. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (7)

Tables (2)

Equations (3)

Optics Express