Deep camera obscura: an image restoration pipeline for pinhole photography

Joshua D. Rego; Huaijin Chen; Shuai Li; Jinwei Gu; Suren Jayasuriya; Suren Jayasuriya

doi:10.1364/OE.460636

1. Introduction

High-quality consumer photography is a modern reality with smartphones incorporating multiple cameras with sensors that can exceed 50MP and enable pro-level features such as high dynamic range (HDR) mode, panorama stitching, and augmented reality. These advancements have greatly improved the creativity and quality of everyday photography, yet lens-based cameras still have inherent limitations for the smartphone medium. Traditionally, smartphone image quality has been largely limited by the image sensor size. Larger sensors have better resolution, noise performance, and low-light detection yet require larger, bulkier optics which is limited in smartphones to keep phone thickness low.

To overcome these trade-offs for lenses, researchers have developed computational imaging algorithms to improve the performance for single-lens cameras (as opposed to the typical compound optics) [1]. Recently, lensless cameras have become a viable option to overcome size limitations in applications like smartphone photography, microscopy and endoscopy, and tiny/micro robotic vision platforms [2–4]. These cameras are generally low-cost, require simpler construction, and are of thinner form factor as opposed to the conventional multi-lens stack. Research in amplitude [5] and phase mask [6] lensless cameras have gained popularity with the reconstructed 2D image quality being continually improved. Further, the use of diffractive optics coupled with computational imaging algorithms have enabled high quality performance [7] including the use of metasurface optics [8].

This paper revisits the original pinhole camera as another candidate for lensless photography in size/cost-limited formats. Pinhole cameras have infinite depth-of-field, no lens-based optical distortion [9], and have a much thinner footprint than typical multi-element lenses used on larger micro-4/3 or full-frame sensors. However, pinhole cameras can suffer from diffraction blur in the image and are limited by less light throughput (e.g. f/200), introducing higher levels of noise. Both these factors remain largely unaddressed for pinhole cameras by the research community, particularly with machine learning techniques.

Mask-based lensless cameras utilize optical multiplexing to capture low-noise but heavily blurred raw images, which creates challenges for the reconstruction algorithms to recover the high-frequency details. Meanwhile, a pinhole’s optical point spread function (PSF) sacrifices relatively less high-frequency detail in exchange for higher noise due to low light. This tradeoff between better deblurring and denoising is common to the computational photography literature with most recent papers choosing to denoise multiple images rather than deblur them in both burst photography [10] and HDR imaging [11]. In this paper, we propose a full system pipeline for improving the quality of pinhole photography at short exposures, as demonstrated in Fig. 1. We leverage both imaging physics and data-driven networks to build a joint denoise and deblur pipeline. The problem is challenging as denoising and deblurring methods have competing objectives where the former reduces noise in high frequencies and the latter restores high-frequency detail in the image. To overcome this, we use knowledge of the pinhole’s PSF frequency cutoff to design a matching pre-filtering module, as well as implement a reblur loss to improve the denoise and deblur performance for low-light pinhole images. This pipeline is trained on synthetic data only but can effectively generalize to real-world data without any fine-tuning. We define practical pinhole photography to be images captured at a usable exposure time of 1/30s, and design a temporal consistency loss with a state-of-the-art optical flow network [12] to reconstruct high-quality video. Our specific contributions include the following:

• An end-to-end imaging pipeline for practical pinhole photography with joint denoising and deblurring for low-light capture, leveraging both a data-driven network and imaging physics.
• A pinhole image dataset with measured HDR PSF suitable for generating synthetic data.
• Detailed tradeoff analysis and ablation studies including pinhole size, light loss, choice of denoising and deblurring networks, effects of ISO and exposure time.

We validate our pipeline by comparing against both conventional denoising/deblurring algorithms and traditional optimization-based lensless reconstruction algorithms. We also compare against end-to-end lensless camera solutions such as diffusion and coded mask cameras [5,6]. We note that these comparisons are not equal due to differing hardware sensor sizes and reconstruction algorithms used, but illustrate our reconstruction results in context. Our main argument is that while each form of lensless camera has its own strengths and weaknesses, our proposed framework can more adequately handle 2D everyday photography with high-resolution reconstruction as large as $2600\times 1952$ or 5 megapixels. We hope this work renews interest in practical pinhole photography in general.

2. Related work

Pinhole cameras. The camera obscura or pinhole camera has been around since antiquity [13] and popularized during the Renaissance [14,15]. However, with the invention of lenses, film, and digital sensors, pinhole cameras largely fell out of common use due to their low-light and optical blur quality. Research on pinhole optics has focused on diffraction effects on image quality [9], optimal sizes for pinholes to mitigate blur [16–18], and the transfer function for spatial frequencies [19]. In this paper, we show that such classical optical tradeoffs can be overcome with the use of computational imaging techniques.

Pinhole cameras have been used extensively in scientific imaging applications [20–22], and serve as the basis for the camera model which underpins many geometric vision algorithms [23]. However, papers which leverage actual pinhole cameras, as opposed to the model, for computer vision have been sparse in the literature. Some recent inspiring work has included accidental pinhole and pinspeck cameras [24,25] showing that simple image processing could be used for apertures naturally occurring in an environment.

Coded lensless cameras. As shown in Table 1, multi-aperture or multiplexed/coded lensless imaging systems typically allow more light throughput than a pinhole but typically at the expense of lower reconstructed 2D image quality, as we will demonstrate later in our experimental comparison to some systems. Coded lensless imaging can capture 3D information through their optical multiplexing which is useful for certain applications, but the focus of this paper is on 2D photography where the pinhole camera has a simpler optical design.

Table 1. Comparison of example lensless imaging systems

View Table | View all tables in this article

Coded lensless cameras feature optical elements that are thin and scalable to a small size in place of a main lens [5,6,26,27]. FlatCam [5] uses a coded amplitude mask to multiplex light and then reconstructs the image in post-processing. Since amplitude masks lose some light efficiency, newer designs have featured improved phase masks [28,29]. DiffuserCam uses diffusion layers that scatter light onto the sensor [6]. Alternatively, diffraction gratings [27,30,31] and Fresnel plates [32] have been used to achieve small form factors. We note that our pinhole camera still requires a vertical distance between the sensor and the aperture (flange distance of 19.25mm) due to the desired field-of-view and large sensor size. Other lensless cameras leveraging thin optics are flatter and have smaller form-factors. Thus our solution is a middle ground between lens-based cameras and these flat lensless cameras.

To reconstruct 2D lensless images, optimization algorithms such as alternating direction method of multipliers (ADMM) [33], regularized $\ell _1$ [34] or total variation regularization [35], are adapted for lensless imaging [5,6]. Recently deep learning has shown superior performance at lensless image reconstruction/other vision tasks [36–38].

Deep optics. There has been a series of recent work that co-design diffractive optical elements (DOEs) with deep learning algorithms for enhanced performance. These methods have improved depth from defocus [39], hyperspectral imaging [40–42], single-shot HDR imaging [43], and large field-of-view imaging [44]. Even convolutional neural networks have been optimized for diffractive optical systems [45,46]. Further, state-of-the-art technology in metasurfaces has led to promising results for achieving simple optical elements with high imaging quality [8]. Our work does not optimize the optical element itself, but rather is focused on how much image enhancement can be done for the relatively simple optics of a pinhole camera.

Image denoising and deblurring. Image denoising and deblurring are traditional low-level vision tasks with a rich history of research. Image denoising has been accomplished using a variety of classical [47–49] and deep learning networks [50–52]. Leveraging these advances in image denoising research, recent work has changed the photography landscape with successful applications in low-light and burst photography [53,54] as well as considering realistic sensor noise in their image model [55,56].

Image deblurring is commonly tackled through either blind or non-blind image deconvolution [57,58] with the main focus on either motion or defocus blur. Techniques to handle motion blur in photography include traditional [59,60] and neural methods [61–63] including the DeblurGAN architectures [64,65] with high-quality performance. Defocus deblurring from optics research includes deconvolution in microscopy [66], dual pixels [67], and coded aperture systems [68,69].

Joint denoise/deblur methods have primarily focused on improving low-light noise and motion blur [10,70–73]. Such joint methods are challenging due to the competing natures of the two tasks: one is aimed at reducing high-frequency content (noise) in the image, and the other aims to restore high-frequency content (spatial details) lost due to optical blur. Our paper contributes to this literature by introducing a joint network architecture carefully designed to mitigate these competing issues specifically designed for pinhole cameras. In particular, we show that leveraging the pinhole’s optical PSF frequency cutoff as a low-pass filter prior to denoising and deblurring helped the learned network improve reconstructions. Further, we also extend this architecture to video with a temporal consistency loss.

3. Method

Our proposed approach involves five critical components: (1) accurate optical point spread function (PSF) capture and modeling, (2) a denoising architecture to improve the low-light captures, (3) high-quality deblurring to restore image details lost due to optical blur, (4) a reblur loss using the PSF to help jointly train the architecture, and (5) a temporal consistency loss leveraging optical flow to help mitigate temporal artifacts and flicker in pinhole video reconstruction at 30 FPS. A summary of our pipeline is visualized in Fig. 2.

Fig. 1. The proposed Deep Camera Obscura (DCO) pipeline is a jointly optimized denoise + deblur pipeline that restores degraded pinhole camera images suitable for perceptual viewing. It is trained on synthetic data and utilizes domain knowledge of the optical point spread function to help improve image restoration for pinhole cameras. The resulting DCO pipeline can operate on 5 MP images with 1/30s exposure time.

Download Full Size | PDF

Fig. 2. DCO System Architecture. Due to diffraction and low light throughput, pinhole camera images observe two major artifacts: optical blur and sensor noise, for which we propose a jointly optimized denoise and deblur framework to tackle. The proposed system turns blur into an advantage for better denoising. Knowing a circular pinhole automatically performs ideal low-pass filtering (LPF) in optics (see Sec. 3), the denoise module first performs optics-aware LPF with an ideal filter that matches the pinhole’s diffraction limit, since any signal with frequency above the limit is due to noise. A denoise network then further cleans mid- and low-frequency noise, given a deblur or deconvolution process is prone to noise. Finally, a GAN-based deblur network is used to recover high-frequency information lost in the pinhole imaging process to form the final output. All modules are jointly trained with data-driven loss, physical and temporal consistency to eliminate GAN-style artifacts.

Download Full Size | PDF

Fig. 3. Comparison between two lensless camera systems’ PSFs. Left is the PSF of the DiffuserCam [6] and right is the visualization of the measured HDR pinhole PSF, tone-mapped using [75]. Note that though the pinhole PSF’s energy is much more concentrated at the main peak, the PSF’s side lobes still carry non-negligible amount of energy causing blur and haze to pinhole image and making the HDR measurement of the PSF necessary. Meanwhile, the DiffuserCam’s PSF is less structured, making it difficult to reconstruct the image with high fidelity in the details.

Download Full Size | PDF

Image model. The optical point spread function (PSF) represents the impulse response of our pinhole imaging system from a single point source in the scene. For a wavelength of light $\lambda$, an imaging system with focal length $f$ and aperture of radius $R$ will produce an Airy disk with width $w = 1.22 \frac {\lambda f}{R}$ at the sensor plane. The pinhole PSF (i.e. the intensity distribution of this Airy pattern), as shown in Fig. 3, can be modeled by the Fraunhofer diffraction of the pinhole aperture. For a circular aperture with radius $R$, the PSF is given by [74],

(1)$$P(x,y) = \left( \dfrac{A}{\lambda z} \right)^2 \left[ 2\dfrac{J_{1}(kR\sqrt{x^2+y^2}/z)}{kR\sqrt{x^2+y^2}/z} \right]^2,$$

where $(x,y)$ is the spatial coordinate on the observation plane, $A = \pi R^2$ is the aperture area, $z$ is the distance from the aperture to the observation plane, $J_1$ is a first order and kind Bessel function, and $k = 2\pi / \lambda$ is the wavenumber. The Fraunhofer approximation is usually only valid for far-field observations, i.e. distance $z \gg R^2/\lambda$, which is satisfied by our pinhole camera system.

The forward imaging model formulates the optical response of the camera from the PSF. Specifically, the captured image is the result of a convolution of the object and the PSF, represented as,

(2)$$I(x,y) = O(x,y) \ast P(x,y),$$

where the image $I$, object $O$, and PSF are denoted as functions of position $(x, y)$ in the spatial domain.

This forward model is extensively used in our paper, including to help create synthetic data for training our network as well as a reblur loss to check the consistency of our network’s image restoration with given camera measurements.

Denoise module. Due to the small aperture of a pinhole, the resulting captured images will have a high amount of noise with photon shot noise dominating [76,77]. Thus our first step in our pipeline (after demosaicing the RAW image), is to perform denoising. One big advantage to the blur caused by the pinhole is that it creates a global frequency limit to the true image intensity, and thus any signal above the PSF’s frequency cutoff is attributable to noise. We observed in practice that even a simple, ideal low-pass filter with the same frequency cutoff used for denoising gave high-quality results when the image was passed to subsequent deblurring. Note that we also tried optimizing the low pass filter as well as jointly learning the filter with DCO, but this did not result in superior performance as compared to the simple ideal LPF.

However, since noise can also occur at lower and middle frequencies, we utilize a denoise network based on the FFDNet architecture [51] to further mitigate the noise. Our choice of FFDNet was informed by an ablation study of various neural network architectures which is presented in Sec. 5.3. As discussed in [78], FFDNet’s unique and simple initial pixel-shuffling down-sample layer doubles the receptive field without increasing the network size, resulting in faster execution time and smaller memory footprint, while being able to handle various noise types. To train this architecture, we utilize synthetic data and simulate photon and Gaussian read noise for various ISOs for the camera as detailed in Sec. 4.

Deblur module. Once the denoised image is obtained, to restore the blurry pinhole image, we incorporate a GAN-based architecture with a modified feature pyramid generator and double-scale discriminator, as implemented in DeblurGANv2 [65]. We chose this architecture given its superior performance in our ablation study as compared to other deblurring architectures as shown in Sec. 5.3. Since the original model is trained for motion blur, the pre-trained model is not suitable for the optical blur in pinhole images. For our pipeline, we train the network on our own synthesized pinhole dataset for it to perform deblurring on pinhole images.

Reblur loss. One key challenge to a joint denoising and deblurring architecture is that deblurring can undo the effects of denoising and reintroduce noise-like artifacts into the final image. This holds for background patches in the image, where the deblur module needs to infer the missing data without many textures or patterns to infer from.

To alleviate this, we introduce joint training of our network using a reblur loss, $\mathcal {L}_{RB}$. Reblur losses have been introduced in other contexts including motion blur [79,80] and lensless imaging [81,82]. We utilize the captured PSF from our camera system and perform the convolution of our generated output from the network to form an estimated reblurred image. Then we utilize an MSE loss between the reblurred image and the original lensless image to fine-tune our network during training. Thanks to the simplicity of the pinhole imaging model, the PSF convolution reblur model is highly accurate and can be measured easily, making the reblur loss more effective than motion blur cases [79,80].

Temporal consistency for video reconstruction. One of the main goals of this paper is to perform video reconstruction for pinhole images captured at 30 FPS with short exposure. Running our network on individual frames resulted in sharp images, but the resulting video suffers from temporal artifacts including flicker and noise (please refer to the supplemental videos). Thus to improve this, we develop a temporal consistency loss leveraging RAFT, a state-of-the-art optical flow network [12], that is inspired by a blind video consistency method [83]. In Fig. 2’s Temporal Consistency section, one can see how RAFT is used to warp the previous frame generated by the network to the current frame output which is then fed into the loss function,

(3)$$\mathcal{L}_{TC} = M_{t-1 \Rightarrow t} ||O_t - \tilde{O}_t||_1,$$

where $O_{t}$ is the current output frame, $\tilde {O}_t$ is the previous frame $O_{t-1}$ warped to the current frame, and $M_{t-1 \Rightarrow t}=\text {exp}(-\alpha ||I_t - \hat {I}_t||_2^2)$ is a non-occlusion mask between the two consecutive frames generated by the occlusion detection method from [84].

4. Implementation

Training dataset. To train our network architecture, we use the HDR+ subset [54] containing, $153$ images of $4048\times 3036$ resolution. These are trained with $256\times 256$ size patches. We create a simulated blurred version of this dataset as our input during training by convolving the real captured PSF with the original data. The original data is then used as ground truth while training the network. We implement both photon shot noise and Gaussian read noise in our simulator, using a realistic noise simulator with parameters set for ISO ranges $1600-25600$ which we randomly toggle during training. This includes applying Gaussian noise with variance between $0.03$ and $0.1$, and shot noise with variance $0.01$ to $0.05$. For the RAW input to the pipeline, we apply black level correction and demosaicing as pre-processing steps. Post-processing consists of white-balance, gamma correction, and exposure enhancement.

Our training procedure consisted of the following: (1) separately training the denoise module on blurred images with noise, and the deblur module with the subsequent denoised output images. Both the FFDNet and DeblurGANv2 are initially trained from scratch on the synthetic dataset. (2) After separate training, joint finetuning of the architecture is done while incorporating the reblur loss. (3) For video results, we then perform further joint finetuning of the architecture with the temporal consistency loss as a separate model. Synthetic pinhole video data to train the video model is created using the DAVIS dataset [85] by convolving with the real captured PSF, similar to the single image training dataset.

Real data capture. We capture real pinhole images for both quantitative and qualitative analysis of our pipeline. We use a Panasonic Lumix G85 mirrorless camera with a micro 4/3 sensor and the Thingyfy Pinhole Pro with a pinhole diameter of $0.20$mm.

The $2600\times 1952$ resolution images are captured at 1/30 second exposure time with $3200$ ISO. We perform black level correction, demosaicing, white balancing, and gamma correction. Finally, lens shading correction for pinhole vignetting is calibrated by capturing a flat-white surface to measure the amount of vignetting and create a mask that is applied to any real pinhole images to compensate for any vignetting. Any desired exposure enhancement is then applied to the image.

For real data capture, it is not trivial to capture ground truth images. A beamsplitter to optically align a lens camera with the pinhole camera would still introduce lens optical aberrations and finite depth-of-field into the ground truth. Thus in this paper, we choose to train on synthetic data only and use our network for inference on real data, showing our network’s generalizability. For real data, due to possible mismatch between real noise and our noise simulations, we lowpass filter the images with the PSF before running through our denoise and deblur modules. We found this helps improve results for the network as the denoise network is expecting noise statistics similar to synthetic data.

To compare against other lensless camera systems, namely FlatCam and DiffuserCam, we capture scenes displayed from a monitor similar to these papers. We display their test dataset and capture these with our pinhole camera. Note that our network is not trained on their datasets, but only tested on their images at inference.

Network hyperparameters and computational time. Our system was trained on two NVIDIA 2080Ti GPUs. While training, we use a batch size of $1$ image which is cropped to multiple patches of size $256\times 256$ per iteration. For both, the generator and discriminator, we use the Adam [86] optimizer using a learning rate of $1e-4$ for 50 epochs then reduce the learning rate with a linear decay from $1e-4$ to $1e-7$ epochs for another 200 epochs. For the generator loss, we assign weights to $\mathcal {L}_{MSE}$, $\mathcal {L}_{perc.}$, and $\mathcal {L}_{adv}$ of $\lambda _{MSE}=0.5$, $\lambda _{perc.}=0.006$, and $\lambda _{adv}=1e-3$ which is used for both the generator and discriminator adversarial losses.

To process each frame through the pipeline with 1 RTX 2080Ti GPU it takes $6.21$ seconds for full resolution ($2600\times 1952$) still images, while full-HD video ($1920\times 1080$) takes $4.94$ seconds per frame.

Comparison to FlatCam [87] and DiffuserCam [81]. For our comparisons to other lensless cameras, we note that it is difficult to make perfect comparisons of the different masks and algorithms used due to differences in hardware, such as different camera and sensor sizes. The goal of this paper is not to compare across different mask types with the same reconstruction algorithms, but to show our reconstruction in context with that of FlatCam and DiffuserCam as alternative lensless methods. FlatCam has a sensor-to-mask distance of around 1.5mm and DiffuserCam places the mask around 6mm from the sensor. While these are much thinner than our current prototype’s minimum thickness of around 19.25mm (the micro 4/3 flange distance), this is mainly due to the much larger sensor ($17.3$mm $\times 13$mm) used in our camera. For a fixed field-of-view, if a smaller sensor was used, such as the $1/3.7"$ sensor ($4.2$mm$\times 2.4$mm) in the Basler Dart camera for DiffuserCam, the resulting crop factor would be $\sim 4.47\times$ which would result in an equivalent thickness of $\sim 4.3$mm for the pinhole camera.

For comparisons with FlatCam [87], we use the testing dataset used in their paper and display the images on a monitor, and then capture it with exposure an time of 1/30s at ISO 3200 with our pinhole camera. Similarly, for the DiffuserCam comparisons, we display the ground-truth images used in their paper on an LCD monitor, capture the images using the lensless pinhole camera, and process them using the DCO framework. Note that DiffuserCam’s raw images are captured at $1920 \times 1080$, reconstructed at $480\times 270$, and cropped to $380\times 210$. For their evaluation, the reference images are captured using a lensed camera through a beam splitter, then resized and cropped to the same $480 \times 270$ size. In contrast, our raw captures and reconstructions are all conducted at $1920 \times 1080$, and the reference images in our evaluation are the original displayed images, resized to $1920 \times 1080$. Runtime for the DiffuserCam network Le-ADMM-U is 0.075 seconds, while the FlatNet takes 0.006 seconds as reported in their work. These are currently much quicker than our network’s 4.94 seconds primarily because of slower Fourier Transform operations used in our low-pass filter which we have not optimized for speed for this paper.

5. Results

5.1 End-to-end DCO restoration

Real-world scene results. We present results from real pinhole captured images in Fig. 4, using the Panasonic m4/3 camera and Thingyfy Pinhole Pro. These feature various scenes with different objects and textures. Note that these images have no corresponding ground truth as we are not imaging a monitor like our lensless system comparisons. We see our captured pinhole images in (a), the resulting DCO reconstructions in (b), and comparisons to two off-the-shelf optimization methods, ADMM [33](c) and self-tuned Wiener filter [88] (d) for deconvolution/deblurring, and (e) DMPHN [89], a modern deblurring neural network. Note how our method recovers back sharp, high-frequency details with less noise or artifacts as compared to ADMM, Wiener filter, and DMPHN. In Fig. 5 we display results at the different steps of the DCO pipeline along with a reference image captured using a lens. All images had $2600\times 1952$ resolution for 1/30 seconds exposure with ISO 3200.

Fig. 4. Real-world pinhole image restoration results. We compare the pinhole image restoration results of the proposed method (DCO) with other deconvolution methods, including traditional ADMM [33], self-tuned Wiener filter [88] and recent deep learning-based DMPHN [89]. The proposed DCO reduces noise, sharpens, and recovers more high-frequency details while causing fewer artifacts. Traditional deconvolution methods in (c) and (d) produce high amounts of artifacts while struggling to sufficiently handle noise and the network-based method in (e) shows lower deblurring performance along with color mismatch and artifacts.

Download Full Size | PDF

Fig. 5. Step-by-step results through the DCO pipeline.

Download Full Size | PDF

Tradeoff analysis with lenses. Smartphone photography scaling requires larger sensors, forcing lens size to scale exponentially and increase their size/weight. Lenses also have inherent aberrations (spherical, chromatic, etc.), while lensless alternatives are much more lightweight and free of those aberrations. However, pinhole cameras allow much less light-throughput due to the small aperture size and can be less sharp than lenses from diffraction blur. We plot the quantitative effects for both of these in Fig. 6 to show the loss in light-throughput with decreasing aperture size (or increasing f/#), as well as a modulation transfer function (MTF) analysis for the lens and pinhole optics. In Fig. 6(top), we plot the experimental measurements in blue compared to the theoretical light loss in orange, and our 0.20mm pinhole is equivalent to an f/130 lens. In Fig. 6(bottom), we see that our DCO improves the resolution in terms of MTF50 of the pinhole system by 4.2x, 63.5% of the lens counterpart. We calculate the MTF using the slanted edge methods [90] in the IMATEST software [91].

Fig. 6. Optical properties of the pinhole camera system. (Top) Light throughput vs. aperture size for a pinhole and lens. Experimentally captured values in blue are plotted against the theoretical light loss of a circular aperture in orange. (Middle) Images captured of the test chart used to determine the MTF. (Bottom) MTF plots for lens, pinhole, and pinhole with DCO outputs. The solid cyan and magenta markers denote the MTF50 and MTF30 values, and the dotted red, green and blue lines denote the MTF of the corresponding RGB channels, respectively. DCO increases the MTF50 by $4\times$ from 0.01326 to 0.05573 cycles/pixel which is 63.5% of the lens value. We calculate the MTF using the slanted edge methods [90] in the IMATEST software [91].

Download Full Size | PDF

Lensless imaging comparisons. We perform a comparison with a state-of-art, end-to-end lensless camera system, FlatNet [87]. In Fig. 7, we show the qualitative results of this comparison. Note that the pinhole naturally captures less light than FlatCam due to its single aperture as shown in (a) with the split visualization. This results in a higher noise level for our DCO reconstruction in (b). However, note that the details of the image are preserved at high spatial frequencies as compared to FlatNet. In FlatNet’s reconstructions (d), certain artifacts from the machine learning include distorted signs and text, and non-physically realistic deformations.

Fig. 7. Lensless comparisons with FlatNet. We compare the input (visualized in its actual form and brightened for display purposes) and reconstruction of our proposed system (DCO) with FlatCam/FlatNet [5,87] on monitor captures of the ILSVRC 2012 dataset [92]. Note that our input and results have higher frequency features without many reconstruction artifacts compared to the FlatNet results from [87].

Download Full Size | PDF

In Fig. 8, we compare the results from our proposed lensless system (DCO) with a diffusion-based lensless camera, DiffuserCam [81]. Our DCO reconstructions in (e) are sharper and are able to reconstruct more details at a higher resolution when compared to the DiffuserCam Le-ADMM-U reconstruction in (f). However, our results do have slightly more noise artifacts due to the very low-light throughput of the pinhole capture. Note we also compare the results of ADMM for both our pinhole images as well as DiffuserCam images. This shows how pinhole images are inverted more easily given the same reconstruction backend of ADMM as compared to DiffuserCam. A similar comparison to FlatCam was unable to be made due to the lack of PSF available from that hardware prototype. Our results show that pinhole cameras are a viable alternative to coded aperture cameras and can handle large image resolution sizes.

Fig. 8. Lensless comparison with DiffuserCam. We compare the input (visualized in its actual form and brightened for display purposes) and reconstruction of our proposed system (DCO) with DiffuserCam, a phase mask-based lensless imaging system, on the MIRFLICKR dataset [93]. Note that our input and results have higher frequency features without much reconstruction artifacts compared to the DiffuserCam results from [81].

Download Full Size | PDF

In Table 2, we report our PSNR numbers as compared to FlatNet and DiffuserCam (slightly higher by a small margin) as reported by those papers [81,87].

Table 2. Quantitative comparison of lensless systems. Each lensless system is paired with an associated state-of-the-art reconstruction algorithm. PSNR is reported for single frame results (higher is better), and temporal consistency is reported for video (lower is better). Pinhole + DCO achieves superior performance in these two metrics to both FlatCam and DiffuserCam.

View Table | View all tables in this article

User study. In addition to the qualitative results shown here, we investigated image quality preference of lensless camera methods by conducting a user study to determine visual preference when comparing our method (DCO) to FlatCam and DiffuserCam reconstructions. 15 subjects, with no prior knowledge of the lensless methods or research, participated in the study.

The study was conducted in a two-alternative forced-choice format, where subjects were presented with two reconstructed image results for a particular sample and were required to pick their preferred choice. We use 10 sample images from the MIRFLICKR dataset [93] to compare our method with DiffuserCam, and 10 sample images from the ILSVRC dataset [92] to compare our method with FlatCam. Each trial is randomized to pick a random sample from either dataset and randomly position our method vs the alternative either on the right or left side. The subjects were instructed to pick their preference for the most visually appealing and/or higher quality.

The results from the user study, shown in Fig. 9, present favorably to our method over the alternatives. For the DiffuserCam comparisons, our method is overwhelmingly preferred (choice ratio of $86.67\%$) for the higher-resolution and high-frequency detail in the reconstructions. The results are closer when compared to FlatCam, as our results are slightly noisier with some color inconsistencies. However, the choice ratio is still $71.33\%$ showing a slight preference for DCO.

Fig. 9. User study for lensless method preference with 15 subjects in a two-alternative forced choice format. Users were instructed to pick which of two images they preferred for its visual quality. The choice ratio for DCO:FlatCam was $71.33\%$ and DCO:DiffuserCam was $86.67\%$.

Download Full Size | PDF

Video reconstruction. Finally, we show real-world video captured at 1/30 seconds exposure with our pinhole camera in Fig. 10. As one can see from the video frames, the input pinhole images are blurry and suffer from noise effects. Our method improves sharpness around object edges and enhances textures, without noticeable noise degradation. We include additional video examples with comparison to other video temporal consistency training methods in the supplemental material. We also compared the temporal loss to FlatCam and DiffuserCam in Table 2, and we have a lower value which means less flickering artifacts in our video reconstructions.

Fig. 10. DCO video results. We show selected frames ($1920 \times 1080$ resolution at 1/30 seconds exposure) from pinhole videos captured by our pinhole camera setup and restored with the proposed DCO framework. The DCO results are perceptually sharper and less noisy than the original pinhole inputs. We refer the readers to the supplementary material package for the full-length videos.

Download Full Size | PDF

5.2 Ablation studies for camera parameters

Pinhole size. We performed an ablation study on the size of our pinhole aperture to determine which gives us optimal results after DCO reconstruction. Note that while there are analytical formulas for the optimal pinhole size depending on the optical forward model [16–18], due to the data-driven nature of our reconstruction pipeline, we conduct experiments to determine the aperture size coupled with the end-to-end pipeline that performs the best. In Table 3, we show MTF50 results for our aperture sweep. We note that we chose 0.2mm as the pinhole size with DCO that preserves the highest spatial frequencies.

Table 3. Ablation study of pinhole size with modulation transfer function (MTF). Note that pinhole apertures 0.10-0.30mm are compared with their native MTF as well as their MTFs after DCO reconstruction.

View Table | View all tables in this article

Table 4. Quantitative performance comparison of denoise network candidates.

View Table | View all tables in this article

Table 5. Quantitative performance comparison of deblur network candidates.

View Table | View all tables in this article

Exposure time. At the very low light throughput condition the pinhole camera operates in, the noise is dominantly photon noise. To understand photon noise’s impact on the reconstruction, we fixed the digital gain setting of the camera to be ISO 6400 and conduct reconstruction at various exposure times. The results are shown in Fig. 11. We do observe that reconstruction, particularly denoising, gets more challenging when light throughput decreases.

Fig. 11. Exposure sweep experiment. We captured data with the ISO fixed as 6400 and perform reconstruction with the proposed DCO pipeline. Top row is input, middle row is brightened input, and the last row is DCO reconstruction. The results indicate that light throughput does have a noticeable impact on the final reconstruction quality.

Download Full Size | PDF

ISO. In contrast to the exposure sweep experiment, we also fixed the exposure time to be 1/30 seconds and sweep across different ISO settings from 1600 to 25600. In this case, the light throughput is constant, and read noise is amplified variously depending on the ISO settings. As Fig. 12 shows, we found ISO settings do not impact the final reconstruction results noticeably.

Fig. 12. ISO sweep experiment. We captured data with the exposure time fixed as 1/30 seconds and perform reconstruction with the proposed DCO framework. Top row is input, middle row is brightened input, and the last row is DCO reconstruction. The results indicate that ISO does not impact the final reconstruction results significantly.

Download Full Size | PDF

Depth-of-field. One advantage of pinhole cameras is that they have a virtually infinite depth-of-field compared to lenses. We demonstrate this in Fig. 13, where the top-left image demonstrates short depth-of-field using a Lumix lens with f/5.6 aperture, the bottom-left shows a large depth-of-field with the Lumix lens with f/22, the top-right shows a pinhole capture with infinite depth-of-field and the bottom-right shows the pinhole capture after DCO. While the original pinhole image is not as sharp as the large depth-of-field lens image, this is only due to the inherent diffraction blur and not focus blur and is shown to be consistent at all depths. We can see this in the difference between focus sharpness at the different depths between the pinhole and the Lumix lens at f/5.6.

Fig. 13. Depth-of-Field comparison between lens and pinhole. (Upper-left) A f/5.6 aperture lens; (Upper-right) Pinhole capture; (Lower-left) A f/22 aperture lens; (Lower-right) Pinhole + DCO output. Note how the pinhole capture has a consistent diffraction blur independent of depth, and whose depth-of-field is larger than the f/5.6. After DCO, the pinhole image recovers back a sharp image over a large depth of field. Note that this is still not as sharp as an f/22 lens, but the pinhole is a lensless imaging system with less size/weight for its lens.

Download Full Size | PDF

5.3 Network ablation studies

Denoise module. In Table 4 we benchmark three denoise network candidates for denoising performance on the synthetic pinhole noise data. Overall, FFDNet [51] performs better than FOCNet [94], and is on par with FC-AIDE [95] quantitatively but with improved perceptual quality. Therefore, we chose FFDNet for our final pipeline based on this ablation study.

Deblur module. We also performed an ablation study to compare three different deblurring networks for our pipeline on the synthetic pinhole blur data. These networks were DMPHN [89], SIUN [96], and DeblurGANv2 [65]. As we can see from Table 5, DeblurGANv2 achieved the highest scores and thus was our choice for the deblurring module in our pipeline.

Optimizing low-pass filter. The current DCO pipeline uses an ideal low-pass filter with a frequency cut-off that is informed by the global frequency limit from the pinhole’s PSF. In addition to this, we also attempted to learn the low-pass filter by making the parameters of the filter differentiable. We show an example result of this in Fig. 14, where the first column is the pinhole camera input, the second column is the restored DCO result with implemented ideal low-pass filter, the third column is the restored DCO result with a separately trained low-pass filter that is then applied to every input to the rest of the pipeline, and the last column is the restored DCO when the entire pipeline is jointly trained with a learnable low-pass filter. As shown in the result, the separately trained low-pass filter shows improved sharpening when deblurred but also introduces severe artifacts in flat regions. Jointly training the learnable low-pass filter with the full pipeline led to sharpening close to the ideal low-pass filter and still shows slight artifacting. Due to this, we decide to use the simpler PSF-informed low-pass filter as the slight improvement in sharpening was not as beneficial with the added artifacts and complexity.

Fig. 14. Learning Low-pass filters for DCO input. (i) Pinhole input; (ii) DCO with ideal low-pass filter; (iii) DCO with a learned low-pass filter (separately trained), and (iv) DCO with a jointly trained low-pass filter. Note how the separately trained low-pass filter generates sharp reconstructions but causes significant artifacts in flat regions in the image, while the jointly trained method is on-par with the ideal low pass filter. For complexity and size of the network reasons, we decided to use the ideal low pass filter for our final DCO pipeline.

Download Full Size | PDF

Video temporal consistency method. Applying a network-based pipeline to restore video data can cause temporal fluctuations in the output from frame to frame such as flickering and artifacts. To overcome this, we need to improve the temporal consistency of the output. We attempted three different methods to tackle this. (1) Multi-frame input, single-frame output, (2) Separately training a temporal consistency network, (3) Fine-tuning DCO with a temporal consistency loss. We show the results of all three methods on a few video examples (Visualization 1, Visualization 2 and Visualization 3) in the supplemental material files: DCO_video1.avi, DCO_video2.avi, DCO_video3.avi. Figure 15 is an example frame to demonstrate the comparison format of the videos. The top-left video is the pinhole input, the top-right uses the DCO pipeline with 3 sequential frames (previous, current, and next frame) as the input and outputs a single frame. The bottom-left uses a temporal consistency network [83] as an additional module that we train separately on the reconstructed DCO outputs. The bottom-right is our current method that utilizes optical flow with RAFT [12] to fine-tune DCO with a temporal consistency loss. The videos will show that while our chosen method reduces sharpness slightly compared to the other methods, the amount of sharpening looks more realistic while the other methods have more over-sharpened edges. This contributes to more consistent sharpening of edges and details from frame to frame.

Fig. 15. Example video frame. (top-left) pinhole input, (top-right) DCO with multi-input single-output, (bottom-left) DCO with additional temporal consistency module (trained separately), (bottom-right) DCO fine-tuned with temporal consistency loss using optical flow. See Visualization 1, Visualization 2 and Visualization 3 for video examples.

Download Full Size | PDF

6. Discussion

We present a practical pipeline to restore low-light and diffraction-blurred pinhole photography images. The pipeline first performs denoising using a low-pass filter with a frequency limit prior based on the optical point spread function. It then passes the low-pass filtered image through a deep learning-based denoising network trained jointly with a GAN-based network module on our synthesized pinhole dataset. We introduce reblur and temporal consistency losses to enhance the performance of the joint network architecture. We verify our pipeline on test data from the synthesized HDR+ dataset and real-world pinhole captured images to show qualitative improvements. Our methods generalize to unknown data by training on synthetic data only, and we process $1920 \times 1080$ video (larger than most competing lensless cameras) at 1/30 second exposure times.

Limitations and potential impacts of the technology. Our deblurring results could still improve as the high-frequency details are not as sharp as lens-based systems. While pinhole cameras’ infinite DoF is desired in some size-limited scenarios, such as smartphone landscape photography, lensless microscopy [3], and tiny/micro robotics [4], very shallow DoF sometimes is aesthetically appreciated for portrait photography, which our work will not be able to achieve. Moreover, unlike other lensless cameras, a pinhole camera’s flatness is determined by the desired field-of-view and the sensor size. Reducing the focal length and the sensor size of the pinhole camera can both result in a more flat camera, but a shorter focal length would also introduce more vignetting, noise, and PSF variation across the frame. This would need to be overcome with additional methods, and a smaller sensor pixel pitch could mean a larger circle of confusion for the diffraction blur. Further, lightweight networks are needed for deployment on embedded platforms and processing real-time pinhole video. Lastly, the use of this technology should follow ethical and privacy guidelines in surveillance applications due to the small form factor of the pinhole.

Future work. A detailed comparative study with the same hardware sensor and reconstruction algorithm but varying the type of optical mask (amplitude, phase, diffractive, pinhole) would be valuable to the lensless imaging community. In addition, there may be further avenues for revisiting pinhole cameras beyond everyday photography for computer vision applications. Since the exposure of the image can be fixed, HDR imaging for pinhole cameras can be an interesting direction that can further its potential for smartphone photography. Additionally, stereo pinhole photography can open up use cases for 3D depth estimation for SLAM algorithms and robotics in general. Finally, adapting the existing burst denoising algorithms for extreme low-light pinhole imagery can be an interesting problem to explore, since textures required by the alignment step in burst denoiser may be hidden below the noise floor of the sensor.

Funding

SenseBrain Technology.

Acknowledgments

The authors acknowledge Research Computing at Arizona State University for providing GPU resources that have contributed to the research results reported within this paper.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Ref. [97].

References

1. F. Heide, M. Rouf, M. B. Hullin, B. Labitzke, W. Heidrich, and A. Kolb, “High-quality computational imaging through simple lenses,” ACM Trans. Graph. 32(5), 1–14 (2013). [CrossRef]

2. V. Boominathan, J. T. Robinson, L. Waller, and A. Veeraraghavan, “Recent advances in lensless imaging,” Optica 9(1), 1–16 (2022). [CrossRef]

3. A. Ozcan and E. McLeod, “Lensless imaging and sensing,” Annu. Rev. Biomed. Eng. 18(1), 77–102 (2016). PMID: 27420569. [CrossRef]

4. S. J. Koppal, “A survey of computational photography in the small: Creating intelligent cameras for the next wave of miniature devices,” IEEE Signal Process. Mag. 33(5), 16–22 (2016). [CrossRef]

5. M. S. Asif, A. Ayremlou, A. Sankaranarayanan, A. Veeraraghavan, and R. G. Baraniuk, “Flatcam: Thin, lensless cameras using coded aperture and computation,” IEEE Trans. Comput. Imaging 3(3), 384–397 (2017). [CrossRef]

6. N. Antipa, G. Kuo, R. Heckel, B. Mildenhall, E. Bostan, R. Ng, and L. Waller, “Diffusercam: lensless single-exposure 3d imaging,” Optica 5(1), 1–9 (2018). [CrossRef]

7. F. Heide, Q. Fu, Y. Peng, and W. Heidrich, “Encoded diffractive optics for full-spectrum computational imaging,” Sci. Rep. 6(1), 33543 (2016). [CrossRef]

8. E. Tseng, S. Colburn, J. Whitehead, L. Huang, S.-H. Baek, A. Majumdar, and F. Heide, “Neural nano-optics for high-quality thin lens imaging,” Nat. Commun. 12(1), 6493 (2021). [CrossRef]

9. M. Young, “Pinhole optics,” Appl. Opt. 10(12), 2763–2767 (1971). [CrossRef]

10. O. Liba, K. Murthy, Y.-T. Tsai, T. Brooks, T. Xue, N. Karnad, Q. He, J. T. Barron, D. Sharlet, R. Geiss, S. W. Hasinoff, Y. Pritch, and M. Levoy, “Handheld mobile photography in very low light,” ACM Trans. Graph. 38(6), 1–16 (2019). [CrossRef]

11. L. Zhang, A. Deshpande, and X. Chen, “Denoising vs. deblurring: Hdr imaging techniques using moving cameras,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (IEEE, 2010), pp. 522–529.

12. Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in European Conference on Computer Vision, (Springer, 2020), pp. 402–419.

13. J. H. Hammond, The Camera Obscura, A Chronicle (Taylor & Francis, 1981).

14. G. Della Porta, Magia naturalis, vol. 2 (1714).

15. L. Da Vinci, “Codex atlanticus,” Biblioteca Ambrosiana, Milan 26 (1894).

16. A. C. Hardy, A. C. Hardy, and F. H. Perrin, The principles of optics (McGraw-Hill book Company, Incorporated, 1932).

17. R. Kingslake, Lenses in Photography (A. S. Barnes and Co., 1963).

18. K. Sayanagi, “Pinhole imagery*,” J. Opt. Soc. Am. 57(9), 1091–1098 (1967). [CrossRef]

19. R. E. Swing and D. P. Rooney, “General transfer function for the pinhole camera,” J. Opt. Soc. Am. 58(5), 629–635 (1968). [CrossRef]

20. P. A. Newman and V. E. Rible, “Pinhole array camera for integrated circuits,” Appl. Opt. 5(7), 1225–1228 (1966). [CrossRef]

21. G. Druart, N. Guérineau, J. Taboury, S. Rommeluère, R. Haïdar, J. Primot, M. Fendler, and J.-C. Cigna, “Compact infrared pinhole fisheye for wide field applications,” Appl. Opt. 48(6), 1104–1113 (2009). [CrossRef]

22. O. Ivanov, A. Sudarkin, V. Stepanov, and L. Urutskoev, “Portable x-ray and gamma-ray imager with coded mask: performance characteristics and methods of image reconstruction,” Nucl. Instrum. Methods Phys. Res., Sect. A 422(1-3), 729–734 (1999). [CrossRef]

23. P. Sturm, Pinhole Camera Model (Springer US, Boston, MA, 2014), pp. 610–613.

24. A. Torralba and W. T. Freeman, “Accidental pinhole and pinspeck cameras: Revealing the scene outside the picture,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2012), pp. 374–381.

25. A. Torralba and W. T. Freeman, “Accidental pinhole and pinspeck cameras,” Int. J. Comput. Vis. 110(2), 92–112 (2014). [CrossRef]

26. J. Tanida, T. Kumagai, K. Yamada, S. Miyatake, K. Ishida, T. Morimoto, N. Kondou, D. Miyazaki, and Y. Ichioka, “Thin observation module by bound optics (tombo): concept and experimental verification,” Appl. Opt. 40(11), 1806–1813 (2001). [CrossRef]

27. P. R. Gill, C. Lee, D.-G. Lee, A. Wang, and A. Molnar, “A microscale camera using direct fourier-domain scene capture,” Opt. Lett. 36(15), 2949–2951 (2011). [CrossRef]

28. V. Boominathan, J. Adams, J. Robinson, and A. Veeraraghavan, “Phlatcam: Designed phase-mask based thin lensless camera,” IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).

29. Y. Wu, V. Boominathan, H. Chen, A. Sankaranarayanan, and A. Veeraraghavan, “Phasecam3d—learning phase masks for passive single view depth estimation,” in 2019 IEEE International Conference on Computational Photography (ICCP), (IEEE, 2019), pp. 1–12.

30. P. R. Gill and D. G. Stork, “Lensless ultra-miniature imagers using odd-symmetry spiral phase gratings,” in Computational Optical Sensing and Imaging, (Optical Society of America, 2013), pp. CW4C–3.

31. M. Hirsch, S. Sivaramakrishnan, S. Jayasuriya, A. Wang, A. Molnar, R. Raskar, and G. Wetzstein, “A switchable light field camera architecture with angle sensitive pixels and dictionary-based sparse coding,” in 2014 IEEE International Conference on Computational Photography (ICCP), (IEEE, 2014), pp. 1–10.

32. K. Tajima, T. Shimano, Y. Nakamura, M. Sao, and T. Hoshizawa, “Lensless light-field imaging with multi-phased fresnel zone aperture,” in 2017 IEEE International Conference on Computational Photography (ICCP), (IEEE, 2017), pp. 1–7.

33. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations Trends Mach. learning 3(1), 1–122 (2010). [CrossRef]

34. A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imaging Sci. 2(1), 183–202 (2009). [CrossRef]

35. L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Phys. D 60(1-4), 259–268 (1992). [CrossRef]

36. A. Sinha, J. Lee, S. Li, and G. Barbastathis, “Lensless computational imaging through deep learning,” Optica 4(9), 1117–1125 (2017). [CrossRef]

37. S. Li, M. Deng, J. Lee, A. Sinha, and G. Barbastathis, “Imaging through glass diffusers using densely connected convolutional networks,” Optica 5(7), 803–813 (2018). [CrossRef]

38. Y. Li, Y. Xue, and L. Tian, “Deep speckle correlation: a deep learning approach toward scalable imaging through scattering media,” Optica 5(10), 1181–1190 (2018). [CrossRef]

39. H. Ikoma, C. M. Nguyen, C. A. Metzler, Y. Peng, and G. Wetzstein, “Depth from defocus with learned optics for imaging and occlusion-aware depth estimation,” in 2021 IEEE International Conference on Computational Photography (ICCP), (IEEE, 2021), pp. 1–12.

40. X. Dun, H. Ikoma, G. Wetzstein, Z. Wang, X. Cheng, and Y. Peng, “Learned rotationally symmetric diffractive achromat for full-spectrum computational imaging,” Optica 7(8), 913–922 (2020). [CrossRef]

41. S.-H. Baek, H. Ikoma, D. S. Jeon, Y. Li, W. Heidrich, G. Wetzstein, and M. H. Kim, “Single-shot hyperspectral-depth imaging with learned diffractive optics,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 2651–2660.

42. H. Arguello, S. Pinilla, Y. Peng, H. Ikoma, J. Bacca, and G. Wetzstein, “Shift-variant color-coded diffractive spectral imaging system,” Optica 8(11), 1424–1434 (2021). [CrossRef]

43. C. A. Metzler, H. Ikoma, Y. Peng, and G. Wetzstein, “Deep optics for single-shot high-dynamic-range imaging,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), pp. 1375–1385.

44. Y. Peng, Q. Sun, X. Dun, G. Wetzstein, W. Heidrich, and F. Heide, “Learned large field-of-view imaging with thin-plate optics,” ACM Trans. Graph. 38(6), 1–14 (2019). [CrossRef]

45. H. G. Chen, S. Jayasuriya, J. Yang, J. Stephen, S. Sivaramakrishnan, A. Veeraraghavan, and A. Molnar, “Asp vision: Optically computing the first layer of convolutional neural networks using angle sensitive pixels,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 903–912.

46. J. Chang, V. Sitzmann, X. Dun, W. Heidrich, and G. Wetzstein, “Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification,” Sci. Rep. 8(1), 12324 (2018). [CrossRef]

47. K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-d transform-domain collaborative filtering,” IEEE Trans. on Image Process. 16(8), 2080–2095 (2007). [CrossRef]

48. A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image denoising,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2 (IEEE, 2005), pp. 60–65.

49. M. Lebrun, A. Buades, and J.-M. Morel, “A nonlocal bayesian image denoising algorithm,” SIAM J. Imaging Sci. 6(3), 1665–1688 (2013). [CrossRef]

50. K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Trans. on Image Process. 26(7), 3142–3155 (2017). [CrossRef]

51. K. Zhang, W. Zuo, and L. Zhang, “Ffdnet: Toward a fast and flexible solution for cnn-based image denoising,” IEEE Trans. on Image Process. 27(9), 4608–4622 (2018). [CrossRef]

52. K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for image restoration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), pp. 3929–3938.

53. C. Chen, Q. Chen, J. Xu, and V. Koltun, “Learning to see in the dark,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), pp. 3291–3300.

54. S. Hasinoff, D. Sharlet, R. Geiss, A. Adams, J. T. Barron, F. Kainz, J. Chen, and M. Levoy, “Burst photography for high dynamic range and low-light imaging on mobile cameras,” SIGGRAPH Asia (2016).

55. S. Guo, Z. Yan, K. Zhang, W. Zuo, and L. Zhang, “Toward convolutional blind denoising of real photographs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 1712–1722.

56. A. Abdelhamed, M. A. Brubaker, and M. S. Brown, “Noise flow: Noise modeling with conditional normalizing flows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019), pp. 3165–3173.

57. P. Campisi and K. Egiazarian, Blind image deconvolution: theory and applications (CRC press, 2017).

58. D. Kundur and D. Hatzinakos, “Blind image deconvolution,” IEEE Signal Process. Mag. 13(3), 43–64 (1996). [CrossRef]

59. S. K. Nayar and M. Ben-Ezra, “Motion-based motion deblurring,” IEEE Trans. Pattern Anal. Machine Intell. 26(6), 689–698 (2004). [CrossRef]

60. S. Cho and S. Lee, “Fast motion deblurring,” in ACM SIGGRAPH Asia 2009 Papers, (Association for Computing MachineryNew York, NY, USA, 2009), SIGGRAPH Asia ’09.

61. A. Chakrabarti, “A neural approach to blind motion deblurring,” in European Conference on Computer Vision, (Springer, 2016), pp. 221–235.

62. L. Xu, S. Zheng, and J. Jia, “Unnatural l0 sparse representation for natural image deblurring,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2013), pp. 1107–1114.

63. T. Eboli, J. Sun, and J. Ponce, “End-to-end interpretable learning of non-blind image deblurring,” in ECCV 2020-16th European Conference on Computer Vision, (2020).

64. O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas, “Deblurgan: Blind motion deblurring using conditional adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), pp. 8183–8192.

65. O. Kupyn, T. Martyniuk, J. Wu, and Z. Wang, “Deblurgan-v2: Deblurring (orders-of-magnitude) faster and better,” in The IEEE International Conference on Computer Vision (ICCV), (2019).

66. P. Sarder and A. Nehorai, “Deconvolution methods for 3-d fluorescence microscopy images,” IEEE Signal Process. Mag. 23(3), 32–45 (2006). [CrossRef]

67. A. Abuolaim and M. S. Brown, “Defocus deblurring using dual-pixel data,” in European Conference on Computer Vision, (Springer, 2020), pp. 111–126.

68. C. Zhou and S. Nayar, “What are good apertures for defocus deblurring?” in 2009 IEEE International Conference on Computational Photography (ICCP), (IEEE, 2009), pp. 1–8.

69. C. Zhou, S. Lin, and S. K. Nayar, “Coded aperture pairs for depth from defocus and defocus deblurring,” Int. J. Comput. Vis. 93(1), 53–72 (2011). [CrossRef]

70. J. Mustaniemi, J. Kannala, J. Matas, S. Särkkä, and J. Heikkilä, “Lsd₂–joint denoising and deblurring of short and long exposure images with cnns,” arXiv preprint arXiv:1811.09485 (2018).

71. Y.-W. Tai and S. Lin, “Motion-aware noise filtering for deblurring of noisy and blurry images,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2012), pp. 17–24.

72. S. Diamond, V. Sitzmann, S. Boyd, G. Wetzstein, and F. Heide, “Dirty pixels: Optimizing image classification architectures for raw sensor data,” arXiv e-prints pp. arXiv–1701 (2017).

73. T. Eboli, J. Sun, and J. Ponce, “Learning to jointly deblur, demosaick and denoise raw images,” arXiv preprint arXiv:2104.06459 (2021).

74. J. W. Goodman, Introduction to Fourier optics (Roberts and Company Publishers, 2005).

75. P. E. Debevec and J. Malik, “Recovering high dynamic range radiance maps from photographs,” in Proceedings of the 24th annual Conference on Computer Graphics and Interactive Techniques, (1997), pp. 369–378.

76. D. X. D. Yang and A. E. Gamal, “Comparative analysis of SNR for image sensors with enhanced dynamic range,” in Sensors, Cameras, and Systems for Scientific/Industrial Applications, vol. 3649M. M. Blouke and G. M. W. Jr., eds., International Society for Optics and Photonics (SPIE, 1999), pp. 197–211.

77. S. Jayasuriya, “Image sensors,” Computer Vision – A Reference Guide (Eds. Katsushi Ikeuchi, 2nd Edition) p. 1–5 (2021).

78. M. Tassano, J. Delon, and T. Veit, “An analysis and implementation of the ffdnet image denoising method,” Image Processing On Line 9, 1–25 (2019). [CrossRef]

79. H. Chen, J. Gu, O. Gallo, M.-Y. Liu, A. Veeraraghavan, and J. Kautz, “Reblur2deblur: Deblurring videos via self-supervised learning,” in 2018 IEEE International Conference on Computational Photography (ICCP), (IEEE, 2018), pp. 1–9.

80. T. Michaeli and M. Irani, “Blind deblurring using internal patch recurrence,” in European Conference on Computer Vision, (Springer, 2014), pp. 783–798.

81. K. Monakhova, J. Yurtsever, G. Kuo, N. Antipa, K. Yanny, and L. Waller, “Learned reconstructions for practical mask-based lensless imaging,” Opt. Express 27(20), 28075–28090 (2019). [CrossRef]

82. J. D. Rego, K. Kulkarni, and S. Jayasuriya, “Robust lensless image reconstruction via psf estimation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, (2021), pp. 403–412.

83. W.-S. Lai, J.-B. Huang, O. Wang, E. Shechtman, E. Yumer, and M.-H. Yang, “Learning blind video temporal consistency,” in European Conference on Computer Vision, (2018).

84. M. Ruder, A. Dosovitskiy, and T. Brox, “Artistic style transfer for videos,” in German Conference on Pattern Recognition (GCPR), (2016).

85. F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in Computer Vision and Pattern Recognition, (2016).

86. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv 1412.6980 (2014).

87. S. S. Khan, V. Sundar, V. Boominathan, A. Veeraraghavan, and K. Mitra, “Flatnet: Towards photorealistic scene reconstruction from lensless measurements,” IEEE Transactions on Pattern Analysis and Machine Intelligence p. 1 (2020).

88. F. Orieux, J.-F. Giovannelli, and T. Rodet, “Bayesian estimation of regularization and point spread function parameters for wiener–hunt deconvolution,” J. Opt. Soc. Am. A 27(7), 1593–1607 (2010). [CrossRef]

89. H. Zhang, Y. Dai, H. Li, and P. Koniusz, “Deep stacked hierarchical multi-patch network for image deblurring,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 5978–5986.

90. P. D. Burns, “Slanted-edge mtf for digital camera and scanner analysis,” in Proceedings of IS&T 2000 PICS Conference, (2000), pp. 135–138.

91. N. Koren, “The imatest program: comparing cameras with different amounts of sharpening,” in Digital Photography II, vol. 6069 (International Society for Optics and Photonics, 2006), p. 60690L.

92. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” Int. J. Comput. Vis. 115(3), 211–252 (2015). [CrossRef]

93. M. J. Huiskes and M. S. Lew, “The mir flickr retrieval evaluation,” in MIR ’08, (2008).

94. X. Jia, S. Liu, X. Feng, and L. Zhang, “Focnet: A fractional optimal control network for image denoising,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019).

95. S. Cha and T. Moon, “Fully convolutional pixel adaptive image denoiser,” (2019).

96. M. Ye, D. Lyu, and G. Chen, “Scale-iterative upscaling network for image deblurring,” IEEE Access 8, 18316–18325 (2020). [CrossRef]

97. J. D. Rego, H. Chen, S. Li, J. Gu, and S. Jayasuriya, “Deep Camera Obscura Repository,” GitHub (2021), https://github.com/ImagingLyceum-ASU/dco-pinhole-restoration.

Name	Description
Visualization 1	Video of outdoor plank wood and stones using Deep Camera Obscura reconstruction and comparing different temporal consistency methods
Visualization 2	Video of small stones and palm trees using Deep Camera Obscura reconstruction and comparing different temporal consistency methods
Visualization 3	Video of street and large palm tree using Deep Camera Obscura reconstruction and comparing different temporal consistency methods

	FlatCam	DiffuserCam	This work
Aperture type	Amp. mask	Phase mask	Pinhole
Fabrication process	Lithography	Lithography / sandblasting	Punch hole
Light throughput	Mid	High	Low
Sensing matrix	Less structured	Less structured	More structured
Perceived quality	High	Mid	Highest
Calibration	Required	Required	Optional
Suitable application	Machine vision, scientific imaging		Photography

Lensless System	FlatCam	DiffuserCam	Pinhole
Reconstruction Pipeline	FlatNet	LeADMM-U	DCO
PSNR (dB)	19.62 [87]	20.46 [81]	22.31
Temporal Consistency [83]	0.0912	0.063	0.052

	MTF50	MTF30	MTF50	MTF30
			w/DCO	w/DCO
Lumix Lens	0.11322	0.15318	N/A	N/A
0.10mm	0.01499	0.01915	0.03089	0.04762
0.15mm	0.01881	0.02464	0.03080	0.04163
0.20mm	0.0147	0.01998	0.03497	0.04695
0.25mm	0.01399	0.01798	0.02964	0.04995
0.30mm	0.01462	0.01928	0.02231	0.03762

	FOCNet	FC-AIDE	FFDNet
PSNR (dB)	29.2964	30.0940	30.0915
SSIM	0.7919	0.9448	0.9452

Deep camera obscura: an image restoration pipeline for pinhole photography

Abstract

1. Introduction

2. Related work

3. Method

4. Implementation

5. Results

5.1 End-to-end DCO restoration

5.2 Ablation studies for camera parameters

5.3 Network ablation studies

6. Discussion

Funding

Acknowledgments

Disclosures

Data availability

References

Supplementary Material (3)

Data availability

Cited By

Figures (15)

Tables (5)

Equations (3)

Optics Express

	DMPHN	SIUN	DeblurGANv2
PSNR (dB)	27.273	26.234	27.385
SSIM	0.6805	0.6372	0.6827

	DMPHN	SIUN	DeblurGANv2
PSNR (dB)	27.273	26.234	27.385
SSIM	0.6805	0.6372	0.6827