## Abstract

Many deep learning approaches to solve computational imaging problems have proven successful through relying solely on the data. However, when applied to the raw output of a bare (optics-free) image sensor, these methods fail to reconstruct target images that are structurally diverse. In this work we propose a self-consistent supervised model that learns not only the inverse, but also the forward model to better constrain the predictions through encouraging the network to model the ideal bijective imaging system. To do this, we employ cycle consistency alongside traditional reconstruction losses, both of which we show are needed for incoherent optics-free image reconstruction. By eliminating all optics, we demonstrate imaging with the thinnest camera possible.

© 2022 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

## 1. INTRODUCTION

Computational imaging is the process of analyzing an object or scene using measurements and prior knowledge of the target. These measurements are gathered through a forward process where, in a conventional system, light is propagated through optics onto an image sensor. They are then processed computationally using the prior to produce an analysis of the target, solving the inverse problem. In this work we discuss the estimation of an anthropocentric image of the object, image reconstruction, using an optics-free camera, i.e., a bare image sensor. Analytical methods to solve the inverse problem explicitly define the physics involved in the forward model and utilize prior knowledge of the target, such as sparsity or smoothness.

Alternatively, deep learning approaches often do not incorporate the underlying physics into the model at all and instead learn it directly from the data [1]. These convolutional models commonly take the form of an encoder-decoder network that is trained in a supervised fashion to solve the inverse problem by taking the raw measurements as input and directly comparing model predictions to the ground-truth images using a pixelwise loss such as L1 or L2. Other loss functions, e.g., negative Pearson correlation coefficient (NPCC), cross entropy, structural similarity index measure (SSIM), as well as the perceptual loss, which instead compares the feature representations from a second neural network, commonly the image recognition network VGG, have proven to be effective in super-resolution [2], medical imaging [3], image denoising [4], imaging through diffuse media [5], imaging from a “see through” camera [6], and so on. Further improvements to encoder-decoder models in computational imaging include architectural modifications such as the use of skip connections [7], i.e., the U-net, and residual connections [8]. The U-net in particular has been the most widely used deep neural network in computational imaging [9].

Generative adversarial networks (GAN) [10] instead take a game theoretic perspective by training a discriminator to determine whether generated images are real or fake and a generator to fool the discriminator by creating images consistent with the ground-truth domain. It has been shown that using an adversarial loss alongside pixelwise losses for image-to-image translation tasks produces more realistic photos than the conventional pixelwise loss alone [11,12]. Further GAN research [13,14] showed that paired input-output images are not always necessary, just images from both domains. These models use a cycle-consistent loss, wherein two GANs are trained in unison to invert the other by passing the output of one network into the other and enforcing it to predict the original input.

The benefits of these deep learning approaches are that they: 1) move the bulk of the computation time to training so testing can be near real time and 2) learn directly from the data without the need for knowledge of the imaging system’s physics. When the physics is known, it can be incorporated into these models, which can be especially useful under extremely ill-posed scenarios [9]. In this work we aim to reduce ill-posedness not by integrating knowledge of the physics directly into the network but through enforcing a bijective constraint on the model. Through this, we encourage the model to learn the ideal imaging system.

Previous work shows that handwritten digits [15] and QR codes [16] can be imaged using an image sensor alone (with no optics at all), in the latter case by training a U-net directly on the raw images. We posit that both the structural and color consistency in these datasets reduces uncertainty enough to enable the model to perform well. When there is less dataset consistency, as in the case of natural images, the same U-net cannot reconstruct the target images with high fidelity. To estimate the structural consistency of these datasets, we take the average SSIM between 250 images and each of an additional set of 250 images, resulting in an average SSIM of 0.515 for QR codes and of 0.045 for Cifar-10 images [17]. Given the amount of information loss from using an optics-free image sensor [18], a model trained for imaging on the raw data alone is forced to guess when making predictions. Without consistency in the dataset, these guesses are less constrained. To alleviate this increase in uncertainty, we apply structural constraints to the neural network such that it models the underlying physics of the imaging system. Specifically, we train one GAN to learn the inverse problem and another to simultaneously invert the reconstruction and learn the forward problem in a cycle-consistent model. Cycle consistency enforces the inverse-model predictions to be consistent with the target domain such that the forward model can use this as input to reconstruct the sensor image.

To the best of our knowledge, this is the first application of cycle consistency to paired input-target images. We demonstrate a simple but effective manner in which to do so—by applying a traditional reconstruction loss, in the form of an L1 or perceptual loss, in addition to the cycle consistency. We show that neither loss alone sufficiently constrains the network to produce anthropocentric reconstructions. However, when combined, our results show that optics-free imaging with a structurally inconsistent dataset is possible with our self-consistent supervised model, even when the distances between the sensor and the object are as large as 10 mm. Furthermore, we emphasize that by eliminating all optics, we demonstrated the thinnest possible camera.

## 2. METHOD

In our experiment, a bare image sensor (Mini-2MP-Plus, Arducam, pixel width 2.2 µm) is placed at a distance ${ z} = {1}\;{\rm mm}$ (or 10 mm) away from a ${20} \times {20}\;({200} \times {200})\;{\rm pixel}$ image [physical size of 6.22 mm and 62.2 mm, respectively; see Fig. 1(a) and Supplement 1]. This corresponds to a field of view of ${2}\theta = {144}\;{\rm deg}$. The object is displayed on a liquid-crystal display (LCD, Acer G276HL ${1920} \times {1080}\;{\rm pixels}$). Images are taken from the Cifar-10 dataset. Cifar-10 contains 60,000 color images (${32} \times {32}\;{\rm pixels}$) of natural scenes of which 45,000 are used for training, 5000 for validation and 10,000 for testing. Each image is resized with bicubic interpolation. The RGB output of the image sensor is also resized from ${320} \times {240}$ to ${240} \times {180}\;{\rm pixels}$, as this produced better results than the original output size, likely due to the increased number of pixels the forward network must predict. The goal, then, is to reconstruct the original target image from the raw sensor image using a convolutional neural network. A different neural network is trained for ${z} = {1}\;{\rm mm}$ and for ${ z} = {10}\;{\rm mm}$, but the model architecture is kept consistent.

#### A. Self-Consistent Supervised Model

Our model utilizes two generators and two discriminators, which are trained in a cycle-consistent manner as illustrated in Fig. 1(b) [13]. The generator $F$ is trained to reconstruct the ground truth image $x$ given a raw sensor image $y$. We use $\hat x$ to denote the reconstructed image. The generator $G$ is then trained to reconstruct $y$ given the input $\hat x$. This cycled sensor image is denoted $\hat {\hat y}$. Therefore, $F:y \to \hat x$ and $G:\hat x \to \hat {\hat y}$ as indicated on the left of Fig. 1(b). Additionally, the networks are trained to model the system in the opposite direction, i.e., $G:x \to \hat y$ and $F:\hat y \to \hat {\hat x}$, where $\hat y$ and $\hat {\hat x}$ are the reconstructed sensor and cycled images, respectively. In other words, the cycle-consistency loss enforces $G({F(y)}) = \hat {\hat y} \approx y$ and $F({G(x)}) = \hat {\hat x} \approx x$. Thus, we encourage the network to model a bijective system where $G$ learns the forward process, while $F$ learns the inverse. This corresponds to the following loss function:

where the full cycle-consistency loss is the sum of the backward cycle-consistency loss ${{\cal L}_{\textit{cyc}}}({F,G,x})$ and the forward cycle-consistency loss ${{\cal L}_{\textit{cyc}}}({G,F,y})$.The discriminator ${D_X}$ is trained to discriminate between
real sensor images and those generated by $F$. Similarly, ${D_Y}$ discriminates between real
ground-truth images and those generated by $G$*.*${D_X}$ and ${D_Y}$ force the generators to produce
images that are indistinguishable from the real image domains,
resulting in more realistic outputs [12]. We use the adversarial least-squares loss as it has shown
improved stability and results over the traditional sigmoid cross
entropy loss [19]:

Finally, we constrain our model with a reconstruction loss for both $F$ and $G$. We found this loss necessary to use alongside the cycle and adversarial losses with our optics-free images for training stability. Two variants are explored, the pixel-wise L1 loss to learn additional low-level information and the perceptual loss to enhance high-level features. For the perceptual loss, we utilize the popular content loss from [2], which extracts intermediate features from a VGG network [20] pretrained on ImageNet [21]. Instead of the L2 we use the L1 loss to be consistent and only apply it to the $F$ network, while the $G$ still uses the pixelwise L1. The reconstruction loss for both models is given by

*Identity*for the pixelwise loss or

*VGG*for the perceptual loss using the feature map output of the last convolution of the third block from the VGG-19. This layer proved more effective for our purposes than the last convolution of the fifth block as used in the SRGAN (a generative adversarial network for single image super-resolution) [22], which used the perceptual loss to enhance image super-resolution performance. It is worth noting that the unpaired methods DualGAN [14] and CycleGAN [13] performed worse than the paired Pix2Pix [12] on certain tasks. Here we apply the cycle-consistency framework on two Pix2Pix models (with unconditional discriminators) to use with our paired data.

Utilizing each of the above losses, the full self-consistent supervised model loss is

## 3. NETWORK

At a high level the generators $F$ and $G$ are inverses of one another. The overall structure of each follows the encoder-decoder framework where the input is contracted in the first half of the network followed by multiple bottleneck layers at the bottom of the model and an expansion of the input in the second half. This contract-expand structure reduces the network size while increasing the receptive field (the number of pixels in the input image impacting each output pixel). Given the size discrepancy between the optics-free images and ground-truth images, the contraction is larger for the inverse model $F$, while the expansion is larger for the forward model $G$ (larger in reference to the reduction/increase in image size, the number and size of convolutions are kept the same on each side) (See Fig. 2.). To further increase the receptive field size, large kernel sizes (${5} \times {5}$) and dilated convolutions [23] (dilation rate 2) are used. This is beneficial with the optics-free camera as each sensor image pixel is dependent on large areas of the target image. Residual connections [8] are also employed to reduce the problem of vanishing gradients from the large network size. We use a single residual connection at each stage of the encoder and decoder, and multiple residuals in the bottleneck layers. Differing from the popular U-net, skip connections are not used, as they did not improve the reconstruction quality. Our discriminators follow the PatchGAN [13] architecture. ${D_X}$ is a ${13} \times {13}$ PatchGAN, while ${D_Y}$ is a ${70} \times {70}$ PatchGAN. The smaller ${D_X}$ is used, since the pixel size of $x$ is small compared to $y$. This is done by removing the final 512 convolutional layer in the ${70} \times {70}$ PatchGAN and applying a stride of 1 to all convolutions.

We train our networks with the Adam optimizer using learning rates of $2 \times {10^{- 4}}$ for the generators and $2 \times {10^{- 5}}$ for the discriminators. This difference in learning rates known as the two time update rule (TTUR) [24] allows both models to be updated at each step, which speeds up training while not allowing either adversary to dominate the other. As in the Pix2Pix, we apply batch normalization at test time to use the test batch statistics as opposed to the training set statistics.

## 4. RESULTS

#### A. Comparison to Traditional Supervision

We first compare our self-consistent supervised model using ${{\cal L}_{\rm{rec}}}({F,G,Identity})$ as the reconstruction loss against the U-net architecture trained with either the conventional L1 loss or the L1 and adversarial loss. The U-net architecture has proven effective in a variety of computational imaging tasks [9], including on the same optics-free camera for reconstruction of QR codes [16]. The L1 U-net is trained for 80 iterations using the Adam optimizer with a learning rate of $1 \times {10^{- 3}}$. The L1 and adversarial U-net is the same as the Pix2Pix [12] with unconditional discriminators (see Section 2.A for comparison of the Pix2Pix with the self-consistent supervised model). It is trained with the same L1 loss weight as Pix2Pix ($\lambda$$= {100}$), a ${13} \times {13}$ PatchGAN, and the same learning rates and optimizers as the self-consistent supervised model. As can be seen in Fig. 3, the self-consistent model shows clear improvements on both versions of the U-net. Our model achieves improved SSIM, PSNR, and mean absolute error (MAE; see Table 1), with the 10 mm self-consistent model results comparable with the 1 mm U-net.

#### B. Loss Evaluation

We then run ablations on the model to evaluate the loss function and model architecture. We train three new models with different losses. The first model uses our ($F$ model) architecture and is trained with the ${{\cal L}_{\textit{rec}}}({F,G,Identity})$ reconstruction loss, without the cycle or GAN losses. The second model is trained with the ${{\cal L}_{\textit{rec}}}({F,G,Identity})$ reconstruction and cycle losses, without the GAN losses. The third model is the full self-consistent supervised model using ${{\cal L}_{\textit{rec}}}({F,G,VGG})$ as the reconstruction loss. We compare these three models with the full self-consistent supervised ${{\cal L}_{\textit{rec}}}({F,G,Identity})$ model in Fig. 4. We attempted to train the pure CycleGAN without the reconstruction loss, but training was extremely unstable and the model was unable to learn. While missing some details, especially in the image background and around the periphery, the full self-consistent reconstructions are fairly coherent. The other two models produce overly smooth images, which is to be expected when using an L1 loss without an adversarial loss [12]. Interestingly, as Table 2 shows, much of the improvement over the U-net comes from the model architecture. However, with the adversarial and cycle losses, the reconstructions are significantly improved. When using the perceptual reconstruction loss, the SSIM is further increased, enabling realistic looking reconstructions, especially of the main object in the image at ${ z} = {1}\;{\rm mm}$. When the sensor is further from the target images (${z} = {10}\;{\rm mm}$), the reconstructed images are still coherent; however, they do have some blurring and are missing many of the fine-grained details.

#### C. Sensor Image Size

Next we test our model against reductions in sensor image resolution. We train a separate model on sizes of ${160} \times {120}$ and ${80} \times {60}\;{\rm pixels}$ as well as the original ${240} \times {180}$. For each experiment the model architecture is kept consistent, except for increases in padding at each level of the encoders and decoders, as well as the target image resolution of ${32} \times {32}$. We train with the ${{\cal L}_{\rm{rec}}}({F,G,Identity})$ loss to keep the model as small as possible. Interestingly, there is only a slight reduction in MAE, SSIM, and PSNR, even when the input is reduced by two-thirds (Table 3). This is important, as it shows that the model size and training time can be reduced with little reduction in quality. Figure 5 shows reconstructions of the same target image for each of these models.

#### D. Forward Model

Figure 6 displays how well the forward model learns to reconstruct the target image using the sensor image as input. The results show consistent reconstructions even where the inverse model performs poorly. The average MAE for the forward model is 0.09 on the 10 mm dataset and 0.07 on the 1 mm dataset. We additionally show the corresponding forward cycle $G({F(y)})$ and backward cycle $F({G(x)})$ images, which are nearly identical to the input.

#### E. Ensemble Learning

Finally, we attempt to improve our model by ensembling multiple networks. To do this, we simply average the output of six self-consistent supervised models, each trained with ${{\cal L}_{\rm{rec}}}({F,G,Identity})$ as the reconstruction loss (Table 4). As Fig. 7 shows, ensembling cleans up much of the additional artefacts that arise from using the adversarial loss.

## 5. CONCLUSION

In this work we show improvements on the traditional supervised deep learning approach for optics-free imaging. This is done by learning the forward process as well as the inverse and utilizing both to encourage the network to model the ideal bijective imaging system. To do this we train two GANs, one for the forward and another for the inverse, with cycle consistency [13]. This work is the first to apply cycle consistency on paired input-target images. We show that naively applying the CycleGAN is not enough to reconstruct optics-free images. But, by simply incorporating an additional L1 reconstruction term for both networks in addition to the cycle-consistency loss, model uncertainty can be reduced and reconstructions improved with the forward network while taking advantage of having paired images. We also show that this reconstruction term can be replaced with a perceptual loss to further enhance the structure of the reconstructions. Further, to make full use of this loss for optics-free imaging, we develop a new network architecture with a focus on a large receptive field. This self-consistent supervised model shows that the forward model can be learned and used to perform image reconstructions with an optics-free camera of structurally inconsistent datasets even with a stand-off distance of 10 mm. Recently, there has been significant interest in minimizing thickness of imaging systems, and a variety of very sophisticated approaches have been proposed [25,26]. Here, we demonstrate imaging with no optics. Nothing thinner is possible. This is an especially remarkable result considering the significant amount of information loss that occurs in the absence of optics.

## Funding

Office of Naval Research (N000141912458); National Science Foundation (1533611).

## Acknowledgment

University of Utah Center for High Performance Computing (CHPC) facilities are gratefully acknowledged.

## Disclosures

RM: University of Utah (P).

## Data availability

Data available from authors upon request.

## Supplemental document

See Supplement 1 for supporting content.

## REFERENCES

**1. **A. Sinha, J. Lee, S. Li, and G. Barbastathis, “Lensless computational imaging
through deep learning,” Optica **4**, 1117–1125
(2017). [CrossRef]

**2. **J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for
real-time style transfer and super-resolution,” in
*European Conference on Computer Vision (ECCV)*
(2016),
pp. 694–711.

**3. **K. Armanious, C. Jiang, M. Fischer, T. Küstner, T. Hepp, K. Nikolaou, S. Gatidis, and B. Yang, “MedGAN: medical image
translation using GANs,” Comput. Med. Imag.
Graph. **79**, 101684
(2020). [CrossRef]

**4. **C. Chen, Q. Chen, J. Xu, and V. Koltun, “Learning to see in the
dark,” in *IEEE Conference on Computer Vision
and Pattern Recognition (CVPR)* (2018),
pp. 3291–3300.

**5. **S. Li, M. Deng, J. Lee, A. Sinha, and G. Barbastathis, “Imaging through glass
diffusers using densely connected convolutional
networks,” Optica **5**,
803–813 (2018). [CrossRef]

**6. **Z. Pan, B. Rodriguez, and R. Menon, “Machine-learning enables image
reconstruction and classification in a ‘see-through’
camera,” OSA Contin. **3**, 401–409
(2020). [CrossRef]

**7. **O. Ronneberger, P. Fischer, and T. Brox, “U-Net: convolutional networks
for biomedical image segmentation,” in
*International Conference on
Medical Image Computing and Computer-assisted Intervention*
(2016),
pp. 234–241.

**8. **K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in *IEEE Conference on
Computer Vision and Pattern Recognition (CVPR)*
(2016),
pp. 770–778.

**9. **G. Barbastathis, A. Ozcan, and G. Situ, “On the use of deep learning
for computational imaging,” Optica **6**, 921–943
(2019). [CrossRef]

**10. **I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial
nets,” in *Neural Information Processing
Systems (NIPS)* (2014), Vol. 27,
p. 2672.

**11. **D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature
learning by inpainting,” in *IEEE Conference on
Computer Vision and Pattern Recognition (CVPR)*
(2016),
pp. 2536–2544.

**12. **P. Isola, J. Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation
with conditional adversarial networks,” in
*IEEE Conference on Computer Vision and Pattern Recognition
(CVPR)* (2017),
pp. 1125–1134.

**13. **J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image
translation using cycle-consistent adversarial
networks,” in *IEEE International Conference on
Computer Vision (ICCV)* (2017),
pp. 2223–2232.

**14. **Z. Yi, H. Zhang, P. Tan, and M. Gong, “DualGAN: unsupervised dual
learning for image-to-image translation,” in
*IEEE International Conference on Computer Vision
(ICCV)* (2017),
pp. 2849–2857.

**15. **G. Kim, K. Isaacson, R. Palmer, and R. Menon, “Lensless photography with only
an image sensor,” Appl. Opt. **56**, 6450–6456
(2017). [CrossRef]

**16. **S. Nelson, E. Scullion, and R. Menon, “Optics-free imaging of
complex, non-sparse QR-codes with deep neural
networks,” OSA Continuum **3**, 2423–2428
(2020). [CrossRef]

**17. **A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-10 (Canadian institute for
advanced research),” 2009,
https://www.cs.toronto.edu/∼kriz/cifar.html.

**18. **S. Nelson and R. Menon, “Classification of optics-free
images with deep neural networks,” arXiv:2011.05132
(2020).

**19. **X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley, “Least squares generative
adversarial networks,” in *IEEE International
Conference on Computer Vision (ICCV)* (2017), p.
2794.

**20. **K. Simonyan and A. Zisserman, “Very deep convolutional
networks for large-scale image recognition,” in
*International Conference on Learning Representations
(ICLR)* (2015).

**21. **O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. Fei-Fei, “ImageNet large scale visual
recognition challenge,” Int. J. Comput.
Vis. **115**,
211–252 (2015). [CrossRef]

**22. **C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image
super-resolution using a generative adversarial
network,” in *IEEE Conference on Computer
Vision and Pattern Recognition (ICCV)* (2017),
pp. 4681–4690.

**23. **F. Yu and V. Koltun, “Multi-scale context
aggregation by dilated convolutions,” in
International Conference on Learning Representations
(ICLR) (San Juan, Puerto Rico, May 2-4 2016).

**24. **M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two
time-scale update rule converge to a local nash
equilibrium,” in *31st Conference on Neural
Information Processing Systems (NIPS 2017)*
(ACM, 2017), p.
6629.

**25. **C. Guo, H. Wang, and S. Fan, “Squeeze free space with
nonlocal flat optics,” Optica **7**, 1133 (2020). [CrossRef]

**26. **O. Reshef, M. P. DelMastro, K. K. M. Bearne, A. H. Alhulaymi, L. Giner, R. W. Boyd, and J. S. Lundeen, “An optic to replace space and
its application towards ultra-thin imaging systems,”
Nat. Commun. **12**, 3512
(2021). [CrossRef]